Stable Diffusion WebUI Basic 01–Introduction

This entry is part 1 of 3 in the series Stable Diffusion basic algorithm principle

In the world of AI image generation, there are many tools and methods to explore. If you’re interested in Midjourney, click there for a detailed tutorial. For a guide on ComfyUI, click there. This series focuses on teaching you how to use Stable Diffusion WebUI systematically. From basic operations to advanced techniques, we’ll guide you step by step to master WebUI’s powerful features and create high-quality AI-generated images.

This article aims to explain the principles of Stable Diffusion in a more accessible manner. By the end, you will understand the following topics:

  1. What is Stable diffusion?
  2. How is diffusion stabilized (with text-to-image as an example)?
  3. CLIP: How do text prompts influence outcomes?
  4. UNet: How does the diffusion model work?
  5. Understanding the encoding and decoding process of VAE.

Stable Diffusion is a powerful image generation and processing algorithm. The name combines “stable,” indicating controlled processing, and “diffusion,” which refers to the transformation of noise within an image.

The algorithm works by adding noise (forward diffusion) or removing it (reverse diffusion) following specific rules. For example, starting with random noise, Stable Diffusion gradually refines the image to match a prompt, such as “a red flower.” This process transforms chaotic patterns into a detailed, clear result, showcasing its impressive generative capabilities.

Stable Diffusion WebUI provides an accessible and efficient way to harness this advanced technology for creating art and processing images.

In Stable Diffusion, the process simplifies to a function, Fsd(prompt). When you input a natural language prompt, this function applies a series of transformations to refine the input into a cohesive image. The algorithm maintains stability by systematically adding and removing noise, guiding the image generation from random patterns to a clear result. This structured approach brings order to the complex task of creating visuals from text descriptions, ensuring controlled and reliable outputs.

freecompress-image-fb96e939-8025-493d-85c8-f43008355375-1024x402 Stable Diffusion WebUI Basic 01--Introduction

When we input a text prompt, the CLIP (Contrastive Language-Image Pretraining) algorithm plays a key role. CLIP, a type of Text Encoder, converts natural language prompts into feature vectors (embeddings). For example, if the prompt is “cute girl,” CLIP processes the semantic meaning and links it with specific features such as “big round eyes,” “fair skin,” and “adorable expression.” It then transforms these features into a series of token vectors, each with 768 dimensions, effectively capturing the essence of the input prompt.

freecompress-image-eb2a974b-eda1-401e-9543-277551b49c87-1024x451 Stable Diffusion WebUI Basic 01--Introduction

At this point, you might be asking, “Why does my generated image look less appealing even with the same keywords?” The reason lies in the denoising algorithm, which plays a significant role in shaping the final output. While the text encoder generates consistent feature vectors for identical prompts, different models use varying denoising techniques, resulting in diverse outcomes. Now, let’s explore one of the key components in Stable Diffusion—UNet.

UNet refines images using word vectors as input. It works with embeddings generated by the CLIP algorithm, which converts the input prompt into machine-readable word vectors. These embeddings function with three key parameters: Q (Query), K (Key), and V (Value). These parameters directly influence how UNet denoises and refines the image at each diffusion step.

For example, if you set the denoising process to 20 steps, UNet progressively reduces the noise, shaping the image to match the features defined by the embeddings. This iterative process transforms the initial noise into a clear and visually coherent result.

freecompress-image-046fdafe-e8d7-4b33-815e-e34476421683-1024x451 Stable Diffusion WebUI Basic 01--Introduction

It’s important to note that the denoising process in UNET is more complex than what the diagram above suggests. Simply denoising step by step does not lead to the desired result. If the denoising were only performed step by step, the outcome would often be poor, and the generated image would not precisely reflect the prompt description.

To ensure prompt accuracy, UNet employs Classifier-Free Guidance. During each denoising step (e.g., 20 steps), it generates two images: one guided by the prompt and one without. The difference between these two images serves as the feature signal influenced by the text prompt. This difference is then amplified, strengthening the effect of the prompt on the generated image.

Additionally, after the N+1 step of denoising, the features from step N+1 are subtracted from those of step N and amplified. This ensures that the prompt has sufficient weight at each step.

In simple terms, this method increases the influence of the prompt on the generated image. In the Stable Diffusion Web UI, it’s called “Prompt Strength” or “Prompt Relevance.” This key parameter controls how closely the generated image matches the input prompt.

freecompress-image-894ba156-3078-4f14-a270-c40ab0526ffb Stable Diffusion WebUI Basic 01--Introduction

After discussing text-to-image generation, let’s also touch on image-to-image generation. When using the image-to-image feature in Stable Diffusion web UI, you provide an image and a prompt. For example, with N=20 diffusion steps, the process first adds noise to the provided image, transforming it into a complete noise pattern. Then, using the UNET algorithm, it gradually denoises the image, incorporating both the original image’s features and the prompt’s details to create the final output.

Finally, let’s briefly look at the VAE (Variational Autoencoder) encoding and decoding process. VAE is an algorithm that compresses and decompresses data. The UNET algorithm works in the “latent space” (at the code level), not directly on the image. When generating a 512×512 image, VAE first compresses it to 64×64. The image is processed through the UNET algorithm (encoding), and after denoising, VAE decodes it back to 512×512. In short, VAE handles the compression and decompression of the image.

freecompress-image-e071a121-7f0a-4139-871a-d3f714abba15-1024x459 Stable Diffusion WebUI Basic 01--Introduction

At this point, the principle of SD (Stable Diffusion) has been covered, and it’s not that complicated, right? Now, let’s move on to discussing how the model is trained. We’ll start by exploring how the training process works to develop these models.

If you’re excited to dive into the world of AI image generation, you’ve come to the right place! Want to create stunning images with Midjourney? Just click on our Midjourney tutorial and start learning! Interested in exploring ComfyUI? We’ve got a detailed guide for that too. Each guide is designed to be simple and fun, helping you master these powerful tools at your own pace. Here, you can learn all the AI knowledge you need, stay updated with the latest AI trends, and let your creativity run wild. Ready to start? Let’s explore the exciting world of AI together!

Share this content:

Series NavigationStable Diffusion WebUI Basic 02–Model Training Related Principles >>

Post Comment