Stable Diffusion WebUI Basic 03– FineTuning Large Models

This entry is part 3 of 3 in the series Stable Diffusion basic algorithm principle

In the world of AI image generation, there are many tools and methods to explore. If you’re interested in Midjourney, click there for a detailed tutorial. For a guide on ComfyUI, click there. This series focuses on teaching you how to use Stable Diffusion WebUI systematically. From basic operations to advanced techniques, we’ll guide you step by step to master WebUI’s powerful features and create high-quality AI-generated images.

This article aims to explain the principles of Stable Diffusion in a more accessible manner. By the end, you will understand the following topics:

  1. Why is fine-tuning necessary?
  2. Challenges in fine-tuning large models.
  3. Overview of common fine-tuning techniques (Dreambooth, LoRA, Embedding, Hypernetwork).

The UNET model, which we introduced earlier, is the core network of Stable Diffusion (SD), containing billions of parameters. Training such a massive model requires significant resources—1.5 billion image-text pairs, 256 A100 GPUs running for 150,000 GPU hours, and an estimated cost of approximately 600,000 USD.

However, designers face a major challenge: unlike programmers, they cannot easily adjust the model’s function parameters. This difficulty arises mainly because designers often lack the deep technical knowledge required to modify these parameters effectively.

While UNET is extremely versatile, this very versatility can sometimes lead to a lack of sufficient stylization. As a result, it becomes difficult to meet specific style requirements. To overcome this limitation, we typically fine-tune the large UNET model, customizing it to better align with our needs.

In the following, I will introduce several common model fine-tuning techniques used in Stable Diffusion training. Today, I won’t go into the specifics of how to train the model, but I will explain the core principles behind these techniques. Once you understand these principles, you’ll find it much easier to dive into training methods on your own.

These techniques are powerful tools that allow us to tailor the model to generate images that meet particular requirements, and mastering them will significantly enhance your ability to work with Stable Diffusion.

All fine-tuning techniques for large models aim to solve two main challenges:

  1. Addressing Model Generalization Issues: How can we ensure that the model generalizes well to new data? This issue often presents itself as overfitting or underfitting.
  2. Improving Training Efficiency and Image Quality: How can we reduce the number of training parameters, enhance efficiency, and produce higher-quality generated images?

Generalization refers to a model’s ability to adapt and perform well on unseen data. However, it can lead to two key problems during fine-tuning:

image-3 Stable Diffusion WebUI Basic 03-- FineTuning Large Models

Overfitting: This occurs when the model’s semantic understanding shifts too much due to excessive focus on a specific set of data. For instance, if I repeatedly train the model with images of a cat named Jojo, labeled as “a Jojo cat,” the model may learn this association too well. When I later input “a Jojo cat,” the model will generate the desired cat. However, if I simply input “a cat,” the generated image may appear strange, indicating that the word “cat” has been too narrowly defined, resulting in overfitting.

Underfitting: On the other hand, underfitting happens when the model fails to recognize key features. For example, the model might not understand the connection between “Jojo” and “cat” if it hasn’t been trained properly, often due to insufficient or poor-quality training data.

Now that we understand the problems of overfitting and underfitting, let’s explore how fine-tuning techniques address these challenges. The four most common fine-tuning methods used in Stable Diffusion are:

  1. Dreambooth
  2. LoRA
  3. Embedding
  4. Hypernetwork

These techniques will play a significant role in our study of Stable Diffusion, and you may already be familiar with some of them. In the next section, we’ll dive deeper into these methods and uncover how they work to improve model performance.

Dreambooth, introduced by Google in August 2022, is a fine-tuning technique that addresses overfitting issues effectively. The key to its success lies in pairing “feature words + category” together during training. For instance, if you want to train a model of a cat named JOJO, but want to avoid overfitting the word “cat,” you would input it like this: “JOJO cat is a cat named JOJO.”

This method ensures that the AI understands “JOJO cat” refers to a specific cat, Jojo, without confusing the word “cat” with the unique features of Jojo, thus preventing overfitting.

image-ce2b01bb-eeb3-432f-bc9e-19ac3fc4a4f5 Stable Diffusion WebUI Basic 03-- FineTuning Large Models

Dreambooth isn’t designed solely for character or artistic style training. Instead, it focuses on perfectly restoring detailed features from a small number of training images. Its primary goal is to capture and replicate visual features accurately.

When Dreambooth was first introduced, it impressed many at the Microsoft Developer Conference with its ability to restore intricate details. In one example, the model precisely captured a “yellow 3” next to an alarm clock, showing its power to retain and generate specific features.

freecompress-image-299cfcdf-e927-475c-bf4f-ada78c19f967-1024x352 Stable Diffusion WebUI Basic 03-- FineTuning Large Models

As you can see in the original image, there is a “yellow 3” to the right of the alarm clock, and Dreambooth perfectly restored it. This shows that using Dreambooth can indeed capture the visual features of the model you want in full detail, which is why it was quite impressive when it was first released.

Dreambooth fine-tunes every internal parameter of the UNET algorithm, adjusting each layer for optimal output. This thorough adjustment results in significant model changes, shown in red parts of the image. As such, the benefits and drawbacks of Dreambooth are quite evident.

image-1a30ccec-ac7c-46bf-8eed-83c4732c8b7c Stable Diffusion WebUI Basic 03-- FineTuning Large Models

Advantages of Dreambooth

  • Perfect Integration of Visual Features: Dreambooth excels at capturing and integrating fine visual details into the model, providing more accurate and detailed outputs.

Disadvantages of Dreambooth

  • Time-Consuming Training: Since it adjusts every internal parameter of the UNET model, Dreambooth requires long training times.
  • Large Model Sizes: Due to the extensive adjustments made, the fine-tuned models tend to become larger, making storage and management more challenging.

In summary, Dreambooth offers exceptional capabilities for fine-tuning models to capture intricate details, but it comes with the trade-offs of longer training times and larger model sizes.

After discussing Dreambooth, let’s move on to LoRA, a technique many of you may have heard about and are eager to learn more about. Let’s dive into its principles and how it differs from Dreambooth.

image-c58e0d8e-cafd-4050-97f7-b21a4fea459d Stable Diffusion WebUI Basic 03-- FineTuning Large Models

To understand how LoRA works, we first need to revisit the UNET algorithm, the core network behind Stable Diffusion (SD). If you remember from earlier discussions, UNET is made up of many stacked computational layers. You can think of each layer as a small function, where the output of one layer becomes the input for the next. Through these layers, the model processes and understands the features of the data it receives.

In Dreambooth, we fine-tune every layer of the UNET model, which requires substantial computational resources, long training times, and results in large model sizes.

LoRA, however, takes a different approach. The primary goal of LoRA is to reduce the number of training parameters and improve training efficiency. It achieves this by freezing the weights of the pre-trained model and injecting training parameters only into the layers of the Transformer function architecture. This allows LoRA to modify the model without disrupting the original structure, making it more efficient and easy to implement—similar to a “plug-and-play” system.

image-c055bfd5-e066-4998-838a-247d688e2f0f Stable Diffusion WebUI Basic 03-- FineTuning Large Models

The advantage of LoRA lies in its ability to use fewer layers for parameter injection compared to Dreambooth. As a result, LoRA reduces the training parameters by a factor of 1000, and the CPU requirements drop by three times. This results in much smaller model sizes, making LoRA highly practical and accessible.

When you download models from platforms like C站, you’ll notice that LoRA models are typically just a few dozen megabytes in size, while larger models can be several gigabytes. This makes LoRA an excellent choice for users with limited resources or those seeking a more lightweight solution for everyday use.

freecompress-image-04f9e358-758d-4e21-8215-690ab910291f-1024x407 Stable Diffusion WebUI Basic 03-- FineTuning Large Models

LoRA as a “Filter” for Style Customization

For beginners, you can think of LoRA as adding a “filter” to an existing model. It guides the model toward the desired style without fundamentally altering its core functionality. For example, using the “revAnimated” base model, which leans toward an anime style, I can add a blind-box effect through LoRA. By keeping the prompt the same but applying this filter, the result is a model output that leans more toward the blind-box style, as shown in the image below.

Next, let’s explore Embedding, also known as Text Inversion, which you may have come across when downloading models from platforms like C站. Embedding is an algorithm that creates a mapping relationship between a prompt and its corresponding vector, specifically designed for training models for particular people, objects, or concepts.

To understand how Embedding works, let’s revisit the CLIP model we discussed earlier. As a Text Encoder, CLIP’s main function is to convert natural language prompts into word characteristic vectors, also known as embeddings. This process allows us to train a model to establish a specific mapping between prompts and vectors, which helps in generating images for specific people or objects.

freecompress-image-04722df0-69a0-447a-9310-17d48727ae7b-1024x451 Stable Diffusion WebUI Basic 03-- FineTuning Large Models

Since Embedding only involves creating a text-to-vector mapping, the resulting model is extremely small, often just a few hundred kilobytes.

Let’s look at an example with a well-known character—D.Va from Overwatch.

If we wanted to generate an image of D.Va, we might need to describe her using thousands of tags, including her physical appearance, personality, and other characteristics. Clearly, typing this long list of tags every time would be cumbersome. This is where Embedding comes in.

image-6f9f9661-9902-4076-8861-563e1776b701 Stable Diffusion WebUI Basic 03-- FineTuning Large Models

Instead of using thousands of tags, we can bundle all these tags into a new word, say OWDva, which doesn’t exist in the CLIP embedding space initially. Because it’s a new word, CLIP will create a new mapping space for it. After training, we can simply use the word OWDva to trigger the model to generate images of D.Va, effectively applying all the tags associated with her in a much more efficient manner.

  • Compact Size: Embedding models are small in size, typically only a few hundred kilobytes, making them easy to store and share.
  • Efficient Training: Embedding allows for easy training of specific objects or characters by mapping complex tags into a single term, greatly simplifying the process.

Finally, let’s briefly discuss Hypernetwork, a technique that is gradually being replaced by LoRA. We’ll provide an overview of its principle, but it’s important to note that Hypernetwork has limitations compared to newer techniques.

Take a look at the diagram below, which highlights the basic principles of Hypernetwork within the context of UNET:

image-3c4f3489-1b13-46a5-acbb-9a35c517af06 Stable Diffusion WebUI Basic 03-- FineTuning Large Models
  • Dreambooth: This technique adjusts the entire UNET model’s functions and parameters. As a result, it has the largest model size and the widest range of applications. However, it also comes with high training difficulty, time, and cost.
  • LoRA: Unlike Dreambooth, LoRA only injects training parameters into specific Transformer functions, leaving the original model largely intact. This makes LoRA plug-and-play, with a controllable model size. It is the focus of our future learning because of its efficiency and practicality.
  • Hypernetwork: Hypernetwork creates a separate neural network model that integrates into the middle layers of the original UNET. During training, the original model’s parameters remain frozen, and only the added network is trained. This enables the model to link the output image to the input instructions, while only modifying a small portion of the original model.

Based on the principles mentioned, Hypernetwork is especially suited for training specific styles, such as pixel art. Many of the Hypernetwork models available on platforms like CIVITAI focus on style-based applications. While it is possible to train models for specific people or objects using Hypernetwork, this process is generally more complex than using LoRA.

In summary, Hypernetwork is a technique that is slowly being phased out in favor of LoRA. If you browse papers on platforms like Zhihu, you’ll notice that most Hypernetwork-related research was published before 2022. This suggests that Hypernetwork is becoming less relevant, with LoRA emerging as the preferred method. You can think of Hypernetwork as a “filter” for style training.

If you’re excited to dive into the world of AI image generation, you’ve come to the right place! Want to create stunning images with Midjourney? Just click on our Midjourney tutorial and start learning! Interested in exploring ComfyUI? We’ve got a detailed guide for that too. Each guide is designed to be simple and fun, helping you master these powerful tools at your own pace. Here, you can learn all the AI knowledge you need, stay updated with the latest AI trends, and let your creativity run wild. Ready to start? Let’s explore the exciting world of AI together!

Share this content:

Series Navigation<< Stable Diffusion WebUI Basic 02–Model Training Related Principles

Post Comment