diffusers での Stable Diffusion 3 の使い方

2024年6月14日 08:46

以下の記事が面白かったので、簡単にまとめました。

・Diffusers welcomes Stable Diffusion 3

1. Stable Diffusion 3

「SD3」は、3つの異なるテキストエンコーダー (CLIP L/14、OpenCLIP bigG/14、T5-v1.1-XXL)、新しい MMDiT (Multimodal Diffusion Transformer)、および「Stable Diffusion XL」に類似した16チャネルAutoEncoderで構成される潜在拡散モデルです。

「SD3」は、テキスト入力とピクセル潜在を埋め込みシーケンスとして処理します。位置エンコーディングは潜在の2x2パッチに追加され、その後パッチエンコーディングシーケンスに平坦化されます。このシーケンスは、テキストエンコーディングシーケンスとともに MMDiTブロックに送られ、共通の次元に埋め込まれ、連結され、変調されたアテンションとMLPのシーケンスに渡されます。

2つのモダリティの違いを考慮するために、MMDiTブロックは2つの別々の重みセットを使用して、テキストシーケンスと画像シーケンスを共通の次元に埋め込みます。これらのシーケンスはアテンション操作の前に結合され、これにより、両方の表現が独自の空間で機能し、アテンション操作中にもう一方の表現を考慮に入れることができます。テキストと画像データ間のこの双方向の情報の流れは、テキスト情報が固定テキスト表現とのクロスアテンションを介して潜在変数に組み込まれる、テキストから画像への合成の以前のアプローチとは異なります。

「SD3」は、タイムステップ調整の一部として、両方のCLIPモデルからのプールされたテキスト埋め込みも使用します。これらの埋め込みは最初に連結され、タイムステップ埋め込みに追加されてから、各MMDiTブロックに渡されます。

2. Rectified Flow Matching による学習

「SD3」は、アーキテクチャの変更に加えて、conditional flow-matching objective to train the model を適用します。このアプローチでは、フォワードノイズプロセスは、データとノイズ分布を直線で接続する整流フローとして定義されます。

Rectified Flow Matchingサンプリングプロセスはより単純で、サンプリングステップの数を減らすとパフォーマンスが向上します。「SD3」による推論をサポートするために、Rectified Flow Matching定式化とオイラー法ステップを備えた新しいスケジューラ (FlowMatchEulerDiscreteScheduler) を導入しました。また、shiftパラメータを介して、解像度に応じたタイムステップスケジュールのshiftも実装します。shift値を増やすと、より高い解像度でのノイズスケーリングがより適切に処理されます。2Bモデルでは、shift=3.0 を使用することが推奨されています。

3. diffusers での Stable Diffusion 3 の使い方

3-1. セットアップ

(1) HuggingFaceのStable Diffusion 3 のページで、利用規約を受け入れるボタンを押す。

(2) 最新のdiffusersをインストール。

pip install --upgrade diffusers

(3) HuggingFaceにログイン。

huggingface-cli login

3-2. Text-To-Image

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

image = pipe(
    "A cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
image

3-3. Image-To-Image

import torch
from diffusers import StableDiffusion3Img2ImgPipeline
from diffusers.utils import load_image

pipe = StableDiffusion3Img2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipe(prompt, image=init_image).images[0]
image

詳しくは、ドキュメントを参照。

4. Stable Diffusion 3 のメモリ最適化

4-1. Stable Diffusion 3 のメモリ最適化

「SD3」は3つのテキストエンコーダーを使用します。そのうちの1つは非常に大きな「T5-XXL」です。このため、fp16精度を使用する場合でも、24GB未満のVRAM を搭載したGPUでモデルを実行するのは困難です。

これを考慮して、diffusersにはメモリ最適化が組み込まれており、「SD3」をより幅広いデバイスで実行できます。

4-2. モデルオフロードによる推論の実行

diffusersで利用できる最も基本的なメモリ最適化では、推論中にモデルのコンポーネントをCPUにオフロードしてメモリを節約できますが、推論のレイテンシはわずかに増加します。モデルのオフロードでは、実行が必要な場合にのみモデルコンポーネントが GPU に移動され、残りのコンポーネントはCPU上に保持されます。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

prompt = "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. “This is fine,” the dog assures himself."
image = pipe(prompt).images[0]

4-3. 推論中にT5テキストエンコーダーを削除

推論中にメモリを大量に消費する4.7Bパラメータの「T5-XXL」テキストエンコーダーを削除すると、パフォーマンスがわずかに低下するだけで、「SD3」のメモリ要件を大幅に削減できます。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", text_encoder_3=None, tokenizer_3=None, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. “This is fine,” the dog assures himself."
image = pipe("").images[0]

4-4. T5-XXLモデルの量子化バージョンの使用

bitsandbytesを使用して「T5-XXL」を8bitでロードし、メモリ要件をさらに削減できます。

import torch
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel, BitsAndBytesConfig

# Make sure you have `bitsandbytes` installed. 
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    model_id,
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

詳しくはコードを参照。

5. Stable Diffusion 3 のパフォーマンス最適化

推論のレイテンシを短縮するには、torch.compile() を使用して、vaeとtransformerコンポーネントの最適化された計算グラフを取得します。

import torch
from diffusers import StableDiffusion3Pipeline

torch.set_float32_matmul_precision("high")

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")
pipe.set_progress_bar_config(disable=True)

pipe.transformer.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

# Warm Up
prompt = "a photo of a cat holding a sign that says hello world",
for _ in range(3):
    _ = pipe(prompt=prompt, generator=torch.manual_seed(1))

# Run Inference
image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
image.save("sd3_hello_world.png")

詳しくはコードを参照。

6. Dreambooth と LoRA のファインチューニング

「LoRA」を活用した「SD3」用の「DreamBooth」ファインチューニングスクリプトも提供しています。このスクリプトは、「SD3」を効率的にファインチューニングするために使用でき、Rectified Flowベースの学習パイプラインを実装するためのリファレンスとして役立ちます。Rectified Flowの他の一般的な実装には、「minRF」があります。

export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth-sd3-lora"

accelerate launch train_dreambooth_lora_sd3.py \
  --pretrained_model_name_or_path=${MODEL_NAME}  \
  --instance_data_dir=${INSTANCE_DIR} \
  --output_dir=/raid/.cache/${OUTPUT_DIR} \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --weighting_scheme="logit_normal" \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

詳しくはドキュメントを参照。

この記事が気に入ったらサポートをしてみませんか？