HuggingFace Diffusers v0.28.0の新機能

2024年5月28日 07:59

「Diffusers v0.28.0」の新機能についてまとめました。

前回

1. Diffusers v0.28.0 のリリースノート

情報元となる「Diffusers 0.28.0」のリリースノートは、以下で参照できます。

2. Marigold

「Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation,」で提案された「Marigold」では、単眼深度推定のための拡散モデルと関連するファインチューニングプロトコルが導入されています。表面法線の推定を実行するように拡張することもできます。

深度推定のコードは次のとおりです。

import diffusers
import torch

pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
    "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")

image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)

vis = pipe.image_processor.visualize_depth(depth.prediction)
vis[0].save("einstein_depth.png")

depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
depth_16bit[0].save("einstein_depth_16bit.png")

詳しくはAPIドキュメントとパイプラインガイドを参照。

3. Massive Refactor of from_single_file

from_single_file をリファクタリングして、ロジックを from_pretrained に近づけました。これを行う最大の利点は、単一ファイルの読み込みサポートを Stable Diffusion のようなパイプラインやモデルを超えて拡張できることです。また、元の形式で保存および共有されているモデルを読み込むのも簡単になります。

このリファクタリングで導入された変更の一部は次のとおりです。

(1) 単一ファイルのチェックポイントのロード時、チェックポイントに存在するキーを使用して、パイプラインの構成に使用できる Hugging Face Hub 上のモデルリポジトリを推測しようとします。たとえば、SD 1.5 に基づく単一ファイルチェックポイントを使用している場合、runwayml/stable-diffusion-v1-5 リポジトリ内の構成ファイルを使用してモデルコンポーネントとパイプラインを構成します。

(2) この推測された構成がチェックポイントに適切ではないとします。その場合、config 引数を使用してこれをオーバーライドし、ローカルモデルリポジトリへのパスまたは Hugging Face Hub のリポジトリIDを渡すことができます。

pipe = StableDiffusionPipeline.from_single_file("...", config=<model repo id or local repo path>)

(3) num_in_channels、scheduler_type、image_size、upcast_attention など、パイプラインの from_single_file のモデル設定引数は非推奨です。これは、Stable Diffusionベースのモデルにのみ関連すると想定していたライブラリの以前のバージョンでサポートされていたアンチパターンです。ただし、他のモデルタイプをサポートする需要があることを考えると、単一ファイルの読み込み動作が他の読み込み方法で設定された規則に従う必要があると考えています。パイプライン読み込みメソッドによる個々のモデルコンポーネントの構成は、from_pretrained ではサポートされていないため、from_single_file でもこの動作のサポートは廃止される予定です。

4. PixArt Sigma

「PixArt Simga」は「PixArt Alpha」の後継です。「PixArt Sigma」は、4K解像度の画像を直接生成できます。また、著しく忠実度が高く、テキストプロンプトとの整合性が向上した画像を生成することもできます。300という膨大なシーケンス長が付属しています (「PixArt Alpha」の最大シーケンス長は120です)。

import torch
from diffusers import PixArtSigmaPipeline

# PixArt-alpha/PixArt-Sigma-XL-2-512-MS に置き換えることもできる
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", 
    torch_dtype=torch.float16
)

# メモリの最適化の有効化
pipe.enable_model_cpu_offload()

prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]

詳しくはドキュメントを参照。

5. AnimateDiff SDXL

「AnimateDiff」の「SDXL」版です。ただし、モーションアダプタチェックポイントのベータリリースのみが利用可能であるため、これは現在実験的な機能になります。

import torch
from diffusers.models import MotionAdapter
from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16)

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    beta_schedule="linear",
    steps_offset=1,
)
pipe = AnimateDiffSDXLPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
).enable_model_cpu_offload()

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

output = pipe(
    prompt="a panda surfing in the ocean, realistic, high quality",
    negative_prompt="low quality, worst quality",
    num_inference_steps=20,
    guidance_scale=8,
    width=1024,
    height=1024,
    num_frames=16,
)

frames = output.frames[0]
export_to_gif(frames, "animation.gif")

詳しくはドキュメントを参照。

6. Block-wise LoRA

さまざまなLoRAブロックのスケールをきめ細かく制御します。使用している LoRAチェックポイントに応じて、この詳細な制御は、生成される出力の品質に大きな影響を与える可能性があります。

...

adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} }
pipe.set_adapters("pixel", adapter_weight_scales)
image = pipe(
		prompt, num_inference_steps=30, generator=torch.manual_seed(0)
).images[0]

詳しくはドキュメントを参照。

7. InstantStyle

スケールのより詳細な制御を、IP-Adapterに拡張できます。

...

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

このようにして、画像プロンプトからのスタイルまたはレイアウトのみに従って画像を生成でき、多様性が大幅に向上します。これは、モデルの特定の部分に対して IP-Adapterのみをアクティブにすることによって実現されます。

詳しくはドキュメントを参照。

8. ControlNetXS

「ControlNet-XS」は、通常の「ControlNet」と同等のイメージを生成しますが、20～25%高速で (StableDiffusion-XL のベンチマークを参照)、メモリ使用量が最大45%少なくなります。「ControlNet-XS」は、「Stable Diffusion」と「SDXL」の両方でサポートされています。

9. Custom Timesteps

一部のパイプラインとスケジューラにカスタムタイムステップのサポートを導入しました。任意のタイムステップのリストを使用してスケジューラを設定できるようになりました。たとえば、AYSタイムステップスケジュールを使用すると、わずか10ステップで非常に優れた結果を達成できます。

from diffusers.schedulers import AysSchedules
sampling_schedule = AysSchedules["StableDiffusionXLTimesteps"]
pipe = StableDiffusionXLPipeline.from_pretrained(
    "SG161222/RealVisXL_V4.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, algorithm_type="sde-dpmsolver++")
prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up"
image = pipe(prompt=prompt, timesteps=sampling_schedule).images[0]

詳しくはドキュメントを参照。

10. device_map in Pipelines

device_map の実験的なサポートを導入しました。この機能は、パイプラインのコンポーネントを分散するために複数のアクセラレータがある場合に関連します。現在、「バランスのとれた」device_map のみがサポートされています。

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", 
    torch_dtype=torch.float16, 
    device_map="balanced"
)
image = pipeline("a dog").images[0]

低VRAMアクセラレータに限定されている場合は、device_map を使用してその利点を活用できます。以下では、(max_memory 引数を介して) それぞれ 1 GB の VRAM のみを備えた2つのGPUにアクセスできる状況をシミュレートします。

from diffusers import DiffusionPipeline
import torch

max_memory = {0:"1GB", 1:"1GB"}
pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16, 
    use_safetensors=True, 
    device_map="balanced",
		max_memory=max_memory
)
image = pipeline("a dog").images[0]

詳しくはドキュメントを参照。

11. VQGAN Training Script

「Taming Transformers for High-Resolution Image Synthesis」で提案された「VQGAN」は、最新の生成画像モデリングツールボックスの重要なコンポーネントです。学習が完了すると、エンコーダを利用して入力画像から汎用トークンを計算できます。

12. VideoProcessor クラス

「VaeImageProcessor」クラスと同様に、動画の前処理と後処理を簡単にし、パイプライン全体でもう少し合理化するために「VideoProcessor」を導入しました。

詳しくはドキュメントを参照。

13. 新しいガイド

ユーザーが画像と動画の生成で最も頻繁に使用されるタスクのいくつかを開始するのに役立つガイドとチュートリアルを提供します。このリリースでは、さまざまなテクニックによるオーバーペイントに関する3つのガイドが用意されています。

・ControlNet Outpainting : このタスク用に学習された特定の ControlNet モデルを使用してアウトペイントを行う方法を学びます。この方法はクリエイティブなアウトペイントに最適です。
・Differential Diffusion Outpainting : ピクセルごとまたは画像領域ごとの変更量のカスタマイズを可能にする新しいフレームワークを使用して、シームレスなアウトペイントを可能にします。これは、画像を初期サイズを超えて拡大するために使用できます。
・Outpainting using an Inpaint Model : さまざまなテクニックを使用して、通常のインペイントモデルを使用して、元の主題をそのまま維持しながらアウトペイントを行う方法を学びます。商品カタログなどに最適です。

14. 公式コールバック

パイプラインに簡単にプラグインできる公式コールバックを導入しました。たとえば、「SDXLCFGCutoffCallback」を使用してステップのノイズを除去した後、分類子なしのガイダンスをオフにできます。

import torch
from diffusers import DiffusionPipeline
from diffusers.callbacks import SDXLCFGCutoffCallback

callback = SDXLCFGCutoffCallback(cutoff_step_ratio=0.4)
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
prompt = "a sports car at the road, best quality, high quality, high detail, 8k resolution"
out = pipeline(
    prompt=prompt,
    num_inference_steps=25,
    callback_on_step_end=callback,
)

詳しくはドキュメントを参照。

15. Community Pipeline と from_pipe API

最初はコミュニティパイプラインとして追加され、人々が頻繁に使用し始めると公式パイプラインとして卒業するパイプラインが増えています。

また、公式パイプラインとチェックポイントを共有し、何らかの方法で生成品質を向上させるコミュニティパイプラインにとって非常に便利な from_pipe API も導入しました。from_pipe() を使用すると、追加のメモリ要件なしで多くのコミュニティパイプラインをロードできます。この API を使用すると、さまざまなパイプラインを簡単に切り替えて、さまざまなテクニックを適用できます。

詳しくはドキュメントを参照。

15-1. BoxDiff

「BoxDiff」を使用すると、境界ボックスの座標を使用して、より制御された生成を行うことができます。

pipe_box = DiffusionPipeline.from_pipe(
    pipe_sd,
    custom_pipeline="pipeline_stable_diffusion_boxdiff",
)
pipe_box.enable_model_cpu_offload()
phrases = ["aurora","reindeer","meadow","lake","mountain"]
boxes = [[1,3,512,202], [75,344,421,495], [1,327,508,507], [2,217,507,341], [1,135,509,242]]
boxes = [[x / 512 for x in box] for box in boxes]

generator = torch.Generator(device="cpu").manual_seed(42)
images = pipe_box(
    prompt,
    boxdiff_phrases=phrases,
    boxdiff_boxes=boxes,
    boxdiff_kwargs={
        "attention_res": 16,
        "normalize_eot": True
    },
    num_inference_steps=50,
    generator=generator,
).images

詳しくはドキュメントを参照。

15-2. HD-Painter

「HD-Painter」は、プロンプトの忠実性が向上して修復パイプラインを強化し、より高い解像度 (最大 2k) を生成できます。

pipe = DiffusionPipeline.from_pipe(
    pipe_box,
    custom_pipeline="hd_painter"
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)

prompt = "wooden boat"
init_image = load_image("https://raw.githubusercontent.com/Picsart-AI-Research/HD-Painter/main/__assets__/samples/images/2.jpg")
mask_image = load_image("https://raw.githubusercontent.com/Picsart-AI-Research/HD-Painter/main/__assets__/samples/masks/2.png")

image = pipe (prompt, init_image, mask_image, use_rasg = True, use_painta = True, generator=torch.manual_seed(12345)).images[0]

詳しくはドキュメントを参照。

15-3. Differential Diffusion

「Differential Diffusion」により、ピクセルごとまたは画像領域ごとの変化量をカスタマイズできます。インペイントやアウトペイントに非常に効果的です。

pipeline = DiffusionPipeline.from_pipe(
    pipe_sdxl,
    custom_pipeline="pipeline_stable_diffusion_xl_differential_img2img",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)

prompt = "a green pear"
negative_prompt = "blurry"

image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=7.5,
    num_inference_steps=25,
    original_image=image,
    image=image,
    strength=1.0,
    map=mask,
).images[0]

詳しくはドキュメントを参照。

15-4. FRESCO

「FRESCO」（FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation)は、Zero-ShotのVideo-to-Videoを可能にします。

詳しくはドキュメントを参照。

次回

この記事が気に入ったらサポートをしてみませんか？