Google Colab で RPG-DiffusionMaster を試す

2024年1月25日 08:48

「Google Colab」で「RPG-DiffusionMaster」を試したので、まとめました。

1. RPG-DiffusionMaster

「RPG-DiffusionMaster」は、マルチモーダル LLM (PRG) によって複雑かつ構成的なText-to-Imageの生成・編集でdiffusionモデルをマスターし、最先端のパフォーマンスを実現するフレームワークです。

核となる戦略が3つあります。

・Multimodal Recaptioning (マルチモーダルな要約)
・Chain-of-Thought Planning (思考連鎖プランニング)
・Complementary Regional Diffusion (補完的な地域拡散)

より多くの条件でText-to-Imageの生成を拡張できます。「ControlNet」と比較して、RPGはプロンプトの理解と構成上の意味の整合性において大幅な改善を実現しています。

Text-to-Imageの (マルチラウンド) 編集でも優れたパフォーマンスを実現できます。

2. Colabでの実行

Colabでの実行手順は、次の通りです。

(1) パッケージのインストール。

# パッケージのインストール
!git clone https://github.com/YangLing0818/RPG-DiffusionMaster
%cd RPG-DiffusionMaster
!pip install -r requirements.txt

(2) torchの再インストール。
RPG.py実行時にエラーになったので、再インストールしてます。

# torchの再インストール
!pip uninstall torch
!pip install torch

(3) リポジトリのダウンロード。

# リポジトリのダウンロード
!mkdir repositories
!mkdir -p generated_imgs/demo_imgs
!mkdir models/Stable-diffusion
%cd repositories
!git clone https://github.com/Stability-AI/generative-models
!git clone https://github.com/Stability-AI/stablediffusion
!git clone https://github.com/sczhou/CodeFormer
!git clone https://github.com/crowsonkb/k-diffusion
!git clone https://github.com/salesforce/BLIP
!mv stablediffusion stable-diffusion-stability-ai
%cd ..

(4) モデルのダウンロード。
今回は、「animagine-xl」を使いました。

# モデルのダウンロード
!wget -P ./models/Stable-diffusion/ https://huggingface.co/Linaqruf/animagine-xl/resolve/main/animagine-xl.safetensors

(5) 画像生成の実行。
以下のコードの <OpenAIのAPIキー> にはOpenAIのサイトで取得できるAPIキーを指定します。(有料)

!python RPG.py \
    --user_prompt 'cute cat-eared maid spills water at a coffee shop' \
    --model_name 'animagine-xl.safetensors' \
    --version_number 0 \
    --api_key '<OpenAIのAPIキー>' \
    --use_gpt

パラメータは、次のとおりです。

--user_prompt : 画像に含まれるコンテンツを大まかに要約した元のプロンプト
--model_name : models/Stable-diffusion/フォルダ内のモデル名
--version_number : 生成で使用されるコンテキスト内サンプルのクラス。
実験では、さまざまなシナリオにおいて、コンテキスト内の適切なサンプルをFew-Shotのサンプルとして採用することで、MLLMの計画能力を大幅に強化できることを示唆。今回は複数の属性を持ったキャラクターを合成することを目指すため、オプション0を選択。
--api_key : GPT-4を利用するためのOpenAI APIキー

入力プロンプトは、次のとおりです。

cute cat-eared maid spills water at a coffee shop

【翻訳】
かわいい猫耳メイドがカフェで水をこぼす

内部処理のプロンプトは、次のとおりです。

Alright, let's dissect the given caption and plan a detailed image layout using the split ratio method, adhering to your specified rules. We'll focus on segmenting elements and their attributes into distinct regions while crafting a visually cohesive and appealing composition.

### Original Caption:
"cute cat-eared maid spills water at a coffee shop"

### Key phrases identification:
We identify a "cat-eared maid" and two actions or situations: "spills water" and "at a coffee shop." The main focus should be the maid and the spilling action, while the coffee shop context serves as a backdrop.

1. Cute cat-eared maid (subject emphasis)
2. Spilled water (action emphasis)

So we need to split the image into 2 main subregions, and the coffee shop context will be incorporated in the background.

### Split Ratio Planning:
#### Horizontal Split Ratio: 1;1
- This ratio splits the image into two horizontal rows, each to be allocated for one aspect of the scene.

#### Vertical Split Ratio: None
- For this scene, we will not need to divide the image further vertically because each row will capture its respective key phrase.

#### Detailed Subregion Prompts:
1. **First Row** (`1`):
- **Region 0:** Cute cat-eared maid, highlighting her playful cat ears, charming outfit with apron, and a slight blush of embarrassment over the mishap.

2. **Second Row** (`1`):
- **Region 1:** Chaos of the spilled water, emphasizing the dynamic splash on the floor, reflecting the light and the maid's flustered reflection.

#### Composition Logic:
- The first row focuses exclusively on the maid, capturing her unique cat-ear feature, the details of her maid outfit, and her emotional reaction to spilling the water.
- The second row illustrates the spilled water, adding drama and reinforcing the storyline of the scene happening in the coffee shop.

#### Aesthetic Considerations:
- The cat-eared maid's description brings forth a sense of playfulness and charm, focusing on the adorable aspect of her persona in the context of the coffee shop.
- The spilled water on the second row introduces a dynamic element, suggesting movement and an unfolding story within the image.

By following this layout plan, we ensure that each region distinctively captures either a single element or two elements with a special relationship, using descriptive words for textures and colors. The overall layout adheres to the principles of human aesthetics, balancing the elements to create an inviting and cohesive scene.

Now, let's output the split ratio and regional prompt we derived from the planning process.

### Output:
Horizontal split ratio: 1;1
Vertical split ratio: None
Split ratio: 1;1
Regional Prompt: Cute cat-eared maid, highlighting her playful cat ears, charming outfit with apron, and a slight blush of embarrassment over the mishap. BREAK
Chaos of the spilled water, emphasizing the dynamic splash on the floor, reflecting the light and the maid's flustered reflection.
Horizontal split ratio: 1;1
Vertical split ratio: None
Split ratio: 1;1
Regional Prompt: Cute cat-eared maid, highlighting her playful cat ears, charming outfit with apron, and a slight blush of embarrassment over the mishap. BREAK
Chaos of the spilled water, emphasizing the dynamic splash on the floor, reflecting the light and the maid's flustered reflection.
{'split ratio': '1;1', 'Regional Prompt': "Cute cat-eared maid, highlighting her playful cat ears, charming outfit with apron, and a slight blush of embarrassment over the mishap. BREAK\nChaos of the spilled water, emphasizing the dynamic splash on the floor, reflecting the light and the maid's flustered reflection."}
select_checkpoint: animagine-xl.safetensors [6f4f816f9d]
process_script_args (True, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1;1', 0.3, False, False, False, 'Attention', [False], 0, 0, 0.4, None, 0, 0, False)

【翻訳】
では、指定されたキャプションを分析し、指定したルールに従って、分割比率方法を使用して詳細な画像レイアウトを計画しましょう。視覚的に一貫した魅力的な構成を作成しながら、要素とその属性を個別の領域にセグメント化することに焦点を当てます。

### 元のキャプション:
「かわいい猫耳メイドが喫茶店で水をこぼす」

### キーフレーズの特定:
私たちは「猫耳メイド」と、「水をこぼす」と「コーヒーショップで」という 2 つの行動または状況を特定します。主な焦点はメイドとこぼす行為であり、コーヒーショップの背景は背景として機能します。

1. 可愛い猫耳メイドさん（主語強調）
2.こぼれた水（アクション重視）

したがって、画像を 2 つの主要なサブ領域に分割する必要があり、コーヒーショップのコンテキストが背景に組み込まれます。

### 分割比率の計画:
#### 水平分割比: 1;1
- この比率は、イメージを 2 つの水平行に分割し、それぞれがシーンの 1 つの側面に割り当てられます。

#### 垂直分割比: なし
- このシーンでは、各行がそれぞれのキーフレーズをキャプチャするため、画像をさらに垂直に分割する必要はありません。

#### サブ領域の詳細なプロンプト:
1. **最初の行** (`1`):
- **リージョン 0:** 遊び心のある猫耳、エプロン付きの魅力的な服装、そして事故に対する恥ずかしさのわずかな赤面を強調するかわいい猫耳メイド。

2. **2 行目** (`1`):
- **領域 1:** こぼれた水の混沌。床のダイナミックな飛沫、光の反射、メイドの慌てふためきの反射を強調します。

#### 構成ロジック:
- 最初の行はメイドにのみ焦点を当て、彼女のユニークな猫耳の特徴、彼女のメイド服の詳細、水をこぼしたときの彼女の感情的な反応を捉えています。
- 2 行目はこぼれた水を示し、ドラマを追加し、コーヒーショップで起こっているシーンのストーリーラインを強化します。

#### 美的考慮事項:
- 猫耳メイドの描写は、喫茶店という文脈における彼女の愛らしい側面に焦点を当て、遊び心と魅力の感覚を醸し出しています。
- 2 行目のこぼれた水はダイナミックな要素を導入し、画像内で動きと展開するストーリーを示唆します。

このレイアウト計画に従うことで、テクスチャと色を説明する言葉を使用して、各領域が特別な関係を持つ 1 つの要素または 2 つの要素を明確に捉えるようになります。全体的なレイアウトは人間の美学の原則に準拠しており、要素のバランスをとって魅力的でまとまりのあるシーンを作り出しています。

次に、計画プロセスから導き出した分割率と地域プロンプトを出力しましょう。

### 出力:
水平分割比: 1;1
垂直分割比：なし
スプリット比: 1;1
地域プロンプト: 遊び心のある猫耳、エプロン付きの魅力的な衣装、そして事故に対する恥ずかしさのわずかな赤面を強調するかわいい猫耳メイド。壊す
こぼれた水の混沌。床に飛び散るダイナミックな飛沫、光の反射、慌てるメイドの姿を強調。
水平分割比: 1;1
垂直分割比：なし
スプリット比: 1;1
地域プロンプト: 遊び心のある猫耳、エプロン付きの魅力的な衣装、そして事故に対する恥ずかしさのわずかな赤面を強調するかわいい猫耳メイド。壊す
こぼれた水の混沌。床に飛び散るダイナミックな飛沫、光の反射、慌てるメイドの姿を強調。
{'split ratio': '1;1', 'Regional Prompt': "Cute cat-eared maid, highlighting her playful cat ears, charming outfit with apron, and a slight blush of embarrassment over the mishap. BREAK\nChaos of the spilled water, emphasizing the dynamic splash on the floor, reflecting the light and the maid's flustered reflection."}

generated_imgsフォルダに画像が生成されています。

メモリは次のとおりです。

この記事が気に入ったらサポートをしてみませんか？