Stable Diffusion 3 Largeの使用経験

2024年4月23日 23:51

　こんにちは、Browncatです。
　Stable Diffusionの開発元のStability AIは、2月22日に「Stable Diffusion 3」を発表していましたが、4月18日、Stable Diffusion 3とStable Diffusion 3 TurboのAPIが開放されました。

　そこで、Stable Cascadeのときと同様、リアルフォト系に特化して、いくつかの画像を生成してみました。

おことわり（2024.6.19）　本記事は、Stable Diffusion 3シリーズのうち、「Large」についての記事になります。
　Stable Diffusion 3シリーズには、ほかに、「Large Turbo」「Ultra（Largeの上位版）」、オープン化された「Medium」があります。
　SD3 Medium/Large/Ultraと、SDXLベースモデルを相互比較した記事もご覧ください。

Stable Diffusion 3 API使用法

　Stable Diffusion 3 APIは、Stability AI Developer Platformでユーザー登録して、API Keyを入手すると使えます。
　Stable Diffusion 3の生成は有償で、1回の生成当たり6.5クレジット消費されます（Turbo版は4クレジット）。新規登録時には25クレジット分無料で、10USDで1000クレジットを購入できます。
　API Keyを入手したら、APIリファレンスのページを参考に、Fireworks AI、Google Colab等を利用するか、pythonで直接コーディングするなどします。
　pythonでコーディングするにしても、それほど難しいわけでもなく、「requests」がまともに動く環境なら（＝ネットワークのセキュリティがガチガチでなければ）動作するでしょう。

生成例

　それでは実際の生成例を提示します。比較対象のSDXLとStable Cascadeの画像は下記をご覧ください。

1.公園のアイドル女子

【プロンプト】

【Positive】 cinematic photo Imagine a young Japanese woman adorned in gothic and lolita fashion, standing in a picturesque garden. She wears a frilled corset and has two-tone hair, a unique blend of blue and purple. Her outfit and surroundings exude the essence of gothic art, creating a captivating and enchanting scene. The contrasting colors of her hair add a vibrant touch to the serene garden setting, making her presence truly mesmerizing . 35mm photograph, film, bokeh, professional, 4k, highly detailed

【Negative】 drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, ugly

2.黒いドレスの女性

【プロンプト】

【Positive】 cinematic photo a Japanese woman young girl ballet dancer wearing black lace dress is posing in the Victorian room . 35mm photograph, film, bokeh, professional, 4k, highly detailed

【Negative】 drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, ugly

※SDXLの時に入れていた「cleavage」という単語を入れると、生成時にcontent_moderationエラーが発生するため、それを外しました。

【講評】
　精細度ではStable Cascadeのベースモデルを上回っており、人物の顔もアジア系含め非常に自然で違和感がありません。
　ほかの生成例を見ても、ベースモデルとしては、SDXLやStable Cascadeを上回るクオリティーといえると思います。
　一方、割合としては少ないですが、人物の腕から先が破綻することがあります。SD系が苦手としている楽器も、相変わらず苦手のようです。従来SD系ではほとんど見なかった顔の破綻もわずかですがあります。
ただし現状モデルは改善中ということですので、今後に期待します。

3.テキストの出力・カフェ「Browncat」

　Stable Diffusion 3の特長として、旧バージョンと比較して、テキスト理解とスペリング機能が向上していることがうたわれています。
　そこで、プロンプトに記述されるテキストが画像に正しく表示されるか、検証してみました。
　まずは8文字の単語。「Browncat」という店名のカフェとその店員を生成してみました。

【プロンプト】

【Positive】cinematic film still, a young Japanese woman like an idol in a black and white maid cosplay with smile is posing In front of a western classical wooden cafe with sign 'Browncat'. shallow depth of field, vignette, highly detailed, high budget, bokeh, cinemascope, moody, epic, gorgeous, film grain, grainy,
【negative】anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured

【講評】
　何例か生成してみましたが、この程度の長さなら正しく表示され、うたい文句に偽りはなさそうです。
　ただし、日本語ではうまくいきません。

4.より長いテキストの出力

　8文字のテキストがうまく生成できましたが、より長い文ではどうなるかやってみました。

【プロンプト】

【Positive】cinematic film still Inside the party room, a beautiful girl like Japanese idol in a red dress is holding a banner 'Thank you for all of my followers.' by her beautiful hands . shallow depth of field, vignette, highly detailed, high budget, bokeh, cinemascope, moody, epic, gorgeous, film grain, grainy
【negative】anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured

【講評】
　テキストは100%というわけにはいきませんでしたが、相当な高確率で再現できました。
　一方、手指の破綻が一定割合で起こり（上記画像はあえて破綻した手指の周辺をそのままにしています）手指とテキストをともに正しく生成する確率はどうしても下がってしまいます。
　手については、SDXLのinpaint機能で修正できます。

5.人物の顔つきのばらつき

上記のカフェのプロンプトでシードを変えて複数枚生成してみました。

　このように人物の容貌は生成ごとにかなり異なります。
　今後、Midjourneyの「cref」や、SD1.5/SDXLのLoRAやInstant IDのような、キャラクターの一貫性を維持する機能がほしいところです。

Developer Platformの表現規制

　Developer Platformの表現規制は、DALL-E 3やMidjourney系よりさらにきついという印象。content_moderationエラーが起こるほか、それほど露出なくても、ぼかしが入った画像が表示されます。
　ネガティブプロンプトに「naked」と書けば失敗の確率が減るでしょう。
　なお、規制によって生成に失敗した場合には、ポイントは減りません。

Stable Diffusion 3 Largeの使用経験

Stable Diffusion 3 API使用法

生成例

1.公園のアイドル女子

2.黒いドレスの女性

3.テキストの出力・カフェ「Browncat」

4.より長いテキストの出力

5.人物の顔つきのばらつき

Developer Platformの表現規制

いいなと思ったら応援しよう！