Efficient Text-to-Image Training (16x cheaper than Stable Diffusion)

Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

Рет қаралды 12,462

9 ай бұрын

Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression.
If you want to dive in even more into Würstchen here is the link to the paper & code:
Arxiv: arxiv.org/abs/2306.00637
Huggingface: huggingface.co/docs/diffusers...
Github: huggingface.co/dome272/wuerst...
We also created a community Discord for people interested in Generate AI:
/ discord

Пікірлер: 81

@outliier 9 ай бұрын

Join our Discord for Generative AI: discord.com/invite/BTUAzb8vFY

@qiaozhaohui 9 ай бұрын

Good job！

@NoahElRhandour 9 ай бұрын

and this link is really not a virus?

@ml-ok3xq 4 ай бұрын

Congrats on stable cascade 🎉

@user-rw3xm8nv7u 8 ай бұрын

You are definitely the most detailed and understandable person I have ever seen.

@dbender 4 ай бұрын

Super nice video which explains the architecture behind Stable Cascade. Step B was nicely visualized, but I still need a bit more time to fully grasp it. Well done!

@jeanbedry3941 9 ай бұрын

This is great, models that are intuitive to understand are the bests ones I find. Great job of explaining it as well.

@ratside9485 9 ай бұрын

Wir brauchen mehr Würstchen! 🙏🍽️

@macbetabetamac8998 9 ай бұрын

Amazing work mate ! 🙏

@dbssus123 9 ай бұрын

awesoooom !!! I always wait your videos

@jonmichaelgalindo 8 ай бұрын

Awesome work and great insights! ❤

@mtolgacangoz Ай бұрын

Brilliant work!

@EvanSpades Ай бұрын

Love this - what a fantastic achievement!

@mik3lang3lo 9 ай бұрын

Great job as always

@adrienforbu5165 9 ай бұрын

Amazing explainations, good job

@xyzxyz324 8 ай бұрын

well explained, thank you!

@arpanpoudel 9 ай бұрын

thanks for the awesome content.

@e.galois4940 9 ай бұрын

Tks very much

@lookout816 9 ай бұрын

Great video 👍👍

@leab.6600 9 ай бұрын

Super helpful

@factlogyofficial 9 ай бұрын

good job guys !!

@omarei 9 ай бұрын

Awesome

@mohammadaljumaa5427 9 ай бұрын

Amazing job and I really love the idea of reducing the size of the models, since it’s just make so much sense for me!! I have a small question, what gpus did you use for training? Did you use a cloud provider for that or you have your own local station? If the second I’m interested to know which hardware components you have? Just curious because I’m trying to make a decision between using cloud providers for training vs buying a local station 😊

@outliier 9 ай бұрын

Hey there. We were using the stability cluster.

@outliier 9 ай бұрын

Local would be much more expensive I guess. What gpus are you thinking to buy / rent and how many?

@glazastik_original 3 ай бұрын

Hi! If it's not a secret, where do you get datasets for training Text2img models? Very great video!

@jeffg4686 3 ай бұрын

Nice !

@timeTegus 9 ай бұрын

I love the video. :) and i would love more detail 😮😮😮😮

@outliier 9 ай бұрын

Noted, in case for Würstchen, you can take a look at the paper: arxiv.org/abs/2306.00637

@hayhay_to333 9 ай бұрын

Damn, you're so smart thanks for explaining this to us. I hope you'll make millions of dollars.

@outliier 9 ай бұрын

Haha thank you!

@jollokim1948 6 ай бұрын

Hi Dominic, This is some great work you have accomplished and definitely a step in the right right direction of democratizing the diffusion method. I have some questions, and a little bit of critique if that would be okay. You say you achieve a compression rate of 42x, however, is this a fair statement when that vector is never decompressed into an actual image? It looks more like your Stage C can create some sort of feature vectors of images in very low dimensional space using the text descriptions. Which then are used to guide the actual image creation, along with embedded text in stage B. In my opinion it looks more like you have used stage C to learn a feature vector representation for the image, which is used as a condition similar to how language free text-to-image models might use the image itself to guide in training. However, I don't believe this to be a 42x image compression without the decompression. Have you tried connecting a decoder onto vectors coming out of stage C? (I would believe the that vector might not be big enough to create a high resolution images because of it's dimensional size) I hope you can answer some of my questions or clear up any misunderstandings on my part. I'm currently doing my thesis on fast diffusion models and found your concept of extreme compression very compelling. Directions on where to go next regarding this topic is also very much appreciated :) Best of luck with further research.

@eswardivi 9 ай бұрын

Amazing work. I am wondering, how this video was made i.e. Editing Process and Cool Animations

@outliier 9 ай бұрын

Thank you a lot! I edit all videos in premiere pro and some of the animations like the compute gpu hours comparison between stable diffusion and würstchen were made mit manim. (The library from 3blue1brown)

@NoahElRhandour 9 ай бұрын

🔥🔥🔥

@ChristProg Ай бұрын

Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you

@MiyawMiv 9 ай бұрын

awsome

@flakky626 4 ай бұрын

Can you please tell where did you study entirety of ML/Deep learning? (courses?)

@truck.-kun. 5 ай бұрын

This needs more reach!

@darrynrogers204 9 ай бұрын

I very much like the image you are using at the opening of the video. The glitchy 3D graph that looks like an image generation gone wrong. How was it generated? Was it intentional or a bit of buggy code?

@outliier 9 ай бұрын

Hey, which glitchy 3d graph? Could you give the timestamp?

@darrynrogers204 9 ай бұрын

The one at 0:01. Right at the start. It says "outlier" at the bottom in mashed up AI text. Also the same image that you are using for your KZfaq banner on your channel page.@@outliier

@muhammadrezahaghiri 9 ай бұрын

That is a great project, I am excited to test the project. Out of curiosity, how is it possible to fine tune the model?

@outliier 9 ай бұрын

Hey, there is not yet official code for that. If you are interested you can give it a shot yourself. With the diffusers release in the next days, this should become much easier I think

@swannschilling474 9 ай бұрын

This is very interesting!! 😊

@davidgruzman5750 9 ай бұрын

Thank you for explanations! I am a bit puzzled - why we call the state in the inner layers of AE as latent, since we actually can observer it?

@outliier 9 ай бұрын

which "state" are you referring to? The ones from Stage B?

@davidgruzman5750 9 ай бұрын

@@outliier I would refer to one you mention in 1:27 point of the video. It is probabbly stage A.

@outliier 9 ай бұрын

@@davidgruzman5750 ah got it. Well you can observe it, but you can’t really understand right? If you print or visualise the latents they are not really meaningful. There are strategies to make them more meaningful tho. But just by themselves its hard to understand them. That’s what we usually call latents I would say

@TheAero 9 ай бұрын

Why use a second encoder? Isn't that what VQGan is supposed to do?

@outliier 9 ай бұрын

Yes but the VQGAN can only do a certain spatial compression, afterwards it gets really bad. That's why we introduce a second one

@TheAero 9 ай бұрын

@@outliier So can we replace the GAN-Encoder to a pre-trained better encoder and reduce the expense of using 2 encoders instead of one? So fundamentally, start with a simple enccoder then replace with a pre-trained better one and continue trainer, so that you also improve the decoder?

@JT-hg7mj 8 ай бұрын

Did you use the same dataset with SDXL?

@streamtabulous 9 ай бұрын

what about decompression times? are they faster and would it be less resources on older systems. curious if the models from this wound benefit users, IE most still use 1.5 nd v2 models of SD due to the decompression times of SDXL models taking so long.

@outliier 9 ай бұрын

Hey, we have a comparison to inference times compared to SDXL in the blog post here: huggingface.co/blog/wuerstchen And I think the model should be comparable to SD1.X in terms of Speed.

@streamtabulous 9 ай бұрын

@@outliier thought those where compression only times not decompression times, that's awesome to read. People like you are hero's to me

@outliier 9 ай бұрын

@@streamtabulous hey, those barcharts are for full sampling times after feeding in the prompt until you receive the final image in pixel space. That is so kind of you, I appreciate it a lot. But people like Pablo, the HF team and other people helping us out together are the real reason that this was possible. And I promise this is only the start.

@streamtabulous 9 ай бұрын

@@outliier the whole team are a god send, myself i am on a disability pension neuromuscular so can't afford the pay to use like abode firefly that's a tick off charging as they use Stable Diffusion. Been disabled I game so have a gtx1070 have a rtx3060 in another system. But one of the Thing I miss doing is art and helping people by restoring there photos free, I have Stable diffusion on my PCs and I love its letting me do stuff I could not before including photo restorations, makes my life better as it give me joy doing that stuff. knowing from work like yours and the team your with, that it will mean in the near future I can do not better art but better photo Restorations faster and higher quality for people with my hardware means a massive amount to me. I'm doing a video tomorrow to help teach people how I use SD and models to help restore photos. only found SD a few weeks ago but i am working out how to use it in ways to help people with damaged old photos.

@krisman2503 9 ай бұрын

Hey, does it recover from the noise or a encoded xT during the inference?

@outliier 9 ай бұрын

During inference you start from pure noise and start denoising and after every denoise step, you noise the image again and then denoise again and then noise and so on

@aiartbx 9 ай бұрын

Looks very interesting. Depending on how fast the generation is real time diffusion seems closer than expected. Btw any hugging space demo we can try this?

@outliier 9 ай бұрын

Hey thank you! The demo is available here: huggingface.co/spaces/warp-ai/Wuerstchen

@KienLe-md9yv Ай бұрын

At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you

@outliier Ай бұрын

The VQGAN decoder can also decode continuous latents. It‘s as easy as that.

@beecee793 9 ай бұрын

If I need X time to inference on SD on a given example GPU, what do I need and how fast in the same environment would inferencing this be? Will it run on my toaster?

@outliier 9 ай бұрын

Hey. Take a look at the blog post. It has a inference time bar chart: huggingface.co/blog/wuertschen

@beecee793 9 ай бұрын

@@outliier Thank you

@hipy-tz3qt 9 ай бұрын

Awesome! I have a question: who decided to call it "Würstchen" and why? I am German and just wondering

@akashdutta6235 9 ай бұрын

Man who loves hot dogs😂

@outliier 9 ай бұрын

We called it Würstchen because Pablo is from Spain and we called our first model Paella. And I‘m from Germany as well, so I thought let’s call the next model after something german lol

@digiministrator 7 ай бұрын

Hello, How do I make a Seamless Pattern with Würstchen, I try a few prompts, the edges are always problematic.

@outliier 7 ай бұрын

Someone on the discord was talking about circular padding on the convolutions. Maybe you can try that

@saulcanoortiz7902 5 ай бұрын

How do you create the dynamic videos of NNs? I want to create a KZfaq Channel explaining theory&code in Spanish. Best regards.

@davidyang102 9 ай бұрын

Why do you still do stage A, would it be possible to just do stage B direct from the image? I assume the issue is stage A is cheaper than stage B to train?

@outliier 9 ай бұрын

Yea you can. We actually even tried that out. But it takes longer to learn and as of now we didnt achieve quite the same results with a single compression stage. The VQGAN is just really neat and provides a free compression already, which simplifies things for Stage B a lot I think. But definitely more experiments could be made here :D

@davidyang102 9 ай бұрын

@@outliier Really cool work. Is the use of diffusion models to compress data in this way a generic technique that can be used anywhere? For example could I use it to compress text?

@pablopernias 9 ай бұрын

@@davidyang102 The only issue with text is its discrete nature. If you're ok with having continuous latent representations for text instead of discrete tokens then I think it could theoretically work, although we haven't properly tried with anything else than RGB images. The important thing is having a powerful enough signal so the diffusion model can rely on it and only require to complete missing details instead of having to make a lot of information up.

@KienLe-md9yv Ай бұрын

So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?

@outliier Ай бұрын

What do you mean exactly?