love to see more good explaination for other model, your explaination is soo good
@MrScorpianwarrior6 күн бұрын
Hey! I am start my CompSci Masters program in the Fall, and just wanted to say that I love this video. I've never really had time to sit down and learn PyTorch, so the brevity of this video is greatly appreciated! It gives me a fantastic starting point that I can tinker around with, and I have an idea on how I can apply this in a non-conventional way that I haven't seen much research on... Thanks again!
@outliier6 күн бұрын
Love to hear that Good luck on your journey!
@bhavyaruparelia74319 күн бұрын
Your explanations are simply great! I do recommend you to return back to KZfaq covering latest papers in this field :)
@WendaoZhao10 күн бұрын
one CRAZY thing to take from this code (and video) GREEK LETTERS ARE CAN BE USED AS VARIABLE NAME IN PYTHON
@astrophage38110 күн бұрын
These implementation videos are marvelous. You really should do more of them. Big fan of your channel!
@pratyanshvaibhav12 күн бұрын
The Under rated OG channel
@freerockneverdrop123612 күн бұрын
At 13:20, the formula is not correct. It is not approximate, not strictly equal. It should have been made clear.
@iceinmylean394715 күн бұрын
Great video! one question: at 22:40 you say "the authors decided to use a simple mean-squared error ..". That part isn't clear to me, at this point we are already considering the loss, we need to minimize the given KL divergence. Why is there a new loss being introduced at this point and how is that justified?
@ShahNawazKhan-jz8wl15 күн бұрын
insance
@ParhamEftekhar16 күн бұрын
Awesome video.
@ParhamEftekhar16 күн бұрын
Great explanation. Thanks.
@erenenadream17 күн бұрын
Nice explanation handsome dude, you just got a new subscriber.
@blancanthony999218 күн бұрын
So far the best model, fastest, high quality image generator on my 3070 gpu. Very very great !!! I used transformers encoders for "denoising". 95% noise on first iteration, no pure noise. No signal in the denoiser's inputs. tested on cifar 100 ! temperature=0.7, top k=40 only 4 steps for denoising !!! Very impressive ! It is the first time i get so confident about the power of a generative model !!!
@NinadDaithankar519 күн бұрын
Amazing video; thanks a lot for going in depth on the math with simplified animations!
@utkarshujwal328622 күн бұрын
Great video buddy, if you can share some more resources to understand the underlying math, that would be great.
@outliier22 күн бұрын
Most of the papers I linked have a good amount of the maths, however often without detailed explanations. There are some good blog posts as well you can easily find on diffusion models. I will soon have another video on this topic that should explain stuff much better too
@rma156323 күн бұрын
Appreciate the effort you put into this. You definitely can teach. If only I have a brain to understand math... still got some bits here and there. Thanks
@JidongLi-lb3zt25 күн бұрын
thanks for your detailed introduction
@khan.saqibsarwar27 күн бұрын
That's a really nice video. You summarised so much information in a concise video. And the explanations were crystal clear. Thanks a lot.
@fcw131029 күн бұрын
Usually the KL divergence is expressed as DKL(q||p)=q*log(q/p), but in the slice @16:48, DKL(q||p)=log(q/p), Why q is ignored here?
@fcw1310Ай бұрын
Thanks for such amazing illustration for Diffusion. One question is about the equation in slice @ 13:16, how to get t-2 and t-3? x_t=sqrt(a_t)*x_t-1+sqrt(1-a_t)*e x_t-1=sqrt(a_t-1)*x_t-2+sqrt(1-a_t-1)*e x_t=sqrt(a_t)*[sqrt(a_t-1)*x_t-2+sqrt(1-a_t-1)*e]+sqrt(1-a_t)*e=sqrt(a_t*a_t-1)*x_t-2+[sqrt(a_t-a_t*a_t-1)+sqrt(1-a_t)]*e The rightmost term doesn't equal or close to sqrt(1-a_t*a_t-1)*e Dis I misunderstand something? Thanks again. @Outlier
@subashchandrapakhrin3537Ай бұрын
Very Bad Video
@outliierАй бұрын
:(
@user-hm6sh6pl7rАй бұрын
Thanks for the explanation, it's awesome! But I have a question. In cross attention, if we set the text as V, the final attention matrix could be viewed as a weighted sum of each word in V itself (the "weighted" part comes from the Q, K similarity). If I understand correctly, the final attention matrix should contain the values in the text domain, why can we multiply by a W_out projection and get the result in the image domain (add it to the original image)? Will it make more sense if we set the text condition as Q, and the image as K, V?
@outliierАй бұрын
If the text conditioning is q then it would not have the same shape as your image. So q needs to be the image
@mousamustafa1042Ай бұрын
U really liked that you showed the derivation in an understandable way
@raphaelfeigl1209Ай бұрын
Amazing explanation thanks a lot! Minor improvement suggestion: add a pop-protection to your microphone :)
@tomasjavurek1030Ай бұрын
I think it is not exactly true statement that N(mu, sigma) = mu + sigma*N(0, 1). Just try that transformation, mu plays a role of translation in the value axis. However, what is correct, that if you sample from the left side, it acts the same as if you sample from the right side. I am pointing this out because I got stuck with that for a while. But I still also might got it completely wrong.
@tomasjavurek1030Ай бұрын
Also later, when working with alphas, there's probably just approx. equal operation restrictred just to the first order of taylor expansion.
@EvanSpadesАй бұрын
Love this - what a fantastic achievement!
@mtolgacangozАй бұрын
Brilliant work!
@shojintam4206Ай бұрын
11:57
@jefersongallo8033Ай бұрын
This is a really great video, thanks for your big effort explaining!
@akkokagari7255Ай бұрын
Wonderful explanation! Not sure if this is in the original papers, but I find it very odd that there is no nonlinear function after V and before W_out. It seems like a waste to me since Attention@V is itself a linear function, so w_out wont necessarily change content of the data beyond what Attention@V already would have done through training.
@akkokagari7255Ай бұрын
Whoops I mean the similarity matrix not Attention
@JeavanCooperАй бұрын
The strange patten in the reconstructed image and the generated image is likely to be caused by the perceptual loss, I have no idea why but the disappears when I take the perceptual loss away.
@ChristProgАй бұрын
Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you
@RyanHeliosАй бұрын
really nice video, helps me understand a lot❗
@mtolgacangozАй бұрын
Great video!! At 13:34, does multiplying with a_0 correct?
@user-kx1nm3vw5sАй бұрын
best explanation
@siddharthshah9316Ай бұрын
This is an amazing video 🔥
@gintonic6204Ай бұрын
12:12 does anyone understand here why when \beta is linear, \sqrt{1-\beta} is linear as well?
@sciencerz7460Ай бұрын
the statement at 15:33 isnt right ... is it ? cause i have a counter f(x) = x^2 , g(x) = -x^2 here f(x) >= g(x) but thier derivatives are negatives of each other. Please help i dont really understand the concept of ELBO
@KienLe-md9yvАй бұрын
At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you
@outliierАй бұрын
The VQGAN decoder can also decode continuous latents. It‘s as easy as that.
@KienLe-md9yvАй бұрын
So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?
@outliierАй бұрын
What do you mean exactly?
@readbynameАй бұрын
Hey great video. Can you tell me why random sampling of codebook vectors doesn't generate a meaningful images. In Vae we random sample from std gaussian, why the same doesn't work for vq auto encoders.
@outliierАй бұрын
Because in a VAE you only predict mean and standard deviation. Sampling this is easier. Sampling the codebook vectors happens independently and this is why the output doesn‘t give a meaningful output.
@Bhllllll2 ай бұрын
How did you manage to get 128 A100 for 3 weeks? I think the cost is about 100k USD for one run. Assuming you did multiple iterations, the overall cost can be easily 200k for this project.
@ashimdahal1822 ай бұрын
Just completed writing a 24 paged handwritten note based on this video and a few other sources
@outliier2 ай бұрын
Wanna share it? :D
@TheSlepBoi2 ай бұрын
Amazing explanation and thank you for taking the time to properly visualize everything
@Gruell2 ай бұрын
Sorry if I am misunderstanding, but at 19:10, shouldn't the code be: "uncond_predicted_noise = model(x, t, None)" instead of "uncond_predicted_noise = model(x, labels, None)" Also, according to the CFG paper's formula, shouldn't the next line be: "predicted_noise = torch.lerp(predicted_noise, uncond_predicted_noise, -cfg_scale)" under the definition of lerp? One last question: have you tried using L1Loss instead of MSELoss? On my implementation, L1 Loss performs much better (although my implementation is different than yours). I know the ELBO term expands to essentially an MSE term wrt predicted noise, so I am confused as to why L1 Loss performs better for my model. Thank you for your time.
@Gruell2 ай бұрын
Great videos by the way
@Gruell2 ай бұрын
Ah, I see you already fixed the first question in the codebase
@duduwe80712 ай бұрын
Hey @Outlier , on 12:44 looks like you mistakenly use "a" instead of "alpha" notation symbol inside 'product notation (Pi Notation)'. Since you mentioned the example multiplication below using "alpha notation". e.g. t = 8 "alpha_8" = "alpha_1" x "alpha_2" x "alpha_3" x "alpha_4" x "alpha_5" x "alpha_6" x "alpha_7" x "alpha_8" ======================================================= Is it intentional, though ? Please let me know. Thanks
@coy4572 ай бұрын
This is dumb, but can anyone explain why when beta increases linearly, the square root of 1 - beta decreases linearly, at 12:13? Shouldn't it have some curve to it, given the square root?
@attilakun785014 күн бұрын
Type these 2 formulas in desmos: \beta=\frac{\left(0.02-0.0001 ight)}{999}x \sqrt{1-\beta} You can see that \sqrt{1-\beta} is indeed non-linear but it curves very-very slightly in the plotted domain. You have to zoom out the X axis a lot to see the curvature.
@coy45713 күн бұрын
@@attilakun7850 ahhh tysm!
@antongolles88962 ай бұрын
@22:32 ur missing a bar over the alpha on the bottom line. Plz correct me if im wrong
@outliier2 ай бұрын
You are probably right 🤔
@UnbelievableRam2 ай бұрын
Hi! Can you please explain why the output is getting two stitched images?
@outliier2 ай бұрын
What do you mean with two stitched images?
@arka-h2742 ай бұрын
How did the KL divergence expand to log(q/p)? You yourself mentioned it to be the integral of q*log(q/p) for D_kl(q||p) Perhaps too much of a simplification.