VQ-GAN: Taming Transformers for High-Resolution Image Synthesis

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Рет қаралды 19,350

Күн бұрын

❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
In this video I cover VQ-GAN or Taming Transformers for High-Resolution Image Synthesis.
It uses modified VQ-VAEs and a powerful transformer (GPT-2) to synthesize high-res images.
An important modification of VQ-VAE they brought are:
1) changing MSE for perceptual loss
2) adding adversarial loss which makes the images way more crispy compared to the original VQ-VAE which had blurry outputs.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Paper: arxiv.org/abs/2012.09841
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 Intro
01:50 A high-level VQ-GAN overview
04:00 Perceptual loss
05:10 Patch-based adversarial loss
06:45 Sequence prediction via GPT
09:50 Generating high-res images
12:45 Loss explained in depth
16:15 Training the transformer
17:50 Conditioning transformer
20:45 Comparisons and results
22:00 Sampling strategies
23:00 Comparisons and results continued
25:00 Rejection sampling with ResNet or CLIP
26:45 Receptive field effects
28:30 Comparisons with DALL-E
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► / theaiepiphany
One-time donation:
www.paypal.com/paypalme/theai...
Much love! ❤️
Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković
Zvonimir Sabljic
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► / aleksagordic
Twitter ► / gordic_aleksa
Instagram ► / aiepiphany
Facebook ► / aiepiphany
👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY:
Discord ► / discord
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS:
GitHub ► github.com/gordicaleksa
📚 FOLLOW ME ON MEDIUM:
Medium ► / gordicaleksa
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#vqvae #imagesynthesis #gpt

Пікірлер: 29

@TheAIEpiphany 3 жыл бұрын

What do you get combining DeepMind's VQ-VAE, GANs, perceptual loss, and OpenAI's GPT-2, and CLIP? Well, I dunno but the results are awesome haha!

@moaidali874 2 жыл бұрын

The in-depth explanation is pretty useful. Thank you so much.

@ronitrastogi9016 Жыл бұрын

In-depth explanations are game changer. Keep doing the same. Great work!!

@jisujeon5799 2 жыл бұрын

KZfaq should have recommended me this channel a year ago. What a quality content! Keep it up :D

@TheAIEpiphany 2 жыл бұрын

Hahah misterious are the paths of the YT algorithm. 😅

@johnpope1473 3 жыл бұрын

I like the low level stuff. I attempt to read these papers and your grasp and explanations give me confidence that I can decode them too. Almost always they’re built on top of other work. I liked when you distilled that history out in stylegan session.

@TheAIEpiphany 3 жыл бұрын

Thanks, It's fairly a complex tradeoff to decide when to stop digging into more nitty-gritty details. 😅 I am still figuring it out

@johnpope1473 3 жыл бұрын

@@TheAIEpiphany . I once came across some python code I cloned on GitHub that could take a PDF and create multi quiz questions based off any content. Maybe I could help you one day and have you nut out the answer. You remember that sort of stuff in physics class where the teacher makes things clear eliminating nonsense and elucidating correct answer.

@alexijohansen 2 жыл бұрын

So great! Love the explanation of the loss functions.

@daesoolee1083 2 жыл бұрын

I think you cover both the high-level explanation and details fairly well :) Keep it up, please.

@akashsuryawanshi6267 Жыл бұрын

keep it up with the detailed explanations. For those who are interested in the low level stuff can just skip the detailed parts, win for both. Thank you.

@hoomansedghamiz2288 3 жыл бұрын

Great work and explanation. Probably you have noticed but VQVAE is a bit rough to train since it’s not differentiable. In parallel there is GumbleSoft which is differentiable and therefore easier to train, wav2vec v2 use that. It might be interesting to cover that next :) cheers

@MostafaTIFAhaggag Жыл бұрын

this is a master pieceee.

@vinciardovangoughci7775 2 жыл бұрын

Great Job! The condition part is super useful. The paper is confusing there.

@akashraut3581 3 жыл бұрын

U are on fire 🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥. This video was much needed for me, Thank you so much.

@TheAIEpiphany 3 жыл бұрын

I am just getting started 😂 awesome!

@MuhammadAli-mi5gg 2 жыл бұрын

Thanks again, a masterpiece like the VQ-VAE one. But it would be great if you also add the code part like in the VQ-VAE part, perhaps even more detailed one. Thanks aloooot again!

@xxxx4570 2 жыл бұрын

Thanks for your awesome explain about this paper, I want to ask a question, How does the transformer use the characteristics of the transformer to achieve autoregressive prediction?

@rikki146 Жыл бұрын

15:56 I thought it is arbitrary at first but later realized it is just balancing between loss terms, namely L_{rec} and L_{GAN}. If gradients of L_{GAN} is big, then less weight on L_{GAN} and vice versa

@kirtipandya4618 2 жыл бұрын

Answer : I find in depth explanation very very useful. 🙂 you could also explain codes here. But great work. Thanks. 👍🏻🙂 Could you please also review paper „A Disentangling Invertible Interpretation Network for Explaining Latent Representations“ from same author. It would be great. Thank you. 🙂

@fly-code 3 жыл бұрын

thank you sooo much

@TheAIEpiphany 3 жыл бұрын

You're welcome man!

@marcotroster8247 Жыл бұрын

It's always interesting to me how a bit of constrained resources can produce very intelligent, next-gen results instead of just pumping up the model with weights and using crazy amounts of compute 😂

@vinhphanxuan5654 2 жыл бұрын

how did you do it can you share with me , thank you

@jonathanballoch Жыл бұрын

i feel like you lost me on the semantic segmentation --> image generation step; you say that the semantic token vector from the semantic VQGAN is appended to the front of the CLS token and then the token vector of...the output VQGAN? and then this 2N+1 length vector is input, and the output is a length N vector? how is this possible, aren't transformers necessarily the same dimensional input and output?

@yasmimrodrigues5437 2 жыл бұрын

Some segments in the video are stamped not adjacent to each other

@TheAIEpiphany 2 жыл бұрын

What exactly do you mean by that?

@TF2Shows 3 күн бұрын

The adversarial loss - i think the explanation is wrong You said the discriminator tries to maximize it, however, you have just shown that it tries to minimize is (the term becomes 0 if D(x) is 1 and D(\hatX) is 0). So the discriminator tries to minimize it (and because its a loss function it makes sense), and the generator tries to do the opposite, maximize it, to fool the discriminator. So I think you mis-labeled the objective: L_GAN we try to minimize (minimize loss) in order to train the discriminator.