How AI 'Understands' Images (CLIP)

How AI 'Understands' Images (CLIP) - Computerphile

Рет қаралды 107,370

12 күн бұрын

With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com

Пікірлер: 213

@michaelpound9891 11 күн бұрын

As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!

@adfaklsdjf 10 күн бұрын

we gotcha 💚

@harpersneil 9 күн бұрын

Phew, for a second there I thought you were dramatically more intelligent then I am!

@ArquimedesOfficial 7 күн бұрын

Omg, I’m your fan since spiderman 😆, thanks for the lesson!

@edoardogribaldo1058 11 күн бұрын

Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers

@joker345172 8 күн бұрын

Dr Pound is just amazing. I love all his videos

@adfaklsdjf 10 күн бұрын

thank you for "if you want to unlock your face with a phone".. i needed that in my life

@alib8396 9 күн бұрын

Unlocking my face with my phone is the first thing I do when I wake up everyday.

@keanualves7977 11 күн бұрын

I'm a simple guy. I see a Mike Pound video, I click

@jamie_ar 11 күн бұрын

I pound the like button... ❤

@Afr0deeziac 11 күн бұрын

@@jamie_arI see what you did there. But same here 🙂

@BooleanDisorder 11 күн бұрын

I like to see Mike pound videos too.

@kurdm1482 11 күн бұрын

Same

@MikeUnity 11 күн бұрын

Were all here for an intellectual pounding

@pyajudeme9245 11 күн бұрын

This guy is one of the best teachers I have ever seen.

@eholloway 11 күн бұрын

"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024

@rnts08 10 күн бұрын

Understatement of the century, even for a brit.

@aprilmeowmeow 11 күн бұрын

Thanks for taking us to Pound town. Great explanation!

@pierro281279 11 күн бұрын

Your profile picture reminds me of my cat ! It's so cute !

@pvanukoff 11 күн бұрын

pound town 😂

@rundown132 10 күн бұрын

pause

@aprilmeowmeow 9 күн бұрын

@@pierro281279 that's my kitty! She's a ragdoll. That must mean your cat is pretty cute, too 😊

@BrandenBrashear 4 күн бұрын

Pound was hella sassy this day.

@skf957 10 күн бұрын

These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.

@letsburn00 9 күн бұрын

KZfaq is like you got the best teacher in school. The world has hundreds or thousands of experts. Being able to explain is really hard to do as well.

@MichalKottman 11 күн бұрын

9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?

@michaelpound9891 11 күн бұрын

Absolutely yes! I definitely should have added “the distance” or similar :)

@ScottiStudios 8 күн бұрын

Yes it should have been *minimise* the diagonal, not maximise.

@TheRealWarrior0 9 күн бұрын

A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)! After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt). You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).

@or1on89 6 күн бұрын

That’s pretty much what he said after explaining how the LLM infers an image from written text. Did you watch the whole video?

@TheRealWarrior0 6 күн бұрын

@@or1on89 What? Inferring an image from written text? Is this a typo? You mean image generation? Anyway, did he make my same point? I must have missed it. Could you point to the minute he roughly says that? I don't think he ever said something like "projective layer" and/or talked about how multimodality in LLMs is "bolted-on". It felt to me like he was talking about the actual CLIP paper rather than how CLIP is used on the modern systems (like Copilot).

@exceptionaldifference392 4 күн бұрын

I mean the whole video was about how to align the embeddings of the visual transformer with LLM embeddings of captions of the images.

@TheRealWarrior0 4 күн бұрын

@@exceptionaldifference392 to me, the whole video seems to be about the CLIP paper which is about “zero-shot labelling images”. But that is a prerequisite to make something like LLaVa which is able to talk, ask questions about the image and execute instruction based on the image content! CLIP can’t do that! I described the step from going to having a vision encoder and an LLM to have a multimodal-LLM. That’s it.

@TheRealWarrior0 4 күн бұрын

@@exceptionaldifference392 To be exceedingly clear: the video is about how you create the "vision encoder" in the first place, (which does require you also train a "text encoder" for matching the image to the caption), not how to attach the vision encoder to the more general LLM.

@beardmonster8051 10 күн бұрын

The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.

@JohnMiller-mmuldoor 8 күн бұрын

Been trying to unlock my face for 10:37 and it’s still not working!

@bluekeybo 8 күн бұрын

The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.

@Shabazza84 2 күн бұрын

Excellent. Could listen to him all day and even understand stuff.

@wouldntyaliktono 10 күн бұрын

I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.

@musikdoktor 10 күн бұрын

Love seeing AI problems explained on fanfold paper. Classy!

@AZTECMAN 11 күн бұрын

Clip is fantastic. It can be used as a 'zero-shot' classifier. It's both effective and easy to use.

@RupertBruce 10 күн бұрын

One day, we'll give these models some high resolution images and comprehensive explanations and their minds will be blown! It's astonishing how good even a basic perceptron can be given 28x28 pixel images!

@rigbyb 10 күн бұрын

6:09 "There isn't red cats" Mike is hilarious and a great teacher lol

@orange-vlcybpd2 2 күн бұрын

The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.

@codegallant 11 күн бұрын

Computerphile and Dr. Pound ♥️✨ I've been learning AI myself these past few months so this is just wonderful. Thanks a ton! :)

@IOSARBX 11 күн бұрын

Computerphile, This is great! I liked it and subscribed!

@sebastianscharnagl3173 3 күн бұрын

Awesome explanation

@xersxo5460 6 күн бұрын

Just writing this to crystallize my understanding: (and for others to check me for accuracy) So by circumventing the idea of trying to instill “true” understanding (which is a hard incompatibility in this context, due to our semantics); On a high level it’s substituting case specific discrepancies (like how a digital image is made of pixels, so only pixel related properties are important: like color and position) and filtering against them, because it happens to be easier to tell what something isn’t than what it is in this case (like there are WAAAY more cases where a random group of pixels isn’t an image of a cat, so your sample size for correction is also WAAY bigger.) And if you control for the specific property that disqualifies the entity (in this case, of the medium: discrete discrepancies), as he stated with the “ ‘predisposed noise’ subtraction to recreate a clean image’“ training, you can be even more efficient and effective by starting with already relevant cases. Once again because a smattering of colors is not a cat so it’s easier to go ahead and assume your images will already be in some assortment of colors similar to a cat to train on versus the near infinite combinations of random color pixel images. And then in terms of the issue of accuracy through specificity versus scalability, it was just easier to use the huge sample size as a tool to approximate accuracy between the embedded images and texts because as a sample size increases, precision also roughly increases given a rule, (in crude terms). And that it’s also a way to circumvent “ mass hard coding” associations to approximate “meaning” because the system doesn’t even have to deal directly with the user inputs in the first place, just their association value within the embedded bank. I think that’s a clever use of the properties of a system as limitations to solve for our human “black box” results. Because the two methods, organic and mathematical, converge due to a common factor: The fact that digital images in terms of relevance to people are also useful approximations, because we literally can only care about how close an “image” is to something we know, not if it actually is or not, which is why we don’t get tripped up over individual pixels in determining the shape of a cat in the average Google search. So in the same way by relying on pixel resolution and accuracy as variables you can quantify the properties so a computer can calculate a useable result. That’s so cool!

@sukaina4978 11 күн бұрын

i just feel 10 times smarter after watching any computerphile video

@Stratelier 10 күн бұрын

When they say "high dimensional" in the vector context, I like to imagine it like an RPG character stat sheet, as each independent stat on that sheet can be considered its own dimension.

@sbzr5323 11 күн бұрын

The way he explains is very interesting.

@user-dv5gm2gc3u 11 күн бұрын

i'm an it-guy & programmer, but this is kinda hard to understand. thanks for the video, gives a little idea about the concepts!

@aspuzling 11 күн бұрын

I'd definitely recommend the last two videos on GPT from 3blue1brown. He explains the concept of embeddings in a really nice way.

@zxuiji 10 күн бұрын

Personally I woulda just did the colour comparison by putting the 24bit RGB integer colour into a double (the 64bit fpn type) and divided one by the other. If the result is greater than 0.01 or less than -0.01 then they're not close enough to deem the same overall colour and thus not part of the same facing of a shape. **Edit:** When searching for images it might be better to use simple line path (both a 2d and 3d one) matching the given text of what to search for and compare the shapes identified in the images to those 2 paths. If at least 20% of the line paths matches a shape in the image set then it likely contains that what was searching for. Similarly when generating images the line paths should then traced for producing each image then layered on to one image. Finally for identifying shapes in a given image you just iterate through all stored line paths. I believe this is how our brains conceptualise shapes in the 1st place given how our brains have nowhere to draw shapes to compare to. Instead they just have connections between...cells? neurons? Someone will correct me. Anyways they just have connections between what are effectively physical functions that equate to something like this in C: int neuron( float connections[CHAR_BIT * sizeof(uint)] ); Which tells me the same subshapes share neurons for comparisons which means a bigger shape will likely be just something initial nueron to visit, how many neurons to vist, and what angle to direct the path at to identify the next neuron to visit. In other words every subshape would be able to revisit a previous subshapes neruon/function. There might be an extra value or 2 but I'm no neural expert so a rough guess should be accurate enough to get the ball rolling.

@barrotem5627 6 күн бұрын

Brilliant mike !

@zzzaphod8507 10 күн бұрын

4:35 "There is a lot of stuff on the internet, not all of it good." Today I learned 😀 6:05 I enjoyed that you mentioned the issues of red/black cats and the problem of cat-egorization Video was helpful, explained well, thanks

@stancooper5436 8 күн бұрын

Thanks Mike, nice clear explanation. You can still get that printer paper!? Haven't seen that since my Dad worked as a mainframe engineer for ICL in the 80s!

@VicenteSchmitt 9 күн бұрын

Great video!

@Misiok89 6 күн бұрын

6:30 if for LLM you have nodes of meaning then you could look fof "nodes of meaning" in description and make classes based on those "nodes", if you are able to represent every language based on same "nodes of meaning" that is even better to translate text from one language to another then average translator that is not LLM, then you should be able to use it also for clasification.

@Funkymix18 9 күн бұрын

Mike is the best

@jonyleo500 11 күн бұрын

At 9:30, doesn't a distance of zero mean the image and caption have the same "meaning", therefore, shouldn't we want to minimize the diagonal, and maximize the rest?

@michaelpound9891 11 күн бұрын

Yes! We want to maximise the similarity measure on the diagonal - I forgot the word similarity!

@romanemul1 11 күн бұрын

@@michaelpound9891 Cmon. Its Mike Pound !

@FilmFactry 11 күн бұрын

When will wee see the multimodal LLMs be able to answer a question with a generated image. Could be how do you wire an electric socket, and it would generate either a diagram or illustration of the wire colors and position. Should be able to do this but it can't yet. Next would be a functional use of SORA rendering a video how you install a starter motor in a Honda.

@jonathan-._.- 11 күн бұрын

approx how many samples do i need when i just want to do image categorisation (but with multiple categories per image)

@pickyourlane6431 6 күн бұрын

i was curious, when you are showing the paper from above, are you transforming the original footage?

@thestormtrooperwhocanaim496 11 күн бұрын

A good edging session (for my brain)

@brdane 11 күн бұрын

Oop. 😳

@Foxxey 11 күн бұрын

14:36 Why can't you just train a network that would decode the vector in the embedded space back into text (being either fixed sized or using a recurrent neural network)? Wouldn't it be as simple as training a decoder and encoder in parallel and using the text input of the encoder as the expected output in the decoder?

@or1on89 6 күн бұрын

Because that’s a whole different class of problem and would make the process highly inefficient. There are better ways just to do that using a different approach.

@IceMetalPunk 10 күн бұрын

For using CLIP as a classifier: couldn't you train a decoder network at the same time as you train CLIP, such that you now have a network that can take image embeddings and produce semantically similar text, i.e. captions? That way you don't have to guess-and-check every class one-by-one? Anyway, I can't believe CLIP has only existed for 3 years... despite the accelerating pace of AI progress, we really are still in the nascent stages of generalized generative AI, aren't we?

@GeoffryGifari 11 күн бұрын

Can AI say "I don't know what I'm looking at"? Is there a limit to how much it can recognize parts of an image?

@throttlekitty1 10 күн бұрын

No, but it can certainly get it wrong! Remember that it's looking for a numerical similarity to things it does know, and by nature has to come to a conclusion.

@OsomPchic 10 күн бұрын

Well in some way. It would say that picture have this embedings: cat:0.3, rainy weather: 0.23, white limo 0.1 every number representing a percentage how "confident" it is. So with a lot of tokens below 0.5 you can say it have no idea what's on that picture

@ERitMALT00123 10 күн бұрын

Monte-Carlo dropout can produce confidence estimations of a model. If the model doesn't know what it's looking at then the confidence should be low. CLIP natively doesn't have this though

@el_es 9 күн бұрын

The 'i don't know ' answer is not very evenly treated along users and therefore there is an understandable hate of it embedded into the model;) possibly because it also means more work for the programmers... Therefore it would rather hallucinate than say it doesn't know something.

@MilesBellas 10 күн бұрын

Stable Diffusion 3 = potential topic Optimum workflow strategies using Control Nets, LORAS, VEAs etc....?

@el_es 9 күн бұрын

@dr Pound: sorry if this is off topic here but, i wonder if the problem of hallucinations in AI comes from us not treating the 'i don't know what I'm looking at ' answer of a model, as a very negative outcome? If it was treated by us as a valid neutral answer, could it reduce the rate if hallucinations?

@WilhelmPendragon 23 сағат бұрын

So the Visio-Text encoder is dependent on the quality of the captioned photo dataset? If so, where do you find quality datasets ?

@aleksszukovskis2074 6 күн бұрын

there is stray audio in the background that you can faintly hear at 0:05

@utkua 11 күн бұрын

How do you go from embedings to text of something never been see. before?

@JT-hi1cs 11 күн бұрын

Awesome! I always wondered how the hell does the AI “gets” that an image is made with a certain type of lens or film stock. Or how the hell AI generates objects that were never filmed in a way, say, The Matrix filmed on fisheye and Panavision in the 1950s.

@lancemarchetti8673 5 күн бұрын

Amazing. Imagine the day when AI is able to detect digital image steganography. Not by vision primarily, but by bit inspection.... iterating over the bytes and spitting out the hidden data. I think we're still years away from that though.

@zurc_bot 10 күн бұрын

Where did they get those images from? Any copyright infringement?

@quonxinquonyi8570 3 күн бұрын

Internet is a huge public repository since its inception

@j3r3miasmg 9 күн бұрын

I didn't read the cited paper, but if I understood correctly, the 5 billion images need to be labeled for the training step?

@StashOfCode 7 күн бұрын

There is a paper on The Gradient about reverting embeddings to text ("Do text embeddings perfectly encode text?")

@genuinefreewilly5706 10 күн бұрын

Great explainer. Appreciated. I hope someone will cover AI music next

@suicidalbanananana 10 күн бұрын

In super short: Most "AI music stuff" is literally just running stable diffusion in the backend, they train a model on the actual images of spectrograms of songs, then ask it to make an image like that & then convert that spectrogram image back to sound.

@genuinefreewilly5706 9 күн бұрын

@@suicidalbanananana Yes I can see that, however AI music has made a sudden marked departure in quality of late. Its pretty controversial among musicians. I can wrap my head around narrow AI applications in music ie mastering, samples etc.. Its been a mixed bag of results until recently.

@or1on89 6 күн бұрын

It surely would be interesting…I can see a lot of people embracing it for pop/trap music and genres with “simple” compositions…my worry as a musician is that it would make the landscape more boring than boy bands in the 90s (and somewhat already is without AI being involved). As a software developer I would love instead to explore the tool to refine filters, corrections and sampling during the production process… It’s a bit of a mixed bag…the generative aspect is being marketed as the “real revolution” and that’s a bit scary…knowing more the tech and how ML can help improve our tools would be great…

@LupinoArts 9 күн бұрын

3:55 As someone born in the former GDR, I find it cute to label a Trabi as "a car"...

@LukeTheB 10 күн бұрын

Quick question from someone outside computer science: Does the model actually instill "meaning" into the embedded space? What I mean is: Is the Angel between "black car" and "Red car" smaller than "black car" and "bus" and that is smaller than "black car" and "tree"?

@suicidalbanananana 10 күн бұрын

Yeah that's correct, "black car" and "red car" will be much closer to each other than "black car" and "bus" or "black car" and "tree" would be. It's just pretty hard to visualize this in our minds because we're talking about some strange sort of thousands-of-dimensions-space with billions of data points in it. But there's definitely discernable "groups of stuff" in this data. (Also, "Angle" not "Angel" but eh, we get what you mean ^^)

@nenharma82 11 күн бұрын

This is as simple as it’s ingenious and it wouldn’t be possible without the internet being what it is.

@IceMetalPunk 10 күн бұрын

True! Although it also requires Transformers to exist, as previous AI architectures would never be able to handle all the varying contexts, so it's a combination of the scale of the internet and the invention of the Transformer that made it all possible.

@Retrofire-47 9 күн бұрын

@@IceMetalPunk the transformer, as someone who is ignorant, what is that? I only know a transformer as a means of converting electrical voltage from AC - DC

@NeinStein 10 күн бұрын

Oh look, a Mike!

@ianburton9223 11 күн бұрын

Difficult to see how convergence can be ensured. Lots of very different functions can be closely mapped over certain controlled ranges, but then are wildly different outside those ranges. What I have missed in many AI discussions is these concepts of validity matching and range identities to ensure that there's some degree of controlled convergence. Maybe this is just a human fear of the unknown.

@GeoffryGifari 11 күн бұрын

How can AI determine the "importance" of parts of an image? why would it output "people in front of boat" instead of "boat behind people" or "boat surrounded by people"? Or maybe the image is a grid of square white cells. One cell then get its color progressively darken to black. Would the AI describe these transitioning images differently?

@michaelpound9891 10 күн бұрын

Interesting question! This very much comes down to the training data in my experience. For the network to learn a concept such as "depth ordering", where something is in front of another, what we are really saying is it has learnt a way to extract features (numbers in grids) representing different objects, and then recognize that an object is obscured or some other signal that indicates this concept of being in front of. For this to happen in practice, we will need to see many examples of this in the training data, such that eventually such features occurring in an image lead to a predictable text response.

@GeoffryGifari 10 күн бұрын

@@michaelpound9891 The man himself! thank you for your time

@GeoffryGifari 10 күн бұрын

@@michaelpound9891 I picked that example because... maybe its not just depth? maybe there are myriad of factors that the AI summarized as "important" For example the man is in front of the boat, but the boat is far enough behind that it looks somewhat small.... Or maybe that small boat has a bright color that contrasts with everything else (including the man in front). But your answer makes sense, that its the training data

@bennettzug 10 күн бұрын

13:54 you actually probably can, at least to an extent there’s been some recent research on the idea of going backwards from embeddings to text, maybe look at the paper “Text Embeddings Reveal (Almost) As Much As Text” (Morris et al) the same thing has been done with images from a CNN, see “Inverting Visual Representations with Convolutional Networks” (Dosovitsky et al) neither of these are with CLIP models so maybe future research? (not that it’d produce better images than a diffusion model)

@or1on89 6 күн бұрын

You can, using a different type of network/model. We need to remind that all he said is in the context of a specific type of model and not in absolute terms, otherwise the lesson would go very quickly out of context and hard to follow.

@bennettzug 6 күн бұрын

@@or1on89 i don’t see any specific reason why CLIP model embeddings would be especially intractable though

@eigd 11 күн бұрын

9:48 Been a while since I did machine learning class... Anyone care to tell me why I'm thinking of PCA? What's the connection?

@donaldhobson8873 10 күн бұрын

Once you have a clip, can't you train a diffusion on pure images, just by putting an image into clip, and training the diffusion to output the same image?

@hehotbros01 7 күн бұрын

Poundtown.. sweet...

@charlesgalant8271 11 күн бұрын

The answer given for the "we feed the embedding into the denoise process" still felt a little hand-wavey to me as someone who would like to understand better, but overall good video.

@michaelpound9891 10 күн бұрын

Yes I'm still skipping things :) The process this uses is called attention, which basically is a type of layer we use in modern deep networks. The layer allows features that are related to share information amongst themselves. Rob Miles covered attention a little in the video "AI Language Models & Transformers", but it may well be time to revisit this since attention has become quite a lot more mainstream now, being put in all kinds of networks.

@IceMetalPunk 10 күн бұрын

@@michaelpound9891 It is, after all, all you need 😁 Speaking of attention: do you think you could do a video (either on Computerphile or elsewhere) about the recent Infini-Attention paper? It sounds to me like it's a form of continual learning, which I think would be super important to getting large models to learn more like humans, but it's also a bit over my head so I feel like I could be totally wrong about that. I'd appreciate an overview/rundown of it, if you've got the time and desire, please 💗

@unvergebeneid 10 күн бұрын

But confusing to say that you want to maximise the distances on the diagonal. Of course you can define things however you want but usually you'd say you want to maximise the cosine similarity and thus minimise the cosine distance on the diagonal.

@ginogarcia8730 9 күн бұрын

I wish I could hear Professor Brailsford's thoughts on AI these days man

@proc 11 күн бұрын

9:48 I didn't quite get how do similar embeddings end up close to each other if we maximize the distances to all other embeddings in the batch? Wouldn't two images of dogs in the same batch will be pulled further away just like an image of a dog and a cat would? Explain like Dr. Pound please.

@drdca8263 10 күн бұрын

First: I don’t know. Now I’m going to speculate: Not sure if this had a relevant impact, but: probably there are quite a few copies of the same image with different captions, and of the same caption for different images? Again, maybe that doesn’t have an appreciable effect, idk. Oh, also, maybe the number of image,caption pairs is large compared to the number of dimensions for the embedding vectors? Like, I know the embedding dimension is pretty high, but maybe the number of image,caption pairs is large enough that some need to be kinda close together? Also, presumably the mapping producing the embedding of the image, has to be continuous, so, images that are sufficiently close in pixel space (though not if only semantically similar) should have to have similar embeddings. Another thing they could do, if it doesn’t happen automatically, is to use random cropping and other small changes to the images, so that a variety of slightly different versions of the same image are encouraged to have similar embeddings to the embedding of the same prompt.

@klyanadkmorr 10 күн бұрын

Heyo, a Pound dogette here!

@EkShunya 9 күн бұрын

I thought diffusion models had VAE and not ViT Correct me if I m wrong

@quonxinquonyi8570 3 күн бұрын

Diffusion model is an upgraded version of vae with limitation in sampling speed

@MikeKoss 9 күн бұрын

Can't you do something analogous to stable diffusion for text classification? Get the image embedding, and then start with random noisy text, and iteratively refine it in the direction of the image's embedding to get a progressively more accurate description of the image.

@quonxinquonyi8570 3 күн бұрын

Image manifolds are of huge dimension compare to text manifolds….so guided diffusion from a low dimension manifold to a very high dimension manifold would have a less information and more noise, basically information theoretic bounds still hold when you transform from high dimensional space to low dimension embedding but the other way around isn’t seems that intuitive…might be some prior must be taken into an account..but it still is a hard problem

@bogdyee 11 күн бұрын

I'm curios about a thing. If you have a bunch of millions of photos of cats and dogs and they are also correctly labeled (with descriptions) but all these photos have the cats and dogs in the bottom half of the image, will the transformer be able to correctly classify them after training if they are put in the upper half of the image? (or images are rotated, color changed, filtered, etc..).

@Macieks300 11 күн бұрын

Yes, it may learn it wrong. That's why scale is necessary for this. If you have a million of photos of a cats and dogs it's very unlikely that all of them are in the bottom half of the image.

@bogdyee 10 күн бұрын

@@Macieks300 That's why for me it pose a philosophical question. Will these things actually solve intelligence at some point? If so, what exactly might be the difference between a human brain an an artificial one.

@IceMetalPunk 10 күн бұрын

@@bogdyee Well, think of it this way: humans learn very similarly. It may not seem like it, because the chances of a human only ever seeing cats in the bottom of their vision and never anywhere else is basically zero... but we do. The main difference between human learning and AI learning, with modern networks, is the training data: we're constantly learning and gathering tons of data through our senses and changing environments, while these networks learn in batches and only get to learn from the training data we curate, which tends to be relatively static. But give an existing AI model the ability to do online learning (i.e. continual learning, not "look up on the internet" 😅) and put it in a robot body that it can control? And you'll basically have a human brain, perhaps at a different scale. And embodied AIs are constantly being worked on now, and continual learning for large models... I'm not sure about. I think the recent Infini-Attention is similar, though, so we might be making progress on that as well.

@suicidalbanananana 10 күн бұрын

@@bogdyee Nah they won't solve intelligence at some point when going down this route they are currently going down, AI industry was working on actual "intelligence" for a while but all this hype about shoving insane amounts of training data into "AI" has reduced the field to really just writing overly complex search engines that sort of mix results together... 🤷‍♂ Its not trying to think or understand (as is the actual goal of AI field) anything at all at this stage, it's really just trying to match patterns. "Ah the user talked about dogs, my training data contains the following info about dog type a/b/c, oh the user asks about trees, training data contains info about tree type a/b/c", etc. Actual AI (not even getting to the point of 'general ai' yet but certainly getting to somewhere much better than what we have now) would have little to no training data at all, instead it would start 'learning' as its running, so you would talk to it about trees and it would go "idk what a tree is, please tell me more" and then later on it might have some basic understanding of "ah yes, tree, i have heard about them, person x explained them to me, they let you all breathe & exist in type a/b/c, right? please tell me more about trees" Where the weirdness lies is that the companies behind current "AI" are starting to tell the "AI" to respond in a similar smart manner, so they are starting to APPEAR smart, but they're not actually capable of learning. All the current AI's do not remember any conversation they have had outside of training, because that makes it super easy to turn Bing (or whatever) into yet another racist twitter bot (see microsoft's history with ai chatbots)

@suicidalbanananana 10 күн бұрын

@@IceMetalPunk The biggest difference is that we (or any other biological intelligence) don't need insanely large amounts of training data, show a baby some spoons and forks and how to use them and that baby/person will recognize and be able to use 99.9% of spoons and forks correctly for the rest of its life, current overhyped AI's would have to see thousands of spoons and forks to maybe get it right 75% of the time & that's just recognizing it, we're not even close yet to 'understanding how to use' Also worth noting is how we (and again, any other biological intelligence) are always "training data" and much more versatile when it comes to new things, if you train an AI to recognize spoons and forks and then show it a knife it's just going to classify it as a fork or spoon, where as we would go "well that's something i've not seen before so it's NOT a spoon and NOT a fork"

@nightwishlover8913 10 күн бұрын

5.02 Never seen a "boat wearing a red jumper" before lol

@fredrik3685 9 күн бұрын

Question 🤚 Up until recently all images of a cat on internet were photos of real cars and the system could use them in training. But now more and more cat images are AI generated. If future systems use generated images in training it will be like a blind leading a blind. More and more distortion will be added. Or? Can that be avoided?

@quonxinquonyi8570 3 күн бұрын

Distortion and perceptual qualities are the tradeoff we make when we use generative ai

@MattMcT 10 күн бұрын

Do any of you ever get this weird feeling that you need to buy Mike a beer? Or perhaps, a substantial yet unknown factor of beers?

@Rapand 10 күн бұрын

Each time I watch one of these videos, I could might as well watch Apocalypto without subtitles. My brain is not made for this 🤓

@MedEighty 6 күн бұрын

10:37 "If you want to unlock a face with your phone". Ha ha ha!

@bryandraughn9830 10 күн бұрын

I wonder, if every cat image has specific "cat" types of numerical curves, textures, eyes and so on. So a completely numerical calculation would conclude that the image is of a cat. There's only so much variety of pixel arrangements at some resolution, it seems like images could be reduced to pure math. Im probably so wrong. Just curious.

@quonxinquonyi8570 3 күн бұрын

You are absolutely right….images are of very high dimension but their image manifold is still considered to cover and filled a very low dimension of their whole image hyper space….the only way to manipulate or tweak that image manifold is by adding noise…but noise is of very low dimension compare to that high dimension image manifold…so that perturbation or guidance to image manifold in form of noise disturb it into one of its direction from many of its inherent direction….this is similar to find slope of a curve ( manifold) by linearly approximate it with a line ( noise)…this is the method you learn in your high school maths….if want to discuss more,I will clarify it further…

@MuaddibIsMe 11 күн бұрын

"a mike"

@creedolala6918 10 күн бұрын

'and we want an image of foggonstilz' me: wat 'we want to pass the text of farngunstills' me: u wot m8

@AngelicaBotelho-he1hb 6 күн бұрын

Crypto Bull run is making waves everywhere and I have no idea on how it works. What is the best step to get started please,,

@roseypasha1706 6 күн бұрын

Am facing the same challenges right now and I made a lots of mistakes trying to do it on my own even this video doesn't give any guidelines

@GiseleLuz-rm6vd 6 күн бұрын

I will advise you to stop trading on your own if you continue to lose. I no longer negotiate alone, I have always needed help and assistance

@brandonkim4554 6 күн бұрын

You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.

@heleisy5110 6 күн бұрын

Inspiring! Do you think you can give me some advice on how to invest like you do now?

@CreachterZ 11 күн бұрын

How does he stay on top of all of this technology and still have time to teach? …and sleep?

@MilesBellas 10 күн бұрын

Stable Diffusion needs a CEO BTW ....just saying ... 😅

@babasathyanarayanathota8564 10 күн бұрын

Me: added to resume ai expert

@Ginto_O 11 күн бұрын

a yellow cat is called red cat

@RawrxDev 11 күн бұрын

Truly a marvel of human applications of mathematics and engineering, but boy do I think these tools have significantly more cons than pros in practical use.

@aprilmeowmeow 11 күн бұрын

agreed. The sheer power required is an ethical concern

@suicidalbanananana 10 күн бұрын

We're currently experiencing an "AI bubble" that will pop within 2-3 years or less, no doubt about that at all. Companies are wasting money and resources trying to be the first to make something crappy appear less crappy than it actually is, but they don't fully realize yet that it's that's a harder task then it might seem & it's going to be extremely hard to monetize the end result. We need to move back to AI research trying to recreate a biological brain, somehow the field has suddenly been reduced to people trying to recreate a search engine that mixes results or something, which is just ridiculous & running in the opposite direction that AI field should be heading.

@RawrxDev 9 күн бұрын

@@suicidalbanananana That's my thought as well, I even recently watched a clip from sam altman saying they have no idea how to actually make money from AI without investors, and that he is just going to ask the AGI how to make a return once they achieve AGI, which to me seems..... optimistic.

@willhart2188 10 күн бұрын

AI art is great.

@FLPhotoCatcher 11 күн бұрын

At 16:20 the 'cat' looks more like a shower head.

@djtomoy 9 күн бұрын

Why is there always so much mess and clutter in the background of these videos? Do you film them in abandoned buildings?

@MagicPlants 10 күн бұрын

the gorilla camera moving around all the time is making me dizzy

@SkEiTaDEV 10 күн бұрын

Isn't there an AI that fixed shaky video by now?

@creedolala6918 10 күн бұрын

Isn't that a problem that's been solved without AI already? Someone can ride on a mountain bike that's violently shaking, down a forest trail, with a GoPro on his helmet, and we get perfect smooth video of it somehow.

@JeiShian 10 күн бұрын

The exchange at 6:50 made me laugh out loud and I had to show that part of the video to the people around me😆😆

@grantc8353 10 күн бұрын

I swear that P took longer to come up than the rest.

@creedolala6918 10 күн бұрын

Normally this guy is great for explaining things and super clear, but in this case it feels like he's kind of assuming some prior knowledge or understanding, and not really giving us the 'explain it like I'm five years old' version. And for a subject this complicated, you kind of need that.

@quonxinquonyi8570 3 күн бұрын

Want to learn gen ai….here is a sat problem for you…adding one percent to an amount and removing that same one percent would never ever give you the original amount…that percentage is “ noise” and removing that percentage is “ generation”…and automation of this process through computation power is called generative ai….wanna know more,I will teach you like a fifth grader about adding and removing noise in an image coz that is the only piece of art in generative ai