Learning To Classify Images Without Labels (Paper Explained)

Рет қаралды 48,057

Күн бұрын

How do you learn labels without labels? How do you classify images when you don't know what to classify them into? This paper investigates a new combination of representation learning, clustering, and self-labeling in order to group visually similar images together - and achieves surprisingly high accuracy on benchmark datasets.
OUTLINE:
0:00 - Intro & High-level Overview
2:15 - Problem Statement
4:50 - Why naive Clustering does not work
9:25 - Representation Learning
13:40 - Nearest-neighbor-based Clustering
28:00 - Self-Labeling
32:10 - Experiments
38:20 - ImageNet Experiments
41:00 - Overclustering
Paper: arxiv.org/abs/2005.12320
Code: github.com/wvangansbeke/Unsup...
Abstract:
Is it possible to automatically classify images without the use of ground-truth annotations? Or when even the classes themselves, are not a priori known? These remain important, and open questions in computer vision. Several approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task from representation learning is employed to obtain semantically meaningful features. Second, we use the obtained features as a prior in a learnable clustering approach. In doing so, we remove the ability for cluster learning to depend on low-level features, which is present in current end-to-end learning approaches. Experimental evaluation shows that we outperform state-of-the-art methods by huge margins, in particular +26.9% on CIFAR10, +21.5% on CIFAR100-20 and +11.7% on STL10 in terms of classification accuracy. Furthermore, results on ImageNet show that our approach is the first to scale well up to 200 randomly selected classes, obtaining 69.3% top-1 and 85.5% top-5 accuracy, and marking a difference of less than 7.5% with fully-supervised methods. Finally, we applied our approach to all 1000 classes on ImageNet, and found the results to be very encouraging. The code will be made publicly available.
Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Luc Van Gool
Links:
KZfaq: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 102

@BeyondTheBrink 4 жыл бұрын

The fact you allowed us to participate in your confusion about the norm-not-norm-issue is sooo valuable. Great fan of your work, thx!

@twmicrosheep 4 жыл бұрын

Great explanations! The self-labeling step reminds me of the paper "ClusterFit: Improving Generalization of Visual Representations", which shows a lot of promising results by using pseudo labels from clustering to retrain a new classifier.

@katharinahochkamp5415 3 жыл бұрын

I am currently bingeing your video during my work hours - but, as a PhD student in this field, I don't even feel guilty because I am learning so much. Great work, keep it up!

@MrPrakalp 3 жыл бұрын

Great paper review and explanation!! Thanks a ton!!! It definitely saved lot of my time in reading and understanding entire paper. Now its easy to go back and implement things

@kumarrajamani2135 Жыл бұрын

Wonderful Video @Yannic. Couple of years back, during my Postdoc, I learnt Attention by hearing through your video on "Attention is All you Need" and then started my research work to build based on the intuition I got. I now get a good idea of Self Supervised Learning !!!

@gforman44 3 жыл бұрын

This is very nice and a nice explanation of it. This works so well in this paper partly because the input dataset is nicely separable into discrete clusters. Try this with photos from the wild, not cropped to include the frog/car/object in the center of the photo. Unsupervised, it's pretty unlikely that you'll get classes you like.

@kapilchauhan9774 4 жыл бұрын

Thank you, for such an amazing overview

@Squirrelonsand 3 жыл бұрын

When I started watching the video, I was not sure if I'd be able to sit for 45 minutes to understand this paper, but thanks to your great explanation skills, I sailed through it...

@ruskinrajmanku2753 4 жыл бұрын

There were some really interesting RL papers in ICLR'20. You should cover a few of them. Great explanation again, keep up this work !

@rongxinzhu 4 жыл бұрын

Can you provide some links? I'm really interested in those papers.

@MrAmirhossein1 4 жыл бұрын

Thanks for the great content Honestly, the entire channel is an actual gold mine! Please keep up the excellent work :)

@saikrishnarallabandi1131 4 жыл бұрын

@Phobos11 4 жыл бұрын

Cool! I was actually going to try doing this myself, exactly the same steps and all, unsupervised learning -> k-means -> self labeling. Awesome to see I wasn't so crazy after all, great explanation 😁

@tedp9146 4 жыл бұрын

I had also something similar (and simpler) in mind before I watched this video: Clustering the bottleneck-encodings of images. Surely that’s made before but I haven’t found any results on the internet

@esthermukoyagwada8578 3 жыл бұрын

@Victor, Which Dataset are you aiming to work with?

@dmc1308 4 ай бұрын

being wondering inside the paper for hours and finding this vid is a big gift for me

@Fortnite_king954 4 жыл бұрын

Amazing review, thank you so much. Keep going....

@dimitrispolitikos1246 2 жыл бұрын

Nice explanation! Thank you Yannic!

@dippatel1739 4 жыл бұрын

label exists. Augmentation : I am about to end this man's career.

@Renan-st1zb Жыл бұрын

Awesome explanation. Thanks a ton

@tamooora87 4 жыл бұрын

Thanks for the great effort 👍

@acl21 4 жыл бұрын

Great explanation as always, thank you! It would have been even better if you had explained the evaluation metrics ACC (clustering accuracy), NMI (normalized mutual information) and ARI (adjusted rand index).

@MrjbushM 4 жыл бұрын

thanks cool videos! very informative, I always try to distill knowledge from your explanations :-)

@dinnerplanner9381 3 жыл бұрын

I have a question, what would happen if we pass images through a pretrained model such as inception and then use the obtained feature map for clustering?

@rahuldeora5815 3 жыл бұрын

Point made in the last 30 seconds is such an important one. All hyper-param choices are based on label information making this more of a playground experiment rather than something robust

@myelinsheathxd Жыл бұрын

Amazing method, I hope RL can use this method during self cruosity rewards. So then there will be less manual rewards for bunch of locomotion tasks

@thebigmouth 2 жыл бұрын

Thanks for the amazing content!

@clivefernandes5435 3 жыл бұрын

Hi I was training the model in the scan stage the total loss displayed is -ve hence to reduce we need to go from say -4 to -9 rite ? Silly question

@ShivaramKR 4 жыл бұрын

Don't worry so much about the mistakes you do. You are doing a great job!

@jacobkritikos3499 3 жыл бұрын

Congrats for your video!!!

@huseyintemiz5249 4 жыл бұрын

Nice overview.

@ProfessionalTycoons 4 жыл бұрын

such dope research!!

@nightline9868 3 жыл бұрын

Great Video. Really easy to understand. Thanks for that! Can i ask you something? I'm trying to compare different Clustering results of different Clustering approaches on image data. Is it possible to use internal validation indexes i.e. davies-bouldin-score? Or are there problems in terms of the euclidean space? Keep it up

@Vroomerify Жыл бұрын

How do we avoid the network projecting all images to 0 in the 1st step if we are not using a contrastive loss function?

@choedward3380 Жыл бұрын

I have one question. If I have no labeled images, Is it possible? On update memory bank (in simclr as pretext), does it need labels??

@bibiworm 2 жыл бұрын

6:58, a stupid question here. So if the downstream task is not a classification task, would the euclidean distance still make sense in the learned representation space? I think it does, but I am not sure. I'd really appreciate it, if anyone can shed some light here. Thanks.

@tuanad121 3 жыл бұрын

at 32:34 the accuracy of self-supervised learning followed by K-means is 35.9%. How do they decide the representing label of a cluster? Is the representing label is the majority ones in the cluster?

@antonio.7557 4 жыл бұрын

great video, thanks!

@ravipashchapur5803 2 жыл бұрын

Hi there, hope you are doing well. I want to know can we use only supervised learning for unlabeled image dataset?

@ehtax 4 жыл бұрын

Super helpful, keep up the great work Yannic! Your ability to filter out the intuitions make you an incredible instructor. ps: what is the note-taking software you're using?

@YannicKilcher 4 жыл бұрын

OneNote, thanks :)

@IndranilBhattacharya_1988 4 жыл бұрын

@@YannicKilcher fantastic..good job .. keep going .. I myself before reading a paper look through your videos in case you have reviewed it already

@shix5592 10 ай бұрын

me too, very good channel@@IndranilBhattacharya_1988

@CodeShow. 4 жыл бұрын

Can you explain the basics of deep learning using the published papers for algorithms as you do now. You. Have a way in teaching that makes me do not fear from scientific papers 🙂

@23kl104 3 жыл бұрын

I would suspect that overclustering is done by shoving in a whole block of data from one class and assign the output with the highest peak the corresponding label, though can't be sure. And shouldn't the accuracy be expected to be lower with more classes, since the entropy term is maximizing the number of different clusters.

@bowenzhang4471 3 жыл бұрын

25:19 why in L2 space the inner product is always 1?

@BanjiLawal Жыл бұрын

This is what I have been looking for

@eternalsecretforgettingfor8525 4 жыл бұрын

OUTLINE: 0:00-Intro & High-level Overview 2:15-Problem Statement 4:50-Why naive Clustering does not work 9:25-Representation Learning 13:40- Nearest-neighbor-based Clustering 28:00-Self-Labeling 32:10-Experiments 38:20- ImageNet Experiments 41:00-Overclusteringg

@nahakuma 4 жыл бұрын

Nice videos, in particular your skepticism. How do you select the papers you will review? I find myself with a mountain of papers to read but time is never enough.

@YannicKilcher 4 жыл бұрын

Same here, I just read what seems interesting.

@mohammadxahid5984 4 жыл бұрын

Yannic, could you please make a video on essential mathematics that are required for to be DL researchers? I am an CS undergrad and I always find myself not knowing enough mathematics while reading paper. Is is the case for everyone? I am amazed at your ability to go through papers with such understanding. Could you share with us how you prepared yourself that way? BS: excuse my English.

@YannicKilcher 4 жыл бұрын

Hey, never be ashamed of your English, it's cool that you participate :) That's a good idea, but the answer will be a bit boring: linear algebra, real (multidimensional) calculus, probability / stats and numerics are most relevant

@ekkkkkeeee 4 жыл бұрын

I have a little question about equation 2 in the paper. How is the soft assignment phi^{c} is calculated? They simply they: "The probability of sample Xi being assigned to cluster c is denoted as Φcη(Xi) but never mention how to calculte it. Am I missing this?

@YannicKilcher 4 жыл бұрын

It's probably a softmax after some linear layers

@sarvagyagupta1744 4 жыл бұрын

Hey, great videos. I have a question though. In the Representation Learning part, it seems very similar to the image reconstruction like using a variational autoencoder. Some people consider it as unsupervised learning. So what's exactly is the difference between self-supervised and unsupervised learning?

@YannicKilcher 4 жыл бұрын

It's pretty much the same thing. Self-supervised tries to make it more explicit that it's "like" supervised, but with a self-invented label.

@sarvagyagupta1744 4 жыл бұрын

@@YannicKilcher thanks for the reply. So what, according to you, is a clear cut example that differentiates between self-supervised and unsupervised learning?

@vsiegel 3 жыл бұрын

From the examples, I had the suspicion that it may works based on *colour and structure of the background* , combined with *confirmation bias* . The shark cluster may not care much about the sharks, bur more so about the blue water that surrounds it. The spiders may just be things in focus in front of a blurred background, caused by the small depth of field of a macro photo. It may also be based on the shape of the colour histogram, that covers more of the example clusters shown, and includes information about the structure and colours of object and background. At least in some examples it is a very strong effect, so strong that it needs confirmation bias by the authors to miss it. Maybe it is discussed in the paper, I did not check.

@NehadHirmiz 4 жыл бұрын

Your videos are amazing. Not only you have the technical knowledge, but you do a wonderful job explaining things. If I may suggest creating an advanced course where you show researchers/students how to implement the algorithms in these papers. I would be your first student lol :).

@YannicKilcher 4 жыл бұрын

God that sounds like work.... just kidding, thanks for the feedback :)

@NehadHirmiz 4 жыл бұрын

@@YannicKilcher I know there is a fine line between too much fun and work :P. This would be a graduate-level course.

@sarc007 3 жыл бұрын

Hi, Very interesting and informative video, I have a question how do I go about detecting symbols in an engineering drawing using your technique explained here?

@YannicKilcher 3 жыл бұрын

you'd need a dataset,

@sarc007 3 жыл бұрын

@@YannicKilcher Then it will be a labled data right , can you elaborate, my email id is sarc007@gmail.com

@sherryxia4763 3 жыл бұрын

The thing I love most is the sassy hand drawing lmao

@tedp9146 4 жыл бұрын

How well would it work to cluster the bottleneck-encoding of an autoencoder?

@YannicKilcher 4 жыл бұрын

Good question, worth a try

@dippatel1739 4 жыл бұрын

Summary of Paper 1. Learn good embedding 2. Learn Classes based on embedding 3. (Extra) Use learned classes to train new NN.

@herp_derpingson 4 жыл бұрын

K-nearest neighbours but with neural networks

@hafezfarazi5513 3 жыл бұрын

I have a question: Why in representation learning, the network won't cheat and classify everything(all kinds of classes) the same? Is there a regularization that is not shown here (for example, encouraging having a diverse output)?

@YannicKilcher 3 жыл бұрын

there are a nmber of tricks, but mostly it's because of stochasticity, normalization and the inclusion of negatives

@MyTobirama 4 жыл бұрын

at 15:06 why do they use the log in the first term of the equation?

@YannicKilcher 4 жыл бұрын

I guess it's so they can interpret the inner product as a likelihood

@egexiang588 4 жыл бұрын

Should I be familiar with a information theory text book to appreciate this paper ? I'm really not sure which math text books to read to understand ML papers better.

@YannicKilcher 4 жыл бұрын

Nope, this is very practical

@julespoon2884 4 жыл бұрын

43:30 ive not read the paper yet but your argument for overclustering does not apply if the authors evaluated the model on a different set that they trained on.

@DarioCazzani 2 жыл бұрын

That's not a flute, it's an oboe! :p Always enjoy your videos btw :)

@linminhtoo 3 жыл бұрын

Could you re-explain why euclidean distance would not work for raw images?

@YannicKilcher 3 жыл бұрын

because two images can be very similar to humans, but every pixel is different

@linminhtoo 3 жыл бұрын

@@YannicKilcher this makes sense. Thanks!

@herp_derpingson 4 жыл бұрын

This is too good to be true. I wouldn't be surprised if nobody is able to replicate this. But if it does work, it could open up a lot of possibilities in unventured territories in computer vision.

@TijsMaas 4 жыл бұрын

Many hyperparams indeed, the authors claim code + configuration files will be released soon, sounds really promising. Defining the class (dis)agreement on the embedding neighbourhood is a fine piece of representation learning 👌.

@simonvandenhende5227 3 жыл бұрын

We released the code over here :) github.com/wvangansbeke/Unsupervised-Classification

@dennyw2383 3 жыл бұрын

@@simonvandenhende5227 great work. what's the best way to communicate with you guys? For example, CIFAR100 ACC is significant lower than ImageNet-100 ACC, any thought why?

@simonvandenhende5227 3 жыл бұрын

@@dennyw2383 You can contact me through email. CIFAR100 is evaluated using superclasses, e.g. vehicles = {bicycle, bus, motorcycle, pickup truck, train}, trees = {trees, maple, oak, palm, pine, willow}. These groups were composed based on prior human knowledge, and not on visual similarities alone. This is the main reason I see for the lower accuracy on CIFAR100. Another reason that also relates to the use of superclasses, could be the increased intra class variability.