Explaining the Segment Anything Model - Network architecture, Dataset, Training

No video

Explaining the Segment Anything Model - Network architecture, Dataset, Training

Рет қаралды 19,810

Neural Breakdown with AVB

Күн бұрын

Пікірлер: 55

@avb_fj 9 ай бұрын

Here's me from the future posting a detailed analysis of Neural Attention: kzfaq.info/get/bejne/nNifptV9lqmpmKs.html

@SlashDL 14 күн бұрын

Some more information at 10:25 - In the token to image attention, the query comes from the prompt + output tokens and the key, value comes from the image. In the image to token attention, the query comes from the image embedding and the key, value comes from the prompt + output tokens.

@man9mj 9 ай бұрын

I am flabbergasted by the quality of this content. Thank you for the effort. I just subscribed to your channel. Keep up the good work brother! We look for more :)

@DatuxGames Жыл бұрын

Your videos just keep getting better and better! Editing is on point with this one. Also great topic and really valuable to have you break things down like this.

@avb_fj Жыл бұрын

Thank you so much! I’m learning things as I go, so I really appreciate feedback like this!

@rmayer4086 Жыл бұрын

@@avb_fj I agree with him. Your pacing is excellent and you're giving a perfect level of detail.

@SlashDL 14 күн бұрын

At 10:03, 4 new tokens are added to the sparse embeddings, 1 representing the IoU score, the rest of the 3 representing the masks. Just a minor correction.

@user-xv8dn4nm5k 3 ай бұрын

Thank for sharing 👍

@jorgeabraham3414 Жыл бұрын

this video will have tens of thousands of views in the upcoming days

@gingerderidder8665 Жыл бұрын

So happy I got recommended this video. Great quality content!

@avb_fj Жыл бұрын

Nice! Glad you enjoyed it!

@anacaznok872 11 ай бұрын

The best video on the subject. Thank you! I'll keep watching your videos

@avb_fj 11 ай бұрын

Awesome! Welcome to the channel and I’m glad you liked the video!

@hinchengchen3153 3 ай бұрын

easy and short but splendid！！

@keneth4 Жыл бұрын

Awesome explanation 👏🏼

@davidyu2372 2 ай бұрын

great video!

@Sciencehub-oq5go 11 ай бұрын

I really like this explanation. Thanks a lot!

@ItalianPizza64 Жыл бұрын

Excellent video, thank you very much! After watching this, there's no doubt in my mind that transformer-based architectures will take over AI for computer vision

@avb_fj Жыл бұрын

Very true. Vision Transformers are definitely here to stay. The generalization power of transformers/attention is so surprising sometimes… decades of computer vision research suggested that CNNs are best for images because they can encode spatial information about the image… it’s just counterintuitive and mind boggling that ViTs can still learn from images by flattening individual patches and lose spatial structure.

@billy.n2813 Жыл бұрын

Thank you for this!

@willikappler1401 Жыл бұрын

Wonderful, I really like the way how you present complex topics!

@victorbjorklund 11 ай бұрын

Good quality video. You got a subscriber.

@turboxxx8 11 ай бұрын

Amazing video! Could you please explain what exactly are the "output tokens" and how do they get them?

@avb_fj 11 ай бұрын

Someone else had the same question, copy-pasting that reply here… So the output token is kind of a common trick that people use in Transformer based models to "aggregate information" about an input sequence. If you are familiar with the Next Sentence Prediction task in BERT models, they also use a similar concept with the [CLS] token. Basically, concept goes as follows: step 1> the output token is a dummy token you append onto the input sequence (say at the very end of the input seq) step 2> pass it through the transformer/attention layers step 3> the attention layers generates a sequence of contextual embeddings, one for each token in the input sequence step 4> you extract the embedding in the index corresponding to the dummy output token (i.e. the last embedding coz that's where you put the output token in the input sequence in step 1) step 5> this embedding now encapsulates or aggregates the entire context of the input sequence and can be used for downstream tasks like classification, etc. Hope that helped. The literature may be a bit thin on output token embeddings in the Segmentation space, but I'll strongly recommend to read about the [CLS] token in the BERT paper for Next Sentence Prediction to get a better understanding.

@SofieSimp Жыл бұрын

Nice video explaining the interactive training. I have one question: During each step in the interactive training, the loss is calculated during each step or at the end. To be more clear: Step 1: I sample a point at the middle of the ground truth mask Step 2: Feed the point as a prompt into the model Step 3: Get the best mask from the model Step 4: From the best mask, calculate error region and sample another positive OR negative point in the error region Step 5: Loop from step 2 until reached the maximum iteration Do I have to calculate loss between step 3 and step 4 then update the model, then move onto step 4, or I calculate loss at the end after step 5?

@avb_fj Жыл бұрын

That's a great question. We should definitely calculate the loss between step 3 and step 4. Every iteration is regarded as an isolated training example, so basically for each training example we input an image and a prompt (with a dense mask) and outputs predicted mask(s)... and then apply the loss over our prediction and the ground truth. As far as "updating the model" is concerned, it is largely a design choice I think. It's not incorrect to update the weights between each iteration, but it's probably better to do gradient accumulation (basically aggregating the losses over multiple iterations) before updating the weights to get a more stable training curve. Hope that helps!

@SofieSimp Жыл бұрын

@@avb_fjThanks a lot! Really great explanation!

@avb_fj Жыл бұрын

🙌🙌@@SofieSimp

@Grenoble7 Жыл бұрын

hello. Great dense video. Suggestion: you are a bit too fast for me: i have to pause on every slide to read it. Usually i x2 the speed of the video, but you are the only opposite i’ve seen on youtube! Maybe you could describe each slide in more details to let us the time to understand it? Just an idea.

@wkgates Жыл бұрын

Great explanation!

@VictorVelazquezEspitia 7 ай бұрын

Hey man, congrats on the great video, rn i am doing my theisis on SAM was of help. May i ask you which camera did u use?

@avb_fj 5 ай бұрын

Good old iPhone. Good luck on your thesis man!

@EkShunya Жыл бұрын

i like your energy. can you help the community with resources you refer to and channels/people you follow?

@avb_fj Жыл бұрын

Thanks for the comment! That’s great feedback, I’ll try to share more in the upcoming videos!

@miyutube1 5 ай бұрын

Very good, I am using SAM and want to understand better to tune the parameters, thus here I am struggling to understand your video (one of the few that actually try to explain the concepts...). What is pt in the focal loss definition?

@avb_fj 5 ай бұрын

Could you add a timestamp?

@miyutube1 5 ай бұрын

2:54

@avb_fj 5 ай бұрын

@@miyutube1 I see. “p” here is the simply the probability outputted by the model for the classification task. You can find more info in Page 3 of this paper arxiv.org/pdf/1708.02002.pdf

@timanb2491 8 ай бұрын

May i iask your a question - one type of prompt is segmentation mask . if we have segmentation mask as a prompt why should we use SAM ? we already have binary segmentation mask

@avb_fj 8 ай бұрын

Check out the part about interactive training at around 5:00 Basically the dense mask prompt is used during the training phase to iteratively improve the networks segmentations. Kinda like asking the model, “Hey you gave me this mask last time, but here’s another internal/external point prompt, give me an updated mask”. During inference, we can pass an image full of zeros as the dense mask into the model (meaning we have no idea where the segmentation should be) and ask the model to update it. Once it gives an initial estimation, we might recursively pass the network’s output logits (not binary, but the prob distribution) back into it as a dense prompt to iteratively improve the predicted mask. In other words, don’t assume that the mask we pass in as prompt need to be the “correct one”… they will be incorrect / a gross estimation of the correct mask, and the networks job is to iteratively improve it till it converges somewhere.

@barbaraz5363 Жыл бұрын

Hi, thanks for your video! It explained so grreat! I have just one question about IoU predicted score: how does it calculate during inference ? In the paper they juste said that it's calculated between predicted mask and object it covers. I wonder how they get the surface about object it covers( because basically we don't have gt)

@avb_fj Жыл бұрын

Yeah it’s kinda tricky to understand it. During training, they have the GT, so they can calculate the IOU with the 3 predicted masks and train the model to predict IOU scores. During inference, the three IOU predictions are simply considered as “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how good the iou prediction is. For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. That said, this is still a network’s own confidence on its own prediction, and there is no way of knowing the correct iou score coz we won’t have the GT during inference. It’s all an additional helpful output that tells us the network’s confidence on each of the masks. Hope that helps!

@barbaraz5363 Жыл бұрын

@@avb_fj Thanks for your quick reply! If I understand well, the so-called "confidence scores" during inference, which are in fact calculated from a MLP head with input (3, embedding_dim) and some hidden layers of 256 neurons, in the end it outputs a tensor (3, 1) which represent the probability (0, 1) of each mask after pass a sigmoid activation ?

@avb_fj 9 ай бұрын

Sorry for the late response, I must've missed the notification. Fwiw, what you said makes perfect sense to me.@@barbaraz5363

@nitinsurya1991 9 ай бұрын

- What could be the intuition for having MLP for IOU scores and MSE loss on top? - from their repository, don't see any interface of text prompt usage. Any examples available?

@avb_fj 9 ай бұрын

- Predicting the IoU scores helps during inference to determine which of the three predicted masks is most likely to be the "correct" mask to show the user. The three IOU predictions are simply considered “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how high the IOU prediction is. They are basically asking the network to output how confident it is for each of its three predictions. For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. Again, note that this is all an inference-time thing. During training, we have the Ground Truth mask and use MSE loss to train the network to output the correct IOU scores. During inference, we just have the three predicted masks and the network's own confidence score as IoU for each of the three masks. Hope that helps. - Reg the text-prompt usage, they did not release it in their web-app. They just documented it in their paper. Don't know if there are plans to release in the future, or if there are other ways to access it.

@nitinsurya1991 9 ай бұрын

@@avb_fj thanks. Have you tried if the CLIP text embeddings would do the trick? Essentially, they mentioned they did train the model with the input.

@prafulmathur4567 11 ай бұрын

Hi The explanation is awesome. But its still not clear how SAM handles nested annotations ? The GT annotations has no hierarchy defined, each part is independently annotated. Then how SAM learns whole, part and subpart for each object ?

@avb_fj 11 ай бұрын

That’s a great question… you are right the network does not explicitly outputs the part/subpart/whole stuff! Just the annotators are asked to annotate in such a way. I believe what ends up happening is that the network automatically learns to map each output head to one of the three labeled GT. Because each output head also has its own unique output token embedding, they technically learn to associate with different output masks as well. They leave it up to backpropagation and gradient descent to handle the rest… Note that neither during inference nor training do they provide labels for whole/part/subpart to the network… and during prediction/inference too, the network doesn’t explicitly output those labels. It just returns 3 distinct masks and their confidence scores…

@Alice-yq6yy Ай бұрын

How does SAM guess the IoU for new images when there is no ground truth available?

@avb_fj Ай бұрын

During training, the ground truth images and their IOU scores are available, so we can train the SAM network to predict it using supervised training. During inference, the network predicts the segmentation masks and also the estimates of the IOU scores.

@Ye1324 Жыл бұрын

In what format is the mask data saved , is it in tensors or numpy array

@avb_fj Жыл бұрын

It is mostly a design/implementation choice. Generally people would save the mask data in an image format (like png) or in any uint8 format (including numpy arrays)… during training though we would need to load/convert them to tensors for easy gradient calculations…