3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Рет қаралды 37,577

Күн бұрын

Vector similarity search is one of the fastest-growing domains in AI and machine learning. At its core, it is the process of matching relevant pieces of information together.
Similarity search is a complex topic and there are countless techniques for building effective search engines.
In this video, we'll cover three vector-based approaches for comparing languages and identifying similar 'documents', covering both vector similarity search and semantic search:
- TF-IDF
- BM25
- Sentence-BERT
📰 Original article:
www.pinecone.io/learn/semanti...
🤖 70% Discount on the NLP With Transformers in Python course:
bit.ly/3DFvvY5
🎉 Sign-up For New Articles Every Week on Medium!
/ membership
Mining Massive Datasets Book (Similarity Search):
📚 amzn.to/3CC0zrc (3rd ed)
📚 amzn.to/3AtHSnV (1st ed, cheaper)
👾 Discord
/ discord
🕹️ Free AI-Powered Code Refactoring with Sourcery:
sourcery.ai/?YouTu...
00:00 Intro
01:37 TF-IDF
11:44 BM25
20:30 SBERT

Пікірлер: 58

@mayursanmugam4050 10 ай бұрын

Found this after stumbling around for a good overview of BM25 & SBERT. This is a fantastic initial introduction - enough detail and introduces the right concepts that people can double down on for further learning. Thank you James!

@stepkurniawan 7 ай бұрын

7 mins to understand TF-IDF, youre my saviour

@bujin1977 Жыл бұрын

Thanks, this was very helpful! I've recently started using SQL Server's full text search capabilities to drive course searches on our college website, but it was all a bit of a "black box" thing. No idea how it worked, I just trusted that it *did* work! Until I got a query from someone who wanted to know how to alter their search results to change the order that we display them on the website. I'm no stranger to complicated mathematical formulae, but I took one look at the BM25 formula on wikipedia and cried! Your explanation made it so much easier to understand what was going on. Now comes the hard part. Explaining how the staff member in question can alter their data to boost their results... 😬

@mcnubn 5 ай бұрын

Really helped clear up BM25 for me! Huge thank you so sharing this!

@lukekim4760 2 жыл бұрын

I am into document similarity ranking and I love your videos! Thank you so much :)

@jamesbriggs 2 жыл бұрын

Great to hear! I made a full (and free) course on semantic search if you're interested :) www.pinecone.io/learn/nlp

@LuisRomaUSA 2 жыл бұрын

Not many views yet, but please dont stop making content. This is the best video i have found in a week of searching.

@jamesbriggs 2 жыл бұрын

haha happy to hear, I've committed to making videos so I'll be here for a long time 😅 check out the similarity search playlist if you're interested in these things, just finished it!

@AjayShivranBCSE 2 жыл бұрын

Great work man!

@parth191079 2 ай бұрын

This is super helpful! Thank you for this video.

@yonahcitron226 11 ай бұрын

great explanations! thanks!

@Data_scientist_t3rmi Жыл бұрын

Excellent video thank you!

@asedaaddai-deseh8152 2 жыл бұрын

Great explanation!

@szymonskorupinski5237 3 жыл бұрын

Great work!

@leonardvanduuren8708 Жыл бұрын

Masterful !! Thx for this and all your other stuff !!

@jamesbriggs Жыл бұрын

Glad you're enjoying them!

@MehdiMirzapour 4 ай бұрын

Great work! You are a great teacher! Although, I know these concepts but I enjoyed a lot watching it.

@ruimelo1039 2 жыл бұрын

I'm doing an uni project in this matter and your explanation was on point! Thank you

@li-pingho1441 10 ай бұрын

extremely simple explanation!!!!!!!!

@qwerty8669 2 жыл бұрын

Thanks this was helpful

@tomwalczak4992 3 жыл бұрын

Really good, simple explanations. Also really liked your Udemy course.

@jamesbriggs 3 жыл бұрын

hey Thomas, yes I remember you left a review on the course? Great to see you here too and thanks!

@tomwalczak4992 3 жыл бұрын

@@jamesbriggs Yup ;) I'm gettting into NLP so your videos have been super useful. Just finished my first project that uses both sparse and dense embeddings: share.streamlit.io/tomwalczak/pubmed-abstract-analyzer And as you say in the video, dense embeddings and complex models don't always work better, at least not out-of-the-box. Looking forward to more vids :)

@jamesbriggs 3 жыл бұрын

@@tomwalczak4992 That's a very cool project, first one too? I'm impressed! Awesome to see you're getting into it though, looking forward to seeing you around!

@UnpluggedPerformance 2 жыл бұрын

that Bert outcome is certainly cool!!!!! you made my day man!! awesome! how can we support you? (besides likes etc.)

@jamesbriggs 2 жыл бұрын

comments like this! Really happy it helped :)

@abhishekrathi6253 2 жыл бұрын

Nice explanation

@UnpluggedPerformance 2 жыл бұрын

bro super good explanations

@pfinardii 2 жыл бұрын

Hi James, fantastic video!!! A question: Using BERT to extract dense representations with hidden_state or last_hidden_state layers and we perform masked_embeddings = embeddings * mask (where mask is the attention_mask BERT output) to put 0 value in padding tokens maybe we need also to consider the special tokens [cls] and [sep]? I mean, the attention mask for these special tokens are 1. So when using some hidden layer from BERT we need perform a slice masked_embeddings = masked_embeddings[ : , 1:sep_token_pos,], where sep_token_pos is the [sep] position in sequence: [[cls], tokens of the sequence [sep], [pad],[pad]...]

@jamesbriggs 2 жыл бұрын

hey Paulo, good question. I believe the other sentence transformer models that build these embeddings keep both, but I have never seen them explicitly state that they do (or why) in any papers, so I can't say for sure sorry! Nonetheless, my understand is that the CLS and SEP tokens are included within the embeddings as they still contain useful information about the input data. The CLS token itself can actually be used in building sentence embeddings (although it is ineffective compared to mean pooling afaik). The significance of that being that the CLS token contains enough information about the sequence to be (somewhat) effectively used as a single vector representation of the whole sequence, therefore it contains quite useful information about the sequence that would be lost if removed. As for the SEP token, I don't believe it is as important as the CLS, but I can't say I know how relevant it is. I'd be curious to see a comparison of embedding performance with/without the CLS/SEP tokens though. I'm sure it has been tested but I've never seen something like that mentioned

@pfinardii 2 жыл бұрын

@@jamesbriggs Hi James I did a test with MNR loss. During tokenization process I set the tokenizer parameter add_special_tokens=False and I got 0.83 against 0.81 with the default value (True). Need to test only without [SEP] token to make the results more robust, thank for the reply :)

@jamesbriggs 2 жыл бұрын

@@pfinardii oh so it's better? Wow I'll have to try it too - that's awesome :)

@23232323rdurian 11 ай бұрын

point taken and understood about eg..the similar meaning, but using different words.....however in practice just a straight-forward word-to-word with frequency stats works pretty good because: words have usage frequencies, so anybody MEANING to say is gonna say not .....like 100-to-1 odds.....and ? well that's extremely rare..... is gonna be 100s of times more frequent in this context than ..... ==then further ACROSS languages (eg English, Japanese) the word frequencies dont necessarily translate....sometimes frequent English words are infrequent in Japanese and vice versa....

@li-pingho1441 10 ай бұрын

thank you so muchhhhhhh

@wenzeloong 2 жыл бұрын

it's a great video!! I need your opinion sir James. In this video you are using Cosine Similarity to calculate the distance. What do you think if we combine these methods with ANN (approximate Nearest Neighbor) with angular distance. is that better than use cosine similarity?

@jamesbriggs 2 жыл бұрын

Hey Iven, thanks! I think you should absolutely use ANN - definitely if you have lots of vectors. As for cosine similarity vs angular similarity, angular similarity can distinguish better between already very similar vectors, but I'm not sure if it is too important in most use-cases. Most applications from pretty smart people tend to use cosine similarity, so that is (for me) evidence that cosine similarity is 'good enough' If you're interested in ANN and more of this, I have a big playlist on it here kzfaq.info/sun/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc Hope it helps :)

@wenzeloong 2 жыл бұрын

@@jamesbriggs Thank you for your opinion and the playlist is quite amazing..! It helps me a lot.. thank you !

@Krobongo 2 жыл бұрын

I'm a bit confused. Is SBERT just the embedding layer, which is fed to a ML Model, or is it also the model itself to do e.g. text classification?

@peterthomas7523 2 жыл бұрын

Excellent video as always :) kinda makes me wonder why I bother spending my grant money on training courses when your whole channel is simply better. I had a question about using S-BERT for similarities between documents, rather than sentences within a document. Could I just average the embedding for the sentences within each document and calculate cosine similarity between these? Or is there a better way? Thanks!

@jamesbriggs 2 жыл бұрын

you can do this but it's not that effective, other option would be to try and compare all paragraphs and take an average score or create some sort of threshold like "if 5 paragraphs similarity > 0.8" etc. It's hard to do. I have a free 'course' on semantic search here, hopefully you can save some more of your grant money: www.pinecone.io/learn/nlp/

@peterthomas7523 2 жыл бұрын

@@jamesbriggs Thanks a lot :) I've been working through your pinecone course and am really liking it so far!

@23232323rdurian 11 ай бұрын

28:50....both B and G SHARE this phrase and its several words, so THAT's why they share a high similarity score...

@venkateshkulkarni2227 2 жыл бұрын

I think the bert-base-nli-tokens are deprecated now according to the Hugging Face website. Which Sentence Transformer model should we now use for SBERT?

@jamesbriggs 2 жыл бұрын

I like mpnet models the most for generic sentence vets huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base I cover a load of models, training methods, etc in this playlist: kzfaq.info/sun/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO Hope it helps :)

@smcgpra3874 Жыл бұрын

Can we classify tabular data where each row is one dataset

@user-bs9bu1ko5f 2 жыл бұрын

want some scripts or subtiltes for your video, thank you!

@AlexGuemez Жыл бұрын

Is there a way to "reverse" TFIDF to see if Google uses it in his algorithm?

@jamesbriggs Жыл бұрын

I've not heard of a way but it could be possible - Google's algorithms uses a lot of different things though (BERT included), so I'm not sure if it would be possible to identify specific parts of it like TF-IDF

@wilfredomartel7781 2 жыл бұрын

how to train sbert with a specific domain?

@jamesbriggs 2 жыл бұрын

hey I have a few articles+videos on this, what does your training data look like? If you have sentence pairs + scores you can use MSE loss which I cover at the end of: www.pinecone.io/learn/gpl/ If you don't have training data and just text data you can use unsupervised methods like GPL (above), GenQ, or TSDAE (all found here): www.pinecone.io/learn/nlp/ If you have sentence pairs *without* labels you can use softmax or preferably MNR loss: www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/

@edgar23vargas53 3 жыл бұрын

Hey is there any way we can get in contact with you?

@jamesbriggs 3 жыл бұрын

Yes on the 'About' page of my YT channel you'll be able to find my email

@edgar23vargas53 3 жыл бұрын

@@jamesbriggs DMed you on Instagram

@edgar23vargas53 3 жыл бұрын

@@jamesbriggs DMed you

@jamesbriggs 3 жыл бұрын

@@edgar23vargas53 got it

@edgar23vargas53 3 жыл бұрын

@@jamesbriggs shot you an email

@ErginSoysal 2 жыл бұрын

You don’t know what b and k in bm25, do you? 😏

@gorgolyt 2 жыл бұрын

You need to tighten up your math notation. Writing f(t, D) for the "total number of terms in the document" is really confusing and makes no sense. What is t in this function? You either need to sum over all t in D, which you haven't written, or you should just get rid of the t and use some function g(D) to denote the total number of terms in the document. When you get onto BM25 it's even worse, I'm not sure your explanation of your notation is even correct. It should be f(q, D) on the denominator, the same q that is in the numerator, not f(t, D), whatever that means.