Building a Recommendation System in Python

No video

Building a Recommendation System in Python

Рет қаралды 71,123

Күн бұрын

===== Likes: 652 👍: Dislikes: 21 👎: 96.88% : Updated on 01-21-2023 11:57:17 EST =====
Ever wonder how the recommendation algorithms work behind large tech companies? (Facebook, Google, Apple, Netflix, Amazon etc) Look no further! I explain how the recommendation systems work and how to create your own using Matrix Factorization and Kmeans clustering.
I create a recommendation system for movies. So, stay tuned! ;)
Github for code: github.com/Spe...
Data Citation:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligents Systems (TiiS) 5, 4: 19:1-19:19.
Data Link: (MovieLens)
grouplens.org/...
0:00 - Why do we care about Recommendation Algorithm & System?
1:22 - Game Plan!
1:38 - Collaborative Filtering and Content-Based Filtering & Objective
3:39 - Google Collab Setup & Data
7:18 - Matrix Factorization Model Initialization & Training / Tuning Model
10:30 - Kmeans Clustering & Movie Recommendations

Пікірлер: 94

@nayibahued5955 2 жыл бұрын

deepest data scientist voice in the world

@umeshkumarasamy6608 4 ай бұрын

He learnt deep

@naderkhaled9410 2 жыл бұрын

Dude I know this is off topic, but ur voice is insanely satisfying !!

@SpencerPaoHere 2 жыл бұрын

😂

@user-jj3we9jv9i 10 ай бұрын

Holy cow! That is a really good recommendation system! Humbling tutorial as well!

@Agent7155 2 жыл бұрын

Ended up searching up for movies to watch at the end xD

@SpencerPaoHere 2 жыл бұрын

😂

@ea1766 Жыл бұрын

easily the best video on this subject, all the other videos were so boring and mundane. I wish KZfaq promoted this video more to the top.

@icequeen2778 Жыл бұрын

Would love to see more of this type of video!

@folahan Жыл бұрын

The first time I will follow a training using my own dataset and I didn't get any error from start to finish.

@ayushthombare9235 2 жыл бұрын

Very informative and useful video.... Thank you so much

@vincent_hall Жыл бұрын

Thank you sir. I have forked it and shall have a go collaborating with a friend.

@nikhilsastry6631 16 күн бұрын

Deepest Learning

@dan7582 2 жыл бұрын

Nice video, keep up the good work!!

@user-cn4co9mt4p 2 ай бұрын

Dude is not only learn deep learning but deep voice. damn

@robbillington1603 Ай бұрын

Jaba ah voice! Great video

@vinayvajrala4366 Ай бұрын

A big like for that voice

@gauravpoudel7288 11 ай бұрын

Thanks for the awesome content. BTW Is that really your voice?

@marcelomlr 4 ай бұрын

Hey man, nice video, and thanks for the tutorial. I'm actually trying to build a recommendation system for online courses, like udemy, but I can't find any datasets for user reviews to make the collaborative filtering. So I decided to manually create a dataset, and thought of choosing like 4 subjects and putting some users to rate like 10-15 courses of each subject. Do you know if something like that can work, or have any tips you can give me?

@stmasanti 9 ай бұрын

Great video!

@alexhort__ 15 күн бұрын

How would you do it from a real-time database, with real users?

@ryderthewatermelon611 Ай бұрын

If i was to adapt this methodology to recommend songs based on user song selection, and used a dataset with parameters of a songs, how would i do that?

@elisama2936 Жыл бұрын

Hello! :) Ty for the video. I have a question regarding the line " def __init__(self, n_users, n_items, n_factors=20)". Can you explain why 20?

@SpencerPaoHere Жыл бұрын

Number of latent factors was arbitrary! Though, you could optimize for that value.

@elisama2936 Жыл бұрын

@@SpencerPaoHere Thank you for your answer!

@lilyh4573 2 жыл бұрын

I'm sorry I was distracted by your good looks xD

@nazrulabuzhar2210 2 жыл бұрын

What is your skincare routine sir? You're looking good

@SpencerPaoHere 2 жыл бұрын

😂😂😂 Comment made my day! Cleanser + Moisturizer

@erick388 Жыл бұрын

Heyo, and thanks for the video! This was incredibly helpful to learn and understand how to make something rudimentary (even if I imagine a full fledged system would be SO much more complex in how you measure input from the user and live data to form a more robust recommendation). I do have one quick question though, since when I tried making my own slight version (mostly changing the dataset and some small aspects), I came across a slight issue regarding the loading aspect. To attempt to make this run faster, I had used panda to fuse both the ratings and movies csv's together, and then I shuffled, and split them to have an even distribution with less values (this is for a class of mine more than anything, and 100k entries is a lot to run during a presentation). The columns remain the same, and headers remain the same, and all that has 'shifted' is the order in which the rows appear (which is to say its not a bunch of toy story reviews in a row, not a bunch of star wars reviews in a row, etc) and I acquired this error. self.ratings.movieId = ratings_df.movieId.apply(lambda x: self.movieid2idx[x]) self.ratings.userId = ratings_df.userId.apply(lambda x: self.userid2idx[x]) It processes movieid correctly. But when we reach the application of the lambda to the userid it proceeds to return. Key Error, NaN. Given that the csv is the same, save for the alteration to the order of the rows but not the headers, and the values are all indeed numeric, what would be a feasible way to fix and remove this error? Or could it bet he way that I shuffled the dataset that's causing it to assume that the numeric values are NaN and that there's a peculiar way I have to shuffle the values? Also on a fun sidenote, I've run this both with and without CUDA installed. I didn't particularly find anything that changed, but maybe that's just me. It runs regardless, though I presume that will create its own problems when it comes down to it.

@SpencerPaoHere Жыл бұрын

Glad you enjoyed it ! This might be an issue when your are shuffling the data together. There could be many reasons why this is the case. Though, I'd recommend to obtain a small subset of your dataset and run the cleaning algorithm from there. (It'd be easier to debug) It seems you are attempting to combine 2 datasets together based on movieId. Have attempted to do Join statements? (inner join to be specific). Also double check if the casting is appropriate. You may be getting a null value due to the userID somehow becoming a string. Otherwise, could you provide an example on what the current dataset looks like and what you are trying to achieve?

@erick388 Жыл бұрын

@@SpencerPaoHere Yeah I got it working. I think it was a messed up join on my end which prematurely ended my experimenting with the dataset, so all's good! On another sidenote, as I'm still learning some machine learning stuff, I have friends who keep talking about accuracy for machine learning algorithms, and the more I look into it I begin to wonder how that may apply here, or if it's even an actual possible thing to quantify here. I know that MSE calculates the error between predicted values, and actual rating values (do correct me if I'm wrong), which makes me question if 'accuracy' or 'error' are actual aspects of this algorithm, or if that's related to other forms of algorithms that are more specific with their goal? Regardless! Big thanks for the help and awesome video. This was honestly a pretty good starting point as it helped me get curious about a lof ot topics I had never got to touch before.

@SpencerPaoHere Жыл бұрын

@@erick388 Glad you enjoyed the content! Regarding the accuracies, there are actually several metrics you can go about optimizing for. A great optimizer function would be adam. Accuracy by itself is not that 'accurate'; you need precision as well. Take a look into F1 scores. That'll help. Increasing "accuracy" comes down to additional features, more data, and different ML algorithms, or tuning algorithms. That's essentially the world of Data Science.

@erick388 Жыл бұрын

@@SpencerPaoHere Gotcha, I'll look into that too. It's a lot to take in but it's always fun and interesting to learn. Appreciate all the advice!

@erick388 Жыл бұрын

@@SpencerPaoHere Actually, I suppose one final question is how I would qualify something as a false positive, or a true positive (or really any of the prerequisite information) for the calculations of F1 Scores (such as the requirements for Precision, Recall, etc). I'm not quite sure how to do that given that in this example here we're giving a recommendation of ten movies based on their overall rating, and I don't really know what would quantify as a false positive (or a true positive).

@Bjorn_R 6 ай бұрын

Hello Spencer im split between collaborative recommender systems and a confirmation tree project for my master thesis. What would be most beneficial?

@vaiterius 10 ай бұрын

How do you know which libraries/functions to use to make these algorithms? I’m trying to make a videogame recommendation system from a Steam games dataset, similar to what you’re doing here

@hamzak5674 6 ай бұрын

Hey, I’m making something similar using the RAWG dataset. Did you manage to get anywhere? I’m planning to start in the next few days

@SpencerPaoHere 6 ай бұрын

Python typically wraps around alot of theoritcal applications behind C/C++. When it comes to a recommendation system base, tensorflow/keras are the building blocks and are quite effective when building something from scratch or fine tuneable

@NobixLee Жыл бұрын

Great video, but how do we then get scores for the User_ID? Something like there is this much probability that User_ID 2 will be in cluster 2? Thank you.

@SpencerPaoHere Жыл бұрын

One way that you can go about this: You'd need more data to have a more accurate way of doing this. Since there are only 4 features: userID, movieID, rating, timestamp in the dataset I am using in this video. However, with the way that I have done this in the video, you can go forth and associate the average of the ratings that each user has appled for all of the users' ratings with the movies in each cluster. Normalize across all clusters with the given movie and sort upon highest ratings per cluster for the user. Whichever movies that may not have been seen by the user in the cluster should be recommended to the user. I am open to hearing your thoughts on this!

@obi666 9 ай бұрын

I'm not sure what these clusters are (for example Cluster #1 and printed titles), are they some sort of groups of similiar movies?

@SpencerPaoHere 6 ай бұрын

Yep! Each cluster represents a group of data points that are similar.

@casewhite5048 2 жыл бұрын

How do you set a rating system for the output of movies lets say it recommends a movie you never want to watch like Fried Green Tomatoes recommends Avengers: Endgame tell it to rate it 10/10 and train it to find more clusters with higher ratings and train it to find more of these over time as more movies come out

@SpencerPaoHere 2 жыл бұрын

There are many ways that you can go about doing this: I'd check out the ELO/FIDE rating system. Based on user input, they manually click either "Yes" or "No" depending on whether they like the recommendation. You can use this system to tailor prediction output to the customer.

@sachamallet5157 Жыл бұрын

Hi, I would like to know if the mac mini M2 pro with only 16gb of RAM is enough for 8Go of data analysis. Thank you so much for your feedback

@SpencerPaoHere Жыл бұрын

Yeah it should be good for smaller datasets. Though you never know until you try ! (Maybe try 2 gb and see how long that’ll take - and approximate from there)

@bhadauriaji 2 жыл бұрын

Hi Spencer. Was working on a similar problem where i have users who have listened to a set of songs and based on there listen history. I have to recommend new songs to the user. Almost 10. How to do that? Also I don't have ratings for songs I have listen count for each song. And listen count is in relation to user.

@SpencerPaoHere 2 жыл бұрын

You'll probably need additional features such as length of listen, genre, artist, etc for a better recommendation algorithm. You could do the frequentist approach (to start) where you recommend the song that has been listened the most and slowly make your application more advanced once you've accumulated more focused data.

@bhadauriaji 2 жыл бұрын

@@SpencerPaoHere The problem is I can't have more features. My dataset has UserId, SongID,listen count , artist, song title, and date of the song only. I have to build a recommendation engine using that only. Also I tried using Kmeans and some brute force filtering techniques but not getting accuracy.

@SpencerPaoHere 2 жыл бұрын

@@bhadauriaji Unfortunately, those features aren't going to be doing recommendations justice. You could, however, do a weighted sampling song recommendation based on hits. Its not perfect, but it may be what you are looking for.

@bhadauriaji 2 жыл бұрын

@@SpencerPaoHere Thanks a lot for the info, will try that surely. 🤗

@abi_xyz Жыл бұрын

great

@dustinvo6097 2 жыл бұрын

Hi Spencer. Nice video as always. I am working on a problem where the users interact with banking website and app. So I have userid, the interaction name, timestamps and some demographic varibables. I'm trying to cluster them into some "personas" based on their interaction and timestamp for biz use. Do you have any ideas how to do that? Thanks.

@SpencerPaoHere 2 жыл бұрын

Glad you enjoyed it! That use case can definitley be quite tricky. You'd first need to categorize what personas you are trying to bucket users in. Based on those personas, what actions (i.e features ) would link them to said persona? I'd suspect that a lot of AB testing would be required to fulfill your hypotheses. But, if its literally just something related to money management via banking, I'd probably look at it from the angle of on-time payments, quantity, frequency, tiered users, time of withdrawl from ATM, fees encountered, zipcodes, and features related to that. (excluding PII unless TOS states as such)

@dustinvo6097 2 жыл бұрын

@@SpencerPaoHere thank for the advice. Another question: if I try to focus on just userid and interactionname, how can I cluster the userid basing on the interactions (withdraw, request credit score,...) while they are repeated categorical measurement? Kmode is a good one?

@SpencerPaoHere 2 жыл бұрын

@@dustinvo6097 I think I have just the video for you :) kzfaq.info/get/bejne/hLGBo7mGlrK4nWw.html (If you haven't seen it already)

@aumasandra9307 Жыл бұрын

Why do I keep getting KeyError: 46970 in the code train_set = Loader() And how do I solve this error

@SpencerPaoHere Жыл бұрын

Is this my code? Did you run through all the cells? If so, check out the loader(Dataset) class and provide some logging statements to see which lines are throwing that error.

@user-vo2lc5he9m 2 ай бұрын

helo brother,can i use any movie dataset from kaggle?

@sospixs Жыл бұрын

Hi Spencer Thanks for your vdo . I've arrange the code , but got stuck in section for loop tqdm len(losses) = 0 for it in tqdm(range(num_epochs)): .... .... ZeroDivisionError Traceback (most recent call last) Input In [59], in () 11 optimizer.step() 12 #print(loss.item()) ---> 13 print("iter #{}".format(it), "Loss:", sum(losses) / len(losses)) ZeroDivisionError: division by zero any ideas ?

@SpencerPaoHere Жыл бұрын

yeah. Whatever is populating your losses is not being done correctly or there is a divergence issue. The len(losses) == 0. You'd need to figure out why that is the length is zero.

@sospixs Жыл бұрын

@@SpencerPaoHere Yep, I'm using jupyter in my PC , And Is running on GPU: False I think that the problem

@kain5244 2 жыл бұрын

thanks

@appyviral8753 2 жыл бұрын

How much u charge for making a video recommendation system for Android app?

@SpencerPaoHere 2 жыл бұрын

If it's highly interesting, $0.00.

@appyviral8753 2 жыл бұрын

@@SpencerPaoHere it will be! how to contact u?

@SpencerPaoHere 2 жыл бұрын

@@appyviral8753 You can send me a message at business.inquiry.spao@gmail.com

@seankirbycordova3937 Жыл бұрын

Can I ask the source code? im building library system, I have no idea implemting the collaborative filtering algo. Thank you if you can help me 😊

@ujjwal.kandel 2 жыл бұрын

How would I pass a movie title to the recommender and get a list of recommendations?

@SpencerPaoHere 2 жыл бұрын

Great question! You might have to change the model itself to be more 'linear' to return a movie title that is most similar to the input. With the Kmeans algorithm, you can technically "Pass in a movie title" and the list would be the cluster associated with that movie title. You can then sort by shortest distance and get the top most rated movie. Some additional coding will be required to do that.

@ujjwal.kandel 2 жыл бұрын

@@SpencerPaoHere I could really use that extra code you're talking about. I'm doing a recommender for my final year project without zero experience in machine learning. Half this code is gibberish to me lol. I just need 10 recommendations for any list of movies. That's all I ask for😭

@christianmoreno7390 2 жыл бұрын

dang bro do you practice retention ??

@guitar300k 2 жыл бұрын

How to solve big scale problem, you guys?

@SpencerPaoHere 2 жыл бұрын

It depends on the use case, but there are many ways to scale a problem. All of which are somewhat unique. For deployment on a website for example, Kubernetes is quite popular.

@maximshidlovski23 Жыл бұрын

Hi Spencer, thanks for the video. I am currently working on the problem of creating a tag-based recommendation system. The user has a list of tags of interest to him and needs to recommend content based on tags and words that are hyperonyms and hyponyms of these tags. I have the user's UserId, FavoriteUserTagsIds and the content's ContentID and ContentTagsIds. Do you have any ideas how to do that? What is best way to create tag-based recommendation system? Thanks.

@SpencerPaoHere Жыл бұрын

This seems like an NLP type problem! You can check out a generalized large language model to see if your keywords exist within its vocabulary. Then, using its word embeddings, you can perhaps utilize the distances between the vectors as a gauge behind the meaning. Then, you can plug in the output of the NLP model to a recommendation system.

@maximshidlovski23 Жыл бұрын

@@SpencerPaoHere Thanks, I came up with a similar solution yesterday, now I'm working on implementing it.

@sssaturn Жыл бұрын

is there a reason you dont split the data set?

@SpencerPaoHere Жыл бұрын

I just wanted to highlight the recommendation aspect (not necessarily the training aspect) Though, in an ML model, you definitely want to do the typical 60/20/20 split!

@sssaturn Жыл бұрын

@@SpencerPaoHere cool, thank you spencer!