My top 50 scikit-learn tips

No video

My top 50 scikit-learn tips

Рет қаралды 12,142

Data School

Күн бұрын

Пікірлер: 49

@dataschool Жыл бұрын

👩‍💻 Code: github.com/justmarkham/scikit-learn-tips 🤖 Learn ML from me: courses.dataschool.io/ml-courses 💌 Weekly Data Science tips: tuesday.tips/ Thanks for watching! 🙌

@KartikeyRiyal Жыл бұрын

Amazing as always. I have been following you since 2019 and every time it's something new.

@dataschool Жыл бұрын

Thank you so much for your kind words! 🙏

@KartikeyRiyal Жыл бұрын

@@dataschool welcome

@KenJee_ds Жыл бұрын

These are amazing! I learned a lot!

@dataschool Жыл бұрын

Thanks Ken! 🙌

@akbarboghani1 Жыл бұрын

Great video, very informative. Thank you so much for sharing.

@dataschool Жыл бұрын

You're very welcome!

@philwebb59 Жыл бұрын

24:08 handle_unknown='ignore'. A most useful tip! If only I'd read the docs. But, I don't understand when you say to go back and include the previously unknown categories. How can you train on unknown data? Even if you include the unknown "labels" in your encoder, they will all be zero during training, because, obviously, they weren't in your training data. I think it's best to just leave it alone. If it wasn't in your training data, then it's probably a rare occurrence and you can just ignore it. Zeros in all known categories simplifies what happens down stream? If you want to train on unknown data, you would need to use "dummy data" and set min_frequency or max_categories, then handle_unknown='infrequent_if_exists' to give down steam modules something to work with.

@dataschool Жыл бұрын

Glad tip 7 was useful to you! When I said "go back and include previously unknown categories", that means that the next time you train your model, you can incorporate that sample into your training data, and thus that previously unknown category will now be a known category.

@uncledez8 Жыл бұрын

This is a Masters level info on Data science.

@dataschool Жыл бұрын

Thank you! 🙏 Just wait for my next Machine Learning course, it will blow your mind 🤯

@user-oj6rl5kc6i Жыл бұрын

Excellent, well done and thank you!

@dataschool Жыл бұрын

You're very welcome!

@Ahmed_Eid Жыл бұрын

I'm a new subscriber. I'm so glad I found u amazing explanation

@dataschool Жыл бұрын

Thank you!

@tassoskat8623 Жыл бұрын

Hello Kevin! Thank you for your great work and tips. Could you please include in the repository notebooks for the tips that are missing? I suppose those are the ones that do not contain code. However, it would be great to have those included in some way so nothing is missing when someone would like to do a quick review. Again, thank you so much for your sharing!

@dataschool Жыл бұрын

Thanks for your kind words! You are right that those 6 tips don't have notebooks, since they don't have code. I'll consider adding notebooks for those tips in the future... thanks for the suggestion!

@maziarjamshidi4505 Жыл бұрын

Awsome resource for Machine Learning. Thanks!

@dataschool Жыл бұрын

You're very welcome! Glad it's helpful to you!

@philwebb59 Жыл бұрын

2:10:00 Yeah, if you have the time and the determination, you could run DecisionTreeClassifier, then plot_tree, and look through it for conditions like name != value. Then, you could use the order the decision tree "discovers" categories as the ordinal value for that feature, 0 being first. You just need to write a custom transformer to preprocess your validation data and assign -1 to all unknowns. Another trick I've had success with is ordering by frequency, with 0 being the most frequent. In that case, your custom transformer should assign 0 to all unknowns. Easy-peasy.

@dataschool Жыл бұрын

Thanks for sharing, Phil!

@rohitchan007 Жыл бұрын

We need more videos like these.

@dataschool 10 ай бұрын

Glad you like it!

@shahriyarabedinnezhad3162 Жыл бұрын

Super useful...Thanks Kevin

@dataschool Жыл бұрын

You're welcome!

@philwebb59 Жыл бұрын

2:09:40 Hopefully, you'll never have 200 columns to passthrough, but I think specifying which columns to passthrough makes what you intend clearer. The default is remainder=drop, so the author thought that as well.

@dataschool Жыл бұрын

Sure! But there's nothing necessarily wrong with passing through 200 (or 200,000) columns if they don't need transformations.

@gary8421 Жыл бұрын

Thank you Kevin.

@dataschool Жыл бұрын

You're welcome Gary!

@TexasStar007 Жыл бұрын

Thanks Kevin!

@dataschool Жыл бұрын

You're welcome Shashi!

@venkataramana6975 Жыл бұрын

Good work❤

@dataschool Жыл бұрын

Thank you!

@hedeyhod Жыл бұрын

thank you 🙏

@dataschool Жыл бұрын

You're welcome!

@philwebb59 Жыл бұрын

30:20 Missingness. So, what happens when a feature is fully populated in your training data, but has missing values in your validation data? Just bringing that up in case you don't get to it.

@dataschool Жыл бұрын

If a feature has no missing values in training, but has missing values in testing, then the prediction step will fail. If that happens, you can go back and set up an imputer for that feature, and thus the prediction step will no longer fail.

@ayyappahemanth7134 Жыл бұрын

U r awesome sir

@dataschool Жыл бұрын

Thank you!

@philwebb59 Жыл бұрын

2:03:00 Drop=if_binary makes sense, otherwise you have two columns which are perfectly redundant, not just implied. At least, it's a happy compromise. My only hesitation, without playing with it, is that the order is probably alphabetic. If it assigned 0 to the most frequent category, then handle_unknown=ignore would make sense. Otherwise, you're lumping unknowns in with the "least" alphabetic category. That's kinda silly.

@dataschool Жыл бұрын

You're correct that the left-to-right order of categories in the matrix is alphabetical.