No video

Adapt this pattern to solve many Machine Learning problems

  Рет қаралды 12,442

Data School

Data School

Күн бұрын

Here's a simple pattern that can be adapted to solve many ML problems. It has plenty of shortcomings, but can work surprisingly well as-is!
Shortcomings include:
- Assumes all columns have proper data types
- May include irrelevant or improper features
- Does not handle text or date columns well
- Does not include feature engineering
- Ordinal encoding may be better
- Other imputation strategies may be better
- Numeric features may not need scaling
- A different model may be better
- And so on...
Want to watch all 50 scikit-learn tips? Enroll in my FREE online course:
👉 courses.datasc... 👈
Tips mentioned in this video:
Tip 1: • Use ColumnTransformer ...
Tip 2: • Seven ways to select c...
Tip 6: • Encode categorical fea...
Tip 7: • Handle unknown categor...
Tip 9: • Add a missing indicato...
Tip 11: • Impute missing values ...
Tip 16: • Use cross_val_score an...
Tip 27: • Two ways to impute mis...
Tip 43: • Use OrdinalEncoder ins...
=== WANT TO GET BETTER AT MACHINE LEARNING? ===
1) LEARN THE FUNDAMENTALS in my intro course (free!): courses.datasc...
2) BUILD YOUR ML CONFIDENCE in my intermediate course: courses.datasc...
3) LET'S CONNECT!
- Newsletter: www.dataschool...
- Twitter: / justmarkham
- Facebook: / datascienceschool
- LinkedIn: / justmarkham

Пікірлер: 16
@dataschool
@dataschool 2 жыл бұрын
Want to watch all 50 scikit-learn tips? Enroll in my FREE online course: courses.dataschool.io/scikit-learn-tips This is the last scikit-learn tip I'll be posting... thank you SO MUCH for watching! 🙌
@grzegorzzawadzki8718
@grzegorzzawadzki8718 2 жыл бұрын
I recently learned that you can use handle_unknown for OrdinalEncoder, but this requires scikit-learn 0.24 or later. What do you think about using onehotencoder for only the 5 or 10 most common values?
@dataschool
@dataschool 2 жыл бұрын
Regarding handle_unknown with OrdinalEncoder, that's correct! I was excited to see that option released. Regarding OneHotEncoder with a frequency cut-off, that can be a useful strategy. It's not currently easy to do in scikit-learn, but it will be possible in a future version. Thanks for your comment!
@KartikeyRiyal
@KartikeyRiyal 2 жыл бұрын
Great video. I have been learning from your videos since 2018 end. Thank you so much and God bless you Kevin. from India
@dataschool
@dataschool 2 жыл бұрын
That's great to hear! 🙏
@johnanih56
@johnanih56 2 жыл бұрын
THE BEST TIP SO FAR!
@dataschool
@dataschool 2 жыл бұрын
You are so kind, thank you! 🙏
@blink4037
@blink4037 2 жыл бұрын
Thank you for the all tips learnt so much, I just wondered are we able to or is it proper to use like FeatureUnion instead of make pipeline while combining transformer objects and pass as featureunion1 and featureunion2 with these numerical/non-numerical constraints.
@RRSS-ce5hf
@RRSS-ce5hf 2 жыл бұрын
Hey Kevin, very helpful videos! In this video, num_cols = make_column_selector(dtype_include='number') -> Does 'num_cols' here also include the dependent/target column? (Assuming it is a numerical column) If yes, say we are scaling other independent features using RobustScaler() because of presence of lot of outliers.. But the target column does not have many outliers.. Will it affect the regression output? What is the way out (I want to scale all numerical columns except the target column)?
@dataschool
@dataschool Жыл бұрын
Excellent question! No, num_cols does not include the target column, because the preprocessor is only applied to the columns in X. Hope that helps!
@pruthvips9565
@pruthvips9565 2 жыл бұрын
Can you explain who can we Perform EDA in NLP
@abir95571
@abir95571 2 жыл бұрын
Great videos ... one question, let's say if the number of categories in a column is large then what should be the ideal encoding? One hot encoding isn't really the ideal one as it will create too many dummy columns
@dataschool
@dataschool 2 жыл бұрын
Glad you like the videos! As for your question, there are a lot of factors that influence the optimal encoding, but you can certainly try OrdinalEncoder instead. However, you will find that it's often not a problem to create thousands of dummy columns, and that feature will still be improving the performance of your model. Hope that helps!
@abir95571
@abir95571 2 жыл бұрын
@@dataschool I thought of ordinal encoding. But you see ordinal encoding inherently introduces rank ... like 1 > 2 > 3 .. so on . In my case the categories have no order , all have equal weightage. I've chosen binary encoding coz at least it reduces the columns to log N , where N is the count of distinct categories . My only doubt is , does it introduce order or is it unordered
@sargonsarkis1292
@sargonsarkis1292 2 жыл бұрын
Awesome!
@dataschool
@dataschool 2 жыл бұрын
Thanks!
My top 50 scikit-learn tips
2:47:31
Data School
Рет қаралды 12 М.
How do I select features for Machine Learning?
13:16
Data School
Рет қаралды 176 М.
ROLLING DOWN
00:20
Natan por Aí
Рет қаралды 10 МЛН
Meet the one boy from the Ronaldo edit in India
00:30
Younes Zarou
Рет қаралды 17 МЛН
Survive 100 Days In Nuclear Bunker, Win $500,000
32:21
MrBeast
Рет қаралды 163 МЛН
а ты любишь париться?
00:41
KATYA KLON LIFE
Рет қаралды 3,2 МЛН
Time Series Forecasting with XGBoost - Advanced Methods
22:02
Rob Mulla
Рет қаралды 119 М.
Internet is going wild over this problem
9:12
MindYourDecisions
Рет қаралды 133 М.
Normalization Vs. Standardization (Feature Scaling in Machine Learning)
19:48
How do I encode categorical features using scikit-learn?
27:59
Data School
Рет қаралды 138 М.
All Learning Algorithms Explained in 14 Minutes
14:10
CinemaGuess
Рет қаралды 222 М.
Machine Learning Tutorial Python 12 - K Fold Cross Validation
25:20
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 267 М.
How to learn Machine Learning (ML/AI Roadmap 2024)
26:01
Kylie Ying
Рет қаралды 88 М.
Generative AI in a Nutshell - how to survive and thrive in the age of AI
17:57
ROLLING DOWN
00:20
Natan por Aí
Рет қаралды 10 МЛН