No video

Drop the first category from binary features (only) with OneHotEncoder

  Рет қаралды 3,018

Data School

Data School

Күн бұрын

New in version 0.23: Use drop='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has exactly two categories).
Note: Beginning in scikit-learn 1.0, drop='first' and drop='if_binary' can both be used with handle_unknown='ignore'. However, the dropped category and an unknown category will both be encoded as all zeros.
👉 New tips every TUESDAY and THURSDAY! 👈
🎥 Watch all tips: • scikit-learn tips
🗒️ Code for all tips: github.com/jus...
💌 Get tips via email: scikit-learn.tips
=== WANT TO GET BETTER AT MACHINE LEARNING? ===
1) LEARN THE FUNDAMENTALS in my intro course (free!): courses.datasc...
2) BUILD YOUR ML CONFIDENCE in my intermediate course: courses.datasc...
3) LET'S CONNECT!
- Newsletter: www.dataschool...
- Twitter: / justmarkham
- Facebook: / datascienceschool
- LinkedIn: / justmarkham

Пікірлер: 9
@dataschool
@dataschool 2 жыл бұрын
Big news! I just launched a free, 3-hour course that contains all 50 scikit-learn tips! Join here: courses.dataschool.io/scikit-learn-tips
@sachink9102
@sachink9102 10 ай бұрын
Explained very well, May i know what is Multicollinearity problem ?
@timfwater
@timfwater Жыл бұрын
I think the binary justification is because with binary -- its either yes or no. So 1 binary column can record that. If you just remove a random column for one of your 3 shapes -- say 'square' -- then haven't you just lost that information from your dataset? I guess you could infer that since there are only 3 discrete categories -- a '0' value for circle/oval implies that it must be a square. But then how would the presence of 'square' be returned as a predictive value in a later model, if square isn't an explicitly listed option? In the case with 2 separate "Pink" and "Yellow" values -- both would be exactly correlated with one another, as the dichotomy is either/or. They are perfect opposites, and the absence of 1 of the 2 options enables you to infer the value 100% of the time. In the case of 3 variables -- each of these columns would not represent a similar symmetric/binary relationship- as the absence of "square" doesn't allow you to directly infer the presence of either circle/oval as an alternative, as the absence of Pink enables you to do for Yellow. Having 2 alternatives instead of 1 introduces ambiguity that is not present in a binary relationship Anyways just my thought. Thank you for the great content!
@dataschool
@dataschool Жыл бұрын
Great question! The information is not lost when you drop the first column, because the original categories are stored in the categories_ attribute of the OneHotEncoder (ohe.categories_). Hope that helps!
@johnanih56
@johnanih56 2 жыл бұрын
YOU ARE AWESOME!
@dataschool
@dataschool 2 жыл бұрын
Thank you! 🙏
@Atulmishra-hs8ch
@Atulmishra-hs8ch 2 жыл бұрын
Well my understanding says that "a binary feature when one-encoded will always give a 2*2 Matrix and non-binary is always n*2 Matrix". This could be the supporting pillar for using "if_binary" as it removes redundancy from a very near Identity Matrix.
@dataschool
@dataschool 2 жыл бұрын
Thanks for your comment! I still don't quite understand, because regardless of whether the feature has 2 categories or 10 categories, there is still always 1 column (after one-hot encoding) that is redundant.
@sv1562
@sv1562 2 жыл бұрын
@@dataschool Because in-case of binary it will be always be negative collinearity ?!
Passthrough some columns and drop others in a ColumnTransformer
3:11
Use OrdinalEncoder instead of OneHotEncoder with tree-based models
6:59
Survive 100 Days In Nuclear Bunker, Win $500,000
32:21
MrBeast
Рет қаралды 163 МЛН
Pool Bed Prank By My Grandpa 😂 #funny
00:47
SKITS
Рет қаралды 19 МЛН
If Barbie came to life! 💝
00:37
Meow-some! Reacts
Рет қаралды 71 МЛН
One Hot Encoder with Python Machine Learning (Scikit-Learn)
9:03
Ryan & Matt Data Science
Рет қаралды 17 М.
What is One Hot Encoding | One Hot Encoding | Machine Learning | Data Magic
9:03
Data Magic (by Sunny Kusawa)
Рет қаралды 2,3 М.
Internet is going wild over this problem
9:12
MindYourDecisions
Рет қаралды 121 М.
Three reasons not to use drop='first' with OneHotEncoder
4:37
Data School
Рет қаралды 5 М.
Normalization Vs. Standardization (Feature Scaling in Machine Learning)
19:48
Different Types of Feature Engineering Encoding Techniques
24:07
Krish Naik
Рет қаралды 189 М.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 267 М.
Tune the parameters of a VotingClassifer or VotingRegressor
4:07
Data School
Рет қаралды 4,8 М.
Survive 100 Days In Nuclear Bunker, Win $500,000
32:21
MrBeast
Рет қаралды 163 МЛН