Drop the first category from binary features (only) with OneHotEncoder

No video

Drop the first category from binary features (only) with OneHotEncoder

Рет қаралды 3,018

Күн бұрын

New in version 0.23: Use drop='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has exactly two categories).
Note: Beginning in scikit-learn 1.0, drop='first' and drop='if_binary' can both be used with handle_unknown='ignore'. However, the dropped category and an unknown category will both be encoded as all zeros.
👉 New tips every TUESDAY and THURSDAY! 👈
🎥 Watch all tips: • scikit-learn tips
🗒️ Code for all tips: github.com/jus...
💌 Get tips via email: scikit-learn.tips
=== WANT TO GET BETTER AT MACHINE LEARNING? ===
1) LEARN THE FUNDAMENTALS in my intro course (free!): courses.datasc...
2) BUILD YOUR ML CONFIDENCE in my intermediate course: courses.datasc...
3) LET'S CONNECT!
- Newsletter: www.dataschool...
- Twitter: / justmarkham
- Facebook: / datascienceschool
- LinkedIn: / justmarkham

Пікірлер: 9

@dataschool 2 жыл бұрын

Big news! I just launched a free, 3-hour course that contains all 50 scikit-learn tips! Join here: courses.dataschool.io/scikit-learn-tips

@sachink9102 10 ай бұрын

Explained very well, May i know what is Multicollinearity problem ?

@timfwater Жыл бұрын

I think the binary justification is because with binary -- its either yes or no. So 1 binary column can record that. If you just remove a random column for one of your 3 shapes -- say 'square' -- then haven't you just lost that information from your dataset? I guess you could infer that since there are only 3 discrete categories -- a '0' value for circle/oval implies that it must be a square. But then how would the presence of 'square' be returned as a predictive value in a later model, if square isn't an explicitly listed option? In the case with 2 separate "Pink" and "Yellow" values -- both would be exactly correlated with one another, as the dichotomy is either/or. They are perfect opposites, and the absence of 1 of the 2 options enables you to infer the value 100% of the time. In the case of 3 variables -- each of these columns would not represent a similar symmetric/binary relationship- as the absence of "square" doesn't allow you to directly infer the presence of either circle/oval as an alternative, as the absence of Pink enables you to do for Yellow. Having 2 alternatives instead of 1 introduces ambiguity that is not present in a binary relationship Anyways just my thought. Thank you for the great content!

@dataschool Жыл бұрын

Great question! The information is not lost when you drop the first column, because the original categories are stored in the categories_ attribute of the OneHotEncoder (ohe.categories_). Hope that helps!

@johnanih56 2 жыл бұрын

YOU ARE AWESOME!

@dataschool 2 жыл бұрын

Thank you! 🙏

@Atulmishra-hs8ch 2 жыл бұрын

Well my understanding says that "a binary feature when one-encoded will always give a 2*2 Matrix and non-binary is always n*2 Matrix". This could be the supporting pillar for using "if_binary" as it removes redundancy from a very near Identity Matrix.

@dataschool 2 жыл бұрын

Thanks for your comment! I still don't quite understand, because regardless of whether the feature has 2 categories or 10 categories, there is still always 1 column (after one-hot encoding) that is redundant.

@sv1562 2 жыл бұрын

@@dataschool Because in-case of binary it will be always be negative collinearity ?!