Feature Engineering-How to Perform One Hot Encoding for Multi Categorical Variables

  Рет қаралды 265,783

Krish Naik

Krish Naik

5 жыл бұрын

Hi All,
After Completing this video you will understand how we can perform One hot Encoding for Multi Categorical Features.
amazon url: www.amazon.in/Hands-Python-Fi...
Buy the Best book of Machine Learning, Deep Learning with python sklearn and tensorflow from below
amazon url:
www.amazon.in/Hands-Machine-L...
Connect with me here:
Twitter: / krishnaik06
Facebook: / krishnaik06
instagram: / krishnaik06
Subscribe my unboxing Channel
/ @krishnaikhindi
Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
Deep Learning Playlist: • Tutorial 1- Introducti...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
Data Science Interview Question playlist: • Complete Life Cycle of...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/Hands-Python-Fi...
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY KZfaq CHANNEL

Пікірлер: 173
@ttowelie
@ttowelie 4 жыл бұрын
I spent my whole week to solve the sort of the same problem. Thank you for your solution!
@niteshmishra3923
@niteshmishra3923 4 жыл бұрын
I was stuck with a similar kind of data set for my class project...This has been an immense help in making things more clear !!!thanks a ton
@ajaykumar-rh2gz
@ajaykumar-rh2gz 3 жыл бұрын
Krish Naik Sir.... You are doing amazing job here. I am deeply following you and your channel. I have taken your paid services also, admission in affordable AI in iNeuron. Till now I have suggested more than 100 students about your channel and most of they following you. Thank you once aging for this support sir....Ajay Kumar Ex Indian Navy.
@umakanta7
@umakanta7 4 жыл бұрын
The best trainer i feel in youtube for simplicity in explaining ..great
@pushpitkumar99
@pushpitkumar99 3 жыл бұрын
Your videos are amazing Sir. Very informative and easy to understand. Thank You so much for all your hardwork.
@pradeepc5207
@pradeepc5207 4 жыл бұрын
I have been waiting to understand something related to this .Now i have understood the flow .superb explanation :-)
@cocum2
@cocum2 4 жыл бұрын
Great video! This is the solution I was looking for, very well explained, thank you very much for sharing!
@pradeepc5207
@pradeepc5207 4 жыл бұрын
Same here also
@harshithbangera7905
@harshithbangera7905 3 жыл бұрын
same here.....i always found your videos very usefull
@yosupalex8276
@yosupalex8276 2 жыл бұрын
hey dude your feature engineering and stats videaos SAVED MY LIFE!!!!!!! THANK YOU SOOOOOOO MUCH!!!!
@poornakumar1508
@poornakumar1508 4 жыл бұрын
Really cool!! i hav got stuck in without knowing this..Thanks a lot!!!!!
@adithyarajagopal1288
@adithyarajagopal1288 4 жыл бұрын
Many youtubers have videos on building models and the intuition behind them, not many have a feature engineering playlist as comprehensive as yours.... All the best
@abdullahalmahfuz6700
@abdullahalmahfuz6700 2 ай бұрын
Should i have to know feature Engineering in 2024?
@prathameshgurav8313
@prathameshgurav8313 4 жыл бұрын
this video is really helpful for me to gain knowledge thank you..!
@dineshnaik4904
@dineshnaik4904 3 жыл бұрын
Amazing!! Thank you very much for solid explanation!!!!
@programsolve3053
@programsolve3053 2 ай бұрын
Thank you so much for the easy explanation of an obscure topic. 🎉🎉🎉🎉
@agastyasharma1641
@agastyasharma1641 2 жыл бұрын
2nd day of me learning ML this is the first video i got when i searched for feature engineering. This video is explained in a simple way to get an understanding by student who is new to AI & ML. @Krish Can I share the link of this video on the course I am learning from Udemy.
@kishoredev6004
@kishoredev6004 4 жыл бұрын
Awesome Video! Krish, Thank You So Much
@shashwatsingh253
@shashwatsingh253 4 жыл бұрын
Great Explanation Sir !! Thank You Sir ...
@Raja-tt4ll
@Raja-tt4ll 4 жыл бұрын
It was a very nice video. Thank you.
@chandrashekharbagul5825
@chandrashekharbagul5825 Жыл бұрын
Thanks for the help sir. I was facing exactly the same kind of issue with my data at the workplace.
@lawrencenanagyan489
@lawrencenanagyan489 Ай бұрын
You changed my life! God bless you!
@Futureyouth-be1bo
@Futureyouth-be1bo Ай бұрын
pro i have problem that is iam using two different datasets one from kaggle and one from local but the problem is when making hot encoding when ever i try doing this flightdata = pd.get_dummies(flightdata, columns=['OriginCityName', 'DestCityName']) df = pd.get_dummies(df, columns=['OriginCityName', 'DestCityName']) # Ensure both datasets have the same dummy variables flightdata, df = flightdata.align(df, join='inner', axis=1) but the public datasets have many more categorical than the local how can i solve it ?
@yogeshrunthla9350
@yogeshrunthla9350 4 жыл бұрын
Very thankful for your efforts 🙌🙌🙌🙌
@anirvansen6591
@anirvansen6591 4 жыл бұрын
Learnt this new technique.Thanks
@kanhataak1269
@kanhataak1269 4 жыл бұрын
All videos are really very nice and very well explanation.... How to explain the project in front of the interviewer. when they are ask tell me about your project and tell me about your self, i confused where i should to start, i don't know how to start. explain by given an example. pls make a video this topic using both hindi and english language. Thanks
@debbie2017
@debbie2017 3 жыл бұрын
great...! thanks for saving lot of time
@debanganabhattacharjee3706
@debanganabhattacharjee3706 2 жыл бұрын
Hi! Could you please explain how do I do the same thing when there are multiple values in each row of each column. For eg. In a genre column there are many genres separated by commas like: Comedy,Drama,Thriller and I need them all as 3 separate columns with 1,0 values wherever applicable. With this approach genres like this are being identified as a single genre but how do I divide them into 3 distinct genres?
@rohanchess8332
@rohanchess8332 Жыл бұрын
Very informative!
@mohammaddehghan8762
@mohammaddehghan8762 3 жыл бұрын
thank you a lot of for all tutorial i learn
@rupambose4830
@rupambose4830 3 жыл бұрын
Amazing explanation
@bhushandhamankar
@bhushandhamankar 3 жыл бұрын
I'll suggest you to watch 2nd Video in this playlist first then come for this one...:)
@AshutoshSingh-do4ts
@AshutoshSingh-do4ts 2 жыл бұрын
Thank you sir ! for this video
@anupampurkait6066
@anupampurkait6066 3 жыл бұрын
I think here we may not need to use 'sort_values' function because 'value_counts' method by default sorts the values by descending order.
@ritwikmukherjee3572
@ritwikmukherjee3572 Ай бұрын
@krish Naik ... Hello sir, first of all I would like to thank you for giving us so many wonderful videos from which we learn so much. I would like to request you to provide the link of this file so that I can practice the coding part.
@MrDareh
@MrDareh 3 жыл бұрын
Great! How does this compare to using word embeddings for encoding categorical features?
@abdmo7281
@abdmo7281 4 жыл бұрын
Great video can i ask,is that multi-hot encoding?
@abhishekverma549
@abhishekverma549 4 жыл бұрын
Sir i need this .ipynb file, please share with us.
@deepeshkumarsharma6514
@deepeshkumarsharma6514 5 жыл бұрын
sir if you get time please create a video about mean encoding , that's also a good technique for encoding
@sujankumar215
@sujankumar215 3 жыл бұрын
Hi Krish, please let me know where can i find Code you have used in these videos ? i also found the code of many videos are not available in description
@siddharthrao3115
@siddharthrao3115 2 жыл бұрын
amazing
@sidgirase
@sidgirase Жыл бұрын
Hey Krish. I am trying to make an anomaly detection model with many categorical columns. Grouping rare values into a single group would negatively impact my model. Am I thinking right?
@sandipansarkar9211
@sandipansarkar9211 2 жыл бұрын
finished watching
@sathishsivam635
@sathishsivam635 Жыл бұрын
only one suggestion i wanted to give you bro, that is kindly arrange the videos based on the data science syllabus. it is very difficult to find the frequency.
@mukulmishra2296
@mukulmishra2296 5 жыл бұрын
can't we use frequency encoding or target encoding?
@anoshkaniskar3117
@anoshkaniskar3117 3 жыл бұрын
Hi.. Krish can we also perform mean encoding for this type of problem...please let me knw.. also thanks for sharing and this type of info...
@aroaro4963
@aroaro4963 4 жыл бұрын
once i feed the encoded data i recive encoded output. How can I map them back to the real categorical data. (decoding)
@yikheichan1653
@yikheichan1653 3 ай бұрын
Im so confused how i use it when i have a dataset , so variables with less frequency set as 0 ? and they are still useful for the dataset? Like when i do the model like Multinomial logistic regression , is your method useful because when i most than 2 which more than 0 and 1 i need Multinomial logistic regression ?
@abelsontenny7537
@abelsontenny7537 2 жыл бұрын
how do i iterate through the variables(features) names in a for loop to do the entire process without repeating to run the one_hot_top_x function again and again?
@rishilramesh946
@rishilramesh946 3 жыл бұрын
Is it fine to One Hot encode before train test split or we should do it only after the split? Does it cause Data Leakage if we use one hot encoding before train test split?
@ritvikpant7107
@ritvikpant7107 2 жыл бұрын
Here as we've considered 10 most occurring labels for the dataset then what is the parameter by the help of which we can makeout that we should use these many labels and that will portray everything right? Anyone can reply.
@RBSTREAMS
@RBSTREAMS Жыл бұрын
sir where can i find these jupyter notebooks? i dont see any link in the description..can anybody please help me with that...
@a.r.s.6301
@a.r.s.6301 4 жыл бұрын
well sir i want to ask you somethink : Isnt that your approaching causes the feature losing lets just say i have a dataset which contains lots of car brand and i want to make regression. I think your approach works fine for most 10 frequent brand but other brands becoming always 0. If i want to learn that brands values. How its work fine
@janithpanditharathne6196
@janithpanditharathne6196 2 жыл бұрын
When there are multi categorical variables, can we use one hot encoding with Support Vector Machine?
@Futureyouth-be1bo
@Futureyouth-be1bo Ай бұрын
pro i have problem that is iam using two different datasets one from kaggle and one from local but the problem is when making hot encoding when ever i try doing this flightdata = pd.get_dummies(flightdata, columns=['OriginCityName', 'DestCityName']) df = pd.get_dummies(df, columns=['OriginCityName', 'DestCityName']) # Ensure both datasets have the same dummy variables flightdata, df = flightdata.align(df, join='inner', axis=1) but the public datasets have many more categorical than the local how can i solve it ?
@manideepgupta2433
@manideepgupta2433 4 жыл бұрын
Hi Krish, That was really a wonderful video. But I have a question, I have used mean encoding in one of my data containing state,city,ward values on 3 of these columns, So does this method be better that mean encoding? and in the case of mean encoding, if I perform mean encoding on various col(state,city,ward) do they cause high correlation among the data?
@akashravindra..
@akashravindra.. 2 жыл бұрын
I think mean encoding is better in a way because it gives different values for different categories and later you can standardize or normalize them. But in this all the categories are treated as 1 which signifies huge loss of data.
@snehithoddula7905
@snehithoddula7905 4 жыл бұрын
instead of seperately doing for x1,x2,x3..... cant we do that like this, for i in data.columns: top_10=[x for x in data.i.value_counts().sort_values(ascending=False).head(10).index] for label in top_10: data[label]=np.where(data['i']==label,1,0) data[['i']+top_10] when i try to do this i am getting that i is nt an attribute of data, how can i resolve this ,can somebody help
@dhainik.suthar
@dhainik.suthar 3 жыл бұрын
How can we handle this data during model deployment ? We need to assign one value as one and anothe all are 0 it's much time consuming is there are way than tell me
@vininitdgp
@vininitdgp 3 жыл бұрын
so, is it recommended to perform one hot encoding to all the binary categorical feature in a dataset? or its ok to let them as feature(column) only ?
@akashravindra..
@akashravindra.. 2 жыл бұрын
Binary its good to use but honestly you can simply replace the one of the binary value with zero instead of one hot encoding and then dropping the original column. You can use str.replace() function and replace one binary value.
@gouthamipalarapu909
@gouthamipalarapu909 3 жыл бұрын
Hello Krish. i am watching this video on repeat mode but none i could understand. can you please take another dataset to explain OHE. Mercedes Benz is really confusing. Awaiting for your reply. please help.
@manojgupta91
@manojgupta91 3 жыл бұрын
First of all Thanks for these wonderful videos. I've a question. Suppose we have a categorical variable with many distinct values. Using One Hot Encoder will add up to too many features/dimensions. Instead of using Top (Most Frequent) approach can add these dimensions and then use dimensionality reduction eg. PCA for this?
@RahulKumar-lv9yz
@RahulKumar-lv9yz 3 жыл бұрын
Do we have to be a member to get this jupyter notebook and other content?
@sanyuktabaluni4608
@sanyuktabaluni4608 4 жыл бұрын
Hi krish! What if we have a dependent variable with Categories: Never, Rarely, Sometimes, Often or dependent variable for weather prediction: "Sunny", "Monsoon", "Windy". How will we deal with a dependent variable with so many categories. Can a dependent variable y have more than 1 column?
@akashravindra..
@akashravindra.. 2 жыл бұрын
use naive bayes
@preenu7528
@preenu7528 3 жыл бұрын
Could you please provide the link to the jupyter notebook?
@sreenathgupta6767
@sreenathgupta6767 3 жыл бұрын
Nice, If i am dealing with dataset similar to Airline dataset where source and destination airports are important and we need to consider all airports. How can we deal such a dataset
@ranasagar699
@ranasagar699 3 жыл бұрын
you can use same technique 10 most frequent category for source and destination
@shreyasaxena5169
@shreyasaxena5169 4 жыл бұрын
is this right for applying in whole data ? data=pd.read_csv('mercedes.csv',usecols=['X1','X2','X3','X4','X5','X6']) usecol=['X1','X2','X3','X4','X5','X6'] for a in usecol: def cal_top(df,variable): tops=[x for x in df[variable].value_counts().sort_values(ascending=False).head(10).index] return tops top1=cal_top(data,a) def one_hot_top_x(df,variable,top_lables): for label in top_lables: df[variable+'_'+label]=np.where(data[variable]==label,1,0) one_hot_top_x(data,a,top1)
@priyankapradhan4539
@priyankapradhan4539 4 жыл бұрын
In one of your video (optimizeCNN model) you took fashion_mnist dataset to optimize model. But when use the same code to read dataset from local drive its showing lotz of error.....i did it using glob module.....sir plzzz make one such video in which we can optimizeCNN model using our own dataset from local drive or from google drive when working with colab.....please do needful.....thank u in advance.
@nan0mchgaming937
@nan0mchgaming937 3 жыл бұрын
We can also use nunique
@bhanupratapyadav6449
@bhanupratapyadav6449 2 жыл бұрын
bhai mujhe ye error dikha raha hai "maximum recursion depth exceeded while calling a Python object" or data type column ka change ho ja raha hai ye code use kar raha ho to " for features in MainData.columns: MainData[features].replace(np.nan,MainData[features].mean,inplace=True)" help plz
@DanishAnsari-sn2sy
@DanishAnsari-sn2sy 3 жыл бұрын
Hello Krish, hope you are doing great. Krish as you have shown us how to take the top 10 categories in a variable but you have used the top 10 categories of X2 in all of the variables. If we encoding each variable separately then we should be taking the top 10 categories of each variable? Can you please help me out with this!?
@muskan_bagrecha
@muskan_bagrecha 3 жыл бұрын
You can find out top 10 for each column and pass this in the function.
@MsKoniki
@MsKoniki 4 жыл бұрын
what a coincidence , i was just watching a video about mercedes benz f1 engine . hahaha
@sandipansarkar9211
@sandipansarkar9211 2 жыл бұрын
where is the ipython notebook for practice?I an unable to locate it
@vaibhavyaramwar
@vaibhavyaramwar 3 жыл бұрын
Does we need to perform Encoding only on Train Data or entire dataset? If we need to perform Encoding only on Train dataset at the time applying model on test we will face issue of column mismatch. Can you please brief about this , it would really be helpful.
@akashravindra..
@akashravindra.. 2 жыл бұрын
Encoding is always done on entire dataset because your test data cannot have categories and you expect the model to predict the output based on that.
@datadrix
@datadrix 5 жыл бұрын
One stupid question from my side, what the different roles in Machine Learning ? for an example in other fields like developer, tester, coder , etc
@salvindsouza7053
@salvindsouza7053 4 жыл бұрын
Analytics and analysis of data in all the fields ,for automation!
@mahikhan5716
@mahikhan5716 2 жыл бұрын
@krish naik could i have the dataset would be wholesome for me ?
@vinjad5672
@vinjad5672 4 жыл бұрын
hi sir but i am not able to get this mercedes dataset from kaggle can you help me with that
@snehithoddula7905
@snehithoddula7905 4 жыл бұрын
www.kaggle.com/c/mercedes-benz-greener-manufacturing/data you can download from here
@mritunjay3723
@mritunjay3723 2 жыл бұрын
I have joined as a member .. How do I get the feature engineering notes ??
@vamsireddy6306
@vamsireddy6306 5 жыл бұрын
Sir can we know that one hot encoding with top labels is only way to improve model performance for more labeled datasets. Suppose datasets with 100 labels having same frequency neglecting 90 of 100 labels make our model less efficient.
@krishnaik06
@krishnaik06 5 жыл бұрын
As said this will not always work..this works only when u have a imbalanced categories in ur features. Still I will be uploading more videos to handle different scenarios
@georgedong3789
@georgedong3789 4 жыл бұрын
agree with you. Moreover, this kind of encoding can overfit
@surajrahinj4797
@surajrahinj4797 2 жыл бұрын
Het Krish please provide the Notebook in video Description
@cinderellaman7534
@cinderellaman7534 2 жыл бұрын
I have 13k categorical data in one column and have only 8 gb ram to perform magical modeling. My soul is already crying.
@abhishekprasad7030
@abhishekprasad7030 5 жыл бұрын
Hello, Can I get to know where are you currently working .. I mean city and Company !!
@pankushkukreja3101
@pankushkukreja3101 5 жыл бұрын
lead Data Scientist, Panasonic , bangalore as per Github
@rajatjain328
@rajatjain328 4 жыл бұрын
Please update this playlist
@banankulovski
@banankulovski 8 ай бұрын
what about dummy variable trap?
@shaz-z506
@shaz-z506 5 жыл бұрын
Hi Krish, I just have one question, that how we'll decide the top 10 or top 20, the threshold value seems like a tedious way to decide. We'll for threshold value, does that depend upon business and to whatever domain we gonna apply this technique to, please let me know.
@aditisrivastava7079
@aditisrivastava7079 5 жыл бұрын
I also have this doubt
@vishal56765
@vishal56765 5 жыл бұрын
We can see from value_counts(). Where the count number starts dropping too much, we can take till that category
@fysalsyed3104
@fysalsyed3104 3 жыл бұрын
can you please give me the link of this notebook in a github
@mahikhan5716
@mahikhan5716 2 жыл бұрын
where can i get the dataset ?
@ashishdhiman4097
@ashishdhiman4097 3 жыл бұрын
The sum of all the labels was 123 however shape showed only 117 columns. Were some labels missed ??
@sandipansarkar9211
@sandipansarkar9211 2 жыл бұрын
finished practicing code
@SahilShah-cd5bi
@SahilShah-cd5bi 5 ай бұрын
Sir, can you please explain when should we use one hot encoding, label encoding or ordinal encoding?
@SahilShah-cd5bi
@SahilShah-cd5bi 5 ай бұрын
What should be the conditions?
@Aman-lw3vq
@Aman-lw3vq 4 жыл бұрын
won't it be better if we use label encoding for every categorical variable instead of creating new variables and making data messier???
@raghavchhabra4783
@raghavchhabra4783 4 жыл бұрын
What if i have 150+ categorical variables?
@sakshibansal3155
@sakshibansal3155 3 жыл бұрын
could you please confirm how to download the data?
@niveshtayal979
@niveshtayal979 5 жыл бұрын
Hi Krish I think this technique is not useful when we are working on real time project. So can you please explain the same with some other technique that will be really helpful.
@Jam05_
@Jam05_ 4 жыл бұрын
Frequency encoding or target encoding maybe
@Mahmil
@Mahmil 4 жыл бұрын
Please share this notebook file link if you have uploaded on github. Thank you
@moussaabgaming3463
@moussaabgaming3463 4 ай бұрын
how can i get your note book sir
@cdhanunjay5497
@cdhanunjay5497 4 жыл бұрын
I have one dara set having more than 1500 different labels then what to do same thing if i apply there will be more features
@littlecutiepiedia2940
@littlecutiepiedia2940 4 жыл бұрын
take %age ratio by applying 80 20 rule if 80% of data lying in top 10 to 20 then you can apply this otherwise convert into Target guided mean value
@akatsukidawn
@akatsukidawn 11 ай бұрын
I am currently doing mtech in machine learning but I can't understand anything from this video. I have lots of assignments to do but I am stuck
@PapunRout-zk9ip
@PapunRout-zk9ip Жыл бұрын
where is the jupyter notebook links
@varadpadalkar4879
@varadpadalkar4879 3 жыл бұрын
please upload the notebook on your github.
@vininitdgp
@vininitdgp 3 жыл бұрын
please share the git link for this also. thanks
@akhibali8405
@akhibali8405 3 жыл бұрын
When to use Label encoding???
@anuvratshukla7061
@anuvratshukla7061 4 жыл бұрын
Why cant we use Label Encoder?
@animeshmuduli1043
@animeshmuduli1043 2 жыл бұрын
pls provide us the jupiternotebook file🙏
Different Types of Feature Engineering Encoding Techniques
24:07
Krish Naik
Рет қаралды 189 М.
Llegó al techo 😱
00:37
Juan De Dios Pantoja
Рет қаралды 60 МЛН
Heartwarming Unity at School Event #shorts
00:19
Fabiosa Stories
Рет қаралды 25 МЛН
IQ Level: 10000
00:10
Younes Zarou
Рет қаралды 11 МЛН
Fastest Way to Learn ANY Programming Language: 80-20 rule
8:24
Sahil & Sarra
Рет қаралды 807 М.
I gave 127 interviews. Top 5 Algorithms they asked me.
8:36
Sahil & Sarra
Рет қаралды 633 М.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 265 М.
Three Easy Ways to Start Using AI For Real Estate Agents
12:51
Eddie Arguelles
Рет қаралды 21
How to Become a Data Scientist in 2024? (complete roadmap)
14:10
Sundas Khalid
Рет қаралды 167 М.
Best Resource to learn Data Science  | Data Science Course
5:34
Jenny's Lectures CS IT
Рет қаралды 14 М.
Llegó al techo 😱
00:37
Juan De Dios Pantoja
Рет қаралды 60 МЛН