Kaggle's 30 Days Of ML (Competition Part-3): What is Target Encoding and how does it work?

No video

Kaggle's 30 Days Of ML (Competition Part-3): What is Target Encoding and how does it work?

Рет қаралды 10,486

Abhishek Thakur

Күн бұрын

Пікірлер: 32

@geekyprogrammer4831 3 жыл бұрын

Abhishek bhai, please don't stop making videos. I followed them very regularly eventhough I was very tired after my working hours. I have put a lot of dedication and pretty sure there are many others around the world who are doing the same. So please continue making videos for us. Until now, you have done an outstanding job!!

@madhu1987ful 3 жыл бұрын

Hi Abhishek, Why at all we need target encoding? What are the benefits? can you tell briefly. Thanks

@debarchanbasu1962 3 жыл бұрын

Bhaiya, would it be possible for you to create a playlist/course on Kaggle competition-specific machine learning and deep learning? Just a request! 😅

@abhishekkrthakur 3 жыл бұрын

there are many videos on my channel... sometimes live competitions too :)

@JeremyWhittakerAZ 3 жыл бұрын

Appreciating your videos and just got your book yesterday. Completely unrelated question, what software do you use to make your videos with the camera overlay?

@abhishekkrthakur 3 жыл бұрын

OBS

@JeremyWhittakerAZ 2 жыл бұрын

@@abhishekkrthakur so I downloaded obs and got it working. My second question is how do you do that awesome cropping around your head. I don't see it in the app natively. Do you use a green screen?

@abhishekkrthakur 2 жыл бұрын

@@JeremyWhittakerAZ yes. please take a look at OBS and streaming videos :D

@ram9208 2 жыл бұрын

Hi Abhishek Thanks for the video. Quick question, how would target encoding work in real life since we don't really have target variables in test dataset in real life

@vikasmishra393 3 жыл бұрын

Thanks Abhishek for your Great Support Regards.

@heyrobined 3 жыл бұрын

Thanks for new lesson again

@vigneshbalasubramanian7878 3 жыл бұрын

One small doubt: Is soo much generalization required though? Like, we encode every categorical variable in the training set(df), by assigning a fold number to the records and then finding its target encoding for a fold using all the other folds of which it is not a part right. And furthermore, we take average of all these target encodings that we got for each fold, in order to encode the same categorical column in test set right. My question is, why are we encoding the test set now itself? We mainly use folds to fine tune the parameter and ensure we have the correct model right. Once the model is set and parameters are found, we could directly encode the test set using the entire training set instead of taking average of folds, which by themselves are generalized as they contain encodings from other folds right?

@yogitad4136 3 жыл бұрын

Abhishek Bhai , confuse Kar diya aapne. Out of all encoding methods I used the one which was counting the frequency and I think that is what is missing from your notebook. please confirm for col in cat_col: train[f"cont_{col}"] = train.groupby(col)[col].transform("count") test[f"cont_{col}"] = test.groupby(col)[col].transform("count") So as per my understanding, this should have been done on folds. ie xtrain , xvalid and xtest. What we saw yesterday was frequency encoding. It is a way to utilize the frequency of the categories as labels. But what you showed us Today is Target encoding. Frequency encoding can also be used for categorical variables. Is my understanding correct.?

@abhishekkrthakur 3 жыл бұрын

Apologies. Let me try again. The first step would be to forget about: for col in cat_col: train[f"cont_{col}"] = train.groupby(col)[col].transform("count") test[f"cont_{col}"] = test.groupby(col)[col].transform("count") If you do any encoding this way, it will overfit! There is a high chance of data leakage. Now let's take a look at the chunk of code from this video: for col in object_cols: temp_df = [] temp_test_feat = None for fold in range(5): xtrain = df[df.kfold != fold].reset_index(drop=True) xvalid = df[df.kfold == fold].reset_index(drop=True) feat = xtrain.groupby(col)["target"].agg("mean") feat = feat.to_dict() xvalid.loc[:, f"tar_enc_{col}"] = xvalid[col].map(feat) temp_df.append(xvalid) if temp_test_feat is None: temp_test_feat = df_test[col].map(feat) else: temp_test_feat += df_test[col].map(feat) temp_test_feat /= 5 df_test.loc[:, f"tar_enc_{col}"] = temp_test_feat df = pd.concat(temp_df) In the code above, look carefully, if you replace "target" with col and "mean" with "count", it becomes frequency encoding. Please let me know if its clear now. :)

@yogitad4136 3 жыл бұрын

@@abhishekkrthakur .Oh ye , I get it now.Thank you so much for the explanation. !

@swayamsingh4650 3 жыл бұрын

Thanks for the new concept, I have a doubt. When we calculated grouped mean on x_train, then why we mapped it over x_valid ?

@affahrizain 3 жыл бұрын

I think it was to prevent target leakage, so we will want to x_valid has derived encoded target from x_train

@swayamsingh4650 3 жыл бұрын

@@affahrizain thanks

@igordedkov3686 3 жыл бұрын

Why we do target encoding based on categorical columns? What if i implement "OrdinalEncoder" for all df at very beginning, before folds, can I still use cat columns for target encoding if cat columns now have numerical values instead of text?

@abhishekkrthakur 3 жыл бұрын

if you implement ordinalencoder before folds, there will be data leakage. yes, you can use target encoding after converting them into numbers.

@iwrestling4020 3 жыл бұрын

Sir, can you explain target encoding again, or point towards some resource that explains it, cause the explanation was chopped between two videos, and I couldn't understand target encoding from this video

@abhishekkrthakur 3 жыл бұрын

again? 😂 have you seen the part-2 video? all videos in this series are connected. if you miss the previous one, you wont understand this. regarding resources: a simple google search will provide you with many resources :)

@mikayilshahtakhtinski8939 3 жыл бұрын

Hello. In one of the lessons it was written that after applying OrdinalEncoder there can be some problems, for instance in the case that the validation data contains values that don't also appear in the training data, the encoder will throw an error, because these values won't have an integer assigned to them. How did you deal with it ? I see that you just used encoder, without removing bad columns

@abhishekkrthakur 3 жыл бұрын

there are no bad columns in this dataset :)

@maxidiazbattan 3 жыл бұрын

Sir, I have a doubt, if I do the folds split at the end of the whole process it's wrong? I mean, first I do all the feature creation and selection process, then later with the dataset "ready" to train, I do the folds split with the encoding or scaling.

@abhishekkrthakur 3 жыл бұрын

If you do it in the end, you have high possibility of data leakage: kzfaq.info/get/bejne/edRgndKLsJqWdKc.html

@maxidiazbattan 3 жыл бұрын

@@abhishekkrthakur thank you so much Sir