DIY #4 - Encode your data

Do It Yourself is part of Machine Learning Pills: mlpills.dev

Sep 14, 2023

∙ Paid

Welcome to the fourth issue of DIY (Do It Yourself). In this section, every week, a key concept in Data Science will be introduced to you. After that, you will be able to practice what you learned! We continue this section with data or feature encoding. I hope you enjoy it!

💊 Pill of the week

In this issue, we will deal with encoding.

What are categorical variables?

Categorical variables are those that take on a limited, fixed number of distinct categories or classes. They are often human-understandable labels such as "red," "blue," "apple," "banana," etc., and don't have a natural numerical representation. To convert these variables into a format that can be provided to machine learning algorithms, we use encoding.

What is encoding?

Encoding is the process of converting data from one form to another. In the context of machine learning and data science, encoding often refers to the transformation of categorical variables into a numerical format that can be easily used by algorithms. Most machine learning algorithms require numerical input and output variables, hence the need for encoding.

The main goal of encoding in machine learning is to translate the data into a form that is useful and understandable by the algorithm, without distorting the characteristic properties and relationships in the data.

It's worth noting that improper encoding can introduce biases or inaccuracies in the model, so choosing the right encoding method is crucial.

Main techniques

1. Label Encoding

In this technique, each unique category is mapped to an integer starting from 0. It does not assume any relationship of order or magnitude between the categories. Categories are numbered arbitrarily.

When to Use: Best suited for ordinal data where the order matters but can be used for nominal data when the algorithm can handle it correctly (e.g., decision trees).

Pros:

Simple to implement.
Does not increase dimensionality.

Cons:

Because it introduces ordinality, it can lead to a misleading sense of magnitude or ordinal relationship.

2. One-Hot Encoding

It converts each unique category into a new binary column of 1 or 0.

When to Use: For nominal categories where no ordinal relationship exists.

Pros:

Easy to use and interpret.
No ordinal relationships are introduced.

Cons:

Dimensionality: Introduces as many new columns as there are unique values in the original column, which can explode dimensionality.
Multicollinearity: The encoding can introduce multicollinearity which can be problematic for certain algorithms (like linear regression).

3. Ordinal Encoding

It maps each unique category to an integer based on the inherent ordinal nature of the category.

When to Use: For ordinal data where the order of categories is important.

Pros:

Keeps the ordinal relationship.
Does not increase dimensionality.

Cons:

Difficult to set up correctly.
Subjectivity: Determining the correct order of categories can be subjective and/or data-dependent.
Misinterpretation: Incorrect ordering may lead the model to learn inaccurate patterns from the data.

4. Frequency or Count Encoding

Categories are replaced with their frequency or count in the data set.

When to Use: When there are too many categories and one-hot encoding increases dimensionality too much.

Pros:

Effective for high cardinality features.
Does not increase dimensionality.

Cons:

Loses information about the categories.
Different categories might end up having the same frequency, causing a collision.

5. Target Encoding

Categories are replaced with the mean of the target variable for that category.

When to Use: When the category has some correlation with the target variable. Be cautious of data leakage.

Pros:

Can capture information within the category that can aid in prediction.
Useful for high cardinality features.

Cons:

Prone to data leakage: If not done correctly, target encoding can result in data leakage that can inflate the performance metrics.
Overfitting: This encoding is sensitive to outliers and can result in overfitting if the number of categories is small.

⚠️ Each encoding technique has its own advantages and disadvantages, and the choice of which to use often depends on the specific problem and the type of data you're working with. Always remember to test out different approaches and validate their effectiveness using cross-validation or a separate validation set.

👉 Check the next section to put what you've learned into practice!

This could perfectly be a Data Science interview question. You can check additional questions on the website!

If you like it, subscribe for free to support us:

🛠️ Do It Yourself!

Now that you know the theory about feature scaling and when you should use each of the techniques, it’s time to apply it!

How does it work?
📜I will share a notebook with some guided initial steps.
📌I will ask you some tasks that you should complete.
🎯I will share the outcome so you can check if you did well or not!

Now it’s your turn. Let’s play!

I want you to apply each of the five techniques I previously introduced:

🎲Label Encoding - Difficulty ⭐
🧮One-Hot Encoding - Difficulty ⭐⭐
🔢Ordinal Encoding - Difficulty ⭐
⏱Frequency Encoding - Difficulty ⭐⭐
⚖️Target Encoding - Difficulty ⭐⭐

Here you can find a Kaggle notebook with everything you need:

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.