Issue #44 - How important is each feature in your model?

and

Jan 26, 2024

∙ Paid

💊 Pill of the week

Last week we generated multiple features for our Time Series data, however, some of them may not be useful at all. How to determine which ones are the best? That’s what we are talking about this week in the following article: feature importance. You can read it here.

To check the answers to last week’s DIY issue go to the end of this issue. There is also a notebook waiting for you!

🤖 Tech Round-Up

No time to check the news this week?

This week's TechRoundUp comes full of AI news. From Meta's new AI to Google's educational tools, the future is zooming towards us! 🚀

Let's dive into the latest Tech highlights you probably shouldn’t this week 💥

1️⃣ 𝗠𝗲𝘁𝗮 𝗯𝗲𝗴𝗶𝗻𝘀 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗟𝗹𝗮𝗺𝗮 3

Zuckerberg's Meta has already started the next generation of Llama 🦙
Promising a new era in open-source AI technology!
Rumours state it might be launched during the first semester of 2024

2️⃣ 𝗚𝗼𝗼𝗴𝗹𝗲 𝗖𝗹𝗼𝘂𝗱 𝗮𝗻𝗱 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀𝗵𝗶𝗽

Google Cloud teams up with Hugging Face.
Their main goal? Attract AI developers to GCP.
Together, they're unleashing new potentials in AI tech. 🤝

3️⃣ 𝗥𝗮𝗯𝗯𝗶𝘁 𝗥1 𝘂𝘀𝗶𝗻𝗴 𝗣𝗲𝗿𝗽𝗹𝗲𝘅𝗶𝘁𝘆 𝗔𝗜'𝘀 𝘁𝗲𝗰𝗵

Meet Rabbit R1, the tech marvel using Perplexity AI to answer your queries.
Think smarter, faster, and more accurate responses.
AI just got real! 🐰🧠

4️⃣ 𝗚𝗼𝗼𝗴𝗹𝗲'𝘀 𝗔𝗜-𝗽𝗼𝘄𝗲𝗿𝗲𝗱 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗳𝗼𝗿 𝗲𝗱𝘂𝗰𝗮𝘁𝗶𝗼𝗻

Google is transforming education with AI-powered features! 🎓
Interactive learning and personalized education experiences are no longer sci-fi fantasies.

5️⃣ 𝗩𝗼𝗶𝗰𝗲 𝗰𝗹𝗼𝗻𝗶𝗻𝗴 𝘀𝘁𝗮𝗿𝘁𝘂𝗽 𝗘𝗹𝗲𝘃𝗲𝗻𝗟𝗮𝗯𝘀

ElevenLabs is soaring high with voice cloning!
With a whopping $80M funding, they've achieved unicorn status.
The era of personalized digital voices is here! 🗣️✨

Follow Josep on 𝕏

‍🎓Learn Real-World Machine Learning!*

Do you want to learn Real-World Machine Learning?

Data Science doesn’t finish with the model training… There is much more!

Here you will learn how to deploy and maintain your models, so they can be used in a Real-World environment:

Elevate your ML skills with "Real-World ML Tutorial & Community"! 🚀
Business to ML: Turn real business challenges into ML solutions.
Data Mastery: Craft perfect ML-ready data with Python.
Train Like a Pro: Boost your models for peak performance.
Deploy with Confidence: Master MLOps for real-world impact.

🎁 Special Offer: Use "MASSIVE50" for 50% off.

Learn Real-World Machine Learning

*Sponsored

📝Check if you were right!

Last week we introduced the concept of feature engineering for Time Series and we suggested some techniques. We also provided an incomplete notebook for you to play with. This week, the answer!

But before that, you can check the previous issue:

DIY #8 - Feature Engineering for Time Series

David Andrés

January 19, 2024

Read full story

You can also go straight to the answer notebook at the end.

Lag Features

Lag features are values at prior time steps. They can help capture the temporal dependencies in the data. For example, the sales a week ago and a month (4 weeks so it matches the day of the week) ago.

df['sales_a_week_ago'] = df['sales'].shift(7)
df['sales_a_month_ago'] = df['sales'].shift(28)

Rolling window features

These features are statistical measures like mean, median, standard deviation, etc., over a sliding or rolling window of time periods. In this case, we are considering the mean, median and standard deviation of the sales during the last week.

window_size = 7 # a week
df['last_week_mean'] = df['sales'].rolling(window_size).mean()
df['last_week_median'] = df['sales'].rolling(window_size).median()
df['last_week_std'] = df['sales'].rolling(window_size).std()

Expanding window features

These are similar to rolling window features but the window size increases with time. Similar to the previous case but considering all data up to the date in each row.

df['historic_mean'] = df['sales'].expanding().mean()
df['historic_median'] = df['sales'].expanding().median()
df['historic_std'] = df['sales'].expanding().std()

Domain-specific features

These are features that are specific to the problem at hand. Since the data for this example is about the sales in Ecuador, an interesting feature could be the average daily temperature. We can consider the capital city, Quito.

# I used Meteostat library for obtaining df_quito
df['temperature'] =  df_quito.tavg.interpolate()
df['temperature'] = df['temperature'].ffill().bfill()

Time Since an Event

This feature measures the time that has passed since a particular event occurred. In this case, we assumme there was an event on the 5th of January 2015. We can calculate for each row the number of days that have passed since then. This is just an example, but this can be much more elaborated and informative if done well.

# Define the event date (the start of the marketing campaign in this case)
event_date = pd.to_datetime('2015-01-05')

# Create the 'Days Since Event' feature
df['days_since_campaign'] = (df.index - event_date).days

Autoregressive features

These are based on the idea that past values have an influence on current values. An autoregressive feature of order p would use the last p values. This is similar to lag features but instead of using the raw values, we use the values predicted by an autoregressive model.

from statsmodels.tsa.ar_model import AutoReg

# Fit an autoregressive model of order 7
model = AutoReg(df['sales'], lags=7)
model_fit = model.fit()

# Use the fitted model to make predictions
df['ar_feature'] = model_fit.predict()

Seasonal Features

These features represent the difference between consecutive values in the time series. Differences can highlight trends or abrupt changes in the data. For example, if the date is within the Christmas period (25th Dec - 1st Jan).

# Create a binary feature for Christmas period
df['is_christmas_period'] = ((df.month == 12) & (df.day >= 25) | (df.month == 1) & (df.day <= 1)).astype(int)

Cyclical Features

Some time series data exhibits cyclical patterns that may not align with standard date-related features. Creating cyclical features can help capture such patterns. Here we consider a weekly cycle.

df['weekday_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['weekday_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

Exponential Moving Averages

Similar to moving averages, exponential moving averages give more weight to recent observations.

span = 3
df['sales_ema'] = df['sales'].ewm(span=span, adjust=False).mean()

Difference features

These features represent the difference between consecutive values in the time series. Differences can highlight trends or abrupt changes in the data. Here we will see if there was an increase or decrease in sales during the last day, last week and last month.

df['sales_diff_yesterday'] = df['sales'].diff(1)
df['sales_diff_last_week'] = df['sales'].diff(7)
df['sales_diff_last_month'] = df['sales'].diff(28)

And finally, you can get all the code here:

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.