Machine Learning Pills

Machine Learning Pills

Share this post

Machine Learning Pills
Machine Learning Pills
RW #3 - EDA applied to Netflix (part I)
Real-World

RW #3 - EDA applied to Netflix (part I)

David Andrés's avatar
Muhammad Anas's avatar
David Andrés
and
Muhammad Anas
Mar 30, 2025
∙ Paid
15

Share this post

Machine Learning Pills
Machine Learning Pills
RW #3 - EDA applied to Netflix (part I)
2
Share

💊 Pill of the Week

Exploratory Data Analysis (EDA) is the foundation of any data-driven project. It’s where you get your first "feel" of the data — what hides beneath, what patterns emerge, and where problems lurk. Think of EDA as mapping uncharted territory before building anything on top.

In this week's pill, we're breaking down the first part of a practical EDA series using Netflix’s Movies and TV Shows dataset. You'll learn why and how EDA matters, and how every plot you generate now can feed into your Machine Learning pipeline later.

✏️ Article and code by Muhammad Anas.

What is this EDA Series?

Over multiple parts, we’ll:

  • Theoretically explain why EDA matters

  • Use Netflix data to practice — because who doesn’t binge Netflix?

  • Show how every insight translates into ML applications

Why Do EDA?

Exploratory Data Analysis (EDA) is a crucial first step in any data science or machine learning project. It helps you:

✅ Understand Data Distributions
Gain insight into how your variables are spread—are they skewed, normally distributed, or full of surprises?

✅ Detect Missing Values, Outliers, and Inconsistencies
Spot issues early—missing data, anomalous values, or strange patterns that could skew your analysis or mislead your models.

✅ Discover Relationships Between Variables
Identify trends, correlations, and potential causal links. This helps guide both your modeling approach and business interpretation.

✅ Inform Feature Engineering for ML Models
EDA reveals patterns and data quirks that can inspire the creation of powerful new features—or the removal of redundant ones.

✅ Refine Business Questions and Assumptions
Sometimes the data tells a different story than expected. EDA helps align your hypotheses with reality and may uncover new questions worth asking.

🔍 Reminder: Garbage in, garbage out.
Good EDA saves you from wasting time building models on messy, misleading, or misunderstood data. Think of it as the detective work that sets the stage for everything else.

Do you want more details? Check our previous MLPills issue:

Issue #67 - Exploratory Data Analysis

Issue #67 - Exploratory Data Analysis

David Andrés and Josep Ferrer
·
July 28, 2024
Read full story

0. Dataset - Netflix Movies & TV Shows

What’s inside:

  • 8,807 titles from Netflix (as of 2021)

  • Columns: title, cast, director, country, release year, rating, duration, genre

The dataset is simple but loaded with insights about what Netflix adds, when, and from where.

Get the dataset

1. Release Year Distribution

Why care: Content release years tell us if Netflix focuses more on newer or older content.

Finding: There’s a dramatic rise in content starting around 2010, peaking in 2018 — the year with the most titles released. Prior to 2000, content is sparse, with only a trickle of titles from the 20th century, including a few surprises from as far back as 1925.

ML Angle:

  • Time-decay features: Recency bias in content could influence recommendation ranking.

  • Content longevity modeling: Which older titles continue to perform well?

🔥 Pro Tip: If you’re pitching content to Netflix, target trends from the post-2015 surge — that’s when they were clearly scaling aggressively.

💎 Here’s a snippet of the code. The full notebook, including all the code, will be sent exclusively to paid subscribers on Wednesday. This is a one-time send—only subscribers with an active paid membership at that time will receive it via email.💎

netflix_df["release_year"]
    .value_counts()
    .plot.barh(
        figsize=(30, 20),
        color="#32a883"
    )

The release year distribution clearly shows Netflix’s aggressive expansion post-2010, aligning with its pivot from distributor to producer. 2018 stands out as the peak year, followed closely by adjacent years. Older content (pre-2000s) is minimal — likely classic films added for niche interest. The long tail before 2000 emphasizes that Netflix’s library is overwhelmingly modern, reflecting a strategy that favors recent, high-engagement content over archival depth.

2. Type of Shows - Movies vs TV Shows

Why care: Determines how users engage — movies offer quick hits, while shows drive long-term retention.

Finding: Movies slightly outnumber TV Shows, but not by much. The split is relatively balanced, signaling that both formats are core to Netflix’s content strategy.

ML Angle:

  • Consumption patterns: A user who watches more TV Shows might prefer serialized storytelling and be less churn-prone.

  • Model feature: A simple binary feature like is_series can significantly impact predictions for engagement or completion rates.

💡 Fun Fact: TV series retention rate is Netflix’s secret growth hack — that's why you auto-play into the next episode.

sns.barplot(x=netflix_df['type'], y=netflix_df.index)

The bar chart confirms that movies still hold the majority, but the gap is narrower than expected — suggesting that Netflix invests heavily in both formats. This balance reflects two strategies: movies provide quick gratification and variety, while TV shows create long-term user hooks. For ML models, distinguishing between these types can inform everything from watch-time predictions to churn modeling.

3. Netflix Ratings Distribution - Is Netflix kid-friendly?

Why care: Ratings define the audience — mature vs family content can drastically shift engagement, retention, and trust.

Finding: TV-MA is by far the most dominant rating, followed by TV-14 and TV-PG. Kid-friendly ratings like TV-Y, TV-G, and G appear far less frequently. A few irregular entries like “min84” or “74 min” likely reflect data entry errors.

ML Angle:

  • Parental controls & content filters: Ratings help segment users for safe content recommendations.

  • User profiling: Ratings can help predict preferences — e.g., users who watch PG content may churn if served too much TV-MA.

💡 Nugget: Planning a content platform? Use rating-based filters to target niche audiences (e.g., family-only, horror buffs, etc.).

sns.countplot(x='rating', data = netflix_df)

The distribution shows a heavy lean toward mature content — TV-MA leads with over 3,000 titles, clearly positioning Netflix as an adult-first platform. TV-14 and TV-PG add some balance, appealing to teens and broader audiences. However, content for young children is relatively sparse, with minimal titles rated TV-Y, G, or TV-Y7. The presence of non-standard ratings like “min84” underscores the importance of cleaning and validating categorical data in real-world datasets.

4. Ratings Distribution by Type (Movies/TV Shows)

Why care: Understanding how content type correlates with audience maturity helps tailor recommendations and refine user segments.

Finding:

  • Movies dominate the mature categories (TV-MA, R, TV-14), indicating a strong focus on adult and teen content.

  • TV Shows are more concentrated in TV-MA and TV-14, but have a slightly better spread in family-friendly ratings like TV-Y and TV-Y7.

ML Angle:

  • Interaction features: Combining type and rating can enhance model accuracy — a user watching PG movies may behave differently than one watching PG shows.

  • Personalization layers: ML models can adapt recommendations by preferred content tone and format.

💡 Business Insight: Family-focused platforms could capitalize on Netflix’s thin children's catalog — especially in the TV show space.

Here some code, remember, next Wednesday we will share it in full for all paid subscribers. A one-time send.

sns.countplot(
    x="rating",
    data=netflix_df,
    hue="type",
    palette=netflix_df["rating"].map(color_map)
)

The distribution shows that Movies skew more adult, with higher counts in TV-MA, TV-14, and R. TV Shows also lean mature but offer relatively more in child-safe categories (TV-Y, TV-Y7, TV-G), possibly due to serialized educational/kids’ content. These rating-pattern differences suggest distinct audience strategies: movies deliver intensity and range, while TV shows balance bingeable maturity with broader age appeal.

5. Top 5 Netflix Countries - Who’s producing what you binge?

Why care: Regional content diversity drives global subscriber growth. Netflix’s reach depends on balancing domestic appeal with international flavor.

Finding:

  • The United States dominates with a massive 66.5% of content.

  • India follows with 17.7%, a strong showing driven by Bollywood and regional productions.

  • United Kingdom contributes 7.6%, while Japan (4.5%) and South Korea (3.6%) round out the top 5.

ML Angle:

  • Geo-personalization: Country tags can power location-aware recommendations.

  • Forecasting trends: Historical production data can predict future regional content expansion.

💡 Business Twist: There's white space for non-US players — especially platforms targeting Asian or European markets with local-first libraries.

netflix_df["country"]
    .value_counts()
    .nlargest(n=5)
    .plot.pie(
        autopct="%1.1f%%",
        figsize=(15, 10),
        colors=colors
    )

Netflix's catalog is heavily skewed toward U.S.-produced content, accounting for over two-thirds of the platform. India, with nearly 18%, signals strong growth and a mobile-first audience hungry for local stories. The UK holds steady with high-quality exports, while Japan and South Korea — though culturally influential — contribute relatively fewer titles. This suggests high impact per title for East Asian countries, and possibly reflects a stronger focus on quality or selective licensing rather than sheer volume.

Wrapping Up Part I - Stay for the next binge!

So far, we:

  • ✅ Time-traveled through Netflix’s content library

  • ✅ Counted if you’re likely binging TV or movies

  • ✅Saw how spicy Netflix really is

  • ✅ Found Netflix’s global production hubs

⚠️REMEMBER⚠️

💎 The full notebook, including all the code, will be sent exclusively to paid subscribers on Wednesday. This is a one-time send—only subscribers with an active paid membership at that time will receive it via email.💎

🗓️ Wednesday 2nd of April 🗓️

What’s Next in Part II?

In the next pill, we tackle:

  • Genre breakdowns

  • Actor/director impact analysis

  • Content addition seasonality trends

All with one goal: prepping these insights for real ML pipelines.

Stay tuned, and remember — better EDA = better models.


‍🎓Further Learning*

Let us present: “From Beginner to Advanced LLM Developer”. This comprehensive course takes you from foundational skills to mastering scalable LLM products through hands-on projects, fine-tuning, RAG, and agent development. Whether you're building a standout portfolio, launching a startup idea, or enhancing enterprise solutions, this program equips you to lead the LLM revolution and thrive in a fast-growing, in-demand field.

Who Is This Course For?

This certification is for software developers, machine learning engineers, data scientists or computer science and AI students to rapidly convert to an LLM Developer role and start building

*Sponsored: by purchasing any of their courses you would also be supporting MLPills.


Machine Learning Pills is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLPills
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share