Machine Learning Pills

Machine Learning Pills

Share this post

Machine Learning Pills
Machine Learning Pills
Issue #67 - Exploratory Data Analysis

Issue #67 - Exploratory Data Analysis

David Andrés's avatar
Josep Ferrer's avatar
David Andrés
and
Josep Ferrer
Jul 28, 2024
∙ Paid
24

Share this post

Machine Learning Pills
Machine Learning Pills
Issue #67 - Exploratory Data Analysis
2
Share

💊 Pill of the Week

Exploratory Data Analysis (EDA) is a crucial step in the Data Science process that involves investigating and understanding our data before applying more formal statistical or machine learning methods. Today we will explain EDA and highlight key aspects to consider during our analysis.

What is Exploratory Data Analysis?

EDA is a process used to:

  • Investigate data

  • Discover patterns

  • Identify anomalies

  • Understand relationships

  • Uncover trends

This is typically done using a combination of statistical summaries and visual methods. EDA is essential for grasping the underlying structure and characteristics of our data, setting the foundation for more advanced analyses.

EDA consists of a combination of statistical summaries and visual methods to grasp the underlying structure and characteristics of the data.

Importance of EDA

EDA plays a critical role in the data science workflow for several reasons:

  1. Data Understanding: It helps you gain a deep understanding of your dataset, its structure, and its peculiarities.

  2. Hypothesis Generation: Through exploration, you can formulate hypotheses about your data that can be tested later.

  3. Data Cleaning: EDA often reveals data quality issues that need to be addressed before further analysis.

  4. Feature Selection: It can help identify which variables are most important for your analysis or modeling.

  5. Assumption Checking: Many statistical methods have underlying assumptions about the data. EDA helps you check if these assumptions are met.

  6. Communication: Visualizations and summaries from EDA are powerful tools for communicating insights to stakeholders.

EDA aspects

When performing EDA, there are several important points to consider:

1. Distribution of data

Understanding how your data is distributed is crucial. This involves:

  • Assessing whether the distribution is normal, skewed, or follows another pattern

  • Using tools like histograms and box plots

  • Calculating summary statistics to understand central tendency and variability

Practical Tip: Pay attention to the shape of the distribution. Is it symmetric or skewed? Are there multiple peaks? This can inform your choice of statistical methods later on.

2. Missing data

Dealing with missing data is a critical part of EDA:

  • Identify the extent and pattern of missing values

  • Decide on appropriate strategies (e.g., imputation, deletion)

  • Understand the reasons behind missing data, if possible

Practical Tip: Look for patterns in missing data. Are certain variables more likely to have missing values? Is missingness related to other variables? This can help you choose the most appropriate method for handling missing data.

3. Outliers

Outliers can significantly impact your analysis:

  • Detect outliers using statistical methods or visualizations

  • Examine their impact on the dataset

  • Decide how to handle them (e.g., removal, transformation, or retention)

Practical Tip: Don’t automatically remove outliers. Investigate them first – they might be errors in data collection, or they might represent important rare events in your dataset.

4. Correlations

Understanding relationships between variables is key:

  • Use correlation coefficients to quantify relationships

  • Create scatter plots to visualize these relationships

  • Identify potential dependencies between variables

Practical Tip: Remember that correlation doesn’t imply causation. High correlation between variables might suggest a relationship worth investigating further, but it doesn’t prove that one variable causes changes in another.

5. Patterns and trends

Look for overarching patterns in your data:

  • Use line graphs for time-series data

  • Employ bar charts for categorical comparisons

  • Identify any anomalies that deviate from expected patterns

Practical Tip: When examining time-series data, look for seasonality, cyclical patterns, and long-term trends. These can be crucial for forecasting and understanding the underlying dynamics of your data.

6. Group comparisons

Compare metrics across different subsets of your data:

  • Look at differences between categories

  • Analyze changes over time periods

  • Identify significant similarities or differences between groups

Practical Tip: Use statistical tests (like t-tests or ANOVA) to determine if differences between groups are statistically significant, not just visually apparent.

7. Data types assessment

Understand the nature of your variables:

  • Identify numerical, categorical, and ordinal data types

  • Ensure appropriate treatment of each data type in your analysis

Practical Tip: Pay special attention to categorical variables. Are they nominal or ordinal? This distinction will affect how you can analyze and visualize them.

8. Data Quality Assessment

Evaluate the overall quality of your dataset:

  • Look for errors or inconsistencies

  • Identify areas that may need correction or further investigation

Practical Tip: Check for inconsistencies in units of measurement, especially if data comes from multiple sources. Also, look for duplicate entries that might skew your analysis.

9. Visual Exploration

Leverage various visualization techniques:

  • Use heatmaps to show correlations between multiple variables

  • Create pair plots to visualize relationships across the entire dataset

  • Employ other charts as needed to gain intuitive understanding of complex relationships

Practical Tip: Don’t limit yourself to basic charts. Consider more advanced visualizations like parallel coordinates for high-dimensional data, or geographic maps for spatial data.

Advanced EDA Techniques

As you become more comfortable with basic EDA, consider incorporating these advanced techniques:

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLPills
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share