Welcome to the sixth DIY issue! Today we are going back to basics. This will serve as a base to get into more advanced concepts in future issues. I understand that this may be too simple for some of you, but I considered it necessary to bring everyone to a minimum level.
💊 Pill of the week: Build a regression ML model
We are going to build a very simple model that will allow us to predict the salary of a person based on their years of experience and age.
You can get the notebook at the end of the issue 👇
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Read dataset. You can download it here. It is a very simple dataset with three columns:
Years of experience
Age
Salary: this is the target, the values we want to predict. During the training, this will work as the labels, to teach the model how to predict it based on the other features.
We can read this dataset (CSV file) easily using pandas read_csv
.
df = pd.read_csv('/kaggle/input/salary-data-with-age-and-experience/Salary_Data.csv')
It is always good to have a look at your data to verify that it was correctly imported. You can check the top 5 or 10 rows (df.head(5)
), the bottom 5 or 10 rows (df.tail(5)
) or simply a random sample (df.sample(5)
). Here we checked 5 random rows of the dataset:
df.sample(5)
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a process used for investigating your data to discover patterns, anomalies, relationships, or trends using statistical summaries and visual methods. This approach is essential for understanding the data's underlying structure and characteristics before applying more formal statistical or Machine Learning methods. Some key points that are normally checked are:
Distribution of Data: Assessing the distribution of data (e.g., normal, skewed) using histograms, box plots, and summary statistics helps understand the central tendency and variability.
Missing Values: Identifying and addressing missing data is crucial, as it can significantly affect analyses. Techniques include imputation, deletion, or understanding the reasons for missingness.
Outliers: Detecting and examining outliers to understand their impact on the dataset and deciding how to handle them (e.g., removal, transformation).
Correlations: Analyzing correlations between variables using correlation coefficients and scatter plots to identify relationships and potential dependencies.
Patterns and Trends: Looking for patterns, trends, or anomalies in the data, which can be visualized using line graphs, bar charts, or time-series analysis.
Group Comparisons: Comparing metrics across different groups (e.g., categories, time periods) to identify significant differences or similarities.
Data Type Assessment: Understanding the types of data (numerical, categorical, ordinal) and their appropriate treatment in analysis.
Data Quality Assessment: Evaluating data quality to identify errors or inconsistencies that may need correction.
Visual Exploration: Employing various visualization tools (like heatmaps, pair plots) to intuitively understand complex relationships in the data.
The purpose of this issue is to introduce the general Data Science process, so we will not deep dive into the EDA. However, we can do a very basic one. For example, we can check the datatypes, missing values and number of records using the "info" method:
df.info()
We can see that there are no null or missing values, all columns contain numbers (floats or integers) and there is a total of 30 rows. I know, this is very simple and not a real-world example! Normally this won’t be that easy…
Before continuing… why are missing values problematic? Check this if you want to find out more:
We can also use the method "describe" to check some descriptive statistics. For example, the average value, the maximum and minimum, etc.
df.describe()
This is a good way of quickly checking the data distribution, but it would be good to also do some plots to understand it better. But it is fine for today… We will do this in future issues.
Learn Advanced Machine Learning Concepts!*
Have you outgrown introductory courses? Ready for a deeper dive?
Explore feature engineering and feature selection methods
Discover tactics for optimizing hyperparameters and addressing imbalanced data
Master fundamental machine learning methods and their Python application
Enroll today and take the next step in mastering the world of data science!
*Sponsored: by purchasing any of their courses you would also be supporting MLPills.
Split data in train and test
You need to train your model using only some data (training set), and then check if it behaves properly with the other part that you left aside (testing set). This is to check whether your model will behave correctly in the real world, with data that hasn't been seen before.
First, we divide the dataset into the label y (salary) and features X (years of experience and age):
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
Now we need to get the training set and the testing set. We need the majority of the data in the training set because it is the one the model will use to learn. For example, we can use 80% of the data for training and 20% for testing. This needs to be adapted depending on the data. If you have a lot of data you could reduce the testing set because even a small percentage will include many samples.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
Train the model
We will use Linear Regression, which is the simplest model. I introduced it here:
You can find more details here:
First, let's instantiate (load) the model. We could also specify some parameters if we wished. You can get more details here.
lr = LinearRegression()
Fit the training data. This is the model training, which is the process of teaching the machine learning algorithm to make predictions or decisions by learning from a dataset. This involves using a dataset to tune the parameters of the model so that it can accurately generalize from the training data to new, unseen data.
lr = lr.fit(X_train, y_train)
After fitting the training data, we should assess the performance of the model on the testing set. This is to make sure that it can accurately generalise to unseen data (testing set).
lr.score(X_test, y_test)
For our dataset, it achieved an R-squared of 0.77. R-squared or coefficient of determination of the prediction is a statistical measure that is commonly used to assess the performance of regression models.
Keep reading with a 7-day free trial
Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.