Issue #51 - ARIMA models: Box-Jenkins method
💊 Pill of the week
This week is the time to give an overall view of the ARIMA model methodology, also called the Box-Jenkins method. We will link each step to previous issues of MLPills, so you can revise each step and become an ARIMA master!
The Box-Jenkins method, also known as the Box-Jenkins Methodology or the ARIMA (Autoregressive Integrated Moving Average) methodology, is a widely used approach for modelling and forecasting time series data.
It consists of the following three steps:
Let’s see each of them in more detail with some examples. Assume your time series data is in the column Value
of your dataframe df
. Its index is monthly data for several years.
1️⃣ Identification
This step involves analyzing the time series data to identify its characteristics and determine the appropriate ARIMA model. The key tasks in this step are:
Checking for stationarity: Time series data is considered stationary if its statistical properties (mean, variance, and autocorrelation) remain constant over time. If the data is non-stationary, differencing techniques are applied to make it stationary. You will use the Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test:
from statsmodels.tsa.stattools import adfuller # Perform Augmented Dickey-Fuller test result_adf = adfuller(df.Value) print('ADF Statistic:', result_adf[0]) print('p-value:', result_adf[1]) print('Critical Values:') for key, value in result_adf[4].items(): print('\t%s: %.3f' % (key, value)) # Determine stationarity based on p-value if result_adf[1] < 0.05: print("ADF test: Series is stationary") else: print("ADF test: Series is not stationary")
from statsmodels.tsa.stattools import kpss # Perform KPSS test result_kpss = kpss(df.Value) print('KPSS Statistic:', result_kpss[0]) print('p-value:', result_kpss[1]) print('Lags Used:', result_kpss[2]) print('Critical Values:') for key, value in result_kpss[3].items(): print('\t%s: %.3f' % (key, value)) # Determine stationarity based on p-value if result_kpss[1] < 0.05: print("KPSS test: Series is not stationary") else: print("KPSS test: Series is stationary")
Identifying the order of differencing: The number of times the data needs to be differenced to achieve stationarity is determined and represented by the parameter 'd' in the ARIMA model.
Check this previous issue to learn how to get ‘d’:
import pandas as pd # Differenciate your series until achieving stationarity to select 'd' differentiated_series = df.Value.diff().dropna()
Examining the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots: These plots provide insights into the presence and structure of autoregressive (AR) and moving average (MA) components in the data.
import matplotlib.pyplot as plt from statsmodels.graphics.tsaplots import plot_acf, plot_pacf # Plot ACF plot_acf(df.Value, lags=30) plt.title('Autocorrelation Function (ACF)') plt.xlabel('Lag') plt.ylabel('Autocorrelation') plt.show() # Plot PACF plot_pacf(df.Value, lags=30) plt.title('Partial Autocorrelation Function (PACF)') plt.xlabel('Lag') plt.ylabel('Partial Autocorrelation') plt.show()
Tentatively identifying the orders of the AR and MA components: Based on the ACF and PACF plots, the tentative orders of the AR (p) and MA (q) components are determined.
Check this previous issue to find out how to do it:
2️⃣ Estimation
In this step, the parameters of the ARIMA model identified in the previous step are estimated using appropriate methods, such as maximum likelihood estimation or conditional sum of squares estimation. The estimation process involves:
Fitting the ARIMA model: The identified ARIMA(p,d,q) model is fitted to the data, and the parameters (autoregressive coefficients, moving average coefficients, and constant term) are estimated.
from statsmodels.tsa.arima.model import ARIMA # Fit ARIMA model model = ARIMA(df.Value, order=(p, d, q)) fit_model = model.fit() # Print summary of the fitted model print(fit_model.summary())
Checking for model adequacy: Various diagnostic tests, such as residual analysis and information criteria (e.g., AIC, BIC), are performed to assess the adequacy of the fitted model.
Check this previous issue for more details:
Refining the model if necessary: If the model is found to be inadequate, the identification step is revisited, and a different ARIMA model is tentatively identified and estimated.
3️⃣ Model Diagnostics
This step involves evaluating the fitted ARIMA model to ensure its validity and usefulness for forecasting. I will cover this in future issues of MLPills. But for now, here you have an introduction to the key tasks in this step:
Residual analysis: The residuals (differences between the observed values and the fitted values) are analyzed for patterns, autocorrelation, and normality. If the residuals exhibit any patterns or non-normality, it may indicate an inadequate model fit.
Assessing forecast accuracy: Various accuracy measures, such as mean squared error (MSE), root mean squared error (RMSE), or mean absolute percentage error (MAPE), are calculated to evaluate the model's forecasting performance.
Model validation: If the model passes the diagnostic tests and exhibits satisfactory forecasting accuracy, it is considered valid and can be used for forecasting future values of the time series.
The Box-Jenkins method is an iterative process, where the steps may need to be repeated until a satisfactory ARIMA model is identified, estimated, and validated. It is a powerful and flexible approach for modelling and forecasting a wide range of time series data, including those exhibiting trends, seasonality, and other patterns.
We are getting closer to the end of the ARIMA model, soon I will release the full notebook to train your ARIMA model step-by-step. Don’t forget to subscribe if you are not already so you don’t miss it!
🎓Learn Real-World Machine Learning!*
Do you want to learn Real-World Machine Learning?
Data Science doesn’t finish with the model training… There is much more!
Here you will learn how to deploy and maintain your models, so they can be used in a Real-World environment:
Elevate your ML skills with "Real-World ML Tutorial & Community"! 🚀
Business to ML: Turn real business challenges into ML solutions.
Data Mastery: Craft perfect ML-ready data with Python.
Train Like a Pro: Boost your models for peak performance.
Deploy with Confidence: Master MLOps for real-world impact.
🎁 Special Offer: Use "MASSIVE50" for 50% off.
*Sponsored
🤖 Tech Round-Up
No time to check the news this week?
This week's TechRoundUp comes full of AI news. From NVIDIA AGI's future to X's new LLM Grok, the future is zooming towards us! 🚀
Let's dive into the latest Tech highlights you probably shouldn’t this week 💥
Nvidia's CEO suggests AI hallucinations are fixable and predicts artificial general intelligence (AGI) could be a reality in 5 years, opening up vast cognitive capabilities.
2️⃣ Apple and Google Tech Titans Collaborate
Apple may partner with Google, integrating Gemini AI into iPhones to enhance AI features, indicating a major leap in smartphone AI capabilities
3️⃣ GitHub New Code Security Breakthrough
GitHub unveils an AI tool that can automatically fix code vulnerabilities, merging GitHub's Copilot with CodeQL for safer, more efficient coding.
The ChatGPT store struggles with spam, showing the complexities of platform moderation as it grows, packed with diverse and not always compliant GPTs.
Discussions swirl around Elon Musk's open-sourcing of Grok, highlighting the diverse impacts on AI ethics and development, and sparking industry-wide conversation.
Keep reading with a 7-day free trial
Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.