Machine Learning Pills

Machine Learning Pills

Share this post

Machine Learning Pills
Machine Learning Pills
Issue #51 - ARIMA models: Box-Jenkins method

Issue #51 - ARIMA models: Box-Jenkins method

David Andrés's avatar
Josep Ferrer's avatar
David Andrés
and
Josep Ferrer
Mar 23, 2024
∙ Paid
9

Share this post

Machine Learning Pills
Machine Learning Pills
Issue #51 - ARIMA models: Box-Jenkins method
3
Share

💊 Pill of the week

This week is the time to give an overall view of the ARIMA model methodology, also called the Box-Jenkins method. We will link each step to previous issues of MLPills, so you can revise each step and become an ARIMA master!

The Box-Jenkins method, also known as the Box-Jenkins Methodology or the ARIMA (Autoregressive Integrated Moving Average) methodology, is a widely used approach for modelling and forecasting time series data.

It consists of the following three steps:

Image

Let’s see each of them in more detail with some examples. Assume your time series data is in the column Value of your dataframe df. Its index is monthly data for several years.

1️⃣ Identification

This step involves analyzing the time series data to identify its characteristics and determine the appropriate ARIMA model. The key tasks in this step are:

  • Checking for stationarity: Time series data is considered stationary if its statistical properties (mean, variance, and autocorrelation) remain constant over time. If the data is non-stationary, differencing techniques are applied to make it stationary. You will use the Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test:

    from statsmodels.tsa.stattools import adfuller
    
    # Perform Augmented Dickey-Fuller test
    result_adf = adfuller(df.Value)
    print('ADF Statistic:', result_adf[0])
    print('p-value:', result_adf[1])
    print('Critical Values:')
    for key, value in result_adf[4].items():
        print('\t%s: %.3f' % (key, value))
    
    # Determine stationarity based on p-value
    if result_adf[1] < 0.05:
        print("ADF test: Series is stationary")
    else:
        print("ADF test: Series is not stationary")
    from statsmodels.tsa.stattools import kpss
    
    # Perform KPSS test
    result_kpss = kpss(df.Value)
    print('KPSS Statistic:', result_kpss[0])
    print('p-value:', result_kpss[1])
    print('Lags Used:', result_kpss[2])
    print('Critical Values:')
    for key, value in result_kpss[3].items():
        print('\t%s: %.3f' % (key, value))
    
    # Determine stationarity based on p-value
    if result_kpss[1] < 0.05:
        print("KPSS test: Series is not stationary")
    else:
        print("KPSS test: Series is stationary")
    
  • Identifying the order of differencing: The number of times the data needs to be differenced to achieve stationarity is determined and represented by the parameter 'd' in the ARIMA model.

    Check this previous issue to learn how to get ‘d’:

    Issue #48 - ARIMA models: stationarity and differencing

    Issue #48 - ARIMA models: stationarity and differencing

    David Andrés and Josep Ferrer
    ·
    March 2, 2024
    Read full story
    import pandas as pd
    
    # Differenciate your series until achieving stationarity to select 'd'
    differentiated_series = df.Value.diff().dropna()
  • Examining the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots: These plots provide insights into the presence and structure of autoregressive (AR) and moving average (MA) components in the data.

    import matplotlib.pyplot as plt
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    
    # Plot ACF
    plot_acf(df.Value, lags=30)
    plt.title('Autocorrelation Function (ACF)')
    plt.xlabel('Lag')
    plt.ylabel('Autocorrelation')
    plt.show()
    
    # Plot PACF
    plot_pacf(df.Value, lags=30)
    plt.title('Partial Autocorrelation Function (PACF)')
    plt.xlabel('Lag')
    plt.ylabel('Partial Autocorrelation')
    plt.show()
  • Tentatively identifying the orders of the AR and MA components: Based on the ACF and PACF plots, the tentative orders of the AR (p) and MA (q) components are determined.

    Check this previous issue to find out how to do it:

    Issue #50 - ARIMA models: selection of p and q

    Issue #50 - ARIMA models: selection of p and q

    David Andrés and Josep Ferrer
    ·
    March 16, 2024
    Read full story

2️⃣ Estimation

In this step, the parameters of the ARIMA model identified in the previous step are estimated using appropriate methods, such as maximum likelihood estimation or conditional sum of squares estimation. The estimation process involves:

  • Fitting the ARIMA model: The identified ARIMA(p,d,q) model is fitted to the data, and the parameters (autoregressive coefficients, moving average coefficients, and constant term) are estimated.

    from statsmodels.tsa.arima.model import ARIMA
    
    # Fit ARIMA model
    model = ARIMA(df.Value, order=(p, d, q))
    fit_model = model.fit()
    
    # Print summary of the fitted model
    print(fit_model.summary())
  • Checking for model adequacy: Various diagnostic tests, such as residual analysis and information criteria (e.g., AIC, BIC), are performed to assess the adequacy of the fitted model.

    Check this previous issue for more details:

    Issue #49 - ARIMA models: Criteria for selection

    Issue #49 - ARIMA models: Criteria for selection

    David Andrés and Josep Ferrer
    ·
    March 9, 2024
    Read full story
  • Refining the model if necessary: If the model is found to be inadequate, the identification step is revisited, and a different ARIMA model is tentatively identified and estimated.

3️⃣ Model Diagnostics

This step involves evaluating the fitted ARIMA model to ensure its validity and usefulness for forecasting. I will cover this in future issues of MLPills. But for now, here you have an introduction to the key tasks in this step:

  • Residual analysis: The residuals (differences between the observed values and the fitted values) are analyzed for patterns, autocorrelation, and normality. If the residuals exhibit any patterns or non-normality, it may indicate an inadequate model fit.

  • Assessing forecast accuracy: Various accuracy measures, such as mean squared error (MSE), root mean squared error (RMSE), or mean absolute percentage error (MAPE), are calculated to evaluate the model's forecasting performance.

  • Model validation: If the model passes the diagnostic tests and exhibits satisfactory forecasting accuracy, it is considered valid and can be used for forecasting future values of the time series.

The Box-Jenkins method is an iterative process, where the steps may need to be repeated until a satisfactory ARIMA model is identified, estimated, and validated. It is a powerful and flexible approach for modelling and forecasting a wide range of time series data, including those exhibiting trends, seasonality, and other patterns.

We are getting closer to the end of the ARIMA model, soon I will release the full notebook to train your ARIMA model step-by-step. Don’t forget to subscribe if you are not already so you don’t miss it!


‍🎓Learn Real-World Machine Learning!*

Do you want to learn Real-World Machine Learning?

Data Science doesn’t finish with the model training… There is much more!

Here you will learn how to deploy and maintain your models, so they can be used in a Real-World environment:

  • Elevate your ML skills with "Real-World ML Tutorial & Community"! 🚀

  • Business to ML: Turn real business challenges into ML solutions.

  • Data Mastery: Craft perfect ML-ready data with Python.

  • Train Like a Pro: Boost your models for peak performance.

  • Deploy with Confidence: Master MLOps for real-world impact.

🎁 Special Offer: Use "MASSIVE50" for 50% off.

Learn Real-World Machine Learning

*Sponsored


🤖 Tech Round-Up

No time to check the news this week?

This week's TechRoundUp comes full of AI news. From NVIDIA AGI's future to X's new LLM Grok, the future is zooming towards us! 🚀

Let's dive into the latest Tech highlights you probably shouldn’t this week 💥

1️⃣ NVIDIA AGI's Horizon

Nvidia's CEO suggests AI hallucinations are fixable and predicts artificial general intelligence (AGI) could be a reality in 5 years, opening up vast cognitive capabilities.

2️⃣ Apple and Google Tech Titans Collaborate

Apple may partner with Google, integrating Gemini AI into iPhones to enhance AI features, indicating a major leap in smartphone AI capabilities

3️⃣ GitHub New Code Security Breakthrough

GitHub unveils an AI tool that can automatically fix code vulnerabilities, merging GitHub's Copilot with CodeQL for safer, more efficient coding.

4️⃣ Spam Challenge at OpenAI

The ChatGPT store struggles with spam, showing the complexities of platform moderation as it grows, packed with diverse and not always compliant GPTs.

5️⃣ Open Source AI Debate

Discussions swirl around Elon Musk's open-sourcing of Grok, highlighting the diverse impacts on AI ethics and development, and sparking industry-wide conversation.

Follow Josep on 𝕏

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLPills
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share