Statsmodels for time series data

A brief introduction to statsmodel which helps in dealing with time-series data.

Published in

Level Up Coding

4 min readJan 29, 2021

In python, a very widely used library named statsmodel is used when dealing with time-series data. It is based on the statistical programming language R. This module helps in analyzing data, perform statistical functions and also create statistical models. It also has functions to plot.

So let’s dive into it!

→Installation

The statsmodel is already included in the python environment file. In case a different environment like Anaconda is used then install by using the command below:

conda install statsmodels

→ Importing packages

The basic packages like NumPy and pandas to help deal with data are imported along with matplotlib to help with plottings.

>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> %matplotlib inline

Then the statsmodel is also imported.

>>> import statsmodels.api as sm

→ Obtain data

The statsmodels has a provision to obtain the dataset. There are various datasets as shown below:

The one that will be used is the macrodata since it is a time-series data. Using the load_pandas() method, the data will be loaded.

>>> data = sm.datasets.macrodata.load_pandas().data
>>> data.head()

>>> data.tail()

To understand what the column headings mean, we can print the details using the NOTE attribute.

>>> print(sm.datasets.macrodata.NOTE)
::
    Number of Observations - 203    Number of Variables - 14    Variable name definitions::        year      - 1959q1 - 2009q3
        quarter   - 1-4
        realgdp   - Real gross domestic product (Bil. of chained 2005 US$,
                    seasonally adjusted annual rate)
        realcons  - Real personal consumption expenditures (Bil. of chained
                    2005 US$, seasonally adjusted annual rate)
        realinv   - Real gross private domestic investment (Bil. of chained
                    2005 US$, seasonally adjusted annual rate)
        realgovt  - Real federal consumption expenditures & gross investment
                    (Bil. of chained 2005 US$, seasonally adjusted annual rate)
        realdpi   - Real private disposable income (Bil. of chained 2005
                    US$, seasonally adjusted annual rate)
        cpi       - End of the quarter consumer price index for all urban
                    consumers: all items (1982-84 = 100, seasonally adjusted).
        m1        - End of the quarter M1 nominal money stock (Seasonally
                    adjusted)
        tbilrate  - Quarterly monthly average of the monthly 3-month
                    treasury bill: secondary market rate
        unemp     - Seasonally adjusted unemployment rate (%)
        pop       - End of the quarter total population: all ages incl. armed
                    forces over seas
        infl      - Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)
        realint   - Real interest rate (tbilrate - infl)

Now to work with time series, it is important to have the year column as the index. So accordingly it is changed by using the time series analysis (tsa) module of statsmodels. It has a method called dates_from_range where the range can be mentioned. We take the start to be 1959 year of the first quarter (Q1) and end to be 2009 of the third quarter (Q3). Using pandas an index will be created of this.

>>> idx = pd.Index(sm.tsa.datetools.dates_from_range('1959Q1','2009Q3'))

Now that the index is created, we can assign it to the dataframe.

>>> data.index = idx
>>> data.head()

→ Visualization

The linear plot of the DPI is plotted to see the trend.

>>> data['realdpi'].plot()

The statsmodel can be useful in getting the estimated trend. A filter is used which is called as Hodrick-Prescott filter. This filter distinguishes a time-series data into a trend and a cyclic component. When this filter is applied it returns a tuple that consists of the estimated cycle and the trend.

>>> dpi = sm.tsa.filters.hpfilter(data['realdpi'])
>>> dpi
(1959-03-31     32.611738
 1959-06-30     45.961546
 1959-09-30     23.190972
 1959-12-31     18.550907
 1960-03-31     23.077748
                  ...    
 2008-09-30   -128.596455
 2008-12-31    -87.557288
 2009-03-31   -122.358968
 2009-06-30    -11.941350
 2009-09-30    -89.467814
 Name: realdpi_cycle, Length: 203, dtype: float64,
 1959-03-31     1854.288262
 1959-06-30     1873.738454
 1959-09-30     1893.209028
 1959-12-31     1912.749093
 1960-03-31     1932.422252
                   ...     
 2008-09-30     9966.896455
 2008-12-31    10007.957288
 2009-03-31    10048.758968
 2009-06-30    10089.441350
 2009-09-30    10130.067814
 Name: realdpi_trend, Length: 203, dtype: float64)

Now using the tuple unpacking, the trend is extracted and then plotted.

>>> dpi_cycle,dpi_trend = sm.tsa.filters.hpfilter(data['realdpi'])
>>> data['trend'] = dpi_trend
>>> data[['realdpi','trend']].plot()

Let’s zoom in to get a better idea of the plot.

>>> data[['realdpi','trend']]['2005-01-01':].plot()

With this, you have a basic understanding of using the statsmodels library. For more detailed information, check out the official documentation below.

Introduction - statsmodels

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical…

www.statsmodels.org

Refer to the notebook for code here.

Reach out to me: LinkedIn
Check out my other work: GitHub

Level Up Coding

Statsmodels for time series data

A brief introduction to statsmodel which helps in dealing with time-series data.

→Installation

→ Importing packages

→ Obtain data

→ Visualization

Introduction - statsmodels

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Level Up Coding

Written by Jayashree domala

No responses yet