Chapter 4: Framing Time Series#

Time Series data must be reframed as a supervised learning dataset before we can start using machine learning algorithms. There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps. In this chapter, you will discover how to perform feature engineering on time series data with Python to model your time series problem with machine learning algorithms.

Learning outcomes:

After completing this session, you will know:

  1. What supervised learning is and how it is the foundation for all predictive modeling machine learning algorithms.

  2. The sliding window method for framing a time series dataset and how to use it.

  3. How to use the sliding window for multivariate data and multi-step forecasting.

Let’s dive in.

Supervised Machine Learning#

The majority of practical machine learning uses supervised learning. Supervised learning is where you have input variables (X) and an output variable (y) and you use an algorithm to learn the mapping function from the input to the output.

\[Y = f(X)\]

The goal is to approximate the real underlying mapping so well that when you have new input data \((X)\), you can predict the output variables \((y)\) for that data. Below is a contrived example of a supervised learning dataset where each row is an observation comprised of one input variable \((X)\) and one output variable to be predicted \((y)\).

X, y
5, 0.9
4, 0.8
5, 1.0
3, 0.7
4, 0.9
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-f69bb942dcd6> in <module>
----> 1 X, y
      2 5, 0.9
      3 4, 0.8
      4 5, 1.0
      5 3, 0.7

NameError: name 'X' is not defined

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers; the algorithm iteratively makes predictions on the training data and is corrected by making updates.

Learning stops when the algorithm achieves an acceptable level of performance. Supervised learning problems can be further grouped into regression and classification problems.

  • Classification: A classification problem is when the output variable is a category, such as red and blue or disease and no disease.

  • Regression: A regression problem is when the output variable is a real value, such as dollars or weight. The contrived example above is a regression problem.

However, a time series dataset looks this:

time 1, value 1

time 2, value 2

time 3, value 3

In order to use it for Machine Learning purpose, it must be transformed or reframed to something that looks like:

input 1, output 1
input 2, output 2
input 3, output 3

So that we can train a supervised learning algorithm. Input variables are also called features in the field of machine learning, and the task before us is to create or invent new input features from our time series dataset. Ideally, we only want input features that best help the learning methods model the relationship between the inputs \((X)\) and the outputs \((y)\) that we would like to predict. In this tutorial, we will look at three classes of features that we can create from our time series dataset:

  • Date Time Features: these are components of the time step itself for each observation.

  • Lag Features: these are values at prior time steps.

  • Window Features: these are a summary of values over a fixed window of prior time steps.

Sliding Window#

Time series data can be phrased as supervised learning. Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable. Let’s make this concrete with an example. Imagine we have a time series as follows, which represents daily steps made by a patient:

time, measure
1,    100
2,    110
3,    108
4,    115
5,    120

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step. Re-organizing the time series dataset this way, the data would look as follows:

X,    y
?,   100
100, 110
110, 108
108, 115
115, 120
120,  ?

Take a look at the above transformed dataset and compare it to the original time series. Here are some observations:

  • We can see that the previous time step is the input (X) and the next time step is the output (y) in our supervised learning problem.

  • We can see that the order between the observations is preserved, and must continue to be preserved when using this dataset to train a supervised model.

  • We can see that we have no previous value that we can use to predict the first value in the sequence. We will delete this row as we cannot use it.

Example: #

With the intense heat experienced in Dira Dawe, a city located in the Eastern Part of Ethiopia, the government has initiated a project to analyze the overall sky surface radiation in the area. This study aims to establish a solar power facility that will convert the radiant energy into electricity for residential use.

The all-sky surface radiation is calculated by accounting for the influences of the clear-sky atmosphere, clouds, and the multiple reflections between cloud and land surface. The Solar Ultraviolet Radiation (UVR) UVR is divided into three wavebands: UV-C (100–280 nm), UV-B (280–315 nm), and UV-A (315–400 nm).

UV-C is absorbed in the stratosphere, whereas both the UV-A and the UV-B bands reach ground level in amounts that depend on several factors. The first goal is to get a forecast the UVA using the historical data that was delivered by the EEP(Ethiopian Electric Power) Company.

The data comprises UVA and dew/frost at 2meters from 1st January 2021 to 31st March 2021.

import pandas as pd

diradawe_data =  pd.read_csv('data/Dire_Dawa_data.csv' , skiprows=10)
diradawe_data.head()
YEAR MO DY ALLSKY_SFC_UVA T2MDEW
0 2021 1 1 15.64 8.24
1 2021 1 2 15.73 8.89
2 2021 1 3 15.65 8.05
3 2021 1 4 15.87 3.81
4 2021 1 5 15.87 5.65
diradawe_data.columns = ['year', 'month', 'day','ALLSKY_SFC_UVA', 'T2MDEW' ]
diradawe_data.head()
year month day ALLSKY_SFC_UVA T2MDEW
0 2021 1 1 15.64 8.24
1 2021 1 2 15.73 8.89
2 2021 1 3 15.65 8.05
3 2021 1 4 15.87 3.81
4 2021 1 5 15.87 5.65
diradawe_data['Date'] =  pd.to_datetime(diradawe_data[['year','month','day']])
diradawe_data = diradawe_data[['Date','year','month','day' , 'ALLSKY_SFC_UVA', 'T2MDEW'] ]
diradawe_data.head()
Date year month day ALLSKY_SFC_UVA T2MDEW
0 2021-01-01 2021 1 1 15.64 8.24
1 2021-01-02 2021 1 2 15.73 8.89
2 2021-01-03 2021 1 3 15.65 8.05
3 2021-01-04 2021 1 4 15.87 3.81
4 2021-01-05 2021 1 5 15.87 5.65

Lags Feature#

Univariate forecast

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time \((t+1)\) given the value at the current time \((t)\). The Pandas library provides the shift() function to help create these shifted or lag features from a time series dataset.

Shifting the dataset by 1 creates the t column, adding a NaN (unknown) value for the first row. The time series dataset without a shift represents the \(t+1.\) Let’s make this concrete with an example. Let’s focus on the UVA column only.

uva_data =  diradawe_data[['Date', 'ALLSKY_SFC_UVA']]
uva_data.head()
Date ALLSKY_SFC_UVA
0 2021-01-01 15.64
1 2021-01-02 15.73
2 2021-01-03 15.65
3 2021-01-04 15.87
4 2021-01-05 15.87
uva_shift1 = uva_data['ALLSKY_SFC_UVA'].shift(1)
uva_shift1
0       NaN
1     15.64
2     15.73
3     15.65
4     15.87
      ...  
85    19.84
86    19.60
87    19.38
88    18.81
89    17.12
Name: ALLSKY_SFC_UVA, Length: 90, dtype: float64

The value passed in the shift function determines the length by which the Series is shifted. We can concatenate the shifted columns together into a new DataFrame using the concat() function along the column axis (axis=1).

uva_new1 = pd.concat([uva_data,  pd.DataFrame(uva_shift1)], axis=1)
#uva_new1 ['Date', 'ALLSKY_SFC_UVA']
uva_new1
Date ALLSKY_SFC_UVA ALLSKY_SFC_UVA
0 2021-01-01 15.64 NaN
1 2021-01-02 15.73 15.64
2 2021-01-03 15.65 15.73
3 2021-01-04 15.87 15.65
4 2021-01-05 15.87 15.87
... ... ... ...
85 2021-03-27 19.60 19.84
86 2021-03-28 19.38 19.60
87 2021-03-29 18.81 19.38
88 2021-03-30 17.12 18.81
89 2021-03-31 16.53 17.12

90 rows × 3 columns

For clarity purpose, we can rename the column with the corresponding lag attributes, taking into account the chronoligical order of the observations.

uva_new1.columns=['Date', 'uva_t' , 'uva_t-1']
uva_new1 =  uva_new1[['Date',  'uva_t-1', 'uva_t']]
uva_new1.head()
Date uva_t-1 uva_t
1 2021-01-02 15.73 15.64
2 2021-01-03 15.65 15.73
3 2021-01-04 15.87 15.65
4 2021-01-05 15.87 15.87
5 2021-01-06 15.80 15.87

You can see that we would have to discard the first row to use the dataset to train a supervised learning model, as it does not contain enough data to work with. The addition of lag features is called the sliding window method, in this case with a window width of 1. The next step is to clear the rows with missing values, which is done using the .dropna() function.

uva_new1.dropna(how = 'any', inplace=True)
uva_new1
<ipython-input-22-bdf46e7312ec>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uva_new1.dropna(how = 'any', inplace=True)
Date uva_t-1 uva_t
1 2021-01-02 15.64 15.73
2 2021-01-03 15.73 15.65
3 2021-01-04 15.65 15.87
4 2021-01-05 15.87 15.87
5 2021-01-06 15.87 15.80
... ... ... ...
85 2021-03-27 19.84 19.60
86 2021-03-28 19.60 19.38
87 2021-03-29 19.38 18.81
88 2021-03-30 18.81 17.12
89 2021-03-31 17.12 16.53

89 rows × 3 columns

The supervised learning problem to solve here would be using the past day to predict current day.

For a reframing with 2 lag features, which means using the past 2 days to predict the current day, we can work it out the same way.

All the steps that were carried out above could be summarized in
uva_shift1 = uva_data['ALLSKY_SFC_UVA'].shift(1)
uva_shift2 = uva_data['ALLSKY_SFC_UVA'].shift(2)
uva_new2 = pd.concat([uva_data, pd.DataFrame(uva_shift1), pd.DataFrame(uva_shift2)], axis=1)
uva_new2
Date ALLSKY_SFC_UVA ALLSKY_SFC_UVA ALLSKY_SFC_UVA
0 2021-01-01 15.64 NaN NaN
1 2021-01-02 15.73 15.64 NaN
2 2021-01-03 15.65 15.73 15.64
3 2021-01-04 15.87 15.65 15.73
4 2021-01-05 15.87 15.87 15.65
... ... ... ... ...
85 2021-03-27 19.60 19.84 19.80
86 2021-03-28 19.38 19.60 19.84
87 2021-03-29 18.81 19.38 19.60
88 2021-03-30 17.12 18.81 19.38
89 2021-03-31 16.53 17.12 18.81

90 rows × 4 columns

uva_new2.columns=['Date', 'uva_t' , 'uva_t-1','uva_t-2']
uva_new2=  uva_new2[['Date', 'uva_t-2', 'uva_t-1', 'uva_t']]
uva_new2.head()
Date uva_t-2 uva_t-1 uva_t
0 2021-01-01 NaN NaN 15.64
1 2021-01-02 NaN 15.64 15.73
2 2021-01-03 15.64 15.73 15.65
3 2021-01-04 15.73 15.65 15.87
4 2021-01-05 15.65 15.87 15.87
uva_new2.dropna(how = 'any', inplace=True)
uva_new2.head()
<ipython-input-28-684f4a75755a>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uva_new2.dropna(how = 'any', inplace=True)
Date uva_t-2 uva_t-1 uva_t
2 2021-01-03 15.64 15.73 15.65
3 2021-01-04 15.73 15.65 15.87
4 2021-01-05 15.65 15.87 15.87
5 2021-01-06 15.87 15.87 15.80
6 2021-01-07 15.87 15.80 15.75

Task 5: #

  1. Reframe the data for a lag-3 feature problem

  2. Write a function that could summarize the above process. The function should only take the lag -number as argument and return the corresponding dataframe.

  3. Considering a bivariate forecast problem now (using the history of 2 features to perhaps only predict the dynamics of one feauture). Make use the function that was written above to build that scenario.