Chapter 1: Working with Time Data#
Time series data lies at the heart of numerous practical applications across various fields, underlining its immense importance. By capturing data points at successive time intervals, time series analysis enables us to uncover valuable insights, observe patterns, and make accurate forecasts. It finds practical application in diverse domains such as finance, economics, weather forecasting, stock market analysis, demand forecasting, anomaly detection, and predictive maintenance.
Whether we aim to
Predict future stock prices in Douala Stock Exchange in Cameroon,
Understand consumer behavior over time on the Takealot.com E-commence website in Zimbabwe,
Monitor sensor readings for spotting anomalies in the Engineering Division of DAL Group in Sudan,
Forecast seasonal trends of fashion accessories designed by LE Creation in Ivory Coast,
a solid grasp of time series analysis proves indispensable. So, let’s delve into the fascinating world of time series and explore the tools and techniques that enable us to extract meaningful information from temporal data.
Learning Objectives:
Get the standard definitions of time series,
Perform time series analysis,
Identify the important components to consider in time series data.
And the Examples of time series to make your understanding concrete.
Working with Time in Pandas#
One of the best reasons to use Pandas is its capability to identify and handle time and date variables in the data values we work on. In the previous chapters, we saw some simple cases where the data was nothing more than string labels and numeric values.
In reality, data can have many formats, and dates and times are almost omnipresent. In fact, in the collection of data, in their sampling, dates and times are introduced.
It is this type of value that is indispensable for carrying out significant statistics and, therefore, in the analysis of the data, it is essential that the tools are able to recognize and use them. The Pandas library manages this type of data based on the following two specific concepts:
Date Times
Time deltas
Date Times#
Date Times represent the concept of date and time, typically indicated as a certain instant. Working with data, we often encounter dates and times expressed as strings in various formats. For example, the order of years, months, and days (and sometimes hours, minutes, and seconds) can differ significantly based on the country of origin.
The African Continent for instance, has six time zones which are (and can vary)
Mauritius and Seychelles Time (UTC/GMT+4),
East Africa Time (UTC/GMT+3),
Central Africa Time (UTC/GMT+2),
West Africa Time (UTC/GMT+1),
Greenwich Mean Time (UTC/GMT),
Cape Verde Time (UTC-1).
Different specificaitions could apply to the way time series are represented:
These values can be numeric (e.g., 11 for November) or literal (e.g., Nov).
Years can have four or two digits (e.g., 2021 or 21).
The separator character between values varies widely (
'-', '/', or ':'
, etc.).
Parsing these values accurately is challenging, hence the need of a powerful library, to simplify those processes for data analysts. It uses the datetime
module and provides the Timestamp
class, which efficiently handles Date Time. Its constructor accepts a string and converts it to datetime64
, a specific Pandas data type representing the corresponding nanosecond value. This precise value is ideal for scientific applications involving short time spans.
Let’s write a series of examples on recognizing and converting dates with
the Timestamp()
constructor, as follows:
import pandas as pd
import numpy as np
pd.Timestamp('2021-11-30')
Timestamp('2021-11-30 00:00:00')
pd.Timestamp('2021-Nov-30')
Timestamp('2021-11-30 00:00:00')
pd.Timestamp('2021/11/30')
Timestamp('2021-11-30 00:00:00')
It can deduced from the preceding examples that regardless of the format
type, the Timestamp()
constructor is able to recognize that it is the same
date. This constructor is also able to work on numeric formats and not only with
strings. In this case, each argument will correspond in the order of years,
months, days, hours, minutes, and seconds, as shown as follows:
pd.Timestamp(2021,11,30,0,0,0)
Timestamp('2021-11-30 00:00:00')
The Pandas library provides us with the to_datetime()
function which accepts,
as an argument, a series of objects which will then be converted into DateTime values.
df = pd.DataFrame({'year': [2019, 2020, 2021],
'month': [10, 6, 4],
'day': [15, 10, 20]})
df
year | month | day | |
---|---|---|---|
0 | 2019 | 10 | 15 |
1 | 2020 | 6 | 10 |
2 | 2021 | 4 | 20 |
It can be converted to DateTime values using the to_datetime()
function,
as shown as follows:
t_stamp = pd.to_datetime(df)
t_stamp
0 2019-10-15
1 2020-06-10
2 2021-04-20
dtype: datetime64[ns]
As we can see, the function has converted a DataFrame into a Series of
datetime64
values.
Another possibility is, for example, that of converting strings that describe a
date and a time but their format is too intelligible for the constructor. In
fact, the to_datetime()
function allows you to add as an optional
parameter, the format in which you can specify the particular formatting
that follows the string by means of indicators (called format codes, as for
example, %S
to indicate the seconds) and, thus, be able to parse. Also, by
adding the optional errors parameter with the value raise
, you will get an
error message whenever the to_datetime()
function fails to parse the
string in a datetime64 value, as shown as follows:
t = pd.to_datetime('2021 USA 31--12', format='%Y USA %d--%m',
errors='raise')
t
Timestamp('2021-12-31 00:00:00')
Time Deltas#
DateTimes are precise times to which a data can be assigned. Another
concept of time on which Pandas is based are the Time Deltas, that are the
measurements of time durations, that is, the time interval between two
precise instants (DateTime). This concept is also fundamental to working
with times and dates in a data analysis.
Pandas implements this concept through Timedelta
, a class very similar to
the one present in the datetime module of the standard library. This class
also has its own constructor, and the definition of a time interval can be
very simple and intuitive. For example, if we wanted to define the duration
of a time equivalent to one day, it would be enough to write the following:
pd.Timedelta('1 Day')
Timedelta('1 days 00:00:00')
In this case, if we wanted to update a Datetime value, for example, by increasing its value by an hour, it would be enough to create a Timedelta value equivalent to an hour and add it to the Datetime variable, thus, obtaining the updated value. It is, therefore, possible to perform arithmetic calculations with times using Datetime and Timedelta together, as shown as follows
df1 = pd.Timestamp('2021-Nov-16 10:30AM')
df1
Timestamp('2021-11-16 10:30:00')
plus_1= pd.Timedelta('1 Hour')
df2 = plus_1 + df1
df2
Timestamp('2021-11-16 11:30:00')
Another, more formal way of defining time intervals is to enter the relative amount and unit of time. For example, another way to define a day and time is to enter 1 as a numeric value, and use the optional unit parameter defining ‘d’ to define days as a unit of time, as shown as follows:
pd.Timedelta(1, unit="d")
Timedelta('1 days 00:00:00')
Using smaller time units such as seconds, the manufacturer will convert them to the larger time units, thus, converting them to the corresponding values of days, hours, and minutes, as shown as follows:
pd.Timedelta(150045, unit="s")
Timedelta('1 days 17:40:45')
There are also more complex ways to define time intervals, which are useful when you have multiple units together, such as days, minutes, and hours, as shown as follows:
pd.Timedelta('3D 5:34:23')
Timedelta('3 days 05:34:23')
We have seen that Timedelta defines time intervals and are useful for performing arithmetic calculations on Timestamp objects. This concept is even more explicit when we consider the following example:
ts2 = pd.Timestamp('2021-Jan-01 11:30AM')
ts3 = pd.Timestamp('2021-Mar-13 4:15PM')
df3 = ts3 - ts2
df3
Timedelta('71 days 04:45:00')
As we can see from the result, even by making the difference between two
Timestamps, an object of type Timedelta
is automatically generated as a
result, without using the constructor.
There is also a data conversion function for Timedelta, called
to_timedelta()
. In this case, however, parsing is less flexible than
to_datetime. In fact, there is no optional format parameter, and only the
default formats that work with the constructor are recognized, as shown as
follows:
ts4 = pd.Series(['00:22:33','00:13:12','00:24:14'])
ts4
0 00:22:33
1 00:13:12
2 00:24:14
dtype: object
ts5 = pd.to_timedelta(ts4, errors='raise')
ts5
0 0 days 00:22:33
1 0 days 00:13:12
2 0 days 00:24:14
dtype: timedelta64[ns]
Creating Arrays of Time Values#
Now that we have seen what Timestamps is in Pandas, let’s
take a further step – seeing how it is possible to generate an array of this
type of values in a simple and automatic way.
For this purpose, there is the data_range()
function which allows us to create a predetermined number of values, starting from a given date (Timestamp)
, and then defining an appropriate time interval, generating a
series of Datetime values all uniformly staggered.
For example, it is possible to generate an array of 10 consecutive days
starting from a specific date (passed as the first argument) defining as
optional parameters freq = 'D'
, to define the duration of 1 day from one
value to another and, with periods, the number of values to generate. In this
case, the number of consecutive days are 10, as shown as follows:
pd.date_range('2021/01/01', freq='D', periods=10)
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10'],
dtype='datetime64[ns]', freq='D')
As we can see, the date_range()
function creates a DatetimeIndex object
containing 10 consecutive days starting from 1st January 2021
.
Another possibility is to pass two different Timestamps, one starting and
one ending, as arguments to the data_range()
function. The default
intervals are one day, as shown as follows:
pd.date_range("2021/01/01", "2021/01/10" , freq='D')
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10'],
dtype='datetime64[ns]', freq='D')
If, on the other hand, we want a staggered time interval with different times,
we must explicitly specify it with the freq parameter. Or we could decide to
subdivide this interval by a certain number of parts; in this case, we will use
the optional parameter, periods.
For example, we would like to create a Timestamp array with 5 minutes
difference within the specified time limits, then we will specify 5T
to
indicate 5 minutes
to the optional freq parameter, as shown as follows:
pd.date_range("2021/01/01 8:00","2021/01/01 10:00", freq='5t')
DatetimeIndex(['2021-01-01 08:00:00', '2021-01-01 08:05:00',
'2021-01-01 08:10:00', '2021-01-01 08:15:00',
'2021-01-01 08:20:00', '2021-01-01 08:25:00',
'2021-01-01 08:30:00', '2021-01-01 08:35:00',
'2021-01-01 08:40:00', '2021-01-01 08:45:00',
'2021-01-01 08:50:00', '2021-01-01 08:55:00',
'2021-01-01 09:00:00', '2021-01-01 09:05:00',
'2021-01-01 09:10:00', '2021-01-01 09:15:00',
'2021-01-01 09:20:00', '2021-01-01 09:25:00',
'2021-01-01 09:30:00', '2021-01-01 09:35:00',
'2021-01-01 09:40:00', '2021-01-01 09:45:00',
'2021-01-01 09:50:00', '2021-01-01 09:55:00',
'2021-01-01 10:00:00'],
dtype='datetime64[ns]', freq='5T')
Or we might want to divide this time interval into 10 equal parts, each delimited by a Timestamp. In this case, 10 is assigned to the optional parameter periods, as shown as follows:
pd.date_range("2021/01/01 8:00","2021/01/01 10:00", periods=10)
DatetimeIndex(['2021-01-01 08:00:00', '2021-01-01 08:13:20',
'2021-01-01 08:26:40', '2021-01-01 08:40:00',
'2021-01-01 08:53:20', '2021-01-01 09:06:40',
'2021-01-01 09:20:00', '2021-01-01 09:33:20',
'2021-01-01 09:46:40', '2021-01-01 10:00:00'],
dtype='datetime64[ns]', freq=None)
In all these cases, we have seen the generation of an object of type
DatetimeIndex
containing a list of Datetime values. In many cases, this
kind of object is useful for building Series or DataFrame. For example, it is
quite common to use these values as Index, both in a Series and in a
DataFrame, as shown as follows:
range_ = pd.date_range("2021/01/01 8:00","2021/01/01 10:00", periods=10)
tsi = pd.Series(np.random.randn(len(range_)), index=range_)
tsi
2021-01-01 08:00:00 -0.207443
2021-01-01 08:13:20 0.311352
2021-01-01 08:26:40 1.495251
2021-01-01 08:40:00 -1.762571
2021-01-01 08:53:20 -0.092389
2021-01-01 09:06:40 -0.549919
2021-01-01 09:20:00 -1.174873
2021-01-01 09:33:20 0.038300
2021-01-01 09:46:40 0.235347
2021-01-01 10:00:00 1.260554
dtype: float64
But they can also be used to populate their values automatically, as shown as follows:
ts5 = pd.Series(range_)
ts5
0 2021-01-01 08:00:00
1 2021-01-01 08:13:20
2 2021-01-01 08:26:40
3 2021-01-01 08:40:00
4 2021-01-01 08:53:20
5 2021-01-01 09:06:40
6 2021-01-01 09:20:00
7 2021-01-01 09:33:20
8 2021-01-01 09:46:40
9 2021-01-01 10:00:00
dtype: datetime64[ns]
In the first case, the use of index with datetime64 type values allows you to easily select rows relating to a particular time interval, as shown as follows:
tsi["2021/01/01 8:00":"2021/01/01 8:30"]
2021-01-01 08:00:00 -0.207443
2021-01-01 08:13:20 0.311352
2021-01-01 08:26:40 1.495251
dtype: float64
Task 1: #
Black Friday has become a significant shopping day globally, thanks to its wide range of discounts and offers. It enables shoppers to seize limited-time deals and locate the best ones. While originally set for the last Friday of November annually, some companies now stretch it until December 31st, starting from the last Monday of November’s final week.
Mamadou Sekou from the Gambia, owns a superette called SekouShop
in Bouake, the second largest city in Ivory Coast. He has been wondering lately why his customers complained about the absence of discounts during the holiday season last year.
As a result of that, he believes he missed out on boosting his sales during what they called “the Black Friday Period”(BFP) and seeks to prepare better for the next one. To achieve this, he wants to understand the dynamics of his sales from the BFP last year and compare them with this year’s data, once available. You work with him on a consultant basis, to help him carrying out such analysis. You have been given the sales data from 1st November 2022 until 31st December 2022 in a csv file called SekouShop_sales.csv
Where does the Black Friday term originate from?
Re-arrange the data into two sub-seasons: the blackfriday and normal sales seasons. What are the cumulated sales for each of the sub-seasons.
The manager of the superette brought to your attention that there were days where the superette did not open its doors. List those days for each of the sub-seasons. what proportion did they represent out of the sub-seasons and out of the whole hoilday season.
Usually on fridays, the Superette produced its best sales. Report all the Friday sales for each of the sub-seasons. Which Friday had the best tally?
Task 2: #
For a better preperation of the upcoming year 2023, Mamadou wants to create schedules or calendars to keep track of certain processes like restocking, utility bills and special sales.
SekouShop restocks certain perishable items like every 3 days. They want you to create a restocking schedule for the next three months (Jan, Feb, Mar 2023) to help with inventory management.
SekouShop pays its utility bills on the first day of every month. The manager wants you to generate reminders five days before each due date for the upcoming year to avoid late fees.
They are planning to host bi-weekly special sales. Can you create a calendar for all their event dates for the upcoming year?