Chapter 1: Working with Series and DataFrames#
Learning Objectives:
Getting familiar with the concept of tabular data and iid data.
Getting familiar with the concept of
Series
andDataFrames
.Developing the ability to query dataframes to answer practical questions.
Updating dataframes based on useful insights.
The concept of Tabular data and iid data#
Tabular data refers to structured data that is organized into rows
and columns
, similar to what you might find in a spreadsheet or a database table. Each row typically represents an individual observation or record, while each column signifies a particular attribute or feature of the data.
For instance, in the African context, a table might list various agricultural products (rows) and their respective prices, production quantities, and export destinations (columns). This structured format of tabular data enables easy querying, visualization, and statistical analysis. Whether it’s used for monitoring rainfall patterns across various regions in Africa or tracking the spread of diseases like Malaria, tabular data plays a pivotal role in decision-making processes across the continent.
IID data, or Independent and Identically Distributed data, refers to a collection of random variables where each variable has the same probability distribution as the others and all are mutually independent. In layman’s terms, it means that each data point is generated in the same way and does not depend on previous data points.
For example, if one were to conduct a survey across various African nations about a particular product’s preference, the responses from one country or individual would not influence the responses from another. However, it’s essential to note that real-life data, especially in complex environments like Africa with its diverse cultures, economies, and ecosystems, might not always strictly adhere to the IID assumption. Recognizing when this assumption holds and when it doesn’t is crucial for accurate data analysis and modeling.
Building on the concept of tabular data, modern data science tools have provided versatile structures to handle and manipulate such data with ease. In the Python programming language, the pandas library stands out as a principal tool for this purpose. At the heart of pandas are two primary data structures: Series and DataFrames which is the focus of this chapter.
Introduction to Pandas#
Pandas
is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Certainly among the most important tools for data analysts and data scientists.
The most popular library for working with tabular data in Python
To import it, simply do:
import pandas as pd
To check the version of Pandas you are running, do:
pd.__version__
'2.0.3'
Pandas Series#
To get started with pandas, you will need to get comfortable with its two data structures: Series and Dataframe. They provide a solid, easy-to-use basis for most applications. Every object successfully returned by Pandas is either Series or DataFrame.
DataFrames and Series are not simply storage containers. Since Pandas treat them similarly, they have built-in support for a variety of data-wrangling operations, such as:
Single-level and hierarchical indexing
Handling missing data
Arithmetic and Boolean operations on entire columns and tables
Database-type operations (such as merging and aggregation)
Plotting individual columns and whole tables
Reading data from files and writing data to files
Series#
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. You can create a simple series from any sequence: a list, a tuple, or a numpy array or even a python dictionary.
From a list of numbers
baby_ages = pd.Series( [2 , 10 , 12] )
baby_ages
0 2
1 10
2 12
dtype: int64
From a list of strings
countries = pd.Series( ['Uganda' , 'Mali' , 'Chad' , 'Niger'] )
countries
0 Uganda
1 Mali
2 Chad
3 Niger
dtype: object
From a list of objects (data of different types)
travel_infos = pd.Series( [ "£12" , 25 , "Banjul", "50km" ])
travel_infos
0 £12
1 25
2 Banjul
3 50km
dtype: object
From a tuple#
diseases = pd.Series( ("malaria" , "tuberculosis", "influenza") )
diseases
0 malaria
1 tuberculosis
2 influenza
dtype: object
From a numpy array#
#Let's import the numpy library first
import numpy as np
index_poverty = np.arange(0,10,2)
index_poverty
array([0, 2, 4, 6, 8])
index_poverty_series = pd.Series(index_poverty)
index_poverty_series
0 0
1 2
2 4
3 6
4 8
dtype: int64
From a python dictionary#
ebola_infos = {"Outbreak": 'Ebola' , "City": 'Goma' ,
"Country": 'DRC' ,"Continent": 'Africa'}
ebola_infos
{'Outbreak': 'Ebola', 'City': 'Goma', 'Country': 'DRC', 'Continent': 'Africa'}
ebola_infos_series = pd.Series(ebola_infos)
ebola_infos_series
Outbreak Ebola
City Goma
Country DRC
Continent Africa
dtype: object
As you can notice, there is column on the left always appearing when printing a series. It’s a column index
which, by default run from \(0\) to \(n-1\) where \(n\) is the length of the series. In the case of a dictionary, it is automatically replaced by the key of the dictionary. And the values of the dictionary are the actual content of the Pandas Series.You can verify it by typing the command below.
ebola_infos_series.index
Index(['Outbreak', 'City', 'Country', 'Continent'], dtype='object')
ebola_infos_series.values
array(['Ebola', 'Goma', 'DRC', 'Africa'], dtype=object)
Series can also be created along with its indices
# An information recorded from a patient during a survey in a Hospital in Mbour, Senegal.
patient_senegal_infos = pd.Series( ["Traore" , "Senegalese" , "Single" , "Wolof"],
index = ["Name" , "Nationality" , "Status" , "Language"])
patient_senegal_infos
Name Traore
Nationality Senegalese
Status Single
Language Wolof
dtype: object
#The age of 4 siblings members of a family in Bouake, Ivory Coast as recorded during a survey
bouake_kids = pd.Series([12 , 25 , 7 , 58 , 39],
index = ["Yao" , "Kouassi" , "Senan", "Bony" , "Marguerite"])
bouake_kids
Yao 12
Kouassi 25
Senan 7
Bony 58
Marguerite 39
dtype: int64
We can get to each of the terms easily, by using the index operators, which is the open and close square brackets.
bouake_kids['Yao']
12
bouake_kids['Bony']
58
We can check the number of people below the age of 20
bouake_kids
Yao 12
Kouassi 25
Senan 7
Bony 58
Marguerite 39
dtype: int64
bouake_kids < 20
Yao True
Kouassi False
Senan True
Bony False
Marguerite False
dtype: bool
Technically speaking, bouake_kids < 20
is a boolean expression, which means it is an expressions that returns Boolean values: True
where there is a match and False
otherwise. But, what if we want to get the actual values, we need to use the index operator and passing the boolean expression as argument. and we will get sliced pandas series with only the matching rows.
bouake_kids[bouake_kids<45]
Yao 12
Kouassi 25
Senan 7
Marguerite 39
dtype: int64
We can also enlarge the scope of our query buy feeding the index operator with a more sophisticated expression or collection of expressions as you see below. For instance, we are interesred in kids between the age of 12 and 45.
bouake_kids[(bouake_kids<45) & (bouake_kids>12) ]
Kouassi 25
Marguerite 39
dtype: int64
Basic Statistics#
Basic statistical operations could be performed on pandas series. Let’s explore a few.
The sum of all the elements
#### A summary of the series
The describe()
method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series. Lets look at the bouake_kids
again.
bouake_kids.describe()
count 5.000000
mean 28.200000
std 20.777392
min 7.000000
25% 12.000000
50% 25.000000
75% 39.000000
max 58.000000
dtype: float64
summing over items#
bouake_kids.sum()
141
averaging over items#
bouake_kids.mean()
28.2
The lowest value
bouake_kids.min()
7
The largest value
bouake_kids.max()
58
The variance
bouake_kids.var()
431.70000000000005
The standard deviation
bouake_kids.std()
20.777391559096152
Pandas Dataframe#
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). It can also simply be viewed as a collection of Series. There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays.
#Some temperature values recorded across African cities last rainy season
temp_african_cities = {"City": ['Gao' , 'Thies' , 'Abeche','Bobo-Dioulasso', 'Port Harcourt'],
"Temperature(°C)": [29,32,27,19,35] }
df_temp_afr_cities = pd.DataFrame(temp_african_cities)
df_temp_afr_cities
City | Temperature(°C) | |
---|---|---|
0 | Gao | 29 |
1 | Thies | 32 |
2 | Abeche | 27 |
3 | Bobo-Dioulasso | 19 |
4 | Port Harcourt | 35 |
#The birth rate from several hospitals in Togo, Benin on a specific week in 2003
togo_birth_rate = {"Hospital": ["Hopital General de Dapaong" , "Deutsche Hospital Apaong" ,
"Clinique Notre-Dame de Lomé", "hôpital Nachtigal d’aného",
"hôpital Germanique de kpalimé", "Liberty Klinic"],
"Birth_Rate" :[ 0.4 , 0.25 , 0.98 , 0.18 , 0.62 , 0.16]
}
df_togo_birth_rate = pd.DataFrame(togo_birth_rate)
df_togo_birth_rate
Hospital | Birth_Rate | |
---|---|---|
0 | Hopital General de Dapaong | 0.40 |
1 | Deutsche Hospital Apaong | 0.25 |
2 | Clinique Notre-Dame de Lomé | 0.98 |
3 | hôpital Nachtigal d’aného | 0.18 |
4 | hôpital Germanique de kpalimé | 0.62 |
5 | Liberty Klinic | 0.16 |
To query the content of a dataframe, we can call the different series using the index operator and passing in the name of a column as a string.
df_togo_birth_rate['Hospital']
0 Hopital General de Dapaong
1 Deutsche Hospital Apaong
2 Clinique Notre-Dame de Lomé
3 hôpital Nachtigal d’aného
4 hôpital Germanique de kpalimé
5 Liberty Klinic
Name: Hospital, dtype: object
df_togo_birth_rate['Birth_Rate']
0 0.40
1 0.25
2 0.98
3 0.18
4 0.62
5 0.16
Name: Birth_Rate, dtype: float64
We might decide to add another column. For instance, we want to add information related to the electricity outage in those hospitals during that week.
df_togo_birth_rate['Light'] = ["Yes" , "Yes" , "No" , "Yes" , "No" , "No"]
df_togo_birth_rate
Hospital | Birth_Rate | Light | |
---|---|---|---|
0 | Hopital General de Dapaong | 0.40 | Yes |
1 | Deutsche Hospital Apaong | 0.25 | Yes |
2 | Clinique Notre-Dame de Lomé | 0.98 | No |
3 | hôpital Nachtigal d’aného | 0.18 | Yes |
4 | hôpital Germanique de kpalimé | 0.62 | No |
5 | Liberty Klinic | 0.16 | No |
We can get a subset of dataframe based on some conditions. Let’s find out from the task below.
Task 1: #
In our quest to understand the interplay between health metrics and infrastructural development in Togo, we have obtained information from various hospitals across Togo. The data is provided below.
Take a look at the data and think of some practical questions that can be answered from querying that dataframe
Phrase them on a piece of paper.
Write the Python codes that solve them
Think of a recommendation you can make out of your answers.
Answer the following questions:
What are the hospitals where the birth rate is less than
0.50
?What are the hospitals which had electricity supplied on that day?
Are the Germans Hospitals in Togo always supplied with electricity?
Siemens, under the auspices of the German-Togo Collaboration, has committed to equipping selected hospitals with backup power generators to mitigate service disruptions during power outages, ensuring continuous patient care. Siemens has judiciously scheduled the delivery of these generators to each beneficiary hospital. Below, you will find the scheduled delivery dates stored in the
scheduled_dates
variable. Usepd.to_datetime
to convert these dates, ensuring they are in the appropriate datetime format for further analysis.In light of directives from His Excellency Mr. Fore Yassingbe, the Togolese Minister of Health and his team will be present for the generator handover ceremonies. Given his tight schedule, his secretary informed that he can only attend events scheduled between the 20th and 40th weeks of the year. Based on this timeframe, how many ceremonies can he attend in person?
togo_data = {
'Hospital Name': [
'Lomé Central Hospital',
'Kara Regional Hospital',
'Sokodé District Hospital',
'Dapaong Clinic',
'Atakpamé Health Center',
'Bassar Hospital',
'Tsevie Hospital',
'Aného Medical Clinic',
'Kpalimé Health Center',
'Mango District Hospital',
'Lomé-Berlin Partnership Clinic',
'Deutsche Togo Hospital',
'Badou Health Center',
'Notse Hospital',
'Sotouboua Medical Facility',
'German-Togolese Medical Institute',
'Vogan Health Clinic',
'Tabligbo Hospital',
'Kande Medical Center',
'German Heart Clinic of Lomé'
],
'Birth rate': [
0.152, 0.124, 0.147, 0.108, 0.135,
0.119, 0.141, 0.130, 0.125, 0.148,
0.151, 0.133, 0.137, 0.128, 0.122,
0.144, 0.136, 0.139, 0.115, 0.155
],
'Access to Electricity': [
'Yes', 'No', 'Yes', 'No', 'Yes',
'No', 'Yes', 'Yes', 'Yes', 'No',
'Yes', 'Yes', 'No', 'Yes', 'Yes',
'No', 'Yes', 'Yes', 'No', 'Yes'
]
}
scheduled_dates = ['2024-06-12', '2024-08-02', '2024-04-11', '2024-09-23',
'2024-05-30', '2024-09-14', '2024-07-04', '2024-06-26',
'2024-08-15', '2024-04-25', '2024-09-10', '2024-06-30',
'2024-12-15', '2024-05-17', '2024-05-13', '2024-05-15',
'2024-10-01', '2024-10-04', '2024-10-03', '2024-10-02']