Chapter 1: Working with Series and DataFrames#

Learning Objectives:

  • Getting familiar with the concept of tabular data and iid data.

  • Getting familiar with the concept of Series and DataFrames.

  • Developing the ability to query dataframes to answer practical questions.

  • Updating dataframes based on useful insights.

The concept of Tabular data and iid data#

Tabular data refers to structured data that is organized into rows and columns, similar to what you might find in a spreadsheet or a database table. Each row typically represents an individual observation or record, while each column signifies a particular attribute or feature of the data.

For instance, in the African context, a table might list various agricultural products (rows) and their respective prices, production quantities, and export destinations (columns). This structured format of tabular data enables easy querying, visualization, and statistical analysis. Whether it’s used for monitoring rainfall patterns across various regions in Africa or tracking the spread of diseases like Malaria, tabular data plays a pivotal role in decision-making processes across the continent.

IID data, or Independent and Identically Distributed data, refers to a collection of random variables where each variable has the same probability distribution as the others and all are mutually independent. In layman’s terms, it means that each data point is generated in the same way and does not depend on previous data points.

For example, if one were to conduct a survey across various African nations about a particular product’s preference, the responses from one country or individual would not influence the responses from another. However, it’s essential to note that real-life data, especially in complex environments like Africa with its diverse cultures, economies, and ecosystems, might not always strictly adhere to the IID assumption. Recognizing when this assumption holds and when it doesn’t is crucial for accurate data analysis and modeling.

Building on the concept of tabular data, modern data science tools have provided versatile structures to handle and manipulate such data with ease. In the Python programming language, the pandas library stands out as a principal tool for this purpose. At the heart of pandas are two primary data structures: Series and DataFrames which is the focus of this chapter.

Introduction to Pandas#

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

  • Certainly among the most important tools for data analysts and data scientists.

  • The most popular library for working with tabular data in Python

To import it, simply do:

import pandas as pd

To check the version of Pandas you are running, do:

pd.__version__
'2.0.3'

Pandas Series#

To get started with pandas, you will need to get comfortable with its two data structures: Series and Dataframe. They provide a solid, easy-to-use basis for most applications. Every object successfully returned by Pandas is either Series or DataFrame.

DataFrames and Series are not simply storage containers. Since Pandas treat them similarly, they have built-in support for a variety of data-wrangling operations, such as:

  • Single-level and hierarchical indexing

  • Handling missing data

  • Arithmetic and Boolean operations on entire columns and tables

  • Database-type operations (such as merging and aggregation)

  • Plotting individual columns and whole tables

  • Reading data from files and writing data to files

Series#

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. You can create a simple series from any sequence: a list, a tuple, or a numpy array or even a python dictionary.

From a list of numbers

baby_ages = pd.Series( [2 , 10 , 12]  )
baby_ages
0     2
1    10
2    12
dtype: int64

From a list of strings

countries = pd.Series( ['Uganda' , 'Mali' , 'Chad' , 'Niger']   )
countries
0    Uganda
1      Mali
2      Chad
3     Niger
dtype: object

From a list of objects (data of different types)

travel_infos = pd.Series(  [ "£12" , 25 , "Banjul", "50km"  ])
travel_infos
0       £12
1        25
2    Banjul
3      50km
dtype: object

From a tuple#

diseases = pd.Series( ("malaria" , "tuberculosis", "influenza") )
diseases
0         malaria
1    tuberculosis
2       influenza
dtype: object

From a numpy array#

#Let's import the numpy library first
import numpy as np
index_poverty = np.arange(0,10,2)
index_poverty
array([0, 2, 4, 6, 8])
index_poverty_series = pd.Series(index_poverty)
index_poverty_series
0    0
1    2
2    4
3    6
4    8
dtype: int64

From a python dictionary#

ebola_infos = {"Outbreak": 'Ebola' , "City": 'Goma' , 
                  "Country": 'DRC' ,"Continent": 'Africa'}
ebola_infos
{'Outbreak': 'Ebola', 'City': 'Goma', 'Country': 'DRC', 'Continent': 'Africa'}
ebola_infos_series = pd.Series(ebola_infos)
ebola_infos_series
Outbreak      Ebola
City           Goma
Country         DRC
Continent    Africa
dtype: object

As you can notice, there is column on the left always appearing when printing a series. It’s a column index which, by default run from \(0\) to \(n-1\) where \(n\) is the length of the series. In the case of a dictionary, it is automatically replaced by the key of the dictionary. And the values of the dictionary are the actual content of the Pandas Series.You can verify it by typing the command below.

ebola_infos_series.index
Index(['Outbreak', 'City', 'Country', 'Continent'], dtype='object')
ebola_infos_series.values
array(['Ebola', 'Goma', 'DRC', 'Africa'], dtype=object)

Series can also be created along with its indices

# An information recorded from a patient during a survey in a Hospital in Mbour, Senegal.
patient_senegal_infos = pd.Series( ["Traore" , "Senegalese" , "Single" , "Wolof"],
              index = ["Name" , "Nationality" , "Status" , "Language"])
patient_senegal_infos
Name               Traore
Nationality    Senegalese
Status             Single
Language            Wolof
dtype: object
#The age of 4 siblings members of a family in Bouake, Ivory Coast as  recorded during a survey
bouake_kids =  pd.Series([12 , 25 , 7 , 58 , 39],
               index = ["Yao" , "Kouassi" , "Senan", "Bony" , "Marguerite"])
bouake_kids
Yao           12
Kouassi       25
Senan          7
Bony          58
Marguerite    39
dtype: int64

We can get to each of the terms easily, by using the index operators, which is the open and close square brackets.

bouake_kids['Yao']
12
bouake_kids['Bony']
58

We can check the number of people below the age of 20

bouake_kids
Yao           12
Kouassi       25
Senan          7
Bony          58
Marguerite    39
dtype: int64
bouake_kids < 20
Yao            True
Kouassi       False
Senan          True
Bony          False
Marguerite    False
dtype: bool

Technically speaking, bouake_kids < 20 is a boolean expression, which means it is an expressions that returns Boolean values: True where there is a match and False otherwise. But, what if we want to get the actual values, we need to use the index operator and passing the boolean expression as argument. and we will get sliced pandas series with only the matching rows.

bouake_kids[bouake_kids<45]
Yao           12
Kouassi       25
Senan          7
Marguerite    39
dtype: int64

We can also enlarge the scope of our query buy feeding the index operator with a more sophisticated expression or collection of expressions as you see below. For instance, we are interesred in kids between the age of 12 and 45.

bouake_kids[(bouake_kids<45) & (bouake_kids>12)  ]
Kouassi       25
Marguerite    39
dtype: int64

Basic Statistics#

Basic statistical operations could be performed on pandas series. Let’s explore a few.

The sum of all the elements

#### A summary of the series

The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series. Lets look at the bouake_kids again.

bouake_kids.describe()
count     5.000000
mean     28.200000
std      20.777392
min       7.000000
25%      12.000000
50%      25.000000
75%      39.000000
max      58.000000
dtype: float64

summing over items#

bouake_kids.sum()
141

averaging over items#

bouake_kids.mean()
28.2

The lowest value

bouake_kids.min()
7

The largest value

bouake_kids.max()
58

The variance

bouake_kids.var()
431.70000000000005

The standard deviation

bouake_kids.std()
20.777391559096152

Pandas Dataframe#

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). It can also simply be viewed as a collection of Series. There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays.

#Some temperature values recorded across African cities last rainy season
temp_african_cities = {"City": ['Gao' , 'Thies' , 'Abeche','Bobo-Dioulasso', 'Port Harcourt'],
      
                          "Temperature(°C)": [29,32,27,19,35]     }
df_temp_afr_cities = pd.DataFrame(temp_african_cities)
df_temp_afr_cities
City Temperature(°C)
0 Gao 29
1 Thies 32
2 Abeche 27
3 Bobo-Dioulasso 19
4 Port Harcourt 35
#The birth rate from several hospitals in Togo, Benin on a specific week in 2003
togo_birth_rate = {"Hospital": ["Hopital General de Dapaong" , "Deutsche Hospital Apaong" , 
                   "Clinique Notre-Dame de Lomé", "hôpital Nachtigal d’aného", 
                   "hôpital Germanique de kpalimé", "Liberty Klinic"],
    "Birth_Rate" :[ 0.4 , 0.25 , 0.98 , 0.18 , 0.62 , 0.16]
     }
df_togo_birth_rate   = pd.DataFrame(togo_birth_rate)
df_togo_birth_rate
Hospital Birth_Rate
0 Hopital General de Dapaong 0.40
1 Deutsche Hospital Apaong 0.25
2 Clinique Notre-Dame de Lomé 0.98
3 hôpital Nachtigal d’aného 0.18
4 hôpital Germanique de kpalimé 0.62
5 Liberty Klinic 0.16

To query the content of a dataframe, we can call the different series using the index operator and passing in the name of a column as a string.

df_togo_birth_rate['Hospital']
0       Hopital General de Dapaong
1         Deutsche Hospital Apaong
2      Clinique Notre-Dame de Lomé
3        hôpital Nachtigal d’aného
4    hôpital Germanique de kpalimé
5                   Liberty Klinic
Name: Hospital, dtype: object
df_togo_birth_rate['Birth_Rate']
0    0.40
1    0.25
2    0.98
3    0.18
4    0.62
5    0.16
Name: Birth_Rate, dtype: float64

We might decide to add another column. For instance, we want to add information related to the electricity outage in those hospitals during that week.

df_togo_birth_rate['Light'] = ["Yes" , "Yes" , "No" , "Yes" , "No" , "No"]
df_togo_birth_rate
Hospital Birth_Rate Light
0 Hopital General de Dapaong 0.40 Yes
1 Deutsche Hospital Apaong 0.25 Yes
2 Clinique Notre-Dame de Lomé 0.98 No
3 hôpital Nachtigal d’aného 0.18 Yes
4 hôpital Germanique de kpalimé 0.62 No
5 Liberty Klinic 0.16 No

We can get a subset of dataframe based on some conditions. Let’s find out from the task below.

Task 1: #

In our quest to understand the interplay between health metrics and infrastructural development in Togo, we have obtained information from various hospitals across Togo. The data is provided below.

  1. Take a look at the data and think of some practical questions that can be answered from querying that dataframe

  2. Phrase them on a piece of paper.

  3. Write the Python codes that solve them

  4. Think of a recommendation you can make out of your answers.

  5. Answer the following questions:

    • What are the hospitals where the birth rate is less than 0.50?

    • What are the hospitals which had electricity supplied on that day?

    • Are the Germans Hospitals in Togo always supplied with electricity?

  6. Siemens, under the auspices of the German-Togo Collaboration, has committed to equipping selected hospitals with backup power generators to mitigate service disruptions during power outages, ensuring continuous patient care. Siemens has judiciously scheduled the delivery of these generators to each beneficiary hospital. Below, you will find the scheduled delivery dates stored in the scheduled_dates variable. Use pd.to_datetime to convert these dates, ensuring they are in the appropriate datetime format for further analysis.

  7. In light of directives from His Excellency Mr. Fore Yassingbe, the Togolese Minister of Health and his team will be present for the generator handover ceremonies. Given his tight schedule, his secretary informed that he can only attend events scheduled between the 20th and 40th weeks of the year. Based on this timeframe, how many ceremonies can he attend in person?

togo_data = {
    'Hospital Name': [
        'Lomé Central Hospital', 
        'Kara Regional Hospital', 
        'Sokodé District Hospital', 
        'Dapaong Clinic', 
        'Atakpamé Health Center', 
        'Bassar Hospital',
        'Tsevie Hospital',
        'Aného Medical Clinic',
        'Kpalimé Health Center',
        'Mango District Hospital',
        'Lomé-Berlin Partnership Clinic',  
        'Deutsche Togo Hospital',          
        'Badou Health Center',
        'Notse Hospital',
        'Sotouboua Medical Facility',
        'German-Togolese Medical Institute', 
        'Vogan Health Clinic',
        'Tabligbo Hospital',
        'Kande Medical Center',
        'German Heart Clinic of Lomé'       
    ],
    'Birth rate': [
        0.152, 0.124, 0.147, 0.108, 0.135, 
        0.119, 0.141, 0.130, 0.125, 0.148,
        0.151, 0.133, 0.137, 0.128, 0.122,
        0.144, 0.136, 0.139, 0.115, 0.155
    ],
    'Access to Electricity': [
        'Yes', 'No', 'Yes', 'No', 'Yes', 
        'No', 'Yes', 'Yes', 'Yes', 'No',
        'Yes', 'Yes', 'No', 'Yes', 'Yes',
        'No', 'Yes', 'Yes', 'No', 'Yes'
    ]
}
scheduled_dates = ['2024-06-12', '2024-08-02', '2024-04-11', '2024-09-23', 
         '2024-05-30', '2024-09-14', '2024-07-04', '2024-06-26', 
         '2024-08-15', '2024-04-25', '2024-09-10', '2024-06-30',
         '2024-12-15', '2024-05-17', '2024-05-13', '2024-05-15', 
         '2024-10-01', '2024-10-04', '2024-10-03', '2024-10-02']