Chapter 2: Data Reading Methods

Chapter 2: Data Reading Methods#

Before diving deep into data manipulation and analysis, mastering the art of data ingestion using pandas is pivotal. This chapter will guide you through the essentials of reading data from various sources, ensuring a robust foundation for your data exploration journey.

Learning Objectives:

Introducing different files reading methods
Developing the ability to query dataframes to answer practical questions
Updating dataframes based on useful insights

Reading Methods#

Python has become a beloved language for text and file munging due to its simple syntax for interacting with files, intuitive data structures, and convenient features. Pandas features a number of functions for reading tabular data as a DataFrame object. Let’s import it

import pandas as pd

A plethore of reading method is available in the pandas library. You can have access to it by typing the tab key after typing pd.:

Q: #

How many of them could you see and at first glance, what could they possibly mean to you?

For sake of the context , we will just focus on the 3 main reading methods

\(\Large pd.read\_csv\)
\(\Large pd.read\_excel\)
\(\Large pd.read\_clipboard\)

The `pandas.read_csv` method#

Reading a csv file requires passing as argument the path to the actual file. For simplicity, the file should be saved in the same folder as the notebook you are running but you could also specify the path to it. See below.

Here is the pattern to follow:

dataframe_name = pd.read_csv( "path/to/the/file.csv" )

Task 2: #

It was a somber afternoon, and Dr. Amara stood at the front of the conference room, a projector screen displaying the infamous date: 2014. Everyone in the room knew what that year meant, especially for West Africa.

“2014,” Dr. Amara began, “the year the Ebola crisis took a grip on some of our neighboring countries. But before we delve into the specifics, can anyone highlight the distinction between an ‘outbreak’ and a ‘pandemic’?”
She paused for a moment, letting the room reflect on the question. Then, pointing to the screen which now showcased a map of Africa, she continued, “Ebola wasn’t new to science, but its rapid spread in 2014 took many by surprise. Does anyone know its origins and why it spread with such ferocity at that particular time?”
After a brief discussion, she steered the group towards the day’s task. “On your workstations, you’ll find a folder labeled data. Within it, there’s a file called ebola_WA.csv. It captures the historical trajectory of Ebola cases in Nigeria, Guinea, and Sierra Leone. Take a moment, ensure the file’s there and load it using the standard protocols we’ve practiced.”
Minutes later, as rows and columns filled various screens, Dr. Amara asked, “From the data in front of you, what anomalies or patterns stand out? Let’s discuss.”

Task 3: #

Dr. Amara took a momentary pause, then shifted her focus. “Often,” she began, with an understanding tone, “as data analysts, we aren’t always involved in the actual data collection phase. This detachment sometimes means we encounter datasets with unfamiliar structures or entries saved differently than what we’re used to.”

Pulling up a new slide, she continued, “I’ve been handed another set of historical data - this time from our colleagues who worked in parts of Central Africa. I’d like you all to delve into this. Access it via ebola_CA.csvand share your initial observations.”
As attendees began accessing the new set of data, she added, **”Once you’ve skimmed through, use the same method we practiced earlier to load this data into a dataframe ande tell us what you observe.”

You might have guessed it. CSV stands for comma separated values. Other files similar file extensions are also used. See below:

"," for the csv file
"\t" for a tsv file
or any other character that separates the different variables for an observation

As example of the .tsv file, see the gapminder dataset below. Let’s read this dataset where variables are separated by a tabulation key.

world_gapminder = pd.read_csv('data/gapminder.tsv' , sep = "\t")
world_gapminder.head(5)

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

The `pandas.read_excel` method#

Excel files are also quite common files for saving data. The XLS file extension is used for files saved as Microsoft Excel worksheets. Reading an excel file requires the same process as reading a csv file. The only difference lies in specifying the sheet you want to get access to.

The dataset below comes from the mining industry. It contains mining locations in some African countries both at the region and at the district level. See here a link to an interesting website where you can download the shapefiles of such regions. https://gadm.org/formats.html

The code below accesses an excel sheet called Mines_Africa_Districtsand the sheet named Africa_Districts

mines_sites = pd.read_excel('data/Mines_Africa_Districts.xls' , 
                    sheet_name='Africa_Districts')
mines_sites.head(5)

	countrycode	countryname	ADM1	ADM2
0	AGO	Angola	BiÚ	Nharea
1	AGO	Angola	HuÝla	Humpata
2	AGO	Angola	Cuando Cubango	Cuangar
3	AGO	Angola	Cabinda	Buco Zau
4	AGO	Angola	Lunda Norte	Cuango

Could you possibly guess what ADM2 might mean?

The code below access a sheet named Type_of_Mines

types_mines = pd.read_excel('data/Mines_Africa_Districts.xls' , 
                    sheet_name='Type_of_Mines')
types_mines.tail(6)

	mines	aluminum_mine	asbestos_mine	bariumbarite_mine	beryllium_mine
5829	0	3	12.0	3	3
5830	0	4	2.0	5	2
5831	5	5	1.5	2	3
5832	1	6	2.0	3	1
5833	22	7	1.0	8	2
5834	0	0	0.0	1	2

The `pandas.read_clipboard` method#

The pandas.read_clipboard method is a convenient tool that reads the contents of your clipboard directly into a pandas DataFrame. This method can be quite useful when you want to quickly grab tables or data that you’ve copied from websites, spreadsheets, or other sources without having to save them as files first.

# lets practice some.  
# Take a look at this website https://www.timeanddate.com/


import pandas as pd
#pd.read_clipboard()

There are several other pandas reading methods. Depending on the files you are dealing with, you might have to choose the convenient method. Other popular methods include pd.read_json (for json files) , pd.read_html (for html files) , pd.read_

The principle stays the same, which is reading from external sources and casting it into a pandas dataframe for facilitating the analysis.

Shape and datatype#

ebola_wa = pd.read_csv('data/ebola_WA.csv')

To get the first n rows and the last n rows

ebola_wa.head(4)

	Date	Day	Cases_Guinea	Cases_Liberia	Cases_SierraLeone	Cases_Nigeria
0	1/5/2015	289	2776.0	NaN	10030.0	NaN
1	1/4/2015	288	2775.0	NaN	9780.0	NaN
2	1/3/2015	287	2769.0	8166.0	9722.0	NaN
3	1/2/2015	286	NaN	8157.0	NaN	NaN

ebola_wa.tail(5)

	Date	Day	Cases_Guinea	Cases_Liberia	Cases_SierraLeone	Cases_Nigeria
117	3/27/2014	5	103.0	8.0	6.0	NaN
118	3/26/2014	4	86.0	NaN	NaN	NaN
119	3/25/2014	3	86.0	NaN	NaN	NaN
120	3/24/2014	2	86.0	NaN	NaN	NaN
121	3/22/2014	0	49.0	NaN	NaN	NaN

A random sample of 3 rows

ebola_wa.sample(3)

	Date	Day	Cases_Guinea	Cases_Liberia	Cases_SierraLeone	Cases_Nigeria
62	8/16/2014	147	543.0	834.0	848.0	15.0
40	10/11/2014	203	NaN	4249.0	NaN	NaN
67	8/4/2014	135	495.0	516.0	691.0	9.0

To check the shape of the dataframe (rows and columns)

ebola_wa.shape

(122, 6)

To check the datatype of each of the columns

ebola_wa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               122 non-null    object 
 1   Day                122 non-null    int64  
 2   Cases_Guinea       93 non-null     float64
 3   Cases_Liberia      83 non-null     float64
 4   Cases_SierraLeone  87 non-null     float64
 5   Cases_Nigeria      38 non-null     float64
dtypes: float64(4), int64(1), object(1)
memory usage: 5.8+ KB

We can notice that the dataset contains 122 entries with their index ranging from 0 to 121
The columns Cases_Guinea, Cases_Liberia, Cases_SierraLeone and Cases_Nigeria are float64 type. Which means they contain floating points.
The *Day column is of int64 type
Some of the columns contains missing values: 122-93 = 29 for the Cases_Guinea 122-83 = 39 for the Cases_Liberia 122-87 = 35 for the Cases_SierraLeone 122-38 = 84 for the Cases_Nigeria

To check the list of columns

ebola_wa.columns

Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
       'Cases_Nigeria'],
      dtype='object')

Let’s consider another dataset

world_gapminder = pd.read_csv('data/gapminder.tsv' , sep='\t')
world_gapminder.head(5)

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

To check the unique values inside a column

world_gapminder['continent'].unique()

array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

world_gapminder['continent'].value_counts()

continent
Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: count, dtype: int64

To get a subset of the data, let’s query the dataframe and get some insights

Task 4: #

Sitting in a bustling cafe in Luanda, Maria, a public health student, turned to her colleague, João, and posed a question, “João, do you recall the years when Angola’s life expectancy surpassed 40? And what do you think such a figure signifies about our nation’s health and development?”
Under the shade of a large acacia tree, two economics professors, Dr. Kofi and Dr. Amina, sat enjoying their afternoon tea. Dr. Kofi, with a nostalgic glint in his eyes, posed a question, “Amina, think back to the 80s. Which African nation do you reckon had the highest GDP?
- Could you guess before computing?
- What could be the explanation for that?

Task 5: #

Look at the gapminder dataset and answer the following questions:

Southern European GDP in the 2000s:

At a global financial seminar in Johannesburg, attendees were embarked in discussions about economic shifts. Sofia, a delegate from Greece, turned to Kwame, a noted economic analyst from Ghana. “Kwame,” she began, “in the 2000s, do you know which Southern European countries stood out in terms of GDP? both at the top and bottom?

African Life Expectancy in the 1970s:

In a health and development conference in Addis Ababa, there was a buzz about historical health metrics. Fatima, a public health student from Algeria, approached Zola, an esteemed demographic analyst from Tanzania. “Zola, in the 1970s, can you pinpoint the African countries that led and trailed in life expectancy? What factors do you think contributed to such outcomes?”

Asian vs. European GDP in the 90s:

During a multinational economic colloquium in Dakar, conversations were abuzz about past economic giants. Ravi, a researcher from India, curiously asked Chijioke, a skilled economic analyst from Cameroon, “Chijioke, reflecting on the 90s, how many Asian countries do you recall, exceeded the European average GDP?”

SE_countries = ['Albania', 'Andorra', 'Bosnia and Herzegovina', 'Croatia', 'Gibraltar', 
                   'Greece', 'Italy', 'Kosovo', 'Malta', 'Montenegro', 'North Macedonia', 
                   'Portugal', 'San Marino', 'Serbia', 'Slovenia', 'Spain', 'Vatican City']