Chapter 2: Data Reading Methods#

Before diving deep into data manipulation and analysis, mastering the art of data ingestion using pandas is pivotal. This chapter will guide you through the essentials of reading data from various sources, ensuring a robust foundation for your data exploration journey.

Learning Objectives:

  • Introducing different files reading methods

  • Developing the ability to query dataframes to answer practical questions

  • Updating dataframes based on useful insights

Reading Methods#

Python has become a beloved language for text and file munging due to its simple syntax for interacting with files, intuitive data structures, and convenient features. Pandas features a number of functions for reading tabular data as a DataFrame object. Let’s import it

import pandas as pd

A plethore of reading method is available in the pandas library. You can have access to it by typing the tab key after typing pd.:

Q: #

  • How many of them could you see and at first glance, what could they possibly mean to you?

For sake of the context , we will just focus on the 3 main reading methods

  • \(\Large pd.read\_csv\)

  • \(\Large pd.read\_excel\)

  • \(\Large pd.read\_clipboard\)

The pandas.read_csv method#

Reading a csv file requires passing as argument the path to the actual file. For simplicity, the file should be saved in the same folder as the notebook you are running but you could also specify the path to it. See below.

Here is the pattern to follow:

dataframe_name  =  pd.read_csv( "path/to/the/file.csv" )

Task 2: #

It was a somber afternoon, and Dr. Amara stood at the front of the conference room, a projector screen displaying the infamous date: 2014. Everyone in the room knew what that year meant, especially for West Africa.

  1. “2014,” Dr. Amara began, “the year the Ebola crisis took a grip on some of our neighboring countries. But before we delve into the specifics, can anyone highlight the distinction between an ‘outbreak’ and a ‘pandemic’?”

  2. She paused for a moment, letting the room reflect on the question. Then, pointing to the screen which now showcased a map of Africa, she continued, “Ebola wasn’t new to science, but its rapid spread in 2014 took many by surprise. Does anyone know its origins and why it spread with such ferocity at that particular time?”

  3. After a brief discussion, she steered the group towards the day’s task. “On your workstations, you’ll find a folder labeled data. Within it, there’s a file called ebola_WA.csv. It captures the historical trajectory of Ebola cases in Nigeria, Guinea, and Sierra Leone. Take a moment, ensure the file’s there and load it using the standard protocols we’ve practiced.”

  4. Minutes later, as rows and columns filled various screens, Dr. Amara asked, “From the data in front of you, what anomalies or patterns stand out? Let’s discuss.”


Task 3: #

Dr. Amara took a momentary pause, then shifted her focus. “Often,” she began, with an understanding tone, “as data analysts, we aren’t always involved in the actual data collection phase. This detachment sometimes means we encounter datasets with unfamiliar structures or entries saved differently than what we’re used to.”

  1. Pulling up a new slide, she continued, “I’ve been handed another set of historical data - this time from our colleagues who worked in parts of Central Africa. I’d like you all to delve into this. Access it via ebola_CA.csvand share your initial observations.”

  2. As attendees began accessing the new set of data, she added, **”Once you’ve skimmed through, use the same method we practiced earlier to load this data into a dataframe ande tell us what you observe.”

You might have guessed it. CSV stands for comma separated values. Other files similar file extensions are also used. See below:

  • "," for the csv file

  • "\t" for a tsv file

  • or any other character that separates the different variables for an observation

As example of the .tsv file, see the gapminder dataset below. Let’s read this dataset where variables are separated by a tabulation key.

world_gapminder = pd.read_csv('data/gapminder.tsv' , sep = "\t")
world_gapminder.head(5)
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

The pandas.read_excel method#

Excel files are also quite common files for saving data. The XLS file extension is used for files saved as Microsoft Excel worksheets. Reading an excel file requires the same process as reading a csv file. The only difference lies in specifying the sheet you want to get access to.

The dataset below comes from the mining industry. It contains mining locations in some African countries both at the region and at the district level. See here a link to an interesting website where you can download the shapefiles of such regions. https://gadm.org/formats.html

The code below accesses an excel sheet called Mines_Africa_Districtsand the sheet named Africa_Districts

mines_sites = pd.read_excel('data/Mines_Africa_Districts.xls' , 
                    sheet_name='Africa_Districts')
mines_sites.head(5)
countrycode countryname ADM1 ADM2
0 AGO Angola BiÚ Nharea
1 AGO Angola HuÝla Humpata
2 AGO Angola Cuando Cubango Cuangar
3 AGO Angola Cabinda Buco Zau
4 AGO Angola Lunda Norte Cuango

Could you possibly guess what ADM2 might mean?

The code below access a sheet named Type_of_Mines

types_mines = pd.read_excel('data/Mines_Africa_Districts.xls' , 
                    sheet_name='Type_of_Mines')
types_mines.tail(6)
mines aluminum_mine asbestos_mine bariumbarite_mine beryllium_mine
5829 0 3 12.0 3 3
5830 0 4 2.0 5 2
5831 5 5 1.5 2 3
5832 1 6 2.0 3 1
5833 22 7 1.0 8 2
5834 0 0 0.0 1 2

The pandas.read_clipboard method#

The pandas.read_clipboard method is a convenient tool that reads the contents of your clipboard directly into a pandas DataFrame. This method can be quite useful when you want to quickly grab tables or data that you’ve copied from websites, spreadsheets, or other sources without having to save them as files first.

# lets practice some.  
# Take a look at this website https://www.timeanddate.com/


import pandas as pd
#pd.read_clipboard()

There are several other pandas reading methods. Depending on the files you are dealing with, you might have to choose the convenient method. Other popular methods include pd.read_json (for json files) , pd.read_html (for html files) , pd.read_

The principle stays the same, which is reading from external sources and casting it into a pandas dataframe for facilitating the analysis.

Shape and datatype#

ebola_wa = pd.read_csv('data/ebola_WA.csv')

To get the first n rows and the last n rows

ebola_wa.head(4)
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_Nigeria
0 1/5/2015 289 2776.0 NaN 10030.0 NaN
1 1/4/2015 288 2775.0 NaN 9780.0 NaN
2 1/3/2015 287 2769.0 8166.0 9722.0 NaN
3 1/2/2015 286 NaN 8157.0 NaN NaN
ebola_wa.tail(5)
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_Nigeria
117 3/27/2014 5 103.0 8.0 6.0 NaN
118 3/26/2014 4 86.0 NaN NaN NaN
119 3/25/2014 3 86.0 NaN NaN NaN
120 3/24/2014 2 86.0 NaN NaN NaN
121 3/22/2014 0 49.0 NaN NaN NaN

A random sample of 3 rows

ebola_wa.sample(3)
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_Nigeria
62 8/16/2014 147 543.0 834.0 848.0 15.0
40 10/11/2014 203 NaN 4249.0 NaN NaN
67 8/4/2014 135 495.0 516.0 691.0 9.0

To check the shape of the dataframe (rows and columns)

ebola_wa.shape
(122, 6)

To check the datatype of each of the columns

ebola_wa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               122 non-null    object 
 1   Day                122 non-null    int64  
 2   Cases_Guinea       93 non-null     float64
 3   Cases_Liberia      83 non-null     float64
 4   Cases_SierraLeone  87 non-null     float64
 5   Cases_Nigeria      38 non-null     float64
dtypes: float64(4), int64(1), object(1)
memory usage: 5.8+ KB
  • We can notice that the dataset contains 122 entries with their index ranging from 0 to 121

  • The columns Cases_Guinea, Cases_Liberia, Cases_SierraLeone and Cases_Nigeria are float64 type. Which means they contain floating points.

  • The *Day column is of int64 type

  • Some of the columns contains missing values: 122-93 = 29 for the Cases_Guinea 122-83 = 39 for the Cases_Liberia 122-87 = 35 for the Cases_SierraLeone 122-38 = 84 for the Cases_Nigeria

To check the list of columns

ebola_wa.columns
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
       'Cases_Nigeria'],
      dtype='object')

Let’s consider another dataset

world_gapminder = pd.read_csv('data/gapminder.tsv' , sep='\t')
world_gapminder.head(5)
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

To check the unique values inside a column

world_gapminder['continent'].unique()
array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)
world_gapminder['continent'].value_counts()
continent
Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: count, dtype: int64

To get a subset of the data, let’s query the dataframe and get some insights

Task 4: #

  1. Sitting in a bustling cafe in Luanda, Maria, a public health student, turned to her colleague, João, and posed a question, “João, do you recall the years when Angola’s life expectancy surpassed 40? And what do you think such a figure signifies about our nation’s health and development?”

  2. Under the shade of a large acacia tree, two economics professors, Dr. Kofi and Dr. Amina, sat enjoying their afternoon tea. Dr. Kofi, with a nostalgic glint in his eyes, posed a question, “Amina, think back to the 80s. Which African nation do you reckon had the highest GDP?

    • Could you guess before computing?

    • What could be the explanation for that?

Task 5: #

Look at the gapminder dataset and answer the following questions:

  1. Southern European GDP in the 2000s:

At a global financial seminar in Johannesburg, attendees were embarked in discussions about economic shifts. Sofia, a delegate from Greece, turned to Kwame, a noted economic analyst from Ghana. “Kwame,” she began, “in the 2000s, do you know which Southern European countries stood out in terms of GDP? both at the top and bottom?

  1. African Life Expectancy in the 1970s:

In a health and development conference in Addis Ababa, there was a buzz about historical health metrics. Fatima, a public health student from Algeria, approached Zola, an esteemed demographic analyst from Tanzania. “Zola, in the 1970s, can you pinpoint the African countries that led and trailed in life expectancy? What factors do you think contributed to such outcomes?”

  1. Asian vs. European GDP in the 90s:

During a multinational economic colloquium in Dakar, conversations were abuzz about past economic giants. Ravi, a researcher from India, curiously asked Chijioke, a skilled economic analyst from Cameroon, “Chijioke, reflecting on the 90s, how many Asian countries do you recall, exceeded the European average GDP?”

SE_countries = ['Albania', 'Andorra', 'Bosnia and Herzegovina', 'Croatia', 'Gibraltar', 
                   'Greece', 'Italy', 'Kosovo', 'Malta', 'Montenegro', 'North Macedonia', 
                   'Portugal', 'San Marino', 'Serbia', 'Slovenia', 'Spain', 'Vatican City']