Chapter 2: Data Reading Methods#
Before diving deep into data manipulation and analysis, mastering the art of data ingestion using pandas is pivotal. This chapter will guide you through the essentials of reading data from various sources, ensuring a robust foundation for your data exploration journey.
Learning Objectives:
Introducing different files reading methods
Developing the ability to query dataframes to answer practical questions
Updating dataframes based on useful insights
Reading Methods#
Python has become a beloved language for text and file munging due to its simple syntax for interacting with files, intuitive data structures, and convenient features. Pandas features a number of functions for reading tabular data as a DataFrame object. Let’s import it
import pandas as pd
A plethore of reading method is available in the pandas library. You can have access to it by typing the tab
key after typing pd.
:
Q: #
How many of them could you see and at first glance, what could they possibly mean to you?
For sake of the context , we will just focus on the 3 main reading methods
\(\Large pd.read\_csv\)
\(\Large pd.read\_excel\)
\(\Large pd.read\_clipboard\)
The pandas.read_csv
method#
Reading a csv file requires passing as argument the path to the actual file. For simplicity, the file should be saved in the same folder as the notebook you are running but you could also specify the path to it. See below.
Here is the pattern to follow:
dataframe_name = pd.read_csv( "path/to/the/file.csv" )
Task 2: #
It was a somber afternoon, and Dr. Amara stood at the front of the conference room, a projector screen displaying the infamous date: 2014. Everyone in the room knew what that year meant, especially for West Africa.
“2014,” Dr. Amara began, “the year the Ebola crisis took a grip on some of our neighboring countries. But before we delve into the specifics, can anyone highlight the distinction between an ‘outbreak’ and a ‘pandemic’?”
She paused for a moment, letting the room reflect on the question. Then, pointing to the screen which now showcased a map of Africa, she continued, “Ebola wasn’t new to science, but its rapid spread in 2014 took many by surprise. Does anyone know its origins and why it spread with such ferocity at that particular time?”
After a brief discussion, she steered the group towards the day’s task. “On your workstations, you’ll find a folder labeled
data
. Within it, there’s a file calledebola_WA.csv
. It captures the historical trajectory of Ebola cases in Nigeria, Guinea, and Sierra Leone. Take a moment, ensure the file’s there and load it using the standard protocols we’ve practiced.”Minutes later, as rows and columns filled various screens, Dr. Amara asked, “From the data in front of you, what anomalies or patterns stand out? Let’s discuss.”
Task 3: #
Dr. Amara took a momentary pause, then shifted her focus. “Often,” she began, with an understanding tone, “as data analysts, we aren’t always involved in the actual data collection phase. This detachment sometimes means we encounter datasets with unfamiliar structures or entries saved differently than what we’re used to.”
Pulling up a new slide, she continued, “I’ve been handed another set of historical data - this time from our colleagues who worked in parts of Central Africa. I’d like you all to delve into this. Access it via
ebola_CA.csv
and share your initial observations.”As attendees began accessing the new set of data, she added, **”Once you’ve skimmed through, use the same method we practiced earlier to load this data into a dataframe ande tell us what you observe.”
You might have guessed it. CSV stands for comma separated values. Other files similar file extensions are also used. See below:
","
for the csv file"\t"
for a tsv fileor any other character that separates the different variables for an observation
As example of the .tsv
file, see the gapminder dataset below. Let’s read this dataset where variables are separated by a tabulation key.
world_gapminder = pd.read_csv('data/gapminder.tsv' , sep = "\t")
world_gapminder.head(5)
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
The pandas.read_excel
method#
Excel files are also quite common files for saving data. The XLS file extension is used for files saved as Microsoft Excel worksheets. Reading an excel file requires the same process as reading a csv file. The only difference lies in specifying the sheet you want to get access to.
The dataset below comes from the mining industry. It contains mining locations in some African countries both at the region and at the district level. See here a link to an interesting website where you can download the shapefiles of such regions. https://gadm.org/formats.html
The code below accesses an excel sheet called Mines_Africa_Districts
and the sheet named Africa_Districts
mines_sites = pd.read_excel('data/Mines_Africa_Districts.xls' ,
sheet_name='Africa_Districts')
mines_sites.head(5)
countrycode | countryname | ADM1 | ADM2 | |
---|---|---|---|---|
0 | AGO | Angola | BiÚ | Nharea |
1 | AGO | Angola | HuÝla | Humpata |
2 | AGO | Angola | Cuando Cubango | Cuangar |
3 | AGO | Angola | Cabinda | Buco Zau |
4 | AGO | Angola | Lunda Norte | Cuango |
Could you possibly guess what ADM2 might mean?
The code below access a sheet named Type_of_Mines
types_mines = pd.read_excel('data/Mines_Africa_Districts.xls' ,
sheet_name='Type_of_Mines')
types_mines.tail(6)
mines | aluminum_mine | asbestos_mine | bariumbarite_mine | beryllium_mine | |
---|---|---|---|---|---|
5829 | 0 | 3 | 12.0 | 3 | 3 |
5830 | 0 | 4 | 2.0 | 5 | 2 |
5831 | 5 | 5 | 1.5 | 2 | 3 |
5832 | 1 | 6 | 2.0 | 3 | 1 |
5833 | 22 | 7 | 1.0 | 8 | 2 |
5834 | 0 | 0 | 0.0 | 1 | 2 |
The pandas.read_clipboard
method#
The pandas.read_clipboard
method is a convenient tool that reads the contents of your clipboard directly into a pandas DataFrame. This method can be quite useful when you want to quickly grab tables or data that you’ve copied from websites, spreadsheets, or other sources without having to save them as files first.
# lets practice some.
# Take a look at this website https://www.timeanddate.com/
import pandas as pd
#pd.read_clipboard()
There are several other pandas reading methods. Depending on the files you are dealing with, you might have to choose the convenient method. Other popular methods include pd.read_json
(for json
files) , pd.read_html
(for html
files) , pd.read_
The principle stays the same, which is reading from external sources and casting it into a pandas dataframe for facilitating the analysis.
Shape and datatype#
ebola_wa = pd.read_csv('data/ebola_WA.csv')
To get the first n rows and the last n rows
ebola_wa.head(4)
Date | Day | Cases_Guinea | Cases_Liberia | Cases_SierraLeone | Cases_Nigeria | |
---|---|---|---|---|---|---|
0 | 1/5/2015 | 289 | 2776.0 | NaN | 10030.0 | NaN |
1 | 1/4/2015 | 288 | 2775.0 | NaN | 9780.0 | NaN |
2 | 1/3/2015 | 287 | 2769.0 | 8166.0 | 9722.0 | NaN |
3 | 1/2/2015 | 286 | NaN | 8157.0 | NaN | NaN |
ebola_wa.tail(5)
Date | Day | Cases_Guinea | Cases_Liberia | Cases_SierraLeone | Cases_Nigeria | |
---|---|---|---|---|---|---|
117 | 3/27/2014 | 5 | 103.0 | 8.0 | 6.0 | NaN |
118 | 3/26/2014 | 4 | 86.0 | NaN | NaN | NaN |
119 | 3/25/2014 | 3 | 86.0 | NaN | NaN | NaN |
120 | 3/24/2014 | 2 | 86.0 | NaN | NaN | NaN |
121 | 3/22/2014 | 0 | 49.0 | NaN | NaN | NaN |
A random sample of 3 rows
ebola_wa.sample(3)
Date | Day | Cases_Guinea | Cases_Liberia | Cases_SierraLeone | Cases_Nigeria | |
---|---|---|---|---|---|---|
62 | 8/16/2014 | 147 | 543.0 | 834.0 | 848.0 | 15.0 |
40 | 10/11/2014 | 203 | NaN | 4249.0 | NaN | NaN |
67 | 8/4/2014 | 135 | 495.0 | 516.0 | 691.0 | 9.0 |
To check the shape of the dataframe (rows and columns)
ebola_wa.shape
(122, 6)
To check the datatype of each of the columns
ebola_wa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 122 non-null object
1 Day 122 non-null int64
2 Cases_Guinea 93 non-null float64
3 Cases_Liberia 83 non-null float64
4 Cases_SierraLeone 87 non-null float64
5 Cases_Nigeria 38 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 5.8+ KB
We can notice that the dataset contains 122 entries with their index ranging from 0 to 121
The columns Cases_Guinea, Cases_Liberia, Cases_SierraLeone and Cases_Nigeria are float64 type. Which means they contain floating points.
The *Day column is of int64 type
Some of the columns contains missing values: 122-93 = 29 for the Cases_Guinea 122-83 = 39 for the Cases_Liberia 122-87 = 35 for the Cases_SierraLeone 122-38 = 84 for the Cases_Nigeria
To check the list of columns
ebola_wa.columns
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
'Cases_Nigeria'],
dtype='object')
Let’s consider another dataset
world_gapminder = pd.read_csv('data/gapminder.tsv' , sep='\t')
world_gapminder.head(5)
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
To check the unique values inside a column
world_gapminder['continent'].unique()
array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)
world_gapminder['continent'].value_counts()
continent
Africa 624
Asia 396
Europe 360
Americas 300
Oceania 24
Name: count, dtype: int64
To get a subset of the data, let’s query the dataframe and get some insights
Task 4: #
Sitting in a bustling cafe in Luanda, Maria, a public health student, turned to her colleague, João, and posed a question, “João, do you recall the years when Angola’s life expectancy surpassed 40? And what do you think such a figure signifies about our nation’s health and development?”
Under the shade of a large acacia tree, two economics professors, Dr. Kofi and Dr. Amina, sat enjoying their afternoon tea. Dr. Kofi, with a nostalgic glint in his eyes, posed a question, “Amina, think back to the 80s. Which African nation do you reckon had the highest GDP?
Could you guess before computing?
What could be the explanation for that?
Task 5: #
Look at the gapminder dataset and answer the following questions:
Southern European GDP in the 2000s:
At a global financial seminar in Johannesburg, attendees were embarked in discussions about economic shifts. Sofia, a delegate from Greece, turned to Kwame, a noted economic analyst from Ghana. “Kwame,” she began, “in the 2000s, do you know which Southern European countries stood out in terms of GDP? both at the top and bottom?
African Life Expectancy in the 1970s:
In a health and development conference in Addis Ababa, there was a buzz about historical health metrics. Fatima, a public health student from Algeria, approached Zola, an esteemed demographic analyst from Tanzania. “Zola, in the 1970s, can you pinpoint the African countries that led and trailed in life expectancy? What factors do you think contributed to such outcomes?”
Asian vs. European GDP in the 90s:
During a multinational economic colloquium in Dakar, conversations were abuzz about past economic giants. Ravi, a researcher from India, curiously asked Chijioke, a skilled economic analyst from Cameroon, “Chijioke, reflecting on the 90s, how many Asian countries do you recall, exceeded the European average GDP?”
SE_countries = ['Albania', 'Andorra', 'Bosnia and Herzegovina', 'Croatia', 'Gibraltar',
'Greece', 'Italy', 'Kosovo', 'Malta', 'Montenegro', 'North Macedonia',
'Portugal', 'San Marino', 'Serbia', 'Slovenia', 'Spain', 'Vatican City']