Chapter 3: Introducing Features and Observations#
Learning Objectives:
Understand the importance of structured data and the key principles of tidy data.
Differentiate between variables and observations in a dataset.
Learn to identify when and why to reshape datasets.
Master the usage of the melt function in pandas to transform data from a wide format to a long format.
Analyze real-world data to detect and rectify structural anomalies.
Gain hands-on experience in preparing data for further statistical analysis or visualization by ensuring it adheres to the tidy data principles.
Introduction:#
It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of the analysis as new problems come to light or new data is collected. To get a handle on the problem, this part focuses on a small, but important, aspect of data cleaning that we call data tidying: structuring datasets to facilitate analysis. It also formally introduces the concept of features and observations.
Task 6: #
As the global demand for sustainable and efficient agricultural practices intensifies, the transformative power of Artificial Intelligence (AI) in farming becomes increasingly evident. Recognizing this potential, the Zambia Farmers’ Federation has partnered with the University of Lusaka’s Department of Agriculture. The focus of this partnership is to explore innovative solutions that could elevate Zambia’s agricultural output.
You have been chosen as the lead data analyst for this project due to your exceptional expertise in the field. The project’s immediate objective is to test the efficacy of two novel fertilizers, with the aim of boosting crop yields. Your task is to scrutinize the given data, apply your analytical skills, and derive meaningful insights that will guide the next stages of this venture.
You’ve just received a detailed report from the leading Agri-expert on the team. Here’s the content of their message:
=====================================#
Greetings!
In agricultural research, when we refer to the use of fertilizers on crops, we often term it as a “treatment”. Presently, I’ve conducted tests using two distinctive treatments: Axida (Treatment A) and Bross (Treatment B) on three select crops: mango, avocado, and pineapple. The formulation for Axida is primarily based on nitrogen-enriched organic compounds, while Bross has a base of potassium-rich minerals with micro-nutrient additives. One of the intriguing metrics we measure is the gas emission from the crops post-application, which can be a direct indicator of the plant’s response to these treatments.
Here are the specifics:
For Axida (Treatment A):
Mango: 4.5 units of gas emission
Avocado: 2.1 units of gas emission
Pineapple: 1.9 units of gas emission
For Bross (Treatment B):
Mango: 5.1 units of gas emission
Avocado: 1.3 units of gas emission
Pineapple: 5.3 units of gas emission
I eagerly await your expert analysis on this data. Let’s make a significant impact together!
=====================================#
Plants have always exhibited unique ways to interact with their environment and among themselves. Can you think of methods plants might use to “communicate”? What could be some scientific explanations for these phenomena?
Translate the information from the email that was sent to you by the Agri-expert into a form that can be used for analysis.
Two different analysts Anna and Jonas have translated that email into the sheets below. Tell us what you observe?
Note that this type of data might be good for presentation but it is not tidy for analysis.
Tidying the data#
The idea here is to give a standard way to organize the data values within the dataset. To formalize the concept of rows and columns so that the analyst will get more time to focus on the interesting domain problem , not on the uninteresting logistics of the data.
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Some common data problems
Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.
Even though the logistics of the above data could be repaired manually, pandas has a function called melt
that can be useful for that process. See below:
It makes uses of three main paramters: The id_vars
, var_name
and value_name
id_vars
represents the Column(s) to be used as identifier variables.var_name
: Represents the variable that runs across columns header(from left to right)value_name
: The name to use for that aboved identified column
Task: What insight can you extract from the data?
Task 7: #
In 2014, the Mayor of Arua City, in Uganda, approved a bill to curb the spread of religious institutions in the city. In his press release, it was stated that starting February 2014, religious authorities would be required to submit proof of qualification (a Degree in Theology) to the City Council upon approval to teach the Sacred Texts.
That strict measure was taken due to social media reports that some religious authorities claimed to possess supernatural powers bestowed by a divinity and could perform miraculous acts that could positively change people’s lives. Consequently, many of them enriched themselves while impoverishing the modest people. To assess the situation, the Mayor ordered his IT Services to hire surveyors, who spent three months collecting data on the salary ranges of these religious authorities, even though some had fled the country. From each religious authority that was sampled, the team collected their religion and their salary range. The data was shared with you via email in a CSV file called arua_religious_2014.csv
.
What is your subjective view of religions in Africa? Do we need them? why?
Load it using pandas and tell us what you observe?
If you observe any anomaly, how could you fix that?
Task 8: #
With 60 million active users, Boomplay is the most popular music streaming service in Africa. The Chinese-owned, Africa-focused company is available throughout the continent and runs a freemium model. They are planning to open new offices in the County of Zwedru in Liberia. You were lucky enough to secure a fully funded internship with them. On your first day in the office, The Regional Manager stated that they are working on remixing the Classics from the Billboard and distribute them on their platform. The Billboard charts tabulate the relative weekly popularity of songs and albums in the United States and elsewhere. For a first phase, they chose the Classics from the beginning of the millenium: The big year 2000. The data was scraped from the Billboard database and given to you in a csv file called billboard_2000.csv
.
How do you think Music streaming platforms make money if you can listen to music there for free? and How do artists benefit from it?
Load the data in pandas and tell us what you observe.
## Multiple variables are stored in one column.
Task 9: #
Bindura is a small town in the Mashonaland Central province of Zimbabwe, located in the North-East of Harare. In Howard Hospital (HH), a small medical facility, the Incidence of Tuberculosis (TB) increased by 35% in 2008 compared to the baseline rates observed from 2003–2007.
Under the Makeba Funding initiative, which promotes data-sharing among African medical institutions, a team of research scientists from Hôpital General de Befelatanana in Antananarivo has developed a drug to treat patients with severe TB symptoms, including fatigue, chest pain, fever, and cough. As a data analyst, you have been selected to join the team traveling to Bindura to study the drug’s side effects on patients.
At Howard Hospital, the drug has been administered to 40 patients, both men and women, aged between 19 and 46. The team has monitored the patients’ fatigue levels for 100 days and recorded the results in an Excel spreadsheet. The data includes fatigue levels ranging from 0 to 10, where 0 indicates no signs of fatigue and 10 indicates extreme fatigue.
The data file, bindura_tb_patients.csv
, contains the relevant information, and you are assigned to work with it.
Do you know how Tuberculosis spread out from person to person?
Load the data file and tell us what you observe
Use the melt function to fix the inconsistencies within the data
What insights can you extract from the data?
Task 10: #
The Covid-19 pandemic has ravaged the globe, claiming countless lives. As part of the Russia-East Africa Partnership (REAP), the Russian Ministry of Health has forged an agreement with government agencies in East Africa to initiate vaccination campaigns. The Sekou Toure Foundation has been enlisted to conduct a comprehensive survey across East Africa, collecting data on the prevalence of Covid-19 in terms of active cases and fatalities. Due to stringent protective measures implemented by the foundation’s personnel, the survey campaign was carried out only from October 2021 to Jan 2022. The data file has now reached the Data Science Department of Université polytechnique de Kougouleu in Libreville. They have reached out to you, seeking your expertise in interpreting the data. The data file is named covid_19_eastafr.csv
.
Do you who Sekou Toure was? and what did he do for the continent?
Load the data file and tell us what you observe
Use the melt function to fix the inconsistencies within the data
What insights can you extract from the data?
east_africa_countries = ['Burundi', 'Comoros', 'Djibouti', 'Eritrea',
'Ethiopia', 'Kenya',
'Rwanda', 'Seychelles', 'Somalia', 'South Sudan',
'Tanzania', 'Uganda', 'North Sudan']