Chapter 5: Grouping the Data#
Introduction#
In data analysis tasks, the insights often can be unveiled when the data are aggregated or grouped. Such type of operations play a pivotal role in real-world applications by enabling powerful data analysis and decision-making. The ability to group data based on specific criteria and perform aggregations and transformations opens up a multitude of possibilities. For instance,
In finance, groupby operations could be used by Nigerian Exchange Group (formerly known Nigeria Stock Exchange) to analyze market trends and identify patterns in stock or bond prices.
In marketing, they could help Tanzlite Agency (in Dar Es Salaam, Tanzania) understand their customer segments and behavior for targeted campaigns.
In healthcare, groupby operations could aid Data Analyst interns at the Regional Hospital of Limbe, Cameroon in examining patient data to identify disease patterns and provide insights foe the medical doctor to devise personalized treatments.
in operations management, these operations could be used by Bollore Logistics to enable efficient resource allocation and optimization.
The pandas module offer a variety of operations under the groupby
API.Those groupby operations provide valuable insights and facilitate data-driven decision-making across various domains, making them indispensable in real-world applications. In this section, we will present some of them including splitting, applying and combining.
Splitting#
Lets take a look at the Nobel Prize dataset winners to understand some of the groupby operations. Let’s load the nobel_winners.csv
file.
The Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. The abstract definition of grouping is to provide a mapping of labels to group names.
Task 15: #
One of the key moments of the Russia-Africa Summit that was held in St Petersburg last July, was the speech delivred bythe 35 years old president of Burkina Faso, Captain Ibrahim Traoré. More specifically, he said: “My generation does not understand this: how can Africa, which has so much wealth, become the poorest continent in the world today? And why African leaders travel the world to beg,” “”” That statement has ignited a sense of patriotism and unity among African leaders. One of the resolutions made immediately after the Summit was the exchange of skills between Russian and African farmers. Consequently, a call for applications was initiated throughout Africa, allowing farmers to apply and compete for the opportunity to win the Nancy Abeid Prize (in honor of Nancy Abeid Arahamane, who established Africa’s first camel milk dairy in 1989).
The Prize essentially acknowledges farmers based on the height and width of their crops for four different types of grain: Coffee, Beans, Maize, and Wheat. The winning farmers will receive funding to expand their businesses and have their products promoted throughout the continent, fostering potential partnerships with industries in the future. The results are enclosed in a csv file called grain_award.csv
.
import pandas as pd
import numpy as np
grain_prize = pd.read_csv('data/grain_award.csv')
grain_prize
Name | birth_country | gender | grainType | cropHeight | cropWidth | prize | |
---|---|---|---|---|---|---|---|
0 | Siyabonga Luthuli | South Africa | Male | wheat | 2.17 | 3.07 | 440000 |
1 | Lucien Nirina | Madagadscar | Male | coffee | 2.05 | 3.03 | 420000 |
2 | Pauline Dossevi | Togo | Female | beans | 2.16 | 3.21 | 290000 |
3 | Garba Kone | Burkina Faso | Male | maize | 2.18 | 3.03 | 440000 |
4 | Adeyoyin Samuel | Nigeria | Male | wheat | 1.95 | 3.12 | 220000 |
5 | Geoffrey Birina | Burundi | Male | wheat | 2.07 | 3.05 | 250000 |
6 | Sylvie Kurwa | Rwanda | Female | coffee | 2.08 | 3.18 | 230000 |
7 | Joao Pedro | Angola | Male | beans | 2.11 | 3.14 | 150000 |
8 | Maria Siphiwe | Botswana | Female | maize | 2.06 | 3.10 | 370000 |
9 | Lejeune Bahati | Congo | Male | wheat | 2.16 | 3.11 | 430000 |
10 | Touadoum Nobre | Chad | Male | coffee | 2.00 | 3.11 | 370000 |
11 | Ebale Gervais | Cameroon | Male | maize | 2.01 | 3.14 | 390000 |
12 | Fatema Nourdin | Morocco | Female | coffee | 2.04 | 3.08 | 120000 |
13 | Aboubakar Houn | Mauritania | Male | beans | 1.88 | 3.10 | 270000 |
14 | Salif Cisse | Mali | Male | wheat | 1.95 | 3.02 | 190000 |
15 | Aicha Boumediene | Algeria | Female | maize | 1.87 | 3.09 | 360000 |
Each row in this dataset represents a prize winner, the country of birth, the gender, the grain type, the crop height and width and the prize. At this stage, we can call the pandas DataFrame.groupby()
function, to split the data into groups based on predefined criteria, usually a categorical column. Lets start by grouping the data by gender. We do this by creating a group_gen
object.
group_gen =grain_prize.groupby('gender')
group_gen
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fde08d99a60>
It is important to note that creating a GroupBy object only checks if we have passed a correct mapping; it doesn’t really perform any of the operations. It simply re-organize the display of the dataframe according to the chosen criterion. We could check the content of the chosen criterion to see the different instances by which the the dataframe will be sorted by:
grain_prize['gender'].unique()
array(['Male', 'Female'], dtype=object)
Using the .first()
(or .last()
) function, we could check the first few (or last) entries of the re-oganized dataframe.
group_gen.first()
Name | birth_country | grainType | cropHeight | cropWidth | prize | |
---|---|---|---|---|---|---|
gender | ||||||
Female | Pauline Dossevi | Togo | beans | 2.16 | 3.21 | 290000 |
Male | Siyabonga Luthuli | South Africa | wheat | 2.17 | 3.07 | 440000 |
Is important to note that the grouping can also be done using more than one column, simply by passing a list of columns as argument. To illustrate that, let’s group our data first by gender
and then, inside each of the created groups, we will apply additional grouping based on the grainType
:
group_gen_grType = grain_prize.groupby(['gender' , 'grainType'])
group_gen_grType.last()
Name | birth_country | cropHeight | cropWidth | prize | ||
---|---|---|---|---|---|---|
gender | grainType | |||||
Female | beans | Pauline Dossevi | Togo | 2.16 | 3.21 | 290000 |
coffee | Fatema Nourdin | Morocco | 2.04 | 3.08 | 120000 | |
maize | Aicha Boumediene | Algeria | 1.87 | 3.09 | 360000 | |
Male | beans | Aboubakar Houn | Mauritania | 1.88 | 3.10 | 270000 |
coffee | Touadoum Nobre | Chad | 2.00 | 3.11 | 370000 | |
maize | Ebale Gervais | Cameroon | 2.01 | 3.14 | 390000 | |
wheat | Salif Cisse | Mali | 1.95 | 3.02 | 190000 |
If instead, for some purpose, we want to select a single group in a form of DataFrame, we should use the method get_group()
on the GroupBy object, passing as an argument the sub-group instance we want to focus on. Let’s try and only focus on the prize winners in coffee
grain type.
group_coffee = grain_prize.groupby('grainType').get_group('coffee')
group_coffee
Name | birth_country | gender | grainType | cropHeight | cropWidth | prize | |
---|---|---|---|---|---|---|---|
1 | Lucien Nirina | Madagadscar | Male | coffee | 2.05 | 3.03 | 420000 |
6 | Sylvie Kurwa | Rwanda | Female | coffee | 2.08 | 3.18 | 230000 |
10 | Touadoum Nobre | Chad | Male | coffee | 2.00 | 3.11 | 370000 |
12 | Fatema Nourdin | Morocco | Female | coffee | 2.04 | 3.08 | 120000 |
Applying#
After splitting the original data and (optionally) inspecting the resulting groups, we can perform one of the following operations or their combination (not necessarily in the given order) on each group:
Aggregation: calculating a summary statistic for each group (e.g., group sizes, means, medians, or sums) and outputting a single number for many data points.
Transformation: conducting some operations by group, such as calculating the z-score for each group.
Filtration: rejecting some groups based on a predefined condition, such as group size, mean, median, or sum. This can also include filtering out particular rows from each group.
Aggregation
To aggregate the data of a GroupBy object (i.e., to calculate a summary statistic by group), we can either use built-in functions (like mean, max, min, std, median, sum, etc) from the groupby module or use the agg()
method on the object:
For instance, we might be interested in the average of cropHeight
and cropWidth
(or any numerical column) per grain type.
group_grType = grain_prize.groupby('grainType')
group_grType[['cropHeight' , 'cropWidth']].mean()
cropHeight | cropWidth | |
---|---|---|
grainType | ||
beans | 2.0500 | 3.150 |
coffee | 2.0425 | 3.100 |
maize | 2.0300 | 3.090 |
wheat | 2.0600 | 3.074 |
or
group_grType[['cropHeight' , 'cropWidth']].agg(np.mean)
cropHeight | cropWidth | |
---|---|---|
grainType | ||
beans | 2.0500 | 3.150 |
coffee | 2.0425 | 3.100 |
maize | 2.0300 | 3.090 |
wheat | 2.0600 | 3.074 |
The code above combines all the prize winner per grainType
and compute the average of both the cropHeight
and the cropWidth
.
It is possible to apply several functions at once to one or more columns of a GroupBy object. For this purpose, we again need the agg()
method and a list of the functions of interest:
group_grType[['cropHeight' , 'cropWidth']].agg([np.mean, np.sum , np.std])
cropHeight | cropWidth | |||||
---|---|---|---|---|---|---|
mean | sum | std | mean | sum | std | |
grainType | ||||||
beans | 2.0500 | 6.15 | 0.149332 | 3.150 | 9.45 | 0.055678 |
coffee | 2.0425 | 8.17 | 0.033040 | 3.100 | 12.40 | 0.062716 |
maize | 2.0300 | 8.12 | 0.128323 | 3.090 | 12.36 | 0.045461 |
wheat | 2.0600 | 10.30 | 0.107703 | 3.074 | 15.37 | 0.041593 |
Also, we can consider applying different aggregation functions to different columns of a GroupBy object by passing a dictionary:
group_grType.agg({'cropHeight': [np.sum, np.std], 'prize': np.sum})
cropHeight | prize | ||
---|---|---|---|
sum | std | sum | |
grainType | |||
beans | 6.15 | 0.149332 | 710000 |
coffee | 8.17 | 0.033040 | 1140000 |
maize | 8.12 | 0.128323 | 1560000 |
wheat | 10.30 | 0.107703 | 1530000 |
Transformation#
Unlike aggregation methods, transformation methods return a new DataFrame with the same shape and indexing as the original one but with transformed individual values. Here it is important to note that the transformation must not modify any values in the original DataFrame, meaning that such operations cannot be performed in place.
The most common pandas method to transform the data of a GroupBy object is apply()
. For example, it can be helpful for computing the z-score, or any function that could be applied on an element by element basis. For instance, when working numerical values of different scales, one might choose to apply a normalization method, which role is to constraint the scale of the output while not loosing the information in it.
The score is computed by negating each value by the average value of the whole column and dividing it by the standard deviation.
$\(\tilde{x} = \frac{x- \bar{x}}{\sigma}\)$.
The z-score is function written in a lambda fashion, is passed as an argument to the apply method.
def Norm(val_):
x_bar = sum(val_)/len(val_)
std_ = np.std(val_)
return x_bar/std_
group_grType[['prize']].apply(lambda x: (x - x.mean()) / x.std())
prize | ||
---|---|---|
grainType | ||
beans | 2 | 0.704361 |
7 | -1.144586 | |
13 | 0.440225 | |
coffee | 1 | 0.990756 |
6 | -0.403641 | |
10 | 0.623809 | |
12 | -1.210924 | |
maize | 3 | 1.404879 |
8 | -0.561951 | |
11 | 0.000000 | |
15 | -0.842927 | |
wheat | 0 | 1.119391 |
4 | -0.718415 | |
5 | -0.467805 | |
9 | 1.035854 | |
14 | -0.969025 |
We might also define your own custom function and apply to a grouped data. Please note that each sub-function within the custom function must provide an aggregated value
#just define your own function
import numpy as np
def custom_calculation(values):
"""This function sums the values withing a group, add 2 and divide it by the number of elements
within a group multiplied by 10"""
n = len(values)
s = np.sum(values) + 2
m = s / (n *10)
return m
group_grType[['prize']].apply(custom_calculation)
prize | |
---|---|
grainType | |
beans | 23666.733333 |
coffee | 28500.050000 |
maize | 39000.050000 |
wheat | 30600.040000 |
Filtration#
Filtration methods discard the groups or particular rows from each group based on a predefined condition and return a subset of the original data. For example, we may want to keep only the values of a certain column from all the groups where the group mean for that column is greater than a predefined value. In the case of our DataFrame, let’s filter out all the groups with a group mean for the cropHeight
column smaller than 2.5 and keep only this column in the output.
group_grType['cropHeight'].mean()
grainType
beans 2.0500
coffee 2.0425
maize 2.0300
wheat 2.0600
Name: cropHeight, dtype: float64
In order to get the groups where the mean is smaller than 2.5, we use the filter function
group_grType['cropHeight'].filter(lambda x: x.mean() < 2.05)
1 2.05
3 2.18
6 2.08
8 2.06
10 2.00
11 2.01
12 2.04
15 1.87
Name: cropHeight, dtype: float64
We use the corresponding indices to reconstruct the dataframe
grain_prize.iloc[group_grType['cropHeight'].filter(lambda x: x.mean() < 2.05).index,:]
Name | birth_country | gender | grainType | cropHeight | cropWidth | prize | |
---|---|---|---|---|---|---|---|
1 | Lucien Nirina | Madagadscar | Male | coffee | 2.05 | 3.03 | 420000 |
3 | Garba Kone | Burkina Faso | Male | maize | 2.18 | 3.03 | 440000 |
6 | Sylvie Kurwa | Rwanda | Female | coffee | 2.08 | 3.18 | 230000 |
8 | Maria Siphiwe | Botswana | Female | maize | 2.06 | 3.10 | 370000 |
10 | Touadoum Nobre | Chad | Male | coffee | 2.00 | 3.11 | 370000 |
11 | Ebale Gervais | Cameroon | Male | maize | 2.01 | 3.14 | 390000 |
12 | Fatema Nourdin | Morocco | Female | coffee | 2.04 | 3.08 | 120000 |
15 | Aicha Boumediene | Algeria | Female | maize | 1.87 | 3.09 | 360000 |
Task 14: #
The Arba Minch University in Ethiopia has been chosen by the African Union to store files related to Nobel Prize Winners. As Data Analyst working with African Union, you have the task to group some informatio from that database in order to understand some patterns from Nobel Prize winners. be it, their gender, continent of birht, prize category, etc. The fie is given to you in a csv file as nobel_winners.csv
.
What is the purpose of the Nobel Prize according to you?
Load the data. Which field African-born personalities have mostly won the Nobel Prize? What does that tell us?
Do you recall the last two African Winners of the Prize? What were their contributions?
Use the groupby
function and answer the questions below:
How many women were allowed the Nobel Prize? How many of them where African?
Out of the winners that were born in Africa, what is the average of and the minimal winning prize for men and women?
What are the categories of winners whose adjusted amount prize average is smaller than $6.000.000 ?
List all of the Africa-born winner in those categories? Any women? If so, in which category?
gap_data.tail()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-5c5159763a1f> in <module>
----> 1 gap_data.tail()
NameError: name 'gap_data' is not defined
Task 15: #
Consider the gapminder dataset here saved as gapminder.tsv
and adress the following questions:
What was the life Expectancy average adn standard deviation across various continents before the year 2000?
Group the data by decades using the tollowing pattern
1950 < year < 1960 ==> “d50”
1960 < year < 1970 ==> “d60”
1970 < year < 1980 ==> “d70” …
Provide the lowest and largest population, the lowest and largest life expectancy for each of the decade
Write a code that provide the corresponding countries for each of the above. What historical facts could you derive from the insights you have obtained