Chapter 5: Grouping the Data#

Introduction#

In data analysis tasks, the insights often can be unveiled when the data are aggregated or grouped. Such type of operations play a pivotal role in real-world applications by enabling powerful data analysis and decision-making. The ability to group data based on specific criteria and perform aggregations and transformations opens up a multitude of possibilities. For instance,

  • In finance, groupby operations could be used by Nigerian Exchange Group (formerly known Nigeria Stock Exchange) to analyze market trends and identify patterns in stock or bond prices.

  • In marketing, they could help Tanzlite Agency (in Dar Es Salaam, Tanzania) understand their customer segments and behavior for targeted campaigns.

  • In healthcare, groupby operations could aid Data Analyst interns at the Regional Hospital of Limbe, Cameroon in examining patient data to identify disease patterns and provide insights foe the medical doctor to devise personalized treatments.

  • in operations management, these operations could be used by Bollore Logistics to enable efficient resource allocation and optimization.

The pandas module offer a variety of operations under the groupby API.Those groupby operations provide valuable insights and facilitate data-driven decision-making across various domains, making them indispensable in real-world applications. In this section, we will present some of them including splitting, applying and combining.

Splitting#

Lets take a look at the Nobel Prize dataset winners to understand some of the groupby operations. Let’s load the nobel_winners.csv file.

The Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. The abstract definition of grouping is to provide a mapping of labels to group names.

Task 15: #

One of the key moments of the Russia-Africa Summit that was held in St Petersburg last July, was the speech delivred bythe 35 years old president of Burkina Faso, Captain Ibrahim Traoré. More specifically, he said: “My generation does not understand this: how can Africa, which has so much wealth, become the poorest continent in the world today? And why African leaders travel the world to beg,” “”” That statement has ignited a sense of patriotism and unity among African leaders. One of the resolutions made immediately after the Summit was the exchange of skills between Russian and African farmers. Consequently, a call for applications was initiated throughout Africa, allowing farmers to apply and compete for the opportunity to win the Nancy Abeid Prize (in honor of Nancy Abeid Arahamane, who established Africa’s first camel milk dairy in 1989).

The Prize essentially acknowledges farmers based on the height and width of their crops for four different types of grain: Coffee, Beans, Maize, and Wheat. The winning farmers will receive funding to expand their businesses and have their products promoted throughout the continent, fostering potential partnerships with industries in the future. The results are enclosed in a csv file called grain_award.csv.

import pandas as pd
import numpy as np
grain_prize = pd.read_csv('data/grain_award.csv')
grain_prize
Name birth_country gender grainType cropHeight cropWidth prize
0 Siyabonga Luthuli South Africa Male wheat 2.17 3.07 440000
1 Lucien Nirina Madagadscar Male coffee 2.05 3.03 420000
2 Pauline Dossevi Togo Female beans 2.16 3.21 290000
3 Garba Kone Burkina Faso Male maize 2.18 3.03 440000
4 Adeyoyin Samuel Nigeria Male wheat 1.95 3.12 220000
5 Geoffrey Birina Burundi Male wheat 2.07 3.05 250000
6 Sylvie Kurwa Rwanda Female coffee 2.08 3.18 230000
7 Joao Pedro Angola Male beans 2.11 3.14 150000
8 Maria Siphiwe Botswana Female maize 2.06 3.10 370000
9 Lejeune Bahati Congo Male wheat 2.16 3.11 430000
10 Touadoum Nobre Chad Male coffee 2.00 3.11 370000
11 Ebale Gervais Cameroon Male maize 2.01 3.14 390000
12 Fatema Nourdin Morocco Female coffee 2.04 3.08 120000
13 Aboubakar Houn Mauritania Male beans 1.88 3.10 270000
14 Salif Cisse Mali Male wheat 1.95 3.02 190000
15 Aicha Boumediene Algeria Female maize 1.87 3.09 360000

Each row in this dataset represents a prize winner, the country of birth, the gender, the grain type, the crop height and width and the prize. At this stage, we can call the pandas DataFrame.groupby() function, to split the data into groups based on predefined criteria, usually a categorical column. Lets start by grouping the data by gender. We do this by creating a group_gen object.

group_gen =grain_prize.groupby('gender')
group_gen
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fde08d99a60>

It is important to note that creating a GroupBy object only checks if we have passed a correct mapping; it doesn’t really perform any of the operations. It simply re-organize the display of the dataframe according to the chosen criterion. We could check the content of the chosen criterion to see the different instances by which the the dataframe will be sorted by:

grain_prize['gender'].unique()
array(['Male', 'Female'], dtype=object)

Using the .first() (or .last()) function, we could check the first few (or last) entries of the re-oganized dataframe.

group_gen.first()
Name birth_country grainType cropHeight cropWidth prize
gender
Female Pauline Dossevi Togo beans 2.16 3.21 290000
Male Siyabonga Luthuli South Africa wheat 2.17 3.07 440000

Is important to note that the grouping can also be done using more than one column, simply by passing a list of columns as argument. To illustrate that, let’s group our data first by gender and then, inside each of the created groups, we will apply additional grouping based on the grainType:

group_gen_grType = grain_prize.groupby(['gender' , 'grainType']) 
group_gen_grType.last()
Name birth_country cropHeight cropWidth prize
gender grainType
Female beans Pauline Dossevi Togo 2.16 3.21 290000
coffee Fatema Nourdin Morocco 2.04 3.08 120000
maize Aicha Boumediene Algeria 1.87 3.09 360000
Male beans Aboubakar Houn Mauritania 1.88 3.10 270000
coffee Touadoum Nobre Chad 2.00 3.11 370000
maize Ebale Gervais Cameroon 2.01 3.14 390000
wheat Salif Cisse Mali 1.95 3.02 190000

If instead, for some purpose, we want to select a single group in a form of DataFrame, we should use the method get_group() on the GroupBy object, passing as an argument the sub-group instance we want to focus on. Let’s try and only focus on the prize winners in coffee grain type.

group_coffee = grain_prize.groupby('grainType').get_group('coffee')
group_coffee
Name birth_country gender grainType cropHeight cropWidth prize
1 Lucien Nirina Madagadscar Male coffee 2.05 3.03 420000
6 Sylvie Kurwa Rwanda Female coffee 2.08 3.18 230000
10 Touadoum Nobre Chad Male coffee 2.00 3.11 370000
12 Fatema Nourdin Morocco Female coffee 2.04 3.08 120000

Applying#

After splitting the original data and (optionally) inspecting the resulting groups, we can perform one of the following operations or their combination (not necessarily in the given order) on each group:

  • Aggregation: calculating a summary statistic for each group (e.g., group sizes, means, medians, or sums) and outputting a single number for many data points.

  • Transformation: conducting some operations by group, such as calculating the z-score for each group.

  • Filtration: rejecting some groups based on a predefined condition, such as group size, mean, median, or sum. This can also include filtering out particular rows from each group.

Aggregation

To aggregate the data of a GroupBy object (i.e., to calculate a summary statistic by group), we can either use built-in functions (like mean, max, min, std, median, sum, etc) from the groupby module or use the agg() method on the object:

For instance, we might be interested in the average of cropHeight and cropWidth (or any numerical column) per grain type.

group_grType =  grain_prize.groupby('grainType')
group_grType[['cropHeight' , 'cropWidth']].mean()  
cropHeight cropWidth
grainType
beans 2.0500 3.150
coffee 2.0425 3.100
maize 2.0300 3.090
wheat 2.0600 3.074

or

group_grType[['cropHeight' , 'cropWidth']].agg(np.mean)  
cropHeight cropWidth
grainType
beans 2.0500 3.150
coffee 2.0425 3.100
maize 2.0300 3.090
wheat 2.0600 3.074

The code above combines all the prize winner per grainType and compute the average of both the cropHeight and the cropWidth. It is possible to apply several functions at once to one or more columns of a GroupBy object. For this purpose, we again need the agg() method and a list of the functions of interest:

group_grType[['cropHeight' , 'cropWidth']].agg([np.mean, np.sum , np.std])  
cropHeight cropWidth
mean sum std mean sum std
grainType
beans 2.0500 6.15 0.149332 3.150 9.45 0.055678
coffee 2.0425 8.17 0.033040 3.100 12.40 0.062716
maize 2.0300 8.12 0.128323 3.090 12.36 0.045461
wheat 2.0600 10.30 0.107703 3.074 15.37 0.041593

Also, we can consider applying different aggregation functions to different columns of a GroupBy object by passing a dictionary:

group_grType.agg({'cropHeight': [np.sum, np.std], 'prize': np.sum})
cropHeight prize
sum std sum
grainType
beans 6.15 0.149332 710000
coffee 8.17 0.033040 1140000
maize 8.12 0.128323 1560000
wheat 10.30 0.107703 1530000

Transformation#

Unlike aggregation methods, transformation methods return a new DataFrame with the same shape and indexing as the original one but with transformed individual values. Here it is important to note that the transformation must not modify any values in the original DataFrame, meaning that such operations cannot be performed in place.

The most common pandas method to transform the data of a GroupBy object is apply(). For example, it can be helpful for computing the z-score, or any function that could be applied on an element by element basis. For instance, when working numerical values of different scales, one might choose to apply a normalization method, which role is to constraint the scale of the output while not loosing the information in it.

The score is computed by negating each value by the average value of the whole column and dividing it by the standard deviation.

$\(\tilde{x} = \frac{x- \bar{x}}{\sigma}\)$.

The z-score is function written in a lambda fashion, is passed as an argument to the apply method.

def Norm(val_):
    x_bar = sum(val_)/len(val_)
    std_ = np.std(val_)
    return x_bar/std_
group_grType[['prize']].apply(lambda x: (x - x.mean()) / x.std())
prize
grainType
beans 2 0.704361
7 -1.144586
13 0.440225
coffee 1 0.990756
6 -0.403641
10 0.623809
12 -1.210924
maize 3 1.404879
8 -0.561951
11 0.000000
15 -0.842927
wheat 0 1.119391
4 -0.718415
5 -0.467805
9 1.035854
14 -0.969025

We might also define your own custom function and apply to a grouped data. Please note that each sub-function within the custom function must provide an aggregated value

#just define your own function
import numpy as np
def custom_calculation(values):
    """This function sums the values withing a group, add 2 and divide it by the number of elements
    within a group multiplied by 10"""
    n = len(values)
    s = np.sum(values) + 2
    m = s / (n *10)
    return m 
\[ val = \frac{\sum (group) +2}{10 \times Card(group)}\]
group_grType[['prize']].apply(custom_calculation)
prize
grainType
beans 23666.733333
coffee 28500.050000
maize 39000.050000
wheat 30600.040000

Filtration#

Filtration methods discard the groups or particular rows from each group based on a predefined condition and return a subset of the original data. For example, we may want to keep only the values of a certain column from all the groups where the group mean for that column is greater than a predefined value. In the case of our DataFrame, let’s filter out all the groups with a group mean for the cropHeight column smaller than 2.5 and keep only this column in the output.

group_grType['cropHeight'].mean()
grainType
beans     2.0500
coffee    2.0425
maize     2.0300
wheat     2.0600
Name: cropHeight, dtype: float64

In order to get the groups where the mean is smaller than 2.5, we use the filter function

group_grType['cropHeight'].filter(lambda x: x.mean() < 2.05)
1     2.05
3     2.18
6     2.08
8     2.06
10    2.00
11    2.01
12    2.04
15    1.87
Name: cropHeight, dtype: float64

We use the corresponding indices to reconstruct the dataframe

grain_prize.iloc[group_grType['cropHeight'].filter(lambda x: x.mean() < 2.05).index,:]
Name birth_country gender grainType cropHeight cropWidth prize
1 Lucien Nirina Madagadscar Male coffee 2.05 3.03 420000
3 Garba Kone Burkina Faso Male maize 2.18 3.03 440000
6 Sylvie Kurwa Rwanda Female coffee 2.08 3.18 230000
8 Maria Siphiwe Botswana Female maize 2.06 3.10 370000
10 Touadoum Nobre Chad Male coffee 2.00 3.11 370000
11 Ebale Gervais Cameroon Male maize 2.01 3.14 390000
12 Fatema Nourdin Morocco Female coffee 2.04 3.08 120000
15 Aicha Boumediene Algeria Female maize 1.87 3.09 360000

Task 14: #

The Arba Minch University in Ethiopia has been chosen by the African Union to store files related to Nobel Prize Winners. As Data Analyst working with African Union, you have the task to group some informatio from that database in order to understand some patterns from Nobel Prize winners. be it, their gender, continent of birht, prize category, etc. The fie is given to you in a csv file as nobel_winners.csv.

  1. What is the purpose of the Nobel Prize according to you?

  2. Load the data. Which field African-born personalities have mostly won the Nobel Prize? What does that tell us?

  3. Do you recall the last two African Winners of the Prize? What were their contributions?

Use the groupby function and answer the questions below:

  1. How many women were allowed the Nobel Prize? How many of them where African?

  2. Out of the winners that were born in Africa, what is the average of and the minimal winning prize for men and women?

  3. What are the categories of winners whose adjusted amount prize average is smaller than $6.000.000 ?

  4. List all of the Africa-born winner in those categories? Any women? If so, in which category?

gap_data.tail()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-5c5159763a1f> in <module>
----> 1 gap_data.tail()

NameError: name 'gap_data' is not defined

Task 15: #

Consider the gapminder dataset here saved as gapminder.tsv and adress the following questions:

  1. What was the life Expectancy average adn standard deviation across various continents before the year 2000?

  2. Group the data by decades using the tollowing pattern

    • 1950 < year < 1960 ==> “d50”

    • 1960 < year < 1970 ==> “d60”

    • 1970 < year < 1980 ==> “d70” …

Provide the lowest and largest population, the lowest and largest life expectancy for each of the decade

  1. Write a code that provide the corresponding countries for each of the above. What historical facts could you derive from the insights you have obtained