Chapter 5: Grouping the Data

Chapter 5: Grouping the Data#

Introduction#

In data analysis tasks, the insights often can be unveiled when the data are aggregated or grouped. Such type of operations play a pivotal role in real-world applications by enabling powerful data analysis and decision-making. The ability to group data based on specific criteria and perform aggregations and transformations opens up a multitude of possibilities. For instance,

In finance, groupby operations could be used by Nigerian Exchange Group (formerly known Nigeria Stock Exchange) to analyze market trends and identify patterns in stock or bond prices.
In marketing, they could help Tanzlite Agency (in Dar Es Salaam, Tanzania) understand their customer segments and behavior for targeted campaigns.
In healthcare, groupby operations could aid Data Analyst interns at the Regional Hospital of Limbe, Cameroon in examining patient data to identify disease patterns and provide insights foe the medical doctor to devise personalized treatments.
in operations management, these operations could be used by Bollore Logistics to enable efficient resource allocation and optimization.

The pandas module offer a variety of operations under the groupby API.Those groupby operations provide valuable insights and facilitate data-driven decision-making across various domains, making them indispensable in real-world applications. In this section, we will present some of them including splitting, applying and combining.

Splitting#

Lets take a look at the Nobel Prize dataset winners to understand some of the groupby operations. Let’s load the nobel_winners.csv file.

The Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. The abstract definition of grouping is to provide a mapping of labels to group names.

Task 15: #

One of the key moments of the Russia-Africa Summit that was held in St Petersburg last July, was the speech delivred bythe 35 years old president of Burkina Faso, Captain Ibrahim Traoré. More specifically, he said: “My generation does not understand this: how can Africa, which has so much wealth, become the poorest continent in the world today? And why African leaders travel the world to beg,” “”” That statement has ignited a sense of patriotism and unity among African leaders. One of the resolutions made immediately after the Summit was the exchange of skills between Russian and African farmers. Consequently, a call for applications was initiated throughout Africa, allowing farmers to apply and compete for the opportunity to win the Nancy Abeid Prize (in honor of Nancy Abeid Arahamane, who established Africa’s first camel milk dairy in 1989).

The Prize essentially acknowledges farmers based on the height and width of their crops for four different types of grain: Coffee, Beans, Maize, and Wheat. The winning farmers will receive funding to expand their businesses and have their products promoted throughout the continent, fostering potential partnerships with industries in the future. The results are enclosed in a csv file called grain_award.csv.

import pandas as pd
import numpy as np

grain_prize = pd.read_csv('data/grain_award.csv')
grain_prize

	Name	birth_country	gender	grainType	cropHeight	cropWidth	prize
0	Siyabonga Luthuli	South Africa	Male	wheat	2.17	3.07	440000
1	Lucien Nirina	Madagadscar	Male	coffee	2.05	3.03	420000
2	Pauline Dossevi	Togo	Female	beans	2.16	3.21	290000
3	Garba Kone	Burkina Faso	Male	maize	2.18	3.03	440000
4	Adeyoyin Samuel	Nigeria	Male	wheat	1.95	3.12	220000
5	Geoffrey Birina	Burundi	Male	wheat	2.07	3.05	250000
6	Sylvie Kurwa	Rwanda	Female	coffee	2.08	3.18	230000
7	Joao Pedro	Angola	Male	beans	2.11	3.14	150000
8	Maria Siphiwe	Botswana	Female	maize	2.06	3.10	370000
9	Lejeune Bahati	Congo	Male	wheat	2.16	3.11	430000
10	Touadoum Nobre	Chad	Male	coffee	2.00	3.11	370000
11	Ebale Gervais	Cameroon	Male	maize	2.01	3.14	390000
12	Fatema Nourdin	Morocco	Female	coffee	2.04	3.08	120000
13	Aboubakar Houn	Mauritania	Male	beans	1.88	3.10	270000
14	Salif Cisse	Mali	Male	wheat	1.95	3.02	190000
15	Aicha Boumediene	Algeria	Female	maize	1.87	3.09	360000

Each row in this dataset represents a prize winner, the country of birth, the gender, the grain type, the crop height and width and the prize. At this stage, we can call the pandas DataFrame.groupby() function, to split the data into groups based on predefined criteria, usually a categorical column. Lets start by grouping the data by gender. We do this by creating a group_gen object.

group_gen =grain_prize.groupby('gender')
group_gen

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fde08d99a60>

It is important to note that creating a GroupBy object only checks if we have passed a correct mapping; it doesn’t really perform any of the operations. It simply re-organize the display of the dataframe according to the chosen criterion. We could check the content of the chosen criterion to see the different instances by which the the dataframe will be sorted by:

grain_prize['gender'].unique()

array(['Male', 'Female'], dtype=object)

Using the .first() (or .last()) function, we could check the first few (or last) entries of the re-oganized dataframe.

group_gen.first()

	Name	birth_country	grainType	cropHeight	cropWidth	prize
gender
Female	Pauline Dossevi	Togo	beans	2.16	3.21	290000
Male	Siyabonga Luthuli	South Africa	wheat	2.17	3.07	440000

Is important to note that the grouping can also be done using more than one column, simply by passing a list of columns as argument. To illustrate that, let’s group our data first by gender and then, inside each of the created groups, we will apply additional grouping based on the grainType:

group_gen_grType = grain_prize.groupby(['gender' , 'grainType']) 
group_gen_grType.last()

		Name	birth_country	cropHeight	cropWidth	prize
gender	grainType
Female	beans	Pauline Dossevi	Togo	2.16	3.21	290000
	coffee	Fatema Nourdin	Morocco	2.04	3.08	120000
	maize	Aicha Boumediene	Algeria	1.87	3.09	360000
Male	beans	Aboubakar Houn	Mauritania	1.88	3.10	270000
	coffee	Touadoum Nobre	Chad	2.00	3.11	370000
	maize	Ebale Gervais	Cameroon	2.01	3.14	390000
	wheat	Salif Cisse	Mali	1.95	3.02	190000

If instead, for some purpose, we want to select a single group in a form of DataFrame, we should use the method get_group() on the GroupBy object, passing as an argument the sub-group instance we want to focus on. Let’s try and only focus on the prize winners in coffee grain type.

group_coffee = grain_prize.groupby('grainType').get_group('coffee')
group_coffee

	Name	birth_country	gender	grainType	cropHeight	cropWidth	prize
1	Lucien Nirina	Madagadscar	Male	coffee	2.05	3.03	420000
6	Sylvie Kurwa	Rwanda	Female	coffee	2.08	3.18	230000
10	Touadoum Nobre	Chad	Male	coffee	2.00	3.11	370000
12	Fatema Nourdin	Morocco	Female	coffee	2.04	3.08	120000

Applying#

After splitting the original data and (optionally) inspecting the resulting groups, we can perform one of the following operations or their combination (not necessarily in the given order) on each group:

Aggregation: calculating a summary statistic for each group (e.g., group sizes, means, medians, or sums) and outputting a single number for many data points.
Transformation: conducting some operations by group, such as calculating the z-score for each group.
Filtration: rejecting some groups based on a predefined condition, such as group size, mean, median, or sum. This can also include filtering out particular rows from each group.

Aggregation

To aggregate the data of a GroupBy object (i.e., to calculate a summary statistic by group), we can either use built-in functions (like mean, max, min, std, median, sum, etc) from the groupby module or use the agg() method on the object:

For instance, we might be interested in the average of cropHeight and cropWidth (or any numerical column) per grain type.

group_grType =  grain_prize.groupby('grainType')

group_grType[['cropHeight' , 'cropWidth']].mean()  

	cropHeight	cropWidth
grainType
beans	2.0500	3.150
coffee	2.0425	3.100
maize	2.0300	3.090
wheat	2.0600	3.074

or

group_grType[['cropHeight' , 'cropWidth']].agg(np.mean)  

	cropHeight	cropWidth
grainType
beans	2.0500	3.150
coffee	2.0425	3.100
maize	2.0300	3.090
wheat	2.0600	3.074

The code above combines all the prize winner per grainType and compute the average of both the cropHeight and the cropWidth. It is possible to apply several functions at once to one or more columns of a GroupBy object. For this purpose, we again need the agg() method and a list of the functions of interest:

group_grType[['cropHeight' , 'cropWidth']].agg([np.mean, np.sum , np.std])  

	cropHeight			cropWidth
	mean	sum	std	mean	sum	std
grainType
beans	2.0500	6.15	0.149332	3.150	9.45	0.055678
coffee	2.0425	8.17	0.033040	3.100	12.40	0.062716
maize	2.0300	8.12	0.128323	3.090	12.36	0.045461
wheat	2.0600	10.30	0.107703	3.074	15.37	0.041593

Also, we can consider applying different aggregation functions to different columns of a GroupBy object by passing a dictionary:

group_grType.agg({'cropHeight': [np.sum, np.std], 'prize': np.sum})

	cropHeight		prize
	sum	std	sum
grainType
beans	6.15	0.149332	710000
coffee	8.17	0.033040	1140000
maize	8.12	0.128323	1560000
wheat	10.30	0.107703	1530000

Transformation#

Unlike aggregation methods, transformation methods return a new DataFrame with the same shape and indexing as the original one but with transformed individual values. Here it is important to note that the transformation must not modify any values in the original DataFrame, meaning that such operations cannot be performed in place.

The most common pandas method to transform the data of a GroupBy object is apply(). For example, it can be helpful for computing the z-score, or any function that could be applied on an element by element basis. For instance, when working numerical values of different scales, one might choose to apply a normalization method, which role is to constraint the scale of the output while not loosing the information in it.

The score is computed by negating each value by the average value of the whole column and dividing it by the standard deviation.

$$\tilde{x} = \frac{x- \bar{x}}{\sigma}$$.

The z-score is function written in a lambda fashion, is passed as an argument to the apply method.

def Norm(val_):
    x_bar = sum(val_)/len(val_)
    std_ = np.std(val_)
    return x_bar/std_

group_grType[['prize']].apply(lambda x: (x - x.mean()) / x.std())

		prize
grainType
beans	2	0.704361
	7	-1.144586
	13	0.440225
coffee	1	0.990756
	6	-0.403641
	10	0.623809
	12	-1.210924
maize	3	1.404879
	8	-0.561951
	11	0.000000
	15	-0.842927
wheat	0	1.119391
	4	-0.718415
	5	-0.467805
	9	1.035854
	14	-0.969025

We might also define your own custom function and apply to a grouped data. Please note that each sub-function within the custom function must provide an aggregated value

#just define your own function
import numpy as np
def custom_calculation(values):
    """This function sums the values withing a group, add 2 and divide it by the number of elements
    within a group multiplied by 10"""
    n = len(values)
    s = np.sum(values) + 2
    m = s / (n *10)
    return m 

\[ val = \frac{\sum (group) +2}{10 \times Card(group)}\]

group_grType[['prize']].apply(custom_calculation)

	prize
grainType
beans	23666.733333
coffee	28500.050000
maize	39000.050000
wheat	30600.040000

Filtration#

Filtration methods discard the groups or particular rows from each group based on a predefined condition and return a subset of the original data. For example, we may want to keep only the values of a certain column from all the groups where the group mean for that column is greater than a predefined value. In the case of our DataFrame, let’s filter out all the groups with a group mean for the cropHeight column smaller than 2.5 and keep only this column in the output.

group_grType['cropHeight'].mean()

grainType
beans     2.0500
coffee    2.0425
maize     2.0300
wheat     2.0600
Name: cropHeight, dtype: float64

In order to get the groups where the mean is smaller than 2.5, we use the filter function

group_grType['cropHeight'].filter(lambda x: x.mean() < 2.05)

   2.05
   2.18
   2.08
   2.06
  2.00
  2.01
  2.04
  1.87
Name: cropHeight, dtype: float64

We use the corresponding indices to reconstruct the dataframe

grain_prize.iloc[group_grType['cropHeight'].filter(lambda x: x.mean() < 2.05).index,:]

	Name	birth_country	gender	grainType	cropHeight	cropWidth	prize
1	Lucien Nirina	Madagadscar	Male	coffee	2.05	3.03	420000
3	Garba Kone	Burkina Faso	Male	maize	2.18	3.03	440000
6	Sylvie Kurwa	Rwanda	Female	coffee	2.08	3.18	230000
8	Maria Siphiwe	Botswana	Female	maize	2.06	3.10	370000
10	Touadoum Nobre	Chad	Male	coffee	2.00	3.11	370000
11	Ebale Gervais	Cameroon	Male	maize	2.01	3.14	390000
12	Fatema Nourdin	Morocco	Female	coffee	2.04	3.08	120000
15	Aicha Boumediene	Algeria	Female	maize	1.87	3.09	360000

Task 14: #

The Arba Minch University in Ethiopia has been chosen by the African Union to store files related to Nobel Prize Winners. As Data Analyst working with African Union, you have the task to group some informatio from that database in order to understand some patterns from Nobel Prize winners. be it, their gender, continent of birht, prize category, etc. The fie is given to you in a csv file as nobel_winners.csv.

What is the purpose of the Nobel Prize according to you?
Load the data. Which field African-born personalities have mostly won the Nobel Prize? What does that tell us?
Do you recall the last two African Winners of the Prize? What were their contributions?

Use the groupby function and answer the questions below:

How many women were allowed the Nobel Prize? How many of them where African?
Out of the winners that were born in Africa, what is the average of and the minimal winning prize for men and women?
What are the categories of winners whose adjusted amount prize average is smaller than $6.000.000 ?
List all of the Africa-born winner in those categories? Any women? If so, in which category?

gap_data.tail()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-5c5159763a1f> in <module>
----> 1 gap_data.tail()

NameError: name 'gap_data' is not defined

Task 15: #

Consider the gapminder dataset here saved as gapminder.tsv and adress the following questions:

What was the life Expectancy average adn standard deviation across various continents before the year 2000?
Group the data by decades using the tollowing pattern
- 1950 < year < 1960 ==> “d50”
- 1960 < year < 1970 ==> “d60”
- 1970 < year < 1980 ==> “d70” …

Provide the lowest and largest population, the lowest and largest life expectancy for each of the decade

Write a code that provide the corresponding countries for each of the above. What historical facts could you derive from the insights you have obtained

Chapter 5: Grouping the Data

Contents

Chapter 5: Grouping the Data#

Introduction#

Splitting#

Task 15: #

Applying#

Transformation#

Filtration#

Task 14: #

Task 15: #