Chapter 7: Data Organization Methods#

Selecting, subsetting, and organizing data#

After having learned how to define and build a DataFrame, in this chapter, we will address some of their basic features, largely applicable also to the Series. Through a series of basic operations and examples of some concepts related to them, we will discover how the DataFrame can be modified in its structure and content as we wish. All operations are aimed at a common purpose – preparing the DataFrame on which we are working and the data within the manipulation

Learning Objectives#

In this section, we will cover the following topics:

  • Viewing

  • Selection

  • Filtering

  • Editing

  • Descriptive Statistics

  • Transposition, Sorting, and Reindexing

After this chapter, you will be able to see and understand the content of a DataFrame (viewing and statistics), select appropriate parts (subsetting) or filter its elements based on appropriate conditions (filtering), add new ones (editing), and finally modify the structure (reindexing and transposition). All will be aimed at focusing on that part of the DataFrame on which we will later want to operate (manipulation) – which we will discuss in the next chapter.

The head() and tail() methods#

To have a sufficient view to roughly understand the structure of a DataFrame and get an idea of the data it contains, the Pandas library provides us with the two methods head() and tail(). By calling these methods on a reference DataFrame, the first five lines are printed, with the head() method, or the last five lines with the tail() method, as follows:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(100).reshape(20,5))
df
0 1 2 3 4
0 0.074081 0.143229 0.172670 0.230184 0.209971
1 0.202420 0.105604 0.927748 0.036043 0.250239
2 0.189448 0.612557 0.134308 0.545073 0.492270
3 0.104358 0.847961 0.496373 0.480613 0.892998
4 0.142162 0.330063 0.253526 0.935058 0.640519
5 0.546463 0.184395 0.767790 0.095399 0.858232
6 0.368070 0.705653 0.431129 0.237202 0.375348
7 0.602861 0.583231 0.101862 0.632895 0.403986
8 0.718824 0.652524 0.488652 0.750440 0.684057
9 0.845760 0.310396 0.541246 0.555395 0.193282
10 0.062214 0.458687 0.131933 0.317933 0.103716
11 0.006183 0.295220 0.974458 0.258122 0.210710
12 0.845106 0.217333 0.364789 0.208822 0.425059
13 0.920062 0.197460 0.946817 0.093274 0.114467
14 0.465570 0.240329 0.272167 0.992029 0.079253
15 0.987038 0.867577 0.527711 0.936863 0.215796
16 0.398053 0.719077 0.014737 0.608792 0.532576
17 0.730112 0.222590 0.247289 0.683116 0.646272
18 0.215039 0.162615 0.724696 0.859992 0.603788
19 0.657143 0.648049 0.966288 0.455964 0.937608
df.head()
0 1 2 3 4
0 0.074081 0.143229 0.172670 0.230184 0.209971
1 0.202420 0.105604 0.927748 0.036043 0.250239
2 0.189448 0.612557 0.134308 0.545073 0.492270
3 0.104358 0.847961 0.496373 0.480613 0.892998
4 0.142162 0.330063 0.253526 0.935058 0.640519

The head() method, if used on Jupyter Notebook, shows the first five lines of the DataFrame with its particular formatting seen previously, as follows:

df.tail()
0 1 2 3 4
15 0.987038 0.867577 0.527711 0.936863 0.215796
16 0.398053 0.719077 0.014737 0.608792 0.532576
17 0.730112 0.222590 0.247289 0.683116 0.646272
18 0.215039 0.162615 0.724696 0.859992 0.603788
19 0.657143 0.648049 0.966288 0.455964 0.937608

These two methods allow us to see the first and last five lines of the DataFrame. This default behavior can be changed by passing the number of lines to be displayed as the argument of the two methods. For example, if we want to see only the first three rows of the DataFrame, we will get the results, as shown below

df.head(3)
0 1 2 3 4
0 0.074081 0.143229 0.172670 0.230184 0.209971
1 0.202420 0.105604 0.927748 0.036043 0.250239
2 0.189448 0.612557 0.134308 0.545073 0.492270

Selection#

In the previous chapter, during the creation of the DataFrame, we already learned how to select the elements inside it. In this section, we will address the selection in the foreground, deepening this topic which will then be the basis of all the other features present in this and subsequent chapters. It is, therefore, crucial to start with this topic to better understand the concepts, and then to address all the subsequent topics in this book with greater clarity.

Subsetting by index or by position#

Structured objects such as DataFrame and Series, have been created specifically to carry out the selection process on them in a simple and efficient way. Within these objects, each dimension is characterized by an Index object, the purpose of which is to label the individual elements on each of them. It is precisely the Index objects that differentiate DataFrame and Series from the n-dimensional arrays (ndarray) of NumPy. On the latter, since they have no index objects, the positions of the elements are specified by means of a sequence of progressive integers from 0 to N-1 (where N is the length of the dimension). Since the Pandas library is based on the NumPy library, and in its way, extends some of its features, it wanted to keep both the possibility of being able to select the elements of the DataFrame and Series with the numbers of the positions. Both extend this concept through the Indexes and their labels. Subsetting can then be done via the following:

  • Indexing: Users can select the data using the labels that distinguish the individual rows and columns, through the Index objects.

  • Integer Positioning: Users can select data using positions for integers of rows and columns, such as NumPy ndarrays.

When you want to select a single element, things are quite simple – you specify the label corresponding to the row and the corresponding column (indexing) or the position within the DataFrame by means of the pair of integers (integer positioning) – as shown as follows:

df = pd.DataFrame(np.random.randint(0,9,16).reshape(4,4),
                           index = ['a','b','c','d'],
                           columns = ['A','B','C','D'])
df
A B C D
a 6 2 4 0
b 4 0 5 1
c 8 1 7 4
d 8 0 5 3
print(df.at['a','A']) #indexing
print(df.iat[0,0])    #integer positioning
6
6

But most of the time, we need to select certain subsets of the DataFrame and not a single element. So, we will have to extend the single values (integer or label) with words that somehow express a particular subset on which to make the selection.

Selection operators and indexers accept the following terms in this regard:

  • Lists (integer or labels)

  • Slices (slicing)

Lists specify all the columns or rows we want to include in the subsetting. As the name implies, they are simple Python lists containing integer values or labels corresponding to the rows or columns to be selected.

Slices, on the other hand, specify ranges of indexes, thus, including in a single wording, a set of contiguous columns or rows, without having to specify them individually. To define the slices, the slicing techniques are used with the ‘:’ operator (the same way as in NumPy with ndarray).

loc[ ] – Selection by Labels#

The loc[ ] operator focuses on selection based on the labels contained within the Index objects, as shown as follows:

loc[ row labels, column labels]

If you want to select a single element within the DataFrame, you will first specify the row label and then that of the column, as shown as follows:

print(df.loc['a','B'])
2

A single element as a scalar is obtained as a return value. Whereas, if you want to select a subset of adjacent elements, you can use the slicing technique with labels, as shown as follows:

print(df.loc['b':'d','A':'C']) 
   A  B  C
b  4  0  5
c  8  1  7
d  8  0  5

In slicing, all the elements included between the extremes expressed by the two labels are selected, including these. A smaller DataFrame is obtained as a return value. If, on the other hand, you want to select the entire range of columns or rows, just write the slicing operator :, as shown as follows:

print(df.loc['a':'c', :]) 
   A  B  C  D
a  6  2  4  0
b  4  0  5  1
c  8  1  7  4

Exercise : #

Try the following

  1. df.loc[['a','c'], ['A','D']]

  2. df.loc['a':'c', 'A':'B'].join(df.loc['a':'c', 'D'])

  3. df.loc[:,['A','B','D']].loc['a':'c', :]

iloc[ ] – Selection by location#

The iloc[ ] operator works with the same rules as the selection operator [] for lists in Python. That is, it focuses on the positions of the elements expressed by sequences of integers from 0 to N-1 with N as the length of the Index, as shown as follows:

iloc[ row number, column number]

So, if we want to select a single element, it will be enough to add its numerical coordinates of the corresponding position, as shown as follows:

df.iloc[1,3]
1

Whereas, if we want to select a subset of the DataFrame, we can use the slicing technique by specifying the ranges of the numerical indices if they are adjacent, as shown as follows:

df.iloc[0:2,1:3]
B C
a 2 4
b 0 5

And if the elements are not adjacent, we can replace slices with lists, as shown as follows:

df.iloc[[1],[0,2]]
A C
b 4 5

Selection on dtype#

So far, we have seen that the selections can be made by subsetting by exploiting the indexing of the DataFrame. But there are also other selection methods, for instance, based on the values contained in the DataFrame. For example, selections can be made based on the type of data (dtype). To make this type of selection, we do not use an operator, but a method called select_dtypes(). This method accepts the following two possible parameters:

  • include

  • exclude

A list containing the types of data (dtype) to be included or excluded in the selection will be passed on to these parameters. In this regard, we create a DataFrame containing different types of data for each column, as shown as follows:

df = pd.DataFrame([[6,0.3,'one', True],[2,5.1,'two', False],[1,4.3,'three', True]], 
                     index=['a','b','c'], 
                     columns=['A','B','C','D'])
df
A B C D
a 6 0.3 one True
b 2 5.1 two False
c 1 4.3 three True

To obtain the list of the dtypes of each single column of the DataFrame, we can call the dtypes attribute, as shown as follows:

df.dtypes
A      int64
B    float64
C     object
D       bool
dtype: object

At this point, based on dtypes, we can select by passing, for example, a list containing the dtypes we want in the include parameter, as shown as follows:

df2 = df.select_dtypes(include=['bool','int64'])
df2
A D
a 6 True
b 2 False
c 1 True

If, on the other hand, we want to select by exclusion, we use the exclude parameter instead with a list containing the dtypes we want to exclude, as shown as follows:

df2 = df.select_dtypes(exclude=['bool','int64'])
print(df2)
     B      C
a  0.3    one
b  5.1    two
c  4.3  three

Filtering#

We saw in the previous section how it is possible to select elements within a DataFrame (or even a Series) by specifying appropriate spatial indicators such as labels or numerical position indices. In this section, we will see another very similar operation, which also selects, but is totally focused on the values of the elements – filtering. In this case, we will set particular conditions in the selection designed to select only the elements with the values that satisfy them. There is, therefore, filtering of the values of a DataFrame.

The Boolean condition#

The Filtering technique is, therefore, based on a particular condition, called a Boolean condition, which is nothing more than a Boolean expression, that is, a mathematical expression that results in True and False (Boolean values). An example of a Boolean expression could be as follows:

x > 5 (values greater than 5)

If this condition is true, it will return True, otherwise False. These Boolean expressions are characterized by the use of Boolean operators such as (<,>,! =, ==, <=,> =), which are intended to compare two values and set a condition that can either be verified or not. In the case of DataFrames, this Boolean condition is applied to the values contained in a particular subset of the DataFrame, establishing which part (column or row) will respond to this condition and returning a Series of Boolean values (True or False). The final result is a filtering that returns us a DataFrame that preserves the same structure as the original but that includes only some rows or columns of the original DataFrame.

Filtering on the lines#

The most intuitive filtering is filtering on lines, that is, the one that will pass only a few lines of a DataFrame based on a Boolean condition. We first create a DataFrame of random values between 0 and 1, distributed over 4 columns, as shown as follows:

df = pd.DataFrame(np.random.rand(40).reshape(10,4), columns = ['A','B','C','D'])
print(df)
          A         B         C         D
0  0.760557  0.860962  0.149834  0.957953
1  0.113805  0.154022  0.935985  0.283209
2  0.025479  0.234475  0.511874  0.379375
3  0.527274  0.699410  0.539590  0.290462
4  0.626130  0.267883  0.238626  0.104339
5  0.425794  0.859567  0.394352  0.891082
6  0.951322  0.377176  0.427899  0.778968
7  0.678811  0.105444  0.327223  0.210827
8  0.292881  0.405660  0.538095  0.296994
9  0.831758  0.387747  0.294632  0.869651

Now, we will define a Boolean condition, based on which the selection will be made. For example, we want to filter only those rows of a DataFrame that have the value of column A greater than 0.5. Since these are floatingpoint numbers, let’s add some significant digits, as shown as follows:

df['A'] > 0.500000
0     True
1    False
2    False
3     True
4     True
5    False
6     True
7     True
8    False
9     True
Name: A, dtype: bool

As an indexing operator, we use the one represented by two square brackets [ ]. This operator will then take care of passing only the rows corresponding to the True values, based on the indices of the Series generated by the Boolean condition, as shown as follows:

df2 = df[ df['A'] > 0.500000 ]
df2
A B C D
0 0.760557 0.860962 0.149834 0.957953
3 0.527274 0.699410 0.539590 0.290462
4 0.626130 0.267883 0.238626 0.104339
6 0.951322 0.377176 0.427899 0.778968
7 0.678811 0.105444 0.327223 0.210827
9 0.831758 0.387747 0.294632 0.869651

But you can also use other indexing operators, such as indexer loc[], as shown as follows:

df2 = df.loc[ df['A'] > 0.500000]
df2
A B C D
0 0.760557 0.860962 0.149834 0.957953
3 0.527274 0.699410 0.539590 0.290462
4 0.626130 0.267883 0.238626 0.104339
6 0.951322 0.377176 0.427899 0.778968
7 0.678811 0.105444 0.327223 0.210827
9 0.831758 0.387747 0.294632 0.869651

The result is pretty much the same.

Filtering on columns#

The complementary case to the previous one is filtering on columns. In this case, only those columns of the DataFrame are filtered, whose values meet certain requirements. Again, we will define a Boolean condition. For example, we want to select only those columns whose average of values is greater than 0.5, as shown as follows:

df.mean() > 0.500000
A     True
B    False
C    False
D     True
dtype: bool

Application of multiple conditions#

In Filtering, you can apply several Boolean conditions at the same time. To do this, different conditions can be applied at the same time, thanks to the logical operators AND, OR, and NOT. In the Pandas library, these logical operators are expressed in the following way:

  • & (and)

  • | (or)

  • ~ (not)

So, every single condition is enclosed in square brackets and joined together by the previous logical operators. We can express joint Boolean conditions in the following way:

(df['A'] < 0.300000) | (df['A'] > 0.700000)
0     True
1     True
2     True
3    False
4    False
5    False
6     True
7    False
8     True
9     True
Name: A, dtype: bool

Apply it inside an indexing operator or an indexer, as shown as follows:

df2 = df[(df['A'] < 0.300000) | (df['A'] > 0.700000)]
df2
A B C D
0 0.760557 0.860962 0.149834 0.957953
1 0.113805 0.154022 0.935985 0.283209
2 0.025479 0.234475 0.511874 0.379375
6 0.951322 0.377176 0.427899 0.778968
8 0.292881 0.405660 0.538095 0.296994
9 0.831758 0.387747 0.294632 0.869651

Boolean reductions#

One topic related to the concept of filtering is Boolean reductions. There are a number of methods, called reduction methods, which allow you to obtain a reduced result, as shown as follows:

  • any()

  • all()

df = pd.DataFrame(np.random.rand(40).reshape(10,4), columns = ['A','B','C','D'])
df
A B C D
0 0.175424 0.071189 0.130327 0.601100
1 0.759364 0.192506 0.832615 0.976395
2 0.321767 0.156089 0.325699 0.910255
3 0.647416 0.963991 0.995218 0.157431
4 0.128737 0.992937 0.866832 0.475004
5 0.186344 0.135137 0.650678 0.734902
6 0.712324 0.360990 0.322299 0.567984
7 0.131138 0.241094 0.491421 0.773954
8 0.124443 0.409533 0.026339 0.376774
9 0.165771 0.687265 0.677189 0.299781

But, let’s see some examples to clarify their function. We apply a Boolean condition in which all the elements of the DataFrame are evaluated. For example, we want to know which elements of the DataFrame have a value greater than 0.1, as shown as follows:

df2 = df > 0.1
df2
A B C D
0 True False True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True False True
9 True True True True

In this case, we got a DataFrame full of Boolean values. A reduction method can be applied to this DataFrame, such as the all() method to know if, within each column, all the values meet these requirements (all on True) , as shown as follows:

(df > 0.1).all()
A     True
B    False
C    False
D     True
dtype: bool

Exercise : #

What do the following command do?

  1. (df > 0.1).any()

  2. (df > 0.1).any().any()

Filtering with isin()#

Another method of Filtering is isin( ). This method lets you know if one or more elements are present inside an object such as a DataFrame or a Series. In this regard, we define an example DataFrame, with predefined values, as shown as follows:

df = pd.DataFrame(np.arange(9).reshape(3,3), columns = ['A','B','C'])
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8

Now, let’s apply the isin( )method with some values to search for, as follows:

df2 = df.isin([0,3])
df2
A B C
0 True False False
1 True False False
2 False False False

Then, passing this Boolean DataFrame to the indexing operator, you get a filtering of the values, in which only those which correspond to True as a value will be passed, while all the others, which correspond to False, will be evaluated as NaN, as shown as follows:

df[df2]
A B C
0 0.0 NaN NaN
1 3.0 NaN NaN
2 NaN NaN NaN

For example, we can enhance them with other values, or completely remove them from the DataFrame, leaving only the enhanced elements, as shown as follows:

df2 = df[df.isin([0,3])].dropna(thresh=1).dropna(axis=1)
df2
A
0 0.0
1 3.0

Adding, inserting, and deleting a column in a DataFrame#

You will have understood by now that due to the way the DataFrames are structured within them, you can think of the columns as separate Series, assembled together to form data tables. So the editing operations of a DataFrame working on the columns are somewhat easier.

Editing#

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], 
                     index=['a','b','c'], 
                     columns=['A','B','C'])
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9

Adding a column at the end is a fairly simple operation. It is sufficient to select a column that does not yet exist, specify a new label within the indexing operator, and enhance it in some way. In this way, a new column will automatically be created which will be added after the last existing column. For example, if we value a non-existent column with a scalar value, , we will get a column completely valued with that number, as shown as follows:

df['D'] = 17 
df
A B C D
a 1 2 3 17
b 4 5 6 17
c 7 8 9 17

From now on, if you select this column, the behavior will be the normal selection behavior, where the returned value will be a Series, as shown as follows:

df['D'] 
a    17
b    17
c    17
Name: D, dtype: int64

If we wanted to create a new column with all different values, instead of a single scalar, we can pass a list containing all the values, as shown as follows:

df['E'] = [13,14,15]
df
A B C D E
a 1 2 3 17 13
b 4 5 6 17 14
c 7 8 9 17 15

The same technique can be used to update the values of an already existing column. For example, passing a list of values to an already existing column will replace those already present, as shown as follows:

df['D'] = ['one','two', 'three']
df
A B C D E
a 1 2 3 one 13
b 4 5 6 two 14
c 7 8 9 three 15

To enhance a column, you can use an existing Series instead of a list. In this case, however, you must pay close attention to the correspondence between the labels of the Series and those of the DataFrame, as shown as follows:

sr = pd.Series([10,11,12],index=['a','b','c'])
df['D'] = sr 
df
A B C D E
a 1 2 3 10 13
b 4 5 6 11 14
c 7 8 9 12 15

Another very useful operation is the removal of a column from a DataFrame. The del keyword exists in Python that can be used for this purpose, as shown as follows:

del df['E']
df
A B C D
a 1 2 3 10
b 4 5 6 11
c 7 8 9 12

In this case, the column is permanently removed from the DataFrame, but there may be cases where the removed column wants to be preserved in some way. The pop() method allows you to remove the column from the DataFrame and return it as Series that can be stored in a new variable, as shown as follows:

sr = df.pop('D')  
print(type(sr))   
df
<class 'pandas.core.series.Series'>
A B C
a 1 2 3
b 4 5 6
c 7 8 9

Another useful method for editing a column within a DataFrame is the insert() method. This method inserts a new column within a DataFrame in a predetermined position. For example, we can insert the Series we extracted earlier with the pop() method, back into the DataFrame in position 1, calling it as label 'Z'. All this information is passed as arguments to the insert() function, as shown as follows:

df.insert(1, 'Z', sr) 
df
A Z B C
a 1 10 2 3
b 4 11 5 6
c 7 12 8 9

The same insert() method can also be used to copy a column within a DataFrame. For example, in the previous case, the column to be copied is passed instead of an external Series, as shown as follows:

df.insert(3, 'B2', df['B'])
df
A Z B B2 C
a 1 10 2 2 3
b 4 11 5 5 6
c 7 12 8 8 9

Adding, inserting, and deleting a row in a DataFrame#

As for the editing of the rows, it is appropriate to make some small considerations to understand how to make these operations very similar to the case of columns. First, let’s start again with a test DataFrame, as follows:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], 
                     index=['a','b','c'], 
                     columns=['A','B','C'])
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9

We saw that in using the columns, we are privileged because we largely exploit the internal structure of the DataFrame, which provides the columns as a Series. So, just call a column with the corresponding label to get a Series, as shown as follows:

df['A']
a    1
b    4
c    7
Name: A, dtype: int64

This reality has been pushed into the syntax within the library to the point of considering the columns as real attributes, without even using the indexing operator anymore, as shown as follows:

df.A
a    1
b    4
c    7
Name: A, dtype: int64

You can then add a row in the same way as for columns. For example, by assigning a scalar value to a selection of a non-existing row, a new row is added at the bottom of the DataFrame, whose values are the repetition of that scalar value, and the label is the one passed in the selection operator, as shown as follows:

df.loc['d'] = 12 
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9
d 12 12 12

Or, you can change it by using a list of values, as shown as follows:

df.loc['d'] = [10,11,12]
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9
d 10 11 12

Or, instead of a list, you can change it by using an already existing Series, as shown as follows:

sr = pd.Series([13,14,15], index=['A','B','C'], name="e") #aggiungi riga con series
df.loc['d'] = sr
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9
d 13 14 15

Then, to delete a row, use the following

df2 = df.drop(['a'],axis=0) 
df2
A B C
b 4 5 6
c 7 8 9
d 13 14 15

Whereas, use the following to delete a column:

df3 = df.drop(['A'],axis=1) 
df3
B C
a 2 3
b 5 6
c 8 9
d 14 15

If we want to delete multiple rows or columns at the same time, just add the corresponding labels in the passed list as an argument, as shown as follows:

df2 = df.drop(['a','c'],axis=0) 
df2
A B C
b 4 5 6
d 13 14 15

Adding new columns with assign()#

Among the methods applicable to DataFrames that have editing functions, there is the assign() method that allows you to create new columns to be added easily, deriving them from those already present. For example, we can add a new column to the DataFrame starting from the values of an already existing column by multiplying it by 2, as shown as follows:

df = pd.DataFrame(np.random.randint(1,10,9).reshape(3,3), 
                     index=['a','b','c'], 
                  columns=['A','B','C'])
df
A B C
a 4 7 9
b 2 1 4
c 9 1 3

The assign () method does not modify the original DataFrame but creates a new copy of it as a return value. So, the result can be assigned to a new variable, as we have just done, or reassigned to the original DataFrame, modifying it. The choice of what to do is therefore up to the user, as shown as follows:

df2 = df.assign(D = df['A']*2)
df2
A B C D
a 4 7 9 8
b 2 1 4 4
c 9 1 3 18
df = df.assign(D = df['A']*2)
df
A B C D
a 4 7 9 8
b 2 1 4 4
c 9 1 3 18

But the assign() method can also be used to overwrite an existing column, which is often very useful for correcting or modifying the data within the DataFrame, as shown as follows:

df = df.assign(D = np.sqrt(df['A']))
df
A B C D
a 4 7 9 2.000000
b 2 1 4 1.414214
c 9 1 3 3.000000

Descriptive Statistics#

Pandas is a library designed for data analysis, and the DataFrames are the basic objects on which all the related activities are carried out. So having an idea of their contents and their structure at the outset is important in order to better target the operations to be carried out. Among the basic activities, there is, therefore, the calculation of general statistics that can somehow give an idea of the values contained within the DataFrame under analysis. In this section, we will see many basic features that the Pandas library offers to obtain this type of descriptive data.

The describe() method#

Inside the Pandas library, there is a describe() method that calculates a whole series of descriptive statistics of the reference DataFrame, as shown as follows:

df = pd.DataFrame(np.random.rand(40).reshape(10,4))
df
0 1 2 3
0 0.124333 0.500606 0.336289 0.751724
1 0.311459 0.613847 0.205989 0.659264
2 0.102582 0.136974 0.963274 0.137202
3 0.402261 0.601782 0.866679 0.839778
4 0.149013 0.558864 0.106777 0.266392
5 0.724342 0.916766 0.013880 0.041685
6 0.015793 0.297276 0.524495 0.456131
7 0.163569 0.736776 0.616999 0.558636
8 0.959790 0.062633 0.024861 0.997334
9 0.388422 0.049508 0.856516 0.381187

It is launched directly on the DataFrame to be analyzed, and the result is a table of descriptive statistics carried out at the column level and reported in the structure of a DataFrame, as shown below

df.describe() 
0 1 2 3
count 10.000000 10.000000 10.000000 10.000000
mean 0.334156 0.447503 0.451576 0.508933
std 0.300767 0.297415 0.364685 0.310277
min 0.015793 0.049508 0.013880 0.041685
25% 0.130503 0.177049 0.131580 0.295091
50% 0.237514 0.529735 0.430392 0.507384
75% 0.398801 0.610831 0.796637 0.728609
max 0.959790 0.916766 0.963274 0.997334

The statistics processed are also returned in the form of DataFrame, each of which is processed by a single column. Among the reported values, we see the element count, the mean and its standard deviation, the maximum and minimum value, and 3 percentiles. This information will be reported only for the columns of the DataFrame that contain numeric values, as shown below

df2 = pd.DataFrame([[1,'one', True],[2,'two',False],
                    [3,'three',True],[4,'four',False],[5,'five', False]], 
                   columns=['numb','word','bool'])
df2.describe()
numb
count 5.000000
mean 3.000000
std 1.581139
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000
df2[['word','bool']].describe()
word bool
count 5 5
unique 5 2
top one False
freq 1 3

So in a DataFrame with many columns, it would be a good practice to be able to select only a few columns (or exclude the ones that don’t interest us) without having to first make a selection of subsettings and then applying describe() as done before. In this regard, there are two parameters, include and exclude, which allow us to explicitly specify which columns should be included or excluded in the calculation of descriptive statistics. The peculiarity is that you do not pass lists containing labels, but dtypes.

df2.describe(include=['int'])
numb
count 5.000000
mean 3.000000
std 1.581139
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000

There are several methods to calculate descriptive statistics and other related operations on Series and DataFrames. Those mainly taken into consideration are a whole series of methods taken from the NumPy library that are also applicable on the DataFrame. Many of these methods produce aggregations, that is, they produce a smaller result by agglomerating a group of values (rows or columns) under a single result. Aggregating functions are, for example, sum(), mean(), quantile() which, if applied to a DataFrame, aggregate its values returning a Series. Other methods of this type produce results of the same size, and are, for example, functions like cumsum(), and cumprod(). Generally, these methods take an axis parameter passed as an argument to determine whether the calculation should be evaluated by column (axis = 0) or by row (axis = 1). The statistical evaluation by column is the default one and you can also omit to insert the axis parameter. For example, if you want to calculate the mean of values in each column, you can easily use the mean() method, as shown as follows:

df.mean(0) #df.mean(axis=0) #df.mean()
0    0.334156
1    0.447503
2    0.451576
3    0.508933
dtype: float64

In fact, as many results are obtained as there are columns in the DataFrame. Whereas, if we want to carry out the statistical evaluation by row, the value 1 must be specified (both with and without axis), as shown as follows:

df.mean(1) #df.mean(axis=1)
0    0.428238
1    0.447640
2    0.335008
3    0.677625
4    0.270262
5    0.424168
6    0.323424
7    0.518995
8    0.511154
9    0.418908
dtype: float64
df.std()
0    0.300767
1    0.297415
2    0.364685
3    0.310277
dtype: float64

The standardization of data#

To conclude this section, let’s add data standardization. This technique is widely used in statistics and data analysis to make the different data series better comparable to each other, making them “standard”. These data groups are somehow “aligned” statistically in order to be able to compare them more easily and highlight the differences. In a practical way, standardization consists of making the mean of this group of values equal to 0 and the standard deviation unitary.

std = ( df - df.mean())/ df.std()
print(std.mean())
print(std.std())
0   -2.137179e-16
1   -1.332268e-16
2   -1.554312e-16
3   -1.776357e-16
dtype: float64
0    1.0
1    1.0
2    1.0
3    1.0
dtype: float64

Transposition, Sorting and Reindexing#

In this final section, we will see some of the techniques that somehow rearrange the internal structure of the DataFrame by acting at the index level. In fact, the order of the labels within a DataFrame is not immutable but can be changed during the analysis. There are, therefore, some techniques that act at this level including transposition, sorting, and reindexing.

Transposition#

The transposition of a DataFrame is equivalent to the mathematical operation of the transposition of a matrix. In this operation, the rows become the columns and vice versa. Applying a transposition to a DataFrame will, therefore, have the exchange of labels between index and columns, with the consequent reordering of the values related to them, as shown as follows:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], 
                     index=['a','b','c'], 
                     columns=['A','B','C'])
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9

To carry out the transposition of a DataFrame, it is sufficient to recall the attribute T, as shown as follows:

df.T
a b c
A 1 4 7
B 2 5 8
C 3 6 9

We, thus, obtain the transposed DataFrame.

Sorting#

Often, when loading a DataFrame from a data source, it happens that the order of the rows or columns is not the desired one. Or during the various� analysis phases, we may wish to sort the values of a specific column in the ascending or descending order. All these sorting operations can be summarized in the following three types:

  • Sorting by labels

  • Sorting by values

  • Sorting by a combination of both

All of these sorting types are feasible with the Panda library. For this purpose, we will create a DataFrame with the inside of the index and columns labels arranged in an unordered way (this is practically the case for almost all real DataFrames), as shown as follows:

df = pd.DataFrame(np.random.randint(10, size=16).reshape(4,4), 
                     index=['b','d','c','a'], 
                     columns=['C','D','A','B'])
df
C D A B
b 5 5 7 2
d 9 1 2 5
c 6 1 4 1
a 7 2 2 4

Sorting by label#

Sorting by labels, or sorting by index, is based on sorting the labels in axis or columns based on their value (numeric or string), and it is performed efficiently by the sort_index() method. By default, this method puts the labels on the index in the DataFrame in the ascending order. Matching the values on each row will also follow the same order, as shown as follows:

df2 = df.sort_index()
df2
C D A B
a 7 2 2 4
b 5 5 7 2
c 6 1 4 1
d 9 1 2 5

As we can see from the result, the rows of the DataFrame have been reordered following the ascending order of the index labels. We, therefore, had a sort order by row, while the order of the columns remained unchanged. If, on the other hand, we need to sort the labels in descending order, simply add the ascending parameter and set it to False (by default, ascending = True), as shown as follows:

df2 = df.sort_index(ascending=False)
df2
C D A B
d 9 1 2 5
c 6 1 4 1
b 5 5 7 2
a 7 2 2 4

If we want to sort the labels in columns in ascending order, it will be necessary to explicitly define the parameter axis = 1 (by default axis = 0), as shown as follows:

df2 = df.sort_index(axis=1)
df2
A B C D
b 7 2 5 5
d 2 5 9 1
c 4 1 6 1
a 2 4 7 2

As you can see from the result, this time, it will be the column labels that will be sorted in the ascending order. The same thing will be done for the corresponding values on the columns. Also in this case, if we want to sort in the descending order, we explicitly set ascending = False, as shown as follows:

df2 = df.sort_index(axis=1, ascending=False)
df2
D C B A
b 5 5 2 7
d 1 9 5 2
c 1 6 1 4
a 2 7 4 2

If we want to sort by both row labels and column labels, we can do it safely by chaining the methods. We sort the DataFrame first by row, and then later by column, or vice versa, as shown as follows:

df2 = df.sort_index(axis=0).sort_index(axis=1)
df2
A B C D
a 2 4 7 2
b 7 2 5 5
c 4 1 6 1
d 2 5 9 1

Sorting by values#

Another way to sort a DataFrame is to sort the rows or columns according to their contained values. Sorting by values is done by the sort_values() method. In order to work, this method needs a mandatory parameter that must be explicitly set. The by parameter must be labeled columns, according to which, the values within the column will be sorted in the ascending order, as shown as follows:

df2 = df.sort_values(by='A')
df2
C D A B
d 9 1 2 5
a 7 2 2 4
c 6 1 4 1
b 5 5 7 2

As can be seen from the result, the DataFrame has the values of column A in the ascending order. Consequently, the order of the corresponding rows of the DataFrame has also changed to follow that order. The by parameter can also accept a list of column labels. The sort_values() method will first sort the column values corresponding to the first label in the list. Then, in the context of the equal values present in this column, it will sort on the column corresponding to the second label in the list, and so on, as shown as follows:

df['A'] = [0,0,2,2]
df2 = df.sort_values(by=['A','B'])
print(df2)
   C  D  A  B
b  5  5  0  2
d  9  1  0  5
c  6  1  2  1
a  7  2  2  4

As we can see from the result, the values of column B will be ordered limited to the range of identical values present in column A. Although, in this simple example, it may be trivial, this combined sorting will be very useful in DataFrame, where there are many rows with data on some columns whose values repeat noticeably.

Finally, both the sort_index() method and the sort_values() method will not modify the original DataFrame but will return an ordered copy of it. Any changes made to the latter will not affect the original DataFrame in the least, as shown as follows:

df2 = df.sort_index(axis=0)
df2.loc['a','A'] = 0
df
C D A B
b 5 5 0 2
d 9 1 0 5
c 6 1 2 1
a 7 2 2 4
df2
C D A B
a 7 2 0 4
b 5 5 0 2
c 6 1 2 1
d 9 1 0 5

So, if we want our changes to be final, it will be necessary to assign the DataFrame returned by the function to the starting DataFrame, as shown as follows:

df = df.sort_values(by='A')
print(df)
   C  D  A  B
b  5  5  0  2
d  9  1  0  5
c  6  1  2  1
a  7  2  2  4

Reindexing#

Reindexing is the fundamental method of aligning data in Pandas. Reindexing means conforming the data to match a given set of labels along a particular axis. This involves a set of things, which are as follows:

  • Reordering of existing data to respond to the new set of labels

  • The insertion of NaN in the positions where there is no correspondence

  • Possibility of filling in cases of missing data (NaN) with special data following appropriate rules

Refer to the following:

df = pd.DataFrame(np.random.randint(10, size=16).reshape(4,4), 
                     index=['b','d','c','a'], 
                     columns=['C','D','A','B'])
df
C D A B
b 2 3 8 4
d 7 6 2 6
c 7 0 8 6
a 6 1 2 1

To reindex, use the reindex() method specifying the new order of the labels that the DataFrame must have. The method will not operate on the reference DataFrame, but will return a copy of the DataFrame that will follow the new indexing rules. To select which particular Index object will be reindexed, we will add the axis parameter with its specific value as the function argument, as shown as follows:

  • axis = 0 to edit rows (index)

  • axis = 1 to edit columns (columns)

df2 = df.reindex(['a','b','c','d'],axis=0)
df2
C D A B
a 6 1 2 1
b 2 3 8 4
c 7 0 8 6
d 7 6 2 6

As you can see, the order of the rows of the DataFrame has been completely changed, following the order in the list of labels specified as an argument of the method.

The new DataFrame will be a completely independent copy, and any changes made to it will not affect the original one in the least, as shown as follows:

df2.iloc[1,1] = 0
df2
C D A B
a 6 1 2 1
b 2 0 8 4
c 7 0 8 6
d 7 6 2 6
df
C D A B
b 2 3 8 4
d 7 6 2 6
c 7 0 8 6
a 6 1 2 1

But what happens if we insert labels that are not present in our DataFrame? Or, if we omit some present label? Refer to the following:

df2 = df.reindex(['A','B','C','D'],axis=1)
df2
A B C D
b 8 4 2 3
d 2 6 7 6
c 8 6 7 0
a 2 1 6 1

As we can see, if a new label is inserted in the reindexing, the method will create a new column (or row) containing NaN (missing values) as values. While if you do not report an existing label in the reindexing, the column (or row) will be removed in the new returned DataFrame.

df2 = df.reindex(['A','E','C','D',],axis=1)
df2
A E C D
b 8 NaN 2 3
d 2 NaN 7 6
c 8 NaN 7 0
a 2 NaN 6 1

Reindexing using another DataFrame as a reference#

There is also the possibility of using a reference DataFrame from which to take the order of the indexes (both by row and by column) and apply it to our DataFrame. This is possible with the reindex_like() method. This method will also not change the object that it is called on but will return a copy of the DataFrame as a result. In this regard, we create a reference DataFrame, as shown as follows:

dfo = pd.DataFrame(np.zeros(16).reshape(4,4), 
                     index=['a','b','c','d'], 
                     columns=['A','B','C','D'])
dfo
A B C D
a 0.0 0.0 0.0 0.0
b 0.0 0.0 0.0 0.0
c 0.0 0.0 0.0 0.0
d 0.0 0.0 0.0 0.0

We will apply the reindex_like() method on our DataFrame on which we want to reindex, as shown as follows:

df2 = df.reindex_like(dfo)
df2
A B C D
a 2 1 6 1
b 8 4 2 3
c 8 6 7 0
d 2 6 7 6

Also in this case, if in the reference DataFrame, there are labels that do not exist in our DataFrame on which to index, these rows or columns will be added to the new DataFrame returned with all the NaN values (missing values) inside. Whereas, if in the reference DataFrame, some labels are not present, the corresponding rows or columns will be removed from the new DataFrame returned, as shown as follows:

dfo = pd.DataFrame(np.zeros(16).reshape(4,4), 
                     index=['a','b','c','e'], 
                     columns=['A','B','W','Z'])
dfo
A B W Z
a 0.0 0.0 0.0 0.0
b 0.0 0.0 0.0 0.0
c 0.0 0.0 0.0 0.0
e 0.0 0.0 0.0 0.0
df2 = df.reindex_like(dfo)
df2
A B W Z
a 2.0 1.0 NaN NaN
b 8.0 4.0 NaN NaN
c 8.0 6.0 NaN NaN
e NaN NaN NaN NaN

Conclusion:#

In this chapter, we dealt with a whole series of basic operations that apply to DataFrame (and also to Series) and which greatly help in understanding this type of object and how they work. The concept of selection is very important because it allows us to focus our attention on a subset of the DataFrame. This type of object often has dimensions that are too large to be considered in their entirety, and in addition to the often-insufficient selection, a filtering system that allows us to select only what meets particular requirements that come to our aid.