Chapter 6: Basic Data Explorations

Chapter 6: Basic Data Explorations#

import numpy as np
import pandas as pd

1. Viewing Data#

In this section, we will see how to view the data within a DataFrame in a simple and direct way, using appropriate methods made available to the Pandas library

random_numbers = np.random.rand(100).reshape(20,5)
df = pd.DataFrame(random_numbers )
df

	0	1	2	3	4
0	0.384515	0.196664	0.334680	0.564900	0.796844
1	0.491713	0.964695	0.741007	0.793240	0.057483
2	0.146637	0.048864	0.505800	0.762624	0.880879
3	0.222989	0.843822	0.951146	0.856485	0.279702
4	0.824133	0.872540	0.780782	0.664067	0.670416
5	0.537655	0.230492	0.990974	0.382503	0.364426
6	0.053221	0.743138	0.150317	0.731637	0.181720
7	0.300088	0.299194	0.282009	0.695885	0.697197
8	0.296056	0.379536	0.126203	0.719643	0.086114
9	0.395368	0.953334	0.221333	0.732485	0.034668
10	0.379021	0.119772	0.707395	0.007248	0.725467
11	0.661449	0.644149	0.552374	0.241897	0.367764
12	0.688481	0.182479	0.813738	0.117468	0.451152
13	0.253444	0.729272	0.853668	0.878688	0.248563
14	0.299979	0.112688	0.075158	0.622681	0.714635
15	0.979649	0.093752	0.175819	0.461655	0.896520
16	0.328408	0.402057	0.689510	0.681640	0.833615
17	0.023958	0.737400	0.183078	0.122998	0.003042
18	0.061089	0.687200	0.059927	0.257995	0.122600
19	0.980161	0.906967	0.910993	0.511182	0.361576

As you can see, a part of the contents of the DataFrame have been be displayed, including the indexes with their labels.

1.a) The .head() and the .tail() method#

To have a sufficient view to roughly understand the structure of a DataFrame and get an idea of the data it contains, the Pandas library provides us with the two methods head() and tail(). By calling these methods on a reference DataFrame, the first five lines are printed, with the head() method, or the last five lines with the tail() method, as follows:

df.head()

	0	1	2	3	4
0	0.384515	0.196664	0.334680	0.564900	0.796844
1	0.491713	0.964695	0.741007	0.793240	0.057483
2	0.146637	0.048864	0.505800	0.762624	0.880879
3	0.222989	0.843822	0.951146	0.856485	0.279702
4	0.824133	0.872540	0.780782	0.664067	0.670416

As you can see above, the head() method, if used on Jupyter Notebook, shows the first five lines of the DataFrame with its particular formatting seen previously, as follows:

df.tail()

	0	1	2	3	4
15	0.979649	0.093752	0.175819	0.461655	0.896520
16	0.328408	0.402057	0.689510	0.681640	0.833615
17	0.023958	0.737400	0.183078	0.122998	0.003042
18	0.061089	0.687200	0.059927	0.257995	0.122600
19	0.980161	0.906967	0.910993	0.511182	0.361576

df.sample(4)

	0	1	2	3	4
4	0.824133	0.872540	0.780782	0.664067	0.670416
3	0.222989	0.843822	0.951146	0.856485	0.279702
19	0.980161	0.906967	0.910993	0.511182	0.361576
1	0.491713	0.964695	0.741007	0.793240	0.057483

These two methods allow us to see the first and last five lines of the DataFrame. This default behavior can be changed by passing the number of lines to be displayed as the argument of the two methods. For example, if we want to see only the first three rows of the DataFrame, we will get the results, as shown

df.head(3)

	0	1	2	3	4
0	0.384515	0.196664	0.334680	0.564900	0.796844
1	0.491713	0.964695	0.741007	0.793240	0.057483
2	0.146637	0.048864	0.505800	0.762624	0.880879

df[:1]

	0	1	2	3	4
0	0.384515	0.196664	0.33468	0.5649	0.796844

2. Selection#

During the creation of the DataFrame, we already learned how to select the elements inside it. In this section, we will address the selection in the foreground, deepening this topic which will then be the basis of all the other features present in this and subsequent section

Since the Pandas library is based on the NumPy library, and in its way, extends some of its features, it wanted to keep both the possibility of being able to select the elements of the DataFrame and Series with the numbers of the positions. Both extend this concept through the Indexes and their labels. Subsetting can then be done via the following:

Indexing: Users can select the data using the labels that distinguish the individual rows and columns, through the Index objects.
Integer Positioning: Users can select data using positions for integers of rows and columns, such as NumPy ndarrays.

When you want to select a single element, things are quite simple – you specify the label corresponding to the row and the corresponding column (indexing) or the position within the DataFrame by means of the pair of integers (integer positioning) – as shown as follows:

df1 = pd.DataFrame(np.random.randint(0,9,16).reshape(4,4),
                           index = ['a','b','c','d'],
                           columns = ['A','B','C','D'])
df1

	A	B	C	D
a	0	1	0	8
b	0	7	0	8
c	1	7	8	3
d	7	5	7	6

2.a) Selection by Labels: The `.loc[ ]` method#

The .loc[ ] operator focuses on selection based on the labels contained within the Index objects, as shown as follows: .loc[ row labels, column labels]

If you want to select a single element within the DataFrame, you will first specify the row label and then that of the column, as shown as follows:

df

	0	1	2	3	4
0	0.384515	0.196664	0.334680	0.564900	0.796844
1	0.491713	0.964695	0.741007	0.793240	0.057483
2	0.146637	0.048864	0.505800	0.762624	0.880879
3	0.222989	0.843822	0.951146	0.856485	0.279702
4	0.824133	0.872540	0.780782	0.664067	0.670416
5	0.537655	0.230492	0.990974	0.382503	0.364426
6	0.053221	0.743138	0.150317	0.731637	0.181720
7	0.300088	0.299194	0.282009	0.695885	0.697197
8	0.296056	0.379536	0.126203	0.719643	0.086114
9	0.395368	0.953334	0.221333	0.732485	0.034668
10	0.379021	0.119772	0.707395	0.007248	0.725467
11	0.661449	0.644149	0.552374	0.241897	0.367764
12	0.688481	0.182479	0.813738	0.117468	0.451152
13	0.253444	0.729272	0.853668	0.878688	0.248563
14	0.299979	0.112688	0.075158	0.622681	0.714635
15	0.979649	0.093752	0.175819	0.461655	0.896520
16	0.328408	0.402057	0.689510	0.681640	0.833615
17	0.023958	0.737400	0.183078	0.122998	0.003042
18	0.061089	0.687200	0.059927	0.257995	0.122600
19	0.980161	0.906967	0.910993	0.511182	0.361576

df1.loc['a','B']

A single element as a scalar is obtained as a return value. Whereas, if you want to select a subset of adjacent elements, you can use the slicing technique with labels, as shown as follows:

df1.loc['b':'d','A':'C']

	A	B	C
b	0	7	0
c	1	7	8
d	7	5	7

df1.loc['a':'c', :]

	A	B	C	D
a	0	1	0	8
b	0	7	0	8
c	1	7	8	3

But the columns and rows may also not be contiguous. In this case, you can use lists containing the labels to be selected, as shown as follows:

df1.loc[['a','c'], ['A','D']]

	A	D
a	0	8
c	1	3

So, where we have selections with contiguous rows and columns, it is advisable to specify the range with slicing, and where instead the rows and columns are isolated, the labels are specified in the form of lists. However, using both words in the same column or row selection gives an error, as shown as follows:

df1.loc['a':'c', 'A':'B'].join(df1.loc['a':'c', 'D'])

	A	B	D
a	0	1	8
b	0	7	8
c	1	7	3

df1.loc[:,['A','B','D']].loc['a':'c', :]

	A	B	D
a	0	1	8
b	0	7	8
c	1	7	3

2.b) Selection by Location: The `.iloc[ ]` method#

The .iloc[ ] operator works with the same rules as the selection operator [ ] for lists in Python. That is, it focuses on the positions of the elements expressed by sequences of integers from 0 to N-1 with N as the length of the Index, as shown as follows: .iloc[ row number, column number] So, if we want to select a single element, it will be enough to add its numerical coordinates of the corresponding position, as shown as follows:

df1.iloc[1,3]

Whereas, if we want to select a subset of the DataFrame, we can use the slicing technique by specifying the ranges of the numerical indices if they are adjacent, as shown as follows:

df1.iloc[0:2,1:3]

	B	C
a	1	0
b	7	0

And if the elements are not adjacent, we can replace slices with lists, as shown as follows:

df1.iloc[[1],[0,2]]

	A	C
b	0	0

type(df1.iloc[[1],[0,2]])

pandas.core.frame.DataFrame

df1.iloc[1,[0,2]]

A    0
C    0
Name: b, dtype: int64

If you must select a single row, insert the value within a single element list, to have a DataFrame that preserves the same indexing structure as a return value, as shown as follows:

type(df1.iloc[1,[0,2]])

pandas.core.series.Series

2.c) Selection on dtype#

So far, we have seen that the selections can be made by subsetting or by exploiting the indexing of the DataFrame. But there are also other selection methods, for instance, based on the values contained in the DataFrame.

For example, selections can be made based on the type of data (dtype). To make this type of selection, we do not use an operator, but a method called select_dtypes(). This method accepts the following two possible parameters:

include
exclude

df2 = pd.DataFrame([[6,0.3,'one', True],[2,5.1,'two', False],[1,4.3,'three', True]], 
                     index=['a','b','c'], 
                     columns=['A','B','C','D'])
df2

	A	B	C	D
a	6	0.3	one	True
b	2	5.1	two	False
c	1	4.3	three	True

To obtain the list of the dtypes of each single column of the DataFrame, we can call the dtypes attribute, as shown as follows:

df2.dtypes

A      int64
B    float64
C     object
D       bool
dtype: object

If, on the other hand, we want to select by exclusion, we use the exclude parameter instead with a list containing the dtypes we want to exclude, as shown as follows:

df2_ = df2.select_dtypes(include=['bool','int64'])
df2_

	A	D
a	6	True
b	2	False
c	1	True

df2__ = df2.select_dtypes(exclude=['bool','int64'])
print(df2__)

     B      C
a  0.3    one
b  5.1    two
c  4.3  three

3. Filtering#

We saw in the previous section how it is possible to select elements within a DataFrame (or even a Series) by specifying appropriate spatial indicators such as labels or numerical position indices. In this section, we will see another very similar operation, which also selects, but is totally focused on the values of the elements – filtering. In this case, we will set particular conditions in the selection designed to select only the elements with the values that satisfy them. There is, therefore, filtering of the values of a DataFrame.

3.a) The Boolean condition#

The Filtering technique is, therefore, based on a particular condition, called a Boolean condition, which is nothing more than a Boolean expression, that is, a mathematical expression that results in True and False (Boolean values).

An example of a Boolean expression could be as follows: x > 5 (values greater than 5)

If this condition is true, it will return True, otherwise False. These Boolean expressions are characterized by the use of Boolean operators such as (<,>,! =, ==, <=,> =), which are intended to compare two values and set a condition that can either be verified or not.

x = 4
x > 5

False

In the case of DataFrames, this Boolean condition is applied to the values contained in a particular subset of the DataFrame, establishing which part (column or row) will respond to this condition and returning a Series of Boolean values (True or False).

The final result is a filtering that returns us a DataFrame that preserves the same structure as the original but that includes only some rows or columns of the original DataFrame.

3.a.1. Filtering on the lines#

The most intuitive filtering is filtering on lines, that is, the one that will pass only a few lines of a DataFrame based on a Boolean condition. We first create a DataFrame of random values between 0 and 1, distributed over 4 columns, as shown as follows:

df3 = pd.DataFrame(np.random.rand(40).reshape(10,4), columns = ['A','B','C','D'])
df3

	A	B	C	D
0	0.775442	0.887318	0.672273	0.002341
1	0.171725	0.640620	0.393520	0.992946
2	0.355619	0.572919	0.427798	0.836267
3	0.827478	0.239240	0.208017	0.336199
4	0.831432	0.857802	0.689031	0.601690
5	0.613891	0.665847	0.860193	0.773083
6	0.055869	0.141251	0.888261	0.723561
7	0.317998	0.106770	0.786794	0.806087
8	0.965558	0.683751	0.111327	0.143663
9	0.968776	0.447863	0.097376	0.595593

Now, we will define a Boolean condition, based on which the selection will be made. For example, we want to filter only those rows of a DataFrame that have the value of column A greater than 0.5.

Since these are floating point numbers, let’s add some significant digits, as shown as follows:

df3['A'] > 0.500000

   True
  False
  False
   True
   True
   True
  False
  False
   True
   True
Name: A, dtype: bool

As an indexing operator, we use the one represented by two square brackets [ ]. This operator will then take care of passing only the rows corresponding to the True values, based on the indices of the Series generated by the Boolean condition, as shown as follows:

df3_ = df3[ df3['A'] > 0.500000 ]
df3_

	A	B	C	D
0	0.775442	0.887318	0.672273	0.002341
3	0.827478	0.239240	0.208017	0.336199
4	0.831432	0.857802	0.689031	0.601690
5	0.613891	0.665847	0.860193	0.773083
8	0.965558	0.683751	0.111327	0.143663
9	0.968776	0.447863	0.097376	0.595593

But you can also use other indexing operators, such as indexer .loc[ ], as shown as follows:

df3__ = df3.loc[ df3['A'] > 0.500000]
df3__

	A	B	C	D
0	0.775442	0.887318	0.672273	0.002341
3	0.827478	0.239240	0.208017	0.336199
4	0.831432	0.857802	0.689031	0.601690
5	0.613891	0.665847	0.860193	0.773083
8	0.965558	0.683751	0.111327	0.143663
9	0.968776	0.447863	0.097376	0.595593

3.a.2. Filtering on columns#

The complementary case to the previous one is filtering on columns. In this case, only those columns of the DataFrame are filtered, whose values meet certain requirements. Again, we will define a Boolean condition. For example, we want to select only those columns whose average of values is greater than 0.5, as shown as follows:

df3.mean() > 0.500000

A    True
B    True
C    True
D    True
dtype: bool

From the Series returned as a result, we can see which columns meet this condition from the elements with the value True. We apply this Boolean condition to indexer .loc[ ] to filter on columns, as shown as follows:

df3__ = df3.loc[:,df3.mean() > 0.500000]
df3__

	A	B	C	D
0	0.775442	0.887318	0.672273	0.002341
1	0.171725	0.640620	0.393520	0.992946
2	0.355619	0.572919	0.427798	0.836267
3	0.827478	0.239240	0.208017	0.336199
4	0.831432	0.857802	0.689031	0.601690
5	0.613891	0.665847	0.860193	0.773083
6	0.055869	0.141251	0.888261	0.723561
7	0.317998	0.106770	0.786794	0.806087
8	0.965558	0.683751	0.111327	0.143663
9	0.968776	0.447863	0.097376	0.595593

3.a.3. Application of multiple conditions#

In Filtering, you can apply several Boolean conditions at the same time. To do this, different conditions can be applied at the same time, thanks to the logical operators AND, OR, and NOT. In the Pandas library, these logical operators are expressed in the following way

& ==> (and)
| ==> (or)
~ ==> (not)

So, every single condition is enclosed in square brackets and joined together by the previous logical operators. We can express joint Boolean conditions in the following way:

df3

	A	B	C	D
0	0.775442	0.887318	0.672273	0.002341
1	0.171725	0.640620	0.393520	0.992946
2	0.355619	0.572919	0.427798	0.836267
3	0.827478	0.239240	0.208017	0.336199
4	0.831432	0.857802	0.689031	0.601690
5	0.613891	0.665847	0.860193	0.773083
6	0.055869	0.141251	0.888261	0.723561
7	0.317998	0.106770	0.786794	0.806087
8	0.965558	0.683751	0.111327	0.143663
9	0.968776	0.447863	0.097376	0.595593

(df3['A'] < 0.300000) | (df3['A'] > 0.700000)

   True
   True
  False
   True
   True
  False
   True
  False
   True
   True
Name: A, dtype: bool

Apply it inside an indexing operator or an indexer, as shown as follows:

df3___ = df3[(df3['A'] < 0.300000) | (df3['A'] > 0.700000)]
df3___

	A	B	C	D
0	0.775442	0.887318	0.672273	0.002341
1	0.171725	0.640620	0.393520	0.992946
3	0.827478	0.239240	0.208017	0.336199
4	0.831432	0.857802	0.689031	0.601690
6	0.055869	0.141251	0.888261	0.723561
8	0.965558	0.683751	0.111327	0.143663
9	0.968776	0.447863	0.097376	0.595593

If the same Boolean conditions are often used within the code, it would be a good practice to first define them as variables and then use them when necessary. For example, by putting the Boolean conditions used previously, you can write the filtering criteria in a concise and readable way inside the indexing operator, as shown as follows:

cond1 = df3['A'] < 0.300000
cond2 = df3['A'] > 0.700000
cond3 = df3.mean() > 0.500000

df3x = df3.loc[cond1 | cond2 ,cond3]
print(df3x)

          A         B         C         D
0.775442  0.887318  0.672273  0.002341
0.171725  0.640620  0.393520  0.992946
0.827478  0.239240  0.208017  0.336199
0.831432  0.857802  0.689031  0.601690
0.055869  0.141251  0.888261  0.723561
0.965558  0.683751  0.111327  0.143663
0.968776  0.447863  0.097376  0.595593

3.a.4. Boolean reductions#

One topic related to the concept of filtering is Boolean reductions. There are a number of methods, called reduction methods, which allow you to obtain a reduced result, as shown as follows:

any()
all()

df4 = pd.DataFrame(np.random.rand(40).reshape(10,4), columns = ['A','B','C','D'])
df4

	A	B	C	D
0	0.702331	0.449733	0.137058	0.268719
1	0.669042	0.323097	0.388716	0.745808
2	0.178261	0.346203	0.862937	0.833575
3	0.367755	0.888903	0.726919	0.035728
4	0.676982	0.774034	0.300389	0.302350
5	0.543669	0.548670	0.645377	0.487328
6	0.051030	0.600260	0.307622	0.648873
7	0.922387	0.611866	0.694456	0.383340
8	0.605769	0.698918	0.556309	0.120186
9	0.201076	0.836606	0.234699	0.287532

But, let’s see some examples to clarify their function. We apply a Boolean condition in which all the elements of the DataFrame are evaluated. For example, we want to know which elements of the DataFrame have a value greater than 0.1, as shown as follows:

df4_ = df4 > 0.1
df4_

	A	B	C	D
0	True	True	True	True
1	True	True	True	True
2	True	True	True	True
3	True	True	True	False
4	True	True	True	True
5	True	True	True	True
6	False	True	True	True
7	True	True	True	True
8	True	True	True	True
9	True	True	True	True

In this case, we got a DataFrame full of Boolean values. A reduction method can be applied to this DataFrame, such as the all() method to know if, within each column, all the values meet these requirements (all on True) , as shown as follows:

(df4 > 0.1).all()

A    False
B     True
C     True
D    False
dtype: bool

(df4 > 0.1).any()

A    True
B    True
C    True
D    True
dtype: bool

(df4 > 0.1).any().any()

True

3.a.5. Filtering with the `.isin( )` method#

Another method of Filtering is .isin( ). This method lets you know if one or more elements are present inside an object such as a DataFrame or a Series. In this regard, we define an example DataFrame, with predefined values, as shown as follows:

df5 = pd.DataFrame(np.arange(9).reshape(3,3), columns = ['A','B','C'])
df5

	A	B	C
0	0	1	2
1	3	4	5
2	6	7	8

Now, let’s apply the .isin( ) method with some values to search for, as follows:

df5_ = df5.isin([0,3])
df5_

	A	B	C
0	True	False	False
1	True	False	False
2	False	False	False

As we can see, by passing a list of values as an argument to the .isin( ) method applied to the DataFrame, we get a Boolean DataFrame of the same size, where we have True if the value belongs to the list, and False if it doesn’t exist.

As we can see, this is also a filtering operation. Then, passing this Boolean DataFrame to the indexing operator, you get a filtering of the values, in which only those which correspond to True as a value will be passed, while all the others, which correspond to False, will be evaluated as NaN, as shown as follows:

df5[df5_]

	A	B	C
0	0.0	NaN	NaN
1	3.0	NaN	NaN
2	NaN	NaN	NaN

df5__ = df5[df5.isin([0,3])].dropna(thresh=1).dropna(axis=1)
df5__

	A
0	0.0
1	3.0

4. Editing#

In the previous sections, we saw how to select some parts (subset) of a DataFrame to view them, to extract them as a new object (new Series or DataFrame), or more often, to manipulate them later, operating on their values in some way. Another basic operation that is carried out is that of editing. That is, once you have created a DataFrame with a constructor, it will be necessary to make changes to its structure. Sometimes, it will be necessary to add at the end, or even insert inside, new elements (such as rows or columns), and other times, it will be necessary to remove entire rows or columns.

In this section, we will, therefore, deal with the editing of a complex object such as a DataFrame, so that we can then learn how to make all the changes necessary to prepare a DataFrame for analysis

4.a. Adding, inserting, and deleting a column in a DataFrame#

You will have understood by now that due to the way the DataFrames are structured within them, you can think of the columns as separate Series, assembled together to form data tables. So the editing operations of a DataFrame working on the columns are somewhat easier. Let’s create an example DataFrame like the following:

df6 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], 
                     index=['a','b','c'], 
                     columns=['A','B','C'])
df6

	A	B	C
a	1	2	3
b	4	5	6
c	7	8	9

Adding a column at the end is a fairly simple operation. It is sufficient to select a column that does not yet exist, specify a new label within the indexing operator, and enhance it in some way. In this way, a new column will automatically be created which will be added after the last existing column. For example, if we value a non-existent column with a scalar value, , we will get a column completely valued with that number, as shown as follows:

df6['D'] = 17 
df6

	A	B	C	D
a	1	2	3	17
b	4	5	6	17
c	7	8	9	17

From now on, if you select this column, the behavior will be the normal selection behavior, where the returned value will be a Series, as shown as follows:

df6['D'] 

a    17
b    17
c    17
Name: D, dtype: int64

df6['E'] = [13,14,15]
df6

	A	B	C	D	E
a	1	2	3	17	13
b	4	5	6	17	14
c	7	8	9	17	15

The same technique can be used to update the values of an already existing column. For example, passing a list of values to an already existing column will replace those already present, as shown as follows:

df6['D'] = ['one','two', 'three']
df6

	A	B	C	D	E
a	1	2	3	one	13
b	4	5	6	two	14
c	7	8	9	three	15

To enhance a column, you can use an existing Series instead of a list. In this case, however, you must pay close attention to the correspondence between the labels of the Series and those of the DataFrame, as shown as follows:

sr = pd.Series([10,11,12],index=['a','b','c'])
df6['D'] = sr 
df6

	A	B	C	D	E
a	1	2	3	10	13
b	4	5	6	11	14
c	7	8	9	12	15

Another very useful operation is the removal of a column from a DataFrame. The del keyword exists in Python that can be used for this purpose, as shown as follows:

del df6['E']
df6

	A	B	C	D
a	1	2	3	10
b	4	5	6	11
c	7	8	9	12

In this case, the column is permanently removed from the DataFrame, but there may be cases where the removed column wants to be preserved in some way. The .pop() method allows you to remove the column from the DataFrame and return it as Series that can be stored in a new variable, as shown as follows:

sr = df6.pop('D')  
print(type(sr))   
df6

<class 'pandas.core.series.Series'>

	A	B	C
a	1	2	3
b	4	5	6
c	7	8	9

Another useful method for editing a column within a DataFrame is the insert() method. This method inserts a new column within a DataFrame in a predetermined position. For example, we can insert the Series we extracted earlier with the pop() method, back into the DataFrame in position 1, calling it as label 'Z'. All this information is passed as arguments to the insert() function, as shown as follows:

df6.insert(1, 'Z', sr) 
df6

	A	Z	B	C
a	1	10	2	3
b	4	11	5	6
c	7	12	8	9

The same insert() method can also be used to copy a column within a DataFrame. For example, in the previous case, the column to be copied is passed instead of an external Series, as shown as follows:

df6.insert(3, 'B2', df6['B'])
df6

	A	Z	B	B2	C
a	1	10	2	2	3
b	4	11	5	5	6
c	7	12	8	8	9

4.b. Adding new columns with assign()#

Among the methods applicable to DataFrames that have editing functions, there is the .assign( ) method that allows you to create new columns to be added easily, deriving them from those already present. For example, we can add a new column to the DataFrame starting from the values of an already existing column by multiplying it by 2, as shown as follows:

df7 = pd.DataFrame(np.random.randint(1,10,9).reshape(3,3), 
                     index=['a','b','c'], 
                     columns=['A','B','C'])
df7

	A	B	C
a	9	8	2
b	6	5	6
c	8	9	9

df7_ = df7.assign(D = df7['A']*2)
df7_

	A	B	C	D
a	9	8	2	18
b	6	5	6	12
c	8	9	9	16

The .assign() method does not modify the original DataFrame but creates a new copy of it as a return value. So, the result can be assigned to a new variable, as we have just done, or reassigned to the original DataFrame, modifying it. The choice of what to do is therefore up to the user, as shown as follows:

df7__ = df7.assign(D = df7['A']*2)
df7__

	A	B	C	D
a	9	8	2	18
b	6	5	6	12
c	8	9	9	16

But the .assign() method can also be used to overwrite an existing column, which is often very useful for correcting or modifying the data within the DataFrame, as shown as follows:

df7x = df7.assign(D = np.sqrt(df7['A']))
df7x

	A	B	C	D
a	9	8	2	3.000000
b	6	5	6	2.449490
c	8	9	9	2.828427

As you can see, in addition to arithmetic operations, functions can also be used to modify the values of a column. Another useful application of the .assign() method is to create new columns containing values generated by special functions. For example, we often need to generate columns containing the same numeric value, as shown as follows:

df7y = df7.assign(D = np.repeat(1,3))
df7y

	A	B	C	D
a	9	8	2	1
b	6	5	6	1
c	8	9	9	1

5. Descriptive Statistics#

Pandas is a library designed for data analysis, and the DataFrames are the basic objects on which all the related activities are carried out. So having an idea of their contents and their structure at the outset is important in order to better target the operations to be carried out. Among the basic activities, there is, therefore, the calculation of general statistics that can somehow give an idea of the values contained within the DataFrame under analysis. In this section, we will see many basic features that the Pandas library offers to obtain this type of descriptive data.

5.a) The `.describe( )` method#

Inside the Pandas library, there is a describe() method that calculates a whole series of descriptive statistics of the reference DataFrame, as shown as follows:

df8 = pd.DataFrame(np.random.rand(40).reshape(10,4))
df8

	0	1	2	3
0	0.647014	0.237350	0.438642	0.541500
1	0.451720	0.608705	0.858325	0.757686
2	0.266615	0.180158	0.603805	0.668843
3	0.846353	0.590789	0.324114	0.965631
4	0.245728	0.403458	0.452438	0.435021
5	0.315291	0.069496	0.285218	0.759164
6	0.093878	0.025662	0.725876	0.064535
7	0.248943	0.274586	0.794715	0.870641
8	0.754469	0.054196	0.290570	0.315829
9	0.986875	0.381342	0.216245	0.239573

It is launched directly on the DataFrame to be analyzed, and the result is a table of descriptive statistics carried out at the column level and reported in the structure of a DataFrame, as shown

df8.describe() 

	0	1	2	3
count	10.000000	10.000000	10.000000	10.000000
mean	0.485689	0.282574	0.498995	0.561842
std	0.302740	0.211315	0.232023	0.293510
min	0.093878	0.025662	0.216245	0.064535
25%	0.253361	0.097162	0.298956	0.345627
50%	0.383506	0.255968	0.445540	0.605172
75%	0.727605	0.397929	0.695358	0.758795
max	0.986875	0.608705	0.858325	0.965631

The statistics processed are also returned in the form of DataFrame, each of which is processed by a single column. Among the reported values, we see the element count, the mean and its standard deviation, the maximum and minimum value, and 3 percentiles. This information will be reported only for the columns of the DataFrame that contain numeric values, as shown

df8_ = pd.DataFrame([[1,'one', True],[2,'two',False],
                    [3,'three',True],[4,'four',False],[5,'five', False]], 
                   columns=['numb','word','bool'])
df8_.describe()

	numb
count	5.000000
mean	3.000000
std	1.581139
min	1.000000
25%	2.000000
50%	3.000000
75%	4.000000
max	5.000000

If, on the other hand, there are no columns with numerical values to be calculated, it will present in addition to the number of elements present, other statistics such as the number of unique elements, and the most frequent value for every single column with its frequency, as shown

df8_[['word','bool']].describe()

	word	bool
count	5	5
unique	5	2
top	one	False
freq	1	3

So in a DataFrame with many columns, it would be a good practice to be able to select only a few columns (or exclude the ones that don’t interest us) without having to first make a selection of subsettings and then applying describe() as done before. In this regard, there are two parameters, include and exclude, which allow us to explicitly specify which columns should be included or excluded in the calculation of descriptive statistics. The peculiarity is that you do not pass lists containing labels, but dtypes. That is, the selection is made by the type of data.

df8_.describe(include=['int'])

	numb
count	5.000000
mean	3.000000
std	1.581139
min	1.000000
25%	2.000000
50%	3.000000
75%	4.000000
max	5.000000

df8_.mean(numeric_only=True, axis=0) #df.mean(axis=0) #df.mean()

numb    3.0
bool    0.4
dtype: float64

df8_.mean(numeric_only=True, axis=1) #df.mean(axis=1)

  1.0
  1.0
  2.0
  2.0
  2.5
dtype: object

df8_.std(numeric_only=True, axis=0)

numb    1.581139
bool    0.547723
dtype: float64

std = ( df8_ - df8_.mean(numeric_only=True,))/ df8_.std(numeric_only=True,)
print(std.mean())
print(std.std())

bool    0.0
numb    0.0
word    NaN
dtype: object
bool    1.0
numb    1.0
word    NaN
dtype: object

6. Transposition, Sorting and Reindexing#

In this final section, we will see some of the techniques that somehow rearrange the internal structure of the DataFrame by acting at the index level. In fact, the order of the labels within a DataFrame is not immutable but can be changed during the analysis. There are, therefore, some techniques that act at this level including transposition, sorting, and reindexing.

6.a. Transposition#

The transposition of a DataFrame is equivalent to the mathematical operation of the transposition of a matrix. In this operation, the rows become the columns and vice versa. Applying a transposition to a DataFrame will, therefore, have the exchange of labels between index and columns, with the consequent reordering of the values related to them, as shown as follows:

df9 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], 
                     index=['a','b','c'], 
                     columns=['A','B','C'])
df9

	A	B	C
a	1	2	3
b	4	5	6
c	7	8	9

To carry out the transposition of a DataFrame, it is sufficient to recall the attribute T, as shown as follows:

df9.T

	a	b	c
A	1	4	7
B	2	5	8
C	3	6	9

6.b. Sorting#

Often, when loading a DataFrame from a data source, it happens that the order of the rows or columns is not the desired one. Or during the various� analysis phases, we may wish to sort the values of a specific column in the ascending or descending order. All these sorting operations can be summarized in the following three types:

Sorting by labels
Sorting by values
Sorting by a combination of both

All of these sorting types are feasible with the Panda library. For this purpose, we will create a DataFrame with the inside of the index and columns labels arranged in an unordered way (this is practically the case for almost all real DataFrames), as shown as follows:

df10 = pd.DataFrame(np.random.randint(10, size=16).reshape(4,4), 
                     index=['b','d','c','a'], 
                     columns=['C','D','A','B'])
df10

	C	D	A	B
b	3	6	5	5
d	0	1	8	0
c	5	9	6	5
a	4	9	2	6

6.b.1. Sorting by label#

Sorting by labels, or sorting by index, is based on sorting the labels in axis or columns based on their value (numeric or string), and it is performed efficiently by the sort_index() method. By default, this method puts the labels on the index in the DataFrame in the ascending order. Matching the values on each row will also follow the same order, as shown as follows:

df10_ = df10.sort_index()
df10_

	C	D	A	B
a	4	9	2	6
b	3	6	5	5
c	5	9	6	5
d	0	1	8	0

As we can see from the result, the rows of the DataFrame have been reordered following the ascending order of the index labels. We, therefore, had a sort order by row, while the order of the columns remained unchanged. If, on the other hand, we need to sort the labels in descending order, simply add the ascending parameter and set it to False (by default, ascending = True), as shown as follows:

df10__ = df10.sort_index(ascending=False)
df10__

	C	D	A	B
d	0	1	8	0
c	5	9	6	5
b	3	6	5	5
a	4	9	2	6

If we want to sort the labels in columns in ascending order, it will be necessary to explicitly define the parameter axis = 1 (by default axis = 0), as shown as follows:

df10x = df10.sort_index(axis=1)
df10x

	A	B	C	D
b	5	5	3	6
d	8	0	0	1
c	6	5	5	9
a	2	6	4	9

df10y = df10.sort_index(axis=1, ascending=False)
df10y

	D	C	B	A
b	6	3	5	5
d	1	0	0	8
c	9	5	5	6
a	9	4	6	2

df10z = df10.sort_index(axis=0).sort_index(axis=1)
df10z

	A	B	C	D
a	2	6	4	9
b	5	5	3	6
c	6	5	5	9
d	8	0	0	1

6.b.2. Sorting by values#

Another way to sort a DataFrame is to sort the rows or columns according to their contained values. Sorting by values is done by the sort_values() method. In order to work, this method needs a mandatory parameter that must be explicitly set. The ‘by parameter’ must be labeled columns, according to which, the values within the column will be sorted in the ascending order, as shown as follows:

df10t = df10.sort_values(by='A')
df10t

	C	D	A	B
a	4	9	2	6
b	3	6	5	5
c	5	9	6	5
d	0	1	8	0

The by parameter can also accept a list of column labels. The sort_values() method will first sort the column values corresponding to the first label in the list. Then, in the context of the equal values present in this column, it will sort on the column corresponding to the second label in the list, and so on, as shown as follows:

df10['A'] = [0,0,2,2]
df10p = df10.sort_values(by=['A','B'])
df10p

	C	D	A	B
d	0	1	0	0
b	3	6	0	5
c	5	9	2	5
a	4	9	2	6

Finally, both the sort_index() method and the sort_values() method will not modify the original DataFrame but will return an ordered copy of it. Any changes made to the latter will not affect the original DataFrame in the least, as shown as follows:

df10q = df10.sort_index(axis=0)
df10q.loc['a','A'] = 0
df10q

	C	D	A	B
a	4	9	0	6
b	3	6	0	5
c	5	9	2	5
d	0	1	0	0

df10r = df10.sort_values(by='A')
df10r

	C	D	A	B
b	3	6	0	5
d	0	1	0	0
c	5	9	2	5
a	4	9	2	6

7. Reindexing#

Reindexing is the fundamental method of aligning data in Pandas. Reindexing means conforming the data to match a given set of labels along a particular axis. This involves a set of things, which are as follows:

Reordering of existing data to respond to the new set of labels
The insertion of NaN in the positions where there is no correspondence
Possibility of filling in cases of missing data (NaN) with special data following appropriate rules

df11 = pd.DataFrame(np.random.randint(10, size=16).reshape(4,4), 
                     index=['b','d','c','a'], 
                     columns=['C','D','A','B'])
df11

	C	D	A	B
b	7	4	6	9
d	7	8	3	4
c	0	0	5	9
a	9	4	3	2

To reindex, use the reindex() method specifying the new order of the labels that the DataFrame must have. The method will not operate on the reference DataFrame, but will return a copy of the DataFrame that will follow the new indexing rules. To select which particular Index object will be reindexed, we will add the axis parameter with its specific value as the function argument, as shown as follows:

axis = 0 to edit rows (index)
axis = 1 to edit columns (columns)

df11_ = df11.reindex(['a','b','c','d'],axis=0)
df11_

	C	D	A	B
a	9	4	3	2
b	7	4	6	9
c	0	0	5	9
d	7	8	3	4

The new DataFrame will be a completely independent copy, and any changes made to it will not affect the original one in the least, as shown as follows:

df11_.iloc[1,1] = 0

df11_

	C	D	A	B
a	9	4	3	2
b	7	0	6	9
c	0	0	5	9
d	7	8	3	4

df11

	C	D	A	B
b	7	4	6	9
d	7	8	3	4
c	0	0	5	9
a	9	4	3	2

The same goes for the columns, as shown as follows:

df11__ = df11.reindex(['A','B','C','D'],axis=1)
df11__

	A	B	C	D
b	6	9	7	4
d	3	4	7	8
c	5	9	0	0
a	3	2	9	4

But what happens if we insert labels that are not present in our DataFrame? Or, if we omit some present label? Refer to the following:

df11x = df11.reindex(['A','E','C','D',],axis=1)
df11x

	A	E	C	D
b	6	NaN	7	4
d	3	NaN	7	8
c	5	NaN	0	0
a	3	NaN	9	4

7.a. Reindexing using another DataFrame as a reference#

There is also the possibility of using a reference DataFrame from which to take the order of the indexes (both by row and by column) and apply it to our DataFrame. This is possible with the reindex_like( ) method. This method will also not change the object that it is called on but will return a copy of the DataFrame as a result.

df11y = pd.DataFrame(np.zeros(16).reshape(4,4), 
                     index=['a','b','c','d'], 
                     columns=['A','B','C','D'])
df11y

	A	B	C	D
a	0.0	0.0	0.0	0.0
b	0.0	0.0	0.0	0.0
c	0.0	0.0	0.0	0.0
d	0.0	0.0	0.0	0.0

We will apply the reindex_like( ) method on our DataFrame on which we want to reindex, as shown as follows:

df11z = df11.reindex_like(df11)
df11z

	C	D	A	B
b	7	4	6	9
d	7	8	3	4
c	0	0	5	9
a	9	4	3	2

Also in this case, if in the reference DataFrame, there are labels that do not exist in our DataFrame on which to index, these rows or columns will be added to the new DataFrame returned with all the NaN values (missing values) inside. Whereas, if in the reference DataFrame, some labels are not present, the corresponding rows or columns will be removed from the new DataFrame returned, as shown as follows

df12 = pd.DataFrame(np.zeros(16).reshape(4,4), 
                     index=['a','b','c','e'], 
                     columns=['A','B','W','Z'])
df12

	A	B	W	Z
a	0.0	0.0	0.0	0.0
b	0.0	0.0	0.0	0.0
c	0.0	0.0	0.0	0.0
e	0.0	0.0	0.0	0.0

df12_ = df12.reindex_like(df12)
df12_

	A	B	W	Z
a	0.0	0.0	0.0	0.0
b	0.0	0.0	0.0	0.0
c	0.0	0.0	0.0	0.0
e	0.0	0.0	0.0	0.0

Conclusion#

In this chapter, we dealt with a whole series of basic operations that apply to DataFrame (and also to Series) and which greatly help in understanding this type of object and how they work. The concept of selection is very important because it allows us to focus our attention on a subset of the DataFrame. This type of object often has dimensions that are too large to be considered in their entirety, and in addition to the often-insufficient selection, a filtering system that allows us to select only what meets particular requirements that come to our aid.