Chapter 4: Handling Text Data#

Learning objectives:

  • Foundation of Textual Data:

    • Define what a string data type is and identify its significance in datasets.

    • Differentiate between numerical, categorical, and textual data in a dataset.

  • Pandas Essentials for Text Data:

    • Introduce the pandas library and its significance in data manipulation.

    • Highlight the str accessor in pandas and its use for vectorized string operations.

  • Basic Text Processing with Pandas:

    • Learn to convert data columns to string data type using the astype(str) method.

    • Demonstrate how to slice strings using the str accessor.

    • Explore common string methods like lower(), upper(), len(), strip(), and their equivalents in pandas.

  • Advanced String Manipulation:

    • Use str.split() to break strings into parts and understand the significance of the expand parameter.

    • Highlight the use of str.contains() to find substrings within a Series.

    • Understand and implement the str.replace() method for replacing textual content.

  • Regular Expressions in Pandas:

    • Introduce the concept of regular expressions for advanced text matching and manipulation.

    • Apply regular expressions with pandas functions like str.extract(), str.contains(), and str.replace().

Introduction:#

What is Text Data ?#

Text data, also known as string data, refers to a collection of characters or words that form textual information. In Data Science, Text data plays a crucial role due to the vast amount of unstructured textual information available today.

Working with text data offers numerous benefits.

  1. Firstly, text data provides rich context and insights into human behavior, opinions, and sentiments, which can be leveraged for various applications such as sentiment analysis, customer feedback analysis, and social media monitoring.

  2. Secondly, text data enables natural language processing (NLP) techniques, allowing machines to understand and process human language, enabling tasks like machine translation, chatbot development, and text summarization.

Furthermore, working with text data enables information retrieval, text classification, topic modeling, and other text-based machine learning algorithms, aiding in tasks like document clustering, content recommendation, and text generation.

Dealing with Text data is foundational to understanding the building blocks of Large Language Models which are revolutionizing the way we interact with the internet today. It all starts with strings.

A string is a sequence of characters. You can access the characters one at a time with the bracket operator.

town = 'douala'
town[1]
'o'

In Python, the index is an offset from the beginning of the string, and the offset of the first letter is zero

town[0]
'd'
town[0:2]
'do'

What about accessing the last letter

town[5]
'a'

An intuitive way of accessing the last element of the string is by passing -1 as an index

town[-1]
'a'

If I’m interested in getting the whole string except the last element

town[:-1]
'doual'
dir(town)  #This command will give all possible attribute we can use to play with strings
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
town.islower()
True
town.isupper()
False
town.title()
'Douala'
town.upper()
'DOUALA'

Task 11: #

  1. Write a function that takes a string as input, and return a new string with the same letters in reverse order.

  2. Write a function that takes a sentence (a collection of strings separated by a space character) as input, Return the longest word in the sentence.

  3. Write a function that takes a string as input and returns the number of vowels in the string. You may assume that all the letters are lower cased.

  4. Write a function that takes a string and returns true if it is a palindrome. A palindrome is a string that is the same whether written backward or forward. Assume that there are no spaces; only lowercase letters will be given

Pandas String Operations#

While Python offers some built-in string methods for basic text processing, pandas extends these functionalities by introducing vectorized string operations that can be applied to entire columns or series of text data. This shift unlocks the power of performing efficient and scalable operations on large datasets, allowing for faster data preprocessing, cleaning, and analysis.

These functionalities introduce vectorized string operations that can be applied to entire columns or series of text data. With pandas string operations, tasks like extracting substrings, replacing patterns, splitting strings, and performing regex operations can be performed seamlessly across multiple data points. In this section, we are going to cover a few which includes islower , isupper , title , lower() , upper(), replace , rsplit , strip , startwith .

import numpy as np
import pandas as pd

animals = pd.Series(['cat' , 'dog' , 'mouse' , 'rabbit', 'lion'])
animals
0       cat
1       dog
2     mouse
3    rabbit
4      lion
dtype: object
animals.str.upper()
0       CAT
1       DOG
2     MOUSE
3    RABBIT
4      LION
dtype: object
animals.str.title()
0       Cat
1       Dog
2     Mouse
3    Rabbit
4      Lion
dtype: object
animals.str.len()
0    3
1    3
2    5
3    6
4    4
dtype: int64

A common Data entry problem#

One common problem that often arises during data entry, particularly when dealing with string data, is the presence of extra white spaces and/or mishandling of cases. Extra white spaces can occur due to inadvertent keyboard inputs or inconsistent data entry practices, leading to inconsistencies in the data.

Similarly, mishandling of cases, such as inconsistent capitalization or differences in letter casing, can introduce variability and inaccuracies in the dataset. These issues hamper data analysis, as they make it difficult to match and compare strings accurately, resulting in errors or skewed results.

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

instruments = pd.Series([' guitar' , '   drum ', 'piano ', np.nan , "violon       "])
instruments
0           guitar
1            drum 
2           piano 
3              NaN
4    violon       
dtype: object

The .strip() method removes extra white spaces from left side and from the right side. Please note that it does not overwrite the old information, so you would have to overwrite the content of the instruments series with the tidy information.

instruments  = instruments.str.strip() 
instruments
0    guitar
1      drum
2     piano
3       NaN
4    violon
dtype: object

It is not obvious to depict extra white spaces in dataframes as you could see below.

NorthAfr_cities = pd.DataFrame(['    Faiyhum' , '   Annaba  ' , '   Bizerte  ' ,
                                '  Luxor  ' , 'Djelfa', 'Hammamet   '] , columns=['na_Cities'])
NorthAfr_cities
na_Cities
0 Faiyhum
1 Annaba
2 Bizerte
3 Luxor
4 Djelfa
5 Hammamet
NorthAfr_cities.iloc[0,0]
'    Faiyhum'

Using the .strip() operation on the untidy pandas Series, one is able to get rid of the extra white space in a vectorized fashion.m

NorthAfr_cities['tidyUp_cities'] = NorthAfr_cities['na_Cities'].str.strip()
NorthAfr_cities = NorthAfr_cities[['tidyUp_cities']]
NorthAfr_cities
tidyUp_cities
0 Faiyhum
1 Annaba
2 Bizerte
3 Luxor
4 Djelfa
5 Hammamet
NorthAfr_cities.iloc[0,0]
'Faiyhum'

Sometimes, two (or more) different type of information could be tied up into a single string and the data entry person uses a common delimiter to separate them instead of entering them in different columns. For instance, take a look at this dataframe of female african artists.

female_ziks= pd.DataFrame(['Fatoumata Diawara_Maliba_Mali' , 'Cina Soul_ For Times we lost_Ghana' ,
                          'Mampi_Love Recipe_Zambia',  'Aster Aweke_Soba_Ethiopia'] ,columns=['music'] )
female_ziks
music
0 Fatoumata Diawara_Maliba_Mali
1 Cina Soul_ For Times we lost_Ghana
2 Mampi_Love Recipe_Zambia
3 Aster Aweke_Soba_Ethiopia
female_ziks['composer'] = female_ziks['music'].str.split('_').str.get(0)
female_ziks['album_title'] = female_ziks['music'].str.split('_').str.get(1)
female_ziks['country'] = female_ziks['music'].str.split('_').str.get(2)
female_ziks
music composer album_title country
0 Fatoumata Diawara_Maliba_Mali Fatoumata Diawara Maliba Mali
1 Cina Soul_ For Times we lost_Ghana Cina Soul For Times we lost Ghana
2 Mampi_Love Recipe_Zambia Mampi Love Recipe Zambia
3 Aster Aweke_Soba_Ethiopia Aster Aweke Soba Ethiopia
female_ziks =  female_ziks[['composer', 'album_title', 'country']]
female_ziks
composer album_title country
0 Fatoumata Diawara Maliba Mali
1 Cina Soul For Times we lost Ghana
2 Mampi Love Recipe Zambia
3 Aster Aweke Soba Ethiopia

Task 12: #

  1. Run the command %load goat1.py in a cell below. It provides a list of african popular writers and their country of origin.

    a. Report the incorrectness that you may have observed in the data.

    b. Use some of the above pandas string operations to tidy up the data.

  2. Consider the very same information as above, but encapsulated in a csv file called african_writers.csv. Perform an inconsistency check and adress them if any.

  3. Run the command %load goat3.py in a cell below. It provides a list of african popular singers and their country of origin.

    a. Report the incorrectness that you may have observed in the data.

    b. Use some of the above pandas string operations to tidy up the data.

Use of Regular Expressions#

Regular expressions (regex) are a powerful tool for pattern matching in text. They can be used to extract, clean, and transform text data in a variety of ways. For example, regex can be used to:

  • Extract specific pieces of information from text, such as email addresses, phone numbers, or dates.

  • Clean up text data by removing unwanted characters or formatting.

  • Transform text data by converting it to a different format, such as changing all uppercase letters to lowercase letters.

The use of regex with pandas dataframes can be a bit daunting at first, but it is a powerful tool that can be used to solve a wide variety of problems. Some basic regex patterns and examples include:

  • \d matches any digit. For example, the regex pattern \d would match the string "1234".

  • \w matches any word character. For example, the regex pattern \w would match the string “hello”.

  • \s matches any whitespace character. For example, the regex pattern \s would match the string ” “.

  • . matches any character. For example, the regex pattern . would match the string “a” or the string “.”.

  • ^ matches the beginning of a string. For example, the regex pattern ^hello would match the string "hello" only if it was at the beginning of the string.

  • [ ] matches a set of characters. For example, the regex pattern [0-9] would match any digit from 0 to 9.

  • {} specifies the number of times a character or set of characters can be matched. For example, the regex pattern \d{5} would match a string of five digits.

  • | matches either one or the other of two patterns. For example, the regex pattern hello|world would match the string "hello" or the string "world".

Example: #

See a couple of namings below related to Mali, which is a country located in West Africa.

import pandas as pd

# Create a pandas series
MaliCiv_infos = pd.Series(["mali", "EaglesOfMali", "+233", "mali223", "Mali_1960" , "MALIPUISSANCI", 
                          "cotedivoire" , "ElephantCoteDivoire" ,"+225" ,  "civ225", "Civ_1960" , "CIVYAFOYE"])

MaliCiv_infos
0                    mali
1            EaglesOfMali
2                    +233
3                 mali223
4               Mali_1960
5           MALIPUISSANCI
6             cotedivoire
7     ElephantCoteDivoire
8                    +225
9                  civ225
10               Civ_1960
11              CIVYAFOYE
dtype: object
  1. Create a regular expression to match strings that only contains lower cases

regex_1 = "[a-z]+$"

# Create a sub series of strings that match the regex
query1 = MaliCiv_infos[MaliCiv_infos.str.match(regex_1)]

# Print the sub series
print(query1)
0           mali
6    cotedivoire
dtype: object
  1. Create a regular expression to match strings that only contains lower cases or upper cases letter

regex_2 = "^[a-zA-Z]+$"

# Create a sub series of strings that match the regex
query2 = MaliCiv_infos[MaliCiv_infos.str.match(regex_2)]
query2
0                    mali
1            EaglesOfMali
5           MALIPUISSANCI
6             cotedivoire
7     ElephantCoteDivoire
11              CIVYAFOYE
dtype: object
  1. Create a regular expression to match strings that only contains upper cases letter

regex_3 = "^[A-Z]+$"
# Create a sub series of strings that match the regex
query3 = MaliCiv_infos[MaliCiv_infos.str.match(regex_3)]
query3
5     MALIPUISSANCI
11        CIVYAFOYE
dtype: object
  1. Create a regular expression to match strings that only ends with 4 digits

regex_4 =  "^.*\d{4}"
# Create a sub series of strings that match the regex
query4 = MaliCiv_infos[MaliCiv_infos.str.match(regex_4)]
query4
4     Mali_1960
10     Civ_1960
dtype: object
  1. Create a regular expression to match strings that only ends with 4 digits

regex_5 =  "^\+"
# Create a sub series of strings that match the regex
query5 = MaliCiv_infos[MaliCiv_infos     #.str.match(regex_5)]
query5
  File "<ipython-input-31-dd6546937ee4>", line 4
    query5
    ^
SyntaxError: invalid syntax

Task 13: #

Egerton University is the oldest university in Kenya having been in 1939. It was founded as Egerton Farm College and became part of the University of Kenya in 1986 and a university in its own right in 1987. They are hosting a Public Policy Hub who has secured funding from the National Research Fund (NRF) Kenya, to conduct a study across different African Countries to understand the “Year of Africa”.

The Year of Africa is a term often coined to represent a particular year where many countries across the continent celebrated the joy, excitement, and possibilities of independence. The end goal is develop and propose a new History Book “for Africans by Africans” that will be used as educational manual across several African countries to teach a different narrative to the upcoming generation. They have collected information on several african countries through corresponding state agencies. Some information include their date of independance, their official language and other country related detail.

Through the grant they received, they will have to hire a Data Analyst to assist them into that project. You just completed your specialization in Statistics from Jomo Kenyatta University of Agriculture and Technology and have been given a role in the project as a data analyst. Their data sits in a csv file is and is called classified_african_info.csv.

  1. What are the two countries/empires that were at the forefront of imperialism and colonialims in Africa? According to you, what was the reason behind?

  2. Load the data and tell us what you observe.

  3. What are the countries that obtained their independance in 1960? Do you have any story to tell behind that?

  4. What are the countries that obtained their independance before 1960? What do you observe about them? Do you have any story to tell behind that?

  5. Some African countries wrote their independance day in letter instead of numbers. Display those countries and tell how many of them are French Speaking nations?

  6. Break down the independance date in to day month and year. Which month seemed to be the most prominent for independances?