Chapter 2: Text Mining Operations#

Introduction#

Text mining is the process of extracting useful information from unstructured text data. One of the first steps in text mining is preprocessing, which includes tokenization, stemming, and lemmatization.

Tokenization is one of the foundational steps in text mining and Natural Language Processing (NLP). It involves breaking down a large paragraph into sentences or words. This chapter will walk you through various tokenization methods provided by NLTK, using examples inspired by the African context.

Learning Objectives:

  • Tokenization Proficiency: Understand and apply various tokenization methods to break down text into smaller, meaningful units such as words or sentences.

  • Stemming Skills: Grasp the concept of stemming and employ NLTK’s stemming functions to reduce words to their root forms, aiding in data normalization.

  • Lemmatization Mastery: Differentiate between stemming and lemmatization, and utilize lemmatization techniques to convert words to their base or dictionary forms.

Tokenization Operations#

There are several Tokenizers functions in the NLTK Library.

Let’s explore some of them, starting from the simple word_tokenizer function.

word_tokenizer#

The word_tokenize function is a standard method for splitting a piece of text into individual words. It breaks text into words, punctuations, etc. It is a bit similar to the split() methods with Python strings.

from nltk.tokenize import word_tokenize

text_1 = "Nairobi is the capital of Kenya."

tokens_1 = word_tokenize(text_1)

print(tokens_1)
['Nairobi', 'is', 'the', 'capital', 'of', 'Kenya', '.']
text_2 = "Jollof rice is a popular dish in West Africa."

tokens_2 = word_tokenize(text_2)

print(tokens_2)
['Jollof', 'rice', 'is', 'a', 'popular', 'dish', 'in', 'West', 'Africa', '.']
text_3 = "Afrobeats has gained global recognition."

tokens_3 = word_tokenize(text_3)

print(tokens_3)
['Afrobeats', 'has', 'gained', 'global', 'recognition', '.']

sent_tokenizer#

The sent_tokenize function breaks down a piece of text into individual sentences. This is particularly useful when analyzing text data on a per-sentence basis.

from nltk.tokenize import sent_tokenize

text_1 = "Cairo is a historic city. It is known for its ancient pyramids."

sentences_1 = sent_tokenize(text_1)

print(sentences_1)
['Cairo is a historic city.', 'It is known for its ancient pyramids.']
text_2 = "African proverbs convey wisdom. Many of them are passed down through generations."

sentences_2 = sent_tokenize(text_2)

print(sentences_2)
['African proverbs convey wisdom.', 'Many of them are passed down through generations.']
text_3 = "Kente is a colorful fabric. It originates from Ghana."

sentences_3 = sent_tokenize(text_3)

print(sentences_3)
['Kente is a colorful fabric.', 'It originates from Ghana.']
PunktSentenceTokenizer?
Object `PunktSentenceTokenizer` not found.

punkt_tokenizer#

The PunktSentenceTokenizer is a pre-trained unsupervised machine learning tokenizer. It is pretrained to tokenize English but can be retrained on other languages or specialized text. The most basic form of it uses punctuations as delimiters to create tokens, but keeps the words without getting rid of the actual punctuation signs. However, they have been trained to pay attention to the meaning of sentences.

from nltk.tokenize import PunktSentenceTokenizer

text_1 = "The Sahara desert is vast. Many nomads call it home."

punkt_tokenizer = PunktSentenceTokenizer()

sentences_1 = punkt_tokenizer.tokenize(text_1)

print(sentences_1)
['The Sahara desert is vast.', 'Many nomads call it home.']
text_2 = "I am just thinking! Is it worth to call this place home? Because I am coughing:."

punkt_tokenizer = PunktSentenceTokenizer()

sentences_2 = punkt_tokenizer.tokenize(text_2)

print(sentences_2)
['I am just thinking!', 'Is it worth to call this place home?', 'Because I am coughing:.']
text_3 = "The Maasai people are known for their vibrant culture!!! They reside in parts of Kenya, and Tanzania!. "

sentences_3 = punkt_tokenizer.tokenize(text_3)

print(sentences_3)
['The Maasai people are known for their vibrant culture!!!', 'They reside in parts of Kenya, and Tanzania!.']

Regexp_tokenizer#

The RegexpTokenizer allows for more control by using regular expressions to specify the pattern for tokenization. It’s ideal when standard tokenization methods don’t suffice.

from nltk.tokenize import RegexpTokenizer

text_1 = "The river Nile flows through many African countries."

tokenizer = RegexpTokenizer(r'\w+')

tokens_1 = tokenizer.tokenize(text_1)

print(tokens_1)
['The', 'river', 'Nile', 'flows', 'through', 'many', 'African', 'countries']

The pattern r'\w+' within the context of a regular expression refers to a sequence of word characters. Let’s break it down:

  • \w: This matches any word character.

  • + This is a quantifier in regular expressions. It means “one or more” of the preceding pattern

See other patterns available below:

  • r'\d+': This pattern will match sequences of digits.

  • r'[a-zA-Z]+\?*

  • r'[a-zA-Z]+\?: This pattern matches sequences of letters that might end with a ?.

  • r'Naija\s*\w*: This pattern matches the word “Naija” followed by zero or more spaces and any word c

  • r'20[0-9]{2}: This pattern matches sequences starting with “20” followed by exactly two digits, useful for capturing years from 2000 to 2099.

from nltk.tokenize import RegexpTokenizer

text = "Waka Waka: 230 seconds, Joromi: 210 seconds, Kuliko Jana: 240 seconds"

tokenizer = RegexpTokenizer(r'\d+')

durations = tokenizer.tokenize(text)

print(durations)  
['230', '210', '240']
text = "When is the best time to plant a tree? Twenty years ago. And the next best time? Now."

tokenizer = RegexpTokenizer(r'[a-zA-Z]+\?*')

proverbs = tokenizer.tokenize(text)

print(proverbs)  
['When', 'is', 'the', 'best', 'time', 'to', 'plant', 'a', 'tree?', 'Twenty', 'years', 'ago', 'And', 'the', 'next', 'best', 'time?', 'Now']
text = "Popular music styles in Africa include Naija beats, NaijaJazz, Afrobeats, and Benga."

tokenizer = RegexpTokenizer(r'Naija\s*\w*')

styles = tokenizer.tokenize(text)

print(styles)  
['Naija beats', 'NaijaJazz']
text = "In 2004, Ghana celebrated its Golden Jubilee. South Africa hosted the FIFA World Cup in 2010. Nigeria became the largest economy in Africa in 2014."

tokenizer = RegexpTokenizer(r'20[0-9]{2}')

years = tokenizer.tokenize(text)
print(years) 
['2004', '2010', '2014']

Task 1: #

African News Headlines#

Imagine you’re a data analyst at "AfriNews", a prominent African news outlet. The chief editor provides you with a lengthy editorial article summarizing major events across the continent in the past month. Your task is to tokenize words to help create an infographic highlighting the most frequently mentioned African countries.

  1. Run this command to load the actual editorial article. %load article_edito.py

  2. Tokenize the text into individual words. See the set below

popular_eng_words = set(["the", "and", "in", "is", "of", "to", "a", "on", "with"])

  1. Filter out common English words

  2. Count occurrences of each country name.

  3. List the top 5 most mentioned countries.

Task 2: #

Deciphering Ancient African Chronicles#

As an African historian, you stumbled upon a long-forgotten manuscript detailing timelines of various African empires. Your task is to extract all the years mentioned to form a chronological timeline.

  1. Run this command to load the actual manuscript. %load african_manuscript.py

  2. Use a regular expression to tokenize and capture all the years (BC and AD).

  3. List all extracted years in chronological order.

Task 3: #

African Folktales Interpretation#

As a literature professor, you are introduced to a long African folktale. You want to split the tale into individual sentences for a detailed analysis with your students.

  1. Run this command to load the actual manuscript. %load african_folktale.py

  2. Tokenize the tale into standalone sentences.

  3. Count and display the total number of sentences.

  4. Display the first 5 sentences.

Stemming Functions#

Stemming is a fundamental process in natural language processing (NLP) where words are reduced to their base or root form. For instance, the stemmed version of the word "running" would be "run". By doing so, stemming helps in compressing the textual data and matches different forms of the same word to its base form. However, it’s essential to understand that the stemmed word may not always be a valid word in the language. Instead, it serves as a representation of the base form.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word_1 = "running"
print(stemmer.stem(word_1))  
run
word_2 = "beautifully"
print(stemmer.stem(word_2))  
beauti
word_3 = "African"
print(stemmer.stem(word_3))  
african
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Example words
words = ["flies", "flying", "runner", "running", "quickly", "national", "nationality"]

# Applying stemming
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)
['fli', 'fli', 'runner', 'run', 'quickli', 'nation', 'nation']
from nltk.stem import WordNetLemmatizer

lemmatizer =  WordNetLemmatizer()

word_2 = "geese"

print(lemmatizer.lemmatize(word_2))  
goose
word_3 = "Africans"

print(lemmatizer.lemmatize(word_3))  
Africans

Task 4: #

Reviewing Tourist Feedback on African Safari Tours#

You work for “SavannaScape”, a renowned African travel agency specializing in safari tours. To improve services, the agency collects feedback from tourists. You decide to analyze this feedback to pinpoint common themes and concerns by stemming the reviews.

  1. Load the text reviews using this commnand %load savanna_reviews.py

  2. Tokenize the feedback into individual words.

  3. Apply stemming to the tokens.

  4. Count occurrences of each stemmed word to identify common themes.

  5. Display the top 5 most frequent stemmed words.

Task 5: #

Analyzing Traditional African Cooking Recipe.#

You’re a culinary researcher focusing on traditional African dishes. You have a lengthy recipe of a traditional dish, and you wish to analyze it to understand the most emphasized ingredients and processes.

  1. Load the recipe texts using the following command %load afro_recipes.py

  2. Tokenize the recipe into individual words.

  3. Apply stemming to the tokens.

  4. Count occurrences of each stemmed word to identify the main ingredients and methods.

  5. Display the top 5 most frequent stemmed words.

Lemmatization Operations#

Lemmatization is the process of reducing a word to its base or dictionary form. Unlike stemming, the result of lemmatization is always a valid word. It might require POS (Part Of Speech) tags for accurate lemmatization.

Why Lemmatization?

Reduces Word Complexity: By converting words to their root form, lemmatization ensures that various inflected forms of a word are represented as a common base form.

Improves Text Analysis: Text analytics algorithms often perform better when variations of words are reduced to their base form.

Saves Storage Space: In large text databases, storing the base form of words can lead to significant storage savings.

print(lemmatizer.lemmatize("running"))  
running
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word_1 = "running"
print(lemmatizer.lemmatize(word_1, pos="v"))  
run
word_2 = "geese"
print(lemmatizer.lemmatize(word_2))  
goose
word_3 = "Africans"
print(lemmatizer.lemmatize(word_3)) 
Africans
print(lemmatizer.lemmatize("better", pos="a"))  
print(lemmatizer.lemmatize("best", pos="a"))    
good
best

Here, the adjective "better" is reduced to its root form, "good". However, "best" is not changed as it’s already in its base form.

Task 6: #

Analyzing an African Folk Tale#

You are a cultural researcher documenting and analyzing African folk tales. The tale of “Anansi the Spider” has been collected from a local storyteller. Your goal is to understand the recurring themes and characters in the tale by lemmatizing the narrative and analyzing the frequently mentioned lemmas.

  1. Load the recipe texts using the following command %load afro_recipes.py

  2. Tokenize the folk tale into individual words.

  3. Apply lemmatization to the tokens.

  4. Count the occurrences of each lemma to identify the primary characters and themes.

  5. Display the top 5 most frequent lemmas.