Chapter 3: Feature Extraction from Text Data

Chapter 3: Feature Extraction from Text Data#

Introduction#

Feature extraction is a pivotal step in the text mining process. Essentially, it translates textual data into a numerical form so that machine learning models can understand. It is the bedrock of many natural language processing tasks. Sklearn provides a suite of tools to efficiently transform text data into a format suitable for machine learning. Through African-context examples, we’ve witnessed the versatility and applicability of these tools across various textual scenarios. As we venture into more advanced topics, mastering the basics of feature extraction remains paramount.

This chapter offers an exploration into sklearn’s text feature extraction techniques.

Learning Objectives:

Understand Basic Text Representation: Comprehend the necessity of converting textual data into numerical format for machine learning applications, and appreciate the significance of feature extraction in text mining.
Master CountVectorizer: Confidently utilize the CountVectorizer method to transform text documents into a matrix of token counts, distinguishing how individual words and tokens are represented in this format.
Differentiate Vectorization Techniques: Discern the differences between TfidfVectorizer and the combination of CountVectorizer with TfidfTransformer. Know when to apply each method based on the task at hand.

Understanding Document-Term Matrix (DTM)#

The Document-Term Matrix (DTM) is a matrix representation of the text dataset where each row corresponds to a document, and each column represents a term (typically a word), and each cell contains the frequency of the term in the document.

Consider two sentences:

“I love machine learning.”
“Learning machine algorithms is fun.”

The DTM for these sentences would have a row for each sentence and columns for each unique word.

CountVectorizer#

CountVectorizer turns text documents into a matrix of token counts. Each row will represent a document, and each column will represent a token (word), with the value indicating the count of the token in the respective document.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Constructing a DTM using the above example

sample_sentences = ["I love machine learning.", "Learning machine algorithms is fun."]

vectorizer = CountVectorizer()
X0 = vectorizer.fit_transform(sample_sentences)

# Convert to a DataFrame for better visualization
df = pd.DataFrame(X0.toarray(), columns=vectorizer.get_feature_names_out())
df

	algorithms	fun	is	learning	love	machine
0	0	0	0	1	1	1
1	1	1	1	1	0	1

docs_1 = ["Nairobi is the capital of Kenya.", 
          "Lagos is a bustling city in Nigeria.", 
          "Cairo is the heart of Egypt."]

vectorizer = CountVectorizer()
vectorizer_2 =  CountVectorizer(stop_words='english')
X1= vectorizer.fit_transform(docs_1)

# Convert to a DataFrame for better visualization
capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names_out())
capitals_df

	bustling	cairo	capital	city	egypt	heart	in	is	kenya	lagos	nairobi	nigeria	of	the
0	0	0	1	0	0	0	0	1	1	0	1	0	1	1
1	1	0	0	1	0	0	1	1	0	1	0	1	0	0
2	0	1	0	0	1	1	0	1	0	0	0	0	1	1

vectorizer_2 =  CountVectorizer(stop_words='english')
X1= vectorizer_2.fit_transform(docs_1)

# Convert to a DataFrame for better visualization
capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer_2.get_feature_names_out())
capitals_df

	bustling	cairo	capital	city	egypt	heart	kenya	lagos	nairobi	nigeria
0	0	0	1	0	0	0	1	0	1	0
1	1	0	0	1	0	0	0	1	0	1
2	0	1	0	0	1	1	0	0	0	0

docs_2 = ["African proverbs are wise sayings.", 
          "Proverbs offer wisdom and life lessons.", 
          "Wisdom is wealth."]

X2 = vectorizer.fit_transform(docs_2)
print(vectorizer.get_feature_names_out())
capitals_df = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names_out())
capitals_df

['african' 'and' 'are' 'is' 'lessons' 'life' 'offer' 'proverbs' 'sayings'
 'wealth' 'wisdom' 'wise']

	african	and	are	is	lessons	life	offer	proverbs	sayings	wealth	wisdom	wise
0	1	0	1	0	0	0	0	1	1	0	0	1
1	0	1	0	0	1	1	1	1	0	0	1	0
2	0	0	0	1	0	0	0	0	0	1	1	0

#Token Patterns (extracting only words without numbers)

vectorizer_3 = CountVectorizer(token_pattern=r'\b\w+\b')
docs_3 = ["Africa has 54 countries.", 
          "It is the 2nd largest continent.", 
          "Kilimanjaro is the tallest mountain in Africa."]

X3 = vectorizer_3.fit_transform(docs_3)

facts_df = pd.DataFrame(X3.toarray(), columns=vectorizer_3.get_feature_names_out())
facts_df

	2nd	54	africa	continent	countries	has	in	is	it	kilimanjaro	largest	mountain	tallest	the
0	0	1	1	0	1	1	0	0	0	0	0	0	0	0
1	1	0	0	1	0	0	0	1	1	0	1	0	0	1
2	0	0	1	0	0	0	1	1	0	1	0	1	1	1

TfidfVectorizer#

TfidfVectorizer converts text documents into a matrix of token counts and then transforms this count matrix into a tf-idf representation. Tf-idf stands for “Term Frequency-Inverse Document Frequency”. It’s a way to score the importance of words (tokens) in the document based on how frequently they appear across multiple documents.

 # Basic Tf-idf Scores
from sklearn.feature_extraction.text import TfidfVectorizer

docs_4 = ["Nelson Mandela was a leader in South Africa.", 
      "Mandela fought for freedom and equality.", 
      "South Africa saw profound change under Mandela's leadership."]


vectorizer_4 = TfidfVectorizer()

X4 = vectorizer_4.fit_transform(docs_4)

sa_facts = pd.DataFrame(X4.toarray(), columns=vectorizer_4.get_feature_names_out())
sa_facts

	africa	and	change	equality	for	fought	freedom	in	leader	leadership	mandela	nelson	profound	saw	south	under	was
0	0.324124	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.426184	0.426184	0.000000	0.251711	0.426184	0.000000	0.000000	0.324124	0.000000	0.426184
1	0.000000	0.432385	0.000000	0.432385	0.432385	0.432385	0.432385	0.000000	0.000000	0.000000	0.255374	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
2	0.298174	0.000000	0.392063	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.392063	0.231559	0.000000	0.392063	0.392063	0.298174	0.392063	0.000000

docs_5 = ["Lions and cheetahs are both big cats found in Africa.", 
          "Elephants in Africa are known for their large tusks.", 
          "Cheetahs are the fastest land animals."]

X5 = vectorizer_4.fit_transform(docs_5)


sa_facts = pd.DataFrame(X5.toarray(), columns=vectorizer_4.get_feature_names_out())
sa_facts

	africa	and	animals	are	big	both	cats	cheetahs	elephants	fastest	for	found	in	known	land	large	lions	the	their	tusks
0	0.267485	0.351711	0.000000	0.207726	0.351711	0.351711	0.351711	0.267485	0.000000	0.000000	0.000000	0.351711	0.267485	0.000000	0.000000	0.000000	0.351711	0.000000	0.000000	0.000000
1	0.277601	0.000000	0.000000	0.215582	0.000000	0.000000	0.000000	0.000000	0.365011	0.000000	0.365011	0.000000	0.277601	0.365011	0.000000	0.365011	0.000000	0.000000	0.365011	0.365011
2	0.000000	0.000000	0.450504	0.266075	0.000000	0.000000	0.000000	0.342620	0.000000	0.450504	0.000000	0.000000	0.000000	0.000000	0.450504	0.000000	0.000000	0.450504	0.000000	0.000000

vectorizer_6 = TfidfVectorizer(ngram_range=(1,2))

docs_6 = ["The Sahara is a vast desert.", 
          "The Nile cuts through several African countries.", 
          "The Congo rainforest is vast and diverse."]

X6 = vectorizer_6.fit_transform(docs_6)

rivers_df = pd.DataFrame(X6.toarray(), columns=vectorizer_6.get_feature_names_out())
rivers_df

	african	african countries	and	and diverse	congo	congo rainforest	countries	cuts	cuts through	desert	...	several african	the	the congo	the nile	the sahara	through	through several	vast	vast and	vast desert
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.375716	...	0.000000	0.221904	0.000000	0.000000	0.375716	0.000000	0.000000	0.285742	0.000000	0.375716
1	0.284569	0.284569	0.000000	0.000000	0.000000	0.000000	0.284569	0.284569	0.284569	0.000000	...	0.284569	0.168071	0.000000	0.284569	0.000000	0.284569	0.284569	0.000000	0.000000	0.000000
2	0.000000	0.000000	0.300366	0.300366	0.300366	0.300366	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.177401	0.300366	0.000000	0.000000	0.000000	0.000000	0.228436	0.300366	0.000000

3 rows × 30 columns

TfidfTransformer#

While TfidfVectorizer takes in raw text and produces tf-idf scores, TfidfTransformer is used after CountVectorizer to convert the count matrix into a tf-idf representation.

from sklearn.feature_extraction.text import TfidfTransformer

docs_7 = ["Accra is the hub of Ghana.", 
          "Ghana is known for its gold resources.", 
          "Accra hosts several historic sites."]


count_vect = CountVectorizer()
X7_count = count_vect.fit_transform(docs_7)

tfidf_transformer = TfidfTransformer()
X7_tfidf = tfidf_transformer.fit_transform(X7_count)

rivers_df = pd.DataFrame(X7_tfidf.toarray(), columns=count_vect.get_feature_names_out())
rivers_df

	accra	for	ghana	gold	historic	hosts	hub	is	its	known	of	resources	several	sites	the
0	0.349498	0.000000	0.349498	0.000000	0.000000	0.000000	0.459548	0.349498	0.000000	0.000000	0.459548	0.000000	0.000000	0.000000	0.459548
1	0.000000	0.403016	0.306504	0.403016	0.000000	0.000000	0.000000	0.306504	0.403016	0.403016	0.000000	0.403016	0.000000	0.000000	0.000000
2	0.355432	0.000000	0.000000	0.000000	0.467351	0.467351	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.467351	0.467351	0.000000

docs_8 = ["Jollof rice is a popular dish in West Africa.", 
          "Ugali is a staple food in East Africa.", 
          "Biltong is a snack originating from South Africa."]


X8_count = count_vect.fit_transform(docs_8)
X8_tfidf = tfidf_transformer.fit_transform(X8_count)

afrofoods_df = pd.DataFrame(X8_tfidf.toarray(), columns=count_vect.get_feature_names_out())
afrofoods_df

	africa	biltong	dish	east	food	from	in	is	jollof	originating	popular	rice	snack	south	staple	ugali	west
0	0.235756	0.00000	0.399169	0.000000	0.000000	0.00000	0.303578	0.235756	0.399169	0.00000	0.399169	0.399169	0.00000	0.00000	0.000000	0.000000	0.399169
1	0.257129	0.00000	0.000000	0.435357	0.435357	0.00000	0.331100	0.257129	0.000000	0.00000	0.000000	0.000000	0.00000	0.00000	0.435357	0.435357	0.000000
2	0.247433	0.41894	0.000000	0.000000	0.000000	0.41894	0.000000	0.247433	0.000000	0.41894	0.000000	0.000000	0.41894	0.41894	0.000000	0.000000	0.000000

tfidf_transformer_9 = TfidfTransformer(norm='l1')
docs_9 = ["African music is diverse.", 
          "Afrobeats and Highlife are popular genres.", 
          "Music festivals are common across the continent."]

X9_count = count_vect.fit_transform(docs_9)
X9_tfidf = tfidf_transformer_9.fit_transform(X9_count)

afrimusic_df = pd.DataFrame(X9_tfidf.toarray(), columns=count_vect.get_feature_names_out())
afrimusic_df

	across	african	afrobeats	and	are	common	continent	diverse	festivals	genres	highlife	is	music	popular	the
0	0.00000	0.26592	0.000000	0.000000	0.000000	0.00000	0.00000	0.26592	0.00000	0.000000	0.000000	0.26592	0.202239	0.000000	0.00000
1	0.00000	0.00000	0.173595	0.173595	0.132024	0.00000	0.00000	0.00000	0.00000	0.173595	0.173595	0.00000	0.000000	0.173595	0.00000
2	0.15335	0.00000	0.000000	0.000000	0.116626	0.15335	0.15335	0.00000	0.15335	0.000000	0.000000	0.00000	0.116626	0.000000	0.15335

Task 7: #

Analyzing News Articles on African Youth Unemployment#

You’re a sociologist who’s investigating the portrayal of youth unemployment in African news media. You’ve collected several news articles discussing youth unemployment in various African nations. Your aim is to identify the most discussed themes and assess the importance of different terms in the articles using feature extraction methods.

Load the youth employment articles using the following command %load youth_emp_article.py`
Tokenize the articles into individual words.
Use the CountVectorizer to count word occurrences.
Use the TfidfVectorizer to compute the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.
Alternatively, use the TfidfTransformer to compute TF-IDF values if starting with raw count from CountVectorizer.
Analyze the top terms to understand the main themes in the articles.

Task 8: #

Analyzing Economic Reports on the African Agricultural Export Potential#

You’re an economist at the African Union’s Department of Economic Affairs. With increasing talks about intra-African trade and global exports, you’ve gathered several economic reports discussing the potential of African agricultural exports and their economic impact. Your goal is to extract insights about the most emphasized agricultural commodities and understand the most significant themes across the reports using text analysis techniques.

Load the youth economics reports using the following command %load eco_reports.py
Tokenize the economic reports into individual words.
Use the CountVectorizer to compute the frequency of word occurrences.
Apply the TfidfVectorizer to determine the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term. Alternatively, if starting with raw counts from CountVectorizer, use the TfidfTransformer to calculate TF-IDF values.
Evaluate the top terms to decipher the primary commodities and themes in the economic reports.