Chapter 3: Feature Extraction from Text Data#
Introduction#
Feature extraction is a pivotal step in the text mining process. Essentially, it translates textual data into a numerical form so that machine learning models can understand. It is the bedrock of many natural language processing tasks. Sklearn provides a suite of tools to efficiently transform text data into a format suitable for machine learning. Through African-context examples, we’ve witnessed the versatility and applicability of these tools across various textual scenarios. As we venture into more advanced topics, mastering the basics of feature extraction remains paramount.
This chapter offers an exploration into sklearn’s text feature extraction techniques.
Learning Objectives:
Understand Basic Text Representation: Comprehend the necessity of converting textual data into numerical format for machine learning applications, and appreciate the significance of feature extraction in text mining.
Master CountVectorizer: Confidently utilize the CountVectorizer method to transform text documents into a matrix of token counts, distinguishing how individual words and tokens are represented in this format.
Differentiate Vectorization Techniques: Discern the differences between TfidfVectorizer and the combination of CountVectorizer with TfidfTransformer. Know when to apply each method based on the task at hand.
Understanding Document-Term Matrix (DTM)#
The Document-Term Matrix (DTM) is a matrix representation of the text dataset where each row corresponds to a document, and each column represents a term (typically a word), and each cell contains the frequency of the term in the document.
Consider two sentences:
“I love machine learning.”
“Learning machine algorithms is fun.”
The DTM for these sentences would have a row for each sentence and columns for each unique word.
CountVectorizer#
CountVectorizer
turns text documents into a matrix of token counts. Each row will represent a document, and each column will represent a token (word), with the value indicating the count of the token in the respective document.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Constructing a DTM using the above example
sample_sentences = ["I love machine learning.", "Learning machine algorithms is fun."]
vectorizer = CountVectorizer()
X0 = vectorizer.fit_transform(sample_sentences)
# Convert to a DataFrame for better visualization
df = pd.DataFrame(X0.toarray(), columns=vectorizer.get_feature_names_out())
df
algorithms | fun | is | learning | love | machine | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 0 | 1 |
docs_1 = ["Nairobi is the capital of Kenya.",
"Lagos is a bustling city in Nigeria.",
"Cairo is the heart of Egypt."]
vectorizer = CountVectorizer()
vectorizer_2 = CountVectorizer(stop_words='english')
X1= vectorizer.fit_transform(docs_1)
# Convert to a DataFrame for better visualization
capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names_out())
capitals_df
bustling | cairo | capital | city | egypt | heart | in | is | kenya | lagos | nairobi | nigeria | of | the | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
vectorizer_2 = CountVectorizer(stop_words='english')
X1= vectorizer_2.fit_transform(docs_1)
# Convert to a DataFrame for better visualization
capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer_2.get_feature_names_out())
capitals_df
bustling | cairo | capital | city | egypt | heart | kenya | lagos | nairobi | nigeria | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
docs_2 = ["African proverbs are wise sayings.",
"Proverbs offer wisdom and life lessons.",
"Wisdom is wealth."]
X2 = vectorizer.fit_transform(docs_2)
print(vectorizer.get_feature_names_out())
capitals_df = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names_out())
capitals_df
['african' 'and' 'are' 'is' 'lessons' 'life' 'offer' 'proverbs' 'sayings'
'wealth' 'wisdom' 'wise']
african | and | are | is | lessons | life | offer | proverbs | sayings | wealth | wisdom | wise | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
#Token Patterns (extracting only words without numbers)
vectorizer_3 = CountVectorizer(token_pattern=r'\b\w+\b')
docs_3 = ["Africa has 54 countries.",
"It is the 2nd largest continent.",
"Kilimanjaro is the tallest mountain in Africa."]
X3 = vectorizer_3.fit_transform(docs_3)
facts_df = pd.DataFrame(X3.toarray(), columns=vectorizer_3.get_feature_names_out())
facts_df
2nd | 54 | africa | continent | countries | has | in | is | it | kilimanjaro | largest | mountain | tallest | the | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
TfidfVectorizer#
TfidfVectorizer
converts text documents into a matrix of token counts and then transforms this count matrix into a tf-idf representation. Tf-idf stands for “Term Frequency-Inverse Document Frequency”. It’s a way to score the importance of words (tokens) in the document based on how frequently they appear across multiple documents.
# Basic Tf-idf Scores
from sklearn.feature_extraction.text import TfidfVectorizer
docs_4 = ["Nelson Mandela was a leader in South Africa.",
"Mandela fought for freedom and equality.",
"South Africa saw profound change under Mandela's leadership."]
vectorizer_4 = TfidfVectorizer()
X4 = vectorizer_4.fit_transform(docs_4)
sa_facts = pd.DataFrame(X4.toarray(), columns=vectorizer_4.get_feature_names_out())
sa_facts
africa | and | change | equality | for | fought | freedom | in | leader | leadership | mandela | nelson | profound | saw | south | under | was | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.324124 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.426184 | 0.426184 | 0.000000 | 0.251711 | 0.426184 | 0.000000 | 0.000000 | 0.324124 | 0.000000 | 0.426184 |
1 | 0.000000 | 0.432385 | 0.000000 | 0.432385 | 0.432385 | 0.432385 | 0.432385 | 0.000000 | 0.000000 | 0.000000 | 0.255374 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.298174 | 0.000000 | 0.392063 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.392063 | 0.231559 | 0.000000 | 0.392063 | 0.392063 | 0.298174 | 0.392063 | 0.000000 |
docs_5 = ["Lions and cheetahs are both big cats found in Africa.",
"Elephants in Africa are known for their large tusks.",
"Cheetahs are the fastest land animals."]
X5 = vectorizer_4.fit_transform(docs_5)
sa_facts = pd.DataFrame(X5.toarray(), columns=vectorizer_4.get_feature_names_out())
sa_facts
africa | and | animals | are | big | both | cats | cheetahs | elephants | fastest | for | found | in | known | land | large | lions | the | their | tusks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.267485 | 0.351711 | 0.000000 | 0.207726 | 0.351711 | 0.351711 | 0.351711 | 0.267485 | 0.000000 | 0.000000 | 0.000000 | 0.351711 | 0.267485 | 0.000000 | 0.000000 | 0.000000 | 0.351711 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.277601 | 0.000000 | 0.000000 | 0.215582 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.365011 | 0.000000 | 0.365011 | 0.000000 | 0.277601 | 0.365011 | 0.000000 | 0.365011 | 0.000000 | 0.000000 | 0.365011 | 0.365011 |
2 | 0.000000 | 0.000000 | 0.450504 | 0.266075 | 0.000000 | 0.000000 | 0.000000 | 0.342620 | 0.000000 | 0.450504 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.450504 | 0.000000 | 0.000000 | 0.450504 | 0.000000 | 0.000000 |
vectorizer_6 = TfidfVectorizer(ngram_range=(1,2))
docs_6 = ["The Sahara is a vast desert.",
"The Nile cuts through several African countries.",
"The Congo rainforest is vast and diverse."]
X6 = vectorizer_6.fit_transform(docs_6)
rivers_df = pd.DataFrame(X6.toarray(), columns=vectorizer_6.get_feature_names_out())
rivers_df
african | african countries | and | and diverse | congo | congo rainforest | countries | cuts | cuts through | desert | ... | several african | the | the congo | the nile | the sahara | through | through several | vast | vast and | vast desert | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.375716 | ... | 0.000000 | 0.221904 | 0.000000 | 0.000000 | 0.375716 | 0.000000 | 0.000000 | 0.285742 | 0.000000 | 0.375716 |
1 | 0.284569 | 0.284569 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.284569 | 0.284569 | 0.284569 | 0.000000 | ... | 0.284569 | 0.168071 | 0.000000 | 0.284569 | 0.000000 | 0.284569 | 0.284569 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.300366 | 0.300366 | 0.300366 | 0.300366 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.177401 | 0.300366 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.228436 | 0.300366 | 0.000000 |
3 rows × 30 columns
TfidfTransformer#
While TfidfVectorizer
takes in raw text and produces tf-idf scores, TfidfTransformer
is used after CountVectorizer
to convert the count matrix into a tf-idf representation.
from sklearn.feature_extraction.text import TfidfTransformer
docs_7 = ["Accra is the hub of Ghana.",
"Ghana is known for its gold resources.",
"Accra hosts several historic sites."]
count_vect = CountVectorizer()
X7_count = count_vect.fit_transform(docs_7)
tfidf_transformer = TfidfTransformer()
X7_tfidf = tfidf_transformer.fit_transform(X7_count)
rivers_df = pd.DataFrame(X7_tfidf.toarray(), columns=count_vect.get_feature_names_out())
rivers_df
accra | for | ghana | gold | historic | hosts | hub | is | its | known | of | resources | several | sites | the | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.349498 | 0.000000 | 0.349498 | 0.000000 | 0.000000 | 0.000000 | 0.459548 | 0.349498 | 0.000000 | 0.000000 | 0.459548 | 0.000000 | 0.000000 | 0.000000 | 0.459548 |
1 | 0.000000 | 0.403016 | 0.306504 | 0.403016 | 0.000000 | 0.000000 | 0.000000 | 0.306504 | 0.403016 | 0.403016 | 0.000000 | 0.403016 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.355432 | 0.000000 | 0.000000 | 0.000000 | 0.467351 | 0.467351 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.467351 | 0.467351 | 0.000000 |
docs_8 = ["Jollof rice is a popular dish in West Africa.",
"Ugali is a staple food in East Africa.",
"Biltong is a snack originating from South Africa."]
X8_count = count_vect.fit_transform(docs_8)
X8_tfidf = tfidf_transformer.fit_transform(X8_count)
afrofoods_df = pd.DataFrame(X8_tfidf.toarray(), columns=count_vect.get_feature_names_out())
afrofoods_df
africa | biltong | dish | east | food | from | in | is | jollof | originating | popular | rice | snack | south | staple | ugali | west | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.235756 | 0.00000 | 0.399169 | 0.000000 | 0.000000 | 0.00000 | 0.303578 | 0.235756 | 0.399169 | 0.00000 | 0.399169 | 0.399169 | 0.00000 | 0.00000 | 0.000000 | 0.000000 | 0.399169 |
1 | 0.257129 | 0.00000 | 0.000000 | 0.435357 | 0.435357 | 0.00000 | 0.331100 | 0.257129 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.435357 | 0.435357 | 0.000000 |
2 | 0.247433 | 0.41894 | 0.000000 | 0.000000 | 0.000000 | 0.41894 | 0.000000 | 0.247433 | 0.000000 | 0.41894 | 0.000000 | 0.000000 | 0.41894 | 0.41894 | 0.000000 | 0.000000 | 0.000000 |
tfidf_transformer_9 = TfidfTransformer(norm='l1')
docs_9 = ["African music is diverse.",
"Afrobeats and Highlife are popular genres.",
"Music festivals are common across the continent."]
X9_count = count_vect.fit_transform(docs_9)
X9_tfidf = tfidf_transformer_9.fit_transform(X9_count)
afrimusic_df = pd.DataFrame(X9_tfidf.toarray(), columns=count_vect.get_feature_names_out())
afrimusic_df
across | african | afrobeats | and | are | common | continent | diverse | festivals | genres | highlife | is | music | popular | the | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00000 | 0.26592 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.26592 | 0.00000 | 0.000000 | 0.000000 | 0.26592 | 0.202239 | 0.000000 | 0.00000 |
1 | 0.00000 | 0.00000 | 0.173595 | 0.173595 | 0.132024 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.173595 | 0.173595 | 0.00000 | 0.000000 | 0.173595 | 0.00000 |
2 | 0.15335 | 0.00000 | 0.000000 | 0.000000 | 0.116626 | 0.15335 | 0.15335 | 0.00000 | 0.15335 | 0.000000 | 0.000000 | 0.00000 | 0.116626 | 0.000000 | 0.15335 |
Task 7: #
Analyzing News Articles on African Youth Unemployment#
You’re a sociologist who’s investigating the portrayal of youth unemployment in African news media. You’ve collected several news articles discussing youth unemployment in various African nations. Your aim is to identify the most discussed themes and assess the importance of different terms in the articles using feature extraction methods.
Load the youth employment articles using the following command
%load
youth_emp_article.py`Tokenize the articles into individual words.
Use the CountVectorizer to count word occurrences.
Use the TfidfVectorizer to compute the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.
Alternatively, use the TfidfTransformer to compute TF-IDF values if starting with raw count from CountVectorizer.
Analyze the top terms to understand the main themes in the articles.
Task 8: #
Analyzing Economic Reports on the African Agricultural Export Potential#
You’re an economist at the African Union’s Department of Economic Affairs. With increasing talks about intra-African trade and global exports, you’ve gathered several economic reports discussing the potential of African agricultural exports and their economic impact. Your goal is to extract insights about the most emphasized agricultural commodities and understand the most significant themes across the reports using text analysis techniques.
Load the youth economics reports using the following command
%load eco_reports.py
Tokenize the economic reports into individual words.
Use the CountVectorizer to compute the frequency of word occurrences.
Apply the TfidfVectorizer to determine the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term. Alternatively, if starting with raw counts from CountVectorizer, use the TfidfTransformer to calculate TF-IDF values.
Evaluate the top terms to decipher the primary commodities and themes in the economic reports.