Chapter 3: Feature Extraction from Text Data#

Introduction#

Feature extraction is a pivotal step in the text mining process. Essentially, it translates textual data into a numerical form so that machine learning models can understand. It is the bedrock of many natural language processing tasks. Sklearn provides a suite of tools to efficiently transform text data into a format suitable for machine learning. Through African-context examples, we’ve witnessed the versatility and applicability of these tools across various textual scenarios. As we venture into more advanced topics, mastering the basics of feature extraction remains paramount.

This chapter offers an exploration into sklearn’s text feature extraction techniques.

Learning Objectives:

  • Understand Basic Text Representation: Comprehend the necessity of converting textual data into numerical format for machine learning applications, and appreciate the significance of feature extraction in text mining.

  • Master CountVectorizer: Confidently utilize the CountVectorizer method to transform text documents into a matrix of token counts, distinguishing how individual words and tokens are represented in this format.

  • Differentiate Vectorization Techniques: Discern the differences between TfidfVectorizer and the combination of CountVectorizer with TfidfTransformer. Know when to apply each method based on the task at hand.

Understanding Document-Term Matrix (DTM)#

The Document-Term Matrix (DTM) is a matrix representation of the text dataset where each row corresponds to a document, and each column represents a term (typically a word), and each cell contains the frequency of the term in the document.

Consider two sentences:

  1. “I love machine learning.”

  2. “Learning machine algorithms is fun.”

The DTM for these sentences would have a row for each sentence and columns for each unique word.

CountVectorizer#

CountVectorizer turns text documents into a matrix of token counts. Each row will represent a document, and each column will represent a token (word), with the value indicating the count of the token in the respective document.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Constructing a DTM using the above example

sample_sentences = ["I love machine learning.", "Learning machine algorithms is fun."]

vectorizer = CountVectorizer()
X0 = vectorizer.fit_transform(sample_sentences)

# Convert to a DataFrame for better visualization
df = pd.DataFrame(X0.toarray(), columns=vectorizer.get_feature_names_out())
df
algorithms fun is learning love machine
0 0 0 0 1 1 1
1 1 1 1 1 0 1
docs_1 = ["Nairobi is the capital of Kenya.", 
          "Lagos is a bustling city in Nigeria.", 
          "Cairo is the heart of Egypt."]

vectorizer = CountVectorizer()
vectorizer_2 =  CountVectorizer(stop_words='english')
X1= vectorizer.fit_transform(docs_1)

# Convert to a DataFrame for better visualization
capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names_out())
capitals_df
bustling cairo capital city egypt heart in is kenya lagos nairobi nigeria of the
0 0 0 1 0 0 0 0 1 1 0 1 0 1 1
1 1 0 0 1 0 0 1 1 0 1 0 1 0 0
2 0 1 0 0 1 1 0 1 0 0 0 0 1 1
vectorizer_2 =  CountVectorizer(stop_words='english')
X1= vectorizer_2.fit_transform(docs_1)

# Convert to a DataFrame for better visualization
capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer_2.get_feature_names_out())
capitals_df
bustling cairo capital city egypt heart kenya lagos nairobi nigeria
0 0 0 1 0 0 0 1 0 1 0
1 1 0 0 1 0 0 0 1 0 1
2 0 1 0 0 1 1 0 0 0 0
docs_2 = ["African proverbs are wise sayings.", 
          "Proverbs offer wisdom and life lessons.", 
          "Wisdom is wealth."]

X2 = vectorizer.fit_transform(docs_2)
print(vectorizer.get_feature_names_out())
capitals_df = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names_out())
capitals_df
['african' 'and' 'are' 'is' 'lessons' 'life' 'offer' 'proverbs' 'sayings'
 'wealth' 'wisdom' 'wise']
african and are is lessons life offer proverbs sayings wealth wisdom wise
0 1 0 1 0 0 0 0 1 1 0 0 1
1 0 1 0 0 1 1 1 1 0 0 1 0
2 0 0 0 1 0 0 0 0 0 1 1 0
#Token Patterns (extracting only words without numbers)

vectorizer_3 = CountVectorizer(token_pattern=r'\b\w+\b')
docs_3 = ["Africa has 54 countries.", 
          "It is the 2nd largest continent.", 
          "Kilimanjaro is the tallest mountain in Africa."]

X3 = vectorizer_3.fit_transform(docs_3)

facts_df = pd.DataFrame(X3.toarray(), columns=vectorizer_3.get_feature_names_out())
facts_df
2nd 54 africa continent countries has in is it kilimanjaro largest mountain tallest the
0 0 1 1 0 1 1 0 0 0 0 0 0 0 0
1 1 0 0 1 0 0 0 1 1 0 1 0 0 1
2 0 0 1 0 0 0 1 1 0 1 0 1 1 1

TfidfVectorizer#

TfidfVectorizer converts text documents into a matrix of token counts and then transforms this count matrix into a tf-idf representation. Tf-idf stands for “Term Frequency-Inverse Document Frequency”. It’s a way to score the importance of words (tokens) in the document based on how frequently they appear across multiple documents.

 # Basic Tf-idf Scores
from sklearn.feature_extraction.text import TfidfVectorizer

docs_4 = ["Nelson Mandela was a leader in South Africa.", 
      "Mandela fought for freedom and equality.", 
      "South Africa saw profound change under Mandela's leadership."]


vectorizer_4 = TfidfVectorizer()

X4 = vectorizer_4.fit_transform(docs_4)

sa_facts = pd.DataFrame(X4.toarray(), columns=vectorizer_4.get_feature_names_out())
sa_facts
africa and change equality for fought freedom in leader leadership mandela nelson profound saw south under was
0 0.324124 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.426184 0.426184 0.000000 0.251711 0.426184 0.000000 0.000000 0.324124 0.000000 0.426184
1 0.000000 0.432385 0.000000 0.432385 0.432385 0.432385 0.432385 0.000000 0.000000 0.000000 0.255374 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.298174 0.000000 0.392063 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.392063 0.231559 0.000000 0.392063 0.392063 0.298174 0.392063 0.000000
docs_5 = ["Lions and cheetahs are both big cats found in Africa.", 
          "Elephants in Africa are known for their large tusks.", 
          "Cheetahs are the fastest land animals."]

X5 = vectorizer_4.fit_transform(docs_5)


sa_facts = pd.DataFrame(X5.toarray(), columns=vectorizer_4.get_feature_names_out())
sa_facts
africa and animals are big both cats cheetahs elephants fastest for found in known land large lions the their tusks
0 0.267485 0.351711 0.000000 0.207726 0.351711 0.351711 0.351711 0.267485 0.000000 0.000000 0.000000 0.351711 0.267485 0.000000 0.000000 0.000000 0.351711 0.000000 0.000000 0.000000
1 0.277601 0.000000 0.000000 0.215582 0.000000 0.000000 0.000000 0.000000 0.365011 0.000000 0.365011 0.000000 0.277601 0.365011 0.000000 0.365011 0.000000 0.000000 0.365011 0.365011
2 0.000000 0.000000 0.450504 0.266075 0.000000 0.000000 0.000000 0.342620 0.000000 0.450504 0.000000 0.000000 0.000000 0.000000 0.450504 0.000000 0.000000 0.450504 0.000000 0.000000
vectorizer_6 = TfidfVectorizer(ngram_range=(1,2))

docs_6 = ["The Sahara is a vast desert.", 
          "The Nile cuts through several African countries.", 
          "The Congo rainforest is vast and diverse."]

X6 = vectorizer_6.fit_transform(docs_6)

rivers_df = pd.DataFrame(X6.toarray(), columns=vectorizer_6.get_feature_names_out())
rivers_df
african african countries and and diverse congo congo rainforest countries cuts cuts through desert ... several african the the congo the nile the sahara through through several vast vast and vast desert
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.375716 ... 0.000000 0.221904 0.000000 0.000000 0.375716 0.000000 0.000000 0.285742 0.000000 0.375716
1 0.284569 0.284569 0.000000 0.000000 0.000000 0.000000 0.284569 0.284569 0.284569 0.000000 ... 0.284569 0.168071 0.000000 0.284569 0.000000 0.284569 0.284569 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.300366 0.300366 0.300366 0.300366 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.177401 0.300366 0.000000 0.000000 0.000000 0.000000 0.228436 0.300366 0.000000

3 rows × 30 columns

TfidfTransformer#

While TfidfVectorizer takes in raw text and produces tf-idf scores, TfidfTransformer is used after CountVectorizer to convert the count matrix into a tf-idf representation.

from sklearn.feature_extraction.text import TfidfTransformer

docs_7 = ["Accra is the hub of Ghana.", 
          "Ghana is known for its gold resources.", 
          "Accra hosts several historic sites."]


count_vect = CountVectorizer()
X7_count = count_vect.fit_transform(docs_7)

tfidf_transformer = TfidfTransformer()
X7_tfidf = tfidf_transformer.fit_transform(X7_count)

rivers_df = pd.DataFrame(X7_tfidf.toarray(), columns=count_vect.get_feature_names_out())
rivers_df
accra for ghana gold historic hosts hub is its known of resources several sites the
0 0.349498 0.000000 0.349498 0.000000 0.000000 0.000000 0.459548 0.349498 0.000000 0.000000 0.459548 0.000000 0.000000 0.000000 0.459548
1 0.000000 0.403016 0.306504 0.403016 0.000000 0.000000 0.000000 0.306504 0.403016 0.403016 0.000000 0.403016 0.000000 0.000000 0.000000
2 0.355432 0.000000 0.000000 0.000000 0.467351 0.467351 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.467351 0.467351 0.000000
docs_8 = ["Jollof rice is a popular dish in West Africa.", 
          "Ugali is a staple food in East Africa.", 
          "Biltong is a snack originating from South Africa."]


X8_count = count_vect.fit_transform(docs_8)
X8_tfidf = tfidf_transformer.fit_transform(X8_count)

afrofoods_df = pd.DataFrame(X8_tfidf.toarray(), columns=count_vect.get_feature_names_out())
afrofoods_df
africa biltong dish east food from in is jollof originating popular rice snack south staple ugali west
0 0.235756 0.00000 0.399169 0.000000 0.000000 0.00000 0.303578 0.235756 0.399169 0.00000 0.399169 0.399169 0.00000 0.00000 0.000000 0.000000 0.399169
1 0.257129 0.00000 0.000000 0.435357 0.435357 0.00000 0.331100 0.257129 0.000000 0.00000 0.000000 0.000000 0.00000 0.00000 0.435357 0.435357 0.000000
2 0.247433 0.41894 0.000000 0.000000 0.000000 0.41894 0.000000 0.247433 0.000000 0.41894 0.000000 0.000000 0.41894 0.41894 0.000000 0.000000 0.000000
tfidf_transformer_9 = TfidfTransformer(norm='l1')
docs_9 = ["African music is diverse.", 
          "Afrobeats and Highlife are popular genres.", 
          "Music festivals are common across the continent."]

X9_count = count_vect.fit_transform(docs_9)
X9_tfidf = tfidf_transformer_9.fit_transform(X9_count)

afrimusic_df = pd.DataFrame(X9_tfidf.toarray(), columns=count_vect.get_feature_names_out())
afrimusic_df
across african afrobeats and are common continent diverse festivals genres highlife is music popular the
0 0.00000 0.26592 0.000000 0.000000 0.000000 0.00000 0.00000 0.26592 0.00000 0.000000 0.000000 0.26592 0.202239 0.000000 0.00000
1 0.00000 0.00000 0.173595 0.173595 0.132024 0.00000 0.00000 0.00000 0.00000 0.173595 0.173595 0.00000 0.000000 0.173595 0.00000
2 0.15335 0.00000 0.000000 0.000000 0.116626 0.15335 0.15335 0.00000 0.15335 0.000000 0.000000 0.00000 0.116626 0.000000 0.15335

Task 7: #

Analyzing News Articles on African Youth Unemployment#

You’re a sociologist who’s investigating the portrayal of youth unemployment in African news media. You’ve collected several news articles discussing youth unemployment in various African nations. Your aim is to identify the most discussed themes and assess the importance of different terms in the articles using feature extraction methods.

  1. Load the youth employment articles using the following command %load youth_emp_article.py`

  2. Tokenize the articles into individual words.

  3. Use the CountVectorizer to count word occurrences.

  4. Use the TfidfVectorizer to compute the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.

  5. Alternatively, use the TfidfTransformer to compute TF-IDF values if starting with raw count from CountVectorizer.

  6. Analyze the top terms to understand the main themes in the articles.

Task 8: #

Analyzing Economic Reports on the African Agricultural Export Potential#

You’re an economist at the African Union’s Department of Economic Affairs. With increasing talks about intra-African trade and global exports, you’ve gathered several economic reports discussing the potential of African agricultural exports and their economic impact. Your goal is to extract insights about the most emphasized agricultural commodities and understand the most significant themes across the reports using text analysis techniques.

  1. Load the youth economics reports using the following command %load eco_reports.py

  2. Tokenize the economic reports into individual words.

  3. Use the CountVectorizer to compute the frequency of word occurrences.

  4. Apply the TfidfVectorizer to determine the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term. Alternatively, if starting with raw counts from CountVectorizer, use the TfidfTransformer to calculate TF-IDF values.

  5. Evaluate the top terms to decipher the primary commodities and themes in the economic reports.