Chapter 1: Introduction to Text Data#

Introduction:#

In today’s digital age, vast amounts of data are produced daily from blogs, social platforms, and web pages. This data holds significant insights, but there’s a challenge in converting it into actionable information. Natural Language Processing (NLP) offers a solution. As a subset of artificial intelligence, NLP focuses on enabling machines to comprehend languages, allowing them to interpret human speech.

Many companies accumulate massive amounts of data, often in gigabytes or more. Analyzing this manually is almost impossible. Yet, with NLP tecniques, we can train computers to interpret and extract valuable insights from this text. This includes identifying parts of speech, such as nouns, verbs, and adjectives. Moreover, NLP can gauge the sentiment behind a piece of writing, determining whether it conveys a positive or negative tone. This capability has empowered businesses to better understand customer opinions about their products by analyzing their feedback.
Let’s start fron the basic: Understanding Textual Data.

Learning Objectives:

  1. Grasp the essence of text data and its significance in real-world scenarios.

  2. Explore the power and functionality of Python string operations.

  3. Understand the importance of tokenization and how to implement it from scratch.

  4. Discover the world of regex for searching and modifying text data.

What is Text Data?#

Text data, in its simplest form, is information conveyed in a written format. This could range from a single word to extensive documents.

Real-world Applications of Text Data:

  • Sentiment Analysis: Evaluating feedback on local e-commerce platforms in African countries to enhance customer satisfaction. Imagine being able to gauge user reactions to a product launch in Nairobi using reviews posted online.

  • Information Retrieval: Consider the importance of having a specialized search engine for African academic papers to promote research within the continent.

  • Machine Translation: Think about breaking language barriers, like translating real-time news from Afrikaans to Igbo, from Wolof to Chichewa, enabling more widespread information dissemination.

Introduction to Strings in Python#

In the coding world, strings are how we represent and manipulate text. Python, being a versatile language, provides a suite of tools to handle strings effectively.

example_string = "African proverbs carry wisdom and depth, like 'Only a fool tests the depth of a river with both feet.'"

Common String Operations:#

Strings in Python come equipped with a set of built-in methods that allow for efficient manipulation and evaluation. Understanding these methods, especially in the context of textual data analysis, is essential for any data scientist.

The .split() function#

This method will dissect a string into a list, usually based on spaces, but we can specify other delimiters as well.

proverb = "Unity is strength, division is weakness."

words = proverb.split()
print(words)
['Unity', 'is', 'strength,', 'division', 'is', 'weakness.']
# Using a different delimiter
phrases = proverb.split(', ')

print(phrases) 
['Unity is strength', 'division is weakness.']

The .join() function#

The opposite of .split(), .join() allows you to combine a list of strings into a single string using a specified delimiter.

african_countries = ["Nigeria", "Kenya", "Tanzania"]
sentence = ', '.join(african_countries)
print(sentence) 
Nigeria, Kenya, Tanzania

The .join() is fairly used to navigate through files in our computer by join paths. For instance, we can access the path to the current working directory and it is rendered as a string.

import os
os.getcwd()
'/home/rockefeller/Desktop/Teachings_n_Talks/Teachings/AIMSCmr_2324/course_content/data_prep/part_4'

We can use that path to navigate to a specific folder by a concatenation operation

data_folder = 'data/'

path_to_data = "/".join([os.getcwd() , data_folder])
path_to_data
'/home/rockefeller/Desktop/Teachings_n_Talks/Teachings/AIMSCmr_2324/course_content/data_prep/part_4/data/'

The .replace() function#

This method is handy when you need to replace a substring with another substring.

swahili_greeting = "Habari gani!"

english_translation = swahili_greeting.replace("Habari gani", "How are you")

print(english_translation) 
How are you!

The .startswith() & .endswith() functions#

These methods allow for checking if a string starts or ends with a specified substring.

song = "Jaraba Africa !"

print(song.startswith("Africa"))  
False
print(song.endswith("Africa"))       
False

The .find() & .rfind() functions#

Use .find() to locate the position of a substring in a string. If the substring isn’t found, it returns -1. .rfind() does the same, but starts from the end of the string.

rhyme = "African sunsets are a sight to behold."
position = rhyme.find("sunsets")
print(position) 
8
last_occurrence = rhyme.rfind("a")
print(last_occurrence) 
20

The .lower() & .upper() functions#

These are used for changing the case of the string.

phrase = "African Rhythms"
print(phrase.lower())  
african rhythms
print(phrase.upper())  
AFRICAN RHYTHMS

The .strip(), .rstrip(), & .lstrip() functions#

These methods are used to trim whitespace or specified characters from the beginning, end, or both sides of a string.

data_entry = "    Timbuktu   "
print(data_entry.strip())
Timbuktu
extra_chars = "XYZThis is an exampleXYZ"
print(extra_chars.strip('XYZ')) 
This is an example

Tokenization Methods from Scratch#

Tokenization is the process of splitting a large paragraph into sentences or words. Essentially, it involves breaking up text into units, known as tokens.

  • Why Tokenization? In languages and dialects across Africa, from Swahili in Tanzania to Zulu in South Africa or Yoruba in Nigeria, words often carry a lot of weight. They can represent culture, traditions, and histories. Thus, when analyzing text data from such rich languages, it’s crucial to break it down accurately to understand the essence of each token.

  • Word Tokenization The most basic form of tokenization is word tokenization where we break paragraphs into individual words.

def word_tokenize(text):
    # Splitting by space
    return text.split()


sentence = "Habari yako rafiki?"
tokens = word_tokenize(sentence)
print(tokens) 
['Habari', 'yako', 'rafiki?']

Sentence Tokenization#

Here, instead of words, we aim to split the paragraphs into individual sentences.

import re
def sentence_tokenize(text):
    # Splitting by full stop, question mark, and exclamation mark
    return [sent.strip() for sent in re.split('[.!?]', text) if sent]


paragraph = "Karibu Tanzania! Je, unapenda safari? Ndiyo, napenda sana."

tokens = sentence_tokenize(paragraph)
print(tokens)  
['Karibu Tanzania', 'Je, unapenda safari', 'Ndiyo, napenda sana']

Basic Regex Operations#

Regular expressions allow us to perform complex searches in a text, making them indispensable in text processing tasks. Let’s discuss its basic operations, especially contextualized in the African setting.

Searching for Patterns#

Regex can help search for specific patterns or words in a text.

import re

def find_pattern(pattern, text):
    return re.findall(pattern, text)

text = "Nairobi is the capital of Kenya. Kampala is the capital of Uganda."

pattern = "capital"
matches = find_pattern(pattern, text)

print(matches) 
['capital', 'capital']

Character Sets#

You can search for any character from a specific set.

pattern = "[NK]ampala" 
matches = find_pattern(pattern, text)

print(matches)  
['Kampala']

Here, [NK]ampala will match strings that start with either 'N' or 'K' followed by 'ampala'.

Wildcard Characters#

The dot . is a wildcard character that matches any character (except for a newline).

pattern = ".ampala"

matches = find_pattern(pattern, text)

print(matches)  
['Kampala']

Replacing Text#

Regex can also be used to replace specific patterns in a string.

def replace_pattern(pattern, replacement, text):
    return re.sub(pattern, replacement, text)

new_text = replace_pattern("capital", "hub", text)
print(new_text)  
Nairobi is the hub of Kenya. Kampala is the hub of Uganda.