Step-by-Step Guide to NLP Basics: Text Preprocessing with Python Part 2

Sovary December 14, 2024 362
5 minutes read

Learn the essential methods for representing text data in Natural Language Processing (NLP). Explore techniques like Bag of Words (BoW), TF-IDF, One-Hot Encoding, Word Embeddings, and Sentence Embeddings, all explained with Python code examples. These foundational methods transform raw text into numerical formats for machine learning and deep learning applications. Perfect for beginners looking to master text preprocessing and representation in NLP.

Once text is preprocessed, we need to represent it in a format that machine learning models can work with. Text representation transforms words or documents into numerical data while preserving their contextual meaning.
I have written part I you can click and see the related key concept of NLP.

Key Concepts in Text Representation

1. Bag of Words (BoW)

The Bag of Words model represents text as a collection of words, disregarding grammar and word order but retaining frequency. It converts text into a fixed-length numerical vector by counting the frequency of each word in the text, disregarding grammar and order.

Advantages:

Easy to implement and understand.
Works well for text classification and clustering tasks when combined with machine learning models.

Limitations:

Ignores the order of words, losing context (e.g., “not good” vs. “good”).
Produces sparse vectors (many zeros), which are computationally expensive.

Use Cases:

Document classification (e.g., spam detection).
Topic modeling with basic algorithms like Naive Bayes or SVM.

Python Example: Creating a BoW Representation

from sklearn.feature_extraction.text import CountVectorizer

# Example corpus
corpus = [
    "The cat sat on the mat.",
    "The dog barked at the mailman.",
]

# Create BoW representation
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(corpus)

# Display feature names and BoW array
print("Feature Names:", vectorizer.get_feature_names_out())
print("BoW Representation:\n", bow.toarray())

Output:

Feature Names: ['at', 'barked', 'cat', 'dog', 'mailman', 'mat', 'on', 'sat', 'the']
BoW Representation:
[[0 0 1 0 0 1 1 1 2]
 [1 1 0 1 1 0 0 0 2]]

2. Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF adjusts word frequency by considering how often a word appears in all documents. It highlights important words while downplaying common ones.

Term Frequency (TF) measures how often a word appears in a document.
Inverse Document Frequency (IDF) reduces the weight of words that appear in many documents (e.g., “the,” “is”) to highlight more meaningful terms.

Advantages:

Highlights important words while downplaying common terms.
Captures relevance of terms for specific documents.

Limitations:

Still produces sparse vectors.
Lacks semantic understanding of words (e.g., synonyms).

Use Cases:

Search engines for ranking documents based on relevance.
Keyword extraction for summarizing content.

Python Example: Calculating TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus)

# Display feature names and TF-IDF array
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Representation:\n", tfidf.toarray())

Output:

Feature Names: ['at' 'barked' 'cat' 'dog' 'mailman' 'mat' 'on' 'sat' 'the']
TF-IDF Representation:
 [[0.         0.         0.40740124 0.         0.         0.40740124
  0.40740124 0.40740124 0.57973867]
 [0.40740124 0.40740124 0.         0.40740124 0.40740124 0.
  0.         0.         0.57973867]]

3. Word Embeddings

Word embeddings like Word2Vec and GloVe represent words in a dense vector space where similar words have closer representations.

They capture relationships between words, such as:

Vector(king) - Vector(man) + Vector(woman) ≈ Vector(queen)

Advantages:

Preserves semantic and syntactic relationships between words.
Reduces dimensionality compared to BoW and TF-IDF.
Works well in deep learning models.

Limitations:

Requires large corpora for training embeddings.
Pretrained embeddings may not capture domain-specific nuances.

Use Cases:

Sentiment analysis.
Chatbots and conversational AI.
Machine translation.

Python Example: Using Pretrained Word2Vec

from gensim.models import KeyedVectors

# Load pretrained Word2Vec model (download 'GoogleNews-vectors-negative300.bin' first)
model_path = "GoogleNews-vectors-negative300.bin"
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

# Get vector for a word
word_vector = w2v_model["cat"]
print("Vector for 'cat':", word_vector)

# Similarity between words
similarity = w2v_model.similarity("cat", "dog")
print("Similarity between 'cat' and 'dog':", similarity)

4. One-Hot Encoding

Each word in the vocabulary is represented as a binary vector with a single high bit corresponding to the word’s position. For example, for a vocabulary of {cat, dog, sat}, the word "cat" is [1, 0, 0].

Advantages:

Simple to implement and interpret.
Useful for small vocabularies or when combined with shallow machine learning models.

Limitations:

Creates very high-dimensional vectors for large vocabularies.
No notion of semantic similarity (e.g., “cat” and “dog” are unrelated).

Use Cases:

Encoding categorical data for text-based machine learning.
Serving as input for simpler models when feature space is small.

Python Example: One-Hot Encoding with Keras

from tensorflow.keras.preprocessing.text import Tokenizer

# Example corpus
texts = ["The cat sat on the mat", "The dog barked at the mailman"]

# Fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Convert text to one-hot encoded vectors
one_hot = tokenizer.texts_to_matrix(texts, mode="binary")
print("One-Hot Encoding:\n", one_hot)

Output:

One-Hot Encoding:
 [[0. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 1. 1. 1. 1.]]

5. Sentence Embeddings

While word embeddings capture word-level meanings, sentence embeddings capture sentence-level context. Libraries like SentenceTransformers are useful. They capture contextual and semantic meanings across phrases and are especially useful for tasks like document retrieval or sentiment analysis.

Advantages:

Captures the overall meaning of sentences.
Reduces the complexity of working with individual word vectors.
Useful for tasks requiring higher-level understanding (e.g., paraphrase detection).

Limitations:

Pretrained models may not adapt well to niche domains.
Requires significant computational resources for training on large corpora.

Use Cases:

Question-answering systems.
Semantic search and document similarity.
Summarization and translation.

Python Example: Sentence Embeddings with SentenceTransformers

from sentence_transformers import SentenceTransformer

# Load pretrained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for sentences
sentence_embeddings = model.encode(corpus)
print("Sentence Embeddings:\n", sentence_embeddings)

Output: Sentence embeddings will be a numerical array with each sentence represented as a vector.

Sentence Embeddings:
 [[ 1.30237207e-01 -1.57728121e-02 -3.67166810e-02  5.79864047e-02
  -5.97917512e-02  3.30537371e-02  3.01239621e-02  2.89271642e-02
  -1.86052322e-02  5.52965626e-02 -2.81547513e-02  6.97595030e-02
   3.96658182e-02  2.59351693e-02 -5.51402606e-02 -5.22531345e-02
  -5.34441136e-02 -3.66815813e-02  6.08156510e-02  3.70685942e-02
  -3.03818192e-02 -1.30533855e-02  3.09664086e-02 -6.17445149e-02
  -5.17991483e-02  7.61938766e-02 -2.98118200e-02 -5.23094833e-02

This part focuses on converting text into numerical representations using various techniques, forming the backbone of many NLP tasks. Let me know when you're ready for Part 3!

Machine Learning NLP

Author

As the founder and passionate educator behind this platform, I’m dedicated to sharing practical knowledge in programming to help you grow. Whether you’re a beginner exploring Machine Learning, PHP, Laravel, Python, Java, or Android Development, you’ll find tutorials here that are simple, accessible, and easy to understand. My mission is to make learning enjoyable and effective for everyone. Dive in, start learning, and don’t forget to follow along for more tips and insights!. Follow him