Learn the essential methods for representing text data in Natural Language Processing (NLP). Explore techniques like Bag of Words (BoW), TF-IDF, One-Hot Encoding, Word Embeddings, and Sentence Embeddings, all explained with Python code examples. These foundational methods transform raw text into numerical formats for machine learning and deep learning applications. Perfect for beginners looking to master text preprocessing and representation in NLP.
Once text is preprocessed, we need to represent it in a format that machine learning models can work with. Text representation transforms words or documents into numerical data while preserving their contextual meaning.
I have written part I you can click and see the related key concept of NLP.
The Bag of Words model represents text as a collection of words, disregarding grammar and word order but retaining frequency. It converts text into a fixed-length numerical vector by counting the frequency of each word in the text, disregarding grammar and order.
Advantages:
Limitations:
Use Cases:
Python Example: Creating a BoW Representation
from sklearn.feature_extraction.text import CountVectorizer
# Example corpus
corpus = [
"The cat sat on the mat.",
"The dog barked at the mailman.",
]
# Create BoW representation
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(corpus)
# Display feature names and BoW array
print("Feature Names:", vectorizer.get_feature_names_out())
print("BoW Representation:\n", bow.toarray())
Output:
Feature Names: ['at', 'barked', 'cat', 'dog', 'mailman', 'mat', 'on', 'sat', 'the']
BoW Representation:
[[0 0 1 0 0 1 1 1 2]
[1 1 0 1 1 0 0 0 2]]
TF-IDF adjusts word frequency by considering how often a word appears in all documents. It highlights important words while downplaying common ones.
Advantages:
Limitations:
Use Cases:
Python Example: Calculating TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus)
# Display feature names and TF-IDF array
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Representation:\n", tfidf.toarray())
Output:
Feature Names: ['at' 'barked' 'cat' 'dog' 'mailman' 'mat' 'on' 'sat' 'the']
TF-IDF Representation:
[[0. 0. 0.40740124 0. 0. 0.40740124
0.40740124 0.40740124 0.57973867]
[0.40740124 0.40740124 0. 0.40740124 0.40740124 0.
0. 0. 0.57973867]]
Word embeddings like Word2Vec and GloVe represent words in a dense vector space where similar words have closer representations.
They capture relationships between words, such as:
Advantages:
Limitations:
Use Cases:
Python Example: Using Pretrained Word2Vec
from gensim.models import KeyedVectors
# Load pretrained Word2Vec model (download 'GoogleNews-vectors-negative300.bin' first)
model_path = "GoogleNews-vectors-negative300.bin"
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
# Get vector for a word
word_vector = w2v_model["cat"]
print("Vector for 'cat':", word_vector)
# Similarity between words
similarity = w2v_model.similarity("cat", "dog")
print("Similarity between 'cat' and 'dog':", similarity)
Each word in the vocabulary is represented as a binary vector with a single high bit corresponding to the word’s position. For example, for a vocabulary of {cat, dog, sat}
, the word "cat" is [1, 0, 0]
.
Advantages:
Limitations:
Use Cases:
Python Example: One-Hot Encoding with Keras
from tensorflow.keras.preprocessing.text import Tokenizer
# Example corpus
texts = ["The cat sat on the mat", "The dog barked at the mailman"]
# Fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
# Convert text to one-hot encoded vectors
one_hot = tokenizer.texts_to_matrix(texts, mode="binary")
print("One-Hot Encoding:\n", one_hot)
Output:
One-Hot Encoding:
[[0. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 1. 1. 1. 1.]]
While word embeddings capture word-level meanings, sentence embeddings capture sentence-level context. Libraries like SentenceTransformers
are useful. They capture contextual and semantic meanings across phrases and are especially useful for tasks like document retrieval or sentiment analysis.
Advantages:
Limitations:
Use Cases:
Python Example: Sentence Embeddings with SentenceTransformers
from sentence_transformers import SentenceTransformer
# Load pretrained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings for sentences
sentence_embeddings = model.encode(corpus)
print("Sentence Embeddings:\n", sentence_embeddings)
Output: Sentence embeddings will be a numerical array with each sentence represented as a vector.
Sentence Embeddings:
[[ 1.30237207e-01 -1.57728121e-02 -3.67166810e-02 5.79864047e-02
-5.97917512e-02 3.30537371e-02 3.01239621e-02 2.89271642e-02
-1.86052322e-02 5.52965626e-02 -2.81547513e-02 6.97595030e-02
3.96658182e-02 2.59351693e-02 -5.51402606e-02 -5.22531345e-02
-5.34441136e-02 -3.66815813e-02 6.08156510e-02 3.70685942e-02
-3.03818192e-02 -1.30533855e-02 3.09664086e-02 -6.17445149e-02
-5.17991483e-02 7.61938766e-02 -2.98118200e-02 -5.23094833e-02
This part focuses on converting text into numerical representations using various techniques, forming the backbone of many NLP tasks. Let me know when you're ready for Part 3!
As the founder and passionate educator behind this platform, I’m dedicated to sharing practical knowledge in programming to help you grow. Whether you’re a beginner exploring Machine Learning, PHP, Laravel, Python, Java, or Android Development, you’ll find tutorials here that are simple, accessible, and easy to understand. My mission is to make learning enjoyable and effective for everyone. Dive in, start learning, and don’t forget to follow along for more tips and insights!. Follow him