Learn the essential methods for representing text data in Natural Language Processing (NLP). Explore techniques like Bag of Words (BoW), TF-IDF, One-Hot Encoding, Word Embeddings, and Sentence Embeddings, all explained with Python code examples. These foundational methods transform raw text into numerical formats for machine learning and deep learning applications. Perfect for beginners looking to master text preprocessing and representation in NLP.
Once text is preprocessed, we need to represent it in a format that machine learning models can work with. Text representation transforms words or documents into numerical data while preserving their contextual meaning.
I have written part I you can click and see the related key concept of NLP.
The Bag of Words model represents text as a collection of words, disregarding grammar and word order but retaining frequency. It converts text into a fixed-length numerical vector by counting the frequency of each word in the text, disregarding grammar and order.
Advantages:
Limitations:
Use Cases:
Python Example: Creating a BoW Representation
Output:
Feature Names: ['at', 'barked', 'cat', 'dog', 'mailman', 'mat', 'on', 'sat', 'the']
BoW Representation:
[[0 0 1 0 0 1 1 1 2]
[1 1 0 1 1 0 0 0 2]]
TF-IDF adjusts word frequency by considering how often a word appears in all documents. It highlights important words while downplaying common ones.
Advantages:
Limitations:
Use Cases:
Python Example: Calculating TF-IDF
Output:
Feature Names: ['at' 'barked' 'cat' 'dog' 'mailman' 'mat' 'on' 'sat' 'the']
TF-IDF Representation:
[[0. 0. 0.40740124 0. 0. 0.40740124
0.40740124 0.40740124 0.57973867]
[0.40740124 0.40740124 0. 0.40740124 0.40740124 0.
0. 0. 0.57973867]]
Word embeddings like Word2Vec and GloVe represent words in a dense vector space where similar words have closer representations.
They capture relationships between words, such as:
Advantages:
Limitations:
Use Cases:
Python Example: Using Pretrained Word2Vec
Each word in the vocabulary is represented as a binary vector with a single high bit corresponding to the word’s position. For example, for a vocabulary of {cat, dog, sat}
, the word "cat" is [1, 0, 0]
.
Advantages:
Limitations:
Use Cases:
Python Example: One-Hot Encoding with Keras
Output:
One-Hot Encoding:
[[0. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 1. 1. 1. 1.]]
While word embeddings capture word-level meanings, sentence embeddings capture sentence-level context. Libraries like SentenceTransformers
are useful. They capture contextual and semantic meanings across phrases and are especially useful for tasks like document retrieval or sentiment analysis.
Advantages:
Limitations:
Use Cases:
Python Example: Sentence Embeddings with SentenceTransformers
Output: Sentence embeddings will be a numerical array with each sentence represented as a vector.
Sentence Embeddings:
[[ 1.30237207e-01 -1.57728121e-02 -3.67166810e-02 5.79864047e-02
-5.97917512e-02 3.30537371e-02 3.01239621e-02 2.89271642e-02
-1.86052322e-02 5.52965626e-02 -2.81547513e-02 6.97595030e-02
3.96658182e-02 2.59351693e-02 -5.51402606e-02 -5.22531345e-02
-5.34441136e-02 -3.66815813e-02 6.08156510e-02 3.70685942e-02
-3.03818192e-02 -1.30533855e-02 3.09664086e-02 -6.17445149e-02
-5.17991483e-02 7.61938766e-02 -2.98118200e-02 -5.23094833e-02
This part focuses on converting text into numerical representations using various techniques, forming the backbone of many NLP tasks. Let me know when you're ready for Part 3!
As the founder and passionate educator behind this platform, I’m dedicated to sharing practical knowledge in programming to help you grow. Whether you’re a beginner exploring Machine Learning, PHP, Laravel, Python, Java, or Android Development, you’ll find tutorials here that are simple, accessible, and easy to understand. My mission is to make learning enjoyable and effective for everyone. Dive in, start learning, and don’t forget to follow along for more tips and insights!. Follow him