💊 Pill of the Week
In this DIY issue, we’ll roll up our sleeves and craft a simple yet effective text sentiment analysis project using the Naive Bayes algorithm. Our mission? Classifying SMS messages as either 'spam' or 'ham' (not spam). This will cover data loading, preprocessing, building the model, and evaluating its performance.
Before starting you can review the theory in this previous issue:
Ready? This is what we will be covering:
Loading and Exploring the Dataset
Data Preprocessing
Splitting Data for Training and Testing
Text Vectorization with TF-IDF
Training the Naive Bayes Model
Evaluating Model Performance
Let's dive in!
1. Loading and Exploring the Dataset
First, we need data! For this project, we’ll be using a dataset containing labeled SMS messages as either spam or ham. Here’s how we get started:
import pandas as pd
# Load the dataset directly from a URL
df = pd.read_csv(url, encoding='latin-1')
# Select relevant columns and rename them for clarity
df = df[['v1', 'v2']] df.columns = ['label', 'message']
What just happened?
We loaded data using pandas straight from a URL.
Kept two important columns: 'label' and 'message'.
Renamed them for better readability. Quick, simple, and clean!
2. Data Preprocessing
Now, let's prep our data for modeling:
Convert labels to numerical values:
0
for ham,1
for spam. This helps our model understand better.Check and remove duplicates, ensuring cleaner data for accurate training.
# Transform labels: 0 for 'ham', 1 for 'spam'
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
# Remove duplicates and check for missing values
df = df.drop_duplicates()
print(df.isnull().sum()) # Expect all zeros (no missing values)
How balanced is our data? Quick peek:
print(df['label'].value_counts())
This gives us an idea of how evenly distributed our spam and ham messages are.
3. Splitting Data for Training and Testing
To evaluate our model fairly, we split our data into training and testing sets.
from sklearn.model_selection import train_test_split
# Features and target labels
X = df['message']
y = df['label']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
What just happened?
We defined our X and y (features and labels/targets)
Then we split them into 75% training and 25% testing
4. Text Vectorization with TF-IDF
Text data needs numerical representation for modeling. Enter TF-IDF (Term Frequency-Inverse Document Frequency), which assigns importance to words based on their frequency across messages.
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert text messages to numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Explanation:
TF-IDF transforms text into a matrix of numerical values, emphasizing unique and relevant words.
Common stop words (like 'the', 'and') are removed for better focus.
5. Training the Naive Bayes Model
Now for the magic! We’ll use the Multinomial Naive Bayes algorithm, perfect for text classification.
from sklearn.naive_bayes import MultinomialNB
# Initialize and train the model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
# Make predictions on the test set
y_pred = nb_model.predict(X_test_tfidf)
Why Multinomial Naive Bayes (MultinomialNB)?
MultinomialNB is tailored for discrete features like word counts or term frequencies, making it ideal for text classification tasks such as spam detection and sentiment analysis. It works by applying Bayes' Theorem under the assumption that features (words) are conditionally independent given the class label.
Breaking it Down Simply:
P(Cᵢ): The prior probability of class Cᵢ, indicating how common a class is (e.g., how likely it is that a message is "spam" versus "not spam").
P(wⱼ|Cᵢ): The probability that word wⱼ appears in class Cᵢ. For instance, if "win" frequently appears in spam messages, P("win"|spam) would be high.
f(wⱼ, d): The frequency of word wⱼ in document d.
What Does the Formula Do?
It multiplies the prior probability of the class P(Cᵢ) by the probabilities of each word in the document appearing in that class, raised to the power of how often the word appears.
The higher this product, the more likely the document belongs to that class.
In simple terms, Multinomial Naive Bayes predicts the class of a document by considering how common each word is in that class, adjusted for how often the class itself appears overall. It’s like scoring how "spammy" or "hammy" the words in a message are!
Key Advantages:
Count-Based Data: Handles features represented by counts or frequencies efficiently (e.g., word occurrence in text).
High-Dimensional Data: Performs well with large vocabularies and sparse matrices, which are common in text data.
Comparison to GaussianNB:
Discrete vs. Continuous Data: MultinomialNB works with discrete data, like word counts. In contrast, GaussianNB assumes continuous features that follow a normal distribution, making it more suitable for numerical data.
Memory Efficiency: MultinomialNB leverages word counts directly, avoiding complex calculations of distribution parameters, which makes it faster for sparse data.
Other Naive Bayes Variants:
BernoulliNB: Suitable for binary/boolean features (presence/absence of words). Works well when the focus is on word presence rather than frequency.
GaussianNB: Assumes continuous data with a Gaussian distribution (e.g., features like height, weight).
ComplementNB: Designed to handle imbalanced classes better, making it effective for datasets where one class is significantly smaller than the other.
6. Evaluating Model Performance
Did it work well? Let’s find out!
from sklearn.metrics import accuracy_score, classification_report
# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')
# Display a detailed classification report
print(classification_report(y_test, y_pred))
The accuracy score tells us the overall success rate of our model, while the classification report provides deeper insights into precision, recall, and F1-score for spam and ham messages.
🎉 Recap
Data Loading & Preprocessing: Cleaned and prepared data.
Text Vectorization: Transformed text to numerical features using TF-IDF.
Model Training & Evaluation: Trained a Naive Bayes classifier and checked its performance.
Quick, effective, and surprisingly powerful for such a lightweight approach! Stay tuned for more optimizations and advanced techniques in future issues!
If you want the full notebook you can find it at the end of the issue!
It has an additional 3 sections:
Data Loading and Exploration
Data Preprocessing
Data Splitting
Text Vectorization
Model Training
Model Evaluation
Hyperparameter Tuning 🆕
Model and Vectorizer Saving 🆕
Custom Predictions 🆕
⚡Power-Up Corner
To make your text classification model more robust, here are some advanced tips and common pitfalls to watch for:
Feature Engineering Beyond TF-IDF:
Keep reading with a 7-day free trial
Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.