Building and Deploying a Spam Detection Model: Section 1

nltk

This is a guest post written by Ehiaghe Aigiomawu.

Spam is an unsolicited email message, instant message, or text message – usually sent to the recipient for commercial purposes. In other words, the recipient never explicitly asked for it, yet the get them.

In this article, we’re going to learn how to build a spam detection model in python to automatically classify a message as either spam or ham(legitimate messages). When we’re done building our model, we will host it on Heroku using Flask, a lightweight web application framework so that it becomes available for anyone to use.

This article will be divided into two sections. In the first section, we focus work on building the spam detection model, and in the second section we will deploy the model on Heroku using with Flask. We will assume you are already familiar of the following:

Getting started: First, we need to install NLTK (if you don’t have it installed already) from command prompt as follows:

conda install NLTK

or

pip install NLTK.

You should see a pop window like shown below, select the all option and click download. This download will download all the corpras we need for this exercise.

nltk

Building the Spam Message Classifier

Loading the Dataset: The dataset we will use is a collection of SMS messages tagged as spam or ham and can be found here – go ahead and download it. Once you have the dataset, open your jupyter notebook and let’s get to work. Our first step is to load in the data using pandas read_csv function.

import pandas as pd
df = pd.read_csv('spam.csv', encoding="latin-1")

Note that we specified encoding=”latin-1″ while reading the csv. This is because the csv file is utf-8 encoded. Failure to add that encoding, you’ll get an error message. Let’s go on

screenshot-localhost-8888-2019.02.15-16-46-18

Cleaning the Dataset: Observing the dataframe, we will notice how we have three columns Unnamed: 2, Unnamed: 3, Unnamed: 4 whose rows are NaN values. We will drop them because they’re not useful for our classification. We will also rename the v1 and v2 columns and give them appropriate titles:

#Drop the columns not needed
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

#Create a new column label which has the same values as v1 then set the ham and spam values to 0 and 1 which is the standard format for our prediction.
df['label'] = df['v1'].map({'ham': 0, 'spam': 1})

#Create a new column having the same values as v2 column
df['message'] = df['v2']

#Now drop the v1 and v2
df.drop(['v1', 'v2'], axis=1, inplace=True)

df.head(10)

unnamed

Creating the Bag of words Model: Now that we have an idea of what our data looks like, next thing we want to do is to create a bag-of-words model by leveraging the CountVectorizer function of the Scikit Learn package which we will use to create the bag of words model. This works by converting each of those messages into tokens and then we take all the words in our corpus and create a column with each word. Once fitted, CountVectorizer has built a dictionary of feature indices. The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer().fit_transform(df['message'])

Training a model: With messages represented as vectors, we can finally train our spam vs ham classifier. Our choice classifier for this tutorial is the Naive Bayes classifier algorithm because it suitable for document classification, since it works by constructing distributions over words. Let’s go ahead and create our model:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#Split the data
X_train, X_test, y_train, y_test = train_test_split(bow_transformer, df['label'], test_size=0.33, random_state=42)

#Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Our output should be something like this:

screenshot-localhost-8888-2019.02.15-23-10-13

As we see, our model predicted with a 97% accuracy. Great job! We’ve developed a model that can attempt to predict spam vs ham classification!

Saving the classifier: As a final step in this section, let’s save the just completed model so that we can always reuse it whenever necessary. We will do this using:

from sklearn.externals import joblib
joblib.dump(clf, 'our_model.pkl')

As a side note, when it’s time to use the model, we’ll load it using:

spam_model = open('our_model.pkl','rb')
clf = joblib.load(spam_model)

By now, your completed model should look like this:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

df = pd.read_csv('spam.csv', encoding="latin-1")
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
df['label'] = df['v1'].map({'ham': 0, 'spam': 1})
df['message'] = df['v2']
df.drop(['v1', 'v2'], axis=1, inplace=True)

#Creating a BOW model
bow_transformer = CountVectorizer()
X = bow_transformer.fit_transform(df['message'])

#Split the data
X_train, X_test, y_train, y_test = train_test_split(bow_transformer, df['label'], test_size=0.33, random_state=42)

#Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

#Making the confusion matrix
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

If you’ve made it to the end of this tutorial, many congrats. You’ve just learnt how to train a spam classifier. Next week, we will publish the section two of this tutorial which entails Deploying the model to Heroku using Flask. Until then, remember to share the link.

We hope you’ve learnt a lot via this tutorial, join our mission to promote more women of color in tech tutorials. Share this link and refer any awesome woman that should be featured at our Guest lounge via techinpinkafrica@gmail.com

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *