December 31 2019

Back to Basics: Naive Bayes Demystified

Our goal is to learn about Naïve Bayes and apply it to a real-world problem, spam detection. Why Naïve Bayes? Consider building an email spam-filter using machine learning. You would like to classify messages according to whether they are unsolicited commercial (spam), or non-spam (ham). Classifying spam vs. ham is one example of a broader set of problems called text classification, a set of problems that Naïve Bayes can solve with relatively high performance. The Naive Bayes classifier is a powerful classification algorithm that works based on modeling conditional probabilities. It is particularly useful for classification problems with many-many features, and as we find out shortly, negligible conditional dependencies. The name “Naïve” suggests that we are making a wrong assumption here; in fact, we are! But as it turns out, our Naïve assumption works pretty well in practice.

Naive Bayes

Let us assume we are interested in classifying a large chunk of emails and detect spam vs. ham. We show our input vectors with x. The input vector is the ordered set of words in each email, say x = {I, am, happy}. The goal of our trained model is to decide which class of outcomes this vector belongs to. We show our possible outcome set as C; In this example, C={spam, ham}.

We can think of spam detection as a probability problem, we want to know the chance of our email being spam vs. ham and naturally go with the one that has a higher chance. As an example, we might find the chance of spam to be 40% and consequently, ham to be 60% (60% = 100% — 40%); then, we expect our model to detect the given email as ham. Mathematically, we show this by P(spam|x)= 0.4 and P(ham|x)= 0.6 or if we really want to be messy, P(spam| I, am, happy)= 0.4 and P(ham| I, am, happy)= 0.6. The term P(spam|x) is read as the probability of the email being spam (spam happening), given that the email was x (x happened).

So far, so good! But here is a question. How can we find these probabilities? We need to find all of our emails with x as the content and count the number of times they were spam vs. ham. Say, we received 100 emails with x as the content and 60 of them were not spam and 40 were, then:

There is a small problem here though! To detect an email and decide whether it is spam or not, we need to receive many of the exact same email and have them in our records before being able to classify them. Not practical! Fortunately, Bayes Theorem for conditional probabilities can help us. Please note that Bayes Theorem is different from Naïve Bayes Assumption! Bayes Theorem can be written as:

Bayes Theorem is a fundamental theorem in probability theory. Ok, let’s use it!

It’s helpful to note that we really want to find whether 𝑃(spam|x)/ 𝑃(ham|x)>1 or not; we don’t really care about their absolute values. So we really want an estimation to:

The good news is that we are in a better shape for estimating 𝑃(spam|x), 𝑃(x|ham), 𝑃(spam), and 𝑃(ham) than 𝑃(spam|x) and 𝑃(ham|x) directly. We supposedly have many-many emails that are either spam or ham, and some of them hopefully have x as their content. This helps us with emails that are less frequent but present in our database. However, we might find no emails with x as the content! What shall we do? That’s where Naïve Bayes comes to our help! There is another probability theorem that is a fundamental one.

Let us apply this theorem to our email example.

Look better! OK, let’s apply the same theorem once again.

Now is the time for the Naïve Bayes Assumption to shine! Naïve Bayes assumes that the elements of x are conditionally independent given C. In other words, in our example, we will assume that:

Not only this assumption sounds wrong, it is absurd! We naturally expect to have a higher chance of having “I” if we have “am” in an email. However, it works pretty well! Why? That needs a bit more advanced understanding of what’s happening but essentially the individual probabilities of each word are enough for most common types of classifications. OK, let’s put everything together in one equation:

What’s left is easy! We only need to look for emails that have various single words and not all of them together. That is a way more practical exercise! Say we have 10 million emails that are spam and 1000 of them have “happy” in them. Also, say we have 100 million that are not spam and 1 million of them have “happy” in them, then:

Spam Detection Using Python

I’m pretty sure no one wants to do all of the above calculations themselves; so let’s look at a Python library! We are going to use a bunch of Python libraries to classify emails as spam and ham. Computers are not great in understanding text so first thing we need to do is representing the content of each email in another form. One way of representing text to be used in a machine learning context is to use the bag-of-words representation (BOW). After collecting all of the possible words and putting them in a dictionary, we assign a vector to each word in our dictionary. The length of each vector is equal to the number of words in the dictionary. For example, if we only have the four words {hello, I, am, happy} in our dictionary, then we can assign them the following vectors:

In BOW representation, a document is simply represented as a vector of word counts; the words order is ignored. A simple BOW representation of ‘I am happy happy’ is:

In this article, we are going to use the data available here (thanks Kaggle). It’s technically not an email dataset but it’s OK. Naïve Bayes works here too! The dataset need a very small amount of cleaning before being ready to be consumed by you- the hungry data scientist! After downloading the CSV file, we use our friend, ‘pandas’, to load the data.

import pandas as pd
file_encoding = 'utf8'
input_fd = open('./spam.csv', encoding=file_encoding, errors = 'backslashreplace') #the file has some non utf8 parts
data = pd.read_csv(input_fd)
data = data.iloc[:,0:2]    #removing the null columns
data.columns = ['label','content']
data.head()

We need to put out emails into BOW representation. No problem! We are going to use the module ‘text’ from ‘sklearn’ to take care of this. First, we need to tell Python to make us the dictionary we need.

from sklearn.feature_extraction import text 
word_bagger = text.CountVectorizer()

Note: there are words in any language that don’t have as much value for text classification, etc; one reason is that they are very common in any context, for example, ‘the’, ‘am’, and ‘this’ are everywhere. We could use the following option to get rid of them.

word_bagger = text.CountVectorizer(stop_words="english")

If you are interested to see how many words you have in your dictionary and even have a look at them, be my guess! You can go ahead and take a look easily by following:

words = word_bagger.get_feature_names()
len(all_words)

As you can see, we have 8692 in our dictionary. If you’re wondering about word 1001 to 1010 then you can take a look!

words[2000:2010]

We get the following words:

Now that we have a dictionary, we can transform any sentence into a vector. Just follow this:

X = word_bagger.transform(data['content'])

Let’s see what it does by transforming ‘I am happy happy’!

test = word_bagger.transform(['I am happy happy'])

If you look at the variable test, it’s stored as a ‘Compressed Sparse Row’. This is because ‘sklearn’ is very careful about using your memory efficiently and wants to minimize any redundant storage. As a result, it only keeps track of nonzero elements. Just use the print command to see the nonzero elements of test.

print(test)

Train the Model:

Believe it or not, we have done most of the work! We are now going to use another ‘sklearn’ module to train our Naïve Bayes Classifier. Depending on the characteristics of the input variable, we need to use different versions of Naïve Bayes. What are they? Let us take a look.

Gaussian Naive Bayes: for continuous data where the features follow Gaussian distribution.
Multinomial Naive Bayes: for discreet data where the features are counts (i.e. non-negative integers).
Bernoulli Naive Bayes: for discreet data with features that are binary-valued.

In our case, we need to use Multinomial Naive Bayes; let’s import it then and train the model on our dataset.

from sklearn.naive_bayes import MultinomialNB
spam_nb = MultinomialNB()
spam_nb.fit(X, y)

We can easily check the in-sample accuracy of our trained model using the native score function provided by ‘sklearn’.

spam_nb.score(X,y)

Note: this is a way to high of score! We are definitely over-fitting here. This can be avoided using usual cross-validation or other methods used to avoid over-fitting.

We are done! Short and sweet, right? We can try our trained model on some sample cases and see if the result is according to our expectations.

ham = "Let me know if it works for you? I will keep you posted!"
spam = "Buy real cheap eyeglasses here, eyeglasses.club!"
test_bagged = word_bagger.transform([ham, spam])
spam_nb.predict(bagged)

We are doing great, indeed! In conclusion, this is a fundamental machine learning algorithm that is dependable for many-many reasons like ease of use and quick calculation time.

Happy reading & stay curious!