Naive Bayes

Assumes that all features are independent of one another. And each impact the response y differently.

The cummulative impact is just a multiplication of each of the individual impacts.

Lets look at a simple example:

Categorical Variables

Consider a data set with 1500 observations and the following output classes:

  • Cat
  • Parrot
  • Turtle

The Predictor variables are categorical in nature i.e., they store two values, either True or False:

  • Swim
  • Wings
  • Green Color
  • Sharp Teeth

\[ \begin{array}{|c|c|c|c|c|} \hline \text { Type } & \text { Swim } & \text { Wings } & \text { Green } & \text { Sharp teeth } \\ \hline \text { Cat } & 450 / 500 & 0 & 0 & 500 / 500 \\ \hline \text { Parrot } & 50 / 500 & 500 / 500 & 400 / 500 & 0 \\ \hline \text { Turtle } & 500 / 500 & 0 & 100 / 500 & 50 / 500 \\ \hline \end{array} \]

From the above table, we can summarise that:

  • There are 1500 animals.

  • There are 4 predictors ie X. ie can swim, has wings, is green, has sharp teeth.

  • The response variable, Y, ie Type has three classes.

  • The class of type cats shows that: - Out of \(500,450(90 \%)\) cats can swim - 0 number of cats have wings - 0 number of cats are of Green color - All 500 cats have sharp teeth.

  • The class of type Parrot shows that: - \(50(10 \%)\) parrots have a true value for swim - All 500 parrots have wings - Out of \(500,400(80 \%)\) parrots are green in color - No parrots have sharp teeth

  • The class of type Turtle shows: - All 500 turtles can swim - 0 number of turtles have wings - Out of \(500,100(20 \%)\) turtles are green in color - 50 out of \(500(10 \%)\) turtles have sharp teeth

Now, with the available data, let’s classify the following observation into one of the output classes (Cats, Parrot or Turtle) by using the Naive Bayes Classifier.

\[ \begin{array}{|c|c|c|c|c|} & \hline\text { Swim } & \text { Wings } & \text { Green } & \text { Sharp Teeth } \\ \hline \text { Observation } & \text { True } & \text { False } & \text { True } & \text { False }\\\hline \end{array} \]

The goal here is to predict whether the new observed type is a Cat, Parrot or a Turtle based on the defined predictor variables (swim, wings, green, sharp teeth).

What type of animal is this? ie it can swim and its green but and it does not have wings and sharp teeth. Looking at our initial table of conditional probabilities we see that only the Turtle fits this description. Can we depict this in mathematical notation?

To solve this, we will use some notations. Let

\[ \begin{aligned} X_{ij} =& \text{Variable } j \text{ for observation } i\\ x_{ij} =& \text{Value observation }i \text{ takes for variable } j \text{ in this case 1 or 0}\\ C = &\text{the response variable}\\ k =&\text{Value of the response variable, ie Cat, Parrot, Turtle} \end{aligned} \]

\[ P(\text{X}_{ij} = x_{ij} \mid \text{Class} = k) = P(X_{ij}\mid C=k)\text{ and }P( \text{Class}=k) = P(C = k) \]

Then we can compute the probability of the true class being \(k\) given the new observation as:

\[ \begin{aligned}P(C = k\mid X_{i1}= x_{i1},&X_{i2}=x_{i2},\cdots,X_{ip}= x_{ip} )\\ &=\frac{P(X_{i1}=x_{i1} \mid C = k)P(X_{i2}=x_{i1}\mid C = k) \cdots P(X_{ip} =x_{ip}\mid C = k)P(C= k)}{\text{Normalizing Constant}}\\ \\ &=\frac{P(C = k)\prod_j^p P(X_{ij} = x_{ij}\mid C = k)}{\sum_{k\in K}P(C = k)\prod_j^p P(X_{ij} = x_{ij}\mid C = k)} \end{aligned} \] To understand this formula, lets continue with the example:

Using \(S = Swim\), \(W = Wings\), \(G = Green\) and \(T = Sharp ~Teeth\), we can determine the probabilities of the classes given the new observation as follows:

Prior Probabilities: \[ P(C = Cat) = 500/1500 = 1/3\quad P(C=Parrot) = 1/3\quad P(C=Turtle) = 1/3 \]

Posterior Probabilities:

cat:

\[ \begin{aligned} P( C = Cat \mid \text{S=1,W=0,G=1,T=0})=&\frac{P( \text{S=1} \mid Cat ) P(\text{W=0}\mid Cat)P(\text{G=1} \mid Cat ) P(\text{T=0}\mid Cat)P(Cat)}{P( \text{S=1,W=0,G=1,T=0} )}\\ =&\frac{(0.9)(1)(0) (0)(1/3)}{\text{Normalizing Constant}}\\ =&0 \end{aligned} \]

Parrot:

\[ \begin{aligned} P( C = Parrot \mid \text{S=1,W=0,G=1,T=0})=&\frac{P( \text{S=1} \mid Parrot ) P(\text{W=0}\mid Parrot)P(\text{G=1} \mid Parrot ) P(\text{T=0}\mid Parrot)P(Parrot)}{P( \text{S=1,W=0,G=1,T=0} )}\\ =&\frac{(0.1)(0) (0.80)(1)(1/3)}{\text{Normalizing Constant}}\\ =&0 \end{aligned} \]

Turtle:

\[ \begin{aligned} P( C= Turtle \mid \text{S=1,W=0,G=1,T=0})=&\frac{P( \text{S=1} \mid Turtle ) P(\text{W=0}\mid Turtle)P(\text{G=1} \mid Turtle ) P(\text{T=0}\mid Turtle)P(Turtle)}{P( \text{S=1,W=0,G=1,T=0} )}\\ =&\frac{(1)(1)(0.2)(0.9)(1/3)}{\text{Normalizing Constant}}\\ =& \frac{0.06}{\text{Normalizing Constant}}\\ \end{aligned} \]

Where \(\text{Normalizing Constant} = 0 + 0 + 0.06 = 0.06\). Thus we get the posterior probabilities of the new observation belonging to cat, parrot and turtle as 0, 0, 1 respectively. This shows that we are quite sure its a turtle. What is we had different values? We pick the class with the largest posterior probability.

Note that with computer computation, we use logarithms. This is because multiplication of really small numbers diminish quickly to zero and therefore difficulty in exact binary representation. But another issue arises when using logarithms. There is no log of 0. To get around this, we introduce a threshold where by anything below this threshold is replaced by the threshold. eg in the example above, using a threshold of \(0.0001\) = \(10^{-4}\) = \(1e-4\) , we can compute the posteriors as:

\[ \begin{aligned} Cat:&\quad\exp^{\log(0.9) + \log(1e-4) + \log(1e-4) +\log(1)+\log(1/3)}=3e-9\\ Parrot:&\quad\exp^{\log(0.1)+\log(1e-4)+\log (0.80)+\log(1)+\log(1/3)} = 2.666667e-06\\ Turtle:&\quad\exp^{\log(1)+\log(1)+\log (0.2)+\log(0.9)+\log(1/3)} = 0.06\\ \text{Normalizing Constant }:&\quad3e-9 + 2.666667e-06+0.06 = 0.06000267\\ \text{the resulting posterior}&\text{ probabilities are }: [4.999778e-08, 4.444247e-05, 0.9999555] \end{aligned} \]

We can still predict that the class for the new observation is turtle.

TASK.

Compute the above using R/Python. Note that if probability is zero, you could use a threshold in order to be able to do logarithm.

Note that, the data is sometimes not neatly presented as given above, but rather as individual observations. We then have to aggregate it to get the conditional probabilities.

Continuous Variables

What if the variable is continuous? How do we proceed? We compute the likelihood for being in class \(k\)using the normal density and then include the prior. ie:

Taking

\[ \begin{array}{} \mathcal{p}(\text{X}_i | \text{Class} = k)=p(x_i|\mu_k,\sigma_k) = \mathcal{N}(x_i;~\mu_k, ~\sigma_k) \end{array} \]

We can compute the posterior density of class \(k\) as:

\[p( C=k \mid X_{i1},\cdots, X_{ip}=x_{ip} )=\frac{P(C = k)\prod_{j=1}^p\mathcal{N}(x_{ij} ,~\mu_k, ~\sigma_k)}{\sum_k P(C = k)\prod_{j=1}^p\mathcal{N}(x_{ij} ,~\mu_k, ~\sigma_k)}\]

Note that to perform the computation, use the log-likelihood for stability as the likelihood produces numbers which are essentially 0. Then afterwards exponentiate the results before normalizing.

ie

\[ \ell = \log P(C=k) + \sum_{j=1}^p\log \mathcal{N}(x_{ij} ,~\mu_k, ~\sigma_k) - \log(\text{Normalizing Constant}) \]

For example:

assume we had the data

\[ \begin{matrix}\begin{array}{} X_1 &X_2 & y\\ \hline 5.1 & 3.5 & 1\\ 4.9 & 3 & 1\\ 7 & 3.2 & 2\\ 6.4 & 3.2 & 2\\ 6.3 & 3.3 & 3\\ 5.8 & 2.7 & 3 \end{array}& \begin{array}{}\text{And we want to predict the} \\ \text{class of some new observations:}\end{array} \begin{array}{} X_1&X_2\\ \hline 5.9& 3.2\\ 5.1& 3.5\\ 5.7& 2.8\\ \end{array} \end{matrix} \]

How can we go about this?

Solution:

From the probabilities above, we can tell that the first new observation belongs to class 2 as this has the highest probability. Note that the code is repetitive and redundant. R/numpy does vectorization, and it will be good to make use of this feature. Thus the new code will be

The predicted classes can easily be obtained as

# Try running this code in the chunk above
max.col(probs)
# Try running this code in the chunk above
probs.argmax(1)

The implementation above involves numeric variables. Naive Bayes does assume that the continuous variables are independent of one another and are normally distributed. For categorical variables, the implementation is as shown in the first example.

There are packages written that do the naive bayes classification.

in R we can use the e1071 library to perform Naive Bayes. For this, we can pass both numeric and categorical predictors. Therefore its preferred to pass in a dataframe. The column names of the test must match the column names of train. or both of them should not have column names.

model <- e1071::naiveBayes(trainX, trainY)
predict(model, testX)

# predict probabilities
predict(model, testX, type = 'raw')

In python we can use GaussianNB class from naive_bayes package in sklearn module. There is also CategoricalNB BinomialNB MultinomialNB classes. Note that there is no module to work with mixed data. You could easily implement this.

from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(trainX, trainY)
model.predict(testX)

# Predict probabilities
model.predict_proba(testX)

These probability results do not exactly match the ones in R. this is because of the usage of a variance smoothing parameter that adds an epsilon to every element of the variance matrix. Also numpy uses 0 ddof (delta degrees of freedom) by default, thus dividing by N when computing the variance instead of dividing by N-1. The subtle differences leads to the differences in the results :::

Exercise:

Implement a Naive Bayes function that will predict the class of new observations from a training set that contains both numeric and categorical predictors. The training set must be a dataframe. Use the Horseshoe crab Data provided below to test your function taking y as the response variables. Note that color spine and y are categorical variables.

Your function should match the following results for the head

## Probabilities
predict(e1071::naiveBayes(y~.,train), test, type = 'raw')|>head()
                0            1
[1,] 0.0282940500 0.9717059500
[2,] 0.9999067092 0.0000932908
[3,] 0.0001659840 0.9998340160
[4,] 0.1459058444 0.8540941556
[5,] 0.0003250912 0.9996749088
[6,] 0.9998464750 0.0001535250
# Accuracy
Metrics::accuracy(test$y, predict(e1071::naiveBayes(y~.,train), test))
[1] 1
Back to top