It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Naive Bayesian model is easy to build and particularly useful for very large data sets. Naive Bayes is known to outperform even highly sophisticated classification methods.
Naive Bayes uses a similar method to predict the probability of different classes based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
Bayes theorem calculates probability P(c|x) where c is the class of the possible outcomes and x is the given instance which has to be classified, representing some certain features.
P(c|x) = P(x|c) * P(c) / P(x)
Naive Bayes predicts the tag of a text. They calculate the probability of each tag for a given text and then output the tag with the highest one.
Code :
Cleaning texts
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
dataset = [[“I liked the movie”, “positive”],
[“It’s a good movie. Nice story”, “positive”],
[“Hero’s acting is bad but heroine looks good.\Overall nice movie”, “positive”],
[“Nice songs. But sadly boring ending.”, “negative”],
[“sad movie, boring movie”, “negative”]]
dataset = pd.DataFrame(dataset)
dataset.columns = [“Text”, “Reviews”]
nltk.download(‘stopwords’)
corpus = []
for i in range(0, 5):
text = re.sub(‘[^a-zA-Z]’, ”, dataset[‘Text’][i])
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = ”.join(text)
corpus.append(text)
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
Splitting the data set into training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.25, random_state = 0)
Fitting naive bayes to the training set
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
classifier = GaussianNB();
classifier.fit(X_train, y_train)
Predicting test set results
y_pred = classifier.predict(X_test)
Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm
Output:
array([[0, 0],
[2, 0]])