Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
Similarity = (A.B) / (||A||.||B||) where A and B are vectors.
nltk.tokenize: It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. word_tokenize(X) split the given sentence X into words and return lists.
nltk.corpus: It is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).
We have to download —– nltk.download(‘punkt’), nltk.download(‘stopwords’)
Code:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
X = input(“Enter first string: “).lower()
Y = input(“Enter second string: “).lower()
X =”I love horror movies”
Y =”Lights out is a horror movie”
tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
sw contains the list of stopwords
sw = stopwords.words(‘english’)
l1 =[];l2 =[]
remove stop words from string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
cosine formula
for i in range(len(rvector)):
c+= l1[i]l2[i]
cosine = c / float((sum(l1)sum(l2))**0.5)
print(“similarity: “, cosine)
Output:
similarity: 0.2886751345948129