Finding Text Similarity using Python

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

Similarity = (A.B) / (||A||.||B||) where A and B are vectors.

nltk.tokenize: It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. word_tokenize(X) split the given sentence X into words and return lists.

nltk.corpus: It is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).

We have to download —– nltk.download(‘punkt’), nltk.download(‘stopwords’)

Code:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

X = input(“Enter first string: “).lower()

Y = input(“Enter second string: “).lower()

X =”I love horror movies”
Y =”Lights out is a horror movie”

tokenization

X_list = word_tokenize(X)
Y_list = word_tokenize(Y)

sw contains the list of stopwords

sw = stopwords.words(‘english’)
l1 =[];l2 =[]

remove stop words from string

X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}

form a set containing keywords of both strings

rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0

cosine formula

for i in range(len(rvector)):
c+= l1[i]l2[i] cosine = c / float((sum(l1)sum(l2))**0.5)
print(“similarity: “, cosine)

Output:
similarity: 0.2886751345948129

Leave a comment Cancel reply