A normalizing method in Python
JIRA CODE: JJ-134
Stemming:
The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.
There are mainly two errors in stemming – Over stemming and Under stemming. Over stemming occur when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems Stemming is used in information retrieval systems like search engines.
Python
Code:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
words = ["program", "programs", "programer", "programing", "programers"]
for w in words:
print(w, " : ", ps.stem(w))
Output:
program : program
programs : program
programer : program
programing : program
programers : program
Stemming words from sentences
Code:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programers program with programing languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
Output :
Programers : program
program : program
with : with
programing : program
languages : languag