In Matlab In linguistics, stemming is the process of reducing inflected words to
ID: 3844233 • Letter: I
Question
In Matlab
In linguistics, stemming is the process of reducing inflected words to their word stem, base, or root form. In this assignment, you are to write a simple word stemmer for English. The input is given a string text that may have punctuations or other non-alphabetical characters. Your program should stem the words in the text and and return these words as a cell array.
Here are the steps your program should perform to derive and filter the word stems:
Convert any upper case letter to lower case.
Replace each non-alphabetical or non-space character to a space character. e.g., "My 1st NLP program!!!" should become: "my st nlp program "
Extract the words from the string. e.g., "my st nlp program " will result in the list: "my", "st", "nlp", and "program".
Strip the following suffixes from the words that have them: -ly, -ed, -ing, -es, -s. Each suffixes should be considered once and in that order (first strip -ly, then strip -ed, then strip -ing, etc.). e.g., the word "excitedly" turns into "excit"; the word "feeding" turns into "feed".
Remove any word from the list that is 2 characters or less.
Remove the following common words from the list: the, and, that, have, for, not
Note that the stemming strategies used in this program are over-simplistic and may not give sensible results.
Explanation / Answer
function out = simplestemmer(a)
for i = 1:length(a)
if ~((a(i) >= 'a' && a(i)<='z') || (a(i) >= 'A' && a(i)<='Z')) %remove non alphabetical chars
a(i)= ' ';
end
end
c = strsplit(a); %split
out = [];
for i = 1:numel(c)
c(1,i) = regexprep(c(1,i), '(ly|ed|ing|es|s)$', ''); %remove suffix
if numel(regexp(c(1,i),'the|and|that|have|for|not'){1})== 0 && length(c(1,i){1}) > 2 %remove certaim words
out = [out, c(1,i){1},' '];
end
end
end
I kept the code simple, and have also commented the code to make things simple. If have troble understanding the code, please feel free to comment below. I shall be glad to help you.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.