D Bonus Project Topic pdf X c al Bonus Project Topic.pdf Ask me anything Topic 6
ID: 3820365 • Letter: D
Question
D Bonus Project Topic pdf X c al Bonus Project Topic.pdf Ask me anything Topic 6 3-8 points Document Classification One of the hot topics in data science is document classification. The link below provides you a tutorial on learning the most basic method used in document classification. Read the article in the link by yourself and finish a project for document classification. www.kdna m/201 When you demo your project, you should prepare some articles for processing, At least three articles from different categories (E.g. sport, politics, and money) should be used as references. Then classify another 20 or more articles from different categories. Compare your results to the real categories of the tested 20 or more articles. Note: 1) Each news could be stored in a single file. 20 if you use more advanced algorithm (google it by yourself to process a large data set you will get more points, e.g. the data set in http://www.daviddlewis.com/resources/testcollections/reuters21578/ 3) You can use processes in your project to make your program run faster. 4) If you are unclear about the subject of this topic, please feel free to contact your instructor. 2-27 PM 4/14/2017Explanation / Answer
The best way to solve this problem is to train a classifier which divides any document in the required 3 categories (fashion, celebrity and politics). In order to train a classifier to do so you need pre-tagged data i.e. documents you already know that belong to fashion category, celebrity category and politics category. Here the bigger the training set, the better will be your classifier. Please ensure that you have approximately equal documents for each category, so that each category is properly trained.
Now divide the group in train-set and test-set in some ratio like 0.7 and 0.3
Train the gtrain-set. There could be many approaches to train such a model. A naive approach will be to remove stopwords and find out words which are found in one category of documents but not in the other two categories. So try to come up with the probability of the word belonging to a certain class. This is called a Naive Bayes classifier. Train the model and then test it on the remaining 0.3 test-set documents. Make a confusion matrix to find out how well is your model performing. If the performance is good, your model is ready. I the data isn't big during training, such a model may not give good results specially when each category of documents more or less use the same words. In that case you may either try to get more data for training or follow a slightly complex approach
You may make a better model by creating features from the words in the document. Apart from extracting keywords also find the part of speech associated to a word. You may use pre-built libraties to do so like Apache OpenNLP or Stanford CoreNLP. Now once you have a word and it's associated part of speech you may use an algorithm to find the word, pos pair similarity with another word, pos pair by using some similarity algorithms like Wu-Palmer. If you are going to use Wu-Palmer only take the pair with part of speech as either Noun or Verb. Wu-Palmer is also available in many word similarity libraries. Now, when you have a new document you may find the word and part of speech pair in that document. Compute the probability of such pairs being in one of the above classes (fashion, celebrity or politics). If the document set is small you may compute it by compare the results with each class and find the value of the Jaccard Index in each classification. The hoghest value will be your answer. If the document set is big, train a classifier as in the Naive bayes case and compute the probabilities. Test the classifier, it should work
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.