Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Naive Bayes is a simple but e?ective machine learning/data mining solution to th

ID: 3639456 • Letter: N

Question

Naive Bayes is a simple but e?ective machine learning/data mining solution to the problem of document classi?cation. For this assignment, you will implement a
Naive Bayes classi?er to classify newsgroup articles using Java code.
Data set: Please use "Twenty Newsgroups Data Set" from here: http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
Your program will read the data, split the data (Randomly split the data into a training set (70% of the data) and a testing set (30% of the data)) train and testing classifier, and evaluate the classifier (Evaluate the performance of your classi?er using two methods: a confusion matrix and a precision recall graph)

Explanation / Answer

3 down vote accepted This is a good question; Now I'm not completely sure that there is a problem here. The posterior probability is simply giving you the probability of each class given a document (that is, the probabilities of each document's class). So when classifying a document you are only comparing the posteriors given the same document and so the number of features would not change (since you are not going across documents), that is: P(class1|document) = P(class1) * P(feature1|class1) * ... * P(featureK|class1) ... P(classN|document) = P(classN) * P(feature1|classN) * ... * P(featureK|classN) The class with the highest posterior will the called the label for the document. So since the number of features seem to depend on the document and not the class, there should be no need to normalize. Am I missing something? If you would want to do something more than classify, e.g. want to compare the most likely documents of a particular class then you would have to use the actual definition of posterior probabilities: P(class1|document) = P(class1) * P(feature1|class1) * ... * P(featureK|class1)/Sum_over_all_numerators And this would normalize correctly across documents of varying feature lengths. link|improve this answer edited Sep 10 '11 at 19:08 answered Sep 10 '11 at 18:57 Junier 636311 Indeed I do "want to compare the most likely documents of a particular class." I don't know what the denominator in your last equation is referring to. By "Sum_over_all_numerators" are you referring to P(feature1)*...*P(featureK)? I'm assuming that's what you mean, since that's the full definition of the posterior probability. – pmc255 Sep 10 '11 at 23:10 Addendum: I guess the key point is that the denominator can be left out only when comparing posteriors across classes, but not if we're comparing across documents. Adding in the denominator would effectively normalize the posterior probabilities, right? I think that's what I was looking for.