Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

You are hired by a new start-up called Readr that wants to offer automatic book

ID: 3702703 • Letter: Y

Question

You are hired by a new start-up called Readr that wants to offer automatic book recommendations based on a book's text. You are tasked with building a classifier that, given a book and its content classifies the book as good or bad. After consulting with a top literary critic, you decide that you only need to consider two binary features for this task: a) the length of the book is more than 500 pages (represented by random variable L with values short and long), and b) the word "wow" appears in the book text (represented by random variable W with values true and false) Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and uistic Theory 20, 636-659 a) You have a dataset of 10,000 books with good/bad labels provide by top literary critics, and you observe the following: i) 2,500 books are labeled as good and the rest as bad, ii) 2,000 of the good books are long, iii) 5,000 of the bad books are short, iv) 50 of the good books have the word "wow", and v) 1,500 of the bad books have the word "wow". Let G be the hypothesis that a book is good, and B the hypothesis that it is bad. What values would you pick for the priors P(G) and P(B), and the likelihoods P(L long G), P(L long | B), P(Wtrue G), P(Wtrue | B)? b) You decide to build your classifier using Naive Bayes, building on the probability mass functions derived in part a). Suppose your classifier receives a new book with length of 700 words and that does not contain the word "wow". What will your classifier predict about this book? C) Suppose you go back to your dataset and notice that all the good books that are long do not have the word "wow", and that 750 of the bad books that are long do not have the word "wow". Would this information lead you to a different answer than the one produced by your classifier from part b). If so, why?

Explanation / Answer

a) Given that out of the dataset of 10,000 books, 2500 are labelled as good and balance 7500 as bad.

Hence the Prior Probability for a new book being good, P(G) = 2500/10000 = 0.25Similarly, the Prior Probability for a new book being bad, P(B) = 7500/10000 = 0.75It is also given that 2000 of the 2500 good books are long, while 5000 of the 7500 bad books are short.

Hence the likelyhood P(L = long | G) = 2000 / 2500 = 0.8and the likelyhood P(L = long | B) = (7500 - 5000) / 7500 = 0.3333

It is further given that 50 of the good books have the word "wow" and 1500 of the bad books have the word "wow".

Hence P(W = true | G) = 50 / 2500 = 0.02

and P(W = true | B) = 1500 / 7500 = 0.2

b) From the given data it can be seen that there are totally 4500 long books of which 2000 are good and 2500 bad.Also, there are totally 5500 short books of which 500 are good and 5000 are bad.

Further, there are totally 1550 books with the word "wow" of which 50 are good and 1500 are bad.

Consequently, there are 8450 books without "wow" of which 2450 are good and 6000 are bad.

It is given that the new book has 700 words. This must be a typo, and let us assume that it has 700 pages.Hence the new book is long and also doe not have "wow".

Probability of the new book being good, given that it is long, P(G | L=long) = 2000/4500 = 0.4444 Probability of the new book being good, given that it does not have "wow", P(G | W=false) = 2450/8450 = 0.2899Hence based on both the above factors, the classifier will predict that the new book will be bad.

c) Now it is further given that all the 2000 good books that are long do not have "vow",while 750 out of the 2500 bad books that are long do not have "wow".Hence there are totally 2750 long books without "wow", out of which 2000 are good.Thus, probability of the new book being good, given that it is long and does not have "wow",P(G | L=long and W=false) = 2000/2750 = 0.7273Thus the additional information, leads to a different prediction, that the book will be good.This is because, in part b) we did not have any information on the interrelation between length of the book and the

occurence of "wow" in the book, while in part c) we had the information. Hence it became clear that if a new bookis long and does not have "wow" probability of it being good is quite high (> 70%).

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote