https://github.com/sundeepblue/movie_rating_prediction http://www.the-numbers.co
ID: 3302060 • Letter: H
Question
https://github.com/sundeepblue/movie_rating_prediction
http://www.the-numbers.com
https://support.microsoft.com/en-us/help/214269/how-to-use-the-histogram-tool-in-excel
Instructions: -The following list of questions is based on the material covered in the text book Chapters 1, 2, 3, 5 and 6. Please work through these problems and present your calculations and/or Excel output where required -You will need to create a Microsoft Word document, copy your Excel output in forms of tables into the word document, or take a screen shot and then insert the image into the word document, as well as present your written answers to the questions in the report. - At the completion of the report, save it to a .pdf file and submit online to the Course Blackboard -For this assignment, we'll be using IMDB movie data. The dataset we're working from can be ithu Question 1 (total 2.5 marks) This is a subset of the original data, filtered to only look at movies produced in Australia or the USA. The original dataset was created using a web scraping library called scrapy, which selected 5000 movies by browsing http://www.the-numbers.com. Assume that the scraper simply selected the first 5000 movie titles it came across, and that our population of interest is all English-language feature-length films ever created What sort of sample is this? (0.5 Marks) How may this have led to bias in our dataset? (1 Mark) (Hint: What might make a movie more likely to be easily found/popular on the website?) Create a frequency table showing the year in which the movie came out. Does this reflect the biases that you noted in part b? (1 Mark) (Hint: use the Histogram tool in the Analyisis ToolPak, and bins based on decades 1920, 1930,1940...2000,2010) A guide to how to use this can be found here https://support.microsoft.com/en-us/help/214269/how-to-use-the-histogram-tool a. b. c.Explanation / Answer
a) This type of sampling is called Simple Random Sampling Without Replacement (SRSWOR).
Because once we picked a movie we are not starting over again from the begining but moving on to the rest of the movies.
b) This would have led to a bias of picking the most recent year movies.
Because when the first 5000 movies that come across in the site are selected, we are picking the movies which are inciting some discussion in IMDB. It is fair to assume that the recent movies would be having a discussion rather than the old movies. This is also proved by the histogram results.
When the FIRST 5000 movies that come across are picked this is the bias rather a good way to pick will be going to some random pages and then picking the movie.
c) Make bins starting from 1916 (the most oldest year in the data) to 2016 (the most recent year in the data) in increments of 10 years.
Create histogram from Data - AnalysisTool pak - Histogram. Select the input data by selecting the title year data from x2 to x5044 and the bin range data (data created by bins from 1916 to 2016.
Histogram is as follows.
This histogram shows the bias of picking the recent movies.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.