Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

data sethttps://harlanhappydog.github.io/STAT306/docs/moviegross.txt # Read in t

ID: 3044843 • Letter: D

Question

data sethttps://harlanhappydog.github.io/STAT306/docs/moviegross.txt

# Read in the data set
mov <- read.table("moviegross.txt", header=T, skip=2)
names(mov)
# Variables are: year, movie, studio, openweekendgross, gross, ST
# ST is a shorter version of studio.
# open weekend gross and gross (final gross) are in million U.S. dollars.

# The goal is to relate the opening weekend gross with the "final" gross
# for the US market, another explanatory variable might be the
# studio that produced the movie;
mov$lngross <- log(mov$gross)
mov$lnopen <- log(mov$openweekendgross)
attach(mov)
#attach "releases" all the columns of the data.frame into the working space
#You do not have to use mov$gross to call the "gross" column of "mov" data set.
#This can be dangerous when you have multiple data frames in the working space,
#especially when they have some columns of the same name.

print(cor(openweekendgross,gross)) # 0.827
print(cor(lnopen,lngross)) # 0.867

# Regression on the original scale
par(mfrow=c(2,4), mar=c(4, 4, 1, 1))
plot(openweekendgross, gross, type="n")
text(openweekendgross, gross, label=ST)
fit <- lm(gross ~ openweekendgross + ST)
print(summary(fit))

# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 71.5602 22.9819 3.114 0.00324 **
#openweekendgross 2.9113 0.2854 10.202 3.59e-13 ***
#STf -42.3190 24.3780 -1.736 0.08957 .
#STs -53.7543 27.4693 -1.957 0.05673 .
#STw -27.8388 24.9629 -1.115 0.27082
#Residual standard error: 63.26 on 44 degrees of freedom
#Multiple R-squared: 0.7149, Adjusted R-squared: 0.689

residSE <- sqrt(sum(fit$resid^2)/fit$df.resid)
plot(openweekendgross, fit$resid)
plot(fit$fitted, fit$resid)
abline(h=2*residSE)
abline(h=-2*residSE)

qqnorm(fit$residuals)
qqline(fit$residuals, col="red")

##Fit the model to the log transformed data.
plot(lnopen, lngross, type="n")
text(lnopen, lngross, label=ST)
fit2 <- lm(lngross ~ lnopen + ST)
print(summary(fit2))

# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.933926 0.279079 6.930 1.45e-08 ***
#lnopen 0.874742 0.069323 12.618 3.23e-16 ***
#STf -0.278467 0.136243 -2.044 0.0470 *
#STs -0.330307 0.153144 -2.157 0.0365 *
#STw -0.009481 0.139467 -0.068 0.9461
#Residual standard error: 0.3531 on 44 degrees of freedom
#Multiple R-Squared: 0.7929, Adjusted R-squared: 0.7741

#The log transformed model has greater R-squared

residSE2 <- sqrt(sum(fit2$resid^2)/fit2$df.resid)
plot(lnopen, fit2$resid)
plot(fit2$fitted, fit2$resid, ylim=c(-1, 1))
abline(h=2*residSE2)
abline(h=-2*residSE2)

qqnorm(fit2$residuals)
qqline(fit2$residuals, col="red")

#Compare the first column of the plots, a linear model is better in the log transformed data.
#In the second column, the variance is more homogeneous in the transformed data.
#For the third column, we have less outliers in the residual plots
#For the fourth column, the deviation from the qqline is less severe in the lower tail.

# Dummy variables
print(table(studio))
#studio
#20thcenturyfox disney sony warner
# 14 13 9 13
print(table(ST))
# d f s w
#13 14 9 13
# ST is the abbreviation of studio
# It is easier to use some short names in the presentation of the regression result
# but this will not affect coefficients estimates.

class(ST)
#ST is a "factor" in the data frame.
#Sometimes if your factor is coded in numbers, i.e. 1, 2, .., R will treat them as numbers.
#you can force them into factors by ST <- as.factor(ST)

levels(ST)
# disney is the "baseline" studio in the regression below,
# because d < f, s, w
# By default, R will order the factor levels alphabetically.

ST <- relevel(ST, 3)
summary(lm(lngross ~ lnopen + ST))
# We can use relevel function to change the the baseline to sony.

M <- model.matrix(fit)
#Obtain the design matrix of our regression model

#Check the design matrix of the dummy variable.
head(mov, 10)
head(M, 10)
#Or you can compare the whole matrix.
#Or
which(as.numeric(M[,3])!=0)
which(ST=="f")

#Compare the STf STs STw columns of model.matrix(fit) to ST in the mov data,
#what do you find?

let
fit2=lm(lngross~lnopen+ST)
summ <- summary(fit2)

and answer the following questions:

Question 3 The "Kung Fu Panda 3" movie was distributed by 20th Century Fox (ST-"f") and achieved an open weekend gross of 75.7 million U.S. dollars (openweekendgross 75.7). Please use the regression model we fitted with the log transformed variables, namely, fit2 to predict the final gross of "Kung Fu Panda 3". Your prediction is 5.4403 The 95% prediction interval of the predicted logarithm of the final gross is Lower endpoint: 4.7305 Upper endpoint: 6.15 The standard error of predicted logarithm of the final gross is 0.3531248

Explanation / Answer

> fit2=lm(lngross~lnopen+ST)
> summary(fit2)

Call:
lm(formula = lngross ~ lnopen + ST)

Residuals:
Min 1Q Median 3Q Max
-0.63098 -0.21414 -0.03501 0.16224 0.87535

Coefficients:
Estimate Std. Error t value Pr(>|t|)   
(Intercept) 1.933926 0.279079 6.930 1.45e-08 ***
lnopen 0.874742 0.069323 12.618 3.23e-16 ***
STf -0.278467 0.136243 -2.044 0.0470 *  
STs -0.330307 0.153144 -2.157 0.0365 *  
STw -0.009481 0.139467 -0.068 0.9461   
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3531 on 44 degrees of freedom
Multiple R-squared: 0.7929, Adjusted R-squared: 0.7741
F-statistic: 42.12 on 4 and 44 DF, p-value: 1.665e-14

> newdata=data.frame(lnopen=log(75.7),ST="f")
> predict(fit2,newdata,interval='predict')
fit lwr upr
1 5.440274 4.697666 6.182881

Prediction : 5.440274
Lower endpoint : 4.697666
Upper end point : 6.182881
Standard error : 0.3531