Exercise 2 (Chapter 4, #10 Revised) This exercise involves the Weekly data set,
ID: 3177570 • Letter: E
Question
Exercise 2 (Chapter 4, #10 Revised)
This exercise involves the Weekly data set, which is part of the ISLR package. If you’ve already completed Exercise 1, then you’ve already installed the package so simply load the library ISLR to continue.
The Weekly data set contains 1,089 weekly percentage returns of the S&P 500 stock index between 1990 and 2010.
• Year: Unit sales (in thousands) at each location
• Lag1: Percentage return for previous week
• Lag2: Percentage return for 2 weeks previous
• Lag3: Percentage return for 3 weeks previous
• Lag4: Percentage return for 4 weeks previous
• Lag5: Percentage return for 5 weeks previous
• Volume: Volume of shares traded (average number of daily shares traded in billions)
• Today: Percentage return for this week
• Direction: A factor with Down and Up indicating whether the market had a positive or negative return on a given week
a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
Let’s divide the data into training and testing sets. We want to allocate 80% of the data into the training set called WeeklyTrain and reserve the rest of the data (20%) to test our model’s predictive accuracy.
> install.packages(“caTools”) # Only need to run once
3
> library(caTools)
> set.seed(88)
> split = sample.split(Weekly$Direction, SplitRatio = 0.80)
> WeeklyTrain = subset(Weekly, split == TRUE)
> WeeklyTest = subset(Weekly, split = FALSE)
b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
c) Compute the predicted response of the model created in (b) using the WeeklyTest data set. Compute the confusion matrix and overall fraction of correct predictions (i.e., accuracy). Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
d) What happens to the model’s accuracy with respect to the WeeklyTest data set if the threshold value is set to 0.6?
Explanation / Answer
The R snippet is as follows
install.packages("caTools")# Only need to run once
install.packages("ISLR")
library(caTools)
library(ISLR)
weekly <- data(Weekly)
set.seed(88)
split = sample.split(Weekly$Direction, SplitRatio = 0.80)
WeeklyTrain = subset(Weekly, split == TRUE)
WeeklyTest = subset(Weekly, split = FALSE)
# fit the model and summarise the results
model <- glm(Direction ~.,family=binomial(link='logit'),data=WeeklyTrain)
summary(model)
The results are
> summary(model)
Call:
glm(formula = Direction ~ ., family = binomial(link = "logit"),
data = WeeklyTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.131e-03 -2.000e-08 2.000e-08 2.000e-08 1.087e-03
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.026e+03 1.911e+06 0.002 0.999
Year -1.526e+00 9.626e+02 -0.002 0.999
Lag1 -2.599e-01 1.467e+03 0.000 1.000
Lag2 -2.826e-01 2.978e+03 0.000 1.000
Lag3 3.132e+00 1.002e+03 0.003 0.998
Lag4 6.181e-01 1.730e+03 0.000 1.000
Lag5 9.026e-01 2.923e+03 0.000 1.000
Volume 1.064e+01 8.122e+03 0.001 0.999
Today 6.319e+02 1.669e+04 0.038 0.970
## as can be seen from the variales above , none of the p values is less than 0.05 , hence no variable is siginificant.
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.1966e+03 on 870 degrees of freedom
Residual deviance: 5.4576e-06 on 862 degrees of freedom
AIC: 18
Number of Fisher Scoring iterations: 25
The confusion matrix is
library(caret)
confusionMatrix(p,WeeklyTest$Direction)
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.