The generated data set in this question is taken from Mantel (1970). The data ar
ID: 3151953 • Letter: T
Question
The generated data set in this question is taken from Mantel (1970). The data are given in Table 7.3.
Mantel’s generated data
Case
Y
X1
X2
X3
1
5
1
1004
6
2
6
200
806
7.3
3
8
–50
1058
11
4
9
909
100
13
5
11
506
505
13.1
Interest centers on using variable selection to choose a subset of the predictors to model Y. The data were generated such that the full model Y=0+1X1+2X2+3X3+e
(7.8)
is a valid model for the data.
1.) Identify the optimal model or models based on R2adj, AIC and BIC from the approach based on all possible subsets.
2.) Identify the optimal model or models based on AIC and BIC from the approach based on forward selection.
3.) Carefully explain why different models are chosen in (a) and (b).
4.) Decide which model you would recommend. Give detailed reasons to support your choice.
https://cms.psu.edu/Spring2/201516SP/201516SPUP___RSTAT_462_001/_assoc/8C2A025D521B4EA39C4F607549F3BADD/Mantel.csv
Can someone help me with the problem using R studio? Thank you
Case
Y
X1
X2
X3
1
5
1
1004
6
2
6
200
806
7.3
3
8
–50
1058
11
4
9
909
100
13
5
11
506
505
13.1
Explanation / Answer
The R output is given first:
> mantel <- read.csv("Mantel.csv")
> head(mantel)
Y X1 X2 X3
1 5 1 1004 6.0
2 6 200 806 7.3
3 8 -50 1058 11.0
4 9 909 100 13.0
5 11 506 505 13.1
> mlm <- lm(Y~.,data=mantel)
> summary(mlm)
Call:
lm(formula = Y ~ ., data = mantel)
Residuals:
1 2 3 4 5
1.372e-14 -1.609e-14 -2.054e-15 2.142e-15 2.280e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e+03 1.501e-11 -6.660e+13 9.56e-15 ***
X1 1.000e+00 1.501e-14 6.661e+13 9.56e-15 ***
X2 1.000e+00 1.501e-14 6.664e+13 9.55e-15 ***
X3 4.108e-15 1.186e-14 3.460e-01 0.788
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.147e-14 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.648e+28 on 3 and 1 DF, p-value: 5.726e-15
> n <- nrow(mantel)
> null_mantel <- lm(Y~1,data=mantel)
> full_mantel <- lm(Y~.,data=mantel)
> step(null_mantel,scope=list(lower=null_mantel, upper=full_mantel),direction="forward")
Start: AIC=9.59
Y ~ 1
Df Sum of Sq RSS AIC
+ X3 1 20.6879 2.1121 -0.3087
+ X1 1 8.6112 14.1888 9.2151
+ X2 1 8.5064 14.2936 9.2519
<none> 22.8000 9.5866
Step: AIC=-0.31
Y ~ X3
Df Sum of Sq RSS AIC
<none> 2.1121 -0.30875
+ X2 1 0.066328 2.0458 1.53172
+ X1 1 0.064522 2.0476 1.53613
Call:
lm(formula = Y ~ X3, data = mantel)
Coefficients:
(Intercept) X3
0.7975 0.6947
> step(null_mantel,scope=list(lower=null_mantel, upper=full_mantel),direction="forward",k = log(n))
Start: AIC=9.2
Y ~ 1
Df Sum of Sq RSS AIC
+ X3 1 20.6879 2.1121 -1.0899
+ X1 1 8.6112 14.1888 8.4339
+ X2 1 8.5064 14.2936 8.4707
<none> 22.8000 9.1961
Step: AIC=-1.09
Y ~ X3
Df Sum of Sq RSS AIC
<none> 2.1121 -1.08987
+ X2 1 0.066328 2.0458 0.36003
+ X1 1 0.064522 2.0476 0.36444
Call:
lm(formula = Y ~ X3, data = mantel)
Coefficients:
(Intercept) X3
0.7975 0.6947
1) Since the Adj-R2 is 1, the full model is preferred, and as expectedly as the Mantel data set has been designed so.
2) Based on the AIC and BIC, we are left only X3 as the regressor for the Y, by both AIC and BIC criteria.
3) Different criteria's lead to different optimal model.
4) The full regression model is suscept since it is impossible to obtain 100% AdjR2. However, it is having signs of being overtrained and it is better to go by either of AIC or BIC.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.