Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

prostate data: https://psu.instructure.com/courses/1816996/files/81708589/downlo

ID: 3171958 • Letter: P

Question

prostate data: https://psu.instructure.com/courses/1816996/files/81708589/download?wrap=1

click penn state webaccess - id: ayh5382 pass: chimmey55

Choosing the ridge regression tuning parameter Download the "prostate. R Data" file from canvas and load it into R using the "load" command. Each row of this data set corresponds to clinical measurements of a patient The response variable is the last column, labeled "lpsa which measures the log of the amount of a protein that is produced by the prostate and used to be thought to correspond to cancer risk. You will build a regression model that linearly relates the first six columns of the dataset to lpsa. You are going to use ridge regression, which finds the B that maximizes for some tuning parameter 1. our job is to pick a good value of One popular way to choose A is to pick the value that minimizes prediction error. Prediction error can be estimated using k-fold cross validation. Here's how cross validation works. First, randonly divide the dataset into k equally sized (as close to equal as possible anyway) partitions, called "folds." Then, for each fold, predict the response in that fold using the Bridge estimated using the other k 1 folds Finally, calculate the differences between the actual response in the k folds and the response predicted by the k values of Bridge. Finally, choose the A that minimizes this error

Explanation / Answer

here the R programming of ridge regression :

> prostate=read.table("D:\Msc 2nd sem egression\ST 12(P)\prac 1\prostate.txt",header=TRUE,sep=" ")
> head(prostate)
lcavol lweight age lbph svi lcp gleason pgg45 lpsa
1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
4 -1.2039728 3.2828 58 -1.386294 0 -1.38629 6 0 -0.16252
5 0.7514161 3.4324 62 -1.386294 0 -1.38629 6 0 0.37156
6 -1.0498221 3.2288 50 -1.386294 0 -1.38629 6 0 0.76547
> class(prostate)
[1] "data.frame"
> n=nrow(D)
> n
[1] 97
> y=prostate[,9]
> head(y)
[1] -0.43078 -0.16252 -0.16252 -0.16252 0.37156 0.76547
> x1=prostate[,1]
> x2=prostate[,2]
> x3=prostate[,3]
> x4=prostate[,4]
> x5=prostate[,5]
> x6=prostate[,6]
> x7=prostate[,7]
> x8=prostate[,8]
> model=lm(y~x1+x2+x3+x4+x5+x6+x7+x8,data=prostate)
> shapiro.test(model$residuals)

Shapiro-Wilk normality test

data: model$residuals
W = 0.99113, p-value = 0.7721

The above test for normality gives a p-value > 0.05.

thus we can accept null hypothesis and conclude that data follows normally distributed.

> round(model$coefficients,3)

(Intercept) x1 x2 x3 x4 x5
0.669 0.587 0.454 -0.020 0.107 0.766
x6 x7 x8
-0.105 0.045 0.005


> n=dim(prostate)[1]
> n
[1] 97

Standardize the levels of response and regressors to the unit length and fot the model using these standardized quantities.
> stdD=scale(prostate,center=TRUE)/sqrt(n-1)
> stdD=as.data.frame(stdD)
> model=lm(y~x1+x2+x3+x4+x5+x6+x7+x8,data=stdD)
> model$coefficients
(Intercept) x1 x2 x3 x4 x5
0.669336698 0.587021826 0.454467424 -0.019637176 0.107054031 0.766157326
x6 x7 x8
-0.105474263 0.045141598 0.004525231

TO check the presense of multicollinearity in the data
> stdX1=model.matrix(model)

1]To check determinant
> det(t(stdX1)%*%stdX1)
[1] 1.550296e+20
> eval=eigen(t(stdX1)%*%stdX1)$values
> eval
[1] 4.791752e+05 6.191026e+04 2.110912e+02 1.756571e+02 6.546074e+01
[6] 4.456108e+01 2.025158e+01 8.094119e+00 2.947521e-01
> max(eval)/eval
[1] 1.000000e+00 7.739835e+00 2.269992e+03 2.727902e+03 7.320039e+03
[6] 1.075322e+04 2.366113e+04 5.920041e+04 1.625689e+06
> diag(solve(t(stdX)%*%stdX))
(Intercept) x1 x2 x3 x4 x5
3.348833e+00 1.540289e-02 5.759506e-02 2.487380e-04 6.807399e-03 1.189333e-01
x6 x7 x8
1.650575e-02 4.940709e-02 3.894933e-05
> library(MASS)
> modelr=lm.ridge(y~x1+x2+x3+x4+x5+x6+x7+x8,data=stdD)
> select(modelr)
modified HKB estimator is 4.256152
modified L-W estimator is 3.487311
smallest value of GCV at 0
> HKB=modelr$kHKB
> lenght=seq(0,1.5,by=0.005)
> modelq=lm.ridge(y~x1+x2+x3+x4+x5+x6+x7+x8,data=stdD,lambda = lenght)
> modelw=list()
> for(i in 1:length(length1))
+ {
+ mod=lm.ridge(y~x1+x2+x3+x4+x5+x6+x7+x8,data=stdD,lambda = length1[i])
+ modelw[[i]]=coef(mod)
+ }
> modelw
[[1]]
x1 x2 x3 x4 x5
0.639785867 0.577025769 0.453430195 -0.019054230 0.105511244 0.755882011
x6 x7 x8
-0.095846251 0.047795017 0.004352673

[[2]]
x1 x2 x3 x4 x5
0.636192856 0.575779119 0.453289513 -0.018981417 0.105318192 0.754606672
x6 x7 x8
-0.094659695 0.048114299 0.004331835

[[3]]
x1 x2 x3 x4 x5
0.63264488 0.57454111 0.45314728 -0.01890909 0.10512634 0.75334153
x6 x7 x8
-0.09348452 0.04842886 0.00431129

[[4]]
x1 x2 x3 x4 x5
0.629141157 0.573311639 0.453003529 -0.018837229 0.104935665 0.752086466
x6 x7 x8
-0.092320565 0.048738808 0.004291034

[[5]]
x1 x2 x3 x4 x5
0.625680945 0.572090595 0.452858282 -0.018765843 0.104746163 0.750841334
x6 x7 x8
-0.091167681 0.049044221 0.004271062

[[6]]
x1 x2 x3 x4 x5
0.622263506 0.570877879 0.452711569 -0.018694920 0.104557818 0.749606009
x6 x7 x8
-0.090025717 0.049345195 0.004251368

[[7]]
x1 x2 x3 x4 x5
0.618888118 0.569673387 0.452563416 -0.018624458 0.104370618 0.748380364
x6 x7 x8
-0.088894524 0.049641816 0.004231947

[[8]]
x1 x2 x3 x4 x5
0.615554078 0.568477020 0.452413849 -0.018554449 0.104184550 0.747164274
x6 x7 x8
-0.087773958 0.049934170 0.004212796

> modelr1=lm.ridge(y~x1+x2+x3+x4+x5+x6+x7+x8,data=stdD,lambda=HKB)
> coef(modelr1)
x1 x2 x3 x4 x5
0.537056968 0.538132230 0.447782213 -0.016772215 0.099421692 0.716752515
x6 x7 x8
-0.060380349 0.056638306 0.003772375
> coef(model)
(Intercept) x1 x2 x3 x4 x5
0.669336698 0.587021826 0.454467424 -0.019637176 0.107054031 0.766157326
x6 x7 x8
-0.105474263 0.045141598 0.004525231
> I=diag(1,9)
> dim(I)
[1] 9 9
> C=(t(stdX1)%*%(stdX1))+(HKB*I)
> V=diag(solve(C)%*%t(stdX)%*%(stdX)%*%solve(C))
> sqrt(V)
(Intercept) x1 x2 x3 x4 x5
0.119294206 0.112115575 0.174773575 0.013777800 0.073635757 0.226356807
x6 x7 x8
0.112735405 0.127459680 0.005214907
> sum(V)+(HKB*HKB*t(coef(model))%*%solve(C%*%C)%*%coef(model))
[,1]
[1,] 0.559632