Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

***Software to be used- R*** A medical center is interested in modeling prostate

ID: 3224799 • Letter: #

Question

***Software to be used- R***

A medical center is interested in modeling prostate-specific antigen (PSA) and a number of prognostic clinical measurements in men with advanced prostate cancer. Data were collected on 97 men who were about to undergo radical prostectomies.

Data

   1     0.651    0.5599    15.959    50    0.0000    0    0.0000    6

   2     0.852    0.3716    27.660    58    0.0000    0    0.0000    7

   3     0.852    0.6005    14.732    74    0.0000    0    0.0000    7

   4     0.852    0.3012    26.576    58    0.0000    0    0.0000    6

   5     1.448    2.1170    30.877    62    0.0000    0    0.0000    6

   6     2.160    0.3499    25.280    50    0.0000    0    0.0000    6

   7     2.160    2.0959    32.137    64    1.8589    0    0.0000    6

   8     2.340    1.9937    34.467    58    4.6646    0    0.0000    6

   9     2.858    0.4584    34.467    47    0.0000   0    0.0000    7

10     2.858    1.2461    25.534    63    0.0000    0    0.0000    6

11     3.561    1.2840    36.598    65    0.0000    0    0.0000    6

12     3.561    0.2592    36.598    63    3.5609    0    0.0000    6

13     3.561    5.0028    20.491    63    0.0000    0    0.5488    7

14     3.857    4.3929    20.086    67    0.0000    0    0.0000    7

15     4.055    3.3535    31.187    57    0.0000    0    0.6505    7

16     4.263    4.6646    21.328    66    0.0000    0    0.0000   6

17     4.349    0.6570    33.784    70    3.4556    0    0.5488    7

18     4.437    9.8749    38.475    66    0.0000    0    1.4477    6

19     4.759    0.5712    26.311    41    0.0000    0    0.0000    6

20     4.953    1.1972    46.063    70    5.2593    0    0.0000    7

21     5.155    3.1582    30.569    59    0.0000    0    0.0000    6

22     5.259    7.8460    33.115    60    4.3492    0    3.8574    7

23     5.474    0.5827    29.371    59    0.4493    0    0.0000    6

24     5.529    5.9299    31.500    63    1.5527    0    3.2544    7

25     5.641    1.4770    39.252    69    4.9530    0    0.0000    6

26     5.871    4.2631    22.646    68    1.3499    0    0.0000    6

27     6.050    1.6653    41.264    65    0.0000    0    0.4493    7

28     6.172    0.6703    47.942    67    6.1719    0    0.0000    7

29     6.360    2.8292    22.874    67    1.2461    0    1.0513    7

30     6.619   11.1340    29.371    65    0.0000    0    5.0531    6

31     6.821    1.3364   59.740    65    7.0993    0    0.4493    6

32     7.463    1.1972   450.339    65    5.4739    0    0.0000    6

33     7.463    3.5966    20.905    71    3.5609    0    0.0000    6

34     7.538    1.0101    26.311    54    0.0000    0    0.0000    6

35     7.768    0.9900    25.028    63    0.0000    0    0.4493    6

36     8.085    3.7062    61.559    64    8.7583    0    0.0000    7

37     8.671    4.1371    38.861    73    0.5599    0    5.2593    8

38     8.935    1.5841    10.697    64   0.0000    0    0.0000    7

39     9.116   14.2963    59.740    68    3.9354    1    6.2339    7

40     9.777    2.2255    20.287    56    2.5600    0    0.8521    7

41     9.974    1.8589    23.104    60    0.0000    0    0.0000    8

42    10.074    4.2207    39.646    68    0.0000    0    0.0000    7

43    10.278    1.7860    47.942    62    5.5290    0    0.6505    6

44    10.697    5.8709    49.402    61    0.0000    0    2.2479    7

45    12.429    4.4371    30.265    66    5.7546    0   0.6505    7

46    12.807    5.2593    29.666    61    1.8589    0    0.0000    7

47    13.066   15.3329    54.598    79    6.5535    1   14.2963    8

48    13.066    3.1899    56.826    68    5.5290    0    0.6505    7

49    13.330    5.7546    33.115    43    0.0000    0    0.0000    6

50    13.330    3.3872    35.517    70    3.9354    0    0.4493    6

51    14.296    2.9743    54.055    68    0.0000    0    0.0000    7

52    14.585    5.2593    68.717    64    7.9248    0    0.0000    6

53    14.585    1.6653    37.713    64    4.4371    0    1.0513    7

54    14.732    8.4149    61.559    68    5.8709    0    4.2631    7

55    14.880   23.3361    33.784    59    0.0000    0    0.0000    8

56    15.180    3.5609    72.240    66   8.3311    0    0.0000    7

57    16.281    2.6379    17.637    47    0.0000    0    1.6487    7

58    16.281    1.5841    42.948    49    4.1371    0    0.0000    6

59    16.610    1.7160    65.366    70    1.5527    0    0.0000    8

60    16.610    2.8864    46.993    61    3.6328    0    0.0000    7

61    17.116    1.5841    91.836    73   10.2779    0    0.0000    6

62    17.288    7.3891    41.264    63    5.0531    1    6.7531    7

63    17.288   16.1190    33.784    72    0.0000    0   4.7588    8

64    17.814    7.6141    50.400    66    7.4633    1    8.2482    7

65    17.814    7.9248    37.338    64    0.0000    0    0.0000    6

66    17.993    4.3060    46.525    61    3.7434    0    0.6505    7

67    18.541    7.5383    48.424    68    5.9299    0    3.7434    7

68    19.298    9.0250    57.397    72   10.0744    0    0.6505    7

69    19.298    0.6376    82.269    69    0.0000    0    0.0000    6

70    19.492    3.2871   119.104    72   10.2779    0    0.4493    7

71    20.287    6.4237    36.234    60    0.0000    1    3.7434    7

72    20.905    3.1899    28.219    77    5.7546    0    0.0000    7

73    21.328    3.3535    46.063    69    0.0000    1    1.2461    7

74    21.758    6.2965    25.534    60    1.5527    1    3.2544    8

75    26.576   20.0855    46.993    69    0.0000    1    6.7531    8

76    28.219   23.1039    26.050    68    0.9512    1   11.2459    6

77    29.666    7.4633    83.931    72    8.3311    0    1.6487    8

78    31.187   12.6797    77.478    78   10.2779    0    0.0000    8

79    31.817   14.1540    35.874    69    0.0000    1   13.1971    7

80    33.448   16.1190    45.604    63    0.0000    0    1.4477    8

81    33.784    4.3492    21.542    66    1.7507    0    1.2461    7

82    34.124   12.3049    32.137    57    1.5527    0   10.2779    7

83    35.517   13.5991    48.911    77    0.5886    1    1.7507    7

84    35.517   14.5851    46.525    65    3.0649    0    5.7546    8

85    36.234    4.7588    40.854    60    5.4739    0    2.2479    8

86    37.713   27.1126    33.784    64    0.0000    1   10.2779    8

87    39.646    7.5383    41.679    58    5.1552    0    0.0000    6

88    40.854    5.6407    29.079    62    0.0000    1    1.3499    7

89    53.517   16.6099   112.168    65    0.0000    1   11.7048    8

90    54.055    4.7588    40.447    76    2.5600    1    2.2479    8

91    56.261   25.7903    60.340    68    0.0000    0    0.0000    6

92    62.178   12.5535    39.646    61    3.8574    1    0.0000    7

93    80.640   16.9455    48.424    68    0.0000    1    3.7434    8

94   107.770   45.6042    49.402    44    0.0000    1    8.7583    8

95   170.716   18.3568    29.964    52    0.0000    1   11.7048    8

96   239.847   17.8143    43.380    68    4.7588    1    4.7588    8

97   265.072   32.1367    52.985    68    1.5527    1   18.1741    8

Each line of the data set ha an identification number and provides information on 8 other variables

Develop a “best” model for predicting PSA and interpret. In addition, create a 90%

prediction interval for PSA levels for an individual who has the following values.

Variable Number Variable Name Description 1 ID number 1-97 2 PSA level Serum prostate-specific antigen level (mg/ml) 3 Cancer volume Estimate of prostate cancer volume (cc) 4 Weight Prostate weight (grams) 5 Age Age of patient (years) 6 Benign hyperplasia Amount of benign prostatic hyperplasia (cm2) 7 Seminal Vesicle invasion Presence of seminal vesicle invasion: 1 yes; 0 otherwise 8 Capsular penetration Degree of capsular penetration (cm) 9 Gleason score Pathologically determined grade of disease. (Scores were either 6, 7, or 8 with higher scores
indicating worse prognosis)

Explanation / Answer

I am using R software to solve this problem.

First i have copied the data into a csv file. We can load the data into R environment using read.csv function as below:

InputData <- read.csv("Data1.txt",header=T)

#Check for dimensions once

dim(InputData)

97 9

#Convert SeminalVesicleInvasion and GleasonScore to factors
InputData$SeminalVesicleInvasion <- as.factor(InputData$SeminalVesicleInvasion)
InputData$GleasonScore <- as.factor(InputData$GleasonScore)

#Fit a linear model with all the variables using lm function

#Excluding IDNum as it is just a unique identifier

fit <- lm(PSALevel ~ . - IDNum, data = InputData)
summary(fit)

Call:
lm(formula = PSALevel ~ . - IDNum, data = InputData)

Residuals:
Min 1Q Median 3Q Max
-68.153 -7.323 -0.177 6.403 161.547

Coefficients:
Estimate Std. Error t value Pr(>|t|)   
(Intercept) 31.849265 28.958981 1.100 0.27442   
CancerVolume 1.748107 0.615858 2.838 0.00563 **
Weight -0.004546 0.074038 -0.061 0.95118   
Age -0.537278 0.471991 -1.138 0.25808   
BenignHyperPlasia 1.530782 1.201007 1.275 0.20581   
SeminalVesicleInvasion1 21.108723 10.844893 1.946 0.05479 .
CapsularPenetration 1.097882 1.322879 0.830 0.40883   
GleasonScore7 -1.661862 7.570741 -0.220 0.82676   
GleasonScore8 18.423157 10.661795 1.728 0.08750 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30.91 on 88 degrees of freedom
Multiple R-squared: 0.4733,   Adjusted R-squared: 0.4254
F-statistic: 9.886 on 8 and 88 DF, p-value: 1.037e-09

We can see that CancerVolume variable is highly significant and also SeminalVesicleInvasion and GleasonScore at 10% significance level. So lets fit the model with only these 3 variables.

fit <- lm(PSALevel ~ CancerVolume + SeminalVesicleInvasion + GleasonScore, data = InputData)
summary(fit)

Call:
lm(formula = PSALevel ~ CancerVolume + SeminalVesicleInvasion +
GleasonScore, data = InputData)

Residuals:
Min 1Q Median 3Q Max
-59.879 -6.706 0.501 4.983 162.012

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4890 5.7769 0.258 0.797169
CancerVolume 1.9706 0.5485 3.592 0.000528 ***
SeminalVesicleInvasion1 23.1265 9.5612 2.419 0.017541 *
GleasonScore7 -1.0806 7.2794 -0.148 0.882317
GleasonScore8 18.1149 10.3085 1.757 0.082198 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30.74 on 92 degrees of freedom
Multiple R-squared: 0.4557,   Adjusted R-squared: 0.432
F-statistic: 19.26 on 4 and 92 DF, p-value: 1.551e-11

We can see that now CancerVolume and SeminalVesicleInvasion are significant at 5% significance level. And GleasonScore for score8 is significant at 10% significance level. P value of F statistic is also very low indicating that the model is way better than a null model

We can check for multicollinearity once using the vif function from the car package

library(car)
vif(fit)

GVIF Df GVIF^(1/(2*Df))
CancerVolume 1.899061 1 1.378064
SeminalVesicleInvasion 1.592206 1 1.261826
GleasonScore 1.551470 2 1.116056

VIF values are well within limits indicating no multicollinearity.

Coefficent of CancerVolume is 1.9706

That means for every 1 cc increase in prostate cancer volume, PSA level is getting increased by 1.9706 mg/ml

Coefficient of SeminalVesicleInvasion1 is 23.1265.

This means if there is presence of seminal vesicle invasion, the PSA level is getting increased by 23.1265 mg/ml as compared when seminal vesicle invalsion is absent.

Coefficient of GleasonScore8 is 18.1149. This means if there is GleasonScore of 8, the PSA level is getting increased by 18.1149 mg/ml as compared when GleasonScore is 6.

To do the prediction for a new individual we can create a dataframe as below:

NewData <- data.frame(CancerVolume=4.2633,Weight=22.783,Age=68,
BenignHyperPlasia=1.35,SeminalVesicleInvasion=0,
CapsularPenetration=0,GleasonScore=6)
FactorLevelsSeminalVesicleInvasion <- levels(InputData$SeminalVesicleInvasion)
FactorLevelsGleasonScore <- levels(InputData$GleasonScore)


NewData$SeminalVesicleInvasion <- as.factor(NewData$SeminalVesicleInvasion)
NewData$GleasonScore <- as.factor(NewData$GleasonScore)
levels(NewData$SeminalVesicleInvasion) <- FactorLevelsSeminalVesicleInvasion
levels(NewData$GleasonScore) <- FactorLevelsGleasonScore

#Prediction can be done using predict function and predcition interval can be calculated using argument interval="prediction"

predict(fit,newdata = NewData, interval="prediction", level = 0.90)

fit lwr upr
9.890279 -41.95032 61.73087