Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

4pts] Suppose that a data scientist has 200 observations,50 input variables, and

ID: 3367348 • Letter: 4

Question

4pts] Suppose that a data scientist has 200 observations,50 input variables, and a output variable. He performed random forests for regression trees. Based on the variable importance plot from the random forests, it seemed that there are 10 important input variables (i.e, the 10 inputs have high importance measure values, while the other inputs have relatively low values). To improve interpretation, he built a linear regression model with the 10 input variables selected from the variable importance plot. However, test error of the linear regression model was much higher than the random forests. Give at least two possibilities why prediction of the linear regression model was poor. And then, justify your answer briefl loryer variance

Explanation / Answer

The two possibilities are:

1. Non-linear relationship – There exists a non-linear relationship between dependent and independent variables and as the data scientist is applying linear regression the error will be high because the correlation between the variables will not be acceptable for linear regression.

2. Heteroscedasticity – The variation in error terms variables is not homogeneous across all values of the independent variables. That means the error term does not have constant variance.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote