Data were collected at a large university on all first-year computer science maj
ID: 3296613 • Letter: D
Question
Data were collected at a large university on all first-year computer science majors in a particular year. The purpose of the study was to attempt to predict success in the early university years. One measure of success was the cumulative grade point average (GPA) after three semesters. Explanatory variables under study were average high school grades in mathematics (HSM).science (HSS). and English (HSE). We also include SAT mathematics (SATM) and SAT verbal (SATV) scores as explanatory variables. The SAS output relates to this problem. Discuss the main issue(s) that arise in each of the scenarios that are presented. (a) Fitting tin? polynomial model: gpa = beta_0 + beta_1 hsm + + beta_100 (hsm)^100 + element (b) Including all possible interactions between the explanatory variables in the regression model. That is.satm*hsm (etc..... up to) satm*satv*hsm*hss*hseExplanation / Answer
(a)
The main issue that would arise in the scenario is overfitting.
Overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for all of the terms in the equation. The sample size limits the number of terms that you can safely include before you begin to overfit the model. The number of terms in the model includes all of the predictors, interaction effects, and polynomials terms (to model curvature).
From the degree of freedom of C Total, it looks the size of the sample is 224 and the number of all terms included in the model is 100 which leads to overffiting the model and we do not have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression. We should have around 15x100 = 1500 observations to run the polynomial regression of order 100.
So this is a case of an overfit model and can cause the regression coefficients, p-values, and R-squared to be misleading and correctly predicts new data.
(b)
This scenario is similar to the part (a). We have large number of interaction terms (too many parameters) included in the model, many of which of them are not significant to predict the data and are not required. So, the main issue that would arise in this scenario also is overfitting.
From the degree of freedom of C Total, it looks the size of the sample is 224 and the number of all interaction terms included in the model will be around 26 which leads to overffiting the model and we do not have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression. We should have around 15x26 = 390 observations to run the regression with 26 interaction terms.
So this is a case of an overfit model and can cause the regression coefficients, p-values, and R-squared to be misleading and correctly predicts new data.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.