Most computer solutions for multiple regression begin with a correlation matrix.
ID: 2687038 • Letter: M
Question
Most computer solutions for multiple regression begin with a correlation matrix. Examining this matrix s often the first step when analyzing a regression problem that involves more than one independent variable. Answer the following questions concerning the correlation matrix given in the table below. 1 2 3 4 5 6 1 1.00 0.55 0.20 -0.51 0.79 0.70 2 1.00 0.27 0.09 0.39 0.45 3 1.00 0.04 0.17 0.21 4 1.00 -0.44 -0.14 5 1.00 0.69 6 1.00 A. If variable 1 is the dependent variable, which independent variables have the highest degree of linear association with variable 1? B. What kind of association exists between variables 1 and 4? C. Does this correlation matrix show any evidence of multicollinearity? D. In your opinion, which variable or variables will be included in the best forecasting model? Please give an explanation. E. why are all the entries on the main diagonal equal to 1.00? F. Why is the bottom half of the matrix below the main diagonal blank?Explanation / Answer
Linear Functions of Predictors To understand rotation, first consider a problem that doesn't involve factor analysis. Suppose you want to predict the grades of college students (all in the same college) in many different courses, from their scores on general "verbal" and "math" skill tests. To develop predictive formulas, you have a body of past data consisting of the grades of several hundred previous students in these courses, plus the scores of those students on the math and verbal tests. To predict grades for present and future students, you could use these data from past students to fit a series of two-variable multiple regressions, each regression predicting grade in one course from scores on the two skill tests. Now suppose a co-worker suggests summing each student's verbal and math scores to obtain a composite "academic skill" score I'll call AS, and taking the difference between each student's verbal and math scores to obtain a second variable I'll call VMD (verbal-math difference). The co-worker suggests running the same set of regressions to predict grades in individual courses, except using AS and VMD as predictors in each regression, instead of the original verbal and math scores. In this example, you would get exactly the same predictions of course grades from these two families of regressions: one predicting grades in individual courses from verbal and math scores, the other predicting the same grades from AS and VMD scores. In fact, you would get the same predictions if you formed composites of 3 math + 5 verbal and 5 verbal + 3 math, and ran a series of two-variable multiple regressions predicting grades from these two composites. These examples are all linear functions of the original verbal and math scores. The central point is that if you have m predictor variables, and you replace the m original predictors by m linear functions of those predictors, you generally neither gain or lose any information--you could if you wish use the scores on the linear functions to reconstruct the scores on the original variables. But multiple regression uses whatever information you have in the optimum way (as measured by the sum of squared errors in the current sample) to predict a new variable (e.g. grades in a particular course). Since the linear functions contain the same information as the original variables, you get the same predictions as before. Given that there are many ways to get exactly the same predictions, is there any advantage to using one set of linear functions rather than another? Yes there is; one set may be simpler than another. One particular pair of linear functions may enable many of the course grades to be predicted from just one variable (that is, one linear function) rather than from two. If we regard regressions with fewer predictor variables as simpler, then we can ask this question: Out of all the possible pairs of predictor variables that would give the same predictions, which is simplest to use, in the sense of minimizing the number of predictor variables needed in the typical regression? The pair of predictor variables maximining some measure of simplicity could be said to have simple structure. In this example involving grades, you might be able to predict grades in some courses accurately from just a verbal test score, and predict grades in other courses accurately from just a math score. If so, then you would have achieved a "simpler structure" in your predictions than if you had used both tests for all predictions. Simple Structure in Factor Analysis The points of the previous section apply when the predictor variables are factors. Think of the m factors F as a set of independent or predictor variables, and think of the p observed variables X as a set of dependent or criterion variables. Consider a set of p multiple regressions, each predicting one of the variables from all m factors. The standardized coefficients in this set of regressions form a p x m matrix called the factor loading matrix. If we replaced the original factors by a set of linear functions of those factors, we would get exactly the same predictions as before, but the factor loading matrix would be different. Therefore we can ask which, of the many possible sets of linear functions we might use, produces the simplest factor loading matrix. Specifically we will define simplicity as the number of zeros or near-zero entries in the factor loading matrix--the more zeros, the simpler the structure. Rotation does not change matrix C or U at all, but does change the factor loading matrix. In the extreme case of simple structure, each X-variable will have only one large entry, so that all the others can be ignored. But that would be a simpler structure than you would normally expect to achieve; after all, in the real world each variable isn't normally affected by only one other variable. You then name the factors subjectively, based on an inspection of their loadings. In common factor analysis the process of rotation is actually somewhat more abstract that I have implied here, because you don't actually know the individual scores of cases on factors. However, the statistics for a multiple regression that are most relevant here--the multiple correlation and the standardized regression slopes--can all be calculated just from the correlations of the variables and factors involved. Therefore we can base the calculations for rotation to simple structure on just those correlations, without using any individual scores. A rotation which requires the factors to remain uncorrelated is an orthogonal rotation, while others are oblique rotations. Oblique rotations often achieve greater simple structure, though at the cost that you must also consider the matrix of factor intercorrelations when interpreting results. Manuals are generally clear which is which, but if there is ever any ambiguity, a simple rule is that if there is any ability to print out a matrix of factor correlations, then the rotation is oblique, since no such capacity is needed for orthogonal rotations. An Example Table 1 illustrates the outcome of rotation with a factor analysis of 24 measures of mental ability. Table 1 Oblique Promax rotation of 4 factors of 24 mental ability variables From Gorsuch (1983) Verbal Numer- Visual Recog- ical nition General information .80 .10 -.01 -.06 Paragraph comprehension .81 -.10 .02 .09 Sentence completion .87 .04 .01 -.10 Word classification .55 .12 .23 -.08 Word meaning .87 -.11 -.01 .07 Add .08 .86 -.30 .05 Code .03 .52 -.09 .29 Counting groups of dots -.16 .79 .14 -.09 Straight & curved capitals -.01 .54 .41 -.16 Woody-McCall mixed .24 .43 .00 .18 Visual perception -.08 .03 .77 -.04 Cubes -.07 -.02 .59 -.08 Paper form board -.02 -.19 .68 -.02 Flags .07 -.06 .66 -.12 Deduction .25 -.11 .40 .20 Numerical puzzles -.03 .35 .37 .06 Problem reasoning .24 -.07 .36 .21 Series completion .21 .05 .49 .06 Word recognition .09 -.08 -.13 .66 Number recognition -.04 -.09 -.02 .64 Figure recognition -.16 -.13 .43 .47 Object-number .00 .09 -.13 .69 Number-figure -.22 .23 .25 .42 Figure-word .00 .05 .15 .37 This table reveals quite a good simple structure. Within each of the four blocks of variables, the high values (above about .4 in absolute value) are generally all in a single column--a separate column for each of the four blocks. Further, the variables within each block all seem to measure the same general kind of mental ability. The major exception to both these generalizations comes in the third block. The variables in that block seem to include measures of both visual ability and reasoning, and the reasoning variables (the last four in the block) generally have loadings in column 3 not far above their loadings in one or more other columns. This suggests that a 5-factor solution might be worth trying, in the hope that it might yield separate "visual" and "reasoning" factors. The factor names in Table 1 were given by Gorsuch, but inspection of the variables in the second block suggests that "simple repetitive tasks" might be a better name for factor 2 than "numerical". I don't mean to imply that you should always try to make every variable load highly on only one factor. For instance, a test of ability to deal with arithmetic word problems might well load highly on both verbal and mathematical factors. This is actually one of the advantages of factor analysis over cluster analysis, since you cannot put the same variable in two different clusters. Principal Component Analysis (PCA) Basics I have introduced principal component analysis (PCA) so late in this chapter primarily for pedagogical reasons. It solves a problem similar to the problem of common factor analysis, but different enough to lead to confusion. It is no accident that common factor analysis was invented by a scientist (differential psychologist Charles Spearman) while PCA was invented by a statistician. PCA states and then solves a well-defined statistical problem, and except for special cases always gives a unique solution with some very nice mathematical properties. One can even describe some very artificial practical problems for which PCA provides the exact solution. The difficulty comes in trying to relate PCA to real-life scientific problems; the match is simply not very good. Actually PCA often provides a good approximation to common factor analysis, but that feature is now unimportant since both methods are now easy enough. The central concept in PCA is representation or summarization. Suppose we want to replace a large set of variables by a smaller set which best summarizes the larger set. For instance, suppose we have recorded the scores of hundreds of pupils on 30 mental tests, and we don't have the space to store all those scores. (This is a very artificial example in the computer age, but was more appealing before then, when PCA was invented.) For economy of storage we would like to reduce the set to 5 scores per pupil, from which we would like to be able to reconstruct the original 30 scores as accurately as possible. Let p and m denote respectively the original and reduced number of variables--30 and 5 in the current example. The original variables are denoted X, the summarizing variables F for factor. In the simplest case our measure of accuracy of reconstruction is the sum of p squared multiple correlations between X-variables and the predictions of X made from the factors. In the more general case we can weight each squared multiple correlation by the variance of the corresponding X-variable. Since we can set those variances ourselves by multiplying scores on each variable by any constant we choose, this amounts to the ability to assign any weights we choose to the different variables. We now have a problem which is well-defined in the mathematical sense: reduce p variables to a set of m linear functions of those variables which best summarize the original p in the sense just described. It turns out, however, that infinitely many linear functions provide equally good summaries. To narrow the problem to one unique solution, we introduce three conditions. First, the m derived linear functions must be mutually uncorrelated. Second, any set of m linear functions must include the functions for a smaller set. For instance, the best 4 linear functions must include the best 3, which include the best 2, which include the best one. Third, the squared weights defining each linear function must sum to 1. These three conditions provide, for most data sets, one unique solution. Typically there are p linear functions (called principal components) declining in importance; by using all p you get perfect reconstruction of the original X-scores, and by using the first m (where m ranges from 1 to p) you get the best reconstruction possible for that value of m. Define each component's eigenvector or characteristic vector or latent vector as the column of weights used to form it from the X-variables. If the original matrix R is a correlation matrix, define each component's eigenvalue or characteristic value or latent value as its sum of squared correlations with the X-variables. If R is a covariance matrix, define the eigenvalue as a weighted sum of squared correlations, with each correlation weighted by the variance of the corresponding X-variable. The sum of the eigenvalues always equals the sum of the diagonal entries in R. Nonunique solutions arise only when two or more eigenvalues are exactly equal; it then turns out that the corresponding eigenvectors are not uniquely defined. This case rarely arises in practice, and I shall ignore it henceforth. Each component's eigenvalue is called the "amount of variance" the component explains. The major reason for this is the eigenvalue's definition as a weighted sum of squared correlations. However, it also turns out that the actual variance of the component scores equals the eigenvalue. Thus in PCA the "factor variance" and "amount of variance the factor explains" are always equal. Therefore the two phrases are often used interchangeably, even though conceptually they stand for very different quantities. The Number of Principal Components It may happen that m principal components will explain all the variance in a set of X-variables--that is, allow perfect reconstruction of X--even though mRelated Questions
Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.