Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

In this question you will explore and summarize the cereals.xls dataset.<?xml:na

ID: 3537414 • Letter: I

Question

In this question you will explore and summarize the cereals.xls dataset.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

(a)     For each of the 16 variables in the dataset, state whether the variable is categorical (qualitative) or numerical (quantitative). For categorical variables, state whether the variable is nominal or ordinal.

(b)     Create a table that summarizes the mean, median, min, max, standard deviation, and coefficient of variation for each of the quantitative variables.

(c)     Examine the summary statistics for the eight variables that reflect the nutritional value of the cereals (calories through potassium). Use the coefficient of variation to find the three nutritional variables with the largest variability (we cannot use the standard deviation because the variables have different units). Generate histograms and boxplots for these three variables.

(i)      Describe the distributions. Are the variables skewed?

(ii)     Are there any outliers for these variables?

(d)     Plot a side-by-side boxplot comparing the calories in hot versus cold cereals. What does this plot show us?

(e)     What type of plot would you use to compare calories to rating? What type of plot would you use to compare type to rating? Generate the plots and describe the relationship between the variables.

(f)      Generate the correlation matrix for the quantitative variables.

(i)      Which variable has the strongest positive correlation with calories? Which variable has the strongest negative correlation with calories?

(ii)     What percentage of the variance in these three variables would we capture if we replace the variables with their first two principle component?

Explanation / Answer

Answer

General Purpose

The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone, 1931). The topics listed below will describe the principles of factor analysis, and how it can be applied towards these two purposes. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance and correlation; if not, we advise that you read the Basic Statistics topic at this point.

There are many excellent books on factor analysis. For example, a hands-on how-to approach can be found in Stevens (1986); more detailed technical descriptions are provided in Cooley and Lohnes (1971); Harman (1976); Kim and Mueller, (1978a, 1978b); Lawley and Maxwell (1971); Lindeman, Merenda, and Gold (1980); Morrison (1967); or Mulaik (1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984).

Confirmatory factor analysis. Structural Equation Modeling (SEPATH) allows you to test specific hypotheses about the factor structure for a set of variables, in one or several samples (e.g., you can compare factor structures across samples).

Correspondence analysis. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by factor analysis techniques, and they allow you to explore the structure of categorical variables included in the table. For more information regarding these methods, refer to Correspondence Analysis. To index



Basic Idea of Factor Analysis as a Data Reduction Method

Suppose we conducted a (rather "silly") study in which we measure 100 people's height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured.

Let's now extrapolate from this "silly" study to something that you might actually do as a researcher. Suppose we want to measure people's satisfaction with their lives. We design a satisfaction questionnaire with various items; among other things we ask our subjects how satisfied they are with their hobbies (item 1) and how intensely they are pursuing a hobby (item 2). Most likely, the responses to the two items are highly correlated with each other. (If you are not familiar with the correlation coefficient, we recommend that you read the description in Basic Statistics - Correlations) Given a high correlation between the two items, we can conclude that they are quite redundant.

Combining Two Variables into a Single Factor. You can summarize the correlation between two variables in a scatterplot. A regression line can then be fitted that represents the "best" summary of the linear relationship between the variables. If we could define a variable that would approximate the regression line in such a plot, then that variable would capture most of the "essence" of the two items. Subjects' single scores on that new factor, represented by the regression line, could then be used in future data analyses to represent that essence of the two items. In a sense we have reduced the two variables to one factor. Note that the new factor is actually a linear combination of the two variables.

Principal Components Analysis. The example described above, combining two correlated variables into one factor, illustrates the basic idea of factor analysis, or of principal components analysis to be precise (we will return to this later). If we extend the two-variable example to multiple variables, then the computations become more involved, but the basic principle of expressing two or more variables by a single factor remains the same.

Extracting Principal Components. We do not want to go into the details about the computational aspects of principal components analysis here, which can be found elsewhere (references were provided at the beginning of this section). However, basically, the extraction of principal components amounts to a variance maximizing (varimax) rotation of the original variable space. For example, in a scatterplot we can think of the regression line as the original X axis, rotated so that it approximates the regression line. This type of rotation is called variance maximizing because the criterion for (goal of) the rotation is to maximize the variance (variability) of the "new" variable (factor), while minimizing the variance around the new variable (see Rotational Strategies).

Generalizing to the Case of Multiple Variables. When there are more than two variables, we can think of them as defining a "space," just as two variables defined a plane. Thus, when we have three variables, we could plot a three- dimensional scatterplot, and, again we could fit a plane through the data.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote