Xem mẫu

7 Principal Components, Factor, and Cluster Analyses, and Application in Social Area Analysis This chapter discusses three important multivariate statistical analysis methods: principal components analysis (PCA), factor analysis (FA), and cluster analysis (CA). PCA and FA are often used together for data reduction by structuring many variables into a limited number of components (factors). The techniques are partic-ularly useful for eliminating variable collinearity and uncovering latent variables. Applications of the methods are widely seen in socioeconomic studies (also see case study 8 in Section 8.4). While the PCA and FA group variables, the CA classifies many observations into categories according to similarity among their attributes. In other words, given a dataset as a table, the PCA and FA reduce the number of columns and the CA reduces the number of rows. Social area analysis is used to illustrate the techniques, as it employs all three methods. The interpretation of social area analysis results also leads to a review and comparison of three classic models on urban structure, namely, the concentric zone model, the sector model, and the multinuclei model. The analysis demonstrates how analytical statistical methods synthesize descriptive models into one framework. Beijing, the capital city of China, on the verge of forming its social areas after decades under a socialist regime, is chosen as the study area for a case study. Usage of GIS in this case study is limited to mapping for spatial patterns. Section 7.1 discusses principal components and factor analysis. Section 7.2 explains cluster analysis. Section 7.3 reviews social area analysis. A case study on the social space in Beijing is presented in Section 7.4 to provide a new perspective to the fast-changing urban structure in China. The chapter is concluded with a discussion and brief summary in Section 7.5. 7.1 PRINCIPAL COMPONENTS AND FACTOR ANALYSIS Principal components and factor analysis are often used together for data reduction. Benefits of this approach include uncovering latent variables for easy interpretation and removing multicollinearity for subsequent regression analysis. In many socio-economic applications, variables extracted from census data are often correlated with each other, and thus contain duplicated information to some extent. Principal components and factor analysis use fewer factors to represent the original variables, and thus simplify the structure for analysis. Resulting component or factor scores 127 © 2006 by Taylor & Francis Group, LLC 128 Quantitative Methods and Applications in GIS are uncorrelated to each other (if not rotated or orthogonally rotated), and thus can be used as explanatory variables in regression analysis. Despite the commonalities, principal components and factor analysis are “both conceptually and mathematically very different” (Bailey and Gatrell, 1995, p. 225). Principal components analysis uses the same number of variables (components) to simply transform the original data, and thus is a mathematical transformation (strictly speaking, not a statistical operation). Factor analysis uses fewer variables (factors) to capture most of the variation among the original variables (with error terms), and thus is a statistical analysis process. Principal components attempts to explain the variance of observed variables, whereas factor analysis intends to explain their intercorrelations (Hamilton, 1992, p. 252). In many applications (as in ours), the two methods are used together. In SAS, principal components analysis is offered as an option under the procedure for factor analysis. 7.1.1 PRINCIPAL COMPONENTS FACTOR MODEL In formula, principal components analysis (PCA) transforms original data on K observed variables Zk to data on K principal components Fk that are independent from (uncorrelated with) each other: Zk = lk1F + lk2F2 +...+ lkjFj +...+ lkK FK (7.1) Retaining only the J largest components (J < K), we have Zk = lk1F + lk2F2 +...+ lkJ FJ + vk (7.2) where the discarded components are represented by the residual term vk, such as vk = lk,J+1FJ+1 + lk,J+2FJ+2 + … + lkKFK (7.3) Equations 7.2 and 7.3 represent a model termed principal components factor analysis (PCFA). The PCFA retains the largest components to capture most of the variance while discarding minor components with small variance. The PCFA is the method used in social area analysis (Cadwallader, 1996, p. 137) and is simply referred to as factor analysis in the remainder of this chapter. In a true factor analysis (FA), the residual (error) term, denoted as uk to distin-guish it from vk in a PCFA, is unique to each variable Zk: Zk = lk1F + lk2F2 +...+ lkJ FJ + uk The uk are termed unique factors (in contrast to common factors Fj ). In the PCFA, the residual vk is a linear combination of the discarded components (FJ+1, …, FK) and thus cannot be uncorrelated like the uk in a true FA (Hamilton, 1992, p. 252). © 2006 by Taylor & Francis Group, LLC Principal Components, Factor, and Cluster Analyses, and Application 129 7.1.2 FACTOR LOADINGS, FACTOR SCORES, AND EIGENVALUES For convenience, the original data of observed variables Zk are first standardized1 prior to the PCA and FA analysis, and the initial values for components (factors) are also standardized. When both Zk and Fj are standardized, the lkj in Equations 7.1 and 7.2 are standardized coefficients in the regression of variables Zk on components (factors) Fj, also termed factor loadings. For example, lk1 is the loading of variables Zk on standardized component F1. Factor loading reflects the strength of relations between variables and components. Conversely, the components Fj can be reexpressed as a linear combination of the original variables Zk: Fj = a1jZ1 + a2 jZ2 +...+ aKjZK (7.4) Estimates of these components (factors) are termed factor scores. Estimates of akj are factor score coefficients, i.e., coefficients in the regression of factors on variables. The components F are constructed to be uncorrelated with each other and are ordered such that the first component F has the largest sample variance (l ), F the second largest, and so on. The variances l corresponding to various components are termed eigenvalues, and l1 > l2 > …. Since standardized variables have variances of 1, the total variance of all variables also equals the number of variables, such as l1 + l2 + … + lK = K (7.5) Therefore, the proportion of total variance explained by the jth component is lj/K. Eigenvalues provide a basis for judging which components (factors) are impor- tant and which are not, and thus deciding how many components to retain. One may also follow a rule of thumb that only eigenvalues greater than 1 are important (Griffith and Amrhein, 1997, p. 169). Since the variance of each standardized variable is 1, a component with l < 1 accounts for less than an original variable’s variation, and thus does not serve the purpose of data reduction. The eigenvalue-1 rule is arbitrary. A scree graph plots eigenvalues against component (factor) number and provides a more useful guidance (Hamilton, 1992, p. 258). For example, Figure 7.1 shows the scree graph of eigenvalues in a case of 14 components (using the result from case study 7 in Section 7.4). The graph levels off after component 4, indicating that components 5 to 14 account for relatively little additional variance. Therefore, four components may be retained as principal components. Outputs from statistical analysis software such as SAS include important infor-mation, such as factor loadings, eigenvalues, and proportions (of total variance). Factor scores can be saved in a predefined external file. The factor analysis procedure in SAS also outputs a correlation matrix between the observed variables for analysts to examine their relations. © 2006 by Taylor & Francis Group, LLC 130 Quantitative Methods and Applications in GIS 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Component FIGURE 7.1 Scree graph for principal components analysis. 7.1.3 ROTATION Initial results from PCFA are often hard to interpret as variables load across factors. While fitting the data equally well, rotation generates simpler structure and more interpretable factors by maximizing the loading (positive or negative) of each vari-able on one factor and minimizing the loadings on the others. As a result, we can detect which factor (latent variable) captures the information contained in what variables (observed), and subsequently label the factors adequately. Orthogonal rotation generates independent (uncorrelated) factors, an important property for many applications. A widely used orthogonal rotation method is Varimax rotation, which maximizes the variance of the squared loadings for each factor, and thus polarizes loadings (either high or low on factors). Varimax rotation is often the rotation technique used in social area analysis. Oblique rotation (e.g., promax rotation) generates even greater polarization, but allows correlation between factors. In SAS, an option is provided to specify which rotation to use. As a summary, Figure 7.2 illustrates the process of PCFA: 1. The original dataset of K observed variables with n records is first standardized to a dataset of Z scores with the same number of variables and records. 2. PCA then uses K uncorrelated components to explain all the variance of the K variables. 3. PCFA keeps only J (J < K) principal components to capture most of the variance. 4. A rotation method is used to load each variable strongly on one factor (and near zero on the others) for easier interpretation. The SAS procedure for factor analysis (FA) is FACTOR, which also reports the principal components analysis (PCA) results preceding those of FA. The following sample SAS statements implement the factor analysis that uses four factors to capture the structure of 14 variables, x1 through x14, and adopts the Varimax rotation technique: © 2006 by Taylor & Francis Group, LLC Principal Components, Factor, and Cluster Analyses, and Application 131 K variables 1 2 3 ......... K 1 2 3 . Original . data set . . . n (1) Standardize K variables 1 2 3 ......... K 1 2 . . Z scores (2) . PCA . . n K components 1 2 3 ......... K 1 2 3 Component . loadings K (3) PCFA J factors 1 2 3 ...J (J < K) 1 3 Factor (4) . loadings Rotation . K J factors 1 2 3 .......... J 1 2 3 . . K FIGURE 7.2 Data processing steps in principal components factor analysis. proc factor out=FACTSCORE (replace=yes) nfact=4 rotate=varimax; var x1-x14; The SAS data set FACTSCORE has the factor scores, which can be saved to an external file. Note that a SAS program is not case sensitive. 7.2 CLUSTER ANALYSIS Cluster analysis (CA) groups observations according to similarity among their attributes. As a result, the observations within a cluster are more similar than observations between clusters, as measured by the clustering criterion. Note the difference between CA and another similar multivariate analysis technique — discriminant function analysis (DFA). Both group observations into categories based on the characteristic variables. Categories are unknown in CA but known in DFA. See Appendix 7A for further discussion on DFA. Geographers have a long-standing interest in cluster analysis (CA) that has been developed in applications such as regionalization and city classification. In the case of social area analysis, cluster analysis is used to further analyze the results from factor analysis (i.e., factor scores of various components across space) and group areas into different types of social areas. A key element in deciding assignment of observations to clusters is distance, measured in various ways. The most commonly used distance measure is Euclidean distance: © 2006 by Taylor & Francis Group, LLC ... - tailieumienphi.vn
nguon tai.lieu . vn