Xem mẫu

Independent Component Analysis. Aapo Hyvarinen, Juha Karhunen, Erkki Oja Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 14 Overview and Comparison of Basic ICA Methods In the preceding chapters, we introduced several different estimation principles and algorithms for independent component analysis (ICA). In this chapter, we provide an overview of these methods. First, we show that all these estimation principles are intimately connected, and the main choices are between cumulant-based vs. negentropy/likelihood-based estimation methods, and between one-unit vs. multi-unit methods. Inotherwords, one must choosethe nonlinearityand the decorrelation method. We discuss the choice of the nonlinearity from the viewpoint of statistical theory. In practice, one must also choose the optimization method. We compare the algorithms experimentally, and show that the main choice here is between on-line (adaptive) gradient algorithms vs. fast batch fixed-point algorithms. At the end of this chapter, we provide a short summary of the whole of Part II, that is, of basic ICA estimation. 14.1 OBJECTIVE FUNCTIONS VS. ALGORITHMS A distinction that has been used throughout this book is between the formulation of the objective function, and the algorithm used to optimize it. One might express this in the following “equation”: ICA method objective function optimization algorithm In the case of explicitly formulated objective functions, one can use any of the classicoptimizationmethods,forexample,(stochastic)gradientmethodsandNewton 273 274 OVERVIEW AND COMPARISON OF BASIC ICA METHODS methods. In some cases, however,the algorithm and the estimation principle may be difficult to separate. The properties of the ICA method depend on both of the objective function and the optimization algorithm. In particular: the statistical properties (e.g., consistency, asymptotic variance, robustness) of the ICA method depend on the choice of the objective function, the algorithmic properties (e.g., convergence speed, memory requirements, numerical stability) depend on the optimization algorithm. Ideally, these two classes of properties are independent in the sense that different optimization methods can be used to optimize a single objective function, and a single optimization method can be used to optimize different objective functions. In this section, we shall first treat the choiceof the objectivefunction,andthen consider optimization of the objective function. 14.2 CONNECTIONS BETWEEN ICA ESTIMATION PRINCIPLES Earlier, we introduced several different statistical criteria for estimation of the ICA model, including mutual information, likelihood, nongaussianity measures, cumu-lants, and nonlinear principal component analysis (PCA) criteria. Each of these criteria gave an objective function whose optimization enables ICA estimation. We havealreadyseenthatsomeofthemarecloselyconnected;thepurposeofthissection isto recapitulatethese results. Infact, almostall ofthese estimationprinciplescanbe considered as different versions of the same general criterion. After this, we discuss the differences between the principles. 14.2.1 Similarities between estimation principles Mutual information gives a convenient starting point for showing the similarity be-tween differentestimationprinciples. We haveforaninvertiblelinear transformation : y Bx X I y y y (14.1) H y H x l o g j d et B j n i i If we constrain the to be uncorrelated and of unit variance, the last term on the y i right-hand side is constant; the second term does not depend on anyway (see B Chapter 10). Recall that entropy is maximized by a gaussian distribution, when variance is kept constant (Section 5.3). Thus we see that minimization of mutual information means maximizing the sum of the nongaussianities of the estimated components. Iftheseentropies(orthecorrespondingnegentropies)areapproximated by the approximations used in Chapter 8, we obtain the same algorithms as in that chapter. CONNECTIONS BETWEEN ICA ESTIMATION PRINCIPLES 275 Alternatively, we could approximate mutual information by approximating the densities of the estimated ICs by some parametric family, and using the obtained log-density approximations in the definition of entropy. Thus we obtain a method that is essentially equivalent to maximum likelihood (ML) estimation. Theconnectionstootherestimationprinciplescaneasily beseenusinglikelihood. First of all, to see the connection to nonlineardecorrelation, it is enough to compare the natural gradient methods for ML estimation shown in (9.17) with the nonlinear decorrelation algorithm (12.11): they are of the same form. Thus, ML estimation gives a principled method for choosing the nonlinearities in nonlinear decorrelation. The nonlinearitiesused are determinedas certain functionsof the probabilitydensity functions(pdf’s) of the independentcomponents. Mutual informationdoes the same thing, of course, due to the equivalency discussed earlier. Likewise, the nonlin-ear PCA methods were shown to be essentially equivalent to ML estimation (and, therefore, most other methods) in Section 12.7. The connectionof the precedingprinciplesto cumulant-basedcriteria can be seen by considering the approximation of negentropy by cumulants as in Eq. (5.35): kurt y J y E fy g (14.2) where the first term could be omitted, leaving just the term containing kurtosis. Likewise, cumulantscould be used to approximatemutual information,since mutual information is based on entropy. More explicitly, we could consider the following approximation of mutual information: X kurt y I y c c i i (14.3) where c and c are some constants. This shows clearly the connection between cumulantsandminimizationofmutualinformation. Moreover,thetensorialmethods inChapter11wereseentoleadtothesamefixed-pointalgorithmasthemaximization ofnongaussianityasmeasuredbykurtosis,whichshowsthattheyaredoingverymuch the same thing as the other kurtosis-based methods. 14.2.2 Differences between estimation principles Thereare,however,acoupleofdifferencesbetweentheestimationprinciplesaswell. 1. Some principles (especially maximum nongaussianity) are able to estimate single independent components, whereas others need to estimate all the com-ponents at the same time. 2. Someobjectivefunctionsusenonpolynomialfunctionsbasedonthe(assumed) probability density functions of the independent components, whereas others use polynomial functions related to cumulants. This leads to different non-quadratic functions in the objective functions. 3. In many estimation principles, the estimates of the ICs are constrained to be uncorrelated. This reduces somewhat the space in which the estimation is 276 OVERVIEW AND COMPARISON OF BASIC ICA METHODS performed. Considering, for example, mutual information, there is no reason why mutual information would be exactly minimized by a decomposition that gives uncorrelated components. Thus, this decorrelation constraint slightly reduces the theoretical performance of the estimation methods. In practice, this may be negligible. 4. OneimportantdifferenceinpracticeisthatofteninMLestimation,thedensities of the ICs are fixed in advance, using prior knowledge on the independent components. This is possible because the pdf’s of the ICs need not be known with any great precision: in fact, it is enoughto estimate whether they are sub-or supergaussian. Nevertheless, if the prior information on the nature of the independent components is not correct, ML estimation will give completely wrong results, as was shown in Chapter 9. Some care must be taken with ML estimation, therefore. In contrast, using approximations of negentropy, this problem does not usually arise, since the approximations we have used in this book do not depend on reasonable approximationsof the densities. Therefore, these approximationsare less problematic to use. 14.3 STATISTICALLY OPTIMAL NONLINEARITIES Thus, from a statistical viewpoint, the choice of estimation method is more or less reduced to the choice of the nonquadratic function G that gives information on the higher-orderstatistics in the form of the expectation T . In the algorithms, E fG b xg i this choice correspondsto the choice of the nonlinearity that is the derivative of . G g In this section, we analyze the statistical properties of different nonlinearities. This is based on the family of approximations of negentropy given in (8.25). This family includes kurtosis as well. For simplicity, we consider here the estimation of just one IC, given by maximizing this nongaussianity measure. This is essentially equivalent to the problem T max E fG b x g (14.4) T E f b x g where the sign of depends of the estimate on the sub- or supergaussianity of T . G b x The obtained vector is denoted by b . The two fundamentalstatistical propertiesof bb b that we analyze are asymptotic variance and robustness. 14.3.1 Comparison of asymptotic variance * In practice, one usually has only a finite sample of observations of the vector . x T Therefore,theexpectationsinthetheoreticaldefinitionoftheobjectivefunctionarein factreplacedbysampleaverages. Thisresultsincertainerrorsintheestimator b , and it is desired to make these errors as small as possible. A classic measure of this error isasymptotic(co)variance,whichmeansthelimitofthecovariancematrixof b p as b T . This gives an approximationof the mean-square error of b , as was already b T STATISTICALLY OPTIMAL NONLINEARITIES 277 discussed in Chapter 4. Comparisonof, say, the traces of the asymptotic variancesof two estimatorsenablesdirect comparisonof the accuracyof two estimators. One can solve analytically for the asymptotic variance of b , obtaining the following theorem [193]: Theorem 14.1 The trace of the asymptotic variance of b as defined above for the b estimation of the independent component , equals s i E fg s g E fs g s g i i i V C A G (14.5) E fs g s g s g i i i where is the derivative of , and is a constant that depends only on . G AC A g The theorem is proven at the appendix of this chapter. Thusthecomparisonoftheasymptoticvariancesoftwoestimatorsfortwodifferent nonquadratic functions boils down to a comparison of the . In particular, one V G G can use variational calculus to find a that minimizes . Thus one obtains the V G G following theorem [193]: Theorem 14.2 The trace of the asymptotic variance of b is minimized when is of G b the form G y c log p y c y c opt i (14.6) where is the density function of , and are arbitrary constants. s c c c p i i For simplicity, one can choose . Thus, we see that the optimal G y l og p y op t i nonlinearity is in fact the one used in the definition of negentropy. This shows that negentropy is the optimal measure of nongaussianity, at least inside those measures that lead to estimators of the form considered here.1 Also, one sees that the optimal functionis the same as the one obtainedfor several units by the maximum likelihood approach. 14.3.2 Comparison of robustness * Another very desirable property of an estimator is robustness against outliers. This means that single, highly erroneous observations do not have much influence on the estimator. In this section, we shall treat the question: How does the robustness of the estimator b depend on the choice of the function G ? The main result is that the function should not grow fast as a function of if we want robust estimators. Gy jy j In particular, this means that kurtosis gives nonrobustestimators, which may be very disadvantagous in some situations. 1One has to take into account, however, that in the definition of negentropy, the nonquadratic function is not fixed in advance, whereas in our nongaussianity measures, is fixed. Thus, the statistical properties G of negentropy can be only approximatively derived from our analysis. ... - tailieumienphi.vn
nguon tai.lieu . vn