Xem mẫu

3 Random Forest Classification of Remote Sensing Data Sveinn R. Joelsson, Jon Atli Benediktsson, and Johannes R. Sveinsson CONTENTS 3.1 Introduction......................................................................................................................... 61 3.2 The Random Forest Classifier........................................................................................... 62 3.2.1 Derived Parameters for Random Forests ........................................................... 63 3.2.1.1 Out-of-Bag Error...................................................................................... 63 3.2.1.2 Variable Importance................................................................................ 63 3.2.1.3 Proximities................................................................................................ 63 3.3 The Building Blocks of Random Forests......................................................................... 64 3.3.1 Classification and Regression Tree...................................................................... 64 3.3.2 Binary Hierarchy Classifier Trees........................................................................ 64 3.4 Different Implementations of Random Forests ............................................................. 65 3.4.1 Random Forest: Classification and Regression Tree ........................................ 65 3.4.2 Random Forest: Binary Hierarchical Classifier ................................................. 65 3.5 Experimental Results.......................................................................................................... 65 3.5.1 Classification of a Multi-Source Data Set........................................................... 65 3.5.1.1 The Anderson River Data Set Examined with a Single CART Tree................................................................................. 69 3.5.1.2 The Anderson River Data Set Examined with the BHC Approach ........................................................................................ 71 3.5.2 Experiments with Hyperspectral Data ............................................................... 72 3.6 Conclusions.......................................................................................................................... 77 Acknowledgment......................................................................................................................... 77 References ..................................................................................................................................... 77 3.1 Introduction Ensemble classification methods train several classifiers and combine their results through a voting process. Many ensemble classifiers [1,2] have been proposed. These classifiers include consensus theoretic classifiers [3] and committee machines [4]. Boost-ing and bagging are widely used ensemble methods. Bagging (or bootstrap aggregating) [5] is based on training many classifiers on bootstrapped samples from the training set and has been shown to reduce the variance of the classification. In contrast, boosting uses iterative re-training, where the incorrectly classified samples are given more weight in 61 © 2008 by Taylor & Francis Group, LLC 62 Image Processing for Remote Sensing successive training iterations. This makes the algorithm slow (much slower than bagging) while in most cases it is considerably more accurate than bagging. Boosting generally reduces both the variance and the bias of the classification and has been shown to be a very accurate classification method. However, it has various drawbacks: it is computa-tionally demanding, it can overtrain, and is also sensitive to noise [6]. Therefore, there is much interest in investigating methods such as random forests. In this chapter, random forests are investigated in the classification of hyperspectral and multi-source remote sensing data. A random forest is a collection of classification trees or treelikeclassifiers.Eachtreeistrainedonabootstrappedsampleofthetrainingdata,andat eachnodeineachtreethealgorithmonlysearchesacrossarandomsubsetofthefeaturesto determine a split. To classify an input vector in a random forest, the vector is submitted as an input to each of the trees in the forest. Each tree gives a classification, and it is said that the tree votesfor that class. In the classification, the forest chooses the class having the most votes(overallthetrees intheforest).Randomforestshavebeenshown tobecomparableto boosting in terms of accuracies, but without the drawbacks of boosting. In addition, the random forests are computationally much less intensive than boosting. Random forests have recently been investigated for classification of remote sensing data. Ham et al. [7] applied them in the classification of hyperspectral remote sensing data. Joelsson et al. [8] used random forests in the classification of hyperspectral data from urban areas and Gislason et al. [9] investigated random forests in the classification of multi-source remote sensing and geographic data. All studies report good accuracies, especially when computational demand is taken into account. The chapter is organized as follows. Firstly random forest classifiers are discussed. Then, two different building blocks for random forests, that is, the classification and regression tree (CART) and the binary hierarchical classifier (BHC) approaches are reviewed. In Section 3.4, random forests with the two different building blocks are discussed. Experimental results for hyperspectral and multi-source data are given in Section 3.5. Finally, conclusions are given in Section 3.6. 3.2 The Random Forest Classifier A random forest classifier is a classifier comprising a collection of treelike classifiers. Ideally, a random forest classifier is an i.i.d. randomization of weak learners [10]. The classifier uses a large number of individual decision trees, all of which are trained (grown) to tackle the same problem. A sample is decided to belong to the most frequently occurring of the classes as determined by the individual trees. The individuality of the trees is maintained by three factors: 1. Each tree is trained using a random subset of the training samples. 2. During the growing process of a tree the best split on each node in the tree is found by searching through m randomly selected features. For a data set with M features, m is selected by the user and kept much smaller than M. 3. Every tree is grown to its fullest to diversify the trees so there is no pruning. As described above, a random forest is an ensemble of treelike classifiers, each trained on a randomly chosen subset of the input data where final classification is based on a majority vote by the trees in the forest. © 2008 by Taylor & Francis Group, LLC Random Forest Classification of Remote Sensing Data 63 Each node of a tree in a random forest looks to a random subset of features of fixed size m when deciding a split during training. The trees can thus be viewed as random vectors of integers (features used to determine a split at each node). There are two points to note about the parameter m: 1. Increasing the correlation between the trees in the forest by increasing m, increases the error rate of the forest. 2. Increasing the classification accuracy of every individual tree by increasing m, decreases the error rate of the forest. An optimal interval for m is between the somewhat fuzzy extremes discussed above. The parameter m is often said to be the only adjustable parameter to which the forest is sensitive and the ‘‘optimal’’ range for m is usually quite wide [10]. 3.2.1 Derived Parameters for Random Forests There are three parameters that are derived from the random forests. These parameters are the out-of-bag (OOB) error, the variable importance, and the proximity analysis. 3.2.1.1 Out-of-Bag Error To estimate the test set accuracy, the out-of-bag samples (the remaining training set samples that are not in the bootstrap for a particular tree) of each tree can be run down through the tree (cross-validation). The OOB error estimate is derived by the classification error for the samples left out for each tree, averaged over the total number of trees. In other words, for all the trees where case n was OOB, run case n down the trees and note if it is correctly classified. The proportion of times the classification is in error, averaged over all the cases, is the OOB error estimate. Let us consider an example. Each tree is trained on a random 2/3 of the sample population (training set) while the remaining 1/3 is used to derive the OOB error rate for that tree. The OOB error rate is then averaged over all the OOB cases yielding the final or total OOB error. This error estimate has been shown to be unbiased in many tests [10,11]. 3.2.1.2 Variable Importance For a single tree, run it on its OOB cases and count the votes for the correct class. Then, repeat this again after randomly permuting the values of a single variable in the OOB cases. Now subtract the correctly cast votes for the randomly permuted data from the number of correctly cast votes for the original OOB data. The average of this value over all the forest is the raw importance score for the variable [5,6,11]. If the values of this score from tree to tree are independent, then the standard error can be computed by a standard computation [12]. The correlations of these scores between trees have been computed for a number of data sets and proved to be quite low [5,6,11]. Therefore, we compute standard errors in the classical way: divide the raw score by its standard error to get a z-score, and assign a significance level to the z-score assuming normality [5,6,11]. 3.2.1.3 Proximities After a tree is grown all the data are passed through it. If cases k and n are in the same terminal node, their proximity is increased by one. The proximity measure can be used © 2008 by Taylor & Francis Group, LLC 64 Image Processing for Remote Sensing (directly or indirectly) to visualize high dimensional data [5,6,11]. As the proximities are indicators on the ‘‘distance’’ to other samples this measure can be used to detect outliers in the sense that an outlier is ‘‘far’’ from all other samples. 3.3 The Building Blocks of Random Forests Random forests are made up of several trees or building blocks. The building blocks considered here are CART, which partition the input data, and the BHC trees, which partition the labels (the output). 3.3.1 Classification and Regression Tree CART is a decision tree where splits are made on a variable/feature/dimension resulting in the greatest change in impurity or minimum impurity given a split on a variable in the data set at a node in the tree [12]. The growing of a tree is maintained until either the change in impurity has stopped or is below some bound or the number of samples left to split is too small according to the user. CART trees are easily overtrained, so a single tree is usually pruned to increase its generality. However, a collection of unpruned trees, where each tree is trained to its fullest on a subset of the training data to diversify individual trees can be very useful. When collected in a multi-classifier ensemble and trained using the random forest algorithm, these are called RF-CART. 3.3.2 Binary Hierarchy Classifier Trees A binary hierarchy of classifiers, where each node is based on a split regarding labels and output instead of input as in the CART case, are naturally organized in trees and can as such be combined, under similar rules as the CART trees, to form RF-BHC. In a BHC, the best split on each node is based on (meta-) class separability starting with a single meta-class, which is split into two meta-classes and so on; the true classes are realized in the leaves. Simultaneously to the splitting process, the Fisher discriminant and the corre-sponding projection are computed, and the data are projected along the Fisher direction [12]. In ‘‘Fisher space,’’ the projected data are used to estimate the likelihood of a sample belonging to a meta-class and from there the probabilities of a true class belonging to a meta-class are estimated and used to update the Fisher projection. Then, the data are projected using this updated projection and so forth until a user-supplied level of separation is acquired. This approach utilizes natural class affinities in the data, that is, the most natural splits occur early in the growth of the tree [13]. A drawback is the possible instability of the split algorithm. The Fisher projection involves an inverse of an estimate of the within-class covariance matrix, which can be unstable at some nodes of the tree, depending on the data being considered and so if this matrix estimate is singular (to numerical precision), the algorithm fails. As mentioned above, the BHC trees can be combined to an RF-BHC where the best splits on classes are performed on a subset of the features in the data to diversify individual trees and stabilize the aforementioned inverse. Since the number of leaves in a BHC tree is the same as the number of classes in the data set the trees themselves can be very informative when compared to CART-like trees. © 2008 by Taylor & Francis Group, LLC Random Forest Classification of Remote Sensing Data 65 3.4 Different Implementations of Random Forests 3.4.1 Random Forest: Classification and Regression Tree The RF-CART approach is based on CART-like trees where trees are grown to minimize an impurity measure. When trees are grown using a minimum Gini impurity criterion [12], the impurity of two descendent nodes in a tree is less than the parents. Adding up the decrease in the Gini value for each variable over all the forest gives a variable importance that is often very consistent with the permutation importance measure. 3.4.2 Random Forest: Binary Hierarchical Classifier RF-BHC is a random forest based on an ensemble of BHC trees. In the RF-BHC, a split in the tree is based on the best separation between meta-classes. At each node the best separation is found by examining m features selected at random. The value of m can be selected by trials to yield optimal results. In the case where the number of samples is small enough to induce the ‘‘curse’’ of dimensionality, m is calculated by looking to a user-supplied ratio R between the number of samples and the number of features; then m is either used unchanged as the supplied value or a new value is calculated to preserve the ratio R, whichever is smaller at the node in question [7]. An RF-BHC is uniform regarding tree size (depth) because the number of nodes is a function of the number of classes in the dataset. 3.5 Experimental Results Random forests have many important qualities of which many apply directly to multi- or hyperspectral data. It has been shown that the volume of a hypercube concentrates in the corners and the volume of a hyper ellipsoid concentrates in an outer shell, implying that with limited data points, much of the hyperspectral data space is empty [17]. Making a collection of trees is attractive, when each of the trees looks to minimize or maximize some information content related criteria given a subset of the features. This means that the random forest can arrive at a good decision boundary without deleting or extracting features explicitly while making the most out of the training set. This ability to handle thousands of input features is especially attractive when dealing with multi- or hyper-spectral data, because more often than not it is composed of tens to hundreds of features and a limited number of samples. The unbiased nature of the OOB error rate can in some cases (if not all) eliminate the need for a validation dataset, which is another plus when working with a limited number of samples. In experiments, the RF-CART approach was tested using a FORTRAN implementation of random forests supplied on a web page maintained by Leo Breiman and Adele Cutler [18]. 3.5.1 Classification of a Multi-Source Data Set In this experiment we use the Anderson River data set, which is a multi-source remote sensing and geographic data set made available by the Canada Centre for Remote Sensing © 2008 by Taylor & Francis Group, LLC ... - tailieumienphi.vn
nguon tai.lieu . vn