Xem mẫu

Text Chunking using Regularized Winnow Tong Zhang and Fred Damerau and David Johnson IBM T.J. Watson Research Center Yorktown Heights New York, 10598, USA tzhang@watson.ibm.com damerau@watson.ibm.com dejohns@us.ibm.com Abstract Many machine learning methods have recently been applied to natural lan-guage processing tasks. Among them, the Winnow algorithm has been ar-gued to be particularly suitable for NLP problems, due to its robustness to ir-relevant features. However in theory, Winnow may not converge for non-separable data. To remedy this prob-lem, a modification called regularized Winnow has been proposed. In this pa-per, we apply this new method to text chunking. We show that this method achieves state of the art performance with significantlyless computation than previous approaches. 1 Introduction However, the convergence of the Winnow al-gorithm is only guaranteed for linearly separable data. In practical NLP applications, data are of-ten linearly non-separable. Consequently, a di-rect application of Winnow may lead to numer-ical instability. A remedy for this, called regu-larized Winnow, has been recently proposed in (Zhang, 2001). This method modifies the origi-nal Winnow algorithm so that it solves a regular-ized optimization problem. It converges both in the linearlyseparable case and inthe linearly non-separable case. Its numerical stability implies that the new method can be more suitable for practical NLP problems that may not be linearly separable. In this paper, we compare regularized Winnow and Winnow algorithms on text chunking (Ab-ney, 1991). In order for us to rigorously com-pare our system with others, we use the CoNLL-2000 shared task dataset (Sang and Buchholz, 2000), which is publicly available from http://lcg- Recently there has been considerable interest in applying machine learning techniques to prob-lems in natural language processing. One method that has been quite successful in many applica-tions is the SNoW architecture (Dagan et al., 1997; Khardon et al., 1999). This architecture is based on the Winnow algorithm (Littlestone, 1988; Grove and Roth, 2001), which in theory is suitable for problems with many irrelevant at-tributes. In natural language processing, one of-ten encounters a very high dimensional feature space, although most of the features are irrele-vant. Therefore the robustness of Winnow to high dimensional feature spaceis consideredan impor- www.uia.ac.be/conll2000/chunking. An advan-tage of using this dataset is that a large number of state of the art statistical natural language pro-cessing methods have already been applied to the data. Therefore we can readily compare our re-sults with other reported results. We show that state of the art performance can be achieved by using the newly proposed regu-larized Winnow method. Furthermore, we can achieve this result with significantly less compu-tation than earlier systems of comparable perfor-mance. The paper is organized as follows. InSection 2, we describe the Winnow algorithm and the reg- tant reason why it is suitable for NLP tasks. ularized Winnow method. Section 3 describes the CoNLL-2000 shared task. In Section 4, we is equivalent to an embedding of the input space give a detailed description of our system that em- into a higher dimensional space as: * . !C - ploys the regularized Winnow algorithm for text This modification allows the positive weight Win-chunking. Section5 contains experimental results now algorithm for the augmented input to have for our system on the CoNLL-2000 shared task. the effect of both positive and negative weights Some final remarks will be given in Section 6. for the original input . 2 Winnow and regularized Winnow for binary classification One problem of the Winnow online update al-gorithm is that it may not converge when the data are not linearly separable. One may partially rem- We review the Winnow algorithm and the reg-ularized Winnow method. Consider the binary classification problem: to determine a label associated with an input vector . A use-fulmethodforsolvingthis problemisthroughlin-ear discriminant functions, which consist of lin-ear combinations of the components of the input variable. Specifically, we seek a weight vector and a threshold such that if its label and if its label . For simplicity, we shall assume in this !# paper. The restriction does not cause problems in practice since one can always append a constant feature to the inputdata , which offsets the effect of . Given a training set of labeled data edy this problem by decreasing the learning rate parameter during the updates. However, this is rather ad hoc since it is unclear what is the best wayto doso. Therefore in practice, it can bequite difficult to implement this idea properly. In order to obtain a systematic solution to this problem, we shall first examine a derivation of the Winnow algorithm in (Gentile and Warmuth, 1998), which motivates a more general solutionto be presented later. Following (Gentile and Warmuth, 1998), we consider the loss function G &H. I , which is often called “hinge loss”. For each data point , we consider an online update rule such that the weight after seeing the -th ex-5KL ample is given by the solution to ***++ , a number of approaches &( - ./ KL G to finding linear discriminant functions have KL KL_‘ K X Ia RTV been advanced over the years. We are especially interested in the Winnow multiplicative update algorithm (Littlestone, 1988). This algorithm updates the weight vector by going through the training data repeatedly. It is mistake driven in the sense that the weight vector is updated only when the algorithm is not able to correctly classify an example. The Winnow algorithm (with positive weight) employs multiplicative update: if the linear dis-criminant function misclassifies an input training vector with true label , then we update each . & component of the weight vector as: (2) Setting the gradient of the above formula to zero, we obtain 5KL / cd/ e (3) In the above equation, RT/ denotes the gra-dient (or more rigorously, a subgradient) of G , which takes the value KL_‘f- is a parameter called the learning rable, it does provide valuable insights which can rate. The initial weight vector can be taken as lead to a more systematic solution of the problem. 3@- , where is a prior which is typ- The basic idea was given in (Zhang, 2001), where ically chosen to be uniform. There can beseveral variants ofthe Winnow al-gorithm. One is called balanced Winnow, which theoriginalWinnow algorithmwasconverted into a numerical optimization problem that can handle linearly non-separable data. The resulting formulation is closely related to problem in natural language processing. As an (2). However, instead of looking at one example at a time as in an online formulation, we incorpo-rate all examples at the same time. In addition, we add a margin condition into the “hinge loss”. example of text chunking, the sentence “Balcor, which has interests in real estate, said the posi-tion is newly created.” can be divided as follows: [NP Balcor], [NP which] [VP has] [NP inter- Specifically, we seek a linear weight that solves ests] [PP in] [NP real estate], [VP said] [NP the position] [VP is newly created]. o 3 In this example, NP denotes non phrase, VP K X 3 Ia - m denotesverb phrase, andPP denotesprepositional KL Where is a given parameter called the reg-ularization parameter. The optimal solution of the above optimization problem can be derived from the solution of the following dual opti-mization problem: phrase. The CoNLL-2000 shared task (Sang and Buch-holz, 2000), introduced last year, is an attempt to set up a standard dataset so that researchers can compare different statistical chunking meth-ods. The data are extracted from sections of the Penn Treebank. The training set consists of WSJ e 9. sections 15-18 of the Penn Treebank, and the test s.t. ( ***+u ) r s set consists of WSJ sections 20. Additionally, a part-of-speech (POS) tag was assigned to each to- The -th component of is given by 3v-w9/ ken by a standard POS tagger (Brill, 1994) that was trained on the Penn Treebank. These POS tags can be used as features in a machine learn-ing based chunking algorithm. See Section 4 for detail. hL A Winnow-like update rule can be derived for the dual regularized Winnow formulation. At each data point , we fix all L with {| , and update to approximately maximize the dual objective functional using gradient ascent: The data contains eleven different chunk types. However, except for the most frequent three types: NP (noun phrase), VP (verb phrase), and PP (prepositional phrase), each of the remaining chunks has less than occurrences. The chunks are represented by the following three types of Ro/ ~ ~K u I (4) tags: where € . We update B-X first word of a chunk of type X - g€ 6 9. and by repeatedly going over the data from s***+u . Learning bounds of regularized Winnow that are similar to the mistake bound of the original Winnow have been given in(Zhang, 2001). These results imply that the new method, while it can properly handle non-separable data, shares simi-lar theoretical advantages of Winnow in that it is also robust to irrelevant features. This theoretical insight implies that the algorithm is suitable for NLP tasks with large feature spaces. I-X non-initial word in an X chunk O word outside of any chunk A standard software program has been provided (which is available from http://lcg-www.uia.ac.be/conll2000/chunking) to compute the performance of each algorithm. For each chunk, three figures of merit are computed: precision (the percentage of detected phrases that are correct), recall (the percentage of phrases in the data that are found), and the L metric 3 CoNLL-2000 chunking task which is the harmonic mean of the precision and the recall. The overall precision, recall and L The text chunking task is to divide text into metric on all chunks are also computed. The syntactically related non-overlapping groups of overall metric gives a single number that L words (chunks). It is considered an important can be used to compare different algorithms. 4 System description For example, since &ˆ+ is in our feature list, 4.1 Encoding of basic features Anadvantage ofregularizedWinnow is itsrobust-ness to irrelevant features. We can thus include as many features as possible, and let the algorithm itself find the relevant ones. This strategy ensures that we do not miss any features that are impor-tant. However, using more features requires more memory and slows down the algorithm. There-fore in practice it is still necessary to limit the number of features used. ***‹ ***‹ Gˆ/- oˆ/- Gˆ GˆIo Gˆ be a string of tokenized text (each token is a word or punctuation). We want to predict the chunk type of the current token . For each word oˆ Gˆ , we let &ˆ denote the associated POS tag, which is assumed to be given in the CoNLL-2000 shared task. The following is a list of the features we use as input to the regularized Winnow (where we choose ): | first order features: and ( &ˆ Coˆ *** ) second order features: ( &ˆ &ˆ” *** , ), and ( •– &ˆ Gˆ r *** ; — ) In addition, since in a sequential process, the predicted chunk tags for o are available for , we include the following extra chunk type ˜ features: each of the possible POS value of corre-c‹ sponds to a component of : the component has value one if (the feature value repre-&ˆ‹£⁄ sented bythecomponent isactive), and value zero otherwise. Similarly for a second order feature in our feature list such as , each pos-&ˆ+ &ˆ sible value in the set c‹ &ˆ is represented by a component of : the component has value one if and (the &ˆ # &ˆ # feature value represented by the component is ac-tive), and value zero otherwise. The same encod-ing is applied to all other first order and second order features, with each possible test of “feature = feature value” corresponds to a unique compo-nent in . Clearly, in this representation, the high order features are conjunction features that become ac-tive when all of their components are active. In principle, one may also consider disjunction fea-tures that become active when some of their com-ponents are active. However, such features are not considered in this work. Note that the above representation leads to a sparse, but very large di-mensional vector. This explains why we do not include all possible second order features since this will quickly consume more memory than we can handle. Also the above list of features are not neces-sarily the best available. We only included the most straight-forward features and pair-wise fea- ture interactions. One might try even higher order first order chunk-type features: (™****š ) features to obtain better results. Since Winnow is relatively robust to irrelevant second order chunk-type features: œ ( ***+* , ), and POS-chunk  interactions ( ***+*&ˆ ž *** ). For each data point (corresponding to the cur- rent token ), the associated features are en-Gˆ coded as a binary vector , which is the input to Winnow. Each component of corresponds to a possible feature value of a feature in one of the above feature lists. The value of the compo-nent corresponds to a test which has value one if features, it is usually helpful to provide the algo-rithm with as many features as possible, and let the algorithm pick up relevant ones. The main problem that prohibits us from using more fea-tures in the Winnow algorithm is memory con-sumption (mainly in training). The time complex-ity of the Winnow algorithm does not depend on the number of features, but rather on the average number of non-zero features per data, which is usually quite small. Due to the memory problem, in our implemen- tation we have to limit the number of token fea- the corresponding feature achieves value , or tures (words or punctuation) to ˆˆ : we sort the value zero if the corresponding feature achieves another feature value. tokens bytheirfrequenciesinthetrainingsetfrom high frequency to low frequency; we then treat to- kens of rank ˆˆ or higher as the same token. ever, the ESG syntactic role tags can be regarded Since the number is still reasonably large, as features in a statistical chunker. Another view ˆˆ this restriction is relatively minor. There are possible remedies to the memory consumption problem, although we have not im-plemented them in our current system. One so-lution comes from noticing that although the fea-ture vector is of very high dimension, most di- mensions are empty. Therefore one may create a is that the statistical chunker can be regarded as a machine learned transformation that maps ESG syntactic role tags into WSJ chunks. We denote by the syntactic role tag associ- ated with token . Each tag takes one of 138 Gˆ possible values. The following features are added to our system. hashtableforthefeatures, which cansignificantly reduce the memory consumption. first order features: (s ***‹ ) 4.2 Using enhanced linguistic features We were interested in determining if additional features with more linguistic content would lead to even better performance. The ESG (English second order features: self interactions ( ***+ , ), and iterations § ¤ with POS-tags c” ( *** ). f 4.3 Dynamic programming Slot Grammar) system in (McCord, 1989) is not directly comparable to the phrase structure gram-mar implicit in the WSJ treebank. ESG is a de-pendency grammar in which each phrase has a head and dependent elements, each marked with a syntactic role. ESG normally produces multiple parsesforasentence,buthasthecapability, which we used, to output only the highest ranked parse, where rank is determined by a system-defined measure. There are a number of incompatibilities be-tween the treebank and ESG in tokenization, whichhad tobecompensated forinordertotrans-fer the syntactic role features to the tokens in the standard training and test sets. We also trans-ferred the ESG part-of-speech codes (different from those in the WSJ corpus) and made an at-tempt to attach B-PP, B-NP and I-NP tags as in-ferred from the ESG dependency structure. In the end, the latter two tags did not prove useful. ESG is also very fast, parsing several thousand sen-tences on an IBM RS/6000 in a few minutes of clock time. It might seem odd to use a parser output as in- In text chunking, we predict hidden states (chunk types) based on a sequence of observed states (text). This resembles hidden Markov models where dynamic programming has been widely employed. Our approach is related to ideas de-scribed in (Punyakanok and Roth, 2001). Similar methods have also appeared in other natural lan-guage processing systems (for example, in (Ku-doh and Matsumoto, 2000)). Given input vectors consisting of features constructed as above, we apply the regularized Winnow algorithm to train linear weight vectors. Since the Winnow algorithm only produces pos-itive weights, we employ the balanced version of Winnow with being transformed into ` ** . As explained earlier, the constant term is used to offset the effect of threshold . Once a weight vector “« ‹ R is obtained, we let and . še | ‹ I The prediction with an incoming feature vector is then . Rfie wfi Since Winnow only solves binary classification problems, we train one linear classifier for each chunk type. In this way, we obtain twenty-three put to a machine learning system to find syntactic linear classifiers, one for each chunk type . De- chunks. As noted above, ESG or any other parser normally produces many analyses, whereas in the kind of applications for which chunking is used, note by the weight associated with type , then a straight-forward method to classify an incoming datum is to assign the chunk tag as the one with e.g., information extraction, only one solution is the highest score . R normally desired. In addition, due to many in- However, there are constraints in any valid se-compatibilities between ESG and WSJ treebank, quence of chunk types: if the current chunk is of less than of ESG generated syntactic role typeI-X,thenthepreviouschunktypecanonlybe I tags are in agreement with WSJ chunks. How- either B-X or I-X. This constraint can be explored ... - tailieumienphi.vn
nguon tai.lieu . vn