Xem mẫu

Automatically Extracting and Tagging Business Information for E-Business Systems The following is an example of a rule to identify ¿QDQFLDOVWDWXVIRUWKLVFDVH for each keyword that is a candidate for denoting a ¿QDQFLDOLWHP (e.g., sales) ifWKHWDJJHUKDVLGHQWL¿HGWKDWNH\ZRUGDVD noun or plural noun in the sentence then a form of a corresponding ¿QDQFLDOVWDWXV keyword (e.g., increase) should be present in the immediately preceding verb phrase end if end for In the previous examples: saw strong and increase UHVSHFWLYHO\SUHFHGHWKH¿QDQFLDOLWHPsales. 7KXVIRUWKH¿UVWVWDWHPHQW),567¿OOVWKHORWV as: Financial Item: sales Financial Status: strong )RUWKHVHFRQGVHQWHQFH),567¿OOVWKHORWV as: Financial Item: sales Financial Status: increase Semantic Analysis FIRSTdoesnot do full semantic analysis, but it is able to recognize that certain words have similar meanings. FIRST relies heavily on WordNet as a source of such semantic information. WordNet is an online lexical database developed by the Cogni-tive Science Laboratory at Princeton University, under the direction of Professor George A. Miller (Fellbaum, 1998; Miller et al., 1990; Miller 1995). WordNet is organized around a lexical concept called a synonym set, or synset—a set of words that can be interchanged in some context with-out changing the truth value of the proposition in which they are embedded. WordNet contains information about nouns, verbs, adjectives, and adverbs. Each synset consists of a list of words (or phrases) and the pointers that describe the relation between that synset and other synsets. These semantic relations between words include: hypernymy/hyponymy (or superordinate/subordi-QDWHUHODWLRQVKLSVHJD³FDUGRRU´LVDNLQGRI ³GRRU´DQWRQ\P\RURSSRVLWHVHJ³KDWH´LVDQ DQWRQ\PRI³ORYH´HQWDLOPHQWDQGPHURQ\P\ KRORQ\P\RUSDUWRIUHODWLRQVKLSVHJ³ORFN´ LVDSDUWRID³GRRU´KWWSZRUGQHWSULQFHWRQ edu/man/wngloss.7WN). Box 1 shows WordNet’s hypernyms for the word ¿QDQFH When many concepts are interconnected, semantic networks can be formed (Miller & Fellbaum, 1991). A semantic network, or net, represents knowledge using graphs, where arcs interconnect the nodes. The nodes represent objects or concepts and the links represent rela- Box 1. 1. commercial_enterprise 2. business 3. business_enterprise 4. management 5. direction 6. economics 7. economic_science 8. political_economy 9. committee 10.commission QRQGHSRVLWRU\B¿QDQFLDOB institution 12.minister 13.government_minister 14.assets 15.pay 16.credit 2424 Automatically Extracting and Tagging Business Information for E-Business Systems WLRQVEHWZHHQQRGHV7KHQHWZRUNGH¿QHVDVHW of binary relations on the set of nodes (Sowa, 2000, n.d.). The Output :HWHVWHG),567ZLWKVRPHRQOLQH¿QDQFLDODU-ticles appearing in the online edition of the WSJ, such as the Web article shown in Figure 6. FIRST produces output in a template, as shown in Figure 7. System Performance FIRST was evaluated using the standard evalua-tion criteria: recall, precision, and the F-measure. Recall measures, as a percentage, how many of the HPEHGGHGIDFWV),567LVDEOHWR¿QGDQGH[WUDFW from a target document or collection of documents. Precision measures how accurately FIRST extract these facts. Both measures are found by comparing FIRST’s extraction results with manual extrac- )LJXUH$VDPSOHLQSXW¿OHIRUH[WUDFWLRQ Figure 7. Output of the extraction process 2425 Automatically Extracting and Tagging Business Information for E-Business Systems tions of the same documents by domain experts. For example, suppose the template has 20 slots, DQGWKHGRPDLQH[SHUWVDUHDEOHWR¿QGDQVZHUV WR¿OODOOVORWVEXWWKHV\VWHPLVRQO\DEOHWR ¿QGFRUUHFWDQVZHUV7KHQWKHUHFDOOLV ,IWKHV\VWHP¿QGVDQVZHUVIRUWKH VORWVEXWRQO\DUHDFFXUDWHO\¿OOHGWKHQWKH precision rate is 12/20 = 60%. The F-measure combines recall and precision into a single measure. It uses the harmonic mean of precision and recall, which is: F-measure = 2 *(recall * precision) / (recall + precision) (Van Rijsbergen, 1979) We evaluated FIRST by comparing the output of the system and the answers that people found from the same articles. We ran FIRST using WSJ GRFXPHQWVLQWKHGRPDLQRI¿QDQFH:HPHDVXUHG the system using recall, precision, and F-measure values as shown in Box 2. XML Formatting To maximize the usefulness of a system like FIRST, it should extract facts and record them in a format that will travel well from one e-business application to another. XML is such a format. Thus, FIRST has been enhanced with an XML converter. To convert an online WSJ corporate earnings article to into XML, the article’s URL is entered into a browser by the user. This triggers the FIRST system to semantically process the article. The facts extracted from FIRST are fed as input to the XML processor, which is implemented in Java. Data items are tagged as a set of compa-nies or organizations, along with generic header information, like the title and date, followed by HDFKFRPSDQ\¶V¿QDQFLDOGHWDLOVVXFKDVFRPSDQ\ name, earnings, revenue information, and so forth. $VDPSOHLQSXW¿OHLVVKRZQLQ)LJXUH)LJXUH 9 shows the user interface page while Figure 10 shows results that the XML processor sent back to the browser in XML format. Box 2. Recall = The number of items correctly tagged by the system The number of possible items that experts would tag FIRST’s Recall = 85% Precision = The number of items correctly tagged by the system The number of items tagged by the system FIRST’s Precision = 90% F-measure = 2(R*P) R + P FIRST’s F-measure = 87.43% 2426 Automatically Extracting and Tagging Business Information for E-Business Systems FUTURE TRENDS Information extraction from natural language will become increasingly important as the number of documents on the Web continues to explode. This makes timely manual processing ever less feasible as a means of seeking competitive advantage in business. Such processing will continue to be DGLI¿FXOWWDVNDQGLQIDFWRQHWKDWFDQQRWEH perfectly achieved. In addition to the manual pattern-based, rule creation techniques discussed in this article, machine learning algorithms are also being used by some researchers to teach computers to recognize the meanings of new texts based on known meanings of previously human-deciphered texts. We plan to hybridize our own technique to include machine learning algorithms, to see if they incrementally enhance the recall and preci-sion of FIRST. The explosion of Web documents, many of which are different descriptions of the same facts, will also bring about the need to recognize which facts are conceptually equivalent. Craven et al. (2002) refer to this as the multiple Elvis problem. ,QRXUFXUUHQWZRUNZHH[WUDFWIURPDQG¿OWHU out, duplicate facts from multiple Web sources, including not only the WSJ but also Reuters, and use this information to create a knowledge base that contains only novel facts. Semantically con-ÀLFWLQJIDFWVDUHLGHQWL¿HGDQGTXDUDQWLQHGXQWLO new information validates or disavows one or the RWKHUDQGWKHFRQÀLFWFDQEHUHVROYHG,QWKLVDS-proach, the multiple sources of a given fact are remembered (via URL references to the source DUWLFOHVIRUYHUL¿FDWLRQSXUSRVHVEXWHDFKIDFW is stored only once. Figure 8. A document used by FIRST for extraction Figure 9. User interface page 2427 Automatically Extracting and Tagging Business Information for E-Business Systems )LJXUH7KH;0/IRUPDWWHGRXWSXW¿OH Since Web information providers may be slow to convert their existing content into a rich XML format, much of the semantic encoding may have to be done by third party e-business service providers, or by end users themselves, using browser-side extracting and encoding tools, such as the Thresher tool proposed by Hogue and Karger (2005). If the Web evolves as expected, online informa-tion will be encoded in the XML-based semantic language layers of RDF, RDF Schema, and OWL. Ontologies will emerge in various domains, in-FOXGLQJWKRVHRI¿QDQFLDOVHUYLFHVDQGUHSRUWLQJ To adapt FIRST to the Semantic Web, we will teach it to convert extracted facts into semantic facts that, unlike XBRL, reference terms in some 5`)EDVHG¿QDQFLDORQWRORJ\7KHVHVHPDQWLF facts can then be automatically discovered by au-tomated agents on the Web. We will also build our own Web service on top of the FIRST knowledge base, to provide explicit informational functions based on FIRST knowledge. CONCLUSION For e-business systems to maximally empower those seeking informational advantage in the fast-moving world of business, these systems must present accurate, timely, and relevant informa-tion. Much of this information becomes available quarterly, monthly, weekly, daily, or hourly, in the form of corporate reports or online news articles which are prepared for the human reader. Humans DUHFUHDWLYHWKLQNHUVEXWVORZDQGLQHI¿FLHQW processors of information. Businesses that can leverage computing technology to process this LQIRUPDWLRQPRUHTXLFNO\DQGHI¿FLHQWO\VKRXOG reap a competitive advantage in the marketplace. Manually converting existing textual data into the relations and data structures of today’s e-business applications or into the knowledge networks of tomorrow’s Semantic Web is, again, a costly en-WHUSULVHIRUKXPDQV7KXVDUWL¿FLDOLQWHOOLJHQFH machine learning, and other unconventional approaches must be employed to automatically extract facts from existing Web texts and con-vert them to portable formats that conventional 2428 ... - tailieumienphi.vn
nguon tai.lieu . vn