Xem mẫu

Automatically Extracting and Tagging Business Information for E-Business Systems marketplaces in which they compete. The World Wide Web is a rich but unmanageably huge source of human-readable business information—some novel, accurate, and relevant—some repeti-WLYHZURQJRURXWRIGDWH$VWKHÀRRGRI:HE document tops 11.5 billion pages and continues to rise (Gulli & Signorini, 2005), the human task of grasping the business information it bears seems more and more hopeless. Today’s Really Simple Syndication (RSS) news syndication and aggregation tools provide only marginal relief to information-hungry, document-weary managers and investors. In the envisioned Semantic Web, business information will come with handles (semantic tags) that computers can intelligently grab onto, to perform tasks in the business-to-business (B2B), business-to-consumer (B2C), and consumer-to-consumer (C2C) environments. 6HPDQWLFHQFRGLQJDQGGHFRGLQJLVDGLI¿FXOW problem for computers, however, as any very ex-pressive language, for example, English provides a large number of equally valid ways to represent a given concept. Further, phrases in most natural (i.e., human) languages tend to have a number of different possible meanings (semantics), with the correct meaning determined by context. This is especially challenging for computers. As a stan-GDUGDUWL¿FLDOODQJXDJHHPHUJHVFRPSXWHUVZLOO become semantically enabled, but humans will face a monumental encoding task. For e-busi-QHVVDSSOLFDWLRQVLWZLOOQRORQJHUEHVXI¿FLHQW to publish accurate business information on the Web in, say, English or Spanish. Rather, that information will have to be encoded into the ar-WL¿FLDOODQJXDJHRIWKH6HPDQWLF:HE²DQRWKHU time-consuming, tedious, and error-prone process. Pre-standard Semantic Web creation and editing tools are already emerging to assist early adopters with Semantic Web publishing, but even as the tools and technologies stabilize, many businesses will be slow to follow. Furthermore, a great deal of textual data in the pre-Semantic Web contains YDOXDEOH EXVLQHVV LQIRUPDWLRQ ÀRDWLQJ WKHUH along with the out-dated debris. However, the new Web vessels—automated agents—cannot navigate this old-style information. If the rising sea of human-readable knowledge on the Web is WREHWDSSHGDQGVWUHDPVRILWSXUL¿HGIRUFRP-puter consumption, e-business systems must be developed to process this information, package it, and distribute it to decision makers in time for competitive action. Tools that can automatically extract and semantically tag business information from natural language texts will thus comprise an important component of both the e-business systems of tomorrow, and the Semantic Web of the day after. In this chapter, we give some background on the Semantic Web, ontologies, and the valuable sources of Web information available for e-busi-ness applications. We then describe how textual information can be extracted to produce XML ¿OHV DXWRPDWLFDOO\ )LQDOO\ ZH GLVFXVV IXWXUH trends for this research and conclude. BACKGROUND The World Wide Web Consortium (W3C) is lead-ing efforts to standardize languages for knowledge representation on the Semantic Web and is de-veloping tools that can verify that a given docu-ment is grammatically correct according to those standards. The XML standard, already widely adopted commercially as a data interchange format, forms the syntactic base for this layered framework. XML is semantically neutral, so the resource description framework (RDF) adds a SURWRFROIRUGH¿QLQJVHPDQWLFUHODWLRQVKLSVEH-tween XML-encoded data components. The Web ontology language (OWL) adds to RDF tools for GH¿QLQJPRUHVRSKLVWLFDWHGVHPDQWLFFRQVWUXFWV (classes, relationships, constraints) still using the RDF-constrained XML syntax. Computers can EHSURJUDPPHGWRSDUVHWKH;0/V\QWD[¿QG RDF-encoded semantic relationships, and resolve meanings by looking for equivalence relation- 2414 Automatically Extracting and Tagging Business Information for E-Business Systems VKLSVDVGH¿QHGE\2:/EDVHGYRFDEXODULHVRU ontologies. Ontologies are virtual dictionaries that for-PDOO\GH¿QHWKHPHDQLQJVRIUHOHYDQWFRQFHSWV Ontologies may be foundational (general), or GRPDLQVSHFL¿FDQGDUHRIWHQVSHFL¿HGKLHUDUFKL-cally, relating concepts to one another via their attributes. As ontologies emerge across the Seman-tic Web, many will overlap, and different terms ZLOOFRPHWRGH¿QHDQ\JLYHQFRQFHSW6HPDQWLF maps will be built to relate the same concepts GH¿QHGGLIIHUHQWO\IURPRQHRQWRORJ\WRDQRWKHU (Doan, Madhavan, Domingos, & Halevy, 2002). Software programs called intelligent agents will be built to navigate the Semantic Web, searching not only for keywords or phrases, but also for concepts semantically encoded into Web docu-ments (Berners-Lee, Hendler, & Lassila, 2001). 7KH\PD\DOVR¿QGVHPDQWLFFRQWHQWE\QHJRWL-ating with semantically enhanced Web services, which Medjahed, Bouguettaya, and Elmagarmid GH¿QHDVVHWV³RIIXQFWLRQDOLWLHVWKDWFDQ be programmatically accessed through the Web” (p. 333). Web services may process information IURPGRPDLQVSHFL¿FNQRZOHGJHEDVHVDQGWKH facts in these knowledge bases may, in turn, be represented in terms of an ontology from the same domain. An important tool for constructing domain models and knowledge-based applications with ontologies is Protégé (n.d.). Protégé is a free, open-source platform. Ontologies are somewhat static, and should be created carefully by domain experts. Knowledge bases, while structurally static, should have dy-namic content. That is, to be useful, especially in the competitive realm of business, they should be continually updated with the latest, best-known information in the domain and regularly purged of knowledge that has become stale or been proven wrong. In business domains, the world evolves quickly, and processing the torrents of information describing that evolution is a daunting task. Much of the emerging information about the business world is published online daily in government UHSRUWV¿QDQFLDOUHSRUWVVXFKDVWKRVHLQWKH electronic data gathering, analysis, and retrieval (EDGAR) system database, and Web articles by such sources as the Wall Street Journal (WSJ), Reuters, and the Associated Press. Such sources contain a great deal of information, but in forms that computers cannot use directly. They there-fore need to be processed by people before the facts can be put into a database. It is desirable, but impossible for a person, and expensive for a company, to retrieve, read, and synthesize all of the day’s Web news from a given domain and enter the resulting knowledge into a knowledge base to support the company’s decision making for that day. While the protocols and information retrieval technologies of the Web make these ar-ticles reachable by computer, they are written for human consumption and still lack the semantic tags that would allow computers to process their FRQWHQWHDVLO\,WLVDGLI¿FXOWSURSRVLWLRQWRWHDFK a computer to correctly read (syntactically parse) natural language texts and correctly interpret (semantically parse) all that is encoded there. However, automatically learning even some of the daily emerging facts underlying Web news articles could provide enough competitive advantage to justify the effort. We envision the emergence of e-business services, based on knowledge bases fed from a variety of Web news sources, which serve this knowledge to subscribing customers in a variety of ways, including both semantic and nonsemantic Web services. One domain of great interest to investors is that dealing with the earnings performance and fore-FDVWVRIFRPSDQLHV0DQ\¿UPVSURYLGHPDUNHW analyses on a variety of publicly traded corpora-WLRQV+RZHYHUSUR¿WPDUJLQVGULYHWKHLUFKRLFHV of which companies to analyze, leaving over half of the 10,000 or so publicly traded U.S. companies unanalyzed (Berkeley, 2002). Building tools, which automatically parse the earnings statements of these thousands of unanalyzed smaller com-panies, and which convert these statements into ;0/IRU:HEGLVWULEXWLRQZRXOGEHQH¿WLQYHVWRUV 2415 Automatically Extracting and Tagging Business Information for E-Business Systems and those companies themselves, whose public exposure would increase, and whose disclosures to regulatory agencies would be eased. A number of XML-based languages and ontologies have been developed and proposed as standards for UHSUHVHQWLQJVXFKVHPDQWLFLQIRUPDWLRQLQWKH¿-nancial services industry, but most have struggled to achieve wide adoption. Examples include News Markup Language (NewsML) (news), Financial products Markup Language (FpML) (derivatives), Investment Research Markup Language (IRML) (investment research), and the Financial Exchange Framework (FEF) Ontology (FEF: Financial Ontology, 2003;Market Data Markup Language, 2000). However, the Extensible Business Markup Language (XBRL), an XML-derivative, has been emerging over the last several years as an HEXVLQHVVVWDQGDUGIRUPDWIRUHOHFWURQLF¿QDQFLDO reporting, having enjoyed early endorsement by such industry giants as NASDAQ, Microsoft, and PricewaterhouseCoopers (Berkeley, 2002). By 2005, the U.S. Securities and Exchange Com-mission (SEC) had begun accepting voluntary ¿QDQFLDO¿OLQJVLQ;%5/WKH)HGHUDO`HSRVLW Insurance Corporation (FDIC) was requiring XBRL reporting, and a growing number of pub-OLFO\WUDGHGFRUSRUDWLRQVZHUHSURGXFLQJ¿QDQFLDO statements in XBRL (XBRL, 2006). We present a prototype system that uses natu-ral language processing techniques to perform LQIRUPDWLRQH[WUDFWLRQRIVSHFL¿FW\SHVRIIDFWV from corporate earnings articles of theWall Street Journal. These facts are represented in template form to demonstrate their structured nature and converted into XBRL for Web portability. WSJ using syntactic and simple semantic analysis (Hale, Conlon, McCready, Lukose, & Vinjamur, 2005; Lukose, Mathew, Conlon, & Lawhead, 2004; Vinjamur, Conlon, Lukose, McCready, & Hale, 2005). Syntactic analysis helps FIRST to detect sentence structure, while semantic analysis helps FIRST to identify the concepts that are represented by different terms. The overall process is shown in Figure 1. This section starts with a discussion of the information extraction literature. Later, we discuss how FIRST extracts information from online documents to produce ;0/IRUPDWWHG¿OHV Information Extraction The explosion of textual information on the Web requires new technologies that can recognize information originally structured for human con-sumption rather than for data processing. Research in DUWL¿FLDOLQWHOOLJHQFH$,KDVEHHQWU\LQJWR ¿QGZD\VWRKHOSFRPSXWHUVSURFHVVWDVNVZKLFK would otherwise require human judgment. NLP, a sub-area of AI, is a research area that deals with spoken and written human languages. NLP subareas include machine translation, natural language interfaces, language understanding, Figure 1. Information extraction and XML tag-ging process URL of the document http://www. EXTRACTING INFORMATION FROM ONLINE ARTICLES This section discusses the process of generating ;0/IRUPDWWHG¿OHVIURPRQOLQHGRFXPHQWV2XU system, Flexible Information extRaction SysTem (FIRST), analyzes online documents from the 2416 Automatically Extracting and Tagging Business Information for E-Business Systems and text generation. Since NLP tasks are very GLI¿FXOWIHZ1/3DSSOLFDWLRQDUHDVKDYHEHHQ developed commercially. Currently, the most successful applications are grammar checking and machine translation programs. To deal with textual data, information systems need to be able to understand the documents they read. Information extraction (IE) research has sought automated ways to recognize and convert information from textual data into more structured, computer-friendly formats, such as display templates or database relations (Cardie, 1997; Cowie & Lehnert, 1996). 0DQ\ EXVLQHVV DUHDV FDQ EHQH¿W IURP ,( research, such as underwriting, clustering, and H[WUDFWLQJLQIRUPDWLRQIURP¿QDQFLDOGRFXPHQWV Some previous IE research prototypes include Sys-tem for Conceptual Information Symmarization, Organziation, and Retrieval (SCISOR) (Jacobs & Rau, 1990), EDGAR-Analyzer (Gerdes, 2003), Edgar2xml (Leinnemann, Schlottmann, Seese, & Stuempert, 2001). Moens, Uyttendaele, and Dumortier (2000) researched the extraction of information from databases of court decisions. The major research organization promoting in-formation extraction technology is the Message Understanding Conference (MUC). MUC’s origi-nal goals were to evaluate and support research on the automation and analysis of military messages containing textual information. IE systems’ input documents are normally Unfortunately, the extraction process presents PDQ\GLI¿FXOWLHV2QHLQYROYHVWKHV\QWDFWLFVWUXF-ture of sentences, and another involves inferring sentence meanings. For example, it is quite easy IRUDKXPDQWRUHFRJQL]HWKDWWKHVHQWHQFHV³7KH Dow Jones industrial average is down 2.7%” and ³7KH`RZ-RQHVLQGXVWULDODYHUDJHGLSSHG´ are semantically synonymous, though slightly different. For a computer to extract the same meaning from the two different representations, LWPXVW¿UVWEHWDXJKWWRSDUVHWKHVHQWHQFHVDQG then taught which words or phrases are synonyms. Also, just as children learn to recognize which sentences in a paragraph are the topic or key sentences, computers must also be taught how to recognize which sentences in a text are paramount versus which are simply expository. Once these key sentences are found, the computer programs will extract the vital information from them for inclusion in templates or databases. There are two major approaches to building information extraction systems: the knowledge engineering approach and the automatic train-ing approach (Appelt & Israel, 1999). In the knowledge engineering approach, knowledge engineers employ their own understanding of natural language, along with the domain expertise they extract from subject matter experts, to build rules which allow computer programs to extract information from text documents. With this ap-proach, the grammars are generated manually, GRPDLQVSHFL¿F&DUGLH&RZLH /HK- and written patterns are discovered by a human nert, 1996). Generally, documents from the same publisher, reporting stories in the same domain, have similar formats and use common vocabular-ies for expressing certain types of facts—styles that people can detect as patterns. If knowledge engineers who build computer systems team up ZLWKVXEMHFWPDWWHUH[SHUWVZKRDUHÀXHQWLQWKH information types and expression patterns of the domain, computer systems can be built to look for the concepts represented by these familiar patterns. Humans do this now, but computers will be able to do it much faster. expert, analyzing a corpus of text documents from the domain. This becomes quite labor-intensive as the size, number, and stylistic variety of these training texts grows (Appelt & Israel, 1999). Unlike the knowledge engineering approach, the automatic training approach does not require computer experts who know how IE systems work or how to write rules. A subject matter expert annotates the training corpus. Corpus statistics or rules are then derived automatically from the training data and used to process novel data. Since this technique requires large volumes of 2417 Automatically Extracting and Tagging Business Information for E-Business Systems WUDLQLQJGDWD¿QGLQJHQRXJKWUDLQLQJGDWDFDQ LVVSDUVHDQGH[SHQVLYHWR¿QG%DVHGRQWKHVH EHGLI¿FXOW$SSHOW ,VUDHO0DQQLQJ constraints, our system, FIRST, employs the Schutze, 2002). Research using this approach includes Neus, Castell, and Martín (2003). Advanced research in information extraction appears in journals and conferences run by several AI and NLP organizations, such as the MUC, the Association for Computational Linguistics (ACL) (www.aclweb.org/), the International Joint &RQIHUHQFH RQ $UWL¿FLDO ,QWHOOLJHQFH ,-&$, (http://ijcai.org/), and the American Association IRU$UWL¿FLDO,QWHOOLJHQFH$$$,KWWSZZZ aaai.org/). FIRST: Flexible Information extRaction SysTem This section discusses our experimental system ),567),567H[WUDFWVLQIRUPDWLRQIURP¿QDQ-FLDOGRFXPHQWVWRSURGXFH;0/¿OHVIRURWKHU e-business applications. According to Appelt and Israel (1999), the knowledge engineering approach performs best when linguistic resources such as lexicons are available, when knowledge engineers who can write rules are available, and when training data knowledge engineering approach. FIRST is an experimental system for extracting semantic facts from online documents. Currently, FIRST works LQWKHGRPDLQRI¿QDQFHH[WUDFWLQJSULPDULO\ from the WSJ. The inputs to FIRST are news articles while the output is the information in an explicit form contained in a template. After the extraction process is completed, this information can be put into a database or converted into an ;0/IRUPDWWHG¿OH)LJXUHVKRZV),567¶V system architecture. FIRST is built in two phases: the build phase and the functional phase. The build phase uses resources such as the training documents and some tools, such as a KeyWord In Context (KWIC) index builder (Luhn, 1960), the CMU-SLM toolkit (Clarkson & Rosendfeld, 1997; Clarkson & Rosendfeld, 1999), and a part-of-speech tag-ger, to analyze patterns in the documents from our area of interest. Through the knowledge engineering process, we learn how the authors of the articles write the stories—how they tend to phrase recurring facts of the same type. We employ these recurring patterns to create rules Figure 2. System architecture of FIRST 2418 ... - tailieumienphi.vn
nguon tai.lieu . vn