Xem mẫu

70 ASHBURNER & LEWIS spelling error or as drastic as being a new lexical string. If the change does not change the meaning of the term then there is no change to the GO identi¢er. If the meaning is changed, however, then the old term, its identi¢er and de¢nition are retired(they are marked as‘obsolete’, they never disappear fromthe database) andthenewtermgetsanewidenti¢erandanewde¢nition.Indeedthisistrueeven ifthelexicalstringisidenticalbetweenoldandnewterms;thusifweusethesame words to describe a di¡erent concept then the old term is retired and the new is created with its own de¢nition and identi¢er. This is the only case where, within any one of the three GO ontologies, two or more concepts may be lexically identical; all except one of them must be £agged as being obsolete. Because the nodes represent semantic concepts (as described by their de¢nitions) it is not strictly necessary that the terms are unique, but this restriction is imposed in order to facilitate searching. This mechanism helps with maintaining and synchronizing other databases that must track changes within GO, which is, by design, being updated frequently. Keeping everything and everyone consistent is adi⁄cultproblemthatwehadtosolveinorderpermitthisdynamicadaptabilityof GO. Theedgesbetweenthenodesrepresenttherelationshipsbetweenthem.GOuses twoverydi¡erentclassesofsemanticrelationshipbetweennodes:‘isa’and‘partof’. BoththeisaandpartofrelationshipswithinGOshouldbefullytransitive.Thatis to say an instance of a concept is also an instance of all of the parents of that concept (to the root); a part concept that is partof a whole concept is a partof all of the parents of that concept (to the root). Both relationships are re£exive (see below). The isa relationship is one of subsumption, a relationship that permits re¢nement in concepts and de¢nitions and thus enables annotators to draw coarseror¢nerdistinctions,depending onthepresentdegreeofknowledge.This classofrelationshipisknownashyponymy(anditsre£exiverelationhypernymy) to the authors of the lexical database WordNet (Fellbaum 1998). Thus the term DNA binding is a hyponym of the term nucleic acid binding; conversely nucleic acid binding is a hypernym of DNA binding. The latter term is more speci¢c than the former, and hence its child. It has been argued that the isa relationship, both generally (see below) and as used by GO (P. Karp, personal communication; S. Schultze-Kremer, personal communication) is complex and that further information describing the nature of the relationship should be captured. Indeed this is true, because the precise connotation of the isa relationship is dependent upon each unique pairing of terms and the meanings of theseterms.Thustheisarelationshipisnotarelationshipbetweenterms,butrather isarelationshipbetweenparticularconcepts.Thereforetheisarelationshipisnota singletypeofrelationship;itsprecisemeaningisdependentontheparentandchild termsitconnects.Therelationshipsimplydescribestheparentasthemoregeneral ONTOLOGIES FOR BIOLOGISTS 71 conceptandthechildasthemorepreciseconceptandsaysnothingabouthowthe childspeci¢callyre¢nestheconcept. The partof relationship (meronymy and its re£exive relationship holonymy) (Cruse1986,citedinMiller1998)isalsosemanticallycomplexasusedbyGO(see Winston et al 1987, Miller 1998, Priss 1998, Rogers & Rector 2000). It may mean that a child node concept ‘is a component of’ its parent concept. (The re£exive relationship [holonymy] would be ‘has a component’.) The mitochondrion ‘is a componentof’thecell;thesmall ribosomal subunit ‘isacomponentof’the ribosome.ThisisthemostcommonmeaningofthepartofrelationshipintheGO cellular___componentontology.Inthebiological___processontology,however,the semanticmeaning ofpartofcan bequitedi¡erent,it canmean‘isasubprocessof’; thus the concept amino acid activation ‘is a subprocess of’ of the concept protein biosynthesis.ItisinthefuturefortheGOConsortiumtoclarifythese semantic relationships while, at the same time not making the vocabularies too cumbersomeanddi⁄culttomaintainand use. Meronymyandhyponymycausetermsto‘becomeintertwinedincomplexways’ (Miller 1998:38). This is because one term can be a hyponym with respect to one parent, but a meronym with respect to another. Thus the concept cytosolic small ribosomal subunit is both a meronym of the concept cytosolic ribosomeandahyponymoftheconceptsmall ribosomal subunit,sincethere alsoexiststheconceptmitochondrial small ribosomal subunit. ThethirdsemanticrelationshiprepresentedinGOisthefamiliarrelationshipof synonymy.Eachconceptde¢nedinGO(i.e.eachnode)hasoneprimaryterm(used for identi¢cation) and may have zero or many synonyms. In the sense of the WordNet noun lexicon a term and its synonyms at each node represents a synset (Miller1998);inGO,however,therelationship betweensynonymsisstrong,and not as context dependent as in WordNet’s synsets. This means that in GO all members of synset are completely interchangeable in whatever context the terms are found. That is to say, for example, that ‘lymphocyte receptor of death’ and ‘death receptor 3’ are equivalent labels for the same concept and are conceptually identical. One consequence of this strict usage is that synonyms are not inherited fromparenttochildconceptsinGO. The¢nalsemanticrelationshipinGOisacross-referencetosomeotherdatabase resource, representing the relationship ‘is equivalentto’. Thus the cross-reference between the GO concept alcohol dehydrogenase and the Enzyme Commission’s number EC:1.1.1.1 is an equivalence (but not necessarily an identity, these cross-references within GO are for a practical rather than theoretical purpose). As with synonyms, database cross-references are not inheritedfromparenttochildconceptinGO. As we have expressed, we are not fully satis¢ed that the two major classes of relationship within GO, isa and partof, are yet de¢ned as clearly as we would 72 ASHBURNER & LEWIS like. There is, moreover, some need for a wider agreement in this ¢eld on the classes of relationship that are required to express complex relationships between biological concepts. Others are using relationships that, at ¢rst sight appear to be similartothese.Forexample,withintheaMAZEdatabase(vanHeldenetal2001) the relationships ContainedCompartment and SubType appear to be similar to GO’s partof and isa, respectively. Yet ContainedCompartment and partof have, on closer inspection, di¡erent meanings (GO’s partof seems to be a much broaderconceptthanaMAZE’sContainedCompartment). The three domains now considered by the GO Consortium, molecular___function, biological___process and cellular___component are ortho-gonal. They can be applied independently of each other to describe separable characteristics. A curator can describe where some protein is found without knowing what process it is involved in. Likewise, it may be known that a protein is involved in a particular process without knowing its function. There are no edges between the domains, although we realize that there are relationships between them. This constraint was made because of problems in de¢ning the semantic meanings of edges between nodes in di¡erent ontologies (see Rogers & Rector 2000, for a discussion of the problems of transitivity met within an ontology that includes di¡erent domains of knowledge). This structure is, however, to a degree, arti¢cial. Thus all (or, certainly most) gene products annotated with the GO function term transcription factor will be involved in the process transcription, DNA-dependent and the majority will have the cellularlocationnucleus.ThisreallybecomesimportantnotsomuchwithinGO itself, but at the level of the use of GO for annotation. For example, if a curator wereannotatinggenesinFlyBase,thegeneticandgenomicdatabaseforDrosophila (FlyBase 2002), then it would be an obvious convenience for a gene product annotated with the function term transcription factor to inherit both the process transcription, DNA-dependent and the location nucleus. There are plans to build a tool to do this, but one that allows a curator to say to the system‘inthiscasedonotinherit’wheretodosowouldbemisleadingorwrong. AnnotationusingGO There are two general methods for using GO to annotate gene products within a database. These may be characterized as the ‘curatorial’ and ‘automatic’ methods. By ‘curatorial’ we mean that a domain expert annotates gene products with GO termsasthe resultof either readingtherelevant literatureor byanevaluation ofa computationalresult(seeforexampleDwightetal2002).Automatedmethodsrely solely on computational sequence comparisons such as the result of a BLAST (Altschul et al 1990) or InterProScan (Zdobnov & Apweiler 2001) analysis of a gene product’s known or predicted protein sequence. Whatever method is used, ONTOLOGIES FOR BIOLOGISTS 73 the basis for the annotation is then summarized, using a small controlled list of phrases (www.geneontology.org/GO.evidence.html); perhaps ‘inferred from direct assay’ if annotating on the evidence of experimental data in a publication or ‘inferred from sequence comparison with database:object’ (where database:object could be, for example, SWISS^PROT:P12345, where P12345 is a sequence accession in the SWISS^PROT database of protein sequences), if the inference is made from a BLAST or InterProScan compute which has been evaluated by a curator. The incorrect inference of a protein’s or predicted protein’s function from sequence comparison iswell knownto beamajorproblem and one thathas often contaminated both databases and the literature (Kyrpides & Ouzounis 1998, for one example among many). The syntax of GO annotation in databases allows curators to annotate a protein as NOT having a particular function despite impressive BLAST data. For example, in the genome of Drosophila melanogaster there are at least 480 proteins or predicted proteins that any casual or routine curation of BLASTP output would assign the function peptidase (or one of its child concepts) yet, on closer inspection, at least 14 of these lack residues required for the catalytic function of peptidases (D. Coates, personal communication). In FlyBase these are curated with the ‘function’ ‘NOT peptidase’. Whatis neededis acomprehensiveset of computational rulesto allow curators,whocannotbeexpertsineveryproteinfamily,toautomaticallydetectthe signatures of these cases, cases where the transitive inference would be incorrect (Kretschmann et al2001). Itis also conceivable thattriggers to correctdependent annotations could be constructed because GO annotations track the identi¢ers of thesequenceuponwhichannotationisbased. Curatorial annotation will be at aquality proportional both to the extent of the available evidence for annotation and the human resources available for annotation. Potentially, its quality is high but at the expense of human e¡ort. For this reason several ‘automatic’ methods for the annotation of gene products are being developed. These are especially valuable for a ¢rst-pass annotation of a large number of gene products, those, for example, from a complete genome sequencing project. One of the ¢rst to be used was M. Yandell’s program LOVEATFIRSTSIGHT developed for the annotation of the gene products predicted from the complete genome of Drosophila melanogaster (Adams et al 2000). Here, the sequences were matched (by BLAST) to a set of sequences from otherorganismsthathadalreadybeencuratedusingGO. Three other methods, DIAN (Pouliot et al 2001), PANTHER (Kerlavage et al 2002) and GO Editor (Xie et al 2002), also rely on comprehensive databases of sequences or sequence clusters that have been annotated with GO terms by curation, albeit with a large element of automation in the early stages of the process. PANTHER is a method in which proteins are clustered into 74 ASHBURNER & LEWIS ‘phylogenetic’ families and subfamilies, which are then annotated with GO terms by expert curators. New proteins can then be matched to a cluster (in fact to a Hidden Markov Model describing the conserved sequence patterns of that cluster) and transitively annotated with appropriate GO terms. In a recent experiment PANTHER performed well in comparison with the curated set of GO annotations of Drosophila genes in FlyBase (Mi et al 2002). DIAN matches proteins to a curated set using two algorithms, one is vocabulary based and is only suitable for sequences that already have some attached annotation; the other isdomainbased,usingPfamHiddenMarkovModelsofproteindomains. Evensimplermethodshavealsobeenused.Forexample,muchofthe¢rst-pass GOannotationofmouseproteinswasdonebyparsingtheKEYWORDsattached toSWISS^PROTrecordsofmouseproteins,usinga¢lethatsemanticallymapped theseKEYWORDstoGOconcepts(seewww.geneontology.org/external2go/spkw2go) (Hilletal2001). Automatic annotations have the advantages of speed, essential if large protein data sets are to be analysed within a short time. Their disadvantage is that the accuracy of annotation may not be high and the risk of errors by incorrect transitive inference is great. For this reason, all annotations made by such methods are tagged in GO gene-association ¢les as being ‘inferred by electronic annotation’. Ideally, all such annotations are reviewed by curators and subsequentlyreplacedbyannotationsofhighercon¢dence. Theproblemsofcomplexityandredundancy Thereareinthebiological___processontologymanywordsorstringsofwordsthat have no business being there. The major examples of o¡ending concepts are chemical names and anatomical parts. There are two reasons why this is problematic, one practical and the other of more theoretical importance. The practical problem is one of maintainability. The number of chemical compounds thataremetabolizedbylivingorganismsisvast.Eachonedeservesitsownunique set of GO terms: carbohydrate metabolism (and its children carbohydrate biosynthesis, carbohydrate catabolism), carbohydrate transport and so on. In the ideal world there would exist a public domain ontology for natural (and xenobiotic)compounds: carbohydrate simple carbohydrate pentose hexose glucose galactose polysaccharide ... - tailieumienphi.vn
nguon tai.lieu . vn