GO Consortium Meeting St. Croix, USVI March 31 - April 2, 2006 [Next meeting: November, 2006, Hinxton or Marseille (to be determined)] GROUP PARTICIPANT LIST BDGP/SO John Day-Richter, Karen Eilbeck, Suzanna Lewis, Chris Mungall, Shu ShengQiang DictyBase Rex Chisholm FlyBase Michael Ashburner, Susan Tweedie GeneDB Pathogen Matt Berriman GeneDB Pombe Val Wood GOA Daniel Barrell, Evelyn Camon GOEO Jennifer Clark, Midori Harris, Amelia Ireland, Jane Lomax MGI Judy Blake, Alex Diehl, Mary Dolan, Harold Drabkin, David Hill NCBO Barry Smith PAMGO Candace Collmer, Trudy Torto-Alallibo Reactome Lisa Matthews RGD Susan Bromberg SGD / CGD Mike Cherry, Karen Christie, Stan Dong, Stacia Engel, Eurie Hong, Marek Skrzypek TAIR Tanya Berardini, Sue Rhee TIGR Linda Hannick, Michelle Gwinn-Giglio WB Ranjana Kishore, Kimberly Van Auken ZFIN Doug Howe Friday, March 31, 2006 Principles of Biomedical Ontology Design Barry Smith, Department of Philosophy, University at Buffalo, National Center for Biomedical Ontology Barry's talked was divided into five sections, the first four covered general ontological issues, while the fifth was devoted to GO, specifically. There are many other ontologies besides GO, each with various pros and cons regarding their construction. The least sloppy ontology is the Foundational Model of Anatomy (FMA), characterized as follows: Pros: * Clear statement of scope, we know what it is: human structural anatomy * Powerful, proper (formal) treatment of definitions - most important feature * Single inheritance is_a hierarchy - an objective good * From the whole organism to the biological molecule Cons: * Some unfortunate artifacts in the ontology deriving from its specific computer representation (Protege) - FMA was built manually, but there came a point where it was too big to be maintained manually. [An ontology should never contain entities dictated by your programmer. Needed to include non-anatomical terms to make it work with the program; this is bad.] Formal Aristotelian Definitions An A = Def. a B which Cs Parent which differentiates in this way. This is also why single inheritance is good; there is only one place to look. For example: cell = an anatomical structure which consists of cytoplasm, etc. There are many circular definitions in GO. These are useless. Not bad, just useless. Every single definition should tell you where in the is_a hierarchy the term belongs. Every definition is an encapsulation; they give you the content you need in a modular form. An ontology has to be designed both for human beings and computers. Terms used in definitions should be simpler than the term to be defined. Many of GO's existing definitions have this problem. (FMA 90,000 terms - it would be a nice discipline for the GO to represent terms in this diagram, because this lets you know if you have terms that don't go anywhere.) The Gene Ontology is characterized as follows: Pros: * Open source * Cross-species * Impressive annotation resource * Impressive policies for maintenance * Has recognized the need for reform Cons: * Poor formal architecture * Poor support for automatic reasoning and error-checking * No cross-ontology relations * Not (yet) transgranular Granularity is one very important challenge for bioinformatics. If data comes in granular packages, then ontologies must be granular. GO doesn't give a basis for reasoning in organized granularity. GO can make big strides forward without changing the content, i.e. distinguish cellular from physiological process. GO deals with definitions in a way that is worse than useless. Logically speaking, they are total nonsense. For example, GO: hemolysis of host red blood cells is defined as: The processes by which an organism effects hemolysis, the lytic destruction of red blood cells with the release of intracellular hemoglobin, in its host. This sort of definition is worse than circular. David: What if there is a parent that defined hemolysis? Barry: This would be fine. GO is now adopting structured definitions which are built out of genus and differentiae. For example, GO: neuronal cell differentiation - differentiation by which a cell acquires features of a neuron. Michael raised the issue that, in the past, Barry has criticized the GO because they have complex terms in the definitions. But, some chemical terms are inherently complex, right? Barry's response: Definitions should use terms that are less complex than the terms itself. You're going to have to produce a computer-friendly version of these definitions. If possible, you'll need to produce a human-friendly version, as well. Judy pointed out that we are also dealing with community vocabularies that were constructed with different concerns. Barry's response: We need to move to a new kind of paradigm. We don't need to allow their terminology to thwart current ontology efforts. Barry: The problem is that the UMLS accepts any group (community) developed ontologies without worrying about their quality. Another example, the National Cancer Institute Thesaurus (NCIT): Pros: * Open source * Broad coverage * Some formal structure (OWL-DL) * OWL-DL: a collection of languages used by WWW, DL maximally expressible formal logic that is still computable * Has realized the error of its ways (a good ontology needs a more expressive language than DL.) Cons * Full of errors (many inherited from UMLS) * Has verbal definitions * Has logically incompatible definitions * Confuses definitions with descriptions Goals: to make use of current terminology best practices to relate relevant concepts to one another in a formal structure, i.e., to support automatic reasoning. Of 37,261 nodes, 33,720 remain formally undefined, while about half have verbal definitions, sometimes more than one, e.g., disease progression. This assumes that people already know what is meant. Three verbal definitions are logically incompatible. For example, cancer is defined as a process and an object. Disease definitions treat them as a condition and a process. Like the GO, the definitions here get confused with descriptions. The NCIT recognizes three classes of plants and three kinds of cells. How best to deal with this? Barry's response: Generally, use of of_a (use of 'other') is bad practice. There are three kinds of cells in the NCIT that do not overlap: * Abnormal cell - top-level class * Normal cell - is a subclass of 'microanatomy' * Cell is a subclass of 'Other anatomic concept' (so that cells themselves are concepts) Neither abnormal or normal cells are types of cells Another example, the UMLS Semantic Network, an upper level ontology for the biomedical domain: Pros: * Broad coverage * No multiple inheritance Cons: * Incoherent use of 'conceptual entities' * Relationship: location_of For example, 'fungus location_of vitamin' - what does this really mean? Every instance of fungus located in some vitamin? Every instance of fungus is located in every vitamin? Should be: every instance of A is such that there is some instance of B. General Ontological Overview Good ontologies require a consistent use of terms, supported by logically coherent (non-circular) definitions and a coherent shared treatment of relations in equivalent human-readable and computable formats. There are Three Fundamental Dichotomies: * Continuants vs. Occurrents * Dependents vs. Independents * Types vs. Instances ONTOLOGIES ARE REPRESENTATIONS OF TYPES, NOT INSTANCES. Types exist to bind different communities, and this is precisely what is missing from the UMLS where the terms are all different and produced by different groups. Types are sometimes called kinds, universals, categories, species, genera, etc. GO has three ontologies: * Molecules, cell components, organisms are independent continuants which have functions. * Functions are dependent continuants that become realized through special sorts of processes we call functionings. * Processes (occurrents) include: functionings, side-effects, stochastic processes Continuants (aka endurants) have continuous existence in time. They can gain or lose parts, i.e. preserve their identity through change, but they exist in toto whenever they exist at all. Snapshots of continuants (you, 3D) Occurrents are never wholly there. They unfold themselves in successive phases and exist only in their phases. Videos of occurrents (your life, 4D: 3D + time) How should complexes be treated? There are special problems that arise in the world of molecules..... Dependent entities require independent continuants as their bearers, e.g., there is no grin without a cat. Independent continuants are such things as organisms, cells, molecules, and environments. Dependent continuants are things like qualities, function, or spatial region. All occurrents are dependent entities. They are dependent upon independents. The basic ontology has three things: * Independent continuant: component * Dependent continuant: function * Occurrent: process Two families are occurrents: functioning vs. side-effects, stochastic processes. Michael asked if it makes sense to have a process of temperature regulation as instantiated in a person? David's response: that would be a great way to define a biological process in the GO. Midori asked if independent continuants can represent entities when something is removed instead of added. Barry's response: Yes, but.... Some dependent continuants are realizable, such as 'expression of a gene', 'applications of a therapy', 'course of a disease', or 'execution of an algorithm.' Functions vs. functionings. The function of your heart = to pump blood into your body. The OBO Foundry There is a movement in the NIH to try to avoid waste of data and to try to encourage reuse of data. There are very generously funded projects to try to serve this need and this is money down the drain. They're not proactive; they accept the data that is thrown at them. You will never make data interoperable unless you actively pursue that. Old strategy: UMLS - rooted in faithfulness to the ways language is used by different communities. Each community created their own terminology and structure independently. We need common, enduring ways of organizing biomedical data. We need preestablished, reference ontology upon which groups can draw. This can indeed help make data interoperable. New strategy: OBO foundry - preemptive regimentation of language, structure, and format. We are making progress on the first two. Draft version can be found on: http://obofoundry.org/ The goal is a step-by-step evolution. In time, the OBO foundry ontologies will be so recognizably good that groups will enforce use of specific terminology in reporting results. The OBO foundry will be initiated by a subset of ontologies who agree to a core set of principles. OBO Foundry * OBO-UBO * GO * SO * RNA ontology * PATO * FuGO (Functional Genomics Investigation Ontology) * Some others The OBO foundry will consist of two kinds of ontologies: a reference ontology and an application ontology, like NCIT or the FuGO ontology. The reference ontologies will provide a repertoire of database schemas to use. Criteria for inclusion: the ontologies must be open, must agree to collaborate, must have common formal language, identifier space, versioning, clearly delineated content, textual definitions, well-documented, and a plurality of independent users. Further criteria will be added over time to begin to improve quality. The main non-trivial step forward is the adoption of methodology of shared, coherent, defined definitions which promotes quality control, guarantees automatic reasoning, and yields direct connection to temporally indexed instance data. Types and Instances We've now seen a distinction between types and instances: science text vs. clinical document, man vs. Michael. Instances are not represented in an ontology. We're interested in generalizations. Nevertheless, instances must still be taken into account. Instances are divided into types which are arranged hierarchically. Once you've got the types in order, the instance becomes less important. But....they should always be in the back of your mind. Each node in an ontology should consist of a term, an identifier, synonyms, and a definition. An ontology is a computable representation of biological reality. When people talk about concepts, they are expressing a fundamental confusion. There are terms in your ontology and types in reality. Nodes are connected by relationships. We're trying to capture reality; that's why we curate the scientific literature. We want to teach the computer to reason about biological reality like we do. The computer can't read science texts, so the annotator is finding a way to teach the computer how the terms fit the instances. The computer should have up-to-date knowledge. There are some rules on types. Don't confuse types with words, concepts, ways of getting to know types, etc. Once you have a good word for a type, you should use that term forever more. John asked what the problem is with concepts. Barry's response: Concepts encourage inward thinking. is_a and part_of should be used the same way in all ontologies referring to the same types and relations. There is no type non-mammal, non-membrane, other metalworker in New Zealand. Ontology of terms is NOT equal to a logic of terms, e.g., there are no conjunctive and disjunctive types * anatomic structure, system, or substance * musculoskeletal and connective tissue disorder * rheumatism, excluding the back Which types exist in reality is not a function of our knowledge. Rex asked about the musculoskeletal example. What if there was a disorder that affected both musculoskeletal and connective tissue? Would you add the word 'both' to the term? One point we need to remember is that we wouldn't deny that something exists, but we need to think about how these terms are put together to avoid confusion. The solution here: 'disorder affecting both musculoskeletal and connective tissue.' The word both makes a difference. It's never wrong to be painstakingly literal. John asked what the basis is for saying something is or is not a type. Precoordination vs. post-coordination. We can observe instances of a type, and we should examine terms in the ontology with and/or to see if they really represent a type. Should we provide an AND search union or intersection? What do people really want? In the world of instances, there are clear boundaries, but there are also continuums, e.g., temperature, color, bowls, cups. This means that there is a necessary element of conventionality to how you divide the continuum. Multiple Inheritance All multiple inheritance can be unpacked into clear, separate hierarchies. There are technologies for normalizing hierarchies, and you can generate any combination of normalized hierarchy. Using breast cancer as an example: breast cancer can have a parent term neoplasm and a parent term disease of the breast. These could be split into two ontologies: one classified as location, the other as manifestation of disease. Most of the 'diamonds' can be cleanly unpacked into two different hierarchies, and then one can map between the two; this is the way this should be done. Problems with multiple inheritance: * source of errors * encourages laziness * serves as obstacles to integration with neighboring ontologies * hampers use of Aristotelian method of definitions Compositionality The meanings of compound terms should be determined by the meanings of the simpler terms. Common rules allow alignment with other ontologies. There are 15 such rules, which can be sent around, if wanted. If we have rules stated, then it's easier to train, avoid mistakes, and classify. But most of all, if all the ontologies use the same rules, then those ontologies become automatically aligned with each other. The Gene Ontology is useful because lots of people use it. We want lots of people to use the GO and thereby use the cell ontology, the SO, etc. OBO Relation Ontology The relation ontology consists of formally defined relationships. An ontology comprises terms with well-defined relationships and good definitions. is_a Correct definition of is_a: Every instance of A is an instance of B. A is_a B = def for all x, if x is an instance of A, then x instance of B. Occurrents: the is_a definition works fine for occurrents. Continuants: only continuants change. This means that continuants need to take time continuously into account. Every instance of A at time t is an instance of B at time t. This is being a little more careful about time. In the ontology, we're only ever going to say things about time and instances. part_of part_of as a relation between types is more problematic than is standardly supposed. There are two kinds of part_of: relations between types and relations between instances: human heart part_of human and Mary's heart part_of Mary. This is incredibly important if you want to avoid mistakes. All-some structure: all instances of A are instance-level parts of some instance of B. This works in the untensed sense of processes. But continuants needs to take some account of time. How to use the OBO Relation Ontology? The all-some form gives us cascading inferences because if you have all-some form, whichever A you choose as the first term, the instance of B of which it is a part will be included in some C, which will include as part also the A with which you began. The same principle applies to the other relations in the OBO Relation Ontology. What about something that occurs only sometimes as part_of a process? Barry: That's okay. There are three kinds of relations: between types, between instances and types, and between instances. You need to keep these three kinds of relations always in mind. There's no constraint on single or multiple inheritance for part_of relations. You can't define everything. You have to take some terms and relations as primitive. We now need to deal with continuity. The human body is very highly connected. This means that you have parts which have no joints between them. This means that there are physical boundaries, but that there are also flat boundaries that are not physical boundaries, but boundaries that we create by fiat. There is continuity, attachment, and adjacent. Practically the only things in the body not connected to other things in the body are blood cells. Sample relationships: * attached_to * synapsed_with There is also attachment, location, and containment. In order to understand containment, you need to understand the different kinds of holes. Containment involves relations to a hole or cavity, e.g., a hole that you dig in the ground has a flat lid, your mouth has a fiat boundary. This is why you need to distinguish between instances and type relations. A continuous_with B is different for instance and type. continuous_with is not always symmetric. Every lymph node is continuous with some lymphatic vessel. adjacent_to is also not symmetric. This is important because there is an expectation of symmetry. transformation_of: child become adult, pre-RNA become mature RNA. Always think about the order! derives_from: zygote derives_from ovum and sperm. Two instances become one instance. Budding and capture are two other relations that need to be considered. A biological example of capture might be eating (not general agreement about this). There is a suite of defined relations between types: foundational, spatial, temporal, participation. To be added are: lacks, dependent_on, quality_of, functioning_of. Alex pointed out that lacks relates an instance and a type, e.g,. this fly lack wings. Is this explicit for this relation? Barry: Yes. What would be an example of quality_of? Barry: temperature. We must choose the relations that we can assert. Comment: There are lots of instances in biology where there are multiple ways to get to a particular state. How do you address that in an ontology? We're going to need a pathway ontology at different levels of granularity. We will need much cruder ontologies for pathways that will take care of every level of granularity. The Gene Ontology The Gene Ontology is composed of three ontologies, or so it thinks, with three central questions that need to be addressed: location, function, and process, and three granularities: cellular, molecular, and organ + organism. GO has cells, but it does not include terms for molecules or organisms within any of its three ontologies, except e.g., GO: xxx host which was a hack. OBO-UBO will provide top level terms, so you would choose the term host from the UBO. Host is kind of a relational term, but UBO has the facility to talk about this. Instance - a particular entity in spatiotemporal reality. Type - A general kind instantiated by an open-ended totality of instances which have certain qualities and propensities in common of the sort that can be documented in scientific literature. Biological process instance - A change or complex of changes on the level of granularity of the cell or organism, mediated by one or more gene products. Molecular function instance - The propensity of a gene product instance to perform actions, such as catalysis or binding, on the molecular level of granularity. Molecular function execution instance, aka "functioning": a process instance on the molecular level of granularity that is the result of the action of a gene product instance. Are the relations between functions and processes a matter of granularity? Molecular functions are defined as the building blocks of biological processes, but you do not assert part_of between ontologies. Michael pointed out that this was a very conscious design choice. You must get relations between molecular and higher level terms correct. What does function mean? To say that an entity has a biological function means that it's part of an organism and has a propensity to act reliably to contribute to survival. A better definition would be: function means it's part of an organism and has a disposition to act reliably in such a way as to contribute to the organism's canonical life plan. Does this exclude the idea of abnormal? What is canonical vs. variance vs. pathological? There are biological functions and there are molecular functions? Are all molecular functions biological functions? The function of the heart is to pump blood. But you can have malfunctionings, side-effects, accidents, and background stochastic activity. (Examples?) These things exist on all levels of granularity. If you do not have a prototype of good function, then you do not have a function. Where you have a function, then you have a scale: heart, healthy heart, unhealthy heart. What about cases like sickle cell, where there is positive selection, but only in some environments? Response: We will need an ontology of biological environments, niches, habitats. A 'reliable' term has built into it the idea of a certain environment. The sickle cell example is really about two different functions: oxygen carrying and malarial resistance. Can variant be thought of as an intermediate between canonical and pathological? For example, most left lungs have only two lobes, but three lobes is a variant. How does this relate to instances and types? There are no pathological functions. Malfunctions lead to pathology. We're going to have to recognize variant functions. Why did we introduce variants? Because functions always come with a scale. It only makes sense to talk about functions with a prototype function. Functions are associated with certain characteristic process shapes. If it's true that there is always a prototypical end to the function, then it follows that there are no bad functions. Hypothesis: there are no 'bad' functions. It is not the function of an oncogene to cause cancer. Oncogenes were in every case proto-oncogenes with functions of their own. They become oncogenes because of bad (non-prototypical) environments. Comment that even using the terms oncogene and proto-oncogene involves pathology. Response: Talking about function is part and parcel of talking about pathologies. Functions are non-pathological. Is this true for molecules? Yes. Is it true on all levels of granularity? Does it make sense to talk about a pathological molecule? An oncogene would be an example. Comment: What about hypersensitivity? At the cellular level it results in cell death. This is good for the organism, but is it good for the cell? Response: An immune response is a response at the biological level which includes functions that may be good for the organism, but not good for the cell. Some things may even be good for the population on the whole. But, we need to recognize that there is a huge amount of thought in population genetics and we should be very careful about how we speak about this. Are there any exceptions to the definition of molecular function? Response: I don't believe there is an exception for molecular functions. They always make a contribution to the canonical life plan. There is frequent discussion about the use of evidence from pathological molecules to inform what is represented in the GO. Like the FMA, so the GO is a canonical ontology. That's why thinking about a variance could be important. What does canonical mean? What does normal mean? The gene ontology is a canonical ontology, a computational representation of the ways in which genes normally function. You need to think carefully about what this means. The FMA is a canonical representation, a computational representation of types and relations. Granularity There are two kinds of causality: successive causality Each stage in the history of a disease presupposes the earlier stages In this case, we need to reason across time, track the order of events in times. We need pathway ontologies on every level of granularity. We especially need these things for the disease level. simultaneous causality Illustrated by Boyle's law. Two things happen simultaneously. It's not about events, it's about changes. (compare Boyle's law: a rise in temperature causes a simultaneous increase in pressure. Networks are continuants. At any given time, there are networks existing in the organism at different level of granularity. Changes in one cause simultaneous changes in all the others. Generally speaking, when you're dealing with organisms at coarser grains, you're dealing with networks at higher levels. We need ontologies of networks at the molecular and at higher levels, e.g., digestive system - simultaneous causality. But there is a granularity gulf. The way data is collected, i.e. For most existing data sources, there is a fixed, single granularity. However, many clinical phenomena cross granularity. The GO consists of three ontologies: MF and BP are dependent, while CC is independent. If we normalize, then we realize that we're missing the independent bearers, such as organism and complex. Judy raised the issue that we may need separate ontologies for cell and physiological processes. Are many existing terms cross-products of these two? Yes, consider cell differentiation. But aren't cellular processes dependent upon molecular function? Barry's response: Harping on about granularity. Do the coarser grains as much service as the finer grains. That's why we need a disease ontology. GO has cellular components, but we've never had anatomy terms. Normalization of granular levels is key: molecule, molecular function, molecular process cellular component, cellular function, cellular process organism, organism-level biological function, organism-level biological process What about hosts? Would we need to go up a level? Response: Host is an organism, is this not included in the GO? Does there need to always be an example, or an instance, of each level? There are likely to be situations where the molecular function and the process are the same thing. Annotation Methodology Scientific curators use experimental observations reported in the literature to link gene products with gene ontology terms - actually observe instances. Is it true that they're always looking at typical instances? The annotations yield a slowly growing map of biological reality. If done properly, this institutes a virtuous cycle, and the bigger and better the ontology becomes. What we're doing when we're annotating: an experiment is an instance from which we infer facts about types. But, we also learn about the instances acted upon. The instances described are typical in that sense that there's nothing interfering with them that would mess up the conclusions, i.e. this is not an artifactual experiment. Experimental records document a variety of such instances; they document the existence of real-world molecules that have the potential to execute. Annotations will help determine what is typical and what is not. We have a glossary now: Instance - a particular entity in spatiotemporal reality. Type - A general kind instantiated by an open-ended totality of instances which share certain qualities and propensities in common of the sort that can be documented in the scientific literature. Gene product instance - Generated by expression of a DNA sequence, that plays a role. Biological process instance - A change or complex of changes on the level of granularity of the cell or organism, mediated by one or more gene products. Cellular component instance - Molecular function instance - The propensity of a gene product to perform actions, such as catalysis or binding, on the molecular level of granularity. Types are trivial once you know the instances! Molecular function execution instance, (aka, "functionings"): a process instance on the molecular level of granularity that is the result of the action of a gene product instance. Type - a type of molecular function execution instance (aka, a type of functioning). Should 'activity' be dropped from Molecular Function terms? Pros: * functions are never activities (they are propensities, potentials) * many functions are never realized * current remedy is ugly, and not universally acceptable structural constituent of bone Cons: * much renaming work would be needed to advance clarity As soon as you try to state carefully what annotators are doing, the activity term messes things up. Suzi pointed out that this illustrates that additional relationships need to be added in order to go from function to process. Barry's response: One problem that needs to be addressed is that if you have a relationship between molecular function and biological process, then it will look like you are appeasing yourself. Jane also pointed out that there is a conceptual problem in that there is a mixture of functions and activities in the GO, e.g., catalysis vs. transcription factor. Rex explained that one of the reasons that 'activity' was added was to distinguish between the gene product and the activity, e.g., DNA polymerase vs. DNA polymerase activity. Barry's response: activities should go into a molecular activity ontology, while functions should go in a molecular function ontology. We need to create a new level of clarity in the way people think and speak. Can we replace activity with function, since they're all functions when they're under that branch of the tree? If we searched and replaced with function, does this work? Sometimes the user community mistakes gene products for function. If a potential user of the GO has a protein which has a molecular function characterized as alcohol dehydrogenase, but is not known as such, isn't this confusing to people? One possible solution would be to have names of molecules classified according to function, delete the word 'activity' and then make sure there are no strange terms. Also, do we want a molecular functioning ontology? Does this parallel a molecular function ontology? Judy: We cannot change the nomenclature of gene/gene products, this is fixed. Alex: I am worried about removing 'activity' since it is commonly used. John: This is just lazy grammar, 'activity' is what it does, not what is has, can be changed. David: This is more common in the biochemical world, than anywhere else. ACTION ITEM #1: Very seriously consider removing the word 'activity' from the molecular function terms and consider renaming the molecular function ontology. Principles for Building Biomedical Ontologies : A GO Perspective David Hill, MGI David's talk was centered around the idea of taking the principles that Barry talked about and discussing how we've applied them in GO. Most of the big arguments that GO has had addresses issues that Barry raised. The principles for building a good biomedical ontology are as follows: * Univocity - word means the same thing * Positivity - not a membrane is not a good term * Objectivity * Single Inheritance * Definitions - formal, written definitions * Basis in Reality * Types vs. Instances * Ontology Alignment The Challenge of Univocity One of the first problems that GO dealt with is that different people in different communities use words differently. So, how does a computer know what people are talking about? In GO, we dealt with this by creating primary terms and many characterized synonyms. Another challenge is illustrated by the question, what does a bud mean? The answer is different for vertebrates, plants, and yeasts. This is now the inverse of what we had before: people use the same words to describe different things. So how did GO deal with that? The computer doesn't know the difference, so that was when we decided to create the sensu designations for terms. These describe the term in the case of metazoans, fungi, etc., as the biologists in the field think about it. Synonyms are incredibly important, as we had to have terms that biologists will search on. One question that arose: how to represent the function of something like a tRNA? It's term: triplet codon-amino acid adaptor activity. But no biologist is ever going to search on this term. That's why we created tRNA as a synonym. What would happen if we removed activity from this term? Not all tRNAs could then be annotated to it. The Challenge of Positivity Sometimes absence is a distinction in the biologist's mind. Some organelles have membranes around them, but some, like centrioles, do not. So the GO came up with two types of organelles: membrane and non-membrane bound. Note the logical difference here between 'non-membranous bound organelle' and 'not a membrane bound organelle.' We don't want the latter, as it signifies everything other than what we're talking about. Alex: Biologists would understand 'non-membrane bound organelle'. Rex: A better term is needed. The Challenge of Objectivity For some gene products, we have no idea what they do. Database users want to know if we don't know anything (exhaustiveness with respect to knowledge) so the unknown terms were created. Annotating to these terms means that an annotator looked at all the literature, and we don't know what the function is. But consider 'G-protein coupled receptor, unknown ligand' - there is no difference between this term and the parent term 'G-protein coupled receptor.' In this case, instead of using a term that incorporates 'unknown,' we should annotate to the parent term. How should we annotate genes that we assume have some function, but we don't know what that is? This will be especially critical for annotating the reference genomes. We will want to annotate to molecular function and use the ND evidence code (see discussion on Sunday). Single Inheritance This is something that is going to be really hard for the GO, due to incompleteness in the graph, as well as other reasons. GO has many is_a diamonds. Technically the diamonds are correct, but they could be eliminated. This would involve choosing one term as the primary parent, while the other would be derived. What do these pairs have in common? * Locomotory behavior vs. larval behavior * GTPase regulator activity vs. enzyme activator activity * Non-membrane bound organelle vs. intracellular organelle All of these terms differentiate from the parent by a different factor - type of behavior vs. what is behaving. One side is a type of behavior, the other is location. Conceptually, we can insert an intermediate groups term, a descriptive behavior term, behavior of a thing. This is no better but allows you to figure out why you have this diamond. Chris: There are more complex diamonds in the GO, where the result is not a cross-product. Alex asked why you couldn't reason with this type of structure? As a biologist, it seems reasonable. Could we just annotate to two different terms: larval behavior and locomotory behavior? Does this get users to the same information? Would doing this make it more difficult for annotators to always know all the right terms to select? Judy pointed out that there is a danger here in having dependency on annotation to represent a term that maybe should be in the ontology. There is also some feeling that annotating to each term separately does not capture all of the same information. Having Good GO Definitions There are two types of definitions: * definition written by a biologist (the one we all see when it's written out) * definition given by where the term is placed in the graph We all strive to make the former definitions necessary and sufficient. The latter are also necessary, but they are not often sufficient. The set of necessary conditions is determined by the graph, and the graph is only considered a partial definition dependent upon placement (selecting the appropriate parents) and relationships. Requests for new terms, thus, need to consider all relationships. For example, a proteasome complex, which may be part_of the cytosol, part_of the endoplasmic reticulum and part_of the nucleus. Does this mean that logically is has to be a part_of each place all the time? Chris' response: It's valid to not be a part of these parents. But, if it's present in the ER, cytosol, and nucleus and moves between these places, is that still valid? No. It could be wrong if it's not part of each thing. For example, red blood cells wouldn't have the nuclear proteasome complex. If the same complex is found in two places (no difference in the composition of the complex), are we double annotating if we annotate to both? The feeling is that curators should annotate to both types. At present, the definition means that if something is a cytosolic proteasome complex, then it is, at some point, part_of the cytosol. Barry stated that is is fine for an is_a relationship, but part_of always means part_of. Chris: We're not making that strong a statement with our part_of complex. Barry: Are you talking about the part_of relationship in the relationship ontology of the GO? Which definition of part_of is used? How do you account for motion? Earlier when this issue was discussed, we said that it had to be part_of sometimes. Is this true for all ontologies? What about the cytokinesis example? Where should we put it-part of mitosis, part of cell cycle - we know that this is not true for all situations all of the time. The Importance of Relationships We annotate to regulatory subunits of catalytic activities. But, we didn't know how to express the regulation of that catalytic activity, so we need a new relationship for regulates. This new relationship type would address the idea that this gene product impinges upon, but doesn't actually have the catalytic activity. There was some discussion of the extent to which the regulates relationship should be used. As an example, suppose that you have a wise father who regulates the behavior of a child who, in turn, regulates the behavior of her pet dog. Does the father also regulate the behavior of the dog via the daughter? What about a kinase regulating a transcription factor that regulates some other activity? Or MAPK signaling pathways - does a MAPKKK regulate a MAPK? There are probably many examples in biochemistry where A regulates B which regulates C, but we only know about A and C. The real issue here, then, is what is the definition of regulation: directly regulates or indirectly regulates? GO Textual Definitions: we strive to created similarly structured, normalized definitions, e.g., glial cell differentiation. For process terms, David would like to propose that every process is defined by a beginning and an end. Does this work for all situations? Can we make extremely explicit process definitions? Would this lead to a lot of new child terms? New sensu terms? Basis in Reality GO is designed by a consortium and represents a consensus. Large-scale developments are the result of compromise and annotators are constantly looking for examples in the literature. If we align different ontologies, such as the cell type or chemical ontologies, with GO, then that permits generation of consistent and complete definitions - formal definitions with necessary and sufficient conditions in both human readable and computer readable forms. Types vs. Instances What are the instances we are dealing with in our work as ontology builders and scientific curators? The first question to answer is: What knowledge are we trying to capture? We're interested in understanding how genes contribute to the biology of an organism. What is meant by gene products? We have gene products types and gene product instances, e.g., shh in this cell in this mouse. The instance is the actual molecule that can be physically isolated and takes up space. What do the experiments do? Experiments are designed to study the properties of gene products. How do we represent that accumulated knowledge? We connect what the wet bench biologists see to our understanding of biology in the ontology. What are the instances? They're in the labs. We use what experimenters report about those instances. How do we connect instances with knowledge representation in the GO? Some examples: * A molecular function annotation using IDA Paper reports assaying the activity of retinoid dehydrogenase Annotation made: retinal dehydrogenase activity What are the instances? There are gene products instances, molecular function instances revealed by that assay, and instances of molecular function associated with an instance of the enzymes. Conclusion: If I have this molecule, it has the potential to have this function. * Using IMP Gucy2c activity goes away in mutant cells - observation was that when they had this molecule, they had cGMP, when it was gone, they didn't have it. * Using IGI Don't have enzymatic activity when animals are doubly mutant for subunits of the enzyme. Single mutants have activity intermediate between wild-type and double mutants. The instances are the gene products, hexA and hexB, and the molecular function instances. We say that these molecules have the potential to execute the activity. Sue commented that the assertion being made in each of these cases is slightly different. For IDA, the annotation is clear, but the latter cases suggest that the gene products are involved, but is their role direct or regulatory? Should we use the contributes_to qualifier in these cases? There is agreement that the evidence code does tell you something about the quality of the inference and what types of caveats a user should expect. But, can you compute that aspect of the evidence code? Another example is that of molecular function terms, such as protein binding, with the IPI evidence code. What are the instances? Flag-tagged molecules, molecules to which an antibody exists, execution of binding, potential to execute binding, potential to execute specific binding. Some discussion about whether protein binding is a good function term. Conclusions? Process annotation using IMP: * Observation is that in the presence of Shh you get a specific process of heart development. Instances are functional and non-functional molecules of Shh, development of a mouse heart, functioning of a Shh molecule. This process is the result of gene products executing their functions. Process annotation using IPI: * IPI is often used to annotate to MF: protein binding * In this example, a catenin-interacting protein was annotated to BP: Wnt receptor signaling pathway. * There was no instance of a function during Wnt receptor signaling, so where did the missing information come from? * An inference was made based upon previous annotations. * This represents a chain of inferences. How far do we want to take it? Much discussion ensued on the chain of inferences. Some felt that IPI for protein binding would be the only annotation that could be made. Others felt that the annotation depended on the context of the experiment and what the authors state. If the authors claim that this [participation in a process] is true based upon their IPI experiment, and the experiment is in a peer-reviewed journal then it's okay for an annotator to make that annotation. Otherwise, the correct evidence code would be IC. In the signaling literature, it is not uncommon to find that authors do co-IP experiments to show that a protein of interest in involved in a particular signaling cascade. But how far do we take the inference? When do we take the inference? Who makes the assertion? Does that matter? Does it matter what the community assumes to be true? If, for example, protein A bind protein B and protein B is involved in two completely different processes, would you annotate to both? No. If the binding experiment is the only one performed in the paper is that enough? If not, do the other experiments influence the process annotation? If we comprehensively annotated\ genes, can we make the same conclusions as wet bench biologists? Can we make a rule about if and when we will use IPI for biological process annotations? A check of the database indicates that almost all of the core databases have used IPI for process annotations. The conclusion was that we should think about this more and come up with examples of when we would or would not make the inference to inform our decision about making a rule for IPI and process annotations. This would be an agenda item for the coming GO annotation camp. Saturday, April 1, 2006 Discussion of Goals in the New Grant Proposal Aim 1 - Ontology Development Suzi Lewis, BDGP and David Hill, MGI What is our aim? Suzi Lewis, BDGP As NIH program director Peter Good has stated, the heart of the project is still the ontology; that's the resource that people still use. Thus, GO will maintain comprehensive, logically rigorous, biologically accurate ontologies paying close attention to both content and relationships in these ontologies. Content Development David Hill, MGI Comprehensive annotation can drive ontology development. With this in mind, David presented an example of ontology development using the process of blood pressure regulation, an important and well-funded area of research. In mid-November, when David began focusing on curation of genes having to do with blood pressure regulation, there were three relevant terms in the GO: regulation of blood pressure, and positive and negative regulation of blood pressure. To further develop this part of the ontology, he first got a textbook on medical physiology and proposed a basic structure in a SourceForge item that included 43 new terms. Then, the refinement process began... A textbook was not sufficient for describing all aspects of blood pressure regulation and from reading published papers, he was able to add more terms, as well as synonyms for many terms. Work continued; he read more and more papers and added more and more terms. One interesting aspect that came out of this was that not all new terms added had to do with blood pressure regulation, because the genes involved in regulating blood pressure affect other processes as well, such as water consumption and kidney function. Thus, in the process, the blood pressure node improved and other nodes improved, as well since you're now starting to understand genes as you expand nodes in the graph. In the end, most new terms added were new leaf nodes, with relatively few changes to the actual structure of the graph. Summary of ontology development steps: 1 Consult a textbook. 2 Identify papers. 3 Read papers. 4 Enter SourceForge items. 5 Modify GO. 6 Curate papers. Good news: 14 new genes initially annotated to blood pressure regulation. Now, 23 genes annotated, with five genes comprehensively annotated. Along the way, other genes got annotated and not all annotations had to do with blood pressure, which is good for clinicians. Bad news: There is still an outstanding SourceForge item about the terms, as Amelia found inconsistencies in some terms and definitions. Although these issues are relatively minor, they're still there and need to be addressed and fixed. When proposing new terms, it is important to think as an ontology developer, not as an annotator. Since the issues raised did not get in the way of curation, it was too easy to switch over to the role of an annotator and just go ahead and curate the papers, without tidying up the ontology issues. How can we prevent this type of short-circuiting? One way would be to have some assignment of responsibility for ontology development. The ontology developers could point out logical issues, while expert curators deal with major logical issues. Minor issues could then be addressed as concrete proposals that could perhaps be presented in a way where the curator could quickly accept or reject the proposal. Any final decisions would then be based on whether the ontology really represents the biology. Some discussion on this: Does it become the responsibility of a curator to assume the role of an ontology developer when they recognize that new terms are needed? To some extent yes, as we need to keep in mind that ontology development is key to the GO and dependent upon feedback. We have become a more cohesive group and are really at a point where ontology development and annotation intersect, leading to shifting responsibilities. This may mean that we need some changes in how we interact and do things and that we may need to address any problems that exist in how we manage ontology development. We should be creative in how we think about this; maybe the SourceForge method needs improvement, maybe there are other options besides SourceForge? Harold: When you have the interest and the background, that's when the interest to develop the ontology kicks in. Lisa: More collaboration between groups is good, we are doing pathways related to human disease. Rex: We should identify areas of interest or holes in the ontology. Suzi: We need more structure in how this is done. There may be some value in alerting people that a particular area of the ontology needs attention. That way, people with expertise in the field can provide insight, since it is not always clear to an annotator when some terms might be missing, especially if the area under development is not congruent with their background and/or interests. It may also be necessary to have someone at each database be responsible for ontology development. Identifying such people would give the whole process more structure. In addition, the editorial office could provide a timeline of projects outlining different areas of development. Annotators could subscribe to specific mailing lists about these issues. Bottom line: there is shared responsibility for ontology development!! Editorial office had a poster about time lines of ontology development projects. (PRE)ACTION ITEM #2: Need to work out the balance of power/responsibility between the GO office and annotator/ontology developers to complete SourceForge items. David to the GO editorial office: Do you want to be the enforcer? Jane: We do hound people to finish an item. Judy: We need a systematic way to bring closure to an item. Aim 2 - Reference Genomes Rex Chisholm, dictyBase and Judy Blake, MGI The second aim of the grant built on the idea of ontology development. In light of the criticisms of the previous proposal, it was important to pay attention to the biology and pay strong attention to how the ontology was used. What is valuable to biologists about GO? There is an enormous effort to sequence more genomes. There are 100 billion letters of sequence in GenBank representing 160,000 different organisms. The next step, however, is to use that information to understand these organisms and GO has an important role here because some organisms may only have a handful of experiments. In light of this, GO would like to have a set of well defined, well characterized reference genomes that could be used for electronic annotation of new organisms. What are the characteristics of a reference genome? 1 Needs to have a sequenced genome. 2 Needs to have an active and robust MOD supporting it. 3 Needs to have broad-scale functional genomics projects. 4 Needs to have an adequate research community adding new information. 5 Needs to have a sufficient literature base. 6 The reference genome list needs to be distributed across the tree of life in such a way as to ensure a range of organisms are represented. So, a list of nine reference genomes was chosen, starting with the two extremes: E. coli and humans. Humans have to be there since that's where interest lies and much funding derives. E. coli is there since it is the largest component of our biomass and there is a proposal for an E. coli database that is planning to use GO in their annotation. Other genomes identified include mouse, fly, worm, S. cerevisiae, Arabidopsis, and Dictyostelium. Those genomes not identified as reference genomes are not considered less significant, but the reference genomes will be the focus of a committed, coordinated annotation effort. What did we commit to? Curation of the reference genomes will provide broad and deep annotation, with each of the genomes being consulted and agreeing to establish foci of curation. Curation of genes that are relevant to human disease will be emphasized and at a minimum their curation will include: 1) looking at a provided list of genes, 2) using that list to prioritize annotations to determine if you can add information about the function of those genes in your organism, and 3) looking at those genes and also at the human annotations to see if the human annotations can be improved based upon your annotations. As the reference genomes are annotated, it will be important to provide data for metrics. It is unlikely that we will have complete annotation for those nine reference genomes, but we want to be able to measure progress. This may require some slight reorganization of how information about GO curation is captured. We want to establish a GO annotation team for this part of the project. For most of the reference genomes, there already are GO-supported, "embedded" curators. This fact is very relevant to the previous discussion, as these individuals will play a coordination, as well as annotation, role for their respective genomes. These individuals will be funded by GO and will answer more to the GO, as we will need to expand leverage to complete these annotations. Another role that these individuals will play is in providing outreach and training to other curation efforts. They will agree to participate in annotation consistency exercises to see if curators within their database have some level of consistency in how to think about curating GO. Part of the annotation consistency efforts may include everyone curating a human paper and comparing annotations. They will also participate in annotation camps and workshops, for which there will be a more regular process. There will be regular communication in the form of biweekly (fortnightly) phone conferences. And lastly, there will be a bidirectional process where each database has a responsibility of reviewing the GOA annotations for their organisms, ensuring feedback between GOA and the reference MODs. This will require a lot of communication with the EBI group about human gene curation. EBI already accepts some human annotation from MGI and other groups and would be happy to take additional annotations. Using EBI's curation tools, annotators can directly annotate to the GOA database. However, there is some concern about other MODs performing human gene curation. How will the outside world view this? We will need to make concrete and tangible progress on this or it will look bad. It might be best to focus on particular genes and show that we've made progress. All MODs have high priorities within their own organisms and so we will need to be realistic and methodical in our approach. The key role of the MODs will be to still do what the MODs do. That's the easy part of this! We're promising to do what we're already doing. GOA/EBI group annotation tool is online and can be made available to outside groups, e.g. AgBase, _____, Roslin group for chicken Sue expressed concerns that the annotation of human could be problematic, if we say we will do human, and then don't do it, we will look bad. MGI already has a list of human disease genes, ~4000, people are already contributing to InParanoid and using it to compare MOD genes to the human list, we can then also track MOD progress on this list of genes. This new effort will need a lot of coordination and we will need to know how to represent progress. We will generate a list of human disease genes, which will number several thousand. This is not an impossible list to generate. In addition, we can provide a list of InParanoid orthologs to help groups identify the relevant genes to annotate. This leads well into the discussion on co-annotation of the mouse/rat/human genomes..... Cooperative Annotation - Human/Mouse/Rat Annotation Judy Blake, MGI There is a heavy call from the GO user community for human annotation, and it will be beneficial to look at how the mouse/rat/human groups can work collaboratively since in many cases papers report experiments on genes from all three species. Mary Dolan has already done some work on the annotation consistency between mouse and human annotations and will also be adding chicken, fly, and zebrafish to this analysis (all predicated on knowing the orthology). One consideration, though, is that there are different intersections of information based upon the experimental focus on an organism. For example, rat gene products are often studied in the context of neurophysiology. At present, these groups do things independently and likely have similar but different priority in curation sets. How can these groups exchange information and help each other out? One way would be to create a shared annotation group, which may be a core group of people funded by GO, of human, mouse, and rat annotators that focus on gene products implicated in human disease. This would involve shared curation of the literature, collective work on the same dataset, and an update of each group's editorial tools to accommodate multiple species. These groups would also need to develop shared quality control protocols that would ensure high quality curation. Currently at MGI, they annotate rat gene products as they come to them, and these annotation lines are sent straight to GOA and posted on the MGI ftp site. Another proposal that relates to David's idea of depth of annotation: when these groups take on a certain area for curation, they also take on coordinated ontology development. The groups would begin by identifying and reading reviews and then using the reviews to improve data sets, triage the primary literature, and improve the ontologies. This would involve identifying all of the organisms and gene products described, and they are working on various computational ways to identify that information. Overall, these efforts would help the groups to simultaneously curate the same, focused set of genes and allow for development of a working group amongst these three genomes. Many mouse papers have data from rats and humans and it doesn't make sense to have three different MODs look at the same paper. Curation of rat genes begins with abstracts and then focuses on disease genes, particularly nervous system disease genes. This is fine, because concentrating on disease collectively gets the synergy. It is agreed that it doesn't make sense for people to look at the same genes in mouse and human, but with this new idea, people would be looking at the same group of genes involved in a particular area and this would help by reflecting how people in each field currently think about this topic. How would the curation be split up? Ideally, by references and by organism. This effort would involve literature-based, shared discussions that could start with a recent review and then choose a focus for annotation. This work might also lead to productive interactions with Reactome. If out of this work came a list of genes that were well curated in mouse/rat/human, then these genes could be bumped to the top of Reactome's curation priority list and help give them ideas about what pathways to coordinate. The exact tools and processes for doing this are not all in place yet, but will come via discussion as work progresses and we see growth in each of these areas. For GOA, this will be different from their normal curation strategy, as they currently fully curate each gene product. However, the hope is that this strategy extends beyond mouse/rat/human. For example, FlyBase could look at their homologs, too. In addition, we could put in place responsibilities so that when you curate a paper, you curate the whole paper and all the information in that paper. Sharing this gene list amongst all the reference genomes will provide the broadest representation of what these genes do in biology. Reference Genomes Rex Chisholm, dictyBase How do we think of metrics for both breadth and depth? One important point in annotating the reference genomes is to have metrics, coordinated by and agreed upon by the consortium, to assess the breadth and depth of annotation. How do we develop these metrics? There was some discussion of this on the email list while this proposal was being written. In some ways the answer seems obvious, but there are a lot of hidden complexities and different groups think about this in different ways. Nevertheless, we should all collect the same set of numbers. What is meant by breadth? A group would have broad annotation when they've annotated all the genes in the genome. There is, however, different kinds of information from different organisms. For example, rat is good for physiology, mouse for disease models, yeast for signaling, etc., and biologists tend to think about a process based upon what they know from a variety of organisms. For a given organism, we will need to track the number of genes that actually have experimental annotations. In addition, we'll want to know what genes' functions are known mainly by ISS and what percentage of the genome is annotated with ISS exclusively. We'll also want to know what is annotated based on IEA (sometimes this is the only type of annotation known). Tracking these numbers for each reference genome and collectively we allow a way to monitor progress towards breadth. The goal is to do the entire job - a huge task - and we need to be able to show progress along the way. It would be good to compare the snapshot taken at the time the grant was being written with one taken this summer. It would also be good to add, in the annotation file, the number of annotatable genes and proteins, since this will be critical for assessing metrics. It is not always trivial, however, to know what percentage of genes you can annotate. The cerevisiae genome, for example, has 20-25% of genes with no annotatable information and it has a compact genome with a long history of experimental study! It may still take a future publication to annotate these genes. In addition to measuring how many annotatable genes and proteins exist for each organism, it might also be helpful to record how much ontology space is covered for each organism. This bears on the second point, which is how to measure depth. If you've annotated every single paper ever published, then you've achieved 100% depth. But that's not realistic, as many papers aren't gene-centric and not all papers have curatable information. So, what should the denominator be here? Is there any way to capture the number of relevant papers? Would working with natural language processing groups be useful for gathering this kind of information? Another measure of depth might include an assessment of how far down any given branch the GO is used, ie how close do you get to the leaves? Shu (FlyBase) has written an algorithm for assessing this. One off-shoot (so to speak) of this would be that lots of annotations to a leaf node would highlight potential new areas for ontology development. Depth: can be measured as: % of papers used for curation average # papers per curated gene can we capture the # of relevant papers since we know it is a subset of all papers. How? Problems: Even if a paper mentions rat, it may not mention any genes Even if a paper mentions a gene, it may not give you information that is curatable for GO Some of these types of projects will be very useful to the natural language processing groups; Mike, Judy already have collaborations with such groups. ACTION ITEM #3: Begin to coordinate processes for reference genomes to start setting priorities and tracking progress. Acquire and distribute lists of genes for curation focus and set-up fortnightly discussions. [Point Person: Rex Chisholm] We need to develop a list of "disease genes" and process to monitor progress. We need to refocus and integrate roles of GO funded annotators. Aim 3 - Outreach Michael Ashburner, FlyBase and Suzi Lewis, BDGP We promised a tool that would provide better ways of annotating through the ortholog sets. This might be software that shows a phylogenetic tree and protein alignments and allows curators to click on any gene in the tree to see the gene's annotations. It might also allow for dragging terms from one organism to the other, resulting in ISS annotations. To accomplish this, we will need ortholog sets (from other groups) and protein sets from each of the reference genomes. Many groups are calculating orthologs - each does this differently and each uses a different data format. It would be really nice to have a common dataset so that we can compare these different methods on one set. It would also be nice to have a common output file. We'd like to be able to add in any group/organism that can provide a protein set (Matt has 40). Michelle mentioned differences in current annotation of gene models in eukaryotic organisms. Common Coding Sequence ?not sure of name? - collaboration between 3 groups annotating human. ACTION ITEM #4: Any other ideas for shared curation software, please forward to Chris Mungall. To avoid circular annotations, curators would only want to drag and drop experimentally derived annotations, which could be color-coded based upon evidence code. Where would the annotations go afterwards? The tool could generate an annotation file, and GOA already has a tool to do this. One thing to keep in mind, though, is that if anything changes about the original annotation, you would want to check the related annotation for continued accuracy. WORKING GROUP: We should establish a working group for this software so that we can leverage what other groups have done and figure out, technically, how to pull this all together. Two critical need for this type of software are: 1) reliable, agreed upon protein sets for establishing orthologs, and 2) a common output from the various ortholog groups. There is pressure to keep the number of genomes annotated up for taxonomic spread, but also to keep it down because of the enormity of work involved. The GO will support annotation across all organisms, and has been doing so for some time now. TIGR became an official member in GO in 2001, and TAIR joined us before that. As of the time the grant was written, there were 1,867 genome projects, of which 339 were published and in some sense complete. The former number includes over 900 prokaryotic genomes, over 500 eukaryotic genomes, and 26 meta-genomes. A number of additional mammalian genomes have just been funded. Over 100 pathogens are being sequenced at Sanger. Thus, there will be an increasing number of genomes coming in the future. For the great majority of these genomes, however, there is either no or relatively poor funding for annotation efforts and associated databases. This provides quite a major problem for the GO which has traditionally interacted with database groups and not with individual sequencing projects. It is in the interests of science and the GO Consortium to reach out to these groups to encourage and support GO annotations across as broad a phylogenetic breadth as possible. To this end, it is important for all of us to assume an ambassadorial role at talks and meetings and this role has been formalized under Jen Clark at the editorial office. But, we need to take a number of additional steps to go beyond outreach and beyond initial contacts. We need to provide a basic toolkit of training materials for new groups to annotate their organisms. This would include reference annotations from reference genomes that could be used for transferring annotations. This is one of the reasons for having a broad phylogenetic spread of reference genomes. In the past, we have also been asked to give advice about tools. People want to know what the best tools are for annotating a genome, but we have been reluctant to recommend one groups' tool over that of another. What we could do, however, is provide the best SOPs for using the existing tools. There are many tools that use GO for microarray analysis, and when people see the list on the web page, they often don't know where to start. When giving tutorials, we currently use one or two of these tools, but don't necessarily work that closely with developers to give them feedback on the software. It would be very good to begin working with these people behind the scenes to help improve the tools. It would also be good to have a wiki for each of the tools and to partner with groups that have a vested interest in further developing and improving these tools. Mike suggested that we could partner with someone, e.g. MGED, to work on this. Sue suggested that we could be more proactive on collaborating with these various software developers - they are often looking for nodes that are overrepresented, which of course will be affected by nodes which are underannotated. GO slims are often too broad for some uses, so we may need to help people develop better slims to really do what they need. We also need to encourage developers to incorporate evidence codes, so they don't get ignored by so many people. Mike seconded Sue's comment about some developers ignoring evidence codes entirely, and they are using a gene-association file to predict other annotations, and have no idea whether they are using experimental or IEA annotations. In summary, we know that some tools don't use the evidence codes, some ignore the WITH column and thus, the NOT qualifier, and that tools using the GO slims may be using GO annotations that are too high in the tree to be really useful. Accordingly, each group should think about what is the right depth of the ontology for their organism. Has anyone performed an independent analysis of GO tools? Sorin Draghici at Wayne State University has done this. (See his publication in Bioinformatics, Khatri and Draghici 21 (18): 3587.) End Discussion. Another form of curation outreach can be provided in the mappings of external classification systems to GO, such as ec2go or interpro2go. When using these, though, it is important to consider how extensively they are maintained, since some are updated regularly and some probably haven't been updated in three or four years. Information about frequency of updates should be transparent in the mappings file. Also, we should always be on the lookout for new classification systems to map to GO, such as the KEGG orthology. Most importantly, however, is that we're going to provide training. We've already had two annotation camps, and are planning a third for this coming July in Palo Alto. We would also like to have one annotation workshop per year, alternating on which side of the Atlantic it's held. We should especially reach out to Asian groups, particularly some groups in Japan who've already sent curators to the 2005 Annotation Camp and are very interested in learning how to do manual annotation . WORKING GROUP: There is going to be an outreach working group and they will need to come up with a plan for how to encourage and support annotations from a broad spectrum of genome projects. Part of this group's focus should be on writing SOPs for GO curation and working very closely with new groups to ensure regular submission of gene association files. This working group should include the "embedded" GO curators at each of the MODs. Another issue related to this has to do with freshness, or currency, of annotations, since we know that annotations can go stale for a number of reasons. For reference genomes and well-supported MODs refreshing annotations is an integral part of their job. For single-pass annotations, however, there may not be resources to refresh them, especially if they are performed at a sequencing center that has long since moved on to the next genome. We do have an existing system to do with this in that a gene association file that has not been refreshed within the last twelve month is put into a separate gene association file archive. Another outreach idea is that of an annual annotation challenge. This might be similar to the CASP (Critical Assessment of Techniques for Protein Structure Prediction) and GASP (Genome Annotation Assessment Project) projects. An annotation challenge might also be a good way to interact with software developers creating tools for GO. We could give groups a set of genomes, have them run their programs on them, submit resulting GO annotations, and then compare their annotations to known annotations to see how the different programs perform. The winning program could then be run against all of the genomes that need refreshing. The annotation challenge could perhaps also be extended to evaluating tools that assess consistency of annotation within the consortium. GO should also continue trying to reach out to additional prokaryotic annotations groups, as there are now a number of different groups in many different locations, such as the Joint Genome Institute, Argonne National Labs or the Pasteur Institute, to name a few. We could also reach out to groups involved in the NIAID program to sequence organisms that are potential bioterrorist agents, which is ongoing at several BRCs (Bioinformatics Resource Centers). There are some particular issues, such as the operon issue or annotation by pathway hole filling, that arise when annotating prokaryotic genomes that may merit some changes in the GO. Specifically, this may require changes to the evidence codes, and we may need to improve links between function and process. These issues are especially important as annotation of the E. coli genome, one of the reference genomes, becomes more coordinated (see Riley et al, NAR, 2006 Vol. 34, No. 1, 1-9). Aim 4 - Community Advocacy Mike Cherry, SGD The main idea behind community advocacy is that there would be advocates to seek out what is needed by those people that use the GO. As a group, we want to know, and should start thinking about, who else is using the GO and what can we do to meet their needs? To begin to address this, we may want to designate a publicity person for the GO and have a place on the website where we highlight the use of the GO. This is not simply an issue of interface design, but also really addresses the questions of what is GO and why should I care about it in an effort to draw people in more. We need to recognize that in talking to non-genomics biologists about GO, we may need to use new language and be pro-active about determining exactly what people do and do not know about GO. GO is now a resource that is expected to be there and as such, will be subject to criticism. But, it is our responsibility to get out there and help people who are using it and address issues they might have. Michelle suggested that we might want to work with someone professional to develop training materials. Candace seconded this and mentioned that there are lots of people at colleges/universities using GO in their courses and training materials would be really helpful. Perhaps we could collaborate with an educational professional to produce GO training materials. Reactome has had some experience with a group that came in to produce training material that they would then sell as a subscription. The number of people subscribed might help to give some idea of how many people use their resource. How else to determine who uses GO? We could compile lists of people who mention GO in publications and in talks at meetings. For the former, we could perform a full-text search using "gene ontology." It would also be good to foster a user community that gets immediate feedback from GO in the form of quick responses to email queries. We could also try some sort of monthly call-in where people can ask questions or simply lurk and listen. Could we put into place some mechanism that allows people to tell us what genes they would really like to see annotated? We could also make some changes to the email lists. Sometimes people post to a GO mailing list, but it's not to the right list. We should consider consolidating the lists and making it more transparent which list people should post to when they have a question or comment. Having good training materials on the web is key. Courses on genomics and informatics would use GO if there were training materials available. This would increase student awareness of GO. Is there any way that we can determine who is downloading files and perhaps get automatic feedback from them when they do so? Tracking downloads, however, would be a low estimate of how many people use GO. Sue was recently at a Systems Biology meeting, and nearly every talk mentioned GO. Attending meetings may be one way to get a feel who is using it. She also felt that looking for citations in publications is also a decent step. John: We need to foster the idea that we are responsive to the community. Karen brought up Rama's comment that we need to consolidate/manage our various email lists: go-database, amigo, go, and make sure that questions don't get lost, that everything gets answered in a timely fashion. David: Medium to large scale analyses where they use GO for analysis- we could offer to focus on annotating genes which would be relevant to the analysis and get it done so that they can use it before publication of the paper. Jane suggested that we could have a tracker for this. What about a monthly newsletter, a wiki, rss feeds? As many of the MODs already send out monthly newsletters, perhaps they could append a new section devoted exclusively to GO. Using the MODs to reach more people was especially helpful for the GO survey done last fall, results of which are included as an appendix to the grant submission. Also, any courses that people teach, such as those at TIGR, Jackson Labs, Woods Hole, CSHL, could include GO as part of the curriculum. There will be no new hires for outreach, so existing GO curators should think about how to reach out to their user community. If you have more ideas about how best to do this, please send them to Eurie Hong at SGD who will be coordinating community outreach. One other point to keep in mind is that a lot of groups doing functional annotation are doing structural annotation, as well. This is partly why we need the Sequence Ontology (SO). SO has become the fourth ontology, and although there are differences between the SO and GO ontologies, they are all supported and needed to do the complexity of gene annotation. Aim 5 - Organization Suzi Lewis, BDGP Three years ago, the aims of the GO grant proposal were to create structured vocabularies, support and promote use of GO in annotation projects, add new MODs to the consortium, build and disseminate informatics resources and tools to support community use of GO vocabularies.. How can we measure progress towards these aims? The number of hits to the web site has gone up, the number of publications mentioning GO has gone up, and the number of links to the GO web site has increased. 17/24 NIH agencies that fund research have supported projects that use GO and there are many hits when searching for GO in Google Scholar. The GO survey netted close to 1500 replies in three weeks, almost all of which were positive. So, where do we go from here? Looking at the 2006 aims: 1 maintain comprehensive, logically rigorous, biologically accurate ontologies 2 comprehensively annotate nine reference genomes in as complete detail as possible 3 support annotations across all organisms 4 provide annotations and tools to the research community Given that the funding level is expected to remain the same, how are we going to scale up? The answer is that we need to do a little better, need to get more efficient, and need to be more coordinated. This can be done! We already have some working groups, such as the AmiGO and OBO-Edit working groups, and these have been successful. But, any new working groups that form will have specific questions to ask, including: 1 Why is this group here? 2 What is the lifespan of this group? (could differ greatly between groups) 3 Who is the group leader? 4 Are we making progress? What are the obstacles to making progress? 5 What are the group's priorities and what are the criteria for setting priorities? 6 What will the group deliver? 7 What are the criteria for membership? Who has a vested interest in this issue? 8 How often should you meet? How should you meet? Via email? Other ways? Each group can decide what works for them and what communication method will work best, but having metrics is essential. How are you going to measure your progress/success? What tests do you want to build? Different groups will need to communicate with one another. How are you going to share information? The decision making process for all of this still needs to be fleshed out, but efficiency and lack of bureaucracy is key. If there are software issues that come up, please raise them with Chris. As an example, the reference genome group will have Rex Chisholm in charge, produce annotations and protein sets, will comprise members for each of the nine reference genomes, participate in fortnightly conference calls, and have as their metrics the percentage of the genome that is annotated. This group will work with the computational group and the user community, processes for which will be decided. Levels of organization in the GOC can be described as follows: 1 GO PIs, Judy, Michael, Mike, Suzi - set priorities, obtain funding. 2 Midori, David, Jen, Rex, Eurie, and Chris - ontology content, annotator-ontology liason, annotation outreach, reference annotation, community advocacy, computational architecture, respectively. 3 Curators, some of whom are directly funded, from MODs: human, mouse, zebrafish, fly, weed, worm, slime mold, budding yeast, microbes - perform annotation Parting thoughts: This isn't a drastic change from what we're already doing, but it is trying to be more conscious about what we're doing. We don't want things to fall through the cracks. One of the reasons that projects fail is that when there are so many things to do, you pick the easy things over the things you should do. We need to do the hard, important things. Ontological Content: Relationships and Terms Missing "is_a"s Chris Mungall, BDGP Tangled DAGs and complexity - complexity is increasing exponentially (i.e total number of paths to the root node). The first topic addresses the issue of missing is_a relationships in the GO. Jane Lomax has worked on filling in missing is_a relationships in the cellular component ontology. At the onset, the process and component ontologies are not is_a complete. In the past, our methodology has been that as long as a term has a parent, we haven't cared so much about the relationship. Why is there a need to have an is_a parent for each term? Because without them, we're missing terms and when we're missing terms it's much harder to get the ontology to work with other ontologies and tools that are out there. For example, the Protˇgˇ ontology editor assumes is_a completeness and so does not work well with GO, as orphan terms appear as root terms. Accurate ontologies also aid accurate searching. How can we get started at untangling complex DAGs? In our current display, when drawing the DAG, we ignore the relationship type. We label it, but we don't use the information on the label. For example, the Srb-mediator complex in it's current incarnation is displayed multiple times showing every possible way to draw a route to the top. In Jane's work, she took all of the is_a orphans and either found the correct parent or created the correct parent. There were 277 is_a orphans in the cellular component ontology. In the 'fixed' version of the cellular component ontology, these is_a orphans are gone, which increases the total number of mixed-paths-to-root, actually making it more complex to look at. Some browsers, such as that of the FMA (Foundational Model of Anatomy) have distinct is_a and part_of browsers, which helps to think about the ontology better. A summary of Jane's work: old * 277 is_a orphans/1688 terms * avg is_a paths to root: 1.4 * avg mixed-paths to root 6.97 new: * 0 is_ a orphans * avg is_a paths to root: 3.36 * avg mixed-paths to root: 38.6 This work illustrates how this project needs to coordinate with the AmiGO working group. We want to make this new version live as soon as possible, but the display will need to be addressed. It would be good to let people have a look, since the component ontology will look very different when these changes are implemented. There are also some terms missing part_of roots. For example, unlocalized complexes don't have part_of roots at the top of the tree. All of these complexes need to be housed somewhere, although some of them are so old that there are no references, no definitions, and no annotations made to them. **Demonstration of new is_a complete cellular component ontology in OBO-Edit** To ensure is_a completeness, a new high level term 'X component' was introduced. For example, cell projection component was introduced to have a complete is_a path for existing terms, such as cell projection membrane. This is analogous to what the FMA, which is is_a complete, did when they invented the term 'cardinal body part' to have a parent term for the 'head', 'trunk', etc. since extremities are not organs. FuGO also needed these types of top level terms. One issue that still remains is what to have as an is_a parent for cell components, ie terms that are part_of cell. Would it be okay to have a parent term cell component, even though the whole ontology is called cellular component? Barry Smith recommended that this trick only be used when dealing with entities that are not anything else, since we don't want to introduce multiple inheritance for this purpose. If the cell component term was introduced, we'd have the same number of high level terms and could remove some redundancies. Chris and Suzi also suggested renaming the ontology. We could refer to a 'cell-level entity' and then also have cell parts. Barry pointed out that the FMA uses part in the name, and uses 'cardinal part' if it's the only way to have an is_a relationship for every term. It is used, however, for entities that are there only when the whole is there. So, we could use the term 'cell part', 'membrane part', etc...... So Jane has done the following: * created terms like 'cell projection component' to be is_a parent to give the various types of cell projections an is_a parent. * may have added some is_a parents that are needed * an issue came up about: she didn't create a term "cell component" which is almost identical to name of the ontology, i.e. "cellular component" so there are still a bunch of things right off the top. ACTION ITEM #5: Consider what, if any, are the repercussions of renaming the cellular component to cell level entity? ACTION ITEM #6: Change new terms ending in '...component' to now end in '...part.' [Jane Lomax] Note that the current version of AmiGO does not have the ability to disentangle paths, nor do other groups browsers. In the meantime, the MODs will need to address this in their display, AmiGO will need to deal with this. We don't necessarily want to wait for the tools to catch up to implement this, but the displays will explode a bit. Annotators should take a look at this in OBO-Edit to see if they can live with this. There may be some redundancies that could be removed, which would help with the increased paths to root. OBO-Edit needs a verification system to check for is_a completeness and warn before saving. ACTION ITEM #7: Jane will send an is_a complete cellular component ontology to the GO list. There are also lots (i.e., thousands) of missing is_a relationships in the process ontology that will need to be dealt with. This should be a little bit easier since the tools are already dealing with this with respect to cellular component. However, once we make this live, we're committed to maintaining is_a completeness. Any new terms entered should not create new is_a orphans. OBO-Edit has a verification system for this that could be revived and could be tested by the OBO-Edit WORKING GROUP. Fixing the Upper Levels of Biological Process Chris Mungall, BDGP and Barry Smith Examining the top levels of the Biological Process ontology, we also find that there are some issues with granularity. We have been talking about diamonds and DAGs and how they're not so good for ontologies. If diamonds are treated correctly they're not such a problem., but diamonds at the upper levels of the ontology are more of a problem, especially when they are at the very top. What is the solution? Some of the high level terms in the BP ontology are: cellular process, physiological process, cellular physiological process, and organismal physiological process. Do these terms define the granular level of the process or of the end result? The definitions seem inconsistent. In some cases, the definitions refer to the goal of the process, not the level at which the process occurs. Further there is a lack of symmetry in these terms and definitions. For example, why is there no term 'organismal process'? And what does physiological process really mean? It is vague and hard to distinguish from biological process, although some dictionaries define it as everything but developmental processes. We should think of and define a process as something that has a beginning and an end. But where's the beginning and end of a physiological process? Of metabolism? There is also an issue with behavior terms. We've got something called behavior which is not considered a physiological process, but 'behavioral physiological' child terms are useful. All of the behavior terms are now 'response to...' terms, but shouldn't response be a physiological process, since the definition is about a set of things that happen in response to something else? Jen commented that she once looked up "physiology" and it appears to be a catch for anything that wasn't "development" and some other major branch of research at the time the term was introduced; David feels that physiological process is equivalent to biological process; idea of eliminating "physiological process". Should we just eliminate the term 'physiological process'? Alternatively, we could add the granularity at the top level of the process ontology itself. We could have molecular, cellular, and organismal levels. These are disjointed, whereas physiology is not. The advantage is that each is distinguished from a parent in the same way. So, for example, a term like cardiac contraction would be an is_a term of organism level process, cell contraction would be a part_of that and also an is_a of a cellular level process. Barry's revision: Biological process: * molecular level process * cellular level process * organismal level process Barry: "I do not believe that there is such a thing as "molecular physiology", however there are such things as "molecular biological process". Rex and David provided examples to the contrary that people do actually talk about "molecular physiology". What about single-celled organisms? This has been a source of past and current confusion, since in these cases cell and organism are the same thing. From the perspective of the GO, we need to make a decision about what a single-celled organism is. We may also want the term 'multicellular level process' or 'multicellular organism process', and we need to consider the cases of organisms like Dicty that can exist in both states and some bacterial species that associate but don't meet criteria for being deemed multicellular. Barry suggested adding two children of cellular level process: single cell process and multicellular process. We could also have single organism process and multiorganism process. Agreed that we should look at a few branches of the ontology and see how this would work. What will the effects of these changes be? We can sort through this in-house as there may be many merges and changes, but we shoulud keep an eye towards what other ontologies, such as FMA, do. We also need to consider the implications of this for pathological processes. What are molecular level processes vs cellular processes? For example, if a molecular penetrates a cell, it's a molecular process, but if a cell captures a molecule this would be a cellular process. We could figure out the granularity by determining the agent. The trick is to find an example where there is no apparent classification as one or the other. It should be obvious, though, where the action is - at the molecular or cellular level. Think about viruses attaching to cells, DNA molecules penetrating a hole in the cell, etc. Also think about how this reorganization would affect development and cell differentiation terms in the GO. Development is an organismal process, but a big part of that, cell differentiation, is a cellular process. Will this be okay? The relationship to the former would be part_of, while the relationship to the latter would be is_a. ACTION ITEM #8: Take the new molecular/cellular/multicellular arrangement of the biological process ontology, try it out, and see how it works. The WORKING GROUP for this will include: Chris, Jane, David, Alex, Michelle, Rex, and Val. Jane will make a .obo file so that people can look at this. The group will also need to create a document outlining their philosophical approach and the results. Should there be a development site to help with working on this? David brought up the sticky issue of making developmental process be under 'organismal process' and then you eventually get down to 'cell differentitation' which also needs to be under ' cellular process'. Problem of development and the child terms of development. Relations Between Function, Process, and Component Chris Mungall, BDGP How can we make links between the GO ontologies? The discussion of this focused on links between molecular function and biological process using histidine degradation as an example. (See Overbeek, et al, NAR 2005 33(17):5691-5702). Histidine degradation is a complex process with a branched pathway. In GO, it is represented by the BP term 'histidine catabolism.' Each of the reactions in the process are mapped to GO MF terms, such as 'histidine ammonia-lyase' activity. We have some variants for different ways that histidine catabolism takes place in different organisms and could map pathway branches seen in different organisms to GO child terms. Two could be mapped, histidine catabolism to glutamate and formate, and histidine catabolism to glutamate and formamide, but one could not: histidine catabolism to glutamate and formiminotetrahydrofolate is not represented in the GO. Because of this, there is no way to link relevant functions to processes. Michelle: The microbial community is looking for this type of relationship because reaction steps for each process is how they (TIGR) annotate a bacterial genome. There is clearly lots of curation work that needs to be done. How will the results fit into the current ontological structure? But, an important issue here is what is a function and what should be in the function ontology? We must settle that first. We also need to be clear about what we mean when we are talking about types, processes, and functions. Every single term in the GO should represent an actual type in biology, but the converse is not necessarily true. What are the relations in reality? There are relationships between types in the same ontology and between functions and processes at a given level of granularity. What are the instances and relations in reality? Some particular gene product has some molecular function instance - has a functioning- which makes up a more complex multi-step process. Examples of type: histidine ammonia lyase function, histidine ammonia lyase reaction, histidine catabolism. How hard is it going to be to determine when we can make these relations? What are the types and relations in reality? Some reactions will only take place in one pathway. Barry commented that if this is the case, then we can't assert part_of relations. But can lack of knowledge be used to exclude a part_of relationship? What about using the relationship has_part? This exists in the OBO relation ontology. We need a way to distinguish those reactions that always occur as part of 'histidine catabolism' versus those that are sometimes part of histidine catabolism depending upon the organism and maybe sometimes part of other processes. For example, alcohol dehydrogenase functions in many pathways. Barry's comments: if we have multiple inheritance in part_of parents, then every single instance must have that part. If we have an entity that is sometimes part of one thing and sometimes another, then we cannot assert the part_of relationship. The part_of definition means all the time; has_part means the parent process necessarily has that function as part of the process. We need to collect more data and think about how to integrate that data. The data is captured in annotations and so we should be able to use annotations to help predict the relationships between process and function. These ideas will probably work fine for well-characterized pathways, but what about other pathways, such as cytokinesis, where all of the molecular events are not yet defined? At some point we are going to want to talk about pathways at an organismal level, not just at the molecular level. To do this we will want to consult with the pathway databases and have some level of synchronization with them. In GO, however, we will still be missing information about the order of events, so we would need to bring in new relations such as 'preceded_by.' Prokaryotic annotators would like to see those relationships brought into GO soon. (Michelle gets complaints from people in her classes that GO doesn't represent pathways.) So, how would we manifest this in GO? Molecular functions reside in gene products and the function is distinct from the functioning. In reality, there is a certain bit of redundancy here; do we want to manifest this directly in the GO? Do we want to have functioning terms in the process ontology? We know that not all GO function terms have a corresponding functioning term. Some redundancy already exists in the GO, in cases such as 'iron transport' and 'iron transporter.' We also have some redundancy between component types and function types. This is not necessarily bad, but we should be aware of the reasons why they're there. The functioning of the gene product is implicit; it exists in reality but not in GO. If we're not going to create functioning terms, then we will need to link the function and process ontologies in another way. We shouldn't use part_of between the ontologies. What about having a functioning_of relationship? Another proposal would be to remove the function terms and replace them with highly granular process terms [see below]. One point to consider is that every organism may not always have a particular reaction as part of its pathway. This is why we're using the has_part relation and not the part_of relation, but this arrangement might still be problematic for creating GO slims. It would then be important to annotation NOT to the process if you know a reaction/enzyme exists but the pathway is broken, which happens a lot in bacteria. Use of the NOT qualifier in these cases would be consistent with the way we normally use it. Can one function have many types of functionings? There can be many instances of functionings but only one functioning type. Processing Function Terms Amelia Ireland, GOEO In the past there has been a lot of discussion about function terms and a number of term obsoletions. The proposal presented here would move function terms representing steps in a process to the process ontology. For example, many terms under 'protein modification' in the process ontology are just one-step reactions. On the other hand, we have multi-step reactions that appear in the function ontology, e.g., ent-kaurene oxidase activity. We also have processes and functions that appear to be identical, such as histone methylation and histone methyltransferase activity. We have had a one-step rule for creating functions, but what constitutes a step can vary between functions. Some function terms fit the definition and some do not. Some discussion as to where the one-step rule is stated. It is in the documentation and has been used by the editorial office for quite some time. When the one-step rule was instated, no one spoke up to disagree. But, perhaps the one-step rule does not clearly state what we want and can be interpreted in different ways. We seem to agree that a function is an instance of what a gene product can do at the level of the molecule and does not represent just one step in a process. For example, think of the Michaelis-Menten equation which takes into account several steps in an enzymatic reaction, e. g., binding, catalysis. However, even though there are steps, some said that there is still just one thing that the molecule does. Does the presence or absence of reaction intermediates have any bearing on this discussion? No, since for any enzymatic reaction that occurs there are going to be some intermediate steps. What about situations where, for a given function, there is one gene product in eukaryotes, but many gene products in prokaryotes? Would we represent the same reaction in different ways in GO? Would there be multiple functions for the prokaryotic gene products and one bundled function for the eukaryotic gene product? But why annotate to one conglomerate term when the current incarnation of GO allows annotators to annotate to more than one term, i.e. one term for each reaction? (See cerevisiae fatty acid synthetase, for example.) This could create problems in term nomenclature. What about multifunctional proteins? In these cases, curators would annotate to one term if this function represented a cascade of events that won't stop once started. Otherwise, curators should annotate to separate functions. Amelia commented here that it seems clear that many functions don't fit the definition of "elemental activities describing the actions of a gene product at the molecular level", and that there are "clean" and "dirty" functions in the GO. "Clean" functions include terms like 'arginase activity', 'transaminase activity', etc. and we can make "clean" relations between these functions and the process term 'arginine catabolism to glutatmate.' Another example of a "clean" function would be 'Notch binding' which can be related to 'Notch signaling pathway.' On the other hand, "dirty" functions are not necessarily steps in a process, but represent a combination of two or more attributes from function, component, process, or something other domain. An example of a "dirty" function is 'transcription factor activity' which is defined as: Any activity required to initiate or regulate transcription; includes the actions of both gene regulatory proteins as well as general transcription factors. Other examples include receptor activity, structural molecule activity, enzyme regulator activity, hormone activity, etc. These "dirty" functions represent a role or a class and cannot be linked to process terms using existing GO relations. What is the solution? Amelia proposes that we move terms representing events into the process ontology. We could start with catalytic activities and move them under the parent term 'metabolism,' creating terms that correspond to particular EC numbers. But, if enzyme activities are "clean", why move them to process? Because these functions are events - occurrents - and should be dealt with in a different way to those function terms representing a role or class of gene product. Using arginine biosynthesis as an example, individual reactions could be organized according to the substances involved. Reactions involving arginine would be is_a children of arginine metabolism. The reactions making up a specific arginine biosynthesis pathway could be made part_of children of the term representing that pathway, e.g. 'arginine biosynthesis from xxx'. Will listing the individual steps as children of arginine metabolism and the relevant pathway be consistent with how biologists think about this? Another example of moving function terms is illustrated by the binding terms. For example, 'Notch binding' could be a part_of 'Notch signaling pathway.' We could do the same for transporter activity, permease activity, receptor activity, ligand binding during signaling pathway, regulator activity, etc. We could also reinstate some obsolete terms such as cell adhesion molecule as the concept of cell adhesion receptor binding. All moved terms will be given many function- and process-style synonyms. So, what's left? The Brave New Function World. This world would redefine function in more colloquial terms, consistent with the dictionary definition that defines a function as the purpose that a gene product serves in the normal activity of an organism. We could keep "dirty" terms representing combinations of function, process, or other information in function, and the new function definition would allow us to add useful terms currently not allowed, such as toxin, ligand, other suggestions? Does this proposal reflect that there are problems with our function terms, or with the definitions? For example, biochemists and cell biologists look at ligands differently. There is a function in the connotation of ligand; was it just poorly defined? Maybe the definitions need to be clarified. The Brave New Function World would include the following functions: energy transducer enzyme motor nutrient reservoir signal structural molecule receptor regulator transporter toxin There is great concern that this will create new problems and confusion, in part because these are names of things, not functions. Barry's suggestion is that perhaps GO should create an ontology of molecular entities so that GO can clarify use of words to describe what a molecule is, what it does, what it could do, etc. But do we want to have a molecular entity ontology and a function ontology? This discussion seems to be raising a number of separate issues: * For "clean" functions, how would you show which functions are needed for a particular process? * Perhaps we need to talk about better upper level organization for function, similar to what we discussed for process? * There is a lot of messiness in the function ontology, but processing terms to group functions is not right. * There is an issue with physical entities. We do have a chance to do this clean, and we don't want to invent a jargon any more than we have to. We could keep the Brave New Functions, but call them molecular function entities or add 'function' to the names. There has been a long-standing issue with trying to define the intersection between the molecules we can assay and our understanding of their function. This is part of why we added the word 'activity' to the function terms. We've purged anything from the function ontology that sounded like a protein, but we still have these issues. We need to readdress the one-step rule and correct this, if needed. The prokaryotic community wants an association between function and process, so we're going to have to find a way to do this, but not by moving function terms into process. This seems backwards to people, as the "cleanest" functions, such as enzymatic activities, are the ones that people think biologists would most expect to see in a function ontology. In summary, Amelia presented the pros and cons to her proposal: Pros: * Reactions involving a certain substrate can all be homed underneath one term, xxx metabolism * Reactions can be linked to pathways * Greater precision in annotation * Ontologies more consistent and pure * Less redundancy between process and function * GO function closer to colloguial interpretation Cons: * Some adjustment needed by those accustomed to seeing binding and enzymatic reactions in function * May take some time to implement An alternative solution presented would be to allow complex functions and redefine molecular function to for terms that represent all the biological activities a gene product has. But some molecules have completely separate functions, such as actin which polymerizes into filaments and inhibits DNaseI. Other examples cited include GPCRs and immunoglobulins. Other thoughts from this discussion: * Make grouping terms under function that indicates that this function initiates processes. * Perhaps we need to broaden our definition of entity so that we can annotate not only to the function of gene products, but to complexes, as well, since in some cases, such as RNA polymerase activity, no single gene product has that function. This relates to the issue of granularity of entities. * Can we establish links between function and complex terms in cellular component? ACTION ITEM #9: Make actin polymerization a function term. ACTION ITEM #10: Amelia, Chris, and David (and Barry, if he's available) should get together and come up with a single proposal for making connections between function and process. This will likely include writing new definitions for function and process. Sunday, April 2nd, 2006 Two Taxon Option Jane Lomax, GOEO This issue arose from the content meeting held at TIGR in November 2005, where a proposal was drafted for a dual annotation system that accommodates annotation of multi-organism processes. A key component of this proposal is that for multi-organism annotation, two taxa may be placed in the Taxon ID field. The first taxon ID refers to the organism that encodes the gene product, while the second refers to the organism that interacts with it. The two taxa should be pipe-separated. Dual-taxon annotation will also require adding a lot more terms and so we're considering having a few GO curators visit PAMGO later this year to develop that part of the ontology. Other plans include writing detailed documentation and guidance for annotators that may need to curate these interactions. There will also be an announcement of this documentation on the annotation mailing list and dual-taxon annotation will be discussed at the upcoming annotation camp. At the moment, though, only TIGR and PAMGO groups are probably doing this. If the interaction is between two individuals of the same species should annotators still put two taxa in the ID field? Yes. The issue of dual-taxon annotation started out originally because some bacterial proteins are injected into plants cells, which leads to the 'hypersensitive response'. Currently, this term is a child of 'programmed cell death', but is this the right parentage? We made need to create a separate ontology for these interactions, since 'hypersensitive response' doesn't fit where it currently resides. It is agreed that the GO recognizes that there are still some big problems in the underlying structure of the ontology with regard to this type of annotation and we may need to recognize a new structure of the process ontology to get this right. Having an upper level term of 'organismal process' may help understand how to place these terms correctly. ACTION ITEM #11: Add detailed documentation on dual-taxon annotation, announce this to the annotators' mailing list, include the info in annotation camp discussions, and work on developing the ontology. [Candace Collmer, Jane Lomax, Amelia Ireland, others?] Two-Taxon Annotation - Capturing the Host Side of Interactions Jane Lomax, GOEO Given that there is a range of interactions, some harmful, some beneficial, are we annotating these interactions consistently? At present, when an interaction is mutually beneficial, we annotate both organisms. But, if the interaction is pathogenic, we only annotate the agent, not the host. A new proposal would allow for annotation of host gene products in pathogenic interactions. There is some opposition to this because, for the host, this represents abnormal processes. For example, would we want to annotate CD4 to HIV binding? Would we want to allow for ISS transference of this kind of annotation? In the case of viruses this could be particularly dangerous as there are many species-specific host-virus interactions. What about cases of rhizobium binding to plant proteins (plant nodulation)? Does apparent co-evolution, or positive selection from both organisms, matter for this discussion? There may be a grey area between interactions that are called mutualistic versus interaction that are pathogenic. Is it always clear how to distinguish the two? Where does normal start and stop? How do you distinguish an agent in the environment that is neutral versus one that causes a pathogenic response? A binding reaction could lead to a pathogenic response; how do you not annotate that activity? Isn't it a normal process to bind something in the environment? We do have MF terms that involve a gene product interacting with DDT, which may not reflect that gene product's normal function. Does the canonical life plan include interaction with viruses? Yes. But again, the normal function of CD4 is not to bind HIV. Yet it does bind, and not when mutated. There is some agreement that it is difficult to draw the line between what is solely beneficial and what may sometimes be pathogenic (think about intestinal flora here). In the absence of any real objective dividing line, then at this point, we have to say that capturing these types of annotations is not okay and we continue to only annotate 'normal' processes. But, there are likely many different examples that will need to be considered before taking a definitive stance on this issue. ACTION ITEM #12: Candace and Trudy will send examples and relevant references to the annotation list as they come up, so that we can consider these on a case-by-case basis. [Candace Collmer, Trudy Torto-Alallibo] Obsoleting Justification Policy Judy Blake, MGI We have been getting a lot of feedback on term obsoletion from the GO user community, for example, GO users at the BRC meetings. It is necessary, therefore, to try to clarify the discussion about complex functions and about how we obsolete terms. The one-step rule for defining functions should have been challenged before becoming embedded into the system. There are two main issues with the current obsoletion situation: 1 when we obsolete terms 2 the way we handle our systems when we do obsolete terms. When searching in AmiGO for a term that has been obsoleted, a term that people use biologically, the obsolete terms come to the top of the page. We need to correct this. The AMIGO WORKING GROUP will address this. We continue to debate how we work with the rules about changing definitions. We can take an absolutist approach and say that any change requires a new ID, but what we've come to over time is that if we felt a definition didn't quite capture the essence of the term, we would change the definition, but not the ID. Another issue (see Item #7 in Judy's handout, Obsoletes Redux) is that obsoletion has been used as a shorthand way to get people to change their annotations. But is there a better way to get people to look at their annotations? We could have annotations deprecated until they are re-checked. It may also be necessary to have a better versioning system for GO terms. It would be possible to granularly track the way terms are changing, and perhaps we could embed the version of the term in the OBO file. One question, though: what constitutes a version-worthy change? OBO-Edit already has a mechanism in place to track obsolete with new replacement, but do people need to track less fundamental changes? SGD has a way for curators to add 'date last updated' tags to their annotations. This is manual and it is always left up to the curator to decide what changes necesitate a change to the update tag. OBO-Edit 1.2 can assign automatic or manual replacement. We need to be clear on why we obsolete terms, though. Sometimes we really need to obsolete a term but in other cases we know that there is a concept we want and we wouldn't make the term obsolete except for the fact that we know there are erroneous annotations. Bottom line: we'll still want to have automatic replacement as well as suggestions for having a look at another possible replacement term. One concern is that our users don't understand why we keep adding and deleting terms such as cytokinesis. Cytokinesis has been obsoleted because we've diddled with the definition, but users don't understand why we're doing this. The key here is that probably the definition hasn't altered the correctness of the term. Another case concerns that of the molecular function plasma protein. We obsoleted this term, but didn't make a term to replace it, such as a component term 'extracellular.' Would this still have been within the scope of the GO, ie is plasma a cellular component? There seem to be two issues here: 1) There does need to be some mechanism for obsoleting terms and checking annotations. But not any change to the definition warrants obsoletion. Trivial changes don't warrant obsoletion. 2) Annotations drive what the GO is. Operationally, there needs to be a lot closer communication between the editorial office and the annotators. We need to think about our policies for communication. ACTION ITEM #13: Develop a new policy for communicating about term obsoletions. The person proposing obsoletion should get in touch with the annotating groups (using contact information from the gene association file) informing them that the term is under review while also soliciting input and suggestions on what changes to make. This could be scripted. But, we still need to have a system for making sure people check their annotations if changes have been made to a term. Perhaps we could use the regular monthly reports to alert people that a term has changed somewhat and ask them to please review their annotations. But how do we enforce review? It could be set up so that until a term is reviewed it will be removed from the file. Filtering would be done by annotation date. An additional feature built into this script could alert curators when leaves have been added to the original parent annotation so that curators could check and possibly add higher granular annotations. Concerns: Is there a way, without having a gene association file, to find and check for obsolete terms? How do we make term obsoletions obvious to groups that are not part of the consortium? We may now have a mechanism in place for communicating obsoletions, but we still need criteria for deciding when an annotation is potentially affected by a change in the definition. And, people really need to check their annotations before a change happens. Should we revisit old obsoletes and for those made obsolete because of annotation issues, should we go back to the old terms and make their IDs secondary IDs? ACTION ITEM #14: Reinforce the policy that we will no longer obsolete a term just because the definition has changed or because annotations are thought to be bad or incorrect. ACTION ITEM #15: Explore the proper technical solution for establishing a mechanism to notify GO users when a term has changed, or rather, when we are thinking of changing a term. This solution should consider: * Do we need to enforce a way of making sure people make any necessary changes? * How best to contact people? Monthly reports? Newsletter? Web page? * Can we adapt the existing system but have a version for 'proposed changes to a term's definition'? * Could we have users subscribe to a mailing list to be alerted when there are changes to terms they're interested in? We could have a 'track this term' feature in AmiGO for communication outside GO. * Which people should be alerted to obsoletions? The EC has a procedure whereby, on a periodic basis, they post all proposed changes and have a public comment period. The editorial office shouldn't have to pester people to review their annotations; this is something that the PIs should do. At many past GO meetings, for many past proposals, there were no comments or objections raised. Is silence tacit approval? That seemed to be the case, but the editorial office would like an acknowledgement that the proposed changes are really okay with people, since silence does not mean 'okay.' Maybe we also need to have a minimal length of time for which a SourceForge item must stay open to allow people time to look things over. With the new grant proposal refocusing 'embedded' GO curators efforts, there should be a greater response to potential term changes. We could also try having a wiki where each group needs to actively check a box, yes or no, by a certain time. We will need to give people a reasonable amount of time to respond, and also perhaps provide a box for comments. A script would remind people, 24 hours before the deadline, if they haven't responded. Each site must appoint someone to take care of this responsibility. GO PIs will be notified of chronically non-responding groups. ACTION ITEM #16: Revisit obsolete terms to see which can be merged with current terms. Decide if the IDs of obsoletes could be made secondary IDs to the currently existing term. ACTION ITEM #17: John Day-Richter will talk to the GO Editorial Office to find the best way to implement these changes and suggestions in OBO-Edit. ACTION ITEM #18: Add a term creation date to the .obo file. Annotation Issues Issues Arising from Annotation Camp Karen Christie, SGD There were a number of annotation issues that arose at the 2005 annotation camp that seemed appropriate for discussion by the entire consortium. However, there has been a considerable amount of time (nine months) between the annotation camp and this latest consortium meeting. Since we are about to have another annotation camp, we should work out procedures for how to reconcile annotation issues that arise at camp in a timely fashion. The upcoming annotation camp will consist of two parts, one of which is internal and will involve representatives from each of the reference genomes and the other of which will consist of outreach and training. For non-contentious annotation issues, is it okay for camp participants to put concrete proposals directly to the list to be checked and ratified, rather than wait for the next consortium meeting? Such proposals would be specifically about annotation issues; content issues would still need to go through the appropriate channels. The general feeling is that yes, this would be okay, but it would be good to provide a bullet summary of the recommendations in the camp minutes, which were excellent last year. Also, since we now have group leaders for different aspects of GO, such as AmiGO, content, we should make explicit contact with each of the other groups if issues arise at camp that affect these groups. The upcoming annotations camp will emphasize annotation consistency. How can we measure and track annotation consistency and quality across groups? This is something that we said we wanted to do in the grant proposal. Dates for the upcoming annotation camp are tentatively set for July 10 -14, 2006. Camp will be funded, in part, by the Genetics Department at Stanford. Annotating to Unknown David Hill, MGI Proposal: unknown terms in the ontology would be eliminated and annotations to these terms would be removed and gene products reannotated to the root of each branch of the ontology, BP, MF, CC. Everyone agreed that this was okay, but we did not establish a time table for this change. Annotation of Common Knowledge Paper Introductions Karen Christie, SGD This issue came up at the last annotation camp. Some groups allow curation of information from the introduction of papers where authors cite research papers as references. These groups use TAS for the evidence code in these cases. Other groups require curators to actually look these statements up. At camp, we didn't come up with a recommendation for the best annotation practice for this situation. It was generally agreed that TAS is not a high quality evidence code and for reference genomes especially, the goal should be for each gene product to be annotated with an experimental evidence code. References cited in the introduction of a paper are not always the most relevant and may not even be from the same species as the gene product being annotated. Some groups have TAS and have used it for different ways. Rat uses it, pombe uses it as a placeholder until the original paper can be curated, sgd used it to curate information from reviews. TAIR has a large number of TAS annotations and GOA has TAS annotations inherited from Proteome and annotation from Swiss-Prot where annotators weren't distinguishing between author statements and experimental evidence. Should groups retro-curate and remove TAS annotations? Yes, if possible. ACTION ITEM #19: TAS is no longer considered a useful evidence code and will not be used in any consistency measures of reference genome annotation. Since part of the idea of the reference genomes is to provide a source of IEA annotations for other groups, we strongly encourage reference genome annotators to not use TAS, and instead use experimental evidence codes whenever possible. The GO documentation should also state this in a clear fashion. Annotation of Common Knowledge 'Textbook Knowledge' Karen Christie, SGD Common knowledge, such as that found in textbooks like Stryer's Biochemistry text, has, in the past, been annotated using the TAS evidence code. For example, alcohol dehydrogenase has been annoated to CC:cytosol using TAS and Stryer, but this information really isn't traceable from Stryer to an experimental paper. This type of TAS annotation predates the IC evidence code which might be a much better way to annotate this kind of information. Consider the case of ribosomal proteins, where there may be strong sequence conservation leading to an ISS annotation to CC:ribosome. In this case, then, an annotation to BP:translation, using the IC evidence code would be appropriate. (Note how IC is being used in this way to link ontologies together.) What reference should be used? What entity goes in the WITH column? Curators should cite the paper that they're reading in the reference column and put the GO term that they used to make the connection in the WITH column. If the original ISS annotation was made with an internal db_ref, then that reference would be used for the IC annotation, too. We should consider establishing criteria for making IC statements, as some inferences may be more explicit than others. This would be a good topic for discussion at annotation camp. NAS vs Experimental Evidence Codes - Data Not Shown Harold Drabkin, MGI This is another issue that came up at annotation camp. When experimental results are reported in a paper but followed with 'data not shown,' what is the appropriate evidence code to use? MGI uses IDA (or whatever is the correct experimental evidence code) because it is often the case that the experimental method is clear and journals would require the data to be shown, if needed. 'Data not shown' can often be the result of space limitations for publication. 'Data not shown' can be subdivided into two types, though: the first being cases where it is clear what assay was used, and the second being cases where it's not so clear how the data was acquired. Should curators annotate the latter, and if so, using what evidence code? NAS? Should curators contact the authors about this type of data? What about information cited as personal communication? The general consensus seems to be that curators should use their best judgement in these cases. If you are confident that the author is clearly stating what they did, then use an experimental evidence code. It's okay to accept that the authors did what they said they did. If