This file contains the collected minutes of Gene Ontology Consortium meetings. =================================================================== Ontology Meeting, Palo Alto, Jan 10 & 11, 1999 Participants: SCD and Arabidopsis Databases Mike Cherry, David Botstein, curators of SCD and Arabidopsis db. =46lyBase Michael Ashburner & Suzi Lewis MGD Janan Eppig, Joel Richardson, Judith Blake Astra Ken Fasman 1. One or three? Answer is Three: The first discussion reaffirmed that the Ontologies will be developed independently. In the future, we may explore edges between the independent ontologies, but not now. After stabilization and initial annotations of product to list terms, we will see whether to combine all together, what the levels of annotation are by species, and see about establishing edges between the three sets of terms. - don't get into describing 'location' as part of 'function' - three ontologies are a) function b) process c) subcellular location (but see below). 2. Gene Function vs. Gene Products? Recognized confusion in gene product/complex listings in function list. Agreed to move protein complexes and their components to the 'cellular' list. Renamed 'cellular location' to 'cellular components'. Decided not to create a fourth category, but rather to think of 'cellular' syntax as 'is located in' with synonym 'is subcomponent of'. Recognized a subtle distinction between something that is 'part of' a macromolecule and something that is 'located in' i.e. 'part of' the nucleus. Still, decided all would be placed in this category. This is a system designed to deal with a state of incomplete knowledge. Cell is composed of the whole set of product complexes. 3. Define syntax: a. function=8A 'is a' , hierarchical=8A b. process=8A'is a subprocess of' but may also list 'is an instance of', a DAG c. component=8A'is located in' synonym: 'is subcomponent of' but may also list 'is an instance of', a DAG. 4. Specific function syntax=8A a) Not for =8A Drop the 'Not for Mus, Not for Drosophila' distinctions b) Precursors =8A receive function process annotation of mature protein. c) Facultative/Obligatory=8A'maybe part of', 'sometimes part of', 'part of >>under certain conditions' some things present at sometimes, not at others= , or don't know. Some things spend some of the time in one compartment and some in the other (cell cycle proteins). Decide not to annotate these distinctions. Decide function ontology will be a straight 'is a', no qualifiers. 5) Cell is a generic cell, will not divide by cell type. Generic gene products will be listed as needed. example: Alpha tubulin=8A.yeast has two, elegans has 6, some are there in mitosis, some are not. what are leaves of the trees? are they nodes representing genes from all species? So, we've thrown out individual gene products except as needed, but gene complexes still here. Not creating laundry list of gene products of each species, but adding component parts as necessary. Example 1. a) cellular component: gene product A, gene product B, gene product C b) function: alpha tubulin c) process: mitosis, axonal transport Not going to put in the relatedness between complex, process, and function until we have some data and a better understanding of how this will work. 6) If you have process, function and localization, it's almost as good as having a small paragraph about the gene. (example of micro-array paper) 7) Implementation Plan a) Michael does revision of current version b) All participants edit list very carefully c) Suzi assigns GO numbers 'for real' d) Start alpha annotations e) Future terms added through 'ontology manager', currently Michael. 8) Immediate future a) requires prototype funding (possibly from Astra via Ken Fasman, other ideas too) b) each database needs one curator for project c) need overall curation manager (currently Michael, person would work with Michael) d) each database needs capitol money for computer 'kit' e) need money for meetings/travel. two meetings/year, alternate coasts f) need programmer, ultimately 2, one for database DBA, when db created, second for interface, tools. Astra is buying our commitment to annotate our databases to the common vocabulary=8Alooking for increased usability and searchability of each individually and collectively. =================================================================== Minutes of the GO Project Meeting The GO Meeting was held 17-19 May 1999 at the Banbury campus of CSH. In attendance were: FlyBase (Michael Ashburner, Suzi Lewis) SGD (Mike Cherry, Midori Harris) MGI (Janan Eppig, Judy Blake, Joel Richardson, Martin Ringwald, Allan Davis) A Summary of the Meeting: The outlined agenda for the meeting was: 1. semantics 2. species specificity 3. content 4 implementation 5 software 6. resources 1. The GO Project is recognized as a shared, pragmatic database resource involving three separate ontologies (Gene Function, Process, Cellular Component) that represent independent structured sets of terms for performing biological queries across different species genomic databases. It is not a definitive phylogenetic classification system of biology. The current GO Project is composed of three Organism Databases: FlyBase (Drosophila), SGD (yeast), and MGD (mouse). It is hoped that additional organism databases may subsequently join. Each Organism Database will annotate their genes to the three ontological categories and deposit these results in a Universal GO browser (currently being constructed by Suzi). Concurrently, each Organism Database on their own will use their annotations for whatever way they see fit at their own web sites. The three database (DB) groups are anxious to start implementing the project. It is felt that once people actually start 'getting their hands dirty' by annotating and testing out the system, we will recognize potential pitfalls and successes and will see areas of the ontologies that need to be developed further. 2. The system will be structured as: a Master GO List with assigned GO# identifiers. This Master List will be made available to all DB. Any curator can add to/modify this list (see below). All additions or modifiers will pass through a Central Manager (a biology-trained individual) who will then update the Master List accordingly. The Master List is anticipated to go through a period of flux during the initial phases and when any new organism joins the group. It will be the responsibility of each DB to regularly check in with the Master List and read all communications between DBs so that everyone is 'on the same page'. 3. It was recognized that this project is a WORK IN PROGRESS: meaning, it is a dynamic system where things will be added and changed as curators see problems and concerns. With subsequent biological knowledge, a re-organization of the hierarchies may even become necessary. Provisions for this re-organization will be made by the following: each DB group will initially be assigned a block of 'free' GO identification numbers (GO#) which can be used to add or modify new terms to the structured GO list. (New terms should use unique words/phrases). Any such modification will be simultaneously submitted to a Central Manager and representative of each DB via email/XML working from the current GO List. We anticipate that these initial modifications will be at lower levels of the hierarchy and not include any major re-organization that could directly and immediately affect the annotations of other DB. If that happens to be the case, the curator is responsible for FIRST contacting the Central Manager for inquiry, discussion, and approval PRIOR to initiating or submitting any changes to the GO list. Since a Central Manager has yet to be hired, it is prudent for curators not to initiate any major renovations to the GO List for quite some time. 4. Common Sense & Individual Responsibility must rule here: that is, each group is responsible for reading the corresponding emails submitted by each DB; each curator should check in frequently with the modified GO list on a regular basis; if a curator anticipates to spend a long time modifying some structure of the GO List he should alert all other members and then work as quickly as possible to update the changes. Effective communication is key to the success of this project. 5. Each DB will annotate their genes/gene products to any 'depth' they see fit. This should include annotating genes to complexes and not subunits of those complexes, since the cross-species homology will not necessarily hold true to such detailed levels. Once the practice is put in motion, curators will develop a better feel for how 'deep' they should be annotating based upon the work of other curators from other DB (see below). 6. Currently, the GO List records some species-specific GO terms. With subsequent biological knowledge, it may be found that these are no longer species-specific but shared amongst different organisms, thus requiring a name change. Currently species-specific terms will only be updated when it is shown that the term is not species specific. If a species does not have a gene annotated to a specific GO term, then either it is not present in the species or it hasn't been found yet. 7. Structures will have to inherently change with the discovery of new genes that do not necessarily follow the currently established hierarchies. Curators are to be aware of the "True Path (Judy's) Rule": the pathway up the hierarchy must always be true. If a new gene is found to break this rule or species-specificity becomes a problem, a restructuring of the hierarchy should occur by adding more nodes and connecting terms that creates a new path to fulfil the trueness of the upward hierarchy. When a term is added to the Master GO List, the curator needs to add all of the parents and children of the new term A suggestion was made that to simplify the hierarchy: we might considered throwing out the "part of" GO items and instead only used the "is a" GO terms. After discussion, it was realized that too much information would be lost by eliminating the terms. The GO List will maintain both "is a" and "part of" terms. The example used to work through this effort was that of the Process Ontology for the gene product 'chitin'. Chitin metabolism is a part of cuticle synthesis in the fly, and part of cell wall organization in yeast. As a result of the above discussion, the parent 'chitin metabolism' will now have daughters 'cuticle chitin metabolism' and 'cell wall chitin metabolism, with the appropriate catabolism and synthesis terms underneath them. chitin metabolism chitin biosynthesis chitin catabolism cuticle chitin metabolism cuticle chitin biosynthesis cuticle chitin catabolism cell wall chitin metabolism cell wall chitin biosynthesis cell wall chitin catabolism The procedure to add terms is made particularly difficult because the Process Ontology is a DAG, and given the current state of knowledge, it is volatile. We need an automated procedure to add all the arcs when necessary to expand the structure. A tool is needed that will 1. not allow bad paths (solution: add extra nodes) 2. curators need to see all paths. 3. given curator decision, can automate a split 8. A CVS file system will be used to process changes made to the GO list. This will automatically record the curators name & date and must provide a succinct reason for implementing the change. The CVS client server will be set up by Mike Cherry: accounts will be set up for each individual curator for access and modifications. Additional curators may join by submitting a User name to Mike. 9. An idea was proposed that the GO list hierarchy encode next to the GO terms the number of genes specific to each organism DB listed underneath that particular GO term. For example: %DNA metabolism (M-3, D-9, Y-5) %DNA replication (M-1, D-4, Y-3) %DNA dependent DNA replication (M-2, D-5, Y-2) Meaning that there are 3 mouse (M) genes, 9 Drosophila (D) genes, and 5 yeast (Y) genes under the entire category of %DNA metabolism, and that they can subsequently be broken down further to each subordinate GO term for finer resolution. This should be helpful to curators in understanding to which level of 'depth' other curators are annotating. 10. EC#s should be kept in the GO term lists because they provide a searchable technique for curators during annotation. 11. The following ideas and agendas were proposed for goals within a 1-2 week (?) time frame from the end of this GO Meeting: a) Stanford curators stop GO and send an updated file to Michael Ashburner. b) Michael Ashburner parses v. 0.2a7 ---> v. 0.9 c) v. 0.9 ---> Suzi for syntex check ---> assign unique GO# to all terms (v. 1.0); and parse into XML d) CVS established/tested by Mike Cherry; as well, Mike will try to register the web site www.gene.org if still available; if not, other suggested names: www.genestogo.org or some combination of the words GO and gene (GOgeneGO; GOgene; geneGO; etc....) e) all GO again. Independently, Suzi and Joel will determine necessary XML syntax and processes. Between June 1999 and the next meeting, two stages will be implemented: STAGE 1: CVS established initial curation: annotate as many genes as possible (FlyBase hopes to get 3000-4000 genes done; are other DB up to the challenge?) XML exports to Suzi established a Central Manager will be hired STAGE 2: a working database something of real functionality for the public: "genes to GO" 12. Making the database available to public users should coincide with a descriptive (promotional) write-up of the GO Project in a widely circulated genetically-oriented journal, such as Trends in Genetics. Members of the GO Project should be thinking of ideas for this paper and what they would like to see in it. 13. Janan will meet with Ken Fasman (Astra) and Lisa Brooks (NIH Program Officer) at the JAX MGI Advisory Board meeting in early June 1999 and discuss initiating a co-operative (MGI, SGD, FlyBase) grant for submission in November 1999. After this meeting, Janan will update other members of the GO Project on the ideas for the grant via email correspondence. 14. Resources: $200K from Astra '99 and a promise of $200K for '00. Michael Ashburner will have the money established in an EBI U.S. account in Cambridge Trust Company. 15. The next GO Meeting is scheduled for 6-9 October 1999 to be hosted by MGI in Bar Harbor, Maine. =================================================================== GO MEETING - The Jackson Labs. Oct 7-8 1999. PEOPLE MGD Judy Blake David Hill Joel Richardson Martin Ringwold Janan Eppig Charlie Ray Ben King - Mouse sequencing Jeff Davies Richard Balderelli Allan Davies SGD Andrew Kasarskis Mike Cherry Midori Harris FB Heather Butler Michael Ashburner Suzanna Lewis Astra Zeneca Michael Rebhan AGENDA 1. Current CVS/annotation of GO 2. Putting sets together for common query interface 3. Publications 4. WWW pages 5. Other collaborations 6. Funding & resources 7. People MINUTES 1. Progress FB/Berkeley. Nothing new on software; New versions imported into query tool. FBV/Cambridge Report on progress of attribution in FB. About 1700 done. Celera annotation plans were reported. It is hoped that they will use GO for functional inference. FB to get its reference CDS set of genes GO'd by November 7. (Ashburner/Heather). SGD Midori annotating yeast genes with GO, done about 300 plus tRNAs. Also doing gene summaries of each gene in SGD. Have about 3000 to do. GO query tool for internal use on www for curators; better diff files. MGD Alan and David Hill have been doing assignments. Detailed hand annotation with MLC and GXD - have to write detailed reports on genes and then add GO terms. At the same time do first pass "GO-FISH" - have mapped 3,000 genes with GO terms. Also mapped via EC numbers. Had not been using CVS but keeping a file of changes. Mapping SWP Keywords to GO terms - done to letter 'E'. 650 SWP Keywords that seem to be relevant to GO. 40-50% map directly to GO. David will [or could !] finish within a week ! dph@informatics.jax.org - David Hill MGD now beginning to use CVS (Allen) For CVS problems: mark@genome.stanford.edu Use "update" rather than "checkout". Agreed number series for new terms: SGD 0000001-0001500. MGD 0001501-0003000. FB 0008001-0009500. 2. Putting sets together: What we are using now: FB tagged value format SGD tabbed list MGD Excel file Evidence statements - MGD argue for "stated by author". Following agreed as valid values IMP inferred from mutant phenotype IGI inferred from genetic interaction {with } IPI inferred from physical interaction {with } **note we changed this from protein interaction ISS inferred from sequence similarity {with } IDA inferred from direct assay ASS author said so NA not avaliable Evidence must not be null, even if the record is " not available " We now want to agree on a tab delimited format - which SL can parse into XML. MEOW Core database. [mandatory] cardinality 1 ; controlled: MGI, FB, SGD gene symbol. [mandatory] cardinality 1 gene symbol synonym .cardinality 0, 1, >1 [white space allowed] gene name. cardinality 0,1 [white space allowed] gene identifier. [mandatory] cardinality 1 chromosome. cardinality 0, 1 map position. cardinality 0, 1 short gene description. cardinality 1 db xref, NA, protein. cardinality 0, 1, >1 GO add-on GO id. [mandatory] cardinality 1, >1 reference id. [mandatory] cardinality 1, >1 ; must be within domain of database identified in MEOW core evidence. [mandatory] cardinality 1, >1 ; controlled, see above aspect. cardinality 1 ; controlled F|P|C DB,Gene_id,Gene_symbol,GOid,ref(|refs),evidence(|evidence),aspect,name,synonym(|synonym) tab delimiter between fields (NOT commas) within field delimiter is | hard return at end of record ascii SGD_GO_files/gene_associations MGD_GO_files/gene_associations FB_GO_files/gene_associations SGD & FB do a remove of old versions before committing new. At this stage other data will not be dumped by contributing databases to GO. 2. Query/Editor tools/databases. Private editorial tools Local editorial interface to modify GO (ie to replace CVS) - but changes to go to editor for committment. Stanford work on editor tool. How do we compare for internal purposes between collab. d/abses ? Public tools At local sites [responsibility of collab d/bases] Cross-genome Data base Servlet ? or other performance enhancement Improved query database GO query tool must have comment to GO email button (at first to all of GO list, so that we can all see what is going on). Each database should implement its own query tool for GO. - all 3. WWW Mike has registered: www.geneontology.org & www.genename.org We agree to use geneontology.org as prime address and to close down the existing ebi and fruitfly sites (these then point to geneontology.org). Need a top page - Cherry Suzanna to check that the Query applet can run from this new web site. - Suzi Suzanna will activate URL hyperlinks from query report. - Suzi Needs url syntax for MGD (see MGD Tools for Developers on home page - or contact Joel) and for SGD (contact Mike Cherry). Tree will show number of gene_associations per node. The CVS can automatically update the text files and automatically write a new version and date at top of file - Cherry ftp - three ontologies in both hierarchical and xml (rename "compartment" as "cellular component" in CVS repository). - Cherry will xml files be automatically updated by a script when ontologies are updated ? - yes, but need to look into mechanism - Suzi. - GO.bib - GO.doc .. MA to re-write as an html document. Add GXD as collaborator indep of MGD - GO.defs - ISMB paper - geneassociations.fly - geneassociations.mouse - geneassociations.yeast GO query tool from Suzanna email button for contacts; go to entire list - Cherry Must change proofs of the SGD/FB/MGD NAR January issue papers, for new url. MA to write general introduction for web page Ashburner MA to update GO.doc Ashburner Suzi to give collaborators urls for definitions. (OUP acknowledgement) - Suzi 4. Publications Where - TIGS .. probably the best for this first paper. Alternatives: Genome Research NAR Nature Genetics Bioinformatics Talk to Roberts about paper for NAR Special Issue for 2001. Ashburner, but next year. Botstein & Cherry to do a draft then to Alan Davies at MGD - Botstein/Cherry/Alan 4. Other collaborators. C. elegans - Sternberg's NIH application for WormBase has been submitted - for summer 2000 funding. Arabidopsis: TAIR (The Arabidopsis Information Resource) - Carnegie-Stanford (science)/NCGR (computing). Started Sept 1, all of old AtDB curators moved over to Carnegie. Chris Town of TIGR is on TAIR grant. MA worried that could be more than one push - TIGR (NSF annotation grant); Mike Bevan at John Innes. Ashburner to follow up. Monica Riley/Gretta Serres - functional assigments for E. coli. Need to talk to TIGR about prokaryotes. Ashburner to follow up Look at TRANSFAC classification. Incyte collaboration, further discussions with Frank Russo. Ashburner/Suzi Swiss-Prot. Ashburner 5. Grants. Janan will lead on an NIH-NHGRI RO1 grant - Liza Brookes - for Feb 1 2000. - Janan What should we ask for: curator for MGD curator for SGD curator for WormBase ? as supplement [curator for FB already on MRC grant] Core: GO manager/editor software support travel/kit Funding cycle: FB to 2003 NIH 2002 MRC SGD 2001 NIH MGD 2001 NIH GXD 2000-2005 (NIH Institute of Child Health) GO 8/00-8/03 ? Astra-Zeneca: Would Ken be willing to write two cheques, one to EBI and one to UCB since we are the only two who now need to draw on funds ? Contracts between EBI and Jaxs and EBI and Stanford are academic at the moment. Should we set up a non-profit GO Inc ? Ashburner for action 6. Content MA to finish Style Manual, work on with Andrew - Ashburner/Andrew Need to look again at %enzyme - split by EC - what would we loose ? - use classification of substrates imposed on EC ? - Ashburner 7. Next meeting Feb 24-26 2000 - Boston / Harvard. Talk to Bill. Ashburner Talk to FCK re: a meeting in Les Treilles. Ashburner Friends of GO - activate and update - add Mike Rebhan. - Ashburner bionet.announce when new pages up and data into query tool. FINAL REMARKS Substantial progress has been made by all three database groups in implementing GO over the summer. This is very encouraging. Although there have been some areas of GO content that have needed changing (and several that have needed adding, as expected), in general the three ontologies seem to be working rather well. A major message of this meeting is that we must get something substantial in the public view as soon as possible. To this end we have rationalised the web sites for GO and agreed an output format for gene associations to be sent to Suzanna to drive the Query Tool. We have also agreed on a paper about GO for TIGS to be done this year. We hope that the new web pages with a Query Tool with content can be up in a matter of weeks, tho we know that until mid-November Suzanna and Ashburner are very busy with the fly annotation. =================================================================== Gene Ontology Meeting February 25-26, 2000 at Astra-Zeneca in Cambridge, MA Attendees: Michael Ashburner (FlyBase) Suzanna Lewis (FlyBase) Heather Butler (FlyBase) Judy Blake (MGI) Janan Eppig (MGI) David Hill (MGI) Joel Richardson (MGI) Martin Ringwald (MGI) Allan Peter Davis (MGI) Michael Rebhan (Astra Zeneca) Mike Cherry (SGD) Cathy Ball (SGD) Midori Harris (SGD) Andrew Kasarskis (SGD) AGENDA ITEMS Progress Reports Celera Report Papers Collaborators and Other Projects Ontology Issues Style and Work Practices Tools for GO Questions from Michael R. Plans for Next Meeting PROGRESS REPORTS Mouse folks: Judy Blake has submitted the GO grant. The mouse members have assigned approximately 4500 genes to GO terms. 100 by hand 650 by EC number 1270 using Swiss-Prot 2500 using mouse nomenclature Without counting the Swiss-Prot data, they used 474 Molecular Function terms, 50 Cellular Component terms and 80 Biological Process terms. Since they are using automated annotation, they have performed a variety of quality checks, such as looking for more than one annotation within an ontology. They have come close to exhausting the current automated assignments and are going to be doing more by hand in the future. Yeast Folks: SGD has 1524 genes in the gene association file. About 1000 of these are ORFs and the rest are tRNAs or snoRNAs. All SGD genes have been GO-annotated by hand. Fly Folks: FlyBase currently has about 3000 genes annotated mostly by hand. Heather has worked through the protein kinases and will next tackle the protein phosphatases. Annotation of new genes will be largely done by sequence similarity, while existing genes will be done by hand in related chunks. When the Drosophila sequence is released in March, there will be a large amount of sequences annotated to a high-level GO ID. These will be deepened to more specific GO nodes with time. CELERA REPORT GO was used in the annotation of the Drosophila genome at Celera. Suzanna made a dataset with all genes annotated to the molecular function GO and used it for BLAST searches. Usually, the level of GO node was quite high -- only one or two terms from the top. Where experts in a field were expected to be annotating genes, the specificity of the GO nodes used were increased (for example, olfactory receptors). Ultimately, there were 40 bins labelled by GO name (the 40th was "unknown"). Annotators were then able to have a pretty reasonable guess as to the function of the new fly gene. A second binning with biological process and cellular component showed a terrific correlation with the first. About half the genes from the Celera set are associated with a GO term. Since a given gene has a less than 50% chance of having been seen by a human, an association with a GO term is very valuable. FlyBase is still waiting to receive the sequence -- it will be released with the publication of the papers in March. FlyBase will be responsible for updating the sequence in GenBank. PAPERS We agreed to immediately pursue three publications: 1) Nature Genetics solicited a short (2500 word) article from David Botstein. It will be submitted March 10, with a short author list. 2) Genome Biology -- Michael Ashburner has been asked to write a short (1000 word) article for their premier issue. It will most likely have an authorship along the lines of "The GO Consortium". 3) Genome Research -- Judy Blake will adapt the grant to a "big" paper to be submitted to Genome Research. 4) NAR database issue -- We will submit a paper to NAR as a matter of course. The submission won't be until August or September. Since there are likely to be changes in the NAR policies, we will discuss the details of the NAR paper at the next meeting. COLLABORATORS AND OTHER PROJECTS There was a great deal of discussion about taking on other organisms and collaborators. The conclusion was that before we take on other organisms, we must first meet the following goals: --We need to be in a database (Suzanna Lewis will be working on this, with help from Joel Richardson). Hopefully, this will be accomplished by the next meeting. See "Plans for Next Meeting" for more detailed steps. --Documentation of philosophy, styles and practices needs to be written to record and communicate our current thinking. See "Plans for Next Meeting" for more detailed steps. --A "GO manager" to coordinate changes to the ontology, arrange training, communicate with all groups, etc needs to be hired. Midori Harris has volunteered to assume the responsibility. Michael Ashburner suggested we have two classes of partners -- the first with write permission and the second without it. These "second class" partners will have to funnel suggestions and comments through a full partner. Other organism groups that have expressed interest include worm, Arabidopsis, and S. pombe. We'll invite a representative from the worm and Arabidopsis database groups to the next GO meeting. Michael Ashburner has received a grant application for "BioBabel" -- a proposal to adopt GO terms within SwissProt, Enzyme Commission and Interpro. Representatives from this group can also be invited to the next meeting. ONTOLOGY ISSUES Methods and practices for editing and maintaining the ontology took up a large portion of the discussions. Conclusions will be listed, and in the cases where the discussion is particularly illuminating, the discarded options will be listed as well. 1) Changes to GO nodes that have multiple parents... When editing one of the ontologies, it is more convenient to add another node in only one position. For example, if we start with the structure shown below: a b d e f If we want to add node 'c' as a of 'd' and a child of node 'a', do we need to edit all the appropriate lines, or just one? The group decided to make an "editable" non-redundant version of the ontologies: Linear, redundant format (for viewing): a b d e f c d e f Non-redundant format (for editing): a b d % c e f The envisioned procedure is that a curator checks out the compressed, or non-redundant, version and then views an expanded version using a planned tool we're calling "The Validator." When an edit needs to be made to an ontology, it is made in the compressed version and tested with the Validator. The compressed version is then checked back into the cvs. The Validator will be written by Joel, suing specifications mentioned later. The web will display the expanded, read-only format. 2) We will add GO id to parent terms. For example, we used to state: term1 ; GOID1 % term2 Now we will state: term1 ; GOID1 % term2 ; GOID2 3) GO nodes should aggressively avoid using species-specific definitions. We agreed to substitute "Yeast mating" with "Mating, sensu Saccharomyces." Using the "sensu" reference makes the node available to other species that use the same process/function/component. Each organism database will take care of their contributions to the species-specific language. 4) We will get rid of cellular component references in the function ontology. For example, "mitochondrial primase" needs only be "primase." There are many cases where component terms are appropriate in the process ontology, so those will remain. Michael A. will take care of this. 5) Joel pointed out these logical relationships that we need to make sure are true in the ontologies: if A is part of B and C isa B, is A part of C? --- YES if A is a B and B isa C, is A isa C? --- YES if A is part of B and B is part of C, is A part of C? --- YES if A isa B and C is part of B, is C part of A? --- NOT NECESSARILY Joel will send out a list of the logical inconsistencies that he has detected. 6) An example that got a lot of attention is the case of the mitotic chromosome's location in the cellular component ontology. While the mitotic chromosome resides in the nucleus in yeast, it is cytoplasmic at this stage of cell life in mouse or fly. In addition, many organisms have chromosomes that are NOT located in the nucleus. The solution arrived at was to remove chromosome from the nucleus in general and place the appropriate subsets of chromosomes in the correct place (nuclear, cytoplasmic, mitochondrial). 7) We need to track deleted GO ids. There are types of things that can happen to GO terms -- merging two (or more) nodes, splitting a node, deleting a term. a. When a term is deleted, we will cut the line out and paste it at the end of the file (or as a child of the parent "defunct", I don't recall the final decision), using the following format: and tags. 11) We currently cannot standardize rules for subdividing ontology terms, but instead will continue to make each decision on a case-by-case basis. 12) Gene products in themselves are not nodes of the function ontology, although doing something with or to a specific gene product can be one. For example, being hedgehog is not likely to be a function, but being a hedgehog receptor or hedgehog receptor ligand are functions. 13) We may eventually need a synonym table to facilitate queries. 14) Changes that need to made to the ontology to meet the current style include eliminating unnecessary hyphens, adjust grammar so that "transporters" become described as "transport" and "transporting," remove words like "protein" and "factor" where we can be more explicit. 15) Heather and Midori will write some documentation about the evidence codes. 16) We need to think about a "best practices" document that will state and explain good work habits for both current and future annotators. In the meantime, we will share any help documents, such as SGD's "Instructions for Annotating Genes Using GO." TOOLS FOR GO 1) Database Suzanna will get a handle on this. The major difficulty has been hiring a programmer. Michael R. offered some help on this from Astra-Zeneca. Suzanna is planning on using MySQL to create a version to distribute from the central site. The schema is not yet ready, but Suzanna and Joel will work on this together. The database will also need the ability for bulk load. 2) Validator - Joel will do this We need a validator to check for: a. cycles b. deletion of nodes used in gene association files c. syntactic correctness (refer to logical relationships described in the ONTOLOGY ISSUES section.) d. unique IDs e. warning message of the number of affected nodes f. orphans g. new nodes have IDs Associated with the validator is the ability to compact and expand the ontologies for writing and reading. The validator will run on the central site, as well as locally for checking before an edited ontology is checked back in. Joel plans on writing this in python, so each site will need to install it. 3) GO BLAST server - Mike C. will take care of this The GO BLAST server will use a dataset of GO-annotated protein sequences. The results should show each GO node associated with a gene product, as well as a few generations of ancestors. 4) Annotation aids It would be nice for curators to have a tool that, given a single node, display all other gene products at that node (and nearby nodes) as well as all their other GO associations. This would assist curators in assigning a gene product to as many GO terms as needed, by showing them all other GO terms that might be related. 5) Suzanna's browser needs to be installed at Stanford, so we can all be using it from the same server. 6) Michael R. suggested we make a link to a dtd (datatype definition) file. Suzanna will look into finding a tool that will read the xml and create a dtd file. ANSWERS TO QUESTIONS FROM MICHAEL R. 1) GO ids will be stable. They may be "defuncted", but they will not go away. 2) "is a" and "part of" are likely to be used for quite some time. However, "part of" means "can be a part of", NOT "is always a part of." 3) Incyte still expresses interest, but that's all we've received from them. 4) Homepage recommendations -- Mike C. will add a bit from the grant to add more detail to the homepage. It might also benefit from the addition of statistics from the gene association files. 5) Should we have an ftp site that allows one to download the most recent version of GO? 6) Michael R. will create a FAQ to be linked from the home page. 7) Mike C. will put SGD's PowerPoint GO presentations to the GO site. PLANS FOR THE NEXT MEETING The next GO meeting will be in Cambridge, UK June 29 and 30. The plans are: 1) Have documentation ready a. GO philosophy document (Michael A., Judy and Midori) b. Rules for making changes to GO (Michael A. and Andrew) c. Rules for applying GO terms -- this is currently project-specific. Each project needs to think about this and bring something to the table next time. This should also include particularly illuminating examples, such as chitin synthesis, mitotic chromosomes. It should also emphasize how to avoid making GO nodes too species-specific, and mention the logical aspects of inserting or moving nodes. 2) Invite representatives from BioBabel, Arabidopsis, and C. elegans. 3) Have database in place 4) Create programs described above 5) Work HARD on adding more GO definitions. We have permission to use the Oxford Dictionary of Biochemistry and Molecular Biology. 6) Make the ontology edits mentioned above 7) Write three (!) papers History We need to establish a FAQ page We need to arrange for introductory sessions for new groups Status Assumption built into this is that if a term is associated with a gene product it must necessarily follow that all parent terms also are an accurate and truthful description of that gene. The structure of the ontology is tested and validated continuously as the curators assure that the parents, parts and go terms are all true. * Yeast 1800 genes associated to GO, represent half of the total number of yeast genes that have name, evidence code for almost all of them * Fly, automated annotation at jamboree, but not submitted to GO until curators validate them. * Mouse mostly done automatically until now; see the handout for the numbers. When conflicts arise they try not to change the ontology unless that have to. This is done by going up to a broader term. Moving to hand annotation particularly for new genes. Tools and Common resources * Now available, John's web browser (www.informatics.jax.org/~jpc/GO) modeled after MESH and Brad's browser (www.fruitfly.org/~bradmars/cgi-bin/go.cgi) that is running off the Informix database * Database (Informix, MySQL, and Oracle?) implemented and there is a Perl object methods in repository. We will be writing updates to the ontology in the database after the fall meeting. * Ontology editor, first priority. Suzi (et al.) to do by next meeting * Mike C. to use Ian's scripts to automatically perform regular validation for text version until editor is ready. * Merge two html versions of browser (Brad and John) * Suzi/John to fix Java browser and decide to either pull the plug or continue development. Add link to Java help page * Each organism database to provide a fasta file of protein sequences for those gene products that have been annotated. Suzi (et al.) will set up blast search services for GO * API to be refined as applications are developed * Steffan and Heather to work up prototype for next meeting of rules between the separate ontologies * Definitions, Michael is to contact Julian Dow for definitions from Dictionary of Cell Biology * Mike to e-mail style manual to Michael, who will then check it into CVS * Suzi/Brad to clean up XML version Content * Use part-of relationship to solve the 's/t/ protein kinase' (multipart protein) problem. E.g. %s/t protein kinase 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with Ji Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at the Carnegie Institute within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkerist (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech relationship to Proteome ... they are using GO ... ?in the human curation? General feeling is that the scientists have lost control of Proteome, the businessmen have smelled profit. Proteome is helping out Worm by providing Worm database definition lines (?). relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein goal ... weekly updates rather than once every 3 months Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. Michael and Heather have mapped Monica Riley's latest (non-GO) classification to the GO, but it cannot be publicly released. This mapping required the addition of many terms to the GO and has set up the GO for use for most enteric bacteria. Monica has a great deal (10 years of work) invested in her classification scheme and has a great deal of interest in seeing a proper mapping/merge between GO and her scheme. Michael and Heather have also obtained from Monica Riley the Genprotec enzymes list (a list of E. coli proteins), and this has been parsed into the Function Ontology. The situation around EcoCyc is complicated and it is unclear when EcoCyc <> GO mapping might be done. EcoCyc ownership is being resolved between DoubleTwist, Pangea (now DoubleTwist), and SRI; an NIH grant (content unknown) is being held up pending resolution. There is an interest at Stanford in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead (Russ Altman?) : a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at www.fruitfly.org/annot/go/database/index.html People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.fruitfly.org/~bradmars/cgi- bin/go.cgi?accession=3700 Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consortium), which will provide more details on the GO and is a followthrough from the Nature Genetics paper recently published. A Genomics paper is in the process of being written (MGI) which will detail some of the recent mouse gene annotations incorporating the GO Next Meeting 3, 4, 5 March Palo Alto, CA hosted by Sue Rhee, Carnegie Institute, Stanford University Immediate Actions Doug:send THE LETTER to him to for DoubleTwist:.invite them to Dec. meeting:.CALL ANDREW AND ASK HIM WHO TO SEND THE LETTER TO:EXPLAIN THE SITUATION.. review John Richter's FAQ about GO page: WE NEED TO SEND FASTA/GO FILE TO GO-SLIM WE NEED TO POST AT mgi THE mgi/GO FILE THAT IS SENT TO www.geneontology.org Appendix A: Current GO SLIM From: Suzanna Lewis[SMTP:suzi@bdgp.lbl.gov] Sent: Wednesday, October 25, 2000 11:34 AM To: ma11@gen.cam.ac.uk; midori@genome.Stanford.EDU; suzi@bdgp.lbl.gov Cc: go@genome.Stanford.EDU; dph@titan.informatics.jax.org Subject: Re: go_slim also needed to add unlocalised to component slim. here is the updated version $Gene_Ontology ; GO:0003673 $cellular_component ; GO:0005575 %cell wall ; GO:0005618 %extracellular ; GO:0005576 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with J. Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at IRRI within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkiewich (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first WormBase employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech Proteome is helping out Worm by providing Worm sequence definition lines relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. There is an interest at Carnegie in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead: a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at http://www.godatabase.org/dev/database/database/ People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.godatabase.org/cgi-bin/go.cgi Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consortium), which will provide more details on the GO and is a followthrough from the Nature Genetics paper recently published. A Genomics paper is in the process of being written (MGI) which will detail some of the recent mouse gene annotations incorporating the GO Next Meeting 3, 4, 5 March Palo Alto, CA hosted by Sue Rhee, Carnegie Institute, Stanford University Immediate Actions Doug:send THE LETTER to him to for DoubleTwist:.invite them to Dec. meeting:.CALL ANDREW AND ASK HIM WHO TO SEND THE LETTER TO:EXPLAIN THE SITUATION.. review John Richter's FAQ about GO page: WE NEED TO SEND FASTA/GO FILE TO GO-SLIM WE NEED TO POST AT mgi THE mgi/GO FILE THAT IS SENT TO www.geneontology.org Appendix A: Current GO SLIM From: Suzanna Lewis[SMTP:suzi@bdgp.lbl.gov] Sent: Wednesday, October 25, 2000 11:34 AM To: ma11@gen.cam.ac.uk; midori@genome.Stanford.EDU; suzi@bdgp.lbl.gov Cc: go@genome.Stanford.EDU; dph@titan.informatics.jax.org Subject: Re: go_slim also needed to add unlocalised to component slim. here is the updated version $Gene_Ontology ; GO:0003673 $cellular_component ; GO:0005575 %cell wall ; GO:0005618 %extracellular ; GO:0005576 GO mapping - GO needs modification to allow full use with human - problem of the irregularity of GO updates. For AstraZeneca, a company that has been generous in its financial help to the GO Consortium, Ken Fasman said that there was increasing concern with different ontologies being used by different data providers and that a policy decision had been made to insist on the use of GO for any product that they would purchase after a 24 month notice period. AstraZeneca intends to seek support for this policy more broadly in the pharmaceutical industry. Ken was also concerned that there might be a number of different and non-collaborating efforts to use GO for human genes, and, even worse, modifications of the GO ontologies independent of the GO Consortium. Outcomes. ========= The objectives of the GO Consortium are simple: to collaborate with others (preferable an other) to develop the GO ontologies so that they can be most effectively used for the annotation of human genes and to receive from the collaborating group(s) a table of assignments of GO terms to human genes that will allow the human genome to be searched along with the genomes of others by GO terms. The GO Consortium recognises that there are pre-conditions necessary for these objectives to be achieved: A mechanism must exist for those using GO, but outwith the Consortium, to suggest changes (additions, corrections etc) to the GO ontologies. For human genes this will be through David Hill of the Jackson Labs. GO expects those who propose new terms to define these terms (see GO documentation) at the time of request. The companies now using GO for human genes in products (i.e. Celera Genomics and Proteome) both said that they will now begin to feed new terms to GO and to suggest changes required for the annotation of human genes. The Consortium must increase the rigour of their syntactical checks on GO data and the synchrony of release of the same data in different forms. This will require a single validation script for each class of file to be run whenever data is committed. It may be necessary (as was suggested by Andrew) to go to a regular (e.g. monthly) public release at a pre-determined time and dates. We need a mechanism to ensure much better user feedback. One suggestion is to run an open User's Meeting once (or more) a year. This will not be by invitation but will require pre-registration so as to avoid a logistic catastrophe. We expect that one of these meetings will be at the time of one of the regular Consortium meetings and that may be another could occur at the time of, e.g. ISMB or similar meeting. GO is in the public domain (there was some discussion as to whether protection under, e.g. a GPL, is desirable). There is an implicit contract between the GO Consortium and commercial users of GO - the commercial users get the information for free, but they have an obligation to give the Consortium useful feedback. There can, of course, be problems with public feedback to GO from commercial companies. GO should establish mechanisms other than the public mailing list to allow people to comment on GO, both in general and in detail, in a manner that is private (although, of course, any resulting changes to GO would be public). It is in the long term interests of the commercial users of genomic data for there to be stability and uniformity in annotation. For the consumers of data their interests are that they can use the same analytical methods on data coming from the public and commercial domains or from two or more different commercial concerns. For the providers of data their interests are not to have to spend resources re-inventing the wheel and to be able to easily QC their data by comparison with public data or data from other commercial sources. There are two major public groups annotating the "complete" human genome sequence, the Ensembl group at Hinxton and the NCBI group. At the moment they are using different assemblies of the sequence (Santa Cruz and Schuler, respectively) but there is an agreement in principle, at least, for the two groups to share a common name space - the International Gene Index. There are clearly a number of name space/identifier issues that will make everyone's job harder - at least in the short term - but these were well beyond the remit of this group. To some extent, for the purposes of GO, there is already a common name space between the EBI and NCBI for gene products - the protein_id's of the GenBank/EMBL-Bank/DDBJ records. A maintained table of correspondence(s) between protein_id's and other name spaces (Swiss-Prot, HUGO, LocusLink, Ensembl etc) might be a good idea, at least until the IGI reaches full term. The NCBI will be importing GO annotation for about 10,000 "known" human genes from Proteome Inc into both RefSeq and LocusLink. The NCBI will provide a general methodological statement as to how the particular gene product to GO assigments were made. All of these assignements will be attributed to Proteome Inc. The Ensembl team will work very closely with others at the EBI and with the HUGO Gene Nomenclature Committee in London, to establish a central, open repository, called GOAH, to track assignments of GO terms to human gene products which can be used by other databases worldwide (see Appendix). Some action items. * Kevin Roberg-Perez - send in proposal for partitioning enzymes (the issue here being that the children of "enzyme" in $molecular_function are very 'flat'). * Richard Mural - send in problems with mouse gene associations. * GO - to arrange user's meeting at next meeting (hosted by TAIR). * GO - establish email methods to allow companies to comment on GO privately to the GO consortium. * Suzanna Lewis - immediate validation of gene associations on CVS commit. * GO - consider regular dated updates, rather than updates on edit as now is the case. Thankyous. ========== We thank Mrs Beatrice Toliver for her help and courtesy at the Banbury Centre and Dr. Jan Witkowski for allowing us to use the Conference Centre for this meeting. Attendees & their affiliations. =============================== Rolf Apweiler (European Bioinformatics Institute - Swiss-Prot; InterPro) Michael Ashburner (European Bioinformatics Institute - GO Consortium - FlyBase) Judith Blake (Jackson Laboratory - Mouse Genome Database - GO Consortium) Mike Cherry (Department of Genetics, Stanford - GO Consortium - SaccDB) Jannan Eppig (Jackson Laboratory - Mouse Genome Database - GO Consortium) David Hill (Jackson Laboratory - Gene Expression Database - GO Consortium) Suzanna Lewis (BDGP, Berkeley - GO Consortium) Martin Ringwald (Jackson Laboratory - GO Consortium - Gene Expression Database) Michele Clamp (Sanger Centre - Ensembl) Donna Maglott (NCBI - LocusLink - RefSeq) Lincoln Stein (Cold Spring Harbor Lab. - WormBase - DAS) Sue Povey (University College London - HUGO Nomenclature Committee) Lisa Brooks (NHGRI) Kevin Roberg-Perez (Proteome Inc) Darryl Gietzen (Incyte) Ken Fasman (AstraZeneca, Boston) Andrew Kasarskis (DoubleTwist) Richard Mural (Celera Genomics, East) Paul Thomas (Celera Genomics, West) Jennifer Wortman (Celera Genomics, East) Appendix - The Ensembl/EBI GOAH proposal. ========================================= GOAH - GO Annotation of Human. The EBI proposes to provide a central, open database tracking assignments of human gene products to the Gene Ontology (GO) resource. The Gene Ontology project (www.geneontology.org) provides a framework to assign functional information to gene products. GO was founded by three model organism databases (FlyBase, SGD and MGI) and has expanded to 5 databases, taking in WormBase and TAIR. GO has proved to be very successful in these databases, capturing functional information in a way which can be queried across species databases and providing a consistent framework for aspects such as evidence tracking. The effective use of the human genome will require some aspect of functional tracking of gene products. The proven GO experience, in particular in Mouse, indicates that GO will work well for this task. Unlike other organisms, there is no clear central database for human genome resources, and it is likely that this will remain the case. We propose therefore to provide a central, open repository, called GOAH, to track assignments of GO terms to human gene products which can be used by other databases worldwide. This repository would be manned with two or three editors providing overall curation and quality control. These editors would be the point of contact for individual researchers wishing to contact GOAH. For large scale projects with a proven track record of functional assignment, such as HUGO, Proteome Inc, SWISS-PROT and MIM, direct editing of the GOAH database will be allowed with the editors providing conflict resolution and general consistency of the project. The human gene products would be tracked via the internationally agreed protein identifiers (protein_id) which is an established identifier system for proteins shared by the International Collaboration of DNA databases. All information stored in GOAH would be placed in the public domain without restriction. The EBI provides an ideal location to provide the GOAH resource, with synergies to the Ensembl team of genome annotation and the SWISS-PROT team of protein functional assignment. In addition the EBI has strong links to the main players in this field, such as the NCBI, Proteome Inc, Celera and CSHL. The necessary resources for GOAH have already been found at the EBI and committed to furthering functional assignment in human, either directly in this GOAH project or in some collaboration with other interested parties. =================================================================== Summary of the GO Consortium Meeting help March 4 & 5 at the Carnegie Institution, Stanford University, Palo Alto, CA. **************** Hosts: Sue Rhee and the TAIR group Participants: TAIR, SGD, BDGP and FlyBase, MGI, DictyBase, Worm full list of individuals at the end of the document Guests: Han Xie of Compugen Mark Wilkinson, of the NRC Canada, a TAIR collaborator Summary Agenda... This is not in the order of the meeting, but rather supports some structure for this report. 1. Introduction to GO for New People and Systems. 2. Revision of Enzymes to incorporate E.C. terms. 3. Short Notes, Updates and Action Items. Definitions SP2GO InterPro GO-SLIM Obsolete terms Energy Derivation Top Level Terms 4. Sort Process Ontology into component parts and other Process considerations. 5. "Determination", "Differentiation" and "Development". 6. Major Divisions of the Process Ontology 7. Physiology - Initial Discussions 8. Report on Narrative vs. Combinatorial approach re anatomy in biological process terms. 9. Software Update from BDGP group. 10. New procedures for revising ontologies. 11. In General, Things to Do, some sooner, some later. 12. Specifically, For the Next Meeting (July 14, 15) in Bar Harbor 13. Progress Reports inclu. short reports from Compugen and NRC Canada 14. Full List of Participants. ******************* 1. Introduction to GO for New People and Systems Brief History, What ontologies are being developed, What are the rules and procedures for both\ontology development and annotation of genomes, Review of the public presence of the GO consortium. 2. Revision of Enzymes to incorporate E.C. terms. It was agreed that incorporating EC higher level terms was a good thing to do. Some of the EC strings are very long because they are adding definitions into the term. We will move the definitions into the definitions file. This will not restrict searches since the search includes the definitions. Michael will work to tidy the list and replace current enzyme set with new representation. Synonyms will be added as needed. Curators are reminded that synonyms for the protein should NOT be entered, just synonyms for the molecular function. 3. Short Notes, Updates and Action Items. a) Definitions: We still only have about 10% of terms with definitions. The rule is, if you add a new term, you need to add a definition. We reminder ourselves that the GO:ID goes with the definition, not the term, in cases of revision in the use of a term. SGD crew are scanning OUP Dict of Molec. Biol. into an ascii file for us to use in adding the definitions. This will be incorporated into the GO-EDITOR (John Richter). We will add ISBN numbers to each definition, as well as personal signature to each definition we add. b) Update of SP2GO files. Need to continue with timely updates. New process from MGI will update SP with each MGI update. David Hill (dph@informatics.jax.org) continues to be the primary person managing this file. c) InterPro: Michael Ashburner reviewed history of InterPro for new people. Mapping of InterPro to GO is public at EBI, but is not posted at the GO site yet. Place InterPro:GO mapping at GO site. d) GO-SLIM: At the moment, there is a hand-curated GO-SLIM. Ultimately, an attribute of a GO term will be that it is a member of a certain GO-SLIM representation. We recognize that there will be different slices of the GO that will be useful to different annotation communities. So we expect to support different GO-SLIM sets. GO-SLIM implementation will wait for database. e) Obsolete terms: When a term becomes obsolete, the definition should be appended to explain why it became obsolete. The note might also contain suggested terms to search if you are considering this obsolete term. John Richter will make sure that obsolete terms are supported in the GO-EDITOR. f) Energy Derivation: Natasha Maltsev of Argonne Natn Lab has list of energy derivations that will be a starting point for expansion in this area. Michael will get list from Natasha. g) Top Level Terms: We don't want to limit top level terms. We need to think of them as 'collectors'. So when considering the addition of high level terms, consider 'Do we need this collective term?". When we consider that we have a term (growth and maintenance, for example) because we cannot distinguish by experimental data to which term we should annotate a protein, that is an annotation perspective. But we also want to include terms so that we can group things. h) Prions will not be represented since they relate to disease state. 4. Sort Process Ontology into component parts and other Process considerations. We discussed whether the process ontology should be separated into two parts: cellular and multicellular. This discussion is not new. We recognized the utility of having a complete unit of the process ontology representing cellular-level processes since this is needed and practical for the unicellular organisms. Thus we will work towards a robust representation of cellular processes that will be useful to all. This decision led further to a recognition the process ontology is sorting into 4 major components. These are: cellular processes, developmental processes, physiological processes, and behavioral processes. We agreed to break out cellular processes and to specifically represent them at the top of the Process ontology. Most of the discussion over the rest of the meeting then focused on developmental processes. a.) Differentiation is a cellular process; morphogenesis is a multicellular process b.) We discussed whether to break apart the terms 'growth and maintenance' and 'cell organization and biogenesis'. At first, it seemed that we should. However, we quickly realized the utility of these terms in that some preliminary experimental evidence couldn't distinguish as to whether a gene product was involved in 'growth' or in the 'maintenance' of an organism. We did agree to change the term 'cell organization and biogenesis' to 'cell organization and/or biogenesis'. Midori will incorporate this into the work to carefully define the high level terms. Still some confusion as to the difference between the 'cell organization and/or biogenesis' node and the 'growth and maintenance' node. c.) Every high-level node needs careful definitions: Midori and Michael will work on this soon. d.) Remove terms 'oncogenesis' and 'tumor suppressor'. These terms reflect phenotypes. 'Oncogenesis' is really 'unregulated or mis-regulated cell cycle control.'. The ontology term relates to cell cycle regulation. The evidence for the association of a gene product with the process of cell cycle regulation often come from the study of the disease state. This same argument supports the removal of the term 'tumor suppressor', which is, after all, a phenotype statement and not a biological process statement. e.) Cell Motility: Under 'cell motility', 'vesicle transport' and 'spindle function' are examples of cell motility. So, maybe need to extend upper level with a term 'motility', then a daughter term 'cell motility' and a daughter term of that of 'cytokinesis'. So 'cytokinesis' would be a part of 'cell motility'. Cell division is a synonym rather than a GO term because the term is used both as 'division of the nucleus' and as a synonym for 'cytokinesis' , i.e., division of the entire cell. So, Synonyms need not be unique. We need some further work here under 'cell cycle' as there are multiple usages of these terms. So, need precise definitions of our usages. 5. "Determination", "Differentiation" and "Development". Definition of 'Determination' and definition of 'Differentiation'. How shall we represent these concepts? Reflects a 50 year debate in developmental biology. Need to rewrite these definitions so that they are less experimentally based. Consider, throughout, 'has the definition been written in terms of the experimental method?', If so, consider revising definition. Sound bites from this interesting discussion -determination when the decision has been made to adopt a developmental stage ( tricky because it is often before the actual differentiation occurs) -differentiation when you actually express a set of characteristics...process whereby relatively unspecialized cells acquire -so is 'cell specification' a synonym for determination? Or is it that specification is the same as establishing an identity but not yet determined. It is a temporal thing. you are getting signal. it is not the same as determination. -autonomous specification specification produced by a inheritance of molecules. a type of cell specification -conditional specification is the specification determined by the relative position of cells in an organisms. A type of cell specification. -not the same as competence which is a characteristic of a cell. Conclusion: Competence is the 'ability' to do something. Competence is not a process, it's a state. So we throw it out. but, if useful, we could have 'establishment of competence' or 'maintenance of competence' Conclusion: the term 'Development' as a high level process will be used to consider the whole history of the organism. This generated a lot of discussion as we considered 'embryonic development' and 'post-embryonic development'. This is a hard distinction to support for plants and for larval development. Different communities use these terms in different ways. Post-embryonic development is useful for fly...keep it in??? What is covered by the term 'brain development'? It continues throughout life. What do we mean by a term like 'heart development'? Does that mean the developmental process up until you have a heart? Or does it include the further development of the heart after a recognizable organ is formed? Embryogenesis, morphogenesis, organogenesis are all DAGs...some things that parts of embryogenesis will be part of morphogenesis as well. So... development morphogenesis aggregation differentiation maturation aging senescence Option 1 global heart development formation of the heart beyond formation of the heart 6. Major Divisions of the Process Ontology. % cell % development % physiology % behavior 7. Physiology - Initial Discussions. Having struggled through the beginnings of a representation of development, and at least conceptualizing the work needed to realize this part of the process ontology, we recognize that physiology is the next big area to struggle with. Animal and plant physiologies will be pretty independent. Cellular processes are independent of physiology. So here is where the DAG structure becomes imperative. For example, 'hormone response' has both physiological and cellular components. We can relate them through the use of the DAG structure. Remember that we are trying to develop a tool for biologists that works....not trying to represent all biology. Need to make somewhat arbitrary decisions, such as where to put 'germination', that address what the user cares about, i.e. 'what genes are involved in the process of germination?'. Physiological processes are heavily impacted by outside signal (environment). Changes in response to environment. While not absolute, physiological statements more often reflect processes in the mature organisms. In defining physiology as a grouping mechanism, we need to work down to the next level now. 'Transpiration' is an example of a physiological process, ditto 'perception of external stimuli', 'stress response', 'immune response', etc. 'Seed germination', 'release of dormancy' terms are both physiological and developmental processes. Physiological processes ultimately will go down to the granularity of cellular processes. The DAG structure will help in the representation of all these terms in the Process ontology. 8. Report on Narrative vs. Combinatorial approach re anatomy in biological process terms. This was the major event of this meeting. For many meetings, we have come back to the issue of species-specific anatomies and the incorporation of anatomical terms in the process ontology. Over a year ago, Joel Richardson proposed a combinatorial approach wherein a process term combined with an anatomical term would be used to annotate knowledge about a gene product. At the Hinxton meeting, the group agreed that this was a sensible and powerful approach. However, subsequent implementation efforts revealed difficulties in incorporating such biologically useful concepts as 'gametogenesis'. Also, the management of the combinatorial approach would be harder than the further development of what is now call the 'narrative' approach. The narrative approach is the current paradigm of building up the ontology incrementally as we describe the process in biological terms. Yet, in following discussions, the issue of whether or not to incorporate anatomies, which are themselves highly developed and precise ontologies, in the process ontology kept arising. Finally, at the human annotation meeting at Banbury last summer, we agreed that David Hill would 'do the experiment' and give a presentation at this meeting for the group to consider. David used the example of "Heart Development". He developed ontology for heart development in both the narrative and the combinatorial manners. A copy of that presentation is available. The end result was that the group was overwhelmed with the power of the combinatorial approach both to provide self-structured cross-product terms and to reveal new information and avenues for experimentation. 1. Do we leave it up to each group to decide whether to use this approach to process annotation? A resounding NO from the group. 2. Can we separate out subtrees that can be used to generate cross-products? Yes, could use GO-SLIM or other subtrees. In fact, the GO-SLIM set may be the mechanism for grouping annotations across species. 3. There could be cross-products of cross-products....how far do we want to break this down? Don't have to go all the way down as long as the representation of the biology is correct. 4. Works as long as the two concepts are orthogonal, can't do with just anything and get the consistency needed. 5. Big worry...if each group is incorporating combined terms relative to their particular anatomy, we lose the power of the combination of all annotations. One approach is to ask the query...'give all products in heart development', and have query go out against all cross-products. We will have to work on this. 6. Can we have a join of the anatomies? then have a single anatomy to use in the cross-product with developmental processes? don't know...right now, we think the combinatorial approach is the right way to go, we will have to work on the implementation. 7. Some concern about ripping out anatomical terms from process right now. Can the primary process ontology be made more amenable to cross-species specific anatomical parts? 8. If we have multiple anatomies, then the search needs to go against anatomies...this can be done. Summary....Issues 1. There is general consensus to go forward with the combinatorial approach. 2. Do we need to have a shared anatomy? 3. How will others be able to use the ontologies to annotate if we have this complicated approach? 4. Parser...need to put into better language...earlier we tackle the problem of language, the better we can promote this for ourselves and others. 5. GOAL...write definitions for common developmental process terms. 6. Start working on further experiments with this approach...write definitions, work out mathematical properties. 7. Each group needs to provide an anatomy. 8. The anatomies needn't have GO:IDs, but the cross-products should have IDs. 9. We will use the developmental process as a demonstration of this approach.. 10. Immediate action items include: a. schema changes (Joel and Suzi) b. editor will work fine for now. 9. Software Update from BDGP group. Brad Marshall - GO-Browser. Objective is to make browser better. Have moved from cgi scripts to XML backend with RDI to associate with different data sources. This makes for a more flexible backend. Want to chain a lot of data sources together with GO associations. Only want to retrieve a subgraph at a given depth. Much enthusiasm for power of this approach. John Richter - GO-EDIT. This new editor is an open-source application that provides an annotation tool for GO type ontologies. Can rearrange, define terms, designated subsets... Released and available. Will start using this right away. First commit will result in messy Diff file, but then all will be well. 1st change....editors using Editor will write out to files and commit via CVS 2nd change...editors using Editor will write out directly to database...don't know when that will be. John Richter - GO database. database could be ready 3 wks after he starts working on it, but right now he's working on Apollo. Suzi Lewis - Apollo Pedigree 10. New procedures for revising ontologies. * who has 'write' access in each instance? Michael, Heather, David, Harold, Leonore, Midori, Karen * how are people outside suppose to communicate suggestions to us? Suggestion...two databases, one public, one writable (production). Curators will have db accounts...login to edit and write. A few have publish access. Publishing takes it over to the public db. We track changes, etc. Publishers.... Michael, Heather, David, Harold, Leonore, Midori, Karen(Publishing involves clearing and reloading. The event gets a version number. 'Static' version would be a cron job ( midnight of the first of every month???) There would need to be a name for each release. 11. In General Things to Do, some sooner, some later 1) Cross Products: Post GO:IDs for cross-products not for Anatomy... 2) Details for Cross-Products...may be summer before we can commit all this. 3) Post InterPro:GO mappings at GO Web site (Michael) 4) Post Anatomy Files from different groups. 5) Post FASTA files of unique set of AA seqs with GO annotations at GO site, include SP:ID in header. Set up for searching. 6) Review and update GO-SLIM files 7) Develop top levels of Physiology for next meeting 8) Add seqID column to Gene Associations table (use SP ID). 9) Consider posting 'good' other ontologies at GO site. This will involve a lot of discussion...Don't want impression that GO is responsible for quality of other ontologies. 10) Update SP translation tables (David Hill) 11) Move comments about reasons for changes from CVS to database (when we can) (John Richter). 12) xml dumps from Editor to Suzi. 13) Support old GO:IDs in database (John Richter). 14) Post 'Citing these data' on Web pages (MikeC). 15) Provide some kind of static or versioned GO for use by tools that incorporate the GO as part of their annotation suite. 16) Suzi needs to deal with <>...where ever you want to keep this character, put the / in front of it. 17) MikeC will create an anonymous server behind firewall at Stanford for cvs or provide a machine outside the firewall. 18) Make CVS world readable..provide a repository..may want to consider SourceForge... 12. Specifically, For the Next Meeting (July 14-15, Bar Harbor). 1. Midori and Michael will have some High-Level definitions for review (i.e., when does the process of differentiation start? when does it stop?). Change the term 'cell organization and biogenesis' to 'cell organization and/or biogenesis'. Also, need children of combined terms....thus, need 'cell organization' term and 'biogenesis' terms with definitions. 2. Need some work done to clarify and define terms in the area of 'Cell cycle'. Full List of Participants. TAIR: Sue Rhee, Leonore Reiser, Aisling Doyle, J. Yoon, Margarita Garcia DictyBase: Rex Chisholm, Warren Kibbe SGD: Mike Cherry, David Botstein, Midori Harris, Selina Dwight, Karen Christie, Dianna Fisk, Anand Sethuraman, Cathy Ball, Gavin Sherlock, Worm: Wen Chen BDGP and FlyBase: Michael Ashburner, Suzi Lewis, Heather Butler, John Richter, Brad Marshall, Chris Mungall MGI: Judith Blake, Joel Richardson, David Hill, Martin Ringwald, Janan Eppig, Harold Drabkin =================================================================== GO Meeting July 14-15,2001 Hosted by Judy Blake and the Jackson Laboratory in Bar Harbor, ME. Minutes compiled by Leonore Reiser Attendees (name-organization): Judith Blake-MGI Janan Eppig-MGI Joel Richardson-MGI Martin Ringwald-MGI David Hill-MGI Harold Drabkin-MGI Michael Ashburner-FlyBase Midori Harris-GO Suzi Lewis-BDGP John Richter-BDGP Brad Marshall-BDGP Pavel Tomancak-BDGP Mike Cherry-SGD Anand Sethuraman-SGD Karen Christie-SGD Selina Dwight-SGD Dianna Fisk-SGD Erich Schwarz-WormBase Raymond Lee-WormBase Rex Chisholm-DictyBase Liat Mintz-Compugen Courtland Yockey-Astra Zeneca Rolf Apweiler-SWISS-PROT Nicola Mulder-SWISS-PROT Jennifer Hogan-Incyte Matt Berriman-Sanger Tom Weigers-MGI Jim Kadin-MGI Carol Bult-MGI Sue Rhee-TAIR Leonore Reiser-TAIR Progress reports by group (presenter): MGD (Harold Drabkin): from handout supplied. 1. As of 7/12/01 MGI has 15638 annotations. 2. Total number of mouse genes with at least 1 associated GO term: 5673 Genes with molecular function term: 4615 Genes with process term: 3487 Genes with component term: 3584 3. Breakdown by annotation type: IEA: 3690 process; 5333 function; 3835 component Of these: 760 annotations are from EC mapping (694 function; 66 component) and 6542 from SWISS-PROT mapping (1740 process, 2603 molecular function; 2199 component). The remainder are from GOFISH. Hand Annotations: Annotating at a rate of 40-45/week. 1115 genes annotated this way (636 process; 637 component; 788 function) that correspond to a total of 2780 annotations (1058 process; 722 component; 950 function). SGD (Karen Christie): 1. How SGD deals with annotations to unknown (function, component or process) is by assigning a reference, from either curation or from a publication with the appropriate evidence code. This distinguishes between cases where a curator has looked and nothing is known and cases where the literature for the gene has not yet been examined in order to make a GO annotation to a given ontology. We decided that this was a good approach for all curators to take and that this decision would be reflected in the documentation on the website. 2.Validation checking: includes checking that all of the required fields were being filled and scripts to identify terms that become obsolete so that annotations to these terms can be updated. 3. Cathy Ball and others have done a four- way genome comparison- Arabidopsis, Yeast, Fly ,Worm:- each sequence is BLASTED against each other to build gene families Criteria is (P =50 and 80% in one HSP).Each sequence is present only in one tree. Compared annotations with a high node, GO Slim type, process ontology and curated each of them by hand (by sequence similarity and FASTA definition lines). Annotating the entire gene cluster (have 1400 families- hand curated for process-14/15 very high level processes, 23 families that could not be ascribed to a process). Function calls were made electronically via a script. This dataset can be used as a check for annotations- and as a general guide for annotation. A limitation is that the represented proteins are only those found in all genomes. Flybase (Michael Ashburner): Working on cleaning up existing annotations: 1. Heather was fixing all legacy annotations that did not have literature or evidence codes. 2. Go Slim high level annotations being gone over for 7000-8000 genes. Michael has gone through all of them- annotated by hand . 3.Several thousand BLAST results being gone through. 4. 9000 genes with at least one GO annotation. WormBase (Erich Schwarz): 1.One third of C.elegans genes have been annotated using InterPro.This data is up at the GO site. 2.For 19,000 cds sequences, ~2,200 have 'reasonable' phenotypes from RNAi screens. Working on turning phenotype data from RNAi to GO annotations. Mapping phenotypes to GO. For example the trait, paralyzed phenotype would map to GO term: locomotory behavior. Evidence inferred from RNAi. (which would be given the evidence code IMP- this additional definition of IMP will be reflected in the documentation on the web pages. DictyBase (Rex Chisholm): 1.Initial annotation from genome sequence 5X coverage ~8000 ORFs. 2. Have about 50 hand annotations. Will use InterProt, SWISS-PROT, EC mapping and have a set of annotations by next meeting. Compugen (Liat Mintz): 1. Concentrating on GO Annotations at the level of transcripts. Methods include: a. Protein clustering (Smith-Waterman) b. Literature Clustering-text mining tool. c. mRNA clustering First release includes human, rat ,mouse annotations. Update to be released Aug.1 will have additional annotations. SWISS-PROT/InterPro (Rolf Apweiler): 1.Nicola Mulder's map of InterProt to GO terms has about 2700/4000 terms mapped. Many can't be mapped yet ( as they include lots of viral terms). 2.SWISS-PROT mapping. From this about 50% of the terms have been mapped to GO but this still needs some manual curation. 3.Human gene annotation status. Yesterday (July 13,2001) the first pass annotation from SWISS-PROT, trEMBL and Ensembl annotations was completed using three electronic mapping methods. Imported 7316 GO annotations from Proteome and literature associations. The plan is to have complete coverage at a high level by September-October. After this , manual GO annotations will be part of the normal curation pipeline. GO (Midori Harris): 1. Training SWISS-PROT curators to do GO annotations. 2. A bit of PR work with Nature and the Wall St. Journal. 3. Working with Karen Christie of SGD to revamp the GO web pages. TAIR (Leonore Reiser): 1. 3901 annotations to genes corresponding to TIGR open reading frames using term matching (exact matches between GO terms and definition lines). Matches generated by script and validated by curator went in as IEA annotations to GO and TAIR FTP site. All annotations are to molecular function. 2. Progress on development of an anatomy ontology for Arabidopsis using GO editor to make DAGs. 266 terms as of 7/12/01, 50% have definitions Will work with rice (Gramene DB, IRRI) and Maize (MMP) to find top level terms for plants. Discussion Points: 1. GO Slim a). Consensus that there needs to be a new GOSLIM developed. Terms for GOSLIM will be selected by a small working group. b). A directory of the GOSLIM versions that have been used should be made available via the website. c). Some considerations in using GOEDIT to make GOSLIM files: Will have to wait until the database is up to implement GOSLIM notations as this is not accommodated by the flat files. Also, having everything in the database will make it easier to keep GOSLIM in synch with the current GO. The 'canonical' GOSLIM will be in the database and other versions (specific to certain projects) will be posted as flat files. d). Chris Mungall has been working on software for mapping full GO to GOSLIM. e). Midori Harris will take charge of new GOSLIM. 2.Gene association files a). Should we get rid of aspect in the files? Consensus is no, as this information is useful in consistency checking. b). How to deal with the fact that each group is annotating at different levels/to different database objects. At the moment, most groups are annotating at the level of gene, or transcript, but this differs from database to database. There is a need to define the object being annotated explicitly. Changes to be made are: 1) add a column that defines the object being annotated (or the moment the options will be gene, transcript and protein). 2) The symbol used in the association file will be the symbol for that object (e.g. if annotation is to a gene object then symbol = gene symbol, annotation to protein object then symbol= protein symbol). Same holds for synonyms, they should match the object being annotated, 3) add a column for Taxonomy_ID (from NCBI) that defines the taxonomic node for the organism whose gene/protein/transcript is being annotated. Symbol is still mandatory but since not every database has symbols for everything. In that case, using alternative names is OK (e.g. a gene symbol when annotation is to a transcript or protein. There may need to be further discussion on the issue of symbol column. c). Should there be another column specifically for PubMed references in addition to the database references for the evidence for the association? No, but we can allow for multiple entries in the DB:reference column to accommodate. Multiple reference identifiers (e.g. |MGI:209393| PMID:123333) will be separated using pipes with the model organism database identifier preceding the secondary identifier. d). What to do with sequence identifiers? We reaffirmed a previous decision NOT to put GenBank or EMBL identifiers in the association files. Instead, sequence identifiers will be provided in a separate file. e). Action Items: Midori will update the web pages with the new information/format for the association files. An XML format will be created to export the association files. 3.Definitions status: Up to 13% of terms are defined. Everyone agrees it is harder to define the higher nodes but crucial especially as these are used for GOSLIM. Midori has taken a stab at some of the higher nodes in process and these need to be looked at. With respect to making it easier to add definitions, SGD has converted a dictionary into an electronic format. 4. Discussion of "is this a function ?". Many people have noted that there still seem to be gene products in the GO both in function and process ontologies. Dyenin was cited as an example of a protein name in the function ontology. a. Rex will look into identifying areas that need to be cleaned up as far as protein names and bring the suggestions back to the group. 5. How granular should we get in the ontologies- what belongs? The discussion about what is a function raised questions about how granular should the ontologies get and what is a function vs. a process term. a). Granularity. In general, the answer is as far as we can go, within the bounds already defined. Community feedback into the ontologies is especially important as a means to improve granularity. b). What to do with molecules? If we want to represent the function of a subunit ( a molecule). 1. Add parts of the complex (e.g. add regulatory/catalytic component) 2. Remember we are annotating the product to the potential of the component. 3. To avoid a proliferation of terms, we can collapse nodes and gain specificity by the combined annotation. For example, RNA polymerases might be collapsed into a single function node and the component or process gives the specificity. c).What about phosphorylation? Is was suggested that having phosphorylation as a process was redundant with kinase in function and is inconsistent with the rule that a process requires more than one function. In this case we decided to keep protein phosphorylation in process ontology because it would be expected as a child of protein modification. 6. Status of the GOEditor (John Richter). a). Changes made to the editor. 1. Dropped obsolete relationships: the editor no longer shows obsolete terms. 2. History now shows differently. You can now view the entire history of a term. 3. There is a new data adaptor that includes the relationship "develops from". But this only works with the database adaptor not flat files. 4. Saving to the database now works. The editor tracks all changes in your session, checks that you are working on the database and adds history tracking. From now on history will show everyone's saves. Can now query the history of a term-selecting a term highlights the edits for that term for the life of that term. 5. Conflict checking is working- save fails if a conflict is detected. 6. ID generation will be from the database. ID ranges will no longer be required. 7. All three ontologies are loaded when using the database adaptor. b). In progress. 1. A plug-in for generating IDs other than GO IDs. For example, people using the editor to create anatomy ontologies will be able to have their own prefixes. 2. A gene association plug-in. 3. Update to require passwords for loading and saving to the database will be instituted. c). Discussion/comments on the editor. 1. When importing from other flat files (e.g. anatomy) they can be parsed into the database by stripping terms. However, there may be reasons to save the original database identifiers -perhaps as synonyms. 2. While conflict checking works, top level node edits need to be announced to the group before starting. 3. When will the database be the real thing? Perhaps in about 3 weeks but, perhaps not. 7. GO HTML Browser (Brad Mars): a.) Displays DAGs, gene associations, definitions. Searching by organism, term, etc... still some issues to resolve in the searching. BDGP has registered the domain name godatabase.org but this is not up yet. b.) Some more work to be done on UI issues before release. c.) Related to general software issues: Chris Mungall has been making progress with a BLAST server using GO annotated sequences from yeast. 8. Website documentation issues: a). FTP site including CVS repository is moving to the outside of the firewall. This will allow anonymous read access to the CVS. b). Updating to the FTP site. SGD has the only write access to the site. So in the event that Chris wants to dump the database XML files, Mike will grab the specified files from BDGP and load them. c). DTD for the XML format needs to be moved to the website so that people can get to that. Currently the XML is updated once a month from Stanford to the database and then dumped out. This will move to nightly dumps once the DTD and XML is in place. d) . Currently we are just exporting the tree but will also be exporting the associations (as above discussion on an XML format and appropriate DTD to be written). e). The new format will allow people to specify and import subgraphs from the files. f). A validation step will be added to the XML dump to make sure it is correct. g). Move old documentation out of CVS but leave on the FTP site. Things to be archived include past minutes from meetings, old XML DTD files, old ontologies. Move the archive directory out of CVS but leave on the FTP site. Note that the abstracts/ directory is empty and could be removed. h). As part of the rearrangement that Midori and Karen are undertaking , we can add a jobs page. 9. GO expansion (specific issues and general concerns): a). Matt Berriman from the Pathogen Sequencing Unit at the Sanger Center raised some issues about how to expand the GO to include parasites. In particular the question was raised as how we can represent the components outside of the cell ... moving from place to place. Alterations to component ontology need to be made to accommodate this. A few basic ideas to deal with this: 1. add a host %extracellular %host Phenotype is clearly outside of the GO . There are no plans to expand to include phenotypes in the ontologies. >Some groups, like SWISS-PROT, may be annotating to groups of protein products. Eventually, these should move out to the level of a transcript. >In general we do not consider natural variants/polymorphisms, but it is really up to individual databases to decide on that. >Splice variants, isozymes and other polymorphisms will be an issue for databases to deal with. Each variant should be annotated appropriately. e). Expansion of the evidence codes. Generally agreed that more information needs to be provided so that it is clear how the annotation call was derived. Case in point by Rolf Apweiler For 26098 proteins 20757 from SWISS-PROT;5841 from Ensembl Annotations include: 1448 mappings from EC 7696 SWISS-PROT keyword mapping 11025 InterPro mappings 1. 3022 unique GO terms that are all IEA. In general, there is not need to expand upon the IEA code but rather to have a more specific DB: reference for each type of annotation. So rather than dumping everything into IEA and giving a general reference, use IEA and the reference should explicitly state if the association was made from InterPro mapping or keyword mapping, etc. Send the references for the analysis as part of the annotation record. Each analysis method should have its own reference (e.g. GOFISH). 2. When is it an ISS vs. IEA call.. It is only ISS if a person has looked at it, in Swiss-Prot there is always a person looking at the sequence. Can we enforce a standard of reporting for sequence analysis methods? Probably this is unrealistic to attain. The reference column of the association files will be used for providing more information on the different computational methods used for annotations. In addition, a set of descriptions of the methods used by each group will be put up on the GO website. f). Integration of annotations from non-model organism databases. SWISS-PROT and Compugen will provide first pass annotations to model organism genes to model organism databases to incorporate them and pass them on to the GO site. For non-model organisms, the associations will go directly to the GO site. 10.Organizational issues a) . Anyone can participate in discussion. Suggestion of terms can and should come from anyone however write access to the ontologies should remain limited. b). Who contributes to the annotations. The associations have to be a from a database. c). What constitutes membership. In general requires that participating groups accept the principles of the GO and are willing to commit to ongoing development of the GO. 1. putting associations into the public domain. 2. contribution of financial support or data. 3. contribution of software tools. d). Due to the size of the group (currently 10 members) it may be more efficient to have smaller working groups that meet for various reasons (such as an executive committee, or onotology working groups) in order to focus on specific action items. To keep meeting at this rate (3-4 times a year) and to get things done, it may be helpful to change the structure of the meetings. e). Should the GO site be a clearing house for information about other ontologies? 11. More about cross products and anatomy. a). Cross products- David Hill expanded on the idea of cross products. We are in agreement to strip anatomy from process terms and proceed with cross products. For making cross products, it is important that the ontologies be orthogonal. We can expand the concept of cross products to many areas and it would be good to have a general tool for doing this that allows you to select specific nodes to create cross products (see d below). With respect to making anatomy ontologies for generating cross products with developmental process node in process ontology we need to first make orthogonal ontologies of anatomy and developmental stage, then take the appropriate cross products from these to make the cross product with development. It is essential to take the time component out from staging (e.g. days post-fertilization, post-germination are not useful as there is a lot of variation in how rapidly development occurs within a species). An example: [stage (organism specific, internal ID) X anatomy (organism specific, internal ID)] X [developmental process ( Go-generic, GO ID)] = GO ID. Also, the cross-product of stage X anatomy will have an internal ID. b). Anatomy browsing- Pavel from BDGP demonstrated a browser being developed at BDGP to display gene expression patterns and fly anatomy. Uses both images and text display. c). Each organism database contributes their anatomy/ developmental stage ontologies and definitions to the GO and it will go into the CVS. Each group should be responsible (and responsive) for updates to their anatomy ontology. d). Developing a tool to generate cross products. John Richter thinks he can adapt the editor to have this function. Realistically there will not be a tool till after October for generating cross products. e). Report on papers citing or using the GO and activities related to GO. 1. MGI/RIKEN annotation paper is out. 2. Cathy Ball and SGD have their 4 way comparison paper in the works. 3. GO paper is out in Genome Research. 4. Matt Berriman -Parasites are GO in Trends in Parasitology. 5. David and Joel working on a paper about cross-products. 6. Courtland is doing a seminar for library sciences. 7. Michael will be doing a course in October. 8. TAIR review on plant data management includes small section on GO. 9. Postponed request from Annual Review of Genetics until next year. f). Next meeting will be at the Chicago Omni Hotel and includes a users meeting. Regular meeting will be from the 12th (new groups)-14th and users meeting on the 15th... =================================================================== GO MEETING in Chicago on Oct 13, 14 2001 Northwestern University, Chicago, Illinois USA Rex Chisholm, host List of Participants 1. Chris Mungall - BDGP, Berkeley 2. Brad Marshall - BDGP, Berkeley 3. Rex Chisholm - DictyBase, Northwestern 4. Suzi Lewis - BDGP, Berkeley 5. Michael Ashburner - FlyBase, EBI 6. J. Yoon - TAIR, Carnegie 7. Sue Rhee -TAIR, Carnegie 8. Peter Good - NHGRI 9. Trisha Dyck - DictyBase, Northwestern 10. Karen Christie - SGD, Stanford 11. Matt Berriman - Parasitic genomes, Sanger 12. Judith Blake - MGI, Jackson Laboratory 13. David Hill - MGI, Jackson Laboratory 14. Harold Drabkin - MGI, Jackson Laboratory 15. Janan Eppig - MGI, Jackson Laboratory 16. Raymond Lee - WormBase, CalTech 17. Wen Chen - WormBase, CalTech 18. Midori Harris - GO EDITOR, EBI 19. Evelyn Camon - SP-human proteins, EBI 20. Bernard de Bono - visitor from Cambridge 21. John Richter - BDGP, Boulder 22. Erich Schwarz - WormBase, CalTech 23. Jason Stewart - BDGP, Albuquerque 24. Hanqing Xie, Compugen ACTION ITEMS FROM CHICAGO MEETING - OCTOBER 2001 1. ACTION ITEM: Get temporal and anatomical CVs into CVS from participating databases. TEMPORAL: need to add temporal ontologies, but maybe we need to add dimensions rather than time or relative temporal terms? discrete temporal terms. like Tyler Stages (Mus; 'life stages ' in worm; template for others? STATUS OF ANATOMIES FlyBase done, FlyBase also represents 'derived from' in organ-organ relationships TAIR.... pretty good, almost done, just checking... SGD. anatomy...12 terms, done MGI ... Martin met with folks in Edinburgh, fine for MGI to commit mouse anatomy to the web site. Adult anatomy is not complete, but we could contribute it. Defined by Edinburgh by anatomical space definitions... 3D... Embryological one is done. Worm... Wen Chen...has been working on this. Life Stages for embryos is done. It has been converted into GO type form. Need to do refinement on definitions. Right now, Wen has built structure with 55 terms.... Worm... can keep temporal aspects. In C. elegans, because of knowledge of every cell, can actually doing anatomy in terms of big picture of all cells. Raymond has done a pilot on the feeding apparatus, the pharynx of worm. Working with David Hall of Einstein to work out anatomy (also Sylvia Martinelli at Sanger has some work to incorporate). 2. ACTION ITEM: need a tool to create cross-product terms see further discussion of this point under BDGP report 3. ACTION ITEM: Post new biological process that incorporates updating developmental processes. Ask for comments by Dec. 31. After that date, do the update and commit it. Post new Biological Process. 4. ACTION ITEM: come up with initial default GO slim, send around for comment, incorporate into database such that db changes could be flagged...then we could make that available as GO-SLIM. Ask that people who make variants and publish with that will post them...Midori and Michael will create default file... technically...flag in flatfile... Note, this is a carryover from Bar Harbor meeting. don't need 'static' one...??? but do need easy way to create one, need 'this one, that one, all the others' last time we agreed that we would a) archive any that are used b) people will use this feature a lot in the future, so how do we make this easier c) we decided to provide a default one.... when the database is implemented, this will be easier to manage... d) people that work with alternative GO-SLIMs will post them 5. ACTION ITEM: Agenda items for next meeting 1. Chris Mungall... intro to ontological formalisms 2. Michael... will report on the literature of relationships in ontologies, WordNet book report All terms need parents. The downside of using a quick upgrade of everything is that 'is-a' relationships may not be correct for all, particularly biological processes. Should we add an 'is-a' to the top level? Also, the 'part-of' relationship is complicated since we use 'part-of ' in different ways. Need to investigate the implications of this. We may decide that we don't want to get more complicated than we are. Do it as needed, well, why would we need it... 1) to use outside ontological tools 2) to resolve multiple uses of 'part-of' types 3) to facilitate doing queries 6. ACTION ITEM: standardization of database abbreviations Michael will finish off flatfile in one day of standard set of linking database abbreviations. He will post to CVS. done: go/doc/GO.xrf_abbs 7. ACTION ITEM: New Evidence Codes 1) invent an evidence code for 'nothing is known biologically' ND; ND:evidence-reference would be a local database citation that abstracts to methodology. Done ************ the next one came up during the user's meeting and didn't receive discussion from the full group....may have to wait until next meeting for consensus, or may be agreed do by email... 2) proposal at User's Meeting to add an evidence code for 'inferred from curated orthology'; ICO; evidence-reference would be a database citation that abstracts to the methodology. This depends heavily on a shared understanding of the term 'curated orthology', but in this first instance refers to the case where RGD is transferring GO associations to Rat genes from MGD via the curated orthology relationship provided by MGI. Further discussion reveals the complexity of this. Orthologies are most often defined at this time by sequence similarity. The transference of functional annotation therefore might be considered 'ISS'. EXCEPT that this really depends on the method of assignment of function to the orthologous object to begin with. If the assignment, for example, of a function to a mouse gene was as a result of a biochemical assay, then ISS for the rat protein might be appropriate. If the assignment of function to the mouse gene, however, was via an electronic assertion based on (perhaps) shared domains, then this would be an IEA assignment in mouse and even with the orthologous relationship to a rat gene, the rat gene GO assignment should not be given an ISS evidence code. 8. ACTION ITEM: We will add a date field to the entire gene association file We are adding a date column, mandatory for all annotations, YYYYMMDD. It will mean the date on which the association was made; it will not need to be broken down into "created" vs. "updated," because we "update" annotations by adding new lines to the association files (and deleting old lines if the situation calls for it). The date field can be used in conjunction with the ND code, so that curators can tell when it was that nothing useful was found. This was the original motivation for including the date, but we quickly realized that dating annotations was good for other reasons as well. This has now been documented in the GO documentation at the Web site. 9. ACTION ITEM: Update Documentation to explain new use of the TaxonID column in the gene association files. Revised syntax for TAXON ID Column is: 1st ID = taxon encoding the gene product; 2nd taxonID refers to the context, i.e. the user organism of the gene product. Syntax will be taxon1 ! (pipe) taxon2 Done 10. ACTION ITEM: Change requirement for submission of sequence information for gene products. Remove sequence subdirectory and replace with subdirectory holding files of gpID: proteinSeqID from the participating db. BDGP will use these files to yank protein sequences and generated appropriate def line according to current GO standards. Resulting sequence sets will be posted regularly. Peptides will be the sequence type. new syntax: DBID:geneObjectID DBID:seqAcc#; DBID:seqAcc#;DBID: seqAcc#; where multiple seqIDs are only added to reflect alternative transcripts, not allelic variants 11. ACTION ITEM: Update directory structure on GO web site. Karen will transmit this directory restructuring to Mike... gp2protein gp2protein.sgd gene2protein.mgi gene2protein.etc /protein2FASTA created by se group, will go into monthly archive, also get rid of species subdirectories, create new one 'gene-associations' and put all the mod association files in there. rename 'monthly' dumps to an informative name Parent directory 'Data Snapshots' subdirectories 'Current' and 'Archived' ?? SO monthly_downloads (cvs tag on the 1st of every month) subdirectory will be /currrent_yyyymmdd ontologies xml db gene-associations definitions database load sequence set / archived under that, the monthly subdirectories moved from the previous monthly all the other directories will be the most current... the monthly is the snapshot version.... remove 'abstract' directory remove 'archive' directory 'docs' okay 'external to go' ok 'mail' ok 'note' can be deleted 'ontology' keep 'schema' goes 'sequence' delete 'software' delete 'xml' delete 12. ACTION ITEMS FOR JOHN RICHTER *need to attach cross-product terms to existing ontologies *need to be able to track identifiers to components of the cross_product *will notify users 'there may be dangling references here' *need to address 'merging' issues. John will release 2.7 next week, in 2.8 will allows dangling objects other plug-ins that are being suggested. gene product fetcher...will get from database also, just select the ISBN for some select set of references, including at least the Oxford Grid ISBN number. John is working on an html toolkit that will show java trees on the web...a little servlet. 13. ACTION ITEM: User's guide for DAGedit... John will set up a WIKI page, and anyone can contribute. 14. ACTION ITEM: Transition into using database as primary repository John will work on history tracking mechanism while we test the db and flatfile saving issue. On a Friday, John will say 'we're going to start using the database'. John will populate the db on the weekend with the newest stuff on CVS. Curators won't do anything over that weekend. From then on, curators will save to db and at the end of the session will also export to flatfiles. This will continues as long as we need to . At some point, John will have us revert to the old system if needed. It's important during that time to check email before using to check for recent messages. John will try to give us notice. Also, won't due before Nov 2, but are tentatively schedule this event (the email msg) for Nov. 2. ACTION ITEM. Continue to consider the need for a DBA. ACTION ITEM: Need a UK mirror of the GO site....Midori will talk to Pete. Check with Chris Richter--he talked with Pete (Petteri Jokinen, EBI systems) ************************************** ACTION ITEMS FROM BAR HARBOR MEETING - July 2001 1.Go Slim a). Consensus that there needs to be a new GOSLIM developed. A small working group will select terms for GOSLIM. b). A directory of the GOSLIM versions that have been used should be made available via the website. c). Some considerations in using GOEDIT to make GOSLIM files: Will have to wait until the database is up to implement GOSLIM notations as this is not accommodated by the flat files. Also, having everything in the database will make it easier to keep GOSLIM in synch with the current GO. The 'canonical' GOSLIM will be in the database and other versions (specific to certain projects) will be posted as flat files. d). Chris Mungall has been working on software for mapping full GO to GOSLIM. e). Midori Harris will take charge of new GOSLIM. 2. Changes to syntax of gene associations files There is a need to define the object being annotated explicitly. Changes to be made are: 1) add a column that defines the object being annotated (or the moment the options will be gene, transcript and protein). 2) The symbol used in the association file will be the symbol for that object (e.g. if annotation is to a gene object then symbol = gene symbol, annotation to protein object then symbol= protein symbol). Same holds for synonyms, they should match the object being annotated, 3) add a column for TaxonID (from NCBI) that defines the taxonomic node for the organism whose gene/protein/transcript is being annotated. 4) Midori will update the web pages with the new information/format for the association files. An XML format will be described to export the association files. 3. "is this a function or a protein name?" Rex will look into identifying areas that need to be cleaned up as far as protein names and bring the suggestions back to the group. 4. Web site management 1) A validation step will be added to the XML dump to make sure it is correct. 2). Move old documentation out of CVS but leave on the FTP site. Things to be archived include past minutes from meetings, old XML DTD files, old ontologies. Move the archive directory out of CVS but leave on the FTP site. Note that the abstracts/ directory is empty and could be removed. Reports from Consortium Members 1. Rex Chisholm, DictyBase Warren Kibbe, getting the database set up, making a few GO associations for an interim collection of genes, 5600 full length cDNAs, ~ 70% of genes (between 8 and 10,000 genes). NIH grant pending which would provide some more curators. 2. Suzi Lewis, BDGP The next thing to accomplish with DAG-edit is to connect to the database, i.e., have the DB directly connected to the editor instead of the flatfiles. Once this is done, a secondary goal is to include plugins and to add a history viewer. Also, we want to switch to the database, and we want to have synchronized monthly updates. Every month (around 1st of month), take snapshot, export XML and to database (don't rewrite back to flatfiles). The XML and DB include gene associations. We have added another format of the data, RDF (an extension of XML) this will get us to DAML-OIL, Protege, etc). Chris has gotten sequence data...fly, yeast, Arabidopsis... BLAST search would bring back hyperlinked output with a small subgraph.... Sequence data is on the GO site only for SGD. Otherwise have to download... 3. Michael Ashburner, FlyBase Have appointed new curator, Rebecca ,will start Nov 5, FlyBase inherited a large number of electronic annotations. An editor has now looked at them all and there are no more IEAs. Several thousand proteins from Celera had no GO data. Michael has been through all of those and has been able to assign GO terms to about 1/4 of them. Still there are ~380 genes with GO annotations but no references. Michael is working on that. Regular FlyBase curators added GO terms as well. Re-annotation of release 3 has just started. Release 3 / Drosophila should be complete with no gaps for euchromatin and should have no ambiguities. Now true for 2 chromosome arms. Harvard and Berkeley trained 10 people to use Apollo to annotate fly sequences...Plan is to have reannotation by April 1. 15 min per gene.... Improving Cross-Links...with PIR, database of modified AA cross links with definition files 2) Minn. Biodegradation 3) MetaCyc...doing just pathways, working with Peter Karp following ISMB, 4) comments back from MIPS...They are interested in making a MIPS 2GO mapping available, Michael and Midori have both talked with Klaus Meyer. 4. Sue Rhee, TAIR Manual annotations have started. Tools developed to do this (PubDB). PubDB stores matches to gene names and keywords (GO, Anatomy) to papers. Curators can validate the matches and update/insert new genes, and gene aliases. Currently developing the web forms to validate and update and insert new annotations between genes and GO terms. Will package it and make the source codes available in a couple of months. A lot of this work supervised and carried out by Leonora. Leonore will be leaving TAIR, she is looking for more of an education/outreach kind of job. TAIR is actively looking for replacement for her. J. has been working with Leonore. Rest of curators at Carnegie are working with the GO, learning to do manual annotations. There are about 4,000 annotations using GOFISH methods to 3890 genes. Added 50 hand annotations to GO. Using PubDB matches of known gene names and GO terms within papers to screen literature and to provide a set for curators to work with to annotate genes. They use GO Editor, and other tools. So, they take whole set of GO terms, run them against abstracts, and make a file of matches for curators to manually validate using Web forms. Doing the same for gene names gives about 80% validation rate when examined. A lot of gene matching to papers, over 90% have GO term that match as well. TIGR curators will use PubDB and share the literature curation efforts. Two types of electronic annotations have been done. Nicky at InterPro has provided InterPro matches to Arabidopsis proteins. TAIR also ran InterproScan and used Nicky's InterPro to GO mapping. Nicky may have used set from MIPs. So, slight differences, but will submit Nicky's annotations to common protein set to GO. There are about 10,000 GO annotations out of about 20,000 proteins. For the remaining 5,000 proteins, will put in TAIR annotations. Currently, they are in the process of removing annotations that don't make sense for plants. Working with Peter Karp at MetaCyc to add plant pathways to MetaCyc. We used the pathologic script to find pathways matching to Arabidopsis enzymes.. All plant proteins thought to be enzyme or enzyme like (~6000) were passed through and ~1800 proteins were matched. 112 pathways have more than 50% reactions matching and 64 pathways have less than 50% reactions matching. We are waiting for Michael to finish going through MetaCyc to GO mapping and will submit the GO annotations from this after manual check (Lukas Mueller, mueller@acoma.stanford.edu ) Lukas Mueller is one of the TAIR curators who is very interested in GO. Anatomy development. 247 anatomy terms and 54 dev. stage terms. These will be submitted to GO once the new CVS architecture is set up. These have been submitted to the CVS repository for Plant Ontology Consortium Lincoln Stein has set up at CSHL. Will do comparisons to combine to create higher nodes once Gramene submits their ontologies. They contacted Paradigm to develop collaborations as they provide extensive service for phenotyping for plants. TIGR will use PubDB to annotate genes to GO. TAIR has provided login for them. We will discuss on the process of operation. We decided to separate labor based on processes. TIGR has microbial systems in GO annotations. TIGR and TAIR cleaning up genome annotations together. 5. Karen Christie, SGD SGD has finished off the Oxford Dictionary. Marcel Mendoza, summer intern, went through word files from scanning dictionary, moved to RTF. Mike Cherry wrote web interface to query... password accessible. RTF files allow one to know the difference between the term definitions and string. Have added 2000 terms since July and now have about 17% of them defined. Internal progress reports including progress on GO. New curator Rama Balakrishnan, she will do GO annotations Mike, Rama, Karen went and meet with Russ Altman's group. He is moving into the genetics department and will have more interactions with SGD and GO. One of his grad students is working to put GO into Protege. Text matches between literature and GO terms to develop interface for curators to use to guide annotations. IEAs from Valerie Wood...they are comparing new set with all other IEAs. For genes with no annotations, may be good. Looking at comparing with new set from Valerie Wood. 6. Matt Berriman, Sanger Matt Berriman, Sanger Institute Just released version 1 of GeneDB (http://www.genedb.org) genome database for parasitic groups, initially S. pombe, Trypanosome brucei and Leishmania major. Every genome will be annotated to GO, and GO association files will be released as these are done. Have written some parasite-specific GO terms (http://www.sanger.ac.uk/Users/mb4/GO). Annotation of malaria genome still planned to happen before Christmas. TIGR has mentioned they would like to get involved in that too. 7. Harold Drabkin, MGI Majority of annotations still electronic with over 6000 genes annotated. There are over 1500 hand-annotated genes, and these are done both as new genes are entered into MGI and as curatorial review of gene families and other genes. There has been an increase in hand annotations with function unknown. (discussed further in the 'annotation discussion'). MGI curators are focusing on sets of genes and on new genes with no GO annotations. The MGI gene association files are updated on a weekly basis and the new file sent to the GO site and posted on our ftp site. We have modified the associations file to include taxonID, object type, 'non' option and syntaxes as agreed by the group. MGI curators are now adding over 100 annotations per week. 8. Midori Harris, GO Editor Added a couple hundred GO terms... Most of these have been specific requests from SP annotators. Have identified areas that need work when time allows. Reorganizing documentation. Still need to hire Midori's assistant, but are now interviewing. 9. Evelyn Camon, SWISS-PROT EBI, Wolfgang Fleischmann is doing automatic annotations to all species. all of GO... of 100,000 SP annotations, 40% have GO assignments... EBI open database to provide access to human gene products...have reached the stage that NCBI/Proteome, SP/InterPro, other GO translations (EC, keywords, etc), SGD, MGI, FlyBase, all annotations all together in dataset... Oracle DB of all associations... 1.5M entries of (Protein to GO) Will also submit gene associations back to the GO. Each assignment has it's own ProteinToGo ID, can extract information about GoAH. 28,727 eligible human proteins, 11435 IEA proteins, (9636 InterPro true match, 4,000+ via SP keywords. 577 via EC codes, 9662 hand annotated proteins in the human dataset, 6864 done by Proteome Inc.; 2,830 by EBI/SP curation team. There will be a new release of InterPro in the next couple of weeks. 3,000 human entries identified by Paul, those by Proteome were removed from curation set. SP curators have been working hard to assign GO terms to the remaining... This stage of the work is now complete. From now on, SP curators will annotate GO terms. Report from Nicky about how InterPro annotations are done. QuickGO browser has some new functionality. Now there are links to microarray expression database. Proteome analysis pages at EBI urgently need GO-SLIM. 10. Erich Swartz, WormBase IEA has done 1/3 of 19,000 genes. Creating new parsing of ontologies and expanding automatic annotation. Building an anatomy and developmental timing ontologies. Erich has been trying to finish RNAi ... 52 phenotypic types. Currently limited to not having full GO curator. Ideally by next meeting... Proteome has asked WormBase for help them with 'GO-izing' their standard vocabulary that they use for all the phenotypes. Erich has been in contact with WIT2 annotations group. have set up a collaboration with them to do that. 11. Physiological Presentation from Bernard de Bono. Bernard has been lecturing on human physiology for the past 6 years. At the MRC-LMB he is annotating protein repertoires from the human genome, and is therefore interested in bridging representations of physiological processes with gene products. During his talk he suggested a physiology model in which a biological process could be precisely defined in terms of a large-scale exchange between compartments. Four main types of compartment were defined: subcellular, cellular, extracellular and surface. He created fourteen tissue Cell Blocks: seven of them interface with the extracellular compartment only, while the other seven interface with the surface compartment as well. As Cell Blocks are the terminal leaves of a classification tree, it is not intended that a particular histological cell type should belong to more than one Cell Block. The human Cell Blocks described are: a) Cutaneous, Respiratory, Urological, Uterine, Testicular, Gastrointestinal, and Placental b) Hematological, Endothelial, Endocrine, Muscular, Nervous, Skeletal and Connective Tissue The whole technical discussion covered the following points, points 1. and 2. having been described above. Points 18 to 21 involved suggested extensions to this model to be discussed at a later stage of development. 1. A Process is the exchange between one compartment and another. 2. A series of major functional Cell Blocks was created and classified in terms of compartment contact. 3. An organ then becomes a Cell Block composite. 4. Cellular and subcellular compartments can then be addressed by the location of the Cell Block. 5. Extracellular and Surface compartments can be addressed by the Cell Blocks that are bound by it. 6. As organs are Cell Block composites, the anatomical location of the Cell Blocks can be tracked down. 7. A Process is an objective that can be depicted by a series of sub-objectives that may have to be temporally sorted. 8. A Process may occur only during a specific milestone in the organism's stage of development. 9. Separate time scales represent 7. and 8. to become the temporal ontology. 10. A spatial ontology represents 4., 5. and 6. 11. The Location Ontology captures 9. and 10. 12. A Process then becomes a cross product of Location Ontology and Function Ontology. 13. The Process Ontology editor creates paths in this co-ordinate matrix and assigns Physiological alias to every path. From 7., a path may have subpaths that are sorted along the temporal ontology co-ordinate. 14. The organism database GO annotator generates ontology co-ordinates in terms of What (function ontology), Where (spatial ontology) & When (temporal ontology) for every sequence. 15. If a gene product's ontology annotations hit a path from 13. the gene product automatically inherits that Process. 16. Cross product annotation is space efficient and robust to updates. Process definitions are more precise. 17. The creation of a 'Tool Box' of basic physiological objectives is therefore feasible. 18. Compartments can be mapped into partitions. 19. Processes description can then extend to exchange between one partition and another. 20. An Enzyme can be seen as pulling Molecule A out and pushing Molecule B into the same partition. 21. Can embryological/developmental processes then be defined as a transition from one Cell Block to another? 22. Caveat: this cross product can be represented by a DAG, but needs more than just 'is a' and 'part of' type of edges. Conclusion: As more and more complex organisms are annotated at gene level, it will become increasingly evident that gene products participate in more physiological processes than practicable to annotate directly - suggesting that curated paths using the spatial, temporal and function Ontologies as co-ordinates may be the solution to represent physiology. Annotation Issues 1. Midori and Rex continue on search and destroy mission....to get gene products out of ontologies... 2. DEFINITIONS.... now 18% done. Tomorrow we will be looking at the Oxford Dictionary of Biochemistry and Molecular Biology that will be provided as part of the Editor. Keep Oxford definition and add local identifier... For edited references that come out of some other resource, add original resource and the modifier resource. So, with more than one citation, it means that definition is composite of the two resources. Good progress. ... including as it does new terms that are supposed to be only entered with a definition.... 3. Database abbreviations are not always consistent. We will make a little flatfile of these... database:identifier and will post to CVS.... 4. Annotation Discussions 1) Annotating to 'unknown' is different than annotating to 'didn't look, don't know'. So, when annotating to 'unknown' because biology isn't known, MGI puts 'unknown' and references the paper... go through all the papers that are for this gene, use the most recent paper as a reference...last paper about that gene that was looked at. SGD..if they look and there are papers available but they don't address the issue, than they use a generic SGD citation...so, like MGI, from modification date. TAIR associates to the last paper, 'NAS' give the last dated paper. Unknown is used as annotation tool to help the annotation pipeline so that you know the effort was made to annotate the GO. We considered advantages and disadvantages of both approaches (MGI & SGD) before coming up with the ND solution. SGD: cite generic SGD Advantage: doesn't attribute statement to a paper that didn't actually contain it. Disadvantage: no indication of when someone last looked for information. MGI: cite most recent papter Advantage: provides a date so that curators know to look only at more recent papers for additional information. Disadvantage: implies that the paper actually stated that something was unknown. Using ND in conjunction with the date added to all annotations captures the advantages of both previous approachers. So question arises, what do we put on the GO site? Have ontological term. 1) invent an evidence code for 'nothing is known biologically' ND 2) evidence-reference would be a local database citation 2) ALSO will add a date field to the entire table...so will know the last annotation date...for each line in the gene association file....to the end...yyyymmdd This will mean that the date on which the association was made; it will not need to be broken down into 'created' or 'modified' because we 'update' annotations by adding new lines to the association files ( and deleting old lines if the situation calls for it). VIRUS/PATHOGEN/PARASITE Virus using host gene products will have associations to the virus genome. Should use the gene reference to the model organism database. Issue is how to annotate, for example, mouse protein that is abnormally functioning in the normal of the viral genome. That is to say, the function of the mouse protein is 'normal' for the viral process, but abnormal for the mouse process. We need a way to identify the genome of the process being annotated. There was intense discussion reflecting that a curator of mouse proteins wouldn't be annotating to the viral process because that process is 'abnormal' for mouse biology. But the curator of viral proteins might want to indicate that a mouse protein was 'part of' the viral transcriptome, or something like that. So, it was concluded that in that kind of instance, the curator was annotating a normal process, and the association file needed to indicate both the taxon of origin of the gene product (which is the function of the taxon ID now), and additionally, be able to indicate the taxon of the genome being annotated. So, if only one taxonID is presented, it means that is the taxon of origin for the gene product and the taxon of the genome being annotated. If two taxon IDs are presented, then the 1st one is the taxon ID of origin for the gene product and the second one is the taxon ID of the genome under annotation. The important point here is that what 'normal' means is relative to the organism being annotated, i.e., normal for the host vs. normal for the virus. ******************** CONCLUSION, TAXON ID Column, 1st ID = taxon encoding the gene product; 2nd taxonID refers to the context, i.e. the user organism of the gene product. Syntax will be taxon1 ! (pipe) taxon2 Implicit here is that the user taxon indicates whose perspective of 'normal' applies. ********************** Columns in SP annotation files db contributing - identifier from the contributing database (SP, TrEMBL, international protein index) - 3rd column would be international protein identifier - from other sources, 3rd column would be the gene symbol...would put any gene names in the synonym field...wanted to be able to use the IPI... so our suggestion is that they use the gene name if they have it, use the IPI as a default, if they update gene name, the IPI moves to the synonym.... ********************** BDGP report on software and database development for the GO John Richter, software...DAG-Editor...version 1.207, 1) can load files directly from the GO site, 2) new plugins load automatically, but can disable that feature 3) can add new relationships Chris Mungall...GO database 1) monthly archives in database, XML and flatfiles 2) have been expanding schema....can now have sequences in database. 3) want not the FASTA, just the relevant protein seqID, these will be loaded into the GO database. We do have sequence directory on CVS, which has SGD file of sequences on it. This will be dumped and new subdirectory created as noted below. 4) building tools to help in the proper use of the files by the community. One example would be to have triggers to prevent others from grabbing files and using them in analysis without understanding the IEA or levels and evidence. Cross products make a new ontology (transcription occurs in the nucleus) move nucleus to this new ontology move transcription to this new ontology 'new term' nuclear transcription, has new relationship should be high priority on the list of things to do to be able to create cross products. Will put the cross products where they belong. So, do we create a huge file of all, or do you create cross products with dangle unfound terms. Why don't we just permit loading of the entire directory....also would need definitions. So the load would be huge, and what would be the point. If we were loading 10 different files all the time, why not load as one big file. Mode of operation is to allow cross product generation to users...so the real question is if a new process term, a cross-product term, were in the process file, there should be access to all the term components of the term including the anatomy terms...Have to support dependencies in the different files. So if you load process ontology, will be prompted to load the anatomy files. 'verified' in the sense that a curator has looked at it. GO DATABASE Chris Mungall sequence blast results multiple sequence viewer... width of sequence bar reflects the degree of similarity. shows the multiple sequences on the top and then the blast results underneath. GADFLY So question now is do we move into database? As in, do the curators edit into the database rather than into the flatfiles, and at the end of each day, commit as well to the flatfile. So, need a group to manage this database, someone familiar with MySQL, postINGRES.... BROWSERS - BRAD MARSHALL AMIGO...www.godatabase.org very soon....(on the internal LBL site) Future directions **** BLAST server so that you can do a BLAST search against the gene product sequences with links back to the tree ****get a portion of the tree that you like, and get a FASTA dump of all the sequences associated. ****new browser gives number of terms annotated to term or terms under it. ***coming soon, will be able to select terms and download into a FASTA file ****want to select more than one evidence code ****want to filter by NOT for one or more evidence codes ****curator approved is everything other than IEA ****advanced search, can search by gene products, or gene symbols, pick data sources, etc. used to have ability to paste in a list, but have taken that feature out. but now people are again asking for that. So, can put back that capability... ****have extensive docs for this... Jason Stewart, new to the group, may do some software development with GO. He is also familiar with MGED development of structured vocabularies. MGED has been working with Rosetta, Affymetrix, others, to have a data model. microarray array gene expression. MAGEML is the mark-up language... just had a jamboree in Toronto and put all the source code in Source Forge. They are building annotation tools to help build the XML files. So, the model is very large, has 146 different classes, very interconnected, a lot of context. There are lots of parts of the model where 'terms' need to exist. Simple and complex, some are forms of restricted vocabularies, some of them will be real ontologies. What they did indicate in the model that whenever they come on one of these terms, they designate as an 'ontology', e.g. this is Jason's ontology for describing spots on a glass file SO when others use the file, and find an ontology term that they don't know about, they will have a url to find the ontology. So the program that is taking the XML and putting in the local database needs to go get that ontology and put it in the database that is being developed. SO Jason is writing a perl script that will allow researcher to go get the XML at the url link to load into their database. So, researchers can each submit their own ontologies... So everyone around the world can utilize the terms now...not that this will not facilitate the development of a community standard MIRROR...need a UK mirror, Midori will talk with Pete VERSIONS, REVISIONS, RELEASES...Diff problems are due to genuine bugs in DAGedit that will be fixed in 2.7. Bo at AstroZeneca asked if we could include a diff along with the monthly. No, we decided not to do that, the new directory structure should help this issue, since all the files will be organized better. Term Counts...Bo...Mike and Midori got the same numbers...no one knows why BO got different numbers. GOBO - global open biology ontologies We, the GO consortium, have three common GO ontologies. Other ontologies such as anatomy and temporal ontologies will be developed and 'owned' by MOD databases. At the Bar Harbor meeting last July we heard in particular from plant people about the need for phenotypes ontologies. Also need an ontology for biological substances so that those terms can be taken out of the function ontology. SP crew may do this. So, we need an umbrella under which we can have a variety of ontologies. This would essentially be a web site, cvs, or ftp site onto which different communities would be encouraged to deposit their ontologies. 1) The ontologies would be open. 2) They should be instantiated in GO syntax, flatfiles, so that they could be used with GO tools. 3) They should be orthogonal with existing ontologies...this is the hardest to resolve... 4) They would share ID space 5) Definition files should accompany ontologies. Orthogonal issue is the biggest concern. There are reasons for this. For example, there are alternative, competing ontologies in the same domain; by definition they are not orthogonal. We would want to distinguish between competing and complementary. We would need to explain why orthogonalities are there if they are. So, would be fine if GOBO was a web server site...community would be offered to send us their urls...There are other web sites for biological ontologies, but they won't adhere to the five principles above. Also, the anatomies and other ones that are used as part of the GO project would be handled somewhat differently. Most of these, however, will have some aspect of involvement with the GO project. Suzi will be writing adaptors for RDF / DAML-OIL, a data adaptor for GOedit. The beauty of using DAGedit is that you get something that can be instantiated in GO syntaxes. ACTION ITEM...Michael proposing to publish a short editorial about this. ... GO COMPLIANCE and JOURNALS. Michael reported that Nature has been doing the experiment of using GO terms as metadata to articles. Michael had a discussion to Declan Butler about this. This might be a more interesting concept than 'compliance'. Nature editors would do this. Coming down to 'keyword' for the article that is selected to from the GO. Rex says maybe if up and running with Nature, we should suggest the concept in a letter to other mainstream journals. GOSlim could be a keyword list, but for article keywords, should be any GO term. Evelyn says...keywords...EMBL adds keywords to flatfiles. Curators at EMBL talk to 8,000 scientists a month. Authors add keywords; they could be encouraged to add GO terms as the keywords. GenBank might add as a dbXREF. dbXREF ... GO has not been accepted as a dbXREF... but Michael is going to next advisory meeting . GOdbXREFs with SP? Midori says they're going to be stored in SP-Oracle DB and will be visible to the public sometime. ACTION ITEM... Michael will try to get GO accepted as a dbXREF for the sequence databases JOBS and FUNDING Midori's assistant advertised FlyBase curator of GO hired (Rebecca) Evelyn hired Lincoln's grant is going in. GO jobs will be on a separate Web page Judy will do Grant Progress Report due Nov 1. This year TAIR and WormBase will get funding from this grant. PUBLICATIONS Michael is talking about GO at Novatis meeting in London that requires a publication Michael will do Matt has TIG paper to add to the progress report David and Michael will continue on cross-product paper. Panther/GO comparative paper...evaluation on electronic annotation...David and Suzi (FANTOM) Cathy Ball, SGD - 4way comparative paper...did pull in all GO annotations that were available...still working on publication... Matt has forthcoming book chapter with Midori to go Current Protocols in Parasite Genomics Han Xie...internal review of Compugen stuff... Also... Berriman, M., Aslett, M., Hall, N., Ivens, A (2001) Parasites are GO. Trend in Parasitology 17(10) 463-4. PMID 11642257 WEB PAGES Midori as working on various pages Karen and Midori talking about totally redoing Web page ****Separate page for the job listings ****Page for getting in touch with GO, listing email lists and other contacts, participating groups would list particular contact for that group ****Another page of all members and former members time on this? Karen will send out url on personal space before posting publicly NEXT MEETINGS GO Users Meeting Feb1, GO meeting on Feb 2 and 3rd (O'Reilly meeting the 3 days before them). Academic staff get 25% discount...$600, Faculty get 50% $495. for O'Reilly meeting How many people from us will be there...? 25... Might be advertised more heavily...how many would be expected at Users Mtg? over 100 at least Would have to pay for network access. yes it's critical we decided. Next beyond that...Michael with host in Hinxton.... Next beyond that...maybe hosted by Compugen in Princeton... TIGR - contact Michelle and ask for a TIGR rep =================================================================== GO meeting - Tucson, Arizona - Feb. 2-3, 2002 Introductions. Progress reports. GO central (Midori Harris). * Adding terms, reorganizing ontology. * Add Jane Lomax as editor. * Gene product search and destroy, GO-slim etc still awaiting action. * MOBY effort (Mark Wilkinson's brainchild). Provide registry for sequence retrieval, annotation of where genes are go terms from different sources. Biomoby.org site for more info. Get MOBY white paper from Midori. Flybase (Michael Ashburner). * Becky Foulger lot of clean up. * Better representation of data. Release 3 of Dm sequence. * Reannotation of sequence will be available soon. * MA and SL have had a collaboration with Celera Proteomics to compare their assignement of GO terms to Drosophila proteins (made with their PANTHER) system to those made by FlyBase. This is now being prepared for publication. SGD (Karen Christie and Mike Cherry). * EC definitions downloaded EC datafile function ontology file to clean up. All EC definitions MUST be checked before loading to be sure they don't overwrite. * Total number of 1500 GO terms that get a definition as a result of this EC to GO effort. * Anonymous CVS server is running. * Website reorganized since Chicago meeting. Check updates on people page of GO web site. MGI. (Harold Drabkin). * Two full-time curators. * 23% increase in annotated genes since October. 45% increase in swissprot assignments. * Increase in hand annotation-triage process picks papers that are GO-friendly. * Backpopulating entries through interactions with rat database. * Developing new web interface. * Since GO is now in the database searches will be much better. In addition tracking of literature will be improved. TAIR (Lukas Mueller). * Manual go curation is being done by all curators. * Annotations to 10k genes. * Linked to metacyc. Aracyc database established. Used metacyc to GO. Generated 1612 annotations. * Secondary metabolism not yet well represented in GO. TAIR will add many new terms in this area. * TAIR has developed anatomy (1000) and development (120 terms) ontologies. * Created tool called pubsearch. Links literature, gene info and go annotation. Contains abstracts and full text files. WormBase. Paul Sternberg. * Focused on large RNAi screens (1100 genes assigned to about 2000 go terms). Should produce a large number of new curated terms. * Life stages DAG for embryo and post embryonic stages and adults currently circulating through the community . * Anatomy DAG (Raymond Lee). Cell lineage relationship included in DAG. DictyBase (Rex Chisholm, Tricia Dyck). * 7200 cDNAs * chromosome 2 complete. * Currently have about 2000 genes with annotations. PSU (Matt Berriman) * Malaria. Genome shrunk. 412 genes annotated. * Tryps bruc. Chr 1. 400 genes * Leishmania. Interested in using GO * Entameoba. Matt will present. * Life cycle using cycle function of DAG-Edit. GRAMENE. (Pankaj Jaiswal) * 8000 gene product in SP-tREMBL about 50% have GO associations. * Rice should be available for next meeting. * 300 plant related terms. * Manual curation of 4000 proteins in next few years. GKB (Beth Nickerson). * Waiting to hire. EBI (Evelyn Camon) * GO release in works * Keyword to GO mapping improved. * Single GO curator * Receptor database at EBI * Interaction database (INTACT) TIGR (Michelle Gwinn) * 10000 microbial genes annotated to go * Arabadopsis Compugene (Han Xie) * annotating protein new version in about 2 months. Based on 1M proteins, 90% annotated in some way. * Used for oligodesign, and in commercial database * Algorithm development with GO clustering to generate primitive ontology from literature; AstraZeneca (Courtland Yockney) * GO has an official home within AZ. Gene association files reconciled with annotation of internal databases. * Feedback-good: used for microarray data analysis. Increasing the number of things that can be prioritized for understanding arrays. * Need immunology, physiology and hematology areas added. GPCR has added value. Use function ontology as an organizing principle. Incyte. (Lisa Matthews). * Incorporation of GO for YPD (6200 genes-46000 go terms) * Mycopath (1800 genes with 7900 GO terms). * Improving tracking. October Action Item updates (see list of action items at end). 1. mouse 2. 3. 4. no progress 5. on agenda for later 6. standardization complete 7. added ND evidence code 8. add date field completed 9. taxon ID done 10. change submission requirement 11. update data directory-done 12. action items for John R. 13. DAG edit user guide. In process 14. transition to DB in progress 15. continued need for DBA 16. UK mirror in the works. Software.and Database * DAG-Edit o Moved to sourceforge site. o DBxrefs now more automated o Type filtering o Find improved o Plugins launched in background o Reduced size flat file format o Dangling parent references now working-can link out to other references o Cycles are supported o Macros can be saved o History plugin restored o Term change tracker plugin added o Can associate pictures with terms. Plugin. o Future directions. Database beta test. Postgres problems. * Gene products viewer * Create servelets * Move DAML/OIL compatability * Interactive database mode * Need to track obsolete terms somehow. * Database beta trial to continue o Will continue until a week has passed without problems. First priority to get save to work. Then get load to work. Then all editing will be on database. * SourceForge repository reviewed. http://sourceforge.net/projects/geneontology o Bugs should be submitted via tracker at sourceforge o Requests for term changes needs to be made through sourceforege * Fasta/sequence status. Most sequences are available. * Database schema. Changed to match DAG-edit. Currently running two systems, MYSql and Postgres * AmiGO update o Added peptide sequences derived from sp. o Icons that go back to sources added o Active link to ISS to show sequences. * FAQ-O-matic. o Chris will implement a basic faq and develop a system for allowing it to be updated by consortium members. Content. * Presence of non-coding RNAs in GO. But need ways of representing genes of various classes. MA proposed an ontology for sequence features--SO (sequence ontology). * Removal of gene products will continue with MH, MA and RLC. They will be replaced with more appropriate descriptors. * Remove cyclin as we have an appropriate replacement, but assure that synonyms are present to allow searches. * GO-slim still on the agenda for future. GOBO. * MA proposed adding GOBO to the GO repository. The group approved. * Ontologies must be orthogonal * Ontologies owned by developers will be responsible for maintaining them. Cross product Generation. * Discussed the production of cross products for anatomy agreed to wait until next time to discuss once tools are available. Annotation issues. * Have been using broad evidence codes. Discussed the need to expand evidence codes to more precisely capture the evidence supporting use of GO terms. o Use of more detailed codes would be optional o JR suggested an ontology of evidence codes. The group agreed. o Important to be sure that ISS codes trace back to a high quality annotation, not just IEA. o Agreed to make finer grained evidence codes available. o Discussed what codes we should use. (see list-get electronic version from Midori). Agreed that each must have definitions to be agreed upon in the period before the next meeting. o Where possible suffixes should be used consistently. * David Hill raised the issue of cardinality induced by mutant phenotype evidence code. Midori suggested simply making a note that the field may be used this way in this case. * Matt Berriman asked what citation should should be used when there is no database. The suggestion was to point to a URL. * Discussed situations were two terms from separate ontologies are often linked. Considered the possibility of developing tools to provide curators with options. This broadened to a discussion of tools for annotation including Talisman, Pubsearch from TAIR and sequence based tools. Sharing of these tools via GMOD was encouraged. * Use of NOT was discussed. This is to be used in cases where something explicitly says something is not true. Content issues: * Discussed pathogenesis. How should we handle cases when this is a function of a pathogen. Agreed to add pathogenesis as a term with synonym of virulence. Ontology Structure and Representation. * The need to begin to develop more varied relationship types was discussed. It was agreed that this was needed, but could be delayed. MA presented a list of some possible relationship types. He agreed to share his list with the group for consideration and possible expansion. Next meeting. 12-14 May at CSH?? Hinxton in the fall coordinate with ontology meeting (mid November). [Later changed to September, immediately after Genome Informatics] Action Items. * Submit electronic annotion methods and tools to Suzi et al. * Suzi will generate a report on progress since the last meeting. To speed up organism reports. * JR action item: spell checking method * JR action item: Other people submit rules and use to check ontology * Brad: grey out obsolete terms in AmiGO * GO requests for terms through sourceforge. * Add links to GO web site to submit or track requests. * Brad: Amigo. ISS add links out. * Individual groups should look at GO.xrf_abbs to check that URLs are correct * Brad: Add documentation for format of GO.xrf_abbs. * Chris will set up GO FAQ. * We will establish a process for updating the FAQ using a distributed. * MA will send out SO (sequence ontology) * JR will develop plug in for producing cross-products * Send GO slims that have used and published on to repository. * MH and MA will circulate expanded evidence code vocabulary with definitions for review. * Suzi et al will develop tools for annotation. Talisman tool. * Pubsearch from TAIR should reside on GMOD. * Beth will pursue May meeting at CSH, with Michelle Gwinn from TIGR as a backup. Action Items from October 2001 Meeting (this is just the list, no details or results) 1. Get temporal and anatomical CVs into CVS from participating databases 2. Tool to create cross-product terms 3. Post new biological process that incorporates updating developmental processes 4. Come up with initial default GO slim 5. Ontological formalisms (for Feb meeting) 6. Standardization of database abbreviations 7. New evidence codes 8. Add date field to the entire gene association file 9. Update Documentation to explain new use of the TaxonID column 10. Change requirement for submission of sequence information for gene products 11. Update directory structure on GO web site 12. Action items for John Richter 13. User's guide for DAG-Edit 14. Transition into using database as primary repository 15. Continue to consider the need for a DBA 16. UK mirror =================================================================== Meeting of the Gene Ontology Consortium Cold Spring Harbor Laboratory, Plimpton Room May 12-13, 2002 Sunday May 12th, 2002 ATTENDING: ---------- - Suzanna Lewis FlyBase Berkeley, CA - Michael Ashburner FlyBase Cambridge, UK - Karen Christie SGD Stanford, CA - Judith Blake MGI Bar Harbor, ME - Elizabeth Nickerson GK CSHL, NY - Janan Eppig MGI Bar Harbor, ME - Courtland E. Yockey AstraZeneca Delaware - Matt Berriman PSU(Sanger) Cambridge, UK - Katya Mantrova Incyte Genomics Beverly, MA - Han Xie Compugen Jamesbrook, NJ - Liat Mintz Compugen Jamesbrook, NJ - Bernard de Bono MRC Cambridge, UK - Michelle Gwinn TIGR Rockville, MD - Linda Hannick TIGR Rockville, MD - Harold Drabkin MGI Bar Harbor, ME - David Hill MGI Bar Harbor, ME - Rex Chisholm DictyBase Northwestern - John Richter BDGP Berkeley, CA - Pankaj Jaiswal Gramene Cornell, NY - Susan McCouch Gramene Cornell, NY - Martin Ringwald MGI Bar Harbor, ME - Midori Harris EBI Hinxton, UK - Eurie Hong SGD Stanford, CA - Chandra Theesfeld SGD Stanford, CA - Mike Cherry SGD Stanford, CA - Doreen Ware Gramene Cornell, NY - Chris Mungall BDGP Berkeley, CA - Eimear Kenny WB Caltech, CA - Lukas Mueller TAIR Carnegie Inst., Stanford, CA - Daniel Barrell EBI Hinxton, UK - Evelyn Camon EBI Hinxton, UK - Becky Foulger FlyBase Cambridge, UK - Jane Lomax EBI Hinxton, UK - Amelia Ireland EBI Hinxton, UK GROUP REPORTS ------------- Database summary - Suzanna Lewis - presented table of all non-IEA annotations - number of terms has increased 10% since last meeting (now 22,000 F; 21,000 P; 15,000 C) - idea to show % genome covered, buts gets into issue of estimating numbers of genes - new ec2go mapping (thanks to Daniel) - pie charts on Amigo!!! can break down by group, by GO-slim like terms (chosen by numerical representation) - Michael Ashburner suggests being able to choose a specific GO-slim file - Matt suggested being able to keep top level pie chart when going to breakdown of a slice FlyBase - Becky Foulger - Fritz Roth data set added SGD - Karen Christie, Mike Cherry - EC definitions (~1500), curators nearly finished checking these; Rama will add when done - added ~300 Component annotations from a large scale analysis paper from Michael Snyder's lab (Kumar et al. 2002); added only when 2 different methods confirmed the localization - working on some new GO tools to incorporate into SGD: one maps sets of genes to GO-slim terms - still in the prototype stage - for annotations to the unknown terms, have changed over to use of the ND evidence code and have added the date column to our gene associations file - have changed our software to be able to display NOT annotations - 2 new curators, Eurie Hong and Chandra Theesfeld, both in attendance *** Action Item 1a (Brad Marshall): make display of NOT data possible/correct in AmiGO (e.g. FBP26 for SGD; FlyBase, others have more) SwissProt - Evelyn Camon - GOA file re-released - human data only - annotation for other species - 2.1 million associations , viewable in QuickGO - next priority - organisms not covered by a MOD - new SwissProt keyword file, on the website (74% of keywords mapped to GO) PSU (Sanger) - Matt Berriman - another organism for Plasmodium falciparum, 74% of genome annotated ~3000 annotations, all non IEA - tsetse: working on gene association file soon, based on BLAST hits, etc,; sequencing is done 21000 ESTs, ~8500 seqs - life cycle stage ontology - http://www.sanger.ac.uk/Users/mb4/PLO/ MGI - Harold Drabkin - GO now incorporated into the MGI database, so MGI browser now faster than it was, not using flat files, and reading data current to within a day, allows Boolean operators - Martin just put mouse anatomy file onto GOBO - David has been working on a scheme which could be used for GO-Slim (text of document distributed at meeting attached to the end of this file) - phenotype ontology - making it loadable into DAG-Edit TAIR - Lukas Mueller - added a GO annotation search to the website - literature curation tool (available via GMOD), have used the tool at TAIR, annotations haven't yet gotten to web site - added/rearranged terms in metabolism for plant specific pathways, - embryogenesis vs. morphogenesis, in plants morphogenesis is not an obligatory child of embryogenesis, Tanya's proposal for revisions should be available for discussion soon - GO-slim version, plant version for TAIR TIGR - Michelle Gwinn (Comprehensive Microbial Resource) - CMR: many associations, but can't release them until the genomes have been published, one paper has been submitted, as associations become available will be added to the GO annotations table - schoenella - Bacillus anthracis - klebsiella - auto-annotation tool to assign IEA annotations to non-TIGR genomes, may not work until more genomes are manually curated - Linda Hannick (Arabidopsis) - 2 new people, team of 5 now - 20% of Arabidopsis genome is done, approaching by paralogous families of genes, tool to display similarities, stuff, speeds up annotation, everyone specializes in one area, rather than random - trying to coordinate with TAIR people to avoid duplication of effort - next genome - Trypanosoma brucei Pankaj Jaiswal - Gramene - 9000 annotations for rice, mostly IEA - ontology browser on Gramene - putting in rice anatomy and temporal files - trait ontologies in rice, refining structures - working with MaizeDB to develop resources for anatomy and trait, phenotype ontologies - working with Michael Ashburner on chemical ontologies WormBase - Eimear Kenny - goal to have detailed descriptions for genes by mid-2003 - Andre P. - Erich working on more extensive gene descriptions - 2 new WB curators, 1 is in the process of moving to CA, already doing lit curation - now 3 WB curators working on GO - WB is developing ontologies: due for release soon o cell lineage ontology (Raymond Lee) o developmental ontology (Wen Chen) life stages DictyBase - Rex Chisholm - EST collection from Dictyostelium , 2000+ IEA annotations - chromosome II about to be completed, mostly IEA annotations - still working on final schema for database, using a prototype yet - ontology: anatomy, life cycle, have passed on to David, as simple test of crossproduct - funding - everything looks good for DictyBase's funding to be approved GO - Midori Harris - got SourceForge suggestion tracking running, and is working well, also helps people making request see that there is a line for requests - Jane and Amelia making lots of progress - definitions at 30% now!! - new curators: Amelia Ireland just started, and Cath Brooksbank about to start Compugen - Liat Mintz - no non-IEA annotations - continuing to work on gene associations, including annotations in different products - just published their paper in Genome Research - oligo-libraries arranged based on GO terms, genes that are not annotated are often low-expressers - over 10% of human genome is transcribed from both strands - hope to release a new version soon Incyte - Katya Mantrova - complete translation of protein properties into GO terms - all databases now annotated with GO terms - new term suggestions - 90% are getting accepted - BioKnowledge library by subscription only from June 1st, free trial until then AstraZeneca - Courtland Yockey (post-meeting addition) - Incorporation of GO into AZ Bioinformatics Infrastructure o Global "protein annotation pipeline" currently utilizes EBI's GOA and NCBI's Proteome annotations as primary public source material o All derivative global molecular class databases being constructed utilize GO annotations o A global target decision support system (under construction) will utilize GO annotations as well o A number of internal groups continue to either use or inquire about GO annotations from the standpoints of microarray data analysis and text mining applications - Additions to GO Infrastructure in AstraZeneca o MySQL database mirror of GO MySQL database set up which includes both public and AZ internal GO annotations, and which will serve as reference set for derivative uses and views o Plans to port GO from MySQL to Oracle were not pursued in favor of a MySQL-only solution o Internal GO Annotation effort (GOAC Project) now spans >2000 genes and >13,000 annotations ACTIONS ON PREVIOUS ACTION ITEMS -------------------------------- - SourceForge suggestion tracker - working well - John Garavelli visiting EBI - helping with terms that have RESIDs - John Richter will come visit people if they ask nicely (for help, analysis of system-specific oddities/bugs with DAG-Edit/GOET) - Not done: Chris - have linkouts to sequence in cases such as ISS with ________) o ISS with SP:nnnnnn, click on ISS, get new report page _ with Literature: PUBMED; _ ISS with SWP:nnnnnn - GO.xrf_abbs file: stable way for database cross-reference (Ask Brad) - in progress o need to invent a metareference for linking for curator refs o need to talk about specific columns, e.g. gp2protein o clarify abbreviations - SO document, Michael Ashburner has submitted, people are commenting *** Action Item 1b - metareference for curator refs for AmiGO (BDGP and/or GO): create a metareference for linking for curator refs for definitions for AmiGO (e.g. GO:mah, SGD:krc, etc) *** Action Item 1c - AmiGO (BDGP): linkouts in AmiGO to sequence in cases such as ISS with ________) *** Action Item 2 - GO.xrf_abbs file (each group): examine the GO.xrfs_abbs file with respect to those abbreviations used by your group, add or submit (to your favorite contact with CVS write permission,) CONTENT ISSUES -------------- - Ligand, everyone has agreed upon a solution and Jane is about to implement it - E&M (Embryogenesis and Morphogenesis, not Electricity and Magnetism, this is biology! ;) - Tanya sent draft out Friday, about ready to implement changes to accommodate fact that in plants is separable from morphogenesis - have not been terribly consistent about "biosynthesis of ___" when we are talking about modifying a residue within a protein; some of these activities are grouped under "biosynthesis" while others are under "protein modification" [NB: not the exact term text strings] o so for modification of bases or aa residues within the context of the RNA or protein, then it will be only modification, biosynthesis applies when the substance is made as a free substance o post-meeting addition (MAH): the case of selenocysteine, which is produced by modification of a serine residue attached to a tRNA. I think the 'not free so not biosynthesis' reasoning applies here too. *** Action Item 3 - GO content: modification vs. biosynthesis (GO) - examine ontologies for consistency of term names in the area of modifications to nucleotides/amino acid residues within the context of an already synthesized nucleic acid/protein - proliferation of sensu terms, (Suzi) do we have rules? should only happen in the case of homonym terms, same text strings with unique meanings for each organism *** Action Item 4 - GO content: sensu terms (GO) - evaluate sensu terms, and expand documentation - Kirill - don"t have a term for "group transfer", instances of certainly are covered, won"t do this (insert a high-level grouping term) yet - use of "AND" in a term, o Suzi is against it, because there is high probability of violations of the true path, should work to use "AND" as a grouping mechanism and attempt to use the structure to represent the grouping; o "and/or" is acceptable ??? - will examine these on a case by case basis and see where they are appropriate - annotation to two different terms with an OR [NB: this would be two lines in a gene_association.yfo file with an "OR"], o DAML+OIL has a way of indicating disjunctions o Chris and John are not in agreement on how to deal with this, will hash out the options for software solutions to this issue and report back o following the above "conclusion" there were some additional group comments suggesting that the better way to approach this is to construct the ontology to have the appropriate grouping terms so as to avoid the need to have an "OR" join between two associations *** Action Item 5a - GO syntax: use of "and" and "and/or" (GO) - evaluate use of "and" and "and/or" in GO terms, target for elimination when possible *** Action Item 5b - possibility of ambiguous gene associations conjoined with "OR" (BDGP: Chris, John) - discuss possible software solutions to ? of joining two different associations (gene product to GO term) with an "OR", [NB: resolution of this item was unclear; first communicate with GO people on Action n Item 4a and discuss whether there is any real desire/need to do this.] *** Action Item 6 - expansion/clarification of GO documentation (GO: Cath B) - Cath will evaluate GO documentation and expand/modify to clarify - Integrity checks o Do we have any rules for integrity checking? are we at a stage where we could? o lets look for: _ child terms lacking parentage that they should have _ redundant relationships - still some of these *** Action Item 7a - ontology integrity checking (John) - will create a SourceForge submission page for ontology errors DONE!!! 5/13/02 *** Action Item 7b - ontology integrity checking (each group) - curators should look for ontology errors, i.e. for items to consider for automated integrity checking and submit them to the SourceForge page that John will create NATIONAL LIBRARY OF MEDICINE (NLM) AND GO (Judy Blake) ------------------------------------------------------ - she and Michael Ashburner will be working with NLM to bring GO into NLM and MESH - has some papers on the topic, "Lexical properties of the Gene Ontology", but in order to map GO terms to NLM terms (of any type), NLM requires definitions for all (GO) terms, when NLM brings in a new system, they are looking to incorporate the new system as synonyms to existing terms OR make new terms if no syn exists - Michael Ashburner had meeting with Stuart Nelson (head of MESH) and Betsy... (head of NLM), to establish seriousness of GO on this project - Courtland ? incorporate GO into MESH ,or have a new UMLS ? A: both, looks like very good progress GO-SLIMS -------- - TAIR and Amelia are working on some generic GO-slims, one for plants and one for animals, will be very similar, except for some things like no photosynthesis in animals, - have archived GO-slim which was used for Celera drosophila, will archive other GO-slims that have been used and which can be found, have written a document on GO-slims to be updated with a caveat about obsolete terms in archived slims - David has a proposal (see attachment), with an example about how to get all membrane things, need to join membrane of cell fraction with membrane of cell, David chose the GO-slim from the DAG and selected bins as biologist , rather than using a computational method to divide the annotations - Michael Ashburner is against having "Other" terms in GO-slims, David"s GO-Slim highlights some grouping terms that may be missing from the GO - there was general agreement that the display software of our dreams would be able to generate an 'other' category (on the fly, maybe?) for pie- chart purposes - we definitely decided not to add 'other' grouping terms into the ontologies - handling redundancy, when a term may be annotated to two terms, with different granularity, issues about collapsing redundancy - each GO-slim should have a document attached to it that explains it rules - ? from Courtland, about being able to use a GO-slim to map an annotation set, Chris suggested that this should be a script, would be nice to incorporate these scripts into AmiGO so that people can use the various GO-slims and use the one of your choice to map the association file(s) of your choice - Evelyn ? - naming convention for GO-slims - no problem to have as many as are needed/used, but we will put them into repository *** Action Item 8 - submit GO-slim scripts/rules (each group, as relevant) - Submit scripts (Chris is fine with Python, or Perl) for using/calculating GO-slims to BDGP *** Action Item 9 - GO-slim naming conventions (GO): - confirm/review naming conventions for GO-slims and expand documentation if needed (Michael Ashburner claims that there is a naming convention in the document that he has just written) *** Action Item 1d : AmiGO (BDGP): Incorporate GO-Slim scripts into AmiGO DAG-EDIT ISSUES ---------------- - John will come visit you to talk about, help set-up DAG-Edit if you ask him nicely - upcoming change of field in DAG-Edit, where the ID will not automatically be GOID, could DAG-Edit will read ID prefix from root term : Action item for John - spell checking is not done - integrity checking - not started, no info to do it, will need to discuss what the rules are - database o occasional problems, still complicated - capture semantics of transport? email from Chris Mungall to Midori... o does this mean the thing itself moving, or o Chris has a little thing he can display to talk about this - relationship type choosing is now allowed - John proposed: o determine new relationship types o inform everyone of what it is and symbol to be used o then implement o - would have to modify true path rule *** Action Item 10 - DAG-Edit/GOET (John Richter) - automatic recognition of ID prefix so that one doesn't have to manually change it all the time *** Action Item 11 - division of "part-of" into multiple relationship types (Chris and Jane) - will look into new relationships deriving from the current multiplicity of the meaning of the "part of" relationship - sure wish we could do cross-products o John will do a "macro" for this o Chris proposed being able to select a term in each ontology and have a table generated, where one could select rectangular blocks, David wanted to be able to see "part of" relationships... o John "but that"ll be huge...." o David "Embrace the Explosion." *** Action Item 12a - GO dictionary (GO, John Garavelli) - we need a dictionary for John to use for spell checking (John Garavelli wants to write a script for this anyway so he will generate the dictionary) *** Action Item 12b - GO dictionary in editor (John Richter) - can write a spell checker for the editor once he has a dictionary *** Action Item 13 - Cross-product tool (interested parties (David, Bernard, ?), Chris, and John Richter) - cross-product tool: further discussion will clarify what is actually wanted as well as feasible, so that John can write a plug-in for curators to use via the editor *** Action Item 14 - New documentation for making cross products in DAG-Edit as currently exists (GO: Jane, Amelia) - create document on generating cross- products in DAG-Edit - How do we handle IDs when we split terms? o Currently, the old term and ID becomes obsolete, and both new terms get new IDs, with obsolete ID as a synonym to each new term *** Action Item 15 - comment field: obsoletes & syntax (GO) - move obsolete IDs from synonyms to comment field and institute a regular (as in parsable) syntax for this field *** Action Item 1e - display comment field in AmiGO (Brad) - display comment field in AmiGO BERNARD'S PRESENTATION : GOAL (GO Active Language) -------------------------------------------------- - Progress since Chicago - representation of physiology as a xproduct of anatomy, and multiple GO aspects - it is possible for terms to inherit processes from parent terms - activity - any GO Function (F) or Process (P) - compartment (CPR) - can link P and C terms when we know where a process occurs - A biological process can be formally described as a relationship between two CPRs using an activity. - CPRs - is a region of biological space that can be unequivocally addressed using a combination of nodes from the C, cellular, and anatomical ontologies - Bolus discoideum (Latin for disk-shaped round lump) will be the hypothetical model organism o 3 developmental milestones o 7 cell types - system models processes o exchange o stage o complexing Bernard's document is downloadable from the MRC-LMB ftp site; he's also got a power point doc there: ftp://ftp.mrc-lmb.cam.ac.uk/pub/bdb/GOAL_Framework.pdf ftp://ftp.mrc-lmb.cam.ac.uk/pub/bdb/GOAL_Presentation.ppt GOET editor (for GOAL) ---------------------- - John demos new software, which will probably replace DAG-Edit - this new program is going to use a DAML+OIL like format, which will allow John to make many things that we've wanted to do be possible much better o e.g. history saves, undo, simpler modules for changing a dbxref (currently programmatically difficult in DAG-Edit) o will allow editing of new types of data much easier o DAML+OIL will have advantages for some of the new data types, e.g. SO; also allows some intrinsic restrictions/rules for a given class - this is available through GMOD project, available on SourceForge ANNOTATION ISSUES ----------------- Concurrent assignments - Evelyn Camon (Correlations between terms often used together) - system in QuickGO which gives curator hints which terms usually turn up together, shows up in QuickGO, also will throw up exceptions "weirdness detector" for annotators *** Action Item 16a - concurrent assignment protocol/docs for QuickGO (Evelyn) - get documentation from Tom Oinn on how he did it for QuickGO; add to documentation, to explain how this is calculated *** Action Item 16b - concurrent assignments from database (Chris) - pull this calculation on concurrent assignments from manual annotations using Database [NB: Fritz Roth is doing some calculations along this line] *** Action Item 1f - AmiGO (Brad) - show concurrent assignments in AmiGO Evidence Codes -------------- - two concerns - issue 1 - evidence codes for annotations o categorization (currently), but does not imply confidence level, o in discussions between this meeting and the previous one in Tucson, and from the surveys done for Fritz Roth, it has become apparent that the evidence codes in use now do not provide an indication of confidence, that curators felt that they could not make judgements on experiment quality from evidence code alone - issue 2 - judgements of sequence believability o another practical ?, qualitatively, to develop a system that provides meaning to people running algorithms, which sequences do you believe in? what are rules/criteria for deciding which sequences to use, and also that the annotations are believable - want a good test set for which to test algorithms, that only contains the genes and those annotations which are deemed to be of good quality - Evelyn suggested using the QuickGO algorithm for correlated annotation with all the current IEA stripped out and compare to same algorithm run with the IEAs, i.e. does it still predict the same correlations - consensus opinion to not include IEA or ISS in training sets - attempt to evaluate training set (no IEA or ISS) quality - FlyBase Panther calculations - transitive errors, only 4% (ISS) o other error type = F errors (F#%*-up, also about 4%) - to get the clustered set: o do within the group o use EBI clustering (TRIBE, InterPro, or SP clustering) o Liat/Compugen may be able to do the clustering - use the training set to help develop a tool that helps with annotation *** Action Item17a - sequence clustering for sequences annotated with GO (Daniel? Liat?) - take sequences as they are now, run a clustering algorithm, generate trees, attach GO annotations and inspect by hand *** Action Item17b - very cool annotation tool (????, highly dependent on above) - use this to develop an annotation tool that utilizes homology clustering Annotation Tools: ----------------- - Talisman can be downloaded from EBI, semi documented, a curation interface for GO in SwissProt, some discussion of transmitting annotations when appropriate to another MOD, currently no programmers to support this tool/program - Lucas's tool (now on GMOD) o searches PUBMED, creates linking table for specified info o preindexed papers against GO terms (perl module for text string matching), o implemented in Java servelet o right now only abstracts indexed o Sue Rhee recently found new software for PDF to text o mysql database, trying to make it more generic, submitting a GMOD grant to expand applicability *** Action Item18 - IEA/ISS methods (each group, GO: Midori): Groups to submit to Midori short blurbs on procedures for large scale annotation methods (bulk assignments, particularly with IEA or ISS) with urls to add to the annotations guide Consistent term use: -------------------- - Midori raised an issue about attempting to make sure that we use terms in consistent ways, Lisa Matthews has offered to send some notes about term use at Incyte ================================================================== Monday May 13th, 2002 GOBO and SO ----------- SO -- - Mike Cherry did some reorganization of the directory structure and put the GOBO stuff into the CVS repository, see http://www.geneontology.org/doc/gobo.html - Martin put up first bit of mouse on Friday - SO attempts to provide a controlled vocabulary for sequence features, and types of genes, e.g. whether primary transcript is edited or not, located sequence features and clones and ways to locate them on the sequence - Michael Ashburner and Suzi will write a supplemental NIH-grant off the GO grant to get a software person to do SO, since this will require DAML+OIL type slots to adequately describe the information types - Lincoln Stein, Owen White, Ewan Birney, test project , servers to provide data in the same way DAS server using the SO terminology, will help refine the quality of the SO - John and Chris made some comments about conversion from GO format to DAML+OIL, John suggested that it might be easier to dev in DAML+OIL from the early stages Biochemical Ontology -------------------- - Pankaj has been working on this - Michael Ashburner has restructured some of this and hopes to release it upon his return to UK next week - most restructurings to split out classifications by compound type and classifications by action - also removed "compound names" - Pankaj will parse in CAS #s - will help to do metabolism by cross-products - MESH is semantically mixed, and heavily biased by pharmaceutical compounds - CAS is not open to the public - about 1400 terms now Disease ontologies ------------------ - Rat people are very interested in these, DictyBase as well - UMLS has tied together a lot of this type of information, though there are some major issues with licensing and public access; but there are many classifications already so do some of these provide a good starting point (JB) - Michael Ashburner is unaware of anyone doing this as yet - where does NCI fit into this? doesn't seem to be much of a relationship... - SMD/MGED may already have some starts on this type of ontology - also need to make sure that we get definitions into this - with respect to tying information to human disease (key to much of our various groups funding), it is key that we make the relationships btw genes/phenotypes and relationships to human diseases - some overlap with phenotype ontologies and Bernard's attempts to describe physiology - can't easily use SnoMed, restrictions on its use Cell type ontologies -------------------- - Martin wants to do a mammalian one - already one for Drosophila Phenotype ontologies -------------------- - Michael Ashburner trying to get together a group for this, Gramene very interested GKB - Elizabeth Nickerson ------------------------- - integrative database for human biology o biological processes o biological pathways o collaboration with GO, EBI - top-down approach, from topics down, rather than gene by gene - ? how to store as a knowledge base, rather than as a database... want to see when the output of one assertion is the input of one assertion - tried lots of grammatical tagging, outputs were often unsatisfactory - now: input and output tagging, and linking to GO terms - and still want to link to references for every assertion - using Protege for structuring the data - but lack a good interface for authors to input data, currently using Excel spreadsheet for authors so that each sentence is associated with metadata in a way that can be imported into Protege GMOD: Generic Model Organism Database (Mike Cherry) ---------------------------------------------------- - www.gmod.org - organization headed by Lincoln Stein - idea is to create modules, small components that can be used - more robust, shareable, documented software modules - so that a new database starting up doesn't have to start de novo with writing their own software, GMOD proposals submission close in a week - what would a new database have to create within their first 6 months in order to get up and running - initially asking everyone to make everything Open Source, for GMOD purposes, it is defined by definition on SourceForge site - GMOD site is an open repository of tools that are being made available - 4 older MODs are to be given supplemental funds for GMOD efforts, with the understanding that if one group is developing a tool, they enquire of the others, how would you use this tool and make it useable for all the groups Data Distribution (Chris Mungall) --------------------------------- AmiGO - pie charts o Matt"s already suggested a slight modification (keeping original pie) - Graph view (from term pages) o some modifications to clarify (GOID #s) o call for suggestions for using the network graphs *** Action Item 1 continued - AmiGO (Brad Marshall) - additions to AmiGO - add a SourceForge site for AmiGO bugs/requests - gray out obsolete terms (post meeting addition) - link from treeview page to graph view - search function for the comments - don't automatically toggle to gene product when the search result comes up null - need to make sure that definition references go up with the def, not in the general dbxrefs - add ability to upload files for multigene search - GOST, request for it to accept a seqID - want to be able to search with SwissProt accession numbers (this requires a gp2protein file for every organism, nothing for TIGR, PomBase, ) - having a way of hiding/deselecting GO terms in BLAST report that you don't believe Chris has some experience with dividing TrEMBL into reliable and non-reliable, may be helpful to others in generating gp2protein files *** Action Item 19 - gp2protein file documentation (Chris??) - expand documentation for gp2protein files Monthly Releases ---------------- - request for synchrony between flat file releases, Definitions at the same time as ontology files - ftp site is being updated hourly (15 after the hour) - Courtland Yockey - Could we use the archives to track our understanding of biology - Courtland Yockey - suggested a month-to-month diff file o is interested in this for corporate/pipeline people for being able to track and find differences, and that GO could provide this a resource for others - Courtland Yockey suggesting some sort of monthly summary of major changes in a place where it is easy to find for part-time users of GO, monthly release notes? this could also explain motivation for changes and clarify rationale *** Action Item 20 - monthly release notes (GO) - take a look at doing monthly release notes, *** Action Item 21 - monthly diffs (Courtland Yockey) - will investigate DAG-Edit diffs, and communicate with John regarding proceeding further on utility of a plug-in for DAG-Edit that could do this Database beta test is over for now ---------------------------------- - hopefully will resume at next meeting - with the new GOET tool Planning Ahead --------------- Upcoming Meetings ----------------- - Genome Informatics - John, talk about GOET and relation to databases - ISMB - Michael Ashburner to give plenary - November meeting (17-20) in Hinxton - MGED meets GO, w/ diseases and chemicals o ontologies, and tools for building ontologies o Suzi soliciting Bernard to submit abstracts to this (Michael Ashburner is in a position of power to select items of key interest, e.g. Bernard's results, SO by Suzi) - Judy will do GO for course at Woods Hole in November - Midori will be doing 3 meetings o E-biosci o NetTab meeting - agents in bioinformatics (www.nettab.org) o Ontologies for Biology (European Science Foundation) - Heidelberg, Germany - FANTOM, part of why David generated rules for a GO-slim - - KDD Cup 2002, this year will have both a FlyBase corpus and also and SGD corpus, the attendees are often from corporations and the results are rarely available to mere mortals, but may available to the AZ's of the world Publications: ------------- - Chris on GO database - GO publication for Current Protocols - Judy and Midori Blake, J.A. and M. Harris (submitted) "The Gene Ontology (GO) Porject: Sturcutred vocabularies for molecular biology and their application to genome and expression analysis" in Current Protocols in Bioinformatics, Brazevanis, A, Davison, D., Page, R., Stein, L. and Storma, G., eds. Wiley & Sons, NY - Matt's in Current Protocols in Parasite Genomics - Evelyn: 1 to Genome Research (interpro2GO mapping); Bioinformatics article on InterPro; SwissProt article to Genome Research - Midori - 2 requests o briefings on Bioinformatics _ thought to be neither time nor cost effective to put such an article in such a small journal, so opinion against accepting either of these at this pint in time to avoid saturating market with the same thing again o Current Drug Discovery, _ sort out gene nomenclature mess... _ trade journal for portion of pharma industry Website stuff ------------- *** Action Item 22 - update to current GO home page (Karen) - make links to Gavin's source *** Action Item 23 - DAG-Edit user notes (Jane) - will post DAG-Edit user notes *** Action Item 24 - GO FAQ (Rama and Cath) - populate FAQ with Q & A"s Amelia"s website proposal ------------------------- - good start on content reorganization - suggestion to remove link to EP-GO browser - statistics on hits: o 951 to AmiGO o 91 to MGI o 52 to EP-GO - conclusion? Hinxton GO meeting ------------------ - Genome Informatics (GI) meeting 4-8th September - GO Users meeting September 9th - GO Consortium meeting September 10-11 - users coming to GI, can extend housing an extra night, Users not coming to GI can refer to page of suggestions for housing, travel - Consortium Members, possible to extend housing for one night via registration page, may be possible to extend housing via another mechanism, Consortium meeting will be in Cambridge rather than in Hinxton - Suzi strongly encouraging attendance at GI, deadline for abs is middle of June - Structure of User's meeting o Midori's thoughts, thinking of still having talks, but also having poster sessions, panel discussions o workshops with John, Chris software stuff o advertise such types of contents *** Action Item 25 - Hinxton meeting (Michael Ashburner) - a: find venue for 10-11 meeting - b: get a Manchester person down to talk about DAML+OIL *** Action Item 26a - Hinxton Users meeting (Midori and Karen) - will work out logistics of registration (Consortium members will probably also use the registration page) *** Action Item 26b - Hinxton Users meeting (Midori) - add suggestion tick box to reg form for what would you like to see *** Action Item 26c - Hinxton Users meeting (Midori) - mailing to go-friends list asking about desired content/attendance for User"s meeting GO meeting after Hinxton ------------------------ - proposal to have it in John"s home town in St. Croix, Virgin Islands - hotels not too expensive - TIGR - better in the spring, not winter - late January - arrive on Friday 24th, meeting 25th-26th, leave on 27th January 2003 o no Users meeting o pending quote from John *** Action Item 27 - quotes for Virgin Islands meeting proposal (John Richter) - will get quotes and send to list within the next week ================================================================== MGI Gene Ontology Progress Report of May, 2002 A GO-slim from a biological perspective (page 1/2) by David Hill Cellular Component 1.) non-structural extracellular: extracellular EXCLUDING extracellular matrix 2.) extracellular matrix: extracellular matrix 3.) plasma membrane: plasma membrane 4.) other membranes: (membrane EXCLUDING plasma membrane) OR (membrane fraction NOT plasma membrane) 5.) cytosol: cytosol OR (sarcoplasm EXCLUDING (sarcoplasmic reticulum OR junctional membrane complex)) 6.) cytoskeleton: cytoskeleton OR microtubule organizing center OR spindle OR muscle fiber OR cilia OR flagellum (sensu Eukarya) 7.) mitochondrion: mitochondrion 8.) ER/Golgi: endoplasmic reticulum OR ER-Golgi intermediate compartment OR Golgi apparatus OR transport vesicle OR Golgi vesicle 9.) translational apparatus: eukaryotic 43S pre-initiation complex OR eukaryotic 48S initiation complex OR eukaryotic translation initiation factor 2B complex OR eukaryotic translation initiation factor 4F complex OR nascent polypeptide-associated complex OR signal sequence receptor complex OR ribosome 10.) nucleus: nucleus 11.) other cytoplasmic organelle: acidocalcisome OR cytoplasmic exosome OR endosome OR glyoxysome OR lysosome OR peroxisome OR vacuole 12.) other cell component: cellular component NOT (1-11) Molecular Function 1.) defense/immunity protein: defense/immunity protein 2.) cytoskeletal protein: cytoskeletal regulator OR motor OR structural constituent of cytoskeleton OR structural constituent of eye lens OR structural constituent of muscle OR cytoskeletal binding protein 3.) transcription regulator: transcription regulator 4.) cell adhesion molecule: cell adhesion molecule 5.) ligand binding or carrier: ligand binding or carrier 6.) ligand: ligand 7.) receptor: receptor 8.) other signal transduction molecule: signal transducer EXCLUDING (ligand OR receptor) 9.) enzyme: enzyme 10.) transporter: transporter 11.) enzyme regulator: enzyme regulator 12.) other molecular function: NOT (1-11) Biological Process 1.) cell adhesion: cell adhesion 2.) cell-cell signaling: cell-cell signaling 3.) cell cycle and proliferation: cell cycle OR cell proliferation 4.) death: death 5.) cell organization and biogenesis: cell organization and biogenesis 6.) protein metabolism: protein metabolism 7.) DNA metabolism: DNA metabolism 8.) RNA metabolism: RNA metabolism OR transcription 9.) other metabolic processes: metabolism EXCLUDING (DNA metabolism OR RNA metabolism) 10.) stress response: stress response 11.) transport: transport 12.) developmental processes: developmental processes 13.) signal transduction:signal transduction 14.) other biological processes: NOT (1-12) =================================================================== Gene Ontology Consortium Meeting Lucy Cavendish College, Cambridge, UK September 10-11, 2002 Contents Participant list Progress Reports Action Items from last meeting Ontology Representation: Chris Wroe GOAL: Bernard de Bono Database & Software Content Issues Annotation Issues Documentation Other items Appendix 1: Collected action items from this meeting Appendix 2: Notes on C. Wroe and B. de Bono presentations A. Chris Wroe, DAML+OIL B. Bernard de Bono, GOAL Appendix 3: Handouts accompanying progress reports A. FlyBase B. GOA at EBI C. MGI D. SGD E. TIGR Eukaryotes F. TIGR Microbes Appendix 4: Action items from CSH May 2002 Participants: Michael Ashburner FlyBase Cambridge, UK Rama Balakrishnan SGD Stanford, CA Daniel Barrell EBI Hinxton, UK Tanya Berardini TAIR Carnegie Inst., Stanford, CA Matt Berriman PSU(Sanger) Hinxton, UK Judith Blake MGI Bar Harbor, ME Cath Brooksbank EBI Hinxton, UK Evelyn Camon EBI Hinxton, UK Mike Cherry SGD Stanford, CA Rex Chisholm DictyBase Northwestern Univ., Chicago, IL Karen Christie SGD Stanford, CA Bernard de Bono MRC-LMB Cambridge, UK Becky Foulger FlyBase Cambridge, UK Linda Hannick TIGR Rockville, MD Midori Harris EBI Hinxton, UK David Hill MGI Bar Harbor, ME Eurie Hong SGD Stanford, CA Amelia Ireland EBI Hinxton, UK Suzanna Lewis BDGP Berkeley, CA Jane Lomax EBI Hinxton, UK Brad Marshall BDGP Berkeley, CA Lisa Matthews Incyte Genomics Beverly, MA Suparna Mundodi TAIR Carnegie Inst., Stanford, CA Chris Mungall BDGP Berkeley, CA Sue Rhee TAIR Carnegie Inst., Stanford, CA John Richter BDGP Berkeley, CA Erich Schwarz WB Caltech, CA Valerie Wood PomBase(Sanger) Hinxton, UK Han Xie Compugen Jamesbrook, NJ Visiting on Tuesday, Sept. 10, 2002: Robert Stevens University of Manchester Chris Wroe University of Manchester Progress Reports GO Curators at EBI - Jane to visit NLM (more below) - 60% of terms now defined - MIPS Funcat <--> GO mapping posted (go/external2go/mips2go) - other aspects of progress touched on in review of action items from CSH FlyBase - see handout; highlights: - recuration for release 3 of Drosophila sequence (gaps filled; new genes) - Eleanor Whitfield (SP) cross-checks FB & SP annotations for redundancy ** action item 1: FB to use PubMed IDs instead of [or in addition to?] FBrf IDs SGD - see handout; highlights: - new GO Term Mapper and GO Term Finder tools - GO Tutorial (some parts generic GO, others SGD-specific) - at least one annotation for every gene known to encode a product MGI - see handout; highlights: - areas where GO annotation is focused - cross-product manuscript accepted (Genome Research) - work on cellular and developmental processes TAIR - replacing IEA with literature-based annotations - nifty cell viewer - organize annotation effort by cellular component (using cell viewer) or by pathway - Pubsearch tool helps with literature mining (from Suparna's Users Meeting talk) WormBase - developmental stage ontology to be released soon (waiting for some data on aging to be made public) - anatomy ontology also in the works; has about 5900 terms! - working on tool, way to handle GO annotations in ACeDB - will update RNAi --> GO term mapping (used for some WB IEA annotations) DictyBase - NIH funding started August 1 - SGD tables loaded with Dicty data - manual curation getting started PSU - malaria genome manually annotated to GO (lots of ISS updated to IDA, especially for cellular component) - annotations will be released when genome paper is published - now working on T. brucei - life cycle ontology in progress (Matt & John Richter will try to speed up DAG-Edit -- was very slow because of many many relationships) - for S. pombe: Data now in GeneDB (replaces PomBase) EBI "GOA" - see handout; highlights: - annotation file releases since last meeting: - 5 gene_association.goa_human releases - 3 gene_association.goa_sptr releases - want GO annotations associated with EMBL-Bank records by end of 2002 - manuscript submitted to Genome Research - possibility of SIB-based SP curators using GO to be explored (SP/EMBL retreat coming up late Sept) - UniProt Consortium [SP (EBI and SIB) + PIR] grant funded; will allow more manual assignment of GO terms to TrEMBL entries TIGR - two handouts: 1 on eukaryotes, 1 on microbes; highlights: - sharing Arabidopsis annotations with TAIR - Manatee tool: interface for editing GO terms and evidence - have RefSeq gi number --> GO ID; GO group recommends using protein id instead (gi's not shared by 3 collaborating nucleotide sequence dbs) - 7 microbial genomes annotated to GO; Vibrio cholerae on GO site; others awaiting genome completion and/or publication - GO terms displayed on CMR ** action item 2: TIGR to provide protein id --> GO ID ** action item 3: TIGR to send IEA annotations to GO for genomes not sequenced at TIGR Compugen - GO annotations updated (August 2002) Incyte - academic subscriptions to *PD databases - Lisa seeking to offer financial support for GO meetings Action Items from CSH meeting (May 2002) (also see complete list in appendix 4) 1. Many AmiGO items -- see software section 2. Check over GO.xrf_abbs file -- essentially done, except for incremental updates 3. GO content: modification vs. biosynthesis -- done 4. GO content: evaluate sensu terms -- done; essentially all will be kept; more "sensu" terms will be added, as will more generic terms as parents for "sensu" terms 5. GO syntax: use of 'and' and 'and/or'; 'or' in gene associations? -- Jane is working on removing most terms with "and"; nothing done with gene associations yet 6. Expansion/clarification of GO documentation -- not done yet, but Cath presented a plan of action that sounded good ** action item 4: Cath will update documentation and circulate drafts 7. Ontology integrity checking -- in progress; one thing done so far is that Amelia has a script that checks for several errors; John has set up SourceForge tracker for suggesting checks 8. Submit GO-slim scripts/rules -- ongoing 9. GO-slim naming conventions -- was done even before it became an action item 10. DAG-Edit/GOET automatic recognition of ID prefix -- not discussed; probably not done yet 11. division of 'part-of' into multiple relationship types -- not even started 12. GO dictionary -- done, with procedure in place for incremental updates (didn't touch on whether it'll be implemented in DAG-Edit or GOET, or, if so, when) 13. Cross-product tool -- nothing beyond current DAG-Edit yet (so cross-products can be done but not as easily as we'd like) 14. New documentation for making cross products in DAG-Edit -- not done yet 15. comment field: obsoletes & syntax (GO) -- done 16. concurrent assignments: QuickGO, database -- documentation on QuickGO at http://golgi.ebi.ac.uk/ego/manual.html and http://golgi.ebi.ac.uk/ego/index_internal.html; Evelyn will try to track down more; nothing on GO database side yet ** action item 5: Evelyn to continue tracking down info on QuickGO concurrent assignments ** action item 6: consortium, especially Chris M, to revisit concurrent annotations in GO database 17. clustering sequences annotated with GO; tool -- nothing yet 18. Short descriptions of IEA/ISS methods -- in progress 19. gp2protein file documentation -- Amelia did a small amount a while back; no word on updating or expanding it 20. monthly release notes -- in progress; see Documentation section 21. monthly diffs -- in progress; see Documentation section 22. update to current GO home page make links to Gavin's source -- not done yet 23. post DAG-Edit user notes -- done 24. GO FAQ -- Cath and Rama will work on FAQ (not much done yet) (other action items were related to organizing meetings) Ontology Structure & Representation: guest presentation by Chris Wroe, with input from Robert Stevens I'm not going to try to reproduce Chris' talk (!) but here are some highlights: - ontologies for biology (such as GO) are best done by biologists for biologists - description logic systems such as DAML+OIL provide a mechanism for building and maintaining ontologies (easier to maintain consistency and completeness with "hand-crafted" ontologies) - examples from GONG: finding inconsistencies and missing relationships that would be really hard to find manually - used MeSH chemical terms - missing 'isa' relationships added - some 'isa' relationships made more specific - errors corrected (e.g. a 'catabolism' term under a 'biosynthesis' parent) - DAML+OIL can be used at any point along spectrum -- don't have to have formal structures already in place to convert - OilEd tool now available; previously tools for use with DAML+OIL underdeveloped - definitions (in the DAML+OIL sense): formal definitions for concepts are easy to create for some (e.g. metabolism) terms but much more complicated for others (e.g. enzymes) - conversion to DAML+OIL will mean a large increase in source code; difficult or impossible to do DAML+OIL diff; Michel Klein developing "virtual cvs" - case study (how apt!): medical vocabularies and the "exploding bicycle" -- highlights need for constraints on what can be combined in cross-products Linda Hannick took good notes on this talk, so I've included them as Appendix 2A. GOAL (GO Annotation Language) update: presentation from Bernard Once again I'm not going to reproduce the whole presentation. Highlights: - concept of "structure," in this context referring to any physical entity, such as a gene product or a cellular component - structure provides activity - activity changes structure - word count on GO terms: - most frequently used words are connectors ('of', 'and', 'sensu', etc.) - 50% occurred only once; of these 65% are "structure" words - activity = change in structure over time, or A = (delta S)/(delta T) - defining structure (S) and measuring S and time (T) provides information on the activity (A) A(r) ---> A(p) activity S(1) ---> S(2) where A(r) and S(1) are starting activity and structure, respectively, and A(p) is activity provided by new structure S(2) - concept of "housing structure" S(H) -- the nearest common parent of S(1) and S(2), not affected by the activity that converts S(1) to S(2); relevant to measuring time (T) -- relative, not absolute, time is what's important A = [S(1)S(2)]/S(H) can compare different activities using function S[S(1)S(2)] A = S[S(1)S(2)]/[TS(H)] - collaborating with Rex Chisholm to try this for Dicty; update later Linda's note are included in Appendix 2B. Database & Software Issues DAG-Edit & GOET - John hopes not to do any more development on DAG-Edit. There are a few bug fixes outstanding, but he won't add new features. John would like someone else (a Java programmer) to take over DAG-Edit maintenance; Sue offered to ask Danny Yoo to do it. - John will add an integrity check to the flat file helper to check for deletion of terms that were present in the files loaded. ** action item 7: add check for term deletion to flat file helper ** action item 8: Sue will ask Danny to take over DAG-Edit maintenance ** action item 9: Amelia will collect bug reports and feature requests from curators. If John can't act on feature suggestions, perhaps Danny can. - John is developing GOET in the context of image annotation for Drosophila. This takes his time away from GO in the short run, but he will be working on the infrastructure of GOET, which will eventually benefit GO. AmiGO Brad has made progress on most of the AmiGO-related action items from last time: a) make display of NOT data possible/correct in AmiGO (e.g. FBP26 for SGD; FlyBase, others have more) -- DONE b) metareference for curator refs for AmiGO (BDGP and/or GO): create a metareference for linking for curator refs for definitions for AmiGO (e.g. GO:mah, SGD:krc, etc) Not done yet because there was no way to distinguish a definition dbxref from any other dbxref (also relevant to item l), nor was there any way to tell a reference to a person apart from a reference to a database entry. We'll introduce a prefix to be used for references to curators (GOC:) and Brad will generate web pages to be used as the metareferences. ** action item 10: change prefixes to "GOC:" for definition references that represent an individual curator or group of curators ** action item 11: Brad will create a form where curators can enter info (e.g. name, affiliation, dbxref entered in definition reference field), and create and link a web page for each GOC:xyz entry c) linkouts in AmiGO to sequence in cases such as ISS with ________) -- DONE d) Incorporate GO-Slim scripts into AmiGO -- not done yet e) display comment field in AmiGO -- This requires comments to be stored in the GO database, and will be done as soon as they are. ** action item 12: Chris to get comments into the database f) show concurrent assignments in AmiGO -- another one in the pipeline, pending addition to GO database g) add a SourceForge site for AmiGO bugs/requests -- DONE h) gray out obsolete terms (post meeting addition) -- DONE i) link from treeview page to graph view -- DONE j) search function for the comments -- again, depends on having comments in database k) don't automatically toggle to gene product when the search result comes up null -- DONE l) need to make sure that definition references go up with the def, not in the general dbxrefs -- can be done once definition references are distinguished from other dbxrefs in the database m) add ability to upload files for multigene search -- DONE n) GOST, request for it to accept a seqID -- programming done; will be "live" once new Linux cluster is installed (probably is by now) o) want to be able to search with SwissProt accession numbers (this requires a gp2protein file for every organism, nothing for TIGR, PomBase, etc.) -- notes aren't quite clear; doesn't seem to be done yet p) having a way of hiding/deselecting GO terms in BLAST report that you don't believe -- hard, and not done, but one can now choose a cutoff score GO-Slim Issues (overlap between software & annotation): Many users have asked the model organism DBs (especially SGD) to provide files with gene symbols and GO (or GO-Slim) terms that have been assigned to the gene product. After some discussion we decided to do so, and to include both annotations to the "unknown" terms and genes that have not yet been annotated (the latter will be listed as "unexamined"). Mike Cherry also suggested a table showing each GO term and a list of gene products annotated to it (originally suggested to Mike by Fritz Roth). No decision on this one. On a related note, Chris has devised, and Matt has tried, a clunky method for generating pie charts using a GO-Slim of one's choosing. The clunky bit is that associations between gene/gene product IDs and GO-Slim terms have to be reloaded into a new database. ** action item 13: Add a link to the GO-Slim directory to the home page. ** action item 14: DBs to send GO-Slims and lists of all genes to BDGP. ** action item 15: BDGP to generate tables of gene ID <--> GO-Slim term for each DB that submits a gene list and a GO-Slim. Genes lacking annotations will get "unexamined"; annotations to "unknown" will be preserved. ** action item 16: Add hyperlinks to the gp2protein files: link from web page and from each gene_association file. Content Issues How to coordinate work of several curators, geographically dispersed and having backgrounds in different areas of biology, and maintain the consensus-building approach that has worked so well for us? We agreed that dividing up work based on areas of interest/expertise is a good way to go. To facilitate it, we'll need to keep track of who's working on what. We'll set up "interest groups" for any areas within the ontologies that are likely to require extensive additions or revisions, or to have proposed changes crop up frequently. Curators can join or leave groups as they please. Proposed changes relevant to an interest group should be handled (or at least seen) by that group. Can we come up with a way to tell whether a given area within an ontology has been extensively reviewed? There was an unfortunate incident recently where a change was made to a bit of the newly revamped 'development' portion of the process ontology. We'd like to avoid this sort of fumble in the future, but it's impossible to tell just by looking at the ontology which bits have been reviewed thoroughly and which parts still look much as they did two or three years ago. There's a lot of information socked away in CVS log files, the email archive, and meeting notes, but it would be much more convenient for curators if the excavation of ontology content history could be streamlined. In the long run, it should be possible to flag terms as "reviewed" in the database, but there's no simple solution for the flat files. We'll just have to keep records as well as or better than in the past, and spend time and effort to keep each other informed. It's not hopeless, though; there are a couple of things we can do to facilitate communication and record-keeping. To help with record-keeping, all ontology content changes will be put in the SourceForge curator request tracker from now on. Jane and Midori can add any GO curator to the list of possible assignees; every member database whose curators have GO CVS write access should have at least one curator on the SourceForge list. Note that putting things in the SourceForge tracker is not mutually exclusive with sending messages to the GO list. Any item that obviously is, or might be, involved or controversial should still go to the list. Err on the side of sending more things to the list if it's not clear. To keep everyone informed, we'll run a script that extracts the summary lines from new SourceForge entries and emails the resulting list to the GO mailing list. (In theory anyone can join the mailing list specifically for the SourceForge tracker, but few will want to, because the volume of email is huge and most of it is administrative dross.) Anyone can then follow the discussion of any item that looks interesting (try the SourceForge "monitor" option -- it's cool!), and anyone can choose to take the discussion onto the GO mailing list. We decided that there is no need for a "GO curators" mailing list: "interest groups" are likely to change over time, and anything of relevant to more than the interest group should go to the main GO mailing list anyway. ** action item 17: Set up "interest groups" based on subject matter; maintain a list of groups and who's in them (on SourceForge if possible -- look into this). ** action item 18: All content changes, no matter how small, should go into the SourceForge tracker for archiving purposes. Summary entries should be nice and informative. ** action item 19: Set up script to email summaries from new (open) SourceForge tracker entries. On the specter of excessive granularity (a long involved discussion indeed): We reaffirmed that gene products should not appear as concepts (i.e. as ontology terms). But under some circumstances it is acceptable to mention gene products within ontology terms. The issue to be resolved is how fine-grained we should be in children of "protein biosynthesis," "protein binding," and some others. Many of the children of "protein binding" and of "protein biosynthesis" mention specific individual proteins; see the MGI handout for a list of terms that have come into question. There is an additional concern with protein biosynthesis terms: many of the too-specific ones added recently are actually intended to capture the results of experiments that measure levels of specific proteins, but do not distinguish effects on translation (the restricted definition of "protein biosynthesis," which is what we use in GO, and have implicitly decided to keep using) from effects on other steps in the overall process of making a protein (e.g. transcription, modification). We thought that adding terms for binding to (or biosynthesis of) any specific protein was reasonably consistent with the logic we apply when considering new terms, but we questioned the utility of having many many very specific terms. We agreed that we would keep or add terms that represent different mechanisms, such as "covalent protein binding" and "non-covalent protein binding" (hypothetical examples) or "viral protein biosynthesis." Michael came up with a two-part test; we can keep/add a "protein X biosynthesis" term if both criteria are met: 1. There is something specific about the biosynthesis of protein X, i.e. there are gene products involved in X biosynthesis but not general protein biosynthesis. 2. The proposed term is not redundant with any other process term. For example, we will make "glycoprotein biosynthesis" obsolete because it is redundant with "protein glycosylation." The same test can be applied to binding, transport, etc. But how to avoid losing information? Curators often want to capture what is known, as when an experiment detects binding to a particular protein substrate or altered levels of a specific gene product. The coffee break "Round Table" discussion led to a proposal: eventually make children of "protein binding" obsolete, and instead use annotation to indicate which protein is bound by the gene product of interest. The annotation would use the generic "protein binding" GO term, and a new column in the gene_association file where we can store an ID for the protein that is bound. Inevitably, though, there's a catch: the world is not yet ready for us to implement this in all situations. If the gene product being annotated binds a class of proteins -- the example was actin -- rather than a single protein, we're SOL for the present. In time there will be UniProt IDs representing protein families, but that could take months or even a year or two. There was some discussion of what to do in the meantime; the conclusion was to apply a couple more tests to identify terms that we should keep for now but make obsolete later. First, check over annotations that use the term; second, check whether the term has any children. Annotations will help us figure out whether the term meets the first criterion of the two-part test. A term that has children is most likely a useful grouping term. The same considerations, and possible future solution, apply to "protein X biosynthesis." To address the issue of experiments that detect changes in levels of a particular protein, we have decided to consider adding terms for "gene expression" and regulation of same, but further discussion is required before we add them (I suspect that counter-arguments will be raised). If they are added, the new gene_association column could be used with them in the same way as proposed for protein binding. ** action item 20: Test all "protein biosynthesis" and "protein binding" terms. Apply the two-part test to all, and (for protein family or class ones) look at annotations and child terms. Circulate the list slated for obsolescence. Note: we are not going to make all "protein binding" terms obsolete yet. It would be good to determine which terms would pass the tests, though. ** action item 21: Circulate a proposal for incorporating "gene expression" and "regulation of gene expression" terms and definitions. ** action item 22: Discuss this again at the next meeting! "Cellular process" to distinguish from multicellular processes was generally well received. Examples where the distinction would be useful are cellular morphogenesis vs. organ or body morphogenesis, cellular respiration vs. breathing, etc. It will take some work to define "cellular process." ** action item 23: Propose definition for "cellular process" and discuss on mailing list. ** action item 24: Each model organism DB should review terms under "embryogenesis" and "morphogenesis" to check for correct parentage; also figure out which ones will go under "cellular process." "Cell surface" and related terms: these were added recently by TAIR curators, to capture information from experiments in plants that can narrow down localization to plasma membrane or cell wall but can't distinguish between the two (that's what's meant by "cell surface" in plant literature). The definitions and placement of the cell surface terms were discussed, and changes recommended. We also discussed other cellular component terms in the area of external or surface structures such as cell walls. The fairly generic term "external protective structure" will be changed because "protective" sounds too much like a process; we came up with "encapsulating." The revised term, "external encapsulating structure," will become a child of extracellular. The definition should mention that the structure lies outside the plasma membrane and surrounds the entire cell. We should also review the cell wall terms to make sure they're placed correctly -- apparently the plant cell wall term should be under extracellular. One thing that came up is that there are no cellular component terms that really reflect boundaries (as opposed to physical parts) such as that between inside and outside the cell. It will be interesting to look into boundary terms, considering how they might be defined and where they might fit relative to existing terms. ** action item 25: TAIR curators to improve definitions of "cell surface" and its children. ** action item 26: Change wording of GO:0030312 to "external encapsulating structure." Circulate new definition; make sure Michelle Gwinn has a chance to comment. ** action item 27: Review all "cell wall" terms to check parentage. Plant cell wall does need to be moved. ** action item 28: Start thinking about terms (and definitions, of course) to capture concept of boundary. Transport terms: Dianna Fisk (SGD) is collaborating with Can Tran, who works on TC. Function terms will thereby be kept consistent with what's in TC. Most transport process terms should be OK, but as always any problems should be noted and sent to the list. Transport terms that mention specific proteins should be put to the same test as binding and biosynthesis terms (see above), although we expect that the results will prompt us to keep more of the transport terms. Susceptibility/resistance: We decided to make all terms that say "X susceptibility/resistance" obsolete because they really represent traits. The biological processes that we were trying to represent can all be covered by "response to X" terms (many of which already exist; others can be added). IDs: should we encode F/P/C in the GOID? Although some users have asked for this (for convenience), the overwhelming consensus was that we will not add anything to current GOIDs to show whether the term is molecular function, biological process, or cellular component. We will eventually be in a position to build links between what are now the three separate ontologies, so it's better to use a single ID space for them. Annotation Issues We receive frequent requests for GO terms/IDs to be associated with UniGene IDs. One way it can be done is via a UniGene <--> LocusLink file available from NCBI. ** action item 29: Create UniGene <--> GO file (Daniel) Issue raised by TIGR (Linda Hannick): how to represent annotations made using multiple BLAST hits or similarity to a domain or family (rather than similarity to one other gene product) The problem: they feel that they're losing information about the annotation/curation procedure by putting only one accession number in the "with" column. For many of these comparisons, several sequences have to be included, and the similarities among them taken together, to get a believable conclusion about the annotations for the gene product of interest. Furthermore, many of these curated sequence sets are not yet published. Discussion centered mainly on whether the situation was best covered by using ISS or IC as the evidence code. The eventual decision was to continue to use ISS. Some key points that came up in the discussion (documented for posterity): - The argument in favor of IC was that considerable curator judgment is involved in making the determinations, which makes the procedure different from simply running BLAST and looking at the best hit. There was concern about "polluting" ISS by including cases where similarity is to a family rather than to a single gene product. - The counter-argument was two-fold. One point is that multiple sequence alignments are nevertheless still analyzing and comparing protein (or nucleic acid) sequences, and most curators have been mentally including these analyses under "ISS" all along, viewing them as consistent with the currently defined scope of ISS. - The second point was that IC is used in a well-defined set of circumstances, for a well-defined purpose. It would "pollute," or at least confuse, the scope of IC to use it for annotations that are based on sequence similarity; also, one could follow similar logic to broaden IC to include all curator evaluation of experimental results. We decided not to relax the current definition and scope of IC. Conclusion: allow >1 entry in "with" column for ISS Curators then enter any accession numbers available, and include an ID that allows a link to a page describing the entire set of sequences used. ** action item 30: add to documentation of "with" column use -- allow cardinality 0, 1, >1 for all evidence codes that use "with" at all; explain situations where cardinality 0 is allowed ** action item 31: annotations that use ISS, IPI, or IGI but have a blank "with" column should link to the annotation documentation (let people see the possible reasons why nothing's entered) Pseudogenes and other "doubtful" genes: If a gene is known to encode an RNA or protein product, there's no doubt that the product(s) can be annotated with GO terms (or the gene can be annotated in lieu of direct gene product annotation if necessary). Genes that look as though they encode a product (e.g. open reading frames with no stops) but haven't been individually studied tend to be annotated. If something is unmistakably a pseudogene -- lots of frameshifts, etc -- it's not annotated. But what about other cases that fall between the "obviously OK to annotate" and "obviously pseudogene" ends of the spectrum? From Michelle Gwinn: We have a class of genes which according to our sequence data have either a single frameshift or a single stop codon in their coding sequence. However, they also have screaming good hits to other characterized proteins and to HMMs that span the problem in the ORF. We reflect the presence of the defect with an addition to the common names of the proteins. The concern is that a single frameshift or stop may be read through, or could even reflect a sequencing error. To avoid losing information, we've decided that the best way to handle these cases is to use SO annotation to document the frameshift/stop/whatever anomaly, and GO annotations to capture what the product is thought to do if it is indeed expressed. Shared annotations: For some organisms, gene products are annotated by more than one group (e.g. MGI and SWISS-PROT do mouse; TIGR and TAIR do Arabidopsis). We must avoid circular annotations, especially those based on sequence similarity (ISS). Most (all?) of the groups that inherit annotations from another source tag them in the gene_association file some way. For example, MGI has a special reference used for annotations inherited from SWISS-PROT. This was regarded as a good way to handle shared annotations; any group that doesn't do something of the sort already should adopt the practice. ** action item 32: Each group that shares annotations should tag the ones that come from the other group(s). ** action item 33: Document this decision, and how to implement it. Documentation Issues Monthly logs: Amelia has been working on a script to detect differences between one version of GO (ontologies + definitions); she showed sample output that was very well received. There is still a bit of work to do to get it to prime-time quality, but it is in very good shape. We will run the script every month, when the flat files are archived and database releases made. In addition to running it regularly, we'll include it in the software repository on SourceForge, so that anyone can run it to compare any two versions of GO. ** action item 34: Amelia will continue polishing The Script. When it's ready for prime time, it will go in the software repository, and will be run every month to generate a log to accompany the flat file archives and database releases. Decide where to put the output. FAQ: Chris will help Cath and Rama set up a FAQ-o-matic page; thereafter, anyone can enter question and answers. Cath and Rama will do a bunch to get things started and make sure the FAQ covers questions that we already know crop up frequently. ** action item 35: set up new faq-o-matic page (Cath & Rama, with a bit of help from Chris); everyone to add faq's and answers, though Cath & Rama will probably do the most, at least at first. ** action item 36: EBI GO curators circulate a set of instructions for using CVS. Other Items of Interest GO <-> UMLS: Jane will visit NLM for about a month starting Sept. 15. She will learn all about UMLS, and help them incorporate GO into the "Metathesaurus." That is, GO will become one of the ontologies indexed in the metathesaurus. Jane and some NLM people have already done a test integration. MeSH terms will be reviewed and new ones added in light of indexing GO in UMLS. Jane will report on this work at the next meeting. Funding: For the NIH grant, there's a progress report due soon (December 1?). Judy will coordinate, and email anyone who should contribute material. We will apply for five years when we renew; the renewal is due March 1, 2003. Judy will also coordinate this. There will be four aims: 1. Develop and support ontologies for molecular biology. 2. Annotation using ontologies for informatics systems of consortium members; this will include support for meetings. 3. Provide informatics resource; covers database instantiations, data repository and means of access, and software tools. 4. Outreach: support for ways to provide training for new groups starting to use GO, perhaps by having them visit a "GO site". A "visiting scientist" sort of thing could also be a good way for GO curators to take advantage of domain experts' knowledge. Meeting support might also fall under this aim. Aims 1, 2, and 3 are essentially the same as in the original grant, with the scope of Aim 1 expanded a bit. Aim 4 is modified from the original aim to have other database groups join the consortium. We would also like to support an effort to annotate bacterial genomes (i.e. those not already done or in the works at TIGR) using GO. E. coli and B. subtilis are the most obvious ones; genomes sequenced at Sanger would also be good. ** action item 37: Progress report for current grant. ** action item 38: Prepare renewal grant application. GOBO: Covered in Michael's talk at the Users meeting. We have a supplement to the NIH grant to fund work on SO; Suzi will hire two people, one more biology-oriented, the other more techy, for a year. SOFG: conference coming up in November. Web pages: We'll keep the current appearance for the time being, but that shouldn't stop us form improving the organization. The home page can be split into a few shorter pages, based on the work Amelia did earlier. ** action item 39: Prepare a site with mock-ups of GO web pages derived by splitting up the current home page sensibly. Next Meetings: The next Consortium meeting will be January 25-26, 2003 in St. Croix. Plan to arrive on Jan. 24 and leave on Jan. 27. John will make a group reservation; when we get the email about it, we must act promptly because rooms will go fast. There won't be a Users meeting. After that, the next meeting will be hosted by TIGR in June 2003, with a Users meeting. Linda will check on available dates; our first choice is June 2-4 (users on Monday June 2, consortium Tues-Wed June 3-4). Alternate dates are June 18-20. Appendix 1: Collected Action Items (numbered in the order the appear in the main document) 1. FB to use PubMed IDs instead of [or in addition to?] FBrf IDs. 2. TIGR to provide protein id --> GO ID. 3. TIGR to send IEA annotations to GO for genomes not sequenced at TIGR. 4. Cath will update documentation and circulate drafts. 5. Evelyn to continue tracking down info on QuickGO concurrent assignments. 6. Consortium, especially Chris M, to revisit concurrent annotations in GO database. 7. Add check for term deletion to flat file helper. 8. Sue will ask Danny to take over DAG-Edit maintenance. 9. Amelia will collect bug reports and feature requests for DAG-Edit from curators. If John can't act on feature suggestions, perhaps Danny can. 10. Change prefixes to "GOC:" for definition references that represent an individual curator or group of curators. 11. Brad will create a form where curators can enter info (e.g. name, affiliation, dbxref entered in definition reference field), and create and link a web page for each GOC:xyz entry. 12. Chris to get comments into the database. 13. Add a link to the GO-Slim directory to the home page. 14. DBs to send GO-Slims and lists of all genes to BDGP. 15. BDGP to generate tables of gene ID <--> GO-Slim term for each DB that submits a gene list and a GO-Slim. Genes lacking annotations will get "unexamined"; annotations to "unknown" will be preserved. 16. Add hyperlinks to the gp2protein files: link from web page and from each gene_association file. 17. Set up "interest groups" based on subject matter; maintain a list of groups and who's in them (on SourceForge if possible -- look into this). 18. All content changes, no matter how small, should go into the SourceForge tracker for archiving purposes. Summary entries should be nice and informative. 19. Set up script to email summaries from new (open) SourceForge tracker entries. 20. Test all "protein biosynthesis" and "protein binding" terms. Apply the two-part test to all, and (for protein family or class ones) look at annotations and child terms. Circulate the list slated for obsolescence. Note: we are not going to make all "protein binding" terms obsolete yet. It would be good to determine which terms would pass the tests, though. 21. Circulate a proposal for incorporating "gene expression" and "regulation of gene expression" terms and definitions. 22. Discuss this [protein binding etc.] again at the next meeting! 23. Propose definition for "cellular process" and discuss on mailing list. 24. Each model organism DB should review terms under "embryogenesis" and "morphogenesis" to check for correct parentage; also figure out which ones will go under "cellular process." 25. TAIR curators to improve definitions of "cell surface" and its children. 26. Change wording of GO:0030312 to "external encapsulating structure." Circulate new definition; make sure Michelle Gwinn has a chance to comment. 27. Review all "cell wall" terms to check parentage. Plant cell wall does need to be moved. 28. Start thinking about terms (and definitions, of course) to capture concept of boundary. 29. Create UniGene <--> GO file (Daniel) 30. Add to documentation of "with" column use -- allow cardinality 0, 1, >1 for all evidence codes that use "with" at all; explain situations where cardinality 0 is allowed. 31. Annotations that use ISS, IPI, or IGI but have a blank "with" column should link to the annotation documentation (let people see the possible reasons why nothing's entered). 32. Each group that shares annotations should tag the ones that come from the other group(s). 33. Document this decision [shared annotation], and how to implement it. 34. Amelia will continue polishing The Script. When it's ready for prime time, it will go in the software repository, and will be run every month to generate a log to accompany the flat file archives and database releases. Decide where to put the output. 35. set up new faq-o-matic page (Cath & Rama, with a bit of help from Chris); everyone to add faq's and answers, though Cath & Rama will probably do the most, at least at first. 36. EBI GO curators circulate a set of instructions for using CVS. 37. Progress report for current grant. 38. Prepare renewal grant application. 39. Prepare a site with mock-ups of GO web pages derived by splitting up the current home page sensibly. ======================================================================= Appendix 2: Linda Hannick's notes on presentations by Chris Wroe and Bernard de Bono at the GO Consortium meeting, 10 Sept. 2002 Note: the pdf looks better! A. DAML+OIL Chris Wroe / Robert Stevens Experiments in how you can use hand-crafted text à software-based technology What we can do, not tutorial... Helen Parkinson - What does the technology offer? Process: * Electronically generate rather than add manually. * Pathway; not all or nothing; some benefit part way too... * Simple additions from yesterday Making relationships to additional parents, etc Finds biological content error(s), finding relationships that are problematic Suggests additions, finds the missing relationships that are very hard to find by hand, suggests additional. Inconsistencies reasoned out (e.g., a case of catabolism under biosynthesis.) - What software is available? GONG: what have we done so far? Developing a stepwise methodology Incremental migration path adding semantic content to the GO in situ 1. Syntax transformation to DAML+OIL 2. Reasoning over existing content 3. Adding partial concept descriptions 4. Adding complete " " 5. Concept composition at the point of use Allow the creation of new ontology terms at the point of use. àIsa has to be done manually; P is easy less hard work than doing it all by hand. Migration path Definitions/descriptions (carbohydrate metabolism) broken down to a DAML+OIL necessary and sufficient conditions. Complete definition: biosynthesis of an amino acid Natural language pulls out the essentials Natural lang tool what you see is what you meant Metabolism terms easy; enzyme terms very complex to describe Absolutely explicit; lots of restrictions onProperty and has-class restrictions Top-down approach would be easier (dehydrogenase defined before malate dehydrogenase) Scripts were central Used as much automation as possible Many term phrases fit a stereotyped pattern Metabolism for example Hard coded UMLS lexical normalization tools to match up concepts from different ontologies May also help the parsing task * Additional DAML+OIL definitions represent a significant increase in the amount of 'source code' * Introduces large numbers of interdependencies (lots of these were missed in the hand-built ontology. * Michael Klein - conceptual cvs for DL Can't just do a diff on DL; need a cvs. Meeting tomorrow. Will e-mail the group re this. Software * DL datastructure with API * Editor gui OilEd * Ontology server Case study from forerunners in medicine (SNOMED) * Learn from their mistakes; already avoided mistakes of early medical terminologies * Similar to medicine; large, complex concepts o SNOMEDrt relational terminology o 200K-300K concepts at the present time o results 200K concepts dissected over 2-3 yrs by 9 half-time clinicians (double coverage) o ~20M investment o tried to use scripts and tools; propagation of concepts like we are discussing o major early benefit was a more complete taxonomy for accurate retrieval of records o not open source; there is a gathering force behind going open source (Richter, Chris Schut) (global technological project, GTP). o Formal def of terms useful resource in its own right irrespective of DL reasoning * But different; well specified use-annotation Relatively small group of people who are highly skilled Medical record keeping more for the accountants What additional software is necessary? BioOntologies people using DL? Not extensively 2 diff ways to use as a standard, because it works with ontologies with property-based descriptions Open source? OIL is; Java client pops it in (Robert) License for display is _36000 on Solaris. Problems: Scaling The combinatorial explosion Example Burns How expensive Read II grew from 20K to 250K terms in ~100 staff-years, but still too small to be useful But too big to use... (SNOMED 3.5) Beat the explosion by having ~12 separate taxonomies, the elements of which can be combined to form more complex concepts. Didn't work because the sensible options have to be defined in the user interface. o No grammar rules o Possible to make nonsense terms o Impossible to detect equivalent terms, or classify composition Need a reference terminology in the middle. ------------------------------------------------------------------ B. GOAL (Bernard) It is structure that provides activity. It is activity that changes structure. Organism bias is a cumulative effect of structure bias that has infiltrated GO. Word counts in GO Split into two sets, A and B. Of, and, etc in A Set B (90% of words used in GO terms). Occur 10 times or less in the DAG. >50% of B set occur only once throughout DAGs 65% of set B are physical objects in our universe (glutamate) three DAGS alike most of the words are structures. A=change in structure ÆS change in transitionTime ÆT Is it possible to have any sort of value for ÆS and ÆT? Having a handle on Structure will have a profound effect on Activity Hypothetical structure classification: Small mol Gene prod Complexes Cells Anatomy Map activity on graph of above. 2 structures more similar will be closer on graph.A Can impose different organisms on the graph. GOAL Object Definition Ar shifts S1 to S2. Ap is new activity. Navigate the structural graph by activities. Distance along the tree will be significant. Ar=s(S1,S2)/t(S1,S2) r is "required" p is "provided" Extend to any level of complexity. SH housing structure : The first part of the node that S1 and S2 have in common. A = S1, S2 T( SH) Profile comparisons of ACT objects Now can compare Activities much in the way the BLOSUM matrix is used. Compare whole-genome physiologies. Show on the same structural graph what you mean by an activity. ======================================================================= Appendix 3: Progress Reports A. FlyBase Progress Report, Sept 2002. Cambridge Meeting. 1. GO terms added by continued literature curation of primary papers and personal communications by Cambridge FlyBase curators Rachel, Gillian and Chihiro. 2. Kerry Knight is currently assigning GO terms as part of her clean up of free text in FB; referencing to primary papers. 3. All outstanding SWISS-PROT records (~1000) that were attached to a FlyBase genes have now been analyzed and GO terms added based on the summary comments. GO terms are referenced directly to the SWISS-PROT record. In addition, Eleanor Whitfield at SWISS-PROT is assigning GO terms to new SWISS-PROT records, and SWISS-PROT records updated from SpTrEMBL. These are referenced to papers listed in the SWISS-PROT record and are also incorporated into our files. We now periodically do a check to ensure that all relevant SWISS-PROT entries are curated. 4. Becky is currently curating recent reviews, mainly on processes e.g. oogenesis, embryogenesis, organogenesis, signaling etc. to increase the number of process GO annotations in FlyBase. 5. Work is ongoing to increase the number of definitions for fly-specific GO terms especially for embryogenesis terms. 6. We have received a file of predicted GO annotations from the FB/PANTHER collaboration. A paper describing this experiment has just been submitted to Genome Research. The predictions have not been parsed into FB. The reason is that this analysis will be redone on the new Release 3 sequence. 7. The next major task will be to re-annotate for GO terms the Release 3 protein set. That should keep us busy for some time. Rebecca & Michael. ------------------------------------------------------------------------- B. Gene Ontology Annotation @ EBI The GOA Project is headed by Rolf Apweiler GOA Annotation Coordinator: Evelyn Camon (camon@ebi.ac.uk) GOA Electronic Coordinator: Daniel Barrell (dbarrell@ebi.ac.uk) URL:http://www.ebi.ac.uk/GOA Last Updated: 03-SEP-2002 Current Status: We have made 5 releases GOA Human and 3 release of GOA SPTR(GOA-All) on the EBI and GO ftp sites. In SRS these releases are merged in the one database called GOA. The recent release of all our GO annotation makes SWISS-PROT group at EBI a considerable contributor to the GO consortium annotation effort providing over 2.1 million GO associations across 507964 SWISS-PROT and TrEMBL entries covering 45407 species. GOA Human releases are in keeping with our Human Proteomics Initiative and GO Consortium agreement to fast-track functional annotation of the human proteome. We have not yet integrated GO data from other Consortium groups due to lack of manual annotation with PUBMED references in the association files. We are working particularly closely with Mouse Genome Informatics (MGI) and FlyBase group to resolve these matters. As IPI is now indexing Mouse data we will next work on releasing GOA Mouse. Discussions have been initiated with EMBL-Bank on how to transfer GO annotations from GOA into EMBL flat files via its db_xref. It is decided to add a link from EMBL-Bank flat files directly to QuickGO eg. db_xref="GOA:P22301". It is hoped that this will be achieved by the next EMBL release, which will be made public in few weeks time. EBI maintains SWISS-PROT keyword 2 go and InterPro 2 go mappings these are updated on a regular basis and shared with the GO Consortium where they have been used to enhance their data sets as well as those of external GO users (Microarray/mass spec). We are also working closely with PIR to help their keyword mappings. A GOA paper has been submitted to Genome Research. The GOA project is ahead of schedule on all its grant deliverables. HOW IS GO ANNOTATED IN SWISS-PROT/TrEMBL/InterPro/? GOA is produced by electronic and manual efforts The large-scale assignment of GO terms to SWISS-PROT and TrEMBL entries involves electronic techniques. This strategy exploits existing properties within the entries including the presence of keywords and Enzyme Commission (EC) numbers as well as the presence of cross-reference to InterPro entries, which are manually mapped to GO. Electronically combining these mappings with a table of matching SWISS-PROT and TrEMBL entries generates a table of associations. SWISS-PROT keyword and InterPro to GO mappings are maintained in-house and shared on the GO home page for local database updates. Manual assignment of GO terms by SWISS-PROT curators uses published literature and provides more reliable GO annotation. On each release of GOA, annotation with electronic evidence codes (IEA: 'inferred from electronic annotation') will be replaced with associations using codes that imply more experimental evidence. RETRIEVING DATA FROM GOA There are various ways of accessing and searching GOA project data, including several web-based browsers. The GOA files can also be downloaded. Resources & Descriptions Web-based tools QuickGO A fast web-based browser with access to core GO data and up-to-date electronic and manual EBI GO annotations. URL: http://www.ebi.ac.uk/ego/index.html SRS Search the GOA database or a mirror of the GO consortium repository (GO). URL: http://srs.ebi.ac.uk/ Proteome Analysis Pages GO annotations have been produced for classification of proteins belonging to each complete proteome. On the Proteome Analysis Pages a slimmed down version of GO (GO-slim), representing high-level GO terms, is displayed as a proteome overview. URL example: http://www.ebi.ac.uk/proteome/HUMAN/go/go.html EBI's GO-slim see: http://www.ebi.ac.uk/proteome/goslim_terms.html InterPro GO annotations made by InterPro are visible directly in InterPro entries. URL example: http://www.ebi.ac.uk/interpro/Ientry?ac=IPR000402 AmiGO GO Consortium browser with access to core GO data and released GOA data. URL: http://www.godatabase.org/docs/docs.html Downloads GOA 'Association File' This is a tab-delimited file of associations between gene products and GO terms and is the most common form of data transfer within the GO Consortium. For more information on our format read the GOA README file (http://www.ebi.ac.uk/proteome/goa/goaHelp.html) Two separate GOA association files are currently produced. Human GOA file access (contains GO annotations for all proteins in the nonredundant human proteome set): ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz http://www.geneontology.org/gene-associations/gene_association.goa_human SPTR GOA file access (contains GO annotations for all proteins in SWISS-PROT and TrEMBL): ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/SPTR/gene_association.goa_sptr.gz http://www.geneontology.org/gene-associations/gene_association.goa_sptr GOA Xref File For each GOA release we also distribute a file of cross references that displays the relationship between the entries in the GOA data set with other databases, such as EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO and LocusLink and Refseq. GOA xref file: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ STATISTICS Statistics for GOA-Human and GOA-SPTR association files are available from the GOA homepage. (http://www.ebi.ac.uk/GOA) GRANT SUPPORT: GOA is supported by Grants QRLT-2001-00015 and QLRI-2000-00981 of the European Commission and a supplementary NIH grant, 1R01HGO2273-01. CONTACTING GOA: Post: EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK Phone: +44 (0) 1223 494444 Fax: +44 (0) 1223 494468 E-mail:goa@ebi.ac.uk CREDITS: Daniel Barrell - GOA File updates David Binns - QuickGO Wolfgang Fleischmann - Automation Coordinator John Maslen - Talisman Paul Kersey - Xref file & data set generation Michele Magrane & all curators - GO Annotation Nicola Mulder, Alex Kanapin & Annotators - InterPro Rodrigo Lopez, Nicola Harte - SRS Midori Harris, Jane Lomax, Amelia Ireland, Cath Brooksbank - GO Curators Rolf Apweiler -SWISS-PROT Coordinator Peter Stoehr - Head of Database Operations (EMBL-Bank Issues) GOA Consortium Report (03-SEP-2002) SWISS-PROT/TrEMBL/InterPro -------------------------------------------------------------------------- C. MGI Gene Ontology Progress Report Sept. 2002 General: We continue to focus on extending our goal to have annotation for all genes in the database. Our efforts have focused on three areas: 1. Adding annotation to genes currently without any annotation 2. Replacing annotations that were "fished" from text records with literature based annotation 3. Annotating genes having no go but having rat orthology We have constructed a dataset that might be used as a "gold standard" to judge the efficiency of various annotation algorithms. This dataset is comprised of genes that have been hand annotated with evidence codes derived from experimental evidence (IDA, IPI, IMP, IGI). A second dataset derived from this series has only those genes that have had the same GO ID applied more than once by any combination of these. MGI GO STATS as of August 27, 2002. [table converted to tab-delimited text] Annotation Type 30-Apr-02 27-Aug-02 Change % Change Total Genes annotated:[1] 7600 8576 976 13 Total Hand Annotation # of Genes 2125 2646 521 25 Orthology: 19 24 5 26 "IEA" SwissProt to GO 4852 6123 1271 26 Interpro to GO 3376 3529 153 5 EC to GO 662 658 -4 -0.6 MLC Scan 40 40 0 0 GO Fish 2337 2228 -109[2] -5 [1] Number of genes with at least ONE GO term of any kind. [2] This figure has decreased due to our ongoing efforts to replace these with literature based annotation. Beyond GO The phenotype ontology is continues to be developed with the aid of the DAG-Editor[3], which has facilitated term merging and increasing the complexity of the DAG structure. [3] Cynthia Smith, Cathleen Lutz, Carroll Goldsmith, Teresa Chu, and Alan P. Davis Too many unnecessary GO terms: On the issues of excess granularity [note: some color coding and underlining lost in conversion to plain text] The GO was originally set up as a vocabulary to describe the molecular function, process, and cellular location of a gene product that could be used across model organism databases. However, recently, the GO appears to be growing in areas that appear to reflect a cross over between product name and function and process. There are three example areas: 1. Protein Binding 2. Protein Biosynthesis 3. Immune Response : interleukin X biosynthesis..... 1. The function term ":Protein Binding" coupled with the "with" statement is intended to describe the interaction of a gene product with another protein. The creation of dozens of children that specifically refer to a single gene product in a single type of organism (mammal), as in the cases of interleukin-X binding, where X is a specific molecule, unnecessarily increase the granularity of the GO in a species specific manner. 2 and 3 . Protein Biosynthesis was originally meant to describe the processes involved in the formation of a peptide bond, either on the ribosome or not. The creation of specific terms for single instances of proteins is unnecessary. If the term is NOT meant to describe processes involved in peptide bond formation, it should not be a child of this term. The use of the term "XYZ protein biosynthesis" to be used for a description of any unknown process or combination of processes involved in altering the level of a particular gene product is ambiguous. If there is not evidence to pinpoint transcription, RNA processing, translation, post-translational processing, or RNA and/or protein degradation as the process or processes that are involved in the gene product to be annotated, then perhaps no annotation should be applied. If we proceed down this path, then XYZ biosynthesis will need to have specific children, XYZ biosynthesis, transcription, etc. 1. The first issue begins in protein biosynthesis, where we currently have: protein biosynthesis [GO:0006412]) amino acid activation + charged-tRNA modification + **glycoprotein biosynthesis+ CD4 biosynthesis + FasL biosynthesis + protein amino acid glycosylation + *integrin biosynthesis + **lipoprotein biosynthesis + **mannoprotein biosynthesis + *MHC class I biosynthesis + *MHC class II biosynthesis + *neurotransmitter receptor biosynthesis non-ribosomal peptide biosynthesis regulation of protein biosynthesis + regulation of translation + *TRAIL receptor biosynthesis + translational elongation + translational initiation + translational termination + viral protein biosynthesis *What we do not need is a separate term for each protein. As I understood from discussions on the GO-list, these terms were intended to encompass everything that goes into making the protein, from transcription, translation, and perhaps even degradation. They are intended to capture experiments that use protein /gene product A to influence the (levels) of protein/gene product B. There may be 100 steps between the two. This is making the GO terms experiment driven rather than the other way around. Such experiments are just NOT useful as evidence for any GO terms. They suggest experiments to be done. **The second issue regarding protein biosynthesis is that adding lipids and carbohydrates to proteins is a post-translational modification and does not belong under protein biosynthesis. The term "protein biosynthesis" should be restricted to processes that form a peptide bond, either on the ribosome (mostly) or not (antibiotics). 2. A second area of is the growth of a separate term for each protein binding: protein binding [GO:0005515] alpha-catenin binding ARF binding beta-amyloid binding beta-catenin binding cadherin binding calmodulin binding + clathrin binding collagen binding cyclin binding cytokine binding + chemokine binding + granulocyte macrophage colony-stimulating factor complex binding interferon binding + interleukin binding + interleukin receptor + interleukin-1 binding + interleukin-10 binding + interleukin-11 binding + interleukin-12 binding + interleukin-13 binding + interleukin-14 binding + interleukin-15 binding + interleukin-16 binding + interleukin-17 binding + interleukin-18 binding + interleukin-19 binding + interleukin-2 binding + interleukin-20 binding + interleukin-21 binding + interleukin-22 binding + interleukin-23 binding + interleukin-24 binding + interleukin-25 binding + interleukin-26 binding + interleukin-27 binding + interleukin-3 binding + interleukin-4 binding + interleukin-5 binding + interleukin-6 binding + interleukin-7 binding + interleukin-8 binding + interleukin-9 binding + cytoskeletal protein binding + DNA topoisomerase I binding dynein binding + enzyme binding + eukaryotic initiation factor 4E binding gamma-catenin binding growth factor binding + hemoglobin binding histone binding HSP70 protein binding + immunoglobulin binding + importin-alpha export receptor intermediate filament binding ISG15 carrier KU70 binding lamin binding lipoprotein binding + metarhodopsin binding neurexin binding nuclear localization sequence binding peroxisome targeting sequence binding + poly-glutamine tract binding polypeptide hormone binding + profilin binding protein amino acid binding + protein C-terminus binding protein carrier protein domain specific binding + protein signal sequence binding RAN protein binding Rho binding + RPTP-like protein binding SNARE binding + snoRNP binding syndecan binding TATA-binding protein binding TRAIL binding transcription factor binding + Wnt-protein binding This loses the utility of the Protein Binding and With fields. Are we going to have a separate term for every single pair of proteins. The chemokine and interferons conceivably could be expanded in a like manner This is not needed. The primary term plus the "with" field is sufficient. Algorithms could be written where if the pairs are annotated properly, one could search the "with" field to come back with all binding partners. 3. A third area is sort of related to the "biosynthesis " issue again> Why are separate terms for the biosynthesis of each interleukin needed?? immune response cytokine metabolism cytokine biosynthesis chemokine biosynthesis + connective tissue growth factor biosynthesis + granulocyte macrophage colony-stimulating factor biosynthesis + interferon type I biosynthesis + interferon-gamma biosynthesis + interleukin-1 biosynthesis [GO:0042222] regulation of interleukin-1 biosynthesis + interleukin-10 biosynthesis + interleukin-11 biosynthesis + interleukin-12 biosynthesis + interleukin-13 biosynthesis + interleukin-14 biosynthesis + interleukin-15 biosynthesis + interleukin-16 biosynthesis + interleukin-17 biosynthesis + interleukin-18 biosynthesis + interleukin-19 biosynthesis + interleukin-2 biosynthesis + Interleukin-20 biosynthesis + interleukin-21 biosynthesis + interleukin-22 biosynthesis + interleukin-23 biosynthesis + interleukin-24 biosynthesis + interleukin-25 biosynthesis + interleukin-26 biosynthesis + interleukin-27 biosynthesis + interleukin-3 biosynthesis + interleukin-4 biosynthesis + interleukin-5 biosynthesis + interleukin-6 biosynthesis + interleukin-7 biosynthesis + interleukin-8 biosynthesis + interleukin-9 biosynthesis + regulation of cytokine biosynthesis + TRAIL biosynthesis + All of these could be easily described using GO terms for translation, protein processing, etc. Again, we do not need a term for each specific protein product. These too appear driven by the desire to want to use an experiment to create a GO term. We need to decide how granular the GO needs to be. Prepared by H. Drabkin 10/2/02 ------------------------------------------------------------------------- D. GO Report from SGD Outline - SGD Goals for GO Annotations - Definitions for GO Terms within SGD - Annotations - GO Tutorial - GO Tools - Pathway Tools SGD Goals for GO Annotations Definitions for GO terms within SGD SGD is making a big push to write definitions for all the terms that have been used to annotate SGD genes. There are about 68 component terms, 268 function terms and 287 process terms that need definitions. Each curator writes 2 definitions per month and also if the curator needs to annotate to a term that doesn't have a definition, he/she will write the definition before making the annotation. We are making good progress towards this goal. Annotations Our goals for the near future are: - Have at least one annotation for all the named genes. Out of 4297 named ORF's we do not have any annotation for only 367 loci. - Fill in annotations for genes that have partial annotations. - Polish all the annotations (work on the IEAs and the 'unknown' annotations). GO Tutorial SGD has created a tutorial to familiarize users with the Gene Ontology (GO) and how it is used at SGD. The tutorial gives an overview of GO and highlights pages and tools at SGD that use GO annotations with some cool mouseovers. In addition, the tutorial provides links to other sites that may help users take advantage of the power of GO. GO Tutorial: http://genome-www.stanford.edu/Saccharomyces/help/gotutorial.html GO Tools SGD has developed 2 tools to mine GO data. They are the GO Term Mapper and the GO Term Finder tools. The GO Term Mapper or the GO slim tool maps the granular GO terms used to annotate a list of genes to their more general parent terms (ie. GO Slim terms) from all three ontologies. The GO Term Finder finds all the terms and their parents for a list of genes (users query). The GO Term Finder gives a tree view of all the terms with the DAG relationships, that the query set of genes have been annotated to. Both these tools can take a file of gene names or ORF's as input and can be very useful for analysis of expression data. GO Term Mapper: http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermMapper GO Term Finder: http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermFinder Pathway Tools SGD is in the process of incorporating biochemical pathways into the database using Peter Karp's (Stanford Research Institute, CA) Pathway Tools. A summer student mapped E.C. numbers to metabolic enzymes in SGD (approximately 1000) by using ec2go and searching the literature. In the first build using the Pathway Tools, 828 reactions were created in 163 pathways. We are in the process of refining the pathways. We will be using the E.C. numbers to increase the GO function annotations and hopefully add to the current ec2go file as new GO function terms are created. ------------------------------------------------------------------------- E. TIGR eukaryotic GO update September 2002 Linda Hannick Associations currently at GO: Arabidopsis Aug-27-2002 # genes with GO assignments 5089 since last release 798 # terms assigned 10833 molecular function 6564 biological process 2807 cellular component 1462 Associations not yet released to GO Other euk GO annotations in progress and not yet released include chromosome 2 of T. brucei (manually curated) and O. sativa (IEA). New developments The Arabidopsis project is now sharing GO annotations with TAIR weekly. TAIR GO assignments will be stored in our database along with our own to prevent duplication of work. They will be displayed on our annotation interface. We have a new gi2ath association file from GenBank which will be uploaded to the GO ftp site after this meeting. Software improvements are making it faster and easier to assign GO terms. The Manatee interface now allows editing of GO terms and evidence. A new GO search page allows an annotated search of a particular genome in our database, or a search of a the entire DAG. TIGR annotators track new terms using temporary "TI:" ID's. The assignment of temporary terms is now enabled by a set of pages: [pictures] Track TI ID's The TI: ID's are intended as a tracking device for new terms as they are submitted to Sourceforge. They are replaced automatically in our database as we enter the newly assigned GO: ID, with the TI: ID becoming a synonym to the GO ID in our database. ------------------------------------------------------------------------- F. TIGR microbial GO update - August 2002 - compiled by Michelle Gwinn Associations currently at GO: genes terms Vibrio cholerae 2924 6243 I just sent a new Vibrio file with more associations, you may have noticed the number went down instead of up, this is due to the removal of GO terms from the plain "hypothetical proteins", after result of discussion on GO email list. I also sent a gp2protein file for Vibrio. -------------------------------- Associations (manual) not yet at GO: Genome genes terms Shewanella oneidensis 3769 8307 Bacillus anthracis 4555 9673 Coxiella burnetii 1467 2711 Methylococcus capsulatus* 2616 4554 Geobacter sulfurreducens* 1916 4078 Listeria monocytogenes* >1465 >3342 (in progress now) ----- ------- TOTAL 15788 32665 GRAND TOTAL 18712 38908 (with Vibrio) Genomes pending publication (submitted manuscripts) and subsequent release to GO web page: Shewanella oneidensis Bacillus anthracis (Total of 17980 GO terms) * indicates annotation is incomplete for that genome, more genes remain from that organism that need to be assigned GO terms ----------------------------------- Other news: Our automatic annotation tool is now assigning GO terms to microbial genomes -TIGR genomes, preliminary assignment - followed by manual review prior to release -non-TIGR genomes (IEA) for display on our CMR website (should we send these to GO?) Comprehensive Microbial Resource (CMR) displaying GO terms for genes that have them. (should be functional by the time of the meeting) Rough draft of prokaryotic GO Slim exists, work continues db/software support of GO Slims is under construction ------------------------------------ If anyone has any questions or wants to chat about any of this, please don't hesitate to email me - mlgwinn@tigr.org Hope you have a good meeting, see you in the winter, Michelle ======================================================================= Appendix 4: Action Items from May 2002 Meeting at CSH ACTION ITEMS FROM MAY 12-13 GO MEETING Action Item 1 - AmiGO (Brad Marshall, BDGP) a) make display of NOT data possible/correct in AmiGO (e.g. FBP26 for SGD; FlyBase, others have more) b) metareference for curator refs for AmiGO (BDGP and/or GO): create a metareference for linking for curator refs for definitions for AmiGO (e.g. GO:mah, SGD:krc, etc) c) linkouts in AmiGO to sequence in cases such as ISS with ________) d) Incorporate GO-Slim scripts into AmiGO e) display comment field in AmiGO f) show concurrent assignments in AmiGO g) add a SourceForge site for AmiGO bugs/requests h) gray out obsolete terms (post meeting addition) i) link from treeview page to graph view j) search function for the comments k) don't automatically toggle to gene product when the search result comes up null l) need to make sure that definition references go up with the def, not in the general dbxrefs m) add ability to upload files for multigene search n) GOST, request for it to accept a seqID o) want to be able to search with SwissProt accession numbers (this requires a gp2protein file for every organism, nothing for TIGR, PomBase, etc.) p) having a way of hiding/deselecting GO terms in BLAST report that you don't believe Action Item 2 - GO.xrf_abbs file (each group): examine the GO.xrfs_abbs file with respect to those abbreviations used by your group, add or submit (to your favorite contact with CVS write permission) Action Item 3 - GO content: modification vs. biosynthesis (GO) - examine ontologies for consistency of term names in the area of modifications to nucleotides/amino acid residues within the context of an already synthesized nucleic acid/protein **DONE, except for a few individual cases that aren't straightforward Action Item 4 - GO content: sensu terms (GO) - evaluate sensu terms, and expand documentation **in progress Action Item 5a - GO syntax: use of 'and' and 'and/or' (GO) - evaluate use of 'and' and 'and/or' in GO terms, target for elimination when possible **in progress Action Item 5b - possibility of ambiguous gene associations conjoined with 'OR' (BDGP: Chris, John) - discuss possible software solutions to ? of joining two different associations (gene product to GO term) with an 'OR', [NB: resolution of this item was unclear; first communicate with GO people on Action Item 5a and discuss whether there is any real desire/need to do this.] Action Item 6 - expansion/clarification of GO documentation (GO: Cath B) - Cath will evaluate GO documentation and expand/modify to clarify Action Item 7a - ontology integrity checking (John) - will create a SourceForge submission page for ontology errors **DONE!!! 5/13/02 Action Item 7b - ontology integrity checking (each group) - curators should look for ontology errors, and submit them to the SourceForge page that John will create **two whole entries so far Action Item 8 - submit GO-slim scripts/rules (each group, as relevant) - Submit scripts (Chris is fine with Python, or Perl) for using/calculating GO-slims to BDGP Action Item 9 - GO-slim naming conventions (GO): - confirm/review naming conventions for GO-slims and expand documentation if needed (Michael Ashburner claims that there is a naming convention in the document that he has just written) **was done already (see go/GO_slims/README) Action Item 10 - DAG-Edit/GOET (John Richter) - automatic recognition of ID prefix so that one doesn't have to manually change it all the time Action Item 11 - division of 'part-of' into multiple relationship types (Chris and Jane) - will look into new relationships deriving from the current multiplicity of the meaning of the 'part of' relationship Action Item 12a - GO dictionary (GO, John Garavelli)- we need a dictionary for John to use for spell checking (John Garavelli wants to write a script for this anyway so he will generate the dictionary) **DONE; dictionary is updated frequently Action Item12b - GO dictionary in editor (John Richter) - can write a spell checker for the editor once he has a dictionary Action Item 13 - Cross-product tool (interested parties (David, Bernard, ?), Chris, and John Richter) - cross-product tool: further discussion will clarify what is actually wanted as well as feasible, so that John can write a plug-in for curators to use via the editor Action Item 14 - New documentation for making cross products in DAG-Edit as currently exists (GO: Jane, Amelia) - create document on generating cross-products in DAG-Edit Action Item 15 - comment field: obsoletes & syntax (GO) - move obsolete IDs from synonyms to comment field and institute a regular (as in parsable) syntax for this field **parsable syntax part is done - syntax established; only thing now is to make sure we use it Action Item 16a - concurrent assignment protocol/docs for QuickGO (Evelyn) - get documentation from Tom Oinn on how he did it for QuickGO; add to documentation, to explain how this is calculated Action Item 16b - concurrent assignments from database (Chris) - pull this calculation on concurrent assignments from manual annotations using Database [NB: Fritz Roth is doing some calculations along this line] Action Item 17a - sequence clustering for sequences annotated with GO (Daniel? Liat?) - take sequences as they are now, run a clustering algorithm, generate trees, attach GO annotations and inspect by hand Action Item 17b - very cool annotation tool (????, highly dependent on above) - use this to develop an annotation tool that utilizes homology clustering Action Item 18 - IEA/ISS methods (each group, GO: Midori): Groups to submit to Midori short blurbs on procedures for large scale annotation methods (bulk assignments, particularly with IEA or ISS) with urls to add to the annotations guide **I've received ONE response (thanks to Harold Drabkin) Action Item 19 - gp2protein file documentation (Chris??)- expand documentation for gp2protein files Action Item 20 - monthly release notes (GO) - take a look at doing monthly release notes **in progress; item for Sept. agenda Action Item 21 - monthly diffs (Courtland Yockey) - will investigate DAG-Edit diffs, and communicate with John regarding proceeding further on utility of a plug-in for DAG-Edit that could do this **progress on item 20 is relevant Action Item 22 - update to current GO home page (Karen) - make links to Gavin's source Action Item 23 - DAG-Edit user notes (Jane) - will post DAG-Edit user notes **DONE!! (thanks, Jane!) Action Item 24 - GO FAQ (Rama and Cath) - populate FAQ with Q & A's Action Item 25 - Hinxton meeting (Michael Ashburner) - a: find venue for 10-11 meeting **DONE - b: get a Manchester person down to talk about DAML+OIL **I've asked Action Item 26a- Hinxton Users meeting (Midori and Karen) - will work out logistics of registration (Consortium members will probably also use the registration page) **DONE Action Item 26b- Hinxton Users meeting (Midori) - add suggestion tick box to reg form for what would you like to see **DONE Action Item 26c- Hinxton Users meeting (Midori) - mailing to go-friends list asking about desired content/attendance for User's meeting **DONE (zero replies tho :( ) Action Item 27 - quotes for Virgin Islands meeting proposal (John Richter) - will get quotes and send to list within the next week **John sent one message and got several replies, so I assume this is in progress =================================================================== Gene Ontology Consortium Meeting Divi Carina Hotel, St Croix, US Virgin Islands January 25-26, 2002 Contents Participant list Progress Reports Action Items from last meeting Presentation: GO in UMLS: Jane Lomax Content Issues Database & Software Annotation Issues Miscellaneous Documentation: Appendix 1: Handouts accompanying progress reports [omitted from text version] Appendix 2: Action items from CSH May 2002 Appendix 3: Notes on J. Lomax presentation [omitted from text version] Appendix 4: Assorted documents relevant to agenda items. A. Email from Tanya Berardini B. Email from Aubrey De Grey C. MGI Excessive granularity document D. MGI Negation document E. Documentation progress report (from Cath) Appendix 5: Collected action items from this meeting Participants Michael Ashburner FlyBase Cambridge, UK Daniel Barrell EBI Hinxton, UK Matt Berriman PSU(Sanger) Hinxton, UK Judith Blake MGI Bar Harbor, ME Cath Brooksbank EBI Hinxton, UK Evelyn Camon EBI Hinxton, UK Tricia Dyck DictyBase Northwestern University, Chicago, IL Kara Dollinski SGD Stanford, CA Harold Drabkin MGI Bar Harbor, ME Dianna Fisk SGD Stanford, CA Becky Foulger FlyBase Cambridge, UK Linda Hannick TIGR Rockville, MD Midori Harris EBI Hinxton, UK David Hill MGI Bar Harbor, ME Eurie Hong SGD Stanford, CA Amelia Ireland EBI Hinxton, UK Jane Lomax EBI Hinxton, UK Brad Marshall BDGP Berkeley, CA Suparna Mundodi TAIR Carnegie Inst., Stanford, CA Chris Mungall BDGP Berkeley, CA Sue Rhee TAIR Carnegie Inst., Stanford, CA John Richter BDGP Berkeley, CA Valerie Wood GeneDB S. pombe (Sanger PSU) Hinxton, UK Progress Reports For full reports, see Appendix 1. GO Editorial Office, EBI - over 600 new terms added; 70% of terms now have definitions - every GO synonym examined and a relationship to the term name assigned (as part of UMLS project) - comments added to all obsolete terms - SourceForge item notification script now up and running DictyBase - public beta of annotations is viewable and will be added to the GO repository after checking - medical ontology has been developed and should be available soon FlyBase - 27,056 GO annotations now in FlyBase - Swiss-Prot GO annotations continuing - these include annotations for non-D. melanogaster genes. - most recent re-annotation of the Drosophila genome (release 3) is almost complete - definitions added to a number of fly specific process terms GOA @ EBI - 6 GOA-SPTR releases, 8 GOA-Human releases - GOA dataset to be enhanced by mappings from Swiss Institute of Bioinformatics - GOA cross-referenced directly in the EMBL nucleotide sequence database - QuickGO browser updated MGI - annotations added at a steady rate; 41,000+ annotations to 9032 genes - continued development of phenotype ontology; expected to be made public by mid-February - RIKEN data has been loaded into the database GeneDB S. pombe (Sanger PSU) - total of 15,029 GO term assignments now made to process and component terms - extensive overhaul of configuration files to give constant refinement of associations PSU (Sanger) - full manually curated GO annotation of malaria finished - joint curation with TIGR of Trypanosoma brucei continues - annotation of Aspergillus fumigatus and Theileria annulata genomes to come SGD - two new software tools: GO Term Finder and GO Tree View - every ORF at SGD has a function and process term annotation - every named ORF has a complete set of GO annotations TAIR - GO terms being added to the ontologies with definitions - plant GO-slim developed and submitted - aim to annotate all studied Arabidopsis genes to all three GO ontologies TIGR - T. brucei chromosomes 4 and 6, rice and Aspergillus fumigatus are in the works - Shewanella association file recently submitted - several bacterial genomes awaiting publication ACTION ITEM: TAIR to update MetaCyc2GO mappings. Action Items from last meeting See Appendix 2 for full details. Action items arising from this were: ACTION ITEM: John. 7 from last time [add term deletion feature to DAG-Edit]. ACTION ITEM: Brad. 10 and 11 [adding more information about GO curators to website/database] outstanding. ACTION ITEM: Come up with system for notifying developers of format changes. ACTION ITEM: Add "contributed by" column. GO in UMLS : Jane Lomax See Appendix 3 for the full presentation. Progress report: GO has not yet been released with UMLS Metathesaurus, but substantial progress has been made. There has been a successful insertion of the molecular function ontology, with cellular component and biological process soon to follow. There are two major issues created for GO; how to handle GO 'synonyms', and ambiguity in GO term names. These issues are discussed later in the meeting. Content Issues Synonyms: distinguishing exact synonyms from related terms - how many types to distinguish? - how to store/represent (implications for tools)? There was some discussion regarding the synonym types. In particular, whether a synonym with the "broader than" relationship to the main term reflects a missing parent or relationship in the tree, and also the number of relationships we need - do we need finer distinctions than true synonym vs related term? It was concluded that we would keep all the existing types of synonyms (exact, broader, narrower, related to, undefined) and the hierarchy of synonym types would be as follows: related to [i] exact [i] broader [i] narrower [i] undefined ACTION ITEM: Curators. When adding new synonyms, track which type they are. If they are 'broader than' or 'narrower than', consider whether it calls for a new term. ACTION ITEM: Jane. Circulate synonym list again. ACTION ITEM: BDGP. Look into rules that could be worked into DAG-Edit to make synonym maintenance easier. GO/UMLS component term merge problems The problem stems from ambiguity in term names. The term string "xxx complex" in GO refers to a cellular location, but the same string in UMLS usually refers to a protein entity and would be assigned the semantic type 'amino acid, peptide or protein'. The question is, does the GO cellular component term mean the same as the UMLS concept? If it doesn't, and a new concept would have to be created, what semantic type should we assign it, and what relationship would need to be created between these new and existing concepts? It was agreed that the GO 'xxx component' cellular component terms were different in meaning to the existing 'xxx complex' concepts in UMLS, and GO term names should not be changed to fit with UMLS. It was decided that Jane should discuss possible solutions with UMLS people; possibly modify some GO term names in UMLS only (by adding 'location'?) or see whether UMLS can help come up with a solution in their system, and to keep consortium informed of progress. The consensus was that all cellular component terms should be in concepts with the semantic type 'cell component' (never part of a concept with the semantic type 'amino acid, peptide or protein') and that the relationship between the new (with GO term) and existing concepts should be something broad, like 'related to'. ACTION ITEM: Jane. Discuss this with UMLS and fill us in on the results. Cellular processes: questions to be resolved before the cellular process reorganization is committed See Appendix 4A for the email from Tanya Berardini containing the questions. - Cellular differentiation vs cell fate commitment and cell type development vs cell type differentiation David Hill outlined a suggestion: cell differentiation can be broken down into the following steps; cell fate commitment where a cell senses its location and begins to specialize, but can still switch types, cell type determination where a cell switches irreversibly to a specific type and cell development where a cell physiologically matures into its type. Should we use these divisions in GO? The group agreed that we should. Conclusion: Cell differentiation and its children will have the following structure: cellular process [i] cell differentiation [p] cell fate commitment (exact synonym: cell fate specification) [p] cell fate determination [p] cell development (exact synonyms: cell morphogenesis, cell maturation) - Response to endogenous stimulus and response to exogenous stimulus Cellular response and organismal responses are usually linked; we would like to capture relationship but don't want to violate true paths (eg. for unicellular orgs). This means being very careful with parentage. Cue a big discussion of where to put the unicellular/multicellular split. A working solution was proposed: make the split as far below 'physiological process ; GO:0007582' as possible, and as and when needed, rather than splitting right below physiological processes. We will revisit this to see how the solution has worked. Leaving the "response to xxx" terms under cell communication is fine. The group agreed that it was always important to keep annotation in mind when making these changes, and reaffirmed the need to keep GO process terms covering multicellular processes, as they are needed for annotation in many species and help in the development of orthogonal ontologies. ACTION ITEM: David and Tanya. When splitting out multicellular vs unicellular processes, make the split as far below 'physiological process ; GO:0007582' as possible, and as and when needed, rather than splitting right below physiological processes. Grouping terms in the function ontology Prompted by Karen's email on G-nucleotide release factors and the related items RNA polymerase and hydrogen-translocating ATPases The function ontology contains grouping terms that reflect process or component info (eg. DNA repair protein; membrane-associated functions). This cross-contamination is useful for helping curators find terms but is not consistent with the guidelines set out for function terms. One approach would be to make relationships between the function and component or process ontologies and remove the grouping terms. This would require VERY careful curation as some functions act in many processes. A better solution would be to expand the toolset available to curators, eg. Fritz Roth's statistical links and concurrent assignment tools. The conclusions were that no hard-coded links will be made between the ontologies and instead research would continue into tools to make statistical links. ACTION ITEM: GO editorial team (and others). Start removing grouping terms slowly and carefully with all the usual communications. If obsoleting a term, ensure the corresponding process or component exists. Should functions (particularly enzyme functions) be differentiated on the basis of environment? 1. pH-specific enzymes: Example given was GO:0030230 and GO:0030231, differentiated on the basis of the pH at which they act. Conclusion: different EC numbers - keep both terms; same EC numbers - obsolete the pH-specific examples and use the parent term. 2. Hydrogenases: Example given was GO:0008901 and its children GO:0016948 - GO:0016951. They have the same EC number but different metal ions associated with them. This could be solved in the same way as protein binding - at the annotation stage, use a chemical ontology and use the extra column to note the metal. Alternatively, we could use multiple parents and/or annotate to separate terms (eg. hydrogenase, iron binding). The issue was not resolved after discussion and will probably be left until we have software to implement the new column. Should we add 'activity' to function term strings? - if so, do we change the main term string or add 'related terms'? Two main arguments for this: first, it reduces the ambiguity of the term name, therefore helping when GO is included in other systems (specifically UMLS), and second, it will reduce user confusion. All agreed this was a timely step. ACTION ITEM: Jane. Add activity to function term strings. How to represent membrane proteins - whether to have 'integral [to] membrane', what wording - whether to add children (e.g. for type I, II, III, IV transmembrane) In the component ontology, we used to have 'intergral membrane protein' plus children which was problematic because it didn't refer to a location, rather a relationship between a membrane protein and a membrane. The wording was recently changed to 'integral to membrane'; did we want to keep this for the long term or find some other solution? The other issue, brought up by Evelyn, was whether to add more granular child terms for the different types of transmembrane protein, as this would help with Swiss-Prot/GO mappings. This idea was rejected because these are types of protein and not locations. Conclusion: Keep the membrane terms as they are now (integral and peripheral); don't add the children as they don't reflect a location. Should the 'host' term be used for viral cellular component terms? The term 'host' was originally created for describing the cellular component of single-celled parasites infecting a host cell, so it was placed under 'extracellular'. A problem arose when trying to add the new viral terms, because viruses aren't cells, so the host cell environment is not extracellular. Various options were discussed, including moving 'host' out from under 'extracellular', but it was felt that the best option was to simply extend the definition of 'extracellular' so that it could be applied to organisms that aren't technically cells. A comment would also be added explaining why this was done. ACTION ITEM: GO editorial team. Define extracellular to include outside a virus particle, then use host terms as parents for the appropriate virus cell component terms. How should we handle component terms that can be both intracellular and extracellular? Some complexes can be intra- or extracellular; the example given was 'immunoglobulin complex ; GO:0019814' which can be either membrane bound or circulating, so there are two is_a child terms, 'immunoglobulin, circulating ; GO:??' and 'immunoglobulin, membrane bound ; GO:00??'. The problem comes with the placement of the parent term, the generic 'immunoglobulin complex', which might be used when you know that a gene product is a component of an immunoglobulin molecule, but not know whether it is membrane bound or cirulating. At the moment the term is placed directly under 'cellular component', but it's going to end up a pretty long list! After some discussion, during which we considered whether we needed a generic term at all, it was felt that the most appropriate place for such terms is directly under cellular component where we currently have them. ACTION ITEM: GO editorial team. Go through the enzyme complexes (see also SF entry 535294) and where applicable, make a general parent directly under 'cellular component' with children in specific locations. Term grammar (for use in automated construction of sentences describing gene products) See Appendix 4B for the email from Aubrey de Grey We are willing to alter the term grammar to suit Aubrey's needs as long as: A: Aubrey sends terms so we don't have too much work to do! B: we check carefully to make sure any changes won't wreck terms for biologists searching or curators annotating ACTION ITEM: GO editorial team to get list from Aubrey and evaluate; adjust terms as needed. Revisit 'catalyst' and 'regulator' part-of children of some enzymatic activity terms Several enzymes are split into a catalyst and a regulator function. This item questioned the need for these terms as they sound like enzyme components rather than functions. After discussion, it was decided that they should be left as they are to allow maximal information about protein function to be captured. Revisit the "Round Table Discussion" on how to represent synthesis/binding/etc. of individual proteins See Appendix 4C for the MGI excessive granularity document. The problem is basically that GO cannot allow gene product names inside GO terms because of the rampant proliferation of terms that this generates, however, it is still useful to be able to annotate to this level of granularity. For instance, to able to state that a gene product IL18_HUMAN is involved in 'interleukin-13 biosynthesis'. The solution proposed by Chris was as follows; some GO terms would have 'slots', which would be filled in the gene_associations file. For instance, 'biosynthesis' would have a 'slot' named 'synthesizes'. The GO term 'interleukin-13 biosynthesis' would therefore not exist, and instead, the annotation for IL18_HUMAN would include an entry to GO term 'cytokine biosynthesis ; GO:0042089' or just plain 'biosynthesis ; GO:0009058'; this entry/line would also have a column for 'slot', which would read "synthesizes(interleukin-13)". Interleukin-13 could be replaced with an identifier from a product/family/physical-entity ontology. The proposition is described in more detail at http://www.fruitfly.org/~cjm/slots.html The practical implications were discussed; there is a need for ontologies to cite in the slot values, for example, a chemical ontology and a protein family ontology. A few exist and more will be available in about a year. This will also require a rethink of annotation practice, and some new tools. Existing annotations would of course have to be retrofitted, but the bulk of this could be automated. Of great importance is considering our users, any changes need to be announced well in advance. In addition, would we change the front-end appearance of tools, e.g. AmiGO, or keep these changes behind the scenes? One issue is that using the slots effectively creates GO terms that are cross-products, but do we instantiate these products - i.e. give them GO IDs? For instance, if we were to instantiate all the terms generated by the cross product between 'synthesis' and a product/molecule/chemical ontology we would have actual GO IDs: GO:9000001 IL-1 biosynthesis GO:9000002 IL-2 biosynthesis GO:9000003 IL-3 biosynthesis GO:9000004 IL-4 biosynthesis GO:9000005 IL-5 biosynthesis The disadvantage is that any time the orthogonal ontology of products is changed, GO has to be changed (either manually or automatically) to reflect this. For example, if IL1 was split into IL-1a, IL-1b we would need IL-1{a,b} {biosynthesis, receptor} etc in GO. With the 'slots' approach there would be no GO ID for "IL-8 biosynthesis". Curators could still annotated genes as "IL-8 biosynthesis" by dynamically combining the terms using slots but the disadvantage is that there would not be a single GO ID they could quote in a paper etc. ACTION ITEM: Announce on the website that we'll implement this solution at some future date (no date set but will be 6+ months from now). Assemble a group (MA, Chris, David) to work on the implementation. Interest Groups Interest groups and areas have been extensively examined or claimed already, the problem is, how to ensure that the interest group is informed when changes are made to that part of the ontology? We could have interest groups listed e.g. in SourceForge, or on our webpage, perhaps with a list of GO_Slim terms defining the area of interest alongside. Anyone making changes to these areas would then have to inform these groups first, then the onus would be on these groups to pipe up if they had a problem! ACTION ITEM: Midori to put up interest groups on web page. Everybody to send group ideas & which they volunteer for. See if it works or if we need further formalization by putting groups in SourceForge. Annotation Annotation of disease genes Annotations of genes implicated in disease to be submitted by Nat Goodman. These should be fine as long as he doesn't annotate actual disease processes, i.e. he must only annotate the normal functions of genes implicated in disease. Consistency and quality control - Suggestion from Evelyn: a set of "standard annotations" for common proteins. Evelyn has seen different terms assigned to "common" proteins; is this a QC problem or does it differ between organisms and what has been studied and what experiments have been done? How do you define "common proteins"? Conclusion: Annotations are the responsibility of individual databases. Differences often reflect the state of experimentation. Evelyn has unique perspective for spotting inconsistencies, because SWISS-PROT includes annotations from all organisms. She should keep communicating problems to the individual databases. Negation See Appendix 4D for MGI's handout. Conclusion: The best solution in the long term is to use Chris's slots model; in the meanwhile, muddle through somehow - each group can decide what works best for them. Database and Software DAG-Edit & GOET One line of GOET work has stopped, but GOET overall goes on. John is back working on DAG-Edit. :-) New DAG-Edit features (full list appears in the release notes of the latest version): - Search tool remembers last 10 searches on each field - Configuration plugin allows users to show undefined terms in gray - Changed flat file format to support multi-character types - The available relationship types are now defined per-session, instead of per-adapter - Created a Relationship Type Manager plugin that allows a user to define which types are available in a session - Dbxrefs now have an editable description (however, the flat file format cannot store these descriptions) - An arbitrary number of files can now be read in at one time (instead of just 3) - File read history now stores groups of files, not one file at a time John would like switch over to the new flat file format. This should be announced on the proposed webpage for forthcoming software/data format changes, as well as on the GO site in SourceForge. Users should be given adequate time to switch over; John suggests allowing two months after the announcement has gone up. The new format allows relationship symbols & types defined in headers, and multi-character relationship types are possible, as well as dbxref comments in the flat file. It also has a reduced file size due to non-redundant display of parentage. Other planned features for DAG-Edit include: - multiple terms viewable in gene product plug in (only one can be viewed at a time at the moment) - option to have a "delete" button to move terms to obsolete - plug-in for cross-products - spellcheck function to use with the dictionary file AmiGO Brad reported that the AmiGO GOst BLAST server is now live. He also reported that an AmiGO software upgrade is coming soon. Brad is interested in feedback from the community that uses GO on what data and tools they use and how they use them. ACTION ITEM: Construct and post a user survey covering tools, AmiGO, etc.. Send question ideas to Amelia Ireland. It will be sent out to GO-Friends and data collected in time for the grant application. Database There isn't much change to report on the database. Chris and Dave Emmert (Harvard) are developing CHADO, a postgres database. It will be more capable of holding different ontologies; it is expected that FlyBase and GMOD will use it and it will probably subsume the GO database. Database Updates (Chris Mungall) Chris says that automated database update are taking place approximately once a month. He has a script which creates 4 downloads: terms; terms and annotations; terms, annotations and sequences; terms, annotations without IEA and sequences (for AmiGO). The script takes takes approximately two days to run. It was suggested that there should be a daily update of the database terms and structure to prevent the lag seen between the addition of the new terms and their appearance in AmiGO. ACTION ITEM: Chris. Suggestion: a daily release of a separate database containing just terms without annotations. The whole database should be updated every month. AmiGO would have the option to view the up-to-date term set with no associations. Chris has scripts to map gene association files to GO-slim terms; it uses 'bucket' terms such as "other enzyme" which are given temporary GO-slim IDs. ACTION ITEM: Chris. Make use of parents rather than bucket terms to avoid confusion due to transient IDs. Brad clarified the AmiGO pie chart maker behaviour and accepted suggestions for new features. ACTION ITEM: Brad. Investigate piping GO-Slim mapping results to the AmiGO pie chart maker. ACTION ITEM: Brad. Add the ability to dump AmiGO pie chart data as a flat file containing GO ID, term name and the number of gene products. Miscellaneous GO.bib file During the updating of the documentation, Cath discovered the GO.bib file and asked who uses and maintains it and whether some guidelines could be drawn up for its content and usage. It was concluded that no one uses this document (let alone maintains it!) and it could be removed from the GO documentation. GOBO After the success of the Standards and Ontologies for Functional Genomics (SOFG) conference, Helen Parkinson (EBI) has had requests for an ontology site hosted at the EBI or at sofg.org. Michael Ashburner will talk to Helen and Chris Stoeckert about this. Documentation See Appendix 5E for Cath's progress report. Cath has made significant progress in her work on the documentation. Unfortunately, Cath is no longer part of the GO team at the EBI, but she was able to do the work in her new role as part of the Outreach team. She has reorganized, rewritten and updated the documentation to make it clearer and easier for users to find the information they are looking for; to this end, she has split the information into sections relating to different GO users. There were several action items relating to the documentation: ACTION ITEM: Member databases. Each database should send annotation FAQs from their existing documentation to Cath for inclusion in GO FAQ. GO FAQ will have general annotation FAQs and then specific FAQs from each database and from the EBI. ACTION ITEM: Everyone . Read over the new documentation (especially the style guide) and send any suggestions to Cath. This is available at http://www.ebi.ac.uk/~cath/ ACTION ITEM: Cath. The changeover to the new documentation will occur on 15 March. ACTION ITEM: Cath. Update the synonym section of format guide to accommodate the decisions made at this meeting. ACTION ITEM: Chris. Provide some documentation on the mySQL database. ACTION ITEM: Jane and John. Update the DAG-Edit user guide. Grant Proposal Judy reviewed the schedule and plan for the upcoming competitive grant renewal for the GO Consortium. We will submit our proposal to the NHGRI on March 1. We will ask for continued support for the development of the ontologies, now including the Sequence Ontology for sequence features. We will ask for continued support for the annotation of genomes and gene products to the GO by the model organism databases and Swiss-Prot. We will ask for continued support for a community database resource which includes open access to the ontologies, the annotations to the GO, and other resources and tools. Some new aspects of the project are that we will continue to work to provide the ontologies in DAML+OIL, and will provide support for pilot projects that investigate or interact with the GO in new ways. Next Meeting Host: TIGR, June 3 - 4 (no users meeting). Minutes: BDGP Appendix 2. Action items from CSH May 2002 Action Items from Cambridge Sept 2002 meeting 1. FB to use PubMed IDs instead of [or in addition to?] FBrf IDs. - DONE 2. TIGR to provide protein id --> TIGR gene ID. - DONE 3. TIGR to send IEA annotations to GO for genomes not sequenced at TIGR. - NOT DONE. Michelle says some of IEA associations were being made based incorrect GO associations and is working to fix this. 4. Cath will update documentation and circulate drafts. - see report 5. Evelyn to continue tracking down info on QuickGO concurrent assignments. - has tried. contact David Binns. 6. Consortium, especially Chris M, to revisit concurrent annotations in GO database. ????? 7. Add check for term deletion to flat file helper. - will put option in configuration manager 8. Sue will ask Danny to take over DAG-Edit maintenance. - DONE. He said no. 9. Amelia will collect bug reports and feature requests for DAG-Edit from curators. If John can't act on feature suggestions, perhaps Danny can. - DONE. SourceForge list. 10. Change prefixes to "GOC:" for definition references that represent an individual curator or group of curators. - when Brad does 11. 11. Brad will create a form where curators can enter info (e.g. name, affiliation, dbxref entered in definition reference field), and create and link a web page for each GOC:xyz entry. - new action item covering this. 12. Chris to get comments into the database. - code working. will do. 13. Add a link to the GO-Slim directory to the home page. - NOT DONE. 14. DBs to send GO-Slims and lists of all genes to BDGP. - in directory. 15. BDGP to generate tables of gene ID <--> GO-Slim term for each DB that submits a gene list and a GO-Slim. Genes lacking annotations will get "unexamined"; annotations to "unknown" will be preserved. ????? 16. Add hyperlinks to the gp2protein files: link from web page and from each gene_association file. - use docs 17. Set up "interest groups" based on subject matter; maintain a list of groups and who's in them (on SourceForge if possible -- look into this). - sort of DONE. 18. All content changes, no matter how small, should go into the SourceForge tracker for archiving purposes. Summary entries should be nice and informative. - ongoing; DONE. 19. Set up script to email summaries from new (open) SourceForge tracker entries. - DONE. 20. Test all "protein biosynthesis" and "protein binding" terms. Apply the two-part test to all, and (for protein family or class ones) look at annotations and child terms. Circulate the list slated for obsolescence. Note: we are not going to make all "protein binding" terms obsolete yet. It would be good to determine which terms would pass the tests, though. - in progress. 21. Circulate a proposal for incorporating "gene expression" and "regulation of gene expression" terms and definitions. - decided against "regulation of gene expression"; Jane will circulate the "gene expression" def. 22. Discuss this [protein binding etc.] again at the next meeting! - DONE. 23. Propose definition for "cellular process" and discuss on mailing list. - DONE. 24. Each model organism DB should review terms under "embryogenesis" and "morphogenesis" to check for correct parentage; also figure out which ones will go under "cellular process." - in progress; mouse done. 25. TAIR curators to improve definitions of "cell surface" and its children. - DONE. 26. Change wording of GO:0030312 to "external encapsulating structure." Circulate new definition; make sure Michelle Gwinn has a chance to comment. - DONE. 27. Review all "cell wall" terms to check parentage. Plant cell wall does need to be moved. - DONE. 28. Start thinking about terms (and definitions, of course) to capture concept of boundary. - ongoing. 29. Create UniGene <--> GO file (Daniel) - DONE. 30. Add to documentation of "with" column use -- allow cardinality 0, 1, >1 for all evidence codes that use "with" at all; explain situations where cardinality 0 is allowed. - NOT DONE. 31. Annotations that use ISS, IPI, or IGI but have a blank "with" column should link to the annotation documentation (let people see the possible reasons why nothing's entered). - NOT DONE. 32. Each group that shares annotations should tag the ones that come from the other group(s). - coming soon. 33. Document this decision [shared annotation], and how to implement it. - coming soon. 34. Amelia will continue polishing The Script. When it's ready for prime time, it will go in the software repository, and will be run every month to generate a log to accompany the flat file archives and database releases. Decide where to put the output. - script done; need to decide where output should go. 35. set up new faq-o-matic page (Cath & Rama, with a bit of help from Chris); everyone to add faq's and answers, though Cath & Rama will probably do the most, at least at first. - content collection 1st round done. 36. EBI GO curators circulate a set of instructions for using CVS. - DONE. 37. Progress report for current grant. - DONE. 38. Prepare renewal grant application. - in progress. 39. Prepare a site with mock-ups of GO web pages derived by splitting up the current home page sensibly. - NOT DONE. Appendix 4A. Email from Tanya Berardini Cellular process issues (from Tanya): Subject: Cellular process issues for St.Croix Hi everyone, Here are a few issues that I think would be good to address at the meeting. David will be attending, while I won't be able to make it. 1. cell differentiation vs. cell fate commitment right now, these terms are siblings cell differentiation: The process whereby relatively unspecialized cells, e.g. embryonic or regenerative cells, acquire specialized structural and/or functional features that characterize the cells, tissues, or organs of the mature organism or some other relatively stable phase of the organism's life history. ref:ISBN:0198506732 cell fate commitment: The commitment of cells to specific cell fates and their capacity to differentiate into particular kinds of cells. Positional information is established through protein signals that emanate from a localized source within a cell (the initial one-cell zygote) or within a developmental field. ref: ISBN:0716731185 2. response to endogenous stimulus and response to exogenous stimulus Move to be children of physiological process/add physiological process as additional parent? Right now, they are children of cell communication. response to endogenous stimulus: The change in state or activity of a cell or an organism as a result of the perception of an endogenous stimulus. ref: TAIR:sm response to exogenous stimulus:The change in state of activity of an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of the perception of an external stimulus. ref: FB:hb 3. cell_type development vs. cell_type differentiation Do we need both terms? Are they meant to describe different things? (e.g. pole cell development vs. pole cell differentiation) Check out the children of cell differentiation for a sample. Thanks, Tanya Appendix 4B. Email from Aubrey De Grey Term grammar (from Aubrey de Grey): Subject: GO grammar Hi Midori, Am I alone in feeling that the GO ontologies are grammatically challenged? It seems to me that the terms in each of them should be such that a sentence of the form: It encodes a[n] involved in which is localised to the should always read properly, but in fact one gets things like: It encodes a heme binding involved in nutritional response pathway which is a component of the extracellular. as opposed to: It encodes a heme binding protein involved in nutritional response which is a component of the extracellular space. I care about this more than most because I construct such sentences automatically from GO data in FlyBase as part of the summary paragraphs that appear in the gene records. But I think it looks decidedly untidy even when the terms are presented in tabular form, and it would probably take only a couple of hours' work to correct the common ones. Becky saw my point and suggested I mention it to you. What do you think? Cheers, Aubrey reply: Hi Aubrey, I'll put this issue on the agenda for the GO meeting, since it's coming up so soon anyway. I don't think there' will be any objection to adjusting the 'pathway' process terms, or the cellular component terms, since it will help with sentence generation, and won't hurt for any other purpose. We absolutely _cannot_ make alterations such as 'heme binding' --> 'heme binding protein' in the function ontology. None of the ontologies includes terms representing gene products; rather, we did and do put a lot of effort into keeping gene product names (whether specific, like 'actin', or generic, like 'protein') out of GO. GO terms also do not represent what a gene product is (or is made of), but what it does and where it is found. Function terms represent activities, not entities. It seems to me that it would be straightforward to adjust the sentence generation to accommodate function terms as activities rather than molecules, e.g. It encodes [an RNA|a protein] with activity involved in ... We would then be willing to fix any function terms that caused this construction to go awry. reply to above: Very good point re function - and very nice suggestion for the sentence structure. That's what I'll do. On a quick browse, the only group of function terms that would be a bit broken by your sentence structure are ones that end in "factor" (guanyl-nucleotide exchange factor, etc), and for them I guess dropping "factor" would actually be in line with the policy you describe. Great if you can adjust the process and component ontologies. Appendix 4C. MGI Excessive granularity document Excessive granularity. As we add and refine terms in the ontologies, we need to keep two things in mind. First, the terms should be as organism non-specific as possible. Secondly, the terms should be as meaningful as possible. As put forth in the last GO meeting, there are several branches of the GO that seem to have expanded unnecessarily. These fall into three broad categories: Protein Binding, Biosynthesis, and Regulation. 1. Protein binding: In using the term GO:0005515, Protein binding, we can make use of not only the GO ontology structure itself, but also the use of the attributes/qualifiers used in linking a term in the ontologies with a gene product, which are included in each annotation line supplied in a gene_association.db file. A good example is the use of the term "protein binding". This term can be qualified with both an evidence code and the "with" field. The combination allows a curation of a gene product to bind to a specific protein product. The "with" field is intended to house a sequence identifier or db identifier pointing to a specific protein. Therefore, there should be no need to populate the GO with specific children of protein binding. However, that is not to say that in those instances where there may be ambiquity, that we cannot have a child that describes binding to a product family. For example, actin binding, stat binding, etc. These can be used when the specific gene product is not identified. Amel, amelogenin F GO:0005515 protein binding IPI SWP:Q9CRG8 In this example, amelogenin was shown to bind to Q9CRG8, the protein specified by Bat3, HLA-B-associated transcript 3 Acrp30 adipocyte complement related protein of 30 kDa F GO:0005515 protein binding IPI SWP:Q60994 In the example above, the protein Acrp30 is shown to bind to SWP:Q60994, Acrp30; thus, the statement demonstrates that the protein oligomerizes. Ablim1, actin-binding LIM protein F GO:0003779 actin binding IDA In this example, the actin-binding LIM protein was shown to bind actin, but the actual gene product was not specified (in mouse, there several actins: actin, alpha 1 (Acta1), actin alpha 2 (Acta2), actin beta (Actb), actin alpha, cardiac (Actc1), actin gamma (Actg), and actin gamma2 (Actg2). The term GO:0005515 would not be sufficient, since the "with" field could not be specified. However, in this case the GO:0003779 term allowed sufficient granularity in the annotation. Another example uses GO:0005518, collagen binding. Gp6, glycoprotein 6 (platelet) F collagen binding IDA In this example, glycoprotein 6 was shown to bind (a) collagen. However, in the case of Mrc2, mannose receptor, C type 2 F collagen binding ISS EMBL:AF107292 a human ortholog of the murine mannose receptor was shown to bind collagen. In this instance, it it NOT the mouse protein that was assayed, so it would be inappropriate to use the human binding target. However, we infer that because the paper shows that AF107292 is the human ortholog of the mouse protein, we can assign the collagen binding function. Again, because a suitable child existed for the protein binding term, we can capture the protein binding function with more granularity than would otherwise be possible. Therefore, in most cases, the use of the "with" field in combination with the IPI code is sufficient to annotate binding of one protein to another. It is therefore not necessary to consider creating protein-specific terms (eg, interleukin 1-15 binding) to capture the information. 2. Biosynthesis As maintained before, the notion of Protein Biosynthesis should mean specifically the building up of a polypeptide by translation. Any other fate of the protein, such as post-translational modification, etc. is NOT part of "Protein Biosynthesis". The use of the term Biosynthesis to include other metabolic fates is misleading. Protein biosynthesis is already itself a child of metabolism. Thus, adding terms such as "biosynthesis of protein X" as a term to mean anything affecting the appearance/level of protein X is not useful. If a gene product effects the translation of protein X, then the gene product's annotation should be to a specific term under protein biosynthesis (initiation, elongation, etc.). If the gene product effects/ modifies a post-translational modification, etc., then it should be annotated to those processes. 3. Regulation Additionally, terms are arising in several notes concerning the regulation, both positive and negative, or particular processes (biosynthesis, phosphorylation, etc.). Terms exist for the negative/positive regulation of phosphorylation/whatever of specific_protein_family_member X, X+1, etc. Is this granularity necessary? Would it be sufficient for negative/positive reglation of phopsphorylation/whatever period/or protein_family? Protein Biosynthesis Example:1 protein biosynthesis [GO:0006412]) amino acid activation + charged-tRNA modification + **glycoprotein biosynthesis+ CD4 biosynthesis + FasL biosynthesis + protein amino acid glycosylation + *integrin biosynthesis + and children **lipoprotein biosynthesis and children+ **mannoprotein biosynthesis and children + *MHC class I biosynthesis and children+ *MHC class II biosynthesis and children+ *neurotransmitter receptor biosynthesis non-ribosomal peptide biosynthesis regulation of protein biosynthesis + regulation of translation + TRAIL receptor biosynthesis + and children translational elongation + translational initiation + translational termination + viral protein biosynthesis Biosynthesis Example 2 immune response cytokine metabolism cytokine biosynthesis chemokine biosynthesis + connective tissue growth factor biosynthesis + granulocyte macrophage colony-stimulating factor biosynthesis + interferon type I biosynthesis + interferon-gamma biosynthesis + interleukin-1 biosynthesis [GO:0042222] regulation of interleukin-1 biosynthesis + interleukin-10 biosynthesis + interleukin-11 biosynthesis + interleukin-12 biosynthesis + interleukin-13 biosynthesis + interleukin-14 biosynthesis + interleukin-15 biosynthesis + interleukin-16 biosynthesis + interleukin-17 biosynthesis + interleukin-18 biosynthesis + interleukin-19 biosynthesis + interleukin-2 biosynthesis + Interleukin-20 biosynthesis + interleukin-21 biosynthesis + interleukin-22 biosynthesis + interleukin-23 biosynthesis + interleukin-24 biosynthesis + interleukin-25 biosynthesis + interleukin-26 biosynthesis + interleukin-27 biosynthesis + interleukin-3 biosynthesis + interleukin-4 biosynthesis + interleukin-5 biosynthesis + interleukin-6 biosynthesis + interleukin-7 biosynthesis + interleukin-8 biosynthesis + interleukin-9 biosynthesis + regulation of cytokine biosynthesis + TRAIL biosynthesis + Regulation Example regulation of tyrosine phosphorylation of STAT protein positive regulation of tyrosine phosphorylation of STAT protein positive regulation of tyrosine phosphorylation of Stat1 protein positive regulation of tyrosine phosphorylation of Stat2 protein positive regulation of tyrosine phosphorylation of Stat3 protein positive regulation of tyrosine phosphorylation of Stat4 protein positive regulation of tyrosine phosphorylation of Stat5 protein positive regulation of tyrosine phosphorylation of Stat6 protein] positive regulation of tyrosine phosphorylation of Stat7 protein Appendix 4D. MGI Negation document NOT Protein Binding Background: The GO term "protein binding" (GO:0005515) is used in the function ontology to specify that a gene product binds to another protein. It is used with the IPI evidence code and the "with" field to indicate the specific protein that the annotated gene product binds to. In the examples below, Arl6ip has been shown to bind to Arl6 (SP:O88848), and Cdc42 has been shown to bind Cdc42ep5 Arl6ip ADP-ribosylation-like factor 6 interacting protein F protein binding IPI SP:O88848 Cdc42, cell division cycle 42 homolog (S. cerevisiae) F protein binding IPI SWP:Q9QZT9 The "not" qualifier has been provided for documentation of experiments that were designed to test a hypothesized function, cellular localization, and proposed participation in a biological process. For example, a protein product has homology to chitinase; however, experiments performed on the isolated protein demonstrated that the protein did NOT have chitinase activity. Chi3l3 chitinase 3-like 3- F NOT chitinase IDA Dilemma: In some experiments, protein binding to a specific protein has been shown to not occur. In the example below, a publication demonstrated that the gene product of Akap9 specifically binds one protein, but not the other. Akap9 A kinase (PRKA) anchor protein (yotiao) 9 C cytoplasm IDA F NOT protein binding IPI SWP:Q62348 F protein binding IPI SWP:Q9QZE7 However, in this instance, the use of the "NOT" may be confusing, as the GO term "protein binding" is (probably) meant to be very broad (it has no definition), and does (may) not imply "binding to a specific protein". For example, immunoprecipitation experiments could demonstrate that a particular gene product is associating with other proteins, but the proteins have not been identified. In this case, the "with" field may have to be left null. The risk, however, is that the "not" could be misinterpreted to mean that this gene product does NOT have the function of binding to a protein. Generally, when an assertion can use the "with" field, the annotation still makes sense if that field is blank. For example, when an ISS evidence code is used, but the accession number is not known, leaving the "with" field blank still means that the annotation was made based on sequence similarity. Another example is when the IMP evidence code is used. If the assertion is based on a specific mutant allele, it is possible to add a database identifier to the "with" field, when known. However, if the assertion is based on an RNAi experiment, the "with" field is often left blank. In these cases, the annotation makes sense even if the "with field is blank. A problem can arise, however, if the "not" qualifier is used with protein binding and IPI. If the "with" field is left blank, the assertion reads that the gene product does not bind protein. Note that this is not a problem when a gene product can be annotated to one of the children, such as "actin-binding" (does NOT bind actin). Proposal We would still like to be able to capture this type of experiment, as it can provide information about the properties of the gene product. Therefore, it might be useful to create a term , such as "specific protein binding" as a child of protein binding. All/most of the current children of "protein binding" would then be moved to be children of the new term. The "not" qualifier would never be used if the "with" field is left blank. For example, the entry for Kdr is shown below: Kdr, kinase insert domain protein receptor F NOT specific protein binding IPI SP:P97946 The interpretation should be that Arl6ip does not bind specifically to Figf (c-fos induced growth factor). A second example, is where Akap9, A kinase (PRKA) anchor protein F NOT specific protein binding IPI SWP:Q62348 F specific protein binding IPI SWP:Q9QZE7 This paper demonstrated that Akap9 did NOT bind to Tsn (translin), but DID bind to Tsnax (translin-associated factor X). Appendix 4E. Documentation progress report (from Cath) GO DOCUMENTATION: PROGRESS REPORT OVERARCHING PRINCIPLES * Make as much as poss. comprehensible to broad audience * Make it clear what audience each doc is aimed at * Avoid redundancy wherever possible to make pages easier to update (FAQ is an exception to this principle: aim is to provide info with as few clicks as possible) STUFF FOR THE GENERAL PUBLIC An introduction to GO * Purpose is to provide an overview that is clear and useful to first-time users. Links to more detailed documents make it useful for curators and annotators. * Includes General documentation up to Data representation section: defines what's covered (and what isn't) in each ontology, plus the basics of DAG structure (will redo diagram so it looks better on the web) * Replaced 'data representation' with a blurb about what file formats we produce and where you can download them from. Also includes a para on GO slims. * New section that puts GO in the context of ontologies in general: discusses GOBO and the new list of ontologies on the MGED site; dicusses cross products, and discusses mappings to other classification systems. * 'Contributing to GO points to the sourceforge site and to the mailing lists. * Still needs a link to the FAQs. FAQs * Have kept html v. simple and not added comprehensive contents list yet because these will be pasted into FAQomatic and this might do some of the formatting for us. * Where are the main gaps and who can provide material to fill them? * What other sections do we need and what order should the sections be in? * Do we need a section on annotations to each of the MODs or should these be dealt with by the MODs' own FAQs? If so, GOA FAQs could be moved to GOA web page and we could just provide links to each MOD's FAQ. * Should the order of questions in any of the sections be swapped around? * Need to install the faqomatic (http://faqomatic.sourceforge.net/fom-serve/cache/1.html) . Can Chris do this? * Do we need someone to be in charge of the FAQ or should it be a free for all? * Can each of the people who provide questions check them: in some cases I've made them more general and added bits. GO style guide (or should it be the GO content guide?) * Revamp of GO usage guide. Purpose is to explain not only how the ontologies are created and edited, but also the rationale behind why we do it this way. * I wrote it as a practical guide for curators but I don't think it's working. * Starts off at the level of terms and what we can do with them; then moves up a level to relationships between terms; then deals with whole ontologies and the rules specific to each one. * Should we split the purely philosophical (more in tune with the original purpose of this doc) from the purely practical? If we did this, people who aren't part of the consortium but want to know more about why we do it that way would have something deeper than the intro. * If we did this, should we merge the purely practical stuff with the format guide? This would avoid the constant cross-referencing between docs. * Things that have been added include Much more comprehensive contents list link to full list of database cross-references Amelia's list of 'standard' definitions Clearer guidelines on sensu GO format guide * Main aim to help anyone who wants to parse the files; but also of use to curators because you always end up tweaking the flat files at some point. * As with the style guide, it starts of at the level of terms, then moves up to relationships between terms and the structure of entire files. * Not sure what to do with the stuff on the bibliography. * Need someone to write up something on structure of mySQL files, or provide a link if this is already there on the godatabase site. * Added Jane's syntax for comments More on how to use sensu. Stuff that I haven't done....(mention at this point that I'm now full time outreach) Publications on/about GO * Update; reverse order so most recent first. STUFF FOR CURATORS CVS user guide for curators * new doc; I'm happy to edit this but I need someone to write it. Volunteers? DAG-Edit User Guide (Jane's doc) * Does this need to be more visible from the front page now that other ontologies are using it more and more? * Jane to add some info on creating cross products * Needs formatting in same style as rest of docs; I'm not going to do anything else with it. Dummies' guides * Each group to maintain their own local 'dummies' guides': SGD and EBI now have these. * I'll turn the EBI one into HTML and leave it on my website (or the website of one of the other curators? I'll have to pass over the responsibility of updating this to someone else. STUFF FOR ANNOTATORS GO Annotation Guide * Computational annotation methods need updating. FlyBase (Becky Foulger) SGD (Karen Christie) MGI (Harold Drabkin - has already sent info to Midori) TAIR (Suparna has sent) WormBase ? PomBase (Val Wood) RGD ? DictyBase (Rex Chisholm) PSU (Matt Berriman) Gramene (Pankaj Jaiswal) GKB ? EBI (Daniel Barrell) TIGR (Michelle Gwinn/Linda Hannick) Compugen (liat Mitz/Han Xie) AstraZeneca Courtland Yockey Incyte Lisa Matthews * I'll add the ones I've got but then I'd like to hand over to someone else. * Need to document standard operating procedures for shared annotations (tag annotations that come from other groups). Appendix 5. Collected action items from this meeting Action Items, St. Croix January 2003 1. TAIR. Update MetaCyc2GO mappings. 2. John. Action item 7 from last time [add term deletion feature to DAG-Edit]. 3. Brad. Action items 10 and 11 [adding more information about GO curators to website/database] outstanding. 4. Come up with system for notifying developers of format changes. 5. Add "contributed by" column. 6. Curators. When adding new synonyms, track which type they are. If they are 'broader than' or 'narrower than', consider whether it calls for a new term. 7. Jane. Circulate synonym list again. 8. BDGP. Look into rules that could be worked into DAG-Edit to make synonym maintenance easier. 9. Jane. Discuss this with UMLS and fill us in on the results. 10. David and Tanya. When splitting out multicellular v/s unicellular processes, make the split as far below 'physiological process ; GO:0007582' as possible, and as and when needed, rather than splitting right below physiological processes. 11. GO editorial team (and others). Start removing grouping terms slowly and carefully with all the usual communications. If obsoleting a term, ensure the corresponding process or component exists. 12. Jane. Add activity to function term strings. 13. GO editorial team. Define extracellular to include outside a virus particle, then use host terms as parents for the appropriate virus cell component terms. 14. GO editorial team. Go through the enzyme complexes (see also SF entry 535294) and where applicable, make a general parent directly under 'cellular component' with children in specific locations. 15. GO editorial team. Get a list from Aubrey of ill-fitting GO terms and evaluate; adjust terms as needed. 16. Announce on the website that we'll implement this solution at some future date (no date set but will be 6+ months from now). Assemble a group (MA, Chris, David) to work on the implementation. 17. Midori. Put up interest groups on web page. Everybody to send group ideas & which they volunteer for. See if it works or if we need further formalization by putting groups in SourceForge. 18. Construct and post a user survey covering tools, AmiGO, etc.. Send question ideas to Amelia Ireland. It will be sent out to GO-Friends and data collected in time for the grant application. 19. Chris. Suggestion: a daily release of a separate database containing just terms without annotations. The whole database should be updated every month. AmiGO would have the option to view the up-to-date term set with no associations. 20. Chris. Make use of parents rather than bucket terms to avoid confusion due to transient IDs. 21. Brad. Investigate piping GO-Slim mapping results to the AmiGO pie chart maker. 22. Brad. Add the ability to dump AmiGO pie chart data as a flat file containing GO ID, term name and the number of gene products. 23. Member databases. Each database should send annotation FAQs from their existing documentation to Cath for inclusion in GO FAQ. GO FAQ will have general annotation FAQs and then specific FAQs from each database and from the EBI. 24. Everyone . Read over the new documentation (especially the style guide) and send any suggestions to Cath. This is available at http://www.ebi.ac.uk/~cath/ . 25. Cath. The changeover to the new documentation will occur on 15 March. 26. Cath. Update the synonym section of format guide to accommodate the decisions made at this meeting. 27. Chris. Provide some documentation on the mySQL database. 1 Number of genes with at least ONE GO term of any kind. 2 Decreased due to movement to obsolete. This also holds for Interpro and EC to GO 3 This figure has decreased due to our ongoing efforts to replace these with literature based annotation.. 4 Cynthia Smith, Cathleen Lutz, Carroll Goldsmith, Teresa Chu, and Alan P. Davis =================================================================== TIGR 20030603 Group reports Most of these reports were provided as written material and more details from the individuals can be found there. We reviewed these quickly and so the comments here are simply those that I happened to catch in passing. FB report (Becky) In addition to existing GenBank/Swiss-Prot sequence curation and a paper-by-paper approach to literature curation, GO-annotations are being done on a gene-by-gene basis to fill in holes. GO data from gene models that were split or merged in the Release 3 genome reannotation have been mostly re-partitioned. Chris Mungall has also given Becky a list of genes where the coding sequence has changed; GO data for these gene models has started to be assessed. SGD report (Karen Christie) Currently pushing to remove IEAs, all gone but those for about 60 Ty encoded ORFs. Microbial Structure Ontology was described (for Fungal structures): Judy asks, are they interested, Karen says yes they are interested, some groups (Neurospora, Aspergillus) already participating. Mike, small community, not a lot of money to sustain, applying for grants now. Aspergillus already has a database set up (in Manchester I think) Cross-reference to SGD for annotation, Candida, Aspergillus, Neurospora. Chandra, Eurie, and Maria have initiated this and it is now a part of OBO. In addition to Maria Costanzo and Jodi Hirschman, who are attending a GO meeting for the first time, SGD welcomes another curator, Rob Nash. MGI (Harold) RIKEN annotation increased total genes annotated by 37%, 3300 genes, this was done by inheriting GO annotation done on Riken clones. These came in mostly as ISS or TAS evidence codes. They are developing (or collaborating on) three ontologies: GO, anatomy, and phenotype all with a common structure. This allows the use of common tools such as the DAG-editor and ontology browser. The MGI GO browser now displays comment field (important for MGI annotators and users) Changes to software so that users don't get links to obsolete GO nodes Changes to software were implemented so that users don't get links to obsolete GO nodes. These include enhancements to the editorial interface, and automatic removal of obsolete terms being assigned via SP2GO and IP2GO translation tables. Editorial interface enhancements were needed to aid reannotation of genes mapping to obsolete terms, because they go live within 24 hours of any changes. Much that is to assist with keeping the annotations up to date. MGI can now track original source of a GO annotation, to help track when a curator has manually changed an annotation that was originally obtained from dataloads. An additional enhancement they have added to the interface is the inclusion of a GO marker notes field to supplement the notes field associated with each individual annotation. The new notes field is meant for notes pertaining to the state of annotation rather than notes about the marker itself. Other software development: A version of the GOTermFinder is being developed at MGI and is available at http://www.spatial.maine.edu/~mdolan/MGI_Term_Finder.html BDGP (Suzanna speaking for Chris Mungall) Chris gave a talk on slots and cross products at Genome Informatics; there is concern, which we share here that this will make things overly complex. That is why we will prototype first and Chris has started work on this. He is also looking at third party tools. Chris proposes that the BDGP software folk meet with MGI (aka David and Joel, etc.) prior to next GO meeting for first implementation of properties (also for other ontologies). Since the words 'slots' and 'properties' are synonymous GO will go with the word properties (and properties have values) DAG-Edit: next version (1.4) will support properties, which means we need a new flat file format (with tag-values). We will still have some backward compatibility with existing flat file format. Not using XML solely because it is not that easily readable by humans GO-Slim - needs to know which slim files go with which annotation files GO Database - monthly loads are now more regular and reliable with a new QC procedure, also daily loads of ontology terms with no QC, is now storing more data in GO database (not yet available in AmiGO). The 'with' column is now "fully normalized". He did note that not everyone is providing a gp2protein file - really need to have these from everyone who is providing an annotation file. This was added as an action item. Karen Eilbeck joining Berkeley group to work on SO TAIR (Suparna) Annotation Update: The complete set of numbers is in the handout. The rate of annotations is about 150 genes a month or 2 genes a day per curator. To the GO ontology itself, they have added about 150 new terms since last meeting. They have updated their gp2protein file recently. The main TAIR database is at NCGR in Santa Fe. The GO associations from Carnegie at Stanford are updated weekly TAIR held a very successful literature curation meeting at TAIR in March 2003. Updating MetaCyc2GO file: The mapping to MetaCyc has problems (going to function instead of process). Approximately 80 new pathways have been added (50 have existing GO terms, need about 30 more terms to complete mapping). TAIR (as personified by Suparna) are now updating the mappings from MetaCyc pathways to GO functions and once finished will this task will pass the mappings on to GO central (Amelia) to check errors. They have updated web site, can now search for genes with GO terms as keywords; added an Evidence description to add more info about the experiment. They are also developing a new ontology browser. Along with this, they also are developing a GO awareness campaign for the Arabidopsis community. Lukas Mueller will be going to Cornell and running Solanaceae database TIGR (Linda) () * Arabidopsis: At TIGR this project is going through and renaming gene products correctly. Funding will end in fall and TIGR will then turn over all A. thaliana data to TAIR. * T. brucei: This annotation effort is still active and progressing. (Michelle) * Bacillus anthracis and Coxiella burnetii: just released annotation files of the gene products to GO terms. * Other prokaryotic genomes: These gene products been annotated with GO terms and are awaiting publication to be released. TIGR uses 'Manatee' to assist curators in GO annotation. This adds new GO search capabilities that assist the curators in fully annotating prokaryotic and eukaryotic genomes. Manatee is available on SourceForge, but it depends on TIGR database schema. BDGP and TIGR may add an Apollo connection to manatee. Rex/Dictybase are interested in this as well. Don't have gp2protein files for all prokaryotes, in some cases because the data was not available when the annotation file was released Wormbase (?) They provide biweekly updates for the public, including GO annotations. They also raised one question regarding the cardinality of evidence codes to annotation. There followed a discussion about whether multiple evidence codes belong on one row or in individual rows (one row per evidence code). Resolved: The decision was to do the latter and make the cardinality one to one. Final question was whether conference abstracts are legitimate references. Resolved: Yes! Conference abstracts can server as references. Dictybase (Rex) They are working for a late June official release of DictyBase (based on SGD's code and schema-special thanks to Mike and all). This release will include 1800 loci with 8949 GO annotations (all except 40 to IEA). They now have two full-term curators (Petra and Pascale), and one new programmer (who will start in July). This developer can help John out since he is experienced with Java and is partially funded by GO. Suzi to ask John to contact Rex (done). Gramene (Pankaj) They will be making a new release in late June, with 4500 new non-IEA gene associations. Most of their recent focus has been on curating mutants and phenotypes. They are working with other databases with on mitochondrial and chloroplasts. Likewise, they are working with rice database on nomenclature issues. They are also now working with Maize people to try to get gene association file for maize incorporated. GO-editorial (Mostly Jane, with a soupcon of comments from Midori) Amelia has created a nice digest that is available monthly. This digest summarizes: new terms, obsoletes, new definitions, basic data on changes, and links to appropriate SourceForge entries. It is kept on the ftp site. Please send suggestions for improvements to Amelia. There will soon be a cron job that mails announcement of each new digests to go-friends (AI). Component terms have increased quite a lot (with effort by BRENDA group to create complex terms for enzyme complexes). In addition, we now have definitions for 78% of terms (yeah!!). They have brought the GO synonyms file up-to-date. Molecular function terms now have the word 'activity' as part of term name. They have generated a list of obsolete terms with suggestions for remapping for review. There are new web page drafts for review (presented later in meeting) Interest groups - not linked to anything GOA (Evelyn) We can look on the GOA web site for latest statistics and news (http://www.ebi.ac.uk/GOA). They have produced three releases since February. They have now updated their associations file to include the source of the annotation, so credit/blame can be made appropriately. They have also now integrated the manual annotations from other sites (fly, MGI, and SGD). The HAMAP group at SIB, Geneva is working on a HAMAP2GO mapping and may be involved in manual GO annotation of Swiss-Prot microbial proteins. In total, the GOA project has released more than 3 million annotations to 600,000 proteins. They have also written two papers about GOA (and GO) and two more are planned for this year. Big news is that now LocusLink is now using the GOA annotations. SwissProt will henceforth be responsible for updating the former Proteome annotations to GOA annotations. Evelyn is just back from ontology workshop in Japan. Edgar W from Transfac was one of the chairs. Evelyn spoke about GO and GOA and got a favorable response. She found that many Japanese were aware of GO but were generating other ontologies(cell types and anatomy) that are up and coming on the OBO site. Some groups had developed ontologies similar to GO, because they didn't seem to realize GO existence or didn't realize they could (and should) request new terms.. Because of this, she raised the question of how we will decide which ontology will go into OBO (this is a good question).Who decides which cell-type ontology will be the standard. Answer: Michael will probably just decide by fiat. SwissProt is going to be employing two fulltime GO annotators. Interviewing begins in July. Daniel is thinking about doing a release of the GOA database (which is in Postgres) for the general public (from Suzi, he should get in touch with Chris in case we end up switching to Postgres here as well for GO). Incyte They are doing manual annotation (a la Proteome) using weekly updates from locus link and GenBank. The statistics for their annotations are in the handout. They have also restarted monthly term suggestions for GO terms They are doing new product development - BioKnowledge Retriever. This will include two new ontologies (mammalian disease, mammalian expression) and they are interested in making these public and in working with other groups to develop these and make them public. This is something to consider for OBO. Maize MaizeDB will cease to exist in 3 months, now called maizeGDB. He is here to learn because they are just getting started with GO. The URL for the new maize database is http://www.maizegdb.org/ RGD They have generated about 3000 annotations (distributed equally at ~1K GO ontology). They are working on using the GO terms (building their own GO browser) for gene search strategies. Their browser will be utilizing GO terms as part of search strategies to identify genes, including genes annotated to terms descendents They have a disease specific orientation and want to utilize other ontologies to organize this type of data in RGD Pathogen at Sanger Next time Annotation Issues IEA TIGR is using an HMM scoring function for assignments and since this is more sophisticated than keyword matches they would like a means to add quality information to IEA. Someone pointed out that this is also true for multiple alignments. David says the appropriate thing to do is to use different references for different types of analysis. Suzi says that this argument can also be extended to all evidence types, as discussed before. David suggests extending filtering in AmiGO to also qualify the query to IEAs with certain references. (Brad, another issue is the bulk of IEAs. Too slow for web interface when IEAs are loaded.) TAIR solution is evidence description, but this is internal. MA if a db wants an internal one then they can. David we have reference. Midori- GO reference refers to GO pub. Added 3 action items below: BDGP needs to implement filter, group needs to establish a collection of references to methods, BDGP also needs to explore ways to deal with size explosion of associations other than omitting IEAs from AmiGO. Suspect annotations Rex, if inaccurate annotations are discovered at one site that came from another site they can't change/fix the annotation because it didn't arise from their own site. I second this because maintaining high quality in the associations is one of the main utilities of GO, people use it as the default golden reference set. Rex noticed an actin with motor activity, easy to notice. How are we to do this? Judy: what can the group as a whole do to help. Midori: they owner has to make the correction. How can notification that a correction is required occur? MA: every MOD has a mechanism in place to receive and make corrections. Question is, do we begin to build association quality assurance tools to detect these. Gp2protein could be used together with BLAST, using best-hit match and flagging discrepancies in associations. Suzi: GOST tool can be used for new annotation. Karen E: how many levels up the tree is acceptable-any number. David: incompleteness of annotation is also an issue. MA: it would help even more if this tool were available and used during the process of annotation. Another means of improving quality is by adding the ability to file error reports directly from AmiGO pages. Three action items added below. late addendum from Evelyn: Concurrent Assignments tool from EBI, Manatee has something similar; AI for AmiGO to be able to do this type of thing, (Amazon-like: others who annotated to this also annotated...). More 'rules' for annotation Midori: The current rules are broad and do not contain specific guidelines for handling of every situation. Just make suggestions, best practices. How do you identify common proteins. Evelyn: amigo needs concurrent assignments. Midori: oral tradition is now written down. The rule is that we are annotating to potential. Long discussion of potential. Amelia: slide show. Solution is to use the word intrinsic to distinguish regulator activity versus extrinsic regulator activity. Harold: function is not necessarily an attribute of gene product, it can also be applied to complexes. Jane: Is transient activity okay? Yes. MA: complex should have a defined stoichiometry. Karen: Is there an issue with counting # of subunits? No. Midori: the point is not to have a component term for every ImmunoPrecipitation-able agglomeration; "defined stoichiometry" doesn't imply identical subunit composition between species Resolved: use the word intrinsic to distinguish regulator activity (regulatory function that occurs when the gp is part of a complex) versus extrinsic regulator activity and to change the relationship type to is-a. - CDK-cyclin example - start including the word 'intrinsic' for the regulator activity to clearly indicate that it is part of a complex, without which the kinase activity of CDK kinase subunit is not active either. Jane's item Resolved: Binding stands alone (not binding activity) Treemap demo (Eric Baehrecke) He is interested in steroid activated programmed cell death signaling, both fly and human apoptosis. Ben Schneiderman is software person who is interested in information visualization (hyperlinks, Spotfire, Treemap) and has developed a strategy for analysis of genome data using GO and Treemap displays. The components of the tool include: a GO parser, parser for genome data, a view in Treemap. The visual variables that may be controlled are color hue, color intensity, and area of the rectangle representing the data. Eric B will look into what they need do in order to enable us to link to Treemap from the GO Tools list. See http://www.cs.umd.edu/hcil/treemap/ for more information. Properties implementation Group seemed to feel that the most important priority is completion of software for direct saves to the database. Believe that this will assist implementation of properties. Did allow that John's proposal for new flat file format looks good and useful and since most of this work is already done, it will be good to have around. We also agreed that the existing flat file format would never go away, although property information may be lost in a direct conversion. Brad's report Brad described STAG, which is an SQL templating system. It returns SQL query results as XML dumps. A generic piece of Perl software uses the template to generate the query. Machete is a software package that sits on top of STAG. It is a lightweight Perl application that maps CGI parameters to the proper SQL, HTML, and XML templates. It uses a library of templates to replace the current Perl API. This will result in all SQL queries, HTML pages and XML transformations being maintained as a library of templates. This will allow future generations of AmiGO to be flexible, expandable, customizable and portable. As the GO schema becomes more integrated with Chado it will allow more types of queries across a wider collection of data in the future. There was quite a bit of interest. Both Rex and Judy interested in having Chris and John talk to their counterparts of the technical staff at DictyBase and JAX. Proteasome or part of relationships Different forms of part of: David, if it always there then it is a child. Distinguish between those that never change and those that vary (where) where only the child will have that part. Midori: We all agree that there must be multiple is-a children for complexes of different composition, which is clear. The question we need to address is what to do when the composition is the same. David: If we don't know it is safer to create two subtypes). Judy: is-always-found-there and is-it-the-same are two separate questions. Conclusion for the first question was to change the documentation to not require part-of to mean always. Different subunit composition implies different terms. If the composition is identical, then this is a single term and multiple parents are allowed (nay encouraged). In other words, complexes that have the same composition in all locations may receive multiple parentage, however, complexes that have varying compositions need separate terms with the specific localizations In the future, we may need to add more sub-types for things like myristoylated or phosphorylated forms of the compound. Physiological processes David: This area need revisions and continued discussions. We will create an interest group to handle the reorganization/structuring of the physiological process node of the biological process ontology. Those interested should contact Tanya (tberardi@acoma.stanford.edu) or David (dph@jax.informatics.org). The group will meet and discuss via email and present a report at the next meeting. Proposed top nodes right under 'physiological process' are 'organismal physiological process' and 'cellular physiological process'. Behaviour Peter Midford in Arizona, already is working on behaviour ontologies for loggerhead turtles, jumping spiders and we feel this level of detail seems to be beyond the scope of GO. However, there still needs to be some descriptive capabilities for behaviour within GO, both for Drosophila and maybe for mouse, to be able to annotate certain genes. The essential questions relates to what should be included in Process. It is clear in Drosophila that one can pin certain genes to behaviours like walking or circadian rhythms because these are hard-wired. Conversely, there is need for an auxiliary ontology developed specifically to deal with behaviors in mouse since much knowledge in this area is not tied directly to specific gene activity Conclusion - we do want behaviour in GO, but there may be other ontologies, for groups like mouse, that will extend these. In these cases we'll recommend that these auxiliary ontologies be consistent with GO and include any necessary cross-references to GO terms. To support this the GO terms should be at a level that can be used for many organisms for behaviours that have a genetically defined component. Localization of viruses We had previously discussed (at earlier meetings) and considered expanding the definition of extracellular to include extraviral in order to be able to include viral host cells. There was an objection from virologists that it doesn't make sense to consider viral host cells as extracellular. Therefore, we have now decided to reverse this previous decision from meeting and will remove viral reference from the definition of extracellular. Added action item. Purity vs. pragmatism aka obsoletism Question: When do terms become obsolete? Two issues, when redefining the term and when removing gene product names. Word-smithing changes to the definitions that does not impact the meaning, only clarifies the original meaning do not require that the term be made obsolete (the criteria is that no annotations will ever be affected by the change to the definition). However, if the fundamental concept changes then the term needs a new ID and the older one must be made obsolete. Michelle found some terms that when going from primary ID to secondary ID (arise from merges only) were not strictly synonymous. This is a problem. David: don't just remove gene products they need to be replaced. There followed a very lengthy discussion regarding the issue of function grouping. Much of group wants to use synonyms (broader) to deal with these. For now put portmanteau terms into synonyms. Drive function to purely subsumption hierarchy. (Function grouping ontology). Synonyms Distinguishing between exact synonyms and inexact synonyms. Work is done, just need incremental improvements. John is on it so that DAG-Edit makes this easier. Structure terms We are keeping structure terms until we have properties. Values can come from anatomy or cellular component, or cell type. Wording is wrong, but we can live with it. Disappearing GO ids Michelle had this problem with DAG-Edit. Midori, special case of terms from Michael and shouldn't happen again. TAIR: Sporadically disappearing definitions. Michael: if term is not in ontology then definition is not saved. Need DAG-Edit to warn if there are definitions without terms. Another reason for going to database. Amelia: most problems apparently due to CVS rather than DAG-Edit. Behaviour Resolved: These are to remain as they are. Viral component terms Two action items were added to change extracellular definition and move terms. Scope of Metabolism What does 'part of' mean within the context of metabolism. The present definition is very broad and the question is should it include its own regulation. Currently it does. In general, this is an issue. Transport is not included as a 'part of' metabolism. Are regulation and transport equivalent (or analogous) concepts? On the other hand, is it more correctly called intermediate metabolism? Definition needs to be examined. Looking at a more sophisticated way to model, but in the meantime, regulation is an inherent part of process although strictly speaking the relationship is not the same PART OF as it is for the steps in a process. Because of this, we may need another relationship type for regulation. Midori will send some examples to Chris and BDGP for consideration. For now, transport will not be included in metabolism, but regulation will be. Synonyms Should gene products be included in synonyms? Yes, because people are going to be using these to look for them. Does this mean that gene products are permissible in term names then too? Yes, this is okay when the gene product is not the complete term, but indicates the substrate within the complete term. P53 is the common usage, but never is the name of a gene. Since the meaning is in the definition then the wording doesn't matter and it is okay to use the gene product as the string. However it is preferable to qualify this, that is, use something like 'p53-class' instead of just p53 in term names. However, if the gene product is used then it should be applicable across species and not restricted to a particular narrow group. Cyclin is another case, but it is more broadly used. We could possibly skirt the issue by using the string 'class' as a qualifier to gene product. Transporters (aka ATP synthase terms) Question was whether to create two separate terms for bi-directional reactions and then annotate to both terms. Resolved: Policy is that we will create a single term (that describes both directions of a bidirectional reaction) unless you have reason to believe that there is a biological justification to separate the two directions of the reaction into separate functions. Function Grouping Terms or Conglomerate functions The examples used in this discussion were 'T cell receptor' and 'myosin'. There was a lot of discussion about whether or not it is appropriate to create function terms that describe the sum of the parts. That is, a term to represent the single unary function that is created through the contributions of all the different individual functions that make up a complex, e.g. The function of something like a T cell receptor or myosin may have. One of the advantages of representing the various activities of 'T cell receptor' with multiple parentage was elucidated by David that it is a way to help annotators, who otherwise need to know that 'DNA helicase activity' includes ATPase activity, etc. Rex argued the other side, that this approach could lead to an unmanageable proliferation of terms to represent this sort of information. Karen brought up the cautionary example of 'GTPase activator activity' which currently has two parentage lines one from 'enzyme regulator activity' which is fine, and a second line of descent from 'signal transducer activity', which is a problem because it makes 'receptor signaling protein activity' an ancestor of 'GTPase activator activity'. This is clearly wrong (there is a SourceForge entry already entered for this). 'GTPase activator activity' is an old term, so this may have come about because at the time the known GTPase(s) was/were all involved in receptor signaling. The eventual thought seemed to settle on the idea that to create this sort of 'grouping term' in the function ontology opens up the potential for true path violations of the type illustrated by the 'GTPase activator activity' example. It was suggested to have some sort of Function Of Gene product (FOG) Ontology to make the correlations between individual functions and a specific gene product or class of gene products. The Function ontology itself will become more like a hierarchy than a DAG. The relationships in the FOG ontology will not be 'ISA' but will be a flavor of 'PART OF" to indicate their contribution to the conglomerate function. Web page 1. for credits have people use the sourceforge style link to logo, so we can count some of the usage statistics. 2. home page is: about, what's new, downloads, credits. 3. Link to AmiGO and a search box all in the left panel. 4. Jennifer suggested that we use the Sanger style links: site links across the top and page links down the left (plus the standard search tools) and no one objected or offered a counter-proposal. She will implement that web site demonstrated within the next few weeks. AI is to prototype the Sanger style page. (Comment added later by Jennifer: This action item was for me, as stated in the bottom of the final list of action items in the minutes.) Next meeting September 13-19: Working group on first implementation of properties in Bar Harbor (Chris, John, David...). September 24-25: phenotype meeting will immediately precede GO meeting in Bar Harbor. September 26-27: next GO meeting in Bar Harbor January 16-17 at Stanford. Decision is still to be made regarding user's meeting in September Action Items 1. ALL: update gp2protein on central CVS site. 2. Suparna & Amelia: update metacyc mappings (and check that no functions are mapped to) 3. Amelia: change monthly report file names so they'll sort by date. DONE! 4. Amelia: cron job that mails announcement of each new monthly digest to go-friends 5. BDGP, JAX: first prototype to be implemented for properties prior to JAX meeting 6. BDGP (SwissProt?): need to provide a tool for tentative assignment of GO terms. 7. one row, one term, one reference, one evidence code. DONE! 8. (IEA) Midori: to assemble method references for IEAs 9. (IEA) BDGP to explore means of including larger number of associations in DB and AmiGO. 10. (IEA) BDGP to add filtering that is a combination of evidence code and reference. 11. (suspect annotations) Midori et al.: Add some things to documentation to describe procedure for error reporting, whether in terms or in associations. 12. (suspect annotations) GO-central to add links on main web site to report errors in annotation. 13. (suspect annotations) Brad to add button to AmiGO to mail error reports. 14. SUZI: write a tool to look at and report on consistency of annotation. 15. ALL: review annotation documentation and send in comments to GO-central (Midori to oversee). 16. BRAD: to add term based page. This would show all gene products and the other terms that had been used on each of those terms. A "other customers who used this term, also used these terms". 17. JOHN: Need DAG-Edit to warn if there are definitions without terms when saving so that the definitions are not lost. 18. GO central: for all part-of children in the function ontology, change the relationship to is-a and change wording to 'intrinsic regulator' or 'intrinsic catalyst'. 19. Jane: remove 'activity' from 'binding' terms; DONE! 20. Midori & Jane to dredge up what problems were at end of database save testing; send to John. DONE! 21. JOHN: Need DAG-Edit and central repository to work more seamlessly...DB or transparent CVS must be implemented. 22. GO-central improve documentation on synonyms 23. David organizing physiological process interest group 24. Physiological interest group is to report on progress next time 25. GO-central delete references to viruses in the definition of extracellular. 26. GO-central move viral component terms back into intracellular. 27. Midori to send examples of regulation to BDGP and Chris et al. to examine how to correctly indicate and model regulation. 28. Eurie: can now proceed to use gene products in terms with the addition of the suffix class and other situations will be handled in the same way. 29. GO-central: Update the documentation to reflect the decision on transporters 30. Amelia: Check on the terms in question and make sure they are consistent with the decision regarding transporters (and other bi-directional functions). 31. Michelle: Originally this AI was to send examples of messed up merges to GO-central for resolution. This was done. There are a few "sensu Eukarya" terms with secondary ids that did not have "sensu Eukarya" in them (Amelia generated a list of about 10). However, it turns out that it is ok that they are that way because, due to the placement of the old terms in the graph (as children of mitochondrial things for example), it is logically implied that they are Eukaryotic and therefore it is fine to make them secondary ids of Eukaryotic specific new terms. The problem for TIGR arose when those terms with mitochondrial parents were used to annotate some bacterial proteins (even though we knew about the path violations for bacteria) because at that point bacterial counterparts did not exist for those terms and they still wanted to capture the information. Therefore, the new Action Item is for TIGR to fix these annotations now that the bacterial counterpart terms are in GO. Thanks to Midori and Amelia for clarification of this. 32. transcription factor is wrong (mis-defined and mis-annotated). Interest group is going to fix this and report the solution. 33. All interest groups to provide short (one page more or less) reports for next meeting. 34. Jennifer: to provide a mock-up of the GO home page using Sanger style links. Minutes by Suzanna Lewis and Karen Christie. Thanks to everyone who could found the time to review, comment and fill in the holes. =================================================================== GO Consortium Meeting - Bar Harbor, ME - September 26-27, 2003 [Next Meeting: Stanford- SGD organizing: - GO Users Open Mtg; Jan. 15th. GO Consortium Mtg. Jan. 16-17.] Opening Comments: Meeting organization: We are a very cohesive group that works well together and we want this to continue. Therefore as we grow in size and in objectives we must continuously address the effectiveness of our organization in order to maintain 1) effective communication, 2) the quality of what the project produces and 3) informalities of the group, so that all feel welcome to contribute and comment. At this point the group has grown to the extent that we must adjust and strengthen the structure and organization of the GO Consortium.. We recognize that there are four major sub-groups here: 1) Ontology Development, including Interest Groups; 2) Annotation; 3) Database and Software Development and 4) Production and Distribution. In this context, we need to discuss how to go about revising the structure of the GO Consortium meetings. For example, the 'whole' group meets less frequently and sub-group meet more frequently. This topic was a thread through the meeting and there was further discussion at the end of the meeting. 1) Group Participant List EBI-Ontology group (Midori Harris, Jane Lomax, Jen Clark, Amelia Ireland ) Berkeley DB group (Suzi Lewis, Chris Mungall) FlyBase (Michael Ashburner, Rebecca Foulger) SGD (Mike Cherry, Rama Balakrishnan, Maria Costanzo, Rob Nash) MGI (Judy Blake, David Hill, Harold Drabkin, Martin Ringwald, Mary Dolan, Li Ni, Joel Richardson, Janan Eppig, Alex Diehl) TAIR (Tanya Berardini, Suparna Mundodi) SWISS-PROT (Evelyn Camon, Daniel Barrell) Sanger Parasite Group (Matt Berriman) S. pombe/Sanger (Val Wood) WormBase (Eimear Kenney, Kimberly Van Auken) DictyBase (Rex Chisholm, Pascale Gaudet, Warren Kibbe, Cathy Li) GKB (Lisa Matthews) RGD (Susan Bromberg, Norie De la Cruz, Victoria Petri, Mary Simoyama, Lan Zhao) TIGR (Michelle Gwinn) Incyte (Allan Davis) ZFIN (Doug Howe, Sridhar Ramachandran) Gamene (Pankaj Jaiswal) 2) Updates on Action Items from St. Croix Meeting The full listing of Action Items from St. Croix Meeting is at end of report. Most Action Items are completed. Not_Done or In_Progress or Special_Notes items listed here. 1. Update all gp2protein files in CVS. Need to send reminders to some groups. 6. BDGP(SwissProt): Request for tool for tentative assignment of GO terms Not Done 8. Assemble 'methods' references for IEA. In progress - work done by Midori and Michelle. GO is going to maintain a set of generic references of descriptions of IEA techniques for databases to use who themselves do not have reference collections to call on. These will then allow users of the data to distinguish between the different ways that GO terms have been assigned that fall under the IEA umbrella. [Action Item 34, BHmtg] 9. IEA- BDGP to explore means of including larger number of associations in DB and AmiGO. In progress [see Action Item 28 BHmtg for a related topic, that is, removing defunct associations]. 10. IEA - BDGP to add filtering that is combination of evidence code and reference Not Done (needs number 8 to be completed first.) 32. Transcription factor issue...Interest group is going to fix and report. Not Done 3) Reports from EBI-GO 3) Reports from Ontology Development Interest Groups Beyond the reports from the Interest Groups, there was considerable discussion about how to involve more experts in certain biological areas in the development of the ontologies. Lisa reported considerable success for GKB by going to specialty meetings and approaching individuals to discuss GKB and elicit their help. Also, follow-up site visits to researcher's institutions might help. GKB uses a powerpoint template to guide contributors. It was decided that GO should also take this proactive approach [Action Item 3]. While the use of ppt is not applicable to GO it is clear that a comparable user guide and standards are needed for newbie ontology contributors. [Action Item 47] a) Physiology Tanya and David provided a file with revisions for physiology section of Process. This will be implemented. Complete revision with terms and definitions available from SGD, MGI, others. b) Plants Interest Group We have revamped all the extracellular component terms and are now rearranging and expanding the children of the sexual reproduction terms. sensu Magnoliophyta -A problem was discovered with the 'sensu Magnoliophyta' terms. Many of these terms seem misleading because they actually refer to phenomena that also occur more broadly outside Magnoliophyta. However it was pointed out that that 'sensu Magnoliophyta' just means 'in the sense of Magnoliophyta' and so does not exclude annotation of non-flowering plant gene products to such a term. -One alternative would be to replace the word 'Magnoliophyta' with a sensu word that could apply equally all groups (that might be annotated with such a term). This would be quite time consuming because we would have to check each annotation case using the term, whether it applied to all plants and whether all green algae were included etc. -At the moment there are no non-flowering plant species being annotated and so there is not an urgent need for terms to be created for the annotation of non-flowering plants. -With these points in mind it was decided that we should concentrate on making the flowering plant terms exhaustive and stick to 'sensu Magnoliophyta'. We will create terms for non-flowering plants when non-flowering plants are being annotated. 4) Reports from Annotation Groups The following groups submitted progress reports of their activities since the last GO Consortium meeting. a) FlyBase - ok b) TAIR - ok c) MGI - ok d) SGD-ok e) GOA-ok f) WormBase -ok g) TIGR - ok h) Sanger Pathogen - ok i) Incycte - ok j) RGD - ok k) ZFIN - ok l) DictyBase-ok 5) Ontology Development Issues a) Logical consistency checks In the documentation there is an example of a logical relationship: If A is a part of B and C is an instance of B, then is A must be a part of C? Then there is an example with "cytoplasm". Jane notes this logic isn't always true in the ontologies and ask if can we fix this? This lead to a discussion of "part of" and how we use it in GO: Chris (and John) said there are 4 types of part of (letting A represent the 'larger' component and B represent the sub-component) 1. B is sometimes part of A 2. B is necessarily (always) a part of A (this is the one we almost always use) 3. A necessarily has part B 4. A necessarily has part B -and- B is necessarily a part of A (both directions of relationship) Chris: Technically what many ontologies do is to use the weakest relationship (#1) as the default because it is assumes the least. These relationships can then be adjusted to become more restrictive (and precise) as more is known. In practice, we (GO) already are using the part-of relationship in the stricter sense of #2--most of the time. (as an aside, Chris met and discussed this with Stuart Aiken in Edinburgh. He is also thinking about this and doing a lot of work in this area). Chris also described the distinction between 'part' and 'proper part'. A proper-part is a direct part and therefore is not transitive. E.g.: "a nail is a proper part of a finger and a finger is a proper part of a hand but a nail is not a proper part of a hand". There were several decisions made. First, we agreed to update documentation as it regards the use of 'part-of'. Second, we agreed to henceforth only use 'part-of' in the sense of type #2. Third, we agreed to track down all cases that do not use 'part-of' in the sense of type #2 and restructure the ontology as needed. [Action Item 5, 16]. Fourth, we will consider adding all the different logically distinct 'part-of' relationships because these may prove to be needed in many cases in the future. b) 'Signal Transducer Activity' term disagreements Question is whether the current "signal transducer activity" term is appropriate for GO. Harold/David think it is. They proposed a new definition: "the activity of converting one type of signal into another type of signal" (signals can be light, chemical, etc.) They say the process of signal transduction is more than one step but the function of "signal transducer activity" is the first step. Amelia: There was an issue with "receptor binding" and "signal transducer activity" - not all signal transducers are receptors. If a receptor is under signal transducer activity it should be involved in signal transduction. If a change is made to the definition of "signal transducer activity" than it should be obsoleted, even though there are lots of annotations to it. Especially since Amelia feels the term has been used incorrectly. Report is attached at the end of Meeting Notes. Midori: there is the question of whether there is a molecular activity of "signal transducer activity". Amelia: What about steroid receptors that move steroids in/out of cells? Many: should we change the wording, add a comment? RESOLUTION TO "signal transducer activity" question: [Action Items 4] -need to obsolete the current term -make a new term with the same name but a new definition -create the new definition to everyone's satisfaction (to be ironed out later) -add a warning to the comment on the appropriate use of this term -clean up the children terms - some need to be moved to other areas of the ontology. c) Presence/Absence of function grouping terms Midori: A couple meetings ago it was decided to remove from Function those terms that grouped things based on something other than activity - like Processes or Components. But, having the grouping terms is useful for annotation so people are in no hurry to remove them. ex. "defense immunity protein activity" This term is a grouping term found in the function ontology that is solely based on Process and has many children terms where this is also true. We don't want function-terms that represent a process because 1) it is a process, not a function and 2) any is-a relationships of child terms to this parent is illogical. While tempting we don't want terms grouped in Function by nature of being in the same Process. Judy: Maybe we're trying too hard to put a function on everything and are wanting these function terms when really we should just have a process and no function. Suzi: Some problems come back to the fact that there is a relationship between function and process which we don't reflect. Midori: Agreements at meeting don't always manifest into agreement after meetings in email. Judy: If there is angst, then we need more discussion and to resolve things at meetings. General agreement at meeting can break down in fuzzy specific instances in emails afterwards. Judy/Midori: practicality verses purity of function ontology Rex: Perhaps people don't realize there are analogous Process terms to use. Maybe state more clearly in the emails. [Action Item 6. RESOLUTION TO process grouping terms in Function. We will not use Process to group Function terms unless all of the terms being grouped share the same type of function. GO curators will continue to bring these to the attention of the group via email, if agreement is reached quickly - great. If not, it will be resolved at a meeting. Also, in the emails be sure to point out the Process term alternatives to the Function term. Things in Function should have things grouped by function.] d) Consistency of Parentage (catalysis and binding) Amelia: catalysis and binding - sometimes an enzyme activity has parents of both the catalysis term and binding term. Mostly there is only the catalysis parent. Which way should it be? Consensus: enzyme activities should have only the catalysis parent. [Action Item 17.Remove all binding parents to enzyme activities where appropriate. Document the fact that binding is not always a parent of enzyme. Binding only when stable binding occurs] e) Difference between activation of/positive regulation of/induction of/etc Evelyn: positive regulation does not equal activation Consensus: some redundancy, can't make synonyms in all cases - need some new definitions and comments. [Action Item 18: curation team will go through and find these and try to resolve them , redefine them as needed and put notes in comments.] f) Synonyms in ontology files (this was actually discussed after the old action items but seems to belong with this section). Michael: following the experiment of integrating GO into UMLS it was clear that "synonym" was being used in many ways. Jane has made a synonym file with all of the relationships in the file. format: GOid/GO term/ synonym type id /synonym . This info should be in the db and in the GO ontology files not just the synonym file. Discussion on whether to stop using inexact synonyms in favor of entry words - answer was no. 5 types of synonyms (one parent, 4 children): related (~) %exact (=) %broader than (<) %narrower than (>) %other related (!=) For broader than and narrower than synonyms one must always ask if the synonym should be a GO term. [ACTION ITEM 8: consensus and resolution: Put the 5 types into the database. Put the 5 types into the flat files. John will need to make DAG edit work with this. Jane needs to write documentation. Chris will add them to the db. A warning of the new file format will go out prior to implementation.] 6) Annotation Issues a) Need for Annotation Consistency We discussed the need for greater attention to consistency in annotations. Our users expect the annotations to be based on shared standards so that they can be compared and used in comparative genomics contexts. We agreed that we need to more formally identify a mechanism/team/process to ensure greater annotation consistency. This effort will include the development or employment of tools to evaluate annotations. [Action Items: 48] b) ISS and sequence dissimilarity. When two sequence are similar but are missing some key piece of sequence similarity that tells you that your protein can't have the function in question what do you do? [Action Item 24; 29] Add to documentation - use the NOT field for ISS annotation with sequence dissimilarity. c) Annotating to Complexes: This was a major and continuing topic at this meeting. Should we assign function to members of a complex when these members either do not engage in the (typically) catalytic activity, or we don't really know the function of the member? This was a very long discussion. There were two separate problems that were discussed simultaneously (see below). Discussion ranged over both of the problems throughout. Two separate problems: Problem 1. There is an ontology problem in that when the function ontology has an enzyme activity and with children "regulation of activity" and "catalytic activity" there becomes a true path violation for the regulator in that it's path goes up to the catalytic activity when it does not have that activity. This could be solved by removing the "enzyme activity" parent from the regulatory subunit. The regulatory subunit would have as parent "enzyme activity regulator". People feared that this would remove a link between the regulatory subunit and the function it was regulating. Others said that the link would be preserved with the component ontology term assignments. There were suggestions to rearrange the ontology - but nothing seemed to satisfy the needs. In the end the decision was to remove the "enzyme activity" parent from the regulator term. [Action Item 10] enzyme activity" terms will no longer have as children their regulatory subunits. The regulatory subunit will have as a parent "enzyme regulator". We recognize that this removes a link in function between the regulator and the enzyme activity. However, we feel this will be covered in the annotation of the gene to the complex in question. Problem 2. What to do when annotating the function of a subunit of a complex when that subunit does not have a known activity on its own. Up to now we have been annotating to the potential of a subunit and therefore would annotate the function of the complex as the function of one of the subunits (this is in the documentation). This is not actually correct of course, since the individual subunits do not have the function of the whole complex. But to not do this would lose the relationship of the subunit to the function of the complex to which it contributes. Ideally we would be annotating the functions of complexes and assigning gene products as parts of complexes with those functions, but many databases don't have the ability to do that. - Some suggested making relationships between GO's Function and Component ontologies. - Some suggested not linking function to the subunits (if nothing is known about what they individually do) at all. - Some suggested adding a qualifier in the association file - suggestions: direct/indirect, associated_with, etc. - Some suggested modifying the association file format to include a way to indicate that gene products A plus B plus C are needed for a particular function. [Action Item 11] Regarding the annotation of gene-products that are members of a complex: 1. The complex should appear in the component ontology. 2. Gene products that are members of that complex should be annotated to that component terms. 3. The complex itself (the instance of it in your DB) should be annotated to the appropriate function. 4. Gene products that are members of that complex should (if a more precise functional granularity is not known) be annotated to the function of the entire complex, but must have an additional qualifier added. This mandatory qualifier will be placed in the "NOT" column. The string we will use for this qualifier has not yet been finalized, but the candidates that we have discussed are "associated_with", "component_of", and "contributes_to". Whichever string is decided upon the consequence is that now there will be two allowed values in the NOT column: These are "NOT" and ["associated_with" or "component_of" or "contributes_to"]. If both NOT and qualifier value are needed for the association then they will be separated with a pipe character '|'. [Action Item 30] In a related topic - Mike will add "complex" as an allowed type of "DB_OBJECT_TYPE" in the gene association file for those groups who are able to store complexes in their dbs and assign terms to them. d) Validation of Annotation Up to now there has been no validation within the data sets or between the data sets. Can we use the test set? We want the consortium to check annotations. Michael: Need tool that takes association files, gets proteins, clusters them, presents to annotators the GO terms attached to the clusters, then view. Need to flag things that are ok, but come up in the screen so they don't have to be looked at again. Once something is found that needs attention - send message to contributing db to fix it. First time these checks are run it will be a lot to go through but once that's done, should be (hopefully) fairly easy to maintain. Maybe we should have GO school/camp for 2 weeks. Suzi: 3 things: 1. take existing annotations and check for consistency 2. have a given set of genes annotated by two methods and check for agreement 3. GO camp/school useful for a. resolving discrepancies, b. new people education It's very important to check consistency between dbs. Mike: consistency is a goal and sharing , must share nitty gritty of methods to make this work. Suzi: maybe we should all use the same tools Mike: that's what GMOD is for. David: consistency with component and function will be easier than process, process will be different for different species. Will need to choose wisely what defines a shared process. [Action Item 49] 7) Resource Issues a) Report from development group on instantiating GO in Prolog The underlying structure of the ontologies is going to have a big shift into a logic programming language Prolog. This new paradigm will impact the development and storage of the ontologies, but the annotation processes will remain the same and most users won't see a difference. We will continue to provide the GO in various formats. Chris Mungall gave a report from the working group that met in Bar Harbor prior to the GO meeting. This group included Chris Mungall, Suzi Lewis, David Hill, Harold Drabkin, Joel Richardson, Jim Kadin, and Alex Diehl. - GO is a mix of 'stem' terms and 'composite' terms. For example: 'oxygen binding' is a composite term of the compound 'oxygen' and the term of the function 'binding'. - A more complex term is 'positive regulation of smooth muscle contraction" - it can be broken down into its component parts: the action in the term is "contraction" "muscle" is the thing being affected by the action "smooth" is a modifier for "muscle" and so a modifier for the thing being affected "regulation" is a modifier of the action "positive" is a modifier of "regulation" (Aside to this discussion: What would be the term in an anatomy ontology "muscle" or "smooth muscle" - answer: "smooth muscle" would be a child of "muscle") We might want to think about GO as a language system. GO terms are highly regular in their structure. They lend themselves to formating: for regulation terms--> QUALIFIER, "regulation of" PROCESS where PROCESS is "contraction" or "biosynthesis", etc. PROCESS can itself have modifiers. One can deconstruct the GO terms like this and build a grammar. There is a programming language called "Prolog" that breaks down terms into parts/classes. Steps in using this for GO: 1. take all or part of GO and decompose. 2. Maintain this breakdown in GO itself 3. make "oxygen binding" a cross product of compound term (from a compound ontology) and the function "binding". Now many parallel hierarchies like transport and binding can be maintained more easily. Question: if we have a way of generating a compound term should we still maintain the compound terms in GO or just have them made as users need them. Answer and consensus - we should maintain them in GO. Phenotype Ontology will produce massive cross product from Anatomy ontology and Process. We will use the build up process for the first time a term is needed and then the term will get an id and be in the ontologies permanently. There will be a user mode for creating specific terms. Discussion of Chris's talk: Michael: mapping of component terms in PO to base GO terms Midori: will this help sorting parent/child relationships for new terms? Chris - it should Martin: will there be 1 rule or decomposition or several rules? Chris - it will create standard wording Judy: Will people who do ontology development need to use Prolog? Chris - No, just need to make sure they add rules as necessary. Rex: so with a new term from many places, will the tool make the term from the many places? Chris - you will put in the term, the tool will suggest optional add-ons or alternative names, and parent terms for you to review. Rex: will the tool read an anatomy file? Chris/Suzi - yes, it will. A set of developers will work on the core/primitive terms and annotators will work on derived terms. Michael: primitive will come from anatomical ontology? Chris - yes Michael: mouse anatomy will have "head development" a compound term, but here the primitive is mouse head not just head and this term will be used for many types of heads - do we want only one GO term? David: we should have all head types as children of "head development" (children would be "mouse head development", "fly head development", etc.) Rex: how will the anatomies be used?, import them all and then sort it out.? - not sure of answer to this one. Prolog demo: -run deconstruction - get stem terms: regulation([regulation, qualifier(Q), regulates: P]) -Grammar: qualifier regulation of process [regulation,qualifier:positive,regulates:[contraction, affects:[muscle, qualifier:smooth]]] -first step: go through all GO and breakdown into stem terms. -Test parent/child relationships Chris showed amino acid test - term "glycine binding" has a parent "amino acid binding" but needs "serine family amino acid binding" as parent term since glycine has parent "serine family amino acid" in the compound ontology. It showed that GO was missing an intermediate term and suggests what to do - either add glycine as a direct child of amino acid in the compound ontology or make a new intermediate term so that "glycine binding" can be a child of "serine family amino acid binding" This tool should solve the interleukin problem from before. More discussion: Michael: this new tool is an easier route to maintainability and communication with other ontologies - what are the downsides - for the GO curation/editorial team there will be transitional pain - but not for the users. Once the transition is done will there be other downsides? Chris: We will need to maintain the other ontologies. Judy: It depends on stable contributing base of vocabularies, some are not so stable, but it will likely be approximately what it is now. David: GO curators need to now maintain all of the member ontologies. If "oxygen" doesn't exist in a compound db, who puts it in? Michael: need a "buy-in" of base ontologies. If we can insure that all of the ontologies we rely on are around this table or under institutional control, than we don't need GO developers to maintain, just need reliable people to maintain them and provide quick turnaround - except maybe compound and protein families. Judy: What about UniProt and PIR? PIR/PANTHER families are a possibility Michael: chemical ontology -- EBI has a real chemist - so there will be work on that eventually at EBI. Cell type ontology is fairly mature, all anatomies must be in CVS. Rex: We need a clearing house to tell people which anatomies their terms should be involved in - define lines of what each ontology encompasses. Martin: who can write to files at OBO? Michael: each file will have a person who does the updates? Martin: but right now who can write to these files? Suzi: there is a short list of people with write access. c) Report on DAG_Edit Suzi gave presentation for John, highlighting new properties (see presentation for details). There was a question on whether DAG Edit can save changes between two versions - GO curation team says you can save the histories - need to check on this. [Action Item 42]. Question to group on when to shift to new DAG Edit which is ready to go. Will organize a testing people and John will visit users during this period [Action Item 43]. d) Report on AmiGO Chris gave presentation for Brad (see presentation for details). Test of new underlying data structure was done for the GO term correlations/concurrent assignments tool. "Genes who liked that GO term also liked this one." - it worked fine. This was Action Item 16 from St. Croix meeting. e) Demonstrations 1. Joel Richardson: Viewing annotation vocabulary graphically: using GraphViz. Currently works on mouse data, has plans to make it generic, maybe it should work off the GO db. SGD has similar tool. Will work to putting these out on GMOD. 2. Eimear Kenney: Textpresso This is a tool for mining the full text of publications for relevant sentences. 3. David Hill: Automated paragraph generation from GO annotations. Attempt to develop rules at making a nice text paragraph based on the annotation and GO terms assigned to a protein. Did this because granting agencies and users want text output. David did test with Pax6. He developed simple sentence structure rules that allow the automatic fill in of GO terms and annotation information and production of a text description of what is known about a protein. This text is generated from underlying data, is basically the reverse of the deconstruction described by Chris. Both are necessary for complete usefulness of the GO system. Ultimately, it is hoped that GO data will be presented to the user with options on viewing - the normal GO term assignment tables, a graphical interface like Joel's, and a text entry like David's describing the sum total of what all of the GO terms and annotations tell us about the protein. f) Slots = properties Slots is being accomplished with the Prolog deconstruction stuff Chris presented. For slots we need to decompose ontologies, additional relationship types, need axiomatic ontologies (elemental, basic terms).Chris will start decomposing terms - needs volunteers to go through them - David/Amelia/ plant person. Should be done with some testing by next meeting. [Action Item 50] 8) Lingering questions from this meeting 1. from TAIR: TAIR has a pathway to term map and SGD has a map to another term in the same tree but at a different level. How should this be handled? We didn't' return to this question. 2. GO Slim: should 3 files be required when sending in a Slim: 1) Go Slim itself, 2. Go term mapping to GO Slim, 3) mapping of genes to GO Slim. 9) Action Items from this meeting Ontology Development Action Items 1. Create SOPs for checking of ontology integrity 2. Document process for revision of subtrees 3. Create SOP for getting people into interest groups and other interest group activities. 4. RESOLUTION TO "signal transducer activity" question: i. need to obsolete the current term ii. -make a new term with the same name but a new definition iii. -create the new definition to everyone's satisfaction (to be ironed out later) iv. -add a warning to the comment on the appropriate use of this term v. -clean up the children terms - some need to be moved to other areas of the ontology. 5. Document Logic Consistency issues in regards to 'Part-Of' designations. Following documentation, track down instances that are not always 'necessarily part-of', figure out what to do with them (known examples: proteasome and polarisome) 6. RESOLUTION TO process grouping terms in Function. We will not use Process to group Function terms unless all of the terms being grouped share the same type of function. GO curators will continue to bring these to the attention of the group via email, if agreement is reached quickly - great. If not, it will be resolved at a meeting. Also, in the emails be sure to point out the Process term alternatives to the Function term. Things in Function should have things grouped by function. 7. Alex will send in SF ticket on 'regulation of survival gene products' under "apoptosis" and GO team will check it out. 8. RESOLUTION on synonym types: Put the 5 types into the database. Put the 5 types into the flat files. John will need to make DAG edit work with this. Jane needs to write documentation. Chris will add them to the db. A warning of the new file format will go out prior to implementation. 9. As needed, add English terms as synonyms. 10. RESOLUTION of the regulator subunit of enzyme activity as child of activity question: "enzyme activity" terms will no longer have as children their regulatory subunits. The regulatory subunit will have as a parent "enzyme regulator". We recognize that this removes a link in function between the regulator and the enzyme activity. However, we feel this will be covered in the annotation of the gene to the complex in question. 11. RESOLUTION of "subunit of complex" annotation issue: It was decided to annotate the gene products of a complex to the complex with component terms. To continue to annotate the individual subunits to the function of the entire complex but with a qualifier in the "NOT" column - the qualifier will be "associated_with". Therefore, there will be two allowed values in the NOT column: "NOT" and "associated_with". If you need to use both values at once separate them with a pipe. [subsequent discussion as to whether 'associated_with' or 'component_of' would be the better tag. Action Items specifically for the Go Editorial Office in Hinxton. 12. Add two new curators to the web site 'people page'. 13. Commit the new web site with improved index. (jen) 14. Send URL of function ontology documentation round to group for discussion. (done) 15. Document the difference between a parent/grouping term in the function ontology and a single term in the process ontology. 16. Document the 5 different part_of terms and the fact that we mostly use just one of them (necessarily part of). 17. Document the fact that binding is not always a parent of enzyme. Binding is only a parent when stable binding occurs. Remove Binding as parent where appropriate. 18. Standardize use of 'activation', 'induction', 'positive regulation of'. GO curation team will go through and find instances of "positive regulation of"/"activation of"/"induction of" and try to resolve them, redefine them as needed and put notes in comments. 19. Keep an eye out for any standard operating procedure information coming from the Annotators 1. meetings. 20. GO.evidence.html has a bad link. Fix this. 21. Two new tools were demonstrated. Add these to the tools page: Joel Richardson's Annotation 2. Browser and the Textpresso program that Eimear Kenney presented. 22. The folks in the GO office are to the test the new DAG-Edit for a few weeks prior to release. 23. Jane will write documentation on Synonym Types. Need to send a warning of the new file format prior to implementation. 24. Add to documentation - use the NOT field for ISS annotation with sequence dissimilarity. 25. Change the documentation so that ISS can have cardinality >1. Add documentation that clarifies the section where it tells annotators that if you are unsure of the function/process of your gene to bump up to the next higher term. Add that if that bumping gets you to the root of the ontology you should then use the "unknown" term for that ontologyAction Items for Annotation Groups and for Annotation Oversight 26. Formally identify an Annotation Oversight Team, they will a) access quality, b) set standards c) evaluate the annotations of contributing groups, d) alert those groups to annotations that may need attention. 27. RESOLUTION: If there are IEA sets of associations that have not been updated in one year, they will be removed from the front page and AmiGO if a call to the submitting group doesn't result in an updated file. 28. RESOLUTION: Use the NOT field for ISS annotation with sequence dissimilarity. Everyone keep the ISS consistency (how much similarity is enough for different groups) issue in mind and think about ways to improve it. 29. Mike will add "complex" as an allowed type of "DB_OBJECT_TYPE" in the gene association file for those groups who are able to store complexes in their dbs and assign terms to them. 30. Need a new tool that will check for situations where annotation of GO terms was made (for ex. to a mouse gene) based on terms added to another gene (for ex. from human) with ISS, but where the annotation of the match protein (in ex. human) has since changed. An email would be sent to an annotator to review the annotation for the mouse gene again. 31. Add documentation that clarifies the section where it tells annotators that if you are unsure of the function/process of your gene to bump up to the next higher term. Add that if that bumping gets you to the root of the ontology you should then use the "unknown" term for that ontology (in EBI-GO List too) 32. Everyone should be using a script to check for formatting errors in the association files before submitting them to GO. SGD and others have such scripts to share. 33. Send comments on text sent out by Michelle for HMMs and pairwise matches IEA references. Send any other text for other types of IEA evidence around for comment. 34. Organize the quality control checking for annotations. Make a tool to do the comparisons. Organize the GO school/camp. We all must buy into the concept of annotation consistency. Action Items for Software and Database Development and Production 35. Send reminders to groups who need to update gp2protein. 36. Mike's group will be establishing a production manager and will hire someone to do the job. This person will work with Brad on AmiGO, Suzi's group on database validations, various Annotation Groups on standards for GO association files. 37. If there are IEA sets of associations that have not been updated in one year, they will be removed from the front page and AmiGO if a call to the submitting group doesn't result in an updated file. (This is the same as Action item #28, but is here because it affects both groups) 38. An AmiGO request: Add a species filter to AmiGO. This could be done either by the using the identity of the contributing database or independent of source database by using taxon id from GenBank available for the related sequence. Note the SF site should be used for this kind of AmiGO request (hence #40 below). 39. Provide a SF ticket for AmiGO improvements and suggestions; provide focus group for AmiGO improvements. 40. Software/db group send Midori db format requirements for the IEA references. 41. There was a question on whether DAG Edit can save changes between two versions - GO curation team says you can save the histories - need to check on this. 42. Organize testing period for new DAG Edit. Approximately 6 weeks of testing. John will visit users of DAG Edit during the testing period. 43. New flat files in new format should have a different file name format. "function.obo, process.obo, component.obo" these files will be terms plus definitions. Feel we still need all three although with new system then can be combined. Other Action Items 44. Run a test-set on all the GO tools. Generate a test set of genes for tool validation. Get a responsible person to manage the test system. Post results so users can see the kind of analysis/visualizations provided by a Tool. Nobody made a clear commitment to organize this, but many were in favor of it. 45. Pankaj is in contact with a group that wants to translate GO into several European languages, Arabic and Chinese. Pankaj will talk to this group wanting to translate GO and learn the details of their plans. Need precise input as to how the group would deal with update issues. 46. Develop further documentation for ontology development guidelines so that when we get help from outside experts to develop specific branches of the ontology we have a way to introduce them to some of the basic tenets and standards that are needed in order to do this [EBI editorial staff]. 47. Mike Cherry (primarily, but not all by himself of course) to propose and up mechanism/team/process to develop 1) manual methods, 2) automated assessment tools, and 3)documentation to ensure greater annotation consistency. 48. Pursue by all possible means methods for improving consistency of annotations: computationally based on sequence; Comparatively, between alternate methods carried out on same gene sets; through training and documentation (camp?) [Suzi, Mike, Michael, and Judy] 49. Chris will start decomposing terms and David/Amelia/ plant person will work with him to help test the results and change the ontologies as needed. 10) Summary Proposal for future organization: -software will be broken into development group and production group -the production group will be handled by Stanford group -annotation needs quality control oversight - for now Mike is checking into this. Should we change the way we organize the GO Constorium meeting schedule? -all will be the same for the next meeting (in Stanford) - Maybe we should have breakout group meetings for the subgroups (GO ontology development, annotation, software) which report back to the big group. However, many people are vested in several of these areas. - Maybe we should have breakout groups for the interest groups which report back. Mike: there will be 1/2 day available for breakouts. They are expecting the meeting to take 2 full days. Judy: maybe a series of small group meetings followed by the big group. Suzi: then the big group would only meet 1-2 times a year. General agreement on this - suggestion to schedule the big meeting following Stanford meeting after the small meetings have been scheduled. Addendum 1: Report of Action Items from St. Croix meeting. 1. ALL: update gp2protein on central CVS site. still several need to update. 2. Suparna & Amelia: update metacyc mappings (and check that no functions are mapped to) DONE 3. Amelia: change monthly report file names so they'll sort by date. DONE! 4. Amelia: cron job that mails announcement of each new monthly digest to go-friends this is DONE, in the sense that the auto-mailing works, but not done in the sense that the reporting can be improved (Judy did I get this right?) 5. BDGP, JAX: first prototype to be implemented for properties prior to JAX meeting DONE (Chris reported at Bar Harbor meeting) 6. BDGP (SwissProt?): need to provide a tool for tentative assignment of GO terms. NOT DONE 7. one row, one term, one reference, one evidence code. DONE! 8. (IEA) Midori: to assemble method references for IEAs.....stuff to discuss at BH mtg 9. (IEA) BDGP to explore means of including larger number of associations in DB and AmiGO. IEA db tuning...also need expiring date...NOT DONE 10. (IEA) BDGP to add filtering that is a combination of evidence code and reference. for IEA...and TIGR, add filter for taxonID, query tool for AmiGO NOT DONE 11. (suspect annotations) Midori et al.: Add some things to documentation to describe procedure for error reporting, whether in terms or in associations. DONE 12. (suspect annotations) GO-central to add links on main web site to report errors in annotation. DONE 13. (suspect annotations) Brad to add button to AmiGO to mail error reports. DONE 14. (suspect annotations) NOT DONE but this will be part of the new annotation oversight system. 15. ALL: review annotation documentation and send in comments to GO-central (Midori to oversee). nothing sent...EBI-GO updating documentation as per BH meeting 16. BRAD: to add term based page. This would show all gene products and the other terms that had been used on each of those terms. A "other customers who used this term, also used these terms". Amazon dot.com approach. Not done in production AmiGO, but has been done in test of new AmiGO architecture. 17. JOHN: Need DAG-Edit to warn if there are definitions without terms when saving so that the definitions are not lost. DONE in beta version 18. GO central: for all part-of children in the function ontology, change the relationship to is-a and change wording to 'intrinsic regulator' or 'intrinsic catalyst'. DONE 19. Jane: remove 'activity' from 'binding' terms; DONE! 20. Midori & Jane to dredge up what problems were at end of database save testing; send to John. DONE 21. JOHN: Need DAG-Edit and central repository to work more seamlessly...DB or transparent CVS must be implemented. NOT DONE 22. GO-central improve documentation on synonyms. DONE 23. David organizing physiological process interest group. DONE 24. Physiological interest group is to report on progress next time. DONE 25. GO-central delete references to viruses in the definition of extracellular. DONE 26. GO-central move viral component terms back into intracellular. DONE 27. Midori to send examples of regulation to BDGP and Chris et al. to examine how to correctly indicate and model regulation. DONE 28. Eurie: can now proceed to use gene products in terms with the addition of the suffix class and other situations will be handled in the same way. OK 29. GO-central: Update the documentation to reflect the decision on transporters. OK 30. Amelia: Check on the terms in question and make sure they are consistent with the decision regarding transporters (and other bi-directional functions). DONE 31. Michelle: Originally this AI was to send examples of messed up merges to GO-central for resolution. This was done. There are a few "sensu Eukarya" terms with secondary ids that did not have "sensu Eukarya" in them (Amelia generated a list of about 10). However, it turns out that it is ok that they are that way because, due to the placement of the old terms in the graph (as children of mitochondrial things for example), it is logically implied that they are Eukaryotic and therefore it is fine to make them secondary ids of Eukaryotic specific new terms. The problem for TIGR arose when those terms with mitochondrial parents were used to annotate some bacterial proteins (even though we knew about the path violations for bacteria) because at that point bacterial counterparts did not exist for those terms and they still wanted to capture the information. Therefore, the new Action Item is for TIGR to fix these annotations now that the bacterial counterpart terms are in GO. Thanks to Midori and Amelia for clarification of this. DONE 32. transcription factor is wrong (mis-defined and mis-annotated). Interest group is going to fix this and report the solution. NOT DONE 33. All interest groups to provide short (one page more or less) reports for next meeting. 34. Jennifer: to provide a mock-up of the GO home page using Sanger style links. Tried this, but didn't work well. Appendum 2: Signal Transducer Activity report signal transducer activity : current def "Mediates the transfer of a signal from the outside to the inside of a cell [or cellular compartment] by means other than the introduction of the signal molecule itself into the cell. The proposed definition of signal transducer is based on the concepts carried by both "signal" and "transducer".I've been looking over the definitions of the individual components of "signal tranducer"; there are two components: 1. detect signal << what is this? 2. change signal into another activity<<>>does not have to be a molecule; it can be light (see further on). Are all these proteins therefore signal transducing molecules. Certainly cytokines are accepted as signal transducing moleculeswith the ability to induce signal transduction via receptor binding. >>No, the transducer is the thing that converts one type of signal to another.Cytokines are the signal, not the transducer. (paraphrasing) "To me, signal tranducer..." or "I see signal transducers..." - you both have a concept of what a signal transducer is, but I think that the current def and the new def fail to capture it. I think that the term 'signal transducer activity' has been used to describe the activity of anything involved in a signal transduction cascade, and by using the term thus you are not capturing any more information than you already have by annotating to the process term 'signal transduction'. If you want to have a term to represent conversion of one type of signal information into another, I think it should be a new term because I don't think that 'signal transducer activity' will have been used in this way. A signal transducer would thus be a gene product that converts one type of signal into another." it seems possible that more than one of the proteins in a signal transduction pathway could be signal transducers, but not necessarily all of them since they all won't change the signal to another form. How is a signal transducer thus defined any different from a transporter? ("Enables the directed movement of substances (such as macromolecules, small molecules, ions) into, out of, within or between cells."). Substance and signal are not the same things. A substance is always a physical entity; a signal is not. Insulin binding it's receptor is a signal, but so is heat, etc. Binding to a receptor does not mean a substance is then transported into the cell. Reply to the part about transducer vs transporter: with a transporter, a substance goes in one end and out the other. the "signal" can be a substance (like a phermone), but it doesn't have to be. The transducer converts one type of signal to another ( a chemical signal (like phermone) to a conformational change , etc. ...in the transducer vs. transporter debate - I understand the difference between transducers and transporters; however, we've got all the receptor activities lumped under 'signal transducer activity' and some receptors work by conveying the signal molecule into the cell. Then these should not be called transducers, they are transporters. We can no longer therefore broadly classify all receptor activities as signal transducers - each receptor activity will need to be assessed and recategorized. Are receptors which transport a signal molecule into a cell therefore not signal transducers? Are they involved in signal transduction, though? Or would we say that there has been a change in the signal type, ie. incoming signal is extracellular steroid molecules, and the outgoing signal is intracellular steroid molecules. In this above cases, there is no transducer. =================================================================== GO Consortium Meeting - Stanford, CA - January 16-17, 2004 [Next Meeting: Chicago - Dictybase organizing - October 2004] Group Participant List SGD (Mike Cherry, Karen Christie, Kara Dolinski, Eurie Hong, Dianna Fisk, Rama Balakrishnan, Rob Nash, Stacia Engel) TAIR (Sue Rhee, Tanya Berardini, Suparna Mundodi) MGI (Judy Blake, Joel Richardson, Harold Drabkin, David Hill, Mary Dolan) ZFIN (Doug Howe) RGD (Victoria Petri) Dictybase (Rex Chisholm, Petra Fey, Karen Pilcher) EBI-Ontology Group: (Midori Harris, Jane Lomax, Jen Clark, Amelia Ireland) GOA (Evelyn Camon, Daniel Barrell) Wormbase (Kimberly Van Auken, Ranjana Kishore) Incyte (Burk Braun) Gramene (not present) IRIS (Richard Bruskiewich) Berkeley DB Group: (Suzi Lewis, Chris Mungall, John Day-Richter, Brad Marshall) TIGR (Linda Hannick) FlyBase (Michael Ashburner, Rebecca Foulger) S. pombe/Sanger (Val Wood) Pathogen/Sanger (not present) TOC 1. Opening Comments: GO Grant Update 2. Annotation Groups: Progress Reports 3. Interest Group Reports 4. Ontology Development Issues 4.1 Metabolism terms: divide into cellular and organismal metabolism 4.2 Regulation of non-biological processes 4.3 Transcription/translation factor activity 4.4 Component ontology annotations 4.5 Protein classification 4.6 Use of 'sensu' 4.7 Documentation of function ontology 4.8 GO_Slims Development 4.9 NameSpace Ontology 4.10 GO email archive search 4.11 Gene association file errors 4.12 Date tracking for definitions 5. Software Report 5a. Presentation of OBOL 5b. Report on the DAG-Edit workshop 5c. Update on changes to AmiGO 6. Annotation Issues 6a. Problems with pathway information annotation 7. Future Meetings 8. Final Item - Incorporation of GO in WormBookIII 9. Summary of Action Items from this meeting. 10. Review of Action Items from past meeting [Bar Harbor] 1. Opening Comments: GO Grant Update (Judy) Judy reported on the status of the GO funding from NHGRI. In the competitive renewal, the GO funding mechanism was changed from an RO1 to a P41 (research_resource), and significant new funds were requested. Current indications are that we will be funded for 3 years. However, there have been several cuts in funding proposed including one requested software engineering position. Additionally, there will be no new group (sub-contract) funding. Small side projects are also not funded. There may be additional adjustments; we are awaiting official notifications. We hope that any further adjustments in funding can be shared across all of the groups receiving funding through this grant except for the European contracts (due to dismal exchange rate) and BDGP (which has already had one position cut). 2. Annotation Progress Reports: Reports were issued from the following groups: SGD - Report available TAIR - Report available MGI - Report available TIGR - Report available ZFIN - Report available RGD - Report available Dictybase - Report available Flybase - Report available GOA - Report available EBI-Ontology - Report available Wormbase - Report available Incyte - Report available Sanger/Pombe - Report available Sanger-pathogen - Report available Gramene - Report available IRIS - ok, no electronic report BDGP Software - Report available 3. Interest Group Reports a. Plant Interest Group Plant interest group report is available at http://www.ebi.ac.uk/~jclark/GOwebsite/text%20in%20development/plants_folder/plants.htm This is also available as a text file with other reports from this meeting. No other interest groups reporting. 4. Ontology Development Issues 4.1 Metabolism terms It was decided that "metabolism" would be split into "cellular metabolism" and "organismal metabolism". This is similar to the division of "physiological process" into "organismal physiological process" and "cellular physiological process". Further discussion about this will continue in coming weeks. 4.2 Regulation of non-biological processes Example: regulation of water crystallization: water crystallization is not a biological process but it is regulated biologically (e.g., "regulat