This file contains the collected minutes of Gene Ontology Consortium meetings. =================================================================== Ontology Meeting, Palo Alto, Jan 10 & 11, 1999 Participants: SCD and Arabidopsis Databases Mike Cherry, David Botstein, curators of SCD and Arabidopsis db. =46lyBase Michael Ashburner & Suzi Lewis MGD Janan Eppig, Joel Richardson, Judith Blake Astra Ken Fasman 1. One or three? Answer is Three: The first discussion reaffirmed that the Ontologies will be developed independently. In the future, we may explore edges between the independent ontologies, but not now. After stabilization and initial annotations of product to list terms, we will see whether to combine all together, what the levels of annotation are by species, and see about establishing edges between the three sets of terms. - don't get into describing 'location' as part of 'function' - three ontologies are a) function b) process c) subcellular location (but see below). 2. Gene Function vs. Gene Products? Recognized confusion in gene product/complex listings in function list. Agreed to move protein complexes and their components to the 'cellular' list. Renamed 'cellular location' to 'cellular components'. Decided not to create a fourth category, but rather to think of 'cellular' syntax as 'is located in' with synonym 'is subcomponent of'. Recognized a subtle distinction between something that is 'part of' a macromolecule and something that is 'located in' i.e. 'part of' the nucleus. Still, decided all would be placed in this category. This is a system designed to deal with a state of incomplete knowledge. Cell is composed of the whole set of product complexes. 3. Define syntax: a. function=8A 'is a' , hierarchical=8A b. process=8A'is a subprocess of' but may also list 'is an instance of', a DAG c. component=8A'is located in' synonym: 'is subcomponent of' but may also list 'is an instance of', a DAG. 4. Specific function syntax=8A a) Not for =8A Drop the 'Not for Mus, Not for Drosophila' distinctions b) Precursors =8A receive function process annotation of mature protein. c) Facultative/Obligatory=8A'maybe part of', 'sometimes part of', 'part of >>under certain conditions' some things present at sometimes, not at others= , or don't know. Some things spend some of the time in one compartment and some in the other (cell cycle proteins). Decide not to annotate these distinctions. Decide function ontology will be a straight 'is a', no qualifiers. 5) Cell is a generic cell, will not divide by cell type. Generic gene products will be listed as needed. example: Alpha tubulin=8A.yeast has two, elegans has 6, some are there in mitosis, some are not. what are leaves of the trees? are they nodes representing genes from all species? So, we've thrown out individual gene products except as needed, but gene complexes still here. Not creating laundry list of gene products of each species, but adding component parts as necessary. Example 1. a) cellular component: gene product A, gene product B, gene product C b) function: alpha tubulin c) process: mitosis, axonal transport Not going to put in the relatedness between complex, process, and function until we have some data and a better understanding of how this will work. 6) If you have process, function and localization, it's almost as good as having a small paragraph about the gene. (example of micro-array paper) 7) Implementation Plan a) Michael does revision of current version b) All participants edit list very carefully c) Suzi assigns GO numbers 'for real' d) Start alpha annotations e) Future terms added through 'ontology manager', currently Michael. 8) Immediate future a) requires prototype funding (possibly from Astra via Ken Fasman, other ideas too) b) each database needs one curator for project c) need overall curation manager (currently Michael, person would work with Michael) d) each database needs capitol money for computer 'kit' e) need money for meetings/travel. two meetings/year, alternate coasts f) need programmer, ultimately 2, one for database DBA, when db created, second for interface, tools. Astra is buying our commitment to annotate our databases to the common vocabulary=8Alooking for increased usability and searchability of each individually and collectively. =================================================================== Minutes of the GO Project Meeting The GO Meeting was held 17-19 May 1999 at the Banbury campus of CSH. In attendance were: FlyBase (Michael Ashburner, Suzi Lewis) SGD (Mike Cherry, Midori Harris) MGI (Janan Eppig, Judy Blake, Joel Richardson, Martin Ringwald, Allan Davis) A Summary of the Meeting: The outlined agenda for the meeting was: 1. semantics 2. species specificity 3. content 4 implementation 5 software 6. resources 1. The GO Project is recognized as a shared, pragmatic database resource involving three separate ontologies (Gene Function, Process, Cellular Component) that represent independent structured sets of terms for performing biological queries across different species genomic databases. It is not a definitive phylogenetic classification system of biology. The current GO Project is composed of three Organism Databases: FlyBase (Drosophila), SGD (yeast), and MGD (mouse). It is hoped that additional organism databases may subsequently join. Each Organism Database will annotate their genes to the three ontological categories and deposit these results in a Universal GO browser (currently being constructed by Suzi). Concurrently, each Organism Database on their own will use their annotations for whatever way they see fit at their own web sites. The three database (DB) groups are anxious to start implementing the project. It is felt that once people actually start 'getting their hands dirty' by annotating and testing out the system, we will recognize potential pitfalls and successes and will see areas of the ontologies that need to be developed further. 2. The system will be structured as: a Master GO List with assigned GO# identifiers. This Master List will be made available to all DB. Any curator can add to/modify this list (see below). All additions or modifiers will pass through a Central Manager (a biology-trained individual) who will then update the Master List accordingly. The Master List is anticipated to go through a period of flux during the initial phases and when any new organism joins the group. It will be the responsibility of each DB to regularly check in with the Master List and read all communications between DBs so that everyone is 'on the same page'. 3. It was recognized that this project is a WORK IN PROGRESS: meaning, it is a dynamic system where things will be added and changed as curators see problems and concerns. With subsequent biological knowledge, a re-organization of the hierarchies may even become necessary. Provisions for this re-organization will be made by the following: each DB group will initially be assigned a block of 'free' GO identification numbers (GO#) which can be used to add or modify new terms to the structured GO list. (New terms should use unique words/phrases). Any such modification will be simultaneously submitted to a Central Manager and representative of each DB via email/XML working from the current GO List. We anticipate that these initial modifications will be at lower levels of the hierarchy and not include any major re-organization that could directly and immediately affect the annotations of other DB. If that happens to be the case, the curator is responsible for FIRST contacting the Central Manager for inquiry, discussion, and approval PRIOR to initiating or submitting any changes to the GO list. Since a Central Manager has yet to be hired, it is prudent for curators not to initiate any major renovations to the GO List for quite some time. 4. Common Sense & Individual Responsibility must rule here: that is, each group is responsible for reading the corresponding emails submitted by each DB; each curator should check in frequently with the modified GO list on a regular basis; if a curator anticipates to spend a long time modifying some structure of the GO List he should alert all other members and then work as quickly as possible to update the changes. Effective communication is key to the success of this project. 5. Each DB will annotate their genes/gene products to any 'depth' they see fit. This should include annotating genes to complexes and not subunits of those complexes, since the cross-species homology will not necessarily hold true to such detailed levels. Once the practice is put in motion, curators will develop a better feel for how 'deep' they should be annotating based upon the work of other curators from other DB (see below). 6. Currently, the GO List records some species-specific GO terms. With subsequent biological knowledge, it may be found that these are no longer species-specific but shared amongst different organisms, thus requiring a name change. Currently species-specific terms will only be updated when it is shown that the term is not species specific. If a species does not have a gene annotated to a specific GO term, then either it is not present in the species or it hasn't been found yet. 7. Structures will have to inherently change with the discovery of new genes that do not necessarily follow the currently established hierarchies. Curators are to be aware of the "True Path (Judy's) Rule": the pathway up the hierarchy must always be true. If a new gene is found to break this rule or species-specificity becomes a problem, a restructuring of the hierarchy should occur by adding more nodes and connecting terms that creates a new path to fulfil the trueness of the upward hierarchy. When a term is added to the Master GO List, the curator needs to add all of the parents and children of the new term A suggestion was made that to simplify the hierarchy: we might considered throwing out the "part of" GO items and instead only used the "is a" GO terms. After discussion, it was realized that too much information would be lost by eliminating the terms. The GO List will maintain both "is a" and "part of" terms. The example used to work through this effort was that of the Process Ontology for the gene product 'chitin'. Chitin metabolism is a part of cuticle synthesis in the fly, and part of cell wall organization in yeast. As a result of the above discussion, the parent 'chitin metabolism' will now have daughters 'cuticle chitin metabolism' and 'cell wall chitin metabolism, with the appropriate catabolism and synthesis terms underneath them. chitin metabolism chitin biosynthesis chitin catabolism cuticle chitin metabolism cuticle chitin biosynthesis cuticle chitin catabolism cell wall chitin metabolism cell wall chitin biosynthesis cell wall chitin catabolism The procedure to add terms is made particularly difficult because the Process Ontology is a DAG, and given the current state of knowledge, it is volatile. We need an automated procedure to add all the arcs when necessary to expand the structure. A tool is needed that will 1. not allow bad paths (solution: add extra nodes) 2. curators need to see all paths. 3. given curator decision, can automate a split 8. A CVS file system will be used to process changes made to the GO list. This will automatically record the curators name & date and must provide a succinct reason for implementing the change. The CVS client server will be set up by Mike Cherry: accounts will be set up for each individual curator for access and modifications. Additional curators may join by submitting a User name to Mike. 9. An idea was proposed that the GO list hierarchy encode next to the GO terms the number of genes specific to each organism DB listed underneath that particular GO term. For example: %DNA metabolism (M-3, D-9, Y-5) %DNA replication (M-1, D-4, Y-3) %DNA dependent DNA replication (M-2, D-5, Y-2) Meaning that there are 3 mouse (M) genes, 9 Drosophila (D) genes, and 5 yeast (Y) genes under the entire category of %DNA metabolism, and that they can subsequently be broken down further to each subordinate GO term for finer resolution. This should be helpful to curators in understanding to which level of 'depth' other curators are annotating. 10. EC#s should be kept in the GO term lists because they provide a searchable technique for curators during annotation. 11. The following ideas and agendas were proposed for goals within a 1-2 week (?) time frame from the end of this GO Meeting: a) Stanford curators stop GO and send an updated file to Michael Ashburner. b) Michael Ashburner parses v. 0.2a7 ---> v. 0.9 c) v. 0.9 ---> Suzi for syntex check ---> assign unique GO# to all terms (v. 1.0); and parse into XML d) CVS established/tested by Mike Cherry; as well, Mike will try to register the web site www.gene.org if still available; if not, other suggested names: www.genestogo.org or some combination of the words GO and gene (GOgeneGO; GOgene; geneGO; etc....) e) all GO again. Independently, Suzi and Joel will determine necessary XML syntax and processes. Between June 1999 and the next meeting, two stages will be implemented: STAGE 1: CVS established initial curation: annotate as many genes as possible (FlyBase hopes to get 3000-4000 genes done; are other DB up to the challenge?) XML exports to Suzi established a Central Manager will be hired STAGE 2: a working database something of real functionality for the public: "genes to GO" 12. Making the database available to public users should coincide with a descriptive (promotional) write-up of the GO Project in a widely circulated genetically-oriented journal, such as Trends in Genetics. Members of the GO Project should be thinking of ideas for this paper and what they would like to see in it. 13. Janan will meet with Ken Fasman (Astra) and Lisa Brooks (NIH Program Officer) at the JAX MGI Advisory Board meeting in early June 1999 and discuss initiating a co-operative (MGI, SGD, FlyBase) grant for submission in November 1999. After this meeting, Janan will update other members of the GO Project on the ideas for the grant via email correspondence. 14. Resources: $200K from Astra '99 and a promise of $200K for '00. Michael Ashburner will have the money established in an EBI U.S. account in Cambridge Trust Company. 15. The next GO Meeting is scheduled for 6-9 October 1999 to be hosted by MGI in Bar Harbor, Maine. =================================================================== GO MEETING - The Jackson Labs. Oct 7-8 1999. PEOPLE MGD Judy Blake David Hill Joel Richardson Martin Ringwold Janan Eppig Charlie Ray Ben King - Mouse sequencing Jeff Davies Richard Balderelli Allan Davies SGD Andrew Kasarskis Mike Cherry Midori Harris FB Heather Butler Michael Ashburner Suzanna Lewis Astra Zeneca Michael Rebhan AGENDA 1. Current CVS/annotation of GO 2. Putting sets together for common query interface 3. Publications 4. WWW pages 5. Other collaborations 6. Funding & resources 7. People MINUTES 1. Progress FB/Berkeley. Nothing new on software; New versions imported into query tool. FBV/Cambridge Report on progress of attribution in FB. About 1700 done. Celera annotation plans were reported. It is hoped that they will use GO for functional inference. FB to get its reference CDS set of genes GO'd by November 7. (Ashburner/Heather). SGD Midori annotating yeast genes with GO, done about 300 plus tRNAs. Also doing gene summaries of each gene in SGD. Have about 3000 to do. GO query tool for internal use on www for curators; better diff files. MGD Alan and David Hill have been doing assignments. Detailed hand annotation with MLC and GXD - have to write detailed reports on genes and then add GO terms. At the same time do first pass "GO-FISH" - have mapped 3,000 genes with GO terms. Also mapped via EC numbers. Had not been using CVS but keeping a file of changes. Mapping SWP Keywords to GO terms - done to letter 'E'. 650 SWP Keywords that seem to be relevant to GO. 40-50% map directly to GO. David will [or could !] finish within a week ! dph@informatics.jax.org - David Hill MGD now beginning to use CVS (Allen) For CVS problems: mark@genome.stanford.edu Use "update" rather than "checkout". Agreed number series for new terms: SGD 0000001-0001500. MGD 0001501-0003000. FB 0008001-0009500. 2. Putting sets together: What we are using now: FB tagged value format SGD tabbed list MGD Excel file Evidence statements - MGD argue for "stated by author". Following agreed as valid values IMP inferred from mutant phenotype IGI inferred from genetic interaction {with } IPI inferred from physical interaction {with } **note we changed this from protein interaction ISS inferred from sequence similarity {with } IDA inferred from direct assay ASS author said so NA not avaliable Evidence must not be null, even if the record is " not available " We now want to agree on a tab delimited format - which SL can parse into XML. MEOW Core database. [mandatory] cardinality 1 ; controlled: MGI, FB, SGD gene symbol. [mandatory] cardinality 1 gene symbol synonym .cardinality 0, 1, >1 [white space allowed] gene name. cardinality 0,1 [white space allowed] gene identifier. [mandatory] cardinality 1 chromosome. cardinality 0, 1 map position. cardinality 0, 1 short gene description. cardinality 1 db xref, NA, protein. cardinality 0, 1, >1 GO add-on GO id. [mandatory] cardinality 1, >1 reference id. [mandatory] cardinality 1, >1 ; must be within domain of database identified in MEOW core evidence. [mandatory] cardinality 1, >1 ; controlled, see above aspect. cardinality 1 ; controlled F|P|C DB,Gene_id,Gene_symbol,GOid,ref(|refs),evidence(|evidence),aspect,name,synonym(|synonym) tab delimiter between fields (NOT commas) within field delimiter is | hard return at end of record ascii SGD_GO_files/gene_associations MGD_GO_files/gene_associations FB_GO_files/gene_associations SGD & FB do a remove of old versions before committing new. At this stage other data will not be dumped by contributing databases to GO. 2. Query/Editor tools/databases. Private editorial tools Local editorial interface to modify GO (ie to replace CVS) - but changes to go to editor for committment. Stanford work on editor tool. How do we compare for internal purposes between collab. d/abses ? Public tools At local sites [responsibility of collab d/bases] Cross-genome Data base Servlet ? or other performance enhancement Improved query database GO query tool must have comment to GO email button (at first to all of GO list, so that we can all see what is going on). Each database should implement its own query tool for GO. - all 3. WWW Mike has registered: www.geneontology.org & www.genename.org We agree to use geneontology.org as prime address and to close down the existing ebi and fruitfly sites (these then point to geneontology.org). Need a top page - Cherry Suzanna to check that the Query applet can run from this new web site. - Suzi Suzanna will activate URL hyperlinks from query report. - Suzi Needs url syntax for MGD (see MGD Tools for Developers on home page - or contact Joel) and for SGD (contact Mike Cherry). Tree will show number of gene_associations per node. The CVS can automatically update the text files and automatically write a new version and date at top of file - Cherry ftp - three ontologies in both hierarchical and xml (rename "compartment" as "cellular component" in CVS repository). - Cherry will xml files be automatically updated by a script when ontologies are updated ? - yes, but need to look into mechanism - Suzi. - GO.bib - GO.doc .. MA to re-write as an html document. Add GXD as collaborator indep of MGD - GO.defs - ISMB paper - geneassociations.fly - geneassociations.mouse - geneassociations.yeast GO query tool from Suzanna email button for contacts; go to entire list - Cherry Must change proofs of the SGD/FB/MGD NAR January issue papers, for new url. MA to write general introduction for web page Ashburner MA to update GO.doc Ashburner Suzi to give collaborators urls for definitions. (OUP acknowledgement) - Suzi 4. Publications Where - TIGS .. probably the best for this first paper. Alternatives: Genome Research NAR Nature Genetics Bioinformatics Talk to Roberts about paper for NAR Special Issue for 2001. Ashburner, but next year. Botstein & Cherry to do a draft then to Alan Davies at MGD - Botstein/Cherry/Alan 4. Other collaborators. C. elegans - Sternberg's NIH application for WormBase has been submitted - for summer 2000 funding. Arabidopsis: TAIR (The Arabidopsis Information Resource) - Carnegie-Stanford (science)/NCGR (computing). Started Sept 1, all of old AtDB curators moved over to Carnegie. Chris Town of TIGR is on TAIR grant. MA worried that could be more than one push - TIGR (NSF annotation grant); Mike Bevan at John Innes. Ashburner to follow up. Monica Riley/Gretta Serres - functional assigments for E. coli. Need to talk to TIGR about prokaryotes. Ashburner to follow up Look at TRANSFAC classification. Incyte collaboration, further discussions with Frank Russo. Ashburner/Suzi Swiss-Prot. Ashburner 5. Grants. Janan will lead on an NIH-NHGRI RO1 grant - Liza Brookes - for Feb 1 2000. - Janan What should we ask for: curator for MGD curator for SGD curator for WormBase ? as supplement [curator for FB already on MRC grant] Core: GO manager/editor software support travel/kit Funding cycle: FB to 2003 NIH 2002 MRC SGD 2001 NIH MGD 2001 NIH GXD 2000-2005 (NIH Institute of Child Health) GO 8/00-8/03 ? Astra-Zeneca: Would Ken be willing to write two cheques, one to EBI and one to UCB since we are the only two who now need to draw on funds ? Contracts between EBI and Jaxs and EBI and Stanford are academic at the moment. Should we set up a non-profit GO Inc ? Ashburner for action 6. Content MA to finish Style Manual, work on with Andrew - Ashburner/Andrew Need to look again at %enzyme - split by EC - what would we loose ? - use classification of substrates imposed on EC ? - Ashburner 7. Next meeting Feb 24-26 2000 - Boston / Harvard. Talk to Bill. Ashburner Talk to FCK re: a meeting in Les Treilles. Ashburner Friends of GO - activate and update - add Mike Rebhan. - Ashburner bionet.announce when new pages up and data into query tool. FINAL REMARKS Substantial progress has been made by all three database groups in implementing GO over the summer. This is very encouraging. Although there have been some areas of GO content that have needed changing (and several that have needed adding, as expected), in general the three ontologies seem to be working rather well. A major message of this meeting is that we must get something substantial in the public view as soon as possible. To this end we have rationalised the web sites for GO and agreed an output format for gene associations to be sent to Suzanna to drive the Query Tool. We have also agreed on a paper about GO for TIGS to be done this year. We hope that the new web pages with a Query Tool with content can be up in a matter of weeks, tho we know that until mid-November Suzanna and Ashburner are very busy with the fly annotation. =================================================================== Gene Ontology Meeting February 25-26, 2000 at Astra-Zeneca in Cambridge, MA Attendees: Michael Ashburner (FlyBase) Suzanna Lewis (FlyBase) Heather Butler (FlyBase) Judy Blake (MGI) Janan Eppig (MGI) David Hill (MGI) Joel Richardson (MGI) Martin Ringwald (MGI) Allan Peter Davis (MGI) Michael Rebhan (Astra Zeneca) Mike Cherry (SGD) Cathy Ball (SGD) Midori Harris (SGD) Andrew Kasarskis (SGD) AGENDA ITEMS Progress Reports Celera Report Papers Collaborators and Other Projects Ontology Issues Style and Work Practices Tools for GO Questions from Michael R. Plans for Next Meeting PROGRESS REPORTS Mouse folks: Judy Blake has submitted the GO grant. The mouse members have assigned approximately 4500 genes to GO terms. 100 by hand 650 by EC number 1270 using Swiss-Prot 2500 using mouse nomenclature Without counting the Swiss-Prot data, they used 474 Molecular Function terms, 50 Cellular Component terms and 80 Biological Process terms. Since they are using automated annotation, they have performed a variety of quality checks, such as looking for more than one annotation within an ontology. They have come close to exhausting the current automated assignments and are going to be doing more by hand in the future. Yeast Folks: SGD has 1524 genes in the gene association file. About 1000 of these are ORFs and the rest are tRNAs or snoRNAs. All SGD genes have been GO-annotated by hand. Fly Folks: FlyBase currently has about 3000 genes annotated mostly by hand. Heather has worked through the protein kinases and will next tackle the protein phosphatases. Annotation of new genes will be largely done by sequence similarity, while existing genes will be done by hand in related chunks. When the Drosophila sequence is released in March, there will be a large amount of sequences annotated to a high-level GO ID. These will be deepened to more specific GO nodes with time. CELERA REPORT GO was used in the annotation of the Drosophila genome at Celera. Suzanna made a dataset with all genes annotated to the molecular function GO and used it for BLAST searches. Usually, the level of GO node was quite high -- only one or two terms from the top. Where experts in a field were expected to be annotating genes, the specificity of the GO nodes used were increased (for example, olfactory receptors). Ultimately, there were 40 bins labelled by GO name (the 40th was "unknown"). Annotators were then able to have a pretty reasonable guess as to the function of the new fly gene. A second binning with biological process and cellular component showed a terrific correlation with the first. About half the genes from the Celera set are associated with a GO term. Since a given gene has a less than 50% chance of having been seen by a human, an association with a GO term is very valuable. FlyBase is still waiting to receive the sequence -- it will be released with the publication of the papers in March. FlyBase will be responsible for updating the sequence in GenBank. PAPERS We agreed to immediately pursue three publications: 1) Nature Genetics solicited a short (2500 word) article from David Botstein. It will be submitted March 10, with a short author list. 2) Genome Biology -- Michael Ashburner has been asked to write a short (1000 word) article for their premier issue. It will most likely have an authorship along the lines of "The GO Consortium". 3) Genome Research -- Judy Blake will adapt the grant to a "big" paper to be submitted to Genome Research. 4) NAR database issue -- We will submit a paper to NAR as a matter of course. The submission won't be until August or September. Since there are likely to be changes in the NAR policies, we will discuss the details of the NAR paper at the next meeting. COLLABORATORS AND OTHER PROJECTS There was a great deal of discussion about taking on other organisms and collaborators. The conclusion was that before we take on other organisms, we must first meet the following goals: --We need to be in a database (Suzanna Lewis will be working on this, with help from Joel Richardson). Hopefully, this will be accomplished by the next meeting. See "Plans for Next Meeting" for more detailed steps. --Documentation of philosophy, styles and practices needs to be written to record and communicate our current thinking. See "Plans for Next Meeting" for more detailed steps. --A "GO manager" to coordinate changes to the ontology, arrange training, communicate with all groups, etc needs to be hired. Midori Harris has volunteered to assume the responsibility. Michael Ashburner suggested we have two classes of partners -- the first with write permission and the second without it. These "second class" partners will have to funnel suggestions and comments through a full partner. Other organism groups that have expressed interest include worm, Arabidopsis, and S. pombe. We'll invite a representative from the worm and Arabidopsis database groups to the next GO meeting. Michael Ashburner has received a grant application for "BioBabel" -- a proposal to adopt GO terms within SwissProt, Enzyme Commission and Interpro. Representatives from this group can also be invited to the next meeting. ONTOLOGY ISSUES Methods and practices for editing and maintaining the ontology took up a large portion of the discussions. Conclusions will be listed, and in the cases where the discussion is particularly illuminating, the discarded options will be listed as well. 1) Changes to GO nodes that have multiple parents... When editing one of the ontologies, it is more convenient to add another node in only one position. For example, if we start with the structure shown below: a b d e f If we want to add node 'c' as a of 'd' and a child of node 'a', do we need to edit all the appropriate lines, or just one? The group decided to make an "editable" non-redundant version of the ontologies: Linear, redundant format (for viewing): a b d e f c d e f Non-redundant format (for editing): a b d % c e f The envisioned procedure is that a curator checks out the compressed, or non-redundant, version and then views an expanded version using a planned tool we're calling "The Validator." When an edit needs to be made to an ontology, it is made in the compressed version and tested with the Validator. The compressed version is then checked back into the cvs. The Validator will be written by Joel, suing specifications mentioned later. The web will display the expanded, read-only format. 2) We will add GO id to parent terms. For example, we used to state: term1 ; GOID1 % term2 Now we will state: term1 ; GOID1 % term2 ; GOID2 3) GO nodes should aggressively avoid using species-specific definitions. We agreed to substitute "Yeast mating" with "Mating, sensu Saccharomyces." Using the "sensu" reference makes the node available to other species that use the same process/function/component. Each organism database will take care of their contributions to the species-specific language. 4) We will get rid of cellular component references in the function ontology. For example, "mitochondrial primase" needs only be "primase." There are many cases where component terms are appropriate in the process ontology, so those will remain. Michael A. will take care of this. 5) Joel pointed out these logical relationships that we need to make sure are true in the ontologies: if A is part of B and C isa B, is A part of C? --- YES if A is a B and B isa C, is A isa C? --- YES if A is part of B and B is part of C, is A part of C? --- YES if A isa B and C is part of B, is C part of A? --- NOT NECESSARILY Joel will send out a list of the logical inconsistencies that he has detected. 6) An example that got a lot of attention is the case of the mitotic chromosome's location in the cellular component ontology. While the mitotic chromosome resides in the nucleus in yeast, it is cytoplasmic at this stage of cell life in mouse or fly. In addition, many organisms have chromosomes that are NOT located in the nucleus. The solution arrived at was to remove chromosome from the nucleus in general and place the appropriate subsets of chromosomes in the correct place (nuclear, cytoplasmic, mitochondrial). 7) We need to track deleted GO ids. There are types of things that can happen to GO terms -- merging two (or more) nodes, splitting a node, deleting a term. a. When a term is deleted, we will cut the line out and paste it at the end of the file (or as a child of the parent "defunct", I don't recall the final decision), using the following format: and tags. 11) We currently cannot standardize rules for subdividing ontology terms, but instead will continue to make each decision on a case-by-case basis. 12) Gene products in themselves are not nodes of the function ontology, although doing something with or to a specific gene product can be one. For example, being hedgehog is not likely to be a function, but being a hedgehog receptor or hedgehog receptor ligand are functions. 13) We may eventually need a synonym table to facilitate queries. 14) Changes that need to made to the ontology to meet the current style include eliminating unnecessary hyphens, adjust grammar so that "transporters" become described as "transport" and "transporting," remove words like "protein" and "factor" where we can be more explicit. 15) Heather and Midori will write some documentation about the evidence codes. 16) We need to think about a "best practices" document that will state and explain good work habits for both current and future annotators. In the meantime, we will share any help documents, such as SGD's "Instructions for Annotating Genes Using GO." TOOLS FOR GO 1) Database Suzanna will get a handle on this. The major difficulty has been hiring a programmer. Michael R. offered some help on this from Astra-Zeneca. Suzanna is planning on using MySQL to create a version to distribute from the central site. The schema is not yet ready, but Suzanna and Joel will work on this together. The database will also need the ability for bulk load. 2) Validator - Joel will do this We need a validator to check for: a. cycles b. deletion of nodes used in gene association files c. syntactic correctness (refer to logical relationships described in the ONTOLOGY ISSUES section.) d. unique IDs e. warning message of the number of affected nodes f. orphans g. new nodes have IDs Associated with the validator is the ability to compact and expand the ontologies for writing and reading. The validator will run on the central site, as well as locally for checking before an edited ontology is checked back in. Joel plans on writing this in python, so each site will need to install it. 3) GO BLAST server - Mike C. will take care of this The GO BLAST server will use a dataset of GO-annotated protein sequences. The results should show each GO node associated with a gene product, as well as a few generations of ancestors. 4) Annotation aids It would be nice for curators to have a tool that, given a single node, display all other gene products at that node (and nearby nodes) as well as all their other GO associations. This would assist curators in assigning a gene product to as many GO terms as needed, by showing them all other GO terms that might be related. 5) Suzanna's browser needs to be installed at Stanford, so we can all be using it from the same server. 6) Michael R. suggested we make a link to a dtd (datatype definition) file. Suzanna will look into finding a tool that will read the xml and create a dtd file. ANSWERS TO QUESTIONS FROM MICHAEL R. 1) GO ids will be stable. They may be "defuncted", but they will not go away. 2) "is a" and "part of" are likely to be used for quite some time. However, "part of" means "can be a part of", NOT "is always a part of." 3) Incyte still expresses interest, but that's all we've received from them. 4) Homepage recommendations -- Mike C. will add a bit from the grant to add more detail to the homepage. It might also benefit from the addition of statistics from the gene association files. 5) Should we have an ftp site that allows one to download the most recent version of GO? 6) Michael R. will create a FAQ to be linked from the home page. 7) Mike C. will put SGD's PowerPoint GO presentations to the GO site. PLANS FOR THE NEXT MEETING The next GO meeting will be in Cambridge, UK June 29 and 30. The plans are: 1) Have documentation ready a. GO philosophy document (Michael A., Judy and Midori) b. Rules for making changes to GO (Michael A. and Andrew) c. Rules for applying GO terms -- this is currently project-specific. Each project needs to think about this and bring something to the table next time. This should also include particularly illuminating examples, such as chitin synthesis, mitotic chromosomes. It should also emphasize how to avoid making GO nodes too species-specific, and mention the logical aspects of inserting or moving nodes. 2) Invite representatives from BioBabel, Arabidopsis, and C. elegans. 3) Have database in place 4) Create programs described above 5) Work HARD on adding more GO definitions. We have permission to use the Oxford Dictionary of Biochemistry and Molecular Biology. 6) Make the ontology edits mentioned above 7) Write three (!) papers History We need to establish a FAQ page We need to arrange for introductory sessions for new groups Status Assumption built into this is that if a term is associated with a gene product it must necessarily follow that all parent terms also are an accurate and truthful description of that gene. The structure of the ontology is tested and validated continuously as the curators assure that the parents, parts and go terms are all true. * Yeast 1800 genes associated to GO, represent half of the total number of yeast genes that have name, evidence code for almost all of them * Fly, automated annotation at jamboree, but not submitted to GO until curators validate them. * Mouse mostly done automatically until now; see the handout for the numbers. When conflicts arise they try not to change the ontology unless that have to. This is done by going up to a broader term. Moving to hand annotation particularly for new genes. Tools and Common resources * Now available, John's web browser (www.informatics.jax.org/~jpc/GO) modeled after MESH and Brad's browser (www.fruitfly.org/~bradmars/cgi-bin/go.cgi) that is running off the Informix database * Database (Informix, MySQL, and Oracle?) implemented and there is a Perl object methods in repository. We will be writing updates to the ontology in the database after the fall meeting. * Ontology editor, first priority. Suzi (et al.) to do by next meeting * Mike C. to use Ian's scripts to automatically perform regular validation for text version until editor is ready. * Merge two html versions of browser (Brad and John) * Suzi/John to fix Java browser and decide to either pull the plug or continue development. Add link to Java help page * Each organism database to provide a fasta file of protein sequences for those gene products that have been annotated. Suzi (et al.) will set up blast search services for GO * API to be refined as applications are developed * Steffan and Heather to work up prototype for next meeting of rules between the separate ontologies * Definitions, Michael is to contact Julian Dow for definitions from Dictionary of Cell Biology * Mike to e-mail style manual to Michael, who will then check it into CVS * Suzi/Brad to clean up XML version Content * Use part-of relationship to solve the 's/t/ protein kinase' (multipart protein) problem. E.g. %s/t protein kinase 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with Ji Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at the Carnegie Institute within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkerist (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech relationship to Proteome ... they are using GO ... ?in the human curation? General feeling is that the scientists have lost control of Proteome, the businessmen have smelled profit. Proteome is helping out Worm by providing Worm database definition lines (?). relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein goal ... weekly updates rather than once every 3 months Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. Michael and Heather have mapped Monica Riley's latest (non-GO) classification to the GO, but it cannot be publicly released. This mapping required the addition of many terms to the GO and has set up the GO for use for most enteric bacteria. Monica has a great deal (10 years of work) invested in her classification scheme and has a great deal of interest in seeing a proper mapping/merge between GO and her scheme. Michael and Heather have also obtained from Monica Riley the Genprotec enzymes list (a list of E. coli proteins), and this has been parsed into the Function Ontology. The situation around EcoCyc is complicated and it is unclear when EcoCyc <> GO mapping might be done. EcoCyc ownership is being resolved between DoubleTwist, Pangea (now DoubleTwist), and SRI; an NIH grant (content unknown) is being held up pending resolution. There is an interest at Stanford in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead (Russ Altman?) : a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at www.fruitfly.org/annot/go/database/index.html People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.fruitfly.org/~bradmars/cgi- bin/go.cgi?accession=3700 Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consortium), which will provide more details on the GO and is a followthrough from the Nature Genetics paper recently published. A Genomics paper is in the process of being written (MGI) which will detail some of the recent mouse gene annotations incorporating the GO Next Meeting 3, 4, 5 March Palo Alto, CA hosted by Sue Rhee, Carnegie Institute, Stanford University Immediate Actions Doug:send THE LETTER to him to for DoubleTwist:.invite them to Dec. meeting:.CALL ANDREW AND ASK HIM WHO TO SEND THE LETTER TO:EXPLAIN THE SITUATION.. review John Richter's FAQ about GO page: WE NEED TO SEND FASTA/GO FILE TO GO-SLIM WE NEED TO POST AT mgi THE mgi/GO FILE THAT IS SENT TO www.geneontology.org Appendix A: Current GO SLIM From: Suzanna Lewis[SMTP:suzi@bdgp.lbl.gov] Sent: Wednesday, October 25, 2000 11:34 AM To: ma11@gen.cam.ac.uk; midori@genome.Stanford.EDU; suzi@bdgp.lbl.gov Cc: go@genome.Stanford.EDU; dph@titan.informatics.jax.org Subject: Re: go_slim also needed to add unlocalised to component slim. here is the updated version $Gene_Ontology ; GO:0003673 $cellular_component ; GO:0005575 %cell wall ; GO:0005618 %extracellular ; GO:0005576 20 groups worldwide to handle nomenclature for Candida. The Candida sequence is close to being completed. The pombe sequence is completed and the paper has been submitted for publication (Sanger group). A GO annotation set has been submitted with external references being Swissprot IDs. Progress Report - MGI (David Hill) GO Browser now publically available for MGI records (created/implemented by John Corradi) Gene-by-gene annotation of 15,000 genes effort is concentrated on moving IEA evidence to ISS evidence (or greater) updated annotations now being downloading to GO webpage every Friday Major Revision of GO subsections Process Ontology: apoptosis (with Flybase/Heather) Progress Report - TAIR (Leonore Reiser) Major Revisions Working (with J. Yoon) on the introduction of terms into the three Ontologies to support Arabidopsis in the first instance, Plantae in general as a followthrough Plant terms - 162 total to date, 63 in the Component Ontology (50 with definitions) Expressed strong support for visual-based ontology curation and browsing tools Thesauri Worked on EGAD plant <> GO associations A contact has been initiated with Maize DB, but little feedback has been received to date (is this contact via Susan at Cornell?). A contact has been established at IRRI within the Rice community regarding working with GO and Cyanobase A plant-related mailing list has been initiated The notion of a 'plant clearinghouse' for GO annotation is embryonic and being organized by Lenore and Richard Burkiewich (formerly of the Sanger). Progress Report - Wormbase (Erich Schwarz) Erich is Paul Sternberg's first WormBase employee Wormbase growth Wen Chen (will be adding expression data to Wormbase AceDB) Raymond Lee (will be porting Wormbase from AceDB to other platforms) to fill one more curator's slot by summer & one db programmer Erich will be the GO-responsible party AceDB is beautiful for assisting in positional cloning. What is the relationship with Lincoln Stein? Invaluable He is on the grant that currently underwrites Wormbase Essential at the level of asking Lincoln to do things .. Lincoln put together the Wormbase.Org site at Paul Sternberg's request Eventually Wormbase will move from Cold Spring Harbor to CalTech Proteome is helping out Worm by providing Worm sequence definition lines relationship with the Sanger ... John Hodgin (gene mapping db) ... Sanger/WashU sequence feed ... very complex interaction map with other smaller dbs ... Richard Durbin involved as a subsidiary PI, like Lincoln Stein Progress Report - Prokaryotes & Protozoa Charlie Hodgkin (Glaxo; used to write software for Michael) initially contacted Michael and expressed his intention to use the GO for annotating E. coli. This launched an e-mail flurry that eventually led to interaction between Michael and Heather, and Monica Riley and her postdoc Greta Serres (around ISMB '99). Charlie, with Monica's support, is quite interested in putting the E. coli GO annotations in the public domain, but there is not indication of when this might be completed. There is an interest at Carnegie in annotating cyanobacteria with GO, but it is unclear where this is going in the near term. A previously noted protist interaction is dead: a Canadian group has recently become interested in annotating protists with the GO in conjunction with some large scale sequencing. There is currently no activity on the horizon for annotating viruses with the GO. The group has not looked at the 'minimal microbial genome' to see if it is fully represented in the GO. Michael would like to see Pseudomonas included (for its xenobiotic metabolism) and Streptomyces (for its antibiotic biosynthesis) Three groups have independently applied for funding to the Wellcome Trust for the sequencing of 3 different protozoa (leishmani, trypansoma cruzi, and trypansoma bruceii), in close collaboration with a new plasmodium database at the University of Pennsylvania. Al Ivans (Sanger), who I presume is attached to one or more of these, is interested in using the GO in their gene annotations, though nothing is firm to date General message: would like to identify communities, work to build a foundation for them, then invite them to build on that foundation Software Developments A relational database containing the GO terms is now available. It was unclear to me whether this is in Oracle in addition to MySQL, but that is not really relevant at this juncture. The time is imminent when the following will be available to curators: direct writes to a pre-production (development) database instance elimination of duplicate ID and basic syntax errors owing to data integrity functions (currently no data integrity ... everything done as flat textfiles). Curation interface with undo, DAG view, and commit functionality (see below) Rollback to previous version, owing to audit trail functions Java Browser (John Richter) Online at fruitfly.org Demonstration was very well received by participants A lot of effort has gone into making the browser platform independent All GO terms in all three Ontologies are found in a single tree (one window in Browser) Clicking any term brings up a minimal DAG in an accessory window Queries run as Perl5 regular expressions Query Window supports either Boolean or Perl5 regular expression queries After pulling a gene product, can click on a term block and thereby highlight the terms in the Ontology Tree Window; the minimal DAGs are also shown for these terms in an accessory window The number of gene products is no longer indicated in the interface Java Editor (John Richter) The editor will have the same look and feel as the Browser when completed Editor runs as an application rather than as an applet; it is available under CVS and must be installed to be used GO ID is fixed; new terms get automatic assignment (there was brief discussion around how to assign ID's which did not reach final resolution; currently, ID ranges are apparently assigned to curation groups) Structrual Edits (edits to the DAG structure) simple select/click/drag to change structure 'infinite' undoes term-merge by click/drag; one term lives on while the other is obsolesced and becomes a synonym of the live term term-splitting enabled (do not recall details) term obsolescence is automatic if all children are obsolesced cyclic graphs are not automatically disallowed; thus curators need to avoid descent along obsolesced terms For searches, obsolesced terms should all be treated as leaves Rollbacks: In theory, 'infinite' rollback is possible, but there is currently no 'History Viewer' and John suggested that design of one would be significantly more complex than either Browser or Editor. Logic Validation is currently not within the remit of the Editor Erich: would like to have a visual cue that indicates which terms have been changed recently Provides for tagging a term as part of GO Slim, and capability of adding any number of additional segregation tags in the future. Import & Export of subtrees will soon be available (a very useful feature) A Gene Product Viewer will be added in the near future, the nature of which is currently unknown. Technical Discussion : aiming for a Generic Ontology Builder The current Editor is componentized The DAG component is generic, not specific for GO Using Java components has allowed fancy click/drag functionality, including a smooth table resizer All components are available on the CVS repository : with DOCUMENTATION (fancy that) The GO Schema is available at http://www.godatabase.org/dev/database/database/ People are encouraged to write ports to different database formats and submit them to the GO Database mailing list HTML Browser (Brad Mars) The HTML Browser is running off of the GO Relational Database (Chris Mungall's Perl API) rather than off of the GO Flatfiles, as the Java Browser does currently found at http://www.godatabase.org/cgi-bin/go.cgi Similar : though generally weaker : functionality to the Java Browser. Plans are to increase functionality to be on par with the Java Browser BLAST Server (Suzanna Lewis) There is an obvious need for the ability to BLAST against organism sequence sets using sequences retrieved via GO searches. An unasked question : why do they not rely on the archival databases to serve BLASTs? There was some discussion around whether to have 'all 400 version of ADH' included or not The current thought is to have protein sequences in fasta- format deposited on the GO website and available for Blast searches In the case of MGI, the plan is to submit a single protein sequence per gene. This single sequence will be the primary Swissprot sequence that represents each gene prouct. MGI annotation is to the level of the gene, not to the gene product; therefore, spliceforms and post-translational modifications do not matter that much to them at this point in time. In the case of yeast, about 2000 yeast protein sequences have already been provided. The fasta header line for the yeast sequences was discussed as a standard. general syntax : key:value[space] (key order not constrained) 'provide everything that you can provide, the more the better' community name for the gene (that used locally within a particular domain) gene name (the generally accepted, or global name for the gene) tab delimited list of GO ID's one or more external database cross-references Timing, Resource, and Prioritization Issues The primary software developers associated with the GO Consortium are John Richter and Chris Mungall. John is the primary driver behind development of the Java Browser and Editor, which appear to be the currently favored for GO delivery to end users and curators. About 3 weeks of work remain (according to John) to reach a final stage on Java Editor & Browser John is currently under obligation for development of an open-source gene annotation platform, Apollo, and timing for completion of software development for the GO Consortium is currently critically dependent upon his obligations to Apollo. Also competing for resource is a Fly Genome Reannotation project that is on-going. There is significant concern (voiced by Michael) around the absence of rigorous syntactic control and internal checks. John indicated, though, that syntactic checking algorithms are 'the stuff of papers' and were going to devour the majority of the development time. (I am quite fuzzy on exactly what the issue is here). The concern was voiced that the GO DAG could be really screwed up if the Editor is used improperly ... powerful tools mean powerful edits. However, the potential for rollbacks really softens the potential impact of this problem. Suzanna voiced the opinion that the priority lay in getting out of flatfile mode and into a relational database environment. It seemed that most everyone agreed with this. A key untackled problem appears to be 'how to commit DAGs to a database,' which apparently both John and Judy are thinking hard about. (again, I'm not quite clear on what this issue is here.) Interim Solution during resource-limited period John will do enough so that the Java Editor is available and will produce flatfiles; the flatfiles will be fed to the GO via the currently implemented CVS system. Downloads from the 'outside' Data is currently provided to external parties on an ad hoc basis. The mySQL database can be downloaded with the following caveat: 'If you make any changes, do not publish this database'. Among the groups who have downloaded is AstraZeneca (Bo Servenius, Lund, Sweden). Presumably, people are writing applications on top of the database, so John is trying to give a lead time on large changes. A Perl Development Kit has been assembled to assist in the loading/updating of external representations. A defined update/error reporting process has not been set up to date. However, each of these is currently handled through the e-mail lists (GO-FRIENDS, GO-DATABASE, GO-DIFF). The contact points should probably be Chris Mungall and John Richter. Send format problems to John Richter (http://www.fruitflyorg/annot/go/) and suggestions to John at go-admin@bdgp.lbl.gov Right now, remote access is relatively slow, presumably because there is lots of SQL being done remotely. John and Chris have some ideas bouncing about regarding 'fast access to GO Objects'. Consult John on this matter. Discussion around the Editing Process (various) (see also Software Development: Java Editor) Judy: a database administrator would help to stem the proliferation of terms and provide some control over the process Still trying to resolve the issue of how to distribute editing to a database globally across groups and disciplines. A software tool is emerging (Java Editor, design by John Richter) but the issue of editing control is still being debated. Michael: Minor Changes - Just commit them; Major Changes - consult the group via e-mail and commit after period (currently ~10 days); Really Major Changes - talk in person. Michael does not really want to see the current process changed. There was some discussion between Michael and Judy around similarity between the GO and extant nomenclatures, where committees meet to commit changes, but Michael would really want to avoid the potential emergence of bureaucracy, favoring maintenance of the present 'sociology' around the GO Consortium. It was generally accepted that there should be both a Working Database (Development) and an Authoritative Database (Production). There was some split of opinion over whether the Working Database should be publicly visible, but it appears that it will be made so, the argument being that the more skilled eyes that are looking at the work in development, the better. Schema Discussion, with emphasis on DBxREF A discussion ensued around whether the DBxREF field should refer to Terms and Definitions in the aggregate or separately. This discussion gets to the issue of whether a particular piece of information derived from a Source should be traceable to the Source or not. The current record structure looks like this: GO ID Term Definition Definition Refs Synonyms Synonym Refs DBxREFS (+ Boolean; supports Definition, Term, etc.) Suzanna led the discussion around proposed changes to the Schema. suggest adding a 'subclass' table to support tags that allow quick segregation of the GO into subsets (such as GO SLIM). suggest addition of sequence relationships to support sequence accession number searches, with coincident links to Gene Products Suggest addition of a type flag to the term_dbxref table to allow Term vs. Definition (or other type) values to be distinguished. Also, suggest addition of DBxREF tables as needed around other value- sets (such as synonyms) Trademark (Mike Cherry) A copyright statement has been added to the GO Consortium Page A logo [designed by John Matese] is available for the GO Consortium (found on http://www.geneontology.org/GO.usage.html; specifically http://genome-www.stanford.edu/images/GOthumbnail.gif) The Trademarking/Copywriting issue comes up now as more people become interested in using GO and the potential for product developers to incorporate GO without attribution or in an independent but apparently associated manner increases. GO SLIM The utility of GO SLIM was generally recognized. A referent/cutout/subclass table will be added to the GO Relational DB that will allow extraction of GO SLIM, GO FAT (full), Plant Terms, and other term subsets. Utility of GO SLIM Initially needed for high level Celera annotation Subsequently needed for Riken Annotation Jamboree (FANTOM meeting) this was apparently an effort to annotate by 'computational means' about 2000 previously undescribed mouse cDNAs 11 methods were used and a manuscript is apparently being assembled one thing that the group rues is the absence of an upfront comparative metric between the 11 methods; they could tease out a comparison between the methods, but it would be tedious, though perhaps worth it This effort involved the MGI (David) and took place around the time of the global genomics meeting in Japan Problem: The GO SLIM used for Celera and Riken were not identical Useful for easy to apprehend graphics, such as pie charts Useful for high level classification, such as across genes in a microarray experiment Potentially more stable than full GO (though this is not a sentiment shared by all participants). In particular, the 'lower half' of the Process Ontology is in flux. Status of GO SLIM Currently maintained by hand (by Michael) One outcome of meeting was a decision to added a GO SLIM entry to each Gene Product annotated. A commitment to automating this process was discussed. This GO SLIM view would be included in Swissprot and NCBI representations alongside the GO FAT view. The current criterion for a term being part of GO SLIM is 'someone says so'. There is no intrinsic connotation to the node-levels in GO. This means that there is nothing inherently similar among terms at Level 3, for instance. Thus, the slice through the Ontologies that would define GO SLIM is necessarily ragged (occurring at different levels for different branches). Currently no alerting system to highlight changes in GO SLIM; requires consultation of DIFF files and manual extraction. Most recent version 100-150 terms; see Appendix A. Funding The GO Consortium expects to receive full funding for pending grants. Funding is for 3 years beginning 1 December 2000 There is currently an administrative (Congressional Budget) delay on the disbursement of funds. Funding is initially to support 3 groups in the GO Consortium (Mosue, Yeast, Fly), with the addition of 2 groups after the 1st year (presumably Arabidopsis and C. elegans). Other funding ... Incyte (no details) ... AstraZeneca (no details) Publications A Genome Research paper is in the process of being written (GO Consor