!FlyBase readme file for gene_association.fb !version: $Revision: 1.6 $ !date: $Date: 2008/07/04 12:51:16 $ !from: FlyBase !saved-by: s.tweedie@gen.cam.ac.uk 1. TABLE OF CONTENTS ================================================================================ 1. TABLE OF CONTENTS 2. INTRODUCTION 3. GENE_ASSOCIATION.FB FILE FORMAT 4. METHODS OF GO ANNOTATION 4.1.1 PUBLISHED LITERATURE 4.1.2 CONFERENCE ABSTRACTS 4.1.3 GENBANK RECORDS 4.1.4 UNIPROT/SWISS-PROT RECORDS 4.1.5 GENOMIC SEQUENCE DATA 4.1.6 PERSONAL COMMUNICATIONS 5. ELECTRONIC (IEA) GO ANNOTATION IN FLYBASE 6. USE OF THE ND EVIDENCE CODE IN FLYBASE 7. CONTACT INFORMATION 2. INTRODUCTION ================================================================================ This file provides a brief description of how GO data is captured in FlyBase and how it is displayed in the gene_association.fb file. FlyBase is a database of genetic and molecular data for Drosophila. FlyBase includes data on all species from the family Drosophilidae; the primary species represented is Drosophila melanogaster. FlyBase is produced by a consortium of researchers funded by the National Institutes of Health, U.S.A., and the Medical Research Council, London. This consortium includes both Drosophila biologists and computer scientists at Harvard University, University of Cambridge (UK), and Indiana University. For additional information, please visit the FlyBase web site at http://flybase.org/. 3. GENE_ASSOCIATION.FB FILE FORMAT ================================================================================ The gene_association.fb file contains GO annotations for Drosophila gene products. The gene_association.fb file uses the standard file format for gene_association files of the Gene Ontology (GO) Consortium. A more complete description of the file format is found here: http://www.geneontology.org/GO.format.annotation.shtml The following provides a brief description of the columns in the gene_association files. Lines beginning 'FB File:' refer specifically to the format in gene_association.fb. 1: DB The database contributing the gene_association file FB File: always "FB" for gene_association.fb. 2: DB_Object_ID A unique identifier in the database for the item being annotated. FB File: This is always the primary FBgn (FlyBase gene identifier) for a Drosophila gene. Example: FBgn0000490 3: DB_Object_Symbol A (unique and valid) symbol to which the DB_Object_ID is matched. FB File: This is always the primary gene symbol for a Drosophila gene. Example: dpp 4: Qualifier (this field is optional) One or more of 'NOT', 'contributes_to' or 'colocalizes_with' as qualifier(s) for a GO annotation. Multiple qualifiers are separated by a pipe (|). 5: GO ID The unique GO identifier for the GO term attributed to the DB_Object_ID. Example: GO:0005160 6: DB:Reference The unique identifier for the reference to which the GO annotation is attributed. FB File: Each FlyBase reference including published literature, conference abstracts, personal communications, sequence records and computer files has a unique 7 digit identifier (an FBrf). Where this reference is a published paper with a PubMed identifier, the PubMed ID is also listed in column 6, separated from the FBrf with a pipe (|). Example: FB:FBrf0136863|PMID:11432817 7: Evidence The evidence code for the GO annotation; one of IMP, IGI, IPI, ISS, ISA, ISM, ISO, EXP, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA 8: With (or) From FB File: This column contains the identifier for annotations where the evidence code is IGI, IPI, ISS, ISA, ISM, ISO, IEA or IC. For IGI the database gene symbol and identifier is listed. For ISS and IPI the identifier can be a gene symbol and identifier, or a sequence (protein or nucleic acid) identifier. For IC, the GO identifier of the term used as the basis of a curator inference is given. IGI example: FLYBASE:rpr; FB:FBgn0011706 ISS example: UniProtKB:P35569 ISS example: EMBL:AF064523 ISS example: SGD_LOCUS:COP1; SGD:S0002304 IC example: GO:0045298 9: Aspect Which ontology the GO term belongs to: Function (F), Process (P) or Component (C). Example: P 10: DB_Object_Name FB File: The full name of the FlyBase gene. Where a FlyBase gene has no full name, this field is left blank. Example: decapentaplegic 11: DB_Object_Synonym FB File: Alternative names and symbols by which the database object is known. Multiple synonyms of a FlyBase gene are separated by a pipe (|). Example: BMP|CG9885|DPP|DPP-C|Decapentaplegic|Decapentaplegic/Bone MorphogeneticProtein |Dm-DPP|Dpp|Haplo-insufficient|Hin-d|M(2)23AB|M(2)LS1|TGF-b|TGF-beta|TGFbeta|Tegula|Tg| blink|blk|bone morphogenetic protein|bone morphogenic protein|heldout|ho|l(2)10638| l(2)22Fa|l(2)k17036|shortvein|shv 12: DB_Object_Type The type of object being annotated. FB file: always "gene" for gene_association.fb. 13: taxon The taxonomic identifier of the species encoding the gene product Example: taxon:7227 14: Date The date of last annotation update, in the format 'YYYYMMDD'. Example: 20070821 FB file: FlyBase started to record annotation dates in 2006; only date stamps later than 20060803 are accurate. 15: Assigned_by The source of the GO annotation. FB File: One of either FB or UniProtKB (See section 4.1.4). The gene_association.fb file is updated each time data is updated on the FlyBase web sites. This is approximately every 4-6 weeks. 4. METHODS OF GO ANNOTATION ================================================================================ Database Objects ---------------- Currently all GO annotations in FlyBase are attributed to genes. The GO terms describe the attributes of the products (both RNA and protein) encoded by these Drosophila genes. Redundancy in gene_association.fb --------------------------------- Redundant GO annotations at FlyBase are captured; if two papers show the same GO data, both sets of GO data will be captured and displayed. Multiple lines of evidence for a single GO annotation are also captured, since multiple annotations to the same GO term add to the confidence of the GO annotation. In addition, if two or more papers show conflicting GO data, all sets of GO data are recorded. If in subsequent references a conclusion is reached, then GO terms which are no longer correct will be removed or replaced. FlyBase curates GO data from mainly from published primary papers. Additional sources include UniProtKB records and FlyBase sequence analysis. We no longer curate GO data from conference abstracts or GenBank records, and we only rarely curate new GO data from reviews and personal communications from FlyBase users. This change in policy is partly due to our participation in the GO Reference Genome Annotation Project which is committed to capturing data from the primary source and avoiding assigning terms based on author statements (evidence codes TAS and NAS). 4.1.1 PUBLISHED LITERATURE Literature curation at FlyBase is primarily done by a paper-by-paper approach. GO curation is one part of this literature curation. FlyBase has a list of 20 journals from which every Drosophila paper is curated for GO data. These journals are: Cell Current Biology Development Developmental Biology Developmental Cell Developmental Neurobiology EMBO Journal Fly Genes and Development Genetics Journal of Biological Chemistry Journal of Cell Biology Journal of Cell Science Journal of Neuroscience Molecular Biology of the Cell Nature Nature Cell Biology Neuron PLoS Biology Science Other journals are initially 'skim' curated in order to link the paper to the appropriate gene in FlyBase and prioritise them for further curation. Literature curators assign GO terms to genes based on the results sections of primary papers. The GO curator also curates literature on a gene-by-gene basis. This is primarily carried out at part of the Reference Genome Annotation Project which is a collaboration of twelve different species databases that aims to provide high quality GO annotation based on experimental evidence (see http://www.geneontology.org/GO.refgenome.shtml). Each month we curate a set of target genes from a cross-species ortholog set and attempt to find all applicable GO terms based on the published literature. 4.1.2 CONFERENCE ABSTRACTS FlyBase no longer curates new GO data based on abstracts from the Annual Drosophila Research Conference. Existing GO annotation associated with abstracted will gradually be removed and replaced data from primary papers where available. 4.1.3 GENBANK RECORDS In the past, GO annotations are taken from GenBank records, where the record lists the function or location of a gene product. These GO annotations were supported by the NAS evidence code. No new annotations are being added using the NAS evidence code and the existing annotations from on GenBank records will gradually be removed as experimental evidence becomes available. 4.1.4 UNIPROT/SWISS-PROT RECORDS Swiss-Prot records created before early 2002 were curated for GO data based on information in the 'Comments' field, supported by the NAS evidence code. Recent UniProtKB records are not curated for GO data by FlyBase staff; instead manual GO annotations done by UniProtKB curators are incorporated into FlyBase, attributed to the published literature, with UniProKB entered in column 15 of the gene_association.fb file. 4.1.5 GENOMIC SEQUENCE DATA BLASTP searches are performed on known protein sequences and protein sequences predicted based on the genomic sequence of the Drosophila melanogaster genome. Genes are GO-annotated based on sequence similarity to proteins of known function in Drosophila and/or other organisms. The ID of the similar sequence is entered in the with column. This can be a GenBank accession, a UniProt ID or a gene identifier from a model organism database. 4.1.6 PERSONAL COMMUNICATIONS Personal communications to FlyBase from the database users are archived. GO data is attributed to these communications where applicable. 5. ELECTRONIC (IEA) GO ANNOTATION IN FLYBASE ================================================================================ IEA-supported GO annotation in FlyBase are now based on a single source: i. INTERPRO 2 GO MAPPINGS GO terms were assigned to FlyBase genes through InterPro protein domain assignments. InterPro protein domains are assigned to FlyBase genes as part of an ongoing collaboration between the UniProt database and FlyBase. The translation table of InterPro protein domains to GO terms (generated by Nicola Mulder at the EBI) was used to associate GO terms with FlyBase genes. Before addition to FlyBase, the InterPro-predicted GO annotations were filtered to prevent redundant annotations; 'molecular_function unknown ; GO:0005554' annotations were removed from the InterPro-predicted set of GO terms. InterPro-predicted GO terms that were identical to an existing non-IEA GO annotation for a FlyBase gene were not added in. InterPro-predicted GO terms that are a parent of (i.e. less specialized than) an existing non-IEA GO annotation for a gene were also not added in. All remaining InterPro GO predictions (including predicted GO terms that are identical to an existing IEA GO annotation for a given gene) were added into FlyBase, supported by the inferred from electronic (IEA) evidence code. For further information and corresponding annotations see FBrf0174215. InterPro now incorporates PANTHER domains so separate GO annotations are no longer added based on the PANTHER protein classification system. 6. USE OF THE ND EVIDENCE CODE IN FLYBASE ================================================================================ The Gene Ontology (GO) Consortium created the evidence code "ND" to indicate "no biological data available". This code is used for annotations to the three root terms `molecular_function ; GO:0003674', `biological_process ; GO:0008150' or `cellular_component ; GO:0008372'. In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that as of the date of the annotation to the unknown term, there is no information supporting an annotation to any GO term in that ontology. 7. CONTACT INFORMATION ================================================================================ Questions or comments about this file should be sent to: flybase-help@morgan.harvard.edu