the Gene Ontology

  • Open menus
  • Home
  • FAQ
  • Downloads
  • Ontologies
  • Annotations
  • Database
  • Mappings to GO
  • Teaching Resources
  • Other files
  • FTP and CVS downloads
  • Tools
  • Browsers
  • Microarray tools
  • Annotation tools
  • Other tools
  • Submit New Tools
  • Documentation
  • Introduction
  • Annotation Guide
  • Evidence Code Guide
  • Component Ontology
  • Function Ontology
  • Process Ontology
  • File Format Guide
  • GO Database Guide
  • GO Slim Guide
  • Meeting minutes
  • Editorial Style Guide
  • About GO
  • GO Consortium
  • Publications
  • Citation Policy
  • Mailing lists
  • Interest Groups
  • GO People
  • Funding
  • Acknowledgements
  • Newsletter
  • Projects
  • Cardiovascular
  • Immunology
  • Reference Genomes
  • Contact GO
  • Site Map

GO Annotation File Format Guide

The Gene Ontology Consortium represents annotation data using a tab-delimited format, where each line represents a single link between a gene product (protein, gene, transcript, etc.) and a GO term with a certain evidence code and the reference to support the link. This page documents the format of these files. For more general information on annotation, please see the GO annotation guide.

  • Annotation File Fields
  • Definitions and requirements for field contents
  • Annotation File Format Quality Control Script
  • Errors Checked
  • Taxon IDs
  • Script command line options

Annotation File Fields

The flat file format comprises 15 tab-delimited fields; red text denotes required fields. See also the annotation fields page, with a table showing the columns in order and example annotations.

Fields in the annotation file
Column Content Required? Example
1 DB required SGD
2 DB_Object_ID required S000000296
3 DB_Object_Symbol required PHO3
4 Qualifier optional NOT
5 GO ID required GO:0003993
6 DB:Reference (|DB:Reference) required SGD_REF:S000047763|PMID:2676709
7 Evidence code required IMP
8 With (or) From optional GO:0000346
9 Aspect required F
10 DB_Object_Name optional acid phosphatase
11 DB_Object_Synonym (|Synonym) optional YBR092C
12 DB_Object_Type required gene
13 taxon(|taxon) required taxon:4932
14 Date required 20010118
15 Assigned_by required SGD

Definitions and requirements for field contents

DB
refers to the database from which the identifier in DB_Object_ID (column 2) is drawn. This is not necessarily the group submitting the file. If a UniProt ID is in column 2 then column 1 should be just UniProt.
must be one of the values from the set of GO database cross-references
this field is mandatory, cardinality 1
DB_Object_ID
a unique identifier in DB for the item being annotated
this field is mandatory, cardinality 1
The entry in the DB_Object_ID field of the association file is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to a protein object (protein ID in DB_Object_ID field).
DB_Object_Symbol
a (unique and valid) symbol to which DB_Object_ID is matched
can use ORF name for otherwise unnamed gene or protein
if gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol
this field is mandatory, cardinality 1
The DB_Object_Symbol field should be a symbol that means something to a biologist, wherever possible (a gene symbol, for example). It is not an ID or an accession number (the second column, DB_Object_ID, provides the unique identifier), although IDs can be used in DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).
Qualifier
flags that modify the interpretation of an annotation
one (or more) of NOT, contributes_to, colocalizes_with
this field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to)
See also the documentation on qualifiers in the GO annotation guide
GO ID
the GO identifier for the term attributed to the DB_Object_ID
this field is mandatory, cardinality 1
DB:Reference
one or more unique identifiers for a single source cited as an authority for the attribution of the GO ID to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number.
Note that only one reference can be cited on a single line in the gene association file. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database. Note that if the model organism database has an identifier for the reference, that identifier should always be included, even if a PubMed ID is also used.
this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD_REF:S000047763|PMID:2676709).
Evidence Code
see the GO evidence code guide for the list of valid evidence codes for GO annotations
this field is mandatory, cardinality 1
With (or) From
one of:
  • DB:gene_symbol
  • DB:gene_symbol[allele_symbol]
  • DB:gene_id
  • DB:protein_name
  • DB:sequence_id
  • GO:GO_id

this field is not mandatory overall, but is required for some evidence codes (see below and the evidence documentation for details); cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. CGSC:pabA|CGSC:pabB)

Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). For example, it can identify another gene product to which the annotated gene product is similar (ISS) or interacts with (IPI). More information on the meaning of with or from column entries is available in the evidence documentation entries for the relevant codes.

Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Cardinality = 0 is not allowed for ISS annotations made after October 1, 2006. Annotations where evidence is IGI, IPI, or ISS and with cardinality = 0 should link to an explanation of why there is no entry in with. Cardinality may be >1 for any of the evidence codes that use with; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).

Note that a gene ID may be used in the with column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct.

'GO:GO_id' is used only when the evidence code is IC, and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code IC.

The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the with column for ISS annotations.

The with column may not be used with the evidence codes IDA, TAS, NAS, or ND.

Aspect
one of P (biological process), F (molecular function) or C (cellular component)
this field is mandatory; cardinality 1
DB_Object_Name
name of gene or gene product
this field is not mandatory, cardinality 0, 1 [white space allowed]
Synonym
Gene_symbol [or other text]
Note that we strongly recommend that gene synonyms are included in the gene association file, as this aids the searching of GO.
this field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene)
DB_Object_Type
what kind of thing is being annotated
one of gene (SO:0000704), transcript (SO:0000673), protein (SO:0000358), protein_structure, complex
this field is mandatory, cardinality 1
The object type (gene, transcript, protein, protein_structure, or complex) listed in the DB_Object_Type field must match the database entry identified by DB_Object_ID. Note that DB_Object_Type refers to the database entry (i.e. does it represent a gene, protein, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based. For example, if your database entry represents a gene, then 'gene' goes in the DB_Object_Type column, even if the annotation is to a component term relevant to the localization of a protein product of the gene. The text entered in the DB_Object_Name and DB_Object_Symbol can refer to the same database entry (recommended), or to a "broader" entity. For example, several alternative transcripts from one gene may be annotated separately, each with a unique transcript DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column.
Taxon
taxonomic identifier(s)
For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, to be used only in conjunction with terms that have the term 'interaction between organisms' as an ancestor. The first taxon id should be that of the organism encoding the gene or gene product, and the taxon id after the pipe should be that of the other organism in the interaction.
this field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000)
Date
Date on which the annotation was made; format is YYYYMMDD
this field is mandatory, cardinality 1
Assigned_by
The database which made the annotation
one of the values from the set of GO database cross-references
Used for tracking the source of an individual annotation.
Default value is value entered in column 1 (DB).
Value will differ from column 1 for any that is made by one database and incorporated into another.
this field is mandatory, cardinality 1

Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GO ID (where dbname is always GO), DB:Reference, With, Taxon (where dbname is always taxon). For GO IDs, do not repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)

Back to top

Annotation File Format Quality Control Script

This script is provided as a quality control check in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.

This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. The checks provided define the minimum standard format for the repository. Suggestions are welcome for enhancements to this process. The Perl script is available from the GO CVS archive, and can be downloaded from the GO FTP site.

Submitted gene association files are committed to the GO CVS repository into the gene association file submissions directory. The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory and subsequently used to load the GO Database.

Errors Checked

The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.

The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.

These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.

Column 1, and all database abbreviations used within the gene association file is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.

The GO IDs mentioned in the file are checked, using the current gene_ontology.obo file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.

All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important.

Taxon IDs

A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. The GO consortium has defined a set of authoritative groups for the major model organisms in the table below.

Authoritative GO consortium groups for model organisms
Project nameSpecies
Candida Genome Database
  • Candida albicans, taxon:5476
dictyBase
  • Dictyostelium , taxon:5782
  • Dictyostelium discoideum, taxon:44689
  • Dictyostelium discoideum AX2, taxon:366501
  • Dictyostelium discoideum AX4, taxon:352472
FlyBase
  • Drosophila melanogaster, taxon:7227 (fruit fly)
Leishmania major GeneDB
  • Leishmania major, taxon:5664
Plasmodium falciparum GeneDB
  • Plasmodium falciparum, taxon:5833 (malaria parasite P. falciparum)
Schizosaccharomyces pombe GeneDB
  • Schizosaccharomyces pombe, taxon:4896 (fission yeast)
Trypanosoma brucei GeneDB
  • Trypanosoma brucei TREU927, taxon:185431
Glossina morsitans GeneDB
  • Glossina morsitans morsitans, taxon:37546
goa_chicken, GO Annotation at EBI
  • Gallus gallus, taxon:9031 (chicken)
  • Gallus gallus bankiva, taxon:208525
  • Gallus gallus gallus, taxon:208526
  • Gallus gallus murghi, taxon:400035
  • Gallus gallus spadiceus, taxon:208524
goa_cow, GO Annotation at EBI
  • Bos taurus, taxon:9913 (cattle)
  • Bos taurus X Bison bison, taxon:297284 (beefalo)
  • Bos taurus x Bos indicus, taxon:30523
goa_human, GO Annotation at EBI
  • Homo sapiens, taxon:9606 (human)
gramene_oryza, Gramene
  • Oryza alta, taxon:52545
  • Oryza australiensis, taxon:4532
  • Oryza barthii, taxon:65489
  • Oryza brachyantha, taxon:4533
  • Oryza coarctata, taxon:77588
  • Oryza eichingeri, taxon:29689
  • Oryza glaberrima, taxon:4538 (African rice)
  • Oryza glumipatula, taxon:40148
  • Oryza grandiglumis, taxon:29690
  • Oryza granulata, taxon:110450
  • Oryza latifolia, taxon:4534
  • Oryza longiglumis, taxon:83309
  • Oryza longistaminata, taxon:4528
  • Oryza malampuzhaensis, taxon:127571
  • Oryza meridionalis, taxon:40149
  • Oryza meyeriana, taxon:83307
  • Oryza minuta, taxon:63629
  • Oryza nivara, taxon:4536
  • Oryza officinalis, taxon:4535
  • Oryza punctata, taxon:4537
  • Oryza rhizomatis, taxon:65491
  • Oryza ridleyi, taxon:83308
  • Oryza rufipogon, taxon:4529
  • Oryza sativa, taxon:4530 (rice)
  • Oryza sativa Indica Group, taxon:39946
  • Oryza sativa Japonica Group, taxon:39947
  • Oryza schlechteri, taxon:110451
  • Oryza sp. IRGC 105360, taxon:364100
  • Oryza sp. IRGC 81916, taxon:364099
  • Panicum , taxon:4539
Mouse Genome Informatics
  • Mus musculus, taxon:10090 (house mouse)
PAMGO_Atumefaciens
  • Agrobacterium tumefaciens str. C58, taxon:176299
Rat Genome Database
  • Rattus norvegicus, taxon:10116 (Norway rat)
Saccharomyces Genome Database
  • Saccharomyces cerevisiae, taxon:4932 (baker's yeast)
  • Saccharomyces cerevisiae RM11-1a, taxon:285006
  • Saccharomyces cerevisiae YJM789, taxon:307796
  • Saccharomyces cerevisiae var. diastaticus, taxon:41870
The Arabidopsis Information Resource
  • Arabidopsis thaliana, taxon:3702 (thale cress)
tigr_Aphagocytophilum
  • Anaplasma phagocytophilum HZ, taxon:212042
tigr_Banthracis
  • Bacillus anthracis str. Ames, taxon:198094
tigr_Cburnetii
  • Coxiella burnetii RSA 493, taxon:227377
tigr_Chydrogenoformans
  • Carboxydothermus hydrogenoformans Z-2901, taxon:246194
tigr_Cjejuni
  • Campylobacter jejuni RM1221, taxon:195099
tigr_Cperfringens
  • Clostridium perfringens ATCC 13124, taxon:195103
tigr_Cpsychrerythraea
  • Colwellia psychrerythraea 34H, taxon:167879
tigr_Dethenogenes
  • Dehalococcoides ethenogenes 195, taxon:243164
tigr_Echaffeensis
  • Ehrlichia chaffeensis str. Arkansas, taxon:205920
tigr_Gsulfurreducens
  • Geobacter sulfurreducens PCA, taxon:243231
tigr_Hneptunium
  • Hyphomonas neptunium ATCC 15444, taxon:228405
tigr_Lmonocytogenes
  • Listeria monocytogenes str. 4b F2365, taxon:265669
tigr_Mcapsulatus
  • Methylococcus capsulatus str. Bath, taxon:243233
tigr_Nsennetsu
  • Neorickettsia sennetsu str. Miyayama, taxon:222891
tigr_Pfluorescens
  • Pseudomonas fluorescens Pf-5, taxon:220664
tigr_Psyringae
  • Pseudomonas syringae pv. tomato str. DC3000, taxon:223283
tigr_Psyringae_phaseolicola
  • Pseudomonas syringae pv. phaseolicola 1448A, taxon:264730
tigr_Soneidensis
  • Shewanella oneidensis MR-1, taxon:211586
tigr_Spomeroyi
  • Silicibacter pomeroyi DSS-3, taxon:246200
tigr_Tbrucei_chr2
  • Trypanosoma brucei, taxon:5691
tigr_Vcholerae
  • Vibrio cholerae O1 biovar El tor, taxon:686
WormBase database of nematode biology
  • Caenorhabditis elegans, taxon:6239
Zebrafish Information Network
  • Danio rerio, taxon:7955 (zebrafish)

Script command line options

Usage help for the script is available with the -h option. The script is designed to be run from the go/gene-associations/submission directory within a GO CVS sandbox. By default the script needs the go/doc/GO.xrf_abbs and go/ontology/gene_ontology_edit.obo files. The input gene association file is read from STDIN by default, or from the specified file defined with the -i option.

Usage

A. check a file for any errors, obsolete GO IDs or old IEA annotations

filter-gene-association.pl -i gene_association.sgd.gz

B. filter any problems and output the validated lines, including headers

filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output

C. check file without the taxid checking on, and write the bad lines to STDOUT

filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines

System requirements

The script is written using basic Perl and should be portable to most systems. It has been tested on MacOSX with Perl 5.8.1 and Solaris with Perl 5.6.1 and greater.

Submitted by Mike Cherry, 2005-10-19

Back to top


Open Biomedical Ontologies logo

Last modified Tuesday, 18-Nov-2008 17:40:14 PST
Cite GO • Terms of use • GO helpdesk
Copyright © 1999-Tuesday, 06-Jan-2009 15:02:39 PST the Gene Ontology