The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products across databases. Founded in 1998, the project began as a collaboration between three model organism databases, FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD). The GO Consortium (GOC) has since grown to incorporate many databases, including several of the world's major repositories for plant, animal, and microbial genomes. The GO Contributors page lists all member organizations.
The GO project has developed three structured ontologies that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, the development and maintenance of the ontologies themselves; second, the annotation of gene products, which entails making associations between the ontologies and the genes and gene products in the collaborating databases; and third, the development of tools that facilitate the creation, maintenance and use of ontologies.
The use of GO terms by collaborating databases facilitates uniform queries across all of them. Controlled vocabularies are structured so they can be queried at different levels; for example, users may query GO to find all gene products in the mouse genome that are involved in signal transduction, or zoom in on all receptor tyrosine kinases that have been annotated. This structure also allows annotators to assign properties to genes or gene products at different levels, depending on the depth of knowledge about that entity.
Shared vocabularies are an important step towards unifying biological databases, but additional work is still necessary as knowledge changes, updates lag behind, and individual curators evaluate data differently. The GO aims to serve as a platform where curators can agree on stating how and why a specific term is used, and how to consistently apply it, for example, to establish relationships between gene products.
The following areas are outside the scope of GO, and terms in these domains will not appear in the ontologies:
Further information is available on the Frequently Asked Questions (FAQ) page.
The Gene Ontology project provides controlled vocabularies of defined terms representing gene product properties. These cover three domains: Cellular Component, the parts of a cell or its extracellular environment; Molecular Function, the elemental activities of a gene product at the molecular level, such as binding or catalysis; and Biological Process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.
The GO ontology is structured as a directed acyclic graph where each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-agnostic, and includes terms applicable to prokaryotes and eukaryotes, and single and multicellular organisms.
In an example of GO annotation, the gene product "cytochrome c" can be described by the Molecular Function term "oxidoreductase activity", the Biological Process terms "oxidative phosphorylation" and "induction of cell death", and the Cellular Component terms "mitochondrial matrix" and "mitochondrial inner membrane".
These terms describe a component of a cell that is part of a larger object, such as an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).
A biological process term describes a series of events accomplished by one or more organized assemblies of molecular functions. Examples of broad biological process terms are "cellular physiological process" or "signal transduction". Examples of more specific terms are "pyrimidine metabolic process" or "alpha-glucoside transport". The general rule to assist in distinguishing between a biological process and a molecular function is that a process must have more than one distinct steps.
A biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.
Molecular function terms describes activities that occur at the molecular level, such as "catalytic activity" or "binding activity". GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are "catalytic activity" and "transporter activity"; examples of narrower functional terms are "adenylate cyclase activity" or "Toll receptor binding".
It is easy to confuse a gene product name with its molecular function; for that reason GO molecular functions are often appended with the word "activity".
Annotation is the process of assigning GO terms to gene products. The annotation data in the GO database is contributed by members of the GO Consortium, and the Consortium is continuously encouraging new groups to start contributing their annotations. The list of links below offer details on the GO annotation policies and the annotation process, as well as direct users to other pages of interest on GO annotation conventions, the standard operating procedures used by some consortium members, and the GO annotation file format guide.
Each annotation in the Gene Ontology (GO) pairs a single gene product identifier to a single term from the ontology. This very powerful format can also restrict the descriptiveness of a specific instance of a function or a sub-cellular location, as there must be a pre-existing (pre-composed) term in the ontology that provides full details of the specific aspects of the function.
It is not always possible to create individual terms that precisely describe the context of each activity, for example, the cellular or anatomical location, the dependency on other processes, or specific protein targets. Annotation Extensions allow the curator a less restrictive environment to combine additional terms in a single annotation and provide a more detailed functional description for an individual gene product. Extensions in GO annotations allows GO terms to be further specified, using gene product and chemical identifiers, or terms from GO and external OBO ontologies.
When curators choose to use an Annotation Extension, they are effectively creating on-the-fly cross-product terms (post-composition). The combinatorial term created by Annotation Extension is not added to the ontology, but Ontology Editors it may choose to create an appropriate GO term describing the information on the Extension at a later stage.
Examples about capturing information in annotation extensions such as specific substrates, products or targets, contextual information (e.g, spatial and temporal information), and details on data that cannot be captured by the current annotation format are available on the Annotations Extension Examples section in the GO Wiki.
Additional general information about annotation can be found in the GO Annotation Policies guide.
The GO database is a relational database comprised of the GO ontologies as well as the annotations of genes and gene products to terms in the those ontologies. Housing both the ontologies and the annotations in a single database allows powerful queries of the annotations using the ontology. The GO database is the source of all data available through the legacy AmiGO 1.8 browser and search engine.
The GO File Format Guide documents the structure and syntax of the files available on the GO website, to assist users who need to read, write parsers for, or create these files. The following file formats are documented separately:
OBO is fully supported via the OWL-API.
The GO Consortium uses the OBO flat file format to store the ontology data. The current version is OBO 1.4, although the ontology data is also available in the previous version, OBO 1.2. The GO Consortium no longer uses or supports files in the legacy GO format. Should you require a file in this format, the command-line script obo2flat can be used to interconvert between OBO format and the legacy GO format. obo2flat is a Java script and comes as part of the OBO-Edit package; instructions on usage are provided in the OBO-Edit User Guide.
OBO-XML is a direct XML serialization of the OBO 1.2 format specification. The schema is specified using RELAX-NG compact syntax: obo-xml.rnc. Currently, only the ontology is available as OBO-XML.
OWL is a standard for ontology languages, produced by the W3C. Details of the translation used for GO is available on the official OboInOwl page.
Sequence data for gene products in the GO database is available in standard FASTA format from the GO database archives.
Annotation is the practice of capturing the activities and localization of a gene product with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information on how this is done can be found in the Guide to GO Annotation Policies. Members of the GO Consortium make their annotation data freely available to the public as part of the data accessed by AmiGO 2, the GO browser and search engine. Annotation data sets from individual databases can found on the GO annotations page.
In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the ontologies that allow you to annotate genomes or sets of gene products to gain a high-level view of gene functions. Using GO slims you can, for example, work out what proportion of a genome is involved in signal transduction, biosynthesis or reproduction. See the GO Slim Guide for more information.
All data from the GO project is freely available. Visit the 'Downloads' page to obtain the ontology data in a number of different formats, including XML and mySQL. The GO file format guide has more information on these formats.
If you need lists of the genes or gene products that have been associated with a particular GO term, the current Annotations table tracks the number of annotations and provides links to the gene association files for each of the collaborating databases is available.
The GO project is constantly evolving, and we welcome feedback from all users. Learn more about how you can contribute to the GO by visiting our instructions page.
GO allows us to annotate genes and their products with a limited set of attributes. For example, the GO does not allow for the description of genes in terms of which cells or tissues they're expressed in, which developmental stages they're expressed at, or their involvement in disease. It is not necessary for the GO to do these things because other ontologies are being developed for these purposes. The GO Consortium supports the development of other ontologies, and all the tools for editing and curating ontologies are freely available to the public. A list of freely available ontologies that are relevant to genomics and proteomics and are structured similarly to GO can be found at the Open Biomedical Ontologies website. A larger list, which includes the ontologies listed at OBO and also other controlled vocabularies that do not fulfill the OBO criteria is available at the Ontology Working Group section of the Microarray Gene Expression Data (MGED) Network site.
The existence of several ontologies will also allow us to create 'cross-products' that maximize the utility of each ontology while avoiding redundancy. For example, by combining the developmental terms in the GO process ontology with a second ontology that describes Drosophila anatomical structures, we could create an ontology of fly development. We could repeat this process for other organisms without having to clutter up GO with large numbers of species-specific terms. Similarly, we could create an ontology of biosynthetic pathways by combining the biosynthesis terms in the GO process ontology with a chemical ontology.
Mappings to other classification systems:
GO is not the only attempt to build structured controlled vocabularies for genome annotation, nor is it the only such series of catalogs in current use. The GO project provides mappings between GO and these other systems, although we caution that these mappings are neither complete nor exact and should only to be used as a guide.
Research groups may contribute to the Gene Ontology Consortium (GOC) by providing suggestions for updating the ontology (e.g. requests for new terms) or by providing annotations, that is, associations between genes or gene products and ontology terms. Suggested edits are reviewed by the ontology editors and implemented where appropriate.
The following pages explain how you can contribute to the project. Please begin by choosing whether you wish to contribute annotations or terms to the Gene Ontology.
This page documents the steps required to take when supplying Gene Ontology annotations to the GO Consortium (GOC). For general information on how to conduct GO annotations, please see the GO Annotation Policies Guide.
Please contact the GOC before carrying out the annotation work; this will ensure that GOC mentors and trainers can be of assistance in producing data sets in agreement with the GOC annotation policies and format requirements.
Research groups looking to supply Gene Ontology annotations to the Consortium must submit an appropriately formatted annotation file that conforms to syntactic and semantic requirements of the Consortium. The primary GO annotation format is the Gene Association Format (GAF) 2.0, or GAF2.0. This page contains details on how to build and populate the GAF2.0 File.
Please ensure that:
!at the start of the line
Each research group must provide a database name, which will be used to acknowledge the annotation set and to appropriately credit your work. This name would be visible in the 'assigned_by' field (Column 15) of all the annotation lines that the group is contributing. This name will also be added to the list of annotation providers.
Each annotation line must include the citation of a bibliographic reference, which details the methods and results from which the annotation was made. The reference should be either a PubMed identifier or an abstract (GO_REF) describing how the annotation was made. Please see the Gene Ontology Reference Collection for a list of all current GO references.
For research groups conducting curation, it is not always necessary to commit to supplying regular updates for their annotations. When the research team chooses to enter 'Longer-term Annotation Contribution / Collaboration' as the submitting group, a primary point of contact must also be identified so that requests may be redirected and proper action on such requests may be taken in a timely manner. The GOC will take responsibility for corrections and updates to datasets included in non-recurring submissions or those from annotation groups that become 'inactive' annotation providers.
Providing complete identifier mapping files is necessary for:
Please be aware: when identifier mapping is carried out, due to different database release cycles, sequence identifiers that should correspond with each other may not always display the same data.