GO Annotation File Formats

This page documents the file formats used to store gene associations (annotations), data capturing the attributes of gene products using terms from the Gene Ontology. For more general information on annotation, please see the GO annotation guide.

Annotation File Format Guide

The Gene Ontology Consortium stores annotation data, the representation of gene product attributes using GO terms, in tab-delimited plain text files. Each line in the file represents a single association between a gene product and a GO term with a certain evidence code and the reference to support the link.

There are two annotation file formats:

  • Gene Association File (GAF)
    • GAF 2.1 is the latest version of the GAF format. Data has been released in this format since the summer of 2015.

      GAF 2.1 allows the use of pipes and comma in column 8 (with/from column) compared to GAF 2.0 which allows the use of pipes only. Pipe will indicate 'OR' and Comma will indicate 'AND'.

    • GAF 2.0 is the primary format currently used by the GO Consortium.
    • GAF 1.0 is a deprecated format (as of June 2010), which captures slightly less information. The GO Consortium continues to provide files in this format for users who have not yet switched to GAF 2.0.
  • Gene Product Association Data (GPAD): The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information (GPI).
    • GPAD 1.1 contains the annotation data
    • GPI 1.2 contains the data about the gene products
    Tags: 
    User story: 

    GO Annotation File (GAF) Format 1.0

    Annotation data is submitted to the GO Consortium in the form of gene association files, or GAFs. The following document lays out the format specification for GAF 1.0; for the newer GAF 2.0 file syntax, please see the GAF 2.0 file format guide.

    More general information on annotation can be found in the GO annotation guide.

    File header

    All annotation files must start with a single line denoting the file format. For GAF1.0 it is as follows:

    !gaf-version: 1.0

    Other information, such as contact details for the submitter or database group, useful link, etc., can be included in an association file by prefixing the line with an exclamation mark (!); such lines will be ignored by parsers.

    Annotation File Fields

    The flat file format comprises 15 tab-delimited fields; red text denotes required fields.
    Fields in the annotation file
    Column Content Required? Cardinality Example
    1 DB required 1 UniProtKB
    2 DB Object ID required 1 P12345
    3 DB Object Symbol required 1 PHO3
    4 Qualifier optional 0 or greater NOT
    5 GO ID required 1 GO:0003993
    6 DB:Reference (|DB:Reference) required 1 or greater SGD_REF:S000047763|PMID:2676709
    7 Evidence Code required 1 IMP
    8 With (or) From optional 0 or greater GO:0000346
    9 Aspect required 1 F
    10 DB Object Name optional 0 or 1 acid phosphatase
    11 DB Object Synonym (|Synonym) optional 0 or greater YBR092C
    12 DB Object Type required 1 gene
    13 Taxon(|taxon) required 1 or 2 taxon:4932
    14 Date required 1 20010118
    15 Assigned By required 1 SGD

    Definitions and requirements for field contents

    DB (column 1)

    Refers to the database from which the identifier in DB object ID (column 2) is drawn. This is not necessarily the group submitting the file. If a UniProtKB accession is the DB object ID (column 2), DB (column 1) should be UniProtKB.

    Must be one of the values from the set of GO database cross-references.

    This field is mandatory, cardinality 1.

    DB Object ID (column 2)

    A unique identifier from the database in DB (column 1) for the item being annotated.

    This field is mandatory, cardinality 1.

    The DB object ID (column 2) is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).

    DB Object Symbol (column 3)

    A (unique and valid) symbol to which DB object ID is matched.

    Can use ORF name for otherwise unnamed gene or protein.

    If gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol.

    This field is mandatory, cardinality 1.

    The DB Object Symbol field should be a symbol that means something to a biologist wherever possible (a gene symbol, for example). It is not an ID or an accession number (the second column, DB object ID, provides the unique identifier), although IDs can be used as a DB object symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).

    Qualifier (column 4)

    Flags that modify the interpretation of an annotation.

    One (or more) of NOT, contributes_to, colocalizes_with.

    This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to).

    See also the documentation on qualifiers in the GO annotation conventions guide.

    GO ID (column 5)

    The GO identifier for the term attributed to the DB object ID.

    This field is mandatory, cardinality 1.

    DB:Reference (column 6)

    One or more unique identifiers for a single source cited as an authority for the attribution of the GO ID to the DB object ID. This may be a literature reference or a database record. The syntax is DB:accession_number.

    Note that only one reference can be cited on a single line in the gene association file. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database. Note that if the model organism database has an identifier for the reference, that identifier should always be included, even if a PubMed ID is also used.

    This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD_REF:S000047763|PMID:2676709).

    Evidence Code (column 7)

    See the GO evidence code guide for the list of valid evidence codes for GO annotations.

    This field is mandatory, cardinality 1.

    With [or] From (column 8)

    Also referred to as with, from or the with/from column.

    one of:
    • DB:gene_symbol
    • DB:gene_symbol[allele_symbol]
    • DB:gene_id
    • DB:protein_name
    • DB:sequence_id
    • GO:GO_id

    This field is not mandatory overall, but is required for some evidence codes (see below and the GO evidence code guide for details); cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. CGSC:pabA|CGSC:pabB).

    Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). For example, it can identify another gene product to which the annotated gene product is similar (ISS) or interacts with (IPI). More information on the meaning of with or from column entries is available in the evidence code documentation entries for the relevant codes.

    Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Cardinality = 0 is not allowed for ISS annotations made after October 1, 2006. Annotations where evidence is IGI, IPI, or ISS and with cardinality = 0 should link to an explanation of why there is no entry in with. Cardinality may be >1 for any of the evidence codes that use with; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).

    Note that a gene ID may be used in the with column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct.

    'GO:GO_id' is used only when the evidence code is IC, and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code IC.

    The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the with column for ISS annotations.

    The with column may not be used with the evidence codes IDA, TAS, NAS, or ND.

    Aspect (column 9)

    Refers to the namespace or ontology to which the GO ID (column 5) belongs; one of P (biological process), F (molecular function) or C (cellular component).

    This field is mandatory; cardinality 1.

    DB Object Name (column 10)

    Name of gene or gene product.

    This field is not mandatory, cardinality 0, 1 [white space allowed]

    DB Object Synonym (column 11)

    Gene symbol [or other text].

    Note that we strongly recommend that gene synonyms are included in the gene association file, as this aids the searching of GO.

    This field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene).

    DB Object Type (column 12)

    The entity being annotated.

    One of gene (SO:0000704), transcript (SO:0000673), protein (SO:0000358), protein_structure, complex.

    This field is mandatory, cardinality 1.

    The object type (gene, transcript, protein, protein_structure, etc.) listed in the DB object type field must match the database entry identified by DB object ID. Note that DB object type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based. For example, if your database entry represents a gene, then gene goes in the DB object type column, even if the annotation is to a component term relevant to the localization of a protein product of the gene. The text entered in the DB object name and DB object symbol can refer to the same database entry (recommended), or to a "broader" entity. For example, several alternative transcripts from one gene may be annotated separately, each with a unique transcript DB object ID, but list the same gene symbol in the DB object symbol column.

    Taxon (column 13)

    Taxonomic identifier(s).

    For cardinality 1, the ID of the species encoding the gene product.

    For cardinality 2, to be used only in conjunction with terms that have the biological process term multi-organism process or the cellular component term host cell as an ancestor. The first taxon ID should be that of the organism encoding the gene or gene product, and the taxon ID after the pipe should be that of the other organism in the interaction.

    This field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000).

    See the GO annotation conventions guide for more information on multi-organism terms.

    Date (column 14)

    Date on which the annotation was made; format is YYYYMMDD.

    This field is mandatory, cardinality 1.

    Assigned By (column 15)

    The database that made the annotation.

    One of the values from the set of GO database cross-references.

    Used for tracking the source of an individual annotation.

    Default value is value entered as the DB (column 1).

    Value will differ from column 1 for any annotation that is made by one database and incorporated into another.

    This field is mandatory, cardinality 1.

    Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GO ID [column 5], where dbname is always GO; DB:Reference (column 6); With or From (column 8); and Taxon (column 13), where dbname is always taxon. For GO IDs, do not repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)

    GO Annotation File Format 2.0

    Annotation data is submitted to the GO Consortium in the form of Gene Association Format, or GAFs. This guide lays out the format specifications for GAF 2.0; for the older GAF 1.0 file syntax, please see the GAF 1.0 file format guide.

    Please see the information on the changes in GAF 2.0.

    General information about annotation can be found in the GO annotation guide.

    Changes in GAF 2.0

    GAF 2.0 has two additional columns compared to GAF 1.0: annotation extension (column 16) and gene product form ID (column 17).

    The addition of gene product form ID (column 17) means that the usage of the DB object ID (column 2) and DB object type (column 12) fields differs from that in GAF 1.0. Please see the descriptions below for full details.

    File Header

    All gene association files must start with a single line denoting the file format, as follows:

    !gaf-version: 2.0

    Other information, such as contact details for the submitter or database group, useful link, etc., can be included in an association file by prefixing the line with an exclamation mark (!); such lines will be ignored by parsers.

    Annotation File Fields

    The annotation flat file format is comprised of 17 tab-delimited fields.

    Column Content Required? Cardinality Example
    1 DB required 1 UniProtKB
    2 DB Object ID required 1 P12345
    3 DB Object Symbol required 1 PHO3
    4 Qualifier optional 0 or greater NOT
    5 GO ID required 1 GO:0003993
    6 DB:Reference (|DB:Reference) required 1 or greater SGD_REF:S000047763|PMID:2676709
    7 Evidence Code required 1 IMP
    8 With (or) From optional 0 or greater GO:0000346
    9 Aspect required 1 F
    10 DB Object Name optional 0 or 1 Toll-like receptor 4
    11 DB Object Synonym (|Synonym) optional 0 or greater hToll|Tollbooth
    12 DB Object Type required 1 protein
    13 Taxon(|taxon) required 1 or 2 taxon:9606
    14 Date required 1 20090118
    15 Assigned By required 1 SGD
    16 Annotation Extension optional 0 or greater part_of(CL:0000576)
    17 Gene Product Form ID optional 0 or 1 UniProtKB:P12345-2

     

    Definitions and requirements for field contents

    DB (column 1)
    refers to the database from which the identifier in DB object ID (column 2) is drawn. This is not necessarily the group submitting the file. If a UniProtKB ID is the DB object ID (column 2), DB (column 1) should be UniProtKB.
    must be one of the values from the set of GO database cross-references
    this field is mandatory, cardinality 1

     

    DB Object ID (column 2)
    a unique identifier from the database in DB (column 1) for the item being annotated
    this field is mandatory, cardinality 1
    In GAF 2.0 format, the identifier must reference a top-level primary gene or gene product identifier: either a gene, or a protein that has a 1:1 correspondence to a gene. Identifiers referring to particular protein isoforms or post-translationally cleaved or modified proteins are not legal values in this field.
    The DB object ID (column 2) is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).

     

    DB Object Symbol (column 3)
    a (unique and valid) symbol to which DB object ID is matched
    can use ORF name for otherwise unnamed gene or protein
    if gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol this field is mandatory, cardinality 1
    The DB Object Symbol field should be a symbol that means something to a biologist wherever possible (a gene symbol, for example). It is not an ID or an accession number (DB object ID [column 2] provides the unique identifier), although IDs can be used as a DB object symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).

     

    Qualifier (column 4)
    flags that modify the interpretation of an annotation
    one (or more) of NOT, contributes_to, colocalizes_with
    this field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to)
    See also the documentation on qualifiers in the GO annotation guide

     

    GO ID (column 5)
    the GO identifier for the term attributed to the DB object ID
    this field is mandatory, cardinality 1

     

    DB:Reference (column 6)
    one or more unique identifiers for a single source cited as an authority for the attribution of the GO ID to the DB object ID. This may be a literature reference or a database record. The syntax is DB:accession_number.
    Note that only one reference can be cited on a single line in the gene association file. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database. Note that if the model organism database has an identifier for the reference, that identifier should always be included, even if a PubMed ID is also used.
    This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD_REF:S000047763|PMID:2676709).

     

    Evidence Code (column 7)
    see the GO evidence code guide for the list of valid evidence codes for GO annotations
    this field is mandatory, cardinality 1

     

    With [or] From (column 8)
    Also referred to as with, from or the with/from column
    one of:
    • DB:gene_symbol
    • DB:gene_symbol[allele_symbol]
    • DB:gene_id
    • DB:protein_name
    • DB:sequence_id
    • GO:GO_id
    • CHEBI:CHEBI_id
    this field is not mandatory overall, but is required for some evidence codes (see below and the evidence code documentation for details); cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. CGSC:pabA|CGSC:pabB)
    Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). For example, it can identify another gene product to which the annotated gene product is similar (ISS) or interacts with (IPI). More information on the meaning of with or from column entries is available in the evidence code documentation entries for the relevant codes.
    Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Cardinality = 0 is not allowed for ISS annotations made after October 1, 2006. Annotations where evidence is IGI, IPI, or ISS and with cardinality = 0 should link to an explanation of why there is no entry in with. Cardinality may be >1 for any of the evidence codes that use with; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).
    Note that a gene ID may be used in the with column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct.
    'GO:GO_id' is used only when the evidence code is IC, and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code IC.
    The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the with column for ISS annotations.
    The with column may not be used with the evidence codes IDA, TAS, NAS, or ND.

     

    Aspect (column 9)
    refers to the namespace or ontology to which the GO ID (column 5) belongs; one of P (biological process), F (molecular function) or C (cellular component)
    this field is mandatory; cardinality 1

     

    DB Object Name (column 10)
    name of gene or gene product
    this field is not mandatory, cardinality 0, 1 [white space allowed]

     

    DB Object Synonym (column 11)
    Gene symbol [or other text] Note that we strongly recommend that gene synonyms are included in the gene association file, as this aids the searching of GO.
    this field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene)

     

    DB Object Type (column 12)
    A description of the type of gene product being annotated. If a gene product form ID (column 17) is supplied, the DB object type will refer to that entity; if no gene product form ID is present, it will refer to the entity that the DB object symbol (column 2) is believed to produce and which actively carries out the function or localization described. one of the following: protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the Sequence Ontology. If the precise product type is unknown, gene_product should be used. this field is mandatory, cardinality 1
    The object type (gene_product, transcript, protein, protein_complex, etc.) listed in the DB object type field must match the database entry identified by the gene product form ID, or, if this is absent, the expected product of the DB object ID. Note that DB object type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based. For example, if your database entry represents a protein-encoding gene, then protein goes in the DB object type column. The text entered in the DB object name and DB object symbol should refer to the entity in DB object ID. For example, several alternative transcripts from one gene may be annotated separately, each with the same gene ID in DB object ID, and specific gene product identifiers in gene product form ID, but list the same gene symbol in the DB object symbol column.

     

    Taxon (column 13)
    taxonomic identifier(s) For cardinality 1, the ID of the species encoding the gene product. For cardinality 2, to be used only in conjunction with terms that have the biological process term multi-organism process or the cellular component term host cell as an ancestor. The first taxon ID should be that of the organism encoding the gene or gene product, and the taxon ID after the pipe should be that of the other organism in the interaction. this field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000) See the GO annotation conventions for more information on multi-organism terms.

     

    Date (column 14)
    Date on which the annotation was made; format is YYYYMMDD
    this field is mandatory, cardinality 1

     

    Assigned By (column 15)
    The database which made the annotation
    one of the values from the set of GO database cross-references
    Used for tracking the source of an individual annotation. Default value is value entered as the DB (column 1).
    Value will differ from column 1 for any annotation that is made by one database and incorporated into another.
    this field is mandatory, cardinality 1

     

    Annotation Extension (column 16)
    one of:
    • DB:gene_id
    • DB:sequence_id
    • CHEBI:CHEBI_id
    • Cell Type Ontology:CL_id
    • GO:GO_id
    Contains cross references to other ontologies that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate GO relationship; references to multiple ontologies can be entered. For example, if a gene product is localized to the mitochondria of lymphocytes, the GO ID (column 5) would be mitochondrion ; GO:0005439, and the annotation extension column would contain a cross-reference to the term lymphocyte from the Cell Type Ontology.
    Targets of certain processes or functions can also be included in this field to indicate the gene, gene product, or chemical involved; for example, if a gene product is annotated to protein kinase activity, the annotation extension column would contain the UniProtKB protein ID for the protein phosphorylated in the reaction.
    See the documentation on using the annotation extension column for details of practical usage; a wider discussion of the annotation extension column can be found on the GO wiki.
    this field is optional, cardinality 0 or greater

     

    Gene Product Form ID (column 17)
    As the DB Object ID (column 2) entry must be a canonical entity—a gene OR an abstract protein that has a 1:1 correspondence to a gene—this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.
    The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2
    • When the gene product form ID (column 17) is filled with a protein identifier, the value in DB object type (column 12) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers.
    • When the gene product form ID (column 17) is filled with a functional RNA identifier, the DB object type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
    This column may be left blank; if so, the value in DB object type (column 12) will provide a description of the expected gene product.
    More information and examples are available from the GO wiki page on column 17.

    Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GO ID [column 5], where dbname is always GO; DB:Reference (column 6); With or From (column 8); and Taxon (column 13), where dbname is always taxon. For GO IDs, do not repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)

    Gene Product Association Data (GPAD) format

    The GPAD file is an alternative means of exchanging annotations from the Gene Association File (GAF). The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.

    All annotation files must start with a single line denoting the file format. For GPAD it is as follows:

    !gpa-version: 1.1

    Other information, such as contact details for the submitter or database group, useful link, etc., can be included in an association file by prefixing the line with an exclamation mark (!); such lines will be ignored by parsers.

    Annotation File Fields

    The file format comprises 12 tab-delimited fields, fields with multiple values (for example, gene product synonyms) should have these values separated by pipes.

    --
    Column Content Required? Cardinality Example
    1 DB required 1 SGD
    2 DB Object ID required 1 P12345
    3 Qualifier required 1 or greater enables
    4 GO ID required 1 GO:0019104
    5 DB:Reference(s) required 1 or greater PMID:20727966
    6 Evidence code required 1 ECO:0000021
    7 With (or) From optional 0 or greater Ensembl:ENSRNOP00000010579
    8 Interacting taxon ID optional 0 or 1 4896
    9 Date required 1 20130529
    10 Assigned by required 1 PomBase
    11 Annotation Extension optional 0 or greater occurs_in(GO:0005739)
    12 Annotation Properties optional 0 or greater annotation_identifier = 2113431320

    Definitions and requirements for field contents

    DB
    refers to the database from which the identifier in DB object ID is drawn. This is not necessarily the group submitting the file. If a UniProtKB ID is the DB object ID, DB should be UniProtKB. must be one of the values from the set of GO database cross-references this field is mandatory, cardinality 1
    DB Object ID
    a unique identifier (from the database in DB) for the item being annotated this field is mandatory, cardinality 1
    In GPAD 1.0 format, the identifier may reference a top-level primary gene or gene product identifier, or an identified variant of a gene or gene product. Contents may include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.
    If the gene product is not a top-level gene or gene product identifier, the Gene Product Information (GPI) file should contain information about the canonical form of the gene or gene product.
    The DB object ID is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).
    Qualifier
    the relationship between the gene product in the DB:DB object ID and the GO ID composed of up to three parts: an operator (optional), a modifier (optional) and an atomic relation (required) this field is mandatory, cardinality 1 or greater than 1, entries pipe-separated
    The operator may be one of two values, not or always. Operators are optional. Valid qualifiers are contributes to and colocalizes with. In addition, annotations encompassing interactions with other organisms may use the qualifiers host, other organism or symbiont. Qualifiers are optional.

    The atomic relations depend upon the term namespace, and are as follows:

  • gene product enables molecular function
  • gene product involved in biological process
  • gene product part of cellular component
  • An atomic relation must be used.

    See also the documentation on qualifiers in the GO annotation guide
    GO ID
    the GO identifier for the term attributed to the DB object ID this field is mandatory, cardinality 1
    DB:Reference
    one or more unique identifiers for a single source cited as an authority for the attribution of the GO ID to the DB object ID. This may be a literature reference or a database record. The syntax is DB:accession. Note that only one reference can be cited on a single line in the gene association file. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, the PubMed ID must be included; if the model organism database has its own identifier for the reference, that can also be included. this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. PMID:2676709|SGD_REF:S000047763).
    Evidence Code
    one of the codes from the Evidence Code ontology, ECO this field is mandatory, cardinality 1
    With [or] From
    Also referred to as with, from or the with/from column
    this field is required for some evidence codes cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. UniProtKB:P10620|UniProtKB:P08011)

    Note: This field is used to hold an additional identifier for annotations using certain evidence codes: ECO:0000305 [IC]; ECO:0000203, 0256, and 0265 [all IEA]; ECO:00000316 [IGI]; ECO:0000021 [IPI]; ECO:0000031, 0250 and 0255 [all ISS].

    For example, it can identify another gene product to which the annotated gene product is similar (ECO:0000031, 0250 and 0255, ISS) or interacts with (ECO:0000021, IPI).

    More information on the meaning of with or from column entries is available in the evidence code documentation entries for the relevant codes.

    Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Cardinality = 0 is not allowed for ISS annotations (ECO:0000031, ECO:0000250 and ECO:0000255) made after October 1, 2006. Annotations where evidence is ECO:0000316 [IGI], ECO:0000021 [IPI], or ECO:0000031, ECO:0000250 or ECO:0000255 [all ISS] and with cardinality = 0 should link to an explanation of why there is no entry in with. Cardinality may be >1 for any of the evidence codes that use with; for ECO:0000021 [IPI] and ECO:00000316 [IGI], cardinality >1 has a special meaning (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).

    Note that a gene ID may be used in the with column for a ECO:0000021 [IPI] annotation, or for an ECO:0000031, ECO:0000250 or ECO:0000255 [all ISS] annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct.

    A GO:ID is used only when the evidence code is ECO:0000305 [IC], and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code ECO:0000305 [IC].

    The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the with column for ECO:0000031, ECO:0000250 or ECO:0000255 [ISS] annotations.

    The with column may not be used with the evidence codes ECO:0000314 [IDA], ECO:0000304 [TAS], ECO:0000303 [NAS], or ECO:0000307 [ND].

    Interacting taxon ID
    taxonomic identifier for interacting organism to be used only in conjunction with terms that have the biological process term 'multi-organism process' or the cellular component term 'host' as an ancestor. This field is mandatory for terms with parentage under 'multi-organism process' or 'host', cardinality 1; annotations to other terms should leave this column blank See the GO annotation conventions for more information on multi-organism terms.
    Date
    Date on which the annotation was made; format is YYYYMMDD this field is mandatory, cardinality 1
    Assigned By
    The database which made the annotation one of the values from the set of GO database cross-references Used for tracking the source of an individual annotation. Value will differ from the DB column for any annotation that is made by one database and incorporated into another. this field is mandatory, cardinality 1
    Annotation Extension
    Contains cross references to other ontologies that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate GO relationship; references to multiple ontologies can be entered. For example, if a gene product is localized to the mitochondria of lymphocytes, the GO ID (column 5) would be mitochondrion ; GO:0005439, and the annotation extension column would contain a cross-reference to the term lymphocyte from the Cell Type Ontology. Targets of certain processes or functions can also be included in this field to indicate the gene, gene product, or chemical involved; for example, if a gene product is annotated to protein kinase activity, the annotation extension column would contain the UniProtKB protein ID for the protein phosphorylated in the reaction. See the documentation on using the annotation extension column for details of practical usage. this field is optional, cardinality 0 or greater
    Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GO ID; Reference; With or From; and Annotation Extension.
    Annotation Properties
    The Annotation Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary. The initial supported properties would be curator_name and annotation_identifier*, but can be extended to include e.g. curator_ID, modification_date, creation_date, annotation_notes...etc.

    Gene Product Information (GPI) Format

    Gene Product Information (GPI) format is used to submit gene and gene product information to the GO Consortium. Please note that the GPI companion file for annotation information uses the GPAD file format.

    GPI format version

    All annotation files must start with a single line denoting the file format. For GPI it is as follows:

    !gpi-version: 1.2

    Other information, such as contact details for the submitter or database group, useful links, etc., can be included in an association file by prefixing the line with an exclamation mark (!); such lines will be ignored by parsers.

    Annotation File Fields

    The file format comprises 10 tab-delimited fields. Fields with multiple values (for example, gene product synonyms) should separate values by pipes.

    Content Required? Cardinality Example
    DB required 1 UniProtKB
    DB_Object_ID required 1 Q4VCS5-1
    DB_Object_Symbol required 1 AMOT
    DB_Object_Name optional 0 or greater Angiomotin
    DB_Object_Synonym(s) optional 0 or greater AMOT|KIAA1071
    DB_Object_Type required 1 protein
    Taxon required 1 taxon:9606
    Parent_Object_ID optional 0 or 1 UniProtKB:Q4VCS5
    DB_Xref(s) optional 0 or greater
    Properties optional 0 or greater db_subset=Swiss-Prot

    Definitions and requirements for field contents

    DB
    The database abbreviation (namespace) for the source of the DB_Object_ID. This field is mandatory; cardinality 1.
    DB_Object_ID
    A unique identifier (from the database in DB) for the item being annotated. This field is mandatory, cardinality 1.
    In GPI 1.0 format, the identifier may reference a top-level primary gene or gene product identifier, or an identified variant of a gene or gene product, for example identifiers that specify distinct proteins produced by differential splicing, alternative translational starts, post-translational cleavage, or post-translational modification. Identifiers for functional RNAs and protein complexes can also be included in this column.
    If the gene product is not a top-level gene or gene product identifier, the Parent_Object_ID field should contain the canonical form of the gene or gene product.
    Note that while the DB_Object_ID is the identifier for a database object that may be used for annotation, it may or may not correspond exactly to what is described in a paper. For example, a paper describing functional characterization of a protein may result in annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to the protein (protein ID in DB_Object_ID field), depending on annotation practice of the contributing group.
    DB_Object_Symbol
    A (unique and valid) symbol to which the DB_Object_ID is matched. This field is mandatory, cardinality 1.
    The DB_Object_Symbol field should contain a symbol that is recognizable to a biologist wherever possible (an abbreviation widely used in the literature, for example). It is not a unique identifier or an accession number (unlike the DB_Object_ID), although IDs can be used as a DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated). ORF names can be used for otherwise unnamed genes or proteins. If gene products are annotated, the gene product symbol can be used if available. Many gene product annotation entries may share a gene symbol.
    The text entered in the DB_Object_Name and DB_Object_Symbol should refer to the entity in DB_Object_ID. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in DB_Object_ID, but with the same gene symbol in the DB_Object_Symbol column.
    DB_Object_Name
    The name of the gene or gene product in DB_Object_ID. This field is not mandatory, cardinality 0, 1 [white space allowed]
    The text entered in the DB_Object_Name and DB_Object_Symbol should refer to the entity in DB_Object_ID.
    DB_Object_Synonym
    These entries may be a gene symbol or other text. Note that we strongly recommend that synonyms are included in the GPI file, as this aids the searching of GO. This field is not mandatory, cardinality 0, 1, >1 [white space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene).
    DB_Object_Type
    A description of the type of the gene or gene product being annotated. This field uses Sequence Ontology labels and may correspond to one of the following: gene, protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; or any subtype of ncRNA in the Sequence Ontology. If the precise product type is unknown, gene_product should be used. This field is mandatory, cardinality 1.
    The object type (gene, transcript, protein, protein_complex, etc.) listed in the DB_Object_Type field must match the database entry identified by the DB_Object_ID. Note that DB_Object_Type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based.
    Taxon
    The NCBI taxon ID of the species encoding the gene product. This field is mandatory, cardinality 1. The taxon should be specified as a number with the prefix "taxon".
    Parent_Object_ID
    If the DB object ID refers to a variant of a gene product, this column will hold the identifier of the gene product from which it was derived. This field is mandatory, cardinality 1, when variant forms of a gene product (e.g. identifiers that specify distinct proteins produced by differential splicing, alternative translational starts, post-translational cleavage or post-translational modification) are represented in DB_Object_ID. If the DB_Object_ID refers to the canonical form of a gene product, this column should be blank.
    The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206
    The entity in the Parent_Object_ID column may not necessarily be the canonical form of the gene product; the canonical form would be identifiable as an entry for that gene product in the GPI file that would have the Parent_Object_ID blank.
    DB_Xrefs
    Identifiers for the object in DB_Object_ID found in other databases. Optional, cardinality 0+; multiple identifiers should be pipe-separated.
    Identifiers used must be a standard 2-part global identifiers, e.g. UniProtKB:OK0206
    This column should be used to record IDs for this object in other databases; for gene products in model organism databases, this must include the UniProtKB ID, and may also include NCBI gene or protein IDs, etc.
    Properties
    Optional, cardinality 0+; multiple properties should be pipe-separated.
    The Properties column can be filled with a pipe separated list of values in the format "property_name = property_value". There is a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: 'GO annotation complete', "Phenotype annotation complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, Kidney etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL).