Annotation Quality Control Checks

These rules are an HTML encoding of the original XML encoding; the schema can be found at http://www.geneontology.org/quality_control/annotation_checks/annotation_qc.rng.

GO_AR:0000001 Basic GAF checks

Contact
cherry@genome.stanford.edu
Status
Implemented (as of 19 October 2005)
Implementation
Active implementation
Language: java, org.geneontology.gold.rules.BasicChecksRule
Input formats: gaf1.0, gaf2.0
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database

The following basic checks ensure that submitted gene association files conform to the GAF spec, and come from the original GAF check script.

  • Each line of the GAF file is checked for the correct number of columns, the cardinality of the columns, leading or trailing whitespace
  • Col 1 and all DB abbreviations must be in GO.xrf_abbs (case may be incorrect)
  • All GO IDs must be extant in current ontology
  • Qualifier, evidence, aspect and DB object columns must be within the list of allowed values
  • DB:Reference, Taxon and GO ID columns are checked for minimal form
  • Date must be in YYYYMMDD format
  • All IEAs over a year old are removed
  • Taxa with a 'representative' group (e.g. MGI for Mus musculus, FlyBase for Drosophila) must be submitted by that group only

Back to top

GO_AR:0000002 No 'NOT' annotations to 'protein binding ; GO:0005515'

Contact
ramab@stanford.edu
Status
Implemented (as of 01 January 2011)
Implementation
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       evidence.code,
       db.name AS assigned_by
FROM   association
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN db
         ON association.source_db_id=db.id
WHERE  association.is_not='1'
       AND term.acc = 'GO:0005515'
Language: regex
Input formats: gaf1.0, gaf2.0
/^(.*?\t){3}not\tGO:0005515\t/i

Even if an identifier is available in the 'with' column, a qualifier only informs on the GO term, it cannot instruct users to restrict the annotation to just the protein identified in the 'with', therefore an annotation applying protein binding ; GO:0005515 with the not qualifier implies that the annotated protein cannot bind anything.

This is such a wide-reaching statement that few curators would want to make.

This rule only applies to GO:0005515; children of this term can be qualified with not, as further information on the type of binding is then supplied in the GO term; e.g. not + NFAT4 protein binding ; GO:0051529 would be fine, as the negative binding statement only applies to the NFAT4 protein.

For more information, see the binding guidelines on the GO wiki.

Back to top

GO_AR:0000003 Annotations to 'binding ; GO:0005488' and 'protein binding ; GO:0005515' should be made with IPI and an interactor in the 'with' field

Contact
ramab@stanford.edu
Status
Implemented (as of 01 January 2011)
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000003
Input formats: gaf1.0, gaf2.0
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       evidence.code,
       db.name AS assigned_by
FROM   association
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN db
         ON association.source_db_id=db.id
WHERE  evidence.code IN ('NAS','TAS','IDA','IMP','IGC','IEP','ND','IC','RCA','EXP', 'IGI')
       AND (term.acc = 'GO:0005515' OR term.acc = 'GO:0005488')

Annotations to binding : GO:0005488 or protein binding ; GO:0005515 with the TAS, NAS, IC, IMP, IGI and IDA evidence codes are not informative as they do not allow the interacting partner to be specified. If the nature of the binding partner is known (protein or DNA for example), an appropriate child term of binding ; GO:0005488 should be chosen for the annotation. In the case of chemicals, ChEBI IDs can go in the 'with' column. Children of protein binding ; GO:0005515 where the type of protein is identified in the GO term name do not need further specification.

For more information, see the binding guidelines on the GO wiki.

Back to top

GO_AR:0000006 IEP usage is restricted to terms from the Biological Process ontology

Contact
ramab@stanford.edu
Status
Implemented (as of 01 January 2011)
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000006
Input formats: gaf1.0, gaf2.0
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       term.term_type,
       evidence.code,
       db.name AS assigned_by
FROM   association
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN db
         ON association.source_db_id=db.id
WHERE  evidence.code = 'IEP'
       AND term.term_type != 'biological_process'

The IEP evidence code is used where process involvement is inferred from the timing or location of expression of a gene, particularly when comparing a gene that is not yet characterized with the timing or location of expression of genes known to be involved in a particular process. This type of annotation is only suitable with terms from the Biological Process ontology.

For more information, see the binding guidelines on the GO wiki.

Back to top

GO_AR:0000019 Generic Reasoner Validation Check

Contact
hdietze@lbl.gov
Status
Implemented (as of 09 April 2011)
Implementation
Active implementation
Language: java, org.geneontology.gold.rules.GenericReasonerValidationCheck

The entire GAF is converted to OWL, combined with the main GO ontology and auxhiliary constraint ontologies. The resulting ontology is checked for consistency and unsatisfiable classes over using a complete DL reasoner such as HermiT.

Back to top

GO_AR:0000004 Reciprocal annotations for 'protein binding ; GO:0005515'

Contact
ramab@stanford.edu
Status
Approved (as of 01 April 2010)
Implementation
Not yet implemented

When annotating to terms that are descendants of protein binding, and when the curator can supply the accession of the interacting protein accession, it is essential that reciprocal annotations are available - i.e. if you say protein A binds protein B, then you need to also have the second annotation that states that protein B binds protein A.

This will be a soft QC; a script will make these inferences and it is up to each MOD to evaluate and include the inferences in their GAF/DB.

For more information, see the binding guidelines on the GO wiki.

Back to top

GO_AR:0000005 No ISS or ISS-related annotations to 'protein binding ; GO:0005515'

Contact
ramab@stanford.edu
Status
Implemented
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000005
Input formats: gaf1.0, gaf2.0
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       evidence.code,
       db.name AS assigned_by
FROM   association
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN db
         ON association.source_db_id=db.id
WHERE  evidence.code IN ('ISS','ISO','ISA','ISM')
       AND term.acc = 'GO:0005515'

If we take an example annotation:

gene product: protein A
GO term: protein binding ; GO:0005515
evidence: IPI
reference: PMID:123456
with/from: with protein A

this annotation line can be interpreted as: protein A was found to carry out the 'protein binding' activity in PMID:12345, and that this function was Inferred from the results of a Physicial Interaction (IPI) assay, which involved protein X

However if we would like to transfer this annotation to protein A's ortholog 'protein B', the ISS annotation that would be created would be:

gene product: protein B
GO term: protein binding ; GO:0005515
evidence: ISS
reference: GO_REF:curator_judgement
with/from: with protein A

This is interpreted as 'it is inferred that protein B carries out protein binding activity due to its sequence similarity (curator determined) with protein A, which was experimentally shown to carry out 'protein binding'.

Therefore the ISS annotation will not display the the interacting protein X accession. Such an annotation display can be confusing, as the value in the 'with' column just provides further information on why the ISS/IPI or IGI annotation was created. This means that an ISS projection from protein binding is not particularly useful as you are only really telling the user that you think an homologous protein binds a protein, based on overall sequence similarity.

This rule only applies to GO:0005515, as descendant terms such as mitogen-activated protein kinase p38 binding ; GO:0048273 used as ISS annotations are informative as the GO term name contains far more specific information as to the identity of the interactor.

For more information, see the binding guidelines on the GO wiki.

Back to top

GO_AR:0000016 IC annotations require a With/From GO ID

Contact
ramab@stanford.edu
Status
Implemented
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000016
Input formats: gaf1.0, gaf2.0
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database

All IC annotations should include a GO ID in the "With/From" column; for more information, see the IC evidence code guidelines.

Back to top

GO_AR:0000017 IDA annotations must not have a With/From entry

Contact
ramab@stanford.edu
Status
Approved (as of 01 February 2012)
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000017
Input formats: gaf1.0, gaf2.0
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database

Use IDA only when no identifier can be placed in the "With/From" column. When there is an appropriate ID for the "With/From" column, use IPI.

Back to top

GO_AR:0000018 IPI annotations require a With/From entry

Contact
ramab@stanford.edu
Status
Implemented
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000018
Input formats: gaf1.0, gaf2.0
Active implementation
Language: perl, http://www.geneontology.org/software/utilities/filter-gene-association.pl
Runs on file submission
Input formats: gaf1.0, gaf2.0
Output: annotations removed and warning sent to submitting database

All IPI annotations should include a nucleotide/protein/chemical identifier in the "With/From" column (column 8). From the description of IPI in the GO evidence code guide: "We strongly recommend making an entry in the with/from column when using this evidence code to include an identifier for the other protein or other macromolecule or other chemical involved in the interaction. When multiple entries are placed in the with/from field, they are separated by pipes. Consider using IDA when no identifier can be entered in the with/from column." All annotations made after January 1 2012 that break this rule will be removed.

Back to top

GO_AR:0000020 Automatic repair of annotations to merged or obsoleted terms

Contact
ramab@stanford.edu
Status
Approved (as of 06 February 2013)
Implementation
Not yet implemented

Ontology operations such as term merges and obsoletions may be out of sync with annotation releases. Each GO entry T in the GAF is checked to see if it corresponds to a valid (non-obsolete) term in the ontology. If not, metadata for other terms is checked. If the term has been merged into a term S (i.e. S has alt_id of T) then T is replaced by S in the GAF line. Optionally, if T is obsoleted and there is 1 or more replaced_by tags to S1, S2, ... Sn, then the GAF line is replaced by n lines with the GO entry changed to S1..Sn.

Back to top

GO_AR:0000007 IPI should not be used with catalytic activity molecular function terms

Contact
ramab@stanford.edu
Status
Implemented
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000007
Input formats: gaf1.0, gaf2.0
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       evidence.code,
       db.name AS assigned_by
FROM   term
       INNER JOIN graph_path
         ON term.id = graph_path.term2_id
       INNER JOIN term AS term2
         ON graph_path.term1_id = term2.id
       INNER JOIN association
         ON graph_path.term2_id = association.term_id
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
       INNER JOIN db
         ON association.source_db_id=db.id
WHERE  term2.acc = 'GO:0003824'
       AND evidence.code = 'IPI'

The IPI (Inferred from Physical Interaction) evidence code is used where an annotation can be supported from interaction evidence between the gene product of interest and another molecule (see the evidence code documentation). While the IPI evidence code is frequently used to support annotations to terms that are children of binding ; GO:0005488, it is thought unlikely by the Binding working group that enough information can be obtained from a binding interaction to support an annotation to a term that is a chid of catalytic activity ; GO:0003824. Such IPI annotations to child terms of catalytic activity ; GO:0003824 may need to be revisited and corrected.

For more information, see the catalytic activity annotation guide on the GO wiki.

Back to top

GO_AR:0000008 No annotations should be made to uninformative high level terms

Contact
ramab@stanford.edu
Status
Implemented
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000008
Input formats: gaf1.0, gaf2.0
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       db.name AS assigned_by
FROM   association
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN db
         ON association.source_db_id=db.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
WHERE  term.acc IN ( 'GO:0050896', 'GO:0007610', 'GO:0051716', 'GO:0009628',
       'GO:0009607', 'GO:0042221', 'GO:0009719', 'GO:0009605', 'GO:0006950',
       'GO:0048585', 'GO:0048584', 'GO:0048583', 'GO:0001071', 'GO:0000988')

Some terms are too high-level to provide useful information when used for annotation, regardless of the evidence code used. These terms appear in the subset 'high_level_annotation_qc', and are listed below. Please consult the ontology file for the most up-to-date set of terms.

  • GO:0000988 : protein binding transcription factor activity
  • GO:0001071 : nucleic acid binding transcription factor activity
  • GO:0006950 : response to stress
  • GO:0007610 : behavior
  • GO:0009605 : response to external stimulus
  • GO:0009607 : response to biotic stimulus
  • GO:0009628 : response to abiotic stimulus
  • GO:0009719 : response to endogenous stimulus
  • GO:0042221 : response to chemical stimulus
  • GO:0048583 : regulation of response to stimulus
  • GO:0048584 : positive regulation of response to stimulus
  • GO:0048585 : negative regulation of response to stimulus
  • GO:0050896 : response to stimulus
  • GO:0051716 : cellular response to stimulus

Back to top

GO_AR:0000009 Annotation Intersection Alerts

Contact
val@sanger.ac.uk
Status
Proposed (as of 01 April 2010)
Implementation
Not yet implemented

To be added

Back to top

GO_AR:0000010 PubMed reference formatting must be correct

Contact
ramab@stanford.edu
Status
Proposed (as of 01 April 2010)
Implementation
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key) AS gpxref,
       IF(association.is_not=1,"NOT","") AS 'not',
       term.acc,
       term.name,
       evidence.code,
       CONCAT(dbxref.xref_dbname, ':', dbxref.xref_key) AS evxref,
       db.name AS assigned_by
FROM   association
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN dbxref
         ON evidence.dbxref_id = dbxref.id
       INNER JOIN dbxref AS gpx
         ON gene_product.dbxref_id = gpx.id
       INNER JOIN db
         ON association.source_db_id=db.id
WHERE  dbxref.xref_dbname = 'PMID'
       AND dbxref.xref_key REGEXP '^[^0-9]'
Language: regex
Input formats: gaf1.0, gaf2.0
/^(.*?\t){5}([^\t]\|)*PMID:(?!\d+)/

References in the GAF (Column 6) should be of the format db_name:db_key|PMID:12345678, e.g. SGD_REF:S000047763|PMID:2676709. No other format is acceptable for PubMed references; the following examples are invalid:

  • PMID:PMID:14561399
  • PMID:unpublished
  • PMID:.
  • PMID:0

This is proposed as a HARD QC check: incorrectly formatted references will be removed.

Back to top

GO_AR:0000011 ND annotations to root nodes only

Contact
ramab@stanford.edu
Status
Proposed (as of 01 April 2010)
Implementation
Active implementation
Language: java, org.geneontology.rules.GO_AR_0000011
Input formats: gaf1.0, gaf2.0
Language: SQL
Input: GO database, LEAD schema
SELECT gene_product.symbol,
       CONCAT(gpx.xref_dbname, ':', gpx.xref_key),
       IF(association.is_not = 1, "NOT", "") AS 'not',
       term.acc,
       term.name,
       evidence.code,
       CONCAT(dbxref.xref_dbname, ':', dbxref.xref_key) AS evxref,
       db.name AS assigned_by
FROM   association
       INNER JOIN evidence
         ON association.id = evidence.association_id
       INNER JOIN gene_product
         ON association.gene_product_id = gene_product.id
       INNER JOIN term
         ON association.term_id = term.id
       INNER JOIN dbxref
         ON evidence.dbxref_id = dbxref.id
       INNER JOIN dbxref AS gpx
         ON gpx.id = gene_product.dbxref_id
       INNER JOIN db
         ON association.source_db_id = db.id
WHERE  ( evidence.code = 'ND'
         AND term.acc NOT IN ( 'GO:0005575', 'GO:0003674', 'GO:0008150' ) )
        OR ( NOT(evidence.code = 'ND')
             AND term.acc IN ( 'GO:0005575', 'GO:0003674', 'GO:0008150' ) )
        OR ( evidence.code = 'ND'
             AND ( CONCAT(dbxref.xref_dbname, ':', dbxref.xref_key) NOT IN (
                         'GO_REF:0000015', 'FB:FBrf0159398',
                         'ZFIN:ZDB-PUB-031118-1',
                         'dictyBase_REF:9851',
                         'MGI:MGI:2156816',
                         'SGD_REF:S000069584', 'CGD_REF:CAL0125086',
                         'RGD:1598407',
                         'TAIR:Communication:1345790',
                         'AspGD_REF:ASPL0000111607' ) ) )

The No Data (ND) evidence code should be used for annotations to the root nodes only and should be accompanied with GO_REF:0000015 or an internal reference. PMIDs cannot be used for annotations made with ND.

  • if you are using an internal reference, that reference ID should be listed as an external accession for GO_REF:0000015. Please add (or email) your internal reference ID for GO_REF:0000015.
  • All ND annotations made with a reference other than GO_REF:0000015 (or an equivalent internal reference that is listed as external accession for GO_REF:0000015) should be filtered out of the GAF.

The SQL code identifies all ND annotations that do not use GO_REF:0000015 or one of the alternative internal references listed for it in the GO references file.

Back to top

GO_AR:0000013 Taxon-appropriate annotation check

Contact
ramab@stanford.edu
Status
Implemented (as of 12 April 2011)
Implementation
Active implementation
Language: java, org.geneontology.gold.rules.AnnotationTaxonRule
Input formats: gaf1.0, gaf2.0

GO taxon constraints ensure that annotations are not made to inappropriate species or sets of species. See http://www.biomedcentral.com/1471-2105/11/530 for more details.

Back to top

GO_AR:0000014 Valid GO term ID

Contact
ramab@stanford.edu
Status
Proposed (as of 12 April 2011)
Implementation
Active implementation
Language: java, org.geneontology.gold.rules.GoClassReferenceAnnotationRule
Input formats: gaf1.0, gaf2.0

This check ensures that the GO IDs used for annotations are valid IDs and are not obsolete.

Back to top

GO_AR:0000015 Dual species taxon check

Contact
go-discuss@lists.stanford.edu
Status
Proposed (as of 25 August 2011)
Implementation
Not yet implemented
Dual species annotations are used to capture information about multi-organism interactions. The first taxon ID should be that of the species encoding the gene product, and the second should be the taxon of the other species in the interaction. Where the interaction is between organisms of the same species, both taxon IDs should be the same. These annotations should be used only in conjunction with terms that have the biological process term 'GO:0051704 : multi-organism process' or the cellular component term 'GO:0044215 : other organism' as an ancestor.

Back to top