The GO Consortium implements a number of automated checks to check the quality of the annotations submitted to the GO database. These checks are detailed on the annotation quality control checks page.
Annotation QC script
A perl script performs quality control checks, a subset of which are listed on the annotation quality control checks page, in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.
This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. Suggestions are welcome for enhancements to this process. The filter_gene_association.pl script is available from the GO SVN repository.
Submitted gene association files are committed to the GO SVN repository into the gene association file submissions directory. The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory and subsequently used to load the GO database.
The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.
The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.
These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.
Column 1, and all database abbreviations used within the gene association file, is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.
The GO IDs mentioned in the annotation file are checked, using the current Gene Ontology edit OBO file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.
All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important.
A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. Please see the list of species and relevant database groups for more details.
Script command line options
Usage help for the script is available with the
-h option. The script is designed to be run from the gene association file submissions directory within a GO SVN sandbox. By default the script needs the GO database cross-references and Gene Ontology edit OBO files. The input gene association file is read from STDIN by default, or from the specified file defined with the
A. check a file for any errors, obsolete GO IDs or old IEA annotations
filter-gene-association.pl -i gene_association.sgd.gz
B. filter any problems and output the validated lines, including headers
filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output
C. check file without the taxid checking on, and write the bad lines to STDOUT
filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines