LEAD Database Schema

GO Database Schema Autodoc

This is the GO LEAD Database Schema documentation, in Autodoc format. Everything on this page was automatically generated from the schema metadata. For more background on the schema, please read GO Database documentation.

note on modules: The tables are partitioned into modules for the purposes of Autodoc. These modules are invisible in the actual instantiations of the database, so the table "go_graph.term" is just simply "term".

This documentation was originally generated on 2010-02-08.

Schema go_associations

Table: go_associations.association

Annotation model: An association is a link between a gene product record and an ontology term, with one or more pieces of evidence *** IMPORTANT: NOT all associations are positive: some posit negative links. THESE SHOULD TYPICALLY BE FILTERED OUT FOR MOST ANALYSIS PURPOSES. See the is_not column ***

go_associations.association Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_graph.term.id term_id integer NOT NULL the (GO) term to which the gene_product is associated
go_associations.gene_product.id gene_product_id integer NOT NULL the gene or gene_product to which the term is associated
is_not integer ** IMPORTANT ** when this field is non-zero, the meaning of the annotation is that the gene_product does NOT have the role defined by the GO term (column 4 = NOT in the gene-association file) See also: association_qualifier table
assocdate integer a date in YYYYMMDD format. This is the date the association was last checked the source db providers (column 14 in the gene-association file)
go_general.db.id source_db_id integer the source of the association; for instance, the association file may come from SwissProt, but the source of the association (Example: SGD) (Example: MGI) (column 15 = NOT in the gene-association file) (docs: http://www.geneontology.org/cgi-bin/xrefs.cgi)

Tables referencing this one via Foreign Key Constraints:

a1 term_id a2 gene_product_id a3 term_id, gene_product_id a4 id, term_id, gene_product_id a5 id, gene_product_id a6 is_not, term_id, gene_product_id a7 assocdate

Index - Schema go_associations

Table: go_associations.association_property

(column 16 in the gene-association file) (see http://wiki.geneontology.org/index.php/Annotation_Cross_Products)

go_associations.association_property Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_associations.association.id association_id integer NOT NULL
go_graph.term.id relationship_type_id integer NOT NULL e.g. the term.id that has acc='part_of'
go_graph.term.id term_id integer NOT NULL e.g. the term.id that has acc='CL:0000017'

Index - Schema go_associations

Table: go_associations.association_qualifier

associations can have a number of qualifiers. These include, but are not limited to the NOT qualifier (which technically is not a qualifier at all as it fundamentally changes the semantics of an association) note that this table IS redundant with association.is_not this is intentional - negation is important enough to go directly in the association table. It also goes in this table for consistency with the gene association file (column 4 in the gene-association file) (docs: http://www.geneontology.org/GO.format.annotation.shtml)

go_associations.association_qualifier Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_associations.association.id association_id integer NOT NULL
go_graph.term.id term_id integer NOT NULL qualifiers come from their own terminology
value character varying(255) qualifiers can potentialy be tag=value pairs. however, all qualifiers are currently boolean tags, so this column is always null
aq1 association_id, term_id

Index - Schema go_associations

Table: go_associations.association_species_qualifier

(see http://www.geneontology.org/GO.annotation.shtml#manySpp) (column 13 in gene_association file, ONLY WHEN card>1, this is the next entry) to be used only in conjunction with terms that have the term 'interaction between organisms' as an ancestor. gene_product.species_id is for the organism type encoding the gene or gene product, association_species_qualifier.species_id should be that of the other organism in the interaction. aka "dual taxon"

go_associations.association_species_qualifier Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_associations.association.id association_id integer NOT NULL
go_associations.species.id species_id integer The species of the interacting organism (eg host) Example: in cytolysis of cells of another organism (GO:0051715) this would be the species playing the 'other organism' role

Index - Schema go_associations

Table: go_associations.evidence

each association can have one or more pieces of evidence attached to it (the schema actually allows zero or more, but with GO all annotation have at least one piece of evidence) (doc: http://www.geneontology.org/GO.evidence.shtml)

go_associations.evidence Structure
F-Key Name Type Description
id serial PRIMARY KEY
code character varying(8) UNIQUE#1 NOT NULL a string code (typically 3-letter) corresponding to a GO evidence code. (column 7 in the gene-association file) (Example: IEA - inferred from electronic annotation) (Example: IMP - inferred from mutant phenotype) evidence codes may eventually become "ontologized", allowing us to take full advantage of the OBO evidence ontology: http://www.obofoundry.org/cgi-bin/detail.cgi?evidence_code
go_associations.association.id association_id integer UNIQUE#1 NOT NULL
go_general.dbxref.id dbxref_id integer UNIQUE#1 NOT NULL A reference for the annotation. Typically a pubmed ID (column 6 in the gene-association file)
seq_acc character varying(255) a denormalised field containing a "|" separated list of accession supporting the call. for the normalised data, use evidence_dbxref (column 8 in the gene-association file, copied identically)

Tables referencing this one via Foreign Key Constraints:

ev1 association_id ev2 code ev3 dbxref_id ev4 association_id, code

Index - Schema go_associations

Table: go_associations.evidence_dbxref

each piece of evidence can have multiple dbxrefs associated with it; this is the *normalised* version of the "With" or "From" field of the evidence (column 8 in the gene-association file, normalized)

go_associations.evidence_dbxref Structure
F-Key Name Type Description
go_associations.evidence.id evidence_id integer NOT NULL
go_general.dbxref.id dbxref_id integer NOT NULL globally unique identifier for the evidence (Example: GO:0000346 - will be stored as DB=GO, Acc=0000346)
evx1 evidence_id evx2 dbxref_id evx3 evidence_id, dbxref_id

Index - Schema go_associations

Table: go_associations.gene_product

Represents a gene or gene_product, typically at the species level. GO allows for annotation of genes OR gene products. Annotation of a gene is understood to be "proxy" for annotation of the corresponding gene products. (docs: http://www.geneontology.org/GO.annotation.fields.shtml) (docs: http://www.geneontology.org/GO.annotation.shtml#file) (Example: human p53-gene) (Example: human p53-protein)

go_associations.gene_product Structure
F-Key Name Type Description
id serial PRIMARY KEY
symbol character varying(128) NOT NULL concise label for this gene product (Example: p53) (Example: BRCA) (Example: PHO3) typically unique within an originating database authority, but not guaranteed; (an example of such as authority is FlyBase or UniProt)
go_general.dbxref.id dbxref_id integer UNIQUE NOT NULL A globally unique identifier for this gene or gene product. All (non-GO) unique identifiers are stored as dbxrefs - they must consist of both a DB and a DB_Object_ID (Example: SGD:S000000296) (column 1 and 2 in gene_association file)
go_associations.species.id species_id integer The species or taxon to which this gene product belongs (Note: in future we reserve the option to use gene_product to represent families at higher levels in the taxonomic tree above species) (column 13 in gene_association file; if card>1, this is the first entry)
go_graph.term.id type_id integer gene_product type (eg gene, transcript, protein, complex) (column 13 in the gene-association file) the type term may correspond to a SO ID, but typically SO is not loaded an an ontology into the GO database
full_name text symbol is typically a concise label, full_name may be more textual (column 10 in the gene-association file)

Tables referencing this one via Foreign Key Constraints:

g1 symbol g2 dbxref_id g3 species_id g4 id, species_id g5 dbxref_id, species_id g6 id, dbxref_id g7 id, species_id g8 id, dbxref_id, species_id

Index - Schema go_associations

Table: go_associations.gene_product_subset

(aka goslims). Each subsetdef (slim) is stored as a gene_product in the database (with term_type = 'subset') The subset_id links to this term (OBO-Format: *subset* tag. term_id references a term housing the *subsetdef*)

go_associations.gene_product_subset Structure
F-Key Name Type Description
go_associations.gene_product.id gene_product_id integer NOT NULL
go_graph.term.id subset_id integer NOT NULL
gps1 gene_product_id gps2 subset_id

Index - Schema go_associations

Table: go_associations.gene_product_synonym

alternate label for the gene or gene product (column 11 in the gene-association file)

go_associations.gene_product_synonym Structure
F-Key Name Type Description
go_associations.gene_product.id gene_product_id integer UNIQUE#1 NOT NULL
product_synonym character varying(255) UNIQUE#1 NOT NULL alternate label. Typically NOT redundant with gene_product.symbol or gene_product.full_name, but this is not guaranteed (column 11 in the gene-association file)
gs1 gene_product_id gs2 product_synonym

Index - Schema go_associations

Table: go_associations.species

Linnaean taxonomic information for an organism type. Modeled after NCBI Taxonomy (Note: the name of the table is misleading, as it can model ANY node in Linnaen taxonomy) (The table should be better called "taxon")

go_associations.species Structure
F-Key Name Type Description
id serial PRIMARY KEY
ncbi_taxa_id integer UNIQUE identifier within the NCBI Taxonomy database. (Example: Dmel=7227) (Example: S Cerevisae=4932)
common_name character varying(255) Non-scientific name (Example: fruitfly)
lineage_string text denormalized list of taxon names as text. (Note: not currently populated)
genus character varying(55) If the row in the table is genuinely a species, this column is for storing the "genus" in the Linnaean system. If the row is a higher taxon, then this column is for the scientific name of that taxon. (Example: Drosophila -- for leaf node taxon) (Example: Homo -- for leaf node taxon) (Example: Metazoa -- for non-leaf node taxon) unfortunately the name of this column is misleading. However, it will be retained for backwards compatibility
species character varying(255) If the row in the table is genuinely a species, this column is for storing the "species" name in the Linnaean system. If the row is a higher taxon, this column is null (Example: sapiens) (Example: pombe) unfortunately the name of this column is misleading. However, it will be retained for backwards compatibility note that (genus,species) is not declared unique
parent_id integer parent taxon in hierarchy (direct parent - for indirect parents see left_value and right_value)
left_value integer left_value, right_value implement a nested set model see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html or Joe Celko's "SQL for smarties" for more information.
right_value integer see left_value
taxonomic_rank character varying(255) eg species, family, phylum, ...

Tables referencing this one via Foreign Key Constraints:

sp1 ncbi_taxa_id sp10 right_value sp11 left_value, right_value sp12 id, left_value sp13 genus, left_value, right_value sp2 common_name sp3 genus sp4 species sp5 genus, species sp6 id, ncbi_taxa_id sp7 id, ncbi_taxa_id, genus, species sp8 parent_id sp9 left_value

Index - Schema go_associations

Schema go_audit

Table: go_audit.instance_data

metadata on this particular instance/build of the GO database

go_audit.instance_data Structure
F-Key Name Type Description
release_name character varying(255) UNIQUE Typically named by date/version in YYYY-MM-DD format
release_type character varying(255) One of: termdb (ontology only) assocdb (termdb + associations) seqdb (assocdb + sequences) seqdblite (seqdb - IEAs)
release_notes text notes specific to this release. Will typically be null unless this release is unusual in some way
ontology_data_version character varying(255) data-version tag from the header of the obo file (not yet implemented)

Index - Schema go_audit

Table: go_audit.source_audit

time of last modification of data source (usually type:file) source_path is a file path or name

go_audit.source_audit Structure
F-Key Name Type Description
source_id character varying(255)
source_fullpath character varying(255)
source_path character varying(255)
source_type character varying(255)
source_md5 character(32)
source_parsetime integer
source_mtime integer
fa1 source_path

Index - Schema go_audit

Table: go_audit.term_audit

not in use

go_audit.term_audit Structure
F-Key Name Type Description
go_graph.term.id term_id integer UNIQUE NOT NULL
term_loadtime integer
ta1 term_id

Index - Schema go_audit

Schema go_general

Table: go_general.db

metadata on the different database / accession granting authorities. the data should come from the GO.xref_abbs file. (docs: http://www.geneontology.org/cgi-bin/xrefs.cgi) the dbname is the abbreviation, and should match dbxref.xref_dbname however we have no foreign key so not every dbxref.xref_dbname will have an entry here most columns will not be populated in the short term - the other fields are for future expansion

go_general.db Structure
F-Key Name Type Description
id serial PRIMARY KEY
name character varying(55) UNIQUE
fullname character varying(255)
datatype character varying(255)
generic_url character varying(255)
url_syntax character varying(255)
url_example character varying(255)
uri_prefix character varying(255)

Tables referencing this one via Foreign Key Constraints:

db1 name db2 fullname db3 datatype

Index - Schema go_general

Table: go_general.dbxref

a unique bipartite identifier for a record typically housed in an external database. dbxrefs are of the form DB_ID:Local_ID. the combination of DB and local identifier is unique: it is up to the DB_ID authority to ensure Local_ID is unique within their ID space This table is referenced from a variety of other table: (1) as an xref for a sequence (*seq_dbxref* table) (2) as a primary identifier for a gene product (gene_product != seq except in the case of GOA) (*gene_product.dbxref_id* column) (3) as an xref for a term or term definition (eg swissprot keywords) (*term_dbxref* table) (4) as evidence for an association based on sequence evidence (*evidence* or *evidence_dbxref* table) (docs: http://www.geneontology.org/cgi-bin/xrefs.cgi)

go_general.dbxref Structure
F-Key Name Type Description
id serial PRIMARY KEY
xref_dbname character varying(55) UNIQUE#1 NOT NULL The name of the database or ID-granting authority from which information concerning this dbxref can be retrieved. It is recommended that this comes from the set here: http://www.geneontology.org/cgi-bin/xrefs.cgi; however, this is not enforced at the schema level. There *may* be a corresponding entry in the *db* table, housing metadata on this dbname but this is not enforced as a foreign key reference (Example: FB) (Example: SGD) (Example: UniProt) (Example: PubMed)
xref_key character varying(255) UNIQUE#1 NOT NULL The local identifier that is unique within xref_dbname (Example: FBgn0000001)
xref_keytype character varying(32) DEPRECATED. Was in principle used for what "type" the xref was - eg symbol vs ID
xref_desc character varying(255) DEPRECATED optional description of dbxref

Tables referencing this one via Foreign Key Constraints:

dx1 xref_dbname dx2 xref_key dx3 id, xref_dbname dx4 id, xref_key, xref_dbname dx5 id, xref_key

Index - Schema go_general

Schema go_graph

Table: go_graph.relation_composition

(See http://wiki.geneontology.org/index.php/Relation_composition) Stores rules of the form: r1 . r2 -> r i.e. IF [ X r1 Y ] AND [ Y r2 Z ] THEN [ X r Z ] Corresponds to "transitive_over" and "holds_over_chain" tags in obo-format.

go_graph.relation_composition Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_graph.term.id relation1_id integer UNIQUE#1 NOT NULL The first relation in the pairwise composition. References an entry in the term table. (recall that the term table housed both the terms themselves, and the relations)
go_graph.term.id relation2_id integer UNIQUE#1 NOT NULL The second relation in the pairwise composition. References an entry in the term table. (recall that the term table housed both the terms themselves, and the relations)
go_graph.term.id inferred_relation_id integer UNIQUE#1 NOT NULL The inferred relation in the pairwise composition. References an entry in the term table. (recall that the term table housed both the terms themselves, and the relations)
rc1 relation1_id rc2 relation2_id rc3 inferred_relation_id rc4 relation1_id, relation2_id, inferred_relation_id

Index - Schema go_graph

Table: go_graph.relation_properties

go_graph.relation_properties Structure
F-Key Name Type Description
go_graph.term.id relationship_type_id integer UNIQUE NOT NULL The first relation in the pairwise composition. References an entry in the term table. (recall that the term table housed both the terms themselves, and the relations)
is_transitive integer equals 1 if this relation is transitive. IF [X R Y] AND [Y R Z] AND [R is_transitive] THEN [X R Z] (OBO-Format: *is_transitive* tag) valid values: 0 or 1
is_symmetric integer equals 1 if this relation is symmetric. IF [X R Y] AND [R is_symmetric] THEN [Y R X] (OBO-Format: *is_symmetric* tag) valid values: 0 or 1
is_anti_symmetric integer equals 1 if this relation is anti_symmetric. IF [X R Y] AND [Y R X] AND [R is_anti_symmetric] THEN [X=Y] (OBO-Format: *is_anti_symmetric* tag) valid values: 0 or 1
is_cyclic integer equals 1 if this relation is cyclic. (OBO-Format: *is_cyclic* tag) valid values: 0 or 1
is_reflexive integer equals 1 if this relation is reflexive. IF [R is_reflexive] THEN [X R X] (OBO-Format: *is_reflexive* tag) valid values: 0 or 1
is_metadata_tag integer equals 1 if this relation is metadata_tag. IF [X R Y] AND [R is_metadata_tag] THEN [Y R X] (OBO-Format: *is_metadata_tag* tag) valid values: 0 or 1

Index - Schema go_graph

Table: go_graph.term

Fundamental representational unit in a controlled vocabulary or ontology. *Terms* form the nodes in the ontology graph (structured using the *term2term* table). An example of a term in GO is GO:0009333 "cysteine synthase complex". Each entry in the term table corresponds to a distinct type (kind, class) of entity in reality. Note that the term table is also used for storing *relations*. The two fundamental relations in GO are "is_a" and "part_of". Relations comprise the "labels" of the edges of the ontology graph. Note the term "term" is misleading in that this table is *not* used for storing synonyms and alternate labels; only the *preferred* terms, corresponding to nodes in the ontology graph each representing a distinct type of entity. In OBO-Format, both Term and Typedef (relation) stanzas are housed in the term table. (doc: http://www.geneontology.org/GO.format.obo-1_2.shtml)

go_graph.term Structure
F-Key Name Type Description
id serial PRIMARY KEY
name character varying(255) NOT NULL DEFAULT ''::character varying A textual label for the term. Each term has a single such label (see *term_synonym* for alternate labels). The name should be unique within an ontology (in fact uniqueness is encourage across ontologies - the principle of univocity. However, this is not enforced at the database schema level). the uniqueness recommendation is relaxed in the case of obsolete terms, which are also housed in this table: there can be many "ex-terms" with the same name. In some alternate systems, *term.name* is also known as the "preferred term" (OBO-Format: *name* tag) (Example: "cysteine biosnthetic process")
term_type character varying(55) NOT NULL The ontology or namespace to which this term belongs (OBO-Format: *namespace* tag) (Example: biological_process) (Note: the column name is somewhat misleading, but is retained for historical reasons. It would be better named "namespace" or "ontology") The namespace for GO terms will always be molecular_function, biological_process or cellular_component. The relations defined in the main GO obo file (from which this table is populated) go into the gene_ontology namespace, with the exception of is_a, which has namespace "relationship" (taken from the obo relation ontology). is_a is a builtin relation as far as obo is concerned, so it does not go in the gene_ontology namespace
acc character varying(255) UNIQUE NOT NULL The unique identifier for this term. This should be in OBO bipartite ID format, and should be unique within OBO, but this is not enforced at the schema level. (Example: GO:0019344) (OBO-Format: *id* tag)
is_obsolete integer NOT NULL equals 1 if this row corresponds to an obsoleted "ex-term". Note that obsoletes are not terms in the true sense, but we house them in the same table as this is the most expedient for the kinds of queries people wish to perform. (OBO-Format: *is_obsolete* tag) valid values: 0 or 1
is_root integer NOT NULL equals 1 if this term is the root term in the ontology graph. Note that in some instantiations of the GO database "fake" root nodes are added (OBO-Format: No correspoding tag)
is_relation integer NOT NULL equals 1 if this term is a relation (relationship type) (OBO-Format: Typedef stanzas)

Tables referencing this one via Foreign Key Constraints:

t1 name t2 term_type t3 acc t4 id, acc t5 id, name t6 id, term_type t7 id, acc, name, term_type

Index - Schema go_graph

Table: go_graph.term2term

Each entry in this table corresponds to an arc/edge in the ontology graph, which represents a relationship that holds between two entities in reality. Graphs are often thought of in terms of parent-child links; with this conception term1_id is the parent and term2_id is the child. However, it may be better to think of each edge as a *statement*, each statement being about a subject and it's relationship to some other entity. For example, a part_of edge between terms "nucleus" and "cell" is a statement about cell nuclei in general, namely that all nuclei are part_of some cell. (the statement is *not* a general statement about cells: not all cells have a nucleus) Here the "subject" of the statement corresponds to term2_id. EXAMPLE: if term1_id points to nucleic acid binding AND term2_id points to DNA binding AND relationship_type points to "is_a" THEN we have a statement "DNA binding is_a nucleic acid binding" (OBO-Format: *is_a* tag or *relationship* tag)

go_graph.term2term Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_graph.term.id relationship_type_id integer UNIQUE#1 NOT NULL References an entry in the term table corresponding to the relation that holds between term2 and term1 (recall that the term table housed both the terms themselves, and the relations) (Example: a reference to a row "part_of" in the term table)
go_graph.term.id term1_id integer UNIQUE#1 NOT NULL the "parent" node of the edge. For example, in the edge corresponding to "nucleus part_of cell", (all nuclei are part_of some cell) term1_id is "cell"
go_graph.term.id term2_id integer UNIQUE#1 NOT NULL the "child" node of the edge. For example, in the edge corresponding to "nucleus part_of cell", (all nuclei are part_of some cell) term2_id is "nucleus"
complete integer UNIQUE#1 NOT NULL equals 1 if this edges comprises an element of the *complete definition*, a set of necessary and sufficient conditions. Note that this field is always =0 for current publically deployed instantiations of the GO database, but is currently used in experimental instantiations for housing so called "cross-products". Example: the term "cysteine metabolism" is completely defined by the edges "is_a metabolism" and "has_participant cysteine" - which is to say anything that satisfies these two conditions is by definition an instance of cysteine metabolism. "complete" links/edges may be provided in addition to normal links/edges, even though these may be partially redundant. Formally: if complete=0, the edge states a NECESSARY CONDITION for term2_id. The set of all edges with complete=1 for any term2_id states the NECESSARY AND SUFFICIENT CONDITIONS for that term. Although currently unused in publically deployed GO DB instances, software using the schema may wish to explicitly query for complete=0 to ensure "normal" graph links are returned as results.
tt1 term1_id tt2 term2_id tt3 term1_id, term2_id tt4 relationship_type_id

Index - Schema go_graph

Schema go_homology

Table: go_homology.gene_product_ancestor

connects two nodes in a phylogenetic tree

go_homology.gene_product_ancestor Structure
F-Key Name Type Description
go_associations.gene_product.id gene_product_id integer UNIQUE#1 NOT NULL
go_associations.gene_product.id ancestor_id integer UNIQUE#1 NOT NULL
go_homology.phylotree.id phylotree_id integer UNIQUE#1 NOT NULL
branch_length double precision
is_transitive integer NOT NULL

Index - Schema go_homology

Table: go_homology.gene_product_homology

go_homology.gene_product_homology Structure
F-Key Name Type Description
go_associations.gene_product.id gene_product1_id integer NOT NULL
go_associations.gene_product.id gene_product2_id integer NOT NULL
go_graph.term.id relationship_type_id integer NOT NULL References an entry in the term table corresponding to the relation that holds between 1 and 2

Index - Schema go_homology

Table: go_homology.gene_product_homolset

a set-member relation between a gene product and the homolset to which it belongs. the relation should, where possible, be supported by individual homology-based relations [TODO] REQUIRED FOR REFERENCE GENOMES PROJECT

go_homology.gene_product_homolset Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_associations.gene_product.id gene_product_id integer NOT NULL
go_homology.homolset.id homolset_id integer NOT NULL

Index - Schema go_homology

Table: go_homology.gene_product_phylotree

a set-member relation between a gene product and the phylogenetic tree to which it belongs. This is a new table - this will be populated from PANTHER datasets

go_homology.gene_product_phylotree Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_associations.gene_product.id gene_product_id integer NOT NULL
go_homology.phylotree.id phylotree_id integer NOT NULL

Index - Schema go_homology

Table: go_homology.homolset

A collection of genes or gene products from a common evolutionary lineage. The purpose of this table is to group homologous sets of gene products to query for shared and divergent biological function. The table is neutral with respect to the method used to determine evolutionary relations; instead it represents the derived results of some kind of some analysis. A homolset may also be derived from a collection of analyses. The table is also neutral w.r.t questions of homology or orthology REQUIRED FOR REFERENCE GENOMES PROJECT

go_homology.homolset Structure
F-Key Name Type Description
id serial PRIMARY KEY
symbol character varying(128) A convenient human-assigned label for the homology set; this may be arbitarily chosen from one of the set members (frequently the target_gene_product); or it may be the name of the gene family. no guarantee is given to its usefulness
go_general.dbxref.id dbxref_id integer UNIQUE A globally unique identifier for this set, or some proxy for the set. For example, if the purpose of building homolsets is to examine disease genes, then an OMIM ID may be appropriate to use here, even though OMIM is not concerned with homology or orthology. If the homolset is derived
go_associations.gene_product.id target_gene_product_id integer A homolset may be constructed from a collection of pairwise homology assignments between individual gene products and a (possibly arbitrary) "target" gene_product. this field is optional: for example, a target is not required for sets that are derived from a tree-based analysis
go_associations.species.id taxon_id integer The least common ancestor of all members of the homolset. (may not be populated)
go_graph.term.id type_id integer homolsets may fall into different categories - this field identifies the category. May not be populated
description text for example: if the purpose of the homolset is to examine disease genes and model organism orthologs, the description could be a summary of the disease in human

Tables referencing this one via Foreign Key Constraints:

Index - Schema go_homology

Table: go_homology.phylotree

A phylogenetic tree. This is a new table - this will be populated from PANTHER datasets. Additional rows or property tables may be added at a future date; e.g. to add bootstrap values and so on.

go_homology.phylotree Structure
F-Key Name Type Description
id serial PRIMARY KEY
name character varying(255) NOT NULL DEFAULT ''::character varying
go_general.dbxref.id dbxref_id integer NOT NULL A globally unique identifier for this tree. e.g. PantherDB:PTHR11361 make sure dbname is registered in GO.xrf_abbs

Tables referencing this one via Foreign Key Constraints:

Index - Schema go_homology

Schema go_meta

Table: go_meta.term2term_metadata

a metadata link between two terms this is primarily to support the "consider" and "replaced_by" tags in OBO Format 1.2. It could also be used for other metadata links we may want to include in the future. The main difference between term2term and this table is that term2term is for encoding the relationships that hold between types of biological entity, whereas this table is for relationships between the units in the ontology. Different rules apply to both. Eg consider/replaced_by would never propagate over the is_a relation open question: do we also want to include disjointness axioms here?

go_meta.term2term_metadata Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_graph.term.id relationship_type_id integer NOT NULL "consider" or "replaced_by" may in future be extended to other tags
go_graph.term.id term1_id integer UNIQUE#1 NOT NULL the "parent" node of the edge. For example, in the edge corresponding to for "GO:0005696 telomere" consider "GO:0000781 chromosome, telomeric region" term1_id is "GO:0005696"
go_graph.term.id term2_id integer UNIQUE#1 NOT NULL the "child" node of the edge. For example, in the edge corresponding to for "GO:0005696 telomere" consider "GO:0000781 chromosome, telomeric region" term2_id is "GO:0000781"

Index - Schema go_meta

Table: go_meta.term_dbxref

linking table between term and dbxref used where there is some other information entity of relevance to the term in question; it may be a dbxref for a publication defining the term, or it may be a reference to an entity in another database, terminology or ontology-like system with similar or identical semantics to the term.

go_meta.term_dbxref Structure
F-Key Name Type Description
go_graph.term.id term_id integer UNIQUE#1 NOT NULL
go_general.dbxref.id dbxref_id integer UNIQUE#1 NOT NULL
is_for_definition integer UNIQUE#1 NOT NULL equals 1 if this dbxref references the source of the definition of the term in question. This includes but is not limited to PubMed IDs. It may also be so-called "GO Curator" references, of the form GOC:<curator-id>
tx0 term_id tx1 dbxref_id tx2 term_id, dbxref_id

Index - Schema go_meta

Table: go_meta.term_definition

In OBO, a term should where possible be defined. A term can have only one definition. Note: due to this cardinality constraint, it would be possible to merge term_definition into the *term* table. The decision was made not to do this early on, and the schema has not been changed since. (doc: http://www.geneontology.org/GO.format.obo-1_2.shtml) (OBO-Format: *def* tag; and also the *comment* tag - see column documentation)

go_meta.term_definition Structure
F-Key Name Type Description
go_graph.term.id term_id integer UNIQUE NOT NULL
term_definition text NOT NULL A textual definition for the term referenced in term_definition.term_id. The attentive user will have two criticisms; the first is that it is poor practice for a column to share a name with a table. The second is for more egregious: although this column is declared non-null, in fact some entries are in fact empty strings. This is due to the fact that term comments are housed in this table, rather than in the term table or their own term_comment table! We accept that this was a poor design decision. However, we have no plans to rectify this modeling error in the short term as this would break many pieces of 3rd party software not under our control. No fixes are likely until after MySQL has good efficient support for SQL Views.
go_general.dbxref.id dbxref_id integer SEVERELY DEPRECATED: see term_dbxref.is_for_definition
term_comment text A free-text comment with non-definitional information that may be useful for end-users or curators. (OBO-Format: *comment* tag - each term has max 1 comment) Please see notes for term_definition column
reference character varying(255) SEVERELY DEPRECATED
td1 term_id

Index - Schema go_meta

Table: go_meta.term_subset

(aka goslims). Each subsetdef (slim) is stored as a term in the database (with term_type = 'subset') The subset_id links to this term (OBO-Format: *subset* tag. term_id references a term housing the *subsetdef*)

go_meta.term_subset Structure
F-Key Name Type Description
go_graph.term.id term_id integer NOT NULL
go_graph.term.id subset_id integer NOT NULL
tss1 term_id tss2 subset_id tss3 term_id, subset_id

Index - Schema go_meta

Table: go_meta.term_synonym

In OBO, each term can have 0 or more synonyms or alternate identifiers. A synonym is an alternate label for a preferred term, intended for humans. An alternate identifer is an OBO ID intended for unique identification of the term in information systems. Note: OBO Format has the concept of broad and narrow synonyms - these might better be called broad and narrow aliases or alternate labels, since technically the definition of synonym is such that synonyms can replace words without changing the meaning.

go_meta.term_synonym Structure
F-Key Name Type Description
go_graph.term.id term_id integer UNIQUE#1 NOT NULL
term_synonym character varying(996) UNIQUE#1 A textual label typically intended for humans. (Example: "cysteine biosynthesis" is a synonym for the term named "cysteine biosynthetic process") One wrinkle is that alternate identifiers are redundantly stored in this column, as well as in the acc_synonym column. This design decision can be rightfully criticised, but we retain this use for software compatibility reasons. alt_ids can ve discriminated from synonyms using synonym_type_id - this will be "alt_id" for alternate identifiers. Note: although we accept it is bad practice to name columns the same as tables, we retain this for software compatibility reasons (OBO-Format: *synonym* tag)
acc_synonym character varying(255) An alternate identifier. (OBO-Format: *alt_id* tag) See also docs for term_synonym column
go_graph.term.id synonym_type_id integer NOT NULL actually corresponds to a synonym "scope" in OBO-Format - one of exact, broad, narrow, related for alternate identifiers we use the term "alt_id" - see notes above
go_graph.term.id synonym_category_id integer category/class to which this synonym belongs Correspinds to type OBO-Format 1.2 Not currently used in GO Note: the synonym "scope" (eg exact, narrow) goes in synonym_type_id
ts1 term_id ts2 term_synonym ts3 term_id, term_synonym

Index - Schema go_meta

Schema go_optimisations

Table: go_optimisations.gene_product_count

this table for use in "data warehouse mode" (will typically be populated in publically available instantiations of the GO database). caches recursive gene product counts the number of DISTINCT gene product records at OR BELOW a term filtered by evidence code refers to the evidence code used to filter this particular count; if preceded by a ! it means exclude this evidence code; typically this will be "!IEA" (ie exclude IEA) speciesdbname corresponds to the dbname from gene_product.dbxref_id and represents the authority that contributed the annotated gene product being counted the product count is partitioned by the speciesdbname product_count is the number of DISTINCT gene product IDs owned by speciesdbname at or below term_id in the DAG note that the product_count is additive across speciesdbnames (because no two speciesdbnames may contribute the same ID), but is NOT additive across evidence codes (the same gene product can be associated more than once beneath a term with different evidence codes) this makes filtering by evidence code hard - we must include all combinations which is a factorial!! to get round this we typically only include counts for non-IEA associations Equivalent to the query: SELECT path.term1_id AS term_id, count(DISTINCT a.gene_product_id) AS total FROM association AS a INNER JOIN evidence AS e ON (e.association_id=a.id) INNER JOIN graph_path AS path ON (path.term2_id=ae.term_id) WHERE <<evidence constraint here>>

go_optimisations.gene_product_count Structure
F-Key Name Type Description
go_graph.term.id term_id integer NOT NULL the term for which gene products are counted; also includes terms below this term in the graph
code character varying(8) evidence code over which this count hold. can include negation; eg "!IEA" note that not every combination will be pre-computed. typically just !IEA is stored. Also note that counts over evidence codes are NON-additive; this is because the same gene_product can be double-counted if it is anntated with >1 evidence code
speciesdbname character varying(55) this should match the gene_product.dbxref (Examples: FlyBase, SGD, MGI, UniProt) (docs: http://www.geneontology.org/cgi-bin/xrefs.cgi) counts *ARE* additive across speciesdbnames - this is because each gene_product record belongs to a single speciesdbname, so double counting is not possible (although the same gene product in reality may be double annotated, for example by UniProt and by a Model Organism Database - these should be filtered out, but even in cases where this filtering fails we do not worry about double-counting when summing across speciesdbnames as we are technically counting distinct gene product *records*) (Note: there is no foreign key reference to the *db* table, but their could in principle be a nullable link here)
go_associations.species.id species_id integer The species or taxon the count pertains to (Note: in future we reserve the option to use gene_product to represent families at higher levels in the taxonomic tree above species)
product_count integer NOT NULL total number of DISTINCT genes/gene product records annotated directly to or via transitivity to term_id
gpc1 term_id gpc2 code gpc3 speciesdbname gpc4 term_id, code, speciesdbname gpc5 term_id, species_id

Index - Schema go_optimisations

Table: go_optimisations.graph_path

A transitive or implied link between two terms Example: if nuclear chromosome is_a chromosome, AND chromosome is_a organelle, then nuclear chromosome is_a organelle this table states whether there exists a path between a parent and a child, and the distance between them. multiple paths mean multiple entries in this table [this table only used in "data warehouse mode"] an entry also exists linking every term with itself of distance 0. This is known as "reflexive transitive closure". [See http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?query=transitive+closure] [See also http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?query=reflexive+transitive+... note: use of this table is optional. but as most relational dbs don't implement recursive queries, you will have to incrementally calculate the transitive closure via multiple SQL calls if you do not use it for graph based queries At this time, this table holds the general transitive closure across *all* relations. In the future this table may house transitive closure on a per-relation basis, taking into account the various rules that follow from the definitions of the relation in question. For example, if X is_a Y and Y part_of Z then X part_of Z.

go_optimisations.graph_path Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_graph.term.id term1_id integer NOT NULL the object node See docs for *term.term1_id*
go_graph.term.id term2_id integer NOT NULL the subject node See docs for *term.term2_id*
go_graph.term.id relationship_type_id integer References an entry in the term table corresponding to the INFERRED relation that holds between term2 and term1. For future extension. See: http://wiki.geneontology.org/index.php/Transitive_closure
distance integer The distance in terms of the number of "hops" between nodes in the asserted graph (term2term). The relationship_type_id is ignored here. Example: if A part_of B is_a C part_of D, then distance=3 for A part_of D
relation_distance integer (added 2008-10-27) The distance in terms of the number of "hops" over relationship_type_id in the asserted graph (term2term). Example: if A part_of B is_a C part_of D, then relation_distance=2 for A part_of D
graph_path1 term1_id graph_path2 term2_id graph_path3 term1_id, term2_id graph_path4 term1_id, distance graph_path5 term1_id, term2_id, relationship_type_id graph_path6 term1_id, term2_id, relationship_type_id, distance, relation_distance graph_path7 term2_id, relationship_type_id graph_path8 term1_id, relationship_type_id

Index - Schema go_optimisations

Schema go_sequence

Table: go_sequence.gene_product_seq

relationship between gene_product and seq is potentially many to many, although in practice each gene product may only have one seq, depending on data population method. If this link is for the representatibe sequence for a gene product, is_primary_seq should be set to true (=1) (a gene_product should only have one seq that is marked is_primary_seq, although this is not- enforced. the idea is that there may be different seqs for a product - allelic variations, variant spliceforms, but only one is the "representative" seq) This table is typically sourced from the gp2protein contributed data file: http://www.geneontology.org/gp2protein

go_sequence.gene_product_seq Structure
F-Key Name Type Description
go_associations.gene_product.id gene_product_id integer NOT NULL
go_sequence.seq.id seq_id integer NOT NULL
is_primary_seq integer If this link is for the representatibe sequence for a gene product, is_primary_seq should be set to true (=1)
gpseq1 gene_product_id gpseq2 seq_id gpseq3 seq_id, gene_product_id

Index - Schema go_sequence

Table: go_sequence.seq

A representative DNA or amino acid sequence for a gene_product; will typically be amino acid. This table is modeled after the BioPerl Bio::PrimarySeq model

go_sequence.seq Structure
F-Key Name Type Description
id serial PRIMARY KEY
display_id character varying(64) UNIQUE#1 the primary label used for identifying the sequence for humans. Not guaranteed to be globally unique. typically corresponds to the first part of a FASTA header
description character varying(255) textual information for humans concerning this sequence. typically corresponds to the part after the ID in the FASTA header
seq text residue sequence: standard IUPAC alphabetic codes are used
seq_len integer number of residues in sequence. should always correspond to length(seq), where seq is populated
md5checksum character varying(32) UNIQUE#1 result of md5(seq), where md5 is the standard MD5 checksum algorithm. see GO::Model::Seq for calculation almost 100% guaranteed to be unique for any sequence of symbols representing the biopolymer
moltype character varying(25) DNA or AA
timestamp integer

Tables referencing this one via Foreign Key Constraints:

seq1 display_id seq2 md5checksum

Index - Schema go_sequence

Table: go_sequence.seq_dbxref

linking table for external identifiers for a sequence seq_dbxref is derived from the dbxrefs in the source sequence file. For example, if the source is a UniProt file, this table will reflect the DR lines if the source is a FASTA file from UniProt, the DB:ACC parts of the fasta header will be used to populate this table typically a seq entry will be something like a protein/ polypeptide - ie something that can be annotated with GO; the dbxref could be for entities in other databases that may only be tangentially related to the protein for example: mRNA records, genomic records, OMIM IDs, etc

go_sequence.seq_dbxref Structure
F-Key Name Type Description
go_sequence.seq.id seq_id integer UNIQUE#1 NOT NULL
go_general.dbxref.id dbxref_id integer UNIQUE#1 NOT NULL
seqx0 seq_id seqx1 dbxref_id seqx2 seq_id, dbxref_id

Index - Schema go_sequence

Table: go_sequence.seq_property

seq can have various properties attached to it. not currently used

go_sequence.seq_property Structure
F-Key Name Type Description
id serial PRIMARY KEY
go_sequence.seq.id seq_id integer UNIQUE#1 NOT NULL
property_key character varying(64) UNIQUE#1 NOT NULL
property_val character varying(255) UNIQUE#1 NOT NULL
seqp0 seq_id seqp1 property_key seqp2 property_val

Index - Schema go_sequence