ISA: Inferred from Sequence Alignment

  • Sequence similarity with experimentally characterized gene products, as determined by alignments, either pairwise or multiple (tools such as BLAST, ClustalW, MUSCLE).
  • An entry in the with field is mandatory.

The ISA code is a sub-category of the ISS code. It should be used whenever a sequence alignment is the basis for making an annotation, but only when a curator has manually reviewed the alignment and choice of GO term or if the information is in a published paper, the authors have manually reviewed the evidence. Such alignments may be pairwise alignments (the alignment of two sequences to one another) or multiple alignments (the alignment of 3 or more sequences to one another). BLAST produces pairwise alignments and any annotations based solely on the evaluation of BLAST results should use this code. GO policy states that in order to assert that a query protein has the same function as a match protein, the match protein MUST be experimentally characterized. This prevents transitive annotation errors. A transitive annotation error occurs when a protein gets its annotation by virtue of a match to an uncharacterized protein that may itself have gotten its annotation from yet another uncharacterized protein, and so on. With the high number of genome sequences currently in the public databases, the risk of transitive annotation errors is high. However, by requiring that every alignment used for a GO annotation contain an experimentally characterized protein, transitive annotation errors can be significantly reduced.

The process of evaluating a sequence alignment involves checking that the length of the matching region and the percent identity with the matching sequence are sufficient to infer shared function. Residues or secondary structures that are important for function should be conserved. The guiding principle in making sequence similarity based annotations should be that there is a good reason to believe that the comparison is relevant. This evaluation may be carried out by the curator, when sequence analysis is performed by the curators, or by authors of a published paper, when the curator is making annotations based on literature. In literature-based annotation it is incumbent upon the curator to identify which of the proteins in the sequence analysis are experimentally characterized so as to populate the with field.

A note about when to use ISO (inferred from sequence orthology) instead of ISA: If it is known that the experimentally characterized match protein in question is the functional ortholog of the query protein, then the code ISO (Inferred from Sequence Orthology) may be used (see the ISO section below). Orthologs are generally determined from phylogenetic analysis using algorithms such as maximum likelihood or nearest neighbor joining. The presumption is that orthologs often have the same/similar biological function and/or engage in the same or similar biological processes. It can sometimes be difficult to determine when proteins are orthologs of each other, but if one is confident of orthology the orthology specific code should be used.

Note that we have not set definitive numerical cutoffs for the extent or percentage identity of sequence similarity comparisons because groups annotating very different organisms from the current MODs / reference genomes may find that a given arbitrarily selected numerical cutoff does not work when applied to a new organism. It is up to each annotating group to use judgment as to what sequence similarity comparisons are relevant for the purpose of making GO annotations.

It is mandatory to make an entry in the with column when using ISA. The entry in with is the accession number of the experimentally characterized sequences(s) that match the query sequence. Multiple entries in the with field should be separated by pipes. Annotations made with ISA without an entry in the with field will be filtered out by the Annotation File Format Quality Control script which is run monthly.

If the generation and evaluation of the alignment was described in a published paper and then curated by a GO annotator, a reference to the paper should be placed in the reference column. However, if the same group that is doing the GO annotation performed the generation and evaluation of the alignment, then a reference should be placed in the reference column that describes the methodology used. If there is no publication for this methodology, a reference can be used from the GO Consortium's collection of GO references; if there is nothing appropriate in this set, the annotating group submit a description of the methods of data collection and evaluation used, and submit it to the GO Consortium. This will be added to the reference collection and will receive a GO_REF accession number for use in annotations.

Examples of when to use ISA:

  • A curator generates a pairwise alignment between a query Haemophilus influenzae protein that he/she is trying to annotate and a Vibrio marinus protein. The curator sees that the Vibrio protein is experimentally characterized. The curator evaluates the alignment and sees that the two proteins match over nearly their entire lengths at 68% identity. Furthermore, after reading information on the characterized Vibrio protein the curator looks for the important residues needed for catalysis and binding in the Vibrio protein and finds that they are conserved in the Haemophilus protein. The curator reads the available literature on the Vibrio protein to determine what is known about that protein. The curator can then assign GO terms to the Haemophilus protein based on what has been experimentally determined in the Vibrio protein. The code for this annotation is ISA, the accession number of the Vibrio protein should be placed in the with field. If the process used by the curator for evaluation of the sequence alignments is not in a published paper they should refer to a GO standard reference, for example GO_REF:0000012.
  • A curator performs sequence similarity analysis on a group of genes, (e.g. sequence similarity alignments of the human NDUFS8 gene (UniProtKB accession: O00217) with several other genes) and identifies several genes with very high sequence identity to the experimentally characterized human HDUFS8 gene: orangutan and chimpanzee (both 100% sequence identity), crab-eating macaque (95% identity), and gorilla (92% identity). The curator judged that these high sequence matches to the human sequence meant that all proteins possessed a similar function, therefore, annotations were made for the related genes in orangutan (UniProt:Q5RC7), macaque (UniProt:Q60HE3), chimpanzee (UniProt:Q0MQI3), and gorilla (UniProt:Q0MQI2) by ISS with the experimentally characterized human NDUFS8 protein, and the accession number of the human NDUFS8 gene was included in the with column for each of these annotations. As there is no published paper describing this sequence analysis, the id of the GO_REF (e.g. GO_REF:0000024) that describes the process the curator carried out to make this judgment is placed in the REF_DB_ID field.
  • PMID:2165073 identifies a new gene, AAC3, that is similar to two known genes of the same species (S. cerevisiae) based on Southern hybridization. Cloning and sequencing of the new AAC3 gene indicates that it is similar to the previously characterized ADP/ATP translocators AAC1 and PET9. For the AAC3 gene, an annotation may be made to the function term ATP:ADP antiporter activity using the evidence code ISA; the reference is the paper which performed the analysis and the accession numbers of the experimentally genes with which AAC3 was aligned (AAC1 and PET9) should be placed in the with field.
  • PMID:12507466 describes a set of proteins containing both experimentally confirmed and predicted N-terminal acetyltransferases (NATs) that were collected and assigned to orthologous groups based on phylogenetic analysis. Three of the groups, Ard1, Mak3, and Nat3, were named based on the well characterized gene by that name from S. cerevisiae that is a member of the group. In addition, a previously unknown group with unknown substrate specificity was identified, called Nat5 based on the name of the S. cerevisiae member of the group. About the Nat5 family, the authors make this statement Nat5p represents a family of the putative NATs with orthlogous proteins identified in yeast, S. pombe, C. elegans, D. melanogaster, A. thaliana and H. sapiens. The finding of this new family is only based on sequence similarity of Nat5p (YOR253Wp) to other NATs. Our attempts to detect any Nat5p substrates in yeast by 2D-gel electrophoresis has been so far unsuccessful, but this may reflect the rarity of the substrates in vivo or that Nat5p is acting on the smaller polypetides with mobility parameters undetectable by our regular 2D-gel procedure. As a protein with sequence similarity to other NATs, the annotation that may be made for NAT5 is to the function term peptide alpha-N-acetyltransferase activity. Although this paper clearly discussed orthology relationships, the evidence code for this annotation for NAT5 is ISA because it is not based on the orthology relationship, but merely on similarity with the other experimentally characterized NATs in yeast, MAK3, ARD1, and NAT3, and the accession numbers of these three genes should be placed in the with field. The reference is the paper which performed the analysis, Note that this paper may also be used for annotations using the ISO code when the annotation is based on the orthology relationships described in the paper.