ISM: Inferred from Sequence Model

  • Prediction methods for non-coding RNA genes such as tRNASCAN-SE, Snoscan, and Rfam
  • Predicted presence of recognized functional domains or membership in protein families, as determined by tools such as profile Hidden Markov Models (HMMs), including Pfam and TIGRFAM
  • Predicted protein features using tools such as TMHMM (transmembrane regions), SignalP (signal peptides on secreted proteins), and TargetP (subcellular localization)
  • Any other kind of domain modeling tool or collections of them such as SMART, PROSITE, PANTHER, InterPro, etc.
  • An entry in the with field is required when the model used is an object with an accession number (as found with Pfam, TIGRFAM, InterPro, PROSITE, Rfam, etc.) The with field may be left blank for tools such as tRNAscan and Snoscan where there is not an object with an accession to point to.

The ISM code is a sub-category of the ISS code. The ISM code should be used any time that evidence from some kind of statistical model of a sequence or group of sequences is used to make a prediction about the function of a protein or RNA. Generally, when searching sequences with these modeling tools, the results include statistical scores (such as e values and cutoff scores) that help curators decide when a result is significant enough to warrant making an annotation. If an annotator manually checks these scores and determines if the result makes sense in the context of other information known about the sequence and decides that the evidence warrants a particular annotation, then the evidence code is ISM. However, if a tool that looks only at the scores makes annotations automatically and there is no manual review, the evidence code should be IEA.

It is important to note that some models are more functionally specific than others. In particular this is seen in the profile HMMs and somewhat in PROSITE motifs. Some HMMs are built so that all of the proteins used in building the model and all of the proteins that score well to the model have the exact same function. These models can therefore be used to predict precise functions in match proteins. Other models are built to reflect the shared sequence found among members of superfamiles or subfamilies. These can be used to predict varying levels of functional specificity and may often only provide very general annotations such as identification of a protein as an oxidoreductase. Finally, many models predict the presence of particular domains in a protein which may or may not provide information on the function of a protein, for example the CUB domain is found in a functionally diverse set of proteins and does not allow annotations to function to be made based on its presence alone. Therefore it is very important during the manual annotation process to assess what information it is safe to conclude from a match to any given model.

Some of the sequence-based modeling techniques result in models specific to individual sequence families. The profile HMMs, PROSITE motifs, and InterPro are in this group. In such cases, the with field should be populated with the accession number of the model specific for the functional domain or protein in question. Other sequence-based modeling techniques such as tRNASCAN and Snoscan are methods that result in the prediction of a set of sequences within a particular class (e.g. tRNAs, snoRNAs) and there are not specific models that one can link to each ncRNA. In these cases the with field may be left blank.

If the search for, and evaluation of, the sequence-based model data was described in a published paper, a reference to the paper should be placed in the reference column. However, if the search for and evaluation of the data was performed by the same group that is doing the GO annotation, then a reference should be placed in the reference column that describes the methodology used. If there is no publication for this methodology, a reference can be used from the GO Consortium's collection of GO references; if there is nothing appropriate in this set, the annotating group submit a description of the methods of data collection and evaluation used, and submit it to the GO Consortium. This will be added to the reference collection and will receive a GO_REF accession number for use in annotations.

Examples of when to use ISM

  • A curator performs an HMM search for a query protein. The result is that the query protein scores above the trusted cutoff to the HMM PF05426 alginate lyase. This HMM describes a family of alginate lyases. After review of all documentation associated with the HMM to determine functional specificity, or lack thereof, of the HMM and review of the scores that the query protein received, if the curator is confident that the query protein is indeed an alginate lyase, the appropriate annotations should be made using ISM as the evidence code, and putting Pfam:PF05426 in the with column. Since this search and evaluation was performed by the curator, a GO standard reference should be used to describe the search and evaluation methods (e.g. GO_REF:0000011).
  • A paper describes using PROSITE searches with the protein of interest and concludes the protein has a particular binding activity based on a match to a particular PROSITE motif. The curator would make the appropriate GO annotations, using ISM as the evidence code, putting the accession number of the PROSTIE motif that provided the evidence in the with column, and the PMID number of the paper that described the work in the reference column.
  • A curator runs the program tRNAscan (Lowe, T.M. and Eddy, S.R. NAR, 1997) on a newly sequenced bacterial genome to find the tRNAs. tRNAscan produces a list of the tRNA genes contained within that genome. A curator checks the results of the analysis to make sure that the predictions make sense and are consistent with what is known about the organism. Each of theses genes is given appropriate annotations for a tRNA. The evidence code is ISM, and a reference describing the process the curator used (either a published paper or a GO standard reference) should be placed in the reference column. The with column may be left blank.
  • PMID:10024243 describes the use of a probabilistic model to predict snoRNA genes in yeast. Each of theses genes may be given appropriate annotations for a snoRNA. The evidence code is ISM, and the reference is the paper describing the work. The with column may be left blank.