GO Enrichment Analysis

One of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set.

Enrichment analysis tool

Users can perform enrichment analyses directly from the home page of the GOC website. This service connects to the analysis tool from the PANTHER Classification System, which is maintained up to date with GO annotations. The PANTHER classification system is explained in great detail in Mi H et al, PMID: 23868073. The list of supported gene IDs is available from the PANTHER website.

Using the GO enrichment analysis tools

First, paste or type the names of the genes to be analyzed, one per row or separated by a comma. The tool can handle both MOD specific gene names and UniProt IDs (e.g. Rad54 or P38086). Second, select the name of the species (e.g. S. cerevisiae or H. sapiens) from the Species pull down and the ontology where you want to calculate the enrichment (biological process, molecular function, or cellular component). To minimize search time, the tool searches only one ontology at a time.

Interpreting the Results Table

The results page displays a table that lists significant shared GO terms (or parents of GO terms) used to describe the set of genes that users entered on the previous page, the background frequency, the sample frequency, Expected p-value, an indication of over/underrepresentation for each term, and p-value. In addition, the results page displays all the criteria used in the analysis. Any unresolved gene names will be listed on top of the table.

  • Background Frequency and Sample Frequency

Background frequency is the number of genes annotated to a GO term in the entire background set, while sample frequency is the number of genes annotated to that GO term in the input list. For example, if the input list contains 10 genes and the enrichment is done for biological process in S. cerevisiae whose background set contains 6442 genes, then if 5 out of the 10 input genes are annotated to the GO term: DNA repair, then the sample frequency for DNA repair will be 5/10. Whereas if there are 100 genes annotated to DNA repair in all of the S. cerevisiae genome, then the background frequency will be 100/6442.

  • Overrepresented or Underrepresented

The symbols + and - indicate over or underrepresentation of a term.

  • P-value

P-value is the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term. That is, the GO terms shared by the genes in the user's list are compared to the background distribution of annotation. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance).

In other words, when searching the process ontology, if all of the genes in a group were associated with "DNA repair", this term would be significant. However, since all genes in the genome (with GO annotations) are indirectly associated with the top level term "biological_process", this would not be significant if all the genes in a group were associated with this very high level term.

External tools

There are a number of different tools that provide enrichment capabilities. Some of these are web-based, others may require the user download an application or install a local environment. Tools differ in the algorithms they use, and the statistical tests they perform.

Some other examples of enrichment tools are: