Elsevier

Methods in Enzymology

Volume 411, 2006, Pages 340-352
Methods in Enzymology

[18] Interpreting Experimental Results Using Gene Ontologies

https://doi.org/10.1016/S0076-6879(06)11018-6Get rights and content

Abstract

High‐throughput experimental techniques, such as microarrays, produce large amounts of data and knowledge about gene expression levels. However, interpretation of these data and turning it into biologically meaningful knowledge can be challenging. Frequently the output of such an analysis is a list of significant genes or a ranked list of genes. In the case of DNA microarray studies, data analysis often leads to lists of hundreds of differentially expressed genes. Also, clustering of gene expression data may lead to clusters of tens to hundreds of genes. These data are of little use if one is not able to interpret the results in a biological context. The Gene Ontology Consortium provides a controlled vocabulary to annotate the biological knowledge we have or that is predicted for a given gene. The Gene Ontologies (GOs) are organized as a hierarchy of annotation terms that facilitate an analysis and interpretation at different levels. The top‐level ontologies are molecular function, biological process, and cellular component. Several annotation databases for genes of different organisms exist. This chapter describes how to use GO in order to help biologically interpret the lists of genes resulting from high‐throughput experiments. It describes some statistical methods to find significantly over‐ or underrepresented GO terms within a list of genes and describes some tools and how to use them in order to do such an analysis. This chapter focuses primarily on the tool GOstat (http://gostat.wehi.edu.au). Other tools exist that enable similar analyses, but are not described in detail here.

Introduction

DNA microarrays are a high‐throughput experimental technique used for gene expression analysis (Lockhart 1996, Schena 1995). Such high‐throughput techniques have greatly facilitated the discovery of new biological knowledge. However, this kind of knowledge is often difficult to grasp, and turning raw microarray data into biological understanding is by no means a simple task. Classical analysis of microarray data is based critically on comparing gene expression at the level of single genes and determining the significantly differentially expressed genes (Ayroles 2006, Beissbarth 2000, Downey 2006, Saeed 2006, Smyth 2004) or on clustering techniques used to determine genes with a similar expression pattern in a series of experiments (Eisen et al., 2000; Gollub 2006, Saeed 2006). In other popular techniques, such as serial analysis of gene expression (SAGE), the focus is often just to find the transcripts expressed in a given tissue (Beissbarth et al., 2004). In each case, results can be expressed as a list of genes or as a sorted list of all genes on the array ranked by a score (see Fig. 1).

In the more recent literature, many groups came up with methods that use prior biological knowledge in order to help interpret these lists of genes (e.g., Al‐Shahrour 2004, Beissbarth 2004, Dennis 2003, Doniger 2003, Yue 2005). Together with the application of clustering techniques to microarray data the term “guilt by association” has been used, indicating that by using prior biological knowledge it may be possible to infer the function of coexpressed genes (Clare 2002, Quackenbush 2003). Furthermore, with a single experiment producing a list of hundreds of differentially expressed genes, automatic annotation and grouping of genes are necessary merely to make interpretation of the results possible (Beissbarth 2003, Boon 2004). In addition, the grouping of genes and subsequent testing for the differential expression of these groups of genes may help find significant biological processes even in cases where single gene analysis fails to produce significant results.

The main problem is how to incorporate prior biological knowledge and where to get it from. The type of knowledge that comes to mind first that would be useful to incorporate is that of biological pathways. However, the information on this as stored in pathway databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa 2000, Mao 2005) is quite sparse. Other sources of information could be SWISS‐Prot key words, presence of transcription factor‐binding sites, protein domains, etc. Further, it is important to organize this information in a structured way. Ontologies are a commonly used concept in computer science (Draghici et al., 2003). They are used to define terms in a controlled vocabulary as well as relations between those terms. They can also be viewed as a directed acyclic graph, which defines a hierarchy of terms. The Gene Ontology Consortium (http://www.geneontology.org) is a worldwide consortium that defines an ontology that can be used to annotate genes (Ashburner et al., 2000). Furthermore, there are a number of Gene Ontology (GO) databases, where annotation for genes of a certain organism or several organisms are stored (e.g., Blake 2003, Camon 2004). The hierarchical structure of GO annotations is illustrated in Fig. 2.

A biological annotation can be expressed as a gene set. Gene ontology annotations can be cut at different levels, allowing the definition of gene sets for each GO term, leading to a hierarchy of gene sets. The subsequent statistical test for whether a certain biological function (or gene set) is associated with the experimental outcome of a microarray is often referred to as gene set enrichment analysis (Lamb 2003, Mootha 2003), and analysis with GO groups has become widely used for the analysis of microarray data (also see Gollub 2006, Hennetin 2006, Whetzel 2006). This analysis allows testing for gene sets (or functional groups) that are represented significantly more frequently in a list of genes that appears interesting in a microarray experiment than in a comparable list of genes that would be selected randomly from all genes on the microarray. This then allows the association of functions to a list of genes.

Section snippets

Find Statistically Overrepresented GO Terms within a Group of Genes

The most commonly used methods to test for enrichment of a gene set in a list of genes selected from an experiment are based on hypergeometric distribution and use either Fisher's exact test or the χ2 test. These methods work in a similar way: a list of genes is selected from a microarray first, for example, by choosing all significantly differentially expressed genes using a cutoff value. The test for enrichment involves counting how many genes in the gene set occur in the list of selected

Description

The program GOstat helps in the analysis of lists of genes and will compute statistics about the GO terms contained in data and sort the GO annotations giving the most representative GO terms first. The program also allows the computation of statistics for GO terms based on ranked gene lists from microarrays. It is available via the web site http://gostat.wehi.edu.au. Figure 4 shows a view of the GOstat input form and an example of a GOstat output.

Using GOstat

The program requires as input a list of gene

Visualization and Further Analysis

GOstat does not as yet provide any tools for visualization of the results. This is due to the fact that creating a good visualization requires interpretation of the results. Determining the significant GO terms in a list of genes is only a first step. The p values provide an indication of what processes or functions might be relevant. However, these results must be scrutinized further to determine whether they are reasonable or not. Careful evaluation of the results is necessary in order to

Discussion

There are many other tools besides GOstat that can help scrutinize the gene ontology structure and annotation to get more out of your microarray data. A more or less complete list of these is provided at http://www.geneontology.org/GO.tools.shtml. Tools that allow more options to display and analyze the graph structure of the annotations, as well as a more automated usage of such analyses, are the Bioconductor (Reimers and Carey, 2006) package gostats (http://www.bioconductor.org) and the

Acknowledgments

Thanks to Terry Speed, Hamish Scott, and Annemarie Poustka for giving the author the opportunity to develop and maintain GOstat. Thanks to Nick Tan and Dirk Ledwinka for IT support. Thanks to James Wettenhall, Lavinia Hyde, and Matthew Ritchie for proofreading and help with the manuscript. Tim Beissbarth and GOstat were supported by the Deutsche Forschungsgemeinschaft, WEHI, NHMRC, and NGFN.

References (40)

  • F. Al‐Shahrour et al.

    FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes

    Bioinformatics

    (2004)
  • M. Ashburner et al.

    Gene ontology: Tool for the unification of biology

    Nature Genet.

    (2000)
  • T. Beissbarth et al.

    Analysis of CREM‐dependent gene expression during mouse spermatogenesis

    Mol. Cell. Endocrinol.

    (2003)
  • T. Beissbarth et al.

    Processing and quality control of DNA array hybridization data

    Bioinformatics

    (2000)
  • T. Beissbarth et al.

    Statistical modeling of sequencing errors in SAGE libraries

    Bioinformatics.

    (2004)
  • T. Beissbarth et al.

    GOstat: Find statistically overrepresented Gene Ontologies within a group of genes

    Bioinformatics

    (2004)
  • Y. Benjamini et al.

    Controlling the false discovery rate: A practical and powerful approach to multiple testing

    JRSS‐B

    (1995)
  • Y. Benjamini et al.

    The control of the false discovery rate in multiple testing under dependencies

    Anal. Stat.

    (2001)
  • J.A. Blake et al.

    MGD: The mouse genome database

    Nucleic Acids Res.

    (2003)
  • M. Boguski et al.

    Establishing a human transcript map

    Nature Genet.

    (1995)
  • Cited by (48)

    • An adverse outcome pathway framework for neural tube and axial defects mediated by modulation of retinoic acid homeostasis

      2015, Reproductive Toxicology
      Citation Excerpt :

      The framework was used as a starting point to look for RA-mediated gene expression changes in flusilazole-challenged developmental model systems, rat whole embryo culture (WEC), the embryonic stem cell test (EST), and the zebrafish embryotoxicity test (ZET). Potential RA–NTA biomarker genes were identified from the Gene Ontology Biological Process Anterior–posterior pattern formation [29], supplemented with genes involved in RA biosynthesis and metabolism, genes with retinoic acid responsive elements (RAREs) [30] and genes selected based on survey of relevant scientific literature. ( The quoted GO term name is somewhat of a misnomer as it covers neural tube and axial developmental genes as well and contains the majority of NTA genes mentioned in this manuscript.)

    • Towards precision medicine: Advances in computational approaches for the analysis of human variants

      2013, Journal of Molecular Biology
      Citation Excerpt :

      Moreover, methods for prioritizing disease genes, reviewed elsewhere, could also be incorporated into the analysis of human variants. For instance, using network analyses [178–180], ontological enrichment [181–183], affected cellular compartment [184,185], gene function identification using text mining [186,187], or inference to the function of orthologous sequences [47,188,189] have been shown to provide insight into the types of phenotypes for which the deleterious variant is important. The progression of disease variant analysis techniques is essential for utilizing genomic data for precision medicine.

    • Exploring biological processes involved in embryonic stem cell differentiation by analyzing proteomic data

      2013, Biochimica et Biophysica Acta - Proteins and Proteomics
      Citation Excerpt :

      The effective processes in transition from one stage of differentiation to another are largely to be discovered. Studying protein concentration profiles is a possible strategy to characterize proteins which are active in a specific process of a certain differentiation stage [2,3]. In omics studies, like proteomics and transcriptomics, long lists of important genes and proteins are initially obtained.

    View all citing articles on Scopus
    View full text