[18] Interpreting Experimental Results Using Gene Ontologies
Introduction
DNA microarrays are a high‐throughput experimental technique used for gene expression analysis (Lockhart 1996, Schena 1995). Such high‐throughput techniques have greatly facilitated the discovery of new biological knowledge. However, this kind of knowledge is often difficult to grasp, and turning raw microarray data into biological understanding is by no means a simple task. Classical analysis of microarray data is based critically on comparing gene expression at the level of single genes and determining the significantly differentially expressed genes (Ayroles 2006, Beissbarth 2000, Downey 2006, Saeed 2006, Smyth 2004) or on clustering techniques used to determine genes with a similar expression pattern in a series of experiments (Eisen et al., 2000; Gollub 2006, Saeed 2006). In other popular techniques, such as serial analysis of gene expression (SAGE), the focus is often just to find the transcripts expressed in a given tissue (Beissbarth et al., 2004). In each case, results can be expressed as a list of genes or as a sorted list of all genes on the array ranked by a score (see Fig. 1).
In the more recent literature, many groups came up with methods that use prior biological knowledge in order to help interpret these lists of genes (e.g., Al‐Shahrour 2004, Beissbarth 2004, Dennis 2003, Doniger 2003, Yue 2005). Together with the application of clustering techniques to microarray data the term “guilt by association” has been used, indicating that by using prior biological knowledge it may be possible to infer the function of coexpressed genes (Clare 2002, Quackenbush 2003). Furthermore, with a single experiment producing a list of hundreds of differentially expressed genes, automatic annotation and grouping of genes are necessary merely to make interpretation of the results possible (Beissbarth 2003, Boon 2004). In addition, the grouping of genes and subsequent testing for the differential expression of these groups of genes may help find significant biological processes even in cases where single gene analysis fails to produce significant results.
The main problem is how to incorporate prior biological knowledge and where to get it from. The type of knowledge that comes to mind first that would be useful to incorporate is that of biological pathways. However, the information on this as stored in pathway databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa 2000, Mao 2005) is quite sparse. Other sources of information could be SWISS‐Prot key words, presence of transcription factor‐binding sites, protein domains, etc. Further, it is important to organize this information in a structured way. Ontologies are a commonly used concept in computer science (Draghici et al., 2003). They are used to define terms in a controlled vocabulary as well as relations between those terms. They can also be viewed as a directed acyclic graph, which defines a hierarchy of terms. The Gene Ontology Consortium (http://www.geneontology.org) is a worldwide consortium that defines an ontology that can be used to annotate genes (Ashburner et al., 2000). Furthermore, there are a number of Gene Ontology (GO) databases, where annotation for genes of a certain organism or several organisms are stored (e.g., Blake 2003, Camon 2004). The hierarchical structure of GO annotations is illustrated in Fig. 2.
A biological annotation can be expressed as a gene set. Gene ontology annotations can be cut at different levels, allowing the definition of gene sets for each GO term, leading to a hierarchy of gene sets. The subsequent statistical test for whether a certain biological function (or gene set) is associated with the experimental outcome of a microarray is often referred to as gene set enrichment analysis (Lamb 2003, Mootha 2003), and analysis with GO groups has become widely used for the analysis of microarray data (also see Gollub 2006, Hennetin 2006, Whetzel 2006). This analysis allows testing for gene sets (or functional groups) that are represented significantly more frequently in a list of genes that appears interesting in a microarray experiment than in a comparable list of genes that would be selected randomly from all genes on the microarray. This then allows the association of functions to a list of genes.
Section snippets
Find Statistically Overrepresented GO Terms within a Group of Genes
The most commonly used methods to test for enrichment of a gene set in a list of genes selected from an experiment are based on hypergeometric distribution and use either Fisher's exact test or the χ2 test. These methods work in a similar way: a list of genes is selected from a microarray first, for example, by choosing all significantly differentially expressed genes using a cutoff value. The test for enrichment involves counting how many genes in the gene set occur in the list of selected
Description
The program GOstat helps in the analysis of lists of genes and will compute statistics about the GO terms contained in data and sort the GO annotations giving the most representative GO terms first. The program also allows the computation of statistics for GO terms based on ranked gene lists from microarrays. It is available via the web site http://gostat.wehi.edu.au. Figure 4 shows a view of the GOstat input form and an example of a GOstat output.
Using GOstat
The program requires as input a list of gene
Visualization and Further Analysis
GOstat does not as yet provide any tools for visualization of the results. This is due to the fact that creating a good visualization requires interpretation of the results. Determining the significant GO terms in a list of genes is only a first step. The p values provide an indication of what processes or functions might be relevant. However, these results must be scrutinized further to determine whether they are reasonable or not. Careful evaluation of the results is necessary in order to
Discussion
There are many other tools besides GOstat that can help scrutinize the gene ontology structure and annotation to get more out of your microarray data. A more or less complete list of these is provided at http://www.geneontology.org/GO.tools.shtml. Tools that allow more options to display and analyze the graph structure of the annotations, as well as a more automated usage of such analyses, are the Bioconductor (Reimers and Carey, 2006) package gostats (http://www.bioconductor.org) and the
Acknowledgments
Thanks to Terry Speed, Hamish Scott, and Annemarie Poustka for giving the author the opportunity to develop and maintain GOstat. Thanks to Nick Tan and Dirk Ledwinka for IT support. Thanks to James Wettenhall, Lavinia Hyde, and Matthew Ritchie for proofreading and help with the manuscript. Tim Beissbarth and GOstat were supported by the Deutsche Forschungsgemeinschaft, WEHI, NHMRC, and NGFN.
References (40)
- et al.
Analysis of variance of microarray data
Methods Enzymol.
(2006) - et al.
The Affymetrix GeneChip Platform: An overview
Methods Enzymol.
(2006) Analysis of a multifactor microarray study using Partek genomics solution
Methods Enzymol.
(2006)- et al.
Clustering microarray data
Methods Enzymol.
(2006) - et al.
GeneNest: Automated generation and visualization of gene indices
Trends Genet.
(2000) - et al.
Clustering methods for analyzing large data sets: Gonad development, a study case
Methods Enzymol.
(2006) - et al.
A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer
Cell
(2003) - et al.
Bioconductor: An open source framework for bioinformatics and computational biology
Methods Enzymol.
(2006) - et al.
TM4 microarray software suite
Methods Enzymol.
(2006) - et al.
Using ontologies to annotate microarray experiments
Methods Enzymol.
(2006)
FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes
Bioinformatics
Gene ontology: Tool for the unification of biology
Nature Genet.
Analysis of CREM‐dependent gene expression during mouse spermatogenesis
Mol. Cell. Endocrinol.
Processing and quality control of DNA array hybridization data
Bioinformatics
Statistical modeling of sequencing errors in SAGE libraries
Bioinformatics.
GOstat: Find statistically overrepresented Gene Ontologies within a group of genes
Bioinformatics
Controlling the false discovery rate: A practical and powerful approach to multiple testing
JRSS‐B
The control of the false discovery rate in multiple testing under dependencies
Anal. Stat.
MGD: The mouse genome database
Nucleic Acids Res.
Establishing a human transcript map
Nature Genet.
Cited by (48)
An adverse outcome pathway framework for neural tube and axial defects mediated by modulation of retinoic acid homeostasis
2015, Reproductive ToxicologyCitation Excerpt :The framework was used as a starting point to look for RA-mediated gene expression changes in flusilazole-challenged developmental model systems, rat whole embryo culture (WEC), the embryonic stem cell test (EST), and the zebrafish embryotoxicity test (ZET). Potential RA–NTA biomarker genes were identified from the Gene Ontology Biological Process Anterior–posterior pattern formation [29], supplemented with genes involved in RA biosynthesis and metabolism, genes with retinoic acid responsive elements (RAREs) [30] and genes selected based on survey of relevant scientific literature. ( The quoted GO term name is somewhat of a misnomer as it covers neural tube and axial developmental genes as well and contains the majority of NTA genes mentioned in this manuscript.)
Towards precision medicine: Advances in computational approaches for the analysis of human variants
2013, Journal of Molecular BiologyCitation Excerpt :Moreover, methods for prioritizing disease genes, reviewed elsewhere, could also be incorporated into the analysis of human variants. For instance, using network analyses [178–180], ontological enrichment [181–183], affected cellular compartment [184,185], gene function identification using text mining [186,187], or inference to the function of orthologous sequences [47,188,189] have been shown to provide insight into the types of phenotypes for which the deleterious variant is important. The progression of disease variant analysis techniques is essential for utilizing genomic data for precision medicine.
Exploring biological processes involved in embryonic stem cell differentiation by analyzing proteomic data
2013, Biochimica et Biophysica Acta - Proteins and ProteomicsCitation Excerpt :The effective processes in transition from one stage of differentiation to another are largely to be discovered. Studying protein concentration profiles is a possible strategy to characterize proteins which are active in a specific process of a certain differentiation stage [2,3]. In omics studies, like proteomics and transcriptomics, long lists of important genes and proteins are initially obtained.
Identification of the Potential Key Long Non-coding RNAs in Aged Mice with Postoperative Cognitive Dysfunction
2019, Frontiers in Aging Neuroscience