[18] Interpreting Experimental Results Using Gene Ontologies

doi:10.1016/S0076-6879(06)11018-6

Methods in Enzymology

Volume 411, 2006, Pages 340-352

https://doi.org/10.1016/S0076-6879(06)11018-6 Get rights and content

Abstract

High‐throughput experimental techniques, such as microarrays, produce large amounts of data and knowledge about gene expression levels. However, interpretation of these data and turning it into biologically meaningful knowledge can be challenging. Frequently the output of such an analysis is a list of significant genes or a ranked list of genes. In the case of DNA microarray studies, data analysis often leads to lists of hundreds of differentially expressed genes. Also, clustering of gene expression data may lead to clusters of tens to hundreds of genes. These data are of little use if one is not able to interpret the results in a biological context. The Gene Ontology Consortium provides a controlled vocabulary to annotate the biological knowledge we have or that is predicted for a given gene. The Gene Ontologies (GOs) are organized as a hierarchy of annotation terms that facilitate an analysis and interpretation at different levels. The top‐level ontologies are molecular function, biological process, and cellular component. Several annotation databases for genes of different organisms exist. This chapter describes how to use GO in order to help biologically interpret the lists of genes resulting from high‐throughput experiments. It describes some statistical methods to find significantly over‐ or underrepresented GO terms within a list of genes and describes some tools and how to use them in order to do such an analysis. This chapter focuses primarily on the tool GOstat (http://gostat.wehi.edu.au). Other tools exist that enable similar analyses, but are not described in detail here.

Introduction

DNA microarrays are a high‐throughput experimental technique used for gene expression analysis (Lockhart 1996, Schena 1995). Such high‐throughput techniques have greatly facilitated the discovery of new biological knowledge. However, this kind of knowledge is often difficult to grasp, and turning raw microarray data into biological understanding is by no means a simple task. Classical analysis of microarray data is based critically on comparing gene expression at the level of single genes and determining the significantly differentially expressed genes (Ayroles 2006, Beissbarth 2000, Downey 2006, Saeed 2006, Smyth 2004) or on clustering techniques used to determine genes with a similar expression pattern in a series of experiments (Eisen et al., 2000; Gollub 2006, Saeed 2006). In other popular techniques, such as serial analysis of gene expression (SAGE), the focus is often just to find the transcripts expressed in a given tissue (Beissbarth et al., 2004). In each case, results can be expressed as a list of genes or as a sorted list of all genes on the array ranked by a score (see Fig. 1).

In the more recent literature, many groups came up with methods that use prior biological knowledge in order to help interpret these lists of genes (e.g., Al‐Shahrour 2004, Beissbarth 2004, Dennis 2003, Doniger 2003, Yue 2005). Together with the application of clustering techniques to microarray data the term “guilt by association” has been used, indicating that by using prior biological knowledge it may be possible to infer the function of coexpressed genes (Clare 2002, Quackenbush 2003). Furthermore, with a single experiment producing a list of hundreds of differentially expressed genes, automatic annotation and grouping of genes are necessary merely to make interpretation of the results possible (Beissbarth 2003, Boon 2004). In addition, the grouping of genes and subsequent testing for the differential expression of these groups of genes may help find significant biological processes even in cases where single gene analysis fails to produce significant results.

The main problem is how to incorporate prior biological knowledge and where to get it from. The type of knowledge that comes to mind first that would be useful to incorporate is that of biological pathways. However, the information on this as stored in pathway databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa 2000, Mao 2005) is quite sparse. Other sources of information could be SWISS‐Prot key words, presence of transcription factor‐binding sites, protein domains, etc. Further, it is important to organize this information in a structured way. Ontologies are a commonly used concept in computer science (Draghici et al., 2003). They are used to define terms in a controlled vocabulary as well as relations between those terms. They can also be viewed as a directed acyclic graph, which defines a hierarchy of terms. The Gene Ontology Consortium (http://www.geneontology.org) is a worldwide consortium that defines an ontology that can be used to annotate genes (Ashburner et al., 2000). Furthermore, there are a number of Gene Ontology (GO) databases, where annotation for genes of a certain organism or several organisms are stored (e.g., Blake 2003, Camon 2004). The hierarchical structure of GO annotations is illustrated in Fig. 2.

A biological annotation can be expressed as a gene set. Gene ontology annotations can be cut at different levels, allowing the definition of gene sets for each GO term, leading to a hierarchy of gene sets. The subsequent statistical test for whether a certain biological function (or gene set) is associated with the experimental outcome of a microarray is often referred to as gene set enrichment analysis (Lamb 2003, Mootha 2003), and analysis with GO groups has become widely used for the analysis of microarray data (also see Gollub 2006, Hennetin 2006, Whetzel 2006). This analysis allows testing for gene sets (or functional groups) that are represented significantly more frequently in a list of genes that appears interesting in a microarray experiment than in a comparable list of genes that would be selected randomly from all genes on the microarray. This then allows the association of functions to a list of genes.

Section snippets

Find Statistically Overrepresented GO Terms within a Group of Genes

The most commonly used methods to test for enrichment of a gene set in a list of genes selected from an experiment are based on hypergeometric distribution and use either Fisher's exact test or the χ² test. These methods work in a similar way: a list of genes is selected from a microarray first, for example, by choosing all significantly differentially expressed genes using a cutoff value. The test for enrichment involves counting how many genes in the gene set occur in the list of selected

Description

The program GOstat helps in the analysis of lists of genes and will compute statistics about the GO terms contained in data and sort the GO annotations giving the most representative GO terms first. The program also allows the computation of statistics for GO terms based on ranked gene lists from microarrays. It is available via the web site http://gostat.wehi.edu.au. Figure 4 shows a view of the GOstat input form and an example of a GOstat output.

Using GOstat

The program requires as input a list of gene

Visualization and Further Analysis

GOstat does not as yet provide any tools for visualization of the results. This is due to the fact that creating a good visualization requires interpretation of the results. Determining the significant GO terms in a list of genes is only a first step. The p values provide an indication of what processes or functions might be relevant. However, these results must be scrutinized further to determine whether they are reasonable or not. Careful evaluation of the results is necessary in order to

Discussion

There are many other tools besides GOstat that can help scrutinize the gene ontology structure and annotation to get more out of your microarray data. A more or less complete list of these is provided at http://www.geneontology.org/GO.tools.shtml. Tools that allow more options to display and analyze the graph structure of the annotations, as well as a more automated usage of such analyses, are the Bioconductor (Reimers and Carey, 2006) package gostats (http://www.bioconductor.org) and the

Acknowledgments

Thanks to Terry Speed, Hamish Scott, and Annemarie Poustka for giving the author the opportunity to develop and maintain GOstat. Thanks to Nick Tan and Dirk Ledwinka for IT support. Thanks to James Wettenhall, Lavinia Hyde, and Matthew Ritchie for proofreading and help with the manuscript. Tim Beissbarth and GOstat were supported by the Deutsche Forschungsgemeinschaft, WEHI, NHMRC, and NGFN.

References (40)

J.F. Ayroles et al.
Analysis of variance of microarray data
Methods Enzymol.
(2006)
D.D. Dalma‐Weiszhausz et al.
The Affymetrix GeneChip Platform: An overview
Methods Enzymol.
(2006)
T. Downey
Analysis of a multifactor microarray study using Partek genomics solution
Methods Enzymol.
(2006)
J. Gollub et al.
Clustering microarray data
Methods Enzymol.
(2006)
S. Haas et al.
GeneNest: Automated generation and visualization of gene indices
Trends Genet.
(2000)
J. Hennetin et al.
Clustering methods for analyzing large data sets: Gonad development, a study case
Methods Enzymol.
(2006)
J. Lamb et al.
A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer
Cell
(2003)
M. Reimers et al.
Bioconductor: An open source framework for bioinformatics and computational biology
Methods Enzymol.
(2006)
A.I. Saeed et al.
TM4 microarray software suite
Methods Enzymol.
(2006)
P.L. Whetzel et al.
Using ontologies to annotate microarray experiments
Methods Enzymol.
(2006)

F. Al‐Shahrour et al.

FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes

Bioinformatics

(2004)

M. Ashburner et al.

Gene ontology: Tool for the unification of biology

Nature Genet.

(2000)

T. Beissbarth et al.

Analysis of CREM‐dependent gene expression during mouse spermatogenesis

Mol. Cell. Endocrinol.

(2003)

T. Beissbarth et al.

Processing and quality control of DNA array hybridization data

Bioinformatics

(2000)

T. Beissbarth et al.

Statistical modeling of sequencing errors in SAGE libraries

Bioinformatics.

(2004)

T. Beissbarth et al.

GOstat: Find statistically overrepresented Gene Ontologies within a group of genes

Bioinformatics

(2004)

Y. Benjamini et al.

Controlling the false discovery rate: A practical and powerful approach to multiple testing

JRSS‐B

(1995)

Y. Benjamini et al.

The control of the false discovery rate in multiple testing under dependencies

Anal. Stat.

(2001)

J.A. Blake et al.

MGD: The mouse genome database

Nucleic Acids Res.

(2003)

M. Boguski et al.

Establishing a human transcript map

Nature Genet.

(1995)

Cited by (48)

An adverse outcome pathway framework for neural tube and axial defects mediated by modulation of retinoic acid homeostasis
2015, Reproductive Toxicology
Citation Excerpt :
The framework was used as a starting point to look for RA-mediated gene expression changes in flusilazole-challenged developmental model systems, rat whole embryo culture (WEC), the embryonic stem cell test (EST), and the zebrafish embryotoxicity test (ZET). Potential RA–NTA biomarker genes were identified from the Gene Ontology Biological Process Anterior–posterior pattern formation [29], supplemented with genes involved in RA biosynthesis and metabolism, genes with retinoic acid responsive elements (RAREs) [30] and genes selected based on survey of relevant scientific literature. ( The quoted GO term name is somewhat of a misnomer as it covers neural tube and axial developmental genes as well and contains the majority of NTA genes mentioned in this manuscript.)
Developmental toxicity can be caused through a multitude of mechanisms and can therefore not be captured through a single simple mechanistic paradigm. However, it may be possible to define a selected group of overarching mechanisms that might allow detection of the vast majority of developmental toxicants. Against this background, we have explored the usefulness of retinoic acid mediated regulation of neural tube and axial patterning as a general mechanism that, when perturbed, may result in manifestations of developmental toxicity that may cover a large part of malformations known to occur in experimental animals and in man. Through a literature survey, we have identified key genes in the regulation of retinoic acid homeostasis, as well as marker genes of neural tube and axial patterning, that may be used to detect developmental toxicants in in vitro systems. A retinoic acid–neural tube/axial patterning adverse outcome pathway (RA–NTA AOP) framework was designed. The framework was tested against existing data of flusilazole exposure in the rat whole embryo culture, the zebrafish embryotoxicity test, and the embryonic stem cell test. Flusilazole is known to interact with retinoic acid homeostasis, and induced common and unique NTA marker gene changes in the three test systems. Flusilazole-induced changes were similar in directionality to gene expression responses after retinoic acid exposure. It is suggested that the RA–NTA framework may provide a general tool to define mechanistic pathways and biomarkers of developmental toxicity that may be used in alternative in vitro assays for the detection of embryotoxic compounds.
Towards precision medicine: Advances in computational approaches for the analysis of human variants
2013, Journal of Molecular Biology
Citation Excerpt :
Moreover, methods for prioritizing disease genes, reviewed elsewhere, could also be incorporated into the analysis of human variants. For instance, using network analyses [178–180], ontological enrichment [181–183], affected cellular compartment [184,185], gene function identification using text mining [186,187], or inference to the function of orthologous sequences [47,188,189] have been shown to provide insight into the types of phenotypes for which the deleterious variant is important. The progression of disease variant analysis techniques is essential for utilizing genomic data for precision medicine.
Variations and similarities in our individual genomes are part of our history, our heritage, and our identity. Some human genomic variants are associated with common traits such as hair and eye color, while others are associated with susceptibility to disease or response to drug treatment. Identifying the human variations producing clinically relevant phenotypic changes is critical for providing accurate and personalized diagnosis, prognosis, and treatment for diseases. Furthermore, a better understanding of the molecular underpinning of disease can lead to development of new drug targets for precision medicine. Several resources have been designed for collecting and storing human genomic variations in highly structured, easily accessible databases. Unfortunately, a vast amount of information about these genetic variants and their functional and phenotypic associations is currently buried in the literature, only accessible by manual curation or sophisticated text text-mining technology to extract the relevant information. In addition, the low cost of sequencing technologies coupled with increasing computational power has enabled the development of numerous computational methodologies to predict the pathogenicity of human variants. This review provides a detailed comparison of current human variant resources, including HGMD, OMIM, ClinVar, and UniProt/Swiss-Prot, followed by an overview of the computational methods and techniques used to leverage the available data to predict novel deleterious variants. We expect these resources and tools to become the foundation for understanding the molecular details of genomic variants leading to disease, which in turn will enable the promise of precision medicine.
Exploring biological processes involved in embryonic stem cell differentiation by analyzing proteomic data
2013, Biochimica et Biophysica Acta - Proteins and Proteomics
Citation Excerpt :
The effective processes in transition from one stage of differentiation to another are largely to be discovered. Studying protein concentration profiles is a possible strategy to characterize proteins which are active in a specific process of a certain differentiation stage [2,3]. In omics studies, like proteomics and transcriptomics, long lists of important genes and proteins are initially obtained.
Since, proteins carry out many functional roles in a cell with different concentrations, proteomics is likely a more appropriate approach to explain biological processes and cellular events than mRNA studies. Although, gene ontology provides a systematic and organized vocabulary of biological terms for proteins, we need more details to decide about the correct duty and annotation of proteins in a specific condition. One can assume that a change of protein concentration is related to a biological process of that protein with negligible error. Therefore, we can obtain more information about the function of proteins by studying these profiles. In this study, we used time-course proteomic data of a twenty day differentiation study of embryonic stem cells (ESCs) differentiating to embryoid bodies (EBs). Hierarchical clustering was used to cluster time-series concentration profile of proteins. Our results demonstrate that there are eleven active processes with distinct concentration profiles in this initial differentiation. According to the concentration profiles of proteins, we suggest new change points (or equivalently, new stages) in the course of embryonic differentiation.
S100A9 plays a key role in Clostridium perfringens beta2 toxin-induced inflammatory damage in porcine IPEC-J2 intestinal epithelial cells
2023, BMC Genomics
Therapeutic Potential of Human Nasal Inferior Turbinate-Derived Stem Cells: Microarray Analysis of Multilineage Differentiation
2022, ORL
Identification of the Potential Key Long Non-coding RNAs in Aged Mice with Postoperative Cognitive Dysfunction
2019, Frontiers in Aging Neuroscience

View all citing articles on Scopus

View full text

[18] Interpreting Experimental Results Using Gene Ontologies

Abstract

Introduction

Section snippets

Find Statistically Overrepresented GO Terms within a Group of Genes

Description

Using GOstat

Visualization and Further Analysis

Discussion

Acknowledgments

Methods Enzymol.

Methods Enzymol.

Methods Enzymol.

Methods Enzymol.

Trends Genet.

Methods Enzymol.

Cell

Methods Enzymol.

Methods Enzymol.

Methods Enzymol.

FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes

Bioinformatics

Gene ontology: Tool for the unification of biology

Nature Genet.

Analysis of CREM‐dependent gene expression during mouse spermatogenesis

Mol. Cell. Endocrinol.

Processing and quality control of DNA array hybridization data

Bioinformatics

Statistical modeling of sequencing errors in SAGE libraries

Bioinformatics.

GOstat: Find statistically overrepresented Gene Ontologies within a group of genes

Bioinformatics

Controlling the false discovery rate: A practical and powerful approach to multiple testing

JRSS‐B

The control of the false discovery rate in multiple testing under dependencies

Anal. Stat.

MGD: The mouse genome database

Nucleic Acids Res.

Establishing a human transcript map

Nature Genet.