Methods and approaches in the analysis of gene expression data

https://doi.org/10.1016/S0022-1759(01)00307-6Get rights and content

Abstract

The application of high-density DNA array technology to monitor gene transcription has been responsible for a real paradigm shift in biology. The majority of research groups now have the ability to measure the expression of a significant proportion of the human genome in a single experiment, resulting in an unprecedented volume of data being made available to the scientific community. As a consequence of this, the storage, analysis and interpretation of this information present a major challenge. In the field of immunology the analysis of gene expression profiles has opened new areas of investigation. The study of cellular responses has revealed that cells respond to an activation signal with waves of co-ordinated gene expression profiles and that the components of these responses are the key to understanding the specific mechanisms which lead to phenotypic differentiation. The discovery of ‘cell type specific’ gene expression signatures have also helped the interpretation of the mechanisms leading to disease progression. Here we review the principles behind the most commonly used data analysis methods and discuss the approaches that have been employed in immunological research.

Introduction

The rapid technological development in the field of genomics has created an unprecedented situation in biology. Large volumes of information, from the genome sequence to high throughput functional data, have shifted the attention of biologists towards an understanding of the global mechanisms behind biological phenomena. For example the human genome project has revealed a considerable number of genes with no clear sequence homology with previously characterised genes; understanding how these new genes act in driving the physiology of a human cell will be a major challenge for many years to come. Genetics and functional genomics platforms such as gene expression profiling, proteomics, yeast two-hybrid, transgenic technology and functional screening clearly all have roles to play in deciphering this puzzle. High-density macro and micro arrays have acquired a special role in this challenging field; these consist of ordered collections of thousands of different DNA sequences that can be used to measure DNA and RNA variation in biological samples. They have many applications (Lipschultz et al., 1999) but are most commonly used in expression profiling (Bowtell, 1999). Because of its relative low cost and high gene coverage (they can be used to measure the expression of thousands of genes in a single experiment) the use of this technology is rapidly spreading in both academic and industrial institutions, contributing to the exponential increase in publicly available genome wide transcription data.

The availability of so much data is a big challenge both in terms of the infrastructure required for its storage and manipulation, and for the analytical tools required to extract meaningful information. These problems, originally limited to local institutions, need to be approached at a more global level. Large publicly accessible gene expression databases will be required to share and cross-mine different experimental data. Common standards in data quality and experimental procedures however may need to be established in order to make this a worthwhile effort. Recently, the European Bioinformatics institute (EBI) has announced plans for a microarray database that could satisfy these criteria (http://www.ebi.ac.uk/arrayexpress/News/news.html) but the current funding situation may hamper rapid expansion.

The development of data analysis strategies and tools to cope with the complexity of the data is a sizeable task. Current methods for analysis are based on comparison roles and pattern recognition algorithms such as cluster analysis. These methods are very effective data exploration tools that have already revealed a great deal of information in many areas of immunological research. The new field of gene expression data analysis is rapidly moving towards more statistically robust methods (for example to predict disease membership or to model biological variables by means of gene expression).

In this review we present an overview of the methodologies currently used in the analysis of gene expression data. In the second section of the manuscript we introduce some principles of data analysis with particular reference to clustering and other classification methods. In the third section we describe some of the most relevant applications in the field of immunology.

Section snippets

Data acquisition and data manipulation

High-density array technology provides an assay for the simultaneous measurement of the expression level of thousands of genes in a single experiment. Each array consists of a solid support (usually nylon or glass) where cDNA or oligonucleotides are arrayed in a fixed pattern. Fluorescent or radioactive probes derived from messenger RNA are hybridised to the complementary DNA on the array. The radioactive or fluorescence emissions of specifically bound probe are detected using an appropriate

Comparison of independent and paired samples

The comparison of two independent samples (e.g. diseased versus normal tissue) is the simplest experimental situation. As discussed previously, although a number of statistical tests are available to assess the significance of the observed differences, most of the groups active in this field use filtering rules based on fold difference criteria. Here we will review two examples based on two different experimental designs.

A filtering rule strategy applied to a problem of immunological interest

Conclusion and future perspectives

Virtually every laboratory can now access some form of microarray technology and monitor the expression of a significant percentage of the transcriptional capacity of the human genome. This unprecedented volume of information is changing the way experimental research is done in many areas of biology. The classical hypothesis driven ‘single gene’ approach is now paralleled by a more global approach. Although experiments aiming to characterise the transcriptional response of a biological system

Acknowledgments

We acknowledge that Dragoni I., Zanders E., Gillian A. and Falciani F. generated and analysed the data on the rheumatoid arthritis populations described in Section 3.5 while working at the GlaxoWellcome medicine research centre in Stevenage, UK. We would like to thank Dr. Brian Champion and Dr. Gareth Maslen (Lorantis Ltd.) for critical reading of the manuscript and for useful comments. We are also indebted to Dr. Dov Stekel (Oxford Gene Technology, Oxford, UK) for his encouragement and for

References (58)

  • M. Brown et al.

    Knowledge-based analysis of microarray gene expression data by using support vector machines

    Proc. Natl. Acad. Sci. USA

    (2000)
  • S. Chu et al.

    The transcriptional program sporulation in budding yeast

    Science

    (1998)
  • J.M. Claverie

    Computational methods for the identification of differential and coordinated gene expression

    Hum. Mol. Genet.

    (1999)
  • S.D. Der et al.

    Identification of genes differentially regulated by interferon a, b, or g using oligonucleotide arrays

    Proc. Natl. Acad. Sci. USA

    (1998)
  • J. Dopazo et al.

    Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree

    J. Mol. Evol.

    (1997)
  • I. Dragoni et al.

    Analysis of synovial tissue and blood from patients with rheumatoid arthritis using differential gene expression technology and statistical analysis

    J. Mol. Med.

    (2001)
  • M. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Natl. Acad. Sci. USA

    (1998)
  • B.S. Everitt et al.

    Applied Multivariate Data Analysis

    (1992)
  • B. Fritzke

    Growing cell structures — a self-organizing network for unsupervised and supervised learning

    Neural Networks

    (1994)
  • R. Glynne et al.

    How self tolerance and the immunosuppressive drug FK506 prevent B-cell mitogenesis

    Nature

    (2000)
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • J.A. Hartigan

    Clustering Algorithms

    (1975)
  • R.A. Heller et al.

    Proc. Natl. Acad. Sci. USA

    (1997)
  • Herrero, J., Valencia, A., Dopazo, J., 2001. A hierarchical unsupervised growing neural network for clustering gene...
  • R. Herwig et al.

    Large-scale clustering of cDNA-fingerprinting data

    Genome Res.

    (1999)
  • L.J. Heyer et al.

    Exploring expression data, identification and analysis of coexpressed genes

    Genome Res.

    (1999)
  • N. Kaminski et al.

    Global analysis of gene expression in pulmonary fibrosis reveals distinct programs regulating lung inflammation and fibrosis

    Proc. Natl. Acad. Sci. USA

    (2000)
  • D.B. Krizman et al.

    Construction of a representative cDNA library from prostatic intraepithelial neoplasia

    Cancer Res.

    (1996)
  • T. Kohonen

    The self-organizing map

    Proc. IEEE

    (1990)
  • Cited by (0)

    View full text