Introduction

We are using cDNA microarrays to study global patterns of gene expression in developing wheat caryopsis. The caryopsis is the fruit of grasses in which the pericarp is fused to the seed coat at maturity and it is interchangeably referred to as the grain in cereals. The caryopsis consists of three major structures: embryo, endosperm and seed coat with the endosperm being the tissue of economic value. The endosperm is the primary site of storage of starch and protein, which are both important for global agriculture (Aquino et al. 1999; Triboi and Triboi-Blondel 2002). The endosperm starch and storage proteins are critical to bread and pasta quality and are exclusive to wheat such that no other cereal can be used similarly (Bechtel et al. 1990; Shewry et al. 2002). Wheat milling, baking, cooking and eating qualities are largely determined by the biochemical composition of the seed (Stoddard 1999a) and processed wheat starch and protein are also finding use as biopolymers (Davies et al. 2003; Woerdeman et al. 2004). To enhance wheat seed quality and functionality with the intention of broadening existing markets, it will be necessary to dissect the underlying molecular mechanisms determining grain storage product accumulation (Davies et al. 2003; Francki and Appels 2002; Wilson et al. 2004).

The physiological, biochemical and morphological changes that culminate during grain development determine the nutritional profile of the seed at harvest. These fundamental processes are almost certainly regulated at the level of transcription. Wheat endosperm development can be generally divided into five phases (Simmonds and O’Brien 1981): Phase I [0 days post-anthesis (DPA)] fertilization, Phase II (1–5 DPA) “coenocytic” endosperm, Phase III (6–13 DPA) cellularization and early grain filling, Phase IV (14–24 DPA) maximum grain filling, and Phase V (25–38 DPA) desiccation. Phase II is also commonly known as Grain Growth stage, Phase III and Phase IV as the grain filling stage, and Phase V as dry-down stage—where each stage describes the relationship between cell development and storage product accumulation. During Grain Growth, free nuclear division in the embryo sac produces the watery coenocytic tissue. Final cell number of the endosperm is established and differentiation into the outer aleurone layer and inner endosperm cells occurs. Grain filling is dominated by the rapid synthesis of starch and protein in the endosperm (Bechtel et al. 1990; Shewry and Halford 2002; Stoddard 1999b) and to a lesser degree, lipids. Cell division ceases and cell enlargement takes place to accommodate reserve storage. Dry down is characterized by cessation of most metabolic processes. The organ reaches maximal dry weight and the cells undergo a type of programmed cell death in preparation for dormancy (Olsen 2001; Olsen et al. 1999; Young and Gallie 2000).

The changes in storage reserve accumulation during wheat grain maturation are well established; however, identifying key molecular determinants controlling carbon flux to the grain and the partitioning of carbon to starch and protein are more elusive. Much of our current knowledge is based on biochemical assays of protein and enzymatic activities of starch and protein biosynthesis during caryopsis development. Our understanding of how each pathway is controlled is complicated by the occurrence of multi-gene families encoding many of the enzymes in these biochemical pathways [reviewed in Morell and Myers (2005)], the interconnectedness of these different pathways (Giroux et al. 1994; Simmonds 1995; Triboi and Triboi-Blondel 2002) and the strong influence of the environment on the amount and nature of the starch and protein made (Dupont and Altenbach 2003; Triboi et al. 2003). Gene networks and transcriptional regulators synchronously expressed with the biochemical and physiological changes observed during grain development (Appels et al. 2003) may be informative. Classifying genes based on similarities or difference in transcript profile with phenotype can confirm existing knowledge, lead to the dissection and revelation of novel mechanisms determining nutrient partitioning, and generate new unbiased hypotheses (Leader 2005; Zhu et al. 2003).

cDNA microarrays are used to provide a comprehensive description of transcript level in an organism after perturbation or during development [reviewed in Schaffer et al. (2000)]. They have been valuable in understanding aspects of grain growth and filling in cereals (Close et al. 2004; Gregersen et al. 2005; Sreenivasulu et al. 2004; Zhu et al. 2003). However, because of its recalcitrant genome (Langridge et al. 2001), this type of transcriptional analysis has only just seen wider utility in wheat (Clarke and Rahman 2005; Gregersen et al. 2005; Kawaura et al. 2006; Leader 2005; Lu et al. 2005; Wilson et al. 2004). In spite of their increasing use and vast contribution to opening new paradigms in plant biology the limitations of cDNA microarrays have been well documented (Alba et al. 2004; Donson et al. 2002; Leader 2005; Meyers et al. 2004). Inherent problems include cross-hybridization of homologous DNA and the poor reproducibility of the PCR amplicons spotted on the array (Meyers et al. 2004). Superior resources for global expression profiling such as tag-based technologies are available [reviewed in Meyers et al. (2004)]. However, they are not widely disseminated because they are technically demanding and relatively expensive. cDNA microarrays in contrast are cheap and relatively portable. The biological relevance of results obtained by this method may be enhanced by rigorous biological and technical replication, validation of results by independent methods and attention to quality control (Brazma et al. 2001; Yang and Speed 2002).

We produced a cDNA microarray for wheat endosperm transcriptomics. Our aim was to establish an in-house platform suitable for robust gene expression profiling of wheat caryopsis development. To support our objectives we created a “boutique” cDNA microarray with high coverage of clones isolated from developing caryopsis in order to focus on this process. Recent wheat transcriptional profiling studies have provided analyses of the changes underpinning caryopses development (Clarke and Rahman 2005; Gregersen et al. 2005; Leader 2005; Lu et al. 2005; Wilson et al. 2004; Xue et al. 2006). We wish to describe these changes, and to use the information as the basis to further explore changes in the wheat caryopsis transcriptome after genetic perturbation.

In this paper we report, (1) the development and validation of the wheat cDNA microarray (2) the classification of the temporal patterns of gene expression during wheat grain filling, and (3) the correlation of those changes with storage product accumulation and caryopsis development. We show that the array is highly reproducible and validate our measurements by semi-quantitative reverse-transcription PCR. Finally we show that our array and methodology is stringent enough to further examine the mechanistic forces that underscore carbon partitioning in wheat.

Materials and methods

Plant material and total RNA extraction

Triticum aestivum cv. Bobwhite was grown to maturity in 10-inch pots (6 plants per pot) in the greenhouse with maximum daytime temperature of 24°C and minimum nighttime temperature of 17°C. Natural light was supplemented with 100 W high-pressure sodium lights to maintain a day length of 16 h. Plants were fertilized with Plantex 20-20-20 using an automatic drip irrigation system. Several measures were implemented to reduce the variability of the biological samples. Pots were rotated every two-weeks to reduce the effects of environmental variations. The primary spike of each plant was tagged at anthesis and only grains at the middle of each spike were harvested. Grains were collected at 3, 7, 10, 14, 18, 21, 25, 28, 31, 35, 42, 45 and 50 DPA and immediately frozen with liquid nitrogen. For microarray experiments, RNA was isolated from developing grains at 3, 7, 14, 21, 28 and 35 DPA using a modified TRIZOL (Invitrogen, Carlsbad, CA). Frozen grains from each spike were ground using a mortar and pestle and the powdered tissue resuspended and incubated in a starch pre-extraction buffer (50 mM Tris–HCl pH 9.0, 200 mM NaCl, 1% Sarcosyl, 20 mM EDTA and 5 mM DTT). Starch was separated from the aqueous phase by phenol/chloroform extraction before proceeding with the TRIZOL protocol. Total RNA was treated with RQ1 DNAse (Promega, Madison, WI) and was re-purified using RNA Easy columns (QIAGEN Inc, Valencia, CA). The quality and concentration of RNA was measured by spectroscopy. RNA integrity and assessment of DNA contamination was done by resolving an aliquot of RNA sample in a 2% agarose gel with ethidium bromide and then UV–visualized. Alternatively, RNA integrity was determined using the Agilent 2100 Bioanalyzer (Agilent, Palo Alto, CA).

Determination of physiological parameters during grain development

To determine the change in grain fresh and dry weight during development, grains from 9 to 20 single heads were collected, counted and weighed. Grains from a single head were freeze-dried to determine dry weight. Water content was calculated by subtracting dry weight from fresh weight. Values from individual heads were averaged for each time point.

To determine the starch and protein content, freeze-dried grains were ground to a powder using an UDY mill (UDY Corporation, Fort Collins, CO, USA). Total starch content per 100 mg sample was assayed using the Megazyme total starch determination assay kit (Megazyme International, County Wicklow, Ireland). Total protein content was determined on 30 mg samples by N combustion analysis with a Leco nitrogen analyzer (Leco Corporation, St Joseph, MI, USA) using a protein to N ratio of 5.7:1. For both starch and proteins assays, three determinations were done per biological sample, and 2–5 biological samples were used per time point. Values were averaged for each time point.

Wheat 8K cDNA array construction

Two sets of clones were used for the construction of the wheat 8K cDNA arrays: 2,304 unique clones from developing wheat grain ESTs and the set of approximately 6,000 expressed sequence tag (EST) clones mapped to the wheat hexaploid genome (Qi et al. 2004). The cDNA clones for the developing wheat grain ESTs were derived from two non-normalized cDNA libraries: TA001E1X, generated from RNA isolated from 5 to 30 DPA developing endosperm tissue of T. aestivum cv. Cheyenne, and TA059E1X, generated from developing grains of T. aestivum cv. Butte plants subjected to different abiotic stresses (http://wheat.pw.usda.gov/cgi-bin/nsf/nsf_library.cgi). The set of mapped EST clones were derived from 48 different cDNA libraries generated from different plant tissues of normally grown plants and those subjected to different biotic and abiotic stresses. The description of the cDNA libraries and sequences of the mapped probes are available at http://wheat.pw.usda.gov/NSF/progress_mapping.html. The GenBank accession number (when available) of a clone spotted on the array is used as its unique ID. Clones that are duplicated from the two sources are assigned probe names appended with a letter. Clones that fail to generate a PCR product either due to poor growth or give a double or multiple bands when amplified were not removed from the array. Instead the original clone was re-arrayed and new plasmid DNA isolated to generate a quality probe. The GenBank accession number of the re-arrayed clone was also appended with a letter. The new array contained 10,800 spots; aside from the control clones, 9,167 of the spots were wheat probes representing 7,835 unique genes.

Clone inserts were amplified using a modified pair of universal M13 forward and M13 reverse primers: GTTTTCCCAGTCACGACGTTG and TGAGCGGATAACAATTTC ACACAG, respectively. PCR amplifications were carried out in 100 μl volume in an MJR Tetrad Thermal Cycler (Bio-Rad Laboratories, Hercules, CA) for 30 cycles with 57°C annealing temperature and 2.5 min extension time. The reaction cocktail contained plasmid DNA, 1.5 mM MgCl2, 200 μM each of deoxynucleotides dATP, dCTP, dGTP and dTTP, 200 nM each of M13 forward and M13 reverse primers, 1.25 unit Taq polymerase (Promega Biosciences, Inc., San Luis Obispo, CA) and 1× reaction buffer (50 mM KCl, 10 mM Tris–HCl pH 9.0, 0.1% Triton X-100). Amplicon size, yield and integrity were determined by resolving 5 μl of the PCR product in a 1% agarose gel. Amplicons were purified using QIAquick 96 PCR purification kit (QIAGEN Inc, Valencia, CA), dried and resuspended in 50% DMSO as printing buffer.

Probe DNA (approximately 300 ng/μl) were spotted from 384-well microtiter plates onto Corning UltraGAPS slides (Corning Inc, NY) using an Omnigrid 100 arraying machine (Genomics Solutions, Ann Arbor, MI) fitted with 48 CMP3 printing pens from Telechem (Sunnyvale, CA). Each array has 48 subgrids and each subgrid had 15 × 15 spots. The same pen prints all the spots on a subgrid. The printed slides were UV-crosslinked at 300 mJ and dried for 2 h in an 80°C oven before use. Clone insert identities were verified by re-sequencing using the ABI Big Dye terminator mix from Perkin-Elmer and analyzed in a 3700 or 3730 ABI DNA analyzer.

Microarray controls

The DNA of several control clones were spotted on the array which could be used for normalization and data quality assessment: The control clones include the 18 Arabidopsis Functional Genomics Consortium (AFGC) microarray control set and 23 Lucidea Score Card (Amersham Biosciences, Palo Alto, CA) calibration, ratio, utility and negative controls. The AFGC control set includes 10 spiking controls derived from human genes which do not show significant cross hybridization with plant DNA and 8 plant transformation marker genes. The 23 Lucidea Score Card control DNA are artificial genes composed of sequences from yeast intergenic regions. Each of the 10 AFGC spiking controls was spotted on the first row of 44 of the 48 subgrids on the array. The 23 Lucidea Score Card control DNAs were spotted on 4 of the 48 subgrids on the array. The four subgrids (7, 18, 31 and 42) were evenly spaced along the entire length of the printed array. The mRNA for the 10 AFGC spiking controls were in vitro transcribed using Promega RiboProbe in vitro Transcription System-T3 (Promega, Madison, WI). The AFGC control clones are available at the European Arabidopsis Stock Center (http://seeds.nottingham.ac.uk/Nasc). The set of in vitro transcribed mRNA for the Lucidea Score Card controls were purchased from Amersham.

RNA labeling and microarray hybridization

RNA was indirectly labeled with Alexa 555 and Alexa 647 fluorophores (Molecular Probes, Inc, Eugene, OR) using the protocol recommended by the manufacturer. Briefly, 10 μg total RNA was annealed to 0.5 μg oligo-dT and reverse transcribed using the following reaction cocktail: 200 units Superscript II (Invitrogen, Carlsbad, CA), 500 μM each of dATP, dCTP and dGTP, 150 μM dTTP, 300 μM amino-allyl-dUTP, 10 mM dithiothreitol, and 1× Superscipt II Strand buffer. The fluorophores were coupled to the cDNA for at least 2 h, quenched and the unbound fluorophores removed using Microcon YM-30 filters (Millipore, Billerica, MA). Equal amounts of in vitro transcribed mRNA for 10 mammalian spiking controls (0.05 ng each/μg sample RNA) and 8 Lucidea calibration controls were added to each labeling reaction mix. Hybridization was carried out using the Pronto! hybridization kit reagents (Corning Inc, NY) following the supplied protocol.

Data collection and preprocessing

Hybridized slides were scanned using a Genepix 4000B microarray scanner (Axon Instruments, Union City, CA) and raw spot fluorescence intensities were collected using GenePix Pro version 5.0. A quality control filter is used to flag questionable spots on the array so they can be removed from analysis. The criteria for a fair array feature included intensity signal >55% of background +1 standard deviation (SD), less than 3% pixel saturation for both channels and feature shape or circularity. Array features annotated as “printing buffer”, “blank” or “empty” were flagged and excluded from analysis.

Although sources of variance can be included in the analysis of variance (ANOVA) gene model, the data set were pre-normalized before ANOVA analysis was performed to further reduce non-biologically significant variations. The logarithms (base 2) of the original intensity values, not the ratios, for each probe obtained from each fluorescent channel were used. To reduce any dye bias introduced during RNA labeling, the global median value of all the signals per channel on each individual array was subtracted from each spot raw intensity signal and divided by the standard deviation. The expression values per gene were centered based on the median value of expression of the gene across all time points.

Data analysis and visualization

Mixed model ANOVA was used to assess the significance of the difference in expression of each gene across the 6 time-points surveyed (Kerr et al. 2000; Wolfinger et al. 2001). With the multiple steps required to carry out a successful microarray experiment, it is not unusual to have “noisy” data. To extract reliable information from the data, non-biologically significant sources of signal variation were identified and their effects removed. The following gene model was used to identify genes that were differentially expressed:

$$ \eqalign{ Y_{ijklm} \, = \,\mu \, + \,{\text{Array}}_i \, + \,{\text{Dye}}_j \, + \,{\text{Treatment}}_k \, + \,{\text{BioRep}}_l \, + \,({\text{Dye}}\,*\,{\text{Treatment}})_{jk} \, + {\text{ }} \cr {\text{ }} & + \,{\text{ }}({\text{BioRep}}\,*\,{\text{Treatment}})_{lk} \, + \,({\text{BioRep}}\,*\,{\text{Dye}})_{jl} \, + \,({\text{Dye}}\,*\,{\text{BioRep}}\,*\,{\text{Treatment}})_{jkl} \cr & + \,\varepsilon _{ijklm} \cr} $$

Y ijkl denotes the transformed intensity for a gene, μ denotes the average intensity and ε ijklm captures the random errors. The variation due to microarray slide used (Array) was designated as random effect, whereas, variations due to RNA fluorescent labeling (Dye), biological sample RNA (BioRep) and caryopsis developmental stage (Treatment) were treated as fixed effects. Only the main effects interacting with Treatment were included in the model. The Statistical Analysis Software version 9.0 (SAS Institute Inc, Cary, NC) PROC MIXED protocol was used for analysis.

The data for the differentially expressed genes were visualized by implementing the hierarchical and non-hierarchical (k-means) clustering methods in the Genesis software (Sturn et al. 2002). Genespring GX software version 7.3 (Agilent Technologies, Santa Clara, CA) was used to identify genes with pattern of expression correlating with the accumulation and rate of accumulation of physiological markers.

RT-PCR

To confirm the expression patterns observed by microarray data analysis, one-step semi-quantitative reverse transcription polymerase chain reaction (RT-PCR) experiments were performed. Probes that exhibited different patterns of expression based on microarray data analysis were chosen. Primer pairs were selected and designed from sequences near the 3′ end of the gene using PrimeSelect software (DNASTAR Inc. Madison, WI). If a clone belonged to a contig, the primer pair was designed from the consensus sequence of the virtual gene. An 18S rRNA gene was selected as a control. Total RNA isolated from grains of appropriately staged spikes of a fourth biological sample (different from the three sets used for the microarray experiment) was used as templates for amplification. The QIAGEN One Step RT-PCR kit and MJR Peltier thermal cycler were used to carry out the amplifications. RT-PCR fragments were resolved on an agarose gel, stained with ethidium bromide and visualized with UV light.

Microarray probe annotation and GO functional categorization

The DNA sequence for each of the unique probes spotted on the array was searched against the non-redundant GenBank DNA and protein databases (release 144) using BLASTN and BLASTX (Altschul et al. 1990, 1997). The best match was extracted using an in-house Perl script and used as a basis for obtaining annotations for each probe based on sequence identity. The DNA sequence for each of the unique probes was also searched against the UniProt database (Release 1.5, TrEMBL, Swiss-Prot, and PIR, http://www.ebi.ac.uk/uniprot) resources using BLASTX, and best matches (E value < 10−10) were compared to terms of the Gene Ontology™ (GO) Consortium. Using GO/UniProt comparison tables, candidate GO assignments were predicted based on EST matches to the UniProt reference sequences. Categories were assigned based on biological, functional, and molecular annotations available from GO (http://www.geneontology.org/).

Results and discussion

Timing of developmental events in wheat caryopsis

The environmental conditions by which wheat is grown affect the development and quality of the grain (Altenbach et al. 2003; Bhullar and Jenner 1986; Brooks et al. 1982; Daniel and Triboi 2000; Dupont and Altenbach 2003). The timing of grain development in particular is dramatically influenced by temperature (Altenbach and Kothari 2004; Dupont and Altenbach 2003). Altenbach and Kothari showed that high temperature advance and compress the transcriptional programs in developing wheat grain. However, they also observed that when grains grown under two different temperature regimes were staged based on physiological markers rather than on chronological time, the pattern of transcript accumulation in grains grown in high temperature compared with those grown in moderate temperature were equivalent. Presumably this results because the network of the grain transcriptional program is dependent on environmental conditions. Thus, the physiological markers of plant growth may serve as a better reference for analyzing the timing of gene expression.

To associate the timing of caryopsis development with changes in gene expression, several key physiological markers of growth were determined for the materials used in the microarray experiments: caryopsis fresh weight, dry weight, water, protein and starch content. Fresh weight (FW) reflects the accumulation of both dry matter and uptake or loss of water. Dry weight (DW) is primarily influenced by the accumulation of the major seed storage molecules, proteins and starch, in the endosperm. Water content (WC), determined as the difference between FW and DW of each grain, can be used to mark grain cellular expansion and desiccation. Grains from appropriately staged spikes were collected, counted, weighed and dried. The dried caryopses in each spike were ground for protein and starch content determination.

The results of the physiological marker determinations for 14 different time points are shown in Fig. 1. Our data show that caryopsis FW increased linearly to 31 DPA reaching a maximum of 62 mg/caryopsis (Fig. 1A). The rate of FW accumulation was at its maximum rate at 14 DPA. Caryopsis DW increased steadily and plateau at 35–38 DPA at 37–39 mg/caryopsis. The rate of DW accumulation was maximal at 21 DPA. Although the caryopsis accumulated water up to 21 DPA, the maximum %WC per caryopsis was at 10 DPA. The rate of water accumulation was maximal between 7 DPA and 14 DPA. The caryopsis began to lose water at 28–35 DPA and reached 4–5% of FW at 45–50 DPA. The caryopsis began to turn brown at around 28 DPA and reached physiological maturity at 35–38 DPA, with water content between 35% and 40% FW (Calderini et al. 2000). Starch accumulated at a maximum rate between 10–14 DPA and peaked at 35–38 DPA. Total protein accumulated linearly between 7–28 DPA and reached its peak level at 38–42 DPA.

Fig. 1
figure 1

Physiological marker of developmental events in wheat caryopsis. (A) Fresh weight (circle), dry matter (square) and water content (triangle), shown as amount in mg per caryopsis, were determined at 14 different time points in wheat caryopsis development. (B) Accumulation of storage compounds, starch (circle) and protein (square), during caryopsis development. (C) Classes of expressed genes expected and predicted to be associated with the biological events at each stage in caryopsis development

Six time points in caryopsis development were used for the microarray experiment. These time-points (3, 7, 14, 21, 28 and 35 DPA) cover the different stages of development from coenocytic stage to desiccation. The time-points encompass the transition points in dry matter accumulation and water uptake and loss, which could reflect control points in the programming of gene expression in the developing caryopsis. The expected and predicted genes that may be expressed in these stages are shown in Fig. 1C. At 3 DPA, the caryopsis is at the coenocytic stage. It is small and has not begun accumulating grain storage molecules. At 7 DPA, the endosperm is actively undergoing cellularization and storage molecule accumulation is initiated. The cells are undergoing expansion as reflected by increasing rate of water uptake, which peaks at 14 DPA. At 14 DPA, the rate of dry matter increases and peaks at 21 DPA. At 21 DPA water content reaches its maximum and begins to decline at 28 DPA. At 35 DPA caryopsis attains maximum dry matter content and reaches physiological maturity.

RNA profiling experimental design and data analysis

We used an in-house generated wheat cDNA array containing probes for 7,835 genes to examine the gene expression dynamics in the developing caryopsis. The probe set is enriched for genes that are expressed in the endosperm and 75% of the probes (5,817 genes) have been mapped in the hexaploid wheat genome. Data analysis of the embedded spiking and hybridization controls on the arrays showed that we can reliably detect a signal from 0.3 pg of the calibration control RNA for every 1 μg of total RNA labeled for hybridization (see Supplemental Data I). Assuming that the amplification efficiency of the endogenous wheat RNA is similar to that of the spiked-in calibration controls and using the estimated 100–500 pg of total RNA in a plant cell (Goldberg et al. 1978; Kerk et al. 2003), and an average of 1,500 nucleotides/molecule of plant mRNA (Alexandrov et al. 2006; Goldberg et al. 1978; Ogihara et al. 2004), the cDNA microarrays could detect the expression of genes with as low as 36–181 RNA transcripts per cell. Using Goldberg and colleagues’ estimate of 17 transcripts, 340 transcripts and 4,500 transcripts per cell for low, moderate and highly expressed genes, respectively, the wheat 8K cDNA arrays can reliably detect moderately expressed genes and perhaps some of the low expressed genes.

A connected loop design (Kerr and Churchill 2001a, b) was used to compare the expression of genes between time points (Fig. 2A). To reduce the effect of dye-bias, dye swaps for each RNA sample were performed for each comparison The experiment was carried out using three biological replicates per time point, wherein RNA was independently isolated from spikes of three different plants at the same stage of development. Overall 54 slides were used for 108 hybridizations, which yielded 18 data-points per gene per time point. Raw signals from each slide were pre-processed to remove flagged data points and were pre-normalized using the spiking control signals. The logarithmic (base 2) transformed signals across all slides showed a skewed distribution that was fixed after normalization (Fig. 2B). The normalized median-centered log2 transformed signals for each probe were used as input for a mixed model analysis of variance (ANOVA) to identify genes that were differentially expressed during development.

Fig. 2
figure 2

Experimental design and data normalization. (A) Loop design: RNA level from a specific time-point, T, is directly compared with the RNA level from two adjacent time points and the time-point 3 times removed. The arrow represents independent RNA labeling and hybridization, wherein the RNA at the tail end of the arrow is labeled with Alexa 555 fluorophore and the RNA at the head of the arrow is labeled with Alexa 647 fluorophore. (B) Distribution of normalized data: The left panel shows the distribution of close to 1.17 million microarray spots raw intensity values after log (base 2) transformation. The right panel shows the distribution of all the data points after normalization (dye correction and median centering). Note that after data pre-processing all the data points were normally distributed and centered on zero

Dynamics of gene expression in developing caryopsis

We identified 2,295 validated probes on our array to be differentially expressed (at P ≤ 0.01) during caryopsis development (listed in Supplemental Data II). Examination of the distribution of the genes that are differentially expressed between two time-points (Fig. 3, solid bars) indicates that a significant number of genes showed differential expression throughout development and that major reprogramming occurs during the early stages of caryopsis development especially between 7 DPA and 14 DPA which coincides with the onset for grain filling. To further gain insight into the transcriptional control of biological events at specific stages of grain development, we examined the transition points for genes that are differentially expressed between two time points. Transcripts that show rapid up-regulation or down-regulation only during a particular time of the grain development could be characteristic of a specific developmental stage. Genes differentially expressed between two time points were queried for genes that were differentially expressed ONLY in the two consecutive time-points chosen (at P ≤ 0.01) and not in other stages of development (Fig. 3, hatched bars). A significant number of the genes expressed in this manner were observed in three stages: between 3–7, 7–14 and 21–28 DPA. These three stages correspond to distinct developmental and metabolic events occurring in the caryopsis. Caryopsis endosperm at 7–14 DPA is transitioning from a coenocyte to a multi-cellular tissue. The onset of grain filling occurs at 7–14 DPA with a maximal rate of accumulation of transcripts encoding storage proteins and enzymes involved in starch synthesis. At 21–28 DPA, the caryopsis is beginning to undergo maturation and desiccation.

Fig. 3
figure 3

Differentially expressed genes between two consecutive time points. Solid bars represent differentially expressed genes between two consecutive time points. The hatched bars represent genes that show significant differential expression (P-value = 0.01 or less) in two consecutive time-points only and not significantly expressed in other time points

A major proportion (48–55%) of the differentially expressed genes have not been ascribed a GO biological function annotation. For those genes with biological GO annotation, the majority falls into the cellular macromolecule metabolism category or generation of precursor metabolites and energy category. About 20% of these genes reach a maximum transcript level between 14 DPA and 21 DPA, when maximum grain filling occurs. The group of genes (69 transcripts) that were differentially expressed only between 14 DPA and 21 DPA contained three 14-3-3b interacting proteins, which were down regulated during that time-span. 14-3-3 are a family of proteins that play a regulatory role in several processes in plant development such as signal transduction, checkpoint control, apoptosis and nutrient-sensing pathways (Ferl 1996; Fulgosi et al. 2002). There are three genes of unknown function that closely follow the pattern of expression of 14-3-3. This co-expression can suggest their possible co-involvement with these regulatory proteins. It has been proposed that 14-3-3b may inhibit starch biosynthesis (Fulgosi et al. 2002; Sehnke et al. 2001). This is consistent with our observation that 14-3-3b transcript accumulation is inversely related to starch accumulation. Among the genes that are significantly up-regulated only at 14–21 DPA are genes involved in generation of metabolite precursors and energy and those that are possibly involved in defense mechanisms. About half of the genes in this group are significantly down-regulated. These include EF1-α, actin, α-tubulin, carbohydrate metabolism and unknown genes.

Global patterns of transcript accumulation in developing caryopsis

To assimilate the data generated, hierarchical and k-means clustering methods were used to discriminate and visualize patterns of gene expression during development. Hierarchical clustering (Eisen et al. 1998) allows for the analysis of the relationship of each gene to every other gene on the array, whereas, non-hierarchical clustering, like k-means clustering (Sturn et al. 2002), allows the separation of the genes (or the expression profiles of the transcripts) into distinct classes.

To classify the genes into groups with a similar pattern of expression, k-means clusters of 5, 8, 10, 12, 15, 20 and 25 were tested (data not shown). Clustering into 10 groups was selected (Fig. 4) since it gave enough resolution (after 50 iterations) of the different expression profiles without significant redundancy. The number of genes that grouped into the 10 different clusters ranged from 10 (Cluster 10) to 771 (Cluster 9).

Fig. 4
figure 4

Patterns of gene expression in developing wheat caryopsis The 2,295 differentially expressed genes across all 6 time points were grouped into 10 clusters using the k-means algorithm. The mean centered relative gene expression value (in log2 scale) for each gene is plotted on the y-axis and the time of development in days post-anthesis (DPA) is on the x-axis. The magenta curve represents the median of the gene expression values in each cluster. Tick marks on the x-axis represent the developmental time 3, 7, 14, 21, 28 and 35 DPA. Tick marks on the x-axis for 3 DPA and 35 DPA overlap with the sides of the cluster box

Based on trends of mRNA accumulation, the 10 k-means cluster of gene expression patterns can be sorted into four classes: up-regulated (Fig. 4, Clusters 2, 3, 6, 7 and 10), down-regulated (Fig. 4, Clusters 1 and 5), bell-shaped (Fig. 4, Cluster 8), and modulated (Fig. 4, Cluster 4 and 9). The range of gene expression based on fold change is shown in Table 1 and the distribution of the gene biological GO annotations for each cluster is in Table 2. Clusters 1, 4, 6 and 9 contain the majority of the genes and have a narrow range of change in gene expression (1.2–4.9) compared to the other clusters. Gene GO annotation shows a significant number involved in cellular and macromolecule synthesis; 41–67% of the genes in these clusters have unknown biological roles in the cell.

Table 1 Level of expression of the developmentally regulated genes within each cluster
Table 2 Functional annotation of genes in each cluster

The activity of the genes in Clusters 1 and 5, which show peak of expression at 3–7 DPA, are consistent with the cellular events occurring during Grain Growth stage. These genes are involved in cell division and nucleic acid metabolism (e.g. histones, tubulins), photosynthesis (i.e. chlorophyll a/b binding proteins, ferredoxin) and protein metabolism (i.e. ribosomal proteins, translation initiation and elongation factors, proteosomal proteins). A significant number of the genes in these clusters (35–41%) still have unknown biological functions in the cell.

The major metabolic activity occurring in the grain filling (14–24 DPA) stage of grain development is the synthesis and accumulation of storage molecules, starch and proteins. The main storage proteins are the prolamins, glutenins, and gliadins, which are the major components of the gluten polymer that determines the economic significance of wheat (Shewry and Halford 2002). Most of the storage protein genes are found in Clusters 2 and 7. Genes in these clusters show a dramatic rate of increase in transcript level between 7 DPA and 14 DPA, peak expression at 21 DPA and then plateau thereafter. The average fold change in gene expression in Cluster 2 is 12, those for Cluster 7 is 42. Cluster 7 contains the most highly differentially expressed genes in the endosperm, with a minimum fold change of 17 and the highest of 115. Most of which the genes in Cluster 7 are involved in grain filling: storage protein genes, grain softness protein, and α-amylase inhibitors (involved in metabolic and possibly defense functions). 9 of the 67 genes in Cluster 7 are unknown proteins. The maximum rate of storage protein transcript and protein accumulation both occur at the same time between 7 DPA and 14 DPA, which is consistent with results obtained by others (Altenbach and Kothari 2004; Clarke et al. 2000). This observation supports the hypothesis that storage protein gene transcription, poly (A)+ RNA processing, and translation are closely coupled in the developing wheat caryopsis, suggesting that a major control of storage protein accumulation is at the transcriptional level.

Cluster 8, which contains 34 genes, shows a bell-shaped pattern of expression with a peak in transcript level at 14 DPA. The cluster represents mainly metabolic enzymes involved in carbohydrate and protein metabolism: orthophosphate dikinase, aspartic protease, aspartate aminotransferase, alanine aminotransferase, sucrose synthase 2 and ADP-glucose pyrophosphorylase small subunit, which is a component of an enzyme involved in the first committed step in starch biosynthesis (Tetlow et al. 2004). The bell-shaped expression of the carbohydrate and protein metabolic genes correlates with the observed rate of starch and protein accumulation in the grain (Fig. 1), except that the maximal rate of accumulation of these storage products is observed between 14 DPA and 21 DPA, following the maximum of the transcript expression. This is expected since the mRNA is the template for protein synthesis. It should be noted, that the accumulation of transcripts involved in starch biosynthesis occurs as early as 3 DPA. This could be from the synthesis of starch in the pericarp tissue, which is still photosynthetic at this stage. However, the presence of starch biosynthesis genes at 3 DPA could also support the morphological observation that starch granules are present as early as the coenocytic stage of endosperm development (Hughes 1976; Simmonds and O’Brien 1981).

Cluster 10, which contains 10 genes, shows a maximum rate of transcript accumulation between 7 DPA and 21 DPA, and the transcripts continue to be up-regulated even after the desiccation program has begun late in grain development. The transcript level of 4 of the genes peaked at 21 DPA and the other 6 continue increasing thru the last data point at 35 DPA. Among the members of this cluster are gamma-purothionin, a well-characterized defense protein (Bruix et al. 1993; Colilla et al. 1990) and Ec-protein (‘early’ cysteine-labeled metalloprotein). Not much is known about the function of Ec in wheat (Hanley-Bowdoin and Lane 1983). A review of plant metalloproteins (Bilecen et al. 2005), suggested that these proteins may be involved in stress responses, programmed cell death, developmental regulation and heavy-metal metabolism.

The cluster analysis, which classified genes with similar expression profiles across development, revealed the enrichment of genes associated in similar metabolic or developmental pathways. The co-expression of genes in a pathway has been observed in other large-scale expression analyses (the most recent ones include Gachon et al. 2005; Wellmer et al. 2006; Zhang et al. 2005). This association within the cluster could provide insight into the function of genes with unknown function. Clustering of similar or paralogous/orthologous genes involved in the same pathway could also serve to improve approaches to manipulate these pathways by gene silencing. Finally, the genes in each cluster could provide a good source of promoters to drive the transcription of transgenes to a specific expression profile during caryopsis development.

Genes of unknown function

As described in the previous section, the majority of genes that are differentially expressed during grain development have unknown function. Many of these genes cluster together with annotated genes involved in defined biological processes such as cell division, carbohydrate and protein synthesis, and desiccation. Analysis of gene expression patterns combined with gene sequence analysis could provide testable hypotheses as to the function of these unknown genes. To test the assumption of “guilt-by-association” and gain insight into the character of unknown genes we looked more closely at the genes in Cluster 7 (Fig. 4). The majority of the 67 genes that grouped to Cluster 7 are storage proteins and defense proteins. Nine genes in this cluster could not be ascribed a function based on BLASTX and BLASTN best match hits. Using the TIGR wheat index EST annotator (http://www.tigr.org/tigr-scripts/tgi/est_ann.pl?db=wheatest), we checked whether the probe for the unknown genes in Cluster 7 assembled with other ESTs to form a contig. If so, then the contig sequence was used to compare against the non-redundant GenBank DNA and protein databases. If the unknown gene had no match to known genes, the protein domain database was searched for common protein motifs (Marchler-Bauer et al. 2005). With this strategy, the unknown gene encoded by BG262302 showed identity to a seed storage protein (GenBank accession Q7XYF0) and BE606942 is similar to a ferrochelatase protein (GenBank accession Q9FEK8). Interestingly, the derived translation protein of BQ804784 sequence showed similarity to an mRNA binding-like protein, whereas, BQ167790 sequence has 3 RRM domains, an RNA recognition motif (Marchler-Bauer et al. 2005). The other 5 genes contain no recognizable domains, therefore, remained unknown.

Expression of developmental stage-specific genes

To further examine gene expression at different stages of grain development, the genes that showed peaks of transcript level at each stage were selected (see Supplemental Data II). Genes highly expressed at a particular stage may indicate relevance for stage-specific developmental functions. Of the 2,295 genes analyzed, 652 showed peaks of expression at 3 DPA, 791 genes peaked at 7 DPA, 219 at 14 DPA, 502 at 21 DPA, 30 at 28 DPA and 101 genes at 35 DPA (see Supplemental Data II). Genes that peaked in expression in the same stage showed clustering of genes of defined function. For example, the maximally expressed genes at 35 DPA include dehydrins and late embryo abundant (LEA) genes, protein synthesis inhibitors, including tritin, proteinase inhibitors, serpin and thaumatin. Dehydrins and LEA proteins (Farrant et al. 2004) are produced in response to environmental stimuli with a dehydrative nature, such as drought, low temperature, salinity, and developmental stages such as seed and pollen maturation (Campbell and Close 1997; Danyluk et al. 1998). Their up-regulation was characteristic only during the last stages of the grain development studied and can signify entering of the grain into the dry-down stage. Tritin is a ribosome inactivating protein (Nielsen and Boston 2001) and has been proposed to have a defense role (Massiah and Hartley 1995) or a suggested role in the programmed senescence of the endosperm at maturity in barley (Leah et al. 1991). Its transcript abundance is significantly increasing between 28 DPA and 35 DPA, when protein biosynthesis is terminating and the seed is preparing for desiccation.

Graphical representation of the expression profiles of genes that peaked at different stages (Fig. 5) displays the continuous expression of these genes from one stage to another. For example, genes with a peak of expression at 14 DPA continue their expression thereafter, suggesting a continued functional requirement at later stages in grain development. Several genes involved in starch biosynthesis peak at 14 DPA (i.e. ADP-glucose pyrophosphorylase, starch synthase IIa-3 and starch branching enzyme IIb) and continue to be expressed up to 35 DPA.

Fig. 5
figure 5

Gene expression at different stages of grain development The left panel shows the morphology of the whole caryopsis at the different time points of development used on the microarray experiments. The yellow bar in the topmost panel is 2 mm. The middle panel shows the pattern of expression of genes that peak at each specific stage. The relative expression value (y-axis) is plotted against developmental time (x-axis). The right panel shows the major physiological and biological processes occurring in the endosperm at different developmental stage based on histological sections and literature

Fig. 6
figure 6

Physiological markers of development and co-expressed genes. Genes with transcript expression profile pattern similar to the (A) pattern of accumulation (or loss) FW, DW and WC per caryopsis and (B) the pattern of the rate of accumulation of each of the designated physiological markers. The numbers outside the circles corresponds to the number of genes that did not show similar pattern (at Pearson correlation >0.95) as the physiological markers. DW, Dry weight; FW, Fresh weight; WC, water content

Overall, the analysis of the temporal expression patterns of developmentally regulated genes showed gene activities that were well correlated with the developmental events at the respective stages.

Physiological markers of caryopsis development and associated differentially expressed genes

We examined the relationship of the 497 genes that have changed at least 2-fold or greater in expression across the 6 time-points studied with the physiological markers FW, WC and DW of a developing caryopsis (Fig. 6A). Caryopsis FW is a function of DW and WC accumulation. FW is primarily due to WC at the early stages of development. At 14 DPA the accumulation of WC tapers off whereas that of DW continuous to rise. The increase in FW at the later stage is due to the accumulation of the storage compounds specially starch. Our analysis showed that the pattern of expression of 234 genes correlated well (Pearson correlation = or greater than 0.95) with the pattern of accumulation of these physiological markers (Listed in Supplemental Data IV-a). A majority (196 genes) correlated with WC, 102 of which are shared with FW; 136 genes correlated with FW, 20 of which are unique to FW. Only 18 correlated with DW, 14 shared with FW. All of the transcripts from storage protein genes within this group of 497 genes showed similar pattern of accumulation to WC.

To identify genes that may potentially play a role in the control of the accumulation of the physiological markers, we examined the pattern of expression of genes that correlates with the RATE of DW, FW and WC accumulation (Fig. 6B). Our data showed 36 genes with significant correlation (Listed in Supplemental Data IV-b). The majority of the genes (35) correlated with DW rate of accumulation. The expression of one gene (encoding a hypothetical protein) correlated with the rate of accumulation of FW and none correlated with WC. The genes that correlated with the pattern of DW rate of accumulation include 11 hypothetical proteins and proteins without known function. Genes with annotations include those involve in the synthesis, accumulation and probable protection of the major storage compounds, starch and storage proteins, both of which substantially contribute to caryopsis DW. These include genes that code for starch synthesis enzymes, starch branching enzyme I and ADP-glucose pyrophosphorylase large subunit, a component of an enzyme complex, the activity of which catalyzes the first committed step in the biosynthesis of starch; inhibitors of protein degradation, protease inhibitor II, a putative Hageman factor (a serine protease inhibitor) and PUP88 protein (a member of the trypsin/α-amylase family of inhibitors); and probable defense proteins—puroindoline, spodomicin and purothionin.

Interestingly, the expression of several genes involved in protein degradation also correlated with the rate of DW accumulation. These include serine carboxypeptidase I and II, and a putative 26S proteosome regulatory subunit. Serine carboxypeptidases are proteolytic enzymes reported to be expressed in the aleurone layer and participate in the mobilization of endosperm storage proteins during cereal grain germination (Dal Degan et al. 1994). These genes are also expressed in the vascular tissue, where they might be involved in programmed cell death in germinating grains; and in the scutellum, where their function is still unclear (Dominguez et al. 2002). The 26S proteosome, which is a large ATP-dependent protease composed of two multiprotein complexes, a 19S regulatory complex and the 20S proteosome, degrades short-lived intracellular proteins marked by ubiquitin (Smalle and Vierstra 2004). Proteases are necessary for protein turnover and development—degradation of damaged, and misfolded proteins provides free amino acids required for the synthesis of new proteins. The breakdown of selected regulatory proteins by the ubiquitin/proteosome pathway could control key aspects of grain growth and development; and limited proteolysis at highly specific sites is required in the maturation and sub-cellular targeting of enzymes and storage proteins.

Grain weight, which is a major component of grain yield, is determined by the rate and duration at which the grain accumulates dry matter. It has been reported that wheat cultivars show significant variations in the rate of dry matter accumulation (Bruckner and Fruhberg 1987; Chanda et al.1999) and that increases in rate of caryopsis dry matter accumulation may be an important factor leading to improved yield (Hartung et al. 1989). One could test whether the genes with expression correlating with the pattern of DW rate of accumulation could be used as possible markers in the selection for increased rate of dry matter accumulation that may result in greater yield potentials in crops.

Transcription factors involved in grain development

In spite of the limitations in the sensitivity of the arrays for detecting low expressed genes the expression of a significant number of transcription factors were detected. Of the 175 genes on the whole array that showed similarity to transcription factors (BLASTX at E-value ≤ 10−5), 49 were differentially expressed across the 6 time-points examined. Cluster analysis identified 4 groups of transcription factors (A–D) each with distinct patterns of expression (Fig. 7). Clusters A and C show maximal expression during later anthesis but with different kinetics. Cluster A shows a low expression at 3 DPA, rapidly increasing between 7 DPA and 14 DPA, peaking at 21 DPA then slightly decline but still highly expressed thereafter. This cluster is composed of 7 genes including three no apical meristem (NAM) transcription factors, a WRKY transcription factor, a negative regulator of transcription and an F-box containing protein. NAM is a member of a family of transcription factors found only in plants (Ooka et al. 2003). NAM plays a role in the establishment of the shoot apical meristem. The bZIP transcription factors in this cluster show the highest change difference (5.4-fold) in expression from early anthesis to late anthesis. Their expression profile correlates with that of the storage proteins (gliadins) especially those involved in gluten formation. Previous studies have shown that gluten protein genes are regulated by bZIP domain containing transcription factors (Albani et al. 1997). Cluster B contains 4 genes; two MADS-box containing transcription factors, a putative transcription factor and a zinc-finger containing protein.Clusters C and D have a narrow range of change in gene expression from early to late anthesis. The average change of expression for these groups is 1.4-fold. These clusters include some of the general transcription factors like the TATA- and CCAAT-box binding proteins.

Fig. 7
figure 7

Transcription factors: A hierarchical clustering of the 49 transcription factors that are differentially expressed across all time-points examined using average linkage clustering and Euclidian distance. The hierarchical cluster color code: yellow bar represents the median value of gene expression during development; red bar represents higher expression than median value of gene expression and the green bar represents lower than median value of gene expression. k-Means clustering was used to group the developmentally regulated genes into 4 classes. The color of the k-means class letter ID corresponds to the color of the vertical bar between the GenBank accession number of each gene and its description; the vertical bar color marks which gene assembled to each of the k-means cluster

Fig. 8
figure 8

Data validation by RT-PCR Comparison of gene expression profile of two genes using microarray and RT-PCR: the top panels show the expression profile of MADS-box 9 transcription factor and tritin at the 6 time-points examined using cDNA microarrays. The middle panels show the PCR fragments amplified by RT-PCR from RNA isolated from developing grains using gene-specific primers. Bottom panel show the amplification product for the 18S rRNA control

Validation of wheat microarray data

Semi-quantitative RT-PCR was used to determine the expression of selected clones with distinct patterns of expression to confirm the trends in mRNA accumulation determined by cDNA arrays (Fig. 8). RNAs isolated from a fourth biological sample (different from the set of biological samples used for hybridization on the array) were used as templates for amplification. The pattern of the MADS-box and tritin gene transcript accumulation correlated well with that obtained by cDNA arrays. The good agreement of gene expression on a different biological sample other than what was used on the array provides further support to the reliability of the microarray results. The patterns of expression of 5 other clones examined by RT-PCR were consistent with the pattern of expression of the same clones obtained using the cDNA arrays (data not shown). Our results indicate that the trend of RNA accumulation obtained with the use of the cDNA microarrays we generated is consistent with the data obtained with other methods used for measuring gene expression.

Concluding remarks

This study describes the expression dynamics of global gene expression during wheat caryopsis development using DNA microarrays. We report the construction and validation of an 8K wheat cDNA microarray for transcriptome analysis. We provide a comprehensive list of developmentally regulated wheat caryopsis genes and their expression profiles, including the information on the temporal expression of a significant number of uncharacterized genes. Several files containing supplemental data are provided for more comprehensive information on the differentially expressed genes discussed in this paper (also available at http://wheat.pw.usda.gov/pubs/2007/Laudencia).