Introduction

Next-generation sequencing technologies have provided unprecedented opportunities for high-throughput functional genomic research, including gene expression profiling, genome annotation, small ncRNA discovery, and profiling and detection of aberrant transcription (Bentley 2006; Morozova and Marra 2008). Among these approaches, RNA sequencing (RNA-Seq) is a powerful new method for mapping and quantifying transcriptomes developed to analyze global gene expression in different tissues. Recently, this technique has also been used as an efficient and cost-effective method to systematically identify SNPs in transcribed regions in different species (Chepelev et al. 2009; Cirulli et al. 2010; Cloonan et al. 2008; Morin et al. 2008).

RNA-Seq generates sequences on a very large scale at a fraction of the cost required for traditional Sanger sequencing, allowing the application of sequencing approaches to biological questions that would not have been economically or logistically practical before (Marguerat et al. 2008). Taking this into account, we applied this novel approach to identify SNPs in the expressed coding regions of the bovine milk transcriptome.

The majority of gene expression analyses in the bovine mammary gland have been developed using a biopsy sample (Boutinaud and Jammes 2002; Finucane et al. 2008). An alternative sampling procedure has been proposed by isolating mRNA directly from somatic cells that are naturally released into milk during lactation (Boutinaud et al. 2002). Recently, Medrano et al. (2010), using the RNA-Seq technique, compared the milk and mammary gland transcriptomes and showed extensive similarities of gene expression in both tissues.

In the present study we performed a SNP discovery analysis in milk transcriptome using RNA-Seq technology. For this purpose, seven milk samples from Holstein cows at different stages of lactation were analyzed by sequencing cDNA libraries using an Illumina GAII analyzer (Illumina, San Diego, CA) system. To evaluate the accuracy of SNPs detected with RNA-Seq, a comparison was made with SNPs detected in a set of 42 candidate genes expressed in milk that had been resequenced previously using Sanger sequencing technology. SNPs that were observed with only one technique were validated by the KASPar SNP Genotyping System.

Materials and methods

RNA-Seq library preparation

Seven milk samples were obtained from Holstein cows at two stages of lactation (day 15 and day 250). Milk samples were collected in 50-ml tubes 3 h after milking, kept on ice, and processed immediately for RNA extraction. Samples were centrifuged at 2000×g for 10 min to obtain a pellet of cells. Total RNA was purified following a Trizol protocol (Invitrogen, Carlsbad, CA), and mRNA was isolated and purified using an RNA-Seq sample preparation kit (Illumina). mRNA was fragmented and first- and second-strand cDNA were synthesized. After adapters were ligated to the ends of double-stranded cDNA, a 300-bp fragment size was selected by gel excision and each sample was individually sequenced on an Illumina GAII analyzer.

RNA-Seq analysis and SNP detection

Short sequence reads (36-40 bp) were assembled and mapped to the annotated bovine reference genome Btau4.0 (http://www.ncbi.nlm.nih.gov/genome/guide/cow/index.html) using CLC Genomics Workbench software (CLC Bio, Aarhus, Denmark). Sequencing reads for each of the seven samples were pooled to perform the RNA-Seq and SNP discovery analyses. We applied stringent criteria in order to reduce the rate of detection of false-positive SNPs. For the assembly procedure, the sequences were mapped to the consensus genome accounting for a maximum of two gaps or mismatches in each sequence. Reads were then classified as uniquely mapped reads and nonspecifically mapped reads as shown in Table 1. SNP detection was performed using the following quality and significance filters: (1) the minimum average quality of surrounding bases and minimum quality of the central base were set as 15 and 20 quality score units, respectively; (2) minimum coverage was set at ten reads; (3) minimum variant frequency or count was set at 20% or two read counts per SNP; and (4) SNPs located in read ends (last three bases) were not considered in the analysis due to possible sequencing errors.

Table 1 Summary of mapping all the RNA-Seq reads to the reference genome (Btau4.0) obtained from seven pooled milk samples

Sanger resequencing of target genes and SNP detection

Resequencing was performed in a DNA resource population specifically developed for SNP discovery as described by Rincon et al. (2007). This population consisted of eight Holstein animals that were unrelated at least three generations back in their pedigrees. Genomic sequences for 42 candidate genes that were expressed in milk samples were obtained from the Btau4.0 assembly and resequenced using Sanger sequencing technology. Exons and conserved noncoding regions were identified using multiple-species genome alignments with Genome VISTA (Couronne et al. 2003). Coding regions and the conserved noncoding regions of each gene were resequenced at SeqWright DNA Technology Services (Houston, TX) using Sanger sequencing technology. SNPs were analyzed using CodonCode aligner software (http://www.codoncode.com); gene sequences and SNPs were assembled and annotated in Vector NTI advance 10.1.1 software (Invitrogen, Carlsbad, CA).

SNP validation by the KASPar SNP genotying system

The KASPar SNP Genotyping System (KBiociences, Herts, UK) was used to validate SNPs detected by RNA-Seq and not detected by Sanger sequencing. For this purpose, 15 bovine DNA samples (8 cows used for Sanger resequencing and 7 cows used for RNA-Seq) were selected. Genomic DNA was extracted from 5 ml of cow’s blood following the protocol of the Gentra Puregene blood kit (Qiagen, Valencia, CA). KASPar assay primers (Table 2) were designed using the Primer Picker software available at http://www.kbioscience.co.uk/primer-picker.htm (KBiociences). Genotyping assays were carried out with a 7500 Fast Real Time instrument (Applied Biosystems, Foster City, CA) in a final volume of 8 μl containing 4× Reaction Mix (KBiociences), 120 nM of each allele-specific primers and 300 nM of common primer, 2.2 mM of MgCl2, and 2 mM KTaq polymerase (KBiociences). The following thermal profile was used for all reactions: 15 min at 94°C; 20 cycles of 10 s at 94°C, 5 s at 57°C, and 10 s at 72°C; and 18 cycles of 10 s at 94°C, 20 s at 57°C, and 40 s at 72°C.

Table 2 KASPar primers used to validate SNP detected by RNA-Seq

Results and discussion

Detecting genetic variation in pooled milk transcriptome reads by RNA-Seq

RNA-Seq analysis included 118 million reads, ranging from 36 to 40 bp in size, that were assembled and mapped to the annotated NCBI bovine whole-genome assembly (27,368 genes). An average of 17 million short-sequence reads was obtained for each individual sample. The median coverage for the exons was 38×. The analysis revealed that 82.7 million reads (~70%) were categorized as mapped reads (68.9 million were uniquely mapped reads and 13.7 million were nonspecifically mapped reads), while 35 million were unmapped reads (Table 1). Most of the uniquely mapped reads corresponded to total exon reads (87.5%), whereas a small fraction corresponded to total intron reads (12.5%; Table 1). Intron reads are expressed regions that are not annotated as exons in Btau4.0.

RPKM (reads per kilobase per million mapped reads) (Mortazavi et al. 2008) values were used to identify the total number of genes that were expressed in the milk transcriptome. A RPKM threshold value of 0.3 was established in order to balance the number of false positives and false negatives as described in Bentley et al. (2008) and Ramskold et al. (2009). A total of 13,807 genes were selected with RPKM threshold values greater than 0.3. For those genes with RPKM < 0.3, a detailed analysis was performed to determine the number of unique reads falling outside exon regions that can be representing either annotation errors or new exons not included in the current Btau4.0 genome. Using this strategy, 5368 expressed genes/regions were found with more than ten unique reads. We detected 19,175 expressed genes in milk samples (70.06%) of the 27,368 total bovine annotated genes in Btau4.0 genome assembly. The SNP detection analysis revealed 100,734 SNPs in the seven Holstein samples. Of these SNPs, 67,689 (67.2%) were homozygous, corresponding to differences between Holsteins and the Hereford bovine whole-genome assembly Btau4.0. This is a large number of SNPs that are fixed in Holstein for a different allele to that found in the Hereford genome reference and requires further investigation. In some cases these SNPs may represent artifacts due to errors in the reference sequence or due to misalignment of the short reads to the reference (see subsection “Validation of unique SNP detected by RNA-Seq” below). It may also be possible that some of the Holstein fixed SNPs in fact correspond to variants with a very low frequency and a large number of cows will be needed to detect the common Hereford variant. A total of 33,045 (32.8%) SNPs were polymorphic within Holsteins. Allele frequencies for these heterozygous SNPs were obtained for the pooled samples by counting the number of reads representing each allele. In summary, 1,849 SNPs had an allele frequency of 80/20, 5,511 SNPs had an allele frequency of 70/30, 15,411 SNPs had an allele frequency of 60/40, and 10,274 SNPs had an allele frequency of 50/50. Figure 1 represents the total number of SNPs per gene mapped to the bovine Btau4.0 genome assembly. SNPs that are different between the Hereford consensus sequence and that of Holstein are shown in red, and SNPs that are polymorphic in Holstein samples are in blue. We observed that SNPs in expressed regions are distributed along the entire genome, but there are an increasing number of polymorphisms located in the extremes of the chromosomes’ centromeric and telomeric regions. The pattern of SNP distribution in each chromosome is very similar between those that differentiate Holstein and Hereford and those SNPs that are polymorphic in Holsteins, suggesting that there are genomic regions that tend to accumulate a large number of SNPs. Interestingly, the collagen family genes in BTA1, BTA4, BTA12, and BTA19 showed the highest SNP count difference between the Holstein and the Hereford consensus sequences. A large amount of data was generated in this study; a detailed description of the SNPs is available from the authors upon request.

Fig. 1
figure 1

Total number of SNPs per gene expressed in milk cells mapped to the Btau4.0 genome assembly. Red dots represent the number of SNPs per gene in coding regions that are different between Hereford and Holstein. Blue dots represent the SNPs per gene that are polymorphic in Holsteins

Accuracy of RNA-Seq technology for SNP detection

To analyze the accuracy of RNA-Seq technology for SNP detection, 42 genes highly expressed in milk and related to fatty acid synthesis and the growth hormone GH/IGF axis were resequenced using Sanger methodology. Nine genes did not show polymorphisms in exons by Sanger resequencing and were excluded from the SNP discovery and validation analyses: IGF1, IGFBP3, IGFBP4, MBTPS1, MBTPS2, NR3C1, PIAS1, STAT2, and STAT4. Eighty-six SNPs were detected in the remaining 33 candidate genes that exhibited variation in Holsteins. Seventy of 86 SNPs were also detected by RNA-Seq in 18 genes (Table 3). From the 16 SNPs that were not detected by RNA-Seq, 6 were located in exons that were not expressed in milk samples and therefore no sequencing reads were found.

Table 3 List of 70 SNPs in 18 genes validated in coding regions using RNA-Seq and Sanger sequencing

It is important to note that the samples used for RNA-Seq were different from those sequenced by the Sanger method, so we were not expecting a 100% concordance of the results. However, it is noteworthy that despite the difference in sample composition in the analysis, only ten SNPs observed in Sanger were not detected in RNA-Seq. On the other hand, five SNPs were observed in three genes, DDIT3 (DNA-damage-inducible transcript 3), INSIG2 (insulin-induced gene 2), and STAT5A (signal transducer and activator of transcription 5A), in RNA-Seq that were not detected by Sanger (Table 4). These SNPs were further validated using the KASPar SNP Genotyping System.

Table 4 Unique SNPs detected by RNA-Seq that were validated using the KASPar Genotyping System

Validation of unique SNPs detected by RNA-Seq

In order to confirm the presence of the five SNPs uniquely found by RNA-Seq, they were genotyped using the KASPar SNP Genotyping System. Three out the five SNPs were validated, as shown in Table 4. Two SNPs in the STAT5A gene that failed with the KASPar assay were further examined with a detailed analysis of the sequence reads containing the putative SNPs. We observed that the corresponding 40-bp sequence that mapped to a STAT5A region had a 99% homology with the STAT5B gene. In the SNP discovery analysis we set up a threshold of a maximum number of mismatches to two. With this mismatch rate, reads that correspond to a given gene, like STAT5B, can be assigned to STAT5A. This was not a common situation for most of the genes studied in this analysis, but it could represent a problem in gene families with highly conserved domains when using short sequence reads. In a similar study, Cirulli et al. (2010) observed that some false-positive SNPs identified in cDNA arose from alignment of a read to the wrong gene and that in these cases the correct gene and the gene chosen for the alignment always had very similar sequences. This situation has also been observed in regions associated with sequence repeats (Morozova and Marra 2008). Although the short-read structure of next-generation sequencers has some potential problems with respect to sequence assembly, the result is a system that generates accurate data and large coverage of consensus sequence and SNP calling at very high throughput and low cost (Thomas et al. 2006).

Conclusion

We have demonstrated that analyzing the transcriptome using RNA-Seq technology is an efficient and cost-effective method to identify SNPs in transcribed regions. Stringent criteria have to be applied to maximize the accuracy and prevent false-positive SNP detection. This study provides a valuable resource of more than 33,000 SNPs located in coding regions of genes expressed during lactation that can be used for further gene variation analysis and association studies in Holstein cattle.