Elsevier

Biochimie

Volume 84, Issue 9, September 2002, Pages 953-959
Biochimie

A survey of metazoan selenocysteine insertion sequences

https://doi.org/10.1016/S0300-9084(02)01441-4Get rights and content

Abstract

The computational detection of novel selenoproteins in genomic sequences is usually achieved through identification of SECIS, a conserved secondary structure element found in the 3′ UTR of animal selenoprotein mRNAs. Previous studies have used “descriptors” specifying the number of base pairs and the conserved nucleotides in SECIS to identify this element. A major drawback of the “descriptor” approach is that the number of detections in current genomic or transcript databases largely exceeds the number of true selenoproteins. In this study, we use instead the ERPIN program to detect SECIS elements. ERPIN is based on a lod-score profile algorithm that uses a training-set of aligned RNA sequences as input. From an initial alignment of 44 animal SECIS sequences, we performed a series of iterative searches in which the training set was progressively enriched up to 117 confirmed SECIS elements, from a large collection of metazoan species. About 200 high-scoring candidates were also detected. We show that ERPIN scores for these candidates can be converted into expect values, thus enabling their statistical evaluation. The most interesting SECIS candidates are presented.

Introduction

In most living organisms, selenoprotein genes are interrupted by in-frame UGA codons—usually Stop codons—that are translated into the aminoacid selenocysteine. A detailed review of the mechanisms in different species is presented by  Krol in this issue 〚1〛. Selenoprotein mRNAs contain a conserved hairpin structure that is required for the distinction of UGA Stop from UGA selenocysteine codons. This structure, called Selenocysteine Insertion Sequence (SECIS), is sufficiently large and constrained that it can be used as a screen for the computational identification of selenoprotein genes. Identifying the RNA component of the gene is essential in this case, since the coding sequence is usually misinterpreted or incorrectly assigned by conventional gene annotation methods not taking into account inframe UGA codons. In animal selenoprotein genes, SECIS occurs in the 3′ untranslated region (UTR) of mRNA and its secondary structure has been characterized in detail 〚2〛, 〚3〛. Several bioinformatics studies have been conducted in the past years to identify SECIS element in transcribed sequences, successfully identifying several new selenoproteins in mammalian 〚4〛, 〚5〛 and Drosophila 〚6〛, 〚7〛 genomes.

SECIS comprises two nested helical regions of about 5 and 14 base pairs (Fig. 1). The largest helix begins with four non-canonical base pairs comprising a central 5′GA3′/5′GA3′ tandem, often flanked by homopyrimidine pairs on each side. The apical loop is characterized by the presence of 2–4 adenosines on the 5′ side followed, in some cases, by an extra stem of about three base pairs. Those SECIS elements deprived of apical stem are said “Form 1”, while the others are said “Form 2” 〚8〛, 〚9〛. Although constraining, the SECIS structure alone does not permit an unambiguous identification of selenoprotein genes in large sequence databases. Using an RNAMOT descriptor representing the constraints in Fig. 1, Lescure et al. 〚4〛 estimated the frequency of false positive hits to be three every 10 Mb. Kryukov et al. 〚5〛 used looser constraints and obtained about 650 hits per 10 Mb, before applying a free energy screening procedure that lowered the number of hits to about 15 per 10 Mb. Subsequent studies also combined a raw structure detection and additional criteria such as free energy or techniques for coding exon recognition 〚6〛. To date, use of additional screening parameters beside SECIS has been a requirement in all selenoprotein gene identification studies.

We recently introduced a new computational tool for the identification of RNA motifs that could constitute a more selective means to detect SECIS elements. This program, called ERPIN 〚10〛, is based on a position weight matrix or “profile” model, specially adapted to handle base-paired structures. Each single-stranded and helical element in an RNA molecule is represented by a profile, and profiles are instantiated onto database sequences using a dynamic programming algorithm. This approach requires an initial “training set” of the RNA sequences under study and offers several advantages: it does not require writing any descriptor, it is usually more specific than descriptor-based programs, and it provides an objective scoring of solutions based on their similarity to training set sequences and structures.

For this study, we built an initial alignment of 44 aligned SECIS sequences and used this training set and the ERPIN program to scan a database of eukaryotic transcripts and genomic sequences. The training set was iteratively enriched with homologous SECIS structures collected during five successive rounds of database searches. The final collection of 117 aligned SECIS elements is the largest available to date and should be helpful in the assessment of base and base pair constraints for structural or functional studies. In addition, we can now use this enhanced collection to identify new selenoprotein genes candidates. Some high-scoring candidates are provided here, based on both pure statistical criteria and/or the presence of homologous SECIS sequences in different species.

Section snippets

The Erpin program

The basic algorithm in ERPIN has been published elsewhere 〚10〛. We used version 2 of the program, which presents significant improvements in the handling of multi-helix motifs. ERPIN 1 used four basic elements for searches: helix, strand, hairpin and pairs of helices. Basic elements could be combined, but optimal matches were guaranteed only within each basic element. ERPIN 2.1 handles more complex elements by creating a set of “configurations” based on the gaps present in the training set. If

Confirmed SECIS

The initial training contained representative sequences from all animal selenoprotein SECIS elements, except for the newly discovered SelM SECIS 〚16〛, which presents a significant deviation from other animal SECIS (CCC instead of AAA in the 5′ apical loop) that would greatly reduce search specificity. A specific training set would be more appropriate for the detection of this particular element. The first two rounds of iterative search, performed against the HGI databases, yield 46 SECIS

Conclusion

We have introduced a computational screen for SECIS elements based on the ERPIN program, differing from previously published protocols by a new search algorithm and the introduction of a statistical evaluation of candidates. Potential SECIS are scored based on their resemblance to SECIS elements in a training set and this score S is converted into an E-value expressing the number of expected hits of same or higher score in a random database. The mean score for SECIS elements in the training set

Acknowledgements

We thank Dr. Alain Krol for critical reading of manuscript.

References (17)

There are more references available in the full text version of this article.

Cited by (18)

  • Coordination of deiodinase and thyroid hormone receptor expression during the larval to juvenile transition in sea bream (Sparus aurata, Linnaeus)

    2010, General and Comparative Endocrinology
    Citation Excerpt :

    Vertebrate deiodinases need a reducing co-factor for appropriate enzyme activity and contain a selenocysteine (Sec) residue in the active site that is fundamental for removal of iodine from THs (Bianco et al., 2002; Buettner et al., 2000; Köhrle, 2000; Kuiper et al., 2002, 2003, 2005). The Sec residue is encoded by UGA that in normal circumstances stops translation, but which in the context of a SElenoCysteine Insertion Sequence (SECIS), in the 3′UTR of deiodinase mRNAs, leads to insertion of a Sec residue in the transcribed deiodinase protein (Buettner et al., 1998; Fagegaltier et al., 2000; Kollmus et al., 1996; Lambert et al., 2002). In all vertebrates in which deiodinases have been studied three genes which encode three different enzymes have been found (Bres et al., 2006; Croteau et al., 1996, 1995; Davey et al., 1999; Hernandez et al., 1999; Klaren et al., 2005; Leonard et al., 2000; Orozco et al., 2002, 2003; Sanders et al., 1999; St. Germain et al., 1994; Sutija et al., 2003; Valverde et al., 1997).

  • A Method for Identification of Selenoprotein Genes in Archaeal Genomes

    2009, Genomics, Proteomics and Bioinformatics
    Citation Excerpt :

    Furthermore, other tools can also be freely integrated to Asec-Prediction if they enable it achieving better prediction. For example, Lambert et al. reported that ERPIN is effective to detect SECIS elements (30). Thus, Asec-Prediction can be updated timely with much higher prediction accuracy.

  • A Dedicated Computational Approach for the Identification of Archaeal H/ACA sRNAs

    2007, Methods in Enzymology
    Citation Excerpt :

    In the first step (Fig. 15.1, step 1), H/ACA‐like motifs are detected by use of the profile‐based ERPIN program (Gautheret and Lambert, 2001). This program has been applied to the search of a wide range of RNA motifs (Lambert et al., 2002, 2004; Legendre et al., 2005). Once H/ACA‐like motifs are identified, their putative target(s) in rRNAs are searched (Fig. 15.1, step 2) by use of the descriptor‐based RNAMOT program.

View all citing articles on Scopus
View full text