Overview
Transcriptome sequencing encompasses a wide variety of applications from simple mRNA profiling to discovery and analysis of the entire transcriptome, including both coding mRNA and non-coding RNA (e.g., miRNA, small RNAs, linc RNAs). These applications, collectively called RNA-Seq, are extremely popular for next generation sequencing platforms as they uncover information that may be missed by array-based platforms, as no prior knowledge of the transcript sequence is needed. RNA-Seq was initially used primarily for discovery applications (rare genes, splice junctions, gene fusions) and with novel or poorly studied organisms for which there weren’t good standard microarrays. However, as the technology is becoming more available and costs are coming down, it is starting to be used for RNA profiling (sample comparison) as well. Also, as it is sequencing based, it is well suited for specialty applications such as RNA editing and allele specific expression. While there are a variety of RNA-Seq applications and protocols, most follow the basic strategy of isolating RNA (such as with poly dT to pull down mRNA), converting it to DNA and then adding adaptor sequences to generate a library suitable for sequencing.
Details
mRNA-Seq
One of the most popular forms of transcriptome sequencing is mRNA-Seq, which targets all polyadenylated mRNA transcripts, or the coding portion of the transcriptome. The depth offered by next generation sequencing is leveraged to find new genes that were undetectable before due to their low level of expression. By the same token, the increased depth and reduced cost of sequencing means these projects can be used to profile gene expression while differentiating between isoforms of the same gene via paired-end reads.
small RNA-Seq
Attempts to target small RNA for sequencing have been driven by interest in microRNAs, as they act as regulatory elements in the transcriptome. By complexing in the 3′ region with mRNA transcripts that have complementary sequence, microRNAs mark these transcripts for degradation prior to translation. The depth offered by next generation sequencing again loans itself to both novel microRNA gene discovery and expression profiling. The major problem in preparing these samples is the difficulty in separating these species from similarly sized adapter-dimers. Gel-based size selections that recover adapter-bound small RNAs while excluding adapter-dimers from the sample are employed here.
Tag-based approaches
The original RNA sequencing method used with next generation sequencing was ‘digital gene expression’ (DGE), a modified form of ‘serial analysis of gene expression’ (SAGE). In this method a single short tag (~23 bases) is generated for each transcript as defined by the relation of a four base restriction enzyme recognition site and the location of the polyA tail. (In practice, additional tags can be generated by the presence of polyA stretches elsewhere in the transcript.) Because each transcript contains only a single tag (or at the most, a few), DGE libraries are substantially less complex than standard RNA-Seq libraries, allowing useful results with fewer reads per sample (1-2M compared with ~30M). Alternatively, higher read depths allow for the detection of extremely rare transcripts, which can be useful when looking for single or low copy transcripts in a population of cells. The SuperSAGE method improves on the original DGE methods through the generation of larger tags, allowing for more precise alignment to the transcriptome. While these methods still have adherents, the rapidly falling costs of sequencing are diminishing the importance of needing fewer reads, while the more comprehensive coverage of the transcriptome by other RNA-Seq methods is gaining in popularity. However, variants of these “tag profiling” methods, such as ‘cap analysis gene expression’ (CAGE), can be very useful for determining the precise 5′ ends of transcripts.
Whole Transcriptome Sequencing
A truly comprehensive view of the transcriptome requires the examination of all unique RNA transcripts for an organism, including both the coding and non-coding portions. This is achieved by combining the results of various RNA-Seq methods. The primary method is the generation of RNA-Seq libraries without the use of polyA enrichment, allowing for the sequencing of a significant portion of the non-coding RNA. However, since rRNA accounts for 95-98% of the transcriptome, it is still necessary to eliminate this fraction prior to sequencing. This is often accomplished with commercially available kits that can use oligos complementary to known rRNA sequences to sequester those transcripts. Alternatively, ‘duplex-specific nuclease’ (or DSN) can be used to eliminate highly abundant sequences, greatly enriching the other portions of the transcriptome. The standard RNA-Seq methods generally don’t capture small RNAs well, so the results are combined with small RNA-Seq and possibly one or more of the CAGE methods to more accurately define the transcription start and stop sites.
Key Trends
Conversion of projects from arrays to sequencing
Increasing sequencing depth
De novo projects
Smaller input amounts
Key Papers
RNA-Seq: Wang, Z. et al. (2009) RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1) (57-63).
miRNA-Seq: Creighton, C. et al. (2009) Expression profiling of microRNAs by deep sequencing. Briefings in Bioinformatics. Vol. 10- No. 5 (490-497).
DGE: Matsumura, H. et al. (2010) High throughput SuperSAGE for digital gene expression analysis of multiple samples using next generation sequencing. PLoS ONE 5(8):e12010.
Key Platform Characteristics
| Characteristic | Importance | Notes |
|---|---|---|
| # of reads | Critical | Highly important in studies that gauge expression levels. Also necessary for identifying sequence variants. |
| Read length | Nice to have | The average eukaryotic mRNA transcript is ~1000 bp long, and longer reads make it easier to discriminate between paralogous genes. By extension, it helps to discriminate between isoforms of the same gene. |
| Error rate | Nice to have | While it’s always beneficial to have low error rates, RNA applications are generally more tolerant unless specifically looking for SNPs, when the error rate should be as low as possible. |
| Paired-end reads | Nice to have | Paired end reads can be useful in discovering and measuring various RNA isoforms and gene fusions. However, for isoforms it requires fairly precise size selection, so some researchers prefer to focus on longer single reads |
| Mate-pair reads | Irrelevant | The large insert sizes of mate-pair libraries aren’t useful for transcripts as very few of them are large enough to require this. |
| Multiplexing | Irrelevant | While de novo eukaryotic mRNA-Seq samples are normally too complex to permit multiplexing, most RNA-Seq projects (i.e. resequencing, small RNA-Seq, etc.) can be accommodated in a multiplexed fashion. |
Please contact us at kb@allseq.com if you have any information or opinions you’d like to share about this page.