Files
Abstract
The precise and accurate identification and quantification of transcriptional start sites (TSS) is key to understanding the control and regulation of transcription. The core promoter is comprised of the TSS and proximal non-coding sequences, which serves as a binding site for the preinitiation complex alongside various regulatory factors. The location of transcription initiation is indicative of the binding locations of these key components, and so the accurate identification of TSSs is important for understanding the molecular regulation of transcription. Existing protocols for TSS identification are challenging and expensive, leaving high-quality data available for only a limited set of organisms. This sparsity of data impairs the study of TSS data across tissues or in an evolutionary context. These techniques can also possess technical limitations, leaving room for modern techniques to add additional dimensions of data analysis. To address these shortcomings, we developed Smart-Seq2 Rolling Circle to Concatemeric Consensus (Smar2C2), which identifies and quantifies TSS and transcription termination sites. Smar2C2 incorporates unique molecular identifiers that allowed for the identification of as many as 70 million sites, with no known upper limit using RNA collected in bulk. We have also addressed required input RNA, with TSS data sets generated from as little as 40 pg of total RNA. We have used Smar2C2 to identify TSSs in Glycine max (soybean), Oryza sativa (rice), Sorghum bicolor (sorghum), Triticum aestivum (wheat) and Zea mays (maize) across multiple tissues. This has allowed for the identification of evolutionary conserved features, such as novel patterns in the initiator elements that flanks the transcription start site or the nucleotide composition of well-known promoter motifs like the TATA box. We have also attempted to expand Smar2C2 into single-cell RNA sequencing, though it is currently not economical when compared with existing techniques from 10X Genomics. However, by using single-cell TSS data we have shown its importance in single-cell RNA sequencing both in its ability to improve technical data processing and to discover novel biology currently undetectable in single-cell data sets. We hope that these developments may prove significant in our ability to understand and utilize control of transcription initiation.