Introduction

As the critical determinant of the proteome and therefore cellular status, the transcriptome represents a key node of regulation for all life1. Transcriptional control is managed by a finely tuned network of transcription factors that integrate environmental and developmental cues in order to actuate the appropriate responses in gene expression2,3,4. Importantly, the transcriptomic state space is constrained. Pareto efficiency constraints suggest that no gene expression profile or phenotype can be optimal for all tasks, and consequently, that some expression profiles or phenotypes must come at the expense of others5,6. Furthermore, across all major studied kingdoms of life, cellular networks demonstrate remarkably conserved scale-free properties that are topologically characterized by a small minority of highly connected regulatory nodes that link the remaining majority of sparsely connected nodes to the network7,8,9. These theories suggest that the effective dimension of the transcriptome should be far less than the total number of genes it contains. If true to a large enough extent, it may be possible to faithfully compress and prospectively summarize entire transcriptomes by measuring only a small, carefully chosen subset of it.

Indeed, previous studies have exploited this reduced dimensionality to perform gene expression imputation for missing or corrupted values in microarray data.10,11,12. Others have extended these intuitions to predict expression from probe sets containing a few hundred genes13,14. However, prediction accuracies have been modest and usually limited to 4,000 target probes/genes. Recently, several studies examined the transcriptomic information recoverable by shallow sequencing especially as it applies to single-cell experiments15,16,17,18. Jaitin et al.18 and Pollen et al.16 demonstrated that only tens of thousands of reads are required per cell to learn and classify cell types ab initio16,18. Heimberg et al.15 extended these findings and demonstrated that the major principal components of a typically sequenced mouse bulk or single-cell expression data set may be estimated with little error at even 1% of the depth15. Though these approaches, advance the notion of strategic transcriptome undersampling, they only recover broad transcriptional states and are restricted to measuring only the most abundant genes. During sample preparation—arguably the most expensive cost of a multiplexed-sequencing experiment—shallow sequencing-based approaches still utilize protocols meant for sampling the entire transcriptome and therefore consume more resources than necessary. Furthermore, given that the expression of even the most abundant genes is highly skewed, sequencing effort is wastefully distributed compared to an approach that chooses which genes to measure more wisely. Finally, it is still not clear from sample sizes and biological contexts previously studied whether the low dimensionality of the transcriptome may be leveraged unconditionally (or nearly so) across organism and application.

In this work, we introduce Tradict (transcriptome predict), a robust-to-noise and probabilistically sound algorithm, for inferring gene abundances transcriptome-wide, and predicting the expression of a transcriptomically comprehensive, but interpretable list of transcriptional programs that represent the major biological processes and pathways of the cell. Tradict makes its predictions using only the expression measurement of a single, context-independent, machine-learned subset of 100 marker genes. Importantly, Tradict’s predictions are formulated as posterior distributions over unmeasured genes and programs, and therefore simultaneously provide point and credible interval estimates over predicted expression. Using a representative sampling of over 23,000 publicly available, transcriptome-wide RNA-Seq data sets for Arabidopsis thaliana and Mus musculus, we show Tradict prospectively models program expression with striking accuracy. Our work demonstrates the development and large-scale application of a probabilistically reasonable multivariate count/non-negative data model, and highlights the power of directly modelling the expression of a comprehensive list of transcriptional programs in a supervised manner. Consequently, we believe that Tradict, coupled with targeted RNA sequencing19,20,21,22,23,24, can rapidly illuminate biological mechanism and improve the time and cost of performing large forward genetic, breeding, or chemogenomic screens.

Results

Assembly of a deep training collection of transcriptomes

We downloaded all available Illumina sequenced publicly deposited RNA-Seq samples (transcriptomes) for A. thaliana and M. musculus from NCBI’s Sequence Read Archive (SRA). Among samples with at least 4 million reads, we successfully downloaded and quantified the raw sequence data of 3,621 and 27,450 transcriptomes for A. thaliana and M. musculus, respectively. After stringent quality filtering, we retained 2,597 (71.7%) and 20,847 (76.0%) transcriptomes comprising 225 and 732 unique SRA submissions for A. thaliana and M. musculus, respectively. An SRA ‘submission’ consists of multiple, experimentally linked samples submitted concurrently by an individual or lab. We defined 21,277 (A. thaliana) and 21,176 (M. musculus) measurable genes with reproducibly detectable expression in transcripts per million (t.p.m.) given our tolerated minimum-sequencing depth and map** rates (see Methods section for further information regarding data acquisition, transcript quantification, quality filtering and expression filtering). We hereafter refer to the collection of quality and expression filtered transcriptomes as our training transcriptome collection.

To assess the quality and comprehensiveness of our training collection, we performed a deep characterization of the expression space spanned by these transcriptomes. We found that the transcriptome of both organisms was highly compressible and that the primary drivers of variation were tissue and developmental stage (Fig. 1a,b, Supplementary Fig. 1), with many biologically realistic trends within each cluster (Supplementary Note 1). We additionally examined the distribution of submissions across the expression space, compared inter-submission variability within and between tissues, inspected expression correlations among genes with well-established regulatory relationships and assessed the evolution of the expression space across time. These investigations revealed our training collection is of high and reproducible technical quality, reflective of known biology, stable, and increasing exponentially in size (Supplementary Note 1, Supplementary Figs 2–4). Given additionally the diversity of tissues, genetic perturbations and environmental stimuli represented in the SRA, these results, taken together, suggest that our training collection is an accurate and representative sampling of the transcriptomic state space that is of experimental interest for both organisms.

Figure 1: The primary drivers of transcriptomic variation are developmental stage and tissue.
figure 1

(a) A. thaliana, (b) M. musculus. Also shown are plots of PC3 versus PC1 to provide additional perspective.

Tradict—algorithm overview

Given a training transcriptome collection, Tradict encodes the transcriptome into a single subset of globally representative marker genes and learns their predictive relationship to the expression of a comprehensive collection of transcriptional programs (for example, pathways, biological processes) and to the rest of the genes in the transcriptome. Tradict’s key innovation lies in using a Multivariate Normal Continuous-Poisson (MVN-CP) hierarchical model to model marker latent abundances—rather than their measured, noisy abundances—jointly with the expression of transcriptional programs and the abundances of the remaining non-marker genes in the transcriptome. In so doing, Tradict is able to (1) efficiently capture covariance structure within the non-negative, right-skewed output typical of sequencing experiments, and (2) perform robust inference of transcriptional program and non-marker expression even in the presence of significant noise.

Figure 2 illustrates Tradict’s general workflow. Estimates of expression are noisy, especially for low to moderately expressed genes. Given samples are often explored unevenly and that the a priori abundance of each gene differs, the level of noise in a gene’s measured expression for a given sample varies, but it can be estimated. Therefore, during training, Tradict first learns the log-latent, denoised abundances for each gene in every sample in the training collection using the lag transformation

Methods

Data acquisition and transcript quantification

Data acquisition and transcript quantification were managed using a custom script, srafish.pl. The srafish.pl algorithm and its dependencies are described below. Complete instructions for installing (including all dependencies) and using srafish.pl are available on our GitHub page:

https://github.com/surgebiswas/transcriptome_compression/tree/master/data_download.

Supplementary Figure 11 illustrates the workflow of srafish.pl. Briefly, after checking an SRA file meets certain quality requirements, srafish.pl uses the ascp fasp transfer program to download the raw SRA (.sra file) for an SRA RNA-Seq sample. Transfers made using ascp are substantially faster than traditional FTP. The .sra file is then unpacked to FASTQ format using the fastq-dump program provided with the SRA Toolkit (NCBI)34. The raw FASTQ read data is then passed to Sailfish35, which uses a fast alignment-free algorithm to quantify transcript abundances. To preserve memory, files with more than 40 million reads for A. thaliana and 70 million reads for M. musculus were downsampled before running Sailfish. Samples with fewer than 4 million reads are not downloaded at all. This workflow is then iterated for each SRA RNA-Seq sample available for the organism of interest.

The main inputs into srafish.pl are a query table, output directory, Sailfish index and ascp SSH key, which comes with each download of the aspera ascp client. srafish.pl depends on Perl (v5.8.9 for Linux x86-64), the aspera ascp client (v3.5.4 for Linux x86-64), SRA Toolkit (v2.5.0 for CentOS Linux x86-64) and Sailfish (v0.6.3 for Linux x86-64).

Query table construction

For each organism, using the following (Unix) commands, we first prepared a ‘query table’ that contained all SRA sample ID’s as well as various metadata required for the download:

qt_name=<query_table_file_name>

sra_url= http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=

organism=<organism_name>

wget -O $qt_name ‘$url($organism[Organism]) AND ‘strategy rna seq’[Properties]’

Where fields in between <> indicate input arguments. As an example,

qt_name=Athaliana_query_table.csv

sra_url= http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=

organism=’Arabidopsis thaliana’

wget -O $qt_name ‘$url($organism[Organism]) AND ‘strategy rna seq’[Properties]’

Reference transcriptomes and index construction

Sailfish requires a reference transcriptome—a FASTA file of cDNA sequences—from which it builds an index it can query during transcript quantification. For the A. thaliana transcriptome reference we used cDNA sequences of all isoforms from the TAIR10 reference. For the M. musculus transcriptome reference we used all protein-coding and long non-coding RNA transcript sequences from the Gencode vM5 reference.

Sailfish indices were created using the following command:

sailfish index -t <ref_transcriptome.fasta> -k 20 -p 6 -o .

Here, <ref_transcriptome.fasta> refers to the reference transcriptome FASTA file. Copies of the reference transcriptome FASTA files used in this study are available upon request.

Quality and expression filtering

In addition to the read count filtering mentioned above, we also removed samples with map** rates below 0.7 and 0.75 for A. thaliana and M. musculus, respectively (Supplementary Fig. 12). The resulting isoform expression table was then collapsed into a gene expression table by setting a gene’s expression to be the sum of expression values for all isoforms of that gene. We next removed all non-protein coding transcripts except for long non-coding RNAs, and removed samples with large amounts (>30%) of non-protein coding contamination (for example, rRNA). The data set was then expression filtered by only kee** genes with expression greater than 1 t.p.m. in at least 5% of all samples. The latter requirement ensured that outlier or extreme expression in just a few samples was not enough to keep the gene for analysis.

We then removed samples with an abnormally large number of genes with expression values of zero. To do this we calculated the mean and s.d. of the number of genes with zero expression across all samples. Samples with the number of zero expressed genes greater than the mean plus two times the s.d. were removed. Finally, we removed outlier samples by first examining the proportion of zeros contained in each sample and by computing the pairwise PCC between the gene expression profiles of all samples. To improve heteroscedasticity, raw t.p.m. values for each gene were converted to a log-scale (log10(t.p.m.+0.1)) before calculating correlations. For A. thaliana, the majority of samples had an average correlation with other samples of greater than 0.45 and fewer than 20% percent zero values. Samples with lower correlation or a greater percentage of zeros were removed (Supplementary Fig. 12). By similar arguments, samples with less average correlation than 0.55 with other samples and greater than 30% zeros were removed for M. musculus (Supplementary Fig. 12). Manual inspection of ∼100 of these samples revealed they were highly enriched for non-polyA selected samples and samples made from low-input RNA (for example, single cells).

Metadata annotation

RNA-Seq samples are submitted to the SRA with non-standardized metadata annotations. For example, for some samples tissue and developmental stage are clearly noted as separate fields, whereas in others such information can only be found the associated paper’s abstract or sometimes only in its main text. To ensure the maximum accuracy when performing metadata annotations, we annotated samples manually until the structure of the gene expression space represented by the first three principal components was clear. Annotation was accomplished by first finding those few submissions with samples in multiple clusters. These submissions revealed that the likely separating variables of interest were tissue and developmental context. For each major cluster in the PCA (determined visually) we then annotated samples by size of their submission until the tissue or developmental context of that cluster became qualitatively clear.

Tradict algorithm

Tradict’s usage can be broken down into two parts: (1) Training and (2) Prediction. Training is the process of learning, from training data, the marker panel and its predictive relationship to the expression of transcriptional programs and to the remaining genes in the transcriptome. In essence, during training we begin with full transcriptome data and collapse its information into a subset of marker genes. Prediction is the reverse process of predicting the expression of transcriptional programs and non-marker genes from the expression measurements of just the selected markers.

Our training algorithm can be broken down into several steps: (1) Computing the latent logarithm of the training transcriptome collection, (2) defining transcriptional programs, (3) marker selection via Simultaneous Orthogonal Matching Pursuit and (4) building a predictive MVN-CP hierarchical model.

Computing the latent logarithm of the transcriptome

Expression values in our training data set are stored as t.p.m., which are non-negative, variably scaled and strongly heteroscedastic, similar to read counts. For subsequent steps in our algorithm and analysis it will be important transform this data to improve its scaling and heteroscedasticity.

Often, one log transforms such data. However, to avoid undefined values where the data are zeros, one also adds a pseudocount (for example, 1). This pseudocount considers neither the gene’s a priori abundance nor the confidence with which the measurement was made, making this practice convenient but statistically unfounded. In previous work, we introduced the latent logarithm, or ‘lag’25. lag assumes that each observed expression value is actually a noisy realization of an unmeasured latent abundance. By taking the logarithm of this latent abundance, which considers both sampling depth and the gene’s a priori abundance, lag provides a more nuanced and statistically principled alternative to the conventional ‘log(x+pseudocount)’. In increasing data, lag quickly converges to log, but in the absence of it, lag relies on both sampling depth and the gene’s a priori abundance to make a non-zero estimate of the gene’s latent abundance.

With these intuitions in mind, we applied the lag transformation to our entire training data set. The lag-transformed expression matrix demonstrated a Pearson correlation of 0.98 to the log(t.p.m.+0.1) transformed expression matrix for both A. thaliana and M. musculus. However, again, especially for samples with 0 expression, lag was able to make better estimates of their true abundance in the log-domain. Availibility: https://github.com/surgebiswas/latent_log.git

Defining transcriptional programs

We define a transcriptional program to be the first principal component of the z-score standardized lag expression of the set of genes involved in a certain response or pathway26,27. This virtual program marker maximally captures (in one dimension) the information contained in the transcriptional program. We considered three criteria for defining a globally comprehensive, but interpretable list of transcriptional programs for A. thaliana and M. musculus:

  1. a

    To capture as much information about the transcriptome as possible, we wanted to maximize the number of genes covered by the transcriptional programs.

  2. b

    To improve interpretability, we wanted to minimize the total number of transcriptional programs.

  3. c

    The number of genes in a transcriptional program should not be too large or too small—genes in a transcriptional program should be in the same pathway.

Rather than defining these transcriptional programs de novo, we took a knowledge-based approach and defined them using GO. We also tried using KEGG pathways, but found these were less complete and nuanced than GO annotations. GO is made of three sub-ontologies or aspects: molecular function, biological process and cellular component. Each of these ontologies contains terms that are arranged as a directed acyclic graph with the above three terms as roots. Terms higher in the graph are less specific than those near the leaves36,37. Thus, with respect to the three criteria above, we wanted to find GO terms with low-to-moderate height in the graph such that they were neither too specific nor too general. Given we were interested in monitoring the status of different processes in the organism, we focused on the Biological Process ontology.

We downloaded gene association files for A. thaliana and M. musculus from the Gene Ontology Consortium (http://geneontology.org/page/download-annotations). We then examined for each of several minimum and maximum GO term sizes (defined by the number of genes annotated with that GO term) the number of GO terms that fit this size criterion and the number of genes covered by these GO terms.

Supplementary Data Tables 1 and 2 contain the results of this analysis for A. thaliana and M. musculus, respectively. A. thaliana has 3,333 GO annotations for 27,671 genes. We noticed that when the minimum GO term size was as small as it could be (1) and we moved from a maximum GO term size of 5,000–10,000, we jumped from covering 18,432 genes (67% of the transcriptome) to covering the full transcriptome (black-bolded two rows of Supplementary Data Table 1). This is due to the addition of one GO term, which was the most general, ‘Biological Process,’ term. Thus, we concluded that 33% of the genes in the transcriptome had only ‘Biological Process’ as a GO annotation, and therefore that we did not need to capture these genes in our GO-term-derived gene sets. Though these genes are not informatively annotated, Tradict still models their expression all the same. We hereafter refer to the set of genes annotated with more than just the ‘Biological Process’ term as informatively annotated.

We reasoned that a minimum GO term size of 50 and a maximum size of 2,000, best met our aforementioned criteria for defining globally representative GO term derived gene sets. These size thresholds defined 150 GO terms, which in total covered 15,124 genes (82.1% of the informatively annotated genes, and 54.7% of the full transcriptome). These 150 GO-term derived, globally comprehensive transcriptional programs covered the major pathways related to growth, development and response to the environment.

We performed a similar GO term size analysis for M. musculus (Supplementary Data Table 2). M. musculus has 10,990 GO annotations for 23,566 genes. Of these genes, 6,832 (29.0%) had only the ‘Biological Process’ term annotation and were considered not informatively annotated. As we did for A. thaliana, we selected a GO term size minimum of 50 and a maximum size of 2,000. These size thresholds defined 368 GO terms, which in total covered 14,873 genes (88.9% of the informatively annotated, 63% of the full transcriptome). As we found for A. thaliana, these 368 GO-term derived, globally comprehensive transcriptional programs covered the major pathways related to growth, development and response to the environment.

Supplementary Data Tables 3 and 4 contain the lists of the globally comprehensive transcriptional programs as defined by the criteria above. For each of these programs, we then computed its first principal component over all constituent genes.

Marker selection via simultaneous orthogonal matching pursuit

After defining transcriptional programs we have a #-training-samples × #-transcriptional-programs table of expression values. We decompose this matrix using an adapted version of the Simultaneous Orthogonal Matching Pursuit algorithm, using the #-training-samples × #-genes table as a dictionary28,29. Because transcriptional programs are often correlated with other programs, we first cluster them using consensus clustering38,39, which produces a robust and stable clustering by taking the consensus of many clusterings performed by a base clustering algorithm. In total, 100 independent iterations of K-means are used as the base-clusterings, and the number of clusters is determined using the Davies–Bouldin criterion40. The decomposition is greedy, such that in each iteration the algorithm first finds the transcriptional program cluster with the largest unexplained variance. It then finds the gene contained within this cluster of transcriptional programs with the maximum average absolute correlation to the expression of all transcriptional programs. This gene is then added to an ‘active set,’ onto which the transcriptional program expression matrix is orthogonally projected. This fit is subtracted to produce a residual, on which the above steps are repeated until a predefined number of genes have been added to the active set or the residual variance of the transcriptional program expression matrix falls below some predefined threshold.

Building a predictive MVN-CP hierarchical model

Here we describe conceptually how we fit a predictive model that allows us to predict gene and transcriptional program expression from expression measurements of our selected markers. Readers interested in the full mathematical details of the MVN-CP hierarchical model are referred to Supplementary Note 6.

The MVN-CP distribution offers us a way of modelling statistically coupled count based or, more generally, non-negative random variables, such as the t.p.m. or count-based expression values of genes41,42,43,44. Here it is assumed the t.p.m. expression of each gene in a given sample is a noisy, CP realization of some unmeasured latent abundance, the logarithm of which comes from MVN distribution over the log-latent abundances of all genes in the transcriptome.

Given the marginalization properties of the MVN distribution, we are only interested in learning relationships between the selected markers and non-marker genes. For the purposes of prediction, we need to estimate (1) the mean vector and (2) covariance matrix over the log-latent t.p.m.’s of the markers, (3) the mean vector of the log-latent t.p.m.’s of the non-markers and (4) cross-covariance matrix between the log-latent t.p.m.’s of markers and non-markers.

Note that before we can estimate these parameters, we must learn the log-latent t.p.m.’s of all genes. To do this we first lag-transform the entire training data set. We then learn the marker log-latent t.p.m.’s, and their associated mean vector and covariance matrix using an iterative conditional modes algorithm. Specifically, we initialize our estimate of the marker log-latent t.p.m.’s to be the lag-transformed expression values, which by virtue of the lag’s probabilistic assumptions are also derived from a Normal CP hierarchical model. We then iterate (1) estimation of the mean vector and the covariance matrix given the current estimate of log-latent t.p.m.’s, and (2) maximum a posteriori estimation of log-latent t.p.m.’s given the estimated mean vector, covariance matrix, and the measured t.p.m. values of the selected markers. A small regularization is added during estimation of the covariance matrix to ensure stability and to avoid infinite-data-likelihood singularities that arise from singular covariance matrices. This is most often happens when a gene’s t.p.m. abundance is mostly zero (that is, there is little data for the gene), giving the MVN layer an opportunity to tightly couple this gene’s latent abundance to that of another gene, thereby producing a nearly singular covariance matrix.

Learning the mean vector of the non-marker genes and the marker × non-marker cross-covariance matrix is considerably easier. For the mean vector, we simply take the sample mean of the lag-transformed t.p.m. values. For the cross-covariance matrix we compute sample cross-covariance between the learned log-latent marker t.p.m.’s and the log-latent non-marker t.p.m.’s obtained from the lag transformation. We find that these simple sample estimates are highly stable given that our training collection includes thousands to tens of thousands of transcriptomes.

Using similar ideas, we can also encode the expression of the transcriptional programs. Recall that a principal component output by PCA is a linear combination of input features. Thus by central limit theorem, the expression of these transcriptional programs should behave like normal random variables. Indeed, after regressing out the first three principal components computed on the entire training samples × genes expression matrix from the expression values of the transcriptional programs (to remove the large effects of tissue and developmental stage), 85–90% of the transcriptional programs had expression that was consistent with a normal distribution (average P value=0.43, Pearson’s χ2 test). Consequently, as was done for non-marker genes and as will be needed for decoding, we compute the mean vector of the transcriptional programs and the markers × transcriptional programs cross covariance matrix. These are given by the standard sample mean of the training transcriptional program expression values and sample cross-covariance between the learned log-latent t.p.m.’s of the markers and the transcriptional program expression values.

Prediction

To perform prediction, we must translate newly obtained t.p.m. measurements of our marker genes into expression predictions for transcriptional programs and the remaining non-marker genes. More specifically, we’d like to formulate these predictions in the form of conditional posterior distributions, which simultaneously provide an estimate of expression magnitude and our confidence in that estimate. To do this, we first sample the latent abundances of our markers from their posterior distribution using the measured t.p.m.’s, and the 1 × markers mean vector and markers × markers covariance matrix previously learned from the training data. This is done using Metropolis-Hastings Markov Chain Monte Carlo sampling (see Supplementary Note 6 for further details on tuning the proposal distribution, sample thinning, sampling depth and burn-in lengths). Using these sampled latent abundances and the previously estimated mean vectors and cross-covariance matrices, we then can use standard Gaussian conditioning to sample the log-latent expression of the transcriptional programs and the remaining genes in the transcriptome from their conditional distribution. These samples, in aggregate, are samples from the conditional posterior distribution of each gene and program and can be used to approximate properties of this distribution (for example, posterior mode (MAP) estimates, and/or credible intervals).

Code availability

Tradict is available at https://github.com/surgebiswas/tradict. All code to perform data downloads, analysis, and generate figures are available at https://github.com/surgebiswas/transcriptome_compression.

Data availability

Raw or filtered transcript-quantified training transcriptomes, as well as any other processed data forms are available upon request. Raw read data is directly accessible through NCBI SRA.

Additional information

How to cite this article: Biswas, S. et al. Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes. Nat. Commun. 8, 15309 doi: 10.1038/ncomms15309 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.