Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes

Biswas, Surojit; Kerner, Konstantin; Teixeira, Paulo José Pereira Lima; Dangl, Jeffery L.; Jojic, Vladimir; Wigge, Philip A.

doi:10.1038/ncomms15309

Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes

Article
Open access
Published: 05 May 2017

Volume 8, article number 15309, (2017)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes

Download PDF

Surojit Biswas¹,
Konstantin Kerner²,
Paulo José Pereira Lima Teixeira^3,4,
Jeffery L. Dangl^3,4,5,6,7^na1,
Vladimir Jojic⁸^na1 &
…
Philip A. Wigge⁹^na1

5719 Accesses
7 Altmetric
Explore all metrics

Abstract

Transcript levels are a critical determinant of the proteome and hence cellular function. Because the transcriptome is an outcome of the interactions between genes and their products, it may be accurately represented by a subset of transcript abundances. We develop a method, Tradict (transcriptome predict), capable of learning and using the expression measurements of a small subset of 100 marker genes to predict transcriptome-wide gene abundances and the expression of a comprehensive, but interpretable list of transcriptional programs that represent the major biological processes and pathways of the cell. By analyzing over 23,000 publicly available RNA-Seq data sets, we show that Tradict is robust to noise and accurate. Coupled with targeted RNA sequencing, Tradict may therefore enable simultaneous transcriptome-wide screening and mechanistic investigation at large scales.

Using RNentropy to Detect Significant Variation in Gene Expression Across Multiple RNA-Seq or Single-Cell RNA-Seq Samples

Pardiff: Inference of Differential Expression at Base-Pair Level from RNA-Seq Experiments

T-REx: Transcriptome analysis webserver for RNA-seq Expression data

Article Open access 03 September 2015

Introduction

As the critical determinant of the proteome and therefore cellular status, the transcriptome represents a key node of regulation for all life¹. Transcriptional control is managed by a finely tuned network of transcription factors that integrate environmental and developmental cues in order to actuate the appropriate responses in gene expression^2,3,4. Importantly, the transcriptomic state space is constrained. Pareto efficiency constraints suggest that no gene expression profile or phenotype can be optimal for all tasks, and consequently, that some expression profiles or phenotypes must come at the expense of others^5,6. Furthermore, across all major studied kingdoms of life, cellular networks demonstrate remarkably conserved scale-free properties that are topologically characterized by a small minority of highly connected regulatory nodes that link the remaining majority of sparsely connected nodes to the network^7,8,9. These theories suggest that the effective dimension of the transcriptome should be far less than the total number of genes it contains. If true to a large enough extent, it may be possible to faithfully compress and prospectively summarize entire transcriptomes by measuring only a small, carefully chosen subset of it.

Indeed, previous studies have exploited this reduced dimensionality to perform gene expression imputation for missing or corrupted values in microarray data.^10,11,12. Others have extended these intuitions to predict expression from probe sets containing a few hundred genes^13,14. However, prediction accuracies have been modest and usually limited to 4,000 target probes/genes. Recently, several studies examined the transcriptomic information recoverable by shallow sequencing especially as it applies to single-cell experiments^15,16,17,18. Jaitin et al.¹⁸ and Pollen et al.¹⁶ demonstrated that only tens of thousands of reads are required per cell to learn and classify cell types ab initio^16,18. Heimberg et al.¹⁵ extended these findings and demonstrated that the major principal components of a typically sequenced mouse bulk or single-cell expression data set may be estimated with little error at even 1% of the depth¹⁵. Though these approaches, advance the notion of strategic transcriptome undersampling, they only recover broad transcriptional states and are restricted to measuring only the most abundant genes. During sample preparation—arguably the most expensive cost of a multiplexed-sequencing experiment—shallow sequencing-based approaches still utilize protocols meant for sampling the entire transcriptome and therefore consume more resources than necessary. Furthermore, given that the expression of even the most abundant genes is highly skewed, sequencing effort is wastefully distributed compared to an approach that chooses which genes to measure more wisely. Finally, it is still not clear from sample sizes and biological contexts previously studied whether the low dimensionality of the transcriptome may be leveraged unconditionally (or nearly so) across organism and application.

In this work, we introduce Tradict (transcriptome predict), a robust-to-noise and probabilistically sound algorithm, for inferring gene abundances transcriptome-wide, and predicting the expression of a transcriptomically comprehensive, but interpretable list of transcriptional programs that represent the major biological processes and pathways of the cell. Tradict makes its predictions using only the expression measurement of a single, context-independent, machine-learned subset of 100 marker genes. Importantly, Tradict’s predictions are formulated as posterior distributions over unmeasured genes and programs, and therefore simultaneously provide point and credible interval estimates over predicted expression. Using a representative sampling of over 23,000 publicly available, transcriptome-wide RNA-Seq data sets for Arabidopsis thaliana and Mus musculus, we show Tradict prospectively models program expression with striking accuracy. Our work demonstrates the development and large-scale application of a probabilistically reasonable multivariate count/non-negative data model, and highlights the power of directly modelling the expression of a comprehensive list of transcriptional programs in a supervised manner. Consequently, we believe that Tradict, coupled with targeted RNA sequencing^{19,20,21,22,23,24}, can rapidly illuminate biological mechanism and improve the time and cost of performing large forward genetic, breeding, or chemogenomic screens.

Results

Assembly of a deep training collection of transcriptomes

We downloaded all available Illumina sequenced publicly deposited RNA-Seq samples (transcriptomes) for A. thaliana and M. musculus from NCBI’s Sequence Read Archive (SRA). Among samples with at least 4 million reads, we successfully downloaded and quantified the raw sequence data of 3,621 and 27,450 transcriptomes for A. thaliana and M. musculus, respectively. After stringent quality filtering, we retained 2,597 (71.7%) and 20,847 (76.0%) transcriptomes comprising 225 and 732 unique SRA submissions for A. thaliana and M. musculus, respectively. An SRA ‘submission’ consists of multiple, experimentally linked samples submitted concurrently by an individual or lab. We defined 21,277 (A. thaliana) and 21,176 (M. musculus) measurable genes with reproducibly detectable expression in transcripts per million (t.p.m.) given our tolerated minimum-sequencing depth and map** rates (see Methods section for further information regarding data acquisition, transcript quantification, quality filtering and expression filtering). We hereafter refer to the collection of quality and expression filtered transcriptomes as our training transcriptome collection.

To assess the quality and comprehensiveness of our training collection, we performed a deep characterization of the expression space spanned by these transcriptomes. We found that the transcriptome of both organisms was highly compressible and that the primary drivers of variation were tissue and developmental stage (Fig. 1a,b, Supplementary Fig. 1), with many biologically realistic trends within each cluster (Supplementary Note 1). We additionally examined the distribution of submissions across the expression space, compared inter-submission variability within and between tissues, inspected expression correlations among genes with well-established regulatory relationships and assessed the evolution of the expression space across time. These investigations revealed our training collection is of high and reproducible technical quality, reflective of known biology, stable, and increasing exponentially in size (Supplementary Note 1, Supplementary Figs 2–4). Given additionally the diversity of tissues, genetic perturbations and environmental stimuli represented in the SRA, these results, taken together, suggest that our training collection is an accurate and representative sampling of the transcriptomic state space that is of experimental interest for both organisms.

**Figure 1: The primary drivers of transcriptomic variation are developmental stage and tissue.**

Tradict—algorithm overview

Given a training transcriptome collection, Tradict encodes the transcriptome into a single subset of globally representative marker genes and learns their predictive relationship to the expression of a comprehensive collection of transcriptional programs (for example, pathways, biological processes) and to the rest of the genes in the transcriptome. Tradict’s key innovation lies in using a Multivariate Normal Continuous-Poisson (MVN-CP) hierarchical model to model marker latent abundances—rather than their measured, noisy abundances—jointly with the expression of transcriptional programs and the abundances of the remaining non-marker genes in the transcriptome. In so doing, Tradict is able to (1) efficiently capture covariance structure within the non-negative, right-skewed output typical of sequencing experiments, and (2) perform robust inference of transcriptional program and non-marker expression even in the presence of significant noise.

Figure 2 illustrates Tradict’s general workflow. Estimates of expression are noisy, especially for low to moderately expressed genes. Given samples are often explored unevenly and that the a priori abundance of each gene differs, the level of noise in a gene’s measured expression for a given sample varies, but it can be estimated. Therefore, during training, Tradict first learns the log-latent, denoised abundances for each gene in every sample in the training collection using the lag transformation

Methods

Data acquisition and transcript quantification

Data acquisition and transcript quantification were managed using a custom script, srafish.pl. The srafish.pl algorithm and its dependencies are described below. Complete instructions for installing (including all dependencies) and using srafish.pl are available on our GitHub page:

https://github.com/surgebiswas/transcriptome_compression/tree/master/data_download.

Supplementary Figure 11 illustrates the workflow of srafish.pl. Briefly, after checking an SRA file meets certain quality requirements, srafish.pl uses the ascp fasp transfer program to download the raw SRA (.sra file) for an SRA RNA-Seq sample. Transfers made using ascp are substantially faster than traditional FTP. The .sra file is then unpacked to FASTQ format using the fastq-dump program provided with the SRA Toolkit (NCBI)³⁴. The raw FASTQ read data is then passed to Sailfish³⁵, which uses a fast alignment-free algorithm to quantify transcript abundances. To preserve memory, files with more than 40 million reads for A. thaliana and 70 million reads for M. musculus were downsampled before running Sailfish. Samples with fewer than 4 million reads are not downloaded at all. This workflow is then iterated for each SRA RNA-Seq sample available for the organism of interest.

The main inputs into srafish.pl are a query table, output directory, Sailfish index and ascp SSH key, which comes with each download of the aspera ascp client. srafish.pl depends on Perl (v5.8.9 for Linux x86-64), the aspera ascp client (v3.5.4 for Linux x86-64), SRA Toolkit (v2.5.0 for CentOS Linux x86-64) and Sailfish (v0.6.3 for Linux x86-64).

Query table construction

For each organism, using the following (Unix) commands, we first prepared a ‘query table’ that contained all SRA sample ID’s as well as various metadata required for the download:

qt_name=<query_table_file_name>

sra_url= http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=

organism=<organism_name>

wget -O $qt_name ‘$url($organism[Organism]) AND ‘strategy rna seq’[Properties]’

Where fields in between <> indicate input arguments. As an example,

qt_name=Athaliana_query_table.csv

sra_url= http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=

organism=’Arabidopsis thaliana’

wget -O $qt_name ‘$url($organism[Organism]) AND ‘strategy rna seq’[Properties]’

Reference transcriptomes and index construction

Sailfish requires a reference transcriptome—a FASTA file of cDNA sequences—from which it builds an index it can query during transcript quantification. For the A. thaliana transcriptome reference we used cDNA sequences of all isoforms from the TAIR10 reference. For the M. musculus transcriptome reference we used all protein-coding and long non-coding RNA transcript sequences from the Gencode vM5 reference.

Sailfish indices were created using the following command:

sailfish index -t <ref_transcriptome.fasta> -k 20 -p 6 -o .

Here, <ref_transcriptome.fasta> refers to the reference transcriptome FASTA file. Copies of the reference transcriptome FASTA files used in this study are available upon request.

Quality and expression filtering

In addition to the read count filtering mentioned above, we also removed samples with map** rates below 0.7 and 0.75 for A. thaliana and M. musculus, respectively (Supplementary Fig. 12). The resulting isoform expression table was then collapsed into a gene expression table by setting a gene’s expression to be the sum of expression values for all isoforms of that gene. We next removed all non-protein coding transcripts except for long non-coding RNAs, and removed samples with large amounts (>30%) of non-protein coding contamination (for example, rRNA). The data set was then expression filtered by only kee** genes with expression greater than 1 t.p.m. in at least 5% of all samples. The latter requirement ensured that outlier or extreme expression in just a few samples was not enough to keep the gene for analysis.

We then removed samples with an abnormally large number of genes with expression values of zero. To do this we calculated the mean and s.d. of the number of genes with zero expression across all samples. Samples with the number of zero expressed genes greater than the mean plus two times the s.d. were removed. Finally, we removed outlier samples by first examining the proportion of zeros contained in each sample and by computing the pairwise PCC between the gene expression profiles of all samples. To improve heteroscedasticity, raw t.p.m. values for each gene were converted to a log-scale (log₁₀(t.p.m.+0.1)) before calculating correlations. For A. thaliana, the majority of samples had an average correlation with other samples of greater than 0.45 and fewer than 20% percent zero values. Samples with lower correlation or a greater percentage of zeros were removed (Supplementary Fig. 12). By similar arguments, samples with less average correlation than 0.55 with other samples and greater than 30% zeros were removed for M. musculus (Supplementary Fig. 12). Manual inspection of ∼100 of these samples revealed they were highly enriched for non-polyA selected samples and samples made from low-input RNA (for example, single cells).

Metadata annotation

RNA-Seq samples are submitted to the SRA with non-standardized metadata annotations. For example, for some samples tissue and developmental stage are clearly noted as separate fields, whereas in others such information can only be found the associated paper’s abstract or sometimes only in its main text. To ensure the maximum accuracy when performing metadata annotations, we annotated samples manually until the structure of the gene expression space represented by the first three principal components was clear. Annotation was accomplished by first finding those few submissions with samples in multiple clusters. These submissions revealed that the likely separating variables of interest were tissue and developmental context. For each major cluster in the PCA (determined visually) we then annotated samples by size of their submission until the tissue or developmental context of that cluster became qualitatively clear.

Tradict algorithm

Tradict’s usage can be broken down into two parts: (1) Training and (2) Prediction. Training is the process of learning, from training data, the marker panel and its predictive relationship to the expression of transcriptional programs and to the remaining genes in the transcriptome. In essence, during training we begin with full transcriptome data and collapse its information into a subset of marker genes. Prediction is the reverse process of predicting the expression of transcriptional programs and non-marker genes from the expression measurements of just the selected markers.

Our training algorithm can be broken down into several steps: (1) Computing the latent logarithm of the training transcriptome collection, (2) defining transcriptional programs, (3) marker selection via Simultaneous Orthogonal Matching Pursuit and (4) building a predictive MVN-CP hierarchical model.

Computing the latent logarithm of the transcriptome

Expression values in our training data set are stored as t.p.m., which are non-negative, variably scaled and strongly heteroscedastic, similar to read counts. For subsequent steps in our algorithm and analysis it will be important transform this data to improve its scaling and heteroscedasticity.

Often, one log transforms such data. However, to avoid undefined values where the data are zeros, one also adds a pseudocount (for example, 1). This pseudocount considers neither the gene’s a priori abundance nor the confidence with which the measurement was made, making this practice convenient but statistically unfounded. In previous work, we introduced the latent logarithm, or ‘lag’²⁵. lag assumes that each observed expression value is actually a noisy realization of an unmeasured latent abundance. By taking the logarithm of this latent abundance, which considers both sampling depth and the gene’s a priori abundance, lag provides a more nuanced and statistically principled alternative to the conventional ‘log(x+pseudocount)’. In increasing data, lag quickly converges to log, but in the absence of it, lag relies on both sampling depth and the gene’s a priori abundance to make a non-zero estimate of the gene’s latent abundance.

With these intuitions in mind, we applied the lag transformation to our entire training data set. The lag-transformed expression matrix demonstrated a Pearson correlation of 0.98 to the log(t.p.m.+0.1) transformed expression matrix for both A. thaliana and M. musculus. However, again, especially for samples with 0 expression, lag was able to make better estimates of their true abundance in the log-domain. Availibility: https://github.com/surgebiswas/latent_log.git

Defining transcriptional programs

We define a transcriptional program to be the first principal component of the z-score standardized lag expression of the set of genes involved in a certain response or pathway^26,27. This virtual program marker maximally captures (in one dimension) the information contained in the transcriptional program. We considered three criteria for defining a globally comprehensive, but interpretable list of transcriptional programs for A. thaliana and M. musculus:

a
To capture as much information about the transcriptome as possible, we wanted to maximize the number of genes covered by the transcriptional programs.
b
To improve interpretability, we wanted to minimize the total number of transcriptional programs.
c
The number of genes in a transcriptional program should not be too large or too small—genes in a transcriptional program should be in the same pathway.

Rather than defining these transcriptional programs de novo, we took a knowledge-based approach and defined them using GO. We also tried using KEGG pathways, but found these were less complete and nuanced than GO annotations. GO is made of three sub-ontologies or aspects: molecular function, biological process and cellular component. Each of these ontologies contains terms that are arranged as a directed acyclic graph with the above three terms as roots. Terms higher in the graph are less specific than those near the leaves^36,37. Thus, with respect to the three criteria above, we wanted to find GO terms with low-to-moderate height in the graph such that they were neither too specific nor too general. Given we were interested in monitoring the status of different processes in the organism, we focused on the Biological Process ontology.

We downloaded gene association files for A. thaliana and M. musculus from the Gene Ontology Consortium (http://geneontology.org/page/download-annotations). We then examined for each of several minimum and maximum GO term sizes (defined by the number of genes annotated with that GO term) the number of GO terms that fit this size criterion and the number of genes covered by these GO terms.

Supplementary Data Tables 1 and 2 contain the results of this analysis for A. thaliana and M. musculus, respectively. A. thaliana has 3,333 GO annotations for 27,671 genes. We noticed that when the minimum GO term size was as small as it could be (1) and we moved from a maximum GO term size of 5,000–10,000, we jumped from covering 18,432 genes (67% of the transcriptome) to covering the full transcriptome (black-bolded two rows of Supplementary Data Table 1). This is due to the addition of one GO term, which was the most general, ‘Biological Process,’ term. Thus, we concluded that 33% of the genes in the transcriptome had only ‘Biological Process’ as a GO annotation, and therefore that we did not need to capture these genes in our GO-term-derived gene sets. Though these genes are not informatively annotated, Tradict still models their expression all the same. We hereafter refer to the set of genes annotated with more than just the ‘Biological Process’ term as informatively annotated.

We reasoned that a minimum GO term size of 50 and a maximum size of 2,000, best met our aforementioned criteria for defining globally representative GO term derived gene sets. These size thresholds defined 150 GO terms, which in total covered 15,124 genes (82.1% of the informatively annotated genes, and 54.7% of the full transcriptome). These 150 GO-term derived, globally comprehensive transcriptional programs covered the major pathways related to growth, development and response to the environment.

We performed a similar GO term size analysis for M. musculus (Supplementary Data Table 2). M. musculus has 10,990 GO annotations for 23,566 genes. Of these genes, 6,832 (29.0%) had only the ‘Biological Process’ term annotation and were considered not informatively annotated. As we did for A. thaliana, we selected a GO term size minimum of 50 and a maximum size of 2,000. These size thresholds defined 368 GO terms, which in total covered 14,873 genes (88.9% of the informatively annotated, 63% of the full transcriptome). As we found for A. thaliana, these 368 GO-term derived, globally comprehensive transcriptional programs covered the major pathways related to growth, development and response to the environment.

Supplementary Data Tables 3 and 4 contain the lists of the globally comprehensive transcriptional programs as defined by the criteria above. For each of these programs, we then computed its first principal component over all constituent genes.

Marker selection via simultaneous orthogonal matching pursuit

After defining transcriptional programs we have a #-training-samples × #-transcriptional-programs table of expression values. We decompose this matrix using an adapted version of the Simultaneous Orthogonal Matching Pursuit algorithm, using the #-training-samples × #-genes table as a dictionary^28,29. Because transcriptional programs are often correlated with other programs, we first cluster them using consensus clustering^38,39, which produces a robust and stable clustering by taking the consensus of many clusterings performed by a base clustering algorithm. In total, 100 independent iterations of K-means are used as the base-clusterings, and the number of clusters is determined using the Davies–Bouldin criterion⁴⁰. The decomposition is greedy, such that in each iteration the algorithm first finds the transcriptional program cluster with the largest unexplained variance. It then finds the gene contained within this cluster of transcriptional programs with the maximum average absolute correlation to the expression of all transcriptional programs. This gene is then added to an ‘active set,’ onto which the transcriptional program expression matrix is orthogonally projected. This fit is subtracted to produce a residual, on which the above steps are repeated until a predefined number of genes have been added to the active set or the residual variance of the transcriptional program expression matrix falls below some predefined threshold.

Building a predictive MVN-CP hierarchical model

Here we describe conceptually how we fit a predictive model that allows us to predict gene and transcriptional program expression from expression measurements of our selected markers. Readers interested in the full mathematical details of the MVN-CP hierarchical model are referred to Supplementary Note 6.

The MVN-CP distribution offers us a way of modelling statistically coupled count based or, more generally, non-negative random variables, such as the t.p.m. or count-based expression values of genes^41,42,43,44. Here it is assumed the t.p.m. expression of each gene in a given sample is a noisy, CP realization of some unmeasured latent abundance, the logarithm of which comes from MVN distribution over the log-latent abundances of all genes in the transcriptome.

Given the marginalization properties of the MVN distribution, we are only interested in learning relationships between the selected markers and non-marker genes. For the purposes of prediction, we need to estimate (1) the mean vector and (2) covariance matrix over the log-latent t.p.m.’s of the markers, (3) the mean vector of the log-latent t.p.m.’s of the non-markers and (4) cross-covariance matrix between the log-latent t.p.m.’s of markers and non-markers.

Note that before we can estimate these parameters, we must learn the log-latent t.p.m.’s of all genes. To do this we first lag-transform the entire training data set. We then learn the marker log-latent t.p.m.’s, and their associated mean vector and covariance matrix using an iterative conditional modes algorithm. Specifically, we initialize our estimate of the marker log-latent t.p.m.’s to be the lag-transformed expression values, which by virtue of the lag’s probabilistic assumptions are also derived from a Normal CP hierarchical model. We then iterate (1) estimation of the mean vector and the covariance matrix given the current estimate of log-latent t.p.m.’s, and (2) maximum a posteriori estimation of log-latent t.p.m.’s given the estimated mean vector, covariance matrix, and the measured t.p.m. values of the selected markers. A small regularization is added during estimation of the covariance matrix to ensure stability and to avoid infinite-data-likelihood singularities that arise from singular covariance matrices. This is most often happens when a gene’s t.p.m. abundance is mostly zero (that is, there is little data for the gene), giving the MVN layer an opportunity to tightly couple this gene’s latent abundance to that of another gene, thereby producing a nearly singular covariance matrix.

Learning the mean vector of the non-marker genes and the marker × non-marker cross-covariance matrix is considerably easier. For the mean vector, we simply take the sample mean of the lag-transformed t.p.m. values. For the cross-covariance matrix we compute sample cross-covariance between the learned log-latent marker t.p.m.’s and the log-latent non-marker t.p.m.’s obtained from the lag transformation. We find that these simple sample estimates are highly stable given that our training collection includes thousands to tens of thousands of transcriptomes.

Using similar ideas, we can also encode the expression of the transcriptional programs. Recall that a principal component output by PCA is a linear combination of input features. Thus by central limit theorem, the expression of these transcriptional programs should behave like normal random variables. Indeed, after regressing out the first three principal components computed on the entire training samples × genes expression matrix from the expression values of the transcriptional programs (to remove the large effects of tissue and developmental stage), 85–90% of the transcriptional programs had expression that was consistent with a normal distribution (average P value=0.43, Pearson’s χ² test). Consequently, as was done for non-marker genes and as will be needed for decoding, we compute the mean vector of the transcriptional programs and the markers × transcriptional programs cross covariance matrix. These are given by the standard sample mean of the training transcriptional program expression values and sample cross-covariance between the learned log-latent t.p.m.’s of the markers and the transcriptional program expression values.

Prediction

To perform prediction, we must translate newly obtained t.p.m. measurements of our marker genes into expression predictions for transcriptional programs and the remaining non-marker genes. More specifically, we’d like to formulate these predictions in the form of conditional posterior distributions, which simultaneously provide an estimate of expression magnitude and our confidence in that estimate. To do this, we first sample the latent abundances of our markers from their posterior distribution using the measured t.p.m.’s, and the 1 × markers mean vector and markers × markers covariance matrix previously learned from the training data. This is done using Metropolis-Hastings Markov Chain Monte Carlo sampling (see Supplementary Note 6 for further details on tuning the proposal distribution, sample thinning, sampling depth and burn-in lengths). Using these sampled latent abundances and the previously estimated mean vectors and cross-covariance matrices, we then can use standard Gaussian conditioning to sample the log-latent expression of the transcriptional programs and the remaining genes in the transcriptome from their conditional distribution. These samples, in aggregate, are samples from the conditional posterior distribution of each gene and program and can be used to approximate properties of this distribution (for example, posterior mode (MAP) estimates, and/or credible intervals).

Code availability

Tradict is available at https://github.com/surgebiswas/tradict. All code to perform data downloads, analysis, and generate figures are available at https://github.com/surgebiswas/transcriptome_compression.

Data availability

Raw or filtered transcript-quantified training transcriptomes, as well as any other processed data forms are available upon request. Raw read data is directly accessible through NCBI SRA.

Additional information

How to cite this article: Biswas, S. et al. Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes. Nat. Commun. 8, 15309 doi: 10.1038/ncomms15309 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Jacob, F. & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961).
Article CAS Google Scholar
Kaufmann, K., Pajoro, A. & Angenent, G. C. Regulation of transcription in plants: mechanisms controlling developmental switches. Nat. Rev. Genet. 11, 830–842 (2010).
Article CAS Google Scholar
Mitchell, P. J. & Tjian, R. Transcriptional regulation in mammalian cells by DNA binding proteins. Science 245, 371–378 (1989).
Article ADS CAS Google Scholar
Segal, E. et al. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet. 34, 166–176 (2003).
Article CAS Google Scholar
Hart, Y. et al. Inferring biological tasks using Pareto analysis of high-dimensional data. Nat. Methods 12, 233–235 (2015).
Article CAS Google Scholar
Shoval, O. et al. Evolutionary trade-offs, pareto optimality, and the geometry of phenotype space. Science 336, 1157–1160 (2012).
Article ADS CAS Google Scholar
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, a. L. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).
Article ADS CAS Google Scholar
Albert, R., Lee, J. H. & Barabási, A.-L. Error and attack tolerance of complex networks. Nature 406, 378–382 (2000).
Article ADS CAS Google Scholar
Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–513 (1999).
Article ADS MathSciNet Google Scholar
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).
Article CAS Google Scholar
Liew, A. W., Law, N. & Yan, H. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief. Bioinform. 12, 498–513 (2010).
Article Google Scholar
Celton, M., Malpertuy, A., Lelandais, G. & De Brevern, A. G. Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 15, 1–16 (2010).
Google Scholar
Ling, M. H. T. & Poh, C. L. A predictor for predicting Escherichia coli transcriptome and the effects of gene perturbations. BMC Bioinformatics 15, 140 (2014).
Article Google Scholar
Donner, Y., Feng, T., Benoist, C. & Koller, D. Imputing gene expression from selectively reduced probe sets. Nat. Methods 9, 1120–1125 (2012).
Article CAS Google Scholar
Heimberg, G., Bhatnagar, R., El-samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst. 2, 239–250 (2016).
Article CAS Google Scholar
Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in develo** cerebral cortex. Nat. Biotechnol. 32, 1053–1058 (2014).
Article CAS Google Scholar
Kliebenstein, D. J. Exploring the shallow end; estimating information content in transcriptomics studies. Front. Plant Sci. 3, 1–10 (2012).
Google Scholar
Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014).
Article ADS CAS Google Scholar
ThermoFisher Scientific. Targeted RNA sequencing by ion torrent next-generation sequencing. Available at: http://www.thermofisher.com/us/en/home/life-science/sequencing/rna-sequencing/targeted-rna-sequencing-ion-torrent-next-generation-sequencing.html.
Illumina TruSeq targeted RNA expression kits. Avialable at: http://www.illumina.com/products/truseq-targeted-rna-expression-kits.html.
Mercer, T. R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).
Article CAS Google Scholar
Li, H., Qiu, J. & Fu, X. RASL-seq for massively parallel and quantitative analysis of gene expression. Curr. Protoc. Mol. Biol. 1–9 (2012).
Larman, H. B. et al. Sensitive, multiplex and direct quantification of RNA sequences using a modified RASL assay. Nucleic Acids Res. 42, 9146–9157 (2014).
Article CAS Google Scholar
Scott, E. R. et al. RASLseqTools: open-source methods for designing and analyzing RNA-mediated oligonucleotide annealing, selection, and, ligation sequencing (RASL-seq) experiments. bioRxiv 1–22 (2016) http://biorxiv.org/content/biorxiv/early/2016/01/07/036061.full.pdf.
Biswas, S. The latent logarithm. ar**v 1–11. Preprint at: https://arxiv.org/abs/1605.06064 (2016).
Ma, S. & Kosorok, M. R. Identification of differential gene pathways with principal component analysis. Bioinformatics 25, 882–889 (2009).
Article CAS Google Scholar
Fan, J. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat. Methods 13, 241–244 (2016).
Article CAS Google Scholar
Tropp, J. a. & Gilbert, A. C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory 53, 4655–4666 (2007).
Article MathSciNet Google Scholar
Tropp, J. a., Gilbert, A. C. & Strauss, M. J. Algorithms for simultaneous sparse approximation. Part I: greedy pursuit. Signal Processing 86, 572–588 (2006).
Article Google Scholar
Gelman, A. et al. Bayesian Data Analysis Chapman & Hall (2013).
Yang, L. et al. Pseudomonas syringae Type III effector HopBB1 promotes host transcriptional repressor degradation to regulate phytohormone responses and virulence. Cell Host Microbe 21, 156–168 (2017).
Article CAS Google Scholar
Jones, J. D. G. & Dangl, J. L. The plant immune system. Nature 444, 323–329 (2006).
Article ADS CAS Google Scholar
Fu, G. K. et al. Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc. Natl Acad. Sci. USA 111, 1891–1896 (2014).
Article ADS CAS Google Scholar
Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Research 39, D19–D21 (2011).
Article CAS Google Scholar
Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).
Article CAS Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Article CAS Google Scholar
The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43, D1049–D1056 (2015).
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).
Article Google Scholar
Yu, Z., Wong, H.-S. & Wang, H. Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics. 23, 2888–2896 (2007).
Article CAS Google Scholar
Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell 2, 224–227 (1979).
Article Google Scholar
Aitchison, J. & Shen, S. M. Logistic-normal distributions: some properties and uses. Biometrika 67, 261 (1980).
Article MathSciNet Google Scholar
Aitchison, J. & Ho, C. H. The multivariate Poisson-log normal distribution. Biometrika 76, 643–653 (1989).
Article MathSciNet Google Scholar
Biswas, S., Mcdonald, M., Lundberg, D. S., Dangl, J. L. & Jojic, V. Learning microbial interaction networks from metagenomic count data. Res. Comput. Mol. Biol. 1, 32–43 (2015).
Article MathSciNet Google Scholar
Madsen, L. & Dalthorp, D. Simulating correlated count data. Environ. Ecol. Stat. 14, 129–148 (2007).
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank Brian Cleary, Aviv Regev, Amaro Taylor-Weiner, Craig Bohrson and Derek Lundberg for valuable discussions in preparing this manuscript. S.B. was supported by a Churchill Scholarship from the Winston Churchill Foundation of the United States and by a training grant from the NHGRI/NIH (T32 HG002295). P.J.P.L.T. was supported by a fellowship from the Pew Latin American Fellows Program in the Biomedical Sciences. This work was additionally funded, in part, by a fellowship to P.W. from the Gatsby Foundation (GAT2373/GLB) and by grants to J.L.D. from the National Science Foundation (INSPIRE Track II award IOS-1343020), National Institutes of Health (1RO1 GM107444), the Gordon and Betty Moore Foundation (GBMF3030), and the HHMI. J.L.D. is an investigator of the Howard Hughes Medical Institute.

Author information

Jeffery L. Dangl, Vladimir Jojic and Philip A. Wigge: These authors contributed equally to this work.

Authors and Affiliations

Department of Biomedical Informatics, Harvard Medical School, Boston, 02115, Massachusetts, USA
Surojit Biswas
Botanical Institute, Biocenter, University of Cologne, Cologne, D-50674, Germany
Konstantin Kerner
Howard Hughes Medical Institute, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
Paulo José Pereira Lima Teixeira & Jeffery L. Dangl
Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
Paulo José Pereira Lima Teixeira & Jeffery L. Dangl
Carolina Center for Genome Science, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
Jeffery L. Dangl
Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
Jeffery L. Dangl
Curriculum in Genetics and Molecular Biology, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
Jeffery L. Dangl
Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
Vladimir Jojic
Sainsbury Laboratory, University of Cambridge, Cambridge, CB2 1LR, UK
Philip A. Wigge

Authors

Surojit Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Paulo José Pereira Lima Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Jeffery L. Dangl
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Jojic
View author publications
You can also search for this author in PubMed Google Scholar
Philip A. Wigge
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Surojit Biswas or Philip A. Wigge.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information

Supplementary Figures, Supplementary Notes and Supplementary References (PDF 3319 kb)

Supplementary Data 1

GO-Term size analysis: A. thaliana (XLSX 9 kb)

Supplementary Data 2

GO-Term size analysis: M. musculus (XLSX 9 kb)

Supplementary Data 3

A. thaliana transcriptional programs, their prospective prediction accuracy, and other properties (XLSX 16 kb)

Supplementary Data 4

M. musculus transcriptional programs, their prospective prediction accuracy, and other properties (XLSX 26 kb)

Supplementary Data 5

Number of SRA RNA-Seq records for several model organisms (current as of September 23, 2016) (XLSX 9 kb)

Supplementary Data 6

Test-set PCC for the bottom 20 and top 20 A. thaliana programs when gene-to-program assignments are 100% random. (XLSX 10 kb)

Supplementary Data 7

Selected globally representative markers for A. thaliana (XLSX 15 kb)

Supplementary Data 8

Selected globally representative markers for M. musculus (XLSX 16 kb)

Peer Review File (PDF 162 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Biswas, S., Kerner, K., Teixeira, P. et al. Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes. Nat Commun 8, 15309 (2017). https://doi.org/10.1038/ncomms15309

Download citation

Received: 05 October 2016
Accepted: 15 March 2017
Published: 05 May 2017
DOI: https://doi.org/10.1038/ncomms15309
Springer Nature Limited

Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes

Abstract

Similar content being viewed by others

Introduction

Results

Assembly of a deep training collection of transcriptomes

Tradict—algorithm overview

Query table construction

Reference transcriptomes and index construction

Quality and expression filtering

Metadata annotation

Tradict algorithm

Computing the latent logarithm of the transcriptome

Defining transcriptional programs

Marker selection via simultaneous orthogonal matching pursuit

Building a predictive MVN-CP hierarchical model

Prediction

Code availability

Data availability

Additional information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation