High-dimensional genomic data bias correction and data integration using MANCIE

Zang, Chongzhi; Wang, Tao; Deng, Ke; Li, Bo; Hu, Sheng’en; Qin, Qian; **ao, Tengfei; Zhang, Shihua; Meyer, Clifford A.; He, Housheng Hansen; Brown, Myles; Liu, Jun S.; **e, Yang; Liu, X. Shirley

doi:10.1038/ncomms11305

High-dimensional genomic data bias correction and data integration using MANCIE

Article
Open access
Published: 13 April 2016

Volume 7, article number 11305, (2016)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

High-dimensional genomic data bias correction and data integration using MANCIE

Download PDF

Chongzhi Zang^1,2^na1,
Tao Wang ORCID: orcid.org/0000-0002-4355-149X^3,4^na1,
Ke Deng⁵,
Bo Li^1,2,6,
Sheng’en Hu⁷,
Qian Qin⁷,
Tengfei **ao^1,2,8,
Shihua Zhang⁹,
Clifford A. Meyer^1,2,
Housheng Hansen He ORCID: orcid.org/0000-0003-2898-3363^1,2,8,10,
Myles Brown^2,8,
Jun S. Liu⁶,
Yang **e^3,11,12 &
…
X. Shirley Liu^1,2

7752 Accesses
29 Citations
2 Altmetric
Explore all metrics

Abstract

High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration.

caOmicsV: an R package for visualizing multidimensional cancer genomic data

Article Open access 22 March 2016

Integrative Exploratory Analysis of Two or More Genomic Datasets

robustica: customizable robust independent component analysis

Article Open access 05 December 2022

Introduction

High-throughput genomic technologies have made it possible to generate massive data for studying biological mechanisms or disease aetiology. Such high-dimensional genomic data usually can be presented as a matrix, with each column representing a sample (for example, a patient, a cell type, an experimental condition and so on), and each row representing a genomic feature (for example, a gene, a genomic locus and so on). By computational analyses of these high-dimensional data matrices using dimension reduction (for example, principal component analysis, PCA) or clustering approaches, one can learn characteristic information within samples and identify key features between samples to interrogate biological functions. In many cases, there can be multiple platforms of experiments on the same set of samples and they can generate more than one data matrices. For example, the ENCODE (Encyclopedia of DNA Elements) Consortium generated high-throughput data including ChIP-seq, DNase-seq, and exon array transcriptomes and so on. on a designated panel of human cell lines¹; The Cancer Genome Atlas (TCGA) program² and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)³ generated mutation and gene-expression profiles of patient tumours; and the Cancer Cell Line Encyclopedia (CCLE) project⁴ provided copy number, gene expression for over a thousand cancer cell lines. Integrative analysis is critical for obtaining biological insights from these data sets, within which a common challenge exists in identifying and correcting hidden biases in such high-dimensional data matrices.

In high-throughput data with different experimental platforms, it is not uncommon for a subset of samples in a data matrix on one experimental platform to have technical biases^5,6. For example, in a cohort of dozens of samples, the expression and ChIP-seq profiling were conducted under various batches, each with unique biases from sample collection and preparation, array hybridization, sequencing GC content⁷ or coverage differences that are challenging to identify and remove. There have been methods developed to remove batch effect within one data matrix of the same platform. For example, PCA have been used to solve such problems. As an extension of PCA, Sparse PCA⁵ uses the linear combination of a small subset of variables instead of all to generate the principal components and still explains most variances present in the data, while making the dimension reduction and bias removal clearer and easier to interpret⁸. Surrogate variable analysis (SVA)⁹ models the gene-expression heterogeneity bias as ‘surrogate variables’ and separate them from primary variables that capture biologically meaningful information. These methods aim to normalize data within the same data matrix from the same platform. However, to our knowledge, methods that can normalize data from different matrices and borrow information between different platforms are still lacking.

Recently, Wang et al.¹⁰ propose similar network fusion (SNF), a method that first generates sample networks from each data platform separately, then uses network fusion to merge the platform-specific networks together with confidence weighting. SNF demonstrated good performance on separating TCGA glioblastoma samples into subtypes using transcriptome and DNA methylome profiles. However, SNF does not provide the normalized data matrices that could be useful in the downstream analysis. In addition, SNF is based on network construction, which could be sensitive to strong biases in a subset of samples that result in ‘high-weight’ edges in the network and are difficult to remove in the fusion step. In other words, if the networks generated from each data matrices were too dissimilar, it is difficult to ‘fuse’. A more general applicable method is needed to simultaneously provide better sample clustering and generate normalized data matrices.

To overcome the above challenges, we propose MANCIE (matrix analysis and normalization by concordant information enhancement), an integrative computational method that can conduct data normalization and bias correction by borrowing information from a column-matched associated data matrix. Applied to ENCODE, METABRIC, TCGA and CCLE data, MANCIE showed effectiveness in improved identification of biologically meaningful patterns.

Results

Method overview

MANCIE takes in two data matrices and adjusts one (thereafter defined as the ‘main matrix’) using the other (thereafter defined as the ‘associated matrix’) by identifying and maintaining the concordant information and reducing the discordant information between them. The two data matrices contain profiles on the same set of samples generated using different experimental platforms (for example, copy number variation (CNV) and RNA-seq on the same collection of tumours), or generated independently (for example, expression profiles measured at different institutions on the same collection of cell lines). If the rows of the two matrices are unmatched (for example, genes versus ChIP-seq peaks), MANCIE first generates a summarized associated matrix that has matched rows with the main matrix using a biologically motivated matching process (Supplementary Fig. 1, see Methods for details). This matching step requires additional biological information to connect the rows between the two matrices, for example, each gene (as a row vector in the main matrix) will corresponds to a row vector summarized from a few nearby transcription factor (TF) -binding sites (as a few rows in the associated matrix). MANCIE assumes that pairwise sample distance as measured by different platforms should be similar, and discordance in the pairwise distance largely arise from technical biases and/or noises. Therefore, the second and key step of MANCIE adjusts the main matrix row by row by borrowing information from data in the associated matrix. In each row, depending on whether the correlation between the main and associated data is high, moderate or low, MANCIE takes the first principle component (scenario 3), a correlation-weighted sum (scenario 2) or the original data (scenario 1) as the adjusted data, respectively (Fig. 1, see Methods for details). The correlation cutoffs are determined empirically, by making roughly 1/3 of the rows be adjusted under scenario 3. Finally, the output of MANCIE is the normalized adjusted matrix that has the same dimension as the main matrix yet with the information from the associated matrix incorporated. It is worth noting that one can swap the main and associated matrices, so the quality of both data can be improved from each other. We show that this approach is an appropriate approximation of the full Bayesian inference for reducing noises from such data sets (Supplementary Notes). In the following sections, we applied MANCIE on a few data sets generated by large consortia including ENCODE¹, METABRIC³, TCGA² and CCLE⁴ to demonstrate its utility in genomic data integration.

ENCODE data

From ENCODE consortium¹, we obtained data from 61 cell lines where both DNase-seq data for chromatin accessibility profiling and Affymetrix exon array data for gene-expression profiling are available. These cell lines can be classified into seven groups by their tissues of origin (Fig. 2). DNase hypersensitive sites (DHS, measured by DNase-seq) mark open chromatin regions that can be considered as a repertoire of all putative cis-regulatory elements in the genome^11,12 that regulate gene expression as measured by exon arrays. Although active gene promoters are usually DHS, most DHS are located in introns or intergenic regions, marking distal enhancers^1,13 which are often more dynamic across different cell types or conditions^14,15,16,17. Focusing on enhancers, we generated the DHS data matrix based on a union set of intronic and intergenic DHS peaks identified from all the DNase-seq data, and obtained the gene-expression data matrix based on the exon array data. We used MANCIE to adjust each data matrix using the other as the associated matrix. Since the rows between the two data matrices are not matched, we generated the summarized associated matrices using the genomic location information of genes (based on the transcription start site, TSS) and DHS (based on the DHS peak centre). For summarization of DHS data around genes, we used up to 50 nearby DHS located within 100 kb from the TSS of each gene as the local sub-matrix for that gene. For summarization of gene-expression data around DHS, we used a similar approach but up to 20 nearby genes, considering that there are much fewer genes than DHS. The adjusted main matrix was then generated by integrating the main matrix with the summarized associated matrix. We conducted multi-dimensional scaling on the adjusted data as well as the raw data, and plotted two principal components with each data point representing a cell line (Fig. 2a,b). We hypothesized that cell lines belonging to the same tissue type should be more similar to each other than cell lines from different tissue types, a pattern previously reported for these data types¹⁶. Indeed, although MANCIE only aims to make cell lines similar in one platform more similar in the other platform, the end result is that the adjusted data, both DHS and expression data, show better cell line clustering according to their tissue types (Fig. 2a,b). To assess the better clustering quantitatively, we performed K-means clustering with randomly sampled seeds for 1,000 times on each data set and calculated the adjusted Rand index¹⁸ that measures the similarity between the K-means clustering and the actual tissue-type clustering for each random sample. The average adjusted Rand index is significantly higher for MANCIE-adjusted data than that for the raw data, indicating that MANCIE improves the tissue-type clustering (Fig. 2c,d). In contrast, cell line clustering using SVA-adjusted data is actually worse than using the raw data (Supplementary Fig. 2a,b).

We next investigated the implication of the MANCIE adjustment on the ENCODE data. As GC-content bias is one major sources of biases in next-generation sequencing data, we first checked whether MANCIE can reduce the GC-content biases in the DNase-seq data. For each cell line, we calculated the distribution of the GC-content of all sequence reads in the DNase-seq data set as well as the magnitude of MANCIE adjustment, measured by the Euclidean distance between the corresponding column vectors in the raw and the MANCIE-adjusted data matrices. Cell lines showing GC-content patterns that were farther away from average of all cell lines (Supplementary Fig. 2c) underwent a greater magnitude of MANCIE adjustment than the other cell lines (Fig. 2e and Supplementary Fig. 2d). This result indicates that MANCIE successfully corrected the GC-content biases in the DNase-seq data.

To further evaluate MANCIE performance in adjusting the DNase-seq data, we selected the top 2,000 DHSs with greatest increase after MANCIE adjustment in the cell lines with the biggest adjustment, and performed sequence motif analysis on these DHSs. We found that the sequence motifs enriched in these DHSs usually match cell-type-specific TFs (Fig. 2e). For example, ETS motif is enriched in both TH1 and TH2 cell lines, and the ETS-family TFs ERM and PU.1 are specific to TH1 and TH2 cell lines, respectively^19,20. The motif of megakaryocyte-specific TF NF-E2¹⁷ is enriched in the megakaryocyte cell line CMK. These results demonstrated that integrated with gene-expression data, MANCIE-adjusted DHSs show an increased pattern of cell-type specificity that is better correlated with cell-type-specific gene-expression pattern. Taken together, MANCIE is able to integrate the genomic DHS data with gene expression and to reduce potential biases, as well as to emphasize biologically meaningful signals.

METABRIC and TCGA data

We applied MANCIE on the METABRIC breast cancer data sets³ to demonstrate its effectiveness for noise reduction on another data platform. METABRIC has two independent cohorts of breast cancer patients. Each cohort has around 1,000 patients with gene-expression values, CNV values and survival information. We used the first cohort as the training set and the second cohort as the independent set to predict survival information from gene-expression data. MANCIE was applied to adjust the gene-expression data based on CNV data. The underlying assumptions are: first, if gene-expression signatures can predict patient survival outcome, noise-reduced and bias-corrected gene-expression data should have better predictive accuracy in the patient survival; second, genes with concordant correlation between copy number and expression are more likely to be reliably measured and are more informative for outcome prediction. Indeed, we found that the MANCIE-adjusted data can better predict survival information than the original expression data, by better distinguishing patients with lower and higher risk of death, with an example shown in Fig. 3a. To assess the improvement quantitatively, we compared the logrank P values obtained using original training and original testing expression matrices with the logrank P values obtained using adjusted training and adjusted testing expression matrices. We limited our analysis to a subset of genes whose adjusted expression values are most different from the original values, defined as correlation of the adjusted vector and original vector being smaller than a threshold for either the training or the testing data set. Under a series of threshold from 0.7–0.93, MANCIE consistently improved the prediction accuracy by generating smaller P values (Fig. 3b).

Although TCGA also has breast cancer profiles, the death events are too few to provide meaningful survival separation. Therefore, we applied MANCIE on TCGA lung adenocarcinoma data² for survival prediction. A total of 10,704 genes for 417 tumours with complete expression, CNV and clinical information were used, and MANCIE was applied to adjust the gene-expression data based on CNV data. For comparison, the gene-expression data matrix was also adjusted by the SVA method. To test the effectiveness of MANCIE adjustment, we selected six prognostic gene signatures for non-small cell lung cancer from previous publications^{21,22,23,24,31}. The sequence motif scan analyses were performed using the MDSeqPos algorithm³² on the Cistrome analysis pipeline³³.

Survival analysis for METABRIC data set

We adjusted both the training set expression matrix and the testing set expression matrix with the corresponding CNV data matrix using default parameters. Then we calculate the Pearson correlation of each row vector in the original training set expression matrix and the row vector in the adjusted training set expression matrix. We also calculated the Pearson correlation for the testing set matrix before and after adjustment. We focus our analysis on the row vectors (gene expression across patients) whose Pearson correlations are below a certain threshold for either the training set or the testing set. To be objective, we chose a series of threshold values in the downstream analysis. Then we used LASSO³⁴ with cox family to analyse selected genes from the previous step to further narrow down the genes whose expression levels are significantly correlated with survival in the training set. Then we fit a multivariate Coxph model³⁵ using these selected genes and predicted to the testing set after which Coxph model returned a risk score vector for the testing set patients. We dichotomized the risk score based on median risk and tested the Logrank P values of the overall survival difference between the low-risk group and high-risk group. Since the LASSO method is stochastic, generating slightly different results especially when the number of input features is large, we ran the same analysis 20 times and obtained a distribution of logrank P values.

Data analysis by SVA

For TCGA data, we used cancer stage and smoking status as the primary variables, no variable of known noise source and 1 surrogate variable to be estimated. For CCLE/GDSC data, we used no variable of known noise source and 1 surrogate variable to be estimated. For Encode data, we used tissue type as known noise source and 1 surrogate variable to be estimated. Each row vector is regressed on the estimated surrogate variable and replaced by the regression residuals.

Choice of parameters

For METABRIC and TCGA data studied in this paper, Cutoff1 was set as 0 and the Cutoff2 being 0.5. For CCLE data, the two parameters are 0 and 0.7, respectively. For Encode data, the two parameters are 0 and 0.5, respectively. We used the differences in negative log rank P values as a metric to evaluate the effect of different combinations of higher cutoff and lower cutoff on the efficiency of MANCIE for the METABRIC data set.

Availability

MANCIE is available as an R-package (http://cran.r-project.org/web/packages/MANCIE/).

Additional information

How to cite this article: Zang, C. et al. High-dimensional genomic data bias correction and data integration using MANCIE. Nat. Commun. 7:11305 doi: 10.1038/ncomms11305 (2016).

References

Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 488, 75–82 (2013).
Google Scholar
The Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Article CAS Google Scholar
Barretina, J. et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–307 (2013).
Article ADS Google Scholar
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Article CAS Google Scholar
Meyer, C. A. & Liu, X. S. Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat. Rev. Genet. 15, 1–13 (2014).
Google Scholar
Dohm, J. C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic. Acids. Res. 36, e105–e105 (2008).
Article Google Scholar
Zou, H., HASTIE, T. & Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 15, 265–286 (2006).
Article MathSciNet Google Scholar
Leek, J. T. & Storey, J. D. capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161–12 (2007).
Article Google Scholar
Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).
Article CAS Google Scholar
Gross, D. S. & Garrard, W. T. Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159–197 (1988).
Article CAS Google Scholar
Felsenfeld, G. & Groudine, M. Controlling the double helix. Nature 421, 448–453 (2003).
Article ADS Google Scholar
Sabo, P. J. et al. Genome-scale map** of DNase I sensitivity in vivo using tiling DNA microarrays. Nat. Methods 3, 511–518 (2006).
Article CAS Google Scholar
Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007).
Article CAS Google Scholar
Boyle, A. P. et al. High-resolution map** and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).
Article CAS Google Scholar
Stergachis, A. B. et al. Developmental fate and cellular maturity encoded in human regulatory DNA Landscapes. Cell 154, 888–903 (2013).
Article CAS Google Scholar
Luyten, A., Zang, C., Liu, X. S. & Shivdasani, R. A. Active enhancers are delineated de novo during hematopoiesis, with limited lineage fidelity among specified primary blood cells. Genes Dev. 28, 1827–1839 (2014).
Article CAS Google Scholar
Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Ouyang, W. et al. The Ets transcription factor ERM is Th1-specific and induced by IL-12 through a Stat4-dependent pathway. Proc. Natl Acad. Sci. USA 96, 3888–3893 (1999).
Article ADS CAS Google Scholar
Chang, H.-C. et al. The transcription factor PU.1 is required for the development of IL-9-producing T cells and allergic inflammation. Nat. Immunol. 11, 527–534 (2010).
Article CAS Google Scholar
Beer, D. G. et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 8, 816–824 (2002).
Article CAS Google Scholar
Guo, L. Constructing molecular classifiers for the accurate prognosis of lung adenocarcinoma. Clin. Cancer Res. 12, 3344–3354 (2006).
Article CAS Google Scholar
Larsen, J. E. et al. Gene expression signature predicts recurrence in lung adenocarcinoma. Clin. Cancer Res. 13, 2946–2954 (2007).
Article ADS CAS Google Scholar
Roepman, P. et al. An immune response enriched 72-gene prognostic profile for early-stage non-small-cell lung cancer. Clin. Cancer Res. 15, 284–290 (2009).
Article CAS Google Scholar
**e, Y. et al. Robust gene expression signature from formalin-fixed paraffin-embedded samples predicts prognosis of non-small-cell lung cancer patients. Clin. Cancer Res. 17, 5705–5714 (2011).
Article CAS Google Scholar
Lu, Y. et al. A gene expression signature predicts survival of patients with stage i non-small cell lung cancer. PLoS Med. 3, 2229–2243 (2006).
Article ADS CAS Google Scholar
Bair, E. & Tibshirani, R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2, e108 (2004).
Article Google Scholar
Andersen, P. K. & Gill, R. D. Cox’s regression model for counting processes: a large sample study. Ann. Stat. 10, 1100–1120 (1982).
Article MathSciNet Google Scholar
Yang, W. et al. Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic. Acids Res. 41, D955–D961 (2012).
Article Google Scholar
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137–R139 (2008).
Article Google Scholar
Seok, J., Xu, W., Gao, H., Davis, R. W. & **ao, W. JETTA: junction and exon toolkits for transcriptome analysis. Bioinformatics 28, 1274–1275 (2012).
Article CAS Google Scholar
Lupien, M. et al. FoxA1 translates epigenetic signatures into enhancer-driven lineage-specific transcription. Cell 132, 958–970 (2008).
Article CAS Google Scholar
Liu, T. et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 12, R83 (2011).
Article CAS Google Scholar
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
MathSciNet MATH Google Scholar
Cox, D. R. Regression models and life-tables. J. R. Stat. Soc. Ser. B Stat. Methodol. 34, 187–220 (1972).
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was partially supported by the US National Institutes of Health (NIH) grants U41HG7000 (X.S.L.), 1R01GM099409 (X.S.L.), 5R01CA172211 (Y.X.), 1R01CA152301 (Y.X.), Leukemia and Lymphoma Society (LLS) fellow award (C.Z.), and the National Natural Science Foundation of China (NSFC) grant 11401338 (K.D.) and 61422309 (S.Z.).

Author information

Chongzhi Zang and Tao Wang: These authors contributed equally to this work.

Authors and Affiliations

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, Boston, 02215, Massachusetts, USA
Chongzhi Zang, Bo Li, Tengfei **ao, Clifford A. Meyer, Housheng Hansen He & X. Shirley Liu
Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, 02215, Massachusetts, USA
Chongzhi Zang, Bo Li, Tengfei **ao, Clifford A. Meyer, Housheng Hansen He, Myles Brown & X. Shirley Liu
Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, 75390, Texas, USA
Tao Wang & Yang **e
Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, 75390, Texas, USA
Tao Wang
Center for Statistical Science, Tsinghua University, Bei**g, 100084, China
Ke Deng
Department of Statistics, Harvard University, Cambridge, 02138, Massachusetts, USA
Bo Li & Jun S. Liu
Department of Bioinformatics, School of Life Sciences, Tongji University, Shanghai, 200092, China
Sheng’en Hu & Qian Qin
Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, 02215, Massachusetts, USA
Tengfei **ao, Housheng Hansen He & Myles Brown
National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Bei**g, 100190, China
Shihua Zhang
Department of Medical Biophysics, University of Toronto, Toronto, M5G 1L7, Ontatio, Canada
Housheng Hansen He
Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, 75390, Texas, USA
Yang **e
Simons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, 75390, Texas, USA
Yang **e

Authors

Chongzhi Zang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ke Deng
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Sheng’en Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Qin
View author publications
You can also search for this author in PubMed Google Scholar
Tengfei **ao
View author publications
You can also search for this author in PubMed Google Scholar
Shihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Clifford A. Meyer
View author publications
You can also search for this author in PubMed Google Scholar
Housheng Hansen He
View author publications
You can also search for this author in PubMed Google Scholar
Myles Brown
View author publications
You can also search for this author in PubMed Google Scholar
Jun S. Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yang **e
View author publications
You can also search for this author in PubMed Google Scholar
X. Shirley Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.Z., T.W., C.A.M. and X.S.L. conceived the idea. C.Z. and T.W. developed the method with the help of K.D. and J.S.L. on the theoretical support. C.Z. and T.W. analysed the data with the help and inputs of B.L., S.H., Q.Q., T.X., S.Z., H.H.H. and M.B. C.Z., T.W., K.D., and X.S.L. wrote the paper. X.S.L. and Y.X. oversaw the research. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Yang **e or X. Shirley Liu.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information

Supplementary Figures 1-4, Supplementary Note 1. (PDF 1469 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Zang, C., Wang, T., Deng, K. et al. High-dimensional genomic data bias correction and data integration using MANCIE. Nat Commun 7, 11305 (2016). https://doi.org/10.1038/ncomms11305

Download citation

Received: 17 September 2015
Accepted: 11 March 2016
Published: 13 April 2016
DOI: https://doi.org/10.1038/ncomms11305
Springer Nature Limited

This article is cited by

Crosstalk among proximal tubular cells, macrophages, and fibroblasts in acute kidney injury: single-cell profiling from the perspective of ferroptosis
- Yulin Wang
- Ziyan Shen
- **aoyan Zhang
Human Cell (2024)
Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP
- Rongyuan Li
- **gli Wu
- Qi Zhu
BMC Bioinformatics (2023)
Intrinsic bias estimation for improved analysis of bulk and single-cell chromatin accessibility profiles using SELMA
- Shengen Shawn Hu
- Lin Liu
- Chongzhi Zang
Nature Communications (2022)
DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data
- Olivier B. Poirion
- Zheng **g
- Lana X. Garmire
Genome Medicine (2021)
Trends in the characteristics of human functional genomic data on the gene expression omnibus, 2001–2017
- Daniel D. Liu
- Lan**g Zhang
Laboratory Investigation (2019)

High-dimensional genomic data bias correction and data integration using MANCIE

From

Abstract

Similar content being viewed by others

caOmicsV: an R package for visualizing multidimensional cancer genomic data

Integrative Exploratory Analysis of Two or More Genomic Datasets

robustica: customizable robust independent component analysis

Introduction