Abstract
An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models.
We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.
Availability. The software is freely available at https://github.com/fmfi-compbio/mcdp2 under the MIT licence. All data for reproducibility are available at https://github.com/fmfi-compbio/mcdp2-reproducibility.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)
Domanska, D., Kanduri, C., Simovski, B., Sandve, G.K.: Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis. BMC Bioinf. 19(1), 481 (2018). https://doi.org/10.1186/s12859-018-2438-1
Dozmorov, M.G., Cara, L.R., Giles, C.B., Wren, J.D.: GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets. Bioinformatics 32(15), 2256–2263 (2016)
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998). https://doi.org/10.1017/CBO9780511790492
Gafurov, A., Brejová, B., Medvedev, P.: Markov chains improve the significance computation of overlap** genome annotations. Bioinformatics 38(Supplement-1), i203–i211 (2022). https://doi.org/10.1093/bioinformatics/btac255
Gel, B., Diez-Villanueva, A., Serra, E., Buschbeck, M., Peinado, M.A., Malinverni, R.: regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics 32(2), 289–291 (2016)
Gelfman, S., Ast, G.: When epigenetics meets alternative splicing: the roles of DNA methylation and GC architecture. Epigenomics 5(4), 351–353 (2013)
Gershman, A., et al.: Epigenetic patterns in a complete human genome. Science 376(6588) (2022). https://doi.org/10.1126/science.abj5089
Goodman, S.: A dirty dozen: twelve p-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008). https://doi.org/10.1053/j.seminhematol.2008.04.003
Guenther, M.G., Levine, S.S., Boyer, L.A., Jaenisch, R., Young, R.A.: A chromatin landmark and transcription initiation at most promoters in human cells. Cell 130(1), 77–88 (2007). https://doi.org/10.1016/j.cell.2007.05.042
Heger, A., Webber, C., Goodson, M., Ponting, C.P., Lunter, G.: GAT: a simulation framework for testing the association of genomic intervals. Bioinformatics 29(16), 2046–2048 (2013). https://doi.org/10.1093/bioinformatics/btt343
Kanduri, C., Bock, C., Gundersen, S., Hovig, E., Sandve, G.K.: Colocalization analyses of genomic elements: approaches, recommendations and challenges. Bioinformatics 35(9), 1615–1624 (2019). https://doi.org/10.1093/bioinformatics/bty835
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
Nurk, S., et al.: The complete sequence of a human genome. Science 376(6588), 44–53 (2022). https://doi.org/10.1126/science.abj6987
Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010). https://doi.org/10.1093/bioinformatics/btq033
Ross, N.: Fundamentals of Stein’s method. Probab. Surv. 8(1), 210–293 (2011). https://doi.org/10.1214/11-PS182
Sandve, G.K., et al.: The genomic hyperbrowser: inferential genomics at the sequence level. Genome Biol. 11(12) (2010). https://doi.org/10.1186/gb-2010-11-12-r121
Sarmashghi, S., Bafna, V.: Computing the statistical significance of overlap between genome annotations with ISTAT. Cell Syst. 8(6), 523–529 (2019). https://doi.org/10.1016/j.cels.2019.05.006
Sheffield, N.C., Bock, C.: LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics 32(4), 587–589 (2016). https://doi.org/10.1093/bioinformatics/btv612
Sullivan, G.M., Feinn, R.: Using effect size-or why the P value is not enough. J. Grad. Med. Educ. 4(3), 279–282 (2012)
Zarrei, M., MacDonald, J.R., Merico, D., Scherer, S.W.: A copy number variation map of the human genome. Nat. Rev. Genet. 16(3), 172–183 (2015). https://doi.org/10.1038/nrg3871
Funding
This material is based upon work supported by the National Science Foundation under Grant No. DBI-2138585. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM146462. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was also supported by a grant from the European Union Horizon 2020 research and innovation program No. 872539 (PANGAIA); and grants from the Slovak Research and Development Agency APVV-22-0144 and the Scientific Grant Agency VEGA 1/0538/22.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gafurov, A., Vinař, T., Medvedev, P., Brejová, B. (2024). Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. https://doi.org/10.1007/978-1-0716-3989-4_3
Download citation
DOI: https://doi.org/10.1007/978-1-0716-3989-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-1-0716-3988-7
Online ISBN: 978-1-0716-3989-4
eBook Packages: Computer ScienceComputer Science (R0)