Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

Gafurov, Askar; Vinař, Tomáš; Medvedev, Paul; Brejová, Broňa

doi:10.1007/978-1-0716-3989-4_3

Askar Gafurov²⁵,
Tomáš Vinař²⁶,
Paul Medvedev^27,28,29 &
…
Broňa Brejová²⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14758))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

296 Accesses

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models.

We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.

Availability. The software is freely available at https://github.com/fmfi-compbio/mcdp2 under the MIT licence. All data for reproducibility are available at https://github.com/fmfi-compbio/mcdp2 -reproducibility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 111.27; Price includes VAT (France)

Softcover Book: EUR 137.14; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)
Article Google Scholar
Domanska, D., Kanduri, C., Simovski, B., Sandve, G.K.: Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis. BMC Bioinf. 19(1), 481 (2018). https://doi.org/10.1186/s12859-018-2438-1
Article Google Scholar
Dozmorov, M.G., Cara, L.R., Giles, C.B., Wren, J.D.: GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets. Bioinformatics 32(15), 2256–2263 (2016)
Article Google Scholar
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998). https://doi.org/10.1017/CBO9780511790492
Gafurov, A., Brejová, B., Medvedev, P.: Markov chains improve the significance computation of overlap** genome annotations. Bioinformatics 38(Supplement-1), i203–i211 (2022). https://doi.org/10.1093/bioinformatics/btac255
Article Google Scholar
Gel, B., Diez-Villanueva, A., Serra, E., Buschbeck, M., Peinado, M.A., Malinverni, R.: regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics 32(2), 289–291 (2016)
Article Google Scholar
Gelfman, S., Ast, G.: When epigenetics meets alternative splicing: the roles of DNA methylation and GC architecture. Epigenomics 5(4), 351–353 (2013)
Article Google Scholar
Gershman, A., et al.: Epigenetic patterns in a complete human genome. Science 376(6588) (2022). https://doi.org/10.1126/science.abj5089
Goodman, S.: A dirty dozen: twelve p-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008). https://doi.org/10.1053/j.seminhematol.2008.04.003
Article Google Scholar
Guenther, M.G., Levine, S.S., Boyer, L.A., Jaenisch, R., Young, R.A.: A chromatin landmark and transcription initiation at most promoters in human cells. Cell 130(1), 77–88 (2007). https://doi.org/10.1016/j.cell.2007.05.042
Article Google Scholar
Heger, A., Webber, C., Goodson, M., Ponting, C.P., Lunter, G.: GAT: a simulation framework for testing the association of genomic intervals. Bioinformatics 29(16), 2046–2048 (2013). https://doi.org/10.1093/bioinformatics/btt343
Article Google Scholar
Kanduri, C., Bock, C., Gundersen, S., Hovig, E., Sandve, G.K.: Colocalization analyses of genomic elements: approaches, recommendations and challenges. Bioinformatics 35(9), 1615–1624 (2019). https://doi.org/10.1093/bioinformatics/bty835
Article Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
Google Scholar
Nurk, S., et al.: The complete sequence of a human genome. Science 376(6588), 44–53 (2022). https://doi.org/10.1126/science.abj6987
Article Google Scholar
Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010). https://doi.org/10.1093/bioinformatics/btq033
Article Google Scholar
Ross, N.: Fundamentals of Stein’s method. Probab. Surv. 8(1), 210–293 (2011). https://doi.org/10.1214/11-PS182
Article MathSciNet Google Scholar
Sandve, G.K., et al.: The genomic hyperbrowser: inferential genomics at the sequence level. Genome Biol. 11(12) (2010). https://doi.org/10.1186/gb-2010-11-12-r121
Sarmashghi, S., Bafna, V.: Computing the statistical significance of overlap between genome annotations with ISTAT. Cell Syst. 8(6), 523–529 (2019). https://doi.org/10.1016/j.cels.2019.05.006
Article Google Scholar
Sheffield, N.C., Bock, C.: LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics 32(4), 587–589 (2016). https://doi.org/10.1093/bioinformatics/btv612
Article Google Scholar
Sullivan, G.M., Feinn, R.: Using effect size-or why the P value is not enough. J. Grad. Med. Educ. 4(3), 279–282 (2012)
Article Google Scholar
Zarrei, M., MacDonald, J.R., Merico, D., Scherer, S.W.: A copy number variation map of the human genome. Nat. Rev. Genet. 16(3), 172–183 (2015). https://doi.org/10.1038/nrg3871
Article Google Scholar

Download references

Funding

This material is based upon work supported by the National Science Foundation under Grant No. DBI-2138585. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM146462. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was also supported by a grant from the European Union Horizon 2020 research and innovation program No. 872539 (PANGAIA); and grants from the Slovak Research and Development Agency APVV-22-0144 and the Scientific Grant Agency VEGA 1/0538/22.

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
Askar Gafurov & Broňa Brejová
Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
Tomáš Vinař
Department of Computer Science and Engineering, The Pennsylvania State University, State College, USA
Paul Medvedev
Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, USA
Paul Medvedev
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, State College, USA
Paul Medvedev

Authors

Askar Gafurov
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Vinař
View author publications
You can also search for this author in PubMed Google Scholar
Paul Medvedev
View author publications
You can also search for this author in PubMed Google Scholar
Broňa Brejová
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Broňa Brejová .

Editor information

Editors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Jian Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gafurov, A., Vinař, T., Medvedev, P., Brejová, B. (2024). Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. https://doi.org/10.1007/978-1-0716-3989-4_3

Download citation

DOI: https://doi.org/10.1007/978-1-0716-3989-4_3
Published: 17 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-1-0716-3988-7
Online ISBN: 978-1-0716-3989-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts