Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14758))

  • 296 Accesses

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models.

We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.

Availability. The software is freely available at https://github.com/fmfi-compbio/mcdp2 under the MIT licence. All data for reproducibility are available at https://github.com/fmfi-compbio/mcdp2-reproducibility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 111.27
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 137.14
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)

    Article  Google Scholar 

  2. Domanska, D., Kanduri, C., Simovski, B., Sandve, G.K.: Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis. BMC Bioinf. 19(1), 481 (2018). https://doi.org/10.1186/s12859-018-2438-1

    Article  Google Scholar 

  3. Dozmorov, M.G., Cara, L.R., Giles, C.B., Wren, J.D.: GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets. Bioinformatics 32(15), 2256–2263 (2016)

    Article  Google Scholar 

  4. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998). https://doi.org/10.1017/CBO9780511790492

  5. Gafurov, A., Brejová, B., Medvedev, P.: Markov chains improve the significance computation of overlap** genome annotations. Bioinformatics 38(Supplement-1), i203–i211 (2022). https://doi.org/10.1093/bioinformatics/btac255

    Article  Google Scholar 

  6. Gel, B., Diez-Villanueva, A., Serra, E., Buschbeck, M., Peinado, M.A., Malinverni, R.: regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics 32(2), 289–291 (2016)

    Article  Google Scholar 

  7. Gelfman, S., Ast, G.: When epigenetics meets alternative splicing: the roles of DNA methylation and GC architecture. Epigenomics 5(4), 351–353 (2013)

    Article  Google Scholar 

  8. Gershman, A., et al.: Epigenetic patterns in a complete human genome. Science 376(6588) (2022). https://doi.org/10.1126/science.abj5089

  9. Goodman, S.: A dirty dozen: twelve p-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008). https://doi.org/10.1053/j.seminhematol.2008.04.003

    Article  Google Scholar 

  10. Guenther, M.G., Levine, S.S., Boyer, L.A., Jaenisch, R., Young, R.A.: A chromatin landmark and transcription initiation at most promoters in human cells. Cell 130(1), 77–88 (2007). https://doi.org/10.1016/j.cell.2007.05.042

    Article  Google Scholar 

  11. Heger, A., Webber, C., Goodson, M., Ponting, C.P., Lunter, G.: GAT: a simulation framework for testing the association of genomic intervals. Bioinformatics 29(16), 2046–2048 (2013). https://doi.org/10.1093/bioinformatics/btt343

    Article  Google Scholar 

  12. Kanduri, C., Bock, C., Gundersen, S., Hovig, E., Sandve, G.K.: Colocalization analyses of genomic elements: approaches, recommendations and challenges. Bioinformatics 35(9), 1615–1624 (2019). https://doi.org/10.1093/bioinformatics/bty835

    Article  Google Scholar 

  13. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

    Google Scholar 

  14. Nurk, S., et al.: The complete sequence of a human genome. Science 376(6588), 44–53 (2022). https://doi.org/10.1126/science.abj6987

    Article  Google Scholar 

  15. Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010). https://doi.org/10.1093/bioinformatics/btq033

    Article  Google Scholar 

  16. Ross, N.: Fundamentals of Stein’s method. Probab. Surv. 8(1), 210–293 (2011). https://doi.org/10.1214/11-PS182

    Article  MathSciNet  Google Scholar 

  17. Sandve, G.K., et al.: The genomic hyperbrowser: inferential genomics at the sequence level. Genome Biol. 11(12) (2010). https://doi.org/10.1186/gb-2010-11-12-r121

  18. Sarmashghi, S., Bafna, V.: Computing the statistical significance of overlap between genome annotations with ISTAT. Cell Syst. 8(6), 523–529 (2019). https://doi.org/10.1016/j.cels.2019.05.006

    Article  Google Scholar 

  19. Sheffield, N.C., Bock, C.: LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics 32(4), 587–589 (2016). https://doi.org/10.1093/bioinformatics/btv612

    Article  Google Scholar 

  20. Sullivan, G.M., Feinn, R.: Using effect size-or why the P value is not enough. J. Grad. Med. Educ. 4(3), 279–282 (2012)

    Article  Google Scholar 

  21. Zarrei, M., MacDonald, J.R., Merico, D., Scherer, S.W.: A copy number variation map of the human genome. Nat. Rev. Genet. 16(3), 172–183 (2015). https://doi.org/10.1038/nrg3871

    Article  Google Scholar 

Download references

Funding

This material is based upon work supported by the National Science Foundation under Grant No. DBI-2138585. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM146462. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was also supported by a grant from the European Union Horizon 2020 research and innovation program No. 872539 (PANGAIA); and grants from the Slovak Research and Development Agency APVV-22-0144 and the Scientific Grant Agency VEGA 1/0538/22.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Broňa Brejová .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gafurov, A., Vinař, T., Medvedev, P., Brejová, B. (2024). Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. https://doi.org/10.1007/978-1-0716-3989-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-3989-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-1-0716-3988-7

  • Online ISBN: 978-1-0716-3989-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation