Abstract
We study the query complexity of exactly reconstructing a string from adaptive queries, such as substring, subsequence, and jumbled-index queries. Such problems have applications, e.g., in computational biology. We provide a number of new and improved bounds for exact string reconstruction for settings where either the string or the queries are “mixed-up”.
The full version of this paper is available in [5].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Our algorithms assume that S is periodic (\(k>1\)), while the Periodicity Lemma (1) only requires a string to have a period (\(k>0\)).
- 2.
A more sophisticated version of this procedure exists (see [17]) that actually improves the constant in the time complexity, but for simplicity, we use the traditional algorithm, which is asymptotically equivalent.
- 3.
Pseudo-code can be found in the full version of the paper [5], where the number of queries is also shown for each step involving queries.
- 4.
Pseudo-code can be found in the full version of the paper [5], where the number of queries is also shown for each step involving queries.
- 5.
Pseudo-code can be found in the full version of the paper [5], where the number of queries is also shown for each step involving queries.
References
Acharya, J., Das, H., Milenkovic, O., Orlitsky, A., Pan, S.: Quadratic-backtracking algorithm for string reconstruction from substring compositions. In: 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014, pp. 1296–1300. IEEE (2014). https://doi.org/10.1109/ISIT.2014.6875042
Acharya, J., Das, H., Milenkovic, O., Orlitsky, A., Pan, S.: String reconstruction from substring compositions. SIAM J. Discrete Math. 29(3), 1340–1371 (2015). https://doi.org/10.1137/140962486
Afshani, P., Agrawal, M., Doerr, B., Doerr, C., Larsen, K.G., Mehlhorn, K.: The query complexity of finding a hidden permutation. In: Brodnik, A., López-Ortiz, A., Raman, V., Viola, A. (eds.) Space-Efficient Data Structures, Streams, and Algorithms. LNCS, vol. 8066, pp. 1–11. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40273-9_1
Afshani, P., van Duijn, I., Killmann, R., Nielsen, J.S.: A lower bound for jumbled indexing. In: 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 592–606 (2020). https://doi.org/10.1137/1.9781611975994.36
Afshar, R., Amir, A., Goodrich, M.T., Matias, P.: Adaptive exact learning in a mixed-up world: dealing with periodicity errors, and jumbled-index queries in string reconstruction. ar**v preprint ar**v:2007.08787 (2029). https://arxiv.org/abs/2007.08787
Amir, A., Eisenberg, E., Levy, A., Porat, E., Shapira, N.: Cycle detection and correction. ACM Trans. Alg. 9(1) (2012). Article no. 13
Amir, A., Apostolico, A., Hirst, T., Landau, G.M., Lewenstein, N., Rozenberg, L.: Algorithms for jumbled indexing, jumbled border and jumbled square on run-length encoded strings. Theor. Comput. Sci. 656, 146–159 (2016). https://doi.org/10.1016/j.tcs.2016.04.030. http://www.sciencedirect.com/science/article/pii/S030439751630069X
Amir, A., et al.: Pattern matching with address errors: rearrangement distances. J. Comput. Syst. Sci. 75(6), 359–370 (2009). https://doi.org/10.1016/j.jcss.2009.03.001
Amir, A., Butman, A., Porat, E.: On the relationship between histogram indexing and block-mass indexing. Philos. Trans. Roy. Soc. Math. Phys. Eng. Sci. 372(2016) (2014). https://doi.org/10.1098/rsta.2013.0132. https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2013.0132
Amir, A., Chan, T.M., Lewenstein, M., Lewenstein, N.: On hardness of jumbled indexing. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) ICALP 2014. LNCS, vol. 8572, pp. 114–125. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43948-7_10
Amir, A., Hartman, T., Kapah, O., Levy, A., Porat, E.: On the cost of interchange rearrangement in strings. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 99–110. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_11
Angluin, D.: Queries and concept learning. Mach. Learn. 2(4), 319–342 (1988). https://doi.org/10.1023/A:1022821128753
Arratia, R., Martin, D., Reinert, G., Waterman, M.S.: Poisson process approximation for sequence repeats and sequencing by hybridization. J. Comput. Biol. 3(3), 425–463 (1996). https://doi.org/10.1089/cmb.1996.3.425
Batu, T., Kannan, S., Khanna, S., McGregor, A.: Reconstructing strings from random traces. In: Munro, J.I. (ed.) Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, 11–14 January 2004, pp. 910–918. SIAM (2004). http://dl.acm.org/citation.cfm?id=982792.982929
Benson, G.: Tandem repeats finder: a program to analyze DNA sequence. Nucleic Acids Res. 27(2), 573–580 (1999)
Benson, G., Waterman, M.: A method for fast database search for all k-nucleotide repeats. Nucleic Acids Res. 22, 4828–4836 (1994)
Bentley, J.L., Yao, A.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976). https://doi.org/10.1016/0020-0190(76)90071-5
Bernasconi, A., Damm, C., Shparlinski, I.: Circuit and decision tree complexity of some number theoretic problems. Inf. Comput. 168(2), 113–124 (2001). https://doi.org/10.1006/inco.2000.3017. http://www.sciencedirect.com/science/article/pii/S0890540100930177
Bresler, G., Bresler, M., Tse, D.: Optimal assembly for high throughput shotgun sequencing. BMC Bioinform. 14(2013). Article number. S18. https://doi.org/10.1186/1471-2105-14-S5-S18
Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: Algorithms for jumbled pattern matching in strings. Int. J. Found. Comput. Sci. 23(2), 357–374 (2012). https://doi.org/10.1142/S0129054112400175
Butman, A., Eres, R., Landau, G.M.: Scaled and permuted string matching. Inf. Process. Lett. 92(6), 293–297 (2004). https://doi.org/10.1016/j.ipl.2004.09.002
Carpi, A., de Luca, A.: Words and special factors. Theor. Comput. Sci. 259(1–2), 145–182 (2001). https://doi.org/10.1016/S0304-3975(99)00334-5
Cayley, A.: LXXVII. Note on the theory of permutations. Lond. Edinb. Dublin Philos. Mag. J. Sci. 34(232), 527–529 (1849)
Chang, Z., Chrisnata, J., Ezerman, M.F., Kiah, H.M.: Rates of DNA sequence profiles for practical values of read lengths. IEEE Trans. Inf. Theory 63(11), 7166–7177 (2017). https://doi.org/10.1109/TIT.2017.2747557
Choi, S.S., Kim, J.H.: Optimal query complexity bounds for finding graphs. Artif. Intell. 174(9), 551–569 (2010). https://doi.org/10.1016/j.artint.2010.02.003. http://www.sciencedirect.com/science/article/pii/S0004370210000251
Cicalese, F., Fici, G., Lipták, Z.: Searching for jumbled patterns in strings. In: Holub, J., Zdárek, J. (eds.) Proceedings of the Prague Stringology Conference 2009, Prague, Czech Republic, 31 August–2 September 2009, pp. 105–117. Prague Stringology Club, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague (2009). http://www.stringology.org/event/2009/p10.html
Cieplinski, L.: MPEG-7 color descriptors and their applications. In: Skarbek, W. (ed.) CAIP 2001. LNCS, vol. 2124, pp. 11–20. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44692-3_3
Cleve, R., et al.: Reconstructing strings from substrings with quantum queries. In: Fomin, F.V., Kaski, P. (eds.) SWAT 2012. LNCS, vol. 7357, pp. 388–397. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31155-0_34
Dakic, T.: On the turnpike problem. Simon Fraser University BC, Canada (2000)
Deininger, P.: SINEs: short interspersed repeated DNA elements in higher eukaryotes. In: Berg, D., Howe, M. (eds.) Mobile DNA, Chap. 27, pp. 619–636. American Society for Microbiology (1989)
Deselaers, T., Keysers, D., Ney, H.: Features for image retrieval: an experimental comparison. Inf. Retrieval 11(2), 77–107 (2008). https://doi.org/10.1007/s10791-007-9039-3
Dobzinski, S., Vondrak, J.: From query complexity to computational complexity. In: Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC 2012, pp. 1107–1116. ACM, New York (2012). https://doi.org/10.1145/2213977.2214076
Domaniç, N.O., Preparata, F.P.: A novel approach to the detection of genomic approximate tandem repeats in the levenshtein metric. J. Comput. Biol. 14(7), 873–891 (2007)
Dudík, M., Schulman, L.J.: Reconstruction from subsequences. J. Comb. Theory Ser. A 103(2), 337–348 (2003). https://doi.org/10.1016/S0097-3165(03)00103-1
Dudley, J., Lin, M.T., Le, D., Eshleman, J.R.: Microsatellite instability as a biomarker for PD-1 blockade. Clin. Cancer Res. 22(4), 813–820 (2016)
Elishco, O., Gabrys, R., Médard, M., Yaakobi, E.: Repeat-free codes. In: IEEE International Symposium on Information Theory, ISIT 2019, Paris, France, 7–12 July 2019, pp. 932–936. IEEE (2019). https://doi.org/10.1109/ISIT.2019.8849483
Eres, R., Landau, G.M., Parida, L.: Permutation pattern discovery in biosequences. J. Comput. Biol. 11(6), 1050–1060 (2004). https://doi.org/10.1089/cmb.2004.11.1050
Fici, G., Mignosi, F., Restivo, A., Sciortino, M.: Word assembly through minimal forbidden words. Theor. Comput. Sci. 359(1–3), 214–230 (2006). https://doi.org/10.1016/j.tcs.2006.03.006
Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16(1), 109–114 (1965)
Gabrys, R., Milenkovic, O.: The hybrid k-Deck problem: reconstructing sequences from short and long traces. In: 2017 IEEE International Symposium on Information Theory, ISIT 2017, Aachen, Germany, 25–30 June 2017, pp. 1306–1310. IEEE (2017). https://doi.org/10.1109/ISIT.2017.8006740
Gabrys, R., Milenkovic, O.: Unique reconstruction of coded sequences from multiset substring spectra. In: 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, 17–22 June 2018, pp. 2540–2544. IEEE (2018). https://doi.org/10.1109/ISIT.2018.8437909
Ganguly, S., Mossel, E., Rácz, M.Z.: Sequence assembly from corrupted shotgun reads. In: IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, 10–15 July 2016, pp. 265–269. IEEE (2016). https://doi.org/10.1109/ISIT.2016.7541302
Holenstein, T., Mitzenmacher, M., Panigrahy, R., Wieder, U.: Trace reconstruction with constant deletion probability and related results. In: Teng, S. (ed.) Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, 20–22 January 2008, pp. 389–398. SIAM (2008). http://dl.acm.org/citation.cfm?id=1347082.1347125
Iwama, K., Teruyama, J., Tsuyama, S.: Reconstructing strings from substrings: optimal randomized and average-case algorithms (2018)
Jeong, K., Bandeira, N., Kim, S., Pevzner, P.A.: Gapped spectral dictionaries and their applications for database searches of tandem mass spectra. Mol Cell Proteomics (2011). https://doi.org/10.1074/mcp.M110.002220
Jerrum, M.: The complexity of finding minimum-length generator sequences. Theor. Comput. Sci. 36, 265–289 (1985). https://doi.org/10.1016/0304-3975(85)90047-7
Kalashnik, L.: The reconstruction of a word from fragments. In: Numerical Mathematics and Computer Technology, pp. 56–57 (1973)
Kannan, S., McGregor, A.: More on reconstructing strings from random traces: insertions and deletions. In: Proceedings of the 2005 IEEE International Symposium on Information Theory, ISIT 2005, Adelaide, South Australia, Australia, 4–9 September 2005, pp. 297–301. IEEE (2005). https://doi.org/10.1109/ISIT.2005.1523342
Kiah, H.M., Puleo, G.J., Milenkovic, O.: Codes for DNA sequence profiles. IEEE Trans. Inf. Theory 62(6), 3125–3146 (2016). https://doi.org/10.1109/TIT.2016.2555321
Kim, S., Bandeira, N., Pevzner, P.A.: Spectral profiles: a novel representation of tandem mass spectra and its applications for de novo peptide sequencing and identification. Mol. Cell. Proteomics 8, 1391–1400 (2009)
Kim, S., Gupta, N., Bandeira, N., Pevzner, P.A.: Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8(1), 53–69 (2009)
Kociumaka, T., Radoszewski, J., Rytter, W.: Efficient indexes for jumbled pattern matching with constant-sized alphabet. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 625–636. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40450-4_53
Kolpakov, R., Kucherov, G.: mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31, 3672–3678 (2003). http://www.loria.fr/mreps/
Krasikov, I., Roditty, Y.: On a reconstruction problem for sequences. J. Comb. Theory Ser. A 77(2), 344–348 (1997). https://doi.org/10.1006/jcta.1997.2732
Levenshtein, V.I.: Binary codes capable of correcting, deletions, insertions and reversals. Soviet Phys. Dokl. 10, 707–710 (1966)
Levenshtein, V.I.: Efficient reconstruction of sequences. IEEE Trans. Inf. Theory 47(1), 2–22 (2001). https://doi.org/10.1109/18.904499
Lowrance, R., Wagner, R.A.: An extension of the string-to-string correction problem. J. ACM 22(2), 177–183 (1975). https://doi.org/10.1145/321879.321880
Manvel, B., Meyerowitz, A., Schwenk, A.J., Smith, K., Stockmeyer, P.K.: Reconstruction of sequences. Discrete Math. 94(3), 209–219 (1991). https://doi.org/10.1016/0012-365X(91)90026-X
Marcovich, S., Yaakobi, E.: Reconstruction of strings from their substrings spectrum. CoRR abs/1912.11108 (2019). http://arxiv.org/abs/1912.11108
Margaritis, D., Skiena, S.S.: Reconstructing strings from substrings in rounds. In: IEEE 36th Symposium on Foundations of Computer Science (FOCS), pp. 613–620, October 1995. https://doi.org/10.1109/SFCS.1995.492591
Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2nd edn. Cambridge University Press, Cambridge (2017)
Moosa, T.M., Rahman, M.S.: Indexing permutations for binary strings. Inf. Process. Lett. 110(18), 795–798 (2010). https://doi.org/10.1016/j.ipl.2010.06.012. http://www.sciencedirect.com/science/article/pii/S0020019010002012
Motahari, A.S., Bresler, G., Tse, D.N.C.: Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59(10), 6273–6289 (2013). https://doi.org/10.1109/TIT.2013.2270273
Motahari, A.S., Ramchandran, K., Tse, D., Ma, N.: Optimal DNA shotgun sequencing: noisy reads are as good as noiseless reads. In: Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013, pp. 1640–1644. IEEE (2013). https://doi.org/10.1109/ISIT.2013.6620505
Parisi, V., Fonzo, V.D., Aluffi-Pentini, F.: STRING: finding tandem repeats in DNA sequences. Bioinformatics 19(14), 1733–1738 (2003)
Pellegrini, M., Renda, M.E., Vecchio, A.: TRStalker: an efficient heuristic for finding fuzzy tandem repeats. Bioinformatics [ISMB] 26(12), 358–366 (2010)
Sala, F., Gabrys, R., Schoeny, C., Mazooji, K., Dolecek, L.: Exact sequence reconstruction for insertion-correcting codes. In: IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, 10–15 July 2016, pp. 615–619. IEEE (2016). https://doi.org/10.1109/ISIT.2016.7541372
Scott, A.D.: Reconstructing sequences. Discrete Math. 175(1–3), 231–238 (1997). https://doi.org/10.1016/S0012-365X(96)00153-7
Shomorony, I., Courtade, T.A., Tse, D.N.C.: Do read errors matter for genome assembly? In: IEEE International Symposium on Information Theory, ISIT 2015, Hong Kong, China, 14–19 June 2015, pp. 919–923. IEEE (2015). https://doi.org/10.1109/ISIT.2015.7282589
Shomorony, I., Kamath, G.M., **a, F., Courtade, T.A., Tse, D.N.C.: Partial DNA assembly: a rate-distortion perspective. In: IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, 10–15 July 2016, pp. 1799–1803. IEEE (2016). https://doi.org/10.1109/ISIT.2016.7541609
Simon, I.: Piecewise testable events. In: Brakhage, H. (ed.) GI-Fachtagung 1975. LNCS, vol. 33, pp. 214–222. Springer, Heidelberg (1975). https://doi.org/10.1007/3-540-07407-4_23
Skiena, S., Smith, W.D., Lemke, P.: Reconstructing sets from interpoint distances (extended abstract). In: Seidel, R. (ed.) Proceedings of the Sixth Annual Symposium on Computational Geometry, Berkeley, CA, USA, 6–8 June 1990, pp. 332–339. ACM (1990). https://doi.org/10.1145/98524.98598
Skiena, S., Sundaram, G.: Reconstructing strings from substrings. J. Comput. Biol. 2(2), 333–353 (1995). https://doi.org/10.1089/cmb.1995.2.333
Sokol, D.: TRedD - a database for tandem repeats over the edit distance. Database J. Biol. Databases Curation 2010(baq003) (2010). https://doi.org/10.1093/database/baq003
Tan, K., Ooi, B.C., Yee, C.Y.: An evaluation of color-spatial retrieval techniques for large image databases. Multimed. Tools Appl. 14(1), 55–78 (2001). https://doi.org/10.1023/A:1011359607594
Tardos, G.: Query complexity, or why is it difficult to separate \(NP^A\cap coNP^A\) from \(P^A\) by random oracles \(A\)? Combinatorica 9(4), 385–392 (1989). https://doi.org/10.1007/BF02125350
Tsur, D.: Tight bounds for string reconstruction using substring queries. In: Chekuri, C., Jansen, K., Rolim, J.D.P., Trevisan, L. (eds.) APPROX/RANDOM -2005. LNCS, vol. 3624, pp. 448–459. Springer, Heidelberg (2005). https://doi.org/10.1007/11538462_38
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992). https://doi.org/10.1016/0304-3975(92)90143-4
Viswanathan, K., Swaminathan, R.: Improved string reconstruction over insertion-deletion channels. In: Teng, S. (ed.) Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, 20–22 January 2008, pp. 399–408. SIAM (2008). http://dl.acm.org/citation.cfm?id=1347082.1347126
Wagner, R.A.: On the complexity of the extended string-to-string correction problem. In: Rounds, W.C., Martin, N., Carlyle, J.W., Harrison, M.A. (eds.) Proceedings of the 7th Annual ACM Symposium on Theory of Computing, Albuquerque, New Mexico, USA, 5–7 May 1975, pp. 218–223. ACM (1975). https://doi.org/10.1145/800116.803771
Wang, J., Hua, X.: Interactive image search by color map. ACM Trans. Intell. Syst. Technol. 3(1), 12:1–12:23 (2011)
Wexler, Y., Yakhini, Z., Kashi, Y., Geiger, D.: Finding approximate tandem repeats in genomic sequences. In: RECOMB, pp. 223–232 (2004)
Yao, A.C.C.: Decision tree complexity and Betti numbers. In: Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, STOC 1994, pp. 615–624. ACM, New York (1994). https://doi.org/10.1145/195058.195414
Zenkin, A., Leont’ev, V.K.: On a non-classical recognition problem. USSR Comput. Math. Math. Phys. 24(3), 189–193 (1984)
Zhou, W., Li, H., Tian, Q.: Recent advance in content-based image retrieval: a literature survey. CoRR abs/1706.06064 (2017). http://arxiv.org/abs/1706.06064
Acknowledgments
This research was funded in part by the U.S. National Science Foundation under grant 1815073. Amihood Amir was partly supported by BSF grant 2018141 and ISF grant 1475-18.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Afshar, R., Amir, A., Goodrich, M.T., Matias, P. (2020). Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-59212-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59211-0
Online ISBN: 978-3-030-59212-7
eBook Packages: Computer ScienceComputer Science (R0)