About Leaks of Confidential Data in the Process of Indexing Sites by Search Crawlers

  • Conference paper
  • First Online:
Perspectives of System Informatics (PSI 2019)

Abstract

The large number of sites for very different purposes (online stores, ticketing systems, hotel reservations, etc.) collect and store personal information of their users, as well as other confidential data, such as history and results of user interaction with these sites. Some of such data, not intended for open access, nevertheless falls into the search output and may be available to unauthorized persons when specific requests are made. This article describes the reasons for such incidents occurrence and the basic recommendations for technical specialists (developers and administrators) that will help prevent leaks.

The research has been supported by the ICMMG SB RAS budget project N 0315-2016-0006.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. YandexBot crawls the links that the user views (in Russian). https://habr.com/en/post/262695/. Accessed 27 Apr 2019

  2. Terms of Use of Yandex.Metrica service. https://yandex.ru/legal/metrica_termsofuse/. Accessed 29 Aug 2018

  3. FAQ on the SMS texts leakage from Megafon site (in Russian). https://habr.com/en/post/124387/. Accessed 27 Apr 2019

  4. Schellekens, M.H.M.: Are internet robots adequately regulated? Comput. Law Secur. Rev. 29(6), 666–675 (2013). https://doi.org/10.1016/j.clsr.2013.09.003

    Article  Google Scholar 

  5. Sun, Y., Councill, I.G., Giles, C.L.: The ethicality of web crawlers. In: Proceedings of 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, pp. 668–675 (2010). https://doi.org/10.1109/wi-iat.2010.316

  6. Giles, C.L., Sun, Y., Councill, I.G.: Measuring the web crawler ethics. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1101–1102 (2010). https://doi.org/10.1145/1772690.1772824

  7. Good Practices for Capability URLs. https://www.w3.org/TR/capability-urls/. Accessed 29 Aug 2018

  8. Yandex began to index Google Docs with passwords (in Russian). https://habr.com/en/post/416219/. Accessed 27 Apr 2019

  9. Martin-Galan, B., Hernandez-Perez, T., Rodriguez-Mateos, D., et al.: The use of robots.txt and sitemaps in the Spanish public administration. PROFESIONAL DE LA INFORMACION 18(6), 625–630 (2009). https://doi.org/10.3145/epi.2009.nov.05

  10. Kolay, S., D’Alberto, P., Dasdan, A., Bhattacharjee, A.: A larger scale study of robots.txt. In: Proceeding of the 17th International Conference on World Wide Web 2008, WWW 2008, pp. 1171–1172 (2008). https://doi.org/10.1145/1367497.1367711

  11. Sun, Y., Zhuang, Z., Councill, I.G., Giles, C.L.: Determining bias to search engines from robots.txt. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007, pp. 149–155 (2007). https://doi.org/10.1109/wi.2007.98

  12. A Standard for Robot Exclusion. http://www.robotstxt.org/orig.html. Accessed 29 Aug 2018

  13. Tong, W., **e, X.: A research on a defending policy against the Webcrawler’s attack. In: 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, ASID 2009 (2009). https://doi.org/10.1109/icasid.2009.5276948

  14. Bates, M.E.: What makes information “public”? Online (Wilton, Connecticut) 28(6), 64 (2004)

    Google Scholar 

  15. Robots.txt analysis. https://webmaster.yandex.ru/tools/robotstxt/. Accessed 29 Aug 2018

  16. robots.txt Tester. https://www.google.com/webmasters/tools/robots-testing-tool. Accessed 29 Aug 2018

  17. Blocking URLs with a robots.txt file. https://support.google.com/webmasters/answer/6062608. Accessed 29 Aug 2018

  18. Robots meta tag and X-Robots-Tag HTTP header specifications. https://developers.google.com/search/reference/robots_meta_tag. Accessed 29 Aug 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergey Kratov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kratov, S. (2019). About Leaks of Confidential Data in the Process of Indexing Sites by Search Crawlers. In: Bjørner, N., Virbitskaite, I., Voronkov, A. (eds) Perspectives of System Informatics. PSI 2019. Lecture Notes in Computer Science(), vol 11964. Springer, Cham. https://doi.org/10.1007/978-3-030-37487-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37487-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37486-0

  • Online ISBN: 978-3-030-37487-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation