PrivaSeer: A Privacy Policy Search Engine

  • Conference paper
  • First Online:
Web Engineering (ICWE 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12706))

Included in the following conference series:

Abstract

Web privacy policies are used by organisations to disclose their privacy practices to users on the web. However, users often do not read privacy policies because they are too long, time consuming, or too complicated. Attempts to simplify privacy policies using natural language processing have achieved some success, but they face limitations of scalability and generalization. While this puts an onus on researchers and policy regulators to protect users against unfair privacy practices, they often lack a large-scale collection of policies to study the state of internet privacy. To remedy this bottleneck, we present PrivaSeer, the first privacy policy search engine. PrivaSeer has been indexed on 1,400,318 English language website privacy policies and can be used to search privacy policies based on text queries and several search facets. Results can be ranked by PageRank, query-based document relevance, and the probability that a document is a privacy policy. Results also can be filtered by readability, vagueness, industry, and mentions of tracking technology, self-regulatory bodies, or regulations and cross-border agreements in the policy text. PrivaSeer allows legal experts, researchers, and policy regulators to discover privacy trends and policy anomalies in privacy policies at scale. In this paper we present the search interface, ranking technique, and filtering techniques for PrivaSeer. We create two indexes of privacy policies: one including supplementary non-policy content present in privacy policy web pages and one without. We evaluate the functionality of PrivaSeer by comparing ranking techniques on these two indexes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.privacyshield.gov/Program-Overview.

  2. 2.

    https://privaseer.ist.psu.edu/.

  3. 3.

    https://explore.usableprivacy.org/.

  4. 4.

    https://pribot.org/polisis.

  5. 5.

    We refer to the corpus as PrivaSeer Corpus and the search engine as simply PrivaSeer.

  6. 6.

    https://commoncrawl.org/.

  7. 7.

    https://docs.peopledatalabs.com/docs/free-company-dataset.

References

  1. Amos, R., Acar, G., Lucherini, E., Kshirsagar, M., Narayanan, A., Mayer, J.: Privacy policies over time: curation andanalysis of a million-document dataset. ar**v preprint ar**v:2008.09159 (2020)

  2. Arasu, A., Novak, J., Tomkins, A., Tomlin, J.: Pagerank computation and the structure of the web: experiments and algorithms. In: Proceedings of the Eleventh International World Wide Web Conference, Poster Track, pp. 107–117 (2002)

    Google Scholar 

  3. Bannihatti Kumar, V., et al.: Finding a choice in a haystack: automatic extraction of opt-out statements from privacy policy text. In: Proceedings of The Web Conference, vol. 2020, pp. 1943–1954 (2020). https://doi.org/10.1145/3366423.3380262

  4. Bhatia, J., Breaux, T.D., Reidenberg, J.R., Norton, T.B.: A theory of vagueness and privacy risk perception. In: 2016 IEEE 24th International Requirements Engineering Conference (RE), pp. 26–35. IEEE (2016). https://doi.org/10.1109/RE.2016.20

  5. Davis, M., Iancu, L.: Unicode text segmentation. Unicode Stand. Annex 29, 1–30 (2012)

    Google Scholar 

  6. Ermakova, T., Fabian, B., Babina, E.: Readability of privacy policies of healthcare websites. Wirtschaftsinformatik 15, 1–15 (2015)

    Google Scholar 

  7. Fabian, B., Ermakova, T., Lentz, T.: Large-scale readability analysis of privacy policies. In: Proceedings of the International Conference on Web Intelligence, pp. 18–25 (2017). https://doi.org/10.1145/3106426.3106427

  8. Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. O’Reilly Media, Inc., Newton (2015)

    Google Scholar 

  9. Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K.G., Aberer, K.: Polisis: automated analysis and presentation of privacy policies using deep learning. In: 27th USENIX Security Symposium, pp. 531–548 (2018)

    Google Scholar 

  10. Kelley, P.G., Cesca, L., Bresee, J., Cranor, L.F.: Standardizing privacy notices: an online study of the nutrition label approach. In: Proceedings of the SIGCHI Conference on Human factors in Computing Systems, pp. 1573–1582 (2010). https://doi.org/10.1145/1753326.1753561

  11. Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel (1975). https://doi.org/10.21236/ada006655

  12. Klare, G.R., et al.: Measurement of readability (1963). https://doi.org/10.1177/002194366400100207

  13. Lebanoff, L., Liu, F.: Automatic detection of vague words and sentences in privacy policies. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3508–3517 (2018). https://doi.org/10.18653/v1/D18-1387

  14. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. ar**v preprint ar**v:1907.11692 (2019)

  15. McDonald, A.M., Cranor, L.F.: The cost of reading privacy policies. Isjlp 4, 543 (2008)

    Google Scholar 

  16. Ravichander, A., Black, A.W., Wilson, S., Norton, T., Sadeh, N.: Question answering for privacy policies: combining computational and legal perspectives. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4949–4959 (2019). https://doi.org/10.18653/v1/D19-1500

  17. Robertson, S.E., Walker, S., Beaulieu, M., Willett, P.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. Nist Spec. Publ. SP 500, 253–264 (1999)

    Google Scholar 

  18. Rodrigues, R., Wright, D., Wadhwa, K.: Develo** a privacy seal scheme (that works). Int. Data Priv. Law 3(2), 100–116 (2013). https://doi.org/10.1093/idpl/ips037

    Article  Google Scholar 

  19. Rudolph, M., Feth, D., Polst, S.: Why users ignore privacy policies – a survey and intention model for explaining user privacy behavior. In: Kurosu, M. (ed.) HCI 2018. LNCS, vol. 10901, pp. 587–598. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91238-7_45

    Chapter  Google Scholar 

  20. Sloan, R.H., Warner, R.: Beyond notice and choice: privacy, norms, and consent. J. High Tech. L. 14, 370 (2014). https://doi.org/10.2139/SSRN.2239099

    Article  Google Scholar 

  21. Srinath, M., Wilson, S., Giles, C.L.: Privacy at scale: introducing the privaseer corpus of web privacy policies. ar**v preprint ar**v:2004.11131 (2020)

  22. Sundareswara, S.N., Wilson, S., Srinath, M., Giles, C.L.: Privacy not found: a study of the availability of privacy policies on the web. In: Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020). USENIX Association (2020)

    Google Scholar 

  23. Supervisor, F.E.D.P.: What to expect when we inspect (2018)

    Google Scholar 

  24. Wilson, S., et al.: The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1330–1340 (2016). https://doi.org/10.18653/v1/P16-1126

  25. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6

  26. Zaeem, R.N., German, R.L., Barber, K.S.: Privacycheck: automatic summarization of privacy policies using data mining. ACM Trans. Internet Technol. (TOIT) 18(4), 1–18 (2018). https://doi.org/10.1145/3127519

    Article  Google Scholar 

  27. Zimmeck, S., Bellovin, S.M.: Privee: an architecture for automatically analyzing web privacy policies. In: 23rd USENIX Security Symposium, pp. 1–16 (2014)

    Google Scholar 

Download references

Acknowledgements

This work was partly supported by a seed grant from the College of Information Sciences and Technology at the Pennsylvania State University. We also acknowledge Adam McMillen for technical support and Ellen Poplavska for providing feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mukund Srinath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Srinath, M., Sundareswara, S.N., Giles, C.L., Wilson, S. (2021). PrivaSeer: A Privacy Policy Search Engine. In: Brambilla, M., Chbeir, R., Frasincar, F., Manolescu, I. (eds) Web Engineering. ICWE 2021. Lecture Notes in Computer Science(), vol 12706. Springer, Cham. https://doi.org/10.1007/978-3-030-74296-6_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-74296-6_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-74295-9

  • Online ISBN: 978-3-030-74296-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation