Abstract
User-generated content (UGC) is an important source of information on products and services for consumers and firms. Although incentivizing high-quality UGC is an important business objective for any content platform, we show that it is also possible to identify anonymous posters by exploiting the characteristics of posted content. We present a novel two-stage authorship attribution methodology that combines structured and text data by identifying an author first by the amount and granularity of structured data (e.g., location, first name) posted with the UGC and second by the author’s writing style. As a case study, we show that 75% of the 1.3 million users in data publicly released by Yelp are uniquely identified by three structured variable combinations. For the remaining 25%, when the number of potential authors with (nearly) identically structured data ranges from 100 to 5 and sufficient training data exists for text analysis, the average probabilities of identification range from 40 to 81%. Our findings suggest that UGC platforms concerned with the potential negative effects of privacy-related incidents should limit or generalize their posters’ structured data when it is adjoined with textual content or mentioned in the text itself. We also show that although protection policies that focus on structured data remove the most predictive elements of authorship, they also have a small negative effect on the usefulness of content.
Similar content being viewed by others
Notes
As reviewed in Shu et al. [57], this problem is known by a number of names, including User Identity Linkage, Social Identity Linkage, User Identity Resolution, Social Network Reconciliation, User Account Linkage Inference, Profile Linkage, Anchor Link Prediction, and Detecting me edges.
We explored several different kernels for SVM, including polynomial (2nd and 3rd order) and other nonlinear specifications. A linear kernel achieved the best results and is therefore presented throughout the paper.
References
Abbasi A, Chen H (2008) Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transact Inform Syst (TOIS) 26(2):1–29
Abbasi A, Chen H, Nunamaker JF (2008) Stylometric identification in electronic markets: scalability and robustness. J Manag Inf Syst 25(1):49–78
Aggarwal CC, Philip SY (2008) A general survey of privacy-preserving data mining models and algorithms. In: In Privacy-preserving data mining. Springer, Boston, pp 11–52
Ahn D-Y, Duan JA, Mela CF (2015) Managing user-generated content: a dynamic rational expectations equilibrium approach. Mark Sci 35(2):284–303
Almishari M, Tsudik G (2012) Exploring linkability of user reviews. In: In European Symposium on Research in Computer Security. Springer, Berlin, pp 307–324
AMZ Tracker, 2018. How to deal with negative reviews. URL: https://www.amztracker.com/blog/deal-negative-reviews/. Accessed: July 24, 2020.
André Q, Carmon Z, Wertenbroch K, Crum A, Frank D, Goldstein W, Huber J, Van Boven L, Weber B, Yang H (2018) Consumer choice and autonomy in the age of artificial intelligence and big data. Cust Needs Solut 5(1):28–37
Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6(Sep):1579–1619
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Brennan M, Afroz S, Greenstadt R (2012) Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Transac Inform Syst Secur (TISSEC) 15(3):1–22
Brizan DG, Tansel AU (2006) A. survey of entity resolution and record linkage methodologies. Commun IIMA 6(3):5
Büschken J, Allenby GM (2016) Sentence-based text analysis for customer reviews. Mark Sci 35(6):953–975
Campbell J, Goldfarb A, Tucker C (2015) Privacy regulation and market structure. J Econ Manag Strateg 24(1):47–73
Caselaw, (2017). ZL TECHNOLOGIES INC v. GLASSDOOR INC. Court of Appeal, First District, Division 4, California. URL: https://caselaw.findlaw.com/ca-court-of-appeal/1868279.html. Accessed July 24, 2020.
De Jong MG, Pieters R, Fox JP (2010) Reducing social desirability bias through item randomized response: an application to measure underreported desires. J Mark Res 47(1):14–27
De Montjoye YA, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the crowd: the privacy bounds of human mobility. Sci Rep 3(1):1376
De Montjoye YA, Radaelli L, Singh VK (2015) Unique in the shop** mall: on the reidentifiability of credit card metadata. Science 347(6221):536–539
Douglas DM (2016) Doxing: a conceptual analysis. Ethics Inf Technol 18(3):199–210
Du Bay WH, (2004). The principles of readability. Accessed April 7, 2020. http://en.copian.ca/library/research/readab/readab.pdf.
Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Farr C, (2018). Facebook sent a doctor on a secret mission to ask hospitals to share data. CNBC. URL: https://www.cnbc.com/2018/04/05/facebook-building-8-explored-data-sharing-agreement-with-hospitals.html. Accessed: July 24, 2020.
Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice & open challenges. Proc VLDB Endowment 5(12):2018–2019
Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics. IEEE Trans Knowl Data Eng 23(10):1498–1512
Goldfarb A, Tucker C (2013) Why managing consumer privacy can be an opportunity. MIT Sloan Manag Rev 54(3):10
Gravano L, Ipeirotis PG, Koudas N and Srivastava D, (2003). Text joins for data cleansing and integration in an rdbms. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405) (pp. 729-731). IEEE.
Hewett K, Rand W, Rust RT, van Heerde HJ (2016) Brand buzz in the echoverse. J Mark 80(3):1–24
Hill S, Provost F (2003) The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explor Newslett 5(2):179–184
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Hu M and Liu B, 2004. Mining opinion features in customer reviews. In AAAI (Vol. 4, No. 4, pp. 755-760).
Jones R, (2017). Court rules Yelp must identify anonymous user in defamation case. Gizmodo. URL: https://gizmodo.com/court-rules-yelp-must-identify-anonymous-user-in-defama-1820433103. Accessed: July 24, 2020.
Juola P (2012) Large-scale experiments in authorship attribution. Engl Stud 93(3):275–283
Juola P and Vescovi D, (2010). Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM workshop on Artificial Intelligence and Security (pp. 14-18).
Kincaid JP, Fishburne Jr RP, Rogers RL and Chissom BS, (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.
Klemko R (2021) A small group of sleuths has been identifying right-wing extremists long before the attack on the Capitol. URL: https://www.washingtonpost.com/national-security/antifa-far-right-doxing-identities/2021/01/10/41721de0-4dd7-11eb-bda4-615aaefd0555_story.html. Accessed January 2, 2021.
Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26
Kourtis I, Stamatatos E (2011) Author identification using semi-supervised learning. In: In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers). The Netherlands, Amsterdam
Krishnamoorthy S (2015) Linguistic features for review helpfulness prediction. Expert Syst Appl 42(7):3751–3759
Kroft S, (2014). The data brokers: selling your personal information. 60 Minutes. URL: https://www.cbsnews.com/news/the-data-brokers-selling-your-personal-information/. Accessed: July 24, 2020.
Kumar V, Reinartz W (2018) Customer privacy concerns and privacy protective responses. In: In Customer relationship management. Springer, Berlin, pp 285–309
Li XB, Qin J (2017) Anonymizing and sharing medical text records. Inf Syst Res 28(2):332–352
Li XB, Sarkar S (2006) Privacy protection in data mining: a perturbation approach for categorical data. Inf Syst Res 17(3):254–270
Mankad S, Han HS, Goh J, Gavirneni S (2016) Understanding online hotel reviews through automated text analysis. Serv Sci 8(2):124–138
Martin KD, Murphy PE (2017) The role of data privacy in marketing. J Acad Mark Sci 45(2):135–155
Menon S, Sarkar S (2016) Privacy and big data: scalable approaches to sanitize large transactional databases for sharing. MIS Q 40(4):963–981
Moe WW, Schweidel DA (2012) Online product opinions: incidence, evaluation, and evolution. Mark Sci 31(3):372–386
Narayanan A, Paskov H, Gong NZ, Bethencourt J, Stefanov E, Shin ECR and Song D, (2012). On the feasibility of internet-scale author identification. In 2012 IEEE Symposium on Security and Privacy (pp. 300-314). IEEE.
Narayanan A and Shmatikov V, 2008, May. Robust de-anonymization of large datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy.
The Associated Press, (2017). Yelp says lawsuit might eliminate all negative reviews. New York Daily News. URL: https://www.nydailynews.com/news/national/yelp-lawsuit-eliminate-negative-reviews-article-1.2796087. Accessed July 24, 2020.
Payer M, Huang L, Gong NZ, Borgolte K, Frank M (2014) What you submit is who you are: a multimodal approach for deanonymizing scientific publications. IEEE Transact Inform Forensics Secur 10(1):200–212
Peer E, Vosgerau J, Acquisti A (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res Methods 46(4):1023–1031
Porter J, (2019). Fraudulent Yelp posting protected under the law, ridiculous. Tahoe Daily Tribune, May 20, 2019. URL: https://www.tahoedailytribune.com/news/jim-porter-fraudulent-yelp-posting-protected-under-the-law-ridiculous/. Accessed July 24, 2020.
Proserpio D, Zervas G (2017) Online reputation management: estimating the impact of management responses on consumer reviews. Mark Sci 36(5):645–665
Qian T, Liu B, Chen L and Peng, Z., (2014). Tri-training for authorship attribution with limited training data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 345-351).
Rochet JC, Tirole J (2003) Platform competition in two-sided markets. J Eur Econ Assoc 1(4):990–1029
Schneider MJ, Jagpal S, Gupta S, Li S, Yu Y (2017) Protecting customer privacy when marketing with second-party data. Int J Res Mark 34(3):593–603
Schneider MJ, Jagpal S, Gupta S, Li S, Yu Y (2018) A flexible method for protecting marketing data: an application to point-of-sale data. Mark Sci. ePub ahead of print Jan 8 37:153–171. https://doi.org/10.1287/mksc.2017.1064
Shu K, Wang S, Tang J, Zafarani R, Liu H (2017) User identity linkage across online social networks: a review. Acm Sigkdd Explor Newslett 18(2):5–17
Singh JP, Irani S, Rana NP, Dwivedi YK, Saumya S, Roy PK (2017) Predicting the “helpfulness” of online consumer reviews. J Bus Res 70:346–355
Snyder P, Doerfler P, Kanich C and McCoy D, (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference (pp. 432-444).
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556
Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427(7):424–440
Stone EF, Spool MD, Rabinowitz S (1977) Effects of anonymity and retaliatory potential on student evaluations of faculty performance. Res High Educ 6(4):313–325
Sweeney L (2000) Simple demographics often identify people uniquely. Health (San Francisco) 671(2000):1–34
Sweeney L (2002a) k-anonymity: a model for protecting privacy. Int J Uncertaint Fuzziness Knowl-Based Syst 10(05):557–570
Sweeney L (2002b) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertaint Fuzziness Knowl-Based Syst 10(05):571–588
Tirunillai S, Tellis GJ (2014) Mining marketing meaning from online chatter: strategic brand analysis of big data using latent dirichlet allocation. J Mark Res 51(4):463–479
Turjeman D and Feinberg FM, (2019). When the data are out: measuring behavioral changes following a data breach. Available at SSRN 3427254.
Tweedie FJ, Baayen RH (1998) How variable may a constant be? Measures of lexical richness in perspective. Comput Hum 32(5):323–352
US Census Bureau, (2016). Decennial Census Surname Files (2010, 2000). URL: https://www.census.gov/data/developers/data-sets/surnames.html. Accessed July 24, 2020.
US Social Security Administration, (2019). Baby names from social security card applications - national data. Data.gov. URL: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data. Accessed: July 24, 2020.
Wedel M, Kannan PK (2016) Marketing analytics for data-rich environments. J Mark 80(6):97–121
Winkler WE, (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.
**a D, Mankad S, Michailidis G (2016) Measuring influence of users in Twitter ecosystems using a counting process modeling framework. Technometrics 58(3):360–370
Xu J, Ding M (2019) Using the double transparency of autonomous vehicles to increase fairness and social welfare. Cust Needs Solut 6(1):26–35
Yelp, 2020. https://terms.yelp.com/privacy/en_us/20200101_en_us/#Controlling-Your-Personal-Data. .
Yule, G.U., 1944. The statistical study of literary vocabulary. In Mathematical Proceedings of the Cambridge Philosophical Society (Vol. 42, pp. b1-9).
Zhang Y, Moe WW, Schweidel DA (2017) Modeling the role of message content and influencers in social media rebroadcasting. Int J Res Mark 34(1):100–119
Zhao Y, Yang S, Narayan V, Zhao Y (2013) Modeling consumer learning from online product reviews. Mark Sci 32(1):153–169
Acknowledgements
We are thankful to Elea Feit, Sachin Gupta, Cameron Bale, and Sharan Jagpal for their helpful comments on earlier versions of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Appendix. Expanded results for the yelp data
Appendix. Expanded results for the yelp data
Figure 4 provides out-of-sample accuracy results for the Yelp data. Accuracy consistently improves as the sophistication of the data intruder and the size of the training data increase.
Rights and permissions
About this article
Cite this article
Schneider, M.J., Mankad, S. A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content. Cust. Need. and Solut. 8, 66–83 (2021). https://doi.org/10.1007/s40547-021-00116-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40547-021-00116-x