Abstract
In this paper, we develop a machine learning classifier that predicts perceived ethnicity from data on personal names for major ethnic groups populating Russia. We collect data from VK, the largest Russian social media website. Ethnicity was coded from languages spoken by users and their geographical location, with the data manually cleaned by crowd workers. The classifier shows the accuracy of 0.82 for a scheme with 24 ethnic groups and 0.92 for 15 aggregated ethnic groups. It can be used for research on ethnicity and ethnic relations in Russia, with the data sets that have personal names but not ethnicity.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Over the past decades, social scientists gained access to many large-scale data sets thanks to the proliferation of digital traces [1]. The explosive growth in new data even raised hopes that social science was entering its golden age [2]. However, digital traces are typically not collected with a research purpose in mind and are framed by the needs of data providers. As a result, they often lack information on individuals that is important to researchers. One potential solution is to infer missing information using machine learning methods. For example, various socio-demographic characteristics were predicted from profile images [3], mobile phone metadata [4], Facebook likes [5], and images of street scenes [6].
One of important characteristics that is of great interest to social scientists but rarely present in digital traces is ethnicity. Taking ethnicity into account is important for analysing social inequalities in health [7], political participation [8], the labour market and housing [9], among other areas.
While lacking information on ethnicity, some large-scale data sets have not been anonymised and include personal names. Examples of such data sets include US voter registration data [10] or Twitter data [11]. Personal names can be used as a signal for ethnicity for many ethnic groups. Experimental studies of discrimination in the labour market and in housing have been using this feature [9]; it was also applied to historic studies of social mobility [12]. An ability to infer ethnicity from personal names allows social scientists to use new administrative and social media data.
While several ethnicity classifiers are already available to researchers, most of them are focusing on a few immigration destination countries and are limited to a small number of ethnic groups [13]. In this paper, we are addressing this gap by develo** a machine learning approach to coding perceived ethnicity from personal names for ethnic groups populating Russia, using data from VK, the largest Russian social media website.
There are several approaches to classifying ethnicity with data on personal names. Early studies employ a dictionary-based method where names were matched to a reference list of names already classified by ethnicity. [14] is perhaps the first example of automatic name binary classification, developed to separate Chinese from non-Chinese names in Canada. [15] offers a review of 13 studies published up to 2007 that use similar methodology. In a more recent application, [16, 17] use both matching and supervised learning to classify ethnicity of Facebook users in the Netherlands on the basis of their fist names, using the Dutch census as the reference list (also see [18]).
A successful application of the matching approach requires a large reference list that covers most ethnic first names and/or surnames. However, for many ethnic groups, the reference lists of names do not exist or are incomplete. These classifiers also rely on the assumption that name/ethnicity distributions are similar for the reference list and the target population. As reference lists are often compiled from census data, it is not clear how well they will work with social media data.
Another approach to ethnic name classification is based on supervised learning algorithms. The main advantage of this approach is that it allows researchers to classify previously unseen names. [19] develop a multiclass name classifier for 13 ethnic groups with the data from Wikipedia using hidden Markov models and decision trees. [20] use recurrent neural networks to predict ethnicity from names from the Olympic records data. In other recent studies, [1 shows the list of ethnic groups and their population and sample sizes.
Machine learning pipeline
To apply ML algorithms, text must be transformed into numerical vectors. We use three different vectorisation methods (Bag of Words, TFIDF, and fastText) and compare their performance with different ML algorithms.
The Bag of Words (BoW) converts text into a vector with dimensionality equal to the size of the vocabulary formed with unique tokens (extracted n-grams in our case) from a corpus. n-gram is a sequence of n characters from a name: for example, 3-grams of the name \(\ll\)Alice\(\gg\) are \(\ll\)Ali\(\gg\), \(\ll\)lic\(\gg\), \(\ll \! ice \! \gg\). Vectorisation is then performed according to the token (n-gram) frequency. As an example, if a corpus contains only tokens ‘A’, ‘T’, ‘G’, ‘C’, then \(\ll\)AATGA\(\gg\) would be converted to < 3, 1, 1, 0> .
TFIDF (term frequency-inverse document frequency) shares the same idea, but it uses the \(tf\text {-}idf\) function of a token, i.e., normalise the token frequency by the share of all words that contain the token. The motivation for this transformation is to decrease the impact of frequent tokens that often provide little information and to increase the impact of rare tokens that are more informative [29].
Finally, the fastText model (FT) [43]. The issue of ethnicity in Russia, very sensitive in the Soviet times, remains significant today in interpersonal relations, as well as in the labour market and housing. Its significance increased after the Russian invasion of Ukraine in 2022 (please note that this study was started and largely concluded before February 2022). It is important to recognize that our classifier cannot, and is not intended to identify ethnicity at the individual level. While this tool can produce reliable distributions for ethnicity for data sets with hundreds and thousands names, for each individual name, there remains a margin of error that does not let the classifier to be used for individual profiling.
Data availability statement
The research data supporting this publication and the Python code are openly available from Github at: https://github.com/abessudnov/ruEthnicNamesPublic.
References
Lazer, D., & Radford, J. (2017). Data ex machina: introduction to big data. Annual Review of Sociology, 43, 19–39.
Buyalskaya, A., Gallo, M., & Camerer, C. F. (2021). The Golden Age of Social Science. Proceedings of the National Academy of Sciences., 118(5), e2002923118.
An J, Weber I (2016). # Greysanatomy vs.# Yankees: Demographics and Hashtag Use on Twitter. In: Proceedings of the tenth international AAAI conference on web and social media. vol. 10; . p. 523-6.
Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science., 350(6264), 1073–6.
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences., 110(15), 5802–5.
Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E. L., et al. (2017). Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences, 114(50), 13108–13.
Khunti, K., Routen, A., Banerjee, A., & Pareek, M. (2021). The need for improved collection and coding of ethnicity in health research. Journal of Public Health, 43(2), e270-2.
Flesken, A., & Hartl, J. (2020). Ethnicity, inequality, and perceived electoral fairness. Social Science Research, 85, 102363.
Bertrand, M., & Duflo, E. (2017). Field experiments on discrimination. In A. Banerjee & E. Duflo (Eds.), Handbook of economic field experiments (Vol. 1, pp. 309–93). Elsevier.
Imai, K., & Khanna, K. (2016). Improving ecological inference by predicting individual ethnicity from voter registration records. Political Analysis, 24(2), 263–72.
Wood-Doughty Z, Andrews N, Marvin R, Dredze M (2018). Predicting Twitter User Demographics from Names Alone. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media; . p. 105-11.
Clark, G. (2014). The son also rises. Princeton: Princeton University Press.
Mateos, P. (2014). Classifying ethnicity through people’s names. Names, ethnicity and populations (pp. 117–144). Berlin: Springer.
Coldman, A. J., Braun, T., & Gallagher, R. P. (1988). The classification of ethnic status using name information. Journal of Epidemiology & Community Health., 42(4), 390–5.
Mateos, P. (2007). A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place., 13(4), 243–63.
Hofstra, B., Corten, R., Van Tubergen, F., & Ellison, N. B. (2017). Sources of segregation in social Networks: a novel approach using facebook. American Sociological Review., 82(3), 625–56.
Hofstra, B., & de Schipper, N. C. (2018). Predicting ethnicity with first names in online social media networks. Big Data & Society, 5(1), 1–14.
Chang J, Rosenn I, Backstrom L, Marlow C (2010). ePluribus: Ethnicity on Social Networks. In: Proceedings of the international AAAI conference on web and social media; , vol.4, p.18-25.
Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009). Name-ethnicity classification from open sources. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining; . p. 49-58.
Lee J, Kim H, Ko M, Choi D, Choi J, Kang J (2017). Name nationality classification with recurrent neural networks. Proceedings of the twenty-sixth international joint conference on artificial intelligence. ;p. 2081-7.
Chaturvedi R, Chaturvedi S (2020). It’s All in the Name: A Character Based Approach To Infer Religion. ar**v:2010.14479. . Available from: https://arxiv.org/abs/2010.14479.
Ye J, Han S, Hu Y, Coskun B, Liu M, Qin H, et al (2017). Nationality classification using name embeddings. In: Proceedings of the 2017 ACM on conference on information and knowledge management; . p. 1897-1906.
Cesare N, Grant C, Nguyen Q, Lee H, Nsoesie EO. How Well Can Machine Learning Predict Demographics of Social Media Users? ar**v:1702.01807v2. 2017. Available from: .
Unbegaun, B. O. (1972). Russian surnames. Oxford: Clarendon Press.
Karaulova, M., Gök, A., & Shapira, P. (2019). Identifying author heritage using surname data: an application for Russian surnames. Journal of the Association for Information Science and Technology., 70(5), 488–98.
Bessudnov A(2022). Ethnic and regional inequalities in the Russian military fatalities in the 2022 war in Ukraine SocAr**v. . Available from: https://osf.io/preprints/socarxiv/s43yf.
Sivak, E., & Smirnov, I. (2019). Parents mention sons more often than daughters on social media. Proceedings of the National Academy of Sciences., 116(6), 2039–41.
Smirnov, I. (2020). Estimating educational outcomes from students’ short texts on social media. EPJ Data Science., 9(1), 27.
Manning, C. D., Raghavan, P., & Schutze, H. (2009). An introduction to information retrieval. Cambridge: Cambridge University Press.
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient text classification. ar**v:1607.01759. 2016. Available from: https://arxiv.org/abs/1607.01759.
Zhang T (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on machine learning. ICML . New York; 2004. p. 116.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics., 29(5), 1189–1232.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: machine learning in python. Journal of Machine Learning Research, 12(85), 2825–30.
Chen T, Guestrin C (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16. New York; . p. 785-94.
Gorenburg, D. (1999). Identity change in Bashkortostan: Tatars into Bashkirs and back. Ethnic and Racial Studies., 22(3), 554–80.
Bessudnov, A., & Monden, C. (2021). Ethnic intermarriage in Russia: the tale of four cities. Post-Soviet Affairs., 37(4), 383–403.
Jenkins, R. (2008). Rethinking Ethnicity (2nd ed.). London: Sage.
Lamont, M., & Molnár, V. (2002). The study of boundaries in the social sciences. Annual Review of Sociology., 28(1), 167–95.
Wimmer, A. (2013). Ethnic boundary making: institutions, power, networks. New York: Oxford University Press.
Bessudnov, A., & Shcherbak, A. (2020). Ethnic discrimination in multi-ethnic societies: evidence from Russia. European Sociological Review., 36(1), 104–20.
Ghai B, Liao QV, Zhang Y, Mueller K. Measuring social biases of crowd workers using counterfactual queries. ar**v:2004.02028. 2020. Available from:
La Barbera, D., Roitero, K., Demartini, G., Mizzaro, S., & Spina, D. (2020). Crowdsourcing truthfulness: the impact of judgment scale and assessor bias. Advances in Information Retrieval., 12036, 207–14.
Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: map** the debate. Big Data & Society., 3(2), 1–21.
Acknowledgements
We are grateful to Ivan Bibilov (European University at St Petersburg; Yandex) and Alexey Shpilman (HSE University) for their advice on develo** this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors do not have competing financial or non-financial interests related to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Hyperparameters’ choice
In this section, we provide the hyperparameters search grid for algorithms and highlight the best parameters with bold. For fastText vectorisation, we used the default checkpoint from https://fasttext.cc/docs/en/crawl-vectors.html.
A.1 Complement Naive Bayes (CNB)
For CNB, we conduct search only through vectorisation methods parameters. N-grams ranges: {(1, 1), (1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {5%, 10%, 20%}, minimal vocabulary elements frequencies: {1, 10, 1%, 5%}, lowercasing: {True, False}.
A.2 Logistic regression
A.2.1 Vectorisation parameters
N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {60%, 65%, 70%, 75%, 80%}, minimal vocabulary elements frequencies: {1, 5, 1%, 10%}, lowercasing: {True, False}.
A.2.2 Model parameters
Regularization method: elasticnet, regularization term: {0.000005, 0.00001, 0.000025, 0.00005, 0.0001}.
A.3 Support vector machine
A.3.1 Vectorisation parameters
N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {35%, 45%, 55%, 65%, 75%}, minimal vocabulary elements frequencies: {1, 5, 10, 50, 1%, 10%}, lowercasing: {True, False}.
A.3.2 Model parameters
Regularization method: elasticnet, regularization term: {0.000005, 0.00001, 0.000025, 0.00005, 0.0001}.
A.4 Quadratically smoothed support vector machine
A.4.1 Vectorisation parameters
N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {35%, 45%, 55%, 65%, 75%}, minimal vocabulary elements frequencies: {1, 5, 10, 50, 1%, 10%}, lowercasing: {True, False}.
A.4.2 Model parameters
Regularization method: elasticnet, regularization term: {0.000005, 0.00001, 0.000025, 0.00005, 0.0001}.
A.5 Gradient tree boosting
A.5.1 Model parameters
Number of estimators: {100, 500, 1000}.
A.6 Random Forest
A.6.1 Vectorisation parameters
N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {1%, 5%, 10%, 20%, 30%, 40%, 50%, 55%, 65%, 70%, 75%}, minimal vocabulary elements frequencies: {1, 5, 10, 10%}, lowercasing: {True, False}.
A.6.2 Model parameters
Number of estimators: {25, 50, 100, 200, 300}, maximal depth: {50, 100, 500, 1000, \(\infty\)}.
A.7 Multilayer perceptron
A.7.1 Model parameters
Hidden size: {100, 300}, number of hidden layers: {1, 2}, dropout probability: {0.1, 0.4, 0.5}, activation function: {ReLU, ELU}, learning rate: {0.0002, 0.0005}, number of epoches: {100}.
A.8 Long short-term memory
A.8.1 Model parameters
Hidden size: {100, 150, 200}, number of hidden layers: {2, 3}, dropout probability: {0, 0.1, 0.5}, bidirectional: {False, True} activation function: {ReLU}, learning rate: {0.0005, 0.0007}, number of epoches: {30, 60}.
A.9 Convolutional neural network
A.9.1 Model parameters
Number of channels: {50, 100, 200}, number of hidden layers: {3}, dropout probability: {0.5}, activation function: {ELU}, learning rate: {0.0001}, number of epoches: {20}.
B Additional models’ results
(See Table 8)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bessudnov, A., Tarasov, D., Panasovets, V. et al. Predicting perceived ethnicity with data on personal names in Russia. J Comput Soc Sc 6, 589–608 (2023). https://doi.org/10.1007/s42001-023-00205-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42001-023-00205-y