Predicting perceived ethnicity with data on personal names in Russia

Bessudnov, Alexey; Tarasov, Denis; Panasovets, Viacheslav; Kostenko, Veronica; Smirnov, Ivan; Uspenskiy, Vladimir

doi:10.1007/s42001-023-00205-y

Predicting perceived ethnicity with data on personal names in Russia

Research Article
Open access
Published: 04 April 2023

Volume 6, pages 589–608, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Computational Social Science Aims and scope Submit manuscript

Predicting perceived ethnicity with data on personal names in Russia

Download PDF

Alexey Bessudnov ORCID: orcid.org/0000-0002-2541-9794¹,
Denis Tarasov²,
Viacheslav Panasovets³,
Veronica Kostenko⁴,
Ivan Smirnov⁵ &
…
Vladimir Uspenskiy⁶

2041 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we develop a machine learning classifier that predicts perceived ethnicity from data on personal names for major ethnic groups populating Russia. We collect data from VK, the largest Russian social media website. Ethnicity was coded from languages spoken by users and their geographical location, with the data manually cleaned by crowd workers. The classifier shows the accuracy of 0.82 for a scheme with 24 ethnic groups and 0.92 for 15 aggregated ethnic groups. It can be used for research on ethnicity and ethnic relations in Russia, with the data sets that have personal names but not ethnicity.

Equal accuracy for Andrew and Abubakar—detecting and mitigating bias in name-ethnicity classification algorithms

Article Open access 09 February 2023

In the Name of the Neighbor: The Associations between Racial Attitudes, Intergroup Contacts, Ethnic Diversity, and the Perception of Names in the Dutch Speaking Part of Belgium

Article Open access 06 October 2022

Detecting Interethnic Relations with the Data from Social Media

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Over the past decades, social scientists gained access to many large-scale data sets thanks to the proliferation of digital traces [1]. The explosive growth in new data even raised hopes that social science was entering its golden age [2]. However, digital traces are typically not collected with a research purpose in mind and are framed by the needs of data providers. As a result, they often lack information on individuals that is important to researchers. One potential solution is to infer missing information using machine learning methods. For example, various socio-demographic characteristics were predicted from profile images [3], mobile phone metadata [4], Facebook likes [5], and images of street scenes [6].

One of important characteristics that is of great interest to social scientists but rarely present in digital traces is ethnicity. Taking ethnicity into account is important for analysing social inequalities in health [7], political participation [8], the labour market and housing [9], among other areas.

While lacking information on ethnicity, some large-scale data sets have not been anonymised and include personal names. Examples of such data sets include US voter registration data [10] or Twitter data [11]. Personal names can be used as a signal for ethnicity for many ethnic groups. Experimental studies of discrimination in the labour market and in housing have been using this feature [9]; it was also applied to historic studies of social mobility [12]. An ability to infer ethnicity from personal names allows social scientists to use new administrative and social media data.

While several ethnicity classifiers are already available to researchers, most of them are focusing on a few immigration destination countries and are limited to a small number of ethnic groups [13]. In this paper, we are addressing this gap by develo** a machine learning approach to coding perceived ethnicity from personal names for ethnic groups populating Russia, using data from VK, the largest Russian social media website.

There are several approaches to classifying ethnicity with data on personal names. Early studies employ a dictionary-based method where names were matched to a reference list of names already classified by ethnicity. [14] is perhaps the first example of automatic name binary classification, developed to separate Chinese from non-Chinese names in Canada. [15] offers a review of 13 studies published up to 2007 that use similar methodology. In a more recent application, [16, 17] use both matching and supervised learning to classify ethnicity of Facebook users in the Netherlands on the basis of their fist names, using the Dutch census as the reference list (also see [18]).

A successful application of the matching approach requires a large reference list that covers most ethnic first names and/or surnames. However, for many ethnic groups, the reference lists of names do not exist or are incomplete. These classifiers also rely on the assumption that name/ethnicity distributions are similar for the reference list and the target population. As reference lists are often compiled from census data, it is not clear how well they will work with social media data.

Another approach to ethnic name classification is based on supervised learning algorithms. The main advantage of this approach is that it allows researchers to classify previously unseen names. [19] develop a multiclass name classifier for 13 ethnic groups with the data from Wikipedia using hidden Markov models and decision trees. [20] use recurrent neural networks to predict ethnicity from names from the Olympic records data. In other recent studies, [1 shows the list of ethnic groups and their population and sample sizes.

Table 1 Ethnic groups and their population and sample sizes

Full size table

Machine learning pipeline

To apply ML algorithms, text must be transformed into numerical vectors. We use three different vectorisation methods (Bag of Words, TFIDF, and fastText) and compare their performance with different ML algorithms.

The Bag of Words (BoW) converts text into a vector with dimensionality equal to the size of the vocabulary formed with unique tokens (extracted n-grams in our case) from a corpus. n-gram is a sequence of n characters from a name: for example, 3-grams of the name \(\ll\)Alice\(\gg\) are \(\ll\)Ali\(\gg\), \(\ll\)lic\(\gg\), \(\ll \! ice \! \gg\). Vectorisation is then performed according to the token (n-gram) frequency. As an example, if a corpus contains only tokens ‘A’, ‘T’, ‘G’, ‘C’, then \(\ll\)AATGA\(\gg\) would be converted to < 3, 1, 1, 0> .

TFIDF (term frequency-inverse document frequency) shares the same idea, but it uses the \(tf\text {-}idf\) function of a token, i.e., normalise the token frequency by the share of all words that contain the token. The motivation for this transformation is to decrease the impact of frequent tokens that often provide little information and to increase the impact of rare tokens that are more informative [29].

Finally, the fastText model (FT) [43]. The issue of ethnicity in Russia, very sensitive in the Soviet times, remains significant today in interpersonal relations, as well as in the labour market and housing. Its significance increased after the Russian invasion of Ukraine in 2022 (please note that this study was started and largely concluded before February 2022). It is important to recognize that our classifier cannot, and is not intended to identify ethnicity at the individual level. While this tool can produce reliable distributions for ethnicity for data sets with hundreds and thousands names, for each individual name, there remains a margin of error that does not let the classifier to be used for individual profiling.

Data availability statement

The research data supporting this publication and the Python code are openly available from Github at: https://github.com/abessudnov/ruEthnicNamesPublic.

References

Lazer, D., & Radford, J. (2017). Data ex machina: introduction to big data. Annual Review of Sociology, 43, 19–39.
Article Google Scholar
Buyalskaya, A., Gallo, M., & Camerer, C. F. (2021). The Golden Age of Social Science. Proceedings of the National Academy of Sciences., 118(5), e2002923118.
Article Google Scholar
An J, Weber I (2016). # Greysanatomy vs.# Yankees: Demographics and Hashtag Use on Twitter. In: Proceedings of the tenth international AAAI conference on web and social media. vol. 10; . p. 523-6.
Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science., 350(6264), 1073–6.
Article Google Scholar
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences., 110(15), 5802–5.
Article Google Scholar
Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E. L., et al. (2017). Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences, 114(50), 13108–13.
Article Google Scholar
Khunti, K., Routen, A., Banerjee, A., & Pareek, M. (2021). The need for improved collection and coding of ethnicity in health research. Journal of Public Health, 43(2), e270-2.
Article Google Scholar
Flesken, A., & Hartl, J. (2020). Ethnicity, inequality, and perceived electoral fairness. Social Science Research, 85, 102363.
Article Google Scholar
Bertrand, M., & Duflo, E. (2017). Field experiments on discrimination. In A. Banerjee & E. Duflo (Eds.), Handbook of economic field experiments (Vol. 1, pp. 309–93). Elsevier.
Imai, K., & Khanna, K. (2016). Improving ecological inference by predicting individual ethnicity from voter registration records. Political Analysis, 24(2), 263–72.
Article Google Scholar
Wood-Doughty Z, Andrews N, Marvin R, Dredze M (2018). Predicting Twitter User Demographics from Names Alone. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media; . p. 105-11.
Clark, G. (2014). The son also rises. Princeton: Princeton University Press.
Google Scholar
Mateos, P. (2014). Classifying ethnicity through people’s names. Names, ethnicity and populations (pp. 117–144). Berlin: Springer.
Chapter Google Scholar
Coldman, A. J., Braun, T., & Gallagher, R. P. (1988). The classification of ethnic status using name information. Journal of Epidemiology & Community Health., 42(4), 390–5.
Article Google Scholar
Mateos, P. (2007). A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place., 13(4), 243–63.
Article Google Scholar
Hofstra, B., Corten, R., Van Tubergen, F., & Ellison, N. B. (2017). Sources of segregation in social Networks: a novel approach using facebook. American Sociological Review., 82(3), 625–56.
Article Google Scholar
Hofstra, B., & de Schipper, N. C. (2018). Predicting ethnicity with first names in online social media networks. Big Data & Society, 5(1), 1–14.
Article Google Scholar
Chang J, Rosenn I, Backstrom L, Marlow C (2010). ePluribus: Ethnicity on Social Networks. In: Proceedings of the international AAAI conference on web and social media; , vol.4, p.18-25.
Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009). Name-ethnicity classification from open sources. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining; . p. 49-58.
Lee J, Kim H, Ko M, Choi D, Choi J, Kang J (2017). Name nationality classification with recurrent neural networks. Proceedings of the twenty-sixth international joint conference on artificial intelligence. ;p. 2081-7.
Chaturvedi R, Chaturvedi S (2020). It’s All in the Name: A Character Based Approach To Infer Religion. ar**v:2010.14479. . Available from: https://arxiv.org/abs/2010.14479.
Ye J, Han S, Hu Y, Coskun B, Liu M, Qin H, et al (2017). Nationality classification using name embeddings. In: Proceedings of the 2017 ACM on conference on information and knowledge management; . p. 1897-1906.
Cesare N, Grant C, Nguyen Q, Lee H, Nsoesie EO. How Well Can Machine Learning Predict Demographics of Social Media Users? ar**v:1702.01807v2. 2017. Available from: .
Unbegaun, B. O. (1972). Russian surnames. Oxford: Clarendon Press.
Google Scholar
Karaulova, M., Gök, A., & Shapira, P. (2019). Identifying author heritage using surname data: an application for Russian surnames. Journal of the Association for Information Science and Technology., 70(5), 488–98.
Article Google Scholar
Bessudnov A(2022). Ethnic and regional inequalities in the Russian military fatalities in the 2022 war in Ukraine SocAr**v. . Available from: https://osf.io/preprints/socarxiv/s43yf.
Sivak, E., & Smirnov, I. (2019). Parents mention sons more often than daughters on social media. Proceedings of the National Academy of Sciences., 116(6), 2039–41.
Article Google Scholar
Smirnov, I. (2020). Estimating educational outcomes from students’ short texts on social media. EPJ Data Science., 9(1), 27.
Article Google Scholar
Manning, C. D., Raghavan, P., & Schutze, H. (2009). An introduction to information retrieval. Cambridge: Cambridge University Press.
Google Scholar
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient text classification. ar**v:1607.01759. 2016. Available from: https://arxiv.org/abs/1607.01759.
Zhang T (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on machine learning. ICML . New York; 2004. p. 116.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics., 29(5), 1189–1232.
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: machine learning in python. Journal of Machine Learning Research, 12(85), 2825–30.
Google Scholar
Chen T, Guestrin C (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16. New York; . p. 785-94.
Gorenburg, D. (1999). Identity change in Bashkortostan: Tatars into Bashkirs and back. Ethnic and Racial Studies., 22(3), 554–80.
Article Google Scholar
Bessudnov, A., & Monden, C. (2021). Ethnic intermarriage in Russia: the tale of four cities. Post-Soviet Affairs., 37(4), 383–403.
Article Google Scholar
Jenkins, R. (2008). Rethinking Ethnicity (2nd ed.). London: Sage.
Book Google Scholar
Lamont, M., & Molnár, V. (2002). The study of boundaries in the social sciences. Annual Review of Sociology., 28(1), 167–95.
Article Google Scholar
Wimmer, A. (2013). Ethnic boundary making: institutions, power, networks. New York: Oxford University Press.
Book Google Scholar
Bessudnov, A., & Shcherbak, A. (2020). Ethnic discrimination in multi-ethnic societies: evidence from Russia. European Sociological Review., 36(1), 104–20.
Google Scholar
Ghai B, Liao QV, Zhang Y, Mueller K. Measuring social biases of crowd workers using counterfactual queries. ar**v:2004.02028. 2020. Available from:
La Barbera, D., Roitero, K., Demartini, G., Mizzaro, S., & Spina, D. (2020). Crowdsourcing truthfulness: the impact of judgment scale and assessor bias. Advances in Information Retrieval., 12036, 207–14.
Google Scholar
Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: map** the debate. Big Data & Society., 3(2), 1–21.
Article Google Scholar

Download references

Acknowledgements

We are grateful to Ivan Bibilov (European University at St Petersburg; Yandex) and Alexey Shpilman (HSE University) for their advice on develo** this paper.

Author information

Authors and Affiliations

Social and Political Sciences, Philosophy and Anthropology, University of Exeter, Exeter, UK
Alexey Bessudnov
Computer Science, Constructor University Bremen, Bremen, Germany
Denis Tarasov
Applied Mathematics and Control Processes, St Petersburg State University, St Petersburg, Russia
Viacheslav Panasovets
Sociology, European University at St Petersburg, St Petersburg, Russia
Veronica Kostenko
Computational Social Sciences and Humanities, RWTH Aachen University, Aachen, Germany
Ivan Smirnov
Digital Transformation, ITMO University, St Petersburg, Russia
Vladimir Uspenskiy

Authors

Alexey Bessudnov
View author publications
You can also search for this author in PubMed Google Scholar
Denis Tarasov
View author publications
You can also search for this author in PubMed Google Scholar
Viacheslav Panasovets
View author publications
You can also search for this author in PubMed Google Scholar
Veronica Kostenko
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Smirnov
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Uspenskiy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexey Bessudnov.

Ethics declarations

Conflict of interest

The authors do not have competing financial or non-financial interests related to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Hyperparameters’ choice

In this section, we provide the hyperparameters search grid for algorithms and highlight the best parameters with bold. For fastText vectorisation, we used the default checkpoint from https://fasttext.cc/docs/en/crawl-vectors.html.

A.1 Complement Naive Bayes (CNB)

For CNB, we conduct search only through vectorisation methods parameters. N-grams ranges: {(1, 1), (1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {5%, 10%, 20%}, minimal vocabulary elements frequencies: {1, 10, 1%, 5%}, lowercasing: {True, False}.

A.2 Logistic regression

A.2.1 Vectorisation parameters

N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {60%, 65%, 70%, 75%, 80%}, minimal vocabulary elements frequencies: {1, 5, 1%, 10%}, lowercasing: {True, False}.

A.2.2 Model parameters

Regularization method: elasticnet, regularization term: {0.000005, 0.00001, 0.000025, 0.00005, 0.0001}.

A.3 Support vector machine

A.3.1 Vectorisation parameters

N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {35%, 45%, 55%, 65%, 75%}, minimal vocabulary elements frequencies: {1, 5, 10, 50, 1%, 10%}, lowercasing: {True, False}.

A.3.2 Model parameters

Regularization method: elasticnet, regularization term: {0.000005, 0.00001, 0.000025, 0.00005, 0.0001}.

A.4 Quadratically smoothed support vector machine

A.4.1 Vectorisation parameters

N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {35%, 45%, 55%, 65%, 75%}, minimal vocabulary elements frequencies: {1, 5, 10, 50, 1%, 10%}, lowercasing: {True, False}.

A.4.2 Model parameters

Regularization method: elasticnet, regularization term: {0.000005, 0.00001, 0.000025, 0.00005, 0.0001}.

A.5 Gradient tree boosting

A.5.1 Model parameters

Number of estimators: {100, 500, 1000}.

A.6 Random Forest

A.6.1 Vectorisation parameters

N-grams ranges: {(1, 3), (1, 5), (1, 7)}, maximal vocabulary elements frequencies: {1%, 5%, 10%, 20%, 30%, 40%, 50%, 55%, 65%, 70%, 75%}, minimal vocabulary elements frequencies: {1, 5, 10, 10%}, lowercasing: {True, False}.

A.6.2 Model parameters

Number of estimators: {25, 50, 100, 200, 300}, maximal depth: {50, 100, 500, 1000, \(\infty\)}.

A.7 Multilayer perceptron

A.7.1 Model parameters

Hidden size: {100, 300}, number of hidden layers: {1, 2}, dropout probability: {0.1, 0.4, 0.5}, activation function: {ReLU, ELU}, learning rate: {0.0002, 0.0005}, number of epoches: {100}.

A.8 Long short-term memory

A.8.1 Model parameters

Hidden size: {100, 150, 200}, number of hidden layers: {2, 3}, dropout probability: {0, 0.1, 0.5}, bidirectional: {False, True} activation function: {ReLU}, learning rate: {0.0005, 0.0007}, number of epoches: {30, 60}.

A.9 Convolutional neural network

A.9.1 Model parameters

Number of channels: {50, 100, 200}, number of hidden layers: {3}, dropout probability: {0.5}, activation function: {ELU}, learning rate: {0.0001}, number of epoches: {20}.

B Additional models’ results

(See Table 8)

Table 8 Models performance on the test set

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bessudnov, A., Tarasov, D., Panasovets, V. et al. Predicting perceived ethnicity with data on personal names in Russia. J Comput Soc Sc 6, 589–608 (2023). https://doi.org/10.1007/s42001-023-00205-y

Download citation

Received: 20 December 2022
Accepted: 13 March 2023
Published: 04 April 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s42001-023-00205-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Predicting perceived ethnicity with data on personal names in Russia

Abstract

Similar content being viewed by others

Equal accuracy for Andrew and Abubakar—detecting and mitigating bias in name-ethnicity classification algorithms

In the Name of the Neighbor: The Associations between Racial Attitudes, Intergroup Contacts, Ethnic Diversity, and the Perception of Names in the Dutch Speaking Part of Belgium

Detecting Interethnic Relations with the Data from Social Media

Introduction

Machine learning pipeline

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Hyperparameters’ choice

A.1 Complement Naive Bayes (CNB)

A.2 Logistic regression

A.2.1 Vectorisation parameters

A.2.2 Model parameters

A.3 Support vector machine

A.3.1 Vectorisation parameters

A.3.2 Model parameters

A.4 Quadratically smoothed support vector machine

A.4.1 Vectorisation parameters

A.4.2 Model parameters

A.5 Gradient tree boosting

A.5.1 Model parameters

A.6 Random Forest

A.6.1 Vectorisation parameters

A.6.2 Model parameters

A.7 Multilayer perceptron

A.7.1 Model parameters

A.8 Long short-term memory

A.8.1 Model parameters

A.9 Convolutional neural network

A.9.1 Model parameters

B Additional models’ results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation