Introduction

Digital social networks (DSN) have become popular as a means of spreading information and connecting people with like-minded ones [1]. The capacity to spread opinions shows a general phenomenon with relevant implications in the context of social influence [2].

Public accessibility of DSN along with the ability to share and exchange opinions, thoughts, and feelings, among others, allow people to connect not only with friends and family, but also with any celebrity on the network [3]. This ability has been evident in the growth of DSN communication [4]. However, the success of such communication attempts depends on the level of trust that members have with each other [1, 5], considering that opinions have helped to influence the feelings and emotions of the public [6].

For this reason, the interest of researching on micro blog communities with services such as Twitter is growing exponentially [7] due to the massive production of written information that each user generates. This information includes personal data such as name, photograph, location, etc.; quantitative data such as the number of followers and people they follow; and also their timeline, which is the chronology of their messages both public and private. Likewise, a user can follow another one by accepting to receive the messages that the other user posts [8].

On the other hand, language variation is permanent and evident in the new ways of writing in DSN. Such variation is not necessarily random, but highly related to social factors [9]. In fact, the linguistic ethnography holds that:

“to a considerable degree, language and the social world are mutually sha**, and that close analysis of situated language use can provide both fundamental and distinctive insight into the mechanisms and dynamics of social and cultural production in everyday activity” [10]

When people share their opinions in DSN (such as Twitter), they might also be revealing demographic, social and/or psychosocial information about themselves. For example, Schwartz and colleagues [8] indicate that his research has been driven to an integral exploration of the language that differentiates people, giving a new perspective to psychosocial processes that yield results on how to identify the words most commonly used by people with self-esteem issues or how possessive words may vary from men and women to refer to their sentimental companions.

Rangel and colleagues [11] point out that due to the huge amount of information available on social networking platforms, it is possible to obtain information about different attributes such as gender, age, personality, native language, or political orientation from the analysis of an author’s profile.

Considering that celebrities use DSN frequently to communicate and connect with their followers [12]; and understanding that user’s behavioral profile is reflected in the message according to their writing patterns [13], it is essential to detect whether a user is a celebrity is essential in order to determine the influence they may have on other users of social networks [14] and to know what would be the impact of a comment made by this user. This provides information to measure the influence of celebrities on their followers by means of the corpus of their texts.

Our motivation to write this paper is to explore the predictive and explanatory capacities of linguistic features on demographics and influence variables of celebrities using DSN. In fact, the research’s main objective is understand how these linguistic features, which are found in the texts that celebrities publish on DSN, generate new information that allows to classify celebrities according to their demographics and influence variables. Moreover, these new variables derived from the texts, can indicate the use of language which shows specific sociolects and idiolects useful to analyze the celebrity’s profile, and increase the accuracy level in the classification models.

Recognizing this opportunity, this article formally addresses the study of linguistic analysis observed from celebrities using DSN and proposes a model with 18 features that can quantify the outcome of five types of analysis: lexical, syntactic, symbolic, participation, and complementary information. From the lexical analysis, the average use of words and lexical diversity are analyzed. The syntactic analysis studies the personal pronouns most commonly used by celebrities. The symbolic analysis studies how symbolic contents such as emojis and hashtags are used; the participation analysis quantifies the features of participation in the network (mentions and retweets). Finally, the complementary information analyzes the reference that the celebrity makes to other media (URLS).

The difference between this paper and the one presented [12] at the Conference and Labs of the Evaluation Forum (CLEF) is that this paper proposes a new model of characteristic selection and explains how this model helps to increase the accuracy value. At the Plagiarism analysis, Authorship identification, and Near-duplicate detection (PAN) at CLEF they only presented several classification models and showed the accuracy obtained with different principles but there is nothing associated with characteristic selection.

This study presents eight sections. The second section presents a summary of the background, showing the authors who worked in areas related to the analysis of DSN, identification of profiles in texts, detection of demographic and social variables in texts, and influence of celebrities. The third section presents the methodology with the necessary steps to determine the features of the digital identity describing celebrities’ characteristics. The fourth section illustrates the data preparation, which is the corpus description, exclusion of redundant measures, and the methodology application. The fifth section shows the results of the constructed explanatory models with the significance from each one of the features found in the digital identity. The sixth section shows the results of making the celebrity classification model validation and prediction ability with the features selected to quantify the improvement of the accuracy. Finally, in the seventh and eighth section, the conclusions and future works are presented.

Background

Relevant background on celebrity detection has three elements: first, a basic background in social networks; second, a review of the works related to author profiling, including Machine Learning classification models tested and demographics and social variables that have been found as valuable in the task of Author profiling; finally, a review of works that address particularly the study and prediction of celebrities influence.

Social networks

According to Aggarwal [15] a social network is defined as:

“a network of interactions or relationships, where the nodes consist of participants and the edges consist of relationships or interactions between these participants.”

Social network analysis (SNA), therefore, seeks to discover different types of patterns in the relationship of the different nodes found inside the network [16], allowing them to describe these communities. Thanks to the Internet, there is an interactive dialogue platform of digital relationships, emulating physical interactions [17, 18], which makes possible to keep the different participants of the network in contact [17, 18] creating not only new forms of sharing information, but also new forms of communication, which, a possible effect would be a transformation of personal opinion or decision due the influence from the new contacts [19]. Therefore, nowadays SNA are of great interest to determine how languages can be used to describe communities and their collective subjectivity from sociolects.

With the vast exchange of information over the Internet, users in social networks are leaving a digital trail; for example, every day, Facebook members post 3.2 billion likes and comments, and 340 million tweets are sent out on Twitter [19]. This trail contains associated information given in texts, images, URLs, or audios, thus, generates a social structure programmed by each user in their own network based on the connections with other users [20]. Therefore, the availability of large amounts of data on the web has given a new motivation to use of statistical and computational tools in the area of Social Network Analysis (SNA) because of their growing popularity [15], combined with Natural Language Processing (NLP). Fan et al. [21] apply that combination for reducing the harmful effects caused by the spread of rumor in a social network through independent cascade (IC) model and the linear threshold (LT) model.

Consequently, the work oriented to computational linguistics has focused on the analysis of the corpus found in conversations shared in social networks to analyze opinions, feelings, emotions and in general, the expression of private status on certain individuals [2, 22, 23].

Author profiling

Author Profiling has been approached from different aspects that converge searching how to describe or profile an author. One of these aspects has studied the problem from a computational point of view, giving all the relevance to classification. Other aspects are from the sociolinguistic point of view, where language is understood as a process of social construction that develops along the time and describes dialects, sociolects, or chronolects associated to the authors.

Therefore, some examples of the aspects mentioned are the methodologies of the first, third, and fifth place of celebrity profiling in the PAN at CLEF event. First, Radivchev and colleagues [24] vectorized with a Term Frequency-Inverse Document Frequency (TF-IDF) the users’ tweets taking into account the top 10,000 features from word bigrams to use a combination of logistic regression and Support Vector Machines (SVM). In contrast, Martinc and colleagues [25] selected a Logistic regression classifier with word unigram and character tetragram features where the Logistic regression classifier and its hyper-parameters were chosen with a grid search. Finally, Petrik and Chuda [26] extracted the text features with TF-IDF using bigrams and trigrams to capture word relationships, then, they combined it with Random forest with 200 decision trees as a classification model.

Profile classification

Theoretical and empirical studies have demonstrated a strong relationship between social factors and linguistic attitudes, since language is perceived as a social activity that reflects and influences social reality [11, 27].

In fact, for Rangel and colleagues [11], the analysis of shared contents aims to:

“ predict different attributes of the authors, such as gender, age, personality, native language, or political orientation. Therefore, social networks are playing a vital role in identifying what people think because they can reinforce political ideas or even influence the way of thinking.”

The relationship between personality traits and the usage of language has been widely studied by psycholinguistics, analyzing the use of language and how it varies depending on personal characteristics. Initial researches on author profiles focused mainly on formal texts and blogs. However, at present time, researchers mainly focus on digital social networks, where language is more spontaneous and less formal [11].

Then, there are connections which are not captured with traditional analysis because a common feature of social media communication is that this is delivered through short messages. These messages do not often use standard language variations [28], and the data itself drives an integral exploration of the language that differentiates people, finding connections that cannot be captured with traditional analysis such as word categorization of vocabulary [8].

Consequently, social activities represent a great challenge for the selection and identification of the user profile, which is caused mainly by the diversity of texts and complex social structures [11, 29, 30].

Demographic and social variables

Jadhav and Mhetre [20] and Simaki and colleagues [27] indicate a connection between social networks and personal behavior on the web, identifying the relationship and influence between social factors and a person’s language. In fact, Milroy and Milroy [31] point out that one of the most important contributions of Labov’s (1972) “quantitative paradigm” on the study of language has been the systematical examination of the relationship between language variation and the variables of “speaker” such as age, ethnicity, gender, social network, and social class.

Due to this growing interest, the extraction of demographic information from the text has been studied, and important approximations have been made by authors like Przybyla and Teisseyre [32], who identified demographic characteristics such as education, party association, and year of birth. In contrast, Simaki and colleagues [27] used texts to determine an author’s gender, from a qualitative to a quantitative analysis, or [33] exploring the differences between male and female writing in a large subset of the British National Corpus.

The authors Nguyen and colleagues [22] and Romaine [34], state that linguistic variations occur over long and non-immediate periods of time in a sociolect. This means that the corpus of each generation has its own linguistic characteristics in which people of different gender and age tend to have different linguistic features. This is strongly related to the social influence and identity they have in the usage of language [27].

As for the characterization of “occupation”, authors such as Sloan and colleagues [35] used a search engine designed to identify the socioeconomic group of a tweet. The 2010 Standard Occupational Classification (SOC) system is used by U.S. federal statistical agencies to classify workers and jobs into occupational categories.

Celebrities’ influence

Celebrities are some of the most common users of DSN, by promoting their careers, and obtaining followers [36]. Therefore, social networks have been a revolutionary scenario for these individuals because these platforms allow them to share any information with their fans [12]. This demonstrates that a minority influences an exceptional number of people, becoming an important factor in the creation of public opinion [37].

In order to know the celebrity’s influence on the network, it is necessary to specify who influences who. However, this evidence of influence on real-world networks is limited, and it is something that only a few studies have attempted empirically [38].

To determine this influence it is necessary to know that there are celebrities who use only one social network. For example, words like “YouTuber” referring to a person whose primary social network is YouTube, or a person who only uses this social network in search of having a high reputation.

The development of micro-celebrities is more evident on Instagram, Facebook, Twitter, and other social platforms [39], leading to find different categories of celebrities on different social networks, therefore the data base of this study shows the celebrity profiling by hierarchical levels.

It is well known identifying profiles is not easy, and although there are exciting approximations, computational linguistics requires an integrated approach providing elements to understand patterns of linguistic variation [31] related to ethnographic and social factors, presenting a model and its validation to detect celebrities from variables identified and explained in the development of this study.

When trying to identify a user as a celebrity on Twitter, authors such as Wang and Kraut [40] argued that the specific topic and its continued usage in the user’s tweets affect the number of followers in two modalities: hemophilia and network externalities. However, Hutto and colleagues [41] created a theory based on forecasting models that although it included the topic of tweets, unlike Wang and Kraut, they did not find a prediction with more followers based on continues usage of a topic. Therefore, it is important to raise new proposals for the prediction of celebrities, not only for the number of followers, but also because more work is required to understand the importance of the contents published to engage an audience [42].

Meanwhile, Li and colleagues [18] indicated that to detect opinion leaders in social networks, academic studies generally consider the semantic analysis of user’s comments or the emotional analysis of contents published by users based on positive or negative comments; also, by analyzing feelings to define the relevance of the connection between users and followers. However, the detection of opinion leaders with semantic analysis or analysis of emotions is not always suitable for complex social networks, so Wang [43] proposed a method of extracting community opinion leaders based on a hierarchical structure.

Deep learning applied on feature selection in social network

Neural networks have the basic idea of representing the process of pattern recognition and classification that the human brain performs [44]. Therefore, research fields have applied this basic idea to evolve the models to increase their performance in classification models. Casas [45] mentions the phenomenon of replacement in statistical and optimization models to understand Geography’s travel behaviors and traffic management.

Now, popularity is a critical issue in celebrities’ behavior since an increase in the degree of their fame is often the result of the implementation of marketing strategies in the networks. Thus, research with neural networks becomes more relevant when concluding the critical factors for popularity or the key active times of popularity for making posts on social networks. Hsu et al. [46] developed research to improve the performance of classifiers in social network popularity prediction tasks; they implemented a multimodal approach by integrating the images included in the post and their social information into a Convolutional Neural Network (CNN). Huang et al. [47] performed a deep neural network model (Long short-term memory (LSTM) and CNN) with embedding in the responsible factors to improve the predictions of long-time popularity in social networks.

In turn, the social network’s characteristics involve vital indicators for the promotion of popularity. Retweets or hashtags contain relevant information about the interests of the communities participating in a separate communication thread generating topics of interest for celebrities, which can influence and achieve higher popularity in the network. Zhang et al. [48] proposed a neural network to predict retweeting behavior by weighting a layer of different interests from a clustering process to identify the core tweets of the cluster. Li et al. [49] modeled a CNN and LSTM-RNNs by improving existing classifiers to make hashtag recommendations by tweet representations that included word embedding generation, sentence composition, tweet composition, and hashtag classification.

PAN at CLEF is an international initiative that has been promoting the research of its excellence network on the fields of Digital Text Forensics and Stylometry for ten years. As a result, the best research groups around the world in the fields of Natural Language Processing (NLP) and Information Retrieval (IR) meet annually to participate in the Author Profiling, Author Verification, Authorship Attribution, and Style Change Detection tasks. In the last version at 2019 with the tasks of Bots and Gender Profiling, Celebrity Profiling, and Style Change Detection, we participated in the task for Celebrity Profiling obtaining the second place.Footnote 1

Table 1 presents proposals for profiling celebrities, including the characteristics used by authors who have worked on identifying profiles through DSN.

Table 1 Demographic and social variables for profile detection

Celebrity feature selection methodology

To achieve the objective of the study and to be able to determine the attributes that describe the characteristics of celebrities analyzing their texts in social networks, this methodology includes 4 phases (see Fig. 1):

  1. 1.

    Modeling digital identity using lexical, syntactic, symbolic, participation, and complementary information features extracted from a user’s publications.

  2. 2.

    Calculating the central tendency and dispersion measurements for each feature.

  3. 3.

    Reducing the dimensionality considering the calculated measures.

  4. 4.

    Constructing a model of significance analysis of each attribute of the digital identity over the person’s characteristics in the real world.

Fig. 1
figure 1

Methodology

Modeling digital identity

Among the different contents created by people in DSN using texts called “post” that form a corpus from which it is possible to extract information that may help to determine some people’s characteristics such as occupation, age, gender, and degree of fame. Through text mining methods, new knowledge emerges to extract relevant information analyzing and identifying vast amounts of unstructured data through text mining methods [52]. This phase proposes a model that relates these previous characteristics to groups of features that can be found in the posts available on digital social networks.

The proposed groups of linguistic features are classified as lexical, syntactic, symbolic, participation, and complementary information type (see Fig. 2). These characteristics represent the contents of the “posts” of the digital user identity. Alternatively, these features might be associated with the digital user identity, particularly with their demography and influence features, focusing on Gender, Birth year, Occupation and Fame. The study assumes that by analyzing these attributes, characteristics of the real user identity can be obtained.

Fig. 2
figure 2

Real identity model using features extracted from a user’s publications

Although some features are standard in different digital social networks within the “post”, there may be others that can be specific to a particular social network.

For example, there are some features shared by social networks such as Facebook, Twitter, and Instagram (see Fig. 3) which a Least Cost Influence (LCI) problem study could find a set of users with minimum cardinality to influence a certain fraction of users in multiple social networks [19]; however, these may vary in their use and application. For example, for Facebook, a “like” can be determined by different emojis that allow having a higher degree of granularity when identifying the emotion that generates to share that “post”.

Fig. 3
figure 3

Features of influence in the different social networks

Therefore for this phase, common features in digital social networks are analyzed (see Fig. 4).

Lexical features seek to estimate the size of words, style, and diversity of the text. Syntactic features correspond to expressions in the use of personal pronouns; Symbolic features refers to the inclusion of semiotics from the use of symbols such as emojis or hashtags which represent an implicit content. Participatory features allow to link different participants or social dynamics that may represent a confirmation or question message or a reinforcement of a common idea. Complementary information features allow to extend or argue comments.

Fig. 4
figure 4

Feature characterization

Calculating central tendency and dispersion measurements

To quantify each qualitative feature described in the previous phase, this study must calculate measurements allowing a statistical analysis of the usage and distribution of the variables, such as central tendency and the level of dispersion measurements.

The central tendency measurements, include the mean, the mode, and the median that present in a single value a value set represented by the center where a data set is located. Besides, it is necessary to determine the dispersion by employing statistical parameters such as the standard deviation (indicating how far data are according to the central measurements), the skewness and kurtosis identify the bias and sharpness of data distribution, respectively.

Reducing the dimensionality

Since all measurements should be presented to build a significance analysis model, it is necessary to debug those that may be redundant. To do this, it is proposed to make a correlation analysis. The purpose is to indicate if there is a relationship between two variables and what is the strength degree of such relationship using “corrplot” package.

Then, the explanatory variables are normalized using these variables, a principal component analysis (PCA) is performed (using “ade4” package).

PCA and LDA are both the earliest data representation learning algorithms. PCA is an unsupervised method [53] that converts existing high-dimensional data into a low-dimensional space. PCA retains the data’s variance to the maximum extent to get the data’s low dimensional representation from a global perspective [54] and preserve the global information of data in the learned feature space [53]. Besides, PCA has been widely used for dimensionality reduction [55], and authors keep holding that “although it is one of the earliest multivariate techniques, it continues to be the subject of much research, ranging from new model-based approaches to algorithmic ideas from neural networks. It is incredibly versatile, with applications in many disciplines.” [56].

Nowadays, representation learning has evolved the dimensionality reduction task. Its techniques include neural network models and non-negative constraint matrix factorization models [57] that, with the use of incremental learning, automatically adjust feature selection according to what is learned whenever new examples or data sets emerge.

Those recent outputs of those unsupervised approaches through clustering techniques produce a selected subset of the features can reduce the computation cost and improve the clustering performance [58], which outstanding the performance of classification problems. On the other hand, these models can consider the information discrepancy between the original feature space and the lower-dimensional subspace, which efficiently reduces the loss of information, and the structure-preserving term is based on the low rank sparse graph, which acquires adequate discriminative information and avoids problems of parameters selection [54]. Hence, further analysis can then be facilitated efficiently, and achieve better performance alleviating the issue of scalability in the term of computational complexity to some extent. However, those unsupervised approaches are not standard models because how to further reduce the computational complexity while kee** models powerful is still an issue worth studying [59]. Another example of the disadvantage of those models is probably because Learning Sparse computes feature’s score independently, but it neglects the possible correlation between different features, thus failing to produce an optimal feature subset [54].

Finally, the multiple correspondence analysis (MCA) is used to reveal the relationship between the different physical characteristics of the analyzed profiles (using “FactoMineR” package).

Constructing a model of significance analysis

To evaluate whether each variable contributes significantly to the user’s characteristics, it was decided to use a multinomial logit model as described by [60]. For the purpose of our analysis, Y is going to be all the characteristics of the real user and X corresponds to be the measurements obtained from the dimensionality reduction phase.

According to Peña [60], to evaluate whether each variable contributes significantly to the model, the p-values established by the Wald statistic 1 are used:

$$\begin{aligned} W_{\beta _i}=\frac{\hat{\beta }_i}{Var(\hat{\beta }_i)}, \end{aligned}$$
(1)

where the test hypotheses are:

$$\begin{aligned} H_0:\beta _i=0, \end{aligned}$$
$$\begin{aligned} H_a:\beta _i\not =0. \end{aligned}$$

The Wald test rejects the null hypothesis if the p-value is less than 5%; as a result, the coefficients are considered to be different from 0, inferring, this variable is statistically significant.

For example, Sluban et al. [61] distinguish some studies that search the outstanding features on Twitter to measure influence, which is the principal measure for celebrities. Avnit (2009) [62] shows the million follower fallacy, where an account with more retweets gets a higher level of influence than one pursue a large number of followers. Therefore, Suh et al. [63] states that URLs, hashtags, the number of followers and followees, the age of the account involve the number of retweets.

Data preparation

Description of the corpus

The study used a corpus from the PAN@CLEF2019 data sets corresponding for celebrities posts in social network Twitter based on the English language. The first data setFootnote 2 was the input for this paper, providing 68,583,577 Tweets and 31,203 profiles divided in 60/40 proportion for training and test data as a mechanism to avoid overfitting. Later, TIRA implemented a blind evaluation; it refers to “an evaluation process where the authors of a to be-evaluated piece of software cannot access the test data and hence cannot (unwittingly) optimize their algorithm against it” [64] with a second and third data setsFootnote 3 which contain approx 6,000 and 60,000 profiles, correspondingly. The software itself is packaged within a virtual machine, and new performance resultsFootnote 4 were achieved. Table 2 presents the performance results of the data sets mentioned above.

Table 2 Performance results of data sets

Four types of analysis were performed, which consisted on identifying occupation, gender, fame, and birth year as shown in Tables 3 and 4. However, each of these categorical variables were extracted from the information provided by the different profiles in the social network database. Specifically for the variable “Fame”, a celebrity means a person who has verified his Twitter account and is notable according to Wikipedia’s notoriety criteria, definition granted by the PAN @ CLEF 2019.

Table 3 Characteristics database
Table 4 Continued from previous page

Application of central tendency and dispersion measurements to selected features

Central tendency and dispersion measurements were applied, however, for this study, mean, skewness, and kurtosis were analyzed as the measures that contribute the most to the analysis (see Fig. 5). The standard deviation was not selected because it was too high compared to the average due to the large amount of atypical data. Similarly, mode and median were not considered as they did not provide relevant information.

Fig. 5
figure 5

Analytical measures

Dimensionality reduction

After selecting the measures to be used for the analysis, it is proposed to build a model with 18 variables corresponding to the analysis of lexical features (V1 to V8); syntactic features (V14 to V18); symbolic features (V9 and V10), participation features (V11 and V13) and complementary information features (V12). For the variables associated with the lexical features shown in Table 5, it is calculated, for example, for each \(t_i\) (being \(t_i\) each of the user’s tweets) the average number of characters per word in the profile.

Table 5 Feature group of lexical analysis

For the variable V14 associated with the corpus syntactic analysis, the following calculation is made: for each \(t_i\) the number of times the post is written in the first person singular is calculated. The average according to the total tweets of the profile is also calculated. Similarly to the calculation of variable V14, all the other variables described in Table 6 are calculated.

Table 6 Feature group related to syntactic analysis

To analyze the variables related to the symbolic, participation, and complementary information features, in each \(t_i\) it is calculated the average number of times that each emoji, hashtag, URL, mention or retweet is used. Each of these language elements in the social network was taken as a variable, as shown in Tables 78, and 9. In particular, separating the URL features into an individual group of feature means recognizing that a tweet is more informative when accompanied by URLs [65] considering the limit of the tweet. Therefore, the scope of the information acquired with a URL is beyond knowing about a clickable link (hashtag) that facilitates an easy search of tweets that with same hashtag [63].

Table 7 Group feature related to symbolic analysis
Table 8 Group feature related to participation analysis
Table 9 Group feature related to the analysis of complementary information

There are highly correlated variables (see Fig. 6), such as variables v2 “Kurtosis character” with V5 “Kurtosis avg character”. Equivalently, variable V2 correlates positively with variable V7 “Skew avg character”. Similarly, variable V1 “Avg character” is positively related to variable v6 “Kurtosis label word”. In contrast, there is a negative relationship between variable V7 and V3: “Lexical diversity”.

Fig. 6
figure 6

Corplot explanatory variables represent the correlation degree on the entire matrix of correlations between the analyzed features shown by the intensity of the colors. Each cell contains the result of measuring the correlations between a pair of features within a color scale with extreme values of 1 or − 1. For this purpose, it was used the R version 3.6.1. Thus, a high correlation between a pair of features is when it has a solid color tone close to the values of 1 or − 1 on the color scale shown on the right side. You can find more information in Availability of data and materials section

For the construction of the model, only variables V1, V3, and V4 will be taken, because they reflect the same information as variables V2, V5, V6, V7, and V8. Therefore, they will not be included in the model since doing so generates collinearity problems and are not explanatory from a sociolinguistic perspective.

There is also a strong correlation between the variables V16 “person 3 singular”, V17 “person 1 plural” and V18 “person 3 plural” (see Fig. 6) , which corresponds to syntactic analysis of the corpus and measures the use of different pronouns. Although these variables are highly related, they are going to use because they represent sociolinguistic and idiolect variables which become in explanatory variables of interest in the study as they help in the task of characterizing the celebrities.

After this, the predictor variables (those corresponding to Tables 5678, and 9 ) were normalized. With these variables, a principal component analysis (PCA) is performed (see Fig. 7) .

Fig. 7
figure 7

PCA correlation circle also known as variable correlation plots. This figure was made in R version 3.6.1. The arrows grouped together indicate positively correlated variables; whereas, arrows in opposed quadrants are negatively correlated variables. You can find more information in Availability of data and materials section

There is no relevant relationship between the variables (see Fig. 7). The only thing shown is that the variable V15 “person 2 singular” is in opposite relationship to the variable V11 “label mention” and V3 “diversity lexical”, which means that the use of second singular person relates negatively with the mentions used in the tweets and the lexical diversity.

Although the birth year variable is discrete, it was grouped by decades. This grou** method changed the Birth year variable to be categorically treated as the Fame, Occupation, and Gender variables.

For the characteristics of the people, a multiple correspondence analysis (MCA) was used (see Fig. 8), which made it possible to reveal the relationships between these celebrities’ profiles.

Fig. 8
figure 8

Multiple correspondence analysis MCA made in R version 3.6.1. The closeness between the characteristics indicates the relationship degree. As a result, a relationship between the characteristics of “gender” and “fame” is evident, while “occupation” and “birth year” do not have a marked relationship. You can find more information in Availability of data and materials section

Results of celebrity feature selection

The significance models along with the p-values, described for each Wald test, are shown in Tables 1014171822, and 23 for each of the variables previously selected as a result of multivariate analysis.

Fame

Table 10 shows the degree of “Fame” and its relationship with the different groups of lexical, syntactic, participation, and complementary information features.

Table 10 Model of the person’s characteristic Fame

Table 10 presents the coefficients of the model posed in equation 2. Therefore, the expressionFootnote 5 of the model for the first category (star) of fame is:

$$\begin{aligned} logit[p(Star=1)]=\; & {} 0.99+0.01V1-2.49V3+0.4V4+0.48V9+0.53V10\nonumber \\&+0.82V11+0.56V12-1.41V13+1.07V14+1.1V15+3.75V16\nonumber \\&+3.23V17+0.22V18. \end{aligned}$$
(2)

Tables 11 and 12 summarize the results using as referent group the category “Rising”Footnote 6 that can be seen in Table 10. Thus in Table 10, since the estimators are relative to the referent group, for a unit change in the feature, the logit of outcome relative to the Rising group celebrities is expected to change by its respective estimator given the other features in the model are held constant. For example, if a celebrity were to increase his use of lexical diversity by one point, the multinomial log-odds for Star celebrity relative to a Rising celebrity would be expected to decrease by 2.49 units while holding all other features in the model constant.

Table 11 Results for the person’s characteristic Fame
Table 12 Results for the person’s characteristic Fame

Celebrities use all the syntactic features, complementary information, participation, and symbols along with their different categories. We can suppose that common helpful in both categories indicates a feature group is more helpful like Table 13 shows.

Table 13 Helpful features group for Fame

Gender

Table 14 Model of the characteristic of the person Gender

Table 14 presents the coefficients of the model posed in Eq. 3. Therefore, the expressionFootnote 7

$$\begin{aligned} logit[p(Male=1)]=\; & {} 3.41+0.02V1+26.42V3-0.29V4-0.77V9-0.24V10\nonumber \\&-0.09V11-1.02V12-0.46V13-3.03V14-2.45V15+2.25V16\nonumber \\&-1.32V17+0.13V18. \end{aligned}$$
(3)

Table 15 shows the inference of the results using as reference the category “Female”Footnote 8, which can be seen in Table 14. Thus, if a celebrity were to increase his use of lexical diversity by one point, the multinomial log-odds for Male celebrity relative to Female celebrity would be expected to increase by 26.42 units while holding all other features in the model constant.Footnote 9

Table 15 Results for the characteristic of Gender

In summary, the celebrity gender in all its categories highlights the use of lexical diversity; they post in singular first person and employ social network features such as hashtag and retweet. However, there are not a common helpful, which indicates the absence of a more useful feature group like Table 16 shows.

Table 16 Helpful features group for Gender

Occupation

Tables 17 and 18 describe the characteristic occupation. It is evident that the highest level of significance of the variables is in the political category, while it is not as strong in the religious and managerial categories.

Table 17 Model of the person’s characteristic Occupation
Table 18 Model of the person’s characteristic Occupation

Table 17 presents the coefficients of the model posed in Eq. 4. Therefore, the expressionFootnote 10 of the model for the first category (Manager) of occupation is:

$$\begin{aligned} logit[p(Manager=1)]= \;& {} -4.7+0.02V1+1.85V3+0.88V4+0.22V9+0.63V10\nonumber \\&+0.12V11-0.91V12-0.16V13-0.9V14+0.88V15-3.89V16\nonumber \\&+4.43V17-0.12V18 \end{aligned}$$
(4)

The description in the behavior of each one of the categories of the characteristic Occupation using as reference the category “Creator”Footnote 11 category are shown in Tables 19 and 20. Hence, if a celebrity were to increase his words using singular first-person pronoun by one point, the multinomial log-odds for Manager celebrity relative to Creator celebrity would be expected to decrease by 0.9 units while holding all other features in the model constantFootnote 12.

Table 19 Results for the person’s characteristic Occupation
Table 20 Results for the person’s characteristic Occupation

A common feature across the various categories of celebrity occupations is the use of the third person singular. Within the group represented by Table 17, the relevance on the number of words in the post and the use of hashtag are standard features; on the other hand, the use of mentions, the plural first person, and the care on the number of characters in the tweet are common features in Table 18. However, there is no common helpful indicating the absence of a more useful feature group like Table 21 shows.

Table 21 Helpful features group for Occupation

Birth year

Tables 22 and 23 show the behavior of the model according to the variables’ significance level of the decade to which the celebrity belongs.

Table 22 Model of the characteristic of the person Birth year
Table 23 Model of the characteristic of the person Birth year

Table 22 presents the coefficients of the model posed in equation 5. Therefore, the expressionFootnote 13 of the model for the first category (1950) of Birth year is:

$$\begin{aligned} logit[p(1950=1)]= & {} 1.05-0.01V1-2.85V3-0.09V4+-0.32V9-0.02V10\nonumber \\&+0.27V11+0.2V12+0.01V13-0.06V14-0.07V15+0.02V16\nonumber \\&+0.59V17-0.04V18 \end{aligned}$$
(5)

The description in the behavior of the characteristic Birth year using as reference the category “1940”Footnote 14 Decades are shown in Tables 24 and 25. Consequently, if a celebrity were to increase mentions in his posts by one point, the multinomial log-odds for born celebrities in 1950 relative to born celebrities in 1940 would be expected to increase by 0.27 units while holding all other features in the model constant.Footnote 15

Table 24 Results for the person’s characteristic Birth year
Table 25 Results for the person’s characteristic Birth year

Celebrities of different ages do not present a common feature of the groups analyzed for this paper, i.e., there is no common helpful, which indicates the absence of a more useful feature group like Table 26 shows.

However, in Table 22, a common feature of the use of mentions can be seen for the decades analyzed; on the other hand, the common feature in Table 23 is the use of the first person plural. Besides, complementary information since 1980 decade at 2000 decade seems to be a feature group more helpful.

Table 26 Helpful features group Birth year

Validation of the celebrity classification model with selected features

Classifiers models with the selected features were created using the PAN CLEF 2019 celebrity analysis data set. These models were divided into a training subset with 60% of the samples, and a test subset with 40% of the samples, with these subsets we developed a performance training and testing for each one of the models.

Different classification models were programmed for texts with a scikit-learn library [66] such as multinomial Naive Bayes (NB), Gaussian Naive Bayes (GNB), Naive Bayes Complement (NBC), Logistic Regression (LR), and Random Forest (RF) called from now on classical classifiers, and Deep Neural Networks (DNN). The model with the best performance on each variable: gender, birth year, occupation, or fame, was selected and replicated for the famous actor data set. Table 27 describes the configured parameters of the best-performing classifiers. Each classifier model was trained with a group of terms associated with each of the celebrity profiles.

Table 27 Description of the parameters used in the best classifier

A cleaning processing and homogenizing of the text in UTF8 encoding was performed. Additional characteristics were included and selected to finally obtain a group of features that comprised the new attributes in the classifier. In addition, an over-sampling technique [67] was applied with the idea to balance the data set between classes with small samples and the other classes.

This allowed the models to improve the average precision up to 0.14 higher than a classifier model that only uses the group of wordsFootnote 16 from each celebrity. In addition, a data set of famous actors was analyzed to see if the results of the models are similar. Consequently, created a data set using the A (The elite of acting circles), and B (quite famous, but not super famous as an A-list) lists published in IMDb websiteFootnote 17. The celebrity listings were manually extracted from IMDb, but IMDb experts had already defined the ratings creating three different fame level lists. The famous actor dataset involves a label annotation of the data where the manual label quality is controlled by brings the same number of actors (100 actors) for each fame category. Therefore, the A-list published on IMDb becomes the simile of the celebrities’ dataset’s superstar category. Similarly, the B list is the counterpart of the category of stars in the celebrities’ dataset. IMDb is the world’s most popular and authoritative source for movie, TV, and celebrity content. Tables 28 at 31 show the model results for each one of the variables with the test data sets.

Table 28 Fame classification using multinomial logistic regression classifier

The fame variable with a multinomial logistic regression classifier obtained a final average F1-score of 0.65 for PAN at CLEF dataset and the 0.44 F1-score for famous actors considering the list come from different IMDb reviewers , as shown in Table 28.

Table 29 Gender classification using multinomial logistic regression classifier

The gender variable with a multinomial logistic regression classifier obtained a final average F1-score of 0.88 for PAN at CLEF data set and 0.89 F1-score for famous actors, which outstanding in the previous result, as shown in Table 29.

Table 30 Birth year classification using multinomial logistic regression classifier

The birth year variable with a multinomial logistic regression classifier obtained a final average F1-score of 0.37 for PAN at CLEF and 0.25 F1-score for famous actors, which is slightly lower, as shown in Table 30.

Table 31 Occupation classification using multinomial naive Bayes classifier

The occupation variable obtained a final average F1-score of 0.57 with a multinomial naive Bayes classifier, as shown in Table 31.

Table 32 Performance F1-score results of data sets

As shown in Table 32, the classifier maintains similar results in all classes except occupation; in terms of fame, the paper data set has a better result than the famous actors’ data set; in the gender and birth year variables, the results are similar in both data sets.

Deep learning

A deep learning model was trained to compare the performance of this model with the model proposed in the paper and the baseline ones. Specifically, two models were generated under two approaches: the first one adds and analyzes the lexical features; the second one adds the new features proposed in this paper.

Table 33 describes the parameters of two Deep Learning models with neural networks using the Keras library with the Tensorflow framework in Python to compare the performance of these models with the one proposed by the paper and baseline models. The model uses a densely connected regular layer Dense as a sequential model API, an activation function for the hidden layers Rectified Linear Unit (ReLU), a Softmax function on the output layers, a Sparse categorical cross-entropy loss function, and an Adam optimizer to do the weighting calculations performed by this optimization method in order to reduce the error on the target output.

Table 33 Deep Neural Network configuration

Baselines

The baseline models were generated for the PAN at CLEF2019 contest. The details on how they obtained the lexical features are not explicitly published. Instead, they describe the following:

“baseline-uniform randomly draws from a uniform distribution of all classes and reflects the data-agnostic lower bound, baseline-rand randomly selects a class according to the prior likelihood of appearance in the test dataset, and baseline-mv always predicts the majority class of the test dataset.” [68]

The models implemented in this research differ on the treatment over lexical features. Paper data set and Famous Actors’ model use n-gram to configure the vectorial representation of words with a minimum frequency of 9 for gender, 6 for the birth year of birth, 3 for occupation, and none for fame. A differential treatment performed a pre-processing of texts to replace hashtags, mentions, URLs, and emojis with special tokens and apply a lemmatization method on the 1.000 most frequent words penalized with the TF-IDF process.

Table 34 reports the results obtained using different classification techniques taking as input lexical features and the proposed features. Results presented in columns 1, 2, and 3 are baseline models using lexical features; the fourth column contains the best performance on each class using the paper dataset, classical classifiers and the proposed features. Finally, the fifth and sixth columns report the results using a deep neural network with lexical features only and the features proposed in this paper (i.e., syntactic, symbolic, participation and complementary information features), respectively.

Table 34 F1-score obtained with the models using the paper data set

Neural network with lexical features has three out of four best performances; the gender and fame are the classes with the best F1-score, 0.88 and 0.75, respectively. In contrast, year of birth is the class with the lowest F1-score overall models with 0.04. The best score for this class was obtained using the classical machine learning model (0.37).

Neural network with proposed features shows the gender as the the best-ranked class with an F1-score of 0.80, followed by the fame with 0.74. On the other hand, the occupation has an F1-score of 0.39, reflecting the large gap with fame, i.e., its F1-score is much lower as the number of values increases. Gender and fame are the best ones with the best F1-score.

In general, the performance of the classical classifiers is considerably lower than the obtained with the neural network model, except on the gender class, where is very similar. The neural networks models that include the proposed features have a decrease on the F1-score, especially on the Occupation class. Finally, the classical classifiers that used the proposed features have one out of four of the best performances in the analyzed classes, making the classical models the best option to classify a class with a large number of possible values. It is very clear, that deep learning models offer new options to explore the problem of celebrity classification.

Conclusions

After analyzing a rigorous selection of features, a measurement group applied to the feature group was achieved, determining the significance of each of them in order to identify a person’s characteristics such as fame, gender, occupation, and birth year. This paper presents a new approach that addresses the characterization of profiles using relations of lexical, syntactic, symbolic, complementary, and information variables achieving a better approach in order to identify significant features that help in the identification of celebrity profiles.

The use of these new features improves the initial classifications made with only the words for the characteristics of Gender, Birth year, Occupation, and Fame. These new features, derived from texts in DSN achieving an increase on the average F1-score up to 0.14, this occurs by including the set of features in the classification task of the characteristic. We used the proposed features in several classifiers. We found that they represent a greater contribution when using classical classifiers (e.g., logistic regression, multinomial naive Bayes). On the contrary, they decreased the performance on deep learning models

As a result, the best-performing models are “Gender” and “Fame” with a residual deviation: 37703.85 and 41544.39 with an AIC: 37763.8 and 41604.39. In contrast, the worst-performing models are “Age” and “Occupation” with a residual deviation of 92807.86 and 100703.9 both with an AIC: 92989.86 and 100913.9.

It is evident from the model that regardless of fame, occupation, and gender, celebrities write recurrently in the third person singular. However, new generations (those born in the last five decades) use more the first person plural.

The occupation that uses all groups of lexical and syntactic features is politician. This is prone to happen due to the occupation’s nature in which being at the forefront of language and trends is vital to hold good.

The most recurrent gender characteristics are the use of the first-person singular in both men and women, but this is not evident in the nonbinary users.

It is also important to note that the most commonly used feature from social network along the decades are Mentions followed by Emojis.

Discussion and future works

In the analysis of user profiles, and even more deeply in the analysis of celebrity profiles, multiple documents analyze a user’s comments to determine the attributes of demographic, sociological or psychographic variables [33] [11] [69].

However, some models only use lexical characteristics that concentrate efforts on pre-processing, and other studies look for other types of obtained variables from the text, as is the case with sociolinguistic studies. In this type of study, we can observe the analysis of variables in the use of some words that denote the social use of “sociolect” or “idiolect” languages [70].

Therefore, it is essential to have a group of documents that can describe a user’s style and not to have only one document per profile. These group of documents can be seen in studies such as Copland and colleagues [71] that analyzed the use of the first person singular or the first person plural in a group of students at a school, and according to the results.

It was inferred that this use denotes social status according to the use of possessives pronouns (“my school”, “our school”), identifying qualities of the property to identify a higher socioeconomic level. So, the challenge of DSN is broad, as they are designed to have personal interactions, where people can share different types of information with different features, for example, text messages.

On Twitter, text messages are associated with the length of characters, also with the use of symbols, emojis, and expressions such as hashtags that can indicate semiotics. Texts are also used to make comments to other users to themselves by creating mentions within the network and finally referring to external sources of information contained in the URLs that can guide or give context to the messages. These messages imply a different measurements than the use of lexical or syntactic characteristics.

By studying these and other additional characteristics, it is possible to improve the precision of the classification processes on demographic, sociological, psychographic, and behavioral variables of users in a social network. However, based on the specific analysis of celebrities, interesting information can be obtained due to the large number of followers they have, and this type of analysis is essential for celebrities due to the influential power they can have on their followers [72].

As future work, the collection and analysis of other language-related elements, such as sociolects and idiolects is recommended. This collection will enable a higher and more accurate profile of social network users, as it will be possible to analyze the digital user’s text, in particular celebrity texts, in a more granular way. Also, the use of synonyms and antonyms, or more than one language and other typical elements of DSN may indicate higher measurements of ranking performance.

On the other hand, it is also possible to use non-linguistic elements that social networks have as data that are not necessarily linguistic, such as the number of followers or the reciprocity of links in publications. The type of data from social network analysis, with the social graph of these characteristics, which comes from extracts in networks, can could help to predict demographic or influencing variables of digital users. It is also possible to explore new strategies in classification models and deep learning techniques that explore other types of architectures such as LMTS or CNN with different configurations to enhance the ranking of celebrities on social networks.

Finally, identifying profiles is not yet an easy task. The proposal to build new models of celebrity profiles from their texts is an interesting approach and, specially, to have new types of features that allow to increasing the accuracy in this type of natural language processing task. The social phenomenon in which users use a language to express their private states with other digital users has meeting points in language and social fields, and it is possible to generate other phenomena such as homophilia or find new patterns of relationship.