1 Introduction

Over the past decades, new technologies and the development of online social platforms have made available to researchers a large amount of text data. Consequently, many studies have been conducted with the aim of exploiting the informative content of digital text. Currently, texts are used as data in a variety of applications in favour of social and economic insights: authorship, sentiment, nowcasting, policy uncertainty, media slant, market definition and other topics, as it is witnessed by the review by Gentzkow et al. (2019). Further studies highlight the contribution of text data to different areas of human life such as politics (Jentsch et al. 2020), public administration (Hollibaugh 2019), education (Ferreira-Mello et al. 2019) and several branches of medical sciences (Luque et al. 2019).

With the focus on marketing and business, Reisenbichler and Reutterer (2018) have recently overviewed the wide range of theoretical and applied research based on text as data and have highlighted the major role played by topic modelling. The latter is a class of unsupervised learning methods developed in a probabilistic setting and capable of clustering text documents in a number of, precisely, topics. The most applied topic model is probably the latent Dirichlet allocation (LDA), referring to the Bayesian model developed by Blei et al. (2003), which, in essence, represents each document as a probability distribution over topics and, on its turn, each topic as a probability distribution over words. LDA is a model-based clustering method, related to finite mixture models. It is recognised to be a flexible and versatile tool to analyse text data, and as such, it has afterwards been extended in multiple variants.

As a matter of fact, when individual texts are very short, say in the range of words from one to thirty, as it is the case of data prevalent on websites, such as titles, image captions, questions in Q&A webpages, LDA might generate topics which are not meaningful. For example, it is recognised that LDA does not perform well when applied to short text fragments, such as microblogging, tweets, headlines and product reviews. This is essentially due to sparsity, as in these cases LDA has too few word co-occurrence information. Several strategies have been proposed to alleviate the problem of data sparsity in short texts, either by combining short documents together, or by employing external resources, such as Wikipedia, to overcome the lack of information or, also, by using alternative models better suited for short texts; see the discussion in Cheng et al. (2014), Jipeng et al. (2011), and we argue that our analysis can provide material for a further insights on enhanced design versus fast fashion. Each record collects information on price, category, brand, which we have voluntarily excluded from the specification, and a field of description. The analyses are carried out for the knitwear and dresses categories. A sample from the dataset is reported in Table 1.Footnote 1

Table 1 Sample from the dataset of case study 1

3.1.1 Pre-processing

Before affording the issue of sparse modelling, we process some preliminary steps to reduce the dimensionality and to map raw text into a numerical matrix, the document term matrix (DTM), whose ijth element indicates the count of the jth word or token in the ith document. Some text preparation operations are required in order to process the data and reduce meaningless dimensionality. First, we cancel out non-words elements (like numbers, punctuations and proper names); then, words in a standard English stop-words list are automatically removed, and further contractions, such as don’t or it’s, or misspellings are excluded manually; finally, words are replaced with their roots through stemming. Eventually, stems displaying sparsity higher than 99% are deleted.

After the pre-processing step, the DTM matrix for the knitwear dataset contains \(N=382\) and \(P=229\) words, the dresses dataset \(N=1110\) and \(P=402\) columns. Note that DTM is high-dimensional in column even if not in the row dimension. In both the knitwear and the dresses datasets, each document is composed, in median, by only three words, and each word appears in less than 1% of documents. As a whole, DTMs are highly sparse, with about 99% of empty cells. Some descriptive statistics are shown in Table 2. The response variable, the price, displays a high standard deviation as compared to the average level and shows an asymmetric distribution. In order to assess the robustness of the results, the analysis has been carried out both on the prices and on the log prices. As the results, in terms of average price variability explained by using the log-transformed data, do not significantly differ from the ones obtained based on the original prices and as interpretation of the results is at the core of the hedonic evaluation, we choose to present the results of the analysis on the prices on their natural scale. It is true that the overall variability is smaller when the logarithmic transform of the data is taken. On the other side, relying on the original prices allows us to express a priori judgments on those that may be relevant influential variables.

Table 2 Descriptive statistics for case study 1

3.2 Case study 2

The second case study is related to a section of the Tech Company Employee reviews dataset, downloaded from www.kaggle.com.Footnote 2 Data were scraped from www.glassdoor.com. Glassdoor is a website which allows current and former employees to anonymously review companies and also to anonymously submit and view salaries, as well as to search and apply for jobs on the same platform. The analyses are carried out for reviews about a worldwide tech anonymous company. Each record collects information on company, location, date of the review, job, position. Moreover, it collects evaluations on a 1 to 5 scale on overall-ratings and other aspects concerning the job, along with two further open questions on pros and cons. Words deriving by the positive field, pros, are presented preceded by the prefix p while words deriving by the negative significant, cons, are preceded by the prefix c. Our aim is to explain the overall-rating using only the text contained in pros and cons open questions fields as data. A sample from the dataset is reported in Table 3.

Table 3 Sample from the dataset of case study 2

We shall treat ratings as a response variable in text regression. Though more specific methods can be applied to interval scales and ordinal variables (see, for instance, Hastie et al. 2009, ch. 14), for sake of interpretation and comparison, and in order to effectively perform variable selection, we shall investigate the marginal contribution of attributes by estimating linear regression models. The motivation for a regression analysis on rating is twofold. First, variable selection methods for categorical data are not as developed as methods for continuous variables and thus often do not allow for homogeneous comparisons. Second, ratings are most analysed as quantitative variables, which makes the results of the analysis in the paper to be useful in several related applications.

After a similar text pre-processing to the one described in section 3.1.1, the DTM for the Employees dataset is a \(N=808\) by \(P=1135\) matrix. In this case, we also considered unordered pairs of words with sparsity lower than 99%. In fact, no pair is resulted among the most frequently selected variables; see Table 4. Each document is composed in median by 15 words. Each word, in mean, is present in the 1.7% of documents, and the sparsity of the DTM reaches the 98.3%.

Table 4 Descriptive statistics for case study 2

4 Design of the study

4.1 Bootstrap

To evaluate to which extent relevant variables have been detected is a challenging task, as the true model is unknown but in simulations. In practical applications, only one dataset at a time is given and one does not know which variables are truly influential. To mimic the availability of several datasets, we recur to bootstrap replications and resample (five hundred times) by each same unique dataset. Indeed, the bootstrap may yield desirable perturbations similar to those of multiple data sets (Efron and Tibshirani 1998). The analyses have been carried out for each case study.

Bootstrap is performed using the classical approach of resampling with replacement. Each replication of the original dataset is divided into training and hold-out datasets. We expect that duplicated observations may somehow overestimate the performance evaluation of the specifications at hand, but in a manner which, in the same way, we expect to be uniform across methods. Indeed, we have verified that the order of performance across specifications does not change if simple cross-validation without repetition is used, which, in addition, reduces the size of training and hold-out datasets. For each bootstrap dataset, we first select variables over the training dataset, by using the pool of models described in section 4.3. Then, a linear regression model with the selected variables as predictors is estimated over the hold-out dataset by the method of ordinary least squares.

4.2 Criteria

We assess the relative performance of alternative selection models on the basis of the following indicators: the predictive \(R^2\), the inclusion frequency and the model class reliance.

To compare the models in terms of complexity, i.e. of number of variables selected, we consider the predictive \(R^2\). We remark here that we compute the predictive \(R^2\), not because we aim to evaluate the usefulness of selected variables for out-of-sample forecasting, which is out of our scope, but rather to evaluate how relevant selected words are in explaining the response variable when new potentially similar datasets are considered. As a matter of fact, the datasets we analyse are renewed either at any season (in the case study 1 of e-commerce data) or at any further session of Human Resources evaluation (in case study 2 of employees’ ratings data) and the goal of our study is to identify which methods allows us to retrieve the relevant drivers of either item prices or employees satisfaction, in any further dataset of this type.

The overall quality of each model is assessed through the inclusion frequency and the model class reliance that are essentially indicators of variable importance, both computed at the variable level and averaged at the specification level. The inclusion frequency measures how often variables are selected, over bootstrap replications. Variables that present a high number of repetitions have been selected in several bootstrap samples, implying that the model is robust to perturbations of the data set.

The model class reliance (Fisher et al. 2018) measures the extent to which well-performing model within a pre-specified class may rely on a variable of interest for prediction accuracy. Within the class, model reliance is a core measure of variable importance, in that it tells how much an individual prediction model relies on explanatory variables of interest for its accuracy. For each model m, the model reliance of each variable j, denoted as MR\(_{j,m}\), is computed as the ratio between: the loss function associated with the specification evaluated based on the DTM with permuted jth variable (numerator) and the same loss function evaluated based on the original DTM (denominator). By permuting the elements of the jth variable, it is possible to assess the amount of the loss, when the variable itself is rendered uninformative; see section 3 of Fisher et al. (2018) for further details. As a loss function, we consider the residuals standard error. At each bootstrap replication, the jth variable in the training set is permuted and the model reliance is computed. We then obtain the empirical bootstrap model reliance of each jth variable as the average over the bootstrap replications. Eventually, we obtain the highest model class reliance (MCR) that is the upper extreme of the interval which defines the MCR. Note that the MCR is a measure of variable importance in a given dataset.

In summary, we evaluate the performance of alternative variable selection methods on the basis of: (1) their explanatory power out of the training sample, measured in terms of predictive \(R^2\); (2) the bootstrap inclusion frequency of selected variables; and (3) the ability to select important variables for the specific dataset.

4.3 Models

Several variants of lasso are considered that we shall discuss with few more details in section 4.4. Nevertheless, we prefer to introduce them all here to provide a full overview of the methods and models used in the analysis. First, to tune the lasso parameter, we recur to standard criteria used to optimise the predictive performance. We call lasso-minFootnote 3 the model attained by minimising the cross-validation error, which is recognised to optimise predictive performance. Secondly, we optimise the tuning parameter by recurring to BIC variants: lasso-bic is attained by minimising the BIC, while lasso-ebic05 and lasso-ebic10 are attained by minimising the extended BIC, with moderate and high model complexity regulated by the tuning parameter \(\gamma = 0.5\) and \(\gamma = 1.0\), respectively. To evaluate the performance of randomisation, we choose the stability selection method in the variant proposed by Shah et al. (2013), by imposing different thresholds of the selection probability. The results are very similar by changing the selection probability, and here, we present those attained with the thresholds 0.7, named ssmb. As a screening method, we run the SIS algorithm and consider the performance produced by different number of selected predictors, from 1 to 25, labelled from sis-k1 to sis-k25. Finally, we evaluate the performance of the simple lasso obtained by imposing different number of selected predictors, from 1 to 25, labelled from lasso-k1 to lasso-k25.

The results obtained with lasso-based methods are compared with those attained through the standard LDA analysis, used here as a benchmark and, as such, applied as follows. Over the randomised training sets, we select solutions corresponding to an a priori fixed number of topics, from 1 to 25, labelled from lda-k1 to lda-k25, with hyperparameters set to default values, as discussed in Sect. 4.5.

Lasso-based specifications with an average number of selected variables (over bootstrap replications) are compared to LDA specifications with the same number of components.

4.4 Methodological and computational aspects

We specify the linear regression model

$$\begin{aligned} Y = X\beta +\varepsilon \end{aligned}$$

where \(Y\in {\mathbb {R}}^N\) denotes the response variable, \(X\in {\mathbb {R}}^{N\times P}\) is the DTM matrix, \(\beta \in {\mathbb {R}}^P\) is the vector of regression coefficients and \(\varepsilon \in {\mathbb {R}}^N\) is an independent and identically distributed error term.

At each bootstrap replication, the DTM matrix is partitioned accordingly in \(X_1\in {\mathbb {R}}^{N_1\times P}\) and \(X_2\in {\mathbb {R}}^{N_2\times P}\), where \(N_1\) and \(N_2 = N-N_1\) denote the number of observations in the training set and in the test set, respectively. In our applications, we have fixed \(N_1 = 2/3N\). Note that we have m bootstrap replications of the pair \((Y_h,X_h)\), \(h=1,2\), i.e. \((Y_h^{(m)},X_h^{(m)}), m = 1,\dots M\) and \(M=500\), but for ease of notation we drop the superscripts.

In the training set, lasso (Tibshirani 1996) solves the penalised least squares problem

$$\begin{aligned} {{\hat{\beta }}}_\lambda ^{{\tiny {lasso}}} = \arg \min _{\beta \in {\mathbb {R}}^P} \left\{ \parallel Y_1-X_1\beta \parallel ^2 + \lambda \sum _{j=1}^P|\beta _j|\right\} \end{aligned}$$
(1)

where \(\parallel \cdot \parallel\) denotes the Euclidean norm and \(\lambda >0\) is a tuning parameter which shrinks to zero some coefficients and, consequently, makes the corresponding variables irrelevant.

In our applications, the tuning parameter is selected based on several methods. One criterion consists in minimising the K-fold cross-validation (Stone 1974), error, with \(K=10\),

$$\begin{aligned} \text {CV}(\lambda ) = \frac{1}{K}\sum _{k=1}^K\sum _{i\in k{\tiny {-th}}}(Y_{1i}-X_{1(i)} {\hat{\beta }}_{\lambda , -k})^2 \end{aligned}$$

where \(Y_{1i}\) is the generic element of \(Y_1\in {\mathbb {R}}^{N_1}\), \(X_{1(i)}\in {\mathbb {R}}^{P}\) denotes the ith row of \(X_1\), the index \(k:\{1,\dots ,N_1\}\rightarrow \{1,\dots ,K\}\) indicates the partitions to which each ith observation is allocated by the randomisation in the kth fold, and \({\hat{\beta }}_{\lambda ,-k}\) is the estimate of \(\beta\) obtained by lasso, without the contribution of the observations in the kth fold.

The Bayesian information criterion (BIC) by Schwarz (1978) is based on minimisation of the following objective function,

$$\begin{aligned} \text {BIC}(\lambda )= N_1\log {\hat{\sigma }}^2_\lambda + df(\lambda )\log (N_1) \end{aligned}$$

where \({{\hat{\sigma }}}^2_\lambda = \frac{1}{N_1}\sum _{i=1}^N(Y_{1i}-X_{1(i)}{\hat{\beta }}_\lambda )^2\) and \(df(\lambda )\) is the effective degrees of freedom parameter for which an unbiased and consistent estimator is the number of nonzero coefficients (Zou et al. 2007).

The extended BIC (eBIC) by Chen and Chen (2008) adds an extra penalty term \(\gamma \in (0,1)\) that accounts for the model complexity, summarised by the term \(\tau _j= {P \atopwithdelims ()j}\), where j is the number of covariates considered in the model,

$$\begin{aligned} \text {eBIC}_\gamma (\lambda ) = N_1\log {\hat{\sigma }}^2_\lambda + df(\lambda )\log (N_1) + 2\gamma df(\lambda ) \log ( \tau _j). \end{aligned}$$

Stability selection based on lasso is discussed in detail in Meinshausen and Bühlmann (2010), section 2.2. The key concept is the stability path, given by the probability of each variable to be selected when randomly resampling from the data, over all the values of the regularisation parameter. Specifically, lasso provides estimates of the set of nonzero coefficients as \({\hat{S}}_\lambda = \{j:{\hat{\beta }}_{\lambda ,j}\ne 0\}\), where \({\hat{\beta }}_{\lambda ,j}\) is an element of \({\hat{\beta }}_{\lambda }\) in equation (1). Let I be a random subsample of \(\{1, \dots , N\}\) drawn without replacement. For every set \(J \subseteq \{1, \dots , P\}\), the probability of being in the selected set \({\hat{S}}_\lambda (I)\) is \(\pi _J^\lambda = {\mathbb {P}} \{ J \subseteq S_\lambda (I)\}\). For every variable \(j=1,...,P\), the stability path is given by the selection probabilities \(\pi _j^\lambda\) across \(\lambda\). For a cut-off \(\pi _0\in (0,1)\) and a set of regularisation parameters \(\Lambda\), the set of stable variables is defined as \(S^{{\tiny {stable}}} =\{j:\max _{\lambda \in \Lambda }{\hat{\pi _j^\lambda \ge }} \pi _0\}\). Here we apply the complementary pairs version of stability selection by Shah and Samworth (2013), which improves error control.

As a further criterion, we select the value of \(\lambda\) associated with at least k nonzero coefficients, i.e. we estimate \({\hat{\lambda _k}} = \arg \min _{\lambda \in \Lambda }\text{ card }\{{\hat{S}}_\lambda \}\ge k\), \(k=1,\dots ,25\).

The sure independence screening (Fan and Lv 2008) is based on correlation learning which filters out the variables that have weak correlation with the response. Let \(S_* = \{j:\beta _{j}\ne 0\}\) denote the true model. SIS selects \(S_\xi = \{j:|\omega _j| \text{is among the first} [\xi N] largest \}\) where \([\xi N]\) denotes the integer part of \(\xi N\), \(\xi \in (0,1)\) and \(\omega = X'Y\). SIS enjoys the sure screening properties, i.e. \({\mathbb {P}}(S_*\subset S_\xi )\rightarrow 1\) for \(N\rightarrow \infty\).

Once the relevant variables have been selected, they form the new DTM matrix \(X_2\in {\mathbb {R}}^{N_2\times P^*}\), where \(P^*\) denotes the number of relevant variables eventually selected by each procedure, and the regression coefficients are estimated by ordinary least squares as the solution of the system

$$\begin{aligned} X_2'X_2 {{\hat{\beta }}}^* = X_2'Y_2. \end{aligned}$$

For the case of the lasso, Belloni and Chernozhukov (2013) discuss some additional assumptions to show that the post-estimation OLS, also referred to as post-lasso, performs at least as well as the lasso itself.

4.5 Comparison with LDA

The results obtained by text regression are compared with those obtained by the unsupervised generative model LDA. We do not exploit the many variants of LDA for short text neither the supervised LDA, as LDA is presented here only as a general benchmark. LDA assumes that each document in the corpus can be described as a probabilistic mixture of T topics, and as an output, LDA provides the probability of document d belonging to topic t, \({\mathbb {P}}(t|d)\), where d = 1,...,D indicates the number of documents and t = 1,...T indicates the number of topics. In turn, each topic is defined by a probabilistic distribution over the vocabulary of size P; for each topic, the word probability vector \({\mathbb {P}}(v|t)\) where v=1,...,P indicates the number of words describes how likely it is observing a word conditional on a topic. LDA proceeds through posterior inference of the latent topics given the observed words, and as a conjugate prior to the multinomial distribution, LDA uses Dirichlet priors.

In this analysis, the Dirichlet prior hyperparameters are set as default values (\(\alpha = 0.1, \beta = 0.05\)) and the model is estimated using collapsed Gibbs sampling, as described in (Jones 2019).

As a next step, in each comparison of LDA to lasso-based specifications, the number of LDA topics, T, is fixed equal to the average number of selected variables (over bootstrap replications), \(P^*\). We generate indicator variables assuming values equal to one in correspondence to each document where the topic displays topic probability, \({\mathbb {P}}(t|d)\), larger than the average topic probability, \(1/T\sum _{t=1...T} {\mathbb {P}}(t|d)\), slightly modifying the procedure by Schwarz (2018), based on the largest topic probability. The number of indicator variables equals the number of topics, T, which, in its turn, in each comparison, equals the average number of selected variables, \(T=P^*\). In that case, \(X_2\) is replaced by \(X_2^*\in {\mathbb {R}}^{N_2\times P^*}\), which collects the \(P^*\) indicator variables. Eventually, the response variable (either ratings o prices) is regressed over the indicator variables derived by topics and collected in \(X_2^*\) and the performance is evaluated as for the lasso-based methods.

As far as the selected words comparison is considered, the most frequently selected words over bootstrap replications are considered for Lasso-based specifications. As regard as lda-k, the most frequently top \(P^*\) terms of each topic in each bootstrap replication are considered after discarding duplicates. Indeed the same word may appear within the top \(P^*\) terms of more than a component.

On the one side, we expect that in terms of predictive \(R^{2}\) the comparison a priori favours LDA specifications, as k components in LDA embed more information than k words alone. On the other side, the comparison of selected words extracted by shrinkage methods to top-terms of LDA components, which certainly sounds more artificial, may notwithstanding provide useful information on relative ability to detect relevant words.

4.6 Computational details

All the computations are carried out based on the R software. In particular, lasso is implemented by the glmnet package by Friedman et al. (2010). Stability selection is run using the ssmb package by Hofner and Hothorn (2017) and Hofner et al. (2015). Sure independence screening is carried out through the package SIS (Saldana and Feng 2018). LDA topic model is fit by using textmineR (Jones 2019).

5 Results

5.1 Case study 1: prices

5.1.1 Predictive \(R^2\)

The main results of case study 1, in terms of number of selected variables and predictive \(R^2\), are displayed in Table 5 and summarised in Fig. 1.

Table 5 Prices. Number of selected variables and predictive \(R^2\)

As expected, lasso-min always selects, on average, a very high number of predictors in both categories, to reach the highest adjusted and predictive \(R^2\) and by explaining 0.773 of price variation for knitwear and 0.694 for dresses replicated datasets. Indeed, as it has been observed it tends to be generous in selecting noisy variables. The number of predictors selected by lasso-min displays the highest standard deviation.

Fig. 1
figure 1

Prices. Number of selected variables and predictive \(R^2\). Averages over bootstrap replications

When lasso is optimised by minimising the eBIC, it employs few parameters; the higher the value of the tuning parameter, the smaller the number of selected variables, on average. The most parsimonious eBIC selects, on average, 14.2 predictors for knitwear and 19.4 predictors for dresses, to explain about 61% of the price variability in new datasets. Stability selection is quite parsimonious as well: it selects 6.3 predictors for knitwear and 9.0 for dresses, explaining a share of 0.499 and 0.404 of prices’ predictive variance within categories. Predictive \(R^2\) are just lower than the ones of lasso-ebic10, though based on quite less predictors. Moreover, it has to be observed that lasso models based on the ebic optimisation tend to select a limited number of coefficients on average, but with a certain degree of heterogeneity over the replications.

For sake of brevity, having results for \(k=1,2,\dots ,25\), Table 5 presents results for SIS and lasso computed with the same (average) fixed number of predictors as ssmb and lasso-ebic10, i.e. \(\bar{P^*} =6\) (knitwear) and \(\bar{P^*}=9\) (dresses) for stability selection and \(\bar{P^*}=14\) (knitwear) and \(\bar{P^*}=19\) (dresses) for lasso optimised with eBIC with \(\gamma =1\).

Focusing on the SIS method, we note that it produces very similar results as compared to ssmb and lasso-ebic10 when a comparable number of predictors are imposed. The same pattern may be observed when lasso is used by fixing the number of selected variables. Figure 1 sketches an overview of presented methods for the two categories. Note that, as expected, the performance in terms of predictive \(R^2\) increases, with diminishing derivative, as long as the number of predictors grows.

Fig. 2
figure 2

Prices. Number of coefficients. Averages over bootstrap replication

Figure 2 displays the standard deviation versus the average number of coefficients for a selection of models, highlighting that the ssmb constantly ensures parsimonious models.

A first finding is that the two methods producing the most parsimonious variables selection are lasso-ebic10 and ssmb, where the last one seems to be preferable in terms of robustness. A similar performance is reached by simple lasso or SIS, with the same, fixed, number of predictors.

5.1.2 Inclusion frequency and model class reliance

We now come to the analysis of inclusion frequencies and the model reliance. The analysis covers a selection of models: lasso optimised with eBIC with \(\gamma =1\); stability selection; SIS and lasso computed with the same (average) fixed number of predictors as ssmb and lasso-ebic10, and LDA with the number of topics equal to the same (average) fixed number of predictors as ssmb and lasso-ebic10. Tables 6 and 7 display the most occurring variables picked in lasso-ebic10, ssmb and in SIS or lasso with a priori fixed coefficients over the knitwear and dresses dataset, respectively. For each specification, the tables present the average inclusion frequency and the highest model class reliance (MCR) by word and on average.

The results display levels of mean inclusion frequency comparable across the variable selection methods with some heterogeneity. A similar picture is found in terms of the highest MCR, i.e. comparable levels across lasso-based methods, always higher as compared to LDA results.

In the knitwear dataset (Table 6), MCR shows that, when 6 variables are selected, the gain in the residual standard error amounts to 7.1 percentage points in mean, by 3.5 to 3.8 percentage points when 14 variables are selected.

Table 6 Selected words and inclusion frequency for the Knitwear dataset
Table 7 Selected words and inclusion frequency for the Dresses dataset

Table 7 displays selected words and bootstrap inclusion frequencies over the dresses dataset. Text regression methods provide similar levels of average inclusion frequencies. The highest MCR values confirm that lasso-based variable selection methods, more often than LDA, identify words indicating important variables. When 9 words are selected, the MCR increase the gain of about 5 percentage points in mean, in case of selection of 19 words by about 3.5 percentage points in mean.

Eventually, by the analysis of the estimated coefficients and the coefficients of variations, see Fig. 3, it clearly emerges that few of them are detected out of the bulk of data, thus indicating a negative note in terms of model robustness for sis, which have selected features whose coefficients result to exhibit a disproportionate variability.

5.1.3 Summary

We may conclude that the results attained through text regression methods always significantly overcome the performance of lda, in terms of predictive \(R^2\). Mean inclusion frequencies of lda models, on the other hand, are comparable to the ones attained by other methods. At the same time, the ability to select important variables favours text regressions. Overall, for prices datasets, our findings recommend the use of text regression methods. Among the latter, the best performance in terms of the three measures considered throughout the analysis is provided by ssmb, i.e. stability selection, that is both parsimonious and stable, in the sense that it is not affected by high variability in the estimated coefficients.

Fig. 3
figure 3

Prices. CV of estimates of coefficients

5.2 Case study 2: ratings

5.2.1 Predictive \(R^2\)

Table 8 displays the results related to the second case study, conducted with the aim of explaining ratings with tokens drawn by open questions. In this example, lasso-bic tends to select the largest number of variables, capable of explaining around 78.4% of predictive ratings variation. As in the previous example, also in this case, lasso-min produces, on average, a high number of predictors, to reach the highest predictive \(R^2\). Figure 4 (left panel) confirms that high predictive power may be reached only at the cost of selecting a very large number of predictors. Note that lasso-min and lasso-bic are not included in the figure as we have maintained the same scale of Fig. 1, for sake of comparison. More parsimonious models are selected by ssmb and lasso-ebic10, based on less nine variables or less, even if also lasso optimising the ebic yet displays higher variability for the number of coefficients (see Fig. 4, right panel, in the same scale of Fig. 2). Both the lasso model which minimises eBIC and stability selection guarantee an acceptable trade-off between explanatory power and parsimony. As in case study 1, lasso and SIS perform comparably well as ssmb and lasso-ebic10, when the number of predictors is a priori fixed, with an out-of-sample \(R^2\) oscillating around 30 to 35%.

Differently than in the case of prices, with ratings, our findings are slightly in favour of topic modelling, as the lda predictive \(R^2\), corresponding, respectively, to solutions with 7 and 9 topics, to preserve comparability, are slightly higher than the ones reached through text regression methods.

Table 8 Ratings. Number of selected variables and predictive \(R^2\)
Fig. 4
figure 4

Ratings. Number of selected variables and predictive \(R^2\) (left). Number of coefficients (right). Averages over bootstrap replications

5.2.2 Inclusion frequency and model class reliance

Table 9 compares the most frequently selected features over the five hundred bootstrap replications by the methods, as well as the highest MCR of selected variables. Words have been drawn by the two open questions asking to indicate, respectively, pros and cons. Words deriving by the positive field, pros, are presented preceded by the prefix p while words deriving by the negative significant, cons, are preceded by the prefix c. The more parsimonious the models, the higher the inclusion frequency of selected variables. For both ssmb and lasso-ebic10, the most relevant variables are selected by more than 65% of replications. It is evident that in terms of both persistence and relevant variables, ssmb and lasso-ebic10 outperform all the other methods. As far as lda is concerned, top selected words display slightly lower mean inclusion frequency rates over the considered methods.

Table 9 Selected words and inclusion frequency for the ratings dataset

Concerning the ability to detect important variables, note that, in all cases, the highest MCR values only negligibly overcome the unitary value, meaning that the selected variables, on average, are able to decrease the loss in predictive power only of about 1 percentage point.

5.2.3 Summary

These results discussed above are corroborated by Fig. 5, confirming that, except for the highest coefficient of variation of lasso-ebic10, all the methods behave in a much more homogeneous way, with respect to the previous case study, in terms of the inclusion frequency and predictive \(R^2\).

Fig. 5
figure 5

Ratings. CV of estimates of coefficients

All in all, in this second case study, among text regression methods, stability selection is the preferred method, as it ensures an acceptable trade-off between explanatory power and parsimony. In addition, it outperforms the other methods in terms of persistence and relevant variables. However, differently than in the case of prices, with ratings, our findings are slightly in favour of topic modelling. The reasons can be related to the fact that in this example (a) sentences are not as short as in the case of prices and (b) the words have an emotional content, in this case, stronger than in the case of attributes, such as freedom as opposed to wool.

On this, and with the focus on interpretability, it is worth remarking that selected words have to be read jointly with most co-occurrent words. The task of understanding the meaning to which each word refers is usually performed in lda by looking at the top words with most probability to occur within each topic. To follow a similar path in text regression, we consider the tokens that show the greatest co-occurrence, in a sort of topic reconstruction, displayed in Table 10. In this way, each selected word is accompanied by some further ones, among which it plays the pivotal role. As well as for standard topic models, such as lda, each group of topics has to be interpreted. The researchers focused on the field of study, jointly looking at the top hth token, find the best meaning for the topic.

In summary, in this case study, our results do not strongly support to use text regression methods but favour lda. The latter performs slightly better than text regression methods in explaining ratings data. We argue that this can be related to the length of the data text, in this second case study longer than in the first one, and on the very nature of the case study itself, where joint co-occurrence of words within the pros and cons fields—rather than single words alone—should explain the topic, as motivations underlying the ratings.

Table 10 Reconstructed topics (co-occurrence greater than 20%) from pivotal words

6 Concluding remarks

The paper has investigated the analytics that allow one to exploit the informative content and the explanatory power of unstructured, short texts on a response variable.

Interpretability of the results was a key issue for the scope of the present study; hence, we have restricted our focus to shrinkage methods, and within this class, we have favoured models that provide results of easier interpretation.

In this perspective, we have compared the explanatory power of variables selected through several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection. A subsequent comparison has been also run with the widely applied topic model, i.e. LDA, used as a benchmark. The relative performance of the methods has been assessed based on the number and the importance of the selected variables.

We have considered two applications. The first application focused on explaining prices of goods within a product category, based on the captions provided by manufacturers on e-commerce platforms. In this case study, the nature of the texts is descriptive, as they characterise the goods for sale; after the text pre-processing phase, texts are very short, reduced in the median to only three words. The second application aimed to understand how to use open questions to obtain information on overall satisfaction within surveys. After the text pre-processing phase, texts are short, with 15 words in the median, but longer than in the previous case. Furthermore, here texts express opinions, and in particular satisfaction or dissatisfaction; thus, compared to the first case study, they are much more related to the emotional sphere.

The results of the study attain insights into two main directions concerning, on the one side, the performance of analysed models within the class of text regression and, on the other side, the different ability of text regression versus topic modelling methods in extracting information from short texts.

Along the first direction, our findings show that, in terms of explanatory power, both stability selection and lasso which optimises the eBIC criterion are able to improve the lasso, when the latter is optimised through standard criteria finalised to prediction purposes. Nevertheless, by limiting the number of selected variables, both lasso and sure independence screening are capable of attaining comparable results. As far as the ability to select relevant explanatory predictors is concerned, stability selection slightly outperformed the other methods, which, notwithstanding, exhibited good performance. In our opinion, a relevant finding is that lasso behaves as well as alternative computationally more intensive methods when the number of selected variables is limited.

Concerning the comparison of text regression with LDA, the former outperforms LDA in terms of explanatory power in the prices case study while LDA outperforms text regressions in the ratings case study. This is likely to happen both because texts are longer in the ratings case study and because of their contents, which are naturally more connected to latent topics.

In terms of quality of the selected words, text regression overcomes LDA in the case of prices but not entirely in the case of ratings. However, words selected based on text regressions are always more robust than LDA ones, so that, in both cases, text regressions appear highly suitable to pick up relevant words within a bag of words.

To conclude, we remark that the results of the paper describe how text regression and variable selection methods work over two specific applications and cannot be extensively generalised if not with further extensive analyses. However, our findings favour variable selection in text regressions as a method that may provide valuable solutions when texts are short and open the way to further investigations.