1 Introduction

The cost of collecting and processing high quality traditional data, such as surveys, is increasing, and the process of deriving statistical products from this data is demanding and time-consuming [1].

At the same time, the availability of new data has led to an expansion of data collection methods, moving beyond traditional primary data collection to the extraction of statistics from non-traditional sources. These sources, referred to as big, or digital trace/behavioral data, include, among others, social media posts, Google trends and mobile phone data (i.e., location, photos, and other sensor data), and are produced by human online/digital behaviors and interactions [2]. Digital trace data are not generated for statistical purposes but can serve as a convenient and timely source of information for understanding and measuring (new) complex socio-economic phenomena [3]. These new data sources provide a basis for the multi-purpose extraction of different statistical indicators, which complement the traditionally available statistical information and feed smart statistics [4,5,6]. The integration of traditional and digital trace data for producing innovative statistics and indicators is a promising approach. This can enhance the timeliness, providing a finer spatial and temporal resolution, a higher level of detail, new perspectives, and new insights on phenomena, while also reducing the production cost of (official) statistics [7].

Research on indicators constructed using non-traditional sources only, particularly textual data from social media, is prevalent in the literature, especially with reference to social aspects [8,9,10,11]. Further, a number of experimental statistics have been developed by National Statistical Institutes (NSIs) using such textual data to study social tensionsFootnote 1 and consumers’ confidence on the economy (see, for example, Daas and Puts [12] and the Istat’s Social Mood on Economy IndexFootnote 2). However, studies combining traditional and digital trace data-based indicators are scarce.

The use of innovative unstructured data, also combined with traditional data, is relatively underdeveloped in the field of business statistics, despite the potential benefits they can offer. New data sources can be used in a variety of ways, including enhancing the information for a given unit [13]. For example, Statistics Canada used sensor data to augment administrative data and produce more efficient small area estimates for business statistics [14]. Similarly, Statistics Netherlands (CBS) is committed in enhancing business statistics, using web-scraped data from companies’ websites in order to detect innovative companies and improve the quality of the appointed NACE codes [15, 16]. The Italian National Statistical Institute (ISTAT) is also committed in develo** experimental statistics based on businesses’ websites in order to identify their activities or to augment the information collected through the traditional survey on Information and Communication Technologies [17,18,23, 24].

Like survey data, also the analysis of unstructured textual data is susceptible to errors. In this direction, there are efforts being made to adapt the TSE framework to such data, but currently, there is not a general framework in order to account, measure and evaluate errors and data quality [25,26,27]. Data sources have different characteristics, which require different quality frameworks. The importance of these aspects becomes especially evident when integrating data from different sources, where it is crucial to understand how errors arise, accumulate, and interact during the entire integration process [28]. These are all emerging topics in the literature.

While all these factors should be considered when combining data, our focus here is on proposing a procedure to develop CIs based on the integration of different types of data, structured and unstructured, derived from traditional and non-traditional sources. An overview about quality evaluation is presented in Sect. 5.

3 Methodology

Data integration is becoming increasingly popular as the combination of different sources (e.g., a probability sample surveys with a non-probability one, or a traditional and an innovative—big data-source) enables enhanced inference, reduced costs and the measurement of new phenomena or previously unexplored aspects of existing ones [29]. However, when it comes to choosing the right methodology for data integration, a universal approach does not exist [5, 30, 31]. The choice of the methodology depends on various factors, including the research objective (such as finite population or analytic inference, measurement of multidimensional phenomena), availability of variables of interest across sources, similarity or dissimilarity in constructs measured by the two sources, and other relevant considerations. Thus, data integration is statistic and purpose specific.

In Sect. 3.1, we propose a modular general framework for combining traditional and innovative data. The proposed framework is very general and applicable to a variety of scenarios. Next, we address the issue of data integration under the perspective of composite indicators derived from different sources and measuring different aspects of a phenomena. Thus, instead of having a finite population quantity or model parameters to estimate, we consider the task of measuring multi-dimensional phenomena combining indicators from various sources. Section 3.2 shows how to generate smart business CIs combining structured and unstructured data (e.g. textual data from social media and websites, or other innovative data sources), introducing an adaption of the general framework.

3.1 A modular general framework for the construction of smart business statistics

To produce smart business statistics using unstructured textual data, we develop a modular general framework in three layers. This is an adaption of the modular organization into three layers introduced by Ricciato et al. [7].

In the first layer, the data are collected and transformed into structured data. Such data and related metadata need then to be interpreted by statisticians and serve as input for the second layer. It is worth noticing that the processing of metadata to complement the analysis of unstructured digital data has been examined in a limited number of studies. Indeed, it is an emerging topic and applications relate user/account profiling [32, 33], and geo-spatial applications [34, 35]. As original contribution, we propose to use social media metadata for the construction of CIs as shown in the prototype application (Sect. 4).

In the second block, innovative statistical information is extracted, and indicators are computed. The first and the second layer are augmenting statistical information through the creation of new indicators generated using textual unstructured data.

In the third layer innovative statistics and indicators are used to augment the already available traditional data. Depending on the specific use-case, this can be achieved through methods such as linkage, statistical integration, or by combining indicators. As a result, Smart Business Statistics are produced. Figure 1 summarizes the framework described above.

The modular approach is particularly useful when dealing with new and complex data sources and their integration with traditional ones. Modularity also allows researchers and practitioners to explore other methodological variants (instances) within the same methodological architecture, and possibly propose improvements to specific modules or test sensitivity of the obtained results.

Fig. 1
figure 1

Modular general framework for producing smart business statistics

3.2 Building smart composite indicators

3.2.1 Background concepts

Before describing the proposed methodology, we shortly remind that when constructing CIs, it is necessary to consider and take decisions on different aspects [36].

First of all, the theoretical framework of the substantial research topic has to be defined. This is crucial for the choice of the data and the variables’ definition. It is also important to guide the researcher in the construction process of the CI with respect to methodological decisions related to the normalization of the indicators and the aggregation strategy. Normalization is performed in order to ensure comparability. Based on the variable type (e.g., continuous, categorical, or ordinal) and the aggregation strategy, this can be accomplished in a variety of ways. Common methods are the standardization (z-score), min–max transformation (or re-scaling) or the transformation to index numbers [37].

Aggregation refers to the combination of the individual indicators in order to create a CI. This phase entails considerations on the polarity and the importance of each elementary indicator and the identification of the technique to synthesize the elementary indicators. To properly insert the original indicators into the aggregation procedure, the polarity of indicators should be carefully considered. It refers to the direction of the relationship between the indicator and the phenomenon to be measured. The polarity is positive (negative) if the dimension is positively (negatively) associated to the phenomenon. Further, the selection of the aggregation technique depends on the level of compensability of the individual indicators, which refers to the possibility of balancing a disadvantage on some indicators with a sufficiently large advantage on others. This should be based on theoretical evaluations. In this respect, there are three types of aggregation approaches depending on the degree of compensability: compensatory, partially compensatory, and non-compensatory. For example, full-compensatory aggregation is obtained with the arithmetic mean. In the case of individual indicators from unstructured data, this can be the case of the topic proportions resulting from a topic model. Partially compensatory approaches relate, for example, to the computation of geometric, harmonic, quadratic means, or specific methods like the Mazziotta–Pareto procedure [38]. For example, one could consider the social media dimension related to communication aspects of a certain phenomenon to be partially replaceable with traditional measurements of the same phenomenon. Non-compensatory aggregation is usually performed following multi-criteria approaches.

Aggregation also involves the identification of weights associated to the individual indicators. Weights reflect the relative importance of the indicators to be combined. When no weights are specified, all indicators are implicitly weighed equally. Alternatively, weights can be determined according to subjective and expert evaluations, or statistical methods, such as Principal Component Analysis. However, weights should only be specified when there is a strong theoretical basis for doing so, otherwise a no-weighting strategy should be adopted [39, 40]. Attention should be paid to the implicit importance associated to the original elementary indicators in the case of subsequent aggregations. For a complete overview of CIs construction, please refer to Mazziotta and Pareto [37], OECD [41] and Booysen [39].

When develo** CIs, it is important to evaluate the quality of the results taking into consideration the impact of the different methodological decisions that have been made [41]. This topic is further discussed in Sect. 5 also in relation to the multi-source nature of the integration process.

3.2.2 Procedures of the approach

As regards our original contribution, we present a methodology for constructing (a) simple and composite indexes that measure new aspect of phenomena using new data sources and (b) a CI that integrates traditional and non-traditional indexes. We do that by adapting the modular general framework introduced in Sect. 3.1 (see Fig. 1) to this setting.

In empirical studies, it is common to consider several dimensions to represent complex phenomena and to proceed through many levels of aggregation. Our approach is general and offers a flexible solution that can be applied to different cases. In Fig. 2, we present a visual representation of our proposed modular layer approach, emphasizing its practical application by showing an example with two dimensions, one innovative indicator, and three levels of aggregation.

The first layer includes the identification and extraction of the elementary indicators. By way of example and focusing on the innovative data source, assume that, according to the theoretical framework, there are two relevant dimensions that can be measured by the innovative data source, namely \(D_1\) and \(D_2\) and let \(I_{D_1,1},\ldots ,I_{D_1,i},\ldots ,I_{D_1,n}\) be the n individual elementary indicators related to dimension \(D_1\) and \(I_{D_2,1},\ldots ,I_{D_2,j},\ldots ,I_{D_2,m}\) be m individual elementary indicators related to dimension \(D_2\). Such indicators and dimensions must be identified based on theoretical, empirical, pragmatic, or intuitive considerations [39]. The second layer entails the construction of the innovative index (INN-INDEX). Depending on the specific situation, this can be done as one aggregation or as subsequent aggregation steps. Generally, in the presence of, say, two pillars, the elementary indicators are first combined in order to generate two sub-indicators measuring each dimensions of interest, \(CI_{D_1}\) and \(CI_{D_2}\) respectively. The approach may be extended to more dimensions depending on the characteristics of the phenomenon and the innovative source being studied. Next, these sub-indicators are further aggregated to create the innovative index (INN-INDEX). This is the second level of aggregation. We assume that the traditional index (TRAD-INDEX) is already available. Otherwise, the same methodology can be applied to obtain a traditional indicator if one does not already exist.

Moving to the third layer, the third level of aggregation relates to the construction of the innovative smart CI (SMART-INDEX). In the second and third levels, attention should be paid to avoid double normalization.

It is important to note that the theoretical framework of the phenomenon being measured plays a crucial role in the construction of the index. All decisions that should be taken at the various step of the three layers and of the CI construction must align with this framework.

Moreover, an advantage of the proposed procedure is that it can be easily adapted to different situations. For example, one might proceed across the whole set of three layers or only compute the CI going through the second and third layer in case the elementary indicators have already been computed.

Fig. 2
figure 2

CI construction strategy on three layers: an example

We illustrate how to apply the proposed methodology through a practical exercise that shows how to construct a prototype CI for measuring Corporate Social Responsibility (CSR) in the next section.

4 Construction of a prototype

4.1 Context and theoretical framework

This application focuses on the construction of a CI in the field of business statistics and sustainability. Socially responsible behaviors of businesses are linked to the concept of CSR.Footnote 9 Given its multi-faceted nature, measuring CSR activities is naturally related to the use of CIs, which allow to summarize complex or multi-dimensional phenomena [43].

The aim of this practical exercise is to measure CSR commitment based on a comprehensive view, including both effective commitment (as traditionally considered) and online communication of CSR-related activities. Online communication is characterised by its content (the topic discussed) and modality (the way it is conducted), leading thus to two pillars. The first one refers to the communication content, i.e., to the text which refers to the communication of CSR activities. The second one refers to communication modality (media richness). This is an important aspect for the communication to be effective and to engage with customers and stakeholders. We expect that the higher the media richness, the more effective the communication will be [44].

Our contribution is to demonstrate how the proposed modular framework can be applied in practice. We show the various steps that should be undertaken for the technical construction of a smart indicator to measure CSR. By providing a step-by-step guide for the technical construction of the indicator, we aim to show how to effectively use social media data in conjunction with (already available) traditional data to create a comprehensive indicator that accounts for various aspects of CSR (augmenting information).

The application should be regarded as an exercise. Hence, we do not provide a comprehensive examination of the CSR theoretical framework nor fully evaluate the meaning of the computed indicators from a substantive point of view. Similarly, we do not delve in discussing the selection of elementary indicators, which is ad-hoc, driven by the information available in the innovative data source, and in the evaluation of indicators’ quality (see Sect. 5 for an overview of quality issues).

4.2 The application of the modular framework

For the sake of illustration, units considered are the firms included in the Dow Jones Industrial Average index, i.e., a stock market index that measures the performance of the 30 largest US listed companies as of the composition in August 2020. The data were collected as part of a previous study and refer to the year 2019 [45]. We retrieved the full list of firms, jointly with the corresponding activity sector from Bloomberg. With respect to sectors classification, Bloomberg adopts the Global Industry Classification Standard (GICS) developed by MSCI and S &P Dow Jones. The final number of firms included in the analysis is reduced to 26 as we only consider the firms for which data are available in both the traditional and digital data source.

For the traditional indicator, we consider the Environmental, Social and Corporate Governance (ESG) database provided by Refinitiv, one of the world’s largest providers of financial markets data and infrastructure (commercial data). Data for listed companies refer to their sustainability performance considering various aspects, including emission reductions, social programs, and economic performance. The database collects publicly reported data, checked for quality, and provides a CSR-Strategy Score. This reflects a company’s practices to integrate economic (financial), social and environmental dimensions into its day-to-day decision-making process and it ranges between 0 and 100. In the subsequent analyses, the CSR-Strategy Score corresponds to the traditional indicator (TRAD-INDEX), which is therefore already available.

As innovative data source, we consider Twitter, which is one of the main communication channels for companies [44]. Since here the INN-INDEX is based on social media data, it is renamed SM-INDEX. For the construction of the social-media based index (SM-INDEX), we follow the modular methodologies proposed in Sect. 4. This is discussed in detail layer-by-layer in the following subsections. Figure 3 provides a representation of the process described.

Fig. 3
figure 3

Modular methodological framework applied to the specific empirical exercise

4.2.1 The first layer: elementary indicators

Following the tasks in the first layer, we identified and retrieved the data from the official Twitter accounts of the companies. Given that companies may have several Twitter accounts, we focused primarily on CSR accounts and, in case these are not available, on the news or multipurpose ones. The objective is to reduce the noise (no-CSR tweets) in the data. We use the same data retrieved by Salvatore et al. [45]. However, we restrict the analysis to the 26 firms for which information is available in both the traditional and the innovative data source. This results in the inclusion of 39 different accounts (5 news, 18 multi-purpose and 16 CSR-related) for a total of 21,919 messages posted in 2019.

The selection of the elementary indicators for each pillar is based on the theoretical framework outlined in Sect. 4.1. To the purposes of identifying elementary indicators for the content pillar, we applied a Structural Topic Model (STM) to discover the topics discussed in the tweets. Next, we grouped those detailed topics to the main CSR dimensions, namely economic, social, environmental and general (or mixed).Footnote 10 These essentially correspond to proportion of text devoted to each CSR dimension for each tweet and represent the elementary indicators with respect to the content pillar, namely social (SOCIAL), economic (ECO), environmental (ENV), and general CSR (MIX).

For the modality pillar, we consider tweets’ metadata. In this respect, each tweet can contain hashtags (defining the topic of posts and allowing users to associate the tweet with all other tweets using the same identifying hashtags), mentions (engaging with other users), media (e.g., photos), and links (to external web pages). We thus define four elementary indicators for the modality pillar, corresponding to the number of hashtags (#), mentions (@), media (MEDIA), and links (LINK) contained in each tweet, respectively.

These elementary indicators represent the output of the first layer, which is the base for the construction of intermediate CIs in the second layer. Elementary indicators are measured at the tweet level and then aggregated at the firm level (the unit of our analyses).

4.2.2 The second layer: development of the social media-based indicator

The CI for the content dimension is constructed by considering the elementary indicators SOCIAL, ECO, ENV, MIX, corresponding to the proportion of text devoted to each CSR dimension for each tweet. We assume that these proportions are substitutes (compensatory aggregation) with the same importance (no weight). To obtain the CI, we take the sum of these proportions at the tweet level and then aggregate them at the firm level by taking the arithmetic mean (content indicator).

As for the modality pillar, we consider the elementary indicators the presence of hashtags, mentions, media, and links (binary variables), assuming them to be substitutes (compensatory aggregation) with the same importance (no weight). For each tweet, we sum these individual indicators, obtaining a score between 0 and 4. We then aggregate these scores at the firm level by computing the arithmetic mean (modality indicator).

Once the modality and the content indexes are constructed at the firm level, it is necessary to combine them to obtain the SM-INDEX. In this case we propose to apply the Mazziotta–Pareto index (MPI), which is partially compensatory, recognizing that the two dimensions are equally important but partially substitute to gain efficiency in CSR communication. Indeed, a deficiency in the content can be partially compensated by effective communication (and vice versa). The MPI is based on a non-linear function that, starting from the arithmetic mean of the normalized indicators, introduces a penalty for units with unbalanced indicators [38]. Denoting by i the firm and j the pillar (content and modality) and given the data matrix \(X = \{x_{ij}\}\), to compute the MPI, we proceed first with standardization as follows

$$\begin{aligned} z_{ij}=100+\frac{x_{ij}-M_{xj}}{S_{xj}}\cdot 10 \quad i=1,\ldots ,28 \quad j=1,2 \end{aligned}$$
(1)

where M and S refer to the mean and standard deviation of the content and modality indexes. Next, given the positive polarity of the content and modality indicators, we compute the MPI as

$$\begin{aligned} MPI_i=M_{z_i}-S_{z_i}\cdot cv_{z_i} \quad i=1,\ldots ,28 \end{aligned}$$
(2)

where z refers to the standardized data as in (1) and \(M_{z_i}\), \(S_{z_i}\), \(cv_{z_i}\) denote the mean, standard deviation, and coefficient of variation of the normalized values for company i, respectively. Figure 4 summarizes the aggregation approach described above.

Fig. 4
figure 4

Composite Indicator aggregation strategy (second layer)

4.2.3 The third layer: development of an augmented information composite indicator

Considering the SM-INDEX, developed in the second layer, and the TRAD-INDEX, already available, we build a combined innovative smart indicator (SMART-INDEX). The TRAD-INDEX is standardized before the combination, while the SM-INDEX is not, being the aggregation output of previously standardized indicators. For the aggregation of SM-INDEX and TRAD-INDEX, we propose to apply the MPI, considering the positive polarity of the indicators (Fig. 5). Indeed, we assume that the two dimensions are partially compensatory, i.e., efficient communication might partially compensate low effective commitment and high effective commitment might compensate scarce communication.

The SMART-INDEX measures the commitment of companies towards CSR in a more comprehensive way, considering not only the effective commitment (traditional indicator) but also the effort in online CSR communication (social media indicator).

Fig. 5
figure 5

Composite Indicator aggregation strategy (third layer)

4.3 Empirical exercise: results

Figure 6 shows the values of the social media-based, traditional and combined indicators for each company. The table with detailed result is available in Appendix C. For the TRAD-INDEX the standardized values according to (1) used as input for the Mazziotta–Pareto SMART-INDEX are reported.

The TRAD-INDEX is very similar across all companies, except for Boeing, Chevron, Honeywell, and McDonalds for which it is particularly low and below 100 indicating a low level of effective commitment. A rational behind this similarity is that the index is constructed considering mainly compliance to laws and regulation with respect to CSR reporting that, nowadays, is a common practice for most companies. The SM-INDEX allows to discriminate better the communication about CSR commitment among firms.

The combination of the two indicators provides an innovative measure of CSR commitment and communication effectiveness, giving additional insights to researchers. Table 2 in Appendix C provides the ranking of firms based on the SM-INDEX, TRAD-INDEX, and the SMART-INDEX, respectively. Generally, firms that rank highly on the SM-INDEX place low on the TRAD-INDEX (and vice versa). Companies in the services sector (e.g., Technology and Health Care) have a higher position on the SM-INDEX and a lower position on the TRAD-INDEX. A possible explanation could be that firms in the services sector have a high need for communication via their websites, whereas firms in other sectors do not. This may be because other methods of communicating sustainability are possible when offering a consumer product (such as information on the package).

Due to their equal weighting, the SMART-INDEX provides a middle ground between the two. Nevertheless, researchers may decide to use a different weighting strategy according to their practical and theoretical evaluations [40].

Fig. 6
figure 6

Social media-based (SM-INDEX), traditional (TRAD-INDEX) and smart indexes

The quality of the resulting innovative CIs (SM-INDEX and SMART-INDEX), can be difficult to asses as there is no benchmark to compare them to. Further analyses, such as uncertainty and sensitivity analyses, can help understand how methodological choices in the construction of the indices affect the results [46]. However, such approaches should be enlarged in order to take into account emerging aspects from novel data sources (such as selection of social media accounts, data pre-processing and analytical methods to transform unstructured data to structured one) and the multi-source nature of the process [47]. An overview about these issues is presented in the following section.

5 Quality considerations

The evaluation of the quality of new socio-economic indicators is of extreme importance to allow their use. This is upmost crucial in the case of smart indicators that use innovative data sources.

With specific reference to CIs, the concept of quality is strictly related to the robustness of the CIs with respect to the decisions made at each analytical step. Traditionally, this refers to normalization methods, weighting approaches, and the evaluation of uncertainty in the weights of sub-indicators [48,49,50]. The robustness is evaluated based on the ability of the indicator to generate accurate and consistent measures, as well as effectively differentiate between units in terms of their scores or rankings [49]. In this respect, the literature presents various possible procedures for evaluating robustness, mainly uncertainty analysis (UA) and sensitivity analysis (SA). UA focuses on how uncertainty in the input factors propagates through the structure of the CIs and influence its value. SA studies how much each individual source of uncertainty contributes to the output variance. For a general discussion of the procedures, please refer to Saisana et al. [46].

In addition to these traditional quality aspects, when working with unstructured data or non-traditional data sources, new quality considerations arise. For example, results may be affected by data extraction techniques (e.g. selection of social media accounts, scraped web-pages), pre-processing (e.g. data cleaning) and analytical choices (e.g. machine learning methods to extract the information).

Thus, when evaluating smart CIs, there are two key aspects to consider: the quality of the CI itself and the multi-source nature of the integration process. The quality of CIs depends on three factors: the quality of (1) the basic data, (2) the procedures to compute and (3) to disseminate the indicators [51]. According to [51] poor CIs result from inaccurate or non-credible data sources, wrong choices of individual indicators (lack of a theoretical background on the phenomena of interest), inconsistent approaches at the various construction steps (e.g. standardization, aggregation, weighting), lack of robustness analysis, poor description of the indicator construction and incorrect presentation of the results.

Furthermore, when dealing with multi-source statistics, further examinations must be conducted to ensure a comprehensive assessment. In the literature, various frameworks have been proposed for the assessment of quality in multi-source statistics [28, 47, 52,53,54]. It is evident that when integrating heterogeneous sources, a critical aspect is the assessment of the input and output quality throughout the integration process [52]. To this purpose, Reid et al. [54] propose a three phases approach where quality is evaluated in relation to: (1) the single data source, (2) the integrated data-set and (3) the output.

Given these premises, we propose to adopt a life-cycle perspective that considers quality evaluation across all analyses steps [55] and integrate the quality evaluation of the smart index (based on multi-source data-sets) into the general framework presented in the paper. The above-mentioned aspects (quality and multi-source nature) can be easily allocated into the three layers structure of the modular framework that we propose. The following paragraphs briefly discuss how to incorporate them in each of the three layers. In a future study, we aim to provide a more comprehensive discussion, incorporating a detailed worked example on quality evaluation.

5.1 Preliminary evaluations

Before engaging in the construction of smart composite indicators it is important to define the theoretical framework which defines the multi-dimensional phenomena under investigation. Subsequently, an evaluation of the suitable data sources becomes necessary, considering their characteristics and ability to measure specific aspects of the phenomena. For instance, researchers can take into account dimensions such as relevance, credibility, accessibility, and timeliness as quality criteria to justify the selection of these sources.

5.2 Layer 1: From unstructured to structured data

Following the setting of the paper and focusing on the innovative data source, this step involves evaluating the quality of both the input data (unstructured) and the output (structured-elementary indicators). The definition of quality depends on the specific data sources, as discussed in Sect. 2. For instance, when analyzing Twitter data, it is possible to refer to Salvatore et al. [25]. Generally, aspects related to the data retrieval strategy (e.g., search query, selection of social media accounts or web pages to scrape) as well as the completeness, timeliness, and coverage of the data source should be assessed and well-documented.

When dealing with unstructured textual data, various steps need to be taken to transform it into structured information, which serves as the elementary indicators (output of the first layer). These steps involve data pre-processing (cleaning and dimensionality reduction) and the implementation of machine learning algorithms such as sentiment analysis and topic modeling to extract the relevant information. Every decision made during the pre-processing and analysis phase might have an impact on the value of the resulting elementary indicators. Therefore, it is highly recommended to conduct a sensitivity analysis to assess the stability of the outcomes.

To summarize, in the first layer of analysis, researchers should provide quality indicators or evaluations related to the data source, data selection, data pre-processing, and analyses.

5.3 Layer 2: Statistical information and indicators

The second layer focuses on the construction of the innovative indicators. Traditional UA and SA can be applied to evaluate the robustness of the resulting indicators. However, in addition to classical aspects (standardization, aggregation, weighting, inclusion/exclusion of elementary indicators), incorporating the elements identified in the first layer is crucial (e.g. compare the results for different combination of data retrieval, cleaning, pre-processing and analytical strategies).

Consequently, as part of the quality evaluation, researchers should provide robustness analyses for both sub-indicators and intermediate indicators, considering the tasks performed in both the first and second layers.

5.4 Layer 3: Data augmentation

This step involves the calculation of the final smart composite indicator. As part of the quality evaluation, researchers should provide a comprehensive robustness analysis, considering not only the tasks performed in the third layer but also those carried out in the preceding layers. By considering the entire process, a comprehensive assessment of the indicator’s reliability and robustness can be obtained.

We leave the development of a comprehensive framework for quality evaluation to a future study.

6 Conclusions

The availability of new sources of data, such as social media, provides an excellent opportunity for augmenting business statistics and examining new aspects of phenomena of interest. As a means of augmenting the data, we propose a modular general framework organized in three layers that defines the tasks and the outputs of each block. In this study, we focus on the case of the construction of CIs based on the combination of traditional and digital textual data to derive smart augmented statistical indicators.

The second part of the paper shows, how the proposed methodology can be applied to real data. The specific empirical exercise of measuring CSR proved that traditional and social media-based indicators measure different aspects of the phenomenon, and enriched information is derived through data augmentation. The resulting smart index provides an innovative measure of CSR commitment and communication effectiveness.

This application can serve as a prototype. A similar modular approach and CI methodological framework can be applied to other contexts. As an innovative aspect, we also use Twitter metadata to enhance the information and construct the SM-INDEX. Our paper shows how can be interesting to include them in the construction of a statistical composite indicator. Metadata usage is an emerging topic and more research is required to better understand the opportunities and statistical challenges resulting from their use.

A single digital data source was considered to augment traditional data in this paper. However, the proposed framework allows the consideration of multiple data sources. For example, researchers may supplement traditional data with website information, social media posts, and newspaper articles. Further research will be conducted in this area in the future.

It is worth noticing that the proposed approach relies on the possibility of identifying the units under investigation in each data source. This can be in some way easier for business surveys and very difficult in the case units are individuals. For example, for businesses it is possible to identify their social media accounts or websites. In scenarios where identifying the individual units is not feasible, but aggregated data are available (for example by sector or other characteristics), the modular approach in layers can be adapted and implemented. This direction of research would require specific attention and could be the topic for further investigations.

A key aspect of the modular general framework in layers is its flexibility, enabling researchers to explore various methodological variations, propose enhancements to specific modules, and assess the sensitivity of the results at each stage.

The paper also outlines and discusses the statistical challenges and errors arising throughout the entire production process, from identification of the units of interest in the digital data source to data collection, pre-processing, analysis, and data augmentation. Further, it highlights the importance of evaluating the quality of the innovative indicators. In fact, in addition to traditional quality dimensions and techniques, this necessitates the identification of specific quality dimensions that are relevant to the data source and use case.

We consider the quality of CIs under a wider perspective and we discuss how the proposed modular structure, organized in layers, facilitates its evaluation by allowing for the assessment of both input and output data/indicators at each layer. This design enables a comprehensive evaluation process throughout the various stages of the analysis. It is noted that traditional robustness analyses do not take into account the multi-source nature of the integration process, and the use of unstructured data as the basis for constructing the indicator. When assessing the quality of the output, it becomes crucial to take into account the multi-source nature of the process as new aspects related to data quality and the impact of analytical choices emerges [47, 52]. The paper provides an overview of these aspects, while an ongoing study will delve into them in more detail, presenting a quality framework for the layers approach. In contrast to existing studies which mainly focus on registers and administrative data, our approach considers innovative sources that provide unstructured data.