Introduction

Approximately half of the studies appearing in JIBS between 1970 and February 2022 include some form of quantitative methodology, such as a generalized linear model, an ordinary least square (OLS) model, or other analytical tools to process either primary survey or secondary quantitative data. Most of these (close to 1500 out of 2000 quantitative studies) use regression analysis – their empirical rigor having increased substantially over time. JIBS has also published several editorials, perspective pieces, and research articles aimed at improving quantitative empirical analysis. These have dealt with types of data used (Cerar, Nell, & Reiche, 2021); the potential benefits of qualitative comparative analysis (Fainshmidt, Witt, Aguilera, & Verbeke, 2020); research study design, and the reporting of statistical results; p-hacking (Meyer, Van Witteloostuijn, & Beugelsdijk, 2017); multilevel models (Peterson, Arregle, & Martin, 2012); and collinear variables (Lindner, Puck, & Verbeke, 2020).

JIBS has not been the only scholarly journal in business and management to provide advice on how to navigate the complexities of collinear variables. Kalnins (2018) does so in the Strategic Management Journal. He went on in 2022 to build on that analysis, formulating guidance specifically for IB research, which challenges that of Lindner et al. (2020). Kalnins (2022) contends that the best way to handle collinearity in IB research critically depends on how the underlying data are generated. IB researchers face a broad range of data generation processes, and Kalnins (2022) argues that, in many cases, they must deal with what he calls a common factor data generation process, which differs from the canonical data generation process typically assumed in the analysis of regression models in econometrics textbooks. If researchers face a common factor data generation process, collinearity can reinforce bias. This occurs, for instance, if the values of collinear variables are driven by an unobserved third variable.

Collinearity has received substantial attention in IB research, because its many dimensions and multiple interdependent variables affecting outcomes make clean experimental treatment with a split between treatment group and control group almost impossible. The complexity of the global business environment means that, in almost every IB study, several interrelated dimensions will individually and jointly influence the outcome variables of interest. Lindner et al. (2020) provide guidance on how to navigate the complexities of collinearity in regression analysis. They recommend not paying too much attention to variance inflation factors when specifying empirical models. They also suggest that, except in extreme cases, researchers should ignore pairwise correlations and include a high number of credible control variables in their regression analysis.

How can the conflicting recommendations of Kalnins (2018, 2022) and Lindner et al. (2020) be reconciled? Kalnins (2018, 2022) cautions researchers against including variables with high partial correlation, whereas, for Lindner et al. (2020), nonzero pairwise correlation usually suggests including rather than excluding variables. The difference between the two arises from different assumptions about the underlying data generation process. Lindner et al. (2020) assume that researchers are dealing with a canonical data generation process. In contrast, Kalnins (2022) claims that, in IB research, a non-canonical common factor generation process typically prevails.1 In the latter case, the Gauss–Markov assumptions underlying regression analysis are not met, and regression results are likely biased. Collinearity can further reinforce the bias. We agree with Kalnins (2022) that most IB research cannot be characterized by a ‘clean’ canonical data generation process, but further believe that it does not usually only face the complications caused by the common factor data generation process. Rather, we argue that it is difficult, if not impossible, for IB researchers to know the true underlying generation process of the data with which they are working, because, typically, IB research questions involve simultaneous, multilevel influences (Peterson et al., 2012) and dynamic dependencies (Li, Ding, Hu, & Wang, 2021), as well as other complexities. The statistical solution to such complexities typically entails reducing regressions to a canonical data generation process. This led us to focus on standard textbook solutions in our earlier analysis (Lindner et al., 2020). Nevertheless, Kalnins’ (2022) perspective serves as an impetus to provide more comprehensive guidance on how to avoid collinearity bias across different (and often unknown) data generation processes.

Eliminating bias, or at a minimum addressing it, is an important goal in IB research. That can be done in two ways. First, one can rigorously apply the existing methodological apparatus. A standard, almost trivial, recommendation is for researchers to be particularly careful when selecting variables to include in a regression (Aguinis, Cascio, & Ramani, 2017). What is more important is that a comprehensive literature review, especially if it is wide ranging, is helpful in selecting the best control variables (Nielsen & Raswant, 2018), while, following what is done in finance and economics, robustness checks should go beyond selecting variables and testing their influence on regression results.

Second, IB researchers can broaden their palette of research methods and include, for instance, cross-validation based on machine learning. Machine learning and artificial intelligence are increasingly being adopted in both business practice and research (e.g., Krakowski, Luger, & Raisch, 2022; Raisch & Krakowski, 2021), and are likely also to be used more in IB scholarly work. Assessing how well a statistical model explains a validation set of data can also help avoid substantial bias in model coefficients. The main goal is to improve the power of the methodology by using in-sample and out-of-sample data.

In the following, we suggest how researchers should proceed when they do not have ex ante knowledge of the generation process of their data.

The discussion of multicollinearity in IB research

Kalnins (2018, 2022) refers to the data generation process as the critical contingency determining whether to include or exclude a collinear variable introduces bias in an OLS regression. The process is a result of the supposed underlying mechanism that drives the observed dependent variable and other relevant parameters. The dependent variable is seen as a function of a number of observed independent variables and potentially unobserved ones. Quantitative research in business and management is often performed without much insight into how exactly the datasets used were generated. Researchers tend to think about the data generation process mostly in terms of the underlying theoretical mechanisms that relate independent to dependent variables. For instance, in IB research, we might think of it as establishing the real-world relationship between inflation rates and exchange rates, as is predicted by the theory of purchasing power parity, or the relationship between behavioral uncertainty and asset specificity as is predicted by internalization theory.

Ultimately, we use the notion ‘data generation process’ as a generic term for any process that creates data. Ideally, it should allow for investigating mathematically the statistical properties of well-defined regression equations. However, it appears that IB researchers do not generally understand in a precise mathematical way how their data were generated. Rather, they only observe and try to understand the structure of a dataset once it has been collected. Typically, the data structure observed cannot be cleanly represented by any single data generation process. Given this, we argue below that quantitative empirical IB research must go beyond addressing concerns that result from specific, narrow assumptions about the data generation process. Nevertheless, we align our more research practice oriented guidance with Kalnins’ (2018, 2022) recommendations about regressions under the canonical and common factor data generation processes. The effects of collinearity on bias in regression results will differ, depending on whether the regression fulfils the Gauss–Markov assumptions, as is the case with canonical data generation processes, and as assumed in most econometrics textbooks.

If the regression equation fulfills the Gauss–Markov assumptions, standard statistical software will compute unbiased regression results, as outlined in Lindner et al. (2020). Gauss–Markov assumes that the regression equation is fully specified. This means that the model includes all relevant independent variables, and that their influence on the dependent variable is linear. As noted above, Kalnins (2022) refers to this process as the canonical data generation process. In that case, the inclusion of additional variables, regardless of whether they affect the dependent variable, does not introduce bias in the regression results, even if they are collinear with the other variables entered in the regression.

However, data can be generated in many ways leading to different data structures. Kalnins (2018, 2022) focuses on one special case where the Gauss–Markov assumptions are not met: the common factor data generation process. In that case, an unobserved but relevant independent variable affects the other independent variables included in the regression equation. The regression equation is therefore incompletely specified because an important determinant of the dependent variable is excluded.2 Incomplete specification introduces bias in the regression results. The central point Kalnins (2018, 2022) makes is that, under these circumstances, results may be spurious. More specifically, two collinear variables can both be driven by an unobserved third variable which is excluded from the regression equation. This unobserved common factor can, under the particular circumstances shown by Kalnins (2018, 2022), reinforce bias in the coefficient estimates of the collinear variables.

In much applied IB research it is difficult to ascertain which type of generation process underlies a dataset. It is likely that it will not be fully canonical because the relationships will not be linear, will not include all relevant variables, and some unobserved variables will be correlated with the observed ones. Rather, the data may show characteristics of several types of data generation processes. For example, answering IB research questions often requires including influences at several levels, making it necessary to consider the hierarchical structure of the data (Peterson et al., 2012). Using Kalnins’ terminology (2018, 2022), we could call these hierarchical dependencies a hierarchical data generation process. Similarly, we can think of dynamic panel models as estimation models to analyze data based on a time–dynamic data generation process (Li et al., 2021). The statistical treatments outlined in Peterson et al. (2012) and Li et al. (2021), for hierarchical or dynamic data generation processes, respectively, aim to deal with such deviations from the canonical data generation process. A hierarchical model, for instance, allows us to interpret the coefficients of a regression analysis in the way prescribed by standard statistics textbooks.

The complexity of IB phenomena makes it difficult to specify regression equations that can accommodate all potential deviations from a canonical data generation process. One would prefer to work with data based unambiguously on a canonical data generation process, but this is not always possible in IB research. Even more complex data generation processes, such as the common factor or the hierarchical one, are unlikely to accurately represent the data used in IB research. Consequently, rather than focusing on specific data generation processes, we suggest that IB researchers use a broader methodological palette.

We summarize below the approaches that can be used to obtain unbiased results, considering the different data generation processes as well as other challenges frequently adding a layer of complexity to quantitative data analysis. What matters most is being aware of the drawbacks of standard statistical analysis and adopting a broad approach that includes insight from the academic literature and from practitioner accounts.

Rigorously applying the established methodological palette

Three well-established approaches can contribute to achieving unbiased data analysis: (1) identifying comprehensive sets of control variables; (2) considering all relevant layers of data; and (3) carefully checking the robustness of empirical results.

A systematic review of the prior literature and of practitioner accounts allows researchers to specify a nearly complete empirical model. Consideration should be given to all variables affecting a dependent variable that have been identified in prior research by credible authors and other experts on the subject matter. Including all relevant control variables minimizes contamination between independent and dependent variables (Bernert & Aguinis, 2016), and reduces the risk of making type-II errors (Nielsen & Raswant, 2018). The more fully specified an empirical model, the more closely its statistical results will correspond to the textbook standard. Ideally, researchers will include documentation of how relevant control variables were identified (Banks, Rogelberg, Woznyj, Landis, & Rupp, 2016). Looking at pairwise correlations in prior research also makes possible gauging whether one is likely to face a common factor data generation process (Kalnins, 2018). In addition, including all variables identified as relevant in prior research can provide insight into which earlier findings can be replicated (Aguinis et al., 2017), and where new research could extend such findings (Cuervo-Cazurra, Andersson, Brannen, Nielsen, & Reuber, 2016).

For many IB research questions, the challenge is to address interactions across several levels. For instance, internalization is driven by country-level factors, such as environmental uncertainty, expropriation risks, and institutional voids, as well as transaction-level factors, such as asset specificity and partner reliability (e.g., Rugman & Verbeke, 2003; Verbeke, 2003). IB researchers may therefore want to consider relevant findings from other disciplines, such as political science and economic geography, and include variables from adjacent social sciences. They also need to consider the fact that some variables influence the dependent variable on different levels, and hence that their statistical model must have the capacity to handle such complexity. Variables on different levels can interact, highlighting again the need for careful treatment of cross-level collinearity. Hierarchical models have received some attention in IB research (Peterson et al., 2012), but, so far, most applications have been limited to capturing unobserved level-2 variation (typically the country-level). Random coefficients can exploit variation in effects across countries rather than just control for unobserved country-level variation (e.g., Alcácer, Chung, Hawk, & Pacheco-de-Almeida, 2018; Lindner, Puck, & Doh, 2021).

Robustness checks are now commonplace, but paying even more attention to them could further contribute to reducing bias and increasing reliability. Trying alternative model specifications and different measures for central constructs can reduce potential bias from collinear variables. Kalnins (2018) has suggested including or excluding particular variables from the analysis. We would also suggest trying out specifications that capture unobserved variation on different levels, using different measures of collinear independent variables, and investigating the predictive power of different models for out-of-sample projections. Finally, to ensure the credibility and legitimacy of the above statistical manipulations in the sphere of robustness checks, we encourage IB scholars to share their data, as outlined for instance in the JIBS policy on data access and research transparency, to permit the reproducibility of results (Beugelsdijk, van Witteloostuijn, & Meyer, 2020)

The next step: broadening the methodological palette

Regression-based methodologies will always be used with complex data structures that do not correspond to any single data generation process. Given this, we suggest complementing them with more flexible machine-learning-based models of data analysis. Such models have attracted the attention of practitioners because of their surprising success against humans in carefully defined games (e.g., Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, & Graepel, 2018). They have been used by IB practitioners to support decision-making and to analyze structured and unstructured data (e.g., Kerr & Moloney, 2018), but rarely in IB research (Veiga, Lubatkin, Calori, Very, & Tung, 2000, is a notable exception).3 They could help address the challenges faced when using regression-based methodologies in at least two ways. First, by using cross-validation, IB researchers could distinguish between cases where including collinear variables can reduce or even eliminate bias (Lindner et al., 2020), and instances where it will reinforce bias (Kalnins, 2018, 2022). Second, machine-learning-based methodologies such as random forests and neural networks are largely immune to biases caused by collinearity.

Using Machine Learning to Cross-validate Statistical Models

Cross-validation (Stone, 1974) is one of the defining elements of machine learning (Cawley & Talbot, 2010), although its importance was widely acknowledged long before the development of modern machine-learning methodologies. Cross-validation consists of develo** models explaining relationships among variables based on a subset of data, called the training data (typically 60–70% of a dataset), and then testing the model on the testing data (the remaining data points). In K-fold cross-validation, the process splits the data several times into training and testing data, and then identifies the model that performs best in the aggregate (Cawley & Talbot, 2010). An even more refined approach reserves another 20% of the data for a separate validation stage, which is not part of model development and testing, but is used instead for out-of-sample validation of the model obtained on the basis of the training and testing datasets. In cross-validation, the training and testing data are separated, and the validation data are used only when a best-fitting model has emerged (James, Witten, Hastie, & Tibshirani, 2013, p. 176).

This approach to model development contrasts with the usual practice in regression modeling, whereby researchers typically fit models based on the full data sample. Let us assume that a researcher wants to model how firms make foreign direct investments. Following a comprehensive literature review, and considering the hierarchical data structure, the researcher could use 70% of the observations to build an empirical model to this end. The remaining 30% (testing data) could then be used to test how well the regression coefficients obtained from the training data predict the firm-level choices when considering the testing data only. If the need were to arise (because of a poor fit with the testing data), the model for the training data could be adjusted: the goal would be for the model still to represent a good fit for the training data, while having an equivalent, good predictive capacity when applied to the testing data.

More generally, cross-validating empirical models to maximize explanatory power represents a different approach to data analysis than the conventional hypothesis testing approach used by researchers in IB and management more generally. Breiman (2001) has argued that statisticians view data analysis differently than IB and management scholars, with the former tending to focus on building models that perform well in predicting out-of-sample occurrences, whereas the latter typically want to determine whether an independent variable influences a dependent variable in their sample. We think that there is a third approach to data analysis that could close the gap between the two primary perspectives. On the one hand, IB research can use theory-driven reasoning and comprehensive literature reviews to build complete models that isolate specific effects and test these. On the other hand, cross-validating empirical models in terms of model fit can help identify variables that only add marginally to the explanatory power of a model, for instance because they are highly collinear with other variables in the model. If model fit for the set of validation data is substantially lower than that for the training data, coefficients may be biased (in particular because of high collinearity). As a result, researchers can distinguish between cases where adding a variable with high collinearity reduces bias (as outlined in Lindner et al., 2020) and those where adding a collinear variable increases it, as suggested by Kalnins (2018, 2022). Cross-validation can therefore help researchers reduce biases due to collinearity because it tells them whether to include or exclude collinear variables.

Machine-learning-based Methods of Data Analysis for IB Research

In addition to cross-validation, other estimation methodologies are available. Machine-learning-based models are particularly interesting as they do not suffer from collinearity biases. For instance, a random forest model, which consists of a set of decision trees, is a flexible and straightforward estimation method for both continuous and binary dependent variables. In such tree-based models a sequence of binary sub-decisions explains an outcome. Tree-based models are somewhat similar to qualitative comparative analysis (QCA) (Crilly, 2011; Fiss, 2011; Loane, Bell, & McNaughton, 2007): they use Boolean logic to categorize situations in which combinations of “and” and “or” lead to a given outcome. Although based on a similar logic, a key distinction between the two is that QCA focuses on parsimony and on maximizing the predictive power of a branch of a tree (Fainshmidt et al., 2020), while tree-based models emphasize the simplicity of an explanation (Bauer & Kohavi, 1999). Random forest models are easy to use with standard statistical software, such as the rforest command in Stata or the randomForest package in R. Following Wager and Athey (2018), Miric and Jeppesen (2020), for example, use a random forest model to identify which measures of product popularity influence innovation in products affected by software piracy.

Neural networks have received substantial attention in statistical research in recent years. Neural network-based reinforcement learning has been used to train computers to play games (such as Atari 2600, chess, Go and shogi) at the same or even higher levels than humans (e.g., Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, & Ostrovski, 2015; Silver et al., 2018; Silver, Schrittwieser, Simonyan, Antonoglou, Huang, Guez, Hubert, Baker, Lai, & Bolton, 2017; Tesauro, 1994). In contrast to regression models or random forest models, neural network-based models do not attempt to solve a specific maximization problem. After completing a learning process, neural network algorithms interpret connections established in the neural network to represent the connections in the data (Haykin, 1994: 2). Neural networks are somewhat complex to master, but standard solutions for some cases also exist in Stata (mlp2 command) and R (neuralnet package).

Veiga et al. (2000) were the first to introduce neural networks in IB research. They used this methodology to understand the cultural characteristics of French and British firms. Wu, Chen, Yang and Tindall (2021) used a neural network to analyze which investment funds are particularly profitable. Because neural networks do not suffer from bias when models are incompletely specified, these authors were able to improve the ability of their model to distinguish between successful and less successful investment firms.

How Machine Learning Can Help IB Researchers Craft Better Empirical Models

Both random forest and neural network models can be applied to the types of research questions addressed by regression models. The assumptions behind them are substantially less rigid than the Gauss–Markov assumptions underlying regression models. As a result, predictions derived from these models can be used to check the plausibility of predictions based on regression models with potentially biased coefficients (inter alia because of collinearity). IB researchers can therefore use them for robustness checks, similar to those highlighted above. Doing so combines the strengths of both. For example, researchers can fit a random forest or neural network model to the data they investigate and predict outcomes based on it. At the same time, they can run a regression model to test hypotheses and to predict the dependent variable, and then compare the predictions derived from both. If they roughly match, it is unlikely that individual coefficients in the regression model suffer from biases introduced by collinearity or other complexities.

Machine-based data analysis can help in conducting quantitative empirical IB research in two ways. First, by using cross-validation, IB researchers may be able to distinguish between cases where including collinear variables resolves bias (Lindner et al., 2020) and those where including them reinforces bias (Kalnins, 2018, 2022). By assessing the predictive power of statistical models using a validation dataset, researchers can identify which model has a better fit. A better fitting model, when applied to a dataset that was not used for model development, is less likely to suffer from bias, everything else being equal.

Second, data analysis methodologies such as random forests and neural networks are more flexible, because they can handle higher-order effects, including interaction terms between variables, U-shaped relationships, and higher-order combinations. Statistical packages, such as R, Python, and Stata, include pre-packaged code for data analysis using machine learning and are relatively easy to use. Our view is that adopting a broader palette of empirical methods will strengthen the robustness and validity of research results and will help address the challenges posed by collinearity.

Conclusions

The scholarly dialogue between Kalnins (2018, 2022) and Lindner et al. (2020) on the challenges of multicollinearity in IB research shows that the specification of empirical models is critical to supporting or rejecting hypotheses in quantitative empirical research. As Kalnins (2022) points out, the relative severity of the statistical bias caused by including collinear variables in a regression model, or excluding them from it, depends on the extent to which the assumptions underlying regression modeling are met. With a canonical data generation process, as described in most textbook analyses of regression, adding collinear variables will at worst inflate standard errors. In contrast, with a common factor data generation process, adding a collinear variable to a regression model can substantially increase bias in coefficients. However, in most cases, which of the two applies is not obvious. In addition, in most IB research contexts, the data structure is more complex than either of these two data generation processes.

While there are no perfect solutions, IB researchers should make full use of the existing palette of methods to avoid collinearity bias. This includes identifying relevant constructs and controls by conducting a comprehensive literature review and studying practitioner accounts, considering different layers of analysis, and employing a variety of robustness checks to corroborate empirical results. In particular, IB researchers can broaden their command of the palette of empirical methodologies by including machine-learning-based methods, such as random forest and neural networks.

We have spelled out how IB researchers can make sense of the seemingly contradictory recommendations of Kalnins (2018, 2022) and Lindner et al. (2020). The more integrative perspective we outline suggests pathways to building robust and consistent models for hypothesis testing. We also suggest broadening the empirical palette used in quantitative empirical research by linking regression-based research with machine learning.

Notes

  1. 1

    In the common factor data generation process, an unobserved variable (xn) is related to the observable independent variables (x1 and x2) in the equation y = γxn + δ1xi1 + δ2xi2 + δ3x3 + e.

  2. 2

    Kalnins (2018) also pointed to measurement error as a commonly omitted source of variation between independent variables, with a similar effect on bias in regression results.

  3. 3

    Somewhat surprisingly, in this early analysis, Veiga et al. (2000) use a rather advanced method of machine learning, namely neural networks.