Introduction

The simulation of multivariate, non-normal data has become fairly common in psychology, education, and the life or social sciences in general, particularly when evaluating the robustness properties of statistical methods or when offering recommendations for best practices in data analysis (Astivia, 2020; Morris et al., 2019; Sigal & Chalmers, 2016). Of the panoply of approaches in which multivariate, non-normal data can be generated, perhaps the most popular one involves the simulation of lower-dimensional marginal distributions as well as the specification of a correlation or covariance structure to govern its relationships, commonly referred to as the random vector generation (RVG) method (Chen, 2001). In general, the RVG family of algorithms follows a simple yet effective series of steps:

  • Simulate from a multivariate (usually standard) normal distribution with a known covariance (or correlation) matrix.

  • Alter each individual marginal distribution to obey the specifications of the researcher.

  • Sample from this newly derived joint distribution.

A few popular algorithms in the literature that follow this structure would be the Vale & Maurelli (1983) generalization of the Fleishman (1978) third-order polynomial approach, Headrick (2002) fifth-order polynomial method, multivariate g-and-h distributions (Kowalchuk & Headrick, 2010), the generalized lambda distribution (Headrick & Mugdadi, 2006), and the NORmal To Anything (NORTA) method (Cario & Nelson, 1997), which is perhaps one of the oldest and most popular data-generating algorithms to simulate multivariate, non-normal data. Throughout the rest of this article, the focus will be on the NORTA method, but empirical extensions will help connect the methodology explicated in the present article with some of the previously mentioned methods.

The NORTA method

Cario & Nelson (1997) are usually credited with the development the NORTA method, but reference to the ideas substantiating it can be found in Li & Hammond (1975), Mardia (1970), and Schmeiser (1990). It has been used to inform a wide variety of simulation research in areas such as structural equation modeling (Jobst et al., 2022), social network analysis (Schweimer et al., 2022), linking and equating within item response theory (Lim & Lee, 2020), factor analysis (Auerswald & Moshagen, 2019), discriminant analysis (Brobbey et al., 2022), interrater reliability (Cohen et al., 2009), and confidence intervals for test scores (Kim & Lee, 2018) among others. NORTA is versatile enough that it has even played a role in modeling the detection of COVID (Jain et al., 2022), gene sequencing (Specht & Li, 2015), and the prediction of tsunamis (Fukutani et al., 2015).

As developed in Cario & Nelson (1997), the NORTA algorithm proceeds as follows:

  1. 1.

    Simulate from \(\varvec{Z} = (Z_{1}, Z_{2},...,Z_{p})^{\prime }\), where \(\varvec{Z} \sim MVN(\varvec{0}, \varvec{\Sigma }_{Z})\) and \(i = 1, 2,..,p\).

  2. 2.

    Apply the probability-integral transformation to each simulated vector, such that \(U_{i} = \Phi (Z_{i})\), where \(\Phi (\cdot )\) is the (univariate) standard normal CDF and \(U_{i} \sim \mathcal {U}[0,1]\).

  3. 3.

    Define \(X_{i} = F^{-1}_{i}(U_{i})\) where \(F^{-1}_{i}(\cdot )\) is the inverse-CDF (or quantile) function that creates the (non-normal) random variable \(X_{i}\).

  4. 4.

    Now \(\textbf{X}= (X_{1}, X_{2},...,X_{p})^{\prime }\) follows a non-normal distribution with marginal densities specified by \(F^{-1}_{i}(\cdot )\) and correlation matrix \(\varvec{\Sigma }_{X}\).

The generation of multivariate, standard normal random variables has a long, established history within the simulation literature in the social sciences. The Kaiser–Dickman approach, based on a principal component analysis (PCA) decomposition of the covariance matrix, is perhaps one of the earliest available (Kaiser & Dickman, 1962), although research in statistics had already implemented a slight variation of this approach in Moonan (1957). The only technical difference between these approaches is the use of a Cholesky decomposition of the covariance matrix as opposed to PCA. The probability-integral transformation is also well known and, if a quantile function is available, it makes the process of transforming a variable from ’normal’ to ’non-normal’ fairly efficiently. The only remaining problem then becomes the difference between \(\varvec{\Sigma }_{Z}\), henceforth referred to as the intermediate correlation matrix, and \(\pmb {\Sigma }_{X}\), the final correlation matrix.

The problem of the intermediate correlation matrix

For \(i \ne j\), the correlation between two arbitrary NORTA-transformed random variables \((X_{i}, X_{j})\), can be found in Cario & Nelson (1997) to be:

$$\begin{aligned}{} & {} \mathbb {E}[X_{i}X_{j}] = \rho _{X}(i,j)\nonumber \\= & {} \int _{-\infty }^{\infty }\int _{-\infty }^{\infty }F^{-1}_{X_{i}}[\Phi (z_{i})]F^{-1}_{X_{j}}[\Phi (z_{j})]\phi (z_{i},z_{j},\rho _{z}(i,j))dz_{i}dz_{j} \end{aligned}$$
(1)

where \(F^{-1}_{X_{i}}(\cdot ), F^{-1}_{X_{j}}(\cdot )\) are the inverse CDFs of the random variables \(X_i, X_j\), \(\Phi (\cdot )\) is the standard normal CDF, \(z_i, z_j\) are arbitrary, standard normal random variables, and \(\phi (\cdot )\) is the standard, bivariate normal PDF with correlation \(\rho _{z}(i,j)\). Notice how in Eq. (1) above, the functional relationship of the correlation between the NORTA-transformed random variables \(\rho _{X}(i,j)\) depends exclusively on the value of \(\rho _{Z}(i,j)\), since all the other arguments in the equation are either defined by the specifications of NORTA or by the marginal distributions selected by the interested user. There exists, however, a non-trivial complication on the implementation of the NORTA algorithm. Because the probability-integral transformation is non-linear, the Pearson correlation is not invariant to it, so that, in general, \(|\rho _{X}(i,j)| \le |\rho _{Z}(i,j)|\), altering the covariance structure intended by the user. The role of the intermediate correlation is therefore to find a proper correlation value for \(\rho _{Z}(i,j)\) so that, as the marginal densities are NORTA-transformed, this correlation is altered to obtain what a researcher intends to use as an effect size for the simulated population.

The nature of the new marginal densities determines how severe the downwards bias would be. For instance, consider \((X,Y)'\) to be jointly normally distributed with parameters \(\mu _{X}=\mu _{Y}=0\), \(\sigma ^{2}_{X}=\sigma ^{2}_{Y}=1\), and an arbitrary correlation coefficient \(\rho >0\). Define two new random variables \((A,B)'\), such that \(A=e^{X}\) and \(B=e^{Y}\). Then both A and B are (standard) log-normally distributed with some correlation \(\rho ^{*}\). The effect that this transformation has on the correlation coefficient is exemplified in Fig. 1 below. For extreme correlation values (either close to 0 or close to 1), the process of exponentiation has very little effect on the value of the correlation itself. Nevertheless, for more intermediate values of \(\rho \), the downward bias of \(\rho ^{*}\) can be quite extreme, sometimes even greater than 0.1.

Fig. 1
figure 1

Relationship between the correlation (\(\rho \)) of two standard random variables and their transformed, log-normal ones \((\rho ^{*}\)). The bold black line describes the equation \(x=y\) and is added for reference

To account for this downward bias, Cario & Nelson (1997) notice that the bivariate correlation of \((X_i, X_j)\) depends exclusively on the correlation of \((z_i, z_j)\) within the bivariate normal PDF, \(\phi (\cdot )\). They begin by defining the function \(\rho _{X}(i,j)=c_{i,j}[\rho _{Z}(i,j)]\) and study several of its properties. A general form of the function \(c_{i,j}[\cdot ]\) does not exist, and it is only available in limited, special cases (cf. Cario & Nelson , 1997, p. 5). An example of a special case for log-normal random variables will be discussed. Still, certain properties of \(c_{i,j}[\cdot ]\) can be derived to further understand under what conditions it is feasible to find an intermediate correlation that would satisfy Eq. (1) above, and allow potential users to specify the population correlation structure of their choice. From their investigations, three important properties of \(c_{i,j}[\rho _{Z}(i,j)]\) were discovered and proven in Cario & Nelson (1997) Proposition 1 (p. 5), Theorem 1 (p. 6) and Theorem 2 (p. 6). Since these results are relevant to the theoretical developments offered in the present article, they are reproduced for the sake of completion:

  • Proposition 1: For any distributions \(F_{X_{i}}\) and \(F_{X_{j}}\), \(c_{i,j}(0)=0\) and \(\rho _{Z}(i,j) \ge 0\) \((\text {or} \le 0)\) implies that \(c_{i,j}[\rho _{Z}(i,j)] \ge 0 (\text {or} \le 0)\).

  • Theorem 1: The function \(c_{i,j}[\rho _{Z}(i,j)]\) is nondecreasing for \(-1 \le \rho _{Z}(i,j) \le 1\).

  • Theorem 2: If there exists \(\epsilon > 0\) such that \(\mathbb {E}[|X_{i}X_{j}|^{1 + \epsilon }]\) \( < \infty \) for all values of \(-1 \le \rho _{Z}(i,j) \le 1\), where \(X_{i}, X_{j}\) are defined by a NORTA transformation, then the function \(c_{i,j}[\rho _{Z}(i,j)]\) is continuous for \(-1 \le \rho _{Z}(i,j) \le 1\).

Methods to obtain the intermediate correlations to initialize the NORTA algorithm are therefore of utmost importance for its practical implementation. Cario & Nelson (1998) proposed one of the earliest root-finding algorithms to obtain feasible values for the intermediate correlation matrix using their software ARTAFACTS. Another approach can be found in Chen (2001), through the use of retrospective approximation.

An important common theme of the available approaches to solve for the intermediate correlation is that they treat the estimation of each correlation as an independent problem. In other words, the final correlation matrix of the NORTA-transformed marginal densities, \(\varvec{\Sigma }_{X}\) of dimensions \(p \times p\), requires the solution of \(p(p-1)/2\) independent equations of the form shown in Eq. (1) above. When \(p=2\), a solution for \(c_{i,j}[\rho _{Z}(i,j)]\) is guaranteed, as long as the final correlation \(\rho _{X}(i,j)\) respects the Fréchet–Hoeffding bounds imposed by the NORTA-transformed marginal densities.Footnote 1 When \(p>2\), solutions for \(c_{i,j}[\rho _{Z}(i,j)]\) could exist, but solving for each correlation pair independently to assemble the intermediate correlation matrix, \(\varvec{\Sigma }_{Z}\), bivariately may result in \(\varvec{\Sigma }_{Z}\) not being positive definite.

For instance, consider the following correlation matrix:

$$\begin{aligned} \varvec{\Sigma }_{X} = \left[ \begin{array}{ccc} 1 &{} -0.101 &{} -0.367 \\ -0.101 &{} 1 &{} -0.367\\ -0.367 &{} -0.367 &{} 1 \end{array}\right] \end{aligned}$$

as the final population correlation matrix for a trivariate density where each marginal follows a (standard) log-normal distribution. From Astivia & Zumbo (2017), one can easily derive an expression for the elements of the intermediate correlation matrix, \(\rho _{Z}(i,j) = Ln[\rho _{X}(i,j)(e-1)+1]\). If one initializes the NORTA algorithm by solving for each intermediate correlation individually, as it is commonly done, the intermediate correlation matrix would be:

$$\begin{aligned} \varvec{\Sigma }_{Z} = \left[ \begin{array}{ccc} 1 &{} -0.191 &{} -0.999 \\ -0.191 &{} 1 &{} -0.999 \\ -0.999 &{} -0.999 &{} 1 \end{array}\right] \end{aligned}$$

Although at first glance \(\varvec{\Sigma }_{Z}\) may seem like any other correlation matrix, upon inspecting its eigenvalues, it becomes apparent that it is, in fact, not positive definite and therefore not a true correlation matrix. Li & Hammond (1975) were among the first to notice this property of the NORTA method, but Ghosh & Henderson (2003) offer the most detailed account of this issue. They show that this problem is quickly aggravated by the dimensionality of the multivariate distribution one wishes to simulate from so that, for \(p>10\), one is almost guaranteed to encounter a non-positive definite intermediate correlation matrix \(\varvec{\Sigma }_{Z}\). It may seem intuitively appealing to estimate the nearest correlation matrix to \(\varvec{\Sigma }_{Z}\), as described in Higham (2002), for instance, to tackle this problem. However, Stanhope (2005) offers a word of caution related to this practice because “random vectors generated with the resulting NORTA transform will not have the desired correlation structure” (p. 71). Indeed, if one were to calculate the nearest correlation matrix to \(\varvec{\Sigma }_{Z}\) using the Higham (2002) procedure, it would bias the correlations (some upwards and some downwards) as follows:

$$\begin{aligned} \varvec{\Sigma }_{Z} = \left[ \begin{array}{ccc} 1 &{} 0.011 &{} -0.711 \\ 0.011 &{} 1 &{} -0.711 \\ -0.711 &{} -0.711 &{} 1 \end{array}\right] \end{aligned}$$

Ghosh & Henderson (2003) refer to this type of matrices as NORTA defective (p. 279), and proposed a semi-definite program aimed at obtaining the closest matrix to the originally intended one. To the authors’ knowledge, there are very limited alternatives to handling matrices that, in principle, cannot be obtained through the independent solution of \(p(p-1)/2\) equations of the form shown in Eq. (1). The purpose of this article is therefore to tackle the problem of deriving the intermediate correlation matrix where all the elements of the matrix are obtained simultaneously. To accomplish this, the stochastic approximation method known as the Robbins–Monro algorithm (cf. Robbins & Monro , 1951) is applied to the full matrix defined by elements in Eq. (1). A small simulation study replicating the impact that non-normality has on the chi-square test of fit within a structural equation modeling (SEM) context is presented to evaluate the properties of this method and potential applications are discussed for the simulation of multivariate, non-normal data.

Theoretical background

Stochastic approximation through the Robbins–Monro algorithm

Assume one wishes to find the root \(x_{0}\) of a function \(h(\cdot ):\mathbb {R}\rightarrow \mathbb {R}\). An initial approach could be the Newton method, which generates a sequence of n iterates of the form:

$$\begin{aligned} x_{n+1}=x_{n}-\frac{h(x_{n})}{h^{\prime }(x_{n})} \end{aligned}$$
(2)

Now, suppose one also knew a neighborhood of points for \(x_{0}\), where the function \(h(x_{i})<0\) for \(x_{i}<x_{0}\), and \(h(x_{i})>0\) for \(x_{i}>x_{0}\), where \(h(\cdot )\) is non-decreasing in this neighborhood. Then, if one begins at an arbitrary point \(x_{i}\), sufficiently close to \(x_{0}\), Eq. (3) below also converges to \(x_{0}\), and does not require the derivative of \(h(\cdot )\) (Bharath & Borkar, 1999).

$$\begin{aligned} x_{n+1}=x_{n}-\kappa _n h(x_{n}) \end{aligned}$$
(3)

where \(\kappa _n\) is a sequence of positive, sufficiently small constants. For instance, if \(\kappa _n=\frac{x_n-x_{n-1}}{h(x_n)-h(x_{n-1})}\), one would obtain the secant method.

It is common to not have access to a general solution to \(h(\cdot )\) directly, but one can conduct simulations to sample the function at particular values \(x_{i}\). The procedure discussed in (Robbins & Monro, 1951), follows this approach by using a noisy version of \(h(\cdot )\) in a slightly modified version of Eq. (3) above:

$$\begin{aligned} x_{n+1}=x_{n}-\gamma _{n}y_{n}(x_{n}) \end{aligned}$$
(4)

where \(\gamma _{n}\) is a sequence of sufficiently small constants converging to 0 such that \(\sum _{n=0}^{\infty }\gamma _{n}=\infty \), \(\sum _{n=0}^{\infty }\gamma _{n}^{2}<\infty \), and \(y_{n}\) is a random variable of the form \(y_{n}(x_{n})=h(x_{n})+\epsilon _{n}\) with \(\epsilon _{n}\) being normally distributed such that \(\mathbb {E}[\epsilon _{n}]=0\) and \(\mathbb {E}[y_{n}(x_n)]=h(x_n)\).

Following the NORTA procedure, consider an arbitrary element \(\rho _{Z}(i,j)\) of the intermediate correlation matrix \(\varvec{\Sigma }_{Z}\). One needs to find a \(\rho _{Z}(i,j)\) such that \(c_{ij}[\rho _{Z}(i,j)]\approx \rho _{X}(i,j)\), where \(c_{ij}[\cdot ]\) is a continuous, non-decreasing function as described in Section “The problem of the intermediate correlation matrix”. A crucial assumption is that \(c_{ij}[\cdot ]\) has a solution (cf. Cario & Nelson , 1997, p. 4) For said function \(c_{ij}[\cdot ]\), one can use the Robbins–Monro technique as follows:

$$\begin{aligned} \rho _{Z}^{k+1}(i,j)=\rho _{Z}^{k}(i,j)-a_{k}(c_{ij}[\rho _{Z}^{k}(i,j)]-\rho _{X}(i,j)) \end{aligned}$$
(5)

where \(a_{k}\) is a sequence of positive constants such that \(\sum _{k=0}^{\infty }a_{k}=\infty \) and \(\sum _{k=0}^{\infty }a_{k}^{2}<\infty \). A particular sequence that satisfies these conditions has the form \(a_{k}=\frac{a}{k}\), for \(a>0\) (Robbins & Monro, 1951). In Eq. (5) above, for each iteration k, the function \(c_{ij}[\cdot ]\) can be estimated by taking N simulated samples of bivariate normal random variable \((Z_{i},Z_{j})\) with correlation \(\rho _{Z}^{k}(i,j)\) such that:

$$\begin{aligned} c_{ij}[\rho _{Z}^{k}(i,j)]\!=\!\frac{\frac{1}{N}\sum _{t=1}^{N}F_{X_{i}}^{-1}(\Phi (Z_{i}^{t}))F_{X_{j}}^{-1}(\Phi (Z_{j}^{t}))\!-\mu _{i}\mu _{j}}{\sigma _{i}\sigma _{j}}, \end{aligned}$$

where \(\mu _{i}\), \(\mu _{j}\), \(\sigma _{i}\), and \(\sigma _{j}\) are means and standard deviations of random variables \(X_{i}\) and \(X_{j}\), respectively.

To implement the Robbins–Monro method, the function \(c_{i,j}[\cdot ]\) must satisfy the following conditions:

  1. 1.

    \(c_{i,j}[\cdot ]\) is non-decreasing.

  2. 2.

    \(h[\rho _{Z}(i,j)]=c_{i,j}[\rho _{Z}(i,j)]-\rho _{X}(i,j)\), in such a way that \(h[\rho _{Z}(i,j)] \le 0\) for \(\rho _{Z}(i,j) \le \rho _{0Z}(i,j)\), and \(h[\rho _{Z}(i,j)] \ge 0\) for \(\rho _{Z}(i,j) \ge \rho _{0Z}(i,j)\), where \(\rho _{0Z}(i,j)\) is a true root of \(h[\cdot ]\).

As described in the Introduction, the first condition is satisfied by Theorem 1 in Cario & Nelson (1997). The second condition can easily be demonstrated as:

$$\begin{aligned} \rho _{Z}(i,j)\! \ge \! \rho _{0Z}(i,j)\Rightarrow & {} \! c_{i,j}[\rho _{Z}(i,j)] \!\ge \! c_{i,j}[\rho _{0Z}(i,j)(i,j)]\\\Rightarrow & {} c_{i,j}[\rho _{Z}(i,j)]- \rho _{X}(i,j)\\\ge & {} c_{i,j}[\rho _{0Z}(i,j)(i,j)]-\rho _{X}(i,j)\\\Rightarrow & {} h[\rho _{Z}(i,j)] \ge h[\rho _{0Z}(i,j)]\\\Rightarrow & {} h[\rho _{Z}(i,j)] \ge 0 \end{aligned}$$

A similar argument can be used to show that \(\rho _{Z}(i,j) \le \rho _{0Z}(i,j) \Rightarrow h[\rho _{Z}(i,j)] \le 0\). Satisfying these conditions guarantees the convergence of each entry in \(\rho _{Z}^{k}(i,j)\) to \(\rho _{0Z}(i,j)\) (Robbins & Monro, 1951).

The NORTA method still necessitates of a correlation matrix \(\varvec{\Sigma }_{Z}\) to convert into \(\varvec{\Sigma }_{X}\). Refer to said call this function as \(G_{N}[\cdot ]\) such that \(G_{N}[\varvec{\Sigma }_{Z}] \approx \varvec{\Sigma }_{X}\). One can compute \(\varvec{\Sigma }_{Z}\) by using the iterative process:

$$\begin{aligned} \varvec{\Sigma }_{Z}^{k+1}=\varvec{\Sigma }_{Z}^{k}-a_{k}(G_{N}[\varvec{\Sigma }_{Z}^{k}]-\varvec{\Sigma }_{X}) \end{aligned}$$
(6)

Each entry (ij) of the matrix \(G_{N}[\varvec{\Sigma }_{Z}^{k}]\) is approximated through the function \(c_{ij}[\cdot ]\), so convergence \(\varvec{\Sigma }_{Z}^{k}\rightarrow \varvec{\Sigma }_{Z}\) is guaranteed due to the convergence of \(\rho _{Z}^{k}(i,j)\rightarrow \rho _{Z}(i,j)\).

Accelerating convergence and the nearest correlation problem

In order to optimize the efficiency of the stochastic approximation scheme described previously, two important algorithmic issues need to be addressed. The first pertains to speeding up the convergence of the Robbins–Monro method by taking averages of longer iterations. In this case, a slightly modified version of Eq. (5) is used. For \( 0 \le t \le k-1\) iterations:

$$\begin{aligned} \rho _{Z}^{k+1}(i,j)= & {} \rho _{Z}^{k}(i,j)-a_{k}(c_{ij}[\rho _{Z}^{k}(i,j)]-\rho _{X}(i,j)) \nonumber \\ \rho _{Z}^{k}(i,j)= & {} \frac{1}{k}\sum \limits _{t=0}^{k-1}\rho _{Z}^{t}(i,j) \end{aligned}$$
(7)

The convergence of \(\rho _{Z}^{k}(i,j)\) to the root \(\rho _{X}(i,j)\) relies on the condition that the step sequence \(a_{k}\) decreases sufficiently slowly. That is:

$$\begin{aligned} a_{k}\rightarrow 0, \frac{a_{k}-a_{k+1}}{a_{k}}=o(a_{k}) \end{aligned}$$

Therefore, the sequence \(a_{k}=k^{-\alpha }\) with \(\frac{1}{2}<\alpha <1\) satisfies this condition, but \(\alpha =1\) does not, hence the need for longer steps. For a more in-depth explanation, please refer to Polyak & Juditsky (1992).

The second issue speaks to the problem of preserving positive definiteness through the stochastic process itself. The Robbins–Monro algorithm is initialized with a positive definite correlation matrix, \(\varvec{R}_{0}\). For this particular problem, \(\varvec{R}_{0}\) is set to simply be the desired, final correlation matrix \(\varvec{\Sigma }_{X}\). The next matrix through the iterative process, \(\varvec{R}_{1}\), may not be a positive definite matrix because the error matrix, \(G_{N}[\varvec{\Sigma }_{Z}^{k}]-\varvec{\Sigma }_{X}\), could be negative definite (i.e., the difference between two positive definite matrices may not be positive definite itself). To alleviate this problem, the nearest positive definite matrix to \(\varvec{R}_{1}\) (in this example, or \(\varvec{R}_{i}\) in general) can be found through the bisection method, where the value of a constant b that minimizes \(f(b)=b\varvec{M}+(1-b)\varvec{R}_{0}\) is used (M is a positive definite matrix. In this case, \(\varvec{\Sigma }_{Z}\) is used).

Fig. 2
figure 2

Average error of estimated correlations (vertical axis) by iterations (horizontal axis)

Empirical demonstration

Accuracy of proposed algorithm

In order for any data-generating algorithm to be effectively used in simulation research, it is of utmost importance to validate the quality of the data it generates to ensure that technical aspects of it do not interfere with the actual simulation results (Lohmann et al., 2022). A series of examples will be presented as follows to ascertain some of the properties of the methodology proposed herein.

A relevant test case to evaluate this approach would be the matrix \(\varvec{\Sigma }_{X}\) presented in Sect. “The problem of the intermediatecorrelation matrix” of the present article. Since the solution to the \(p(p-1)/2\) individual equations would yield a non-positive definite matrix, as shown previously, a natural query would be whether or not the joint estimation of all the elements would be able to bypass this issue somewhat. A small simulation study was conducted and contrasted with the nearest-correlation matrix approach described here and explicated in Ghosh & Henderson (2003). As a reminder to the reader, the marginals are being simulated from (standard) log-normal distributions and the desired final correlation matrix is:

$$\begin{aligned} \varvec{\Sigma }_{X} = \left[ \begin{array}{ccc} 1 &{} -0.101 &{} -0.367 \\ -0.101 &{} 1 &{} -0.367\\ -0.367 &{} -0.367 &{} 1 \end{array}\right] \end{aligned}$$

Figure 2 shows the average error of estimated correlations in \(\varvec{\Sigma }_{X}\) as a function of the number of iterations. Around the \(750^{th}\) iteration mark, the average error of estimation begins to decrease so that, towards the final iterations, it too oscillates around \((-.02,.05)\).

The actual estimated correlation, \(\hat{\varvec{\Sigma }}_{X_{1}}\), on a large sample size of \(n=1,000,000\) is:

$$\begin{aligned} \hat{\varvec{\Sigma }}_{X_{1}} = \left[ \begin{array}{ccc} 1 &{} -0.010 &{} -0.341 \\ -0.010 &{} 1 &{} -0.341\\ -0.341 &{} -0.341 &{} 1 \end{array}\right] \end{aligned}$$

Using the univariate approach paired with the nearest-correlation matrix solution reported in Sect. “The problemof the intermediate correlation matrix” yields the following estimated final correlation matrix. A large sample size of \(n=1,000,000\) was also used:

$$\begin{aligned} \hat{\varvec{\Sigma }}_{X_{2}} = \left[ \begin{array}{ccc} 1 &{} 0.0156 &{} -0.2769 \\ 0.0156 &{} 1 &{} -0.2769\\ -0.2769 &{} -0.2769 &{} 1 \end{array}\right] \end{aligned}$$

Although bias is shown in both final correlation matrices, the correlations obtained through the joint estimation method presented in this article show less bias than those obtained by pairing a univariate approach with nearest-correlation estimation, validating the claim made in Stanhope (2005). To further explore the ability of both approaches to reproduce the final correlation matrix, a small simulation was ran comparing the estimated \(\hat{\varvec{\Sigma }}_{X_{1}}\) and \(\hat{\varvec{\Sigma }}_{X_{2}}\) to \(\varvec{\Sigma }_{X}\) using the asymptotic distribution-free (ADF) estimator within structural equation modeling.Footnote 2 The ADF estimator relaxes the traditional assumption of multivariate normality, so it can be safely used in this context where the marginal densities are log-normal. Nevertheless, the ADF estimator relies on large samples to properly estimate the matrix of corrections needed to obtain the proper standard errors and \(\chi ^{2}\) test of fit (Browne, 1984). To address this, the same sample sizes of \(n=1,000,000\) were used, with a reduced number of replications, 100, due to the increased computational demands of the sample size. Results of this small simulation are shown below in Table 1.

Table 1 Empirical type I error rates for the test comparing the both the multivariate method to the univariate method paired with nearest-correlation estimation

Similar to the previous section, an increase in the empirical type I error rate of both approaches is shown, but the increase is more pronounced for the univariate method paired with the nearest correlation approach. For the case of the multivariate method presented herein, the inflation falls within the recommend limits suggested in Bradley (1978), to denote that the level of increase in empirical \(\alpha \) is within acceptable levels.

Non-normality and \(\chi ^{2}\) test of fit in SEM

To evaluate the use of the methodology in a proper simulation study, a section of the simulation studies found in Curran et al. (1996) documenting the impact that non-normality has on the type I error rate of the chi-square test of fit in SEM is replicated here. In the original study, a three-factor model using three indicators per factor was used as a population model. All factors loaded equally on their respective indicators with \(\lambda =0.7\) and error variances of \(\sigma ^{2}_{\epsilon }=0.51\). All three factors were correlated equally at a \(\phi =0.3\) level. Sample sizes of \(n=100, 200, 500, 1000\) were used. In the original study, the multivariate generalization of the Fleishman (1978) method proposed by Vale & Maurelli (1983) was used as to simulate the non-normal data, with population skewness and (excess) kurtosis pairs of 2 and 7 (for “moderate” non-normality) and 3 and 21 (for “extreme non-normality”). The type I error rate of the normal likelihood chi-square test of fit as well as the Satorra–Bentler (SB) correction (Satorra & Bentler, 1994) and the asymptotic distribution free estimator (ADF) chi-square (Browne, 1984) were investigated.

Two important deviations from the original simulation design were implemented here. First, the number of replications was increased from the original 200 to 10,000 to improve precision of the estimates. Second, rather than trying to match or mimic the original population densities (i.e., Fleishman distributions) from where the data were generated, the present study opts to show the versatility of the NORTA method by using a mix of continuous and discrete distributions. The approach described in the previous section should be able to obtain the intermediate correlation matrix irrespective of whether the variables being correlated are continuous, continuous but bounded, discrete or a mix of all of them, as in this case. Table 2 summarizes the distributions used, their parameters, as well as their skewness and (excess) kurtosis values to better understand the properties of the multivariate distribution used in this simulation.

The results of this simulation study can be found in Table 3. The empirical rejection rates (i.e., type I error rates) from the original Curran et al. (1996, p.22) are reproduced as reference for the “severely non-normal” condition. As expected, the type I error rate for the normal theory chi-square test is inflated due to non-normality, while the Satorra–Bentler (SB) correction, as well as the asymptotic distribution-free (ADF) chi-square perform better in maintaining their empirical rate close to the nominal 5%.

Table 2 Distribution type and parameter values implying population skewness and excess kurtosis for marginal densities

There appears to be a discrepancy regarding the severity of the inflation, which can be explained in part due to the simulation conditions themselves. In the original study, the “severely non-normal” condition simulates data with population skewness values of 3 and excess kurtosis of 21. A close inspection of Table 2 shows than none of the parameters chosen for the distributions to initialize the NORTA method are close to those values, particularly for the case of excess kurtosis. Through simulated pseudo-populations of 10,000,000 participants, Mardia’s kurtosis (in the z-score metric) were calculated for each data-generating process. For the original densities used in Curran et al. (1996), the multivariate kurtosis is approximately \(z_{1} \approx 10,383.91\), whereas the NORTA-implied one is \(z_{2} \approx 911.96\). It should not come as a surprise, therefore, that convergence to the nominal.05 type I error rate happened much faster for the SB correction and the ADF estimator for the NORTA approach. Moreover Yuan et al. (2005) make explicit in their Eq. (9) (p. 247) that negative kurtosis deflates the value of the \(\chi ^{2}\) test of fit, whereas a positive kurtosis inflates it. The NORTA method used here offers a combination of positive and negative population kurtosis values, which could explain why the inflation of type I error rates are much lower for the NORTA case than the third-order polynomial transformation used to simulate the data in the Curran et al. (1996) original article.

Table 3 Empirical type I error rates for the Curran, West and Finch (CWF) article and the NORTA transformation

Inducing a correlational structure in empirical data

A NORTA-generated vector is defined, in part, by \(X_{i}=F_{X_i}^{-1}[\Phi (Z_i)]\). If a collected (not simulated) dataset is available and \(F_{X_i}^{-1}\) is not known, it would be natural to use the quantile function, as an empirical CDF \(Q(\cdot )\), to estimate \(F_{X_i}^{-1}\). Given a set of values \(x_{1},\ldots ,x_{n}\) the quantiles for an arbitrary fraction \(p_{i}\) can be defined as follows. First, sort the values in order \(x_{(1)} \le x_{(2)}\cdots \le x_{(n)}\). Then, take the order statistics \(x_{i}\) to be the quantile corresponding to the fraction:

$$\begin{aligned} p_{i}=\frac{i-1}{n-1}, (i=1,\ldots ,n). \end{aligned}$$

In general, to define the quantile for a fraction p, linear interpolation can be used. If p lies a fraction f of the way from \(p_i\) to \(p_{i+1}\), then define the \(p-th\) quantile to be:

$$Q(p)=(1-f)Q(p_{i})+fQ(p_{i+1}).$$

Now the function \(G_{N}[\cdot ]\) described previously can be used, but with the quantile function \(Q_{i}\) instead \(F_{X_i}^{-1}\).

Concluding remarks

Within the class of random-vector generating algorithms, the NORTA approach is one of the oldest, most established methods to simulate correlated, non-normal data. Although the alterations of the univariate marginal densities are straightforward applications of the probability-integral transformation and its inverse (whenever it is properly defined), preserving the correlation structure intended by the researcher is a much more complex issue. Cario & Nelson (1997) realized this issue and proposed one of the first solutions in their ARTAFACTS software (Cario & Nelson, 1998), which combines the techniques of numerical integration with root-searching algorithms. New, more efficient approaches have been proposed to help find the intermediate correlation matrix (e.g., Chen , 2001) but, as shown in Ghosh & Henderson (2003), solving for each entry bivariately can result in a non-positive definite matrix as the dimensions of the simulated vector increases. The purpose of the current article was therefore to leverage advances in stochastic optimization (more specifically, the Robbins–Monro approach) to estimate the elements of this matrix at once.

There are some limitations that need further discussion. The first is computational time. Currently available approaches can enhance their efficiency and computational time by sacrificing the multivariate nature of the problem. Instead of jointly estimating all the \(p(p-1)/2\) elements of the intermediate correlation matrix (for a \(p \times p\) correlation matrix), \(p(p-1)/2\) univariate equations are solved individually, and the intermediate correlation matrix is assembled from them, entry by entry. The reduction in computational time, however, comes at the increased risk of obtaining a non-positive definite matrix and, depending on the matrix, the user may require further adjustments or the matrix, albeit feasible, may be impossible to assemble entry by entry. The current approach does tend to result in additional computational time, but this is to be expected, as the framework presented herein preserves the multivariate nature of the problem, i.e., estimating all \(p(p-1)/2\) entries jointly. It sacrifices computational efficiency in favor of mathematical precision. Since the problem is more complex, the amount of time required to solve it also increases. Nevertheless, this increase in computational time allows users to specify a wider variety of correlational structures which are not limited to the ones that can only be achieved by solving for each entry individually. We encourage potential researchers to become familiar with the details of the methods they use in their simulations to understand the advantages and shortcomings of each approach.

A second issue at play, which is inherent to every method where the correlation matrix is specified separately from the marginal vectors it intends to simulate from, is that if the marginal distributions chosen by the user are not normally distributed, one is no longer guaranteed a flexible choice in the theoretical [-1, +1] range. These are known as the Fréchet–Hoeffding bounds (for a thorough introduction please see Nelsen , 2007) and it is an issue present in all potential methods, be it analytical or computational. For the algorithm presented herein, selecting a combination of marginal densities and correlations outside of the Fréchet–Hoeffding bounds would yield a warning that the algorithm has not converged. Preliminary exploration of currently available algorithms that approach this problem univariately also either not settle in a solution or settle in correlations outside of the [-1, +1] range. Overall, for simulation purposes, this is not an issue that can be approached from a purely mathematically standpoint, since it would be unfeasible. This is an issue of simulation design, where the user of this (or other algorithms) should become aware of which correlation structures are possible depending on the simulation conditions being selected. Ultimately, the majority of simulation algorithms used to generate multivariate, non-normal data in the social sciences belong to the family of Gaussian copulæ, which take a covariance (or correlation) matrix as a parameter governing its dependencies. It is important, therefore, to understand the interplay between this type of multivariate density and the mathematical restrictions that its use can imply. For a more thorough introduction to this, please see Astivia (2020).

The simulation of correlated, non-normal data where the user has some control over the marginal non-normality and the correlational structure is one of the most common approaches for simulation in the social sciences. The NORTA method has served as a conceptual basis for several of these approaches, but efficient methods to calculate its intermediate correlation, although available, are limited. The stochastic optimization scheme presented here offers an alternative that bypasses one of the most difficult aspects of implementing this methodology, while allowing for a theoretically justified computational approach. We hope that, with this obstacle removed, the NORTA method may become a more appealing alternative to potential researchers interested in understanding the effects of multivariate non-normality on their statistical methods.

Open Practices Statement

There are no data or materials to share. Sample code implementing the algorithm and the simulated example can be found in the Appendix