Main

The advancement of modern technologies, including single-cell1 and multiomics approaches2, wearable devices3, and integrated electronic health records4,5, have enabled an exciting era of precision medicine. These technologies regularly produce datasets with hundreds of thousands of variables (here referred to as features), allowing for unprecedented profiling of complex biological processes such as diseases, pregnancy or healing2,6,7. Correlation analysis is typically the first step to gain insights into a complex system (56% of all papers on the preprint server bioRxiv contain the word ‘correlation’). While erroneous data may skew correlation analyses8 and correlation does not imply causation, correlation analysis can guide coordinated subprocesses in complex systems for further investigation. Consequently, a broad range of algorithms have been developed for analyzing large-scale correlation networks from the perspective of topology, connectivity patterns or community structures9,10. In addition, extensive gene graphs and cell-to-cell relations derived from large-scale correlation networks are integrated in modern deep learning and graph neural network applications11,12.

Despite diverse applications, the construction of correlation networks for large datasets remains a major computational challenge (for example, for only n = 1, 000 features, at least 499,000 pairs need to be examined). As such, computation time and memory requirements for constructing correlation networks grow rapidly and quickly exceed computational resources as the dimensionality of the datasets increases. Current approaches for constructing correlation networks (Supplementary Section 5) either rely on specialized parallel processing and high-performance-computing frameworks (for example, graphics processing units, MapReduce and so)13,14,15,16,17, focus on specialized correlation measures18,43, making their application in high-dimensional data impractical. In this context, CorALS can be used either to efficiently sample correlations using full correlation matrix calculation or to first select top-k correlations for which robust methods can then be applied selectively. Similarly, CorALS does not account for confounding or causation. However, more advanced approaches to account for these effects, such as partial correlation or Bayesian networks44,45, are often restricted to small datasets and do not scale for high-dimensional data. In this context, CorALS can be used to effectively suggest highly correlated components of the data for further investigation with such methods. Thus, overall, investigating correlation networks can be broadly applied to gain insight into the underlying functional structures, which then may provide input for downstream analysis and also for more advanced methods such as graph neural networks11,12.

Finally, as the number of features increases with advancing technologies, it may be necessary to introduce more sophisticated methods that find correlated compounds, for example, based on existing domain knowledge, rather than individual correlations, for which CorALS can lay the computational foundation.

Overall, owing to its wide range and scope, we anticipate CorALS to be a catalyst that will be adopted to enable a multitude of downstream applications of large-scale correlation networks. For example, in ‘Correlated functional changes across immune cells’, the efficiency characteristics of CorALS’s top correlation network estimation allow to derive an innovative sampling-based approach to analyze the interaction of hundreds of thousands of cells simultaneously. In future work, CorALS may also support advanced tensor and network analysis or deep learning and graph neural network modeling (for example, for gene-interaction graphs and cell-to-cell relationships11,12). Thus, it will lay the analytical foundations and provide computational tools to unravel the intricate interactions of biological systems as develo** computational approaches are able to analyze increasingly complex network structures.

Methods

Derivation of efficient feature representations by CorALS

The different components of CorALS rely on transforming features into specific vector representations that connect the scalar product of these vectors to efficient correlation computations. In the following, we outline the derivation of these transformations for correlation projections (used for efficient correlation matrix calculation, top correlation network approximation and correlation embeddings) as well as differential projections (used for top differential correlation search), respectively. It is noted that the following feature representations are derived for the Pearson correlation coefficient; however, without loss of generality, these derivations hold for Spearman’s rank correlation coefficient by replacing individual feature values with ranks per feature. This is supported by CorALS’s implementation.

Correlation projections

By transforming feature representations appropriately, correlation computation can be formulated as a scalar product of two pre-processed vectors46. We refer to this pre-processing step as correlation projection. In particular, the Pearson correlation cor(x, y) between two features x and y with respective sample vectors x = (x1, ..., xm) and y = (y1, ..., ym), can be rewritten as follows:

$$\begin{array}{lll}{\mathrm{cor}}({{{\bf{x}}}},{{{\bf{y}}}})&=&\frac{\mathop{\sum }\limits_{i=1}^{m}({x}_{i}-{\mu }_{{{{\bf{x}}}}})(\,{y}_{i}-{\mu }_{{{{\bf{y}}}}})}{\sqrt{\mathop{\sum }\limits_{j=1}^{m}{({x}_{j}-{\mu }_{{{{\bf{x}}}}})}^{2}\mathop{\sum }\limits_{j=1}^{m}{({y}_{j}-{\mu }_{{{{\bf{y}}}}})}^{2}}}\\ &=&\mathop{\sum }\limits_{i=1}^{m}\frac{({x}_{i}-{\mu }_{{{{\bf{x}}}}})}{\sqrt{\mathop{\sum }\limits_{j=1}^{m}{({x}_{j}-{\mu }_{{{{\bf{x}}}}})}^{2}}}\frac{({y}_{i}-{\mu }_{{{{\bf{y}}}}})}{\sqrt{\mathop{\sum }\limits_{j=1}^{m}{({y}_{j}-{\mu }_{{{{\bf{y}}}}})}^{2}}}\\ &=&\left\langle \frac{{{{\bf{x}}}}-{\mu }_{{{{\bf{x}}}}}}{\sqrt{\mathop{\sum }\limits_{i=1}^{m}{({x}_{i}-{\mu }_{{{{\bf{x}}}}})}^{2}}},\frac{{{{\bf{y}}}}-{\mu }_{{{{\bf{y}}}}}}{\sqrt{\mathop{\sum }\limits_{i=1}^{m}{({x}_{i}-{\mu }_{{{{\bf{y}}}}})}^{2}}}\right\rangle \\ &=&\left\langle \hat{{{{\bf{x}}}}},\hat{{{{\bf{y}}}}}\right\rangle \\ &&{{{\rm{with}}}}\,\hat{{{{\bf{z}}}}}=\frac{{{{\boldsymbol{z}}}}-{\mu }_{{{{\bf{z}}}}}}{\parallel {{{\bf{z}}}}-{\mu }_{{{{\bf{z}}}}}\parallel }\end{array}$$
(1)

where \({\mu}_{\mathbf z}\) is the mean of vector z. Thus, the operator corresponds to the correlation projection that allows the transformation of the original sample vectors so that their scalar product is equal to their correlation. CorALS exploits this vector representation to formulate correlation matrix computation as an efficient matrix product.

This transformation allows to derive a direct relationship between the correlation cor(x, y) of any two vectors and the Euclidean distance \({d}_{{\mathrm{e}}}(\hat{{{{\bf{x}}}}},\hat{{{{\bf{y}}}}})\) of their correlation projections46. In particular, cor(x, y) and \(-{d}_{{\mathrm{e}}}(\hat{{{{\bf{x}}}}},\hat{{{{\bf{y}}}}})\) are order-equivalent and it holds that:

$${\mathrm{cor}}({{{\bf{x}}}},{{{\bf{y}}}})=1-\frac{{d}_{{\mathrm{e}}}{(\hat{{{{\bf{x}}}}},\hat{{{{\bf{y}}}}})}^{2}}{2}$$
(2)

CorALS exploits this relationship between correlation and Euclidean distance, for example, in top correlation approximation and correlation-based embeddings. For more details and corresponding proofs, see Supplementary Section 6.1.

Differential projections

CorALS further introduces a dual feature representation in a differential space that allows to calculate correlation differences across two conditions or timepoints using a single scalar product. In particular, for two features x and y, let \({{{{\bf{x}}}}}_{1}=({x}_{1,1},...,{x}_{1,{m}_{1}})\) and \({{{{\bf{y}}}}}_{1}=({y}_{1,1},...,{y}_{1,{m}_{1}})\) denote respective sample vectors in the first condition/timepoint and \({{{{\bf{x}}}}}_{2}=({x}_{2,1},...,{x}_{2,{m}_{2}})\) and \({{{{\bf{y}}}}}_{2}=({y}_{2,1},...,{y}_{2,{m}_{2}})\) in the second condition/timepoint. Then, the goal is to find vector transformations δ(x1, x2), κ(y1, y2) that represent information form both conditions/timepoints simultaneously so that

$${\mathrm{cor}}({{{{\bf{x}}}}}_{1},{{{{\bf{y}}}}}_{1})-{\mathrm{cor}}({{{{\bf{x}}}}}_{2},{{{{\bf{y}}}}}_{2})=\langle \delta ({{{{\bf{x}}}}}_{1},{{{{\bf{x}}}}}_{2}),\kappa ({{{{\bf{y}}}}}_{1},{{{{\bf{y}}}}}_{2})\rangle$$
(3)

Given the correlation projection from ‘Correlation projections’, the following definitions for \(\delta\) and κ provide such a dual vector representation.

$$\begin{array}{ll}\delta :{{\mathbb{R}}}^{{m}_{1}}&\times {{\mathbb{R}}}^{{m}_{2}}\to {{\mathbb{R}}}^{{m}_{1}+{m}_{2}}\\ &{{{{\bf{z}}}}}_{1},{{{{\bf{z}}}}}_{2}\mapsto \left(\begin{array}{l}{\hat{{{{\bf{z}}}}}}_{1}\\ {\hat{{{{\bf{z}}}}}}_{2}\end{array}\right)\\ \kappa :{{\mathbb{R}}}^{{m}_{1}}&\times {{\mathbb{R}}}^{{m}_{2}}\to {{\mathbb{R}}}^{{m}_{1}+{m}_{2}}\\ &{{{{\bf{z}}}}}_{1},{{{{\bf{z}}}}}_{2}\mapsto \left(\begin{array}{l}{\hat{{{{\bf{z}}}}}}_{1}\\ -{\hat{{{{\bf{z}}}}}}_{2}\end{array}\right)\end{array}$$
(4)

We call the vector space containing the codomain of these functions differential space.

Similar to the connection of Euclidean distance and basic correlation (see above), the dual feature representations in the differential space exhibit a connection between Euclidean distance and correlation difference across conditions or timepoints. In particular, for two features x and y with sample vectors x1, x2 and y1, y2 across two conditions or timepoints, cor(x1, y1) − cor(x2, y2) and −de(δ(x1, x2), κ(y1, y2)) are order-equivalent and it holds that:

$${\mathrm{cor}}({{{{\bf{x}}}}}_{1},{{{{\bf{y}}}}}_{1})-{\mathrm{cor}}({{{{\bf{x}}}}}_{2},{{{{\bf{y}}}}}_{2})=2-\frac{{d}_{{\mathrm{e}}}{(\delta ({{{{\bf{x}}}}}_{1},{{{{\bf{x}}}}}_{2}),\kappa ({{{{\bf{y}}}}}_{1},{{{{\bf{y}}}}}_{2}))}^{2}}{2}$$
(5)

Thus, analogously to correlation projections, CorALS exploits this order equivalence of Euclidean distance and correlation differences for top differential correlation approximation. For more details and corresponding proofs, see Supplementary Section 6.2.

Efficient calculation of full correlation matrices

Efficiently calculating full correlation matrices is achieved by recognizing that the inner product formulation in equation (1) allows to condense the correlation calculation between all possible feature pairs in a dataset to a single matrix product \({\hat{{{{X}}}}}^{\top }\hat{{{{X}}}}\). Here, \(\hat{{{{X}}}}\in {{\mathbb{R}}}^{m\times n}\) is the sample-feature matrix representing the corresponding dataset with m samples and n features where each column corresponds to the correlation projected sample vector of each feature, respectively (see ‘Correlation projections’). This approach can be directly formulated in any recent programming language without requiring additional software packages, and is able to take advantage of built-in efficient linear algebra routines such as BLAS and LAPACK47,48, which inherently support parallelization as showcased in Supplementary Data 1 and Supplementary Section 3. This approach outperforms many other implementations employing similar concepts as demonstrated in Supplementary Table 2.

Efficient approximation of correlation networks

Top correlation computation as a query search problem

By default correlation networks are fully connected. However, often it is more valuable to study only the most interesting interactions, that is, the strongest correlations. For this, it is common to either define a fixed threshold or concentrate the analysis on the top-k correlations. A straightforward approach to achieve this is to calculate the full correlation network and then keep only those correlations that are sufficiently strong according to either criterion. However, for high-dimensional data, calculating the full correlation matrix between features is often not feasible owing to memory restrictions, and in the top-k case, the subsequent sorting operation has more than cubic complexity with the number of features n (\({{{\mathcal{O}}}}({n}^{2}\log\,n)\)). And even when using partial sorting techniques based on selection algorithms for top-k search, this may result in impractical runtimes (\({{{\mathcal{O}}}}({n}^{2}+k\log\,k)\))31,49.

To address this, we fist observe that owing to the symmetry property of correlation measures, a single feature can never be strongly correlated to all other features (except in cases where all features highly correlated). Thus, we assume that the top global correlations can be approximated by finding and merging the top correlations locally, for example, for each feature separately, given an appropriate local margin (coined approximation factor as introduced below). This allows CorALS to reinterpret the task of top correlation computation as a query search problem50 where an indexed set of elements is efficiently queried based on a set of query vectors and a given distance measure. In particular, CorALS constructs an efficient index structure TX over a set of features X and then interprets another (often the same) set of features as queries Y to find the top correlated feature pairs. This approach prevents the construction of the complete correlation matrix and the corresponding implementation is inherently parallelizable, resulting in substantially reduced runtimes and memory requirements.

In the following, we describe the individual steps to enable this approach. This includes (1) the construction of an optimized indexing and query method that circumvents limitations of the previously derived relation between Euclidean distance and correlation (‘Joint ball trees for local top correlation discovery’), (2) the description of an approximation scheme to generalize single-query-based search to return global top-k correlations (‘Approximate search for global top-k correlations’), and (3) a discussion on the implementation of threshold-based search (‘Threshold-based correlation filtering’).

Joint ball trees for local top correlation discovery

While in principle, any metric-based k-nearest-neighbor algorithm can be used for CorALS, we focus on space partitioning algorithms that allow for efficient top-k as well as threshold-based queries in high-dimensional settings. Ball trees (or metric trees) in particular automatically adjust their structure to the represented data, provide good average-case performance and can cope well with high-dimensional entities50,51. While such indexing structures are mostly optimized for metrics such as the Euclidean distance, CorALS takes advantage of the correlation projection introduced in ‘Correlation projections’ and its properties (see ‘Correlation projections’) to enable top correlation and differential correlation search.

In particular, CorALS first represents each feature as a correlation vector by applying the correlation projection introduced in ‘Correlation projections’ to their respective sample vectors. These correlation vectors X are then indexed using ball tree space partitioning resulting in index TX. On the basis of the relation between Euclidean distance and correlation derived in ‘Correlation projections’, this index allows to search for top-k positively correlated features search(TX, y, k) based on a given query feature y ∈ Y. It also allows to search for a set of features search(TX, y, t) passing a positive correlation threshold t with respect to the query feature y.

Note that this set-up has two specific limitations that we address in the following. First, ball trees generally only support to search for top correlations relative to a single reference feature y. The algorithm to generalize this to a set of features will be described in ‘Approximate search for global top-k correlations’ and ‘Threshold-based correlation filtering’. Second, by default, only feature pairs with positive correlations are returned because only positive correlations correspond to small Euclidean distances while negative correlations will result in large distances (see equation (2) and the corollary in Supplementary Section 6.1).

To address the latter, CorALS takes advantage of the fact that correlation (as well as the scalar product) is associative with respect to scalar multiplication. In particular, changing the sign of a sample vector also changes the sign of the correlation:

$${\mathrm{cor}}(-\bf{x},\bf{y})=-{\mathrm{cor}}(\bf{x},\bf{y})={\mathrm{cor}}(\bf{x},-\bf{y})$$
(6)

Now, without loss of generality, we focus on top-k search in the following derivation. Assuming that at least k features with positive correlations to a query feature y exist in X, then all correlations returned by search(TX, y, k) are positive. Similarly, assuming that at least k negative correlations exist, switching the sign of all features in the dictionary X, that is, search(TX, y, k), or switching the sign of the query, that is, search(TX, −y, k), allows to also extract the strongest negative correlations (see equation (6)). Thus, a simple solution to find those features with the top positive and negative correlations is to run the search twice, once to extract positive and once to extract negative correlations, followed by a merging step.

However, for top-k search, this merging step, involves returning the top-k correlations twice, resulting in a sorting step that orders 2k elements, which can double memory requirements. This can be prevented by building the ball tree based on positive and negative dictionary features simultaneously, that is, search(TXX, y, k). This search only returns k elements, and thus can reduce runtime and memory requirements. See Supplementary Table 1 for a comparison of top-0.1% search on real-world datasets (Table 1). The corresponding experiments are based on the CorALS’s Python implementation and were repeated ten times; reported medians had no substantial fluctuations between runs. While the runtime improvements are marginal, the memory consumption can be reduced by half. Also note that for multiple queries, ball trees support to pre-process the set of queries resulting in a dual-tree approach52 for speeding up the search. Supplementary Table 1 also demonstrates the effectiveness of this approach. For the final implementation of CorALS, we jointly build the ball tree structure on negative and positive features and employ the dualtree search whenever provided by the underlying software library.

Approximate search for global top-k correlations

Focusing on the top-k correlations can be an effective way to construct interpretable visualizations of correlation matrices without having to explicitly specify a threshold. For this, k is often large, defined either as a multiple of the number of features (for example, 100n, 1,000n), or as a percentage (say 0.1% of all correlations ~⌈n2 * 0.001⌉). However, the ball tree algorithm (see ‘Joint ball trees for local top correlation discovery’) returns only the top correlations for each feature rather than the overall top-k correlations between all features. To address this, CorALS employs an approximation scheme.

In particular, for each query feature y ∈ Y, CorALS heuristically sets the number of k′ top correlated features to retrieve and then merges the results to approximate the global set of top-k features. Selecting k′ presents a trade-off. On the one hand, if k′ is greater than or equal to the number of features n, all feature pairs will be considered, thus allowing for an exact determination of the top-k features but no gain in runtime. On the other hand, if k′ < n, then there is no guarantee that the exact top-k features are retrieved; however, the runtime can be substantially improved as only a subset of candidates is returned and processed. To address this, CorALS uniformly draws top correlation candidates across all query features with a sufficient margin that accounts for biases in the correlation structure. That is, we chose k′ to be dependent on k with \({k}^{{\prime} }=a \lceil \frac{k}{n}\rceil\) as a middle ground between drawing the exact number of required candidates from each query \({k}^{{\prime} }=\lceil \frac{k}{n}\rceil\) and considering all candidates from each query k′ = n. Here a is called the approximation factor and regulates how many correlations are inspected per feature. The approximation factor can be selected so that CorALS returns results up to a specific sensitivity s. In particular, for a desired sensitivity up to s ≤ 0.75, the approximation factor can be chosen based on \(a=s\frac{n}{\sqrt{k}}\); and for a desired sensitivity s ≥ 0.75, the approximation factor can be chosen based on \(a=\frac{s n}{2\sqrt{k}\sqrt{1-s}}\). When formulating k in terms of the overall number of correlations n2, that is, k = rn2, for a sensitivity of s ≤ 0.75, the approximation factor can be calculated via \(a=\frac{s}{\sqrt{r}}\), and for s ≥ 0.75 it can be calculated via \(a=\frac{s}{2\sqrt{r}\sqrt{1-s}}\). However, in practice the number of missed correlations can be substantially smaller as correlations are usually not distributed according to the the worst case (Supplementary Figure 5). The derivation of sensitivity estimates as well as a study of the effects of a itself can be found in Supplementary Section 5. Supplementary Algorithm 1 summarizes the overall approach.

Threshold-based correlation filtering

To calculate all correlations greater than a threshold t, for each feature y ∈ Y, we can also employ the ball tree data structure (see ‘Joint ball trees for local top correlation discovery’) by issuing radius queries. For this, the correlation threshold needs to be converted into an Euclidean radius using equation (2). Thus, for each query feature, the respective query returns all indexed features with correlations greater than the respective correlation threshold. The results of each query are then merged to retrieve the final list of the filtered feature pairs. This approach is more memory efficient than calculating correlations for all possible feature pairs, for example, using the methodology introduced in ‘Efficient calculation of full correlation matrices’. However, it can also result in substantially increased runtimes compared with calculating the complete correlation matrix. The corresponding algorithm is implemented analogously to the top-k search in Supplementary Algorithm 1 but replaces k with a correlation threshold that is converted into a corresponding Euclidean radius via equation (2) to be used by the ball tree index structure.

Top correlation difference search

To efficiently calculate the top differences in correlation between pairs of features across more than one timepoint or condition, the naive implementation involves calculating the full correlation matrices for two conditions or timepoints, subtracting them and then extracting the top differences, for example, through thresholding or by identifying the top-k candidates. As previously shown for top-k correlation search, this is runtime and memory extensive if implemented naively and thus can easily exceed computational resources (Table 2).

To address this, CorALS builds on the dual feature representation introduced in ‘Differential projections’. In particular, it exploits the connection of correlation difference and Euclidean distance between the dual representation of features in differential space and then applies the same query search approach as for top correlation search (see ‘Efficient approximation of correlation networks’).

Thus, this first requires representing all features x ∈ X as their dual representations δ(x) ∈ δ(X) and κ(x) ∈ κ(X). Then, analogously to ‘Joint ball trees for local top correlation discovery’, a combined ball tree Tδ(X)∪−δ(X) is constructed to cover negative as well as positive differences. This ball tree can then be used to query the top-k (or thresholded) correlation differences search(Tδ(X)∪−δ(X), y, k) by querying with the feature representations κ(x) ∈ κ(X). This already includes positive and negative correlation differences as we index positive and negative projections δ(X) ∪ − δ(X), while indexing only δ(X) would solely return the top positive correlation differences (see equation (2) and the corollary in Supplementary Section 6.2). After the construction of Tδ(X)∪−δ(X), the same approximation approach as laid out in ‘Approximate search for global top-k correlations’ and ‘Threshold-based correlation filtering’ is employed to query the top correlation differences across all query features κ(X).

Correlation embeddings

t-SNE40 was used to embed high-dimensional data points into low-dimensional spaces, for example, for visualization. In this work, we employ t-SNE to embed features based on their correlation structure across samples. However, t-SNE is based on Euclidean distance and thus does not directly represent the correlation structure of features.

In particular, t-SNE reduces the dimensionality of data by minimizing the Kullback–Leibler divergence between a probability distribution, P, in the high-dimensional space and a probability distribution, Q, in the low-dimensional space40:

$$C=KL(P| | Q)=\mathop{\sum}\limits_{i}\mathop{\sum}\limits_{j}{p}_{ij}\log \frac{{p}_{ij}}{{q}_{ij}}$$
(7)

where the probabilities pij and qij represent probabilities for features j to belong to the neighborhood of feature i based on Euclidean distance in the corresponding space:

$$\begin{array}{lll}{p}_{ij}&=&\frac{\exp (-\parallel {{{{\bf{z}}}}}_{i}-{{{{\bf{z}}}}}_{j}{\parallel }^{2})/2{\sigma }^{2}}{{\sum }_{k\ne l}\exp (-\parallel {{{{\bf{z}}}}}_{k}-{{{{\bf{z}}}}}_{l}\parallel )/2{\sigma }^{2}}\\ {q}_{ij}&=&\frac{{(1+\parallel {\tilde{{{{\bf{z}}}}}}_{i}-{\tilde{{{{\bf{z}}}}}}_{j}{\parallel }^{2})}^{-1}}{{\sum }_{k\ne l}{(1+\parallel {\tilde{{{{\bf{z}}}}}}_{k}-{\tilde{{{{\bf{z}}}}}}_{l}{\parallel }^{2})}^{-1}}\end{array}$$
(8)

with ∥zi − zj2 and \(\parallel {\tilde{{{{\bf{z}}}}}}_{i}-{\tilde{{{{\bf{z}}}}}}_{j}{\parallel }^{2}\) representing pairwise Euclidean distances between features i and j for high-dimensional z and low-dimensional feature representations \(\tilde{{{{\boldsymbol{z}}}}}\), respectively.

Now, by projecting features onto correlation vectors, CorALS establishes an order equivalence between Euclidean distance and correlation as introduced in ‘Correlation projections’. This allows to directly employ distance-based embeddings methods such as t-SNE on the projected features without adding substantial computational overhead or requiring implementations that support customized distance information. A performance example is given in Supplementary Section 7.

Correlation coefficient classes

The underlying computation of CorALS is based on the Pearson correlation coefficient as discussed in the previous sections. On this basis, CorALS also supports any class of correlation coefficients that can be reduced to the Pearson calculation scheme. In particular, Spearman correlation can be calculated using the Pearson formula by replacing individual feature values with feature-local ranks, which may help to account for outliers or certain error types8,43. CorALS provides the corresponding capabilities to switch between Pearson and Spearman. Similarly, the Phi coefficient for binary variables can be calculated using the Pearson formula53. Finally, other correlation coefficient classes may be supported by future versions of CorALS by finding a map** between the corresponding coefficient and Euclidean distance as derived in the previous section for the Pearson correlation coefficient.

P-value calculation and multiple testing correction

P values for Pearson correlation coefficients r, can be derived from the correlation coefficient together with the number of samples n. That is, first the t-statistic can be derived using \(t=r\frac{\sqrt{n-2}}{\sqrt{1-{r}^{2}}}\). Then, the P value can be calculated by examining the cumulative t-distribution function p: P = 2 ⋅ p(T > t) where T follows a t-distribution function with N − 2 degrees of freedom. This approach is implemented in CorALS as derive_pvalues and can be applied as a post-processing step.

Note that owing to the large amount of correlations calculated, multiple test correction is necessary when working with P values. The most straightforward approach is to control for family-wise error rate using Bonferroni correction, which multiplies the corresponding P values by the number of compared correlation coefficients \(\frac{{n}^{2}-n}{2}\). Other approaches such as the false discovery controlling procedure Benjamini–Hochberg generally require the full P value distribution, which is not available when applying top-k correlation discovery. In these cases, padding the calculated P values with 1s for unknown P values can provide an upper bound for adjusted P values. However, this generally requires instantiating the full number of P values, which causes memory issues like in the full correlation matrix case Supplementary Table 1. To address this we provide a truncated version of the Benjamini–Hochberg procedure that avoids this issue.

The Benjamini–Hochberg (BH) procedure yields adjusted P values54 through

$${P}_{(i)}^{{\mathrm{BH}}}={\mathrm{min}}\left\{{\mathrm{min}}_{j\ge i}\left\{\frac{m\cdot {P}_{j}}{j}\right\},1\right\}$$
(9)

with \({P}_{(i)}^{{\mathrm{BH}}}\) representing the BH corrected P value at rank (i) for ascendingly ranked P values, m being the number of overall P values, for example, \(m=\frac{({n}^{2}-n)}{2}\), and j represents the rank of the P value Pj. On the basis of this formula, a truncated upper-bound version of BH calculates the adjusted P values for all top-k P values. Then a upper-bound adjusted value is calculated by \(u=\frac{m\cdot 1}{k+1}\). If Pk > u, then all adjusted P values P with P = Pk are replaced by u. This yields a minimally invasive truncated BH procedure for adjusted P values without instantiating the full distribution of P values. The approach is implemented in CorALS as multiple_test_correction and can be applied as a post-processing step.

Extensible framework for large-scale correlation analysis

The computational framework of CorALS is based on three steps (Fig. 1b): a feature projection step, a dynamic batching step and a reduction step. As such, the general structure is compatible with the the big data computation model MapReduce41.

The feature projection step (Fig. 1b, left) allows for preparing the data so that it can be split and processed independently in an efficient manner. In this paper, we specifically focus deriving an indexing structure based on space partitioning that allows for efficiently querying top-k correlations.

The dynamic batching step (Fig. 1b, middle) then splits the data matrix into multiple batches. The prepared data (and indexing structures) are then used to locally extract the relevant values in each batch independent of the other batches. Batches can be processed sequentially, in parallel or even in a distributed manner. Thus, the smaller the batches and the smaller the number of batches that run simultaneously, the less memory is required. This fine-grained control over batches introduces an effective mechanism to manage and trade-off memory requirements and runtime based on the available resources. Furthermore, batches may store their results on disk rather than in-memory, further reducing memory requirements. In this paper, for each batch of features, we focus on utilizing the previously mentioned indexing structure to extract the local top-k correlations in line with the corresponding approximation factor (see ‘Approximate search for global top-k correlations’). We also provide a thresholding feature that can reduce memory requirements of the batch results.

Finally, the batch results are reduced into the final result by merging batches. Dependent on the batch implementation and the local results, this can be done directly in memory for the fastest runtimes, sequentially by merging one batch result at a time or even mostly on disk, which can be used to further reduce memory requirements in favor of computation time. In the implementation of the final join analyzed in this paper, the results from the batches consist of individual correlations, which are merged, partitioned and then sorted to return the final top-k values.

Feature projections

Note that the implementation provided by CorALS is highly extensible and nearly all aspects can be replaced by custom implementations to optimize for particular application scenarios. For example, during the feature projection step, the index structures employed in the current implementation are based on ball trees, which optimizes for high-dimensional datasets with small samples sizes by employing correlation and differential spaces (Fig. 1a). However, this index structure can easily be replaced by implementations with different computational characteristics. For example, it may make sense to consider approximate nearest-neighbor methods55 to replace the current index, which may potentially reduce runtimes for a cut in sensitivity. Similarly, particularly for larger sample sizes, instead of using indexing structures, it may be advantageous to directly calculate correlations for smaller batches via the efficient matrix multiplication scheme introduced in ‘Efficient calculation of full correlation matrices’. While this direct calculation and partitioning of correlations increases time complexity from \({{{\mathcal{O}}}}(n\log n)\) to \({{{\mathcal{O}}}}({n}^{2})\), this may be faster than the currently employed ball tree indexing structure as the corresponding search performance of \({{{\mathcal{O}}}}(\log n)\) may deteriorate to \({{{\mathcal{O}}}}(n)\) with increasing dimensionality (in our case sample size). Here it is important to appropriately select the number of simultaneous batches to limit the memory requirements of this approach (for example, if only one batch is used, the complete correlation matrix will be instantiated). A corresponding implementation is provided by CorALS. A detailed comparison with in-depth parameter optimization and the corresponding relation to more efficient approximate nearest-neighbor schemes is left for future studies.

Distributed computation

The methods in this paper are focused on in-memory computations. However, as mentioned earlier, the computational framework of CorALS allows for sequential computation of batch results which can be cached on disk, circumventing potential memory limitations and allowing for calculating correlations for massive datasets. Furthermore, CorALS also supports distributed computation of correlation and differential matrices through the joblib backend (https://github.com/joblib/joblib). This directly enables Spark (https://github.com/joblib/joblib-spark), Dask (https://ml.dask.org/joblib.html) or Ray (https://docs.ray.io/en/latest/joblib.html).

In principle, the batch-based design of CorALS also allows for more specialized implementations based on the MapReduce paradigm41. Thus, overall, CorALS provides a very flexible algorithmic framework for large-scale correlation analysis that can be easily extended and adjusted to the application at hand.

Practical considerations

Full correlation matrix calculation

On the basis of the results in Table 2 and Supplementary Table 3, where CorALS substantially outperforms all other methods, we recommend generally using CorALS for full correlation matrix calculation. As the number of features grows, however, the full correlation matrix will not fit into memory. For example, at n = 32,000 features, the full matrix uses more than 8 GB of memory; at n = 64,000 features, it already requires more than 32 GB. This can be calculated roughly by assuming 64-bit float values (default in Python) and the formula: \({\mathrm{memory}}(n)=\frac{64 {n}^{2}}{8\times 1{0}^{9}}\). Thus, we recommend switching to top-k correlation analysis after n = 32,000 features.

Top-k correlation search

For top-k correlation search, we recommend using the basic CorALS implementation (referred to as matrix in Table 2) as long as the full correlation matrix fits into memory, independent of the number of samples. However, as the number of features increases, memory issues will make this approach impossible to use. When this is the case, switching to the index-based CorALS implementation is the best option. With increasing sample numbers, CorALS becomes slower, which may warrant other heuristics such as dimensionality reduction such as locality sensitive hashing or random projections (see ‘Discussion’).

Note that, by default, the top-k approximation approach does not guarantee symmetric results, that is, even if cor(x, y) is returned, cor(y, x) may be missing. This can be addressed by various post-processing steps, for example, by adding missing values. CorALS provides the option to enable this feature. In the experiments, this is not enabled as symmetric results are redundant for practical purposes.

Correlation structure visualization

For practical purposes, there are two properties of the proposed correlation structure visualization to consider. First, by design, CorALS visualizes strongly positively correlated features close to each other while the distance to strongly negatively correlated features will be large (see corollary in ‘Correlation projections’). In some settings it may be desirable to simultaneously visualize negatively correlated features close to each other, which is currently not supported by CorALS. Second, the relationship between Euclidean distance and correlation established in is not linear, which may result in bias toward tightly clustering highly correlated features. See Supplementary Fig. 1 for an illustration of the relation between correlation and the corresponding Euclidean distance.

Investigating the coordination of single-cell functions

For the analysis in ‘Correlated functional changes across immune cells’ and Fig. 3, we first divide cells into 20 individual non-overlap** cell types based on manual gating1. We then repeatedly sample 10,000 cells from each cell type across all patients using a dual bootstrap** scheme to ensure appropriate variations in cell types where less than 10,000 cells are present. The dual bootstrap** scheme first samples n cells from each cell type with replacement, where n is the number of available cells for that cell type. From this intermediate sample, we sample the final 10,000 cells for that cell type with replacement.

On the resulting sample of 200,000 cells across cell types, we calculate the top-0.01% Spearman correlations across all sampled cells based on their functional markers. We then count the number of top correlations between each pair of cell types. This allows to measure the relative correlation strengths between cell types.

By generating pairs of samples in each repetition, one from third-trimester cells and one from postpartum cells, we calculate the effect size (Cliff’s δ) of the top-k frequency differences between each pair of cell types. Supplementary Fig. 6 depicts a single instance of such a pair. We sample 1,000 times. Very large effect sizes defined by a corresponding effect size threshold (t = 0.622) are visualized in Fig. 3. This threshold has been derived based on analogous interpretation intervals proposed for Cohen’s d (refs. 56,57).

As described above, this procedure requires repeated sampling and top-k correlation calculations across millions of individual cells, making CorALS an essential component of this pipeline, enabling this analysis on our available servers by substantially reducing runtime and particularly memory requirements.

Datasets

The four real-world datasets we use for runtime and memory evaluation stem from biological applications in the context of pre-eclampsia, healthy pregnancy and cancer.All previously reported feature counts are subject to the following pre-processing procedure. We set negative values to 0, remove features that have only a single value and drop duplicate features (features are considered duplicates if all their sample values are the same). Dataset statistics are summarized in Table 1. For dataset availability, see Section ‘Data availability’.

The pre-eclampsia dataset24,26 contains aligned measurements from the immunome, transcriptome, microbiome, lipidome, proteome and metabolome, from 23 pregnant women with and without pre-eclampsia across the three trimesters of pregnancy. In brief, women of at least 18 years of age in their first trimester of a singleton pregnancy were recruited to the study after providing their informed consent and under institutional review board (IRB)-approved protocols. Whole blood, plasma and urine samples, and vaginal swabs were collected throughout pregnancy and processed to generate immunome, transcriptome, microbiome, lipidome, proteome and metabolome datasets. After aligning omics and drop** features with missing or only homogeneous values, 32 samples with 16,897 features where obtained.

The pregnancy dataset6 contains 68 samples from 17 pregnancies with four samples per woman in the first, second and third trimesters as well as postpartum, respectively. Each sample contains immunome, transcriptome, microbiome, proteome and metabolome measurements obtained simultaneously. In brief, women of at least 18 years of age in their first trimester of a singleton pregnancy were recruited to the study after providing their informed consent and under IRB-approved protocols. Whole blood, plasma and serum samples, and vaginal, stool, saliva and gum swabs were collected throughout pregnancy and processed to generate immunome, transcriptome, microbiome, proteome and metabolome datasets. After aligning omics and drop** features with missing or only homogeneous values, 32,211 features where obtained.

The cancer dataset contains samples from 443 patients with gastric adenocarcinoma58 and 185 patients with esophageal carcinoma59, for a total of 628 samples obtained via the LinkedOmics platform25. In brief, fresh frozen tumor samples and accompanying healthy tissue were collected from patients after providing their informed consent and under IRB-approved protocols. Samples were used to generate DNA methylation profiling at the CpG-site and gene levels (methylation CpG-site level, HM450K; methylation gene level, HM450K), whole-exome sequencing (mutation gene level), messenger RNA sequencing (HiSeq, gene level), reverse-phase protein array (analyte level) and somatic copy number variation (gene level, log-ratio) datasets. After aligning omics and drop** features with missing or only homogeneous values, the dataset consisted of samples from 258 patients. For our runtime and memory experiments, we sample increasing numbers of features (25%, 50% and 100%).

The single-cell dataset1 contains 68 mass cytometry samples from 17 pregnancies with four samples per woman in the first, second and third trimesters as well as postpartum, respectively. In brief, women of at least 18 years of age in their first trimester of a singleton pregnancy were recruited to the study after providing their informed consent and under IRB-approved protocols. Whole blood samples were collected throughout pregnancy and processed to generate an immunome dataset. For the benchmark experiments, samples from the third trimester were used. We process the data by sampling 10,000/30,000 cells from each of the 20 cell types, resulting in a dataset with 200,000/600,000 cells and 10 functional markers per cell.

We also add one more dataset (sim) that corresponds to 400,000 features and 500 samples to test larger sample sizes. The data are generated randomly.

Experimental settings for runtime and memory analysis

Experiments were repeated from 3 to 10 times depending on their runtime, the first sample was always dropped (to account for burn-ins, for example, for Julia’s JIT compiler), and respective medians are reported. No substantial runtime or memory fluctuations were observed. The experiments were run on a bare metal server with two AMD EPYC 7452 32-Core Processors and hyper-threading enabled amounting to 128 processing units. The machine provided 314 GB of memory and ran on Ubuntu 20.04.1 LTS. We use Python 3.9.1 and R 4.0.3 with current packages installed from conda-forge and Bioconductor. The employed Julia version was 1.5.3. Multi-threading was disabled explicitly if not otherwise specified.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.