Introduction

When investigating real-world network datasets we often do not have access to the entire network information. This is the case of large datasets, having limited storage capacity or limited resources during the data collection phase. Nevertheless, this should not prevent practitioners from analyzing an available network sample. In fact, evaluating network properties while accessing only a smaller sample is a relevant problem in various fields, ranging from modelling dynamical processes (De Choudhury et al. 2010; Sadikov et al. 2011), network statistics estimation (Leskovec and Faloutsos 2006), data compression (Adler and Mitzenmacher 2001) and survey design (Frank 2005). Imagining that one could design the sampling scheme for data collection, then this should be done wisely, as this biases the estimates of the network properties aimed at investigating (Han et al. 2005; Lee et al. 2006; Kossinets 2006). The goal should be to design a sampling protocol that not only preserves the relevant network properties of the entire topology inside the sample, but that can be implemented efficiently. Most sampling strategies found in the literature (Leskovec and Faloutsos 2006) are empirically-driven and lack theoretical groundings. Recently, TCEC (Ruggeri and De Bacco 2020), a sampling algorithm to approximate in-sample eigenvector centrality (Bonacich 1972), whose main features are being theoretically grounded and computationally scalable, has been proposed. TCEC aims at preserving the relative eigenvector centrality ranking of nodes inside the sample. This is a centrality measure used in many disciplines to characterize the importance of nodes. However, this might not be the only property of interest when studying a network. The question is then how a sampling method, optimized to retrieve one particular property, performs in estimating other network-related measures. In this work we address this question by performing an extensive analysis of the behavior of TCEC in recovering several relevant network properties by means of empirical results on real and synthetic networks. In particular, we focus on estimating various centrality measures which have a very different characterization from eigenvector centrality and do not come from spectral methods. Then we investigate how community structure and covariate information are affected by the sampling. In addition, we analyze how sampling strategies behave in recovering relevant network statistics specific to various network generative models. We compare performance with other sampling strategies. Finally, we discuss what are the challenges preventing a trivial extension of TCEC on PageRank (Brin and Page 1998) score.

Related work

A large part of the scientific literature aiming at investigating sampling strategies on networks is based on empirical approaches (Blagus et al. 2017; Costenbader and Valente 2003) and focus on recovering standard topological properties like degree distribution, diameter or clustering coefficient (Leskovec and Faloutsos 2006; Morstatter et al. 2013; Stutzbach et al. 2009; Hübler et al. 2008; Stumpf and Wiuf 2005; Ganguly and Kolaczyk 2018; Antunes et al. 2018). To the best of our knowledge, TCEC sampling (Ruggeri and De Bacco 2020) is one of the first theoretical attempts in estimating eigenvalue centrality, which goes beyond heuristics or empirical reasoning. A closely related problem is that of estimating eigenvector centrality without observing any edge but only signals on nodes (Roddenberry and Segarra 2019; He and Wai 2020). A different but related research direction is to question the stability of centrality measures under perturbations (Segarra and Ribeiro 2015; Han and Lee 2016; Murai and Yoshida 2019). In the case of PageRank score, and more recently for Katz centrality as well (Lin et al. 2019), the focus of similar lines of research is based on the different objective of estimating single nodes’ scores or approximating the external information missing for reliable within-sample estimation (Sakakura et al. 2014; Chen et al. 2004; Davis and Dhillon 2006), rather than estimating the relative ranking of nodes within a sample as we do here. Finally, focusing on temporal networks, Shao et al. (2017) propose a centrality measure suitable for this case and a method for its estimation using the network dynamics.

TCEC: sampling for eigenvector centrality estimation

In this section we introduce the formalism and explain the main ideas behind the Theoretical Criterion for Eigenvector Centrality (TCEC) sampling algorithm (Ruggeri and De Bacco 2020). This method uses mathematical formalism from spectral approximation theory to approximate the eigenvector centralities of nodes in a subsample with their values in the whole graph. Consider a graph \(G=( \mathcal {V}, \mathcal {E})\) where \(\mathcal {V}\) is the set of nodes and \(\mathcal {E}\) the set of edges; denote A its adjacency matrix with entries \(A_{ij} \in \mathcal {R}_{\ge 0}\) the weight of an edge from i to j. Sampling a network can be defined as the problem of selecting a principal submatrix \(A_{m}{'}\) of size \(m\le |\mathcal {V}|\) induced by a subset of nodes \(\mathcal {I}\subseteq \mathcal {V}\). The subsampled network is denoted as \(G_{m}=( \mathcal {I}, \mathcal {E}_{m})\), and \(\mathcal {E}_{m} \subseteq \mathcal {E}\) is the set of edges in the subsample. In general, there can be several choices for selecting \(G_{m}\). They should depend on the quantities aimed at preserving when sampling. TCEC selects \(G_{m}\) in order to minimize the sin distance \(\sin (\mu _{m},{\tilde{\mu }})\) between the eigenvector centrality \({\tilde{\mu }} \in \mathcal {R}^{m}\) in the subsample and the one on the same nodes, but calculated from the whole graph \(\mu _{m} \in \mathcal {R}^{m}\); \(\mu _m\) is a vector built from the whole-graph eigenvector centrality \(\mu \in \mathcal {R}^{V}\), when selecting only the m entries corresponding to nodes in the subsample. Accessing \(\sin (\mu _{m},{\tilde{\mu }})\) without the knowledge of the whole graph is not possible. However, given that eigenvector centrality is a spectral method, i.e. is based on evaluating eigenvectors and eigenvalues, TCEC uses projection methods for spectral approximation to propose a bound on that distance and relate it to network-related quantities. This results in an algorithmic implementation of a sampling procedure that aims at minimizing that bound. Referring to Ruggeri and De Bacco (2020) for details, the algorithm briefly works as follows. Starting from an initial small random sample, it selects nodes in an online fashion: it adds to the current sample \(\mathcal {I}\) of size \(k-1\) one node at a time by selecting the best node from the set of non-sampled nodes \(j\in \mathcal {V} {\setminus } \mathcal {I}\). The best candidate node j is the one that maximizes the following quantity made of network-related quantities:

$$\begin{aligned} \, (1 -\alpha ) \left( || b_1 ||_2^2 + || b_1^T U ||_2^2 - || b_3 ||_2^2 \right) + \alpha \,d_{in}^{G_k}(j), \end{aligned}$$
(1)

where \(b_1\in \mathbb {R}^{k-1}\) are the edges pointing from j to the nodes already in the subsample, \(b_2 \in \mathbb {R}\) is the entry corresponding to j, \(b_3 \in \mathbb {R}^{n-k}\) are edges from nodes outside the sample towards j, \(U \in \mathbb {R}^{k-1, n- k}\) are the edges from nodes outside the sample towards nodes in it, j excluded; \(d_{in}^{G_{k}}(j)\) is the (weighted) in-degree of node j calculated considering only the incoming edges from nodes that are in the sample; \(\alpha \in [0,1]\) is an hyperparameter that can be tuned empirically. We present a diagram of the quantities involved in Fig. 1.

Fig. 1
figure 1

TCEC sampling visual representation. Consider a candidate node j to be added to the current sample \(G_m\). The algorithm considers: the outgoing connections \(b_1\) towards the sample, the incoming connections \(b_3\) from the non-sampled nodes and U, the remaining edges incoming towards the sample

Empirical studies

We study the impact of sampling a network with TCEC on several relevant network properties different form eigenvector centrality. Namely, we investigate: (1) the distribution of the sampled nodes in terms of non-spectral centrality measures as in-degree, betweenness centrality and SpringRank (De Bacco et al. 2018); (2) the relationship between community structure and sampled nodes; (3) the preservation of the distribution of node attributes in the sampled network; (4) the impact of sampling on other model-specific statistics. For all these tasks, we compare with uniform random walk sampling (RW), as this is the mainstream choice for many sampling scenarios, due to its favorable statistical and computational properties (Gjoka et al. 2010); it has also shown better performance in recovering eigenvector centrality than all other state-of-the-art algorithms analyzed against TCEC (Ruggeri and De Bacco 2020). In addition, in the absence of a best sampling protocol that works for all applications, we further show comparisons with various other algorithms; for sake of visualization, we move some of the results to the corresponding Appendices. In the following experiments we use the Kendall-\(\tau\) correlation (Kendall 1990) to assess similarity between score vectors, as done in Ruggeri and De Bacco (2020).

Implementation details

While we refer to Ruggeri and De Bacco (2020) for the detailed definitions of the parameters needed in the algorithmic implementation, we provide a summary of their values used in our experiments in the “Appendix 1”; we use the open-source implementation of TCEC available online.Footnote 1 In all the following experiments we sample to include \(10\%\) of the total nodes. We comment on the stability and effects of different samples sizes in “Appendix 2” by means of an empirical comparison.

In addition to RW, for performance comparison we compare with several commonly employed sampling algorithms. Namely, uniform sampling on nodes (RN) (Leskovec and Faloutsos 2006; Ahmed et al. 2012; Wagner et al. 2017), uniform sampling on edges (RE), node2vec(Grover and Leskovec 2016) (with exploration parameters \(p=2\), \(q=0.5\), i.e. depth-first oriented search) and snowball expansion sampling (EXP) (Maiya and Berger-Wolf 2010).

Non-spectral centrality measures behavior

We analyzed the performance of TCEC in estimating non-spectral centrality measures in various real world datasets (see “Appendix 1” for more details): C. Elegans, the neural network of the nematode worm C. elegans (Watts and Strogatz a); US Power grid, an electrical power grid of the western US (Watts and Strogatz a); Epinions, a who-trusts-whom network based on the review site Epinions.com (Takac and Zabovsky 2012); Slashdot, a social network based on the reviews website Slashdot.org community (Leskovec et al. 2009); Stanford, a network of hyperlinks of the stanford.edu domain (Leskovec et al. 2009); Adolescent Health, a network of friendship between schoolmates (Moody 2001). Together these networks cover different domains (transportation, social, biological, communication), directed and undirected topologies, sizes (from order of \(10^{2}\) to \(10^5\) nodes) and sparsity levels (from average degree of 2.67 to 20.29). We use all algorithms listed above with the exception of EXP sampling, since it has already proven to perform poorly in eigenvector centrality approximation (Ruggeri and De Bacco 2020) and is computational too slow to be deployed on networks of the sizes considered here; in “Appendix 3” we show the computational efficiency of the various algorithms.

We consider three different centrality measures: (1) in-degree centrality, which corresponds to the in-degree of a node; (2) betweenness centrality, a measure that captures the importance of a node in terms of the number of shortest paths that need to pass through it in order to traverse the network; (3) SpringRank (De Bacco et al. 2018), a physics-inspired probabilistic method to rank nodes from directed interactions which yields rank distributions relatively different than that of spectral measures, like eigenvector centrality. Together, these three provide a diverse set of methods to characterize a node’s importance. Notably, none of these are based on spectral methods, as opposed to the theoretical grounding behind TCEC.

As we show in Fig. 2, all the considered non-spectral measures are well approximated by RW-like algorithms and TCEC on all datasets, with TCEC performing slightly better on average. A big gap can be observed instead with respect to uniform random sampling (RN and RE). We argue that this might be caused by the loss of discriminative edges when not taking into account the topology at sampling time, thus resulting in poor performance in recovering any edge-based centrality measure. For the sake of completeness, we include similar plots for the estimation of two spectral centrality measures, PageRank (Brin and Page 1998) and Katz centrality (Katz 1953) in “Appendix 4”. Finally, while it is difficult to model analytically how these results depend on the sample size, empirically we see that performance increases as the sample grows, similarly to what observed for recovering eigenvector centrality in Ruggeri and De Bacco (2020). The magnitude of this increment depends on what measure is being tracked and the specific dataset, however the relative performance of the various sampling strategies seems to be coherent with the results reported here in Fig. 2, see “Appendix 2”. As a final consideration, one may wonder how much of the approximation capabilities of TCEC with respect to non-spectral centralities, such as in-degree, could be explained by correlations with the eigenvector centrality itself. In fact, one would expect RN sampling to perform best at least in the approximation of nodes’ degrees. We remark the following facts. First, in the datasets considered above, the correlation between eigenvector centrality and in-degree varies from high values, 0.84 and 0.80 for the Epinionsand Slashdotdatasets respectively, to values as low as 0.03 for the Stanforddataset, see “Appendix 5”, while Kendall’s correlation is quite high in both these extreme cases. It is therefore not possible to directly attribute the good approximation of degree centrality to the correlation between the latter and eigenvector centrality. We conjecture that this is rather due to the character of sampling methods based on moving between adjacent nodes, like RW, TCEC and node2vec, which tend to preserve connections between nodes and therefore degrees. In fact, uniform sampling strategies (which do not move between adjacent nodes) as RN and RE, perform significantly worse and consistently across datasets and measures. In addition, in the presence of tight-knit communities, the former algorithms are more likely to sample the majority of a node’s neighborhood, without getting lost in other regions of the graph, as we show in the Section 3.3.

Fig. 2
figure 2

Approximation of non-spectral centrality measures with various sampling algorithms for the real datasets Epinions(upper left), Slashdot(upper right), Stanford(center left), Adolescent Health(center right), C. Elegans(bottom left) and US Power grid(bottom right). Scatterplots represent average results over 10 runs of sampling, with standard deviation indicated by vertical bars. SpringRank is missing for undirected networks

Community structure preservation

We investigate how the sampling algorithms impact a network’s underlying community structure. To this end, we study the distribution of the community memberships of sampled nodes in synthetic networks generated with Stochastic Block Model (SBM) (Holland et al. 1983) of size \(N=10^{4}\) nodes divided in 3 communities. Sampling protocols can be sensitive to the topological structure of the network (assortative or homophilic, disassortative or heterophilic) and to the balance of group sizes (Wagner et al. 2017). These can all impact how the different groups are represented in the sample and other factors such as individuals’ perception biases (Lee et al. 2019). We thus run tests on both types of structures and using various levels of balance for the communities. Specifically, we consider (1) balanced assortative networks: two groups of 3000 nodes and one of 4000, within-block probability of connection \(p_{in}=0.05\) and between-blocks \(p_{out}=0.005\); (2) unbalanced assortative networks: groups of sizes 1000, 3000 and 6000 respectively, same \(p_{in}\) and \(p_{out}\) as in (1); (3) balanced disassortative networks: same group division as in (3) but within-block probability of connection \(p_{in}=0.005\) and between-blocks \(p_{out}=0.05\). We compare TCEC with RW, which was shown to be robust in representing groups in the sample (Wagner et al. 2017) and EXP sampling, since it has been explicitly built to sample community structure. All algorithms start sampling from a node belonging to the group of smallest size. We observe two qualitatively different trends in the way nodes are chosen. RW yields samples of nodes more homogeneously distributed across communities, in all network structures. TCEC, instead, tends to select nodes within the block where it has been initialized. A possible explanation for this behavior is given by the peculiar form of the TCEC score of Eq. (1). In general, the algorithm tends to select nodes with a large \(|| b_1 ||_2\) and small \(|| b_3 ||_2\), i.e. many connections towards the sample and few connections from outside the sample. A likely choice is to then select nodes within the same community, where this combination holds. However, for the standard SBM case one can show (see “Appendix 9”) that the contribution of these two terms cancels out, thus the term that matters is \(b_1^{T}\, U\). This represents the number of common neighbors that are in the sample between j and any other \(\ell \in G{'}{\setminus } \left\{ j\right\}\). Assuming, for simplicity, homogenous communities, in the assortative case it is reasonable to assume that the initial random walk biases the sample to have more nodes of color as the initial seed, say r. With this initial bias, one can show that, on average, nodes j of color r have higher score, because they have higher values of \(|| b_1^{T}\, U ||_2\) than other \(\ell \in G{'}\) not of color r. Instead, for the disassortative case, we have the opposite result of nodes of color different than r having higher score. In this case, the sampling dynamics keep jum** to nodes of different colors, hence the sample is more balanced. One can generalize similar arguments to more complex topologies, for instance degree-corrected SBM, where nodes’ degree also contribute to their scores (not only in the case \(\alpha >0\)). However, the theoretical generalization becomes more cumbersome as we add more details to the model. With the same reasoning we recover the Erdös-Rényi case by assuming a connection probability equal for all edges, i.e. independent of the nodes’ colors. In this case, one can show that the only term that matters is the (weighted) in-degree \(d_{in}^{G_{k}}(j)\), if we assume \(\alpha >0\). This matches with the intuition that, given the uniform edge probability, in an Erdös–Rényi network all nodes are statistically identical, hence there should be no best candidate to distinguish, unless we look at a specific realization where random fluctuations dominate. These graphs may not be appropriate to evaluate the performance of a centrality measure, as all nodes are concentrated around the same score of centrality. Finally, expansion sampling remains confined in a single block, as it is a deterministic algorithm. Results are presented in Fig. 3, where we also report the KL-divergence (Kullback and Leibler 1951) between the communities distribution in the sample and the whole network for all sampling algorithms (the communities are the known ground-truth used for the SBM synthetic generation). The KL-divergence is a measure of discrepancy between probability distributions, which is 0 if they perfectly overlap, and gets larger as the difference between them grows. Thus, higher values signal higher discrepancy between the in-sample block distribution and the one calculated on the entire network. To account for different number of nodes in the whole network and the sample, here we consider the frequency of each group as community distribution. This can be observed graphically in Fig. 3 (left) for the assortative homogeneous structure i). Here the higher KL divergence is due to a more pronounced clustering of sampled nodes in one single block. The nodes selected by RW are more scattered around different blocks, while TCEC tends to select nodes within a single block and expansion sampling is completely confined to the initial one. Similar results hold for case ii), as defined above, and are presented in “Appendix 6”. For the disassortative structure iii), however, results differ. In this case, TCEC and RW tend to explore the network in a similar manner. A lower KL-divergence from the ground truth signals the fact that blocks are sampled more uniformly. While for RW this phenomenon is explained by the stochasticity of the neighborhood exploration, for TCEC it is caused by the way the algorithm works in selecting candidate nodes with high out-degrees towards the sample but small in-degrees from outside of it, as shown in Fig. 1. In disassortative networks these likely candidates belong to different communities, thus the more homogeneous exploration. Expansion sampling is still confined inside the starting block as in the previous case.

Fig. 3
figure 3

Community structure and sampling. We show an example of sampling two synthetic SBM networks one (left) assortative and one (right) disassortative. Sampled nodes and edges are colored in blue for TCEC (left), red for uniform random walk (center) and green for snowball expansion sampling (right). The KL-divergence averages and standard deviations are computed over 10 different rounds of sampling 10% of all the nodes

Node attribute preservation

Another relevant question is whether node attributes are affected by the way the network is sampled. This is particularly important in cases where extra information is known, along with the network’s topological structure. For instance, in relational classification, network information is exploited to label individuals (e.g. recovering nodes’ attributes); classification performance can significantly change based on the sampling protocol adopted (Ahmed et al. 2012; Espín-Noboa et al. 2018). Here we assume that attribute information is used a posteriori to analyze the results, but not taken as input to bias sampling. In general, when performing statistical tests on sampled networks’ covariates, we work under the assumption that their distribution is similar to that of the original network. However, this assumption is not necessarily fulfilled when performing arbitrary sampling. Notice that this is a related but different problem than the one above of community structure preservation. In that case, we were explicitly imposing that communities were correlated with network structure. If attributes well correlate with communities, then we should see a similar pattern with the results of the previous section. We test this on synthetic networks with both communities and node attributes, where a parameter tunes their correlations. Namely, we use a 2-block SBM generative model with varying levels of correlation between the community and a binary covariate (Contisciani et al. 2020). We then measure the correlation between community and covariate after sampling. Results are presented in Fig. 4. As can be observed, the original correlation is preserved in the samples by most algorithms, as the relative deviation from the real value (y-axis) is low with respect to the true correlation (x-axis), suggesting a regular behavior of covariates in the samples, despite not being taken into account as input for the sampling algorithm. For TCEC, the performance seems to increase with correlation values, while for the other algorithms we cannot distinguish a clear monotonic pattern.

Fig. 4
figure 4

Results from sampling SBM with varying levels of correlation between a binary covariate and the community membership. On the x-axis there is the ground truth correlation (the one used to generated the synthetic network). On the y-axis, we show the difference between this correlation and the one calculated after sampling. Standard deviations are computed over 10 different rounds of sampling 10% of all the nodes. Notice how the y-axis is restricted around the 0 values, denoting small deviations from the full network’s value. Notice that snowball sampling, being a deterministic method when starting from the same seed node (as in this experiment), presents no variance

But if the community-attribute correlation is not there or if this is not trivial, then we might see something different if we only look at attributes (e.g. if a practitioner is not interested in community detection). In this case, we can only assume correlation with network structure, but this may not be valid depending on the real dataset at hand. We test this behavior by studying the Pokecdataset (Takac and Zabovsky 2012). This is a social network representing connections between people in form of online friendships. In addition, the dataset contains extra covariate information on nodes, i.e. attributes about the individuals. In our case we focus on one of them, the geo-localization of users in one of the ten regions (the eight Slovakian regions, Czech Republic and one label for all other foreign countries) where the social network is based. We compare the distribution of this covariate in the full network with that on the nodes sampled by RW, TCEC and node2vec, with exploration parameters \(p=2\), \(q=0.5\), i.e. depth-first oriented search. We omit results from other sampling algorithms, to focus on the interpretation of these relevant cases. The choice of node2vecis motivated by its frequent implementation for node embedding tasks.Footnote 2 As node embeddings are often used for regression or classification tasks, along with network covariates, it is thus relevant for our task here. We run the algorithms starting from seed nodes within different regions, as the choice of the initial sample of labeled seed nodes can impact the final in-sample attribute distribution (Wagner et al. 2017). As before, we measure KL-divergence between the empirical attribute distribution on the entire network against that found within the sample. A graphical representation of one example of the results is given in Fig. 5. We notice different behaviors for the various sampling methods. While all algorithms recover a covariate distribution close the ground truth, slightly better performances are achieved, in order, from RW, TCEC and node2vec, with average KL values ranging from 0.01 to 0.04 respectively. However, a peculiar trend can be observed in relation to the starting region. In fact, the final sample is biased towards over representing the seed region for node2vec, as opposed to a comparable homogeneity obtained by TCEC and RW. This is a subtle result, as this over representation is not shown by the KL values. Instead, it can be measured by the entropy ratio \(H_{G_{m}}(s)/H_{G}(s)\) between the entropy \(H_{G_{m}}(s)=-p_{G_{m}}(s)\log p_{G_{m}}(s)-(1-p_{G_{m}}(s))\log (1-p_{G_{m}}(s))\) of a binary random variable representing whether a node in the sample belongs to the seed region s or not, over \(H_{G}(s)\), the same quantity but calculated over all nodes in the graph. In words, this measures the discrepancy of the frequency of the particular attribute corresponding to the seed region between in-sample nodes and the whole network. Assuming that all the frequencies, in-sample and whole network, are less than 0.5 (which is the case in our experiments), than values close to 1 denote high similarity, greater than 1 means over representation and less than one under representation of a particular attribute. In all but two starting regions, node2vechas a significantly high entropy ratio: for various seed regions this is higher than 1.19 whereas the maximum values obtained by TCEC and RW are both less than 1.12. Quantitatively, this shows the magnitude of the over representation in the sample induced by node2vec; instead, TCEC and RW do not yield any significant bias towards the starting region. An example of this behavior is plotted in Fig. 5, all the other starting regions are given in “Appendix 7”.

Fig. 5
figure 5

Covariates distribution and sampling. We consider the Pokec social network dataset (\(\approx 1.3 \cdot 10^6\) nodes and \(\approx 2.9 \cdot 10^7\) edges) with a sample fraction of 10%. The columns with bolded colors represent the region where we started the sampling from. Numbers inside the legend are average and standard deviations over 10 runs of \(KL(p_{G}||p_{G_{m}})\), where \(p_{G}\) is the empirical frequency of the ten regions in the original graph, similarly for \(p_{G_{m}}\) in the sample. H represents the mean and standard deviation over the same runs for the \(H_{G_m}(s) / H_G (s)\) entropy ratio. We plot here an example of the relative frequency of nodes for the ten regions in which nodes are divided. The distribution on the whole network is the black vertical line. Vertical lines on top of bars represent standard deviations across 10 runs of sampling. Notice three different behaviors: RW obtains an in-sample attribute distribution similar to the one on the whole graph. TCEC has a higher difference in KL, followed by node2vec. On average, the former two are not biased by the starting region, as it is instead the case for node2vec. This can also be observed quantitatively by a higher \(H_{G_m}(s) / H_G (s)\) ratio

Further statistics

Another interesting question is whether other relevant network statistics are preserved in a controlled setting where we generate synthetic data with a particular underlying structure. We explore this on two different types of synthetic networks were we have control over the generative process. First we generate an Exponential Random Graph Model (ERGM), a probabilistic model popular in social sciences where a set of sufficient statistics has a predefined expected value. We use one of the few ERGMs with an analytical formulation,Footnote 3 the reciprocity model of Holland and Leinhardt (Holland and Leinhardt 1981); here the network has a fixed number n of nodes and two parameters, p, regulating the probability of directed connections, and \(\alpha\), regulating the expected reciprocity (fraction of nodes with edges connecting them in both directions) (Park and Newman 2004). We set \(n=10^{4}\), \(p=1.51 \cdot 10^{-4}\), \(\alpha =10.1\). The values of p and \(\alpha\) have been chosen as estimates based on the social network Slashdot, since for this type of social network reciprocity and density are particularly relevant and descriptive measures. After generating the synthetic network, we sample \(10\%\) of its nodes and recompute on it the parameters for the two sufficient statistics: p being the expected number of directed edges, is estimated as \(E / N(N-1)\), where E and N are the number of edges and nodes in the sample; \(\alpha\) is estimated reversing the formula (53) in (Park and Newman 2004), using the observed value of network reciprocity on the sample and the estimate of p. Results are presented in Fig. 6. We can observe that the only sampling strategy successfully recovering the same parameters as the original ones is RN, while all the others achieve similarly biased results. A possible interpretation is given by the way neighbors are chosen at sampling time by all but random node sampling. All algorithms but the latter incorporate a bias towards incoming (e.g. TCEC and node2vec) or outgoing connections (e.g. RW). This does not allow a balanced selection of edges in both directions, thus biasing the estimate of \(\alpha\). For similar reasons, by choosing neighboring nodes, this algorithms retrieve denser samples, which naturally yield a higher estimate of p.

Fig. 6
figure 6

Estimated parameters \(p, \alpha\) for the ERGM network with reciprocity (Holland and Leinhardt 1981; Park and Newman 2004) after sampling with different algorithms. The red horizontal line represents the real network parameter, while the blue one its estimate re-fitted on the full generated network, to account for variability in generating the network from this probabilistic model. Box plots represent results across 10 rounds of sampling

The second synthetic model that we consider is the SpringRank generative model. In short, this model defines a network with a built-in hierarchical structure. In words, nodes have a real-valued score parameter \(s_{i}\) denoting their strength, or prestige, and this determines the likelihood of observing a directed and weighted tie between two nodes (it is only valid for directed networks). Formally, the adjacency matrix A of a graph with n nodes is drawn with the following probabilistic model:

$$\begin{aligned} s_i&\sim N\left( 1, \frac{1}{\alpha \beta }\right) \\ A_{ij} |s_i, s_j&\sim Poisson\left\{ c\, \exp \left[ -\frac{\beta }{2} (s_i - s_j - 1)^2 \right] \right\} , \end{aligned}$$
(2)

for \(i, j= 1, \ldots , n\) and \(\alpha , \beta , c\) parameters tuning the scores’ variance, the hierarchy strength and the network sparsity respectively. In particular, \(\beta\) can be seen as an inverse temperature: the larger its value the stronger the hierarchical relationship of nodes described by the scores is, thus its impact in determining the observed adjacency matrix. We investigate how various sampling algorithms successfully retrieve samples of the graph for which the inferred scores respect the ground truth ones. We generate a graph and its scores according to (2), and sample it with the methods listed above. We then infer the scores as described in De Bacco et al. (2018). The parameters are fixed as \(n=10^5, \alpha =0.1, c=0.01\). We vary \(\beta\) in \(\{0.1, 1, 10\}\) to check if a stronger signal in the edges, obtained by increasing \(\beta\), reflects in a better recovery of nodes’ scores, regardless the sampling algorithm. Results are presented in Fig. 7. As in the ERGM case, we can compare against two reference values. In this synthetic case we know the real underlying scores that the network has been generated with, but these determine the adjacency matrix of observations in (2) only stochastically. For this reason we compare the scores computed in the samples against the ones computed on the whole graph, rather than with the ground truth ones. As it can be observed, TCEC, RW and node2vechave similar recovery capacity in terms of scores computed on the yielded samples, with TCEC showing the best performance for stronger hierarchical structures (\(\beta =1\)); instead, RN and EXP sampling achieve comparable results only for high signal strength, i.e. high \(\beta\). RE never achieves good performances compared to the other methods. For completeness, Kendall-\(\tau\) measures with respect to ground truth scores are represented in shaded gray in the plots.

Fig. 7
figure 7

Kendall-\(\tau\) correlation between SpringRank scores inferred on samples and full graph, generated with the SpringRank generative model for varying level of inverse temperature \(\beta\). Scores with respect to ground truth SpringRank scores are represented in shaded gray. Standard deviations are computed across 10 rounds of sampling. Notice that here EXP has variance, as in these experiments we varied the initial seed node

Sampling for PageRank estimation

In this section we discuss the challenges preventing an effective extension of the theoretical framework behind TCEC to PageRank score (PR) (Brin and Page 1998), i.e. a method for sampling networks theoretically grounded on the same ideas, but aiming at better approximating PageRank, rather than eigenvector centrality. In fact, arguably counterintuitively, there is no trivial generalization of TCEC for PageRank. Instead, it is necessary to make further assumptions that result in an algorithmic scheme that is equivalent to TCEC in practice, from our empirical observations. Here we explain the main challenges and refer to the “Appendix 8” for detailed derivations of how to address them. PageRank considers a different adjacency matrix \(A_{PR}\), which is strongly connected (as the network is complete) and stochastic (the rows are normalized to 1). This is built from the original A. Both these features, not present for the eigenvector centrality case, are the cause of the additional complexity of sampling for PageRank. The PR score is defined as the eigenvector centrality computed on \(A_{PR}\). At a first glance, this may lead to a straightforward generalization of TCEC sampling by simply applying the algorithm to \(A_{PR}\). However, this simple scheme hinders in fact one main challenge, which makes this generalization theoretically non trivial. TCEC yields the matrix \(A_{G_{m}}\) (the adjacency of the sampled network \(G_m\) ), which is a submatrix of the original A; having a submatrix is a requirement for the validity of the sine distance bound at the core of TCEC. Instead, in the case of PageRank, the matrix of the sampled network \(A_{PR, G_{m}}\) is not a submatrix of \(A_{PR}\); this is because \(A_{PR}\) is a stochastic matrix, which requires knowing the degree of each node in advance to normalize each row. This information is in general not known a priori. We fixed this problem introducing an approximation (see “Appendix 8”) which allows to use the theoretical criterion of Eq. (1) in this case as well. However, we still face a computational challenge. Due to the nature of PageRank, which allows jumps to non-neighboring nodes, albeit with low probability, the networks behind \(A_{PR}\) and \(A_{PR, G_m}\) are both complete. This results in a much higher computational cost of the sampling algorithm. Even though we proposed ways to fix this issue as well (see “Appendix 8”) and thus combined these two considerations into an efficient algorithmic implementation (which we refer to as TCPR) analogous to TCEC, empirical results for this are poor. In practice, TCEC performs better in recovering the PR scores of nodes in the sample. As noted above for non spectral centrality measures, also in this case this approximation cannot simply be attributed to the correlation between PR and EC; in fact, this is high for Epinions, Internet Topology and Slashdot (respectively 0.580, 0.524, 0.715) but \(-0.001\) for Stanford, see “Appendix 5”. Rather, it has to be attributed to the structural properties of the networks recovered from the various sampling procedures.

TCEC versus TCPR for PageRank approximation

We compare the approximation of the PageRank score as obtained on samples from RW, TCEC and TCPR, via Kendall-\(\tau\) correlation with the true score, which were assumed to be available in these experiments. A higher correlation signals a better recovery of the relative ranks between nodes. We do so on the Epinions, Internet Topology, Slashdotand Stanfordnetwork. The Internet Topology(Zhang et al. 2005) represents the (undirected) Internet Autonomous Systems level topology.

For these experiments we set the TCEC randomization probability to 0.5, to achieve better approximation scores. Figure 8 shows a noticeable improvement of TCEC in most of the networks, both as a function of the sampling ratio and compared to RW for in-sample PR ranking recovery. However, we do not observe such a pattern for TCPR, which performs better than TCEC only for few datasets and sample ratio combinations. As the theoretical groundings behind the two are similar, we argue that using the \(L_{1}\)-norm in TCPR (see “Appendix 8”), which is inherently less discriminative of the \(L_{2}\)-norm behind TCEC, seems to affect this difference in performance. Another possible cause is the extra assumption of in-sample nodes’ degrees linearly scaling with sample size. Large deviations from this assumption could sensibly impact the quality of the goodness criterion at hand.

Fig. 8
figure 8

Results on the four datasets for PR score approximation, respectively Epinions(upper left), Internet Topology(upper right), Slashdot(lower left) and Stanford(lower right). While, as a general trend, TCEC and TCPR seem to perform in average better than random walk for PR score estimation, there is no clear separation between the former two. Standard deviations are computed on 10 runs of sampling

Conclusions

Designing a sampling protocol when the whole-network information is not accessible is a task that has to be performed wisely. In fact, the choice of the sampling algorithm biases the analysis of relevant network quantities performed on the sample. We investigated here the impact on various centrality measures, community structure, node attribute distribution and further statistics relevant to specific instances of network generative models that sampling techniques have. We studied in particular the performance of TCEC, a theoretically grounded sampling method aimed at recovering eigenvector centrality on such network properties within the sample and compared with other sampling approaches. The goal was to understand whether a sampling algorithm optimized to preserve a specific global and spectral network measure, is indirectly preserving also other network quantities. We empirically found that on various networks, the performance of TCEC, as well as that of other algorithms, varies with the task, further suggesting to an end-user that the choice of the sampling strategy should be made thoroughly and according to the goal. In particular, in some tasks there is high performance similarity with other routines that sample by moving between adjacent nodes, like RW and node2vec, their performances all differ significantly from strategies based on random uniform sampling. Instead, for tasks like recovering PageRank, a spectral measure, or SpringRank values for strong hierarchical structures it performs better than uniform random walk. In addition, while RW yields community structure homogeneously distributed across blocks, TCEC tends to select nodes inside the starting community, however partially reaching out to other blocks. Finally, studying a large online social network, it recovers in-sample attribute distributions close to the ones of the whole graph. It does not show any significant bias towards the seed region, as it is instead the case for node2vec, which is over representing the starting regions. We discussed possibilities of extending TCEC to the case of PageRank and showed the challenges associated to this task and the remedies to them. However, the resulting algorithm performs comparably well to TCEC on recovering PageRank values. We focused here in showcasing the impact of sampling on three different relevant tasks that have broad relevance in network datasets and presented example of further statistics covering more specific scenarios (i.e. networks with reciprocity or hierarchical structure). We have not considered the case where networks change in time, it would be interesting to measure the robustness of sampling strategies against the dynamics of network structure.