Abstract
The traditional setup of link prediction in networks assumes that a test set of node pairs, which is usually balanced, is available over which to predict the presence of links. However, in practice, there is no test set: the ground truth is not known, so the number of possible pairs to predict over is quadratic in the number of nodes in the graph. Moreover, because graphs are sparse, most of these possible pairs will not be links. Thus, link prediction methods, which often rely on proximity-preserving embeddings or heuristic notions of node similarity, face a vast search space, with many pairs that are in close proximity, but that should not be linked. To mitigate this issue, we introduce LinkWaldo, a framework for choosing from this quadratic, massively skewed search space of node pairs, a concise set of candidate pairs that, in addition to being in close proximity, also structurally resemble the observed edges. This allows it to ignore some high-proximity but low-resemblance pairs, and also identify high-resemblance, lower-proximity pairs. Our framework is built on a model that theoretically combines stochastic block models (SBMs) with node proximity models. The block structure of the SBM maps out where in the search space new links are expected to fall, and the proximity identifies the most plausible links within these blocks, using locality sensitive hashing to avoid expensive exhaustive search. LinkWaldo can use any node representation learning or heuristic definition of proximity and can generate candidate pairs for any link prediction method, allowing the representation power of current and future methods to be realized for link prediction in practice. We evaluate LinkWaldo on 13 networks across multiple domains and show that on average it returns candidate sets containing 7–33% more missing and future links than both embedding-based and heuristic baselines’ sets. Our code is available at https://github.com/GemsLab/LinkWaldo.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10115-021-01632-x/MediaObjects/10115_2021_1632_Fig7_HTML.png)
Similar content being viewed by others
References
Adamic LA, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211–230
Alivisatos AP, Chun M, Church GM, Greenspan RJ, Roukes ML, Yuste R (2012) The brain activity map project and the challenge of functional connectomics. Neuron 74(6):970–974
Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In WWW, pp 651–660
Belth C, Büyükçakır A, Koutra D (2020) A hidden challenge of link prediction: Which pairs to check? In: ICDM, pp 831–840. IEEE
Belth C, Zheng X, Koutra D (2020) Mining persistent activity in continually evolving networks. In: KDD
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: STOC
Donnat C, Zitnik M, Hallac D, Leskovec J (2018) Learning structural node embeddings via diffusion wavelets. In: KDD, pp 1320–1329. ACM
Duan L, Ma S, Aggarwal C, Ma T, Huai J (2017) An ensemble approach to link prediction. In: IEEE TKDE 29(11)
Gao M, Chen L, He X, Aoying Z (2018) Bipartite network embedding. In: SIGIR, Bine
Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: KDD, pp 855–864. ACM
Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: methods and applications. IEEE Data Eng Bull 40(3):52–74
Heimann M, Shen H, Safavi T, Danai K (2018) Representation learning-based graph alignment. In: CIKM, REGAL
Di **, Heimann M, Safavi, T Wang M, Lee W, Snider L, Koutra D (2019) Smart roles: inferring professional roles in email networks. In: KDD, pp 2923–2933. ACM
Joshi U, Urbani J (2020)Searching for embeddings in a haystack: link prediction on knowledge graphs with subgraph pruning. In: WebConf
Kipf TN, Welling M (2016) Variational graph auto-encoders. In: NeurIPS workshop on Bayesian deep learning
Kunegis J (2013) Konect: the koblenz network collection. In: WWW
Latouche P, Birmelé E, Ambroise C et al (2011) Overlap** stochastic block models with application to the French political blogosphere. Ann Appl Stat 5(1):309–336
Leskovec J, Krevl A (2014) SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data
Levin DA, Peres Y (2017) Markov chains and mixing times, volume 107. American Mathematical Soc
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. ASIS&T 58(7):1019–1031
Martínez V, Berzal F, Cubero J-C (2016) A survey of link prediction in complex networks. CSUR 49(4):1–33
Mehta N, Carin L, Rai P (2019) Stochastic blockmodels meet graph neural networks. In: ICML
Miller K, Michael IJ, Thomas LG (2009) Nonparametric latent feature models for link prediction. In: NeurIPS
Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2)
Nowicki K, Snijders TAB (2001) Estimation and prediction for stochastic blockstructures. ASIS&T 96(455):1077–1087
Pachev B, Webb B (2018) Fast link prediction for large networks using spectral embedding. J Complex Netw 6(1):79–94
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: KDD, pp 701–710. ACM
Qiu J, Dong Y, Ma H, Li J, Wang K, Tang J (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In: WSDM, pp 459–467. ACM
Ribeiro LFR, Saverese PHP, Figueiredo DR (2017) struc2vec: learning node representations from structural identity. In: KDD, pp 385–394. ACM
Rossi R, Ahmed N (2015) The network data repository with interactive graph analytics and visualization. In: AAAI
Rossi RA, Di J, Kim S, Ahmed S, Koutra D, Lee JB (2020) On proximity and structural role-based embeddings in networks: Misconceptions, techniques, and applications. TKDD
Safavi T, Koutra D, Meij E (2020) Evaluating the calibration of knowledge graph embeddings for trustworthy link prediction. In: EMNLP
Song D, Meyer DA, Tao D (2015) Top-k link recommendation in social networks. In: ICDM, pp 389–398. IEEE
Sporns O, Tononi G, Kötter R (2005) The human connectome: a structural description of the human brain. PLoS Comput Biol 1(4):e42
Tang J, Qu M, Mei Q (2015) Pte: predictive text embedding through large-scale heterogeneous text networks. In: KDD, pp 1165–1174. ACM
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: WWW, pp 1067–1077. ACM
Tsybakov AB (2008) Introduction to nonparametric estimation. Springer Science & Business Media, Berlin
Varshney LR, Chen BL, Paniagua E, Hall DH, Chklovskii DB (2011) Structural properties of the caenorhabditis elegans neuronal network. PLoS Comput Biol 7(2):e1001066
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. ar**v preprint ar**v:1408.2927
Zhang M, Chen Y (2018) Link prediction based on graph neural networks. In: NeurIPS, pp 5165–5175
Zhu J, **ngyu L, Heimann M, Koutra D (2021) Node proximity is all you need: Unified structural and positional node and graph embedding. In: SDM, SIAM
Acknowledgements
This work is an invited extension of a paper accepted at ICDM 2020 [4] and is supported by an NSF Graduate Research Fellowship, NSF CAREER Grant No. IIS 1845491, Army Young Investigator Award No. W9–11NF1810397, and an Amazon faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Proximity models
Proximity models
Here we discuss the proximity models used in Sect. 5.1. The proximity model for each method (along with parameters for the NMF+Bag baseline) is given in Table 6. We observe that the AA heuristic performed well on several datasets. During development, we tried over two dozen embedding methods (including GNNs) and found NetMF to be the most consistently strong for both LaPM and LinkWaldo. This, combined with the strong performance of AA, suggests that there is much room for improving proximity-preserving embedding methods. Improving their performance for low-degree nodes is of particular importance, as discussed in Sect. 5.4.
Rights and permissions
About this article
Cite this article
Belth, C., Büyükçakır, A. & Koutra, D. A hidden challenge of link prediction: which pairs to check?. Knowl Inf Syst 64, 743–771 (2022). https://doi.org/10.1007/s10115-021-01632-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01632-x