A hidden challenge of link prediction: which pairs to check?

Belth, Caleb; Büyükçakır, Alican; Koutra, Danai

doi:10.1007/s10115-021-01632-x

A hidden challenge of link prediction: which pairs to check?

Regular Paper
Published: 18 February 2022

Volume 64, pages 743–771, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

344 Accesses
1 Altmetric
Explore all metrics

Abstract

The traditional setup of link prediction in networks assumes that a test set of node pairs, which is usually balanced, is available over which to predict the presence of links. However, in practice, there is no test set: the ground truth is not known, so the number of possible pairs to predict over is quadratic in the number of nodes in the graph. Moreover, because graphs are sparse, most of these possible pairs will not be links. Thus, link prediction methods, which often rely on proximity-preserving embeddings or heuristic notions of node similarity, face a vast search space, with many pairs that are in close proximity, but that should not be linked. To mitigate this issue, we introduce LinkWaldo, a framework for choosing from this quadratic, massively skewed search space of node pairs, a concise set of candidate pairs that, in addition to being in close proximity, also structurally resemble the observed edges. This allows it to ignore some high-proximity but low-resemblance pairs, and also identify high-resemblance, lower-proximity pairs. Our framework is built on a model that theoretically combines stochastic block models (SBMs) with node proximity models. The block structure of the SBM maps out where in the search space new links are expected to fall, and the proximity identifies the most plausible links within these blocks, using locality sensitive hashing to avoid expensive exhaustive search. LinkWaldo can use any node representation learning or heuristic definition of proximity and can generate candidate pairs for any link prediction method, allowing the representation power of current and future methods to be realized for link prediction in practice. We evaluate LinkWaldo on 13 networks across multiple domains and show that on average it returns candidate sets containing 7–33% more missing and future links than both embedding-based and heuristic baselines’ sets. Our code is available at https://github.com/GemsLab/LinkWaldo.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Neighborhood and PageRank methods for pairwise link prediction

Article 30 July 2020

Link Prediction via Higher-Order Motif Features

An information-theoretic model for link prediction in complex networks

Article Open access 03 September 2015

References

Adamic LA, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211–230
Article Google Scholar
Alivisatos AP, Chun M, Church GM, Greenspan RJ, Roukes ML, Yuste R (2012) The brain activity map project and the challenge of functional connectomics. Neuron 74(6):970–974
Article Google Scholar
Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In WWW, pp 651–660
Belth C, Büyükçakır A, Koutra D (2020) A hidden challenge of link prediction: Which pairs to check? In: ICDM, pp 831–840. IEEE
Belth C, Zheng X, Koutra D (2020) Mining persistent activity in continually evolving networks. In: KDD
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: STOC
Donnat C, Zitnik M, Hallac D, Leskovec J (2018) Learning structural node embeddings via diffusion wavelets. In: KDD, pp 1320–1329. ACM
Duan L, Ma S, Aggarwal C, Ma T, Huai J (2017) An ensemble approach to link prediction. In: IEEE TKDE 29(11)
Gao M, Chen L, He X, Aoying Z (2018) Bipartite network embedding. In: SIGIR, Bine
Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: KDD, pp 855–864. ACM
Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: methods and applications. IEEE Data Eng Bull 40(3):52–74
Google Scholar
Heimann M, Shen H, Safavi T, Danai K (2018) Representation learning-based graph alignment. In: CIKM, REGAL
Di **, Heimann M, Safavi, T Wang M, Lee W, Snider L, Koutra D (2019) Smart roles: inferring professional roles in email networks. In: KDD, pp 2923–2933. ACM
Joshi U, Urbani J (2020)Searching for embeddings in a haystack: link prediction on knowledge graphs with subgraph pruning. In: WebConf
Kipf TN, Welling M (2016) Variational graph auto-encoders. In: NeurIPS workshop on Bayesian deep learning
Kunegis J (2013) Konect: the koblenz network collection. In: WWW
Latouche P, Birmelé E, Ambroise C et al (2011) Overlap** stochastic block models with application to the French political blogosphere. Ann Appl Stat 5(1):309–336
Article MathSciNet Google Scholar
Leskovec J, Krevl A (2014) SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data
Levin DA, Peres Y (2017) Markov chains and mixing times, volume 107. American Mathematical Soc
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. ASIS&T 58(7):1019–1031
Google Scholar
Martínez V, Berzal F, Cubero J-C (2016) A survey of link prediction in complex networks. CSUR 49(4):1–33
Article Google Scholar
Mehta N, Carin L, Rai P (2019) Stochastic blockmodels meet graph neural networks. In: ICML
Miller K, Michael IJ, Thomas LG (2009) Nonparametric latent feature models for link prediction. In: NeurIPS
Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2)
Nowicki K, Snijders TAB (2001) Estimation and prediction for stochastic blockstructures. ASIS&T 96(455):1077–1087
MathSciNet MATH Google Scholar
Pachev B, Webb B (2018) Fast link prediction for large networks using spectral embedding. J Complex Netw 6(1):79–94
Article MathSciNet Google Scholar
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: KDD, pp 701–710. ACM
Qiu J, Dong Y, Ma H, Li J, Wang K, Tang J (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In: WSDM, pp 459–467. ACM
Ribeiro LFR, Saverese PHP, Figueiredo DR (2017) struc2vec: learning node representations from structural identity. In: KDD, pp 385–394. ACM
Rossi R, Ahmed N (2015) The network data repository with interactive graph analytics and visualization. In: AAAI
Rossi RA, Di J, Kim S, Ahmed S, Koutra D, Lee JB (2020) On proximity and structural role-based embeddings in networks: Misconceptions, techniques, and applications. TKDD
Safavi T, Koutra D, Meij E (2020) Evaluating the calibration of knowledge graph embeddings for trustworthy link prediction. In: EMNLP
Song D, Meyer DA, Tao D (2015) Top-k link recommendation in social networks. In: ICDM, pp 389–398. IEEE
Sporns O, Tononi G, Kötter R (2005) The human connectome: a structural description of the human brain. PLoS Comput Biol 1(4):e42
Article Google Scholar
Tang J, Qu M, Mei Q (2015) Pte: predictive text embedding through large-scale heterogeneous text networks. In: KDD, pp 1165–1174. ACM
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: WWW, pp 1067–1077. ACM
Tsybakov AB (2008) Introduction to nonparametric estimation. Springer Science & Business Media, Berlin
Varshney LR, Chen BL, Paniagua E, Hall DH, Chklovskii DB (2011) Structural properties of the caenorhabditis elegans neuronal network. PLoS Comput Biol 7(2):e1001066
Article Google Scholar
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. ar**v preprint ar**v:1408.2927
Zhang M, Chen Y (2018) Link prediction based on graph neural networks. In: NeurIPS, pp 5165–5175
Zhu J, **ngyu L, Heimann M, Koutra D (2021) Node proximity is all you need: Unified structural and positional node and graph embedding. In: SDM, SIAM

Download references

Acknowledgements

This work is an invited extension of a paper accepted at ICDM 2020 [4] and is supported by an NSF Graduate Research Fellowship, NSF CAREER Grant No. IIS 1845491, Army Young Investigator Award No. W9–11NF1810397, and an Amazon faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties.

Author information

Authors and Affiliations

Computer Science & Engineering, University of Michigan, Ann Arbor, USA
Caleb Belth, Alican Büyükçakır & Danai Koutra

Authors

Caleb Belth
View author publications
You can also search for this author in PubMed Google Scholar
Alican Büyükçakır
View author publications
You can also search for this author in PubMed Google Scholar
Danai Koutra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caleb Belth.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 2976 KB)

Proximity models

Here we discuss the proximity models used in Sect. 5.1. The proximity model for each method (along with parameters for the NMF+Bag baseline) is given in Table 6. We observe that the AA heuristic performed well on several datasets. During development, we tried over two dozen embedding methods (including GNNs) and found NetMF to be the most consistently strong for both LaPM and LinkWaldo. This, combined with the strong performance of AA, suggests that there is much room for improving proximity-preserving embedding methods. Improving their performance for low-degree nodes is of particular importance, as discussed in Sect. 5.4.

Table 6 The best input proximity model for LaPM and LinkWaldo (used in Sect. 5.1) and parameter deviations from default for NMF+Bag

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belth, C., Büyükçakır, A. & Koutra, D. A hidden challenge of link prediction: which pairs to check?. Knowl Inf Syst 64, 743–771 (2022). https://doi.org/10.1007/s10115-021-01632-x

Download citation

Received: 10 March 2021
Revised: 15 November 2021
Accepted: 20 November 2021
Published: 18 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10115-021-01632-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

A hidden challenge of link prediction: which pairs to check?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Neighborhood and PageRank methods for pairwise link prediction

Link Prediction via Higher-Order Motif Features

An information-theoretic model for link prediction in complex networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 2976 KB)

Proximity models

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A hidden challenge of link prediction: which pairs to check?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Neighborhood and PageRank methods for pairwise link prediction

Link Prediction via Higher-Order Motif Features

An information-theoretic model for link prediction in complex networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 2976 KB)

Proximity models

Proximity models

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation