Log in

A hidden challenge of link prediction: which pairs to check?

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The traditional setup of link prediction in networks assumes that a test set of node pairs, which is usually balanced, is available over which to predict the presence of links. However, in practice, there is no test set: the ground truth is not known, so the number of possible pairs to predict over is quadratic in the number of nodes in the graph. Moreover, because graphs are sparse, most of these possible pairs will not be links. Thus, link prediction methods, which often rely on proximity-preserving embeddings or heuristic notions of node similarity, face a vast search space, with many pairs that are in close proximity, but that should not be linked. To mitigate this issue, we introduce LinkWaldo, a framework for choosing from this quadratic, massively skewed search space of node pairs, a concise set of candidate pairs that, in addition to being in close proximity, also structurally resemble the observed edges. This allows it to ignore some high-proximity but low-resemblance pairs, and also identify high-resemblance, lower-proximity pairs. Our framework is built on a model that theoretically combines stochastic block models (SBMs) with node proximity models. The block structure of the SBM maps out where in the search space new links are expected to fall, and the proximity identifies the most plausible links within these blocks, using locality sensitive hashing to avoid expensive exhaustive search. LinkWaldo can use any node representation learning or heuristic definition of proximity and can generate candidate pairs for any link prediction method, allowing the representation power of current and future methods to be realized for link prediction in practice. We evaluate LinkWaldo on 13 networks across multiple domains and show that on average it returns candidate sets containing 7–33% more missing and future links than both embedding-based and heuristic baselines’ sets. Our code is available at https://github.com/GemsLab/LinkWaldo.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Adamic LA, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211–230

    Article  Google Scholar 

  2. Alivisatos AP, Chun M, Church GM, Greenspan RJ, Roukes ML, Yuste R (2012) The brain activity map project and the challenge of functional connectomics. Neuron 74(6):970–974

    Article  Google Scholar 

  3. Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In WWW, pp 651–660

  4. Belth C, Büyükçakır A, Koutra D (2020) A hidden challenge of link prediction: Which pairs to check? In: ICDM, pp 831–840. IEEE

  5. Belth C, Zheng X, Koutra D (2020) Mining persistent activity in continually evolving networks. In: KDD

  6. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: STOC

  7. Donnat C, Zitnik M, Hallac D, Leskovec J (2018) Learning structural node embeddings via diffusion wavelets. In: KDD, pp 1320–1329. ACM

  8. Duan L, Ma S, Aggarwal C, Ma T, Huai J (2017) An ensemble approach to link prediction. In: IEEE TKDE 29(11)

  9. Gao M, Chen L, He X, Aoying Z (2018) Bipartite network embedding. In: SIGIR, Bine

  10. Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: KDD, pp 855–864. ACM

  11. Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: methods and applications. IEEE Data Eng Bull 40(3):52–74

    Google Scholar 

  12. Heimann M, Shen H, Safavi T, Danai K (2018) Representation learning-based graph alignment. In: CIKM, REGAL

  13. Di **, Heimann M, Safavi, T Wang M, Lee W, Snider L, Koutra D (2019) Smart roles: inferring professional roles in email networks. In: KDD, pp 2923–2933. ACM

  14. Joshi U, Urbani J (2020)Searching for embeddings in a haystack: link prediction on knowledge graphs with subgraph pruning. In: WebConf

  15. Kipf TN, Welling M (2016) Variational graph auto-encoders. In: NeurIPS workshop on Bayesian deep learning

  16. Kunegis J (2013) Konect: the koblenz network collection. In: WWW

  17. Latouche P, Birmelé E, Ambroise C et al (2011) Overlap** stochastic block models with application to the French political blogosphere. Ann Appl Stat 5(1):309–336

    Article  MathSciNet  Google Scholar 

  18. Leskovec J, Krevl A (2014) SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data

  19. Levin DA, Peres Y (2017) Markov chains and mixing times, volume 107. American Mathematical Soc

  20. Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. ASIS&T 58(7):1019–1031

    Google Scholar 

  21. Martínez V, Berzal F, Cubero J-C (2016) A survey of link prediction in complex networks. CSUR 49(4):1–33

    Article  Google Scholar 

  22. Mehta N, Carin L, Rai P (2019) Stochastic blockmodels meet graph neural networks. In: ICML

  23. Miller K, Michael IJ, Thomas LG (2009) Nonparametric latent feature models for link prediction. In: NeurIPS

  24. Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2)

  25. Nowicki K, Snijders TAB (2001) Estimation and prediction for stochastic blockstructures. ASIS&T 96(455):1077–1087

    MathSciNet  MATH  Google Scholar 

  26. Pachev B, Webb B (2018) Fast link prediction for large networks using spectral embedding. J Complex Netw 6(1):79–94

    Article  MathSciNet  Google Scholar 

  27. Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: KDD, pp 701–710. ACM

  28. Qiu J, Dong Y, Ma H, Li J, Wang K, Tang J (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In: WSDM, pp 459–467. ACM

  29. Ribeiro LFR, Saverese PHP, Figueiredo DR (2017) struc2vec: learning node representations from structural identity. In: KDD, pp 385–394. ACM

  30. Rossi R, Ahmed N (2015) The network data repository with interactive graph analytics and visualization. In: AAAI

  31. Rossi RA, Di J, Kim S, Ahmed S, Koutra D, Lee JB (2020) On proximity and structural role-based embeddings in networks: Misconceptions, techniques, and applications. TKDD

  32. Safavi T, Koutra D, Meij E (2020) Evaluating the calibration of knowledge graph embeddings for trustworthy link prediction. In: EMNLP

  33. Song D, Meyer DA, Tao D (2015) Top-k link recommendation in social networks. In: ICDM, pp 389–398. IEEE

  34. Sporns O, Tononi G, Kötter R (2005) The human connectome: a structural description of the human brain. PLoS Comput Biol 1(4):e42

    Article  Google Scholar 

  35. Tang J, Qu M, Mei Q (2015) Pte: predictive text embedding through large-scale heterogeneous text networks. In: KDD, pp 1165–1174. ACM

  36. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: WWW, pp 1067–1077. ACM

  37. Tsybakov AB (2008) Introduction to nonparametric estimation. Springer Science & Business Media, Berlin

  38. Varshney LR, Chen BL, Paniagua E, Hall DH, Chklovskii DB (2011) Structural properties of the caenorhabditis elegans neuronal network. PLoS Comput Biol 7(2):e1001066

    Article  Google Scholar 

  39. Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. ar**v preprint ar**v:1408.2927

  40. Zhang M, Chen Y (2018) Link prediction based on graph neural networks. In: NeurIPS, pp 5165–5175

  41. Zhu J, **ngyu L, Heimann M, Koutra D (2021) Node proximity is all you need: Unified structural and positional node and graph embedding. In: SDM, SIAM

Download references

Acknowledgements

This work is an invited extension of a paper accepted at ICDM 2020 [4] and is supported by an NSF Graduate Research Fellowship, NSF CAREER Grant No. IIS 1845491, Army Young Investigator Award No. W9–11NF1810397, and an Amazon faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Caleb Belth.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 2976 KB)

Proximity models

Proximity models

Here we discuss the proximity models used in Sect. 5.1. The proximity model for each method (along with parameters for the NMF+Bag baseline) is given in Table 6. We observe that the AA heuristic performed well on several datasets. During development, we tried over two dozen embedding methods (including GNNs) and found NetMF to be the most consistently strong for both LaPM and LinkWaldo. This, combined with the strong performance of AA, suggests that there is much room for improving proximity-preserving embedding methods. Improving their performance for low-degree nodes is of particular importance, as discussed in Sect. 5.4.

Table 6 The best input proximity model for LaPM and LinkWaldo (used in Sect. 5.1) and parameter deviations from default for NMF+Bag

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Belth, C., Büyükçakır, A. & Koutra, D. A hidden challenge of link prediction: which pairs to check?. Knowl Inf Syst 64, 743–771 (2022). https://doi.org/10.1007/s10115-021-01632-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01632-x

Keywords

Navigation