Calculating a Distributional Similarity Kernel using the Nyström Extension

Arndt, Markus; Arndt, Ulrich

doi:10.1007/978-3-642-24466-7_34

Markus Arndt⁵ &
Ulrich Arndt⁶

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2506 Accesses

Abstract

The analysis of distributional similarities induced by word co-occurrences is an established tool for extracting semantically related words from a large text corpus. Based on the co-occurrence matrix C the basic kernel matrix K = CC ^T reflects word–word similarities. In order to considerably improve the results, a similarity kernel matrix is expressed as \(G\,=\,{U}_{k}{U}_{k}^{T}\), where U _k are the first k eigenvectors of the eigendecomposition K = UΣU ^T. Clearly, the bottleneck of this technique is the high computational demand for calculating the eigendecomposition. In our study we speed up the calculation of the low-rank similarity kernel by means of the Nyström extension. We address in detail the inherent challenge of the Nyström method, namely selecting appropriate kernel matrix columns in such a way that the fast approximation process yields satisfactory results. To illustrate the effectiveness of our method, we have built a thesaurus containing 32,000 entries based on 0.5 billion corpus words (nouns, verbs, adjectives and adverbs) extracted from the Project Gutenberg text collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 85.59; Price includes VAT (Germany)

Softcover Book: EUR 106.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Analysis of (Non-)Metric Proximities at Linear Costs

Document Similarity from Vector Space Densities

Similarity Based Hierarchical Clustering with an Application to Text Collections

Notes

1.
www.gutenberg.org

References

Drineas P, Mahoney MW (2005) On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175
MathSciNet MATH Google Scholar
Fellbaum C (1998) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
MATH Google Scholar
Kumar S, Mohri M, Talwalkar A (2009) Sampling techniques for the Nyström method. In: Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), pp 304–311
Google Scholar
Landauer TK, Dumais ST (1997) A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychol Rev 104:211–240
Article Google Scholar
Rapp R (2008) The automatic generation of thesauri of related words for English, French, German, and Russian. Int J Speech Technol 11:147–156
Article Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Book Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: Vector space models of semantics. J Artif Intell Res 37:141–188
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

European Patent Office, Erhardt Str. 27, 80649, Munich, Germany
Markus Arndt
data2knowledge GmbH, Fahrenheitstr. 1, 28359, Bremen, Germany
Ulrich Arndt

Authors

Markus Arndt
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Arndt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Arndt .

Editor information

Editors and Affiliations

Fak. Wirtschaftswissenschaften, Inst. Entscheidungstheorieund, Universität Karlsruhe (TH), Kaiserstr. 12, Karlsruhe, 76128, Germany
Wolfgang A. Gaul
Insitute for Information Systems, and Management (IISM), Karlsruhe Institute of Technology (KIT), Kaiserstr. 12, Karlsruhe, 76131, Baden-Württemberg, Germany
Andreas Geyer-Schulz
, Information Systems, University ofHildesheim, Marienburger Platz 22, Hildesheim, 31141, Germany
Lars Schmidt-Thieme
Institute for Information Systems, and Management (IISM), Karlsruhe Institute of Technology (KIT), Kaiserstraße 12, Karlsruhe, 76128, Germany
Jonas Kunze

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arndt, M., Arndt, U. (2012). Calculating a Distributional Similarity Kernel using the Nyström Extension. In: Gaul, W., Geyer-Schulz, A., Schmidt-Thieme, L., Kunze, J. (eds) Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24466-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-24466-7_34
Published: 05 January 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24465-0
Online ISBN: 978-3-642-24466-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Calculating a Distributional Similarity Kernel using the Nyström Extension

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Data Analysis of (Non-)Metric Proximities at Linear Costs

Document Similarity from Vector Space Densities

Similarity Based Hierarchical Clustering with an Application to Text Collections

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Calculating a Distributional Similarity Kernel using the Nyström Extension

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Data Analysis of (Non-)Metric Proximities at Linear Costs

Document Similarity from Vector Space Densities

Similarity Based Hierarchical Clustering with an Application to Text Collections

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation