Creating Semantic Representations

  • Chapter
  • First Online:
Statistical Semantics

Abstract

In this chapter, we present the vector space model and some ways to further process such a representation: With feature hashing, random indexing, latent semantic analysis, non-negative matrix factorization, explicit semantic analysis and word embedding, a word or a text may be associated with a distributed semantic representation. Deep learning, explicit semantic networks and auxiliary non-linguistic information provide further means for creating distributed representations from linguistic data. We point to a few of the methods and datasets used to evaluate the many different algorithms that create a semantic representation, and we also point to some of the problems associated with distributed representations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Thailand)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 117.69
Price includes VAT (Thailand)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 139.99
Price excludes VAT (Thailand)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
EUR 139.99
Price excludes VAT (Thailand)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

  2. 2.

    https://radimrehurek.com/gensim/corpora/hashdictionary.html

  3. 3.

    https://nlp.stanford.edu/projects/glove/

  4. 4.

    Note it is not always clear how the size is counted. One may count the window size as the total number of words or count it based on the number of words on each side of the word-of-interest.

  5. 5.

    http://visualgenome.org/

  6. 6.

    http://cocodataset.org

  7. 7.

    https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)

  8. 8.

    The dataset is available at https://sites.google.com/site/semeval2012task2/

  9. 9.

    https://fasttext.cc/

  10. 10.

    https://github.com/fnielsen/dasem/blob/master/dasem/data/four_words.csv

  11. 11.

    https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-en-165.txt

  12. 12.

    http://babelfy.org/

  13. 13.

    Dasem is a Python package for Danish semantics available at https://github.com/fnielsen/dasem

  14. 14.

    We can place the super concept at A = (0, 0, 0) and the subconcepts at (1, 1, 1), (−1, −1, 1), (−1, 1, −1) and (1, −1, −1). All subconcepts have the same distance to the super concept and the same distance to all other subconcepts.

References

  • Al-Rfou, R., Perozzi, B., & Skiena, S. (2014). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, June 2014 (pp. 183–192). https://arxiv.org/pdf/1307.1662.pdf

  • Anderka, M., & Stein, B. (2009). The ESA retrieval model revisited. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 670–171). https://doi.org/10.1145/1571941.1572070

  • Bhattacharyya, M., Suhara, Y., Md Rahman, M., & Krause, M. (2017). Possible confounds in word-based semantic similarity test data. In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing – CSCW ‘17 Companion (pp. 147–150). https://doi.org/10.1145/3022198.3026357

  • Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2001 (pp. 245–150). https://doi.org/10.1145/502512.502546

  • Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016, July 29). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information Processing Systems, 29. https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf

  • Chiu, B., Korhonen, A., & Pyysalo, S. (2016). Intrinsic evaluation of word vectors fails to predict extrinsic performance. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, August 2016. https://sites.google.com/site/repevalacl16/26_Paper.pdf

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

    Article  Google Scholar 

  • Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, May 2016. https://sites.google.com/site/repevalacl16/11_Paper.pdf

  • Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, August (pp. 1616–1626). http://aclweb.org/anthology/D17-1169

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131. https://doi.org/10.1145/503104.503110

    Article  Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, January (pp. 1606–1611). http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf

  • Ganchev, K., & Dredze, M. (2008). Small statistical models by random feature mixing. In Proceedings of the ACL-2008 Workshop on Mobile Language Processing. Association for Computational Linguistics.

    Google Scholar 

  • Henrich, V., & Hinrichs, E. (2010). GernEdiT: A graphical tool for GermaNet development. In Proceedings of the ACL 2010 System Demonstrations, July 2010 (pp. 19–24).

    Google Scholar 

  • Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2015). SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, July 2015 (pp. 95–105). https://doi.org/10.3115/V1/P15-1010

  • Jurgens, D. A., Turney, P. D., Mohammad, S. M., & Holyoak, K. J. (2012). SemEval-2012 task 2: Measuring degrees of relational similarity. In ∗SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), June 2012 (pp. 356–364). http://www.aclweb.org/anthology/S12-1047

  • Koehn, P., & Knight, K. (2003). Empirical methods for compound splitting. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, February (pp. 187–193). https://doi.org/10.3115/1067807.1067833

  • Köper, M., Scheible, C., & Walde, S. S. (2015). Multilingual reliability and ‘semantic’ structure of continuous word spaces. In Proceedings of the 11th International Conference on Computational Semantics, April (pp. 40–45). http://www.aclweb.org/anthology/W15-0105

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. https://doi.org/10.1037/0033-295X.104.2.211

    Article  Google Scholar 

  • Lee, D. D., & Seung, S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. https://doi.org/10.1038/44565

    Article  Google Scholar 

  • Lee, D. D., & Seung, S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562. http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

  • Lee, Y.-Y., Ke, H., Huang, H.-H., & Chen, H.-H. (2016). Combining word embedding and lexical database for semantic relatedness measurement. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 73–74). https://doi.org/10.1145/2872518.2889395

  • Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 27, 2177–2185.

    Google Scholar 

  • Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistic, 3, 211–225.

    Article  Google Scholar 

  • Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, August (pp. 13–18). https://doi.org/10.18653/V1/W16-2503

  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, 28, 203–208. https://doi.org/10.3758/BF03204766

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013a). Efficient estimation of word representations in vector space, January 2013. https://arxiv.org/pdf/1301.3781v3

  • Mikolov, T., Yih, W.-T., & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2013 (pp. 746–51). http://www.aclweb.org/anthology/N13-1090

  • Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations, December 2017. https://arxiv.org/pdf/1712.09405.pdf

  • Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38, 39–41.

    Article  Google Scholar 

  • Neelakantan, A., Shankar, J., Passos, A., & McCallum, A. (2014). Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1059–10699). https://doi.org/10.3115/V1/D14-1113

  • Nickel, M., Kiela, D., & Kiela, D. (2017, May 30). Poincaré Embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems. https://arxiv.org/pdf/1705.08039.pdf

  • Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In Proceedings of the Eswc2011 Workshop on ‘Making Sense of Microposts’: Big Things Come in Small Packages, May 2011 (pp. 93–98). http://ceur-ws.org/Vol-718/paper\_16.pdf

  • Nielsen, F. Å. (2017, October). Wembedder: Wikidata entity embedding web service. https://doi.org/10.5281/ZENODO.1009127

  • Nielsen, F. Å. (2018). Linking ImageNet WordNet Synsets with Wikidata. In WWW ‘18 Companion: The 2018 Web Conference Companion, April 23–27, 2018, Lyon (pp. 1809–1814). https://doi.org/10.1145/3184558.3191645

  • Nielsen, F. Å., & Hansen, L. K. (2002). Modeling of activation data in the BrainMap database: Detection of outliers. Human Brain Map**, 15, 146–156. https://doi.org/10.1002/HBM.10012

    Article  Google Scholar 

  • Nielsen, F. Å., & Hansen, L. K. (2017). Open semantic analysis: The case of word level semantics in Danish. In Human Language Technologies as a Challenge for Computer Science and Linguistics, October 2017 (pp. 415–19). http://www2.compute.dtu.dk/pubdb/views/edoc\_download.php/7029/pdf/imm7029.pdf

  • Nielsen, F. Å., & Hansen, L. K. (2018). Inferring visual semantic similarity with deep learning and Wikidata: Introducing imagesim-353. In Proceedings of the First Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies, April (pp. 56–61). http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7102/pdf/imm7102.pdf

  • Nielsen, F. Å., Balslev, D., & Hansen, L. K. (2005). Mining the posterior cingulate: Segregation between memory and pain components. NeuroImage, 27, 520–532. https://doi.org/10.1016/J.NEUROIMAGE.2005.04.034

    Article  Google Scholar 

  • Nissim, M., van Noord, R., & van der Goot, R. (2019). Fair is better than sensational: Man is to doctor as woman is to doctor. ar**v 1905.09866.

    Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). http://www.emnlp2014.org/papers/pdf/EMNLP2014162.pdf

  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of NAACL-HLT 2018, March 2018 (pp. 2227–2237). https://arxiv.org/pdf/1802.05365.pdf

  • Radford, A., Józefowicz, R., & Sutskever, I. (2017). Learning to generate reviews and discovering sentiment, April 2017. https://arxiv.org/pdf/1704.01444.pdf

  • Radovanović, M., Nanopoulos, A., & Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487–2531.

    Google Scholar 

  • Řehůřek, R. (2011). Fast and faster: A comparison of two streamed matrix decomposition algorithms, February 2011. https://arxiv.org/pdf/1102.5597.pdf

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613–620. https://doi.org/10.1145/361219.361220

    Article  Google Scholar 

  • Scheepers, T., Kanoulas, E., & Gavves, E. (2018). Improving word embedding compositionality using lexicographic definitions. In Proceedings of the 2018 World Wide Web Conference. https://doi.org/10.1145/3178876.3186007

  • Schmidhuber, J. (2014). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/J.NEUNET.2014.09.003

    Article  Google Scholar 

  • Soboroff, I. M., Nicholas, C. K., Kukla, J. M., & Ebert, D. S. (1997). Visualizing document authorship using n-grams and latent semantic indexing. In Proceedings of the 1997 Workshop on New Paradigms in Information Visualization and Manipulation. https://doi.org/10.1145/275519.275529

  • Speer, R., Chin, J., & Havasi, C. (2016). ConceptNet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, December 2016 (pp. 4444–4451). https://arxiv.org/pdf/1612.03975.pdf

  • Sun, Y., Rao, N., & Ding, W. (2017). A simple approach to learn polysemous word embeddings, July 2017. https://arxiv.org/pdf/1707.01793.pdf

  • Svoboda, L., & Brychcín, T. (2018). New word analogy corpus for exploring embeddings of Czech words. In Computational Linguistics and Intelligent Text Processing (pp. 103–114). https://doi.org/10.1007/978-3-319-75477-2_6

  • Vylomova, E., Rimell, L., Cohn, T., & Baldwin, T. (2016). Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 1671–1682). https://doi.org/10.18653/V1/P16-1158

  • Wróblewska, A., Krasnowska-Kieraś, K., & Rybak, P. (2017). Towards the evaluation of feature embedding models of the fusional languages. In Human Language Technologies as a Challenge for Computer Science and Linguistics, November 2017 (pp. 420–424).

    Google Scholar 

Download references

Acknowledgments

We would like to thank Innovation Fund Denmark for funding through the DABAI project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Finn Årup Nielsen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Nielsen, F.Å., Hansen, L.K. (2020). Creating Semantic Representations. In: Sikström, S., Garcia, D. (eds) Statistical Semantics. Springer, Cham. https://doi.org/10.1007/978-3-030-37250-7_2

Download citation

Publish with us

Policies and ethics

Navigation