Introduction

One of the most debated topics of cognitive science concerns the way conceptual representation are acquired and organized. This issue became even more central over the last two decades due to the influence of the grounded cognition framework, which claims that concepts are represented at a sensorimotor level (Barsalou, 1999; Fischer, 2012; Glenberg, 2015; Zwaan & Madden, 2005). According to this view, our semantic memory cannot be a self-contained system in which all the representations are abstract, amodal symbols that are defined exclusively by their relations to one another (see for example Collins & Quillian, 1969; Kintsch, 1988). The best-known argument against this conceptualization is provided by Harnad’s (1990) adaptation of Searle’s (1980) Chinese room argument: If a monolingual English speaker suddenly finds herself in China, only equipped with a monolingual Chinese-Chinese dictionary, she will never be able to understand anything. In this case, whenever she looks up any symbol, it is only ever linked to other symbols that have no meaning for her. At some point, she needs the symbols to be grounded in a format she is able to understand—in this concrete example, this could be English words or pictures. This argument directly translates to our semantic memory: At some point, semantic representations need to be grounded in a primary format of our cognitive system. This is the core assumption of theories of grounded cognition (e.g. Barsalou, 1999; Glenberg & Kaschak, 2002; Glenberg & Robertson, 2000; Zwaan & Madden, 2005), which postulate that perceptual and motor systems take on this vital grounding role. Simply speaking, in order to understand a word such as horse, the cognitive system re-activates sensorimotor experience with the word referent, such as visual experience with a horse standing in a stable or running on a field. This sensorimotor experience can be linked to linguistic experience by systematic patterns of co-occurrence (for example, by hearing the word horse while seeing a horse on a field; Zwaan & Madden, 2005), with several studies suggesting this connection to be established as the result of Hebbian learning at the brain level (see also Pulvermüller 2005; Hoenig et al., 2011; Kiefer et al., 2007; Trumpp and Kiefer, 2018).

Such a co-occurrence-based grounding mechanism appears to be straightforward for concrete words, which refer to clearly identifiable objects that can be perceived with our senses. However, it is far less obvious how grounding would be achieved when this is not the case (Barsalou 2016; Borghi et al. 2017). One prime example are abstract words such as libertarianism, jealousy, or childhood, which by definition do not refer to a distinct class of physical objects (for an overview, see Borghi et al. 2017). However, it should be noted that these issues already arise for concrete words whose referents one has never experienced directly, such as Atlantis or supernova (Günther, Dudschig, & Kaup, 2018; Günther and Nguyen, et al., 2020). The question of how can we achieve grounding in the absence of any direct sensorimotor experience is of central importance for theories of grounded cognition (Borghi et al. 2017); if they can account for only a fraction of words that are directly experienced, the usefulness and adequacy of grounded cognition theories as a general-level cognitive theory stands in question.

Over the recent years many different proposals have been made addressing this issue (see Barsalou, Santos, Simmons, & Wilson, 2008; Borghi & Binkofski, 2014; Glenberg, Sato, & Cattaneo, 2008; Harpaintner, Trumpp, & Kiefer, 2018; Harpaintner, Sim, Trumpp, Ulrich, & Kiefer, 2020; Hoffman, McClelland, & Lambon Ralph, 2018; Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011; Lakoff & Johnson, 2008; Wilson-Mendenhall, Simmons, Martin, & Barsalou, 2013, for different theoretical approaches). One possible mechanism of how grounding can be established in the absence of experience (referred to as acquired embodiment by Hoffman et al., 2018 and indirect grounding by Günther, Nguyen, et al., 2020) is best illustrated by an example: Assume a friend tells you, “On my way here, I saw a little wibby chir** in a tree!”. You have never heard the word wibby before, but it does not seem difficult to imagine how it would look like: A small animal with feathers, wings, and a beak. Thus, due to the way wibby was used in language—similar to bird, robin, or sparrow—its semantic representation is similar to these words for which visual experience is available (Landauer and Dumais 1997; Lenci 2008), and you can draw on this information to predict a likely visual representation. In other words, one can map a semantic representation formed through linguistic experience onto perceptual experience, by exploiting systematic language-to-vision relations learned before. Note that this is not restricted to simply substituting a word with an already grounded one and retrieving the associated experience: If your friend adds the sentence “They used to build these wibbies from white steel, but nowadays it’s just aluminium.”, the visual representation probably changed to some robotic bird—something you most likely have never seen before. Thus, from this purely linguistic input, one can extrapolate from available experience, draw inferences about what a wibby would most likely look like, and simulate the corresponding visual experience.

In the present study, we investigate whether such a map** can be reliably achieved for word meanings learned from language alone (i.e. without any accompanying direct experience, be it sensorimotor or emotional)—for concrete words, and crucially also for abstract words (see Hoffman et al., 2018). More specifically, we test whether language-based representations in our semantic memory along with their relation to their associated vision-based representations provide the necessary structural information to reliably map remaining language-based representations (i.e., the ones for which no vision-based representations is available) onto the visual domain. This can be achieved by exploiting (a) systematic relations between language-based semantic representations and visual representations (for example, birds usually have wings), and (b) the structure of similarity among language-based representations themselves (in the example above, wibby is used in a similar way as words denoting birds). To test this, we implement a data-driven, computational model in which both language-based and vision-based representations are conceptualized in a high-dimensional vector format. In the following section, we will first describe the model in detail; we will then discuss the perspective it provides on the grounding problem for abstract words, before putting it to empirical test.

The map** model

In the model presented here, we employ a distributional semantics framework to model language-based semantic representations, a deep neural network computer-vision approach to model visual representations, and train a simple linear function to establish a map** from the former to the latter.

Language-based semantic representations

Language-based representations were obtained via the distributional semantics framework (Günther, Rinaldi, & Marelli, 2019; Landauer & Dumais, 1997; Turney & Pantel, 2010). These models are based on the distributional hypothesis that words with similar meanings are used in a similar manner (Wittgenstein 1953) and thus occur in similar (linguistic) contexts (Harris, 1954; Lenci, 2008). Consequently, distributional semantic models estimate a word meaning from its distribution over linguistic contexts in large corpora of natural language, resulting in a representation of word meanings as high-dimensional numerical vectors. Over the past decades, these models have received strong empirical (e.g. Baroni, Dinu, & Kruszewski, 2014; Jones, Kintsch, & Mewhort, 2006; Mandera, Keuleers, & Brysbaert, 2017;Pereira, Gershman, Ritter, & Botvinick, 2016) and theoretical support (Günther et al. 2019; Jones et al. 2015; Westbury 2016) as models of human semantic memory.

There are many different possible parametrizations for distributional semantic models (Jones et al. 2015). In the present study, we employed the model with the overall best performance in a systematic evaluation by Baroni et al. (2014): a system trained using the cbow algorithm (with 400-dimensions vectors, negative sampling with \(k = 10\), and subsampling with \(t = 1e^{-5}\) as parameter settings) of the word2vec model (Mikolov et al. Full size image

This model was trained on an English \(\sim\) 2.8 billion word source corpus (a concatenation of the ukWaC corpus, Baroni, Bernardini, Ferraresi, & Zanchetta, 2009; an English Wikipedia dump; and the British National corpus, BNC Consortium, 2007), while considering the 300,000 most frequent words in the corpus as target and context words.

Vision-based representations

Vision-based representations were obtained via a state-of-the-art computer vision model (Krizhevsky, Sutskever, & Hinton, 2012). This model employs a deep neural network that is trained to predict (human-generated) image labels from a vector representation encoding the pixel-based RGB values of the respective image (see the upper-right part of Fig. 1). Besides excelling at image labeling (Chatfield, Simonyan, Vedaldi, & Zisserman, 1). Such map** functions between distributional vectors and sensorimotor information have previously been successfully implemented for participant ratings on various sensorimotor features (Sommerauer and Fokkens 2018; Utsumi 2020) or emotional features (Martínez-Huertas, Jorge-Botana, Luzón, & Olmos, in press) associated to words.

In the model presented here, we implemented one of the simplest possible map** functions: a linear function \(m: \text {language} \rightarrow \text {vision}\), so that \({\hat{v}} = m(l) = M\cdot l\) for \(l \in \text {language}\) and \(v \in \text {vision}\) (\({\hat{v}}\) being a predicted vision-based representation). M is estimated as a matrix where the cell entries \(M_{ij}\) specify how much each element \(l_i\) of the input vector influences each element \({\hat{v}}_j\) of the predicted output vector (with \({\hat{v}}_j = \sum _{i} M_{ij}\cdot l_i\)). Note that such a linear function is equivalent to a linear regression with multiple predictors and dependent variables. With this approach, we follow up on similar previous developments in the field of natural language processing and computer vision (Lazaridou et al. 2015).

As a training set for this regression, we employed the complete set of 7801 words for which we had both language-based vectors and vision-based vectors available. The weights are then estimated so that, on average, the difference between the predicted visual representation \({\hat{v}}\) and the observed visual representation v is minimized, with respect to a least-squares criterion. We used the DISSECT toolkit to train the function (Dinu et al. 2013).

On an intuitive level, this training set can be thought of as the repeated presentation of a visual scene paired with a corresponding linguistic stimulus, in which both the visual representation of the scene and the language-based representation of the word meaning are activated and associated (Zwaan and Madden 2005). The training process then captures the learning of the systematic relation between the two, if it exists.

While the cbow model provides one single vector representing the word meaning for each word, the VGG-F model provides an individual vector for each single image. Thus, as a consequence of our selection procedure, we have between 100 and 200 VGG-F vectors for each label. We can hence set up two different frameworks to estimate the map** function: A prototype-based approach and an exemplar-based approach (following a classical distinction in concept research; Smith & Medin, 1981). In the prototype-based approach, we averaged all the 100-to-200 vectors for images with the same label to obtain a single visual representation for each word (Günther et al. 2020; Petilli et al. Footnote 1 and trained the map** function m directly on these individual images. Thus, we had 20 training items per word, resulting in a training set of 156,020 pairs. The language part of the model was identical for both approaches (cbow vectors corresponding to the image labels).

Since the dimensionality of the vectors will directly influence the number of free parameters for our map** function, we reduced the original dimensionality of the visual representations in both training sets from \(d = 4096\) to \(d' = 300\) using Singular Value Decomposition (SVD; Martin & Berry, 2007), as implemented in the DISSECT toolkit (Dinu et al. 2013). This was done separately for the prototype-based and the exemplar-based approach. As indicated by a pre-test on the similarity structure within the visual representations, this has a negligible effect on the informativity of these vectors (see Günther, Petilli, & Marelli, 2020). As a result, m is estimated as a 400\(\times\)300-dimensional matrix, instead of a 400\(\times\)4096-dimensional matrix.

Once the map** function is estimated, it can take any cbow vector as input—including, and especially, those outside the training set—to predict a visual representation for the corresponding concept (see the bottom part of Fig. 1).

Identifying predictors of model performance

In the previous section, we described the language-to-vision map** system implemented for this study. Notably, “concreteness” is not explicitly encoded in this model and therefore is not implemented as an a-priori component: the model has no “concreteness feature” that is assigned to some concepts and not to others. The model only knows whether a word has any visual experience associated to itself (in technical terms, whether a word serves as a label for a set of images).

However, not all language-based vectors are the same: By definition, they have different dimensional values, and populate different neighborhoods of the induced semantic space (see Martínez-Huertas, Jorge-Botana, Luzón, & Olmos, in press, for a conceptually similar distinction between a specific dimensionality hypothesis and a semantic neighborhood hypothesis for the map** between language and grounded information). Based on these properties, we can identify factors that potentially influence how well a vision-based representation can be predicted from a language-based representation. On the one hand, there could just be inherent, fundamental differences between language-based representations for concrete and abstract concepts which emerge naturally during the training of the language-based model. Initial evidence for this assumption is provided by Hollis and Westbury (2016), who demonstrate that the dimensions of language-based distributional vectors contain concreteness information that can be extracted using adequate mathematical methods.

On the other hand, language-based representations for different concepts could inhabit fundamentally different areas of the semantic system. For example, assume that a speaker has newly learned the words stallion and jealousy without accompanying visual experience. Due to the way these words are used, stallion will have a language-based representation that is very similar to horse, steed, and pony. For all these neighbors of stallion, direct visual experience is available, which makes estimations of how a stallion looks like very easy (similar to a horse). The language-based representation for jealousy on the other side will be similar to envy, hatred and resentment, and thus concepts for which no visual experience is available. Irrespective of the concreteness of the word itself, it might be easier to extrapolate a visual representations for a word if its linguistic neighborhood contains more other visually-grounded concepts that can provide “a bridge to visual experience” and an orientation on how the concept probably looks like. This difference between the relative position of visual neighbors (which we will from now on refer to simply as visual neighbors for brevity) might thus influence model performance.

In the following empirical studies, we test our model by deriving model-predicted images for words which are all outside the model training set (i.e., for which the model has no visual experience available). These model-predicted images are then paired with random control images. If our model matches human intuitions on which image better fits the word meaning (i.e., if participants systematically prefer the model prediction over the control image), this will demonstrate that our linguistic and perceptual experience provides the necessary information to establish a link between the two (and that our model provides one possible, simple account on how this can be achieved). The model will be tested in different conditions which we expect to influence model performance: On the one hand, we test the model on both concrete and abstract words; on the other hand, we test it on words that do or do not have training items (i.e., words for which visual experience is available) in their immediate neighborhood. Since these two variables (concreteness and visual neighbors) are normally highly correlated, we apply item selection procedures to disentangle them (see the Methods sections of Experiments 1, 2, and 3). This will allow us (a) to evaluate if the model generally succeeds in predicting visual representations from language-based representations, (b) to test which factors influence its ability to do so, and (c) to examine potential limits of our approach and identify conditions it is not able to handle.