Introduction

Normative research in psychology is typically in pursuit of studying the effects of specific stimuli, that is, attempting to quantify how words differ from one another in meaningful ways that might affect cognition (Paivio, 1971). For example, imageable, concrete words are more memorable than abstract words (Paivio, 1969; Rubin & Friendly, 1986). Most word attributes are derived from Paivio’s dual coding theory regarding the mental lexicon, which posits that concepts are not coded in a single, uniform common code and instead are coded both verbally and visually (see Paivio, 2010, for a review). More recently, functional analyses of episodic memory have indicated semantic categories within the mental lexicon may be privileged in cognition as well. Specifically, evidence now indicates animate, social stimuli such as people and animals are more memorable than inanimate, asocial stimuli such as objects and place names (see Nairne et al., 2017, for a review). Animates have been demonstrated to be more memorable than inanimates in free recall (e.g., Bonin et al., 20132015; Félix et al., 2019; Gelin et al., 20172019; Leding, 20182019a, 2019b; Li et al., 2016; Meinhardt et al., 2018, 2020; Nairne et al., 2013; Popp & Serra, 2016, 2018; VanArsdall et al., 2017), recognition tasks (Bonin et al., 2015; Leding, 2020; VanArsdall et al., 2013), and some forms of paired-associate learning (e.g., DeYoung & Serra, 2021; Kazanas et al., 2020; Popp & Serra, 2016; VanArsdall et al., 2015). Contextual information (e.g., spatial or temporal location) associated with animate concepts is also better-remembered (Gelin et al., 2018).

Further, current research on the animacy advantage in episodic memory appears to indicate it is likely due to intrinsic semantic qualities of the words themselves (indeed, the number of semantic features appears to partially mediate the animate recall advantage, see Rawlinson & Kelley, 2021). A categorical explanation (the idea animate words form stronger, more memorable categories that are used as a cue to structure recall) has been discredited (Gelin et al., 2017; VanArsdall et al., 2017). Further, the animacy effect appears insensitive to different encoding instructions including whether memory tasks are intentional or incidental (Bonin et al., 2015; Leding, 2018, 2019a, 2019b); a concurrent memory load task is present (Bonin et al. 2015); or highly imageable, scenario-based, or temporally regularized encoding environments are used (Bonin et al., 2015; Gelin et al., 2017; Blunt & VanArsdall, 2021). The effect may be related to attentional capture, but these data are mixed. Animates appear resistant to inattentional blindness under a variety of conditions (this finding is known as the animate monitoring hypothesis; see Altman et al., 2016; Calvillo & Hawkins, 2016; Calvillo & Jackson, 2014; New et al., 2007) and it also takes longer to process the ink color of animate words in a Stroop task (Bugaiska et al., 2019), suggesting attentional capture with words as well. However, little evidence has made a direct connection between these attentional effects and memory. Recent work by ** into the same general construct. Cue set sizes were available for just over 5,000 words. Thus, while availability and meaningfulness metrics were not as restrictive as they could have been using traditional sets of normative data that contain far fewer observations, they did restrict word selection to a degree (with meaningfulness in particular being somewhat restrictive, as the current word set of 1,200 makes up roughly a quarter of observed cases in the Nelson et al. (1998) database).

Concreteness, imagery, and familiarity

Finally, available normative data for concreteness, imagery, and familiarity were the most restrictive in how words were chosen for the current 1,200-word set. The MRC Psycholinguistic Database (Coltheart, 1981) is the most comprehensive database for these measures, drawing from multiple sources that use the same rating task for each metric. For example, concreteness is measured on a seven-point scale with one referring to words that are highly abstract and seven referring to words that are highly concrete. Imagery and familiarity too are on 7-point scales, with 1 referring to words that are “highly unfamiliar”/“low imagery” and 7 referring to words that are “highly familiar”/“high imagery.” While data exist for these variables, much of the extant datasets were not usable because of the current study’s focus on relatively concrete nouns – most animate words are relatively concrete, and it would be unfair to pit them against inanimate abstract concepts. As such, normative data for these variables was compiled from multiple sources, each of which used the same rating task for each variable (Clark & Paivio, 2004; Coltheart, 1981; Cortese & Fugett, 2004; Friendly et al., 1982; Schock et al., 2012; Stadthagen-Gonzalez & Davis, 2006). Even combining multiple datasets, ratings did not exist for a sizable number of words of interest (e.g., computer, robot, and a number of plants, animals, vehicles, and words referring to people). Because ratings did not exist for a sizable number of these words (between 100–200 per measure, detailed below), this led to Part 1, which collected normative data on concreteness, imagery, and familiarity.

Part 1: Collection of missing normative data

The purpose of Part 1 was to collect normative data for words in the selected set of 1,200 that were missing values for concreteness, familiarity, and/or imagery. In total, 209 of the 1,200 selected words were missing at least one of these values. Of those, 67 words were missing one value, 74 words were missing two values, and 68 words were missing values for all three metrics. Amazon Mechanical Turk (MTurk) was used to collect the missing data. Twenty-five workers were recruited for each scale, as at least 20 ratings per scale per word is typical in word variable research (e.g., Clark & Paivio, 2004, among others).

Method

Materials and instructions are available on the OSF (https://osf.io/4t3cu).

Materials

Materials for concreteness, familiarity, and imagery consisted of 123, 188, and 108 words, respectively, lacking values in extant databases. Words were randomly divided into four sets of 30 to 31 (for concreteness), six sets of 30 to 31 (for familiarity), and four sets of 27 (for imagery), and were presented to participants in a randomly selected order.

Procedure

The procedure for each task was identical, with any exceptions noted. Instructions for the rating scales were adapted from Paivio et al. (1968), and are provided for each scale in S1 of the OSF. Words were presented in groups of approximately 30, with a reminder of the scale they were to use in making their rating decisions presented at the top of the web page. Participants made ratings on a scale from 1 to 7, with appropriate anchors at either end. Participants were forced to make a rating decision for each word before moving on to the next page. At the halfway point and conclusion of the rating task, participants’ attention to the task was assessed with a question that had a single correct answer (“Have you ever walked on the surface of Mars?” then “What is the fifth word in this sentence?”) to increase the reliability of data (Rouse, 2015). All participants passed the attention check, likely because of the strict barrier to entry (95% approval rate over at least 1,000 HITs for the current sample).

After the rating task, participants provided demographic information and completed an “honesty” affirmation (Rouse, 2015). All participants indicated they answered honestly. Participants then had an opportunity to provide feedback on the study, were debriefed, and received a code to receive payment for their participation.

Participants

In total, 75 participants (35 female, 40 male) were recruited via the Amazon Mechanical Turk website. Of the 75 participants, two were eliminated from consideration because they reported a native language other than English or chose not to report a native language. Participant demographic details for each rating scale are listed in Table S1 (OSF). All workers were paid $0.05 per estimated min of the task.

Results and discussion

Observed reliabilities (Cronbach’s alpha, a measure of internal consistency) for the newly collected the concreteness, familiarity, and imagery metrics were α = 0.833, 0.824, and 0.936, respectively. As all alpha values were above 0.8 (a common rule-of-thumb for reliability data), the newly collected data were considered internally consistent. Means and standard deviations were multiplied by 100 to match the 100–700 scale common for these metrics and are presented in Table 1. Because the newly collected dataset contains words that did not already have concreteness, familiarity, or imagery values, consistency unfortunately cannot be compared between the current data and previous normative datasets; this is a limitation of these newly collected data.

Table 1 Descriptive statistics for extant normative data

With these newly collected data on concreteness, familiarity, and imagery, the initial dataset of 1,200 nouns was now complete. Table 1 shows descriptive statistics for all measures of interest, with concreteness, familiarity, and imagery values inclusive of the newly collected normative data. Further, Table 1 depicts descriptive statistics for all measures of interest broken down by “initially assigned” word type (that is, our selections of “clearly” animate, “clearly” inanimate and “clearly” ambiguous words). As a result of Part 1, we now had a complete normative dataset for all 1,200 selected words. In Part 2, we sought to add additional animacy normative data to the dataset.

Part 2: Collection of normative data for animacy scales

The purpose of Part 2 was to collect normative data for six scales thought to explain various aspects of the animacy construct. The scales included were: Two scales related to the physical capabilities of animate things: (1) likelihood of movement [Move] and (2) ability to reproduce [Repro]; two scales related to the mental capabilities of animate things: (3) degree of goal-directedness [Goals] and (4) ability to think [Thought]; and two scales thought to be “general” markers of whether something is animate or inanimate: A rating about how similar the thing is to a (5) person [Person] and (6) a basic living/non-living rating [Living]. Part 2 used the same general format for data collection as Part 1, with a few important exceptions described in the procedure section below.

Method

Materials

Materials consisted of 1,200 relatively concrete nouns with selection processes described above. Regardless of rating scale, each participant received a random assortment of 120 words to rate. These 120 words were further divided into lists of 30 items each; participants rated words one list at a time before moving on. Although word selection for any given participant was random without replacement, some MTurk workers started but did not finish the rating task (a common occurrence on Amazon Mechanical Turk). Therefore, words were not all rated an equal number of times.

Procedure

The procedure for each rating task was identical to Part 1, except words were always presented in groups of exactly 30.

Participants

In total, 1,500 participants (685 female, 811 male, and four who chose not to respond or self-identified outside these choices) were recruited via the Amazon Mechanical Turk website. A total of 250 participants completed each rating scale. An additional 62 participants were recruited and paid but eliminated from consideration because they reported a native language other than English (55) or responded “no, delete my data” when asked if they were paying attention and providing honest answers (seven). Participant demographic details for each rating scale are listed in Table S2 (OSF). All workers were paid $0.05 per estimated min of the task’s duration.

Results and discussion

Table 2 displays various metrics for each seven-point rating scale. For all rating tasks, each word was rated by at least 18 different participants, and words were rated an average of 25 times on each measure. Words were placed into three bins based on the average ratings to give a rough estimate of the number of “inanimate” (ratings ≤ 3), “ambiguous” (ratings between 3 and 5), and “animate” (ratings ≥ 5) items for each scale. For the most part, ratings were equally distributed across each scale with a small trend toward favoring lower ratings, especially for the “mental capacities” scales (Goals, Thought, Person). A notable exception was the Move scale – this was due to otherwise inanimate words like tornado, jet, and car receiving high ratings.

Table 2 Descriptive statistics and split-half reliabilities for animacy properties

Additionally, the primary trend was for items initially thought to be ambiguous to be reclassified as inanimate, and some items initially thought to be animate to be reclassified as ambiguous. This was expected to some degree – the point of collecting normative data is to verify (and disconfirm where appropriate) our intuitive assumptions. This “reclassification” trend was primarily among the “mental capacities” scales: For example, words like gazelle, hare, and trout were given low-to-middling ratings on these scales compared to words that referred to people. The Living scale related most to the initial assignments, at least for animate and inanimate words. Of the items initially assigned as animate, only 14 (3.3%) received ratings below 5, and of these, only two (0.5%) received ratings below 3 (these words were relation and nag). Similarly, of items initially thought to be inanimate, only 15 (3.5%) received a rating on the Living scale above 3, and none received ratings above 5.

Reliability and validity of the animacy scales

As reported by Madan (2020), 957 of the 1,200 words in the dataset were also present in the PEERS study. The Living scale highly correlated with participants’ binary living/non-living judgments (after rescaling; the Living scale ranges from 100 to 700 while the PEERS study ranged from zero to one); Pearson’s r(955) = .97, p < .001; Spearman’s ρ(955) = .91, p < .001. These data help to show that, for at least the Living scale, participants are treating the task similarly to the commonly used Living/Non-Living task.

Because no two participants saw the same exact list, standard measures of interrater reliability (e.g., Cronbach’s alpha) could not be calculated. Instead, estimates of reliability were calculated using the split-half method: For each word, participants’ ratings were randomly split into two subgroups of equal size (unless the word received an odd number of ratings, in which case one subgroup had an additional member). Means for each word were then calculated for each subgroup and the two means for each word were correlated across the word set. All correlations were r > 0.9. Split-half reliability was then calculated using the formula 2*r/(1+r). All split-half reliability measures were high (above 0.95) and are included in Table 2. These data suggest participants were consistent in their rating of the words along each scale, which implies participants consensually understood the rating tasks in the same manner and could consistently apply the ratings to the words. Participant feedback indicated this as well, with two excerpts included below. While not shown, similar anecdotes exist for the other scales.

From the Living task:

This task made me stop and say to myself "huh." When I started this task I honestly thought it would be easy but several of the words I had to think about. Dinosaurs were once alive but they are no longer so technically they are not living things but at one time they were. A few of the vegetable words are alive while they are growing but are no longer when they make it to produce. However, a potato will continue to grow if left to its own device[sic] and kept in soil. Does that count as being alive? Our hands and body are alive when attached to a live body but when we die everything dies. I wasn't really sure how to answer a few of those. This was a thought-provoking study. Thank you for allowing me to participate.

From the Goals task:

This was an interesting study. My answers may have changed slightly as I became more familiar with the task and viewed objects or things that are not alive as low goal-directedness. I did assign higher to goal-directedness to hurricane as it seems like a living changing entity.

Table 3 illustrates how various categories of words rated on each of the six measures (means for each scale were multiplied by 100 to match the 100–700 range of scales such as imagery and familiarity). Note while category norms exist (e.g., Van Overschelde et al., 2004), they are less useful for this kind of grou**. Each word was individually assigned to one of the listed categories by hand; another rater may make slightly different choices in some cases, but overall, the category assignments seem reasonable. Examples of each category are provided in the table to aid in understanding. While not empirically verifiable, each of the scales seems to pass the “eye test” of face validity – living things like people and animals group appropriately on the Living and Repro scales, and words for people appropriately separate out from other words on the Person, Thought, and Goals scales. Further, animal words are given appropriate ratings on these mental scales as well. Animals were not rated as high as people, but they were rated higher than other categories on the Living scale, like plants. The scales related to mental abilities (Goals, Thought, and Person) had lower average scores than the others (Living, Repro, and Move), as mental abilities are not only related to whether something is alive, but specifically to higher cognition (scales ranged from 100 to 700: MGoals = 314.6, MThought = 285.3, MPerson = 300.7; versus MLiving = 382.9, MRepro = 346.5, MMove = 407.5; p values for all comparisons < .0001).

Table 3 Mean ratings on animacy scales by category

A closer look at specific cases serves to highlight participants’ sensitivity to the complex nature of animacy and the specific subscales, indicating their comprehension of the differences among the measures. For example, the Thought ratings seem to reflect common perceptions of the intelligence of different birds (Thought: chicken – 300, dove – 340, eagle – 419, and owl – 496) and mammals (Thought: lamb – 316, pig – 358, cow – 396, dog – 441, cat – 463, ape – 493, and dolphin – 519). The Goals ratings were highest for words referring to specific professions (Goals: president – 692, governor – 675, doctor – 665), and collective nouns which were largely groups of people who work together for a purpose also received high ratings (Goals: congress – 623, team – 596, orchestra – 527). Participants appeared to acknowledge these collective words were groups of animals or people, yet at the same time are not really living, biological things themselves (and are merely made up of them). For example, these same words received lower Living ratings (Living: congress – 454, team – 550, orchestra – 493). Perhaps these groups tend to be seen as possessing of agency, but little else (Knobe & Prinz, 2008). Finally, tools (e.g., brush, spoon; MGoals = 217) are seen as more goal-oriented than are natural objects (e.g., diamond, mud; MGoals = 160), t(68) = 5.53, p < .0001, d = 1.30, likely a reflection of the way people tend to reason teleologically (and do so correctly when it comes to artifacts such as tools; Kelemen et al., 2013).

It appears participants correctly interpreted the Move scale as likelihood of movement, and not simply whether the word can move on its own. For example, subcategories like vehicles and weather phenomena received higher ratings than the subcategories buildings and words for areas (Move: vehicles – 576, weather – 486, buildings – 165, areas – 175; relevant comparisons p < .0001). In addition, words for landscapes were inflated by moving bodies of water (Move: puddle – 146 compared to lake – 330 or ocean – 543). The ratings suggest participants were even sensitive to the rate at which water moves through bodies of flowing water (Move: brook – 438, creek – 538, stream – 608, and river – 627).

Other interesting examples across scales include words for celestial bodies, which, while ostensibly inanimate, often are thought to be somewhat goal-directed (Goal: world – 452, sun – 312), thought of as pseudo-living things (Living: world – 408, sun – 344), and definitely move (Move: world – 432, sun – 407). In addition, the Living average for the reptiles subcategory was reduced by dinosaurs, likely due to their status as extinct (Living: dinosaur – 419, reptile category average – 648; also see participant’s quote above). For the Repro scale, participants appeared sensitive to this dimension as well, with highly reproducing concepts rated higher than lower-reproducing concepts, even within the domain of people (child – 289, monk – 404, grandma – 408, mother – 674) and animals (puppy – 476, stallion – 580, rabbit – 630).

Finally, a few representative examples of concepts that vary in respect to each dimension help demonstrate participants’ sensitivity to the different dimensions. Words for plants and ambiguous words and are good examples here. The ratings for tulip, for example, demonstrate appropriate sensitivity to each of the dimensions. They are living things (Living: 665), they reproduce (Repro: 548), they do not move (or move only when growing; Move: 232), they do not have goals (Goals: 173) or thoughts (Thought: 138), and are very dissimilar to people (Person: 163). The ratings for pitcher (ambiguously referring to either an object or a person) reflect this ambiguity of meaning (Living: 335, Repro: 361, Move: 542, Goals: 338, Thought: 348, Person: 336). The ratings for robot, which are (Living: 138, Repro, 181, Move: 500, Goals: 393, Thought: 294, Person: 279). Overall, there is substantial evidence the scales were assessing the intended concepts.

Factor analysis of animacy normative data

Given six different potential measures of animacy were collected, it is of practical and theoretical interest to know whether any of these scales are the result of the same underlying construct. Therefore, factor analysis was used to determine whether this was the case. Factor analysis was chosen (as opposed to principal component analysis), due to the underlying theoretical assumption that variables are the result of some common but not necessarily obvious factor or factors (indeed, all scales correlated with each other at r > .63). In the case of the new animacy scales, there is likely an underlying factor such as “mental capacity” or “agency” that predicts variables like Goals, Thought, or Person. For example, Gray et al. (2007) divide the perception of minds into two components: “agency” and “experience.” Agency generally represents a thing’s capacity to act on the world and possess intent, while experience represents a thing’s capacity to experience the world; people and animals (and other agents) vary in respect to these dimensions. Babies for example are high in experience while low in agency, while robots are often thought of as high in agency but low in experience. Similarly, Living and Repro are likely related because the rating decision for both scales are judgments based on whether the target is alive or not (either directly or indirectly). Further, factor analysis minimizes the amount of unique variance and error variance that are analyzed, and only considers the shared variance of the variables of interest (Tabachnick & Fidell, 2013).

A one-factor solution was extracted, with an eigenvalue of 4.87; this solution explained 77.62% of the variance. As shown in Table 4, all six variables load highly onto this single factor, which could easily be termed a “General Animacy” factor. Note this is not a rotated solution, as rotation can only be performed when two or more factors exist. These results may be overly simple, however. Indeed, Wood et al. (1996) have argued that after observing extensive simulations of factor analysis, more bias is shown when factors are underextracted compared with overextraction. In particular, overextraction is typically very robust when so-called singleton constructs are involved; that is, a construct for which only one (or perhaps very few) variables are present in the data set. This is certainly likely true with the current data, which has only six variables, two of which are more “general” (Living and Person).

Table 4 Animacy factor analysis results for 1,200 nouns

Several two- and three-factor solutions with other forms of rotation (varimax and promax) were therefore explored, with the most appropriate presented in Table 4. A two-factor promax solution was chosen as it best meets Thurstone’s (1947) criteria for a factor solution with simple structure; the varimax solution that was extracted did not. In this final preferred solution, two factors with eigenvalues of 4.28 and 4.21 were extracted and a total of 86.07% of the variability is explained. A simple test that is often used to see if the oblique/promax rotation is preferred is to correlate the extracted factors; Factors one and two correlate at r = 0.77. A general rule is if factor correlations exceed 0.32, an oblique rotation was appropriate, because average correlations in excess of 0.32 imply 10% or more of variance overlaps among factors (Tabachnick & Fidell, 2013). Clearly, the present data meet this criterion. For all of these reasons, a two-factor oblique rotation was accepted as the most appropriate factor analysis of the animacy measures. For these reasons, factor analysis with promax rotation was used to maximize the simple structure when examining the relationship among the animacy concepts. When promax rotation was used with two factors, the factor solution had a simple structure. Factor loadings are presented in Table 4.

The two factors that result from analysis of the animacy measures appear to roughly correspond to the “Mental” attributes of animate things and to the “Physical” attributes of living things (rescaled factor scores for these composite factors are also available on the OSF: https://osf.io/4t3cu). The fact that these two factors separate out from one another normatively is not only interesting, but important. As previously discussed, Gray et al. (2007) have identified two primary dimensions for what they call “mind perception”. These dimensions are experience and agency. Namely, they are something’s ability to experience the world (e.g., things like hunger, pain, joy, etc.), and something’s ability to act on the world (e.g., its ability to control itself, and act on others, and have intent). It is possible the two factors extracted above correspond to these dimensions. In particular, it is likely the “Mental” factor relates to agency, while the “Physical” factor relates to experience. This idea is explored in detail later in the general discussion.

Principal component analysis of entire dataset

In addition to investigating the degree to which the six newly collected animacy measures are related to one another, it is of practical and theoretical interest to know if the concepts are redundant with extant normative data. For example, Popp and Serra (2016) have posited that animacy advantages in free recall may be a result of animate things being more mentally arousing, perhaps because they attract attention or cause fear. While they tested this hypothesis using word lists equated for arousal and rejected the hypothesis (see Popp & Serra, 2018), principal component analysis can lend additional credence to these sorts of arguments, showing that animacy is also not conceptually redundant with other norms. If the animacy measures and the arousal measure load onto the same component, then it is likely they are related. If not, they are probably not related (in terms of whether they are similar enough in the factor analysis to be redundant). In this way, proximate mechanisms of animacy can be further understood through the normative data.

All 21 variables were subjected to a principal component analysis with varimax (orthogonal) rotation. Principal component analysis was more appropriate than factor analysis for analyzing the normative data altogether. This is because rather than asking whether underlying factors produced the current variables, the primary focus was how the measured variables overlap and correspond with one another. There were six components with eigenvalues greater than 1.00, and seventh and eighth components with values relatively close to 1.00 (0.846 and 0.765, respectively). While seven- and eight-component extractions were explored, they did not appear to add much to the interpretability of the factor structure.

Because of these considerations, the six-component solution was deemed most appropriate. The component loadings are shown in Table 5; component loadings greater than .300 are shown in bold. These six components accounted for over 76% of the variability in the combined normative data. Two of the observed components (two and five) showed noticeably “clean” results, meaning their constituent variables did not have substantial loadings on other components and other variables did not load on these components. Because they were relatively clear of misleading variables, Component two was identified as pertaining to a word’s lexical features, while Component five was related to the how emotionally laden a word was.

Table 5 Principal component analysis results for 1,200 nouns

Component two was identified as a measure of animacy. Importantly, none of the extant word variables loaded significantly onto this animacy component: only ARO and FAM had loadings nearing than 0.30. It appears arousal is not much related to animacy after all, as Popp and Serra (2018) demonstrated through other means. However, two of the animacy variables (Person and Goals) did load onto Component four, with loadings of -.307 and -.308 respectively. Component four consisted primarily of IMG, CNC, and AoA – while Person and Goals loaded negatively onto this component, they do so only barely, and do so negatively.

Due to this constellation of variables, Component four was named the “Simple Words” component, meaning words high on this dimension largely consist of highly imageable, highly concrete concepts that are learned early in life, are not very complex, and are generally naturally occurring. That is, words high on this dimension primarily refer to a single, exact concept with little room for error in interpretation (they are “simple”). Interestingly, the words that rated most highly on this Simple Words component included virtually all animals, a number of edible fruits and vegetables, and words like parent, airplane, and finger. On the opposite end of the Simple Words component are vague, ill-specified words like soul, thing, mind, region, and expert. Notably, while there were equal numbers of animate words (that is, words scoring at 500 or above on the Living scale) on each half of this component (243 words in the upper half compared to 243 words in the bottom half), the types of animate words in each half were vastly different – 102 of the 111 animals in the list were in the upper half of the Simple Words component. The remaining eight in the other half consisted of relatively obscure animal words like fawn, mare, mole, and oyster.

Of the remaining two components, Component three was a measure of contextual variables including word frequency, a word’s contextual diversity, its familiarity and its availability. AoA loaded negatively onto this component as well, as more familiar, available, and frequent words are learned earlier in life. Finally, Component six appeared to relate primarily to word meaningfulness, that is, how readily a given word makes a person think of other words (how meaningfulness is operationalized). Related variables FAM and AVAIL make sense in this context, as they are highly related to meaningfulness as a concept.

Three variables in these data load rather evenly across two or more of the six components, indicating a multidimensional underlying structure for these variables. These variables include AVAIL, AoA, and FAM. The AVAIL measure of availability is based on the occurrence of words as free associations in response to other words, and is known to tap other measures including FAM, LEN, and CNC (Clark & Paivio, 2004); these relationships are all reflected in the current data. AoA is similarly multidimensional. Low values for AoA typically represent words that are familiar, short, concrete, and occur with significant frequency (Kuperman et al., 2012). These patterns are also reflected in the current data. Finally, FAM appears to load across multiple components as well – this pattern is most likely due to the aforementioned relationships with both AoA and AVAIL.

General discussion

Our primary goal was to create an open-source set of normative data related to the animacy dimension for 1,200 words. In addition to providing data for six new animacy scales we also included rescaled factor scores for the mental and physical components of the animacy dimension. These data are a valuable resource for future research, as no comprehensive ratings-based normative data exists at present for the animacy dimension. With these norms, future researchers can easily select word pools matched along the animacy dimension, whether they wish to study recall or something else entirely. For example, one possible future direction with these data is to find words matched along one aspect of animacy but not others, to see how various components of the animacy dimension impact recall when under experimental control.

Secondly, the data produced here is a complete, comprehensive set of data. Very few studies on normative components (with the possible exception of Clark & Paivio, 2004) gather and present all of the data in a single place: Most are concerned with a particular aspect, such as age of acquisition or emotionality. Because the present norms are unified, future researchers can easily consult a single dataset to find normative values for a wide range of words – even if they are not concerned about animacy. Additionally, if further normative data are collected in the future (say, to better specify the mental component of animacy), it can easily be added to this set. On the other hand, if an outcome measure is gathered for words in this sample, these norms can be used as predictors in a new regression model.

A few limitations exist for the present data set. First, our newly collected norms for concreteness, imagery, and familiarity are not directly comparable to other, older measures: recent work has demonstrated that categorical information can shift over time (e.g., Castro et al., 2021), as can general knowledge (e.g., Coane & Umanath, 2021). Second, the six new animacy dimensions were chosen based on the logic described above, but may not be comprehensive – other ways of considering what defines animacy exist. For example, while we used “can move” in the present study, “can move on its own” is another plausible way to have framed the “movement likelihood” dimension. This difference may explain why the dimension loads less cleanly than others in Table 4, for example.

This project’s third contribution is that word animacy is placed in a larger context with other word variables. The principal component analysis described above is the first attempt to see if word animacy can be conceptually collapsed with other word factors – and it clearly cannot. Animacy is not directly related to any other variables in this project, with the possible exception of a somewhat negative relationship with imagery and concreteness for items particularly high on the mental component of animacy (the “Simple Words” component described in the principal component analysis, above). The “Physical” dimension of animacy also seems somewhat under-specified in the current data: it did help clarify the factor structure of animacy, but it did not explain much additional variance (only 7.02%, compared to the “Mental” dimension’s much larger 79.05%). Therefore, exploring the “Physical” component of animacy provides another opportunity for future research. One limitation of these data is the amount of variability accounted for by any given component or factor (as presented in the bottom rows of Tables 4 and 5) does not relate to the “importance” of that factor in any absolute sense. Instead, the amount of variability accounted for is reflective of the number and similarity of measures relevant to the particular construct in the current dataset – therefore, these data do not support any conclusion regarding what components of the animacy dimension are more important than others. As a second example, because there are six constructs related to animacy, the animacy component (as shown in Table 5) naturally accounts for the largest proportion of the variance; this does not imply the animacy component is more important.

As discussed earlier, the two dimensions of what Gray et al. (2007) call “mind perception” are experience and agency. That is, the ability to experience the world, and act on it. Gray et al. (2007) further measured these dimensions using factor analysis for a small set of 13 “minds”: Different kinds of people (yourself, men, women, children, babies, the dead, and those in a persistent vegetative state), animals (frogs, chimps, and dogs), as well as God, robots, and fetuses. Some of these minds are high on experience but not agency (animals, babies, and people in persistent vegetative states), some are high in agency but not experience (God, robots), and some are high in both or neither (adult humans and the dead, respectively).

The present data mimic these results somewhat. Animals were clearly living and could reproduce but were not clearly goal-directed, capable of thought, or similar to people. Thus, the extracted “Mental” factor may correspond to the agency dimension, while the extracted “Physical” factor may correspond to the experience dimension. However, it is interesting to note that judgments about whether something is a living thing or can reproduce are not quite about experience per se. In Gray et al. (2007), experience included the ability to feel hunger, pain, or various emotional states – these may be indirectly tapped by making a judgment as simple as whether something is alive. Another possibility is the extracted “Physical” factor is less about ability to experience at all, but whether physical markers or features for animacy exist: The ability to feel hunger, fear, pain, or pleasure (the measures that loaded highest onto the experience component in Gray et al.’s (2007) work) are related to questions about reproduction when thought of as simply markers for being alive. Perhaps in this way the Living scale may serve to quickly gauge all of these components.

Comparatively, the agency construct of Gray et al. (2007) included measures related to the perceived amount of/ability related to self-control, morality, memory, emotion recognition, planning, communication, and thinking. These measures align with the extracted “Mental” factor of animacy made up of goal-directedness, ability to think, and similarity to a person. Person-similarity may be an indicator measure for these factors, like the Living scale seems to be for the “Physical” factor. Additional research could better specify the physical and mental dimensions of animacy, perhaps using the metrics employed by Gray et al. (2007) for a more direct comparison to this prior work. For example, future scale development might consider the specific development of an explicitly experience-oriented measure as well, as none of the present measures directly tap it.

Conclusions

Altogether, this project provides insight into the makeup of the animacy dimension in general and how it relates to other major word variables (in that it largely does not). While animacy is not the first normative factor related to a semantic dimension, animacy as a word factor has remained unquantified until now. Perhaps it has been ignored by many word-variable researchers because it began as a functional-evolutionary hypothesis, whereas much of the research on semantic features of words is rooted in dual-coding theory (Paivio, 2010). It was predicted that word animacy may be important for recall because animates were likely to be important over the course of evolution (Nairne et al., 2013; VanArsdall et al., 2013). Many domains of cognitive psychology support this hypothesis, from the ways in which animates capture visual attention (Johansson, 1973; Pratt et al., 2010), to language research that claims animacy as a linguistic universal (Comrie, 1989), to research in neuroscience that implies a critical role of animacy in how semantic knowledge is stored (Capitani et al., 2003), to the rapidity with which the animate-inanimate distinction emerges in development (Opfer & Gelman, 2011). There is even a name for the evolutionary account for why animates are likely to play a key role in human cognition: “The animate monitoring hypothesis” (New et al., 2007). While the search for a proximate mechanism which explains the animacy effect in episodic memory is still ongoing, we hope that these normative data will aid researchers in the search.

Thus, while the hypothesis that led to the current project is somewhat intuitive and well-supported by prior work in cognition, it had not yet been fully explored. In sum, we hope (1) this project has successfully demonstrated that animacy is a separable and important word-related factor, and (2) the resources it contributes to the literature will allow for future researchers to easily and consistently integrate animacy ratings into their work.