Observing an action activates the motor patterns used to perform the same action (Buccino et al., 2004; Hari et al., 1998), demonstrating an imitative capacity in the observer to quickly map observed actions onto his or her motor repertoire. The discovery of mirror neurons in macaque monkeys (Gallese, Fadiga, Fogassi, & Rizzolatti, 1996) and humans (Mukamel, Ekstrom, Kaplan, Iacoboni, & Fried, 2010) demonstrated a direct neural observation-execution link, and neuroimaging studies have suggested that a human mirror neuron system (MNS) responds when participants execute and observe the same actions (Molenberghs, Cunnington, & Mattingley, 2012). The MNS has been proposed as the underlying neural structure subserving imitation (Buccino et al., 2004; Catmur, Walsh, & Heyes, 2009).

Behaviorally, observation-induced motor activation has been demonstrated in studies showing automatic imitation, measured using the stimulus–response compatibility (SRC) task (Heyes, Bird, Johnson, & Haggard, 2005; Stürmer, Aschersleben, & Prinz, 2000). In Stürmer et al., observing a compatible movement (e.g., an opening hand) facilitated participants’ response (e.g., opening their hand) relative to observing an incompatible movement (e.g., a closing hand). Automatic imitation is defined as the response time (RT) difference between the two compatibility conditions, with a larger effect indicating greater observation-induced motor activation (Heyes, 2011). Automatic imitation is thought to occur because action observation activates the corresponding motor patterns that interact with the participant’s response. Specifically, performance is facilitated when observation activates the compatible action, and is delayed when observation activates the incompatible action.

The associative sequence learning (ASL) model proposes that the imitative capacity is a product of associative sensorimotor learning that involves correlated experiences of observing and executing the same actions (Heyes, 2005, 2010). Notably, the associative mechanism is suggested to be the same domain-general process that also produces Pavlovian and instrumental conditioning, and that is therefore sensitive to experienced stimulus–response pairs. Previous studies supporting ASL have demonstrated that sensorimotor training modulates automatic imitation of manual movements (see Catmur, 2013, for review). In Heyes et al. (2005), automatic imitation of hand opening/closing movements was eliminated following countermirror training that associated different observed and executed hand movements, but not following mirror training that associated the same observed and executed movements. Because both groups received the same amount of sensory and motor practice during training, the authors concluded that the relationship between observed and executed actions was what modulated automatic imitation, hence supporting ASL’s hypothesis that observation-execution links depend on sensorimotor learning.

Past training studies have exclusively examined perceptually transparent actions, such as manual gestures, whose sensorimotor links could be built through self-observation, whereas one dispute that remains unresolved concerns the flexibility of the links underlying perceptually opaque orofacial gestures (Heyes, 2005). On the basis of the interpretation of infant imitation research, “innate observation-execution links” are suggested to enable newborns to imitate observed orofacial actions (Meltzoff, 2002, p. 23). Different developmental trajectories have also been postulated for the manual and orofacial MNS. Specifically, Casile, Caggiano, and Ferrari (2011) have suggested that the orofacial MNS is “prewired and already present at birth” (p. 532), whereas the manual MNS is acquired after birth through learning. In contrast, ASL suggests that the observation-execution links underlying either manual or orofacial actions do not depend solely on the visual guidance of self-generated movements. Rather, imitative sensorimotor experience mostly originates from sociocultural sources during development (e.g., being imitated by others, or through a common stimulus; Ray & Heyes, 2011).

In this study, we aimed to determine the role of sensorimotor learning in establishing the observation-execution links underlying perceptually opaque orofacial gestures, focusing on the automatic imitation of visual speech. Speech actions are communicative orofacial gestures seen in face-to-face conversations, but not by the talkers themselves. Watching and/or hearing other people speak activates articulatory motor regions (Skipper, Devlin, & Lametti, 2017), suggesting a close perception–production link. We aimed to elucidate whether the flexibility of the observation-execution links underlying manual gestures extends to speech perception–production links.

Studies using speech SRC tasks have demonstrated that perceiving compatible articulations produced by a speaker facilitates participants’ responses, relative to perceiving incompatible articulations (Adank, Nuttall, Bekkering, & Maegherman, 2018; Kerzel & Bekkering, 2000). Here we adopted the speech SRC task to establish participants’ initial automatic imitation, before assigning them to either a countermirror (say /ba/ when the speaker says /da/, and vice versa) or a mirror (say /ba/ when the speaker says /ba/, and likewise for /da/) training group. Automatic imitation was measured again using the same task 24 h after training. We predicted that if sensorimotor experience is critical in establishing speech perception–production links, automatic imitation would be impaired following countermirror training, but not following mirror training; however, if sensorimotor experience is not critical, we predicted no difference in automatic imitation between the two groups after training. Additionally, because automatic imitation of speech has been shown to vary when prompts are presented at different time points relative to distractor onsets (i.e., stimulus onset asynchronies [SOAs]; Adank et al., 2018), we included different SOAs in the SRC task in order to examine whether training effects would interact with automatic imitation at different SOAs.

Method

Participants

An a priori power analysis was conducted using G*Power 3.1 (Faul, Erdfelder, Buchner, & Lang, 2009) with an effect size of ηp2 = .108, obtained from a pilot study. Sixty-two participants were needed in order to detect significant sensorimotor training effects on automatic imitation with a power of .80 and an alpha of .05. Sixty-eight participants were recruited, but one was excluded for not being a native British English speaker, one for having dyslexia, one for not attending the posttraining session, and three for performing at chance level during training. The final analysis included 31 participants in the mirror group (23 female, eight male; Mage = 21.71 years, SDage = 4.89, rangeage = 17–34) and 31 in the countermirror group (19 female, 12 male; Mage = 21.61 years, SDage = 3.25, rangeage = 18–30). All were native British English speakers with self-reported normal or corrected-to-normal vision, normal hearing, and no speech or language disorders or other neurological disorders. Participants received £20 or course credit. The University Research Ethics Committee approved the procedures, and all participants gave written informed consent.

Stimuli and procedure

The stimuli (Fig. 1) included silent videos of a speaker saying /ba/ and /da/ and the syllable prompts ba and da. The videos (25 fps) were filmed with a Canon Legria HF G30 video camera, edited in iMovie, and scaled down in resolution to 1,280 × 720 in AVI format. A female native British English speaker was shown in color videos from her neckline upward. Both videos started and ended with the speaker’s mouth closed in a resting configuration. At 552 ms, the speaker’s mouth began to move in the da video, and the lips began moving toward each other in the ba video. The consonant bursts in both videos occurred around 736 ms, and the vowel articulation commenced around 920 ms. The speaker was still articulating the vowel at 1,104 ms, and her facial expression returned to its resting position at 1,400 ms. The prompts ba and da (300-dpi JPEG images) were printed in white boldface Arial font on a black background and were positioned extending from the speaker’s bottom to top lip. The prompt was presented at one of four SOAs (552, 736, 920, or 1,104 ms) in each trial. The experiment was performed using Presentation (version 18.0, Neurobehavioral Systems).

Fig. 1
figure 1

Schematic timeline of a compatible trial presented in the testing sessions, with the speaker saying /ba/ in the distractor video and the prompt ba appearing in the front. The video lasted 1,840 ms, with the prompt presented at 736 ms. In the actual experiment, participants’ viewing was unrestrained at a distance of 60 cm from the screen, the speaker’s face was shown at 14.34° × 11.14° of visual angle, and the prompt measured 0.38° × 0.38° (see the supplementary material for detailed measures of the speaker’s mouth configurations at different time points). The size of the prompt is enlarged here for visual clarification

The experiment included two testing sessions (pre- and posttraining) and one training session and took place in a soundproofed, light-controlled booth. Written instructions were presented on the PC monitor. In the pretraining session, participants were instructed to speak out the syllable (ba or da) as soon as they saw the prompt and to ignore the speaker’s articulation (ba or da) in the distractor video. In the compatible condition, the speaker’s articulation matched the prompted response; in the incompatible condition, the speaker’s articulation differed from the prompted response. Each trial started with a 200-ms tone with a frequency of 500 Hz at 70 dB SPL, played through Sennheiser HD25-SP II headphones. The screen remained black for one of three jittered durations (1,500, 1,750, or 2,000 ms), which were included to reduce the trial onset’s predictability. The prompt was presented at one of the four SOAs for 200 ms, and the screen went black at the end of the video. There were six blocks with 40 trials each (240 trials in total) in the pretraining session. Forty-eight trial types (2 prompts × 2 distractors × 4 SOAs × 3 jitters) were repeated five times in a randomized order. Ten practice trials were given before the first block, in a pretraining session that lasted about 20 min. The posttraining session was identical to the pretraining session, except that it was conducted the day after the pretraining session and participants completed five practice trials.

Training took place immediately after the pretraining session, and participants were randomly assigned to a training group. The participants in the countermirror group said /ba/ as soon as they saw the speaker mouth /da/, and vice versa when they said /da/; the participants in the mirror group repeated the syllable as it was mouthed by the speaker. The same jittered intertrial intervals were used. There were 12 blocks with 80 trials each (960 trials in total), in which the six trial types (2 videos × 3 jitters) were presented in a randomized order. After the first six blocks, a short animation film was played with sound before participants continued to the second half of the session. Five practice trials were given, and the training session lasted about 90 min.

Data recording and analysis

Responses were recorded via a voice key using a RØDE NT1-A Condenser Microphone and a Focusrite Scarletti 2i4 USB Computer Audio Interface preamplifier plugged into the sound card input of a Dell PC at 44.1 kHz with 16 bits. Audio recording started at the video onset for 3,000 ms. The voice key was triggered when the system detected an audio input at .2 of Presentation’s total range. RTs were measured relative to the prompt onset. For missed trials, a warning stating “No response given” was presented for 500 ms. A warning stating “Response too early” appeared for RTs < 200 ms. The responses were checked manually using Praat (Boersma & Weenink, 2018). Errors included incorrect responses and missed trials. For the testing sessions, trials with RTs < 100 or > 1,200 ms were defined as errors, because they were likely to be anticipatory or neglected responses (Kerzel & Bekkering, 2000). For the training sessions, outliers included trials with RTs that were three standard deviations away from the average. A natural log-transformation was applied to the RTs for statistical analyses, but the figures present untransformed RTs.

Error rates and RTs from the testing sessions were subjected to repeated measures analyses of variance (ANOVAs) with test (pre- vs. posttraining), compatibility (compatible vs. incompatible), and SOA (552, 736, 920 or 1,104 ms) as within-subjects variables, and training (mirror vs. countermirror) as a between-subjects variable. Error rates and RTs from the training sessions were analyzed in separate ANOVAs, with block as a within-subjects variable and training as a between-subjects variable. The significance level was set to p < .05. Greenhouse–Geisser corrections for nonsphericity and Bonferroni corrections for multiple comparisons were applied whenever appropriate.

Results

On average, participants made 7% errors in the testing sessions. RT analyses for the testing sessions are reported here, and other analyses are included in the supplementary materials. After errors were excluded, RT analyses (Fig. 2, Table 1) revealed a compatibility main effect, with faster log-transformed RTs for compatible trials (M = 6.226, SE = 0.015) than for incompatible trials (M = 6.303, SE = 0.014). We also found a test main effect, with slower log-transformed RTs in the pretraining session (M = 6.302, SE = 0.019) than in the posttraining session (M = 6.227, SE = 0.013). Follow-up t tests for the main effect of SOA revealed faster RTs for longer SOAs (all ps < .001). For the interaction between test and SOA, follow-up t tests revealed a greater RT reduction after training at the three longer SOAs than at the first SOA (all ps < .001). Follow-up t tests for the interaction between compatibility and SOA revealed larger compatibility effects for the two longer SOAs than for the two shorter SOAs (all ps < .003).

Fig. 2
figure 2

a Mean response times ± standard errors in each experimental condition. Four panels represent pretraining (top left) and posttraining (top right) sessions in the mirror group and the pretraining (bottom left) and posttraining (bottom right) sessions in the countermirror group. b Automatic-imitation effects (i.e., incompatible – compatible) ± standard errors for the pretraining (gray) and posttraining (black) sessions at each stimulus onset asynchrony in each group. *Significant changes in automatic-imitation effects after training (p < .05)

Table 1 Four-way ANOVA summary for log-transformed response times, as a function of training, test, compatibility, and stimulus onset asynchrony (SOA)

A significant three-way interaction emerged between training, test, and compatibility; follow-up t tests revealed that compatibility effects increased after mirror training (p = .002) but did not change after countermirror training (p = .177). This three-way interaction was further modulated by SOA, as suggested by the significant four-way interaction between training, test, compatibility, and SOA. Follow-up t tests of the four-way interaction (Fig. 2, Table 2) revealed that automatic imitation increased by 19 ms from 42 ms at each of the two longer SOAs after mirror training [920 ms, t(30) = 3.25; 1,104 ms, t(30) = 3.03; all ps < .006], and decreased by 16 ms from 49 ms at the longest SOA after countermirror training [1,104 ms, t(30) = 2.07, p = .047].

Table 2 Mean log-transformed response times, back-transformed response times in milliseconds, standard errors (SEs), and 95% confidence intervals (CIs) for each experimental condition in the testing sessions

Discussion

In this study, we investigated sensorimotor training effects on the automatic imitation of visual speech. We found that automatic imitation increased after mirror training and decreased after countermirror training. Moreover, mirror training had stronger effects than countermirror training. Our findings are largely consistent with the ASL hypothesis that the observation-execution links underlying orofacial movements such as speech can be modulated through sensorimotor learning, therefore suggesting similar developmental trajectories for perceptually opaque and transparent actions.

ASL proposes that sensorimotor experience of observing and executing the same action establishes and strengthens excitatory matching links between sensory and motor representations of that action (Press, Gillmeister, & Heyes, 2007). Accordingly, mirror training in this study strengthened excitatory matching links, which consequently enhanced observation-induced motor activation, leading to more facilitation in the compatible than in the incompatible condition (i.e., increased automatic imitation). ASL also proposes that sensorimotor experience of observing and executing different actions leads to excitatory nonmatching links between the sensory and motor representations of different actions, and also establishes inhibitory matching links between the sensory and motor representations of the same actions (Heyes et al., 2005). Accordingly, countermirror training in this study established inhibitory matching links that consequently reduced observation-induced motor activation, leading to less facilitation in the compatible than in the incompatible condition (i.e., decreased automatic imitation).

Training effects were found only for longer SOAs, where automatic imitation was greater. Speech actions consist of sequences of movements, and SOAs have been included in speech SRC tasks in order to demonstrate the time course of automatic imitation influenced by the different movement components of perceived speech. Potentially, the greater automatic imitation at longer SOAs was elicited by perceptually more salient components of perceived actions, and participants also paid more attention to these components during training. Consequently, the automatic imitation elicited by these components was more susceptible to training. Future research could manipulate participants’ attention to different action components during training and thus examine whether such manipulations influence training effects.

Importantly, automatic imitation of speech seems more resilient to countermirror training than is automatic imitation of manual actions. In Heyes et al. (2005), automatic imitation of manual movements was eliminated following countermirror training. However, the automatic imitation of speech actions in our study was only reduced after countermirror training (960 trials in total) that was considerably longer than the training implemented by Heyes et al. (432 trials). Following ASL, this result may be accounted for by sociocultural imitative experience; it is possible that sensorimotor experience of observing and executing the same orofacial movements is gained mostly through social interactions. In monkeys, mouth mirror neurons have been found to be connected to the brain regions involved in emotion/reward processing, which plays a role in social activities (Ferrari, Gerbella, Coudé, & Rozzi, 2017). Comparable mirror activation has also been found in homologous regions in humans during the perception and production of emotional facial expressions (Carr, Iacoboni, Dubeau, Mazziotta, & Lenzi, 2003). Hence, whereas both manual and orofacial observation-execution links are likely to result from sensorimotor learning, the extent of social influence may differ between the two, with the latter requiring more social engagement. Future studies could investigate how social manipulations modulate sensorimotor training effects on the automatic imitation of manual and orofacial movements.

Mirror training was more effective than countermirror training in our study, showing results opposite to those previously reported (Cracco et al., 2018). The ASL hypothesis proposes that observation-execution links could also be built through a common stimulus (e.g., hearing people say /ba/) that co-occurs with both experience of performing a movement (e.g., saying /ba/) and experience of seeing others perform the same movement (e.g., seeing people say /ba/; Heyes, 2005). Such sensorimotor experience may contribute to the creation of indirect observation-execution links different from the direct ones learned through the training provided in our study. Thus, direct and indirect links might exist that are acquired through different experience, and the latter could have initially been stronger for speech actions, which are inherently multimodal. Therefore, our finding was likely due to initially weak direct links that were more susceptible to mirror than to countermirror training in modulating the automatic imitation of speech.

Our results do not allow us to completely exclude the possibility of an innate mechanism governing imitative, rather than counter- or nonimitative, sensorimotor associations underlying the imitation of orofacial movements. Heyes (2011) also acknowledged that results from training studies in principle “do not exclude a role for genetic prespecification in establishing the long-term sensorimotor connections that generate automatic imitation” (p. 478). Nevertheless, though the present study does not conclusively support ASL, our results are in line with the main ASL hypothesis that it is sensorimotor experience, not either sensory or motor experience alone, that configures observation-execution links (Heyes, 2010), since the only difference between the two groups in our study was the relationship between the observed and executed movements per trial during training. Hence, our findings suggest that the ASL mechanism can also be applied to communicative orofacial movements that infants learn to perceive and produce in the first few years of life. The precise mechanisms responsible for forging the sensorimotor associations underlying speech actions could be further explored by providing extended countermirror training. If extended training leads to a reduction/reversal of automatic imitation, this would support the notion that learning of these associations is not necessarily constrained by innate factors.

The simulation theory of speech perception proposes that observation-induced motor activation facilitates prediction of the incoming signals supporting speech comprehension (Pickering & Garrod, 2013). Critically, greater motor involvement is suggested when observers have more experience with the perceived speech. By applying transcranial magnetic stimulation (TMS) to lip motor cortex (lip M1), Swaminathan et al. (2013) found facilitated lip M1 excitability during the viewing of sentences spoken in a known as compared to an unknown language. Following ASL, Swaminathan et al. suggested that the difference between the two conditions was due to different strengths of perception–production links underlying known and unknown languages, hence supporting the simulation hypothesis that more experience leads to greater observation-induced motor activation. Our results further suggest that it was imitative sensorimotor learning that facilitated the observation-induced motor activation. Additionally, overt imitation of accented speech improves subsequent speech perception, indicating that imitative learning leads to enhanced observation-induced motor activation facilitating speech comprehension (Adank, Hagoort, & Bekkering, 2010). Schmitz et al. (2018) stimulated lip M1 with TMS and found that listening to nonnative vowels elicited higher articulatory excitability than did native-like vowels, which was opposite to what Swaminathan et al. found with sentence articulations presented visually. Future research that controls linguistic levels and stimulus modalities will be required in order to investigate this inconsistency. Moreover, follow-up research could extend our findings by examining whether sensorimotor training modulates the audio–motor links underlying speech. Behavioral research could also examine automatic imitation of nonnative speech and investigate the role of sensorimotor learning in establishing perception–production links underlying second language processing.

In conclusion, the present study showed that sensorimotor training modulated automatic imitation of visual speech. As such, our results elucidate questions concerning the flexibility of imitative mechanisms and add to the growing body of evidence on perception–production links in speech processing.