The close interaction of speech perception and production is undeniable. Perception of one’s own speech influences speech production (e.g., Bohland et al., 2010; Guenther, 1994). For example, altering speech acoustics and feeding speech back to a talker with minimal delay results in rapid compensatory alterations to productions that are predictable, replicable, and well accounted for by neurobiologically plausible models of speech production (e.g., Guenther, 2016; Houde & Jordan, 1998).

Similarly, perception of another talker’s speech can influence production. Talkers imitate sublexical aspects of perceived speech in speech shadowing tasks (Fowler et al., 2003; Goldinger, 1998; Shockley et al., 2004) and phonetically converge to become more similar to a conversation partner (Pardo et al. 2017). However, results are variable and hard to predict. Shadowers imitate lengthened voice onset times (VOT), but not shortened VOTs (Lindsay et al., 2022; Nielsen, 2011; but see also Schertz & Paquette-Smith, 2023). Phonetic convergence occurs only for some utterances or some acoustic dimensions but not others (Pardo et al., 2013). Talkers may converge across some dimensions but diverge on others (Bourhis & Giles 1977; Earnshaw, 2021; Heath, 2015), making it difficult to predict which articulatory-phonetic dimensions will be influenced (Ostrand & Chodroff, 2021). Phonetic convergence is also variable across talkers’ sex (Pardo et al., 2017), with some studies reporting greater convergence among female participants (Namy et al., 2002), others among males (Pardo, 2006; Pardo et al., 2010), or more complicated male–female patterns of convergence (Miller et al., 2010; Pardo et al., 2017). In sum, the direction and magnitude of changes in speech production driven by perceived speech are dependent on multiple contributors (Babel, 2010; Pardo, 2006,) likely to include social and contextual factors (Bourhis & Giles, 1977; Giles et al., 1991; Pardo, 2006). This has made it challenging to characterize production–perception interactions fully.

Some have argued that a better understanding of the cognitive mechanisms linking speech perception and production will meet this challenge (Babel, 2012; Pardo et al., 2022). Here, we propose an approach that is novel in two ways: (1) Statistical learning. Instead of investigating phonetic convergence at the level of individual words, we manipulate the statistical relationship of two acoustic dimensions, fundamental frequency (F0) and voice onset time (VOT) and study the effect of perceptual statistical learning across these dimensions on listeners’ own speech. (2) Subtlety and implicitness. Acoustic manipulation of the statistical regularities of speech input is barely perceptible and devoid of socially discriminating information, since it is carried on the same voice, therefore allowing us to investigate the basic perception–production transfer without influence of additional (important, but potentially complicating) sociolinguistic factors.

Our approach builds on the well-studied role of statistical learning in speech perception. Dimension-based statistical learning tracks how the effectiveness of acoustic speech dimensions in signaling phonetic categories varies as a function of short-term statistical regularities in speech input (Idemaru & Holt, 2011, 2014, 2020; Idemaru & Vaughn, 2020; Lehet & Holt, 2017; Liu & Holt, 2015; Schertz et al., 2015; Schertz & Clare, 2020; Zhang & Holt, 2018; Zhang et al., 2021). This simple paradigm parametrically manipulates acoustic dimensions, for example, voice onset time (VOT) and fundamental frequency (F0), across a two-dimensional acoustic space to create speech stimuli varying across a minimal pair (beer–pier). The paradigm selectively samples stimuli to manipulate short-term speech regularities, mimicking common communication challenges like encountering a talker with an accent that deviates from local norms. Across Exposure stimuli (Fig. 1A–B, red) the short-term input statistics either match the typical F0 × VOT correlation in English (canonical condition, e.g., with higher F0s and longer VOTs for pier) or introduce a subtle and barely detectable “accent” with a short-term F0 × VOT correlation opposite of that typically experienced in English (reverse condition, e.g., lower F0s with longer VOTs for pier).

Fig. 1
figure 1

Stimulus and trial structure. A Canonical distribution. B Reverse distribution. The test stimuli (blue) have ambiguous VOT and are identical across canonical and reverse conditions. C Trial structure. Exposure phase: Participants listened passively to 8 exposure stimuli, each paired with a visual stimulus. Perceptual categorization phase: After 600 ms, they heard one of two test stimuli with low or high F0 and categorized it as beer or pier. Repetition phase: they heard the same test stimulus again and repeated it aloud. (Color figure online)

Test stimuli are constant across conditions (Fig. 1A–B, blue). They have a neutral, perceptually ambiguous VOT thereby removing this dominant acoustic dimension from adjudicating category identity. But F0 varies across test stimuli. Therefore, the proportion of test stimuli categorized as beer versus pier provides a metric of the extent to which F0 is perceptually weighted in categorization as a function of experienced short-term speech input regularities (Wu & Holt, 2022).

Although the manipulation of short-term input statistics is subtle and unbeknownst to the listeners, the exposure regularity rapidly shifts the perceptual weight of F0 in beer–pier test stimulus categorization (Idemaru & Holt, 2011). Listeners down-weight F0 reliance upon introduction of the accent. This effect is fast and robust against the well-known individual differences in perceptual weights and the variability with which individuals perceptually weight different acoustic dimensions (Kong & Edwards, 2011, 2016; Schertz et al., 2015, 2016). In all, this well-replicated finding (1) demonstrates reliable changes in the perceptual system as a function of brief exposure to subtle changes in the statistical properties of the acoustic input and (2) establishes a statistical learning paradigm as an ideal tool for examining the impact of these changes on speech production.

In the current study, we used dimension-based statistical learning to investigate whether adjustments to the perceptual space influence speech production in systematic ways. Following Hodson et al. (2013), we strove for including the maximal random effects in the models. Most models, however, did not tolerate the maximal random effect structure. For consistency, we report the models with random intercept of both subjects and items, which were tolerated by all models. The former captures variability among subjects; the latter among exposure sequences that changed from trial to trial. To assure that excluding random slopes did not radically alter any of the main conclusions, we also report the output of the models with the largest random effect structure tolerated by each model in Appendix 2.

For perceptual categorization data, a logit mixed-effects logistic regression model included a binary response (beer, pier) as the dependent variable. The model included condition (canonical, reverse), test stimulus F0 (low F0, high F0), and participant sex (male, female) and their two- and three-way interactions as fixed effects, and by-subject and by-item random intercepts included. For speech production data, a continuous z-score normalized F0 dependent measure allowed for a standard (non-logit) linear mixed-effects model. Here, too, fixed effects of condition, test stimulus, sex and its interactions were modeled, with by-subject and by-items modeled as random effects. Dependent categorical variables were center coded (1 vs −1). P values were based on Satterthwaite approximates using the LmerTest package (Version 3.1-3; Kuznetsova et al., 2017). Analyses collapsed data from the three canonical blocks and, separately, from the three reverse blocks.

We conducted the production analyses in two steps: (1) Our first analysis used test stimulus F0 to predict production F0. This analysis is parallel to the perceptual analysis and captures the whole process, which includes the change to perception as well as changes to production. (2) Our second analysis used perceptual responses as the main predictor of production F0. This analysis already partials out the contribution of perceptual changes as a function of exposure to the canonical and reverse distributions, which allows us to isolate the production component of transfer. The data, analysis code, and full tables of the results are available (https://osf.io/cwg4d/).