Log in

Contribution of speech rhythm to understanding speech in noisy conditions: Further test of a selective entrainment hypothesis

  • Published:
Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Abstract

Previous work by McAuley et al. Attention, Perception, & Psychophysics, 82, 3222–3233, (2020), Attention, Perception & Psychophysics, 83, 2229–2240, (2021) showed that disruption of the natural rhythm of target (attended) speech worsens speech recognition in the presence of competing background speech or noise (a target-rhythm effect), while disruption of background speech rhythm improves target recognition (a background-rhythm effect). While these results were interpreted as support for the role of rhythmic regularities in facilitating target-speech recognition amidst competing backgrounds (in line with a selective entrainment hypothesis), questions remain about the factors that contribute to the target-rhythm effect. Experiment 1 ruled out the possibility that the target-rhythm effect relies on a decrease in intelligibility of the rhythm-altered keywords. Sentences from the Coordinate Response Measure (CRM) paradigm were presented with a background of speech-shaped noise, and the rhythm of the initial portion of these target sentences (the target rhythmic context) was altered while critically leaving the target Color and Number keywords intact. Results showed a target-rhythm effect, evidenced by poorer keyword recognition when the target rhythmic context was altered, despite the absence of rhythmic manipulation of the keywords. Experiment 2 examined the influence of the relative onset asynchrony between target and background keywords. Results showed a significant target-rhythm effect that was independent of the effect of target-background keyword onset asynchrony. Experiment 3 provided additional support for the selective entrainment hypothesis by replicating the target-rhythm effect with a set of speech materials that were less rhythmically constrained than the CRM sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Code availability

Not applicable.

References

  • Allen, K., Carlile, S., & Alais, D. (2008). Contributions of talker characteristics and spatial location to auditory streaming. The Journal of the Acoustical Society of America, 123(3), 1562–1570.

    Article  PubMed  Google Scholar 

  • Assmann, P. F., & Summerfield, Q. (1989). Modeling the perception of concurrent vowels: Vowels with the same fundamental frequency. The Journal of the Acoustical Society of America, 85(1), 327–338.

    Article  PubMed  Google Scholar 

  • Assmann, P. F., & Summerfield, Q. (1990). Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies. The Journal of the Acoustical Society of America, 88(2), 680–697.

    Article  PubMed  Google Scholar 

  • Aubanel, V., Davis, C., & Kim, J. (2016). Exploring the role of brain oscillations in speech perception in noise: intelligibility of isochronously retimed speech. Frontiers in Human Neuroscience, 10, 430.

    Article  PubMed  PubMed Central  Google Scholar 

  • Auditech. (2015). Multitalker Noise—20 Talkers (Frank Version) [Audio recording]. https://auditec.com/2015/08/04/multitalker-noise-20-talkers-frank-version/

  • Baese-Berk, M. M., Dilley, L. C., Henry, M. J., Vinke, L., & Banzina, E. (2019). Not just a function of function words: Distal speech rate influences perception of prosodically weak syllables. Attention, Perception, & Psychophysics, 81(2), 571–589.

    Article  Google Scholar 

  • Barnes, R., & Jones, M. R. (2000). Expectancy, attention, and time. Cognitive Psychology, 41, 254–311.

    Article  PubMed  Google Scholar 

  • Bolia, R. S., Nelson, W. T., Ericson, M. A., & Simpson, B. D. (2000). A speech corpus for multitalker communications research. Journal of the Acoustical Society of America, 107, 1065–1066.

    Article  PubMed  Google Scholar 

  • Bregman, A. S. (1990). Auditory scene analysis. MIT Press.

    Book  Google Scholar 

  • Brokx, J. P. L., & Nooteboom, S. G. (1982). Intonation and the perceptual separation of simultaneous voices. Journal of Phonetics, 10(1), 23–36.

    Article  Google Scholar 

  • Darwin, C. J. (1981). Perceptual grou** of speech components differing in fundamental frequency and onset-time. The Quarterly Journal of Experimental Psychology Section A, 33(2), 185–207.

    Article  Google Scholar 

  • Darwin, C. J., & Ciocca, V. (1992). Grou** in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component. The Journal of the Acoustical Society of America, 91(6), 3381–3390.

    Article  PubMed  Google Scholar 

  • Dauer, R. M. (1983). Stress-timing and syllable-timing reanalyzed. Journal of Phonetics, 11, 51–62.

    Article  Google Scholar 

  • Desjardins, J. L., & Doherty, K. A. (2013). Age-related changes in listening effort for various types of masker noises. Ear and Hearing, 34(3), 261–272.

    Article  PubMed  Google Scholar 

  • Dilley, L. C., & McAuley, J. D. (2008). Distal prosodic context affects word segmentation and lexical processing. Journal of Memory and Language, 59, 294–311.

    Article  Google Scholar 

  • Ding, N., Melloni, L., Zhang, H., Tian, X., & Poeppel, D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19, 158.

    Article  PubMed  Google Scholar 

  • Ding, N., & Simon, J. Z. (2012). Emergence of neural encoding of auditory objects while listening to competing speakers. Proceedings of the National Academy of Sciences, 109(29), 11854–11859.

    Article  Google Scholar 

  • Ding, N., & Simon, J. Z. (2014). Cortical entrainment to continuous speech: functional roles and interpretations. Frontiers in Human Neuroscience, 8, 311.

    Article  PubMed  PubMed Central  Google Scholar 

  • Friston, K. (2005). A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456), 815–836.

    Article  Google Scholar 

  • Friston, K. (2018). Does predictive coding have a future? Nature neuroscience, 21(8), 1019–1021.

    Article  PubMed  Google Scholar 

  • Ghitza, O. (2011). Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm. Frontiers in Psychology, 2, 130.

    Article  PubMed  PubMed Central  Google Scholar 

  • Giraud, A. L., & Poeppel, D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15, 511.

    Article  PubMed  PubMed Central  Google Scholar 

  • Golumbic, E. M. Z., Ding, N., Bickel, S., Lakatos, P., Schevon, C. A., McKhann, G. M., Simon, J. Z., Poeppel, D., & Schroeder, C. (2013). Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party.” Neuron, 77, 980–991.

    Article  Google Scholar 

  • Goswami, U. (2019). Speech rhythm and language acquisition: An amplitude modulation phase hierarchy perspective. Annals of the New York Academy of Sciences, 1453(1), 67–8.

    Article  PubMed  Google Scholar 

  • Henry, M. J., & Herrmann, B. (2014). Low-frequency neural oscillations support dynamic attending in temporal context. Timing & Time Perception, 2(1), 62–86.

    Article  Google Scholar 

  • Humes, L. E., Kidd, G. R., & Fogerty, D. (2017). Exploring use of the coordinate response measure in a multitalker babble paradigm. Journal of Speech, Language, and Hearing Research, 60(3), 741–754.

    Article  PubMed  PubMed Central  Google Scholar 

  • Johnson, T. A., Cooper, S., Stamper, G. C., & Chertoff, M. (2017). Noise exposure questionnaire: A tool for quantifying annual noise exposure. Journal of the American Academy of Audiology, 28(1), 14–35.

    Article  PubMed  PubMed Central  Google Scholar 

  • Jones, M. R. (1976). Time, our lost dimension: Toward a new theory of perception, attention, and memory. Psychological Review, 83, 323–355.

    Article  PubMed  Google Scholar 

  • Jones, M. R., & Boltz, M. (1989). Dynamic attending and responses to time. Psychological Review, 96, 459–491.

    Article  PubMed  Google Scholar 

  • Jones, M. R., Moynihan, H., MacKenzie, N., & Puente, J. (2002). Temporal aspects of stimulus-driven attending in dynamic arrays. Psychological Science, 13, 313–319.

    Article  PubMed  Google Scholar 

  • Kollmeier, B., Warzybok, A., Hochmuth, S., Zokoll, M. A., Uslar, V., Brand, T., & Wagener, K. C. (2015). The multilingual matrix test: Principles, applications, and comparison across languages: A review. International Journal of Audiology, 54(Suppl. 2), 3–16.

    Article  PubMed  Google Scholar 

  • Large, E. W., & Jones, M. R. (1999). The dynamics of attending: How people track time-varying events. Psychological Review, 106, 119–159.

    Article  Google Scholar 

  • Lehiste, I. (1977). Isochrony reconsidered. Journal of phonetics, 5(3), 253–263.

    Article  Google Scholar 

  • McAuley, J. D., & Jones, M. R. (2003). Modeling effects of rhythmic context on perceived duration: A comparison of interval and entrainment approaches to short-interval timing. Journal of Experimental Psychology: Human Perception and Performance, 29, 1102–1125.

    PubMed  Google Scholar 

  • McAuley, J. D., Jones, M. R., Holub, S., Johnston, H. M., & Miller, N. S. (2006). The time of our lives: Life span development of timing and event tracking. Journal of Experimental Psychology: General, 135, 348–367.

    Article  PubMed  Google Scholar 

  • McAuley, J. D., Shen, Y., Dec, S., & Kidd, G. (2020). Altering the rhythm of target and background talkers differentially affects speech understanding: Support for a selective-entrainment hypothesis. Attention, Perception, & Psychophysics, 82, 3222–3233.

    Article  Google Scholar 

  • McAuley, J. D., Shen, Y., Smith, T., & Kidd, G. R. (2021). Effects of speech-rhythm disruption on selective listening with a single background talker. Attention, Perception & Psychophysics, 83(5), 2229–2240. https://doi.org/10.3758/s13414-021-02298-x

    Article  Google Scholar 

  • Miller, J. E., Carlson, L. A., & McAuley, J. D. (2013). When what you hear influences when you see: Listening to an auditory rhythm influences the temporal allocation of visual attention. Psychological Science, 24(1), 11–18.

    Article  PubMed  Google Scholar 

  • Milne, A. E., Bianco, R., Poole, K. C., Zhao, S., Oxenham, A. J., Billig, A. J., & Chait, M. (2021). An online headphone screening test based on dichotic pitch. Behavior Research Methods, 53(4), 1551–1562.

    Article  PubMed  Google Scholar 

  • Morrill, T. H., Dilley, L. C., McAuley, J. D., & Pitt, M. A. (2014). Distal rhythm influences whether or not listeners hear a word in continuous speech: Support for a perceptual grou** hypothesis. Cognition, 131, 69–74.

    Article  PubMed  Google Scholar 

  • Noble, W., Jensen, N. S., Naylor, G., Bhullar, N., & Akeroyd, M. A. (2013). A short form of the Speech, Spatial and Qualities of Hearing scale suitable for clinical use: The SSQ12. International Journal of Audiology, 52(6), 409–412.

    Article  PubMed  PubMed Central  Google Scholar 

  • Peng, Z. E., Waz, S., Buss, E., Shen, Y., Richards, V., Bharadwaj, H.,..., Venezia, J. H. (2022). Remote testing for psychological and physiological acoustics. The Journal of the Acoustical Society of America151(5), 3116–3128.

  • Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87.

    Article  PubMed  Google Scholar 

  • Reips, U. D. (2002). Standards for Internet-based experimenting. Experimental Psychology, 49(4), 243.

    PubMed  Google Scholar 

  • Riecke, L., Formisano, E., Sorger, B., Baskent, D., & Gaudrain, E. (2018). Neural entrainment to speech modulates speech intelligibility. Current Biology, 28, 161–169.

    Article  PubMed  Google Scholar 

  • Rosen, S., Souza, P., Ekelund, C., & Majeed, A. A. (2013). Listening to speech in a background of other talkers: Effects of talker number and noise vocoding. The Journal of the Acoustical Society of America, 133(4), 2431–2443.

    Article  PubMed  PubMed Central  Google Scholar 

  • Schütt, H. H., Harmeling, S., Macke, J. H., & Wichmann, F. A. (2016). Painfree and accurate Bayesian estimation of psychometric functions for (potentially) overdispersed data. Vision Research, 122, 105–123.

    Article  PubMed  Google Scholar 

  • Shen, Y., & Richards, V. M. (2012). A maximum-likelihood procedure for estimating psychometric functions: Thresholds, slopes, and lapses of attention. The Journal of the Acoustical Society of America, 132(2), 957–967.

    Article  PubMed  PubMed Central  Google Scholar 

  • Tilsen, S., & Arvaniti, A. (2013). Speech rhythm analysis with decomposition of the amplitude envelope: characterizing rhythmic patterns within and across languages. The Journal of the Acoustical Society of America, 134(1), 628–639.

    Article  PubMed  Google Scholar 

  • Turgeon, M., Bregman, A. S., & Roberts, B. (2005). Rhythmic masking release: Effects of asynchrony, temporal overlap, harmonic relations, and source separation on cross-spectral grou**. Journal of Experimental Psychology: Human Perception and Performance, 31(5), 939.

    PubMed  Google Scholar 

  • Vuust, P., & Witek, M. A. (2014). Rhythmic complexity and predictive coding: A novel approach to modeling rhythm and meter perception in music. Frontiers in Psychology, 5, 1111.

    Article  PubMed  PubMed Central  Google Scholar 

  • Wang, M., Kong, L., Zhang, C., Wu, X., & Li, L. (2018). Speaking rhythmically improves speech recognition under “cocktail-party” conditions. The Journal of the Acoustical Society of America, 143, EL255–EL259.

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

The authors thank Frank Dolecki and Kyle Oliver for their assistance with data collection and helpful insights, Dylan V. Pearson at Indiana University for assistance with stimulus generation, and members of the Timing, Attention and Perception Lab at Michigan State University for their helpful suggestions and comments at various stages of this project. NIH Grant R01DC013538 (PIs: Gary R. Kidd and J. Devin McAuley), NIH Grant R01017988 (PI: Yi Shen), and the University of Washington Mary Gates Undergraduate Research Scholarship (Awardee: Christina N. Williams; Mentor: Yi Shen) supported this research.

Funding

This work was supported by NIH Grant R01DC013538 (PIs: Gary R. Kidd and J. Devin McAuley), NIH Grant R01017988 (PI: Yi Shen), and the University of Washington Mary Gates Undergraduate Research Scholarship (Awardee: Christina N. Williams; Mentor: Yi Shen)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toni M. Smith.

Ethics declarations

Conflicts of interest

The authors have no relevant financial or nonfinancial interests to disclose

Ethics approval

This study was approved by the institutional review boards at Michigan State University and the University of Washington. The procedures of this study adhere to the principles of the Declaration of Helsinki.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

Not applicable.

Additional information

Open practices statement

The data and materials for all experiments are available upon request. None of the experiments was preregistered.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Experiment 1: Supplementary results

Learning analysis

At the suggestion of a reviewer, we investigated whether there were any learning effects on performance. Since intact and rhythm-altered conditions were alternated from one block to the next and there was no interaction between rhythm condition and block, F(7, 98) = 1.88, p = .082, η2 = 0.118, we averaged proportion correct data over pairs of blocks. There was a small but significant learning effect, F(7, 98) = 2.86, p = .009, η2 = 0.170. Follow-up Bonferroni-corrected paired t tests showed that the only comparison that approached significance was between the first (M = 0.39, SD = 0.09) and last (M = 0.49, SD = 0.12) blocks, t(14) = −3.62, p = .077, 95% CI [−0.22, 0.01]. No other differences approached significance (all p > .10).

NEQ and SSQ

Supplementary analyses considered the possible relation between individual scores on the NEQ and SSQ and overall speech-in-noise performance (as well as any relation to the magnitude of the target-rhythm effect) in Experiment 1. Estimates of annual noise exposure (ANE) were calculated for each participant from their NEQ responses, based on the procedure outlined in Johnson et al. (2017). The overall SSQ score was calculated by averaging responses across all items. Separate scores were also calculated for the speech, spatial, and qualities subscales of the SSQ by averaging individual responses across only items within one of the given subscales. The correlation between ANE and SSQ was significantly positive, r(13) = 0.56, p = .029. We found no relationship between estimates of ANE and overall speech-in-noise performance (averaged across target rhythm intact and target rhythm altered conditions), r(13) = −0.22, p = 0.44, nor between overall speech-in-noise performance and overall SSQ scores, r(13) = 0.33, p = 0.24. There also was no reliable correlation between overall speech-in-noise performance and any of the SSQ subscales—speech: r(13) = 0.33, p = .24; spatial: r(13) = 0.16, p = .58; qualities: r(13) = 0.30, p = .28.

In terms of the magnitude of the target-rhythm effect (the performance in the rhythm intact condition minus performance in the rhythm altered condition), there was similarly no relationship between ANE and the magnitude of the target-rhythm effect, r(13) = −0.08, p = .77, nor between SSQ scores and the magnitude of the target-rhythm effect, r(13) = 0.28, p = .31. There was also no correlation between the magnitude of the rhythm effect and any of the three SSQ subscales—speech: r(13) = 0.26, p = .35; spatial: r(13) = 0.27, p = .33; qualities: r(13) = 0.18, p = .53.

Finally, there was no evidence of a relationship between formal music training and speech-in-noise performance when overall performance was compared between participants who did not report having any formal music training (M = 0.45, SD = 0.08) and those who did (M = 0.46, SD = 0.08), t(13) = −0.42, p = .68, Cohen’s d = −0.22, 95% CI [−0.11, 0.07]. Similarly, the magnitude of the target-rhythm effect was not significantly different for participants with formal music training (M = 0.11, SD = 0.06) and without formal music training (M = 0.06, SD = 0.04), t(13) = −1.86, p = .09, Cohen’s d = −0.96, 95% CI [−0.10, 0.01].

Experiment 2: Supplementary results

Supplementary analyses considered the potential relations between estimated ANE from the NEQ, SSQ scores, formal music training and overall speech-in-noise performance. To create a composite speech-in-noise score, proportion correct data was averaged for each participant across OA and rhythm intact/altered conditions. Since there were different SNR groups, we conducted partial correlations between ANE, overall SSQ score, years of formal music training, and speech-in-noise performance, controlling for SNR. Three participants were excluded from these analyses due to data quality issues with their survey responses (e.g., responding “10” to all SSQ items, or having multiple inconsistent responses on the NEQ that may have led to underestimates of ANE). There was no correlation between overall performance and ANE, r(33) = 0.244, p = .16, between overall performance and overall SSQ score, r(33) = -0.116, p = .51, or between overall performance and years of formal music training, r(33) = −0.04, p = .80. When SSQ responses were broken down into three subscales, there was still no correlation with overall proportion correct scores—speech: r(33) = 0.038, p = .83; spatial: r(33) = −0.08, p = .67; qualities: r(33) = −0.193, p = .27.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Smith, T.M., Shen, Y., Williams, C.N. et al. Contribution of speech rhythm to understanding speech in noisy conditions: Further test of a selective entrainment hypothesis. Atten Percept Psychophys 86, 627–642 (2024). https://doi.org/10.3758/s13414-023-02815-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3758/s13414-023-02815-0

Keywords

Navigation