Abstract
Recent years have witnessed a mushrooming of reading corpora that have been built by means of eye tracking. This article showcases the Hong Kong Corpus of Chinese Sentence and Passage Reading (HKC for brevity), featured by a natural reading of logographic scripts and unspaced words. It releases 28 eye-movement measures of 98 native speakers reading simplified Chinese in two scenarios: 300 one-line single sentences and 7 multiline passages of 5,250 and 4,967 word tokens, respectively. To verify its validity and reusability, we carried out (generalised) linear mixed-effects modelling on the capacity of visual complexity, word frequency, and reading scenario to predict eye-movement measures. The outcomes manifest significant impacts of these typical (sub)lexical factors on eye movements, replicating previous findings and giving novel ones. The HKC provides a valuable resource for exploring eye movement control; the study contrasts the different scenarios of single-sentence and passage reading in hopes of shedding new light on both the universal nature of reading and the unique characteristics of Chinese reading.
Similar content being viewed by others
Background & Summary
Over the past two decades, researchers have given increasing attention to reading behaviours and conducted in-depth investigations into when and where the underlying cognitive mechanisms of reading concurrently function by using recordings of physiological signals from human organs (e.g., lung, heart, eye, and brain)1. As one of the most prominent types of empirical data, eye movements possess unique advantages in representing accurately sliced time segments (e.g., first fixation duration, second go-past duration, and total reading time), flexibly segmented interest areas (e.g., local words and phrases or global sentences and paragraphs), and high ecological validity that allows for previewing and reviewing texts. Along with this direction, a growing number of eye-tracking datasets have been developed in recent years2,3,4,5,6,7,8,9,10,11,12,13,14,15 (see details in Table 1). However, it is noteworthy that the few Chinese reading corpora, such as GECO-CN12, BSC13 and CEMD14, were not published until last year.
The rapid growth of eye-movement corpora has boosted a variety of empirical studies that address new challenges arising from reading. In reading research with alphabetic languages, the Dundee Corpus promotes the discussion on word processing in parafoveal and foveal vision2,16, while the PSC is employed to examine word surprisal effects3,17, the Provo Corpus to investigate undersweep fixations in multiline contexts8,18, the GECO to explore the age-of-acquisition effect on fixations regardless of word length and frequency6,19 and the ZuCo to train machine learning models to predict human reading behaviours7,20,
Methods
Participants
This study was approved in advance of implementation by the Human Subjects Ethics Subcommittee of the corresponding college. We recruited 98 university students (89 females, age = 26 ± 3.64) as our test-takers, who are native speakers of Mandarin, skilled in reading simplified Chinese with normal (or corrected-to-normal) eyesight and no illness that impacts cognitive abilities. They each signed a consent form before the experiment and received monetary remuneration upon completion. Due to privacy protection, other information is not disclosed.
Apparatus
Two experiments take advantage of the following hardware: (1) tower-mounted EyeLink 1000 series (SR Research, Canada) with a sampling rate up to 1000 Hz and a spatial resolution of 0.01° of visual angle; (2) an 18-inch ViewSonic CRT monitor (resolution rate, 1024 × 768 pixels; and refresh rate, 85 Hz); and (3) an adjustable chin rest.
Materials
The top 300 sentences that were 30 characters long (including punctuation marks) were selected from the XIN subcorpus of the Chinese Gigaword Corpus27,28 by first sorting sentences in ascending order of average entropy per character (i.e., the overall information per language signal29). Entropy is estimated by a simple unigram model using character frequencies in the corpus, followed by filtering out those entries suspected of any lexical, syntactic or semantic inclinations (e.g., long numbers of many digits and repeated expressions) or other bias (e.g., religion, racism, sexuality, and violence). We opted to choose the most likely unbiased sentences in this way in hopes of smoothing the natural reading process to the greatest extent possible. The sentence length was chosen to allow a full utility of the screen width, except for a necessary margin on four sides (left and right, 110 pixels; top and bottom, 180 pixels).
Following a similar procedure with the same criteria except for text length, 7 passages were selected from the same corpus with no overlap with any selected single-lined sentences (see Table 2 for sample materials). Totalling a text length of 8742 characters, these passages cover a variety of topics, including 1 on celebrity news, 1 on city development, 2 on education, 1 on employment, and 2 on sports. A small number of uncommon words, such as technical terms and long numbers, were pruned out or replaced with easier or shorter ones without altering the meanings of original sentences. For the best fit to the monitor, we divided each passage into a title page and several content pages (individually: 6, 4, 5, 6, 5, 10, and 5 for the seven passages, in a total of 41) according to the text length (in number of characters), and configured each page with 9 lines (unless the last page of a passage) and each line with 38 characters (unless the last line of a paragraph). There are 1078 ± 275 characters per passage, 36 ± 21 sentences per passage, and 40 ± 21 characters per sentence.
Given the convention of no interword spacing in Chinese texts, we performed word segmentation following the national standard GB/T 13715-92 (1992) and inserted delimiters (*) between words in the materials for the purpose of facilitating word-based analyses after data collection (see Fig. 2). We manually checked the results of the segmentation word by word and resolved ambiguous and controversial cases according to our best understanding of the standard. Experiment Builder software (SR Research, Canada) automatically specifies an interest area (IA) between two delimiters and keeps all delimiters invisible to participants during reading. Each punctuation mark was also taken as an IA.
Code availability
Two R scripts (“preprocessing.R” and “lmeModelling.R”), resulting from the step-by-step coding for our data preprocessing and technical validation, respectively, are released in the repository of OSF32. Also released is the source code file (mergeChineseInfo.java) of a Java program for integrating lexical property information of CLD for the words in HKC by means of word matching, on the premise of a standardised format (word-based and UTF-8 comma-delimited data format).
References
Ayres, P., Lee, J. Y., Paas, F. & van Merriënboer, J. J. G. The validity of physiological measures to identify differences in intrinsic cognitive load. Front. Psychol. 12, 702538 (2021).
Kennedy, A. The Dundee Corpus (University of Dundee, 2003).
Kliegl, R., Grabner, E., Rolfs, M. & Engbert, R. Length, frequency, and predictability effects of words on eye movements in reading. Eur. J. Cogn. Psychol. 16, 262–284 (2004).
Kuperman, V., Dambacher, M., Nuthmann, A. & Kliegl, R. The effect of word position on eye-movements in sentence and paragraph reading. Q. J. Exp. Psychol. 63, 1838–1857 (2010).
Asahara, M., Ono, H. & Tadashi, M. E. BCCWJ-EyeTrack: Reading time annotation on the 'Balanced Corpus of Contemporary Written Japanese’. IEICE Tech. Rep. 116, 7–12 (2016).
Cop, U., Dirix, N., Drieghe, D. & Duyck, W. Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading. Behav. Res. Methods 49, 602–615 (2016).
Hollenstein, N. et al. ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading. Sci. Data 5, 180291 (2018).
Luke, S. G. & Christianson, K. The Provo Corpus: A large eye-tracking corpus with predictability norms. Behav. Res. Methods 50, 826–833 (2017).
Laurinavichyute, A. K., Sekerina, I. A., Alexeeva, S., Bagdasaryan, K. & Kliegl, R. Russian Sentence Corpus: Benchmark measures of eye movements in reading in Russian. Behav. Res. Methods 51, 1161–1178 (2018).
Hollenstein, N., Barrett, M. & Björnsdóttir, M. The Copenhagen Corpus of eye tracking recordings from natural reading of Danish texts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference 1712–1720 (2022).
Siegelman, N. et al. Expanding horizons of cross-linguistic research on reading: The Multilingual Eye-movement Corpus (MECO). Behav. Res. Methods 54, 2843–2863 (2022).
Sui, L., Dirix, N., Woumans, E. & Duyck, W. GECO-CN: Ghent eye-tracking corpus of sentence reading for Chinese-English bilinguals. Behav. Res. Methods 1–21, https://doi.org/10.3758/s13428-022-01931-3 (2022).
Pan, J., Yan, M., Richter, E. M., Shu, H. & Kliegl, R. The Bei**g Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms. Behav. Res. Methods 1–12, https://doi.org/10.3758/s13428-021-01730-2 (2021).
Zhang, G. et al. The database of eye-movement measures on words in Chinese reading. Sci. Data 9, 411 (2022).
Acartürk, C., Özkan, A., Pekçetin, T. N., Ormanoğlu, Z. & Kırkıcı, B. TURead: An eye movement dataset of Turkish reading. Behav. Res. Methods 1–24, https://doi.org/10.3758/s13428-023-02120-6 (2023).
Kennedy, A., Pynte, J., Murray, W. S. & Paul, S.-A. Frequency and predictability effects in the Dundee Corpus: An eye movement analysis. Q. J. Exp. Psychol. 66, 601–618 (2013).
Boston, M. F., Hale, J., Kliegl, R., Patil, U. & Vasishth, S. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. J. Eye Mov. Res. 2, 1–36 (2008).
Slattery, T. J. & Parker, A. J. Return sweeps in reading: Processing implications of undersweep-fixations. Psychon. Bull. Rev. 26, 1948–1957 (2019).
Dirix, N. & Duyck, W. An eye movement corpus study of the age-of-acquisition effect. Psychon. Bull. Rev. 24, 1915–1921 (2017).
Hollenstein, N. & Zhang, C. Entity recognition at first sight: Improving NER with eye movement information. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1-10 (2019).
Hollenstein, N., Troendle, M., Zhang, C. & Langer, N. ZuCo 2.0: A dataset of physiological recordings during natural reading and annotation. Preprint at ar**v:1912.00903 (2019).
Hollenstein, N., Pirovano, F., Zhang, C., Jäger, L. & Beinborn, L. Multilingual language models predict human reading behavior. Preprint at ar**v:2104.05433 (2019).
Just, M. A. & Carpenter, P. A. A theory of reading: From eye fixations to comprehension. Psychol. Rev. 87, 329–354 (1980).
Asahara, M. Between reading time and clause boundaries in Japanese-wrap-up effect in a head-final language. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32) 19–27.
Rayner, K. Eye guidance in reading: Fixation locations within words. Perception 8, 21–30 (1979).
Yang, H.-M. & McConkie, G. W. Reading Chinese: Some basic eye-movement characteristics. In Reading Chinese Script: A Cognitive Analysis (eds. Wang, J. Inhoff, A. W. & Chen, H.-C.) 207–222 (Erlbaum, 1999).
Ma, W.-Y. & Chen, K.-J. Design of CKIP Chinese word segmentation system. Chin. Orient. Lang. Inf. Process. Soc. 14, 235–249 (2005).
Ma, W. Y. & Huang, C. R. Uniform and effective tagging of a heterogeneous giga-word corpus. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). L06–1163 (European Language Resources Association (ELRA), 2006).
Sun, C. C., Hendrix, P., Ma, J. & Baayen, R. H. Chinese lexical database (CLD). Behav. Res. Methods 50, 2606–2629 (2018).
Sun, F., Morita, M. & Stark, L. W. Comparative patterns of reading eye movement in Chinese and English. Percept. Psychophys. 37, 502–506 (1985).
Andrews, S. & Veldre, A. Wrap** up sentence comprehension: The role of task demands and individual differences. Sci. Stud. Read. 25, 123–140 (2020).
Wu, Y. & Kit, C. Hong Kong Corpus of Chinese Sentence and Passage Reading. OSF https://doi.org/10.17605/OSF.IO/7UQ3J (2022).
Sereno, S. Measuring word recognition in reading: Eye movements and event-related potentials. Trends Cogn. Sci. 7, 489–493 (2003).
Graff, D. & Chen, K. Chinese Gigaword LDC2003T09 (Linguistic Data Consortium, 2003).
Cai, Q. & Brysbaert, M. SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One 5, e10729 (2010).
Van Esch, D. Leiden Weibo Corpus (Leiden University, 2012).
R Core Team. R: A language and environment for statistical computing (R Foundation for Statistical Computing, 2020).
RStudio Team. RStudio: Integrated development environment for R (RStudio, Inc., 2019).
Tullo, C. & Hurford, J. Modelling Zipfian distributions in language. In Proceedings of Language Evolution and Computation Workshop/Course 62–75 (ESSLLI, 2003).
Zang, C. New perspectives on serialism and parallelism in oculomotor control during reading: The multi-constituent unit hypothesis. Vision 3, 50 (2019).
Just, M. A. & Carpenter, P. A. The Psychology of Reading and Language Comprehension (Allyn & Bacon, 1987).
Zang, C., Liversedge, S. P., Bai, X. & Yan, G. Eye Movements during Chinese Reading (Oxford University Press, 2011).
Sun, F. & Feng, D. Eye movements in reading Chinese and English text. In Reading Chinese Script: A cognitive analysis (eds. Wang, J., Inhoff, A. W. & Chen, H.-C.) 201–218 (Psychology Press, 1999).
Warren, T., White, S. J. & Reichle, E. D. Investigating the causes of wrap-up effects: Evidence from eye movements and E-Z Reader. Cognition 111, 132–137 (2009).
Rayner, K., Sereno, S. C. & Raney, G. E. Eye movement control in reading: A comparison of two types of models. J. Exp. Psychol. Hum. Percept. Perform. 22, 1188–1200 (1996).
Wickham, H., François, R., Henry, L. & Müller, K. dplyr: A grammar of data manipulation. R package version 0.7.6. (2018).
Wickham, H. ggplot2: Elegant graphics for data analysis (Springer-Verlag, 2016).
Sjoberg, D. D., Whiting, K., Curry, M., Lavery, J. A. & Larmarange, J. Reproducible summary tables with the gtsummary package. R J. 13, 570 (2021).
Baayen, R. H., Davidson, D. J. & Bates, D. M. Mixed-effects modeling with crossed random effects for subjects and items. J. Mem. Lang. 59, 390–412 (2008).
Bates, D. M. lme4: Mixed-effects modeling with R (Springer, 2010).
Lüdecke, D., Ben-Shachar, M., Patil, I., Waggoner, P. & Makowski, D. performance: An R package for assessment, comparison and testing of statistical models. J. Open Source Softw. 6, 3139 (2021).
Lüdecke, D. sjPlot: Data visualization for statistics in social science.R package version 2.8.11 (2022).
Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019).
Acknowledgements
This study was supported by grants from the RGC General Research Fund (GRF) 9042276 and CityU Strategic Research Grant (SRG-Fd) 7005411 and 7005708. Thanks go to all our participants and a group of PhD students and assistants (Yanmengnan Cui, Dandan Huang, **anhe Li, Di **ong, Bo Zhang, and Nannan Zhou) who helped with this project.
Author information
Authors and Affiliations
Author notes
Co-first author: Chunyu Kit.
- Chunyu Kit
Contributions
C. Kit conceived the experimental design and supervised material selection and data collection; Y. Wu conducted data retrieval and analysis and drafted the manuscript; C. Kit revised and finalised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wu, Y., Kit, C. Hong Kong Corpus of Chinese Sentence and Passage Reading. Sci Data 10, 899 (2023). https://doi.org/10.1038/s41597-023-02813-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-023-02813-9
- Springer Nature Limited