Background & Summary

Over the past two decades, researchers have given increasing attention to reading behaviours and conducted in-depth investigations into when and where the underlying cognitive mechanisms of reading concurrently function by using recordings of physiological signals from human organs (e.g., lung, heart, eye, and brain)1. As one of the most prominent types of empirical data, eye movements possess unique advantages in representing accurately sliced time segments (e.g., first fixation duration, second go-past duration, and total reading time), flexibly segmented interest areas (e.g., local words and phrases or global sentences and paragraphs), and high ecological validity that allows for previewing and reviewing texts. Along with this direction, a growing number of eye-tracking datasets have been developed in recent years2,3,4,5,6,7,8,9,10,11,12,13,14,15 (see details in Table 1). However, it is noteworthy that the few Chinese reading corpora, such as GECO-CN12, BSC13 and CEMD14, were not published until last year.

Table 1 Introduction of eye-tracking datasets across different languages.

The rapid growth of eye-movement corpora has boosted a variety of empirical studies that address new challenges arising from reading. In reading research with alphabetic languages, the Dundee Corpus promotes the discussion on word processing in parafoveal and foveal vision2,16, while the PSC is employed to examine word surprisal effects3,17, the Provo Corpus to investigate undersweep fixations in multiline contexts8,18, the GECO to explore the age-of-acquisition effect on fixations regardless of word length and frequency6,19 and the ZuCo to train machine learning models to predict human reading behaviours7,20,

Fig. 1
figure 1

Schematic overview of HKC development.