Introduction

Individuals with reading and writing difficulties often identify spelling as their most significant (e.g., Hatcher et al., 2002; Sumner & Connelly, 2020) and enduring challenge (Kemp et al., 2009). They frequently leave more spelling errors in their texts than writers without such difficulties (e.g., Connelly et al., 2006), resulting in lower quality ratings compared to their peers’ work (Bogdanowicz et al., 2014; Connelly et al., 2006; Sterling et al., 1998). To some extent, these quality differences seem to persist even after texts have been corrected for spelling (Galbraith et al., 2012; Gregg et al., 2007; Tops et al., 2013). It has been suggested that variables such as low fluency causing cognitive overload (e.g., Berninger et al., 2008) or avoidance strategies (O’Rourke, 2020; Wengelin, 2007) resulting in lower lexical diversity (Sumner et al., 2016; Wengelin, 2007) could account for some of these outcomes, indicating that spelling difficulty is a complex phenomenon that requires understanding not only of the errors that meet the eye but also of the underlying processes. This raises questions about the need to take spelling processes more into account in the assessment and instruction of writing.

One might argue that spelling is not crucial since most texts can still be comprehensible despite a few errors. However, research suggests that spelling mistakes negatively impact the reading speed of fluent adult readers, even if they are not consciously noticed (Melin, 2007), and that children find texts with errors less engaging, memorable, well-crafted, and comprehensible (Varnhagen, 2000). Moreover, some readers use spelling errors to infer authors’ intelligence (Figueredo & Varnhagen, 2005), credibility (Schloneger, 2016), or trustworthiness (Melin, 2007). Therefore, it is not surprising that adults, including educators and parents, commonly consider spelling fundamental to good writing (Rankin et al., 1993), or that children and adolescents perceive spelling proficiency as crucial for academic success and future career prospects (Rankin et al., 1994). Spelling ability is typically assumed to reflect the effort invested in writing, and authors making spelling mistakes are often stigmatized as careless and lazy thinkers (Varnhagen, 2000). This is likely one reason why many individuals with reading and writing difficulties continue to struggle with writing well into their later years (e.g., Wengelin, 2007).

Let’s consider an exemplary case of struggling writing, drawn from the experiences of 15-year-old Philip, a participant in one of our previous studies (Wengelin et al., 2014). Despite initially scoring at Stanine 1 on a spelling test during the screening process, he managed to compose a short argumentative text (57 words long) with just two misspellings. A segment from the keystroke log (Example 1) sheds light on his approach to the task, which he needed a total of 37 min to complete. We zoom in on the third word “engagera” (engage). Numbers within angle brackets (e.g., <2.86>) shows pause lengths in seconds, and BACKSPACEX within angle brackets (e.g., <BACKSPACE4>) demonstrates Philip’s use of the backspace key, including number of characters deleted.

Example 1

L<2.86>ärarna <42.58>sk<36.79><BACKSPACE2>bör eng<4.88>a<9.27>

<BACKSPACE2>ga<22.96><BACKSPACE4>äng<2.82>agera<2.22><BACKSPACE8>eng<4.75>agera<2.60><135.59>sig.

Philip clearly hesitated about, reread, and revised the word several times before his efforts resulted in a correct spelling. The first phoneme, /ɛ/, which he identified correctly seemed to cause a major challenge, as shown by the way he exchanged the letter “e” for “ä”, both of which can represent the phoneme, and then back to “e”.

One potential explanation for his laborious and slow word processing could be that Philip has fuzzy representations of words. According to this view (Sénéchal et al., 2016), mental representations of words are initially constructed as a frame in which consistent graphemes are clearly specified, whereas inconsistent graphemes are more likely to be underspecified, if represented at all. “Fuzziness” may arise from variability in pronunciation, complex morphological structures, or visual similarities. Consequently, uncertainty regarding the sequence or presence of specific letters may arise, leading to slow word production, particularly among children facing reading and writing challenges who struggle with recalling the correct letter combinations.

In combination with the stigma surrounding texts that include spelling errors, difficulties in retrieving the correct spelling from memory may also—with increasing age and awareness—lead to conscious worry about spelling, including increased hesitation, rereading, and revision. Philip’s example is, of course, anecdotal and stems from a relatively dated dataset. Thus, its generalizability to other writers grappling with reading and writing difficulties in today’s digital communication landscape remains unclear. However, a more recent study by Reynolds and Wu (2018), in which dyslexic young adults self-reported their Facebook usage, suggests that many writers encountering writing difficulties contend with significant stigma regarding their spelling. These writers commented, for example, on how spelling errors were used to discredit arguments or divert debates, causing strong emotional responses toward writing and prompting them to meticulously edit and revise their content until deemed error-free.

There is also evidence from experimental and quasi-experimental research suggesting that those with spelling difficulties often exhibit hesitation and struggle. This line of research typically operationalizes hesitation and struggle in terms of word production speed/fluency, word-level pausing, and revision. While some early studies failed to find significant differences in writing speed or pausing between children with and without dyslexia (Martlewm, 1992), or yielded inconclusive results (Søvik et al., 1987), more recent research has demonstrated that both young children (English 9-year-olds by Sumner et al., 2013; 2016, Spanish 8–12-year-olds by Suárez-Coalla et al., 2020 and Afonso et al., 2020, and French 11-year-olds by Alamargot et al., 2020) with dyslexia, and adolescents (Norwegian 17-year-olds by Torrance et al., 2016, and Swedish 15-year-olds, by Wengelin et al., 2014) with reading and writing difficulties, have been reported to require more time to transition from one letter to another and/or to make more word-internal pauses, i.e. hesitate more within words, than comparable control/reference groups. Similar results have been demonstrated for Swedish (Wengelin, 2002, 2007) and Croatian (Tomazin et al., 2023) adults, the majority of whom did not possess a university education, and for university students (Afonso et al., 2015; O’Rourke, 2020; Sumner & Connelly, 2020). However, Galbraith et al. (2012) found no differences in temporal processing between university students with and without dyslexia.

Regarding spelling-related revision, Wengelin’s (2002; 2007) adult dyslexic participants revised their spelling more frequently compared to the reference group, and their spelling-related revisions constituted a higher proportion of their total number of revisions. Notably, only 50% of their spelling revisions were successful. In Sumner and Connelly’s (2020) research, university students with dyslexia did not revise spelling more frequently than those in the control group, but their spelling revisions did indeed comprise a higher proportion of their overall revisions. Interestingly, in contrast, the adolescents in Torrance et al.’s (2016) study made slightly fewer word-level revisions than their peers. This could perhaps be attributed to either a lack of concern about errors or difficulty in error detection, possibly due to reading challenges. However, in a masked condition where their participants could not see their texts, they neither revised more nor less. Like Wengelin (2002; 2007), Torrance et al. analysed all word-level revisions, including for example typos, and we have only found one study focusing specifically on spelling error detection by individuals with reading and writing difficulties. O’Rourke (2020) conducted an experimental sentence-level task for this purpose and found that university students with dyslexia detected and correctly revised fewer spelling errors compared to students without dyslexia.

Although it is well established that texts by writers with spelling difficulties typically receive lower quality assessments, the relationship between their writing processes and the resulting texts is not yet clear. Contemporary models of writing and writing development, such as that of Hayes and Berninger (2014), the Simple View of Writing (Berninger et al., 2002), and the Revised Writer(s) Within Communities (WWC) model by Graham (2018), suggest that dysfluent and hesitant word processing can create cognitive overload, potentially impeding higher-level processes and thus affecting the content or structure of a text. Interestingly, a recent study by Rønneberg et al. (2022) that aimed to investigate this in Norwegian 6th-graders without known difficulties found no effects of fluency measures on the quality of completed text, thus not supporting what they termed the “process-disruption hypothesis” for typically develo** children writing in a shallow orthography. The authors did note though that their participants may have had sufficient fluency in spelling and ty** for this not to be disruptive. They also acknowledged that their results do not negate the relevance of spelling ability to higher-level text in individuals with writing difficulties.

For texts composed by writers with dyslexia, both Sumner et al. (2016) and Wengelin (2007) identified associations between spelling-related dysfluencies and lexical diversity. Sumner and Connelly (2020) also found such associations, albeit in a different manner. Their participants with dyslexia produced texts with lexical diversity equal to that of their peers but seemingly at the expense of spelling accuracy. The authors suggested that these writers had the ability to overlook spelling errors, possibly due to familiarity with spell check. This notion was supported by O’Rourke (2020) and O’Rourke et al. (2020) who found that although interrupting writing, the use of spell check by writers with dyslexia to alleviate spelling demands increased lexical diversity. O’Rourke’s result is in line with some previous research showing that dictating by means of speech recognition—thus not having to think about spelling—can facilitate for children with various language difficulties (Higgins & Raskind, 1995; Kraft, 2023; MacArthur & Cavalier, 2004; Quinlan, 2004). The suggestion that some writers had the ability to overlook spelling errors could potentially also apply to Torrance et al.’s (2016) participants, as Norwegian students are accustomed to ty** their work and exams on a computer from an early age. Finally, Galbraith et al. (2012) reported a more complex relationship. Although their research indicated no discrepancies in the proportions of time that dyslexic and non-dyslexic undergraduates dedicated to various writing processes, their results did indicate that these processes were correlated with the final text’s quality in distinct ways for the two groups.

In summary: On an individual level, there appears to be little doubt that qualitative analysis of process data can offer valuable insights to both researchers and educators regarding the types of spelling challenges described above. It illuminates moments of hesitation, worry, error detection, attempted revisions, the success of these revisions, and avoidance strategies. Additionally, process data can reveal which orthographic patterns writers are aware of and find challenging. For instance, Philip demonstrated clear awareness that the phoneme /ɛ/ could be spelled in multiple ways, causing him difficulty. On the group level, while taking into account that the studies accounted for above defined and operationalized fluency and/or dysfluency in different ways, we note that they indicate a certain consistency regarding temporal aspects of word-level writing by individuals with spelling difficulties. The results for spelling-related revision and the relation between process variables and text characteristics are less conclusive. Disparities can probably, to a large extent, be explained by differences in demographics, input methods, tasks, and languages employed throughout the research. The studies accounted for include different age groups, levels of education, input modalities, and languages—including orthographies of varying levels of transparency. Thus, different results are not surprising. It is, for example, conceivable that due to fuzzy representations (e.g., Sénéchal et al., 2016), dysfluency in the writing of young children does indeed hinder their higher-level processing only until they reach a certain level of automatization, in accordance with the Simple View of Writing (Berninger et al., 2002). However, those who develop reading and writing difficulties may, with age, become increasingly aware of their limited proficiency, and thus become more hesitant, as highlighted by Reynolds and Wu (2018). This raises questions about the extent to which school children are at risk of develo** such writing behaviour, when and how it happens in that case, and how classroom teachers can identify struggling writing in time to prevent it.

Three of the studies referred to in the literature review (Afonso et al., 2020; Alamargot et al., 2020; Suárez-Coalla et al., 2020) investigated children with reading and writing difficulties in the ages of ‘upper elementary school’/‘middle school’—all with a focus on handwriting skills. In the Swedish context, these are the grades (4–6, ages 10–12) where children are expected to move from basic writing skills into more complex spelling and composition. We speculate that these ages may be where differences between children with and without spelling difficulties start to become apparent and thus where their processes may start to deviate from those of their peers. Therefore, our first research question of this paper focuses on fluency and dysfluency in grade 4–6 Swedish children.

Another important aspect of the previous research is how the participants with spelling difficulties are conceptualized. Some studies have recruited participants with a known dyslexia diagnosis, while others have used the broader concept of reading and writing difficulties, and one has specifically focused on poor decoders. All these groups have also been shown to be poor spellers, but differences in dyslexia definitions and diagnoses across time and location aside, this means that they all, including our own research, assume reading difficulty as the default problem. Although, Torrance et al. (2016), who focused specifically on the writing of poor decoders, concluded that word-level hesitation in their sample appeared to be linked ‘solely to production rather than reading’ (Torrance et al., 2016, p. 385), as differences between writers with and without reading and writing difficulties persisted in temporal patterns even in a condition where participants were prevented from seeing what they had written. However, inhibiting hesitation and lookbacks for anxious writers may not be easier than inhibiting other types of behaviour that have developed over a long period, and more research is needed to disentangle the influences of spelling and reading abilities on word processing. For example, the results of the studies included in the literature demonstrate the importance of distinguishing between word-level revision caused by hesitation which, as argued by Torrance et al could be merely a question of production, and spelling revision induced by unambiguous error detection, which clearly needs reading skill. As indicated by an eye-tracking study of typically develo** adolescents by Beers et al. (2010, p. 768), ‘reading at the inscription point’ was associated with text quality, possibly to ‘review their most recently composed words’. Paradoxically, error detection could be just as important when introducing compensatory tools to facilitate writing, such as spell check or speech recognition. Although experienced as useful by many (but under-researched), these systems are not infallible. In view of the above, we conclude that distinguishing reading and spelling skills may add to our understanding of spelling difficulties, and thus, our second research question focuses on the detection and revision of spelling errors. To understand the contribution of spelling difficulty per se, our main inclusion criterion was based on spelling skill, after which we assigned the participants into either a group with mainly spelling difficulties or a group with both reading and spelling difficulties.

The overarching aim of our paper is to explore the processes of 10–13-year-old (grades 4–6) Swedish children with and without spelling difficulties using keystroke logging, and to discuss to what extent the knowledge gained can be of use for assessment of spelling difficulties. Based on the results by Rønneberg et al. (2022), we assume that at least older children without reading and writing difficulties in our sample will have reached a certain ‘ceiling level’ of automatization. Younger children and children with reading and writing difficulties may not have done so to the same extent. Therefore, our main approach is comparative rather than correlational, as indicated by the two research questions outlined above. However, for the sake of comparability with other studies and because we are interested in whether and how process data can add to existing assessments, our third question focuses on the relationships between process data, text characteristics, and the outcomes of standardized spelling and word decoding tests. Our specific research questions are:

  1. 1.

    To what extent do Swedish children with and without spelling difficulties in grades 4–6 demonstrate dysfluency in writing in terms of word-internal pausing and word-level revision?

  2. 2.

    Can distinctions be identified between children characterized by both reading and spelling difficulties and children facing challenges specifically in spelling, particularly regarding the detection and revision of spelling errors?

  3. 3.

    Are there any identifiable correlations between data on writing processes, text characteristics, and the outcomes of standardized spelling and word decoding tests?

We expect children with spelling difficulties to be less fluent and produce texts with more spelling errors and of lower quality than typically develo** children. We also expect children with both reading and spelling difficulties to detect and revise spelling errors to a lesser extent than the others. The results will be discussed in terms of implications for assessment.

Method

Participants and corpus

Three groups of children aged 10–13 (grades 4–6), one primarily with spelling difficulties (n = 16), one with both reading and spelling difficulties (n = 15), and one without difficulties (n = 16), all of whom had received their entire education in a mainstream Swedish school and used laptops or tablets with physical keyboards daily, were drawn from a project aimed at understanding writing difficulties in children with decoding and spelling challenges and exploring ways to facilitate writing for them, through writing technologies and individual interventions. The project included 40 participants with reading and writing difficulties, but only 31 had complete data sets for the present study. These include a spelling test, a decoding test in two parts, and a writing task carried out on a computer—without compensatory writing technologies. These participants were recruited through special educational experts who were invited to suggest participants with reading and spelling difficulties, focusing specifically on those who would benefit from compensatory support with spelling. Spelling difficulties, as highlighted by Dockrell (2009), can be identified in a wide range of children, and we adopted an inclusive approach to participation. Autism and/or intellectual disability were exclusion criteria, but beyond that, we welcomed all childrenFootnote 1. To confirm group membership, we conducted a standardized spelling test, DLS (Johansson, 1992). Only children who scored at stanine 3 or below were included.

Because error detection is an important aspect of this paper, we then used the LäSt (word and non-word) decoding test (Elwér et al., 2011) to divide this group of children into two subgroups post hoc: one with clear both reading and spelling difficulties and one with mainly spelling difficulties. Inclusion criteria for the group with both reading and writing difficulties were stanine 3 or lower on both the word-decoding part and the non-word decoding part of the decoding test. Throughout the paper, we will refer to the two groups with spelling difficulties as follows:

  1. 1.

    Children with mainly spelling difficulties.

  2. 2.

    Children with both reading and spelling difficulties.

We also initiated the recruitment of typically develo** children for a reference group. Unfortunately, this data collection was interrupted by Covid-19, and therefore the reference group in this paper comprises only 16 children who, on average are slightly younger than the children with reading and writing difficulties.

The group with both reading and spelling difficulties comprised the oldest participants and exhibited the poorest performance in both reading and spelling, not only in comparison to their similarly aged peers but also in absolute terms. A Bayesian ANCOVA confirmed that the typically develo** children, as expected, demonstrated stronger spelling abilities than participants in the two groups with spelling difficulties—despite being younger. However, we found no evidence for a difference between the two groups with spelling difficulties. That is, they were at similar spelling levels. The majority of children in these two groups performed at or below the expected level of a 4th grader—all of them at least one grade-level below expectations, with most of them falling even further behind.

Regarding non-word decoding, we observed very strong evidence for a group effect (BF10 = 9410.36). Post hoc tests indicated moderate evidence for a difference between typically develo** children and those with primarily spelling difficulties (BF10 = 3.45). In contrast, there was very strong evidence that participants with both reading and spelling difficulties were weaker readers compared to both of the other two groups (BF10 = 4071 for the comparison with typically develo** children and BF10 = 110.15 for the comparison with children mainly experiencing spelling difficulties).

For word decoding, the results were similar but demonstrated even higher probabilities (BF10 = 137302.41 for the model based on Group). In summary, acknowledging substantial individual variation, the two groups with spelling difficulties exhibited similar levels of spelling proficiency but differed significantly in reading skills. Descriptive statistics are shown in Table 1.

Table 1 Age, spelling test result and word decoding results for the typically develo** children, the children with mainly spelling difficulties and the children with both reading and spelling difficulties

The corpus for the present study consists of 47 text samples containing both product and process data. The finally edited texts comprise a total of 4166 words (1884 by the typically develo** children, 1495 by the children with mainly spelling difficulties, and 787 by the children with both spelling and reading difficulties) out of which 356 are misspelled: 64 by the children without difficulties and 292 by the children in the two groups with spelling difficulties. As will be shown later more errors were made during the writing processes, but they were detected and revised accordingly.

Elicitation material, instruments, and procedure

Individual sessions were conducted with a certified speech pathologist, either Author 2 or 3 of this paper, for all testing and text production. The DLS spelling test (Johansson, 1992) served as a standardized diagnostic tool, involving a word list of 36 dictated words for the child to spell. Each correctly spelled word earned the child one point, and the raw score was converted into stanine scores based on the child’s grade level (4–6). The test boasts reported reliabilities of 0.90 for grade 4, 0.88 for grade 5, and 0.96 for grade 6. The administration was carried out using pen and paper.

The LäSt decoding test (Elwér et al., 2011) functioned as a standardized reading assessment tool, featuring two lists of words: one with non-words and the other with real words. Both lists required the child to read individual units one at a time. The non-word list measured alphabetic reading skills, while the real-word list also evaluated the child’s orthographic reading abilities. The reported reliabilities for the two lists are 0.74 and 0.91, respectively.

The keyboarded texts were created on a MacBook computer equipped with ScriptLog, a keystroke logging program designed for recording ty** processes (Wengelin et al., 2019). ScriptLog records each keystroke and mouse click, providing them with a timestamp, enabling playback of the writing process, and analysis of temporal patterns, pauses, and revisions. The texts were composed in ScriptLog’s simple editor, which is similar to Windows’ NotePad or TextEdit in Mac OS, and thus does not include spell check. Participants were prompted by short film clips (Berman & Verhoeven, 2002) presenting moral dilemmas related to cheating or stealing. They were instructed to present the problem and discuss what their favourite superhero would do in response to the incidents in the films. Participants were asked to write for a maximum of 30 min. The time spent on the task varied widely. One child gave up after 1.82 min, and another (the only one who exceeded the given time) kept writing for 65 min.

Measures

We report both product measures, that is, characteristics of the final compositions, and process measures, that is, measures related to temporal aspects of the writing process, and measures related to error detection and revision.

Product measures

We assessed text length, text quality, vocabulary diversity, and proportion of misspelled words in the finally edited texts. Prior to quality rating and calculations of text length and vocabulary diversity, all texts were corrected for spelling. Some words included more than one error but for the analyses in this study we only used number of misspelled words. We operationalized text length as the number of words in the final product, which we calculated using LIX.se, a tool which is similar to textinspector.com but specifically designed for Swedish. To measure vocabulary diversity, we used VocD, a measure that, as distinguished from type/token ratio, is not sensitive to text length (McCarthy & Jarvis, 2010). It does, however, require a minimum number of words (50) why not all texts could be included in this analysis. Text quality (holistic) was assessed by means of comparative judgment (using nomoremarking.com) which has been argued to have better reliability than criterion-based marking (Steedle & Ferrara, 2016; Verhavert et al., 2019). The basic idea is that judges compare two randomly selected texts and decide which is better—again and again, until each text has been compared with a large variety of other texts. Based on the recommendations in a meta-analysis by Verhavert et al. (2019) we let each text constitute the basis for comparison 20 times. The system calculates a quality score, between 0 and 100. To ensure acceptable reliability, we used four judges who had not participated in the data collection. Before starting the assessment process, the judges agreed on time slots, and how many assessments they should do during each slot to avoid that fatigue would influence their judging. They also conducted a pilot study assessing texts that were similar to those in the main study. The reliability measure reported by the system, Scale separation reliabilityFootnote 2, was 0.92.

Process measures

We examined two types of process measures: (1) those concerning temporal aspects, and (2) those related to error detection and revision. Regarding temporal processing, we report word-internal mean interkey intervals (IKIs), as well as those that could be considered interruptions, here referred to as pauses. The latter necessitates establishing a pause threshold. This is somewhat problematic because pause thresholds set on the group level are by definition arbitrary. While it is essential to choose a pause threshold relevant to the research question, a two-second threshold has frequently been used to mark the point at which an IKI is considered a pause (Strömqvist et al., 2006) as seen in earlier keystroke logging research. This threshold was, for example, utilized in both Wengelin (2007) and Wengelin et al. (2014). Several subsequent studies that have recorded ty** or handwriting have used the same threshold — either by convention or for comparability. Although this practice remains relatively common, variations in threshold choice between studies have become more apparent today. For instance, while Sumner et al. (2014) used a two-second criterion, Rønneberg et al. (2022) employed a one-second threshold within words and a two-second threshold before words. In this article, we focused on word-internal pauses and explored two different thresholds. First, we calculated the number of IKIs within words that lasted for 2 s or longer. Notably, two-second pauses are highly unlikely to occur within words during regular ty**, even among children (Wengelin & Strömqvist, 2004). Although employing a two-second threshold might result in a lower recall rate, it is expected to yield a higher precision level, as the included pauses significantly disrupt the overall word production flow. In other words, while this approach may not capture all instances of word-level hesitation, the included pauses undoubtedly disrupt the general word production flow. Additionally, for comparison with the dataset of Rønneberg et al., which closely resembles our study in terms of recently collected data, keystroke logging usage, a reasonably similar age group, and a transparent orthography, we also present pauses based on a one-second criterion. We will revisit the rationale behind these thresholds in the Discussion section. For revision behavior, we focused on the extent to which children with and without reading and writing difficulties detected and corrected word-level errors during written composition and how frequently this occurred. Two annotators coded all revisions as either word-level revisions (typographical errors and spelling errors) or other revisions (formulation, content, punctuation, or other) and calculated the proportions of word-level revision for each group. We counted revisions of word-level errors as detected errors, regardless of their success, along with errors left in the final text to comprise the total number of errors. To calculate the detection rate, detected errors were then divided by the total number of errors.

Ethical considerations

As already mentioned, data analysed in the current study were obtained from a larger dataset collected as part of a research project which was funded by the Marcus and Amalia Wallenberg Foundation (Ref. No. 2014 − 0122). The data collection was conducted in eight schools in southern and mid-Sweden by the second and third authors. Testing normally took place over two sessions, to avoid exhausting the participants, but if a participant still showed signs of fatigue we interrupted and divided the rest of the data collection from that person in shorter sessions. Ethical approval for the study was granted by the Swedish Ethical Review Authority in Gothenburg (Ref. No. 702 − 17).

Written assent/consent was obtained from both the participants and their caregivers. Participants were informed that they could withdraw from the study at any time without providing a reason.

Statistical analyses

We used Bayesian methods to analyse the data. These methods allow for evidence in favour of both the presence and absence of group differences and correlations because they represent uncertainty about the true value of a parameter using a probability distribution, which can assign non-zero probabilities to both the null hypothesis and alternative hypotheses. Furthermore, Bayesian analyses have been argued to achieve better type I error control than traditional frequentist analyses because Bayesian methods allow for the incorporation of prior knowledge into the analysis, which can reduce the impact of random noise in the data. For all analyses, we report the Bayes factor for the best model in comparison to the null model (BF10), which provides the evidence for the alternative hypothesis over the null hypothesis. For example, if BF10 = 10, the alternative hypothesis is regarded as ten times as likely as the null hypothesis. In most of the literature, a Bayes factor below 3 is regarded as negligible. In such cases, we only state a lack of evidence for our models. Values between 3 and 10 indicate moderate evidence, and values above 10 indicate strong evidence (Jeffreys, 1961).

Because age is not evenly distributed across our three groups, for group comparisons, we performed Bayesian analysis of covariance, ANCOVA (Rouder et al., 2012) in Jamovi, version 2.3 (The Jamovi Project, 2022), using its default priors (r scale fixed effects = 0.5, r scale random effects = 1, r scale covariates = 0.354), on the proportion of word-level errors, error detection rate, frequency of word-internal pausing, vocabulary diversity, and text quality as dependent variables, including group as a fixed factor and age as a covariate for each comparison. The Bayesian ANCOVA works by comparing four models with various predictors of the dependent variable, in our case:

  1. 1.

    A null model, P(M) = 0.250.

  2. 2.

    A model containing only group as a predictor, P(M) = 0.250.

  3. 3.

    A model containing only age as a predictor, P(M) = 0.250, and.

  4. 4.

    A model containing both group and age as predictors, P(M) = 0.250.

Note that all odds were set to be equally likely, a priori (0.250). Posthoc comparisons between the individual groups are in Jamovi based on the t-test with a Cauchy prior r = 0.707. To account for model uncertainty, we performed Bayesian model averaging to test the effects of both predictors. For relations between different variables, we conducted Bayesian Pearson’s correlations between the different spelling-related variables: spelling test score, proportion of misspellings in the final text, frequency of word-internal pausing, frequency of word-level revision, and between each of those with text quality or vocabulary diversity. Under the null hypothesis, we would expect a correlation of 0 between any two of the variables in any of the pairs in the correlation matrix. The alternative hypothesis is two-sided, and we assigned a uniform prior probability to all correlations between − 1 and + 1.

Results

Since we are interested in whether and how process data can contribute to the assessment of spelling difficulties, we will first compare the product data, that is, the characteristics of the texts produced by the participants, and then the process data. Finally, we will present correlational analysis as follows: (a) correlations between the test results and the product data, (b) correlations between the process data and the product data, and (c) correlations between word-processing variables: spelling test, word-decoding test, proportion of misspelled words in the texts, word-internal pausing, error detection rate, and spelling revision.

Comparing the text characteristics of the three groups

The results for the text characteristics are detailed in Table 2. For vocabulary diversity the evidence for any effect was negligible. However, as already mentioned, only a subset of the texts was included, and the results must be interpreted with great care. The typically develo** children outperformed the other two groups for all the other three variables. For text length and text quality, the Bayesian analysis of covariance indicated strong evidence for the model based on Group only. In the case of the proportion of spelling errors, we found the strongest evidence for the model based on Group + Age (BF10 = 43.37) but we also found increased odds for the model based on Group only (BF10 = 38.29). The averaging analysis of the two predictors indicated that the data were more likely under models containing Group as a predictor than when including Age. In all three cases, posthoc tests indicated that typically develo** children were likely to produce better results than both the other groups. Only for text length, children with mainly spelling difficulties were likely to perform better than the group with both reading and spelling difficulties.

Table 2 Text characteristics, typically develo** children, children with mainly spelling difficulties and children with reading and spelling difficulties. The best models are shown in bold font

Analyses of the process variables

Results for the process variables are presented in Table 3. For the variables related to temporal processes, the Bayesian analysis of covariance indicated the strongest evidence for the model based on Group + Age, but the evidence for pauses longer than one second was negligible.

For the mean duration of word-internal Inter-Keystroke Intervals (IKIs), there was strong evidence for the model based on Group only, but the averaging analysis of the two predictors indicated that data were more likely under models containing Age as a predictor than when including Group. However, posthoc tests for Group indicated moderate (BF10 = 8.863) evidence for a difference between the typically develo** children and the children with both reading and writing difficulties, but not between any of these two groups and the group with only spelling difficulties. For word-internal pauses longer than 2 s, the evidence for models other than the one based on Group + Age was negligible, and once again, the averaging analysis of the two predictors indicated that data were more likely under models containing Age as a predictor than when including Group. Moreover, posthoc tests for Group indicated only negligible effects.

Regarding error detection and revision, the Bayesian analysis of covariance indicated strong evidence for both the model based on Group only and the model based on Group + Age for error detection, but the averaging analysis of the two predictors indicated that data were more likely under models containing Group as a predictor than when including Age. Posthoc tests indicated that the odds for detecting errors were much better for the typically develo** children than for those in the two groups with spelling difficulties. With that in mind it was somewhat surprising to find that the probabilities for any effects of Group or Age on revision frequency or the proportion of word-level revision were, more or less, negligible. Instead, we found moderate evidence for the null hypothesis.

Table 3 Process characteristics, typically develo** children, children with mainly spelling difficulties and children with reading and spelling difficulties. The best models are shown in bold font

In sum, the probabilities for differences between the groups are small in most cases. Although the typically develo** children appear to process words faster than the children with both reading and writing difficulties, age appears to be a more important predictor of both these variables and of word-internal pausing. However, visual inspection of the scatterplots (Fig. 1) related to these may add some food for thought. The leftmost scatterplot illustrates the mean durations of the word-internal IKIs and indicates that while word-processing speed appear to increase with age for all three groups, the distance between them is relatively stable. The scatterplot for word-internal pausing (> 2 s) on the other hand (the one in the middle) indicates that the less dysfluent the children appear to become with age, the closer the three groups get to each other. But although differences between them decreases accordingly, the participants with both reading and spelling difficulties—who are also the poorest spellers in relation to their ages—stand out as the most dysfluent. For most age levels this group demonstrate the highest number of word-internal pauses > 2 s. We will return to that in the discussion section. The probability that children detect and correctly revise their errors is, on the other hand, much higher for the typically develo** children. Interestingly, the children with only spelling difficulties seem unlikely to outperform their peers with both reading and spelling difficulties in speed (mean duration of word-internal IKIs), (word-internal) pausing and revision. However, for error detection rate, the patterns look completely different. While typically develo** children and the children with mainly spelling difficulties appear to improve their ability to detect and revise errors, the children with both reading and spelling difficulties demonstrate no such tendencies.

Fig. 1
figure 1

Scatterplots, mean duration of word-internal IKI:s, word-internal pausing, and error detection rate for typically develo** children (0), children with mainly spelling difficulties (1), and children with both reading and spelling difficulties (2)

Bayesian Pearson correlations

Our sample is too small for reliable reports of correlations for each group, and therefore, we will focus mainly on correlations for the whole sample. We did, however, also run groupwise correlations, to explore whether some correlations for the whole group may have been driven by group differences rather than by a linear relation, and we will touch upon a couple of these.

We first report correlations between the spelling test and the word decoding test on the one hand and the text characteristic variables on the other. Bayesian correlation pairs are shown in Table 4. We found no evidence for any correlation with vocabulary diversity and have excluded that variable from the table. Not surprisingly the odds for a strong correlation (r= −0.735) between the tests and proportion of misspelled words are very high (BF10 = 3.43e + 6). Furthermore, we found very strong evidence, with Bayes factors around 1000 for moderate to strong correlations (r ≈ 0.5) between both decoding and spelling skills on the one hand and text quality on the other. Finally, we also found strong evidence for correlations between decoding skills and text length.

Table 4 Correlations between test variables and product variables

However, individual variation within the groups is large and when we looked at text length and text quality for the three groups separately only one of these correlations held for a single group. The evidence was relatively strong (BF10 = 13.62) for a correlation (r = 0.68) between spelling test results and text quality for the group with mainly spelling difficulties.

We then analysed correlations between spelling process related variables and the two higher-level text characteristics for the whole sample and found only negligible results.

Finally we carried out correlational analyses between the different word-level variables (test results, proportion of misspelled words in the texts, error detection rate, word-internal pausing, error detection rate, and duration of word-internal IKI:s. Since no effects were found in the comparative analysis for revision frequency or proportion word-level revision, these variables were excluded from the correlational analyses.

For word-internal pausing, we chose to include the variable with the strongest results from the comparisons, which was word-internal pausing > 2 s. Numerous correlation models with very large Bayes factors were found for correlations involving test results and word processing variables from the writing data. Table 5 shows the Bayesian Correlation Matrix, and Fig. 2. shows the correlation plots. Not surprisingly, the spelling test results appear to correlate with both the word decoding test (r = 0.69) and the proportion of misspelled words in the text (r = −0.74), indicating that poor spelling and poor word decoding go hand in hand. Furthermore, those who receive low results in a spelling test also produce many spelling errors during composition. Error detection rate appears to correlate with all three of these variables (r = 0.64 for the spelling test, 0.41 for the word decoding test, and −0.80 for the proportion misspelled words in the finally edited texts). This suggests that one of the reasons why poor spellers leave many errors in their texts is that they do not detect them or alternatively do not know how to revise them. However, although relatively strong, the evidence for the correlation between error detection rate and word decoding is considerably lower than for the other models (BF10 = 10.018).

Fig. 2
figure 2

Correlation plots for word-processing related variables

Regarding the temporal process variables, there is strong evidence for correlations between the mean duration of word-internal IKIs and all the other variables, while for word-internal pausing there is evidence only for a correlation with the proportion of misspelled words (r = 0.444). In other words, those who hesitate more within words also tend to make many spelling errors. Just like for the correlation between word decoding skill and error detection rate, the evidence for this model is lower than that of the others, but still with a Bayes factor of 20.7, meaning that it is 20 times more probable than the null model. However, as shown in the correlation plot for word-internal pausing, the participants are scattered around one end of the regression line. Moreover, when conducting the Bayesian correlation analyses with this variable for each group separately, we found no evidence for such a correlation, so the result should be interpreted with care.

Table 5 Bayesian Pearson correlations for word-processing related variables

Discussion

In our study, we embarked on a journey to deepen our understanding of how spelling data derived from children’s composition processes could enhance our comprehension of spelling difficulty beyond traditional spelling test results and observations from free writing samples by children with reading and writing difficulties. We were interested in whether such analyses could contribute to the assessment of spelling. Our primary focus was on (a) spelling difficulties in general and (b) the role of reading difficulties in combination with spelling difficulties.

Our first question aimed to determine whether 10–13-year-old Swedish children with spelling difficulties exhibit comparable hesitation and avoidance behaviours to those observed in adults and adolescents in previous studies. The succinct response to this query appears to be negative, or at least not to a significant extent. While typically develo** children outperformed others for all product variables except vocabulary diversity, we found no indications that the groups with spelling difficulties displayed distinct patterns of revision or pauses exceeding one second. For both mean word-internal IKIs and word-internal pauses > 2 s, the best model was the one based on Group + Age, and the averaging analysis of the two predictors indicated that data were most likely under models containing Age as a predictor. These results contrast with those of Sumner et al. (2013), who showed that children with dyslexia made more word-internal pauses during handwriting. Since our group of children was based on spelling difficulty in a broad sense, these results are not too surprising. A very tentative and premature interpretation of these results could be that it is not any spelling difficulty that drives dysfluency in writing, but rather an underlying disorder, such as dyslexia. However, more research and larger samples would be needed to support that notion.

The effect of age was also clearly visible in the scatterplots in Fig. 1. Within each group, the speed of word-level processing, as shown by the word internal IKIs, increased, and the number of long word-internal interruptions, in terms of pauses exceeding 2 s., decreased. This could to a certain extent be taken to support the claim by Rønneberg et al. (2022) that word-internal pausing did not distinguish weaker spellers from stronger in typically develo** children of the same age. It does, however, also raise the question of what constitutes a dysfluency, since they only measured pauses > 1 s. within words. Like these authors, we found no evidence for either age or group effects on 1-s pauses. The suggested explanation for the results of Rønneberg et al. was that perhaps the participants in their study had already automatized ty** and spelling. This interpretation seems reasonable in the light of the behaviour of the typically develo** children in our data. However, for the children with spelling difficulties automatization appear to happen considerably later. While the gap between the groups narrowed considerably with age for pauses > 2 s., the gaps between the groups for mean word internal IKI:s appeared to remain more or less consistent. These findings have the potential to support the idea that fuzzy representations of words (Perfetti, 2007; Sénéchal et al., 2016) diminish with age in typically develo** children but persist, or at least decrease at a slower rate, for children with spelling difficulties.

Our second inquiry aimed to investigate the extent to which reading difficulty influenced the observed patterns. Interestingly, we found evidence only for one clear distinction between the two groups with spelling difficulties and that related to text length. Children with mainly spelling difficulties produced considerably more text than those with both spelling and reading difficulties. In fact, the subcorpus based on texts by children with mainly spelling difficulties was about twice the size of that by children with both reading and spelling difficulties. The explanation for this is unclear. Since the children with both reading and spelling difficulties were older than those with mainly spelling difficulties, we certainly did not anticipate this difference. A potential explanation could have been a high degree of dysfluency, but as already discussed, our analyses provided very moderate evidence for this, primarily in terms of longer word internal IKIs, and very little effect for pausing or revision. Another possibility is that the children who demonstrate both reading and spelling difficulties, and thus were more “dyslexic-like”, than those with mainly spelling difficulties, had other underlying language difficulties, and/or as suggested by Torrance et al. (2016), less prior knowledge due to limited print exposure.

Of particular interest for the distinction between children with mainly spelling difficulties and those with both reading and spelling difficulties was of course the question of error detection. Surprisingly, while the evidence was strong for the typically develo** children outperforming the children with spelling difficulties, we found no statistical evidence that reading difficulty played a role on the group level. However, it may be worth directing some attention to the rightmost scatterplot in Fig. 1. Although visual observations of scatterplots should never substantiate any scientific claim, this plot does seem to tell a story that merits further research. While error detection seems to improve with age for both typically develo** children and the group with mainly spelling difficulties, this improvement does not appear to be evident for the group with both reading and spelling difficulties. To our knowledge, this has not been investigated before.

Our third question related to the correlations between variables. As already mentioned, and in accordance with previous research, we found strong evidence for a correlation between spelling test results and text quality. Furthermore, despite not finding any group differences for error detection between the groups with spelling difficulties, we did observe some evidence for a correlation between the word decoding test and error detection. Therefore, we are not yet prepared to rule out the influence of reading difficulty on error detection, but should rather re-evaluate our inclusion and exclusion criteria. Is there also a relation between process and text characteristics? Whereas we did find strong evidence for a relation between test data and text characteristics, we found no correlations between the process data and product data for these young writers, except for the correlation between percentage of spelling errors, and duration of word internal IKIs, indicating on the one hand that spelling does indeed influence the processes, but that at least for the children in this study, without consequences for the text characteristics. Hence, once again our results support the claims by Rønneberg et al. (2020) that word-level dysfluency does not necessarily impede higher processes of writing in young writers—with or without reading and writing difficulties. We found this result slightly surprising, yet hopeful. It could mean that on a general level these children have not (yet?) developed the types of writing behaviour demonstrated by Philip and that described by the participants of Reynolds and Wu (2018) and thus that that type of stigma can be prevented.

Finally, we return to our final question and the title of this paper. What can writing-process data add to the assessment of spelling difficulties? At first sight: perhaps very little—at least for these young children. We have not been able to show that children with spelling difficulties consistently demonstrate process patterns that predicts their difficulties better than a traditional spelling test. Is then the problem with cognitive overload, caused by difficulties with spelling and other lower order processes, overstated, as indicated by Torrance et al. (2016)? Perhaps, and that is encouraging. It could be the case that writing instruction has improved, or that better writing tools have become more accessible, or something else that have changed the conditions for writing development. However, as already mentioned, the longer mean durations of the word-internal IKI:s do indeed suggest that children with reading and spelling difficulties process words more slowly than the typically develo** children, even if they don’t make long word-internal pauses, and this could support the theory of fuzzy representations (cf. Perfetti, 2007 Sénéchal et al., 2016), rather than conscious hesitations and co** strategies. On the other hand, the fact that not all the gaps between the groups appear to narrow, and that the group with both reading and spelling difficulties show different profiles for some variables while simultaneously being older than the others indicate that there could still be a definable group out there to identify in order to prevent stigmas related to writing, even if we didn’t manage to capture that in our current analyses. More research is needed, on larger groups, other age spans, more orthographies, and the effects of writing technologies to support struggling writers. One of several limitations with our study is the limited number of participants in each group.

In the Swedish context, the transition between grade 6 and 7 (age 12–13, accidentally the age of many of the poorest writers in our sample) should most likely be in focus for our next study. In this transition children leave upper elementary and move to secondary school, which not only constitutes a different school form, but is also frequently located at a different geographical location, with new teachers—who typically expects that the elementary school teachers have “fixed” their pupils’ reading and writing skills.

Moreover, during our explorations and analyses of the data we have noticed that there are individuals with weak spelling test results who manage to produce text with very few errors and vice versa (see Fig. 2). In fact, a couple of students with spelling and decoding difficulties, as assessed by the tests, produced texts that were completely free from spelling errors. During the analyses of our data, we also noticed how some children were very successful in their spelling revisions while revision for others rather made the text worse, and how some children succeeded in revising certain types of spelling errors but not others. While process data may not be the best screening tool for these ages yet, slow word writing may be a first warning, and a signal to educators to look out for and attempt to prevent more dysfluent writing processes. As shown by Afonso et al. (2020), Spanish children with dyslexia in the same age span as our participants, displayed similar patterns of slow word processing during handwriting. For a review of the relation between processes in handwriting and ty**, see Feng et al. (2019). These types of measures can relatively easily be detected by means of keystroke logging or handwriting recordings. We suggest that researchers and educators alike, embrace qualitative analyses of individual cases, and use these types of information to gain valuable insights into the formative assessment (cf. Skar et al., 2022) of the writing processes of struggling students, to acquire knowledge of where and when bottle necks can occur and to foster more effective and targeted interventions. While these suggestions are language independent, we also encourage continued research on spelling processes and their relation to higher-level processes in different orthographies.