Selective attention is the process of directing awareness toward a subset of relevant information in the environment while ignoring or suppressing irrelevant information (Johnston & Dark, 1986). There are a variety of tasks used across the literature to measure selective attention, with each task targeting a specific aspect or type of selective attention, such as the selective attention of the visual system (Theeuwes, 1993). One common task used to measure visual selective attention is the flanker task (B. A. Eriksen & Eriksen, 1974). The original flanker task was designed to measure the influence of visual similarity and competition on response selection: participants responded to a central target stimulus with flanking stimuli that varied in their similarity to the target. Reaction time was used to estimate the influence of the flanking distractors depending on whether they were mapped to the same or different response as the target (B. A. Eriksen & Eriksen, 1974). In the decades since its inception, the task has become one of three major paradigms used to measure selective attention in vision, along with the Simon and Stroop tasks (Ridderinkhof et al., 2021). Most studies using the flanker task analyze congruency effects, in which responses are slower and more error-prone for incongruent flankers than congruent or no/neutral flankers, as indexes of selective attention (e.g., B. A. Eriksen & Eriksen, 1974; Hedge et al., 2018; Mullane et al., 2009). Beyond its use in basic research to develop and test theories of attention, this task has been used to investigate behavioral and/or neural differences in selective attention in a wide range of samples and/or under different conditions, as shown in Table 1.

Table 1 Summary of diagnoses and experiential contexts studied with flanker paradigms with humans, derived from a search for “flanker” in Web of Science (April 2021)

The broad implementation of the flanker task to assess selective attention across these samples and conditions has led to numerous modifications of the task details, including the types of visual stimuli, the timing of events, the modality of response, and the inclusion of additional stimuli or tasks. For example, the Attentional Network Task (Fan et al., 2002) used arrows for stimuli instead of the original letter stimuli (B. A. Eriksen & Eriksen, 1974) and incorporated visual pre-cues that may or may not signal the location in which the stimuli will appear (above or below central fixation). Most variations of attentional tasks were developed and used primarily with young adults in university settings, but have since become popular for use with children as well, especially following the adaptation of Fan et al.’s (2002) Attentional Network Task for 6- to 10-year-olds (Rueda, Fan, et al., 2004a), and further adaptation for inclusion in the National Institutes of Health (NIH) toolbox for 3- to 15-year-olds (Zelazo et al., 2013). For this paper, we were interested in comparing young children’s performance on a flanker paradigm to other measures of attention, leading us to focus on tasks used with children aged 7 years or younger. However, in develo** and evaluating our task, we found inconsistencies in methods and analyses across prior studies that led us to further explore the flanker paradigm. Although we focus primarily on tasks modified for different age groups, the implications of our investigation are relevant for changes in tasks that are motivated by any reason—for example, reducing total time on task, using the apparatus most readily available in the research setting, or adjusting timing or response modality to accommodate inclusions of neural measures.

Modifying cognitive tasks from their original design for adults to be used with children may seem straightforward; however, these modifications present significant challenges. Age-related changes in how children understand and follow instructions, perceive and process stimuli, and plan and execute responses play a major role in task design. When modifying tasks for use with younger ages than originally designed, researchers typically construct a playful “cover story” to help explain the task's goals and keep children motivated. For example, Rueda, Fan, et al. (2004a) changed arrow stimuli to fish (with arrow-like markings pointing toward the head) and asked children to feed the center fish by pressing a button corresponding to the direction the fish was swimming rather than asking them to identify the direction the central fish was pointing. They also slightly increased the duration of the pre-cue presentation and inter-stimulus interval and added feedback following the participant’s response. To ensure that their modified task yielded results comparable to the original, Rueda, Fan, et al. (2004a) included an experiment in which 10-year-olds and adults completed both the modified task and a version with arrow stimuli (similar to the original but retaining the modified timing and feedback from their modified task). Results with both age groups showed the expected pattern of congruency effects (as well as orienting and alerting effects from the pre-cues, which are not part of traditional flanker tasks) in both versions, indicating its similarity to the task from Fan et al. (2002).

Rueda, Fan, et al. (2004a) is the most-cited version of the flanker paradigm used with children, with over 800 citations in Web of Science as of January 2022. However, 24 other papers with more than 100 citations included original data collected with children using a flanker paradigm, 14 of which sampled children 7 years or younger. We reviewed the methods of these 14 papers to evaluate whether they modified their methods from prior papers, and if so, whether they tested congruency effects in older children or adults to verify that the task remained comparable to others in the literature. Table 2 summarizes our findings from this comparison. We found that each of these studies changed their method in various ways from their cited source for the method (detailed further in Table S1 in the supplement), but only four analyzed data from a group of older children or adults directly to evaluate whether these samples showed traditional congruency effects.

Table 2 Comparison of methods and analyses from top-cited papers testing young children in flanker tasks

One assumption underlying these types of child-friendly adaptations of cognitive tasks is that the relations between the behaviors and cognitive processes being studied show developmental continuity if the tasks show the expected congruency effects in young children. Our review of these papers shows that this assumption could be evaluated directly by including groups of older children or adults to compare tasks, but this is rarely done. Here we present two studies that show that this assumption may not be valid and highlight the need to ensure that tasks function similarly across participants of different ages. In particular, we show that tasks using different stimulus types (direction versus color) and response modalities (pressing a key on a keyboard versus touching a screen) do not similarly elicit congruency effects in early versus middle childhood and adulthood. Differences between early and middle childhood are especially important to consider because the transition between these developmental periods (age 6–7 years) typically coincides with changes in school structure (e.g., from kindergarten to first grade in the United States), which may equip older children with more standardized academic knowledge and strategies to apply to lab tasks, as well as leading children to expect there is a “right” or “wrong” way to approach a task. This is also an age with increasing diagnoses of learning disabilities that show corresponding differences in visual attention (e.g., attention-deficit/hyperactivity disorder; Kieling et al., 2010). Diagnoses that correlate with visual attention motivate efforts to provide early assessments to identify performance patterns as risk factors before the typical age of diagnosis, but this requires equating tasks between early and middle childhood.

Our original goal in this study was to derive a sufficiently engaging method for young children (2–6 years of age) to complete enough trials for inclusion with other tasks in a study map** developmental trajectories among multiple cognitive constructs. However, the exploratory nature of our approach resulted in task type being confounded with children’s age and prior experience. To eliminate this confound, we next investigated how our tasks compared when tested within a single session, in counterbalanced order, in groups of older children (6–10 years of age) and young adults (19–23 years of age). In addition to addressing congruency effects in variations of the flanker task in different age groups, our studies included Bayesian analyses and explored a new measure for combining response correctness and reaction time (RT) as an alternative to traditional analyses that use RT on correct-response trials only. This traditional analysis led to excluding a large number of trials and participants due to overall poor performance by young children, leading us to use a measure that incorporated both correctness and RT into a single score per trial, a modified version of signed residual time (Maris & van der Maas, 2012; see also Tijmstra & Bolsinova, 2021). As we further describe below, using this measure allowed for the inclusion of more trials and participants, as it did not require excluding incorrect trials from analyses (as is typical in analyses of RT; e.g., Rueda, Fan, et al., 2004a). This measure also allows us to take into account trial-level relations between accuracy and RT, and to modify the measure to use RT deciles within each task and participant group reduces the influence of group or task-based differences in RT distributions, providing a clearer test of congruency effects within each group and task.

Across all measures, we report traditional frequentist analyses to provide a straightforward comparison with prior studies that reported frequentist analyses, but also include Bayesian analyses to evaluate the strength of evidence for or against the null hypothesis. We chose to include Bayesian analyses because they can be especially informative for developmental studies when variability between participants is high relative to the size of the sample (van de Schoot et al., 2014), which can lead to nonsignificant frequentist results that are difficult to interpret (Dienes, 2014). Additionally, the relative strength of evidence for the null versus alternative hypotheses inherent in Bayesian analyses can provide further insight into analyses of noisy data and more straightforward reporting of uncertainty (e.g., Marsman & Wagenmakers, 2017; van de Schoot et al., 2014). We found the quantification of evidence to be especially valuable across our studies when we were more concerned with identifying the presence or absence of congruency effects rather than characterizing the magnitude of those effects (as could be done using effect sizes in frequentist analyses).

Our first experiment tested young children in a brief flanker task using direction (fish, as in Rueda, Fan, et al., 2004a) as the stimulus dimension. Approximately one year later, we tested a subset of these children in a new flanker task using color stimuli on a touch screen. Our design was partly motivated by a study showing similar congruency effects in young children (3–6 years of age) across tasks with direction, shape, or color stimuli (McDermott et al., 2007). The primary goal of this study was to find a flanker task that elicited congruency effects (i.e., incongruent flankers leading to slower responding and higher error rates than neutral or congruent flankers) in young children, while being brief enough for inclusion in a larger battery of cognitive tasks.

Experiment 1a

Method

Participants

Sixty-nine children participated in this experiment. Children were recruited through participant databases at a small liberal arts college (Ripon College) and a large state university (University of Wisconsin [UW]–Madison) in the Midwestern United States. Of the children who participated, 21 were excluded from analyses: three ended early (two voluntarily, one due to equipment failure), four did not meet the criterion for understanding the task (see Method of analysis), and 14 did not contribute enough trials in each type for analysis (see Method of analysis). The final sample included 48 children (age range: 2.65–5.91 years, M = 4.95, SD = 0.81; 21 girls, 27 boys), 14 participated at Ripon College and 34 at UW–Madison. Individual demographic data were not collected for all participants; the communities we recruited from were primarily from White, monolingual English-speaking, middle- to upper-middle-class backgrounds. Caregivers who provided individual demographic data (50 participants from the full sample) reported their children identifying as follows: White/non-Hispanic (96%) or White/Hispanic (4%) for race/ethnicity; only English spoken in the home (90%) or English speaking plus one or more additional language (10%) for language background; one or more parents holding a 4-year degree or higher (90%), one or more parents completing a 2-year degree or some college (8%), or both parents having completed high school (2%) for parental education.

Prior to participation, parents gave written consent, and children gave oral assent. Families received a small prize (Ripon College) or $10 monetary compensation (UW–Madison) following their participation; the differential compensation was due to the UW–Madison participants completing two visits to the lab instead of one (described further in the Supplement). The respective institutional review boards reviewed and approved all procedures prior to participant recruitment.

Materials and procedure

After obtaining consent and assent, an experimenter led the participant to the testing space at a standard office desk within a windowless laboratory room. If the child requested that a caregiver accompany them, the caregiver sat in a chair along the back wall of the room and was asked not to prompt their child during participation. Children sat in an adjustable chair to allow approximately the same viewing distance and angle for children of different sizes. The approximate height from the floor to the child’s head was 114 cm. The task was administered on a 15.6″ (39.62 cm) widescreen Dell Latitude 5580 laptop (Ripon College) or a 15.4″ (39.12 cm) widescreen Dell Latitude E6500 laptop computer (UW–Madison), with children seated approximately 45 cm from the screen (see Fig. 1). Participants entered their responses using the left and right arrow keys on the laptop at UW–Madison (see Fig. 1a); at Ripon College, the laptop was equipped with an external numeric keypad on which the “4” and “6” keys were marked with white tape showing left and right arrows, respectively, used for responses in the task (see Fig. 1b).

Fig. 1
figure 1

Experiment testing room setup at (a) University of Wisconsin–Madison and (b) Ripon College. Note that the white stickers on the laptop (a) and keypad (b) show arrows pointing left and right for the task response entry

Children completed a variety of tasks designed to assess visual selective attention and other cognitive processes (Andrews et al., 2018); here, we focus only on the flanker paradigm (see Supplement for list of tasks at each site). The flanker task was adapted from no-cue trials in Rueda, Fan, et al. (2004a) for presentation in Python (version 2.7.5) using the PsychoPy library (version 2-1.90.2; Peirce et al., 2019). Figure 2 shows a schematic of the stimuli across trial types. On each trial, a fixation cross appeared in the center of the screen for 100 ms, followed by a delay of 100–300 ms (randomly selected from a uniform distribution across trials), then one of the three fish arrays shown in Fig. 2, which remained on the screen until the participant pressed a response key. The center fish (target) was centered vertically and horizontally on the screen; each fish or horizontal line (on neutral trials) subtended an approximately 1.9° visual angle and was separated by 0.25°. Following the keypress, written feedback (“correct” or “incorrect”) appeared on the screen for 750 ms. The experimenter provided verbal encouragement regardless of the feedback (most children in this sample could not read the feedback). Following the feedback was a 100 ms delay, and then the fixation cross appeared to begin the next trial.

Fig. 2
figure 2

Schematic of the stimuli across trial types for the direction flanker task. A fixation cross appeared in the position of the center fish at the beginning of each trial to signal the location of the target (see text for details of timing and spacing)

Before beginning the task, the experimenter first modeled the task using a PowerPoint presentation, allowing the experimenter to control the timing and reverse through trials if needed to facilitate children’s understanding. The presentation demonstrated the three trial types (neutral, congruent, incongruent, shown in Fig. 2), and the experimenter modeled responses on the keyboard or keypad. The experimenter explained the task as follows (with accompanying gestures in brackets): “Now we are going to play a game with fish. In this game, we need to look at the fish in the middle [pointing to the middle fish] and see which way he is swimming. When you know which way the fish is swimming, you can press one of these two keys to tell the computer which way the fish is swimming! [gesture to the two marked keys] If the fish in the middle is swimming this way [gesture to the middle fish], you press this key [gesture to the correct key]. If the fish in the middle is swimming this way [gesture to the middle fish], you press this key [gesture to the correct key]. See how there are other fish here too? It’s your job to only look at the middle one and press the key that is pointing the same way that the fish is swimming. OK? Sometimes, you will see only one fish. The directions are the same; just tell the computer which way the fish is swimming.” Note that we avoided using the terms “left” and “right” to describe the fish directions and response keys because young children may not yet reliably produce and understand these spatial words (e.g., Dessalegn & Landau, 2008), and we did not want performance to be dependent upon spatial language vocabulary.

Following this explanation, children began the computerized task presented using PsychoPy on the laptop. At the beginning of the computer task, a fish target appeared in the center of the screen, and the experimenter said, “Ok, are you ready to play the game? Remember, as soon as you know which way the middle fish is swimming, press the key that is pointing the same way as fast as you can! Which way is this fish swimming?” After pausing for the child to respond, the experimenter said, “Press the key that is pointing the same way,” and encouraged the child to use the keys if needed. This response was not recorded, as it was intended just to ensure the child had identified the keys to press and would respond. Next, the experimenter pressed the enter key to begin the test trials, and successive trials began automatically following each response from the participant as described above. Children completed 18 test trials, evenly divided among the three types (i.e., six congruent, six incongruent, and six neutral) presented in random order, with the direction of the target randomized within each type. Although this number of trials is lower than typically used with adults, it is similar to the number used for analysis by Zelazo et al. (2013; 20 trials total, with 13 congruent and 7 incongruent); we used this as a starting point to see whether it would be sufficient to detect congruency effects in this age group. Participation in the flanker task, including explanation and test trials, took 3–5 min for most participants. Following the flanker task, children completed additional cognitive tasks as listed in the Supplement but not reported here. After the session, the experimenter recorded a qualitative report if the participant’s understanding or engagement in each task was atypical or unexpected (see Supplement for examples). The experimenter also recorded general observations from the session that may have impacted the participant’s experience during the session (e.g., “construction noise outside”). Sessions were video-recorded by a camera placed in the corner of the room behind the participant, out of their view.

Method of analysis

Following the most common method of analysis in prior studies (e.g., Rueda, Fan, et al., 2004a), we analyzed accuracy and mean RT (in seconds) on correct-response trials for evidence of conflict between the target and incongruent flankers (i.e., more errors and slower RT relative to congruent and/or neutral trials). We first screened for participants’ understanding of and compliance with the task instructions by considering their number of correct responses on neutral trials. We reasoned that neutral trials should be easiest for children to complete correctly due to the absence of flankers. We chose three correct trials (50%) as the minimum to meet the criterion of understanding the task, which led to the exclusion of four participants (see Participants section). Next, to ensure sufficient data per participant for the analyses of RT on correct-response trials, we used a minimum of three trials per type as a criterion for inclusion in these analyses, which led to the exclusion of 18 participants (see Participants section). As noted in the introduction, we conducted both traditional frequentist analyses for ease of comparison with prior studies and Bayesian analyses to provide a more nuanced evaluation of evidence for or against each factor’s inclusion in the models that fit the data best.

Our first research question was whether accuracy was lower on incongruent trials relative to neutral trials and whether that effect varied by age, which we addressed with logistic regression analyses. We conducted logistic regression analyses on accuracy in the RStudio software environment (Version 1.4.1106; RStudio Team, 2021). For the frequentist analyses of accuracy, we used the glmer function in the lme4 package (Bates et al., 2015). We fit a generalized linear mixed-effects model using response correctness per trial as the outcome variable, with age (mean-centered), trial type (neutral as the reference level), and the interaction between age and trial type as predictors, treating subject as a random intercept. For the Bayesian analyses of accuracy, we fit a Bayesian generalized linear mixed-effects model with a logit link function predicting accuracy using the glmer function in the rstanarm package (Goodrich et al., 2019). We fit the linear models using Markov chain Monte Carlo with response correctness per trial as the outcome variable, with age (mean-centered), trial type (neutral as the reference level), and the interaction between age and trial type as predictors. A by-subject random intercept and a by-subject random slope for the effect of trial type were also included and specified weakly informative Student’s t priors. We used four chains with 5000 iterations each (half of the chains were used as warm-up). Therefore, the posterior distributions are based on 10,000 iterations. No initial models converged. We then followed the recommendations outlined in Brauer and Curtin (2018) to achieve convergence for each analysis. The first model to converge had only the by-subject random intercept.

Our second research question was whether RT on correct-response trials was slower on incongruent trials than other types and whether this effect varied by age, which we addressed with repeated-measures analyses of variance (ANOVAs). We analyzed mean RT on correct-response trials in JASP (version 0.16.1; JASP Team, 2022) first using a frequentist repeated-measures ANOVA (for which JASP cites Conover, 1999; Field et al., 2012; Loftus & Masson, 1994; Maxwell, 1980; Morey, 2008) with trial type (congruent, incongruent, neutral) as a within-subjects factor and age as a continuous covariate. We checked the sphericity assumption using Mauchly’s test and, if necessary, applied Greenhouse–Geisser correction. When relevant, post hoc comparisons on trial type effects were conducted using the pooled error term and Bonferroni correction for multiple comparisons. This analysis used the following R packages in JASP: afex, boot, emmeans, ggplot2, plyr, and stats (JASP Team, 2022).

We also analyzed mean RT on correct-response trials in JASP using a Bayesian repeated-measures ANOVA (for which JASP cites Rouder et al., 2016; Rouder et al., 2012; Rouder et al., 2017; van den Bergh et al., 2020; Wagenmakers et al., 2018; Wetzels et al., 2012) using the default multivariate Cauchy prior on the effects (specified in JASP as r scaled fixed effects = 0.5, r scaled random effects = 1, r scale covariates = 0.354; Rouder et al., 2012; van den Bergh et al., 2020) with age and trial type (congruent, incongruent, neutral) as model components, and model terms including both main effects and the interaction. We report the Bayes factor for the model with the strongest evidence compared with the null model, as well as Bayes factors for inclusion (BFincl), which were calculated across matched models, comparing models that contained the effect to the equivalent models without the effect or higher-order interactions (see van den Bergh et al., 2020, for discussion). The magnitude of BFincl provides a quantification of evidence for (>1) or against (<1) inclusion of each factor in the model by comparing prior inclusion odds (equal across predictors before considering the data) to posterior inclusion odds (across all models after considering the data; van den Bergh et al., 2020). Bayesian analyses also include posterior summaries for the parameter estimates from each factor, which are not of central interest for our analyses; however, these can be found in the supplementary JASP files from each analysis for interested readers. When relevant, post hoc comparisons on trial type effects were corrected for multiple testing by fixing the prior probability that the null holds to 0.5 across all comparisons (Westfall et al., 1997). This analysis used the following R packages in JASP: BayesFactor, colorspace, ggplot2, KernSmooth, matrixStats, ply, stats, stringi, stringr, and utils (JASP Team, 2022).

Results

Accuracy

Our first research question was whether accuracy was lower on incongruent trials relative to neutral trials and whether that effect varied by age, which we addressed with logistic regression analyses. Figure 3a, b shows children’s accuracy (proportion of trials with correct responses) across trial types and the relation between accuracy and age per trial type. The frequentist logistic regression showed that the model with age, trial type, and their interaction as predictors fit better than the null model, χ2(5, n = 48) = 17.64, p = .003. Table 3 shows the coefficients from this model, which indicated significant effects of age and incongruent trial type. Similarly, the Bayesian logistic regression results showed evidence for an effect of age and incongruent trial type (see Table 4 for model posterior summary; note that credible intervals that do not include 0 are interpreted as evidence for an effect). As shown in Fig. 3a, b, results from both analyses indicate increasing accuracy with age and lower accuracy on incongruent trials than neutral trials.

Fig. 3
figure 3

Young children’s proportion of trials with correct responses across trial types (a, c) and the relation between accuracy and age per trial type (b, d) in Experiments 1a (a, b; n = 48, 2.65–5.91 years old) and 1b (c, d; n = 34, 3.64–7.01 years old). In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. Note that some boxes and whiskers are not visible when the distribution is restricted. In b and d, lines show the linear fit to the observed values

Table 3 Coefficients from the frequentist logistic regression on accuracy in Experiment 1
Table 4 Posterior summaries from the Bayesian logistic regression on accuracy in Experiment 1

Reaction time on correct-response trials

Our second research question was whether RT on correct-response trials was slower on incongruent trials than other types and whether this effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 4a, b shows children’s mean RT on correct-response trials across trial types and the relation between RT and age per trial type. The frequentist ANOVA revealed a significant main effect only of age, F(1, 46) = 13.85, p < .001, ηp2 = .231. The age effect reflects generally faster RT with age (see Fig. 4b). For the Bayesian ANOVA, model comparisons indicated that the best-fitting model (BF10 = 1.45 * 104) included main effects of age (BFincl = 54.64) and trial type (BFincl = 265.04). Post hoc comparisons yielded strong evidence that mean RT was slower on incongruent trials than on neutral trials (posterior odds = 250.28; see Fig. 4a, b), in contrast to the frequentist analysis that yielded a null effect for trial type.

Fig. 4
figure 4

Young children’s mean reaction time (RT) on correct-response trials across trial types (a, c) and the relation between mean reaction time and age per trial type (b, d) in Experiments 1a (a, b; n = 48, 2.65–5.91 years old) and 1b (c, d; n = 34, 3.46–7.01 years old). In a and c, boxes show second and 3rd quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. In b and d, lines show the linear fit to the observed values. Note that the y-axes differ between tasks to increase visibility of the range observed

Discussion

The results of Experiment 1a indicated the expected effect for incongruent trials on accuracy in both logistic regressions, but only the Bayesian ANOVA provided evidence for the correct-response trial RT difference, presumably due to the high variability within each trial type. These analyses were limited by young children’s poor performance when including only correct-response trials with a minimum of three per type: summed with the children excluded for ending early or not meeting the criterion for understanding on neutral trials, these exclusions comprised 29% of the full sample (20 out of 69 participants). Qualitative reports from experimenters further indicated that many young children had difficulty remaining engaged enough to complete the task or possibly misunderstood the task (19% of participants across this sample and pilot participants, including some who were excluded based on their data; see Supplement for details). The combination of poor task performance (high error rates and slow RTs) and perceived difficulty of the task indicated there might be a disconnect between children’s abilities and their task performance. To overcome limitations from the high error rates, overall slow reaction times, and difficulty in engagement or understanding, we next developed a variation of the flanker task that we hoped would be easier for young children to perform. To achieve this goal, we changed from direction to color as the stimulus dimension and implemented the task on a touchscreen. We tested this version approximately one year after Experiment 1a in a subset of these participants, allowing for longitudinal comparison of children who completed both tasks.

Experiment 1b

We made several changes from the direction flanker task to try to make the color task as accessible as possible for young children to increase their understanding, completion, and compliance in the task relative to the direction version. Many of these changes were based on prior experience develo** a visual change detection task for young children (Simmering, 2012), and multiple changes were introduced simultaneously because the original goal was not a controlled comparison between the direction and color tasks, but rather develo** an engaging, functional task for use in future studies. We chose to change the stimulus dimension from direction to color due to young children’s more limited vocabulary of spatial words. As noted in the procedure for Experiment 1a, we avoided relying on the terms “left” and “right” to explain the task because young children may not yet know these words well (e.g., Dessalegn & Landau, 2008). We reasoned that using stimulus features that children already know words for, specifically “red” and “blue” (e.g., Sandhofer & Smith, 1999; see also item trajectories from Frank et al., 2017), might facilitate children’s understanding of the task. We also made the task experimenter-paced rather than having trials advance automatically, to ensure that children attended the task space when each trial began (cf. Rueda, Posner, et al., 2004b; Simmering, 2012). We implemented the task on a tablet rather than a laptop because young children are generally more familiar with touchscreen tablets than keypads on laptops (Frank et al., 2016; Kostyrka-Allchorne et al., 2017). The touchscreen also allowed us to present the response buttons on the screen (cf. Zelazo et al., 2013), which reduced the need for children to remember which key mapped to each color or look down to the keypad to see the map** (cf. Fig. 1 for the direction task). Lastly, as described further in the Materials and procedure section, we built in a contingency that would add trials to the end of the session when participants responded incorrectly (similar to B. A. Eriksen & Eriksen, 1974) to increase the number of correct-response trials available for RT analyses.

Method

Participants

Thirty-seven children participated in this experiment; 33 were recruited from the Ripon College participants in Experiment 1a, and the remaining four were recruited through the participant database at Ripon College. Of the children who participated, one was excluded from all analyses for not meeting the criterion for understanding the task (50% correct on neutral trials), and two were excluded for contributing fewer than three correct trials per type (as described in Experiment 1a Method of analysis). The final sample included 34 children (age range: 3.46–7.01 years, M = 5.25, SD = 0.98; 16 girls, 18 boys). Caregivers who provided individual demographic data (34 participants from the full sample) reported their children identifying as follows: White/non-Hispanic (100%) for race/ethnicity; only English spoken in the home (100%) for language background; one or more parents holding a 4-year degree or higher (88%), one or more parents completing a 2-year degree or some college (9%), or both parents having completed high school (3%) for parental education. Written parental consent and oral child assent were obtained prior to participation, and at the completion of the session, children were compensated with a small prize. All procedures were reviewed and approved by the Ripon College Institutional Review Board.

For the 33 who participated in Experiment 1a during the prior summer, we report a longitudinal comparison for those who were included in analyses at both time points (n = 11; Time 1 age range: 2.65–5.79 years, M = 4.40, SD = 1.06; Time 2 age range: 3.46–6.83 years, M = 5.41, SD = 1.08; testing lag range: 9.69–14.09 months, M = 12.21, SD = 1.12; 6 girls, 5 boys). The remaining 22 children who participated at both time points were excluded for the following reasons: ten were pilot participants in Experiment 1a and therefore not reported; one was a pilot participant in the current experiment; one ended early in Experiment 1a; two did not meet the criterion for understanding the task (50% correct on neutral trials) in either Experiment 1a or 1b; eight did not contribute at least three correct trials per type in Experiment 1a.

Materials and procedure

All details of the materials and procedure were identical to Experiment 1a except as noted here. The task was administered with a custom JavaScript application on a Samsung Galaxy Tab 2 10.1″ (25.65 cm) tablet placed approximately 45 cm from the child on a tablet stand on the desk, behind a set of handprints on which children rested their hands between trials (see Fig. 5a). The task structure and trial types are shown in Fig. 5b–d, with red and blue stimuli as targets and flankers. The experimenter explained the task as follows before starting the task on the tablet: “Now we are going to play a game with colors. In this game, you are going to see five circles that are red, white, or blue, and it’s your job to tell the computer what color the circle in the middle is! To help you remember what circle is in the middle, a green circle will pop up in the place where you need to look first! Then, when the green circle goes away, you need to tell the computer what color the circle is that takes its place. When you know what color the circle in the middle is, you can press the square button at the bottom that has the same color as fast as you can—either red or blue! Remember, even though you will see five circles, it’s your job to look at the one in the middle and tell the computer if it is red or blue. Are you ready to play our game? Great! Remember, as soon as you know what color the circle in the middle is, press the button that is the same color as fast as you can!” Once the experimenter started the tablet task, she would reiterate directions as needed if the child seemed uncertain or asked questions. Based on pilot testing, we anticipated that some children might attempt to respond by touching the target itself rather than the button at the bottom of the screen. Experimenters were told that if a child made this type of error, describing the task as matching the target to the button at the bottom might help them better understand.

Fig. 5
figure 5

Color flanker task setup: (a) tablet and handprint placement; (b) tablet screen between trials as the experimenter waited for the child to be ready; (c) the green circle indicating the upcoming location of the target; (d) a sample neutral trial with response buttons along the bottom of the screen. On congruent trials, the flanking circles all matched the color of the target, and on incongruent trials, the flanking colors matched the opposite response (i.e., blue flankers for red targets and red flankers for blue targets). See text for details of timing and spacing

Before each trial began in the tablet task, a black circle appeared in the middle of the screen (see Fig. 5b); children were instructed to place their hands on the handprints and look at the circle when they were ready to proceed. At the top center of the screen, a white box served as a virtual button for the experimenter to tap when the child was ready. Participants were told that each trial would begin with a green circle showing which circle needed a match (see Fig. 5c); this circle appeared for 1 s in the location previously occupied by the white box. We chose this location to maximize the distance between the stimuli and the response buttons (see Fig. 5d) and follow the likely directing of attention to that location by the experimenter’s finger initiating the trial. The green circle then disappeared, and a tone sounded while the screen was white for 667 ms. Then an array of five circles appeared across the top of the screen (subtending approximately 2.16° visual angle with adjacent circles separated by 1.27°), with the red (left) and blue (right) response boxes at the bottom (each subtending 2.16°, approximately 6.35° below the circles and separated by 6.87° horizontally; see Fig. 5d). These circles remained visible until the participant tapped one of the buttons at the bottom as their response or the response window of 10 s ended; if no response was entered within the response window, a null response was recorded. Following the response or end of the response window, the screen returned to the display of the black circle in the center (Fig. 5b) until the experimenter initiated the next trial.

Trials were presented in random order, initialized at eight trials per type (minimum 24 trials total), with the color of the target chosen randomly within type; to increase the number of correct-response trials available for analysis (cf. B. A. Eriksen & Eriksen, 1974), we programmed a contingency into the task that would present additional trials of a given type following null or incorrect responses (e.g., if seven of eight neutral trials had correct responses, a ninth neutral trial would be added to the end of the task for that participant; if that trial response was incorrect or null, a tenth trial would be added, but if the response was correct, no additional neutral trials would be presented). The application continued to present trials until the participant either completed eight correct-response trials of each type or any trial type with fewer than eight correct responses reached 18 trials total. This feature was intended to ensure that all children could contribute enough trials for analyses that used only correct-response trial RTs. This resulted in children completing different total numbers of trials: sixteen children completed 24 trials, six children completed 25 trials, four children completed 26 trials, two children each completed 28 or 32 trials, and one child each completed 23, 35, 36, and 37 trials.

As in Experiment 1a, children completed additional cognitive tasks during their session following the flanker task (see Supplement). Participation in the color flanker task, including explanation and test trials, took 2–3 min to complete for most participants. We analyzed accuracy and mean RT on correct-response trials in the same way as in Experiment 1a.

Method of analysis

The method of analysis was identical to Experiment 1a except that any non-response trials (which occurred in the color task if participants did not respond within 10 s but were not possible in the direction task) were excluded before calculating accuracy (i.e., accuracy = total correct / [total correct + total incorrect]). For the longitudinal comparisons, we used the same models as in the individual tasks, with the addition of stimulus type as a within-subjects factor, as well as its interactions with other factors, to the models. The repeated-measures ANOVAs also included age at both time points as separate covariates.

Results

Color task

Accuracy

Our first research question was whether accuracy was lower on incongruent trials relative to neutral trials and whether that effect varied by age, which we addressed with logistic regression analyses. Fig. 3c, d shows children’s accuracy by trial type and the relation between accuracy and age per trial type in the color task. For the frequentist regression, the model with age, trial type, and their interaction fit better than the null model, χ2(5, n = 34) = 26.67, p < .001. Table 3 shows the coefficients from this model, which indicated a significant effect of incongruent trial type. The Bayesian regression also indicated a main effect of incongruent trial type, plus an age by incongruent trial type interaction (see Table 4 for model posterior summary). These trial type results parallel those of Experiment 1a, with lower accuracy on incongruent trials than neutral trials. The age by incongruent trial type interaction evident in the Bayesian regression reflects a greater change in accuracy with age, driven by lower accuracy on incongruent trials in the younger children in the sample (see Fig. 3d).

Reaction time on correct-response trials

Our second research question was whether RT on correct-response trials was slower on incongruent trials than on other types and whether this effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 4c, d shows children’s mean RT on correct-response trials across trial types and the relation between RT and age per trial type in the color task. The frequentist ANOVA showed a main effect only of age F(1, 32) = 17.87, p < .001, ηp2 = .358. The Bayesian ANOVA model comparisons indicated the best fit (BF10 = 121.85) from the model including a main effect only of age (BFincl = 117.21). As in the direction task, RT generally decreased with age (see Fig. 4d). Together these results show the expected congruency effect on accuracy but not correct-trial RT.

Longitudinal comparison

We next compared the direction and color tasks longitudinally for the subset of children who completed both (n = 11; see Participants section for exclusions). Due to the exclusions across experiments, this sample is small, and results should be interpreted with caution.

Accuracy

For the accuracy analyses, we used the same logistic regression models as before, with the addition of stimulus type as a within-subjects factor and age at both time points as predictors. Figure 6 shows this sample’s accuracy by trial type and the relation between accuracy and age per trial type, separately for the direction and color tasks. The frequentist model with age, trial type, stimulus type, and their interactions failed to converge using the default optimizer (nloptwrap); following Clark (2020), we successfully ran the model with bobyqa optimization. This model fit significantly better than the null model, χ2(11, n = 11) = 23.29, p = .016. Table 3 shows the coefficients from this model, which indicated a significant effect of stimulus type and a significant age by stimulus type interaction.

Fig. 6
figure 6

Longitudinal sample of young children’s (n = 11) proportion of trials with correct responses across trial types (a, c) and the relation between accuracy and age per trial type (b, d) in Experiment 1 direction task (a, b; 2.65–5.79 years old) and color task (c, d; 3.46–6.83 years old). In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. Note that some boxes and whiskers are not visible when the distribution is restricted. In b and d, lines show the linear fit to the observed values

The Bayesian regression also indicated evidence for a main effect of stimulus type and an age by stimulus type interaction, plus an age by incongruent trial type interaction (see Table 4 for posterior summaries). The stimulus type effect reflects generally lower accuracy in the color task (M = .85, SD = .19, 95% CI [.74, .97]; Fig. 6c) than in the direction task (M = .90, SD = .13, 95% CI [.82, .98]; Fig. 6a); this difference is driven in part by the overall different number of trials per task (18 in direction, 24 in color), which means that our criterion of three correct trials per type excludes children who performed below .50 in the direction task but below .375 in the color task (i.e., mean accuracy in the color task includes four children who performed between .375 and .5, which was not possible in the direction task). The age by stimulus type interaction was driven by a steeper positive effect of age in the direction task (Fig. 6b) than in the color task (Fig. 6d). The interaction between age and incongruent trial type reflected a steeper positive effect of age on incongruent trial accuracy relative to neutral when collapsing across tasks. Note, however, that our design conflated age with stimulus type, as all children completed the direction task at the first time point and the color task at the second time point approximately one year later (see Participants section). Thus, our age distribution was different across tasks.

Reaction time on correct-response trials

Figure 7 shows the longitudinal sample’s mean RT on correct-response trials by trial type and the relation between mean RT and age per trial type, separately for the direction and color tasks. For these analyses, we note that the different response modalities with different temporal characteristics (pressing a key versus touching the screen) across tasks is conflated with stimulus type. The influence of response modality should affect all trial types similarly, though, so the inclusion of a neutral trial type in both tasks can help account for this difference and would contribute to stimulus type effects (in addition to the overlap** contributions of age and stimulus type effects noted in the accuracy analysis).

Fig. 7
figure 7

Longitudinal sample of young children’s (n = 11) mean reaction time on correct-response trials across trial types (a, c) and the relation between mean reaction time and age per trial type (b, d) in Experiment 1 direction task (a, b; 2.65–5.79 years old) and color (c, d; 3.46–6.83 years old). In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. In b and d, lines show the linear fit to the observed values. Note that the y-axes differ between tasks to increase visibility of the range observed

The frequentist ANOVA on mean RT on correct-response trials in the longitudinal sample indicated significant main effects of age at each time point, direction/Time 1 age F(1, 8) = 10.41, p = .012, ηp2 = .337, color/Time 2 age F(1, 8) = 12.45, p = .008, ηp2 = .404, and stimulus type, F(1, 8) = 12.19, p = .008, ηp2 = .604, as well as significant interactions between age at each time point and stimulus type, direction/Time 1 age by stimulus type F(1, 8) = 6.95, p = .030, ηp2 = .456, color/Time 2 age by stimulus type F(1, 8) = 7.73, p = .024, ηp2 = .491. Similarly, the Bayesian ANOVA model comparison indicated the best fit (BF10 = 1.17 * 1010) from a model that included main effects of age at each time point (direction/Time 1 age BFincl = 4.35, color/Time 2 age BFincl = 4.85), trial type (BFincl = 1.01), and stimulus type (BFincl = 5.68 * 104), as well as interactions between age and stimulus type at each time point (direction/Time 1 age by stimulus type BFincl = 4,603.04, color/Time 2 age by stimulus type BFincl = 9,954.57). Evidence for the trial type effect was quite weak (i.e., < 3; Goss-Sampson et al., 2020) so we did not conduct a post hoc comparison. Most relevant to our research question was whether there was evidence for the stimulus type by trial type interaction, which would indicate that trial type effects differed across tasks: evidence was stronger for the null than for this effect (BFincl = 0.43). Across these analyses, the age effects indicate generally faster responding with age (see slopes in Fig. 7b and d) and the stimulus type effects reflect faster responding in the color task (M = 2.22 s, SD = 0.76, 95% CI [1.77, 2.67]) than in the direction task (M = 4.63 s, SD = 3.25, 95% CI [2.71, 6.55]). The interaction between stimulus type and age reflected the overall steeper negative effect of age in the direction task (see slopes in Fig. 7b) relative to the color task (see Fig. 7d).

Discussion

The analyses of the color task alone and longitudinal cross-task comparisons yielded mixed results: analyses of accuracy in the color task indicated congruency effects, but not analyses of RT on correct-response trials; the longitudinal comparison yielded no evidence for congruency effects, either overall or interacting with stimulus types. Bayesian analyses of accuracy in both the color task and longitudinal comparison indicated evidence for an interaction between age and trial type, which suggests possible congruency effects in the younger part of the age range. The longitudinal comparison also indicated that age effects were larger in the direction task, which could be because the participants were slightly older when they completed the color task and/or because the direction task was more challenging for younger children. Our interpretation of these effects was limited by restricting RT analyses to only correct-response trials with a minimum of three trials per type, which reduced our sample size by nearly half for the longitudinal comparison. In the next section, we consider alternative analyses that can avoid the limited number of trials and more directly normalize for different base RTs across tasks.

Experiment 1: Alternative analyses

Young children in the traditional (directional) version of the flanker task showed high error rates, especially on incongruent trials, which led to excluding many trials and, ultimately, many participants from the longitudinal comparison. Some prior studies with children have avoided this problem by retaining incorrect-trial RTs and either analyzing all RTs together (e.g., Rueda, Posner, et al., 2004b; Scheres et al., 2003) or grou** trials by response correctness (e.g., Davies et al., 2004; McDermott et al., 2007). The latter approach is limited because it results in different numbers of trials per participant in the grou**s and will still have missing data for children who had the same response on all trials of a given type. The former approach loses some information by not taking the correspondence between response and RT per trial into account. Therefore, it cannot differentiate individuals who respond more quickly when producing errors from those who respond more slowly when producing errors.

A number of approaches to integrating correctness and RT have been proposed in studies with adults; these calculations are used for each participant separately for each trial type or condition as relevant. One common approach is the inverse efficiency score (mean RT on correct-response trials / proportion correct; Townsend & Ashby, 1983; cited by Mullane et al., 2009) or related efficiency score (proportion correct / mean RT on correct-response trials; Akhtar & Enns, 1989), neither of which can be used if accuracy is 0 on any trial type, as was the case with our data. More recent alternatives, rate-correct scores (number of correct responses / sum of all RTs; Woltz & Was, 2006) and linear-integrated speed–accuracy scores (mean RT + [proportion of errors * (standard deviation of RTs / standard deviation of proportion correct)]; Vandierendonck, 2017) can be used with an accuracy of 0, but then information about the RT distribution is lost since the resulting score is 0. Liesefeld and Janczyk (2019) proposed balanced integration scores (standardized proportion correct − standardized mean RT; see also Liesefeld et al., 2015) as an alternative to these three methods that was less affected by speed–accuracy trade-offs. This method would be better suited to our data but cannot account for whether incorrect responses were faster or slower than correct responses. In a task-switching paradigm, Hughes and colleagues compared inverse efficiency scores and rate-correct scores to bin scores in which correct-response trial RTs are binned by deciles, and then incorrect-response trials are assigned to a “worst” bin (i.e., higher than the slowest correct-response RT decile) regardless of speed (Hughes et al., 2014). Similarly, in the conditional accuracy function, observed reaction times are binned, and the proportion of correct responses is calculated within each bin and compared across trial types (e.g., Gratton et al., 1992). However, these two approaches are not well suited to our study because they require larger numbers of trials per bin for analysis that we can derive from our short task (see van Maanen et al., 2019, for additional critiques).

Our goal was to find an analysis that could be used when accuracy was 0 in any trial type, retaining information about the RT distribution on incorrect-response trials. This led us to seek a method to combine response correctness and RT at the trial level. We chose to combine correctness and RT into a single variable per trial using a modified version of signed residual time (SRT; Maris & van der Maas, 2012). SRT was developed for assessments with fixed response windows, with the goal of using response speed to estimate ability levels of individuals relative to one another (in conjunction with response correctness and item difficulty, which are the typical parameters in item response theory; e.g., Rasch, 1966). For SRT, residual time is calculated as the maximum allowed time minus response time per trial; some variations also dichotomize response times as “fast” or “slow” to eliminate some variability from individual differences in base response speed as opposed to individual ability (see Tijmstra & Bolsinova, 2021, for discussion). Residual times are then multiplied by 1 for correct responses and −1 for incorrect responses, resulting in scores ranging from the negative to positive maximum response time, with higher positive scores indicating greater ability. Our goal here is not to estimate individual participants’ abilities relative to one another but rather to construct a single metric of flanker performance that includes correctness and RT on each trial as an index of congruency effects. Prior research has established that incongruent flankers result in more errors and slower reaction times (Hedge et al., 2018; Mullane et al., 2009; Ridderinkhof et al., 2021). In SRT, this would result in more negative scores (from errors) and scores near zero (from slow, correct responses) on incongruent trials relative to congruent trials. Using mean SRT across trials for each type, then, we would expect overall lower scores on incongruent trials than congruent trials.

Participants

By combining correctness and RT into a single variable per trial, we can eliminate our criterion of three correct trials per trial type for these analyses and include participants from Experiments 1a, 1b, and the longitudinal comparison who had been excluded for failing to meet this criterion. However, to ensure that unusually low performance did not exert undue influence on our results, we evaluated our modified-SRT measure for potential influential points by deviation from the regression line predicting modified-SRT by age (described in the Method of analysis section). Our resulting sample sizes for these analyses were as follows: Experiment 1a included 61 children (age range: 2.64–5.91 years, M = 4.76, SD = 0.90; 26 girls, 35 boys; 25 children participated at Ripon College and 36 at UW–Madison), with one additional participant excluded. Experiment 1b included 35 children (age range: 3.46–7.01 years, M = 5.29, SD = 0.98; 17 girls, 18 boys), with one additional participant excluded. The longitudinal sample included 18 children (Time 1 age range: 2.65–5.79 years, M = 4.24, SD = 0.97; Time 2 age range: 3.46–6.83 years, M = 5.24, SD = 0.98; testing lag range: 9.69–14.09 months, M = 11.98, SD = 1.12; 8 girls, 10 boys), with one additional participant excluded (the same participant that was excluded from Experiment 1a).

Method of analysis

Calculating SRT as in Maris and van der Maas (2012) requires a maximum allowed time for the response window to estimate individuals’ ability relative to one another under the same circumstances and on the same scale (see also Tijmstra & Bolsinova, 2021). Our application of SRT was not intended to compare individuals with each other, but rather to compare trial types within each task, meaning we could use different maximum response times across tasks as long as we retained the same maximum across trial types within a task. Our color task was designed with an enforced response window (maximum 10 s), but our direction task was not. Our tasks also present the challenge that the response modalities differed (keypress in the direction task, screen touch in the color task), and have different timing due to the different motor demands. To account for the different distributions of RT across tasks and accommodate our lack of response window in the direction task, we chose to bin response times into deciles for a modified-SRT calculation. To do this, we used the distribution across all trials separately for each task to determine deciles; separating by task but combining across trial types minimizes the effects of base RT differences between tasks but retains differentiation of the distributions across trial types within a task (i.e., if RT is slower on incongruent trials than congruent trials in the same task, more incongruent trials will be represented in higher deciles and more congruent trials in lower deciles). Note that all children who were included as understanding the task from Experiments 1a (n = 62) and 1b (n = 36) were included for decile calculations. Then, to calculate modified-SRT, we subtracted the observed RT decile (1 through 10) from the maximum decile (10) and multiplied the difference by 1 for correct responses and −1 for incorrect responses, resulting in scores ranging from −9 to 9.

The relation between observed RTs and modified-SRT scores is shown in Fig. 8 separately for young children’s performance on the direction (left column) and color (right column) tasks; note the difference in the x-axes due to different ranges observed across tasks. As this figure shows, RTs were more closely distributed near 0 (i.e., fast RTs) and more sparsely distributed near the maximum observed values (i.e., slow RTs). Incorrect responses are shown as orange X marks and correct responses as gray O marks; as was apparent in Fig. 3, correct responses were more frequent than incorrect responses. Converting binned RTs into modified-SRT scores according to response correctness resulted in scores near −9 indicating a fast incorrect response, scores near 9 reflecting a fast correct response, scores of 0 resulting from RTs in the tenth decile (regardless of correctness), and intermediate negative and positive scores corresponding to moderate speed on incorrect and correct responses, respectively. Overall, then, lower mean scores reflect slower and more error-prone responding, as would be expected from more interference from incongruent flankers.

Fig. 8
figure 8

Correspondence between young children’s raw reaction times and modified-SRT scores separated by trial type in the direction task (n = 62; 2.64–5.91 years old) and the color task (n = 37; 3.46–7.01 years old). Note the different span of x-axes across panels for improved visibility of different ranges. To increase visibility within the displayed range of RTs, 18 additional observations in the direction task (six congruent, nine incongruent, three neutral) greater than 21 s are not shown; these correspond to modified-SRT scores of 0

We see two advantages to using modified-SRT scores compared with traditional approaches analyzing accuracy and RT separately or combined at the aggregate level. First, it retains data from all trials without obscuring relations between correctness and RT. Second, binning RTs into deciles per task, and per experiment below, reduces the influence of age group and task differences in the observed range of RTs (see Fig. 8); this facilitates comparisons of congruency effects independent of overall RT, which is expected to differ across age groups and response modalities (i.e., the time required to press a key versus touch a screen).

After calculating modified-SRT scores for all participants in Experiments 1a (n = 62) and 1b (n = 37) who met the criterion for understanding described in Experiment 1a Method of analysis, we evaluated the distribution of scores relative to a linear regression line with age, trial type, and their interaction as predictors, separately for each stimulus type. In JASP, we calculated a linear regression for modified-SRT with these predictors, then used the case-wise diagnostics to generate tables with each participant’s modified-SRT scores, standard residual values, predicted values, and residual values per trial type (see Supplementary file). We then excluded participants with any scores that had a standard residual beyond ±3.3, following the recommendation by Tabachnick and Fidell (2007). This led to the exclusion of one participant in Experiment 1a, who was also part of the longitudinal sample, and one participant in Experiment 1b (see Participants section for final sample size). Following these exclusions, we analyzed mean modified-SRT in JASP using the same repeated-measures ANOVAs described in the above analyses of correct-response trial RT.

Results

Our third research question across Experiments 1a and 1b was whether modified-SRT was lower on incongruent trials than other trial types and whether that effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 9a, b shows children’s mean modified-SRT across trial types and the relation between modified-SRT and age per trial type from the direction task in Experiment 1a. The frequentist ANOVA yielded a significant main effect only of age, F(1, 59) = 47.93, p < .001, ηp2 = .448. The Bayesian ANOVA indicated that the best-fitting model (BF10 = 3.26 * 1015) included main effects of age (BFincl = 2.67 * 106) and trial type (BFincl = 1.57 * 109). Post hoc comparisons showed strong evidence that mean modified-SRT on incongruent trials was lower than on congruent and neutral (posterior odds = 5495.70 and 9.78 * 106, respectively), as can be seen in Fig. 9a. The age effect reflects generally increasing modified-SRT scores with age, as shown in Fig. 9b.

Fig. 9
figure 9

Young children’s mean modified-SRT across trial types (a, c) and the relation between modified-SRT and age per trial type (b, d) in Experiments 1a (a, b; n = 61; 2.64–5.91 years old) and 1b (c, d; n = 35; 3.46–7.01 years old). In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range

Figure 9c, d shows children’s mean modified-SRT by trial type and the relation between modified-SRT and age per trial type from the color task in Experiment 1b. The frequentist ANOVA yielded significant main effects of age, F(1, 33) = 16.27, p < .001, ηp2 = .330, and trial type, F(2, 66) = 4.67, p = .013, ηp2 = .124. Post hoc comparisons indicated that mean modified-SRT was lower on incongruent trials compared with both neutral (p < .001) and congruent trials (p = .004), as shown in Fig. 9c. The age effect again reflects higher modified-SRT with age (see Fig. 9d). The Bayesian ANOVA on modified-SRT showed that the best-fitting model (BF10 = 1.23 * 104) included main effects of both age (BFincl = 76.79) and trial type (BFincl = 156.67). Post hoc comparisons showed moderate evidence that modified-SRT was lower on incongruent trials than on congruent trials (posterior odds = 2.93) and strong evidence that it was lower than on neutral trials (posterior odds = 125.18; see Fig. 9c).

Figure 10 shows the longitudinal sample’s mean modified-SRT by trial type and the relation between accuracy and age per trial type, separately for the direction and color tasks. The frequentist ANOVA on this sample yielded no significant effects. The Bayesian ANOVA, however, showed that the best-fitting model (BF10 = 55.73) included main effects of trial type (BFincl = 27.72) and stimulus type (BFincl = 2.40), although the latter effect was weak (i.e., BFincl < 3). Post hoc comparisons of the trial type effect indicated moderate evidence that modified-SRT was lower on incongruent trials (M = 1.36, SD = 2.22, 95% CI [0.63, 2.63]) than on congruent trials (M = 2.93, SD = 1.92, 95% CI [2.06, 3.79]; posterior odds = 2.15) and strong evidence that it was lower than on neutral trials (M = 3.34, SD = 1.91, 95% CI = [2.48, 4.20]; posterior odds = 131.61).

Fig. 10
figure 10

Longitudinal sample of young children’s (n = 18) mean modified-SRT across trial types (a, c) and the relation between modified-SRT and age per trial type (b, d) in Experiment 1 direction task (a, b; 2.64–5.79 years old) and color task (c, d; 3.46–6.83 years old). In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range

Discussion

Children in Experiment 1a showed the expected pattern in their modified-SRT in the direction task, with age-related improvements in performance overall, and lower scores on incongruent trials reflecting more errors and slower responses relative to congruent and neutral trials. Using modified-SRT to combine accuracy and RT into a single metric confirmed that this effect was statistically robust, with the best-fitting Bayesian model showing that congruency effects can be found with this measure. The analyses of modified-SRT in Experiment 1b (both frequentist and Bayesian) showed evidence for the expected congruency effect in the color task, which we designed to be easier for children to understand and complete. The lack of a similar effect in the frequentist analysis of the direction task is likely due to the higher variability in this measure in the direction task than in the color task (cf. Fig. 9a and b).

Based on children’s overall performance and the lower number of exclusions (three out of 37 participants, or 8%), as well as qualitative reports from experimenters indicating little to no difficulty with the task during pilot testing and data collection (see Supplement for details), it seems that the color task was indeed easier for children. The Bayesian analysis of our longitudinal sample indicated strong evidence for an overall trial type congruency effect, with weak evidence for a difference between stimulus types and no evidence for an interaction between trial type and stimulus type. Together, these analyses suggest that both tasks can elicit congruency effects in young children, perhaps more robustly in the color task (given that the trial type effect was not significant in the frequentist analysis of the direction task). However, due to our design, we cannot attribute this difference solely to the task differences, as children were older when performing the color task, and most had previously participated in the directional task, which may have supported better performance in this task. Our next experiment investigated whether task differences would be found when comparing the tasks concurrently within subjects. To ensure that participants would be able to complete both tasks within the same session, we sampled a group of older children (aged 6 to 10 years), and then included a group of young adults for comparison. This also allowed us to investigate continuity across age groups in the differences between tasks.

Experiment 2a

Method

Participants

Thirty children participated in this experiment, recruited through the same participant database at Ripon College as in Experiment 1.Footnote 1 One child’s data were excluded from all analyses for not completing the color flanker task; additionally, three children were excluded from accuracy and mean RT on correct-response trial analyses for not contributing enough trials per type, one of whom was also excluded from the modified-SRT analyses based on standard residuals from the regression (described in the Method of analysis section below). This resulted in a sample of 26 children in the first two sets of analyses (age range: 6.11–10.96 years, M = 8.29, SD = 1.43; 11 girls, 15 boys) and 28 children in the third set of analyses (age range: 6.11–10.96 years, M = 8.28, SD = 1.39; 13 girls, 15 boys). Caregivers who provided individual demographic data (28 participants from the full sample) reported their children identifying as follows: White/non-Hispanic (93%) or biracial (White and Black/non-Hispanic; 7%) for race/ethnicity; only English spoken in the home (100%) for language background; one or more parents holding a 4-year degree or higher (90%), one or more parents completing a 2-year degree or some college (7%), or both parents having completed high school (3%) for parental education. Written parental consent and oral child assent were obtained prior to participation, and at the completion of the session, children were compensated with a small prize.

Materials and procedure

All details of the materials and procedure were identical to those reported for Ripon College in Experiment 1 except in two ways. First, less elaborate instructions were provided for both tasks, without modeling the direction tasks using PowerPoint. For the direction task, the experimenter said (gestures shown in brackets), “For this next game, you will be identifying the direction a cartoon fish is swimming. On the screen right now [gesturing to the target fish shown at the beginning of the task], you see an example of the fish you will be identifying. Once the game starts, you will see a row of fish. I want you to look at the fish in the middle of the screen; a small cross will appear that shows you where the middle fish will be. Once you figure out if the fish is swimming left or right, press the arrow key [gesturing to the keys] that matches the way the fish is swimming. Are you ready to begin? Press the key that shows the way this fish is swimming.” For the color task, the experimenter said, “For this game, you will be identifying the color of a circle. On the screen right now, you see a dark circle and a white square. Please start with your hands on the mat for each trial. I will tap the white square and the trial will start; when the trial starts, you will see a green circle and then a row of circles. I want you to look at the circle in the middle—this will be the circle that appears in place of the green circle. The circle in the middle will either be red or blue. Once you figure out the color of the middle circle, tap the square in the bottom that has the matching color.”

Second, participants completed both tasks within a single session, running through the direction task twice in succession, resulting in double the number of direction trials relative to Experiment 1a (36 trials total). Trial repetitions following incorrect responses in the color task made the total number of trials higher than 24 for eight older children (five completed 25 trials, two completed 26 trials, and one completed 27 trials). The two flanker tasks were interleaved with other tasks (described in the Supplement) and presented in counterbalanced order. Including explanations of the tasks, most older children took 2–4 min per task to complete the direction and color tasks.

Method of analysis

We first conducted the accuracy and correct-trial RT analyses as described in the Experiment 1 longitudinal comparison, except that each participant had only one age value due to the design; these were followed by analyses separated by stimulus type. For these analyses, we adjusted our criterion for inclusion to be more correct-response trials per type than in Experiment 1, since these participants were older on average (around 8 years old, compared to a mean age of 5 years in Experiment 1) and completed more trials overall. Most older children had at least eight correct trials per type, leading to the exclusion of three participants with two or three correct trials in one trial type (all on the direction task, but these participants were excluded from analyses for both tasks since they completed both in the same session).

As in Experiment 1, we also calculated modified-SRT, again using deciles derived from the observed RT distributions (including all participants who completed the task regardless of number of correct trials per type, n = 29) separately by task but collapsing across trial types (see Fig. 11 for distributions), then used linear regression in JASP to identify scores with standard residuals beyond ±3.3 (cf. Tabachnick & Fidell, 2007; see Supplementary file). This resulted in one participant being excluded from these analyses. We analyzed modified-SRT as in the longitudinal comparison in Experiment 1, followed by planned analyses separated by stimulus type.

Fig. 11
figure 11

Correspondence between older children’s (n = 29; 6.11–10.96 years old) raw reaction times and modified-SRT scores separated by trial types in the direction task and the color task reported in Experiment 2a. To increase visibility within the displayed range of RTs, 11 additional observations, nine in the direction task (one congruent, six incongruent, two neutral) and two in the color task (one congruent, one incongruent), greater than 6 s are not shown; these correspond to modified-SRT scores of 0

Results

Accuracy

Our first research question was whether accuracy was lower on incongruent trials relative to neutral trials and whether that effect varied by age, which we addressed with logistic regression analyses. Figure 12 shows older children’s accuracy by trial type and the relation between accuracy and age per trial type, separately for the direction and color tasks. For the frequentist regression, the model with age, trial type, stimulus type, and the resulting interactions failed to converge using the default optimizer (nloptwrap) but did converge using bobyqa optimization. This model fit better than the null model, χ2 (11, n = 26) = 46.68, p < .001. Table 5 shows the coefficients from this model, which indicated no significant effects, although the main effect of age was near the p < .05 threshold. The Bayesian regression indicated a main effect only of age (see Table 6 for model posterior summary). The age effect was driven by an overall small increase in accuracy with age (see Fig. 12b and d).

Fig. 12
figure 12

Older children’s (n = 26; 6.11–10.96 years old) proportion of trials with correct responses across trial types (a, c) and the relation between accuracy and age per trial type (b, d) in the direction task (a, b) and color task (c, d) in Experiment 2a. In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. Note that some boxes and whiskers are not visible when the distribution is restricted. In b and d, lines show the linear fit to the observed values

Table 5 Coefficients from the frequentist logistic regression on older children’s accuracy in Experiment 2a
Table 6 Posterior summaries from the Bayesian logistic regression on older children’s accuracy in Experiment 2a

Splitting by task, we first analyzed older children’s accuracy in the direction task. The frequentist regression indicated that the model with age, trial type, and their interaction fit better than the null model, χ2(5, n = 26) = 18.59, p = .002. Table 5 shows the coefficients from this model, which included only a significant main effect of incongruent trial type. The Bayesian regression also indicated a main effect of incongruent trial type (see Table 6 for model posterior summary). These results indicate overall lower accuracy on incongruent than neutral trials in the direction task (see Fig. 12a), similar to younger children’s performance in Experiment 1a.

We next analyzed older children’s accuracy in the color task. The frequentist regression indicated that the model with age, trial type, and their interaction fit better than the null model, χ2(5, n = 26) = 15.05, p = .010. Table 5 shows the coefficients from this model, which included no significant effects. The Bayesian regression indicated a main effect only of age (see Table 6 for model posterior summary). As Fig. 12d shows, this reflects a slight increase in accuracy with age in the color task. In contrast to Experiment 1b and the direction task in Experiments 1a and 2a, we see no evidence for an effect of trial type on older children’s accuracy in the color task (see Fig. 12c).

Reaction time on correct-response trials

Our second research question was whether RT on correct-response trials was slower on incongruent trials than on other types and whether this effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 13 shows older children’s mean RT on correct-response trials across trial types and the relation between RT and age per trial type, separately for the direction and color tasks. The frequentist ANOVA showed significant main effects of age, F(1, 24) = 23.50, p < .001, ηp2 = .495, and trial type, F(2, 48) = 9.24, p < .001, ηp2 = .278, as well as an age by trial type interaction, F(2, 48) = 5.79, p = .006, ηp2 = .194, and a trial type by stimulus type interaction, F(2, 48) = 4.77, p = .013, ηp2 = .166. The main effect of age reflects generally faster RT with age (see Fig. 13b and d). Post hoc comparisons of the trial type main effect indicated significantly slower RT on incongruent trials (M = 1.45 s, SD = 0.70, 95% CI [0.62, 1.16]) than on congruent (M = 1.28 s, SD = 0.60, 95% CI [0.48, 0.95]; p = .002) or neutral trials (M = 1.19 s, SD = 0.53, 95% CI [0.61, 1.01]; p < .001). The age by trial type interaction was driven primarily by the direction task (compare Fig. 13b to d), in which the age effect was larger on incongruent trials (as shown in analyses separated by task below).

Fig. 13
figure 13

Older children’s (n = 26; 6.11–10.96 years old) mean reaction time (RT) on correct-response trials across trial types (a, c) and the relation between mean reaction time and age per trial type (b, d) in the direction task (a, b) and color task (c, d) in Experiment 2a. In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. In b and d, lines show the linear fit to the observed values

The Bayesian ANOVA on older children’s mean RT on correct-response trials across tasks showed that the best-fitting model (BF10 = 1.40 * 1011) included main effects of age (BFincl = 303.36), trial type (BFincl = 442.08), and stimulus type (BFincl = 1.28 * 105), as well as an age by stimulus type interaction (BFincl = 3.38) and a trial type by stimulus type interaction (BFincl = 18.31). Post hoc comparisons on the trial type effect indicated strong evidence that mean RT on incongruent trials was slower than that on neutral trials (posterior odds = 747.80; see means above) and moderate evidence that mean RT on incongruent trials was slower than that on congruent trials (posterior odds = 7.17). The stimulus type effect reflects overall slower responding in the color task (M = 1.44 s, SD = 0.57, 95% CI [0.73, 1.17]) than in the direction task (M = 1.18 s, SD = 0.66, 95% CI [0.40, 0.91]). The age by stimulus type interaction is driven by a generally steeper slope for the effect of age in the direction task versus color task (compare slopes in Fig. 13b and d). The trial type by stimulus type effect reflects the differentiation by trial type in the direction task but not in the color task, as indicated by the analyses separated by stimulus type below.

Splitting by task, we first analyzed older children’s correct-response trial RT in the direction task. The frequentist ANOVA showed significant main effects of age, F(1, 24) = 27.27, p < .001, ηp2 = .532, and trial type, F(2, 48) = 15.08, p < .001, ηp2 = .386, as well as the age by trial type interaction, F(2, 48) = 9.19, p < .001, ηp2 = .277. The main effect of age reflects faster RT with age (see Fig. 13b). Post hoc comparisons of the trial type main effect indicated significantly slower RT on incongruent trials than on congruent or neutral trials (ps < .001; see Fig. 13a). Similarly, the Bayesian ANOVA showed that the best-fitting model (BF10 = 8.31 * 108) included main effects of age (BFincl = 567.16) and trial type (BFincl = 1.03 * 105) as well as their interaction (BFincl = 10.92). Post hoc comparisons on the trial type effect indicated strong evidence that mean RT was slower on incongruent trials than on congruent and neutral trials (posterior odds = 249.49 and 2,275.56, respectively; see Fig. 13a). The age by trial type interaction reflects the steeper slope by age on incongruent trials (see Fig. 13b), driven by slower mean RT in younger children on that trial type.

We then analyzed older children’s correct-response trial RT in the color task. The frequentist ANOVA showed a significant main effect only of age, F(1, 24) = 15.14, p < .001, ηp2 = .387. The Bayesian ANOVA similarly showed that the best-fitting model (BF10 = 30.40) included only a main effect of age (BFincl = 29.80), with no evidence for effects of trial type (BFincl = 0.26) in this task. These analyses indicate that mean RT was generally faster with age (see Fig. 13d) and was similar across trial types (see Fig. 13c), but that the color task did not show the same congruency effect as the direction task in older children.

Modified-SRT scores

Our third research question was whether modified-SRT scores were lower on incongruent trials than other types and whether this effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 14 shows older children’s mean modified-SRT across trial types and the relation between modified-SRT and age per trial type, separately for the direction and color tasks. The frequentist ANOVA on older children’s modified-SRT across tasks yielded a significant main effect only of age, F(1, 26) = 72.60, p < .001, ηp2 = .736, indicating a general increase in modified-SRT scores with age (see Fig. 14b and d). The Bayesian ANOVA on modified-SRT indicated that the best-fitting model (BF10 = 7.14 * 1014) included main effects of age (BFincl = 2.08 * 106), trial type (BFincl = 1.49 * 105), and stimulus type (BFincl = 11.40), plus a trial type by stimulus type interaction (BFincl = 475.06). Post hoc comparisons of the trial type effect showed strong evidence that modified-SRT scores on incongruent trials (M = 3.20, SD = 2.02, 95% CI [2.45, 3.95]) were lower than on congruent (M = 4.28, SD = 2.20, 95% CI [3.46, 5.09]; posterior odds = 3,103.18) and neutral trials (M = 4.43, SD = 2.14, 95% CI [3.63, 5.22]; posterior odds = 9,435.34). The stimulus type effect reflects the slightly higher modified-SRT scores in the color task (M = 4.25, SD = 2.21, 95% CI [3.43, 5.07]) than the direction task (M = 3.69, SD = 2.02, 95% CI [2.94, 4.44]). The trial type by stimulus type interaction is driven by lower modified-SRT on incongruent trials compared with congruent and neutral trials in the direction task (see Fig. 14a) but not the color task (see Fig. 14c), as evidenced by the Bayesian analyses for each stimulus type below.

Fig. 14
figure 14

Older children’s (n = 28; 6.11–10.96 years old) mean modified-SRT across trial types (a, c) and the relation between modified-SRT and age per trial type (b, d) in the direction task (a, b) and color task (c, d) in Experiment 2a. In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. In b and d, lines show the linear fit to the observed values

We next conducted planned analyses separated by stimulus type. For older children’s modified-SRT in the direction task, the frequentist ANOVA yielded a significant main effect only of age, F(1, 26) = 57.24, p < .001, ηp2 = .688. The Bayesian ANOVA indicated that the best-fitting model for the direction task (BF10 = 4.46 * 1013) included main effects of age (BFincl = 2.38 * 105) and trial type (BFincl = 2.72 * 108). Post hoc comparisons showed strong evidence that modified-SRT scores on incongruent trials were lower than on congruent and neutral trials (posterior odds = 1.05 * 104 and 3.50 * 105, respectively; see Fig. 14a). For older children’s modified-SRT in the color task, the frequentist ANOVA resulted in a significant main effect only of age, F(1, 26) = 49.20, p < .001, ηp2 = .654. Similarly, the Bayesian ANOVA indicated that the best-fitting model (BF10 = 8.91 * 104) included only an age main effect (BFincl = 8.17 * 104). Together these analyses indicate the expected congruency effect in the direction task (see Fig. 14a) but not in the color task (see Fig. 14c). Across both tasks, modified-SRT generally increased with age (see Fig. 14b and d).

Discussion

Our main goals in this experiment were to test whether performance was generally better in the color task than in the direction task in older children (aged 6–10 years) and to see whether they showed the expected congruency effects in the color task. For these older children, the Bayesian analyses showed effects of stimulus type on correct-response trial RT and modified-SRT, but not accuracy. In general, RT was slower in the color task than in the direction task, presumably because the response modality required a longer movement (touchscreen versus keypress). Modified-SRT was slightly higher in the color task than in the direction task due to better performance on incongruent trials, which also resulted in a trial type by stimulus type interaction. Older children showed the expected effects of trial type in the direction task across measures for nearly all analyses (except frequentist ANOVA on modified-SRT). On the color task, however, there was no evidence for differences across trial type on any of the measures, raising concerns that congruency effects in tasks designed for young children may not be comparable in older children.

Experiment 2b

Method

Participants

Sixteen adults, recruited from Ripon College through voluntary participant sign-up among students and community members, participated in this experiment. All participants were included in all analyses (age range: 19–23 years, M = 20.44, SD = 1.26; 12 women, 4 men). No individual demographic data were collected from adults, but the communities from which we recruited were primarily from White, monolingual English-speaking, middle- to upper-middle-class backgrounds. Adults provided written consent and either volunteered for no compensation (n = 6) or entered a drawing for a prize (an insulated drink tumbler with the college’s logo; n = 10). All procedures were reviewed and approved by the Ripon College Institutional Review Board.

Materials and procedure

All details of the materials and procedure were identical to those reported for Ripon College in Experiment 2a except as follows: adults completed both tasks while seated at a standard office desk (at viewing distance approximately 45 cm); adults ran through both tasks twice (72 trials total for direction; 48 trials total for color). Including explanations of the tasks, most adults took 1–3 min in the direction task and 1–2 min in the color task.

Method of analysis

We first conducted the accuracy and correct-trial RT analyses as described in Experiment 2a. As in Experiment 2a, for these analyses, we adjusted our criterion for inclusion to be more correct-response trials per type than in Experiment 1 since they were adults and had more correct responses and completed more trials overall. Adults all had at least 10 correct trials per type, leading to no exclusions. We also calculated modified-SRT as in Experiments 1 and 2 (see Fig. 15), then used linear regression in JASP to identify scores with standard residuals beyond ±3.3 (cf. Tabachnick & Fidell, 2007; see Supplementary file). No data points were identified as potentially influential. We analyzed modified-SRT for adults as described in Experiment 2a.

Fig. 15
figure 15

Correspondence between adults’ (n = 16; 19–23 years old) raw reaction times and modified-SRT scores separated by trial type in the direction task and the color task reported in Experiment 2b. To increase visibility within the displayed range of RTs, five additional observations, two in the direction task (both incongruent) and three in the color task (all congruent), greater than 1.5 s are not shown; these correspond to modified-SRT scores of 0

Results

Accuracy

Our first research question was whether accuracy was lower on incongruent trials relative to neutral trials and whether that effect varied by age, which we addressed with logistic regression analyses. Figure 16 shows adults’ accuracy by trial type and the relation between accuracy and age per trial type, separately for the direction and color tasks. For the frequentist regression, all models (null and with predictors) failed to converge, regardless of the optimization used. Results of the Bayesian logistic regression indicated main effects of incongruent trial type and stimulus type plus an age by incongruent trial type interaction (see Table 7 for model posterior summary). The stimulus type effect reflects higher performance on the color task (at ceiling) than the direction task (M = .96, SD = .04, 95% CI [.94, .98]; see Fig. 16c). The incongruent trial type effect is driven by lower performance on incongruent trials (M = .94, SD = .05, 95% CI [.92, .97]) than neutral trials (M = .99, SD = .02, 95% CI [.98, 1.00]); this is apparently driven by direction task performance (see Fig. 16a and c) even though the interaction between stimulus types and incongruent trial type was not indicated. The age by incongruent trial type interaction reflects increased accuracy with age for that trial type relative to neutral trials, again only in the direction task, since performance in the color task was at ceiling (see Fig. 16b and d).

Fig. 16
figure 16

Adults’ (n = 16; 19–23 years old) proportion of trials with correct responses across trial types (a, c) and the relation between accuracy and age per trial type (b, d) in the direction task (a, b) and color task (c, d) in Experiment 2b. In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. Note that some boxes and whiskers are not visible when the distribution is restricted. In b and d, lines show the linear fit to the observed values

Table 7 Posterior summaries from the Bayesian logistic regression on adults’ accuracy in Experiment 2b

Separating by task, we only analyzed adults’ accuracy in the direction task since performance was at ceiling in the color task. Again, the frequentist models all failed to converge regardless of the optimization used. Results of the Bayesian logistic regression indicated a main effect of incongruent trial type, plus an age by incongruent trial type interaction (see Table 7 for model posterior summary). The trial type effect reflects lower accuracy on incongruent than neutral trials (see Fig. 16a). The age by incongruent trial type interaction was driven by a slight increase in accuracy with age for incongruent trials only compared to a slight decrease in accuracy with age on neutral trials (see Fig. 16b).

Reaction time on correct-response trials

Our second research question was whether RT on correct-response trials was slower on incongruent trials than on other types and whether this effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 17 shows adults’ mean RT on correct-response trials across trial types and the relation between RT and age per trial type, separately for the direction and color tasks. The frequentist ANOVA indicated no significant effects. The Bayesian ANOVA showed that the best-fitting model (BF10 = 1.35 * 1039) included main effects of age (BFincl = .65), trial type (BFincl = 2.83), and stimulus type (BFincl = 3.21 * 1038), plus age by stimulus type (BFincl = 1.84) and trial type by stimulus type (BFincl = 5.70) interactions. Note that the effects related to age and trial type were weak (i.e., BFincl < 3), so we did not conduct post hoc comparisons for the trial type effect. The stimulus type effect reflects overall slower correct-response trial RT in the color task (M = 0.78 s, SD = 0.11, 95% CI [0.49, 0.60]) than in the direction task (M = 0.46 s, SD = 0.05, 95% CI [0.34, 0.40]), presumably due to the different response modalities. The trial type by stimulus type interaction is driven by an effect of trial type in only the direction task, as shown by the Bayesian ANOVAs below.

Fig. 17
figure 17

Adults’ (n = 16; 19–23 years old) mean reaction (RT) time on correct-response trials across trial types (a, c) and the relation between mean reaction time and age per trial type (b, d) in the direction task (a, b) and color task (c, d) in Experiment 2b. In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. Note that some boxes and whiskers are not visible when the distribution is restricted. In b and d, lines show the linear fit to the observed values

Separating by task, we first analyzed adults’ mean RT on correct-response trials in the direction task. The frequentist ANOVA indicated no significant effects. The Bayesian ANOVA showed that the best-fitting model (BF10 = 1.50 * 105) included a main effect only of trial type (BFincl = 1.64 * 105). Post hoc comparisons for this analysis yielded strong evidence that mean RT was slower on incongruent trials than on congruent and neutral trials (posterior odds = 2,415.33 and 176.60, respectively; see Fig. 17a). For the color task, the frequentist ANOVA on adults’ mean RT on correct-response trials yielded no significant effects. The Bayesian ANOVA showed that no models fit better than the null (BF10 < .67). These analyses indicate the expected congruency effect in the direction task but not in the color task.

Modified-SRT scores

Our third research question was whether modified-SRT scores were lower on incongruent trials than other types and whether this effect varied by age, which we addressed with repeated-measures ANOVAs. Figure 18 shows adults’ mean modified-SRT across trial types and the relation between RT and age per trial type, separately for the direction and color tasks. The frequentist ANOVA yielded no significant effects. The Bayesian ANOVA indicated that the best-fitting model (BF10 = 2.92 * 107) included main effects of age (BFincl = 1.07), trial type (BFincl = 326.01), and stimulus type (BFincl = 1.01), plus the trial type by stimulus type interaction (BFincl = 1.08 * 105), although the age and stimulus type effects were weak. Post hoc comparisons of the trial type effect yielded strong evidence that modified-SRT scores were lower on incongruent trials (M = 3.27, SD = 1.43, 95% CI [2.57, 3.97]) than on congruent (M = 4.54, SD = 1.67, 95% CI [3.72, 5.36]; posterior odds = 36.05) and neutral trials (M = 4.57, SD = 1.48, 95% CI [3.84, 5.29]; posterior odds = 133.62). The trial type by stimulus type interaction reflects different trial type effects across tasks (compare Fig. 18a and c), as shown by the Bayesian ANOVAs below.

Fig. 18
figure 18

Adults’ (n = 16; 19–23 years old) mean modified-SRT across trial types (a, c) and the relation between modified-SRT and age per trial type (b, d) in the direction task (a, b) and color task (c, d) in Experiment 2b. In a and c, boxes show second and third quartiles in light and dark gray (respectively) above and below the median, and whiskers show 1.5 times the interquartile range. Note that some boxes and whiskers are not visible when the distribution is restricted. In b and d, lines show the linear fit to the observed values

Separating by tasks, we first analyzed adults’ mean modified-SRT in the direction task. The frequentist ANOVA yielded no significant effects. The Bayesian ANOVA showed that the best-fitting model (BF10 = 3.32 * 108) included a main effect only of trial type (BFincl = 3.59 * 108). Post hoc comparisons indicated strong evidence that modified-SRT scores were lower on incongruent trials than on congruent and neutral trials (posterior odds = 1.11 * 104 and 4.56 * 104, respectively), as can be seen in Fig. 18a. We next analyzed adults’ mean modified-SRT in the color task. The frequentist ANOVA yielded no significant effects. The Bayesian ANOVA showed that the best-fitting model was barely better than the null (BF10 = 1.03; a Bayes factor of 1 would indicate equal evidence for the null and alternative models), leading us to retain the null. Thus, both types of analyses indicate no evidence for a congruency effect in modified-SRT in the color task.

Discussion

Our main goals in this experiment were to test whether performance was generally better in the color task than the direction task in adults and to see whether adults showed the expected congruency effects in the color task. Adults’ performance was high overall and showed little variability, leading to no significant effects using frequentist analyses. The Bayesian analyses indicated higher accuracy in the color task (at ceiling) than the direction task, as well as slower RT, presumably due to the difference in response modality. Once we accounted for baseline differences in RT by using deciles to calculate modified-SRT, there was only weak evidence for an overall difference across the tasks, but strong evidence for different trial type effects across tasks. Adults showed the expected congruency effects in the Bayesian analyses for all measures in the direction task, but not the color task. These results show that the color task did not produce congruency effects, again raising concerns that congruency effects in tasks designed for young children may not be comparable in adults.

General discussion

The flanker paradigm is frequently used as an index of developmental, individual, and group differences in visual selective attention (e.g., Rueda, Fan, et al., 2004; Zelazo et al., 2013; see Table 1). To make the task more appropriate for young children, experimenters may modify details of the stimuli and task interface (Anwyl-Irvine et al., 2020; Best, 2012; Bialystok et al., 2010; Chang et al., 2013; Davies et al., 2004; Di Martino et al., 2008; Fjell et al., 2012; Levin et al., 2008; McDermott et al., 2007; Ridderinkhof & van der Molen, 1995; Ridderinkhof et al., 1997; Rueda, Fan, et al., 2004; Rueda, Posner, et al., 2004; Scheres et al., 2003; Vaidya et al., 2005; Zelazo et al., 2013), but whether such changes affect how the task functions in older children and adults is often untested (see Table 2). Our original goal was to find a flanker task that was sufficiently engaging for young children (ages 2–6 years) to complete an adequate number of trials for comparison with other attention tasks. We first tested a group of young children in a traditional flanker task using (directional) fish stimuli on a laptop computer and found high error rates and slow reaction times, along with qualitative reports from research assistants that many children struggled to understand or complete the task. This led us to develop a color version on a tablet to provide young children with a simpler, more engaging task. This task was easier and more enjoyable for young children and showed congruency effects in most analyses. A longitudinal comparison of children who completed both tasks (approximately one year apart) showed better performance on the color task but no evidence for a difference in the congruency effect across tasks. However, the exploratory nature of our approach in Experiment 1 resulted in task type being confounded with children’s age and prior experience. To eliminate this confound, in Experiment 2 we tested the same two tasks, in counterbalanced order within a single session, in middle childhood (ages 6–10 years) and young adulthood (ages 19–23 years). These older age groups showed the expected congruency effect in the directional task but not in the color task. These findings paint a cautionary tale for other researchers in three major areas, providing both insight and forewarning, which we discuss in turn in this section, beginning with the narrowest application and ending with the broadest application.

During our investigation of the congruency effect across these tasks and age groups, we encountered some challenges for analysis, which we addressed by adapting an alternative analysis, a modified version of signed residual time (Maris & van der Maas, 2012; Tijmstra & Bolsinova, 2021), to incorporate both response correctness and RT at the individual trial level. We felt this analysis was more appropriate for young children’s data and may provide insights for older children’s and adults’ performance as well, which we discuss further in the following subsection. We also included both frequentist and Bayesian analyses for each of our measures, including the former for a more straightforward comparison with other studies in the literature and the latter for a more nuanced evaluation of evidence for or against the null hypothesis in each age group. We discuss potential benefits of Bayesian analysis in the subsection following the discussion of the modified-SRT scores. Then we return to our primary theoretical focus on testing the same cognitive construct over a wide range of ages and/or abilities. Finally, we conclude with the broader implications for behavioral studies.

Consideration #1: Measurement of accuracy and reaction time

The most common method of analysis in existing literature requires the analysis of accuracy and RT from participants only on correct-response trials (e.g., B. A. Eriksen & Eriksen, 1974; Fan et al., 2002; Rueda, Fan, et al., 2004a), leading us to begin with this analytical approach. However, we discovered that, unlike adults, young children made large numbers of errors when participating in the directional version of the flanker task. When using only correct-response trials, more than 15% of trials from young children needed to be excluded, resulting in 14 participants (23% of the sample) not contributing at least three trials per type to our analyses. We felt this analysis would miss meaningful aspects of children’s performance by excluding the lowest-performing children and RTs on all incorrect-response trials. In addition, because response modality differed across tasks (keypress versus touching a screen), we sought a measurement that could also accommodate baseline reaction time differences in our tasks.

With these goals in mind, we opted to use a modified version of SRT (Maris & van der Maas, 2012; Tijmstra & Bolsinova, 2021) to incorporate both response correctness and reaction time into a single value per trial. We preferred this measure for analysis over others in the literature because it combined correctness and RT information in a single measure on each trial. Other types of analyses that attempt to account for potential trade-offs between speed and accuracy (e.g., inverse efficiency scores, rate-corrected scores, linear integrated speed–accuracy scores, balanced integration scores; Liesefeld & Janczyk, 2019) combine this information at the level of the trial type. If participants contribute few or no observations in any trial type, as we found with young children on incongruent trials, those measures still lead to the exclusion of low-performing participants. To include the most data possible, a measure must take both into account at the trial level. Another analysis that combines accuracy and RT is the conditional accuracy function, in which RTs are binned and accuracy is calculated within each bin to then be compared across trial types (e.g., Gratton et al., 1992). This approach requires more trials per bin than we had in our task, making it inappropriate for our use. Thus, we felt that a variation of SRT could best capture the important features of our datasets.

Two related challenges we found when trying to apply SRT were the lack of time limit on responding in our direction task and, for older children and adults, the contracted range of observed RT values relative to the task response window in the color task. To make this measure more comparable across tasks and age groups, we calculated deciles from raw RTs separately for each task and age group, then used the deciles for calculation rather than raw RTs subtracted from the limit of the response window. This modified-SRT calculation can therefore be used across tasks with different ranges of observed RTs and response windows, at least for tasks like ours in which our primary interest is comparing a within-task manipulation (i.e., trial type) rather than differences between the tasks and age groups. For application to assessments designed to provide an index of each individual’s ability relative to others, as intended by Maris and van der Maas (2012), this modification may be inappropriate (see Tijmstra & Bolsinova, 2021, for related discussion). However, we found two drawbacks of this modified measure with our data. First, it does not provide an intuitive interpretation of scores. Unlike accuracy and RT, which can be understood in terms of trials and time, respectively, the meaning of a change in modified-SRT is not immediately apparent. Second, multiplying residual scores on incorrect trials by −1 increases the variance in scores relative to the use of RT alone (see Figs. 8, 11 and 15). This can make it more difficult to obtain significant effects using frequentist analyses, although our results showed that Bayesian analyses addressed this drawback (discussed further in the next section). Despite these challenges, we believe this method holds promise for future investigations in tasks where both correctness and RT are expected to be affected by task details.

Regardless of the preferred measure, our experiments did not address the psychometric properties of either task due to our intent to keep the tasks brief, therefore not providing enough trials for analyses like split-half comparisons. Psychometric properties of flanker tasks are typically not reported (and therefore presumably not measured) in various tasks developed for use with young children (for example, among all of the studies we reviewed for Table 2, only Zelazo et al., 2013, included such an analysis). Paap and Sawi (2016) investigated test–retest reliability of common executive function tasks with adults and found high cross-block (r = .89–.93) and cross-session (Spearman–Brown prophecy = .94–.98) reliability for RT on incongruent and congruent trials in a directional flanker task (neutral trials were not included). These were lower than one comparison from the literature they cited, but more work is needed to evaluate additional tasks designs with other stimulus details and other versions and groups of interest (cf. Tables 1 and 2). Such investigations will also be useful in determining whether brief tasks, as are typically desirable in larger batteries or with young or atypical populations, are sufficient to measure individual differences in these types of cognitive processes in such small numbers of trials. Without evidence for robust reliability of cognitive tasks in all groups of interest, researchers and clinicians should be cautious in interpreting group or individual differences in performance.

Consideration #2: Traditional versus modern statistical approaches

Our findings of somewhat inconsistent results between frequentist and Bayesian analyses across multiple measures (accuracy, mean RT on correct-response trials, and modified-SRT) also raise questions about the most appropriate analyses for data from flanker tasks. We included Bayesian analyses as a more sensitive approach to weighing evidence for the null hypothesis versus the alternative and the inherent model comparison and quantification of evidence for specific effects within the best model. This statistical approach allows for a more nuanced interpretation of results than just rejecting or retaining the null hypothesis based on a p-value threshold (e.g., Dienes, 2014; Marsman & Wagenmakers, 2017; van de Schoot et al., 2014). In the case of the accuracy data, we were also able to use Bayesian logistic regression when the frequentist models failed to converge on our low-variance data sets (i.e., Experiment 2a color task, all logistic regressions in Experiment 2b). When the frequentist and Bayesian results diverged, the Bayesian analyses indicated evidence for effects that were not significant in the frequentist analysis, except for one effect in the cross-task analysis of correct-response trial RT in Experiment 2a. This is likely driven by the high variability among participants (relative to variance between factors like trial type) on many of our measures (Dienes, 2014).

The potential advantages and limitations of applying Bayesian statistical analyses to psychology experiments have been discussed at length elsewhere (e.g., Dienes, 2014; Marsman & Wagenmakers, 2017; van de Schoot et al., 2014), and our inclusion of both Bayesian and frequentist results provides an illustrative example of their utility for this type of task. The specific benefits we saw were the increased utility with relatively small samples that had high variability in some measures and more straightforward reporting of certainty for effects (van de Schoot et al., 2014), as well as quantification of the evidence for or against specific factors compared across models (van den Bergh et al., 2020). Across our analyses, we found the Bayesian approach to be more robust and interpretable and recommend it for future investigations using flanker tasks as well as data from young children.

Consideration #3: Modifications of task design

Often in developmental research, tasks designed for adults are scaled down or simplified for use with children. This practice is common and is often assumed appropriate without further investigation, as noted in our review of other papers in Table 2 in the introduction. Our original task was designed to test young children (2–6 years old) in a directional flanker paradigm similar to Rueda, Fan, et al. (2004a) for comparison with other cognitive measures. However, we found very high error rates and long RTs in our sample of young children. This overall poor performance, along with qualitative reports from experimenters that children struggled with the task, led us to develop a new tablet-based color variant of the flanker paradigm for use with young children (3–7 years old, tested approximately one year after participating in the direction task), who generally found this version easier to complete. Longitudinal comparisons of the tasks suggested that the color task was easier for children to complete, but our analyses were limited by the high number of children excluded from the longitudinal sample due to poor performance.

To test for the possibility that the higher performance in the color versus direction task resulted from the longitudinal nature of our investigation with young children, in which the color version had been completed approximately a year after the direction version, we next sought to compare both versions of the task within the same participants. In experiments with older children (6–10 years old) and young adults (19–23 years old), we tested both direction and color versions within participants in the same session. Both age groups showed similar results, with congruency effects in direction but not in color, which we did not expect. Performance in the color task was near ceiling across trial types for both older children and adults, with little variation in reaction time, despite the fact that prior studies with adults have demonstrated congruency effects using color stimuli (e.g., Brosowsky & Crump, 2018; Davranche et al., 2009; Diedrichsen et al., 2000; Rafal et al., 1996; Verbruggen et al., 2006). Our results suggest that the tablet-based color flanker task we developed would not be appropriate for all samples, but further investigation is needed to determine what features of the task led to differences across age groups. By extension from our age group comparison, it may also be the case that different populations or other types of groups, like those listed in Table 1, do not show congruency effects in some variants of the flanker task, warranting further study.

Our two tasks differed in multiple ways, all of which are changes that are routinely implemented (see Table 2 and Table S1): the stimulus dimensions; presentation on a laptop versus tablet, with responses by keypress versus virtual button touch; the number of trials; automatic versus experimenter-controlled trial presentation; and the manner of feedback. We briefly consider how each of these might affect performance in turn. As noted above, prior studies have found congruency effects using color as the stimulus type with adults, but to our knowledge, no other studies have tested color stimuli in children older than 7 years. This suggests no reason to expect that color cannot produce congruency effects, but perhaps there is a narrower set of task parameters that will elicit those effects with color stimuli. Most flanker paradigms rely on shape as the relevant dimension, most commonly letters (B. A. Eriksen & Eriksen, 1974) or arrows (Ridderinkhof et al., 2021). One challenge when using different stimulus dimensions is attempting to equate the magnitude of difference between stimuli within a dimension (i.e., left- and right-facing arrows may be less dissimilar than red and green circles). The effect of similarity among distractors was part of the original flanker study design, showing that dissimilar flankers affect performance less than similar ones (B. A. Eriksen & Eriksen, 1974; see Enns & Akhtar, 1989, for a similar investigation including children). By extension, our task differences in older children and adults could be due to less similarity between red and blue circles than left- and right-facing fish, although this raises a new question of why this was not the case for young children.

Presenting our stimuli on a laptop versus tablet would not necessarily change the task demands on visual attention, especially when the apparent sizes of the stimuli are comparable, but the different response modalities clearly elicit different motor demands (pressing a key versus touching the screen), which affect RT and possibly accuracy. Participants could rest their fingers on the keys in the direction task, allowing quick responses that could be more error-prone relative to the motor action for touching the screen, which has more opportunity for correction during longer execution. Changes in response modality were found in half of the top-cited flanker studies with young children that we reviewed (see Table 2). Of the three that used touchscreen buttons similar to ours, one included only young children (Bialystok et al., 2010) and the other two did not specifically test whether children older than 7 years and adults showed reliable congruency effects, although the results they do report suggest they were present (Fjell et al., 2012; Zelazo et al., 2013). In our analyses comparing tasks, we attempted to reduce the effects of baseline RT differences by modifying SRT to use RT deciles, but this cannot eliminate contributions from differences in the range and variability of RTs across trials and does not address potential influences on accuracy.

We also increased the total number of trials in the color task, plus added a contingency where more trials would be presented following errors. Our motivation for this change was to reduce the number of children who had to be excluded for contributing too few correct-response trials, although this was no longer an issue when we analyzed modified-SRT instead. Studies with adults have shown that congruency effects can be modulated over trials, but these tasks include many more trials than we tested here (e.g., comparing the first 50 trials to the last 50 trials in a block of 400; Cochrane et al., 2021), and congruency effects were evident on the early trials and increased with more trials. As such, it seems unlikely that including more trials in the color task would reduce the congruency effect for older children and adults. Additionally, we implemented an experimenter-controlled inter-trial interval in our color task, which can help young children stay engaged with a task (cf. Simmering, 2012) and reduce the influence of their time to orient to the trial on RT. To our knowledge, the effect of such a change on attentional demands has not been tested directly (note that early instantiations of the flanker task used self-initiated trial onsets, e.g., B. A. Eriksen & Eriksen, 1974; C. W. Eriksen et al., 1985, but most subsequent studies used computer-controlled inter-trial intervals, e.g., those listed in Table 2).

Lastly, our color task included no feedback on response correctness, whereas our direction task showed “correct” or “incorrect” after each trial, although most young children could not read this feedback, and experimenters gave them general verbal encouragement regardless of correctness. Cochrane et al. (2021) compared congruency effects in tasks with versus without feedback on both response correctness and RT (with RTs longer than 500 ms identified as “slow”) and found only that RT was slightly faster with feedback (i.e., no effects on accuracy or congruency effects). Furthermore, many prior flanker tasks have found congruency effects without feedback (e.g., Davranche et al., 2009; Diedrichsen et al., 2000; B. A. Eriksen & Eriksen, 1974; C.W. Eriksen & Schultz, 1979; Fan et al., 2002). Thus, it seems unlikely that the absence of feedback in the color task is a reason we found no congruency effects.

Taken together, prior research provides no clear indication of which features of our color task design would lead to a congruency effect in young children but not older children or adults, and further testing would be needed to tease apart these possibilities. Regardless of the source, however, researchers need to be aware that modifying details of a well-established cognitive task can lead to unexpected performance profiles across different age groups. It remains an open question whether modifications like ours could also eliminate congruency effects in other populations, groups, or contexts of interest, such as those listed in Table 1. Understanding the precise source of differences between task types can shed some light on whether and when we should expect them to function differently across age groups, but this will require careful testing and comparison before using a task to assess group or individual differences.

Implications for future research

Our investigation was designed as a practical comparison of two versions of the flanker task, motivated originally by the goal to design a task that young children could reliably and quickly complete, and thus was not designed to evaluate or extend any particular theoretical explanation of congruency effects. However, our results have implications for theories of attention and development that could explain why congruency effects are found in some instances (i.e., age, groups, tasks, response modalities) but not in others. Our design was not intended to differentiate potential explanations, but some candidate explanations for the lack of congruency effect in older children and adults could be inadequate time pressure or insufficient perceptual and/or response competition that could be addressed through further task manipulations. Studies designed to adjudicate between theories or further develop explanations should consider whether and how these types of changes between our tasks would be expected to affect performance. Investigations that use congruency effects as a metric of comparison (cf. Table 1) must also consider that results may be influenced by individual, developmental, and group differences in congruency effects that are not related to their central research question. A deeper theoretical understanding of the ways methodological details affect performance across various participant groups can help guide future task development.

Differences in selective attention have been investigated across a wide range of samples and under various conditions of interest, with the flanker task serving as one of the most common tasks for such investigations (Ridderinkhof et al., 2021). Working with age groups and populations with different behavioral capabilities and needs often requires adapting cognitive tasks to be more engaging and accessible for the target sample (e.g., McDermott et al., 2007; Rueda, Fan, et al., 2004a; Zelazo et al., 2013). A common assumption that underlies these adaptations is that tasks showing the expected patterns of performance (i.e., more errors and slower reaction times on trials with incongruent flankers, compared to neutral or congruent) will measure the same constructs as the original task with the necessary level of sensitivity to yield robust effects. Additional analytical approaches may be able to pick up on congruency effects more reliably than traditional approaches, as we saw with modified-SRT and Bayesian analyses. However, it is still possible that typical modifications to canonical tasks can fail to elicit the effects of interest. Our results highlight the need for more systematic development and study of methods that can be used broadly, not only across age groups but also with special populations or under constraining contexts like neuroimaging, to ensure that tasks are sufficiently sensitive and robust to answer a variety of research questions.