In the years since their first introduction (ca. 1950s), videogames have only increased in popularity. In education, videogames are already widely applied as tools to support students in learning (cf. Boyle et al., 2016; Ifenthaler et al., 2012; Young et al., 2012). In contrast, less research has been done on the use of videogames as summative assessment environments, even though administering (high-stakes) summative assessments through games has several advantages.

First, videogames can be used to administer standardized assessments that provide richer data about candidate ability in comparison to traditional standardized assessments (e.g., multiple-choice tests; Schwartz & Arena, 2013; Shaffer & Gee, 2012; Shute & Rahimi, 2021). Second, assessment through videogames gives considerable freedom in recreating real-life criterion situations, which allows for authentic, situated assessment even when this is not feasible in the real working environment (Bell et al., 2008; Dörner et al., 2016; Fonteneau et al., 2020; Harteveld, 2011; Kirriemur & McFarlane, 2004; Michael & Chen, 2006). Third, videogames can offer candidates a more enjoyable test experience by providing an engaging environment where they are given a high degree of autonomy (Boyle et al., 2012; Jones, 1998; Mavridis & Tsiatsos, 2017). Finally, videogames allow for assessment through in-game behaviors (i.e., stealth assessment), which intends to make assessment less salient for candidates and lets them retain engagement (Shute & Ke, 2012; Shute et al., 2009).

The benefits above highlight why videogames are viable assessment environments, irrespective of the specific level of cognitive achievement (e.g., those depicted in Bloom’s revised taxonomy; Krathwohl, 2002). Moreover, the possibility for immersing candidates in complex, situated contexts make them especially interesting for higher-order learning outcomes such as problem solving and critical thinking (Dede, 2009; Shute & Ke, 2012). Therefore, videogames may provide a solution to the validity threats associated with traditional high-stakes performance assessments: an assessment type to evaluate competencies through a construct-relevant task in the context for which it is intended (Lane & Stone, 2006; Messick, 1994; Stecher, 2010), often used for the purpose of vocational certification.

The first validity threat associated with high-stakes performance assessments is the prevalence of test anxiety among candidates (Lane & Stone, 2006; Messick, 1994; Stecher, 2010), which is shown to be negatively correlated to test performance (von der Embse et al., 2018; von der Embse & Witmer, 2014). Although some debate exists about the causal relationship between the two (Jerrim, 2022; von der Embse et al., 2018), it is apparent that candidates who experience test anxiety are unfairly disadvantaged in high-stakes assessment contexts.

The second threat identified is caused by a need for high-stakes performance assessment to be both standardized to ensure objectivity and fairness (AERA et al., 2014; Kane, 2006) as well as include a construct-relevant task (e.g., writing an essay, participating in a roleplay; Lane & Stone, 2006; Messick, 1994). While neither rule out adaptivity (e.g., adaptive testing and open-ended assessments), the combination often restricts us to use a linear performance task that is not adaptable to candidate ability level. The potential mismatch that could occur between task difficulty and the ability level of candidates posits two disadvantages. First, the mismatch can frustrate candidates, which negatively affects their test performance (Wainer, 2000). Second, candidates likely receive fewer tasks that align with their ability level, which negatively affects test reliability and efficiency (Burr et al., 2023). High-stakes performance assessments would thus benefit from adaptive testing that is personalized and appropriately difficult, allowing candidates to be challenged enough to retain engagement (Burr et al., 2023; Malone & Lepper, 1987; Van Eck, 2006) while assessors are able to determine whether the candidate is at the required level efficiently and reliably (Burr et al., 2023; Davey, 2011). Additionally, adaptive testing allows for more personalized (end-of-assessment) feedback that could further boost candidate performance (Burr et al., 2023; Martin & Lazendic, 2018).

The third threat identified in high-stakes performance assessment is a lack of assessment authenticity. Logically, assessment would be administered best in the authentic context (i.e., the workplace in the case of professional competencies). This leads to a high degree of fidelity: how closely the assessment environment mirrors reality (Alessi, 1988, as cited in Gulikers et al., 2004). Unfortunately, this is not attainable for competencies that are dangerous or unethical to carry out (Bell et al., 2008; Williams-Bell et al., 2015). Another concern is that in the workplace, assessments are largely dependent on the workplace in which they are carried out. This would lead to considerable variations in testing conditions between candidates, but also the construct relevance of tasks they are evaluated on (Baartman & Gulikers, 2017). Authenticity of physical context and task are two dimensions required for mobilizing the competencies of interest (Gulikers et al., 2004), there is a need to achieve authenticity in other ways. Authenticity is also related to transfer: applying what is learned to new contexts. The higher the alignment between assessment and reality is, the more likely it is that the transfer of competence to the professional practice is made.

The fourth threat identified are inconsistencies between raters in scoring candidate performance. Traditional high-stakes performance assessments are often accompanied by rubrics to evaluate candidate performance; however, inconsistencies in how rubrics are interpreted and used leads to construct-irrelevant variance (Lane & Stone, 2006; Wools et al., 2010). In this study, the aim is to investigate whether ‘serious games’ (SGs)—those “used for purposes other than mere entertainment” (Susi et al., 2007; p. 1)—provide a viable solution to this and the other limitations posed by traditional high-stakes performance assessments.

The most important characteristic of games is that they are played with a clear goal in mind. Many games have a predetermined goal, but other games allow players to define their own objectives (Charsky, 2010; Prensky, 2001). Goals are given structure by the provision of rules, choices, and feedback (Lameras et al., 2017). First, rules direct players towards the goal by placing restrictions on gameplay (Charsky, 2010). Second, choices enable players to make decisions, for example to choose between different strategies to attain the goal (Charsky, 2010). The extent to which rules are restrictive for the gameplay is also closely related to the choices players have in the game (Charsky, 2010). Thus, rules and choices seem to be on two ends of a continuum that determines the linearity of a game. Linearity is defined as the extent to which players are given freedom of gameplay (Kim & Shute, 2015; Rouse, 2004). The third characteristic, feedback, is a well-versed topic in the field of education. In education, the main purpose of feedback is to help students get insight into their learning and get student understanding to the level of learning goals (Hattie & Timperley, 2007; Shute, 2008; van der Kleij et al., 2012). In games, feedback is used in a similar way to guide players towards the goal, as well as facilitate interactivity (Prensky, 2001). Feedback in games is provided in many modalities and gives players information about how they are progressing and where they stand with regards to the goal. For instance whether their actions have brought them closer to the goal or further away. Games are made up of a collection of game mechanics that define the game and determine how it is played (Rouse, 2004; Schell, 2015). In other words, game mechanics are how the defining features of games are translated into gameplay. To illustrate, game mechanics that provide feedback to players can include hints, gaining or losing lives, progress bars, dashboards, currencies and/or progress trees (Lameras et al., 2017).

When designing a game-based performance assessment, determining the information that should be collected about candidates to inform competence and designing the tasks that fulfill this information need is something that should be considered carefully for each professional competency. One way is through the use of the evidence-centered design (ECD) framework (cf. Mislevy & Riconscente, 2006). The ECD framework is a systematic approach to test development that relies on evidentiary arguments to move from a candidates behavior on a task to inferences about candidate ability. It is beyond the scope of the current study to examine the design of game content in relation to the target professional competencies. In this systematic literature review, the aim is to determine which game mechanics could help overcome the validity threats associated with high-stakes performance assessments and are suitable for use in such assessments.

Previous research for game design has been done for instructional SGs (e.g., dos Santos & Fraternali, 2016; Gunter et al., 2008). For SGs used in high-stakes performance assessments, emphasis is put on the potential effect of game mechanics on the validity of inferences should be considered. For instance, choices in game design can affect correlations between in-game behavior and player ability (Kim & Shute, 2015). Moreover, game mechanics exist that are likely to introduce construct-irrelevant variance when used in high-stakes performance assessments. To illustrate, when direct feedback about performance (e.g., points, lives, feedback messages) is given to players, at least part of the variance in test scores would be explained by the type and amount of feedback a candidate has received.

Establishing design principles for SGs for high-stakes performance assessment is important for several reasons. First, such an overview allows future developers such assessments to make more informed choices regarding game design. Second, combining and organizing the insights gained from the available empirical evidence advances the knowledge framework around the implementation of high-stakes performance assessment through games. Reviews on the use of games exist for learning (e.g., Boyle et al., 2016; Connolly et al., 2012; Young et al., 2012) or are targeted at specific professional domains (e.g., Gao et al., 2019; Gorbanev et al., 2018; Graafland et al., 2012; Wang et al., 2016). Nevertheless, a research gap remains as there is no knowledge of a systematic literature review that addresses the high-stakes performance assessment of professional competencies. To this end, this study begins with identifying the available literature on SGs targeted at professional competencies; then extracts the implemented game mechanics that could help to overcome the validity threats associated with high-stakes performance assessment; and finally synthesizes game design principles for game-based performance assessment in high-stakes contexts.

The scope of the current review is limited to professional competencies specifically catered to a vocation (e.g., construction hazard recognition). More generic professional competencies (e.g., programming) are not taken into consideration, as the context in which they are used can also fall outside of secondary vocational and higher education. Additionally, there is a growing body of literature that recognizes the potential of in-game behavior as a source of information about ability level in the context of game-based learning (e.g., Chen et al., 2020; Kim & Shute, 2015; Shute et al., 2009; Wang et al., 2015; Westera et al., 2014). As the relationship between in-game behavior and candidate ability is of equal importance in assessment, the scope of the current review includes SGs that focus not only on assessment, but also teaching and training of professional competencies.

Method

The following section describes the procedure followed in conducting the current systematic literature review. First, a description of the inclusion criteria and search terms is given. This is followed by a description of the selection process and data extraction, together with an evaluation of the objectivity of the inclusion and quality criteria. Then, the search and selection results are presented, where two further categorizations of included studies operationalized: the type of competency and the how a successful SG is defined.

Procedure

Following the guidelines described in Systematic Reviews in the Social Sciences (Petticrew & Roberts, 2005), the protocol below gives a description and the rationale behind the review along with a description of how different studies were identified, analyzed, and synthesized.

Databases and search terms

The databases that include most publications from the field of educational measurement (Education Resources Information Center (ERIC), PsycInfo, Scopus, and Web of Science) were consulted for the literature search using the following search terms:

  • Serious game: (serious gam* or game-based assess* or game-based learn* or game-based train*) and

  • Quality measure: (perform* or valid* or effect* or affect*)

Inclusion criteria and selection process

The initial search results were narrowed down by selecting only publications that were published in English and in a scientific, peer-reviewed journal. To be included, studies were required to report on the empirical research results of a study that (1) focused on a digital SG used for teaching, training, or assessment of one or more professional competencies specific to a work setting, (2) was conducted in secondary vocational education, higher education or vocational settings, and (3) included a measure to assess the dependent variable related to the quality of the SG. Studies were excluded when the focus was on simulations; while they have an overlap** role in the acquisition of professional competencies to SGs, these modalities represent distinct types of digital environments.

All results from the databases were exported to Endnote X9 (The EndNote Team, 2013) for screening. The selection process was conducted in three rounds. First, duplicates, and alternative document types (e.g., editorials, conference proceedings, letters) were removed. Then, the publications were screened based on the titles and abstracts; publications were removed when the title or abstract mentioned features of the study mutually exclusive with the inclusion criteria (e.g., primary school, rehabilitation, systematic literature review). Second, titles and abstracts of the remaining results were screened again. When the title or abstract lacked information, the full article was inspected. To illustrate, some titles and abstracts did not mention the target population, or whether the game was digital, or whether the professional competency was specific to a work setting. Finally, full-text articles were screened for full compliance with the inclusion criteria. Data was extracted from those publications.

The objectivity of the inclusion criteria was determined by blinded double classification on two occasions. The first occasion, after the removal of duplicates and alternative document types, 30 randomly selected publications were independently double-classified by an expert in the field of educational measurement based on the title and abstract. An agreement rate of 93% with a Cohen’s Kappa coefficient of .81 translated to a near perfect inter-rater reliability (Landis & Koch, 1977). On the second occasion, a random selection of 32 publications considered for data extraction were blindly double-classified based on the full-text by a master student in educational measurement which resulted in an agreement rate of 97% was with a near perfect Cohen’s Kappa coefficient (.94; Landis & Koch, 1977).

To assess the comprehensiveness of the systematic review and identify additional relevant studies, snowballing was conducted by backward and forward reference searching in Web of Science. For publications not available on Web of Science, snowballing was done in Scopus.

Data extraction

For the publications included, data was extracted systematically by means of a data extraction form (Supplementary Information SI1). The data extraction form includes: (1) general information, (2) details on the professional competency and research design, (3) serious game (SG) specifics and (4) a quality checklist.

The quality checklist contains 12 closed questions with three response options: the criterion is met (1), the criterion is met partly (.5), and the criterion is not met (0). Studies that scored 7 or below were considered to be of poor quality and were excluded. Studies that scored between 7.5 and 9.5 were considered to be of medium quality, while studies with scores 10 or above were considered to be of good quality (denoted with an asterisk in the data selection table; Supplementary Information SI2). These categories were determined by piloting the study quality checklist on two publications that were included, based on the inclusion criteria: one that was considered to be of a poor quality and one that was considered to be of good quality. The scores obtained by those studies were set as the lower and upper threshold, respectively.

As this systematic literature review is focused on the extraction of game mechanics to inform game design principles, all articles included in the review needed to obtain a score of at least .5 on the criteria that the game is discussed in enough detail. When publications explicitly refer to external sources for additional information, information from those sources were included in the data extraction form as well.

Blinded double coding to determine the reliability of the quality criteria for inclusion was done by the same raters described above. 24 randomly selected publications from the final review were included, with a varying overlap between three raters. The assigned scores were translated to the corresponding class (i.e., poor, medium, and good) to calculate the agreement rate. The rates ranged between 82 and 93%, which correspond to Cohen’s Kappa coefficients between substantial and near perfect (.66–.88; Landis & Koch, 1977; Table 1).

Table 1 Results of reliability assessment of quality criteria by blinded double coding

Search and selection results

In the PRISMA flow diagram of the publication selection process (Fig. 1; Moher et al., 2009), the two rounds in which titles and abstracts were screened for eligibility are combined. The databases were consulted on the 21st of December 2020 and yielded a total of 6,128 publications. After the removal of duplicates, 3,160 publications were left. On the basis of the inclusion criteria, another 2,981 publications were excluded from the review. In total, data was extracted from 179 publications. During the examination of the full-text articles, 129 studies were excluded due to insufficient quality (n = 42), lack of a detailed game description (n = 6), unavailability of the article (n = 5), not classifying the application as a game (n = 10) and an overall mismatch with the inclusion criteria (n = 66). In total, 50 publications were included. Snowballing was conducted in November of 2021 and resulted in the inclusion of six additional studies. In total, 56 publications were included in the final review.

Fig. 1
figure 1

PRISMA flow diagram of inclusion of the systematic literature review. PRISMA preferred reporting items for systematic reviews and meta-analyses

Categorization of selected studies

Competency types

Professional competencies are acquired and assessed in different ways. Given the variety of professional competencies, there is no universal game design that is likely to be beneficial across the board (Wouters et al., 2009). Other researchers (e.g., Young et al., 2012) even suggest that game design principles should not be generalized across games, contexts or competencies. While more content-related game design principles likely need to be defined per context, this review is conducted with the idea that generic game design principles exist that can be successfully used in multiple contexts. In that sense, the aim is to provide a starting point from where more context-specific SGs can be designed, for example through the use of ECD.

The review is organized according to the type of professional competency that is evaluated rather than the content of the SG under investigation, as this provides an idea of what researchers expect to train or assess within the SG. Different distinctions between competencies can be made. For example, Wouters et al. (2009) distinguish between cognitive, motor, affective, and communicative competencies. Moreover, Harteveld (2011) distinguishes between knowledge, skills, and attitudes. These taxonomies served as a basis to inductively categorize the targeted professional competencies into knowledge, motor skills, and cognitive skills.

The knowledge category includes studies that focus on for instance declarative knowledge (i.e., fact-based) or procedural knowledge (i.e., how to do something). For instance, the procedural steps involved in cardiopulmonary resuscitation (CPR). The motor skills category refers to motor behaviors (i.e., movements). For CPR, an example would be compression depth. The cognitive skills category encompasses skills such as reasoning, planning, and decision making. For example, studies that focus on the recognition of situations that require CPR.

Successful SGs

The scope of this systematic literature review is limited to SGs that are shown to be successful in teaching, training, or the assessment of professional competencies. As research methodologies differ between studies, there is a need to define what characterizes a successful SG. When SGs were used in teaching or training, it was deemed successful when a significant improvement in the targeted professional competency was found (e.g., through an external validated measure of the competency). Some studies compared an active control group and an experimental group that additionally received an SG (e.g., Boada et al., 2015; Dankbaar et al., 2016; Graafland et al., 2017; see Supplementary Information SI2 for a full account): an SG was not deemed successful in the current results when such two groups showed comparable results. When SGs were used for assessment, it was deemed successful when (1) research results showed a significant relationship between the SG and a validated measure of the targeted competency, or (2) the SG was shown to accurately distinguish between different competency levels.

Results

The studies included in the review are discussed in two ways. First, descriptives of the included studies are given in terms of the degree to which games were successful in teaching, training, or assessment of professional competencies, the professional domains, and the competency types. Then, the game mechanics associated with the potential solutions to the validity threats in traditional performance assessment are presented.

Descriptives of the included studies

The final review includes 56 studies, published between 2006 and 2020 (consult Supplementary Information SI2 for a more detailed overview). No noteworthy differences were found between the SGs that aimed to teach, train, and assess professional competencies. Therefore, the results for the SGs included in the review are presented collectively.

Serious games with successful results

Divided over the type of professional competency evaluated, 84%, 83%, and 100% reported research results showing the SG was successful for cognitive skills, knowledge, and motor skills respectively (Table 2). Of the studies included in the systematic review, three studies found mixed effects of the SG under investigation between competency types (i.e., Luu et al., 2020; Phungoen et al., 2020; Tan et al., 2017).

Table 2 The proportion of studies that reported on serious games successful in teaching, training, or assessing professional competencies per competency type

Professional domains and competency types

The studies included in the review can be divided over seven professional domains (Table 3). These are further separated into professional competencies (see Supplementary Information SI2 for a full account). Examples include history taking (Alyami et al., 2019), crisis management (Steinrücke et al., 2020) and cultural understanding (Brown et al., 2018). Furthermore, the studies included in the review can be divided into three competency types: cognitive skills (n = 21), knowledge (n = 31), and motor skills (n = 4). An important note is that some studies evaluate the SG on more than one competency type, thus the sum of these categories is greater than the total number of studies included.

Table 3 Studies included in the review divided over type of competency and professional domain

Game mechanics

The following section discusses the inclusion of game mechanics—all design choices within the game—for the SGs discussed in the studies included in the review. Following the aim of the current paper, the game mechanics discussed are selected for having the potential to (1) mediate the validity threats associated with traditional performance assessments, and (2) be appropriate for implementing in a game-based performance assessment.

Authenticity

Authenticity in the SGs is divided into two dimensions: authenticity of the physical context and task. First, an example of a physical context that was not representative of the real working environment was found for all three competencies (Table 4). Regarding the SGs targeted at cognitive skills, this was the case for Effic’ Asthme (Fonteneau et al., 2020). In this SG, the target population—medical students—would normally carry out pediatric asthma exacerbation in a hospital setting. The game environment used is, however, the virtual bedroom of a child. Regarding the SGs targeted at knowledge, Alyami et al. (2019) implemented the game Metaphoria to teach history taking content to medical students. Here, the game environment is inside a pyramid within a fantasy world. The final SG using a game environment that does not resemble the real working environment within the motor skill competency type studied by Jalink et al. (2014). In this SG, laparoscopic skills are trained by having players perform tasks in an underground mining environment.

Table 4 The degree of authenticity of the physical context

Second, of the studies for which task authenticity could be determined, all but four included an authentic task for the professional competency targeted (Table 5). Examples of a task that was not authentic were found for all three competency types. Two SGs that targeted cognitive skills did not include an authentic task (Brown et al., 2018; Chee et al., 2019) as a result of implementing role reversals. Within these SGs, the players played in a reversed role fashion, and thus the task was not authentic for the task in the real working environment. One SG targeting knowledge did not include an authentic task (Alyami et al., 2019). In Metaphoria, the task for players is to interpret visual metaphors in relation to symptoms, whereas the target professional competency was history taking content. Finally, the SG studied by Drummond et al. (2017), targeting motor skills, the professional competency under investigation was not represented authentically within the game as the navigation was through point-and-click.

Table 5 The degree of task authenticity in the serious game

Unobtrusive data collection

For all three competency types, studies were found that use in-game data to make inferences about player ability (Table 6). While other studies did mention the collection of in-game behaviors, the results were limited to those that assessed the appropriateness of using the data in the assessment of competencies.

Table 6 Unobtrusive data collection

Different measures of in-game behaviors were found. First, 12 SGs determine competency by comparing player performance to some predetermined target, sometimes also translated to a score. In the game VERITAS (Veracity Education and Reactance Instruction through Technology and Applied Skills; Miller et al., 2019), for instance, players are assessed on whether they accurately assess whether the statement given by a character in the game is true or false. Second, seven SGs use time spent (i.e., completion time or playing time) as a measure of performance. For example, in the SG Wii Laparoscopy (Jalink et al., 2014), completion time is used to assess performance. This performance metric in the game showed a high correlation with performance on a validated measure for laparoscopic skills, but it should be noted that time penalties were included for mistakes made during the task. Finally, the use of log data was found in one SG targeted at cognitive skills (Steinrücke et al., 2020). In the Dilemma Game, in-game measures collected during gameplay were found to have promising relationships with competency levels.

Adaptivity

In SGs, the difficulty level can be adapted in two ways: independent of the actions of players or dependent on the actions of players (Table 7). Whereas SGs that varied in difficulty level were found for professional competencies related to both knowledge and motor skills, none were found for professional competencies related to cognitive skills. Three SGs were found that adjust difficulty level based on player actions; however, none of the SGs adjusts the difficulty level down based on player actions. Three studies evaluated SGs where difficulty level was varied independent of player actions. Regarding the SGs targeted at knowledge, players either received fixed assignments (Boada et al., 2015) or were able to set the difficulty level prior to gameplay (Taillandier & Adam, 2018). The SG studied by Asadipour et al. (2017), targeting motor skills, increased challenge by building up the flying speed during the game as well as random generation of coins, but this was independent of player ability. Two SGs targeted at knowledge did mention difficulty levels, but not how they were adjusted. The SG Metaphoria (Alyami et al., 2019) included three difficulty levels. The SG Sustainability Challenge (Dib & Adamo-Villani, 2014) became more challenging as players progress to higher levels, but it is not clear when or how this was done.

Table 7 Adaptivity incorporated within the serious games

Test anxiety

As described earlier, games are able to provide a more enjoyable testing experience by providing an engaging environment with a high degree of autonomy. Therefore, the way game characteristics, feedback, rules, and choices—are expressed in the studies included in the review are discussed below. To avoid confusion with linearity of assessment, the expression freedom of gameplay to describe the interaction between rules and choices.

First, seven examples were found where players are given feedback unrelated to performance (Table 8). Some ways feedback was given included a dashboard (Perini et al., 2018), remaining resources (Calderón et al., 2018; Taillandier & Adam, 2018) remaining time (Calderón et al., 2018; Dankbaar et al., 2017a, 2017b; Mohan et al., 2014) or remaining tasks (Jalink et al., 2014).

Table 8 Feedback unrelated to performance given in the serious games

Second, all studies included in the review but two include game mechanics to give some freedom of gameplay (Table 9). For cognitive skills and knowledge, game mechanics included the choice between multiple options (n = 14 for both), the inclusion of interactive elements (n = 8, for both) and the possibility for free exploration (n = 5 and n = 8, respectively). Two examples of customization were found: Dib and Adamo-Villani (2014) gave players the choice of avatar, whereas Alyami et al. (2019) allowed for a custom name. For the SGs that target motor skills, freedom of gameplay was given through control over the movements. For three out of four SGs in this category, special controllers were developed to give players authentic control over the movements in the game. This was not the case for Drummond et al. (2017), as their game did not explicitly train CPR; however, the researchers did assess its effect on motor skills.

Table 9 Freedom of gameplay

Discussion

Included studies

The final review included 56 studies. Of these, many reported positive results. This suggests that SGs are often successful in teaching, training, or assessing professional competencies, but could also point to a publication bias of positive results. As similar reviews to the current one (e.g., Connolly et al., 2012; Randel et al., 1992; Vansickle, 1986; Wouters et al., 2009) draw on similar databases, it is difficult to establish what is true. Some studies found mixed results for different competency types, suggesting that different approaches are warranted. Therefore, game mechanics in SGs for different competency types are discussed separately.

The review included few studies on SGs targeting motor skills compared to those targeting cognitive skills and knowledge. The low number of SGs for motor skills could be due to the need for specialized equipment to create an SG targeting motor skills. For example, Wii Laparoscopy (Jalink et al., 2014) is played using controllers that are specifically designed for the game. Not only does it require an extra investment, it also affects the ease of large scale implementation. There is no indication that motor skills cannot be assessed through SGs: four out of five studies have shown positive effects, both in learning effectiveness and assessment accuracy. Despite this, the benefits may only outweigh the added costs in situations where it is unfeasible to perform the professional competency in the real working environment.

Authenticity

Focusing on game mechanics for the authenticity of the physical context and the task, the results indicate that SGs are able to provide both. It should be noted that, while SGs are able to simulate the physical context and task with high fidelity, authenticity remains a matter of perception (Gulikers et al., 2008). The review focused only on those SGs that were successful when compared to validated measures of the targeted professional competency. Since these measures are considered to be accurate proxies for workplace performance, the transfer to the real working environment is likely to have been made. For all three competency types, examples were found for SGs that did not include an authentic physical context or authentic task, while still mobilizing competencies of interest. Even though the number of SGs in these categories is quite small, it does indicate that it is possible to assess professional competencies without an authentic environment or task.

Unobtrusive data collection

The in-game measures most often used in the included SGs are those that indicate how well a player did in comparison to some standard or target. This suggests that SGs are able to elicit behavior in players that is dependent on their ability level in the target professional competency. Since the accuracy measures varied depending on the professional competency, an investigation is warranted to determine which in-game measures are indicative of ability per situation. Evidentiary frameworks such as the ECD framework can provide guidance in determining which data could be used to make inferences about candidate ability. Despite the promising results, more research should be done on the informational value of log data before claims can be made.

Adaptivity

Some examples of studies were found where adaptivity was implemented was adaptive. In particular, some promising relationships between in-game behaviors and ability level were found. In traditional (high-stakes) testing, adaptivity has already been implemented successfully (Martin & Lazendic, 2018; Straetmans & Eggen, 2007). Although there are professional competencies for which ability levels cannot be differentiated, you are either able to do it or not. For such competencies, adaptivity does not have an added benefit. In contrast, for professional competencies where it is possible to differentiate ability levels, adaptivity should be considered.

Feedback

Considering the appropriateness of game mechanics for high-stakes assessment, feedback considered in the current review was limited to progress feedback. This adds a fourth type of feedback to the feedback already recognized for assessment: knowledge of correct response, elaborated feedback, and delayed knowledge of results (van der Kleij et al., 2012). Although the small number of SGs that incorporated progress feedback affect the generalizability of the finding, it does indicate that feedback about progress may be the most appropriate solution.

Freedom of gameplay

A variety of game mechanics implemented in the SGs included in the review fulfill freedom of gameplay. While some studies did not elaborate on the choices given in the game, common ways players are given freedom are through choice options, interactive elements, and freedom to explore. These game mechanics were found in various studies, which raises the possibility that these findings can be generalized to new SGs targeted at assessing professional competencies. Other game mechanics related to freedom of gameplay were also found in a smaller capacity. Thus, further research should shed light on their generalizability. Moreover, the freedom of gameplay provided to the player plays a substantial role in sha** overall player experience and behavior (Kim & Shute, 2015; Kirginas & Gouscos, 2017). Therefore, future research should shed further light on whether different game mechanics influence players in different ways.

Limitations

Although the current systematic literature review provides a useful overview of the game design principles for game-based performance assessment of professional competencies, some limitations are identified.

First, the review covered a substantial amount of studies from the healthcare domain. This may be because the medical field consists of many higher order standardized tasks which may be particularly suitable to SGs. Although the large contribution of studies in the healthcare domain could limit the generalizability to other domains. The results of this systematic review were quite uniform; no indication was found that SGs in healthcare employed different game mechanics were employed. Moreover, there is a growing popularity of SGs in healthcare education (Wang et al., 2016), resulting in a higher number of studies that were available compared to other professional domains. It is advisable to regard the current results as a starting point for game design principles game-based performance assessment. Further research into the generalizability of game design principles across professional domains is warranted.

The second limitation is true for all systematic literature reviews: it is a cross section of the literature and may not present the full picture. The inclusion of studies is dependent on what is available in the search databases, what is accessible, and what keywords are included in the literature. Likely due to this limitation, only studies published from 2006 are included in the review, while the use of SGs dates back much further (Randel et al., 1992; Vansickle, 1986). To minimize the omission of relevant literature, snowballing was conducted on the final selection of studies. This method allowed for including related and potentially relevant studies. In total, six additional publications were included through this method out of the 2,370 considered.

After snowballing, an assessment of why these additionally included studies were not found through the search results resulted in various insights. First, three studies used the terms (educational) video game in their publication on SGs (Duque et al., 2008; Jalink et al., 2014; Mohan et al., 2017). Including this term in the original search would have resulted in too many hits outside of the scope of the current review. Second, Moreno-Ger et al. (2010) used the term simulation to describe the application, but refer to the application as game-like. As simulations fall outside of the scope of the current review, the absence of this study in the initial search cannot be attributed to a gap in the search terms, Third, the publication from Blanié et al. (2020) was probably not found due to a mismatch in search terms related to the quality measure. Additional search terms such as impact or improve could have been included. As only one additional study was found that presented this issue, it is unlikely to have had a great effect on the outcome of the review. Finally, it is unclear why the study by Fonteneau et al. (2020) was not found through the initial search, as it showed a match with the search terms used in the current review. Perhaps, this misclassification can be ascribed to the search databases queried.

Finally, many of the studies included in the review compare SGs to other, non-digital or digital, alternatives in terms of learning. These types of studies often include many confounding variables (Cook, 2005). This is because a comparison is done between interventions that are different in more ways than one. These differences affect the results in different ways: positive, negative, or even through an interaction with other features.

Suggestions for future research

Besides providing interesting insights, the current review also has implications for research. First, the review identified SGs successful in teaching, training, or assessment that did not authentically represent the physical context or task. Although in this review, too few examples were found to generalize the findings. Second, while some studies were found in which the SGs difficulty was adaptive, more studies should be conducted on the implementation of adaptivity within SGs. In particular, how in-game behavior to match the difficulty level to the ability level of the candidates. Third, Fantasy is included in many games (Charsky, 2010; Prensky, 2001) and is regarded as one of the reasons for playing them (Boyle et al., 2016). By including fantasy elements in game-based performance assessments, assessment can become even more engaging and enjoyable and candidates can become even less aware of being assessed. For learning, it has been suggested that fantasy should be closely connected to the learning content (Gunter et al., 2008; Malone, 1981), but further research might explore whether this holds for SGs used for the (high-stakes) assessment of professional competencies. Furthermore, while fantasy elements may blur the direct link between the SG and the professional practice, in-game behavior may still have a clear relationship with professional competencies (Kim & Shute, 2015; Simons et al., 2021). More research into the effect of authenticity on the measurement validity of SGs in assessing professional competencies is warranted.

Implications for practice

Based on the results of the review, four recommendations can be made for practice. First, regardless of the competency type: design the SG in such a way that both the task and the context are authentic. The results have shown that SGs are able to provide a representation of the physical context and task, authentic to the professional competency under investigation. Thus, in situations where the physical context or assessment task are difficult to represent in a traditional performance assessments, SGs can provide a solution. At the same time, implementing non-authentic (fantasy) contexts and tasks should be investigated further before being implemented in high-stakes performance assessment.

Second, ensure that in-game behavior within the SG is collected. This review has synthesized additional evidence for the potential of in-game behavior as a source of information about ability level. That being said, the in-game behavior that can be used to inform ability level is dependent on both the professional competency of interest and game design. While no generalized design principles regarding the collection of gameplay data can be given, evidentiary frameworks (e.g., ECD) can be used to determine which in-game behavior can be used to infer ability level. This is ultimately connected to implementation of adaptivity. While a limited number of SGs were found that implemented adaptivity, the potential to unobtrusively data about ability level underscores a missed opportunity for the wider implementation of adaptivity in SGs. Taken together with the successful implementation of adaptive testing in traditional high-stakes assessments (Martin & Lazendic, 2018; Straetmans & Eggen, 2007), a third recommendation would be to implement adaptivity where appropriate.

Finally, this review gives an overview of the game mechanics for high-stakes game-based performance assessment with little risk of affecting validity. To provide freedom of gameplay for SGs targeted at cognitive skills and knowledge, include free exploration, interactive elements and providing options. For motor skills, giving control over movements is a, perhaps straightforward, game design principle. Furthermore, feedback in SGs for high-stakes performance assessments can be done through providing progress feedback, which is different from traditional types of feedback in education (van der Kleij et al., 2012) but has potential to satisfy feedback as a game mechanic. These recommendations, intended for game developers, may prove useful in designing future SGs for the (high-stakes) assessment of professional competencies.