Introduction

Assessment is an essential part of teaching and learning. Teachers can verify the learning progress of their students, both in formative and summative settings. Learners, on the other hand, are informed about their progress, either through external or self-assessment. Following assessment results, teaching or learning can be adapted in order to improve the experience and competences.

Models like Bloom’s taxonomy [1] or Miller’s pyramid [2] present different levels of learning objectives and competences. For each level, a different set of assessment tools is adequate. The choice of evaluation tool depends on the target objective and level of complexity. For the lower levels of both models, where learners need to remember and understand factual knowledge, assessment often makes use of multiple choice questions (MCQ, including true/false questions), cloze (fill-in-the-blanks) questions, matching and ordering activities or open-ended questions (e.g., essays).

Each tool has its advantages and disadvantages [3]: While an MCQ can be automatically graded, e.g., by a Learning Management System (LMS) like Moodle,Footnote 1 and can cover a broad range of learning items through a large set of questions within a single test [4], they are more difficult to create as compared to open-ended questions, as the answer possibilities (also known as options) need to be carefully chosen. For essays, there is a reduced risk of guessing: As there are no given options, the knowledge has to be actually recalled. Also, essays tend to better evaluate reasoning capabilities. However, different graders might assign a different mark to the answer of an essay, which results in a lower reliability, as compared to MCQs.

The creation of high-quality MCQs is a time-consuming task [5, 6], even more considering that changes in the curriculum or students collecting questions from previous years requires teachers to create new sets of questions every once in a while [7]. Nevertheless, MCQs are the most common type of knowledge assessment [5], e.g., in medical education [7].

Given the heavy usage of MCQs in assessment and at the same time their complex and time-consuming creation, automatic question generation (AQG) solutions have been developed. AQG encompasses Question Generation (QG), Question Answering (QA) and—in the case of MCQs - Distractor Generation (DG) [8, 9]. With the recent advent of Large Language Models (LLM), such as Generative Pre-trained Transformer (GPT), there has been a new momentum for AQG. LLM-generated content is often evaluated through metrics like BLEU (BiLingual Evaluation Understudy Score), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or METEOR (Metric for Evaluation of Translation with Explicit ORdering), based on precision and recall. These metrics focus on the quality evaluation from a natural language processing (NLP) point of view. However, in the case of LLM-generated (multiple choice) questions, human-based evaluation is still needed to ensure the semantic correctness and determine the relevance of generated questions. For instance, an LLM could create specific questions on a topic which were not covered in detail in class, which would hamper instructional alignment. A docimological evaluation of generated questions, keys and distractors, according to the best practices for MCQ creation [10,11,12], is thus necessary.

In this article, we propose a docimological analysis of the quality of LLM-generated MCQs. We employ zero-shot approaches in two domains, namely computer science and medicine. In the former, we make use of 3 GPT-based services to generate MCQs. In the latter, we developed a Moodle plugin that leverages the recent Assistants API of OpenAI to generate MCQs based on learning material. Based on common multiple-choice item writing guidelines, we check the generated MCQs for docimological flaws. Our main contributions are thus a docimological analysis of LLM-generated MCQs in 2 domains as well as a Moodle plugin that enables teachers to generate questions without leaving the LMS.

The remainder of this article is organized as follows: In Sect. “Multiple Choice Questions”, common item writing guidelines and basics of docimology are explained. AQG approaches, including LLM-based ones, are presented in Sect. “Automatic Question Generation”. Two case studies present our results for computer science and medicine in Sects. “Case Study 1: Computer Science” and “Case Study 2: Medicine”, respectively. We discuss common flaws, their solutions and the necessary steps towards higher-order activities in Sect. “Discussion”. We summarize our findings in Sect. “Conclusion”, while mentioning directions for future work.

Multiple Choice Questions

As shown in Fig. 1, an MCQ typically consists of a stem, which can be more or less elaborate. For instance, a clinical case in a medicine-related quiz may need to give a bit of context, such as the patient’s medical record. The stem is followed by a set of options, among which there is one, sometimes several, right answers called keys and one or several wrong answers called distractors. The distractors need to be plausible while unambiguously wrong: An implausible distractor can be easily discarded by the testee and thus has little to no value. The number of options does not necessarily influence the difficulty of the question, and an item with 3 options may remain valid and reliable: In fact, two plausible distractors are better than multiple implausible ones [10].

Note that neither the stem, nor the options need to be limited to text, as multimedia items such as images, sounds or videos could also be used. For instance, in a cardiology course, the options could present different electrocardiograms among which the learner needs to choose the right answer.

Fig. 1
figure 1

Example of a multiple choice question

Common Multiple-Choice Item Writing Guidelines

There is a consensus in literature regarding the good and bad practices when it comes to creating assessment items such as MCQs, often reported in item writing guidelines [10,11,12]. Identifying common flaws and pitfalls before a question is included in an assessment is of utmost importance, even more if the assessment is summative or for certification. While the list of bad practices is quite exhaustive, in this work, we considered the following 8 most common ones while evaluating the docimological quality of generated questions:

  • Key too long: A key can be more easily identified if it is more detailed than the distractors.

  • Implausible distractors: An implausible distractor can be easily discarded, augmenting the chances of guessing the right answer.

  • Grammar hint: If the stem gives a grammar hint that would exclude several distractors, the key can be more easily guessed. For instance, a French stem ending in the feminine article la would exclude masculine distractors.

  • All of the above: Usage of this option is generally discouraged [11]. Instead, all correct options should be keyed.

  • None of the above: There have been mixed results reported in literature concerning this option [11, 13]. It is advised to use it with caution.

  • No key: An MCQ needs at least one key, otherwise it will be impossible to distinguish it from a non-answered question.

  • Ambiguous key: The key(s) should be unambiguously correct and all distractors should be unambiguously wrong. No margin for interpretation should be left to the knowledgeable testee.

  • Complex questions: Also known as K-type questions, options would consist of combinations, either of a set of options inside the very stem, or of other neighboring options. Such questions evaluate rather logical skills than content knowledge.

Basics of Docimology

Docimology (ancient Greek dokimé: test, logos: science), or the art of testing, comprises, among others, the verification of pedagogical alignment between learning objectives, course content and assessment items, as well as the post-exam item analysis through psychometric measures from Classical Test Theory (CTT) such as the difficulty index or the discrimination index. The difficulty index is a number between 0 and 1 that indicates how difficult a question was, with 1 being easy and 0 being difficult. It is typically calculated by dividing the number of respondents answering correctly by the total number of respondents. The discrimination index indicates how well a question is able to differentiate between high-performing and low-performing testees. It is a number between -1 and 1, and the higher the value, the better the question is able to discriminate. It is typically calculated by taking into consideration the top \(27\%\) and bottom \(27\%\) testees as per their overall test score. A more advanced tool for determining the discriminative power of an item is the point bi-serial correlation coefficient [14].

A posteriori, the difficulty and discrimination indices can provide insights with respect to flaws [12]. If the question was too easy, it could comprise implausible distractors or present cues. If the content addressed by an item was not covered in the course, and thus indicates a poor instructional alignment, or if the item was written in an ambiguous way or miskeyed, the difficulty index could go up and the discrimination index could be too low.

Automatic Question Generation

In some domains, such as mathematics or physics, questions have often been dynamically generated based on parameters that could be customized, either through random values or values coming from a datasheet [15, 16]. AQG techniques can be classified into rule-based and neural network-based approaches [9]. To take into account non-numerical contexts, such as reading comprehension, AQG leverages recent advances in Natural Language Processing (NLP) and Artificial Intelligence (AI), such as transformers.

Mulla and Gharpure presented a review of AQG methodologies, datasets, evaluation metrics and applications [9]. In domain-specific applications, semantic web technologies such as ontologies are used to enrich the context. Kumar et al. use both semantic and machine learning techniques for MCQ stem generation [17]. Gilal et al. propose Question Guru, an NLP-based AQG for MCQs [18].

Several articles focused on the generation of multiple question types from a text through an encoder-decoder architecture-based text-to-text transfer transformer (T5) [19,20,21,22,23]. With the advent of large language models (LLM) such as Generative Pre-trained Transformer (GPT), Dijkstra et al. developped EduQuiz, an MCQ generator for reading comprehension tasks based on a GPT-3 [8]. Doughty et al. employed GPT-4 to generate MCQs for programming courses by indicating the learning objectives (LOs) [5]. The authors created a pipeline that makes use of a considerable system prompt. The resulting questions presented a clear language, a single correct choice, high-quality distractors, and were well-aligned with the LOs. Zuckerman et al. used ChatGPTFootnote 2 to create USMLEFootnote 3-style MCQ items with a clinical vignette, vital signs and exam findings [7]. The generated items were used as formative assessment in a reproductive system course. No negative effect with respect to psychometric measures such as the discrimination index could be determined. The distractors produced by ChatGPT came from the same content area and were plausible and of similar length compared to the key. Interestingly, however, the content of the distractors was not always covered in the course. Hallucinations regarding factual information could not be determined. Cheung et al. compared 50 MCQs generated by ChatGPT against 50 questions created by human teachers in medical education [4]. The questions were generated respectively created on texts from 2 reference textbooks. A randomized assessment was conducted on appropriateness, clarity and specificity, relevance, discriminative power of alternatives and suitability for medical graduate exams. The only significant difference was determined for relevance, where LLM-generated questions scored less than human-created ones. Laupichler et al. compared student performance of 25 MCQs generated by ChatGPT (GPT\(-\)3.5) against 25 MCQs created by human medical educators, each with 1 key and 4 distractors [24]. The questions were tested by 161 students in a formative assessment setting. No significant difference was determined in item difficulty, but in item discrimination, as LLM-generated questions discriminated less well than human-created ones. This shows that generated questions, just like human-created questions, should ideally be tested before usage in a summative assessment. Interestingly, students were also asked whether they thought a question stemmed from an LLM or a human. They correctly identified only \(57\%\) of the question sources.

A whole market of (commercial) services offering AI-based AQG has appeared. A comprehensive overview is given in [25].

When using LLMs, there are different levels of additional information developers can add to the base model in case of more specific contexts. Zero-shot prompting would rely purely on the pre-trained knowledge the LLM comes with. In few-shot prompting, the prompt would include some information not necessarily known to the base model, e.g., by giving examples. For domains where specific knowledge is essential, fine-tuning an LLM involves enhancing the weights of the pre-trained model by training it on an additional dataset. More separately, Retrieval-Augmented Generation (RAG) [26] retrieves information from a separate dataset to provide to the generator, e.g., an LLM. Finally, the recent Assistants APIFootnote 4 by OpenAI—at the time of writing still in Beta—enables developers to create custom assistants, maintain conversations in threads and retrieve additional knowledge from files.

Case Study 1: Computer Science

In this first case study, questions were generated on a first-semester English-language operating systems course from an undergraduate program in computer science [25]. The course consisted of 5 chapters, for each of which 10 questions were generated. Three different LLM-based AQG services were used herefore, resulting in a total of 150 questions. Each question should comprise 1 key and 3 distractors. These were then analyzed by a domain expert as per the common item writing guidelines presented in Sect. “Common Multiple-Choice Item Writing Guidelines”.

Methods

The 3 LLM-based AQG services that were used are Quiz Wizard (QW)—a companion tool of Wooclap,Footnote 5Quizicist (QC)—an open-source tool developed at Brown University— and ChatGPT (CGPT)—a general-purpose chatbot by OpenAI, not limited to AQG. For CGPT, the used model in this experiment was gpt-3.5, QC used gpt-4. The employed GPT version at QW is unknown.

QW, supporting file upload, was fed with PDF handouts of Keynote presentations including presenter notes. QC and CGPT were given the prompts shown in Table 1 (for CGPT, they were preceded by the indication “Write 10 MCQs about: ”). Without further indication, both QC and CGPT produced questions with 4 options.

Table 1 Prompts given to Quizicist and ChatGPT

The fact that chapters given to QW and the prompts given to QC and CGPT were of variable length compared to a fix number of questions (10 per chapter) was a deliberate choice to see whether relevant aspects were chosen. In a zero-shot attempt, the prompts were also deliberately left naive.

All generated questions were imported into Moodle and tagged by a domain expert according to whether they presented one or several of the previously discussed flaws. The resulting question bank in Moodle XML format is available on figshare.Footnote 6

Results

Among the 150 questions, there were 55 (\(37\%\)) that presented at least one of the flaws described above. The distribution of flaws is shown in Fig. 2. The most common issue was ambiguous keys, followed by keys that were too long and implausible distractors. The None of the above distractor or K-type questions were not present in the dataset.

Fig. 2
figure 2

Distribution of item flaws (global) in Case Study 1

Service-wise, the total number of flaws is comparable with 19 out of 50 (\(38\%\)) for both Quiz Wizard and Quizicist, and 17 (\(34\%\)) for ChatGPT. Figure 3 shows the distribution of the 6 occurring flaws among the 3 services. Quiz Wizard presented more issues with respect to a too long or ambiguous key, but it produced less implausible distractors than Quizicist or ChatGPT. Grammar hints, All of the above and No key occurred in very few cases.

Fig. 3
figure 3

Comparison of AQG services with respect to item flaws in Case Study 1

Case Study 2: Medicine

In the second case study, questions were generated in the realm of medicine and biology. A Moodle plugin was developed enabling teachers to directly generate MCQs based on learning material available in a Moodle course and have them added to the corresponding question bank. One domain expert per subfield evaluated the generated questions.

Moodle Plugin

We developed a Moodle plugin (requiring at least Moodle version 4.1) that lets users enrolled with the Teacher role in a given course create MCQs based on File resources previously uploaded to the course. Screenshots are shown in Appendix A.

Fig. 4
figure 4

Functioning of the Moodle plugin

The inner workings of this plugin are showcased in Fig. 4. First, the teacher needs to select the files for which questions should be generated. An asynchronous task is then launched that will interact with the OpenAI API in the background. For this API to be used, an API key at OpenAI is necessary. This key can be configured in the plugin settings in the administration area of Moodle. The task then creates an Assistant, specifying instructions, the model to be used and the tools. In the case of our plugin, instructions were the prompt “You create multiple-choice questions about the files that you will receive.”. In our experiment, we used the model gpt-4-1106-preview. The tool to be used was Retrieval, enabling the inclusion of knowledge coming from files. Note that the Assistants API automatically does the actions that RAG-approaches typically need to do: content chunking, creation of embeddings, vector search.Footnote 7 The creation of an assistant is only required once, as its ID is stored in the plugin settings.

Next, a separate question bank category is created. Moodle organizes questions in a question bank, which again can be organized in separate categories. To allow teachers to easily recognize which questions were related to which generation task, each category shows in the description the file resources that were involved. For each of the selected file resources, the file is then uploaded to the assistant. Messages in the Assistants API are organized in Threads. The only message that is added to the newly created thread is the prompt

“Create 10 multiple choice questions for the provided file. Each question shall have 4 answers and only 1 correct answer. The output shall be in JSON format, i.e., an array of objects where each object contains the stem, an array for the answers and the index of the correct answer. Name the keys ‘stem‘, ‘answers‘, ‘correctAnswerIndex‘. The output shall only contain the JSON, nothing else.”

together with the ID of the uploaded file. Note that this is also a zero-shot approach. A Run is then created. At the time of writing, streaming results from the Assistants API is not yet supported,Footnote 8 so the plugin polls periodically for a result. The response in JSON format is then parsed and questions are added to the Moodle database. Finally, the previously uploaded file is again removed from the assistant. This is mostly done out of economic reasons, as the storage of files in an assistant is charged at a rate of $0.20/GB per assistant per day\(^{7}\).

Similar GPT-based Moodle plugins have been developed, such as the AI Text to questions generatorFootnote 9 or the OpenAI Question Generator.Footnote 10 However, these require the user to paste the text for which questions should be generated. Our plugin enables users to directly use the learning material already available in a course.

Note that currently, the plugin assumes a fix number of questions (10), options (4) and keys (1). However, this does not take into account the content length of a given file or the teacher’s requirements (e.g., multiple keys, fewer options). In future versions, these parameters will be customizable by the user.

Methods

The learning material provided to the Moodle plugin stemmed from two fourth-semester undergraduate medicine courses. From an endocrinology course, 8 slidedecks were used, with content being mostly written in English. In a neurology course, 2 slidedecks served as a basis for AQG. For each slidedeck, the plugin produced 10 questions, resulting in a total of 100 MCQs. In addition to identifying the flaws listed in Sect. “Common Multiple-Choice Item Writing Guidelines”, the 2 domain experts - an endocrinologist and a neurobiologist, respectively - were asked to evaluate the questions on the following criteria, each time on a 5-point Likert scale:

  • Pertinence: How relevant is the question to the topic? [1: Not relevant at all, 5: Perfectly relevant]

  • Difficulty: How difficult do you perceive the question, compared to the course level [1: Very easy, 5: Very Difficult]

  • Level of specificity: How general or specific is the question, i.e., to what degree does it require detailed knowledge? [1: Very general, 5: Very specific]

  • Ambiguity: To what degree is the question ambiguous? [1: Not at all ambiguous, 5: Very ambiguous]

  • Instructional alignment: To what degree is the question aligned to the content of the course? [1: Not at all aligned, 5: Perfectly aligned]

Pertinence and instructional alignment might seem a bit redundant, but there is a slight nuance: Pertinence aims at the general relevance of the question regarding the topic, while instructional alignment is narrowed down to the content of the course. Further remarks could also be provided by the evaluators.

The generation of these 100 MCQs using the aforementioned GPT-4 Turbo (gpt-4-1106-preview) costed \(\$ 0.88\).

Results

Among the 100 questions of endocrinology and neurology combined, 13 questions presented 1 flaw, 1 question 2 flaws (Fig. 5). There was only 1 question presenting a flaw among the 20 neurology MCQs (\(5\%\)), the other 13 being among the 80 endocrinology questions (\(16\%\)). The most present issue were implausible distractors, followed by either no, a wrong or an ambiguous key.

Fig. 5
figure 5

Comparison of item flaws across subjects in Case Study 2

Regarding the 5 additional quality criteria, the picture is quite different in endocrinology (Fig. 6) as opposed to neurology (Fig. 7). In endocrinology, the questions were overall considered pertinent topic-wise and well-aligned with the content of the course, of balanced difficulty, tended to be quite specific. The majority of questions were not ambiguous.

Fig. 6
figure 6

Additional quality criteria for the 80 endocrinology questions

For the neurology-related questions, the domain expert did not perceive them as pertinent topic-wise or well-aligned with the course content, found them oftentimes too easy and too general. This aligns with the previously mentioned finding of LLM-generated questions being less relevant than human-created ones [4]. On the positive side, ambiguity was again not perceived as an issue.

Fig. 7
figure 7

Additional quality criteria for the 20 neurology questions

Discussion

The automatic generation of questions does not only scaffold teachers in preparing assessments. Students can also benefit from AQG in self-directed learning, reinforcing their understanding of concepts and fostering critical thinking skills [27]. Hence, it is of utmost importance that the resulting questions are relevant with respect to the given topics, syntactically and semantically correct and follow common item writing guidelines. In fact, a knowledgeable learner should not be led into a pitfall choosing a distractor, while a less knowledgeable learner should not be cued to choose the right answer [3].

In both case studies, the generated questions showcased an excellent linguistic quality. Grammar hints or other cues were almost never an issue.

The controversial None of the above option was neither generated by any of the 3 services in the first case study, nor proposed by the GPT model used in our Moodle plugin in the second case study. K-type questions were never generated either: We believe that directing an LLM towards this question type would require few-shot learning. The All of the above option was generated rarely. As its usage is generally discouraged, its occasional occurrence can be easily avoided through a prompt instruction.

The key being too long was only once an issue in the second case study, as opposed to 13 questions in the first, even without extra instructions in the prompt. As the underlying GPT models were comparable, we estimate that the medical domain could be less prone to this issue, as distractors could be symptoms, diseases, molecules or dosage similar to the key and hence be of comparable length. Further experiments are necessary to verify this intuition. As mentioned in [25], resolving this issue is non-trivial. On one hand, enlarging distractors to the length of the key might make them implausible. On the other hand, splitting the lengthy key into multiple key options could pose a challenge with respect to semantic coherence.

The major issues in both case studies were having either no, a wrong or an ambiguous key, or implausible distractors.

Questions without a key or a wrong key are the result of an issue in the QA phase, which can be related to non-accurate or obsolete data in the underlying model. This issue cannot be easily solved by the end-user who, directly or indirectly, can only prompt the model, unless an approach like RAG is added. There were multiple reasons for an ambiguous key, often resulting from the stem not being detailed enough or simply having multiple correct answers. Here, the stem could be adapted to ask for the single best answer. Tran et al. reported that GPT-3 presented poor performance at generating correct answers [28]. Where possible, GPT-4 should be preferred [6].

Finally, implausible distractors were the biggest issue in the second case study. If distractors can be easily discarded even by learners with limited prior knowledge, guessing is enabled, which has a negative impact on the difficulty of the question. This flaw is more difficult to fix, as the “generation of high-quality distractors is more challenging than question and answer generation” [8]. Distractors can be generated using generative adversarial networks (GAN) [9] or by relying on semantic similarity [29]. Knowledge graph-based approaches may ensure to generate distractors closer to the realm of the key.

Approaches based on file upload instead of prompts adds an additional layer of complexity to AQG services, as relevant concepts need to be identified first, e.g., through a knowledge-graph based approach [30]. In case study 1, the questions were pertinent and well-aligned with respect to the chapter topics presented in Table 1, independently of the service. This shows that the concept identification from the provided learning material in the case of Quiz Wizard was successful and that the limited prompts for Quizicist and ChatGPT were sufficient for generating relevant questions.

For case study 2, OpenAI’s Assistants API takes care of the retrieval of information in the uploaded files. For endocrinology, the questions were considered as pertinent and well-aligned, quite specific regarding the content of the slidedecks and of balanced difficulty. This indicates that the identification of core concepts was successful. One question that was too content-specific asked for the authors of a study mentioned in a slidedeck. This is certainly not something that students should remember by heart.

However, for neurology, the majority of questions were considered neither pertinent nor well-aligned, and oftentimes too broad and hence too easy. Here, the identification of core concepts was thus partially unsuccessful. Certain questions focused too much on knowledge recall, for aspects that the domain expert would not expect students to learn by heart. Questions would not illustrate the critical thinking capacities of learners. At least for the medical domain, this could be improved through prompt engineering, e.g., to produce clinical cases as presented in [7]. A similar approach in computer science regarding more practical cases could be considered. Higher levels of Bloom’s taxonomy could hereby be covered in the assessment. Incorporating the target level of Bloom’s taxonomy in the prompt was recommended by Indran et al. [6].

We agree with the authors of similar attempts who state that LLM-generated MCQs can assist human teachers by providing a starting point. Content experts remain responsible for checking the accuracy of the generated questions, and while editing is certainly necessary, it takes less time to edit generated questions than to create them from scratch [6, 7]. Cheung et al. reported that manual creation took 10 times longer than automatic generation [4]. Comparable quality can thus be reached in shorter time. In our view, a hybrid partnership between human teachers and AI can and should be developped in this matter.

Our study presents a few limitations. First, the questions were only evaluated by a single domain expert each time. There were also relatively few questions in neurology compared to endocrinology or computer science. Still, the comments given by the domain expert provide valuable insights for upcoming endeavours.

Conclusion

In this article, we proposed a docimological analysis of the quality of LLM-generated MCQs in both medicine and computer science. For this purpose, we also developed a Moodle plugin that enables teachers to generate questions based on learning material. The employed zero-shot approaches were sufficient to follow many item writing guidelines. However, issues in both domains regarding question answering and distractor generation remained, and they might require more advanced techniques. Similarly, items addressing higher-order thinking skills are unlikely to be produced through zero-shot approaches. Finally, psychometric measures such as the discrimination index should be carefully monitored, just like for human-created items.

For future work, we have a high interest in further develo** the Moodle plugin by making it more customizable and applying the insights from this and related work. Integrating more advanced techniques such as RAG could overcome the current limitations. As proposed in [6], we will also consider to generate alternative versions of questions already existing in the question bank of Moodle. We would also like to address higher-order thinking skills, e.g., by generating clinical cases. Other domains, such as human sciences, other input languages, non-textual content such as diagrams or other LLMs such as GeminiFootnote 11 could also be considered. Finally, we would like to extend our analysis to open questions by comparing LLM-based and human grading.