Experiences and Lessons Learned Creating and Validating Concept Inventories for Cybersecurity

  • Conference paper
  • First Online:
National Cyber Summit (NCS) Research Track 2020 (NCS 2020)

Abstract

We reflect on our ongoing journey in the educational Cybersecurity Assessment Tools (CATS) Project to create two concept inventories for cybersecurity. We identify key steps in this journey and important questions we faced. We explain the decisions we made and discuss the consequences of those decisions, highlighting what worked well and what might have gone better.

The CATS Project is creating and validating two concept inventories—conceptual tests of understanding—that can be used to measure the effectiveness of various approaches to teaching and learning cybersecurity. The Cybersecurity Concept Inventory (CCI) is for students who have recently completed any first course in cybersecurity; the Cybersecurity Curriculum Assessment (CCA) is for students who have recently completed an undergraduate major or track in cybersecurity. Each assessment tool comprises 25 multiple-choice questions (MCQs) of various difficulties that target the same five core concepts, but the CCA assumes greater technical background.

Key steps include defining project scope, identifying the core concepts, uncovering student misconceptions, creating scenarios, drafting question stems, develo** distractor answer choices, generating educational materials, performing expert reviews, recruiting student subjects, organizing workshops, building community acceptance, forming a team and nurturing collaboration, adopting tools, and obtaining and using funding.

Creating effective MCQs is difficult and time-consuming, and cybersecurity presents special challenges. Because cybersecurity issues are often subtle, where the adversarial model and details matter greatly, it is challenging to construct MCQs for which there is exactly one best but non-obvious answer. We hope that our experiences and lessons learned may help others create more effective concept inventories and assessments in STEM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Science, Technology, Engineering, and Mathematics (STEM).

  2. 2.

    Personal correspondence with Melissa Dark (Purdue).

  3. 3.

    As an experiment, select answers to your favorite cybersecurity certification exam looking only at the answer choices and not at the question stems. If you can score significantly better than random guessing, the exam is defective.

  4. 4.

    Contact Alan Sherman (sherman@umbc.edu).

References

  1. Bechger, T.M., Maris, G., Verstralen, H., Beguin, A.A.: Using classical test theory in combination with item response theory. Appl. Psychol. Meas. 27, 319–334 (2003)

    Article  MathSciNet  Google Scholar 

  2. Brame, C.J.: Writing good multiple choice test questions 2019. https://cft.vanderbilt.edu/guides-sub-pages/writing-good-multiple-choice-test-questions/. Accessed 19 Jan 2019

  3. Brown, B.: Delphi process: a methodology used for the elicitation of opinions of experts. CA, USA, Rand Corporation, Santa Monica, September 1968

    Google Scholar 

  4. U.S. Department of Labor Bureau of Labor Statistics. Information Security Analysts. Occupational Outlook Handbook, September 2019

    Google Scholar 

  5. George Washington University Arlington Center. Cybersecurity education workshop, February 2014. https://research.gwu.edu/sites/g/files/zaxdzs2176/f/downloads/CEW_FinalReport_040714.pdf

  6. Chi, M.T.H.: Methods to assess the representations of experts’ and novices’ knowledge. In: Cambridge Handbook of Expertise and Expert Performance, pp. 167–184 (2006)

    Google Scholar 

  7. Ciampa, M.: CompTIA Security+ Guide to Network Security Fundamentals, Loose-Leaf Version, 6th edn. Course Technology Press, Boston (2017)

    Google Scholar 

  8. CompTIA. CASP (CAS-003) certification study guide: CompTIA IT certifications

    Google Scholar 

  9. International Information System Security Certification Consortium: Certified information systems security professional. https://www.isc2.org/cissp/default.aspx. Accessed 14 Mar 2017

  10. International Information Systems Security Certification Consortium. GIAC certifications: The highest standard in cyber security certifications. https://www.giac.org/

  11. Cybersecurity and Infrastructure Security Agency. The National Initiative for Cybersecurity Careers & Studies. https://niccs.us-cert.gov/featured-stories/take-cybersecurity-certification-prep-course

  12. Epstein, J.: The calculus concept inventory: measurement of the effect of teaching methodology in mathematics. Not. ACM 60(8), 1018–1025 (2013)

    Google Scholar 

  13. Ericsson, K.A., Simon, H.A.: Protocol Analysis. MIT Press, Cambridge (1993)

    Book  Google Scholar 

  14. Evans, D.L., Gray, G.L., Krause, S., Martin, J., Midkiff, C., Notaros, B.M., Pavelich, M., Rancour, D., Reed, T., Steif, P., Streveler, R., Wage, K.: Progress on concept inventory assessment tools. In 33rd Annual Frontiers in Education, 2003. FIE 2003, pp. T4G– 1, December 2003

    Google Scholar 

  15. CSEC2017 Joint Task Force. Cybersecurity curricula 2017. Technical report, CSEC2017 Joint task force, December 2017

    Google Scholar 

  16. Gibson, D., Anand, V., Dehlinger, J., Dierbach, C., Emmersen, T., Phillips, A.: Accredited undergraduate cybersecurity degrees: four approaches. Computer 52(3), 38–47 (2019)

    Article  Google Scholar 

  17. Glaser, B.G., Strauss, A.L., Strutzel, E.: The discovery of grounded theory: strategies for qualitative research. Nurs. Res. 17(4), 364 (1968)

    Article  Google Scholar 

  18. Goldman, K., Gross, P., Heeren, C., Herman, G.L., Kaczmarczy, L., Loui, M.C., Zilles, C.: Setting the scope of concept inventories for introductory computing subject. ACM Trans. Comput. Educ. 10(2), 1–29 (2010)

    Article  Google Scholar 

  19. Hake, R.: Interactive-engagement versus traditional methods: a six-thousand-student survey of mechanics test data for introductory physics courses. Am. J. Phys. 66(1), 64–74 (1998)

    Article  Google Scholar 

  20. Hake, R.: Lessons from the physics-education reform effort. Conserv. Ecol. 5, 07 (2001)

    Article  Google Scholar 

  21. Hambleton, R.K., Jones, R.J.: Comparison of classical test theory and item response theory and their applications to test development. Educ. Meas. Issues Pract. 12, 253–262 (1993)

    Google Scholar 

  22. Herman, G.L., Loui, M.C., Zilles, C.: Creating the digital logic concept inventory. In: Proceedings of the 41st ACM Technical Symposium on Computer Science Education, pp. 102–106, January 2010

    Google Scholar 

  23. Hestenes, D., Wells, M., Swackhamer, G.: Force concept inventory. Phys. Teach. 30, 141–166 (1992)

    Article  Google Scholar 

  24. **deel, M.: I just don’t get it: common security misconceptions. Master’s thesis, University of Minnesota, June 2019

    Google Scholar 

  25. Association for computing machinery (ACM) joint task force on computing curricula and IEEE Computer Society. Computer Science Curricula: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science. Association for Computing Machinery, New York (2013)

    Google Scholar 

  26. Jorion, N., Gane, B., James, K., Schroeder, L., DiBello, L., Pellegrino, J.: An analytic framework for evaluating the validity of concept inventory claims. J. Eng. Educ. 104(4), 454–496 (2015)

    Article  Google Scholar 

  27. Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 453–456, April 2008

    Google Scholar 

  28. Litzinger, T.A., Van Meter, P., Firetto, C.M., Passmore, L.J., Masters, C.B., Turns, S.R., Gray, G.L., Costanzo, F., Zappe, S.E.: A cognitive study of problem solving in statics. J. Eng. Educ. 99(4), 337–353 (2010)

    Article  Google Scholar 

  29. Mestre, J.P., Dufresne, R.J., Gerace, W.J., Hardiman, P.T., Touger, J.S.: Promoting skilled problem-solving behavior among beginning physics students. J. Res. Sci. Teach. 30(3), 303–317 (1993)

    Article  Google Scholar 

  30. NIST. NICE framework. http://csrc.nist.gov/nice/framework/. Accessed 8 Oct 2016

  31. Offenberger, S., Herman, G.L., Peterson, P., Sherman, A.T., Golaszewski, E., Scheponik, T., Oliva, L.: Initial validation of the cybersecurity concept inventory: pilot testing and expert review. In: Proceedings of Frontiers in Education Conference, October 2019

    Google Scholar 

  32. Olds, B.M., Moskal, B.M., Miller, R.L.: Assessment in engineering education: evolution, approaches and future collaborations. J. Eng. Educ. 94(1), 13–25 (2005)

    Article  Google Scholar 

  33. Parekh, G., DeLatte, D., Herman, G.L., Oliva, L., Phatak, D., Scheponik, T., Sherman, A.T.: Identifying core concepts of cybersecurity: results of two Delphi processes. IEEE Trans. Educ. 61(11), 11–20 (2016)

    Google Scholar 

  34. DiBello, L.V., James, K., Jorion, N. and Schroeder, L., Pellegrino, J.W.: Concept inventories as aids for instruction: a validity framework with examples of application. In: Proceedings of Research in Engineering Education Symposium, Madrid, Spain (2011)

    Google Scholar 

  35. Pellegrino, J., Chudowsky, N., Glaser, R.: Knowing What Students Know: The Science and Design of Educational Assessment. National Academy Press, Washington DC (2001)

    Google Scholar 

  36. Peterson, P.A.H, **deel, M., Straumann, A., Smith, J., Pederson, A., Geraci, B., Nowaczek, J., Powers, C., Kuutti, K.: The security misconceptions project, October 2019. https://secmisco.blogspot.com/

  37. Porter, L., Zingaro, D., Liao, S.N., Taylor, C., Webb, K.C., Lee, C., Clancy, M.: BDSI: a validated concept inventory for basic data structures. In: Proceedings of the 2019 ACM Conference on International Computing Education Research, ICER 2019, pp. 111–119. Association for Computing Machinery, New York (2019)

    Google Scholar 

  38. Scheponik, T., Golaszewski, E., Herman, G., Offenberger, S., Oliva, L., Peterson, P.A.H., Sherman, A.T.: Investigating crowdsourcing to generate distractors for multiple-choice assessments (2019). https://arxiv.org/pdf/1909.04230.pdf

  39. Scheponik, T., Golaszewski, E., Herman, G., Offenberger, S., Oliva, L., Peterson, P.A. and Sherman, A.T.: Investigating crowdsourcing to generate distractors for multiple-choice assessments. In: Choo, K.K.R., Morris, T.H., Peterson, G.L. (eds.) National Cyber Summit (NCS) Research Track, pp. 185–201. Springer, Cham (2020)

    Google Scholar 

  40. Scheponik, T., Sherman, A.T., DeLatte, D., Phatak, D., Oliva, L., Thompson, J. and Herman, G.L.: How students reason about cybersecurity concepts. In: IEEE Frontiers in Education Conference (FIE), pp. 1–5, October 2016

    Google Scholar 

  41. Offensive Security. Penetration Testing with Kali Linux (PWK). https://www.offensive-security.com/pwk-oscp/

  42. Sherman, A.T., DeLatte, D., Herman, G.L., Neary, M., Oliva, L., Phatak, D., Scheponik, T., Thompson, J.: Cybersecurity: exploring core concepts through six scenarios. Cryptologia 42(4), 337–377 (2018)

    Article  Google Scholar 

  43. Sherman, A.T., Oliva, L., Golaszewski, E., Phatak, D., Scheponik, T., Herman, G.L., Choi, D.S., Offenberger, S.E., Peterson, P., Dykstra, J., Bard, G.V., Chattopadhyay, A., Sharevski, F., Verma, R., Vrecenar, R.: The CATS hackathon: creating and refining test items for cybersecurity concept inventories. IEEE Secur. Priv. 17(6), 77–83 (2019)

    Article  Google Scholar 

  44. Sherman, A.T., Oliva, L., DeLatte, D., Golaszewski, E., Neary, M., Patsourakos, K., Phatak, D., Scheponik, T., Herman, G.L. and Thompson, J.: Creating a cybersecurity concept inventory: a status report on the CATS Project. In: 2017 National Cyber Summit, June 2017

    Google Scholar 

  45. Thompson, J.D., Herman, G.L., Scheponik, T., Oliva, L., Sherman, A.T., Golaszewski, E., Phatak, D., Patsourakos, K.: Student misconceptions about cybersecurity concepts: analysis of think-aloud interviews. J. Cybersecurity Educ. Res. Pract. 2018(1), 5 (2018)

    Google Scholar 

  46. Walker, M.: CEH Certified Ethical Hacker All-in-One Exam Guide, 1st edn. McGraw-Hill Osborne Media, USA (2011)

    Google Scholar 

  47. Wallace, C.S., Bailey, J.M.: Do concept inventories actually measure anything? Astron. Educ. Rev. 9(1), 010116 (2010)

    Article  Google Scholar 

  48. West, M., Herman, G.L. and Zilles, C.: PrairieLearn: mastery-based online problem solving with adaptive scoring and recommendations driven by machine learning. In: 2015 ASEE Annual Conference and Exposition, ASEE Conferences, Seattle, Washington, June 2015

    Google Scholar 

Download references

Acknowledgments

We thank the many people who contributed to the CATS project as Delphi experts, interview subjects, Hackathon participants, expert reviewers, student subjects, and former team members, including Michael Neary, Spencer Offenberger, Geet Parekh, Konstantinos Patsourakos, Dhananjay Phatak, and Julia Thompson. Support for this research was provided in part by the U.S. Department of Defense under CAE-R grants H98230-15-1-0294, H98230-15-1-0273, H98230-17-1-0349, H98230-17-1-0347; and by the National Science Foundation under UMBC SFS grants DGE-1241576, 1753681, and SFS Capacity Grants DGE-1819521, 1820531.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alan T. Sherman .

Editor information

Editors and Affiliations

A A Crowdsourcing Experiment

A A Crowdsourcing Experiment

To investigate how subjects respond to the initial and final versions of the CCA switchbox question (see Sects. 3.5–3.7), we polled 100 workers on Amazon Mechanical Turk (AMT)  [27]. We also explored the strategy of generating distractors through crowdsourcing, continuing a previous research idea of ours  [38, 39]. Specifically, we sought responses from another 200 workers to the initial and final stems without providing any alternatives.

On March 29–30, 2020, at separate times, we posted four separate tasks on AMT. Tasks 1 and 2 presented the CCA switchbox test item with alternatives, for the initial and final versions of the test item, respectively. Tasks 3 and 4 presented the CCA switchbox stem with no alternatives, for the initial and final versions of the stem, respectively.

For Tasks 1–2, we solicited 50 workers each, and for Tasks 3–4, we solicited 100 workers each. For each task we sought human workers who graduated from college with a major in computer science or related field. We offered a reward of $0.25 per completed valid response and set a time limit of 10 min per task.

Expecting computer bots and human cheaters (people who do not make a genuine effort to answer the question), we included two control questions (see Fig. 4), in addition to the main test item. Because we expected many humans to lie about their college major, we constructed the first control question to detect subjects who were not human or who knew nothing about computer science. All computer science majors should be very familiar with the binary number system. We expect that most bots will be unable to perform the visual, linguistic, and cognitive processing required to answer Question 1. Because \(11\,+\,62\,=\,73\,=\,64\,+\,8\,+\,1\), the answer to Question 1 is 1001001.

We received a total of 547 responses, of which we deemed 425(78%) valid. As often happens with AMT, we received more responses than we had solicited, because some workers responded directly without payment to our SurveyMonkey form, bypassing AMT. More specifically, we received 40, 108, 194, 205 responses for Tasks 1–4, respectively, for which 40(100%), 108(100%), 120(62%), 157(77%) were valid, respectively. In this filtering, we considered a response valid if and only if it was non-blank and appeared to answer the question in a meaningful way. We excluded responses that did not pertain to the subject matter (e.g., song lyrics), or that did not appear to reflect genuine effort (e.g., “OPEN ENDED RESPONSE”).

Fig. 4.
figure 4

Control questions for the crowdsourcing experiment on AMT. Question 1 aims to exclude bots and subjects who do not know any computer science.

Only ten (3%) of the valid responses included a correct answer to Question 1. Only 27(7%) of the workers with valid responses stated that they majored in computer science or related field, of whom only two (7%) answered Question 1 correctly. Thus, the workers who responded to our tasks are very different from the intended population for the CCA.

Originally, we had planned to deem a response valid if and only if it included answers to all three questions, with correct answers to each of the two control questions. Instead, we continued to analyze our data adopting the extremely lenient definition described above, understanding that the results would not be relevant to the CCA.

Fig. 5.
figure 5

Histogram of responses to the CCA switchbox test item, from AMT workers, for the initial and final versions of the test item. There were 148 valid responses total, 40 for the initial version, and 108 for the final version.

We processed results from Tasks 3–4 as follows, separately for each task. First, we discarded the invalid responses. Second, we identified which valid responses matched existing alternatives. Third, we grouped the remaining (non-matching) valid new responses into equivalency classes. Fourth, we refined a canonical response for each of these equivalency classes (see Fig. 8). The most time-consuming steps were filtering the responses for validity and refining the canonical responses.

Especially considering that the AMT worker population differs from the target CCA audience, it is important to recognize that the best distractors for the CCA are not necessarily the most popular responses from AMT workers. For this reason, we examined all responses. Figure 8 includes two examples, \(t*\) for initial version (0.7%) and \(t*\) for final version (1.2%), of interesting unpopular responses. Also, the alternatives should satisfy various constraints, including being non-overlap** (see Sect. 3.7), so one should not simply automatically choose the four most popular distractors.

Figures 5, 6, 7 and 8 summarize our findings. Figure 5 does not reveal any dramatic change in responses between the initial and final versions of the test item. It is striking how strongly the workers prefer Distractor E over the correct choice A. Perhaps the workers, who do not know network security, find Distractor E the most understandable and logical choice. Also, its broad wording may be appealing. In Sect. 3.7, we explain how we reworded Distractor E to ensure that Alternative A is now unarguably better than E.

Fig. 6.
figure 6

Histogram for equivalency classes of selected open-ended responses generated from the CCA switchbox stem. These responses are from AMT workers, for the initial and final versions of the stem, presented without any alternatives. There were 120 valid responses for the initial version, and 157 for the final version. A–E are the original alternatives, and t1–t4 are the four most frequent new responses generated by the workers. These responses include two alternate phrasings of the correct answer: t2 for the initial version, and t1 for the final version. Percents are with respect to all valid responses and hence do not add up to 100%.

Fig. 7.
figure 7

Histogram for equivalency classes of selected open-ended responses generated from the CCA switchbox stem. These responses are from AMT workers, for the initial and final versions of the stem, presented without any alternatives. There were 120 valid responses for the initial version, and 157 for the final version. A–E are the original alternatives, and t1–t5 are the four most frequent new distractors generated by the workers. The alternate correct responses t2 (initial version) and t1 (final version) are grouped with Alternative A. Percents are with respect to all valid responses and hence do not add up to 100%.

Figure 6 shows the popularity of worker-generated responses to the stem, when workers were not given any alternatives. After making Fig. 6, we realized it contains a mistake: responses t2 (initial version) and t1 (final version) are correct answers, which should have been matched and grouped with Alternative A. We nevertheless include this figure because it is informative. Especially for the initial version of the stem, it is notable how more popular the two worker-generated distractors t2 (initial version) and t1 (final version) are than our distractors. This finding is not unexpected, because, in the final version, we had intentionally chosen Alternative A over t1, to make the correct answer less obvious (see Sect. 3.7). These data support our belief that subjects would find t1 more popular than Alternative A.

Fig. 8.
figure 8

Existing alternatives (A–E), and refinements of the five most frequent new responses (t1–t5) generated for the CCA switchbox stem, from AMT workers, for the initial and final versions of the stem. There were 120 valid responses for the initial version, and 157 for the final version. Distractors \(t*\) for initial version (0.7%) and \(t*\) for final version (1.2%) are examples of interesting unpopular responses. Percents are with respect to all valid responses.

Figure 7 is the corrected version of Fig. 6, with responses t2 (initial version) and t1 (final version) grouped with the correct answer A. Although new distractor t1 (initial version) was very popular, its broad nebulous form makes it unlikely to contribute useful information about student conceptual understanding. For the first version of the stem, we identified seven equivalency classes of new distractors (excluding the new alternate correct answers); for the final version we identified 12.

Because the population of these workers differs greatly from the intended audience for the CCA, these results should not be used to make any inferences about the switchbox test item for the CCA’s intended audience. Nevertheless, our experiment illustrates some notable facts about crowdsourcing with AMT. (1) Collecting data is fast and inexpensive: we collected all of our responses within 24 hours paying a total of less than $40 in worker fees (we did not pay for any invalid responses). (2) We had virtually no control over the selection of workers, and almost all of them seemed ill-suited for the task (did not answer Question 1 correctly). Nevertheless, several of the open-ended responses reflected a thoughtful understanding of the scenario. (3) Tasks 3–4 did produce distractors (new and old) of note for the worker population, thereby illustrating the potential of using crowdsourcing to generate distractors, if the selected population could be adequately controlled. (4) Even when the worker population differs from the desired target population, their responses can be useful if they inspire test developers to improve test items and to think of effective distractors.

Despite our disappointment with workers answering Question 1 incorrectly, the experience helped us refine the switchbox test item. Reflecting on the data, the problem development team met and made some improvements to the scenario and distractors (see discussions near the ends of Sects. 3.5 and 3.7). Even though the AMT workers represent a different population than our CCA target audience, their responses helped direct our attention to potential ways to improve the test item.

We strongly believe in the potential of using crowdsourcing to help generate distractors and improve test items. Being able to verify the credentials of workers assuredly (e.g., cryptographically) would greatly enhance the value of AMT.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sherman, A.T. et al. (2021). Experiences and Lessons Learned Creating and Validating Concept Inventories for Cybersecurity. In: Choo, KK.R., Morris, T., Peterson, G.L., Imsand, E. (eds) National Cyber Summit (NCS) Research Track 2020. NCS 2020. Advances in Intelligent Systems and Computing, vol 1271. Springer, Cham. https://doi.org/10.1007/978-3-030-58703-1_1

Download citation

Publish with us

Policies and ethics

Navigation