1 Introduction

Middle school science experiences play a pivotal role in forming students’ STEM identities as they progress through academic and career trajectories. Middle school science teachers are unique in that they require broader content knowledge (CK) across topics than their high school counterparts who primarily focus on one discipline. This means that they must be able to make abstract science concepts relatable to students with diverse learning needs [1]. At the center of science teacher research are efforts to improve student learning. For this to happen teachers must understand the content they teach. Teacher licensure testing is designed to ensure a baseline quality of teacher content knowledge [2, 3].

Due to a national shortage of certified science teachers in the United States, many states have been forced to adopt policies that allow teaching assignments across disciplines without having to demonstrate mastery in a specific field. Although the majority of states require disciplinary specific endorsements, many states continue to allow secondary science teachers to teach across disciplines with a general science certification [4]. An area of particular concern is that new science teachers are often assigned to teach out-of-field more frequently than those with more experience which ultimately negatively impacts development and leads to attrition [5, 6]. Understanding that teachers are primarily responsible for student learning outcomes, it is imperative that teachers are supported in CK development. In this study, we focus on CK of prospective general science teachers through analysis of the Praxis General Science Content Knowledge Test (GSCKT) with the objective of informing professional learning (PL) experiences.

2 Purpose of the study

Licensure testing is common, and nearly all states include them as part of the requirements for teaching in public schools. Across the United States, the Praxis II content knowledge tests are most commonly administered to assess knowledge and competencies of an entry-level teacher [3]. The GSCKT is a 2.5 h computer-delivered selected response examination. It includes 135 selected response questions assessing integration of basic topics in chemistry, physics, life science, and Earth science in alignment with the National Science Education Standards and National Science Teacher Association standards. Although content domains are determined by practitioners in each field [7], overall GS content categories include (1) Science Methodology, Techniques, and History, (2) Physical Science, (3) Life Science, (4) Earth and Space Science, (5) Science, Technology, and Society [8]. The original assessments are not publicly available, however, sample questions and content topics are available in the GSCKT Study Companion [8]. While ETS conducts a multi-state standard-setting study for each assessment, each state determines their passing score. This means that although the scaled score earned by each test is considered the same across states, the passing score may differ. Multiple test forms are offered throughout the year, and questions vary between test forms [7].

As part of the commitment to offer high quality tests with minimal bias, ETS evaluates the assessment using differential item functioning (DIF). DIF allows test developers to determine whether people in different groups (typically gender or race) perform differently on test items. Groups of people are matched using the content and skills scores from the test or section of the test. DIF occurs when people within matched groups perform differently on test items. Each question is assigned a category. Category A questions indicate least difference between matched groups. Category B questions have small to moderate differences, and Category C indicates questions with the greatest differences. Test developers select questions from category A whenever possible. Category B questions are only used if there are not enough category A questions, with preference given to those with the smallest DIF values. Developers only use category C questions if they are considered essential and must document the reasons why those questions are selected [9]. Similar to the methodology of ETS [9], the study included a DIF analysis. False positives and type I error were reduced during hypothesis testing through application of a False Discovery Rate [10], thus increasing statistical power. Subjects were grouped into quartiles and a logistic regression was run that included gender and race. Our analysis does not account for sample sizes. Results from these analyses are reported in Supplemental Materials Table B and C.

Teacher learning experiences are diverse and include everything from formal topic-specific seminars to informal collegial conversations in school buildings. They begin through teacher preparation programs and continue once teachers enter the classroom [11]. The content focus paired with how students learn that content is considered among top characteristics of effective teacher learning experiences influencing student achievement [2, 11]. Science instruction includes learning new ideas while unlearning old ones, therefore knowledge of and ability to reveal common misconceptions is central to designing quality learning experiences. Because middle school science teachers are frequently required to teach across disciplines it is important to leverage teacher training and PL opportunities that emphasize gaps in teachers’ content knowledge [2].

3 Research questions

This study investigates the following research questions (1) What are the correlations between personal and professional characteristics of Praxis General Science: Content Knowledge Test examinees and scaled score performance in the last decade? (2) How have examinees performed as a whole in each category on the Praxis General Science: Content Knowledge Test? (3) What are the correlations between personal and professional characteristics of Praxis General Science: Content Knowledge Test examinees and category performance in the last decade? What have been the relative category performances of examinees of varying characteristics?

4 Conceptual framework

It is our assertion that strong foundational CK has a positive impact on licensure examination performance. Teacher knowledge has a direct impact on instructional design and is considered among the most influential factors contributing to student achievement [2, 3, 12]. Figure 1 presents a model depicting the relationship between science teacher professional knowledge and skills and student learning. Quality instruction is influenced by science teacher identity as well as science teacher CK. Targeted professional learning experiences designed for the needs of GS teachers improve instructional quality in order to maximize student learning.

Fig. 1
figure 1

Model of science teacher professional knowledge and skills. Science teacher knowledge is influenced by topic specific knowledge as measured by the Praxis General Science Content Knowledge Test. For professional learning to improve the quality of science instruction, it must be sustained and intensive, variable inclusive, and include collective participation across the building and/or district. Adapted from Opfer & Pedder, [15], Menter & McLaughlin [16] and Ha et al., [17].

4.1 Science teacher identity

Teachers’ self-efficacy and multiple identities (personal and professional characteristics as presented within the dataset) associated with unique backgrounds influence development as they grow in their practice [13, 14]. Because of this, science instruction is impacted by the beliefs teachers hold about science teaching [14]. Strong content knowledge, in-field or out-of-field placement, access to mentors and learning communities are among factors contributing to recruitment and retention of STEM teachers [13].

Science teacher identity is dynamic and complex. It is a socially constructed ongoing process that describes the teacher within a personal and/or professional context [18, 19]. Student populations are becoming increasingly diverse, yet science learning environments tend to be Eurocentric [20, 21]. This results in overrepresentation of White middle-class candidates with monocultural perspectives within the STEM fields. Understanding that preservice teachers of color experience burdens including subliminal racism or impostor syndrome, it is thereby important to consider recruitment, retention, and professional learning efforts that foster development of underrepresented science teacher candidates [20, 21].

4.2 Quality instruction: teacher content knowledge & student learning

Quality science instruction includes learning experiences that challenge students to think deeply about phenomena and processes while critiquing and evaluating claims, constructing scientific explanations, and supporting arguments with evidence [22]. Relevant to student experiences, quality instruction supports authentic participation in science practices and exploration of student interests [23]. Instructional quality when thought of as a continuum, is influenced by licensure policies and practices. Learning to teach is dependent upon both CK and pedagogical content knowledge (PCK). CK differs PCK in that PCK is specific to the knowledge possessed by teachers that is used to transfer content to students and incorporates an understanding of how students learn those content and skills specific to the discipline [24,25,26]. Teacher CK impacts instructional design and is central to PCK. It is considered among the most influential factors contributing to student achievement [3, 12]. Much like their students, science teachers also enter the classroom with misconceptions about the content they teach [27].

4.3 Differentiated professional learning

High quality, differentiated PL is central to improving instruction, organizing curriculum, facilitating clear communication of ideas, and creating 3D science learning experiences. Within the context of the Next Generation Science Standards, this incorporates science & engineering practices, disciplinary core ideas, and cross cutting concepts [22]. An integral feature of PD theory of action is how to facilitate transfer of new ideas into systems of practice inside the classroom when knowledge and skills gained through PD commonly occur outside of the classroom [28]. Understanding what teachers need to know, how they have to know it, and how to help them learn it [24] has the potential to increase CK, PCK, and overall confidence in teaching. Because of this, teachers’ initial CK must be taken into consideration when planning for instructional support [2, 25].

5 Methodology

5.1 Research paradigm

This study was part of a project investigating personal and professional characteristics associated with outcomes on the Praxis Content Knowledge Tests [29,30,31,32,33]. Here we present findings about examinee performance on the Praxis GSCKT as a whole and at the category level. In order to gain insight into examinee performance on the assessment we followed a methodology similar to Ndembera et al., [30] which is reiterated below. (1) Regression of the examination as a whole; (2) Categorical percent correct; (3) Regression at the category level; (4) ANOVA at the category level; and (5) Scaled points lost per category. As deidentified human data is used in this study, all methods were carried out in accordance with relevant guidelines and regulations. This study was approved by the Institutional Review Board.

5.2 Data sample & collection procedure

The data analyzed in this study included all examinees who sat for the Praxis GSCKT from 2006 to 2016 and was obtained from ETS as part of a National Science Foundation project to assess subject matter knowledge of beginning STEM teachers in the US. Because the test-takers may take the exam more than once, the data was restricted to the highest score recorded for each examinee resulting in a study population of 28,688. Examinee data included self-reported demographic characteristics in response to survey questions asking about personal and professional characteristics as part of the examination. Selected demographics are presented in Table 1; full descriptive statistics are found in Supplemental Table A. Examinees selected from male/female options when reporting their biological sex. Here it should be noted that non-binary options were not available within the survey. Understanding that these terms reference biological sex rather than gender, we use terms such as man and woman when not discussing results directly. In order to maintain consistency with reporting on the Praxis ESS CKT we will use male/female options when discussing our results.

Table 1 Descriptive Statistics: Detailed list of personal and professional characteristics of General Science: Content Knowledge test-takers from 2006–2016

After exclusions were applied, the majority of test-takers were female, comprising 60.1% of the testing population as compared to 39.5% male. Of those who responded, white test-takers comprised 77.7% of the study population, Black and Hispanic test-takers represented 8.3% and 2.5% respectively. Biology undergraduate majors represented the largest testing population, comprising 29.2% followed by other non-STEM (24.8%), other STEM (11.8%), physical science (9.4%) and Earth & space science (3.8%). 77% of the testing population held undergraduate grade point averages (GPA) above 3.0. 59.3% reported that they had not yet entered the teaching field, 17.4% had completed more than 3 years of teaching, and 17.3% had 1–3 years of teaching. It can be inferred that these teachers registered for the assessment as part of an additional certification.

5.3 Data treatment

5.3.1 Regression model selection: scaled score

Over the decade studied, 13 test forms were administered, each with unique variation in number and difficulty of exam items. In order to adjust for exam difficulty when comparing relative performance between candidates and across years, ETS converts raw scores to scaled scores that range from 100 to 200 [7]. A stepwise linear regression was constructed utilizing tenfold cross validation to predict the effect of self-reported variables (Supplemental Materials Table A) on GSCKT performance. The regression model included combining repeated cross-sectional data and was limited to two-way interactions. This facilitated the disaggregation of population subgroups and identification of associations between self-reported demographic variables and scaled score on the GSCKT [34]. A descriptive analysis was then performed on the characteristics most strongly associated with GSCKT performance for comparison within groups.

5.3.2 Estimation of categorical percent correct

Information on the highest number of points each test taker earned per category was provided in the ETS dataset. Because it did not include the number of test items for each category, we used the highest reported number of items to represent the total number of questions. The following equation was used to estimate the categorical percentage score for each examinee and was repeated for each of the 13 versions of the assessment included in the study. Estimated percent correct per category was based on a weighted average of the percent correct per test form taking into consideration the number of examinees per test form.

Percent Correct = (number of correctly answered questions)/(highest items reported correct) × 100 [30, 33].

5.3.3 Regression model selection: category analysis

Associations between self-reported test-taker characteristics and category performance on the Praxis GSCKT were identified through a stepwise linear regression using a tenfold cross validation procedure. Results from the regression model informed the ANOVA model selection.

5.3.4 ANOVA model selection: category analysis

The regression model was extended through Analysis of Variance (ANOVA) calculations using SAS software, Version 9.4 to determine which were most strongly associated with variance (η2) in category level performance. For each category, the three variables explaining the greatest η2 were analyzed to estimate scaled points lost.

5.3.5 Estimation of scaled points lost: category analysis

Scaled points lost were calculated to determine examinees’ relative performance at the category level. Exam difficulty is accounted for in ETS reporting on performance on the assessment as a whole, not at the category level. Here scaled points lost for each category are reported using the equation:

Scaled points lost C1 = m(total number of questions C1) − m(number of correctly answered questions C1) where m was equal to the slope between scaled score and total questions correct on the exam [30, 33]. Demographic comparisons were made within categories (undergraduate major in Physical Science) but not across categories (undergraduate major in Physical Science vs. Earth and Space Science).

6 Results

6.1 Stepwise model

The stepwise linear regression yielded several statistically significant relationships between reported personal and professional characteristics and performance on the GSCKT. Undergraduate major, ethnicity and gender were identified by the regression model (Table 2) as the top demographic variables associated with performance. The independent variables in the model reliably predict test-taker performance as confirmed by the F values and associated P < 0.0001 values. Reported R2 and η2 values can be expressed as a percentage and provide information about the proportion of variance in the scaled score accounted for in the sample. Performance was relatively consistent across the decade studied, therefore, results are presented as an average. Variability of the mean scaled scores are indicated by whiskers and points outside the whiskers represent outliers (see Fig. 2).

Table 2 Model selection from stepwise linear regression
Fig. 2
figure 2

Source: Derived from data provided by Educational Testing Service

Praxis General Science Content Knowledge Test assessment scaled score by examinee’s reported grouped undergraduate major, ethnicity, and gender. Means are indicated by X. Outliers are represented as points outside the whiskers.

6.2 Scaled score

Table 2 and Fig. 2 present results from the analysis of the Praxis GSCKT as a whole as represented by average scaled score and offer insight into research question 1. Undergraduate major (Fig. 2) explained 11% of the overall variance (Table 2) in the General Science CKT. Test-takers with physical science degrees demonstrated the highest performance on the assessment followed by Earth & space science, biology, and other STEM majors with average scaled scores of 175, 171, 167, and 166 respectively. Other STEM included majors such as engineering, mathematics, and computer science. Non-STEM majors demonstrated lowest performance on the assessment. Ethnicity (Fig. 2) explained 7% of the overall variance (Table 2) in the assessment; there are differences in achievement between White and Black or Hispanic test-takers. The greatest variability was found in Black and Hispanic test-takers, 144 and 158 average scaled points respectively. Over the decade studied we found that mean scaled scores of White examinees outperformed Black examinees by 20 scaled points and outperformed Hispanic examinees by 5 scaled points. In order to determine the extent to which the assessment serves as a barrier to the teaching field, additional information including other interacting factors is needed about the states in which Black examinees are likely to test. Gender explained 6% of the overall variance (Table 2) in the GSCKT, with males earning an average of 8 scaled points higher than females within the study sample.

6.3 Category analysis

Table 2 presents results from the category analysis of the Praxis GSCKT. To provide additional context for research question 2, our results and analysis focus on the Physical Science, Life Science, and Earth & Space Science categories because those most closely align with the undergraduate majors represented. Life and Earth & Space Science topics each comprised 20% of the exam. Estimated percent correct performance on Life Science and Earth & Space Science were 75% and 67%, respectively. Physical Science consists of questions assessing chemistry and physics topics. While it makes up the largest portion of the exam at 38%, it had the lowest estimated percent correct (64%).

The ANOVA model presented in Table 3 was developed as part of research question 3. Our correlational analysis of the stepwise linear regression at the category level revealed several statistically significant relationships. Table 3 presents examinee characteristics most strongly correlated with category performance on the Praxis GSCKT. For the three major categories assessed (Table 3), the statistical power of the reported F Values and accompanying p < 0.0001 values confirm significance of the relationship between demographic variables represented within the model and score variability at the category level. The large η2 effect sizes presented in Table 3 indicate strong relationships between reported demographic variables and category level performance. These data were further analyzed at the category level to make comparisons in scaled points lost and are presented as graphical representations in Fig. 3.

Table 3 Results from category analysis of Praxis General Science: Content Knowledge Test
Fig. 3
figure 3

Source: Derived from data provided by Educational Testing Service

Praxis General Science Content Knowledge Test relative performance in physical science, life science, and Earth science categories by demographic characteristic.

Undergraduate major, ethnicity, and gender were the three characteristics most strongly associated with performance in Physical Science (Fig. 3) explaining 12.1%, 4.4%, and 3.3% of the total variance (Table 3, bold values) at the category level. The fewest scaled points were lost by Physical science majors (11.9) who outperformed non-STEM majors by 12 scaled points. White, Hispanic, and Black test-takers lost an average of 18.6, 19.9 and 25.8 scaled points respectively. Male test-takers lost 3 fewer scaled points than female test-takers.

Undergraduate major, ethnicity, and undergraduate GPA were most strongly associated with performance in the Life Science category of the assessment (Fig. 3) explaining 11.7%, 3.5%, and 1.7% of the total variance (Table 3, bold values). Biology (4.8) and physical science (7.0) majors lost the fewest scaled points. White, Hispanic, and Black test-takers lost an average of 6.2, 6.5, and 9.4 scaled points respectively. Test-takers with undergraduate GPAs of 3.5–4.0 lost an average of 5.8 scaled points, outperforming those with undergraduate GPAs below 2.99 (7.5 scaled points).

In the category assessing Earth & Space Science (Fig. 3) undergraduate majors, ethnicity, and gender explained 13.8% of the overall variance (Table 3, bold values) with undergraduate major accounting for 4.7% of the overall variance. ESS (4.9) and physical science (7.9) majors lost the fewest scaled points. In alignment with the other categories and test as a whole, non-STEM majors lost the most points (10.0). White test-takers lost an average of 8.3 scaled points, Hispanic test-takers lost an average of 10.1 scaled points, and Black test-takers lost an average of 12.8 scaled points. Males and females performed similarly losing an average of 7.6 and 9.6 scaled points respectively.

7 Limitations

While this study presents one of the largest-scale analyses of general science teacher CK it should be noted that this is limited to those states that administer the Praxis GSCKT rather than an overall generalization of a demographic group. The deidentified nature of the survey data does not allow for investigation of factors that impact preservice and early career teachers. Examinee’s self-reported personal and professional characteristics may also have contributed to limitations within the study. Data presented in this study is limited to 2006–2016, as a result there may have been changes in test-taker populations and performance in the assessment since that time.

8 Discussion & implications for practice

While general science teachers often earn degrees to specialize in one science content area, they are responsible for demonstrating foundational knowledge across disciplines. Understanding how science teachers' knowledge progresses over time is essential as professional developers design and facilitate targeted learning experiences [35]. Our findings revealed differences in performance on the assessment as a whole and within sub-discipline categories most commonly associated with both professional characteristics including undergraduate major and undergraduate GPA and personal characteristics such as gender and ethnicity. The estimated percent correct (Table 3) was lowest for the Physical Science category. This category combines chemistry and physics topics, thus warranting details about the questions and test-takers themselves in order to offer context about performance.

Ethnic representation of the testing population did not match the overall makeup of the United States according to the US Census Bureau within the testing window (US Census Bureau, 2018) where those who identify as Black or Hispanic make up respectively 13.4% and 18.3% of the US population but only 8.3% and 2.5% of the population studied.

Across Physical Science, Life Science, and Earth & Space Science categories, test-takers demonstrated strongest performance in the category that best aligned with their undergraduate major (Fig. 3). Examinees across disciplines lost the fewest scaled points and performed most similarly in the category assessing Life Science topics.

As seen in previous research [29,30,31,32,33] males outperformed females on the assessment as a whole and across categories. Although Black-White and Hispanic-White ethnicity DIF analysis was relatively low for Category C (Supplemental Table C) questions per test form, similar trends were identified in regard to ethnicity, with White test-takers scoring above Black and Hispanic counterparts.

8.1 Recruitment & retention

With the growing science teacher shortage and high teacher turnover, pre and inservice teachers must be supported in order to promote confidence in teaching. Although many undergraduates enrolled in STEM majors may not have considered education as a career option, it is critical to expose them to the field of teaching early in their post-secondary educational program. Programs are encouraged to offer inquiry-based learning through early field experiences [36, 37]. Identification of students with an affinity towards STEM disciplines as early as high school can help strengthen these efforts. Placing preservice teachers with strong mentors during field experience will strengthen recruitment efforts and facilitate development of effective science educators [38]. In this way they will be more likely to include coursework that aligns with state certification requirements as they progress in their studies [38]. We assert that these early exposures will also facilitate diversification of the field.

8.2 Professional learning

Middle school science teachers with CK across disciplines are able to achieve larger gains in student achievement than those with gaps in CK and accompanying understanding of common misconceptions [2]. Understandably, test takers demonstrated strongest performance in categories aligned with their academic backgrounds. Coordination between education reformers including researchers, teacher educators, policymakers, and administrators is encouraged with a focus on supporting CK as part of PCK [2, 39, 40]. We present a call to action for teacher preparation program faculty to collaborate with associated science departments at their institutions to ensure standardization of coursework for licensure. Building and district administrators are encouraged to support professional learning communities (PLCs) whereby teachers in the field have agency in the direction of collaboration focused on student outcomes through improved teacher practice. Central to this work is science CK and its impact on PCK [1].