Despite over 30 years of reported institutional intent to diversify the medical workforce, racial disparity in medical training persists [1, 2]. Attempts to address overt discrimination toward individual applicants have fallen short in impact: only prior to 1978 were there fewer Black men in medical training than today [3]. Similarly, interventions to promote workforce diversity for the potential health benefit to marginalized groups [4, 5] have not achieved their goal: at current rates of graduation, five centuries will pass before Latinos are proportionally represented in California’s physician workforce [6]. These findings call into question the efficacy of existing frameworks intended to promote diversity as well as the admissions procedures that account for trends in racial/ethnic representation of medical trainees.

A growing body of literature focused on trainee evaluation and promotion in medical education argues that existing definitions of “excellence” [7], as well as the metrics that serve as their evidence, reflect presumed racial hierarchies, obscure an uneven playing field, and do not reliably predict long term clinical contribution [8,9,10,11,12,13]. These processes thereby foster an ongoing segregation of marginalized groups to the detriment of medicine [14, 15]. In an effort to correct the systemic bias that propagates disparity, the Association of American Medical Colleges (AAMC) encourages the use of Holistic Review (HR)—“a flexible, individualized way of assessing an applicant’s capabilities by which balanced consideration is given to experiences, attributes, and academic metrics...” [16]. Despite its widespread implementation in medical schools nationwide, there are few studies examining the impact of HR on residency admissions [12].

This manuscript examines the potential of a holistic review screening process to dismantle bias and promote equitable representation relative to non-holistic processes in graduate medical education (GME) interview selection at one residency program [17]. In turn, it examines and quantifies how bias and privilege, both embedded in traditional GME selection procedures, influence patterns of representation.

Methods

Study Design and Population

The studied psychiatry residency program received a total of 806 applications in the 2018–2019 cycle. To maximize the internal validity of the study and increase consistency across application review, inclusion criteria consisted of graduates from allopathic US medical schools and excluded applicants with felony convictions or unexplained misdemeanors and graduates from schools without comparative data for overall performance or clinical rotations. Due to restrictions from the sponsoring institution, non-US citizens with Green Card and students applying for J-1 visa were excluded. Representation in medicine (RM) for study participants, including both underrepresented in medicine (URM) and not underrepresented in medicine (nURM), was based on self-identified race/ethnicity in the Electronic Residency Application Service (ERAS) application and using frameworks provided by the AAMC and in-state partners [16, 18].

This study qualified for exemption by the Institutional Review Boards at the University of California, Los Angeles.

Interventions

The Holistic Review (HR) tool was subjected to comparative analysis relative to two non-holistic models: the “Traditional” (TR) approach, which approximated rubrics previously used at the study site and reflects standards of practice in GME screening nationwide, and the “Traditional Modified” (TM), which expanded the Traditional approach by leveraging ERAS filters to attempt a more equity-enhancing approach without full holistic review.

Holistic Review

Residency program stakeholders developed a mission-driven and domains-based approach to applicant evaluation. This process included (1) identifying and devaluing metrics with known bias and limited predictive value for long-term clinical strength (AOA induction, USMLE scores) [8, 9, 19], (2) reimagining and prioritizing personal qualities and professional characteristics that reflect program values, and (3) actively considering applicants in a broader social context—including acknowledgment of how institutional racism, poverty, and family educational achievement can impact applicant trajectory through medical school.

The Holistic Review approach ultimately resulted in a strengths-based rubric composed of eight domains. These domains were assessed according to criteria gleaned from multiple application elements—including curriculum vitae, personal statement, and the Medical Student Performance Evaluation (MPSE). Six categorical domains included (i) Leadership, (ii) Community Service, (iii) Clinical Performance, (iv) Research, (v) Reference Letters, and (vi) Professionalism and were scored according to duration, intensity, and degree of achievement (e.g. for leadership—Which position was held? For how long? What was the impact of experience?). With regard to professionalism, instances of unprofessional behavior generated a negative point score. Two measures of lived experience were included as binary outcomes, each with its own rubric: Resilience (achievement in enduring adversity—e.g. personal setback, illness, discrimination) and Distance Travelled (trajectory relative to family or community-level barriers reflecting marginalization at a population or structural level—e.g. first-generation college graduate, raised in community with high poverty/low educational resources). Self-identified race/ethnicity was not provided to reviewers as part of ERAS application material, and race/ethnicity was not explicitly considered in any Holistic Review domain nor was it sufficient to earn de facto recognition for Resilience or Distance Travelled. The pool of Holistic reviewers included faculty and residents trained in application of the Holistic Review rubric.

To generate a preliminary Holistic Review score, categorical domains were summed, with a weighted emphasis given to the Clinical domain. Where applicants demonstrated notable resilience and/or distance travelled, the preliminary score was multiplied by 1.1 (or 1.2 if both were noted) for the final Holistic Review composite score (see Table 1).

Table 1 Holistic review rubric overview

In light of mounting evidence of racial bias in STEP1 scores and their limited utility in predicting clinical performance, standardized scores were not included in the Holistic Review rubric [9, 20,21,22]. Additional elements, including “yellow” and “red” flags (e.g. significant difficulty in or repeated failure of clinical rotations), were included in the Holistic Review evaluation form but were not included in the composite score nor were they essential to this analysis.

Traditional Review

The Traditional Review rubric was based on screening rubrics previously used at the study site and parallels rubrics currently in use at residency programs nationwide. It relied on manipulation of discrete elements of the ERAS application, including graduating medical school, additional graduate degrees, STEP 1 scores, Alpha Omega Alpha Honors Society (AOA), Gold Humanism Honors Society (GHHS), peer-reviewed publications, and poster presentations. Following their transformation into either binary variables (e.g. AOA—yes/no) or, for continuous outcomes, into discreet categories (e.g. STEP1 score), these metrics were combined to provide one composite Traditional Review score (see Table 2).

Table 2 Traditional Review and Traditional Modified rubric overview

Traditional Modified Review

The Traditional Modified rubric represented an intermediary step between Traditional Review and Holistic Review and was designed to examine whether more elaborate ERAS filters can be used to increase URM applicant representation through piecemeal modification of the Traditional Review approach without full holistic review. Through this method, “keywords” identified by the working group (e.g. “equity”) could be used as search filters within ERAS and potentially identify applicants with experience in areas related to equity, diversity, and/or inclusion. This would allow for enhancement of the Traditional Review score with only brief review of application materials to ensure relevant utilization of keywords (see Table 2).

Measures and Outcomes

Baseline applicant measures were obtained via AAMC ERAS. Each applicant was assigned three scores, one score reflecting their performance according to each of the three rubrics. Applicants selected for interview were identified by ranking in the top 100 applicants through a given screening approach—Traditional Review, Traditional Modified, or Holistic Review. The primary outcome was odds of selection for interview by URM status according to each screening tool. Secondary outcomes included odds of interview selection by individual elements of each rubric. Predicted probabilities of interview selection according to RM status were examined as alternative means to compare the relative contribution of each rubric to interview selection in light of differences in the number of applicants in each group.

Statistical Analysis

Statistical difference in the distribution of selection criteria by RM status was evaluated by chi-squared testing. This approach has been previously supported in the literature despite a value of zero for some categories of analysis [23].

We used logistic regression to model the binary outcome of “interview selection” according to exposure by review approach and as a function of each applicant dimension (e.g. URM, STEP1). In this manner, review paradigm served as the independent variable and was operationalized as a three-level nominal value: Traditional Review, Traditional Modified, and Holistic Review. In each regression, we included the interaction between review paradigm (e.g. Holistic Review) and a specified applicant dimension (e.g. URM) to quantify the association between review approach, applicant dimension, and interview selection. This approach generated an interaction term reflecting the relative impact of each review paradigm in determining interview selection according to applicant criteria. For example, a positive interaction term between Holistic Review and URM implies that Holistic Review strengthens the association between URM status and interview selection. To more intuitively display the impact of review approach to interview selection by applicant dimension, we show the odds ratio and confidence interval for each dimension separately under each review paradigm.

We generated marginal predicted probabilities of selection for interview. These communicate the extent to which review paradigm influences the subsequent chance an applicant would have to be selected for interview according to a given dimension of their application. Because nURM applicants grossly outnumber URM applicants in our data, odds ratios obscure the absolute probability of interview selection. We therefore included predicted probabilities to convey a more intuitive measure of absolute effects, reengaging with the overwhelming predominance of nURM applicants in our model.

Analysis were conducted in, Stata 15.1 (Stata Corp, College Station, TX) and R Statistical Programming software 3.5.2 (R Core Team, Vienna, Austria).

Results

Representation of URM and Non-URM Applicants by Scoring Domain

Of 806 total applications, 574 applicants met inclusion criteria, including 154 (27%) URM applicants. Baseline characteristics, including distribution according to variables of interest by RM status, are shown in Table 3.

Table 3 Summary statistics of residency applicants by representation in medicine (RM) status (n = 574)

Regarding Traditional Review domains, nURM applicants were more likely than URM applicants to have STEP 1 scores above 240 (33% vs 12%; p < 0.001) and be AOA inductees (9.5% vs 2%; p = 0.004). There were no significant differences between groups in relation to GHHS membership, school in the highest ranked tier, distribution of additional graduate degrees (PhD or master’s), or number of applicants in the highest rank tier of either posters or peer reviewed publications.

Under the Traditional Modified rubric, URM applicants were more likely than nURM to have 2 or more relevant keywords in their applications (27% vs 15%; p = 0.002).

Using the Holistic rubric, URM applicants were more likely to be scored in the highest tier for resilience (37% vs 20%; p < 0.01) and distance travelled (31% vs 12%; p < 0.01) and less likely to be scored in the highest tier for clinical (8% vs 21%; p < 0.01) or research (13% vs 23%; p < 0.01) domains. There were no significant differences between groups in likelihood of being scored in the highest tiers for community service, leadership, reference letters, or professionalism.

Relative Impact of Screening Rubric on Interview Selection

As shown in Table 4, relative to Traditional Review, Holistic Review significantly increased the odds that a URM applicant would be selected for interview (0.35 vs 0.84, p < 0.05), while no statistically significant change was noted under Traditional Modified (OR 0.54). Under the Holistic Review rubric, high STEP1 scores had a less pronounced impact on the odds of receiving an interview relative to Traditional Review (for scores > 260, OR 2.03 vs 24, respectively, p < 0.01), noting that only 18 applicants met criteria for highest tier of STEP1 scores. Membership in GHHS increased the odds for interview selection under Holistic Review relative to Traditional Review (OR 6.5 vs 1.59, p < 0.001), while applicants with a PhD or scoring in the highest tier for posters and peer-reviewed publications had significantly decreased odds of interview selection in Holistic Review. Use of the Traditional Modified rubric, compared with Traditional Review, did not significantly change the odds of selection for interview for applicants scoring in the highest tier of any Traditional Review domain. Applicants ranked in the highest tiers of the Holistic Review domains of Community Service, Leadership, Research, Reference Letters, and Professionalism or screening positive for Resilience had a greater odds ratio of selection for interview under Holistic Review than Traditional Review.

Table 4 Interaction effect of traditional rubric relative to traditional modified and holistic review including odds ratios (95% confidence interval) of interview selection

As shown in Fig. 1, relative to Traditional Review, the predicted probability of interview selection for URM applicants doubled via the Holistic Review approach (0.08 vs 0.16), with little change for nURM applicants across rubrics, (0.21 vs 0.18, Traditional Review and Holistic Review, respectively).

Fig. 1
figure 1

Contrasts of the marginal predicted probabilities of interview selection (Y-axis) for applicants underrepresented in medicine (URM) relative to applicants who are not underrepresented in medicine (nURM) according to each screening and selection rubric examined (X-axis)—Traditional Review (TR), Traditional Modified (TM), and Holistic Review (HR)

Discussion

This manuscript describes the development and implementation of an equity-minded and mission-driven Holistic Review process that critically examines and reconstructs the standards defining residency selection processes. In this single-site analysis, we note the efficacy of a Holistic Review approach in significantly increasing the odds of URM interview selection. Given the baseline distribution of residency applicants by RM status, the increased odds of interview selection for URM applicants comes without meaningful change in the predicted probability of interview selection for nURM applicants.

The reduction in disparity demonstrated in this study, without direct consideration of applicant race/ethnicity, reflects the extent to which bias-producing metrics have themselves perpetuated racialized notions of excellence [9, 24]. For example, the distribution of STEP1 scores by RM status and the highly significant variation in the pattern of odds for interview selection by STEP1 score across rubrics underscore the role of USMLE as a potent barrier to equitable representation.

This study replicated known trends in honors society representation, including the exclusion of URM trainees from AOA selection and more equitable representation in GHHS [11]. While Holistic Review scoring did not prohibit consideration of AOA, its decreasing association to interview selection relative to Traditional Review (while not significant) likely reflects the repositioning of AOA election as but one example of achievement as opposed to the ultimate proxy for clinical excellence. Similarly, the defining of Holistic Review domains to allow multiple forms of justifying evidence likely explains the equity-enhancing impact of community service, leadership, and professionalism in conjunction with GHHS.

In combination with a divestment from traditional screening elements, study outcomes are additionally explained by associations between URM status and lived experience. Rather than giving additive point contribution in recognition of resilience, the magnifying effect defined in this approach (≥ 110% of the preliminary score) gives due consideration to the experience of applicants who face structural barriers in a pervasive way, as opposed to overcoming discrete challenges without lasting repercussion. While distance travelled was also more frequently observed in the life of URM applicants and while its relative contribution to interview selection increased in Holistic Review relative to Traditional Review, its ultimate impact fell short of significance.

The structure of the Traditional Modified rubric, in which designated activities or experiences were given a minor additive point value in an otherwise largely “traditional” approach, likely represents a prevalent compromise to Holistic Review in modern GME. This compromise is presumably justified by time constraints—residency applications in psychiatry and across most specialties have increased substantially over the last decade [25, 26], and many programs nonetheless rely on the same small groups of administrators and faculty to carry out burdensome and time-sensitive screening and selection procedures. While not formally assessed, in this study, Holistic Review screening typically took ~ 10–15 min, while Traditional Modified required only ~ 2–3 min per applicant, and the Traditional Review processes was entirely automated. In this study, despite the fact that URM applicants were more likely to describe meaningful participation in endeavors aligning with program priorities, the additional point contribution defined by Traditional Modified was not sufficient to result in significant change in the odds of interview selection for URM applicants.

Several study outcomes require consideration of baseline sample characteristics. In the case of interview selection by RM status, the minimal change in predicted probability observed for nURM applicants is explained by the notably larger group of nURM applicants to GME. Similarly, while the magnitude of change in interview selection associated with PhD was unanticipated (from OR = 266 to OR = 0.4 in Traditional Review and Holistic Review, respectively), the trend can be understood when considering the relatively large point contribution defined by the Traditional Review rubric relative to the number of applicants with a PhD (~ 6% of total applicant pool).

In contemporary GME, diversity initiatives characterized by expedited and piecemeal interventions to “improve the numbers” by mitigating downstream bias are commonplace (e.g. unconscious/implicit bias trainings for individual reviewers or strategic advocacy for individual URM applicants) [27]. Despite their prevalence, these efforts have failed to shift population-level representation in medical training even despite corresponding demographic shifts in the broader population [28]. By limiting their scope to downstream outcomes, these interventions leave intact the frameworks and selection processes that overlook the circumstances, accomplishments, and value-added of applicants from marginalized groups. Despite purported intention to seek out and support diversity, by continuing to rely on processes devoid of social and structural contextualization, academic medicine perpetuates institutional racism and thereby manufactures the continued underrepresentation of marginalized groups [29,30,31,32].

The method of evaluation described in this study moves upstream and represents a shift in focus away from quota-like recruitment of individual URM applicants (“how many were selected?”) and toward a realignment of values that promotes holistic excellence across groups by asking “under what conditions should applicants be selected?”

Study limitations include its small sample size, location at one study institution, lack of demonstrated inter-rater reliability for holistic review scoring, and the inability to account for the potentially compounding impacts of race- and gender-based bias in the applicant review process due to insufficient power to create subgroup analysis. The study sample was limited to US allopathic graduates, and future exploration may shed additional light on the contribution of Holistic Review to the consideration of osteopathic graduates as well as applicants trained internationally. Further, while metrics with extensive literature demonstrating embedded racial bias were excluded in the Holistic Review tool (e.g. STEP1), emerging literature is implicating some of the materials included, such as clinical grades [33] and MPSE letters [10] as sources of race- and gender-based disparity, and future iterations of the Holistic Review tool therefore must reflect this dynamic evidence base.

To the best of our knowledge, this manuscript represents the first comparative analysis of a systematic holistic approach to applicant screening relative to non-holistic or traditional methods in addressing racial disparity in GME applicant selection. By identifying a representative cadre of trainees who embody diverse forms of excellence using a single, standardized procedure for all applicants, the study results affirm the efficacy of holistic tools to build a medical workforce with an increased capacity to promote equity within a residency training program [34].

In the context of new ACGME requirements to “engage in practices that focus on mission-driven, ongoing, systematic recruitment and retention of a diverse and inclusive workforce” [35], we share this intervention in order to advance existent knowledge regarding the viability and effectiveness of applied holistic review in residency recruitment [12, 36, 37]. In their application, we additionally hope that tools for holistic recruitment be accompanied by meaningful interventions to address the toxic climate of daily experiences with racism that characterize the lived experience of many GME trainees [38, 39]. Beyond effective recruitment, meaningful interventions to foster the retention and promotion of URM trainees and faculty will be essential to eradicate the inequity that remains ever-present in medical training and clinical care [40, 41].