Introduction

Sprains have been found to be the most common type of ankle injuries [1, 2]. Persistent symptoms after ankle sprains are common [3,4,5]. Approximately 55% of individuals do not seek treatment for an ankle sprain [6]. and even when treatment is sought, treatment strategies are often insufficient in the rehabilitation and prevention of recurrences [7]. Consequently, ankle sprains may be underreported in certain populations, such as by athletes [7]. The first step in being able to improve patient outcomes for ankle sprains would be to correctly diagnose the ankle sprains. Clinicians rely on certain physical examination tests to diagnose and potentially grade ankle sprains and ankle instability. Diagnostic error and inaccurate prognosis may have important repercussions for clinical decision-making and patient outcomes [8]. Therefore, it is important to recognize the diagnostic value of orthopaedic tests through understanding the reliability and validity of these tests.

Reliability looks at the consistency demonstrated when a measure using a test is repeated [9]. Inter-rater reliability measures the reliability between two or more raters, and intra-rater reliability measures the reliability of the same rater on the same patient. Validity is the degree to which a test measures what it is intended to measure [9]. Determining the reliability and validity of a test or an examination technique is essential and provides credibility to the results obtained with the test or examination technique [10].

Several previous reviews have considered the diagnostic accuracy of particular ankle injuries. Schwieterman et al. [11] focussed their review on the ankle and foot special tests, including ligament stability, neurological issues, and tendons dysfunction. Schneiders et al. [12] and Netterström-Wedin et al. [13] specifically reviewed the diagnostic accuracy of clinical tests for low ankle sprain and included the drawer and talar tilt tests, while Sman et al. [14] assessed the accuracy of syndesmosis injuries specifically the squeeze test and the dorsiflexion-external rotation stress test. Finally, Delahunt et al. [15] published a consensus statement and recommendations focussing on develo** a structured clinical assessment of acute lateral ankle sprain. This Delphi study included experts from the “International Ankle Consortium” executive committee [15]. Key recommendations included establishing the mechanism of injury and assessing ankle joint bones and ligaments. This group also established an “International Ankle Consortium Rehabilitation-Orientation Assessment (ROAST), ho** to help clinicians identify mechanical and sensorimotor impairments often found with chronic ankle instability [15]. They advocated that lateral ankle integrity, including syndesmosis, must be assessed, reporting that the most utilised clinical tests were the anterior drawer, talar tilt tests, syndesmosis direct palpation, and the squeeze test [15]. However, many primary studies do not clearly define or distinguish between the types of ankle sprains and often only consider the overall ankle injuries or ankle instability [16,17,18,19]. Therefore, focusing on one only component or considering only one type of ankle sprain in isolation may mean studies are missed.

Our objective was to systematically review and report evidence on the reliability and validity of physical examination (orthopaedic) tests for the diagnosis of ankle sprains and/or ankle instability.

Methods

This review was prospectively registered within Prospero (CRD42019124090). This systematic review adheres to the Preferred Reporting Items for Systematic reviews and Meta-Analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) guidelines [20].

Eligibility criteria

Studies regarding either the reliability or validity of manual physical examination or orthopaedic tests for the diagnosis of ankle instability or ankle sprains, including but not limited to anterior drawer test, talar tilt test, and external rotation test were included. We included original peer-reviewed studies written in English or French that included human participants of any age, gender, or ethnicity. Studies assessing validity had to include relevant statistical values such as odds ratios, predictive value, likelihood ratios, receiver operator curves, sensitivity, or specificity. Studies assessing reliability had to include relevant statistical values such as Kappa, intra-class correlation coefficient, or percent agreement.

Search strategies

Searches were conducted in PubMed, CINAHL, Scopus, and Cochrane Database from inception to December 2021. In addition, reference lists of included studies, located systematic reviews, and important textbooks on orthopaedic evaluation/musculoskeletal diagnosis were searched for other possible studies [21,22,23].

The keywords used combination were; “reproducibility of results”, “sensitivity and specificity”, joint instability, ligament, ankle, ankle joint, physical examination, validity, predictive value, accuracy, instability, laxity, injury, alignment, clinical assessment, palpation, orthopaedic, anterior drawer test, talar tilt, and external rotation test. The full search strategy for each database is included in Additional file 1. Search results were imported into bibliographic management software (EndNote X9.2) and duplicates discarded. Results of the search were reported as per the PRISMA flow diagram (See Fig. 1).

Fig. 1
figure 1

Study flow diagram

Study selection and data extraction

Titles and abstracts were screened independently by two review authors (A.B and J.T) according to the eligibility criteria. The full texts of possibly relevant papers were obtained and again screened against the same criteria (A.B and J.T). Any disagreements were resolved through discussions and consensus between the reviewers.

Data from included studies were extracted independently by two reviewers (A.B and J.T), using data collection forms based on a Quality Appraisal for Reliability studies (QAREL) checklist [24] (reliability studies) and a Standards for Reporting Diagnostic Accuracy Studies (STARD) [25] (validity studies) by two review authors, and then collated together. Any disagreements were resolved through discussions and consensus between the reviewers. We extracted study characteristics, including purpose of study, sample size, study population, examiners, orthopaedic tests used, reference standards, and study results.

Methodological quality assessment

The quality of included articles was assessed by two review authors. Methodological quality of the reliability studies was assessed with the QAREL checklist [24], which has 11 items covering seven domains including spectrum of subjects, spectrum of raters, rater blinding, order of examinations, suitable time intervals among repeated measures, test applied and interpreted correctly, and appropriate statistical analysis. Each item is rated as ‘Yes’, ‘No’, ‘Unclear’, or ‘Not applicable’. An item rated as ‘Yes’ indicates a good quality aspect of the study, while an item rated as ‘No’ indicates a poor quality assessment [24]. As recommended each quality item on the QAREL is considered separately rather than given an overall numerical quality score [24, 26]. Studies that were rated as ‘Yes’ on all items have an overall judgement of ‘high quality’. However, if a study is rated as ‘No’ or ‘Unclear’ on one or more items then it has an overall rating of ‘At risk of bias’.

Methodological quality of the validity of the studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) [27]. The QUADAS-2 consists of four key domains covering patient selection, index test, reference standard, flow and timing, with each domain assessing risk of bias and three of the domains are also assessing applicability. As recommended, each domain on the QUADAS-2 is considered separately rather than giving an overall numerical quality score [24, 26, 27]. Studies that were rated as low risk on all domains regarding risk of bias or applicability have an overall judgement of ‘low risk of bias’ or ‘low concern regarding applicability’. However, if a study is rated as ‘high’ or ‘unclear’ in one or more domains then it has an overall evaluation of ‘at risk of bias’ or ‘concerns regarding applicability’ [27].

Summary of findings

The characteristics of the included studies were tabulated for comparison. Identifying the number of times the orthopaedic test was investigated and the validity and/or reliability of each test. Where possible and appropriate (if studies included appropriate statistics), we have included a summary of the validity results summarised by test. Where possible further validity results were calculated from results provided within the included studies. Likelihood ratios were calculated if sensitivity and specificity were reported using the equations; positive likelihood ratio = sensitivity/(1-specificity) and negative likelihood ratio = (1-specificity)/sensitivity [9]. Predictive values and diagnostic accuracy were calculated if the true positive and negative, and false positive and negative values were reported [9]. The interpretation of Kappa values were based on the Landis and Koch reliability classification scale; below chance agreement < 0.00, slight agreement 0.00–0.20, fair agreement 0.21–0.40, moderate agreement 0.41–0.60, substantial agreement 0.61–0.80, and almost perfect agreement 0.81–1.00 [28]. Intra-class correlation coefficient (ICC) were interpreted as poor < 0.40, good 0.40–0.75, and excellent if > 0.75 [29].

We assessed whether results could be included into meta-analysis. Studies were assessed for statistical heterogeneity using I2 [30, 31]. Although there is no agreement on I2 interpretation, we applied the following criteria: 0–40% represented low heterogeneity, 30–60% represented moderate heterogeneity, 50–90% represented substantial heterogeneity, and 75–100% represented considerable heterogeneity [30]. When considering whether a meta-analysis is potentially suitable, we considered both the I2 and the methodological/clinical heterogeneity such as population under study, interpretation of index tests, and reference standards used.

Results

Study selection

We identified 6798 articles through searching databases and 26 additional records through other sources. After duplications were removed, 6007 articles remained. The title and abstract screen reduced the potential number down to 27 for full-text review. Eleven articles were excluded at full text review [32,33,34,35,36,37,38,39,40,41,42]. After the full-text review, 16 articles met the eligibility criteria (N = 935 participants) and are included in this review. Figure 1 outlines the screening and selection process.

Study characteristics

Of the 16 included studies, three studies assessed reliability [17, 19, 43], eight studies assessed validity [16, 18, 44,45,46,47,48,49], and five studies assessed both reliability and validity [50,51,52,53,54]. Two studies were cadaveric studies [46, 51]. The characteristics of all included studies are reported in Table 1.

Table 1 Characteristics of included studies

Methodological quality

Quality assessment of included reliability studies using QAREL is presented in Table 2. Only one study rated ‘yes’ on all 11 item yielding an overall judgement of ‘high quality’ [19]. The other six studies that assessed reliability had at least one item rated as ‘no’ or ‘unclear’ giving an overall judgement of ‘at risk of bias’ [17, 43, 50,51,52,53,54]. Common sources of bias included not enough information regarding blinding of the raters to the findings of other raters [17, 50,51,52,53], to their own prior findings [17, 43], to other clinical information [17, 43, 50, 53, 54], and to additional cues [17, 43, 50, 52,53,54]. All included studies used appropriate statistical tests.

Table 2 Quality assessment of included reliability studies using QAREL

Quality assessment of included validity studies using QUADAS-2 are presented in Table 3. Four studies assessing validity had an overall judgement of ‘low risk of bias’ [46,47,48, 51], and seven studies had an overall judgement of ‘low concern regarding applicability’ [16, 18, 44, 45, 47,48,49]. Only two studies rated as ‘low risk of bias’ and ‘low concern of applicability’ [47, 48]. the other eight studies had at least one domain within risk of bias and/or applicability with a rating of ‘high’ or ‘unclear’ [16, 18, 44,45,46, 49,50,51,52,53,54]. Common sources of bias included not enough information on how the sample was enrolled [16, 44, 45, 52], how the index test was interpreted such as if a pre-specified threshold was used [16, 50, 52,53,54], if the reference standard was interpreted without knowledge of the test [44] or if the reference standard was likely to correctly classify the condition [18], and only the cases receiving the reference standard [16]. The two cadaveric studies posed concerns regarding the applicability of patient selection and the use of the reference standard [46, 51] therefore, the results from these studies will be reported separately.

Table 3 Quality assessment of included validity studies using QUADAS-2

Summary of findings

Six studies assessed the reliability of the anterior drawer test [17, 19, 50,51,52,53]. Three studies assessed the reliability of the external rotation test [43, 50, 53], and the squeeze test [43, 50, 53]. Two studies assessed the reliability of the anterolateral drawer test [51, 52], and the inversion tilt test [19, 53]. Only one study assessed the reliability of syndesmosis ligament palpation [43], the dorsiflexion compression test [43], tenderness of anterior inferior tibiofibular ligament, proximal fibular, deltoid ligament, anterior talofibular ligament and calcaneo-fibular ligament [50], the cotton test [50], the crossed-leg test [50], distal fibular position [17], the reverse anterolateral drawer test [52], talar tilt [19], and the eversion tilt test [53]. Table 4 reports an overview of the results from studies assessing reliability. Additional file 2 presents a description of all included tests based upon the provided reviewed literature.

Table 4 Results from studies assessing reliability

Nine studies assessed the validity of the anterior drawer test [16, 44, 46, 48, 50,51,52,53,54]. Four studies assessed the validity of the external rotation test [45, 47, 50, 53], and the squeeze test [45, 47, 50, 53]. Three studies assessed the validity of the anterolateral drawer test [46, 51, 52], and the tenderness of the anterior talofibular ligament and calcaneofibular ligament [49, 50, 54]. Two studies assessed the validity of a talar tilt test [18, 48], and tenderness of the syndesmosis [47, 54]. Only one study assessed the validity of dorsiflexion lunge with compression [47], tenderness of anterior inferior tibiofibular ligament [50], proximal fibular [50], deltoid ligament [50], medial ankle [54], talocrural joint [54], peroneal tendon [54], lateral malleolus [54], diffusely lateral [54], supination line [54], the cotton test [50], the crossed-leg test [50], the reverse anterolateral drawer test [52], the inversion stress test [53], and the eversion stress test [53]. Table 5 reports an overview of the results from studies assessing validity.

Table 5 Results from studies assessing validity

Due to the methodological and statistical heterogeneity of the included studies, a meta-analysis was not possible. When combining results, the I2 value was 75–100% representing considerable heterogeneity for all considered meta-analyses. Additionally, there was major methodological and clinical heterogeneity among the included studies. For example, nine included studies assessed the validity of the anterior drawer test. However, two of these studies are cadaveric studies [46, 51]. A range of different reference standards were used within these studies, including ultrasound [44, 48, 52], MRI [16, 50, 53], arthrography [54], and cutting the ligaments and measured with direct anatomical measurements [46, 51]. There were also differences in how the anterior drawer test was conducted and scores interpreted.

There were only three tests; anterior drawer [17, 51], distal fibular position [17], and anterolateral drawer tests [51], that had results reported regarding intra-rater reliability. These tests were all reported to have excellent intra-rater reliability [17, 51]. However, these results are only based on at most two studies [17, 51], in which one of these studies was using cadavers [51]. The two tests with the highest reported inter-rater reliability were the external rotation and the anterior drawer tests, rated as substantial [43] and good [17] agreement respectively. However, other studies have rated the inter-rater reliability of the anterior drawer test as slight [52] and poor [19], and the external rotation test as fair [50, 53], demonstrating inconsistent results. The only test to show some consistent results based on more than one included study was the squeeze test, which was rated as having moderate inter-rater reliability based on results from two studies [43, 50].

Overall, the test with the highest reported diagnostic accuracy (91.3%) was the anterolateral talar palpation test, however, this was only based on the results of one study [16]. The tests with the highest reported sensitivity were the anterior drawer test [44, 51, 53], the anterolateral talar palpation [16], the reverse anterior lateral drawer test [52], and palpation of the anterior talofibular ligament [49, 54]. However, there were quite inconsistent results with lower sensitivity reported for the anterior drawer test depending on the grade of the ankle sprain to indicate positive test results [44]. The anterior drawer test also reported the lowest negative likelihood ratio (0.24) compared to other reported tests assessing validity for ankle sprains [53]. The tests with the highest reported specificity were the anterior drawer [16, 48, 52, 53], anterolateral drawer test [46, 52], the reverse anterior lateral drawer test [52], tenderness on palpation of the proximal fibular [50] and diffusely lateral [54], the squeeze test [45, 47, 53], the talar tilt test [48], and the eversion stress test [53]. Again, there were inconsistent results with lower specificity results reported for the anterior drawer test in other studies [44, 46, 50, 51]. The squeeze test reported the highest positive likelihood ratio (35) compared to all other reported tests [53]. The reverse anterolateral drawer test reported both a very high sensitivity and specificity, but this was only reported within one study [52].

Consideration of type of ankle sprain

In the diagnosis of an ankle injury, the mechanism of injury should be considered, such as by using Lauge-Hansen classification [55]. While many included studies included a mixture of participants with different types of ankle sprains, some included studies did specify which tests should be used for which type of ankle injury. Orthopaedic tests to assess for a potential syndesmosis injury include; tenderness of palpation of direct ligaments [43, 47, 50], squeeze test [43, 47, 50], external rotation stress test [43, 50, 53], dorsiflexion compression test [43, 47], cotton test [50], and crossed-leg test [50]. Orthopaedic tests to assess for a potential lateral ligament injury include; anterior drawer test [44, 46, 51,52,53], anterolateral drawer test [46, 51, 52], anterolateral talar palpation, reverse anterolateral drawer test [52], tenderness of palpation of direct ligaments [50], inversion stress test [53], and talar tilt test. Orthopaedic tests to assess for a potential medial ligament injury include; tenderness of palpation of direct ligaments [50], and eversion stress test [53]. Additional file 3 reports orthopaedic tests for different types of ankle sprains. Additional file 4 reports a summary of the sensitivity and specificity values by orthopaedic test.

Discussion

The tests reviewed included the anterior drawer, anterolateral drawer, reverse anterolateral drawer test, external rotation, dorsiflexion external rotation, squeeze, palpation and tenderness, cotton, crossed-leg, dorsiflexion compression, distal fibular position, talar tilt, inversion tilt, eversion stress, and dorsiflexion lunge with compression tests. Overall, none of these tests have shown robust reliability and validity scores. Even the studies that used a combination of tests did not show high diagnostic accuracy [47]. However, one study did find that the overall validity of physical examination for the ankle did drastically increase if conducted five days after the injury rather than within 48 h of injury [54]. The orthopaedic tests should be used in combination with the clinical history.

Many of the included studies had different or unclear definitions of ankle sprains. These could include a mixture of participants with a history of lateral, medial and/or syndesmotic ankle sprains [16,17,18,19, 49, 54]. Many studies had a mixture of acute and chronic ankle sprains [16, 17, 43, 44] or no information regarding how long the injury was ongoing [17, 19]. The clinical usefulness of certain tests could differ among acute or chronic conditions. Also, some studies did not consider the grade of the ankle sprain required to indicate a positive test [16, 17]. One study that did consider the grade of the ankle sprain showed that when a higher grade (grade 3 or above) was used to consider a positive result, they observed a higher specificity but a lower sensitivity compared to values when using a grade 2 or above [44].

There were other differences in how the studies were conducted, which hindered the interpretation of this systematic review’s results. There were a range of different reference tests used, including ultrasound [44, 48, 52], MRI [16, 45, 47, 49, 50, 53], Cumberland ankle instability tool [18], arthrography [54], and cutting the ligaments to directly measure anatomical movements [46, 51]. Additionally, there were differences in how tests were conducted, and scores interpreted. For instance, some authors used subjective or objective interpretations to assess the drawer test, such as feeling if there is any laxity [19, 44] compared to using a goniometer [17]. Other studies did not provide enough detail about how the index test was interpreted such as if a pre-specified threshold was used [16, 50, 52, 53]. Furthermore, many studies had a mixture of examiners with varying degrees of experience from students or clinicians with minimal clinical experience to highly experienced clinicians [19, 43, 46, 47, 50,51,52]. When studies compared the results between students or junior examiners compared to more senior or experienced examiners, there were mixed results. On occasions, the less experienced examiners yielded higher results and on other occasions, the more experienced examiners yielding higher results [19, 52]. Moreover, the two studies using cadaveric specimens [46, 51] posed concerns regarding the applicability to a clinical population, there would be differences between using living participants compared to using cadaveric specimens. The advantage of using cadaveric specimens over live patients is the easiness of distinguishing between a true positive or a true negative as the ligaments were cut however, it lacks important feedback such as patient cues and tenderness.

This systematic review differs from previous reviews. Two previous reviews on ankle injuries were published six [12] and nine [11] years ago. While both reviews investigated the diagnostic accuracy of special ankle tests, Schneiders et al. [12] included special tests of ankle and foot musculoskeletal pathologies, and Schneiders et al. [12] reviewed publications that included only the two most widely used clinical tests to assess lateral ankle sprains, namely the anterior drawer and the talar tilt tests. Both these review articles [11, 12] did not account for the reliability of the index tests. A more recent review [13] looked at the accuracy of clinical tests assessing ligamentous injury of the talocrural and subtalar joint. Netterström-Wedin et al. [13] focussed on lower lateral ankle stability assessment and did not review ankle stability integrity in its entirety, including the ankle medial side and higher aspect (syndesmosis), which we have considered in our systematic review. We also evaluated the reliability of those tests. Considering our review objectives, we included studies [17, 18, 43, 45,46,47, 50, 51, 53, 56] that were not included by Netterström-Wedin et al. [13].

Considering the risk of bias assessment of similar included studies to the most recent previous systematic review [13], our interpretation of the QUADAS-2 tool differed for some studies. For example, Netterström-Wedin et al. [13] reported that Li et al. [52] was at low risk of bias and low applicability concerns on all items. We considered this same article to have patient selection and index test to be rated as ‘unclear risk of bias, and ‘unclear’ concerns regarding the applicability of the index test, due to the study not including enough details. These bias assessment discrepancies probably relate to the subjective interpretation of the tool which has been reported with other measurement tools [57, 58] the agreement appears to be lowest on highly subjective items. Reliability may vary according to reviewers' familiarity with the tool, their expertise, items’ interpretation, or whether reviewers have worked together before [57]. What is important is to apply the risk of bias tool consistently within the systematic review. Considering this subjectivity, comparing similar systematic reviews becomes challenging.

Despite the concerns raised by our systematic review on the diagnostic value of the included ankle physical tests, clinicians should not dismiss the significance of a thorough physical examination. The argument supporting technology as a substitute remains notably debatable, often associated with false-positive results [59], imparting a false sense of confidence that can sometimes delay and increase the burden of care. Similar to Rheumatology which lacks a specific organ or system constraint [60], musculoskeletal complaints involve multiple tissues and remain a common reason for patients visiting their primary health practitioners [61]. Despite that, the physical examination, including its orthopaedic component, remains a neglected field of research [62], this component should not be abandoned but instead better understood and refined [63].

Strengths and limitations of this review

This systematic review endeavoured to include all relevant articles that assessed the reliability and/or validity of any type of ankle sprain and/or ankle instability and included a wide initial search strategy. The methodological quality of all included studies was assessed by using the QAREL and/or the QUADAS-2. Due to the methodological heterogeneity of the included studies no meta-analysis could be conducted. The results from this review highlight the heterogeneity within the current literature. Additionally, results are only based on a few studies at most for each test, frequently with limited sample sizes. This systematic review was limited to studies written in English and French.

Recommendations for future research

Appropriate reference standards should be used when determining the diagnostic accuracy of physical examination tests. More high-quality research is needed to truly determine the reliability and validity of physical examination tests for the diagnoses of ankle sprains. Clear definitions of the type of ankle injury and the duration of time since the injury should be considered in future research. Furthermore, to truly consider the use of physical examination tests in a clinical and pragmatic way, future studies should use a combination of clinical tests along with the patient’s history.

Clinical implications

Although individual orthopaedic tests may not yield high reliability and validity, they should not be discarded entirely. When examining a patient with an ankle injury, fractures of the ankle and mid-foot should first be excluded, such as by using the Ottawa ankle rules [64], and then consider a range of orthopaedic tests to assess for an ankle sprain. Physical examination tests should not be used in isolation; instead, in combination with the clinical history to diagnose an ankle sprain. Careful consideration should be taken as to when is the most appropriate time to conduct the physical examination.

Conclusion

The diagnostic accuracy, reliability, and validity of physical examination tests for the assessment of ankle instability were limited. Physical examination tests should not be used in isolation to diagnose an ankle sprain. Rather clinicians should use a combination of physical examination tests along with the clinical history. Future studies should ensure appropriate reference standards are used, such as MRI or arthroscopy, and use a combination of clinical tests with the patient’s history to determine the diagnostic accuracy in a clinical and pragmatic way.