1 Introduction

Healthcare costs in most developed countries have been growing dramatically in the last few decades. According to records from the Organization for Economic Co-Operation and Development (OECD), the average proportion of Gross Domestic Product (GDP) devoted to the healthcare systems of all member countries has increased from 4.6% in 1970, when the OECD was founded, to 6.5% in 1990, and reached 8.7% in 2010 (Eckermann and Coelli, 2013; OECD, 2023). Moreover, the proportion of healthcare expenditure to GDP is generally even higher in developed country members, the average of which has surpassed 10% in 2021. As for Australia, the proportion demonstrates a persistent and upward tendency, which is estimated at about 10.6% in 2021 and is located at around the 40th percentile of all OECD members (OECD, 2023). Compared with other member countries sharing a similar healthcare system, such as the United Kingdom (UK), Canada, and New Zealand, the trajectory of the proportions of GDP spent on healthcare in Australia shows a high degree of similarity throughout the past half-century.

From a domestic perspective, when looking at the budget of the Australian government in the financial year (FY) 2022/2023, one can see that the expense in the “Health” sector ($108 billion, 16.70% of total expenses, 4.24% of GDP) is second only to the “Social security and welfare” sector ($226 billion, 35.11% of the total expenses, 8.90% of GDP), and is followed by “Education” and “Defense”. In recent years, the budget in the health category has been even more than that of education together with defense (Commonwealth of Australia, 2023). Moreover, considering the huge expenditure from the consumer side, expenditure on healthcare greatly exceeds budget expenses. For example, according to the report from the Australian Institute of Health and Welfare (AIHW), the nationwide healthcare spending on health goods and services in FY 2021/2022 was $241.3 billion, equating to about $9365 per capita, accounting for about 10.5% of the overall economic activity (Australian Institute of Health and Welfare, 2022).

Encouraged by both public pressure and the executive interest in cost containment, a substantial amount of research has been published to inform policymakers on how to address the problems of the tremendous and still rapidly increasing costs (O’Neill et al. 2008). It is widely believed that the inefficiency of healthcare institutions has, to some extent, contributed to the ongoing global increase in healthcare costs (Worthington, 2004). In addition, the resources of healthcare services are naturally limited, and therefore a more efficient utilization of them is important for any type of healthcare system (Kumbhakar, 2010). Efficiency measurement may be the first step in the evaluation of a coordinated healthcare system (O’Neill et al. 2008), whether for better healthcare service outcomes or a more controllable budget. Hospitals, as the core institutions of the healthcare system, also account for the lion’s share of the expenditure (about 39.8% of Australian healthcare spending in FY 2021/2022), and have become the most popular research target (Australian Institute of Health and Welfare, 2022).Footnote 1

Along with the continuously expanding research, what insights have we gained about hospital efficiency in Australia, and what inspiring and outstanding achievements could be pursued in the future? To obtain a broader scope, we extended the topic from Australia to several reference countries and regions, which share a similar healthcare system structure with high-level quality healthcare services, similar traditions of the British management system, and a similar level of economic development, i.e., the UK, Canada, New Zealand, and Hong Kong.Footnote 2

Several comprehensive reviews of comparable topics can be found in the literature. For example, Hollingsworth (2003, 2008); Hollingsworth et al. (1999) broadly reviewed the studies of healthcare delivery efficiency, focusing on the application and development of efficiency measures, the main findings, the indicators of output and quality, etc. The measurements of efficiency, including the indicators of input and output of a healthcare facility and the approach to evaluate the utilization, were also among the main concerns of the reviews by Hussey et al. (2009); O’Neill et al. (2008); Worthington (2004).

The novelty of our review primarily lies in two aspects. Firstly, to reduce the selection bias of the journal articles, we conducted a systematic review instead of the commonly employed conventional approach. Indeed, often the articles chosen by the authors, though mostly by field experts, might not be representative of the existing knowledge (Linnenluecke et al. 2020). Meanwhile, a systematic review tends to collect a more comprehensive set of available research and select journal articles by precisely predetermined criteria for further analysis (Linnenluecke et al. 2020; Tranfield et al. 2003). Secondly, we deployed bibliometric analytic techniques for visualization and network analysis, which have been widely used in surveys of other topics (e.g., Choi and Oh, 2019; Nepomuceno et al. 2022), specifically for the review of efficiency studies of hospitals. In turn, this helped in revealing the dynamic patterns of the most researched topics and the most productive authors for these topics.

The remainder of this article is organized as follows. After the introduction, selected previous reviews on the related topic are summarized in Section 2. Methodologies of article collection, processing procedure, and bibliometric analysis are discussed in Section 3. Section 4 presents a detailed analysis of the articles and methodologies. Qualitative and critical analysis of findings in selected articles are further discussed in Section 5. Section 6 provides reflections on the interpretations of efficiency results, and the concluding remarks are summarized in Section 7.

2 Key related works

Several key review studies have been conducted in the last three decades. Hollingsworth et al. (1999) conducted a seminal review of the global applications of nonparametric methods on healthcare efficiency published before 1997, and found that more than two-thirds of the applications used data envelopment analysis (DEA) or DEA-related techniques, and more than two-thirds of the research was about hospitals and nursing homes in the US. A few years later, Hollingsworth (2003) reinvestigated the study of healthcare efficiency with a broader spectrum of methods. Although using different measures, the studies regarding the hospitals in the EU typically reported a higher level of mean efficiency than the mean efficiency estimated for the hospitals in the US. Parametric methods, especially within the stochastic frontier analysis (SFA) paradigm, had been employed more widely, compared to the last survey, although the dominant methods were still DEA and DEA-related techniques. Under an overall perspective of frontier techniques, about three-quarters of the 188 reviewed articles are based on DEA, SFA, and their variants.

In another review of the hospital efficiency studies with DEA-based methods, O’Neill et al. (2008) reviewed 79 studies from 1984 to 2004, which also mainly focused on samples from the US and Europe. Besides the selection of inputs and outputs, this cross-national comparison study also revealed the difference in preference of the research topic and model selection. For example, they concluded that European researchers pay more attention to allocative efficiency than to technical efficiency, compared to those in the US.

Specifically for the SFA models, Rosko and Mutter (2008) reviewed 20 hospital efficiency studies in the US utilizing SFA. Varying assumptions were employed regarding the distributions of the composed error terms across different model specifications. Meanwhile, the consideration of the input and output variables is relatively consistent. SFA also exhibits a high level of flexibility, including the adaptation to panel data settings and the utilization of a two-stage framework to examine the influence of environmental variables on (in)efficiency. In the subsequent review of SFA on the US hospitals, Rosko and Mutter (2011) focused more on the discussion of the empirical findings of the influential factors to hospital efficiency and potential policy implications.Footnote 3

In a nutshell, three primary types of measures of efficiency have been developed to satisfy the requirements of researchers and policymakers. These are technical efficiency, allocative efficiency, and a combination of both (often referred to as overall efficiency). In turn, these are focused on maximizing output from a given input, or minimizing input with expected output, or optimizing over inputs and outputs jointly (Färe et al. 2019; Worthington, 2004). Other efficiency measures, such as cost, revenue, profit, and scale efficiency, may be also utilized in the context of hospitals (e.g., see Sickles and Zelenyuk (2019, Chap. 3, 8) for more details).

Worthington (2004) reviewed the efficiency measurements applied in healthcare topics, focusing on the frontier techniques. The author considered the main approaches, such as DEA, SFA, the Malmquist index (MI), and their combinations, the implications, including the input and output indicators, and explanatory approaches of the differences in efficiency. In conclusion, although the efficiency measurement has attracted more and more attention in the early 2000s, the applications of advanced frontier techniques are still in a rudimentary stage (Worthington, 2004).

Subsequently, Hollingsworth (2008) reviewed a broader collection of 317 published studies. The widely preferred techniques have remained constant from prior years. Besides, sharing the same viewpoint with other discussions and reviews on this topic, the author found that output indicators are mostly for physical performance, such as inpatient days, without considering the quality of the treatment. Only 9% of studies included outcome measures, such as the mortality rate and changes in health status. Another weakness is that only a few studies tested methods with statistical inference or sensitivity analysis. Furthermore, technical efficiency is a primary focus of most studies, while only a few of the studies focused on allocative efficiency.

Hussey et al. (2009) stated similar opinions about the deficiency of current research, especially in the aspects of output and statistical tests. By reviewing articles in Medline and EconLit from 1990 to 2008, the authors identified 265 measures in peer-reviewed articles and 8 in so-called ‘gray literature’. As defined in Paez (2017), gray literature is the research produced by academics, government, industry, etc., which is not controlled by traditional publishers. Their systematic review focused on the efficiency measures and tried to create a mutual understanding of the adequacy of these approaches. Following McGlynn et al. (2008), the measures were classified into three branches: “perspective”, “inputs”, and “outputs”. Among the 265 measures abstracted from the 172 reviewed articles, the production of hospital services, such as length of stay and cost per discharge, was the most commonly used indicator. Half of the measurements used physical resources to reflect the input, while one-third used costs and one-quarter used both as input indicators. For outputs, most studies counted the healthcare services, for example, discharges, procedures, and physician visits. In rare cases was quality integrated into the output, which was the most concerning issue in the review and also an enduring focus of discussion in the field. Meanwhile, another empirical issue is that only about 2.3% of the articles included tests of reliability or validity, while sensitivity analysis was considered in about one-quarter of the articles, even though it is commonly used in multivariate statistical models.

More recently, Hadji et al. (2014) conducted a systematic review of studies regarding both productivity and financial outputs of hospitals from 1990 to 2013. Among the 38 articles reviewed, they summarized a taxonomy of the most commonly considered input/output categories, which are mainly applied to the prevalent DEA and SFA methods. Nepomuceno et al. (2022) systematically reviewed 65 hospital efficiency studies from 1996 to 2022, highlighting the effectiveness of bibliometric techniques in paper selection, visualization, and notably bolstering subsequent critical and qualitative analysis.Footnote 4

Consequently, according to the conclusive reviews by the field experts, the efficiency studies on healthcare show a prosperous atmosphere, especially in the US and the EU. However, there is a lack of review of similar topics for particular countries other than the US and we try to bridge this gap by providing such an analysis for Australia and some of its peers. One may wonder why would Australia be interesting for a general audience of this journal. Basically, it is because the Australian healthcare system is one of the best in the world (albeit still with room for improvement), and knowledge about it can help other countries. Indeed, in their recent report, Schneider et al. (2021, p.14) concluded that

"International comparisons allow the public, policymakers, and health care leaders to see alternative approaches to delivering health care, ones that might be borrowed to build better health systems that yield better health outcomes. Lessons from the three top performers we highlight in this report—Norway, the Netherlands, and Australia—can inform the United States and other countries seeking to improve.”

For the same reasons, we hope that the insights about the studies on the performance of hospitals in Australia and its peers, with similar healthcare system structures and comparable levels of economic development, presented in this paper will also help researchers in other countries.

3 Methodologies and data

3.1 Collection of articles

In line with the article collection process in Hussey et al. (2009), we employed a systematic collection of the published articles and gray literature regarding the hospital efficiency studies conducted in Australia and its peer countries and regions. Following the procedure of bibliometric data collection in the literature (e.g., Choi and Oh, 2019 and Linnenluecke et al. 2020), our collection process aims at capturing a confluence of possibly all related published articles and gray literature.

Firstly, we chose Scopus as the main database for its comprehensive coverage of healthcare research and the adaptive format for mainstream bibliometric analysis techniques. Meanwhile, the Web of Science or more specialized platforms such as Medline or PubMed provide complements.

Subsequently, in the Boolean search, our review topic could be decomposed into three categories, i.e., “location”, “topic”, and “object”. The “location” restricts the interested country or region. Besides the country names, the sub-level district names were also included. For example in Australia, apart from “Australia” and “Australian” as the keywords, “Queensland”, “Victoria”, and names of all the other States and Territories were included as well. Moreover, the commonly used abbreviations, such as “UK” to “United Kingdom”, and synonyms, such as “British” and “Britain”, were also listed simultaneously. Since there are five countries and regions of interest, we listed five keyword groups for Australia, the UK, Canada, New Zealand, and Hong Kong, respectively.

The “topic” limits the research regarding “efficiency”, “inefficiency”, “productivity”, and “performance analysis”, which are the same for all targeted countries and regions. Finally, the research “object” is constrained among “hospital”, “healthcare” (or “health care”), and “health services”. Whilst our focus is “hospital”, we opted to encompass similar phrases, which are slightly more generalized, in order not to overlook any crucial material for further analysis.

The result of the Boolean search is the intersection among the three categories, obtained by the logical operator “AND”. Meanwhile, in each category, the selected terms are searched in every specified field (e.g., title, keyword (including author keys and index terms), or affiliationFootnote 5) and combined with the logical operator “OR”. Another configuration worth noting is that the wildcard “*” is added at the beginning and end of the one-word keywords, such as “efficiency”, to allow similarly spelled terms, such as “inefficiency”. An illustration of the Boolean search employed is presented in Fig. 1.

Fig. 1
figure 1

Illustration of the Boolean search logic

As for gray literature, we searched similar keywords in Google Scholar and RePEc. For example, regarding Australia, we found 6 papers, which were not included in the previous search but were very close to our review topic. We reviewed these studies together with those collected from the published platform, but due to one limitation of our study that we will discuss in later sections, the gray literature could not be included in the subsequent bibliometric analysis because of the lack of bibliometric information.

The collection at this stage is usually too wide for effective review or analysis. Similar to the procedure used in the literature, we dropped unrelated articles by manually reviewing them with a set of predetermined criteria. The selection criteria were determined by field experience and were adjusted by the condition of the existing paper pool, aiming at extracting every article that is directly relevant to the theme of the review.

Accordingly, we acquired five sets of published articles regarding each interested country and region via the Scopus database, containing 2153 studies for Australia, 5398 for the UK, 6872 for Canada, 337 for New Zealand, and 192 for Hong Kong from 1970 to 2023.Footnote 6 Following a process shown in Fig. 2, after excluding the articles containing disease or department terms in the title, we reviewed the contents of the remaining articles and identified those that were closely relevant to our review topic. Consequently, we elected 12 articles for Australia, 17 for the UK, 10 for Canada, as well as 9 and 5 for New Zealand and Hong Kong, respectively.

Fig. 2
figure 2

Literature collection flow

3.2 Bibliometric analysis techniques

Academic knowledge is expanding so dramatically that it is becoming increasingly challenging for researchers to comprehensively review, analyze, and understand a field relying only on manual reading (Linnenluecke et al. 2020). In addition to the efficiency, the massive resources also force researchers to choose “high quality” materials but not necessarily consider a broader range of evidence (Tranfield et al. 2003). Visualization and map** techniques, due to their remarkable capability to facilitate the comprehension of a substantial volume of knowledge in a variety of dimensions, have attracted widespread interest from researchers in various fields in recent years. In fact, it is more than a beneficial tool, but to some degree a necessary route to conduct a comprehensive systematic review.

Bibliometric analysis, such as co-citation analysis (Small, 1973) and co-word analysis (Callon et al. 1983), has been long applied in map** the literature (Nieminen et al. 2013). With the development of big data methods, scientometric researchers have programmed functional tools to solve the analysis demand on comprehensive data sources.

A number of well-developed tools are based on the Java platform, such as VOSviewer (Van Eck and Waltman, 2010), Sci2 (Science to Science Tool) (Sci2 Team, 2009), and SciMAT (Cobo et al. 2012), which are powerful in map** visualization and network analysis. These Java-based tools are also usually equipped with text-mining functions to handle data modification in advance of map**. There are some other tools possessing similar capabilities, such as CiteSpace II (Chen, 2006), whose main functions are similar to Sci2, and Network Workbench Tool (NWB Team, 2006), which is also functional and has open access to data modification. Another series of tools is developed in the statistical programming language R (R Core Team, 2023). One advantage of these open-source packages is the flexible and extensible working environment, where researchers and practitioners could continuously provide updates for the functions (Linnenluecke et al. 2020). A representative package developed in R is Bibliometrix (Aria and Cuccurullo, 2017), which supports both descriptive analysis and network analysis. Additionally, with the deployment of another package, Shiny (Chang et al. 1998) and Gephi (Bastian et al. 2009), whose functions could be amalgamated with other applications, such as Bibliometrix and VOSviewer. Another pioneer software Histcite (Garfield, 2009) was developed for network analysis of key authors and articles, which however is no longer in development now.

Most of these popular tools are designed to allow importing bibliometric data from Scopus, Web of Science, etc., which are usually stored as either Bibtex (bib), Plain Text (txt), or RIS (ris). However, the storage formats of different platforms are incompatible. Taking into account the functional requirements and the characteristics of our bibliometric data set, we mainly use Bibliometrix and VOSviewer in the subsequent analysis.

4 Analysis

We start our analysis by employing a series of bibliometric techniques to obtain a panoramic perspective on the research regarding the productivity and efficiency of hospitals within the selected countries and regions. This overview serves as a prerequisite for those seeking a deeper understanding of the research landscape. It provides fundamental yet pivotal insights into the dynamics of prevalent research topics and methodologies, the identification of principal contributors to the field, and the interconnection between local research endeavors and the broader global literature. We then delve deeper into each of the selected papers to identify the common inputs, outputs, and quality indicators, as well as major techniques utilized in the analysis of the productivity and efficiency of hospitals within these countries and regions.

4.1 Overview of the research landscape

4.1.1 The dynamics of research topics and methodologies

To depict the prominent research topics and methodologies, a word cloud of keywords was generated for each country and region, where general terms (such as country names) were omitted to emphasize the methodologies and research topics. In Fig. 3, the size of each term corresponds to its frequency relative to other terms within the same country or region. As can be seen from Fig. 3, in addition to output assessment terms, such as “mortality” and “length of stay”, researchers in Australia preferred using “risk assessment” and “cost-benefit analysis” in measuring “organizational efficiency”. Meanwhile, research in Canada focused more on methodology terms, such as “DEA”, “bootstrap**”, “Monte Carlo method”, and “regression analysis”. In New Zealand, researchers exhibited a greater focus on “public health” and “primary health care”, while frequently employing “cost-benefit analysis”, “DEA”, and “Monte Carlo method”. Researchers in the UK devoted more attention to the “National Health Service” (NHS), “state medicine”, and “quality”, as well as the novel technique, “machine learning”. In Hong Kong, the researchers are primarily concentrated on “health care delivery”, which is also a popular topic in Canada and New Zealand. Hong Kong’s research also has placed particular emphasis on “population density”, distinguishing it from its peers.

Fig. 3
figure 3

Word clouds of keywords by country and region

In a dynamic perspective of the key topics and methods, as illustrated in Fig. 4, the records of keywords from our interested countries and regions were combined and redivided into several periods: prior to 2005, 2006 to 2010, 2011 to 2015, and 2016 to 2023. Using the same generating algorithm, the word clouds of later years appear to be denser and more informative, primarily because the majority of research was conducted after 2010. Research in earlier years focused more on the qualitative discussion about the effects of cost, policy, and reform on efficiency, while the focus later shifted to empirical methods, with “cost benefit analysis”, “DEA”, and “machine learning” gradually becoming more prevalent. Notably, “Quality” was extensively discussed in the years from 2006 to 2015 but less so in more recent years.

Fig. 4
figure 4

Word clouds of keywords by period

Another dynamic trend worth exploring is the evolution of the keywords. The results of co-occurrence analysis of the modified keywords using VOSviewer (Van Eck and Waltman, 2010) are as depicted in Figs. 5 and 6. The connections between the word frames denote co-occurrences between pairs of terms. The size of each frame reflects the number of occurrences of the respective keyword, while the dimensions represented by different colors differ between the two figures.

Fig. 5
figure 5

Co-occurrence network of keywords over time

Fig. 6
figure 6

Co-occurrence network of keywords by number of citations

The frame in Fig. 5 is color-coded based on the average publication year. Initially, a period is derived from the publication time of all the published journal articles in the sample. Accordingly, each year in the period is scored chronologically. For a given keyword, the score of publication year Si is the average of the scores \({S}_{i}^{j}\) of all the studies which have used it, i.e.,

$${S}_{i}={m}^{-1}\sum\limits_{j=1}^{m}{S}_{i}^{j},$$
(1)

where i represents the ith keyword in the sample, m is the number of studies that mentioned this keyword, and j denotes the jth article among these m articles. Finally, the color gradient from warm to cold hues is assigned to each keyword based on the scores \(S={\{{S}_{i}\}}_{i = 1}^{w}\), where w represents the number of keywords included in the analysis. As a result, the more frequently a keyword appeared in recently published journal articles, the warmer the color of the frame would be. Analogously, a frame with a colder hue indicates that the term was more frequently employed in earlier publications.

Meanwhile, the color of the frame in Fig. 6 serves as an indicator of the average number of total citations of the journal articles containing the particular term as a keyword. The color gradient is determined by the citation scores \(C={\{{C}_{l}\}}_{l = 1}^{q}\), where q is the number of keywords and l indicates the lth keyword in the sample. The citation score for a certain keyword is

$${C}_{l}={p}^{-1}\sum\limits_{k=1}^{p}{C}_{l}^{k},$$
(2)

where p is the number of relevant studies that used this keyword, k represents the kth study among these p studies, and \({C}_{l}^{k}\) is the number of citations of the kth study containing the lth keyword. Therefore in Fig. 6, the published studies utilizing the warm-colored terms as keywords receive a higher citation count on average than those using the cold-colored keywords.

Consequently, the evolution of keywords reveals three main periods of different research interests among our interested regions. In the early years around 2005, “cost benefit analysis”, “health care organization” and “teaching hospital” were the predominant research subjects. Whereas during the subsequent period around 2010 to 2015, more advanced methods and indexes emerged within the field, i.e., from “organization management” and “healthcare delivery” to “length of stay”, “hospital quality”, “technical efficiency”, and “DEA”. Finally, new terms were introduced in recent years, such as “efficiency frontier estimation”, “bootstrap”, “factor analysis”, and “cost efficiency”. Moreover, this keyword shift over time also corresponds to that unveiled in the previous word clouds.Footnote 7

The network in Fig. 6 implies that research focusing on “DEA”, “diagnosis-related group”, and “Malmquist productivity index” (MPI) has received the highest number of citations, which indicates the popularity of these methods in this topic. When cross-checked with Fig. 5, most of the terms that emerged in recent years have obtained relatively fewer citations, which may be due to the limited time of circulation in the field. Nevertheless, the methodology terms, such as “bootstrap”, “DEA”, and “MPI”, were not prevalent in the early stage, but they have received a considerable amount of attention in more recent years.

4.1.2 Top contributors

To gain an overview of the top contributors to the research on the performance of hospitals across all the countries and regions of interest, we plotted the number of publications by the most productive authors (in descending order) in a time series chart. As shown in Fig. 7, the articles authored by the productive researchers span from 1995 to 2023. The connecting lines represent each piece of work of a specific researcher in our paper pool, with the size of the blue dots representing the number of articles published each year, ranging from one to three. Additionally, the depth of color at each point is determined by the average number of citations per article per year, ranging from approximately 0.2 to around 5.4.

Fig. 7
figure 7

Production of top productive authors over time

Andrews has been the most productive author, publishing five articles on New Zealand hospitals in the last decades. In Australia, Yong stands out as the most prolific author over the past decades and is also the most active local collaborator. The focus of research in Canada is on the period from 2008 to 2016, during which Chowdhury and Laporte, along with their co-authors, conducted the majority of the studies. Research in Hong Kong is relatively scarce compared to other peers and thus not presented in the plot. In general, most of the articles were published after 2011, aligning with the conclusion drawn from the time series word clouds plot.

In addition to studying productive authors individually, we aimed to uncover connections among them by providing a comprehensive overview of the relationship between the productive authors, prevalent keywords, and frequently published sources with a three-field plot (also referred to as the Sankey diagram) for each country. Sankey diagrams are designed to visualize the flow of networks and processes with the arrow and the width (Froehlich, 2005). The three fields utilized are keywords (both author keys and index terms), authors, and published sources, arranged sequentially from left to right. The central field is the “authors”, where the most productive authors were identified for each country. The width of the flow between author and keyword or between author and source represents the degree of relevance. Not all of the keywords or sources of each author are included in the analysis, but only the most commonly utilized ones among various authors are depicted.

As shown in Fig. 8 for Australia, the productive authors in the middle field exhibit groups of highly productive collaborators. For example, Chua, Palangkaraya, and Yong, whose works contain similar keywords (author keys and index terms) on the left side, have articles published in Economic Record and Health Economics as displayed on the right side. Moreover, if we focus on research methods and research subjects in the keywords on the left side and disregard general terms such as “Australia”, “article”, and “humans”, some tendencies in local research can be discerned, specifically, “risk assessment” and “mortality” are the most frequently mentioned subjects.

Fig. 8
figure 8

Sankey diagram of Australia

Analogous diagrams were generated for other peer countries for comparisons.Footnote 8 As shown in Figs. 811, it is similar to the case in Australia that productive researchers of this topic in the UK also display a pattern of collaboration, as the co-authorship among Bojke, Castelli, Street, Laudicella, and Ward. Furthermore, researchers in Canada and New Zealand demonstrate a greater emphasis on the DEA method, e.g., the Rouse, Harrison, and Turner group in New Zealand. It is also shown by keywords on the left side of the diagram that researchers in New Zealand exhibit a greater interest in assessing private healthcare.

Fig. 9
figure 9

Sankey diagram of the UK

Fig. 10
figure 10

Sankey diagram of Canada

Fig. 11
figure 11

Sankey diagram of New Zealand

4.1.3 Connection to the global literature

To understand the connection between the local and global literature on hospital performance, we first examined the sources and authors most cited by local researchers in each region. For Australia, as summarized in Table 1, Health Economics is not only a journal where local research is frequently published, but also the most frequently cited journal by local researchers. Turning to the most cited authors, as listed in Table 2, researchers with global reputations in efficiency analysis and the healthcare sector, including Simar, Wilson, Zelenyuk, Braithwaite, Grosskopf, Färe, are among others commonly cited by Australian researchers on this topic.

Table 1 Most cited sources by region
Table 2 Most cited authors by region

Another noteworthy phenomenon is the resemblance of the most cited authors by researchers in different countries and regions. Most of the top cited researchers in studies of Canadian hospitals are based in the US, such as Grosskopf, Färe, and Valdmanis, who are also among the most cited in the studies on Australia, which indicates that Australian research closely tracks the works of North America. On the contrary, the top cited researchers in the studies on the UK are local experts, such as Street, Castelli, and Gravelle. The higher level of independence of research in the UK may be due to the special topics of interest, such as the performance of the NHS. The condition in New Zealand is similar to that in Australia that local researchers tend to follow the global prominent researchers in the field, as well as the regional leading authors.

Finally, we collected the global citations of local studies and the top ones within each country and region are listed in Table 3. When comparing the total citations of these top-cited journal articles, studies in the UK and Canada were more influential in the field than those in Australia, New Zealand, and Hong Kong.

Table 3 Influential journal articles in each region

4.2 Input, output, and quality indicators

Healthcare services can be considered as a production process, where inputs are transformed into outputs by the production decision-making units (DMUs), in our focus, the hospitals. Among the reviewed studies, as well as the literature (and reviews) of a wider range of countries on this topic,Footnote 9 the most commonly applied inputs include different types of labor (e.g., hours of medical and surgical staff, nurses, administrative staff, etc.), supply expenditure (e.g., medical and surgical supplies and cost of other consumables), equipment expenditure (e.g., physical investment and cost of equipment), and capital investment (usually represented by the number of staffed beds). Two categories of outputs are generally evaluated for the hospital performance: the outpatient service (e.g., number of outpatient visits, number of ambulance visits, etc.) and the inpatient care (e.g., number of admitted/discharged episodes, number of bed days, number of case-mix weighted procedures, etc.).

The production process of a hospital, and the healthcare sector in general, is different from that of other sectors, as the final ‘product’ or the outcome of the service should be about improving the health of the patients, rather than admitting and discharging patients efficiently. Hence when feasible, it would be more desirable to evaluate the performance of hospitals by the outcome. However in practice, the aforementioned output categories are the most frequently considered dimensions in the evaluation of hospital performance. One reason for this is usually the lack of data availability on the outcomes, as well as the difficulty in quantifying the health improvement of a procedure.

Much effort has been devoted to remedying this inadequacy (though not many studies address this aspect in our review scope). Some indicators designed for specific procedures or the entire hospital can be recorded to represent the quality of the services (e.g., the mortality rate during admission, the post-surgery infection rate, and the readmission rate), which in turn may indicate the outcome. In practice, the indicators can be adapted into the output as a weight or an independent category, or considered as an exogenous variable to explain the variations in efficiency (e.g., Chua et al. 2011; Deng et al. 2019; Ferrier and Valdmanis, 1996). As proxies of the outcome, the currently developed quality indicators, as reviewed by Breyer et al. (2019); De Vos et al. (2009), are primarily focused on undesirable outcomes, aimed at controlling the lower limit of the service. For instance, in our reviewed studies, Andrews et al. (2022) evaluated the trade-off between technical efficiency and quality of hospitals in New Zealand, where the quality indicator is represented by the undesirable output–the readmission of inpatients. Nevertheless, incorporating the quality indicators in the performance analysis of hospitals is a step closer to the evaluation based on the fundamental objective of healthcare.

As indicated in reviews by Hollingsworth (2008); Hussey et al. (2009); O’Neill et al. (2008), the application examining the quality of care is sparse in the literature, which is also the case in our review. Akin to the issue of evaluating the outcome directly, the difficulty in data collection is usually one of the main reasons for this.

4.3 Major techniques

Various performance indicators are used for measuring hospital efficiency. For example, the length of stay (LOS) was analyzed by Bogomolov et al. (2017); Hanning (2007). The ratio-type indicators, e.g., total factor productivity (TFP) between periods or hospital groups, is another vein of computationally simple and easy to interpret measures (e.g., Aragón et al. 2019; Bojke et al. 2013; Cheng et al. 2020; Andrews and Emvalomatis, 2023).Footnote 10 The prevalence of the two most widely adopted techniques in the efficiency analysis field is also observed among our reviewed articles as in other reviews on hospital efficiency (e.g., Hollingsworth and Wildman, 2003; Hollingsworth et al. 1999; Worthington, 2004). Both approaches belong to the vein of “frontier analysis”.Footnote 11 One is SFA that was first proposed by Aigner et al. (1977) and Meeusen and van den Broeck (1977) and has been further advanced in many studies, while the other method is DEA, which was originated by Farrell (1957) and advanced by Charnes et al. (1978) and many others.

DEA is a nonparametric method derived from production theory axioms and it is relatively easier to apply in the case when both inputs and outputs are multi-dimensional. In the healthcare sector, it also has advantages in flexibility and versatility (O’Neill et al. 2008), and hence it is widely applied in our reviewed articles, e.g., Chua et al. (2011); Nghiem et al. (2011); Nguyen and Zelenyuk (2021a); Nguyen and O’Donnell (2023) for Australia, Giuffrida (1999); Hollingsworth et al. (1999); Parkin and Hollingsworth (1997) for the UK, Chowdhury and Zelenyuk (2016); Fixler et al. (2014); Wang et al. (2018) for Canada, Andrews (2021); Andrews et al. (2022) for New Zealand, and Li et al. (2019) for Hong Kong. At the same time, DEA can be more sensitive to outliers (which must be dealt with, if any, prior to the deployment of DEA) and it requires a relatively large sample for statistical inference with a high dimension of inputs/outputs. Meanwhile, as the estimated production (cost) frontier of the envelopment-type estimator is downward (upward) biased from the boundary of the true technology set, the estimated efficiency is also biased. Hence, to obtain more accurate and reliable efficiency estimates, a bias correction procedure as well as the estimation of confidence intervals are necessary following the initial estimation.Footnote 12

The SFA method, on the other hand, decomposes the regression-type overall error term into inefficiency and statistical noise, which makes it less sensitive to extreme values. It also provides more interpretable results (e.g., estimated coefficients of the assumed production relationship, etc.). SFA appears to be more frequently applied in Australia, e.g., Gabbitas and Jeffs (2009); Nghiem et al. (2011); O’Donnell and Nguyen (2013); Productivity Commission (2009, 2010); Wang and Zelenyuk (2024). It is also deployed in Street (2003) for the UK and Jiang and Andrews (2020) for New Zealand. However, SFA is usually implemented in a parametric framework, and therefore requires additional assumptions on the frontier and error terms, which may lead to potentially high estimation errors and, could possibly, influence the conclusions.

The two mainstream approaches have been widely compared, while also frequently jointly applied in the literature (e.g., Jacobs, 2001; Linna, 1998; Varabyova and Schreyögg, 2013 in the context of healthcare). The choice may be more analysis-oriented but does not rely on the pros and cons of each method. Moreover, a hybrid approach may attenuate the limitations of both approaches and apply leverage on their respective virtues at the same time. For example, the semi-parametric and nonparametric SFA (e.g., see the review by Parmeter and Zelenyuk, 2019) and stochastic DEA (Simar and Zelenyuk, 2011) and StoNED (Kuosmanen and Johnson, 2017).

In either vein of techniques, explaining the efficiency is a critical task after the estimation. After all, it is not the level of efficiency or ranking but the potential policy implementations to improve the performance that matter. After obtaining the estimates in DEA, researchers have applied a variety of regression-type methods to explain the efficiency with environmental variables, e.g., from ordinary least squares (OLS) (Street (2003) for the UK) to Tobit regression (Guo et al. 2017 for Hong Kong), and truncated regression in the most widely applied two-stage DEA approach based on Simar and Wilson (2007) (e.g., Chua et al. 2011 for Australia, Chowdhury and Zelenyuk, 2016; Wang et al. 2018 for Canada, and Andrews, 2020a for New Zealand). In the stream of SFA, a number of models were also developed to explain the inefficiency term by incorporating the environmental determinants into the assumed distribution. The idea can be traced back to Kumbhakar et al. (1991) and was more popularized by Battese and Coelli (1995), which is also widely applied in the case of hospitals, e.g., in the US and the EU by Herr (2008); Rosko (2001, 2004); Vitikainen et al. (2010), as well as Sickles et al. (2024); Wang and Zelenyuk (2024) for Australian hospitals.

Concluding this section, a summary of these key findings of the bibliometric analysis is presented in Table 4. It is good to emphasize that all approaches have their advantages, as well as caveats and limitations, which may be more or less relevant depending on the context, data, or aims of the analyses—we will discuss this in more detail in the next section.

Table 4 Summary of key findings from the bibliometric analysis

5 Summary of findings of reviewed research

Results and conclusions drawn from the reviewed studies across the selected countries and regions exhibit some degree of consistency in a number of aspects, while also indicating contrasting views in certain cases, and sometimes are also not comparable. In Table 5, we provide a succinct summary of selected articles with representative findings in our review and present a more detailed discussion in the following sub-sections.

Table 5 Key findings of selected research

5.1 Levels of inefficiency

The level of efficiency and/or productivity is one of the fundamental aspects of analysis for this topic. Such estimates are usually reported in an aggregated measurement of the studied area. For example, Gabbitas and Jeffs (2009) suggested that improvements of up to 10% in productivity in aggregate for Australian public hospitals are possible as indicated in the SFA models. In the report of the Productivity Commission (2009), the average technical efficiency estimated with SFA was found to be similar between the 386 public and 122 private hospital sectors in Australia (2006/2007), which was about 20% below the hypothesized best practice. With regard to other countries of interest, the national technical efficiency of New Zealand public hospitals was evaluated at 86% (i.e., 14% inefficiency) on average from 2011 to 2017 by SFA and at 93% (i.e., 7% inefficiency) by DEA with variable returns to scale (VRS) assumption (Jiang and Andrews, 2020). In a dynamic view, research in the UK and Canada utilizing MPI and its decomposition with DEA (constant returns to scale (CRS) and VRS) generally indicated an improvement in productivity and/or efficiency during the studied period (e.g., Chowdhury et al. 2011; Giuffrida, 1999; McCallion et al. 2000; Valdmanis et al. 2017).

In the decade from 1999 to 2009,Footnote 13 SFA was among the most popular methods in the efficiency analysis of Australian hospitals.Footnote 14 It was widely applied in the studies on public hospitals (e.g., Victoria, 1994/1995 Yong and Harris, 1999 and New South Wales, 1995/1996 Paul, 2002), public acute hospitals (e.g., nationwide in Australia, 1996 to 2006, Gabbitas and Jeffs, 2009 and New South Wales, 1997/1998 Wang et al. 2006) and both public and private hospitals (e.g., nationwide in Australia, 2006/2007 Productivity Commission, 2009). The results indicated an approximate 75% overall efficiency level (i.e., 25% inefficiency) of public hospitals in New South Wales in 1995/1996, a 10% potential improvement in productivity in nationwide acute hospitals from 1996 to 2006, and 80% mean technical efficiency of Australian hospitals in 2006/2007. From a cost perspective, the SFA model revealed 3% mean cost inefficiency in Victoria in 1994/1995 and about 9% of inefficiency in total cost in New South Wales in 1997/1998. During a similar period, research on hospitals in the UK reached conclusions with more diverse approaches. In the analysis of acute hospitals in Scotland (1991 to 1994), DEA models with both CRS and VRS assumptions were conducted, which concluded an average production efficiency score of around 85% to 90% in different financial years, where DEA-VRS generally reported a higher level of efficiency (Parkin and Hollingsworth, 1997). In a comparison study of DEA (under the VRS assumption) and SFA, the benchmark NHS hospitals (1995/1996) were analyzed with both methods using several specifications. The mean level of technical efficiency estimated with the DEA models ranged from around 65% to 94%, while the SFA models estimated a mean efficiency level of around 85% (Jacobs, 2001).

In the last decade, more studies on Australian hospitals applied DEA and DEA-related methods, such as the analysis of the admitted episodes between 2003 and 2005 in Victorian hospitals (Chua et al. 2011). Following the two-stage semi-parametric approach (bootstrap** bias-corrected DEA and truncated regression) proposed by Simar and Wilson (2007), the researchers obtained estimates of an input-oriented technical efficiency score at around 72% to 82% on average. In the research of acute hospitals in Canada (Ontario, 2003 and 2006), the two-stage double-bootstrap DEA-CRS was applied to evaluate the main factors associated with output-oriented technical efficiency, where an average efficiency score between 70% and 75% was estimated for most types of hospitals, and a score of 90% was obtained for the teaching hospitals (Chowdhury and Zelenyuk, 2016). Meanwhile, a similar two-stage algorithm with DEA under VRS assumption was deployed by Wang et al. (2018) for Canadian acute hospitals in 2012/2013, which reported an average technical efficiency level of about 75%.

In more recent studies of Australian hospitals (Queensland, FY 2016/2017), Nguyen and Zelenyuk (2021b) employed DEA and Free Disposal Hull (FDH) estimators, as well as partial quantile frontier methods in estimating the individual and aggregated efficiency. In particular, the estimated efficiency levels were aggregated by local Hospital and Health Services (HHSs), through averaging individual efficiency scores weighted by individual output. The aggregated efficiency of the FDH estimators was within a range between 82% and 100%, while the estimates of DEA under VRS and CRS assumptions ranged from 57% to 99% and from 48% to 84%, respectively. In comparison, the results of the partial-quantile frontier model were between 82% and 136%, where the scores above 100% indicate a phenomenon of the so-called ‘super-efficiency’ observed for some hospitals, some of which can be considered as outliers relative to the rest of the sample (Nguyen and Zelenyuk, 2021b). In another study, the authors also utilized the bootstrap** method as proposed by Simar and Zelenyuk (2007) and the central limit theorem (CLT) method by Simar and Zelenyuk (2018) in aggregating the DEA (with VRS or CRS) estimates of the efficiency of Queensland hospitals (Nguyen and Zelenyuk, 2021a). When the VRS assumption was applied, the aggregated efficiency, bias-corrected with the generalized jackknife estimator proposed by Kneip et al. (2015); Simar and Zelenyuk (2018), was about 83% and 80% using the bootstrap** method and CLT method, respectively. Meanwhile, when the CRS assumption was applied, the bias-corrected aggregate efficiency following the two aggregation approaches was around 50% and 46%, respectively. Further discussion regarding the influence of returns to scale assumptions on the efficiency estimates will be presented in the next section.

5.2 Determinants of inefficiency

An important issue for further analysis is to identify the factors within a hospital that determine, impact, or are associated with the efficiency scores. The size of a hospital is one of the most frequently investigated factors in the studies for Australian hospitals, where the larger hospitals were typically found to be more efficient than the smaller ones (e.g., Bogomolov et al. 2017; Cheng et al. 2020; Chua et al. 2010; Nguyen and Zelenyuk, 2021b; Paul, 2002; Wang and Zelenyuk, 2024). While an alternative perspective through a stochastic frontier cost model indicates that smaller hospitals tend to be more labor-intensive and perform better in scale economies (Wang et al. 2006). In the research about Northern Ireland hospitals, the decomposed MPI using DEA with VRS and CRS suggested that smaller hospitals, started from a lower level of basis, tended to achieve more improvement in productivity during 1986 to 1992 (McCallion et al. 2000).

Chua et al. (2010) deployed a two-stage regression on Victorian hospitals and found that teaching hospitals demonstrated higher efficiency than the non-teaching hospitals, which was also related to their larger size. For hospitals in Queensland, Nguyen and Zelenyuk (2021a) aggregated the DEA estimated efficiency with the bootstrap and the CLT approaches, where the superior efficiency of teaching hospitals is observed when utilizing the VRS assumption. On the contrary, when under the CRS assumption, the aggregated efficiency of non-teaching hospitals is significantly higher. Meanwhile, a higher level of education function in a hospital usually leads to higher costs for the hospitals (Yong and Harris, 1999) and according to a SFA-based study, the higher level of education of the healthcare professionals (especially doctors and nurses, as well as managers at all levels) was found to contribute to the technical inefficiency of the hospital (Paul, 2002). Nevertheless, it is also important to acknowledge that a higher education of employees (especially medical staffs and nurses) may potentially result in better quality outcomes (i.e., improved healthcare, with lower mortality and readmission rates, etc.), even if the technical efficiency is comparatively lower due to the provision of fewer services.

The location of a hospital is another significant factor associated with the efficiency (Chowdhury and Zelenyuk, 2016; Lavers and Whynes, 1978). The relatively low efficiency of certain hospitals in Queensland, Australia, as demonstrated by DEA (CRS and VRS), FDH, and partial-quantile frontier estimators, was found to be partially explained by their remoteness (Nguyen and Zelenyuk, 2021b). Meanwhile according to the SFA model by Battese and Coelli (1995), no discernible relationship was observed between the location and efficiency level of hospitals in New South Wales (Australia) (Paul, 2002). In the context of Hong Kong, the public hospitals in more affluent districts showed a lower level of efficiency in a DEA studyFootnote 15 and corresponding regressions, which also reflects the preference of people in better economic conditions to receive services from private hospitals (Guo et al. 2017). Hospitals in remote areas are important in ensuring equitable access to healthcare services to the local population, who are more likely to be limited in being insured, in poor health, and incur higher expenses. The rural hospitals also foster community cohesion and contribute to the local economy. Other than technical efficiency, these are also important factors for policymakers to consider. See Valdmanis et al. (2024) for more detailed discussions.

Some other factors or indicators also have been found to be significantly associated with hospital efficiency. For example, a longer LOS can negatively impact the efficiency of a hospital as suggested in a regression tree model by Ali et al. (2019), the two-stage DEA (with the VRS assumption) in Andrews (2020b), and the SFA model with determinants (i.e., the Kumbhakar et al. (1991) model) in (Wang and Zelenyuk, 2024). The lower level of occupancy rate indicated a positive relationship with the higher level of efficiency in Paul (2002) and (Wang and Zelenyuk, 2024), while the occupancy rate was inversely related to the efficiency estimated by SFA models with different specifications in Yong and Harris (1999). For the effect of the case-mix adjustment, the MPIs (with DEA) and decomposition of the specifications with or without case-mix weighted output were significantly different (Chowdhury et al. 2014).Footnote 16

The type of hospital was also considered influential in some cases. In a comparative study of the operation and performance of public and private hospitals in Australia by the Productivity Commission (2009) and its supplement (Productivity Commission, 2010), the authors, inter alia, concluded that private hospitals showed higher partial productivityFootnote 17 in admitted-patient care than public hospitals through several specifications of the classic SFA model (Aigner et al. 1977; Meeusen and van den Broeck, 1977). Importantly, they also concluded that public hospitals provided a higher proportion of non-admitted patient care (including outpatient and emergency department visits) in the sample while private hospitals preferred to treat the least morbid patients (Productivity Commission, 2009). As for technical efficiency, the public contract hospitalsFootnote 18 performed the most efficiently in both the output-oriented and input-oriented conventional SFA models (Productivity Commission, 2010).

It is worth noting the potential caveats inherent in the analysis of the determinants of inefficiency. For example, the widely applied two-stage DEA proposed by Simar and Wilson (2007) relies on certain assumptions. Among others, the ‘separability’ assumption requires that the environmental variables only influence the inefficiency scores, but not the frontier, which may not always be supported by the data (Simar and Wilson, 2011). The ‘separability’ assumption is satisfied for the unconditional (or marginal) frontier, by construction, yet such a frontier may or may not be the relevant reference for some observations in the sample. If the sample size is large enough for an accurate inference, one may conduct the statistical tests in advance to ascertain the validity of the ‘separability’ assumption.Footnote 19

Moreover, it is important to acknowledge the observed lack of application of causal inference methods in the context of explaining the inefficiency estimates whether within DEA, SFA, or other approaches (with or without the “separability” assumption). For example, even though it is typically impossible to conduct randomized control trials, methods including the regression discontinuity design (RDD) (Angrist and Lavy, 1999; Black, 1999), the difference in differences (DID) (Card, 1990; Card and Krueger, 1994), and synthetic controls (Abadie and Gardeazabal, 2003), among others, could be adapted in a two-stage DEA framework to further explain the change of efficiency level.Footnote 20

The adaption of advanced machine learning methods also exhibits considerable potential for further research within our reviewed topic. For example, the least absolute shrinkage and selection operator (LASSO), introduced in Tibshirani (1996), can be incorporated with DEA (Chen et al. 2021; Lee and Cai, 2020) to enhance the performance in high-dimensional scenarios. The regression or classification utilizing random forests (RF) (Breiman, 1996a, b; Breiman et al. 1984), artificial neural network (ANN) (Rosenblatt, 1958; Rumelhart et al. 1986), and other machine learning algorithms, can also be adapted into the explanation of the efficiency estimates of DMUs.Footnote 21

5.3 Policy implications

The objective of this subsection is to briefly summarize the key policy implications that we found from the studies mentioned above regarding Australia and its peers. For Australia, while many questions were tackled and various policy implications were made explicitly or implicitly in different studies, the primary policy implication appears to be about the size of hospitals. For example, some studies advised the consideration of merging small hospitals that are situated close to one another in Victoria, Australia, in order to leverage the substantial advantage among large hospitals in the level and growth rate of TFP (Cheng et al. 2020). Similar recommendations were reached for hospitals in Northern Ireland, the UK (McCallion et al. 2000), and in Ontario, Canada (Chowdhury et al. 2011) where concentrating services in large hospitals was suggested to contribute to the improvement of technical efficiency.

In addition, small hospitals are usually located in remote areas, whereas teaching hospitals are typically larger in size and situated in urban areas. However, when considering merging small hospitals in remote areas in Australia, it is also necessary to emphasize the crucial contribution of rural hospitals in fulfilling special clinical functions and guaranteeing equitable access to healthcare services for local residents (Bogomolov et al. 2017; Cheng et al. 2020; Nguyen and Zelenyuk, 2021b). Another concern was raised regarding the consolidation of small hospitals in Canada. Although the technical efficiency has been improved, the declined scale efficiency indicated that larger facilities face greater difficulties in managing and coordinating resources to absorb the technical advancement due to the decreasing returns to scale (Chowdhury et al. 2011). Alternatively, for Australia, a possible policy implication could be to promote the remote healthcare delivery models, such as Telehealth (Nguyen and Zelenyuk, 2021b), or to provide bonded medical training programs (Nguyen and O’Donnell, 2023). Similarly, in the context of New Zealand, video conference techniques are also believed to play a vital role in addressing the staff shortages (Andrews, 2020a).

For the Canadian healthcare system, it was concluded that the process of discharging patients to post-acute care is a key driver of efficiency for large and medium size non-teaching hospitals. Accordingly, provinces were suggested to increase the capacity of post-acute care and improving the smoothness of transition procedures (Wang et al. 2018). Meanwhile, a reassessment was recommended for the funding policy in New Zealand to prevent the incentives for secondary care facilities from kee** the patients longer than required, which in turn increases the length of stay and the level of inefficiency (Andrews, 2020a).

In Scotland, UK, it was also found that technical change in a given period exhibited a negative impact on the change in the subsequent period, which may be due to the time required for hospitals to assimilate the novel treatment technologies. Consequently, another policy implication was to moderate the escalation of the same type of input in a contiguous period to prevent the over-concentration of investments and constrain the increase of costs (Valdmanis et al. 2017).

The findings concluded for countries with similar healthcare systems suggest feasible pathways for the further investigation of Australian hospital efficiency and potential reform strategies. These include exploring the incentive mechanism and influence of funding policies to the efficiency of healthcare facilities, the role of post-acute care in improving the hospital efficiency, and the potential decreasing returns to scale resulting from the consolidation of hospital functions.

To conclude this section, it is worth reminding again that the aforementioned policy implications were derived (explicitly or implicitly) from particular approaches/models that had certain assumptions (e.g., orientation, choice of variables and of reference, etc.), some of which might be critical for the arrived conclusions. Performing extensive robustness analyses of such conclusions with respect to various assumptions is paramount in such circumstances, before deriving the policy implications.Footnote 22

Table 6 Selected policy implications from reviewed studies

Overall, the collection of the reviewed articles utilized a spectrum of approaches, predominantly including DEA, SFA, their variants, and various productivity indices, to gauge the levels of efficiency of local hospitals and their changes over time. Further analysis, e.g., regression (OLS, bootstrap truncated regression, etc.) and decomposition of productivity indices, among others, was deployed in many studies to investigate the possible determinant variables of the estimated efficiency. A myriad of features of hospitals and exogenous factors were taken into account, while some determinants share similar effects across different countries. For example, hospitals that are small in size, located in remote areas, or without teaching functions, often obtained a lower level of technical efficiency. Ultimately, various policy recommendations were proposed based on the efficiency level, the identified determinants, as well as the regional sociocultural situations. Interestingly, divergent implications stemming from different considerations were sometimes suggested even with the same or similar findings. Last, yet not least, besides technical efficiency and cost efficiency, the access and quality of healthcare are also very important to take into account when making policy decisions, which sometimes appear to be overlooked in the literature.Footnote 23

6 Reflections on the interpretations of efficiency results

Before concluding the paper, a closer look at how to compare and interpret the different efficiency scores, and related caveats, is in order. First of all, it is worth reminding that the DEA estimate of efficiency under the VRS assumption can be (and often is) relatively higher than that estimated under the CRS assumption. As a hypothetical example, consider five hospitals {A, B, C, D, E} in a one-input-one-output scenario, and suppose the frontiers estimated by DEA from a large sample with CRS and VRS are depicted in Fig. 12 with the solid and dashed lines, respectively. Under the VRS assumption, hospital C lies on the estimated frontier, and hence its estimated efficiency level would be higher than that of hospital B. On the other hand, while hospital C is technically efficient under VRS, it is much more inefficient compared to hospital B if the CRS assumption is employed.

Fig. 12
figure 12

Illustration of assumptions of returns to scale

Furthermore, consider hospital D: it is efficient under VRS for the output-oriented approach, yet very inefficient under VRS for the input-oriented approach. Similarly, hospital E is efficient under VRS with the input-oriented approach, yet very inefficient under VRS for the output-oriented approach. It is worth noting that such ambiguity does not happen under CRS: theoretically, input and output orientations are equivalent with respect to any CRS technology.

An important question, therefore, is which reference (CRS, VRS, etc.) to choose, and which orientation is more appropriate to take. By and large, it depends on the research questions.Footnote 24 For example, if the goal is to benchmark efficiency relative to the individual frontier of a DMU, then it would be important to conduct appropriate statistical inference (tests) to identify the most likely true individual technology (whether it is CRS, VRS, or perhaps non-convex). A caveat here, however, is that such tests may require too strong underlying assumptions for the tests to be valid. One such assumption is usually that technology is identical for all the DMUs in the relevant sample, which can be far from the truth and, in fact, can be one reason for some DMUs to look less efficient than others. This reasoning sometimes motivates an alternative approach that allows the frontier to be conditional on some variables. While appealing, a caveat of this approach is that, potentially, one may find yet another conditional variable (and come up with a nice economic story for it), due to which a DMU of interest may get a very different (e.g., perfect) efficiency score than otherwise. Such a hunt for “a significant” conditional variable (which may depend on the researcher’s imagination and persistence) may lead to many or even all DMUs looking perfectly efficient, being near or on their individual frontiers. In other words, allowing for individual frontiers to depend on conditional variables (whether in DEA, FDH, SFA, or other methods), while it sounds appealing, might be likened to opening a ‘Pandora’s box’ that may defeat the meaningfulness of an efficiency analysis. Moreover, the estimated efficiency scores in such an approach may not be comparable if they are measured with respect to different conditional frontiers.Footnote 25

Another related caveat in this framework is that, often, the currently used approaches that allow for conditional variables in frontier analysis are not based on causal inference. For example, they may not be accounting for possible endogeneity, or reverse causality, or might be identifying spurious relationships between the conditional variables and the dependent variable, which is also important to keep in mind when deriving policy implications.Footnote 26

An alternative look at the problem is to admit that each firm may indeed have its own technology that may be unique relative to others and potentially be very different from CRS or VRS or FDH-type technologies. And, the goal may not be to estimate those individual technologies (which are hard to identify without too strong assumptions), but rather to estimate the unconditional best practice aggregate technology for the observed DMUs, and then measure the individual efficiency relative to it. How will such an aggregate technology look? Interestingly, even if the individual technologies are non-CRS and non-convex and very different across the individuals, the aggregate technology (defined as the summation of the individual technology sets) may still be approximately convex (due to the Shapley-Folkman lemma) and, under fairly mild conditions, also approximately CRS.Footnote 27 Importantly, the use of such a reference is coherent with benchmarking relative to a socially optimal level, i.e., the level of optimal scale or the highest average productivity (e.g., hospital A in Fig. 12).

To illustrate this point vividly, suppose a rigorous statistical test rejects DEA-CRS in favor of DEA-VRS. A researcher then proceeds with DEA-VRS as the preferred approach and reports that hospital C is 100% efficient. A policymaker may rely on such results and think that other hospitals should be like the “100%-efficient” hospital C and approve policies that award and encourage hospitals like C and discourage (and perhaps penalize) others, including hospital B. However, a closer look at the matter reveals that the amount of inputs that hospital C is actually using should provide about twice as much output if two hospitals like B (even though inefficient relative to both CRS and VRS frontiers) are deployed instead of a “100%-efficient” hospital like C! How important is such a difference in the interpretations of efficiency results? Well, ultimately, providing twice as much output for a hospital may lead to a substantial reduction in pain and suffering of many more patients, or it may simply mean saving many more lives of patients in urgent need–they would receive health services due to greater output from the same inputs.

Now, suppose a researcher is able to obtain two more data points, hospitals F and G in Fig. 12, which were not available before. If the researcher re-estimates DEA for the updated data, the DEA-VRS results would change substantially (e.g., the output-oriented technical efficiency of hospital C and D decreases from 100% efficient to 92% and 80%, respectively), while the DEA-CRS results would not. Furthermore, suppose another data point–hospital H in Fig. 12–becomes available. In this case, the results change also for DEA-CRS, yet much less than the changes for DEA-VRS. This illustrates the relatively higher robustness (lower sensitivity) of DEA-CRS relative to DEA-VRS, which is often the case, although the opposite could happen too, emphasizing the importance of robustness checks and sensitivity analysis in practice.

Finally, it is also worth remembering that while the DEA and FDH (under certain assumptions) are consistent estimators, they are biased. The bias converges to zero asymptotically, yet the speed of convergence decreases with the dimension of the DEA-model (and the assumptions on returns to scale and convexity), which is the price one pays for going nonparametric. It is therefore desirable (yet so far appears to be still rare in the analyzed literature) to correct the bias, for both individual and aggregate efficiency scores, as well as present standard errors or estimated confidence intervals.Footnote 28 The parametric SFA approaches avoid some of these problems, yet at the expense of potentially high estimation errors due to misspecification of the imposed parametric assumptions.

All in all, this hypothetical example illustrates that the choice of a reference relative to which one shall benchmark efficiency should not be just about running a statistical test of whether VRS or CRS or FDH or any other frontier fits the data tighter. Rather, it should be a combination of positive and normative judgments that is incorporated into various modeling assumptions (orientation, variables, types of measures, etc.), some of which are chosen to be tested or checked for sensitivity/robustness of results, while others are chosen to stay as maintained assumptions. What is paramount is to make the judgment choices coherent with policy goals (e.g., striving for social optimum or individual profits?) and transparent to the readers.Footnote 29 Of course, we acknowledge and respect that there could be different views on the mentioned issues and we welcome an open discussion on them.

7 Concluding remarks

In this article, we systematically reviewed the published research about hospital-level efficiency with a focus on Australia and its peers, the UK, Canada, New Zealand, and Hong Kong from 1970 to 2023. In particular, we conducted a series of Boolean searches in Scopus to select a pool of published journal articles and then manually reviewed the selected articles to construct a pool for each country and region. Prior to a detailed review of the selected research, bibliometric analytic techniques were deployed to explore the field.

For each country, we identified distinct groups of productive authors as depicted in the Sankey plots. As illustrated by the word clouds, noticeable divergence was revealed in the most concerned keywords for each country and region, for instance, “cost benefit analysis” and “length of stay” in Australian research, “DEA” in Canadian studies, and “public health” in New Zealand studies. According to the analysis of local and global citations, the European Journal of Operational Research and Health Economics are the most cited sources by researchers in all our interested regions. Regarding the most cited authors, the researchers in Australia, New Zealand, and Canada show comparable preferences to the global top researchers (mostly in the US) and the local prominent authors of efficiency analysis in the healthcare sector. In contrast, UK researchers tend to focus more on domestic studies rather than global research. We also employed both co-occurrence network analysis and dynamic word cloud analysis and identified a shift in popular methods and research topics. We found that the most frequently applied methods in the reviewed regions changed from “cost benefit analysis” in the early years to “bootstrap” and “MPI” in the middle period, and ultimately to “efficiency frontier estimation”, “factor analysis”, and “total factor productivity” in more recent years. Moreover, we found that DEA has been applied over an extensive period, as reflected by its inclusion as a keyword in some highly cited journal articles. We also pointed out some caveats in interpretations of results and policy implications from efficiency studies, which may depend on the assumptions and methods chosen by the researchers. Hence, we emphasized the importance of carefully choosing the assumptions of the models, which must be coherent with the goals of measurement being transparently explained to the readers, and then carefully interpreting the achieved results and the subsequent corresponding policy implications. We also pointed out a relative lack of adaptation and development of the causal inference methods and modern machine learning techniques for the context of efficiency and productivity analysis, and that in our view, such endeavors promise to pave a fruitful avenue for future research. The promotion of Bayesian methodologies within the realm of hospital efficiency analysis is also advocated, with a particular emphasis on addressing inherent challenges such as endogeneity and stringent assumptions commonly faced by the prevalent DEA and SFA estimators. For example, a recent Bayesian-type solution advanced by Tsionas et al. (2023) consolidates the benefits of DEA and SFA, providing considerable flexibility in both functional forms and distributional assumptions. Based on Bayesian artificial neural networks, the approach can also effectively tackle issues such as endogeneity and determinants of inefficiency. Table A1 provides a summary and conclusions of the reviewed studies.

One of the limitations of our review is that the quantity of the target research in the paper pool is still relatively small for more advanced machine learning analysis and visualization, although it is relatively large for manual analysis. In contrast to the large and increasing number of publications regarding efficiency analysis of medical techniques and therapies, the efficiency analysis of hospitals is relatively sparse. Another constraint is that the gray literature, though potentially presenting some fruitful conclusions (albeit potentially with a lower quality than articles published in refereed journals), is not included in the deeper analysis due to a lack of bibliometric information. Moreover, a possible improvement in future studies could be to evaluate the citations with a deeper perspective when conducting the co-occurrence analysis on keywords or investigating the reasons for high citation rates. All in all, we hope this study will provide a valuable step** stone for studies and practices aiming to improve the performance of hospitals in Australia and its peers as well as other countries.