Background

Four decades after its discovery, human immunodeficiency virus (HIV) continues to impact millions of people worldwide, remains one of the leading causes of morbidity and mortality globally [1, 2] and incurs billions of dollars annually in direct health care costs and indirect socioeconomic costs [3]. In sub-Saharan Africa (SSA) in 2019, an estimated 26 million people were living with HIV [2]. In recent years, international bodies have set goals to end the HIV epidemic: in 2014, the Joint United Nations Programme on HIV/AIDS (UNAIDS) introduced the “95-95-95” targets—that by 2030, 95% of people living with HIV globally would know their status, 95% of all people with diagnosed HIV infection would receive sustained antiretroviral therapy, and 95% of people living with HIV receiving antiretroviral therapy (ART) would be virally suppressed [4, 5]. The United Nations Sustainable Development Goals also call for an end to the AIDS epidemic by 2030 [6]. Unfortunately, despite a significant increase in ART coverage over the last 20 years and major progress in terms of reductions in HIV incidence and mortality [1], the latest estimates and projections indicate that the world is not on track to meet these goals [2, 7, 8], and progress may stall further as a consequence of the COVID-19 pandemic [9].

Differences in HIV prevalence both within and between nations in SSA have been well-documented [10,11,12,13,14], as have differences between sexes [2, 12,13,14] and age groups [2]. These differences have also changed over time [1, 10], impacted in part by the onset, duration, location, and demographic targeting of different prevention and treatment interventions [15,16,17]. Epidemiologically targeted interventions are understood to be more effective compared to homogeneous interventions [18] and are increasingly important at a time when the future of funding for HIV prevention and treatment is both uncertain and highly variable [19, 20], particularly in the wake of disruptions related to the COVID-19 pandemic [21]. Evidence suggests that interventions are most effective when tailored to account for differences in the intensity of the epidemic by geographic location [14, 22], sex [23], and age [24]. Locally and demographically precise HIV prevalence information, however, is necessary in order to maximize the benefit of such methods; at present, such information in SSA is lacking.

HIV prevalence estimates stratified by age and sex are available at the national level through the Global Burden of Disease (GBD) [2] and from UNAIDS [25]. Both sources also provide subnational estimates at the first administrative level (e.g., province, state) in select countries. Recently, Dwyer-Lindgren et al. [10] presented aggregated adult HIV prevalence estimates for the years 2000–2017 at local scales in SSA, generalizing estimates for males and females combined, and across ages 15–49 years. Some studies have gone further to present subnational prevalence estimates separated by sex [26,27,28,29] or age [30]; however, these studies focused on single countries, and/or presented estimates for only one point in time, without describing any temporal trajectories in prevalence. To our knowledge, no previous studies have presented age- and sex-specific HIV prevalence estimates across SSA at local scales over time.

We built upon the HIV prevalence model from Dwyer-Lindgren et al. [10] to produce HIV prevalence estimates for 43 countries in SSA for males and females ages 15–59 years, stratified into nine 5-year age groups, for the years spanning 2000 to 2018. Countries, age groups, and time period were selected according to data availability. We expanded upon existing Bayesian spatiotemporal methods to model these estimates at a 5 × 5-km resolution and present them here aggregated to the second administrative level (which varies by country but is typically equivalent to e.g., districts, municipalities), which is the level typically considered most relevant to policymakers and stakeholders. Prevalence estimates for all demographic groups at all levels of geographic aggregation, as well as number of people living with HIV (count estimates), are publicly available from the Global Health Data Exchange (https://ghdx.healthdata.org/record/ihme-data/sub-saharan-africa-hiv-prevalence-geospatial-estimates-2000-2018) and through a user-friendly data visualization tool (http://vizhub.healthdata.org/lbd/hiv-prev-disagg).

Methods

Overview

This ecological study follows the Guidelines for Accurate and Transparent Health Estimates Reporting (GATHER) [31] (Additional file 1: Section 1). This analysis relies secondary data sources to provide estimates of HIV prevalence on a 5 × 5-km grid in 43 countries in SSA for males and females ages 15–59 years residing at each location, stratified into five-year age bins (i.e., ages 15–19, 20–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59), with annual resolution from year 2000 to 2018 inclusive, calibrated to national estimates from the GBD [2]. The period of 2000–2018 and the age range of 15–59 years were selected to optimize the contemporaneousness of the estimates and to account for data availability—there were relatively few large-scale seroprevalence surveys conducted before 2000, and most seroprevalence surveys focus on adults, with little reporting outside the 15–59 years age range. We produced estimates for sex rather than gender binaries because sex is more predominantly reported in the available data sources. Due to data availability limitations we were unable to produce prevalence estimates for sex minority individuals outside the male/female binary. The 43 countries analyzed were also selected according to data availability—Mauritania was excluded as there were no HIV prevalence data available. We included six countries—Djibouti, Guinea-Bissau, Madagascar, Somalia, South Sudan and Sudan—where no seroprevalence survey data were available, but where sentinel surveillance data collected from antenatal care clinic (ANC) attendees (described below) were available. The implications of these and other limitations are expanded upon in the “Methodological advantages and limitations” section in the “Discussion” section.

The methodology used here largely parallels that previously used to map adult HIV prevalence in SSA [10], with the incorporation of modifications necessary to model by age and sex, and improvements related to the inclusion of spatially aggregated data and ANC data (Fig. 1). We used a 5 × 5-km grid for consistency with this previous analysis; to align with the resolution available for pre-existing covariates incorporated in this analysis; and for flexibility in aggregating these estimates to other levels of interest (e.g., first- and second-level administrative subdivisions, such as states or districts, respectively, or more aggregated age groups such as reproductive ages [commonly 15-49]) using grid-cell-level estimates of age- and sex-specific population from Worldpop [32]. These population estimates were also used to estimate the number of people living with HIV in each demographic group. All analyses were conducted in R version 3.6.1 [33]. Figure 2 provides an overview of the analytic process, described in more depth below. Additional details are available in Additional file 1.

Fig. 1
figure 1

HIV prevalence data by region and country. a HIV seroprevalence survey data and b ANC sentinel surveillance data used in this analysis, by region and country. Color indicates the data source. AIS, AIDS Indicator Survey; DHS, Demographic and Health Survey; MICS, Multiple Indicator Cluster Survey; PHIA, Population-based HIV Impact Assessment Survey. Shape type indicates whether a data source is age-specific and has point (GPS) or polygon location information. Size indicates the relative effective sample size for each source. A full list of data sources with additional details about data type (such as survey microdata and survey reports) and geographical details are provided in Additional file 2: Tables S1-S5

Fig. 2
figure 2

Analytical process overview. The process used to produce age- and sex-specific HIV prevalence estimates in sub-Saharan Africa involved three main parts. In the data-processing steps (green), data were identified, extracted, and prepared for use in the HIV prevalence model and in covariate models. In the modeling phase (orange), we used these data and covariates in a stacked generalization ensemble model and spatiotemporal Gaussian process model. In the post-processing phase (blue), we calibrated the prevalence estimation to match GBD 2019 estimates at the national level, aggregated prevalence estimates to the first- and second-level administrative subdivisions in each country, and calculated the number of people living with HIV (PLHIV)

Data

HIV data

We compiled a geolocated dataset of 304,672 observations from 91 seroprevalence surveys from 37 countries and 10,351 observations from sentinel surveillance among antenatal care clinic attendees (ANC data) in 43 countries (Additional file 2: Tables S1-S2; Fig. 1). Data from seroprevalence surveys were originally in the form of survey microdata (that is, individual-level survey responses) or survey reports (Additional file 2: Table S1). For surveys with available microdata, we extracted variables related to age, sex, HIV blood test result, location, and year, as well as survey weights, where available. We excluded rows with missing information on any of these variables, and subset the data to ages 15–59 years. For data coded by gender rather than sex, we treated these data as if they were sex-specific rather than gender-specific. We recognize that sex and gender are not interchangeable: sex is a biological variable, while gender is a fluid social construct. In the absence of quality data, however, we could not disaggregate estimates by gender at this time. After subsetting by age, we collapsed the age-specific data into 5-year age bins (hereafter referred to as “ages”) by sex. We did this by calculating the weighted age- and sex-specific HIV prevalence at the finest spatial resolution available. Ideally, this was at the level of global positioning system (GPS) coordinates that represent the location of a survey cluster. In most surveys, GPS coordinates are randomly displaced (typically by 2–5 km depending on the setting and the survey series [34]) in order to protect respondent’s confidentiality. In instances where GPS coordinates were not available, the smallest areal unit (termed a “polygon”) possible was used instead. These typically represented an administrative subdivision. For surveys without microdata but for which estimates with some subnational resolution were provided in a report, we extracted these estimates with information about the sample size and location. GPS coordinates were not available for these reports, so these data were exclusively matched to polygons. In most reports, age ranges larger than 5 years were reported. Among these, we retained data reported for age ranges that corresponded exactly to one or more of the 5-year age bins used in this model; for example, we included surveys covering age ranges 15–49 years, or 15–24 years, but excluded those covering age ranges such as 18–24 years. For age-aggregated data, we retained information regarding the age range covered, to be used in our modeling process as described below. We also only included sex-specific data. For more information on excluded surveys see Additional file 2: Table S3.

Data that were spatially aggregated (i.e., polygon data) and/or age-aggregated required additional processing. Although we ultimately modeled HIV prevalence at the level of the observation, be it point or polygon, age-specific or age-aggregated, our modeling process initially specified HIV prevalence at the point-, time-, age-, and sex-specific level. Because of this, it was necessary that we disaggregate the age-aggregated and polygon survey data to be location- and age-specific. We did this by distributing polygon data to pixels proportional to population. Specifically, for each polygon, we generated points at the centroid of each 5 × 5-km pixel falling within that polygon and replicated that observation’s HIV prevalence and sample size at the location of each of those centroids. Age-aggregated point data were similarly disaggregated by replicating the HIV prevalence and sample size once for each year-age group covered in the overall age range. In the cases of age-aggregated polygon data, these two processes were combined. Next, each of the disaggregated, location- and age-specific rows of data associated with a given aggregated observation were assigned weights proportional to the age- and sex-specific population residing at that location for the given year, derived from WorldPop [32]. Weights per observation all summed to one. This process substantially increased the size of the dataset. To reduce the associated computational burden when fitting the model, in cases where at least one row within an observation was given a weight of less than half of one divided by the number of locations and/or ages in that observation, we successively dropped the lowest-weighted locations and/or ages until reaching a maximum of 1% of the observation’s weight dropped. Remaining locations and/or ages within that observation were then reweighted to maintain a total weight of one. Data that were not aggregated (i.e., age-specific point observations) were each assigned a weight of one.

ANC data were primarily derived from national HIV estimate files developed by national teams and compiled and shared via UNAIDS [35] and supplemented with data derived from sentinel surveillance country reports (Additional file 2: Table S2). We extracted information from these sources on HIV prevalence and sample size by site and year. Sites were geolocated to specific GPS coordinates where possible and otherwise to a polygon that represents an administrative subdivision. The ANC data available for this analysis were not age-specific. Because ANC data included only pregnant females, we assumed the age range of these data to be that of females with non-zero fertility rates in SSA according to GBD 2019 [36], that is, females ages 15–54 years. We disaggregated ANC data to the age and location level as we did for age-aggregated or polygon survey data. However, specific locations and ages were weighted by number of births rather than population size. The number of births for a given age and location was estimated as the product of the location-, age-, and sex-specific population, again derived from WorldPop [32], and the national fertility rate, derived from GBD 2019 estimates [36].

Covariates

This analysis included the same covariates as the previous analysis [10]. This included five pre-existing covariates: (1) travel time to the nearest settlement of more than 50,000 inhabitants; (2) total population; (3) night-time lights; (4) urbanicity; and (5) malaria incidence (Additional file 2: Table S4). In addition, eight covariates were constructed explicitly for this analysis owing to their known association with HIV prevalence and data availability: (1) prevalence of male circumcision (all forms); (2) prevalence of self-reported sexually transmitted infection (STI) symptoms; (3) prevalence of marriage or living with a partner as married; (4) prevalence of one’s current partner living elsewhere among females; (5) prevalence of condom use at last sexual encounter; (6) prevalence of reporting ever having had intercourse among young females; and (7) and (8) prevalence of multiple partners in the past year for males and for females, respectively. We updated the covariates constructed for this analysis to incorporate newly available data but utilized the original statistical methods (Additional file 1: Section 3.2; Additional file 2: Table S5; Additional file 3: Figs. S1-S8).

Model and estimation

Covariate stacking

An ensemble covariate modeling approach (“stacking”) was implemented to capture possible nonlinear interactions among the covariates across space and time [37]. In this approach, three sub-models were fitted to the HIV survey data with the covariates as explanatory predictors: generalized additive models [38], boosted regression trees [39], and lasso regression [40]. Each sub-model was fitted using fivefold cross-validation to avoid overfitting, and the out-of-sample predictions from across the five folds were compiled into a single set of predictions that were used to fit the geostatistical model described below. In addition, each sub-model was also fitted to the full dataset to generate a complete set of in-sample predictions that were subsequently used when generating predictions from the geostatistical model (Additional file 3: Figs. S9-S11). Because the covariates used here were neither age-specific nor (for most) sex-specific, we fit these sub-models at that same age- and sex-aggregated level as the HIV-specific covariates, modeling HIV prevalence data aggregated across ages 15–49 and males and females. The age range 15–49 years was used in this case because of its more common usage in seroprevalence surveys compared to the 15–59 years range, allowing us to retain more data for the stacking model. Polygon data were excluded from stacking models due to their incongruity with the configurations needed for the different sub-models. The ANC data were also excluded due to known sampling biases, which are described in the Additional file 1: Section 4.2.

Geostatistical model

This model was fit in Template Model Builder (TMB) [41]. Owing to computational constraints, and to allow for regional differences in the relationships between covariates and HIV prevalence, as well as differences in the temporal, spatial, and demographic autocorrelation in HIV prevalence, separate models were fitted for four regions (Additional file 3: Fig. S12). We modeled HIV prevalence stratified by space, time, age, and sex using a generalized linear mixed-effects model. To simultaneously model point- and polygon-level observations, as well as both age-specific and age-aggregated observations, we specified the data likelihood at the observation level (i), which accommodated all of these. We modeled the number of HIV-positive individuals (Yi) among a sample (Ni) for a given observation as a binomial variable:

$${Y}_i\sim \textrm{Binomial}\left({N}_i,{p}_i\right)$$

Logit-transformed prevalence was however first specified at the space, time, age, and sex-disaggregated level (j):

$${\displaystyle \begin{array}{c}\textrm{logit}\left({p}_j\right)={\beta}_0+{\boldsymbol{\beta}}_1{\boldsymbol{X}}_j+{Z}_{1,j}+{Z}_{2,j}+{Z}_{3,c\left[j\right]}\\ {}{Z}_{1,j}\sim \textrm{GP}\left(0,{\varSigma}_{1, space}\otimes {\varSigma}_{1, time}\right)\\ {}\begin{array}{c}{Z}_{2,j}\sim \textrm{GMRF}\left(0,{\varSigma}_{2, time}\otimes {\varSigma}_{2, age}\otimes {\varSigma}_{2, sex}\right)\\ {}{Z}_{3,c\left[j\right]}\sim \textrm{GMRF}\left(0,{\varSigma}_{3,c}\right)\end{array}\end{array}}$$

We specified logit-transformed prevalence at the disaggregated level (pj) as a linear combination of:

  • A regional intercept (β0);

  • Covariates and associated regression parameters (β1Xj);

  • Random effects correlated across space and time, (Z1, j);

  • Random effects correlated across time, age, and sex, (Z2, j);

  • Country-specific (c) random effects correlated across age, (Z3, c[j]).

The random effects capturing correlations between space, time, age, and sex included:

  • Z1, j: a Gaussian process with mean 0 and a covariance matrix given by the Kronecker product of a spatial Matérn covariance function [42] (Σ1, space) and a temporal first-order autoregressive covariance function (Σ1, time);

  • Z2, j: a Gaussian Markov Random Field with mean 0 and a covariance matrix given by the Kronecker product of first-order autoregressive covariance functions for time (Σ2, time), age (Σ2, age), and sex (Σ2, sex);

  • Z3, c[j]: a Gaussian Markov Random Field with mean 0 and a covariance matrix given by country-specific first-order autoregressive covariance functions for age (Σ3, c).

We used the stochastic partial differential equation [43] approach to approximate the continuous spatiotemporal Gaussian random field (Z1, j). Sensitivity analyses were carried out to compare this model configuration to others with differing pj specification configurations, as well as to several other model and data specifications, and are described in detail in the Additional file 1: Section 4.3, Additional file 3: Figs. S13-S15, and the “Discussion” section. We then specified observation-level (i) prevalence:

$${p}_i={\textrm{logit}}^{-1}\left(\textrm{logit}\left(\sum \left({p}_{transformed,j}\cdot {w}_j\right)\right)+\left({\beta}_2+{U}_{s\left[i\right]}\right)\cdot {I}_{ANC}+{\epsilon}_i\right)$$

pi was calculated as the sum of disaggregated prevalence (ptransformed, j) estimates multiplied by their respective population (or in the case of ANC data, birth) weights (wj), plus the incorporation of additional ANC-related transformations and bias corrections (β2, Us[i], and IANC described below), and an observation-level uncorrelated error term (ϵi):

$${\upepsilon}_i\sim \textrm{Normal}\left(0,{\sigma}_i^2\right)$$

In cases where data were already disaggregated spatially and by age, wj = 1.

HIV prevalence as measured by sentinel surveillance of ANC clinic attendees is known to be biased as a measure of HIV prevalence in the general adult female population [44], because it only covers pregnant females who attend ANC, compared to all adult females [45, 46]. Additionally, fertility rates differ between HIV+ and HIV- females, with the exact relationship varying by age [47], thereby impacting age-specific ANC clinic visitation rates. To address this, for ANC data we transformed prevalence among pregnant females based on the underlying prevalence among all females and the age-specific fertility-rate ratio (HIV+ fertility/HIV- fertility). For ANC data,

$${p}_{transformed,j}=\frac{\left({p}_j\cdot {FRR}_j\right)}{\left({p}_j\cdot {FRR}_j\right)+1-{p}_j}$$

Fertility rate ratios (FRRj) were derived from GBD 2019 fertility estimates [36], taken at the national level except in cases where subnational estimates were available (in Ethiopia, Nigeria, and South Africa). For survey data,

$${p}_{transformed,j}={p}_j$$

To allow for additional ANC-related bias at the observation level (i), in instances where data in our model were derived from ANC sentinel surveillance (where IANC = 1 for ANC data, and IANC = 0 for all other data) our model incorporated a fixed term (β2) that captured overall mean bias in the ANC data, and a random effect (Us[i]) for a given ANC site s that captured spatial differences in the extent of this bias:

$${U}_{s\left[i\right]}\sim \textrm{Normal}\left(0,{\sigma}_{site\left[i\right]}^2\right)$$

Fitted model parameters are detailed in Additional file 2: Table S6. From each fitted model, we generated 1000 draws from the approximated joint posterior distribution of all model parameters and used these to construct 1000 draws of pj, setting IANC to 0. Fivefold cross-validation was used to assess model performance and to compare a number of alternative models (Additional file 3: Figs. S13-S15). We also compared the re-aggregated adult-level estimates from our final model to those from the results of an age- and sex-aggregated counterpart (Additional file 3: Fig. S16).

Post-estimation

To take advantage of the more structured modeling approach and additional national-level data used by GBD 2019 [2], we performed post hoc calibration of our estimates to the corresponding national-level GBD estimates. For each country, year, age bin, and sex in our analysis, we defined a “raking factor” equal to the ratio of the GBD estimate for this country-year-age-sex to the population-weighted posterior mean HIV prevalence in all corresponding grid cells (Additional file 3: Figs. S17-S18). These raking factors were then used to scale each draw of HIV prevalence for each grid cell within that GBD geography, year, age, and sex. Point estimates for each grid cell were calculated as the mean of the scaled draws, and 95% uncertainty intervals were calculated as the 2.5th and 97.5th percentiles of the scaled draws. Grid cells that crossed international borders within modeling regions were fractionally allocated to multiple countries in proportion to the covered area during this process. In cases where subnational (i.e., first administrative level) estimates were available from the GBD, that is, for Ethiopia, Nigeria and South Africa, we calibrated to those estimates rather than those at the national level. Uncertainty in GBD estimates was not accounted for in this calibration.

In addition to estimates of HIV prevalence on a 5 × 5-km grid, we constructed estimates of HIV prevalence for first- and second-level administrative subdivisions. We did this by calculating age- and sex-specific population-weighted averages of prevalence for all grid cells within a given area. This process was carried out for each of the 1000 posterior draws (after calibration to GBD), with final point estimates derived from the mean of these draws and uncertainty intervals from the 2.5th and 97.5th percentiles. Additionally, estimates of the number of people living with HIV for a given age and sex in each grid cell were derived by multiplying estimated prevalence in each grid cell by the corresponding population estimate from WorldPop [32], which was also calibrated to match GBD 2019 [36] (Additional file 1: Section 4.4; complete estimates of people living with HIV are available along with all prevalence estimates at (https://ghdx.healthdata.org/record/ihme-data/sub-saharan-africa-hiv-prevalence-geospatial-estimates-2000-2018)).

Although the model makes predictions for all locations covered by available covariates, all final model outputs for which land cover was classified as barren or sparsely vegetated according to European Space Agency Climate Change Initiative satellite data [48] and for which total population density was less than 10 individuals per 1 × 1-km in 2015 were masked for improved clarity when communicating with data specialists and policymakers. Maps were generated in R using the ggplot2 [49] package version 3.3.0.

Results

Geographic variation

We found large differences in the spatial and demographic distribution of estimated HIV prevalence in SSA that were masked in demographically aggregated estimates (Figs. 3 and 4; Additional file 3: Figs. S19-S34). This was particularly striking among middle and older age groups. For example, in the year 2018, the maximum estimated HIV prevalence in any second-level administrative unit for adults ages 15–59 years was 35.4% in Umgungundlovu in the Kwazulu Natal province, South Africa (95% uncertainty interval (UI), 22.3–46.3%). However, estimated prevalence reached up to 59.4% [46.5–71.2%], almost 1.7 times higher, for females ages 35–39 years within that same location. Across all second-level administrative units, age groups, and sexes, females ages 35–39 in Nkilongo in Lubombo, Eswatini, had the highest estimated HIV prevalence in the year 2018, at 62.5% [50.1–74.5%].

Fig. 3
figure 3

HIV prevalence in sub-Saharan Africa in 2018 at the second administrative level for a subset of modeled demographic groups from the lower, middle, and upper age ranges: a all adults, ages 15–59 years; b males and c females ages 15–19 years; d males and e females ages 35–39 years; and f males and g females ages 55–59 years. Maps reflect national boundaries, land cover, lakes, and population; areas with fewer than ten people per 1 × 1 km, and classified as barren or sparsely vegetated, are colored light gray. Countries colored in dark gray were not included in the analysis

Fig. 4
figure 4

Relative uncertainty in HIV prevalence, 2018. Overlap** population-weighted quartiles of HIV prevalence (constructed separately for each demographic group) and relative 95% uncertainty in 2018 at the 5 × 5-km grid cell level for select demographic groups: a all adults, ages 15–59 years; b males and c females ages 15–19 years; d males and e females ages 35–39 years; and f males and g females ages 55–59 years. Relative uncertainty is defined as the ratio of the width of the 95% uncertainty interval to the mean estimate. Maps reflect national boundaries, land cover, lakes, and population; areas with fewer than ten people per 1 × 1 km, and classified as barren or sparsely vegetated, are colored light gray. Countries colored in dark gray were not included in the analysis

Geographic variation within countries was also more dramatic in our demographically disaggregated results. Across SSA countries, the median absolute difference between second-level administrative units with the lowest and highest estimated prevalence within a given country in 2018 was 3.5 times greater when considered across ages and sexes, than when estimated for all adults combined (11.2 percentage points versus 3.2 percentage points). This difference in within-country prevalence range between demographically aggregated versus disaggregated estimates varied greatly between countries. For example, in Mozambique, this range across second-level administrative units was 30.1 percentage points [16.7–46.3] for combined adults and 56.9 percentage points [37.4–78.2] (or 1.9 times larger) for estimates across ages and sexes. In Lesotho, on the other hand, this range was 8.2 times larger for estimates across ages and sexes compared to adults combined (51.6 percentage points [40.1–63.5] versus 6.3 percentage points [1.4–11.5]). Overall, countries in Eastern SSA tended to see greater such discrepancies compared to other regions; here, the median absolute difference between second-level administrative units was 4.4 times greater when considered across ages and sexes than for all adults combined (14.0 versus 3.2 percentage points). For complete geographic variation comparisons within each country, including uncertainty estimates, see Additional file 4.

Variation between males and females

Across SSA and across the years 2000–2018, estimated HIV prevalence was generally higher among females than males (Fig. 5). In 2018, for prevalence aggregated across ages 15–59 years, in no second-level administrative units was estimated prevalence higher among males compared to females. The absolute difference in estimated prevalence in 2018 between females and males reached a maximum of 15.0 percentage points (in Umkhanyakude, in KwaZulu-Natal, South Africa, with 36.3% [24.7–46.8%] estimated prevalence in females compared to 21.3% [13.1–28.7%] estimated prevalence in males), for a female to male prevalence ratio of 1.7 [1.5–1.9]. Countries in Central SSA, where overall prevalence was lower than in other SSA regions, tended to see the largest disparity between females and males in terms of relative differences. Estimated prevalence among females in Central SSA ranged up to a maximum of 2.7 [1.84–4.2] times greater than estimated prevalence in males in 2018 (in San Antonio de Palé, in Annobón, Equatorial Guinea, with 8.3% [2.1–21.4%] prevalence in females compared to 3.1% [0.8–8.1%] prevalence in males). Across Central SSA second-level administrative units, the median ratio between female and male estimated prevalence was 2.2, compared to the all-SSA median ratio of 1.6. The greatest absolute differences were seen in Eastern SSA, where the median absolute difference between female and male estimated prevalence was 1.9 percentage points in 2018, compared to the all-SSA median absolute difference of 0.9 percentage points. These differences between female and male prevalence in 2018 were less than those observed in the year 2000, when the median ratio between female and male estimated prevalence was 1.5, and the median absolute difference was 1.5 percentage points. We did not note substantial differences in within-country variations in prevalence between females and males in either 2000 or 2018 in any region. For complete comparisons between sexes by second-level administrative unit, including uncertainty estimates, see Additional file 4.

Fig. 5
figure 5

Differences in estimated prevalence between males and females ages 15–59 years at the second administrative level in 2018, calculated as a the ratio of estimated prevalence among females to prevalence among males and b the absolute difference in estimated prevalence between females and males. Maps reflect national boundaries, land cover, lakes, and population; areas with fewer than ten people per 1 × 1 km, and classified as barren or sparsely vegetated, are colored light gray. Countries colored in dark gray were not included in the analysis

Variation between age groups

Prevalence within second-level administrative units was also highly variable across age groups (Fig. 6), and relative variation in prevalence between age groups in 2018 tended to be higher in males. Comparing estimated prevalence across age groups within a given second-level administrative unit in 2018, the ratio between highest and lowest prevalence among age groups tended to be larger among males compared to females (median ratio across all SSA second-level administrative units of 14.4 for males, and 9.3 for females). For males, this ratio between highest and lowest estimated prevalence among age groups was smaller in Central SSA compared to other regions (median ratio of 8.3) and was largest in Western SSA (median ratio of 21.7). There was little regional difference for females. The sexes also differed in changes in this ratio between years, where it decreased over time for males (with a median ratio in 2000 of 52.7) but increased over time for females (median ratio in 2000 of 5.6). For complete age variation comparisons by second-level administrative unit, including uncertainty estimates, see Additional file 4.

Fig. 6
figure 6

Differences in prevalence between age groups in the year 2018 at the second administrative level, calculated as the ratio of estimated prevalence between the age groups with highest and lowest prevalence, for a males b and females; and the age groups with highest prevalence for c males d and females in 2018. Maps reflect national boundaries, land cover, lakes, and population; areas with fewer than ten people per 1 × 1 km, and classified as barren or sparsely vegetated, are colored light gray. Countries colored in dark gray were not included in the analysis

Across SSA, the age group with the highest estimated prevalence in any given second-level administrative unit in 2018 was always between ages 35 and 54 years for males and between 30 and 49 years for females (Fig. 6). In 2018, males ages 45–49 years most commonly had the highest estimated prevalence across all age groups in a given second-level administrative unit, at 46.8% of second-level administrative units (1894 of 4043) from within 23 of 43 countries. Females ages 40–44 years had the highest estimated prevalence across age groups in 63.8% of second-level administrative units (2581 of 4043) in 31 of 43 countries. For both males and females, the age group with the highest estimated prevalence tended to vary more across Eastern SSA compared to other regions.

Within-country variation between second-level administrative units was relatively consistent across age groups. The ratio of maximum to minimum estimated prevalence among districts within each country was lowest for ages 35–39 years (median ratio of 4.3 across countries) and highest for ages 15–19 years (median ratio of 4.8 across countries) in 2018. Slightly larger differences were seen between age groups in Eastern and Southern SSA, with lower variation in middle-age groups and greater within-country variation in younger age groups. The maximum-to-minimum within-country prevalence ratio in Eastern SSA was lowest for adults ages 40–44 years (median ratio of 5.4 across Eastern SSA countries) and highest for adults ages 15–19 years (median ratio of 6.7 across Eastern SSA countries). These same age groups also represented the highest and lowest ratios in Southern SSA countries, with median values of 2.0 in adults ages 40–44 years and 2.8 in adults ages 15–19 years.

Variation over time

Estimated change in prevalence over time among all adults masked broad differences between specific age and sex groups (Fig. 7; Additional file 3: Figs. S35-S40). Large temporal changes were much more common when considering sexes and age groups, compared to all adults combined. Between the years 2000 and 2018, among all adults ages 15–59 years, estimated HIV prevalence increased by more than 5.0 percentage points in only 3.7% (151 out of 4043) of second-level administrative units across SSA and decreased by more than 5.0 percentage points in 7.9% (321 of 4043) of second-level administrative units. On the other hand, 37.7% (1523 of 4043) of second-level administrative units experienced an increase in estimated HIV prevalence greater than 5.0 percentage points in that timeframe in at least one sex and age group, and 70.9% (2867 of 4043) of second-level administrative units saw a decrease greater than 5.0 percentage points in at least one sex and age group.

Fig. 7
figure 7

Change in HIV prevalence at the second administrative level between 2000 and 2018 for a subset of modeled demographic groups from the lower, middle, and upper age ranges: a all adults, ages 15–59 years; b males and c females ages 15–19 years; d males and e females ages 35–39 years; and f males and g females ages 55–59 years. Maps reflect national boundaries, land cover, lakes, and population; areas with fewer than ten people per 1 × 1 km, and classified as barren or sparsely vegetated, are colored light gray. Countries colored in dark gray were not included in the analysis

The distribution of districts with large increases or decreases in prevalence over time also varied greatly by region. All regions saw a decrease of greater than 5.0 percentage points in estimated prevalence for at least one sex and age group in a majority of second-level administrative units between 2000 and 2018: 61.2% for Central SSA, (393 out of 642), 70.9% (1160 out if 1635) for Western SSA, 71.0% (1032 out of 1452) for Eastern SSA, and 90.1% (283 out of 314) for Southern SSA. However, Southern SSA also had a very high proportion of second-level administrative units seeing an increase of greater than 5.0 percentage points in that same time frame, at 92.0% (289 of 314), while only a minority of second-level administrative units saw similar increases in the other regions.

We found diverging overall trends between age groups over time, with greater decreases over time among younger age groups, and greater increases among older age groups. For example, for females ages 25–29 years, we found that estimated prevalence decreased by at least 1.0 percentage point in the year 2018 compared to 2000 in more than 73.3% of second-level administrative units in SSA (2965 of 4043) and increased by at least 1.0 percentage point in only 2.4% (99 of 4043) of all second-level administrative units. Conversely, among females ages 50–54 years, estimated prevalence decreased between 2000 and 2018 by at least 1.0 percentage point in just 11.8% (477 of 4043) of second-level administrative units but increased by at least 1.0 percentage point in 40.1% (1622 of 4043) of second-level administrative units. We found this trend to be similar across regions. For complete comparisons of prevalence over time for each second-level administrative unit, age, and sex, including uncertainty estimates, see Additional file 4.

Discussion

The results of this study, the first to present age- and sex-specific HIV prevalence estimates across sub-Saharan Africa at local scales, emphasize the interactions of geographic and demographic differences in HIV prevalence, going beyond previous research focused on either aspect individually. Just as previous work demonstrated how much geographic variability is masked in national prevalence estimates [10], we show here that demographically aggregated estimates mask important variation in the age and sex distributions of HIV prevalence at a local level, which in turn provide much clearer insights into the evolution of the HIV epidemic in SSA.

Many intervention methods are commonly used in the fight against the HIV epidemic, and variation in their efficacy and implementation has likely contributed to the prevalence trends presented here. Cost-efficiency is a consistent priority and is generally maximized by using targeted, integrated interventions [50]. For example, HIV prevention via behavioral and biomedical interventions based on local prevalence rates, HIV testing, and treatment initiation may be priorities for some age groups [51], while long-term ART retention and comorbidity care may require more emphasis for others [52]. Barriers to access to care often differ between geographic and demographic groups, where in some cases barriers may be logistical (e.g., geographic isolation and programmatic fragmentation [53]) or social (e.g., lack of information, stigmatization, homophobia [54]), and require different intervention methods. Males and females are also often targeted using different points of contact. For example, HIV testing has been recommended for all females attending antenatal care clinics [55], whereas for males the provision of self-, home-based, and mobile testing compared to facility-based testing may be more useful for testing and subsequent uptake of care [56,57,58]. Effective targeting of these interventions requires local, demographically specific HIV burden information, such as provided in the estimates presented here. Countries may similarly use this burden information to prioritize subnational and demographically specific treatment needs. This resource may also be useful in program evaluation efforts and thus aid the development of more successfully tailored interventions.

Variation in the social determinants driving HIV incidence and mortality, and thus HIV prevalence, are also an important consideration when assessing inequalities in HIV prevalence between locations and demographic groups. While prevalence among females is consistently higher than prevalence among males, for example, these differences can be attributed to different exposure to risk factors (such as age at first sex between males and females, marital status) in different countries [59]. In addition to understanding local patterns in HIV prevalence, effective interventions also need to consider, if not focus directly on, locally important risk factors and determinants of HIV infection and mortality [60, 61].

Our estimates point to many local shifts in HIV prevalence over time. A multitude of factors can affect HIV prevalence trends at the local level over time, from local changes in prevention interventions to shifts in the overall demographics of an area, but one particularly important factor is local scale-up of ART [62, 63]. Increases in ART coverage and reduced treatment costs have repeatedly been associated with large demographic shifts among people living with HIV [64] due to its success in reducing HIV mortality, leading to greatly increasing numbers of people living with HIV over the age of 50 years; our results reflect this trend. Given evidence pointing to differences between younger and older ART patients in rates of CD4 cell count decline [65], immune reconstitution rates [66], and risk of associated non-communicable diseases [67, 68], among other health metrics [69], it is necessary that treatment plans for older patients be specifically tailored for their age group. Our results highlight those locations with large existing populations of people living with HIV for ages 50–59 years, and those seeing rapid growth of HIV prevalence in that demographic group. At the same time, the minimal change in estimated prevalence over time among the youngest age groups suggests that continued and even expanded efforts in HIV prevention for adolescents and young adults still need to be maintained as a priority across the continent.

Despite the significant progress made through this analysis in describing HIV burden in SSA, prevalence estimates mask complex and varied relationships between HIV incidence and mortality, as well as migration and seasonal mobility. It is difficult to determine, for example, if a dramatic decrease in HIV prevalence in an area is due to reduced incidence, increased mortality, or differences in the immigration and emigration rates of HIV+ and HIV- individuals. Primary data for all three of these metrics are not widely available for SSA, adding additional complexity to the interpretation of our estimates. Importantly, no estimates of these indicators are consistently available at local scales for specific demographic groups. Furthermore, local data related to diagnosis, treatment, and viral suppression rates are also limited, despite these metrics lying at the heart of the UNAIDS 95-95-95 goals [4]. While very informative, difficulties can still arise in intervention decision-making built around HIV prevalence estimates alone, without understanding their underlying drivers. Improved surveillance of HIV prevalence, incidence, and mortality, combined with reliable population and migration estimates and information on local programs, are necessary to fully understand the complexities of the region’s HIV epidemic. Clearly, even with the development of more comprehensive burden information, any modeled estimates should only be used for intervention purposes in conjunction with local program knowledge.

Methodological advantages and limitations

The methods used in this analysis build upon those previously used by Dwyer-Lindgren et al. to model adult HIV prevalence [10]. While this analysis does improve upon and have advantages over the previous methods in some ways, it faces some of the same, as well as some new limitations. As with the previous study, and as with all modeling studies, the quality of our estimates is highly dependent on the quality and coverage of our input data. Despite constructing a large database of HIV prevalence data, coverage gaps and small sample sizes in some locations can be associated with imprecision and/or large uncertainty intervals in some of our prevalence estimates (Additional file 3: Figs. S27-S34). Additionally, the location information associated with the data compiled for this analysis is subject to some error. In order to protect respondent confidentiality, most surveys that collect GPS coordinates perform some type of random displacement on those coordinates prior to releasing data for secondary analysis: for example, GPS coordinates for Demographic and Health Surveys (DHS) are displaced by up to 2 km for urban clusters, up to 5 km for most rural clusters, and up to 10 km in a random 1% of rural clusters [34]. Past research has found that displacement can degrade the predictive power of a geostatistical model, however this effect was found to be modest, and researchers concluded that relatively accurate map** can be undertaken at a 5 × 5-km resolution even with GPS displacement [70].

The approximate integration method we use in this analysis better handles uncertainty estimation and easily accommodates not only polygon data but age-aggregated data as well, compared to the polygon resampling method that has been used elsewhere [10, 71, 72]. At the same time, given the large number of dimensions being modeled, as well as the high data input count produced by our data disaggregation technique, we found that current matrix packages, as well as our computational facilities, could not accommodate a Gaussian process that accounted for the covariance of a complete space-time-age-sex Kronecker product. We therefore focused on the interactions between space, time, age, and sex that we believed would be most relevant in terms of capturing important variability in these dimensions, within our computational abilities. Our modeling strategy also assumed no difference in the probability that an HIV+ versus an HIV- pregnant woman would access antenatal care and therefore be included in ANC surveillance.

Due to limited data availability, we delineated estimates in this analysis using a male/female binary. We recognize that this approach does not allow for investigation of HIV prevalence among gender and sex diverse people, despite the disproportionate burden of HIV commonly seen among these populations [73]. Further, we recognize that many data sources do not provide the option to select a sex other than “male” or “female,” gender options beyond “man” or “woman,” and often conflate gender with sex. In the future, we hope that high-quality data on HIV prevalence for gender and sexual diverse people will be more widely available, so we can produce estimates beyond females and males.

We note that our results include unprecedentedly high prevalence estimates for certain population subsets. In most cases, we do not believe these estimates are implausible. For example, we estimated prevalence among middle- and older-aged females to be up to 59.2% [45.9–73.0%] in Umgungundlovu in KwaZulu-Natal, South Africa in 2018. Previous research has estimated prevalence for females adults of all ages combined in Umgungundlovu in 2017 to be 46.6% [43.8–49.5%] [74]. As we have shown that prevalence in middle- and older-aged females tended to be higher than all-ages prevalence, we believe our estimates for middle- and older-aged females during this time period in this location to be reasonable, especially with uncertainty intervals taken into consideration. In rare cases, however, our methods yielded estimates which we were unable to support through the literature. For example, for males ages 35–39 and 40–44 years in Nyatike in Migori, Kenya, we estimated prevalence in the year 2000 to be 77.8% [50.2–100.0%] and 78.7% [50.0–100.0%], respectively. It is unlikely true prevalence in that area and year was this high (though given the large uncertainty intervals associated with these values, it is probable that true prevalence does fall within those ranges). We note, however, that the high estimates in this area and surrounding second-level administrative units were predominantly associated with the earlier years in our time series—we believe the more recent estimates in Nyatike to be more realistic [75]. In these locations, decreases in prevalence over time may therefore also be overestimated. These instances were rare.

A combination of data limitations and model complexity ultimately led to large uncertainty intervals around our estimates. Given that our 95% coverage estimates in model validation were consistently higher than expected (Additional file 3: Figs. S14-S16), this indicates that these uncertainty intervals may be larger than appropriate. Wide uncertainty can limit the utility of our estimates in terms of informing HIV policies, and reducing this uncertainty through improved data coverage will be an important consideration in future iterations of this model. We were also unable to account for all sources of uncertainty such as uncertainty in the WorldPop estimates used in many stages of our modeling and estimation processes and uncertainty in covariates.

Conclusions

HIV continues to impose enormous human and financial costs [3] on SSA, decades since its emergence. Financial and logistical disruptions and discontinuities due to the impacts of COVID-19, as well as changes in ART adherence, are likely to present new barriers [21, 76] to the UNAIDS 95-95-95 goals [4]. This analysis provides important insight into the nuances of HIV burden in SSA, offering information that is critical to the development of targeted interventions.