Introduction

A major goal of precision medicine is to optimize medical care for subgroups of patients based on genetic and/or molecular profiling1. A challenge in widespread adaptation of genetic profiling is the genome variability among different population groups2. One example is the identification of pathogenic variants in (Mendelian) single gene disorders (SGDs). While the same genes are responsible, there is considerable variability across populations in the specific causative pathogenic variants3. For example, while all pathogenic variants causing cystic fibrosis affect the CFTR gene, the common pathogenic variant observed in Puerto Rico4 is different from the variant observed in Qatar5 and both are different from the pathogenic variants common in European populations6. A recent analysis of ClinVar, the main NCBI database of pathogenic variants causative of SGDs, shows a significant bias towards pathogenic variants observed in European ancestry individuals2. As is the case for Hispanics, Blacks, and other non-European groups, SGD pathogenic variants found in Greater Middle Eastern populations are under-reported. Since screening technologies depend on public resources such as ClinVar7, OMIM8, and 1000 Genomes Project9 for source data, there are limited screening platforms to assess SGD pathogenic variants in the Greater Middle East10.

A striking example of this is the Qatari population11,12. The inhabitants of Qatar include approximately 300 thousand Qataris and 2.5 million expatriates13. The Qataris are comprised of distinct genetic subgroups11,14. The proportion of consanguineous marriage among Qataris is high15, leading to longer runs of homozygosity16. In addition, the tribal nature of marriages, where individuals select a mate from a limited gene pool that are members of the same tribe, contributes to higher chance of homozygosity for a pathogenic founder variant derived from a common ancestor, such as the well-known p.Arg366Cys CBS variant linked to homocystinuria17.

In prior studies, we and others have identified SGD pathogenic variants that are common in the Qatari population3 and in other Greater Middle East populations18, including many pathogenic variants that are only observed in Qatari genomes or are at an enriched (higher) risk allele frequency compared to populations outside of the Greater Middle East14. At present, there is a limited screening of the Qatari populations for inherited pathogenic variants19.

The focus of this study is to develop “QChip1,” a genoty** microarray designed as a research and screening tool capable of enabling precision medicine of Qataris. The aim for QChip1 was to enable accurate and comprehensive screening for SGD pathogenic variants in Qatari newborns, premarital couples and patients presenting to the clinic. First, we analyzed genetic data from 8445 Qataris, including whole-genome sequence (WGS), whole-exome sequence (WES), and clinical pathology case reports from affected families. Using these data, a Qatari Genome Knowledgebase was constructed, containing known and predicted pathogenic variants in SGDs. Second, with this knowledgebase, QChip1 was designed to assess the Qatari genome for SGD pathogenic variants in the knowledgebase. Third, QChip1 accuracy was confirmed by comparison of QChip1 genotypes to WGS data for a batch of Qatari genomes. Fourth, genomes from Qataris and residents of New York City (NYC), and Puerto Rico (PR) were genotyped on QChip1 to determine the prevalence of SGD pathogenic variants in Qataris and to compare this to other populations. The analysis demonstrated that QChip1 is highly accurate in identifying deleterious variants in Qataris, and that the majority of pathogenic variants among Qataris are Qatari-specific or Qatari-enriched. Overall, this study demonstrates the value of a custom genoty** array for precision medicine identification of pathogenic variants that cause single-gene disorders in human populations absent from or underrepresented by common knowledgebases used for pathogenic variant screening assay design7,8,9,20,21. In the interest of the advancement of science and open data sharing, a list of variants on the array, the genes and disorders with a known or potential link to the variants, and the prevalence of these variants in Qatar, Kuwait, NYC, and PR will be made available to the public through the QChip Browser (http://qchip.biohpc.cornell.edu), as well as through our 3rd party data sharing repositories at FigShare (https://figshare.com/projects/QChip1/120108) and NCBI BioProject (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA774497).

Results

Construction of the Qatari Genome Knowledgebase

The Qatari Genome Knowledgebase of single gene coding sequence pathogenic and potentially pathogenic variants was based on sequence data from 8416 Qataris, including 6218 whole-genome sequence of Qataris recruited by the Qatar BioBank (QBB)22,23 and sequenced by the Qatar Genome Program (QGP)24,25, 180 whole-genome sequences12,26 and 1297 exome sequences11 of Qataris recruited by Weill Cornell Medicine Qatar and sequenced by Illumina, Bei**g Genomics Institute (BGI) or the New York Genome Center (NYGC), and 721 clinical reports from Hamad Medical Corporation (Supplementary Table 1). After filtering to remove variants observed in multiple cohorts, the analysis yielded 104,473,390 total variants in 20,069 genes in the Qatari population, including 87,813,560 single nucleotide variants (SNV) and 16,659,829 indels (Table 1); below we refer to this dataset as the Qatar Genome Knowlegebase (QGK). Assessment of QGK for ClinVar pathogenic variants and genes yielded a list of 10,490,820 variants in 3770 genes known to ClinVar. Parallel assessment of QGK for moderate or high impact variants in protein coding genes using SnpEff identified 805,649 variants in 19,770 genes (Table 1, Supplementary Table 2). The SnpEff list of moderate/high impact predicted variants was intersected with the ClinVar list of known variants and known genes to generate a final list of 207,370 pathogenic variants in 3770 genes, including 196,855 single nucleotide variants (SNVs) in 3769 genes and 10,515 indels in 1897 genes. This final list of variants included 13,891 (7%) predicted high impact (e.g., nonsense, frame shift and other loss of function) and 193,479 (93%) predicted moderate impact (e.g., missense variants).

Table 1 Step 1: Identification of pathogenic variants and genes in the Qatari Genome.

Design of QChip1

For each variant in the Axiom QChip design, one or more probesets were added to the design, depending on the computationally predicted difficulty of obtaining a high-quality genotype, the priority of the variant, and available space on the array. QChip0 consisted of a total of 184,713 probes organized in 159,377 probesets for genoty** 91,942 variants in 3540 genes (Table 2). The additional probesets represent variants not previously genotyped by Thermo Fisher (formerly Affymetrix) arrays, for these novel variants (67,435 or 73.3% of 91,942) 2 or more probes were included in the probeset, while for known variants (24,507 or 26.7%) a single probe was included in the probeset.

Table 2 Step 2: Design of QChip1 based on the predicted pathogenic variants in the Qatari Genome.

QChip0 was then tested on 26 Qatari genomes for which WGS was available. Concordance was 99.7% ± 0.002 for n = 61,592 of n = 91,942 variant sites with non-missing genotypes in both WGS and QChip0 for all n = 26 samples. This high-confidence dataset consisted of 70,715 probes in 61,592 probesets for genoty** of 61,592 variants in 3438 genes (61,195 SNV probesets for 61,195 variants in 3476 genes, and 397 indel probesets for 397 variants in 300 genes), resulting in the final design of QChip1 (Table 2). Of these probes, 61,565 were autosomal and a small proportion (n = 27; 0.04%) non-autosomal (located in ChrX, ChrY, or MtDNA).

Testing of QChip1

The single nucleotide variants and indels represented on QChip1 were tested with an additional 473 Qatari genomes for which whole-genome sequencing was available24. After selection of the top performing probeset for each variant, probesets that were consistently top-performing across batches were compared to WGS genotypes. A total of 27,850 ± 0.75 variant sites where a high-confidence genotype was obtained for both QChip and WGS were compared, concordance was 99.1% ± 0.00034 (Table 3). Concordance was high for indels (92.4% ± 0.0057) and SNVs (99.2% ± 0.00034).

Table 3 Step 3: Concordance of QChip1 compared to whole-genome sequencinga.

QChip1 was then used to determine the prevalence in the Qatari population and in non-Qatari populations for variants of interest for SGD pathogenicity research and screening in Qatar. Genoty** of n = 2708 Qatari, n = 226 European-American, South Asian American and African-American New York City (NYC) residents and n = 51 European and Afro-Caribbean Puerto Rico (PR) residents was conducted and analyzed as a single batch, including data from the first two (QChip0/QChip1) batches described above and a third batch with the rest of the samples. Probesets were again filtered based on performance, and variants were filtered based on missing genotype rate (<10%) low concordance with WGS in batches 1 or 2 (>90%) and minor allele frequency (<5%). The final set of variants for analysis included n = 32,674 SNVs. In order to assess the utility of QChip1 for use in other populations of the Greater Middle East (GME), the allele frequency of these variants was obtained for n = 540 Kuwaiti exomes and each variant was checked for presence in the Center for Arab Genetic Disorders (CAGS) database (http://cags.org.ae).

Use of QChip1

Among the 2,708 Qatari genomes tested, QChip1 identified a median of 2 homozygotes and 130 heterozygotes for SNVs of interest for SGD pathogenicity research and screening (Table 4). When assessed by Qatari subpopulations25, the highest median number (n = 205) of SNVs were identified in the Peninsular Arab subpopulation, 1.6-fold greater than the average median for the General Arab (109), Arabs of Western Eurasia and Persia (132), South Asian Arabs (137) and African Arab (129) subpopulations.

Table 4 Step 4: Use QChip1 to assess average number of single nucleotide variants per genome of interest for SGD research and screening in Qataris and other populationsa.

To help validate that QChip1 accurately detects known Qatari pathogenic variants, n = 140 variants identified as pathogenic either by the Hamad Medical Corporation (HMC) or by ClinVar were assessed in 2708 Qatari genomes by QChip1 (Table 5). There were n = 140 QChip1 pathogenic variants, including n = 140 (100%) present in ClinVar, n = 25 (18%) present in HMC, and n = 27 (19%) present in CAGS. Among these n = 140, n = 94 were only present in ClinVar, n = 19 were present in both HMC and ClinVar, n = 21 were present in ClinVar and CAGS but not HMC, and n = 6 present in all three pathogenic variant databases (ClinVar, HMC, CAGS). Among the n = 140 pathogenic variants, n = 3 were classified as “suspicious” based on high allele frequency (greater than 0.005)27. The three variants were previously reported in CAGS, HMC, or both, and appear to be truly pathogenic variants are enriched in the Qatari population due to founder effects, tribalism, consanguinity or a combination of these factors. One of these, NM_000071.2(CBS):c.1006C > T (p.Arg336Cys) linked to homocystinuria, is a well-documented founder variant in Qatar that was experimentally validated and is a priority for screening in the population17,28.

Table 5 Step 3: Known pathogenic variants of interest for Mendelian (single gene) disorder screening in Qatar using QChip1a.

A major question for the future of QChip is the applicability of the variant list in other GME populations. In order to begin to answer this question, the QChip1 variant list was looked up in four datasets, including sequencing data from CAGS, Kuwait, Iran, and a collection across the GME (GME Variome)29,30,31,32. Out of the n = 140 pathogenic variants in Qatar genotyped by QChip1, 50%% (n = 70) were observed in one or more of the 4 GME datasets, including n = 28 (20%) in Kuwait, n = 32 (23%) in Iran, and n = 37 (26%) in the GME Variome. As expected, only n = 8 (6%) were observed in Puerto Rico and n = 16 (13%) were observed in NYC (Table 6). Based on these data, the utility of QChip1 was higher in GME than in the Americas; however, half the variants were unique to Qatar, and thus each GME nation (such as Kuwait and Iran) could benefit from a custom design.

Table 6 QChip1 pathogenic variants in genomics knowledgebasesa.

All 140 of the pathogenic variants were accurately detected by QChip1 and were described in Table 5; for additional variants of interest for SGD research on QChip1 assessed on 2,708 Qatari genomes, see Supplementary Table 3. In Table 5 pathogenic variants were identified in CBS, a gene linked to homocystinuria (rs398123151 and rs121964972, 1 homozygote and 32 heterozygotes combined, 0.62% genomes), nemaline myopathy (rs886041851,16 heterozygotes, 0.3% genomes), and factor XI deficiency (rs121965063, 0.13% genomes). Relevant to these observations, all 2708 genomes tested were from the general medical clinic and general population, not from referrals to genetic disease clinics, and hence these data were interpreted as representative of the general population of Qatar.

Examination of the distribution of types of functional variants identified by QChip1 in the Qatari genome, the majority of variants of interest for research that were computationally predicted to have “high impact” were involved in structural interaction, which currently would be considered “benign” or “uncertain significance” by ACMG standards and ClinVar. The most common class of variants of interest for research that were computationally predicted “moderate” impact were missense variants (Supplementary Table 4). In some cases, the SnpEff annotation was different from the ClinVar annotation for a pathogenic variant, typically in situations where multiple transcripts lead to multiple alternative annotations for a varant and SnpEff is not aware of the “canonical” annotation in the literature, such as for NM_000071.2(CBS):c.1006C > T (p.Arg336Cys), which SnpEff correctly annotated on the transcript as c.1006C > T but did not provide the amino-acid change, but rather annotated it as “structural_interaction_variant”.

The applicability of the QChip1 was assessed across populations, including those directly genotyped using the array and others not genotyped in the array but of relevant Greater Middle Eastern ancestry. Of the 32,674 variants of interest for SGD research and screening were observed by QChip1 in at least 1 Qatari, 77% were at a frequency higher than any of the non-Qatari populations genotyped on the array (Fig. 1A). Among the Qatari genomes, the highest proportion of SGD risk alleles were in the Arabs of Western Eurasia and Persia, and African Arab subpopulations (Fig. 1A). As predicted, the majority (76%) of the Qatari genome pathogenic variants were not present in non-Qatari populations (Fig. 1B). QChip1 assessment of NYC and Puerto Rico residents demonstrated only rare detection of Qatari pathogenic variants in populations that included (based on genetic analysis of population clusters, Supplementary Fig. 1) European-American, South Asian-American, African-American populations (Table 5, Supplementary Table 3).

Fig. 1: Population distribution of QChip1 variants observed in Qatar.
figure 1

In order to demonstrate the population-specific value of QChip1, the risk alleles that were discovered by genome/exome sequencing, prioritized in the knowledgebase, included in the array design, successfully genotyped, and observed in array data for at least one of n = 2,708 Qataris are provided for download in Supplementary Table 1 and online at the Qatar Genome Browser (http://qchip.biohpc.cornell.edu). Shown is a summary of the population enrichment of these variants. A Enrichment of potentially pathogenic variants on QChip1 in Qatari subpopulations. In order to determine if Mendelian disease risk alleles were enriched in single Qatari subpopulations, a cross-population allele frequency comparison was conducted for five ancestries observed in Qatar (k1, QGP_PAR, Peninsular Arabs; k2, QGP_GAR, General Arabs; k4, QGP_WEP, Arabs of Western Eurasia and Persia; k5, QGP_SAS, South Asian Arabs, and k3, QGP_AFR, African Arabs). Not shown, QGP_ADM, Admixed Arabs. For each subpopulation, the risk allele frequency was compared to the maximum of the other four subpopulations. Shown is the proportion that was highest in the subpopulation for (left-to-right) QGP_PAR, QGP_GAR, QGP_WEP, QGP_SAS, and QGP_AFR. B Enrichment of potentially pathogenic variants on QChip1 in the Qatari genome relative to non-Qatari. The non-Qatari genomes were residents of New York City (total n = 226) and Puerto Rico (n = 51). The ancestry proportions of these 226 non-Qatari genomes in 5 clusters (k1 to k5) were calculated as described in Fig. 2 (combined analysis of non-Qataris and Qataris using ADMIXTURE68), the lowest cross-validation error was for k = 5, with the non-Qataris falling in 3 clusters (African-Americans from NYC, n = 60, k3; European-Americans from NYC, n = 153, k4; South Asian-Americans from NYC, n = 13, k5; Puerto Ricans of European Ancestry, k4; and Puerto Ricans of Afro-Caribbean Ancestry, k3). More details of the population structure were made available in Fig. 2 (Qataris) and Supplementary Fig. 1 (non-Qataris). Shown is the percentage of n = 32,674 potentially pathogenic variants in Mendelian (single gene) disorder genes that were observed in at least one Qatari and have a risk (minor) allele frequency in Qatar higher than in non-Qatari populations. The proportion of variants was calculated that were at elevated minor allele frequency (enriched) in the Qatari genome relative to the genomes of the 5 non-Qatari population clusters tested: USA African-American (k3), USA European-American (k4), USA South-Asian American (k5), PR Afro-Caribbean (k3), PR European (k4). Shown from left-to-right is the proportion that are enriched in Qatar relative to the maximum of all 5 populations, followed the proportion enriched relative to each individual population.

Within the subset of the variants that are known pathogenic and of interest for screening (n = 140), similar results were observed for Western populations, with only 6% of QChip1 pathogenic variants observed in Puerto Rico and only 13% found in NYC. Within Arab populations, the results were better but still not sufficient to justify the use of the array, with only 24% of QChip1 pathogenic variants observed in Kuwait and 15% reported in the Center for Arab Genetics Studies database.

Array performance

Using NGS data as the gold standard, the authors calculated the analytical sensitivity, specificity, accuracy, positive predictive value, and negative predictive value of QChip1. Using data from WGS and QChip1 for n = 140 (mostly rare) pathogenic variants in n = 472 Qatari, comparison was conducted for n = 66,220 genotypes. Of these, n = 39,286 could not be compared due to missing genotype in one of the two platforms, (99.8% were missing in WGS only), and among the remaining n = 26,934 there were n = 26,781 true negatives, n = 132 true positives, n = 21 false negatives, and n = 0 false positives. Based on these data, the sensitivity was 86.3%, the specificity was 100%, the accuracy was 99.9%, the positive predictive value was 100%, and the negative predictive value was 99.9%. This performance is very high relative to recently published evaluations of SNP chips performance on rare pathogenic variants33.

Discussion

This report described the design, testing, and application of QChip1, the first genoty** microarray specifically designed for precision medicine in the Greater Middle Eastern population. QChip was designed for and determined to be suitable for SGD research, clinical screening of newborns or couples planning children, and for genetic diagnosis of SGD patients in the country and in the region.

The main hypothesis of this project was confirmed, that variants of interest for SGD pathogenicity research and screening within known genes vary considerably across populations, as the majority of the QChip1 variants observed in Qatar were either Qatar-private or Qatar-enriched, and were absent from other GME populations and databases of SGD pathogenic variants specific to GME populations. In addition, the majority of QChip1 variants were absent from the Thermo Fisher database, one of the largest knowledgebases in the world of genetic disease variants used in clinical genetics and research genetics. Given the low cost (<$100 each array) and ease of use of the QChip1, it provides an accessible and sustainable alternative to extensive sequencing and interpretation of variants of unknown significance34 for the implementation of precision medicine in countries such as Qatar.

The development of QChip1 included the following steps: (1) assessment of the Qatari population to identify Qatari variants and genes of interest for SGD pathogenicity research and screening; (2) design and manufacture of genoty** probesets for inclusion in the QChip1 microarray; (3) refinement and testing of QChip1 by analysis of data from 469 Qataris also sequenced using WGS; and (4) use of the refined QChip1 for quantification of variants of interest for SGD pathogenicity research and screening in 2708 Qatari genomes, with a focus on (a) variants specific-to or enriched-in Qatar relative to non-Qatari DNA samples also genotyped using QChip1 and (b) variants known to be pathogenic.

The key findings of this study were that out of over 104 million variants in Qatar, extensive analysis both in silico and in vitro identified with over 99% accuracy over 32 thousand variants in the Qatari population that are known or predicted to alter the function of genes with a known role in SGDs. The majority of these 32 thousand variants were only observed in Qatar, including 103 of 140 (64%) known pathogenic variants previously observed in Qatari clinical case reports and in ClinVar. Of those variants also observed in Kuwait, the CAGS database of GME variants, NYC or Puerto Rico, the majority were enriched in Qatar, at a higher risk allele frequency. These observations confirm the hypothesis that a considerable proportion of SGD risk variants are population-private founder variants or population-enriched variants that drifted to elevated allele frequency in Qatar. Surprisingly, this hypothesis holds even when compared to neighboring GME populations. This observation justifies the effort invested this research team in develo** QChip1 and in producing a framework for the development of similar SGD clinical and research arrays for other understudied populations in the GME, the Americas, and beyond. The population genetic analysis presented here suggests that the high diversity of the Qatari population demonstrates the limited applicability of this array in the Greater Middle East region, which from a genetic perspective spans from Africa to Southern Europe, the Near East, Central Asia, and South Asia. The population-specificity of the variants on the array is a confirmation of the uniqueness and genetic isolation of the Qatari population as previously described by this research team.

The majority of genoty** arrays in use today were designed for coverage of the whole genome, and provide limited coverage of rare variants in genes known and potentially pathogenic in genetic disorders35. Screening arrays do exist, most designed for detection of cytogenetic defects in newborns36, arrays designed for pre-natal screening37, and exome arrays designed for exome-wide association studies (ExWAS)38. Exome sequencing is growing in popularity for the detection of risk variants, and a number of companies offer it as a service, including variant interpretation39. The challenge with exome sequencing is for clinical use is how to deal with the identification of variants of unknown significance40. In contrast, the concept of the QChip1 array is that all variants in the array were annotated prior to genoty**, hence circumventing the issue of variants of unknown significance issues while still covering rare variants. In this sense, the QChip1 knowledgebase is of great value, as it can be used to aid the interpretation of genetic data produced by targeted sequencing or genoty** of a panel of variants of interest for carrier screening, similar to the Plain Insight Panel41.

The challenge for array design is the selection of variants. There are over 7 million known missense and loss of function variants42, and no array can fit all. Unlike arrays designed for ExWAS, genome-wide association study (GWAS) and population genetics, limiting the array to common variants is not useful for screening for pathogenic variants, as common variants are less likely to be pathogenic, and rare variants are difficult to impute using reference panels and common variant genotype data43. In order to focus on pathogenic rare variants, arrays custom-tailored to a population are a better fit for individuals sampled from that population, as rare variants are more likely to be population-specific44.

This study provides advances in both knowledge and technology for the field of genomic medicine for a specific genetic population. On the knowledge front, it contains the largest knowledgebase of variants of interest for genetic disease research and screening in a Greater Middle Eastern population. While the consequences of many of the variants on QChip1 are unknown, the array provides a paradigm for clinical screening of this population and a platform for future genetic disease research in the Greater Middle Eastern populations. The variants included in the design and validated in a batch of n = 2708 Qatari were as rare as 1 in 5000 (minor allele frequency of 0.0002), and future whole-genome sequencing of Qataris are expected to yield thousands of additional variants of interest. A high confidence in the true existence of such rare SGD risk variants in the Qatari population was boosted by this study, as the variants were discovered by WGS and verified by QChip genoty**.

The QChip1 array did not include short tandem repeats, other repetitive variants, copy number variants, or structural variants. A small proportion of probes on QChip1 were designed for indel detection, but the concordance with whole-genome sequencing for the indels was inadequate. This may be due to inadequate probeset design and should be a focus for future QChip designs. The main limitation of arrays is the space for probes, and in this case the majority of variants were novel to the Axiom platform and hence required multiple probesets. In future iterations, the highest performing probesets identified in this study can be used, and poor performing probesets can be eliminated, thus making additional space on the array for additional variants. Thus, multiple iterations of QChip are needed to produce a high-quality design that genotypes a variety of variants. Another strategy that is frequently used by genoty** array manufacturers is to spread a design across multiple arrays that are genotyped together, i.e., the manufacturers can advertise an array with up to 5 million variants, in reality the “array” consists of 4 or more individual arrays45.

Another limitation of this study is cis/trans phase of variants, a challenge for exome sequencing. For example, multiple pathogenic variants in BTD can occur in the same genome, and hence screening for these variants includes a second step to determine phase46. In the case of this study, there were three pathogenic variants in BTD (rs397514369, rs13078881, rs138818907). Among those individuals with a BTD pathogenic variant, there were five heterozygotes for rs397514369, n = 4 homozygotes and n = 135 heterozygotes for rs13078881, and n = 5 heterozygotes for rs138818907. Zero individuals were positive for more than one BTD pathogenic variant, which rules out the possibility of two pathogenic variants in trans. However, were it the case that multiple BTD variants were observed in the same genome, follow-up validation of phase by Sanger sequencing would be needed. This is a disadvantage of exome sequencing and exome-focused array genoty**, as insufficient coverage of intergenic regions is available for phase inference. Follow-up sequencing is needed, until genome-wide technologies are widely available, such as WGS. Plans for QChip2 include broad coverage of sufficient variants for phase inference.

QChip1 was designed to be competitive relative to sequencing and existing arrays, hence there was a focus on achieving a platform that could provide data for under $100 per DNA sample, including reagents and labor. This is a price point that should remain competitive compared to alternative options for up to a decade, and remains the objective of major manufacturers of sequencing instruments47. A major saving is the small data footprint of the QChip1, relative to exome or genome sequencing, where orders of magnitude more data storage are needed. In particular, if the objective is to apply QChip1 on a national scale, the infrastructure investment is considerably more manageable for the prospect of running hundreds of thousands of arrays relative to sequencing hundreds of thousands of genomes or exomes. In perspective, the total Qatari population is approximately 300,000, so the entire Qatari population could be screened for all known and potentially pathogenic variants for approximately $30 million. As presented by the chair of the Qatar Foundation, HH Sheikha Moza bin Nassert at the WISH 2018 summit in Doha, such a precision medicine objective is under consideration for the next decade48.

Assessment of 2708 Qatari genomes shed novel insight into the Qatari population. As predicted from our prior assessments of the Qatari population3,11, the majority of the pathogenic and predicted pathogenic variants were Qatari-specific, underrepresented in non-Greater Middle Eastern genomes. The most commonly known and high predicted severity pathogenic variants were structural interaction variants and stop gain loss-of-function variants. The most pathogenic variants per genome were observed in the General Arab population, a finding that has implications for other Greater Middle East populations such as Kuwait, United Arab Emirates, and Saudi Arabia that share considerable ancestry with Qatar18,49,50,51. The median Qatari genome had 134 known or computationally predicted pathogenic alleles of interest for SGD research or screening. Of the known pathogenic alleles that were both previously observed in Qatar and known to the ClinVar database, the most common known pathogenic variants were causative of biotinidase deficiency, Stargardt disease, and homocystinuria. Among these 3 variants with risk allele frequency above 0.5% in Qatar, one was not previously known to the CAGS nor HMC databases NM_000060.2(BTD):c.[470G > A;1330G > C] linked to biotinidase deficiency. This is unusual, given the high frequency of the pathogenic variant at 0.0265, and could be an indication that either biotinidase deficiency is under-diagnosed in Qatar, or that the variant should be re-classified as “uncertain significance”. The other two variants with elevated risk allele frequency, one was reported in CAGS but not HMC database, NM_000350.2(ABCA4):c.[5512C > G;5882G > A] linked to Stargardt disease, risk allele frequency 0.0207. Again, it is unusual that the variant was not previously observed in the HMC database, although it is a known pathogenic variant in Arabs and quite possibly enriched in a subset of the Qatari population due to drift. The NM_000071.2(CBS):c.1006C > T (p.Arg336Cys) variant linked to homocystinuria is a well-known variant that is present in both the HMC and CAGS databases, and is known to be an enriched founder variant in the population. It was notable that this variant was incorrectly annotated by SnpEff as “structural interaction”, and only manual review based on the rsID identified the known function (Arg336Cys). This is an issue with annotation software that is not exclusive to SnpEff, where multiple transcripts overlap a variant (4 in the case of CBS), and the annotation for the “canonical” experimentally validated function of the variant in disease is buried among other annotations. This is a general problem in variant annotation, and computationally predicted annotations are to be considered an estimate that needs to be validated both by manual review of the literature and experimental validation in vitro. Other known pathogenic variants found using QChip1 included a Factor XI deficiency variant that was previously observed in both Arabs and in ancestral Jewish populations52.

QChip1 was designed to assess for pathogenic variants in SGDs, with the aim of genomic medicine for Qatari newborns, premarital couples and clinical genetics patients. A likely future strategy for QChip2 and beyond will be to produce multiple arrays for different purposes, including (1) genome-wide association array designed for genoty** of common variants and calculation of polygenic risk scores for multifactorial disorders53; (2) imputation of rare variants based on a Qatari genome imputation reference; (3) population-specific variants that influence drug kinetics and adverse effects; (4) structural variants and repeats; (5) expansion of the QChip1 SGD variants based on a larger sample of Qatari genomes; and (6) variants relevant to autoimmune disease and infectious disease in HLA54 and non-autosomal chromosomes, such as ChrX variants in the ACE2 receptor used by the SARS-Cov-2 virus to infect human cells55.

In addition to future versions of the array, the QChip knowledgebase and browser (Qatar Genome Browser) will continue to expand and be updated as more public data from Qatar and literature data on known SGD variants and genes become available. The knowledgebase, array, and browser produced by this project were intended as a first and enabling step towards advancing the state of the art of genomic medicine in Qatar and in populations that share ancestry with Qatar, as demonstrated in the population genetics analysis presented in this study. The intent is to demonstrate this approach as a framework for the development of precision medicine in populations of countries in continents such as Africa56, where a per-sample genome analysis cost beyond $100 is out of reach. Given the low cost of sequencing data production, the availability of cloud-based genome analysis infrastructure that does not require large capital investment, and the ease of rapid array design using the Axiom platform, a nation or population that currently has no prior knowledge of genetic variation could take the approach presented here and produce a genetic disease screening program in under a year, potentially saving thousands of lives at risk of unknowingly being affected by a genetic disorder.

The applicability of the QChip1 technology in the Qatari national population is clear, as all of the variants genotyped were previously observed in Qatari nationals, and we know from current and prior studies that the Qatari population sample used as the source of genetic variation for the QChip is also very diverse, with contributions of ancestry from Africa, Europe, and Asia11,12. The applicability to expatriates both living within Qatar and those outside of Qatar will depend on shared ancestry between the expatriate individual and the Qatari population. An expatriate coming from one of the populations that contribute to Qatari ancestry will be more likely to have one or more pathogenic variants in QChip. More distantly related individuals would see less benefit from QChip for screening. Confirming that hypothesis, only 6% of the known pathogenic variants were observed in Puerto Ricans, hence an expatriate from Puerto Rico in Qatar would not benefit as much from QChip1 screening as an expatriate from Kuwait, where 20% of QChip1 pathogenic variants were observed. Across the Greater Middle East region, a total of 50% of the QChip1 variants were observed. This study provides a strong argument for ancestry inference as a standard part of precision medicine, to determine the appropriate screening tool and allele frequency reference database for SGDs.

Methods

Subject recruitment and sample collection

All research participants were recruited using IRB-approved protocols and informed consent. Recruitment sites included Doha, Qatar (Weill Cornell Medicine – Qatar Institutional Review Board); New York, New York, USA (Weill Cornell Medicine Institutional Review Board); and Mayaguez, Puerto Rico, USA (Institutional Review Board, University of Puerto Rico at Mayagüez). Every research participant received and understood the accurate information in the consent document and other written information and (s)he released the permission to take part in the research by signing the informed consent. No plan was put in place for recontacting participants with information on actionable findings. DNA extracted from whole blood57 was tested for quality by RUCDR Infinite Biologics (Piscataway, New Jersey) to be of sufficient quality for array genoty**58.

Strategy to design and assess QChip1

QChip1 was developed in steps (Fig. 2). Step 1. Pathogenic variants (known and predicted) in the coding regions of single genes in the Qatari genome were cataloged. Step 2. Using these data, QChip0 (the precursor of QChip1) was designed on the Axiom platform, tested using Qatari genomes and refined with optimal probes, variants and genes to create QChip1. Step 3. QChip1 was tested for concordance with whole-genome sequencing. Step 4. QChip1 was used to evaluate pathogenic variant Qatari prevalence and specificity by assessing genomes from Qataris and non-Qatari populations.

Fig. 2: Strategy to design and assess QChip1.
figure 2

Step 1. Qatari Genome Knowledgebase. Identification of the single gene (Mendelian) pathogenic variants and genes in protein coding regions of the Qatari genome was generated using whole-genome sequencing, exome sequencing and clinical reports (see Table 1). After cataloging all variants and respective genes, the pathogenic variants and genes were identified using ClinVar and SnpEff. Step 2. Using this list, Qchip0 (the precursor of QChip1) was designed on the Axiom platform which was then tested with 25 Qatari DNA samples for which whole-genome sequencing was available. Step 3. Elimination of poor performance probes and variants led to the final design of QChip1, which was tested for concordance with genome sequencing using DNA samples from Qataris. Step 4. Use of QChip1 to assess the prevalence of pathogenic variants and genes among Qataris, New York City residents and Puerto Ricans.

Step 1: Identification of variants of interest for research or screening in the Qatari Genome

The knowledgebase of pathogenic variants in the Qatari genome was established from several sources, including (1) Qatar Genome Program whole-genome sequencing of 6218 Qatari genomes sequenced on the Illumina HiSeq (Illumina, San Diego, CA) at Sidra Medicine (Doha, Qatar); (2) Department of Genetic Medicine, Weill Cornell Medicine whole-genome sequencing of n = 180 Qatari genomes sequenced on the HiSeq at Illumina (n = 108)12 and the New York Genome Center (n = 72)26; (3) exome sequencing of n = 1297 Qatari genomes sequenced on the HiSeq at Bei**g Genomics Institute (n = 100)3 or New York Genome Center (n = 1197)11; and (4) n = 594 variants from n = 721 case reports of hereditary disorders identified by the Clinical Genetics Laboratory at Hamad Medical Corporation (HMC; Doha, Qatar; Supplementary Table 1). The HMC variants were collected in the period between 2002 and 2017, all probands were Qatari nationals. Details of the number of variants in each cohort were tabulated. The final knowledgebase without duplicates consisted of n = 104,473,390 variants, including single nucleotide variants (SNVs) and indels (short insertions and deletions; Table 1)

The identification of variants of interest for SGD research and screening in the Qatari genome was carried out in a 3 step process: (1) establishing a list of genes with a known link to Mendelian SGDs described in the ClinVar (version 7/21/20) database; (2) identification of Qatari variants computationally predicted to alter the function of SGD genes in a pathogenic maner, which are primarily of interest for SGD pathogenicity research, and (2) identification of Qatari variants known to be pathogenic in SGDs, based on being classified as such by the ClinVar database or by the HMC case reports.

Establishing a list of genes

A list of genes was compiled from ClinVar with the following criteria: (i) protein coding gene in human genome that (ii) has a known link to a SGD and (iii) contains one or more variants in ClinVar that are classified with a “clinical significance” value of “pathogenic” (Supplementary Table 2), recommended by American College of Medical Genetics (ACMG) for variants interpreted for Mendelian disorders59.

Identification of variants of interest for SGD pathogenicity research in Qataris

Single nucleotide variants (SNV) and indel variants in the Qatar Genome Knowledgebase were annotated using data from public and private sources. First, the allele frequency for each variant in Qataris and non-Qataris was calculated. Variants with a minor allele frequency above 5% in either Qataris or non-Qataris were excluded, per ACMG guidelines59. Second, variants were annotated with respect to impact on protein-coding genes in the ENSEMBL database60 using SnpEff61. Variants that did not affect the function of a SGD gene from ClinVar identified as described above were excluded. Third, variants that were predicted to produce missense or loss-of-function (LoF) variants were kept: these variants are classified by SnpEff as having “High” or “Moderate” potential impact on protein function. This collection of variants includes a variety of variants, including known pathogenic variants, variants of unknown significance, and benign variants.

Identification of pathogenic variants for SGD screening

Among the variants defined in step 1.2, a subset is known pathogenic variants, including those classified by ClinVar as pathogenic or those previously observed in HMC case reports of SGDs. These variants can be used for screening of Qataris in a Precision Medicine setting.

Step 2: Design of QChip1

The microarray platform for the QChip was based on the Axiom custom array platform capable of accommodating 1.3 × 106 probe features, each consisting of DNA probes covalently linked to a silicon wafer designed to hybridize DNA for the genomic sample. Multiple probes designed to hybridize to a genomic segment can be included in a single “probeset”, and one or more probesets designed to genotype a single variant can be included in the design, such that the performance of probes sets can be compared. The initial design was named “QChip0” and the final (post-quality-filtering) version as “QChip1”. The array design contained 693,652 probes in 597,049 probesets. A subset of n = 184,713 of the probes (27%), the focus of this report, were designed to assess variants of interest for SGD pathogenicity research and screening. These variants are computationally predicted or are known to affect the function of ClinVar SGD genes found in the variant knowledgebase. The remaining 73% of probes on QChip0, not the subject of this report, were designed for research purposes focused on population genetics, pharmacogenomics, and multifactorial disease research, and will be described in future publications based on future versions of QChip.

The probesets included probes complementary to reference and variant alleles, plus flanking sequence of 35 bases in both 5’ and 3’ directions. Note that this manuscript refers to reference GRCh38 and variant alleles from a genome sequencing perspective. However, in microarray genoty**, there is no “reference” allele, as both alleles are treated as equal by the technology, and hence potentially reducing false genotype calls attributable to reference bias62. Some variants were already present in the ThermoFisher (previously Affymetrix) knowledgebase, and thus previously validated to provide accurate genotypes for an SNV or indel, were assessed using a single probeset, while novel variants were assayed using two or more probesets.

Once the array was manufactured, it was tested on an initial batch of genomic DNA samples, including n = 26 Qataris from the Weill Cornell Medicine cohort WGS data. Genotypes were generated from the WGS data for these n = 26 using GATK Haplotype Caller 3.863,64, configured to output genotypes for all sites on the QChip list, including homozygous reference calls. Comparison of QChip and WGS genotypes was conducted for sites where both WGS and QChip produced a non-missing (sufficient quality) genotype.

In order to exclude poorly performing probesets, two rounds of filtering were applied, including a primary filter to select the highest performing probeset for each variant with multiple probesets, and a secondary filter to exclude variants with a high rate (>10%) of missing genotypes or high rate of discordant genotypes. Excluding poorly performing probes and variants led to the final design of QChip1 with 166,695 probes designed to detect 83,542 variants of 3438 genes. Concordance and filtering analysis were performed using Python65 scripts. The concordance analysis script takes as input two single-sample VCF files66 as input, including one with QChip1 genotypes and a second with WGS genotypes for all QChip1 sites (including reference and variant genotypes) by GATK 3.864.

Step 3: Test of QChip1

The concordance of genes and variants of QChip1 with whole-genome sequencing data was calculated for a second array genoty** batch of n = 443 Qatari genomic DNA samples previously sequenced using WGS by the Qatar Genome Program. Concordance was performed using the same method for the first batch of n = 26 as described above.

Step 4: Use of QChip1

QChip1 was then used to determine the prevalence of variants of interest for SGD research and screening in the Qatari population (n = 2708) compared to genomes for European-American, South Asian-American and African-American New York City (NYC) residents (n = 226) and European and Afro-Caribbean in Puerto Rico (PR) residents (n = 51). In addition to assessment of variant prevalence in Qataris as a single population, the population structure of Qataris was quantified as described previously67, and the prevalence of each variant was quantified for each known Qatari population cluster [Peninsular Arab (QGP_PAR), General Arab (QGP_GAR), Admixed Arab (QGP_ADM), Arabs of Western Eurasia and Persia (QGP_WEP), South Asian Arabs (QGP_SAS) and African Arabs (QGP_AFR); this nomenclature has replaced our prior nomenclature for these subgroups of Q1a, Q1b, Admixed, Q2a, Q2B and Q3, respectively, used in prior publications; Fig. 3]11. The population structure was quantified using ADMIXTURE68 for both Qataris and non-Qataris (Supplementary Fig. 1) using QChip1 data that was filtered to exclude indels, singletons, and variants in linkage disequilibrium (window 1000, step 25, maximum r2 0.1). Each genome was assigned to an inferred population cluster based on the k value with lowest cross-validation error (k = 5). Rather than classify individuals as admixed/non-admixed, each individual genome was assigned to the cluster (k) with the highest proportion of ancestry69. The results were visualized in a plot of principal components (PCs) calculated using PLINK70, with visualization in R71. Outliers were excluded based on over 2 standard deviations outside the median PC value for PCs 1 to 5. Each genome was color-coded by the inferred ancestry (1–5) and the country of origin (Qatar, US, PR).

Fig. 3: Population structure and principal component analysis of ancestry assessed by QChip1.
figure 3

Sites and samples that failed QC based on variant batch effects or PC outliers were excluded. After QC, ADMIXTURE analysis was conducted on the remaining n = 37,674 variants and n = 2985 samples of Qataris (n = 2708) and non-Qataris (n = 277) for a range of K from 3 to 12. The lowest cross-validation error was observed for k = 5 for the full dataset. After analysis, the Qatari and non-Qatari samples were plotted separately, the panels here show the Qatari samples from the joint analysis. A Admixture (k = 5) proportions. Shown is a plot of the admixture proportions (% k from 0 to 100%, y axis), with each column representing one genome, sorted from left-to-right by dominant (highest %) k, and decreasing % k1 to k5. Genomes are color-coded by the dominant (largest %) ancestry (QGP_PAR, Peninsular Arabs, red; QGP_GAR, General Arabs, orange; QGP_WEP, Arabs of West Eurasia and Persia, bright green; QGP_SAS, South Asian Arabs, olive green; and QGP_AFR, African Arabs, light blue). Samples from prior studies of Qatar population structure (Qatar Genome public samples from Fakhro et al.11 and Rodriguez-Flores et al.12 genotyped on QChip1 were included in the clustering analysis and were used to assign the clusters. B Principal components analysis of Qataris. Shown is a PC1 × PC2 plot of Qatari genomes in squares color-coded by cluster of largest proportion of inferred ancestry. Not shown, QGP_ADM, Admixed Arabs.

Data analysis

The final set of QChip1 data included SNV variants with high-quality genotypes and genomes with known ancestry that are of interest for research and screening of SGDs in Qataris. Analysis of these data included quantification and comparison across populations of the following parameters: (1) individual burden of variants; (2) prevalence of variants; (3) enrichment of variants among Qatari subpopulations; and (4) enrichment of variants in Qataris compared to non-Qatari populations.

Performance

Once a final set of pathogenic variants screened using QChip1 was identified, the performance of the array was quantified. Data for QChip1 and WGS was compared on n = 140 pathogenic variants for n = 472 genomes. Using WGS as a “gold standard”, the number of true negative (TN, both WGS and QChip1 call wild type genotype), true positive (TP, both WGS and QChip1 call heterozygote or homozygote for risk allele), false negative (FN, WGS calls positive but QChip1 calls negative), and false positive (FP, WGS calls negative and QChip1 calls positive). Based on these four numbers, the sensitivity [TP/(TP + FN)], specificity [TN/(TN + FP)], accuracy [TP/(TN + TP + FN + FP)], positive predictive value [TP/(TP + FP)], and negative predictive value [TN/(TN + FN)] was calculated.

Utility beyond Qatar

In order to assess the potential utility of QChip1 beyond Qatar, the number of QChip1 pathogenic variants was quantified in internal and external knowledgebases. The internal knowledgebases included the QChip1 data for Qatar, NYC, Puerto Rico, and the Hamad Medical Corporation (https://www.hamad.qa/EN/Pages/default.aspx) list of pathogenic variants. The external knowledgebases included ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), the Center for Arab Genetics Studies (https://www.cags.org.ae/en), the Iranome (http://www.iranome.ir/), the GME Variome (http://igm.ucsd.edu/gme/), and a set of exomes sequenced by the Dasman Diabetes Institute in Kuwait (https://www.dasmaninstitute.org/). Among the external databases, allele frequency was available for Iran (n = 800), GME (n = 886), and Kuwait (n = 540). The subset of variants present in one or more of the knowledgebases, as well as the subset present in one or more external knowledgebase focusing on the Greater Middle East region (CAGS, Iran, GME, Kuwait) was also quantified.

QChip genome browser

In order to provide researchers and clinicians access to annotation and allele frequency data in Qatar and USA for the QChip1 Qatar SGD pathogenicity research and screening variants and genes, a web browser was constructed. The Qatar Genome Browser architecture consisted of a searchable table with a user interface implemented in a Shiny RStudio72 application frontend, running within a Docker (docker.com) container instance installed on a Linux Centos (centos.org) server backend. The server was custom built by Red Barn (thinkredbarn.com) and configured by Cornell BioHPC73. In order to maintain security, the development version was accessible only within Cornell campus network or via Cornell VPN, with plans for a public release after publication of this report. Testing of the server was conducted to confirm that the url (http://qchip.biohpc.cornell.edu) was accessible from both Weill Cornell Medicine New York and Weill Cornell Medicine Qatar.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.