Abstract
Background
The absence of heterozygosity (AOH) is a kind of genomic change characterized by a long contiguous region of homozygous alleles in a chromosome, which may cause human genetic disorders. However, no method of low-pass whole genome sequencing (LP-WGS) has been reported for the detection of AOH in a low-pass setting of less than onefold. We developed a method, termed CNVseq-AOH, for predicting the absence of heterozygosity using LP-WGS with ultra-low sequencing data, which overcomes the sparse nature of typical LP-WGS data by combing population-based haplotype information, adjustable sliding windows, and recurrent neural network (RNN). We tested the feasibility of CNVseq-AOH for the detection of AOH in 409 cases (11 AOH regions for model training and 863 AOH regions for validation) from the 1000 Genomes Project (1KGP). AOH detection using CNVseq-AOH was also performed on 6 clinical cases with previously ascertained AOHs by whole exome sequencing (WES).
Results
Using SNP-based microarray results as reference (AOHs detected by CNVseq-AOH with at least a 50% overlap with the AOHs detected by chromosomal microarray analysis), 409 samples (863 AOH regions) in the 1KGP were used for concordant analysis. For 784 AOHs on autosomes and 79 AOHs on the X chromosome, CNVseq-AOH can predict AOHs with a concordant rate of 96.23% and 59.49% respectively based on the analysis of 0.1-fold LP-WGS data, which is far lower than the current standard in the field. Using 0.1-fold LP-WGS data, CNVseq-AOH revealed 5 additional AOHs (larger than 10 Mb in size) in the 409 samples. We further analyzed AOHs larger than 10 Mb, which is recommended for reporting the possibility of UPD. For the 291 AOH regions larger than 10 Mb, CNVseq-AOH can predict AOHs with a concordant rate of 99.66% with only 0.1-fold LP-WGS data. In the 6 clinical cases, CNVseq-AOH revealed all 15 known AOH regions.
Conclusions
Here we reported a method for analyzing LP-WGS data to accurately identify regions of AOH, which possesses great potential to improve genetic testing of AOH.
Similar content being viewed by others
Background
The absence of heterozygosity (AOH) is a kind of genomic change characterized by a long contiguous region of homozygous alleles in a chromosome [1]. Several underlying mechanisms of AOH have been reported, such as meiotic segregation errors [2], parental consanguinity [3], or complex chromosomal rearrangements [4]. AOHs do not necessarily have clinical consequences, however, they may cause serious pathogenic effects when it is related to imprinting effects [5] or autosomal recessive disease mechanisms [3]. For example, more than 25% of patients with Prader–Willi syndrome are caused by isodisomy (the inheritance of both homologs from a single parent and only one homolog of that parent is present) or heterodisomy (the inheritance of both homologs from a single parent and both homologs of that parent are present) [6]. Sahoo et al. found that whole-genome uniparental isodisomy (UPD) caused pregnancy loss in ~ 1% of cases [7]. In a study of rare autosomal trisomy by genome-wide noninvasive prenatal testing, the author found that 4.16% of cases with rare autosomal trisomies originate from uniparental disomy [15], while variants on the X chromosome were phased by Eagle2 (without the pedigree-based correction) [16]. Due to this inconsistency in variant phasing, the probability calculation of CNVseq-AOH may be influenced. So, we separately calculated the concordant rate on autosomes and the X chromosome.
For the 784 AOHs on autosomes, in general, the prediction sensitivity of CNVseq-AOH increased with depth (Fig. 2a). As expected, the sensitivity of CNVseq-AOH was 100% (784/784) when the depth was > = onefold (Supplementary Table 1). With a depth of 0.5-fold, the sensitivity reached 99.9%. Only one AOH with an overlap of 47% was missed by CNVseq-AOH (Supplementary Table 1). The sensitivity of CNVseq-AOH reached 96.23% even with a depth of 0.1-fold, which is far lower than current studies, which need 4-to-fivefold depth [11, 17]. For the 79 AOHs on the X chromosome, the sensitivity of CNVseq-AOH was 59.49% (47/79) with a depth of 0.1-fold. The prediction sensitivity of CNVseq-AOH also increased with depth (Fig. 2b). However, even with a depth of threefold, the prediction sensitivity is still not 100%. There were 6 AOHs missed by CNVseq-AOH with an overlap ranging from 18%-44%. These 6 AOHs were located in similar regions on the X chromosome (Supplementary Table 1), which were also missed by CNVseq-AOH when using 0.5-fold and onefold depth. We further calculated the SNP numbers per 1 Mb on all the chromosomes in the 1KGP. The number of SNPs per 1 Mb on the X chromosome (mean of 26,839.9) was significantly less than the number of autosomes (18,439.9) (T-test, with P-value of 2.94E-12). One reasonable explanation for the relatively low sensitivity for AOHs on the X chromosome is that, compared with autosome, the variant information in the phasing results of the X chromosome in 1KGP was insufficient to calculate the probabilities for resampled reads.
For SNP-based microarrays, a threshold of > = 10 Mb has been suggested for reporting AOH [18]. In the real clinical setting, AOH larger than 10 Mb in one chromosome is recommended for reporting the possibility of UPD [19, 20]. There were 291 AOH regions larger than 10 Mb in the 1KGP. For these AOHs, CNVseq-AOH can predict AOHs with a sensitivity of 100% (291/291) when the depth was > = 0.5-fold (Supplementary Table 1). With a depth of 0.1-fold, the sensitivity reached 99.66% (290/291). CNVseq-AOH provided a prediction sensitivity of 94.5% (275/291) even with a depth of 0.05-fold.
For 0.1-fold LP-WGS data, it takes an average of 11 min to process a single sample using an 8-core CPU with 8 GB of RAM (from data alignment to reporting), including an average of 10 min for alignment, 25 s for feature learning, and 10 s for AOH prediction and reporting.
Additional AOHs detected by CNVseq-AOH
Compared to AOHs detected by CMA, additional AOHs were detected by CNVseq-AOH. We analyzed additional AOHs detected by CNVseq-AOH with a depth of 0.1-fold. A total of 267 additional AOHs were detected in the 409 samples by CNVseq-AOH, approximately 0.65 AOHs for each sample. The number of the additionally detected AOHs decreased with the length of AOH (Supplementary Fig. 1). Using high-coverage data, we further validated these AOHs by visualization using an in-house script. The results showed that, 50.56% (135/267) additional AOHs were true positives (Supplementary Table 2; Supplementary Fig. 2). In the clinical setting, a threshold of > 10 Mb was recommended for reporting the possibility of UPD [19, 20]. Using a threshold of > 10 Mb, only 5 additional AOHs were detected by CNVseq-AOH for the 409 samples with 0.1-fold depth (Supplementary Table 2).
Interestingly, we found an AOH region (seq[GRCh38] hmz(6)(p12.3q12) chr6:g. 47568317_64568317hmz) using CNVseq-AOH, which crossed the centromeric regions of chromosome 6 in this case (Fig. 3c, d). Although with sufficient markers for this region (Fig. 3a), no AOH was reported in this region by CMA, which indirectly reflects the detection performance of CNVseq-AOH for regions crossing the centromeric regions. This AOH was further validated using high-coverage data, which also showed positive signals in this region (Fig. 3b).
AOH region (seq[GRCh38] hmz(6)(p12.3q12) chr6:g. 47568317_64568317hmz) detected by CNVseq-AOH in chromosome 6 of HG01980. A Marker of HumanOmni2.5–4 (SNP-based microarray kit used in the 1KGP) in chromosome 6. Marker density in this region of the sample represents sufficient markers for the chr6:g. 47568317_64568317 region. However, no AOH was reported in this region by CMA, which indirectly reflects the detection performance of CNVseq-AOH for regions crossing the centromeric regions; b The number of heterozygous SNP number (yellow line) and all SNP number (green line) in this region calculated using high coverage WGS data. The heterozygous SNP number (yellow line) in the chr6:g. 47568317_64568317 region was close to 0, which indicated potential AOH events in this region; c Log-likelihood ratio for haploid and diploid in each bin. Each dot represents the mean log-likelihood ratio in each bin. For potential AOH regions, the log-likelihood ratio tends to approach 0, with relatively sparse blue dots above 0; d AOH prediction likelihoods of CNVseq-AOH using 0.1-fold depth. The higher the confidence, the closer it is to 1, indicating that the region we predicted is likely to be AOH regions; e Chromosome 6
RNN VS. Hidden Markov model
RNN and Hidden Markov Model (HMM) are both widely used models for processing sequential data. HMM, a probabilistic model, is particularly effective for problems involving time series data. Currently, no published literature employs the HMM method for the detection of AOH, hence it cannot be cited. In this study, HMM with Gaussian emissions (the “hmmlearn.hmm.GaussianHMM” module in Python) was used for AOH prediction. We established an HMM model with 5 hidden states and a full covariance matrix, and compared it with CNVseq-AOH for AOH prediction. As a result, the prediction sensitivity of CNVseq-AOH is better than the HMM-based method with differing depths (Fig. 4).
Validation of CNVseq-AOH with 6 clinical cases
We further applied CNVseq-AOH on 6 clinical cases with previously detected AOHs (Table 1). A mean depth of 0.573-fold (raw reads) was obtained for each sample. Uniquely aligned high-quality reads (UAHRs) reads were used for the detection of AOH. A UAHR was defined as a read that was uniquely aligned to the human genome reference with a quality value of more than 20 per base (containing no partial adapter sequences and no more than 5% that were not determined in the read length).
As a result, CNVseq-AOH detected all the 15 AOH regions (Table 1). In some cases with multiple known AOHs (Case 1, Case 2, and Case 5), a greater number of AOH regions were detected by CMA, probably because several AOH regions were split into sub-regions by CMA.
Discussion
RNN, known as recurrent neural network, is a very popular class of neural network. RNN is especially useful with sequential data. The neuron in RNN can use the internal state to “memory” previous input information, combining the information of the current input, to determine the next output state. RNN was widely used in natural language processing (NLP) [21]. However, the application of RNN in human genomic research is still rare. In this study, we described an RNN-based method, CNVseq-AOH, for predicting the absence of heterozygosity using LP-WGS. To the best of our knowledge, CNVseq-AOH is the first application combining population-based haplotype information, adjustable sliding windows, and RNN in genetic testing. CNVseq-AOH shows the feasibility of using ultra-low sequencing depth for the detection of clinically significant AOHs and demonstrates its potential in genetic testing.
One of the key innovations of CNVseq-AOH is the use of population-based haplotype information. Based on our testing, population-based haplotype information greatly influenced the feasibility of CNVseq-AOH. For the 409 samples in the current study, ancestry-matched populations (or genetically similar population) were used for analysis. We further compared the sensitivity using ancestry-matched populations for feature learning and using all available haplotype information from multiple ethnicities for feature learning at 0.1-fold. As a result, using a threshold of 50% overlap with the AOHs detected by CMA, the sensitivity of CNVseq-AOH reached 93.40% (806/863) when using ancestry-matched populations for feature learning. When switching to the strategy using all available haplotype information from multiple ethnicities for feature learning, the sensitivity is only 60.95% (526/863). Simultaneously, when employing a strategy using all available haplotype information from multiple ethnicities for feature learning, the accuracy is also significantly impacted (Supplementary Fig. 3). Not all the populations are captured in the 1KGP. The number of samples in a specific population varied a lot. This may influence the accuracy of our method and impede the wide application of CNVseq-AOH. Expanding the data collection to include new populations and samples may solve the problem.
One limitation of CNVseq-AOH is that it cannot be used for the detection of mosaic AOH. So, we did not include the 4 cases with mosaic AOH for testing in the first place. Based on the signals for these 4 cases (Supplementary Fig. 4), CNVseq-AOH possesses the potential for predicting mosaic AOH. This may require a different model and a great number of ascertained positive cases with mosaic AOHs for model training, which is an interesting topic but beyond the scope of this study. Another limitation of the current study is the performance of CNVseq-AOH for the detection of AOHs on the X chromosome. With a depth of 0.1-fold, a detection sensitivity of only 59.49% was achieved for the 79 AOHs on the X chromosome in the 1KGP. In phase three of the 1KGP, variants on the X chromosome were phased without the pedigree-based correction using Eagle2 (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/README_SNV_INDEL_phasing_111822.pdf), resulting in less number of SNPs per 1 Mb. So, the information of biallelic SNPs in the VCF file for the X chromosome is insufficient to calculate the probabilities for resampled reads. Actually, this is not a limitation of CNVseq-AOH, which means that, with sufficient information in the reference panel, CNVseq-AOH also possesses the potential to provide high prediction sensitivity for AOHs located on the X chromosome. Next, we plan to reanalyze these samples to optimize the performance of CNVseq-AOH for the detection of AOHs on the X chromosome.
In this study, we investigated sequencing depth on model performance in the 1KGP. In general, the prediction sensitivity of CNVseq-AOH increased with sequencing depth. However, data in the 1KGP was generated using various sequencing parameters (different sample types, library construction protocols, sequencing platforms, etc.), so the evaluation of sequencing depth may be biased. For clinical laboratories, depth evaluation using real clinical samples and uniform sequencing parameters is necessary before clinical application.
Conclusions
In summary, we developed a method for predicting the absence of heterozygosity using LP-WGS data, which overcomes the sparse nature of typical LP-WGS by combing population-based haplotype information, adjustable sliding windows, and RNN. Next, we plan to apply our method to clinical pregnant women who underwent prenatal diagnosis, thereby further evaluating the performance and potential utility of CNVseq-AOH under realistic clinical scenarios.
Availability of data and materials
The raw data of the cases with previously identified AOH events based on SNP-based microarrays and high coverage WGS data from 1KGP is available in the (https://www.ebi.ac.uk/ena/browser/view/PRJEB31736?show=reads) under the accession number PRJEB31736. The data of the 6 clinical cases generated and analyzed during the current study is not publicly available as they are patient samples and sharing them could compromise research participant privacy.
References
Liu J, He Z, Lin S, Wang Y, Huang L, Huang X, Luo Y. Absence of heterozygosity detected by single-nucleotide polymorphism array in prenatal diagnosis. Ultrasound Obstet Gynecol. 2021;57(2):314–23.
Potapova T, Gorbsky GJ. The consequences of chromosome segregation errors in mitosis and meiosis. Biology (Basel). 2017;6(1):12.
Rehder CW, David KL, Hirsch B, Toriello HV, Wilson CM, Kearney HM. American College of Medical Genetics and Genomics: standards and guidelines for documenting suspected consanguinity as an incidental finding of genomic testing. Genet Med. 2013;15(2):150–2.
Carvalho CM, Pfundt R, King DA, Lindsay SJ, Zuccherato LW, Macville MV, Liu P, Johnson D, Stankiewicz P, Brown CW, et al. Absence of heterozygosity due to template switching during replicative rearrangements. Am J Hum Genet. 2015;96(4):555–64.
Yauy K, de Leeuw N, Yntema HG, Pfundt R, Gilissen C. Accurate detection of clinically relevant uniparental disomy from exome sequencing data. Genet Med. 2020;22(4):803–8.
Dong Z, Zhang J, Hu P, Chen H, Xu J, Tian Q, Meng L, Ye Y, Wang J, Zhang M, et al. Low-pass whole-genome sequencing in clinical cytogenetics: a validated approach. Genet Med. 2016;18(9):940–8.
Sahoo T, Dzidic N, Strecker MN, Commander S, Travis MK, Doherty C, Tyson RW, Mendoza AE, Stephenson M, Dise CA, et al. Comprehensive genetic analysis of pregnancy loss by chromosomal microarrays: outcomes, benefits, and challenges. Genet Med. 2017;19(1):83–9.
**ang J, Li R, He J, Wang X, Yao L, Song N, Fu F, Zhou S, Wang J, Gao X, et al. Clinical impacts of genome-wide noninvasive prenatal testing for rare autosomal trisomy. Am J Obstet Gynecol MFM. 2023;5(1):100790.
Wang H, Dong Z, Zhang R, Chau MHK, Yang Z, Tsang KYC, Wong HK, Gui B, Meng Z, **ao K, et al. Low-pass genome sequencing versus chromosomal microarray analysis: implementation in prenatal diagnosis. Genet Med. 2020;22(3):500–10.
Chau MHK, Wang H, Lai Y, Zhang Y, Xu F, Tang Y, Wang Y, Chen Z, Leung TY, Chung JPW, et al. Low-pass genome sequencing: a validated method in clinical cytogenetics. Hum Genet. 2020;139(11):1403–15.
Dong Z, Chau MHK, Zhang Y, Yang Z, Shi M, Wah YM, Kwok YK, Leung TY, Morton CC, Choy KW. Low-pass genome sequencing-based detection of absence of heterozygosity: validation in clinical cytogenetics. Genet Med. 2021;23(7):1225–33.
Qian Y, Sun Y, Guo X, Song L, Sun Y, Gao X, Liu B, Xu Y, Chen N, Chen M, et al. Validation and depth evaluation of low-pass genome sequencing in prenatal diagnosis using 387 amniotic fluid samples. J Med Genet. 2023;60(10):933–8.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.
Ariad D, Yan SM, Victor AR, Barnes FL, Zouves CG, Viotti M, McCoy RC: Haplotype-aware inference of human chromosome abnormalities. Proc Natl Acad Sci USA. 2021;118(46):e2109307118
Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat Methods. 2011;9(2):179–81.
Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, Schoenherr S, Forer L, McCarthy S, Abecasis GR, et al. Reference-based phasing using the Haplotype reference consortium panel. Nat Genet. 2016;48(11):1443–8.
Lu Y, Jiang Y, Zhou X, Hao N, Lu G, Guo X, Guo R, Liu W, Xu C, Chang J, et al. Evaluation and analysis of Absence of Homozygosity (AOH) using chromosome analysis by medium coverage whole genome sequencing (CMA-seq) in prenatal diagnosis. Diagnostics (Basel). 2023;13(3):560.
Papenhausen P, Schwartz S, Risheg H, Keitges E, Gadi I, Burnside RD, Jaswaney V, Pappas J, Pasion R, Friedman K, et al. UPD detection using homozygosity profiling with a SNP genoty** microarray. Am J Med Genet A. 2011;155A(4):757–68.
Armour CM, Dougan SD, Brock JA, Chari R, Chodirker BN, DeBie I, Evans JA, Gibson WT, Kolomietz E, Nelson TN, et al. Practice guideline: joint CCMG-SOGC recommendations for the use of chromosomal microarray analysis for prenatal diagnosis and assessment of fetal loss in Canada. J Med Genet. 2018;55(4):215–21.
Liu W, Lu J, Zhang J, Li R, Lin S, Zhang Y, Wang Y, Yin A. A consensus recommendation for the interpretation and reporting of copy number variation and regions of homozygosity in prenatal genetic diagnosis. Zhonghua Yi Xue Yi Chuan Xue Za Zhi. 2020;37(7):701–8.
Rezaeenour J, Ahmadi M, Jelodar H, Shahrooei R. Systematic review of content analysis algorithms based on deep neural networks. Multimed Tools Appl. 2023;82(12):17879–903.
Acknowledgements
Not applicable.
Funding
This study was supported by the National Key R&D Program of China (No. 2023YFC2705600). This program is a non-profit research project by government, and had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Conceptualization: Lijie Song, **aoyuan **e, Zhonghua Wang, Yan Sun. Data Curation: Linlin Fan, Yun Yang, Xueqin Guo, Zhihong Qiao, Yun Li, Ting Jiang, **aoli Wang. Formal Analysis: Fei Tang, Zhonghua Wang, Yan Sun, Yaoshen Wang, Saiying Yan, Jianfen Man, Lina Wang. Funding Acquisition: Yan Sun, Yun Yang. Investigation: Fei Tang, Zhonghua Wang, Yan Sun, Yaoshen Wang, Saiying Yan, Jianfen Man, Lina Wang, Shunyao Wang. Methodology: Fei Tang, Zhonghua Wang, Yan Sun, Yaoshen Wang, Saiying Yan, Jianfen Man, Lina Wang. Project Administration: Lijie Song, **aoyuan **e. Resources: Lijie Song, **aoyuan **e, Zhonghua Wang, Yan Sun, Zhiyu Peng, Huanhuan Peng. Software: Fei Tang, Zhonghua Wang. Supervision: Lijie Song, **aoyuan **e. Validation: Fei Tang, Zhonghua Wang, Yan Sun, Yaoshen Wang, Saiying Yan, Jianfen Man, Lina Wang. Visualization: Fei Tang, Zhonghua Wang, Yan Sun. Writing – Original Draft Preparation: Yan Sun. Writing – Review & Editing: All authors.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
This study and all the protocols were approved by the ethics committee of THE INSTITUTIONAL REVIEW BOARD OF BGI (NO. BGI-IRB 22062). Informed consent for the anonymous usage of remaining samples and data for scientific research and possible publication was obtained from all participants. This study was performed in accordance with the principles of the Helsinki Declaration.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tang, F., Wang, Z., Sun, Y. et al. Recurrent neural network for predicting absence of heterozygosity from low pass WGS with ultra-low depth. BMC Genomics 25, 470 (2024). https://doi.org/10.1186/s12864-024-10400-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864-024-10400-4