Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

Pagès-Gallego, Marc; de Ridder, Jeroen

doi:10.1186/s13059-023-02903-2

Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

Research
Open access
Published: 11 April 2023

Volume 24, article number 71, (2023)
Cite this article

Download PDF

You have full access to this open access article

Genome Biology Aims and scope Submit manuscript

Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

Download PDF

7198 Accesses
7 Citations
34 Altmetric
1 Mention
Explore all metrics

This article has been updated

Abstract

Background

Nanopore-based DNA sequencing relies on basecalling the electric current signal. Basecalling requires neural networks to achieve competitive accuracies. To improve sequencing accuracy further, new models are continuously proposed with new architectures. However, benchmarking is currently not standardized, and evaluation metrics and datasets used are defined on a per publication basis, impeding progress in the field. This makes it impossible to distinguish data from model driven improvements.

Results

To standardize the process of benchmarking, we unified existing benchmarking datasets and defined a rigorous set of evaluation metrics. We benchmarked the latest seven basecaller models by recreating and analyzing their neural network architectures. Our results show that overall Bonito’s architecture is the best for basecalling. We find, however, that species bias in training can have a large impact on performance. Our comprehensive evaluation of 90 novel architectures demonstrates that different models excel at reducing different types of errors and using recurrent neural networks (long short-term memory) and a conditional random field decoder are the main drivers of high performing models.

Conclusions

We believe that our work can facilitate the benchmarking of new basecaller tools and that the community can further expand on this work.

Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

Article Open access 07 December 2023

RUBICON: a framework for designing efficient deep learning-based genomic basecallers

Article Open access 16 February 2024

RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data

Article Open access 20 April 2022

Background

Sequencing of DNA (or RNA) can be achieved by translocating nucleic acids through a protein nanopore. By passing an electric current through the nanopore, a signal is measured that is representative for the chemical nature of the different nucleotides inside the pore. Therefore, capturing this current yields a signal that can be translated into a DNA sequence. In 2014, Oxford Nanopore Technologies (ONT) released the first commercial sequencing devices based on this principle.

Basecalling is the process of translating the raw current signal to a DNA sequence [1]. It is a fundamental step because almost all downstream applications depend on it [2]. Basecalling is a challenging task due to several reasons. First of all, the current signal level does not correspond to a single base but is most dominantly influenced by the several nucleotides that reside inside the pore at any given time. Secondly, DNA molecules do not translocate at a constant speed. Therefore, the number of signal measurements is not a good estimate of sequence length. Instead, the detection of signal changes is required to determine that the next base has entered the pore.

To address the basecalling challenge, a wide array of basecallers has been developed both by ONT and the wider scientific community. Basecallers evolved from statistical tests, to hidden Markov models (HMMs) and finally to the use of neural networks [3, 4]. Wick et al. (2019) benchmarked Chiron [5] and four other ONT basecallers that were being developed at the time: Albacore, Guppy, Scrappie, and Flappie. Chiron used developments in the speech to text translation field as it applied a convolutional neural network (CNN) to extract the features from the raw signal, a recurrent neural network (RNN) to relate such features in a temporal manner, and a connectionist temporal classification (CTC) decoder [6] to avoid having to segment the signal. Since then, many other basecallers have been developed and published by the community (Fig. 1a): Mincall [Full size image

Basecalling failure is a major determinant of reported performance

We first noted that the number of reads that failed basecalling varies substantially. Distinctively, Causacall, Halcyon, and Mincall only managed to properly basecall 66%, 79%, and 87% of the reads respectively, while the rest of the models managed > 90% (Fig. 2a). It is therefore critical to include such a metric in model evaluation since models could be skip** difficult to basecall reads, which would skew results towards a false higher performance.

Different methods prevail at different measures

We evaluated the performance of the different architectures using the alignment event rates (Fig. 2b). Bonito performed best in three out of the four metrics. It has the highest median match rate (90%) and the lowest median mismatch (2.5%) and deletion (4.3%) rates. Causalcall achieved the lowest median insertion rate (1.7%); however, it performed worst in the other three metrics with the lowest median match rate (77.6%) and highest mismatch (6%) and deletion (14.4%) rates. Halcyon shows the highest variation in performance rates, demonstrating that in addition to the median, the distribution across the reads is important to consider while comparing basecallers. It is therefore critical to not only report the error rates but also their distributions.

Homopolymer error rates are correlated with alignment event rates

Homopolymers are especially difficult to basecall because, for long stretches of the same base, the signal does not change, and since the DNA translocation speed is not constant, the number of measurements is not a good indicator of the length of the homopolymer. For such stretches of DNA, Bonito performed best with the lowest median error rate (14.9% averaged across all four bases) and the lowest median error rate for each base individually. Causalcall performed significantly worse than the rest of the models with the highest median error rate (44.5% averaged across bases) (Fig. 2c). We observed a performance correlation between alignment event rates and homopolymer error rates; however, the latter have a significantly higher error rate likely due to the inherent difficulty of basecalling such stretches.

Utility of PhredQ scores varies across methods

To evaluate the relationship between the predicted bases and their PhredQ scores. We first consider the distributions of the scores of the correct and incorrect bases (Fig. 2d). For all model architectures, correct bases have higher scores than incorrect bases; Causalcall has the smallest overlap between the two distributions (0.7%), followed by the rest of the models with similar overlaps (6–8%) except Halcyon (12%) and Bonito (32%). Secondly, we calculated AUCs by sorting the reads based on their average quality scores and determining the area under the normalized cumulative score (Fig. 2e). All architectures showed a correlation between read quality and average match rate. Not surprisingly, Bonito performed best with an AUC of 0.91. CATCaller and SACAll are, however, close contenders both with an AUC of 0.886. Importantly, each model has its own PhredQ score offset that determines how their quality scores are calibrated. As a result, quality scores across models, even when compared in a standardized benchmark, are not directly numerically comparable.

The signatures of basecalling errors

Finally, we evaluated the different types of mistakes in the context of the two neighboring bases in the basecalls (Additional file 1: Fig. S2). In general, these “error signatures” reveal that there are differences between the accuracy of the models depending on the predicted base context. Across all models, cytosine has low error rates (\(\approx\)10%) when predicted in CCT or TCT context; however, it can have significantly higher error rates (> 30%) when predicted in the context of the 3-mers TCC or TCG. We noticed that many of the contexts with higher error rates contain a CG motif, suggesting the increased error might be due to the potential methylation status of cytosine. To evaluate if specific models have particular error biases we performed hierarchical clustering on the pairwise Jensen-Shannon divergences between the error signatures (Additional file 1: Fig. S3). This revealed that Causalcall and Halcyon are the two most different models in terms of “error signatures”. The rest of the models have similar error profiles (lowest Pearson correlation coefficient between them: 0.95). We can conclude that basecalling errors are biased since they are not uniformly distributed across the 3-mer contexts. However, the error profiles are very similar between basecallers, suggesting that training data may play a stronger role in defining these error biases than the architecture of the model itself.

Architecture analysis

The benchmarking setup also allows straightforward investigation of which components of the neural networks provide the main performance gains. To this end, we created novel architectures by combining the convolution and encoder and decoder modules from existing basecallers, as well as some additional modules. We again used the human task to evaluate the different models. In total, ninety different models were evaluated and ranked based on the sum of rankings across all metrics (Fig. 3a, Additional file 1: Fig. S4). Out of the original models, Bonito again performs best but reached 9th place in the overall ranking. Consequently, our grid search reveals eight new model architectures that perform better in general. However, improvements in performance made by these models are small; for example, in comparison to the Bonito model, the alignment event rate improvements and homopolymer error rates are smaller than 1%, suggesting that we may be reaching the performance limits obtainable given the training data used.

CRF decoder is vastly superior to CTC

We observed that most of the high performing models used the CRF decoder module. We therefore compared the change in performance between pairs of models whose only difference was the decoder (Fig. 3b, Additional file 1: Fig. S5). We see that for almost all models, using a CRF decoder leads to a general improvement of performance with a mean increase in match rate of 4% and mean decreases of mismatch, insertion, and deletion rates of 1%, 1%, and 2% respectively. Some exceptions are models that used the Mincall or URNano convolution modules, which have a mean increase in insertion rates of 1%, although their other alignment rate metrics still improve significantly. This is in concordance with our previous results, where Causalcall and URNano demonstrated the lowest insertion rates of all the models, showcasing that it is their convolutional architectures that boosts performance for this type of metric. Notably, a decrease in homopolymer error rates is also observed for the models with Mincall or URNano convolution modules that include a CRF decoder. However, results for the other models are more varied and depend on the base. Consistently with these improvements, we observe an average improvement of 3% in the AUCs. However, the PhredQ overlap between correct and incorrect predictions worsened with a median increase of 30%.

Complex convolutions are most robust, but simple convolutions are still very competitive

Another main architectural difference is the complexity and depth of the convolutional layers: ranging from two or three simple convolutional layers like in Bonito, CATCaller, and SACall, to more elaborate convolutional modules like Causalcall, URNano, or Mincall. We find that the top four ranked models use the URNano or Causalcall architecture (Fig. 3a). However, the six following models all use one of the simpler CNNs. More complex convolutional architectures perform better in general, specifically Causalcall and URNano (Fig. 3c, Additional file 1: Fig. S6). Simple convolutional architectures can also perform as good or even better, however they are more dependent on the encoder architecture that follows.

RNNs are superior to transformers and are depth dependent

Transformer layers have gained popularity in other fields due to increased performance and speed [Models

Model architectures were recreated using Pytorch (v1.9.0) based on their description in their publications and their github repositories. If their implementation was done using Pytorch, code was reused as much as possible (Additional file 1: Fig. S13-27).

Model training

Non-recurrent models (all except Halcyon) were trained for 5 epochs with a batch size of 64. All models were trained on the same task data which was also given as input in the same order. Models inital random parameters were initialized via a uniform distribution with values ranging from − 0.08 to 0.08. Reads were sliced in non-overlap** chunks of 2000 data points. Models were trained using an Adam optimizer (initial learning rate = \(1e^{-3}\), \(\beta _1\) = 0.9, \(\beta _2\) = 0.999, weight decay = 0). Learning rate was initially increased linearly for 5000 training steps from 0 to the initial learning rate of the optimizer as a warm-up; the learning rate was then decreased using a cosine function until the last training step to a minimum of \(1e^{-5}\). To improve model stability, gradients were clipped between − 2 and 2. Halcyon was trained similarly to non-recurrent models with the following differences: models were trained first for 1 epoch with non-overlap** chunks of 400 data points, then for 2 epochs with chunks of 1000 data points and finally for 2 epochs with chunks of 2000 data points. This was necessary because training directly using 2000 data points chunks led to unstable model training. This phenomenon is also described in the original Halcyon publication [17], requiting this transfer learning approach to ameliorate the issue. Recurrent models were also trained without warm-up and with a 0.75 scheduled sampling. During training, 5% of the training data was used for validation from which accuracy and loss were calculated without gradients. Validation data was the same for all models. The state of the model was saved every 20,000 training steps. The model state was chosen based on the best validation accuracy during training. Models were evaluated on hold out test data from the task being evaluated.

Original model recreation and benchmark

URNano used cross-entropy as its loss; however, since the objective of the benchmark was basecalling and not signal segmentation, we used a CTC decoder instead. All the other models were recreated as stated in their respective publications; when in doubt, their github implementations were used as reference.

Comparison of original models to model recreations

Since Causalcall and Halcyon performed worse than the rest of the models, we evaluated the original models Causalcall, Halcyon, and Guppy model published by the authors and compared them against our PyTorch implementations (SAdditional file 1: Fig. S28, Additional file 1: Table S5). When evaluating the original Halcyon, we were unable to completely basecall all 25k reads in the test set due to memory limitations; we therefore compared our recreation based only on the \(\approx\)13k reads that were basecalled by the original Halcyon model from the human task test set. We used Guppy (v5.0.11, latest version) for the comparison between original models and our PyTorch implementations. In terms of reads that we consider evaluable, we see small differences (less than 3%) between the original versions and our implementations of Causalcall and Guppy. However, we see some differences in the types of failed reads between the original Causalcall, which has 7% more reads that failed map**, whereas our recreation had 3% more reads with short alignments. Surprisingly, basecalls from the original Halcyon produce only 46% of reads suitable for evaluation, (30% less than our recreation). A significant 34% of reads failed map** (30% more than our recreation) to the reference. There is also an increase, although smaller, on the 19% of reads that have short alignments to the reference (5% more than our recreation) (Additional file 1: Fig. S28a). Regarding our implementation of Guppy, we find that differences are small, with at most a 2% difference. We then looked at the alignment event rates (Additional file 1: Fig. S28b). Differences between the two Guppy models were very small, with the largest being a 1.3% difference in increased match rate from the original version. The original Causalcall showed improved match performance, with increased match (2%) and a decreased deletion (6%) rates; however, it showed a slight increase in mismatch (1%) and insertion (1%) rates as well as higher variability across reads. Finally, the original Halcyon performed worse in all metrics except deletion rate. However, its performances are less variable across reads. Homopolymer error rates show a similar trend (Additional file 1: Fig. S28c), the original Causalcall performs significantly better with a more similar to the other models average error rate (25.6%), while the other two models show very similar performances. We finally compared the models regarding their PhredQ scores: when comparing Bonito to Guppy, we saw a large difference in the scale of the scores (Additional file 1: Fig. S28d); however, Guppy still had an overlap between the distributions of 28%. On the other hand, the original Causalcall showed a significant increase in the overlap between distributions (48%). (Additional file 1: Fig. S28e). Correlating with the event rates results, the original versions of Causalcall and Guppy performed slightly better than our recreated counterparts with AUCs of 0.837 and 0.937 respectively. The original Halcyon does not report any PhredQ scores. With these results, we concluded that although there are some differences between the original models and our recreations, these are minor and could be attributed to training strategies and used data.

Architecture analysis

Most models contain a convolutional module that later directly feeds into an encoder (recurrent/transformer) module. To be able to combine modules from different models without changing the original number of channels, we included a linear layer in between the convolution and encoder modules to up-scale or down-scale the number of channels. After this additional linear layer, we applied the last activation function of the preceding convolutional module. Contrary to the other models, the convolution modules from URNano and Causalcall do not reduce the amount of input timepoints. For those modules, we also included an extra convolution layer with the same configuration as the last convolution layer in Bonito (kernel size = 19, stride = 5, padding = 9). This layer had the same number of channels as the last convolutional layer of URNano or Causalcall. This convolution layer was necessary in order to both use transformer encoders and/or a CRF decoder due to memory requirements. We also included three non-used encoder architectures: either one, three or five RNN-LSTM bidirectional stacked layers with 256 channels each.

Evaluation metrics

Evaluation metrics are based on the alignment between the predicted sequence and the reference sequence. Alignment is done using Minimap2 (2.21) [29] with the ONT configuration for all metrics except accuracy. Accuracy is based on the Needleman-Wunsch global alignment algorithm implemented in Parasail (1.2.4) [30]. The global alignment is configured with a match score of 2, a mismatch penalty of 1, a gap opening penalty of 8 and a gap extension penalty of 4. Accuracy is used to evaluate the best performing state of the models during training based on the validation fraction of the data. During training, short sequences have to be aligned; however, during testing, complete reads have to be aligned, for which Minimap2 is necessary.

Accuracy

Accuracy is defined as the number of matched bases in the alignment divided by the total number of bases in the reference sequence.

Alignment rates

Match, mismatch, insertion, and deletion rates are calculated as the number of events of each case divided by the length of the reference unless otherwise stated.

Homopolymer error rates

Homopolymer regions are defined as consecutive sequences of the same base of length 5 or longer. Error rates on homopolymer regions are calculated by counting the number of homopolymers with errors (one or more mismatches, insertions, or deletions) and dividing it by the number of homopolymer bases.

PhredQ scoring

PhredQ scores are calculated using the fast_ctc_decode library from ONT. Average quality scores are calculated for all the correct and incorrect bases for each read. Differences between mean scores between correct and incorrect bases are reported. AUCs are calculated by sorting the basecalled reads according to their mean Phred quality score and calculating the average match rate for cumulative fraction of reads in steps of 50.

Error profiles

Error profiles are calculated for all 3-mers by counting the number of events (mismatches for each base, insertions and deletions) in the context of the two neighboring bases of the event itself according to the basecalls. Rates for each event are calculated by dividing each event count by the total number of 3-mer occurrences in the read. Error profiles are also calculated for each base, independently of their context. Randomness of error is defined as the Jensen-Shannon divergence between each 3-mer error profile and their corresponding base error profile.

Software and hardware requirements

Packages and their versions used for training and evaluation can be found on (Additional file 1: Table S6). All analysis were run on Python 3.7.8 and CUDA version 10.2. We used the following hardware requirements: 32 CPU cores and 64Gb of RAM (data processing and model performance evaluation); it is possible to reduce these requirements at the expense of longer compute time; 4 CPU cores, 128Gb of RAM, and 1 NVIDIA RTX6000 GPU (model training basecalling).

Availability of data and materials

Datasets

Nanopore sequencing data: human raw data and basecall datsets (FAF04090, FAF09968, FAB42828) are available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md [28]; bacterial raw data are available at the Monash University repository, download links are available at https://github.com/marcpaga/nanopore_benchmark/tree/main/download [27]; lambda phage raw data is available on the Sequence Read Archive under the PRJNA926802 bioproject ID [31].

Code

Source code and scripts used to recreate and train the models are available at https://github.com/marcpaga/basecalling_architectures [32]. Source code and scripts used for benchmarking (data download and evaluation) are available at https://github.com/marcpaga/nanopore_benchmark [33]. Both repositories are under the Unlicense license and accessible at: https://zenodo.org/record/7657037 (DOI: 10.5281/zenodo.7657037) [34].

Change history

24 April 2023
The Abstract contained typos, which have now been corrected.

References

Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;19(1):90. https://doi.org/10.1186/s13059-018-1462-9. https://genomebiology-biomedcentral-com.proxy.library.uu.nl/articles/10.1186/s13059-018-1462-9.
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65. https://doi.org/10.1038/s41587-021-01108-x.
Article CAS PubMed PubMed Central Google Scholar
Boža V, Brejová B, Vinař T. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. PLoS ONE. 2017;12(6):e0178751. https://doi.org/10.1371/journal.pone.0178751. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0178751.
Stoiber M, Brown J. BasecRAWller: streaming nanopore basecalling directly from raw signal. bioRxiv. 2017. https://doi.org/10.1101/133058.
Teng H, Cao MD, Hall MB, Duarte T, Wang S, Coin LJMM. Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. GigaScience. 2018;7(5). https://doi.org/10.1093/gigascience/giy037. http://dx.doi.org.proxy.library.uu.nl/10.1093/gigascience/giy037.
Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning - ICML ’06. New York: ACM Press; 2006. p. 369–376. https://doi.org/10.1145/1143844.1143891. http://portal.acm.org.proxy.library.uu.nl/citation.cfm?doid=1143844.1143891.
Miculinić N, Ratković M, Šikić M. MinCall - MinION end2end convolutional deep learning basecaller. GitHub. 2019. ar**v preprint ar**v:1904.10337.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8. https://doi.org/10.1109/CVPR.2016.90.
Zeng J, Cai H, Peng H, Wang H, Zhang Y, Akutsu T. Causalcall: nanopore basecalling using a temporal convolutional network. Front Genet. 2020;10:1332. https://doi.org/10.3389/fgene.2019.01332. https://www.frontiersin.org/articles/10.3389/fgene.2019.01332/full.
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, et al. WaveNet: a generative model for raw audio. 2016. p. 1–15. ar**v preprint ar**v:1609.03499.
Huang N, Nie F, Ni P, Luo F, Wang J. SACall: a neural network basecaller for Oxford Nanopore sequencing data based on self-attention mechanism. IEEE/ACM Trans Comput Biol Bioinforma. 2020;XX(X):1–10. https://doi.org/10.1109/TCBB.2020.3039244.
Lv X, Chen Z, Lu Y, Yang Y. An end-to-end Oxford nanopore basecaller using convolution-augmented transformer. IEEE/ACM Trans Comput Biol Bioinforma. 2020:6. https://doi.org/10.1101/2020.11.09.374165.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;2017-Decem(Nips):5999–6009.
Wu Z, Liu Z, Lin J, Lin Y, Han S. Lite transformer with long-short range attention. ICLR 2020. 2020. p. 1–13. ar**v preprint ar**v:2004.11886.
Zhang YZ, Akdemir A, Tremmel G, Imoto S, Miyano S, Shibuya T, et al. Nanopore basecalling from a perspective of instance segmentation. BMC Bioinformatics. 2020;21(136). https://doi.org/10.1186/s12859-020-3459-0.
Ronneberger O, Fischer P, Brox T. In: U-Net: Convolutional Networks for Biomedical Image Segmentation. Cham: Springer International Publishing; 2015. p. 234–41.
Google Scholar
Konishi H, Yamaguchi R, Yamaguchi K, Furukawa Y, Imoto S. Halcyon: an accurate basecaller exploiting an encoder-decoder model with monotonic attention. Bioinformatics. 2021;37(9):1211–1217. https://doi.org/10.1093/bioinformatics/btaa953. https://academic-oup-com.proxy.library.uu.nl/bioinformatics/article/37/9/1211/5962086.
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;4(January):3104–12.
Google Scholar
Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. In: Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. 2015. p. 1412–1421. https://doi.org/10.18653/v1/d15-1166.
Lafferty J, McCallum A, Pereira FCN. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Deparmental papers (CIS), University of Pennsylvania. 2001;11(1):1–84. https://doi.org/10.29122/mipi.v11i1.2792.
Moult J, Krzystof F, Kryshtafovych A, Schwede T, Tramontano A. Critical assessment of methods of protein structure prediction (CASP) – round x. Proteins. 2014;82(02):1–6. https://doi.org/10.1002/prot.24452.Critical.
Article CAS PubMed Google Scholar
Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, et al. Critical assessment of metagenome interpretation - a benchmark of metagenomics software. Nat Methods. 2017;14(11):1063–71. https://doi.org/10.1038/nmeth.4458.
Article CAS PubMed PubMed Central Google Scholar
Going for algorithm gold. 2008. https://doi.org/10.1038/nmeth0808-659.
Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013;3(1):246–259. https://doi.org/10.1016/j.celrep.2012.12.008. https://linkinghub.elsevier.com/retrieve/pii/S2211124712004330.
Karita S, Chen N, Hayashi T, Hori T, Inaguma H, Jiang Z, et al. A comparative study on transformer vs RNN in speech applications. ASRU 2019. 2019. ar**v preprint ar**v:1909.06317.
Delahaye C, Nicolas J. Sequencing DNA with nanopores: troubles and biases. PLoS ONE. 2021;16(10):1–29. https://doi.org/10.1371/journal.pone.0257521.
Article CAS Google Scholar
Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford nanopore sequencing. Genome Biol. 2019;20(1):1–10. https://doi.org/10.1186/s13059-019-1727-y. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1727-y.
Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–45. https://doi.org/10.1038/nbt.4060.
Article CAS PubMed PubMed Central Google Scholar
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. https://doi.org/10.1093/bioinformatics/bty191.
Article CAS PubMed PubMed Central Google Scholar
Daily J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics. 2016;17(1). https://doi.org/10.1186/s12859-016-0930-z.
Pagès-Gallego M, de Ridder J. Nanopore sequenced (R9.4.1) Lambda phage dataset. 2023. https://doi.org/10.5281/zenodo.7728175.
Pagès-Gallego M, de Ridder J. Deep learning architectures for basecalling. Github; 2023. https://github.com/marcpaga/basecalling_architectures.
Pagès-Gallego M, de Ridder J. Nanopore benchmark for basecallers. Github; 2023. https://github.com/marcpaga/nanopore_benchmark.
Pagès-Gallego M, de Ridder J. Comprehensive benchmark and architectural analysis of deep learning models for Nanopore sequencing basecalling. 2023. https://doi.org/10.5281/zenodo.7657037.
Article Google Scholar

Download references

Acknowledgements

We thank Tobias Dansen for critical reading of the manuscript. We thank Carlo Vermeulen for critical reading of the manuscript and contribution of the Lambda phage sequencing data. We thank Mike Vella for critical reading of the manuscript. We thank Vlado Menkovski for very helpful discussions.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2.

Funding

This work was funded by Health Holland (No. LSHM19029)

Author information

Authors and Affiliations

Center for Molecular Medicine, University Medical Center Utrecht, Universiteitsweg 100, 3584 CG, Utrecht, The Netherlands
Marc Pagès-Gallego & Jeroen de Ridder
Oncode Institute, Utrecht, The Netherlands
Marc Pagès-Gallego & Jeroen de Ridder

Authors

Marc Pagès-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Jeroen de Ridder
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MPG and JdR designed the analysis and benchmark. MPG performed the analysis and wrote the code. MPG drafted the first version of the manuscript with guidance from JdR. JdR contributed major parts of the manuscript and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jeroen de Ridder.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Mike Vella works at Oxford Nanopore Technologies; he had no influence on the design or conclusions of the analysis. JdR is cofounder of Cyclomics. JdR has received reimbursement of travel and accommodation expenses to speak at meetings organized by Oxford Nanopore Technologies.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Figs. S1-S28, Tables S1-S6.

Additional file 2.

The review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Pagès-Gallego, M., de Ridder, J. Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling. Genome Biol 24, 71 (2023). https://doi.org/10.1186/s13059-023-02903-2

Download citation

Received: 01 August 2022
Accepted: 20 March 2023
Published: 11 April 2023
DOI: https://doi.org/10.1186/s13059-023-02903-2

Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

RUBICON: a framework for designing efficient deep learning-based genomic basecallers

RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data

Background

Basecalling failure is a major determinant of reported performance

Different methods prevail at different measures

Homopolymer error rates are correlated with alignment event rates

Utility of PhredQ scores varies across methods

The signatures of basecalling errors

Architecture analysis

CRF decoder is vastly superior to CTC

Complex convolutions are most robust, but simple convolutions are still very competitive

RNNs are superior to transformers and are depth dependent

Model training

Original model recreation and benchmark

Comparison of original models to model recreations

Architecture analysis

Evaluation metrics

Accuracy

Alignment rates

Homopolymer error rates

PhredQ scoring

Error profiles

Software and hardware requirements

Availability of data and materials

Change history

24 April 2023

References

Acknowledgements

Peer review information

Review history

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation