Introduction

Cryptosporidiosis is a neglected disease, caused by apicomplexan parasites in the genus Cryptosporidium. It has devastating impacts on the most vulnerable, especially infants and the immunosuppressed [1]. As the scope and significance of this infectious disease has become apparent [2, 3••], the research community has responded. Over the last decade, there have been significant advances in genetics, genomics, the ability to culture the parasite and perform high-throughput screening, host–pathogen interactions, surveillance, and therapeutics. The advances in these areas have yielded large quantities of associated genomic data (Table 1) that are fueling advances in our understanding of Cryptosporidium, its life cycle [4, 5], its evolution [6], transmission [7••], and host–pathogen interactions [8, 9, 10••]. These data will also facilitate the design of better surveillance tools for local, regional, and hopefully, global use.

Table 1 Currently available Cryptosporidium genomic and transcriptomic data#

Genomic data, however, are not without their challenges. Historically, Cryptosporidium genome sequences have been very hard to generate due to a lack of pure parasite material. This challenge has recently been overcome for genome sequence generation, but not for transcriptomics of post-infection Cryptosporidium life cycle stages. Another challenge arises from the fact that most available data were created with short-read sequencing approaches. While Illumina sequencing is highly accurate, it cannot yield complete genome assemblies and poses significant challenges for the analysis of gene families and repetitive sequences. Currently, cloning is also impossible for Cryptosporidium. Thus, nearly all genomic and transcriptomic data have been generated using populations of parasites rather than purified isogenic clones. This fact creates considerable challenges for genome assembly, data analysis, and interpretation. Here we highlight key advances, remaining challenges and future prospects.

Technological Advances Have Facilitated Cryptosporidium Genomics

Whole Genome Sequencing

With the advent of high accuracy second-generation sequencing (Illumina short reads) and the large fragment sequencing capabilities of third-generation sequencing (Pacific Biosciences and Oxford Nanopore Technologies, ONT), Cryptosporidium genome sequences are being generated at an increased rate [11]. Currently, there are 74 Cryptosporidium genome sequence assemblies located in the NCBI GenBank and more than half have been submitted since 2018. Whole-genome sequences are needed to facilitate the research community’s ability to design and interpret their experiments. As additional genome sequences become available for new species and strains, a framework for a more holistic genomic comparative analysis is being constructed. The power of comparative insights is significant [12, 13••, 14••, 15]. For example, the addition of a small number of genome sequences shed considerable insight into the diversity and evolution of a species, C. parvum, which revealed the existence of an anthroponotic subclade that was likely shaped via introgression of DNA from other Cryptosporidium species and subtypes [6].

Single-Oocyst Sequencing

Genome sequence generation for Cryptosporidium has historically been quite difficult due to the large number of oocysts required for DNA preparation. Oocysts, which contain four haploid sporozoites, have ~ 40 fg of genomic DNA. Most clinical samples do not contain a sufficient number of oocysts to reach the minimum DNA requirements for sequencing library preparation. Thus, important isolates have historically been propagated in immunosuppressed mice or gnotobiotic pigs. This process is difficult, expensive, and time-consuming. However, with the advent of single-oocyst sequencing, the possibility of generating genomic sequences from a single oocyst is a reality [16]. The protocol involves oocyst sorting, lysis, genome amplification with multiple displacement amplification (MDA), and sequencing with short-read Illumina [16]. This technique has recently been modified to utilize long-read ONT sequencing [17••, 18]. Given the obligately sexual nature of Cryptosporidium and the existence of four related haploid sporozoites within an oocyst, single-oocyst sequencing is also a promising technique for studies of diversity within a single infection [16]. Single-oocyst sequencing also creates an avenue for studying diversity and recombination events within a single oocyst [17••].

Hybrid Capture from Fecal DNA Samples

Hybrid capture, i.e., the selective enrichment of particular DNA sequences via hybridization to long, single-stranded RNA probes representing the target genome sequence of interest [19], is an ideal approach for isolating Cryptosporidium genomic DNA from fecal DNA samples. Most clinical Cryptosporidium fecal DNA samples contain abundant microbial, food, and human DNA content. Recently, hybrid capture has proven tractable for fecal DNA samples with a Cryptosporidium qPCR Ct score of < 20 and for much higher values if a double enrichment is performed (Bayona et al., in prep). The use of hybridization capture has made fecal DNA samples accessible for genome sequence generation. This development will permit samples from numerous studies, sitting in freezers to be analyzed. Importantly, the hybrid capture baits can be customized to have a wider sequence divergence range to facilitate detection of less common human-infecting species. Smaller subsets of probes can be tailored for specific regions of the genome to provide a multi-locus approach to quickly screen large numbers of samples and facilitate outbreak investigations.

Genomic and Transcriptomic Data Are Abundant yet Incomplete

As we can see in Table 1, there are over 1000 genomic data sets for C. parvum and C. hominis; yet, there are only nine assembled and annotated genome sequences. Outside of these two prominent human-infecting species, the situation is bleak. A few dozen genome sequence data sets exist for all other species, and more than half of all named species have no genomic sequence data and are missing from Table 1.

Most existing Cryptosporidium genome sequences also present several challenges for the community. Most were generated using only short-read technologies that produce assemblies that contain gaps and compressed sequence regions (Fig. 1). A complete telomere-to-telomere, T2T, chromosomal assembly would contain eight chromosomes. Although the karyotype is unknown for most species, current assemblies contain dozens to hundreds of contigs, few telomere sequences, and many unassembled reads. Genome sequence assembly gaps most often arise in genome regions that contain repetitive sequences making short reads difficult to place. Long stretches of repetitive sequence also generate gaps, as does the merger of recent gene duplications that reside in multiple locations in genome (Fig. 1). Diversity within the population of parasites being sequenced can also create gaps because some parasites may possess structural variants like indels and inversions or differences in gene family or repeat copy numbers. When genome sequences contain gaps, it is difficult to know if genes are actually missing, thus posing significant challenges for comparative genomics.

Fig. 1
figure 1

Genome assembly impacts annotation quality, gene family member estimates, and genetic variation analyses. A Long-read assembly can clearly identify all three copies of gene 1, and the average read depth is uniform suggesting the assembly does not contain compressed, i.e., merged, assembled sequence in this area. B Short-read assembly cannot separate the three closely related gene 1 family members as evidence by the read pile-up. The phenomenon is called a compression since three genes are merged and annotated as only 1 gene. Also, since the ends of the gene 1 reads are different for each gene copy, the contig cannot be extended, and a gap in the assembly is generated. Compressions do not only affect the determination of gene number; they also affect estimates of genetic variation. Reads that were generated from different family members are all mapped to one locus; thus, the estimate of variation is artificially high. This image was created with BioRender.com

The inclusion of long-read sequencing approaches and hybrid genome assemblies utilizing both long- and short-read approaches is the answer. Long reads, which can reach 100 + kb in length, can cover large genomic regions permitting an exact determination of repeat or gene copy numbers (Fig. 1) and provide proof of genome rearrangements. Depending on population numbers, contigs for differing genotypes within a population can be obtained, i.e., evidence for parasites that have two vs three copies of a particular gene in the same isolate.

When looking at RNA sequence data, the landscape is barren (Table 1), and most existing data sets are a combination of host and parasite transcripts since purification of post-infection parasites remains tenuous at best [20]. There are 145 RNA data sets for C. parvum, four for C. hominis, and one for C. baileyi. This paucity of data has significant consequences for the community. It means that the few genome sequences that have annotation have had to rely on orthology and de novo gene prediction alone. This means that species-specific genes are very difficult to discover, ncRNA genes will be missed, and untranslated regions (UTRs) will be unannotated making it difficult to know where promoters are located since transcription initiation sites are unknown and studies of post-transcriptional regulation, which often involve sequences in the 5′ and 3′ UTRs, are impossible.

Comparative genomics of Cryptosporidium species yields informative insights

Many Human-Infecting Species Are Closely Related

Genome sequences for the species most often observed in humans revealed that the genome sequences are highly similar and highly syntenic [6, 21, 22]. Several recent papers have also demonstrated the complex population genomic structure of C. parvum and C. hominis in natural infections [6, 7••, 15, 23] and highlighted the role that recombination and introgression have played during evolution [6, 14••, 15]. These and other works also show the impact that recombination can have with respect to the generation of novelty with proven impacts on transmission [7••]. Comparative genomics has also revealed the strikingly close relationship between the genome sequences of C. cuniculus, which infects rabbits and humans, and C. hominis [6]. Interestingly, an even closer genomic relationship was observed between the genome sequences of C. parvum and C. tyzzeri, yet C. tyzzeri only infects mice [24].

Recently, a genetic cross was reported not only within C. parvum but also between C. parvum and C. tyzzeri [17••]. In addition to being a major genetics breakthrough for Cryptosporidium research, this finding, combined with the recombination and introgression observations above, raises questions regarding the definition of what constitutes a species in Cryptosporidium. For many reasons, we are not advocating changes, only the recognition of just how similar some subclades and species are to each other at the genomic sequence level and how little we know about their host range [25]. C. parvum shares 96.8% identity with C. hominis, 97.2% identity with C. tyzzeri, 97% identity with C. cuniculus, and 91.3% identity with C. meleagridis (% identity is average nucleotide identity). These species also share almost complete synteny (gene order and orientation) and appear to differ by only a few sub-telomeric genes, if any [24, 26••] with the exception of C. meleagridis that appears, on the basis of long reads, to have a few dozen small intra- and inter-chromosomal rearrangements relative to the other species [27]. Thus, the genetic basis of host preference and pathogenicity may extend from gene content differences to also include single nucleotide variants, small indels, and possible differences in gene regulation. It is worth noting that significant differences in gene content between these species are found with short read analyses, including one in which some of us have participated [22, 28], highlighting the impact of technology and assembly quality on downstream analyses. Notably, synteny with species outside of this group, for which we have genome sequences, no longer extends for the full length of the chromosome and instead is broken down into smaller units of recognizable synteny [11, 29].

Subtelomeric Chromosomal Regions Contain Gene Families and Appear to be Highly Dynamic

In general, the subtelomeric regions of eukaryotic chromosomes are more dynamic in terms of gene copy numbers and levels of observed variation than the rest of the chromosome, and this is particularly the case in pathogenic organisms [30, 31]. Genes that encode proteins involved in host–pathogen interactions and environmental responses are often, but not uniquely, located in sub-telomeric locations [30, 31]. As a result, these regions of the genome are notoriously difficult to assemble. They also represent some of the fastest evolving regions of the genome and, thus, are interesting from the perspectives of host–pathogen biology, evolution, and diagnostics/surveillance.

The first T2T genome sequences for C. parvum revealed surprises regarding higher than expected (based on previous short-read assemblies) gene copy number for a number of genes located in subtelomeric regions, e.g., MEDLE genes, tryptophan synthase beta, and rRNA genes among others [24, 26••]. They also revealed that three different chromosomes shared a total of four highly similar subtelomeric chromosome ends, indicating that replication had occurred between chromosomes [24, 26••]. Better assembly and identification of genes in subtelomeric regions are likely to be crucial for our understanding of important aspects of Cryptosporidium biology. For example, MEDLE proteins, most of which are encoded in subtelomeric regions, are important secreted pathogenesis determinants [9, 32, 33] that appear to be differentially present across a number of species [29].

Population Genomic Studies Provide Insights into Variation, Evolution, and Transmission

Short-read sequencing technology permitted the generation of nearly 700 genomic data sets for C. parvum and nearly 400 for C. hominis. These highly accurate reads have been used to detect variants that exist among and between the different populations of Cryptosporidium parasites that have been sequenced [6, 7••, 13••, 14••, 15, 34, 35•]. The results have been illuminating. They have revealed a discordance in some cases between gp60 single locus ty** and genome ancestry, mixed infections with the same or different species, recombination events within species and hybridization between species, discovery of novel subclades, and in general demonstrated the role that admixture has had on sha** population structure [13••, 15].

These studies have also revealed how little we know about the global population structure of Cryptosporidium species and the forces driving their evolution in differing environments and outbreak scenarios [6, 7••, 13••, 14••, 15, 34, 35•]. These studies also reveal the critical role that the reference genome has in the determination of differences in gene content and polymorphisms. Figure 1 highlights the theoretical outcome of determining single-nucleotide variants (SNVs) in two different scenarios, uncompressed gap free and compressed gapped genome sequences. Thus, a degree of caution is warranted for the interpretation of variant calling until the community has more complete reference genome sequences. The community would greatly benefit from a more diverse set of reference genome sequences and methods for capturing novel genomic content that may not be present in any given reference genome sequence.

Transcriptomics in Cryptosporidium

Annotation, Antisense, and ncRNA Transcripts

Utilization of small RNA-seq and PacBio long-read Iso-seq and ONT Direct RNAseq has significantly advanced our understanding of the Cryptosporidium transcriptome. These technologies have enabled the identification of untranslated regions (UTRs), as well as a variety of long and short non-coding RNAs (ncRNAs), including anti-sense transcripts of unknown function [26••, 36, 37]. Furthermore, single-molecule long-read RNAseq has been instrumental in demonstrating that approximately 10% of C. parvum genes have polycistronic transcripts, offering new insights into gene expression biology and regulation in this important pathogen (** is a more ideal approach to ty** [67]. Although there is no consensus on the specific markers to use, there is agreement that a multi-locus approach is needed [68] to better inform on Cryptosporidium epidemiology. Also, multi-locus genoty** is better because more data are available for identification and it adds the possibility of species subtype identification [69].

The increasing availability of full genome sequence data from increasing numbers of isolates should make the determination of appropriate loci for ty** easier, but challenges remain. First, the community still lacks genomic sequence data from isolates circulating in many important regions of the world with a high incidence of Cryptosporidium infection. Just compare the burden reported in Gilbert et al. [3••] with the source of available genomes sequences in Fan et al. [11]. This situation is beginning to change, and sequences from isolates in other countries are emerging [15]. Hopefully, the genomic advances described above will facilitate this process and unlock the potential of existing samples and lead to the strategic collection of others. Additional genome sequences from new geographic locations and environments will allow the community to survey the extent of the genomic diversity that exists globally and design markers to account for it. One can also imagine the need for specialized markers to very quickly evolving regions of the genome that can be utilized in outbreak scenarios to detect variants as they arise.

Second, in order to appropriately assess genomic variation and rapidly evolving genomic regions, complete, T2T reference genomes for the species most commonly infecting humans should be established and adopted.

Finally, markers for routine surveillance are also needed. RT-PCR tests for the Cryptosporidium dsRNA virus are very sensitive due to viral abundance [70]. The method has been used to successfully identify Cryptosporidium infection in calves, lambs, goats, and environmental water samples across the world [71,72,73]. However, we do not yet know the full extent to which the dsRNA virus is present in different species [56].

Conclusions

Advances in genomics and transcriptomics are impacting all arenas of Cryptosporidium research [74••] from evolution to the life cycle, to host–pathogen interactions and surveillance. The Cryptosporidium research community has come far, very quickly with many new technologies, approaches, and data sets. Much of this new data is available for use and mining in the NCBI GenBank [75] and CryptoDB.org [76•].

The Cryptosporidium community is also struggling a bit with the difficult challenges posed by this important pathogen and the state of genomics technology. The lack of transcriptomic data for species other than C. parvum and the lack of genome sequences for more than half of the named species are real challenges. Complete genome sequences are still too hard to generate and even harder to consistently annotate, especially in the absence of RNA data. This reality impacts their utility and application to important needs like global surveillance and determination of complete gene repertoires. Complete, annotated reference genome sequences greatly facilitate experimental design, e.g., gene knock-outs, and pathway analyses. Likewise, analyses of data that require a reference genome sequence for interpretation, e.g., transcriptome and proteome data analyses and comparative genomics and evolution studies, will also benefit.

The community is also struggling with appropriate geographic representation of genomic data sets from many of the countries most affected by this pathogen. This lack of representation impacts the development of more representative, multi-locus diagnostics and impacts our knowledge base for epidemiological studies and outbreak investigations. Given how difficult Cryptosporidium is to work with, genomics advances have come far, but more is needed.