LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Lu, **feng; Toro, Camilo; Adams, David R.; Moreno, Cristiane Araujo Martins; Lee, Wan-**; Leung, Yuk Yee; Harms, Mathew B.; Vardarajan, Badri; Heinzen, Erin L.

doi:10.1186/s12864-023-09935-9

LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Software
Open access
Published: 26 January 2024

Volume 25, article number 115, (2024)
Cite this article

Download PDF

You have full access to this open access article

BMC Genomics Aims and scope Submit manuscript

LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Download PDF

**feng Lu^1,6,
Camilo Toro²,
David R. Adams²,
Undiagnosed Diseases Network,
Cristiane Araujo Martins Moreno³,
Wan-** Lee⁴,
Yuk Yee Leung⁴,
Mathew B. Harms⁵,
Badri Vardarajan⁶ &
…
Erin L. Heinzen^1,7

Abstract

Background

Short tandem repeats (STRs) are widely distributed across the human genome and are associated with numerous neurological disorders. However, the extent that STRs contribute to disease is likely under-estimated because of the challenges calling these variants in short read next generation sequencing data. Several computational tools have been developed for STR variant calling, but none fully address all of the complexities associated with this variant class.

Results

Here we introduce LUSTR which is designed to address some of the challenges associated with STR variant calling by enabling more flexibility in defining STR loci, allowing for customizable modules to tailor analyses, and expanding the capability to call somatic and multiallelic STR variants. LUSTR is a user-friendly and easily customizable tool for targeted or unbiased genome-wide STR variant screening that can use either predefined or novel genome builds. Using both simulated and real data sets, we demonstrated that LUSTR accurately infers germline and somatic STR expansions in individuals with and without diseases.

Conclusions

LUSTR offers a powerful and user-friendly approach that allows for the identification of STR variants and can facilitate more comprehensive studies evaluating the role of pathogenic STR variants across human diseases.

View this article's peer review reports

Straglr: discovering and genoty** tandem repeat expansions using whole genome long-read sequences

Article Open access 13 August 2021

Genome-wide profiling of heritable and de novo STR variations

Article 24 April 2017

Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats

Article Open access 10 June 2020

Background

Short tandem repeats (STRs), also known as microsatellites, are DNA sequences composed of either identical (perfect) or highly similar (imperfect) short repetitive units (Supplement Fig. 1) [1]. By definition, the length of the repeated unit is usually shorter than 6bp [2]. STRs are typically flanked by patternless sequences. Since their first characterization in vivo, STRs have been found throughout the genome of both prokaryotes and eukaryotes [3,4,5]. Under the common definition of STR, more than 3% of human genome reference contains STR sequences, and about 90% of known human genes contain at least one STR locus within the protein-coding regions [2, 6].

STR variants include both nucleotide and length changes, resulting in both mismatches and repeat insertion/deletions (rINDELs). The slippage model first proposed by Kornberg is one widely accepted mechanistic model explaining the high mutation rate at STRs compared to non-STR regions [2, 7, 8]. This model posits that the length of the STR repeat sequence can either expand (increase repeat number) or contract (decrease repeat number) due to a mispairing of the repetitive sequence in the nascent strand to the template strand during DNA replication. This mispairing creates a loop in either the nascent or template strand thus leading to a larger or smaller tandem repeat number in the newly formed DNA strand. In most cases STRs vary by only a single repeat addition or subtraction, but in some cases the STR loci can expand or contract by several thousand repeats [9, 10]. Such length variations may cause structural disruption and result in altered gene expression when they happen within protein coding or non-coding regulatory regions [11,12,13]. The majority of research into the biological relevance of STRs focuses on the impact of the size of STRs, or the total number of repeated DNA units on each allele at the STR locus [9, 10]. Pathogenic STR expansions cause multiple severe human neurological disorders, including Huntington disease, amyotrophic lateral sclerosis (ALS), fragile X syndrome, and Friedreich ataxia [14,15,16,17,18]. Interestingly, the length of the expansion has been shown to vary in different tissues and cells within the same individual which gives rise to mosaicism [18,19,20]. In fact, mosaicism has been reported in both clinical cases and mouse models for multiple disease associated STR loci [21,22,23,24,25,26,27,29,30,31].

The unique properties of STRs make the genoty** of these sites extremely challenging. Historically STRs genoty** was done using repeat-primed polymerase chain reaction (RP-PCR) and southern blotting, however, these approaches are inefficient and require advance knowledge of the target site [32,33,34]. Genome sequencing technologies offer the potential for a more efficient and more cost-effective way to genotype STRs genome-wide and without bias. Short read sequence outputs have been adopted more widely because application of the emerging long reads sequencing technologies are still limited by cost and high sequencing error rates [35]. Although small STR expansions or contractions can be identified via standard variant calling pipelines as small insertion-deletion variants, the robustness and accuracy of the genotype can be significantly affected by the structural complexity of the STRs, especially when the variant size exceeds the sequenced read lengths [36]. Efforts have been made to develop computational tools specifically for STR realignment and variation calling [37,38,39,40,41,42,43,44,45,46], but significant challenges still exist. Many of the STR calling pipelines require the user to provide target STR loci with inflexible input requirements. A recently developed tool ExpansionHunter Denovo does not require information of STR loci and allows for an unbiased screen. ExpansionHunter Denovo uses only paired reads composed of one read map** to the flanking region and one read mapped to only the region of repeated sequence to detect signals of expansions. This approach only applies to long expansions limiting the ability to genotype specified STR loci when they have no or only small size variations [47]. Furthermore, to our knowledge there are very limited options to detect mosaicism at STR loci which has been observed in some individuals [20]. While the link between somatic mutations and cancer and neurological disorders has been well established, the full contribution of somatic STR variants in disease is yet to be revealed [20, 48, 49]. Given the high mutability of STR variants, post-zygotically acquired pathogenic STR expansions and contractions, which would give rise to mosaicism, may be more involved in disease risk than currently appreciated [25, Full size image

Finder module

The purpose of this module is to identify the genomic coordinates to extract the repeat and flanking sequences for the STRs the user seeks to genotype. There is no limit to the number of STR sites that can be interrogated. Since the exact sequence of an STR may vary due to the presence of mismatches in some of the repetitive sequences or incomplete repeats (Supplementary Fig. 1), providing exact STR boundaries can be difficult and imprecise. Therefore, in addition to the repeat unit, LUSTR requires only the approximate position of the targeted STR, which can merely include sufficient repeats as seeds to initiate the search. Using this information LUSTR searches the reference sequence for both perfect and imperfect repeats around the given positions, periodically extends the repeats, and automatically determines the boundaries between flanking and repeat sequences using default or user-defined parameters that specify how permissive the user wants to be regarding the extent of mismatch and gaps (Supplementary Fig. 2). The LUSTR-defined genomic coordinates, sequences associated with the targeted STRs, and the parameters used to generate the list will then be carried to the following modules.

RefCreator module and extractor module

Given the unique requirements for the alignment of sequencing reads at STR loci, LUSTR requires de novo map** of raw reads to STR loci. Based on the sequences determined by the “Finder” module using the user-defined parameters (Supplementary Fig. 1), the “RefCreator” creates separate references from the flanking and the repeat sequences, as well as artificial references composed by perfect repetitive units of target STRs. In case of unavailability of the original raw reads (.fastq), LUSTR provides the “Extractor” module to pull all of the raw reads from bam files using a single command regardless of the way the bam files are sorted. Alternatively, users can choose samtools or other existing tools to prepare the raw reads after the bam files are sorted by reads ID. The map** of the raw reads to STR references can then be done by existing tools such as bwa with appropriate parameters for STRs (defined in the user manual), to provide primary alignments as sam or bam files for the following LUSTR modules. Quality control can be applied either before or after the map** to reduce false signals in the subsequent steps. Note that this de novo map** step, as well as the “Finder” module, are unique to LUSTR to increase calling accuracy.

Realigner module

LUSTR then uses the “Realigner” module to map any unmapped reads and to map the unmapped portions of partially mapped reads from the previous step. Specifically, when the majority of the read is from a flanking sequence, the “Realigner” module will try to align the remaining part to the repeat sequence using the periodic Smith-Waterman algorithm. When the majority of the read is from a repeat sequence, the “Realigner” module will try to align the remaining part to the flanking sequence using the regular Smith-Waterman algorithm. Reads with non-contiguous realignment will be presented as split portions of the read belonging to up-stream flanking, repeat, and down-stream flanking regions of a STR. To analyze each STR in the subsequent step, all realigned reads are categorized according to the STR regions they map to, allowing for single reads to map to multiple different locations if homologous sequences exist. Paired-end reads unable to be mapped to the same target STR(s) are discarded.

Caller module

In the last step, the “Caller” module collects the information from the alignment procedures described above and lists each potential repeat size at the STR locus that is supported by at least one read. Alleles with repeat sizes short enough to be supported by spanning reads will be determined directly, while the size of long repeats (those exceeding read length) will be estimated by taking the ratio of the number of reads realigned to the flanking and the repeat regions. The quality of the calls can then be determined by inspection of the number of realigned reads and the randomness of their distribution at the STR loci following default or user-provided thresholds. By categorizing pairs supporting each of the potential alleles, the “Caller” module estimates the fraction of each allele, allowing for the possibility of somatic STR variants. Considering the complexity of STRs, the “Caller” module returns the genoty** results in plain text format, which can be easily converted to VCF or other file formats if needed. Furthermore, the “Caller” module also integrates an option to narrow down the STR candidates by generating a list with alleles meeting user-customized thresholds in several features, such as the expansion size, call quality, and allele fraction. Additionally, in the presence of bias detected between upstream and downstream flanking sequences, the “Caller” module will also provides a warning message for users to investigate potential off-targets or complex mutations close by.

Results

Application of LUSTR in simulated short reads sequencing datasets

We first tested how well LUSTR performs the local realignment using the “realigner” module, as this step is critical for accurate genoty** and estimating the number of variant alleles present. Simulated reads were generated from the STR locus in human C9ORF72 gene (Table 1). The C9ORF72 STR contains tandemly repeated GGGGCC sequences (or GGCCCC on the forward strand), whose expansion is well-studied and known to be associated with ALS (Supplementary Fig. 1). We simulated individual libraries of C9ORF72 STR alleles with different repeat sizes as follows: (Library 1) allele with the original repeat size (62bp by the default parameters of LUSTR Finder module), (Library 2) expanded allele with 2 times repeats to the original size, (Library 3) expanded allele with 4 times repeats to the original size which exceed standard short read lengths, (Library 4) contracted allele with half number of repeats to the original size, and (Library 5) an allele missing the repeats entirely. Twenty thousand raw reads with lengths of 150 nucleotides were generated in pairs for each library, randomly from the 2X1000bp flanking sequences and the repeat regions. Note that by these settings, the repeat region of C9ORF72 STR in Library 3 was unable to be fully spanned by any reads due to the length limitation. To simulate sequencing errors, we allowed mismatches, insertions, and deletions (indels) at each nucleotide position at a rate of 0.5%. Raw simulated pairs were then processed following the LUSTR pipeline. The realignment annotations by the LUSTR “realigner” module of flanking and repeat lengths were compared to the records during the generation of the raw reads, and the repeat size estimations by the “caller” module were then compared to the expectation (Table 1). Notably, LUSTR showed high specificity in all libraries and successfully excluded all pairs that were not generated in the forward-reverse pattern (true negative) without calling any positive signals incorrectly (false positive). Among the remaining pairs, LUSTR also exhibited high sensitivity > 99% by successfully retrieving most of the positive pairs (true positive) and missing only a few pairs in certain libraries (false negative). The false negative calls arose because of the mismatches or INDELs that occasionally occurred within correlated reads, which rendered the realignment scores below the threshold and triggered them to be discarded. Moreover, LUSTR annotated > 99% of the true positive pairs identically to the way they were generated, with only a few pairs annotated imperfectly. We found most of the misannotated pairs were due to simulated sequencing errors at the exact boundary between the flanking and repeat regions, which resulted in one nucleotide shifts in the annotation results. These results show that LUSTR was both sensitive and specific to realigning raw reads to the STR loci.

Table 1 Performance of LUSTR in genoty** simulated short reads sequencing libraries

Full size table

We next tested the ability of LUSTR to estimate the size of STR from short reads (Fig. 2a). We simulated homozygous C9ORF72 STR references with different repeat sizes along with 2X1000bp flankings, and randomly generated forward-reverse 150 nucleotide pairs from each of them. Mismatches or INDELs were allowed at each nucleotide position at a rate of 0.5% to imitate expected sequencing errors. To test the robustness of LUSTR under low sequencing depth, we generated the libraries under different average coverages varying from 1 to 100X. Each condition was repeated 10 times independently, and the raw pairs in each simulated library were processed by LUSTR up through the “caller” module. Individual size variation estimation by LUSTR for each library was shown in Fig. 2a, and the average of each condition was compared to the expectation. We also calculated the square of the correlation coefficient (r²) to summarize the ability of LUSTR to call expected sizes under different coverage conditions. LUSTR successfully estimated the STR size variation in libraries with sequencing depth as low as 5X (r² = 0.74), and performed more accurate estimations by the increase of sequencing depth (r² = 0.97 at 30X, Fig. 2a). LUSTR was even able to make an accurate estimation when the STR repeat sizes were close to the simulated read lengths (150bp, variation + 15). These results indicated that LUSTR robustly estimates STR sizes with high accuracy.

The estimation of STR allele fraction has not been explored to any great extent with existing STR calling tools but is essential for somatic variant analysis. Therefore, we further tested the ability of LUSTR to accurately determine STR allele fraction (Fig. 2b). We simulated heterozygous C9ORF72 STR references composed of two alleles along with 2X1000bp flankings: one with a normal C9ORF72 STR repeat size (62bp including 18bp perfect repeat units), and the other with a very large expansion in the range commonly found in humans with ALS or FTD (about 100 repeat units longer than reference). The normal C9ORF72 STR allele fraction was then varied from 10 to 90%. Raw pairs of 150 nucleotides were randomly generated in a forward-reverse pattern under different average coverages varying from 1 to 100X. Randomly generated substitutions or INDELs at a rate of 0.5% were incorporated to account for expected sequencing errors. Simulations for each allelic fraction evaluated were repeated 10 times and independently processed by LUSTR. The estimated allelic fraction of the original C9ORF72 STR allele in each library is shown in Fig. 2b. The average of each condition was compared to the expectation. We found that the estimation of STR allele fraction required higher sequencing depth compared to that required for non-mosaic STR sizing. Although LUSTR exhibited a correlation between the estimation averages and the expectations starting from 10X coverage (r² = 0.56), it did not return a reliable estimation for individual libraries until 30X (r² = 0.77) or 50X (r² = 0.88) coverage (Fig. 2b). These results indicate that LUSTR is able to successfully estimate the fractions of STR alleles in deep sequenced short reads libraries, although the performance, as expected, could be affected by insufficient realigned reads when sequencing depth was low.

Identification of known STR variants from publicly-available sequence data using LUSTR

We next tested the ability of LUSTR to correctly identify STR variants in a database with benchmarking variant calls defined by the Genome in a Bottle Consortium (GIAB). GIAB integrates multiple short and linked read sequencing datasets to provide benchmark calls for human genomes and provides a valuable source for the optimization and validation of bioinformatics tools [62]. We downloaded the MGISEQ (150 nucleotide read length) and the BGISEQ (100 nucleotide read length) sequenced pair-ended short reads libraries by their availability for the Ashkenazim trio and the Chinese trio from GIAB. In addition to the analysis for each individual library, we also generated and analyzed merged libraries when the same individual was sequenced multiple times or across multiple sequencing lanes (Tables 2 and 3, Supplementary Table 1). We then selected 13 STR loci that were known to be associated with neurological disorders and thus have been used to validate existing STR calling software [42]. The raw pairs of each merged and individual library were processed by LUSTR using the default settings, and the genotype calls for the listed STR loci were compared with variant calling files (VCFs) provided by GIAB.

Table 2 Performance of LUSTR and ExpansionHunter in identifying STR variants reported in the GIAB database (Ashkenazim Trio)

Full size table

Table 3 Performance of LUSTR and ExpansionHunter in identifying STR variants reported in the GIAB database (Chinese Trio)

Full size table

Across all the Ashkenazim and Chinese trio libraries and the 13 loci, there were a total of 54 opportunities to compare the genotype provided by GIAB to that called by LUSTR (Tables 2 and 3). For 48 out of the 54 comparisons (88.9%), the predominant allele(s) identified by LUSTR matched that of the benchmark GIAB calls. Among the concordant calls, LUSTR also detected two contracted STR variants at the ATN1 and HTT loci for the son of the Ashkenazim trio at low levels, one with a five repeat units contraction by 5% allele fraction and the other with a nine repeat units contraction by 4% allele fraction (indicated as -5 and -9 in Tables 2 and 3, respectively). Although these small fraction alleles were not called by GIAB, they were supported by some reads realigned to the loci (Supplementary Fig. 4). This could be due to sequencing errors that generated a small fraction of reads artificially revealing the variants, or it could indicate the real presence of somatic STR variants at these loci. A minor fraction of reads supporting an allele that was + 12.7 were also detected at the ATXN3 loci in the Ashkenazim trio compared to the expected + 13. This minor discrepancy was attributed to likely by sequencing errors or slight interpretation differences. For the six discordant calls (11.1%), they were either small differences in repeat count unlikely to alter the interpretation (i.e., for the Ashkenazim trio LUSTR called -1/ + 1 repeats for the two ATXN1 alleles whereas GIAB reported 0/ + 1) or due to reads supporting variant alleles being absent in specific libraries (i.e., ATXN7 in the mother of the Ashkenazim trio and in the child of the Chinese trio) (Supplementary Table 1a and b and Supplementary Fig. 3).

For the 50 instances without GIAB calls (NAs), LUSTR called 34 genotypes identical or close to the reference (68%), which likely explains the absence of calls in GIAB VCFs. In addition to observing a high rate of calling concordance, there were several cases where LUSTR detected a genotype that was not called by GIAB. For example, at the DMPK locus, LUSTR called a genotype of -15, -9 in two sequencing runs for the mother of the Ashkenazim trio that was not reported by GIAB (Tables 2 and 3). The reason this was not called in GIAB is unclear. However, in all cases, there was clear sequence read evidence supporting the presence of these alleles (Supplementary Fig. 5). The LUSTR genotype calls in the child also followed a Mendelian inheritance pattern which further supports the accuracy of the calling (Tables 2 and 3).

Since ExpansionHunter provides curated information and input format for these 13 STR loci, we also ran ExpansionHunter (ver 4.0, default settings) and compared the results in all of the eight merged libraries to further evaluate the performance of LUSTR. Among all the 104 comparisons from both Ashkenazim and Chinese trios, LUSTR showed 94 calls (90.4%) identical or equivalent to ExpansionHunter results, including those loci referred above where LUSTR called genotypes significantly different from GIAB database (Tables 2 and 3, Supplementary Fig. 5). Among the 10 calls that were discordant between LUSTR and ExpansionHunter, there were instances where ExpansionHunter was able to reveal the hidden allele missed by LUSTR (e.g. ATXN7 in the mother from Ashkenazim trio), but also instances where LUSTR showed more convincing results by raw reads inspection (e.g. HTT in the father from Ashkenazim trio) (Tables 2 and 3). These results collectively support that LUSTR can accurately genotype STR variant alleles in short reads sequencing libraries.

LUSTR was accurate and robust to call mosaic STR variants in the in silico mixture libraries

We next tested the ability of LUSTR to call mosaic STR variants by mixing short reads from real data libraries in silico. We selected two MGISEQ sequenced libraries with equal sequencing lengths, one from the father of the Ashkenazim trio and the other from the child of the Chinese trio. We generated an in silico mixture by randomly selecting varying proportions of reads (Table 4) from the two libraries. The mixed libraries were then processed by LUSTR for the 13 tested STR loci shown in Tables 2 and 3. To validate the performance, we first assumed the STR genotypes of the two samples by integrating both GIAB calls and reliable LUSTR calls in the previous tests. We then estimated the expected STR allele fractions in the mixture libraries by assuming that both original samples had homozygous or heterozygous germline genotypes at these loci (i.e., 100% or 50% variant allele frequency, Table 4). The expected calls were then compared with the calls by LUSTR. In the mixed library consisting of a 1:2 ratio of the two genomes, LUSTR successfully called the alleles with fractions very close (< 10%) to expected for six out of the 13 STRs (ATN1, ATXN3, C9ORF72, CBL, DMPK, and HTT) (46.2%, Table 4). For 5 STRs (ATXN2, ATXN10, CACNA1A, JPH3, and PPP2R2B), LUSTR called allelic fractions deviating greater than 10% from expectation. This could be due to read bias in the original samples or sampling error (Table 4, Supplementary Table 1a and b). LUSTR missed STR alleles for ATXN1 and ATXN7, but these were due to missing or low-quality reads supporting the non-dominant alleles in the original libraries (Table 4, Supplementary Table 1).

Table 4 Ability of LUSTR to estimate allele fraction by in silico mixture of samples

Full size table

To further test the ability of LUSTR under more extreme conditions, we then mixed the two samples by an approximate ratio of 1:10 to mimic mosaic STR alleles with fractions as low as 5 or 10% (Table 4). Considering that such low fractions were made by selecting only a few reads from the sample, we performed three replicates to reduce the impact of sampling error that can occur during the mixing (Table 4). LUSTR successfully called the alleles with expected fractions in at least one of the three replicates for 6 out of the 13 STRs (ATN1, ATXN3, C9ORF72, CBL, DMPK, and PPP2R2B) (46.2%, Table 4). Notably, LUSTR was able to call the minor alleles with low fractions (5 or 10%) for ATXN3, CBL, DMPK, and PPP2R2B with very close estimations (< 10%) (Table 4). At the HTT locus, LUSTR called the correct fraction in one of three mixtures but flagged the call as not being reliable. This suggests that allowing more permissive calling may be needed to capture mosaic STRs. At the JPH3 locus, LUSTR estimated allelic fractions that did not align well with expectation (> 10% difference) (Table 4). The reason for this is unclear but is likely due to allelic bias from the Chinese trio Son library as shown by original LUSTR calls (Supplementary Table 1a and b). LUSTR consistently missed the minor alleles for ATXN1, ATXN2, ATXN7, ATXN10, and CACNA1A (Table 4) due to the loss of all reads supporting that allele when randomly sampling from the non-dominant genome.

While noisier than germline calling, these results collectively support the ability of LUSTR to accurately call mosaic STR variant alleles with variant allele fractions as low as 5%. We note however that the accuracy will be greatly influenced by read depth at the locus, as is the case for calling of any allele with low representation in a genome.

Identification of undiagnosed STR expansions in subjects by unbiased whole genome scan using LUSTR

We next tested the ability of LUSTR to identify clinically significant STR expansions using an unbiased whole genome scan in samples harboring known pathogenic STRs. We collected raw whole genome sequence data (short read paired end sequencing) from three individuals with presumed genetic disorders sequenced as part of the Undiagnosed Disease Network (UDN). These subjects were genetically undiagnosed, but all had STR expansion variants that may explain their phenotype (Table 5). We were blind to the specific phenotypes or genotypes while performing the scans so not to bias the analyses. Two libraries were sequenced for subject 1 and subject 2, and four libraries were sequenced for subject 3 (Table 5). We also collected the libraries from the unaffected parents and siblings for subject 1 and subject 2 to determine inheritance (Table 5). To prepare for the whole genome scan, we used Tandem Repeats Finder [37] to obtain the basic information of STRs across the whole human genome reference (build 37). We ran Tandem Repeats Finder using the recommended settings (match/mis/gap/PM/PI/minscore = 2/-5/-7/80/10/50), selected those STRs located within 1000 bp distance to known genes as defined by UCSC genome annotation database (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/), and retrieved a customized list of 162,840 STRs. We then applied the LUSTR “Finder” module to retrieve the standardized STR sequences for these 162,840 loci by default settings (match/mis/gap/stop = + 2/-5/-7/-30) and generated reference sequences by using the “RefCreater” module.

Table 5 Unbiased whole genome scan by LUSTR for known STR expansions in undiagnosed subjects

Full size table

Raw read libraries of the three subjects were mapped to the references generated by LUSTR using bwa mem. All bam files from each individual library as well as merged bam files for each subject were then processed blindly by the LUSTR “Realigner” and “Caller” modules against the customized STR list. The parallel processing function provided by LUSTR was applied to reduce the processing time for calling. We set thresholds for the “Caller” module to call all STR loci with alleles expanded larger than 100 bp compared to the human genome reference, allelic fractions larger than 5%, and variant sites called by more than 15 realigned pairs without repeat-only pairs in at least medium calling quality determined by LUSTR “Caller” module. Note that such settings can be relaxed to reduce the risk of false negatives and to capture mosaicism. The STR expansions fulfilling the quality control metrics were then checked to assess whether they were detected in both individual libraries of subject 1 or subject 2, or were detected in at least three individual libraries of subject 3. Following these steps, we identified 86 candidate STR expansions for subject 1, 78 candidates for subject 2, and 33 candidates for subject 3 (Table 5).

Among the 86 candidates for subject 1, 49 STR expansions were also detected with similar or larger sizes in subjects 2 and 3 and assumed to be either benign polymorphisms or sequencing artifacts. Among the 37 remaining we focused on the 20 candidates with high calling quality for primary investigation (Tables 5 and 6). We next looked into the detailed features of these 20 candidates to decide their priority ranking based on the likelihood they may contribute to the individual’s phenotype. Distinct from the previously excluded 49 candidates that passed the threshold and were also called in subjects 2 and 3, many of these 20 candidates were either called in only one of subjects 2 or 3, or called with a smaller expansion or a low allele fraction that didn’t pass the threshold for subjects 2 and 3. This may indicate non-specificity, but could also indicate potential genetic penetrance. We decided to keep them on the list, but took this into consideration when making priority determination (Table 6). Another important feature being considered for the 20 candidates was the reference size of each candidate STR, since the estimation for STR expansions with reference sizes longer than read length was more likely to be affected by sequencing randomness and off-target repeats, compared to those with relatively shorter sizes (Table 6). We also investigated other features such as the locations of the candidates to the affected genes, the potential for off-target alignment or the presence of mutations in the flanking sequences, and the number of called alleles which could indicate complex situations requiring further examination (Table 6). Among all these candidates, the STR expansion at the GLS gene, a known pathogenic STR, was deemed the most likely candidate in subject 1 (Table 6). We also identified STR expansions at ARHGAP28 and other loci with high priorities that may also be worthy of further consideration (Table 6). Once unblinded, we found that the GLS expansion was indeed the suspected pathogenic variant identified for subject 1.

Table 6 Evaluation of candidate STR expansions by LUSTR unbiased whole genome scan for subject 1

Full size table

We applied a similar procedure to subjects 2 and 3 and narrowed down the candidate list to 21 and one high quality STR calls, respectively (Supplementary Table 2). However, we could only deem TCF4 STR expansion as a possible candidate for subject 2 and no possible candidates were identified for subject 3. Following unblinding the cases, both harbored likely pathogenic RFC1 STRs. The RFC1 STR variants in the two subjects included a replacement of the repetitive “AAAAG” with “AAGGG”, a 1-bp shift, and the expansion (AA + AAAAG × 11 + AAAAAG—> AAA + AAGGG × 10 + AAGAAAAAG—> AAA + AAGGG x n + AAGAAAAAG). This explains why LUSTR, when searching for “AAAAG” repeats under the default settings, actually gave expansion signals at RFC1 locus for the two subjects by very low realignment coverage and low calling quality, which did not happen for the parents and sibling (Supplementary Table 3). To evaluate the flexibility of LUSTR to fulfill the detection of this complex RFC1 expansion, we first tried reducing the mismatching penalty. More pairs were realigned, but the calling qualities were not improved adequately for successful detection as merely penalty change did not benefit retrieving repeat dominant reads (Supplementary Table 3). However, by applying a customized alternative RFC1 STR reference with “AAGGG” repeats accordingly, the RFC1 expansions were successfully detected with high coverage and quality for both subjects 2 and 3 (Supplementary Table 3). Moreover, by combining both results by the two RFC1 STR references, LUSTR genotyped an “AAGGG” expansion allele in subject 1 inherited from the mother, as well as four individuals carrying “AAAAG” expansion alleles in the families of subject 1 and subject 2 (Supplementary Fig. 6). These cases exemplify the challenges of STR calling but also demonstrate the flexibility of LUSTR for customization upon user-specified settings. Develo** LUSTR to call non-reference STRs sequences de novo is an area for future development of the software.

Discussion

Besides the utility of STRs in kinship determination and identity verification, STRs have attracted significant attention for their role in human neurological disorders. Genome-wide sequencing offers tremendous potential to identify STRs that may contribute to disease. Despite the recent progress made in calling STR variants in short read sequence data, there is an on-going need for improvements to make calling more user friendly and interpretable [20, 35, 63].

The LUSTR pipeline described here builds on the advantages of several different existing STR variant calling tools [37,38,39,40,41,42,43,44,45]. LUSTR specifically aims to provide an alternative choice to benefit users with varied conditions or in need of more flexible input requirements (Supplementary Table 4). LUSTR applies the strategy to realign as many reads as possible to each STR locus in order to allow for the most sensitivity and accurate STR calling as possible. It also enables the detection of deviations in allele frequencies that may indicate mosaicism, which has hardly been addressed to date in existing STR callers (Supplementary Table 4). LUSTR follows the classic pipeline of map**, local realignment, and then STR calling. However, distinct from other existing tools, LUSTR requires a de novo map** from the raw reads to STR specific references generated in the pipeline, rather than directly processing bams from whole genome map**. Although it may increase the cost of running time and storage space, this design aims to improve the sensitivity to specifically call STRs, and reflects the idea that STR mutation should be considered as a unique type of variation that requires a distinctive pipeline from that designed to call SNVs and INDELs. In our tests running LUSTR along with the existing STR variant calling tools, LUSTR and ExpansionHunter showed consistent calls in most cases (> 90%, Tables 2 and 3). For the discordant loci, neither LUSTR nor ExpansionHunter showed a significant overall advantage over the other, indicating that each tool has pros and cons under different conditions. As for the running speed, a single process for a whole genome STR genoty** by LUSTR takes days to finish, varying within about a seven day range depending on several factors including sequencing depth, list size of target STRs, and running platform conditions. This running speed, mostly dictated by the Realigner module, is slower than ExpansionHunter or GangSTR when simple target inputs are supplied, but comparable when off-target information is provided [42, 45]. Moreover, LUSTR allows for parallel processing, which will greatly increase the running speed (Supplementary Table 4). In the local realignment step, LUSTR uses the periodic Smith-Waterman algorithm to solve the challenges of imperfect repeats and sequencing errors that happen within STR repetitive regions. While this approach increases sensitivity for long expansions with an expected trade-off in specificity, we note that parameters in the Finder module and subsequent calling step can be altered to favor specificity over sensitivity. New optional modules are under development to further reduce noise and enhance specificity to benefit certain situations such as cohort-level association analyses.

Long read sequencing technologies that have recently emerged will likely improve STR variant calling. LUSTR is designed based on short read sequenced data which remains much more commonly used due to cost and accuracy limitations of current long reads sequencing technologies. Even when long-read sequencing is more economical and accurate, there will still be large numbers of genomes sequenced with short-read sequencing genomes for which short-read STR variant callers will be still be needed. Newer tools have been developed to incorporate algorithms compatible to long sequenced reads to address this emerging need [46]. Another future development of LUSTR will be focused on ensuring compatibility of the caller with long read sequencing data.

Both the local realignment and the variant calling steps are widely acknowledged as critical factors required for accurate STR variant calling [38,39,40,41,42,43,44,45,46]. However, the importance of STR sequence definition is often underestimated when STR target list customization is required, which is another important feature where LUSTR will provide an improved experience compared to other existing tools (Supplementary Table 4). The repeat regions of STRs often contain partial or imperfect repetitive sequences, natural SNVs and short INDELs, as well as sequencing errors during the establishment of the reference. Therefore, the boundaries of STRs may vary largely according to different definition rules, making it difficult for users to precisely define STRs regions of the genome. Furthermore, the inconsistent rules applied to STR boundary definition and local realignment may lead to aberrant calls. One solution is to provide an STR list with the optimal format [42, 64]. Although the list can be updated and expanded following newly emerging clinical discoveries, the feature would limit the ability of the user to add new STR loci of interest or that arise for new releases of the reference genome. Beyond the widely used GRCh37/38 genomes, there have been several new genome references, such as Telomere-to-Telomere genome reference (T2T), Han Chinese genome reference (HG00514), and Japanese genome reference (JG2) [65,66,67,78, 79]. It was equivalent to a novel STR expansion and hereby escaped the detection of LUSTR when the reference “AAAAG” repetitive unit was expected, with extremely low numbers of reads able to be realigned. Such coverage warning can serve for users to notice the potential existence of this type of STR mutation and can be scheduled for future updates of LUSTR. However, by simply applying a customized “AAGGG” RFC1 STR reference or modifying the running with a lower mismatch penalty to allow the realignment of “AAGGG” to “AAAAG”, LUSTR was able to detect the expansion. Furthermore, with such modifications LUSTR identified the inheritance of an “AAAAG” expansion allele and the carriers of heterozygous “AAGGG” expansion allele in the families of two UDN subjects (Supplementary Fig. 6), allowing for further investigation into the potential unrevealed contributions to the phenotype, which so far are suggested to be benign [78,79,80]. This case indicated the flexibility of LUSTR when applied to complex situations encountered with novel STRs.

Conclusions

In summary, LUSTR is a reliable and powerful tool for both germline and somatic STR variant calling, and we expect its application to contribute to studies evaluating the role of STR mutations in disease.

Software availability and requirements

Project name: LUSTR

Project home page: https://github.com/JLuGithub/LUSTR

Operating system: Linux

Programming language: Perl

Other requirements: samtools, map** software such as bwa or bowtie

License: GNU GPL

Any restrictions to use by non-academics: licence needed

Method

LUSTR script

The code for each of the LUSTR modules were written in Perl script. Regular Smith-Waterman algorithm was applied to the local realignment of short sequences to STR flanking regions. Periodic Smith-Waterman algorithm with modifications was applied to the recognition of STR repeat sequences. The sizes of STR repeats and allele fractions were estimated by calculating the ratios between the counts of reads with and without flanking sequences. The core concept equations are listed below, with modifications applied in practice to allow for random sequencing bias. Equations 1 and 2 are first applied to judge the existence of the allele with repeat length longer than the sequencing read length. Upon the detection of a signal, Equation 3 is used to estimate the size of the repeat region for the allele. The fraction of each allele is then determined by the combination of Equations 4, 5, 6 and 7. The calling reliability was determined by the counts of reads categorized into different patterns and flanking-repeat length distributions under the parameters provided by users. Future updates of LUSTR script will include applications of probability methods to the repeat size estimation, statistics methods to the reliability determination, and functions to incorporate de novo STR variants and long read sequencing libraries.

$$\begin{array}{cc}E_{n+1}=1&\lbrack if{(O}_{n+1}>O_n)\&(O_n\geq\sum_{i=1}^n\frac{2S_iC_i}{L-S_i})\rbrack\end{array}$$

(1)

$$\begin{array}{cc}E_{n+1}=0&(if\ else)\end{array}$$

(2)

$$\frac{2L{R}_{1}}{{O}_{n+1}-\sum_{i=1}^{n}\frac{2{S}_{i}{C}_{i}}{L-{S}_{i}}}+L\le {S}_{n+1}\le \frac{2L{(R}_{1}+{R}_{2})}{{O}_{n+1}-\sum_{i=1}^{n}\frac{2{S}_{i}{C}_{i}}{L-{S}_{i}}}+L (if\ {E}_{n+1}=1)$$

(3)

$$\sum\nolimits_{i=1}^{n+1}{F}_{i}=1$$

(4)

$$\begin{array}{cc}\frac{F_i}{F_j}=\frac{C_i(L-S_j)}{C_j(L-S_i)}&(1\leq i,j\leq n)\end{array}$$

(5)

$$\begin{array}{cc}\frac{F_{n+1}}{F_i}=\left(O_{n+1}-\sum_{i=1}^n\frac{2S_iC_i}{L-S_i}\right)\cdot\frac{L-S_i}{2C_iL}&(1\leq i\leq n,\ if\ E_{n+1}=1)\end{array}$$

(6)

$$\begin{array}{cc}\frac{F_{n+1}}{F_i}=0&(1\leq i\leq n,\ if\ E_{n+1}=0)\end{array}$$

(7)

L indicates the sequencing read length (bp).

n indicates the number of alleles with repeat sizes (bp) that can be directly detected by reads containing sequences from both flanking regions.

E_n+1 indicates the existence of the allele (allele n + 1) with repeat size longer than the sequencing read length.

S_i (1 ≤ i ≤ n) indicates the repeat size of allele i directly detected by reads; S_n+1 indicates the repeat size of allele n + 1 that is longer than the sequencing read length thus needs to be estimated.

C_i (1 ≤ i ≤ n) indicates the number of reads containing sequences from both flanking regions and a repeat region with size of S_i, thus belonging to allele i.

O_n indicates the number of reads containing sequences from only one flanking region, and from the pairs with any repeat length ≤ the maximum from S₁ to S_n; O_n+1 indicates the number of all of the reads containing sequences from only one flanking region.

F_i (1 ≤ i ≤ n) indicates the fraction of allele i; F_n+1 indicates the fraction of allele n + 1 whose repeat size is longer than the sequencing read length.

R₁ indicates the number of reads containing only repeat sequences but not from repeat-only pairs; R₂ indicates the number of reads containing only repeat sequences and from repeat-only pairs.

Data processing

The running of LUSTR and the processing of the short read sequencing libraries were done in Linux, with SAMTOOLS 1.14 pre-installed. The map** of reads to STR references was done by BWA MEM version 0.7.

Data generation

The simulated data in the performance test of LUSTR was generated by in-house Perl script. STR references with expected repeat sizes were prepared, and then read pairs were generated in random directions from the STR references. The pattern of each read was recorded to evaluate the performance of LUSTR calling. Each nucleotide of reads was by a given chance altered, deleted, or inserted to imitate sequencing errors.

The mixed library data was generated by an in-house Perl script.

Availability of data and materials

The raw read libraries and variant calling result files from the GIAB project were downloaded from ftp://ftp-trace.ncbi.nih.gov/giab/ftp/. The raw read libraries UDN project were obtained directly from NIH Undiagnosed Diseases Program by collaboration.

Abbreviations

STRs:: Short tandem repeats
INDELs:: Insertions or deletions
ALS:: Amyotrophic lateral sclerosis
PCR:: Polymerase chain reaction
GIAB:: Genome in a Bottle Consortium
VCFs:: Variant calling files
UDN:: Undiagnosed Disease Network
T2T:: Telomere-to-Telomere genome reference

References

Tautz D, Schlötterer C. Simple sequences. Curr Opin Genet Dev. 1994;4(6):832–7. https://doi.org/10.1016/0959-437x(94)90067-1. PMID: 7888752.
Article CAS PubMed Google Scholar
Fan H, Chu JY. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5(1):7–14. https://doi.org/10.1016/S1672-0229(07)60009-6. PMID:17572359;PMCID:PMC5054066.
Article CAS PubMed PubMed Central Google Scholar
Hamada H, Petrino MG, Kakunaga T. A novel repeated element with Z-DNA-forming potential is widely found in evolutionarily diverse eukaryotic genomes. Proc Natl Acad Sci U S A. 1982;79(21):6465–9. https://doi.org/10.1073/pnas.79.21.6465. PMID:6755470;PMCID:PMC347147.
Article CAS PubMed PubMed Central Google Scholar
Tautz D, Renz M. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res. 1984;12(10):4127–38. https://doi.org/10.1093/nar/12.10.4127. PMID:6328411;PMCID:PMC318821.
Article CAS PubMed PubMed Central Google Scholar
van Belkum A, Scherer S, van Alphen L, Verbrugh H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev. 1998;62(2):275–93.
Article PubMed PubMed Central Google Scholar
Madsen BE, Villesen P, Wiuf C. Short tandem repeats in human exons: a target for disease mutations. BMC Genomics. 2008;12(9):410. https://doi.org/10.1186/1471-2164-9-410. PMID:18789129;PMCID:PMC2543027.
Article CAS Google Scholar
Kornberg A, Bertsch LL, Jackson JF, Khorana HG. Enzymatic synthesis of deoxyribonucleic acid, XVI. Oligonucleotides as templates and the mechanism of their replication. Proc Natl Acad Sci U S A. 1964;51(2):315–23. https://doi.org/10.1073/pnas.51.2.315. PMID: 14124330; PMCID: PMC300067.
Article CAS PubMed PubMed Central Google Scholar
Strand M, Prolla TA, Liskay RM, Petes TD. Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature. 1993;365(6443):274–6. https://doi.org/10.1038/365274a0. Erratum.In:Nature1994Apr7;368(6471);569 PMID: 8371783.
Article CAS PubMed Google Scholar
Weber JL, Wong C. Mutation of human short tandem repeats. Hum Mol Genet. 1993;2(8):1123–8. https://doi.org/10.1093/hmg/2.8.1123. PMID: 8401493.
Article CAS PubMed Google Scholar
Ellegren H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat Genet. 2000;24(4):400–2. https://doi.org/10.1038/74249. PMID: 10742106.
Article CAS PubMed Google Scholar
Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, Daly MJ, Price AL, Pritchard JK, Sharp AJ, Erlich Y. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–9. https://doi.org/10.1038/ng.3461. Epub 2015 Dec 7. PMID: 26642241; PMCID: PMC4909355.
Article CAS PubMed Google Scholar
Sun JH, Zhou L, Emerson DJ, Phyo SA, Titus KR, Gong W, Gilgenast TG, Beagan JA, Davidson BL, Tassone F, Phillips-Cremins JE. Disease-associated short tandem repeats co-localize with chromatin domain boundaries. Cell. 2018;175(1):224-238.e15. https://doi.org/10.1016/j.cell.2018.08.005. Epub 2018 Aug 30. PMID: 30173918; PMCID: PMC6175607.
Article CAS PubMed PubMed Central Google Scholar
Hannan A. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet. 2018;19:286–98. https://doi.org/10.1038/nrg.2017.115.
Article CAS PubMed Google Scholar
Fu YH, Kuhl DP, Pizzuti A, Pieretti M, Sutcliffe JS, Richards S, Verkerk AJ, Holden JJ, Fenwick RG Jr, Warren ST, et al. Variation of the CGG repeat at the fragile X site results in genetic instability: resolution of the Sherman paradox. Cell. 1991;67(6):1047–58. https://doi.org/10.1016/0092-8674(91)90283-5. PMID: 1760838.
Article CAS PubMed Google Scholar
Kremer B, Almqvist E, Theilmann J, Spence N, Telenius H, Goldberg YP, Hayden MR. Sex-dependent mechanisms for expansions and contractions of the CAG repeat on affected Huntington disease chromosomes. Am J Hum Genet. 1995;57(2):343–50. PMID: 7668260; PMCID: PMC1801544.
CAS PubMed PubMed Central Google Scholar
Mirkin SM. Expandable DNA repeats and human disease. Nature. 2007;447(7147):932–40. https://doi.org/10.1038/nature05977. PMID: 17581576.
Article CAS PubMed Google Scholar
La Spada AR, Taylor JP. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat Rev Genet. 2010;11(4):247–58. https://doi.org/10.1038/nrg2748. PMID:20177426;PMCID:PMC4704680.
Article CAS PubMed PubMed Central Google Scholar
McMurray CT. Mechanisms of trinucleotide repeat instability during human development. Nat Rev Genet. 2010;11(11):786–99. https://doi.org/10.1038/nrg2828. Erratum.In:NatRevGenet.2010Dec;11(12):886.PMID:20953213;PMCID:PMC3175376.
Article CAS PubMed PubMed Central Google Scholar
Pearson CE, Nichol Edamura K, Cleary JD. Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet. 2005;6(10):729–42. https://doi.org/10.1038/nrg1689. PMID: 16205713.
Article CAS PubMed Google Scholar
Depienne C, Mandel JL. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet. 2021;108(5):764–85. https://doi.org/10.1016/j.ajhg.2021.03.011. Epub 2021 Apr 2 PMID: 33811808.
Article CAS PubMed PubMed Central Google Scholar
Lavedan C, Hofmann-Radvanyi H, Shelbourne P, Rabes JP, Duros C, Savoy D, Dehaupas I, Luce S, Johnson K, Junien C. Myotonic dystrophy: size- and sex-dependent dynamics of CTG meiotic instability, and somatic mosaicism. Am J Hum Genet. 1993;52(5):875–83. PMID: 8098180; PMCID: PMC1682032.
CAS PubMed PubMed Central Google Scholar
Anvret M, Ahlberg G, Grandell U, Hedberg B, Johnson K, Edström L. Larger expansions of the CTG repeat in muscle compared to lymphocytes from patients with myotonic dystrophy. Hum Mol Genet. 1993;2(9):1397–400. https://doi.org/10.1093/hmg/2.9.1397. PMID: 8242063.
Article CAS PubMed Google Scholar
Ashizawa T, Dubel JR, Harati Y. Somatic instability of CTG repeat in myotonic dystrophy. Neurology. 1993;43(12):2674–8. https://doi.org/10.1212/wnl.43.12.2674. PMID: 8255475.
Article CAS PubMed Google Scholar
Telenius H, Kremer B, Goldberg YP, Theilmann J, Andrew SE, Zeisler J, Adam S, Greenberg C, Ives EJ, Clarke LA, et al. Somatic and gonadal mosaicism of the Huntington disease gene CAG repeat in brain and sperm. Nat Genet. 1994;6(4):409–14. https://doi.org/10.1038/ng0494-409. Erratum.In:NatGenet1994May;7(1):113 PMID: 8054984.
Article CAS PubMed Google Scholar
Helderman-van den Enden AT, Maaswinkel-Mooij PD, Hoogendoorn E, Willemsen R, Maat-Kievit JA, Losekoot M, Oostra BA. Monozygotic twin brothers with the fragile X syndrome: different CGG repeats and different mental capacities. J Med Genet. 1999;36(3):253–7. PMID: 10204857; PMCID: PMC1734321.
CAS PubMed Google Scholar
Fortune MT, Vassilopoulos C, Coolbaugh MI, Siciliano MJ, Monckton DG. Dramatic, expansion-biased, age-dependent, tissue-specific somatic mosaicism in a transgenic mouse model of triplet repeat instability. Hum Mol Genet. 2000;9(3):439–45. https://doi.org/10.1093/hmg/9.3.439. PMID: 10655554.
Article CAS PubMed Google Scholar
Gonitel R, Moffitt H, Sathasivam K, Woodman B, Detloff PJ, Faull RL, Bates GP. DNA instability in postmitotic neurons. Proc Natl Acad Sci U S A. 2008;105(9):3467–72. https://doi.org/10.1073/pnas.0800048105. Epub 2008 Feb 25. PMID: 18299573; PMCID: PMC2265187.
Article PubMed PubMed Central Google Scholar
McGoldrick P, Zhang M, van Blitterswijk M, Sato C, Moreno D, **ao S, Zhang AB, McKeever PM, Weichert A, Schneider R, Keith J, Petrucelli L, Rademakers R, Zinman L, Robertson J, Rogaeva E. Unaffected mosaic C9ORF72 case: RNA foci, dipeptide proteins, but upregulated C9ORF72 expression. Neurology. 2018;90(4):e323–31. https://doi.org/10.1212/WNL.0000000000004865. Epub 2017 Dec 27. PMID: 29282338; PMCID: PMC5798652.
Article CAS PubMed PubMed Central Google Scholar
Hearne CM, Ghosh S, Todd JA. Microsatellites for linkage analysis of genetic traits. Trends Genet. 1992;8(8):288–94. https://doi.org/10.1016/0168-9525(92)90256-4. PMID: 1509520.
Article CAS PubMed Google Scholar
Bruford MW, Wayne RK. Microsatellites and their application to population genetic studies. Curr Opin Genet Dev. 1993;3(6):939–43. https://doi.org/10.1016/0959-437x(93)90017-j. PMID: 8118220.
Article CAS PubMed Google Scholar
Butler JM. Genetics and genomics of core short tandem repeat loci used in human identity testing. J Forensic Sci. 2006;51(2):253–65. https://doi.org/10.1111/j.1556-4029.2006.00046.x. PMID: 16566758.
Article CAS PubMed Google Scholar
Warner JP, Barron LH, Goudie D, Kelly K, Dow D, Fitzpatrick DR, Brock DJ. A general method for the detection of large CAG repeat expansions by fluorescent PCR. J Med Genet. 1996;33(12):1022–6. https://doi.org/10.1136/jmg.33.12.1022. PMID:9004136;PMCID:PMC1050815.
Article CAS PubMed PubMed Central Google Scholar
Buchman VL, Cooper-Knock J, Connor-Robson N, Higginbottom A, Kirby J, Razinskaya OD, Ninkina N, Shaw PJ. Simultaneous and independent detection of C9ORF72 alleles with low and high number of GGGGCC repeats using an optimised protocol of Southern blot hybridisation. Mol Neurodegener. 2013;8(8):12. https://doi.org/10.1186/1750-1326-8-12. PMID:23566336;PMCID:PMC3626718.
Article CAS PubMed PubMed Central Google Scholar
Akimoto C, Volk AE, van Blitterswijk M, Van den Broeck M, Leblond CS, Lumbroso S, Camu W, Neitzel B, Onodera O, van Rheenen W, Pinto S, Weber M, Smith B, Proven M, Talbot K, Keagle P, Chesi A, Ratti A, van der Zee J, Alstermark H, Birve A, Calini D, Nordin A, Tradowsky DC, Just W, Daoud H, Angerbauer S, DeJesus-Hernandez M, Konno T, Lloyd-Jani A, de Carvalho M, Mouzat K, Landers JE, Veldink JH, Silani V, Gitler AD, Shaw CE, Rouleau GA, van den Berg LH, Van Broeckhoven C, Rademakers R, Andersen PM, Kubisch C. A blinded international study on the reliability of genetic testing for GGGGCC-repeat expansions in C9ORF72 reveals marked differences in results among 14 laboratories. J Med Genet. 2014;51(6):419–24. https://doi.org/10.1136/jmedgenet-2014-102360. Epub 2014 Apr 4. PMID: 24706941; PMCID: PMC4033024.
Article CAS PubMed Google Scholar
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21(1):30. https://doi.org/10.1186/s13059-020-1935-5. PMID:32033565;PMCID:PMC7006217.
Article PubMed PubMed Central Google Scholar
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. https://doi.org/10.1101/gr.107524.110. PMID: 20644199; PMCID: PMC2928508.
Article CAS PubMed PubMed Central Google Scholar
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80. https://doi.org/10.1093/nar/27.2.573. PMID:9862982;PMCID:PMC148217.
Article CAS PubMed PubMed Central Google Scholar
Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62. https://doi.org/10.1101/gr.135780.111. Epub 2012 Apr 20. PMID: 22522390; PMCID: PMC3371701.
Article CAS PubMed PubMed Central Google Scholar
Cao MD, Tasker E, Willadsen K, Imelfort M, Vishwanathan S, Sureshkumar S, Balasubramanian S, Bodén M. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res. 2014;42(3):e16. https://doi.org/10.1093/nar/gkt1313.
Article CAS PubMed Google Scholar
Kojima K, Kawai Y, Misawa K, Mimori T, Nagasaki M. STR-realigner: a realignment method for short tandem repeat regions. BMC Genomics. 2016;17(1):991. https://doi.org/10.1186/s12864-016-3294-x. PMID:27912743;PMCID:PMC5135796.
Article PubMed PubMed Central Google Scholar
Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017;14(6):590–2. https://doi.org/10.1038/nmeth.4267. Epub 2017 Apr 24. PMID: 28436466; PMCID: PMC5482724.
Article CAS PubMed PubMed Central Google Scholar
Dolzhenko E, van Vugt JJFA, Shaw RJ, Bekritsky MA, van Blitterswijk M, Narzisi G, Ajay SS, Rajan V, Lajoie BR, Johnson NH, Kingsbury Z, Humphray SJ, Schellevis RD, Brands WJ, Baker M, Rademakers R, Kooyman M, Tazelaar GHP, van Es MA, McLaughlin R, Sproviero W, Shatunov A, Jones A, Al Khleifat A, Pittman A, Morgan S, Hardiman O, Al-Chalabi A, Shaw C, Smith B, Neo EJ, Morrison K, Shaw PJ, Reeves C, Winterkorn L, Wexler NS, US–Venezuela Collaborative Research Group, Housman DE, Ng CW, Li AL, Taft RJ, van den Berg LH, Bentley DR, Veldink JH, Eberle MA. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 2017;27(11):1895–903. https://doi.org/10.1101/gr.225672.117. Epub 2017 Sep 8. PMID: 28887402; PMCID: PMC5668946.
Article CAS PubMed PubMed Central Google Scholar
Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, Ramakrishnan S, Lavrenko V, Kakaradov B, Hou C, Hicks B, Heckerman D, Och FJ, Caskey CT, Venter JC, Telenti A. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes. Am J Hum Genet. 2017;101(5):700–15. https://doi.org/10.1016/j.ajhg.2017.09.013. PMID:29100084;PMCID:PMC5673627.
Article CAS PubMed PubMed Central Google Scholar
Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, Davis M, Lamont P, Clayton JS, Laing NG, MacArthur DG, Oshlack A. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19(1):121. https://doi.org/10.1186/s13059-018-1505-2. PMID:30129428;PMCID:PMC6102892.
Article CAS PubMed PubMed Central Google Scholar
Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019;47(15):e90. https://doi.org/10.1093/nar/gkz501. PMID:31194863;PMCID:PMC6735967.
Article CAS PubMed PubMed Central Google Scholar
Wang X, Huang M, Budowle B, Ge J. TRcaller: a novel tool for precise and ultrafast tandem repeat variant genoty** in massively parallel sequencing reads. Front Genet. 2023;18(14):1227176. https://doi.org/10.3389/fgene.2023.1227176. PMID:37533432;PMCID:PMC10390829.
Article CAS Google Scholar
Dolzhenko E, Bennett MF, Richmond PA, Trost B, Chen S, van Vugt JJFA, Nguyen C, Narzisi G, Gainullin VG, Gross AM, Lajoie BR, Taft RJ, Wasserman WW, Scherer SW, Veldink JH, Bentley DR, Yuen RKC, Bahlo M, Eberle MA. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol. 2020;21(1):102. https://doi.org/10.1186/s13059-020-02017-z. PMID:32345345;PMCID:PMC7187524.
Article PubMed PubMed Central Google Scholar
Martincorena I, Campbell PJ. Somatic mutation in cancer and normal cells. Science. 2015;349(6255):1483–9. https://doi.org/10.1126/science.aab4082. Epub 2015 Sep 24. Erratum in: Science. 2016 Mar 4;351(6277). pii: aaf5401. doi: 10.1126/science.aaf5401. PMID: 26404825.
Article CAS PubMed Google Scholar
Benjamin D, Sato T, Cibulskis K, Getz G, Stewart C, Lichtenstein L. Calling somatic SNVs and Indels with Mutect2. bioRxiv. 2019. https://doi.org/10.1101/861054.
Manley K, Shirley TL, Flaherty L, Messer A. Msh2 deficiency prevents in vivo somatic instability of the CAG repeat in Huntington disease transgenic mice. Nat Genet. 1999;23(4):471–3. https://doi.org/10.1038/70598. PMID: 10581038.
Article CAS PubMed Google Scholar
Matsuura T, Sasaki H, Yabe I, Hamada K, Hamada T, Shitara M, Tashiro K. Mosaicism of unstable CAG repeats in the brain of spinocerebellar ataxia type 2. J Neurol. 1999;246(9):835–9. https://doi.org/10.1007/s004150050464. PMID: 10525984.
Article CAS PubMed Google Scholar
van den Broek WJ, Nelen MR, Wansink DG, Coerwinkel MM, te Riele H, Groenen PJ, Wieringa B. Somatic expansion behaviour of the (CTG)n repeat in myotonic dystrophy knock-in mice is differentially affected by Msh3 and Msh6 mismatch-repair proteins. Hum Mol Genet. 2002;11(2):191–8. https://doi.org/10.1093/hmg/11.2.191. PMID: 11809728.
Article PubMed Google Scholar
Kennedy L, Evans E, Chen CM, Craven L, Detloff PJ, Ennis M, Shelbourne PF. Dramatic tissue-specific mutation length increases are an early molecular event in Huntington disease pathogenesis. Hum Mol Genet. 2003;12(24):3359–67. https://doi.org/10.1093/hmg/ddg352. Epub 2003 Oct 21 PMID: 14570710.
Article CAS PubMed Google Scholar
Gomes-Pereira M, Fortune MT, Ingram L, McAbney JP, Monckton DG. Pms2 is a genetic enhancer of trinucleotide CAG.CTG repeat somatic mosaicism: implications for the mechanism of triplet repeat expansion. Hum Mol Genet. 2004;13(16):1815–25. Epub 2004 Jun 15. PMID: 15198993.
Article CAS PubMed Google Scholar
Kovtun IV, Thornhill AR, McMurray CT. Somatic deletion events occur during early embryonic development and modify the extent of CAG expansion in subsequent generations. Hum Mol Genet. 2004;13(24):3057–68. https://doi.org/10.1093/hmg/ddh325. Epub 2004 Oct 20 PMID: 15496421.
Article CAS PubMed Google Scholar
Matsuura T, Fang P, Lin X, Khajavi M, Tsuji K, Rasmussen A, Grewal RP, Achari M, Alonso ME, Pulst SM, Zoghbi HY, Nelson DL, Roa BB, Ashizawa T. Somatic and germline instability of the ATTCT repeat in spinocerebellar ataxia type 10. Am J Hum Genet. 2004;74(6):1216–24. https://doi.org/10.1086/421526. Epub 2004 May 4. PMID: 15127363; PMCID: PMC1182085.
Article CAS PubMed PubMed Central Google Scholar
Rindler PM, Clark RM, Pollard LM, De Biase I, Bidichandani SI. Replication in mammalian cells recapitulates the locus-specific differences in somatic instability of genomic GAA triplet-repeats. Nucleic Acids Res. 2006;34(21):6352–61. https://doi.org/10.1093/nar/gkl846. Epub 2006 Nov 16. PMID: 17142224; PMCID: PMC1669776.
Article CAS Google Scholar
Kovtun IV, Liu Y, Bjoras M, Klungland A, Wilson SH, McMurray CT. OGG1 initiates age-dependent CAG trinucleotide expansion in somatic cells. Nature. 2007;447(7143):447–52. https://doi.org/10.1038/nature05778. Epub 2007 Apr 22. PMID: 17450122; PMCID: PMC2681094.
Article CAS PubMed PubMed Central Google Scholar
Shelbourne PF, Keller-McGandy C, Bi WL, Yoon SR, Dubeau L, Veitch NJ, Vonsattel JP, Wexler NS, US-Venezuela Collaborative Research Group, Arnheim N, Augood SJ. Triplet repeat mutation length gains correlate with cell-type specific vulnerability in Huntington disease brain. Hum Mol Genet. 2007;16(10):1133–42. https://doi.org/10.1093/hmg/ddm054. Epub 2007 Apr 4. PMID: 17409200.
Article CAS PubMed Google Scholar
Libby RT, Hagerman KA, Pineda VV, Lau R, Cho DH, Baccam SL, Axford MM, Cleary JD, Moore JM, Sopher BL, Tapscott SJ, Filippova GN, Pearson CE, La Spada AR. CTCF cis-regulates trinucleotide repeat instability in an epigenetic manner: a novel basis for mutational hot spot determination. PLoS Genet. 2008;4(11):e1000257. https://doi.org/10.1371/journal.pgen.1000257. Epub 2008 Nov 14. PMID: 19008940; PMCID: PMC2573955.
Article CAS PubMed PubMed Central Google Scholar
Goula AV, Berquist BR, Wilson DM 3rd, Wheeler VC, Trottier Y, Merienne K. Stoichiometry of base excision repair proteins correlates with increased somatic CAG instability in striatum over cerebellum in Huntington’s disease transgenic mice. PLoS Genet. 2009;5(12):e1000749. https://doi.org/10.1371/journal.pgen.1000749. Epub 2009 Dec 4. PMID: 19997493; PMCID: PMC2778875.
Article CAS PubMed PubMed Central Google Scholar
Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, De La Vega FM, **ao C, Sherry S, Salit M. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019;37(5):561–6. https://doi.org/10.1038/s41587-019-0074-6. Epub 2019 Apr 1. PMID: 30936564; PMCID: PMC6500473.
Article CAS PubMed PubMed Central Google Scholar
Cao MD, Balasubramanian S, Bodén M. Sequencing technologies and tools for short tandem repeat variation detection. Brief Bioinform. 2015;16(2):193–204. https://doi.org/10.1093/bib/bbu001. Epub 2014 Feb 6 PMID: 24504770.
Article CAS PubMed Google Scholar
Halman A, Dolzhenko E, Oshlack A. STRipy: a graphical application for enhanced genoty** of pathogenic short tandem repeats in sequencing data. Hum Mutat. 2022;43(7):859–68. https://doi.org/10.1002/humu.24382. Epub 2022 Apr 21. PMID: 35395114; PMCID: PMC9541159.
Article CAS PubMed PubMed Central Google Scholar
Via M, Gignoux C, Burchard EG. The 1000 Genomes Project: new opportunities for research and social challenges. Genome Med. 2010;2(1):3. https://doi.org/10.1186/gm124. PMID:20193048;PMCID:PMC2829928.
Article PubMed PubMed Central Google Scholar
Hickey G, Heller D, Monlong J, Sibbesen JA, Sirén J, Eizenga J, Dawson ET, Garrison E, Novak AM, Paten B. Genoty** structural variants in pangenome graphs using the vg toolkit. Genome Biol. 2020;21(1):35. https://doi.org/10.1186/s13059-020-1941-7. PMID:32051000;PMCID:PMC7017486.
Article PubMed PubMed Central Google Scholar
Takayama J, Tadaka S, Yano K, Katsuoka F, Gocho C, Funayama T, Makino S, Okamura Y, Kikuchi A, Sugimoto S, Kawashima J, Otsuki A, Sakurai-Yageta M, Yasuda J, Kure S, Kinoshita K, Yamamoto M, Tamiya G. Construction and integration of three de novo Japanese human genome assemblies toward a population-specific reference. Nat Commun. 2021;12(1):226. https://doi.org/10.1038/s41467-020-20146-8. PMID:33431880;PMCID:PMC7801658.
Article CAS PubMed PubMed Central Google Scholar
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen NC, Cheng H, Chin CS, Chow W, de Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, **ao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM. The complete sequence of a human genome. Science. 2022;376(6588):44–53. https://doi.org/10.1126/science.abj6987. Epub 2022 Mar 31. PMID: 35357919; PMCID: PMC9186530.
Article CAS PubMed PubMed Central Google Scholar
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ar**v: Genomics. 2013. https://doi.org/10.48550/ar**v.1303.3997.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. https://doi.org/10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4. PMID: 19261174; PMCID: PMC2690996.
Article CAS PubMed PubMed Central Google Scholar
Oliva A, Tobler R, Llamas B, Souilmi Y. Additional evaluations show that specific BWA-aln settings still outperform BWA-mem for ancient DNA data alignment. Ecol Evol. 2021;11(24):18743–8. https://doi.org/10.1002/ece3.8297. PMID:35003706;PMCID:PMC8717315.
Article PubMed PubMed Central Google Scholar
Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273(5281):1516–7. https://doi.org/10.1126/science.273.5281.1516. PMID: 8801636.
Article CAS PubMed Google Scholar
Altmüller J, Palmer LJ, Fischer G, Scherb H, Wjst M. Genomewide scans of complex human diseases: true linkage is hard to find. Am J Hum Genet. 2001;69(5):936–50. https://doi.org/10.1086/324069. PMID: 11565063; PMCID: PMC1274370.
Article PubMed PubMed Central Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53. https://doi.org/10.1038/nature08494. PMID:19812666;PMCID:PMC2831613.
Article CAS PubMed PubMed Central Google Scholar
Ibanez L, Farias FHG, Dube U, Mihindukulasuriya KA, Harari O. Polygenic risk scores in neurodegenerative diseases: a review. Curr Genet Med Rep. 2019;7:22–9. https://doi.org/10.1007/s40142-019-0158-0.
Article Google Scholar
Dashnow H, Pedersen BS, Hiatt L, Brown J, Beecroft SJ, Ravenscroft G, LaCroix AJ, Lamont P, Roxburgh RH, Rodrigues MJ, Davis M, Mefford HC, Laing NG, Quinlan AR. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. bioRxiv. 2021.11.18.469113. https://doi.org/10.1101/2021.11.18.469113.
Fearnley LG, Bennett MF, Bahlo M. Detection of repeat expansions in large next generation DNA and RNA sequencing data without alignment. Sci Rep. 2022;12(1):13124. https://doi.org/10.1038/s41598-022-17267-z. PMID:35907931;PMCID:PMC9338934.
Article CAS PubMed PubMed Central Google Scholar
Cortese A, Simone R, Sullivan R, Vandrovcova J, Tariq H, Yau WY, Humphrey J, Jaunmuktane Z, Sivakumar P, Polke J, Ilyas M, Tribollet E, Tomaselli PJ, Devigili G, Callegari I, Versino M, Salpietro V, Efthymiou S, Kaski D, Wood NW, Andrade NS, Buglo E, Rebelo A, Rossor AM, Bronstein A, Fratta P, Marques WJ, Züchner S, Reilly MM, Houlden H. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat Genet. 2019;51(4):649–58. https://doi.org/10.1038/s41588-019-0372-4.
Article CAS PubMed PubMed Central Google Scholar
Rafehi H, Szmulewicz DJ, Bennett MF, Sobreira NLM, Pope K, Smith KR, Gillies G, Diakumis P, Dolzhenko E, Eberle MA, Barcina MG, Breen DP, Chancellor AM, Cremer PD, Delatycki MB, Fogel BL, Hackett A, Halmagyi GM, Kapetanovic S, Lang A, Mossman S, Mu W, Patrikios P, Perlman SL, Rosemergy I, Storey E, Watson SRD, Wilson MA, Zee DS, Valle D, Amor DJ, Bahlo M, Lockhart PJ. Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am J Hum Genet. 2019;105(1):151–65. https://doi.org/10.1016/j.ajhg.2019.05.016. Epub 2019 Jun 20. PMID: 31230722; PMCID: PMC6612533.
Article CAS PubMed PubMed Central Google Scholar
Currò R, Salvalaggio A, Tozza S, Gemelli C, Dominik N, Galassi Deforie V, Magrinelli F, Castellani F, Vegezzi E, Businaro P, Callegari I, Pichiecchio A, Cosentino G, Alfonsi E, Marchioni E, Colnaghi S, Gana S, Valente EM, Tassorelli C, Efthymiou S, Facchini S, Carr A, Laura M, Rossor AM, Manji H, Lunn MP, Pegoraro E, Santoro L, Grandis M, Bellone E, Beauchamp NJ, Hadjivassiliou M, Kaski D, Bronstein AM, Houlden H, Reilly MM, Mandich P, Schenone A, Manganelli F, Briani C, Cortese A. RFC1 expansions are a common cause of idiopathic sensory neuropathy. Brain. 2021;144(5):1542–50. https://doi.org/10.1093/brain/awab072. PMID:33969391;PMCID:PMC8262986.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Michael Guo from the University of Pennsylvania for assistance generating data to allow for comparisons of LUSTR to other currently available STR callers. We thank Undiagnosed Diseases Network for providing sequenced libraries and information required to perform blinded whole genome screening. Full list of Undiagnosed Diseases Network members can be found in Additional file 8.

Undiagnosed Diseases Network²

Maria T. Acosta², Margaret Adam², David R. Adams², Raquel L. Alvarez², Justin Alvey², Laura Amendola², Ashley Andrews², Euan A. Ashley², Carlos A. Bacino², Guney Bademci², Ashok Balasubramanyam², Dustin Baldridge², Jim Bale², Michael Bamshad², Deborah Barbouth², Pinar Bayrak-Toydemir², Anita Beck², Alan H. Beggs², Edward Behrens², Gill Bejerano², Hugo J. Bellen², Jimmy Bennett², Beverly Berg-Rood², Jonathan A. Bernstein², Gerard T. Berry², Anna Bican², Stephanie Bivona², Elizabeth Blue², John Bohnsack², Devon Bonner², Lorenzo Botto², Brenna Boyd², Lauren C. Briere², Gabrielle Brown², Elizabeth A. Burke², Lindsay C. Burrage², Manish J. Butte², Peter Byers², William E. Byrd², John Carey², Olveen Carrasquillo², Thomas Cassini², Ta Chen Peter Chang², Sirisak Chanprasert², Hsiao-Tuan Chao², Ivan Chinn², Gary D. Clark², Terra R. Coakley², Laurel A. Cobban², Joy D. Cogan², Matthew Coggins², F. Sessions Cole², Heather A. Colley², Heidi Cope², Rosario Corona², William J. Craigen², Andrew B. Crouse², Michael Cunningham², Precilla D’Souza², Hongzheng Dai², Surendra Dasari², Joie Davis², Jyoti G. Dayal², Esteban C. Dell’Angelica², Patricia Dickson², Katrina Dipple², Daniel Doherty², Naghmeh Dorrani², Argenia L. Doss², Emilie D. Douine², Dawn Earl², David J. Eckstein², Lisa T. Emrick², Christine M. Eng², Marni Falk², Elizabeth L. Fieg², Paul G. Fisher², Brent L. Fogel², Irman Forghani², William A. Gahl², Ian Glass², Bernadette Gochuico², Page C. Goddard², Rena A. Godfrey², Katie Golden-Grant², Alana Grajewski², Don Hadley², Sihoun Hahn², Meghan C. Halley², Rizwan Hamid², Kelly Hassey², Nichole Hayes², Frances High², Anne Hing², Fuki M. Hisama², Ingrid A. Holm², Jason Hom², Martha Horike-Pyne², Alden Huang², Sarah Hutchison², Wendy Introne², Rosario Isasi², Kosuke Izumi², Fariha Jamal², Gail P. Jarvik², Jeffrey Jarvik², Suman Jayadev², Orpa Jean-Marie², Vaidehi Jobanputra², Lefkothea Karaviti², Shamika Ketkar², Dana Kiley², Gonench Kilich², Shilpa N. Kobren², Isaac S. Kohane², Jennefer N. Kohler², Susan Korrick², Mary Kozuira², Deborah Krakow², Donna M. Krasnewich², Elijah Kravets², Seema R. Lalani², Byron Lam², Christina Lam², Brendan C. Lanpher², Ian R. Lanza², Kimberly LeBlanc², Brendan H. Lee², Roy Levitt², Richard A. Lewis², Pengfei Liu², Xue Zhong Liu², Nicola Longo², Sandra K. Loo², Joseph Loscalzo², Richard L. Maas², Ellen F. Macnamara², Calum A. MacRae², Valerie V. Maduro², AudreyStephannie Maghiro², Rachel Mahoney², May Christine V. Malicdan², Laura A. Mamounas², Teri A. Manolio², Rong Mao², Kenneth Maravilla², Ronit Marom², Gabor Marth², Beth A. Martin², Martin G. Martin², Julian A. Martínez-Agosto², Shruti Marwaha², Jacob McCauley², Allyn McConkie-Rosell², Alexa T. McCray², Elisabeth McGee², Heather Mefford², J. Lawrence Merritt², Matthew Might², Ghayda Mirzaa², Eva Morava², Paolo Moretti², John Mulvihill², Mariko Nakano-Okuno², Stanley F. Nelson², John H. Newman², Sarah K. Nicholas², Deborah Nickerson², Shirley Nieves-Rodriguez², Donna Novacic², Devin Oglesbee², James P. Orengo², Laura Pace², Stephen Pak², J. Carl Pallais², Christina G.S. Palmer², Jeanette C. Papp², Neil H. Parker², John A. Phillips III², Jennifer E. Posey², Lorraine Potocki², Barbara N. Pusey Swerdzewski², Aaron Quinlan², Deepak A. Rao², Anna Raper², Wendy Raskind², Genecee Renteria², Chloe M. Reuter², Lynette Rives², Amy K. Robertson², Lance H. Rodan², Jill A. Rosenfeld², Natalie Rosenwasser², Francis Rossignol², Maura Ruzhnikov², Ralph Sacco², Jacinda B. Sampson², Mario Saporta², Judy Schaechter², Timothy Schedl², Kelly Schoch², Daryl A. Scott², C. Ron Scott², Elaine Seto², Vandana Shashi², Jimann Shin², Edwin K. Silverman², Janet S. Sinsheimer², Kathy Sisco², Edward C. Smith², Kevin S. Smith², Lilianna Solnica-Krezel², Ben Solomon², Rebecca C. Spillmann², Joan M. Stoler², Kathleen Sullivan², Jennifer A. Sullivan², Angela Sun², Shirley Sutton², David A. Sweetser², Virginia Sybert², Holly K. Tabor², Queenie K.-G. Tan², Amelia L. M. Tan², Arjun Tarakad², Mustafa Tekin², Fred Telischi², Willa Thorson², Cynthia J. Tifft², Camilo Toro², Alyssa A. Tran², Rachel A. Ungar², Tiina K. Urv², Adeline Vanderver², Matt Velinder², Dave Viskochil², Tiphanie P. Vogel², Colleen E. Wahl², Melissa Walker², Stephanie Wallace², Nicole M. Walley², Jennifer Wambach², Jijun Wan², Lee-kai Wang², Michael F. Wangler², Patricia A. Ward², Daniel Wegner², Monika Weisz Hubshman², Mark Wener², Tara Wenger², Monte Westerfield², Matthew T. Wheeler², Jordan Whitlock², Lynne A. Wolfe², Kim Worley², Changrui **ao², Shinya Yamamoto², John Yang², Zhe Zhang², Stephan Zuchner²

Funding

This work was funded by R01-NS094596.

Author information

Authors and Affiliations

Division of Pharmacotherapy and Experimental Therapeutics, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
**feng Lu & Erin L. Heinzen
NIH Undiagnosed Diseases Program, National Human Genome Research Institute (NHGRI), National Institutes of Health, Bethesda, MD, 20892, USA
Camilo Toro & David R. Adams
Neurology Department, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
Cristiane Araujo Martins Moreno
Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory MedicinePerelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
Wan-** Lee & Yuk Yee Leung
Department of Neurology, Division of Neuromuscular Medicine, Columbia University Irving Medical Center, New York, NY, 10032, USA
Mathew B. Harms
The Taub Institute for Research On Alzheimer’s Disease and the Aging Brain, Gertrude H. Sergievsky Center, Department of Neurology, College of Physicians and Surgeons, Columbia University, The New York Presbyterian Hospital, New York, NY, 10032, USA
**feng Lu & Badri Vardarajan
Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
Erin L. Heinzen

Authors

**feng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Camilo Toro
View author publications
You can also search for this author in PubMed Google Scholar
David R. Adams
View author publications
You can also search for this author in PubMed Google Scholar
Cristiane Araujo Martins Moreno
View author publications
You can also search for this author in PubMed Google Scholar
Wan-** Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yuk Yee Leung
View author publications
You can also search for this author in PubMed Google Scholar
Mathew B. Harms
View author publications
You can also search for this author in PubMed Google Scholar
Badri Vardarajan
View author publications
You can also search for this author in PubMed Google Scholar
Erin L. Heinzen
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Contributions

JL wrote the scripts of LUSTR, performed the tests on simulated and real datasets, and analyzed and interpreted the results of the simulations and analyses. CT, DRA and the UDN provided the genomic sequence data from individuals with genetic diseases and collaborated in the blinded whole genome analyses. WL, YL and BV collaborated with the optimization and application of LUSTR. CAMM and MBH were involved in the design and building of LUSTR scripts. ELH oversaw all aspects of the work and wrote the manuscript with JL. All authors read, edited, and approved the final manuscript.

Corresponding authors

Correspondence to **feng Lu or Erin L. Heinzen.

Ethics declarations

Ethics approval and consent to participate

Genomic sequence data used in this study were either publicly available or from individuals who were consented to allow for their de-identified data to be used to develop analytical tools to analyze and interpret genomic data under the guidance of local institutional review boards.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary Figure 1.

Structure of C9orf72 STR. We show here the reference sequence surrounding an STR within C9orf72 as a typical example of the complexities of STR structure. This STR has been reported to be associated with amyotrophic lateral sclerosis (ALS) and contains GGCCCC repeats. It is located on chromosome 9, and the genomic location (build 37) is shown in the figure. The approximate boundaries between the repeat and flanking regions are indicated. This figure shows how allowing incomplete repeats and tolerating repeat mismatches can greatly influence how one defines the repeat region that will be interrogated in the downstream models to infer genotype. *Note that the algorithm is agnostic to strand. For this C9orf72 STR, inputting CCGGGG from the reverse strand will be treated as equivalent to indicating CCCCGG from the forward strand.

Additional file 2: Supplementary Figure 2.

Determination of the repeat sequence of C9orf72 STR by LUSTR applying periodic Smith-Waterman algorithm. We show here as an example how the LUSTR finder module determines the repeat sequence of C9orf72 STR by applying the periodic Smith-Waterman algorithm, searching for GGCCCC repetitive sequences using the default settings as follows: match/mismatch/gap/stop = 2/-5/-7/-30. Starting from the seed sequence (two GGCCCC repeats, highlighted in yellow), the finder module aligns the reference periodically to GGCCCC in both upstream and downstream directions and records the best score at each nucleotide. Scores above 0 will be reset to 0, and routines with a score below the stop limit will be blocked for further extension. In this case, the extension stops when the best score is below -30 (highlighted in orange), and the repeat sequence is determined by the farthest nucleotides with a score of 0 (highlighted in green).

Additional file 3: Supplementary Figure 3.

Average read coverage of 13 STR loci in GIAB trios. Average read coverage by GIAB trio libraries for the 13 STR loci tested in this study. Reads from each individual or merged library were first mapped to the whole human genome by bwa mem. Coverage of each nucleotide within the STR loci region (repeat region plus 2 x 50 bp flanking sequence at both sides) was calculated by SAMTOOLS depth, and the average coverage of each STR locus was calculated. STRs with failed or allele-missing calls in certain libraries are indicated by red color.

Additional file 4: Supplementary Figure 4.

Reads realigned to ATN1 and HTT STR loci from the son of GIAB Ashkenazim trio. Raw sequences of the reads realigned to the two loci were collected from the libraries sequenced for the Ashkenazim son. Gaps are indicated, and mismatched nucleotides are marked in red. Reads are categorized according to their repeat sizes. Interestingly, besides the dominant alleles, LUSTR identified one read directly supporting the -5 allele at ATN1 STR locus, and one read directly supporting the -9 allele at HTT STR locus. These reads might indicate potential small fraction somatic STR variants, but further confirmation is needed to exclude the possibility of random sequencing error.

Additional file 5: Supplementary Figure 5.

Reads supporting the STR alleles called by LUSTR but not revealed in GIAB database. Raw sequences of the reads realigned to (a) ATXN3 STR locus in father and son from the Ashkenazim trio, (b) DMPK STR locus in mother and son from the Ashkenazim trio, (c) DMPK STR locus in mother from the Chinese trio, and (d) PPP2R2B STR locus in father and mother from the Chinese trio. Gaps are indicated, and mismatched nucleotides are marked in red. Reads are categorized according to their repeat sizes.

Additional file 6: Supplementary Figure 6.

Potential inheritance of RFC1 STR alleles in the families of UDN subject 1 and 2. The genotypes of RFC1 STR alleles identified by LUSTR are shown for the pedigrees of UDN families of subject 1 and subject 2, for whom nuclear family members were available. The reference RFC1 STR allele (AAAAG wt, marked in blue) has two mutant types, AAAAG expansion (marked in orange and not known to be associated with disease) and AAGGG expansion (marked in red). The alleles were confirmed by checking the raw reads in sequenced libraries.

Additional file 7: Supplementary Table 1.

a Performance of LUSTR in identification of STR variants in GIAB database (Ashkenazim Trio). b Performance of LUSTR in identification of STR variants in GIAB database (Chinese Trio). Supplementary Table 2. a Evaluation of candidate STR expansions by LUSTR unbiased whole genome scan for subject 2. b Evaluation of candidate STR expansions by LUSTR unbiased whole genome scan for subject 3. Supplementary Table 3. a RFC1 expansion calls by LUSTR with alternative references for subject 2. b RFC1 expansion calls by LUSTR with alternative references for subject 3. Supplementary Table 4. Comparison among LUSTR, ExpansionHunter, and GangSTR

Additional file 8.

Full list of Undiagnosed Disease Network members.

Additional file 9.

A .zip file including following LUSTR scripts, which can also be downloaded from https://github.com/JLuGithub/LUSTR: LUSTR_Finder.pl, LUSTR_RefCreator.pl, LUSTR_Extractor.pl, LUSTR_Realigner.pl, LUSTR_Caller.pl, README.txt, README.md, README_detail.txt, QuickGuide.txt, LICENSE.txt, testdata/test_genome_hg19_chr9_27571483_27575544.fa, testdata/test_pairendreads_C9orf72_ref70exp30.fastq, testdata/test_STRinfo.txt.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Lu, J., Toro, C., Adams, D.R. et al. LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants. BMC Genomics 25, 115 (2024). https://doi.org/10.1186/s12864-023-09935-9

Download citation

Received: 18 May 2023
Accepted: 21 December 2023
Published: 26 January 2024
DOI: https://doi.org/10.1186/s12864-023-09935-9

LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Background

Finder module

RefCreator module and extractor module

Realigner module

Caller module

Results

Application of LUSTR in simulated short reads sequencing datasets

Identification of known STR variants from publicly-available sequence data using LUSTR

LUSTR was accurate and robust to call mosaic STR variants in the in silico mixture libraries

Identification of undiagnosed STR expansions in subjects by unbiased whole genome scan using LUSTR

Discussion

Conclusions

Software availability and requirements

Method

LUSTR script

Data processing

Data generation

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Consortia

Undiagnosed Diseases Network

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation