Introduction

Autosomal recessive mutations in the NPC1 gene account for 95% of the cases with Niemann-Pick syndrome type C (NP-C, MIM #607623), a cholesterol storage and glycosphingolipid trafficking deficiency, with wide variability in the age of onset and in the number and severity of manifestations [1, 2]. The patients manifest progressive neurological symptoms including cerebellar atrophy, loss of Purkinje neurons, and neurofibrillary tangles, which could eventually cause ataxia, loss of speech capabilities, psychiatric problems and learning difficulties [1, 3]. Despite the clear autosomal mode of inheritance, in up to 30% of the patients, only one deleterious NPC1 allele has been identified [47]. In addition, ablation of NPC1 expression is shown to lead to age-progressive retinal degeneration in mice and drosophila [8, 9].

The NPC1 gene is located on chromosome 18q11.2 and is composed of twenty-five exons. According to the Human Gene Database (http://www.hgmd.cf.ac.uk/ac/index.php, [7]) 345 pathogenic NPC1 mutations have been described, consisting of missense and nonsense substitutions (69%), small insertions and deletions affecting the open reading frame (20%), intron-positioned splice variants (8%), and gross alterations (3%) (Additional file 1: Table S1). Rs1140458 is a common synonymous substitution (N931N, chr18:21119777 C > T, NM_000271.4) in NPC1, with allele frequency close to 0.5. It is located three nucleotides from the exon boundary inside of exon 18 and has not previously been associated with pathogenicity. Herein, we describe a link between the rs1140458 and a prematurely terminated NPC1 isoform lacking exons 19–25, discovered through a novel approach for assessment of co-occurrence of RNA features.

Materials and methods

Human RNA-seq transcriptomes

We analyzed 75 transcriptomes derived from different human tissues and cell lines as follows: 10 in-house RPE primary cell lines from the eyes of organ donors, and 65 RNA-seq datasets downloaded from online sources: 7 glioblastoma tumor tissues and 8 low-grade brain tumor cell lines, 25 breast cancer tumor tissues and 6 breast cancer cell lines, 9 colon cancer tissues, and 10 colon cancer cell lines. The web-sources used were ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) [10], and Cancer Genomic Hub (https://cghub.ucsc.edu/) [11]. The choice of samples is based on relevance to the cholesterol/glycosphingolipid metabolism, and availability of concordantly processed RNA-seq datasets from -different tissues for assessment of tissue-specificity of our observation.

RNA-seq analysis pipeline

All the RNA-seq datasets were aligned using Tophat 2 version 2.0.11 [12]. Variants calls were obtained using Mpileup utility of SAMTools (http://samtools.sourceforge.net/mpileup.shtml) [13]. Base Alignment Quality was used to score the variant calls. Consensus calling was one using bcftools. The variant calls were quality filtered as described previously [14], and annotated using SeattleSeq Annotation Tools version 8.01, dbSNP build 138 [15]. Parallel run of the 10 RPE samples through Varscan (http://varscan.sourceforge.net/) produced 100% identical calls on chr18:21119777 C > T in NPC1. Transcript abundance level of each sample was quantified using Cufflinks version 2.1.1 [16] as we have previously described [14].

Assessment for variants associated with intron-containing RNA-molecules

To identify potential splice-modulating variants, we have performed a transcriptome-wide search for nucleotides that selectively reside in unspliced molecules, using aligned RNA-seq datasets. To do this we applied SNPlice - a Python-based computational approach developed in our lab (https://code.google.com/p/snplice/, v.1.7.1) [17]. SNPlice finds sequencing reads spanning SNP loci and nearby exon-intron boundaries, and classifies them based on the nucleotide at the SNP site and the presence or absence of the canonical exon-exon splice junction [17]. Following analysis of raw reads using Tophat, and Samtools, the resulting aligned reads (.bam format), junctions (.bed format), and variants (.vcf format) are read directly by SNPlice. Four categories of reads are counted at a given SNP–junction pair, and the statistical significance of non-independence of the SNP and splicing is assessed using Fisher-exact test: N VARee indicates reads containing the variant base and exon-exon junction, NVARei, - variant base and exon-intron boundary, N REFee , − reference base and exon-exon junction, and N REFei , − reference base and exon-intron boundary. High N VARei with low N REFei indicate that the variant nucleotide is preferentially observed in intron-retaining molecules, and, may have splice-modulating potential. In addition, log odds-ratio (log2,LOD) of intron-retaining reads among variant bearing reads versus intron-retaining reads among all reads is computed [17]:

lo g 2 N VARei / N VARee + N VARei N VARei + N REFei / N VARee + N VARei + N REFee + N REFei

The log odds-ratio increases when the intronic reads are enriched in the variant allele. A pooled ratio is used in the denominator since the number of reference intronic reads is often zero. For analyses across samples, SNPlice will accept multiple reads, junctions, and variant files, taking the semantic union of the input reads, junctions, and variants.

Allele-specific Sanger sequencing

To illustrate the co-occurrence of variant nucleotide and exon-intron boundary, and the coherence with the alignment display through IGV, we designed and performed allele-specific reverse transcription PCR on cDNA derived from heterozygote samples from the same RPE primary cell lines, followed by Sanger sequencing. Three primers were designed: a common forward exonic primer positioned further inside the exon from rs1140458, and two reverse primers hybridizing in the downstream exon or intron, respectively (Additional file 2: Table S2, and Additional file 3: Figure S1A). The PCR products were gel-purified and subjected to bi-directional Sanger sequencing using the forward and reverse primers used for the amplification.

Results and discussion

The distribution of the categories of reads across the 75 samples is presented in Additional file 4: Table S3. In the nine heterozygote samples of the RPE subset, the variant nucleotide of rs1140458 was found more often on reads remaining unspliced at the exon 18 boundary, with seven of the samples’ read counts meeting a p-value threshold of 5% (p-values between 0.04 and 0.009). When reads from all RPE samples, including the homozygote ones, were pooled, the p-value of the observed counts was 3.9 × 10−12 (See Additional file 4: Table S3). In the rs1140458 positive samples, between 8 and 21% of the sequencing reads harboring the variant nucleotide retained the intron, while this percent was on average 1.7 for the reads carrying the reference nucleotide. Out of all intron-retaining molecules, overall 87.7% carried the variant nucleotide.

Next, we assessed the three other assemblies of non-neural cell lines and tissues: glioma/glioblastoma, breast cancer, and colon cancer (See Additional file 4: Table S3). In contrast to the neuroepithelial cells, when examined individually, fewer heterozygote samples presented with significantly higher number of intron-retaining reads harboring the variant nucleotide. For example, none of the heterozygote samples in the glioma/glioblastoma subset reached significance in the read distribution at p <0.05. Nevertheless, a tendency was clearly seen, and the proportion of the intron-retaining reads with variant nucleotide was significant for the pooled reads in both cell lines (p <0.0001), and tissues (p = 0.03). Similarly, few individual samples reached p <0.05 for the read distribution in the breast and colon cancer subsets: one from colon cancer tissue, and two from the colon and breast cancer cell lines each. A total of 4 more heterozygote samples showed a tendency (p <0.1). Again, when the pooled number of reads across all samples was analyzed, the difference in the distribution was significant (p <0.0001) for all 4 datasets. The proportion of variant-bearing reads that retained the intron (out of all variant bearing reads) ranged between 3 and 8.3% as compared to between 0.2 and 1.6% for the intron-retaining reference reads. Of note, the overall proportion of the intron-retaining reads bearing the variant (out of all intron-retaining reads) in the non-neuroepithelial samples was the same as the one observed in the neuroepithelial subset – 87.6%. To assess if the observed strong association in the RPE cells is related to overall increased alternative splicing, we analyzed the variety and the abundance of transcripts across the studied samples. Higher expression fraction of complete length NPC1 transcripts (over all NPC1 transcripts) was estimated in the RPE samples, as compared to the cancer samples (0.88 vs 0.55, p <0.001).

All the samples were examined at the rs1140458 locus using Integrative Genomics Viewer (IGV) [18], and the correspondence to the SNPlice assessment was visually confirmed (Figure 1 top). Furthermore, Cufflinks assembly (.gtf) of the local sequencing reads in samples bearing the variant showed the presence of an isoform retaining the intron 18 of the NPC1 gene (Figure 1 bottom).

Figure 1
figure 1

IGV visualization of the read alignment and assembly of rs1140458 region for three illustrative RPE samples. Top. Sequencing reads with variant (A, in green) and reference nucleotide at the SNP position in the proximity of an exon-intron boundary are shown. Variant-harboring reads often continue in the intron, indicating association with potential junction alteration. Bottom. Cufflinks assembly (.gtf) showing the presence of an isoform retaining the intron 18 of the NPC1 gene (ENST00000540608 represents spliced isoform, and ENST0000269228 represents intron-retaining isoform.

We performed allele-specific PCR followed by Sanger sequencing on two heterozygote samples from the RPE primary cell lines; resulting chromatograms are shown on Additional file 3: Figure S1. The chromatograms were specifically examined for relative signal (peak) of the variant versus the reference nucleotide in the spliced and unspliced amplicons. Consistent with the assessment, Sanger sequencing of the region flanking rs1140458 showed that the molecules bearing the variant nucleotide predominate in the PCR product amplifying the exon-intron boundary. In contrast, the amplicon of the canonically spliced exon-exon region showed an equal proportion of the variant and the reference allele (See Additional file 3: Figure S1).

The potential of rs1140458 to modulate the splicing was further assessed through three independent splice modeling tools: SplicePort, Skippy and SpliceAid [1921]. SplicePort estimated that the variant diminishes the donor strength in comparison to the reference (False Discovery Rate, FDR <0.01). Skippy predicted loss of one exonic splice enhancer, and SpliceAid predicted the loss of the binding site for YB1 [22]. Overall, all three tools predicted a role for the variant in splice modulation.

Incorporation of the intronic sequence downstream of exon 18 leads to replacement of the third base of codon 932, which alters it from tyrosine-encoding to a stop codon, thus generating a prematurely terminated NPC1 isoform lacking the seven downstream exons. Such an isoform has been previously described (uc010xba, UCSC genes). To our knowledge, this is the first study to link it to the presence of the variant nucleotide of the upstream located rs1140458.

The highest proportion of intron-retaining reads in an individual sample is 21% (RPE Sample #5, see Additional file 4: Table S3). Based on that, most of the samples homozygote for rs1140458 are expected to retain more than half of the active NPC1. If, however, rs1140458 is in compound heterozygosity with a strong deleterious NPC1 allele, the level of the functional protein may decrease below 40%. Currently, it is unclear whether this level of NPC1 activity is sufficient for proper cellular functioning, or if it would lead to a cellular phenotype resulting from aberrant cholesterol trafficking [23]. Nevertheless, the study illustrates an important example of partial mutational effect – a phenomenon of particular interest for studying of incomplete penetrance of genetic changes.

Rs1140458 is positioned three nucleotides inside exon 18, thus residing on the border of a sequence traditionally annotated as “splice-site”, which consists of the two immediate nucleotides on each side of the exon-intron boundary. However, only a small percentage of exonic splice-site nucleotides have been experimentally verified to affect splicing. Our results suggest that the motif encompassing rs1140458 may be important for recognizing the donor splice sequence by the components of the splicing machinery.

The interesting observation in this study is the strong effect of the rs1140458 variant observed specifically in the retinal cells. This effect was present, but weaker in the compared cancer datasets, known to have increased general splice deregulation, as also shown by their increased fraction of incomplete NPC1 transcripts [24]. This may indicate cell-specific splicing control in the neuroepithelial cells, making this variant selectively essential for tissue-specific phenotype formation. Given the important role of the cholesterol for the retinal physiology [8, 9, 25], rs1140458 variant in NPC1 is worth studying for potential implication in cholesterol-related retinal pathology.

Despite the clear tendency for the variant to reside in intron-retaining reads in non-neuronal tissue types, the significance of the read distribution there was lower, mainly due to the overall lower proportion of intron-retaining reads (See Additional file 4: Table S3). This observation shows that even in cells where intron retention occurs less frequently, it is still selective for the variant-bearing molecules, which is also demonstrated by the similar proportion of variant intron-retaining reads (out of all intron-retaining reads) across the entire set of samples.

In summary, we demonstrate that substantial proportion of the RNA molecules bearing the variant nucleotide of rs1140458 in the NPC1 gene remain unspliced at the nearby exon-intron boundary and incorporate an early stop codon; the effect being stronger in RPE cells. The implication of rs1140458 in NP-C phenotype variation, or in the cholesterol metabolism in retinal cells, is a subject for further investigation. However, our study illustrates an important mechanism of allele-specific splicing modulation, and reveals partial penetrance of the effect of a genetic variant in a cell-specific mode – a phenomenon of crucial importance for understanding complex genotype-phenotype relationships and multigene variance effects.