Introduction

Epilepsy is a neurological disorder that is characterized by spontaneous and reoccurring seizures that are mainly generated in the areas of the hippocampus or cerebral cortex1,2. Epilepsy remains the most common serious chronic disorder of the brain with more than 68 million people affected worldwide3. Active epilepsy is a devastating disorder that requires continued care, thus disrupting everyday aspects of life and imposing a physical, psychological, and social burden on patients and families4.

The pathophysiology of epilepsy remains poorly understood. While there are some studies in animal models showing a contribution of certain neuronal subtypes to seizure generation and propagation5, the corresponding data from human epilepsy patients are scarce. This can be explained by the complexity of neuronal networks involved in epileptogenesis. Recent studies suggest the presence of >60 neuronal subtypes in a single functional cortical area both in rodents and in human6,7, and the same might be true for each area of the hippocampus, based on the number of GABAergic interneurons in the CA1 region in mice8. Importantly, not all neuronal subtypes will be similarly affected in epilepsy, and there is an indication from the literature that some subtypes are affected or contribute more to epileptogenesis than others. For instance, excitatory neurons of the CA1 region of the hippocampus are more affected than those from other hippocampal regions9,10. In addition, there is a decrease in the number of particular subpopulations of somatostatin (SST)- and neuropeptide Y (NPY)-positive GABAergic interneurons in the hippocampi of patients with temporal lobe epilepsy (TLE)10. In the cortex, selective impairment in gene expression of parvalbumin (PV)-positive GABAergic interneurons has been shown in epileptic tissue from focal cortical dysplasia type I/III, but not type II11. In a mouse model of seizure activation, different populations of GABAergic interneurons contribute to distinct stages of epileptogenesis5. These and other data5,12 clearly show that depending on the disease phenotype, different neuronal assemblies and subtypes of neurons might be affected in epilepsy.

So far, gene-expression changes in epileptic brains have been studied in resected pieces of brain tissue to assess averaged changes across all types of neurons and glia as well as nonneural cells (blood vessels, ependymal cells, etc). Although such studies provided some important information about large-scale changes in gene expression, only relatively minor transcriptomic changes have been identified, even when comparing to highly sclerotic tissue13,14,15,16,17,13,14,15,17,61. The CNR1 signal was multiplied with a correction factor based on the VIP signal (see Eq. (1)). This normalization was necessary due to variability in signal intensity of FISH that could arise due to several factors, including tissue integrity for FISH, tissue processing etc. To avoid Infinite values, we used Raw Integrated Density instead of Integrated Density. A Welch Two Sample one-sided test was used based on unequal variances in the distributions.

$${\mathrm{cnr}}1_{{\mathrm{normalized}}} = {\mathrm{cnr}}1 \ast \left( {{\mathrm{vip}}_{{\mathrm{normalized}}}/{\mathrm{vip}}} \right).$$
(1)

In total, the following number of samples (N) and cells (n) was used: CNR1/VIP (Ncontrol = 3; ncontrol = 22; NTLE = 5; nTLE = 43); CKAMP44/RORB (Ncontrol = 2; ncontrol = 98; NTLE = 3; nTLE = 215); GRIA1 in layer 2–3 (Ncontrol=3; ncontrol = 136; NTLE = 3; nTLE = 113); GRIA1 in layer 5–6 (Ncontrol= 3; ncontrol = 145; NTLE = 3; nTLE = 117); GRIN3A (Ncontrol = 2; ncontrol = 130; NTLE = 3; nTLE = 143).

Processing of Smart-seq2 single-nucleus transcriptomes

Fastq files were trimmed with Trimmomatic 0.3662 with parameters HEADCROP:12, LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, and MINLEN:25. Trimmed reads were aligned to the GRCh38.p12_genomic.fna genome assembly (accessible from https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38) with STAR 2.5.3a63. On average, approximately 90% of reads were uniquely mapped to the reference genome. Genes were counted with featureCounts 1.5.164 using GRCh38.p2 updated by removing duplicate Entrez gene entries from the gtf reference. Cells with less than 2000 genes were excluded. Genes present in less than five cells were excluded. Cells with more than one mio reads or more than 5% mitochondrial genes were removed. In addition, based on overlap** expression of cardinal markers for neuronal cells, astrocytes, and oligodendrocytes, 150 nuclei (constituting 15% of the total dataset) were removed from further analysis as they could not be assigned to one of the main neural cell types and thus likely represent technical artifacts such as doublets.

Nuclei that remained after filtering were normalized, scaled, and embedded in two-dimensional space with Pagoda 2 package (https://github.com/hms-dbmi/pagoda2 and65) using 1000 overdispersed genes and 100 principal components.

Genome alignment and quality control of 10× Chromium single-nuclei transcriptomes

Raw data were demultiplexed, aligned, and quantified using Cell Ranger version 3.1 using a custom version of the GRCh38-1.2.0 human reference genome provided by Cell Ranger where “transcript” rows were changed to “exon” in order to allow for the counting of introns (awk ‘BEGIN{FS = “\t”; OFS = “\t”} $3 = = “transcript”{ $3 = “exon”; print}’ \refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf > GRCh38-1.2.0.premrna.gtf). To filter cells, we used “#UMIs * #Cells” versus “log10(#UMIs)” plots, provided by dropestr package.66 Afterward, cells for which the total mitochondrial gene expression exceeded 8% were removed. Doublets were filtered using the Scrublet package67, followed by removal of clusters with double-cell-fate signatures (Source Data Table 2).

Joint analysis of 10× Chromium-processed single-nuclei datasets

After the initial filtration, each of the 19 NeuN+ datasets (ten healthy and nine epilepsy) was processed with Pagoda 2 using 1000 overdispersed genes and 100 principal components. Afterward, the Pagoda objects were aligned using Conos with parameters k = 15, k.self = 5, and k.self.weight = 0.1. The resulting graph was embedded with the UMAP embedding and clustered using the leiden method with Resolution = 15. Cells from the whole dataset were assigned to one of the major cell types based on the expressed marker genes (see Source Data Table 3). After that, clusters expressing a combination of nonneuronal markers, as well as clusters with mixed identities were removed, and the remaining cells were annotated to the final depth. For better alignment, we reran Conos on the filtered data using balancing across conditions (parameter balancing.factor.per.sample) and setting parameters k = 50, k.self = 5, k.self.weight = 0.1, k.same.factor = 5, and same.factor.downweight = 0.25. The final UMAP embedding was generated with parameters n_epochs = 1000, spread = 5, min_dist = 1.0.

The NeuN dataset was processed in the same way, but without Conos alignment

To estimate the effect of epilepsy on the cell types and for Gene Ontology analyses, we performed additional sample filtration of samples with a high fraction of missed cell types. We filtered all samples in which >5 subtypes had <5 nuclei, as without the filtration, these samples biased subsequent analyses for the subtypes with low number of cells. Having low number of cells in some samples can compromise the analysis as a consequence of possible imprecisions in the annotation or higher variance of the centroid estimation for certain subtypes. Thus, we removed samples with low number of nuclei per sample (as in samples C3 and E5) or with low-quality sample preparation (sample C5). After filtering, we ran Conos using the same parameters as above.

Joint analysis of Smart-seq2 and 10× data

We used Conos label transfer routines to transfer Smart-Seq2 cell-type annotations to our 10× data. We used the preprocessed Pagoda object containing the Smart-Seq2 data and aligned it with the preprocessed 10× Pagoda objects using the same parameters as above (k = 15, k.self = 5, and k.self.weight = 0.1). The UMAP embedding was estimated with parameter spread = 1.5 and min.dist = 1, and propagateLabels function with “max.iters = 50”.

To map our annotation to the Allen Brain Institute (ABI) subtypes, we also preprocessed and aligned their Smart-Seq2 dataset. Preprocessing was done using Pagoda 2 in the same manner as for 10×. For the alignment, we used the three largest autopsy control datasets (C6, C7, and C8) and ran them with parameters k = 20, k.self = 5, k.self.weight = 0.1, space = “CCA”, same.factor.downweight=0.1, and balancing.factor.per.sample corresponding to the protocol.

To produce alluvial diagrams for matching of cell types, we estimated the number of cells ni,j for each of our cell types \(i \in types(ours)\) getting labeled by a specific ABI cell type \(j \in types(ABI)\). All occurrences with \(n_{i,j} = 1\) were filtered out, as well as

$$n_{i,j}:\frac{{n_{i,j}}}{{\mathop {\sum }\nolimits_{k \in types(ABI)} n_{i,k}}} \,<\, 0.05.$$
(2)

Estimation of expression-similarity score across conditions

To estimate how severely each of the subtypes was affected by epilepsy, we developed a metric for expression similarity across conditions. We generated a joint count matrix by row-binding individual total-count normalized count matrices. On the joint matrix, we estimated PCA reduction using 100 principal components. Next, we found the cell-type centroids \(\bar v_{t,s}\) in this PCA space for each sample s and cell type t. For each cell type t, we estimated all pairwise Pearson correlations between samples

$$c_{t,s_i,s_j} = cor(\bar v_{t,s_i},\bar v_{t,s_j}).$$
(3)

As a similarity score, we used the measure of how far the epilepsy samples were from the control samples, accounting for the cross-sample variation within control samples. To do so, we estimated average pairwise correlations between control samples using the 40% trimmed mean

$$m_t = TM_{0.4}(\{ c_{t,s_i,s_j}:s_i,s_j \in control\} ),$$
(4)

and the deviation from the mean using median absolute deviation measure

$$\sigma _t = MAD(\{ c_{t,s_i,s_j}:s_i,s_j \in control\} ).$$
(5)

For each pair of an epilepsy and a control centroid \(c_{t,s_i,s_j}:(s_i \in control,s_j \in epilepsy)\), we estimated their difference from the control mean, normalized by deviation within the control datasets using a Z-score-like approach:

$$z_{t,s_i,s_j} = \frac{{c_{t,s_i,s_j} - m_t}}{{\sigma _t}}.$$
(6)

The obtained measure is <0 for all transcriptomes that are divergent between conditions, and it is around 0 for the cases where the transcriptional profile is not affected by condition. Consequently, as the epilepsy cell types become more similar to the controls, \(z_{t,s_i,s_j}\) grows. As a measure of how much a cell type is affected by epilepsy, we used the distribution of scores \(z_{t,s_i,s_j}\) over samples \(s_i,s_j:(s_i \in control,s_j \in epilepsy)\) for each of the types t.

Differential expression and Gene Ontology enrichment testing

To inspect functional differences in the data, we performed Differential Expression (DE) and GO analyses using the following procedures. First, on the Conos object with filtered samples, we found DE genes between conditions for each of the cell types using the Conos wrapper of DESeq2 package25,68. Among the found genes, we picked those with absolute Z score >3. Then, for each cell type, we kept only the genes with distinct expression level of raw expression >1 UMI in at least 5% of the cells.

We validated the relevance of the found DE genes according to existing knowledge. For each cell type, we tested the enrichment of its DE genes among (i) genes found in GWAS data, and (ii) published epilepsy-related genes. To calculate the enrichment, we used Fisher exact test, using the union of all expressed genes (see above) across all cell types as background.

Next, we performed GO enrichment analysis with the enrichGO function from the clusterProfiler69 package using Benjamini–Hochberg false-discovery rate adjustment with P-value threshold of 0.05. To avoid autopsy-related pathways, we filtered all terms, for which >20% of the enriched genes belong to the list of autopsy-associated genes published by the Allen Brain Institute7. To aggregate the terms that were identified based on the same genes, we performed clustering of the terms by genes, collapsing those with highly similar genes. Thus, we first identified clusters of individual pathways for each subtype using Jaccard distance on the sets of enriched genes (R functions hclust and cutree with parameter h = 0.66). Then, for each pair of pathways (P1, P2), we found all cell types that had both P1 and P2 enriched and estimated fraction of cell types that assigned P1 and P2 to different clusters:

$$f_{P_1,P_2} = \frac{{\left| {cluster_i\left( {P1} \right) \ne cluster_j\left( {P1} \right)} \right|_{i,j \in enriched(P1,P2)}}}{{\left| {enriched\left( {P1,P2} \right)} \right|}}.$$
(7)

This fraction was used as a distance metric for hierarchical clustering (R functions hclust with parameter method=“average” and cutree with parameter h = 0.66). This reduced the number of pathways from 446 to 186, which improved visualization and simplified analysis.

GO visualization

We used a heatmap of log P values to visualize the pathway clusters. First, we determined the name of each cluster by picking the name of the pathway with the least mean log P value across cell types across all pathways from this cluster. Then, we built a matrix of minimal log p values for each of the cell types (columns) and each of the pathway clusters (rows). This matrix was clustered by pathway clusters using hierarchical clustering with L1 distance over row-normalized log P values (R functions hclust with parameter method=“ward.D” and cutree with parameter h = 2.5). According to that, pathway clusters that were enriched in the similar cell types were grouped together, and groups of pathway clusters with size of at least five were picked for further analysis.

To visualize relationships between pairs “cell-type: pathway cluster” we embedded these pairs in 2D space using UMAP. For each such pair, we picked all genes, enriched in the pathways of the given cluster within the given cell type, and used them to characterize the pair. We estimated the pairwise Jaccard distances on these gene sets and used the resulting distances to pick k=10 nearest neighbors for each pair and pass it to the UMAP embedding (parameter spread=1.5, min_dist=0.2).

To understand which cell types are affected by epilepsy in a similar way, we represented each cell type as a set of pathways, which are enriched in this type. We then estimated weighted Jaccard distances and showed them on a clustered heatmap. The weighting was used to account for those pathways that were detected based on similar sets of genes within one cell type. The weight of the pathway i within the cell type k was estimated as

$$w_{k,i} = \left( {\mathop {\sum }\limits_{j \in pathways(k)} JS(genes_{k,i},genes_{k,j})} \right)^{ - 1},$$
(8)

where JS is nonweighted Jaccard similarity, “genesk,j” is the set of genes for the pathway k, enriched within the type j, and “pathways(k)” are all the pathways, enriched for the type k. For the visualization, we clustered these weighted Jaccard distances using hierarchical clustering (h = 1.2 for the R cutree function).

Summary score for the degree at which cell types are affected

For the results summary, we used six different metrics (expression similarity, cell-type composition, number of changed GO terms, enrichment in GWAS genes, enrichment in epilepsy genes, and number of DE genes, see below), aggregated into a single score. We replaced continuous values with ordinal ones to make the metrics comparable to each other. Thus, for each metric, we classified each cell type into one of the following categories: “not affected”, “affected”, “highly affected”, and “top-1 affected cell type”. The last, which by definition includes only one cell type, is assigned separately to Excitatory and Inhibitory neurons. The final score is assigned based on a weighted sum of ranks of the categories, where ranks are integers from 0 to 3. Weights were determined a priori according to our trust in specific metrics (see explanations below).

Expression-similarity score

The expression-similarity score is a direct measure of transcriptional change between conditions (Fig. 2e), where a value of zero means that transcriptional profiles are high in similarity across conditions, and a more negative value indicates a lower similarity. For each subtype, we calculated a distribution of scores across all pairs of datasets. The cell types with a score where the upper quartile <0 are labeled as “affected”. The “affected” types with median score below median of all medians across the cell types (separately for Excitatory and Inhibitory) are labeled as “highly affected”. The type with the lowest median value is the “top-1 affected cell type”. As the direct measure, expression-similarity score has weight 1.0.

Changes in cell-type composition

The proportion of cells of a specific cell type (Fig. 2d) varies between datasets, as cells are subsampled from the total pool. In addition, we expect that abundance of some cell types is affected by epilepsy. To measure differences in cell-type proportions between conditions accounting for variance within cell type, we used a permutation-test P value. Here we do not perform binary hypothesis testing, but use p values as a continuous measure instead. In particular, as we have relatively few samples with large variance, the power of the permutation test is low and the P values are relatively large. Thus, we labeled a cell type as “affected” if it had a P value <0.2, “highly affected” if P value <0.05, and the “top-1 affected cell type” is the one with the smallest P value. As a change in the abundance of cell types should have a large effect on the system, this metric has weight 0.66.

Enrichment of genes identified by GWAS

To evaluate the enrichment of genes identified by the largest GWAS study of epilepsy patients28 (Fig. 2g), we used Fisher test statistics h as the metric. We cross-compared the DE gene lists for each pairwise comparison (i.e., subtype xepilepsy vs. subtype xhealthy) with the GWAS list. Cell types with h > 1 were labeled as “affected”, and those with lower confidence interval for h > 1 were labeled as “highly affected”. The subtype with the largest h value is designed as “top-1 affected cell type”. Gene mutations are likely to cause phenotypical changes, but since the size of the patient cohort in the epilepsy GWAS is not large enough and there is a significant variability in patient diagnosis in the GWAS cohort, it is expected that many epilepsy-related genes are still missed in the GWAS data, and thus we assigned a lower weight of 0.66 to this metric.

Enrichment of epilepsy-related genes

For this metric (Fig. 2f), we used the same definitions as in the GWAS enrichment score. Since the number of genes that has been associated with epilepsy is influenced by prior knowledge in the literature, it does not provide us with a reliable measure of the real level at which a cell type is affected; thus, we used a weight of 0.33 for this score.

Number of highly expressed DE genes, adjusted by the number of cells per cell type

The number of expressed DE genes linearly depends on the number of cells (Supplementary Fig. 11c). Thus, we performed robust linear regression (R function MASS::rlm) of the number of DE genes yDE by the number of cells

$$x_{cells}:y_{DE} = a \ast x_{cells} + b.$$
(9)

Then we used residuals of the regression (\(y_{DE}^{observed} - y_{DE}\)) as a measure of how affected a cell type was. Cell types with positive residuals (i.e., above the regression line) are marked as “affected”, and those with values >75 percentile among the affected types are marked as “highly affected”. The cell type with the largest residual value is the “top-1 affected cell type”. The number of DE genes is an important factor, but the linear dependency makes it weakly reliable, as the residuals can be explained by noise. Thus, a weight of 0.66 was assigned to this metric.

Number of enriched GO pathways, adjusted by the number of highly expressed DE genes

The number of enriched GO pathways was utilized as another metric of functional changes in cell types, although it linearly depends on the number of DE genes (Supplementary Fig. 11a). We used a similar procedure for this metric calculation as we used for the number of highly expressed DE genes (above): residuals of the robust linear regression of the number of enriched GO pathways by the number of highly expressed DE genes. The weight was similarly set to 0.66.

Gene filtering for rWGCNA

The Seurat R package (version 3.1.2)70 was used for preparing the expression data for co-expression analysis. The cells were split by the second level of annotation (Supplementary Fig. 5a) in order to ensure sufficient cell numbers and expression variation within each subset for the detection of gene co-expression. Genes expressed in fewer than 20 cells in a cluster were removed. Principal component analysis was carried out using the RunPCA function after centering and scaling the data with the ScaleData function, to find 120 principal components (PCs). Genes were then ranked by their highest absolute loading value on any given PC, and the top 5000 genes within each cell cluster were selected for co-expression analysis. Subsequent analyses were performed on the entire dataset, except for CNT9 and CNT10.

rWGCNA adjacency and topological overlap matrix computation

Robust Weighted Gene Co-expression Analysis was carried out using the WGCNA R package (version 138). The pickSoftThreshold function was used to identify soft thresholding powers as follows: powers corresponding to the top 95th percentile of network connectivity or above were discarded, and the lowest soft-threshold power between 1 and 30 to achieve a scale-free topology R-squared fit of 0.93 was selected; if none did so, the thresholding power with the highest R squared was used.

In order to identify gene networks robust to outlier cells, the expression data were resampled using a previously published approach37, drawing two-thirds of the cells at random without replacement 100 times. The consensusTOM command was then run with a consensusQuantile of 0.5, “pearson” correlation coefficient, and “signed hybrid” networkType, to compute a signed consensus Topological Overlap Matrix (TOM). Genes were subsequently filtered with the goodGenesMS output produced by consensusTOM.

Clustering and intramodular connectivity

The consensusTOM matrices were converted to distance matrices, and the hclust function was used with the “average” method to cluster genes hierarchically. The cutreeHybrid command was used with a deepSplit of 2 and pamStage set to TRUE to cut the dendrogram into discrete modules, each containing a minimum of 15 genes. We next computed the intramodular connectivity, or “kIM”, of every gene with respect to each module, to serve as a continuous and weighted measure of module membership. Modules whose gene kIM scores exhibited a Pearson correlation of 0.85 or higher were merged. Inspired by71, kIM scores were then used as distance measures in a subsequent iterative k-means clustering, in which genes were reassigned if their kIM with regard to another module was 1.25 times greater than the kIM to their current module. kIM scores were recomputed for the new modules, and the algorithm was repeated until no further genes were reassigned. Finally, a t test was performed to prune genes whose kIM with respect to their allocated module was not statistically significant (using the Benjamini–Hochberg false- discovery rate adjustment).

Post-rWGCNA gene module filtering

Of the original 140 rWGCNA modules detected, those for which 75% or more of the constituent genes were also found with another larger module, with at least a weighted Pearson correlation (using the WGCNA::cor function) between gene kIM scores of 0.75 or higher, were removed. The remaining 129 modules were subsequently filtered down to 117 by removing modules associated with genes differentially expressed in neurosurgery and postmortem interval conditions7. The test was carried out by computing the dot product of module membership scores (kIMs) with –log10-transformed differential expression p values and evaluating significance against a null distribution produced by permuting kIM gene labels (10e3 replicates). A module expression matrix was produced by scaling kIM gene weights to sum to 1 and computing module expression as the weighted sum of module-normalized expression. To address bias from sample-specific expression profiles, fixed-effect linear models with either sample or epilepsy condition as covariates and module as the outcome were used to remove modules for which the R squared of any sample covariate was higher than that of the epilepsy condition covariate model within the cell cluster from which the module originated, narrowing the field to 38 candidate modules. Confidence intervals and P values were then computed for the epilepsy condition coefficients, resulting in 12 modules with a significant epilepsy status coefficient. Information on all of the original 140 gene modules detected across 8 cell subsets can be found in Source Data Table 14. Regression coefficients, confidence intervals, and P values for simple linear models with epilepsy condition as covariate and module expression as outcome can be found in Source Data Table 15.

Gene set enrichment testing

To address potential confounding by common co-expression structures within the gene modules and the curated epilepsy gene set, the correlation of the epilepsy gene expression profile within our single-cell expression data was used to compute a variance inflation factor (VIF)72. The VIF was then passed to the rankSumTestWithCorrelation command from the limma R package (version 3.38.3)73 to carry out a nonparametric Wilcoxon signed-rank testing whether genes from the curated list were ranked near the top of module gene membership scores (kIMs), highlighting seven gene modules after adjusting p values for multiple testing (Bonferroni, 12 tests). Wilcoxon Rank Sum test enrichment results for 12 modules with uncorrected P values (one-sided) can be found in Source Data Table 18.

Functional module annotation

To investigate associations of the 7 prioritized gene modules with biological pathways, the gprofiler2 R package (version 0.1.8)74 was used to query the Gene Ontology Biological Process, Molecular Function, and Cellular Component databases, using Bonferroni correction for multiple testing (Source Data Table 19).

Module preservation in cell-level-4 subtypes

Having identified gene modules within subsets of the cells corresponding to the second level of annotation, we used the WGCNA::modulePreservation function to evaluate module preservation within the level-4 subtypes (Supplementary Fig. 5a), using as reference the level-2 cell types, in which the modules were originally detected75.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.