Introduction

Transcriptional regulation and chromatin accessibility have been shown to play a crucial role in various neurological disorders1. Among other epigenetic techniques, the Assay for Transposase Accessible Chromatin by sequencing (ATAC-seq) provides an accurate way to depict the chromatin landscape of the brain and how it is altered by neurodegenerative diseases2,3,4. ATAC-seq is notably relevant to nominate candidates involved in disease, especially non-coding regions that disrupt gene transcription. While bulk ATAC-seq promises to determine open chromatin regions (OCR) and gene regulatory changes in a direct and efficient way, it entangles data from different cell types and obscures cell type-specific information5,6,7. Although single nucleus (sn) ATAC-seq can overcome this barrier7, it is labor-intensive, expensive, and vulnerable to technical dropout impacting data analysis and interpretation8.

Deconvoluting bulk sequencing data has been widely investigated, especially for RNA-sequencing9,10,11,12,13,14 and recently adapted for bulk ATAC-seq as OCR is better at capturing cell type-specificity than gene expression12,15,16,17. These computational approaches rely on a well-designed signature matrix, using only limited and most distinguishing cell type-specific features to estimate the cellular composition of tissue samples. While this matrix approach can help to resolve spatial single-cell gene expression from bulk RNA-seq18, the definition of cell type-specific marker remains challenging14,19. Recently, a deep learning approach has been proposed to bypass this limitation and directly predict cellular abundance from bulk RNA and microarray expression with high accuracy13. Cellular abundance change is a milestone in bulk analysis and has led to new insight into biological mechanisms13,20. Yet, it prevents a comprehensive understanding of the chromatin accessibility heterogeneity across cell populations and cell-specific OCR variation in disease, limiting bulk sequencing analysis.

Source separation21 is a widely studied signal processing paradigm that retrieves the set of individual sources from a mixed signal. One classical application is in audio processing, where one microphone is recording multiple instruments that are playing simultaneously. Source separation consists in retrieving the sound made by each type of instrument individually from one recorded audio signal. In this study, using a similar paradigm, we develop and test a deep learning-based algorithm, Cellformer, that separates the expression of 6 main brain cell types from bulk samples: 4 glial cell types including astrocytes (AST), microglia (MIC), oligodendrocytes (OLD) and oligodendrocyte progenitor cells (OPCs) and 2 major classes of neurons, excitatory (EXC) and inhibitory (INH). Unlike previous studies, Cellformer not only estimates cellular composition but also deconvolutes cell type-specific ATAC-seq OCR along the whole genome.

As we age past 65 years, the majority of the population resides on the Alzheimer’s disease (AD) continuum, meaning that approximately four out of five older adults have latent, prodromal, or full expression of AD dementia (ADD)22,23. These stages within the AD continuum typically have a progression of functional decline matched with increasing disease burden as measured during life by histopathology, neuroimaging, or biomarkers24,25,26. Standing apart from the AD continuum is a relatively small subset of older individuals who have mismatched normal cognitive function and a high disease burden sufficient to cause dementia; these individuals, called resilient to AD (RAD), are especially important because their existence demonstrates that even advanced AD burden does not necessarily lead to dementia. What combination of inherited factors, life choices, and experiences incurred or avoided combine in this “natural protection” that can be fully effective even in centenarians? Recently, several putative genetic loci involved in RAD have been found using genome-wide association studies (GWAS)27,28, yet the underlying gene regulatory machinery that mediates gene expression in RAD remains to be elucidated.

In this work, we leverage 191 well-curated tissue samples from sex and age range matched normal control (NC, n = 5), RAD (n = 12), and ADD (n = 19) individuals, without neurological comorbidities, and use Cellformer to predict cell type-specific ATAC-seq data from three brain regions and provide unique insights into the cellular and molecular mechanisms underpinning RAD (Fig. 1, Supplementary Fig. 1).

Fig. 1: Study overview.
figure 1

Cellformer was fed data from comorbidity-free bulk samples from individuals with clinicopathologic characterization as normal control (NC), Resilient to Alzheimer’s disease (RAD), and Alzheimer’s disease dementia (ADD). Three brain regions were used per individual to gain insight into the regional and cellular epigenetic profile of RAD. Cellformer generated cell type-specific expression for 6 main cell types across the whole genome, leading to an unprecedented chromatin profiling of RAD (Created with Biorender.com).

Results

Cellformer: from bulk to cell type-specific OCR

The rich diversity of cell type-specific changes can be obscured in bulk tissue transcriptomic and epigenomic analyses by mixing across heterogeneous cell populations. Hence, we hypothesize that deep learning algorithms, developed to separate mixed source signals21, could help resolve cell type-specific expression. However, a major pitfall of deep learning is the requirement of large and annotated datasets to train the model without overfitting, yet bulk ATAC-seq datasets with corresponding known cell type-specific expression compositions are very scarce. To bypass this limitation, we leveraged single-nucleus ATAC-seq collected from the brains of NC individuals7 and an in-silico dataset generation strategy to create synthetic bulk samples with established cell type-specific expression13 (Fig. 2a). More precisely, Cellformer was trained using synthetic subject-specific synthetic bulk samples. These samples were generated by first sampling and merging a random number of single nuclei per cell type, to create cell type-specific ground truth. Then combining cell type-specific expression produces synthetic bulk samples, Cellformer’s input (Methods).

Fig. 2: Cellformer: model training, design, and evaluation.
figure 2

a A synthetic dataset of simulated bulk samples was generated from previously published single-cell ATAC-seq from 13 normal controls7. Cell type-specific pseudo-bulk samples were generated by aggregating snATAC-seq data, revealing the ground truth cell type-specific composition. Simulated cell-specific pseudo-bulk samples were further aggregated to generate pseudo-bulk samples, which are Cellformer’s input. This dataset was used to train Cellformer to minimize the reconstruction error between predicted and ground truth cell type-specific ATAC-seq (Created with Biorender.com). b Cellformer leverages a dual-path strategy to process both intra and inter-chromosome interaction, enabling full genome deconvolution. P values were derived using a two-sided Wilcoxon’s test after multi-testing correction. c Cellformer was evaluated using the leave-one-subject-out strategy. It outperformed other multi-output regression models, notably linear regression, KNN and an unsupervised approach (NMF) used previously to estimate cellular composition across the (n = 6) different cell types. P values were derived using a two-sided Wilcoxon’s test after multi-testing correction. d Cellformer successfully deconvoluted leave-one-out cross-validated PBMC in-silico bulk ATAC-seq data from different datasets (n = 18 samples), predicting cell type-specific expression of five main cell types (B cell, T cell-CD4+ (CD4), T cell-CD8+ (CD8), Myeloid and NK cells). P values were derived using a two-sided Wilcoxon’s test after multi-testing correction. e Quality of the Cellformer’s predictions was assessed by comparing technical replicate cell type-specific expression (n = 36 samples, see Fig. 1). Cellformer generated outputs that are highly consistent between true technical replicates, exhibiting a correlation coefficient (>0.9) significantly higher than with random replicates. (Two-sided Wilcoxon’s test after multi-testing correction) f Cellformer output preserves cell type signature across 6 cell types: astrocytes (AST), microglia (MIC), oligodendrocytes (OLD), and oligodendrocyte progenitor cells (OPCs), and 2 major classes of neurons, excitatory (EXC) and inhibitory (INH). An external cell classifier trained on single-cell data from NC samples was used to assess the cell type-specific ATAC-seq quality. The confusion matrix computed between the cell classifier and Cellformer predictions showed almost perfect agreement, highlighting its capacity to preserve the cell type signature. g Cellformer validation was performed by comparing RAD cell type-specific expression from SMTG with RAD single-cell ATAC-seq expression from SEA-AD using a two-sided Spearman correlation. Significant high correlations were obtained within the same cell type between the two datasets. Spearman correlation coefficient order between cell types was consistent with biological knowledge: a high correlation was found between neuron types and between OLD and OPCs. All box plots show the median (middle line), interquartile range (bottom and upper edges), and the minimum and maximum values of the distribution (whiskers). *P value < 0.05, **P value < 0.01, ***P value < 0.001, ****P value < 0.0001.

Processing DNA-based sequencing has inherent challenges including handling the extremely large number of sequential features, which can lead to both memory and computational challenges. Deep learning offers the promise of dealing with high dimensional data and showing successful applications in diverse tasks using ATAC-seq data29,30. To handle the ATAC-seq high-dimensionality, Cellformer combines attention mechanisms and an effective method, named dual-path. The attention mechanisms create connections between distantly related elements demonstrating high performances in long-sequence modeling with the development of Transformer models8f).

To complement our analysis and nominate RAD gene regulatory elements, an activity-by-contact (ABC) algorithm was applied to the set of predictable OCR and HiChip from different brain regions7 to predict regional gene enhancer interactions47. ABC model determined 16,320 hippocampal enhancer OCR with 15% showing physical evidence only in this region (Supplementary Fig. 8g). By intersecting the set of predicted enhancers with RAD-specific OCR, we found that <50% of RAD-specific OCR were linked to cis-regulated elements while the rest were found in non-coding regions (Fig. 4e). GO and pathway analysis applied to OCR predicted to be localized in both genic and intergenic RAD-specific enhancers revealed significant enrichment of genes related to chemical synaptic transmission in excitatory neurons, inhibitory neurons, and microglia notably Amyloid Beta Precursor Protein Binding Family A Member 2 (APBA2), that modulates AD amyloid precursor protein, and BDNF signaling pathway. These findings corroborate previous analyses performed with microarray protein analysis and animal models48,49.

Discussion

Bulk ATAC-seq is an effective and efficient method to measure open chromatin accessibility2,4,35. In human brains, ATAC-seq may be favored over RNA-based methods for technical reasons, including greater stability of DNA in post-mortem brains and more comprehensive assessment than in single-nucleus assays50. While snATAC-seq offers the attractive opportunity to detect cell type-specific open chromatin accessibility, it is highly impacted by dropout events, making snATAC-seq analysis more challenging and vulnerable to missing low-expressed genes51. To remedy this, we developed Cellformer, a new approach to deconvolute bulk ATAC-seq data and thereby computationally enhance resolution to the cell-type level. Using Cellformer, we illustrated the power of deep learning to enhance biological data analysis and advance our understanding of disease mechanisms at the epigenetic level.

The cellular composition of tissue is a crucial component of sequencing analysis40,52. In contrast to previous methods to deconvolute bulk sequencing data9,12,53, Cellformer does not aim to predict cellular abundance in bulk tissue but instead “fully” deconvolutes bulk ATAC-seq, generating OCR data at the cell type level across the whole genome. Besides, Cellformer does not rely on cell type signature matrix definition9,18, a key ingredient of previous bulk deconvolution that strongly correlates with prediction accuracy14,19 and deters application to samples where no single-cell data are available, such as the RAD and ADD samples investigated here. Defining an accurate signature remains an open computational problem that has been poorly investigated for ATAC-seq data12. Cellformer bypasses this issue by leveraging the power of deep learning to automatically extract and predict cell type-specific OCR13.

While ATAC-seq can be performed on experimentally isolated single cells or single nucleus, Cellformer can resolve bulk expression at the cell type level, not at the single-cell level. However, creating cell type-specific mixtures is a popular strategy to overcome the low count and sparsity of snATAC-seq and improves the statistical power of single-cell analysis54. Notably, cell type-specific data are used to strengthen the signal and improve statistical significance for high-confidence, differential analysis55, TF footprint, or disease gene regulatory signatures identification56,57,58. Furthermore, similar to single-cell sequencing, Cellformer is limited to the most predictable and highly expressed cell type-specific OCR.

Cellformer belongs to the reference-based method category: supervision of the model requires snATAC-seq to learn cell type-specific expression. Therefore, Cellformer predictions are limited to the cell type-specific open chromatin patterns detected in the single-nucleus samples and strongly depend on the quality of snATAC-seq samples.

In this study, we focused on the 6 major brain cell classes. Although we demonstrate Cellformer’s ability to deconvolute at a lower resolution (Supplementary Fig. 4b), we notice that increasing the number of output cell types results in a significant rise in computational complexity. This limitation restricts the number of deconvoluted cell types that can be effectively handled. To overcome this challenge, we plan to implement and validate strategies such as hierarchical training or cell-type prioritization functions, which will expand the capabilities of Cellformer. The model generalization and robustness will be also improved in the future by leveraging the ever-growing amount of available single-cell/nucleus ATAC-seq data or by adapting transfer learning approaches to improve prediction performances on bulk samples from new tissues. Another limitation is the time of training which varies between two hours and a few days depending on the model configuration, the computing power used for the training, and the number of samples. Yet, it remains faster and less expensive than a single-cell sequencing protocol. Additional hyperparameter optimization and gradient acceleration strategy will be implemented in the future to improve training efficiency59.

Applied to large cohort bulk ATAC-seq data from multiple human brain regions, Cellformer provided new insights into RAD, an unusual group of individuals who do not succumb to the high burden of AD and who likely hold important clues to treat this highly prevalent disease. Cellformer predicted that OCR differences between RAD and the other two groups that are on the AD continuum were very strongly localized to HIPP, which subserves declarative memory formation and is the primary target of AD. At the cellular level, most predicted RAD-specific OCR were characterized by changes in both inhibitory and excitatory neurons followed by microglia60. From the perspective of cellular processes, GO analysis of RAD-specific OCR highlighted neuronal development, inflammatory response, and synaptic transmission processes. These pathways were highlighted in previous studies using proteomics and mouse models of AD48,61,62,63. Overall, these highly plausible predictions suggest that individuals with RAD are distinguished from the AD continuum by epigenetic upregulation in support of hippocampal neuronal processes and synapses. This regulation change might confer RAD the ability to preserve the number of neuronal projections and synapses that have been observed through histopathological studies64.

Cellformer offers new perspectives to gain insight into bulk sequencing and identify cell-specific gene regulatory changes in disease progression in a cost-effective way. Complementing cellular abundance prediction, Cellformer provides additional information to advance bulk ATAC-seq analysis. We expect that Cellformer may help to unveil cell-specific transcriptional regulation and advance our understanding of disease epigenetic mechanisms in other biological settings.

Methods

Data overview

This study drew on snATAC-seq and bulk ATAC-seq data previously collected7. Primary brain samples were obtained from Stanford University, the University of Washington, or Banner Health from post-mortem tissue following informed consent and Institutional Review Boards approval7 (Supplementary Fig. 1). In addition, five recently collected control snATAC-seq samples from SMTG (collected using the same ATAC-seq protocol7) were used to train the model. Data on sex was collected but not gender. Validation was performed using Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD)43 single-cell (sc-) ATAC-seq data generated using a 10xMultiome preparation, snATAC-seq from human prefrontal cortex42 and genetic variants from the most recent GWAS study for late-onset Alzheimer’s45.

Individuals’ brain samples from both ATAC-seq datasets were carefully filtered according to clinical diagnosis of cognitive status proximate to death and assessment of AD neuropathologic change and other neuropathologic comorbidities (see Supplementary Table 1) using current consensus guidelines24,65,66,67,68,69. Resilient cases were defined as individuals without dementia at their most recent clinical research evaluation within 2 years of death, and neuropathologic findings of B score >2 and C score >1 but without vascular brain injury or Lewy body disease, and LATE neuropathologic change stage of 0 or 1. To ensure a valid comparison between cell-type-specific ATAC-seq and single-nucleus ATAC-seq from SEA-AD cohort, we also removed samples from patients with known Lewy Body disease (brainstem, limbic, neocortical, olfactory) and LATE neuropathologic change stage above 1 in the SEA-AD dataset.

Bulk ATAC-seq and scATAC-seq processing

We leveraged annotated single-cell ATAC-seq from 12 NC subjects to identify cell type-specific OCR (or peak)7. “Peak calling” was performed on regional and cell type-specific replicates to improve statistical significance using ArchR workflow56 and its MACS2 implementation70. Chromatin accessibility varies largely per brain region and cell type1. Therefore, to ensure capturing OCR that is significantly expressed in the 6 main cell types of interest (astrocyte, microglia, oligodendrocyte, OPCs, excitatory, and inhibitory neurons), we removed single cells that do not belong to one of these six categories as well as single cells identified as doublets. Only the significant OCR marker56, unique to an individual or a small number of cell type groups was conserved for downstream analysis (FDR < 0.001 and FC > 1). In total, we defined a set of 41,954 OCR. The count normalized matrix, combining OCR from all the bulk samples, was then derived using featureCount71.

Previously preprocessed and annotated PBMC scATAC-seq was downloaded from (https://github.com/GreenleafLab/ArchR_2020)56. Peak calling, peak filtering, and in-silico bulk generation were performed using the same workflow as described above.

FeatureCount (version 2.0.3) also was applied to 10xMultiple single-nucleus ATAC-seq fragment files from SEA-AD cohort to extract the same set of OCR. For an accurate comparison, we only considered predictable cell type-specific OCR for comparison, used in this study. Only cells passing the quality control and annotated using the paired snRNA were used for downstream analysis43.

Annotated raw OCR count matrix provided by Morabito et al42. was used and intersected with our set of predictable OCR to compare the prefrontal cortex from ADD and NC single nucleus with deconvoluted cell type expression using “intersect” function from bedtools72. A set of 20060 overlap** OCR was then used to compute the Spearman correlation matrices between cell type expression (Supplementary Fig. 8).

Transformer-based cell-specific ATAC-seq separator

The “Cocktail party” or source separation problem is a widely studied question consisting of extracting individual source signals from a mixed one. Inspired by this paradigm, we leveraged a state-of-the-art source separation method to deconvolute bulk ATAC-seq and extract individual cell type-specific expression along the whole genome. Comprehensive processing of the gene regulatory elements linkage, spanning the whole genome, was achieved using a long-sequence friendly neural network developed for speech separation34. Based on the inner/outer transformer-based architecture, our neural network can extract both within and cross-chromosome epigenetic dependencies along the whole genome using a reduced number of trainable parameters. More precisely, this “dual-path recurrent neural network” strategy73,74 decomposes long sequences into smaller chunks of size 250 and, extracts high-level representations within chunks, which are then concatenated and permuted for inter-chunk interaction processing. Dual path-based models have shown to be effective at modeling very long sequences, leading to superior predictive performances in various audio processing tasks. We adapted the published model to predict the ATAC-seq profile of 6 major brain cells including astrocytes, microglia, oligodendrocytes, OPCS, and two subclasses of neurons, excitatory and inhibitory neurons. An advantage of this architecture is that it can extract both within and cross-chromosome cell type-specific epigenetic dependencies all along the genome.

To improve our prediction confidence and enhance the robustness of our approach, we stacked on top of our trained network a filtering module removing predicted OCR with a relatively high training error. For each cell type i, we computed the normalized mean absolute error across all the N samples in the training set as: \({{{{{\rm{NMEA}}}}}}_{p}=\frac{{{{{\rm{MEA}}}}}({\hat{{{{{{\bf{X}}}}}}}}_{i},{{{{{{\bf{X}}}}}}}_{i})}{\frac{1}{N}{\sum }_{p=1}^{N}|{{{{{{\bf{X}}}}}}}_{i}|}\), with \({{{{{{\bf{X}}}}}}}_{i}\) and \({\hat{{{{{{\bf{X}}}}}}}}_{i}\) the ground truth and the predicted, OCR values, respectively for the cell type i. We preserved cell type-specific OCR for downstream analysis with a mean error lower than a threshold defined as: \({{{{{\rm{NMEA}}}}}}_{p} \, < \, {M}_{{{{{\rm{NMEA}}}}}}+\tau*{\sigma }_{{{{{\rm{NMEA}}}}}}\), with \({M}_{{{{{\rm{NMEA}}}}}}\) and \({\sigma }_{{{{{\rm{NMEA}}}}}}\) the mean and standard deviation of NMEA across all the OCR and \(\tau\) the threshold ranging from −0.5 to 1. \(\tau\) was optimized using samples from the training to maximize the correlation between the cell type-specific ground truth and predicted ATAC-seq expression. Once the filtering module was trained, it was applied to the model outputs to preserve only the most predictable cell type-specific OCR.

The pythonic implementation from Asteroid librarySynthetic dataset generation

Model training was achieved by creating a synthetic dataset of paired bulk and corresponding cell type-specific samples (ground truth), leveraging available single cells samples from NC12,13 Each pair in the synthetic dataset was created by first sampling and aggregating a random number of the same type of single nuclei from a sample’s snATAC-seq results in order to create synthetic cell type-specific pseudo-bulk samples that preserve regional and individual diversity in our synthetic dataset. Then, corresponding synthetic bulk ATAC-seq data were created by aggregating the generated cell type-specific pseudo-bulk samples from the six cell types. 3000 pairs of synthetic bulk and cell type-specific bulk were generated from each subject, composed of a random number of cells ranging from 100 to 800. Then, both synthetic cell type-specific and bulk samples were normalized by the total number of cells and the maximum OCR value. Harmony batch normalization was applied on snATAC-seq to ensure robust peak calling7,56. No batch normalization was further applied on the input of the model, since additional analysis suggests that Cellformer removes batch effects while preserving biological variations, i.e., kBet = 0.79 ± 0.14, NMI = 0, ARI = −0.02 ± 0.001, cLISI = 1, iLISI = 0.58 ± 0.02 across the brain regions77.

Model validation

Model testing through leave-one-subject-out cross-validation

Model generalizability was assessed using the leave-one-subject-out-strategy. More precisely, at each iteration, the dataset was partitioned into a training and testing set, such that samples from all brain regions from one individual were left out while the rest of the samples were gathered to create the training set. From the training set, 20% of the samples were used to tune the model hyperparameters at each iteration while the remaining samples helped to optimize the weights of the model. Once trained, we fed the model with synthetic bulk samples created by aggregating single cells per snATAC-seq donor from the test set (never seen by the model) and validated its performances at predicting an accurate and consistent cell-specific signal using the Spearman correlation. We also assessed the ability of the model to predict non-zero OCR by computing the AUROC and AUPRC after binarization of the ATAC-seq expression29,30. Mean errors with quartile error bars across iterations were reported and compared with other models. The model with the highest performance on the whole dataset and among the top three models with the lowest test error was used for downstream analysis.

Model’s output consistency and plausibility

Model output consistency was validated by computing the Spearman correlation between technical replicates and predicted cell type-specific signals (Fig. 2e). To assess the significance of the observed mean correlation between technical, random replicate permutation tests were performed. More precisely, for each bulk sample, Spearman correlation was computed between the model’s output of this sample and a random replicate, arbitrarily selected from the same brain region, from the phenotype group, or both the same brain region and phenotype group (Supplementary Fig. 3). P value was derived by comparing the mean correlation between true replicates and random replicates using Bonferroni corrected two-sided Wilcoxon test.

Cell signature preservation

We ensured the cell type signatures model’s preservation in ADD and RAD samples using an external cell classifier (Supplementary Fig. 6). To better capture the cell type-specific signature and be more robust to dropout, an XGBoost classifier was trained to classify single-nucleus ATAC-seq from Control cases into cell class. The model was trained to minimize a softmax loss between the predicted label from synthetic cell type-specific ATAC and the corresponding ground truth (see Synthetic Dataset Generation). Stratified K-fold nested cross-validation was exploited to validate the model and performances of the model were quantified using AUROC and AUPRC between the ground truth label and the predicted probability and optimized the model hyperparameters. The XGBoost package in Python was exploited to implement the cell classifier, parameterized with a learning rate equal to 0.1, a maximum depth of tree set to 10 and 100 estimators. The model achieved a mean AUROC = 1.000, Precision = 0.994, and Recall = 0.993 over cross-validation iterations when tested on the held-out single-nucleus ATAC-seq datasets.

Once trained and validated to accurately predict the cell type, the model was applied to deconvoluted cell type-specific expression from AD and RAD. The classifier achieved an AUROC of 1.000, a Precision of 0.993, and a Recall of 0.994 when tested on the deconvoluted expression. Confusion matrices showed almost a perfect classification across cell types, brain regions, and conditions.

Model comparison

We compared our model with supervised (Linear regression), nonparametric (KNN) machine learning, and unsupervised (Non-Negative Matrix Factorization) models. These algorithms were implemented using the default parameterized functions from Scikit-learn78. Using 100 synthetic bulk samples, the multi-out models (Linear regression and KNN) were trained to predict cell type-specific OCR by minimizing the MSE loss. The same leave-one-subject-out strategy as for Cellformer was used to assess models’ generalization and avoid overfitting. NMF model was trained using a synthetic bulk matrix created by aggregating all the single nuclei per replicate. Then, the predicted OCR-specific expression was computed through row-wise multiplications between the feature matrix and the coefficient vectors.

OCR annotation

Chipseeker79 was used to identify OCR-gene association and genomic OCR annotation using default parameters following ATAC-seq data processing guidelines and Harvard bioinformatics recommendations35,80. One of the main issues in epigenetic analysis is the lack of consensus between annotating tools81. We, therefore, compared Chipseeker to the annotations given by ArchR, developed for ATAC-seq data analysis. Overall, Chipseeker and ArchR agreed on 60% of peaks, including complete (100%) agreement between peak-to-gene annotations of RAD-specific promoters, from which our biological insights were drawn. In particular, similar gene ontology enrichment is observed using both tools.

Candidate enhancer OCRs per brain region were computed using Active-by-Contact (ABC) model47. Candidate enhancer regions were derived for each brain region independently using 10 ATAC-seq replicates. Region-specific ABC scores were computed by combining the OCR activity and the genomic spatial information extracted from the HiChip-seq data provided by Corces et al7., using the suggested parameters80.

Cell type-specific ATAC-seq analysis

Differentially expressed OCR were identified using FDR corrected two-sided Wilcoxon test using FDR 5% and absolute log fold-change superior to 0.5 from Scanpy library and MultiPy82,83,84. Gene ontology and pathway analysis were performed on OCR-related genes with the GO and BioPlanet databases from 2021 and 2019, respectively, using GSEAPY85 (version 1.0.3).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.