Introduction

The objective of neural encoding is to predict the brain’s response to external stimuli, providing an effective means to explore the brain’s mechanism for processing sensory information and serving as the foundation for brain–computer interface (BCI) systems. Visual perception, being one of the primary ways in which we receive external information, has been a major focus of neural encoding research. With the advancement of non-invasive brain imaging techniques, such as functional magnetic resonance imaging (fMRI), scientists have made remarkable progress in vision-based neural encoding1,2,3,4 over the past two decades, making it a hot topic in neuroscience.

The process of vision-based encoding typically involves two main steps: feature extraction and response prediction5. Feature extraction aims to produce visual features of the stimuli by stimulating the visual cortex. An accurate feature extractor that approximates real visual mechanisms is crucial for successful encoding. Response prediction aims to predict voxel-wise fMRI responses based on the extracted visual features. Linear regression6 is commonly used for this step, as the relationship between the features and responses should be as simple as possible. Previous studies have shown that the early visual cortex processes information in a manner similar to Gabor wavelets7,8,9. Building on this finding, Gabor filter-based encoding models have been proposed and successfully applied in tasks such as image identification and movie reconstruction1,3. In recent years, convolutional neural networks (CNNs) have garnered significant attention due to their impressive accomplishments in the field of computer vision. Several studies10,11 have utilized representational similarity analysis12 to compare the dissimilarity patterns of CNN and fMRI representations, revealing that the human visual cortex shares similar hierarchical representations to CNNs. As a result, CNN-based encoding models have become widely used and have demonstrated excellent performance2,4,13,14. However, it is important to note that despite the success of CNNs in encoding applications, the differences between CNNs and the brain in processing visual information cannot be overlooked15.

In terms of computational mechanisms, a fundamental distinction exists between the artificial neurons in CNNs and the biological neurons, whereby the former propagate continuous digital values, while the latter propagate action potentials (spikes). The introduction of spiking neural networks (SNNs), considered the third generation of neural networks16, has significantly reduced this difference. Unlike traditional artificial neural networks (ANNs), SNNs transmit information through spike timing. In SNNs, each neuron integrates spikes from the previous layer and emits spikes to the next layer when its internal voltage surpasses the threshold. The spike-timing-dependent plasticity (STDP)17,18 algorithm, which is an unsupervised method for weight update and has been discovered in mammalian visual cortex19,20,21, is the most commonly used learning algorithm for SNNs. Recent studies have applied STDP-based SNNs to object recognition and achieved considerable performance22,23,24. The biological plausibility of SNNs provides them with an advantage in neural encoding.

In this paper, a spiking convolutional neural network (SCNN)-based encoding framework was proposed to bridge the gap between CNNs and the realistic visual system. The encoding procedure comprised three steps. Firstly, a SCNN was trained using the STDP algorithm to extract the visual features of the images. Secondly, the coordinates of each voxel’s receptive field in the SNN feature maps were annotated based on the retinal topological properties of the visual cortex, where each voxel receives visual input from only one fixed location of the feature map. Thirdly, linear regression models were built for each voxel to predict their responses from corresponding SNN features. The framework was evaluated using four publicly available image-fMRI datasets, including handwritten character25, handwritten digit26, grayscale natural image1, and colorful natural image datasets27. Additionally, two downstream decoding tasks, namely image reconstruction and image identification, were performed based on the encoding models. The encoding and decoding performance of the proposed method was compared with that of previous methods.

Results

Encoding performance on handwritten character dataset

We built SCNN-based encoding models (see Fig. 1a) on four image-fMRI datasets and realized image reconstruction and image identification tasks based on the pre-trained encoding models (see Fig. 1b, c). Table 1 provides the basic information about these datasets, and details can be found in Methods. To predict the fMRI responses evoked by handwritten characters, the SCNN was first constructed using the images in the TICH dataset (with the exclusion of images in the test set and the inclusion of 14,854 images for the 6 characters). This was done to maximize the representation ability of the SCNN. Subsequently, voxel-wise linear regression models were trained with the fMRI data in the train set for each participant. The encoding performance was measured using Pearson’s correlation coefficients (PCC) between the predicted and measured responses to the test set images. Moreover, the proposed model was compared with a CNN-based encoding model, where the network architecture of CNN was constrained to be consistent with that of the SCNN (Supplementary Table 1). The CNN was trained using the Adam optimizer2,4,13,14, with the computational rules of SNN that are more biologically realistic. To extract meaningful visual features, we employed an SCNN consisting of a DoG layer and a convolutional layer, which simulate information processing in the retina and visual cortex, respectively. Our model outperformed other benchmark methods (Gabor and CNN-based encoding models), in terms of encoding performance on experimental data, highlighting the superiority of SCNN in visual perception encoding.

Despite its biological plausibility, SCNN simulates information processing at the level of individual neurons, while fMRI measures large-scale brain activity, with each voxel’s signal representing the joint activity of a large number of neurons. Therefore, regression models are crucial for voxel-level encoding, as they map the activations of multiple SCNN neurons to the responses of single voxels. Previous studies have demonstrated the neuronal population receptive field properties35,36 of fMRI data, indicating that each voxel in the visual cortex (especially in V1–3) only receives visual input from a fixed range of the visual field. Based on this theory, we employed a feature selection algorithm that matched the receptive field location for each voxel, which was more consistent with the real visual mechanism and reduced the risk of overfitting.

The question of whether the brain operates under supervised or unsupervised conditions has been a topic of debate. In lieu of utilizing supervised CNNs, we employed an unsupervised SCNN trained via STDP in our model. The findings of this study suggest that the early visual areas of the visual cortex are more inclined to acquire visual representations in an unsupervised manner. Additionally, the STDP-based SCNN offers several advantages in terms of neural encoding. Firstly, it is biologically plausible due to the bioinspired nature of STDP as a learning rule. Secondly, it is capable of handling both labeled and unlabeled data. Lastly, it is particularly well-suited for small sample datasets, such as those obtained via fMRI.

The realization of neural decoding tasks serves as the foundation for numerous brain-reading applications, such as BCI37. Two types of decoding models exist: those derived from encoding models and those constructed directly in an end-to-end manner. The former offers voxel-level functional descriptions while completing decoding tasks5. However, recent breakthroughs in decoding have primarily been achieved using the latter models33,38,39. In this study, we successfully completed downstream decoding tasks, including image reconstruction and identification, based on the encoding model. The results demonstrate that our approach outperformed other end-to-end models in both decoding tasks. This finding further confirms the effectiveness of our encoding model and suggests that encoding-based approaches hold significant potential for solving decoding tasks.

Despite the progress made in neural encoding using SCNN, there remain several limitations. First, the architectures of SNNs are typically shallower than those of deep-learning networks, which restricts their ability to extract complex and hierarchical visual features. Recent studies have attempted to address this issue and have made some headway23,24,40. The incorporation of a deeper SCNN into our model would further enhance encoding performance and enable investigation of the hierarchical structure of the visual cortex. Second, the Integrate-and-Fire neuron utilized in our study is a simplification of biological neurons. The use of more realistic neurons, such as leaky Integrate-and-Fire and Hodgkin-Huxley neurons41, would further enhance the biological plausibility of our encoding model. Third, the parameters of STDP and network architecture were selected from previous works23,24, and the impact of different parameters on encoding performance requires further exploration.

In conclusion, this work presents a powerful tool for neural encoding. On the one hand, we combined the structure of CNNs and the calculation rules of SNNs to model the visual system and constructed voxel-wise encoding models based on the receptive field mechanism. On the other hand, we demonstrated that our model can be utilized to perform practical decoding tasks, such as image reconstruction and identification. We anticipate that SCNN-based encoding models will provide valuable insights into the visual mechanism and contribute to the resolution of BCI and computer vision tasks. Furthermore, we plan to extend the use of SNNs to encoding tasks of other cognitive functions (e.g., imagination and memory) in the future.

Methods

SCNN-based encoding model

An SCNN-based encoding model was proposed in this study to predict fMRI activities that are elicited by input visual stimuli. The encoding model was comprised of voxel-wise regression models and a SCNN feature extractor. Initially, the unsupervised SCNN was utilized to extract the stimulus features for each input image. Subsequently, linear regression models were constructed to project the SCNN features into fMRI responses. The architecture of the encoding model is depicted in Fig. 1a.

SCNN feature extractor

To extract stimuli features, a simple two-layer SCNN was employed in this study. The first layer, known as the Difference of Gaussians (DoG) layer, was designed to emulate neural processing in retinal ganglion cells42,43. The parameter settings for this layer were based on previous research23,24. For both handwritten characters and natural images, each input image underwent convolution with six DoG filters with zero padding. ON- and OFF-center DoG filters with sizes of \(3\times 3\), \(7\times 7\), and \(13\times 13\), and standard deviations of \((3/9,\,6/9)\), \((7/9,\,14/9)\), and \((13/9,\,26/9)\) were utilized. The padding size was set to 6 for this study. For handwritten digits, each input image underwent convolution with two DoG filters with zero padding. ON- and OFF-center DoG filters with a size of \(7\times 7\) and standard deviations of \((1,\,2)\) were utilized. The padding size was set to 3. Subsequently, DoG features were transformed into spike waves using intensity-to-latency encoding44 with a length of 30. Specifically, DoG feature values greater than 50 were sorted in descending order and equally distributed into 30 bins to generate the spike waves. Prior to being passed to the next layer, the output spikes underwent max pooling with a window size of \(2\times 2\) and a stride of 2.

The second layer of the SCNN corresponds to the convolutional layer, which was designed to emulate the information integration mechanism of the visual cortex. In this layer, 64 convolutional kernels comprised of Integrate-and-Fire (IF) neurons were utilized to process the input spikes. The window size of the convolutional kernels was 5×5, and the padding size was 2. Each IF neuron gathered input spikes from its receptive field and emitted a spike when its voltage reached the threshold. This can be expressed mathematically as follows:

$${v}_{i}\left(t\right)={v}_{i}\left(t-1\right)+ \mathop{\sum }\limits_{j}{w}_{{ij}}\times {s}_{j}(t-1),$$
(1)
$${s}_{i}\left(t\right)=1\,{{{{{\rm{and}}}}}}\,{v}_{i}\left(t\right)=0,{{{{{\rm{if}}}}}}\,{v}_{i}\left(t\right)\ge {v}_{{th}},\,$$
(2)

where \({v}_{i}\left(t\right)\) represents the voltage of the \({i}_{{th}}\) IF neuron at time step t, while \({w}_{{ij}}\) signifies the synaptic weight between the \({i}_{{th}}\) neuron and the \({j}_{{th}}\) input spikes within the neuron’s receptive field. The firing threshold, denoted by \({v}_{{th}}\), is set at 10. For each image, neurons are permitted to fire a maximum of once. The inhibition mechanism is employed in the convolutional layer, allowing only the neuron with the earliest spike time to fire at each position in the feature maps. Synaptic weights are updated through Spike-Timing-Dependent Plasticity (STDP), which can be expressed as:

$$\Delta {w}_{{ij}}=\left\{\begin{array}{c}{a}^{+}\times {w}_{{ij}}\times \left(1-{w}_{{ij}}\right),\,{{{{{\rm{if}}}}}}\,{t}_{j}-{t}_{i}\le 0,\\ {a}^{-}\times {w}_{{ij}}\times \left(1-{w}_{{ij}}\right),\,{{{{{\rm{if}}}}}}\,{t}_{j}-{t}_{i} > 0,\end{array}\right.$$
(3)

where \(\Delta {w}_{{ij}}\) denotes the weight modification, \({a}^{+}\) and \({a}^{-}\) represent the learning rates (set at 0.004 and −0.003, respectively)23, and \({t}_{i}\) and \({t}_{j}\) indicate the spike times of the \({i}_{{th}}\) neuron and \({j}_{{th}}\) input spikes, respectively. The learning convergence, as defined by Kheradpisheh et al.23, is calculated using the following equation:

$${{{{{\rm{C}}}}}}=\mathop{\sum }\limits _{i} \mathop{\sum }\limits_{j}{w}_{{ij}}\times (1-{w}_{{ij}})/{{{{{\rm{N}}}}}},$$
(4)

where N represents the total number of synaptic weights. The training of the convolutional layer ceases when C is below 0.01. The SCNN implementation is based on the SpykeTorch platform45. After training the SCNN, the firing threshold \({v}_{{th}}\) is set to infinity, and the voltage value at the final time step in each neuron is measured as the SCNN feature of the visual stimuli. As the voltages in the convolutional neurons accumulate over time and are never reset when \({v}_{{th}}\) is infinite, the final voltage values (SCNN feature) reflect the SCNN’s activation in response to the visual stimuli.

Responses prediction algorithm

With the obtained SCNN feature \({{{{{\rm{F}}}}}}\in {{{{{{\mathscr{R}}}}}}}^{64\times h\times w}\), a linear regression model is constructed for each voxel to predict the fMRI response Y. To avoid the overfitting problem, the receptive field mechanism is introduced into the regression models, where each voxel only receives the input at a specific location of the SCNN feature map. To identify the optimal receptive field location for each voxel (different voxels can have the same preferred receptive field), all locations on the SCNN feature maps are examined to fit the regression model, and threefold cross-validation is performed on the training data. The regression model’s expression and objective function are defined as:

$${y}_{v}=w\times {f}_{{ij}}+\epsilon ,$$
(5)
$$\mathop{{{\min }}}\limits_{w}{{{{{{\rm{||}}}}}}w\times {f}_{{ij}}-{y}_{v}{{{{{\rm{||}}}}}}}_{2}^{2},$$
(6)

where \({y}_{v}\) represents the fMRI response of voxel v, w denotes the weight parameters in the regression model and \({f}_{{ij}}\in {{{{{{\mathscr{R}}}}}}}^{64\times 1}\,(i={{{{\mathrm{1,2}}}}},\ldots ,h,{j}={{{{\mathrm{1,2}}}}},\ldots ,w)\) signifies the feature vector at location \((i,j)\) of the SCNN feature maps. The regression accuracy is quantified using the coefficient of determination (\({R}^{2}\)) of the predicted and observed responses, and the feature location with the highest \({R}^{2}\) is chosen as the receptive field location for each voxel. Lastly, the regression model for each voxel is retrained on the entire training data based on the determined receptive field location.

Downstream decoding tasks

Two downstream decoding tasks were performed based on the encoding models, namely image reconstruction and image identification. The objective of the image reconstruction task is to reconstruct the perceived image from the observed fMRI response, while the image identification task aims to determine the image that was viewed. The specific methodologies employed for these tasks are expounded upon as follows.

Image reconstruction

As depicted in Fig. 1b, the image reconstruction task was executed by utilizing an extensive prior image set. Initially, the encoding model was employed to generate the anticipated fMRI responses for all images in the prior image set. Subsequently, the likelihood of the observed fMRI response r given the prior image s was estimated, which can be mathematically represented as a multivariate Gaussian distribution:

$${{{\mbox{p}}}}\left({{{\mbox{r}}}}|{{{\mbox{s}}}}\right)\propto \exp \left\{-({{{{\mbox{r}}}}}- {\hat {{{\mbox{r}}}}} ({{{\mbox{s}}}})){\sum }^{-1}{({{{\mbox{r}}}}-\hat{{{{\mbox{r}}}}}({{{\mbox{s}}}}))}^{{\prime} }\right\}$$
(7)
$$\sum ={{{{{\mathrm{cov}}}}}}({{{\mbox{r}}}}-{\hat{{{{\mbox{r}}}}}}({{{\mbox{s}}}}))$$
(8)

Where \(\hat {{{{{\rm{r}}}}}}({{\mbox{s}}})\) represents the predicted fMRI response of \({{{{{\rm{s}}}}}}\), and Σ signifies the noise covariance matrix for train samples. Finally, the prior images that elicited the highest likelihood of evoking the observed fMRI response were averaged to derive the reconstruction result.

Image identification

Figure 1c illustrates the methodology employed for the image identification task. The test set images were fed into the encoding model to generate the predicted fMRI responses. Subsequently, the Pearson’s correlation coefficients (PCCs) between the predicted fMRI responses and the observed fMRI response were computed. The image that exhibited the highest correlation between its predicted fMRI response and the observed response was deemed to be the image viewed by the subject.

fMRI datasets

To validate the encoding model, four publicly available datasets that have been extensively utilized in prior research1,25,26,27,33,38,46 were utilized, namely the handwritten character, handwritten digits, grayscale natural image, and colorful natural image datasets. The fundamental characteristics of these datasets are presented in Table 1, and a brief overview of each dataset is provided below.

Handwritten character dataset

This dataset comprises fMRI data obtained from three participants as they viewed handwritten character images. A total of 360 images depicting 6 characters (B, R, A, I, N, and S) with the size of \(56\times 56\) were presented to each participant, sourced from the TICH character dataset47. A white square was added to each image as a fixation point. During the experiment, each image was displayed for 1 s (flashed at 2.5 Hz), followed by a 3-s black background, and 3 T fMRI data were simultaneously collected (TR = 1.74 s, voxel size = \(2\times 2\times 2\,{{{{{{\rm{mm}}}}}}}^{3}\)). The voxel-level fMRI responses of visual areas V1 and V2 for each visual stimulus were estimated using general linear models48. The same train/test set split as the original work25 was adopted, which comprised 270 and 90 class-balanced examples, respectively.

Handwritten digit dataset

This dataset comprises fMRI data obtained from one participant while viewing handwritten digit images26. During the experiment, 100 handwritten 6 and 9 images with the size of \(28\times 28\) were presented to the participant, with each image displayed for 12.5 s and flashed at 6 Hz. The fMRI responses of V1, V2, and V3 were captured using a Siemens 3 T MRI system (TR = 2.5 s, voxel size = \(2\times 2\times 2\,{{{{{{\rm{mm}}}}}}}^{3}\)). The train and test sets comprised 90 and 10 examples, respectively. Additionally, this dataset provided 2000 prior handwritten 6 and 9 images that were not utilized in the fMRI experiment for the image reconstruction task.

Grayscale natural image dataset

This dataset comprises fMRI data obtained from two participants as they viewed grayscale natural images1. The experiment was divided into train and test stages. During the training stage, the participants were presented with 1750 images, each of which was displayed for a duration of 1 s (flashed at 2 Hz), followed by a 3 s gray background. In the test stage, the participants were shown 120 images that were distinct from the ones used in the training stage. The fMRI data was acquired simultaneously in both stages of the experiment using a 3 T scanner (TR = 1 s, voxel size = \(2\times 2\times 2.5\,{{{{{{\rm{mm}}}}}}}^{3}\)). The voxel-level fMRI responses of visual areas V1–V3 were estimated for each visual stimulus. To mitigate computational complexity, the natural images were downsampled from \(500\times 500\) to \(128\times 128\) pixels.

Colorful natural image dataset

This dataset comprises fMRI data obtained from five participants as they viewed colorful natural images27. The experiment consisted of two sessions, namely the training image session and the test image session. During the training image session, each participant was presented with 1200 images from 150 categories, with each image being displayed only once (flashed at 2 Hz for 9 s). In the test image session, each participant was shown 50 images from 50 categories, with each image being presented 35 times. The fMRI responses of multiple visual areas on the ventral visual pathway were collected using a 3 T Siemens scanner (TR = 3 s, voxel size = \(3\times 3\times 3\,{{{{{{\rm{mm}}}}}}}^{3}\)), and V1, V2, and V3 were selected as regions of interest for this study. Prior to being fed into the SCNN, the natural images were converted from RGB format to grayscale format and downsampled from \(500\times 500\) to \(128\times 128\) pixels.

Noise ceiling estimation

The encoding accuracies of the colorful natural image dataset were compared with noise ceilings, which represent the upper limit of the accuracies in the presence of noise. To calculate the noise ceiling for each voxel, we employed a method that has been commonly used in previous studies13,49,50,51. This method assumes that the noise follows a Gaussian distribution with a mean of zero and that the observed fMRI signal is equal to the response plus noise. Initially, we estimated the standard deviation of the noise \({\hat{\sigma }}_{N}\) using the following formula:

$${\hat{\sigma }}_{N}=\sqrt{{{{{{\rm{mean}}}}}}({\sigma }_{R}^{2})},$$
(9)

Where \({\sigma }_{R}^{2}\) represents the variance of the responses across 35 repeated sessions of each test image. Subsequently, we calculated the variance of the response by subtracting the variance of the noise from the variance of the mean response:

$${\hat{\sigma }}_{R}^{2}={{{{{\rm{var}}}}}}\left({\mu }_{R}\right)-{\hat{\sigma }}_{N}^{2},$$
(10)

Where \({\mu }_{R}\) represents the mean responses across the repeated sessions of each test image. Finally, we drew samples from the response and noise distributions to obtain their simulations and generated the simulated signal by summing the simulated response and noise. We conducted 1000 simulations and calculated the PCC between the simulated signal and response in each simulation. The mean PCC value was taken as the noise ceiling.

Statistics and reproducibility

In Fig. 2, we performed a one-tailed two-sample t-test to compare the encoding accuracies of different methods on each dataset, and the sample sizes were described in figure captions. In reproducibility analysis, we conducted a two-tailed two-sample t-test to estimate whether the encoding accuracies (n = 500) between the SCNNs with different initial values exhibited any significant statistical differences; the corresponding p-values were reported in the “Results” section.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.