Neural encoding with unsupervised spiking convolutional neural network

Wang, Chong; Yan, Hongmei; Huang, Wei; Sheng, Wei; Wang, Yuting; Fan, Yun-Shuang; Liu, Tao; Zou, Ting; Li, Rong; Chen, Huafu

doi:10.1038/s42003-023-05257-4

Neural encoding with unsupervised spiking convolutional neural network

Article
Open access
Published: 28 August 2023

Volume 6, article number 880, (2023)
Cite this article

Download PDF

You have full access to this open access article

Communications Biology

Neural encoding with unsupervised spiking convolutional neural network

Download PDF

Chong Wang^1,2,3,
Hongmei Yan ORCID: orcid.org/0000-0002-9629-1396^2,3,
Wei Huang^2,3,
Wei Sheng^2,3,
Yuting Wang^2,3,
Yun-Shuang Fan^2,3,
Tao Liu²,
Ting Zou²,
Rong Li ORCID: orcid.org/0000-0001-7266-0241^2,3 &
…
Huafu Chen ORCID: orcid.org/0000-0002-4062-4753^1,2,3

7792 Accesses
1 Citation
5 Altmetric
Explore all metrics

Abstract

Accurately predicting the brain responses to various stimuli poses a significant challenge in neuroscience. Despite recent breakthroughs in neural encoding using convolutional neural networks (CNNs) in fMRI studies, there remain critical gaps between the computational rules of traditional artificial neurons and real biological neurons. To address this issue, a spiking CNN (SCNN)-based framework is presented in this study to achieve neural encoding in a more biologically plausible manner. The framework utilizes unsupervised SCNN to extract visual features of image stimuli and employs a receptive field-based regression algorithm to predict fMRI responses from the SCNN features. Experimental results on handwritten characters, handwritten digits and natural images demonstrate that the proposed approach can achieve remarkably good encoding performance and can be utilized for “brain reading” tasks such as image reconstruction and identification. This work suggests that SNN can serve as a promising tool for neural encoding.

NatCSNN: A Convolutional Spiking Neural Network for Recognition of Objects Extracted from Natural Images

Bio-Inspired Deep Spiking Neural Network for Image Classification

Spike time displacement-based error backpropagation in convolutional spiking neural networks

Article 19 April 2023

Introduction

The objective of neural encoding is to predict the brain’s response to external stimuli, providing an effective means to explore the brain’s mechanism for processing sensory information and serving as the foundation for brain–computer interface (BCI) systems. Visual perception, being one of the primary ways in which we receive external information, has been a major focus of neural encoding research. With the advancement of non-invasive brain imaging techniques, such as functional magnetic resonance imaging (fMRI), scientists have made remarkable progress in vision-based neural encoding^1,2,3,4 over the past two decades, making it a hot topic in neuroscience.

The process of vision-based encoding typically involves two main steps: feature extraction and response prediction⁵. Feature extraction aims to produce visual features of the stimuli by stimulating the visual cortex. An accurate feature extractor that approximates real visual mechanisms is crucial for successful encoding. Response prediction aims to predict voxel-wise fMRI responses based on the extracted visual features. Linear regression⁶ is commonly used for this step, as the relationship between the features and responses should be as simple as possible. Previous studies have shown that the early visual cortex processes information in a manner similar to Gabor wavelets^7,8,9. Building on this finding, Gabor filter-based encoding models have been proposed and successfully applied in tasks such as image identification and movie reconstruction^1,3. In recent years, convolutional neural networks (CNNs) have garnered significant attention due to their impressive accomplishments in the field of computer vision. Several studies^10,11 have utilized representational similarity analysis¹² to compare the dissimilarity patterns of CNN and fMRI representations, revealing that the human visual cortex shares similar hierarchical representations to CNNs. As a result, CNN-based encoding models have become widely used and have demonstrated excellent performance^2,4,13,14. However, it is important to note that despite the success of CNNs in encoding applications, the differences between CNNs and the brain in processing visual information cannot be overlooked¹⁵.

In terms of computational mechanisms, a fundamental distinction exists between the artificial neurons in CNNs and the biological neurons, whereby the former propagate continuous digital values, while the latter propagate action potentials (spikes). The introduction of spiking neural networks (SNNs), considered the third generation of neural networks¹⁶, has significantly reduced this difference. Unlike traditional artificial neural networks (ANNs), SNNs transmit information through spike timing. In SNNs, each neuron integrates spikes from the previous layer and emits spikes to the next layer when its internal voltage surpasses the threshold. The spike-timing-dependent plasticity (STDP)^17,18 algorithm, which is an unsupervised method for weight update and has been discovered in mammalian visual cortex^19,20,21, is the most commonly used learning algorithm for SNNs. Recent studies have applied STDP-based SNNs to object recognition and achieved considerable performance^22,23,24. The biological plausibility of SNNs provides them with an advantage in neural encoding.

In this paper, a spiking convolutional neural network (SCNN)-based encoding framework was proposed to bridge the gap between CNNs and the realistic visual system. The encoding procedure comprised three steps. Firstly, a SCNN was trained using the STDP algorithm to extract the visual features of the images. Secondly, the coordinates of each voxel’s receptive field in the SNN feature maps were annotated based on the retinal topological properties of the visual cortex, where each voxel receives visual input from only one fixed location of the feature map. Thirdly, linear regression models were built for each voxel to predict their responses from corresponding SNN features. The framework was evaluated using four publicly available image-fMRI datasets, including handwritten character²⁵, handwritten digit²⁶, grayscale natural image¹, and colorful natural image datasets²⁷. Additionally, two downstream decoding tasks, namely image reconstruction and image identification, were performed based on the encoding models. The encoding and decoding performance of the proposed method was compared with that of previous methods.

Results

Encoding performance on handwritten character dataset

We built SCNN-based encoding models (see Fig. 1a) on four image-fMRI datasets and realized image reconstruction and image identification tasks based on the pre-trained encoding models (see Fig. 1b, c). Table 1 provides the basic information about these datasets, and details can be found in Methods. To predict the fMRI responses evoked by handwritten characters, the SCNN was first constructed using the images in the TICH dataset (with the exclusion of images in the test set and the inclusion of 14,854 images for the 6 characters). This was done to maximize the representation ability of the SCNN. Subsequently, voxel-wise linear regression models were trained with the fMRI data in the train set for each participant. The encoding performance was measured using Pearson’s correlation coefficients (PCC) between the predicted and measured responses to the test set images. Moreover, the proposed model was compared with a CNN-based encoding model, where the network architecture of CNN was constrained to be consistent with that of the SCNN (Supplementary Table 1). The CNN was trained using the Adam optimizer^2,4,13,14, with the computational rules of SNN that are more biologically realistic. To extract meaningful visual features, we employed an SCNN consisting of a DoG layer and a convolutional layer, which simulate information processing in the retina and visual cortex, respectively. Our model outperformed other benchmark methods (Gabor and CNN-based encoding models), in terms of encoding performance on experimental data, highlighting the superiority of SCNN in visual perception encoding.

Despite its biological plausibility, SCNN simulates information processing at the level of individual neurons, while fMRI measures large-scale brain activity, with each voxel’s signal representing the joint activity of a large number of neurons. Therefore, regression models are crucial for voxel-level encoding, as they map the activations of multiple SCNN neurons to the responses of single voxels. Previous studies have demonstrated the neuronal population receptive field properties^35,36 of fMRI data, indicating that each voxel in the visual cortex (especially in V1–3) only receives visual input from a fixed range of the visual field. Based on this theory, we employed a feature selection algorithm that matched the receptive field location for each voxel, which was more consistent with the real visual mechanism and reduced the risk of overfitting.

The question of whether the brain operates under supervised or unsupervised conditions has been a topic of debate. In lieu of utilizing supervised CNNs, we employed an unsupervised SCNN trained via STDP in our model. The findings of this study suggest that the early visual areas of the visual cortex are more inclined to acquire visual representations in an unsupervised manner. Additionally, the STDP-based SCNN offers several advantages in terms of neural encoding. Firstly, it is biologically plausible due to the bioinspired nature of STDP as a learning rule. Secondly, it is capable of handling both labeled and unlabeled data. Lastly, it is particularly well-suited for small sample datasets, such as those obtained via fMRI.

The realization of neural decoding tasks serves as the foundation for numerous brain-reading applications, such as BCI³⁷. Two types of decoding models exist: those derived from encoding models and those constructed directly in an end-to-end manner. The former offers voxel-level functional descriptions while completing decoding tasks⁵. However, recent breakthroughs in decoding have primarily been achieved using the latter models^33,38,39. In this study, we successfully completed downstream decoding tasks, including image reconstruction and identification, based on the encoding model. The results demonstrate that our approach outperformed other end-to-end models in both decoding tasks. This finding further confirms the effectiveness of our encoding model and suggests that encoding-based approaches hold significant potential for solving decoding tasks.

Despite the progress made in neural encoding using SCNN, there remain several limitations. First, the architectures of SNNs are typically shallower than those of deep-learning networks, which restricts their ability to extract complex and hierarchical visual features. Recent studies have attempted to address this issue and have made some headway^23,24,40. The incorporation of a deeper SCNN into our model would further enhance encoding performance and enable investigation of the hierarchical structure of the visual cortex. Second, the Integrate-and-Fire neuron utilized in our study is a simplification of biological neurons. The use of more realistic neurons, such as leaky Integrate-and-Fire and Hodgkin-Huxley neurons⁴¹, would further enhance the biological plausibility of our encoding model. Third, the parameters of STDP and network architecture were selected from previous works^23,24, and the impact of different parameters on encoding performance requires further exploration.

In conclusion, this work presents a powerful tool for neural encoding. On the one hand, we combined the structure of CNNs and the calculation rules of SNNs to model the visual system and constructed voxel-wise encoding models based on the receptive field mechanism. On the other hand, we demonstrated that our model can be utilized to perform practical decoding tasks, such as image reconstruction and identification. We anticipate that SCNN-based encoding models will provide valuable insights into the visual mechanism and contribute to the resolution of BCI and computer vision tasks. Furthermore, we plan to extend the use of SNNs to encoding tasks of other cognitive functions (e.g., imagination and memory) in the future.

Methods

SCNN-based encoding model

An SCNN-based encoding model was proposed in this study to predict fMRI activities that are elicited by input visual stimuli. The encoding model was comprised of voxel-wise regression models and a SCNN feature extractor. Initially, the unsupervised SCNN was utilized to extract the stimulus features for each input image. Subsequently, linear regression models were constructed to project the SCNN features into fMRI responses. The architecture of the encoding model is depicted in Fig. 1a.

SCNN feature extractor

To extract stimuli features, a simple two-layer SCNN was employed in this study. The first layer, known as the Difference of Gaussians (DoG) layer, was designed to emulate neural processing in retinal ganglion cells^42,43. The parameter settings for this layer were based on previous research^23,24. For both handwritten characters and natural images, each input image underwent convolution with six DoG filters with zero padding. ON- and OFF-center DoG filters with sizes of $3\times 3$, $7\times 7$, and $13\times 13$, and standard deviations of $(3/9,\,6/9)$, $(7/9,\,14/9)$, and $(13/9,\,26/9)$ were utilized. The padding size was set to 6 for this study. For handwritten digits, each input image underwent convolution with two DoG filters with zero padding. ON- and OFF-center DoG filters with a size of $7\times 7$ and standard deviations of $(1,\,2)$ were utilized. The padding size was set to 3. Subsequently, DoG features were transformed into spike waves using intensity-to-latency encoding⁴⁴ with a length of 30. Specifically, DoG feature values greater than 50 were sorted in descending order and equally distributed into 30 bins to generate the spike waves. Prior to being passed to the next layer, the output spikes underwent max pooling with a window size of $2\times 2$ and a stride of 2.

The second layer of the SCNN corresponds to the convolutional layer, which was designed to emulate the information integration mechanism of the visual cortex. In this layer, 64 convolutional kernels comprised of Integrate-and-Fire (IF) neurons were utilized to process the input spikes. The window size of the convolutional kernels was 5×5, and the padding size was 2. Each IF neuron gathered input spikes from its receptive field and emitted a spike when its voltage reached the threshold. This can be expressed mathematically as follows:

$${v}_{i}\left(t\right)={v}_{i}\left(t-1\right)+ \mathop{\sum }\limits_{j}{w}_{{ij}}\times {s}_{j}(t-1),$$

(1)

$${s}_{i}\left(t\right)=1\,{{{{{\rm{and}}}}}}\,{v}_{i}\left(t\right)=0,{{{{{\rm{if}}}}}}\,{v}_{i}\left(t\right)\ge {v}_{{th}},\,$$

(2)

where ${v}_{i}\left(t\right)$ represents the voltage of the ${i}_{{th}}$ IF neuron at time step t, while ${w}_{{ij}}$ signifies the synaptic weight between the ${i}_{{th}}$ neuron and the ${j}_{{th}}$ input spikes within the neuron’s receptive field. The firing threshold, denoted by ${v}_{{th}}$, is set at 10. For each image, neurons are permitted to fire a maximum of once. The inhibition mechanism is employed in the convolutional layer, allowing only the neuron with the earliest spike time to fire at each position in the feature maps. Synaptic weights are updated through Spike-Timing-Dependent Plasticity (STDP), which can be expressed as:

$$\Delta {w}_{{ij}}=\left\{\begin{array}{c}{a}^{+}\times {w}_{{ij}}\times \left(1-{w}_{{ij}}\right),\,{{{{{\rm{if}}}}}}\,{t}_{j}-{t}_{i}\le 0,\\ {a}^{-}\times {w}_{{ij}}\times \left(1-{w}_{{ij}}\right),\,{{{{{\rm{if}}}}}}\,{t}_{j}-{t}_{i} > 0,\end{array}\right.$$

(3)

where $\Delta {w}_{{ij}}$ denotes the weight modification, ${a}^{+}$ and ${a}^{-}$ represent the learning rates (set at 0.004 and −0.003, respectively)²³, and ${t}_{i}$ and ${t}_{j}$ indicate the spike times of the ${i}_{{th}}$ neuron and ${j}_{{th}}$ input spikes, respectively. The learning convergence, as defined by Kheradpisheh et al.²³, is calculated using the following equation:

$${{{{{\rm{C}}}}}}=\mathop{\sum }\limits _{i} \mathop{\sum }\limits_{j}{w}_{{ij}}\times (1-{w}_{{ij}})/{{{{{\rm{N}}}}}},$$

(4)

where N represents the total number of synaptic weights. The training of the convolutional layer ceases when C is below 0.01. The SCNN implementation is based on the SpykeTorch platform⁴⁵. After training the SCNN, the firing threshold ${v}_{{th}}$ is set to infinity, and the voltage value at the final time step in each neuron is measured as the SCNN feature of the visual stimuli. As the voltages in the convolutional neurons accumulate over time and are never reset when ${v}_{{th}}$ is infinite, the final voltage values (SCNN feature) reflect the SCNN’s activation in response to the visual stimuli.

Responses prediction algorithm

With the obtained SCNN feature ${{{{{\rm{F}}}}}}\in {{{{{{\mathscr{R}}}}}}}^{64\times h\times w}$, a linear regression model is constructed for each voxel to predict the fMRI response Y. To avoid the overfitting problem, the receptive field mechanism is introduced into the regression models, where each voxel only receives the input at a specific location of the SCNN feature map. To identify the optimal receptive field location for each voxel (different voxels can have the same preferred receptive field), all locations on the SCNN feature maps are examined to fit the regression model, and threefold cross-validation is performed on the training data. The regression model’s expression and objective function are defined as:

$${y}_{v}=w\times {f}_{{ij}}+\epsilon ,$$

(5)

$$\mathop{{{\min }}}\limits_{w}{{{{{{\rm{||}}}}}}w\times {f}_{{ij}}-{y}_{v}{{{{{\rm{||}}}}}}}_{2}^{2},$$

(6)

where ${y}_{v}$ represents the fMRI response of voxel v, w denotes the weight parameters in the regression model and ${f}_{{ij}}\in {{{{{{\mathscr{R}}}}}}}^{64\times 1}\,(i={{{{\mathrm{1,2}}}}},\ldots ,h,{j}={{{{\mathrm{1,2}}}}},\ldots ,w)$ signifies the feature vector at location $(i,j)$ of the SCNN feature maps. The regression accuracy is quantified using the coefficient of determination (${R}^{2}$) of the predicted and observed responses, and the feature location with the highest ${R}^{2}$ is chosen as the receptive field location for each voxel. Lastly, the regression model for each voxel is retrained on the entire training data based on the determined receptive field location.

Downstream decoding tasks

Two downstream decoding tasks were performed based on the encoding models, namely image reconstruction and image identification. The objective of the image reconstruction task is to reconstruct the perceived image from the observed fMRI response, while the image identification task aims to determine the image that was viewed. The specific methodologies employed for these tasks are expounded upon as follows.

Image reconstruction

As depicted in Fig. 1b, the image reconstruction task was executed by utilizing an extensive prior image set. Initially, the encoding model was employed to generate the anticipated fMRI responses for all images in the prior image set. Subsequently, the likelihood of the observed fMRI response r given the prior image s was estimated, which can be mathematically represented as a multivariate Gaussian distribution:

$${{{\mbox{p}}}}\left({{{\mbox{r}}}}|{{{\mbox{s}}}}\right)\propto \exp \left\{-({{{{\mbox{r}}}}}- {\hat {{{\mbox{r}}}}} ({{{\mbox{s}}}})){\sum }^{-1}{({{{\mbox{r}}}}-\hat{{{{\mbox{r}}}}}({{{\mbox{s}}}}))}^{{\prime} }\right\}$$

(7)

$$\sum ={{{{{\mathrm{cov}}}}}}({{{\mbox{r}}}}-{\hat{{{{\mbox{r}}}}}}({{{\mbox{s}}}}))$$

(8)

Where $\hat {{{{{\rm{r}}}}}}({{\mbox{s}}})$ represents the predicted fMRI response of ${{{{{\rm{s}}}}}}$, and Σ signifies the noise covariance matrix for train samples. Finally, the prior images that elicited the highest likelihood of evoking the observed fMRI response were averaged to derive the reconstruction result.

Image identification

Figure 1c illustrates the methodology employed for the image identification task. The test set images were fed into the encoding model to generate the predicted fMRI responses. Subsequently, the Pearson’s correlation coefficients (PCCs) between the predicted fMRI responses and the observed fMRI response were computed. The image that exhibited the highest correlation between its predicted fMRI response and the observed response was deemed to be the image viewed by the subject.

fMRI datasets

To validate the encoding model, four publicly available datasets that have been extensively utilized in prior research^{1,25,26,27,33,38,46} were utilized, namely the handwritten character, handwritten digits, grayscale natural image, and colorful natural image datasets. The fundamental characteristics of these datasets are presented in Table 1, and a brief overview of each dataset is provided below.

Handwritten character dataset

This dataset comprises fMRI data obtained from three participants as they viewed handwritten character images. A total of 360 images depicting 6 characters (B, R, A, I, N, and S) with the size of $56\times 56$ were presented to each participant, sourced from the TICH character dataset⁴⁷. A white square was added to each image as a fixation point. During the experiment, each image was displayed for 1 s (flashed at 2.5 Hz), followed by a 3-s black background, and 3 T fMRI data were simultaneously collected (TR = 1.74 s, voxel size = $2\times 2\times 2\,{{{{{{\rm{mm}}}}}}}^{3}$). The voxel-level fMRI responses of visual areas V1 and V2 for each visual stimulus were estimated using general linear models⁴⁸. The same train/test set split as the original work²⁵ was adopted, which comprised 270 and 90 class-balanced examples, respectively.

Handwritten digit dataset

This dataset comprises fMRI data obtained from one participant while viewing handwritten digit images²⁶. During the experiment, 100 handwritten 6 and 9 images with the size of $28\times 28$ were presented to the participant, with each image displayed for 12.5 s and flashed at 6 Hz. The fMRI responses of V1, V2, and V3 were captured using a Siemens 3 T MRI system (TR = 2.5 s, voxel size = $2\times 2\times 2\,{{{{{{\rm{mm}}}}}}}^{3}$). The train and test sets comprised 90 and 10 examples, respectively. Additionally, this dataset provided 2000 prior handwritten 6 and 9 images that were not utilized in the fMRI experiment for the image reconstruction task.

Grayscale natural image dataset

This dataset comprises fMRI data obtained from two participants as they viewed grayscale natural images¹. The experiment was divided into train and test stages. During the training stage, the participants were presented with 1750 images, each of which was displayed for a duration of 1 s (flashed at 2 Hz), followed by a 3 s gray background. In the test stage, the participants were shown 120 images that were distinct from the ones used in the training stage. The fMRI data was acquired simultaneously in both stages of the experiment using a 3 T scanner (TR = 1 s, voxel size = $2\times 2\times 2.5\,{{{{{{\rm{mm}}}}}}}^{3}$). The voxel-level fMRI responses of visual areas V1–V3 were estimated for each visual stimulus. To mitigate computational complexity, the natural images were downsampled from $500\times 500$ to $128\times 128$ pixels.

Colorful natural image dataset

This dataset comprises fMRI data obtained from five participants as they viewed colorful natural images²⁷. The experiment consisted of two sessions, namely the training image session and the test image session. During the training image session, each participant was presented with 1200 images from 150 categories, with each image being displayed only once (flashed at 2 Hz for 9 s). In the test image session, each participant was shown 50 images from 50 categories, with each image being presented 35 times. The fMRI responses of multiple visual areas on the ventral visual pathway were collected using a 3 T Siemens scanner (TR = 3 s, voxel size = $3\times 3\times 3\,{{{{{{\rm{mm}}}}}}}^{3}$), and V1, V2, and V3 were selected as regions of interest for this study. Prior to being fed into the SCNN, the natural images were converted from RGB format to grayscale format and downsampled from $500\times 500$ to $128\times 128$ pixels.

Noise ceiling estimation

The encoding accuracies of the colorful natural image dataset were compared with noise ceilings, which represent the upper limit of the accuracies in the presence of noise. To calculate the noise ceiling for each voxel, we employed a method that has been commonly used in previous studies^13,49,50,51. This method assumes that the noise follows a Gaussian distribution with a mean of zero and that the observed fMRI signal is equal to the response plus noise. Initially, we estimated the standard deviation of the noise ${\hat{\sigma }}_{N}$ using the following formula:

$${\hat{\sigma }}_{N}=\sqrt{{{{{{\rm{mean}}}}}}({\sigma }_{R}^{2})},$$

(9)

Where ${\sigma }_{R}^{2}$ represents the variance of the responses across 35 repeated sessions of each test image. Subsequently, we calculated the variance of the response by subtracting the variance of the noise from the variance of the mean response:

$${\hat{\sigma }}_{R}^{2}={{{{{\rm{var}}}}}}\left({\mu }_{R}\right)-{\hat{\sigma }}_{N}^{2},$$

(10)

Where ${\mu }_{R}$ represents the mean responses across the repeated sessions of each test image. Finally, we drew samples from the response and noise distributions to obtain their simulations and generated the simulated signal by summing the simulated response and noise. We conducted 1000 simulations and calculated the PCC between the simulated signal and response in each simulation. The mean PCC value was taken as the noise ceiling.

Statistics and reproducibility

In Fig. 2, we performed a one-tailed two-sample t-test to compare the encoding accuracies of different methods on each dataset, and the sample sizes were described in figure captions. In reproducibility analysis, we conducted a two-tailed two-sample t-test to estimate whether the encoding accuracies (n = 500) between the SCNNs with different initial values exhibited any significant statistical differences; the corresponding p-values were reported in the “Results” section.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The handwritten character dataset is publicly available at http://sciencesanne.com/research/, the handwritten digit dataset is publicly available at http://hdl.handle.net/11633/di.dcc.DSC_2018.00112_485, the grayscale natural image dataset is publicly available at https://crcns.org/datasets/vc/vim-1, the colorful natural image dataset is publicly available at https://github.com/KamitaniLab/GenericObjectDecoding. The source data underlying Figs. 2, 4, and 5 can be found in Supplementary Data 1, 2, 3.

Code availability

The code that supports the findings of this study is available from https://github.com/wang1239435478/Neural-encoding-with-unsupervised-spiking-convolutional-spiking-neural-networks.

References

Kay, K. N., Naselaris, T., Prenger, R. J. & Gallant, J. L. Identifying natural images from human brain activity. Nature 452, 352–355 (2008).
CAS PubMed PubMed Central Google Scholar
Güçlü, U. & van Gerven, M. A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015).
PubMed PubMed Central Google Scholar
Nishimoto, S. et al. Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol. 21, 1641–1646 (2011).
CAS PubMed PubMed Central Google Scholar
Wen, H. et al. Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cereb. Cortex 28, 4136–4160 (2018).
PubMed Google Scholar
Naselaris, T., Kay, K. N., Nishimoto, S. & Gallant, J. L. Encoding and decoding in fMRI. NeuroImage 56, 400–410 (2011).
PubMed Google Scholar
Wu, M. C. K., David, S. V. & Gallant, J. L. Complete functional characterization of sensory neurons by system identification. Annu. Rev. Neurosci. 29, 477–505 (2006).
CAS PubMed Google Scholar
Adelson, E. H. & Bergen, J. R. Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A 2, 284–299 (1985).
CAS PubMed Google Scholar
Jones, J. P. & Palmer, L. A. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58, 1233–1258 (1987).
CAS PubMed Google Scholar
Carandini, M. et al. Do we know what the early visual system does? J. Neurosci. 25, 10577–10597 (2005).
CAS PubMed PubMed Central Google Scholar
Khaligh-Razavi, S. M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).
PubMed PubMed Central Google Scholar
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A. & Oliva, A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci. Rep. 6, 27755 (2016).
CAS PubMed PubMed Central Google Scholar
Kriegeskorte, N. & Kievit, R. A. Representational geometry: integrating cognition, computation, and the brain. Trends Cogn. Sci. 17, 401–412 (2013).
PubMed PubMed Central Google Scholar
Allen, E. J. et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022).
CAS PubMed Google Scholar
Khosla, M., Ngo, G. H., Jamison, K., Kuceyeski, A. & Sabuncu, M. R. Cortical response to naturalistic stimuli is largely predictable with deep neural networks. Sci. Adv. 7, eabe7547 (2021).
PubMed PubMed Central Google Scholar
Xu, Y. & Vaziri-Pashkam, M. Limits to visual representational correspondence between convolutional neural networks and the human brain. Nat. Commun. 12, 2065 (2021).
CAS PubMed PubMed Central Google Scholar
Maass, W. Networks of spiking neurons: the third generation of neural network models. Neural Netw. 10, 1659–1671 (1997).
Google Scholar
Gerstner, W., Kempter, R., van Hemmen, J. L. & Wagner, H. A neuronal learning rule for sub-millisecond temporal coding. Nature 383, 76–78 (1996).
CAS PubMed Google Scholar
Bi, G.-Q. & Poo, M.-M. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18, 10464 (1998).
CAS PubMed PubMed Central Google Scholar
Huang, S. et al. Associative Hebbian synaptic plasticity in primate visual cortex. J. Neurosci. 34, 7575–7579 (2014).
CAS PubMed PubMed Central Google Scholar
McMahon, DavidB. T. & Leopold, DavidA. Stimulus timing-dependent plasticity in high-level vision. Curr. Biol. 22, 332–337 (2012).
CAS PubMed PubMed Central Google Scholar
Meliza, C. D. & Dan, Y. Receptive-field modification in rat visual cortex induced by paired visual stimulation and single-cell spiking. Neuron 49, 183–189 (2006).
CAS PubMed Google Scholar
Diehl, P. & Cook, M. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Front. Comput. Neurosci. https://doi.org/10.3389/fncom.2015.00099 (2015).
Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J. & Masquelier, T. STDP-based spiking deep convolutional neural networks for object recognition. Neural Netw. 99, 56–67 (2018).
PubMed Google Scholar
Mozafari, M., Ganjtabesh, M., Nowzari-Dalini, A., Thorpe, S. J. & Masquelier, T. Bio-inspired digit recognition using reward-modulated spike-timing-dependent plasticity in deep convolutional networks. Pattern Recognit. 94, 87–95 (2019).
Google Scholar
Schoenmakers, S., Barth, M., Heskes, T. & van Gerven, M. Linear reconstruction of perceived images from human brain activity. Neuroimage 83, 951–961 (2013).
PubMed Google Scholar
Van Gerven, M. A., De Lange, F. P. & Heskes, T. Neural decoding with hierarchical generative models. Neural Comput. 22, 3127–3142 (2010).
PubMed Google Scholar
Horikawa, T. & Kamitani, Y. Generic decoding of seen and imagined objects using hierarchical visual features. Nat. Commun. 8, 15037 (2017).
CAS PubMed PubMed Central Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. International Conference on Learning Representations. https://doi.org/10.48550/ar**v.1412.6980 (2015).
Seeliger, K. et al. End-to-end neural system identification with neural information flow. PLoS Comput. Biol. 17, e1008558 (2021).
CAS PubMed PubMed Central Google Scholar
Zhou, W., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Google Scholar
Miyawaki, Y. et al. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron 60, 915–929 (2008).
CAS PubMed Google Scholar
Wang, W., Arora, R., Livescu, K. & Bilmes, J. On deep multi-view representation learning. Proc. 32nd Int. Conf. Mach. Learn. 37, 1083–1092 (2015).
Du, C., Du, C., Huang, L. & He, H. Reconstructing perceived images from human brain activities with bayesian deep multiview learning. IEEE Trans. Neural Netw. Learn. Syst. 30, 2310–2323 (2019).
PubMed Google Scholar
Seeliger, K., Güçlü, U., Ambrogioni, L., Güçlütürk, Y. & van Gerven, M. A. J. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage 181, 775–785 (2018).
CAS PubMed Google Scholar
Victor, J. D., Purpura, K., Katz, E. & Mao, B. Population encoding of spatial frequency, orientation, and color in macaque V1. J. Neurophysiol. 72, 2151–2166 (1994).
CAS PubMed Google Scholar
Dumoulin, S. O. & Wandell, B. A. Population receptive field estimates in human visual cortex. NeuroImage 39, 647–660 (2008).
PubMed Google Scholar
Gao, X., Wang, Y., Chen, X. & Gao, S. Interface, interaction, and intelligence in generalized brain-computer interfaces. Trends Cogn. Sci. 25, 671–684 (2021).
PubMed Google Scholar
Ren, Z. et al. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning. NeuroImage 228, 117602 (2021).
PubMed Google Scholar
Wang, C. et al. Reconstructing rapid natural vision with fMRI-conditional video generative adversarial network. Cerebral Cortex https://doi.org/10.1093/cercor/bhab498 (2022).
Wu, Y., Deng, L., Li, G., Zhu, J. & Shi, L. Spatio-temporal backpropagation for training high-performance spiking neural networks. Front. Neurosci. 12, 331 (2018).
PubMed PubMed Central Google Scholar
Izhikevich, E. M. Simple model of spiking neurons. IEEE Trans. Neural Netw. 14, 1569–1572 (2003).
CAS PubMed Google Scholar
Enroth-Cugell, C. & Robson, J. G. The contrast sensitivity of retinal ganglion cells of the cat. J. Physiol. 187, 517–552 (1966).
CAS PubMed PubMed Central Google Scholar
McMahon, M. J., Packer, O. S. & Dacey, D. M. The classical receptive field surround of primate parasol ganglion cells is mediated primarily by a non-GABAergic pathway. J. Neurosci. 24, 3736–3745 (2004).
CAS PubMed PubMed Central Google Scholar
Gautrais, J. & Thorpe, S. Rate coding versus temporal order coding: a theoretical approach. Biosystems 48, 57–65 (1998).
CAS PubMed Google Scholar
Mozafari, M., Ganjtabesh, M., Nowzari-Dalini, A. & Masquelier, T. SpykeTorch: efficient simulation of convolutional spiking neural networks with at most one spike per neuron. Front. Neurosci. https://doi.org/10.3389/fnins.2019.00625 (2019).
Du, C., Du, C., Huang, L. & He, H. Conditional generative neural decoding with structured CNN feature prediction. Proc. AAAI Conf. Artif. Intell. 34, 2629–2636 (2020).
Google Scholar
Van der Maaten, L. A new benchmark dataset for handwritten character recognition. Tilburg Univ. 2–5 (2009).
Friston, K. J. et al. Statistical parametric maps in functional imaging: a general linear approach. Hum. Brain Mapp. 2, 189–210 (1994).
Google Scholar
Han, K. et al. Variational autoencoder: an unsupervised model for encoding and decoding fMRI activity in visual cortex. NeuroImage 198, 125–136 (2019).
PubMed Google Scholar
Kay, K. N., Winawer, J., Mezer, A. & Wandell, B. A. Compressive spatial summation in human visual cortex. J. Neurophysiol. 110, 481–494 (2013).
PubMed PubMed Central Google Scholar
Lage-Castellanos, A., Valente, G., Formisano, E. & De Martino, F. Methods for computing the maximum performance of computational models of fMRI responses. PLoS Comput. Biol. 15, e1006397 (2019).
PubMed PubMed Central Google Scholar
Deng, J. et al. Imagenet: a large-scale hierarchical image database. IEEE Conf. Comput. Vis. Pattern Recognit. https://doi.org/10.1109/CVPR.2009.5206848 (2009).

Download references

Acknowledgements

This work was supported by the STI 2030-Major Projects 2022ZD0208900 and the National Natural Science Foundation of China (Nos. 82121003, 62036003, 62276051, and 82072006), Medical-Engineering Cooperation Funds from University of Electronic Science and Technology of China (ZYGX2021YGLH201), Innovation Team and Talents Cultivation Program of National Administration of Traditional Chinese Medicine (No. ZYYCXTD-D-202003).

Author information

Authors and Affiliations

The Center of Psychosomatic Medicine, Sichuan Provincial Center for Mental Health, Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, Chengdu, 611731, China
Chong Wang & Huafu Chen
School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
Chong Wang, Hongmei Yan, Wei Huang, Wei Sheng, Yuting Wang, Yun-Shuang Fan, Tao Liu, Ting Zou, Rong Li & Huafu Chen
MOE Key Lab for Neuroinformation; High-Field Magnetic Resonance Brain Imaging Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, 610054, China
Chong Wang, Hongmei Yan, Wei Huang, Wei Sheng, Yuting Wang, Yun-Shuang Fan, Rong Li & Huafu Chen

Authors

Chong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongmei Yan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Shuang Fan
View author publications
You can also search for this author in PubMed Google Scholar
Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ting Zou
View author publications
You can also search for this author in PubMed Google Scholar
Rong Li
View author publications
You can also search for this author in PubMed Google Scholar
Huafu Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Chong Wang designed the project and wrote the paper; Yuting Wang, Yun-Shuang Fan, and Ting Zou prepared the data; Wei Huang, Wei Sheng, and Tao Liu analyzed data and built models; Hongmei Yan, Rong Li, and Huafu Chen supervised the project and revised the paper.

Corresponding authors

Correspondence to Hongmei Yan, Rong Li or Huafu Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Joao Valente. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, C., Yan, H., Huang, W. et al. Neural encoding with unsupervised spiking convolutional neural network. Commun Biol 6, 880 (2023). https://doi.org/10.1038/s42003-023-05257-4

Download citation

Received: 06 February 2023
Accepted: 18 August 2023
Published: 28 August 2023
DOI: https://doi.org/10.1038/s42003-023-05257-4
Springer Nature Limited

Neural encoding with unsupervised spiking convolutional neural network

Abstract

Similar content being viewed by others

Introduction

Results

Encoding performance on handwritten character dataset

Methods

SCNN-based encoding model

SCNN feature extractor

Responses prediction algorithm

Downstream decoding tasks

Image reconstruction

Image identification

fMRI datasets

Handwritten character dataset

Handwritten digit dataset

Grayscale natural image dataset

Colorful natural image dataset

Noise ceiling estimation

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation