Map** gene expression in the brain

The mammalian brain is a complex system consisting of billions of neuronal and glia cells that can be categorized into hundreds of different subtypes. Understanding the organization of these cells, throughout development, into functional circuits carrying out sophisticated cognitive tasks can help us better characterize disease-associated changes. Advances in technology and automation of laboratory procedures have facilitated high-throughput characterization of functional neuronal circuits and connections at different scales (Pollock et al. 2014). For example, the Human Connectome Project maps the complete wiring of the brain using magnetic resonance imaging (Van Essen and Ugurbil 2012). Despite the importance of these imaging modalities in characterizing brain pathologies and development, it is imperative to analyze the molecular structure to gain a better mechanistic understanding of how the brain works. However, studying the molecular mechanisms of the brain has proved very challenging due to the unknown large number of cell types (Sunkin 2006).

The complexity of the brain is largely reflected in the underlying patterns of gene expression that defines neuronal identities, neuroanatomy, and patterns of connectivity. With 80% of the 20,000 genes in the mammalian genome expressed in the brain (Lein et al. 2007), characterizing spatial and temporal gene expression patterns can provide valuable insights into the relationship between genes and brain function and their role throughout neurodevelopment. Brain transcriptome atlases have proven to be extremely instrumental for this task.

Following earlier progress in other model organisms (Kim et al. 2001; Spencer et al. 2011; Milyaev et al. 2012), several projects have assessed gene expression in the mouse brain with various degrees of coverage for genes, anatomical regions, and developmental time-points (Sunkin 2006; Pollock et al. 2014). In rodents, the Gene Expression Nervous System Atlas (GENSAT) (Gong et al. 2003; Heintz 2004) and GenePaint (Visel et al. 2004) mapped gene expression in both the adult and develo** mouse brain, while the EurExpress (Diez-Roux et al. 2011) and the e-Mouse Atlas of Gene Expression (EMAGE) (Richardson et al. 2014) focused on the develo** mouse brain. Comparable atlases of gene expression in the human brain are far less abundant due to the challenges posed by difference in size between the human and mouse brain as well as the scarcity of post-mortem tissue. However, several studies have profiled the human brain transcriptome to analyze expression variation across the brain (Lonsdale 2013), expression developmental dynamics (Oldham et al. 2008; Colantuoni et al. 2011; Kang et al. 2011), and differential expression in the autistic brain (Voineagu et al. 2011), albeit in a limited number of coarse brain regions.

The Allen Institute for Brain Science provides the most comprehensive maps of gene expression in the mouse and human brain in terms of the number of genes, the spatial-resolution, and the developmental stages covered (Pollock et al. 2014). Several atlases have been released which map gene expression in the adult and develo** mouse brain (Lein et al. 2007; Thompson et al. 2014), the adult and develo** human brain (Hawrylycz et al. 2012; Miller et al. 2014a), and the adult and develo** non-human primate (NHP) brain (Bernard et al. 2012; Bakken et al. 2016); see Fig. 1. Sunkin et al. (2013) provides a complete review of the Allen Brain Atlas resources.

Fig. 1
figure 1

Spatially mapped gene expression in the mammalian brain. To map gene expression across the human and mouse brains, the Allen Institute for Brain Sciences followed two different strategies. In the human brain, samples covering all brain regions are extracted (a) and gene expression is measured using either microarray or RNA-sequencing (Hawrylycz et al. 2012; Miller et al. 2014b) (b). Accompanying histology sections and MRI scans are acquired to localize samples. Manual delineation of anatomical regions on the histology sections allowed for accurate sample annotation (c). In the mouse brain, gene expression is measured in coronal and sagittal sections using in situ hybridization (Lein et al. 2007) (d). Several slices covering the mouse brain are extracted per gene. Image registration methods are used to align the set of sections acquired for each gene to a common reference atlas (e). Anatomical regions are delineated on the reference atlas allowing for sample annotation (f). Data from the mouse and human atlases can be represented in a data matrix of three dimensions representing: genes, brain regions, and developmental stages (in case of the developmental atlases) (g)

The availability of genome-wide spatially mapped gene expression data provides a great opportunity to understand the complexity of the mammalian brain. It provides the necessary data to decode the molecular functions of different cell populations and brain nuclei. However, the diversity of cell types and their molecular signatures and the effect of mutations on the brain remain poorly understood. For example, de novo loss-of-function mutations in autistic children have been shown to converge on three distinct pathways: synaptic function, Wnt signaling, and chromatin remodeling (Krumm et al. 2014; De Rubeis et al. 2014). Except for the synaptic role of autism-related genes, it is not clear how alternations in basic cell functions, such as Wnt signaling and chromatin remodeling, can result in the complex phenotype of autism spectrum disorders (ASD). A recent effort to map somatic mutations in cortical neurons using single-cell sequencing has shown that neurons have on average ~1500 transcription-associated mutations (Lodato et al. 2015). The significant association of these single-neuron mutations and genes with cortical expression indicates the vulnerability of genes active in human neurons to somatic mutations, even in normal individuals. The difference between these patterns in the normal and diseases brains remains unclear. Efforts to understand genotype-phenotype relationships in the brain face several challenges, including the complexity of the underlying molecular mechanisms and the poor definition of clinically based neurological disorders. In addition, the high-dimensionality of the data makes most studies underpowered to detect any associations. This is especially true in the case of testing genetic associations with phenotype markers, such as imaging measurements (Medland et al. 2014). A combination of efforts to map the genomic landscape of the brain and data-driven approaches can add to our understanding of the underlying genetic etiology of neurological processes and how they are altered in neurological disorders.

Several review articles provide extensive insights into the gene expression maps of the brain. French and Pavlidis (2007) provide a global overview of neuroinformatics, including ontology, semantics, databases, connectivity, electrophysiology, and computational neuroscience. Jones et al. (2009) give an overview on develo** the mouse atlas, the challenges faced, the community reaction, limitations, and atlas usage examples, as well as the data mining tools provided by the Allen institute. Pollock et al. (2014) provide a detailed review of the technology and tools which are currently advancing the field of molecular neuroanatomy. Recently, Parikshak et al. (2015) illustrated the power of using network approaches to leverage our understanding of the genetic etiology of neurological disorders. Yet, a global overview of the computational methodologies applied to brain transcriptome atlases to increase our understanding of neurological processes and disorders remains missing.

In this review, we provide an overview of the computational approaches used to expand our understanding of the relationship between gene expression on one hand and the anatomical and functional organization of the mammalian brain on the other hand. We focus our discussion on spatial and temporal brain transcriptomes mapped by the Allen Institute for Brain Sciences. Nevertheless, we also discuss how the methods can be extended to epigenomes and proteomes of the brain and other human tissues. We describe the different computational approaches taken to analyze the high-dimensional data and how they have contributed to our understanding of the functional role of genes in the brain, molecular neuroanatomy, and genetic etiology of neurological disorders. Finally, we discuss how these methods can help solve some of the data-specific challenges, and how the integration of several data types can further our understanding of the brain at different scales, ranging from molecular to behavioral.

Computational analysis of spatial and temporal gene expression data in the brain

Spatio-temporal transcriptomes of the brain pose several challenges due to their high-dimensionality. In this section, we identify the different types of approaches taken to analyze the spatially mapped gene expression data. We show the strengths of each approach and demonstrate how it has enriched neuroscience research. We divide the different methods into two categories. First, we describe a class of methods used to analyze the expression profile of gene(s) across different brain regions, cell types, and developmental stages. Second, we discuss methods focusing on the molecular organization and the genetic signature of the brain.

Analyzing the expression patterns of genes in the brain

Map** gene expression across the brain is very helpful in determining the neural function of a gene of interest by associating it with a specific brain region and/or developmental stage or in identifying genetic markers of those brain regions and developmental stages. Brain transcriptome atlases, such as the Allen Brain Atlases, provide useful information about the expression of a gene under “normal” conditions. Such information can be used to direct in-depth studies about a specific gene in biologically/clinically relevant cohorts. With the increasing number of genes implicated in neurological diseases as well as the realization that complex phenotypes of the brain likely result from the combined activity of several genes, a number of studies analyze gene sets rather than individual candidate genes. By studying the expression of a gene set rather than a single gene, neuroscientists are faced with a challenge on how to summarize this data to understand the relationship between genes and neuronal phenotypes.

Gene expression visualization

High-throughput data visualization approaches can facilitate the exploration of complex patterns in multivariate high-dimensional gene expression data sets (Pavlopoulos et al. 2015). For example, heatmaps are commonly used to visualize gene expression levels across a set of samples using a two-dimensional false-color image (Fig. 2f). However, heatmaps are not ideal to represent brain transcriptomes, because they fail to capture the multivariate nature of the data (genes, samples, and time-points) and to represent the inherent spatial and temporal relationships between different brain regions and developmental stages, respectively. To acquire high-resolution gene expression maps, the Allen atlases of the develo** and adult mouse brain rely of ISH images (Fig. 2a). The Brain Explorer 3D viewer (Lau et al. 2008) is an interactive desktop application that allows the visualization of the 3D expression of one or more genes with the possibility to link them back to the high-resolution ISH images (Sunkin et al. 2013) (Fig. 2b). ISH images can be synchronized between different genes and also with the anatomical atlas of the mouse brain (Fig. 2c), facilitating the analysis of a group of genes. For the adult and develo** human atlases, the gene expression data (microarray or RNA-seq) are mainly visualized using heatmaps (Fig. 2d). In the adult human atlas, the expression data can also be visualized on top of the magnetic resonance images (Fig. 2e). The Brain Explorer 3D viewer is also used to visualize gene expression from cortical samples using an inflated cortical surface, a surface-based representation of the cortex that allows better representation of the relative locations of laminar, columnar, and areal features (Fig. 2f). In addition, gene expression can be mapped to an anatomical representation of the brain to facilitate interpretation (Fig. 2g). Ng et al. developed a method to construct surface-based flatmaps of the mouse cortex that enables map** of gene expression data from the Allen Mouse Brain Atlas (Ng et al. 2010). Similarly, French (2015) developed a pipeline to map the expression of any gene from the Allen Human brain atlas to the cortical atlas built into the FreeSurfer software, which shall facilitate integration with medical imaging studies.

Fig. 2
figure 2

Gene expression visualization. Gene expression of spatially mapped samples can be visualized using several approaches. a Mouse gene expression data of the gene Man1a can be investigated using the original ISH sections. b BrainExplorer software allows visualization of the 3D expression volume with an overlay of the anatomical atlas and the ability to go back to the original high-resolution ISH section. c Simultaneously, viewing the ISH section and the corresponding atlas section helps in localizing gene expression to brain regions. d Heatmaps are commonly used to visualize gene expression. Expression of the two exons of the NEUROD6 gene from the BrainSpan Atlas is visualized using a heatmap in which samples are ordered according to the age of the donor. e Samples from the Allen Human Brain Atlas are associated with coordinates of their location in the corresponding brain MRI. f Using the BrainExplorer, expression values of Mecp2 can be mapped to an inflated white matter surface for better visualization of the cortex. g Alternatively, expression values can be mapped on an anatomical atlas of the human brain

Summary statistics and visualization-based methods

The early studies employing the Allen Brain Atlases used a variety of visualization and qualitative measurements to analyze the expression of gene sets associated with dopamine neurotransmission (Björklund and Dunnett 2007), consummatory behavior in the mouse brain (Olszewski et al. 2008), midbrain dopaminergic neurons (Alavian and Simon 2009), and changes in locomotor activity in the mouse brain (Mignogna and Viggiano 2010). Kondapalli et al. (2014) used a similar qualitative approach to analyze the expression of Na+/H+ exchangers (NHE6 and NHE9), which are linked to several neuropsychiatric disorders, in the adult and develo** mouse brain atlases.

To provide better quantitative representations of the expression of gene sets, several studies relied on basic summary statistics, such as the mean and standard deviation. Zaldivar and Krichmar (2013) used summations to summarize the expression of cholinergic, dopaminergic, noradrenergic, and serotonergic receptors in the amygdala, and in neuromodulatory areas. By plotting the average expression of genes harboring de novo loss-of-function mutations identified by means of exome sequencing across human brain development, Ben-David and Shifman (2012a) identified two clusters with antagonistic expression patterns across development. In addition, spatio-temporal exonic expression in the BrainSpan atlas correlates inversely with the burden of deleterious de novo mutations identified by exome sequencing in autism, schizophrenia, or intellectual disability (Uddin et al. 2014). For genes mutated in autism, the inverse relationship was found to be strongest in prenatal orbital frontal cortex, highlighting the value of the BrainSpan atlas to associate genetic variation with specific brain regions and developmental stages. Dahlin et al. (2009) developed a custom score (expression factor) of gene expression in the mouse brain based on the ISH images of the Allen Mouse Brain Atlas. They computed the mean and the standard deviation of the expression factor to assess the global expression and heterogeneity of solute carrier genes, respectively. To deal with the qualitative ISH-based expression data from the Allen Mouse Brain Atlas, Roth et al. (2013) used a non-parametric representation of the data (using ranks instead of the raw expression values) to study the relationship between genes associated with grooming behavior in mice and 12 major brain structures.

Most of the studies analyzing gene expression in the brain focused on scores describing the expression of a gene or a gene set within each brain region of interest. Liu et al. (2014) proposed a characterization of the stratified expression pattern of sonic hedgehog (Shh), a classical signal molecule required for pattern formation along the dorsal–ventral axis, and its receptor Ptch1. Using a combination of differential expression, transcription factor motif analysis, and CHIP-seq, they identified the role of Gata3, Fox2, and their downstream targets in pattern formation in the early mouse brain. These results illustrate the power of characterizing complex expression patterns across the brain rather than solely summarizing the expression of each gene within individual brain regions.

Box1 | Gene Sets

Complex biological functions and disorders usually involve several rather than a single gene. Gene sets are groups of genes that share common biological functions and that can be defined either based on prior knowledge (e.g. about biochemical pathways or diseases) or experimental data (e.g. transcription factor targets identified using CHIP-seq). Gene set databases organize existing knowledge about these groups of genes by arranging them in sets that are associated with a functional term, such as a pathway name or a transcription factor that regulates the genes. Gene sets can be classified into 5 types:

Gene Ontology (GO)

The Gene Ontology project (Ashburner et al. 2000) developed three hierarchically structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions. Genes annotated with the same GO term(s) constitute a gene set.

Biological Pathways

Biological pathways are networks of molecular interactions underlying biological processes. Pathway databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata et al. 1999) and REACTOME (Croft et al. 2014), catalog physical entities (proteins and other macromolecules, small molecules, complexes of these entities and post-translationally modified forms of them), their subcellular locations and the transformations they can undergo (biochemical reaction, association to form a complex and translocation from one cellular compartment to another).

Transcription

Transcription databases include information on regulation of genes by transcription factors (TFs) binding to the DNA, or post-transcriptional regulation by microRNA binding to the mRNA. Determining these physical interactions can be done either in silico using computational inference (motif enrichment analysis) or using experimental data (such as CHIP-seq and microRNA binding data). For the motif enrichment analysis, position weight matrices (PWMs) from databases TRANSFAC (Matys et al. 2006) and JASPER (Portales-Casamar et al. 2010) can be used to scan the promoters of genes in the region around the transcription factor start site (TSS). CHIP-seq data, such as the large collection of experiments from the Encyclopedia of DNA Elements (ENCODE) project (Bernstein et al. 2012b) and the Roadmap Epigenomics consortium (Consortium 2015a), is used to identify genes targeted by the TFs. Similarly, microRNA targets can be extracted from databases such as TargetScan (Lewis et al. 2003).

Cell-type markers

Cell type-specific transcriptional data provide a very rich source of cell type marker genes. Genes are identified as a cell type marker if they are up-regulated in one cell population compared to other cell populations. Several studies have used microarrays and RNA-seq to profile the transcriptome of a number of neuronal cell types (Cahoy et al. 2008; Zhang et al. 2014). Recently, studies are using single-cell sequencing to precisely capture the transcriptome of individual neuronal cells (Darmanis et al. 2015; Zeisel et al. 2015).

Disease

Genes can be grouped into sets based on their association to the same diseases. Public databases, such as OMIM (2015a) and DisGeNet (Pinero et al. 2015), contains curated information from literature and public sources on gene-disease association. Another source to obtain disease-related gene sets is by identifying genes harboring variants identified using GWAS (Simón-Sánchez and Singleton 2008; Welter et al. 2014), exome-sequencing (2015b), or whole-genome sequencing.

Identifying genes with localized expression patterns

The complexity of the brain implies that genes are involved in more than one function and that their function is region- or cell-type-specific. Neuronal cell types have been classically defined using cell morphology, electrophysiological and connectivity properties. Similarly, classical neuroanatomy identifies regions based on their cyto-, myelo-, or chemo-architecture. Genomic transcriptome measurements provide an alternative route to define functional cell types and brain regions based on their genetic makeup.

Several studies have analyzed the ISH-based gene expression images of the Allen Mouse Brain Atlas to identify cell-type-specific genes and genes with localized gene expression. Loerch et al. (2008) studied the localization of age-related gene expression changes in different neuronal cell types in the mouse and human brains. At the brain region level, David and Eddy (2009) developed ALLENMINER, a tool that searches the Allen Mouse Brain Atlas for genes with a specific expression pattern in a user-defined brain region. At a finer scale, Kirsch et al. (2012) described an approach to identify genes with a localized expression pattern in a specific layer of the mouse cerebellum. They represented each ISH image (gene) using a histogram of local binary patterns (LBP) at multiple-scales. Predicting the localization of gene activity to each of the four cerebellar layers is done using two-level classification. First, they used a support vector machine (SVM) classifier to assign a cerebellar layer to each image and then used multiple-instance learning (MIL) to combine the resulting image classification into gene classification. Similarly, to identify cell-type specific genes, Li et al. (2014) used scale-invariant feature transform (SIFT) features of the ISH images. They further classified genes, using a supervised learning approach (regularized learning), based on their expression in different brain cell types. Zeng et al. (2015) compared two models to extract features from the ISH images of the develo** mouse brain atlas to train a classification model to annotate gene expression patterns in brain structures. In one approach, they used SIFT features and the bag-of-words approach to represent the expression of each gene across the entire brain. In addition, they used a transfer learning approach by training a deep convolutional neural network on natural images to extract useful features from the ISH images. Their results show a superior performance for the deep convolutional neural network, indicating the applicability of transfer learning from natural to biological images (Zeng et al. 2015).

Ramsden et al. (2015) studied the molecular components underlying the neural circuits encoding spatial positioning and orientation in the medial entorhinal cortex (MEC). They developed a computational pipeline for automated registration and analysis of ISH images of the Allen Mouse Brain Atlas at laminar resolution. They showed that while very few genes are uniquely expressed in the MEC, differential gene expression defines its borders with neighboring brain structures, and its laminar and dorso-ventral organization. Their analysis identifies ion channel-, cell adhesion- and synapse- related genes as candidates for functional differentiation of MEC layers and for encoding of spatial information at different scales along the dorso-ventral axis of the MEC. Finally, they reveal laminar organization of genes related to disease pathology and suggest that a high metabolic demand predisposes layer II to neurodegenerative pathology.

Spatial and temporal gene co-expression

Genes with similar expression patterns over a set of samples are said to be co-expressed and are more likely to be involved in the same biological processes (guilt by association) (Stuart et al. 2003). Applying the same approach to brain transcriptomes can identify co-expressed genes based on their spatial and/or temporal expression across the brain. This can serve as a powerful tool to characterize genes with respect to their context-specific functions. In addition, co-expression has been used to assess the quality of RNA-seq data, such as the BrainSpan atlas, by modeling the effects of noise within observed co-expression (Ballouz and Gillis 2016a).

Box 2 | Dimensionality reduction

The high dimensionality of transcriptomes, and other biological data (e.g. proteomes, epigenomes, etc.), provides a challenge for visualization as well as for selecting informative features for clustering and classification. Dimensionality-reduction approaches aim at finding a smaller number of features that can adequately represent the original high dimensional data in a lower dimensional space. The conventional principal component analysis (PCA) is the most commonly used dimensionality reduction method. Despite its utility, PCA can only capture linear rather than non-linear relationships, which are inherent in many biological applications. Several non-linear dimensionality reduction techniques have been proposed (e.g. Isomap (Tenenbaum et al. 2000)), see (Lee and Verleysen 2005) for an extensive review. The t-distributed stochastic neighbor embedding (t- SNE) method (Maaten and Hinton 2008) has been widely used to visualize biological data in two dimensions by preserving both the global and local relationships between the data points in the high-dimensional space (Saadatpour et al. 2015).

Several similarity/distance measurements have been used to characterize the similarity in spatial/temporal expression patterns between a pair of genes. Of these, correlation-based measures are mostly used to assess gene co-expression patterns across the brain. NeuroBlast is a search tool developed by the Allen Institute for Brain Sciences to identify genes with a similar 3D spatial expression to that of a gene of interest in a given anatomical region, based on Pearson correlation (Hawrylycz et al. 2011). Figure 3a shows an example of the obtained correlations of estrogen receptor alpha (Esr1) in the mouse hypothalamus. The ISH sections in Fig. 3b show that correlation can effectively be used to identify genes’ functional association with Esr1. For example, the top correlated gene to Esr1 in the hypothalamus is insulin receptor substrate 4 (Irs4), a target gene of Esr1 associated with sex-specific behavior (Xu et al. 2012). NeuroBlast was used to identify genes with a similar expression profile to Wnt3a, a ligand in the Wnt signaling pathway, in the develo** mouse brain and identified eight Wnt signaling genes among the top correlated genes (Thompson et al. 2014). Using Spearman correlation coefficient, French et al. analyzed gene-pairs with positive and negative co-expression in the mouse brain. By focusing on genes with a strong negative correlation, they showed that variation in gene expression in the adult normal mouse brain can be explained as reflecting regional variation in glia to neuron ratios, and is correlated with degree of connectivity and location in the brain along the anterior–posterior axis (French et al. 2011). Tan et al. (2013) extended the analysis to the adult human brain and identified conserved co-expression patterns between the mouse and the human brain. To characterize the role of SNCA, a gene harboring a causative mutation for Parkinson’s disease, Liscovitch and French (2014) analyzed the co-expression relationships of SNCA in the adult and develo** human brain. They identified a negative spatial co-expression between SNCA and interferon-gamma signaling genes in the normal brain and a positive co-expression in post-mortem samples from Parkinson’s patients, suggesting an immune-modulatory role of SNCA that may provide insight into neurodegeneration. Another example is given by Bernier et al. (

Box 3 | Clustering

Clustering is the unsupervised learning process of identifying distinct groups of objects (clusters) in a dataset (Duda et al. 2000). There are two main types of clustering: hierarchical and partitional. Hierarchical clustering algorithms start by calculating all the pair-wise similarities between samples and then building a dendrogram by iteratively grou** the most similar sample pairs. By cutting the tree at an appropriate height, the samples are grouped into clusters. On the other hand, partitional clustering optimizes the number of simple models to fit the data. Examples of partitional clustering include k-means, Gaussian mixture models (GMMs), density-based clustering, and graph-based methods.

In order to cluster the samples hierarchically, all the pair-wise similarities between sample Si and Sj are calculated. Samples are then grouped iteratively based on the calculated similarities (grou** the most similar first). Once the full dendrogram is built, a cut-off (dashed line) is used to group samples into groups. For k-means we set the number of clusters based on the data heatmap. K-means groups samples by minimizing the within-cluster sum of square distances between each point in the cluster and the cluster center.

Fig. 3
figure 3

Spatial gene co-expression in the mouse brain. a Expression energy profiles of voxels in the hypothalamus region of the mouse brain using the same linear ordering. The estrogen receptor alpha (Esr1) gene shows high expression in the hypothalamus. The expression patterns of Irs4 and Ngb are highly correlated with that of Esr1 (R = 0.79 and R = 0.64, respectively). On the other hand, the expression pattern of Ltb is not correlated with that of Esr1 (R = 8.01 × 10−4). Correlation is calculated using Pearson correlation. b Esr1 and its highly correlated genes (Irs4 and Ngb) are highly expressed in the hypothalamus (red arrow), while Ltb is not

Gene co-expression can serve as a very powerful tool for in silico prediction and prioritization of disease genes, by identifying genes with similar expression pattern to known disease genes. Piro et al. (2010) described a candidate gene prioritization method using the Allen Mouse Brain Atlas. They showed that the spatial gene-expression patterns can be successfully exploited for the prediction of gene–phenotype associations by applying their method to the case of X-linked mental retardation. By extending their methods to the human brain atlas, they showed that spatially mapped gene expression data from the human brain can be employed to predict candidate genes for Febrile seizures (FEB) and genetic epilepsy with febrile seizures plus (GEFS+) (Piro et al. 2011). Both examples illustrate the power of using computational approaches to prioritize disease genes before carrying out empirical analysis in the lab.

In measuring gene co-expression, correlation-based methods are not specific to spatially mapped expression data and do not fully model the complexity of the brain transcriptomes. To identify gene-pairs with similar expression patterns in the adult mouse brain based on the ISH images, Liu et al. (2007) compared three image similarity metrics: a naïve pixel-wise metric, an adjusted pixel-wise metric, and a histogram- row-column (HRC) metric. They showed that HRC performs better than voxel-based methods, indicating the superiority of methods that capture the local structure in spatially mapped data. Miazaki and Costa (2012) used Voronoi diagrams to measure the similarity of the density distribution between gene expressions in the adult mouse brain. Inspired by computer vision algorithms, Liscovitch et al. (2013) used the similarity of scale-invariant feature transform (SIFT) descriptors of the ISH images of the mouse brain to predict the gene ontology (GO) labels of genes.

Box 4 | Classification

Classification is a supervised learning process of labeling unseen objects (test set) given a set of labeled objects (training set) (Duda et al. 2000). Classification approaches can be divided into Bayesian methods and prediction error minimization methods. The former group is based on Bayesian decision theory and uses statistical inference to find the best class for a given object. Bayesian methods can be further divided into parametric classifiers (e.g nearest-mean classifier and Hidden Markov Model) and non-parametric classifiers (e.g. Parzen window or k-nearest neighbor classifier). Alternatively, classifiers can be designed to minimize a measure of the prediction error. Well-known classifiers in this category include regression classifiers (e.g. Lasso regression), support vector machines, decision trees and artificial neural networks. Neural networks (in particular Deep Learning), have become very successful in solving problems in a wide range of applications, including bioinformatics (**ong et al. 2014; Alipanahi et al. 2015; Engelhardt and Brown 2015).

A low dimensional embedding of the samples is generated using two features (genes). A Baysian Classifier assigns each sample to one of the two classes (Diseases or Healthy) based on statistical inference. A prediction error-minimization classifier updates the classification boundary (dashed line) based on the prediction error and terminates when a certain criterion is met.

Gene co-expression networks

As we have shown, the guilt by association paradigm has been successfully employed to identify pairs of spatially co-expressed genes sharing a neuronal function, based on various similarity measures. To extend the co-expression analysis of gene-pairs, clustering and network-based approaches can be used to identify molecular interaction networks of a group of genes that signal through similar pathways, share common regulatory elements, or are involved in the same biological process. Co-expression networks avoid the problem of relying on prior knowledge, such as protein–protein interactions and pathway information, which are valuable but incomplete. Gene co-expression networks have heavily been used to identify disrupted molecular mechanisms in cancer (Chuang et al. 2007; Yang et al. 2014) and aging (van den Akker et al. 2014).

Hierarchical clustering is a widely used unsupervised approach to identify groups of co-expressed genes across a set of samples. Using hierarchical clustering, Gofflot et al. (2007) identified the functional networks of nuclear receptors based on their global expression across different regions of the mouse brain. By focusing on subsets of brain structures involved in specialized behavioral functions, such as feeding and memory, they elucidated links between nuclear receptors and these specialized brain functions that were initially undetected in a global analysis. Dahlin et al. (2009) used hierarchical clustering to explore potential functional relatedness of the solute carrier genes and anatomic association with brain microstructures.

Another approach to unsupervised clustering is to use gene co-expression relationships to construct a co-expression network where nodes are genes and edges represent the similarity of the expression profile of those genes. Weighted gene co-expression network analysis (WGCNA) (Zhang and Horvath 2005) is a commonly used method to construct modules of co-regulated genes based on the topological overlap between genes in a weighted co-expression network. WGCNA has widely been used to identify transcription networks in the mammalian brain. Oldham et al. (2006) demonstrated the first utility of WGCNA to examine the conservation of co-expression networks between the human and chimpanzee brains. They found that module conservation in cerebral cortex is significantly weaker than module conservation in sub-cortical brain regions, which is in line with evolutionary hierarchies. WGCNA has been applied to identify modules of co-regulated genes in the develo** and adult human brain transcriptomes (Kang et al. 2011; Hawrylycz et al. 2012), the develo** rhesus monkey brain (Miller et al. 2013), the develo** mouse brain (Thompson et al. 2014), and the prenatal human cortex (Miller et al. 2014a), see Fig. 3b. The methods provide a valuable insight into the molecular organization of the brain by identifying modules reflecting primary neural cell types and molecular functions. For example, modules constructed based on the prenatal human cortex correspond to cortical layers and age, while no areal patterning was observed (Miller et al. 2014a). In addition, WGCNA was used to identify a set of 32 functionally and anatomically distinct modules of genes with highly reproducible gene expression patterns across six human brains (Hawrylycz et al. 2015). There are numerous technical considerations to considere while constructing co-expression networks that go beyond the scope of this review (Allen et al. 2012; Ballouz et al. 2015). To analyze regional specificity of co-expression networks in the adult human brain, Myers et al. (2015) analyzed the modularity of a given gene set in region-specific co-expression networks. The developed method was used to compare networks that are constructed using expression data from a large sample size, but coarse neuroanatomical data set (Gibbs et al. 2010) to region-specific networks derived from the Allen Human Brain Atlas.

Box 5 | Co-expression Measurements

Gene co-expression is widely used for functional annotation, pathway analysis, and the reconstruction of gene regulatory networks. Co-expression measurements assess the similarity between a pair of gene expression profiles by detecting bivariate associations between them. These co-expression measurements can be summarized in five categories (Kumari et al. 2012; Allen et al. 2012; Song et al. 2012; Wang et al. 2014):

Correlation

The most widely used co-expression measure is Pearson correlation, due to its straightforward conceptual interpretation and computational efficiency. However, Pearson correlation can only capture linear relationships between variables. Alternatively, Spearman correlation is a nonparametric measure of non-linear associations. Other correlation-based methods include Renyi correlation, Kendall rank correlation, and bi-weight mid-correlation.

Partial correlation

Partial correlation is used to measure direct relationships between a pair of variables, excluding indirect relationships. Based on Gaussian graphical models, partial correlations infer conditional dependency as the non-zero entries in the precision matrix (the inverse of the covariance matrix).

Mutual-Information

Mutual information-based methods measure general statistical dependence between two variables. Based on information theory, mutual information does not assume monotonic relationships and hence can capture non-linear dependencies.

Other measures

Euclidian distance; Cosine similarity; Kullback-Leibler divergence; Hoeffding’s D, distance covariance, and probabilistic measures (as used in Baysian networks).

Co-expression of disease-related genes

Complex neuropsychiatric and neurological disorders involve dysregulation of multiple genes, each conferring a small but incremental risk, which potentially converge in deregulated biological pathways or cellular functions. Using genome-wide association studies (GWAS), exome sequencing, and whole-genome sequencing (WGS), hundreds of variants have been linked to complex neurological disorders, such as autism (Iossifov et al. 2012; Neale et al. 2012; O’Roak et al. 2012; Sanders et al. 2012; Dong et al. 2014; De Rubeis et al. 2014), schizophrenia (Fromer et al. 2014; Ripke et al. 2014), Migraine (Freilinger et al. 2012), and Alzheimer’s (Bettens et al. 2013; Zhang et al. 2013). With the increasing numbers of samples included in these studies, the number of variants associated to each disease is set to increase (Krumm et al. 2014). Gene co-expression networks provide a framework to identify the underlying molecular mechanisms on which these variants converge. Ben-David and Shifman (2012b) analyzed co-expression networks of genes affected by common and rare variants in autism using WGCNA. Menashe et al. (2013) used the cosine similarity of expression profiles to build a co-expression network of autism-related genes in the mouse brain. Both studies provide an important link between gene networks associated with autism and specific brain regions. However, for neurodevelopmental disorders, such as autism and schizophrenia, it is more beneficial to study when and where implicated genes are expressed during brain development. Gulsuner et al. (2013) studied the transcriptional co-expression of genes harboring de novo mutations in schizophrenia patients using the BrainSpan atlas of the Develo** Human Brain. Parikshak et al. (2013) used WGCNA to identify modules of co-expressed genes during human brain development using the BrainSpan atlas. They identified modules with significant enrichment in autism-related genes (Fig. 4). Willsey et al. (2013) used the BrainSpan atlas to generate co-expression networks around nine genes harboring recurrent de novo loss-of-function mutations in autism probands. Mahfouz et al. (2015b) used a combination of differential expression and genome-wide co-expression analysis to identify shared pathways among autism-related genes. To assess the functional convergence of distinct sets of genetic variants, Ballouz and Gillis (2016b) analyzed the connectivity of autism-candidate genes within a co-expression network constructed from the BrainSpan atlas. Their results show that gene sets with a higher proportion of burden genes exhibit higher interconnectivity, indicating stronger functional associations.

Fig. 4
figure 4

Gene co-expression networks. a Module M13 of co-expressed genes from Parikshak et al. (2013) (reprinted from Parikshak et al. Parikshak et al. 2013, Copyright (2016), with permission from Elsevier.). The shown module is significantly enriched in autism-related genes. The shown network comprises the top 200 connected genes (highest correlation) and their top 1000 connections in the subnetwork (also ordered on correlation). Genes are labeled if they are members of relevant gene sets. b Pattern of gene expression of genes in the shown module is summarized using the first principal component (eigengene). The red line indicates birth. c Gene Ontology terms enriched in the shown module. The blue bars indicate relative enrichment compared to all cortex-expressed genes in terms of Z score. The red line indicates Z = 2

Using gene co-expression networks to study relationships between disease-related genes is a valuable approach to understand disease mechanisms. In addition, using networks facilitates the integration of different types of interactions between genes, including but not limited to: co-expression, protein–protein interactions, and literature-based interactions. This can be very useful to our understanding of the etiologies of complex neurological diseases at different levels. In a recent study, Hormozdiari et al. (2015) integrated gene co-expression based on the BrainSpan atlas and PPI networks to identify networks of genes related to autism and intellectual disability. For a review on using gene networks to investigate the molecular mechanisms underlying neurological disorders, we refer to Gaiteri et al. (2014) and Parikshak et al. (2015).

Box 6 | Co-expression Networks

Gene co-expression networks provide a framework to uncover the molecular mechanisms underlying biological processes based on gene expression data. A co-expression network consists of nodes to represent genes and edges to encode the co-expression between two genes. A weighted network is a network in which the edges have continuous values to indicate the strength of co-expression. Networks with binary edges (an edge either exists or not) are termed binary networks. Analysis of co-expression networks can be summarized in four main steps:

Network Construction

The first step in building a co-expression network is to construct a similarity matrix, by quantifying the similarity between the expression profiles of each pair of genes (i.e. co-expression). Several methods to measure gene co-expression are discussed in Box 5. For non-regularized estimations of co-expression, all off-diagonal elements of this similarity matrix will be nonzero. We can take these similarities as edge weights in the network, but that will give a fully connected network (each gene is connected to each gene). An additional step can be to threshold the similarity matrix, either to prune edges, or to binarize (absent/present) the similarities to obtain an adjacency matrix. In the latter case, pairs of genes with co-expression values above a threshold will be connected in a binary network. In the weighted gene co-expression network analysis (WGCNA) framework the similarity matrix undergoes a power transformation and a weight diffusion step, to optimize the topological properties and stability of the network (Zhang and Horvath 2005).

Network Characterization

The obtained networks can be analyzed in a number of ways. Topological measures characterize the structure of the network, and quantify the importance of genes in their network context. These measures have been extended to weighted networks (Zhang and Horvath 2005), and can capture topology on different levels of scale (Hulsman et al. 2014). Sets of networks can also be aligned and compared (Przulj 2007; Hayashida and Akutsu 2010; Fionda 2011). Network comparison can be used either to assess changes between different conditions, or to replicate a network in an independent dataset for validity assessment.

Module Identification

To interpret a network, it can be divided into sub-networks, or gene modules. To do this, the network edges are often treated as similarities in a clustering approach (see Box 3). Alternatively, graph properties, such as topological overlap or modularity, can be used to divide a network into modules (Blondel et al. 2008).

Module Characterization

Finally, modules can be characterized using a wide range of approaches. The expression profile of genes within the same module can be summarized using the average or the first principle component (also called eigengene (Oldham et al. 2006)). Alternatively, one can characterize a module according to its hub genes: the genes with the largest number of connections within the module. Another option is to assess the association of a module to external data by testing statistical enrichment in various gene sets (see Box 1 for different types of gene sets). In addition, modules can be characterized based on changes between conditions (e.g. health and disease) in their summary statistics (average expression profile), their topological measures (inter-connectivity), or the number of differentially-expressed genes they include.