Background

Gene regulatory network (GRN), which represents interactions or causalities between genes, describes the developmental or regulatory process in a cellular system [1]. GRN inference is a focal point of systems biology to understand biological systems [2]. The traditional knock-out or perturbation experiments have been widely used to discover the regulations among genes and achieved success in some degree to understand the biological system [3]. However, these interactions discovered by the expensive and time-consuming experiments are 'just the tip of the iceberg' in a complex GRN. While the genome-wide inference of GRNs from high-throughput data by computational methods promises an economical channel to disclose the complex regulatory mechanism [4, 5]. The challenge of computational methods is to build reasonable models to precisely predict the interactions between regulators and targets from gene expression data [6]. Distinguishing the direct interactions from the indirect ones remains an important challenge in the reconstruction of GRNs because of the notoriousness of the inference methods with the indirect interactions inherited in the network [7, 8].

In recent years, various approaches have been developed to address these challenges in GRN inference, and some of them have achieved success in some degree [9]. According to the techniques involved, these approaches can be divided into two types, i.e., dependence and equation-based methods [10]. In dependence-based methods, gene network is predicted by measuring the dependences among genes based on the methods such as Pearson correlation coefficient [11,12,13], mutual information [14, 15], and Granger method [16, 17]. This types of methods can measure the linear or nonlinear correlations independently but the results involve lots of redundant edges like indirect regulations [18,19,20]. In equation-based methods, the regulations and regulatory strengths among genes are described as equations [21]. Representative equation-based methods contain multiple linear regression [22], nonnegative matrix factorization [23], network component analysis [24, 25], and linear programming [26], and random forest [27, 28]. The equation-based methods can catch the interactions based on the dynamic mechanism but the optimization technique sometimes impacts their capability of parameter estimation for the high dimensionality of candidate regulators [29, 30].

Despite concurrent advances in GRN inference methods, most of them cannot distinguish direct correlations from the indirect ones [31]. Some dependence-based methods have been developed to discriminate direct and indirect connections of GRNs, such as partial correlation coefficient (PCC) [32], conditional mutual information (CMI) [33], part mutual information (PMI) [34], and conditional mutual inclusive information (CMI2) [35]. The equation-based methods are popular for their advantages of sparseness control and optimal estimation [36,37,38]. However, these methods are sensitive to the data with tow limitations which impact the performance of GRN inference seriously [39, 40]. Firstly, the noise of the data, high dimensionality of genes, and small scale of samples will affect parameter estimation of optimization. Secondly, indirect interactions will be involved in the results [41, 42]. The challenge to improve the accuracy of regression-based methods is to address these limitations [43, 44].

We previously proposed a noise and redundancy reduction strategy, namely NARROMI, based on recursive optimization that improved the performance on gene network inference [45]. In this strategy, the network was updated by recursive optimization to remove the indirect interactions. The limitation of the strategy is that some direct interactions identified by previous step were not recognized by next step. In other words, accompanied with the elevated true positive rate (TPR), recursive optimization (RO) also improves false negative rate (FPR). In an algorithm for network inference, the balance between TPR and FPR is the key technique to improve its performance. Some techniques incorporating existing network information into the optimization problem have been proposed to improve network inference [46, http://geneontology.org/) was achieved. With above GO items, the web tool WEGO2.0 (http://wego.genomics.org.cn/) was used for the visualization. Figure 6 shows the result of GO analysis for the genes identified. Out of 313 core genes, 147 genes were annotated and divided into three basic parts in GO first-level items (Additional file 3: Table S3). There are 98 items in biological process part, 30 items in cellular component and 128 items in molecular function part (Fig. 6a). To show the hierarchical relationship for the gene set, the second and third levels of GO items were provided separately (Fig. 6b, c). Listed in first and third places of the columns, two items catalytic activity (GO:0003824) and binding (GO:0005488) reveal that these genes are involved in some catalytic reactions and molecule activities, such as redox reactions, hydrolysis reaction, ion binding, organic cyclic compound binding, etc. Another two items metabolic process (GO:0008152) and cellular process (GO:0009987), listed in second and forth places, indicate that the genes regulate some metabolism related biological progresses. All items above confirm that the gene set identified by RSNET method are highly correlated with fruit developmental progress.

Fig. 6
figure 6

The GO analysis confirmed the genes predicted correlated with fruit development. a Table for the result of GO analysis including the number of genes involved in different GO terms. b Hierarchical relationship of the gene set in second level of GO items. c Hierarchical relationship of gene set in third level of GO items

To explore whether the genes identified by RSNET method correlate with fruit development, we analyzed the dynamical changes of their expression during the stages from floral bud to ripe fruit. We clustered the 313 genes into seven sub-clusters with clustering tool. Among of them, six sub-clusters are matched with the four plant physiological processes, i.e. floral bud/bloom (FB), early fruit development (EDF), mid-development (MD), and ripening (R) (Fig. 7a). This result showed that the sub-cluster 4 matched FB, the sub-cluster 5 matched EDF, the sub-clusters 1 and 7 matched MD and sub-clusters 2 & 3 matched R exactly (Fig. 7b). Our analysis provides a gene list with significance for fruit development. Among of these genes in the list, 30 genes are highly related ones and 283 genes are related ones. Compared to previous analysis by ANOVA method which selected 1955 genes, RSNET method show the superiority in smaller gene size for showing the similar dynamical change with fruit development. With fewer genes, RSNET method significantly caught the dynamical changing during fruit development. The result shows two advantages of RSNET method in network inference. Firstly, RSNET method can identify the direct causal genes by filtering out the indirect and noisy genes. Secondly, RSNET method can identify significant genes but not a random selection from the whole genes.

Fig. 7
figure 7

The clustering analysis for dynamical gene expression confirmed the genes predicted correlated with fruit development. a Seven sub-clusters of genes with dynamical changes during eight fruit developmental stages. b The heat-map of clustering of genes in four different fruit developmental stages

Methods

Mutual information between gene pairs

The dependency between a gene pair can be measured by computing mutual information (MI) of two gene expression vectors. For the advantage of nonlinear relationship measurement, mutual information has been widely used. For gene pair A and B, their mutual information (MI) can be described as [33]

$$MI(A,B) = - \sum\limits_{a \in A,\;b \in B} {f(a,b)\log \frac{f(a,b)}{{f(a)f(b)}}} .$$
(1)

With mathematical analysis, above formula can be commutated by [33]

$$MI(A,B) = \frac{1}{2}\log \frac{|M(A)| \cdot |M(B)|}{{|M(A,B)|}},$$
(2)

where M is covariance matrix and |M| is the determinant of M. In particular, MI(A,B) = 0 represents that genes A and B are independent.

In the first step of the proposed method, mutual information will be used to select the putative regulators from the global candidate genes for a given target gene.

Redundancy silencing and network enhancement technique

To quantitatively describe a gene regulatory network for the transcription procedure from DNA to RNA, a mathematical model involving transcription factors and target gene should be built [45, 54]. Among the reasonable models, regression model is the most popular one for its advantage of dynamic description of transcription. In this work, we provided an update model to silence the redundant regulations and enhance the high-confident edges. The redundancy silencing is implemented by the following recursive optimizations with update results until there is no change for the result.

$$\tilde{\beta } = \mathop {\min }\limits_{\beta } \left| {y - \beta X} \right| + \lambda \left| \beta \right| + \gamma \left| {\hat{\beta } \otimes \beta } \right|.$$
(3)

where \(y,X\) and \(\beta\) represent target gene, TFs, and regulatory strengths respectively. \(\hat{\beta }\) is the network enhancement items with 0 or 1. \(\lambda\) and \(\gamma\) are parameters to balance the error and ensure the network sparseness respectively. The operator \(\otimes\) is the Hadamard product. The parameter \(\hat{\beta }\) will be estimated by mutual information firstly and then updated by optimizations [

Conclusion

In reconstruction of GRNs, distinguishing the direct interactions from the indirect ones is an important challenge because of the notoriousness of the inference methods with the indirect interactions inherited in the network. In this study, we present a redundancy silencing and network enhancement technique-based network inference method named RSNET. In the proposed method, the redundant interactions including weak and indirect connections are silenced by recursive optimization adaptively. While the highly confident correlated regulators are constrained to improve the true positive rate of prediction. The results on gold-standard networks including simulation study, DREAM challenge dataset and Escherichia coli network show the good performance of RSNET method. The case study for constructing apple fruit ripening GRN show that RSNET method can construct function-specific GRNs. This study provides a useful bioinformatics tool for inferring clean GRN from gene expression data.