Incorporating kernelized multi-omics data improves the accuracy of genomic prediction

Liang, Mang; An, Bingxing; Chang, Tianpeng; Deng, Tianyu; Du, Lili; Li, Keanning; Cao, Sheng; Du, Yueying; Xu, Lingyang; Zhang, Lupei; Gao, Xue; Li, Junya; Gao, Huijiang

doi:10.1186/s40104-022-00756-6

Incorporating kernelized multi-omics data improves the accuracy of genomic prediction

Research
Open access
Published: 20 September 2022

Volume 13, article number 103, (2022)
Cite this article

Download PDF

You have full access to this open access article

Journal of Animal Science and Biotechnology Aims and scope Submit manuscript

Incorporating kernelized multi-omics data improves the accuracy of genomic prediction

Download PDF

Mang Liang¹^na1,
Bingxing An¹^na1,
Tianpeng Chang¹,
Tianyu Deng¹,
Lili Du¹,
Keanning Li¹,
Sheng Cao¹,
Yueying Du¹,
Lingyang Xu¹,
Lupei Zhang¹,
Xue Gao¹,
Junya Li¹ &
…
Huijiang Gao ORCID: orcid.org/0000-0002-5502-2528¹

2869 Accesses
4 Altmetric
Explore all metrics

Abstract

Background

Genomic selection (GS) has revolutionized animal and plant breeding after the first implementation via early selection before measuring phenotypes. Besides genome, transcriptome and metabolome information are increasingly considered new sources for GS. Difficulties in building the model with multi-omics data for GS and the limit of specimen availability have both delayed the progress of investigating multi-omics.

Results

We utilized the Cosine kernel to map genomic and transcriptomic data as ${n}\times {n}$ symmetric matrix (G matrix and T matrix), combined with the best linear unbiased prediction (BLUP) for GS. Here, we defined five kernel-based prediction models: genomic BLUP (GBLUP), transcriptome-BLUP (TBLUP), multi-omics BLUP (MBLUP, $\boldsymbol M=\mathrm{ratio}\times\boldsymbol G+(1-\mathrm{ratio})\times\boldsymbol T$), multi-omics single-step BLUP (mssBLUP), and weighted multi-omics single-step BLUP (wmssBLUP) to integrate transcribed individuals and genotyped resource population. The predictive accuracy evaluations in four traits of the Chinese Simmental beef cattle population showed that (1) MBLUP was far preferred to GBLUP (ratio = 1.0), (2) the prediction accuracy of wmssBLUP and mssBLUP had 4.18% and 3.37% average improvement over GBLUP, (3) We also found the accuracy of wmssBLUP increased with the growing proportion of transcribed cattle in the whole resource population.

Conclusions

We concluded that the inclusion of transcriptome data in GS had the potential to improve accuracy. Moreover, wmssBLUP is accepted to be a promising alternative for the present situation in which plenty of individuals are genotyped when fewer are transcribed.

RNA-Seq Data Analysis in Galaxy

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Article Open access 05 December 2014

RefFinder: a web-based tool for comprehensively analyzing and identifying reference genes

Article 15 April 2023

Background

A significant objective of genetics is to examine the connection between genotypes and phenotypes. Although genome-wide association studies (GWASs) have mapped thousands of common genetic variants for complex traits the causal variants and genes at these loci remain unknown [1]. This is because the map** resolution is limited by the complicated linkage disequilibrium (LD) structure of the genome (i.e., the top associated variant at a locus is often not the causal variant) [2]. Especially for polygenic traits regulated by many interacting genes with minor effects [3], these detected significant loci could only explain a small proportion of phenotype variances, resulting in lower prediction accuracy.

Alternatively, genomic prediction (GP) is an ensemble of methods to estimate the breeding values with higher reliability earlier in life by combining DNA variants jointly using existing identification, pedigree, and phenotype databases for individuals [4, 5]. Over the past decades, this technology has revolutionized animal and plant breeding after its first implementation because of its excellent performance in reducing generation intervals and generating more genetic gain [6]. Currently, several methods have been made to develop more efficient statistical approaches to estimating genomic breeding values (GEBVs), such as genomic best linear unbiased prediction (GBLUP) which has been the most widely used in GP [7], single-step BLUP (ssGBLUP) [8], ridge regression methods [9], Bayesian Alphabet regression [10, 11], and emerging machine learning (ML) strategies including support vector regression (SVR) [12], random forest (RF) [13], reproducing kernel Hilbert spaces regression (RKHS) [10], kernel ridge regression [14], etc. Briefly, the predictive accuracy of Bayesian methods outperforms BLUP-based models in the majority of cases, while the Markov Chain Monte Carlo (MCMC) procedure also suffered a substantial computational burden [15]. In several simulation and actual studies, nonparametric ML methods behaved better, primarily due to their superior prediction ability [16, 17]. Therefore, there was a clear trend that increasingly breeders were trying to combine the advantages of multiple methods to estimate GEBVs in GP.

Afterward, one critical issue for phenotypic prediction is how to model non-additive effects (dominance or epistatic effects). Several research confirmed that using non-additive relationships that could improve the prediction of phenotypes [18, 19]. Incorporating additional layers of omics data into the prediction machine may partially solve this problem. For instance, many genetic variants affect complex traits by modulating gene expression, thus altering the abundance of relevant proteins [20]. The advanced next-generation sequencing technologies made it possible, and Li et al. discussed the concept of “omics-augmented broad sense heritability” that accounts for SNP-based effects and effects of downstream biological regulation captured by gene interactions [21]. However, the genetic links between phenotype and genome variants are too complex to determine directly at the genome level. Another concern was that the gene expression levels were affected by several factors such as tissue specificity, time of sampling and experimental conditions. Guo et al. found that using only transcriptome data to predict phenotypes is not satisfying, especially for high heritability traits; the genomic data remain the most efficient predictors [22]. Therefore, integrating multi-omics information could be a promising option in GP. In a human acute myeloid leukemia (AML) dataset including cytogenetics, gene mutations and expression variables, a priority-Lasso was presented and showed better predictive accuracy than the independent validation dataset [23]. Xu et al. found that the predictability of hybrid yield of rice can be further increased using multi-omics data, in particular, when used metabolomics data, the predictability was almost doubled compared with the genomic prediction [24].

Based on our previous studies using the Cosine kernel transformed the SNP matrix of the population into an ${n}\times {n}$ (n is the size of the population) symmetric matrix, which is similar to the G matrix. So, we tried the Cosine kernel to map genomic and transcriptomic data as ${n}\times {n}$ symmetric matrix (G matrix and T matrix). In this study, we firstly defined three prediction models: GBLUP (genomic data), TBLUP (transcriptomic data), and MBLUP (combining genomic and transcriptomic data, where the $\boldsymbol M=\mathrm{ratio}\times\boldsymbol G+(1-\mathrm{ratio})\times\boldsymbol T$) in experiment population (120 cattle both genotyped and transcribed). Additionally, large-scale studies systematically measuring the relationship between gene expression and a trait in individuals have been hampered because of the specimen availability and cost. We secondly construction of multi-omics single-step BLUP (mssBLUP) and weighted multi-omics single-step BLUP (wmssBLUP) to integrate transcribed individuals and genotyped resource population (Fig. 1) [5] inspired by $\boldsymbol H_{\boldsymbol w}$ matrix construction of single-step BLUP (ssBLUP) strategy. Genomic best linear unbiased prediction (GBLUP) using SNP array data was set to be a benchmark model (assessed only in resource population, 1478 cattle). Essentially, our research proposed an alternative strategy for integrating multi-omics data for genomic prediction, that is, to build a linear regression utilizing kernel trick mapped the original high-dimension data as a relationship matrix.

Methods

All animals used in the study were treated following the Council of China Animal Welfare guidelines. Protocols of the experiments were approved by the Science Research Department of the Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Bei**g, China (approval number: RNL09/07).

Data collection

Resource population

The Huaxi cattle population with an average age of 26 months and an average pre-slaughter weight of 700 kg were from Ulgai, **

HISAT2 (v2.2.1) was used to compare clean reads to reference the genome Bos taurus ARS-UCD1.2 [32]. The cattle reference genome annotation supplied the genomic position information was used to calculate effective reads aligned to the gene region. File generated by the HISAT2 were sorted through SAMtools (v1.11) [33, 34]. FeatureCounts (v1.5.2) was used to estimate read counts generated from RNA sequencing experiments [35]. After the above process, we obtained the original reads matrix (120 cattle, 27,607 genes), 7546 genes with 0 reads in all individuals and 1007 genes located on sex chromosomes were removed. Then, using STRINGTIE software (v.1.3.4 with default settings), the expression levels of genes (fragments per kilobase of transcript per million mapped reads, FPKM) were calculated. Genes with FPKM < 0.1 in more than 95% of samples were discarded (n = 3729) [36]. After this quality control step, a total of 15,325 gene expression transcripts from 120 cattle remained for further analysis.

Genomic prediction comparison models

To remove the fixed effects in prediction, we used adjusted phenotypic values of phenotypes in subsequent analysis:

$$\boldsymbol y=\boldsymbol X\boldsymbol \beta +{y}^{*}$$

where y is a vector of phenotypic, $\boldsymbol \beta$ is a vector of fixed effects (year of birth, birth weight, fattening duration, and slaughtered batch as a covariate), X is the design matrix of relevant observations, and ${y}^{*}$ is the random residual, which was subsequently used in the prediction models.

Genomic best linear unbiased prediction (GBLUP)

GBLUP assumes that all SNPs contribute to the genetic variance and follow the same normal distribution [7]. GEBVs were calculated based on the following equation:

$$\boldsymbol y^\ast=\boldsymbol Z\gamma+\boldsymbol e\;with\;\boldsymbol \gamma\sim N\left(0,\boldsymbol G\sigma_g^2\right)\;and\;\boldsymbol e\sim N\left(0,\boldsymbol I\sigma_e^2\right)$$

where $\boldsymbol{y}^{*}$ is the vector of the corrected phenotype and Z is an incidence matrix for individual effects. $\boldsymbol\gamma$ is a vector of breeding values and ${\sigma }_{g}^{2}$ is genetic variance. $\boldsymbol e$ is a vector of residual error, where $\boldsymbol I$ is an identity matrix and ${\sigma }_{e}^{2}$ is the residual variance. Hence, the narrow sense of heritability was estimated by the formula: ${h}^{2}=\frac{{\sigma }_{g}^{2}}{{\sigma }_{g}^{2}+{\sigma }_{e}^{2}}$. The G matrix was calculated as $\boldsymbol G=\frac{Z{Z}^{^{\prime}}}{2\sum {p}_{i}(1-{p}_{i})}$, and ${p}_{i}$ is the MAF of the i-th marker [7].

Kernel trick

Based on our previous study, the kernel matrix transformed by Cosine kernel was analogous to the numerator relationship G matrix, which had a well-matched performance for the G matrix, with time consumption reduced by 20 times [37]. In this research, we choose the Cosine kernel to transform original genomic and transcriptomic data:

$$K\left({x}_{i},{x}_{j}\right)=\frac{\langle {x}_{i},{x}_{j}\rangle }{\Vert {x}_{i}\Vert *\Vert {x}_{j}\Vert }=\frac{{\sum }_{1}^{n}{x}_{i}{x}_{j}}{\sqrt{\sum_{1}^{n}{x}_{i}^{2}}\sqrt{\sum_{1}^{n}{x}_{j}^{2}}}$$

For the G matrix, ${x}_{i}$ and ${x}_{j}$ were the feature vectors of individual i and j in an m-dimensional feature space, respectively, where m is the number of SNPs. For the T matrix, ${x}_{i}$ and ${x}_{j}$ were the $1\times\boldsymbol n$ vector and was kernelized in an n-dimensional feature space, where n is the number of genes. The G matrix and T matrix were measured by the cosine of the angle between two vectors, and the regularization factor C was determined by gird search as 0.05 in our previous study.

Multi-omics best linear unbiased prediction (MBLUP)

For experimental populations who have both genomic and transcriptomic data (120 individuals), we defined an MBLUP, where the M matrix was used to replace the G matrix in traditional GBLUP equations. Here, the $\boldsymbol M=\mathrm{ratio}\times\boldsymbol G+(1-\mathrm{ratio})\times\boldsymbol T$) and the gradient of weight parameter ratio were set as 0.01–0.99 to gain the optimum predictive accuracies for each trait. The MBLUP was equivalent to TBLUP when the ratio was 0 and was comparable to GBLUP when the ratio was 1.

Multi-omics single-step best linear unbiased prediction (mssBLUP)

Inspired by the single-step best linear unbiased prediction (ssBLUP) [5]:

$$\boldsymbol{H}=\left[\begin{array}{cc}\boldsymbol{H}_{\boldsymbol{11}}& \boldsymbol{H}_{\boldsymbol{12}}\\ \boldsymbol{H}_{\boldsymbol{21}}& \boldsymbol{H}_{\boldsymbol{22}}\end{array}\right]=\left[\begin{array}{cc}\boldsymbol{A}_{\boldsymbol{11}}+\boldsymbol{A}_{\boldsymbol{12}}\boldsymbol{A}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{A}_{\boldsymbol{21}}+\boldsymbol{A}_{\boldsymbol{12}}\boldsymbol{A}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}\boldsymbol{A}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{A}_{\boldsymbol{21}}& \boldsymbol{A}_{\boldsymbol{12}}\boldsymbol{A}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}\\ \boldsymbol{G}\boldsymbol{A}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{A}_{\boldsymbol{21}}& \boldsymbol{G}\end{array}\right]$$

the inverse of $\boldsymbol{H}^{-1}$:

$$\boldsymbol{H}^{\boldsymbol{-1}}=\boldsymbol{A}^{\boldsymbol{-1}}+\left[\begin{array}{cc}0& 0\\ 0& \boldsymbol{G}^{-1}-\boldsymbol{A}_{\boldsymbol{22}}^{\boldsymbol{-1}}\end{array}\right]$$

To address the issue that large numbers of individuals were genotyped but fewer are transcribed, here, we built multi-omics single-step best linear unbiased prediction (mssBLUP):

$$\boldsymbol{Hm}=\left[\begin{array}{cc}\boldsymbol{Hm}_{\boldsymbol{11}}& \boldsymbol{Hm}_{\boldsymbol{12}}\\ \boldsymbol{Hm}_{\boldsymbol{21}}& \boldsymbol{Hm}_{\boldsymbol{22}}\end{array}\right]=\left[\begin{array}{cc}\boldsymbol{G}_{\boldsymbol{11}}+\boldsymbol{G}_{\boldsymbol{12}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}_{\boldsymbol{21}}+\boldsymbol{G}_{\boldsymbol{12}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{T}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}_{\boldsymbol{21}}& \boldsymbol{G}_{\boldsymbol{12}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{T}\\ \boldsymbol{T}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}_{\boldsymbol{21}}& \boldsymbol{T}\end{array}\right]$$

where $T$ was as mentioned above. The ${\boldsymbol G}_{11}$ was the submatrix of G for transcribed animals. The ${\boldsymbol G}_{22}$ was the submatrix of G for non-transcribed animals. The ${\boldsymbol G}_{12}$ (or ${\boldsymbol G}_{21}$) was the submatrix of G describing the relationships between transcribed and non-transcribed animals.

Weighted multi-omics single-step best linear unbiased prediction (wmssBLUP)

For higher accuracy, several research studies weighted the G matrix in ssBLUP, in which the ${\boldsymbol G}_w=\left(1-w\right)\boldsymbol G+w{\boldsymbol A}_{22}$, the $w$ is a weighted parameter, which indicated the proportion of genetic relationships that were not explained by SNPs [38,39,40]. Afterward, we built the weighted multi-omics single-step linear unbiased prediction (wmssBLUP):

$$\boldsymbol{Hm}_{\boldsymbol{w}}=\left[\begin{array}{cc}\boldsymbol{Hm}_{\boldsymbol{11}}& \boldsymbol{Hm}_{\boldsymbol{12}}\\ \boldsymbol{Hm}_{\boldsymbol{21}}& \boldsymbol{Hm}_{\boldsymbol{22}}\end{array}\right]=\left[\begin{array}{cc}\boldsymbol{G}_{\boldsymbol{11}}+\boldsymbol{G}_{\boldsymbol{12}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}_{\boldsymbol{21}}+\boldsymbol{G}_{\boldsymbol{12}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{T}_{\boldsymbol{w}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}_{\boldsymbol{21}}& \boldsymbol{G}_{\boldsymbol{12}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{T}_{\boldsymbol{w}}\\ \boldsymbol{T}_{\boldsymbol{w}}\boldsymbol{G}_{\boldsymbol{22}}^{\boldsymbol{-1}}\boldsymbol{G}_{\boldsymbol{21}}& \boldsymbol{T}_{\boldsymbol{w}}\end{array}\right]$$

where the ${\boldsymbol T}_w=\left(1-w\right)\boldsymbol T+w{\boldsymbol G}_{22}$, the $w$ is a weighted parameter between 0 and 1, which indicated the proportion of genetic relationships that were not explained by gene expression levels.

The inverse of $\boldsymbol{{Hm}_{w}}$:

$$\boldsymbol{{Hm}_{w}^{-1}}=\boldsymbol{{G}^{-1}}+\left[\begin{array}{cc}0& 0\\ 0& \boldsymbol{{T}_{w}^{-1}}-\boldsymbol{{G}_{22}^{-1}}\end{array}\right]$$

Here, the gradient of parameter $w$ was set as 0–1 to gain the optimum predictive accuracies for Chinese Simmental beef cattle.

Assessing prediction performance

For experimental populations with transcriptome data (120 individuals), because of the limited group size, we adopt Leave One Out (LOO) to assess the predictive performance of GBLUP, MBLUP, and TBLUP. LOO is simple cross-validation (CV), which generated lower generalization errors than cross-validation in the small-scale population [41]. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for 120 samples, 119 individuals are used as the training set to train the model, and the remaining one is used as the test which was predicted using the trained model. Repeat the above process 120 to ensure that each individual is used as the training set once. And then, predictive accuracies were expressed Pearson’s correlation between 120 GEBVs and ${y}^{*}$. The formula was as follows: $r({y}^{*}, GEBV)=\frac{cov({y}^{*}, GEBV)}{\sqrt{var({y}^{*})}var\sqrt{GEBV}}$, where ${y}^{*}$ was the corrected phenotype.

For the resource population (1478 individuals), the predictive accuracies of GBLUP (based SNP data, assessed only in resource population), mssBLUP, and wmssBLUP were quantified with Pearson’s correlation based on five replicates of fivefold CV. The predictive accuracy performance of each method was the average Pearson correlation of 5 replicates in the validation subset.

Results

Compared performance between GBLUP, MBLUP, and TBLUP

The weighting parameter ratio in MBLUP was set with a gradient of 0.01–0.99 for determining the optimal coefficient for the M matrix. The predictive accuracies of MBLUP for four traits in the experiment population are shown in Fig. 2. The predictive accuracy of LDM, HWC, and pH traits has the same changing trend, with gradually decreased along with the increase of ratio value. When the ratio was 0, the MBLUP (or TBLUP) was far preferred to GBLUP. The predictive trend of the SF trait first increased and then decreased, reaching the maximum when the ratio was 0.41.

Predictive performance of GBLUP, mssBLUP, and wmssBLUP

For wmssBLUP, the gradient of weight parameter w was set as 0–1 to explore the optimum predictive accuracies for each trait. When $w$ was 0.1, 0.6, 0.5, and 0.4, respectively, the predictive ability of wmssBLUP reached its peak (Fig. S1). So, we used this group weighted value to assess the predictive performance of GBLUP, mssBLUP, and wmssBLUP for four traits (Fig. 3). The details of 25 Pearson’s correlation coefficients of GBLUP, mssBLUP, and wmssBLUP were listed in Table S1. The mssBLUP outperformed GBLUP for all four evaluated traits, and the average improvement was 3.37%, of which the accuracy improvement of LDM traits was the most obvious, reaching 7.50%. The wmssBLUP performed best among three models for all traits, while the average improvement of which over GBLUP was 4.18%, and over mssBLUP was 0.79%.

The proportion of transcribed population influenced the wmssBLUP

Additionally, except for selecting the optimum weighted parameter w, we also considered the influence of the proportion of transcribed data in the whole resource population on the statistical power of the wmssBLUP. Based on the weighted value w for each trait mentioned above, firstly, 600, 720, 840, 960, 1080, 1200, and 1320 individuals from the resource population (1358 cattle with genotype only) were randomly extracted, respectively. And then combined the selected groups with the experimental population (120 cattle with both genotyped and transcribed) to build the ${\boldsymbol H\boldsymbol m}_{\boldsymbol w}$ matrix and evaluated the prediction accuracy of wmssBLUP, respectively. The details of accuracies were listed in Table S2 which demonstrated the trend of the accuracy level of wmssBLUP with the group size. The accuracy of wmssBLUP and GBLUP showed an improved trend with increasing population scale (Fig. S2). However, the wmssBLUP consistently outperformed GBLUP in all cases, and the average improvement was 13.14%, 4.78%, 2.47%, and 8.67% for four traits, respectively. As seen in Fig. 4, the improved accuracy of the wmssBLUP over the GBLUP was getting smaller with decreasing proportion of transcribed cattle in the whole resource population (from 15.16% to 3.04%).

Discussion

Genomic selection (GS) is a promising method in modern molecular breeding practice because of its demonstrated effectiveness in predicting accuracies and shortening generation intervals. As a significant advance of next-generation sequencing, transcriptome and other omics data provide new information sources for phenotypic prediction. Meanwhile, the current progress of GS using genomic data alone has reached a bottleneck and integrating multi-omics as a novel predictive factor analysis may be a promising way to improve accuracies [42]. Based on our previous study, An et al. [37] defined a Cosine similarity matrix (CS matrix), which was generated in the process of kernelizing the term-frequency of the minor allele (0, 1, and 2) in high-dimensional feature space. The results showed the CS matrix had a well-matched performance for the G matrix (calculated by VanRaden [7]) with a computational efficiency increase by 20 times. In this research, we used the Cosine trick to kernelize the genome and transcriptome data to an ${n}\times {n}$ relationship matrix, named G and T matrix (distinct from the traditional G matrix of VanRaden). Afterward, the proposition of integrating multiple omics is transformed into the conventional calculation problem of matrixes. Therefore, we evaluated three kernel-based investigating methods with the above foundation: MBLUP, mssBLUP, and wmssBLUP.

For MBLUP in this study, we set a weighted parameter, valued between 0 and 1, aimed to explore an optimal ratio for each trait. The reported studies had proven that the resource population scale significantly determined the reliability of genomic predictions [43, 44]. Due to the reason that the population size of MBLUP was 120, the accuracy of HWC and SF appeared in some unusual values, actually the accuracy of SF was 0.28 in general [45]. Here, we mainly emphasized the effects of modeling transcriptome data into the prediction model, that was, comparing the trend of MBLUP, GBLUP (ratio = 1), and TBLUP (ratio = 0). This matched those observed in earlier studies [46, 47]. Typically, in this small experiment population with limited condition and cost, participation of gene expression data in GS had the potential to improve genomic predictions.

Before the genoty** technology matured a decade ago, GP was limited by the number of animals for which both genotypes and phenotypes were available. A single-step BLUP (ssBLUP) was proposed to predict GEBV using information from genotype and pedigree simultaneously [48]. The weighted single-step BLUP (wssBLUP) was derived from ssBLUP. This model was more accurate by using a weighted scaled and properly augmented relationship matrix ($\boldsymbol H$ matrix) [49, 50]. Similarly, integrating multi-omics for GP faced the dilemma that large-scale studies systematically measuring the association between omics data and traits have been hampered because of the specimen availability and cost. Therefore, we developed mssBLUP and wmssBLUP, in which the $\boldsymbol H\boldsymbol m$ and ${\boldsymbol H\boldsymbol m}_w$ matrix both were constructed by kernel-based G and T matrix. In this study, compared with traditional GBLUP, wmssBLUP and mssBLUP had 4.18% and 3.37% average improvement of predictive accuracy. And the weighted coefficient w changed with the data structure. This is also in agreement with our earlier observations, which showed that predictive ability can be improved when combining transcripts with SNPs, but it depends on the traits [51]. For another, the mssBLUP and wmssBLUP appeared to be more reliable and robust than MBLUP. While the degree of improvement of wmssBLUP decreased with the lessening proportion of transcribed cattle in the whole resource population (Fig. S2). These results corroborate the findings of a great deal of the previous work in GBLUP and ssBLUP, that was modeling increasing predictive factors, including pedigree information, genome, transcriptome, and metabolic data, even intergenic, gene, exon, protein-coding sequences et al. showed a common tendency that could partly contribute to the improvements of phenotype prediction [6, 21]. 2) The complex interaction of multi-omics should be modelled, while consider the overfitting problem. Xu et al. observed that the predictive ability decreased when combining transcriptome and metabolic data into GP for six yield-related traits in maize [54]. 3) In addition, a rising concern was that the gene expression levels were affected by several factors, such as tissue specificity, time of sampling, and experimental conditions. It is necessary to balance the costs and genetic gains of using transcriptomic in-formation in genomic predictions. Essentially, we emphasize the feasibility of this opinion, that is, the kernel algorithm is taken to map the original data into an ${n}\times {n}$ relationship matrix and then build linear regression with the phenotypes.

Conclusions

In the present study, we proposed three Cosine kernel-based methods to investigate multi-omics data: MBLUP, mssBLUP, and wmssBLUP. Our results showed MBLUP was far preferred to GBLUP (ratio = 1) in four traits. While, wmssBLUP and mssBLUP outperformed GBLUP, and the average improvement was 4.18% and 3.37%. We also found the transcriptome data has the potential to improve genomic predictions if they can be included on a larger scale. Our research proposes reliable and robust alternatives for the present situation in that large numbers of individuals were genotyped, but fewer were transcribed.

Availability of data and materials

The datasets used during the current study are available from the corresponding authors on reasonable request. The genotype datasets are available from the Dryad Digital Repository https://datadryad.org/stash/dataset/doi:10.5061/dryad.4qc06.

Abbreviations

BLUP:: Best linear unbiased prediction
CV:: Cross-validation
GBLUP:: Genomic BLUP
GEBVs:: Genomic breeding values
GP:: Genomic prediction
GS:: Genomic selection
GWASs:: Genome-wide association studies
LD:: Linkage disequilibrium
LDM:: Longissimus dorsi muscle
mssBLUP:: Multi-omics single-step BLUP
MAF:: Minor allele frequency
MBLUP:: Multi-omics BLUP
MCMC:: Markov Chain Monte Carlo
ML:: Machine learning
QC:: Quality control
RF:: Random Forest
RKHS:: Reproducing kernel Hilbert spaces regression
SVR:: Support vector regression
ssGBLUP:: Single-step BLUP
SF:: Shear force
wmssBLUP:: Weighted multi-omics single-step BLUP
WHC:: Water holding capacity

References

Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169(7):1177–86.
Article CAS PubMed PubMed Central Google Scholar
Wu Y, Zheng Z, Visscher PM, Yang J. Quantifying the map** precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 2017;18(1):86.
Article PubMed PubMed Central Google Scholar
De los Campos G, Gianola D, Allison DB. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet. 2010;11(12):880–6.
Article PubMed CAS Google Scholar
Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819–29.
Article CAS PubMed PubMed Central Google Scholar
Legarra A, Christensen OF, Aguilar I, Misztal I. Single step, a general approach for genomic selection. Livest Sci. 2014;166:54–65.
Article Google Scholar
Georges M, Charlier C, Hayes B. Harnessing genomic information for livestock improvement. Nat Rev Genet. 2019;20(3):135–56.
Article CAS PubMed Google Scholar
VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91(11):4414–23.
Article CAS PubMed Google Scholar
Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92(9):4656–63.
Article CAS PubMed Google Scholar
Whittaker JC, Thompson R, Denham MC. Marker-assisted selection using ridge regression. Genet Res. 2000;75(2):249–52.
Article CAS PubMed Google Scholar
de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, et al. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 2009;182(1):375–85.
Article PubMed PubMed Central CAS Google Scholar
Gianola D, Okut H, Weigel KA, Rosa GJM. Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genet. 2011;12(1):87.
Article CAS PubMed PubMed Central Google Scholar
Maenhout S, De Baets B, Haesaert G, Van Bockstaele E. Support vector machine regression for the prediction of maize hybrid performance. Theor Appl Genet. 2007;115(7):1003–13.
Article CAS PubMed Google Scholar
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003;43(6):1947–58.
Article CAS PubMed Google Scholar
Kung SY. Kernel methods and machine learning. Cambridge: Cambridge University Press; 2014.
Book Google Scholar
Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics. 2011;12(1):186.
Article PubMed PubMed Central Google Scholar
González-Camacho JM, Ornella L, Pérez-Rodríguez P, Gianola D, Dreisigacker S, Crossa J. Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome. 2018;11(2):170104.
Article Google Scholar
Howard R, Carriquiry AL, Beavis WD. Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3. 2014;4(6):1027–46.
Article PubMed PubMed Central Google Scholar
Martini JW, Wimmer V, Erbe M, Simianer H. Epistasis and covariance: how gene interaction translates into genomic relationship. Theor Appl Genet. 2016;129(5):963–76.
Article CAS PubMed Google Scholar
Forsberg SK, Bloom JS, Sadhu MJ, Kruglyak L, Carlborg Ö. Accounting for genetic interactions improves modeling of individual quantitative trait phenotypes in yeast. Nat Genet. 2017;49(4):497–503.
Article CAS PubMed PubMed Central Google Scholar
Albert FW, Kruglyak L. The role of regulatory variation in complex traits and disease. Nat Rev Genet. 2015;16(4):197–212.
Article CAS PubMed Google Scholar
Li Z, Gao N, Martini JWR, Simianer H. Integrating gene expression data into genomic prediction. Front Genet. 2019;10:126.
Article CAS PubMed PubMed Central Google Scholar
Guo Z, Magwire MM, Basten CJ, Xu Z, Wang D. Evaluation of the utility of gene expression and metabolic information for genomic prediction in maize. Theor Appl Genet. 2016;129(12):2413–27.
Article CAS PubMed Google Scholar
Klau S, Jurinovic V, Hornung R, Herold T, Boulesteix AL. Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics. 2018;19(1):322.
Article PubMed PubMed Central Google Scholar
Xu S, Xu Y, Gong L, Zhang Q. Metabolomic prediction of yield in hybrid rice. Plant J. 2016;88(2):219–27.
Article CAS PubMed Google Scholar
National Academies of Sciences, Engineering, and Medicine. Nutrient requirements of beef cattle: eighth revised edition. Washington, DC: The National Academies Press; 2016. https://doi.org/10.17226/19014.
NY/T 1333-2007. Determination of meat quality of livestock and poultry. Bei**g: Ministry of Agricultre of the PRC; 2007.
Google Scholar
NY/T 1180-2006. Determination of meat tenderness and shear force method. Bei**g: Ministry of Agricultre of the PRC; 2006.
Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
Article CAS PubMed PubMed Central Google Scholar
S A. FastQC: a quality control tool for high throughput sequence data. In. http://www.bioinformatics.babraham.ac.uk/projects/fastqc; 2010.
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–8.
Article CAS PubMed PubMed Central Google Scholar
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.
Article CAS PubMed PubMed Central Google Scholar
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60.
Article CAS PubMed PubMed Central Google Scholar
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genoty** with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.
Article CAS PubMed PubMed Central Google Scholar
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008.
Article PubMed PubMed Central CAS Google Scholar
Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.
Article CAS PubMed Google Scholar
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5.
Article CAS PubMed PubMed Central Google Scholar
An B, Liang M, Chang T, Duan X, Du L, Xu L, et al. KCRR: a nonlinear machine learning with a modified genomic similarity matrix improved the genomic prediction efficiency. Brief Bioinform. 2021;22(6):bbab132.
Article PubMed Google Scholar
Marques DBD, Bastiaansen JWM, Broekhuijse M, Lopes MS, Knol EF, Harlizius B, et al. Weighted single-step GWAS and gene network analysis reveal new candidate genes for semen traits in pigs. Genet Sel Evol. 2018;50(1):40.
Article PubMed PubMed Central CAS Google Scholar
Terakado APN, Costa RB, de Camargo GMF, Irano N, Bresolin T, Takada L, et al. Genome-wide association study for growth traits in Nelore cattle. Animal. 2018;12(7):1358–62.
Article CAS PubMed Google Scholar
Zhou C, Li C, Cai W, Liu S, Yin H, Shi S, et al. Genome-wide association study for milk protein composition traits in a Chinese holstein population using a single-step approach. Front Genet. 2019;10:72.
Article CAS PubMed PubMed Central Google Scholar
Rao RB, Fung G, Rosales R. On the dangers of cross-validation. an experimental evaluation. in: proceedings of the 2008 siam international conference on data mining (SDM). Society for Industrial and Applied Mathematics; 2008: 588–96.
Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data integration methods. Front Genet. 2017;8:84.
Article CAS PubMed PubMed Central Google Scholar
Morgante F, Huang W, Maltecca C, Mackay TFC. Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals. Heredity (Edinb). 2018;120(6):500–14.
Article Google Scholar
VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, et al. Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 2009;92(1):16–24.
Article CAS PubMed Google Scholar
Zhu B, Guo P, Wang Z, Zhang W, Chen Y, Zhang L, et al. Accuracies of genomic prediction for twenty economically important traits in Chinese Simmental beef cattle. Anim Genet. 2019;50(6):634–43.
Article CAS PubMed PubMed Central Google Scholar
Acharjee A, Kloosterman B, Visser RG, Maliepaard C. Integration of multi-omics data for prediction of phenotypic traits using random forest. BMC Bioinformatics. 2016;17 Suppl 5(Suppl 5):180.
Article PubMed CAS Google Scholar
Schrag TA, Westhues M, Schipprack W, Seifert F, Thiemann A, Scholten S, et al. Beyond genomic prediction: combining different types of omics data can improve prediction of hybrid performance in maize. Genetics. 2018;208(4):1373–85.
Article CAS PubMed PubMed Central Google Scholar
Su G, Lund MS, Sorensen D. Selection for litter size at day five to improve litter size at weaning and piglet survival rate. J Anim Sci. 2007;85(6):1385–92.
Article CAS PubMed Google Scholar
Christensen OF, Lund MS. Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010;42(1):2.
Article PubMed PubMed Central Google Scholar
Forni S, Aguilar I, Misztal I. Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic information. Genet Sel Evol. 2011;43(1):1.
Article PubMed PubMed Central Google Scholar
Morgante F, Huang W, Sørensen P, Maltecca C, Mackay TFC. Leveraging multiple layers of data to predict drosophila complex traits. G3 (Bethesda, Md). 2020;10(12):4599–613.
Article CAS Google Scholar
Hu X, **e W, Wu C, Xu S. A directed learning strategy integrating multiple omic data improves genomic prediction. Plant Biotechnol J. 2019;17(10):2011–20.
Article CAS PubMed PubMed Central Google Scholar
Gao N, Martini JWR, Zhang Z, Yuan X, Zhang H, Simianer H, et al. Incorporating gene annotation into genomic prediction of complex phenotypes. Genetics. 2017;207(2):489–501.
Article CAS PubMed PubMed Central Google Scholar
Xu Y, Xu C, Xu S. Prediction and association map** of agronomic traits in maize using multiple omic data. Heredity (Edinb). 2017;119(3):174–84.
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by funds from the National Natural Science Foundations of China (32172693) and the Program of National Beef Cattle and Yak Industrial Technology System (CARS-37). China Agriculture Research System of MOF and MARA supported statistical analysis and writing of the paper.

Author information

Mang Liang and Bingxing An contributed equally to this work.

Authors and Affiliations

Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Bei**g, 100193, People’s Republic of China
Mang Liang, Bingxing An, Tianpeng Chang, Tianyu Deng, Lili Du, Keanning Li, Sheng Cao, Yueying Du, Lingyang Xu, Lupei Zhang, Xue Gao, Junya Li & Huijiang Gao

Authors

Mang Liang
View author publications
You can also search for this author in PubMed Google Scholar
Bingxing An
View author publications
You can also search for this author in PubMed Google Scholar
Tianpeng Chang
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Deng
View author publications
You can also search for this author in PubMed Google Scholar
Lili Du
View author publications
You can also search for this author in PubMed Google Scholar
Keanning Li
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yueying Du
View author publications
You can also search for this author in PubMed Google Scholar
Lingyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lupei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xue Gao
View author publications
You can also search for this author in PubMed Google Scholar
Junya Li
View author publications
You can also search for this author in PubMed Google Scholar
Huijiang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ML wrote, and JL and HG revised the paper. BA and LD performed experiments. TD, KL, SC, and YD collected the data. LX, XG, and LZ participated in the design of the study and contributed to the acquisition of data. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Huijiang Gao.

Ethics declarations

Consent for publication

Not applicable.

Competing of interests

The authors declare no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Fig. S1. The determination of the weight parameter w in wmssBLUP.

Additional file 2:

Fig. S2. The comparison of the accuracy of wmssBLUP and GBLUP in different population scales.

Additional file 3:

Table S1. The details of 25 Pearson’s correlation coefficients of GBLUP, mssBLUP, and wmssBLUP for LDM, HWC, SF, and pH.

Additional file 4:

Table S2. The summary of the accuracy of wmssBLUP and GBLUP for increasing population scale.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Liang, M., An, B., Chang, T. et al. Incorporating kernelized multi-omics data improves the accuracy of genomic prediction. J Animal Sci Biotechnol 13, 103 (2022). https://doi.org/10.1186/s40104-022-00756-6

Download citation

Received: 23 March 2022
Accepted: 08 July 2022
Published: 20 September 2022
DOI: https://doi.org/10.1186/s40104-022-00756-6

Incorporating kernelized multi-omics data improves the accuracy of genomic prediction

Abstract

Background

Results

Conclusions

Similar content being viewed by others

RNA-Seq Data Analysis in Galaxy

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

RefFinder: a web-based tool for comprehensively analyzing and identifying reference genes

Background

Methods

Data collection

Resource population

Genomic prediction comparison models

Genomic best linear unbiased prediction (GBLUP)

Kernel trick

Multi-omics best linear unbiased prediction (MBLUP)

Multi-omics single-step best linear unbiased prediction (mssBLUP)

Weighted multi-omics single-step best linear unbiased prediction (wmssBLUP)

Assessing prediction performance

Results

Compared performance between GBLUP, MBLUP, and TBLUP

Predictive performance of GBLUP, mssBLUP, and wmssBLUP

The proportion of transcribed population influenced the wmssBLUP

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Consent for publication

Competing of interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1:

Additional file 2:

Additional file 3:

Additional file 4:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation