Introduction

Recent advances in genomics have bolstereed our capacity to delve deeper into the complex etiologies of various diseases. Among them, Genome-Wide Association Studies (GWAS) are one example of approaches driving these developments and have greatly improved our understanding of genetic factors that underpin many important complex traits and diseases. However, this research still faces many limitations that demand innovative solutions to unlock its full potential. Many of these problems arise from the insufficient ability of current modeling methods to capture the full degree of underlying biological complexity. The number of genetic variants is vast, and their individual effects are often small and context-specific. Fully map** out the mechanisms that translate genetic variation into phenotypes requires fine integration of genetic analysis with other types of knowledge and ‘omics domains. By their nature, these aspirations closely align with problems currently being encountered and solved in other data-intensive domains, both in other areas of biology and more widely. In this review we will outline promising synergies and solutions from recent advances in DL analysis and explain their potential applications in biomedical research.

Extensive datasets offer insights into the molecular mechanisms behind various diseases and create novel opportunities for predictive modeling in precision medicine, particularly in the realm of drug efficacy prediction. However, complexity and volume of omics data pose significant challenges in extracting meaningful information and deciphering essential biological context. Within high-throughput genomics, similar to other omics domains, technological advances have brought with them both the typical challenges associated with overabundance of data and also unique problems – like those arising from the methodological specifics of GWAS. Meaningful interpretation of such data is complicated, as most contemporary profiling strategies capture an outwardly extensive yet functionally narrow view of the entire biological system. Some key challenges currently affecting genomics research within the context of these overarching themes are briefly outlined below.

One often-encountered criticism of the GWAS approach is its focus on common genetic variants with modest effects that often do not explain a significant part of the genetic contribution to complex traits [1]. This issue arises due to the presence of rare variants with very large contributions, complex epistatic effects, and interactions between genes and the environment. Even once a GWAS study has led to identification of potentially relevant variants, it is still essential to determine their functional consequences and causality, which can often be complicated and require additional targeted experiments [2]. The task of distinguishing causal variants from other markers in linkage disequilibrium blocks and understanding their mechanistic effects on observable phenotypes require some form of additional information. This type of research will therefore greatly benefit from reliable, automated integration of functional genomics [3], epigenomics [4], and transcriptomics [5].

Attempting to explain ever more complex natural processes inevitably requires comparably complex models. This necessitates the use of advanced algorithms to automate model construction and the fitting process, which results in ‘black-box’ models that, although very accurate, usually lack human interpretability. Increasing complexity can ultimately limit the usefulness of these methods for biological discovery (as it can obscure hints about potential mechanisms) and undermine trust in such tools for clinical applications, one example being polygenic risk scores for complex diseases [6]. Additionally, many genetic studies have to deal with suboptimally small numbers of samples – especially in the case of rare diseases, variants with subtle effects or those confined to specific population subgroups. Artificial intelligence (AI) and ML techniques [7, 8] offer multiple new strategies to address these limitations and, most importantly, improve our ability to integrate knowledge and data across multiple layers of biological organization.

ML algorithms encompass a broad range of computational methods that enable computers to construct predictive models leading to actionable insights. In medical genomics, ML techniques have found diverse applications, including but not limited to, predictive disease classification, biomarker discovery, drug response prediction, and identification of disease-causing genetic variants. Classical ML techniques, such as Random Forest [9], Support Vector Machines [10], and logistic regression [11], have been instrumental in omic data analysis. These methods excel at handling high-dimensional data and have demonstrated success in various genomics research endeavors. For a comprehensive survey on the typical applications of standard ML techniques in biomedical research, readers are referred to existing review articles [12,13,14,15,16,17]. However, as the dimensions and layers of data grow, so do the limitations of these classical methods. One significant drawback is their insensitivity to the relationships behind data, e.g. genes often crucial in omics, because they are mostly from “tabular” data, where variables are represented independently with each other. Seizing the potential relationships of genes or elements can offer a wealth of information as we describe in detail below.

The emergence of DL has revolutionized the field of AI and transformed the landscape of data analysis. This ability of DL to automatically learn hierarchical representations from raw data has proven invaluable for predictive modeling, capturing intricate dependencies within datasets, even when dealing with noisy and high-dimensional data. One particularly important development in the area of DL is the creation of new methods to deal with small sample sizes, a recurring concern in genetic profiling of small populations and rare diseases. While the ideal application of DL aims to facilitate comprehensive representation learning of the underlying structures within omics data, practical challenges emerge when too few samples are available. In certain applications this can even hinder adequate representation learning and complicate the direct processing of omics data within DL frameworks. This important limitation can be mitigated using a technique called transfer learning. Unlike classical statistical models, neural networks are fitted sequentially meaning that they can be continuously updated with new data. This opens a possibility of ‘pre-training’ a model on some large, but weakly or partially relevant dataset and then finalizing the training on a more valuable but smaller one. For patterns that persist between the two datasets there will be a benefit from access to all of the combined observations, whereas irrelevant patterns will simply be overwritten with new data. Transfer learning can facilitate the reuse of knowledge from larger datasets to substantially improve accuracy in smaller cohorts. In principle, pre-trained models initialized on large-scale genetic data can be re-specialized for new tasks, reducing the need for data collection while also improving performance. The potential and limitations of this strategy are discussed in detail in this review and its benefits are demonstrated with several examples [18,19,20,21,22,23,24].

DL models offer a wide range of additional analysis capabilities that can improve many kinds of high-throughput biomedical analysis. Firstly, artificial neural networks can easily identify large numbers of interactions and model non-linear effects while also offering very effective regularization to mitigate the risk of overfitting [25]. Secondly, modern neural networks can utilize very large input sizes, incorporate methods for imputation of missing values [26] and accommodate very diverse types of structured information. All of these capabilities offer additional ways of increasing power to detect rare SNPs, epistasis, and more accurately model the full range of possible association patterns. DL models can jointly analyze and integrate heterogeneous data sources, allowing for a more comprehensive view of genetic contributions and some recently introduced methods offer ways to integrate different omics data types within the same model and perform inference across them simultaneously. Of particular note is the DeepInsight family of approaches [18,19,20,21], where several studies have demonstrated successful cross-omics analysis that included cancer somatic variation as one of the input types. As this type of data is akin to germline variation, potentially similar strategies can be used in the future for enhancing functional and causal annotation of SNPs.

Lastly, increasing use of AI across all areas of life has brought to the fore the need for such systems to become more transparent and interpretable. This problem is a focus of increasing research interest and explainable AI (XAI) is now emerging as its own sub-discipline within AI research. As the vast majority of this work was done with DL-based models, most of these methods can be readily used with most typical architectures of neural networks. Techniques like attention and gradient-based attribution can in principle help to understand the contribution of individual biological factors to disease risk and drug response, making the results more interpretable for biologists and clinicians. To demonstrate these potential benefits, this review will use DeepFeature method as an illustrative example, which implements a gradient-based attribution approach to discover potential mechanisms involved in cancer drug efficacy [20, 21].

Here we focus more on biologically inspired CNNs [27], which are one of the fundamental architectures widely used in the computer vision domain, and their adoption to it has led to unprecedented improvements in performance. Two-dimensional (2D) CNNs are extensively used tools, particularly for image data analysis, as they extract spatial features hierarchically, starting from raw image data, through edge-detection etc., and finally for object prediction. While 2D CNNs have traditionally thrived in image analysis, recent interest has arisen in their application to omics data analysis. Researchers have contemplated the prospect of harnessing the potency of 2D CNNs for tabular or omics data analysis, necessitating the revelation of latent (we sometimes call “spatial”) information inherent among genes (or elements) within a sample (or feature vector) [28,29,30,31]. Zhou et al. [32] underscored the significance of DL including CNN in predictive tasks like determining the sequence specificity of DNA- and RNA- binding proteins and pinpointing cis-regulatory regions, among other applications. Notably CNNs and recurrent neural networks have become the architectures of choice for modeling these regulatory elements with sequence patterns, illustrating the wide-ranging utility of DL in genomics. Talukder et al. further explore the intricacies of deep neural network (DNN) interpretation methods, particularly their applications in genomics and epigenomics [33]. This breadth of application also extends to synthetic biology, emphasizing its promise in plant and animal breeding [34]. Nonetheless, existing reviews have not extensively addressed how to effectively handle tabular data like omics data without explicit patterns by converting them to adequate representations for CNNs.

With the emergence of converter techniques like DeepInsight [18], a groundbreaking development has transpired: the conversion of tabular data, such as omics data, into image-like representations. This transformative conversion now empowers the effective harnessing of CNNs for analysis. DeepInsight, a pioneering technique, revolutionizes data preprocessing by instilling latent information among genes or elements within a feature vector. This reimagining of data arranges elements sharing similar characteristics into proximal neighbors, while distant elements remain distinct. This spatial context generates a rich environment for CNNs to operate not only feasibly but also insightfully. Unlike traditional ML techniques, which independently handle variables and sometimes pick representative ones, this new technique gathers similar variables close together and treats them as a group, which reflects the structure behind the omics data.

To clarify further, when biological data is transformed into an image format, the latent relationships between biological entities, such as genes, are encoded as spatial proximities within the image. Subsequently, using a CNN with these images allows for a substantial reduction in the number of model parameters. This reduction is achieved through the architectural design of convolutional layers, which are adept at identifying opportunities for parameter sharing among appropriate inputs, specifically in cases characterized by partial linear or even non-linear correlations among features. Given the prevalence of such correlations in biological data, the resulting models usually have better generalization capabilities, while preserving the neural networks’ innate ability to discover and model more complex features if and when needed. Moreover, these images facilitate the interpretation of results by explicitly showing the potential relationship between the biological entities found to be important by the model, as will be explained in detail further on.

One additional notable advantage is the capability of utilizing transfer learning, negating the necessity for network creation from scratch. This attribute allows for all-encompassing learning across a diverse spectrum of omic data, unlocking novel avenues for comprehensive analysis. Through transfer learning, models can be initialized with weights from a pre-trained model [35], typically developed using extensive and diverse image datasets such as ImageNet. These pre-trained models have already learned essential patterns from millions of natural images, hierarchically capturing universal features that are surprisingly effective when repurposed for distinct tasks, even in seemingly unrelated domains like genomics. This approach allows researchers to capitalize on the foundational knowledge embedded in these pre-trained models, drastically reducing the computational effort and time required for training, and often enhancing performance.

Utilizing transfer learning with pre-trained models offers a unique advantage for omic data analysis [18, 19]. Genomic datasets, unlike publicly available image datasets, are often limited in size. Leveraging the pattern learned by models from vast image datasets through transfer learning can provide a robust foundation, enabling researchers to fine-tune these models for the specifics of omics data, without the need for large training sets. Additionally, transfer learning allows for the extraction of intricate and nuanced patterns from omics data that might be overlooked or unattainable when starting the model training from scratch. The prowess of transfer learning using CNNs is vividly showcased in various applications beyond just image processing, demonstrating their potential to revolutionize data analysis across fields [18, 22, 36, 37].

The adaptability of DeepInsight is evident through its applications across various domains, including its pivotal role in sha** the winning model ('Hungry for gold') of the Kaggle.com competition [19, 20, 22, 23, 38,Full size image