INTRODUCTION

In spite of the fact that human genomes are identical by 99.9%, it is precisely the remaining 0.1% of genetic variants that underlie phenotypic differences, including susceptibility to diseases [1]. These genetic variations are Single Nucleotide Variation (SNV) or Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), and Structural Variation of more than 50 b.p. in length (SV) [2]. The most widespread genetic variation is SNP, i.e., a DNA sequence variation (a variant allele) of one nucleotide in size in the members of the same species, which occurs within a population at a frequency of at least 1% [3]. SNPs occur every 200-300 b.p. in the genome, being localized in its coding and regulatory parts (promoters, enhancers, introns, and untranslated regions) [45]. Importance of studying SNP lies in the fact that such genetic variants are often associated with different diseases, as it has been shown by numerous Genome-Wide Association Studies (GWAS). About 95% of the clinically significant SNPs are localized in non-coding genome regions [6], and their functional significance is probably associated with the changes in the regulatory characteristics of the regions surrounding the polymorphism [7]. Such regulatory regions of the eukaryotic genome may be promoters, enhancers, 5′- and 3′-untranslated regions (UTR) of protein-coding genes, gene regions of non-coding RNA (ncRNA), and splicing regulatory elements (SRE) [58]. Promoters initiate gene transcription and enhancer elements increase the rate of this initiation [9]. Promoters are preferred sites for binding transcription factors (TFs) and RNA polymerase II to DNA and include the region of the first transcribed nucleotide of the transcript (transcription start site, TSS) [10]. Enhancers, which have been identified for the first time with the help of reporter analysis as elements capable of enhancing the reporter gene expression [11], are the platforms for TF binding that can act irrespective of orientation, distance, and localization relative to the target gene [12]. The 5′- and 3′-UTRs play an important role in post-transcriptional regulation of gene expression and are part of mature coding mRNA. For example, 5′-UTRs contain different regulatory components influencing translation initiation, and 3′-UTRs comprise the sequences that bind microRNA and lead to transcript degradation [5]. In addition, it should be noted that the non-coding polymorphisms within UTR could also be involved in transcription regulation, because the 5′-UTR sequence usually overlaps with the promoter regions of the genes, while the 3′-UTR sequence could overlap with other regulatory elements of the genes, e.g., enhancers [13]. Non-coding polymorphisms are also localized in ncRNA; in recent years, a lot of information has been obtained about their effects on RNA maturation, transcription regulation, chromatin remodeling, and post-transcriptional modifications of RNA [14].

Being the most frequently occurring class of genetic variants, SNPs are the major genetic marker for Quantitative Trait Loci (QTL) map**; they further could be conditionally divided into those regulating gene expression directly at the transcriptional and chromatin levels, exerting effect on the mRNA level (eQTL – expression QTL regulating gene expression at the transcriptional level), and those influencing post-transcriptional processes (sQTL – splicing QTL regulating alternative splicing of pre-mRNA; pQTL – protein QTL regulating protein expression) [15]. The following mechanism of functional effects of polymorphisms at the genomic level could be suggested: functions of the regulatory elements are impaired due to the change in the sequence of the sites for TF–DNA interaction (both decrease and increase in binding efficiency) [16]. At the post-transcriptional level, non-coding polymorphisms could affect activity of the 5′- and 3′-UTR mRNA, which play a key role in translation regulation and mRNA stability, including due to the change in the regulatory microRNA binding [46], for analysis of the sequence overlap and assessment of the effects of particular nucleotides on activity of these sequences.

High-throughput reporter assays of polymorphic variants include Massively Parallel Splicing Assay (MaPSY) [47], which was used to study impaired splicing in the case of autism spectrum disorders. The screening results were used to characterize genetic variants in the TNRC6C, MAPK8IP1, and USP45 genes, and it has been shown that the proteins of TNRC6 family could increase the risk of autism development [48]. Recently, the method of Cre-dependent MPRA in vivo has been proposed for functional analysis of the library of 3′-UTRs with genetic variants associated with autism. Quantification of the transcripts depending on activity of the regulatory element was performed in particular types of neurons by transduction of the libraries into the brain tissues of mice with tissue-specific expression of Cre recombinase. This method makes it possible to study regulatory effect in a more relevant cellular context, because neurons have an absolutely different expression profile of trans-acting factors (e.g., TF and microRNA) compared to other cell lines [49].

Main limitation of the methods based on reporter assays is absence of the relevant chromatin context, which accompanies the regulatory element in the native genome. This limitation is partially eliminated in the lentiMPRA technique, when library with the regulatory elements under study is created in a lentiviral vector, which is integrated into the genome, facilitating analysis of transcription within the chromatin context [50].

FUNCTIONAL ANALYSIS OF GENETIC POLYMORPHISMS IN THE NATIVE GENOMIC CONTEXT

With regard to the effects of genetic variants on pathogenesis of a disease, it is important to take into account chromatin context which, in turn, varies between the different types and functional states of the cells. The eQTL map** per se makes it possible to relate a particular genotype to the changes in mRNA levels of potential target genes in the native genomic context, including tissue specificity [5152]. Functional relationship between the genes and distant regulatory loci can be found by determining 3D chromatin organization using methods such as Hi-C (high-throughput chromosome conformation capture), ChIA-PET (chromatin interaction analysis with paired-end tag sequencing), and their modifications [5354]. Comparison of the 3D tissue-specific genomic maps with disease-associated regulatory SNPs makes it possible to identify the most probable genes involved in pathogenesis. Hence, the most accurate method for verification of hypotheses constructed is genome editing and producing of cells with the desired combinations of variants. Precise and efficient editing of particular nucleotides in the human genome has become a daunting but realistic challenge due to the RNA-programmable bacterial nucleases found in the CRISPR (clustered regularly interspaced short palindromic repeats)-Cas system [55]. The double-strand break (DSB) in DNA induced in the target site by the Cas9 nuclease from Streptococcus pyogenes (currently, the most popular genome editor) triggers cellular mechanisms of DNA repair, including homology-directed repair (HDR) [56], which is used in the CRISPR-HDR methods, when the target region is repaired in the presence of a homologous DNA sequence containing the necessary allele variant.

This method used in many polymorphism studies [74]. Genome sequences associated with the specific proteins in their native chromatin context are identified by the ChIP-seq technique combining chromatin immunoprecipitation with subsequent high-throughput DNA sequencing [75]. Sequences optimal for binding of a particular TF (probably not existing in nature) are found with the involvement of SELEX methods for enrichment of the libraries of randomly generated oligonucleotides with specific sequences exhibiting high affinity to a given TF [76]. The are well-known PWM motif databases including TRANSFAC [77], HOCOMOCO [78], JASPAR [79], HOMER [80], iRegulon [81], etc. Application of bioinformatics makes it possible to assess potential changes in the strength of TF binding depending on the variant of polymorphism. Efficiency of the allele-specific TF binding can be estimated directly by the ChIP-Seq data, if sequencing depth allows detection of the statistically significant deviations in the frequencies of alternative SNP alleles in the binding site [8283]. Combination of ChIP with quantification of alleles, ChIP-AS-qPCR (ChIP-based allele-specific quantitative PCR), makes it possible to measure effects of the allele variants on efficiency of TF binding in a living cell [57]. A high-throughput variant of the analysis of TF binding with polymorphisms in the regulatory regions, SNP-SELEX, based on the HT-SELEX has been proposed. This method allows analysis of the effects of about 100,000 allele variants of the potentially regulatory (GWAS-annotated) SNPs on binding of several hundreds of TFs [84]. Classical method of analysis of DNA–protein interactions based on the shifts in electrophoretic mobility (electrophoretic mobility shift assay, EMSA) can also be considered as an experimental approach to TF identification. During EMSA, proteins under study specifically bind to the labeled oligonucleotide probes, which is followed by analysis of mobility of such fragments using electrophoresis in polyacrylamide gel under native conditions; relative strength of the binding could be assessed based on the amount of the formed complex [85]. Specificity of determination of protein components in the complexes is achieved by adding antibodies against a specific protein in the reaction: EMSA–supershift [86]. There are also high-throughput methods for analysis of large amounts of SNP allowing to find out effects of the allele variants on TF binding based on incubation of the SNP-containing oligonucleotides with a nuclear extract from the particular cell type, followed by sequencing of the enriched libraries; such methods are SNPs-Seq [57] and Reel-Seq [87]. Neither of these methods per se makes it possible to establish, which TF binds to a particular allele variant; however, such information could be obtained by mass spectrometry and/or using a purified TF instead of the nuclear extract [2488].

Bioinformatics databases suitable for analysis of SNP of interest include the on-line resource PERFECTOS-ARE https://opera.autosome.org/perfectosape [76], where the predicted TF binding motifs are collected from various databases: HOCOMOCO [78], JASPAR [79], HT-SELEX [89], etc. Another bioinformatics resource, ADASTRA [82], that provides comprehensive data on the allele-specific TF binding with allele variants in different types of cells, is based on the HOCOMOCO and SPRy-SARUS data [90], as well as on the allele-specific data of the DNase footprinting assay [91]. The ANANASTRA resource [92] based on the systematic analysis of allelic imbalance in the ChIP-Seq experiments, makes it possible to annotate a great number of genetic variants in parallel.

One of the examples of using such annotation could be functional characterization of the SNPs rs7873784 and rs71327024 localized in the regulatory regions of the TLR4 and CXCR6 genes, respectively [1331]. According to the results of GWAS, both SNPs are disease-associated: the minor C allele of rs7873784 is associated with rheumatoid arthritis and the minor T allele of rs71327024 is associated with severe COVID-19. The reporter assays have shown that both SNPs are raQTL; therefore, bioinformatics analysis was used to find TFs PU.1 (rs7873784) and c-Myb (rs71327024) relevant for the respective types of cells characterized by the allele-dependent binding to SNP-containing sites. This hypothesis was verified using the genetic knockdown of TF with involvement of small interfering RNA (siRNA), as well as the DNA pull-down immunoprecipitation technique [93]. The latter includes incubation of oligonucleotides containing alternative SNP variants with the nuclear extract from the relevant cells and immunoprecipitation with the specific antibodies against the predicted TF, followed by quantification of the enriched oligonucleotides. The described methods for identification of transcription factors with binding efficiency depending on the allele of polymorphism are shown in Fig. 2.

Fig. 2.
figure 2

Methods for identification of functional transcription factors with allele-specific binding to the region of polymorphism (the image was produced using BioRender.com).

Due to continuously increasing amounts of data and modern machine learning models, bioinformatic computations provide a more precise annotation of the candidate TFs with allele-specific binding to the SNP region [94-96]. However, clinical validation and a fortiori application of these data in diagnostics and probably treatment of the diseases are possible only after experimental validation in different types of cells in the relevant functional context.

CONCLUSIONS

To date, meta-analysis of large amounts of experimental data makes it possible to develop bioinformatics tools for searching for the most probable functional genetic variants, as well as for prediction of particular mechanisms of their effects on pathogenesis of the diseases. Overwhelming majority of the genetic variants are localized in the non-coding regions of the genome; they affect functions of the genes by regulating their expression. Such regulation could vary widely depending on the type and functional state of cells, which is not always taken into consideration in the case of in silico methods involving statistical generalizations. In view of the above, it is still relevant to use versatile experimental techniques for characterization of particular genetic variants. The most informative method for studying effects of the genetic variants on phenotype is development of precise genetic models using genome editing techniques. However, due to the difficult procedure of precise genome editing, preliminary characterization of allele variants under study by the reporter assays remains relevant.