Introduction

Traditionally, techniques involving the introduction of specific mutations/foreign DNA at the site of the targeted gene to either inactivate it or to correct a faulty gene have been one of the widely used approaches in modern biology utilized for functional elucidation of genes. Even today, these are routinely used as standard methods of choice to investigate vertebrate and invertebrate model organisms, such as mouse, plant, zebrafish, drosophila, nematode, and bacteria. In general, to study a gene function, the dominant-negative approach, knock-in, complete, partial, tissue-specific, and conditional knockout approaches are utilized based on the needs of the individual investigation. Moreover, recent advances in techniques involving CRISPR/Cas9 have not only expedited transgenesis but also rejuvenated the field of therapeutics as a potential tool in treating diseases like lung cancer as well as the ongoing pandemic, COVID-19 [5, 6, 36, 48]. Indeed, these techniques have proven to be powerful in understanding the minutiae of gene function, such as how a specifically located amino acid residue in a particular peptide and its corresponding DNA sequence in the gene play a crucial role in determining its function. For example, in knockin mouse model, p53 gene is engineered in a way that it harbors those mutations that are generally found in human sporadic cancer cases having either a mutant or a non-functional p53 gene [22]. Unsurprisingly, these mutations in humans cause different syndromes and cancers. Additionally, each respective mutation presents a distinct phenotype in mice, suggesting diversity in the mechanisms of p53 regulation in different microenvironments/tissues/genetic backgrounds. However, one cannot completely explain the difference in phenotypes produced by the same p53 mutation in both organisms based only on the difference in genes, species, and microenvironment.

Currently, we understand that central dogma alone cannot explain the behavior of the cell quite well, and complexity supersedes quantity. We now know that only a very small percentage (~ 2%) of our genome codes for functional proteins and that most of the genome still is beyond our limited understanding. The conventional view of the mammalian genome is that ~ 25,000 protein-coding genes are dispersed within a quite repetitive and largely non-transcribed sequence. Over the past decade, this view has been challenged by the discovery of several different and essential RNA species in mammalian cells that are termed as non-coding RNAs. This non-coding genome lies mixed and interspersed with the coding genome in such an intricate manner that today it is an extremely daunting task to discriminate between the two [51]. For instance, for functional proteins, coding regions tend to be much longer, and presence of an ORF (open reading frame) of at least 300 nucleotides (100 aa) is commonly used to define a transcript as “coding,” whereas many long transcripts with known non-coding functions may also typically contain multiple ORFs. These ORFs may give rise to proteins, might be translated inefficiently, or may even produce a non-functional protein which is rapidly degraded by proteasomes. These gray areas in defining coding and non-coding elements remain unexplored and may open new avenues of research. Even though we have begun to understand the signatures and properties of this tessellated non-coding entity, yet it is very early to anticipate or understand its full complexity.

The problem

The whole biology and engineering of “knocking out” genes become a little more complex per se due to the presence of important regulatory elements in the form of non-coding RNAs like miRNAs, lncRNAs, and natural antisense transcripts (NATs) inside and outside of the traditionally defined coding sequence (Fig. 1). Hence, it would be incorrect to state that knocking out a gene by the available traditional approaches will produce a phenotype that can precisely be attributed to the loss of that gene only. Until the end of last century and even currently, scientists have engineered numerous knockouts by deleting or modifying exon(s), e.g., by inserting reporter genes, by trap** the promoters and coding sequences, and by truncating the large part of protein by inserting a stop signal. However, the effect of unintentional alteration of several non-coding genes present within/outside the introns, and sometimes within exons, has not been taken into account in the process of knockout mouse generation. Moreover, the unintentional disruption of natural antisense transcripts (NATs) present in the non-coding strand of DNA during knockout generation further complicates the matter as they participate in various cellular regulatory processes via the cis or trans mechanisms, for instance, Cftr gene knockout mouse (Cftr−/−) which was generated by inserting an in-frame mutation in exon 10, to produce a truncated protein [47]. These Cftr knockout mice displayed a very strong phenotype, limiting their viability to a maximum of 40 days. The mouse Cftr gene has 28 exons, and there are several long intronic regions in the gene. Interestingly, a report published by Hill et al. on introns from CFTR demonstrated that introns alone are capable of coordinating the expression of functionally related genes [20]. They overexpressed three long intronic sequences (6a, 14b, and 23) from the CFTR gene in epithelial cells (HeLa), in which CFTR is not normally expressed. They observed that the expression of the CFTR introns caused extensive, specific, and highly reproducible transcriptional changes, affecting genes linked to CFTR function. Authors posited that, since these transfected cells do not express the CFTR protein-coding transcript, observed effects were certainly caused by the intronic sequences. Because all three intronic sequences do not include any known miRNAs or predicted stem-loop structures, they seem to act in trans as long ncRNA regulatory elements [20]. Similarly, constructs containing common selection markers/reporter genes like GFP, EGFP, Neor, LacZ, and DsRed are often left within the target genome post-selection [9, 21, 29, 37, 62]. However, these genes themselves can become potential targets of miRNAs of host origin, e.g., Mus musculus as discussed later. Therefore, it would not be wrong to assume that the resulting phenotype can be attributed to the combined effect of “altering the specific coding gene” as well as the “other non-coding genes” that get affected inadvertently due to the disruption by genetic engineering method used to generate the knockout organism. This work attempts to highlight the presence and/or disruption of these non-coding elements.

Fig. 1
figure 1

Probable mechanisms of inadvertent sequence changes in transgenic mice. Genes of foreign origin such as those from humans and marker genes become potential targets of murine miRNAs once expressed within the cells of knockin/transgenic mice. Non-coding elements such as NATs and lncRNAs may get co-disrupted along with the target gene and contribute to the resulting phenotype of the mice

Analysis

Coding region or mRNA sequences of the transgenes were retrieved from the NCBI nucleotide database and used as target sequences for analysis. The custom miRNA prediction tool available at miRDB, an online database for miRNA target prediction [7], was utilized to search for Mus musculus miRNAs potentially targeting the mRNAs generated from commonly used reporter genes, Cre recombinase (Table 1), and human genes expressed in transgenic mouse models (Table 2). An arbitrary minimum cutoff value of 60 was selected for the target SCORE for selection of miRNAs in cases where several miRNAs with a wide range of scores were retrieved.

Table 1 Mus musculus miRNAs against inserted foreign genes and corresponding transgenic mice. Reporter genes like Neor, LacZ, Cre recombinase, DsRed, and TdTomato have widely been used to generate transgenic mice. However, due to their non-mammalian origin, (e.g., DsRed from Discosoma sp., GFP from Aequorea victoria, and LacZ from Escherichia coli K12), most of these genes may be potentially targeted by Mus musculus miRNAs once expressed in transgenic mice
Table 2 Human genes expressed in transgenic mice are targeted by murine miRNAs in corresponding transgenic mice. Human genes expressed in transgenic mice become potential targets of Mus musculus miRNAs due to their foreign nature. This miRNA-target mRNA interaction may often lead to interference with their expression in mice. The potentially targeting miRNAs were retrieved from miRDB using their custom target prediction tool

A search of previously published literature was performed for knockout/mutant mice in which the introduction of specific mutations/foreign DNA at the site of the targeted gene had also inadvertently caused the disruption of lncRNAs or NATs. The affected genes and the co-disrupted non-coding elements were analyzed and complied with the publications which have utilized the mice (Table 3).

Table 3 Genes, NATs, lncRNAs, and corresponding mice. List of mouse strains showing probable inadvertent partial/complete disruption of overlap** sequence of NATs on the antisense strand along with intended target genes

Results

Commonly used foreign genes targeted by Mus musculus miRNAs

Neomycin resistance gene (Neor) is one of the widely utilized selection markers for the cells which are correctly targeted, and the neomycin cassette itself is normally left within the genome post-selection, assuming that it has no adverse effect on the eukaryotic cell biology [21, 50, 62]. But upon careful observation, it can be seen that the Neor gene construct itself is a potential target of several miRNAs of the eukaryotic origin or more specifically the miRNAs within the cells of the neomycin cassette containing transgenic mice (Table 1). Similarly, lacZ is another widely used reporter molecule, and its gene is often used in generating transgenic mice. A simple analysis revealed a similar fate of the lacZ gene as another strong target of several murine microRNAs (Table 1). Several other reporter genes that are widely used in mouse transgenic technology such as GFP, EGFP, TdTomato, and DsRed also have been shown as potential targets of murine microRNAs (Table 1). Hence, it can be correctly assumed that any gene that contains the Neor/lacZ/GFP/EGFP/TdTomato/DsRed variants can also be considered as de novo targets of microRNAs of murine origin. Interestingly, one of the most widely used recombinase enzyme, Cre, which is used in mice studies for fate map**, stem cell homing, and gene deletion, is also a potential target of several murine microRNAs (Table 1). Using the miRDB custom prediction tool [7], we searched for potential Mus musculus miRNAs that could target the abovementioned foreign genes that are frequently used in the generation of transgenic mice strains (Table 1). Based on the analyzed data, we propose that the resulting phenotype produced by interfering with the gene of interest may not solely be due to the disruption of that particular gene but due to the combined interference of the gene of interest and the associated non-coding elements. Additionally, these reporter genes or other elements of a targeting vector that are deliberately left in the mouse may very well act as sponges/sinks for the miRNAs or other non-coding RNAs, thus interfering with the normal physiology of the cell.

Human genes in transgenic mice targeted by Mus musculus miRNAs

Over the last three decades, transgenic mice expressing human genes have proven to be an efficient tool to model human diseases. These murine models have successfully accelerated the drug discovery process as well as contributed to the knowledge base of the underlying molecular mechanisms of those diseases [23, 28]. However, due to the foreign nature of human genes being expressed in these mouse models, they often may become targets of murine miRNAs which may interfere with their expression in mice, for example, the CETP gene containing mouse or APOE*3-Leiden. CETP mouse is widely used in atherosclerosis research and has been very useful in understanding lipid metabolism and drug discovery [24, 52]. Mice naturally lack cetp gene, but these transgenic mice express the human CETP gene. Interestingly, our analysis indicates that 3’UTR, as well as the coding sequence of this gene, are potential targets of several murine microRNAs (e.g., mir149-3p) (Table 2). Similarly, in the three strains of transgenic mice expressing human ACE2 gene currently being utilized to model the effect of SARS-CoV-2, the inserted hACE2 gene CDS is also a potential target of murine miRNAs (Table 2) [34, 35, 49, 57]. Other mice expressing human TNF-α, IL-8, APOA1, APOA5, and HD5 are some more examples, where the murine microRNAs are targeting the inserted human genes (Table 2). Using the miRDB custom prediction tool [7], we retrieved murine miRNAs that can potentially target the aforementioned human genes commonly expressed in transgenic mice to model human disease. Our data predicts that the observed phenotype in these mice may not explicitly be a result of only the inserted transgene, but rather a combined effect of the inserted transgene and the endogenous microRNAs acting on the foreign gene.

Co-disruption of natural antisense transcripts (NATs) and long non-coding RNAs (lncRNAs) with the gene of interest in knockout mice

Recent years have seen a rising number of studies investigating the role of natural antisense transcripts (NATs) in eukaryotes. This has shed light on their cis- as well as trans-activity in gene regulation at various levels and NATs have been shown to play a crucial regulatory role in eukaryotic gene expression [3, 55, 64]. Generally, these are non-protein-coding fully processed mRNAs that are transcribed from the opposite strand of protein-coding sense transcripts [4]. In currently used transgenic techniques, while introducing mutations in the target site of the gene of our interest, we often not only disrupt the sequence of our target gene but also the partially/completely overlap** sequence of genes for NATs on the antisense strand. Although the disruption of NATs may be inadvertent, it interferes with its cis-/trans-activity. Hence, the resulting knockout phenotype would have to be attributed to the disruption of both the target gene and the corresponding overlap** NAT sequence. This should make us reconsider the assignment of the “bonafide mutant for the target gene only” status to the transgenic mice generated in such cases. We performed a literature search for such mice with co-disruption of target genes and overlap** NATs and found several such cases (Table 3). For instance, Hoxd-3 knockout mice have been created by insertion of pD3Neo2TK vector carrying 11.7 kb of Hoxd-3 sequence with disruption of Hoxd-3 at nucleotide 82 of exon 1 by an MC1neo poly-A cassette [12]. Murine Hoxd-3 has 3 exons and 2 introns and has a 5’ end overlap (4137 bp) with its antisense regulatory element “hoxd3os1” and the disruption of exon1 (size 324 bp) also results in the disruption of intron 2 in “hoxd3os1” due to the overlap. Hence, the resulting phenotype should be attributed to the disruption of both of these elements. Similarly, double-mutant mice were created with a targeted disruption in hoxa-3 and hoxd-3 in which the resulting phenotype would be due to the similar nature of disruption of hoxd-3 [13]. Another example of NATs disruption in genetically engineered mice is “Airn” in Igf2r mutant mouse. Igf2r has 48 exons and has a 28,395 bp overlap with its natural antisense transcript “Airn,” a long non-coding RNA. This mouse gene is responsible for silencing the insulin-like growth factor 2 receptor gene and flanking genes in the mice. The overlap spans exon 1, exon 2, intron 1, and a major portion of intron 2. Igf2r knockout mice were created by replacing 0.33 kb of 5’ flanking sequence and 38 codons of exon 1 by a neomycin resistance gene (Neor) cassette [33]. This would also replace a portion of intron 1 of Airn and hence contribute to the phenotype originally attributed to the disruption of the only Igf2r. Similarly in Dlx-1/2 floxed conditional knockout mice, Dlx-1 has a 3343 bp overlap with its natural antisense transcript “Dlx-1as” spanning exons 2 and 3 and intron 2 completely and a portion of intron 1. These mice have been generated by introducing loxP sites located between exons 1 and 2 of both Dlx-1 and 2 genes (found in the opposite orientation on chromosome 2, 9427 bp apart from each other) [45]. Dlx-1/2 floxed mice were crossed with Olig1-Cre knockin mice which completely excised exons 2 and 3 and intron 2 of each gene and the intervening ~ 10 kbp sequence (which contains Dlx-1as on the complement strand in that region). Therefore, the deletion of entire Dlx-1as would also contribute to the resulting phenotype along with the deletion of Dlx-1 and 2. In Msx-1 conditional KO mice, Msx-1, a 4059 bp long homeobox gene, has a 2187 bp overlap with its natural antisense transcript “Msx1os” spanning portions of exon 2 and the single intron of Msx-1. Conditional KO mice of Msx-1 and 2 have been generated by introducing 2 loxP sites flanking exon 2 of each gene and crossing them with Msx-2Cre mice to obtain a global knockout of Msx-1 and 2 [17]. Although there was no apparent effect of the loxP sites present within intron and sequence downstream to exon 2 on Msx-1 and 2 gene functions, no account of disruption of “Msx1os” in the region opposite to intron (of the complementary strand) due to loxP site insertion has been reported. Post, global deletion of exon 2 via cre excision, the change in phenotype would have to be attributed to both disruption/knockout of Msx-1 as well as Msx1os. Several other mammalian gene-NAT pairs have been reported elsewhere [43]. These analyses demonstrate that the inadvertent disruption of NATs has been completely missed from the rigorous scheme of transgenesis and warrants a re-look into the biology that is affected by it.

Conclusion

Technically, this is a limitation of the biological system itself that we may never be able to overcome. In most cases, man-made mutations introduced into the mouse genome would ultimately affect both strands of DNA and hence, the non-coding genes, whereas a natural mutation in the form of point mutation may not affect the other strand. However, when a natural mutation/deletion is affecting a large part of a chromosome, we must acknowledge the phenotype as a collective representation of both coding and non-coding gene disruptions. This can also be seen in mice where unknown modifiers from different genetic backgrounds interact with the same targeted gene to contribute to anomalous differences in the phenotype. For example, in the first documented case describing the influence of genetic background on gene expression, diabetes (db) and obese (ob) mutations against a B6 background were shown to only cause obesity and transient diabetes, but, on a C57BLKS/J (BKS) background, they caused obesity and severe diabetes [10, 11]. However, in addition to the modifier genes, we might also be seeing the effects of these non-coding genes, which play essential roles in cellular processes that get affected due to genetic deletions. Contrary to this, often we observe that knocking out a gene does not produce expected results. Commonly, this is explained as the gene not being crucial for either development or maintenance. However, one can argue that altering the coding gene at one locus gets compensated by the simultaneous loss of non-coding gene(s) at the same position. The African proverb “When elephants fight, it is the grass that suffers,” explains the fate of “non-coding genes” well. Because of our incomplete understanding of the complexity of non-coding entities in the past, there is a strong possibility that these components of the genome were inadvertently affected while engineering knockout mice. Hence, it becomes extremely critical to revisit the old methods of generating knockouts with our current understanding of the concepts and examine the transgenic strategy and affected gene functions more carefully. Nevertheless, the development of strategies to single out a particular gene function without affecting other associated non-coding elements will be a highly complex task. However, it should be noted that this may not be necessarily true for all the knockouts created to date. Our work warrants the use of already established mice lines in further research.