Introduction

Lung cancer, a genetically heterogeneous disease is one of the leading causes of cancer incidence and mortality. It accounts for ~ 2.1 million new lung cancer cases and 1.8 million deaths worldwide1. In India, lung cancer is the chief cause of cancer-related mortality in both men and women2 and its incidence is rising at an alarming rate accounting for 11.3% of all new cancers and 13.7% cancer associated death3,4,5. Among North Indian region, Union territory of Jammu and Kashmir (J&K) is at greater risk of death rate related to various cancers. The incidence of lung cancer and breast cancer is higher followed by esophageal cancer in Jammu region of J&K as reported by a recent study. The study on the Kashmir region revealed that gastric carcinoma was commonly occurring cancer followed by lung carcinoma (9%) in general6. Despite making several efforts to enhance the 5 year survival rate of lung cancer patients, it remains 15–20%, the lowest of all cancers7. Currently, candidate gene approach (CGA) and Genome wide association studies (GWAS) has confirmed to be significant tools in interpretation of genetic complexity and heterogeneity of these disorders through association studies. With the successive GWAS, over the recent past more than 60 genetic loci have been found to be linked with NSCLC risk. Genetic characterization of variants have attracted significant attention in current medical era as potential biomarkers for predicting disease susceptibility and therapeutic targets8.

With this background, the variants in genes, which are critically important in various biological pathways like DNA damage and repair, invasion, metastasis, autophagy, circadian rhythm, apoptosis and signaling processes like TCF21, ERCC1, BRIP1, ARNTL, ERCC5, REV1, PIK3CA, CASC16, DDC, BCL2 were targeted. This is the first ever genomic study from the region targeting the critical genes involved in the pathophysiology of non-small cell lung cancer. It is noteworthy that such studies will provide the holistic view of genetic landscape of non-small cell lung cancer in population of Jammu and Kashmir, North India. With this perspective, we evaluated twelve genetic variants of ten genes that are critically important and were previously associated with various cancers including lung cancer.

Results and discussion

Lung cancer is the major global health burden contributing for more than million death worldwide. Before the GWAS era, the identification and characterization of lung cancer loci has been quite limited. GWAS, transcriptome wide association study (TWAS) and CGA has proved to be significant approach in understanding the genetic complexity and heterogeneity of multifactorial disorders through association studies. Worldwide so far, more than 60 loci have been linked with lung cancer by GWAS and candidate gene approach. Nevertheless, these genes are linked with multiple lung cancer pathways9. Currently, various susceptibility genes encoding various enzymes involved in the activation, cell-cycle pathways, circadian rhythm pathways and DNA damage and repair caused by smoke as well as genes involved in inflammatory and apoptosis processes have been studied extensively. Insights about the genetic and molecular mechanism is precondition to improve the clinical management and progress into novel therapeutic interventions. In present study, we evaluated twelve genetic variants of ten genes that are critically important and were previously associated with various cancers including non-small cell lung cancer. These genetic variants were associated with many biological pathways like DNA damage and repair, signaling processes, cell cycle, autophagy, circadian rhythm, apoptosis etc. Clinical and various epidemiological parameters has been enlisted in Table 1. The population enrolled in this study was genotyped for twelve genetic variants of ten genes including TCF21 (rs12190287), ERCC1 (rs2298881, 11615), ERCC5 (rs751402), ARNTL (rs4757151, rs1026071), BRIP1 (rs4986764), REV1 (rs3792152), PIK3CA (rs2699887), CASC16 (rs3803662), DDC (rs2229080) and BCL2 (rs1801018) as mentioned in Supplementary Table 1. Following quality control (QC) check, the finalized data set remained as twelve genetic variants that passes the quality control analyses and followed the HWE and further tested for their association with NSCLC. Among twelve genetic variants, six variants were found to be significantly associated with non-small cell lung cancer as shown in Table 2, however six variants didn’t show any association with lung cancer risk in the population of J&K North India as shown in Table 3. Moreover, these genetic variations may interfere with epigenomics, transcription factor binding sites10,11,12. The possible functional role of the variants using databases GTEx v.7, UCSC, HaploReg v4.1, HSF (v.3.1) and ESE v.3 was assessed13,14. The findings of each variant has been summarized below and described in Table 4 and Fig. 3.

Table 1 showing the clinical parameter distribution between non-small cell lung cancer patients and healthy controls from Jammu and Kashmir population.
Table 2 Allelic, genotypic distribution and logistic regression analysis of significant variants of genes in our study.
Table 3 Allelic and Genotypic distribution of the variants, which did not show significant association with NSCLC in population of J&K, North India.
Table 4 Putative Role of the associated variants with NSCLC in JandK Population—North India using the information from the different online databases including GTEX, UCSC genome browser and HSF.

Genetic variants which showed significant association with non-small cell lung cancer in this study

Genetic variations in predominant genes, which maintain the genomic stability has been documented as a key factor for the individual risk to develop cancer. ERCC1/ERCC5 genes are critically important factors in nucleotide excision repair pathway (NER). Excision repair cross complimentary group-1 (ERCC1) typically binds with XPF endonuclease (ERCC4) to form heterodimeric endonuclease (XPF-ERCC1) as shown in Fig. 1 during excision step at damaged site. This dimeric complex is also important interstrand crosslinks and homologous repair machinery, which activates the RPC, PCNA, DNA polymerase δ/ε followed by ligation step for repair process. Thus the functional variation/polymorphism in ERCC1/ERCC5, establishes the DNA repair capacity in the cell in order to maintain the genomic stability, may be a potential risk factor in the early process of oncogenesis as shown in Fig. 1. Various studies in such domains have been conducted in recent past to demonstrate the association of the genetic polymorphism and lung cancer risk15,16,17.

Figure 1
figure 1

showing the DNA Repair process which include identifying DNA damage by DNA damage association proteins, then transducing damage signals to the cellular machinery, lastly, cell cycle arrest, however, the functional polymorphism in DNA repair genes can halt the repairing capability thus drives the cell towards oncogenesis and ultimately lung cancer (ChemBioDraw Ultra v.14.0.0.117).

rs11615

In this cases-control association study among various DNA repair genes and NSCLC risk in population of J&K, north India. The variant rs11615 is synonymous variant of the ERCC1 gene. In this study, the major allele (A) of variant rs11615 (A/G) demonstrated significantly increased risk for non-small lung cancer with an odds ratio (OR) of 1.96 (1.23–3.11 at 95% of CI) and p value of 0.006 (Table 2). The findings from the study are consistent with previous studies reported in meta-analysis of Asian/Caucasian pooled population16. Our study indicated that genetic variant rs11615 of ERCC1 is a risk factor of NSCLC in Jammu & Kashmir population.

Furthermore, the findings of cis-eQTL analysis, the risk allele (A) is linked with down regulation of the expression of the gene in lungs (p value = 0.1 and normalized effect size (NES) = − 0.05). Since the gene is very critical in DNA repair process17, so the downregulation of gene might affect the repair efficiency. Moreover, the locus exhibited the existence for histone marks as (H3K4me1_Enh/H3K4me3_Pro/H3K27ac_Enh/H3K9ac_Pro) indicating promoter and transcription regulation, active transcription start site (TSS) promotor activity. Besides that in order to examine the consequence of this genetic variant on ERCC1 gene using insilco approach. The widely used algorithms for the prediction of enhancer/silencer motifs by HSF demonstrated that rs11615 results in the broken site for SF2/ASF (IgM/BRCA1), SF2/ASF and creation of new sites. It was observed that variation in splicing factor binding of exonic splicing enhancer (ESE) intronic site signifies its vital role in epigenomics (Table 4 and Fig. 3).

rs2298881

Another variant rs2298881 is an intronic variant of the ERCC1 and was significantly associated with non-small cell lung cancer, but major allele (A) of variant rs2298881 (C/A) showed the protection against the NSCLC with an odds ratio (OR) of 0.66 (0.48–0.91 at 95% of CI) and p value of 0.012 (Table 2). The results are consistent with previous studies on meta-analysis suggesting rs2298881 is not a risk-associated polymorphism in lung cancer16.

Moreover, during the cis-eQTL analysis, it was observed that the risk allele (A) is related with down regulation of the gene in lungs (p value = 2.4E−15 and normalized effect size (NES) =  − 0.36). Since the gene is vital DNA repair process, so the downregulation of gene might affect the repair capacity. Furthermore, the locus exhibited the existence for histone marks as (H3K4me1_Enh/H3K4me3_Pro/H3K27ac_Enh/H3K9ac_Pro/DNase hypersensitive) suggesting promoter and transcription regulatory activity, active transcription start site (TSS) promotor activity. In order to examine the influence of this variant on ERCC1 using insilco analysis. The prediction tools suggested that rs2298881 develop Site broken for SRp40. It was observed that alteration in splicing factor binding of exonic splicing enhancer (ESE) intronic site indicating its effect on epigenetic process (Table 4 and Fig. 3).

rs751402

Variant rs751402 is 5´UTR variant of the ERCC5. In present study, the major allele (A) of rs751402 (A/G) exhibited significant association with non-small lung cancer risk with an odds ratio (OR) of 1.46 (1.00–2.13 at 95% of CI) and p value of 0.02 (Table 2). This variant has been extensively studied in different cancers (gastric, breast, salivary gland tumour) in different population groups18,19,20,21 including lung cancer22. The present study also indicated that genetic variant rs751402 is a risk factor of NSCLC in J&K population.

Cis-eQTL analysis demonstrated that the risk allele (A) is significantly related with up regulation of the gene in lungs (p value = 6.2E−4 and normalized effect size (NES) = 0.14). Since the gene is essential for DNA repair process, so the upregulation of gene might affect nucleotide excision repair pathway. Moreover, the region of interest exhibited the existence of histone marks as (H3K4me3_Pro/H3K27ac_Enh/H3K9ac_Pro/H3K4me1_Enh/DNase) signifying role in epigenetic regulation. Insilco approach also indicated that rs751402 results in creation of new site for Tra2-β/ broken site for SRp40. It was perceived that change in splicing factor binding of exonic splicing enhancer (ESE) intronic site may influence the physiology of the gene (Table 4 and Fig. 3).

The process of genomic instability is associated with earlier process of oncogenesis. Many essential genes maintain the genome stability and complexity by responding the DNA damage and repair machinery23. Among one such important gene is BRIP1 (BRCA1 Interacting Protein C-Terminal Helicase 1) encodes a factor, which is an integral member of RecQ DEAH helicase family, which intercommunicate with repeats of BRCA type 1 (BRCA1). The composite complex is critical in normal double strand break repair processes. BRIP1 encodes 1249 amino acid long protein that colocalizes with BRCA1 DNA damage site, and enhances to its DNA repair function24. During the DNA double strand break repair BRCA2 interacts with RAD51 resulting in BRCA2/RAD51 complex. The complex colocalizes to damage induced foci where actual DNA repair process has to take place25. BRIPI is critically important in maintaining the genomic stability by regulating the GM1/2 checkpoints and CHK1 activation as shown in Fig. 2.

Figure 2
figure 2

Showing the biological interaction and role of BRIP1. BRIP1 (BRCA1 Interacting Protein C-Terminal Helicase 1) encodes a factor, which is an integral member of RecQ DEAH helicase family, which intercommunicate with repeats of BRCA type 1 (BRCA1). The composite complex is critical in normal double strand break repair processes (ChemBioDraw Ultra v.14.0.0.117).

rs4986764

Variant rs4986764 is the missense variant of the BRIP1 gene. The study evaluated the genetic association of rs751402 with NSCLC risk in population of Jammu and Kashmir, North India. The major allele (A) of variant rs751402 (A/G) displayed significant association with non-small lung cancer risk with an odds ratio (OR) of 1.47 (1.12–1.94 at 95% of CI) and p value of 0.006 (Table 2). Various studies have demonstrated the effect of the genetic variation rs4986764 in BRIP1 with multiple cancers including the non-small lung cancer26,27,28. Some studies demonstrated that genetic variation in any of associated genes result in reduced repair efficiency, which drives cell towards oncogenesis26. Thus, present study indicated that genetic variant rs4986764 (BRIP1) is a risk factor of non-small cell lung cancer in Jammu and Kashmir population, North India.

Cis-eQTL analysis advocated that risk allele (G) is significantly related with downregulation of the gene in lungs (p value = 3.8E−3 and normalized effect size (NES) = − 0.09). The said gene is the key component for DNA repair process24, so the downregulation of gene might critically effect the DNA repair pathway. Moreover, in order to examine the influence of this variant on BRIP1 using insilco analysis. It was observed that rs4986764 results in site broken for SRp40. It was demonstrated that the alteration in splicing factor binding of exonic splicing enhancer (ESE) intronic site might disturb the regulation of gene (Table 4 and Fig. 3).

Figure 3
figure 3

Effect of genetic variation on the Exonic Splicing Enhancers (ESEs) according to ESE prediction tool. ESE finder enables to recognize the potential ESE sites. The elevation of the colored bars represents the motif scores and the girth of the bars indicates the length of the motif. Bars in red, yellow, blue, purple and green indicate potential binding sites for Serine-Arginine (SR) proteins SF2/ASF, SRp55, SC35, SF2/ASF (IgM-BRCA1) and SRp40, respectively. Panel-I signifies the ESE sequence with the allele not posing risk in the population under study and panel-II denotes the ESE sequence with the risk allele in the studied population. From the figure, we can predict that there is a change in the potential ESE sites as can be seen from change in the bars (change in the potential splicing sites) that might increase the disease susceptibility (Human Splicing Finder (HSF) 3.1and ESE finder 3.0).

Transcription factor 21 (TCF21) belongs to helix loop helix (HLH) family of transcription factors, which have critical role in development of tissues of lung, heart and kidney. It harbor 3 exons associated with CpG islands (CpG1, CpG2 & CpG3). Higher rates of TCF21 promoter hypermethylation processes have been observed in cancers of different origins, including lung cancer. The activation of TCF21 by long ncRNA TCF21 antisense RNA-inducing demethylation (TARID) by induction of promoter demethylation. Promoter of TCF21 in third CpG guides the TARID transcription, thereby inducing the DNA demethylation (TET protein-dependent) resulting TCF21 transcriptional activation and interaction of TARID to promoter of TCF21, which inducts GADD45A/TDG to base excision repair (BER) for demethylation processes29. A recent study on TCF21 revealed the expression of TCF21 in normal lung airways with the observation of aberrantly methylated and silenced in majority of non-small lung carcinomas30. Genetic variation rs12190287 can control TCF21 expression and may function as a potent biomarker for genetic susceptibility to lung cancer.

rs12190287

Genetic variant rs12190287 is 3´UTR variant of the TCF21. The allele (C), which is the major allele of variant rs12190287 (C/G) indicated significant association with non-small lung cancer risk with an odds ratio (OR) of 1.85 (1.14–2.99 at 95% of CI) and p value of 0.012 (Table 2). The same genetic variant was examined in Chinese GWAS for risk factor in many cancer including breast, osteosarcoma, renal cell carcinoma31,32,33. However, various studies have demonstrated the downregulation of TCF21 in breast cancer, bladder cancer, and non-small cell lung cancer30. Although this genetic variant has not been evaluated for lung cancer risk in any of the population group in india. This study is the first study to evaluate rs12190287 with non small lung cancer risk. The findings from the study strongly advocated rs12190287 of TCF21 is risk factor NSCLC in the J&K Population, North India with p = 0.012.

Analysis through cis-eQTL suggests that allele (C) (risk allele) is significantly linked with up regulation of the gene in lungs (p value = 1.9E−17 and normalized effect size (NES) = 0.29). Since the gene is essentially important in many biological processes, thus the upregulation of gene can affect these biological processes. Moreover, the locus exhibited the existence of histone marks as H3K4me1_Enh/ H3K4me3_Pro/H3K27ac_Enh/H3K9ac_Pro/23_PromBiv) suggesting important role in epigenetic regulation. In-silco approach also indicated that rs12190287 results in broken site for 9G8 and creation of new site. The change was also observed in splicing factor binding of exonic splicing enhancer (ESE) intronic site (Table 4 and Fig. 3).

rs4757151 and rs1026071

Circadian rhythms pathways, which has been characterized in almost all living species and are controlled by circadian rhythm genes34. Disruption in either genes or pathways has been associated with many ailments like mood related disorders, depression, cardiovascular disease and cancer. The monitoring feedback loop of circadian rhythm consists of critical genes like ARNTL, PER, CLOCK, which function as an important regulators of transcription and translation process.

Genetic variant rs4757151 is an intronic variant of the ARNTL. The allele (C) (major allele) of variant rs4757151 (C/G) exhibited significant association with NSCLC risk with an odds ratio (OR) of 2.12 (1.32–3.47 at 95% of CI) and p value of 0.002 (Table 2). This variant has not been evaluated for the non-small cell lung cancer risk in any Indian population group and our results proved that rs4757151 of ARNTL is a risk factor for NSCLC in J&K population, North India. Furthermore, in order to examine the effect of this genetic variant on ARNTL using in-silco analysis by Human Splicing finder (HSF) and exonic splicing enhancers (ESE). The majority of the algorithms used for the prediction of enhancer/silencer motifs by HSF indicated that rs4757151 results in broken site for SC35 and creation of new site (Table 4 and Fig. 3). Moreover, other variant rs1026071 of same gene didn’t show any genetic association with NSCLC risk with an odds ratio (OR) of 0.99 (0.75–1.31 at 95% of CI) and p value of 0.985. (Table 3).

Non-significant genetic variants with non-small cell lung cancer

Various studies have linked the DDC expression with multiple cancer35. The genetic variant rs2229080 of DDC revealed the null association with the gastric and esophageal cancer risk in J&K population36. We similarly evaluated the same variant in population of Jammu and Kashmir for lung cancer risk and couldn’t found the genetic association with an odds ratio (OR) of 0.98 (0.75–1.28 at 95% of CI) and p value of 0.925 (Table 3). Genetic polymorphism in PIK3CA has been observed in several types of cancer including non-small cell lung cancer. Moreover genetic variation rs2699887 in PIK3CA has been associated with the brain metastasis in non-small cell lung patients. The study also revealed that NSCLC patients with one variant in rs2699887 had double the risk of having the brain metastasis than those without the variant37. The same variant was targeted in population of Jammu and Kashmir for lung cancer risk but we failed to find genetic association of same variant with an odds ratio (OR) of 0.74 (0.52–1.05 at 95% of CI) and p value of 0.095 (Table 3). Genetic variant rs3803662 of Cancer Susceptibility Candidate 16 gene (CASC16) is located at 16q12.1 is an RNA gene. The variant rs3803662 did not show any genetic association with NSCLC risk in population of Jammu and Kashmir with an odds ratio (OR) of 1.15 (0.85–1.54 at 95% of CI) and p value of 0.36 (Table 3). This polymorphism has been extensively associated with breast cancer risk in Iranian, Caucasian, Asian population groups38. REV1 DNA Directed Polymerase (REV1) gene shares homology to Y-family DNA polymerases, and act as scaffold protein involved in translesion synthesis (TLS) of damaged DNA39. Genetic variant rs3792152 is an intronic variant of REVI gene. The variant did not show the genetic association with NSCLC risk in population of Jammu and Kashmir with an odds ratio (OR) of 1.24 (0.96–1.59 at 95% of CI) and p value of 0.092 (Table 3). Various studies have demonstrated the role of BCL-2 in oncogenesis, neuro disorders, ischemia and autoimmune diseases etc. BCL-2 overexpression is associated with various cancers like NSCLC, esophageal cancer, endometrial cancer, breast cancer, CLL, diffuse large B-cell lymphoma etc.40,41. Genetic variant rs3792152 is coding sequence variant of BCL-2. The variant did not show any significant association with NSCLC risk in population of Jammu and Kashmir with an odds ratio (OR) of 1.02 (0.79–1.31 at 95% of CI) and p value of 0.872 (Table 3), which is in consistent with male Chinese population42 and Asian43 population groups, wherein they fail to find association of variant rs1801018 with NSCLC risk. Furthermore, the interaction between the genetic variants were evaluated through the multifactor dimensionality reduction software (MDR) v3.0.2. The variants (attributes) connected with shortest lines show strongest synergetic effect. The results indicted the variant BRIP1, ERCC5, ERCC1 are linked with red colored line thus suggesting the strong interaction and maximum synergetic effect among the genes as shown in supplementary Fig. 2a,b. Best fit model as shown in supplementary Fig. 3a,b suggests interaction effect for the associated variants with NSCLC in the studied population and revealed the strong interaction among the BRIP1, ERCC5, and ERCC1 genes respectively.

Conclusion

The recent advances in high throughput techniques and molecular characterization of cancer related single nucleotide variants for improving the therapeutic interventions has been challenging task for scientists and clinicians. The case control association studies identifying the role of these genetic variants proved to be fruitful in such arena.

The present study explored the association of twelve critical genetic variants involved in diverse biological processes and their plausible regulatory role. Out of twelve genetic variations, after applying the QC and HWE analysis, six variants TCF21 (rs12190287), ERCC1 (rs2298881, 11615), ERCC5 (rs751402), ARNTL (rs4757151), BRIP1 (rs4986764) showed strong significant association with non-small lung cancer in population of Jammu and Kashmir, North India with (OR = 1.46–2.12 and p value ≤ E10−3) while six variants REV1 (rs3792152), PIK3CA (rs2699887), CASC16 (rs3803662), DDC (rs2229080), ARNTL (rs1026071) and BCL2 (rs1801018) variants did not showed any significant association with NSCLC risk. Our result revealed the complex genetic mechanism and highlighted the critical role of various genetic variants in the pathogenesis of non-small cell lung cancer. Moreover, all the statistically significant variants showed the role in epigenetic regulation and have potential effect in modulation of the gene expression of its own or neighboring gene that might be responsible for underlying etiology of non-small cell lung cancer. This is the first study from the northern region targeting the important cancer related genetic variants as the union territory of J&K is genetically less explored state. Such studies are lacking in the region.

This prelude study, which advocated the relationship of genetic variants with other cancers but not with non-small cell lung cancer and the variants which deviated from HWE warrants to be replicated on large sample cohorts. The finding from our study will enlighten our cognizance of inter-population variances in non-small cell lung cancer etiology and strengthens GWAS outcomes as well. Furthermore, these association studies if conducted on large sample size would help contributing towards fulfilling the gap of remaining unexplained heritability of non-small cell lung cancer to greater extent. Furthermore, the genetic variants targeted in the present study warrants the functional analysis in future studies.

Materials and methods

Ethical statement

The study design was following the Helsinki Declaration and was confirmed by the Institutional Ethics Review Board (IERB) of Shri Mata Vaishno Devi University (SMVDU) vide IERB Serial No: SMVDU/IERB/16/41. The participants were informed about the research objectives and a written informed consent in three local languages was acquired from all the subjects enrolled in the present study. It was confirmed that all the methods were performed following the relevant guidelines and regulations.

Sampling

A total of 723 subjects, 162 NSCLC cases and 561 healthy controls were enrolled for the study after informed consent from the individuals. All cancer cases were histopathologically confirmed. Two milliliters of venous blood sample was collected from each participant in an EDTA vial. Epidemiological features were summed up in Table 1.

DNA isolation

Genomic DNA was isolated from the blood samples using Qiagen DNA Isolation kit (Catalogue No. 51206). The quantity and quality control analysis of genomic DNA was performed by carrying out UV spectrophotometer (Eppendorf Biospectrometer®, Hamburg Germany) analysis and Gel electrophoresis respectively.

Selection of variants and genoty**

In this study, we selected genetic variants which have been associated in non-small cell lung cancer through GWAS and replication studies using the CGA. Finally, a total of twelve genetic variants of ten genes were shortlisted. The details of genetic variants are discussed in supplementary Table 1. Genoty** was performed at Central MassARRAY facility at SMVDU on a high-throughput Agena MassARRAY platform (The MassARRAY® System by Agena Bioscience™, San Diego, CA)44. The list of primers provided in supplementary Table 2.

Sequenom Typer 4.0 software was used to analyse genotype calls as in supplementary Fig. 1. In order to exclude the call errors via spectrograms all genotype calls were cross checked. The subjects were left out from the study if the missing genotypes were higher than 10%. Those variants which deviated from the Hardy–Weinberg Equilibrium (HWE) (p value < 0.05) were also omitted from the study. The genoty** results were replicated in 10% of random samples and the concordance rate was 98.5%. In the reaction of 384 well plates, one positive and one negative control were added for quality check.

Genoty** quality control and criteria

Following criteria was used for validation and acceptance of genoty**. Genetic variants (SNPs) having call rate > 90% were included for statistical analysis45. Hardy–Weinberg Equilibrium (HWE) among cases and controls were used for assessing the quality of genotypes after analysing data sets. Those variants which didn’t follow the HWE (p value < 0.05) were also omitted from the study.

Statistical analysis

Statistical t-test was used to compare by comparing the clinical characteristics between cases and controls. Genotype data was analysed by using the PLINK v. 1.0746 and IBM SPSS statistics 20 software47. All the genetic variants were tested for Hardy–Weinberg equilibrium using chi-square test. The association of variants with non-small lung cancer risk was validated by binary logistic regression analysis adjusted for confounding factors like age, gender and Body Mass Index (BMI). The odds ratios (ORs) were calculated based on the risk allele observed in this study conducted. One way ANOVA was employed for comparison of clinical characteristics of different genotypes for each variant, adjusted for age and gender (Supplementary Table 3).

Potential role of the variants

University of California Santa Cruz (UCSC) Genome Browser (https://genome.ucsc.edu) and GTEx portal (https://www.gtexportal.org) combined was used for expression Quantitative Trait Loci (eQTL) analysis of the variants. Furthermore, UCSC Genome Browser, Encyclopedia of DNA Elements (ENCODE) (V3) and HaploReg v4.1 database13,48 tools were employed for the analyzing the transcriptional regulatory role like histone modifications, DNase hypersentivity and binding sites for the transcription factor. Besides that the effect of variant on splicing was evaluated by using the web tool Human Spicing Finder (HSF) 3.1and ESE finder (3.0)14,49.