Introduction

The concept of the human exposome was first proposed almost two decades ago as a framework to guide research that explores the etiological complexities of health and disease (Wild, 2005). Because the relationship between multifactorial exposure patterns that influence health outcomes is complex, there is a need for studies that incorporate information from multiple exposures. Approaches that include a single environmental exposure may not fully or accurately describe the risk of disease because mixing factors may alter the effects of a single exposure (Wild, 2005; Zhang et al., 1), Poisson regression and Poisson RF models were employed to model the relationship between the cancer-related factors and the lung/bronchus cancer incidence.

MAPE showed statistically significant differences when T-test was done between Poisson regression and Poisson RF. As the number of samples for each case is 25, degree of freedom is 48. For both males (t(48) = 12.86, p < 0.01) and females (t(48) = 6.40, p < 0.01). RMSE also showed significant differences for both males (t(48) = 8.85, p < 0.01) and females (t(48) = 6.57, p < 0.01) (Table 2).

Table 2 Mean absolute percentage errors (MAPEs), root mean square errors (RMSEs), and their standard deviation from the test set with Poisson regression and Poisson random forest

Tables 3 and 4 summarize the regression results of various datasets through Poisson RF and Poisson regression. Smoking, radiation exposure, and PM2.5, which are thought to be related to radon exposure (Matthaios et al., 2021; Trassierra et al., 2016), and sociodemographic and behavioral factors were combined in various models. The analysis of the relationship between variables and model accuracy revealed an interesting trend in the error from the Poisson RF, as shown in Table 4. The VIM was acquired by averaging the model weights across folds with the entire dataset by using the default function of the fpechon/rfCountData package (Liaw & Wiener, 2002; Pechon, 2019). Table 5 and 6 show the VIMs of the variables analyzed with full model Poisson random forest regression including all variables, including socioeconomic variables, in the model.

Table 3 Mean absolute percentage errors (MAPEs), root mean square errors (RMSEs), and their standard deviation of each data set with Poisson regression
Table 4 Mean absolute percentage errors (MAPEs), Root mean square errors (RMSEs), and their standard deviation of each data set with Random Forest
Table 5 Variable importance measures (VIM) of variables from male dataset with Random Forest
Table 6 Variable importance measures (VIM) of variables from female dataset with Random Forest

Table 7 summarizes the IRRs analyzed with full model Poisson regression. The increased unit of the IRR is proportional to the range of each variable to make a more intuitive comparison. In both cases, smoking had the greatest effect on lung cancer incidence rates. In the case of indoor radon, the association was negative. [Male: 0.99 (0.98, 0.99), Female: 0.99 (0.98, 0.99)]. Also, Background gamma count (RadNet) [Male: 0.97 (0.97, 0.98), Female: 0.98 (0.98, 0.99)] and three-year average PM2.5 for female [0.99 (0.98, 1.00) P-value: 0.09] showed negative associations at higher concentrations, which somewhat contradicts results from previous studies (Ghazipura et al., 2019; Raaschou-Nielsen et al., 2013; Turner et al., 2011a, 2011b).

Table 7 Incidence rate ratios (IRRs) and 95% confidence intervals of each factor of interest with poisson regression

To understand the differences broken down by EPA Radon Zone, separate regression models were run for each zone using the full model Poisson regression (Table 8). In the case of Radon Zone 1, an area with high radon concentration, the effect of PM2.5 exposure was the greatest. Conversely, in the case of Radon Zone 3, which is an area with a low radon concentration, higher rates of PM2.5 were associated with lower incidence rates. The effect of smoking was consistent across all radon zones.

Table 8 Incidence rate ratios (IRRs) and 95% confidence intervals of PM2.5 and smoking by radon zone

Discussion

The effects of environmental exposure on health outcomes are complex. In this study, the results (Table 8) suggest that the assocation between PM2.5 may vary with levels of indoor radon exposure. Despite potential synergistic effects of exposure, many radiation epidemiological studies include a limited number of environmental exposure measures (Haylock et al., 2018; Richardson et al., 2015; Stanley et al., 2019; Tomasek, 2013). Belloni et al. (2020) have noted that few studies (Klebe et al., 2019; Leuraud et al., 2011) have attempted to address multifactorial exposures from environmental stressors. In the study of radiation-related disease, estimating the risk associated with radiation-related lung cancer has been a focal point in resolving the dose-risk response relationship (United Nations Scientific Committee on the Effects of Atomic Radiation [UNSCEAR], 2018). Furthermore, due to the high baseline cancer risk compared to the risk increased from low-dose radiation exposure, the population size required for detecting low-dose radiation risk with statistical significance exponentially increases as the target dose decreases (Ozasa, 2016; Ozasa et al., 2019; UNSCEAR, 2008; Valentin, 2006). To address some of the challenges, studies that use a wider range of data, such as the Million Person Study (Boice et al., 2022), are being conducted (Calabrese, 2015; Ricci & Tharmalingam, 2019; Tubiana et al., 2009; Valentin, 2008; Weber & Zanzonico, 2017). The utilization of population-level exposure variables and health outcomes data adopted in this study can serve as a valuable resource for future research. Population-level data offers an advantage in the adoption of multiple variables and the analysis of diverse health outcomes. Furthermore, ML techniques are particularly well suited to model the complex relationships that exist between environmental exposure and health outcomes. By leveraging ML, it is possible to capture the complex interplay between environmental exposures and health, thereby offering a promising avenue for future research in this field.

The results suggest that PM2.5 should be included in future analysis of radon-induced lung cancer incidence, as there may be an interaction with radon exposure. The observed patterns, where changes in radon concentration result in significant differences (p < 0.001 for all cases) in the effects of PM2.5, corroborate findings from other research that explores the combined impacts of PM2.5 and radon exposure (Dlugosz-Lisiecka, 2016). PM2.5 or other particulate matter could be one of the possible transport mechanisms that allow radon gas to permeate lung tissue. This is further supported by two experimental studies that assess the speciation of PM2.5 particles in the presence of radon progeny. The first study shows that the alpha activity of PM2.5 tends to increase as the concentration of radon increases (Matthaios et al., 2021). The second study shows that in a radon chamber, the presence of particulate matter will increase the attached fraction of radon progeny, thereby implying that the radiation exposure from particulate matter will increase (Trassierra et al., 2016). PM2.5 and radon seem to have synergistic effects and are thought to affect various health outcomes, including incidences of lung cancer. Given the possible synergistic effect between PM2.5 and radon, future epidemiological studies should investigate this further.

This study harnessed ML to consider the non-linear effects of radon exposure within the context of other environmental factors. The results of decreased errors from ML models show that ML is effective at analyzing complex relationships in environmental exposure studies and should be considered in future studies that investigate the relationship between radon exposure and cancer outcomes. One limitation of current ML is the lack of variety in ML algorithm packages that can be applied to count data. However, it is believed that these problems will naturally be resolved as ML develops and becomes more widely used in regression analysis.

Large-scale data can be challenging when conducting analysis attributable to individual characteristics, for example they are limited in their ability to reflect the interaction of environmental and genomic factors, which is important in the exposome approach (Zhang et al., 2021). Furthermore, individual history of exposure information which is similarly essential to exposome analysis is difficult to reflect in the analysis (Zhang et al., 2021). Thus, population-level studies of incidence rates, such as this one, are susceptible to the ecological fallacy. This limits the ability to establish causal relationships between variables and health outcomes. Despite these limitations, population-level studies can still provide valuable reference points for guiding individual-level studies.

The World Health Organization (2009) reported that radon is the second major contributor to lung cancer incidence. Also, a study by Turner et al. (2011a, 2011b), which analyzed county-level radon concentrations and residents' lung cancer risk similar to this study, showed a positive association between residential radon and lung cancer risk. However, our results showed that there was negative association between radon and lung cancer incidence rates [IRR of male: 0.99 (0.98, 0.99), IRR of female: 0.99 (0.98, 0.99)]. There are several reasons our findings may differ from occupational cohort studies that show there is a strong association in occupational studies where individuals are exposed to high levels radon (Kreuzer et al., 2015; Leuraud et al., 2011; Richardson et al., 2021, 2022). First, as mentioned above, this study may suffer from ecological fallacy. Second, indoor radon exposure risk is measured at the county level and radon exposure varies widely across counties (Li et al., 2021). Third, the effect sizes at low levels of exposure are likely small—making the signal difficult to detect in an ecological analysis. Our results of study which investigated the association between residential radon exposure to lung cancer is difficult distinguished are more aligned with results from recently published residential exposure and lung cancer-based study (Li et al., 2020). The study on residential radon exposure and lung cancer risk in Connecticut and Utah (Sandler et al., 2006) could not provide evidence of an increased risk of lung cancer at the exposure levels observed. Unlike minor studies, the residential radon exposure is so low that statistically significant results are difficult to obtain.

Furthermore, the difference in findings across studies may arise from discrepancies between individual-level and population-level approaches in their methodologies and analysis. Also, regarding the interaction between smoking and radon, the results were different from the previous studies. According to BEIR VI, a comprehensive analysis of the relationship between smoking habits, radon exposure, and lung cancer risk of uranium miners from several studies showed a submultiplicative effect, which means that the risk in the population exposed to both smoking and radon is greater than the sum of the individual risks expected from either smoking or radon exposure and less than the product (NRC, 1999). The results of a case-control study in Spain after BEIR VI indicated that there is a strong synergistic effect between smoking and radon exposure, and the case-control miner study showed evidence of submultiplicative interaction between radon and smoking (Barros-Dios et al., 2012; Leuraud et al., 2011). However, the association between smoking and radon concentration did not appear to be significant in the results presented herein. These inconsistent results again may be attributed to certain limitations in this study, including terse measurement of radon concentrations. Using the median data could prevent the effects of outliers, but it will have errors from the insufficient number of tests. This problem could skew the results toward non-significant associations or even contradict established knowledge.

Possible confounding factors that were not properly reflected are that the level of stress that people experience, and the quality of medical care will vary considerably by county or state despite some socioeconomic factors being included. This may also explain the opposite trend in this analysis vs. the previously known results of PM2.5 and lung cancer incidence. These problems could be mitigated if the research is conducted on specific regions with very high-resolution data, or by improving our measures of radon concentrations. Another limitation of this study is the lack of residential history data, which made it impossible to create a model that adequately considers different exposures across a life span and the associated latency periods. Other lung cancer models have considered the incubation period of 5 years (National Research Council [NRC], 2006; UNSCEAR, 2008; Valentin, 2008). Future studies should use residential history to assess the effects of indoor radon exposure across a life span.

If future studies address these limitations, then the combination of highly accurate ML techniques and the advantages and applicability in radiation epidemiology of population-level data could be harnessed for more diverse health outcome analysis. This may also provide valuable insights into the interplay between variables.

Conclusion

Traditional statistical methods and ML models can be used in parallel to fully understand the complex relationship between environmental exposures and health. To investigate the applicability of multivariable and ML methods in environmental exposure studies, county-level lung/bronchus cancer risk was assessed with various exposures (airborne gamma counts, radon concentration, air quality), lifestyle (smoking), and socioeconomic factors through Poisson regression and Poisson RF regression. The study found that the risk of lung cancer from PM2.5 varied by radon concentration with larger effect sizes in areas with high indoor radon exposure. In summary, the results of this study demonstrate how (1) including multiple environmental exposures has advantages over single exposure studies when the relationship between the environment and lung cancer risk is considered, thereby making an exposomics framework an important consideration, and (2) employing ML models enhances the utility of analysis in identifying complex relationships, as in the case of environmental radiation exposure and lung cancer incidence. Consequently, this study proposes a new paradigm for studying environmental radiation combined with other environmental exposures.