Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: gradient boosting, extra trees, and random forest models

Ghazwani, Mohammed; Begum, M. Yasmin

doi:10.1038/s41598-023-37232-8

Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: gradient boosting, extra trees, and random forest models

Article
Open access
Published: 21 June 2023

Volume 13, article number 10046, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: gradient boosting, extra trees, and random forest models

Download PDF

Mohammed Ghazwani¹ &
M. Yasmin Begum²

1589 Accesses
10 Citations
Explore all metrics

Abstract

This work presents the results of using tree-based models, including Gradient Boosting, Extra Trees, and Random Forest, to model the solubility of hyoscine drug and solvent density based on pressure and temperature as inputs. The models were trained on a dataset of hyoscine drug with known solubility and density values, optimized with WCA algorithm, and their accuracy was evaluated using R², MSE, MAPE, and Max Error metrics. The results showed that Gradient Boosting and Extra Trees models had high accuracy, with R² values above 0.96 and low MAPE and Max Error values for both solubility and density output. The Random Forest model was less accurate than the other two models. These findings demonstrate the effectiveness of tree-based models for predicting the solubility and density of chemical compounds and have potential applications in determination of drug solubility prior to process design by correlation of solubility and density to input parameters including pressure and temperature.

Design of predictive model to optimize the solubility of Oxaprozin as nonsteroidal anti-inflammatory drug

Article Open access 30 July 2022

Development a novel robust method to enhance the solubility of Oxaprozin as nonsteroidal anti-inflammatory drug based on machine-learning

Article Open access 30 July 2022

A machine learning approach for thermodynamic modeling of the statically measured solubility of nilotinib hydrochloride monohydrate (anti-cancer drug) in supercritical CO2

Article Open access 09 August 2023

Introduction

The poor water solubility of newly discovered medicines has been a major issue for pharmaceutical industry, and various techniques have been explored and developed to enhance the solubility of drugs in aqueous solutions¹. Either physical or chemical methods can be used for increasing the solubility of drugs in aqueous media, however the method of nanonization based on physical methods has attracted much attention recently for preparation of drug nanoparticles. One of the physical methods for drug nanonization is supercritical processing which can be used to prepare drug particles at nano size for enhanced aqueous solubility². For develo** this new technique, drug solubility in the supercritical solvent must be known prior to process design and development.

Estimating pharmaceutical solubility in supercritical solvents such as CO₂ has been reported by different methods such as thermodynamics and data-driven models³. The main inputs for the modeling have been considered to be pressure and temperature as these factors showed the most important effects on the drug solubility change^2,4,5,6,7. It is a crucial step to measure and correlate drug solubility to prepare drugs with nanosized and better bioavailability. The process of supercritical for solid-dosage drugs is also considered as green technology because CO₂ gas is usually employed for the drug treatment, and no organic solvent is used for the process^8,9,10.

Other approaches have been studied for enhancing drug solubility in water, however nanonization is a facile and effective process specifically mechanical approaches which do not use chemical agents for preparation of nanomedicines¹⁰. The method of supercritical processing can be also developed for continuous processing thereby a hybrid process can be developed using this novel technology. For solubility estimation of pharmaceuticals, basically two main approaches are utilized including thermodynamics and data-driven models. The methods of thermodynamics estimate the drug solubility based on solid–liquid equilibrium, and the computations are performed to find the amount of dissolved drug (solid phase) in the solvent as function of pressure and temperature^11,12,13. On the other hand, data-driven models estimate the solubility based on the available measured data via training appropriate algorithms. Despite the acceptance of thermodynamic models for pharmaceutical solubility, these models are not straightforward to develop for a variety of drug substances. The method of machine learning which is data-driven model has indicated greater performance in terms of fitting accuracy for estimating different drugs solubility in supercritical solvents^{14,Tree-based ensembles}

In this subsection, we will introduce three ensemble methods based on decision trees that are employed in this study. Random Forest (RF) is a widely-used ensemble model that is designed to overcome the shortcomings of the conventional Decision Tree algorithm. The RF technique involves training numerous decision tree learners concurrently to minimize model bias and variance. The construction of a random forest model involves randomly selecting N bootstrap samples from the original dataset, and for each sample, an unpruned regression tree is trained. Instead of using every possible predictor, K randomly selected predictors are used as potential splits 26. The process is then iterated until C trees are formed, and then new data is estimated by averaging the predictions made by the C trees. By employing bagging to grow trees from different training datasets, RF increases the diversity of the trees and decreases the total variance of method²⁴. A RF model (for regression) can be mathematically expressed as^24,27:

$$\hat{f}_{RF}^{C} \left( {\mathbf{x}} \right) = \frac{1}{C}\sum\limits_{i = 1}^{C} {T_{i} \left( {\mathbf{x}} \right)}$$

The random forest regression predictor takes a vectored input variable x, and produces an output by combining the predictions of C decision trees, where T_i(x) represents a single regression tree generated using a subset of input variables and bootstrapped samples²⁴. The RF method has the potential benefit of performing out-of-bag error estimation during forest construction by reusing training instances that were not used to build individual trees. The out-of-bag subset is a random subset of samples used to estimate the generalization error without consulting an external validation dataset^18,24.

RF can determine the importance of input features, hel** to enhance model performance on high-dimensional datasets. It involves measuring the mean decrease in prediction accuracy by changing one input variable while kee** others constant. This assigns a relative importance score to each variable and guides the selection of the most influential features for the final model^28,29.

One other tree-based ensemble similar to Random Forest is Extremely Randomized Tress or Extra Trees²⁰. This method is a relatively new approach in the field of machine learning and can be seen as an expansion of the widely-used random forest algorithm. It is designed to be less prone to overfitting²⁰. Similar to the random forest, the extra trees algorithm (ET) works by training each base estimator with a random subset of features. In contrast to random forest, it does not randomly choose a feature and its corresponding value to use in node splitting³⁰.

Random Forest (RF) and Extremely Randomized Trees (ET) are both ensemble learning algorithms that combine multiple DT to create a more robust model. The main difference between the two algorithms is in the way they select the features used in each decision tree. In RF, a random subset of features is selected for each tree, and the best feature is chosen for each node split. In contrast, ET uses a random subset of features for each tree and selects a random threshold value for each feature to split the node. This makes ET even more random than RF, as it completely eliminates the bias that comes from choosing the best feature. ET is therefore less likely to overfit a dataset, but may have slightly higher bias than RF. Overall, both algorithms are highly effective for high-dimensional datasets with many features, and the choice between the two will depend on the specific characteristics of the data and the trade-off between bias and variance.

As the last one, Gradient Boosting Regression (GBR) is a regression technique that involves combining a set of simple decision trees to form a strong predictor. The technique entails adding decision trees to a model iteratively in order to correct errors made by previous trees. The model learns the difference between the previous model’s predictions and the actual values of the target variable at each iteration³¹.

The GBR algorithm uses a loss function to examine the accuracy of the model at each iteration. The objective function measures the discrepancy between the target variable's predicted and actual values. In GBR, the widely adopted loss function is the mean squared error (MSE) function.

The GBR model is formulated as follows³¹:

$$f\left( x \right) = \mathop \sum \limits_{m = 1}^{M} {\upbeta }_{m} h_{m} \left( x \right)$$

Here, f(x) is the predicted target variable, ${\upbeta }_{m}$ is the weight assigned to the m-th decision tree, $h_{m} \left( x \right)$ is the prediction of the m-th decision tree for input x, and M stands for the quantity of trees in the model.

The decision trees used in GBR are typically shallow, with only a few levels of branching. In order to define the tree structure, the input space is partitioned into regions according to the values of the input features. The principle for selecting splits is to maximize the reduction in the MSE of the target variable.

The GBR algorithm uses gradient descent to update the weights of the decision trees at each iteration. The gradient of the loss function in relation to the predicted target variable is calculated, and the decision tree is trained to predict the negative gradient. The weight of the tree is then updated to minimize the loss function.

Water cycle algorithm (WCA)

The Water Cycle Algorithm (WCA) is an optimization algorithm based on population that is inspired by the natural water cycle process. The algorithm is based on the concept of the water cycle, which involves water evaporation from the earth's surface, cloud formation, and precipitation back onto the earth's surface. In its search for optimal solutions, the WCA follows a similar pattern. Initialization, evaporation, precipitation, and river formation are all steps in the algorithm³².

During the initialization step, a random population of candidate solutions is generated. Each solution is characterized by a set of parameters that describe the issue at hand. In a function optimization problem, for example, the parameters could be the values of the input variables³³.

The fitness values of the solutions are evaluated during the evaporation step. A solution's fitness is a measure of how good it is, with higher fitness values indicating better solutions. The fitness values are used to calculate the evaporation rate, which is used to determine how much water evaporates from each solution^33,34.

The evaporated water is transformed into clouds in the precipitation step, which are then randomly distributed across the population of solutions. Each cloud represents a potential solution improvement. The cloud fitness values are compared, and the best one is chosen³⁵.

The selected cloud is used in the river formation step to create a river that flows from the current solution to the selected cloud. The river is represented as a set of solution parameter changes. The differences between the current solution and the chosen cloud determine the changes.

Evaporation, precipitation, and river formation are all repeated until a stop** criterion is reached. A maximum number of iterations, a minimum fitness value, or a maximum computational time could be used as the stop** criterion³⁶.

The WCA's ability to handle multiple objectives is one of its strengths. The goal of multi-objective optimization is to determine a set of solutions that are optimal in terms of several competing objectives. The WCA can be extended to handle multiple objectives by employing the dominance concept. A solution is said to dominate another solution if it outperforms it in at least one objective while failing in none. The WCA can be used to find a set of solutions that are not dominated by any other solution³⁶.

Modeling framework

In this work, we aimed to predict the solubility of Hyoscine drug as well as density of solvent (supercritical CO₂) at different combinations of temperature and pressure using machine learning models. We utilized three models, Random Forest (RF), Extra Trees (ET), and Gradient Boosting (GB), and fine-tuned their hyperparameters using the Water Cycle Algorithm (WCA). The dataset was preprocessed using the Min–Max scaler to normalize the input features. The methodology can be visualized through the flowchart in Fig. 1. Indeed, all models have two outputs and two inputs.

Data description

The given dataset comprises of 45 instances that represent the solubility of the Hyoscine drug at distinct combinations of temperature and pressure. The input variables considered for the dataset are temperature in Kelvin and pressure in bar, whereas the output variables are density and solubility³. The entire data set is displayed in Table 1 which has been obtained from³⁷. The ρsc_CO2 stand for the density of solvent and y is the solubility in this table. Also, in Fig. 2 the scatter plot of input parameters is shown against outputs. In this research, 80% of the data is selected randomly for training phase and 20% is kept for testing phase.

Table 1 Entire values of drug solubility³⁷.

Full size table

Results and discussions

In order to implement the models in this study, we used Python 3.9, along with several libraries and frameworks for machine learning and data analysis including NumPy, Pandas, Scikit-learn, and Matplotlib. Based on the tree-based models used in the work, the results for solubility and density output are summarized in the Table 2.

Table 2 Modeling performance.

Full size table

As shown in the table, Gradient Boosting and Extra Trees models have achieved high accuracy for both solubility and density output, with R² values of above 0.96. Nevertheless, the Random Forest model was less accurate than the other two models. The MAPE values for all models were below 0.04, indicating that the models had a low average percentage error. Max Error values indicate the maximum deviation from the true value, and the models had a relatively low maximum error for both solubility and density output. The comparison of estimated and observed values of solubility and density are visualized in Figs. 3 and 4. Based on All these facts and figures, the Gradient Boosting is selected as the most appropriate model for solubility and Extra Trees is selected for density.

Variations of both responses, i.e., drug solubility and solvent density as 3D and 2D representations are indicated in Figs. 5, 6, 7, 8, 9, 10. The results revealed that solubility of Hyoscine is increased with pressure and temperature, while on the other hand the density is increased with pressure and reduced with temperature. It is also observed that the pressure has eminent influence on the variability of physical parameters which is due to the nature of the solvent which is compressed gas, and its compressibility is high so that it is affected by the pressure. In fact, more compressed gas as the solvent is favorable which can enclose more drug molecules and increases the drug solubility in the solvent at high pressure. However, the cost of processing should be taken into account when the pressure and temperature go up.

Conclusion

In this work, we investigated the effectiveness of tree-based models in predicting the solubility of hyoscine drug and density values of the solvent in supercritical processing of drugs. We utilized Gradient Boosting, Extra Trees, and Random Forest models alongside with WCA as model optimizer to evaluate their performance in predicting the solubility and density of the hyoscine drug, and their accuracy was evaluated using R², MSE, MAPE, and Max Error metrics. Our results demonstrated that both Gradient Boosting and Extra Trees models were highly accurate in predicting the solubility and density values of the hyoscine drug. The models had R² values above 0.96, and their MAPE and Max Error values were relatively low, indicating a low average percentage error and maximum deviation from the true value. These findings suggest that tree-based models, particularly Gradient Boosting and Extra Trees, could be effective in predicting the solubility and density values of the hyoscine drug. This could have significant implications in drug discovery and other chemical industries, where the ability to accurately predict solubility and density values could aid in the development of new drugs or chemical products.

Data availability

All data generated or analyzed during this study are included in this published article.

References

Kaur, G. et al. Exploring the aggregation behaviour and antibiotic binding ability of thiazolium-based surface-active ionic liquids; Understanding transportation of poorly water-soluble drug. Colloids Surf. A 664, 131195 (2023).
Article CAS Google Scholar
Abdelbasset, W. K. et al. Development a novel robust method to enhance the solubility of Oxaprozin as nonsteroidal anti-inflammatory drug based on machine-learning. Sci. Rep. 12(1), 13138 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Begum, M. Y. Advanced modeling based on machine learning for evaluation of drug nanoparticle preparation via green technology: Theoretical assessment of solubility variations. Case Stud. Therm. Eng. 45, 103029 (2023).
Article Google Scholar
Abdelbasset, W. K. et al. Development of GBRT model as a novel and robust mathematical model to predict and optimize the solubility of decitabine as an anti-cancer drug. Molecules 27(17), 5676 (2022).
Article CAS PubMed PubMed Central Google Scholar
Abourehab, M. A. S. et al. Enhancing drugs bioavailability using nanomedicine approach: Predicting solubility of Tolmetin in supercritical solvent via advanced computational techniques. J. Mol. Liq. 365, 120103 (2022).
Article CAS Google Scholar
Abuzar, S. M. et al. Enhancing the solubility and bioavailability of poorly water-soluble drugs using supercritical antisolvent (SAS) process. Int. J. Pharm. 538(1), 1–13 (2018).
Article CAS PubMed Google Scholar
Blokhina, S. V. et al. Solubility and lipophilicity of antiarrhythmic drug Dofetilide in modeling physiological media. J. Chem. Thermodyn. 161, 106512 (2021).
Article CAS Google Scholar
Alqarni, M. et al. Solubility optimization of loxoprofen as a nonsteroidal anti-inflammatory drug: Statistical modeling and optimization. Molecules 27(14), 4357 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chinh Nguyen, H. et al. Computational prediction of drug solubility in supercritical carbon dioxide: Thermodynamic and artificial intelligence modeling. J. Mol. Liq. 354, 118888 (2022).
Article CAS Google Scholar
An, F. et al. Machine learning model for prediction of drug solubility in supercritical solvent: Modeling and experimental validation. J. Mol. Liq. 363, 119901 (2022).
Article CAS Google Scholar
Abourehab, M. A. S. et al. Experimental evaluation and thermodynamic analysis of Febuxostat solubility in supercritical solvent. J. Mol. Liq. 364, 120040 (2022).
Article CAS Google Scholar
Abourehab, M. A. S. et al. Laboratory determination and thermodynamic analysis of alendronate solubility in supercritical carbon dioxide. J. Mol. Liq. 367, 120242 (2022).
Article CAS Google Scholar
Faraz, O. et al. Thermodynamic modeling of pharmaceuticals solubility in pure, mixed and supercritical solvents. J. Mol. Liq. 353, 118809 (2022).
Article CAS Google Scholar
Kostyrin, E. V., Ponkratov, V. V. & Salah Al-Shati, A. Development of machine learning model and analysis study of drug solubility in supercritical solvent for green technology development. Arab. J. Chem. 15(12), 104346 (2022).
Article CAS Google Scholar
**a, S. & Wang, Y. Preparation of solid-dosage nanomedicine via green chemistry route: Advanced computational simulation of nanodrug solubility prediction using machine learning models. J. Mol. Liq. 375, 121319 (2023).
Article CAS Google Scholar
Zhu, H. et al. Machine learning based simulation of an anti-cancer drug (busulfan) solubility in supercritical carbon dioxide: ANFIS model and experimental validation. J. Mol. Liq. 338, 116731 (2021).
Article CAS Google Scholar
Jovel, J. & Greiner, R. An introduction to machine learning approaches for biomedical research. Front. Med. 8, 2534 (2021).
Article Google Scholar
Goel, E. et al. Random forest: A review. Int. J. Adv. Res. Comput. Sci. Softwa. Eng. 7(1), 251–257 (2017).
Article Google Scholar
Breiman, L. Random forests. Machine learning 45, 5–32 (2001).
Article Google Scholar
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006).
Article MATH Google Scholar
Acosta, M. R. C. et al. Extremely randomized trees-based scheme for stealthy cyber-attack detection in smart grid networks. IEEE Access 8, 19921–19933 (2020).
Article Google Scholar
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 21 (2013).
Article PubMed PubMed Central Google Scholar
Xu, M. et al. Decision tree regression for soft classification of remote sensing data. Remote Sens. Environ. 97(3), 322–336 (2005).
Article ADS Google Scholar
Ahmad, M. W., Reynolds, J. & Rezgui, Y. Predictive modelling for solar thermal energy systems: A comparison of support vector regression, random forest, extra trees and regression trees. J. Clean. Prod. 203, 810–821 (2018).
Article Google Scholar
Breiman, L. et al. Classification and Regression Trees (Routledge, 2017).
Book Google Scholar
Breiman, L. Random forests. Mach. Learn. 45(1), 5–32 (2001).
Article MATH Google Scholar
Cutler, A., Cutler, D. R. & Stevens, J. R. Random forests. Ensemble Mach. Learn. Methods Appl. 157–175 (2012).
Biau, G. & Scornet, E. A random forest guided tour. TEST 25, 197–227 (2016).
Article MathSciNet MATH Google Scholar
Sathyadevan, S. & Nair, R. R. Comparative analysis of decision tree algorithms: ID3, C4. 5 and random forest. In Computational Intelligence in Data Mining-Volume 1: Proceedings of the International Conference on CIDM, 20–21 December 2014 (Springer, 2015).
Wehenkel, L., Ernst, D. & Geurts, P. Ensembles of extremely randomized trees and some generic applications. In Proceedings of Robust Methods for Power System State Estimation and Load Forecasting (2006).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001).
Abou El-Ela, A. A., El-Sehiemy, R. A. & Abbas, A. S. Optimal placement and sizing of distributed generation and capacitor banks in distribution systems using water cycle algorithm. IEEE Syst. J. 12(4), 3629–3636 (2018).
Article ADS Google Scholar
Eskandar, H. et al. Water cycle algorithm–A novel metaheuristic optimization method for solving constrained engineering optimization problems. Comput. Struct. 110, 151–166 (2012).
Article Google Scholar
Sadollah, A., Eskandar, H. & Kim, J. H. Water cycle algorithm for solving constrained multi-objective optimization problems. Appl. Soft Comput. 27, 279–298 (2015).
Article Google Scholar
Razmjooy, N., Khalilpour, M. & Ramezani, M. A new meta-heuristic optimization algorithm inspired by FIFA world cup competitions: Theory and its application in PID designing for AVR system. J. Control Autom. Electr. Syst. 27, 419–440 (2016).
Article Google Scholar
Jafar, R. M. S. et al. A comprehensive evaluation: water cycle algorithm and its applications. in Bio-Inspired Computing: Theories and Applications: 13th International Conference, BIC-TA 2018, Bei**g, China, November 2–4, 2018, Proceedings, Part II 13. (Springer, 2018).
Hani, U. et al. Study of hyoscine solubility in scCO2: Experimental measurement and thermodynamic modeling. J. Mol. Liq. 381, 121821 (2023).
Article CAS Google Scholar

Download references

Acknowledgements

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Small Groups Project under Grant number (RGP 1/399/44).

Author information

Authors and Affiliations

Department of Pharmaceutics, College of Pharmacy, King Khalid University, P.O. Box 1882, 61441, Abha, Saudi Arabia
Mohammed Ghazwani
Department of Pharmaceutics, College of Pharmacy, King Khalid University, Guraiger, 62529, Abha, Saudi Arabia
M. Yasmin Begum

Authors

Mohammed Ghazwani
View author publications
You can also search for this author in PubMed Google Scholar
M. Yasmin Begum
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.G.: Conceptualization, Writing, Methodology, Validation. M.Y.B.: Writing, Supervision, Formal analysis, Validation, Resources.

Corresponding author

Correspondence to M. Yasmin Begum.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ghazwani, M., Begum, M.Y. Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: gradient boosting, extra trees, and random forest models. Sci Rep 13, 10046 (2023). https://doi.org/10.1038/s41598-023-37232-8

Download citation

Received: 29 April 2023
Accepted: 18 June 2023
Published: 21 June 2023
DOI: https://doi.org/10.1038/s41598-023-37232-8
Springer Nature Limited

This article is cited by

Research on predicting the driving forces of digital transformation in Chinese media companies based on machine learning
- Zhan Wang
- Yao Li
- Zihan **ao
Scientific Reports (2024)
Modeling based on machine learning to investigate flue gas desulfurization performance by calcium silicate absorbent in a sand bed reactor
- Kamyar Naderi
- Mohammad Sadegh Kalami Yazdi
- Mohammad Reza Mosavi
Scientific Reports (2024)
An ensemble-based machine learning solution for imbalanced multiclass dataset during lithology log generation
- Mohammad Saleh Jamshidi Gohari
- Mohammad Emami Niri
- Javad Ghiasi‑Freez
Scientific Reports (2023)
Machine learning analysis and risk prediction of weather-sensitive mortality related to cardiovascular disease during summer in Tokyo, Japan
- Yukitaka Ohashi
- Tomohiko Ihara
- Yukihiro Kikegawa
Scientific Reports (2023)

Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: gradient boosting, extra trees, and random forest models

Abstract

Similar content being viewed by others

Design of predictive model to optimize the solubility of Oxaprozin as nonsteroidal anti-inflammatory drug

Development a novel robust method to enhance the solubility of Oxaprozin as nonsteroidal anti-inflammatory drug based on machine-learning

A machine learning approach for thermodynamic modeling of the statically measured solubility of nilotinib hydrochloride monohydrate (anti-cancer drug) in supercritical CO2