Abstract
Advances in high-throughput screening (HTS) revolutionized the environmental and health sciences data landscape. However, new compounds still need to be experimentally synthesized and tested to obtain HTS data, which will still be costly and time-consuming when a large set of new compounds need to be studied against many tests. Quantitative structure–activity relationship (QSAR) modeling is a standard method to fill data gaps for new compounds. The major challenge for many toxicologists, especially those with limited computational backgrounds, is efficiently develo** optimized QSAR models for each assay with missing data for certain test compounds. This chapter aims to introduce a freely available and user-friendly QSAR modeling workflow, which trains and optimizes models using five algorithms without the need for a programming background.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ciallella HL, Zhu H (2019) Advancing computational toxicology in the big data era by artificial intelligence: data-driven and mechanism-driven modeling for chemical toxicity. Chem Res Toxicol 32:536–547. https://doi.org/10.1021/acs.chemrestox.8b00393
Zhao L, Ciallella HL, Aleksunes LM, Zhu H (2020) Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modeling. Drug Discov Today 25:1624–1638. https://doi.org/10.1016/j.drudis.2020.07.005
Wang Y, Bolton E, Dracheva S et al (2010) An overview of the PubChem BioAssay resource. Nucleic Acids Res 38:D255–D266. https://doi.org/10.1093/nar/gkp965
Gaulton A, Bellis LJ, Bento AP et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:1100–1107. https://doi.org/10.1093/nar/gkr777
Zhu H (2020) Big data and artificial intelligence modeling for drug discovery. Annu Rev Pharmacol Toxicol 60:1–17. https://doi.org/10.1146/annurev-pharmtox-010919-023324
Jia X, Ciallella HL, Russo DP et al (2021) Construction of a virtual opioid bioprofile: a data-driven QSAR modeling study to identify new analgesic opioids. ACS Sustain Chem Eng 9(10):3909–3919. https://doi.org/10.1021/acssuschemeng.0c09139
Ciallella HL, Russo DP, Aleksunes LM et al (2020) Predictive modeling of estrogen receptor agonism, antagonism, and binding activities using machine- and deep-learning approaches. Lab Investig 101:490–502. https://doi.org/10.1038/s41374-020-00477-2
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754. https://doi.org/10.1021/ci100050t
Huang R, Sakamuru S, Martin MT et al (2014) Profiling of the Tox21 10K compound library for agonists and antagonists of the estrogen receptor alpha signaling pathway. Sci Rep 4:1–9. https://doi.org/10.1038/srep05664
Kim MT, Wang W, Sedykh A, Zhu H (2016) Curating and preparing high throughput screening data for quantitative structure activity relationship modeling. In: Zhu H, **a M (eds) High-throughput screening assays in toxicology. Methods in molecular biology, vol 1473. Humana Press, Totowa, New Jersey, pp 161–172
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. https://doi.org/10.1007/s13398-014-0173-7.2
Shanker MS, Hu MY, Hung MS (1996) Effect of data standardization on neural network training. Omega 24:385–397. https://doi.org/10.1016/0305-0483(96)00010-2
Russo DP, Zorn KM, Clark AM et al (2018) Comparing multiple machine learning algorithms and metrics for estrogen receptor binding prediction. Mol Pharm 15:4361–4370. https://doi.org/10.1021/acs.molpharmaceut.8b00546
Zhu J, Zou H, Rosset S, Hastie T (2009) Multi-class AdaBoost. Stat. Interface 2:349–360. https://doi.org/10.4310/SII.2009.v2.n3.a8
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139
Manning CD, Raghavan P, Schuetze H (2009) The Bernoulli model. In: Introduction to information retrieval. Cambridge University Press, Cambridge, pp 234–265
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27. https://doi.org/10.1109/TIT.1967.1053964
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324
Vapnik VN (2000) Methods of pattern recognition. In: The nature of statistical learning theory, 2nd edn. Springer Science & Business Media, Berlin, pp 123–170
Korotcov A, Tkachenko V, Russo DP, Ekins S (2017) Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol Pharm 14:4462–4475. https://doi.org/10.1021/acs.molpharmaceut.7b00578
Organization for Economic Co-operation and Development (2007) Guidance document on the validation of (Quantitative) structure-activity relationship [(Q)SAR] models. OECD Environ Heal Saf Publ Ser Test Assess 69:1–154
Chinchor N (1992) MUC-4 evaluation metrics. MUC4 ‘92 proc 4th Conf Messag Underst 22–29. https://doi.org/10.3115/1072064.1072067
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46. https://doi.org/10.1177/001316446002000104
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451. https://doi.org/10.1016/0005-2795(75)90109-9
Powers DMW (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int J Mach Learn Technol 2:37–63
Altman DG, Bland JM (1994) Diagnostic tests. 1: sensitivity and specificity. BMJ 308:1552. https://doi.org/10.1136/bmj.308.6943.1552
Velez DR, White BC, Motsinger AA et al (2007) A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol 31:306–315. https://doi.org/10.1002/gepi.20211
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply
About this protocol
Cite this protocol
Ciallella, H.L., Chung, E., Russo, D.P., Zhu, H. (2022). Automatic Quantitative Structure–Activity Relationship Modeling to Fill Data Gaps in High-Throughput Screening. In: Zhu, H., **a, M. (eds) High-Throughput Screening Assays in Toxicology. Methods in Molecular Biology, vol 2474. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2213-1_16
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2213-1_16
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2212-4
Online ISBN: 978-1-0716-2213-1
eBook Packages: Springer Protocols