Abstract
Landslide hazards give rise to considerable demolition and losses to lives in hilly areas. To reduce the destruction in these endangered regions, the prediction of landslide incidents with good accuracy remains a key challenge. Over the years, machine learning models have been used to increase the accuracy and precision of landslide predictions. These machine learning models are sensitive to the data on which they are applied. Feature selection is a crucial task in applying machine learning as meticulously selected features can significantly improve the performance of the machine learning model. These selected features decrease the learning time of the model and increase comprehensibility. In this paper, we have considered three feature selection methods namely chi-squared, extra tree classifier and heat map. The paper substantiates that feature selection can significantly increase the performance of the model. The study was carried out on the landslide data of the Kullu to Rohtang Pass transport corridor in Himachal Pradesh, India. The classification score and receiver operating characteristics (ROC) curves were used to evaluate the model performance. Results exhibited that eliminating one or more features using different feature selection methods increased the comprehensibility of the model by reducing the dimensionality of the dataset. The model achieved an accuracy of 90.74% and an area under the ROC curve (AUROC) value of 0.979. Furthermore, it can be deduced that with a reduced number of features model learns faster without affecting the actual result.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12524-022-01645-1/MediaObjects/12524_2022_1645_Fig7_HTML.png)
Similar content being viewed by others
References
Abdalla, M., & Almghari, K. I. (2011). Remedy of multicollinearity using ridge regression. Journal of Al Azhar University Gaza (Natural Sciences), 13, 119–134.
Achu, A. L., Aju, C. D., Pham, Q. B., Reghunath, R., & Anh, D. T. (2022). Landslide susceptibility modelling using hybrid bivariate statistical-based machine-learning method in a highland segment of Southern Western Ghats India. Environmental Earth Sciences, 81(13), 360. https://doi.org/10.1007/s12665-022-10464-z.
Achu, A. L., & Aju Rajesh Reghunath, C. D. (2020). Spatial modeling of shallow landslide susceptibility: a study from the southern western ghats region of Kerala India. Annals of GIS, 26(2), 113–131. https://doi.org/10.1080/19475683.2020.1758207.
Aggarwal, C. C. (2004). On demand classification of data streams. In Proceedings ACM SIGKDD international conference knowledge discovery data mining, (pp. 503–508).
Aghdam, I. N., Varzandeh, M. H. M., & Pradhan, B. (2016). Landslide susceptibility map** using an ensemble statistical index (wi) and adaptive neuro-fuzzy inference system (ANFIS) model at Alborz mountains (Iran). Environmental Earth Sciences, 75(7), 553. https://doi.org/10.1007/s12665-015-5233-6.
Akgun, A., Sezer, E. A., Nefeslioglu, H. A., Gokceoglu, C., & Pradhan, B. (2012). An easy-to-use MATLAB program (MamLand) for the assessment of landslide susceptibility using a Mamdani fuzzy algorithm. Computers & Geosciences, 38(1), 23–34. https://doi.org/10.1016/j.cageo.2011.04.012.
Alin, A. (2010). Multicollinearity wiley interdisciplinary reviews. Computational Statistics, 2(3), 370–374. https://doi.org/10.1002/wics.84.
Allen, M. P. (1997). The problem of multicollinearity. Understanding regression analysis: Springer, Boston, MA. https://doi.org/10.1007/978-0-585-25657-3_37.
Andrew, P. B. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2.
Bahassine, S., Madani, A., & Kissi, M. (2016). An improved Chi-sqaure feature selection for Arabic text classification using decision tree. In 11th International conference on intelligent systems: Theories and applications (SITA), (pp. 1–5). https://doi.org/10.1109/SITA.2016.7772289.
Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039–1082. https://doi.org/10.1007/s10994-017-5633-9.
Bharadwaj, B. K., & Pal, S. (2011). Data Mining: A prediction for performance improvement using classification. International Journal of Computer Science and Information Security, 9(4), 136–140. https://doi.org/10.48550/ar**v.1201.3418.
Bradley, P.S., Fayyad, U.M., & Reina, C. (1998). Scaling clustering algorithms to large databases. Knowledge Discovery and Data Mining, 9–15.
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on machine learning, (pp. 161–168). Pittsburgh, Pennsylvania. https://doi.org/10.1145/1143844.1143865.
Carvalho, D. R., & Freitas, A. A. (2004). A hybrid decision tree/genetic algorithm method for data mining. Information Sciences, 163(1–3), 13–35. https://doi.org/10.1016/j.ins.2003.03.013.
Chandra, B., & Varghese, P. P. (2009). Fuzzifying Gini Index based decision trees. Expert Systems with Applications, 36(4), 8549–8559. https://doi.org/10.1016/j.eswa.2008.10.053.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024.
Chen, W., Li, Y., Xue, W., Shahabi, H., Li, S., Hong, H., Wang, X., Bian, H., Zhang, S., Pradhan, B., & Ahmad, B. B. (2020). Modeling flood susceptibility using data-driven approaches of naive Bayes tree, alternating decision tree, and random forest methods. Science of the Total Environment, 701, 134979. https://doi.org/10.1016/j.scitotenv.2019.134979.
Chen, W., **e, X., Wang, J., Pradhan, B., Hong, H., Bui, D. T., Duan, Z., & Ma, J. A. (2017). Comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena, 151, 147–160. https://doi.org/10.1016/j.catena.2016.11.032.
Feizizadeh, B., & Ghorbanzadeh, O. (2017). GIS-based interval pairwise comparison matrices as a novel approach for optimizing an analytical hierarchy process and multiple criteria weighting. GI_Forum, 1, 27–35. https://doi.org/10.1553/giscience2017_01_s27.
Friedl, M. A., & Brodley, C. E. (1997). Decision tree classification of land cover from remotely sensed data. Remote Sensing of Environment, 61(3), 399–409. https://doi.org/10.1016/S0034-4257(97)00049-7.
Garcia, S., Luengo, J., Saez, J. A., Lopez, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734–750.
Ge, L., Li, G. Z., & You, M. Y. (2009). Embedded feature selection for multi-label learning. Journal of Nan**g University (Natural Sciences), 45(5), 671–676. https://doi.org/10.1145/1854776.1854828.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63, 3–42. https://doi.org/10.1007/s10994-006-6226-1.
Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521, 452–459. https://doi.org/10.1038/nature14541.
Ghorbanzadeh, O., Blaschke, T., Aryal, J., & Gholaminia, K. (2018). A new GIS-based technique using an adaptive neuro-fuzzy inference system for land subsidence susceptibility map**. Journal of Spatial Science, 65(3), 401–418. https://doi.org/10.1080/14498596.2018.1505564.
Goyal, S., & Maheshwar. (2019). Naive bayes model based improved k-nearest neighbor classifier for breast cancer prediction. In A. Luhach, D. Jat, K. Hawari, X. Z. Gao, & P. Lingras (Eds.), Advanced Informatics for Computing Research, ICAICR, Communications in Computer and Information Science, (p 1075). Singapore: Springer.
Guo, Y., Chung, F., & Li, G. (2016). An ensemble embedded feature selection method for multi-label clinical text classification. In IEEE International Conference on Bioinformatics and Biomedicine, (pp. 823–826). https://doi.org/10.1109/BIBM.2016.7822631.
Hanley, J. A., & McNeil, B. J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3), 839–843. https://doi.org/10.1148/radiology.148.3.6878708.
Holbling, D., Fureder, P., Antolini, F., Cigna, F., Casagli, N., & Lang, S. (2012). A semi-automated object-based approach for landslide detection validated by persistent scatterer interferometry measures and landslide inventories. Remote Sensing, 4(5), 1310–1336. https://doi.org/10.3390/rs4051310.
Hong, H., Tsangaratos, P., Ilia, I., Liu, J., Zhu, A. X., & Chen, W. (2018). Application of fuzzy weight of evidence and data mining techniques in construction of flood susceptibility map of Poyang County, China. Science of The Total Environment, 625, 575–588. https://doi.org/10.1016/j.scitotenv.2017.12.256.
Hong, H., Chen, W., Xu, C., Youssef, A. M., Pradhan, B., & Tien Bui, D. (2017). Rainfall-induced landslide susceptibility assessment at the Chongren area (China) using frequency ratio, certainty factor, and index of entropy. Geocarto International, 32(2), 139–154. https://doi.org/10.1080/10106049.2015.1130086.
**, R., Breitbart, Y., & Muoh, C. (2009). Data discretization unification. Knowledge and Information Systems, 19(1), 1–29. https://doi.org/10.1007/s10115-008-0142-6.
Kamber, M., Winstone, L., Wan, G., Shan, C., & Jiawei, H. (1997). Generalization and decision tree induction: efficient classification in data mining. Proceedings Seventh International Workshop on Research Issues in Data Engineering, High Performance Database Management for Large-Scale Applications (pp. 111–120). UK: Birmingham.
Kannan, R., & Vasanthi, V. (2019). Machine learning algorithms with ROC curve for predicting and diagnosing the heart disease. In Soft Computing and Medical Bioinformatics (pp. 63–72). Springer Briefs in Applied Sciences and Technology. https://doi.org/10.1007/978-981-13-0059-2_8.
Lavrač, N. (1999). Machine learning for data mining in medicine. In W. Horn, Y. Shahar, G. Lindberg, S. Andreassen, & J. Wyatt (Eds.), Lecture notes in computer science. AIMDM 1999, Artificial Intelligence in Medicine (Vol. 1620)). Heidelberg: Springer, Berlin. https://doi.org/10.1007/3-540-48720-4_4.
Lee, I. H., Lushington, G. H., & Visvanathan, M. (2011). A filter-based feature selection approach for identifying potential biomarkers for lung cancer. Journal of Clinical Bioinformatics, 1(1), 11. https://doi.org/10.1186/2043-9113-1-11.
Liang, D., Tsai, C. F., & Wu, H. T. (2015). The effect of feature selection on financial distress prediction. Knowledge Based Systems, 73, 289–297. https://doi.org/10.1016/j.knosys.2014.10.010.
Lin, W., Chu, H., Wu, J., Sheng, B., & Chen, Z. (2013). A Heat-Map-Based algorithm for recognizing group activities in videos. IEEE Transactions on Circuits and Systems for Video Technology, 23(11), 1980–1992.
Lin, F. (2008). Solving multicollinearity in the process of fitting regression model using the Nested estimate procedure. Quality & Quantity, 42, 417–426.
Lu, M. (2019). Embedded feature selection accounting for unknown data heterogeneity. Expert Systems with Applications, 119, 350–361.
Maheshwar Kaushik, K., & Arora, V. (2015). A hybrid data clustering using firefly algorithm based improved genetic algorithm. Procedia Computer Science, 58, 249–256.
Maheshwar, & Kumar, G. (2019). Breast cancer detection using decision tree, naive bayes, KNN and SVM classifiers: A comparative study. In International conference on smart systems and inventive technology (ICSSIT), (pp. 683–686). Tirunelveli, India. https://doi.org/10.1109/ICSSIT46314.2019.8987778.
Mamitsuka, H. (2006). Selecting features in microarray classification using ROC curves. Pattern Recognition, 39(12), 2393–2404. https://doi.org/10.1016/j.patcog.2006.07.010.
Mansfield, E. R., & Helms, B. P. (1982). Detecting multicollinearity. The American Statistician, 36(3a), 158–160. https://doi.org/10.1080/00031305.1982.10482818.
Martire, D., De Rosa, M., Pesce, V., Santangelo, M. A., & Calcaterra, D. (2012). Landslide hazard and land management in high-density urban areas of Campania region, Italy. Natural Hazards and Earth System Sciences, 12(4), 905–926. https://doi.org/10.5194/nhess-12-905-2012.
Mengmeng, W., Zhigang, L., Zhongliang, S., Yong, Y., & Hong, Z. (2019). Machine learning methods for MRI biomarkers analysis of pediatric posterior fossa tumors. Biocybernetics and Biomedical Engineering, 39(3), 765–774. https://doi.org/10.1016/j.bbe.2019.07.004.
Miles, J. (2005). Tolerance and variance inflation factor. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of statistics in Behavioral Science (pp. 2055–2056). Hoboken, NJ, USA: Wiley.
Myronidis, D., Papageorgiou, C., & Theophanous, S. (2016). Landslide susceptibility map** based on landslide history and analytic hierarchy process (AHP). Natural Hazards, 81, 245–263. https://doi.org/10.1007/s11069-015-2075-1.
Narayanan, B. N., Djaneye, B. O., & Kebede, T. M. (2016). Performance analysis of machine learning and pattern recognition algorithms for Malware classification. IEEE National aerospace and electronics conference (NAECON)and Ohio innovation summit (OIS) (pp. 338–342). OH: Dayton. https://doi.org/10.1109/NAECON.2016.7856826.
Pal, B., Zaman, S., & Hasan, M. A. (2015). Chi-Square statistic and principal component analysis based compressed feature selection approach for Naive Bayesian Classifier. Journal of Artificial Intelligence Research & Advances, 2(2), 16–23.
Pham, Q. B., Achour, Y., Ali, S. A., Parvin, F., Vojtek, M., Vojteková, J., Al-Ansari, N., Achu, A. L., Costache, R., Khedher, K. M., & Anh, D. T. (2021). A comparison among fuzzy multi-criteria decision making, bivariate, multivariate and machine learning models in landslide susceptibility map**. Geomatics Natural Hazards and Risk, 12(1), 1741–1777. https://doi.org/10.1080/19475705.2021.1944330.
Pinto, A., Pereira, S., Correia, H., Oliveira, J., Rasteiro, D. M. L. D., & Silva, C. A. (2015). Brain tumour segmentation based on extremely randomized forest with high-level features. In 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), (pp. 3037–3040). https://doi.org/10.1109/embc.2015.7319032.
Porkodi, R. (2014). Comparison of filter based feature selection algorithms: An overview. International journal of Innovative Research in Technology & Science, 2(2), 108–113.
Pourghasemi, H. R., & Kerle, N. (2016). Random forests and evidential belief function-based landslide susceptibility assessment in western Mazandaran province Iran. Environmental Earth Sciences, 75, 185. https://doi.org/10.1007/s12665-015-4950-1.
Pourghasemi, H., Gayen, A., Park, S., Lee, C. W., & Lee, S. (2018). Assessment of landslide-prone areas and their zonation using logistic regression, logitboost, and naivebayes machine-learning algorithms. Sustainability, 10(10), 3697. https://doi.org/10.3390/su10103697.
Pradhan, B. A. (2013). Comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility map** using GIS. Computers & Geosciences, 51, 350–365. https://doi.org/10.1016/j.cageo.2012.08.023.
Premakanthan, P., & Mikhael, W. B. (2001). Speaker verification/recognition and the importance of selective feature extraction: review. In: Proceedings of the 44th IEEE 2001 midwest symposium on circuits and systems. MWSCAS 1, (pp. 57–61).
Qiao, L. Y., Peng, X. Y., & Peng, Y. (2006). BPSO-SVM wrapper for feature subset selection. DianziXuebao. Acta Electronica Sinica, 34(3), 496–498.
Quentin, T. W. (1997). Targeting the poor using ROC curves. World Development, 25(12), 2083–2092. https://doi.org/10.1016/S0305-750X(97)00108-3.
Rajab, K. D. (2017). New hybrid features selection method: a case study on websites phishing. Security and Communication Networks, 2017(1), 10. https://doi.org/10.1155/2017/9838169.
Saaty, T. L. (1990). How to make a decision: the analytic hierarchy process. European Journal Operational Research, 48(1), 9–26.
Saha, A. K., Gupta, R. P., Sarkar, I., Arora, M. K., & Csaplovics, E. (2005). An approach for GIS-based statistical landslide susceptibility zonation-with a case study in the Himalayas. Landslides, 2(1), 61–69. https://doi.org/10.1007/s10346-004-0039-8.
Sezer, E. A., Pradhan, B., & Gokceoglu, C. (2011). Manifestation of an adaptive neuro-fuzzy model on landslide susceptibility map**: Klang valley Malaysia. Expert Systems with Applications, 38(7), 8208–8219. https://doi.org/10.1016/j.eswa.2010.12.167.
Solway, L. (1999). Socio-economic perspective of develo** country megacities vulnerable to flood and landslide hazards. In R. Casale & C. Margottini (Eds.), Floods and landslides: Integrated risk assessment. Environmental Science. Heidelberg: Springer, Berlin. https://doi.org/10.1007/978-3-642-58609-5_15.
Somol, P., Baesens, B., Pudil, P., & Vanthienen, J. (2005). Filter-versus wrapper-based feature selection for credit scoring. International Journal of Intelligent Systems, 20(10), 985–999. https://doi.org/10.1002/int.20103.
Sun, J., Zhang, X., Liao, D., & Chang, V. (2017). Efficient method for feature selection in text classification. In International Conference on Engineering and Technology (ICET), (pp. 1–6). https://doi.org/10.1109/ICEngTechnol.2017.8308201.
Svalova, V. (2018). Landslide risk management for urbanized territories. Risk Management Treatise for Engineering Practitioners. IntechOpen. https://doi.org/10.5772/intechopen.79181.
Tirelli, T., & Pessani, D. (2011). Importance of feature selection in decision-tree and artificial-neural-network ecological applications. Alburnus alburnus alborella: A practical example. Ecological Informatics, 6(5), 309–315. https://doi.org/10.1016/j.ecoinf.2010.11.001.
Wang, G. C. S. (1996). How to handle multicollinearity in regression modelling. The Journal of Business Forecasting Methods & Systems, 15(1), 23–27.
Wang, F., Xu, P., Wang, C., Wang, N., & Jiang, N. (2017). Application of a GIS-based slope unit method for landslide susceptibility map** along the Longzi River Southeastern Tibetan Plateau. China. ISPRS International Journal of Geo-Information, 6(6), 172. https://doi.org/10.3390/ijgi6060172.
Wang, J., **g, Xu., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32–39. https://doi.org/10.1080/21642583.2019.1620658.
Windeatt, T., Duangsoithong, R., & Smith, R. (2011). Embedded feature ranking for ensemble MLP classifiers. IEEE Transactions on Neural Networks, 22(6), 988–994. https://doi.org/10.1109/TNN.2011.2138158.
Xue, B., Cervante, L., Shang, L., Browne, W. N., & Zhang, M. (2012). A multi-objective particle swarm optimisation for filter-based feature selection in classification problems. Connection Science, 24(2–3), 91–116. https://doi.org/10.1080/09540091.2012.737765.
Zafari, A., Zurita-Milla, R., & Izquierdo-Verdiguier, E. (2019). Evaluating the performance of a Random Forest Kernel for land cover classification. Remote Sensing, 11(5), 1–20. https://doi.org/10.3390/rs11050575.
Acknowledgements
This study is a section of my Ph.D. research in Department of Geography, Delhi School of Economics, University of Delhi, India. We are obliged to University Grants Commission (UGC) for granting fellowship for the research. We are also grateful to National Disaster Management Authority (NDMA) Government of India, Border Road Organization (BRO), Manali and Public Work Department (PWD), Kullu for providing landslide data. We also acknowledge our gratitude to O P Gupta, Network administrator, Central Library, University of Delhi for his contribution to enhance the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Nirbhav, Malik, A., Maheshwar et al. Landslide Susceptibility Prediction based on Decision Tree and Feature Selection Methods. J Indian Soc Remote Sens 51, 771–786 (2023). https://doi.org/10.1007/s12524-022-01645-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12524-022-01645-1