1 Introduction

Road traffic accident (RTA) is churning the world with killing thousands and bringing demolition of property in a day without discrimination but did not give much attention to mitigate the severity. However, it is one of the life-threatening incidents in the world cause of death and property damage. Identifying the primary road traffic accident factors will help to provide an appropriate solution to minimize the adverse effect of severity on human and property loss. Road Severity does not occur by chance: It has patterns and can be predicted and avoided. So, accidents are “events which can be examined, analyzed, and prevented” [20]. According to workers’ health organization, accidents defined as “Fatalities are not fated; accident does not just happen; illness is not random; they are caused [33]. Traffic accidents occurred daily in the capital city of Addis Ababa—Ethiopia. Human beings’ life and property damage with a fraction of seconds. It is one of the leading terrifying causes of death in the country.

RTA severity is one of the research areas in these two decades in road safety. Researchers were using interesting methods on the road accident severity classification based models. The authors were studying using a traditional statistical-based approach for model building. These techniques help to get insights and identify the underlying cause of vehicle accidents and related factors on road safety. These days, due to the presence of a massive volume of datasets, machine learning surpasses conventional statistical-based in predicting the model [41]. Many pieces of literature explained in different countries the causes of road traffic accident severity [7, 9, 36, 37, 40, 43, 45]. However, the road traffic accident severity prediction research is still in development. In the previous study, we have seen a room using a hybrid machine learning approach to improve classification accuracy. To fill the stated space, we work on a hybrid machine learning approach for road accident classification to improve the effectiveness of prediction accuracy. The previous study mainly works on the performance of the Machine learning-based classification approach. However, there is a dearth of comparing the state-of-the-art algorithms, Hybrid Machine Learning, and deep learning algorithms. Sometimes, obtaining a suitable approach will make prediction accuracy more informative. Hence getting of best paradigm helps to identify the most determinant road accident factors. Furthermore, target specific contributing factors were not concerned and identified previously. The study used hybrid clustering and classification algorithms to predict road accident severity prediction. In this work, a new hybrid K-means and random Forest algorithm proposed to predict target specific road accident severity. The proposed approach compared with individual classifiers to measure the performance of the developed model. Accuracy, precision, specificity, and recall used to compare the new approach and conventional techniques (SVM, KNN, LR, and RF). The new approach composed of the following phases: (I) removing disturbing noise and filling missing data using mean for numeric variables and mode for the categorical variable, (II) splitting the dataset into training and test dataset, (III) creating new feature using clustering, (IV) training classifiers, (V) finally evaluating the performance of individual classifiers. Moreover, the proposed approach compared with a deep neural network to evaluate further with another state of the art classifier techniques. The evaluation outcome showed the proposed better performs than other classifiers based on classification and performance metrics.

The rest of the study prepared as follows: In Sect. 2, existing research in road accident classification concerning the Machine Learning approach discussed. In Sect. 3 new Hybrid Based Machine Learning method using k-means and random forest is presented. Experiment, Evaluation, and Discussion are summarized in Sect. 4. In Sect. 5, results and analysis driven from the experiment are explained at the last conclusion presented in Sect. 6.

2 Litrature review

In the area of road safety traditional statistical model-based techniques were used to predict accident fatal and severity. Mixed logit modeling approach [23, 26], ordered Probit model [54], logit model [11] are few of adopted conventional statistical-based studies. some studies believed the conventional statistical model better identify dependent and independent accident factors [31]. But conventional statistical-based approach lacks the capability to deal with multidimensional datasets [16]. In order to combat traditional statistical models limitations; Nowadays many studies used ML approach due to its predictive supremacy, time consuming and informative dimension.In these decade ML approach employed in construction industry [48], occupational accident [41], agriculture [22], educational classification [53], sentiment classification [50] and in banking and insurance [46].

On the other hand, in road accident prediction, many studies performed using Data mining, machine learning, and deep learning algorithms.Among clustering and classification algorithms: K-means, Support Vector Machines, K-Nearest Neighbors (KNN) Decision Tree (DT), Artificial Neural Network (ANN), Convolution Neural Network (CNN) and Logistic Regression (LR) are in front to build accident severity model. Kwon et al. [28] adopted Nave Bayes (NB) and Decision Tree (DT) on California dataset collected from 2004 to 2010. Authors used binary regression to compare the performance of the developed model but Nave Bayes were more sensitive to risk factors than the Decision Tree model.

Sharma et al. [44] analyzed road accident data using SVM and MLP on a limited number of datasets (300 datasets). Besides authors used only two independent variables (alcohol and speed) as considering key factors. Eventually, SVM with RBF kernel gave better accuracy (94%) than MLP (64%). The study showed driving with high speed after drunk was the main reason for accident occurrence.

Wahab and Jiang [51]carried out crash accidents on Ghana dataset using MLP, PART, and SimpleCART intending to evaluate classifiers and to identify the major factors for motorcycle crash. Auhors used Weka tools to compare and analyze datasets and InfoGainAttributeEval applied to see the most influential variable for motorcycle crash in Ghana. As a result simpleCART model showed better accuracy than other classification models.

Kumar et al. [27] implemented kmeans and Association Rule data mining approaches to identify the frequency of accident severity locations and to extract hidden information. From the total 158 locations; 87 of them were selected after removing accident location frequency count less than 20. Then k-means were applied to cluster into three groups, Number of clusters are determined by gap statistics. To get rules, they used minimum support of 5 percent. As a result, curved and slop on the hilly surface were revealed as accident prone locations. Authors worked on the FARS data-set using data mining techniques to combat death and injury severity during 2007. After prepossessing the study applied clustering Association rule and Nave Bayes to get trends of fatal accidents in the USA. The study explained and identify human and collusion types were the main cause of the fatality rate [34]. Other studies conducted using clustering and classification techniques to predict an accurate model in Iran. The research mainly focused on combining k means clustering with self-organizing maps to get better classification accuracy than ANN and ANFIS. The author’s preference model better performs than the single classifiers [5]. AlMamlook et al. [6] used AdaBoost, Nave Bayes, Logistic Regression and random forest to get determinant factors and to identify high risky highways for Michigan traffic Agencies. Performance measurement ROC, AUC, Precision and recall and F1-score were applied to evaluate models. The Study showed random forest outperforms other classifiers with an accuracy of 75.5%. Tiwari et al. [47] conducted a data mining approach to analyze causality class traffic accidents. The authors implemented clustering like K-modes and SOM and classification techniques like NB, DT, and SVM. As a result, better accuracy was presented on cluster dataset over classification.

The existing study on road accident severity in Ethiopia see [2, 8, 10, 17, 25, 49]. These stated works concerned mostly on road accident analysis and pedestrian severity in Ethiopia using Statistical methods. on the other hand some studies employed a data mining techniques (Decision Tree and MLP) on Weka tool focusing mainly on driver responsibility [39]. Another study employed J48 and PART a data mining algorithm on driver and vehicle information considering as a major risk in accident severity on Weka tool [19]. Other related work in the country, Beshah [12] studied to identify the key road way related variables for accident severity in Ethiopia. Authors used a data mining approach (Decision Tree, Naive Bayes and KNN) to develop a decision rule to improve road safety. Their focus has been analyzing driver and pedestrian crashes without giving more attention to the influence of machine learning accuracy for better identification of major risks influencing in road accident in Ethiopia. At this time there is a great need for increasing road safety prevention study due to the growth of crashes. There is still room for improvement in the prediction accuracy of RTA in the case of Ethiopia to improve prediction accuracy in road safety. Therefore, we tried to develop a new hybrid approach to classify road accident severity by combining or collaborating clustering with classification, which will give remarkable classification results in road accident prediction. Clustering minimizes the sample dataset in the cooperation. Classification predicts road traffic severity. In this vein, clustering provides indirect cooperation for classification to extract hidden information from the training set to improve classifier performance.

3 Methodology

The study concerns mainly variable-based classification on road accident severity. It combines K-means clustering and classification technique to get better result than individual classifier. K-means employed to create new features and random forest used for classification prediction. The proposed approach workflow in the study is shown in Fig. 1. Major components of the flow chart are as follows:

Fig. 1
figure 1

Flowchart of proposed model framework for predicting road traffic accident—case of Ethiopia

3.1 Road accident dataset manipulation

3.1.1 Raw traffic accident data-set

Dataset in this study comprises 5000 road traffic accidents collected from federal traffic police agency reports from 2011 to 2018 in Addis Ababa, Ethiopia. One of the challenging parts of this research is collecting sample datasets from the organization. The original dataset collected from the authority handled manually. Most of the fields are incomplete. Some of the fields are useless, and the main vital fields did not include as a field on the manual document. Documents were invisible to read. They are written either heedlessly or in a rush. We forced to record in excel format to ease analysis and prediction. In each instant accident, 14 variables (ten categorical variables and four numeric variables were recorded. Among these variables, Severity class is a target variable with three values (Fatal, Severe Injury and Light Injury). Full Dataset description described in the following manner:

Accident time This variable implies the time on which road traffic severity occurred throughout the day (24 hours).

Driver age This variable shows the age of the driver. Drivers age mainly in the range of 18-80 years of age.

Sex This variable indicates the driver’s sex. The driver’s sex (observed from the data collection) is either male or female.

Drivers experience This variable indicates the driver’s experience. It mainly represents the duration of how much time the driver drives a car.

Type of vehicle This variable implies the different types of vehicles. Namely: ambulance, car, automobile, Isuzu, taxi, truck, motorcycle, pick up, bus and minibus.

Service year This variable indicates the year of service the vehicle gives to the owner.

Location This variable indicates where the accident occurred: namely: canteen area, pubic area, organization, government office, hospital, college, vehicle station, market living area, and hospital.

Road condition This variable shows the situation of the road during the accident. Variable represents Namely: dry, muddy and wet.

Light condition This variable connotes the situation of the road during the accident. Variable represents Namely: dry, muddy and wet.

Weather condition This variable indicates the climate condition during the accident. Variable represents namely: rainy, sunny, cold and windy.

Causality class This variable indicates the severity of the class. The variable represents namely: driver, passenger, pedestrian, cycle driver, and resident.

Causality age This variable indicates the severity of class age.

Causality sex This variable indicates the severed class sex male or female.

Severity This variable is the target variable represents three classes, namely: fatal, serious injury and light injury.

3.1.2 Preprocessing

Raw datasets were sadly dirty, not in a proper format to be understood by computing machines and give incomplete information to use as it is. Using Such datasets will reduce the efficiency of the accident severity prediction model. Therefore, irrelevant datasets need to remove to obtain quality data. In the study before building a model intensive data preprocessing technique employed to get meaningful and determinant risk factors Like Data cleaning, missing value handling, outlier treatment, dealing with absolute value—encoding and normalization are carefully purify before using it.

3.1.3 Splitting dataset

Raw datasets and k-means created features split into training set and testing sets. The aforementioned training set helps to learn the newly proposed method. On the other hand, the testing set used to measure the performances of the new proposed model. In the study, a 70:30 ratio is used to split the raw dataset. Then 70% is used to train the prediction model, whereas, 30% of the dataset used to evaluate the performance of the prediction classification accuracy.

3.1.4 Prediction model

A prediction model is mainly used in machine learning techniques to forecast future behavior by analyzing current and historical data.

3.2 K-means techniques

K-means Technique [35] is unsupervised Machine learning technique mainly used in statistical data analysis, image processing, signal processing, information retrieval. The presence of heterogeneity in a road accident may lead to wrong model building and prediction. Unobserved heterogeneity defined as the presence of critical unseen features correlated with the observed feature in a model building.To overcome this problem, we are engaged in using clustering in our accident dataset. The split of datasets based on its similarity makes homogenous within clusters and heterogenous between clusters. Besides, clustering in collaboration with classification makes the classifier to train a model with a short time, more accurate and needs less computational memory when dealing with a massive amount of dataset [29]. K-means technique works on M data points as input in the N dimension in initial k cluster centrists, k is user defined to determine the total number of clusters; as a result after claculating there distance from each cluster data points assigned to each nearest cluster. Hartigan and Wong [24]. All points within a cluster are closer in the distance to their centroid than they are to any other centroid. The primary goal of the K-means technique is to reduce the Euclidean distance D(**, Cj) between each point from the centroid. as a result intra-cluster variance can be reduced and inter cluster similarity increases. Squared error function represented in Eq. 1.

$$\begin{aligned} \displaystyle {{f(x)=\sum \limits _{n=1}^k\sum \limits _{n=1}^n|X_i-C_j|^2}} \end{aligned}$$
(1)

where k is number of clusters, n-number of cases and Cj-number of centroids and X is data points of which Euclidean distance from the centroid is calculated. K means algorithm has initialization and iteration phases. In the first phase data points assign randomly in to k clusters, then in iteration phase the algorithm calculate the distance between each data points to each cluster centers, finally the algorithm converges when each road accident data points assigned to the nearest cluster [24]. Let us see how K-means algorithm works as follow:

  1. 1.

    Randomly initialize and select the Cj-centroids.

  2. 2.

    Calculate the distance between each instance to the Cj-centroid.

  3. 3.

    Compute mean of each data points in each cluster to find their centroid.

  4. 4.

    Then repeat the aforementioned steps until each points assigned to their nearest cluster.

3.3 Random forest

Random forest is ensembled classification technique proposed by Breiman and Adele Cutler mainly works building multiple tress to make uncorrelated decision trees [13]. It is one of the robust algorithms to predict a large number of datasets. Mainly decision tree prone to overfitting but random forest uses multiple tresses to reduce overfitting [13]. The random forest creates many shallow, random subset trees and then combine or aggregate subtrees to avoid overfitting. Also, when it employed in large datasets gives more accurate predictions and cannot relinquish its accuracy when it faces several missing data. Random forest combines multiple Decision Tress during training then takes the aggregate of it to build model. Therefore, weak estimators improve when they are combined. Even if some of the decision trees become weak, there overall desired output results tend to be accurate. Figure 2 illustrates sample random forest implementation.

Fig. 2
figure 2

Sample random forest (n-estimator= 5)

3.4 Proposed approach

These days road accident datasets are stored in a vast database repository. A large number of datasets make the training and testing phase more complicated and reduces predicting efficiency. Therefore, it needs a powerful model to overcome or minimize the complexity of a huge amount of dataset. We developed a hybrid K-Means and Random forest model to get a better efficient predictive model to enhance the efficiency and accuracy of prediction model. K-means normally, which is an unsupervised machine learning algorithm mainly used to find similar groups within the dataset. Even though this is an unsupervised technique, k-means can create new features for the training set to improve the performance of classifier. Clustering creates cluster feature and adds to the training set. Then random forest employed on clustered training data to classify severity of RTA. There combination will produce a powerful prediction model in terms of generalization performance and predictive accuracy.

4 Experiment, evaluation, and discussion

In this section, the preprocessing technique applied to the road accident dataset, evaluation metrics, and experimental result analysis presented.

4.1 Dataset manipulation

The Dataset collected from Addis Ababa City is not entirely clear and organized. Raw dataset recorded manually and prone to damage. However, it must be in a machine-understandable format to get meaningful information and to develop an efficient intelligent system. The road accident severity prediction model depends on the quality of the datasets. We used different types of data preprocessing techniques to clean the dataset.

  • Missing value handling Missing value treatment is a mandatory task in data preprocessing. Before building model missing values needs to be filled using a different strategy. In the dataset, some attribute values are missing. Building an exciting and well-performing prediction model on incomplete data will not give a decisive output. It ought to handle wisely and either ignore or must be filled using different methods to get a better result [43]. Ignoring or drop** values is an approach to handle missing values, but drop** may lead to missing valuable information. In the study, missing value is not forced to drop missing attributes. Figure 3 presents a number of missing values and its percentage from the total dataset.The missing value is less than 50 percent of the total population. we employed substituting feature mean for numeric variables and most frequent (mode) value for categorical variable [3]. For further,on Preprocessing our previous work gives detail information, see Ref. [43].

  • Categorical Value Encoding Raw traffic accident datasets consists of categorical and numeric values. However, many machine learning algorithms require numeric values to predict a model. Employing a machine learning algorithm on categorical values are a challenging problem. Therefore, categorical values should be either converted into numeric values or needs to be removed [32]. In the dataset, most of the variables are categorical; Among 14 variables, 10 of them are categorical values and needed to transform into a numeric format. predictive variables and target variables converted into numeric using one-hot encoding and label encoding respectively.

Fig. 3
figure 3

The number of missing values and their percentage in the RTA dataset

4.2 Experimental system set up

The study implemented using python 3.7 on Jupyter notebook as IDM and intel core i7 1.80GHz processor speed CPU, 8Gb RAM, and 1TByte HD system. In this section, different experiments like Choosing an optimal value of k, evaluating the proposed approach, and finally comparing with conventional algorithms with the new approach presented.

4.3 Evaluation metrics

In the study, different types of evaluation, metrics are used to measure the performances of the proposed approach to predict road accident training set as indicated from Eqs. 1 to 6. Namely: accuracy, specificity, precision, recall and F1 score [38]

$$\begin{aligned} Accuracy= \frac{TP +TN}{TP+FP+FN+TN} \end{aligned}$$
(2)
$$\begin{aligned} Recall= \frac{TP}{TP+FN}\end{aligned}$$
(3)
$$\begin{aligned} Specificity= \frac{TN}{FP+TN}\end{aligned}$$
(4)
$$\begin{aligned} Precision= \frac{TP}{TP+FP}\end{aligned}$$
(5)
$$\begin{aligned} F1 Score= 2 * \frac{(Precision * Recall)}{(Precision + Recall)} \end{aligned}$$
(6)
TP::

it shows predictive is positive and it is normally true

TN::

it implies predictive is Negative and it is normally True

FP::

denotes predictive is positive and it is normally false

FN::

represents predictive is negative and it is false. Where TP implies true positive, TN denotes true negative, FP indicates false positive, and FN denotes false negative. in the actual study values are represented by true and false whereas predictive values denoted by positive and negative.

5 Experimental result analysis and discussion

5.1 Train-test split

Once the dataset prepared to train the model. It splits into a training set and testing set. The former used to learn the classifier, whereas the later used to test the performances of the predictive model. In the study, It is applied into 70:30 ratio, 70% of the proportion used to train model and 30% for the testing set applied to evaluate trained model.

5.2 Choosing k

Fig. 4
figure 4

a The average distance within clusters (SSD) and b the percentage of variance between clusters

Fig. 5
figure 5

a Elbow technique for optimal k, b lines created from \(\hbox {k}=1\) to \(\hbox {k}=9\)

There is no specific solution to find the exact value of k to partition training dataset. For each k, we can initialize k-means and use the inertia attribute to identify the sum of squared distances of the training set to the nearest cluster center. When k increases, the sum of squared distance leans towards zero and the percentage of variances increased as shown in Fig. 4a, b. If we use k to its maximum value in the M training set, each training set will form its cluster. Figure 5a. below is a plot of the sum of squared distances for k. If the plot looks like an arm, then the elbow on the arm is optimal k. However, from the graph, the elbow is not clear to determine the optimal value of k. Then we created line joining the first and last points (i.e. \(\hbox {K}=1\) and \(\hbox {k}=9\)) (Fig 5b illustrates line creates to connect \(\hbox {k}= 1\) and \(\hbox {k}=9\)). Then we calculated the distance between each cluster and the line to find the maximum distance. Figure 6a, b shows values of a distance of each k points from the line. The maximum length is index 2 (i.e 3.63). so we could say that the exact optimal value of k is three and road accident dataset clustered into three groups based on the experimentation.

Fig. 6
figure 6

a, b Calculated distance values from each k or cluster to the line (value of \(\hbox {k} = 3\))

5.3 Model performance evaluation

In this section, evaluation of the performance and reliability of the model, and comparing the proposed approach with the conventional models discussed briefly.

The study employed the k-Means algorithm on a raw accident dataset to cluster into three groups based on given k value. The newly created cluster used as a new feature and added to the training set. An experiment performed on both the raw dataset and a new feature added training set. when we emplloy unsupervised K-means algorithm on raw dataset scores an accuracy of 42.25% whereas supervising machine learning algorithms like logistic regression, random forest, support vector machine, and k-Nearest Neighbors performance accuracy on a raw dataset scored 86.83%, 87.77%, 68.45%, and 64.97% respectively. While unsupervised and supervised machine learning techniques applied to a new feature added training set its performance of K-means, logistic regression, random forest, support vector machine, and k-Nearest Neighbors scored an accuracy of 35.83%, 99.13%, 99.86%, 73.13% and 68.58% respectively. The experiment revealed performances of each classifier on various classifier metric on both data set showed all classification algorithms perform well on all evaluation metrics except k means classifier. An excellent performer on both datasets is random forest. Especially on the cluster added dataset, the efficiency of the random forest algorithm performed very well. Its performance dramatically improved to 99.86% accuracy. Each supervised machine learning classifier achieved a promising result. Classification techniques performance showed a mouthwatering efficiency, especially random forest performance accuracy heightened from 87.77 to 99.86%. But unsupervised k means classifier performed somehow better on a raw dataset. In Table 1 performance evaluation of each model is presented before and after adding a new feature to the dataset. Result discovered the proposed model has better performance than other models. Table 2 presents the execution time of each model. Astonishingly KNN model had less execution time than other models.

Table 1 Performance evaluation of classifiers and proposed approach
Table 2 The execution time of models (ms)

In the study, proposed approaches compared with other related studies. Table 3 shows the previous papers that worked on road accident severity prediction using different types of methodology. Our proposed Hybrid approach used k-means from clustering and random forest from classification to improve severity model accuracy and more importantly designed to identify target specific contributing factors for road accident severity. The proposed approach infrequently used in the related study and target specific classification was not concerned. This makes the study unique and significant in Ethiopia.

Table 3 Performance comparison of related work models

5.4 ANN experiment analysis

In this paper, we created a baseline ANN for road accident model prediction for different dense layers. Rectifier and Sigmoid activation function used as input and output layers respectively. Table 4 presents the performance of Artificial Neural network (ANN) with different dense layers. Experiment result showed the best test accuracy seen by two and three dense layers. Both model (\(\hbox {Model\,1} = 88.77\%\) and \(\hbox {Model\,2} = 88.77\%\)) achieved better result than the third model (\(\hbox {model\,3} = 88.03\%\)). However as presented in Fig. 7 model 2 has low test loss value (0.3622) than model 1 (0.3819) and model 3 (0.3686) relatively. But test loss is not as such attractive in multi-class classification. On the other hand Fig. 8 presents the AUC metric values of each neural network model with different amounts of dense layers. As it has seen all models with different dense layer gives similar results.

Table 4 Test accuracy, loss, and ROC curve value of ANN model with multiple dense layers
Fig. 7
figure 7

The validation and loss accuracy of different ANN Model

Fig. 8
figure 8

ROC curve of different ANN models

5.5 Comparative of neural network and proposed models

In this study, the ANN model compared with the proposed Hybrid model. The Comparative performance of both models showed that the proposed model (Hybrid K means and random forest) performed better than the ANN model in terms of Precision, Recall, F1 score, and Accuracy. Table 5 presents Performance comparison of proposed and ANN models. The proposed model achieved a better result than a deep neural network.

Table 5 Comparison of ANN and proposed model performance with different metrics (%)

5.6 Random forest interpretation

Random forest builds numerous decision trees for several subsets of RTA variables. It is commonly called a black box-difficult to know how it processed inside the model. Indeed, it comprises of many decision trees.Examining each deep tree decision and process is troublesome and improbable. Whereas individual tree could learn on bagged data on randomly selected features [13]. So, we can get insights from the random forest on computing feature importance. Before going to see how random forest works lets see how decision tree works. It has a series of decision paths from the node to the last leaf safeguarded by a sub-feature. Prediction is a sum of individual features and bias (mean value of top-most region covered by training set). Decision tree prediction function defined as:

$$\begin{aligned} f(x)=Cfull + \sum \limits _{m=1}^M contrib(x,k) \end{aligned}$$
(7)

where M—number of leaves in the tree, k—the number of features, C full—root node value, Contrib(x, k) kth feature contribution in feature vector x. Now let’s move to the prediction of random forest, which is as discussed in Sect. 3.2 an average value of its tree prediction. Therefore, random forest prediction function defined as follows:

$$\begin{aligned} f(x)=\frac{1}{J}\sum \limits _{j=1}^J f_j(x) \end{aligned}$$
(8)

It is pretty clear that the random forest prediction is the average value of bias and the average value of each contribute feature set. Which can be defined as

$$\begin{aligned} f(x)=\frac{1}{J}\sum \limits _{j=1}^J C_jfull + \sum _{k=1}^{k}(\frac{1}{j}\sum \limits _{j=1}^J +contrib_j(x,k) \end{aligned}$$
(9)

The above expression explained how random forest black box processed by following decision routes through the tree and compute the contributions of individual features. knowing the relatedness of predictive variable to the prediction model either negatively or positively helps to understand detail information about the model. which helps to know the influence of each variable on the outcome.

In the experiment, the default parameter set up used to implement the random forest algorithm. we have seen that day, driver experience, type of vehicle, location, light condition, causality age, and casualty sex are the strong contribution for serious injury, light condition, causality class, causality age, and causality sex are a contributor for light injury whereas driver age, service year, weather condition and causality class are a strong contribution for fatal accident severity.

6 Conclusion

In the study, a hybrid-based approach developed to predict the severity of the RTA dataset. The approach is competitive and better than traditional machine learning algorithms. In the case of creating a new cluster feature and finally added to the training set, k means used and showed a convincing result when combined with classification algorithms. In the paper, K-Means used to group road accident dataset based on its similarity and random forest employed to classify road accident factors into the severity variable. The combination of K-Means with random forest outperforms other Conventional models, namely Logistic Regression, k Nearest Neighbor, and Support Vector Machine. The classification technique used in the experiment improves classification accuracy values for logistic regression, random forest, support vector machine, and k nearest neighbor are 12.3, 12.09, 4.8, and 3.61 respectively. On the contrary, k means decreased its accuracy value by 6.42. The experiment result revealed that adding a new cluster on the training set has a strong impact to improve classification accuracy. Random Forest got better accuracy (99.86%). Before clustering and classification data Preprocessing performed to purify raw datasets. Missing value treatment and conversion of categorical values done to get a better result. In the paper optimal value of k discovered after calculating the maximum distance from each cluster to the joining line from K1 to Kn. Also, to trust a prediction model interpretation made to understand how a model inside processes to predict. Prediction is a sum of bias and contribution features. In the experiment, we showed results to get insights into the contributing variables for the prediction model. Knowing the influence of individual variables on the prediction model is trustworthy. Moreover, In the study target-specific, variable contribution explained. Overall, the paper tried to show the effects of combining Clustering and Classification to improve model accuracy and identified major contributing factors class-specific wise from the collected data for road traffic accident datasets. Another dataset will straighten our model to get a better result.