1 Introduction

Concrete, which is known for its strength, durability, and versatility, is extensively used in civil construction [1] and it plays a critical role in building structures of various sizes, from small to large infrastructure projects. Therefore, many parameters should be assessed to ensure concrete safety and stability. Among these parameters, the compressive strength of concrete is a fundamental parameter in civil engineering, with direct implications for safety and durability, being widely applied as a fundamental concrete safety indicator [1,2,3]. Thus, compressive strength of concrete should be assessed to ensure that concrete has safety and resistance accordingly. However, the traditional methods used to test compressive strength are destructive, expensive, and limited, especially considering the complexity of concrete composition, which is influenced by ingredient proportions and curing time [4].

As concrete properties are influenced by a wide range of variables, such as quantity of cement, water, aggregates, and additives [5,6,7], it can be difficult to find a formulation with the desired resistance and strength for a specific application. In the search of high-performance concrete, many revolutionary formulations have been developed during the past years, including fiber-reinforced concrete [6, 8, 9], nanoparticle-enriched concrete [10], self-healing concrete [11,12,13], as well as structural high performance concrete enriched with fly ashes and other “low value” additives such as slag powder and silica fume [14, 15]. It is known that the proportions and interactions of such additives and aggregates with cement and water affect the compressive strength of concrete [2]. Therefore, new formulations of concrete should be subjected to essays and tests to verify how the proportion and interactions of such components affect the mechanical properties of concrete, which can be costly and time-consuming.

In this context, Machine Learning (ML) techniques are an innovative approach that can account for a wide range of complex variables and interactions and have been increasingly applied in the field of civil engineering [16,17,18,19,20,21,22,23]. ML enables one to model and predict concrete strength with greater accuracy and efficiency, which can significantly reduce costs and contribute to advances in develo** more robust and durable concrete [16, 24, 25]. Although ML has been successfully applied to predict concrete structural features [26,27,28], predicting the compressive strength of concrete, which is pivotal for ensuring its ability to withstand applied loads, has been challenging, leading to the development of various ML models over the years [2, 28,29,30,31]. The advantage of ML models is that they consider multiple variables and can identifying complex patterns in the data [32, 33], therefore ML can be a useful tool to aid the design and development of highly resistant concrete. Despite the increasingly development of ML models, it is difficult to predict the compressive strength accurately due to non-linear relationships between the concrete components, and several works report distinct performances in predicting the compressive strength of concrete [16, 24, 25]. Given this context, this work sought to scrutinize the application and performance of ML techniques, including multiple linear regression (MLR), support vector machines for regression (SVR), gradient boosting (GB), random forest (RF) and artificial neural networks (ANN), to predict compressive strength and explore the advantages and challenges associated with these ML techniques. Moreover, this work conducted a comprehensive analysis of the interactions between predictive variables in concrete composition and their impact on compressive strength to gain in-depth insights into the relationships between these variables.

2 Materials and methods

2.1 Dataset

An initial dataset [34] was supplemented with information extracted from the literature to enhance the diversity and robustness of the learning process [16]. The final data set consists of 1234 records on the compressive strength of concrete, which were used for the ML and algorithm training. This dataset provided comprehensive and representative information on the properties of concrete and their compressive strength values. An existing dataset was chosen to ensure that the research was based on a representative and diverse sample, thereby making the results more robust and generalizable. Another advantage of this approach is that it allows for external validation of the model and results. This external validation serves as an independent check that strengthens the reliability of the study.

Furthermore, utilizing a pre-existing dataset was crucial for time and resource efficiency, as collecting data can be time-consuming and costly. By using this data set to train an ML algorithm, it was possible to explore the patterns and relationships between the eight input variables (water, cement, fly ash, blast furnace slag, superplasticizer, coarse aggregate, fine aggregate, and curing time) and the output variable (compressive strength). Each concrete variable has interdependent relationships, as demonstrated in Table 1.

Table 1 Meaning of the variables that make up concrete

2.2 Attribute selection

To identify the best attributes from dataset to train the models and predict compressive strength, we employed the “SelectKBest” feature selection [35]. The selection criterion used was the F regression, which measures the linear correlation between each attribute and the target variable. After the analysis, ML models were trained with two different scenarios by selecting k independent (predictors) variables: one with eight predictor variables (K = 8), including water, cement, fly ash, blast furnace slag, superplasticizer, coarse aggregate, fine aggregate, and curing time; and another with six predictor variables (K = 6), in which fly ash and fine aggregates were removed. This comparison allowed us to assess how the concrete components can influence the compressive strength of concrete.

2.3 Machine learning models

Data were divided into training and test sets in an 8:2 ratio. Then, an initial exploratory analysis was conducted to evaluate the data quality using descriptive statistics, which is crucial to understand the nature of the data and identify any potential obstacles that may impact the effectiveness of model training. Lastly, five methods were selected to develop the prediction model: multiple linear regression (MLR), support vector regression (SVR), Random Forest (RF), Artificial Neural Networks (ANN) and Gradient Boosting (GB). The performance of each model was further compared with each other. The quality of the model’s fit was assessed using two metrics: the coefficient of determination (R2) and the root mean square error (RMSE). The former measures how well the model’s predictions align with the actual values, while the latter quantifies the closeness between the model’s predictions and the actual values. For evaluation purposes, we printed out the results of model training.

2.3.1 Exploratory analysis

An initial exploratory analysis was conducted to get descriptive statistics and insights about data quality and variable correlations. This analysis yielded pertinent information, such as the number of fields completed in the database, the standard deviation, and the maximum (max) and minimum (min) values for each variable.

2.3.2 Multiple linear regression

Multiple Linear regression (MLR) is employed to model the relationship between a set of independent variables and a dependent variable, and this study utilized a flexible approach to capture complex relationships. The MLR model was employed to obtain prediction values that could be used as a reference to assess other ML models performance in this work.

2.3.3 Support vector machine for regression

This technique is an extension of support vector machine (SVM) algorithms [36]. Unlike SVM, which provides a binary output, SVR estimates a real-valued function and is better suited for solving regression problems [37, 38].

To train SVR models, it is necessary to establish the required parameters. Hence, the following parameters and their functions were defined as described in the literature: kernel, which is a mathematical function responsible for data transformations; C, a regularization parameter which controls the trade-off between maximizing the hyperplane margin and minimizing the training error term; epsilon (regression precision), which establishes the maximum acceptable error; gamma, which improves classification accuracy and reduces regression errors for training data; and degree, which controls the complexity of the transformed space [37, 39, 40]. Hence, the parameters and their values were selected using GridSearchCV method, which operates through an iterative process, where each iteration defines a combination of parameter values and calculates a score. The tool identifies the combination with the highest score as the best one.

2.3.4 Gradient boosting

Another method used in this study was Gradient Boosting (GB), an ML technique designed for regression and classification problems. This particular approach generates a prediction model as an ensemble of weak prediction models, typically represented by decision trees [19, 41]. The parameters employed and their respective functions are as follows:

  • n_estimators: This parameter represents the number of boosting stages. Since GB is resistant to overfitting, a larger number generally leads to better performance. In this work, this parameter was set to 1000;

  • max_depth: This parameter determines the maximum depth of the individual regression estimators. The maximum depth restricts the number of nodes in the tree, and adjusting this parameter aims to optimize performance, with the ideal value depending on the interaction of the input variables. In this work this parameter was set to 10;

  • min_samples_split: This parameter specifies the minimum number of samples required to split an internal node.. In this work this parameter was set to 20;

  • learning_rate: This parameter refers to the learning rate, which decreases the contribution of each tree based on the learning_rate value. There exists a trade-off between learning_rate and n_estimators. Therefore, in this work this parameter was set to 0.01;

  • loss: This parameter denotes the loss function to be optimized, where ‘squared_error’ refers to the RMSE.

2.3.5 Artificial neural networks

To apply Artificial Neural Networks (ANN), the architecture of the neural network was adjusted for optimization by considering the following parameters:

  • solver = ‘adam’: This refers to the weight optimization solver. ‘adam’ is a gradient-based stochastic optimization algorithm particularly well-suited for large data sets.

  • hidden_layer_sizes = (32, 64, 32): This represents the number of neurons in the hidden layers. The model will have three hidden layers, with the first layer containing 32 neurons, the second layer containing 64 neurons, and the third layer containing 32 neurons.

  • n_iter_no_change = 200: This indicates the maximum number of epochs to iterate without observing an improvement in the training process.

  • random_state = 1: This is the seed of the random number generator, which is used to initialize the weights randomly.

  • max_iter = 5000: This represents the maximum number of interactions for the solver.

  • learning_rate_init = 0.0001: This signifies the initial learning rate for the ‘adam’ solver.

  • verbose = True: This setting enables the printing of the training progress.

Once the training is complete, the ANN model is ready to make predictions.

2.3.6 Random forest regressor

Random forest regression is an ensemble machine learning technique that combines decision trees to predict the value of a target response [42]. Random forest algorithms use bootstrap sampling to generate sets of random samples for training each model base tree, which means that instead of training all observations, each tree of RF is trained on a subset of the observations. The predictive ability and range of random forest models can be tuned by adjusting some parameters. In this study, four parameters were used, as will be discussed below. The first parameter is “n_estimators”, which represents the number of trees in the forest. In this work, the final value n_estimators = 1000 was set. The second parameter is “random_state”, which controls the bootstrap randomness. To ensure reproductivity, an integer value equal to 42, which is default to scikit-learn, was adopted in this work. The third parameter is “criterion”, which control how the algorithm will measure the quality of the training and test. Hence, in this work, the criterion was set as “mean_absolute_error” to ensure comparison with other machine learning models developed in this work.

2.4 Influence of independent variable sensivity

ML algorithms are often referred to as black boxes, that is, it is not possible to infer the relationship between the input (independent) and the output (dependent) variables, unlike often done in statistical modeling. However, it is a myth that the relationships between variables and how ML models work cannot be explained [43, 44]. Hence, to determine the influence of each concrete component (variable) in the compressive strength, we used the permutation_importance function at the sklearn.inspection package. Details on the methodology are described elsewhere [22].

3 Results and discussion

3.1 Exploratory analysis

Table 2 presents the findings of the preliminary exploratory analysis conducted on the database under investigation. This analysis yielded pertinent information, such as the number of fields completed in the database, the standard deviation, and the maximum (max) and minimum (min) values for each variable.

Table 2 Input variables in the compression analysis process

The mean compressive strength of the concrete was 35.1 MPa, indicating a relatively high strength suitable for applications requiring this property. However, there was considerable variability in the compressive strength, as evidenced by the standard deviation of 16.2 MPa. This variability may be attributed to factors such as the quality of the materials, mixing and curing procedures, and testing methods.

Conversely, the minimum compressive strength of 2.33 MPa was significantly low, making the concrete associated with this strength unsuitable for most applications. This low strength may have been due to issues with the concrete mix or the testing procedures. Additionally, the compressive strength at the 25th percentile was 23.1 MPa, meaning 25% of the concrete samples had a compressive strength lower than 23.1 MPa. At the 50th percentile, the compressive strength was 33.7 MPa; at the 75th percentile, it was 44.4 MPa. These findings indicated that the median compressive strength of the concrete was 33.7 MPa, and 75% of the samples had a strength above 33.7 MPa. Furthermore, the maximum compressive strength was 82.6 MPa, a very high strength value suitable for demanding applications.

Figure 1 shows the frequence distribution of the dependent (compressive strength) and independent variables. The figure shows that the values distribution of compressive strength is diverse and is sufficiently distributed for training ML models without any under- or overfitting problems.

Fig. 1
figure 1

Frequency distribution of variables in the dataset for training ML models

The data suggests that concrete has a high resistance to compression, although there was significant variation in this property; ML techniques, known for their flexibility and adaptability, can effectively capture both linear and non-linear variations. By adjusting models to identify complex patterns and relationships between variables, these algorithms can predict and explain substantial variations in the data, including anomalies and outliers. Training on diverse data sets and optimizing hyperparameters allow these techniques to create dynamic models that can identify anomalous behavior or outliers, thereby enhancing the robustness and accuracy of data analysis [17, 33].

3.2 Correlation analysis

A correlation analysis was conducted to identify the concrete components with the strongest relationship with concrete strength. The results showed that concrete compressive strength has a moderate positive correlation with cement (0.676) and weaker correlations with other materials, with it also has a moderate positive correlation with test age of concrete. Figure 2 shows a heatmap with all variables correlations.

Fig. 2
figure 2

Heatmap of features (variables) correlation

From Fig. 2, we can conclude that most independent variables have a weak correlation with the compressive strength of concrete. These correlations provide valuable insights for develo** predictive models of concrete strength, particularly when using non-linear ML techniques, particularly support vector, boosting and ANN [16, 25, 45].

3.3 Machine learning models

3.3.1 Multiple linear regression

Figure 3 shows the linear regression model analysis used eight predictor variables (k = 8), and the results revealed moderate performance (R2 = 0.594 and RMSE = 10.319 MPa). However, when the model was used with only the six best predictor variables (k = 6) selected by the SelectKBest method, the coefficient of determination decreased (R2 = 0.450) and RMSE increased (RMSE = 18,097 MPa).

Fig. 3
figure 3

Multiple Linear Regression results for eight predictive variables (k = 8) and six predictive variables (k = 6)

The MLR model falls short in terms of R2, which is too low in both scenarios. Initially, the MLR analysis displayed moderate performance, indicating its limitation in handling the complexity of the underlying relationships. Reducing the number of predictor variables to six led to a further decline in performance, emphasizing the model's sensitivity to selecting these variables.

This suggests that the regression line does not adequately fit the data (Fig. 3). As a result, this suggests that statistical regression models are not suitable for accurately predicting the compressive strength of concrete due to its low precision. Despite the low precision of MLR in predicting the compressive strength, from attribute selection results (Fig. 4) it is evident that the superplasticizer and blast furnace slag significantly affect the resulting value of the compressive strength. This observation corroborates previous findings [18, 34].

Fig. 4
figure 4

Importance of predictor variables by score

3.3.2 Support vector regression

A final SVR model with C = 100, degree = 1, gamma = 0.1 and max_iter = 1000 was found to be the best SVR estimator via grid search. A comparative analysis of different k settings in the context of SVR provides relevant information about the model’s performance (Fig. 5). When k was set to 8, including all predictive variables, the R2 reached 0.836, indicating the model’s ability to explain the variability in concrete compressive strength. Furthermore, the RMSE was calculated at 6.535 MPa, indicating high accuracy in the estimates. Surprisingly, when k was adjusted to 6, there was a slight increase in the performance indicators. The R2 reached 0.840, suggesting that the model had a robuster ability to capture the underlying relationships in the data. The corresponding RMSE was 6.442 MPa, indicating an acceptable level of accuracy in the predictions. This suggests that fine aggregate and fly ashes do not influence so much in concrete compressive strength, this observation corroborates the scores of attribute selection (Fig. 4) and previous studies [18, 24, 34].

Fig. 5
figure 5

Support vector regression for six predictive variables (k = 6) and eight predictive variables (k = 8)

3.3.3 Gradient boosting

Gradient boosting was utilized to predict concrete strength in the models with six and eight predictive variables (Fig. 6). The models showed an excellent fit to the data, achieving R2 = 0.886 and RMSE = 5.434 MPa for k = 8 and R2 = 0.867 and RMSE = 5.842 MPa for k = 6. The analysis indicated that while both sets of parameters produced significant results, the configuration with k = 8 exhibited slightly higher accuracy due to its lower RMSE and higher R2. Moreover, gradient boosting outperforms MLR and is slightly better than SVR, indicating better reliability than the other models.

Fig. 6
figure 6

Gradient boosting for six predictive variables (k = 6) and eight predictive variables (k = 8)

3.3.4 Artificial neural networks

By comparing the results obtained from different configurations of ANN (Fig. 7), significant variations in predictive performance for concrete compressive strength can be observed. In the scenario with k = 8, the neural model incorporated variables such as water, cement, fly ash, blast furnace slag, superplasticizer, coarse aggregate, fine aggregate, and curing time. The results revealed a strong coefficient of determination (R2 = 0.895), indicating that the model can explain the variability in compressive strength. Additionally, the RMSE (5.192 MPa) indicated a relatively high level of accuracy in the predictions.

Fig. 7
figure 7

Artificial Neural Networks for six predictive variables (k = 6) and eight predictive variables (k = 8)

As for k = 6, the model’s performance slightly decreased (R2 = 0.855), implying that the ANN accounted can predict compressive strength with k = 8 or k = 6 with almost no performance loss. Furthermore, with k = 6 RMSE increased to 6.098 MPa, indicating a greater prediction spread than actual values.

3.3.5 Random forest regression

Random Forest Regressor was the last model utilized to predict concrete strengths with six and eight predictive variables (Fig. 8). The models showed an excellent fit to the data, achieving R2 = 0.868 and RMSE = 5.859 MPa for k = 8 and R2 = 0.855 and RMSE = 6.145 MPa for k = 6. The analysis indicated that while both sets of parameters produced significant results, the configuration with k = 8 exhibited slightly higher accuracy due to its lower RMSE and higher R2. Moreover, RF performance is comparable with other ML models, such as gradient boosting and support vector regressor.

Fig. 8
figure 8

Random Forest Regressor for six predictive variables (k = 6) and eight predictive variables (k = 8)

3.4 Overall performance of machine learning models

Our findings showed that including more independent variables significantly improves the results (Table 3), regardless of the ML algorithm used. This improvement is evident in both the coefficient of determination, which measures the model’s explanatory capacity, and the RMSE, which reflects the accuracy of the predictions. This finding emphasizes the importance of comprehensive and relevant predictor characteristics in predictive analysis.

Table 3 Performance comparison of ML models

Several studies demonstrate the potential of ML models to predict the compression strength of materials with considerable accuracy [18, 19, 23,24,25, 31, 34, 46, 47]. Through training on comprehensive datasets, these models learn complex patterns and relationships between material characteristics and their respective strength [25]. The scientific literature documents cases where ML models achieve coefficients of determination (R2) exceeding 0.90 [23, 30, 31], mostly using neural networks, which represents a reasonable level of accuracy for practical applications. In this work, ANN also outperforms other ML techniques, especially with eight predictive variables.

In the case of six predictive variables compared to using eight predictive variables, we observed numerically lower values, highlighting the significance of specific physical characteristics of fly ash and fine aggregate excluded from the analysis. Fly ash plays a crucial role in concrete by enhancing cohesion, reducing exudation and segregation, and extending the setting time of fresh concrete [14]. In its hardened state, fly ash contributes to temperature reduction through hydration reactions, resulting in more resilient concrete. Fine aggregates (e.g., sand) are incorporated into concrete to fill voids and improve the workability of the fresh material, playing a decisive role in concrete strength.

The comparison between the models with 6 and 8 variables demonstrates that including more variables positively impacts the model's accuracy in predicting concrete strength (Table 3). The results show that including more variables in the model leads to an increase in the accuracy of predicting concrete strength. This is because the additional variables provide more information about the behavior of concrete, allowing the model to make more precise estimates.

However, the choice of the number of variables to be included in the model should be made carefully, as an excessive number of variables can lead to overfitting, which can impair its generalization. On the other hand, including an insufficient number of variables can lead to underfitting, which can reduce its accuracy.

A comparative analysis of the models, including SVR, GB, RF and ANN, revealed that GB exhibited the highest coefficient of determination (R2) with six predictive variables, while ANN exhibit the highest performance when training and testing with eight predictive variables. Interestingly, the differences between ANN and GB are too small to be significant from a practical point of view. Hence, both models demonstrated a superior ability to explain variation in the data compared to the other models (MLR, RF and SVR). Moreover, the ANN excelled in minimizing the RMSE, indicating more accurate predictions and reduced scattering in relation to the real values, leading to a more precise estimate of the concrete’s compressive strength.

3.5 Influence of independent variable sensitivity

A graph with the importance of each variable (Fig. 9) shows that the besides the components present in every concrete (cement, water, and aggregates), additives such as blast furnace slag, fly ash and super plasticizers have a direct influence on compressive strength.

Fig. 9
figure 9

Relative importance of the independent variables in the final ML models

Thus, this analysis corroborates the previous observation that such additives affect compressive strength of concrete [14, 21, 47]. Moreover, previous studies involving compressive strength of concrete, machine learning as well as experimental assays found the same influence of concentration and type of additives in compressive strength of concrete [16, 21, 23, 47].

4 Conclusions

This research utilized several modeling approaches to predict the compressive strength of concrete by considering relevant predictor variables. Our findings led us to conclude that the influence of multiple factors on the compressive strength of concrete makes the use of simple modelling techniques, such as multiple linear regression, unsuitable for the regression problems addressed herein. Therefore, using more sophisticated techniques, such as artificial neural networks and ensemble methods, i.e. gradient boosting and random forest, are more appropriate for the regression problems addressed in this study. Moreover, these techniques are particularly good in mitigating problems that could arise during model training due to the limited availability of data, as ensemble techniques are known for being very accurate when data availability is limited. Thus, the choice between these approaches depends on the project’s specific requirements, data size and the emphasis placed on interpretability versus accuracy.