Blog

An efficient machine learning approach for predicting concrete chloride resistance using a comprehensive dataset | Scientific Reports

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Scientific Reports volume  13, Article number: 15024 (2023 ) Cite this article Tile Glue Blender

An efficient machine learning approach for predicting concrete chloride resistance using a comprehensive dataset | Scientific Reports

By conducting an analysis of chloride migration in concrete, it is possible to enhance the durability of concrete structures and mitigate the risk of corrosion. In addition, the utilization of machine learning techniques that can effectively forecast the chloride migration coefficient of concrete shows potential as a financially viable and less complex substitute for labour-intensive experimental evaluations. The existing models for predicting chloride resistance encounter two primary challenges: the constraints imposed by a limited dataset and the absence of certain input variables. These factors collectively contribute to a decrease in the overall effectiveness of these models. Therefore, this study aims to propose an advanced approach for dataset cleaning, utilizing a comprehensive experimental dataset comprising 1073 pre-existing experimental outcomes. The proposed model for predicting the chloride diffusion coefficient incorporates various input variables, such as water content, cement content, slag content, fly ash content, silica fume content, fine aggregate content, coarse aggregate content, superplasticizer content, fresh density, compressive strength, age of compressive strength test, and age of migration test. The utilization of the artificial neural network (ANN) technique is also employed for the processing of missing data. The current supervised learning incorporates both regression and classification tasks. The efficacy of the proposed models for accurately predicting the chloride diffusion coefficient has been effectively validated. The findings indicate that the XGBoost and SVM algorithms exhibit superior performance compared to other regression prediction algorithms, as evidenced by their high R2 scores of 0.94 and 0.91, respectively. In relation to classification algorithms, the findings demonstrate that the Random Forest, LightGBM, and XGBoost models exhibit the highest levels of accuracy, specifically 0.93, 0.96, and 0.97, respectively. Furthermore, a website has been developed that is capable of predicting the chloride migration coefficient and chloride penetration resistance of concrete.

The concrete composition considerably impacts concrete's mechanical and durability characteristics1,2. Despite the mechanical properties, understanding the effect of concrete composition on durability is still challenging for researchers in this field3. The emergence of a new generation of concretes (including recycled, innovative, and sustainable materials) has increased this concern about the durability of concrete structures. The durability of reinforced concrete (RC) structures is considerably influenced by chloride penetration, especially in coastal and chloride-rich environments, resulting in a severe corrosion issue for steel reinforcement or steel fibers4. This corrosion phenomenon causes considerable economic loss and environmental influence in the construction industry. Various resources cause chloride attacks, such as de-icing salt, seawater, and groundwater. Different parameters substantially affect the chloride resistance of RC members, including concrete composition, temperature, relative humidity, and carbonation. Among these variables, concrete composition plays a major role in controlling internal damage of chloride attack. Previous studies strongly emphasized that concrete composition directly affects the fresh, mechanical, and durability characteristics2,5. Concrete composition means the water-to-binder ratio (W/B), water content, cement content, cement type, aggregate content, aggregate type, fillers, nanomaterials, mineral additives, and chemical admixtures. Each of these ingredients affects the concrete porosity. Achieving a reliable concrete mixture with the lowest permeability to prevent the penetration of the chloride ions is one of the main practical solutions to mitigate internal chloride attack damages. For instance, different supplementary cementitious materials (SCMs), fillers, and nanomaterials can be used to obtain an efficient concrete mixture resistant to chloride attack. However, as innovative and new concrete generations have been introduced in recent years, standards recommended experimental rapid tests to determine the performance of these new cementitious composites against chloride attack, including salt ponding, bulk diffusion, and rapid chloride permeability tests. These tests are time-consuming (more than 28 days) and require an adequate material, researchers, and testing equipment budget. Hence, a practical solution and an extensive experimental program are necessary to determine the efficiency of a proposed mixture for concrete members exposed to chloride attack.

Experimental tests considered different parameters to determine the chloride resistance of mixtures. As mentioned in Eq. (1), Fick’s second law and its analytical solution have usually been used to explain the chloride diffusion of concrete for measuring the service life design of RC members6,7, where \(C(x. t)\) is the chloride ions concentration, \(t\) and \(x\) denote the time and position from the exposed concrete surface, respectively, \({D}_{C}\) is the chloride diffusion coefficient, C0 is the initial chloride concentration in concrete, and Cs is the apparent surface chloride concentration.

Among these empirical parameters, the chloride diffusion coefficient plays a critical role in showing the chloride resistance of concrete mixtures. Different parameters affect \({D}_{C}\) as a time-dependent material characteristic, including concrete composition, curing conditions, chloride exposure time, and exposure location8. Many experimental test methods were used to determine the chloride diffusion coefficient. As recommended by ASTM C1556–119 and NT Build 44310, bulk diffusion tests are time-consuming tests where concrete samples are exposed to a chloride solution for a long period. In this context, Nordic standard NT Build 49211 proposed an accelerated test method, where chloride ions infuse the concrete at high rates caused by the applied electric field. The chloride diffusion coefficient measured by this method is denoted as the “non-steady-state migration coefficient (\({D}_{nssm}\) )”. Standard criteria for classification of chloride penetration resistance of concrete based on this NT Build 49211 method compared to ASTM C120212 is mentioned in Table 1. ASTM C120212 considers five categories for chloride ion penetrability based on the passed electric charge (PEC), while \({D}_{nssm}\) was used in NT Build 49211. Although the literature severally confirmed the accuracy of this method, the test procedure needs a practiced operator and accurate resources and facilities. Although this method's experimental results are precise, conducting this test for each concrete mixture for RC projects is not feasible and practical due to the long-term immersion period and costs. Hence, an efficient and practical solution is needed to determine the chloride resistance of concrete mixture for each project. Accordingly, it is crucial to advance a feasible and accurate predicting model that specify the chloride diffusion coefficients using concrete composition.

Recently, in the last few years, the use of artificial intelligence (AI) to predict cementitious composites' mechanical and durability characteristics has gained significant speed. However, specific types of concrete were considered for each study, whereas a comprehensive dataset was required in a unified AI model. Additionally, the number of datasets is another critical issue of these investigations. For instance, Amin et al. (2023)14 presented an effective predicting method for the strength of sustainable concrete utilizing rice husk ash (RHA) as supplementary cementitious materials using machine learning (ML) algorithms. Hosseinzadeh et al. (2023)15 used ML techniques (both regression and classification algorithms), to forecast the mechanical properties of fly ash-based recycled aggregate concrete (FARAC). Jiao et al. (2023)16 employed ML to predict the compressive strength of concrete with carbon nanotubes as nanomaterials. In this context, Zheng et al. (2023)17 developed an ML model to predict the compressive strength of silica fume concrete. Ullah et al. (2022)18,19 used the ML method to predict dry density and compressive strength of lightweight foamed concrete. In this field, Khan et al. (2021)20 used the ML method to formulate the depth of wear of fly-ash concrete. Song et al. (2021)21 worked on presenting an ML model to predict the compressive strength of concrete with fly ash admixture. Farooq et al. (2021)22 used the ML method to formulate the strength of self-compacting concrete (SCC). Ahmad et al. (2021)23 employed the ML method to forecast the compressive strength of concrete at high temperatures considering 207 experimental datasets. Ahmad et al. (2021)24 presented an ML model along with gene expression programming (GEP) and artificial neural network (ANN) to predict the compressive strength of recycled coarse aggregate (RCA)-based concrete. Farooq et al. (2021)25 employed ML algorithms with individual learners and ensemble learners (bagging, boosting) to forecast the mechanical properties of high-performance concrete using waste materials. Ahmad et al. (2021)26 used ML to predict the compressive strength of fly ash-based concrete. In this field, Khan et al. (2021)27 used soft-computing techniques, including random forest regression and gene expression programming to present a prediction model for the compressive strength of fly ash-based geopolymer concrete. Other studies in the literature also used AI methods in predicting the mechanical characteristics of cementitious composites28,29.

Accordingly, there has also been a growing interest in using AI techniques for predicting chloride diffusion coefficients, such as ANN, fuzzy inference systems, and ML methods. Table 2 comprehensively gathered all available AI models introduced by researchers for chloride diffusion coefficient. Thirty-four predicting models were proposed by the literature, considering different input variables, output factors, dataset numbers, and concrete types. Water-to-binder ratio (W/B), water content, cement content, fly ash (FA) content, silica fume (SF) content, ground granulated blast-furnace slag (GGBFS) content, superplasticizer dosage (SP), fine aggregate content (sand), coarse aggregate content, submerged duration time, curing age, air-entraining admixture, viscosity modifying admixtures dosage (VMA), testing age, concrete compressive strength (\({f}_{c}\) ), specific surfaces of FA, environmental conditions, diffusion rate, etc., are the input variables considered in the current predicting models reported by the literature. Also, different output variables were considered in these studies, such as non-steady-state migration coefficient and surface chloride concentration (Table 2). Along with obtaining a predicting model for the chloride resistance of concrete, previous AI studies reported some critical and conflicting findings based on their soft computing models. Generally, two different types of AL models were presented by the literature, including (a) a model for only a specific type of concrete mixture; and (b) an integrated model with no limitation for concrete type. Although most of the models developed by previous studies concentrate on the first group, the literature recently introduced some valuable models for the second category. Regarding the first model type, Parichatprecha & Nimityongskul (2009)30 used the ANN method to predict chloride ions permeability using 86 datasets and 8 input variables for high-performance concrete (HPC) mixtures with compressive strength ranging from 30 to 120 MPa. They reported that the optimum cement content for chloride-resistant HPC is 450–500 kg/m3. Also, based on their model, the chloride resistance of HPC can be significantly improved using both SF and FA (around more than 20% cement replacement). Song & Kwon (2009)31 slightly increased the dataset (120 number) for HPC specimens and found that they added GGBFS along with other SCMs to the input variables to predict the diffusion coefficient of chloride ion using the ANN method. They also used the parameter “duration time in submerged condition” instead of “superplasticizer” for the input parameters. As the precision of the soft computing methods in AI is extremely dependent on the number of database, Hodhod & Ahmed (2013)32 used 300 datasets of HPC mixture for the ANN method. They considered 5 input variables of W/B, cement, FA, GGBFS, and curing age. Regarding self-consolidating concrete (SCC) mixtures, extensive AI models were also presented in the literature33,34,35. In this field, Ghafoori et al. (2013)36 used statistical and ANN models on their limited experimental database (only 24 SCC mixtures) to present a predicting model for the rapid chloride penetration test (RCPT) value choosing 6 input features of cementitious materials, W/B, coarse aggregate, fine aggregates, air-entraining admixture, and high range water reducer (HRWR). Although a general trend cannot be extracted based on their limited database, they reported that three independent parameters of cementitious materials content, W/B ratio, and coarse or fine aggregate are essential to predict the RCPT value of SCC. Accordingly, Mohamed et al. (2018)37 used 86 datasets of SCC mixture to present a predicting ANN model for chloride penetration level using comprehensive 11 input variables of W/C ratio, cement, GGBFS, FA, SF, water, superplasticizer, coarse aggregate, fine aggregate, age, and charge. Najimi et al. (2019)38 collected 72 datasets of SCC mixtures and used feed-forward ANN combined with an artificial bee colony algorithm to develop a chloride penetration model. They added air-entraining admixture and HRWR to their input variables. Kumar et al. (2020)39 gathered 360 datasets of SCC mixtures to predict chloride penetration using MARS and minimax probability machine regression (MPMR). They only considered three input variables of FA, SF, and temperature. In this context, the model presented by Yuan et al. (2022)34 revealed that temperature and SF considerably affect the model performance. In the field of high-strength concrete (HSC), Inthata et al. (2013)40 used 270 datasets of mixtures containing ground pozzolans to predict chloride penetration and chloride penetration depth. They added for the first time the actual concrete compressive strength to the input parameters in ANN models. Their model revealed that 40% of cement replacement by ground pozzolans shows the highest resistance to chloride penetration, especially for rice husk ash-contained concrete mixtures. In line with this research, Boğa et al. (2013)41 used 162 datasets of concrete containing GGBFS and calcium nitrite-based corrosion inhibitors to develop a predicting model for chloride ion permeability using a four-layered ANN and adaptive neuro-fuzzy inference system. They considered Cure type, curing period, GGBFS ratio, and calcium nitrite-based corrosion inhibitor ratio independent variables. Based on the dataset's limited range of concrete types, no general report can be extracted by their study. Regarding concrete containing high calcium FA, Marks et al. (2015)42 gathered 56 dataset and used ML method to predict the concrete resistance to chloride penetration. Their input parameters were water, cement, high calcium FA, and the specific surface of fly ash. Their model showed that to have a reliable chloride-resistance FA concrete, the water content should be controlled in the w ≤ 158 L/m3 range. However, they emphasize that the experimental mode dataset should be checked due to their limited dataset number. In the context of the limited dataset, Slika & Saad (2016)43 used Ensemble Kalman Filter (EnKF) on the limited dataset of NC to predict Chloride concentration. Regarding SF concrete, 162 datasets were collected by Asghshahr et al. (2016)44, and ANN, along with classification and regression tree (CART) methods, were utilized to predict the Chloride concentration. Only 4 input variables were considered in their models, including environmental condition, penetration depth, W/B ratio, and SF. Their models showed that the penetration depth is the most vital factor impacting the chloride concentration. However, they could not find the critical role of SF in their model. The literature also used AI models to predict chloride ion diffusion. For instance, Hoang et al. (2017)45 considered the ML model with the multi-gene Genetic programming (MGGP) and multivariate adaptive regression splines (MARS) to predict the chloride ion diffusion of cement mortars using mortar age, depth of measured point, diffusion dimension, the reinforcement presence as input variables. However, along with ML methods, valuable soft computing models have still been used using the ANN method on different types of concrete. In the field of AI techniques, genetic programming (GP) was also used by Gao et al. (2019)46 on 25 experimental databases of NC to predict chloride-ion diffusion. They found that the GP method was more efficient than the ANN model. However, based on their limited dataset, more studies should check their finding. Some previous studies also concentrated on predicting AI models of chloride resistance for recycled aggregate concrete (RAC). Ahmad et al. (2021)47 employed an ML model to predict the surface chloride concentration in concrete containing waste material, including gene expression programming (GEP), the decision tree (DT), and an ANN. In this field, Liu et al. (2022)48 used the ensemble ML method to predict the PEC gathering 226 dataset. Their soft computing technique found that water content is the most vital parameter controlling the chloride ion permeability of RAC. In this context, Jin et al. (2022)49 developed an ANN model to predict the chloride diffusion coefficient using 112 dataset. They considered 9 input variables consisting of cement, water, coarse aggregate, sand, water reducing agent, recycled aggregate replacement rate, W/C ratio, water absorption rate of coarse aggregate, and apparent density of coarse aggregate. However, they ignored two important parameters of recycled fines and mineral admixtures due to the lack of an appropriate dataset. Also, unnoticed the effect of cement type. Their model demonstrated that the natural fine aggregate content, along with recycled coarse aggregate, has the most substantial influence on the penetrability of RAC chloride ions. Regarding metakaolin-contained concrete, Alabdullah et al. (2022)50 gathered 201 dataset to develop ML method chloride migration coefficient. They used concrete aging, binder content, W/B ratio, metakaolin, sand, coarse aggregate, and concrete compressive strength as input features. Their model revealed that concrete compressive strength is the most critical parameter in predicting the chloride migration coefficient. However, they found that aggregate content showed the minimum importance. Also, it was deduced from their ML model that using metakaolin has a vital influence on chloride penetration resistance with the optimum dosage of 15%. In this regard, Amin et al. (2022)51 used the same dataset and input features of metakaolin-based concrete to develop a GEP model of predicting RCPT. Their model showed that the concrete age is the most noteworthy factor, along with aggregate content.

Regarding the second category (models with no limitations), many studies gathered many types of concrete data in a specific database47,52,53,54. For instance, Delgado et al. (2020)52 used 243 datasets without concrete type limitations to develop ANN models of chloride depth penetration and diffusion coefficient. They considered cement type as an additional input variable for the first time. Based on their model, curing time is the most important input feature for the chloride penetration ANN model. In this field, Cai et al. (2020)53 collected 642 datasets for different types of concrete to predict surface chloride concentration using the ensemble ML method. Twelve input variables were considered in their model, including cement, water, FA, GGBFS, SF, Superplasticizer, fine aggregate, coarse aggregate, exposure time, Annual mean temperature, chloride content in seawater, and exposure type. Their predicting model showed that the exposure condition (i.e., tidal, splash and submerged zones) and W/B ratio seem to be the greatest important factors affecting the surface chloride concentration. Ahmad et al. (2021)47 gathered a dataset of concrete containing waste material to predict the surface chloride concentrations using gene expression programming (GEP). According to these studies, Liu et al. (2021)4 collected 653 datasets of different concrete types to develop an ANN model for predicting the chloride diffusion coefficient. They also considered the concrete compressive strength, curing mechanism, and common input variables. Tran et al. (2022)57 used an ensemble decision tree boosted (EDT Boosted) model to predict the surface chloride concentration considering the 386 datasets. Their model indicated that the fine aggregate content is the key parameter influencing the Cs. Tran (2022)58 concentrated on 127 concrete data-containing SCMs to develop an ML model for the chloride diffusion coefficient. They ignored using temperature and curing time as input parameters. It can be deduced from their model that water content and FA dosage have the most and least effect on chloride diffusion coefficient precision, respectively. They reported that GGBFS content has also a low impact on the output. Based on their model, water content and W/B ratio has negative relationship with the chloride diffusion coefficient. Guo et al. (2022)59,60 developed ML and Fuzzy logic system methods on two datasets (366 and 495 numbers) to predict surface chloride concentration without considering a specific type of concrete. Their models indicated that by increasing the W/B ratio, Cs increases. Moreover, mineral admixture and the W/B ratio affect the surface chloride concentration. Based on their model, FA and GGBFS cause higher and lower surface chloride concentration, respectively. Finally, the most recently studied in this field was conducted by Taffese & Espinosa-Leal (2022)55, where the ML method was used to predict the migration coefficient of different types of concrete in a unique dataset. Although they gathered an extensive experimental database, four separate models were used for different input variable groups due to some missing input variables. Each group has less than 200 datasets, including (1) first group with 134 dataset containing W/B ratio, Cement, Slag, FA, SF, fine aggregate, coarse aggregate, superplasticizer, migration test age, and cement types; (2) the second group uses 131 datasets with the same as group first input variables along with fresh density; (3) third group has 176 datasets considering the same input variables of the first group and additional parameters of compressive strength test age and compressive strength; and finally, (4) fourth group having all input variables with 91 datasets, showing that a limited number of the dataset mentioned both concrete compressive strength and fresh density in their studies. Hence, they couldn’t use the ML method on all 834 datasets due to the huge amount of missing input data. Similarly, Taffese & Espinosa-Leal (2022)61 only used the final 204 datasets for the ML method, considering input variables of W/B ratio, cement, water, slag, FA, SF, fine aggregate, coarse aggregate, total aggregate, superplasticizer, and air-entraining agent, and migration test age. Their finding from the proposed model showed that binder content and aggregate content considerably affect the predicting model. They also found that cement type has no impact, which is considered a weak predictor for classifying concrete chloride resistance and can be removed from the model. In this filed, Tran et al. (2022)56 used the ML technique to predict chloride content based on 404 dataset. Based on their model, exposure condition and depth of measurement are the most important parameters for the prediction of chloride content. In this context, Golafshani et al. (2022)8 developed a marine creatures-based metaheuristic artificial intelligence model to predict the apparent chloride diffusion using 216 dataset. Their model indicated that the exposure time and curing conditions considerably control the performance of the predicting model.

One of the main criteria for evaluating the performance of an AI model is simplicity. Accordingly, the number of input variables and accessibility should be considered in a predicting model. Hence, based on the experience of other proposed models and SHapley Additive exPlanations (SHAP) results, some input variables can be ignored to predict the chloride diffusion coefficient. Predominant input variables reported for each literature on AI methods are summarized in Table 3. It is clearly shown that based on the number of datasets considered, different parameters found by the AI methods have the most influence on predicting models of chloride resistance. Water, cement, W/B ratio, SCMs, chemical admixtures, aggregate content, and exposure conditions were severally considered by the literature as input variables in AI methods. However, there are differences of opinion about whether or not to consider some parameters. For instance, regarding cement type, although Delgado et al. (2020)52 reported the critical role as input illustrative in predicting chloride penetration in concrete, Taffese et al. (2022)55,61 showed its lack of influence as a predictor.

Moreover, the greatest numbers of the dataset did not precisely mention the cement type in the experimental program49. Accordingly, most developed AI models ignored this variable as an input feature. Also, only a few studies used concrete compressive strength as an input variable in AI methods4,40,50,51. Similarly, only Taffese et al. (2022)55 considered concrete density as one of the input variables to predict chloride resistance. Most studies also ignored the temperature, while many models considered it an input feature34,35,39,46,47,53,57,60. It may be due to the missing data in the experimental research. As mentioned in Table 3, there is no clear trend for the most important predictor in the models presented by the literature, so different input variables were reported. However, it can be deduced from Table 3 that parameters of cement content, SCMs content (FA and SF), curing time, and aggregate content (sand and coarse) should be considered for predicting models of concrete chloride resistance. Although other parameters of temperature, penetration depth, exposure time, and exposure condition were also found as vital factors affecting the predicting models, they cannot be considered definite input parameters due to the lack of data in the experimental dataset. The conflicting trend obtained by the previous predicting AI models (Table 3) can be attributed to some critical reasons, including (1) the number of the dataset collected in each model; (2) the types of concrete considered; (3) input variables selected for each model; (4) AI methods used to obtain a predicting model; and (5) existence of missing input data in some experimental program. Hence, although the literature presented valuable predicting models, they may not be well-suited to be considered reliable chloride resistance predicting models for all experimental databases considering different types of concrete.

As reviewed in this section, introducing a unique model for each concrete type is impractical. Accordingly, a significant research gap should be filled by developing a novel AI method using all available datasets. Additionally, due to the lack of consistency and exitance of missing data in the details of mixtures tested by the literature, most of the previous predicting AI models ignored these valuable experimental datasets in their model, causing a significant reduction in the accuracy and comprehensiveness of the existing predicting models for all types of concrete. Accordingly, the present study uses the ANN method as a dataset arranging technique to practically predict missing details in datasets for the ML models. In other words, the ANN method helps to prepare an accurate and efficient dataset for ML input and prevents the deletion of valuable data. Additionally, prior studies have examined the forecasting of concrete's chloride migration coefficient, but there has been a lack of research on the integration of regression and classification tasks. Moreover, there is a scarcity of scholarly investigations pertaining to the advancement of artificial neural networks and machine learning techniques for the purpose of constructing a robust predictive model using a comprehensive dataset comprising over 1000 data points. Therefore, it is imperative to address a notable deficiency in research by devising an innovative artificial intelligence (AI) approach that effectively utilizes all accessible datasets. Hence, the present study intends to address this issue by following the objectives:

Developing a unique ML method to predict the chloride diffusion coefficient for all types of concrete.

Introducing a developed dataset-cleaning technique (DCT) to cover the missed database of all experimental works using ANN.

Compare classification algorithms with regression ones in the supervised learning method.

Finding the most and least influencing parameters controlling the predicting model for chloride diffusion coefficient.

To achieve these objectives, a comprehensive experimental database containing 1037 datasets was gathered in the present study to predict the \({D}_{nssm}\) using 12 various features of water content, cement, slag, FA, SF, fine aggregate, coarse aggregate, superplasticizer, fresh density, compressive strength test age, compressive strength, and migration test age (Fig. 1). Due to the missing data for fresh density and compressive strength, previous studies couldn’t practically use all dataset in a unique model, so that ML method applied on four separate groups. However, the present study intends to solve this issue using the ANN method to predict the missing data through a precise approach. Linear Regression (such as elastic net, lasso, Ridge), decision tree, random forest, boosting algorithms, support vector machine, and k-nearest neighbors algorithm were considered in the present study for regression method. Regarding the classification method, different algorithms were also selected for predicting \({D}_{nssm}\) , including support vector machine, random forest, lightGBM, XGBoost, Logistic Regression, k-nearest neighbors (KNN), and decision tree.

Number of datasets used in the literature as compared to the present study.

Generally, the ML method is the branch of AI that deals with the application of algorithms that allow computers to advance patterns using the experimental database. The main intention of the ML method is to automatically learn to identify complicated patterns (relations between variables) and then make ingenious judgments based on datasets. The dataset is a set of logical (laboratory) records with unique characteristics called machine input data or features. Each of these input data is also called a predictor. Moreover, the intelligent decision that the machine should make after learning this data is considered a prediction model for a particular output (or target). In other words, the machine has reached intellectual maturity by finding a logical connection between the input data and the results. It provides a logical model of a real physical phenomenon. The main problem exists where the set of all possible behaviors given all potential inputs is too enormous to be covered by the set of observed examples (training data). Accordingly, the learner should generalize from the training data to produce expedient target data for new conditions out of the datasets. Pattern recognition, commonly accompanied by classification, is the most popular use case for the ML method. Although the quantity and quality of datasets play a significant role in the training process, selecting appropriate features or input variables (unique dataset characteristics) affects the ML model's performance considerably.

The present study comprehensively assesses the potential of using regression and classification algorithms in the ML method to predict the non-steady-state migration coefficients (\({D}_{nssm}\) ) without any limitation for the concrete type or missing input data. Using such a substantial experimental database (1073 datasets) in the ML method is critical to achieving a unified model. A practical DCT is used in the present study to compensate for the missing data in the literature. The figures and machine learning models utilized in this study were implemented through the utilization of the Python programming language62. The ANN model in Figure S1 was generated using MATLAB (2019b)63 in this study. The flowchart of the ML methodology proposed in the present study is shown in Fig. 2. The ML method proposed in the present study consists of three main steps, including (1) data cleaning technique; (2) data visualization; and (3) ML models. Each of these steps is explained in the following subsections.

Flowchart presenting the methodology of the machine learning (ML) technique proposed in the present study.

Experimental datasets used in the present novel ML models after considering the outlier’s removal are summarized in Table 4. Totally 24 research papers containing 1073 datasets were collected in the present study. As shown in Fig. 3a, experimental databases gathered have three missing input features for some datasets, including fresh density, compressive strength test age, and concrete compressive strength. To address this issue, Taffese et al. (2022)55 divided the datasets into four separate groups with an experimental database lower than 200 datasets. However, this method cannot be efficient as the ML method is considerably affected by the quality and quantity of the dataset. Moreover, achieving a reliable unified model using all datasets is required. Accordingly, the feed-forward back propagation network and the Levenberg–Marquardt training function were used in the present study to predict the missing data using different hidden layers for each dataset group based on the missing parameter type (Fig. 3b). The accuracy of the ANN prediction is illustrated in the supplementary file (Figure S1). For this ANN method, water-to-binder (W/B) ratio, cement content, aggregate content, mineral admixtures, aggregate content, chemical admixtures, and \({D}_{nssm}\) were considered as input layers to predict the missing values for the concrete compressive strength and fresh density. Regarding this high and adequate accuracy of the ANN method, the problem of the missing dataset was entirely solved, and accordingly, the complete datasets were used in the ML method.

Dataset-cleaning technique (DCT) procedure: (a) different categories of literature regarding missing data; (b) Using the feed-forward back propagation network and the Levenberg–Marquardt training function for reproducing missing data.

Generally, outliers are a few parts of datasets that are meaningfully different from the rest of the database. They are usually anomalous observations that deviate from the data distribution and are commonly caused by inconsistent data entry or inaccurate observations. The reason behind removing each outlier is necessary. One of the main methods to consider the removal of outlier datasets is to analyze with and without these skewed datasets and explain the differences. Descriptive statistics of the datasets considered in the present study before removing the outliers are summarized in Table 5. Regarding the output parameter (\({D}_{nssm}\) ), a range of 0.22 to 133.6 (× 10−12) m2/s was gathered in the present study. However, only a few contents of datasets are in the range of \({D}_{nssm}>50\) , and accordingly considered as outliers. Another analysis of the outlier’s detection is shown in Fig. 4 for each input variable. For superplasticizer (SP) dosage, SP values higher than 6 kg/m3 were removed from the datasets. A range of 200–1500 kg/m3 was kept as the main database of fine aggregates, while experimental databases out of this range were considered outliers (Fig. 4). Concrete compressive strength of higher than 67 MPa was removed due to the high concentration of datasets for \({f}_{c}\le 67\) MPa. Statistical analysis confirmed that the content of SF higher than 40 kg/m3 should be removed as outliers. Most of the datasets were in the range of 2000 days for migration test age, so that higher than this period are considered outliers (Fig. 4). Only 1 dataset of all experimental databases has slag content higher than 400 kg/m3, and accordingly removed from the ML analysis as an outlier data. Similarly, only one dataset has cement content higher than 510 kg/m3, which was removed from the ML method. Few datasets were selected as outliers for FA higher than 400 kg/m3. Although most experimental databases tested concrete compressive strength at a test age lower than 100 days, three tested the sample’s compressive strength more than 175 days’ measurement and were removed as outliers. Experimental datasets of coarse aggregate out of the 200–1300 kg/m3 range were selected as outliers (Fig. 4). For water content, this inliers range is about 50–250 kg/m3. However, no dataset was removed as an outlier data for fresh density of concrete samples. Analysis of the dataset for each input parameter before and after removing outliers is also illustrated in Fig. 4. Finally, a total amount of 965 datasets were obtained after removing outliers. As mentioned in Table 1, after checking the data preprocessing, the output parameter of \({D}_{nssm}\) was divided into five categories with data encoding ranging from 0 to 4.0 based on the recommendation of NT Build 49213 so that a higher amount of data encoding shows the higher chloride penetration resistance of concrete mixture with negligible chloride ion penetrability.

The procedure of outlier’s detection (left and middle ones show the dataset with outliers, right one represents the dataset without outliers).

To visually represent the datasets, this section utilized several plots including distplot, heatmap, and joint plot kernel density estimation (KDE), employing Seaborn, a Python library developed specifically for data visualization. Distplotm or also distribution plot is used to characterize data in histogram form. Distplotm represents a univariant set of gathered data, describing the data distribution of each variable compared to another one. As shown in Fig. 5, the distribution plot of water content shows that most of the datasets have water content in the 160–180 kg/m3 range. However, this domain is larger for cement content ranging from 200 kg/m3 to 500 kg/m3. The common dosages of slag, FA, and SF used in the literature to study the \({D}_{nssm}\) were 100–150 kg/m3, 50–100 kg/m3, and 20–30 kg/m3, respectively (Fig. 5). Regarding fine aggregates content, the literature in the field of \({D}_{nssm}\) measurement mostly used 600–1000 kg/m3, with the highest consumption of around 800 kg/m3 in concrete mixtures, while no specific range was followed for the coarse aggregate content. Most literature used 2–4 kg/m3 SP dosage for chemical admixture mixtures. Commonly, the datasets used in the present study are divided into two separate groups with different densities, including (1) high number, so datasets belong to the first group where concrete samples have a density of 220–2400 kg/m3, and (2) some portion of samples has a density ranging from 1800–2000 kg/m3 (Fig. 5). Also, very few samples used are lightweight concrete samples (less than 25 datasets) with fresh density lower than 1600 kg/m3. Most of the datasets measured the 28-day compressive strength accompanied with \({f}_{c}\) ranging from 35 to 60 MPa in the majority of samples. Finally, the distplot curve confirms that most of the concrete samples tested in the literature has \({D}_{nssm}<50\) .

To illustrate the relations between variables within a dataset and reveal valuable details from the datasets, the pairplot is shown in Fig. 6. This plot provides an immaculate conception to recognize the data. The scatter plot and Kernel Density Estimation (KDE) function are shown in the pairplot so that diagonal rows are the KDEs distribution curve, representing the distribution of each parameter. However, the scatter plots presented in other cells show the relationship between variables. Different colors were also used to distinguish the output parameter classifications (\({D}_{nssm}\) ). For instance, the water content distribution plot for four classifications (based on Table 1 data encoding) of \({D}_{nssm}\) depicted in the first row and column of the pairplot, showing that the water content distribution is near the normal distribution for \({D}_{nssm}\) classifications of 3 and 4. Regarding concrete compressive strength, all \({D}_{nssm}\) classifications show the normal distribution. Although finding appropriate justification for each parameter in contact with other variables is complicated, valuable findings can be revealed by the pairplot (Fig. 6). For instance; it can be deduced from this plot that for higher water content, samples with higher fresh density can have high chloride resistance. Also, this plot shows the appropriate relations between different powders (SCMs) so that samples containing both FA and slag show data encoding of 3 and 4 as conditions of high chloride resistance situations. A similar justification for concrete containing both FA and slag (or SF) was found. Data visualization from the pairplot also indicates that FA has a higher impact on the chloride resistance of concrete samples with high \({f}_{c}>40\) MPa. Also, FA is more efficient in reducing chloride permeability for samples containing a high content of water and a low content of cement (Fig. 6). Moreover, pairplot shows that a good relationship exists between the fine aggregate content and fresh density, so using a high content of sand along with having high density causes better chloride resistance concrete. The Pairplot curve shows that cement and SCMs contents affect the effect of coarse aggregate on the chloride resistance of concrete. Based on the data visualization results, fresh density also controls the influence of coarse aggregate on the \({D}_{nssm}\) . As depicted in Fig. 6, SP content has a meaningful impact on the chloride resistance of concrete with a high W/B ratio. Coarse aggregate content, fresh density, and compressive strength similarly improve the effect of SP on \({D}_{nssm}\) . Generally, results indicate that the ternary relation of SP-fresh density-compressive strength should be considered in the concrete chloride resistance. The direct effect of fresh density on the concrete compressive strength is also clearly highlighted in the pairplot, which was notably ignored by the literature regarding AI methods.

To illustrate the relationships between two variables, the heatmap plot is shown in Fig. 7. The Pearson correlation coefficient (r) is a statistical measure used to quantify the linear relationship between two continuous variables, X and Y. It provides a numerical value that indicates the strength and direction of the linear association between the variables. The calculation of the Pearson correlation coefficient involves several steps.

For each pair of values (Xi, Yi) in the dataset, the formula subtracts the mean of X from each Xi, and the mean of Y from each Yi. This step represents the deviation of each data point from its respective mean. Then, the formula multiplies the deviations of X and Y for each data point and sums up these products across the dataset. This step captures the covariation between X and Y. The resulting sum of products is divided by the product of the standard deviations of X and Y. The standard deviation of X is calculated by summing the squared deviations of X from its mean and taking the square root. Similarly, the standard deviation of Y is obtained by summing the squared deviations of Y from its mean and taking the square root. Changing cell’s color for each axis shows the patterns in value for one or both variables ranging from -1.0, as a perfect negative linear correlation between two variables, to 1.0 representing a perfect positive linear correlation. The value of 0 designates no linear correlation between the two features. On the other hand, the heatmap plot demonstrates the independence of the variables. For instance, the first row of this plot shows that water content seems to be positively correlated with the cement content, while it is negatively correlated with the slag content. For the second raw, it can be deduced from the plot that the cement content has a strong positive correlation with the water content (+ 0.57), along with negative correlations with the slag content (− 0.68) and the fresh density (− 0.43). Also, analysis of the heatmaps plot shows that coarse aggregate content has negatively correlated with the cement content (− 0.41) and significantly has a positive correlation with the fresh density value of mixtures. This is a significant finding for the coarse aggregate content in predicting \({D}_{nssm}\) . As shown in Fig. 7, SP dosage has a positive + 0.51 dependency on the concrete compressive strength. SF content, fine aggregate, and SP dosage were also found to have a notable impact on the dependency of concrete compressive strength variable.

Kernel Density Estimation (KDE) jointplot for all variables against the \({D}_{nssm}\) is shown in Fig. 8. The KDE is a non-parametric approach showing the probability of the density of an independent variable. This plot contains two plots, including (1) a bivariate figure indicating how the dependent parameter (\({D}_{nssm}\) ) changes with the variation of independent variable features; and (2) the scattering plot located at the top of the bivariate graph to display the distribution of the independent factors. KDE distribution jointplot shows that most chloride-resistant concrete mixtures with \({D}_{nssm}\) belongs to the classifications higher than 3 and has cement content ranging from 300 kg/m3 to 400 kg/m3 and a water content domain of 150–180 kg/m3 (Fig. 8). Among SCMs, slag content (maximum 50 kg/m3) shows better correlation as compared to FA and SF. Moreover, it can be deduced from the KDE jointplot that highly resistant concrete mixtures have fine aggregate and coarse range of 700–850 kg/m3 and 900–1200 kg/m3, respectively. However, no clear trend was obtained by the KDE jointplot between SP content and \({D}_{nssm}\) classifications. As illustrated in Fig. 8, concrete mixtures should have a fresh density range of 2100–2500 kg/m3 to resist the chloride attack. Finally, the KDE distribution jointplot revealed that high chloride-resistant concrete samples have a concrete compressive strength of 45–60 MPa.

Bivariate KDE jointplot of variables (darker colors represent higher densities).

Data visualization plots revealed that the number of datasets for \({D}_{nssm}\) plays a significant role in the efficiency of the analysis. As shown in Fig. 9, category number 2 has the highest number of datasets \({D}_{nssm}\) . Using this dataset to train the machine causes a misleading prediction model. Accordingly, the synthetic minority oversampling technique (SMOTE) is used in the present study to increase the number of datasets for training the machine. This technique is one of the most frequently used oversampling approaches to achieve a balanced training dataset. This technique balances the class distribution by erratically increasing minority class instances by duplicating them. On the other hand, using SMOTE causes the generation of 1750 training datasets by having an equal number of 350 datasets for each \({D}_{nssm}\) classification. Only 80 percent of all datasets were considered for training. It is worth mentioning that SMOTE was only used for training machines, while real experimental datasets were used for testing.

Synthetic Minority Oversampling Technique (SMOTE) on training data.

Prior to commencing the machine learning model in data visualization, it is imperative to conduct a thorough examination of the feature importance plots for both regression and classification methodologies, as depicted in Fig. 10a,b correspondingly. Feature importance score represents the significance of each input feature for a \({D}_{nssm}\) model. A specific feature with a higher score has a more significant influence on the predicting model of \({D}_{nssm}\) . Based on the regression approach, results of feature important show that superplasticizer dosage, fresh density, and water content have the highest impacts on the predicting regression model of \({D}_{nssm}\) . However, compressive strength test age and FA have the lowest effects on the regression model of \({D}_{nssm}\) (Fig. 10a). Fresh density, coarse aggregate content, and fine aggregate content are the most crucial variables that affect the predicting model in the classification approach. Similar to the regression approach, FA and compressive strength test age have no effective impact on the predicting model of the classification approach. The importance of fresh density was confirmed in both regression and classification approaches.

Feature importance techniques for: (a) Regression; (b) Classification.

As shown in Fig. 11, the ML method contains three types of models, including (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. Each type has various algorithms. Supervised learning is used in the present study to introduce a unified predicting model of \({D}_{nssm}\) . Supervised techniques adjust the model to reproduce the target variable known from a training dataset. This type of ML has two main models, including (a) Regression and (b) Classification. The regression technique intends to reproduce the output or target value, while classification provides a model to produce the class assignments or data encoding. It can predict the target value, and the data is divided into different categories denoted as “classes.” Elastic Net, Lasso, Linear Regression, Ridge, Random Forest, KNN, Decision Tree, SVR, and XGB are different regression models used in the present study. Decision Tree, KNN, Logistic Regression, XGBoost, LightGBM, Random Forest, and Support Vector Machine are classification models considered in the present study. Schematic illustrations of some regression and classification models are shown in Fig. 12. Generally, linear Regression consists of four different linear regression models, including Simple, Ridge, Lasso, and Elastic Net. Linear Regression aims to predict a dependent output (or target) through different independent variables. Accordingly, this regression technique determines a linear relationship between a response variable and the other predictor variables (Fig. 12a). Support Vector Machines (SVM) as a robust classification and regression model convert the data into a linear decision space to classify datasets, even when the data are not linearly distinguishable. Support vectors are datasets nearer to the line (or hyperplane) and affect the location and direction of the hyperplane. A separator between the categories is created, and then the data are converted to draw the separator as a hyperplane (Fig. 12b). K-Nearest Neighbors (KNN) is a non-parametric and non-linear data classification method using proximity to perform classifications or estimations about the grouping of an individual data point (Fig. 12c). Logistic Regression is essentially a non-linear extension of linear Regression, allowing us to handle classes in a classification problem. This is achieved by categorizing estimates into a given class based on a likelihood threshold (Fig. 12d). Logistic Regression is a statistical analysis method for calculating the probability of a binary outcome, such as yes or no, based on former findings of a dataset. Logistic Regression affords discreet output. Finally, Decision Tree is a hierarchical demonstration of the outcome space where each node signifies a selection or an act. This model is used to classify or estimate how a former set of queries were responded to (Fig. 12e).

Schematic illustrations of regression and classification models: (a) Linear regression; (b) Support vector machines (SVM); (c) K-nearest neighbors (KNN); (d) Logistic regression; (e) Decision tree.

Different statistical criteria are used in the present study to determine the efficiency of regression and classification models. For the regression models, different coefficients are used in the present study, as follows:

Root Mean Square Error (RMSE) is the square root of the error function and should be reduced by the regression models to fit the datasets appropriately. Based on Eq. (3), N is the number of non-missing data points, \({x}_{i}\) is the actual observations time series, and \({\widehat{x}}_{i}\) is the estimated time series. As mentioned in Eq. (4), Mean Absolute Error (MAE) is used as the sum of the error values. However, as the absolute value is used instead of squaring it, it is more tolerant of large estimate errors. R Square (\({R}^{2})\) calculates how much the regression model can describe changeability in a dependent variable. Although It is a respectable measurement to check the fit of dependent variables, no overfitting is considered in \({R}^{2}\) . Based on Eq. (5), RSS and TSS are the sum of squares of residuals and the total sum of squares, respectively. As stated in Eq. (6), MSE is Mean Squared Error, which determines how close is a fitted line to datasets. In this equation, \(n\) is the number of data points, \({Y}_{i}\) is observed values, and \({\widehat{Y}}_{i}\) is the predicted value. Along with these criteria, the accuracy of the regression models is also improved by the K-fold cross-validation (CV) and hyper-parameter tuning techniques (Fig. 13). Generally, the K-fold CV process is used to assess the efficiency of ML models when making estimates on datasets not used during training. CV is a resampling method used to assess ML models on limited data points. This method has a parameter called k that denotes the number of groups into which a given data sample should be divided. The procedure for using this method is mentioned in the flowchart shown in Fig. 2. Regarding the classification models as performance measurement (Fig. 14). Different performance metrics are used based on the confusion matrix to determine the efficiency of the classification models, including precision, recall, F1, accuracy, and specificity.

Confusion matrix as a performance measurement for classification models.

The performance of the regression model in the ML method is summarized in Fig. 15. Figure 16 displays both the error graph and the predicted versus actual graph. The evaluation of the relationship between the actual and predicted results, as well as the error graph, was conducted for three algorithms with higher accuracy: Decision Tree, Random Forest, and XGBoost. The results are depicted in Fig. 16a,b,c respectively. Results indicate that among regression models, XGBoost (\({R}^{2}=\) 0.94) and SVR (\({R}^{2}=\) 0.94) show the highest accuracy. Also, ML analysis revealed that Elastic Net, Lasso, Linear regression, and Ridge could not precisely predict the \({D}_{nssm}\) . Moreover, accepted \({R}^{2}\) score was found for Random Forest, KNN, and decision tree. It is worth mentioning that achieving a reliable model (XGBoost) with this high \({R}^{2}\) score for this content of datasets (1037 numbers) showing the efficiency of the proposed unified method followed by the present study.

Performance of regression models in ML method for predicting \({D}_{nssm}\) .

The relationship between the actual and predicted results and error graph: (a) decision tree; (b) Random Forest; (c); XGBoost.

The performance of the classification models by the confusion matrix is shown in Fig. 17. Regarding the SVM model, results show an accuracy of 0.89 with the highest efficiency for Class 0 (based on the classification of Table 1). As shown in Fig. 17a, the first row shows all datasets of the first category, where all of them (16 datasets) were successfully predicted, without any sample out of accuracy. For Category 3 (Predict High), the SVM model appropriately predicted 68 datasets (from 73 numbers), showing high accuracy. The average accuracy of 0.93 was found for the Random Forest model with the highest performance for the categories Low & Extremely High with accuracy factors of 1.0 and 0.96, respectively (Fig. 17b). The LightGBM model shows an accuracy of 0.96 (Fig. 17c). Almost all categories for this model have F1 scores higher than 0.90. The XGBoost model also has a high accuracy of 0.97, the maximum accuracy among other classification models (Fig. 17d). The lowest accuracy was found for the Logistic Regression model with an accuracy of 0.68, where none of the categories reached 0.90 Accuracy (Fig. 17e). Results also showed that the accuracy of the KNN model (0.79) is lower than that of the Decision-Tree model (0.88), as shown in Fig. 17f,g.

Performance of classification models using confusion matrix: (a) Support vector machine (SVM); (b) random forest; (c) LightGBM; (d) XGBoost; (e) Logistic regression; (f) KNN; (g) Decision tree.

After checking the performance of each predicting model, the SHAP (SHapley Additive exPlanations) approach is used in the present study to determine the sensitivity of the predicting model considering different variables. SHAP is a game hypothetical and mathematical method to clarify the ML model's output by measuring each feature's influence on the prediction. As shown in Fig. 18a, the SHAP value indicates that high content of fresh density has a considerably positive impact on the predicting model, while high water contents have an adverse influence. Moreover, a high dosage of SP increases the possible effect on the predicted model. SHAP results also revealed that high contents of SF and migration test age positively impact the predicting model. Based on SHAP analysis, lower concrete compressive strength has a negative effect on the predicted \({D}_{nssm}\) . Furthermore, it can be deduced from the SHAP value that the lower content of fine aggregate positively affects the predicted model, while the lower content of coarse aggregate negatively influences the model. The high content of cement and slag was also found to adversely affect the \({D}_{nssm}\) model. However, another trend was found for FA so that high content causes a positive influence on the model. No trend was found for the compressive strength test age. Another valuable finding of the SHAP approach is finding the most critical relationship between the features. For instance, as shown in Fig. 18b, fresh density affects the influence of water on the SHAP value. Also, SP content controls the effect of cement on the SHAP value, so that for cement lower than 350 kg/m3, the high content of the SP has a higher impact on the predicting model, while this trend is vice versa for a higher content of cement (Fig. 18c). SHAP analysis found that SP dosage affects the interaction between slag content and the predicting model (Fig. 18d). As shown in Fig. 18e, higher coarse aggregate content results in a better relationship between FA and predicted \({D}_{nssm}\) . In this field, SHAP analysis showed that coarse aggregate affects the SF influence on the predicting model so lower coarse aggregate content causes a more proper relation between SF and the predicting model (Fig. 18f). In this context, Fig. 18g demonstrates that for fine aggregate content lower than 800 kg/m3, low SP dosage causes a more positive impact of fine aggregate on the model, while this trend changes for fine aggregate higher than 800 kg/m3. SHAP analysis depicts that water content controls the effect of coarse aggregate on the predicting model (Fig. 18h), so that high content of water causes a higher impact of coarse aggregate on the ML model for coarse aggregate content lower than 800 kg/m3. Finally, SHAP analysis showed that fresh density considerably affects the compressive strength-predicting model relationship, so for compressive strength lower than 50 MPa, high fresh density results in a more devastating effect on the predicting model. However, this trend changes for \({f}_{c}\ge\) 50 MPa where high fresh density improved the effect of compressive strength on the predicting model (Fig. 18i).

Results of SHapley Additive exPlanations (SHAP) approach.

The present study incorporated an additional validation process to assess the efficacy of the current research. The outcomes of this validation procedure are concisely presented in Table 6. In the current study, a total of 886 datasets were utilized, following the data cleaning process. These datasets were employed for the regression algorithm, as depicted in Fig. 15, and the classification algorithm, as illustrated in Fig. 17. The model outputs were found to be associated with all of the datasets.

After conducting an analysis of feature importance and data visualization, as well as evaluating various models, a new dataset consisting of a 28-day test age was chosen as the input dataset for the website. This test age was selected due to its prevalence among civil engineers. In order to validate and confirm that certain data points were not observed by the algorithm, a subset of 12 numbers from the datasets were completely isolated from the dataset utilized for the final validation procedure.

As indicated in Table 6, the regression model yielded a deviation below 9%. Furthermore, the classification model exhibits a high level of precision in its predictions, displaying minimal deviation. As a result, the proposed machine learning classification model successfully forecasts precise categories.

The present study has developed a free-access machine learning predictive model for concrete durability checking, as depicted in Fig. 19. This model can be accessed through the productive link "http://materialai.ir". Researchers and engineers involved in the development of various concrete mixtures have the ability to verify the durability of their mixtures through online platforms. The utilization of extensive datasets and the inclusion of multiple validation stages in the current study make this resource applicable to a wide range of concrete types.

Free access ML predictive model for concrete durability checking designed by the present study (Link of this model https://materialai.ir/).

To present a reliable and efficient prediction model for \({D}_{nssm}\) , an extensive experimental database containing 1037 datasets was gathered in the present study containing different types of concrete mixtures. As existing predicting models for chloride resistance have two main problems, the limited dataset considered and missing input variables causing a reduction in their efficiency, a unified data-cleaning technique was used to improve the consistency of the datasets. Different ML methods of Regression and classification were used. Simple linear Regression, Ridge, Lasso, Elastic Net, Support Vector Machine (SVM), Random Forest, LightGBM, XGBoost, Logistic Regression, KNN, and Decision-Tree are the models used to predict the \({D}_{nssm}\) . Water, cement, slag, fly ash, silica fume, fine aggregate, coarse aggregate, superplasticizer, fresh density, compressive strength, compressive strength test age, and migration test age are the variables considered as independent predictors. Also, the SHAP approach was used to check the importance of each variable in the predicting model. Generally, results showed that XGBoost and SVR (as regression models) could appropriately predict the \({D}_{nssm}\) with \({R}^{2}\) score higher than 0.90. Moreover, among the classification models, LightGBM and XGBoost showed the highest accuracy factor. SHAP analysis also indicates that there is a controlling effect between impacts of cement–water, slag-SP, FA-coarse aggregate, SF-fine aggregate, compressive strength-fine aggregate, fresh density-coarse aggregate, and fresh density-compressive strength on the predicting \({D}_{nssm}\) model.

The authors confirm that the data supporting the findings of this study are available within the article and its Supplementary material. Raw data supporting this study's findings are available from the corresponding author upon reasonable request.

Ground granulated blast furnace slag

Concrete containing high calcium fly ash

Position from the concrete surface

Mousavi, S. S., Mousavi, A. & Bhojaraju, C. A critical review of the effect of concrete composition on rebar–concrete interface (RCI) bond strength: A case study of nanoparticles. SN Appl. Sci. 2(5), 893 (2020).

Mousavi, S.S. and M. Dehestani. Influence of Mixture Composition on the Structural Behaviour of Reinforced Concrete Beam-column Joints: A Review. In Structures. (Elsevier, 2022).

Bhojaraju, C., Mousavi, S. S. & Ouellet-Plamondon, C. M. Influence of GGBFS on corrosion resistance of cementitious composites containing graphene and graphene oxide. Cem. Concr. Compos. 135, 104836 (2023).

Liu, Q.-F. et al. Prediction of chloride diffusivity in concrete using artificial neural network: Modelling and performance evaluation. Constr. Build. Mater. 268, 121082 (2021).

Mousavi, S. S., Mousavi, A. & Bhojaraju, C. A critical review of the effect of concrete composition on rebar–concrete interface (RCI) bond strength: A case study of nanoparticles. SN Appl. Sci. 2(5), 1–23 (2020).

Shafikhani, M. & Chidiac, S. Quantification of concrete chloride diffusion coefficient—A critical review. Cem. Concr. Compos. 99, 225–250 (2019).

Collepardi, M., Penetration of chloride ions into cement pastes and concretes. (1972).

Golafshani, E. M. et al. Concrete chloride diffusion modelling using marine creatures-based metaheuristic artificial intelligence. J. Clean. Prod. 374, 134021 (2022).

ASTM C1556- 11a, Standard Test Method for Determining the Apparent Chloride Diffusion Coefficient of Cementitious Mixtures by Bulk Diffusion. ASTM (2016).

NORDTEST, N.T. Build, 443, Concrete, Hardened: Accelerated Chloride penetration (2010).

NORDTEST, N.T. Build, 492, Concrete, Mortar and Cement-Based Repair Materials: Chloride Migration Coefficient from Non-Steady-State Migration Experiments. (1999).

Materials, A. S. ASTM-C1202: Standard Test Method for Electrical Indication of Concrete’s Ability to Resist Chloride Ion Penetration (West Conshohocken, 2012).

NORDTEST, Concrete, Mortar and Cement-based Repair Materials: Chloride Migration Coefficient from Non-Steady-State Migration Experiments. (1999).

Amin, M. N. et al. Prediction of sustainable concrete utilizing rice husk ash (RHA) as supplementary cementitious material (SCM): Optimization and Hyper-tuning. J. Mater. Res. Technol. https://doi.org/10.1016/j.jmrt.2023.06.006 (2023).

Hosseinzadeh, M., Dehestani, M. & Hosseinzadeh, A. Prediction of mechanical properties of recycled aggregate fly ash concrete employing machine learning algorithms. J. Build. Eng. 2023, 107006 (2023).

Jiao, H. et al. A novel approach in forecasting compressive strength of concrete with carbon nanotubes as nanomaterials. Mater. Today Commun. 2023, 106335 (2023).

Zheng, W. et al. Sustainable predictive model of concrete utilizing waste ingredient: Individual alogrithms with optimized ensemble approaches. Mater. Today Commun. 35, 105901 (2023).

Ullah, H. S. et al. Predictive modelling of sustainable lightweight foamed concrete using machine learning novel approach. J. Build. Eng. 56, 104746 (2022).

Ullah, H. S. et al. Prediction of compressive strength of sustainable foam concrete using individual and ensemble machine learning approaches. Materials 15(9), 3166 (2022).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Khan, M. A. et al. Simulation of depth of wear of eco-friendly concrete using machine learning based computational approaches. Materials 15(1), 58 (2021).

Article  ADS  MathSciNet  PubMed  PubMed Central  Google Scholar 

Song, H. et al. Predicting the compressive strength of concrete with fly ash admixture using machine learning algorithms. Constr. Build. Mater. 308, 125021 (2021).

Farooq, F. et al. A comparative study for the prediction of the compressive strength of self-compacting concrete modified with fly ash. Materials 14(17), 4934 (2021).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Ahmad, A. et al. Comparative study of supervised machine learning algorithms for predicting the compressive strength of concrete at high temperature. Materials 14(15), 4222 (2021).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Ahmad, A. et al. Compressive strength prediction via gene expression programming (GEP) and artificial neural network (ANN) for concrete containing RCA. Buildings 11(8), 324 (2021).

Farooq, F. et al. Predictive modeling for sustainable high-performance concrete from industrial wastes: A comparison and optimization of models using ensemble learners. J. Clean. Prod. 292, 126032 (2021).

Ahmad, A. et al. Prediction of compressive strength of fly ash based concrete using individual and ensemble algorithm. Materials 14(4), 794 (2021).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Khan, M. A. et al. Compressive strength of fly-ash-based geopolymer concrete by gene expression programming and random forest. Adv. Civ. Eng. 2021, 1–17 (2021).

Nafees, A. et al. Predictive modeling of mechanical properties of silica fume-based green concrete using artificial intelligence approaches: MLPNN, ANFIS, and GEP. Materials 14(24), 7531 (2021).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Dudziak, L. & V. Mechtcherine. Reducing the cracking potential of ultra-high performance concrete by using super absorbent polymers (SAP). in Proc. of the International Conf. on Advanced Concrete Materials. 2009.

Parichatprecha, R. & Nimityongskul, P. Analysis of durability of high performance concrete using artificial neural networks. Constr. Build. Mater. 23(2), 910–917 (2009).

Song, H.-W. & Kwon, S.-J. Evaluation of chloride penetration in high performance concrete using neural network algorithm and micro pore structure. Cem. Concr. Res. 39(9), 814–824 (2009).

Hodhod, O. & Ahmed, H. Developing an artificial neural network model to evaluate chloride diffusivity in high performance concrete. HBRC J. 9(1), 15–21 (2013).

Mohamed, O. et al. Application of ANN for prediction of chloride penetration resistance and concrete compressive strength. Materialia 17, 101123 (2021).

Yuan, J., Zhao, M. & Esmaeili-Falak, M. A comparative study on predicting the rapid chloride permeability of self-compacting concrete using meta-heuristic algorithm and artificial intelligence techniques. Struct. Concr. 23(2), 753–774 (2022).

Ge, D.-M., Zhao, L.-C. & Esmaeili-Falak, M. Estimation of rapid chloride permeability of SCC using hyperparameters optimized random forest models. J. Sustain. Cem. Mater. 2022, 1–19 (2022).

Ghafoori, N. et al. Predicting rapid chloride permeability of self-consolidating concrete: A comparative study on statistical and neural network models. Constr. Build. Mater. 44, 381–390 (2013).

Mohamed, O. A., Ati, M. & Al Hawat, W. Implementation of artificial neural networks for prediction of chloride penetration in concrete. Int. J. Eng. Technol 7, 47–52 (2018).

Najimi, M., Ghafoori, N. & Nikoo, M. Modeling chloride penetration in self-consolidating concrete using artificial neural network combined with artificial bee colony algorithm. J. Build. Eng. 22, 216–226 (2019).

Kumar, S. et al. Prediction of rapid chloride permeability of self-compacting concrete using multivariate adaptive regression spline and minimax probability machine regression. J. Build. Eng. 32, 101490 (2020).

Inthata, S., Kowtanapanich, W. & Cheerarot, R. Prediction of chloride permeability of concretes containing ground pozzolans by artificial neural networks. Mater. Struct. 46(10), 1707–1721 (2013).

Boğa, A. R., Öztürk, M. & Topcu, I. B. Using ANN and ANFIS to predict the mechanical and chloride permeability properties of concrete containing GGBFS and CNI. Compos. B Eng. 45(1), 688–696 (2013).

Marks, M., Glinicki, M. A. & Gibas, K. Prediction of the chloride resistance of concrete modified with high calcium fly ash using machine learning. Materials 8(12), 8714–8727 (2015).

Article  ADS  PubMed  PubMed Central  Google Scholar 

Slika, W. & Saad, G. An ensemble kalman filter approach for service life prediction of reinforced concrete structures subject to chloride-induced corrosion. Constr. Build. Mater. 115, 132–142 (2016).

Asghshahr, M. S., Rahai, A. & Ashrafi, H. Prediction of chloride content in concrete using ANN and CART. Mag. Concr. Res. 68(21), 1085–1098 (2016).

Hoang, N.-D., Chen, C.-T. & Liao, K.-W. Prediction of chloride diffusion in cement mortar using multi-gene genetic programming and multivariate adaptive regression splines. Measurement 112, 141–149 (2017).

Gao, W., Chen, X. & Chen, D. Genetic programming approach for predicting service life of tunnel structures subject to chloride-induced corrosion. J. Adv. Res. 20, 141–152 (2019).

Article  CAS  PubMed  PubMed Central  Google Scholar 

Ahmad, A. et al. Application of novel machine learning techniques for predicting the surface chloride concentration in concrete containing waste material. Materials 14(9), 2297 (2021).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Liu, K.-H. et al. Innovative modeling framework of chloride resistance of recycled aggregate concrete using ensemble-machine-learning methods. Constr. Build. Mater. 337, 127613 (2022).

Jin, L. et al. Prediction of the chloride diffusivity of recycled aggregate concrete using artificial neural network. Mater. Today Commun. 32, 104137 (2022).

Alabdullah, A. A. et al. Prediction of rapid chloride penetration resistance of metakaolin based high strength concrete using light GBM and XGBoost models by incorporating SHAP analysis. Constr. Build. Mater. 345, 128296 (2022).

Amin, M. N. et al. Prediction of rapid chloride penetration resistance to assess the influence of affecting variables on metakaolin-based concrete using gene expression programming. Materials 15(19), 6959 (2022).

Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

Delgado, J. et al. Artificial neural networks to assess the useful life of reinforced concrete elements deteriorated by accelerated chloride tests. J. Build. Eng. 31, 101445 (2020).

Cai, R. et al. Prediction of surface chloride concentration of marine concrete using ensemble machine learning. Cem. Concr. Res. 136, 106164 (2020).

Yao, L., L. Ren, and G. Gong. Evaluation of chloride diffusion in concrete using PSO-BP and BP neural network. In IOP Conf. Series: Earth and Environmental Science. IOP Publishing (2021).

Taffese, W. Z. & Espinosa-Leal, L. A machine learning method for predicting the chloride migration coefficient of concrete. Constr. Build. Mater. 348, 128566 (2022).

Tran, V. Q. et al. Application of machine learning technique for predicting and evaluating chloride ingress in concrete. Front. Struct. Civ. Eng 2022, 1–17 (2022).

Tran, A.-T., Le, T.-H. & Nguyen, H. M. Forecast of surface chloride concentration of concrete utilizing ensemble decision tree boosted. J. Sci. Transp. Technol. 2(1), 44–56 (2022).

Tran, V. Q. Machine learning approach for investigating chloride diffusion coefficient of concrete containing supplementary cementitious materials. Constr. Build. Mater. 328, 127103 (2022).

Guo, Z., Guo, R. & Lin, S. Multi-factor fuzzy prediction model of concrete surface chloride concentration with trained samples expanded by random forest algorithm. Mar. Struct. 86, 103311 (2022).

Guo, Z., Guo, R. & Yao, G. Multi-factor model to predict surface chloride concentration of concrete based on fuzzy logic system. Case Stud Constr Mater. 17, e01305 (2022).

Taffese, W. Z. & Espinosa-Leal, L. Prediction of chloride resistance level of concrete using machine learning for durability and service life assessment of building structures. J. Build. Eng. 60, 105146 (2022).

Van Rossum, G. Python programming language. In USENIX Annual Technical Conference. Santa Clara (2007).

Gilat, A. MATLAB: An introduction with Applications (John Wiley & Sons, 2004).

Costa, A. & Appleton, J. Chloride penetration into concrete in marine environment—Part I: Main parameters affecting chloride penetration. Mater. Struct. 32, 252–259 (1999).

Thomas, M. D. & Bamforth, P. B. Modelling chloride diffusion in concrete: Effect of fly ash and slag. Cem. Concr. Res. 29(4), 487–495 (1999).

Hao-bo, H. & Guo-zhi, Z. Assessment on chloride contaminated resistance of concrete with non-steady-state migration method. J. Wuhan Univ. Technol Mater. Sci. 19(4), 6–8 (2004).

Alizadeh, R. et al. Effect of curing conditions on the service life design of RC structures in the Persian Gulf region. J. Mater. Civ. Eng. 20(1), 2–8 (2008).

Shekarchi, M., Rafiee, A. & Layssi, H. Long-term chloride diffusion in silica fume concrete in harsh marine climates. Cem. Concr. Compos. 31(10), 769–775 (2009).

Audenaert, K., Yuan, Q. & De Schutter, G. On the time dependency of the chloride migration coefficient in concrete. Constr. Build. Mater. 24(3), 396–402 (2010).

Jain, J. & Neithalath, N. Electrical impedance analysis based quantification of microstructural changes in concretes due to non-steady state chloride migration. Mater. Chem. Phys. 129(1–2), 569–579 (2011).

Liu, X., Chia, K. S. & Zhang, M.-H. Water absorption, permeability, and resistance to chloride-ion penetration of lightweight aggregate concrete. Constr. Build. Mater. 25(1), 335–343 (2011).

Maes, M., Gruyaert, E. & De Belie, N. Resistance of concrete with blast-furnace slag against chlorides, investigated by comparing chloride profiles after migration and diffusion. Mater. Struct. 46, 89–103 (2013).

Elfmarkova, V., Spiesz, P. & Brouwers, H. Determination of the chloride diffusion coefficient in blended cement mortars. Cem. Concr. Res. 78, 190–199 (2015).

Real, S., Bogas, J. A. & Pontes, J. Chloride migration in structural lightweight aggregate concrete produced with different binders. Constr. Build. Mater. 98, 425–436 (2015).

Bogas, J. A. & Gomes, A. Non-steady-state accelerated chloride penetration resistance of structural lightweight aggregate concrete. Cem. Concr. Compos. 60, 111–122 (2015).

Liu, X., Du, H. & Zhang, M.-H. A model to estimate the durability performance of both normal and light-weight concrete. Constr. Build. Mater. 80, 255–261 (2015).

Farahani, A., Taghaddos, H. & Shekarchi, M. Prediction of long-term chloride diffusion in silica fume concrete in a marine environment. Cem. Concr. Compos. 59, 10–17 (2015).

Park, J.-I. et al. Diffusion decay coefficient for chloride ions of concrete containing mineral admixtures. Adv. Mater. Sci. Eng. https://doi.org/10.1155/2016/2042918 (2016).

Ferreira, R. et al. Effect of metakaolin on the chloride ingress properties of concrete. KSCE J. Civ. Eng. 20, 1375–1384 (2016).

Pilvar, A. et al. Practical evaluation of rapid tests for assessing the chloride resistance of concretes containing silica fume. Comput. Concr. Int. J. 18(4), 793–806 (2016).

Choi, Y. C. et al. Modelling of chloride diffusivity in concrete considering effect of aggregates. Constr. Build. Mater. 136, 81–87 (2017).

Liu, J. et al. Understanding the effect of curing age on the chloride resistance of fly ash blended concrete by rapid chloride migration test. Mater. Chem. Phys. 196, 315–323 (2017).

Article  ADS  CAS  Google Scholar 

Shiu, R.-W. & Yang, C.-C. Evaluation of migration characteristics of opc and slag concrete from the rapid chloride migration test. J. Mar. Sci. Technol. 28(2), 1 (2020).

Naito, C. et al. Chloride migration characteristics and reliability of reinforced concrete highway structures in Pennsylvania. Constr. Build. Mater. 231, 117045 (2020).

Sell Junior, F. K. et al. Experimental assessment of accelerated test methods for determining chloride diffusion coefficient in concrete. Rev. Ibracon. Estrut. Mater. 14, 14407 (2021).

Pontes, J. et al. The rapid chloride migration test in assessing the chloride penetration resistance of normal and lightweight concrete. Appl. Sci. 11(16), 7251 (2021).

The authors would like to thank the Concrete Technology Laboratory of the faculty of Civil Engineering at Babol Noshirvani University of Technology (BNUT) in Iran.

Faculty of Civil Engineering, Babol Noshirvani University of Technology, 484, Babol, 47148-71167, Iran

Maedeh Hosseinzadeh, Seyed Sina Mousavi, Alireza Hosseinzadeh & Mehdi Dehestani

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

You can also search for this author in PubMed  Google Scholar

M.H.: Conceptualization, idea, investigation, validation, formal analysis, methodology, software, visualization, writing—original draft, writing—review & editing. S.S.M.: Investigation, methodology, data curation, validation, writing—original draft, writing—review & editing. A.H.: Software, validation, data curation, writing—review & editing. M.D.: Conceptualization, supervision, methodology, data curation, validation, writing—review & editing.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Hosseinzadeh, M., Mousavi, S.S., Hosseinzadeh, A. et al. An efficient machine learning approach for predicting concrete chloride resistance using a comprehensive dataset. Sci Rep 13, 15024 (2023). https://doi.org/10.1038/s41598-023-42270-3

DOI: https://doi.org/10.1038/s41598-023-42270-3

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Scientific Reports (Sci Rep) ISSN 2045-2322 (online)

An efficient machine learning approach for predicting concrete chloride resistance using a comprehensive dataset | Scientific Reports

Tile Adhesive Mixing Equipment Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.