Data modeling analysis on removal efficiency of hexavalent chromium

Chromium and its compounds are widely used in many industries in China and play a very important role in the national economy. At the same time, heavy metal chromium pollution poses a great threat to the ecological environment and human health. Therefore, it's necessary to safely and effectively remove the chromium from pollutants. In practice, there are many factors which influence the removal efficiency of the chromium. However, few studies have investigated the relationship between multiple factors and the removal efficiency of the chromium till now. To this end, this paper uses the green synthetic iron nanoparticles to remove the chromium and investigates the impacts of multiple factors on the removal efficiency of the chromium. A novel model that maps multiple given factors to the removal efficiency of the chromium is proposed through the advanced machine learning methods, i.e., XGBoost and random forest (RF). Experiments demonstrate that the proposed method can predict the removal efficiency of the chromium precisely with given influencing factors, which is very helpful for finding the optimal conditions for removing the chromium from pollutants.


1.
Introduction. Chromium (Cr) is in the fourth cycle VIB group of the periodic table, whose atomic number is 24, density is 7.20g/cm 3 and content in the earth's crust is 0.01%. Chromium is a kind of silver-white-luster metal. It is hard and corrosion resistant, insoluble in water but soluble in strong alkali solution. Chromium is widely found in soil, atmosphere, water, and bodies of plants and animals. At normal temperature, Chromium keeps chemically stable to air and water. Chromium exists in a variety of oxidation states (from 0 to 6). In nature, it is mainly in the form of Cr(III) and Cr(VI). Cr(III) is an essential micro-element to human body, Figure 1. The relationship of pH-Eh of Cr(VI) which is relatively stable and less toxic. However, it exhibits biotoxicity at higher concentrations [9]. The main existing forms of Cr(III) are chromium oxide and chromium hydroxide. By comparison, the existing forms of Cr(VI) are HCrO − 4 and CrO 2− 4 etc. Fig. 1 illustrates the relationship of pH-Eh of Cr(VI). Cr(VI) is usually not readily present in soil but can be deposited in water. Without oxygen, Cr(VI) can be transformed to Cr(III) by organic matter, the ions of S 2− , F e 2+ , etc. Cr(VI) is 100 times more toxic than Cr(III) and has carcinogenic, teratogenic and mutagenic effects on humans. The transformation between Cr(III) and Cr(VI) is: In addition, studies have shown that Cr(VI) can be easily absorbed by the human body. Cr(VI) may cause skin eczema, ulcers, dermatitis or allergic reactions. Inhalation of higher concentration of Cr(VI) can cause sneezing, runny nose, pruritus and nosebleed. About 1.5-1.6g of Cr(VI) compound leads to death, nausea and vomiting, dizziness and headache, restlessness, thickening of oral mucosa, severe abdominal pain and other conditions. Chromium compounds can also cause damage to the eyes, leading to foreign body sensation, painful tears, conjunctival congestion, vision loss and corneal epithelium loss.
With the rapid development of industry, the pollution of Cr(VI) becomes increasingly serious. The Cr(VI) pollution ranks second, being below the lead pollution in the heavy metal pollution. It has been shown that about 600,000 tons of chromium slag are produced in China every year, and the total amount of chromium slag is 6 million tons, but over 4 million tons have not been detoxified or comprehensively utilized [12] [18]. Therefore, it's of great importance to remove Cr(VI), which arouses much attention in recent years. A lot of removal methods have been proposed, which can be classified into physical, chemical and biological methods generally.
Firstly, the physical methods mainly contain adsorption methods, ion exchange methods and membrane separation methods. The adsorption methods refer to removing Cr(VI) by the high void structure or specific surface area absorbents, such as activated carbon, peanut skin, microbial cells, chestnut shell, chitosan, etc. [2] [14] [8]. The ion exchange method can make Cr(VI) ion exchange with ion exchanger to separate Cr(VI) from wastewater, which is suitable for the treatment of high concentration Cr(VI) wastewater. The common ion exchangers are exchange resin. Membrane separation method uses specific semi-permeable membrane osmosis to make the substances in solution out from the semi-permeable membrane osmosis, so as to separate different substances. After passing through the semipermeable membrane, Cr(VI) will be enriched on the side of the deposits, and then the permeable materials will be either processed or discharged directly [17] [6].
Secondly, the chemical methods for removing Cr(VI) include chemical precipitation, ferrite and electrolysis methods. Chemical precipitation methods remove Cr(VI) by converting it to Cr(III) precipitation. The common chemical precipitation methods include Barium salt precipitation and sulfur dioxide precipitation. Ferrite method involves addition of excessive F eSO 4 into acidic (pH 2-3) Cr(VI) waste water to transform Cr(VI) to Cr(III) and oxidize Fe(II) to Fe(III). Electrolysis is a method in which Cr(VI) is precipitated by hydroxide through oxidation and reduction reactions between iron and Cr(VI). In this process, iron serves as anode.
Thirdly, biochemical methods convert soluble Cr(VI) ions into insoluble compounds and remove them from polluted water through catalytic transformations of microbial enzymes, reduction of metabolites, flocculation and precipitation [7] [25] [1].
The above mentioned methods of removing Cr(VI) suffer from similar drawbacks, such as high cost, high energy consumption and secondary pollution to the environment. Therefore, researchers in recent years investigated green synthesis of nano-iron for removing Cr(VI) [13] [11] [10] [20]. Compared with traditional methods, the green synthesis of nano-iron has the advantages of simple operation, low synthesis cost, fast reaction speed, avoiding the use of toxic and harmful reducing agents, and no burden of environmental pollution, which has been widely applied in removing the environmental pollutants [24] [16] [21]. There are many factors that can influence the removal efficiency of Cr(VI) [23] [22] [19]. However, few studies have investigated this issue. Therefore, in this paper, we deeply investigate the impacts of multiple factors on the removal efficiency of Cr(VI). Specifically, we resort to advanced machine learning methods, i.e. XGBoost and random forest (RF) to establish a mapping model to predict the removal efficiency of Cr(VI) under given conditions, which can predict the removal efficiency of Cr(VI) precisely.
The rest of this paper is organized as follows. In Section 2, we formulate the problem mathematically. In Section 3, we introduce the machine learning methods, XGBoost and RF in detail. In Section 4, we introduce the process and settings of the experiments to collect data. In Section 5, we analyze the obtained experimental results and evaluate the effectiveness of the proposed model. At last, we conclude this paper in Section 6.
2. Problem description. In real applications, multiple factors that influence the removal efficiency of Cr(VI) should be investigated. We use green tea synthesized iron nanoparticles in this experiment and the factors contain the pH value of the reaction solution, the dosage of the green tea synthesized iron nanoparticles, the initial concentration of Cr(VI) solution, reaction temperature, different content of green tea extract, the ratio of green tea extract and F e 2+ , the preparation temperature of green tea extract, and the synthesizing temperature of green tea iron nanoparticles. We denote the above eight different factors as In experiments, it is impossible to sample or traverse each dimension of the high dimension data densely. Therefore, we introduce the statistical learning methods, such as random forest [3], XGBoost [4], etc. to investigate the relationship between the influencing vector X and the removal efficiency of Cr(VI). Supposing the removal efficiency of Cr(VI) as Y , we learn the mapping model from X to Y through the statistical learning methods, written as: where M(·) refers to the mapping function.
3. Random forest and XGBoost. In machine learning, the objective function of the regression problem is defined as: where θ is the model parameter, L (θ) is the training error, representing the fitting accuracy of the model to the training set. In regression, the Mean Square Error (MSE) is used as the loss function,Ω (θ) represents the regular term, indicating the complexity of the model, which is usually defined by l 1 or l 2 norm. Random Forest (RF) was proposed by LeoBreiman to approximate multi-nonlinear relation, which belongs to statistical learning methods [3]. The basic idea of RF is through constructing a number of regression trees and assembling the regression trees according to the predefined principle. Due to randomness, the constructed regression trees are different in each time. Fig. 2 illustrates the structure of an RF. Suppose RF contains K regression trees, then the final prediction result is: where F refers to the RF, f k (X i ) is the weight of the i-th sample on the leaf of the k-th tree. In the training of the RF, we only consider the training error, then the training objective function can be denoted as: On the basis of RF, we introduce the regular term Ω(θ) to constrain the complexity of the system and construct the regression tree successively as follows: ...
where the complexity of the regression tree is further defined as the function of the number of leaves T and the weight of leaves ω: where γ and λ represent parameters to adjust the significance of the two terms. f t refers to the t-th tree. Then we transform the regression problem to the optimization problem and construct the regression tree by the gradient solving method. Specifically, constructing the regression trees successively can improve the availability of the training samples. The regular term of the regression trees can reduce the risk of over-fitting. The gradient regression method can guarantee the best performance of the learned regression model. XGBoost is a machine learning function library focused on gradient lifting algorithm, which was proposed by Chen and Guestrin [4]. This function library has attracted wide attention due to its excellent learning effect and efficient training speed. In 2015, of the 29 algorithms that won the Kaggle contest, 17 used the XGBoost library, compared with 11 for the recent boom in deep neural network methods [5]. XGBoost is highly-efficient, flexible and highly-transplantable, which also supports distributed training on multiple work stations, such as Amazon Web Services (AWS), Google Compute Engine and AZURE etc. [ After preparing the experimental data, we randomly divided the data into 3:2 training set and test set according to the machine learning method. We learned the prediction model on the training set and tested its prediction performance on the test set.

5.
Experimental results and analysis. In XGBoost, we adjusted the number of regression trees, the depth of regression trees and the parameter of the regular l 2 -norm on the training set. We started from 1 and added 10 regression trees every time, the experimental results are shown in Fig. 3, where the horizontal axis represents the number of regression trees, the vertical axis represents the regression coefficient R 2 . The closer to 1 R 2 is, the better the fitting performance is. From  Fig. 3, we can observe that the regression performance tends to converge as the number of regression trees increase. The maximum value of R 2 appears around the value of 321. Therefore, we refined the search around 321, the searching results are shown in Fig. 4. From Fig. 4, we determine the best number of regression trees is 306. Then we investigate the effect of the maximum depth of the regression  regression trees, the results are shown in Fig. 5. As can be seen, when the depth is 5, the performance is best. Then we investigate the l 2 -norm. We start from 0 and increase by the step of 0.1, the results are shown in Fig. 6. From this figure, we can see that the regular parameter λ has little effect on the performance. The final prediction results by XGBoost are shown in Fig. 7.   Fig. 7, we can observe that the prediction performance is excellent on the training set. The prediction performance on the test set is also satisfied, and the regression coefficient on the test set is 0.963.
Similarly, we also employ RF for learning the prediction model. We investigate the impact of the number of regression trees, the depth of the regression trees on the fitting performance. We start from 0 and increase the number of regression trees by the step of 1. The experimental results are shown in Fig. 8. From this figure, we can find that the optimal number of the trees is 9, at which the regression coefficient R 2 attains the maximum value. Then we start from 1 and increase the maximum depth of the trees by the step of 1. The experimental results are shown in Fig. 9. From this figure, we observe that the optimal depth is 13. The prediction results of the removal efficiency of Cr(VI) are shown in Fig. 10.  Fig.9, the prediction performance of RF is worse than that of XGBoost on the training set. However, the performance of RF is better than that of XGBoost on the test set. The final regression coefficient is 0.981.  These two conditions are all the same except the removal temperature or x 8 . Therefore, we can find that higher removal temperature can improve the removal efficiency of Cr(VI). In addition, the Cr(VI) removal rates are 57.76% and 87.26% when the feature X = {60, 1/3, 80, 25, 3.8, 0.02, 80, 25} and X = {60, 1/3, 80, 25, 3.8, 0.06, 80, 45}. These two conditions are all the same except the dosage of the green tea synthesized nanoparticles iron or x 6 . Therefore, we can obtain that higher dosage of the green tea synthesized nanoparticles iron can improve the removal efficiency of Cr(VI).
In general, according to the mapping model learned from RF or XGBoost, we are able to analyze the influencing degree of different factors on the removal efficiency of Cr(VI). Accordingly, we can obtain the optimal condition, under which the removal efficiency of Cr(VI) earns the maximum value, which is impossible to obtain through the experiments by exhaustive traverse.
In this experiment, XGBoost performs better than RF on the training set, while RF outperforms XGBoost on the test set. Such observations are mainly restricted by the experimental conditions. It can be expected that the results can be refined as increasing the number of samples. It's also note that we present the two-dimension plot, while the prediction model contains eight-dimension data, which can't be intuitively demonstrated due to the representation restriction.
6. Conclusion. The removal of Cr(VI) is of great importance to the ecological environment and human health. In this paper, we have focused our attention on the investigation of the influence of multiple factors on the removal efficiency of Cr(VI). Specifically, we collected the data by elaborately-designed experiments and constructed a dataset. Through modern machine learning methods, such as XGBoost and RF algorithms, we establish a model that can predict the removal efficiency of Cr(VI) accurately with given multiple factors. This work can help to find the best conditions for removing Cr(VI) so as to guide the Cr(VI) removal process effectively.