Automated machine learning structure-composition- property relationships of perovskite materials for energy conversion and storage

Perovskite materials are central to the fields of energy conversion and storage, especially for fuel cells. However, they are challenged by overcomplexity, coupled with a strong desire for new materials discovery at high speed and high precision. Herein, we propose a new approach involving a combination of extreme feature engineering and automated machine learning to adaptively learn the structure-composition-property relationships of perovskite oxide materials for energy conversion and storage. Structure-composition-property relationships between stability and other features of perovskites are investigated. Extreme feature engineering is used to construct a great quantity of fresh descriptors, and a crucial subset of 23 descriptors is acquired by sequential forward selection algorithm. The best descriptor for stability of perovskites is determined with linear regression. The results demonstrate a high-efficient and non-priori-knowledge investigation of structurecomposition-property relationships for perovskite materials, providing a new road to discover advanced energy materials.


INTRODUCTION
To discover materials, the investigation of structure-composition-property relationship of inorganic materials is essential, and a huge number of material composition pose a big challenge to investigate the hidden structure-composition-property relationships [1,2] . It is usually supported by magnanimous lab experiments that are demanding both in terms of time and technology. Accordingly, the exploration of the structure-composition-property relationship is very difficult [3][4][5] . Machine learning is intensively applied in the field of advanced materials exploration and discovery for almost a decade [6][7][8][9] , becoming a high-efficient approach to investigate inorganic materials [5] . However, the complicacy of machine-learning processes and the inability to comprehend models make it hard to obtain good rules for describing connections between structure, composition and property of materials, which impedes their deeper comprehension [10] . Consequently, it is particularly significant to improve the approach of exploring the structure-compositionproperties relationship of inorganic materials [7,11] . So far, several approaches for discovering important descriptors have been published, such as the symbolic regressionr algorithm [12] , the least absolute shrinkage and selection operator algorithm algorithm [13] , and the sure independence screening and sparsifying operator (SISSO) algorithm [14] . The purpose of these approaches is to find some vital descriptors describing the target variables or some hidden mathematical formulas from the given feature space so that these vital descriptors can be used to predict the target variables [4,9,10] . Although these methods achieved good results, they need to rely on many conditions, such as a large amount of data, suitable algorithms, etc., which are obviously tough for material scientists who are not familiar with computer algorithms [15] . Therefore, these algorithms are extremely low efficient [14][15][16] .
Perovskite materials are essential for energy storage and conversion, due to their excellent electrocatalytic properties [11] . The stability of perovskite compounds is the focus and challenging dimension in perovskitebased fuel cells, and is a key material property whose value may determine the use of perovskite oxides [17] . When considering numerous different A-and B-site elements [ Figure 1], as well as various conventional doping ratios and combinations, the amount of perovskite components should be huge. The full compositional flexibility of perovskite structure gives it a complex set of functional properties. In addition, the flexibility poses the big challenge for predicting stability [18] . A recent research paper obtained a subset of nine important descriptors by constructing a large number of new descriptors and using recursive feature elimination method. Furthermore, the optimal descriptor of lattice constant was obtained by linear regression algorithm, and the simple linear expression of lattice constant was obtained successfully [19] . It helps to explore structure-composition relationships of materials without prior knowledge. In this work, the approach was further improved.
In this paper, the structure-composition-property connections between stability and other features of perovskite compounds was investigated via a high-effective approach of extreme feature engineering and automated machine learning [19][20][21][22][23][24][25][26][27] . The feature engineering approach was used to remove redundant features while generating many fresh descriptors [28] . The subset of significant descriptors was obtained by sequence forward selection algorithm, the best descriptor was obtained via linear regression analysis to obtain expression of stability. Instead of trying to model all the feature combinations, the sequential forward selection algorithm aborts the search by finding an optimal solution, which greatly reduces the computational effort. This new approach combining feature engineering with linear regression algorithms does not demand researchers to have an in-depth understanding of computer algorithms and does not depend on advanced knowledge or model [29] . Compared with symbolic regression algorithm and SISSO algorithm, this algorithm has obvious advantages [9] . The acquired structure-composition-property relationships will speed up the design and optimization of perovskite materials, and offer a new way for the exploration and research of inorganic materials.

EXPERIMENTAL
The whole process of adaptively learning structure-composition-property connections of ABO 3 perovskite compounds shows in Figure 2, and it contains several steps as follows: Step 1: Collect the material dataset from different ways; Step 2: Perform pretreatment on the material dataset; Step 3: Extreme feature engineering is used to generate a large number of new descriptors; Step 4: Apply the feature selection on a significant number of new descriptors, discover the subset of important descriptors, and then apply regression fitting on the subset of important descriptors. This enables the discovery of the optimal descriptor as well as the gain of the related structure-composition-property relationships.

Dataset of perovskite materials
The dataset of ABO 3 perovskite oxide utilized in this paper originates from DFT high-throughput compute, including the ion radius of the A-site, the ion radius of the B-sit, formation energy, crystal volume, band gap, lattice parameters (a, b, c, α, β, γ), oxygen vacancy formation energy, stability, etc. [30] . Different radii of ions were used in A sites and B sites, including those of A ∈ [Al, As, Ag, Be, B, Bi, Ba, Ca, Co, Cu, Cr, Cd, Ce, Zr, Zn, Dy, Er, Fe, Ge, Gd, Ga, Hf, Ho, In, Ir, K, La, Lu ABO 3 compounds. Through a cursory examination of the data, a total of 4912 sets of ABO 3 perovskite compound high-throughput data were chosen. The values of stability were in the -0.729~3.927 eV/atom range. Data sets are indexed via abbreviations to make experimentation easier, using the following details: r A for ionic radius at the A-site, r B for ionic radius at the B-site, ΔH f for formation energy, V for crystal volume, ΔE for band gap, a for lattice parameter a, b for lattice parameter b, c for lattice parameter c, α for the lattice parameter of the α phase, β for the lattice parameter of the β phase, γ for the lattice parameter of the γ phase, O V E ∆ for oxygen vacancy formation energy and ΔH s for stability. For ABO 3 perovskite oxide, more detailed descriptions are shown in Table 1.

Data pretreatment
Data pretreatment is used to process the missing and repeated values in the data, raising the data's accuracy and helping to raise the precision and efficiency of the subsequent learning procedure. The common processes of data pretreatment include missing value processing, attribute coding, feature selection, etc. [31,32] . There are three common ways to deal with missing values: use the feature that contains the missing value directly, delete the feature that contains the missing value (this only works if the feature contains blank values in a big number), and complete the missing value [33,34] . Because there are a small number of blank values in the raw Dataset, the features containing blank values are employed in this paper to process blank values.
Feature selection refers to the procedure of picking a subset of relevant features from a given feature collection [35] . Although a variety of factors influence the target characteristics of perovskite oxides, the amount of features must be appropriate, the features must be uneven for the category of interest, and certain non-essential information must be removed [36] . Correlation is a term that describes the degree and direction of the link between these two measurable features. Pearson correlation analysis is often adopted for analyzing the connections between two measurable features [37] . In this paper, Pearson correlation coefficient was used to examine the link between composition, structure and property, and the linear correlation among composition, structure and property is measured [38] . Pearson correlation coefficient can be defined easily as follows: where x i represents the value of feature x, y i represents the value of feature y, represents the average value of value of feature x, represents the average value of feature y, σ x represents the standard deviation of feature x, σ y represents the standard deviation of feature y and n represents the sample size [38] . Correlation coefficient of 1 indicates strongly positively correlated, whereas a correlation value of -1 indicates a strong negative correlation. The correlation coefficient near to 0 implies that there is no association [37,38] .
Pearson correlation coefficients were utilized to choose the raw dataset in this paper. Table 2 depicts the degree of connection between the 12 properties of perovskite oxide ABO 3 and their stability. Figure 3 depicts Pearson correlation map for different features.

Extreme feature engineering
For the sake of rapidly discovering the connections between structure, composition, and properties, we displayed dataset's feature distribution. Figure 4 depicts the distribution of raw features and stability. The distribution of the raw data set of observation feature and the predicted variable stability is positively biased, and the range of data is quite broad. Therefore, data transformation methods must be employed to generate new descriptors through feature engineering [39] . Essentially, the data provided to the algorithm should be compatible with the required structure or characteristics of the underlying data. Feature engineering is the process of turning data attributes into data features and extracting features from raw data through algorithms and models to the greatest extent possible [40] . Therefore, the feature engineering approach may generate a large number of new descriptors and assess their performance with a subset of them.
In machine learning, feature engineering is a critical data preparation activity that creates suitable descriptors from a given feature to improve prediction performance [41] . Feature engineering is adding some functions of conversion, such as arithmetic and aggregation operators, into a given attribute to create a huge number of new descriptors [42] . The transformation functions contribute to increase the dimensions of features or to turn the nonlinear connection between features and stability into a more understandable linear one [40,43] . Feature combination is a highly important method in feature engineering to integrate features from several categories into a single feature [44] . This is a beneficial method when a combination of features outperforms a single feature. The feature combination is the cross multiplication of all conceivable eigenvalues in mathematics. The features of each combination really constitute the information synergy.
A huge number of brand-new descriptors were gained through extreme feature engineering, where the dimensionality of features was also expanded. Figure 5 shows the construction process of the descriptors by extreme feature engineering. In their midst, x i (i = 1, 2, … n) indicates the selected feature. The parameters following the yellow arrows reflect a significant number of new descriptors that were produced. These 9 functions of x, x -1 , , x 2 , x 3 , e x , ln|x|, ln(1 + |x|) and log|x| are utilized for nonlinear transformation of features. In order to generate additional descriptors, these descriptors are merged non-linearly [45] . The primary descriptors were generated in the following way: Step 1: Import the chosen vital features into these 9 functions of x, x -1 , , x 2 , x 3 , e x , ln|x|, ln(1 + |x|) and log|x|, where x is one of the vital features chosen from the raw features, and it can directly generate brand- Step 2: Feature combination combines the brand-new descriptors from Step 1. Increase the number of descriptive words by multiplying them by two or more and then combining them into a brand-new one; Step 3: Substituting brand-new descriptors gained in step 2 into the function x -1 for nonlinear conversion, and the number of descriptors acquired has increased.

Regression
The term "regression" refers to the process of determining the quantitative connection between two or more variables using a group of data, the establishment of simulations from mathematics, and the estimation of unidentified factors [46] . Machine learning is an efficient way of performing regression. The capacity to do linear regression is defined as to properly depict the connection between data using a straight line, which is more suited to fitting the expression [47] . The modeling speed of linear regression is rapid, it does not need sophisticated calculation, and it may even run quickly when dealing with huge amounts of data [48] . The gained linear expression can be understood and interpreted according to the coefficient of each variable, and the influence of each feature on the result can be directly seen from the weight, which is much easier to grasp [43,49] . Nonlinear expressions are more complex than other machine learning methods, and the related process is difficult to learn [48] . Clearly, linear regression is appropriate for selecting the most appropriate descriptor. In this paper, we gained 55%/45% of the optimized data sets, which nicely balanced the accuracies and overfitting of the machine learning model. In the end, the important descriptor was gained by comparing the effectiveness of models for various descriptors.

Performance evaluation
In order to assess the prediction accuracy and model performance, we employed the mean absolute error (MAE), mean square error (MSE) and coefficient of determination (R 2 ). Simply, the smaller MAE and MSE values to 0 and the bigger R 2 values to 1 suggest the higher prediction accuracy and better model performance. The corresponding equations can be summarized: Where, n indicates the sample size, indicates experimental value and y j indicates predicted value, and y is the average value.

Extreme feature engineering
Following data pretreatment and feature transformation, the amount and quality of the description dataset must be checked further. Feature processing plays an important role in feature engineering and is also the most time-consuming aspect of data analysis. Because feature processing lacks a defined phase, such as algorithms and models with greater technical knowledge and compromises, there is no unified feature processing way. Fortunately, scikit-learn offers a more comprehensive feature processing approach, which includes data preparation, feature selection, dimension reduction, and so on [50] . Scikit-learn is a free and open-source machine learning library licensed under the Berkeley Software Distribution license [51] . Thus, in this paper, the python package scikit-learn was used for data pretreatment, feature transformation, feature processing, machine-learning model training and model performance evaluation [51][52][53] . Feature selection is the process of eliminating duplicate and unnecessary characteristics from a data collection, determining the important features in the data set, and eventually obtaining the feature subset [51] . Wrapper methods are common methods for feature selection [54,55] . The basic description of wrapper methods is: Step 1: A subset of features is chosen to train the model. The model here usually refers to a machine learning algorithm, also called an objective function; Step 2: Evaluate the model with a validation dataset; Step 3: Perform the above operations on different feature subsets based on some search algorithm; Step 4: Based on the evaluation results, the best feature subset is selected.
Clearly, the method for finding the optimal descriptors subset belongs to the family of greedy search algorithms. Wrapper methods include three common selection methods, such as sequential feature selection (SFS) [56] , exhaustive feature selection [57] and recursive feature elimination [58] . Among them, SFS includes two algorithms, such as sequential forward feature selection algorithm and sequential backward feature selection algorithm. Sequential forward selection algorithm is about execution of the following steps to search the most appropriate features out of N features to fit in K-features subset. Instead of trying to model all the feature combinations, the sequential forward feature selection algorithm aborts the search by finding an optimal solution, which greatly reduces the computational effort [56] . Therefore, we adopted the sequential forward feature selection algorithm to perform feature selection. In this work, gradient augmented regression (GBR) was used as the objective function.
The extreme feature engineering created many descriptors, and was followed by a preliminary screening of these descriptors. By analyzing the Pearson correlation coefficient, the top 50 descriptors with the highest Pearson correlation coefficient were successfully selected. Figure 6A shows the Pearson correlation coefficients for different new-constructed descriptors. Figure 6B shows the Pearson correlation coefficients for the as-selected 50 descriptors. Figure 6C shows the Pearson correlation map of the as-selected 50 descriptors and stability. Figure 6D shows the trend between the prediction effect of GBR models and the descriptor number. GBR is an enhancement to the Boosting algorithm [59] . Boosting is a type of integrated machine-learning algorithm that transforms the poor learner into the strong learner. Each sample was initially allocated an equal weight value in the Boosting algorithm [60,61] . Because each training produced a significant change in the values of data points, the weight values were processed by adding mis-splitting points at the end of each step, and then N iterations were done to obtain N simple base classifiers [62,63] . Finally, the N basic classifiers acquired were weighted together to form a final model.
The distinction between GBR and Boosting is that each GBR computation is designed to minimize the last residual. To reduce the associated residuals, a new model must be created in the gradient's orientation to reduce the residuals. As a result, in GBR, each new modeling aimed to reduce the previous model residuals in the gradient orientation [64] , the associated loss-function negative gradient was employed as the estimated value of the residual in the GBR algorithm, and then the regression tree was fitted [65,66] . As a weak classifier, GBR typically employs a fixed size regression tree. The capacity to analyze mixed data and create models with complicated functions are two properties of the regression tree that make it more accurate in the promotion process [64] . The GBR model is as follows: Here, r m is the weight, T(x; θ m ) is the regression tree, θ m is the parameter of the regression tree and m is the number of trees [67] . The GBR models were built iteratively in this paper. The best descriptors were chosen based on the coefficients of descriptor significance, and this process was repeated for the other descriptors until all descriptors had been explored. Finally, the optimal descriptor subset consists of 23 most important descriptors, as shown in Table 3. Figure 6D depicts the relationship between GBR model prediction effect and descriptor numbers. It is clear that when the number of descriptors was raised, the prediction effect of the GBR models grew and eventually stabilized [68,69] . Clearly, the best effect of the GBR models was obtained when the optimum subset of 23 descriptors was employed. Figure 6B depicts the Pearson correlation coefficients for the 50 descriptors chosen and the stability, which ranges from 0.845766 to 0.839595 with minor variations. These descriptors are strongly correlated with the stability of perovskite compounds. The key to understanding the structurecomposition-property relationship is to choose the best relevant description.
Following the selection of 23 key descriptors through SFS, the ideal subset of descriptors was chosen based on the three evaluation indices of the GBR model to train the linear regression model, as shown in Table 3. After a large number of experiments, the results showed that these fluctuations were within the range of 3%. It is apparent that the descriptor Due to the limited capabilities of experimental and theoretical tools, traditional material discovery has always been a process of trial and error. The widely used tolerance factor (t for short) to measure the stability of perovskite was proposed by Goldschmidt in 1926. t has become a popular descriptor of stability and has accelerated stability screening of perovskite over the past century. It is worth noting that Goldschmidt tolerance factor t has been widely used to predict the stability of perovskite structures based only on a universal formula of ABX 3 with matching ionic sizes of A-site, B-site and X-site [70] . Its expression is: A X B X ( r r ) t 2( r r ) Here, r A is the A-site ionic radius, r B is the B-site ionic radius, and r X is the X-site ionic radius. This is a semi-empirical formula with an accuracy of only 70% that gives a rough indication of the stability of perovskite materials. The descriptors constructed in this work are not only related to the A-site ion radius, B-site ion radius, but also related to the lattice parameters, which are considered to be key features related to the stability of perovskite materials.

Automated machine learning
By automated machine learning, we discovered the quantitative relationships between various variables based on a collection of data, which resulted in the construction of a mathematical model and the estimation of unknown parameters. The linear regression algorithm, as an effective machinelearning algorithm, accurately depicted the connection of data via the straight line and was better for fitting expressions in this paper [71] . Table 4 shows the 7 descriptors { } a n d t h e c o r r e s p o n d i n g evaluation indexes selected by the linear regression model. It is easy to see that the last descriptor of achieved the greatest R 2 value of 0.735529, the lowest MAE value of 0.224526 and the lowest MSE value of 0.102889. As a result, the descriptor of was chosen as the best descriptor for investigating structure-composition-property relationships in perovskite compounds (ABO 3 ), which was only related with A-site ion radius, B-site ion radius, lattice constant b, and α angle of the crystal structure. The straightforward linear equation is intended to represent the structure-composition-property relationships of ABO 3 perovskite compounds after obtaining the optimal descriptor using a linear regression model. The following is the equivalent formula: Where d i is the final descriptors, F is the stability, f(d 1 , d 2 ,…d n ) is a linear representation of the structurecomposition-property connection. A simple linear expression was produced using linear regression analysis as follows: (8) where k = 0.1485 and z = -0.0380 are the coefficient values. Following linear regression fit and comparison with DFT calculation value, as shown in Figure 7, the dependability of the automated-machine-learning stability expression was validated. The results showed that the effects of A-site ion radius, B-site ion radius, lattice constant b, and α angle of the crystal structure are more significant than that of other variables. The equation of showed the relationship between structure, composition and property in perovskite oxides. Our technique produces a more accurate expression than the semi-empirical formula. In a nutshell, the novel approach may be utilized to investigate the structurecomposition-property relationships of ABO 3 perovskite oxides.

CONCLUSIONS
For the sake of conquering the huge complexity of structure-composition-property in ABO 3 perovskite materials for energy conversion and storage, we presented a new way to combine extreme feature engineering and automated machine learning for investigating structure-composition-property connections in perovskite oxides. A great number of brand-new descriptors were generated via extreme feature engineering and a subset of 23 significant descriptors was gained via SFS. Furthermore, by linear regression algorithm, the optimal descriptor of was found, and the straightforward linear equation of for the stability was achieved. It has been shown that the influence of radius of A-site ions, radius of B-site ions, lattice constant b, and α angle of the crystal structure on the stability of ABO 3 perovskites are more significant than others. In this way, we can obtain expression with higher accuracy than a semi-empirical formula. The results demonstrate a high-efficient and non-priori-knowledge investigation of structure-composition-property relationships for perovskite materials, providing a new road to discover advanced energy materials.

Authors' contributions
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Deng Q, Lin B Performed data acquisition, as well as provided administrative, technical, and material support: Deng Q, Lin B Wrote and reviewed the manuscript: Deng Q, Lin B