INCREASE STATISTICAL RELIABILITY WITHOUT LOSING PREDICTIVE POWER BY MERGING CLASSES AND ADDING VARIABLES

. It is usually true that adding explanatory variables into a probability model increases association degree yet risks losing statistical reliability. In this article, we propose an approach to merge classes within the categorical explanatory variables before the addition so as to keep the statistical reliability while increase the predictive power step by step.


1.
Introduction. In any applications where feature selection or dimension reduction is required, a key question to be answered is how many variables or features are enough. More variables may increase data based association degree but may also result in explanatory information reliability reduction or model over fitting. It is particularly important for a stepwise forward feature selection procedure [8] to decide when to stop the variable aggregation. It can be stopped when the maximum joint association or the predefined maximum number of variables is reached. More discussions about this subject can be found in [2].
The prediction accuracy naturally attracts most of the attention and has been studied for hundreds of years. Categorical data analysis alone has the rate of pointhit accuracy, of distribution bias and of the balanced one between them [9]. Huang, Shi and Wang [12] suggested that the measure of association is fundamental to obtain the prediction accuracy rate and that this measure will increase as more explanatory variables added in that probabilistic model [12]. The risk of model failure, or the model's reliability, is usually related to the average number of categories in a categorical predictive model. Guttman [7] presented methods to estimate the upper and lower bounds to a categorical data set's reliability. These estimates are functions of the number of categories available and the proportion of instance from which the model response is chosen. Probably the most generally applicable and widely used method for estimating the reliability of rating or judgment is with the intra-class correlation, or some variation of it [3]. However, none of these methods reflect the response variable's distribution.
We hence introduce a new measure, denoted as E(Gini(X|Y )), to measure the reliability. It is based on the classical measurement theory and the Gini coefficient [13]. E(Gini(X|Y )) measures the independent variable X's concentration degree given the dependent variable Y . This measure ensures that the reliability will always increase when two categories in X are merged, meaning these two categories are treated as one.
We also prove that the association between the merged independent variable and the target variable keeps exactly the same after the merge if the merged independent classes have the same condition probabilities. Thus, we believe that the solution to the dilemma of the association increase and the reliability decrease along the feature selection process is to merge categories with similar conditional probabilities before adding new variables.
This article is organized as follows. Section 2 presents the definitions for the association and the reliability measures; section 3 discusses how and why the independent classes are merged; two supportive experiments are analyzed in section 4; the last section is a brief summarization and discussion to the future work.
2. Association, reliability and the comparison matrix Φ.
2.1. The association measures. Given a nominal categorical data set with one independent variable X and one dependent variable Y , the following two association measures are of our interest in this article to address the predicting accuracy issue. Both measures were further discussed in [6]. The first one is a measure based on modal (or optimal) prediction, the Goodman-Kruskal λ (denoted as λ thereafter).
Please note that p(·) is the probability of a statistical event. One can see that λ is the relative decrease rate of predicting errors as we go from predicting Y with X to that without X. The other association measure is the Goodman-Kruskal τ (denoted as τ thereafter). It is based on the proportional prediction and defined as follows.
τ calculates the relative decrease of long-run proportion of predictions accuracy from predicting Y with X to that without X. Both measures are used in many applications including supervised discretization [11,10]. Both can be used to measure the prediction errors: the first one aims to maximize the point-to-point accuracy and the second one wants to keep the same distribution between the predicted and the real target variable.
2.2. Reliability measure. Going by "precision" in some publications, reliability may be ambiguous in certain cases [17]. But in our context, it is how much a probability model built upon a given nominal categorical data set may fail in predicting the unknowns hence the number of classes of the independent variable, or the expected one, approximately shows the model's reliability. The expected number of classes in a variable X is a variation of the well-known Gini index defined as follows.
Roughly speaking, the more independent classes a predictive model has, the less support each conditional probability has in a given data set with limited size hence the less reliable the constructed model is. However, this measure does not count adequately the target variable's distribution. We believe it is more appropriate to construct one that considers the concentration of the independent values in each dependent class. Here then comes our proposed measure of reliability of explanatory information which is nothing but the average number of independent classes within each dependent class.
It is easy to see that the value of E(Gini(X|Y )) is within [0, 1], that it reaches the minimum when and only there is one independent category in each dependent class and that it reaches the maximum when and only when all independent classes equally distributed within each dependent class. Given Dmn(X) = {1, 2, · · · , n x }, we can further conclude that E(Gini(X|Y )) is within [0, 1 − 1 nx ]; and the smaller E(Gini(X|Y )) is, the more reliable the predictor information is.
3. Comparison matrix Φ and merging process. To decide which independent classes to be merged, a category-to-variable measure is required to estimate each independent class' overall predictive power to the target variable. This new measure to all element pairs in X, denoted as Φ(Y |X), is a matrix given by Thus, the (s,t)-entry in Φ(Y |X) is the weighted difference between the conditional probabilities of X = s and of X = t. It can be seen that this comparison matrix has the following properties: The value in the diagonal entries is zero. 3. The smaller φ st (Y |X) is, the more similar the two conditional distributions are. When φ st (Y |X) = 0, X = s and X = t have the exactly the same conditional distributions.
Thus, the measure of association, Φ, can be applied to multi-dimensional or even high dimensional models.

3.1.
Merging the classes and enhancing the measure of associations. The proportional prediction based association measure τ can also be rewritten as follows where Thus, ω Y |X is equivalent to τ when it goes to evaluate the associations in a given data set before and after the independent classes are merged.
We also have the following theorem to explain why merging the nominal classes works.
Theorem 3.1. If the conditional probabilities of X = s and X = t are all equal, i.e., ρ sy ρ s. = ρ ty ρ t. = a y , for y = 1, 2, . . . , n y , then merging the classes X = s and X = t, and labelling the merged variable X as X gives us

Proof.
Let where m is the merged class of s and t Because we have x =s,t y where the X represents the variable X with X = s and X = t merged, when the conditional probabilities are the same for X = s and X = t.
On the other hand, when the conditional probabilities of X = s and X = t are extremely similar with each other, i.e., φ st (Y |X) is very small, τ (Y |X ) should be very close to τ (Y |X) . It is then practically very possible to find another variable Z or merged Z in an usual high dimensional data set such that τ (Y |X , Z) > τ (Y |X), since it is almost certain that the added variable Z satisfies τ (Y |X, Z) > τ (Y |X, S), where S is any other predictor besides X and Z. Meanwhile, a smaller E(Gini(X|Y )) ensures a better reliability. Thus, ideally, the added variables Z also satisfies that E(Gini(X , Z|Y )) ≤ E(Gini(X|Y )). In the next section, we are going to show two examples to support the previous statements. Please note that the nominal classes to be merged don't need to have exactly the same, but the sufficiently close conditional probabilities. τ Y |X , λ Y |X and E(Gini(X|Y ) are all going to be investigated to evaluate the goodness of the merge.

Experiments.
Both experiments use the 1996 Survey of Family Expenditure administrated by The Statistics Canada [16]. It has 10, 417 rows with over 200 continuous and categorical variables but we are only going to use some of them as the supportive evidences. 4.1. Occupation, sex, age group and education. The first result shows how the reliability and the association degrees are changed when Sex is added to Age group with Occupation as the target variable. The result briefly demonstrate how a regular feature selection process without merging works. It is also going to be used as the baseline to evaluate the performance after the merge. As discussed above, the added variable Sex increases the association, measured by τ or λ, but reduces the reliability.
Knowing that Age group has 13 categories and the E(Gini(X|Y )) is 0.8773, we choose φ st (Y |X) ≤ 0.003 as the criteria to merge class 2 to class 7 and class 11 to 13. Treating merged Age group, denoted as Age group', and Sex as a single variable, we can merge it again using φ ijkl (Y |X 1 , X 2 ) and the same threshold. Table 2 shows the computation result.  Table 2 tells us that the merged Age Group, combined with Sex is better in reliability given the smaller E(Gini(X|Y ) with worse association both in τ and λ, compared with that without merging in Table 1. However, if we merge the merged then add Education into the variable list, we have a better association AND better reliability, which was impossible in the old feature selection process without merging.
It is clear that the merging threshold determines how many classes will be merged therefore affects the quality. Table 3 shows some simple analysis. Table 3. Compare different merging threshold:Occupation As Table 3 suggests, the bigger the merging threshold is, the more classes are merged then the higher the reliability is while the lower the association is. One can tune this parameter to achieve the needed result given certain trade-off considerations. The chosen ones in this article come from practical considerations than theoretical optima.

4.2.
House type, rooms, bedroom and tenure. Following similar steps to those the previous section presents with different variable sets, we consider House type as the target variable and investigate the effect of merging. Please note that the threshold is still φ st (Y |X) ≤ 0.003.  Table 4 also shows us an example of not only the better reliability but also the higher association after two merged variables are combined.

5.
Conclusion. Based on the theory of association measure and the Gini coefficient, we take E(Gini(X|Y )) to measure the statistical reliability. A categoryto-variable comparison matrix Φ(Y |X) is proposed to represent the conditional probability differences between the explanatory variable's classes. We are going to implement both E(Gini(X|Y )) and Φ(Y |X) in an improved feature selection process in the future. Generally, this improved process will merge classes in the candidate variables before adding one of them into the selected independent variable list. By doing so, the selected features will keep reliability high while the associations increased step by step, as shown in the above experiments.