PROPORTIONAL ASSOCIATION BASED ROI MODEL

. Based on a local-to-global proportional association measure proposed by Huang, Shi and Wang [9], with cost and revenue information known, an association measure is proposed to maximize the expected RoI. A descrip-tive experiment with a synthetical data set is presented.


1.
Introduction. Multi-nominal data are common in scientific and engineering research such as biomedical research, customer behavior analysis, network analysis, search engine marketing optimization, web mining etc. When the response variable has more than two levels, the principle of mode-based or distribution-based proportional prediction can be used to construct nonparametric nominal association measure. For example, Goodman and Kruskal [3,4] and others proposed some localto-global association measures towards optimal predictions. Both Monte Carlo and discrete Markov chain methods are conceptually based on the proportional associations. The association matrix, association vector and association measure were proposed by the thought of proportional associations in [9]. If there is no ordering to the response variable's categories, or the ordering is not of interest, they will be regarded as nominal in the proportional prediction model and the other association statistics.
But in reality, different categories in the same response variable often are of different values, sometimes much different. When selecting a model or selecting explanatory variables, we want to choose the ones that can enhance the total revenue, not just the accuracy rate. Similarly, when the explanatory variables with cost weight vector, they should be considered in the model too. The association measure in [9], ω Y |X , doesn't consider the revenue weight vector in the response variable, nor the cost weight in the explanatory variables, which may lead to less profit in total. Thus certain adjustments must be made for a better decisionning.
To implement the previous adjustments, we need the following assumptions: • X and Y are both multi-categorical variables where X is the explanatory variable with domain {1, 2, ..., α} and Y is the response variable with domain {1, 2, ..., β} respectively; • the amount of data collected in this article is large enough to represent the real distribution; • the model in the article mainly is based on the proportional prediction; • the relationship between X and Y is asymmetric; It needs to be addressed that the second assumption is probably not always the case. The law of large number suggests that the larger the sample size is, the closer the expected value of a distribution is to the real value. The study of this subject has been conducted for hundreds of years including how large the sample size is enough to simulate the real distribution. Yet it is not the major subject of this article. The purpose of this assumption is nothing but a simplification to a more complicated discussion.
The article is organized as follows. Section 2 discusses the adjustment to the association measure when the response variable has a revenue weight; section 3 considers the case where both the explanatory and the response variable have weights; how the adjusted measure changes the existing feature selection framework is presented in section 4. Conclusion and future works will be briefly discussed in the last section.

2.
Response variable with revenue weight vector. Let's first recall the association matrix {γ s,t (Y |X)} and the association measure ω Y |X and τ Y |X .
is the (s, t)-entry of the association matrix γ(Y |X) representing the probability of assigning or predicting Y = t while the true value is in fact Y = s. Given a representative train set, the diagonal entries, γ ss , are the expected accuracy rates while the off-diagonal entries of each row are the expected first type error rates. ω Y |X is the association measure from the explanatory variable X to the response variable Y without a standardization. Further discussions to these metrics can be found in [9].
Our discussion begins with only one response variable with revenue weight and one explanatory variable without cost weight. Let R = (r 1 , r 2 , ..., r β ) to be the revenue weight vector where r s is the possible revenue for Y = s. A model with highest revenue in total is then the ideal solution in reality, not just a model with highest accuracy. Therefore comes the extended form of ω Y |X with weight in Y as in 2: Please note that ω Y |X is equivalent to τ Y |X for given X and Y in a given data set. Thus the statistics of τ Y |X will not be discussed in this article.
It is easy to see that ω Y |X is the expected total revenue for correctly predicting Y . Therefore one explanatory variable Example. Consider a simulated data motivated by a real situation. Suppose that variable Y is the response variable indicating the different computer brands that the customers bought; X 1 , as one explanatory variable, shows the customers' career and X 2 , as another explanatory variable, tells the customers' age group. We want to find a better explanatory variable to generate higher revenue by correctly predicting the purchased computer's brand. We further assume that X 1 and X 2 both contain 5 categories, Y has 4 brands and the total number of rows is 9150. The contingency table is presented in 1. Let us first consider the association matrix {γ Y |X }. Predicting Y with the information of X 1 , or X 2 is given by the association matrix γ(Y |X 1 ), or γ(Y |X 2 ) as in Table 2.
Please note that Y contains the true values whileŶ is the guessed one. One can see from this table that the accuracy rate of predicting y 1 and y 2 by X 1 on the left are larger than that on the right. The cases of y 3 and y 4 , on the other hand, are opposite.
The correct prediction contingency tables of X 1 and Y , denoted as W 1 , plus that of X 2 and Y , denoted as W 2 , can be simulated through Monte Carlo simulation as in Table 3. The total number of the correct predictions by X 1 is 3142 while it is 3092 by X 2 , meaning the model with X 1 is better than X 2 in terms of accurate prediction. But it maybe not the case if each target class has different revenues. Assuming the revenue weight vector of Y is R = (1, 1, 2, 2), we have the association measure of ω Y |X , and ω Y |X as in Table 4: Given that revenue = i,s W i,s k r s , i = 1, 2, ..., α, s = 1, 2, ..., β, k = 1, 2, we have the revenue for W 1 as 4313, and that for W 2 as 5178. Divide the revenue by the total sample size, 9150, we can obtain 0.4714 and 0.5659 respectively. Contrasting these to ω Y |X1 and ω Y |X2 above, we believe that they are similar, which means then ω Y |X is truly the expected revenue.
In summary, it is possible for an explanatory variable X with bigger ω Y |X , i.e, the larger revenue, but with smaller ω Y |X , i.e., the smaller association. When the total revenue is of the interest, it should be the better variable to be selected, not the one with larger association.
3. Explanatory variable with cost weight and response variable with revenue weight. Let us further discuss the case with cost weight vector in predictors in addition to the revenue weight vector in the dependent variable. The goal is to find a predictor with bigger profit in total. We hence define the new association measure as in 3.
c i indicates the cost weight of the ith category in the predictor and r s means the same as in the previous section.ω Y |X is then the expected ratio of revenue and cost, namely RoI. Thus a largerω Y |X means a bigger profit in total. A better variable to be selected then is the one with biggerω Y |X . We can see thatω Y |X is an asymmetric measure, meaningω Y |X =ω Y |X . When c 1 = c 2 = ...  1, 1, 1), we have the associations in Table 5. , i = 1, 2, .., α; s = 1, 2, .., β and k = 1, 2 where W k is the corresponding prediction contingency table, we have the profit for X 1 as 12016.17 and that of X 2 as 17072.17. When both divided by the total sample size 9150, they change to 1.3132 and 1.8658, similar toω(Y |X 1 ) andω(Y |X 2 ). It indicates thatω Y |X is the expected RoI. In this example, X 2 is the better variable given the cost and the revenue vectors are of interest.
We  Table 6. Henceω Y |X1 >ω Y |X2 , on the contrary to the example with the old weight vectors. Thus the right amount of weight is critical to define the better variable regarding the profit in total.
4. The impact on feature selection. By the updated association defined in the previous section, we present the feature selection result in this section to a given data set S with explanatory categorical variables V 1 , V 2 , .., V n and a response variable Y . The feature selection steps can be found in [9].
At first, consider a synthetic data set simulating the contribution factors to the sales of certain commodity. In general, lots of factors could contribute differently to the commodity sales: age, career, time, income, personal preference, credit, etc. Each factor could have different cost vectors, each class in a variable could have different cost as well. For example, collecting income information might be more difficult than to know the customer's career; determining a dinner waitress' purchase preference is easier than that of a high income lawyer. Therefore we just assume that there are four potential predictors, V 1 , V 2 , V 3 , V 4 within the data set with a sample size of 10000 and get a feature selection result by monte carlo simulation in Table 7. The first variable to be selected is V 1 using ω Y |X as the criteria according to [9]. But it is V 3 that needs to be selected as previously discussed if the total profit is of interest. Further we assume that the two variable combinations satisfy the numbers in Table 8 by, again, monte carlo simulation. As we can see, all ω Y |(X1,X2) ≥ ω Y |X1 , but it is not case forω Y |(X1,X2) since the cost gets larger with two variables thus the profit drops down. As in one variable scenario, the better two variable combination with respect to In summary, the updated association with cost and revenue vector not only changes the feature selection result by different profit expectations, it also reflects a practical reality that collecting information for more variables costs more thus reduces the overall profit, meaning more variables is not necessarily better on a Return-Over-Invest basis.

5.
Conclusions and remarks. We propose a new metrics,ω Y |X in this article to improve the proportional prediction based association measure, ω Y |X , to analyze the cost and revenue factors in the categorical data. It provides a description to the global-to-global association with practical RoI concerns, especially in a case where response variables are multi-categorical.
The presented framework can also be applied to high dimensional cases as in national survey, misclassification costs, association matrix and association vector [9]. It should be more helpful to identify the predictors' quality with various response variables.
Given the distinct character of this new statistics, we believe it brings us more opportunities to further studies of finding the better decision for categorical data. We are currently investigating the asymptotic properties of the proposed measures and it also can be extended to symmetrical situation. Of course, the synthetical nature of the experiments in this article brings also the question of how it affects a real data set/application. It is also arguable that the improvements introduced by the new measures probably come from the randomness. Thus we can use k-fold cross-validation method to better support our argument in the future.