# American Institute of Mathematical Sciences

• Previous Article
MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics
• BDIA Home
• This Issue
• Next Article
A testbed to enable comparisons between competing approaches for computational social choice
October  2016, 1(4): 341-347. doi: 10.3934/bdia.2016014

## Increase statistical reliability without losing predictive power by merging classes and adding variables

 1 School of Mathematics and Information Sciences, Guangzhou University, Guangzhou, 510006, China 2 Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario, Canada M5H 2K1, Canada

* Corresponding authors: Wenxue Huang and Xiaofeng Li

Revised  April 2017 Published  April 2017

It is usually true that adding explanatory variables into a probability model increases association degree yet risks losing statistical reliability. In this article, we propose an approach to merge classes within the categorical explanatory variables before the addition so as to keep the statistical reliability while increase the predictive power step by step.

Citation: Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014
##### References:

show all references

##### References:
Feature selection with merging: Occupation
 $X$ $\tau_b^{(Y|X)}$ $\lambda^{(Y|X)}$ $E(\mbox{Gini}(X|Y))$ $Age group'$+Sex 0.1484 0.0375 0.6688 (Age group'+Sex)'+Education' 0.1542 0.0447 0.6620
 $X$ $\tau_b^{(Y|X)}$ $\lambda^{(Y|X)}$ $E(\mbox{Gini}(X|Y))$ $Age group'$+Sex 0.1484 0.0375 0.6688 (Age group'+Sex)'+Education' 0.1542 0.0447 0.6620
Feature selection without merging: Occupation
 $X$ $\tau^{Y|X}$ $\lambda^{Y|X}$ $E(\mbox{Gini}(X|Y))$ Age group 0.1344 0.0311 0.8773 Age group + Sex 0.1511 0.0476 0.9228
 $X$ $\tau^{Y|X}$ $\lambda^{Y|X}$ $E(\mbox{Gini}(X|Y))$ Age group 0.1344 0.0311 0.8773 Age group + Sex 0.1511 0.0476 0.9228
Compare different merging threshold:Occupation
 $X$ $\phi^{st}(Y|X)$ $\lambda^{(Y|X)}$ $\tau^{(Y|X)}$ $E(Gini(X,Y))$ Age group - 0.0311 0.1344 0.8773 $Age group'$+Sex 0.0005 0.0414 0.1493 0.9222 $Age group'$+$Sex$ 0.0030 0.0375 0.1484 0.6688 $Age group'$+$Sex$ 0.0100 0.0000 0.0209 0.2710
 $X$ $\phi^{st}(Y|X)$ $\lambda^{(Y|X)}$ $\tau^{(Y|X)}$ $E(Gini(X,Y))$ Age group - 0.0311 0.1344 0.8773 $Age group'$+Sex 0.0005 0.0414 0.1493 0.9222 $Age group'$+$Sex$ 0.0030 0.0375 0.1484 0.6688 $Age group'$+$Sex$ 0.0100 0.0000 0.0209 0.2710
Compare different merging threshold
 $X$ $\lambda^{(Y|X)}$ $\tau^{(Y|X)}$ $E(\mbox{Gini}(X|Y))$ Rooms 0.3443598 0.3004656 0.8200656 $Rooms'$+$Tenure'$ 0.4255117 0.3583277 0.7911177 $(Rooms'$+$Tenure')'+bedroom'$ 0.4381247 0.3901767 0.7165204
 $X$ $\lambda^{(Y|X)}$ $\tau^{(Y|X)}$ $E(\mbox{Gini}(X|Y))$ Rooms 0.3443598 0.3004656 0.8200656 $Rooms'$+$Tenure'$ 0.4255117 0.3583277 0.7911177 $(Rooms'$+$Tenure')'+bedroom'$ 0.4381247 0.3901767 0.7165204
 [1] A. Zeblah, Y. Massim, S. Hadjeri, A. Benaissa, H. Hamdaoui. Optimization for series-parallel continuous power systems with buffers under reliability constraints using ant colony. Journal of Industrial & Management Optimization, 2006, 2 (4) : 467-479. doi: 10.3934/jimo.2006.2.467 [2] Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014 [3] Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331 [4] Luís Tiago Paiva, Fernando A. C. C. Fontes. Sampled–data model predictive control: Adaptive time–mesh refinement algorithms and guarantees of stability. Discrete & Continuous Dynamical Systems - B, 2019, 24 (5) : 2335-2364. doi: 10.3934/dcdsb.2019098 [5] Santiago Cañez. Double groupoids and the symplectic category. Journal of Geometric Mechanics, 2018, 10 (2) : 217-250. doi: 10.3934/jgm.2018009 [6] Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004 [7] Masataka Kato, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Effect of energy-saving server scheduling on power consumption for large-scale data centers. Journal of Industrial & Management Optimization, 2016, 12 (2) : 667-685. doi: 10.3934/jimo.2016.12.667 [8] Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129 [9] Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005 [10] Xavier Brusset, Per J. Agrell. Intrinsic impediments to category captainship collaboration. Journal of Industrial & Management Optimization, 2017, 13 (1) : 113-133. doi: 10.3934/jimo.2016007 [11] Alan Weinstein. A note on the Wehrheim-Woodward category. Journal of Geometric Mechanics, 2011, 3 (4) : 507-515. doi: 10.3934/jgm.2011.3.507 [12] Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, 2017, 2 (5) : 1-8. doi: 10.3934/bdia.2017020 [13] Shi-Liang Wu, Cheng-Hsiung Hsu. Entire solutions with merging fronts to a bistable periodic lattice dynamical system. Discrete & Continuous Dynamical Systems - A, 2016, 36 (4) : 2329-2346. doi: 10.3934/dcds.2016.36.2329 [14] Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45 [15] Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173 [16] Rudy R. Negenborn, Peter-Jules van Overloop, Tamás Keviczky, Bart De Schutter. Distributed model predictive control of irrigation canals. Networks & Heterogeneous Media, 2009, 4 (2) : 359-380. doi: 10.3934/nhm.2009.4.359 [17] Lars Grüne, Marleen Stieler. Multiobjective model predictive control for stabilizing cost criteria. Discrete & Continuous Dynamical Systems - B, 2019, 24 (8) : 3905-3928. doi: 10.3934/dcdsb.2018336 [18] Torsten Trimborn, Lorenzo Pareschi, Martin Frank. Portfolio optimization and model predictive control: A kinetic approach. Discrete & Continuous Dynamical Systems - B, 2019, 24 (11) : 6209-6238. doi: 10.3934/dcdsb.2019136 [19] Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161 [20] David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509

Impact Factor: