• Previous Article
    MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics
  • BDIA Home
  • This Issue
  • Next Article
    A testbed to enable comparisons between competing approaches for computational social choice
October  2016, 1(4): 341-347. doi: 10.3934/bdia.2016014

Increase statistical reliability without losing predictive power by merging classes and adding variables

1. 

School of Mathematics and Information Sciences, Guangzhou University, Guangzhou, 510006, China

2. 

Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario, Canada M5H 2K1, Canada

* Corresponding authors: Wenxue Huang and Xiaofeng Li

Revised  April 2017 Published  April 2017

It is usually true that adding explanatory variables into a probability model increases association degree yet risks losing statistical reliability. In this article, we propose an approach to merge classes within the categorical explanatory variables before the addition so as to keep the statistical reliability while increase the predictive power step by step.

Citation: Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014
References:
[1]

H. L. Costner, Criteria for measure of association, American Sociology Review, 30 (1965), 341-353. Google Scholar

[2]

M. Dash and H. Liu, Feature selection for classification, Intell. Data. Anal., 1 (1997), 131-156. doi: 10.1016/S1088-467X(97)00008-5. Google Scholar

[3]

R. L. Ebel, Estimation of the reliability of ratings, Psychomereika, 16 (1951), 407-424. Google Scholar

[4]

G. S. Fisher, Monte Carlo: Concepts, Algorithms, and Applications, Springer-Verlag, 1996.Google Scholar

[5]

P. Glasserman, Monte Carlo Method in Financial Engineering, (Stochastic Modelling and Applied Probability) (V. 53), Spinger, 2004.Google Scholar

[6]

L. A. Goodman and W. H. Kruskal, Measures of Associations for Cross Classification, With a foreword by Stephen E. Fienberg. Springer Series in Statistics, 1. Springer-Verlag, New York-Berlin, 1979.Google Scholar

[7]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95. doi: 10.1007/BF02288925. Google Scholar

[8]

I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3 (2003), 1157-1182. Google Scholar

[9]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Info. Anal., 1 (2016), 129-137. doi: 10.3934/bdia.2016.1.129. Google Scholar

[10]

W. HuangY. Pan and J. Wu, Supervised Discretization with GK-τ, Proc. Comp. Sci., 17 (2013), 114-120. Google Scholar

[11]

W. HuangY. Pan and J. Wu, Supervised discretization for optimal prediction, Proc. Comp. Sci., 30 (2014), 75-80. doi: 10.1016/j.procs.2014.05.383. Google Scholar

[12]

W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 2017.Google Scholar

[13]

M. G. Kendall, The Advanced Theory of Statistics, London, Charles Griffin and Co. , Ltd, 1946.Google Scholar

[14]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley Sons, 1999.Google Scholar

[15]

K. Pearson and D. Heron, On Theories of association, Biometrika, 9 (1913), 159-315. Google Scholar

[16]

STATCAN, Survey of Family Expenditures -1996. (1998)Google Scholar

[17]

D. L. Streiner and G. R. Norman, On Theories of association, J. of Cli. Epid., 59 (2006), 327-330. doi: 10.1016/j.jclinepi.2005.09.005. Google Scholar

show all references

References:
[1]

H. L. Costner, Criteria for measure of association, American Sociology Review, 30 (1965), 341-353. Google Scholar

[2]

M. Dash and H. Liu, Feature selection for classification, Intell. Data. Anal., 1 (1997), 131-156. doi: 10.1016/S1088-467X(97)00008-5. Google Scholar

[3]

R. L. Ebel, Estimation of the reliability of ratings, Psychomereika, 16 (1951), 407-424. Google Scholar

[4]

G. S. Fisher, Monte Carlo: Concepts, Algorithms, and Applications, Springer-Verlag, 1996.Google Scholar

[5]

P. Glasserman, Monte Carlo Method in Financial Engineering, (Stochastic Modelling and Applied Probability) (V. 53), Spinger, 2004.Google Scholar

[6]

L. A. Goodman and W. H. Kruskal, Measures of Associations for Cross Classification, With a foreword by Stephen E. Fienberg. Springer Series in Statistics, 1. Springer-Verlag, New York-Berlin, 1979.Google Scholar

[7]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95. doi: 10.1007/BF02288925. Google Scholar

[8]

I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3 (2003), 1157-1182. Google Scholar

[9]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Info. Anal., 1 (2016), 129-137. doi: 10.3934/bdia.2016.1.129. Google Scholar

[10]

W. HuangY. Pan and J. Wu, Supervised Discretization with GK-τ, Proc. Comp. Sci., 17 (2013), 114-120. Google Scholar

[11]

W. HuangY. Pan and J. Wu, Supervised discretization for optimal prediction, Proc. Comp. Sci., 30 (2014), 75-80. doi: 10.1016/j.procs.2014.05.383. Google Scholar

[12]

W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 2017.Google Scholar

[13]

M. G. Kendall, The Advanced Theory of Statistics, London, Charles Griffin and Co. , Ltd, 1946.Google Scholar

[14]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley Sons, 1999.Google Scholar

[15]

K. Pearson and D. Heron, On Theories of association, Biometrika, 9 (1913), 159-315. Google Scholar

[16]

STATCAN, Survey of Family Expenditures -1996. (1998)Google Scholar

[17]

D. L. Streiner and G. R. Norman, On Theories of association, J. of Cli. Epid., 59 (2006), 327-330. doi: 10.1016/j.jclinepi.2005.09.005. Google Scholar

Table 2.  Feature selection with merging: Occupation
$X$$\tau_b^{(Y|X)}$$\lambda^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
$Age group'$+Sex0.14840.03750.6688
(Age group'+Sex)'+Education'0.15420.04470.6620
$X$$\tau_b^{(Y|X)}$$\lambda^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
$Age group'$+Sex0.14840.03750.6688
(Age group'+Sex)'+Education'0.15420.04470.6620
Table 1.  Feature selection without merging: Occupation
$X$$\tau^{Y|X}$$\lambda^{Y|X}$$E(\mbox{Gini}(X|Y))$
Age group0.13440.03110.8773
Age group + Sex0.15110.04760.9228
$X$$\tau^{Y|X}$$\lambda^{Y|X}$$E(\mbox{Gini}(X|Y))$
Age group0.13440.03110.8773
Age group + Sex0.15110.04760.9228
Table 3.  Compare different merging threshold:Occupation
$X$$\phi^{st}(Y|X)$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(Gini(X,Y))$
Age group-0.03110.13440.8773
$Age group'$+Sex0.00050.04140.14930.9222
$Age group'$+$Sex$0.00300.03750.14840.6688
$Age group'$+$Sex$0.01000.00000.02090.2710
$X$$\phi^{st}(Y|X)$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(Gini(X,Y))$
Age group-0.03110.13440.8773
$Age group'$+Sex0.00050.04140.14930.9222
$Age group'$+$Sex$0.00300.03750.14840.6688
$Age group'$+$Sex$0.01000.00000.02090.2710
Table 4.  Compare different merging threshold
$X$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
Rooms0.34435980.30046560.8200656
$Rooms'$+$Tenure'$0.42551170.35832770.7911177
$(Rooms'$+$Tenure')'+bedroom'$0.43812470.39017670.7165204
$X$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
Rooms0.34435980.30046560.8200656
$Rooms'$+$Tenure'$0.42551170.35832770.7911177
$(Rooms'$+$Tenure')'+bedroom'$0.43812470.39017670.7165204
[1]

A. Zeblah, Y. Massim, S. Hadjeri, A. Benaissa, H. Hamdaoui. Optimization for series-parallel continuous power systems with buffers under reliability constraints using ant colony. Journal of Industrial & Management Optimization, 2006, 2 (4) : 467-479. doi: 10.3934/jimo.2006.2.467

[2]

Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014

[3]

Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331

[4]

Luís Tiago Paiva, Fernando A. C. C. Fontes. Sampled–data model predictive control: Adaptive time–mesh refinement algorithms and guarantees of stability. Discrete & Continuous Dynamical Systems - B, 2019, 24 (5) : 2335-2364. doi: 10.3934/dcdsb.2019098

[5]

Santiago Cañez. Double groupoids and the symplectic category. Journal of Geometric Mechanics, 2018, 10 (2) : 217-250. doi: 10.3934/jgm.2018009

[6]

Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004

[7]

Masataka Kato, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Effect of energy-saving server scheduling on power consumption for large-scale data centers. Journal of Industrial & Management Optimization, 2016, 12 (2) : 667-685. doi: 10.3934/jimo.2016.12.667

[8]

Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129

[9]

Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005

[10]

Xavier Brusset, Per J. Agrell. Intrinsic impediments to category captainship collaboration. Journal of Industrial & Management Optimization, 2017, 13 (1) : 113-133. doi: 10.3934/jimo.2016007

[11]

Alan Weinstein. A note on the Wehrheim-Woodward category. Journal of Geometric Mechanics, 2011, 3 (4) : 507-515. doi: 10.3934/jgm.2011.3.507

[12]

Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, 2017, 2 (5) : 1-8. doi: 10.3934/bdia.2017020

[13]

Shi-Liang Wu, Cheng-Hsiung Hsu. Entire solutions with merging fronts to a bistable periodic lattice dynamical system. Discrete & Continuous Dynamical Systems - A, 2016, 36 (4) : 2329-2346. doi: 10.3934/dcds.2016.36.2329

[14]

Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45

[15]

Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173

[16]

Rudy R. Negenborn, Peter-Jules van Overloop, Tamás Keviczky, Bart De Schutter. Distributed model predictive control of irrigation canals. Networks & Heterogeneous Media, 2009, 4 (2) : 359-380. doi: 10.3934/nhm.2009.4.359

[17]

Lars Grüne, Marleen Stieler. Multiobjective model predictive control for stabilizing cost criteria. Discrete & Continuous Dynamical Systems - B, 2019, 24 (8) : 3905-3928. doi: 10.3934/dcdsb.2018336

[18]

Torsten Trimborn, Lorenzo Pareschi, Martin Frank. Portfolio optimization and model predictive control: A kinetic approach. Discrete & Continuous Dynamical Systems - B, 2019, 24 (11) : 6209-6238. doi: 10.3934/dcdsb.2019136

[19]

Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161

[20]

David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509

 Impact Factor: 

Metrics

  • PDF downloads (8)
  • HTML views (144)
  • Cited by (0)

Other articles
by authors

[Back to Top]