# American Institute of Mathematical Sciences

doi: 10.3934/bdia.2017020

## A category-based probabilistic approach to feature selection

 1 School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China 2 Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Published  August 2018

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, doi: 10.3934/bdia.2017020
##### References:

show all references

##### References:
Feature selection by the original variables
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
Feature selection by the original variables
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
 [1] Yunmei Lu, Mingyuan Yan, Meng Han, Qingliang Yang, Yanqing Zhang. Privacy preserving feature selection and Multiclass Classification for horizontally distributed data. Mathematical Foundations of Computing, 2018, 1 (4) : 331-348. doi: 10.3934/mfc.2018016 [2] Renato Bruni, Gianpiero Bianchi, Alessandra Reale. A combinatorial optimization approach to the selection of statistical units. Journal of Industrial & Management Optimization, 2016, 12 (2) : 515-527. doi: 10.3934/jimo.2016.12.515 [3] Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667 [4] Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014 [5] Mohamed A. Tawhid, Kevin B. Dsouza. Hybrid binary dragonfly enhanced particle swarm optimization algorithm for solving feature selection problems. Mathematical Foundations of Computing, 2018, 1 (2) : 181-200. doi: 10.3934/mfc.2018009 [6] Mohammed Abdulrazaq Kahya, Suhaib Abduljabbar Altamir, Zakariya Yahya Algamal. Improving whale optimization algorithm for feature selection with a time-varying transfer function. Numerical Algebra, Control & Optimization, 2020  doi: 10.3934/naco.2020017 [7] Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331 [8] Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004 [9] Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129 [10] Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005 [11] Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45 [12] Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173 [13] Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161 [14] David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509 [15] Bailey Kacsmar, Douglas R. Stinson. A network reliability approach to the analysis of combinatorial repairable threshold schemes. Advances in Mathematics of Communications, 2019, 13 (4) : 601-612. doi: 10.3934/amc.2019037 [16] Jiaoyan Wang, Jianzhong Su, Humberto Perez Gonzalez, Jonathan Rubin. A reliability study of square wave bursting $\beta$-cells with noise. Discrete & Continuous Dynamical Systems - B, 2011, 16 (2) : 569-588. doi: 10.3934/dcdsb.2011.16.569 [17] Yi-Kuei Lin, Cheng-Ta Yeh. Reliability optimization of component assignment problem for a multistate network in terms of minimal cuts. Journal of Industrial & Management Optimization, 2011, 7 (1) : 211-227. doi: 10.3934/jimo.2011.7.211 [18] Zhi Guo Feng, K. F. Cedric Yiu, K.L. Mak. Feature extraction of the patterned textile with deformations via optimal control theory. Discrete & Continuous Dynamical Systems - B, 2011, 16 (4) : 1055-1069. doi: 10.3934/dcdsb.2011.16.1055 [19] Lok Ming Lui, Yalin Wang, Tony F. Chan, Paul M. Thompson. Brain anatomical feature detection by solving partial differential equations on general manifolds. Discrete & Continuous Dynamical Systems - B, 2007, 7 (3) : 605-618. doi: 10.3934/dcdsb.2007.7.605 [20] Anarina L. Murillo, Muntaser Safan, Carlos Castillo-Chavez, Elizabeth D. Capaldi Phillips, Devina Wadhera. Modeling eating behaviors: The role of environment and positive food association learning via a Ratatouille effect. Mathematical Biosciences & Engineering, 2016, 13 (4) : 841-855. doi: 10.3934/mbe.2016020

Impact Factor: