# American Institute of Mathematical Sciences

• Previous Article
Prediction models for burden of caregivers applying data mining techniques
• BDIA Home
• This Issue
• Next Article
An application of PART to the Football Manager data for players clusters analyses to inform club team formation
doi: 10.3934/bdia.2017020
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

## A category-based probabilistic approach to feature selection

 1 School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China 2 Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Early access August 2018

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, doi: 10.3934/bdia.2017020
##### References:

show all references

##### References:
Feature selection by the original variables
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
Feature selection by the original variables
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
 [1] Yunmei Lu, Mingyuan Yan, Meng Han, Qingliang Yang, Yanqing Zhang. Privacy preserving feature selection and Multiclass Classification for horizontally distributed data. Mathematical Foundations of Computing, 2018, 1 (4) : 331-348. doi: 10.3934/mfc.2018016 [2] Junying Hu, Xiaofei Qian, Jun Pei, Changchun Tan, Panos M. Pardalos, Xinbao Liu. A novel quality prediction method based on feature selection considering high dimensional product quality data. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021099 [3] Renato Bruni, Gianpiero Bianchi, Alessandra Reale. A combinatorial optimization approach to the selection of statistical units. Journal of Industrial & Management Optimization, 2016, 12 (2) : 515-527. doi: 10.3934/jimo.2016.12.515 [4] Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014 [5] Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667 [6] Austin Lawson, Tyler Hoffman, Yu-Min Chung, Kaitlin Keegan, Sarah Day. A density-based approach to feature detection in persistence diagrams for firn data. Foundations of Data Science, 2021  doi: 10.3934/fods.2021012 [7] Mohamed A. Tawhid, Kevin B. Dsouza. Hybrid binary dragonfly enhanced particle swarm optimization algorithm for solving feature selection problems. Mathematical Foundations of Computing, 2018, 1 (2) : 181-200. doi: 10.3934/mfc.2018009 [8] Mohammed Abdulrazaq Kahya, Suhaib Abduljabbar Altamir, Zakariya Yahya Algamal. Improving whale optimization algorithm for feature selection with a time-varying transfer function. Numerical Algebra, Control & Optimization, 2021, 11 (1) : 87-98. doi: 10.3934/naco.2020017 [9] Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331 [10] Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004 [11] Weihong Guo, Yifei Lou, Jing Qin, Ming Yan. IPI special issue on "mathematical/statistical approaches in data science" in the Inverse Problem and Imaging. Inverse Problems & Imaging, 2021, 15 (1) : I-I. doi: 10.3934/ipi.2021007 [12] Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129 [13] Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005 [14] Laura Aquilanti, Simone Cacace, Fabio Camilli, Raul De Maio. A Mean Field Games model for finite mixtures of Bernoulli and categorical distributions. Journal of Dynamics & Games, 2021, 8 (1) : 35-59. doi: 10.3934/jdg.2020033 [15] Ye Wang, Ran Tao. Constructions of linear codes with small hulls from association schemes. Advances in Mathematics of Communications, 2020  doi: 10.3934/amc.2020114 [16] Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45 [17] Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173 [18] Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161 [19] David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509 [20] Jiaoyan Wang, Jianzhong Su, Humberto Perez Gonzalez, Jonathan Rubin. A reliability study of square wave bursting $\beta$-cells with noise. Discrete & Continuous Dynamical Systems - B, 2011, 16 (2) : 569-588. doi: 10.3934/dcdsb.2011.16.569

Impact Factor: