# American Institute of Mathematical Sciences

## A category-based probabilistic approach to feature selection

 1 School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China 2 Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Published  August 2018

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, doi: 10.3934/bdia.2017020
Feature selection by the original variables
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
Feature selection by the original variables
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
