# American Institute of Mathematical Sciences

August  2020, 3(3): 185-193. doi: 10.3934/mfc.2020017

## Averaging versus voting: A comparative study of strategies for distributed classification

 Department of Mathematical Sciences, Middle Tennessee State University, 1301 E Main Street, Murfreesboro, TN 37132, USA

* Corresponding author: Qiang Wu

Received  March 2020 Revised  May 2020 Published  June 2020

In this paper we proposed two strategies, averaging and voting, to implement distributed classification via the divide and conquer approach. When a data set is too big to be processed by one processor or is naturally stored in different locations, the method partitions the whole data into multiple subsets randomly or according to their locations. Then a base classification algorithm is applied to each subset to produce a local classification model. Finally, averaging or voting is used to couple the local models together to produce the final classification model. We performed thorough empirical studies to compare the two strategies. The results show that averaging is more effective in most scenarios.

Citation: Donglin Wang, Honglan Xu, Qiang Wu. Averaging versus voting: A comparative study of strategies for distributed classification. Mathematical Foundations of Computing, 2020, 3 (3) : 185-193. doi: 10.3934/mfc.2020017
Description of Data Sets and Classification Tasks
 Classification Task Number of Observations Number of Features Default of Credit Card Clients 30,000 23 Wilt Diseased Tree Detection 4,889 5 APS Failure 60,000 170 MAGIC Gamma Telescope 19,020 10 Spam Email Detection 4,601 57 Epileptic Seizures 9,200 178 Wireless Localization {1, 2} vs {3, 4} 2,000 7 Student Evaluation {1, 2} vs {3, 4, 5} 5,046 32 Handwritten Digits 5 vs 8 12,017 786
Classification accuracy (in percentage) of distributed logistic regression and p-values of hypothesis tests on the difference between voting and averaging strategies
 Classification Task Voting Averaging p-value Default of Credit Card Clients 73.71 80.15 <2.2e-16 Wilt Diseased Tree Detection 95.42 96.94 <2.2e-16 APS Failure 98.39 98.75 <2.2e-16 MAGIC Gamma Telescope 79.18 79.18 0.9845 Spam Email Detection 61.52 92.83 <2.2e-16 Epileptic Seizure 50.10 66.11 <2.2e-16 Wireless Localization {1, 2} vs {3, 4} 91.77 95.15 <2.2e-16 Student Evaluation {1, 2} vs {3, 4, 5} 91.81 95.17 <2.2e-16 Handwritten Digits 5 vs 8 84.46 95.84 <2.2e-16
Classification accuracy (in percentage) of distributed SVM and p-values of hypothesis tests on the difference between voting and averaging strategies
 Classification Task Voting Averaging p-value Default of Credit Card Clients 79.29 79.48 9.2e-05 Wilt Diseased Tree Detection 96.83 97.19 4.6e-08 APS Failure 98.52 98.60 <2.2e-16 MAGIC Gamma Telescope 86.59 86.64 0.2107 Spam Email Detection 93.20 93.47 0.0001 Epileptic Seizure 89.16 89.46 0.0008 Wireless Localization {1, 2} vs {3, 4} 95.42 95.47 0.3773 Student Evaluation {1, 2} vs {3, 4, 5} 95.31 95.34 0.6433 Handwritten Digits 5 vs 8 99.50 99.54 2.3e-05
