HYBRID BINARY DRAGONFLY ENHANCED PARTICLE SWARM OPTIMIZATION ALGORITHM FOR SOLVING FEATURE SELECTION PROBLEMS

. In this paper, we present a new hybrid binary version of dragonﬂy and enhanced particle swarm optimization algorithm in order to solve feature selection problems. The proposed algorithm is called Hybrid Binary Dragonﬂy Enhanced Particle Swarm Optimization Algorithm(HBDESPO). In the proposed HBDESPO algorithm, we combine the dragonﬂy algorithm with its ability to encourage diverse solutions with its formation of static swarms and the enhanced version of the particle swarm optimization exploiting the data with its ability to converge to the best global solution in the search space. In order to investigate the general performance of the proposed HBDESPO algorithm, the proposed algorithm is compared with the original optimizers and other optimizers that have been used for feature selection in the past. Further, we use a set of assessment indicators to evaluate and compare the diﬀerent op- timizers over 20 standard data sets obtained from the UCI repository. Results prove the ability of the proposed HBDESPO algorithm to search the feature space for optimal feature combinations. proposed algorithm, we apply it on 20 feature selection problems. We perform the evaluation of the proposed algorithm using a set of evaluation criteria to assess diﬀerent aspects of the proposed system. The experimental results show that the proposed algorithm is a promising algorithm with its ability to search the feature space eﬀectively. The given algorithm was also run on test data and observations


1.
Introduction. Feature selection is a way for identifying the independent features and removing redundant ones from the dataset [6]. The objectives of feature selection are dimensionality reduction of the data, improving prediction accuracy, and understanding data for different machine learning applications [5]. In the real world, data representation often uses too many features with some redundant ones, which means certain independent features can fill in for others and the dependent(redundant) features can be removed. Moreover, the output is influenced by the relevant features because they contain important information about the data and the results will be obscure if any of them is excluded [3]. The classical optimization techniques have some limitations in solving the feature selection problems [16] and hence evolutionary computation(EC) algorithms are the alternative for solving these limitations and searching for the optimum solution [4]. Evolutionary computation(EC) algorithms are inspired by nature, group dynamics, social behavior, and interaction of biological organisms in a group. The binary version of these algorithms allow us to investigate problems like feature selection and arrive at superior results.
Many heuristic algorithms have been used in an attempt to solve the feature selection problem.A survey on evolutionary computation approaches to feature selection is delineated in [4]. The authors in [2] used a firefly to solve feature selection problem. A binary bat for solving feature selection problem is shed light upon in [20]. [12] presents a feature subset selection approach by grey wolf optimization. Even hybrid algorithms have been used to solve feature selection problems. A hybrid genetic algorithm on mutual information is presented in [14].
The dragonfly algorithm (DA) is a novel swarm intelligence optimization technique [19]. The main inspiration of the DA algorithm originates from the static and dynamic swarming behaviors of dragonflies in nature. Two essential phases of optimization, exploration and exploitation, are designed by modelling the social interaction of dragonflies in navigating, foraging for food, and evading enemies when swarming statically or dynamically. The DA algorithm has been applied to discrete as well as multi-objective problems [19], but has not been applied to feature selection problems.
Particle swarm optimization (PSO) is a population based stochastic optimization technique developed by Eberhart and Kennedy in 1995 [11], inspired by social behavior of bird flocking or fish schooling. In past several years, even though PSO has been successfully applied in many research and application areas like the constrained non linear optimization problems [8], for optimal design of combinational logic circuits [9] also to real world hydraulic problems [17], there is little work i in the domain of feature selection [1]. It is demonstrated that PSO gets better results in a faster, cheaper way compared with other methods. Another reason is that PSO is attractive there are few parameters to tweak. One version, with slight variations, works well in a wide variety of real world applications. Here an enhanced version of the standard PSO is used [18] to solve the feature selection problem.
Hybridization of different algorithmic concepts is a method to obtain better performing systems and is believed to benefit from synergy, i.e. usually it exploits and unites advantages of the individual pure strategies. It is mostly due to the no free lunch theorems [23] that the generalized view of metaheuristics changed and people recognized that there cannot exist a general optimization strategy which is globally better than any other. In fact, to solve a problem at hand most effectively, it almost always requires a specialized algorithm that needs to be compiled of adequate parts. Hybridization is classified into many categories [10], [22]. Hybridization of one metaheuristic with another is a popular method to enhance the performance of both the algorithms.
The aim of this work is to propose a new hybrid binary version of dragonfly and enhanced particle swarm optimization algorithm in order to solve feature selection problems effectively. The hybridization allows us to combine the best aspects of both these algorithms and obtain better performance. In this paper, we propose a new hybrid algorithm, which is called HBDESPO Algorithm by combining the dragonfly algorithm with the enhanced particle swarm optimization algorithm in order to obtain superior results when we compare it to the respective individual algorithms. We test the binary HBDESPO algorithm on 20 standard datasets obtained from the UCI repository [13]. We use set of assessment indicators to evaluate and compare the different optimizers. Also, we compare the algorithm with the HBEPSOD, where the particle swarm optimization is carried out first and then given to the dragonfly algorithm. The experimental results show the ability of the proposed HBDESPO algorithm to search the feature space for optimal feature combinations.
The reminder of this paper is organized as follows. In Section 2, we present the definition of the feature selection problem. We summarize the main concepts of the dragonfly algorithm in Section 3. We present the main concepts of the enhanced particle swarm optimization algorithm in Section 4. In Section 5, we describe the main structure of the proposed HBDESPO algorithm. Section 6 provides details about the feature selection problem, evaluation criteria and an insight about the classifier used. In Section 7, we report the experimental results and finally, we give the conclusion Section 8.

2.
Definition of the feature selection problem. In this section, we present the definition of the feature selection problem as follows. The feature selection problem can be defined as the selection of certain number of features out of the total number of available features in such a way that the classification performance is maximum and the number of selected features is minimum.
Where γ R (D) is the classification quality of set R relative to decision D, R is the length of selected feature subset, C is the total number of features, α and β are two parameters corresponding to the importance of classification quality and subset length, α ∈ [0, 1] and β = 1 − α. The fitness function maximizes the classification quality; γ R (D), and the ratio of the unselected features to the total number of features; |C−R| |C| . The above equation can be easily converted into a minimization problem by using error rate rather than classification quality and using selected features ratio rather than using unselected feature size. The minimization problem can be formulated as in Eq. 2.
where E R (D) is the error rate of the classifier, R is the length of the selected feature subset and C is the total number of features. α ∈ [0, 1] and β = 1 − α are constants used to control the weights of classification accuracy and feature reduction. The intuition behind the fitness function is that we want the algorithm to choose features in such a way that classification accuracy is not compromised while maintaining the number of features to a minimum as is reflected in Eq. 2.
3. Overview of binary dragonfly algorithm. In the following subsection, we will give an overview of the main concepts and structure of the binary dragonfly algorithm as follows.
3.1. Main concepts and inspiration. The dragonfly algorithm derives its inspiration from the static and dynamic swarming behavior of dragonflies. These two behaviors are very similar to the two main phases of metaheuristic algorithms: exploration and exploitation. Dragonflies create sub swarms and fly over different areas in a static swarm, which is similar to the main objective of the exploration phase. In the dynamic swarm, however, dragonflies fly in bigger swarms and along one particular direction, which is favorable in the exploitation phase. The five concepts of Separation, Alignment, Cohesion, Attraction to food and Distraction from enemy are used to simulate the behavior of dragonflies in both static and dynamic swarms [19].

Definition of concepts.
1. Separation: This parameter controls separation between the different available solutions. It helps to explore the search space initially. The separation weight(s) assigned to the parameter controls this possibility.
where N eighbors xi is the matrix consisting of neighbors of x i and N indicates the number of neighbors. 2. Alignment: This parameter dictates the alignment of a solution with the neighboring solutions. This parameter is given the alignment weight(a) and the arrival to the final solution is controlled through this weight.
where N eighbors Deltaxi is the matrix consisting of Deltax values of neighbors of i. Deltax is the matrix that keeps track of change in position of the soultions. 3. Cohesion: This parameter represents the convergence of the solutions to a particular food source. It is defined as the distance between the mean of the neighboring solutions and the current solution. It is controlled by the cohesion weight(c).
4. Attraction to food: As the name suggests this parameter determines the distance of the current solution from the best solution in the group(food source).
The Food attraction weight (f )controls this parameter.
where F ood pos is the best solution in the iteration.

Distraction from enemy:
This parameter makes sure that the current solution stays away from bad solutions in the space. Its essentially the separation from the worst solution in the space and is controlled by the enemy distraction weight(e).
where Enemy pos is the worst solution in the group.
3.3. Binary dragonfly algorithm. In this section, we present in details the main steps of the binary dragonfly algorithm as shown in Algorithm 1. The notations introduced majorly follow [19]. 4. Overview of binary enhanced particle swarm optimization. In the following section, we will give an overview of the main concepts and structure of the binary enhanced particle swarm optimization algorithm as follows.

4.1.
Main concepts and inspiration. The particle swarm optimization is a population based search method inspired from the swarm behavior (information interchange) of birds [15]. In PSO, initially a random population of particles is initialized and these particles move with certain velocity based on their interaction with other particles in the population. At each iteration the personal best achieved by each particle and the global best of all the particles is tracked and the velocity of all the particles is updated based on this information. Certain parameters are used to give weights to the global and personal best. In the enhanced version of the binary PSO [18], special type of S shaped transfer functions is used to convert a continuous value to a binary value instead of a simple hyperbolic tangent function.  for (j = 1; j < D : j + +) do 28: if (rand < T ) then 37:

4.2.
Movement of particles. The notations introduced in this section majorly follow [18]. Each of the particles is represented by D dimensional vectors and they are randomly initialized with each individual value being binary.
where S is the available search space. The velocity is represented by a D dimensional vector and is initialized to zero.
The best personal(local) position recorded by each particle is maintained as At each iteration each particle changes its position according to its personal best (Pbest) and the global best(gbest) as follows where c 1 and c 2 are acceleration constants and called cognitive and social parameters, respectively. r 1 and r 2 are random values ∈ [0, 1]. w is called as the inertia weight. It determines how the previous velocity of the particle influences the velocity in the next iteration. The value of w is determined by the following expression where w max and w min are constants. M ax iteration is the maximum number of iterations to be run.

4.3.
The continuous to binary map. The position of each particle is determined by the S shaped transfer function that maps the continuous velocity value to the position of the particle. This is a special sigmoid function that enhances the PSO.
4.4. Enhanced particle swarm optimization algorithm. In this section, we present in details the main steps of the binary enhanced particle swarm optimization algorithm as shown in Algorithm 2.
• Step 1. Initialize the values of swarm size SS(N ), acceleration constants c 1 and c 2 , w max , w min , v max and max iter . • Step 2. The population is randomly initialized as in Eq. 9 and the velocity vectors are initialized to zeros as in Eq. 10. • Step 3. The following steps are repeated until the terminating criteria is met.
-Step 1. Update the value of inertia weight w according to Eq. 13.
-Step 2. The fitness value of each solution is calculated using f (x i ).
-Step 3. The personal best solution P best and the global best solution gbest are assigned. -Step 4. At each iteration t, the velocity of each particle is calculated according to Eq. 12. assign the values for P best and gbest. 6: for (i = 1; i < SS; i + +) do
• Step 4. Produce the global best as the best found solution.
5. Hybrid binary dragonfly enhanced particle swarm optimization (HBDESPO) algorithm. We show the main steps of the proposed HBDESPO algorithm for feature selection in Algorithm 3 and summarize in this section.
• Step 1. Split the given data set into three equal sizes of training, validation and testing sets.   13: end for{Update the velocities of Particles} 14: for (i = 1; i < SS, i + +) do 15: for (j = 1; j < D; j + +) do 16: if (v(i, j) > v max ) then The combination of the concepts in the binary dragonfly and the binary particle swarm optimization algorithms that are described in the previous sections are combined in this section to derive an algorithm that can benefit from their coexistence. The dragonfly algorithm has the ability to obtain diverse solutions with its formation of static swarms and the enhanced PSO converging to the global best solution creates synergy in the implementation of the hybrid algorithm which results in increased performance. The hybrid algorithm avoids excessive exploitation in the individual algorithms and uses dual exploration technique to reach the solution. In the HBDEPSO algorithm, the decoupling of the velocity vectors of the dragonflies and the particles leads us to an interesting formulation. The velocity vectors are updated independently for both the particles, according to the weighted combination of the personal and global best solutions, and the dragonflies, which are instantaneous in nature. This does not direct one algorithm with the results obtained from the other algorithm but rather allows both of them to explore the search space alternatively and this form of decoupling and alternation is also why the personal and global solutions are only updated once per the whole iteration (after the enhanced particle swarm update) and not updated after the binary dragonfly algorithm.
Some interesting insights are derived from the fact that the velocities are decoupled. The velocities use different ways to work towards the same goal which is beneficial for the hybrid algorithm because it derives from the diversity of the solutions in each iteration which is also the main philosophy behind hybridizing algorithms. It has to be kept in mind that choosing the hyperparameters is very important for getting good solutions and can be accomplished by a simple grid search or a random search over the hyperparameter space.
6. Feature selection. The feature selection problem is defined in Section 2. For a feature vector of size N , the number of different feature combinations would be 2 N , which is a huge space to search exhaustively. So the proposed hybrid metaheuristic Algorithm is used to adaptively search the feature space and produce the best feature combination. The fitness function used is the one given in Eq. 2 where E R (D) is the error rate of the classifier, R is the length of the selected feature subset and C is the total number of features. α ∈ [0, 1] and β = 1 − α are constants used to control the weights of classification accuracy and feature reduction.
6.1. Classifier. K-nearest neighbor(KNN) [7] is a common simple method used for classification. KNN is a supervised learning algorithm that classifies an unknown sample instance based on the majority vote of its K-nearest neighbors. Here, a wrapper approach to feature selection is used which uses KNN classifier as a guide for the same. Classifiers do not use any model for K-nearest neighbors and are determined solely based on the minimum distance from the current query instance to the neighboring training samples. In this proposed system, the KNN is used as a classifier to ensure robustness to noisy training data and obtain best feature combinations. A single dimension in the search space represents individual feature and hence the position of a particle represents a single feature combination or solution.  Table 1 taken from the UCI machine learning repository [13] and compare it with other algorithms like Binary versions of dragonfly, enhanced particle swarm optimization, GA, bat and grey wolf. Also, we compare the algorithm with HBEPSOD, where the order of implementation of the two algorithms is reversed. We select the datasets to have variety in number of instances and features to test for varied data. We divide the datasets into three sets: training, validation and testing. We select the value of K as 5 based on the trial and error. We use the training set to evaluate the KNN on the validation set through this algorithm to guide the feature selection process. We use only the test data for the final evaluation of the best selected feature combination. We report the global and optimizerspecific parameter setting in Table 2. We set the parameters according to either domain specific knowledge or trial and error. We explain the evaluation criteria in Subsection 7.1 7.1. Evaluation criteria. We divide the datasets into 3 sets of training, validation and testing. We run the algorithm repeatedly M = 10 times for statistical significance of the results. The following measures [12] are recorded from the validation data: 1. Mean fitness function is the average of the fitness function value obtained from running the algorithm M times. The Mean fitness function is calculated as shown in Eq. 16. where g * i is the best fitness value obtained at run i.

2.
Best fitness function is the minimum of the fitness function value obtained from running the algorithm M times. The best fitness function is calculated as shown in Eq. 17.
where g * i is the best fitness value obtained at run i. 3. Worst fitness function is the maximum of the fitness function value obtained from running the algorithm M times. The Worst fitness function is calculated as shown in Eq. 18.
where g * i is the best fitness value obtained at run i.
where g * i is the best fitness value obtained at run i. 5. Mean Feature selection ratio(FSR) is the mean of the ratio of the number of selected features to the total number of features when an algorithm is run M times. The Mean Feature selection ratio is calculated as shown in Eq. 20.
where g * i is the best fitness value obtained at run i, size(g * i ) gives the number of features selected and D is the total number of features. 6. Average F-score is a measure that evaluates the performance of a chosen feature subset. It requires that in the data spanned by the feature combination the distance between data points in different classes be large and of those in the same class be as small as possible. The Fischer index for a given feature is calculated as in Eq. 21 [21].
where F j is the Fischer index for j, µ j is the mean of the entire data for feature j, (σ j ) 2 is defined as in Eq. 22, n k is the size of class k, µ j k is the mean of class k for feature j, (σ j k ) 2 is the variance of class k for feature j. The average F-score is calculated by taking the average of values obtained from M runs for only the selected features.

7.2.
Results. We compare the proposed binary version of the HBDESPO algorithm with the binary dragonfly algorithm, binarized from [19] and the enhanced particle swarm optimization [18] which are the individual algorithms that are used in this paper and other optimizers which have been used for feature selection before such as binary bat [20] and binary grey wolf algorithms [12]. The results are tabulated as follows. Table 3 outlines the performance of the algorithms using the fitness function mentioned in Eq. 2 in the minimization mode. The table shows the average fitness obtained over M runs and is calculated using Eq. 16. The best performance is achieved by the proposed binary version of the HBDESPO algorithm proving its ability to search the feature space effectively. This indicator tells us that our algorithm is consistently able to attain good fitness values when compared to other algorithms.
Similar results are seen in Table 4 and Table 5 that outline the best and the worst fitness function obtained over M runs and is calculated using Eqs. 17 and 18, respectively. These support the observations from the previous table indicating that our algorithm not only achieves good performance on average but also the best fitness value and the worst fitness value attained are the minimum amongst all the algorithms.
For testing the stability, robustness and the repeatability of convergence of these stochastic algorithms the standard deviation of the fitness values over M runs is recorded as per Eq. 19 in Table 6. The table shows that the HBDESPO algorithm has the ability to converge repeatedly irrespective of the random initialization.
The best selected feature combinations by the algorithms are also allowed to run on the test data and the average classification accuracy and the average feature selection ratio over M runs is recorded using Eqs. 20 and 21, respectively as shown in Tables 7 and 8. As can be seen from these tables, the HBDESPO algorithm is able to select the minimum number of features and yet maintain the classification accuracy. This shows the capability of the HBDESPO algorithm to satisfy both  the objectives of optimization i.e. achieve a high classification accuracy and at the same time chose as fewer number of features for this as possible.
The consolidated performance can be seen in Fig 1 and 2. It can be observed that the proposed algorithm achieves high classification accuracy while keeping feature selection ratio to a minimum. Also, low fitness values are obtained which don't vary a lot across different runs.  To analyze the separability and closeness of the selected features Fischer score of these features is calculated as shown in Eq. 22. The average over M runs is recorded in Table 9. As shown in the table, HBDESPO algorithm achieves superior data compactness in comparison with the other algorithms.  These tables show that the HBDESPO algorithm outperforms the other algorithms with respect to all of the assessment indicators. It can also be seen that it performs much better when compared to its switched version HBEPSOD algorithm. This leads us to believe that the dragonfly algorithm is powerful in exploring the search space and the enhanced particle swarm optimization algorithm aids in exploiting the reduced feature space. The algorithm also majorly benefits from dual exploration because of the decoupled velocities and outperforms the other algorithms compared with. It is clear that the individual binary algorithms (EPSO and binary dragonfly) don't perform as well as the hybrid version which is influenced by synergy. The binary bat and the binary grey wolf also lag behind with respect to performance indicators when compared to the proposed algorithm which can be attributed to the novel formulation in this paper. 8. Conclusion. In this paper, we propose a new hybrid binary metaheuristic algorithm with dragonfly algorithm and enhanced particle swarm optimization algorithm in order to solve feature selection problems. We call the proposed algorithm by hybrid binary dragonfly enhanced particle swarm optimization(HBDESPO) algorithm. The two algorithms come together to give better solutions than each of them individually. In order to verify the robustness and the effectiveness of the  proposed algorithm, we apply it on 20 feature selection problems. We perform the evaluation of the proposed algorithm using a set of evaluation criteria to assess different aspects of the proposed system. The experimental results show that the proposed algorithm is a promising algorithm with its ability to search the feature space effectively. The given algorithm was also run on test data and observations show higher performance of the selected features when compared to the other optimizers. The Fischer index table reveals better separability. It is also noted from the values of standard deviation that the algorithm has the robustness to repeatedly converge to similar solutions therefore a powerful ability to solve feature selection problems better than other algorithms in most cases.