TABU SEARCH GUIDED BY REINFORCEMENT LEARNING FOR THE MAX-MEAN DISPERSION PROBLEM

. We present an eﬀective hybrid metaheuristic of integrating reinforcement learning with a tabu-search (RLTS) algorithm for solving the max– mean dispersion problem. The innovative element is to design using a knowledge strategy from the Q -learning mechanism to locate promising regions when the tabu search is stuck in a local optimum. Computational experiments on extensive benchmarks show that the RLTS performs much better than state-of-the-art algorithms in the literature. From a total of 100 benchmark instances, in 60 of them, which ranged from 500 to 1,000, our proposed algorithm matched the currently best lower bounds for all instances. For the remaining 40 instances, the algorithm matched or outperformed. Furthermore, additional support was applied to present the eﬀectiveness of the combined RL technique. The analysis sheds light on the eﬀectiveness of the proposed RLTS algorithm.


1.
Introduction. Suppose a set V with n vectors and [d ij ] n×n , a distance matrix where d ij determines the length from element i to j, and the distance can be taken as both a positive and a negative value. The max-mean dispersion problem consists of picking a subset M of m vectors from V (|M | is not fixed) so that the average distance between selected vectors, i.e., i,j∈M ;i<j d ij /|M | , is maximized. Assume that variable x i equal to 1 when the vector is picked in M , and equal to 0 otherwise; then, the max-mean dispersion problem can be formally considered as a local binary quadratic programming problem first as follows: x i ∈ {0, 1} , i = 1, 2, ..., n Constraints in Equations (2) and (3) state that at least two variables are selected. Dispersion problems are distinguished in the literature in term of the equity and efficiency distance function based on [22]. The objective function of the max-mean dispersion problem is accordingly taken as equity-based. Several other specific dispersion problems can be found and defined in the literature include: max-mean dispersion problem that maximize the mean of the aggregate dispersion, the maxmin diversity problem that maximize the minimum dispersion [10,21], the minimum differential problem that minimize the difference between the maximum aggregate dispersion and minimum aggregate dispersion [1,26],the balanced quadratic optimization problem that minimizes the difference between the maximum dispersion and the minimum dispersion [23], a Solution-based tabu search for the maximum min-sum dispersion problem characterized by the joint use of hash functions to determine the tabu status of candidate solutions and a parametric constrained swap neighborhood to enhance computational efficiency [14] etc. Another efficiency-based dispersion problem is max-sum dispersion that maximizes the aggregate dispersion [2,27]. Furthermore, one of the well-known equity-based diversity problems is the max-mean dispersion problem (max-mean DP) and weighted version of max-mean DP [15,20], when selected subset M consists of a flexible cardinality, i.e., the range of subset M can be from 2 to n. This paper focuses on this last specified max-mean DP, and aimed at designing and implementing effective Metaheuristic algorithms for dealing with it. In addition to its theoretical significance as an NP-hard problem [22], the max-mean dispersion problem has practical applications in various fields, such as architectural-space planning, social-relations mining, web-page ranking, pollution control, and capital investment.
Several algorithms have been presented in the literature to solve difficult optimization problems. In 2009, Prokopyev et al. [22] proposed the linear mixed binary programming formulation for the max-mean dispersion problem, and proved that the max-mean dispersion problem is NP-hard where the distance between elements is not limited. Xu et al. [28] applied learning adaptation to solve constraintsatisfaction problems. A hybrid metaheuristic approach based on reinforcement learning that is applied to the traveling-salesman problem was proposed by Francisco et al. [7] in 2010. In that same year, Tao and Zhen [29] also introduced a multistep Q() learning approach to power-system stabilizers. In 2013, Martin and Sandoya proposed the Greedy Random Adaptive Search integrated with the Path Relinking Method (GRASP-PR) [18]. It uses a randomized greedy mechanism to maintain first building elite solutions (ES) and a variable neighborhood descent procedure for improvement. In 2014, Croce et al. [8] developed a two-stage hybrid heuristic method that combines a mixed integer nonlinear solver and a local branch program. Late in 2016, the same author improved a new version of this algorithm by adding a path phase enhance the quality of the solution [9]. In 2015, Carrasho et al. [6] dynamically used an efficient two-savage tabu-search algorithm. The strategy combined various short-term and long-term tabu searches to improve their performance. Brimberg et al. [4] introduced a search algorithm method that examines multiple neighborhood structures based on add, drop, and swap moves and picks one of them in a probability way to perform a shaking procedure. In 2016, Lai and Hao [15] developed the first tabu-search population-based memetic algorithm (MAMMDP). The algorithm first uses a tabu search to detect promising local optima and then randomly a crossover operator procedure to create better diversified offspring solutions.
After reviewing the above algorithms, it is worthwhile to note that the algorithms after 2013 are more contrastive. Another important component in our algorithm is the reinforcement-learning mechanism is an unsupervised learning algorithm. Since it does not need an experienced agent in choosing the exact way; instead, it performs its future actions based on an obtained feedback response from the environment [13].
The three most distinct elements of an RL agent are state, actions, and rewards. RL functions in the same way as solving a heuristic optimization problem. Miagkikh and Tunch [19] proposed a population of RL agents to deal with combinatorial optimization problems. The state of cellular automata applied to the graph-coloring problem was updated using RL [25]. An interacting application of RL is found in its combination with metaheuristics [3]. For example, RL is applied to learn a new evaluation function over multiple search trajectories of the same problem instance and alternates between using the learned and the original evaluation function. For this, Xu et al. [28] proposed a formulation of constraint-satisfaction problems as an RL task. Indeed, it can be used in choosing heuristics among the many available, and knows when to use it. Another use of RL arises in hyperheuristics, where it a combines section strategy with an acceptable strategy for choosing suitable low-level heuristics and deciding when to agree on a move. For example, Burke et al. [5] proposed hyperheuristics in which the selection of low-level heuristics makes use of basic reinforcement-learning principles combined with a tabu-search mechanism. More recently, RL is used to schedule several search operators under genetic and multiagent-based optimization frameworks. Another example can be found in the works of Sghir et al. [24]. The two mechanisms above rely on the use of RL approaches instead of the random component to obtain a more knowledgeable decision strategy. In this paper, reinforcement-learning-based local search (RLTS) uses an RL mechanism to build promising initial solutions for iterative Q-learning to directing a tabu search. Thus, we present the RLTS approach for tackling the max-mean DP, which combines reinforcement-learning techniques with a dynamic tabu-search procedure (Section 3).
Our proposed (RLTS) algorithm belongs to a multistart search framework where two main components, at each start, include an initial solution-building mechanism and a local-search mechanism. For the initial solution-generation phase, we were inspired by the idea of the Q-Learning strategy that uses an informed-decision strategy from a set of agents interacting in the environment to build high-quality and well-diversified solutions. Then, we used a one-flip tabu search algorithm for the local phase to process the initial solution, to improve the quality of solutions, and obtain the feasible approximated optimal solution. Computational experiments on extensive benchmarks showed that the RLTS performed much better than state-ofthe-art algorithms in the literature. Furthermore, additional support was applied to present the merit of the combined RL technique. Next, the paper is organized as follows. Section 2 describes the general procedure and important components of the proposed RLTS algorithm. Section 3 presents our computational results, comparisons with state-of-art algorithms, and analyzes the contributions of the RL component of our algorithm to its performance. Conclusions based on computational outcomes are given in Section 4.3.

2.
Algorithm. In this section, the proposed general RLTS algorithm scheme, feasible-set initialization mechanism, RL guided solution-generation technique, tabusearch strategy, feasible-set updating, and reconstruction methods are presented.
2.1. General scheme. The main scheme of our RLTS algorithm is depicted as Algorithm 1. The algorithm starts by constructing the relevant feasible set and initializes it (Section 2.3), then enters the loop (Lines 5-18). For the first cycle, an initial solution S 0 is randomly constructed (Section 2.4). Then, initial solution S 0 is processed into a current solution S c using a tabu-search algorithm (Section 2.3). According to whether the element in initial solution S 0 is still present in current solution S c or whether current solution S c is greater than current optimal S * , we update the reward matrix so as to increase the reward of the element still present or reduce the reward of the element that no longer exists (Section 2.4). The next step is to decide whether current solution S C is greater than current optimal solution S * . If S c is greater, it replaces current optimal solution S * (Lines 14-15), and then enters the next cycle. Starting from the second cycle, some feasible set needs to be initialized, but note that the Q matrix, the reward matrix, and current optimal solution S * , do not participate in initialization (Section 2.2.1). Apart from using a reinforcement-learning algorithm to generate the initial solution to the tabu-search algorithm for later processing (Section 2.2.2), the other processes remain unchanged. At the end of each loop, it again decides whether or not the current running time exceeded the limited time. Otherwise, it does not enter the next cycle and produces current optimal solution S * (Lines 3, [17][18][19].

Feasible-Set initialization.
To construct the RL Model, we started by initializing a feasible set (FS) to generate a set of good-quality solutions. To build each solution in FS, we first used distance matrix D to build a reward matrix; then, we constructed the Q-learning matrix (QLM) on the basis of reward-matrix information. Meanwhile, the FS set to be initialized every loop in Feasible Initialization Set () records the three components: initial solution S 0 , current solution S 0 , and the tabu table.
2.2.1. Randomly Constructed Initial Solution. At the beginning, the algorithm records reward matrix R to be a zero matrix if the reinforcement algorithm is used to construct the initial solution. At this time, each action selection is not rewarded, which leads to a feasible set of solutions in the Q matrix, which implies that the resulting experience matrix cannot be updated. Hence, Q-matrix is likewise set to be a zero matrix. If reinforcement learning constructs the first initial solution, it may try all currently available actions, that is, the initial solution selects all elements. Such an initial solution is to be fixed, and records initial solution S 0 , current solution S C , and updated current optimal solution s * . Updating reward matrix R is supposed to be fixed in the first cycle. Therefore, in order to diversifying the search move of the algorithm, the initial solution of the first cycle is Random () performed. The algorithm assigns each variable with the equal probability of 0.5 to receive value 1 or 0. Then, a one-flip move operator is applied by a tabu search for improvement.

RL Initial-Solution Construction.
After going through the first cycle, by comparing initial solution S 0 with current solution S c , it is already possible to define a reward of selecting some elements. The main idea of rewards in this algorithm is that if the element in initial solution S 0 still arrives in current solution S c after tabu-search refinement, it is considered to be a superior element. If this element is selected, it is more promising to increase the value of the objective function and shorten tabu-search-use time. However, a positive return is defined for all actions that select this kind of element. On the other hand, if the element in initial solution S 0 does not exist in current solution S c after tabu improvement, then all the actions of selecting this element record a negative return of the same value. Because the reinforcement-learning algorithm has a higher probability of selecting the action with the highest return when selecting an action, the former has a greater probability of being selected, and the probability of the latter to be selected reduces.
At the same time, if the value of the objective function of current solution S c is greater than current optimal solution S * , it means that elements in current solution S c are even better; then, the value of the reward is accordingly increased (Section 2.2.3). By doing so, the elements in current optimal solutions S * are more likely to be selected in the initial solution. Although this can save a lot of work for tabu searches and shorten time, the initial solution has very strong similarities, and it is easy for the optimal solution found by the algorithm to become trapped in a local optimal contradiction. Therefore, the action-selection strategy does not always choose the action with the highest reward; instead, some probabilities choose random actions to improve search diversity (Section 2.2.4). In this way, we can ensure that the initial solution generated by the reinforcement-learning algorithm is better. The specific steps for reinforcement learning to build an initial solution, detailed in Algorithm 3, are as follows:

2.2.3.
Optional-Action List Update. The reinforcement-learning algorithm first aims to construct an empty set, and then continuously selects the action to put the element into the set without removing elements from the set. To avoid selecting actions that have already been selected, an array of optional-state actions S a are built into the algorithm. The array stores currently selectable actions. At the beginning, the size of the array is the size of the problem, that is, number of all elements n, for example, S a[0] = 1, S a[1] = 2, . . . , S a[n − 1] = n . The array records counter S aN um representing the number of current optional actions; initial size is n. After random selection of the initial state and each action-selection, array S a is updated. Specifically, it removes the element selected by the operation from S a[ ], then all numbers after this element are forwarded by one square, and counter S aN um is decremented by one. In this way, it is possible to avoid wasting time by selecting actions that have already been selected and, to a certain extent, improve the search efficiency of the algorithm.

2.2.4.
Action-Selection Procedure. At each step, the reinforcement-learning algorithm selects actions from the optional-action array according to the action-selection strategy. There are generally four action-selection strategies that were detailed by Zhou et al. [30]: random selection, greedy selection; roulette selection based on the values of experience matrix Q; and hybrid selection, which mixes random-selection and greedy-selection strategies. The choice of action-selection strategy is very important for the reinforcement-learning algorithm. Therefore, our proposed algorithm uses a hybrid selection strategy to select actions that combine randomness with greediness search. When the greedy strategy is adopted with the probability of , a seemingly optimal element is selected, thereby improving the quality of the initial solution, speeding up the search for the optimal solution by tabu search, and at the same time exploiting the action to make the model more accurate. When using a random strategy with a probability of 1 -, it randomly selects an optimal element, so that the algorithm occasionally avoids greed, reduces the degree of similarity of the initial solutions, and reduces the probability of falling into trapped local optima. At the same time, it can explore unperformed actions and dynamically develop more possibilities to make the strategy more comprehensive.

Termination-Criterion Conditions.
In reinforcement-learning procedures, the goal is to average over the results from state actions, so we keep a running average, and then updating the value under a free model, or without a model, when an agent travels a trajectory distance linking state S to state S , which can be found using action a, performed using the iteration formula for state-action function Q(s,a), as follows [6]: where S t is the current action, a t is the current state, S t+1 is the next state,a t+1 is the next action, α is the learning factor, 0<α<1, r is the reward to choose this current action, γ is the discount factor, 0≤ r<1, and max Q(S t+1 ,a t+1 ) is the maximum reward available for the next state. Therefore, after many iterations, r can be regarded as the current reward, γ max Q(S t+1 ,a t+1 ) is the maximum future reward after discounting, i.e., r + γ max Q (S t+1 ,a t+1 ) is the largest reward for the current action; (1 − α)Q(S t, a t ) + α(r + γ max Q(S t + 1, a t+1 )) is the comprehensive maximum reward combining past experience and current realities. Then, the termination condition in this algorithm is given by the following formula: If this condition is met, selecting this action leads to a reduction in the comprehensive maximum reward. However, the purpose of reinforcement learning is to maximize the comprehensive maximum reward as much as possible. If these two are inconsistent, the algorithm is terminated. Another termination condition of this algorithm can be found when reaching the condition as: this condition indicates that the training Q matrix has already had the experience of selecting this action in the state, and this can effectively reduce loss of the fact that an excellent element is no longer present in some local optimal solution, it is therefore selected by the very low probability in the later period. Finally, the reinforcement-learning algorithm terminates only when the following two conditions are met. Let Q = Q[state-1, action-1] then, 2.3. Tabu-search procedure. In the case of the max-mean dispersion problem, tabu searches have been largely used with great success in the literature, as was proven by Lai and Hao [15], as well as Carrasco et al. [6]. In particular, the MAMMDP algorithm proposed by Lai and Hao [15] is still the current optimal algorithm. Therefore, the tabu search designed in this paper is the same as the one first used by them [15]. The main difference is the effect of constructing the initial solution using reinforcement learning instead of the genetic algorithm. The process of a tabu search improving the initial solutions in this algorithm is mainly performed by interacting through its main components, including move operator, objective-function assessment, selection of solution mechanism, and tabu aspiration criteria. To interchange the current solution into its neighbor's solution, a one-flip move operator is used. This strategy of a one-flip move has been largely used in dealing with various binary quadratic programming problems [17]. The assessment of this function is referred to the objective function to measure solution quality. For each iteration, the best improvement mechanism is applied among neighborhood solutions where the best evaluation-function value is more biased.
According to the size of the problem (n variables), the size of the neighborhood solutions is O(n); then, the computational process time to identify the best solution becomes time-consuming when using Equation (1).when seeking computation simplicity. A quick strategy that evaluates the neighborhood solution is described in [15] to improve the search efficiency of a tabu search.
Practically, we set a one-dimensional array W = {c 1 , p 2 , . . . , p t } to compute the number of moves of each possible move applicable to current solution S, where p i represents the total distance between element i and the other selected elements in the current solution, defined as: where M represents the set of the selected elements. In this way, when the one-flip move flips element x i to 1-x i , the amount of the move gain ∆ i can be immediately computed as: Where f (s) is the objective-function value of solution s, and M is the number of selected elements in solution s, that is, the number of elements with a value of 1.
After element x u is flipped, updating one-dimensional array W can be represented as: According to the above formula, one-dimensional array W is set to be initialized at the beginning of tabu search. Then, computational complexity becomes O(n), which greatly simplifies the computational complexity of a tabu search and greatly improves search efficiency. Tabu uses a parameter called tabu tenure to restrict the reverse move from being performed each time a move is performed in the next determined number of tabu tenure. The number of iterations (tabu tenure) are then subdivided into different intervals recognized according to a specified order of tabu tenure. The above strategy was first introduced by Galinier [11] as a management technique that dealt with the graph-partitioning problem instead of adopting the common static tabu list-management strategy. In this strategy, the true tabu tenure is generated by a periodic transition function consisting of different intervals from the divided iterations. In addition, this strategy was more effective than that of Lai and Hao [15] in solving the max-mean dispersion problem. Therefore, the experiment showed that this strategy can effectively make the tabu-search algorithm achieve a good balance between diversification and intensification. Furthermore, the tabusearch strategy lasts until the best solution reached cannot be improved for a certain number of iterations d, so-called search depth. The specific operation is to set a search-depth counter d at the beginning of tabu search; if the objective-function value of the optimal neighborhood solution found during the iteration is better than the current optimal solution, it means that the iteration is meaningful, and counter d is then cleared. Otherwise, counter d is incremented by one and terminated when counter d has accumulated a specified number of iterations alpha. In doing so, the objective-function value achieved from a one-flip move is found by either the restricted move with the best objective value or a tabu move with an aspiration rule chosen to be performed and selected to join toward the solution.
2.4. Reward-matrix updating. After constructing the initial solution and processing it with tabu search, the reward matrix needs to be updated by comparing the initial solution with the current solution to construct a better initial solution in the next iteration. The specific idea behind this is that, if the element in the initial solution can remain in the current solution after being flipped through tabu search, the element is considered to be more interested. Selecting this element is beneficial to approach the global optimum; consequently, this grants it a reward when it is selected. If the element in the initial solution does not remain in the current solution after it has been flipped through a tabu search, then the element is considered to be poor, and it is punished when it is selected. The reward and punishment value is initially set to 1; if the current solution is better than the current optimal solution, then the value of reward and punishment is increased referring to the highest historical reward and punishment value. Therefore, elements that still remain in the current solution are more special; otherwise, the reward and punishment value is reset to 1.

Computational results and comparisons.
3.1. Experiment protocols and benchmark instances. In this section, we show the experiment results of the RLTS algorithm for 100 benchmark instances and compare them with other algorithms to evaluate RLTS performance. After flipping a variable x i to be its complement 1 − x i , the potential p j is then updated as: (3.9) Definitely, by Using the above-mentioned method, the computational complexity of each iteration is reduced to O(n) which is with greater importance when dealing with large scale instances. Tabu rule requires that each time a move is performed, its reverse move is prohibited from being performed during the following specified number of iterations (called tabu tenure). A dynamic tabu tenure management strategy that divides the iterations into different intervals and uses a specified order of different tabu tenures for these intervals is employed. This strategy was first proposed by Della Croce et al. [10] for solving the graph partitioning problem and demonstrated to be effective for the Max-Mean DP [12]. Furthermore, an aspiration rule overrides the tabu rule if performing a tabu move leads to a better solution than the best solution found so far. The proposed tabu search phase repeatedly carries out the following iterations until the best solution can not be improved for a consecutive number of iterations lamda, called tabu search depth. To be specific, all the moves are first categorized as tabu and non-tabu according to the tabu along with aspiration rules. Then the objective function value of each neighborhood solution caused by the 1-ip move is quickly evaluated according to Eq. 11 and 12. Finally, either the non-tabu move with the best objective value or a tabu move that satisfies the aspiration rule is chosen to be performed to move to the next solution.
3.1.1. Elite set updating and rebuilding. We employ a popular quality based elite set updating strategy, where the elite set updating happens whenever the newly generated solution gets a better objective value than the worst solution in ES. In this case, the worst solution is replaced by this solution to execute the ES updating procedure. Otherwise, this new solution is disregarded and ES updating fails. When ES does not undergo successful updates for continuous times, we rebuild ES to restart the search. The rebuilding strategy we use is to keep the best solution found so far in ES and generate the other solutions in the same way as the initial ES does.

Computational results and comparisons.
Our aim is to show and demonstrate by Computational results and comparisons between our proposed method with the existing ones. Accordingly,this section is divided into five sub-sections. In the first sub-section, the Benchmarks and experimental protocols are presented. The second sub-section covers the Parameter sensitivity analysis. The third subsuction describes the Experimental results found so far. The fourth sub-section presents the balance between intensification and diversification. Finally, the fifth sub-section demonstrates the effectiveness of the incorporated EDA component for search exploration. 4.0.1. Benchmarks and experimental protocols. In order to evaluate our proposed EDA-TS algorithm, we use have different sizes of benchmarks n = 500; 750; 1,000; 1,500; 2,000; 3,000; 5,000 with a total of 140 instances, where each size of benchmarks include 20 instances from two groups (TypeI and TypeII). These instances are widely used in the literature for performance comparisons among algorithms and can be downloaded from the websites1. Similar to other reference algorithms, we use time limits as the stopping condition, which are set to be 100 seconds, 1,000 seconds and 2,000 seconds for problem instances with n 1,000, 1,000 ¡ n ¡ 5,000 and n = 5,000, respectively. Our EDA-TS algorithm is programmed in C++ and compiled using GNU g++ with O3 ag on an Intel Xeon Processor E5-2670 v2 with 2.50GHz CPU. Given the random nature of our EDA-TS algorithm, we give 20 independent runs for each instance.
Our RLTS algorithm was programmed in c++ and compiled using GNU g++. At runtime, we used the taurus2 server of the cluster platform. The CPU used by the taurus2 server is the Intel Xeon Processor E5-2670 (2.5GHz 2 × 10 core (20)), and the GPU is Nvidia Tesla K20m. The software used in later parameter tuning was IBM SPSS Statistics version 22. The instances for the max-mean dispersion problem used in our paper can be divided into two types, Types I and II. The distance between elements in Type I is randomly generated in [- 10,10] and has uniform probability distribution. The distance between elements in Type II is randomly generated in [−10, −5] ∪ [5,10] and also has uniform probability distribution. Accordingly, instances can also be divided into small-and mediumsized instances, and large-scale instances. Small-and medium-sized instances refer to case n = 500, 750, 1, 000 with a total of 60 instances in each group of Type I. Each scale has 20 instances, and each type has 10 instances. These instances are all the same as those used in [15,18,8,9,6] and can be downloaded at http://grafo.etsii.urjc.es/optsicom/edp/. Large-scale instances refer to the case with n > 1,000 elements. We used 40 instances with a scale of 3,000 and 5,000 variables from Type II, each scale had 20 instances, and each type had 10 instances. They were also the same as those used by Lai and Hao [15] and can be downloaded from http://www.info.univ-angers.fr/~hao/maxmeandp.html.

Parameter settings.
Our algorithm relies on six parameters: greedy factor , learning factor α, discount factor γ, search depth alpha, tee maximum tabu tenure T max , and time limit t limit . For , α, γ, we set =0.7, α = 0.5, γ=0.5 after the debugging (Section 4.4). For alpha, we followed the literature [15] and set α = 50, 000, T max , = 120. For t limit, we introduced the algorithm on the basis of whether running time exceeded limited time to determine whether to terminate the operation, and limited time needed to be set according to the size of the instance. Given n number of elements, when 500 ≤ n ≤ 1,000, time limit t limit = 100s, when n = 3, 000, time limit t limit = 1, 000s, when n = 5, 000, time limit t limit = 2, 000s; the setting of time limit t limit refers to the current optimal literature [15].

Experiment results and comparisons of small-and medium-sized instances.
We ran each test 20 times and observed both maximum objective-function value f best and average objective function value f avg of the 20 operation results. Table 1 shows the results for small-and medium-sized instances (20 instances for each categorized scale). For the first row, Column 1 is instance name, Column 2 is instance size, Columns 3 to 9 list the chosen state-of-the-art algorithms and our proposed RLTS algorithm. In the second row, the two subcolumns in Columns 7 to 9 stand for maximum objective-function value f best (for example in the results of MAMMDP algorithm), which is the current optimal algorithm [15], and average objective-function value f avg (for example, average objective function value f avg in the RLTS). According to the results in Table 1 for the small-and medium-sized instances of each algorithm, it can be seen that the maximum objective-function value of RLTS is basically greater than or equal to the algorithms in [18,8,9,6] (reserving two decimal places in Columns 3-5 results in larger values). The computational results are the same as those of current optimal algorithm MAMMDP, and the average objective-function value of each instance is consistent with the maximum objective-function value, that is, the current maximum objective-function value can be obtained every time the algorithm runs, proving that the RLTS is very stable. To sum up, the performance of the RLTS was better than these algorithms in the literature [15,18,8,9,6] for small-and medium-sized instances. It had the same performance as current optimal algorithms MAMMDP and EDA, which proves its greater role in tackling the max-mean dispersion problem.

4.3.
Results and comparisons on large-scale instances. Table 2 shows the results for large-scale instances, including 20 instances with scales of both 3,000 and 5,000. Since the algorithms in [18,8,9] cannot compute large-scale instances, the following table only lists the TP-TS algorithm in [6], MAMMDP algorithm in [15], EDA algorithm [16], and our RLTS algorithm. The data in bold indicate that the result was better than MAMMDP, and the underlined data indicate that the result was inferior to MAMMDP. Furthermore, according to the results in the above Table 1, the maximum objective function value computed by our RLTS algorithm is larger than TP-TS algorithm in [6]. Compared with EDA algorithm, among these 40 instances, there are two maximum objective function values that are not as good as EDA and one is greater than EDA, however, in terms of stability, there are 19 average objective function values larger than EDA, only 6 is not as good. Thus, the stability of our proposed algorithm is stronger than EDA. Compared with MAMMDP algorithm which is the current best algorithm, the computational results when solving the problem instances with 3,000 variables are all the same; while with 5,000, the maximum objective function values of 3 instances are greater than MAMMDP, and 1 instance is smaller than MAMMDP. In terms of the average objective function values, in the 3,000-scale instances have 5 average objective function values greater than MAMMDP, against one less than. In the 5,000-scale instances, the average objective function value of 5 instances is greater than MAMMDP, and the rest are all less than. Therefore, it is illustrated that the RLTS algorithm almost has the same ability to obtain the maximum objective function value as compared with MAMMDP algorithm. In terms of stability, the RLTS is more stable when solving 3,000-scale instances while MAMMDP is more stable when solving 5, 000 s cale instances. In addition, the Table 2 shows the average computational time required by the RLTS to achieve better solutions is shorter than those both MAMMDP and EDA algorithms taken as to be better among all state-of-the-art algorithms used in this work. Although the RLTS does not get the current optimal solution in one instance, it also updates the records of three optimal solutions. This shows that the RLTS algorithm designed in this paper is still superior when tackling the max-mean dispersion problems.
Furthermore, we study components of our RLTS algorithm to analyze their influence on its performance, including significance analysis of parameters , α, γ and the role of the reinforcement learning.

4.4.
Significance analysis of parameters α, , t, γ. Two factors α and γ mainly affect reinforcement learning when experience matrix Q is updated (Section 2.4). The greater learning factor α is, the smaller the degree to which the original value is retained each time experience matrix Q is updated, and the Q value retains the overall reward obtained by this action, that is, the update degree is larger. If learning factor α is smaller, the original value is more retained when the Q-value is updated, and it is integrated into the comprehensive reward brought by this action to a lesser extent. The larger discount factor γ is, the greater the discount rate of future rewards when the Q value is updated, that is, the greater the degree of consideration for future rewards in the action selection is. The smaller discount factor γ is, the smaller the discount rate of future rewards is when Q value is updated, that is, the current reward is more considered in the action selection, and future rewards only account for a small portion.   Next, we show the method for obtaining debugging data, as follows: when other parameters remain unchanged, the current debugging parameter is changed according to the debugging interval. Because the calculation of large-scale instances more easily reflects differences after a parameter is changed, 20 instances with a scale of 5,000 are estimated. Each parameter value is measured 20 times, and the average objective-function value is taken. Since the objective-function values of various instances greatly differ, the mean value of the objective function is subtracted from the calculated mean value of each instance in the literature [15], and the difference is used as the debugging data. Figures 1, 2, and 3 below are the debugging results for factors , α, and γ, respectively. Table 4 shows the debugging results for factor . When the below data were analyzed, we used the Friedman test. The Friedman test is a nonparametric test method that uses rank to outline significant differences in multiple population distributions. The original assumption is that there is no significant difference in multiple population distributions from multiple paired samples. We used IBM SPSS Statistics 22 software to carry out Friedman's test on the above data and calculate corresponding probability p-value. If the p-value was less than the given significance level of 0.05, the original assumption was rejected and it was to assumed that there was significant difference in the sample ranks. In the case of the contrary, the original assumption could not be rejected, and it could be considered that there was no significant difference in the sample ranks in each group. By using Friedman's test, we could debug the most suitable values of greedy factor , learning factor α, and discount factor γ. At the same time, it could also test whether these three parameters are significant, that is, the change of these three parameters had greater impact on our algorithm. Table 5 shows the Friedman test results of these three parameters. In addition to the Friedman test, the box diagrams of the debug variables for each group were also displayed in SPSS to intuitively display the degree of change in the values of the parameters. The circle and star are abnormal values, and the others are, from top to bottom, upper edge, upper quartile, median, lower quartile, and lower edge. Figures 1, 2, and 3 below are the box diagrams of the debugging data for greedy factor , learning factor α, and discount factor γ.   According to results of the Friedman test and the box diagrams of the corresponding data, it can be seen that the change of greedy factor had greater influence on the algorithm results. Probability p-value was 0.034 less than 0.05, indicating that the parameter rejected the original assumption, which is significant. However, the change of learning factor α had less influence on the algorithm. Its probability pvalue was 0.085 greater than 0.05, indicating that the parameter was not significant. The change of discount factor γ affected the algorithm, but it was not obvious. Its probability p-value was 0.046 less than 0.05, but it was very close to 0.05, indicating that the parameter was significant, but that significance was not as good as that of greedy factor . Therefore, learning factor α is a noncritical parameter in the algorithm, while greedy factor and discount factor γ are key parameters. From them, greedy factor is the most critical and has the greatest influence on the algorithm.

4.5.
Role of reinforcement learning. In this section, the effect of the reinforcement-learning algorithm is analyzed. The specific analysis process was to compare the performance of the RLTS algorithm with the multistart tabu-search method MTS algorithm that removes the part of the reinforcement-learning algorithm, and then analyzes the effect of the reinforcement-learning algorithm on the basis of the results. Test instances of 40 large-scale instances with scales of 3,000 and 5,000 were only considered since, for small-and medium-sized instances, the results were basically the same. Each instance was tested 20 times by the two algorithms, and the final comparison was made on the basis of the maximum and average objective-function values. Table 6 shows the results of multistart tabu-search (MTS) algorithm and the proposed RLTS algorithm for large-scale instances. The larger result of the two algorithms is in bold in the table below. From this table, it can be seen that the maximum objective-function values of the two algorithms were almost the same when considering the 3,000-scale instances, and the average objective-function value of RLTS was larger than that of MTS in five instances, which shows that the RLTS was more stable when calculating the 3,000-scale instances. While considering 5,000scale instances, the RLTS outperformed the MTS in terms of both the maximum and the average objective-function values (greater in 11 and 16 instances than MTS, respectively). Therefore, its performance result was better than that of MTS since it could not only find a better-quality solution, it was also more stable, demonstrating the effectiveness of the reinforcement-learning algorithm. 4.6. Effectiveness of RL component for search exploration. In this section, a little bit different from the Section 4.2, we again demonstrated the importance of the RLTS component in the exploration of the solution space. For this, we produced two RLTS variants, using popular exploration mechanism in the literature to replace RLTS. The first is called randomized greedy tabu search (RG-TS), in which the solution is gradually built by a restricted candidate set of attributes that generates a high objective gain; then, an attribute is selected from the created set to add to the partial solution. This procedure continued until no objective gain was positive. The other variant was a randomized perturbation tabu search (RP-TS), in which the best result in the set was perturbed; i.e., a given number of attributes were modified randomly to generate a new result. Since we observed a significant decrease in performance when solving large instances compared with small-and medium-sized instances, we conducted our experiments on 20 instances with 3,000 Table 6. Comparison between multistart tabu-search (MTS) method and the proposed RLTS algorithm on the set of 40 large instances within ≥ 3,000. Each instance was independently solved 20 times by the two algorithms.   The Fig. 4(a) and Fig. 4(b) showed the best percent deviations of RG-TS and RP-TS, respectively, from RLTS for each of the 20 selected instances. The Fig.  4(a) showed that the best percent deviations were non-positive for all the selected instances,while Fig. 4(b) showed that the best percent deviations of RP-TS from RLTS were non-positive for all tested instances. These results revealed that RLTS found best solution values that were better than or equivalent to those found by its two variants. The Fig. 4(c) and Fig. 4(d) showed the average percent deviations of RG-TS and RP-TS, respectively, from RLTS for each of the 20 selected instances. The Fig. 4(c) showed that the average percent deviations were negative for all the selected instances, while Fig. 4(d) showed that the average percent deviations of RP-TS from RLTS are negative for all 20 selected instances, except for one instance. These results revealed that RLTS found better average objective values than its variants. Thus, this study proved that incorporating the RL component is beneficial to the effectiveness of the algorithm 5. Conclusion. In this paper, we proposed for the first time strong hybrid metaheuristics that integrates reinforcement-learning and tabu-search algorithms (RLTS) for tackling the max-mean dispersion problem. The innovative key feature in our proposed algorithm is the use of an RL algorithm-guided solution-building strategy for diversifying the search, and a tabu-search improvement mechanism for intensifying the search. At each step, the algorithm relies on gradually setting the reward mechanism on the basis of the contrast information between initial solution and processed local optimal solution, to encourage the selection of more excellent elements when constructing the initial solution. Extensive experiments and the final computational results demonstrated that our RLTS algorithm could compete with MAM-MDP, the current optimal algorithm when calculating small-and medium-scaled instances that are below 3,000 scales, and could find the current optimal solution. Performance was better than the algorithms in [18,8,9,6]. When processing largescale instances of 3,000-5,000 scales, the RLTS algorithm could also find the current optimal solution, and its stability was higher than that of MAMMDP. Moreover, our RLTS algorithm successfully updated the current optimal solutions of three instances, which proved its superiority to other algorithms for solving the max-mean dispersion problem. In addition, we analyzed parameter sensitivity and found the great role and influence of the RL algorithm component in the better performance of the outperforming RLTS algorithm. Our findings could inspire investigating other typical hybrids that integrate learning strategies and local-search mechanisms for solving NP-hard optimization problems in terms of stability improvement. Specific ideas can be attempted, such as designing more appropriate challenges to the three components, including a reward mechanism, action-selection strategy, and matrix Q updating. The RLTS algorithm can also attempt to solve other types of NP-hard dispersion or even nondispersion problems, such that those developed in literature [18,8,9,6] or, for knapsack problem using hybrid based metaheuristic, in terms of the optimization mechanism.
Funding. This work was supported by the National Natural Science Foundation of China [grant number 71971170].