FINITE-HORIZON OPTIMAL CONTROL OF DISCRETE-TIME LINEAR SYSTEMS WITH COMPLETELY UNKNOWN DYNAMICS USING Q-LEARNING

. This paper investigates ﬁnite-horizon optimal control problem of completely unknown discrete-time linear systems. The completely unknown here refers to that the system dynamics are unknown. Compared with inﬁnite-horizon optimal control, the Riccati equation (RE) of ﬁnite-horizon optimal control is time-dependent and must meet certain terminal boundary constraints, which brings the greater challenges. Meanwhile, the completely unknown sys-tem dynamics have also caused additional challenges. The main innovation of this paper is the developed cyclic ﬁxed-ﬁnite-horizon-based Q-learning algo-rithm to approximate the optimal control input without requiring the system dynamics. The developed algorithm main consists of two phases: the data collection phase over a ﬁxed-ﬁnite-horizon and the parameters update phase. A least-squares method is used to correlate the two phases to obtain the optimal parameters by cyclic. Finally, simulation results are given to verify the eﬀec-tiveness of the proposed cyclic ﬁxed-ﬁnite-horizon-based Q-learning algorithm.

1. Introduction. Considerable research efforts have been devoted to optimal control due to its importance from both theoretical and practical perspectives. It has been widely used in many fields such as industrial processes, investment, aerospace, robotics, vehicles and networked control systems, and is also an important part of control theory [16,4,5,37]. Optimal control can maximize benefits and minimize costs, resource consumption. From mathematical perspective, finding the optimal controller is equivalent to solving the Hamilton-Jacobi-Berman (HJB) equation for nonlinear systems and solving the RE equation for linear systems. Dynamic programming (DP) method is an effective tool for solving optimal control problems.
However, the computational complexity of this method increases sharply as the number of system state and input dimension increases. This is the so-called "curse of dimensionality" problem [29].
During the last decades, many researchers have made much efforts to overcome the problem of "curse of dimensionality". Based on dynamic programming method, a large number of reinforcement learning (RL) methods such as approximate dynamic programming (ADP), Neuro-dynamic programming (NDP) and adaptive dynamic programming (ADP) have thus been proposed to deal well with the problem of "curse of dimensionality" problem [36,11,8,30,41,45,40,23,38,2,3]. In the process of various RL methods emerged, optimal control also realizes the transition from model-based reinforcement learning to model-free reinforcement learning. For model-based reinforcement learning, interested readers can view [39,20,13] and the references therein. For model-free reinforcement learning, interested readers can view [7,1,44,23,22,10,28,14,31,32] and the references therein. Q-learning is also one of the most popular and powerful reinforcement learning method, which has achieved lots of research results in theory and application [19,35]. In [27], an off-policy actor-critic neural network structure Q-learning method is developed to tackle the optimal output regulation problem of discrete-time systems. The authors in [24,23] develop a critic-only Q-learning method. In [26], the authors develop a multistep Q-learning method to solve the optimal output regulation problem for 2-degree-of-freedom helicopter. In [25], a based-Q function policy gradient adaptive dynamic programming algorithm is proposed to tackle the optimal control problem of general discrete-time nonlinear systems. In [18], the authors present a novel off-policy Q-learning method to learn the optimal solution to rougher flotation operational processes. A novel off-policy interleaved Q-learning algorithm is presented for solving optimal control problem of affine nonlinear discrete-time systems in [17].
However, all aforementioned results are related to the infinite-horizon optimal control problems. In other words, all above reinforcement learning methods are based on infinite-horizon. In practical applications, the completion of each task is generally time-limited. Therefore, the finite-horizon reinforcement learning methods are more practical. However, so far, the development of finite-horizon reinforcement learning methods remains an open issue, and it is a motivation of this study.
Different from the infinite-horizon optimal control problem, the HJB equation or RE equation for finite-horizon optimal control problem is inherently time-dependent, which makes it more difficult to obtain its solution. In [21], the authors propose a new form of finite horizon discrete-time Riccati equation and prove the equivalence of it and the old finite discrete-time Riccati equation. In [6], a neural network feedback controller that has time-varying coefficients is found by a prior offline tuning. A finite-horizon single network adaptive critic is proposed in [9]. An iterative heuristic dynamic programming (HDP) technique is developed to solve the finitehorizon optimal tracking problem in [33]. In [12], inspired by recent advances in machine learning, a model-based globalized dual heuristic programming (GDHP) algorithm is proposed to solve the finite-horizon optimal tracking control problem, and deep neural networks (DNN) structure and training algorithm are introduced to train the GDHP algorithm. In [34], an iterative adaptive dynamic programming algorithm is developed to study the finite-horizon optimal control of discrete-time nonlinear systems. However, aforementioned studies for the finite-horizon optimal control problem, it needs a prior model identification and is also a model-based method. There are few studies for finite-horizon optimal control for completely unknown discrete-time linear systems. And, the application of reinforcement learning in finite-horizon optimal control is still in its infancy. All of these are the motivations of this paper.
The main innovations of this paper are summarized in the following two aspects.
(1). A novel cyclic fixed-finite-horizon-based Q-learning algorithm is proposed. Terminal boundary condition is incorporated in the proposed algorithm. The collected learning data traverses the overall finite-horizon step [0 N ], and the robustness of the algorithm is thus improved.
(2). The key of finite-time optimal control problem for discrete-time linear systems is to solve the time-varying RE equation, which is inherently challenging. In addition, completely unknown system dynamics bring additional challenges. However, these challenges have been well handled in this paper.
The structure of this paper is described as follows. In Section 2, we formulate the finite-horizon optimal control problem of discrete-time linear systems with completely unknown dynamics. In Section 3, the Q-learning formulation of finitehorizon optimal control for discrete-time linear systems is given. In Section 4, we introduce the cyclic fixed-finite-horizon-based Q-learning algorithm. In Section 5, two simulation examples are given to verify the effectiveness of our proposed cyclic fixed-finite-horizon-based Q-learning algorithm. Section 6 concludes this paper and gives the direction of future research.
2. Finite-horizon optimal control problem of discrete-time linear systems. In this section, the finite-horizon optimal control problem of discrete-time linear systems is formulated. This paper considers the following time-invariant discrete-time linear systems where x(k) ∈ Ω ∈ R n denotes the system state and u(k) ∈ Ω u ∈ R m denotes the system control input at time step k. A ∈ R n×n and B ∈ R n×m are constant matrices of suitable dimensions. A and B are assumed to be unknown in this study. For simplicity, x(k) and u(k) can be abbreviated as x k and u k respectively. For the finite-horizon optimal control problem, the goal is to find a optimal control input sequence u k , k ∈ [0, 1, 2, · · · , N − 1] which minimizes the following performance index function is defined to be finite-horizon admissible with respect to performance index function (2) if u is continuous on Ω and J(u k , x k ) is finite, ∀x k ∈ Ω. [25] Assumption 1.
Assumption 2. All the state variables of discrete-time linear systems (1) are assumed to be available.
As we all know, according to the traditional optimal control theory [16], the optimal control input u * can be found as following where P k+1 is the solution to the following Riccati equation where P N = Q N . As we all know, equation (4) is difficult to solve directly, and it is required that A and B are known. In order to deal with this dilemma, a Q-learning-based algorithm will be introduced in the next.
3. Q-learning for finite-horizon optimal control of discrete-time linear systems. This section will main present the Q-learning formulation for finitehorizon optimal control of discrete-time linear systems.
Based on the defined performance index function (2), we define the value function or cost function as In line with [16], the cost function or value function can also be represented as follow (4) and (5), we can obtain the following Bellman equation where V * (x k,N −k ) and V * (x k+1,N −(k+1) ) can be respectively abbreviated as V * k and V * k+1 . Based on equation (6) and Bellman equation (7), we can define the Q-function associated with u k as [15] Q Substituting system (1) into (8), we can obtain Lemma 3.1. If u k in Q-function (9) given by (3), the Q-function (9) have the same value as the Bellman equation (7), that is A detailed proof of Lemma 1 can be found in references [31] and [15]. Then we can obtain the optimal control input sequence u * k by the following method without using the system dynamics A and B and the following equation is satisfied 4. Cyclic fixed-finite-horizon-based Q-learning algorithm. This section will main develop the cyclic fixed-finite-horizon-based Q-learning algorithm for finitehorizon optimal control of discrete-time linear systems. For the convenience of representation, we introduce Kronecker product, the Qfunction (9) can be rewritten as where let then we can represent (ζ T ⊗ ζ T ) k in a compact form as follow whereQ kij (i, j = 1, · · · , m + n) represents the i − th row and the j − th column element of the matrixQ k and defineQ k = vec −1 [vec(Q k )]. Then (12) can be represented as SinceQ k is a time-dependent matrix, vec(Q k ) T is thus also time-varying. Then, we can rewrite vec(Q k ) T as follow where W = [w 1 , w 2 , · · · , w (n+m)(n+m+1)/2 ] T ∈ R (n+m)(n+m+1) 2 ×1 is the ideal parameters. The ideal W is unknown, we useŴ as its estimated value. ϕ(N, k) ∈ is time-dependent function matrix. Then (13) can be estimated as folloŵ According to (15), (11) can be rewritten as follow where e is the Bellman estimation error.
Assumption 3. It is assumed that W T ϕ(N, N ) = G N ∈ R 1× (n+m)(n+m+1) 2 equivalent to Q N ∈ R n×n in the terminal boundary constraints is given in advance.
In finite-horizon optimal control, based on equation (16), in the finite-time step [0 N ], we have . . . 20) where e 1 , e 2 . · · · , e N are the Bellman approximation errors and e terminal is the boundary error. In order to minimize the Bellman approximation errors e 1 , e 2 . · · · , e N and the boundary error e terminal , that is to guarantee that e 1 , e 2 , · · · , e N , e terminal → 0,Ŵ → W , a least-squares method is thus adopted. To this end, we define the following matrices where l = 1, 2, · · · , N .
The flow chart of Algorithm 1 is depicted in Fig.1.

Remark 1.
It is easy to see that Algorithm 1 does not require known system dynamics A and B.
Remark 2. In initial admissible control step of Phase 1, initial system state x 0 can be randomly taken from a compact set Ω in each iteration. When each iteration randomly selects the initial state for learning, it is necessary to ensure that enough points in the compact set are traversed.   To guarantee the convergence of Algorithm 1, for each data collection in Step 2, the following condition must be satisfied [10] rank(Φ j ) = (n + m)(n + m + 1) 2 (23) and L satisfies L ≥ (n + m)(n + m + 1) 2(N + 1) Proof. Since the number of independent elements inŴ is (n+m)(n+m+1) 2 , in order to ensure that the solution to equation (21) is unique, according to the Lemma 6 in [10], it is easy to obtain that equation (23) should be satisfied. According to data collection step 2, the data are collected at N + 1 points over the finite-time step [0 N ] for L times, based on equation (23), there must be has L(N + 1) ≥ (n + m)(n + m + 1) 2 According to the above inequality, it is easy to obtain L ≥ (n + m)(n + m + 1) 2(N + 1) The proof is thus complete.
Remark 5. In applications, the condition (23) is easily satisfied by properly adjusting L.

5.
Simulation results. In this section, we present two simulation examples to verify the effectiveness of our proposed cyclic fixed-finite-horizon-based Q-learning algorithm. Example 1: In this example, we consider the following discrete-time linear systems from reference [42] x(k + 1) = 0.
Then, the simulation results are presented in the following. Initial system state x 0 randomly selected from Ω := {−1 ≤ x 1 , x 2 ≤ 1} are presented in Fig.2. Fig.3 shows the convergence process ofŴ . After the second iteration,Ŵ has converged and W = [−0.5567, 2.6900, −2.7724, −0.7135, 1.5957, − 0.2423] T . The system states trajectories are displayed in Fig.4. The optimal control input is depicted in Fig.5.   Example 2: In this example, we consider the following discretized F-16 aircraft plant systems from reference [43] x(k + 1) =  The time-dependent function matrix ϕ(N, k) is the same as Example 1. In this example, we use the fixed initial state x 0 = 1 −1 0.5 T to train Algorithm 1.
The simulation results are presented in the following. Fig.6 shows the convergence process ofŴ . After the second iteration,Ŵ has converged and The system states trajectories are displayed in Fig.7. The optimal control input is depicted in Fig.8. Therefore, the simulation results verify the effectiveness of our designed cyclic fixed-finite-horizon-based Q-learning algorithm.   6. Conclusion. This paper has investigated the finite-horizon optimal control problem of discrete-time linear systems with completely unknown dynamics. Dealing with this problem is equivalent to solving a time-dependent Riccati equation. For relaxation dependence on the system dynamics, Q-learning technique is introduced in this study. A cyclic fixed-finite-horizon-based Q-learning algorithm is thus developed to cope with the optimal control problem. Finally, the effectiveness of the developed algorithm is verified by two simulation examples. In future work, we will extend the results of this study to finite-horizon optimal tracking control problems.