AN RKHS APPROACH TO ESTIMATE INDIVIDUALIZED TREATMENT RULES BASED ON FUNCTIONAL PREDICTORS

. In recent years there has been massive interest in precision medicine, which aims to tailor treatment plans to the individual characteristics of each patient. This paper studies the estimation of individualized treatment rules (ITR) based on functional predictors such as images or spectra. We consider a reproducing kernel Hilbert space (RKHS) approach to learn the optimal ITR which maximizes the expected clinical outcome. The algorithm can be conve- niently implemented although it involves inﬁnite-dimensional functional data. We provide convergence rate for prediction under mild conditions, which is jointly determined by both the covariance kernel and the reproducing kernel.


1.
Introduction. In recent years precision medicine has attracted growing attention in both statistics and machine learning communities. One goal of precision medicine is to identify the optimal individualized treatment rules (ITR) by incorporating patients' individual-level clinical information. Different approaches on how to estimate the optimal ITR via maximizing the expected clinical outcome have been considered, mainly including Q-learning, A-learning, D-learning and outcome weighted learning, see details in [9,10,11,14,17] and references therein.
Denote the treatment assignment A ∈ A := {−1, 1} i.e., we only consider two treatment options labeled −1 and 1, and denote the clinical outcome by Y ∈ Y ⊂ R + := [0, +∞). Let L 2 (T ) be the Hilbert space of square-integrable functions from T to R, where T is a compact subset of a Euclidean space. We are interested in maximizing the expected clinical outcome given a set of functional predictors X ∈ X ⊂ L 2 (T ) such as brain images [12], which has been introduced as the observation of one patient's symptom and discussed in [2,8] together with its importance for applications. In clinical trials, patients are randomly assigned to one treatment of the two options. Therefore, we assume that the observation X is independent of A and Prob(A = 1) = Prob(A = −1) = 1 2 . The goal is to learn an ITR f : X → A based on a sample set {(x i , a i , y i )} n i=1 such that the expected clinical outcome E[Y |A = f (X)] is maximized, where the sample data {(x i , a i , y i )} n i=1 are observations from independent and identically distributed (i.i.d.) random copies of (X,A,Y), which all have the same distribution ρ on X ×A×Y. Note that maximizing the expected clinical outcome E[Y |A = f (X)] is equivalent to minimizing where I(·) is the indicator function. Then the optimal ITR can be characterized by f * (X) = arg min f :X →A R(f ) = sign(η 0 (X)) := 1, if η 0 (X) ≥ 0, −1, otherwise, , which is the relative difference in outcome between two treatment assignments.
In this paper, we propose to adopt the D-learning framework [11] to directly learn the optimal ITR f * based on functional predictors. Denote by · L 2 (T ) the norm of L 2 (T ) induced by the inner product f, g L 2 (T ) := T f (t)g(t)dt. For brevity we simply denote the norm and the inner product by · and ·, · respectively. For β ∈ L 2 (T ), the induced linear functional is given by As in [11], we assume that the clinical outcome Y can be characterized as where µ 0 , β 0 ∈ L 2 (T ) are unknown slope functions and is the random noise independent of X and A. Without loss of generality, we will assume E(X) = 0 throughout the paper. By simple calculation, we have L µ0 (X) = 1 and it is easy to check that β 0 = argmin β:T →R E(β). The goal of prediction is to recover the linear functional L β0 based on a training sample consisting of n independent copies of (X, A, Y ). Let K : T × T → R be a continuous, symmetric and positive semi-definite kernel function. We say K is positive semi-definite, meaning that for all s ∈ N, the s × s kernel matrix [K(t i , t j )] s i,j=1 evaluated on any finite set of distinct points {t 1 , · · · , t s } ⊂ T is positive semi-definite. The kernel function K uniquely induces a reproducing kernel Hilbert space (RKHS) H K , which is defined to be the closure of the linear span of the set of functions {K t := K(t, ·) : t ∈ T } with the inner product ·, · K satisfying K s , K t K = K(s, t), ∀s, t ∈ T and the induced norm · K . In this paper we consider the regularized estimator of η 0 = L β0 given byη = Lβ witĥ β = argmin where λ > 0 is a regularization parameter controlling overfitting and can be selected via generalized cross validation. By virtue of the Representer Theorem, the above optimization problem is conveniently implementable [1,15]. More specifically, we i T x i (s)K(s, t)ds, whereĉ = (ĉ 1 , ...,ĉ n ) T ∈ R n is the solution of min c∈R n 1 n y − W c 2 2 + λc T W c , with y = (a 1 y 1 , ..., a n y n ) and W = (W ij ) being a n × n matrix defined by Note that the prediction accuracy can be measured by the excess risk where C : T × T → R is the covariance kernel defined as Define an integral operator L C : L 2 (T ) → L 2 (T ) as Throughout the paper we assume E X 2 < ∞, which ensures that L C is a compact operator on L 2 (T ). We can similarly define an integral operator associated with the kernel function K, which is given by Note that both L C and L K are self-adjoint, positive and compact operators on L 2 (T ). Thus the r-th power of L C and L K can be defined by spectral theorem for r > 0. For notational simplicity, we denote From (3) and the definition (4) of L C , the excess risk can be rewritten as Thus we can evaluate the performance of algorithm (2) via estimating the norm L 1/2 C (β − β 0 ) , which will be discussed in next section. We first state the following generalized margin condition for our theoretical analysis. Recall that in our setting, the distribution ρ on (X , A, Y) can be factorized as ρ(x, a, y) = ρ(y|x, a)ρ A (a)ρ X (x), where ρ A and ρ X are the marginal distributions of ρ on A and X respectively. Assumption 1. There exist constants C 0 > 0 and γ ≥ 0 such that where ρ X is the marginal distribution of X.
Theorem 1.1. For any distribution ρ of (X, A, Y ) satisfying Assumption 1 and η = L β for β ∈ L 2 (T ), we have We separate the set X η into two parts one with |η 0 (x)| ≤ C 0 t and the other with where the last equality follows by taking Theorem 1.1 indicates that we can learn the optimal ITR via minimizing the excess risk. Before establishing error bounds for the excess risk, we state some assumptions as follows, which are common conditions in statistical learning [3,6] and functional data analysis [1,15].
which is known as the effective number in learning theory literature [5,7,16].

Theorem 1.2. Suppose Assumptions 2, 3 and 4 hold, then by taking
If Assumption 1 also holds, then we have , which achieves the minimax optimal learning rate [1,4,15] and follows R( We first give an explicit expression forf . Given an input data set Here, for any f, g ∈ L 2 (T ), f ⊗ g : Then the estimatorf can be expressed asf where y = (a 1 y 1 , · · · , a n y n ) is the vector consisting of the output data. To provide some intuition behindf , for any f ∈ L 2 (T ), consider the random variable ξ(x) := L which shows that T x is a finite sample approximation of L 1/2 Analogously, if we consider the random variable ξ(z) := ayL meaning that 1 n S x y is a finite sample approximation of L 1/2 K L C β 0 . Based on these observations, define an intermediate function which is actually a population version off . By introducing f λ , we can decompose L 1/2 C (β − β 0 ) into two terms, namely, sample error and approximation error. In the followings, the adjoint of an operator T is denoted by T * . Proposition 1. Let λ > 0 and f λ be defined by (14). Then a solution of (2) is Kf withf expressed as (11) and the following error decomposition holds where Observe that T 0 = (L Then based on the polar decomposition of compact operators, there holds Recall that Substituting (17) and (18) into (16), we get the desired bounds. This completes the proof of Proposition 1.
The first term S(z, λ) in (15) is called the sample error, which is caused by randomly sampling. As we discussed, the intermediate function f λ is a population version of the estimatorf , one would expect S(z, λ) tends to zero as the sample size n becomes large. An elegant error analysis of S(z, λ) will be carried out in detail in next section by characterizing the approximations of (λI +T x ) −1 to (λI +T 0 ) −1 and 1 n S x y to L 1/2 K L C β 0 in some suitable sense. The second term A(λ) in (15) is called the approximation error, which depends on the regularities of slope function β 0 , the kernel K and the covariance kernel C. Though A(λ) is independent of randomized sampling, the decay of A(λ) is crucial to bound the first term S(z, λ). According to the error decomposition in Proposition 1, to bound the prediction error of algorithm (2), it suffices to bound S(z, λ) and A(λ) respectively.

AN RKHS APPROACH TO ESTIMATE ITR BASED ON FUNCTIONAL PREDICTORS 175
3. Error analysis. In this section, we will estimate S(z, λ) and A(λ) respectively, and then derive an upper bound for L 1/2 C (β − β 0 ) . Recall the definition (6) of T 0 and T 1 . Both T 0 and T 1 are positive, self-adjoint and compact. It should be pointed out that these two operators shares the same eigenvalues, which we denote by (λ j ) j≥1 and is sorted in descending order. Then the spectral theorem of compact operators claims that where {φ} j≥1 and {ψ j } j≥1 are two sets of orthonormal basis of L 2 (T ). The spectral decomposition of T 0 and T 1 are very important in our error analysis. We first estimate the approximation error A(λ) based on the spectral decomposition of T 1 .
Proof. According to the spectral decomposition of T 1 in (19), there exists in L 2 (T ) an orthonormal basis {ψ} j≥1 consisting the eigenfunctions of T 1 associated with eigenvalues (λ j ) j≥1 . Since L 1/2

Thus we have
Simple calculations show that for 0 < θ ≤ 1 2 , This inequality together with γ 0 2 = j≥1 d 2 j leads to the bound (20) for A(λ). Thus we have completed the proof of Proposition 2.
Next we focus on estimating the sample error S(z, λ). Our mathematical analysis involves approximation of operators. So we introduce some notations in operator theory. Let A : L 2 (T ) → L 2 (T ) be a (linear) bounded operator and (e j ) j≥1 an orthonormal basis of L 2 (T ). The operator norm is defined as is an orthonormal basis of L 2 (T ). The space of Hilbert-Schmidt operators is a Hilbert space with respect to the inner product A, B HS = j≥1 Ae j , Be j and we denote the corresponding norm by · HS . If A is Hilbert-Schmidt, then And for any bounded operator B, there holds We denote by Tr(A) the trace of A, which is defined as Tr(A) = j≥1 (A * A) 1/2 e j , e j . In particular, if A is positive, self-adjoint and compact, we can choose the orthonormal basis (e j ) j≥1 to be the eigenfunctions of A with corresponding eigenvalues {γ j } j≥1 . Then Tr(A) = j≥1 Ae j , e j = j≥1 γ j . Now we are in the position to bound the sample error S(z, λ).
Proof. We split the proof in several steps.

AN RKHS APPROACH TO ESTIMATE ITR BASED ON FUNCTIONAL PREDICTORS 177
Step 2. Bound on S 3 (z, λ). Note that and by the definition (14) of f λ , , y) is generated from model (1). Then from (12) and (13), We further defineξ Then one can see that (λI The Chebyshev's inequality gives that, for all > 0, Note that S 3 (z, λ) = 1 n n i=1ξ λ (z i ) − Eξ λ . For any 0 < δ < 1 and by taking = E ξ λ 2 nδ in the inequality above, it follows that with confidence 1 − δ, So in order to bound S 3 (z, λ), we need to estimate E ξ λ 2 . Recall that the sample z = (x, y) is generated from model (1). Then This equation together with the definition (24) ofξ λ (z) yields that Now we use condition (22) to obtain E µ 0 , x 4 ≤ c 2 L 1 2 C µ 0 4 and where the last equality follows the same argument in the proof of Proposition 1.

AN RKHS APPROACH TO ESTIMATE ITR BASED ON FUNCTIONAL PREDICTORS 179
To derive the above bound, we define a random variable It takes values in the space of Hilbert-Schmidt operators on L 2 (T ). From (12), one can check that and Following the same argument in step 2, we see that with confidence 1 − δ, Next we use the spectral decomposition of T 0 in (19) to derive a bound on E η λ 2 . By (21), there holds where the last inequality is from Cauchy-Schwartz inequality and condition (22). Note that Combining the estimates above, we derive bound (32). The estimation in step 2 gives that, with confidence 1 − δ/2, Using this inequality and (32), we obtain with confidence 1 − δ, S 1 (z, λ) ≤ c 5 δ −1 (1 + A(λ)) 1/2 N (λ)λ −1/2 n −1 where c 5 := (4c 2 Tr(T 0 )) 1/2 c 4 .
Proof of Theorems 1.2: Combining Propositions 1, 2 and 3, we have where the second inequality follows from N (λ) = O(λ − 1 2α ) and the third inequality follows by taking λ = n − 6α (4θ+4)α+3 . This together with Theorem 1.1 leads to the desired conclusion. 4. Discussions. In this paper, we consider the estimation of individualized treatment rules (ITR) based on functional predictors within the framework of D-learning. We consider an RKHS approach to learn the optimal ITR which maximizes the expected clinical outcome and derive fast convergence rates for prediction. There are several possible extensions to this work. First, it would be interesting to extend our approach to the cases involving multiple treatments instead of just two options. Second, to take into account patients' various information, it would be worthwhile to consider scalar and functional predictors in the model simultaneously. Another interesting question is how to employ distributed learning scheme [5,7,13] to make it possible to deal with large-scale dataset efficiently.