Learning rates for partially linear functional models with high dimensional scalar covariates

This paper is concerned with learning rates for partial linear functional models (PLFM) within reproducing kernel Hilbert spaces (RKHS), where all the covariates consist of two parts: functional-type covariates and scalar ones. As opposed to frequently used functional principal component analysis for functional models, the finite number of basis functions in the proposed approach can be generated automatically by taking advantage of reproducing property of RKHS. This avoids additional computational costs on PCA decomposition and the choice of the number of principal components. Moreover, the coefficient estimators with bounded covariates converge to the true coefficients with linear rates, as if the functional term in PLFM has no effect on the linear part. In contrast, the prediction error for the functional estimator is significantly affected by the ambient dimension of the scalar covariates. Finally, we develop the proposed numerical algorithm for the proposed penalized approach, and some simulated experiments are implemented to support our theoretical results.

1. Introduction. Learning with functional data has received considerable attention over the past years due to its broad real-world applications, including medicine, linguistics, chemometrics as well as signal processing applications; see, for example, [10] and [16] for case studies. More concretely, [22] for applications in electroencephalogram (EEG) signals in brain computer interfacing, and [11] for analysis of music/speech via integration of audio content and functional brain response. For introduction to the concepts and applications of functional data analysis, we refer the interested reader to the book written by [17].
In the problem of functional linear regression, the predictor X(·) is a random function defined on an interval T , and the classical functional linear model between the response Y and X is where f 0 is called the slope function and is assumed to be square-integrable. is a random error independent of X(t). Let (Y i , X i ) : i = 1, ..., n denote independent and identically distributed realizations from the population (Y, X), there is much of the practical interest in estimation of the slope function itself, or its application for the purpose of prediction T X(t)f 0 (t)dt. It is worth noting that, the problem of slope-function estimation is significantly different from prediction of the slope function. The former is intrinsically nonparametric, by contrast, the latter can be either nonparametric or semiparametric. In particular, the optimal mean-square convergence rate of predictors can be attained at n −1 . The greater detailed discussion can be found in [6]. Functional linear regression has been widely studied over the last twenty years in the fields of statistics and machine learning, see [8,26,18].
In practice, it is often the case that a response is affected by both a vector of finite length and a function-valued random variable as predictor variables. For instance, in growth data analysis, a child's adult height is related to his (her) past height curve and a vector that consists of the variables such as sex, heights of parents and others. These scenarios partially motivate us to study PLFM under the high dimensional setting. For simplifying the notations, we suppose that Y and X are centered. This paper now considers functional linear regression with single functional and scalar predictors Z = (z 1 , ..., z p ), which is a p-dimensional vector of scalar covariates. Partially functional linear regression between Y and (X, Z) is of the form where γ 0 = (γ 0 1 , ..., γ 0 p ) is referred to be the regression coefficients of nonfunctional covariates, and is the standard normal variable independent of X(·) and Z. Under the sparse high dimensional setting, a standard assumption is that the cardinality of the active set S 0 := S(γ 0 ) := {j, γ 0 j = 0, j ∈ {1, ..., p}} is far less than p. Note that p and s 0 := |S 0 | are allowed to diverge as n increases. A primary purpose for (1.1) is to extract useful information from Z and X(·), whereas the classical functional linear model focuses on a single functional covariate alone.
Recent years have witnessed a great deal of work on the high dimensional problems for linear or additive models with scalar covariates only [9,25,5], while a small set of partially linear functional models with high dimensional scalar covariates have been previously studied. For example, [19] considered two different approaches for the PLFM under fixed dimensional setting, which are proved to achieve the same convergence rate of the mean squared prediction error under several regularity assumptions. [21] investigated the asymptotic behaviors of their proposed estimators for the partial quantile linear regression via the truncation tool of functional principal component analysis (FPCA), and the associated application problem for neuroimaging data has been considered in [28]. However, the above mentioned work focus only on the PLFM with fixed dimensional scalar covariates, and in this case we can not know how the ambient dimensions or sparsity structure affect estimation performance induced by the slope function. Though we notice that [13] has recently considered the PLFM with high dimensional scalar covariates, they only provided some weak learning rates of their estimators for the parametric part and the slope function. Precisely, the convergence rate of the parametric part is the same as that of the slope function, which is of the order (s 0 + s n )/n in Theorem 1 of [13]. Note that s n over there represents the number of truncated FPCA, which diverges with a polynomial order of n. So their derived rates are suboptimal in the minimax sense. In addition, their proposed method is based on the classical FPCA, which involves computationally expensive numerical optimization. Moreover, this standard truncation way via the FPCA leads to a difficult choice problem for the number of the main FPCA components.
In this paper, we analyze the high dimensional PLFM. our main contribution is to combine the methods used in [7] for functional linear regression over RKHS with the Lasso [23] for linear models, and thus obtain an estimator that can capture the sparse structure and the functional information.
We establish theoretical results for the error bounds of both the prediction risk for the functional covariate and the estimation error for the scalar covariates. Our derived rates are sharper than the existing work in [13,12] and our kernel-based penalized method enjoys some specific advantages with comparison in classical FPCA technique. Precisely, the convergence rate of the excess prediction risk is of the order s 2 0 log p n , which is determined by the ambient dimension of the scalar covariates and the sparsity parameter s 0 . Surprisingly, such rate for the prediction risk is independent of the functional complexity of the RKHS, which coincides with those discussed in [6] and [7]. Intuitively, the prediction error for functional covariate is parameterized by the integral operator when X(·) is sufficiently smooth. In our case of bounded process on X, it is reasonable that the learning rate for functional-based perdition problem is the same rate as that of the linear part. On the other hand, the oracle rate of the linear part is the same as that for the pure linear model up to the sparsity s 0 , see [2], as if the functional component in the PLFM has no effect on the linear part. We conjecture that our derived rates may be optimal in the minimax sense up to the sparsity s 0 , which is of independent interest as our future work.
Finally, we develop the proposed numerical algorithm for the proposed penalized approach, and some simulated experiments are implemented to support our theoretical results.

Notation and definitions.
Let u, v be two general random variables, and denote by Q the joint distribution of (u, v) and by Q u (Q v ) the marginal distribution of u(v). For f : u × v → R a measurable function, we define the squared L 2norm by f 2 := E Q f 2 (u, v), and the square empirical norm is given by copies of (u, v). Note that Q may differ from line to line. For a vector γ ∈ R p , the 1 -norm and 2 -norm are given by γ 1 := p j=1 |γ j | and γ 2 := p j=1 γ 2 j 1/2 , respectively. With a little notational confusion as above, we write f L2 := f, f L2 with f, g L2 = T f (t)g(t)dt. For two sequences {a k : k ≥ 1} and {b k : k ≥ 1}, we write a k b k if there are positive constants c and c such that c ≤ a k /b k ≤ c for all k ≥ 1. Also, a k = O(b k ) means that there exists some constant c such that a k ≤ cb k for all k ≥ 1.
Kernel methods are one of the most powerful learning schemes in machine learning, which often take the form of the regularization schemes in a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel [1]. Recall that a kernel K is a continuous, symmetric, and positive semi-definite function on T × T . Let H := H K denote be the closure of the linear span of the set of functions {K t := ∈ T m and m ∈ N + . An important property on H is the so-called reproducing property: The reproducing property ensures that any kernel-based approach defined on an infinite dimensional Hilbert pace H can be equivalently transformed into a finite dimensional parametric optimization. Thus, some standard finite truncated ways such as functional PCA or spline-based approaches are avoided, and moreover the associated estimation bias can be reduced significantly. In practice, those kernelbased methods are less sensitive to the dimension of covariates in comparison with local approximated methods, for instance, local linear (constant) methods in nonparametric statistics.
Let X(·) be a square integrable centered stochastic process defined over T . The covariance function of X is also a real, symmetric and nonnegative definite function defined as For any given reproducing kernel R : T × T , let L R : L 2 → L 2 denote an integral operator defined by By the reproducing property of RKHS, L R can be equivalently defined through The spectral theorem implies that there exists a family of orthonormalized eigenfunctions {φ R : ≥ 1} and a sequence of eigenvalues It is obvious that L R 1/2 = L 1/2 R, Moreover, define Then, we have L R1R2 = L R1 • L R2 . For a given reproducing kernel K and a covariance function C, define the linear operator L K 1/2 CK 1/2 by L K 1/2 CK 1/2 = L K 1/2 (L C (L K 1/2 (f ))). If both L K 1/2 and L C are bounded linear operators, so is L K 1/2 CK 1/2 . By the spectral theorem, there exist a sequence of positive eigenvalues s 1 ≥ s 2 ≥ ... > 0 and a set of orthonormalied eigenfunctions {ϕ : ≥ 1} such that ... It is seen that the eigenvalues {s : ≥ 1} of the linear operator L K 1/2 CK 1/2 rely on the eigenvalues of both the reproducing kernel K and the covariance function C. We shall show in Section 2 that the minimacx rate of convergence of the excess prediction risk is determined by the decay rate of the eigenvalues {s : ≥ 1}.
which are independently drawn from (1.1), the least squares regularization scheme for PLFM (1.1) is proposed by Here the tuning parameter λ > 0 is used to control the functional complexity of the nonparametric component, while ρ > 0 associated with the Lasso penalty [23] is used for generating sparse coefficients with respect to scalar covariates. In our theory, the choices of λ and ρ will be given explicitly.
Even though the proposed approach (2.1) is directly defined in an infinite-dimensional Hilbert space, the following lemma shows that this optimization is equivalent to a finite-dimensional minimization problem.
Lemma 2.1. Our proposed algorithm (2.1) defined on H × R p is equivalent to a finite-dimensional parametric convex optimization.
The proof of Lemma 2.1 is deferred to the Appendix. This current article focuses on the prediction problem with the functional covariate, and the oracle rate of the high dimensional parametric estimator. The goal of prediction is to recover the functional ζ X (f 0 ), where Before establishing our main theoretical results for the functional and the parametric estimation, we introduce several basic notation and assumptions used in the proof. We first define a mixed set by and g 0 (X, Z) := ζ X (f 0 ) + Z T γ 0 . Define a sequence of projections of Z = (z 1 , ..., z p ) into the integral-type functional of X(·) and H by π j (·) := arg min We denote Π := ζ X (π 1 ), ..., ζ X (π p ) ∈ R p andΠ := Z − Π. For every g, g 0 ∈ G, . This together with the definition of projection implies that For some fixed constant δ 0 , we define where R is a small positive quantity to be determined later (depending on n). Under some regularity conditions, we will show that τ (ĝ − g 0 ) ≤ R with a high probability. Then conditional on the event {(X, Z), τ (ĝ − g 0 ) ≤ R}, consistency and optimality of our estimator (f , γ) can be proved accordingly. Next, we present several technical assumptions used in our theory. Condition A (Eigenvalue condition). The smallest eigenvalue Λ 2 min,Π of E[Π TΠ ] is positive uniformly.
Assumption A is standard in semiparametric literature, see [15], ensuring sufficient information in estimating γ.
(1) and (2) on Condition B seem to be strong assumptions, while we can often approximate a non-bounded distribution or process with its truncated version.
(3) on Condition B is mild and the commonly-used Gaussian kernel satisfies such condition.
Condition C (Positiveness condition). For some constant C K , there holds Condition C looks strong a bit but it holds when E[X(t)X(t )] > c > 0 for any t, t ∈ T . For example, any centered independent difference process with E[X 2 (t)] > 0 uniformly on t ∈ T . In the case of Gaussian process and T = [1,2], we know that E[X(t)X(t )] = min{t, t } ≥ 1. This condition is also viewed as an identification condition in the sense of squire integral.
In order to present our general theoretical results, we introduce some events under which our results are stated explicitly. For any given R > 0 Theorem 2.2. Given R > 0 and 0 < δ 0 ≤ 1/700. Under Conditions A and that Then on T (δ 0 , R), we have τ (ĝ − g 0 ) ≤ R.
Note that δ 0 is just a technical quantity and we do not pursuit the optimality of constants. Besides, Theorem 1 can not provide the optimal choices for the parameters ρ, λ and R, since we shall show that T (δ 0 , R) holds under some additional constraints on these three parameters.
It is seen from Theorem 1 that, the learning rates for the prediction estimation and the linear part can be obtained easily once T (δ 0 , R) holds with a high probability. In the literature of empirical processes, T 1 (δ 0 , R) is related to uniform law of large number with the constrained set G R , which is often characterized by the so-called Rademacher complexity, defined in Section 3. By contrast, T 2 (δ 0 , R) is closely related to Gaussian complexity, which has been studied well in [14]. Then, with probability at least 1 − δ p , Remark that, the convergence rate of the excess prediction risk is of the order s 2 0 log p n , which is determined by the ambient dimension of the scalar covariates and the sparsity parameter s 0 . However, such rate for the prediction risk is independent of the functional complexity of the RKHS, which coincides with those discussed in [6] and [7]. More precisely, the prediction error for functional covariate is parameterized by the integral operator when X(·) is sufficiently smooth. In our case of bounded process on X, it is reasonable that the learning rate for functional-based perdition problem is achieved as the same rate as the linear part. On the other hand, the oracle rate of the linear part is the same as that for the pure linear model, see [2], as if the functional component in the PLFM has no affect on the linear part.
Note also that the choice of the nonparametric tuning parameter λ is of the order s0logp n , which is strictly less than the ordinary choice in nonparametric literature (i.e. n −2r/(2r+1) ) with the r−times differentiable functions. This under-smoothing requirement is often encountered in two-step estimation in statistics. Due to the smoothness of integration and bounded process of X, we allow more complex functions to be entered into our algorithm, which corresponds to a smaller quantity of λ.
3. Computing algorithm. The proposed framework aims to solve the following optimization problem By Lemma 2.1, the minimizer of (3.1) must have a finite form that where B(X i ) = B 1 (X i ), ..., B n (X i ) T ∈ R n with B j (X i ) = T T X j (s)K(s, t)X i (t)dsdt and Σ ∈ R n×n with components Σ ij = T T X i (t)K(t, s)X j (s)dtds. Hence, the optimization problem (3.1) over a probably infinite function space changes to the optimization problem in a n-dimensional vector space. To solve (3.3), we consider a splitting algorithm with Proximal operator, which updates θ and γ sequentially. For simplicity, we denote the first term of (3.3) as R n (θ, γ) and the last term as Ω(γ). Then at the t-th iteration with solution (θ t , γ t ), we have Specifically, we have where Prox is a proximal operator that is explicitly expressed as Prox ρ Note that in practice, the exact value of the Lipschitz upper bound D is often difficult to determine in large-scale problems. And thus we recommend a backtracking scheme [4] as a more efficient alternative to approximate the value of D.
4. Numerical analysis. In this section, we illustrate the performance of the proposed method in simulated cases. We consider the simulation setting used by [27]. Specifically, we assume that T = [0, 1] and the true slope function and the random function X(·) are and ξ k are pre-difined as ξ k = (−1) k+1 k −v/2 with some pre-specified positive constant v. For the linear part, the true regression coefficients γ 0 = (−5, 5, 0, ..., 0) T and random variable Z = (Z 1 , ..., Z n ) T ∈ R n×p with Z i = (Z i1 , ..., Z ip ) T is generated as Z ij ∼ N (0, 1). Then, the response is generated as where i ∼ N (0, 1) denotes the random noise. The proper choice of the tuning pairs (λ, ρ) plays a crucial role in achieving the desired performance of the proposed method. Here, we apply the ABIC criterion used in [13]. Specifically, given a fixed pair (λ, ρ), we consider the criterion that ABIC(λ, ρ) = log RSS(λ, ρ) + Card( γ) log n n , where RSS(λ, ρ) = 1 2 and Card(γ) denotes the number of nonzero components of γ. Then, we prefer to choose the pair (λ, ρ) which minimizes this criterion via grid search, where the grid is set as {10 −2+0.1s : s = 0, ..., 40}.
For the numerical example, we use the kernel function where B m (·) is the m-th Bernoulli polynomial. The scenarios with n = 100, 200 and v = 1.1, 2 or 4 have been considered. Each scenario was replicated 50 times. The averaged performance measures are summarized in Table 1, including γ − γ 0 2 , E f − f 0 and y − y 2 . Note that the expectation and the prediction error are evaluated by using 1000 independently and identically generated testing samples, and similar measures have been adopted by [27] and [13].
The averaged performance measures are summarized in Table 1. Note that the Table 1. The averaged performance measures of the proposed method in simulated example. integrals in calculation of Σ, B(Xi) and f (t) can be approximated by summations for practical purposes. Specifically, in our method, we generate 1000 points in [0, 1] with equal distance and evaluate the integral by using the generated points. Clearly, we conclude that with the increase of n and v, the performance of the proposed method is getting better in the sense that all the errors decrease significantly.

5.
Proofs. In order to prove Theorem 1, we need to introduce the classical Talagrand's Concentration Inequality, which can be found in [3] or [20]. and U := sup g∈G g ∞ , then there exists a universal constant N 0 such that, for Lemma 5.2. We choose the tuning parameters as the following way: where c 1 , c 2 are fixed constants, while δ 1 , δ 2 are sufficiently small constants and mainly depend on δ 0 . Then, with probability at least 1 − exp(−δ nρ 2 ) with small enough constant δ > 0, Proof. In order to apply Talagrand's Concentration Inequality as above, it suffices to bound these key quantities: E[Z], B, U . To this end, we introduce {σ i } n i=1 as a Rademacher sequence independent of {(X i , Z i )} n i=1 . By symmetrization, it is shown in Corollary 3.4 of [24] that E sup g∈G R g 2 n − g 2 can be upper bounded by where we write ζ i (f ) := ζ Xi (f ) for the ease of notation. We next consider these three terms respectively. Note that for ζ X (f ) + Z T γ ∈ G R , by Condition B, we have where we used the fact R 2 /(c 1 ρ) ≤ 1 due to R ≤ 1/ √ s 0 and R/ρ ≤ c 1 √ s 0 . By the contraction property of Rademacher complexity, we have Note that This follows that where we used the assumption √ nρ ≥ √ 2 log p/δ 1 . Thus, we obtain We now consider the term: By the contraction property again, one gets and by Jensen's inequality where we used the assumption on Condition C. Note now that for any f satisfying ζ X (f ) + Z T γ ∈ G R , we have ζ X (f ) + Z T γ ≤ R and Z T γ ≤ C z γ 1 ≤ C z δ 0 /7R 2 /ρ. The triangle inequality is applied to imply that since R/ρ ≤ c 1 √ s 0 by assumption. Hence, this follows from (5.2) that where the second inequality is based on the contraction property of Rademacher complexity, and the third one follows from the above derived results from (5.2) to (5.3). Thus, combining (5.1), (5.3) and (5.4), we conclude that Besides, we also have that where we used the assumption R 2 ≤ c 2 λ. By the definition of G R Then, an application of Talagrand's Concentration Inequality yields that with probability at least 1 − exp(−nρ 2 δ ) by taking r := nρ 2 δ . Here δ 1 := 8c 1 δ 0 δ 1 C 2 z 7 + 8κ(1 + c 1 )C 2 x δ 2 + 2(1 + c 1 )C z C 2 x δ 1 7/δ 0 .
So we obtain our desired result with probability at least 1 − exp(−nρ 2 δ ).
To consider the term T 2 (δ 0 , R), we introduce the following lemma. Besides the Rademacher processes, The Gaussian concentration inequality plays an important role in our analysis as well, which is given in Theorem 7.1 of [14]. Lemma 5.3. Let G = {G t } t∈T be a centered Gaussian process indexed by a countable set T such that sup t∈T G t < ∞ almost surely. Then P sup whereσ 2 = sup t∈T E[G 2 t ] < ∞. Lemma 5.4. If the following bounds hold with small enough δ 3 > 0: Then, with probability at least 1 − exp(−δ nρ 2 ), Proof. For any g ∈ G R , we observe that First, by Condition C and the same arguments as Lemma 5.2, one gets as well as γ 1 ≤ R 2 √

7/δ0ρ
. Since i 's are i.i.d. sub-gaussian random variables, by the Jensen's inequality we obtain and conditional on any fixed z ij All the mentioned results are combined to imply that Let r := √ δ ρR, by Lemma 5.3 we conclude that, with probability at least 1 − exp(−δ nρ 2 ), It is seen from the above inequality, and we can get our desired result as long as δ 3 , δ 1 , δ are sufficiently small with respect to δ 0 .