HIERARCHICAL APPROXIMATIONS FOR DATA REDUCTION AND LEARNING AT MULTIPLE SCALES

. This paper describes a hierarchical learning strategy for generating sparse representations of multivariate datasets. The hierarchy arises from ap- proximation spaces considered at successively ﬁner scales. A detailed analysis of stability, convergence and behavior of error functionals associated with the approximations are presented, along with a well chosen set of applications. Results show the performance of the approach as a data reduction mechanism for both synthetic (univariate and multivariate) and a real dataset (geo-spatial). The sparse representation generated is shown to eﬃciently reconstruct data and minimize error in prediction. The approach is also shown to generalize well to unseen samples, extending its prospective application to statistical learning problems.


1.
Introduction. Hierarchical and multiscale modeling are widely used both in data driven and physics based models [49,30,34,42]. In this paper, we focus on a strategy which analyzes data at multiple scales s by constructing corresponding bases representations B s ∈ R n×ls and inferring if these scale and data dependent bases are able to approximate the observed data f | X ∈ R n (X ∈ R n×d ) to satisfy: where, l s ∈ N is a scale enumeration chosen by the algorithm, T OL is a user defined tolerance level, and n is the number of observations in a d dimensional space (x i ∈ R d , 1 ≤ i ≤ n). C s in (1) is computed as the optimal coordinate of projection on B s . Besides constructing these bases at each scale, the algorithm identifies 'representative' data points which form the corresponding sparse data X s with X s ⊆ X. {s, X s , C s } allow reconstruction of the original dataset and prediction (model) at any new point in the domain of interest resulting in both a sparse representation [37,16,45] and an efficient model [28,12]. The code for the proposed approach is available online 1 Data sparsification seeks to find a subset X s ⊆ X = {x 1 , x 2 , .., x n } ⊂ Ω (x i ∈ R d for d dimensional inputs) for which we have data f | X = (f (x 1 ), f (x 2 ), ..., f (x n )) T ∈ R n , such that, we may obtain an acceptable reconstruction or approximation to f as in equation (1). This subset with the corresponding projection coordinate C s with respect to the basis set B s , is regarded as our obtained sparse representation (D s sparse = [X s , C s ]). Thus, this compressed representation of the dataset acts as a model for new prediction points X * ⊂ Ω and reconstruction of the original dataset f | X at X ⊂ Ω as needed. The approximation problem [14,41] is a more traditional one seeking to construct an approximation (Af ) : R d → R for the underlying function f which was discretely observed (f | X ) such that it minimizes the normed error measure ||(Af ) − f ||. Following the standard notation, for a normed space V and any approximationf to f such that f,f ∈ V, the quality of approximationf is usually quantified by the error norm ||f −f || V .f here is usually constructed by first fixing a finite dimension approximation space Q ⊆ V and then computing the optimal approximation (Af ) ∈ Q by minimizing normed distance between f and Q: miñ We now motivate our development of a data dependent bases (B s ≡ B(X) s ) [22]. The Mairhuber-Curtis theorem 2 states that for d ≥ 2 and n ≥ 2, there is no Chebyshev system B = {s 1 , ...s n } on R d , therefore for some data X = {x 1 , x 2 , .., x n } ⊂ R d , the Vandermonde matrix approaches singular behavior affecting the quality of approximation. As a remedy, we choose a data dependent continuous kernel function K : R d × R d → R which is positive definite and symmetric on Ω. Thus, for data points X = {x 1 , x 2 , ...., x n }, the kernel function becomes K(x i , x j ) 1≤i,j≤n . From the literature on these functions [3,11,50], these kernels are reproducing in the native Hilbert Space H of functions on Ω in the sense The functions K(x, .) are the Reisz representers of the linear point functionals δ x : f → f (x) in the dual space H * of H. Therefore we have the relationship (assuming the traditional definition of norm in the dual space) K(x, y) =< K(x, ·), K(y, ·) > H =< δ x , δ y > H * x, y ∈ Ω (4) For observed data at X = {x 1 , x 2 , ..., x n }, one can consider the space of trial function D X = span{K(·, x j ) : x j ∈ X} as a choice for the approximation space. However from work such as [22,24] we know that the bases formed by translates of the kernel functions are highly dependent on the distribution of data points in X and hence numerically ill-conditioned for many problems. For dealing with this issue, at each scale s, our approach considers a subset of original trial functions as bases such that Γ s = span{K s (·, x i ) : x i ∈ X s } ≈ span{K s (·, x j ) : x j ∈ X}, X s ⊆ X These linearly independent, data derived basis functions are sampled using a Pivoted QR strategy as detailed in the following section (Algorithm 1). It should be noted here that at each scale we have a different kernel function K s . Hence, effectively we are working in a different RKHS at every scale (from the uniqueness property [3] of kernel functions). Let this scale dependent RKHS be represented as H s . Therefore at any scale s, the function to be approximated (f ) is assumed to be in the RKHS H s . However, our approach is aimed at finding an approximation A s f ∈ Γ s to f as in (5). Hence, if we look at the problem formulation (2), then H s is the space V and Γ s is our approximation space Q. While our overall approaches are somewhat related to statistical emulators [32], the notion of scale is rarely articulated in that domain.
The research problem we handle here is fairly general and has close relationships with many other research areas. Multiresolution Analysis (MRA) implementing wavelet decomposition serves as one of the earliest literature on multiscale processing of datasets ( [39], [21]). Multiscale modeling also has close relationships with geometric analysis ( [18], [19], [20], [38], [48]) involving diffusion maps to analyze datasets at different scales. The inherent hierarchy in our approximation space is also motivated from a wide literature on Hierarchical Radial Basis Functions (HRBF) ( [26], [8], [4], [25]). The combination of basis functions at different scales to produce desired approximation, forms the foundation of the majority of this literature. This is very similar to the idea of multiscale analysis explored in [5] and [6]. However, in the current work we show that this maybe an overkill (in terms of computation and generalization) in many of the situations, and instead we focus on one particular optimal scale which is able to model the dataset efficiently and also allows data reduction. Hierarchical approximations also find applications in data visualization [51]. Since our approach generates data driven hierarchical basis functions belonging to RKHS, therefore the proposed approach is also related to works such as [13] and [2], where the authors consider data dependent multiscale dictionaries that generalize wavelets in geometric sense with the capability to analyze non-linear geometries. The literature on modeling large physical systems through partial differential equations have utilized the idea of multilevel analysis through multigrid methods ( [10], [46]). Following this idea, many research works have extended the multigrid methods to general field of machine learning and data analysis ( [29], [36] [44]) closely relating it to our approach. Since our approach generates data driven basis functions centered at the produced sparse representations (essentially producing data driven unstructured grids), thus is also related to methods implementing sparse grids ( [7], [31]) for dimensionality reduction of datasets. The hierarchical approach which we discuss here samples the sparse bases set using a pivoted QR decomposition. However, there is also a rich body of literature on Column Subset Selection (CSS) problem ( [17], [9], [40], [23]) that can be connected with this work leading to approximate (less accurate than pivoted QR) but faster methods for generating the sparse representation.
Overall, the contributions of this paper can be summarized as follows.
• Firstly we introduce an approach which computes a scale (s) dependent sparse representation D s sparse of a large dataset. Besides data reduction, the algorithm also provides a model for approximations at new data points in the observation space using just D s sparse , enabling any sort of prediction from a compressed version of the dataset.
• We then provide detailed analysis on convergence, establishing the optimality of the solution obtained by the algorithm, computation of point-wise error functionals and their behavior which governs the performance of the approach. • At each scale, besides providing confidence bounds which quantify our belief in the estimation, we also provide prediction intervals which estimates the bounds in which the algorithm believes any observations not included in the sparse representation as well as any new samples to be made in the future will lie. This also makes the sparse representation more useful. • Towards the end, we have also provided stability estimates to further establish the dependability of the approach.
2. The hierarchical approach. Developing further on the work of [5], we introduce a methodology for data reduction through efficient bases construction, exploiting the multilevel nature of the correlation structure in the data. The basic steps involved in the approach are presented as Algorithm 1, here.
Algorithm 1 Hierarchical approach Compute covariance kernel: [G s on X with ( s = T /P s )]

7:
Remove sampling Bias: [(W = AG s ) with A ∈ R k×n and (a i,j ∼ N (0, 1))] Subset the sparse representation in X s

12:
Generate approximation at s: ...f n } values are obtained at data points X = {x 1 , x 2 , ...., x n } (f | X ∈ R n and X ∈ R n×d ). The scalars [T, P, T OL] ∈ R 3 are the algorithmic hyperparameters defined by the user. TOL is the simple 2-norm error tolerance, P is assumed to be 2 (Based on [5]). This choice of P reduces the length scale of the kernel (K) by a factor of 0.5 at each scale increment, providing an intuitive understanding of how the support of basis functions is adapted to scale variation. If we assume the diameter of the dataset to be distance between the most distant pair of datapoints, then T is given by Besides these hyperparameters, the algorithm also accepts the prediction points X * ⊂ Ω ⊂ R d , which represent the data points at which the user wants to approximate the underlying function. We also need to choose the positive definite function . For this paper we use the squared exponential (also referred to as Gaussian) kernel (7) for mapping the covariance structure and generating the space of trial functions Γ s (5) at each scale s.
Here s (also known as the length scale parameter) determines the width of the bases support at a particular scale. This squared exponential kernel is very widely used in the Gaussian process literature [43] and is a popular choice for most spatial statistics problems. The learning phase (for generating D s sparse ) of the proposed algorithm has been explained in ST EP − 1 and ST EP − 2 below. ST EP − 3 constitutes the prediction phase which uses the produced sparse representation for the dataset. It should be noted that, it is not required to wait till the convergence of the algorithm to go to the prediction phase. In fact, it is possible to predict from the sparse representation at each scale (D s sparse ). We have shown the prediction phase after convergence for the purpose of clarity. STEP-1 (Getting sparse data-X s ). Given the Dataset D = [X, f | X ], the algorithm begins with the computation of the covariance operator G s (7). However, based on [22,24], the distribution of the dataset might lead to ill-conditioning of this covariance kernel. Therefore we carry out a column pivoted QR decomposition to identify the space Γ s (at each scale) which represents the span of the trial functions K s (·, x j ), (1 ≤ j ≤ n) at scale s. The QR decomposition is carried out on W for obtaining the Permutation matrix P R . W is produced by the product of a random normal matrix A with the G s . Here we have A ∈ R k×n with l s = rank(G s )(l s ≤ k ≤ n) and k = l s + 8 (as in [5]) giving us 8 additional rows to account for numerical round-offs during the QR decomposition. However, this is a conservative step and even without any additional sampled rows, the algorithm was found to perform well. The permutation matrix P R produced by the decomposition captures the information content of each column of W. P R is then used to extract independent columns with the biggest norm contributions along with the observation points these columns correspond to in the covariance kernel (G s ). This set of sampled observations from the original dataset is termed as the corresponding sparse data (X s ). The number of columns sampled from G s come from its numerical rank estimated by using strategies such as a Rank Revealing − QR or a SV D decomposition. STEP-2 (Getting projection coordinate-C s ). Once, we have the relevant columns of G s (denoted as B s ) which also represent the approximation subspace Γ s which spans D X , the algorithm proceeds to solve the over-constrained system (B s C s = f | X ). We can think of it as an orthogonal projection problem where f | X needs to be projected on the column space of B s and then it is required to compute the specific weighting of vectors in the basis matrix B s which produces this projection. The Algorithm then computes the orthogonal projection ((A s f )| X ) given the required coordinates and basis vectors in B s . Once we have the scale s = S a at which the algorithm satisfies the 2-norm condition, it produces the sparse representation D Sa sparse = [X Sa , C Sa ] and proceeds to the prediction stage if required by the user. Here S a is called the convergence scale as detailed in the following subsection. STEP-3 (Prediction at X * from D Sa sparse ). For computing functional values at unobserved location X * , basis functions are constructed by computing G * Sa (Gaussian kernel for the prediction points with the sparse data -X Sa ). The Prediction step weighs the constructed bases with coordinates of orthogonal projection C s (C Sa , if the prediction is being made at the convergence scale) and linearly combines them to produce the required approximation.
Finally, before moving forward, it is worth mentioning here that the 2-norm criteria is just one of the many possible kinds of norms which can be used to measure the scale dependent approximation error.

2.2.
Critical scale (S c ) and convergence scale (S a ). In our work, the scale at which the kernel matrix becomes numerically full rank is referred to as Critical Scale. Working with proposition 3.7 in [5], if δ represents the precision of rank for the Gaussian kernel matrix, then we can define the numerical rank of the kernel as where σ j (G s ) is the j th largest singular value of G s . Also let |I i | represents the length of the bounding box of the data in i th (i ∈ [1, d]) dimension. Then given the length scale parameter s , the rank of the Gaussian kernel can be bounded above as Proposition 3.7 in [5] states that numerical rank of the Gaussian kernel matrix is proportional to the volume of the minimum bounding box B = I 1 × I 2 × .... × I d and to −d/2 . Therefore for a fixed data distribution, following relation holds Hence (for P = 2 and a fixed d) numerical rank of the Gaussian covariance kernel increases exponentially with the scale of study. Therefore, there exists a minimum scale s = S c , at which the kernel becomes full and continue to stay full rank as the scale is further increased. This scale is referred to as the Critical Scale (S c ). Hence, based on the overall Algorithm 1, if X 1 , X 2 , ..., X s , .. and so on, are the sampled sparse data at scale s, they satisfy |X 1 | ≤ |X 2 | ≤ |X 3 | ≤ ..... ≤ |X Sc | , where | · | denotes the cardinality of a set. Therefore with increasing scales more data points are added, leading to better approximations. Thus, for any scales j and i satisfying the relation j ≥ i, the following relation holds From approximation theory [15], we can now state that for kernel at the Critical Scale and a finite point set X ⊂ Ω: (a) G Sc is positive definite, (b) If (A Sc f ) ∈ Γ Sc vanishes on X, then (A Sc f ) ≡ 0, and, c) the approximation (A Sc f ) satisfying the following relation is unique We can now define the Convergence scale (S a ) as the minimum scale s which satisfies Therefore S a is the scale at which Algorithm 1, stops. It is worth noting here that based on the definition of S c and S a , we have S a ≤ S c and |X Sa | ≤ |X Sc | 3. Convergence properties. Following the notations introduced in the previous sections, again let f be the function which needs to be approximated and A s f be the approximation produced by the Algorithm 1 at scale s. Correspondingly, let Lemma 3.1. For any choice of T OL ≥ 0 and any norm: || · || in R n chosen by the user, the proposed algorithm (Algorithm 1) converges in the sense for some finite s ≤ S c and the corresponding produced approximation A s f From Equation (10), we know that numerical rank of bases B s increases monotonically with scale s. Hence as s approaches S c , R s approaches I n . Thus ||I n −R s || 2 → 0 and hence ||E s | X || 2 → 0. This establishes the convergence for the proposed approach in 2-norm at some finite scale S a ≤ S c . From equivalence of norms on finite dimensional vector space [35], this generalizes the convergence guarantee to all possible norm in R n .
In the following theorem we analyze a particular hierarchical updating scheme.
Theorem 3.2. If the projection update of Algorithm 1 is written in an iterative form where E s = f − A s f . Then α s follows the bounds 0 < α s < 2 and convergence rate (ρ s ) satisfies the relation ρ s = |1 − α s | Proof. Since at each scale s, A s f is generated as projection of f on space From the definition of E s | X , it follows that any convergent algorithm requires α s ≥ 0. Rearranging (15) and taking an inner product with respect to E s | X . Now, using the Cauchy Schwarz and Triangle inequalities Therefore we get 0 ≤ α s ≤ 1 + ||E s+1 | X || ||E s | X || . Now, for convergent approximations ||E s+1 | X || ≤ ||E s | X ||. Therefore for all scales, α s ∈ [0, 2]. The end members α s = {0, 2} imply the iterative scheme has converged and no improvement is needed (as reflected in the stopping criterion in Algorithm 1). Hence, we remove these end members obtaining our desired bounds. Rewriting equation (15) Now we analyze the inner product of the error of the projection (E s = f − A s f ) with respect to projection at scale s in the native RKHS. The major power of the following theorem lies in the fact that as the proposed algorithm converges, we are able to upper bound this inner product as a function of the user defined tolerance (T OL). This provides the user direct control on the approximation properties on the algorithm even in the native RKHS. It should be noted again that at each scale s, it is assumed that the function f to be approximated belongs to the same Hilbert space H s and is approximated in Γ s , justifying the use of the reproducing property.
the approximation A s f and its corresponding error E s at convergence scales (s = S a ) satisfies where C Sa = |c 1 |, |c 2 |, ...|c l Sa | T ∈ R l Sa contains the modulus of coordinates of projection at S a , n is the number of observations and TOL is the 2-norm convergence error tolerance in Algorithm 1.

Proof. Beginning with the Inner products in the RKHS (
However, from the convergence criteria in the proposed Algorithm 1, we know ||E s | X || 2 ≤ T OL at s = S a . Hence the theorem follows 4. Approximation properties and confidence intervals. This section provides estimates quantifying the quality of approximations generated by the proposed algorithm at each scale. The first result is a direct application of theorem (3.3) 4.1. Approximation properties.
the approximation (A Sa f ) produced by Algorithm 1 at the convergence scale (S a ), follows the Pythagoras Theorem in the limit T OL → 0. i.e, Proof. Let f be the function to be approximated. As stated earlier, it was discretely observed at X ∈ Ω leading to the restricted function f | X . Starting with the norm of the function in the Hilbert Space Now, we will provide results related to uniqueness and quality of solution.
the approximation (A s f ) ∈ Γ s in the limit T OL → 0 produced by the proposed algorithm at the convergence scale s = S a is 1. the unique orthogonal projection to f 2. the unique best approximation to f with respect to || · || Hs Proof. (1): Again using the reproducing property of RKHS xj ∈Xs The rest of the proof is similar to steps in the proof of theorem 3.3 showing the considered native Hilbert space inner product vanishes as T OL → 0 (2): Given A s f as the unique orthogonal projection to f , result in part 2 follows from property of projections covered in standard approximation theory [1]. Now, we will analyze the error functional and pointwise error bounds. At scale s, we are searching for a solution in the space Γ s . Our approximation is of the form Any A s f ∈ Γ s can also be expressed as and λ y is application of linear functional λ on y. Thus, we have the dual space H * s = xj ∈Xs c j δ xj .

PRASHANT SHEKHAR AND ABANI PATRA
Next, we present a result which provides an inner product representation for approximation at a point x ∈ Ω. The objective here is to show the optimality of the approximation A s f by requiring vanishing gradient for the norm of the Error Functional in H s .
establishing the optimality of approximation (A s f ) at each scale s.
Proof. We begin by expressing the error functional norm Differentiation yields the optimalM s (x) that minimizes ||ε s From Algorithm 1, we know G s need not be full rank. Using matrix A (as in Algorithm 1), constructing W (= AG s ) and a Column pivoted QR decomposition W P s = QR, and, applying the permutation operator P s on equation (21) yields Since G s is a symmetric operator, sampling the first l s rows of P s G s , where l s represents the numerical rank of G s , will produce B sT , the transpose of the bases considered at scale s. Correspondingly, sampling the respective values in R s (x) produces the restriction of R s (x) to set X s (represented as R s (x)| Xs ). Therefore, restricting system (22) to equations only corresponding to X s where we use the Moore-Penrose inverse for getting the optimal projection of R s (x)| Xs . Thus, optimal solution for M s (x) is given bŷ Now computing the inner product Now, we provide approximation results for optimal point evaluational functional. For the approximation at each scale, we can obtain the variance associated with that prediction by minimizing the squared error in evaluation at x in the native reconstruction space. However, since there is no noise or uncertainty associated with the data, this variance will be zero.
. This functional has a representation Proof. We begin with the expression for the norm of the error functional in equation (20) ||ε s is given by equation (24). Now, writing the optimal approximation functional at x as: which completes the proof for equation (25). Now, considering the error functional at x and putting the optimal M s (x) in equation (20), we get Recognizing ψ s x from equation (25) and substituting in the above equation. min ||ε s Which completes the proof for equation (26).
Before moving forward, here we also provide a result for bounding the Error functional.

4.2.
Confidence and prediction intervals. Confidence intervals in general are a measure of our belief in the estimated approximation. Prediction intervals on the other hand refer to the bounds around the mean fit, where a future datapoint is expected to fall. This is crucial information in conjunction with the sparse representation, as even when we are not able to capture the function accurately at initial scales, we have an estimate of the expected behavior of the observations. Algorithm 1 makes predictions at each scale based on the corresponding sparse representation (D s sparse ). Here if the error in approximation is greater than user defined tolerance, i.e. ||f | X − (A s f )| X || 2 > T OL, then it signifies that Approximation A s f needs additional degrees of freedom to capture the underlying data generation process (as in principle it is the best possible approximation, given a basis set). For modeling this error value, we consider a model of form (where f | X is considered as a noisy realization of a true underlying process at a fixed and finite X) Therefore the sampling distribution for f | X would be given as Since, we know that the projection coordinate is given as which shows an unbiased estimator. Here it should be noted that since we are modeling f | X (given a fixed X), as a stochastic process. So E[C s ] = C s is true in asymptotic sense with expectation taken over the infinite realizations of f | X given a fixed X. Hence C s is the best unbiased coordinate approximation at scale s given bases B s . Now for computing the confidence and prediction intervals, we start by computing the prediction basis (B s * ) at x * ∈ Ω, which is expressed as a set of Gaussians from equation (7) centered at the sparse set X s and computed with respect to x * (resulting in a row vector of length l s ). Then we obtain the mean and variance of the underlying process as follows For an estimated value of σ (σ), we can write the corresponding standard deviation Therefore if we use t-distribution for confidence bounds, we get the following 100 Here l s is again the numerical rank of G s and n is the original number of observations made. Now for prediction interval, we know, Hence the 100(1 − α)% prediction intervals are given as For the estimated value of σ 2 , we use its unbiased estimation at scale s given aŝ 5. Stability properties. In this section we provide bounds related to stability of approximations obtained by the proposed algorithm. The first result bounds the approximation at scale s with respect to the L ∞ topology for some compact domain Ω ∈ R d .
the approximation A s f at any scale s is bounded in the L ∞ (Ω) norm as where σ smax denotes the largest singular value and D s is obtained by applying the extension operator B s † T on G Xs = R s (x)| Xs R s (x)| T Xs , i.e. D s = B s † T G Xs B s †

Proof.
Expressing (A s f ) in terms of the inner product as in equation (19). Therefore For lower bound on P s ∞ , For a tall thin matrix B, where σ smax is the maximum singular value. Putting in equation (37).
Combining equation (36) and above result, the bounds on P s ∞ follow We will conclude this section by providing a bound on the approximation at any scale s at some point x ∈ Ω Theorem 5.2. For observed data f | X = (f (x 1 ), f (x 2 ), ...f (x n )) T ∈ R n on X ∈ Ω, the absolute value of the approximation produced by the hierarchical algorithm (A s f ) at scale s ≤ S a < S c at any point x ∈ Ω is bounded as Where D s is defined as in Theorem (5.1). Also if the convergence happens at the critical scale (S c ), then at convergence, bound (39) can be simplified as Where σ max is the maximum eigenvalue operator Proof. Here again we begin with the absolute value of approximation produced by the proposed algorithm at scale s. The last step was carried out using equation (35). Now if the convergence happens at s = S c = S a , then D Sc = G −1 Sc and therefore D Sc (j, j) = G −1 g(j, j)||f || Hs ≤ n j=1 g(j, j)||f || Hs Since g(j, j) ≥ 1 for 1 ≤ j ≤ n ≤ n · σ max (G −1 Sc )||f || Hs Remark 1. We note that the bounds here are conservative and depend on the data set size n. The assumed global overlap of the basis functions leads to the loose upper bound. However, as Figure 4 (in the next section) shows the basis functions have a rapid decay and attained bounds in practice are much smaller. 6. Results and analysis. This section analyzes the behavior of the proposed approach on a variety of datasets under different conditions. The first subsection here studies the performance on synthetic datasets. This is important, as here we know the truth, and hence quantification of performance (both in terms of reconstruction and out of sample prediction) becomes feasible. The following subsection analyzes the performance of Algorithm 1 on a real spatial dataset. Since in reality, estimating the true generalization error is not feasible (as the sampling distribution is not known), so following the standard practice, we empirically estimate this error by the performance on an unseen testing set.
6.1. Analysis on synthetic datasets. Here we have chosen a set of 4 test functions ( Figure 1) from literature [47] providing our proposed algorithm, the sampled data to learn the underlying function. These test functions have been shown in Figure 1, and are mathematically expressed as: • Test Function 1 (TF1): Gramacy and Lee Test function After sampling the data from these test functions, for all axis (X and Y for TF1 and TF2, and X, Y and Z for TF3 and TF4 respectively) the values are normalized between -1 to 1 by dividing the measurements with the corresponding absolute maximum value along each axis. It should be noted here that for most of the analysis presented here, we have sampled 200 equidistant points for TF1 and TF2. And for TF3 and TF4, points are sampled on a 50 × 50 grid. The motivation here is to sample data points from the test functions and reconstruct the functions from the sampled data using the generated sparse representation (D s sparse ). These functions were specifically chosen as they have a lot of curvature changes and multiple local minima and maxima, which makes learning the function form difficult. As mentioned before, besides reconstruction we also test the generalization (prediction at new data locations out of training sample) ability of our approach. To give a sense of the quality of performance, we also compare the results of our analysis with the performance of Algorithm 4 from [5] on the same experimental setup.
We begin with the analysis of the convergence of the algorithm on the test functions. 6.1.1. Convergence behavior. Figure 2 shows the convergence behavior of the Algorithm 1 by studying the 2-norm error of the prediction with respect to the original observations. T OL = 10 −2 was used in Algorithm 1 for generating these results. Following the notations used earlier, S a here represents the convergence scale with S c being the critical scale. Figure 2 also illustrates the proportion of dataset used at each scale s for generating the approximation A s f . It is essentially the proportion of the dataset used as the sparse representation. Therefore, from figure 2 we can make inferences like, for test function 1, at scale 6 with 23% of datapoints, the proposed algorithm was able to generate an approximation A 6 f which had a 2-norm error of less than 10 −5 . It should be noted here, that based on the structure, the error measure reduces in a unique manner for all 4 test functions which have different convergence scales. Figure 3 shows the convergence bounds from Theorem 3.3. Here we have shown the results in 1-norm, which state that at any scale s, in the Hilbert space ( Figure 3 plots the quantity on right in the above equation with increasing scales. The sharp drop in the this bound for all 4 test functions justifies the capacity of the algorithm to produce good approximations. Precisely, at higher scale in the native Hilbert space, the error in prediction approaches orthogonality with respect to the approximation A s f . 6.1.2. Scale dependent D s sparse and corresponding reconstructions. Figure 4 and 5 show the behavior of the selected sparse representation and the approximation A s f produced with increasing scales. In this subsection we have only presented results for TF2 and TF3 for analyzing the performance of the algorithm in univariate Figure 6. Confidence and Prediction Intervals for reconstruction from D s sparse from scale 0 to scale 5 for TF2. Along with the approximation produced at these scales (solid curve), the plots also show the sparse representation selected (star marked points) with the 95% t-confidence interval (thinner dark shaded region) and 95%t-prediction interval (broader light shaded region). and multivariate setting respectively. Starting with figure 4, the plots in the left column show that with increasing scale, the support of the basis functions becomes narrower. This corresponds to the increasing numerical rank of the kernel matrix G s with increasing s. The wider support of the basis functions in the initial scales also explains the corresponding over-smoothed approximations. This scenario is similar to behavior of approximation strategies with global bases (for example polynomial based approximation). In the right column, for every scale we have mentioned the number of points chosen as sparse representation (out of 200 points). It should be noted here that the star marked points represent the D s sparse sampled from the smaller dotted data points. Here, our motivation is not just to show that with few points, the algorithm is able to learn the underlying function. But also, that the algorithm has an inbuilt capacity to choose a small set of representative points which can appropriately capture the function structure. Figure 5 shows the corresponding result for the dropwave wave function. Since proper visualization of the basis functions for a surface is a little challenging, so here we have just shown the location of the points chosen in the sparse representation (in X-Y plane) and the respective bases would be centered at these locations. One additional thing to be noted here is the higher density of sampling near the edges of the domain. This directly corresponds with the fact that at the edges, for matching the curvature appropriately, it needs more points as there is no scope of learning beyond the edges. The corresponding reconstruction also shows how the specific features are learned over the scales. Again, since points were sampled on a 50 × 50 grid. So D s sparse consists of data points sampled from 2500 design points.

Confidence and prediction intervals.
The results for this section have been shown in figure 6. The analysis is shown for scale 0 to scale 5. The smaller dotted points show the original data and the star marked points represent the sampled ones for D s sparse . The thinner bands show the 95% confidence interval on the estimated approximation at each scale. It should be noted here that, we have not carried out a full Bayesian analysis here. Instead, we have just used the fitting variance (31) as a proxy for the variance and used it for scaling our interval. Specifically here we are using t-confidence bands which are suitable for smaller datasets (as compared to Gaussian bounds) and tend to be normal in the limit of larger datasets. The thicker band show the prediction interval. The main idea to be conveyed here is that if at a scale s, we have our sparse representation D s sparse , then with these prediction intervals, we can make estimations of where all the deleted points and data points to be sampled in future would lie (bounds on generalization in some sense). One other way to say the same thing is, that if we are at a particular scale s, and if we fix the location on x ∈ X, then we can be 95% confident that the mean of y values observed at that particular x will lie within the thinner confidence bounds. Similarly for a fixed x, the broader prediction intervals show the range in which any new observation to be made in the future will lie. The results here also confirms the fact that with increasing scales both the bounds become very thin showing confidence in the produced approximation.
6.1.4. Importance metric for design points. This section aims at further exploring the application of the proposed algorithm. The results of the current study are presented in figure 7. The idea is based upon the requirement that besides just getting the sparse representation at each scale, sometimes we also need to define an ordering of data points in D s sparse in decreasing order of importance with respect to efficient reconstruction of f . This is important because if a measurement at a design point is of very high priority, then more resources could be engaged to measure that particular observation accurately. Also if we need to further compress the data, then in what sequence the datapoints can be removed. A test case is shown in figure 7(I). It contains the importance ranking computed based on the order in which the points were sampled while getting the sparse representation. It shows that the most important point (rank 1) at scale 0 is near the center of the function (and hence should rank last in the ordering for deletion). This confirms our belief that an observation near the center is crucial to capture the behavior of the function to be approximated. The second and third most important points (rank 2 and 3 respectively) are found to be at the very end of the curve which again is logical based on the fact that algorithm needs precise information to capture the function at the edges. These results also make sense because of the inherent symmetry of the 1-D schwefel function under consideration. It should be noted that the location of these important points depends on the nature of the function under consideration, so for any other functions, the location of important points might be very different as compared to the ones obtained in figure 7(I). Also, this was the result obtained for one random run of the approach and a different run might produce a different ordering.
Now, for getting a more detailed picture of importance metric, we ran 1000 simulations of Algorithm 1 and presented the distribution (for location) of the top 3 most important points. The results are presented in figure 7(II). Here the Xaxis varies from 0 to 200 because we considered 200 sampled points from TF2 as our learning set. The analysis was run for scales 0, 1, 2, 3, 5, 10 for studying the behavior of the distribution. It could be very well seen here that all the weight of the distribution for the most important point (I:1 in top row) is concentrated at the center of the domain for the initial scales (leftmost columns). For the same scales, the second (middle row) and third (bottom row) most important points have all their mass concentrated at the edges (again more obvious in leftmost columns). However, if we move towards the critical scale -right columns(where all points are included in the sparse representation), all the points have approximately uniform importance density distribution. This is also expected as when all points are sampled, no single point is more important than the other. 6.1.5. Comparison with Algorithm 4 in Bermanis et. al. As mentioned earlier, [5] was one of the major motivators for the current work. In that paper, algorithm 4 for multiscale data sampling and function extension comes close in behavior to our hierarchical approach (however the approximation strategy is different). Briefly the idea there can be summarized as follows. Suppose the approximation at scale s is represented as H s f . Therefore starting with scale 0, H 0 f is the approximation to f produced at scale 0. Thus we can write However, when we move further, the error orthogonal to the search space at the previous scales becomes the target function for the next scale. Therefore, With a user defined error tolerance (err), the authors define the convergence scale (s * ) as the scale satisfying Therefore, final approximation to f is of the form Hence algorithm 4 in [5] has bases from all scales for the final approximation. If we take a closer look at our algorithm (Algorithm 1), we see that although our approach also samples bases at each scale, for final approximation, we just use the bases at convergence scale for approximation. This behavior enables our algorithm to have a uniform length scale with basis functions centered at data points spread out 'nicely' throughout the approximation domain. This nature of basis functions (obtained as a result of the pivoted QR-decomposition), makes our basis matrix (B s ) well conditioned. The mixing of scales in algorithm 4 in [5] makes conditioning an issue, which is problematic while computing the coordinates of projection. This further provides a well founded motivation for our work. Furthermore, mixing the basis functions (as in [5]) at multiple scales tie the approximation very closely to the training data and hence leads to relatively worse generalization as will be seen in the next subsection.
The behavior (in terms of approximation quality) of the algorithm 4 from [5] (referred to as 'Berm' henceforth) is compared with Algorithm 1 (correspondingly referred to as 'Shek') in figure 8. Here the decay of error shows similar behavior for both the algorithms. For TF1 (in (a)), Shek shows some faster convergence. However, for TF3 and TF4 (in (c) and (d)), Berm algorithm reduces the error to below machine precision faster than Shek. Although quite impressive, this doesn't make Berm any more useful because we are already at error levels of 10 −13 at such higher scales.
With comparative approximation(reconstruction) capabilities, we now move to the analysis of training and prediction latency. Detailed analysis of these results has been shown in table 1. Starting with training, both Shek and Berm obtain approximation bases at each scale using pivoted QR decomposition. Shek models the original target function while Berm uses them to model the error orthogonal to the search space till the previous scale. In this way, the training time for the two approaches is comparable. Table 1 shows for some cases Shek takes less time, with Berm performing efficiently in other cases. The reason for these differences can be attributed to the fact that Berm has to keep track of points sampled at each scale from s = 0 to s = s * along with computation of error (that serves as the target at the next scale). These factors increase the training latency for Berm in most cases. However, depending on the function structure, Berm can converge relatively quickly in a few cases leading to less training time.
Now, If we think of prediction at a new design point for Berm, then we will have to iteratively compute the corresponding approximation for each considered scale as in equation (46). This is where Shek outperforms Berm by only just requiring the the linear combination of bases formulated with respect to the sparse representation at the convergence scale. This characteristic of Shek allows us to talk about sparse representation of the dataset which is not possible with the definition of Berm. Table 1 shows this behavior more clearly. Here we have measured the time which each of the algorithms take for just prediction, with all the learning assumed to be performed beforehand. Here |X pred | shows the number of design points at which the prediction is to be made. It should be noted here that unlike training time, for prediction Shek consistently performs better for all the considered cases. [27] presented an approach similar to Berm from [5]. They however used Delaunay triangulations to thin the dataset at each scale effectively producing a nested set of datapoints leading finally to the original set X. At each scale, they project the error on the current approximation space (similar to Berm). Since their approach is clearly based on similar combination of multiple scales (with Delaunay triangulations instead of pivoted QR decomposition), we have just chosen to compare to Berm (as algorithm in [27] is expected to have similar issues as compared to our algorithm). Also, since pivoted QR decomposition samples the datapoints based on the approximation power of the functions centered at them (coupling the thinning and approximation problem), it is expected to perform better than the triangulation strategies that are just aimed at thinning the dataset with approximation carried out separately. 6.1.6. Generalization analysis. In the previous experiments, we have analyzed the capability of our hierarchical approach for reconstructing the analytical test functions we have considered for this study (evaluating performance on the same data). However, one other important aspect of the learning approaches which we must also study, relates to the ability to generalize (predict) well to unseen samples (not available to the algorithm during the learning phase). Additionaly, here we also study the performance of our approach when the training dataset has a non-uniform sampling (the samples are not available on well defined grids). For analyzing these two aspects, in this section we have divided the data for TF1, TF2, TF3 and TF4 (200 samples for TF1 and TF2, 2500 samples for TF3 and TF4 respectively) into training and testing sets using 50% as training and rest 50% as testing. The training samples are randomly selected from the approximation domain providing a training dataset with non-uniform sampling. Also, since we will be testing on the remaining 50% dataset, it will quantify the ability of the approach to predict on new data points.
Here we denote training and testing datasets as D train and D test respectively. Figure 9 shows the analysis of a random run of the entire experiment. The left column ((a), (c), (e) and (g)) shows the training and testing samples from the 4 test functions. The corresponding plot on the right shows the reconstruction obtained just using the training data (in the previous experiments we used the full dataset for reconstruction). The header on the plots (b), (d), (f) and (h) show the testing root mean square error. Because of strong 'structure' in data for (b) and (f), the testing error is specially very small. We have then compared these results to the performance of Berm (shown in figure 10) on the same training and testing data for  the 4 test functions. It should be noted that (approximately) Shek gives an order of magnitude better performance as compared to Berm for predicting on unseen samples. The reason for this can again be attributed to the way Berm tries to fit very closely to the data by projecting the prediction error on the approximation space at the next scale and then mixing the scale to solve the 'data fitting' problem. This is also specifically evident from the the performance on TF2. In figure 9 (c), we can see a big cluster of testing data points (with very few training samples) around 0.75 to 0.9. This leads to some non smooth reconstruction by Shek in that region (in figure 9 (d)). However Berm performs considerably worse in that region (figure 10 (b)) by producing undesired spurious fluctuations. These fluctuations explain the Berm's bad behavior on unseen samples For showing the consistency of better performance by our approach (Shek) as compared to Berm, we did an additional experiment shown in figure 11. Here we repeated the experiment in figure 9 and 10, 100 times (100 different random splits into training the testing) and computed the RMSE of prediction on the testing dataset (D test ). Bar plots in figure 11 (a) and (b) show the mean and standard deviation of this RMSE for both Shek and Berm, The important thing to note here is the consistent lower mean RMSE by Shek for 100 different train/test sets as compared to Berm on all 4 test functions. This shows superior generalization ability irrespective of the sampling distribution. Similarly, the lower standard deviation of RMSE for all 4 test functions in (b) again shows the relatively stable nature of our algorithm with respect to how the samples were generated.
6.2. Application on real data. In this section we have analyzed the performance of Algorithm 1 (Shek) on real observed datasets (the true underlying function is not known). Specifically, we consider a particular type of spatial dataset known as Digital Elevation Model (DEM). It is a topographical map of a particular region. The data is arranged on grids with each grid node (x, y) associated with a height measurement. Here X and Y are projection coordinates on a horizontal plane (usually transformed from latitudes and longitudes). The DEM dataset we consider is shown in figure 12 (a). Also, like the previous sections, here again we compare the performance of Shek with respect to Berm by sampling the DEM (figure 12 (a)) at even indexed nodes (both along X and Y dimension) and creating a lower resolution dataset (120m) shown in figure 12 (b). We use this as our training data for prediction at deleted nodes (testing dataset).
The first experiment is shown in figure 13. Here we show the reconstruction results for scale 0,2 and 6 using the full DEM dataset (12 (a)) for training. These specific scales were chosen so as to provide an idea of how the surface is evolving towards the starting scale and towards the end. The important thing to note here is that at scale 6, with D s sparse only consisting of 763 points out of 4350 points, the algorithm was able to generate a reconstruction to the DEM where the prediction error in ∞−norm is just 6.74 (compared to the range of variation observed as ∼ [250, 500] in the shadebar in figure 12(a) ). The compression metric of (763/4350 ∼ 17.5%) clearly shows the success of the algorithm in generating a sparse representation for the dataset. Here ∞ − norm was chosen as an error measure as it upper bounds the error at any individual point and gives an intuitive understanding of the performance of the algorithm. Figure 13 also shows how specific topographic features become sharper with increase in scale.
The next analysis we show here again corresponds to comparing the generalization ability of Shek and Berm with the training DEM shown in figure 12 (b) and testing on the grid points that were deleted to obtain the training data. Figure 14 shows the reconstruction of the original resolution DEM (60m) from the training DEM with 120m resolution. (a) and (b) here show the reconstructions obtained by  and Berm (b) using the 120m DEM training data shown in figure  12 (b). The header shows the RMSE for prediction of height at the testing locations (testing RMSE for Shek: 2.74 and testing RMSE for Berm: 10.7).
Shek and Berm. Similar to the case of test functions, here again, the performance on the testing data is an order of magnitude better for Shek as compared to Berm.

7.
Conclusion. In this paper, we have introduced a hierarchical method for learning a sequence of sparse representations for a large dataset. The hierarchy comes from approximation spaces defined with respect to a scale parameter. Principally, the proposed approach has been shown to be useful for data reduction applications coupled with learning a model for representing the data. The paper provides analysis that explains and studies the theoretical properties of the proposed approach (with the code available online). We derive bounds for stability, convergence and behavior of error functionals. In the results section, we have shown the performance of the approach as data reduction mechanism on both synthetic and a real dataset (geo-spatial). The sparse model generated by the presented approach is also shown to efficiently reconstruct the data, while minimizing error in prediction. Besides reconstruction, we have also analyzed the ability of the proposed approach to generalize to unseen samples and compared the performance with algorithm 4 from [5]. Our approach is shown to consistently provide, approximately an order of magnitude better performance on the testing datasets (both real and synthetic) with considerably better stability behavior.
Though the results shown in this paper depict the efficiency of the approach on a variety of datasets and settings, there are several areas in which the presented algorithm can be improved. Firstly, the implementation of the algorithm can be made more efficient by either optimizing the operations in the algorithm or by handling chunks of data at a time. Secondly, generation of sparse representation for noisy datasets poses another set of unique challenges which will be addressed in a companion paper under preparation. For this case, properly capturing the uncertainty in the generated sparse approximations becomes very crucial. Finally the hyperparameters like P in (Algorithm 1) can be explored and studied further for making the approach behave better. This can be done by even utilizing additional information which is not directly observed but is inherently known (like the physics of a system). Hence, in essence the work presented addresses the need for efficient learning methods for large datasets and opens up new interesting approaches.