Function approximation by deep networks

We show that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge. Thus, the blessing of compositionality mitigates the curse of dimensionality. On the other hand, a theorem called good propagation of errors allows to “lift” theorems about shallow networks to those about deep networks with an appropriate choice of norms, smoothness, etc. We illustrate this in three contexts where each channel in the deep network calculates a spherical polynomial, a non-smooth ReLU network


Introduction
As is well known, deep networks are playing an increasingly important role in artificial intelligence, industry, and many aspects of modern life ranging from homeland security to automated cars.A topic of great recent interest is to examine the expressive power of deep networks to explain their remarkable success in comparison with classical shallow networks.There are many efforts in this direction, depending upon what one defines to be the expressive power [13,17,18,19,5,12].
The fundamental problem of machine learning is the following.Given an integer q ≥ 1, and data of the form {(x i , y i )} M i=1 ⊂ R q × R, drawn randomly from a probability distribution µ, find a model P such that P (x i ) ≈ y i .In theory, one assumes an underlying function f on the unknown support of the distribution µ * from which the x i 's are sampled, so that y i = f (x i ) + i , i = 1, • • • , M , and i are zero mean random variables.Equivalently, f (x) = E µ (y|x).An important aspect of the problem of machine learning is thus viewed as a problem of function approximation.A goal of this paper is to standardize the notion of expressive power in term of the ability of the network to approximate functions measured in a manner utilized in approximation theory for more than 100 years.Our main thesis is that the ability of deep networks to do a better approximation than shallow networks stems from their ability to mimic any compositional structure inherent in the target function; an ability that shallow networks cannot have.On the other hand, a theorem called "good propagation of errors" allows us to lift results from shallow networks to those for deep networks, highlighting the importance of compositionality.It will be pointed out that there is no natural way to define a probability measure that can take advantage of the very important compositionality with respect to which one can define generalization error as in classical machine learning.In particular, the bias-variance split does not hold, and a new theory is required.This paper summarizes some of our recent results in this direction, in particular, for deep non-smooth ReLU networks.

arXiv:1905.12882v1 [cs.LG] 30 May 2019
We will describe the central problems of approximation theory in Section 2 and illustrate them using the example of approximation of a function on the Euclidean (hyper-)sphere by spherical polynomials.In Section 3, we will establish the terminology for describing deep networks.A theorem called good propagation of errors is proved and discussed in Section 4. Applications to approximation by non-smooth ReLU networks and networks with another related activation function are discussed in Section 2. The relationship of our results with some others in the literature is discussed in Section 6.

Basic ideas in approximation theory
A central problem in approximation theory is to investigate the quality of approximation of an unknown function given finite amount of information about the function.In order to do so, one assumes that the target function f is in some Banach space X with norm • .The function needs to be approximated by models coming from a nested sequence of sets One of the most important quantities in approximation theory is the degree of approximation, defined by dist(X; f, V n ) = inf The assumption that The rate of this convergence clearly depends upon further assumptions on f , called prior in machine learning parlance, and smoothness class in approximation theory.Typically, this class is defined in terms of a smoothness parameter γ as a subspace W γ of X.
Constructing the minimizer in (1) is generally not of any interest.Such a minimizer can be hard to obtain computationally, and does not have many desirable properties; e.g., it is generally not sensitive to the local properties of f .Instead, the central themes of approximation theory are: for some s depending upon γ and other parameters, e.g., the number of input variables to f .Construction, aka training Give a method to construct P ∈ V n from the given information on f such that f − P = O(n −s ), and study the connection between the amount of information available and n for which such a construction is possible.Width theorem This states that if we can only assume that f ∈ K ⊂ W γ for a compact subset K, and n pieces of information are allowed on f (in the form of a continuous mapping S : K → R n ), then no matter how one constructs an approximation to f from this information, i.e., A(S(f )) ∈ X, the worst case error under the assumption that f ∈ K is Ω(n −s ).This asserts merely the existence of f ∈ K for which the lower estimate holds.Converse theorem This states that the estimate (2) implies that f ∈ W γ .This is a statement about individual functions, not about the whole class of functions.
We discuss an example in connection with approximation on a Euclidean sphere of R q+1 for some integer q ≥ 1: We will be interested in approximating continuous functions on S q , so that the Banach space is C(S q ) equipped with the uniform norm • S q .The restriction of an algebraic polynomial in q + 1 real variables of total degree n to S q is called a spherical polynomial of degree n.The space of all spherical polynomials of degree < n is denoted by Π q n .Thus, V n = Π q n .We will denote dist(C(S q ); f, Π q n ) by E q;n (f ).
The smoothness class is defined as follows.If ∆ is the negative Laplace-Beltrami operator on S q , a K-functional on the space C(S q ) is defined by where r is an even integer, and the infimum is taken over all g for which (I + ∆) r/2 g ∈ C(S q ).The class W q;γ is defined by for an even integer r > 2γ.The following estimate (5) that the class W q;γ (although not the norm f Wq;γ ) is independent of the choice of r.
It is proved in [14,8] that there exist positive constants c 1 , c 2 depending only on q, γ, r such that The second inequality gives an estimate on the degree of approximation in terms of the smoothness class, and represents the direct theorem.The first inequality asserts that the rate at which the degree of approximation converges to 0 determines the smoothness class to which the target function belongs; i.e., a converse theorem.The converse theorem in particular is stronger than the width theorem.
A construction of a polynomial approximation that yields the bounds is given in [8] in the case when spectral information is available, and in [7] in the case when noisy values of the function are given at arbitrary points on the sphere.
We note that the dimension of Π q n ∼ n q .Therefore, in terms of the number of parameters M involved in the approximation, the rate in ( 5) is ∼ M −γ/q .This exponential dependence on q is called curse of dimensionality; the quantity q/γ is called the effective dimension of W q;γ .

Deep networks and compositional functions
A commonly used definition of a deep network is the following.Let φ : R → R be an activation function; applied to a vector be an integer (q 0 = q), T : R q → R q +1 be an affine transform, where q L+1 = 1.A deep network with L − 1 hidden layers is defined as the compositional function There are several shortcomings for this definition.First, a function may have more than one compositional representation, so that the affine transforms and L are not determined uniquely by the function itself.Second, this notion does not capture the connection between the nature of the target function and its approximation.Third, the affine transforms T define a special directed acyclic graph (DAG).It is difficult to describe notions of weight sharing, convolutions, sparsity, etc. in terms of these transforms.Therefore, we follow [12] and fix a DAG to represent both the target function and its approximation.Let G be a DAG, with the set of nodes V ∪ S, where S is the set of source nodes, and V that of non-source nodes.We assume that there is only one sink node, v * .A G-function is defined as follows.The in-edges to each node in V represents an input real variable.For each node v ∈ V ∪ S, we denote its in-degree by d(v).A node v ∈ V ∪ S itself represents the evaluation of a real valued function f v of the d(v) inputs.The out-edges fan out the result of this evaluation.Each of the source node obtains an input from some Euclidean space.Other nodes can also obtain such an input, but by introducing dummy nodes, it is convenient to assume that only the source nodes obtain an input from the Euclidean space.In summary, a G-function is actually a set of functions {f v : v ∈ V ∪ S}, each of which will be called a constituent function.
For example, the DAG G in Figure 1 ( [12]) represents the compositional function The If v ∈ S, the (vector of) variables seen by v are those which are input to v.For other v ∈ V , the variables seen by v are defined recursively as the vector of variables obtained by concatenating the variables seen by each of the children of v in order.In particular, there is a notation overload.The function f v is a function of d(v) variables input to the vertex v.It is also a function of the variables seen by v.For example, in the DAG of Figure 1, h 11 sees the variables (x 4 , x 5 ), h 13 is a function of two variables, namely, the outputs of h 10 and h 11 , but it is also a function of the variables which are seen by h 13 .We will explain what meaning is intended if we find it warranted.
In the remainder of this paper, we will assume G to be a fixed DAG. Figure 1: This figure from [12] shows an example of a G-function (f * given in ( 6)).The vertices V ∪ S of the DAG G are denoted by red dots.The black dots represent the inputs; the input to the various nodes as indicated by the in-edges of the red nodes.The blue dot indicates the output value of the G-function, f * in this example.

Good propagation of errors
The following Theorem 4.1 is the main technical tool that allows us to reduce the problem of approximation by deep networks to a series of approximations by shallow networks.In this theorem, for integer d ≥ 1, let ρ d be a metric on R d .
Theorem 4.1 Let {f v } be a G-function satisfying the following Lipschitz condition: there exists a constant L > 0 such that for Let {g v } be a G-function.Let w ∈ V , {u 1 , • • • , u s } ⊂ V be the children of w, and x u1 , • • • , x us be the variables seen by u 1 , • • • , u s respectively.Then PROOF.By triangle inequality followed by (7), we get We illustrate Theorem 4.1 using the example of approximation by spherical polynomials as in Section 2. We note first that the transformation is a one-to-one correspondence between R d and the open upper hemisphere S d + .For a function f : R d → R vanishing at infinity, one can therefore associate in a one-to-one manner an even function on S d which shares all the smoothness properties of f .In the notation of Theorem 4.1, if we assume that all the G-functions involved are continuous, the points such as (f u1 (x u1 ), • • • , f us (x us )) may thus be thought of as points on a compact subset of S s + .Therefore, with some simple modifications, we may assume that the inputs to all the constituent functions are from the appropriate spheres.Moreover, restricted to compact subsets of R d , the usual Euclidean metric on R d is equivalent to the metric ρ d on R d induced by the geodesic distance d on S d .Therefore, we may write (8) in the form Motivated by Theorem 4.1, we define the following notion.Let W d be a class of functions of d variables with norm (or semi-norm) . We define i.e., we use the tensor product norm on v∈V W d(v) .For example, GΠ n is the class of all G-functions We note that the fact that for each v ∈ V .Together with (5), Theorem 4.1 leads to the following Theorem 4.2 Let {f v } be a G-function such that ( 7) is satisfied with ρ d induced by the geodesic metric on S d .Then there exist positive constants c 3 , c 4 independent of the functions {f v } or n such that We end this section by pointing out another important feature of Theorem 4.1.It is customary in machine learning to measure the generalization error between a function and its approximation using an appropriate L 2 norm.In (8), the argument of f v is different (and in particular, differently distributed) from that of g v .Thus, there is no natural measure with respect to which one can take the L 2 norm while preserving the advantages of compositionality.Therefore, in the theory of function approximation by deep networks, one has to use the uniform norm.In turn, this means that the usual bias-variance split does not work anymore, and one has to develop an entirely new paradigm.

Approximation by ReLU networks
) and recalling the transformation between R q and S q , the problem of approximation of functions on R q by networks of this form is equivalent to that of approximation of functions on S q by zonal function networks of the form x → Next, we define a smoothness class for approximation by such networks [11,12].In this section, we denote the dimension of the space of the restrictions to the sphere of all homogeneous harmonic polynomials of degree by d q , = 0, 1, • • •, and the set of orthonormalized spherical harmonics on S q by {Y ,k } d q k=1 .we recall the addition formula where p is the degree ultraspherical polynomial with positive leading coefficient, with the set {p } satisfying The function t → |t| can be expressed in an expansion with the series converging on compact subsets of (−1, 1).
If f ∈ C(S q ), then we define We note that if f is an even function, then f (2 + 1, k) = 0 for = 0, 1, • • •.In this context, the place of the operator (I + ∆) 1/2 is taken by the operator D q;|•| defined formally by and It is proved in [11] that if f ∈ Y q , then there exists a network of the form The class of all networks of the form G is denoted R q;N .Theorem 4.1 allows to "lift" this upper bound to the following corresponding bound for deep ReLU networks.
Theorem 5.1 Let {f v } be a G-function such that each f v satisfies (7) with ρ d(v) induced by the geodesic metric on S d(v) .In addition, let each Then there exists a deep network in GR N ; i.e., a G-function {g v } such that every For example, if G is a binary tree with 1024 leaves, then a shallow network as in (18) with N neurons yields a degree of approximation O(N −1/(512) ), while a deep network as in ( 19) yields a degree of approximation O(N −1 ); a substantial improvement.
The "derivative" D |•| is very unusual in that instead of being a local function, it is supported on equators perpendicular to the point in question.This is illustrated by Figure 2 from [11].This behavior makes it very difficult to obtain a converse theorem.
On the other hand, if we consider the spherical convolution function then a complete theory emerges by combining the results in [10] with Theorem 4.1.An interesting feature of this theory is that the complexity of the network is not measured in terms of the number of neurons but the minimal separation among the neurons.If C ⊂ S q is a finite subset, we define the minimal separation η(C) and mesh norm δ(C) of C by η(C) = min x,y∈C,x =y q (x, y), δ(C) = max x∈S q min y∈C q (x, y), where q is the geodesic distance on S q .By replacing C by a suitable subset, we may assume that For a finite subset C ⊂ S q , the set N (q; C) comprises networks of the form x → y∈C a y φ(x • y).
We note that the number of neurons in a network in N (q; C) is O(η(C) −q ), but given N , it easy to construct C with N elements for which η(C) −q N .
Omitting many nuances and using a different notation, [10,Theorem 3.3] (applied to the sphere) can be restated in the following form: Theorem 5.2 Let 0 < γ < 3 and f ∈ W q;γ .For any set C satisfying ( 22), there exists G ∈ N (q; C) Using Theorem 4.1, this theorem can be lifted as before to the following theorem for deep networks.Theorem 5.3 (a) For each v ∈ V , let C v ⊂ S q be finite subsets satisfying (22).Let η = min η(C v ).Let 0 < γ < 3, and {f v } ∈ GW γ .In addition, we assume that each f v satisfies (7) with ρ d(v) induced by the geodesic metric on S d (v) .Then there exists a G-function {G v } such that each v) ) and there exists a sequence of

Related works
There is a deluge of papers on the expressive power of deep networks and their superiority over shallow networks.We cite a few of these.The papers [13,17] measure the expressive power by the number of linear pieces into which the network partitions the domain space.This measurement overlooks the fact that the optimal number of pieces ought to depend upon the function being approximated.It is shown in [18] that deep networks are better when the complexity is measured in terms of the rank of certain tensors.It is not clear how this criterion relates to the problem of function approximation.The papers [19,5] establish the existence of functions which cannot be approximated well by neural networks with a given graph structure.This anticipates the compositionality of the networks being represented by a DAG structure, but does not address the compositional nature of the target function itself.The papers [16,3,4] show that specific functions such as the characteristic functions of balls and radial functions cannot be approximated well by shallow ReLU networks.In [9], it is shown that by using the function t → (t + ) 2 as the activation function, one can synthesize any spline or polynomial exactly with a network with sufficient depth.In particular, one can synthesize any given partition of the Euclidean space into linear regions arbitrarily closely.In [6] estimates on the degree of uniform approximation are given in terms of the modulus of continuity, where the number of neurons in each layer is fixed at 2q + 1, but the number of layers is inversely proportional to the modulus of continuity and fixed width.The paper [1] obtains bounds on the degree of approximation of Lipschitz continuous functions by ReLU networks.The idea of transforming the problem from the Euclidean space to that on the sphere is used in this paper as well.This paper also considers approximation by spherical convolutions as in (20).Our estimates are under different assumptions, and are better.Lower bounds for universal approximation of Lipschitz functions by ReLU networks are given in [20,21], and for twice differentiable functions in [15].In particular, [21] gives a detailed analysis, showing the order of magnitude of the degree of approximation of Lipschitz continuous functions cannot be better than N −2/q , where N is the number of neurons.The bound (18) clearly achieves this as an upper bound, but with a different class of functions.We conjecture that the class of functions introduced in this paper is the best possible, in the sense that the estimate (18) cannot be improved in terms of nonlinear widths.However, a converse theorem is probably not true.Finally, we note that explicit expressions for the kernels φ defined in (20) are easy to deduce from those given in [2] where the function t → max(t, 0) is used in place of | • |.

Conclusions
We have demonstrated several concepts in this paper.First, we have shown that deep networks have a better approximation power than shallow networks because they are capable of reflecting any compositional structure in the target function, while shallow networks cannot.Second, we have pointed out an important tool in this theory called good propagation of errors which enables us to lift theorems on approximation power of shallow networks to those of deep networks if all the constituent functions are Lipschitz continuous.Third, we have argued that in order to use this tool, there is no natural measure at each step with respect to which the error can be measured in the L 2 -norm as customary in machine learning.In particular, the usual bias-variance split does not work anymore, and a new paradigm is necessary.Fourth, we obtained converse theorems for approximation by certain kernels obtained from the ReLU functions which enable us to verify from the observed degree of approximation the prior smoothness condition which the target function must satisfy.
We note that the question of whether or not a given target function is compositional is meaningless; e.g., However, the direct and converse theorems show that if we know in advance that the target function is not as smooth as the degree of approximation by the networks indicates, then the blessing of compositionality must be playing some role.
) Conversely, let C m be a nested sequence of sets satisfying (22), and for each integer m ≥ 1, η