Higher Order Concentration of Measure

6 downloads 0 Views 367KB Size Report
Sep 20, 2017 - As another previous work let us mention Adamczak and Wolff [A-W], who ex- .... Xi, and put x+ = max(x, 0) and x− = max(−x, 0) for a number x.
HIGHER ORDER CONCENTRATION OF MEASURE

arXiv:1709.06838v1 [math.PR] 20 Sep 2017

S. G. BOBKOV, F. GÖTZE, AND H. SAMBALE Abstract. We study sharpened forms of the concentration of measure phenomenon typically centered at stochastic expansions of order d − 1 for any d ∈ N. The bounds are based on d-th order derivatives or difference operators. In particular, we consider deviations of functions of independent random variables and differentiable functions over probability measures satisfying a logarithmic Sobolev inequality, and functions on the unit sphere. Applications include concentration inequalities for U -statistics as well as for classes of symmetric functions via polynomial approximations on the sphere (Edgeworth-type expansions).

1. Introduction In this article, we study higher order versions of the concentration of measure phenomenon. Referring to the use of derivatives or difference operators of higher order, say d, the notion of higher order concentration has several aspects. In particular, instead of the classical problem about deviations of f around the mean Ef , one may consider potentially smaller fluctuations of f − Ef − f1 − . . . − fd , where f1 , . . . , fd are “lower order terms” of f with respect to a suitable decomposition, such as a Taylor-type decomposition or the Hoeffding decomposition of f . Starting with the works of Milman in local theory of Banach spaces, and of Borell, Sudakov, and Tsirelson within the framework of Gaussian processes, the concentration of measure phenomenon has been intensely studied during the past decades. This study includes important contributions due to Talagrand and other researchers in the 1990s, cf. Ledoux [L1], [L2], [L3]; a more recent survey is authored by Boucheron, Lugosi and Massart [B-L-M]. As another previous work let us mention Adamczak and Wolff [A-W], who exploited certain Sobolev-type inequalities or subgaussian tail conditions to derive exponential tail inequalities for functions with bounded higher-order derivatives (evaluated in terms of some tensor-product matrix norms). While in [A-W], concentration around the mean is studied, the idea of sharpening concentration inequalities for Gaussian and related measures by requiring orthogonality to linear functions also appears in Wolff [W] as well as in Cordero-Erausquin, Fradelizi and Maurey [CE-F-M]. Our research started with second order results for functions on the n-sphere orthogonal to linear functions [B-C-G], with an approach which was continued in [G-S] in presence of logarithmic Sobolev inequalities. This includes discrete models as well as differentiable functions on open subsets of Rn . Here, we adapt in particular Sobolev type inequalities introduced by Boucheron, Bousquet, Lugosi and Massart Date: September 21, 2017. 1991 Mathematics Subject Classification. Primary 60E. Key words and phrases. Concentration of measure phenomenon, logarithmic Sobolev inequalities, Hoeffding decomposition, functions on the discrete cube, Efron–Stein inequality. This research was supported by CRC 1283. 1

2

S. G. BOBKOV, F. GÖTZE, AND H. SAMBALE

[B-B-L-M], and thus extend some of the results from [G-S] to arbitrary higher orders. Developing the algebra of higher order difference operators, we moreover came across a higher order extension of the well-known Efron–Stein inequality. 1.1. Functions of independent random variables. Let X = (X1 , . . . , Xn ) be a random vector in Rn with independent components, defined on some probability space (Ω, A, P). First, we state higher order exponential inequalities in terms of the difference operator which is frequently used in the method of bounded differences. ¯1, . . . , X ¯ n ) be an independent copy of X. Given f (X) ∈ L∞ (P), define Let (X ¯ i , Xi+1 , . . . , Xn ), Ti f (X) = Ti f = f (X1 , . . . , Xi−1 , X

(1.1)

hi f (X) =

1 kf (X) − Ti f (X)ki,∞, 2

i = 1, . . . , n,

hf = (h1 f, . . . , hn f ),

¯ i ). Depending on the ranwhere k·ki,∞ denotes the L∞ -norm with respect to (Xi , X dom variables Xj , j 6= i, hi f thus provides a uniform upper bound on the differences with respect to the i-th coordinate (up to constant). Based on h, it is possible to define higher order difference operators hi1 ...id (d ∈ N) by setting

(1.2)

d

1

Y hi1 ...id f (X) = d (Id − Tis )f (X) i1 ,...,id ,∞ 2 s=1 d X X 1

(−1)k = d f (X) + 2 k=1 1≤s d − 3 and is orthogonal in L2 (σn−1 ) to all polynomials of total degree at most d − 1. Moreover, assume that (1.21)

kf (d) kHS,2 ≤ 1

and

kf (d) kOp,∞ ≤ 1.

Then, there exists some universal constant c > 0 such that Z c  exp 2 |f |2/d dµ ≤ 2. σ G A possible choice is c = 1/(8e). The same holds for p ≤ d − 3, if n > d − p − 1.

Recall that the Hilbert space L2 (S n−1 ) can be decomposed into a sum of orthogonal subspaces Hd , d = 0, 1, 2, . . ., consisting of all d-homogeneous harmonic polynomials (in fact, restrictions of such polynomials to the sphere). This fact is mirrored in the orthogonality assumptions from Theorem 1.9. If f is not a homogeneous function, the bounds from Theorem 1.9 remain valid assuming (1.21), but instead of orthogonality to polynomials of lower degree we have to require that f and all its partial derivatives of order up to d − 1 are centered with respect to σn−1 . In Theorem 1.8, we have chosen the usual (Euclidean) derivatives of functions defined in an open neighbourhood of the unit sphere. In applications, this is usually sufficient. There is also a notion of intrinsic (spherical) derivatives (cf. Section 5), and it is possible to obtain an analogue of Theorem 1.8 for these derivatives as well. To fix some notation, denote by ∇S f the spherical gradient of a differentiable function f : S n−1 → R and write Di f = h∇S f, ei i, i = 1, . . . , n, for the spherical partial derivatives of f . Here, ei denotes the i-th standard unit vector in Rn . Higher order

HIGHER ORDER CONCENTRATION OF MEASURE

9

spherical partial derivatives are defined by iteration, e. g. Dij f = h∇S h∇S f, ej i, ei i for any 1 ≤ i, j ≤ n. Note that in general, Dij f 6= Dji f . If f ∈ C d (S n−1), we denote by D d f (θ) the hyper-matrix of the spherical partial derivatives of order d, i. e. (D (d) f (θ))i1 ...id = Di1 ...id f (θ),

(1.22)

θ ∈ S n−1 .

Similarly to (1.17), let |D (d) f (θ)|Op be the operator norm of D (d) f (θ). Finally, write (1.23)

(d)

kD f kOp,p =

Z

S n−1

|D

(d)

f |pOp dσn−1

1/p

,

p ∈ (0, ∞].

We have the following “intrinsic” version of Theorem 1.8. Theorem 1.10. Let f be a C d -smooth function on S n−1 such that Assume that kD (k) f kOp,2 ≤ n−(d−k)/2 , k = 1, . . . , d − 1, and |D (d) f (θ)|Op ≤ 1, θ ∈ S n−1 . Then Z  exp (n − 1) |f |2/d /(8e) dσn−1 ≤ 2.

R

S n−1

f dσn−1 = 0.

S n−1

1.5. Outline. In Section 2, we give the proofs of Theorem 1.1, Proposition 1.2 and Corollary 1.3. We briefly discuss the notion of difference operators. The main tool is a recursion inequality for the Lp -norms of the function f and the Hilbert–Schmidt norms of |h(k) f |. In Section 3, Theorem 1.5 is proven. This includes a number of relations between the difference operators introduced in Definition 1.4. In Section 4, we prove Theorems 1.6 and 1.7 by adapting the main steps of the proof of Theorem 1.1. In Section 5, the proofs of Theorems 1.8, 1.9 and 1.10 are given; in particular, we introduce some facts about spherical calculus which allow us to proceed in a similar way as in case of functions on Rn . Finally, in Section 6, we illustrate Theorem 1.8 on the example of polynomials and the problem of Edgeworth approximations for symmetric functions on the sphere. For additional applications we refer to [G-S]. 2. Functions of independent random variables: Proofs Let X = (X1 , . . . , Xn ) be a vector of independent random variables on the probability space (Ω, A, P). By a “difference operator” we mean an Rn -valued functional Γ defined on L∞ (P) such that the following two conditions hold: Conditions 2.1. (i) Γf (X) = (Γ1 f (X), . . . Γn f (X)), where f : Rn → R may be any Borel measurable function such that f (X) ∈ L∞ (P). (ii) |Γi (af (X) + b)| = a |Γi f (X)| for all a > 0, b ∈ R and i = 1, . . . , n.

We also call Γ a gradient operator or simply gradient. We do not suppose Γ to satisfy any sort of “Leibniz rule”. Clearly, the difference operator h from (1.1) and any of the difference operators introduced in Definition 1.4 satisfy Conditions 2.1. For the proof of Theorem 1.1 we will need several lemmas. As before, let Ti f = ¯ i , Xi+1 , . . . , Xn ) with X ¯1, . . . , X ¯ n an independent copy of X. As f (X1 , . . . , Xi−1 , X a first step, the Hilbert–Schmidt norms of the derivatives of consecutive orders are related in the following way:

10

S. G. BOBKOV, F. GÖTZE, AND H. SAMBALE

Lemma 2.2. For any d ≥ 2,

|h|h(d−1) f (X)|HS| ≤ |h(d) f (X)|HS.

(2.1)

Proof. First let d = 2. Using Ti |hf | = |Ti hf | and the triangle inequality, we have

1 1 1 k|hf | − |Ti hf |k2i,∞ ≤ k|hf − Ti hf |k2i,∞ = k|hf − Ti hf |2 ki,∞ . 4 4 4 Here, hf − Ti hf is defined componentwise. Since Ti hi f = hi f , we obtain X 1 2 kf − Tj f kj,∞ − kTi f − Tij f kj,∞ |hf − Ti hf |2 = 4 j6=i X 1 (2.3) kf − Tj f − Ti f + Tij f k2j,∞, ≤ 4

(2.2) (hi |hf |)2 =

j6=i

where the last inequality follows from the reverse triangular inequality again (for the pseudo-norm k·ki,∞ ). Combining (2.2) and (2.3) yields

1 X

(hi |hf |)2 ≤ kf − Tj f − Ti f + Tij f k2j,∞ i,∞ 16 j6=i 1 X kf − Tj f − Ti f + Tij f k2i,j,∞. ≤ 16 j6=i

Summing over i = 1, . . . , n we arrive at the result in the case d = 2. For d ≥ 3, note that Ti hi1 ...id−1 f = hi1 ...id−1 f whenever i ∈ {i1 , . . . , id−1 }. The claim then follows in the same way as above.  Corresponding results in the setting of differentiable functions (see Lemma 4.1) suggest to replace the Hilbert–Schmidt norms in Lemma 2.2 by operator type norms (1.17). In Boucheron, Bousquet, Lugosi and Massart [B-B-L-M], Theorem 14, iterations of (2.6) are sketched to study applications for Rademacher chaos type functions. Unfortunately, working out the arguments in the proof of Theorem 14 we seemed to need Hilbert–Schmidt instead of P operator norms. Already in second order statistics of Rademacher variables like ni=1 Xi Xi+1 (setting Xn+1 = X1 ), an analogue of (2.1) for operator norms cannot be true (consider d = 2 and zeros of hf ). Similar remarks hold for any of the difference operators introduced in Definition 1.4 (cf. Remark 3.1). Our results will follow from certain moment inequalities for functions of independent random variables. In [B-B-L-M], cf. Theorem 2, the following moment bounds are shown p p k(f − Ef )+ kp ≤ 2κp kV + (f )kp , k(f − Ef )− kp ≤ 2κp kV − (f )kp in terms of the conditional expectations n X  + V (f ) = E (f − Ti f )2+ X i=1

where κ =

√ e √ 2 ( e−1)



V (f ) = E

n X i=1

 (f − Ti f )2− X ,

< 1.271. Note that, in our notations according to Definition 1.4,

V + (f ) = 2 |d+f |2

and

V − (f ) = 2 |d− f |2.

HIGHER ORDER CONCENTRATION OF MEASURE

11

For iterating these inequalities however, we had to bypass the problem that dii = di + respectively d+ ii = di (up to constant), which would introduce additional lower order differences on the right-hand side of (2.1). This motivated us to introduce the following related quantities adapted to the framework of L∞ -bounds. For i = 1, . . . , n, introduce (2.4)

h+ i f (X) =

1 k(f (X) − Ti f (X))+ ki,∞ , 2

+ h+ f = (h+ 1 f, . . . , hn f ),

1 − k(f (X) − Ti f (X))− ki,∞ , h− f = (h− 1 f, . . . , hn f ), 2 which are clearly difference operators in the sense of Conditions 2.1. Using the relations V + (f ) ≤ 4 |h+ f |2 and V − (f ) ≤ 4 |h− f |2 , we get from the BBLM-result the following somewhat weaker bounds in terms of the Lp -norms as in (1.4). (2.5)

h− i f (X) =



e Theorem 2.3. With the same constant κ = 2 (√e−1) , for any real p ≥ 2, p (2.6) k(f − Ef )+ kp ≤ 8 κp kh+ f kp ,

(2.7)

k(f − Ef )− kp ≤

p

8 κp kh− f kp .

For the reader’s convenience, let us give a self-contained proof of Theorem 2.3. It is sufficient to derive (2.6), since (2.7) follows from (2.6) by considering −f ). The key step are the following two lemmas. Lemma 2.4. Assume Ef = 0. Then,



(2.8)

kf k2 ≤

(2.9)

kf+ k2 ≤ 2 kh+ f k2 .

2 khf k2 ,

Proof. By the Efron–Stein inequality (1.9), Ef 2 ≤ E |df |2 and Ef 2 ≤ 2 E |d+ f |2 , while |d+ f |2 ≤ 2 |h+ f |2 and |df |2 ≤ 2 |hf |2 (cf. Remark 3.1 (v)).  The next lemma provides a moment recursion similarly to [B-B-L-M], Lemma 3. Lemma 2.5. For any real p ≥ 2, (2.10)

kf+ kpp ≤ kf+ kpp−1 + 4 (p − 1) kh+ f k2p kf+ kp−2 p .

¯ Proof. First assume n = 1, i. e. f = f (X) for a random variable X and T f = f (X), p−1 p−1 ¯ is an independent copy of X. Using the notation f+ ≡ (f+ ) , we have where X

1 p E [(f+p−1 − T f+p−1 )(f+ − T f+ )] − kf+ kpp = −kf+ kp−1 p−1 kf+ k1 ≥ −kf+ kp−1 . 2 ¯ and since (f − T f )+ ≤ 2 h+ f , Thus, by symmetry in X and X, kf+ kpp ≤ kf+ kpp−1 + E [(f+p−1 − T f+p−1 )+ (f+ − T f+ )]   ≤ kf+ kpp−1 + (p − 1) E (f − T f )2+ f+p−2 .

Using Hölder’s inequality, the last expectation may be bounded by

12

S. G. BOBKOV, F. GÖTZE, AND H. SAMBALE

(2.11)

  4 E |h+ f |2 f+p−2 ≤ 4 k| h+f |2 kp/2 kf+ kp−2 = 4 kh+ f k2p kf+ kp−2 p p .

This completes the proof in case n = 1. For n ≥ 1, we use a tensorization argument: For any g ∈ Lq , q ∈ (1, 2], (2.12)

E |g|q − E |g|

q

≤E

n  X q  Ei |g|q − Ei |g| , i=1

where Ei denotes expectation with respect to Xi . Applying this inequality to g = f+p−1 with q = p/(p − 1), similarly to the case of n = 1 we obtain kf+ kpp − kf+ kpp−1 ≤ E

n   X Ei f+p − (Ei f+p−1 )p/(p−1) i=1

≤ (p − 1)

n X

  ¯ i (f − Ti f )2 f+p−2 E Ei E +

i=1 n X

≤ 4 (p − 1)

i=1

  + 2 p−2  2 p−2 = 4 (p − 1) E |h f | f+ . E |h+ f | f + i

As in (2.11), by Hölder’s inequality, the last expectation is bounded by kh+ f k2p kf+ kp−2 p , which gives the desired result. It remains to prove (2.12). Let us mention that the tensorization of functionals L(g) = E Ψ(g) − Ψ(E g) was proposed in the mid 1990’s by Bobkov, as explained in [L1], Proposition 4.1. This property is actually equivalent to the convexity of L in g, and can be explicitly expressed in terms of R (convexity of Ψ and −1/Ψ′′ ; see also [L-O]). For completeness of exposition let us include here a direct argument for the power functions Ψ(x) = xq . By induction, it suffices to consider n = 2; we use the representation (2.13)

    L(|g|) = sup q E |g|(|h|q−1 − (E |h|)q−1 ) − (q − 1) E |h|q − (E |h|)q . h∈Lq

Indeed, by the arithmetic-geometric inequality, which we rewrite as

1 q

E |g|q +

q−1 q

E |h|q ≤ E |g| |h|q−1,

E |g|q ≤ q E |g| |h|q−1 − (q − 1) E |h|q . We may assume E|g| = 1; therefore, subtracting (E |g|)q = 1 on both sides,    L(|g|) ≤ q E |g| (|h|q−1 − (E |h|)q−1) − (q − 1) E |h|q − (E |h|)q + R(E |h|)

with R(x) = qxq−1 − (q − 1) xq − 1 for x ≥ 0. Since R(x) ≤ 0, while equality holds if h = g, we arrive at (2.13). By Fubini’s theorem and applying (2.13), we now get

HIGHER ORDER CONCENTRATION OF MEASURE

13

  E2 (E1 |g|)q − (E|g|)q = E2 (E1 |g|)q − (E2 E1 |g|)q     = sup q E2 (E1 |g|)(|h|q−1 − (E2 |h|)q−1 ) − (q − 1) E2 |h|q − (E2 |h|)q h(X2 )∈Lq

=

sup

h(X2 )∈Lq

≤ E1



   E1 q (E2 |g|)(|h|q−1 − (E2 |h|)q−1 ) − (q − 1) E2 |h|q − (E2 |h|)q

sup

h(X2 )∈Lq

     q E2 |g|(|h|q−1 − (E2 |h|)q−1 ) − (q − 1) E2 |h|q − (E2 |h|)q

 = E1 E2 |g|q − (E2 |g|)q . 

As a consequence, by Fubini’s theorem again,     E |g|q − (E |g|)q = E2 E1 |g|q − (E1 |g|)q + E2 (E1 |g|)q − (E|g|)q     ≤ E2 E1 |g|q − (E1 |g|)q + E1 E2 |g|q − (E2 |g|)q .



Following the arguments in [B-B-L-M], we may now prove Theorem 2.3: Proof of Theorem 2.3. It suffices to prove (2.6) assuming Ef = 0. To this end, by induction on k, we show that for all k ∈ N and all p ∈ (k, k + 1], (2.14)

p kf+ kp ≤ 8κp p kh+ f kp∨2

 1 1 p/2 −1 with κp = 1− 1− . 2 p

These constants are strictly increasing in p, κ1 = 1/2 and limp→∞ κp = κ = For k = 1 and p ∈ (1, 2], by (2.9) and the fact that κp p ≥ 1/2, we have p kf+ kp ≤ kf+ k2 ≤ 2 kh+ f k2 ≤ 8κp p kh+ f k2 .

√ √e . 2 ( e−1)

To make an induction step, fix an integer k > 1 and assume that (2.14) holds for all real p ∈ [1, k]. Now, consider the values p ∈ (k, k + 1]. Set xp = kf+ kpp 8−p/2 κ−p/2 p−p/2 kh+ f k−p p p∨2 ,

so that it suffices to prove that xp ≤ 1. In terms of xp , (2.10) implies that xp 8p/2 κpp/2 pp/2 kh+ f kpp p/(p−1)

≤ xp−1

p/2

8p/2 κp−1 (p − 1)p/2 kh+ f kpp−1

+ 4 (p − 1) kh+ f k2p x1−2/p 8p/2−1 κp/2−1 pp/2−1 kh+ f kp−2 p p p 1 p/(p−1) p/2 p/2 ≤ xp−1 8 κp (p − 1)p/2 kh+ f kpp + x1−2/p 8p/2 κp/2−1 pp/2 kh+ f kpp . p 2 p Here we have used the fact that κp−1 ≤ κp . Simplifying and using that by induction, xp−1 ≤ 1, it follows that  1 1−2/p  1 1−2/p 1 p/2 1 p/2 p/(p−1) + + xp ≤ 1− x . 1− xp ≤ xp−1 p 2κp p 2κp p Now note that the function  1 1−2/p 1 p/2 x −x + up (x) = 1 − p 2κp

14

S. G. BOBKOV, F. GÖTZE, AND H. SAMBALE

is concave on R+ and positive at x = 0. Since up (1) = 0 and up (xp ) ≥ 0, we may conclude that xp ≤ 1.  Corollary 2.6. Given f = f (X1 , . . . , Xn ) in L∞ (P), for all p ≥ 2, p (2.15) kf kp ≤ kf k2 + 32 κp khf kp .

Proof. By Theorem 2.3, kf − Ef kp ≤ k(f − Ef )+ kp + k(f − Ef )− kp p p p (2.16) ≤ 8κp kh+ f kp + 8κp kh− f kp ≤ 2 8κp khf kp . On the other hand, by the triangle inequality,

kf − Ef kp ≥ kf kp − |Ef | ≥ kf kp − kf k2 .

(2.17)

It remains to combine (2.16) and (2.17).



We shall now prove Theorem 1.1. Recall that if the relation of the form (2.18)

kf kk ≤ γk

(k ∈ N)

holds true with some constant γ > 0, then f has sub-exponential tails, i. e. Eec|f | ≤ 2 1 for some constant c = c(γ) > 0, e.g. c = 2γe . Indeed, using k! ≥ ( ke )k , we have ∞ X

k E |f |

k

∞ X

kk

k

∞ X

≤1+ (cγ) ≤1+ (cγe)k = 2. k! k! k=1 k=1 k=1 √ Proof of Theorem 1.1. Put A = 32 κp. Using (2.15) with f replaced by |h(k−1) f |HS for k = 1, . . . , d, and applying Lemma 2.2, we get E exp(c|f |) = 1 +

c

kh(k−1) f kHS,p ≤ kh(k−1) f kHS,2 + A kh|h(k−1)f |HS kp ≤ kh(k−1) f kHS,2 + A kh(k) f kHS,p .

Consequently, by iteration, and applying (2.8), we arrive at

kf kp ≤ kf k2 + (2.19)





d−1 X k=1

Ak kh(k) f kHS,2 + Ad kh(d) f kHS,p

2 khf k2 +

d−1 X k=1

Ak kh(k) f kHS,2 + Ad kh(d) f kHS,p .

Now, since kh(k) f kHS,2 ≤ 1 for k ≤ d − 1 and kh(d) f kHS,∞ ≤ 1 by assumption, we get √

d X

√  Ad+1 − 1 ≤ Bd, B = A + 2, 2−1 + A−1 k=1 √ for all p ≥ 2. For this region, B ≤ Cp, where the best constant corresponds to p = 2, and then we find that C < 55. Hence, we obtain the bound kf kp ≤

(2.20)

2+

Ak =



kf kp ≤ (55 p)d/2 ,

p ≥ 2.

As for 0 < p < 2, one may kf kp ≤ kf k2 ≤ (110)d/2 , and thus, for all k ≥ 1, 2/d

k|f |2/d kk = kf k2k/d ≤ γk,

as in (2.18), with constant γ = 110.



HIGHER ORDER CONCENTRATION OF MEASURE

15

Proof of Proposition 1.2. First note that since the Xi are centered, we have α0 = 0, and the Hoeffding decomposition of f is given by the polynomials hi1 ...id (Xi1 , . . . , Xid ) = αi1 ...id Xi1 · · · Xid for all d = 1, . . . , n and i1 < . . . < id . It is now easily seen that for any 1 ≤ k ≤ d and 1 ≤ j1 6= . . . 6= jk ≤ n, X hj1 ...jk f (X) = hXj1 · · · hXjk αi1 ...id i1