Adaptive Multinomial Matrix Completion

0 downloads 0 Views 334KB Size Report
Aug 26, 2014 - sampling scheme and give theoretical guarantees on the performance of a constrained, nuclear norm penalized maximum likelihood estimator.
Vol. 0 (0000) DOI: 10.1214/154957804100000000

arXiv:1408.6218v1 [math.ST] 26 Aug 2014

Adaptive Multinomial Matrix Completion Olga Klopp CREST and MODAL’X, Universit´ e Paris Ouest e-mail: [email protected]

Jean Lafond Institut Mines-T´ el´ ecom, T´ el´ ecom ParisTech, CNRS LTCI e-mail: [email protected]

´ Eric Moulines Institut Mines-T´ el´ ecom, T´ el´ ecom ParisTech, CNRS LTCI e-mail: [email protected]

Joseph Salmon Institut Mines-T´ el´ ecom, T´ el´ ecom ParisTech, CNRS LTCI e-mail: [email protected] Abstract: The task of estimating a matrix given a sample of observed entries is known as the matrix completion problem. Most works on matrix completion have focused on recovering an unknown real-valued low-rank matrix from a random sample of its entries. Here, we investigate the case of highly quantized observations when the measurements can take only a small number of values. These quantized outputs are generated according to a probability distribution parametrized by the unknown matrix of interest. This model corresponds, for example, to ratings in recommender systems or labels in multi-class classification. We consider a general, non-uniform, sampling scheme and give theoretical guarantees on the performance of a constrained, nuclear norm penalized maximum likelihood estimator. One important advantage of this estimator is that it does not require knowledge of the rank or an upper bound on the nuclear norm of the unknown matrix and, thus, it is adaptive. We provide lower bounds showing that our estimator is minimax optimal. An efficient algorithm based on lifted coordinate gradient descent is proposed to compute the estimator. A limited Monte-Carlo experiment, using both simulated and real data is provided to support our claims. MSC 2010 subject classifications: Primary 62J02, 62J99; secondary 62H12,60B20. Keywords and phrases: Low rank matrix estimation; matrix completion; multinomial model.

1. Introduction The matrix completion problem arises in a wide range of applications such as image processing [14, 15, 27], quantum state tomography [12], seismic data 0

/Adaptive Multinomial Matrix Completion

1

reconstruction [28] or recommender systems [20, 2]. It consists in recovering all the entries of an unknown matrix, based on partial, random and, possibly, noisy observations of its entries. Of course, since only a small proportion of entries is observed, the problem of matrix completion is, in general, ill-posed and requires a penalization favoring low rank solutions. In the classical setting, the entries are assumed to be real valued and observed in presence of additive, homoscedastic Gaussian or sub-Gaussian noise. In this framework, the matrix completion problem can be solved provided that the unknown matrix is low rank, either exactly or approximately; see [6, 16, 19, 24, 4, 18] and the references therein. Most commonly used methods amount to solve a least square program under a rank constraint or its convex relaxation provided by the nuclear (or trace) norm [9]. In this paper, we consider a statistical model where instead of observing a real-valued entry of an unknown matrix we are now able to see only highly quantized outputs. These discrete observations are generated according to a probability distribution which is parameterized by the corresponding entry of the unknown low-rank matrix. This model is well suited to the analysis of voting patterns, preference ratings, or recovery of incomplete survey data, where typical survey responses are of the form “true/false”, “yes/no” or “agree/disagree/no opinion” for instance. The problem of matrix completion over a finite alphabet has received much less attention than the traditional unquantized matrix completion. One-bit matrix completion, corresponding to the case of binary, i.e. yes/no, observations, was first introduced by [7]. In this paper, the first theoretical guarantees on the performance of a nuclear-norm constrained maximum likelihood estimator are given. The sampling model considered in [7] assumes that the entries are sampled uniformly at random. Unfortunately, this condition is unrealistic for recommender system applications: in such a context some users are more active than others and popular items are rated more frequently. Another important issue is that the method of [7] requires the knowledge of an upper bound on the nuclear norm or on the rank of the unknown matrix. Such information is usually not available in applications. On the other hand, our estimator yields a faster rate of convergence than those obtained in [7]. One-bit matrix completion was further considered by [5] where a max-norm constrained maximum likelihood estimate is considered. This method allows more general non-uniform sampling schemes but still requires an upper bound on the max-norm of the unknown matrix. Here again, the rates of convergence obtained in [5] are slower than the rate of convergence of our estimator. Recently, [13] consider general exponential family distributions, which cover some distributions over finite sets. Their method, unlike our estimator, requires the knowledge of the “spikiness ratio” (usually unknown) and the uniform sampling scheme. In the present paper, we consider a maximum likelihood estimator with nuclear-norm penalization. Our method allows us to consider general sampling scheme and only requires the knowledge of an upper bound on the maximum absolute value of the entries of the unknown matrix. All the previous works on this

/Adaptive Multinomial Matrix Completion

2

model also require the knowledge of this bound together with some additional (and more difficult to obtain) information on the unknown matrix. The paper is organized as follows. In Section 2.1, the one-bit matrix completion is first discussed and our estimator is introduced. We establish upper bounds both on the Frobenius norm between the unknown true matrix and the proposed estimator and on the associated Kullback-Leibler divergence. In Section 2.2 lower bounds are established, showing that our upper bounds are minimax optimal up to logarithmic factors. Then, the one-bit matrix completion problem is extended to the case of a more general finite alphabet. In Section 3 an implementation based on the lifted coordinate descent algorithm recently introduced in [8] is proposed. A limited Monte Carlo experiment supporting our claims is then presented in Section 4. Notations For any integers n, m1 , m2 > 0, [n] := {1, . . . , n}, m1 ∨ m2 := max(m1 , m2 ) and m1 ∧ m2 := min(m1 , m2 ). We equip the set of m1 × m2 matrices with real entries (denoted Rm1 ×m2 ) with the scalar product hX|X ′ i := tr(X ⊤ X ′ ). For a given matrix X ∈ Rm1 ×m2 we write kXk∞ := maxi,j |Xi,j | and for any ρ ≥ 1, we denote its Schatten ρ-norm (see [1]) by kXkσ,ρ :=

mX 1 ∧m2

!1/ρ

σiρ (X)

i=1

,

with σi (X) the singular values of X ordered in decreasing order. The operator norm of X is kXkσ,∞ := σ1 (X). For any integer q > 0, we denote by Rm1 ×m2 ×q the set of m1 × m2 × q (3-way) tensors. A tensor X is of the form X = (X l )ql=1 where X l ∈ Rm1 ×m2 for any l ∈ [q]. For any integer p > 0, a function f : Rq → Sp is called a p-link function, where Sp is the p−dimensional probability simplex. Given a p-link function f and X , X ′ ∈ Rm1 ×m2 ×q , we define the squared Hellinger distance

d2H

X 1 (f (X ), f (X )) := m1 m2 ′

X

" 2 # q X q ′ j j f (Xk,k′ ) − f (Xk,k′ ) ,

k∈[m1 ] k′ ∈[m2 ] j∈[p]

j q where Xk,k′ denotes the vector (Xk,k ′ )j=1 . The Kullback-Leibler divergence is

X 1 KL (f (X ), f (X )) := m1 m2 ′

X

X

k∈[m1 ] k′ ∈[m2 ] j∈[p]

"

j

f (Xk,k′ ) log

f j (Xk,k′ ) ′ f j (Xk,k ′)

!#

.

For any tensor X ∈ Rm1 ×m2 ×q we define rk(X ) := maxl∈[q] rk(X l ), where rk(X l ) is the rank of the matrix X l and its sup-norm by kX k∞ := maxl∈[q] kX l k∞ .

/Adaptive Multinomial Matrix Completion

3

2. Main results 2.1. One-bit matrix completion Assume that the observations follow a binomial distribution parametrized by a ¯ ∈ Rm1 ×m2 . Assume in addition that an i.i.d. sequence of coefficients matrix X n (ωi )i=1 ∈ ([m1 ] × [m2 ])n is revealed and denote by Π their distribution. The observations associated to these coefficients are denoted by (Yi )ni=1 ∈ {1, 2}n and distributed as follows ¯ ωi ), P(Yi = j) = f j (X

j ∈ {1, 2} ,

(1)

¯i where f = (f j )2j=1 is a 2−link function. For ease of notation, we often write X ¯ instead of Xωi . Denote by ΦY the (normalized) negative log-likelihood of the observations:   n 2  1 X X 1{Yi =j} log f j (Xi )  . (2) ΦY (X) = − n i=1 j=1

¯ ∞ . We consider the following estimator Let γ > 0 be an upper bound of kXk ¯ of X: ˆ= X

arg min X∈Rm1 ×m2 ,kXk∞ ≤γ

ΦλY (X) ,

where

ΦλY (X) = ΦY (X) + λkXkσ,1 , (3)

with λ > 0 being a regularization parameter. Consider the following assumptions. H1. The functions x 7→ − ln(f j (x)), j = 1, 2 are convex. In addition, There exist positive constants Hγ , Lγ and Kγ such that: Hγ ≥2 sup (| log(f 1 (x))| ∨ | log(f 2 (x))|) ,

(4)

|x|≤γ

|(f 2 )′ (x)| |(f 1 )′ (x)| , sup Lγ ≥ max sup f 1 (x) |x|≤γ f 2 (x) |x|≤γ

!

,

(5)

(f 1 )′ (x)2 . 8f 1 (x)(1 − f 1 (x))

(6)

Remark 1. As shown in [7, Lemma 2], Kγ satisfies   2 p 2 X p Kγ ≤ inf  f j (x) − f j (y) /(x − y)2  .

(7)

Kγ = inf g(x) , |x|≤γ

x,y∈R |x|≤γ |y|≤γ

where g(x) =

j=1

Our framework allows a general distribution Π. We assume that Π satisfies the following assumptions introduced in [18] in the classical setting of unquantized matrix completion:

/Adaptive Multinomial Matrix Completion

4

H2. There exists a constant µ > 0 such that, for any m1 > 0 and m2 > 0 min

k∈[m1 ],k′ ∈[m2 ]

πk,k′ ≥ µ/(m1 m2 ) ,

where πk,k′ = P(ω1 = (k, k ′ )) .

(8)

P m1 Pm πk,k′ the probability of revealing Denote by Rk = k′ 2=1 πk,k′ and Ck′ = k=1 a coefficient from row k and column k ′ , respectively. H3. There exists a constant ν ≥ 1 such that, for all m1 , m2 , max(Rk , Cl ) ≤ k,l

ν , m1 ∧ m2

The first assumption ensures that every coefficient has a nonzero probability of being observed, whereas the second assumption requires that no column nor row is sampled with too high probability (see also [10, 18] for more details on these conditions). For instance, the uniform distribution yields µ = ν = 1. Define d = m1 + m2 ,

M = m1 ∨ m2 ,

m = m1 ∧ m2 .

(9)

¯ ∞ ≤ γ. Assume in addition that Theorem 1. Assume 1, 2, 3 and that kXk n ≥ 2m log(d)/(9ν). Take r 2ν log(d) . λ = 6Lγ mn Then, with probability at least 1−3d−1 the Kullback-Leibler divergence is bounded by ! r   ¯ M log(d) L2γ rk(X) log(d) ¯ ˆ KL f (X), f (X) ≤ µ max c¯µν , , eHγ Kγ n n with c¯ a universal constant whose value is specified in the proof. Proof. See Section 5.1. ˆ This result immediately gives an upper bound on the estimation error of X, measured in Frobenius norm: Corollary 2. Under the same assumptions and notations of Theorem 1 we have with probability at least 1 − 3d−1 ! r ¯ − Xk ˆ 2 ¯ M log(d) eHγ log(d) kX L2γ rk(X) σ,2 . ≤ µ max c¯µν , m1 m2 Kγ2 n Kγ n Proof. Using Lemma 9 and Theorem 1, the result follows. Remark 2. Note that, up to the factor L2γ /Kγ2 , the rate of convergence given by Corollary 2, is the same as in the case of usual unquantized matrix completion, see, for example, [18] and [19]. For this usual matrix completion setting, it has

/Adaptive Multinomial Matrix Completion

5

been shown in [19, Theorem 3] that this rate is minimax optimal up to a logarithmic factor. Let us compare this rate of convergence with those obtained in ¯ is estimated previous works on 1-bit matrix completion. In [7], the parameter X by minimizing the negative log-likelihood under the constraints kXk∞ ≤ γ and √ ¯ ≤ r, kXkσ,1 ≤ γ rm1 m2 for some r > 0. Under the assumption that rk(X) they could prove that r ¯ − Xk ˆ 2 kX rd σ,2 ≤ Cγ , m1 m2 n where Cγ is a constant depending on γ (see [7, Theorem 1]). This rate of convergence is slower than the rate of convergence given by Corollary 2. [5] studied a max-norm constrained maximum likelihood estimate and obtain a rate of convergence similar to [7]. In [13], matrix completion was considered for a likelihood belonging to the exponential family. Note, for instance, that the logit distribution belongs to such a family. The following upper bound on the estimation error is provided (see [13, Theorem 1])   ¯ − Xk ˆ 2 ¯ kX σ,2 2 rk(X)M log(M ) . ≤ Cγ α∗ m1 m2 n

(10)

Comparing with Corollary 2, (10) contains an additional term α2∗ where α∗ is √ ¯ ∞. an upper bound of m1 m2 kXk 2.2. Minimax lower bounds for one-bit matrix completion Corollary 2 insures that our estimator achieves certain Frobenius norm errors. We now discuss the extent to which this result is optimal. A classical way to address this question is by determining minimax rates of convergence. For any integer 0 ≤ r ≤ min(m1 , m2 ) and any γ > 0, we consider the following family of matrices  ¯ ≤ r, kXk ¯ ∈ Rm1 ×m2 : rank(X) ¯ ∞≤γ . F (r, γ) = X

ˆ that are functions of We will denote by inf Xˆ the infimum over all estimators X n m1 ×m2 , let PX denote the probability distrithe data (ωi , Yi )i=1 . For any X ∈ R bution of the observations (ωi , Yi )ni=1 for a given 2−link function f and sampling distribution Π. We establish a lower bound under an additional assumption on the function f 1 : H4. (f 1 )′ is decreasing on R+ and Kγ = g(γ) where g and Kγ are defined in (6). In particular, 4 is satisfied in the case of logit or probit models. The following theorem establishes a lower bound on the minimax risk in squared Frobenius norm:

/Adaptive Multinomial Matrix Completion

6

Theorem 3. Assume 4. Let α ∈ (0, 1/8) Then there exists a constant c > 0 such that, for all m1 , m2 ≥ 2, 1 ≤ r ≤ m, and γ > 0, !  ˆ − Xk ¯ 22 kX 2 Mr ≥ δ(α, M ) , > c min γ , inf sup PX¯ ˆ X∈ m1 m2 n K0 ¯ F (r,γ) X where δ(α, M ) =

1 1 + 2−rM/16



1 − 2α −

1 2

r

α log(2)(rM )



.

(11)

Proof. See Section 5.3. Note that the lower bound given by Theorem 3 is proportional to the rank ¯ and inversely proportional the multiplied by the maximum dimension of X sample size n. Therefore the lower bound matches the upper bound given by Corollary 2 up to a constant and a logarithmic factor. The lower bound does not capture the dependance on γ, note however that the upper and lower bound only differ by a factor L2γ /Kγ . 2.3. Extension to multi-class problems Let us now consider a more general setting where the observations follow a distribution over a finite set {1, . . . , p}, parameterized by a tensor X¯ ∈ Rm1 ×m2 ×q . The distribution of the observations (Yi )ni=1 ∈ [p]n is P(Yi = j) = f j (X¯ωi ),

j ∈ [p] ,

¯ l )q . where f = (f j )pj=1 is now a p-link function and X¯ωi denotes the vector (X ωi l=1 The negative log-likelihood of the observations is now given by:   p n  1 X X ΦY (X ) = − 1{Yi =j} log f j (Xi )  . (12) n i=1 j=1

where we use the notation Xi = Xωi . Our proposed the estimator is defined as: Xˆ =

arg min X ∈Rm1 ×m2 ×q kX k∞ ≤γ

ΦλY

(X ) ,

where

ΦλY

(X ) = ΦY (X ) + λ

q X j=1

kX j kσ,1 , (13)

In order to extend the results of the previous sections we make an additional assumption which allows to split the log-likelihood as a sum. H5. There exist functions (glj )(l,j)∈[p]×[q] such that the p-link function f can be factorized as follows j

f (x1 , . . . , xq ) =

q Y l=1

glj (xl )

for j ∈ [p] .

/Adaptive Multinomial Matrix Completion

7

The model considered above covers many finite distributions including among others logistic binomial (see Section 2.1) and conditional logistic multinomial (see Section 3). Assumptions on constants depending on the link function are extended by H6. There exist positive constant Hγ , Lγ and Kγ such that: Hγ ≥

max

sup 2| log(glj (x))| ,

(14)

(j,l)∈[p]×[q] |x|≤γ

(g j )′ (x) l Lγ ≥ max sup , (j,l)∈[p]×[q] |x|≤γ g j (x) l   p  2 X p p Kγ ≤ inf q  f j (x) − f j (y) /kx − yk22  . x,y∈R kxk∞ ≤γ kyk∞ ≤γ

(15)

(16)

j=1

¯ := ∇ ΦY (X¯ ) ∈ Rm1 ×m2 ×q . We also For any tensor X ∈ Rm1 ×m2 ×q , we write Σ n define the sequence of matrices (Ei )i=1 associated to the revealed coefficients ′ m2 1 (ωi )ni=1 by Ei := eki (e′li )⊤ where (ki , li ) = ωi and with (ek )m k=1 (resp. ((el )l=1 ) m2 m1 (resp. R ). Furthermore, if (εi )1≤i≤n is a being the canonical basis of R Rademacher sequence independent from (ωi )ni=1 and (Yi )1≤i≤n we define the matrix ΣR as follow n 1X εi Ei . ΣR := n i=1 We can now state the main results of this paper.

¯ l kσ,∞ and Theorem 4. Assume 2, 5 and 6 hold, λ > 2 maxl∈[q] kΣ

−1

X¯ ≤ γ. Then, with probability at least 1 − 2d , the Kullback-Leibler diver∞ gence is bounded by   KL f (X¯ ), f (Xˆ ) ≤

 m1 m2 rk(X¯ )  2 λ + 256e(qLγ E kΣR kσ,∞ )2 , eHγ µ max 4µ Kγ

r

log(d) n

!

.

with d defined in (9). Proof. See Section 5.1. Note that the lower bound of λ is stochastic and the expectation EkΣR kσ,∞ is unknown. However, these quantities can be controlled using 3. Theorem 5. Assume 2, 3, 5 and 6 hold and that kX¯ k∞ ≤ γ. Assume in addition that n ≥ 2m log(d)/(9ν). Take r 2ν log(d) . λ = 6Lγ mn

/Adaptive Multinomial Matrix Completion

8

Then, with probability at least 1 − (2 + q)d−1 , the Kullback-Leibler divergence is bounded by   q 2 L2γ rk(X¯ ) M log(d) , eHγ KL f (X¯ ), f (Xˆ ) ≤ µ max c¯µν Kγ n

r

log(d) n

!

,

with c¯ a universal constant , d, m and M defined in (9). Proof. See Section 5.2. 3. Implementation In this section an implementation for the following p-class link function is given:  !−1 j  Y   l j  (1 + exp(x )) if j ∈ [p − 1] ,  exp(x ) j 1 p−1 l=1 f (x , . . . , x ) = !−1 p−1  Y   l  (1 + exp(x )) if j = p .   l=1

This p-class link function boils down to parameterizing the distribution of the observation as follows: P(Yi = 1) =

¯ 1) exp(X i ¯ 1) , 1 + exp(X i

P(Yi = j|Yi > j − 1) =

¯ j) exp(X i ¯ j) 1 + exp(X i

for j ∈ {2, . . . , p − 1} .

Assumption 5 is satisfied and the problem (13) is separable w.r.t. each matrix X l . Following [7], we solve (13) without taking into account the constraint γ; as reported in [7] and confirmed by our experiments, the impact of this projection is negligible, whereas it increases significantly the computation burden. Because the problem is separable, it suffices to solve in parallel each subproblem Xˆ l = arg min Φlλ (X) , X∈R

m1 ×m2

where

Φlλ (X) = Φl (X) + λkXkσ,1 .

(17)

This can be achieved by using the coordinate gradient descent algorithm introduced by [8]. To describe the algorithm, consider first the set of normalized rank one matrices  M := M ∈ Rm1 ×m2 |M = uv ⊤ | kuk2 = kvk2 = 1, .

Define Θ the linear space of real-valued functions on M with finite support, i.e., any θ ∈ Θ satisfies θ(M ) = 0 except forPa finite number of M ∈ M. This space is equipped with the ℓ1 -norm kθk1 = M∈M |θ(M )|. Define by Θ+ the

/Adaptive Multinomial Matrix Completion

9

positive orthant, i.e., the cone of functions θ ∈ Θ such that θ(M ) ≥ 0 for all M ∈ M. Any matrix X ∈ Rm1 ×m2 can be associated to an element θ ∈ Θ+ satisfying X X= θ(M )M . (18) M∈M

Pm Such function is not unique. Consider an SVD of X i.e., X = i=1 λi ui vi⊤ , m where (λi )m values and (ui )m i=1 are the singular i=1 , (vi )i=1 are left and right Pm singular vectors, then θX = i=1 λi δui vi⊤ satisfies (18), with δM ∈ Θ is the function on M satisfying δM (M ) = 1 and δM (M ′ ) = 0 if M ′ 6= M . As seen below, the function θX plays a key role. Conversely, for any θ ∈ Θ+ , define X W : θ → Wθ := θ(M )M . M∈M

and the auxiliary objective function: ˜ lλ (θ) := λ ˜ lλ : θ → Φ Φ

X

θ(M ) + Φl (Wθ ) .

(19)

M∈M

The triangular inequality implies that for all θ ∈ Θ+ , kWθ kσ,1 ≤ kθk1 . For θ ∈ Θ we denote by supp(θ) the support of θ i.e., the subset of M such that θ(M ) 6= 0 ⇐⇒ M ∈ supp(θ). If for any M, M ′ ∈ supp(θ), M 6= M ′ , P ′ hM | M i = 1 , then kθk1 = kWθ kσ,1 . Indeed in such case M∈M θ(M )M defines a SVD of Wθ . Therefore the minimization of (19) is actually equivalent to the minimization of (17); see [8, Theorem 3.2]. The minimization (19) can be implemented using a coordinate gradient descent algorithm which updates at each iteration the nonnegative finite support function θ. The algorithm is summarized in Algorithm 1. Compared to the Soft-Impute [23] or the SVT [3] algorithms, this algorithm does not require the computation of a full SVD at each step of the main loop of an iterative (proximal) algorithm (recall that the proximal operator associated to the nuclear norm is the softthresholding operator of the singular values). The proposed algorithm requires only to compute the largest singular values and associated singular vectors. Another interest of this algorithm is that it only requires to evaluate the coordinate of the gradient for the entries which have been actually observed. It is therefore memory efficient when the number of observations is smaller than the total number of coefficients m1 m2 , which is the typical setting in which matrix completion is used. Moreover, we use Arnoldi iterations to compute the top singular values and vector pairs (see [11, Section 10.5] for instance) which allows us to take full advantage of sparse structures, the minimizations in the inner loop are carried out using the L-BFGS-B algorithm. Table 1 provides the execution time one-bit matrix completion (on a 3.07Ghz w3550 Xeon CPU with RAM 1.66 Go, Cache 8 Mo, C implementation).

/Adaptive Multinomial Matrix Completion

10

Algorithm 1: Lifted coordinate gradient descent Initialization: initial parameter θ0 , precision ǫ Loop: Compute the top singular vector pair of −∇ Φl (Wθk ): uk , vk gk ← λ + h∇ Φl (Wθk ) | uk vk⊤ i if gk ≤ −ǫ/2 then   ˜ l θ + bδ βk ← arg min Φ ⊤) b∈R+

λ

θk+1 ← θk + βk δu else gkmax ←

max

uk vk

⊤, k vk

M ∈supp(θk )

if gkmax ≤ ǫ then Break else θk+1 ←

|λ + h∇ Φl (Wθk ) | M i|

arg min

θ ′ ∈Θ+ ,supp(θ ′ )⊂supp(θk )

˜ l (θ ′ ) Φ λ

Parameter Size Observations Execution Time (s.)

1000 × 1000 3000 × 3000 10000 × 10000 100 · 103 1 · 106 10 · 106 4.5 52 730 Table 1 Execution time of the proposed algorithm for the binary case.

4. Numerical Experiments We have performed numerical experiments on both simulated and real data provided by the MovieLens project (http://grouplens.org). Both the onebit matrix completion - p = 2, q = 1 - and the extended multi-class setting -p = 5, q = 4 - are considered; comparisons are also provided with the classical Gaussian matrix completion algorithm to assess the potential gain achieved by explicitly taking into account the facts that the observations belong to a finite alphabet. Only a limited part of the experiments are reported in this article; a more extensive assessment can be obtained upon authors request. ¯ l we sampled uniformly five unitary (for the Euclidean For each matrix X ¯ l is then defined as norm) vector pairs (ulk , vkl )5k=1 . The matrix X ¯ l = Γ√m1 m2 X

5 X

αk ulk (vkl )⊤ + η l Im1 ×m2 ,

k=1

with (α1 , . . . , α5 ) = (2, 1, 0.5, 0.25, 0.1), Γ a scaling factor and Im1 ×m2 the m1 × m2 matrix of ones. The term ηl has been fixed so that each class has the ¯ l ])p−1 ) = 1/p for j ∈ [p]. Note that the same average probability i.e., f j ((E[X l=1 √ ¯ l coefficients does not depend on factor m1 m2 implies that the variance of X m1 and m2 . The sizes investigated are (m1 , m2 ) ∈ {(500, 300), (1000, 600)}. The observations are sampled to the conditional multinomial logistic model ˆN, introduced in Section 3. For comparison purposes we have also computed X

/Adaptive Multinomial Matrix Completion

11

the classical Gaussian version (i.e., using a squared Frobenius norm in (13)). Contrary to the logit version, the Gaussian matrix completion does not directly recover the distribution of the observations (Yi )ni=1 . However, we can estimate P(Yi = j) by the following quantity:  if j = 1 ,  0 ˆN j−0.5−X i FN (0,1) (pj+1 ) − FN (0,1) (pj ) with pj = if 0 < j < p σ ˆ   1 if j = p ,

where FN (0,1) is the cdf of a zero-mean standard Gaussian random variable. The choice of the regularization parameter λ has been solved for all methods by performing 5-fold cross-validation on a geometric grid of size 0.6 log(n) (note that the estimators are null for λ greater than k∇ ΦY (0)kσ,∞ ). As evidenced in Figure 1, the Kullback-Leibler divergence for the logistic estimator is significantly lower than for the Gaussian estimator, for both the p = 2 and p = 5 cases. This was expected because the Gaussian model assume implicitly symmetric distributions with the same variance for all the ratings, These assumptions are of course avoided by the logistic modem. Regarding the prediction error, Table 2 and Table 3 summarize the results obtained for a 1000 × 600 matrix. The logistic model outperforms the Gaussian model (slightly for p = 2 and significantly for p = 5). 0.16

Normalized KL divergence for logistic (plain), Gaussian (dashed)

Normalized KL divergence for logistic (plain), Gaussian (dashed)

size: 500x300 size: 1000x600

0.14

size: 500x300 size: 1000x600

0.35

0.30 0.12

Normalized KL

Normalized KL

0.25 0.10

0.08

0.06

0.15

0.10

0.04

0.05

0.02

0.00

0.20

100000

200000 300000 Number of observations

400000

500000

0.00

100000

200000 300000 Number of observations

400000

500000

Fig 1. Kullback-Leibler divergence between the estimated and the true model for different matrices sizes and sampling fraction, normalized by number of classes. Right figure: binomial and the Gaussian models ; left figure: multinomial with five classes and Gaussian model.

10 · 103 50 · 103 250 · 103 500 · 103 0.50 0.38 0.32 0.32 0.46 0.33 0.31 0.31 Table 2 Prediction errors for a binomial (2 classes) underlying model, for a 1000 × 600 matrix. Number of observations Gaussian prediction error Logistic prediction error

We have also run the same estimators on the MovieLens 100k dataset. In this case, the Kullback-Leibler divergence cannot be computed. Therefore, to assess

/Adaptive Multinomial Matrix Completion

12

10 · 103 50 · 103 250 · 103 500 · 103 0.75 0.75 0.72 0.71 0.75 0.67 0.58 0.57 Table 3 Prediction Error for a multinomial (5 classes) distribution against a 1000 × 600 matrix. Number of observations Gaussian prediction error Logistic prediction error

the prediction errors, we randomly select 20% of the entries as a test set, and the remaining entries are split between a training set (80%) and a validation set (20%). For this dataset, ratings range from 1 to 5. To consider the benefit of a binomial model, we have tested each rating against the others (e.g., ratings 5 are set to 0 and all others are set to 1). These results are summarized in Table 4. For the multinomial case, we find a prediction error of 0.59 for the logistic model against a 0.63 for the Gaussian one. Rating against the others 1 2 3 4 5 Gaussian prediction error 0.12 0.20 0.39 0.46 0.30 Logistic prediction error 0.06 0.11 0.27 0.34 0.20 Table 4 Binomial prediction error when performing one versus the others procedure on the MovieLens 100k dataset.

5. Proofs of main results 5.1. Proof of Theorem 1 and Theorem 4 Proof. Since Theorem 1 is an application of Theorem 4 for p = 2 and q = 1 it suffices to prove Theorem 4. We consider a tensor X which satisfies ΦλY (X ) ≤ ΦλY (X¯ ), (e.g., X = Xˆ ). We get from Lemma 6 √ q  ΦY (X ) − ΦY (X¯ ) ≤ λ r¯ KL f (X¯ ), f (X ) , (20)

where

r¯ = Let us define

2m1 m2 rk(X˜ ) . Kγ

   D f (X¯ ), f (X ) := E ΦY (X ) − ΦY (X¯ ) ,

(21)

(22)

where the expectation is taken both over the  (Ei )1≤i≤n and (Yi )1≤i≤n . As stated in Lemma 11, 2 implies µ D f (X¯ ), f (X ) ≥ KL f (X¯ ), f (X ) . We now need to control the left hand side of (20) uniformly over X with high probability. Since ¯ l kσ,∞ applying Lemma 10 (30) and then Lemma 11 we assume λ > 2 maxl∈[q] kΣ

/Adaptive Multinomial Matrix Completion

13

yields q X l=1

q √ q   ¯ l kσ,1 ≤ 4 r¯ KL f (X¯ ), f (X ) ≤ 4√µ¯ r D f (X¯ ), f (X ) , (23) kX l − X

Consequently, if we define C(r) as ) ( q q X  l l m1 ×m2 ×q ¯ kσ,1 ≤ r D f (X¯ ), f (X ) , kX − X C(r) := X ∈ R : l=1

we need to control (ΦY (X ) − ΦY (X¯ )) for X ∈ C(16µ¯ r). We have to ensure that  D f (X¯ ), f (X ) is greater than a given threshold β > 0 and therefore we define the following set   Cβ (r) = X ∈ C(r), D f (X¯ ), f (X ) ≥ β . (24)

We then consider the two following cases. Case 1. If D f (X¯ ), f (X ) > β, (23) gives X ∈ Cβ (16µ¯ r). Plugging Lemma 12 p p in (20) with β = 2Mγ log(d)/(η n log(α)) , α = e and η = 1/(4α) then it holds with probability at least 1 − 2d−1 /(1 − d−1 ) ≥ 1 − 2/d  √ q  D f (X¯ ), f (X ) − ǫ(16µ¯ r, α, η) ≤ λ r¯ KL f (X¯ ), f (X ) , 2 where ǫ is defined in Lemma 12. Recalling Lemma 11 we get  √ q  KL f (X¯ ), f (X ) r, α, η) ≤ 0 . − λ r¯ KL f (X¯ , f (X ) − ǫ(16µ¯ 2µ

An analysis of this second order polynomial and the relation ǫ(16µ¯ r, α, η)/µ = ǫ(16¯ r, α, η) lead to q  √  p  KL f (X¯ ), f (X ) ≤ µ λ r¯ + λ2 r¯ + 2ǫ(16¯ r, α, η) . (25)

Applying the inequality (a + b)2 ≤ 2(a2 + b2 ) gives the bound of Theorem 4. Case 2. If D f (X¯ ), f (X ) ≤ β then Lemma 11 yields  KL f (X¯ ), f (X ) ≤ µβ . (26) Combining (25) and (26) concludes the proof.

For X ∈ Rm1 ×m2 , denote by S1 (X) ⊂ Rm1 (resp. S2 (X) ⊂ Rm2 ) the linear spans generated by left (resp. right) singular vectors of X. PS1⊥ (X) (resp. PS2⊥ (X) ) denotes the orthogonal projections on S1⊥ (X) (resp. S2⊥ (X)). We then define the following orthogonal projections on Rm1 ×m2 ⊥ ˜ ˜ ˜ ˜ ˜ P⊥ X : X 7→ PS1⊥ (X) XPS2⊥ (X) and P X : X 7→ X − P X (X) .

/Adaptive Multinomial Matrix Completion

14

Lemma 6. Let X , X˜ ∈ Rm1 ×m2 ×q satisfying ΦλY (X ) ≤ ΦλY (X˜ ), then r   ΦY (X ) − ΦY (X˜ ) ≤ λ¯ r1/2 KL f (X˜ ), f (X ) ,

where r¯ is defined in (21).

Proof. Since ΦλY (X ) ≤ ΦλY (X˜ ), we obtain ΦY (X ) − ΦY (X˜ ) ≤λ

q X l=1

˜ l kσ,1 − kX l kσ,1 ) ≤ λ (kX

q X l=1

˜ l )kσ,1 , k P X˜ l (X − X

! q q X ˜ σ,2 , ≤λ 2 rk(X˜ ) kX − Xk l=1

where we have used Lemma 7-(ii) and (iii) and for the last two lines and the definition of Kγ and Lemma 8 to get the result. ˜ ∈ Rm1 ×m2 we have Lemma 7. For any pair of matrices X, X ⊥ ˜ ˜ (i) kX + P ⊥ X (X)kσ,1 p= kXkσ,1 + k P X (X)kσ,1 , ˜ ˜ (ii) k P X (X)kσ,1 ≤ 2 rk(X)kXkσ,2 , ˜ σ,1 ≤ k P X (X ˜ − X)kσ,1 . (iii) kXkσ,1 − kXk

Proof. If A, B ∈ Rm1 ×m2 are two matrices satisfying Si (A) ⊥ Si (B), i = 1, 2, then kA + Bkσ,1 = kAkσ,1 + kBkσ,1 . Applying this identity with A = X and ˜ B = P⊥ X (X), we obtain ⊥ ˜ ˜ kX + P ⊥ X (X)kσ,1 = kXkσ,1 + k P X (X)kσ,1 ,

showing (i). ˜ = PS (X) XP ˜ S ⊥ (X) + XP ˜ S (X) . Note It follows from the definition that P X (X) 1 2 2 m1 ×m2 equipped with the euclidean that P X is an orthogonal projector on R product h· | ·i. On the other p hand, the Cauchy-Schwarz inequality implies that for any matrix C, kCkσ,1 ≤ rk(C)kCkσ,2 . Consequently (ii) follows from p p ˜ σ,1 ≤ 2 rk(X)k P X (X)k ˜ σ,2 ≤ 2 rk(X)kXk ˜ σ,2 . k P X (X)k ˜ = X + P ⊥ (X ˜ − X) + P X (X ˜ − X) we have Finally, since X X

˜ σ,1 ≥ kX + P ⊥ (X ˜ − X)kσ,1 − k P X (X ˜ − X)kσ,1 , kXk X ˜ ˜ = kXkσ,1 + k P ⊥ X (X − X)kσ,1 − k P X (X − X)kσ,1 , leading to (iii). Lemma 8. For any tensor X , X˜ ∈ Rm1 ×m2 ×q and p-link function f it holds:     d2H f (X ), f (X˜ ) ≤ KL f (X ), f (X˜ )

/Adaptive Multinomial Matrix Completion

15

Proof. See [26, Lemma 4.2] Lemma 9. For any p, q > 0 and p-link function f and any X , X˜ ∈ Rm1 ×m2 ×q satisfying kX k∞ ≤ γ and kX˜ k∞ ≤ γ, we get: q X l=1

 m m   m1 m2 2  1 2 dH f (X ), f (X˜ ≤ KL f (X ), f (X˜ . kX l − X˜ l k2σ,2 ≤ Kγ Kγ

Proof. For p = 2 and q = 1, it is a consequence of Remark 1 and Lemma 8. Otherwise, the proof follows from the definition (16) of Kγ and Lemma 8. Lemma 10. Let X , X˜ ∈ Rm1 ×m2 ×q satisfying kX k∞ ≤ γ and kX˜ k∞ ≤ γ. ˜ σ,∞ and Φλ (X) ≤ Φλ (X). ˜ Then Assume that λ > 2 maxl∈[q] kΣlY (X)k Y Y q X

l=1 q X

l=1 q X

l=1 q X l=1

l ˜l k P⊥ ˜ l (X − X )kσ,1 ≤ 3 X

q X l=1

˜ l )kσ,1 , k P X˜ l (X l − X

˜ l kσ,1 ≤ 4 kX l − X

q q X ˜ l )kσ,2 , 2 rk(X˜ ) k(X l − X

˜ l kσ,1 ≤ 4 kX l − X

q   2m1 m2 rk(X˜ )/Kγ dh f (X˜ ), f (X ), ,

l

˜l

kX − X kσ,1

(27) (28)

l=1

(29)

r q   ≤ 4 2m1 m2 rk(X˜ )/Kγ KL f (X˜ ), f (X ) .

(30)

Proof. Since ΦλY (X ) ≤ ΦλY (X˜ ), we have ΦY (X˜ ) − ΦY (X ) ≥ λ

q X l=1

˜ l kσ,1 ). (kX l kσ,1 − kX

˜ + P ⊥˜ (X − X) ˜ + P ˜ (X − X), ˜ Lemma 7-(i) For any X ∈ Rm1 ×m2 , using X = X X X and the triangular inequality, we get ˜ σ,1 + k P ⊥˜ (X − X)k ˜ σ,1 − k P ˜ (X − X)k ˜ σ,1 , kXkσ,1 ≥ kXk X X which implies ΦY (X˜ ) − ΦY (X ) ≥ λ

q  X l=1

l ˜l ˜l k P⊥ ˜ l (X − X )kσ,1 ˜ l (X − X )kσ,1 − k P X X

Furthermore by concavity of ΦY we have ΦY (X˜ ) − ΦY (X ) ≤

q X l=1

˜ l − X li . hΣlY (X˜ ) | X



. (31)

/Adaptive Multinomial Matrix Completion

16

The duality between k · kσ,1 and k · kσ,∞ (see for instance [1, Corollary IV.2.6]) leads to ˜ σ,∞ ΦY (X˜ ) − ΦY (X ) ≤ max kΣlY (X)k l∈[q]

≤ ≤

q λX

2

l=1

q λX

2

l=1

q X l=1

˜ l − X l kσ,1 , kX

˜ l − X l kσ,1 , kX l l ˜l ˜l (k P ⊥ ˜ l (X − X )kσ,1 ) , ˜ l (X − X )kσ,1 + k P X X

(32)

˜ σ,∞ in the second line. Then combining where we used λ > 2 maxl∈[q] kΣlY (X)k ˜ l = P ⊥˜ l (X l − X ˜ l) + (31) with (32) gives (27). Since for any l ∈ [q], X l − X X l l ˜ ), using the triangular inequality and (27) yields P X˜ l (X − X q X l=1

˜ l kσ,1 ≤ 4k P ˜ l (X l − X ˜ l )kσ,1 . kX l − X X

(33)

Combining (33) and (27) immediately leads to (28) and (29) is a consequence of (28) and the definition of Kγ . The statement (30) follows from (29) and Lemma 8. Lemma 11. Under 2 we have   1 D f (X¯ ), f (X ) ≥ KL f (X¯ ), f (X ) . µ

where D(·, ·) is defined in (22). Proof. Follows from

  j ¯  n  f (Xk,l ) 1X X X D f (X¯ ), f (X ) = πk,l f j (X¯k,l ) log , n i=1 f j (Xk,l ) k∈[m1 ] j∈[p] l∈[m2 ]

 j ¯  X X f (Xk,l ) 1 j ¯ f (Xk,l ) log . ≥ µm1 m2 f j (Xk,l ) k∈[m1 ] j∈[p] l∈[m2 ]

¯ Let α > 1, β > 0 and 0 < η < 1/2α. Then Lemma 12. Assume that λ ≥ Σ. with probability at least 1 − 2(exp(−nη 2 log(α)β 2 /(4Mγ2 ))/(1 − exp(−nη 2 log(α)β 2 /(4Mγ2 ))) we have for all X ∈ Cβ (r):   D f (X¯ ), f (X ) + ǫ(r, α, η) , | ΦY (X ) − ΦY (X¯ ) − D f (X¯ ), f (X ) | ≤ 2

/Adaptive Multinomial Matrix Completion

where ǫ(r, α, η) := and Cβ (r) is defined in (24).

17

4q 2 L2γ r (EkΣR kσ,∞ )2 , 1/(2α) − η

(34)

Proof. The proof is adapted from [24, Theorem 1] and [18, Lemma 12]. We use a peeling argument combined with a sharp deviation inequality detailed in Lemma 13, Consider the events  B := ∃X ∈ Cβ (r) and

   D f (X¯ ), f (X ) ¯ ¯ | ΦY (X ) − ΦY (X ) − D f (X ), f (X ) | > + ǫ(r, α, η) , 2   Sl := X ∈ Cβ (r)|αl−1 β < D f (X¯ ), f (X ) < αl β .

Let us also define the set   Cβ (r, t) = X ∈ Rm1 ×m2 | X ∈ Cβ (r), D f (X¯ ), f (X ) ≤ t ,

and

Zt :=

sup X ∈Cβ (r,t)

 | ΦY (X ) − ΦY (X¯ ) − D f (X¯), f (X ) | ,

(35)

Then for any X ∈ B ∩ Sl we have

 1 | ΦY (X ) − ΦY (X¯ ) − D f (X¯ ), f (X ) | > αl−1 β + ǫ(r, α, η) , 2

Moreover by definition of Sl , X ∈ Cβ (r, αl β). Therefore

1 l α β + ǫ(r, α, η)} , 2α If we now apply the union bound and Lemma 13 we get B ∩ Sl ⊂ Bl := {Zαl β >

2 log(α)β 2   exp(− nη 4M ) 2 nη 2 (αl β)2 γ ≤ , P(B) ≤ P(Bl ) ≤ exp − 2 log(α)β 2 8Mγ2 ) 1 − exp(− nη 4M 2 l=1 l=1

+∞ X

+∞ X

γ

x

where we used x ≤ e in the second inequality.

¯ Let α > 1 and 0 < η < Lemma 13. Assume that λ ≥ Σ.

1 2α .

Then we have  P (Zt > t/(2α) + ǫ(r, α, β)) ≤ exp −nη 2 t2 /(8Mγ2 ) , (36)

where Zt and ǫ(r, α, η) are defined in (35) and (34), respectively.

/Adaptive Multinomial Matrix Completion

18

Proof. Using Massart’s inequality ([22, Theorem 9]) we get for 0 < η < 1/(2α)  P(Zt > E[Zt ] + ηt) ≤ exp −η 2 nt2 /(8Mγ2 ) . (37)

By using the standard symmetrization argument, we get   n   j p 1 X X f (X ) i 1{Yi =j} log j ¯  , εi E[Zt ] ≤ 2E  sup n f (Xi ) X ∈Cβ (r,t) j=1 i=1

where ε := (εi )1≤i≤n is a Rademacher sequence which is independent from (Yi )1≤i≤n and (Ei )1≤i≤n . 5 yields  !  X p q j l X X 1 n g (X ) εi 1{Yi =j} log lj ¯ il  . E[Zt ] ≤ 2E  sup n gl (Xi ) X ∈Cβ (r,t) j=1 i=1 l=1 Since for any i ∈ [n], the function

p 1 X 1{Yi =j} log φi (x) := Lγ j=1

¯ l) glj (x + X i ¯ l) g j (X i

l

!

is a contraction satisfying φi (0) = 0, the contraction principle ([21, Theorem 4.12]) and the fact that (εi )ni=1 is independent from (Yi )ni=1 and (ωi )ni=1 yields # n " q 1 X X ¯ l | Ei i = εi hX l − X E[Zt ] ≤ 4Lγ E sup X ∈Cβ (r,t) n i=1 l=1

Denoting ΣR := n E[Zt ] ≤ 4Lγ ≤ 4Lγ

q X

l=1 q X l=1

Pn −1

i=1 εi Ei

E

"

E

"

sup

X ∈Cβ (r,t)

and the duality, the previous inequality implies # l hX − X ¯ l | ΣR i

sup

X ∈Cβ (r,t)

#

√ kX − X kσ,1 kΣR kσ,∞ ≤ 4qLγ E[kΣR kσ,∞ ] rt , l

¯l

where we have the definition of Cβ (r, t) for the last inequality. Plugging into (37) gives √  P(Zt > 4qLγ E[kΣR kσ,∞ ] rt + ηt) ≤ exp −η 2 nt2 /(8Mγ2 ) . The proof is concluded by noting that, since for any a, b ∈ R and c > 0, ab ≤ (a2 /c + cb2 )/2, √ 4qLγ E[kΣR kσ,∞ ] rt ≤

1 4q 2 L2γ rE[kΣR kσ,∞ ]2 + (1/(2α) − η)t . 1/(2α) − η

/Adaptive Multinomial Matrix Completion

19

5.2. Proof of Theorem 5 ¯ l kσ,∞ and E[kΣR k Proof. By Theorem 5 it suffices to control kΣ σ,∞ ]. For any l ∈ [q], by definition   p n j ¯ X X ∂ f ( X ) 1 ¯l = −  1{Yi =j} l j ¯ i  Ei , Σ n i=1 j=1 f (Xi )

with ∂l designating the partial derivative against the l-th variable. The sequence of matrices     p p j ′ ¯l j ¯ X X (g ) ( X ) ∂ f ( X ) 1{Yi =j} l j ¯ i  Ei =  1{Yi =j} lj ¯ l i  Ei Zi :=  f (Xi ) g (X ) l

j=1

j=1

i

satisfies E[Zi ] = 0 (as any score function) and kZi kσ,∞ ≤ Lγ . Noticing ek (e′k′ )⊤ (ek (e′k′ )⊤ )⊤ = ek (e′k′ )⊤ we also get    2  p m1 m2 n j ¯ ′ X X X ) ∂ f ( X 1X l k,k j  ek (e′k )⊤ ,  f (X¯k,k′ ) E[Zi Zi⊤ ] = πk,k′  j (X ¯k,k′ ) n i=1 f ′ j=1 k=1

k =1

which is diagonal. We recall the definition Ck′ = for any k ′ ∈ [m2 ], k ∈ [m1 ]. Since 

P m1

k=1

πk,k′ and Rk =

P m2

k′ =1

πk,k′

2 ∂l f j (X¯k,k′ ) ≤ L2γ , f j (X¯k,k′ )

and (f j (X¯k,k′ ))pj=1 is a probability distribution, we obtain

" n #

1X

⊤ Zi Zi

E

n i=1

σ,∞

2 1 ≤ L2γ kdiag((Rk )m k=1 )kσ,∞ ≤ Lγ

ν , m

P were we have 3 for the last inequality. Using a similar argument we get kE[ ni=1 Zi⊤ Zi ]kσ,∞ /n ≤ 2 L2γ ν/m. Therefore, Proposition 14 applied with t = log(d), U = Lγ and σZ = 2 Lγ ν/m yields with at least probability 1 − 1/d,

l ˜

ΣY (X)

σ,∞

√ ≤ (1 + 3)Lγ max

(r

2ν log(d) 2 log(d) , mn 3 n

)

.

(38)

P With the same analysis for ΣR := n1 ni=1 εi Ei and by applying Lemma 15 with ν 2 U = 1 and σZ =m , for n ≥ n∗ := m log(d)/(9ν) it holds: E [kΣR kσ,∞ ] ≤ c



r

2eν log(d) . mn

(39)

/Adaptive Multinomial Matrix Completion

20

Assuming n ≥ 2m log(d)/(9ν), implies n ≥ n∗ and (39) is therefore satisfied. p Since it also implies 2ν log(d)/(mn) ≥ 2 log(d)/(3n), p the second term of (38) √ is negligible. Consequently taking λ ≥ 2(1 + 3)Lγ 2ν log(d)/(mn), a union ˜ σ,∞ with probability at bound argument ensures that λ > 2 maxl∈[q] kΣlY (X)k least 1 − q/d. By taking λ, β and n as in Theorem 5 statement , with probability larger than 1 − (2 + q)/d, Theorem 4 result holds when replacing EkΣR kσ,∞ by its upper bound (39). Using the inequality (a + b)2 ≤ 2(a2 + b2 ) yields the result with c¯ = 24832. Proposition 14. Consider a finite sequence of independent random matrices (Zi )1≤i≤n ∈ Rm1 ×m2 satisfying E[Zi ] = 0 and for some U > 0, kZi kσ,∞ ≤ U for all i = 1, . . . , n. Then for any t > 0  

  n

1 X nt2 /2

> t ≤ d exp − 2 P  Zi ,

n σZ + U t/3 i=1

σ,∞

where d = m1 + m2 and 

n

 1 X

2 := max E[Zi Zi⊤ ] σZ

 n i=1

σ,∞

n

1 X

, E[Zi⊤ Zi ]

n

i=1

σ,∞

 

.



In particular it implies that with at least probability 1 − e−t

n

) ( r

1 X t + log(d) U (t + log(d))

Zi , ≤ c∗ max σZ ,

n

n 3n i=1 σ,∞

√ with c∗ = 1 + 3.

Proof. The first claim of the proposition is Bernstein’s inequality for random ma2 /2 trices (see for example [25, Theorem 1.6]). Solving the equation (in t) − σ2nt+Ut/3 + Z

log(d) = −v gives with at least probability 1 − e−v

n " # r

1 X U2 1 U

2 (v + log(d)) , Zi ≤ (v + log(d)) + (v + log(d))2 + 2nσZ

n n 3 9 i=1 σ,∞

2 we conclude the proof by distinguishing the two cases nσZ ≤ (U 2 /9)(v + log(d)) 2 2 or nσZ > (U /9)(v + log(d)).

Lemma 15. Let h ≥ 1. With the same assumptions as Proposition 14, assume 2 n ≥ (U 2 log(d))/(9σZ ) then the following holds: 

h   h/2 n

1 X 2 2ehc∗2 σZ log(d)

  ≤ Zi , E

n n i=1

√ with c∗ = 1 + 3.

σ,∞

/Adaptive Multinomial Matrix Completion

21

2 )/U 2 − Proof. The proof is adapted from [18, Lemma 6]. Define t∗ := (9nσZ log(d) the value of t for which the two bounds of Proposition 14 are equal. Let 2 ∗2 c ) and ν2 := 3n/(U c∗ ) then, from Proposition 14 we have ν1 := n/(σZ  

n

1 X

Zi P  > t ≤ d exp(−ν1 t2 ) for t ≤ t∗ ,

n

i=1 σ,∞  

n

1 X

Zi > t ≤ d exp(−ν2 t) for t ≥ t∗ , P 

n i=1

σ,∞

Let h ≥ 1, then 

h n

1 X

Zi E 

n i=1

σ,∞





2h log(d) 1/(2 log(d)) n

1 X

 ≤ E   Zi ,

n i=1 

σ,∞



Z +∞ n

1 X

  Zi ≤ P

n 0 i=1

σ,∞

≤ d(2h log(d))



−1

Z

+∞



1/(2 log(d))

> t1/(2h log(d))  dt

,

exp(−ν1 t2/(2h log(d)) ) + exp(−ν2 t1/(2h log(d)) )dt

0

1/(2 log(d))

,

1/(2 log(d)) √  −h log(d) −2h log(d) e h log(d)ν1 Γ(h log(d)) + 2h log(d)ν2 Γ(2h log(d)) ,

where we used Jensen’s inequality for the first line. Since Gamma-function satisfies for x ≥ 2, Γ(x) ≤ ( x2 )x−1 (see [17, Proposition 12]) we have 

h  n

1 X

≤ E  Zi

n

i=1

σ,∞

1/(2 log(d)) √  −h log(d) 1−h log(d) −2h log(d) . 2 +2(h log(d))2h log(d) ν2 e (h log(d))h log(d) ν1

2 For n ≥ (U 2 log(d))/(9σZ ) we have ν1 log(d) ≤ ν22 and therefore we get 

h   h/2 n

1 X

 ≤ 2eh log(d) E  Zi .

n

ν1 i=1 σ,∞

5.3. Proof of Theorem 3

Proof. Let h be the following function q n √ o −1 h(κ) = min 1/2, α rM K(1−κ)γ /(8γ n) .

(40)

/Adaptive Multinomial Matrix Completion

22

Since 0 < h(κ) ≤ 1/2 and h is continuous, there exists a fixed point κ∗ ∈ (0, 1/2]: h(κ∗ ) = κ∗ .

(41)

For notational convenience, the dependence of κ∗ in r, M and n is implicit. We start with a packing set construction, inspired by [7]. Assume w.l.o.g., that m1 ≥ m2 . For κ ≤ 1, define o n n κγ κγ o , ∀i ∈ [m1 ], ∀j ∈ [r] , L = L = (lij ) ∈ Rm1 ×r : lij ∈ − , 2 2 and consider the associated set of block matrices n o L′ = L′ = ( L · · · L O ) ∈ Rm1 ×m2 : L ∈ L ,

where O denotes the m1 × (m2 − r⌊m2 /r⌋) zero matrix, and ⌊x⌋ is the integer part of x.

Remark 3. In the case m1 < m2 , we only need to change the construction of ˜ ∈ Rr×m2 with the low rank of the test set. We first build a matrix L  component κγ κγ entries in − 2 , 2 and then we replicate this matrix to obtain a block matrix L of size m1 × m2 . Let Im1 ×m2 denote the m1 × m2 matrix of ones. The Varshamov-Gilbert bound ([26, Lemma 2.9]) guarantees the existence of a subset L′′ ⊂ L′ with cardinality Card(L′′ ) ≥ 2(rM)/8 + 1 containing the matrix (κγ/2) Im1 ×m2 and such that, for any two distinct elements X1 and X2 of L′′ , M r κ 2 γ 2 j m2 k m1 m2 κ 2 γ 2 kX1 − X2 k22 ≥ ≥ . (42) 8 r 16 Then, we construct the packing set A by setting   (2 − κ)γ ′′ Im1 ×m2 : L ∈ L . A= L+ 2 By construction, any element of A as well as the difference of any two elements of A has rank at most r, the entries of any matrix in A take values in [0, γ], and X 0 = γIm1 ×m2 belongs to A. Thus, A ⊂ F (r, γ). Note that A has the same size as L′′ and it also satisfies the same bound on pairwise distances, i.e. for any two distinct elements X1 and X2 of A, (42) is satisfied. For some X ∈ A, we now estimate the Kullback-Leibler divergence D (PX 0kPX) between probability measures PX 0 and PX . By independence of the observations (Yi , ωi )ni=1 ,    j 0  2 X ) f (X ω 1  . D (PX 0 kPX ) = nEω1  f j (Xω01 ) log j (X ) f ω 1 j=1 Since Xω01 = γ and either Xω1 = Xω01 or Xω1 = (1 − κ)γ, by Lemma 16 we get  2 n f 1 (γ) − f 1 ((1 − κ)γ) D (PX 0 kPX ) ≤ 1 . f ((1 − κ)γ) [1 − f 1 ((1 − κ)γ)]

/Adaptive Multinomial Matrix Completion

23

From the mean value theorem, for some ξ ∈ [(1 − κ)γ, γ] we have D (PX 0 kPX ) ≤

n{(f 1 )′ (ξ)}2 (κγ)2 . f 1 ((1 − κ)γ) [1 − f 1 ((1 − κ)γ)]

Using 4, the function (f 1 )′ is decreasing and the latter inequality implies D (PX 0 kPX ) ≤ 8 n(κγ)2 g((1 − κ)γ) ,

(43)

where g is defined in (6). From (43) and plugging κ = κ∗ defined in eq. (41), we get αrM ≤ α log2 (rM/8) , D (PX 0 kPX ) ≤ 8 which implies that X  1 D (PX 0 kPX ) ≤ α log Card(A) − 1 . Card(A) − 1

(44)

X∈A

Using (42) and (44), [26, Theorem 2.5] implies inf

sup

ˆ X∈ ¯ F (r,γ) X

PX¯

!  ˆ − Xk ¯ 2 Mr kX 2 2 ≥ δ > c min γ , m1 m2 n K(1−κ∗)γ

(45)

for some universal constants c > 0 and δ ∈ (0, 1). Lemma 16. Let us consider x, y ∈ (0, 1) and k(x, y) := x log(x/y) + (1 − x) log((1 − x)/1 − y) . Then the following holds k(x, y) ≤

(x − y)2 . y(1 − y)

Proof. The proof is taken from [7, Lemma 4]. Since k(x, y) = k(1 − x, 1 − y), w.l.o.g., we may assume y > x. The function g(t) = k(x, x + t) satisfies g ′ (t) = t/[(x + t)(1 − x − t)] and g ′′ (t) ≥ 0. Therefore the mean value Theorem gives g(y − x) − g(0) ≤ g ′ (y − x)(y − x) which yields the result. References [1] R. Bhatia. Matrix analysis, volume 169 of Graduate Texts in Mathematics. Springer-Verlag, New York, 1997. [2] J. Bobadilla, F. Ortega, A. Hernando, and A. Guti´errez. Recommender systems survey. Knowledge-Based Systems, 46(0):109 – 132, 2013. [3] J-F. Cai, E. J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.

/Adaptive Multinomial Matrix Completion

24

[4] T. T. Cai and W-X. Zhou. Matrix completion via max-norm constrained optimization. CoRR, abs/1303.0341, 2013. [5] T. T. Cai and W-X. Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res., 14:3619–3647, 2013. [6] E. J. Cand`es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010. [7] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion. CoRR, abs/1209.3672, 2012. [8] M. Dud´ık, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regularization. In AISTATS, 2012. [9] M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002. [10] R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro. Learning with the weighted trace-norm under arbitrary sampling distributions. In NIPS, pages 2133–2141, 2011. [11] G. H. Golub and C. F. van Loan. Matrix computations. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013. [12] D. Gross. Recovering low-rank matrices from few coefficients in any basis. Information Theory, IEEE Transactions on, 57(3):1548–1566, 2011. [13] S. Gunasekar, P. Ravikumar, and J. Ghosh. Exponential family matrix completion under structural constraints. ICML, 2014. [14] J. Hui, L. Chaoqiang, S. Zuowei, and X. Yuhong. Robust video denoising using low rank matrix completion. CVPR, 0:1791–1798, 2010. [15] L. Ji, P. Musialski, P. Wonka, and Y. Jieping. Tensor Completion for Estimating Missing Values in Visual Data. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):208–220, 2013. [16] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. J. Mach. Learn. Res., 11:2057–2078, 2010. [17] O. Klopp. Rank penalized estimators for high-dimensional matrices. Electron. J. Stat., 5:1161–1183, 2011. [18] O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 2(1):282–303, 02 2014. [19] V. Koltchinskii, A. B. Tsybakov, and K. Lounici. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist., 39(5):2302–2329, 2011. [20] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009. [21] M. Ledoux and M. Talagrand. Probability in Banach spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)]. Springer-Verlag, Berlin, 1991. Isoperimetry and processes. [22] P. Massart. About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab., 28(2):863–884, 2000. [23] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res., 11:2287–2322, 2010.

/Adaptive Multinomial Matrix Completion

25

[24] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: optimal bounds with noise. J. Mach. Learn. Res., 13:1665–1697, 2012. [25] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389–434, 2012. [26] A. B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. [27] H. Xu, W. Jiasong, W. Lu, C. Yang, L. Senhadji, and H. Shu. Linear Total Variation Approximate Regularized Nuclear Norm Optimization for Matrix Completion. Abstr. Appl. Anal., pages Art. ID 765782, 8, 2014. [28] Y. Yang, J. Ma, and S. Osher. Seismic data reconstruction via matrix completion. Inverse Probl. Imaging, 7(4):1379–1392, 2013.