Sparse Maximum-Entropy Random Graphs with a Given Power-Law ...

4 downloads 107 Views 823KB Size Report
May 29, 2017 - network snapshot, one cannot usually trust its degree sequence as some “ultimate truth” for. 1. arXiv:1705.10261v1 [math.PR] 29 May 2017 ...
Sparse Maximum-Entropy Random Graphs with a Given Power-Law Degree Distribution Pim van der Hoorn1 , Gabor Lippner2 , and Dmitri Krioukov1,2,3

arXiv:1705.10261v1 [math.PR] 29 May 2017

1

Northeastern University, Department of Physics Northeastern University, Department of Mathematics 3 Northeastern University, Departments of Electrical&Computer Engineering 2

May 30, 2017

Abstract Even though power-law or close-to-power-law degree distributions are ubiquitously observed in a great variety of large real networks, the mathematically satisfactory treatment of random power-law graphs satisfying basic statistical requirements of realism is still lacking. These requirements are: sparsity, exchangeability, projectivity, and unbiasedness. The last requirement states that entropy of the graph ensemble must be maximized under the degree distribution constraints. Here we prove that the hypersoft configuration model (HSCM), belonging to the class of random graphs with latent hyperparameters, also known as inhomogeneous random graphs or W -random graphs, satisfies all these requirements. The proof relies on generalized graphons, and on mapping the problem of maximization of the normalized Gibbs entropy of a random graph ensemble, to the graphon entropy maximization problem, showing that the two entropies converge to each other in the large-graph limit. Keywords: Sparse random graphs, Power-law degree distributions, Maximum-entropy graphs PACS: 89.75.Hc, 89.75.Fb, 89.70.Cf MSC: 05C80, 05C82, 54C70

1

Introduction

Random graphs have been used extensively to model a variety of real networks. Many of these networks, ranging from the Internet and social networks to the brain and the universe, have broad degree distributions, often following closely power laws [1–3], that the simplest random graph model, the Erd˝ os-R´enyi random graphs [4–6] with Poisson degree distributions, does not reproduce. To resolve this disconnect, several alternative models have been proposed and studied. The first one is the configuration model (CM), random graphs with a given degree sequence [7, 8]. This model is a microcanonical ensemble of random graphs. Every graph in the ensemble has the same fixed degree sequence, e.g., the one observed in a snapshot of a real network, and every such graph is equiprobable in the ensemble. The ensemble thus maximizes Gibbs entropy subject to the constraint that the degree sequence is fixed. Yet given a real network snapshot, one cannot usually trust its degree sequence as some “ultimate truth” for 1

a variety of reasons, including measurement imperfections, inaccuracies, and incompleteness, noise and stochasticity, and most importantly, the fact that most real networks are dynamic both at short and long time scales, growing often by orders of magnitude over years [9, 10, 1–3]. These factors partly motivated the development of the soft configuration model (SCM), random graphs with a given expected degree sequence, first considered in [11, 12], and later corrected in [13–16], where it was shown that this correction yields a canonical ensemble of random graphs that maximize Gibbs entropy under the constraint that the expected degree sequence is fixed. In statistics, canonical ensembles of random graphs are known as exponential random graph models (ERGMs) [17]. In [18, 19] it was shown that the sparse CM and SCM are not equivalent, but they are equivalent in the case of dense graphs [20]. Yet the SCM still treats a given degree sequence as a fixed constraint, albeit not as a sharp but soft constraint. This constraint is in stark contrast with reality of many growing real networks, in which the degree of all nodes constantly change, yet the shape of the degree distribution and the average degree do not change, staying essentially constant in networks that grow in size even by orders of magnitude [9, 10, 1–3]. These observations motivated the development of the hypersoft configuration model (HSCM) in which expected degrees are not fixed but sampled from a fixed distribution, and the expected average degree is also fixed [21–24]. In [24] it was shown that if the expected degree distribution is a power law, then the HSCM is equivalent to a soft version of preferential attachment, a model of growing graphs in which the degree distribution and average degree do not essentially change either as the network grows [25, 26]. However it remains unknown if the HSCM is statistically unbiased, i.e., if it maximizes ensemble entropy subject to the degree distribution constraints. From the statistical perspective, this unbiasedness requirement is critical since the predictive power of a model is maximized if the model maximizes the entropy of an ensemble that it defines, subject to imposed constraints [27–29]. Perhaps the best illustration of this predictive power is the predictive power of equilibrium statistical mechanics, which can be formulated almost fully in terms of the maximum-entropy principle [30]. In fact, in the statistical physics or statistics terminology, random graphs with a given property—with n nodes and m edges, for instance—usually mean the maximum-entropy ensemble of graphs that have this property—the Erd˝os-R´enyi random graphs Gn,m with the entropy-maximizing uniform distribution on the set of graphs in the example. In probability or graph theory however, the maximum-entropy requirement is usually either partially or fully ignored, so that random graphs with n nodes and m edges can mean any ensemble satisfying the constraint—random m-cycles or m-stars, for instance, if m < n. In addition to the natural, dictated by real-world network data, requirements of constant, i.e., independent of graphs size n, average degree and power-law degree distribution, as well as the maximum-entropy requirement, dictated by the basic statistical considerations, a reasonable model of real networks must also satisfy two more requirements: exchangeability and projectivity. Exchangeability takes care of the fact that node labels in random graph models are usually meaningless. Even though node labels in real networks often have some network-specific meaning, such as autonomous system numbers in the Internet [9], node labels in random graph models can be, and usually are, random integer indices between 1 and n. A random graph model is exchangeable if for any permutation σ of node indices i = 1 . . . n, the probabilities of any two graphs G and Gσ given by adjacency matrices Gi,j and Gσ(i),σ(j) are the same, P (G) = P (Gσ ) [31, 32]. A random graph model is projective if there exists a map πn7→n0 from graphs of size n to graphs of size n0 < n such that the probability of graphs in the model satisfies P (Gn0 ) = P (πn7→n0 (Gn )) [33, 34]. If this condition is satisfied, then it is easy 2

to see that the same model admits a dual formulation as an equilibrium model of graphs of a fixed size, or as a growing graph model [35]. If this requirement is not satisfied, then as soon as one node is added to a graph, e.g., due to the growth of a real network that this graph represents, then the resulting bigger graph is effectively sampled from a different distribution corresponding to the model with different parameters, necessarily affecting the structure of its existing subgraphs, a clearly unrealistic scenario. As the simplest examples, Gn,p is projective (the map πn7→n0 simply selects any subset of n nodes consisting of n0 nodes), but Gn,k/n with constant k is not. In the first case, one can realize Gn,p by growing graphs adding nodes one at a time, and connecting each new node to all existing nodes with probability p, while in the second case such growth is impossible since some existing edges must be removed with certain probability for resulting graphs to be samples from Gn,k/n for each n. Here we prove that the HSCM satisfies all the requirements discussed above. The powerlaw HSCM is defined by the exponential measure µ on the real line R µ = eαx ,

x ∈ R,

(1)

where α > 1 is a constant, and by the Fermi-Dirac graphon W : R2 → [0, 1] W (x, y) =

1 ex+y

+1

.

(2)

The volume-form measure µ establishes then probability measures µn =

µ|An = α eα(x−Rn ) µ(An )

(3)

on intervals An = (−∞, Rn ], where Rn =

1 n log 2 , 2 β ν

β =1−

1 , α

(4)

and ν > 0 is another constant. The constants α > 1 and ν > 0 are the only two parameters of the model. The HSCM random graphs of size n are defined by (W, An , µn ) via sampling n i.i.d. points x1 , . . . , xn on An according to measure µn , and then connecting pairs of points i and j at sampled locations xi and xj by an edge with probability W (xi , xj ). We prove that the distribution of degrees D in the HSCM ensemble defined above converges— henceforth convergence always means the n → ∞ limit, unless mentioned otherwise—to P (D = k) = α(βν)α

Γ(k − α, βν) Γ(k − α) = α(βν)α (1 − P (k − α, βν)) , k! Γ(k + 1)

(5)

where Γ(a, x) is the upper incomplete Gamma function, and P (a, x) is the regularized lower incomplete Gamma function. Since P (a, x) ∼ (ex/a)a for a  ex, while Γ(k − α)/Γ(k + 1) ∼ k −(α+1) for k  α, we get P (D = k) ∼ α(βν)α k −γ ,

γ = α + 1.

(6)

We also prove that the expected average degree in the ensemble converges to E [D] = ν. 3

(7)

4

Pim van der Hoorn, Gabor Lippner, Dmitri Krioukov

10

10

0

10

10

-10

10

-15

10

Theory

HSCM n=104

HSCM n=104

HSCM n=105

HSCM n=105

HSCM n=106 Internet

-5

0

10

2

10

4

0

Theory

10

10

6

HSCM n=106

-5

10

-10

10

-15

10

0

10

1

10

2

10

3

10

4

10

5

Fig. 1 Degree distribution in the HSCM, theory vs. simulations, and in the Internet. The theory curve

Figure 1:panel Degree distribution the theory vs.data simulations, and in the in the left is Eq. (5) with α = 1.1 (γ in = 2.1) andHSCM, ν = 4.92. The simulation shown by symbols is averaged over 100 random graphs for each graph size n. All the graphs are generated according to the HSCM with the Internet. The theory curve in the left panel is Eq. (5) with α = 1.1 (γ = 2.1) and4 ν =5 same 4.92. α = 1.1 and ν = 4.92. The average degrees, averaged over 100 random graphs, in the graphs of size 10 , 10 , and 6 The simulation data shown by symbols is averaged over 100 random graphs for each graph 10 , are 1.73, 2.16, and 2.51, respectively. The Internet data comes from CAIDA’s Archipelago measurements of the Internet at the are Autonomous System level [34]. to Thethe number of nodes the same averageαdegree in and the size n. Alltopology the graphs generated according HSCM withandthe = 1.1 Internet graph are 23, 752 and 4.92. The right panel shows the theoretical degree distribution curve in Eq. (5) with4 να= average degrees, averaged overHSCM 100 graphs random graphs, in with the the graphs size 10 , = 4.92. 2.0 and The ν = 10.0 versus simulations of 100 random of different sizes same αofand ν. The 5 , and 106 , are 1.73, 2.16, and4 2.51, 10 respectively. Internet data comes from CAIDA’s average degrees in the graphs of size 10 , 105 , and 106 , are 9.96, The 9.98, and 10.0, respectively. Archipelago measurements of the Internet topology at the Autonomous System level [36]. The number of nodes and the average degree in the Internet graph are 23, 752 and 4.92. The right the limit, is constant δ = ν/2 if α = 2. In this case, the HSCM can be equivalently defined as a panel shows the theoretical degree distribution curve in Eq. (5) with α = 2.0 and ν = 10.0 growing model of labeled graphs, converging to soft preferential attachment [22]—for n = 1, 2, . . ., versus simulations of 100 random HSCM graphs of different sizes with the same α and ν. The the location xn of node n belongs to An4 ’s increment, x ∈ Bn = An \ An−1 (A0 = ∅), and 5 , and 106 , nare 9.96, average degrees in the graphs of size 10 , 10 9.98, and sampled from µ restricted to this increment, i.e., from the probability measure µ ˜n10.0, = µ|Brespectively. /µ(Bn ) = n

α eα(x−Rn +Rn−1 ) ; node n is then connected to existing nodes i = 1 . . . n − 1 with probability given by (2). exactdistribution equivalence,inforthe each n, between original HSCM That is, For the the degree ensemble has the a power tailequilibrium with exponent γ,definition while the with ordered xi s, xi < xi+1 , and this growing definition, slightly more care, ensuring that the expected average degree is fixed to constant ν that does not depend on n, Fig. 1. joint distribution of xi s is exactly the same in both cases, must be taken using basics properties The ensemble is manifestly exchangeable asequilibrium any graphon-based [37, R38]. In of Poisson point processes [33]. Specifically, in the formulationensemble with a given n , the agreement Aldous-Hoover [31,be 39]sampled that states the limit graphon ofwith any number of with nodesthe cannot be fixed to theorem n, but must from that the Poisson distribution exchangeable sparse graph family is zero, the limit HSCM graphon is zero as well, as can be mean n. Alternatively, for a given fixed n, the right boundary Rn of interval An cannot be fixed, seen from mapping (W, A , µ ) to (W , I, µ ), where I = [0, 1], µ = 1, and but must be a random variable variable sampled from n n Rn = (1/2) I,n log(2V I n ), where Vn is a random I the Gamma distribution with shape n and rate δ = ν/2. Node n is then placed at xn = Rn , while the coordinates of theWrest(x, ofy) n= − 1 nodes1 are sampled , x, yfrom ∈ I, µn defined by this random (8) I,n 1 n In the Rn , and then labeled in the increasing order. growing graph formulation, the coordinate α (xy) + 1 2 β ν xn+1 of the n + 1’th node is determined by vn+1 = vn + V , where v0 = 0, vi = (1/2)e2xi , and V is a random variable sampled from distribution with x rate δ = ν/2. All the providing an equivalent definition of the theexponential ensemble—sample n i.i.d. i s from the uniform three options are equivalent realizations of the same Poisson point process on R with measure µ distribution on the unit interval [0, 1], and connect i and j with the probability WI,n (xi , xj ). and rate δ, converging to the binomial sampling with both n and Rn fixed [33]. The projective Amap more familiar equivalent Pareto representation, in which x and y are proportional to the πn7→n0 in the projectivity definition above, maps graphs Gn to their subgraphs induced by expected coordinates x,the y [15, 16], isversion given by (WP,n , P, is µPnot ), where interval nodes i =degrees 1 . . . n0 . of Wenodes note at that even though growing of the model exchangeable Psince = [βν, ∞), and it relies on labeling of nodes in the increasing order of their coordinates, it is nevertheless equivalent to the equilibrium HSCM with this ordered labeling, because the joint distribution of α −γ µP = α (βν) x , of x ∈these P, coordinates are the same in (9) node coordinates, and the linking probability as a function the two formulations [33]. This observation suggests 1 that there might exist a less trivial projective WP,n (x, y) = and , x, y ∈ if P.α = 2. If α 6= 2, then the HSCM, (10) νn exchangeable map such that the model is both projective xy + 1 even with ordered labeling, is not equivalent to soft preferential attachment, but it is equivalent to its adjusted version with a certainofrate of (dis)appearance of does edges not between existing vertices [22]. We note that the convergence WI,n or WP,n to zero mean that the ensemble Here we note that the HSCM is the zero-clustering limit [38] of random hyperbolic graphs [39], where the α = 2 case corresponds to the uniform density of points in the hyperbolic space Hd , and

4

converges to infinite empty graphs. In fact, the expected degree distribution and average degree converge to (5,7) in the limit, as stated above. If α = 2, the ensemble is manifestly projective, but only with a specific labeling of nodes breaking exchangeability. This can be immediately seen by observing that the density of α/2 1−α/2 points on intervals An , δn = n/µ(An ) = α β 2 ν n , and consequently on the whole real line R in the limit, is constant δ = ν/2 if α = 2. In this case, the HSCM can be equivalently defined as a growing model of labeled graphs, converging to soft preferential attachment [24]—for n = 1, 2, . . ., the location xn of node n belongs to An ’s increment, xn ∈ Bn = An \ An−1 (A0 = ∅), and sampled from µ restricted to this increment, i.e., from the probability measure µ ˜n = µ|Bn /µ(Bn ) = α eα(x−Rn +Rn−1 ) ; node n is then connected to existing nodes i = 1 . . . n − 1 with probability given by (2). For the exact equivalence, for each n, between the original equilibrium HSCM definition with ordered xi s, xi < xi+1 , and this growing definition, slightly more care, ensuring that the joint distribution of xi s is exactly the same in both cases, must be taken using basics properties of Poisson point processes [35]. Specifically, in the equilibrium formulation with a given Rn , the number of nodes cannot be fixed to n, but must be sampled from the Poisson distribution with mean n. Alternatively, for a given fixed n, the right boundary Rn of interval An cannot be fixed, but must be a random variable Rn = (1/2) log(2Vn ), where Vn is a random variable sampled from the Gamma distribution with shape n and rate δ = ν/2. Node n is then placed at xn = Rn , while the coordinates of the rest of n − 1 nodes are sampled from µn defined by this random Rn , and then labeled in the increasing order. In the growing graph formulation, the coordinate xn+1 of the n + 1’th node is determined by vn+1 = vn + V , where v0 = 0, vi = (1/2)e2xi , and V is a random variable sampled from the exponential distribution with rate δ = ν/2. All the three options are equivalent realizations of the same Poisson point process on R with measure µ and rate δ, converging to the binomial sampling with both n and Rn fixed [35]. The projective map πn7→n0 in the projectivity definition above, maps graphs Gn to their subgraphs induced by nodes i = 1 . . . n0 . We note that even though the growing version of the model is not exchangeable since it relies on labeling of nodes in the increasing order of their coordinates, it is nevertheless equivalent to the equilibrium HSCM with this ordered labeling, because the joint distribution of node coordinates, and the linking probability as a function of these coordinates are the same in the two formulations [35]. This observation suggests that there might exist a less trivial projective map such that the model is both projective and exchangeable if α = 2. If α 6= 2, then the HSCM, even with ordered labeling, is not equivalent to soft preferential attachment, but it is equivalent to its adjusted version with a certain rate of (dis)appearance of edges between existing vertices [24]. Here we note that the HSCM is the zero-clustering limit [40] of random hyperbolic graphs [41], where the α = 2 case corresponds to the uniform density of points in the hyperbolic space Hd , and where xi s are the radial coordinates of nodes i in the spherical coordinate system of the hyperboloid model of Hd . These coordinates can certainly not be negative, but the expected fraction of nodes with negative coordinates in the HSCM definition above is α/2 negligible: µn (R− ) = β 2 ν/n → 0. In the zero-clustering limit, the angular coordinates of nodes in Hd are ignored in the hyperbolic graphon [41], which becomes equivalent to (2). The main result of this paper is the proof that the HSCM random graphs maximize the Gibbs entropy of random graphs whose degree distribution converges to (5). The key technique used is the mapping of the Gibbs entropy maximization problem to the graphon entropy maximization problem [42]. The Fermi-Dirac graphon (2) is the graphon maximizing

5

graphon entropy subject to the constraints imposed by (5), a reflection of the basic fact in statistical physics that the grand canonical ensemble of Fermi particles, which are edges of energy x + y in our case, is a maximum-entropy ensemble with fixed expected values of energy and number of particles [43], in which the probability to find a particle at a state with energy x + y is given by (2). However, since the rescaled graphon WI,n converges to zero, its entropy converges to zero as well. To address this subtlety we prove that the properly rescaled Gibbs entropy of the graph ensemble converges to the graphon entropy faster than the graphon entropy converges to zero. The combination of graphon (2) being the entropy maximizer, and the convergence of the rescaled Gibbs entropy to the entropy of this graphon, implies that the HSCM is a Gibbs entropy maximizer subject to the degree distribution constraints (5). As a final introductory remark, among the rigorous approaches to sparse exchangeable graphs, the HSCM definition is perhaps closest to graphon processes and graphexes in [44– 46]. In particular in [46], where the focus is on graph convergence to well-defined limits, two ensembles are considered. One ensemble, also appearing in [45], is defined by any graphon W : R2+ → [0, 1] and any measure µ on R+ (R+ can be replaced with any measure space). Graphs of a certain expected size, which is a growing function of time t > 0, are defined by sampling points as the Poisson point process on R+ with intensity tµ, then connecting pairs of points with the probability given by W , and finally removing isolated vertices. The other ensemble is even more similar to the HSCM. It is still defined by W and µ on R+ , but the location of vertex n on R+ is sampled from µn = µ|An /µ(An ), where An s are finite-size intervals growing with n whose infinite union covers the whole R+ . The latter ensemble is not exchangeable, but both ensembles are shown to converge to properly stretched graphons defined by W , yet only if the expected average degree grows to infinity in the limit. The HSCM definition is different—in particular all n vertices of n-sized graphs are sampled from the same µn —ensuring exchangeability, and allowing for an explicit control of the degree distribution and average degree, which can be constant, but making the problem of graph convergence difficult. We do not further discuss graph convergence here, leaving it, as well as the generalization of the results to arbitrary degree distributions, for future publications. In what follows, we first review some necessary background information and definitions, Section 2, then formally state the results in Section 3, and finally provide proofs in Section 4.

2 2.1

Background information and definitions Graph ensembles and their entropy

A graph ensemble is a set of graphs G with probability measure P on G. The Gibbs entropy of the ensemble is X S[P ] = − P (G) log P (G) (11) G∈G

Note that this is just the entropy of the random variable G with respect to the probability measure P . When Gn is a graph of size n, sampled from G according to measure P , we write S[Gn ] instead of S[P ]. Given a set of constraints, e.g., in the form of graph properties fixed to given values, the maximum-entropy ensemble is given by P ∗ that maximizes S[P ] across all measures P that satisfy the constraints. These constraints can be either sharp (microcanonical) or soft (canonical), satisfied either exactly or on average, respectively. The simplest example of the constrained graph property is the number of edges, fixed to m, in

6

graphs of size n. The corresponding microcanonical and canonical maximum-entropy ensem n bles are Gn,m and Gn,p with p = m/ 2 , respectively. The P ∗ is respectively the uniform and exponential Boltzmann distribution P (G) = e−H(G) /Z with Hamiltonian H(G) = λm(G), where m(G) is the number of edges in graph G, and the Lagrange multiplier λ is given by p = 1/ eλ + 1 [13]. When the constraints are given by the degrees of nodes, instead of the number of edges, we have the following characterization of the microcanonical and canonical ensemble. 2.1.1

Maximum-entropy graphs with a given degree sequence (CM)

Given a degree sequence dn = d1 . . . dn , the microcanonical ensemble of graphs that have this degree sequence is the configuration model (CM) [7, 8]. The entropy-maximizing P ∗ is uniform on the set of all graphs that have degree sequence dn . 2.1.2

Maximum-entropy graphs with a given expected degree sequence (SCM)

If the sharp CM constraints are relaxed to soft constraints, the result is the canonical ensemble of the soft configuration model (SCM) [14, 15]. Given an expected degree sequence kn , which in contrast to CM’s dn , does not have to be a graphical sequence of non-negative integers, but can be any sequence of non-negative real numbers, the SCM is defined by connecting nodes i and j with probabilities pij =

1

, where Lagrange multipliers λi are the solution of +1 X ki = pij , i = 1 . . . n.

(12)

eλi +λj

(13)

i 0, β = 1 − , 2 νβ α µ|An µn = = α eα(x−Rn ) . µ(An )

(27) (28) (29) (30) (31)

The dense power-law HSCM is recovered from the above definition by setting ν = ν˜n, where ν˜ is a constant, in which case Rn = R = −(1/2) log ν˜β 2 , An = A = (−∞, R], and µn = µ = α eα(x−R) .

3

Results

In this section we formally state our results, and provide brief overviews of their proofs appearing in subsequent sections. The main result is Theorem 3.1, stating that the HSCM(α, ν) defined in Section 2.3.6 is a maximum-entropy model under hypersoft power-law degree distribution constraints, according to the definition in Section 2.3.5. This result follows from Theorems 3.2-3.6 and Proposition 3.4. Theorems 3.2,3.3 establish the limits of the degree distribution and expected average degree in the HSCM(α, ν). Proposition 3.4 states that HSCM’s graphon maximizes the graphon entropy under the constraints imposed by the degree distribution. Theorem 3.5 establishes proper graphon rescaling and the limit of the rescaled graphon. Finally, the most critical and involved Theorem 3.6 proves that the rescaled Gibbs entropy of the HSCM converges to its rescaled graphon entropy.

12

3.1

Main result

Let Y be a Pareto random variable with shape α > 1, scale νβ > 0, β = 1 − 1/α, so that Y ’s probability density function is PY (y) = α (νβ)α y −γ , y ≥ νβ, γ = α + 1, ( (νβ)α y −α if y ≥ νβ P (Y > y) = 1 otherwise.

and

Let D be a discrete random variable with probability density function   k Y −Y , k = 0, 1, 2, . . . , P (D = k) = E e k!

(32) (33)

(34)

which is the mixed Poisson distribution with mixing parameter Y [62]. Then it follows that E [D] = E [Y ] = ν,

(35)

and since Y is a power law with exponent γ, the tail of D’s distribution is also a power law with the same exponent [62]. In particular, P (D = k) is given by (5). Therefore if D is the degree of a random node in a random graph ensemble, then graphs in this ensemble are sparse and have a power-law degree distribution. Our main result is: Theorem 3.1. For any α > 1 and ν > 0, HSCM(α, ν) is a maximum entropy ensemble of random graphs under the hypersoft constraints (C1,C2) with P (D = k) and ν defined by (3235).

3.2

The limit of the degree distribution in the HSCM

The degree Dn of a random node i in a random HSCM(α, ν) graph of size n, conditioned on the node coordinates xn = x1 . . . xn , is the sum of n − 1 independent Bernoulli random variables with success probabilities W (xi , xj ), j 6= i. The distribution ofPthis sum can be approximated by the mixed Poisson distribution with the mixing parameter j6=i W (xi , xj ). Therefore after first integrating over xj with j 6= i and then over xi , the distribution of Dn is approximately the mixed Poisson distribution   (κn (X))k −κn (X) P (Dn = k) = E e , so that E [Dn ] = E [κn (X)], (36) k! where the random variable X has density µn (31), and the mixing parameter κn (x) is the expected degree of a node at coordinate x: κn (x) = (n − 1)wn (x), Z wn (x) = W (x, y) dµn (y),

(37) (38)

An

where An is given by (29). In the n → ∞ limit, the distribution of the expected degree κn (X) of X converges to the Pareto distribution with shape α and scale νβ. To prove this, we use the observation that 13

the mass of measure µn is concentrated towards the right end of the interval An = (−∞, Rn ], where Rn  1 for large n. Therefore, not only the contributions coming from negative x, y are negligible, but we can also approximate the Fermi-Dirac graphon W (x, y) in (27) with its classical limit approximation c (x, y) = e−(x+y) W (39) on R2+ . In addition, the expected degree function wn (x) can be approximated with w bn (x) defined by ( ωn e−x if 0 ≤ x ≤ Rn , where (40) w bn (x) = 0 otherwise Z Rn ν 1  1 1 − e−(α−1)Rn 2 −x ωn = e dµn (x) = = + o n− 2 , (41) Rn βe n 0 so that the expected degree of a node at coordinate x can be approximated by   1  1 . κ bn (x) = nw bn (x) = e−x (νn) 2 + o n 2

(42)

To see that κ bn (X) converges to a Pareto random variable, note that since X has density µn , it follows that for all t > νβ   nω α nωn  n = e−αRn P (b κn (X) > t) = P X < log t t = (νβ)α t−α (1 + o(1)) = P (Y > t) (1 + o(1)), (43) where Y is a Pareto-distributed random variable (33). We therefore have the following result, the full proof of which can be found in Section 4.2.2: Theorem 3.2 (HSCM(α, ν) satisfies (C1)). Let α > 1, ν > 0, and Dn be the degree of a uniformly chosen vertex in the HSCM(α, ν) graphs of size n. Then, for each k = 0, 1, 2, . . ., lim P (Dn = k) = P (D = k) ,

n→∞

(44)

where P (D = k) is given by (34).

3.3

The limit of the expected average degree in the HSCM

The expected degree of a random node in n-sized HSCM graphs is, for any fixed i, X E [Dn ] = E [W (Xi , Xj )] = (n − 1)E [W (X, Y )] ,

(45)

j6=i

where X and Y are independent random variables with distribution µn . Approximating c (x, y) on R2+ , and using e2Rn = n/νβ 2 , we have W (x, y) with W Z Rn 2 Z Z Rn 2 n  −x c (x, y) dµn (x) dµn (y) = n n W e dµn (x) = 2 2Rn 1 − e−(α−1)Rn = ν+o(1). β e 0 0 (46) All other contributions are shown to also vanish in the n → ∞ limit in Section 4.2.3, where the following theorem is proved: Theorem 3.3 (HSCM satisfies (C2)). Let α > 1, ν > 0, and Dn be the degree of a uniformly chosen vertex in the HSCM(α, ν) graphs of size n. Then lim E [Dn ] = ν.

n→∞

14

(47)

3.4

HSCM maximizes graphon entropy

Let A ⊆ R be some interval, µ a measure on A with µ(A) < ∞, and suppose some µ-integrable function w : A → R is given. Consider the graphon entropy maximization problem under the constraint Z W (x, y) dµ(y). (48) w(x) = A

That is, the problem is to find a symmetric function W ∗ that maximizes graphon entropy σ[W, µ, A] in (17) and satisfies the constraint above, for fixed A and µ. We note that this problem is a “continuous version” of the Gibbs entropy maximization problem in the SCM ensemble in Section 2.1.2. The following proposition, which we prove in Section 4.3.1, states that the solution to this problem is a “continuous version” of the SCM solution (12): Proposition 3.4 (HSCM Maximizes Graphon Entropy). Suppose there exists a solution W ∗ to the graphon entropy maximization problem defined above. Then this solution has the following form: 1 W ∗ (x, y) = λ(x)+λ(y) , (49) e +1 where function λ is µ-almost-everywhere uniquely defined on A by (48). This proposition proves that for each n, the HSCM(α, ν)’s Fermi-Dirac graphon (27) maximizes the graphon entropy under constraint (38), because A and µ in the HSCM(α, ν) are chosen such that λ(x) = x. This is always possible as soon as λ(x) is invertible, cf. Section 2.3.3. For each n, interval An (29) and measure µn (31) in the HSCM(α, ν) can be mapped to [0, 1] and 1, respectively, in which case λn (x) = Rn + (1/α) log x, leading to (8). In other words, node coordinates x in the original HSCM(α, ν) definition in Section 2.3.6, and their coordinates x ˜ in its equivalent definition with An = [0, 1] and µn = 1 are related by x ˜ = eα(x−Rn ) .

3.5

(50)

Graphon entropy scaling and convergence

To derive the rate of convergence of HSCM’s graphon entropy to zero, it suffices to consider c (39) to W (27) on R2+ . Its Bernoulli entropy (16) is the classical limit approximation W   c (x, y) = (x + y)W c (x, y) − (1 − W c (x, y)) log(1 − W c (x, y)). H W (51) c (Rn , Rn ) → 0, the second Since the most of the mass of µn is concentrated near Rn and W term is negligible. Integrating the first term over [0, Rn ], we get Rn

ZZ

2 −2αRn

Rn

ZZ

c (x, y) dµn (x) dµn (y) = α e (x + y)W 0

(x + y)e(α−1)(x+y) dx dy

0

2Rn ν n log n = 2 2Rn + O(n−1 ) = log 2 + O(n−1 ) = ν + O(n−1 ), β e n νβ n

(52)

from which we obtain the proper scaling as log(n)/n. All further details behind the proof of the following theorem are in Section 4.3.2.

15

Theorem 3.5 (Graphon Entropy Convergence). Let σ[Gn ] be the graphon entropy (24) in the HSCM(α, ν) ensemble with any α > 1 and ν > 0. Then, as n → ∞, n σ[Gn ] = O (1/ log n) . − ν (53) log n This theorem implies that σ[Gn ] goes to zero as log(n)/n, while n σ[Gn ]/ log n goes to ν as 1/ log n.

3.6

Gibbs entropy scaling and convergence

The last part of Theorem 3.1 is to prove that the rescaled Gibbs entropy (19) of the HSCM converges to its graphon entropy (24) faster than the latter converges to zero. The graphon entropy is a trivial lower bound for the rescaled Gibbs entropy, Section 2.3.2, so the problem is to find an appropriate upper bound for the latter converging to the graphon entropy. To identify such an upper bound, we rely on an argument similar to [38]. Specifically, we first partition An into m intervals It that induce a partition of A2n into rectangles Ist = Is ×It , s, t = 1 . . . m. We then approximate the graphon by its average value on each rectangle. Such approximation brings in an error term on each rectangle. We then show that the Gibbs entropy is upper-bounded by the entropy of the averaged graphon, plus the sum of entropies of indicator random variables Mi which take value t if the coordinate xi of node i happens to fall within interval It . The smaller the number of intervals m, the smaller the total entropy of these random variables Mi , but the larger the sum of the error terms coming from graphon averaging, because rectangles Ist are large. The smaller they are, the smaller the total error term, but the larger the total entropy of the Mi ’s. The crux of the proof is to find a “sweet spot”—the right number of intervals guaranteeing the proper balance between these two types of contributions to the upper bound, which we want to be tighter than the rate of the convergence of the graphon entropy to zero. This program is executed in Section 4.4, where we prove the following theorem: Theorem 3.6 (Gibbs Entropy Convergence). Let σ[Gn ] be the graphon entropy (24) and S ∗ [Gn ] be the rescaled Gibbs entropy (19) in the HSCM(α, ν) ensemble with any α > 1 and ν > 0. Then n |S ∗ [Gn ] − σ[Gn ]| = 0, and n→∞ log n 2 S[Gn ] lim = ν. n→∞ n log n lim

(54) (55)

We remark that this theorem implies that S[Gn ] ∼

ν n log n, 2

(56)

which is the leading term of the (S)CM Gibbs entropy obtained in [23]. It is also instructive to compare this scaling of Gibbs entropy with its scaling in dense ensembles with limn→∞ σ[Gn ] = σ[W, µ, A] = σ ∈ (0, ∞) [38]: σ S[Gn ] ∼ n2 . (57) 2

16

Finally, it is worth mentioning that even though we use the Fermi-Dirac graphon W (27) to define our W -random graphs, the same convergence results could be obtained for W -random graphs defined by any other graphon W 0 such that   lim n E W (X, Y ) − W 0 (X, Y ) = 0, (58) n→∞

with X and Y having density µn . In fact, to establish the required limits, we use the classic instead of W . Therefore there exists a vast equivalence cal limit approximation graphon W class of W -random graphs defined by graphons W 0 that all have the same limit degree distribution (C1) and average degree (C2), and whose rescaled Gibbs entropy converges to the graphon entropy of the Fermi-Dirac W . However, it follows from Proposition 3.4 that among all these ensembles, only the W -random graph ensemble defined by the Fermi-Dirac graphon (27) uniquely maximizes the graphon entropy (17) for each n, which, by our definition of maximum-entropy ensembles under hypersoft constraints, is a necessary condition for graph entropy maximization.

4

Proofs

In this section we provide the proofs of all the results stated in the previous section. In Section 4.1 we begin with some preliminary results on the accuracy of the approximation of c . In the same section the Fermi-Dirac graphon W (27) by the classical limit approximation W we also establish results showing that the main contribution of the integration with respect to µn is on the positive part [0, Rn ] of the interval An as defined by (29). In particular, we show that all contributions coming from the negative part of this interval, i.e., R− , are o(n−1 ), which means that for all our results the negative part of the support of our measure µn is negligible. We then proceed with proving Theorems 3.2 and 3.3 in Section 4.2. The proofs of Proposition 3.4 and Theorem 3.5 can be found in Section 4.3. Finally, the convergence of the rescaled Gibbs entropy to the graphon entropy (Theorem 3.6) is given in Section 4.4.

4.1

The classical limit approximation of the Fermi-Dirac graphon

We will use e−(x+y) as an approximation to the graphon W to compute all necessary limits. To be precise we define c (x, y) = min{e−(x+y) , 1} W (59) c converge to zero as n tends to and show that differences between the integrals of W and W c , we could have also worked with the integral expressions infinity. Note that, instead of W involving W , which might have led to better bounds. However, these integrals tend to evaluate c are much easier to to combinations of hypergeometric functions, while the integrals of W evaluate and are sufficient for our purposes. c (x, y) we need to consider separately, the intervals (−∞, 0] and By the definition of W (0, Rn ]. Since graphons are symmetric functions, this leads to the following three different cases: I)

− ∞ < x, y ≤ 0

II)

− ∞ < y ≤ Rn and 0 < x ≤ Rn ,

c ≤ 1 and For case I) we note that W, W ZZ 0  dµn (y) dµn (x) = O n−α . −∞

17

III)

0 < x, y ≤ Rn . (60)

(61)

c , only the With this we obtain the following result, which shows that for both W and W 2 integration over (0, Rn ] , i.e. case III, matters. Lemma 4.1. Z Z Rn

ZZ

Rn

c (x, y) dµn (y) dµn (x) − W −∞

  c (x, y) dµn (y) dµn (x) = O n− α+1 2 W

(62)

0

c with W . and the same result holds if we replace W Proof. First note that Z −Rn Z

Rn

Z

−Rn

c (x, y) dµn (y) dµn (x) ≤ W −∞

Z

0

−Rn

(63)

−∞

−∞

We show that

 dµn (y) = O n−α .

Z

Rn

  c (x, y) dµn (y) dµn (x) = O n− α+1 2 W ,

(64)

0

which together with (61) and (63) implies the first result. The result for W follows by noting c. that W ≤ W We split the integral (64) as follows Z 0 Z Rn Z 0 Z −x Z 0 Z Rn c e−(x+y) dµn (y) dµn (x). dµn (y) dµn (x) + W (x, y) dµn (y) dµn (x) = −Rn

−Rn

0

−Rn

0

−x

(65) For the first integral we compute Z Z 0 Z −x dµn (y) dµn (x) = αe−2αRn −Rn

0

 e−αx − 1 eαx dx −Rn   αRn + e−αRn − 1 = O log(n)n−α

0

= e−2αRn

Finally, the second integral evaluates to Z  α −2αRn 0  (α−1)Rn α − e−(α−1)x e(α−1)x dx = e−2αRn e e β β −Rn

e(α−1)Rn − 1 − Rn α−1

!

(66)

 α+1  = O n− 2 , (67)

from which the result follows since α > (α + 1)/2. c ). We can show a similar result for H(W ) and H(W Lemma 4.2. Z Z Rn  ZZ  c H W (x, y) dµn (y) dµn (x) − −∞

where

Rn

  c (x, y) dµn (y) dµn (x) = n , H W

(68)

0

 −α   O (log(n)n ) n = O log(n)2n−2   O n− 2+α 2

if 1 < α < 2, if α = 2, if α > 2,

c with W . Moreover, the same result holds if we replace W 18

(69)

c . For this we split the interval An into three parts Proof. We will first prove the result for W (−∞, −Rn ], [−Rn , 0] and (0, Rn ] and show that the integrals on all ranges other than [0, Rn ]2 are bounded by a term that scales as n . Since H(p) ≤ log(2) for all 0 ≤ p ≤ 1 it follows from (61) and (63) that, for all α > 1, 0

ZZ

   c (x, y) dµn (y) dµn (x) = O n−α = o(n ), H W

−∞ Rn Z −Rn

Z

−∞

   c (x, y) dµn (y) dµn (x) = O n−α = o(n ). H W

(70) (71)

−∞

c , we only need to consider the integration over (0, Rn ] × Hence, using the symmetry of W (−Rn , 0]. First we compute     c (x, y) = e−(x+y) (x + y) − (1 − e−(x+y) ) log 1 − e−(x+y) H W (72) and observe that  − (1 − e−z ) log 1 − e−z ≤ e−2z

for all large enough z.

(73)

Now let δ > 0 be such that (73) holds for all z ≥ δ. Then δ < Rn for sufficiently large n and we split the integration as follows Z 0

Rn

Z

0

  c (x, y) dµn (y) dµn (x) H W

−Rn

Z

δ

0

Z

Z   c (x, y) dµn (y) dµn (x) + H W

= −Rn

0

Rn

Z

0

  c (x, y) dµn (y) dµn (x). H W

(74)

−Rn

δ

The first integral is O(n−α ). For the second we note that x + y > δ for all y > δ − x and hence Z Rn Z 0   c (x, y) dµn (y) dµn (x) H W −Rn

δ

Z

Rn

Z

δ−x



Z

Rn

Z

0

log(2) dµn (y) dµn (x) + −Rn

δ

δ

e−2(x+y) dµn (y) dµn (x).

(75)

δ−x

For the second integral we obtain,  −α   Z Rn Z 0 O (log(n)n ) if 1 < α < 2, if α = 2, e−2(x+y) dµn (y) dµn (x) = O log(n)2n−2  2+α δ δ−x  − O n 2 if α > 2,

(76)

while for the first integral we have Z

Rn

Z

δ−x

−2αRn

Z

log(2) dµn (y)dµn (x) ≤ αe δ

−Rn

δ

19

Rn

 eαδ dx = O log(n)n−α .

(77)

Therefore we conclude that  −α   Z Rn Z 0 O (log(n)n ) if 1 < α < 2,   c (x, y) dµn (y)dµn (x) = O log(n)n−2 if α = 2, H W  4+α   0 −∞   O n− 2 if α > 2,

(78)

c. which yields the result for W For W we first compute that    ex+y (x + y)ex+y x+y = log 1 + e − (x + y) + (x + y) 1 − H(W (x, y)) = log 1 + e − 1 + ex+y 1 + ex+y     (x + y) −(x+y) = log 1 + e−(x+y) + ≤ log 1 + e + (x + y)e−(x+y) . (79) 1 + ex+y x+y



c (x, y)) and noting that log (1 + e−z ) ≤ e−2z for large Comparing this upper bound to H(W c. enough z, the result follows from the computation done for W With these two lemmas we now establish two important results on the approximations of c (X, Y ) W . The first shows that if X and Y are independent with distribution µn , then W converges in expectation to W (X, Y ), faster than n−1 . Proposition 4.3. Let X, Y be independent with density µn and α > 1. Then, as n → ∞, i  h  c (X, Y ) = O n− α+1 2 (80) E W (X, Y ) − W Proof. Since i Z Z h c (X, Y ) = E W (X, Y ) − W

Rn

−∞

c (x, y) dµn (y) dµn (x), W (x, y) − W

(81)

c (x, y) ≤ 1 it follows from Lemma 4.1 that it is enough to consider the and W (x, y) − W integral Z Z Rn c (x, y) dµn (y) dµn (x). (82) W (x, y) − W 0

For this we note that

c (x, y) ≤ e−2(x+y) . W (x, y) − W

Hence we obtain Z Z Rn c W (x, y) − W (x, y) dµn (y) dµn (x) 0  −α Z Rn 2  O (n )  ≤ e−2x dµn (x) = O log(n)2 n−2   0  O n−2 Since all these terms are o(n−(α+1)/2 ), the result follows.

20

(83)

if 1 < α < 2 if α = 2 if α > 2.

(84)

c (X, Y )) converges in expectation to H(W (X, Y )), faster than Next we show that also H(W

n−1 .

Proposition 4.4. Let X, Y be independent with density µn and α > 1. Then, as n → ∞,  −α O (log(n)n ) if 1 < α < 2 h   i  c (X, Y ) = O log(n)3 n−2 E H (W (X, Y )) − H W (85) if α = 2    −2 O log(n)n if α > 2. Proof. Similar to the previous proof we now use Lemma 4.2 to show that it is enough to consider the integral Rn

ZZ 0

  c (x, y) dµn (y)dµn (x). H (W (x, y)) − H W

Define δW (x, y) =

1 , e2(x+y) + ex+y

(86)

(87)

c (x, y) = W (x, y) + δW (x, y). Now fix x, y. Then, by the Taylor-Lagrange and note that W c (x, y) such that theorem, there exists a W (x, y) ≤ cW (x, y) ≤ W   c (x, y) = H 0 (cW (x, y)) δW (x, y) H (W (x, y)) − H W     c (x, y) δW (x, y). ≤ H 0 (W (x, y)) + H 0 W (88) Next we compute that 0 H (W (x, y)) = (x + y),

and

   0 c H W (x, y) = log ex+y − 1 ,

(89)

where log(ex+y − 1) ≤ (x + y) for all x + y ≥ 1. We now split the integral and bound it as follows Z Z Rn   c (x, y) dµn (y) dµn (x) H (W (x, y)) − H W 0 Z 1 Z 1−x   c (x, y) dµn (y) dµn (x) = H (W (x, y)) − H W 0 0 Z 1 Z Rn   c (x, y) δW (x, y) dµn (y) dµn (x) + H (W (x, y)) − H W 0

Rn

Z

1−x Z Rn

  c (x, y) δW (x, y) dµn (y) dµn (x) H (W (x, y)) − H W

+ 1

Z

0 1 Z 1−x

≤ log(2) 0 Z 1Z +2 0

Z

dµn (y) dµn (x) 0 Rn

(x + y)δW (x, y) dµn (y) dµn (x)

1−x Rn Z Rn

+2

(x + y)δW (x, y) dµn (y) dµn (x) 1

0

21

Z

1 Z 1−x

≤ log(2)

ZZ

0

Rn

(x + y)δW (x, y) dµn (y) dµn (x).

dµn (y) dµn (x) + 4

(90)

0

0

The first integral is O(log(n)n−α ), while for the second we have Rn

ZZ

(x + y)δW (x, y) dµn (y) dµn (x)

4 0

 −α   2 O (log(n)n ) if 1 < α < 2 Rn e−2x dµn (x) = O log(n)3 n−2 if α = 2   0  −2 O log(n)n if α > 2.

Z ≤ 8Rn

(91)

Comparing these scaling to the ones from Lemma 4.2 we see that the former are dominating, which finishes the proof.

4.2

Proofs for node degrees in the HSCM

In this section we give the proofs of Theorem 3.2 and Theorem 3.3. Denote by Di the degree of node i and recall that Dn is the degree of a node, sampled uniformly at random. Since the node labels are interchangeable we can, without loss of generality, consider D1 for Dn . For Theorem 3.3 we h i use (45). We show that if X and Y are independent with distribution c µn , then E W (X, Y ) → ν. The final result will then follow from Proposition 4.3. PnThe proof of Theorem 3.2 is more involved. Given the coordinates X1 , . . . Xn we have Dn = j=2 W (X1 , Xj ). We follow the strategy from [63, Theorem 6.7] and construct a coupling between Dn and a mixed Poisson random variable Pn , with mixing parameter nw bn (X), where w bn (x) is given by (40) and X has distribution µn . In general, a coupling between two random variables X and Y consists of a pair of new b Yb ), with some joint probability distribution, such that X b and Yb have random variables (X, the same marginal probabilities as, respectively X and Y . The advantage of a coupling is that we can tune the joint distribution to fit our needs. For our proof we construct a coupling b n , Pbn ), such that (D   b n 6= Pbn = 0. lim P D (92) n→∞

b n and Pbn have the same distribution as, respectively Dn and Pn we have Hence, since D   b n = k, D b n 6= Pbn , P (Dn = k) = P (Pn = k) + P D (93) from which it follows that |P (Dn = k) − P (Pn = k) | → 0. Finally, we show that lim P (nw bn (X) > k) = P (Y > k) ,

n→∞

(94)

where Y is a Pareto random variable with shape νβ and scale α, i.e. it has probability density PY (y) given by (32). This implies that the mixed Poisson random variable Pn with mixing parameter nw bn (X) converges to a mixed Poisson D with mixing parameter Y , which proves the result. Before we give the proofs of the two theorems we first establish some technical results needed to construct the coupling, required for Theorem 3.2, the proof of which is given in Section 4.2.2. 22

4.2.1

Technical results on Poisson couplings and concentrations

We first establish a general result for couplings between mixed Poisson random variables, where the mixing parameters converge in expectation. Lemma 4.5. Let Xn and Yn be random variables such that, lim E [|Xn − Yn |] = 0.

(95)

n→∞

Then, if Pn and Qn are mixed Poisson random variables with, respectively, parameters Xn and Yn , P (Pn 6= Qn ) = O (E [|Xn − Yn |]) , (96) and in particular lim P (Pn 6= Qn ) = 0.

(97)

n→∞

Proof. Let an = E [|Xn − Yn |] and define the event An = {|Xn − Yn | ≤ lim P (|Xn − Yn | >

n→∞



an ) ≤ lim

n→∞



E [|Xn − Yn |] = 0, √ an

an }. Then, since (98)

it is enough to show that lim P (Pn 6= Qn , An ) = 0.

n→∞

(99)

√ Take Pˆn to be mixed Poisson with parameter Xn + an . In addition, let Vn and Zn be mixed √ √ Poisson with parameter min{Xn + an − Yn , 0} and an , respectively. Then, since on An we √ have Xn + an > Yn we get, using Markov’s inequality,     P (Pn 6= Qn , An ) ≤ P Pn 6= Qn , Pn = Pˆn , An + P Pˆn 6= Pn , An   = P Pˆn 6= Qn , An + P (Zn 6= 0, An ) = P (Vn 6= 0, An ) + P (Zn 6= 0, An ) √ √ ≤ P (Vn ≥ 1, An ) + P (Zn ≥ 1, An ) ≤ E [|Xn + an − Yn |] + an √ √ ≤ E [|Xn − Yn |] + 2 an = O( an ). (100) Since an → 0 by assumption, this finishes the proof. c (X, Y ) converges in expectation to w Next we show that W bn (X) when X, Y have distribution µn . We also establish an upper bound on the rate of convergence, showing that it converges faster than n−1 . Lemma 4.6. Let X, Y be independent random variables with density µn . Then, as n → ∞, i h   1+α c E W (X, Y ) − w bn (X) = O log(n)n−α + n− 2 (101) Proof. Recall that ( ωn e−x if 0 ≤ x ≤ Rn w bn (x) = , where 0 otherwise Z Rn 1 − e−(α−1)Rn . ωn = e−x dµn (x) = βeRn 0 23

(102) (103)

Hence, it follows that i Z h c (X, Y ) − w bn (X) = E W

0

Rn

Z

c (x, y) dµn (y) dµn (x) W

−∞ −∞ Z Rn Z Rn

c bn (x) dµn (y) dµn (x), W (x, y) − w

+ −∞

0

where the first integral is O(n−(α+1)/2 ) by Lemma 4.1. To deal with the second integral we first compute Z Rn Z −x  dµn (y) dµn (x) = αRn e−2αRn = O log(n)n−α .

(104)

(105)

−∞

0

Therefore we have Z Rn Z Rn c bn (x) dµn (y)dµn (x) W (x, y) − w −∞

0

Z

Rn

Z

−x

Z

Rn

Z

Rn

dµn (y)dµn (x) +

= 0

−∞

−x

0

≤ O log(n)n

 −α

= O log(n)n

 −α

Z

Rn

Z

Rn

+ 0

0 Rn

Z + ωn

−(x+y) − ωn e−x dµn (y)dµn (x) e

e−x e−y − ωn dµn (y)dµn (x)

−y e − ωn dµn (y)

(106)

0

We proceed to compute the last integral and show that it is O(n−(α+1)/2 ). For this we note that since e−y ≤ ωn ⇐⇒ y ≤ − log(ωn ), (107) we have Z Z Rn −y e − ωn dµn (y) = 0

− log(ωn )

−y

(ωn − e

Z

Rn

) dµn (y) +

(e−y − ωn ) dµn (y) (108)

− log(ωn )

0

For the first integral we compute Z

− log(ωn )

(ωn − e

−y

) dµn (y) = αe

0

= Similar calculations yield Z Rn

 ωn1−α ωn1−α ωn 1 − − + α α−1 α α−1 −αR n ωn e ωn e−αRn e−αRn − − ωn e−αRn = − . (109) α−1 β β

−αRn

e−αRn β



(e−y − ωn ) dµn (y) =

− log(ωn )

and hence, Z Rn 0

ω 1−α e−αRn e−Rn − ωn − n , β α−1

−αRn −y  ω 1−α e−αRn e − ωn dµn (y) = 1 e−Rn + e−αRn − ωn − ωn e − n β β α−1

24

(110)



 1 −Rn e + e−αRn − ωn . β

(111)

We now use this last upper bound, together with ωn =

 1 − e−(α−1)Rn 1 −Rn −αRn = e − e , R βe n β

(112)

to obtain Z ωn 0

Rn

−y  e − ωn dµn (y) ≤ ωn e−Rn + e−αRn − ωn2 β  2 1 1 = 2 e−2Rn + e−α2Rn − 2 e−Rn − e−αRn β β   1+α 2 = 2 e−(α+1)Rn = O n− 2 , β

(113)

from which the result follows. 4.2.2

Proof of Theorem 3.2

b n , Pbn ), between Dn and Pn such that We start with constructing the coupling (D   b n 6= Pbn = 0. lim P D n→∞

(114)

First, let Ij be the indicator that the edge (1, j) is present in Gn . Let Xn = X1 , . . . , Xn denote the coordinates of the nodes. Then, conditioned on these, we have that PnIj are independent Bernoulli random variables with probability W (X , X ), while D = 1 j 1 j=2 Ij . Now, let Qn Pn be a mixed Poisson with parameter j=2 W (X1 , Xj ). Then, see for instance [63, Theorem b 1, Q b n ) of D1 and Qn , such that 2.10], there exists a coupling (D n  X b b P D1 6= Qn Xn ≤ W (X1 , Xj )2 .



(115)

j=2

Therefore, we get that ZZ   b 1 6= Q b n ≤ (n − 1) P D

Rn

W (x, y)2 dµn (y) dµn (x) −∞   −(α−1) Z Rn 2  O n  ≤n e−2x dµn (x) = O log(n)n−1   −∞  O n−1

if 1 < α < 2 . if α = 2 if α > 2

(116)

Next, since X1 and Xj are independent for all 2 ≤ j ≤ n we use Proposition 4.3 together with Lemma 4.6 to obtain   X n  E W (X1 , Xj ) − nw bn (X1 )  j=2 ≤ (n − 1)E [|W (X1 , X2 ) − w bn (X1 )|] + E [w bn (X1 )] 25

i i h h c (X1 , X2 ) + (n − 1)E W c (X1 , X2 ) − w ≤ (n − 1)E W (X1 , X2 ) − W bn (X1 ) + E [w bn (X1 )]  α−1  = O n− 2 . (117) from which it follows that   X n  lim E W (X1 , Xj ) − nw bn (X1 )  = 0. n→∞ j=2

(118)

Now let Pbn be a mixed Poisson random variable with mixing parameter nw bn (X1 ). Then by (118) and Lemma 4.5   bn = 0 lim P Pbn 6= Q (119) n→∞

and (114) follows from       b n 6= Q bn = bn . b n + P Pbn 6= Q 6 Pbn ≤ P D P D

(120)

As a result we have that     b n = k − P Pbn = k lim |P (Dn = k) − P (Pn = k)| = lim P D n→∞ n→∞     b n = k, D b n 6= Pbn ≤ lim P D b n 6= Pbn = 0, = lim P D n→∞

n→∞

(121) We will now prove that if X has density µn , then for any t > 0 lim |P (nw bn (X) > t) − P (Y > t)| = 0.

n→∞

(122)

A sequence {Pn }n≥1 of mixed Poisson random variables, with mixing parameters Zn converges to a mixed Poisson random variable P with mixing parameter Z, if Zn converge to Z in distribution, see for instance [62]. Therefore, since D is mixed Poisson with mixing parameter Y , (122) implies lim P (Pn = k) = P (D = k) (123) n→∞

which, combined with (121), yields lim P (Dn = k) = P (D = k) .

n→∞

(124)

To establish (122), first define n = e−(α−1)Rn − e−2αRn so that nw bn (x) =

 √ n  −(α−1)Rn 1 − e = νn e−x (1 − n ) . βeRn

(125)

(126)

Next, for all 0 ≤ x ≤ Rn we have that νβ (1 − n ) ≤ nw bn (x) ≤ 26



νn (1 − n )

(127)

and hence   √ nωn  if νβ (1 − n ) ≤ t ≤ νn (1 − n ) P X < log t P (nw ˆn (X) > t) = 1 if t < νβ (1 − n )   0 else. √ Moreover, for any νβ (1 − n ) ≤ t ≤ νn (1 − n )   nω  Z log( nωt n ) n P (nw bn (X) > t) = P X < log = dµn (x) t −∞  α  α  νβ −Rn α −α −αRn nωn = nωn e (1 − n )α . t = =e t t

(128)

(129)

Now, fix t < νβ. Then, for large enough n it holds that t < νβ(1 − n ) in which case (122) holds trivially, since both probabilities are 1. Hence we can assume, without loss of generality √ that t ≥ νβ. Then for n large enough such t ≤ νn (1 − n ), it follows from (129) and (33) that  α  α νβ νβ α |P (nw bn (X) > t) − P (Y > t)| = (1 − n ) − t t  α νβ (1 − (1 − n )α ) ≤ 1 − (1 − n )α , (130) = t which converges to zero as n → ∞ and hence proves (122). 4.2.3

Proof of Theorem 3.3

First using Lemma 4.1, we compute that Z Z Rn h i c (x, y) dµn (y) dµn (x) c W (n − 1)E W (X, Y ) = (n − 1) −∞ Rn

ZZ = (n − 1)

 α−1  e−(x+y) dµn (y) dµn (x) + O n− 2

0

 α−1   (n − 1) −2Rn −αRn 2 = e 1 − e + O n− 2 β2    α−1   α−1  2 1 =ν 1− 2 1 − e−αRn + O n− 2 = ν + O n− 2 , (131) β n where the last line follows since −1 < −α/2 < −(α − 1)/2. Next recall (45) E [Dn ] =

n X

E [W (X1 , Xj )] = (n − 1)E [W (X, Y )] .

(132)

j=2

Therefore, using Lemma 4.6 we have |E [Dn ] − ν| = |(n − 1)E [W (X, Y )] − ν| i h h i c (X, Y ) + (n − 1) E W c (X, Y ) − ν ≤ (n − 1)E W (X, Y ) − W    α−1  ≤ O log(n)n−(α−1) + O n− 2 , which yields the result. 27

(133)

4.3

Proofs for graphon entropy

Here we derive the properties of the graphon entropy σ of our graphon W given by (27). We first give the proof of Proposition 3.4, and then that of Theorem 3.5. 4.3.1

Proof of Proposition 3.4

Recall that, given a measure µ on some interval A ⊆ R with µ(A) < ∞ and a µ-integrable function w(x), we consider the problem of maximizing σ[W, µ] under the constraint (48), Z W (x, y) dµ(y). (134) w(x) = A

In particular we need to show that if a solution exists, it is given by (49). Therefore, suppose there exists at least one graphon W which satisfies the constraint. For this we use the technique of Lagrange multipliers from variation calculus [64, 65]. To set up the framework, let W denote the space of symmetric functions W : A×A → [0, 1] which satisfy ZZ W (x, y) dµ(y) dµ(x) < ∞.

(135)

A2

Observe that W is a convex subset of the Banach space WR , of all symmetric, µ-integrable, functions W : A × A → R and that the function w is an element of L1 (−∞, R) with respect to the measure µ, which is also Banach. Denote the latter space by L1A,µ . We slightly abuse notation and write σ for the functional ZZ W 7→ H(W (x, y)) dµ(y) dµ(x) = σ[W, µ], (136) A2

and define the functional F : WR → L1A,µ by Z F(W )(x) = w(x) −

W (x, y) dµ(y).

(137)

A

Then we need to solve the following Euler-Lagrange equation for some Lagrange multiplier functional λ : L1A,µ → R, ∂ (σ(W ) − Λ ◦ F(W )) = 0, (138) ∂W with respect to the Fr´echet derivative. By Riesz Representation Theorem, we have that for any functional Λ : L1A,µ → R, there exists a λ ∈ L∞ , uniquely defined on A, such that Z Λ(f ) = λ(x)f (x) dµ(x). (139) A

Hence our Euler-Lagrange equation becomes     Z Z ∂ σ(W ) − λ(x) w(x) − W (x, y) dµ(y) dµ(x) = 0. ∂W A A In addition, since W is symmetric we have ZZ ZZ 1 λ(x)W (x, y) dµ(y)dµ(x) = (λ(x) + λ(y))W (x, y) dµ(y) dµ(x) 2 A2 A2 28

(140)

(141)

and hence, by absorbing the factor 1/2 into λ we can rewrite our Euler-Lagrange equation as   ZZ Z ∂ (λ(x) + λ(y))W (x, y) dµ(y) dµ(x) = 0. (142) λ(x)w(x) dµ(x) + σ(W ) − ∂W A2 A For the two derivatives we have   ∂σ(W ) W (x, y) , = log(1 − W (x, y)) − log(W (x, y)) = − log ∂W 1 − W (x, y)  Z ZZ ∂ (λ(x) + λ(y))W (x, y) dµ(y) dµ(x) = λ(x) + λ(y). λ(x)w(x) dµ(x) + ∂W A2 A Hence, we need to solve the equation   W (x, y) − log = λ(x) + λ(y), 1 − W (x, y)

(143)

(144)

which gives (49). There is, however, a small technicality related to the computation of the derivative (143). This is caused by the fact that H is only defined on [0, 1]. In particular, for W1 , W2 ∈ W, it could be that W1 + W2 > 1 on some subset C ⊆ A × A and hence H(W1 + W2 ) is not well defined. To compute the Fr´echet derivative we need H(W1 + εW2 ) to be well defined for any W1 , W2 ∈ W and some, small enough, ε > 0. To this end we fix a 0 < δ < 1 and define   δ Hδ (x) = H (1 − δ)x + , (145) 2 which is just H, stretched out to the interval (−δ/2(1 − δ), 1 + δ/2(1 − δ)). Similarly we define σδ , using Hδ and consider the corresponding graphon entropy maximization problem. Then we can compute ∂σδ /∂W , by taking a W 0 ∈ W and ε > 0 such that W +εW 0 < 1+δ/2(1−δ)). Then Hδ (W + εW 0 ) is well-defined and using the chain rule we obtain   ∂Hδ (W + εW 0 ) (1 − δ)(W + εW 0 ) + δ/2 = −(1 − δ)W 0 log , (146) ∂ε 1 − (1 − δ)(W + εW 0 ) − δ/2 from which it follows that ∂σδ (W ) = −(1 − δ) log ∂W



(1 − δ)W + δ/2 1 − (1 − δ)W − δ/2

 .

Therefore we have the following equation   (1 − δ)(W + δ/2 − (1 − δ) log = λ(x) + λ(y), 1 − (1 − δ)W − δ/2

(147)

(148)

which leads to a solution Wδ of the form 1−

Wδ (x, y) =

δ 2



λ(x)+λ(y) (1−δ)



1+e  . λ(x)+λ(y) (1−δ) (1 − δ) 1 + e

29

(149)

Since δ < 1 it follows, using elementary algebra, that Wδ (x, y) ∈ [0, 1] for all x, y ∈ [0, 1]. From this we conclude that Wδ ∈ W. Moreover Wδ converges to (49) as δ → 0. Since σδ → σ as δ → 0, we obtain the graphon that maximizes the entropy σ, where the function λ is determined by the constraint (48). For uniqueness, suppose there exist two solutions, λ1 and λ2 in L1A,µ , to (48) and for which the graphon entropy is maximized. Let f (x) = λ1 (x) − λ2 (x) ∈ L1A,µ , so that λ1 = λ2 + f . Since λ1 satisfies (140) it follows that, due to linearity of the derivative, Z ∂ f (x)W (x, y) dµ(y) dµ(x) = 0. (150) ∂W A Now since

Z ∂ f (x)W (x, y) dµ(y) dµ(x) = f (x), (151) ∂W A it follows that f = 0, µ almost everywhere on A and hence λ1 = λ2 , µ almost everywhere on A. 4.3.2

Proof of Theorem 3.5

First note that Proposition 4.4 implies that the difference in expectation between H(W ) c ) converges to zero faster than log(n)/n, so that for the purpose of Theorem and H(W c . Hence we are left to show that the rescaled entropy 3.5 we can approximate W with W c nσ[W , µn ]/ log(n) converges to ν. By Lemma 4.2 the integration over all regimes except [0, Rn ]2 goes to zero faster than log(n)/n and therefore we only need to consider integration over [0, Rn ]2 . The main idea for the rest of the proof is that in this range   c (x, y) ≈ (x + y)W c (x, y). H W (152) Let us first compute the integral over [0, Rn ]2 of the right hand side in the above equation. Z Z Rn c (x, y) dµn (y) dµn (x) (x + y)W 0 Z Z Rn 2 −2αRn (x + y)e(α−1)(x+y) dy dx =α e 0

   2e−2(1+α)Rn  = 2 (α − 1)Rn e2αRn + e(α+1)Rn − e2Rn − e2αRn β (α − 1)   2Rn  −2Rn 2 = 2 e e−2αRn + e−2Rn + e−(α+1)Rn − 2 β β (α − 1) −2R n 2Rn e = + O(n−1 ) = νn−1 log(n) + O(n−1 ) β2

(153)

which implies that n lim n→∞ log(n)

ZZ

Rn

c (x, y) dµn (y) dµn (x) = ν. (x + y)W

(154)

0

Next we show that Z Z Rn      α−1  c c n H W (x, y) − (x + y)W (x, y) dµn (y) dµn (x) = O n− 2 , 0

30

(155)

which, together with (154), gives n n→∞ log(n)

Rn

ZZ

lim

  c (x, y) dµn (y) dµn (x) = ν. H W

(156)

0

We compute that  x+y    −1 c (x, y) = e−(x+y) (x + y) − (1 − e−(x+y) ) log e H W ex+y  = (x + y) − (1 − e−(x+y) ) log ex+y − 1 ,

(157)

and hence we get      −(x+y) x+y c c H W (x, y) − (x + y) W (x, y) = 1 − e x + y − log e − 1     = 1 − e−(x+y) log 1 − e−(x+y)     = − 1 − e−(x+y) log 1 − e−(x+y) ,

(158)

which is of the form − (1 − e−z ) log(1 − e−z ).

(159)

We proceed by splitting the integration over two regimes [0, δRn ] and (δRn , Rn ], for suitable δ, and using appropriate bounds for (159). For this we note that − log(1 − e−z )(1 − e−z ) ≤ 1 −z

− log(1 − e

)(1 − e

−z

)≤e

for all z ≥ 0

−2z

(160)

for large enough z.

(161)

Now take δ = β/2 and let N be such that δRN is large enough for (161) to hold. Then we have, for all n ≥ N , using (160) δRn

ZZ n 0

  c (x, y) − (x + y)W c (x, y) dµn (y)dµn (x) H W δRn

ZZ ≤n

2

δRn

Z dµn (y) dµn (x) = n

dµn (x)

0

0

2      = ne−2αRn eαδRn − 1 = O n1−α+αδ = O n−(α−1)/2 .

(162)

Since for the other parts of the integral we have that x + y ≥ δRn we can use (161). In this way we obtain for the mixed regimes Z

δRn

Z

Rn

n 0

H(e−(x+y) − (x + y)e−(x+y) dµn (y) dµn (x)

δRn δRn

Z

Z

≤n

Rn

e

Z dµn (y)dµn (x) = n

0

=

δRn α2 ne−2αRn 

−2(x+y)

(2 − α)2

e 0

e−(2−α)δRn − 1

δRn

−2x

 Z dµn (x)

Rn

e

−2y

 dµn (y)

δRn

    (2−α)δ e−(2−α)Rn − e−(2−α)δRn = O n1−α− 2 ,

31

(163)

while for the square (δRn , Rn ]2 we have Rn

ZZ n

δRn

  −(x+y) −(x+y) H e − (x + y)e dµn (y) dµn (x) ZZ

Rn

≤n

−2(x+y)

e

δRn 2 α ne−2αRn

Z dµn (y) dµn (x) = n

−2x

e

2 dµn (x)

δRn



e−(2−α)Rn − e−(2−α)δRn (2 − α)2   1−α−(2−α)δ =O n . =

Rn

2

  = O ne−2αRn e−2(2−α)δRn (164)

Combining these results and noting that, for all α > 1, 1−α−

(2 − α)δ α−1 ≤− , 2 2

(165)

yields (155). We now have ZZ

Rn

c , µn ] = σ[W −∞ Rn

ZZ =

  c (x, y) dµn (y)dµn (x) H W     c (x, y) dµn (y)dµn (x) + O n−α + log(n)n− α+1 2 H W

(166)

0

and hence, using (156), c , µn ] log(n)σ[W = ν. n→∞ n lim

4.4

(167)

Proof of Theorem 3.6

In this section we first formalize the strategy behind the proof of Theorem 3.6, briefly discussed in Section 3.6. This strategy relies on partitioning the interval An into non-overlapping subintervals. We then construct a specific partition satisfying certain requirements, and finish the proof of the theorem. 4.4.1

Averaging W by a partition of An

We follow the strategy from [38]. First recall that for a graph Gn generated by the HSCM,     n n S[Gn ] ≥ E [S[Gn |xn ]] = E [H(W (x1 , x2 ))] = σ[Gn ], (168) 2 2 and hence S ∗ [Gn ] ≥ σ[Gn ],  n

(169)

where S ∗ [Gn ] = S[Gn ]/ 2 denotes the normalized Gibbs entropy. Therefore, the key ingredient is to find a matching upper bound. For this we partition the range An = (−∞, Rn ] of our probability measure µn (x) into intervals and approximate W (x, y) by its average over the box in which x and y lie.

32

To be more precise, let mn be any increasing sequence of positive integers, {ρt }0≤t≤mn be such that (170) ρ0 = −∞, and ρmn = Rn , and consider the partition of (−∞, Rn ] given by It = (ρt−1 , ρt ]

for 1 ≤ t ≤ mn .

(171)

Now define Jn (x) = t ⇐⇒ x ∈ It , and let Mi be the random variable Jn (Xi ), where Xi has density function µn for any vertex i. The value of Mi equal to t indicates that vertex i happens to lie within interval It . Denoting the vector of these random variable by Mn = M1 , . . . , Mn , and their entropy by S[Mn ], we have that S[Gn ] ≤ S[Mn ] + S[Gn |Mn ]   Z Z Rn n f (x, y)) dµn (y) dµn (x), ≤ nS[M ] + H(W 2 −∞

(172)

f (x, y) is the average of W over the square IJ (x) × IJ (y) . That is, where W n n Z Z fn (x, y) = 1 W W (u, v) dµn (v) dµn (u), µx,y IJn (x) IJn (y) with

Z

(173)

Z

µx,y =

dµn (v) dµn (u), IJn (x)

(174)

IJn (y)

the measure of the box to which (x, y) belongs. fn approximates W . More specifically, The first step in our proof is to investigate how well W fn , µn ]| scales, depending on the we want to understand how the difference |σ[W, µn ] − σ[W specific partition. Note that for any partition ρt of An into mn intervals we have that S[M ] = −

mn X

P (M = t) log (P (M = t)) ≤ log(mn ),

(175)

t=1

where the upper bound is achieved on the partition which is uniform according to measure µn . Since 2S[M ] n nS[M ]  = , (176) n log(n) 2 log(n)(1 − 1/n) it is enough to find a partition ρt , with log(mn ) = o(log(n)), such that n fn , µn ]| = 0, lim |σ[W, µn ] − σ[W n→∞ log(n)

(177)

This then proves Theorem 3.6, since n n lim |S ∗ [Gn ] − σ[Gn ]| = lim (S ∗ [Gn ] − σ[Gn ]) n→∞ log(n) n→∞ log(n) n nS[M ] n fn , µn ]|  + |σ[W, µn ] − σ[W n log(n) 2 log(n)

≤ lim

n→∞

 ≤ lim

n→∞

!

2 log(mn ) n fn , µn ]| + |σ[W, µn ] − σ[W log(n)(1 − 1/n) log(n)

 = 0. (178)

33

4.4.2

Constructing the partition

We will take I1 = (−∞, −Rn ] and partition the remaining interval [−Rn , Rn ] into log(n)2 equal parts. To this end, let n be sufficiently large so that n ≥ νβ 2 , take mn = dlog(n)2 e + 1, and define the partition ρt by ρ0 = −∞,

ρ1 = −Rn ,

and ρt = ρt−1 +

2Rn mn − 1

for all t = 2, . . . , mn .

(179)

Note that log(mn ) = O(log log n) = o(log(n) so that all that is left, is to prove (177). In addition, Z −Rn Z Rn  n f (x, y)) dµn (y) dµn (x) = O log(n)−1 n1−α , H(W log(n) −∞ −∞ f with W . Hence it follows that in order to establish (177) and the same holds if we replace W we only need to consider the integral over the square [−Rn , Rn ] × [−Rn , Rn ]. That is, we need to show Z Z Rn    n f H (W (x, y)) − H W (x, y) dµn (y) dµn (x) = 0, (180) lim n→∞ log(n) −Rn

f , µn ], based on the mean value theorem which states For this we compare σ[W, µn ] and σ[W that for any a ≤ b, ∃ c ∈ [a, b], such that H(b) − H(a) = H 0 (c) (a − b). (181) Here |H 0 (c)| = | log(c) − log(1 − c)|, and due to the symmetry of H we get, for any 0 < a ≤ c ≤ b < 1, min{H(a), H(b)} |H 0 (c)| ≤ . (182) min{a, b, 1 − a, 1 − b} Note that 0