Thursday, December 13, 2018

Ising Perceptron under Gaussian Disorder, and k-NAE-SAT

Blog Post By: Patrick Guo, Vinh-Kha Le, Shyam Narayanan, and David Stoner

Methods in statistical physics are known to be extremely useful for understanding certain problems in theoretical computer science. Physical observations can motivate the underlying theoretical models, which in turn explain some of the physical phenomena.

This post is based on Professor Nike Sun’s guest lecture on the Ising Perceptron model and regular NAE-SAT for CS 229R: Physics and Computation. These are both examples of random constraint satisfaction problems, where we have n variables x_1, ..., x_n \in \{-1, 1\} and certain relations, or constraints, between the x_i, and we wish to approximate the number of solutions or visualize the geometry of the solutions. For both problems, the problem instance can be random: for example, the linear constraints in the Ising perceptron model are random, and the clauses in the NAE-SAT instance are chosen at random. As in the previous blog posts, to understand the geometry of solutions, statistical physicists think of sampling random solutions from \{-1, 1\}^n, which introduces a second type of randomness.

This post is meant to be expository. For interested readers, we point to useful references at the end for more rigorous treatments of these topics.

1. Perceptron Model

The Ising Perceptron under Gaussian disorder asks how many points in \{\pm 1\}^n survive M half-plane bisections (i.e. are satisfying assignments for M constraints), where the planes’ coefficients are drawn from standard Gaussians. Like many problems in statistical physics, there is likely a critical capacity of constraints where having more constraints yields no survivors with high probability, and having fewer constraints yields survivors with high probability. This lecture gives an overview of a proof for one side of this physics prediction, i.e. the existence of a lower bound critical capacity, where having fewer constraints yields survivors with positive probability. Briefly, we use the Cavity method to approximate the distribution of the number of satisfying assignments, then attempt to use the second moment method on that distribution to get our lower bound. A direct application fails, however, due to the variance of the Gaussian constraints. The solution is to carefully choose exactly what to condition on without destroying the model. Specifically, we iteratively compute values related to the Gaussian disorder, after which we are able remove enough variance for the second moment method to work and thus establish the lower bound for the Ising Perceptron’s critical capacity. The result holds subject to an analytical condition which is detailed in the paper (Condition 1.2 in [6]) and which remains to be rigorously verified.

1.1. Problem

We pick a random direction in \mathbb{R}^n, and delete all vertices in the hypercube \{\pm 1\}^n which are in the half-space negatively correlated with that direction. We repeat this process of picking a random half space and deleting points M times, and see if any points in the hypercube survive. Formally, we define the perceptron model as follows:

Definition 1: Let G = (g_{\mu{}i})_{\mu\ge 1, i\ge 1} array of i.i.d. standard Gaussians. Let M be the largest integer such that:

\left\{J\in \{\pm 1\}^N:\sum_{i=1}^N\frac{g_{\mu{}i}J_i}{\sqrt{N}}\ge 0 \hspace{0.1cm} \forall \mu\le M\right\}\neq \emptyset.

More compactly, M is the largest integer such that

\left\{J\in \{\pm 1\}^N:\frac{GJ}{\sqrt{N}}\ge 0\right\}

is nonempty, where the inequality is taken pointwise, G \in \mathbb{R}^{M \times N} is an array of i.i.d. std. Gaussians, and J is treated as a vector in \{\pm 1\}^N \subset \mathbb{R}^N.

CS229 Pic 1

In the late 80’s, physicists conjectured that there is a critical capacity \alpha_{*} such that \frac{M}{N} \overset{P}{\to} \alpha_{*} where M is a function of N. The predicted critical capacity has been studied, for example in [7,9]. Our goal is to establish a lower bound on \alpha_* for the Ising Perceptron under Gaussian disorder. To do this, let Z(G) denote the random variable which measures how many choices of n variables survive M Gaussian constraints in G (this is the partition function). We want to show Z > 0 with high probability when there are M < \alpha_{*}N constraints. To this end, the second moment method seems promising:

Second Moment Method: If Z is a nonnegative random variable with finite variance, then

P(Z > 0) \ge \frac{\left(\mathbb{E}[Z]\right)^2}{\mathbb{E}[Z^2]}

However, this method actually fails due to various sources of variance in the perceptron model. We will briefly sketch the fix as given in [6].

Before we can start, however, what does the distribution of Z even look like? This is actually quite computationally intensive; we will use the cavity equations, a technique developed in [12,13], to approximate Z‘s distribution.

1.2. Cavity Method

The goal of the cavity method is to see how the solution space changes as we remove a row or a column of the matrix G with i.i.d. standard Gaussian entries, assuming that G is fixed. Since our matrix G is big and difficult to deal with, we try to see how the solution space changes as we add one row or one column at a time. Why is this valuable? We can think of the system of variables and constraints as an interaction between the rows (constraints) and columns (variables), so the number of solutions should behave proportionally to the product of the solutions attributed to each variable and each constraint. With this as motivation, define G_{-\mu} as the matrix obtained by removing row \mu and G^{-i} as the matrix obtained by removing column i. We can approximate

Z(G) \approx \prod\limits_{\mu = 1}^{M} \frac{Z(G)}{Z(G_{-\mu})} \cdot \prod\limits_{i = 1}^{N} \frac{Z(G)}{Z(G^{-i})}

since we can think of the partition function as receiving a multiplicative factor from each addition of a row and each addition of a column. Thus, the cavity method seeks to compute Z(G)/Z(G_{-\mu}) and Z(G)/Z(G^{-i}).

1.2.1. Removing a constraint

Our goal in computing Z(G)/Z(G_{-\mu}) is to understand how the solution space SOL(G_{-\mu}) changes when we add in the constraint \mu. Recalling that J \in \{\pm 1\}^n is our vector that is potentially in the solution space, if we define

\Delta_\mu := \sum_{i = 1}^{N}\frac{g_{\mu{}i}J_i}{\sqrt{N}},

then adding the constraint \mu is equivalent to forcing \Delta_\mu \ge 0. We will try to understand the distribution of \Delta_\mu under the Gibbs measure \nu_{-\mu}, which is the uniform measure on the solution space without the constraint \mu. This way, we can determine the probability of \Delta_\mu \ge 0 under the Gibbs measure, which will equal Z(G)/Z(G_{-\mu}). To do so, we will use the Replica Symmetric Cavity Assumption (which we will abbreviate as RS cavity assumption). The RS cavity assumption lets us assume that the J_i‘s for 1 \le i \le N are independent under \nu_{-\mu}. As the J_i‘s are some discrete sample space that depend on our constraints, we do not actually have full independence, but the RS cavity assumption tells us there is very little dependence between the J_i‘s, so we pretend they are independent.

Note that \Delta_\mu should be approximately normally distributed, since it is the sum of many “independent” terms g_{\mu i} J_i by the RS Cavity assumption. Now, define h_\mu as the expectation of \Delta_\mu under the Gibbs measure without the constraint \mu, and v_\mu as the variance of \Delta_\mu under the Gibbs measure without the constraint \mu. It will also turn out that the variance v_\mu will concentrate around a constant \sigma^2. Thus,

\frac{Z(G)}{Z(G_{-\mu})} \approx \nu_{-\mu}(\Delta_{\mu} \ge 0) = \overline{\Phi} (\frac{-h_\mu}{\sigma}).

Here, \overline{\Phi}(\frac{-h_\mu}{\sigma}) equals the probability that a random \mathcal{N}(h_\mu, \sigma^2) Gaussian distribution is positive, or equivalently, the probability that a standard Gaussian \mathcal{N}(0, 1) distribution is greater than \frac{-h_\mu}{\sigma}.

1.2.2. Removing a spin

To calculate the cavity equation for removing one column, we think of removing a column as deleting one spin from J. Define J^{-i} as the vector resulting from removing spin i from J. We want to calculate Z(G)/Z(G^{-i}). Note that it is possible for J^{-i} \not\in SOL(G^{-i}) but J \in SOL(G) now, which complicates this calculation. It is possible to overcome this difficulty by passing to positive temperature, though this makes the calculations incredibly difficult. We do not worry about these issues here, and just briefly sketch how Z(G)/Z(G^{-i}) is computed.

To compute Z(G)/Z(G^{-i}), we will split the numerator into a sum of two terms based on the sign of J_i, which will allow us to compute not only Z(G) but also \langle J_i \rangle, which represents the magnetization at spin i. A series of complicated calculations (see the lecture notes for the details) will give us

\frac{Z(G)}{Z(G^{-i})} = \frac{\exp(H_i)+\exp(-H_i)}{exp(c)}

for some constant c, where H_i is a quantity that compares how much more correlated spin i is to the constraints than all other spins. The \exp(+H_i) will come from the solutions for G with J_i = 1 and the \exp(-H_i) will come from the solutions with J_i = -1.

The above equations allow us to deduce that

Z(G) \approx \prod\limits_{\mu = 1}^{M} \overline{\Phi}\left(-\frac{h_\mu}{\sigma}\right) \cdot \prod\limits_{i = 1}^{N} \frac{\exp(H_i) + \exp(-H_i)}{\exp(c)}.

1.3. The Randomness of G

In the previous section, we regarded G as fixed. We now use the these results but allow for G to be random again. Recall G‘s entries were i.i.d. Gaussians. We thus get that h_\mu, as a linear combination of the g_{\mu{}i}‘s, is a Gaussian. We thus can write h_\mu \sim \mathcal{N}(0, q). Similarly, H_i is a linear combination of the g_{\mu{}i}‘s and must be Gaussian. We thus write H_i \sim \mathcal{N}(0, \psi).

With some calculations detailed in the lecture notes and [12], we get Gardner’s Formula:

\frac{\ln Z}{N} \rightarrow \alpha \int \ln \overline{\Phi}\left(-\frac{\sqrt{q} z}{\sigma}\right) \varphi(z) dz + \int \ln \left(2 \cosh(\sqrt{\psi z})\right)\varphi(z) dz

where \alpha is our capacity, \frac{M}{N}. It turns out that \sigma^2 = 1-q and c = \psi(1-q)/2, so the above equation only depends on two parameters, q and \psi. It will turn out that q, \psi have relations dependent on each other based on our definitions of h_\mu and H_i, and these will give us a fixed point equation for (q, \psi), which has two solutions: one near \alpha \approx 1 and one near \alpha \approx 0.83. It is believed that the second point is correct, meaning that the critical capacity should equal \alpha_* = 0.83.

1.4. Second Moment Method

Now that we have Gardner’s formula, solving for Z gives us an approximation for its distribution as

Z \sim \exp\{N\mathcal{G}(\alpha) + \mathcal{N}(0, N\varepsilon^2)\}

where \mathcal{G}(\alpha) is Gardner’s formula, and the Gaussian noise \mathcal{N}(0,N\varepsilon^2) arises from the Gaussian distribution of the h_\mu‘s and H_i‘s. Again, we are interested in the probability that Z > 0, but due to the approximate nature of the above equation, we cannot work with this distribution directly, and instead can try the second moment method. However, the exponentiated Gaussian makes \mathbb{E}[Z^2] \gg \mathbb{E}[Z]^2 and the moment method just gives P(Z>0) \ge 0, whereas we need P(Z>0) with high probability.

The fault here lies with Gaussian noise due to the h_\mu‘s and H_i‘s, so it is natural to consider what happens when we condition on them, getting rid of the noise. We denote

\underline{h} = (h_\mu)_{\mu = 1}^M \text{ and } \underline{H} = (H_i)_{i=1}^N.

Then, we can compute that

\frac{1}{N}\log \mathbb{E}[Z | \underline{H},\underline{h}] \to \mathcal{G}(\alpha)

in probability as N\to \infty. (This is not an exact equality becacuse of the approximations made in the derivation of Gardner’s formula.) Moreover, we will have \mathbb{E}[Z^2|\underline{H},\underline{h}] \approx \mathbb{E}[Z|\underline{H},\underline{h}]^2. The second moment method gives us the desired lower bound on P(Z>0). However, note that these \underline{H},\underline{h} must satisfy the equations that defined them. In vector form, we have

\frac{1}{\sqrt{N}}G\tanh \underline{H} = \underline{h} + b_*F(\underline{h})
\frac{1}{\sqrt{N}}G^T\tanh F(\underline{h}) = \underline{H} + d_*\tanh\underline{H}

where b_*, F, and d_* are constants and functions that appear in the rigorous definitions of h_{\mu} and H_i.
Here arises a problem, however. We cannot solve these equations easily, and furthermore when we condition on \underline{H} and \underline{h}, the G‘s that satisfy the above are not necessarily representative of matrices of standard Gaussians. Hence, simply conditioning on knowing the values of \underline{h},\underline{H} destroys the model and will not prove the lower bound on P(Z>0) for Gaussian disorder.

To solve this problem, we introduce iteration: first, initialize \underline{h}^{(0)} and \underline{H}^{(1)} (initialized to specific values detailed in [6]. Then, a simplified version of each time step’s update looks like

\underline{h}^{(t)} = \frac{G\tanh'{\underline{H}^{(t)}}}{\sqrt{N}} - b_*F(\underline{h}^{(t - 1)}),
\underline{H}^{(t + 1)} = \frac{G^TF(\underline{h}^t)}{\sqrt{N}} - d_*\tanh{\underline{H}^{(t + 1)}}.

It has been proven (see [2,4]) that \underline{h}^{(t)} and \underline{H}^{(t)} converge as t\to\infty, and moreover the convergent values are distributed by what looks like a Gaussian at each time step. Since this sequence of \underline{h},\underline{H} converge (at a rate that is independent of n) and are representative of Gaussian G, for a special kind of truncated partition function \tilde Z, conditioning on these iteration values allows the second moment method to work and gives

\mathbb{E}[\tilde Z | \underline{h}^{(0)},\underline{H}^{(0)},\dots,\underline{h}^{(t)},\underline{H}^{(t)}]^2 \approx \mathbb{E}[\tilde Z^2 | \underline{h}^{(0)},\underline{H}^{(0)},\dots,\underline{h}^{(t)},\underline{H}^{(t)}]

which is then enough to establish the lower bound P(Z>0) for the non-truncated partition function Z (see [6] for details, see also [3] for closely related computations).

2. k-NAE-SAT

This lecture also gave a brief overview of results in k-NAESAT. In the k-SAT problem, we ask whether a boolean formula in CNF form, with each clause containing exactly k literals, has a satisfying solution. For k-NAESAT, we instead require that the satisfying solution is not uniform on any clause; equivalently, each clause must contain at least one true and at least one false value. Finally, we will restrict our set of possible boolean formulae to those for which every variable is contained in exactly d clauses; we call this model d-regular k-NAESAT. We briefly outline the critical capacities of clauses where the solution space changes. For more background on the k-NAESAT probem and its variants, see [10].

As with the perceptron model, we are concerned with the expected number of solutions Z of this system where the d-regular formula is chosen uniformly at random. It turns out that for any fixed \alpha=\frac{d}{k}, the expected proportion of satisfiable solutions as n\to\infty converges to some f(\alpha). More on this, the model in general and the critical \alpha constants mentioned below can be found in [15]. Via an application of Markov’s inequality and Jensen’s Inequality for the convex function x^n, f(\alpha) may be bounded above by f^{RS}(\alpha)=(\mathbb{E} Z)^{\frac{1}{n}}, shown graphically below.

CS 229 Pic 2
The diagram also shows the nature of the expected assortment of the solutions of a random d-regular k-NAESAT for ranges of values of \alpha. Namely, physics analysis suggests the following:

  • For 0<\alpha<\alpha_d, w.h.p. the solutions are concentrated in a single large cluster.
  • For \alpha_d<\alpha<\alpha_{\text{cond}}, w.h.p. the solutions are distributed among a large number of clusters.
  • For \alpha_{\text{cond}}<\alpha_{\text{sat}}<0, the function f(\alpha) breaks away from f^{RS}(\alpha), and w.h.p. the solutions are concentrated in a small number of clusters.
  • For \alpha>\alpha_{\text{sat}}, w.h.p. there are no satisfiable solutions.

One final note for this model: the solutions to satisfiability problems tend to be more clumpy than in the perceptron model, so correlation decay methods won’t immediately work. See [15] for how this is handled.

References

  1. N. Bansal. Constructive algorithms for discrepancy minimization. In Proc. FOCS 2010, pages 3–10.

  2. M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764-785, 2011.

  3. Bolthausen, E., 2018. A Morita type proof of the replica-symmetric formula for SK. arXiv preprint arXiv:1809.07972.

  4. E. Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington-Kirkpatrick model. Commun. Math. Phys., 325(1):333-366, 2014.

  5. T. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14(3):326-334.

  6. J. Ding and N. Sun. Capacity lower bound for the Ising perceptron. https://arxiv.org/pdf/1809.07742.pdf

  7. E. Gardner and B. Derrida. Optimal storage properties of neural network models. J. Phys. A., 21(1): 271–284, 1988.

  8. J. H. Kim and J. R. Roche. Covering cubes by random half cubes, with applications to binary neural networks. J. Comput. Syst. Sci., 56(2):223–252, 1998

  9. W. Krauth and M. Mézard. Storage capacity of memory networks with binary couplings. J. Phy. 50(20): 3057-3066, 1989.

  10. F. Krzakala et al. “Gibbs states and the set of solutions of random constraint satisfaction problems.” Proc. Natl. Acad. Sci. 104.25 (2007): 10318-10323.

  11. S. Lovett and R. Meka. Constructive discrepancy minimization by walking on the edges. SIAM J. Comput., 44(5):1573-1582.

  12. M. Mézard. The space of interactions in neural networks: Gardner’s computation with the cavity method. J. Phys. A., 22(12):2181, 1989

  13. M. Mézard and G. Parisi and M. Virasoro. SK Model: The Replica Solution without Replicas. Europhys. Lett., 1(2): 77-82, 1986.

  14. M. Shcherbina and B. Tirozzi. Rigorous solution of the Gardner problem. Commun. Math. Phys., 234(3):383-422, 2003.

  15. A. Sly and N. Sun and Y. Zhang. The Number of Solutions for Random Regular NAE-SAT. In Proc. FOCS 2016, pages 724-731.

  16. J. Spencer. Six standard deviations suffice. Trans. Amer. Math. Soc. 289 (1985), 679-706

  17. M. Talagrand. Intersecting random half cubes. Random Struct. Algor., 15(3-4):436–449, 1999.


Windows On Theory published first on Windows On Theory

No comments:

Post a Comment