"Statistical Physics of Artificial Neural Networks" by Prof. Lenka Zdeborova - 2021
Details
Title : "Statistical Physics of Artificial Neural Networks" by Prof. Lenka Zdeborova Author(s): Department of Theoretical Physics, TIFR Link(s) : https://www.youtube.com/watch?v=2P_iB0ldSS8
Rough Notes
Most of modern deep learning derives its successes from large amounts of labeled data, GPUs and (variants of) Stochastic Gradient Descent (SGD).
We know that Neural Networks (NNs) are universal learners (, ) and that training them is NP-complete (, a). However, not everyone agrees on (i.e. there are conflicting stories on) why (overparametrized) NNs do not overfit, what the effective number of parameters are and why we do not get bad local minima.
Principal learning theory tells us that being bad at fitting random labels implies good generalization (via Rademacher complexity), however state of the art NNs are able to fit random labels (, a, cite:, a)
In data science/statistics, models are expected to faithfully capture all properties of the observed data, meanwhile in physics this would be called an ansatz, and a model is expected to capture the essence of the problem, acting as a tool for understanding - e.g. the 2D Ising model.
Model situation considered in the statistical physics view is the teacher-student setting. There is a teacher network, which gets \(n\) random iid samples \(X\), the weights are chosen iid at random as well, and this produces the labels \(y\). The student network observes \(X,y\), and the architecture of teacher network, but not the value of the weights. The student network aims to learn the weights such that the function it represents is the same as the teacher network. In (, a) the authors consider taking random iid Gaussian \(X_{\mu i}\) and random iid \(w_i^*\) from \(P_w\), and creating labels \(y_\mu=\text{sign}(\sum_{i=1}^d X_{\mu i}w_i^*)\), and look at the high-dimensional regime where \(n,d \to \infty\) (or the thermodynamic limit in physics terms) where \(\alpha=n/d=\Theta(1)\). The following correspondence is then noted:
Physics of disordered systems | Machine Learning |
---|---|
Position of 1 molecule | One weight in the NN |
Interaction between molecules | Input data and labels |
Inverse temperature | # Samples per input dimension |
Physical dynamics | Algorithm |
Physical property of interest | Test error |
Applying techniques from the physics of glasses/spin-glasses e.g. the replica method to neural networks, we can solve the problem of minimizing the generalization error (, a). The state of the art (, a) extends this to to any activatation function, any prior on the weights alongside a rigorous statements about the replica method for the correctness for the teacher-student model.
When moving in the direction towards a theory of deep learning, there is a trio of concepts that are related to each other: structure in the data, architecture of the NNs, and the algorithm used to train it.
SGD requires more samples than Approximate Message Passing (AMP).