DL1 Deep Learning (University of Amsterdam)

Introduction

Content

History

McCulloch-Pitts Neuron, Perceptron, Adaline are some early works.

Rosenbaltt's perceptron were proposed for binary classification, where the classification outcome for sample \(i\) is \(y_i = \mathbb{I}[\sum_{k=1}^p w_kx_{ik} + w_0]\) where \(x_i\) is the input vector of size \(p\) and \(y_i\in\{-1,1\}\) is its corresponding label. The main innovation is an algorithm for training perceptrons i.e. updating the weights. The algorithm is:

1. Set random weights.
2. Sample new training image and label $(x,l)$.
3. Compute prediction y
4. If y<0, l>0, w=w+eps*x  //score too low, increase weights
5. If y>0, l<0, w=w-eps*x  //score too high, decrease weights
6. Go to 2

1 layer perceptrons however cannot learn the XOR function, one can show no weights exist for the XOR function. Multilayer Perceptrons (MLPs) however can solve this problem - however we cannot use Rosenbaltt's learning algorithm for MLPs as they required ground truth labels which we do not have for intermediate layers in MLPs.

The XOR issue led to the first AI winter, where despite lack of funding in general, significant discoveries were made e.g. the (re)invention of the backpropagation algorithm for MLPs and Recurrent Neural Networks (RNNs) which can take sequences of unbounded length.

The second AI winter were predominated by kernel machines and graphical models, which had similar accuracies with more formalism and fewer heuristics.

The AI winter started thawing around 2006, where the authors proposed to train "deep"neural networks with multiple layers one layer at a time.

Nowadays, we can train deep neural networks with multiple layers quite easily, this have to lead to systems that can classify images and play video games better than humans.

Deep learning however is big data hungry. ImageNet is the most famous dataset, leading to the Imagenet Large Scale Visual Recognition Challenge (ILSVRC) consisting of 1M images, 1000 classes and top-5, top-1 errors being the metric of success. AlexNet gave a huge improvement in this challenge, and is the most well known catalyst for the deep learning revolution.

Better hardware and more data allowed for such a breakthrough, that being said the improvements brought by deep learning to fields ranging from computer vision, robotics, games like Go, natural language processing, multiple supervised learning tasks etc. are very impressive.

In short, Deep Neural Networks (DNNs) are a family of parametric, nonlinear, hierarchical representation learning functions which are massively optimized with (variants of) Stochastic Gradient Descent (SGD) to encode domain knowledge (e.g. domain invariances, stationarity).

Linear machines are often non-separable, for e.g. with \(n\) samples in a binary classification task, there are \(2^n\) possible datasets, and only about \(d\) of the \(M\) are linearly separable where \(d\) is the dimension of the input. As \(n>d\) the chances that the data is linearly separable drop very fast. One idea is to use non-linear features followed by linear machines, which is what the kernel trick does.

What is a good feature?

  • Invariant but not too invariant.
  • Repeatable but not bursty.
  • Discriminative but not too class-specific.
  • Robust but sensitive enough.

Traditionally, features were designed manually for e.g. by setting explicit kernels, or explicitly like SIFT, HOG.

A (possibly) better idea is to learn nonlinear features, linear classifiers. This can be thought of as transforming features until they are linear enough for classifiers. This has some benefits as it is expensive to research, validate. Nowadays, time spent on designing features is spent on designing architectures.

The manifold hypothesis states that data lies on lower dimensional manifolds which are most probably highly nonlinear, in a way that semantically similar things lie closer together than semantically dissimilar things.

DNNs are universal approximators, meaning with just 1 hidden layer they can in principle approximate any function - the theorem however does say the precise architecture and how to train a DNN to achieve this.

Tutorials

PyTorch

Activation functions

(#DOUBT Relation between dead neurons from ReLU and residual connections)

Optimization and Initialization

Inception, ResNet and DenseNet

(#DOUBT Why do most computer vision networks perform layer wise normalization instead of pixel wise normalization)

Emacs 29.4 (Org mode 9.6.15)