# Batch Normalization

An attempt to normalize pre-activations from hidden layers, initially motivated by the fact that normalizing inputs (i.e. moving them towards having mean 0 and variance 1) speeds up training*.

During training, the mean and standard deviation is computed for each mini-batch, with backpropagation taking into account this normalization. This forces the pre-activations to have mean 0 and variance 1, meaning they will most likely be in the non-saturating regime which helps mitigate the vanishing gradient problem. During evaluation/test phase, the global mean and standard deviation are used.