Breno

The initialization and normalization concepts in deep learning

Weight initialization and normalization proved to be crucial for model training. Over time, deep learning research deepened our understanding of how to properly initialize and normalize weights for a better learning. The first question that comes to my mind is about how they are distinct from one another.

What is initialization? What is normalization?

Interesting that two important papers were published in 2015 discussing initialization and normalization. So, let's discuss and learn about these two concepts with code.

The distinction.

Initialization and normalization address the same underlying problem — keeping activations and gradients in a healthy range throughout the network — but they act at different moments and in different ways.

Initialization is a one-time decision. It sets the starting point of the weight matrices before training begins. The goal is to ensure that, at step zero, the forward pass produces activations with reasonable variance and the backward pass produces gradients that neither vanish nor explode. Once training starts, the initialization has done its job. It cannot correct anything that goes wrong later.

Normalization is an ongoing operation. It runs at every forward pass, during every training step, actively forcing the activations back into a well-behaved distribution. It is a layer in the network, not a pre-training procedure.

In short: initialization sets a good starting condition; normalization maintains good conditions throughout.

The two 2015 papers.

The coincidence is remarkable. Within five days of each other in February 2015, two papers appeared on arXiv that would become foundational:

February 6, 2015 — He et al., Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (ICCV 2015). This paper derived the initialization scheme now known as Kaiming initialization. The core idea: if the activation function is a rectifier (ReLU or its variants), the weight variance should be 2/nin to preserve signal magnitude across layers.

February 11, 2015 — Ioffe & Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (ICML 2015). This paper introduced batch normalization, which standardizes the pre-activations at each layer using the batch mean and variance. The result: higher learning rates, less sensitivity to initialization, and implicit regularization.

Both papers were responding to the same difficulty: deep networks were hard to train. He et al. attacked the problem at the start (initialization), while Ioffe & Szegedy attacked it continuously (normalization). Together, they made deep networks practical.

Initialization in code.

The Kaiming derivation starts from a simple variance analysis. For a layer y=Wx where entries of x have variance σx2 and entries of W have variance σw2:

Var(y)=nin·σw2·σx2.

To keep Var(y)=Var(x), we need σw2=1/nin. For ReLU, which zeros out half the distribution, we need σw2=2/nin. For tanh, which is contractive, PyTorch uses a gain of 5/3, giving σw2=(5/3)2/nin.

import torch

def kaiming_init(fan_in, fan_out, activation='relu'):
    if activation == 'relu':
        gain = (2.0) ** 0.5
    elif activation == 'tanh':
        gain = 5.0 / 3.0
    else:  # linear
        gain = 1.0
    std = gain / (fan_in ** 0.5)
    return torch.randn(fan_in, fan_out) * std

This is equivalent to torch.nn.init.kaiming_normal_. The key insight is that the gain factor depends on the activation function — it compensates for how much the non-linearity contracts or expands the signal.

Normalization in code.

Batch normalization standardizes the pre-activations using the current mini-batch:

x^=xμBσB2+ϵ.

Then it applies a learnable scale (γ) and shift (β).

y=γx^+β.

class BatchNorm1d:

    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.momentum = momentum
        self.training = True
        self.gamma = torch.ones(dim)   # learnable scale
        self.beta = torch.zeros(dim)   # learnable shift
        self.running_mean = torch.zeros(dim)
        self.running_var = torch.ones(dim)

    def __call__(self, x):
        if self.training:
            mean = x.mean(0, keepdim=True)
            var = x.var(0, keepdim=True)
            # Update running statistics for inference
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var
        xhat = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * xhat + self.beta

    def parameters(self):
        return [self.gamma, self.beta]

The γ and β parameters are essential. Without them, the normalization would be a hard constraint — the network could never learn that a particular layer should have non-zero mean or non-unit variance. With them, the network can recover any affine transformation of the normalized output. At initialization, γ=1 and β=0, so the normalization is the identity-plus-standardization. During training, the network learns what distribution each layer actually needs.

How they interact.

With good initialization alone, the network starts in a healthy state but can drift during training. Activations may saturate, gradients may vanish, and the network has no mechanism to self-correct.

With batch normalization alone, the normalization actively prevents drift, which means the network is less sensitive to how the weights were initialized. This is exactly what Ioffe & Szegedy reported: batch normalization allows higher learning rates and makes initialization less critical.

In practice, modern networks use both. Initialization ensures a clean start (no hockey-stick loss curve), and normalization maintains stability throughout training. They are complementary, not redundant.

The broader picture.

It is worth noting that batch normalization is not the only normalization technique. Layer normalization (Ba et al., 2016) normalizes across features instead of across the batch, avoiding the batch-size dependency that makes batch normalization awkward during inference and in sequence models. Group normalization (Wu & He, 2018) splits channels into groups and normalizes within each group. The Transformer architecture uses layer normalization, not batch normalization.

Similarly, Kaiming initialization is not the only initialization scheme. Xavier/Glorot initialization (2010) preceded it and was designed for linear or sigmoid activations. The key principle is the same: match the weight variance to the activation function so that signal magnitude is preserved across layers.

What both papers from 2015 established is a way of thinking: the internal statistics of a network are observable quantities that can and should be controlled, either by design (initialization) or by mechanism (normalization). This diagnostic mindset — measure what is happening inside the network, and intervene when the statistics go wrong — is the foundation of modern deep learning practice.