Breno

The batch size in training

We have a MLP. The training code is structured. The dataset is the one with names. Now, it is important to observe how the training proceeds. We want the loss to be reduced. Karpathy in the lecture related to this MLP poses a challenge. To beat the loss that he found in the lecture. For that, we can start to try with the batch size. We have 30,000 epochs. The batch size set as 32. Let's understand the batch size role on the training.

import torch

So, the imports.

After that, the code for setting the data.

import torch.nn.functional as F
import matplotlib.pyplot as plt

words = open('names.txt', 'r').read().splitlines()

chars = sorted(list(set(''.join(words))))
s_to_i = {s:i+1 for i, s in enumerate(chars)}
s_to_i['.'] = 0
i_to_s = {i:s for s, i in s_to_i.items()}
vocab_size = len(i_to_s)
block_size = 3

def build_dataset(words):
    X, Y = [], []
    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = s_to_i[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xt, Yt = build_dataset(words[n2:])

This is the same dataset code from the previous post. The training set has around 182k examples.

Now, the model definition. Also the same.

n_embd = 10
n_hidden = 200

g = torch.Generator().manual_seed(2147483647)
C = torch.randn((vocab_size, n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3) / (n_embd * block_size)**0.5
b1 = torch.randn(n_hidden, generator=g) * 0.01
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
b2 = torch.zeros(vocab_size)

parameters = [C, W1, b1, W2, b2]
for p in parameters:
    p.requires_grad = True

Notice the initialization is already the improved version with Kaiming scaling for W1 and small values for W2. We are past the saturation problem. Now the question is: how does the batch size affect training?

What is the batch size?

The batch size is the number of examples the model sees before computing one gradient update. In our training loop:

ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix]

These two lines randomly select batch_size examples from the training set. The loss is computed on this subset, and the gradients are computed from this loss. The key insight is: the gradient computed from a mini-batch is an approximation of the true gradient over the entire dataset.

With batch_size = 32, we are estimating the gradient from 32 examples out of 182,000. That estimate is noisy. Sometimes the mini-batch is representative of the full data; sometimes it is not. This noise has consequences — both good and bad.

The experiment.

Let's train the same model with three different batch sizes and compare.

def train(batch_size, max_steps=30000, seed=2147483647):
    g = torch.Generator().manual_seed(seed)
    C = torch.randn((vocab_size, n_embd), generator=g)
    W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3) / (n_embd * block_size)**0.5
    b1 = torch.randn(n_hidden, generator=g) * 0.01
    W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
    b2 = torch.zeros(vocab_size)

    parameters = [C, W1, b1, W2, b2]
    for p in parameters:
        p.requires_grad = True

    lossi = []

    for i in range(max_steps):
        ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
        Xb, Yb = Xtr[ix], Ytr[ix]

        emb = C[Xb]
        embcat = emb.view(emb.shape[0], -1)
        hpreact = embcat @ W1 + b1
        h = torch.tanh(hpreact)
        logits = h @ W2 + b2
        loss = F.cross_entropy(logits, Yb)

        for p in parameters:
            p.grad = None
        loss.backward()

        lr = 0.1 if i < 20000 else 0.01
        for p in parameters:
            p.data += -lr * p.grad

        lossi.append(loss.log10().item())

    return lossi, parameters

This function encapsulates the entire training. It receives the batch_size as a parameter and returns the loss history and the final parameters. Everything else is identical. This way, the only variable is the batch size.

lossi_32, params_32 = train(batch_size=32)
lossi_64, params_64 = train(batch_size=64)
lossi_1, params_1 = train(batch_size=1)

Three runs. Batch size 1 (stochastic gradient descent in its purest form), batch size 32 (what Karpathy uses), and batch size 64.

What we observe.

The loss curves reveal the tradeoff.

Batch size 1: the curve is extremely noisy. Each gradient update is computed from a single example, so the direction of the update is highly variable. The loss jumps around wildly from step to step. But the model does make progress — the trend is downward. This is stochastic gradient descent: every step is a rough guess, but on average the guesses point in the right direction.

Batch size 32: the curve is smoother. The gradient is averaged over 32 examples, which reduces the variance of the estimate. The model converges more steadily. This is the standard choice in practice — noisy enough to escape local minima, smooth enough to make consistent progress.

Batch size 64: the curve is the smoothest. Less noise in the gradient means each step is more reliable. But there is a cost: with 30,000 steps, the model has seen 30,000×64=1.92M examples. With batch size 32, it has seen 30,000×32=960K examples. More examples per step means fewer steps for the same computational budget.

The question is not which batch size gives the lowest loss per step, but which gives the lowest loss per unit of computation.

The evaluation.

To compare fairly, we evaluate on the validation set:

def eval_loss(parameters):
    C, W1, b1, W2, b2 = parameters
    emb = C[Xdev]
    embcat = emb.view(emb.shape[0], -1)
    hpreact = embcat @ W1 + b1
    h = torch.tanh(hpreact)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Ydev)
    return loss.item()

print(f"Batch size  1: val loss = {eval_loss(params_1):.4f}")
print(f"Batch size 32: val loss = {eval_loss(params_32):.4f}")
print(f"Batch size 64: val loss = {eval_loss(params_64):.4f}")

The validation loss tells the real story. Training loss can be misleading — a small batch size produces noisy training loss that looks worse step-by-step, but the actual model quality might be comparable or even better, because the noise acts as a form of regularization.

Why batch size matters.

The batch size controls a fundamental tradeoff between three things:

Gradient quality. Larger batches give a more accurate estimate of the true gradient. With the full dataset (batch size=182,000), the gradient is exact. With batch size 1, it is maximally noisy.

Training speed. Smaller batches mean more parameter updates per epoch. With batch size 32 and 182k examples, one epoch contains about 5,700 updates. With batch size 64, about 2,850. More updates means the model has more chances to adjust.

Regularization. The noise from small batches acts as implicit regularization. It prevents the model from overfitting to the exact training data, because the gradient never points in the exact direction that would overfit — it always has some random perturbation.

In our case, with 30,000 steps and this small network, batch size 32 tends to work well. It is a reasonable balance. But this is something to experiment with. Part of beating Karpathy's loss is finding the right combination of batch size, learning rate, and number of steps. They are not independent — if you increase the batch size, you often need to increase the learning rate proportionally to maintain the same effective step size.

The takeaway.

The batch size is not just a computational convenience. It is a hyperparameter that shapes the optimization landscape the model navigates. Too small, and the model wanders without clear direction. Too large, and the model takes fewer steps, each one more precise but possibly missing the diversity of the data. The 32 that Karpathy uses is a practical default, but understanding why it works is what allows you to try different values and see if you can push the loss lower in the final result.