The weights in a study on fixing the initial loss
Karpathy explains activations in a way that gives you perspective for research.
This lecture 3 is challenging, but, despite this, Karpathy continues to simplify the teaching.
The goal here is to take on the challenge of detailing what the prereqs would be.
I thought about this during parts of the lecture. I am a Brazilian engineer, and I think that in many regards, education in Brazil, a developing country, at the high school and university level, can fall short in some areas.
I remember that I had just one class on probability and statistics and the teacher was not that motivated or engaged.
So, this is important for avoiding comparison mistakes. Like, observing only my own progress.
So, let's study!
Before that, I remembered that Karpathy highlighted a really great strategy for learning.
Take a look at the post in the image.

Well, great!
Starter code.
Let's start with the MLP that trains on a simple dataset of names.
Imports.
First, the imports.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
Line 1 imports the PyTorch library.
Line 2 imports the functional module from Pytorch's neural network package torch.nn and aliases it as F for convenience. This module contains functions like loss functions and activation functions that we will use during training, for example, the F.cross_entropy.
Line 3 imports matplotlib for visualization.
Line 4 is not Python code per se. It is a Jupyter 'magic command', that is indicated by the % prefix. It tells Jupyter to display matplotlib plots directly inside the notebook, right below the cells that generated them. Without it, plots might open in a separate window or not display at all.
Dataset.
After that, the code for the dataset preparation. Let's go line by line. Code and explanation.
words = open('names.txt', 'r').read().splitlines()
This line opens the file names.txt in reading mode and split the lines. The result is a list where each element is one line, which, in the file, is a name.
chars = sorted(list(set(''.join(words))))
This line is used to obtain all the unique characters, that composes the character vocabulary. The lowercase letters in the alphabet. The execution is gives as follows. First, the join function operates concatenating all elements of the list words considering '' as the separator element. It results in a string. Then, the set function transforms the string into a set with characters without repetition. The list instantiates a list from the set. Finally, the sorted function order in alphabetic order.
s_to_i = {s:i+1 for i, s in enumerate(chars)}
This line builds a mapping to convert string to an index for each character.
s_to_i['.'] = 0
This line adds the character . that will serve as indication for start and end of a word. It is for the model.
i_to_s = {i:s for s, i in s_to_i.items()}
This line creates a mapping from index to string.
vocab_size = len(i_to_s)
This line defines the variable vocab_size that indicates the number of characters in the vocabulary, which is 27.
block_size = 3
This line defines the block size for the model to train. In this case, the model are going to look at 3 characters to predict the next one. Like, it sees emm and it predicts a, so, forming the name emma.
def build_dataset(words, block_size):
X, Y = [], []
for w in words:
#print(w)
context = [0] * block_size
for ch in w + '.':
ix = s_to_i[ch]
X.append(context)
Y.append(ix)
context = context[1:] + [ix] # crop and append.
X = torch.tensor(X)
Y = torch.tensor(Y)
print(f"{X.shape}, {Y.shape}.")
return X, Y
This block defines the function build_dataset that receives a list of words as parameter and the block size and returns the pair X and Y, where each x in X is a three letter word and Y is the next word for the example.
import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xt, Yt = build_dataset(words[n2:])
This block of code uses random to define a seed to kind of fix the results and shuffle the word list. After that, it defines two variables, n1 and n2 for indexing the X and Y. So, three datasets are defined: the pair Xtr and Ytr for training, the pair Xdev and Ydev for validation, and Xt and Yt for testing.
Important to state here that this deep comprehension for each line tremendously helps with the understanding of the model training. This is because you know what the dataset is about and what the X and Y mean.
The model.
Let's build the MLP.
The code is the following.
n_embd = 10
n_hidden = 200
g = torch.Generator().manual_seed(2147483647) # for reproducibility.
C = torch.randn((vocab_size, n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g)
b1 = torch.randn(n_hidden, generator=g)
W2 = torch.randn((n_hidden, vocab_size), generator=g)
b2 = torch.rand(vocab_size, generator=g)
parameters = [C, W1, b1, W2, b2]
print(f"{sum(p.nelement() for p in parameters)}.") # number of parameters in total.
for p in parameters:
p.requires_grad = True
Line 1 defines the embedding vector dimensionality for each character. So, the 10 means that for each one of the 27 characters, the representation would be a 10-dimensional vector. So, for example, the character a maps to a 10-dimensional vector.
Line 2 defines the number of neurons in the hidden layer. 200 neurons. Each neuron receives all 30 inputs (3 characters × 10 dimensions each) and produces a single output.
Line 4 creates a random number generator with a fixed seed. The seed 2147483647 ensures that every time we run this code, the same random numbers are generated. This is crucial for reproducibility — we want to be able to compare results across experiments.
Line 5 creates the embedding matrix C with shape (27, 10). Each row is the embedding vector for one character. At this point, these vectors are random. The network will learn meaningful embeddings during training.
Line 6 creates the weight matrix W1 with shape (30, 200). This connects the concatenated embedding (30 values) to the 200 hidden neurons. Notice that here there is no scaling — the weights are drawn from a standard normal distribution . This is important. This is the problematic initialization that we will need to fix.
Line 7 creates the bias vector b1 with shape (200,). One bias per hidden neuron.
Line 8 creates the weight matrix W2 with shape (200, 27). This connects the 200 hidden neurons to the 27 output logits, one per character in the vocabulary.
Line 9 creates the bias vector b2 with shape (27,). One bias per output character.
Line 11 collects all the parameters into a single list. This makes it easy to iterate over them during the backward pass and the update step.
Line 13 prints the total number of parameters. The count is: .
Lines 15–16 enable gradient tracking for all parameters. PyTorch needs this flag to compute gradients during loss.backward().
The forward pass.
Now, the training loop. This is where the model actually learns.
max_steps = 200000
batch_size = 32
lossi = []
Line 1 defines the total number of training iterations. Line 2 defines the mini-batch size — 32 examples per step. Line 3 creates an empty list to store the loss at each step, for later visualization.
The loop:
for i in range(max_steps):
# minibatch construct.
ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix]
Line 4 randomly selects 32 indices from the training set. This is the mini-batch sampling. We do not train on the full dataset at each step — that would be too slow. Instead, we estimate the gradient from a small random subset.
Line 5 uses those indices to extract the corresponding inputs Xb and targets Yb.
# forward pass.
emb = C[Xb]
embcat = emb.view(emb.shape[0], -1)
Line 7 performs the embedding lookup. C[Xb] uses the indices in Xb to select rows from the embedding matrix C. The result emb has shape (32, 3, 10) — 32 examples, each with 3 characters, each character represented by a 10-dimensional vector.
Line 8 concatenates the three embedding vectors into a single vector. The .view(emb.shape[0], -1) reshapes from (32, 3, 10) to (32, 30). The -1 tells PyTorch to compute the second dimension automatically. Now each example is a flat vector of 30 numbers.
hpreact = embcat @ W1 + b1
h = torch.tanh(hpreact)
Line 9 computes the pre-activation of the hidden layer. The @ operator is matrix multiplication: (32, 30) @ (30, 200) = (32, 200). Then we add the bias b1, which broadcasts across the batch. Each of the 32 examples now has 200 pre-activation values.
Line 10 applies the tanh activation function. This squashes each value to the range . The result h has the same shape (32, 200). This non-linearity is what gives the network its expressive power — without it, the entire network would collapse to a single linear transformation, no matter how many layers we stack.
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Yb)
Line 11 computes the output logits. The matrix multiplication (32, 200) @ (200, 27) = (32, 27) produces 27 scores per example — one per character in the vocabulary. These are the raw, unnormalized predictions.
Line 12 computes the cross-entropy loss. Internally, F.cross_entropy does three things: it applies softmax to convert logits into probabilities, it extracts the probability assigned to the correct character (given by Yb), and it takes the negative log. The result is a single scalar — the average loss over the mini-batch.
The backward pass and update.
# backward pass.
for p in parameters:
p.grad = None
loss.backward()
Lines 14–15 reset all gradients to None. This is necessary because PyTorch accumulates gradients by default. Without resetting, the gradients from the previous step would add to the current ones, which is not what we want.
Line 16 computes the gradients of the loss with respect to all parameters. This is the backpropagation step. After this call, every parameter p in our list has a .grad attribute containing the gradient .
# update.
lr = 0.1 if i < 100000 else 0.01
for p in parameters:
p.data += -lr * p.grad
Line 18 defines the learning rate with a step decay: 0.1 for the first 100k steps, then 0.01 for the remaining 100k. The higher rate allows fast initial progress; the lower rate allows fine-tuning.
Lines 19–20 apply the gradient descent update. For each parameter, we move in the direction opposite to the gradient, scaled by the learning rate. The p.data is used instead of p directly to avoid PyTorch tracking this operation in the computational graph.
# track stats.
if i % 10000 == 0:
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())
Lines 22–23 print the loss every 10,000 steps. Line 24 stores the log base 10 of the loss for every step. Using the log scale makes the visualization cleaner, the loss curve becomes roughly linear when learning is proceeding at a constant rate.
Conclusion.
The forward pass is a sequence of measurable transformations. The embedding lookup converts discrete symbols into continuous vectors. The matrix multiplication projects those vectors into a new space. The activation function introduces non-linearity. The output layer projects back to the vocabulary space. The loss function measures how far the predictions are from reality.
Each of these steps produces a tensor with specific statistical properties, mean, standard deviation, range. When those statistics are healthy, the network learns. When they are not, when activations saturate, when gradients vanish, when logits explode, the network is stuck for learning.