Breno

The linear case with non-linearities

In the lecture, Karpathy discusses the implication of not using the activation function. The experiment is simple: take the same deep MLP and remove all the tanh layers. What remains is a stack of linear transformations — and a stack of linear transformations is just one linear transformation.

If h=W3(W2(W1x)), then h=(W3W2W1)x=Weffx. The depth is an illusion. No matter how many layers you stack, the network can only learn a linear function of its input. This is the fundamental reason non-linearities exist: without them, there is no point in having more than one layer.

But the diagnostic plots reveal something even more instructive. When Karpathy looks at the gradient statistics for the fully linear case, all the layers show nearly identical gradient distributions. This makes sense — since the entire network is equivalent to a single matrix multiplication, there is no structural reason for gradients to differ across layers. The gradient flows uniformly because there is nothing to obstruct it. No saturation, no vanishing, but also no expressive power.

The moment you add tanh back, two things happen simultaneously. First, the network gains the ability to approximate non-linear functions — which is the whole point. Second, gradient flow becomes non-trivial: the (1tanh2) factor at each layer means gradients can shrink (if activations saturate) or pass through (if activations are moderate). This is the tradeoff that the entire lecture is about. The non-linearity gives you representational power, but it also introduces the possibility of vanishing gradients. Everything else — Kaiming initialization, batch normalization, careful gain calibration — exists to manage this tradeoff.

The linear case is the control experiment. It shows what the gradient plots should look like in the absence of any pathology, and makes it clear that the challenges we observe in deep networks are entirely caused by the non-linearities we need.