The self-attention first minimum code

23 Jun, 2026

So, still thinking considering a small example and perspective for selt-attention e, nesse contexto, thinking in the child learning. That is research, basic, from first principles.

So, for this post, just that. The minimum code that is built.

B, T, C = 1, 2, 2

x = torch.randn((B, T, C))

Just that, precisamos ainda pensar na questão dos keys e queries.

Mas estabelecido assim um batch, duas sílabas porque seria character-level e esse C precisamos pensar, like, what this means? Sure, it is the capacity for the semantic. But this really needs to be investigated carefully.

Really great learning.