The self-attention first minimum code
So, still thinking considering a small example and perspective for selt-attention e, nesse contexto, thinking in the child learning. That is research, basic, from first principles.
So, for this post, just that. The minimum code that is built.
B, T, C = 1, 2, 2
x = torch.randn((B, T, C))
Just that, precisamos ainda pensar na questão dos keys e queries.
Mas estabelecido assim um batch, duas sílabas porque seria character-level e esse C precisamos pensar, like, what this means? Sure, it is the capacity for the semantic. But this really needs to be investigated carefully.
Really great learning.