The math in self-attention

19 Jun, 2026

So, failure in TMLR. Focus now on the interest, man. Finished the embedding model discussion with Karpathy. Now, discussing the self-attetion perspective and Karpathy starts with this discussion regarding the tokens communicating with each other.

Gostaria de pensar aqui se esse seria o caso. From first principles. Deixe-me primeiro falar o trick que o Karpathy fala que seria o mais simples. Como os tokens deveriam se comunicar? Considerando os 8 tokens de um batch no time, Karpathy explica que o fifth token não deveria saber sobre os tokens posteriores, ok, because the prediction is about that. The fifth character should know about the previous ones. Então, nesse sentido, o Karpathy fala que o modo mais simples da comunicação seria average dos tokens, no caso aqui, dos caracteres. So, um feature vector com o average dos anteriores.

Fundamentalmente, would it be this way that a child proceed with thought? Or when is thinking. Quando os adultos pensam, is it that? A child na perspectiva de aprendizado, mas pensemos na questão do adulto pensando. Like, estamos considerando o conjunto de dados de Tolstoy. Como se dá a composição? E lembrando aqui que Tolstoy revisava bastante. This is interesting. But, ok, with RL, the models are agentic and check their own work, but this is not the main aspect. It is the production per se.

Consideremos Tolstoy pegando a pena para escrever. The ideas surgem como tokens? Consider that there was a plan and now it is just the time for writing, composing. The story was thought and a particular sentence would be written. The human relates the words, but there is a concept, a vision in his mind to tell the story. Like, Tolstoy did not produce only reading Dickens, perhaps the learning for writing tem ajuda com a leitura, this is known in some perspective, but fundamentally the quality advém da honestidade do autor regarding what he wants to write. It seems like a digression, but I am trying to think from first principles regarding the writing realization, that it is circumscripted regarding the task.

So, is it just an average of the previous caracteres or even tokens?

That is the challenge. Like, ok, my writing here. It is not something established how we proceed with the choice for the next words written. For sure, it is not the same that the models considered. The sample efficiency counts. Agora pensando no meu filho. Tem um aprendizado.

Ele me olhou escrevendo e falou um dia, com apenas 2 aninhos e 10 meses, "Que difícil, papai.". Really special.

Let's abstract regarding the motricity involved in the writing, pensemos na composição. From first principles, lembrando aqui que meu filho começou apenas falando "pa." e repetia essa sílaba. Depois, com o tempo, se tornou "Papá." e ele sabia que se tratava de mim. The world model also pode não estar correto. Like, ok, challenging. But that is great, study with Karpathy. Think.

Really great learning.