Breno

To guard the guardians

In the Republic, Plato faces a problem that resonates far beyond ancient Athens: if we build guardians to protect the city, who ensures the guardians themselves don't become corrupt?

Plato's answer was education — a formation so deep that the guardian would internalize the good. External control wasn't enough; the guardian's internal structure had to be aligned with its function.

Two and a half millennia later, we face the same problem. We train increasingly powerful models and try to align them with human values.

But how do we know if the alignment is real or superficial? How do we know if the guardian has internalized the good or merely learned to appear good?

This is exactly where interpretability comes in. If alignment is the attempt to educate the guardian, interpretability is the attempt to look inside it — to understand what actually formed in there. Without interpretability, alignment is an act of faith. With it, we at least have a chance of answering Plato's question.

And the question is becoming urgent. The trajectory of AI research points toward models that don't just follow instructions but do science — formulate hypotheses, design experiments, discover. The more autonomous and capable the guardian becomes, the deeper its education must be. A tool that executes commands can be controlled from the outside. A scientist cannot. If we are building AI that thinks for itself, then Plato's answer is not a metaphor — it is the research plan.