Breno

The implementation for the drifting model

The implementation consists of some aspects, like the tokenizer, the architecture and other ones that we can discuss.

Section 3 of the drifting model paper gave us the theory: a drifting field V that pushes generated samples toward the data distribution, a loss function that is simply MSE between the network output and its drifted target, and the elegant equilibrium condition where V vanishes when p = q. That is the mathematics. Section 4 is where Deng et al. tell us how they actually built the thing for image generation. And honestly, the engineering choices here are just as interesting as the theory.

There are six subsections in Section 4: tokenizer, architecture, CFG conditioning, batching, feature extractor, and pixel-space generation. Today I will focus on the first three.

Tokenizer. The generator f maps Gaussian noise to a latent representation. Following the standard protocol established by Latent Diffusion Models, they use a pretrained VAE to encode images into a lower-dimensional latent space. The generator works in this latent space, and the VAE decoder maps back to pixel space at inference time. This is the same tokenizer approach used by most modern image generators — nothing surprising here, but it is the foundation that makes everything else computationally tractable.

Architecture. The generator itself is a transformer. Think DiT-style, but adapted for the drifting paradigm. The model takes noise tokens as input, conditions on class labels, and outputs latent tokens that decode into images. They use in-context tokens and random style embeddings to provide diversity in the generation. The full model has 463M parameters. One key observation: because the drifting model only needs a single forward pass at inference time, the architecture does not need to handle timestep conditioning. There is no diffusion timestep. The network is simply a map from noise to data, which makes the architecture cleaner than its diffusion counterparts.

CFG conditioning. This is one of the most clever design choices. In diffusion models, Classifier-Free Guidance requires running the model twice at inference time — once with the condition and once without — then combining the outputs. This doubles the computational cost. The drifting model handles CFG at training time instead. They take the guidance scale α as an explicit input to the network, so at inference time you simply feed the desired α and get the guided output in a single pass. The model learns the full spectrum of guidance strengths during training.

What strikes me most about these choices is how the implementation reveals the true nature of the paradigm shift. The absence of a timestep input in the architecture, the CFG-at-training-time trick — these are direct consequences of moving the iterative process from inference to training. Once you accept that premise, the engineering follows naturally. The architecture becomes simpler, and the inference becomes cheaper. One forward pass, one output, concluded.