A training principle for drifting models

12 Feb, 2026

There is a technique named Classifier-Free Guidance (CFG) that improves the generation quality considering samples conditioned on a class and samples unconditioned.

The drifting formulation consider a related form of guidance. It uses positive samples for a specific class and it uses generated or real samples from other classes for the negative cases.

What Is CFG?

In standard diffusion models, Classifier-Free Guidance works at inference time. You train a model to predict both a conditional output (given a class label c) and an unconditional output (no label). At sampling time, you extrapolate between the two.

$\tilde{ϵ} = (1 + w) \cdot ϵ_{θ} (x | c) - w \cdot ϵ_{θ} (x)$ .

The parameter w controls the guidance strength. Higher w means sharper, more class-specific samples, but at the cost of diversity. This is purely a sampling trick — nothing about the training loss changes. The model learns the conditional and unconditional distributions separately, and the guidance happens only when you generate.

This has always felt somewhat unsatisfying to me. You train one thing and then, at generation time, you do something mathematically different from what you optimized for. The generated distribution with CFG is not the distribution the model was trained to approximate.

How Drifting Models Reframe CFG.

In drifting models, the whole game is different. Recall the core mechanism: samples are "drifted" during training by a field V that attracts generated samples toward real data (positive samples, y⁺ ~ p_data) and repels them from other generated samples (negative samples, y⁻ ~ q_θ). When the generated distribution matches the data distribution, the drifting field becomes zero, and we reach equilibrium.

Now, section 3.5 introduces a simple but powerful modification. For class-conditional generation with label c, the positive samples are drawn from the class-conditional data distribution: y⁺ ~ p_data(·|c). So far, nothing surprising. The key insight is in how they construct the negative samples. Instead of using only generated samples as negatives, they mix in real data from other classes:

$\tilde{q} (\cdot | c) = (1 - γ) \cdot q_{θ} (\cdot | c) + γ \cdot p_{data} (\cdot | \emptyset)$ .

Here, γ ∈ [0,1) is a mixing rate, and p_data(·|∅) is the unconditional data distribution (i.e., real images from all classes).

Why This Works.

This is where it gets beautiful. Think about what equilibrium means in this setup. The drifting field V reaches zero when the positive and negative distributions match. The positive distribution is p_data(·|c). The negative distribution is the mixture above. Setting them equal.

$p_{data} (\cdot | c) = (1 - γ) \cdot q_{θ} (\cdot | c) + γ \cdot p_{data} (\cdot | \emptyset)$ .

Solving for q_θ.

$q_{θ} (\cdot | c) = α \cdot p_{data} (\cdot | c) - (α - 1) \cdot p_{data} (\cdot | \emptyset)$ .

where α = 1/(1 - γ) ≥ 1.

Look at this equation carefully. The learned distribution q_θ is a linear extrapolation between the class-conditional and unconditional data distributions. This is exactly the spirit of original CFG — but now it emerges from the training objective itself, not from an inference-time trick.

That is really interesting for learning.