Test-Time Training

24 Feb, 2025

Test-time compute has become a household name after the rise of DeepSeek.

But test-time training is one area that has been overlooked.

Test-time training (TTT) is a technique where a model adapts itself during the testing (i.e. inference) phase by utilizing the input data to update its parameters, rather than relying solely on pre-trained states. This approach enhances the model's ability to handle distribution shifts between training and testing data. TTT has been applied to various neural network architectures, including transformers and RNNs. Researchers across MIT and Stanford have shown remarkable improvement in performance, yet it's an area that hasn't quite taken off yet.

I had a chance to sit down with a researcher building TTT models. He put it aptly: pre-training and post-training are ways to amortize the cost of running the model. If you have a fixed compute budget, it may be more efficient to spend part of that on test-time training as opposed to just on pre-training and post-training.

Perhaps I caught a glimpse of something I wasn’t supposed to, but I had the chance to see TTT + SSM in action—delivering results that were radically better than anything I’ve seen before. It blew me away and made me even more excited about this approach.

Yet, as many technologists and venture capitalists know, the best technology doesn’t always win. Plenty of promising model architectures have failed to gain traction and ultimately faded into obscurity, as seen with the Megalodon class of models.

Still, I’m optimistic that test-time training will become increasingly relevant, much like how multi-head latent attention made a comeback after DeepSeek R1.

Some paper worth checking out: