Test-time compute is the additional computing that is performed when a model is inferencing (as opposed to training) to produce better results. Models that use test-time compute are reasoning models.

Test-time compute is handy for two reasons:

  1. When applied to a large, trained LLM, it allows the LLM to produce better responses than it could have if it just did a single inferencing pass. This is similar to chain-of-thought prompting.
  2. When applied to a small, trained LLM, it allows that LLM to perform like a much larger LLM. This may be a pathway to getting high-quality models to run on edge devices; even though inferencing will take longer than if it had run in a datacenter environment, it would allow high-quality AI capabilities to fit entirely on personal devices.

[2408.03314v1] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters has a good introduction to test-time compute.