Supercomputers are built, and then frontier models are designed to be optimally trainable on them. This contrasts with the traditional approach to scientific computing, where a grand-challenge problem is first defined, then a supercomputer is built to make that problem solvable.
As a result, hardware manufacturers, not model builders, have an outsized effect on the architecture of next-generation AI models. Model builders have some room to choose what architectural features to implement, but many critical parameters (such as number of experts) are defined based on the GPUs on which the model will be trained.
This balance of hardware-driving-models and models-driving-hardware is being called codesign
Examples
In decoding, MLA performs a 576-dimensional dot product, higher than the 128-dimensional computation of GQA. While the number of attention heads in DeepSeek-V3 is selected according to the roofline of H800, it is inappropriate for other hardware. Given the Multi-head Attention (MHA) style of MLA during training and prefilling, we increase the head dimension from 192 to 256 and decrease the number of attention heads by 1/3.