Pre-Training GPT-4.5

From Pre-Training GPT-4.5:

Amin Tootoonchian, OpenAI Chief System Architect

We wouldn’t have been able to train GPT 4.5 on the precisely same stack as we did with GPT 4, so let’s say our approach to state management changed. ==We had to scale to more compute, and that compute was not available as part of one cluster. We had to go to multicluster training==, and imagine it’s many different work streams that have to come together in a short period of time for us to be able to do this, aiming for another 10x jump.

Acknowledgment that GPT-4.5 used multicluster training.

Quote

So, for the next 10x, for me, it would be, of course, fault tolerance, and a form of fault tolerance that we can co-design with the workload, such that we don’t have to worry about the operational burden of keeping such a massive run going. It is not like our prior system. So I would argue that with our prior stack, 4.5, but at the edge of what we could keep up with

Interesting to see that failures at scale are the top issue. Amin later speaks specifically to network protocol support for improving fault tolerance, so perhaps this statement does not apply to the entire system.

Quote

I think up until this rough point in time, like if you look even through GPT-4, we were largely just in a compute-constrained environment. Um, so that was kind of where all the research was going into. But now we’re, you know, in

a very different kind of regime. Um, starting with 4.5 for some aspects of the data where we are much more data bound.

Acknowledgment that throwing more GPUs at a model does not cause a proportional increase in quality. This dovetails with observations that GPU datacenter buildout is retracting (e.g., https://www.wgtd.org/news/microsoft-elaborates-earlier-announced-data-center-pauses).

Quote

Like, how often are you like, “Oh, this is looking really bad.” And then it’s fine? Pretty frequently, I think probably about half the time. Maybe because we’re a paranoid bunch. So I think, yeah, if it wasn’t half the time, we wouldn’t be looking closely enough.

Abnormalities in the micro do not mean problems in the macro. It sounds like there is an art and subjectivity to understanding when it’s time to stop training to understand what might be going awry.

Quote

It’s just that where there are faults that could be worked around at a different level than the application level. I would rather Uh, the transport network will do its job and keep running, giving me the available bandwidth without me worrying about it.

Amin’s philosophy on resilience relies on pushing fault tolerance down the stack. In this case, networking issues should be handled by the transport protocol. What about other aspects, like GPU failures or memory corruptions?

Quote

we’re entering a new stage of AI research where we’ll be stacking data efficiency wins: 10% here, 20% there, and I think it would be a little foolish to make predictions about it hitting walls that we have no reason to predict.

Context: ML innovation has benefited from incremental (10-20% improvements in computing contributed by teams from around the world. That same approach towards making models that more efficiently learn from data has just begun, indicating that scaling laws will pivot to a new dimension where instead of loss vs. FLOPS, it’ll be loss vs. data (assuming some optimal loss vs. FLOPS slope).

Quote

There’ll definitely be 10 million GPUs working together on an AI system that is learning and doing things, but it might not all the parts of the brain won’t necessarily all be talking to each other.

Another sign that asynchronous scale-out models will be critical to advancing LLMs. The days of monolithic transformers are coming to an end.

Quote

the ideal intelligence is called Solomon induction. Basically, it’s uncertain about what universe it’s in and it imagines all possible universes, considering simple ones more likely than less simple ones. It’s fully Bayesian and updates its views as it progresses.

You can approximate this by finding the shortest program that computes everything you’ve seen so far. What we’re doing with pre-training, or one way to think about what pre-training is doing, is it is compressing; it is trying to find the shortest program that explains all of the data that humans have produced so far as a way of approximating.

“Solomonoff induction” (see https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_inductive_inference). This is an interesting way of rationalizing how LLMs capture information and relationships.

Quote

if you can actually train on the entire internet, tests become somewhat Degenerate compared to tests for humans, where they can’t do that. And so the main approach in the field is to look at how much it compresses some held out data that’s thought to be good data. And even then, if you’re not careful about that held out data, it’s too similar to your training data. Training changes to your training algorithm that make it memorize better will seem to make it smarter, because it’ll have already known your test set; and we don’t want to just be measuring memorization.

This is obvious to practitioners, but the claim of memorization is so often repeated by AI deniers.

Quote

Our internal codebase, which we know is not out there, is a very good held out. It has that held as our best thing across like many; it’s still the best thing.

Remarkable! I mean, we joke that a model is its monorepo loss,

OpenAI has a monorepo, and they validate against it since they know the model has not been trained on it. Interesting.

Quote

we kind of used the convention of each major number of GPT as a 100x increment.

This explains why GPT-4.5 was not GPT-5. It did not meet the 10x criteria.

Quote

I’d say the the two defining characteristics of the GPT paradigm have been that the law you Can predict the test loss, and it scales predictably and magically. Lower test loss means greater intelligence in all these intangible, amazing, mysterious ways.

Transcription errors:

You can predict the test loss
it scales predictably

Magically, lower test loss means greater intelligence in all these intangible, amazing, mysterious ways.

This clarifies that the relationship between test loss (predictable with scaling laws) does not have a straightforward relationship with our perception of model quality.

Quote

It has more common sense, knowledge, it understands nuance and context, and that’s the magic that came out of the few extra bits of test laws.

“test loss,” not “test laws.”

Again, this notion of “common sense” and “nuance” do not have a straightforward relationship with test loss.

Quote

I think seeing the moment that the whole Once a few of those issues got resolved, we got a big performance boost. After I remember, everybody got — I mean, you could sense that the energy changed. It’s just that everybody feels excited and now more motivated to push everything through to the end. It’s just, it’s fascinating to see the ETA on our status tracker; like, yeah, it has constantly shifted from, let’s say, two years to something tangible.

Interesting anecdote: training performance was far under prediction in the early days of training, but building the plane while it flew allowed them to continually optimize and see how their live improvements result in a nearer completion time.

Quote

we had a number of very large de-risking runs.

Interesting, practical anecdote about LLM training at scale.

Quote

we have built a lot of systems around giving us

visibility and the ability to distinguish between whether it is a hardware fault, what type of hardware fault it is, some form of corruption, or if it is some ML potentially an ML bug or something like races in our code.

Acknowledgment that OpenAI takes great care to understand the nature of failures to ensure the model trains in a predictable way.

Quote

we are mostly using Triton kernels; it’s just that for some corner cases, let’s say the operations don’t matter much, we basically fall back to using torch operations.

Interesting insight into how much optimization they use.

Quote

it’s a crash at a very, very, I mean, a slow rate. It’s one every hundred steps, one every thousand steps, and it’s something very easy to dismiss. But it’s just that we should not have that in the run as a discipline that we should, we do have, and it’s just not giving up on it, is the story.

Speaks to the ability to ignore nondeterministic errors if you want. It sounds like OpenAI has a threshold of tolerance for mysterious errors that is low, but not zero.

Quote

I think most people can imagine it, like leading up to, like, pushing “Go” on the run. But after that happens, what is your day-to-day like? Are you just like Sitting there watching loss curves, like how does it go? Definitely a lot of watching loss curves.

Interesting, practical anecdote about LLM training at scale. Similar to the experience described in the OPT-175B on-call logs.

Glenn's Digital Garden

Explorer

Pre-Training GPT-4.5

Graph View

Backlinks