It’s hard to keep track of Microsoft’s public disclosures about its AI supercomputers, so here is what has been stated publicly.
2019 supercomputers
OpenAI’s supercomputer (V100)
Microsoft disclosed that it built a supercomputer with 10,000 NVIDIA V100 GPUs that would’ve listed at #5 on Top500 had they submitted an HPL score.1 This supercomputer was used to train GPT-3.2
Kevin Scott qualitatively described this system as “as big as a shark:”2
Kevin Scott, Microsoft Build 2024:
In 2020, we built our first Al supercomputer for OpenAI. It’s the supercomputing environment that trained GPT-3. And so, we’re going to just choose “marine wildlife” as our scale marker.
You can think of that system as about as big as a shark.
Mark Russinovich said the following about it:3
Mark Russinovich, Microsoft Build 2024:
The first AI supercomputer we built back in 2019 for OpenAI was used to train GPT-3 and the size of that infrastructure, if we’d submitted it to the TOP500 supercomputer benchmark, would have been in the top five supercomputers in the world on-premises or in the Cloud and the largest in the Cloud.
2021 supercomputers
Voyager (A100)
Microsoft debuted the Voyager supercomputer at #10 on the November 2021 Top500 list. HPL was run on 264 nodes of Azure ND A100 v4.
2022 supercomputers
OpenAI’s 2022 supercomputer
Very little has been said about this computer other than it is “as big as an orca”2 and that it was built to train GPT-4:3
Mark Russinovich, Microsoft Build 2024:
In November, we talked about the two generations later supercomputer, so that we built one to train GPT-4 after the GPT-3 one…
Kevin Scott also referred to the supercomputer that trained GPT-4 at Build 2024:2
Kevin Scott, Microsoft Build 2024:
The next system that we built, scale wise, is about as big as an orca. And that is the system that we delivered in 2022 that trains GPT-4.
Iowa Supercomputer (2022-2023)
In September 2023, Microsoft disclosed “the Iowa supercomputer” which is “among the largest and most powerful in the world.”4 It is unclear from the article when this supercomputer was deployed.
2023 supercomputers
OpenAI’s 2023 supercomputer
Mark Russinovich said the following in his talk at Build 2024:3
Mark Russinovich, Microsoft Build 2024:
In November, we talked about the two generations later supercomputer, so that we built one to train GPT-4 after the GPT-3 one and this one we’re building out to train the next generation of OpenAI’s models this one with a slice of that infrastructure, just a piece of that supercomputer that we were bringing online.
Kevin Scott also referred to a supercomputer being built out at Build 2024:2
Kevin Scott, Microsoft Build 2024
The system that we have just deployed is, scale wise, about as big as a whale relative to the shark-sized supercomputer and this orca-sized supercomputer. And it turns out you can build a whole hell of a lot of Al with a whale-sized supercomputer.
Explorer (MI200)
In June 2023, Microsoft debuted Explorer, as the #11 supercomputer on Top500. It ran HPL on 480 nodes of Azure ND MI200 v4.
Eagle (H100)
In November 2023, Microsoft also debuted Eagle as the #3 supercomputer on Top500, a “tiny fraction”3 of a much larger supercomputer that ran HPL on 1,800 Azure ND H100 v5 nodes. See Eagle for more information.
Post-2024 supercomputers
OpenAI’s supercomputer
Again, Mark Russinovich said the following at Build 2024:3
Mark Russinovich, Microsoft Build 2024:
“We’re already on to the next one and designing for the next one after that which are multiple times even bigger than this one.”
Global infrastructure in 2024
Mark Russinovich also provided the following timeline of growth:3
- May 2020: 10,000 V100 GPUs, #5 supercomputer (not confirmed)
- Nov 2023: 14,400 H100 GPUs, #3 supercomputer (confirmed)
30x supercomputers comment
We’ve built 30x the size of our AI infrastructure since November, the equivalent of five of those supercomputers I showed you every single month
The math is 5x Eagles 6 months between November 2023 and May 2024 to get 30x.
InfiniBand domain size
If you take a look at the scale of those systems I was talking about, those AI supercomputers, there are tens of thousands of servers, and they all have to be connected together to make that efficient all-reduce happen. …the InfiniBand domain covers the entire supercomputer, which is tens of thousands of servers.
Standard cluster size
In the case of our public systems where we don’t have customers that are looking for training at that scale … the InfiniBand domains are 1,000 - 2,000 servers in size, which is still 10,000 - 20,000 GPUs.
Project POLCA
- Training has only 3% power headroom
- Inference has 21% power headroom
- Allows 30% more servers in existing datacenter with minimum performance loss.
Footnotes
-
Microsoft announces new supercomputer, lays out vision for future AI work - Source ↩
-
Microsoft Build opening keynote video and Microsoft Build opening keynote transcript ↩ ↩2 ↩3 ↩4 ↩5
-
Inside Microsoft AI innovation with Mark Russinovich ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
How a small city in Iowa became an epicenter for advancing AI - Source ↩