Colossus is a AI training cluster deployed by xAI in Memphis, TN. It has:
- “Over”1 100,000 H100 GPUs (12,500 nodes?)
- Standard 8-GPU HGX baseboards
- 400G Ethernet backend
- 8x BlueField-3 NICs/node2
- Three-layer tree, 3 likely rail-optimized
- Spectrum SN5600 Ethernet switches3
- 400G Ethernet frontend (1x ConnectX-7 NIC/node)2
“Half” of its racks were provided by Supermicro, and the other half were Dell.4
The Supermicro nodes are mostly liquid cooled, but it is unclear what percent of each node is air-cooled. The racks use rear-door heat exchangers to capture the heat that isn’t covered by the cold-plate cooling.[^sth] Each rack has eight nodes (64 GPUs), a bottom-of-rack CDU, and a rear-door heat exchanger. The SN5600 TOR switches for the backend network are in separate racks, suggesting a rail-optimized design.
The storage is provided by VAST Data5 on Supermicro (and possibly Dell?) hardware.2 DDN has also claimed credit as being the storage provider for this cluster, but there is no evidence of that in the video.6
If HPL was run across the full 100,000 GPUs (12,500 nodes), I estimate the performance at around 3.8 EFLOPS.
Facility
The cluster is physically sited at 3231 Paul R. Lowry Road in Memphis, a former appliance factory being leased to xAI.78 The cluster is split across 4x 25,000-GPU halls.9
As of July 2024, the site only had 8 MW of power, and the substation serving the facility was capable of serving up to 50 MW.710 As of September 2024, xAI was using portable, natural-gas-fired generators served by a 16” main to meet the power demands.810 At least one 16 MW generator (Solar Turbines SMT130) is deployed at the facility.11
Tesla Megapacks are also in use to smooth the power load on the facility.9
xAI has requested an additional 100 MW of power (150 MW total) and will consume 1 million gallons of water per day.10
Environmental impact
The xAI facility housing Colossus has attracted significant environmental concern because:
- It is currently relying a significant amount of power coming from portable, on-site natural gas-fired generators
- The region has exceeded the National Ambient Air Quality Standards for three consecutive years prior to 2024, and is on track to exceed the standard in 2024 as well.12
- TVA, which operates the local grid, still relies on fossil fuels for around half its energy generation.13
It’s not clear to me if this is just because of bad press, or if this system really is an outlier in its carbon footprint over other AI training clusters of comparable size.
See sustainability in HPC for more.
Buildout
According to NVIDIA, the cluster was “built by xAI and NVIDIA” in 122 days, and it was only 19 days between the first rack arriving and training beginning.3 It is unclear what role Supermicro or Dell may have played in this buildout.
Future
There is a plan to expand this supercomputer to include an additional 100,000 H200 GPUs “in a single building”14 for a total of 200,000 Hopper-generation GPUs.3
Footnotes
-
Inside the World’s Largest Al Supercluster XAl Colossus (youtube.com) ↩
-
Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk - Page 3 of 4 - ServeTheHome ↩ ↩2 ↩3
-
NVIDIA Ethernet Networking Accelerates World’s Largest AI Supercomputer, Built by xAI | NVIDIA Newsroom ↩ ↩2 ↩3 ↩4
-
Dell and Super Micro Computer will build Elon Musk’s ‘AI factory’ ↩
-
In Memphis, Elon Musk’s xAI Supercomputer Stirs Hope and Concern - Bloomberg ↩ ↩2
-
How Memphis became a battleground over Elon Musk’s xAI supercomputer - OPB ↩ ↩2
-
Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk - Page 4 of 4 - ServeTheHome ↩ ↩2
-
Environmentalists urge health department to take action on xAI ↩
-
Why a Memphis Community Is Fighting Elon Musk’s Supercomputer - The New York Times ↩
-
“Soon to become a 200k H100/H200 training cluster in a single building” (x.com) ↩