Colossus is a AI training cluster deployed by xAI in Memphis, TN. It has:

  • “Over”1 100,000 H100 GPUs (12,500 nodes?)
  • Standard 8-GPU HGX baseboards
  • 400G Ethernet backend
    • 8x BlueField-3 NICs/node2
    • Three-layer tree, 3 likely rail-optimized
    • Spectrum SN5600 Ethernet switches3
    • 400G Ethernet frontend (1x ConnectX-7 NIC/node)2

“Half” of its racks were provided by Supermicro, and the other half were Dell.4

The Supermicro nodes are mostly liquid cooled, but it is unclear what percent of each node is air-cooled. The racks use rear-door heat exchangers to capture the heat that isn’t covered by the cold-plate cooling.[^sth] Each rack has eight nodes (64 GPUs), a bottom-of-rack CDU, and a rear-door heat exchanger. The SN5600 TOR switches for the backend network are in separate racks, suggesting a rail-optimized design.

The storage is provided by VAST Data5 on Supermicro (and possibly Dell?) hardware.2 DDN has also claimed credit as being the storage provider for this cluster, but there is no evidence of that in the video.6

If HPL was run across the full 100,000 GPUs (12,500 nodes), I estimate the performance at around 3.8 EFLOPS.

Facility

The cluster is physically sited at 3231 Paul R. Lowry Road in Memphis, a former appliance factory being leased to xAI.78 The cluster is split across 4x 25,000-GPU halls.9

Temporary power

As of July 2024, the site only had 8 MW of power, and the substation serving the facility was capable of serving up to 50 MW.710 xAI has requested an additional 100 MW of power (150 MW total) and will consume 1 million gallons of water per day.10

As of September 2024, xAI was using portable, natural-gas-fired generators served by a 16” main to meet the power demands.810 At least one 16 MW generator (Solar Turbines SMT130) is deployed at the facility.11

Long-term power

The site is literally across the street from a 1.1 GW combined cycle natural gas power plant, TVA’s Allen Combined Cycle Plant.

Tesla Megapacks are also in use to smooth the power load on the facility.9

Environmental impact

The xAI facility housing Colossus has attracted significant environmental concern because:

  1. It is currently relying a significant amount of power coming from portable, on-site natural gas-fired generators
  2. The region has exceeded the National Ambient Air Quality Standards for three consecutive years prior to 2024, and is on track to exceed the standard in 2024 as well.12
  3. TVA, which operates the local grid, still relies on fossil fuels for around half its energy generation.13

Given its proximity to a 1.1 GW natural gas plant, it is likely that Colossus will be powered almost entirely by natural gas.

See sustainability in HPC for more.

Buildout

According to NVIDIA, the cluster was “built by xAI and NVIDIA” in 122 days, and it was only 19 days between the first rack arriving and training beginning.3 It is unclear what role Supermicro or Dell may have played in this buildout.

Future

There is a plan to expand this supercomputer to include an additional 100,000 H200 GPUs “in a single building”14 for a total of 200,000 Hopper-generation GPUs.3

Footnotes

  1. Inside the World’s Largest Al Supercluster XAl Colossus (youtube.com)

  2. Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk - Page 3 of 4 - ServeTheHome 2 3

  3. NVIDIA Ethernet Networking Accelerates World’s Largest AI Supercomputer, Built by xAI | NVIDIA Newsroom 2 3 4

  4. Dell and Super Micro Computer will build Elon Musk’s ‘AI factory’

  5. Grok is being built on the VAST Data platform!

  6. Paul Bloch on LinkedIn

  7. In Memphis, Elon Musk’s xAI Supercomputer Stirs Hope and Concern - Bloomberg 2

  8. How Memphis became a battleground over Elon Musk’s xAI supercomputer - OPB 2

  9. Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk - Page 4 of 4 - ServeTheHome 2

  10. 2024xAI and MLGW Quick Facts 1a.pdf 2 3

  11. Environmentalists urge health department to take action on xAI

  12. Why a Memphis Community Is Fighting Elon Musk’s Supercomputer - The New York Times

  13. Memphis Light, Gas and Water - Carbon Allocation

  14. “Soon to become a 200k H100/H200 training cluster in a single building” (x.com)