Colossus (xAI)

Colossus is a AI training cluster deployed by xAI in Memphis, TN. It has:

“Over”¹ 100,000 H100 GPUs (12,500 nodes?)
Standard 8-GPU HGX baseboards
400G Ethernet backend
- 8x BlueField-3 NICs/node²
- Three-layer tree, ³ likely rail-optimized
- Spectrum SN5600 Ethernet switches³
- 400G Ethernet frontend (1x ConnectX-7 NIC/node)²

“Half” of its racks were provided by Supermicro, and the other half were Dell.⁴

The Supermicro nodes are mostly liquid cooled, but it is unclear what percent of each node is air-cooled. The racks use rear-door heat exchangers to capture the heat that isn’t covered by the cold-plate cooling.[^sth] Each rack has eight nodes (64 GPUs), a bottom-of-rack CDU, and a rear-door heat exchanger. The SN5600 TOR switches for the backend network are in separate racks, suggesting a rail-optimized design.

The storage is provided by VAST Data⁵ on Supermicro (and possibly Dell?) hardware.² DDN has also claimed credit as being the storage provider for this cluster, but there is no evidence of that in the video.⁶

If HPL was run across the full 100,000 GPUs (12,500 nodes), I estimate the performance at around 3.8 EFLOPS.

Facility

The cluster is physically sited at 3231 Paul R. Lowry Road in Memphis, a former appliance factory being leased to xAI.⁷⁸ The cluster is split across 4x 25,000-GPU halls.⁹

Temporary power

As of July 2024, the site only had 8 MW of power, and the substation serving the facility was capable of serving up to 50 MW.⁷¹⁰ xAI has requested an additional 100 MW of power (150 MW total) and will consume 1 million gallons of water per day.¹⁰

As of September 2024, xAI was using portable, natural-gas-fired generators served by a 16” main to meet the power demands.⁸¹⁰ At least one 16 MW generator (Solar Turbines SMT130) is deployed at the facility.¹¹

Long-term power

The site is literally across the street from a 1.1 GW combined cycle natural gas power plant, TVA’s Allen Combined Cycle Plant.

Tesla Megapacks are also in use to smooth the power load on the facility.⁹

Environmental impact

The xAI facility housing Colossus has attracted significant environmental concern because:

It is currently relying a significant amount of power coming from portable, on-site natural gas-fired generators
The region has exceeded the National Ambient Air Quality Standards for three consecutive years prior to 2024, and is on track to exceed the standard in 2024 as well.¹²
TVA, which operates the local grid, still relies on fossil fuels for around half its energy generation.¹³

Given its proximity to a 1.1 GW natural gas plant, it is likely that Colossus will be powered almost entirely by natural gas.

See sustainability in HPC for more.

Buildout

According to NVIDIA, the cluster was “built by xAI and NVIDIA” in 122 days, and it was only 19 days between the first rack arriving and training beginning.³ It is unclear what role Supermicro or Dell may have played in this buildout.

Future

There is a plan to expand this supercomputer to include an additional 100,000 H200 GPUs “in a single building”¹⁴ for a total of 200,000 Hopper-generation GPUs.³

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

working at Microsoft

BXI

Reasoning models

Azure ND GB200 v6

Azure SmartNICs

Colossus (xAI)

Facility

Temporary power

Long-term power

Environmental impact

Buildout

Future

Graph View

Backlinks

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

working at Microsoft

BXI

Reasoning models

Azure ND GB200 v6

Azure SmartNICs

Colossus (xAI)

Facility

Temporary power

Long-term power

Environmental impact

Buildout

Future

Footnotes

Graph View

Backlinks