Trainium2

Chips

Trainium2 is Amazon’s in-house AI training accelerator. A single Trainium2 chip has:¹

8 NeuronCore-v3 cores
96 GiB HBM (2.9 TB/s)
1.3 PF FP8 with support for 4:1 sparsity

Graphically:¹

Trn2 Instances

Trainium2 chips are packaged into Trn2 instances, each with:

16 Trainium2 chips
- 128 NeuronCore-v3
- 1.5 TiB HBM
- 20.8 PF FP8
192 vCPUs
2 TiB DDR DRAM
3.2 Tbps of EFA v3

The NOC within a Trn2 instance is a “2D torus” (isn’t that a mesh?).

UltraServers

It seems like Trainium instances are packaged into rack-scale UltraServers, each containing for Trn2 instances.

Each UltraServer has

64 Trainium2 chips
- 512 NeuronCore-v3
- 6 TiB HBM
- 83 PF FP8
768 vCPUs
8 TiB DDR DRAM
12.8 Tbps of EFA v3

The toruses (meshes?) of a Trn2 instance within an UltraServer are connected in some kind of non-3D torus(?) The exact language is “cores at corresponding XY positions in each of the four instances are connected in a ring.”

Here is what an UltraServer looks like from the front:²

From this, it looks like each UltraServer has

Two racks
64 (8x4) individual nodes
Cross-rack cabling

Here is a closer-up view of what one of those nodes looks like:³

Zooming in, the labeling implies

There are two “GPUs” per physical sled with “left” and “right” ports.
There are external PCIe ports

The above photo of a node doesn’t exactly look like what is in the idealized two-rack UltraServer diagram though.

This is what an UltraServer looks like from the back:³

UltraCluster

AWS announced that it will build its next flagship AI training cluster using Trainium, and Anthropic will be the customer for it.³ This cluster, Rainier, will be located in the continental US and will have “hundreds of thousands” of Trainium2 accelerators.

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

Azure ND GB200 v6

Azure SmartNICs

LLM training datasets

test-time compute

Reasoning models

Trainium2

Chips

Trn2 Instances

UltraServers

UltraCluster

Graph View

Backlinks

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

Azure ND GB200 v6

Azure SmartNICs

LLM training datasets

test-time compute

Reasoning models

Trainium2

Chips

Trn2 Instances

UltraServers

UltraCluster

Footnotes

Graph View

Backlinks