There are two SKUs:1

  • Intel Data Center GPU Max 1100 (56 Xe Cores)
  • Intel Data Center GPU Max 1550 (128 Xe Cores)

Specifications

Each Intel Data Center GPU Max 1550 has:

  • 2 Xe Stacks
    • 128 Xe Cores (64 per Stack)
      • 1024 Xe Vector Engines (8 per Xe Core)
      • 1024 Xe Matrix Engines (8 per Xe Core)
    • 900 MHz (base), 1.6 GHz (peak)
    • No sparsity
  • 128 GB HBM2e
    • 3.2768 TB/s2
  • 16 Xe Links (D2D)
  • 1x16 PCIe Gen5 or CXL 1.1 (H2D)
  • 600 W maximum

Performance

The following are measured values from preproduction Aurora.3 These are measured values using DGEMM, and although it seems like they should make use of the Xe Matrix Engines (since the benchmarks simply calls into oneapi::mkl::blas::column_major::gemm4), it’s unclear how much vector FMA vs. matrix operations are being conducted here. For example, the FP64 matrix peak performance should be somewhere between

and

The measured 17 TFLOPS FP64 is far closer to the VFMA peak than the Matrix peak, yet it is higher than VFMA peak. Perhaps this indicates it’s all VFMA but it runs at higher-than-base frequency?

Data TypeVFMAMatrixSparse
FP6417
FP3223
TF32110
FP16263
BF16273
FP8
INT32
INT8577

Nomenclature

The Intel terminology is confusing. According to James Brodman (Intel):5

  • 1 GPU = 2 stacks
  • 1 stack = 4 slices + 4 HBM2e controllers + 8 Xe Links1
  • 1 slice = 16 cores
  • 1 core = 8 vector engine + 8 matrix engines
  • 1 vector engine = 512 bits
  • 1 matrix engine = 4096 bits

Also,

  • Stacks used to be called tiles
  • Vector Engines used to be called execution units (EUs)

Footnotes

  1. GPU Optimization Thread Mapping Occupancy (anl.gov) 2

  2. https://ark.intel.com/content/www/us/en/ark/products/232873/intel-data-center-gpu-max-1550.html

  3. https://docs.alcf.anl.gov/aurora/node-performance-overview/node-performance-overview/

  4. https://github.com/argonne-lcf/user-guides/blob/main/docs/aurora/node-performance-overview/src/gemm.cpp#L65C5-L65C42

  5. Unified Tensor Interface in DPC++ (iwocl.org)