LineShine is an all-CPU exascale supercomputer at the National Supercomputing Center in Shenzhen (NSCC-SZ), built entirely from domestically produced Chinese hardware with no reliance on foreign chips.

See Tadashi Ogawa’s thread for authoritative references.

System Overview

SiteNSCC-SZ, Shenzhen, China
Peak performance~2 EFLOPS FP64 (claimed)1
Node count20,480
ProcessorLX2 (ARMv9)
InterconnectLingQi (dual-plane multi-rail fat-tree)
Bandwidth/node1.6 Tb/s
Storage bandwidth10 TB/s
OSAnolis OS 8.9

Compute Nodes

Each node has two LX2 sockets. The LX2 is an ARMv9 processor with an unusual memory topology: two compute dies per socket, each die with four NUMA domains and on-package HBM alongside off-package DDR.

Per LX2 socket:

  • 2 compute dies × 152 cores = 304 cores total
  • 8 HBM stacks (on-package): 32 GB, ~4 TB/s aggregate bandwidth
  • Off-package DDR: 128 GB per die / 256 GB per socket
  • Dedicated SDMA engine per die for DDR↔HBM movement
  • Peak: 60.3 TFLOPS FP64 / 120.6 TFLOPS FP32 via SME and SVE units; FP16 and INT8 also supported

Per node (2x LX2):

  • 608 cores
  • 64 GB HBM + 512 GB DDR
  • ~120.6 TFLOPS FP64 peak

At 20,480 nodes this yields ~2.47 EFLOPS FP64 theoretical peak, consistent with the stated 2+ EFLOPS claim.

Interconnect

The LingQi network uses a dual-plane multi-rail fat-tree at 1.6 Tb/s per node. The full deployment targets 36 network cabinets.

Storage

  • 428 storage nodes across 67 cabinets
  • 10 TB/s aggregate bandwidth
  • Liquid-cooled; described as China’s largest liquid-cooled storage deployment

Software Stack

Runs Anolis OS 8.9 (Alibaba’s RHEL-compatible distro) with a ROCm-compatible environment plus GCC 8.5.0, rocBLAS, and PyTorch 2.7.1. The application paper describes a software-defined asynchronous MPI runtime to compensate for PyTorch’s CPU backend lacking CUDA stream semantics.2

Notes

  • CPU-only: Positioned as a deliberate alternative to GPU-dominated Western systems. Workloads highlighted include molecular simulation, CFD, materials design, and LLM training.
  • Domestic stack: LX2 processor, LingQi network, and storage are all Chinese-designed; explicitly framed as a response to US export controls on advanced chips.
  • HBM topology is unusual. The per-NUMA-domain HBM (4 GB/domain, 16 GB/die) with SDMA-mediated DDR↔HBM movement resembles the MI300A APU design more than a conventional CPU+HBM scheme.
  • Phase 1 was 100 Huawei Kunpeng servers (12,800 cores); the 20,480-node system described in the paper is a later phase.3

Footnotes

  1. Stated at an NSCC-SZ institutional meeting, April 2026.

  2. From the preprint “Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials,” arXiv:2604.15821.

  3. Some secondary sources describe an intermediate industrial complex phase with x86 blades; the paper’s architecture describes the LX2-based phase only.