LineShine is an all-CPU exascale supercomputer at the National Supercomputing Center in Shenzhen (NSCC-SZ), built entirely from domestically produced Chinese hardware with no reliance on foreign chips.
See Tadashi Ogawa’s thread for authoritative references.
See Torsten Hoefler’s photos for details disclosed by Yutong Lu.
System Overview
| Site | NSCC-SZ, Shenzhen, China |
|---|---|
| Peak performance | ~2 EFLOPS FP64 (claimed)1 |
| Node count | 20,480 |
| Processor | LX2 (ARMv9) |
| Interconnect | LingQi (dual-plane fat-tree) |
| Bandwidth/node | 1.6 Tb/s |
| Storage bandwidth | 10 TB/s |
| OS | Anolis OS 8.9 |
Compute Nodes
Each node has two LX2 sockets. The LX2 is an ARMv9 processor with an unusual memory topology: two compute dies per socket, each die with four NUMA domains and on-package HBM alongside off-package DDR.
Per LX2 socket:
- 2 compute dies × 152 cores = 304 cores total
- 8 HBM stacks (on-package): 32 GB, ~4 TB/s aggregate bandwidth
- Off-package DDR: 128 GB per die / 256 GB per socket
- Dedicated SDMA engine per die for DDR↔HBM movement
- Peak: 60.3 TFLOPS FP64 / 120.6 TFLOPS FP32 via SME and SVE units; FP16 and INT8 also supported
Per node (2x LX2):
- 608 cores
- 64 GB HBM + 512 GB DDR
- ~120.6 TFLOPS FP64 peak
At 20,480 nodes this yields ~2.47 EFLOPS FP64 theoretical peak, consistent with the stated 2+ EFLOPS claim.
Interconnect
The LingQi network uses a dual-plane multi-rail fat-tree at 1.6 Tb/s per node. The full deployment targets 36 network cabinets.
Storage
- 428 storage nodes across 67 cabinets
- 10 TB/s aggregate bandwidth
- Liquid-cooled; described as China’s largest liquid-cooled storage deployment
Software Stack
Runs Anolis OS 8.9 (Alibaba’s RHEL-compatible distro) with a ROCm-compatible environment plus GCC 8.5.0, rocBLAS, and PyTorch 2.7.1. The application paper describes a software-defined asynchronous MPI runtime to compensate for PyTorch’s CPU backend lacking CUDA stream semantics.2
Notes
- CPU-only: Positioned as a deliberate alternative to GPU-dominated Western systems. Workloads highlighted include molecular simulation, CFD, materials design, and LLM training.
- Domestic stack: LX2 processor, LingQi network, and storage are all Chinese-designed; explicitly framed as a response to US export controls on advanced chips.
- HBM topology is unusual. The per-NUMA-domain HBM (4 GB/domain, 16 GB/die) with SDMA-mediated DDR↔HBM movement resembles the MI300A APU design more than a conventional CPU+HBM scheme.
- Phase 1 was 100 Huawei Kunpeng servers (12,800 cores); the 20,480-node system described in the paper is a later phase.3
Footnotes
-
[2604.15821] Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials ↩
-
Some secondary sources describe an intermediate industrial complex phase with x86 blades; the paper’s architecture describes the LX2-based phase only. ↩