LineShine

LineShine is an all-CPU exascale supercomputer at the National Supercomputing Center in Shenzhen (NSCC-SZ), built entirely from domestically produced Chinese hardware with no reliance on foreign chips.

See Tadashi Ogawa’s thread for authoritative references.

System Overview


Site	NSCC-SZ, Shenzhen, China
Peak performance	~2 EFLOPS FP64 (claimed)¹
Node count	20,480
Processor	LX2 (ARMv9)
Interconnect	LingQi (dual-plane multi-rail fat-tree)
Bandwidth/node	1.6 Tb/s
Storage bandwidth	10 TB/s
OS	Anolis OS 8.9

Compute Nodes

Each node has two LX2 sockets. The LX2 is an ARMv9 processor with an unusual memory topology: two compute dies per socket, each die with four NUMA domains and on-package HBM alongside off-package DDR.

Per LX2 socket:

2 compute dies × 152 cores = 304 cores total
8 HBM stacks (on-package): 32 GB, ~4 TB/s aggregate bandwidth
Off-package DDR: 128 GB per die / 256 GB per socket
Dedicated SDMA engine per die for DDR↔HBM movement
Peak: 60.3 TFLOPS FP64 / 120.6 TFLOPS FP32 via SME and SVE units; FP16 and INT8 also supported

Per node (2x LX2):

608 cores
64 GB HBM + 512 GB DDR
~120.6 TFLOPS FP64 peak

At 20,480 nodes this yields ~2.47 EFLOPS FP64 theoretical peak, consistent with the stated 2+ EFLOPS claim.

Interconnect

The LingQi network uses a dual-plane multi-rail fat-tree at 1.6 Tb/s per node. The full deployment targets 36 network cabinets.

Storage

428 storage nodes across 67 cabinets
10 TB/s aggregate bandwidth
Liquid-cooled; described as China’s largest liquid-cooled storage deployment

Software Stack

Runs Anolis OS 8.9 (Alibaba’s RHEL-compatible distro) with a ROCm-compatible environment plus GCC 8.5.0, rocBLAS, and PyTorch 2.7.1. The application paper describes a software-defined asynchronous MPI runtime to compensate for PyTorch’s CPU backend lacking CUDA stream semantics.²

Notes

CPU-only: Positioned as a deliberate alternative to GPU-dominated Western systems. Workloads highlighted include molecular simulation, CFD, materials design, and LLM training.
Domestic stack: LX2 processor, LingQi network, and storage are all Chinese-designed; explicitly framed as a response to US export controls on advanced chips.
HBM topology is unusual. The per-NUMA-domain HBM (4 GB/domain, 16 GB/die) with SDMA-mediated DDR↔HBM movement resembles the MI300A APU design more than a conventional CPU+HBM scheme.
Phase 1 was 100 Huawei Kunpeng servers (12,800 cores); the 20,480-node system described in the paper is a later phase.³

Stated at an NSCC-SZ institutional meeting, April 2026. ↩
From the preprint “Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials,” arXiv:2604.15821. ↩
Some secondary sources describe an intermediate industrial complex phase with x86 blades; the paper’s architecture describes the LX2-based phase only. ↩

Glenn's Digital Garden

Explorer

LineShine

System Overview

Compute Nodes

Interconnect

Storage

Software Stack

Notes

Graph View

Table of Contents

Glenn's Digital Garden

Explorer

LineShine

System Overview

Compute Nodes

Interconnect

Storage

Software Stack

Notes

Footnotes

Graph View

Table of Contents