Reliability

In HPC, there are two ways to look at the reliability of a supercomputer.

Top-down reliability is where you start with what a full-scale system job would experience in practice and begin breaking that down. Top-down reliability is governed by metrics that characterize job reliability.

Bottoms-up reliability is where you start with individual components and build a reliability model by connecting those components in series and in parallel. Bottoms-up reliability is governed by metrics that characterize component reliability.

Glenn's Digital Garden

Explorer

Recent Notes

Satya Nadella

Notice of Request for Information (RFI) on Frontiers in AI for Science, Security, and Technology (FASST) Initiative

Obsidian

Capex

xAI Colossus

Reliability

Graph View

Backlinks