Revisiting Reliability in Large-Scale Machine Learning Research Clusters is a paper authored by a bunch of folks at Meta that describes the findings of studying eleven months of operations on two AI clusters: one with 16K A100 GPUs (RSC-1) and another with 8K A100 CPUs (RSC-2). These clusters ran mixed workloads of wildly varying scales, and the paper describes a lot of challenges around reliability and quantifying metrics weighted by jobs vs. cycles.
Overall the paper doesn’t have much new, deep insight that would be surprising to people who have been working in HPC for a while. They rediscovered a few metrics that have been in use (like forward progress—they call it “ETTF”) and emphasize how different the results can be when metrics are weighted by job count instead of node-minutes. They present a heavily formalized model that quantitatively amounts to modeling the reliability of a supercomputer as a pile of nodes connected in series.
A big portion of their paper is also devoted to assessing the impact of their pre-emption policy on overall cluster utilization (which they call “goodput,” which they acknowledge as different from the industry-standard definition of goodput). Their policy is to make jobs eligible for pre-emption after two hours, which allows large jobs to launch without forcing significant parts of the cluster to drain; while this does reduce queue wait time, rapid failures of large jobs causes excessive pre-emption of small jobs and undercuts some of the utilization gains from the big jobs.
Although this paper doesn’t contain any new breakthrough insights or methods, it is a good signal that the AI community is arriving at the same conclusions around quantifying reliability as the HPC community. This paper also contains a bunch of operational nuggets and anecdotes (highlighted below) that are indicative of what other leading AI research labs are probably doing. Good on Meta for being open about how they operate so that others who are further behind on their journey can follow.
A few key findings that I thought are worth bubbling up:
- They use node health checks that run every five minutes and catch a variety of overlapping issues, and these checks are integrated with the workload orchestrator (Slurm). This underscores the importance of having reliability integrated throughout the entire stack, from hardware health up into the application layer. This is easier to do for Meta because both research and facilities live under the same roof, but it would be harder for AI labs who rely on a third party to provide their training infrastructure.
- They suffer a significant amount of job failures due to their reliance on file systems. AI labs would do well to avoid parallel file systems and instead use object storage; doing so decouples the way applications interact with data from the overall health of the node since the node only need to provide the data plane (the network connectivity to storage) and not the control plane (authentication and authorization, which is a requirement of file-based storage). This is because object storage delegates authentication and authorization to the application layer since it is a user-space protocol.
Quote
4k GPU jobs constitute less than 1% of our jobs while consuming 12% of the GPU resources at the cluster level.
Quote
11 months of data collected from state-of-the-art AI researcher clusters with >80% utilization.
Quote
RSC-1 and RSC-2, follow the same design template discussed below. RSC-1 is a general ML cluster (e.g., training some of the prominent LLMs) of 16k GPU size, while RSC-2 focuses on vision applications and is of 8k GPU size.
Quote
Leaning into the High-Performance Computing (HPC) stack, our clusters use the Slurm [45] scheduler on top of bare-metal allocations.
Quote
Jobs are eligible to be preempted after running for 2 hours, and they have a maximum lifetime of 7 days.
Quote
Overall, our clusters average 7.2k for RSC-1 and 4.4k for RSC-2 jobs submitted per day, averaging 83% and 85% cluster utilization, respectively.
Quote
each rack has two servers, and ten racks are connected via a rail-optimized network, forming a pod. Pod-pod communications going through the next level of switches (spine switches).
Quote
our infrastructure is instead designed to check that jobs are running on healthy hardware, restarting the job on different nodes if there is a failure. This can be viewed as a cooperative recovery strategy as the application is still responsible for correctly implementing checkpoint and resume logic.
Requires application to be aware of infrastructure and vice versa. This underscores the importance of having infrastructure that is programmable by the application layer.
Quote
health checks that are periodically scheduled to run every five minutes, and return codes indicating success, failure, or warning. Each health check examines some aspect of node health, spanning from GPU errors (e.g. XID errors [9]) to file system mounts, and services status (i.e., scheduler).
Quote
High severity check failures will immediately signal a scheduler handler to remove the node and reschedule all jobs executing on the node, while lower severity checks will signal to the scheduler to remove the node for remediation after jobs running on the node have finished
Quote
ETTR is defined as the ratio of productive runtime to the available wallclock time of a job run.
Infrastructure providers operating in zero-trust mode have no insight into this because the infrastructure has no visibility into the application runtime space. As such, the infrastructure cannot define productive runtime.
Quote
The exact definition of productive runtime is open to interpretation depending on context, but we consider two sources of unproductive scheduled time:
So Meta has rediscovered the idea of “forward progress” as defined by NNSA.
Quote
job preemption, resource fragmentation, and failures are the dominant sources of lost goodput.
Quote
a NCCL timeout occurs whenever a rank observes that a collective operation, such as an AllReduce, has not completed within a several minutes.
I am surprised the NCCL timeouts take minutes.
Quote
Errors such as NCCL timeouts may be naively attributed to a proximal cause e.g., on the network rather than a deadlock. Networking has a large “blast-radius”, causing errors across the stack.
Quote
We attribute a failure to a cause if the cause was detected within the last 10 minutes or 5 minutes after a failing jobs lifetime (FAILED or NODE_FAIL).
Again, this works because there is feedback on the state of the application that triggers a root-cause at the infrastructure level.
Quote
IB Links, filesystem mounts, GPU memory errors, and PCIe errors contribute heavily to the failure rates, however for IB Links in particular this seems to be dominated by a short period of many IB Link related job failures from a handful of nodes in the summer of 2024 as shown in Figure 5.
The fact that file system mounts contribute so much to job failures is a strong indictment against relying on shared, file-based storage for model training. Had Meta chosen to just not offer parallel file storage (which requires a stateful relationship between each compute node’s kernel and a remote, distributed service) and instead used object storage exclusively, a significant number of job failures could’ve been avoided entirely.
This isn’t to say that storage-related problems wouldn’t have ever caused problems, but object storage puts the responsibility of authentication and session management in the hands of the application. In doing so, applications can respond more flexibly to misbehaving storage since storage issues aren’t node health problems anymore.
Quote
Failures may co-occur—3% and 5% of hardware failures on RSC-1/RSC-2 have co-occuring events of similar priority. For example, we observe PCIe errors often co-occur with XID 79 (GPU falling off the bus) and IPMI “Critical Interrupt” events.
Sounds familiar.
Quote
Figure 7 illustrates that the mean-time-to-failure (MTTF) of 1024-GPU jobs is 7.9 hours—roughly 2 orders-of-magnitude lower than 8-GPU jobs (47.7 days).
From their 1024-GPU job failures, the MTBF of a single node should be 42.1 days. So they just confirmed that each GPU node is a single point of failure. This should not be surprising.
Quote
The worst-case version of this is a crash loop, where a single job is configured to requeue on failures (e.g., by using exception handling in the submission script). In the period we observe, we see a 1024 GPU job NODE_FAIL and subsequently requeue 35 times, causing a total of 548 preemptions (over 7k GPUs).
This is a bad interaction between policy and infrastructure.
Quote
While optimizing large jobs is clearly important, 16% of the total lost goodput resulting from hardware failures is due to second-order preemptions, which come from jobs of much smaller sizes. These results indicate that the cluster as a whole is impacted beyond the failures themselves.
To restate, a significant amount of cluster utilization loss is due to their preemption policy. This is not surprising; everyone who’s had to schedule hugely variable job sizes has encountered this in the form of backfill bubbles or node draining bubbles.
Quote
u0≈5−20minssubscript𝑢0520minsu_{0}\approx 5-20~{}\mathrm{mins}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 5 - 20 roman_mins
Restart time is 5-20 minutes after a failure.
Quote
we find RSC-1 GPUs are swapped at ∼similar-to\sim∼ 3 times the rate compared to RSC-2; both the GPU swap rate and failure rate differences may be due to differing workloads that tax GPUs on RSC-1 more heavily.
This is a bad explanation to an interesting observation - their larger cluster is significantly less reliable on a per-node basis than their smaller one. Was the larger cluster in service for longer than the smaller one? That is, are they observing higher dropout from earlier-generation GPUs?
Quote
Moving to a 5 minute checkpoint interval would increase expected ETTR to 0.93, illustrating the value of frequent checkpointing to insulate against interruptions (assuming checkpoint writes are non-blocking)
This statement has no meaning. If checkpointing was non-blocking, why not just checkpoint continuously and get 100% ETTR? I can appreciate that reducing checkpoint interval improves forward progress/ETTR, but assuming non-blocking checkpoints is akin to assuming a spherical cow here.
Quote
2048-4096 GPU job runs on RSC-1 show an average ETTR of over 0.9 at a one-hour assumed checkpoint interval
That’s a good milestone, but the previous paragraph suggests it is just a function of scale. For larger training jobs, this setup of hourly checkpointing would not work, and this should not be a surprise.
Quote
To reach ETTR of 0.9 for a 100,000 GPU training runs on a hypothetical cluster with an RSC-2-like failure rate, checkpointing intervals and restart overhead need to be ∼similar-to\sim∼2 minutes.
You don’t need such a heavily formalized model to make these predictions, because the data presented here (and reality) is that reliability is well-approximated as a system of independent nodes connected in series.
Quote
Among tens of detection signals available on each node, the following ones correlate with lemon nodes the most: excl_jobid_count:Number of distinct jobs that excluded a node. xid_cnt:Number of unique XID errors a node experienced. tickets:Count of repair tickets created for a node. out_count:Number of times node was taken out of availability from the scheduler. multi_node_node_fails:Number of multi-node job failures caused by a node. single_node_node_fails:Number of single-node job failures caused by a node. single_node_node_failure_rate:Rate of single-node job failures on a node.
It sounds like they did the same thing as I did when using Darshan logs en masse to correlate job slowness with specific Lustre OSTs.1
Quote
Our lemon node detection mechanism led to 10% reduction in large job failures (512+ GPUs), from 14% to 4%.
Observation 11: Historic data is necessary to find defective nodes. Implementing lemon node detection can improve large job completion rate by over 30%.
The general principle is good - find nodes that keep showing up in jobs that fail. However they are cherry picking the definition of “large job” here, and I don’t see how they get a 30% improvement in job completion rate from a 10% reduction in large job failures. This feels like the authors are playing games with statistics to show impact rather than objectively measuring improvement in a way that reflects overall positive outcomes of the cluster. As such, it’s hard to contextualize the impact of this lemon node detection.
However, the qualitative statement that finding lemon nodes is good is undeniable.
Quote
The network must remove and route around failures. Without resilience mechanisms in place, over 50% of bandwidth may be lost.
This is why everyone uses adaptive routing, and there is no reason these days not to use it. I guess this statement is meaningful if your goal is to push for using a fabric that supports fine-grained adaptive routing (i.e., not standard RoCE).
Quote
We therefore envision future infrastructure systems that attempt to make unreliability less noticeable rather than attempting to remove it altogether.
This is a truism. Nobody would disagree. Nobody is trying to make unreliability go away, nor has anyone ever tried to do this since the early days of distributed computing.
Quote
We can improve the success rate of training runs by retroactively identifying the root cause of a NCCL timeout, by comparing logged data across different ranks participating in the collective.
Isn’t this what PyTorch flight recorder already does?