MTBF, FIT, and AFR are closely related concepts that describe component reliability. In brief,


Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years) and is calculated simply as:

If you had to send five nodes back for replacement in the past six months, your MTBF is


FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:

It is a unitless quantity because it represents a number of failures.

Hardware vendors often express component reliability in terms of their FIT rate.


AFR (annual failure rate) is the percent chance that a component will fail in a year. Assuming a year is 8,766 hours:

Like FIT, it is unitless since it really represents the number of components that will fail within a year.

Predicting MTBF, FIT, and AFR

A nice property of MTBF and FIT is that, for a system of components connected in series, you can add up FIT/MTBF rates for all components to get a FIT/MTBF for the total system.

Example: Calculating node reliability

Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):

ComponentComponent FIT ()Qty per node ()Total FIT
CPU + DRAM1,00022,000
GPU + HBM1,500812,000
SSD backplane2001200
Power supply2,000816,000

The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:

This is valid because every component is connected in series; the failure of one component causes the whole node to fail.


The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.

In the above example, the node FIT is 41,900. You can then calculate:


Example: Calculating cluster MTBF

Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:

This means

Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:

Relationship to JMTTI

Assuming an MPI job runs across all these nodes ( one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.

In practice

There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.