MTBF, FIT, and AFR are closely related concepts that describe component reliability. In brief,

MTBF

Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years) and is calculated simply as:

If you had to send five nodes back for replacement in the past six months, your MTBF is

FIT

FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:

It is a unitless quantity because it represents a number of failures.

Hardware vendors often express component reliability in terms of their FIT rate.

AFR

AFR (annual failure rate) is the percent chance that a component will fail in a year. Assuming a year is 8,766 hours:

Like FIT, it is unitless since it really represents the number of components that will fail within a year.

Predicting MTBF, FIT, and AFR

A nice property of MTBF and FIT is that, for a system of components connected in series, you can add up FIT/MTBF rates for all components to get a FIT/MTBF for the total system.

Example: Calculating node reliability

Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):

ComponentComponent FIT ()Qty per node ()Total FIT
CPU + DRAM1,00022,000
GPU + HBM1,500812,000
NIC30082,400
Transceiver1008800
BMC5001500
SSD1,00088,000
SSD backplane2001200
Power supply2,000816,000

The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:

This is valid because every component is connected in series; the failure of one component causes the whole node to fail.

Warning

The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.

In the above example, the node FIT is 41,900. You can then calculate:

and

Example: Calculating cluster MTBF

Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:

This means

Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:

Relationship to JMTTI

Assuming an MPI job runs across all these nodes ( one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.

In practice

There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.