MTBF, FIT, and AFR are closely related concepts that describe component reliability. In brief,

MTBF

Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years) and is calculated simply as:

If you had to send five nodes back for replacement in the past six months, your MTBF is

If you know the survival function of a component , you can also express MTBF in terms of that:

FIT

FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:

It is a unitless quantity because it represents a number of failures.

Hardware vendors often express component reliability in terms of their FIT rate.

Failure rate ()

Failure rate is the frequency with which a component fails and is measured in units of failures per unit time. If the unit time is hours, it is the same as the FIT rate.

As with FIT rate, it is inversely related to MTBF:

Survival function

Survival function (or reliability function) describes the probability that a component will survive for at least a certain amount of time. It can be inferred if you know the failure rate from above.

Survival function

Recall that the failure rate is inversely related to MTBF, you then get:

Or maybe more meaningfully, the probability that a component will fail within a certain amount of time:

AFR

AFR (annualized failure rate) seems to have two definitions:

  1. The intuitive one, which is the failure rate normalized to a year
  2. The Wikipedia definition which is the probability that a component will fail in a year

The big difference is that #1 can be above 100% (more than one component fails per year, or a component fails multiple times per year), but #2 approaches 100% asymptotically.

Intuitive (failures per year)

The intuitive AFR, or the frequency of component failure per year, is easy to define assuming a year is 8,766 hours:

It is unitless because it is a percentage.

Wikipedia (probability of failure)

The Wikipedia definition is really just the survival function from above. Recall:

If you use hours per year, you get

Further confusion

It doesn’t help that is often used as an approximation of . This results in some sources claiming that the intuitive definition is just an approximation of the other definition, but this is not true.

Predicting MTBF, FIT, and AFR

A nice property of AFR and FIT is that, for a system of components connected in series, you can add up AFR/FIT rates for all components to get a AFR/FIT for the total system.

Example: Calculating node reliability

Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):

ComponentComponent FIT ()Qty per node ()Total FIT
CPU + DRAM1,00022,000
GPU + HBM1,500812,000
NIC30082,400
Transceiver1008800
BMC5001500
SSD1,00088,000
SSD backplane2001200
Power supply2,000816,000

The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:

This is valid because every component is connected in series; the failure of one component causes the whole node to fail.

Warning

The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.

In the above example, the node FIT is 41,900. You can then calculate:

and

Example: Calculating cluster MTBF

Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:

This means

Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:

Relationship to JMTTI

Assuming an MPI job runs across all these nodes ( one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.

In practice

There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.