MTBF, FIT, and AFR are closely related concepts that describe component reliability. In brief,
MTBF
Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years) and is calculated simply as:
If you had to send five nodes back for replacement in the past six months, your MTBF is
If you know the survival function of a component , you can also express MTBF in terms of that:
FIT
FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:
It is a unitless quantity because it represents a number of failures.
Hardware vendors often express component reliability in terms of their FIT rate.
Failure rate ()
Failure rate is the frequency with which a component fails and is measured in units of failures per unit time. If the unit time is hours, it is the same as the FIT rate.
As with FIT rate, it is inversely related to MTBF:
Survival function
Survival function (or reliability function) describes the probability that a component will survive for at least a certain amount of time. It can be inferred if you know the failure rate from above.
Survival function
Recall that the failure rate is inversely related to MTBF, you then get:
Or maybe more meaningfully, the probability that a component will fail within a certain amount of time:
AFR
AFR (annualized failure rate) seems to have two definitions:
- The intuitive one, which is the failure rate normalized to a year
- The Wikipedia definition which is the probability that a component will fail in a year
The big difference is that #1 can be above 100% (more than one component fails per year, or a component fails multiple times per year), but #2 approaches 100% asymptotically.
Intuitive (failures per year)
The intuitive AFR, or the frequency of component failure per year, is easy to define assuming a year is 8,766 hours:
It is unitless because it is a percentage.
Wikipedia (probability of failure)
The Wikipedia definition is really just the survival function from above. Recall:
If you use hours per year, you get
Further confusion
It doesn’t help that is often used as an approximation of . This results in some sources claiming that the intuitive definition is just an approximation of the other definition, but this is not true.
Predicting MTBF, FIT, and AFR
A nice property of MTBF and FIT is that, for a system of components connected in series, you can add up FIT/MTBF rates for all components to get a FIT/MTBF for the total system.
Example: Calculating node reliability
Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):
Component | Component FIT () | Qty per node () | Total FIT |
---|---|---|---|
CPU + DRAM | 1,000 | 2 | 2,000 |
GPU + HBM | 1,500 | 8 | 12,000 |
NIC | 300 | 8 | 2,400 |
Transceiver | 100 | 8 | 800 |
BMC | 500 | 1 | 500 |
SSD | 1,000 | 8 | 8,000 |
SSD backplane | 200 | 1 | 200 |
Power supply | 2,000 | 8 | 16,000 |
The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:
This is valid because every component is connected in series; the failure of one component causes the whole node to fail.
Warning
The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.
In the above example, the node FIT is 41,900. You can then calculate:
and
Example: Calculating cluster MTBF
Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:
This means
Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:
Relationship to JMTTI
Assuming an MPI job runs across all these nodes ( one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.
In practice
There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.