MTBF, FIT, and AFR

MTBF, FIT, and AFR are closely related concepts that describe component reliability. In brief,

$MTBF \propto \frac{1}{FIT} \propto \frac{1}{AFR}$

MTBF

Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years) and is calculated simply as:

$MTBF = \frac{time spent observing}{# observed failures}$

If you had to send five nodes back for replacement in the past six months, your MTBF is

$MTBF = \frac{6 months}{5 failures} = 1.2 months$

If you know the survival function of a component $R (t)$ , you can also express MTBF in terms of that:

$MTBF = \int_{0}^{\infty} R (t) d t$

FIT

FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:

$FIT rate = \frac{1 0 ^{9} h o u rs}{MTBF _{hours}}$

It is a unitless quantity because it represents a number of failures.

Hardware vendors often express component reliability in terms of their FIT rate.

Failure rate ( $λ$ )

Failure rate is the frequency with which a component fails and is measured in units of failures per unit time. If the unit time is $1 0^{9}$ hours, it is the same as the FIT rate.

As with FIT rate, it is inversely related to MTBF:

$λ = \frac{1}{MTBF}$

Survival function

Survival function (or reliability function) describes the probability that a component will survive for at least a certain amount of time. It can be inferred if you know the failure rate $λ$ from above.

Survival function $R (t) = e^{- λ t}$

Recall that the failure rate $λ$ is inversely related to MTBF, you then get:

$R (t) = e^{- \frac{t}{MTBF}}$

Or maybe more meaningfully, the probability that a component will fail within a certain amount of time:

$F (t) = 1 - e^{- \frac{t}{MTBF}}$

AFR

AFR (annualized failure rate) seems to have two definitions:

The intuitive one, which is the failure rate $λ$ normalized to a year
The Wikipedia definition which is the probability that a component will fail in a year

The big difference is that #1 can be above 100% (more than one component fails per year, or a component fails multiple times per year), but #2 approaches 100% asymptotically.

Intuitive (failures per year)

The intuitive AFR, or the frequency of component failure per year, is easy to define assuming a year is 8,766 hours:

$AFR = \frac{8766 hours}{MTBF _{hours}}$

It is unitless because it is a percentage.

Wikipedia (probability of failure)

The Wikipedia definition is really just the survival function from above. Recall:

$F (t) = 1 - e^{- \frac{t}{MTBF}}$

If you use $t = 8766$ hours per year, you get

$AFR = 1 - e^{- \frac{8766 hours}{MTBF _{hours}}}$

Further confusion

It doesn’t help that $\frac{1}{x}$ is often used as an approximation of $1 - e^{- \frac{1}{x}}$ . This results in some sources claiming that the intuitive definition is just an approximation of the other definition, but this is not true.

Predicting MTBF, FIT, and AFR

A nice property of AFR and FIT is that, for a system of components connected in series, you can add up AFR/FIT rates for all components to get a AFR/FIT for the total system.

Example: Calculating node reliability

Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):

Component	Component FIT ( $C_{i}$ )	Qty per node ( $N_{i}$ )	Total FIT
CPU + DRAM	1,000	2	2,000
GPU + HBM	1,500	8	12,000
NIC	300	8	2,400
Transceiver	100	8	800
BMC	500	1	500
SSD	1,000	8	8,000
SSD backplane	200	1	200
Power supply	2,000	8	16,000

The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:

$Node FIT = \sum_{i}^{components} N_{i} C_{i} = (1000 \cdot 2) + (1500 \cdot 8) + ... = 41, 900$

This is valid because every component is connected in series; the failure of one component causes the whole node to fail.

Warning

The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.

In the above example, the node FIT is 41,900. You can then calculate:

$AFR = \frac{FIT \times 8766}{1 0 ^{9}} = 36.7%$

and

$MTBF = \frac{1}{AFR} = \frac{1 year}{0.367} = 2.72 years = 23, 800 hours$

Example: Calculating cluster MTBF

Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:

$Cluster AFR = # nodes \times Node FIT = 1024 \times 41, 900 = 42, 905, 600$

This means

$AFR = \frac{FIT \times 8766}{1 0 ^{9}} = 37, 611%$

Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:

$MTBF = \frac{1}{AFR} = \frac{1 year}{376.11} = 0.002659 years = 23.3 hours$

Relationship to JMTTI

Assuming an MPI job runs across all these nodes ( $∴$ one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.

In practice

There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

BXI

Meta Llama-3.1

checkpointing

Storage for LLM training

Availability

MTBF, FIT, and AFR

MTBF

FIT

Failure rate ( $λ$ )

Survival function

AFR

Intuitive (failures per year)

Wikipedia (probability of failure)

Further confusion

Predicting MTBF, FIT, and AFR

Example: Calculating node reliability

Example: Calculating cluster MTBF

Relationship to JMTTI

In practice

Graph View

Backlinks

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

BXI

Meta Llama-3.1

checkpointing

Storage for LLM training

Availability

MTBF, FIT, and AFR

MTBF

FIT

Failure rate (λ)

Survival function

AFR

Intuitive (failures per year)

Wikipedia (probability of failure)

Further confusion

Predicting MTBF, FIT, and AFR

Example: Calculating node reliability

Example: Calculating cluster MTBF

Relationship to JMTTI

In practice

Graph View

Backlinks

Failure rate ( $λ$ )