MTBF, FIT, and AFR

MTBF, FIT, and failure rate ( $λ$ ) are closely related concepts that describe component reliability. In brief,

$MTBF \propto \frac{1}{FIT} \propto \frac{1}{λ}$

Specifically,

Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years).

FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:

$FIT rate = \frac{1 0 ^{9} h o u rs}{MTBF _{hours}}$

It is a unitless quantity because it represents a number of failures, and hardware vendors often express component reliability in terms of their FIT rate.

Failure rate ( $λ$ ) is the frequency with which a component fails. It is measured in units of failures per unit time; failures per hour, failures per month, or failures per year are the most common. If the unit time is $1 0^{9}$ hours, failure rate is the same as the FIT rate.

As with FIT rate, it is inversely related to MTBF:

$λ = \frac{1}{MTBF}$

Calculating reliability

Modeling the reliability of systems requires an understanding of the statistics governing failure. For example, there are a couple of ways to calculate MTBF depending on what type of data you have.

MTBF with a constant failure rate

A measure of MTBF can be calculated simply as:

$MTBF = \frac{time spent observing}{# observed failures}$

If you had to send five nodes back for replacement in the past six months, your MTBF is

$MTBF = \frac{6 months}{5 failures} = 1.2 months$

This simplification assumes that failures occur at a constant rate; every 1.2 months, a component will fail. That’s not really how failures occur, though.

Exponential distribution

Things don’t typically fail predictably, so this MTBF of 1.2 months is just that—an average. In reality, things are more likely to fail before they reach MTBF than they are to survive beyond their MTBF, because a component can survive infinitely long but it cannot survive fewer than zero minutes.

The time between failures for a component often follows an exponential distribution which models Poisson processes. Anything that randomly fails with no dependence on its previous failure history or other components follows a Poisson process, and its time-to-failure can be drawn from an exponential distribution. Hard drive failures are a perfect example: when a hard drive fails, that failure is usually independent of any other hard drives in the data center. And when that hard drive is replaced, the time-to-failure for the replacement is independent of however long it took for the previous drive to fail.

To generate a time-to-failure $TBF$ for a Poisson process, you’d do:

$TBF = - \frac{1}{λ} ln U$

where

$λ$ is the failure rate (e.g., failures per year)
$U$ is a uniformly random number between 0 and 1

Survival function and AFR

The survival function (or reliability function) describes the probability that a component will survive for at least a certain amount of time. It can be inferred if you know the failure rate $λ$ from above and assume a Poisson process:

$R (t) = e^{- λ t}$

Recall that the failure rate $λ$ is inversely related to MTBF, you then get:

$R (t) = e^{- \frac{t}{MTBF}}$

Or maybe more meaningfully, the probability that a component will have failed by a time $t$ :

$F (t) = 1 - e^{- \frac{t}{MTBF}}$

You can use this to also represent the fraction of components that will fail within a time window. For example, let’s assume a hard drive has an MTBF of 2 million hours and calculate this probability of failure in a year:

$F (1 year) = 1 - e^{- \frac{1 year = 8766 hours}{2 , 000 , 000 hours}} = 0.0043734 = 0.43734%$

This $F (1 year)$ is called the annualized failure rate (AFR) and represents the probability that a component fails within one year. It is a unitless value.

MTBF of a Poisson process

From the time to next failure, you can calculate the mean time between failures. We start with the survival function of a component $R (t)$ , which is the probability that a component survives beyond time $t$ . Assuming a Poisson process like above, this is:

$R (t) = e^{- λ t}$

The mean time between failures is the expected time until failure happens, which is:

$MTBF = \int_{0}^{\infty} R (t) d t$

When we use the exponential distribution $R (t) = e^{- \frac{t}{MTBF}}$ , MTBF then becomes:

$MTBF = \frac{1}{λ}$

FIT and failure rate of a Poisson process

Following our hard drive example (MTBF = 2 million hours), we can calculate the failure rate:

$λ = \frac{1}{MTBF} = \frac{1}{2 , 000 , 000 hours} = 5 \times 1 0^{- 7} \frac{failures}{hour}$

To convert to failures per year, just multiply by 8766 hours per year:

$λ = 5 \times 1 0^{- 7} \frac{failures}{hour} \times \frac{8766 hours}{1 year} = 0.004383 \frac{failures}{year}$

Or to convert this to failures per billion hours (the FIT rate):

$λ = 5 \times 1 0^{- 7} \frac{failures}{hour} \times 1 0^{9} hours = 500 FIT$

Reliability of many components

Getting the failure rate of a single hard drive is good, but in HPC, we deploy hard drives in tens or hundreds or thousands. Let’s talk about how to scale up these reliability calculations.

Failure rate of many components

Knowing a single hard drive will experience 0.004383 failures/year, we can calculate how many failures a JBOD with 106 drives (like those used in Frontier’s file system)—this is the failure rate.

$λ = 0.004383 \frac{failures}{year} per drive \times 106 drives = 0.4546 \frac{failures}{year}$

That is, a single 4U106 JBOD will experience 0.4546 drive failures per year. You can invert this fraction ( $\frac{1 year}{0.4546 failures}$ ) to get 2.2 years per drive failure—the MTBF!

You can scale this up and up as well. Knowing Frontier has 47,700 hard drives,

$λ = 0.004383 \frac{failures}{year} per drive \times 47700 drives = 209.1 \frac{failures}{year}$

Probability of failure and AFR

Remember that the annualized failure rate (AFR) represents the probability that a single hard drive will experience a failure in a year:

$AFR = 1 - e^{- \frac{duration}{MTBF}} = 1 - e^{- \frac{8766 hours}{2 , 000 , 000 hours}}$

When we look at a collection of hard drives, the MTBF starts to represent the mean time between failures of any components. If there are two drives, the mean time between failure is half of what it was for one drive. We can then scale up the probability that Frontier’s file system will experience a failure in one year:

# drives	what it represents	MTBF	AFR
1	a hard drive	2,000,000 hours	0.437%
106	a JBOD	18,868 hours	37.0%
47,700	the whole file system	41.9 hours	$\approx 1 - 1 0^{- 91}$

$A FR = 1 - e^{- n T / MTBF} = 1 - e^{- \frac{- 47700 drives \cdot 8766 hours}{2000000 hours}}$

where

$n$ is the number of components (1 drive, 106 drives, or 47,700 drives).
$T$ is the duration in which you want to know the likelihood of one failure. It’s 1 year or 8766 hours here.
$MTBF$ is the the mean time between failure.

Reliability of a system

A nice property of $λ$ and FIT is that, for a system of components connected in series, you can add up failure rates and FIT rates for all components to get a rate for the total system.

Example: Calculating node reliability

Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):

Component	Component FIT ( $C_{i}$ )	Qty per node ( $N_{i}$ )	Total FIT
CPU + DRAM	1,000	2	2,000
GPU + HBM	1,500	8	12,000
NIC	300	8	2,400
Transceiver	100	8	800
BMC	500	1	500
SSD	1,000	8	8,000
SSD backplane	200	1	200
Power supply	2,000	8	16,000

The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:

$Node FIT = \sum_{i}^{components} N_{i} C_{i} = (1000 \cdot 2) + (1500 \cdot 8) + ... = 41, 900$

This is valid because every component is connected in series; the failure of one component causes the whole node to fail.

Warning

The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.

In the above example, the node FIT is 41,900. You can then calculate:

$λ = \frac{41900 failures \times 8766 hours}{1 0 ^{9} hours} = 0.367 \frac{failures}{year}$

and

$MTBF = \frac{1}{λ} = \frac{1 year}{0.367 failures} = 2.72 years = 23, 800 hours$

Example: Calculating cluster MTBF

Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:

$Cluster AFR = # nodes \times Node FIT = 1024 \times 41, 900 = 42, 905, 600$

This means

$λ = \frac{FIT \times 8766}{1 0 ^{9}} = 37, 611%$

Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:

$MTBF = \frac{1}{AFR} = \frac{1 year}{376.11} = 0.002659 years = 23.3 hours$

MTBF vs. JMTTI

Assuming an MPI job runs across all these nodes ( $∴$ one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.

In practice

There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.

Non-constant failure rates

When the failure rate $λ$ is not constant, it becomes the hazard function $λ (t)$ . This can capture the effects of the bathtub curb, where components initially fail a lot immediately after they’re manufactured, then level out on failures, then shoot back up as they approach their end of life.

Glenn's Digital Garden

Explorer

MTBF, FIT, and AFR

Calculating reliability

MTBF with a constant failure rate

Exponential distribution

Survival function and AFR

MTBF of a Poisson process

FIT and failure rate of a Poisson process

Reliability of many components

Failure rate of many components

Probability of failure and AFR

Reliability of a system

Example: Calculating node reliability

Example: Calculating cluster MTBF

MTBF vs. JMTTI

In practice

Non-constant failure rates

Graph View

Table of Contents

Backlinks