MTBF, FIT, and failure rate () are closely related concepts that describe component reliability. In brief,

Specifically,

Mean Time Between Failure (MTBF) refers to the mean time between failures of a single component or a system of components. Pedantically, a failure is something which requires replacement (like a GPU burnt up). Its units are time (hours, days, months, years).

FIT (failures in time) rate is a closely related concept and is the inverse of the MTBF per billion hours in service:

It is a unitless quantity because it represents a number of failures, and hardware vendors often express component reliability in terms of their FIT rate.

Failure rate () is the frequency with which a component fails. It is measured in units of failures per unit time; failures per hour, failures per month, or failures per year are the most common. If the unit time is hours, failure rate is the same as the FIT rate.

As with FIT rate, it is inversely related to MTBF:

Calculating reliability

Modeling the reliability of systems requires an understanding of the statistics governing failure. For example, there are a couple of ways to calculate MTBF depending on what type of data you have.

MTBF with a constant failure rate

A measure of MTBF can be calculated simply as:

If you had to send five nodes back for replacement in the past six months, your MTBF is

This simplification assumes that failures occur at a constant rate; every 1.2 months, a component will fail. That’s not really how failures occur, though.

Exponential distribution

Things don’t typically fail predictably, so this MTBF of 1.2 months is just that—an average. In reality, things are more likely to fail before they reach MTBF than they are to survive beyond their MTBF, because a component can survive infinitely long but it cannot survive fewer than zero minutes.

The time between failures for a component often follows an exponential distribution which models Poisson processes. Anything that randomly fails with no dependence on its previous failure history or other components follows a Poisson process, and its time-to-failure can be drawn from an exponential distribution. Hard drive failures are a perfect example: when a hard drive fails, that failure is usually independent of any other hard drives in the data center. And when that hard drive is replaced, the time-to-failure for the replacement is independent of however long it took for the previous drive to fail.

To generate a time-to-failure for a Poisson process, you’d do:

where

  • is the failure rate (e.g., failures per year)
  • is a uniformly random number between 0 and 1

Survival function and AFR

The survival function (or reliability function) describes the probability that a component will survive for at least a certain amount of time. It can be inferred if you know the failure rate from above and assume a Poisson process:

Recall that the failure rate is inversely related to MTBF, you then get:

Or maybe more meaningfully, the probability that a component will have failed by a time :

You can use this to also represent the fraction of components that will fail within a time window. For example, let’s assume a hard drive has an MTBF of 2 million hours and calculate this probability of failure in a year:

This is called the annualized failure rate (AFR) and represents the probability that a component fails within one year. It is a unitless value.

MTBF of a Poisson process

From the time to next failure, you can calculate the mean time between failures. We start with the survival function of a component , which is the probability that a component survives beyond time . Assuming a Poisson process like above, this is:

The mean time between failures is the expected time until failure happens, which is:

When we use the exponential distribution , MTBF then becomes:

FIT and failure rate of a Poisson process

Following our hard drive example (MTBF = 2 million hours), we can calculate the failure rate:

To convert to failures per year, just multiply by 8766 hours per year:

Or to convert this to failures per billion hours (the FIT rate):

Reliability of many components

Getting the failure rate of a single hard drive is good, but in HPC, we deploy hard drives in tens or hundreds or thousands. Let’s talk about how to scale up these reliability calculations.

Failure rate of many components

Knowing a single hard drive will experience 0.004383 failures/year, we can calculate how many failures a JBOD with 106 drives (like those used in Frontier’s file system)—this is the failure rate.

That is, a single 4U106 JBOD will experience 0.4546 drive failures per year. You can invert this fraction () to get 2.2 years per drive failure—the MTBF!

You can scale this up and up as well. Knowing Frontier has 47,700 hard drives,

Probability of failure and AFR

Remember that the annualized failure rate (AFR) represents the probability that a single hard drive will experience a failure in a year:

When we look at a collection of hard drives, the MTBF starts to represent the mean time between failures of any components. If there are two drives, the mean time between failure is half of what it was for one drive. We can then scale up the probability that Frontier’s file system will experience a failure in one year:

# driveswhat it representsMTBFAFR
1a hard drive2,000,000 hours0.437%
106a JBOD18,868 hours37.0%
47,700the whole file system41.9 hours

where

  • is the number of components (1 drive, 106 drives, or 47,700 drives).
  • is the duration in which you want to know the likelihood of one failure. It’s 1 year or 8766 hours here.
  • is the the mean time between failure.

Reliability of a system

A nice property of and FIT is that, for a system of components connected in series, you can add up failure rates and FIT rates for all components to get a rate for the total system.

Example: Calculating node reliability

Let’s say you have a node that is comprised of the following parts (FIT values made up by ChatGPT):

ComponentComponent FIT ()Qty per node ()Total FIT
CPU + DRAM1,00022,000
GPU + HBM1,500812,000
NIC30082,400
Transceiver1008800
BMC5001500
SSD1,00088,000
SSD backplane2001200
Power supply2,000816,000

The FIT for the whole node is the sum of all “Total FIT” values which is just the sum of FIT rates for every component:

This is valid because every component is connected in series; the failure of one component causes the whole node to fail.

Warning

The above is not true if components are redundant; for example, the above assumes all eight power supplies are active/active with no redundancy. This is never true in practice; you would calculate an aggregate FIT for 6+2 power supplies and use that above.

In the above example, the node FIT is 41,900. You can then calculate:

and

Example: Calculating cluster MTBF

Let’s say you have a cluster of 1,024 of these nodes. Calculating the FIT rate of the whole cluster is as simple as connecting all nodes in series:

This means

Or conversely, the cluster will fail 376 times every year on average. This is better expressed as MTBF:

MTBF vs. JMTTI

Assuming an MPI job runs across all these nodes ( one node failing = whole job failing), you can claim that the JMTTI would be 23.3 hours as well. However, there are more things that can interrupt a job than component failures—link flaps, kernel panics, cosmic rays, and the like. This will almost always make JMTTI lower than MTBF.

In practice

There’s a table in In practice that contains a few MTBF measurements from large-scale supercomputers.

Non-constant failure rates

When the failure rate is not constant, it becomes the hazard function . This can capture the effects of the bathtub curb, where components initially fail a lot immediately after they’re manufactured, then level out on failures, then shoot back up as they approach their end of life.