|
Fan Failures and Unavailability
|
Introduction
Some problems are associated with so many uncertainties;
Uncertainties in Thermal Design that it is doubtful
if a computed result is any better than a crude estimate. The issue for this
article is a typical example. Yet, even if fundamental uncertainties make a
calculation method highly approximate, it can reveal valuable overview
information.
Paradoxically, a good expert can be defined as someone who knows why he
does not know. No one can explain this better than Tony Kordyban. Read his
comments about
room temperature. The major uncertainty in this case is that
equipment rooms often house devices with a large spectrum of specifications.
A likewise fuzzy problem is how potential fan failures impact the
unavailability. A difference in this case, however, is that the impact of
each parameter can be described with reasonably simple correlations.
Putting them all together in a method can therefore, even if the result
is numerically corrupt, reflect general tendencies.
Figure 1
This equipment is fully functional even if 2 fans fail provided that the
ambient temperature is below 30 C. What is the unavailability?
The problem
Figure 1 shows a sub rack cooled by four fans. For a typical telecom
application the maximum room temperature would be specified to 50 C.
Suppose that the PCB temperature increases 10 C if a fan fails. In that
case the system would still function safely if the ambient temperature was
below 40 C. Room temperatures on that level are however not that common. A
system failure will therefore only occur if two unlikely events coincide, a
fan failure and a high room temperature. In that unfortunate event the down
time would depend on how fast the fan can be replaced.
Everybody wants to avoid down times and electrical engineers spend a lot of
time trying to do so. On the system level there is nevertheless a regrettable
tendency to take the easy way out and specify full functionality at maximum
room temperature even if one, or sometimes two, fans fail. This requirement
is pushed onto the thermal engineers who upsize the fans and get scolded by
their managers for increasing both cost and noise. There is definitely an
improvement potential here.
Figure 2
Life time and MTBF as defined by the bath tab curve.
Fan reliability
There are two aspects of
fan reliability. Life time and failure intensity, the
latter is often also represented by its inverted value MTBF, (medium time
between failures). Lifetime and MTBF are both measured in hours and can
therefore easily be confused. When defined as in the bath tab curve, figure 2,
the difference is obvious but things are not always that clear. There are
two distinct cases.
The first case is a straight forward interpretation of figure 2. It can be
applied when the life time for the fans is of the same order as the life time
for the equipment. In that case life time has no impact on failure intensity
until the fans begin to wear out and need to be replaced, (mostly because the
lubricant has evaporated). The failure intensity is caused by sporadic
collapses of weak components and a large number of other effects, including
insect attacks.
The second case is when the life time for the fans is much smaller than the
life time for the equipment and the strategy is to replace faulty fans
whenever they occur. The fans will after some time form a population with a
large spread in age. Sporadic failures will therefore not only appear as
component collapses but also as wear outs. That is, the bath tab curve for
the fan system has lost its tail.
The lifetime for high quality fans is at room temperature currently of the
order 10 - 20 years. The first case would therefore probably be the most
common. It should also be noted that the life time for a fan is defined as
the time at which 90% of a large population still are functional.
Figure 3
MTBF as function of temperature, realistic or pure guesswork?
The problem is to find realistic values for MTBF. Some manufacturers claim it
is enormous. Others indicate values of the order >35 years. There is also a
temperature impact. To base a curve like the one in figure 3 on a sequence of
experience values for various temperature levels is therefore next to
impossible, (do not confuse this curve with the temperature dependence of
the life time, for which the manufacturers now can provide decent data).
An additional complication is that eventual external speed controls also
contribute to the failure intensity. There is fortunately a physical principle
that describes how material degradation varies with temperature. It can be
formulated as an exponential function with two characteristic parameters, it
is the
Arrhenius function. This equation is by no way exact for all electronic
components but it is the best that can be done in this context. Given two
MTBF-temperature couples it is therefore possible to create a diagram of the
figure 3 type. It is of course apparent that predictions based on data of this
quality not can be anything else than approximations.
Figure 4
A temperature duration curve. It shows the relative time for which the
temperature is above a certain level.
Temperature duration
Another important parameter is the temperature duration, figure 4. The curve
shows the relative time for which the ambient temperature is above a certain
level. Diagrams of this type are only physically relevant for specific
locations. The ones found in specifications are worst case assumptions. In
addition they are often only specific for the extremes, (of the type <5%
run time in the temperature range 45 - 50 C). To create a time duration
diagram therefore often involves elements of guessing. The one shown in
figure 4 has some resemblance with actual data for non-air conditioned
premises but is completely off target for well controlled environments.
Figure 5
The fan speed is often temperature controlled.
Fan speed control
An additional complication is that the fan speed often is temperature
controlled, figure 4. There is no uncertainty in the control curve itself,
so this is the least uncertain factor in a fan unavailability calculation.
However, there is a problem with the
sensor location. If both the sensor and
the fans are placed in the outlet air and that air is stratified because of
a non uniform heat dissipation that jumps from one PCB to another, that
uncertainty enters into the fan failure prediction. A further problem is that
the failure intensity probably is speed dependent. The default value that is
used in the referenced calculator, 75% at half speed, is a pure guess.
It should also be noted that the fan speed has an impact on the exhaust air
temperature raise. The volumetric flow is approximately proportional to the
fan speed. A nominal air temperature raise of 10 C therefore changes to 20 C
when the fans are run at half speed. If the fans are placed in the outlet air,
this will naturally impact the failure intensity.
Discussion
The calculation procedure is quite simple. It is based on a step by step
integration of the temperature duration curve. The fan speed, the exhaust air
temperature, the MTBF and the fan temperature are determined for each step.
The result is the average fan temperature and the long term MTBF. Combined
with the probability of the room temperature to be above the safe function
threshold and the repair time, the result is the down time.
It could be of interest to look at some results of the included calculator.
The default values yields an unavailability of ~6 min/year. In view of all
the uncertainties, could it be 6 hours? Probably not, but it is up to anyone
to guess. If the repair time is decreased from 10 to 5 hours, the down time
changes proportionally. The impact of this parameter is often overlooked.
Actually, it is just as important as MTBF.
Another parameter of interest is the fan location. There are several aspects
of this subject but one argument against placing fans in the exhaust air is
that it increases the failure intensity. The calculator can be used to
estimate the order of that effect. Even if not numerically exact the relative
result should be fairly relevant. For the default values this effect is
predicted to a 40% increase.
It is also easy to simulate a system that is fully functional for worst case
conditions and 1 fan fail. Setting the temperature level for 1 fan fail to
50 C simulates that. The result is a factor 30'000 decrease of the
unavailability, truly dramatic. But is it worth the effort? Could it not be
sufficient to specify 48 C? Every degree that does not need to be cooled
is valuable and it is always the last ones that are most expensive.
In spite of all the uncertainties involved, these examples show that a
calculator of this type does have a value as a sensitivity analyser. Since
the parameters of the problem interact in a complex manner, the results
should also be better than pure intuitive conclusions. Another advantage is
that it exposes the complexity of the problem to those who might think
otherwise. The disadvantage, of course, is that the numerical result not
can be trusted, which could produce erroneous conclusions when used by
someone who does not understand the background.