What is a Failure?
Almost every maintenance organization sets some sort of failure elimination goal. Problem is, this goal is often set without fully understanding what a failure is. In some organizations equipment isn’t considered failed unless it is totally inoperative. In others equipment is considered failed if there is some partial loss of function such as reduced production rate, or off-quality production outside their normal targets. There are constant arguments about whether there was ever really a failure. Eliminating failures requires a slightly different outlook on what constitutes a failure.
Let’s begin by taking a look at the definition generated by Nowlan and Heap in their seminal work on Reliability-Centered Maintenance (Nowlan & Heap).
“…Without a precise definition of what condition represents a failure, there is no way to assess its consequences or to define the physical evidence for which to inspect. The term failure must, in fact, be given a far more explicit definition than “an inability to function” in order to clarify the basis of Reliablity-Centered Maintenance.”
“…A failure is an unsatisfactory condition. In other words, a failure is an identifiable deviation from the original condition which is unsatisfactory to a particular user.”
They further define two types of failures.
“A functional failure is the inability of an item (or the equipment containing it) to meet a specified performance standard and is usually identified by an operator”.
“A potential failure is an identifiable physical condition which indicates a functional failure is imminent and is usually identified by a Maintenance Technician using predictive or quantitative preventive maintenance”
Predictive or Condition Based Maintenance is based on the concept that there is sufficient time between when the potential failure is detected and the functional failure occurs for the organization to react and prevent the functional failure. This interval is known as the p-f interval.
These definitions mean that it is up to individual organizations to decide what constitutes an unacceptable condition. This decision significantly impacts whether or not an organization will actually be able to eliminate all functional failures except for those they have decided to accept by making a run-to-failure, or no scheduled maintenance decision.
Age and reliability studies conducted on aircraft components over a period of years revealed the six basic age-reliability relationships shown in figure 1-1. The vertical axis of these curves represents the conditional probability of failure, and the horizontal axis represents time in service after installation or overhaul.
Figure 1-1: Aircraft Component Failure Patterns – Nowlan & Heape, John Moubray
What is particularly striking about these curves is the very low percentage of items that display a distinct wear-out region, the large number of items that display a random failure region, and the extremely high percentage of items that display in infant failure region. Only patterns A and B which represent only six percent of the items studied display the wear-out region denoted by a rapidly increasing conditional probability of failure at the right hand end of the curve. Ninety-Five percent of the items studied had a least some region of random failures denoted by a flat region in the curve. Pattern C was the only curve that did not have some region of random failure. This means that 95% of the equipment in the study may benefit from some form of condition monitoring and that only 6% may benefit from time based replacement or overhaul.
It is important to recognize the significance of Pattern F, or Infant Mortality. Sixty-Eight percent of the items studied had a high conditional probability of failure immediately after installation and commissioning. The majority of item failures were being induced by activities directly related to time based replacements and overhauls. The overall maintenance strategy present at the time was extremely faulty, and was not achieving the desired goals of restoring, protecting, and preserving the function of the equipment in the safest most economical manner.
Causes of Failures
All equipment failures are governed by the simple laws of physics present in everyday life. Friction, erosion, corrosion, stress, and impact are the physical basis for most failures. It is the interaction of humans with the equipment that determines whether these causes occur normally or abnormally.
Figure 1-2: Equipment Item Life Cycle
As we can see in Figure 1-2 human interaction with the equipment occurs at every phase of an items life. Substandard performance and errors at any phase will result in decreased reliability, and the result will be lower profits, more environmental incidents, and more safety incidents. In general PM activities are designed to either prevent the physical sources of failure from occurring, or removing the item before degradation caused by those forces results in loss of equipment function. As we saw from the six failure shapes, there is a very small percentage of equipment that will benefit from time based replacement or overhaul. .A preventive replacement/overhaul (PM) strategy is dependent on knowing which equipment has the wear-out pattern, and what the best time is to perform the PM.
A failure elimination strategy is driven by finding those actions that create random failures, infant failures, and early wear-out failures and eliminating them. The Failure Reporting, Analysis, and Corrective Action System (FRACAS) is designed to help the organization detect common failure modes, determine the causes of the failure modes, and eliminate them. Table 1-2 shows that stopping at the physical root of a failure will probably not eliminate future failures of the same type. The RCA absolutely has to address the human side of the failure equation.
Hidden failures are functional failures that share two very important characteristics. Firstly, they can’t be seen by the operators during normal operation of the system. Secondly, they are usually in items that protect people from severe injury or death, or protect equipment from severe damage. The combination of those two characteristics means that they must have some sort of hidden failure finding task assigned to them as part of any maintenance strategy. It is important to remember that the longer the hidden loss of function is present the higher the risk of a catastrophic consequence.
Threads between Common Failures
The failures we see in an organization are either crisis failures of chronic failures. Table 1-2 delineates the characteristics of the two types. The primary thing to remember is that solving chronic failures actually changes the systems overall output. The goal with FRACAS is to be able to recognize the chronic failures, determine and eliminate the cause, and spread that solution across the organization either nationally or internationally.
In most instances crisis failures will be analyzed, and the root cause will be eliminated. With a good FRACAS we will be able to see the commonality of failure modes that create chronic failures. We will be able to use the data to determine which of the failure patterns the failure modes fit, and take appropriate action to eliminate them. The beauty of a well defined corporate level FRACAS is that failures in every facility can be tracked, and failure modes that are common across the entire corporation can be addressed. That translates into substantial reductions in overall cost. You may find that a certain brand of pump suffers early seal failures in every facility that uses them regardless of the quality of local repairs. Or, you may find that facilities in a certain region are experiencing fewer failures, and a different failure pattern for the same failure mode. In any case, the failure patterns will be easier to determine because the significant failure modes have been recorded and analyzed using the proper tools.
In this chapter we have looked at a working definition of failure that allows organizations to decide for themselves what a failure is by defining which events are unsatisfactory. We saw how the definitions of potential and functional failure are related to condition monitoring. We looked at the six common failure patterns and saw from them that a large number of failures could benefit from condition monitoring, and that a large number of defects are actually induced in equipment by the actions of maintaining and commissioning the equipment. We also looked at how failures can be caused at every point in the life cycle of equipment starting in the design phase. The concept of crisis and chronic failures was introduced, and we looked at how a FRACAS can be used to ferret out chronic failures to help improve overall business performance.
Later we will talk about eliminating failures. We will look at the I-P-F curve and see how it relates to eliminating versus managing failures. Lastly, we will look at how the U.S. Navy used the concept of procedures based maintenance to significantly reduce infant failures.
1-1: reliability-centered maintenance, F. Stanley Nowlan and Howard F. Heap, DOD Report Number A066-579, December 29, 1978
1-2: Failure Analysis/Problem Solving Methods Student Manual, Reliability Center, Inc.