Accelerated Reliability Solutions, LLC.

Why Electronics Failure Prediction Methodology does not work, but we still wish it did

Posted 12-6-2012

When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails. One need only think of the weather, in which case the prediction even for a few days ahead is impossible.” ― Albert Einstein


“Prediction is very difficult, especially about the future.” – Niels Bohr


We have always had a quest to reduce future uncertainties and know what is going to happen to us, how long we will live, and what may impact our lives. Horoscopes, Tarot Cards, tea leaves, and crystal balls have been used as specialized “tools” by fortune tellers to gaze into the future. The paradox of fortune telling is that by knowing the future, we can change it. The risk side of believing we know the future is also that if we incorrectly guess (assume) the causes of a future event, our prevention action may create additional costs or higher risk of an even worse event.

This is also true when making predictions of the future life of electronics.  Without clear traceability to actual physics of failure in electronics, assumptions about the causes of failures have added costs without benefits.  Still today the long established belief that electronics systems failures are mostly driven by thermal stress (missing predicate). This belief is in spite of the fact that there is no traceability to an intrinsic physical mechanism in components or systems driven by thermal stress in today’s electronics .


Many in management and marketing of electronics companies want to believe and wish reliability engineers could predict the life of electronics systems. By knowing the future failure rates, we could budget warranty costs and the correct number of spare parts and replacement units before the product is launched.


In 1995 my friend Professor Michael Pecht, founder and chairman of the University of Maryland's Center for Advance Life Cycle Engineering Consortium, wrote and published the article “Why Traditional Reliability Predictions don’t work – Is there an Alternative.”  In it he provides the history of one of the foundational documents of electronics reliability engineering, Military Handbook 217 (MIL HDBK 217) , and why it cannot predict electronic system failure rates.  It was removed as a military reference document in 1995, largely due to the work of Prof. Pecht.  It is amazing that MIL HDBK 217, removed  almost 17 years  is still being referenced and its progeny are still being  used for reliability predictions in many electronics companies today.  Needless to say electronics materials and manufacturing methods have changed tremendously in the last 17 years, but the continued belief that electronics systems reliability can be predicted has changed little in that time.


Electronics reliability cannot be predicted at a system level.  The vast majority of failures of electronics hardware are due to design margin errors, component misapplication, errors in manufacturing processes, and customer misuse or abuse.  It is very easy to confirm this is the case if you have access to the root causes of real field failures in real electronics products.


“All models are wrong, but some are useful” – George E. P. Box


Mathematical models to predict future events are, in many cases, valid and useful.  Computer models and measurement systems that are used in meteorology to forecast weather conditions are improving, yet the ability to predict the weather more few days has been elusive.  There would be huge benefits economically and in human lives if we could project longer than a few hours in advance when and where extreme weather events such as tornados or hurricanes will occur.  With more inputs of contributing atmospheric conditions and computer algorithms, weather forecasting is getting better.  Yet extreme weather prediction is limited to a few hours for tornados or a few days for hurricanes, before we know where they will hit.


Of course reliability prediction can be performed more accurately if we knew all of the many inherent potential failure mechanisms in an electronics system and the fatigue responses to the life cycle environmental profile (LCEP) stresses.  Even if we could know all the inherent failure mechanisms in components, we would also need to include some information the time distributions of manufacturing variations and excursions that would modify the strength or rate of degradation of those mechanisms during manufacturing.


In many mechanical and electromechanical systems we do have physical wear mechanisms that can be mathematically modeled and from those models we can mathematically project the “life” of the mechanism.  We know that in electric motors, wear of contact brushes, evaporation of lubricants, and wear of ball bearings eventually use up life, leading to failures due to wear out.  Mechanical switches and hinges have a limited fatigue life.  Through those models we can extend the life in mechanical systems by increasing the reservoir of material or reducing the driving stress conditions.  In electronics there are a few devices, such as batteries, that do have short wear out modes relative to technological obsolescence and modeling life is very useful and necessary.


It is much more difficult to determine the underlying life-limiting mechanisms of solid state electronics components such as IC’s in a complex system and much less in a PWB.  Not only the intrinsic physics causing component degradation and failure must be known, but also the PWB and solder fatigue mechanisms must be known for each package.  BGA solder joints and PTH (Plated through Hole) vias do not fatigue at the same rate under the same stress inputs.  Of course the stresses for all the mechanisms on the PWB and components can vary widely depending on the PWB locations.


LCEP for most electronics systems is a very rough guesstimate


Reliability prediction must also determine the Life Cycle Environmental Profile (LCEP) and also the LCEP distributions for the future field population.  We must know to some precision the actual LCEP stress distributions along with the inherent product “strength entitlement” distributions to know where the strength distribution overlaps the stress distribution resulting in product failures.  Please see my blog post “Reliability Paradigm Shift From Time to Stress Metrics” for more explanation of the Stress/Strength relationship in reliability.


So many electronics systems have a wide variety of LCEP’s with new applications of systems that result in new LCEP’s that were never considered.  Take an example of VGA projectors that we see in many conference and meeting rooms.  Some projectors are permanently mounted on the ceiling and many others are mobile.  The ceiling mounted units fatigue stress most likely comes from thermal cycling during power cycling, and the mobile units have that stress plus the shock and vibration from transporting.  The mobile units’ populations have a much wider distribution of LCEPs.  I doubt the manufacturers of these products know the distribution of the LCEP for these two distinct end use environments.  End users will expect the same reliability in both, regardless of the very different LCEP’s.  Of course some of the mobile units will break instantaneously from an accidental drop.  If and when it breaks from an accidental drop, will the user blame their own mishandling for the cause, or blame the manufacturer for making a “fragile” projector and never buying again from the same manufacturer?  Certainly we do not expect our cell phones to fail after a waist high drop, but again at what height of drop would we blame the failure being caused by us?


When it comes to electronics systems reliability modeling and prediction, we really cannot know all the mechanisms or the distributions of the LCEP. Even if all the degradation models were known and all the combinations of stress distributions and effects in the assembly were known, the challenge of reliability prediction is compounded by variations over time in manufacturing.


Focus on real weakness discovery – less on guessing a very uncertain future


We have even less time to model partial or whole systems and the resulting fatigue damage and degradation as the design and manufacturing cycle times for new electronics continue to decrease.  Even if we are able to model the degradation and fatigue damage of every potential failure mechanism in a PWB, the models must be based on the units from capable manufacturing, not variations, and we know there will be variations.  Additionally, modeling can only establish a failure rate based on inherent wear out mechanisms known LCEPs, even though there may be new applications and different future LCEP’s that were not known when the product was designed.


The best way to predict the future is to create it.” - Peter Drucker


Just as the prediction of our future, many would like to know what the future holds for the electronics we make and use.  Yet for complex electronic systems there has been no evidence that we can model and predict the future failure rates, regardless of the fact that many still want to believe it can be done and want it to be true.


Empirical stress limit discovery is a vastly more efficient tool for building a reliable electronic system.  Using stepped stress to limits methods (such as HALT) and focusing on discovery of potential weaknesses that could be a reliability risk (missing predicate). We can very quickly find the strength limits of complex electronic systems under stress conditions in order to establish a benchmark of strength based on current standard electronics technologies.  By knowing empirical stress limits, we can develop safe and efficient ongoing accelerated reliability testing to precipitate and detect manufacturing errors or excursions that result in latent defects.


Unfortunately, there is still the wish that accurate prediction is possible and many are still feeding the wish that reliability of electronics hardware can be predicted based on past invalid documents.  Without the ability to share real field reliability data that belief is likely to continue.

Why Success With HALT Begins Long Before Doing HALT

Posted November 23, 2012

By Kirk Gray

Accelerated Reliability Solutions, L.L.C.


HALT is a BIG change


Implementing a new reliability development paradigm in a company which is using traditional, standards-based testing can be a perilous journey.  It is especially true with introducing HALT (Highly Accelerated Life Test) in which strength against stress, and not quantifying electronics lifetimes is the new metric.  Because of this significant change in test orientation, a critical factor for success begins with educating the company’s top technical and financial stakeholders on why and how HALT is so effective for rapid reliability development.  Without the upper levels of management understanding in parallel the big picture of the HALT paradigm shift, the work of educating each skeptical key player in a serial fashion will cost much more time and puts success of HALT at significant risk.

To illustrate, let’s imagine that you are your electronics systems company’s reliability engineer or you have been involved in reliability qualification or validation testing of its products for several years.  You have experienced field failures that resulted from design margin issues that were overlooked during the development process, as well as some from mistakes in manufacturing.  Reliability development in your company consists of running tests that simulate the LCEP (Life Cycle Environmental Profile) estimates (guesstimates?), or design engineers apply limited stress to their predefined “that’s good enough” level or guesses on what may be the worst case stress environmental conditions for the product.


You have just learned the generic methodology and some of the benefits of HALT (Highly Accelerated Life Test) from reading a book on the subject or attending a class or webinar.  You now want to try HALT for your company’s new product, and find an outside test lab that has a HALT chamber so that you can go do a HALT.  It would seem to be the most straight forward path to getting started with HALT.  Or is it?


Teaching HALT in Series takes longer


Let’s consider the following scenario.

You find an outside test lab that can perform HALT a few miles away from your company.  You have five samples of the new product, support equipment to operate and monitor the UUTs (Units Under Test), and possibly a technician (or if lucky you have a design engineer for the product you are going to test) to go with you.  The environmental design specifications for the product are 0°C to 35°C.


The HALT lab helps you set up the first sample of the product and proceeds to find the lower temperature operational limit and the upper temperature operational limit.  Since the product in this case is a digital system, there are no thermal destruct levels found as the system will not operate above or below the temperature operational limits.  In the five samples used for HALT, you find upper temperature operational limits at 70, 72, 90, 117, and 110°C.  The lower temperature operational limits for the five samples are found to be -55, -45, -50, -58, and -47°C.


The final stress used in HALT is vibration and two of the samples fail when the vibration level reaches the maximum vibration level of the HALT chamber.  The failure mechanism on both is a broken lead of a capacitor mounted high off the PWB.  You and the design engineer repair the capacitor leg and glue it down to the PWB.  To verify the HALT improvement, you apply HALT vibration to the same maximum level and the glued capacitors do not fail.


Picture of PWB in HALT Chamber
Education accelerates adoption long before this HALT

Improving margins beyond spec – the bigger challenge

After you complete HALT at the outside lab and come back to your company, you wonder why there is approximately a 40°C difference in upper temperature  operational limits between the five samples.  You realize that wide variation in limits may be an indicator of some components’ inconsistent manufacturing processes, or significant sensitivity to inherent parametric variations of a component, which if it increases its variation could significantly impact field reliability.  You hope that you can have the manager of design engineering support an investigation into the cause of the wide upper thermal limit variations between the samples.  When you meet with him he tells you that his department is very busy with the next design and his limited resources will not be available because:


  • “The product meets the design specifications, and even the worst sample has 35°C margin above design specifications.”


  • “The product will never see 70°C in its worst case use, therefore if it does fail it’s the customers fault.”


  • “We do not have time to re-design the product to meet your HALT stress requirements.”



How do you address these obstacles from the design engineering manager for resources needed to identify the weaknesses and potentially improve the product robustness and reliability?

Let’s say you spend an hour with the design manager, overcome his objections, and get help from the design engineers.  With the help of a couple of design engineers you determine a  ten-watt FET is the most likely cause of the upper temperature operational limit.  Fortunately you find a twenty-watt FET in the same size package and voltage and use it to replace the ten-watt FET.  You go the HALT lab two weeks after your first HALT and find that all three new samples have an upper operational limit above 115°C and again no thermal destruct limit is found.



While you were busy in the Lab

Later in the week you find that during the time you have been doing the HALT at the local environment test lab, a skeptical design engineer (you have yet to speak with) has heard that you want to “overdesign the product for stresses it will never see” and has spoken to others in the design and procurement departments.  Now you start hearing that engineers (you have not spoken with) comment on your desire to add product costs to overdesign a product for an irrelevant failure mode.  You find that you are teaching the HALT paradigm shift to each skeptic as you find them, convincing some, but the ones you have not spoken with are also spreading the same “fear of over-design.”

When you go to the purchasing department you find out the higher wattage FET will add of 50 cents additional costs to a product that retails at $700.00.  Now the Vice President of Engineering hears about the additional product costs if the FET’s are changed.  Since it will reduce the profit margin on an already competitively priced product, the VP then asks very similar questions to those the design engineering manager asked previously.  The design manager attempts to explain the reasoning in his half-hour meeting with the VP, but doesn’t succeed.


Product Launch Time – too late, but now you may get the field failure data



Ultimately increasing the wattage of thermal operation limiting FET is not implemented and the product is released to market.  In a year or two years, you may be able accumulate the warranty return data and find the FET failures are the second highest cause of warranty returns, after NFF (No Fault Found).  You bring this information to the new design engineering manager who just joined the company, or had moved from another position during reorganization only five months ago.

Back to square one in teaching HALT to the new design engineering manager, and then many new design engineers, reliability engineers, and department managers since you first begin your path to introducing HALT at your company two years ago.  Do you have the energy to show and explain HALT to the new key players?  Are you still a reliability engineer at the same company?


A “HALT Battlefield” Experienced Consultant can accelerate understanding


It is critical to win the “hearts and minds” of the top management when introducing any new engineering concept.  Introducing HALT methods, a relatively new yet still very misunderstood paradigm in electronics reliability development can be a challenging path.  An experienced HALT consultant can provide a simultaneous education of the top executives and key management personnel to help their understanding the fundamentals of the HALT paradigm shift.  They can provide data and examples showing how and why HALT is so effective and pitfalls to avoid.  A HALT consultant can review causes of your field failures and identify those that would be likely found during HALT so that you can show the future potential ROI for different test strategies.  They can also provide multiple paths to HALT adoption by demonstration tests on a known weakness, or showing how it can reduce the NFF warranty returns problem.

A HALT consultant can provide a short overview presentation to the company executives, and their staff can ask questions about HALT with all hearing the same answers and explanations.  You will still have questions to answer about the new methods going forward, but with an experienced HALT consultant, you and your company will have someone who has heard the questions many times before and likely has the data to support the answers or knows where to find it.


Of course there are many other benefits of having an experienced leader in HALT guiding your way, such as planning, test monitoring, procedure writing, and most importantly the continued teaching and coaching of new engineers and skeptics while you are accumulating real field data showing the benefits of HALT in your products.  But, without a common understanding at the highest levels of the company, you may “win several battles but lose the war” of introducing the most efficient approach to making a reliable and robust electronic system.

Many thanks to Chet Haibel for improving this blog post.

Why HALT is a methodology, Not equipment


Posted October 20, 2012

By Kirk Gray

Accelerated Reliability Solutions, L.L.C.



It is easy to understand why the term HALT (Highly Accelerated Life Test) is so tightly couple to the equipment called “HALT chambers” systems.  Many do not think they can do HALT processes without a “HALT Chamber”. Many know that Dr. Gregg Hobbs, who coined the term HALT and also HASS (Highly Accelerated Stress Screens), spent much of his life promoting the techniques and was also the founder of two “HALT/HASS” environmental chamber companies.  I was very fortunate in working with and learning from Dr. Hobbs beginning 23 years ago at Storage Technology, Inc., before he founded his first chamber company Qualmark, Inc.. Later he asked me to join him at Qualmark in promoting the methods and equipment for HALT and HASS and we worked together at Qualmark for a year.


Back in the 1980’s Gregg taught that the best and basic approach for HALT was to use a single stress in a stepwise fashion to a point of empirical operational discontinuity, that is an operation limit or destruct limit. Some stresses were more useful and universal than others such as rapid thermal change and multi-axis vibration (pneumatic hammer driven) which required a combined environmental chamber and became referred to as a “HALT chamber”.


Even though Gregg made HALT chambers, still the HALT processes he taught were a fundamental paradigm shift in in that it is empirical test process over theoretical, probabilistic failure prediction methods used in traditional reliability engineering.  Increasing voltage stress, clock frequency stress, software application stresses, drop height shock stress, and increasing pressure in hydraulic systems are all stresses that could be used for HALT processes and no HALT chamber is required.


To illustrate I recall almost twenty years ago at Qualmark I was in Gregg’s office and he was very excited to tell me about a call he just had with some test engineer who had a big success with a vibration HALT procedure, and the reason I remember this so well is that as I was leaving Gregg’s office he told me, “oh by the way, he used a classical [Electrodynamic] shaker!”. Many do not realize that vibration HALT could be performed on a classical shaker by taking the product to a empirically determined vibration stress limit, as Gregg knew it could. Thermal HALT methods could be done in a traditional mechanical refrigeration chamber, although maybe not as efficiently.


HALT is not a chamber, or type of stress or combinations of stress; it is a basic method of using an applied stress to for empirical design or manufacturing weakness discovery.   A much better term to describe HALT would be “Highly Accelerated Limit Test”.


“HALT/HASS” chambers are typically defined by the fact that they are capable of producing multi-axis vibration from pneumatic hammers and rapid thermal stresses from liquid nitrogen supplied cooling and large banks of electric resistive coils for heating. In these systems the multi-axis vibration intensity can be controlled but the frequency spectrum is uncontrolled. It is like the real world in that vibration for most products life cycle environmental stress is uncontrolled intensity, axis, and frequency. Temperature ramp rates can force rapid product thermal changes of 60°C per minute or greater depending on the product mass and chamber configuration. Even though they are typically referred to as "HALT/HASS chambers" and are excellent for that purpose, their capabilities are really shine in HASS processes when combined environments should be used to induce the fastest stress and fatigue damage in the shortest period of time. The quicker a latent defect is detected the shorter and less expensive a HASS process is for those companies doing HASS. If HALT methods are used to make a robust design as it should be, then most of the failure mechanisms found in HASS occurs in the “infant mortality” region of the life cycle bathtub curve and vast majority of these early life failures arise from assignable errors or from wide variations in the manufacturing processes. HASS is an ongoing insurance process. It can be a very useful yet expensive ongoing test process if not coupled with RCA to eliminate the cause of manufacturing defects.


To be sure, “HALT” chambers [multi-axis pneumatic, rapid thermal] are the fastest most efficient chambers capable of applying some of the most useful universal environmental stresses to find latent design errors or manufacturing errors. Yet at the core of the new methodology and philosophy of HALT is a conceptually simple empirical process. The HALT methodology is performed by increasing stress in a controlled application to an operational and sometimes destruct limit. Many useful stresses to find operational and destruct limits do not require a chamber to find limits,or for improving operating margins and comparing limits between samples and products. That is what HALT is.

Why Parametric Variation Can Lead to Failures and HALT Can Help

Posted October 4, 2012

By Kirk Gray

Accelerated Reliability Solutions, L.L.C.


Many reliability engineers have discovered HALT will quickly find the weaknesses and reliability risks in electronic and electromechanical systems from the capability of thermal cycling and vibration to create rapid mechanical fatigue in electronic assemblies. Assemblies that have latent defects such as cold solder or cracked solder joints, loose connectors or mechanical fasteners, or component package defects can be brought to a detectable, or patent, condition by which we can observe and potentially improve the robustness of an electronics system. Thermal cycling creates expansion and contraction, stressing mismatched material thermal coefficients of expansion (TCE) interfaces. Applying vibration to an assembly, especially the pneumatic repetitive shock of HALT chambers, creates very rapid mechanical fatigue. When Gregg Hobbs, Ph.D., PE created HALT and HASS methods  back in the 1980’s, digital systems were not as prevalent and bus speeds were much slower than today’s electronics. As the signal speeds continue to increase and circuit features get smaller in electronics HALT has a potentially significant additional benefit for signal integrity (SI) and operational reliability during new product development.


Today’s electronics are requiring bus speeds that have to have ten times better resolution than the time it takes light to bounce off your nose and hit your eye, which takes about 85 picoseconds. As data bus speeds increase affects in data transmission that were second and third order affects are now becoming dominant in SI issues. These new variables may be difficult if not impossible to model accurately. The continue decrease in metallization and higher bus frequencies will result in increased sensitivity to fabrication variations. SI issues are likely to become more dominant in reliability of hardware as a result of the continued decrease of metallization and increase in bus speeds. Yet, the effect of these developments on operational reliability may also be more difficult to find and reproduce before thousands or millions are sent to the field.


Failures in SI in many times results in marginal operational reliability or “soft failures” where a system can be reset and operate normally. Depending on the frequency of these operational failure events, the user may or may not tolerate their occurrence. When too frequent, intermittent operational reliability may result in returning the system to the manufacturer. The returned system then may then be broken down and all subassemblies subjected to failure analysis. When divided up, the subsystems tested will likely be declared “No Fault Found” (NFF) as the marginality may only come from the stack up of parametric variations, or unique environmental conditions of original system in the end-use environment. To modify an old adage “If you cannot find what broke, you cannot fix it” and the cause of the marginality and returns will continue. The result is a churn of “good” parts being returned being sent out to replace “good” parts. The returned parts may be sent to a repair depot to be used for repair or replacement. Those returned parts may or may not work with a different system depending on the systems stack up, but it is likely the manufacturer will never come to know one of the potential real contributors to the high NFF rate. Of course there are many other causes of NFF returns not necessarily related to hardware issues. If the issues come from SI and timing marginality, thermal stress to operational limits can be a very useful tool to discover these issues before mass production.


We know that in mass manufacturing of anything there will be variation in any parameter that is measured. We know that during PWB manufacture that some dimensional variations will occur during mass manufacturing, although hopefully the variations are small. Dimensional variations in PWB can affect impedance crosstalk, noise, and EMI issues in the system. Dimensional expansion and contraction of the PWBA of course is what induces the thermo-mechanical fatigue damage during thermal cycling that has been a primary focus of HALT and HASS methodology, but the dimensional variations also effects SI quality. We know from the SPC teachings of Dr. W. Ed Deming that reduction of manufacturing variation is the path to making a defect free product and “six sigma” production capability is the goal. When we design and build a complex high speed digital electronics system we cannot know necessarily how the stack up of all the real future variations in component manufacturing, circuit board fabrication, solder quality, and second sources of these possibly impact operational reliability. Yet we do know for sure that there will be parametric variations created at all the levels of assembly, and the affect operational reliability may only be discovered after a large numbers are produced and sold.

The challenge of finding marginal operation during early product development is illustrated graphically in the Figure 1. . Early samples of a new electronics product are typically expensive and scarce and all development teams want the limited samples.  The graphic shown on the left side of figure 1 represents the parametric timing distribution found with a limited number of units. With a small number of units the parametric variation that could be near the upper and lower limits of would likely remained undiscovered before the product is released to from development to be manufactured in mass.

Figure 1. Thermal stress skews timings to discover marginal conditions

The graph on the right side illustrates the potential of the larger variation found during mass manufacturing and the higher probability that the stack up of parametric variations could fall near operational limits resulting in soft operational failures.


The benefits of the effect of thermal stress in inducing mechanical fatigue to expose mechanical and material weaknesses is well established, but there is another aspect of thermal stimulation that may be become more important in the future for assuring reliable operation of high speed digital systems. A little known fact to those who have not performed real thermal HALT on digital electronics is that it almost always ends with finding an operational limit only. It is very rare ever to find a thermal destruct level in digital systems such as IT Hardware. Hot and cold thermal stress causes impedance shifts and signal propagation shifts in conductors and semiconductors resulting in “skewing” of signals throughout the system. This is probably why thermal HALT on most digital systems results in finding an operational limit and not destruct limit. At the thermal operation limits the SI fails, and a lock up or shut down occurs, but it can easily be reset when the stress is removed.


The graphic in figure 2 represents how using small number of samples stressed to empirical thermal limits we can skew the systems signal propagation timings. Higher temperatures slow signals and cold increases the signal speeds. Through thermal stressing a small number of samples we can observe the thermal hot and cold operating limit and this can be repeated many times without causing a catastrophic damage. Marginal operational reliability may be realized later from worst case stack up of parametric variations in smaller percentage of products when thousands or millions are produced.  As manufacturing volumes ramp up, a wider distribution of parametric variations may then extend near or over the stable operational limit as previously shown on the right graphic in figure 1.  Of course the stimulation of timing variations using thermal stress on a system moves all the components parametric skew to either slower or faster.  In the larger mass manufacturing population, the lot to lot and second source variation of parametrics is mixed with high and low speed distributions. Rapid thermal cycling stress found in HALT chambers helps discover more mixing of timing variations by differentially skewing timings across a PWBA. This is created by very fast air temperature transitions producing thermal gradients across the PWBA. Low mass components have higher thermal transition rates than larger mass or high wattage components resulting in a mix of temperatures across a PWBA.  An even more detailed understanding of the risk of variations timing distributions could be created by individually heating and cooling of active components. Individual heating and cooling of components is a good way to isolate a limiting component found during a thermal HALT.


Figure 2. Thermal stress skews signal timings

Examples of the benefits of HALT techniques on finding software issues are have been documented by Allied Telesis. Donovan Johnson and Ken Franks of Allied Telesis wrote and published a white paper several years ago on how the use of HALT has benefited their discovery of reliability issues due to software. In the paper they give examples of significantly increasing thermal operational margins and limits from only software changes. Click on the following link to access the white paper:  Software Fault Isolation Using HALT and HASS.  It is an excellent documentation of another benefit from HALT in making a reliable system.


The benefits of HALT to find mechanical issues in electronics assemblies have been well established over the last several decades. As the speed and density of electronics continue to increase, operational reliability may be more sensitive to manufacturing variations that result in parametric variations, leading to marginal SI and operational reliability. Many companies have not realized thermal HALT has so much potential as an effective tool for rapid discovery of operational reliability issues, not just catastrophic hardware failures.   Along with the traditional established benefits of HALT on hardware weaknesses, there is  growing of evidence of the benefit of improving operational reliability by using thermal HALT for finding how parametric variations may affect operational reliability when thousands or millions are produced.


Why Change the Reliability Paradigm Shift: From Time to Stress Metrics
Posted on September 6, 2012

Kirk Gray, Accelerated Reliability Solutions L.L.C.

Traditional electronics reliability engineering began during the period of infancy in solid state electronic hardware. The first comprehensive guide to Failure Prediction Methodology (FPM) premiered in 1956 with the publication of the RCA release TR-1100:  "Reliability Stress Analysis for Electronic Equipment" presented models for computing rates of component failures.  "RADC Reliability Notebook" emerged later in 1959, followed by the publication of a military handbook know as  that addressed reliability prediction known as Military Handbook  for Reliability Prediction of Electronics Equipment (MIL HNBK 217) .  All of these publications and subsequent revisions developed the FPM based on component failures in time for deriving a system MTBF as a reference metric for estimating and comparing the reliability of electronics systems designs.  At the time these documents were published it was fairly evident that the reliability of an electronics system was dominated by the relatively short life entitlement of a key electronics component, vacuum tubes.

In the 21st century, active components have significant life entitlements if they had been correctly manufactured and applied in circuit. Failures of electronics in the first five or so years are almost always a result of assignable causes somewhere between the design phase and mass manufacturing process. It is easy to verify this from a review of root causes of verified failures of systems returned from the field . Almost always you will find the cause an overlooked design margin, an error in system assembly or component manufacture, or from accidental customer misuse or abuse. These causes are random in occurrence and therefore do not have a consistent failure mechanism. They are not in general capable of being modeled or predicted.

There is little or no evidence of electronics FPM correlating to actual electronics failure rates over the many decades it has been applied. Despite the lack of supporting correlating evidence, FPM and MTBF is still used and referenced for a large number of electronics systems companies. FPM has shown little benefit in producing a reliable product, since there has been no correlation to actual causes of field failure mechanisms or rates of failure. It actually may result in higher product costs as it may lead to invalid solutions based on invalid assumptions (Arrhenius anyone?) regarding the cause of electronics field failures. 

It’s time for a new frame of reference, a new paradigm, for measurement for confirming and comparing the capability of electronics systems to meet their reliability requirements. The new orientation should be based on the stress-strength interference perspective, the physics of failures, and material science of electronics hardware.

The new metric and relationship to reliability is illustrated in a stress-strength graph as shown in figure 1. This graphic shows the relationship between a systems strength and the stress or load it is subjected to. As long as the load is less than the strength, no failures occur.

Figure 1. The Stress-Strength Diagram

In the stress-strength graph in figure 2, anywhere the load to a system exceeds the system’s strength is where the two curves overlap and failures occur. This relationship is true for bridges and buildings as well as electronics systemsems.

Figure 2. Stress-Strength Intersection

This relationship between stress and strength and failures correlates with our common sense understanding that the greater the inherent strength a system has relative to environmental stress, the more reliable it will be. Of course the balance is that we must consider the competitive market and build the unit at the lowest costs. What is probably not be known to most electronics companies is how strong standard electronic materials and systems can actually be in relation to thermal stress, since so few companies are actually testing to thermal empirical stress operational limits, with thermal protection defeated. Many complex electronics systems can operate from -60°C or lower to +130 °C or greater using standard components. Typically it is only one or two components that keep a system from reaching stress levels that are at the fundamental limit of technology (FLT). The FLT is the point at which the design capability cannot be increased with standard materials. Sometimes designs have significant thermal operating margins without modifications, which can be used to produce shorter and more effective combined stress screens such as HASS (Highly Accelerated Stress Screens) to protect against manufacturing excursions that result in latent defects. 

In most applications of electronics systems, technological obsolescence comes well before components or systems wear out. For most electronics systems we will never empirically confirm their total “life entitlement” since few systems are likely to be operational long enough to have “wear out” failures occur.  Again it is important to emphasize that we are referring more to the life of solid state electronics and less to mechanical systems where fatigue and material consumption results in wear out failures.    

Reliability test and field data are rarely published but there is one published study with data showing a correlation between empirical stress operational margin beyond specifications and field returns. Back in 2002, Ed Kyser, Ph.D., and Nahum Meadowsong from Cisco Systems gave a presentation titled “Economic Justification of Halt Tests:  The relationship between operating margin, test costs, and the cost of field returns” at the IEEE/CPMT 2002 Workshop on Accelerated Stress Testing (now the IEEE/CPMT Workshop on Accelerated Stress Testing and Reliability, ASTR). In their presentation they showed the graph of data on differences in thermal stress operational margin versus the normalized warranty return rate on different line router circuit boards as is shown in Figure 3.

Figure 3 . Normalized RMA Return ves. Thermal Margin

The graph shows the correlation between the thermal margin and the RMA (Return Material Authorization), i.e. the warranty return rate. A best fitting curve with this scatter diagram shows a probabilistic relationship between thermal operational margin and warranty returns. It indicates that the lower the operational margin, the higher probability of its return. Cisco also compared the relationship between the number of parts (on a larger range of product) and the return rate. The graph of that data is shown in figure 4.  The relationship between thermal margins versus return rates is ten times stronger than the relationship between board parts counts versus return rates.

Figure 4. Parts count versus Warranty Return Rate

This makes sense from a stress-strength relationship. No matter how long a chain is, it is only as strong as the weakest link in that chain. No matter how many parts on the PWBA, the designs tolerance to variation in manufacturing and end-use stress is dependent on the least tolerant part.

For operational reliability or “soft failures” in digital electronics, the relationship between thermal limits and field operational reliability is less obvious since again most electronics companies do discover and therefore do not compare empirical thermal limits with rates of warranty returns. In mass production of high speed digital electronics, the variations in components and PWBA manufacturing can lead to impedance variations and signal propagation (strength) that overlap the worst case stresses in the end use (load) leading to marginal operational reliability. It is very challenging to determine a root cause for operational reliability that is marginal or intermittent, as the subsystems will likely function correctly on a test bench or in another system and considered a CND (Cannot Duplicate) return.  Many times the marginal operational failures observed in the field can be reproduced when the system is cooled or heated to near the operational limit. Heating and cooling the system skews the impedance and propagation of signals, essentially simulating variations in electrical parametrics from mass manufacturing. If companies do not apply thermal stress to empirical limits, they will never discover and be able to utilize this benefit to find difficult to reproduce signal integrity issues.  

Faster and lower reliability test costs are becoming more critical in today’s fast pace of electronics development.   Most conventional reliability testing that is done to some pre-established stress above spec or “worst case” field stress takes many weeks if not months, and result in minimal reliability data. Finding electronics systems strength by HALT methods is relatively very quick, typically taking only a week or less to find, and with fewer samples. Even if no “weak link” is discovered during HALT evaluations, it always provides very useful variable data on empirical stress limits between samples and design predecessors. Empirically discovered stress limits in an electronics system design are very relevant to potential field reliability, and especially thermal stress an operational reliability in digital systems. Not only can stress limits data be used for making a business case for costs of increasing thermal or mechanical margins, but it can also be used for comparing the consistency of strength between samples of the same products. Large variations of strength limits between samples of a new system can be an indicator of some underlying inconsistent manufacturing processes. If the variations are large enough some percentage will fail operationally because the end-use stress conditions exceed the variation in the products strength.  

As with any major paradigm shift, a move from using the dimension of time to the dimension of stress as a metric for reliability estimations, there will be many details and challenges yet to be determined on how best to apply it and use the data derived from it. Yet from a physics and engineering standpoint a new reference of stress levels as a metric has a much stronger potential for relevance and correlation to field reliability than the previous FPM with broad assumptions on the causes of field operational and hardware unreliability in current and future electronics systems. If we begin today using stress limits and combinations of stress limits as a new reference for reliability assessments we will discover new correlations and benefits in developing better test regimens and finding better reliability performance discriminators resulting in improving real field reliability at the lowest costs.

How can We Advance Electronics Reliability Engineering?


By Kirk Gray

Accelerated Reliability Solutions, L.L.C.



In all aspects of engineering we only make improvements and innovation in technology by building on previous knowledge. Yet in the field of reliability engineering (and in particular electronics assemblies and systems), sharing the knowledge about field failures of electronics hardware and the true root causes is extremely limited. Without the ability to share data and teach what we know about the real causes of “un-reliability” in the field, it is more easily understood why the belief in the ability able to model and predict the future of electronics life and MTBF continue to dominate the field of electronics reliability engineering. Please note that I refer to solid state electronics and assemblies. I  make a distinction between Failure Prediction Methodologies (FPM) for electronic assemblies (PWBA and systems) that typically has more life than needed (or really known) for most applications, as opposed to mechanical systems (i.e. motors, gears, switches) that can have a more limited life assignable to friction and wear out and have some potential to be mathmatically modeled.



I have been teaching HALT and HASS methods for over 20 years and a common question I have heard many  times is “If HALT is so great, why aren’t there more examples published on its benefits?”  There are several reasons why details of real HALT case histories, as well as any other actual empirical electronics reliability data, are rarely published.


Three key reasons are:


  • Competitive advantages of not sharing the most effective reliability practices
  • Potential legal liability for disclosing real causes of field failures
  • Engineers have little time  to write and publish.


When a company discovers a new product development process that leads to significantly faster times and lower costs to release a mature product to market, they are not likely to tell their competitors. Doing so would cause them to lose the competitive advantages those new processes. Does your company publish its best methods for product reliability development?  So why expect it from any other company?


Legal liability is a huge risk for manufacturers. Failures of electronics systems might lead to significant loss of property or in the worst case human lives. Publishing the cause of electronics failures might provide evidence of liability leading to costly judgments for the product designer or manufacturer. For this reason, most companies will never voluntarily allow failure data to become public. Reliability engineers that may want to help the industry by publishing the real failure data typically face many challenges to have the legal departments give permission to publish. Even if they are able to publish something on actual reliability the paper has so much redacted and “sanitized” for public disclosure that the most significant data may not be published.


It takes time to write and publish any technical paper. In today’s current economic challenges many electronics companies have trimmed engineering departments down to the bone. Engineers are challenged for enough time in the day to complete projects and timelines. Few are motivated to take the extra time necessary, and face the companies’ legal obstacles of publishing the case histories of real field failures of electronics. Without engineers being willing or able to publish the real case histories, details on the root causes of the failures, and best methods to prevent them, little can be expected in the advancement of the science of electronics reliability development and testing.


As previously stated many times in my previous blogs, if the reliability engineers really look at the root causes of field failures in their own products, they would see the same confirming evidence that in general the reliability of an electronics system cannot be predicted from statistics and probabilities. In the 22 years I have spent working with companies to improve reliability of electronics hardware I have seen many root causes of field unreliability in electronics systems.I do not recall ever seeing a wear out mode of a nominal electronic component as a cause field failures. It has always been an overlooked design margin (which may appear to be an early wear out), misapplication of a component, an error in manufacturing, or a misuse or abuse by the user. If I could publish the real evidence and data and root causes of all the failures of electronics in a wide variety of equipment and companies I have worked with, or heard about through colleagues in the field, the case for using empirical stress limit methods of finding weaknesses would be much clearer.


The conditions that limit the sharing of real field reliability data are not likely to change in the foreseeable future. This is why many companies are still doing FPM fundamentally derived from the invalid  MIL HNBK 217 methods and still use the meaningless term MTBF. While statistical or probabilistic methods are used for many valid engineering design and analysis applications, they have few applications to predicting the random errors and combinations of events that cause failure solid state electronics. Most electronics systems become technological obsolete before any inherent wear out modes cause failures.


We still can make progress in the field of electronics reliability, but we must validate the methodology and results from engineering basis. We must use our knowledge of physics and material science in electronics, as well as what lessons have been learned in the causes of real field failures. We must make sure when electronics reliability is developed now and in the future that there is traceability and references to the physics of the failure mechanisms.  For instance, traditional electronics FPM has used the Arrhenius equation and the broad assumption of 0.7 eV for the activation energy in silicon components as a major factor in predicting the rate of failure of component. This belief continues today, even though there is little or no evidence of traceability to physical mechanisms in today’s components and over 16 years since MIL HNBK 217 was removed as a DoD reference document. .


When we find a weakness in an electronics system through stepped-stress methods, we should know enough about the materials to know whether the weakness is due to a fundamental limit of technology (FLT) such as the melting of plastics, solder, or limits of LCD operation at temperature, or if the weakness is due to the in-circuit application of a particular component. After uncovering the causes, we can understand what physics drove the failure and the element to change to increase the systems strength or capability. Usually it is only necessary to strengthen one or two “weak links”, to bring a products strength up to the FLT. Sometimes those weak links are software, not hardware, and changing code may be the only change necessary to add significant thermal strength capability and margins. Occasionally the system is  designed and built and reaches the stress FLT with no change needed, and this becomes a benchmark for subsequent designs. Although, if you are not testing to empirical stress limits, you will never find out.

The causes for the inherent limits in sharing field and test lab reliability data are not likely to change anytime soon. Yet, we can change our orientation and approach to electronics reliability development. Realizing the random and unpredictable nature of most electronics failures in the first 5-7 years of use would result in a major shift in the activities for many companies developing electronics systems. It is  a change that has been slowly being adopted by more companies, but they are not going to spread the word to competitors.


The development of reliability based on empirical stress limits still has a long way to go before it becomes the dominant electronics reliability engineering development paradigm and activity. Using the limited data available in your own products failures and physics and material science, must be the basis for validating the use of empirical stress limit methodologies. The need  for faster reliability development demands it. I have seen the evidence, but cannot share most of it, that HALT discovery methods are valid regardless of the rapidly changing materials and manufacturing processes in electronics components  and systems in the past and know it will be in the future.


“A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it”-Max Planck, Scientific Autobiography

Why You Should Take it to the Limit for Maximum Test Value!


Posted August 8, 2012

By Kirk A. Gray

Accelerated Reliability Solutions, L.L.C.


When we go to an automobile race such as the Indianapolis 500, watching those cars circle the track can get fairly boring. What is secretly unspoken is that everyone observing the race is watching for a race car to find and sometimes exceed a limit, finding a discontinuity. The limit could be how fast he enters a curve before the acceleration forces exceed the tires coefficient of friction, or how close to the racetrack wall, he can be before he contacts it and spins out of control. Using the race analogy, time trials before the race are like the design phase of electronics products, where only one race car one the track, and manufacturing like the race where many cars and consistent control of each is required to have an “accident free” race. 

Of course no one wants to see a driver injured or killed at these events, but watching cars circle the track without incident is fairly boring. The same is true in testing of electronics hardware and software. Highly Accelerated Life Testing (HALT) is fairly boring until an empirical limit, a discontinuity, is discovered. Fortunately engineers are not injured or killed discovering empirical stress limits in HALT evaluations of electronics systems.

Formula One Racing - Pushing the Limits

HALT methodology really is a limit discovery tool, not a pass-fail test. Near and at the empirical, not theoretical or specified operational limits, provides some of the most useful data lies. It is the fastest way of finding weaknesses and for comparisons of electronics systems designs. Observing wide differences in operational limits between samples of the same product provides evidence of some component(s) inconsistent manufacturing processes affecting the system. If the deviation is large enough, the variation probably will affect operation of a smaller percentage of units at field use conditions.  Discovery of variable empirical limits of multiple samples can be a discriminator for the quality of component and assembly process consistency. Wide deviation of operational limits between identical system samples is a good indicator of uncontrolled, possibly unknown, process variation that if wide enough will lead to failures in the intended use environment.  Even if the numbers of units compared is not statistically significant, wide differences in limits are good qualitative indicators for reliability risks.

Stress testing at well below the operational limits even though it may be well beyond the end use specifications provides only very limited data on the product’s strength capability. Testing to only those “margins above spec” if not close to the empirical stress limit is just like watching a car race with a 120 mph speed limit.  Some probability exists that a race car in this speed limited type of race could have a failure, and some cars would have failures and “lose” the race. Still failure would likely be rare and most of the vehicles would be tie for the win and there would be little differentiating information would be available for improving handling, durability or reliability over the competing cars. As in typical reliability testing, the cars much faster (higher stress) than most cars are driven and most accelerated reliability testing of electronic is performed at higher stresses than most systems will be exposed to in their useful life, and some percentage do fail in these milder but above spec stress conditions.

So why not test to empirical operation, and sometimes destruct, limits (i.e. HALT)? It is the quickest way to get useful data on product weaknesses. Why do so many resist testing electronics systems to empirical stress limits  of voltage, temperature, vibration, shock, and other stresses that provide data on what the ultimate stress capability is? Here are just some of the reasons given in the last couple of decades:


1.      Product failures above specified component stress specifications are “foolish failures”

2.      Products in the field will never be subjected to those stress levels

3.      The product is too expensive to destroy the samples


To briefly answer those reasons

1. All components have margins above specification and functional margins are very dependent on its application in the design, not individual component specifications. Why assume any failure is foolish before finding it. Not testing to the operational strength of the actual product is leaving what could be valuable data (and ultimately money) on the table


2. The product may not see the instantaneous stress levels used in the tests, but the cumulative fatigue damage of lower field stresses have a high probability of failing the same weakness in the design that is found at the destruct limits.

3. How expensive is a product failure to company and its customers? Finding out in the weaknesses in a test lab is almost always less costs than lost sales and warranty costs when a latent defect or weakness reaches the customers. There is a risk to all testing and to find weaknesses at limits, you risk catastrophic damage. In digital systems it is very difficult to destroy systems below thermal empirical operating limits due to the parametric shifts causing failures in signal integrity. Maybe it is because there are many that believe finding empirical limits results in a pile of melted solder, components and plastics. Vibration on the other hand will eventually cause a hard failure, where the operational limit is also a destruct limit. In any case, many times the unit can be repaired and re-used for additional testing. 


In the reliability development of a new product we are somewhat like a person in an unfamiliar dark room. We really don’t know how big the room is until we bump into a wall, and actually several walls, to define the available space in the room. In electronics testing, until we find the actual empirical limits of stress, we do not know what the actual “stress” space is that can be used to find marginal functional or material issues. The larger the stress space, the faster we can find the strength “entitlement” and use that strength to find the one or two weaknesses in an electronics product that puts overall reliability at risk.

Just like the title of a song by the rock group the Eagles, we should in testing “Take it to the Limit” to fully benefit from each sample of electronics systems we test. You will find it takes fewer units, less time and money to find the few elements in a design that really could impact field reliability.     

Why the Drain in the Bathtub Curve Matters


Posted May 23, 2012

Kirk Gray, Accelerated Reliability Solutions, L.L.C.


Most reliability engineers are familiar with the life cycle bathtub curve, the shape of the hazard rate or risks of failure of a electronic product over time. A typical electronic’s life cycle bathtub curve is shown in figure 1.

Life Cycle Curve
Figure 1. Typical Electronics Life Bathtub Curve

The origination of the curve is not clear, but it appears that it was based on the human life cycle rates of death. In human life cycles we have a high rate of death due to the risks of birth and fragility of life during that time. As we age, the rates of death decline to a steady state level until we age and our bodies start to wear out. Just as medical science has done much to extend our lives in the last century, electronic components and assemblies have also had a significant increase in expected life since the beginning of electronics when vacuum tube technologies were used.  Vacuum tubes had inherent wear out failure modes that were a significant limiting factor in the life of an electronics system.


During the days of vacuum tubes, wear out of the tubes and other components were the dominant cause of field failure. Although errors in the design or manufacturing probably contributed to field failure rates, the mechanical fragility and limited life of vacuum tubes dominated the causes of system failures in a few years of use.

Traditional electronics reliability engineering and failure prediction methodology (FPM) has in its foundation the concept of the life cycle bathtub curve.  The declining hazard rate region is called the “infant mortality” region.  The wear out failure modes of electronics results in the increasing hazard rate represented at the “back end” of the bathtub curve. The concept of the curve has been used as a guide for “burn-in” testing and, of course, for establishing the misleading and meaningless term, MTBF.


In order to make a life prediction for an electronics assembly some assumptions must be made regarding the quality and consistency of the manufacturing process as well as assumptions on the distribution of life cycle stresses. To create models for electronics devices we must assume that the manufacturing process is capable and parts being produced are from the center of the normal distribution. We also must make assumptions about the frequency and distribution of the life cycle stresses. It is difficult to account for variations at all the manufacturing levels in models without creating significant complexity. The same applies to accounting for variation in life cycles stresses the product population will be subjected to.


Today’s electronics components, especially semiconductors, have an inherent “life entitlement” is relatively infinite relative to the technologically useful life of the system it is in. We will not likely empirically determine the life for the vast majority of electronics components and systems because technological obsolescence occurs relatively  quickly. The pace of significantly better performance and features with electronics systems is not likely to slow down, and therefore the rate of technological obsolescence will not slow down. Of course there are some exceptions in electronics, such as in energy systems. Wind and solar energy systems costs justifications are based on 25 years of use.

The Drain in the Bathtub Curve

Using the same life-cycle “bathtub” curve analogy, technological obsolescence is the “drain” that is before the back end of the life cycle (wear out) mode occurs. Figure 2 is a graphical representation of the bathtub curve with the “drain”. The drain is the point in time that electronics is replaced due to technological obsolescence. Because of this technology obsolescence drain the back end of the bathtub curve a relatively small contributor to the overall costs of failure for customers and manufacturers in a very large percentage of electronics. Obsolescence is especially rapid in consumer and IT hardware. The infant mortality at the front end is where most manufacturers and customers realize the costs of poor reliability development programs and what determines future purchases for most consumers.

The causes of failures during the early production are mostly due to poor or overlooked design margins, errors in manufacturing of the components or system, or abuse (sometimes accidental). Precise data on rates and costs of product failures is not easily found, as reliability data is very confidential, but most who deal with new product introductions realize that most costs of unreliability come from the front end of the bathtub curve and not much from the wear out end of the curve. Poor reliability may result in total loss of market share in a competitive market. The backend of the bathtub curve is for the most part irrelevant in the case of high rates of failure, as an electronics company may not be in business long enough for technological obsolescence to be a factor, and even much less “wear out” failures

The electronics industry in the last few decades has been misdirected in the belief that life in an electronics system can be calculated and predicted from components rates of failures and models of some failure mechanisms (i.e. MIL HNBK 217) although there has been no empirical evidence of any correlation of most predictions to field failure rates.


The vast majority of costs of failures for almost all electronics manufacturers come in the first few years in its life, some covered by the warranty period but also a few years past that. It is the customer’s experience of reliability that determine the quality and reliability of the manufacturer and future purchases from that manufacturer. The costs of lost future sales may be even greater than warranty costs, but since is difficult to quantify it may never be known.

Most of the causes of failures are attributable to assignable causes previously mentioned. Traditional reliability engineering  is mostly based on FPM and for most electronics design and manufacture companies the majority of reliability engineering resources have been spent on creating probabilistic estimates of the life entitlement of a system and the back end of the bathtub curve. There is little evidence of reducing the costs of unreliability in most electronics products because it occurs in the first several years due to assignable random causes, and not wear out.


A much greater return on investment in developing can be realized when the industry understands that most of the reliability failures in the first few years of use are not intrinsic wear out, but instead on random errors in design and during manufacturing. Reliability engineering must reorient to spend most of the reliability development time and resources to develop better accelerated stress tests, using better reliability discriminators (prognostics) to detect errors and overlook low design margins and eliminate them before market introduction. With this new orientation, electronics companies can be the most effective and quickest at developing a reliability product at market introduction at the lowest costs.

No Evidence of Correlation: Field failures and Traditional Electronics Reliability Engineering

Posted February 10, 2012
Kirk Gray, Accelerated Reliability Solutions, L.L.C.

Historically Reliability Engineering of Electronics has been dominated by the belief that 1) The life or percentage of hardware failures that occurs over time can be estimated, predicted, or modeled and 2) Reliability can be calculated or estimated through statistical and probabilistic methods to improve hardware reliability.  The amazing thing about this is that during the many decades that reliability engineers have been taught this and believe that this is true, there is little if any empirical field data from the vast majority of verified failures that shows any correlation with calculated predictions of failure rates.

The probabilistic statistical predictions based on broad assumptions of the underlying physical causes begin with the first electronics reliability prediction guide  begin November 1956, with the publication of the RCA release TR-1100, "Reliability Stress Analysis for Electronic Equipment", which presented models for computing rates of component failures. This publication was followed by the "RADC Reliability Notebook" in October 1959, and the publication of a military reliability prediction handbook format known as MIL-HDBK-217.


It still continues today with various software applications which are progenies of the MIL-HDBK-217. Underlying these “reliability prediction assessment” methods and calculations is the assumption that the main driver of unreliability is due to components that have intrinsic failure rates moderated by the absolute temperature. It has been assumed that the component failure rates follow the Arrhenius equation and that component failure rates approximately doubles for every 10 degC.


MIL-HDBK-217 was removed from the military as reference document in 1996 and has not been updated since that time; it is still being reference unofficially by military contractors and still believed to have some validity even without any supporting evidence.


Electronics reliability engineering has a fundamental “knowledge distribution” problem in that real field failure data, and the root causes of those failures can never be shared with the larger reliability engineering community. Reliability data is some of the most confidential sensitive data a manufacturer has, and short of a court order will never be published. Without this information being disseminated and shared, little changes in the beliefs of the vast majority of the engineering community.  

Even though the probabilistic prediction approach to reliability has been practiced and applied for decades any engineer who has seen the root causes of verified field failures will observe that most all failures that occur before the electronic system is technologically obsolete, are caused by 1) errors in manufacturing 2) overlooked design margins 3) or accidental overstress or abuse by the customer.  The timing of the root causes of these failures, which many times are caused by multiple events, are random and inconsistent. Therefore there is no basis for applying statistical or probabilistic predictive methods.


It is long past time that the electronics design and manufacturing organizations to abandon these invalid and misleading approaches, acknowledge that reliability cannot be estimated from assumptions and calculations. Instead a more effective approach is to put most of your reliability engineering resources to finding 1) errors in manufacturing 2) overlooked design margins 3) or weakest link that my cause failure in the roughest environment a customer may subject through stress testing as quickly as possible.