By Kirk Gray | February 25, 2017 at 03:46 PM EST | No Comments
A Crisis in the Home IT department
If you are the head of your home IT department, you may relate to this story
Last night my wonderful wife Stacy was riddled with angst over the disruption of her binge watching a spy series, a mild crisis with our own IT hardware. The cause of her disappointment was failure of our new large screen Ultra High Definition (UHD) Smart TV to deliver a drama series with a reasonable picture and sound quality.
The failure mode was a choppy video stream and intermittent sound.
It was so bad she was reading the closed captioning while watching a series of almost jumping video frames. She was freaking out.
Who is to blame?
With me being the only engineer in the house, and according to our gender division of labor, I was expected to solve this reliability of service problem, and soon.
As would most non-technical users, she did not care who was the provider in a long service supplier chain that was the weak link.
Her first target of frustration in our discussions of the issue was the Satellite system. I pointed out that the content provider was being streamed through the internet and the problem was independent from the Satellite system.
Next, she blamed the TV manufacturer for causing the failure of service. I thought that may be possible as I had previously had to power cycle the program because of a failure of app to load, and the power cycle did fix that problem.
The Long Chain of potential Culprits
It is funny that we still call these amazing displays and systems “Televisions”, now that they are basically computers with amazingly detailed large displays. They are complex systems with many suppliers of interdependent hardware and software in the chain of delivery. Everything must work together to successfully deliver content to the viewer.
As the capabilities and quality of Television and the internet increase and start blending into one system of content delivery adds many more potential causes and suppliers of reliability failures.
With our new UHD TV we can access movies and TV shows from a satellite system provider, as well as the internet.These days we are watching much more original content from the internet streaming content providers such as Amazon Prime, Netflix, and YouTube along with others. The reliability of being able to watch movies and shows from the cloud shifts now in our ISP (via DSL) providers signal which adds many hardware and software nodes to the chain, not all under the control of the provider.
What is most likely and what is easiest solution?
My first thought when I saw the TV failure mode is that our ISP provider was at fault. It was possible that the data rate, the Mbps (megabits per second) download rate had dropped due to our Wi-Fi or local internet traffic loads being heavy, slowing the system.
So my first thought was to the easiest of the solutions, and one solution I had “fixed” many failed IT systems both in my professional career and at home – simple–
TURN IT OFF and TURN IT ON!
It was back to normal again.
My wife was gleeful and expressed deep admiration of my IT prowess.
Rushing to Fix it!
In my desperation to please my most important IT client (my wife), back to making the streaming series watchable again, I made a common error with failure cause isolation and trying to determine the element that was at the cause of the failure.
I erred in that I reset two hardware systems in the signal chain simultaneously.
I unplugged DC power the DSL modem and plugged it back in to reset the access to the DSL signal.
For the newer Smart HDTV’s,Using a remote control on/off does not really reset operating software in the system,so I power cycled the TV by pulling out the AC power cord, waiting a few seconds, and reconnecting it.
Conflict of priorities challenge to finding the root cause
The solving of an IT problem at home illustrates a common problem in the real world of electronics field reliability.
In most cases the emphasis is on getting the customer back up and running, as was the case when I was repairing semiconductor high vacuum systems. Time was money, and so most field repairs use the method of “shot gunning” by replacing many hardware subsystems simultaneously. It is best in the short term for the customer, but it makes for many parts being sent back that have no defect which is one of the many contributors to the NDF (no defect found) problem. NDF continues to be a big challenge in finding the root cause of the failures.
As our home entertainment, as well as other consumer electronics become more dependent on cloud services and internet signal quality, we are adding many more links controlled by many more companies making hardware and software and therefore making it more difficult to isolate the root cause of service failures.
As Steve Jobs had said many times “it just works”.
But sometimes it doesn’t work and it’s not known why.
Fortunately, for most “soft failures” of digital systems, the simple act of power cycling "repairs it" and works again. And we are just happy it does.
By Kirk Gray | March 05, 2016 at 06:49 PM EST | No Comments
Kirk A. Gray
HALT (Highly Accelerated Life Test) requires exceeding specifications
One aspect of HALT, a test to find weaknesses and reliability risks empirically, is the difficultly for many engineers that are new to the HALT process is using HALT guarantees that the maximum or recommended operating environmental specifications for the system and components under test will be exceeded and failures beyond spec are potentially relevant to field reliability.
All products have some margins beyond the manufacturers environmental use specifications, to allow for variations in environmental and manufacturing, The goal in HALT is to empirically discover how much strength or operating margin beyond the end use stresses and whether the strength or capability of the product needs to or can be improved.
That is the key goal of HALT, to find empirical margins beyond the designed specs, and not theoretical operational limits.
The empirical limits of devices must exceed design ratings as all must have stress margins to allow it to meet published environmental specifications although there will be variations in manufacturing processes.
There are no industry standards for component manufacturers to provide thermal ratings. Many times the thermal use specification is based on the market requirements, sometimes it is based on a similar component or its predecessor. Ratings for components is not always based on a devices empirical capability or devices physics of failure .
Sometimes a component supplier may have a shortage of a lower rated component and will ship higher rated components that are equivalent in form and function to fill a purchase order.
When we purchase a consumer electronics product, we generally want to know that it will operate in most environmental conditions we humans operate in. Of course we have inherent understanding that some environments, such as water immersion or usage conditions, we expect have to be explicitly identified by the manufacturer as “water-resistant” (and not waterproof) to a specific depth or pressure.
Some thermal specifications are expected to be exceeded
Many portable electronics producers are no longer specifying temperature operating specifications. Yet some popular brands of consumer electronics such as the Samsung Galaxy III smartphone and Apple IPhone and IPad (3th generation and later) have listed the manufacturers maximum operating temperature of 35°C (95°F) . This means that if the devices is operated it when it is at human body temperature (37°) it is operating outside the manufacturers “rating” and technically voiding the warranty.
Of course many of us carry our phones in in a shirt or coat pocket, or when outside on a hot summer day and using it above 35°C is a “normal use” condition. Yet most users will agree to these limited specifications through the EULA (End User License Agreement) required for using the product, although probably unknowingly since most EULA’s are probably signed without actually reading through the details.
I would guess that few users know operating specifications not just cell phones or computers but most electronics they own. Most users probably do not care; they just want it to work wherever and whenever when they need to use it regardless of specifications.The portable phone, tablet and PC market would quickly reject a product that cannot operate in most temperature conditions that humans commonly function in.
I know the IPhone 6 has an internal high temperature monitor circuit, as I have had mine display an over temperature message “Temperature: iPhone needs to cool down” on the screen when trying to turn it on after leaving it in my car on a hot day in the summer. After it had cooled it operated normally, as I would expect. I am certain the smartphone market would not accept the phone failing completely and not be covered under warranty a not uncommon event.
The persistence of this standard temperature specification, regardless of its validity, sometimes costs real money.In many parts of the world air conditioning is not available and the outside temperature regularly exceeds 35°C much of the year. For some IT equipment producers, the 35°C maximum operating environment specification prevents bidding on contracts from worldwide enterprises that require a warranty valid for an operating specification of up to 40°C. The inability to bid on large contracts, even when there is no supporting data behind the reasons, has costs some enterprise IT hardware equipment companies serious money.
Why specify the maximum thermal operating conditions lower than normal use environments?
So why do manufacturers specify a maximum temperature that is below use temperatures for many consumer electronics used in normal human environment conditions?
There is no clear answer.
It could have been based on the traditional reliability prediction models of electronics for semiconductors, such a Military Handbook 217A (MIL-HDBK 217) first published in 1959. It was believed that the failure rate of semiconductors was dominated by temperature. The resulting life models were based on Arrhenius equation, predicted that the failure rates of semiconductors doubled for each 10°C increase in temperature. There is no data to support this long held and misleading belief, and has led to many referring to the Arrhenius equation as the “erroneous equation” as it was so widely misapplied for electronics life predictions.
Yet it would be easy for a company like Apple to record the phones temperature and declare the warranty void if the phone log showed it was operated beyond a 35°C environment. The point is that it doesn’t matter what the ‘spec’ is, it is the consumer and market that is the ultimate judge and decides whether a product producer has made a reasonably robust device. They are the judge and jury of whether a portable product is flimsy or robust if it fails after any stress condition. Most consumers would blame themselves for failure if a portable device if did not work after being accidently dropped 4 meters, or being run over it with a car. But most all consumers would consider a cell phone that did not operate above 35°C a poor design and unreliable, and it would not last long on the market.
So does setting a user environment specification really matter? No, not really for the consumer market, as the market competitors and consumer demands will ultimately set the benchmarks for how robust consumer electronics should be. This is why it is so beneficial to apply and use HALT methods to find out how robust it can be before the rest of the consumer market finds how robust it can be. You know they will.
 Das, D., Pendse, N., Pecht, M., Condra, L., and C. Wilkinson, “Deciphering the Deluge of Data – Understanding Electronic Part Data Sheets For Part Selection And Management,” IEEE Circuits and Devices Magazine, Vol. 16, No. 5, pp. 26-34, September 2000.
Every once in a while I see a comment that by following the HALT methodology you will “overdesign” a product. Many question at what point or operational limit do you quit increasing the stress-strength margins. Those who hold this view of HALT do not understand the essence of what was Gregg Hobbs’ principles and paradigm shift. The point of HALT is to create the most robust design using standard materials and manufacturing methods. Reliability development engineers that have applied HALT to electronic circuit boards have probably observed the same, some circuits and assemblies have significant capability. Many products achieve what Gregg Hobbs referred to as the “fundamental limit of technology” or FLT. That is the strength margins are as large as they can be made without resorting to special materials or designs.
Adding liquid cooling to an air cooled product might be an example of a special method to extend the high temperature operating limit of a circuit or component. Changing a design from being air cooled to liquid cooled, which would most likely add significant costs, due to a thermal limit found in HALT is in most cases not cost effect nor necessary. Changing a single component found to be a limiting component to a higher voltage or faster switching speed to increase thermal margins due to a low margin found in HALT is the better approach. Consider each limit and the cost and time required to improve it with each discovery in HALT.
Small Change can result in large Gains in Operational Limits
I have performed HALT evaluations that by changing one component, increasing the wattage from a eighth watt diode to a quarter watt diode, increased thermal margins by 30°C or more. Sometimes in a change in software that increases a thermal limit. Most of the time, a change in the value of one or two components found to be limiting a thermal margin is cost effective for creating a more robust system.
For mechanical weaknesses found in HALT, an extra tie-wrap hold a component or repositioning of a component may only be required to result in an increase in vibration strength. Rarely is there a need for significant changes to a design to increase the strength of an assembly to meet or be near the FLT.
To paraphrase what came to me as Gregg’s core belief when he developed HALT methods was:
Start with the frame of reference of the inherent empirical stress-strength that is already in electronics and electromechanical systems built with standard materials and technologies. Use that strength to develop the best ROI screens through HASS (Highly Accelerated Stress Screens)
HALT is also “Highly Adaptive” to each industry
Standard materials and methods may be different for different industries. Standard materials for electronics systems used in a normal terrestrial environment would include lead free solder (which has a higher reflow temperature than previous standard lead solders). HALT failures or limits for common environment products is due to reflow of solder (210°C or so) would be in most cases an HALT limit that is not relevant to the field. Most likely the product will have an operational limit long before reaching a solder reflow temperature. Insulation melting on a electrical connector would also likely be irrelevant to the field in most HALTs.
Yet, In the Oil and Gas down-hole MWD (measurement while drilling) equipment design and manufacturing, reflow of solder at 210°C would likely be a very relevant failure, as the designed product operation specification may be as high as 190°C. The product must have some strength (temperature) margin to insure durability over its lifetime with potential variations in assembly strength and variations use environments. Equipment failures that stop a oil or gas well drilling operation are very costly, and using 300°C solders to insure a strength reliability margin in MWD equipment is a good standard design requirement for these applications. Using 300°C solders for consumer products for use in terrestrial environments would be in most cases unnecessary overdesign.
When HALT has been used as standard reliability development tool for several products, there will be expectations of strength limits for a new product HALT. Thermal and vibration limits found in HALT for predecessor fielded reliable products can be the benchmark limits for the same types of assemblies (Fans, circuit boards, displays, batteries, power supplies).HALT does not mandate a change in the product to increase its thermal or mechanical strength. Even the weakest link in a product may be at the FLT.
All HALT results do not require increasing strength limits
Again the key is not designing for HALT, but to design products using lessons learned from predecessor products, using industry standard materials and manufacturing methods, and then finding the weakest link in the design. The strength of the new product’s “weakest link” may actually be at the fundamental limit of technology, and therefore no improvement necessary.
By Kirk Gray | June 13, 2014 at 11:49 AM EDT | No Comments
The GM Ignition switch failure case history should be required reading for all reliability engineers.
It is rare to have insight into any internal company history of serious electronic and electromechanical failures. Failure analysis and the causes of electronics or electromechanical systems failure can be a difficult investigation for any manufacturing company. Disclosure of the history and data is rarely if ever published due to the potential liability and litigation costs as well as loss of reputation for reliability and safety.
The report of the probe by former U.S. Atty. Anton Valukas issued June 2014 by the NHTSA on the failure of GM to determine the failure of air bags to deploy due to an ignition switch assembly is a fascinating case history. If you would like a copy of the redacted full 325 page report you can download it here.
The NHTSA report will become a classic case study for management and reliability engineering history books. Anyone who has worked for a large product manufacturing company likely will relate to many aspects of the mismanagement, misdirection, and company silos of information that prevented in understanding and action for this fatal flaw in certain model years of the Cobalt and Ion models ignition switch.
The NHTSA report to GM shows multiple reporting errors, mismanagement of leadership in investigations, lack of data sharing, and lack of individual responsibility among other many other complications. What is also very interesting about the NHTSA report is the many different memories of meetings and investigations and the lack of documentation of what was or was not communicated during those meetings.
Of course a major player that misleads the failure investigation was Ray DeGiorgio, who led the team that developed the switch and later approved a change to the part. Unfortunately and against GM procedures the part number was not changed, and DeGiorgio repeatedly denied making and approving a change to the spring and detent plunger that later improved the switch torque and eliminated the failures.
In the end, simple testing and empirical measurements of the torque required limits, and comparison of those limits clearly showed the ease at which the switch could be moved from the “Run” position to the “ACC” position. It would appear from 20-20 hindsight the problem would have been discovered if any engineer involved in the investigation had gone out to used and new car dealers or a junkyard and actually measured the torque required to rotate the switch between “Run” and “ACC” position in the different Cobalt model years that the issue was and was not seen. This was eventually done, but only years later and at first unintentionally.
During design development of the switch it is almost certain that the low force to move the switch would be found in a step stress test with vibration, key chain weight and temperature. It seems that actually testing of the device was a low priority throughout the years of investigation. Instead of using stress testing and comparing switch torque limits, GM tried reproducing the issue by simulating a worst case rough road driving test track but were unable to. Testing several samples of switch assemblies to an empirical boundary or limit, would have had a better probability of showing design flaw over an attribute test of pass or fail for a few samples on the “teeth chattering” test track.
It all reminds me of a quote from two known car experts and MIT Engineering graduates, Tom and Ray Magliozzi, who once on their NPR radio show “Car Talk” warned to be cautious of MIT Engineers understanding of actual devices operation because “they are never shown the actual ‘thing’ but are only given the mathematical models that describe the ‘thing’”. In the story of the GM Cobalt ignition switch there seemed to be little actual observation, testing, or empirical measurements on a device that would seemingly not hard to access since thousands were built and in used and new car lots.
You can see many photos of the GM ignition switch assembly at the root of the problem from McSwain Engineering, Inc. at the International Business Times website link here.
It took a plaintiff’s investigator to physically compare ignition switches from Cobalt model years that had the problem and the years after the switch was changed and the issue did not occur. Of course the denial of the design change by the DeGiorgio that approved the change caused misdirection of the investigation for years by denying the change that fixed the problem was made.
Some of the most interesting findings were
One newspaper review of the Cobalt in 2005, the reviewer Gary Heller of the Sunbury Daily reported that “unplanned engine shutdowns happened four times during a hard-driving test last week…I never encountered anything like this in 37 years of driving and I hope I never do again”
A Wisconsin State Trooper, Keith Young, correctly identified the problem back in 2007, seven years before GM accepted the cause.
Lack of knowledge of the many GM engineers and investigative teams looking at the “moving stall” issue is that the ignition switch in the “ACC position” prevents deployment of the airbags; therefore the switch problem was only considered an “inconvenience” and not a safety issue. In the report it states that “The engineers made a basic mistake. They did not know how their own vehicle had been designed and GM did not have a process in place to make sure some looking at the issue had a complete understanding of what the failure of the ignition switch meant for the customer”.
A proliferation of committees at GM made disavowal of responsibility easy. Some of the interviewed witnesses called the “GM salute”, a crossing of arms and pointing towards others, indication that the responsibility for the issue belongs to someone else. There was also described the “GM nod”, that is when everyone nods in agreement to a proposed plan of action, but then no one really does anything to act on the plan.
The ignition switch was redesigned and parts of the assembly were changed to improve the torque required to hold the switch positions, unfortunately the part number was not which added to the delay in discovery of the change.
GM Issued a Technical Service Bulletin about the problem in 2005 and 2006. Trooper Young found them online at the NHTSA back in 2007
The GM Field Performance Assessment (FPA) engineers did not consider searching for information relevant to the problem of airbag deployment that was publicly available or in GM’s own files.
After on Engineer unintentionally notice how extraordinarily easy it was to turn the key on one of the Cobalt model vehicles in a junkyard when they were trying to retrieve a BCM (Body Control Module) the Engineers went to a local fish and tackle store to purchase a fish scale in order to measure the torque on the switches required to move it from the run position to the ACC position in "a number of Cobalt vehicles at the junkyard".
In the Reports assessment of GM culture, many other interesting factors were discussed.
In critical safety meetings, amazingly no one took notes. The general avoidance of taking notes was seemingly communicated among GM engineers through “urban lore”, as there was no evidence that any lawyer or manager at GM ever gave that instruction. The result was that there were no clear records of attendance, or what was discussed or decided.
The last parts of the report detail the cultural and systematic failures of GM and actions to prevent the same failure of corrective action for a safety issue that occurred with the ignition switch. We can only wait to see if the culture and policies at GM that caused this fatal flaw to be uncorrected for so many years will change.
What aspects of the GM culture and systems that caused this failure to find root cause and implement corrective action have you seen in companies you have worked for?
By Kirk Gray | June 20, 2013 at 02:20 PM EDT | No Comments
MTBF for electronics life entitlement measurements is a meaningless term. It says nothing about the distribution of failures or the cause of failures and is only valid for a constant failure rate, which almost never occurs in the real world. It is a term that should be eliminated along with reliability predictions of electronics systems with no moving parts.
There is also another term widely used in reliability engineering that is a bit of a misnomer and should be eliminated, that is the term “Infant Mortality”. The term “infant mortality” typically is used to describe early life failures in an electronics system during the declining hazard rate period which may extend to its technological obsolescence.
It is my experience that it is a term used dismissively as it if it was “expected” or acceptable as a intrinsic yet generic cause of failures within the first weeks or months of a new product introduction. It is also considered by some traditional reliability engineers I have met as a “quality department” problem, not to be confused with reliability engineering.
The vast majority of human infant mortality occurs in poorer third world countries and the main cause is dehydration from diarrhea which is a preventable disease.There are many other factors which contribute to the rate of infant deaths, such as limit access to health services, education of the mother, and access to clean drinking water contribute.
Human infant mortality is defined as the number of deaths in the first year of life. The contributing causes of human infants and failure of electronics of course are completely different. Causes of human infant mortality comes from the fact that at birth a child may go through a complicated delivery and does not have a fully developed immune system, so it has less resistance to infections. The lack of health care facilities or skilled health workers is a contributing factor.
An electronic component or system is not weaker when fabricated; instead it has the highest inherent strength when turned on for the first time. Opposite of humans, electronics are “adult” when first produced and decline in strength (fatigue life) from that point on. This is why we can subject new systems to high levels of environmental stress to remove latent defects (HASS process) without taking significant life from it.
So why use the dismissive term “infant mortality” to describe latent defects in electronics as if they are expected? The time period that we would classify as “infant mortality” in electronics is arbitrary. It could be the first 30 days or the first 18 months or longer. Since the vast majority of latent (hidden) defects that are found early come from mistakes and errors either in design or manufacturing and is therefore not controlled, they can have a wide distribution of times to failure. Many times the same mechanism in which the weakest manifestations may occur within 30 to 90 days continues as declining rate through a products useable life period.
Failures of electronics systems in the first days or months after manufacture are not due to intrinsic wear out mechanisms that are known. We can only model those failure mechanisms that have an intrinsic and repeatable physics of failure.
Traditional reliability engineering has been focused on making predictions of the life entitlement of electronics systems using cookbooks of FIT rates to derive a system MTBF or MTTR. This is in spite of the fact that there is little or no evidence of empirical correlation to actual causes of most electronics failures. Traditional reliability engineering it seems has not been very focused on early discovery of the causes of early life failures during the the declining hazard rate after market release. Semantics is important and carries implications. The term “infant mortality” contributes to dismissing the significance of early life failures to the overall reliability of a system. Yet, it is where the vast majority of costs are for the customer and any electronics systems manufacturer.
Because electronics are not “infants” and not weaker when first “born” we can be aggressive in our treatment of them before they leave the “birth room”. Unlike newborns we can put new electronics through a stress test and if they fail diagnose and discover an assignable cause which then we can correct for and prevent further failures.Through HALT and HASS we can find the root causes of latent defect failures and by removing those from the production population eliminate the most costly time period of defects and failures which because of the potential wide time distributions can extended until the product is replaced due to technological obsolescence. I believe the term infant mortality when applied to electronics has the connotation that it is expected, inherent, unavoidable, and due to nature. It should be used for human life cycles, not electronics life cycles.
By Kirk Gray | March 12, 2013 at 12:17 PM EDT | No Comments
“When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails. One need only think of the weather, in which case the prediction even for a few days ahead is impossible.” ― Albert Einstein
“Prediction is very difficult, especially about the future.”– Niels Bohr
We have always had a quest to reduce future uncertainties and know what is going to happen to us, how long we will live, and what may impact our lives. Horoscopes, Tarot Cards, tea leaves, and crystal balls have been used as specialized “tools” by fortune tellers to gaze into the future. The paradox of fortune telling is that by knowing the future, we can change it. The risk side of believing we know the future is also that if we incorrectly guess (assume) the causes of a future event, our prevention action may create additional costs or higher risk of an even worse event.
This is also true when making predictions of the future life of electronics. Without clear traceability to actual physics of failure in electronics, assumptions about the causes of failures have added costs without benefits. Reliability prediction by much of the electronics industry is still being based on assumptions of a constant rate of failure only modified by the steady state temperature. This is despite the fact that there is NO identification of an intrinsic physical mechanism in active components that causes or would cause the increase in a constant rate of failure . There is no evidence or reason to believe that components fail at a constant rate.
Many in management and marketing of electronics companies want to believe and wish reliability engineers could predict the life of electronics systems. By knowing the future failure rates, we could budget warranty costs and the correct number of spare parts and replacement units before the product is launched.
In 1995 my friend Professor Michael Pecht, founder and chairman of the University of Maryland's Center for Advance Life Cycle Engineering Consortium, wrote and published the article “Why Traditional Reliability Predictions don’t work – Is there an Alternative.” In it he provides the history of one of the foundational documents of electronics reliability engineering, Military Handbook 217 (MIL HDBK 217) , and why it cannot predict electronic system failure rates. It was removed as a military reference document in 1995, largely due to the work of Prof. Pecht. It is amazing that MIL HDBK 217, removed almost 17 years is still being referenced and its progeny are still being used for reliability predictions in many electronics companies today. Needless to say electronics materials and manufacturing methods have changed tremendously in the last 17 years, but the continued belief that electronics systems reliability can be predicted has changed little in that time.
Electronics reliability cannot be predicted at a system level. The vast majority of failures of electronics hardware are due to design margin errors, component misapplication, errors in manufacturing processes, and customer misuse or abuse. It is very easy to confirm this is the case if you have access to the root causes of real field failures in real electronics products.
“All models are wrong, but some are useful” – George E. P. Box
Mathematical models to predict future events are, in many cases, valid and useful. Computer models and measurement systems that are used in meteorology to forecast weather conditions are improving, yet the ability to predict the weather more few days has been elusive. There would be huge benefits economically and in human lives if we could project longer than a few hours in advance when and where extreme weather events such as tornados or hurricanes will occur. With more inputs of contributing atmospheric conditions and computer algorithms, weather forecasting is getting better. Yet extreme weather prediction is limited to a few hours for tornados or a few days for hurricanes, before we know where they will hit.
Of course reliability prediction can be performed more accurately if we knew all of the many inherent potential failure mechanisms in an electronics system and the fatigue responses to the life cycle environmental profile (LCEP) stresses. Even if we could know all the inherent failure mechanisms in components, we would also need to include some information the time distributions of manufacturing variations and excursions that would modify the strength or rate of degradation of those mechanisms during manufacturing.
In many mechanical and electromechanical systems we do have physical wear mechanisms that can be mathematically modeled and from those models we can mathematically project the “life” of the mechanism. We know that in electric motors, wear of contact brushes, evaporation of lubricants, and wear of ball bearings eventually use up life, leading to failures due to wear out. Mechanical switches and hinges have a limited fatigue life. Through those models we can extend the life in mechanical systems by increasing the reservoir of material or reducing the driving stress conditions. In electronics there are a few devices, such as batteries, that do have short wear out modes relative to technological obsolescence and modeling life is very useful and necessary.
It is much more difficult to determine the underlying life-limiting mechanisms of solid state electronics components such as IC’s in a complex system and much less in a PWB. Not only the intrinsic physics causing component degradation and failure must be known, but also the PWB and solder fatigue mechanisms must be known for each package. BGA solder joints and PTH (Plated through Hole) vias do not fatigue at the same rate under the same stress inputs. Of course the stresses for all the mechanisms on the PWB and components can vary widely depending on the PWB locations.
LCEP for most electronics systems is a very rough guesstimate
Reliability prediction must also determine the Life Cycle Environmental Profile (LCEP) and also the LCEP distributions for the future field population. We must know to some precision the actual LCEP stress distributions along with the inherent product “strength entitlement” distributions to know where the strength distribution overlaps the stress distribution resulting in product failures. Please see my blog post “Reliability Paradigm Shift From Time to Stress Metrics” for more explanation of the Stress/Strength relationship in reliability.
So many electronics systems have a wide variety of LCEP’s with new applications of systems that result in new LCEP’s that were never considered. Take an example of VGA projectors that we see in many conference and meeting rooms. Some projectors are permanently mounted on the ceiling and many others are mobile. The ceiling mounted units fatigue stress most likely comes from thermal cycling during power cycling, and the mobile units have that stress plus the shock and vibration from transporting. The mobile units’ populations have a much wider distribution of LCEPs. I doubt the manufacturers of these products know the distribution of the LCEP for these two distinct end use environments. End users will expect the same reliability in both, regardless of the very different LCEP’s. Of course some of the mobile units will break instantaneously from an accidental drop. If and when it breaks from an accidental drop, will the user blame their own mishandling for the cause, or blame the manufacturer for making a “fragile” projector and never buying again from the same manufacturer? Certainly we do not expect our cell phones to fail after a waist high drop, but again at what height of drop would we blame the failure being caused by us?
When it comes to electronics systems reliability modeling and prediction, we really cannot know all the mechanisms or the distributions of the LCEP. Even if all the degradation models were known and all the combinations of stress distributions and effects in the assembly were known, the challenge of reliability prediction is compounded by variations over time in manufacturing.
Focus on real weakness discovery – less on guessing a very uncertain future
We have even less time to model partial or whole systems and the resulting fatigue damage and degradation as the design and manufacturing cycle times for new electronics continue to decrease. Even if we are able to model the degradation and fatigue damage of every potential failure mechanism in a PWB, the models must be based on the units from capable manufacturing, not variations, and we know there will be variations. Additionally, modeling can only establish a failure rate based on inherent wear out mechanisms known LCEPs, even though there may be new applications and different future LCEP’s that were not known when the product was designed.
“The best way to predict the future is to create it.” - Peter Drucker
Just as the prediction of our future, many would like to know what the future holds for the electronics we make and use. Yet for complex electronic systems there has been no evidence that we can model and predict the future failure rates, regardless of the fact that many still want to believe it can be done and want it to be true.
Empirical stress limit discovery is a vastly more efficient tool for building a reliable electronic system. Using stepped stress to limits methods (such as HALT) and focusing on discovery of potential weaknesses that could be a reliability risk (missing predicate). We can very quickly find the strength limits of complex electronic systems under stress conditions in order to establish a benchmark of strength based on current standard electronics technologies. By knowing empirical stress limits, we can develop safe and efficient ongoing accelerated reliability testing to precipitate and detect manufacturing errors or excursions that result in latent defects.
Unfortunately, there is still the wish that accurate prediction is possible and many are still feeding the wish that reliability of electronics hardware can be predicted based on past invalid documents. Without the ability to share real field reliability data that belief is likely to continue.
By Kirk Gray | March 12, 2013 at 12:12 PM EDT | No Comments
Posted on November 20, 2012 by Kirk Gray
HALT is a BIG change
Implementing a new reliability development paradigm in a company which is using traditional, standards-based testing can be a perilous journey.It is especially true with introducing HALT (Highly Accelerated Life Test) in which strength against stress, and not quantifying electronics lifetimes is the new metric.Because of this significant change in test orientation, a critical factor for success begins with educating the company’s top technical and financial stakeholders on why and how HALT is so effective for rapid reliability development.Without the upper levels of management understanding in parallel the big picture of the HALT paradigm shift, the work of educating each skeptical key player in a serial fashion will cost much more time and puts success of HALT at significant risk.
To illustrate, let’s imagine that you are your electronics systems company’s reliability engineer or you have been involved in reliability qualification or validation testing of its products for several years.You have experienced field failures that resulted from design margin issues that were overlooked during the development process, as well as some from mistakes in manufacturing.Reliability development in your company consists of running tests that simulate the LCEP (Life Cycle Environmental Profile) estimates (guesstimates?), or design engineers apply limited stress to their predefined “that’s good enough” level or guesses on what may be the worst case stress environmental conditions for the product.
You have just learned the generic methodology and some of the benefits of HALT (Highly Accelerated Life Test) from reading a book on the subject or attending a class or webinar.You now want to try HALT for your company’s new product, and find an outside test lab that has a HALT chamber so that you can go do a HALT.It would seem to be the most straight forward path to getting started with HALT.Or is it?
Teaching HALT in Series takes longer
Let’s consider the following scenario.
You find an outside test lab that can perform HALT a few miles away from your company.You have five samples of the new product, support equipment to operate and monitor the UUTs (Units Under Test), and possibly a technician (or if lucky you have a design engineer for the product you are going to test) to go with you.The environmental design specifications for the product are 0°C to 35°C.
The HALT lab helps you set up the first sample of the product and proceeds to find the lower temperature operational limit and the upper temperature operational limit.Since the product in this case is a digital system, there are no thermal destruct levels found as the system will not operate above or below the temperature operational limits.In the five samples used for HALT, you find upper temperature operational limits at 70, 72, 90, 117, and 110°C.The lower temperature operational limits for the five samples are found to be -55, -45, -50, -58, and -47°C.
The final stress used in HALT is vibration and two of the samples fail when the vibration level reaches the maximum vibration level of the HALT chamber.The failure mechanism on both is a broken lead of a capacitor mounted high off the PWB.You and the design engineer repair the capacitor leg and glue it down to the PWB.To verify the HALT improvement, you apply HALT vibration to the same maximum level and the glued capacitors do not fail.
Two circuit boards in a HALT chamber - A common education will speed understanding before this HALT begins
Improving margins beyond spec – the bigger challenge
After you complete HALT at the outside lab and come back to your company, you wonder why there is approximately a 40°C difference in upper temperatureoperational limits between the five samples.You realize that wide variation in limits may be an indicator of some components’ inconsistent manufacturing processes, or significant sensitivity to inherent parametric variations of a component, which if it increases its variation could significantly impact field reliability.You hope that you can have the manager of design engineering support an investigation into the cause of the wide upper thermal limit variations between the samples.When you meet with him he tells you that his department is very busy with the next design and his limited resources will not be available because:
“The product meets the design specifications, and even the worst sample has 35°C margin above design specifications.”
“The product will never see 70°C in its worst case use, therefore if it does fail it’s the customers fault.”
“We do not have time to re-design the product to meet your HALT stress requirements.”
How do you address these obstacles from the design engineering manager for resources needed to identify the weaknesses and potentially improve the product robustness and reliability?
Let’s say you spend an hour with the design manager, overcome his objections, and get help from the design engineers.With the help of a couple of design engineers you determine aten-watt FET is the most likely cause of the upper temperature operational limit.Fortunately you find a twenty-watt FET in the same size package and voltage and use it to replace the ten-watt FET.You go the HALT lab two weeks after your first HALT and find that all three new samples have an upper operational limit above 115°C and again no thermal destruct limit is found.
While you were busy in the Lab
Later in the week you find that during the time you have been doing the HALT at the local environment test lab, a skeptical design engineer (you have yet to speak with) has heard that you want to “over-design the product for stresses it will never see” and has spoken to others in the design and procurement departments.Now you start hearing that engineers (you have not spoken with) comment on your desire to add product costs to over-design a product for an irrelevant failure mode.You find that you are teaching the HALT paradigm shift to each skeptic as you find them, convincing some, but the ones you have not spoken with are also spreading the same “fear of over design.”
When you go to the purchasing department you find out the higher wattage FET will add of 50 cents additional costs to a product that retails at $700.00.Now the Vice President of Engineering hears about the additional product costs if the FET’s are changed.Since it will reduce the profit margin on an already competitively priced product, the VP then asks very similar questions to those the design engineering manager asked previously.The design manager attempts to explain the reasoning in his half-hour meeting with the VP, but doesn’t succeed.
Product Launch Time – too late, but now you may get the field failure data
Ultimately increasing the wattage of thermal operation limiting FET is not implemented and the product is released to market.In a year or two years, you may be able accumulate the warranty return data and find the FET failures are the second highest cause of warranty returns, after NFF (No Fault Found).You bring this information to the new design engineering manager who just joined the company, or had moved from another position during reorganization only five months ago.
Back to square one in teaching HALT to the new design engineering manager, and then many new design engineers, reliability engineers, and department managers since you first begin your path to introducing HALT at your company two years ago.Do you have the energy to show and explain HALT to the new key players?Are you still a reliability engineer at the same company?
A “HALT Battlefield” Experienced Consultant can accelerate understanding
It is critical to win the “hearts and minds” of the top management when introducing any new engineering concept.Introducing HALT methods, a relatively new yet still very misunderstood paradigm in electronics reliability development can be a challenging path.An experienced HALT consultant can provide a simultaneous education of the top executives and key management personnel to help their understanding the fundamentals of the HALT paradigm shift.They can provide data and examples showing how and why HALT is so effective and pitfalls to avoid.A HALT consultant can review causes of your field failures and identify those that would be likely found during HALT so that you can show the future potential ROI for different test strategies.They can also provide multiple paths to HALT adoption by demonstration tests on a known weakness, or showing how it can reduce the NFF warranty returns problem.
A HALT consultant can provide a short overview presentation to the company executives, and their staff can ask questions about HALT with all hearing the same answers and explanations.You will still have questions to answer about the new methods going forward, but with an experienced HALT consultant, you and your company will have someone who has heard the questions many times before and likely has the data to support the answers or knows where to find it.
Of course there are many other benefits of having an experienced leader in HALT guiding your way, such as planning, test monitoring, procedure writing, and most importantly the continued teaching and coaching of new engineers and skeptics while you are accumulating real field data showing the benefits of HALT in your products.But, without a common understanding at the highest levels of the company, you may “win several battles but lose the war” of introducing the most efficient approach to making a reliable and robust electronic system.
What has convinced you that HALT is worth it or that it is not worth it? What has been your most effective tool for accelerating the adoption of “stress to limits” (HALT) methods? Please leave your questions and comments or stories of success or failure with accelerated stress testing below or contact Accelerated Reliability Solutions, L.L.C. for more information.
By Kirk Gray | March 12, 2013 at 11:49 AM EDT | No Comments
It is easy to understand why the term HALT (Highly Accelerated Life Test) is so tightly couple to the equipment called “HALT chambers” systems. Many do not think they can do HALT processes without a “HALT Chamber”. Many know that Dr. Gregg Hobbs, who coined the term HALT and also HASS (Highly Accelerated Stress Screens), spent much of his life promoting the techniques and was also the founder of two “HALT/HASS” environmental chamber companies. I was very fortunate in working with and learning from Dr. Hobbs beginning 23 years ago at Storage Technology, Inc., before he founded his first chamber company Qualmark, Inc.. Later he asked me to join him at Qualmark in promoting the methods and equipment for HALT and HASS and we worked together at Qualmark for a year.
Back in the 1980’s Gregg taught that the best and basic approach for HALT was to use a single stress in a stepwise fashion to a point of empirical operational discontinuity, that is an operation limit or destruct limit. Some stresses were more useful and universal than others such as rapid thermal change and multi-axis vibration (pneumatic hammer driven) which required a combined environmental chamber and became referred to as a “HALT chamber”.
Even though Gregg made HALT chambers, still the HALT processes he taught were a fundamental paradigm shift in in that it is empirical test process over theoretical, probabilistic failure prediction methods used in traditional reliability engineering. Increasing voltage stress, clock frequency stress, software application stresses, drop height shock stress, and increasing pressure in hydraulic systems are all stresses that could be used for HALT processes and no HALT chamber is required.
To illustrate I recall almost twenty years ago at Qualmark I was in Gregg’s office and he was very excited to tell me about a call he just had with some test engineer who had a big success with a vibration HALT procedure, and the reason I remember this so well is that as I was leaving Gregg’s office he told me, “oh by the way, he used a classical [Electrodynamic] shaker!”. Many do not realize that vibration HALT could be performed on a classical shaker by taking the product to a empirically determined vibration stress limit, as Gregg knew it could. Thermal HALT methods could be done in a traditional mechanical refrigeration chamber, although maybe not as efficiently.
HALT is not a chamber, or type of stress or combinations of stress; it is a basic method of using an applied stress to for empirical design or manufacturing weakness discovery. A much better term to describe HALT would be “Highly Accelerated Limit Test”.
“HALT/HASS” chambers are typically defined by the fact that they are capable of producing multi-axis vibration from pneumatic hammers and rapid thermal stresses from liquid nitrogen supplied cooling and large banks of electric resistive coils for heating. In these systems the multi-axis vibration intensity can be controlled but the frequency spectrum is uncontrolled. It is like the real world in that vibration for most products life cycle environmental stress is uncontrolled intensity, axis, and frequency. Temperature ramp rates can force rapid product thermal changes of 60°C per minute or greater depending on the product mass and chamber configuration. Even though they are typically referred to as "HALT/HASS chambers" and are excellent for that purpose, their capabilities are really shine in HASS processes when combined environments should be used to induce the fastest stress and fatigue damage in the shortest period of time. The quicker a latent defect is detected the shorter and less expensive a HASS process is for those companies doing HASS. If HALT methods are used to make a robust design as it should be, then most of the failure mechanisms found in HASS occurs in the “infant mortality” region of the life cycle bathtub curve and vast majority of these early life failures arise from assignable errors or from wide variations in the manufacturing processes. HASS is an ongoing insurance process. It can be a very useful yet expensive ongoing test process if not coupled with RCA to eliminate the cause of manufacturing defects.
To be sure, “HALT” chambers [multi-axis pneumatic, rapid thermal] are the fastest most efficient chambers capable of applying some of the most useful universal environmental stresses to find latent design errors or manufacturing errors. Yet at the core of the new methodology and philosophy of HALT is a conceptually simple empirical process. The HALT methodology is performed by increasing stress in a controlled application to an operational and sometimes destruct limit. Many useful stresses to find operational and destruct limits do not require a chamber to find limits,or for improving operating margins and comparing limits between samples and products. That is what HALT is.
By Kirk Gray | March 12, 2013 at 11:48 AM EDT | No Comments
Many reliability engineers have discovered HALT will quickly find the weaknesses and reliability risks in electronic and electromechanical systems from the capability of thermal cycling and vibration to create rapid mechanical fatigue in electronic assemblies. Assemblies that have latent defects such as cold solder or cracked solder joints, loose connectors or mechanical fasteners, or component package defects can be brought to a detectable, or patent, condition by which we can observe and potentially improve the robustness of an electronics system. Thermal cycling creates expansion and contraction, stressing mismatched material thermal coefficients of expansion (TCE) interfaces. Applying vibration to an assembly, especially the pneumatic repetitive shock of HALT chambers, creates very rapid mechanical fatigue. When Gregg Hobbs, Ph.D., PE created HALT and HASS methods back in the 1980’s, digital systems were not as prevalent and bus speeds were much slower than today’s electronics. As the signal speeds continue to increase and circuit features get smaller in electronics HALT has a potentially significant additional benefit for signal integrity (SI) and operational reliability during new product development.
Today’s electronics are requiring bus speeds that have to have ten times better resolution than the time it takes light to bounce off your nose and hit your eye, which takes about 85 picoseconds. As data bus speeds increase affects in data transmission that were second and third order affects are now becoming dominant in SI issues. These new variables may be difficult if not impossible to model accurately. The continue decrease in metallization and higher bus frequencies will result in increased sensitivity to fabrication variations. SI issues are likely to become more dominant in reliability of hardware as a result of the continued decrease of metallization and increase in bus speeds. Yet, the effect of these developments on operational reliability may also be more difficult to find and reproduce before thousands or millions are sent to the field.
Failures in SI in many times results in marginal operational reliability or “soft failures” where a system can be reset and operate normally. Depending on the frequency of these operational failure events, the user may or may not tolerate their occurrence. When too frequent, intermittent operational reliability may result in returning the system to the manufacturer. The returned system then may then be broken down and all subassemblies subjected to failure analysis. When divided up, the subsystems tested will likely be declared “No Fault Found” (NFF) as the marginality may only come from the stack up of parametric variations, or unique environmental conditions of original system in the end-use environment. To modify an old adage “If you cannot find what broke, you cannot fix it” and the cause of the marginality and returns will continue. The result is a churn of “good” parts being returned being sent out to replace “good” parts. The returned parts may be sent to a repair depot to be used for repair or replacement. Those returned parts may or may not work with a different system depending on the systems stack up, but it is likely the manufacturer will never come to know one of the potential real contributors to the high NFF rate. Of course there are many other causes of NFF returns not necessarily related to hardware issues. If the issues come from SI and timing marginality, thermal stress to operational limits can be a very useful tool to discover these issues before mass production.
We know that in mass manufacturing of anything there will be variation in any parameter that is measured. We know that during PWB manufacture that some dimensional variations will occur during mass manufacturing, although hopefully the variations are small. Dimensional variations in PWB can affect impedance crosstalk, noise, and EMI issues in the system. Dimensional expansion and contraction of the PWBA of course is what induces the thermo-mechanical fatigue damage during thermal cycling that has been a primary focus of HALT and HASS methodology, but the dimensional variations also effects SI quality. We know from the SPC teachings of Dr. W. Ed Deming that reduction of manufacturing variation is the path to making a defect free product and “six sigma” production capability is the goal. When we design and build a complex high speed digital electronics system we cannot know necessarily how the stack up of all the real future variations in component manufacturing, circuit board fabrication, solder quality, and second sources of these possibly impact operational reliability. Yet we do know for sure that there will be parametric variations created at all the levels of assembly, and the affect operational reliability may only be discovered after a large numbers are produced and sold.
The challenge of finding marginal operation during early product development is illustrated graphically in the Figure 1. . Early samples of a new electronics product are typically expensive and scarce and all development teams want the limited samples. The graphic shown on the left side of figure 1 represents the parametric timing distribution found with a limited number of units. With a small number of units the parametric variation that could be near the upper and lower limits of would likely remained undiscovered before the product is released to from development to be manufactured in mass.
Figure 1. Thermal stress skews timings to discover marginal conditions
The graph on the right side illustrates the potential of the larger variation found during mass manufacturing and the higher probability that the stack up of parametric variations could fall near operational limits resulting in soft operational failures.
The benefits of the effect of thermal stress in inducing mechanical fatigue to expose mechanical and material weaknesses is well established, but there is another aspect of thermal stimulation that may be become more important in the future for assuring reliable operation of high speed digital systems. A little known fact to those who have not performed real thermal HALT on digital electronics is that it almost always ends with finding an operational limit only. It is very rare ever to find a thermal destruct level in digital systems such as IT Hardware. Hot and cold thermal stress causes impedance shifts and signal propagation shifts in conductors and semiconductors resulting in “skewing” of signals throughout the system. This is probably why thermal HALT on most digital systems results in finding an operational limit and not destruct limit. At the thermal operation limits the SI fails, and a lock up or shut down occurs, but it can easily be reset when the stress is removed.
The graphic in figure 2 represents how using small number of samples stressed to empirical thermal limits we can skew the systems signal propagation timings. Higher temperatures slow signals and cold increases the signal speeds. Through thermal stressing a small number of samples we can observe the thermal hot and cold operating limit and this can be repeated many times without causing a catastrophic damage. Marginal operational reliability may be realized later from worst case stack up of parametric variations in smaller percentage of products when thousands or millions are produced. As manufacturing volumes ramp up, a wider distribution of parametric variations may then extend near or over the stable operational limit as previously shown on the right graphic in figure 1. Of course the stimulation of timing variations using thermal stress on a system moves all the components parametric skew to either slower or faster. In the larger mass manufacturing population, the lot to lot and second source variation of parametrics is mixed with high and low speed distributions. Rapid thermal cycling stress found in HALT chambers helps discover more mixing of timing variations by differentially skewing timings across a PWBA. This is created by very fast air temperature transitions producing thermal gradients across the PWBA. Low mass components have higher thermal transition rates than larger mass or high wattage components resulting in a mix of temperatures across a PWBA. An even more detailed understanding of the risk of variations timing distributions could be created by individually heating and cooling of active components. Individual heating and cooling of components is a good way to isolate a limiting component found during a thermal HALT.
Figure 2. Thermal stress skews signal timings
Examples of the benefits of HALT techniques on finding software issues are have been documented by Allied Telesis. Donovan Johnson and Ken Franks of Allied Telesis wrote and published a white paper several years ago on how the use of HALT has benefited their discovery of reliability issues due to software. In the paper they give examples of significantly increasing thermal operational margins and limits from only software changes. Click on the following link to access the white paper: Software Fault Isolation Using HALT and HASS. It is an excellent documentation of another benefit from HALT in making a reliable system.
The benefits of HALT to find mechanical issues in electronics assemblies have been well established over the last several decades. As the speed and density of electronics continue to increase, operational reliability may be more sensitive to manufacturing variations that result in parametric variations, leading to marginal SI and operational reliability. Many companies have not realized thermal HALT has so much potential as an effective tool for rapid discovery of operational reliability issues, not just catastrophic hardware failures. Along with the traditional established benefits of HALT on hardware weaknesses, there is growing of evidence of the benefit of improving operational reliability by using thermal HALT for finding how parametric variations may affect operational reliability when thousands or millions are produced.
By Kirk Gray | March 12, 2013 at 11:45 AM EDT | No Comments
Traditional electronics reliability engineering began during the period of infancy in solid state electronic hardware. The first comprehensive guide to Failure Prediction Methodology (FPM) premiered in 1956 with the publication of the RCA release TR-1100: "Reliability Stress Analysis for Electronic Equipment" presented models for computing rates of component failures."RADC Reliability Notebook" emerged later in 1959, followed by the publication of a military handbook know as that addressed reliability prediction known as Military Handbookfor Reliability Prediction of Electronics Equipment (MIL HNBK 217) . All of these publications and subsequent revisions developed the FPM based on component failures in time for deriving a system MTBF as a reference metric for estimating and comparing the reliability of electronics systems designs.At the time these documents were published it was fairly evident that the reliability of an electronics system was dominated by the relatively short life entitlement of a key electronics component, vacuum tubes.
In the 21st century, active components have significant life entitlements if they had been correctly manufactured and applied in circuit. Failures of electronics in the first five or so years are almost always a result of assignable causes somewhere between the design phase and mass manufacturing process. It is easy to verify this from a review of root causes of verified failures of systems returned from the field . Almost always you will find the cause an overlooked design margin, an error in system assembly or component manufacture, or from accidental customer misuse or abuse. These causes are random in occurrence and therefore do not have a consistent failure mechanism. They are not in general capable of being modeled or predicted.
There is little or no evidence of electronics FPM correlating to actual electronics failure rates over the many decades it has been applied. Despite the lack of supporting correlating evidence, FPM and MTBF is still used and referenced for a large number of electronics systems companies. FPM has shown little benefit in producing a reliable product, since there has been no correlation to actual causes of field failure mechanisms or rates of failure. It actually may result in higher product costs as it may lead to invalid solutions based on invalid assumptions (Arrhenius anyone?) regarding the cause of electronics field failures.
It’s time for a new frame of reference, a new paradigm, for measurement for confirming and comparing the capability of electronics systems to meet their reliability requirements. The new orientation should be based on the stress-strength interference perspective, the physics of failures, and material science of electronics hardware.
The new metric and relationship to reliability is illustrated in a stress-strength graph as shown in figure 1. This graphic shows the relationship between a systems strength and the stress or load it is subjected to. As long as the load is less than the strength, no failures occur.
Figure 1. The Stress-Strength Diagram
In the stress-strength graph in figure 2, anywhere the load to a system exceeds the system’s strength is where the two curves overlap and failures occur. This relationship is true for bridges and buildings as well as electronics systemsems.
Figure 2. Stress-Strength Intersection
This relationship between stress and strength and failures correlates with our common sense understanding that the greater the inherent strength a system has relative to environmental stress, the more reliable it will be. Of course the balance is that we must consider the competitive market and build the unit at the lowest costs. What is probably not be known to most electronics companies is how strong standard electronic materials and systems can actually be in relation to thermal stress, since so few companies are actually testing to thermal empirical stress operational limits, with thermal protection defeated. Many complex electronics systems can operate from -60°C or lower to +130°C or greater using standard components. Typically it is only one or two components that keep a system from reaching stress levels that are at the fundamental limit of technology (FLT). The FLT is the point at which the design capability cannot be increased with standard materials. Sometimes designs have significant thermal operating margins without modifications, which can be used to produce shorter and more effective combined stress screens such as HASS (Highly Accelerated Stress Screens) to protect against manufacturing excursions that result in latent defects.
In most applications of electronics systems, technological obsolescence comes well before components or systems wear out. For most electronics systems we will never empirically confirm their total “life entitlement” since few systems are likely to be operational long enough to have “wear out” failures occur. Again it is important to emphasize that we are referring more to the life of solid state electronics and less to mechanical systems where fatigue and material consumption results in wear out failures.
Reliability test and field data are rarely published but there is one published study with data showing a correlation between empirical stress operational margin beyond specifications and field returns. Back in 2002, Ed Kyser, Ph.D., and Nahum Meadowsong from Cisco Systems gave a presentation titled “Economic Justification of Halt Tests:The relationship between operating margin, test costs, and the cost of field returns” at the IEEE/CPMT 2002 Workshop on Accelerated Stress Testing (now the IEEE/CPMT Workshop on Accelerated Stress Testing and Reliability, ASTR). In their presentation they showed the graph of data on differences in thermal stress operational margin versus the normalized warranty return rate on different line router circuit boards as is shown in Figure 3.
The graph shows the correlation between the thermal margin and the RMA (Return Material Authorization), i.e. the warranty return rate. A best fitting curve with this scatter diagram shows a probabilistic relationship between thermal operational margin and warranty returns. It indicates that the lower the operational margin, the higher probability of its return. Cisco also compared the relationship between the number of parts (on a larger range of product) and the return rate. The graph of that data is shown in figure 4.The relationship between thermal margins versus return rates is ten times stronger than the relationship between board parts counts versus return rates.
Figure 4. Parts count versus Warranty Return Rate
This makes sense from a stress-strength relationship. No matter how long a chain is, it is only as strong as the weakest link in that chain. No matter how many parts on the PWBA, the designs tolerance to variation in manufacturing and end-use stress is dependent on the least tolerant part.
For operational reliability or “soft failures” in digital electronics, the relationship between thermal limits and field operational reliability is less obvious since again most electronics companies do discover and therefore do not compare empirical thermal limits with rates of warranty returns. In mass production of high speed digital electronics, the variations in components and PWBA manufacturing can lead to impedance variations and signal propagation (strength) that overlap the worst case stresses in the end use (load) leading to marginal operational reliability. It is very challenging to determine a root cause for operational reliability that is marginal or intermittent, as the subsystems will likely function correctly on a test bench or in another system and considered a CND (Cannot Duplicate) return.Many times the marginal operational failures observed in the field can be reproduced when the system is cooled or heated to near the operational limit. Heating and cooling the system skews the impedance and propagation of signals, essentially simulating variations in electrical parametrics from mass manufacturing. If companies do not apply thermal stress to empirical limits, they will never discover and be able to utilize this benefit to find difficult to reproduce signal integrity issues.
Faster and lower reliability test costs are becoming more critical in today’s fast pace of electronics development.Most conventional reliability testing that is done to some pre-established stress above spec or “worst case” field stress takes many weeks if not months, and result in minimal reliability data. Finding electronics systems strength by HALT methods is relatively very quick, typically taking only a week or less to find, and with fewer samples. Even if no “weak link” is discovered during HALT evaluations, it always provides very useful variable data on empirical stress limits between samples and design predecessors. Empirically discovered stress limits in an electronics system design are very relevant to potential field reliability, and especially thermal stress an operational reliability in digital systems. Not only can stress limits data be used for making a business case for costs of increasing thermal or mechanical margins, but it can also be used for comparing the consistency of strength between samples of the same products. Large variations of strength limits between samples of a new system can be an indicator of some underlying inconsistent manufacturing processes. If the variations are large enough some percentage will fail operationally because the end-use stress conditions exceed the variation in the products strength.
As with any major paradigm shift, a move from using the dimension of time to the dimension of stress as a metric for reliability estimations, there will be many details and challenges yet to be determined on how best to apply it and use the data derived from it. Yet from a physics and engineering standpoint a new reference of stress levels as a metric has a much stronger potential for relevance and correlation to field reliability than the previous FPM with broad assumptions on the causes of field operational and hardware unreliability in current and future electronics systems. If we begin today using stress limits and combinations of stress limits as a new reference for reliability assessments we will discover new correlations and benefits in developing better test regimens and finding better reliability performance discriminators resulting in improving real field reliability at the lowest costs.
By Kirk Gray | March 12, 2013 at 11:40 AM EDT | No Comments
In all aspects of engineering we only make improvements and innovation in technology by building on previous knowledge. Yet in the field of reliability engineering (and in particular electronics assemblies and systems), sharing the knowledge about field failures of electronics hardware and the true root causes is extremely limited. Without the ability to share data and teach what we know about the real causes of “un-reliability” in the field, it is more easily understood why the belief in the ability able to model and predict the future of electronics life and MTBF continue to dominate the field of electronics reliability engineering. Please note that I refer to solid state electronics and assemblies. I make a distinction between Failure Prediction Methodologies (FPM) for electronic assemblies (PWBA and systems) that typically has more life than needed (or really known) for most applications, as opposed to mechanical systems (i.e. motors, gears, switches) that can have a more limited life assignable to friction and wear out and have some potential to be mathmatically modeled.
I have been teaching HALT and HASS methods for over 20 years and a common question I have heard many times is “If HALT is so great, why aren’t there more examples published on its benefits?” There are several reasons why details of real HALT case histories, as well as any other actual empirical electronics reliability data, are rarely published.
Three key reasons are:
Competitive advantages of not sharing the most effective reliability practices
Potential legal liability for disclosing real causes of field failures
Engineers have little time to write and publish.
When a company discovers a new product development process that leads to significantly faster times and lower costs to release a mature product to market, they are not likely to tell their competitors. Doing so would cause them to lose the competitive advantages those new processes. Does your company publish its best methods for product reliability development? So why expect it from any other company?
Legal liability is a huge risk for manufacturers. Failures of electronics systems might lead to significant loss of property or in the worst case human lives. Publishing the cause of electronics failures might provide evidence of liability leading to costly judgments for the product designer or manufacturer. For this reason, most companies will never voluntarily allow failure data to become public. Reliability engineers that may want to help the industry by publishing the real failure data typically face many challenges to have the legal departments give permission to publish. Even if they are able to publish something on actual reliability the paper has so much redacted and “sanitized” for public disclosure that the most significant data may not be published.
It takes time to write and publish any technical paper. In today’s current economic challenges many electronics companies have trimmed engineering departments down to the bone. Engineers are challenged for enough time in the day to complete projects and timelines. Few are motivated to take the extra time necessary, and face the companies’ legal obstacles of publishing the case histories of real field failures of electronics. Without engineers being willing or able to publish the real case histories, details on the root causes of the failures, and best methods to prevent them, little can be expected in the advancement of the science of electronics reliability development and testing.
As previously stated many times in my previous blogs, if the reliability engineers really look at the root causes of field failures in their own products, they would see the same confirming evidence that in general the reliability of an electronics system cannot be predicted from statistics and probabilities. In the 22 years I have spent working with companies to improve reliability of electronics hardware I have seen many root causes of field unreliability in electronics systems.I do not recall ever seeing a wear out mode of a nominal electronic component as a cause field failures. It has always been an overlooked design margin (which may appear to be an early wear out), misapplication of a component, an error in manufacturing, or a misuse or abuse by the user. If I could publish the real evidence and data and root causes of all the failures of electronics in a wide variety of equipment and companies I have worked with, or heard about through colleagues in the field, the case for using empirical stress limit methods of finding weaknesses would be much clearer.
The conditions that limit the sharing of real field reliability data are not likely to change in the foreseeable future. This is why many companies are still doing FPM fundamentally derived from the invalid MIL HNBK 217 methods and still use the meaningless term MTBF. While statistical or probabilistic methods are used for many valid engineering design and analysis applications, they have few applications to predicting the random errors and combinations of events that cause failure solid state electronics. Most electronics systems become technological obsolete before any inherent wear out modes cause failures.
We still can make progress in the field of electronics reliability, but we must validate the methodology and results from engineering basis. We must use our knowledge of physics and material science in electronics, as well as what lessons have been learned in the causes of real field failures. We must make sure when electronics reliability is developed now and in the future that there is traceability and references to the physics of the failure mechanisms. For instance, traditional electronics FPM has used the Arrhenius equation and the broad assumption of 0.7 eV for the activation energy in silicon components as a major factor in predicting the rate of failure of component. This belief continues today, even though there is little or no evidence of traceability to physical mechanisms in today’s components and over 16 years since MIL HNBK 217 was removed as a DoD reference document. .
“A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it”-Max Planck, Scientific Autobiography
When we find a weakness in an electronics system through stepped-stress methods, we should know enough about the materials to know whether the weakness is due to a fundamental limit of technology (FLT) such as the melting of plastics, solder, or limits of LCD operation at temperature, or if the weakness is due to the in-circuit application of a particular component. After uncovering the causes, we can understand what physics drove the failure and the element to change to increase the systems strength or capability. Usually it is only necessary to strengthen one or two “weak links”, to bring a products strength up to the FLT. Sometimes those weak links are software, not hardware, and changing code may be the only change necessary to add significant thermal strength capability and margins. Occasionally the system is designed and built and reaches the stress FLT with no change needed, and this becomes a benchmark for subsequent designs. Although, if you are not testing to empirical stress limits, you will never find out.
The causes for the inherent limits in sharing field and test lab reliability data are not likely to change anytime soon. Yet, we can change our orientation and approach to electronics reliability development. Realizing the random and unpredictable nature of most electronics failures in the first 5-7 years of use would result in a major shift in the activities for many companies developing electronics systems. It is a change that has been slowly being adopted by more companies, but they are not going to spread the word to competitors.
The development of reliability based on empirical stress limits still has a long way to go before it becomes the dominant electronics reliability engineering development paradigm and activity.Using the limited data available in your own products failures and physics and material science, must be the basis for validating the use of empirical stress limit methodologies. The need for faster reliability development demands it. I have seen the evidence, but cannot share most of it, that HALT discovery methods are valid regardless of the rapidly changing materials and manufacturing processes in electronics components and systems in the past and know it will be in the future.
By Kirk Gray | March 12, 2013 at 11:38 AM EDT | No Comments
When we go to an automobile race such as the Indianapolis 500, watching those cars circle the track can get fairly boring. What is secretly unspoken is that everyone observing the race is watching for a race car to find and sometimes exceed a limit, finding a discontinuity. The limit could be how fast he enters a curve before the acceleration forces exceed the tires coefficient of friction, or how close to the racetrack wall, he can be before he contacts it and spins out of control. Using the race analogy, time trials before the race are like the design phase of electronics products, where only one race car one the track, and manufacturing like the race where many cars and consistent control of each is required to have an “accident free” race.
Of course no one wants to see a driver injured or killed at these events, but watching cars circle the track without incident is fairly boring. The same is true in testing of electronics hardware and software. Highly Accelerated Life Testing (HALT) is fairly boring until an empirical limit, a discontinuity, is discovered. Fortunately engineers are not injured or killed discovering empirical stress limits in HALT evaluations of electronics systems.
Formula One Racing - Pushing the Limits
HALT methodology really is a limit discovery tool, not a pass-fail test. Near and at the empirical, not theoretical or specified operational limits, provides some of the most useful data lies. It is the fastest way of finding weaknesses and for comparisons of electronics systems designs. Observing wide differences in operational limits between samples of the same product provides evidence of some component(s) inconsistent manufacturing processes affecting the system. If the deviation is large enough, the variation probably will affect operation of a smaller percentage of units at field use conditions. Discovery of variable empirical limits of multiple samples can be a discriminator for the quality of component and assembly process consistency. Wide deviation of operational limits between identical system samples is a good indicator of uncontrolled, possibly unknown, process variation that if wide enough will lead to failures in the intended use environment. Even if the numbers of units compared is not statistically significant, wide differences in limits are good qualitative indicators for reliability risks.
Stress testing at well below the operational limits even though it may be well beyond the end use specifications provides only very limited data on the product’s strength capability. Testing to only those “margins above spec” if not close to the empirical stress limit is just like watching a car race with a 120 mph speed limit. Some probability exists that a race car in this speed limited type of race could have a failure, and some cars would have failures and “lose” the race. Still failure would likely be rare and most of the vehicles would be tie for the win and there would be little differentiating information would be available for improving handling, durability or reliability over the competing cars. As in typical reliability testing, the cars much faster (higher stress) than most cars are driven and most accelerated reliability testing of electronic is performed at higher stresses than most systems will be exposed to in their useful life, and some percentage do fail in these milder but above spec stress conditions.
So why not test to empirical operation, and sometimes destruct, limits (i.e. HALT)? It is the quickest way to get useful data on product weaknesses. Why do so many resist testing electronics systems to empirical stress limits of voltage, temperature, vibration, shock, and other stresses that provide data on what the ultimate stress capability is? Here are just some of the reasons given in the last couple of decades:
1.Product failures above specified component stress specifications are “foolish failures”
2.Products in the field will never be subjected to those stress levels
3.The product is too expensive to destroy the samples
To briefly answer those reasons
1. All components have margins above specification and functional margins are very dependent on its application in the design, not individual component specifications. Why assume any failure is foolish before finding it. Not testing to the operational strength of the actual product is leaving what could be valuable data (and ultimately money) on the table
2. The product may not see the instantaneous stress levels used in the tests, but the cumulative fatigue damage of lower field stresses have a high probability of failing the same weakness in the design that is found at the destruct limits.
3. How expensive is a product failure to company and its customers? Finding out in the weaknesses in a test lab is almost always less costs than lost sales and warranty costs when a latent defect or weakness reaches the customers. There is a risk to all testing and to find weaknesses at limits, you risk catastrophic damage. In digital systems it is very difficult to destroy systems below thermal empirical operating limits due to the parametric shifts causing failures in signal integrity. Maybe it is because there are many that believe finding empirical limits results in a pile of melted solder, components and plastics. Vibration on the other hand will eventually cause a hard failure, where the operational limit is also a destruct limit. In any case, many times the unit can be repaired and re-used for additional testing.
In the reliability development of a new product we are somewhat like a person in an unfamiliar dark room. We really don’t know how big the room is until we bump into a wall, and actually several walls, to define the available space in the room. In electronics testing, until we find the actual empirical limits of stress, we do not know what the actual “stress” space is that can be used to find marginal functional or material issues. The larger the stress space, the faster we can find the strength “entitlement” and use that strength to find the one or two weaknesses in an electronics product that puts overall reliability at risk.
Just like the title of a song by the rock group the Eagles, we should in testing “Take it to the Limit” to fully benefit from each sample of electronics systems we test. You will find it takes fewer units, less time and money to find the few elements in a design that really could impact field reliability.
By Kirk Gray | March 12, 2013 at 11:34 AM EDT | No Comments
Most reliability engineers are familiar with the life cycle bathtub curve, the shape of the hazard rate or risks of failure of a electronic product over time. A typical electronic’s life cycle bathtub curve is shown in figure 1.
Figure 1. Typical Electronics Life Bathtub Curve
The origination of the curve is not clear, but it appears that it was based on the human life cycle rates of death. In human life cycles we have a high rate of death due to the risks of birth and fragility of life during that time. As we age, the rates of death decline to a steady state level until we age and our bodies start to wear out. Just as medical science has done much to extend our lives in the last century, electronic components and assemblies have also had a significant increase in expected life since the beginning of electronics when vacuum tube technologies were used. Vacuum tubes had inherent wear out failure modes that were a significant limiting factor in the life of an electronics system.
During the days of vacuum tubes, wear out of the tubes and other components were the dominant cause of field failure. Although errors in the design or manufacturing probably contributed to field failure rates, the mechanical fragility and limited life of vacuum tubes dominated the causes of system failures in a few years of use.
Traditional electronics reliability engineering and failure prediction methodology (FPM) has in its foundation the concept of the life cycle bathtub curve. The declining hazard rate region is called the “infant mortality” region. The wear out failure modes of electronics results in the increasing hazard rate represented at the “back end” of the bathtub curve. The concept of the curve has been used as a guide for “burn-in” testing and, of course, for establishing the misleading and meaningless term, MTBF.
In order to make a life prediction for an electronics assembly some assumptions must be made regarding the quality and consistency of the manufacturing process as well as assumptions on the distribution of life cycle stresses. To create models for electronics devices we must assume that the manufacturing process is capable and parts being produced are from the center of the normal distribution. We also must make assumptions about the frequency and distribution of the life cycle stresses. It is difficult to account for variations at all the manufacturing levels in models without creating significant complexity. The same applies to accounting for variation in life cycles stresses the product population will be subjected to.
Today’s electronics components, especially semiconductors, have an inherent “life entitlement” is relatively infinite relative to the technologically useful life of the system it is in. We will not likely empirically determine the life for the vast majority of electronics components and systems because technological obsolescence occurs relatively quickly. The pace of significantly better performance and features with electronics systems is not likely to slow down, and therefore the rate of technological obsolescence will not slow down. Of course there are some exceptions in electronics, such as in energy systems. Wind and solar energy systems costs justifications are based on 25 years of use.
The Drain in the Bathtub Curve
Using the same life-cycle “bathtub” curve analogy, technological obsolescence is the “drain” that is before the back end of the life cycle (wear out) mode occurs. Figure 2 is a graphical representation of the bathtub curve with the “drain”. The drain is the point in time that electronics is replaced due to technological obsolescence. Because of this technology obsolescence drain the back end of the bathtub curve a relatively small contributor to the overall costs of failure for customers and manufacturers in a very large percentage of electronics. Obsolescence is especially rapid in consumer and IT hardware. The infant mortality at the front end is where most manufacturers and customers realize the costs of poor reliability development programs and what determines future purchases for most consumers.
The causes of failures during the early production are mostly due to poor or overlooked design margins, errors in manufacturing of the components or system, or abuse (sometimes accidental). Precise data on rates and costs of product failures is not easily found, as reliability data is very confidential, but most who deal with new product introductions realize that most costs of unreliability come from the front end of the bathtub curve and not much from the wear out end of the curve. Poor reliability may result in total loss of market share in a competitive market. The backend of the bathtub curve is for the most part irrelevant in the case of high rates of failure, as an electronics company may not be in business long enough for technological obsolescence to be a factor, and even much less “wear out” failures
The electronics industry in the last few decades has been misdirected in the belief that life in an electronics system can be calculated and predicted from components rates of failures and models of some failure mechanisms (i.e. MIL HNBK 217) although there has been no empirical evidence of any correlation of most predictions to field failure rates.
The vast majority of costs of failures for almost all electronics manufacturers come in the first few years in its life, some covered by the warranty period but also a few years past that. It is the customer’s experience of reliability that determine the quality and reliability of the manufacturer and future purchases from that manufacturer. The costs of lost future sales may be even greater than warranty costs, but since is difficult to quantify it may never be known.
Most of the causes of failures are attributable to assignable causes previously mentioned. Traditional reliability engineering is mostly based on FPM and for most electronics design and manufacture companies the majority of reliability engineering resources have been spent on creating probabilistic estimates of the life entitlement of a system and the back end of the bathtub curve. There is little evidence of reducing the costs of unreliability in most electronics products because it occurs in the first several years due to assignable random causes, and not wear out.
A much greater return on investment in developing can be realized when the industry understands that most of the reliability failures in the first few years of use are not intrinsic wear out, but instead on random errors in design and during manufacturing. Reliability engineering must reorient to spend most of the reliability development time and resources to develop better accelerated stress tests, using better reliability discriminators (prognostics) to detect errors and overlook low design margins and eliminate them before market introduction. With this new orientation, electronics companies can be the most effective and quickest at developing a reliability product at market introduction at the lowest costs.
By Kirk Gray | March 12, 2013 at 11:33 AM EDT | No Comments
Posted February 10, 2012
Historically Reliability Engineering of Electronics has been dominated by the belief that 1) The life or percentage of hardware failures that occurs over time can be estimated, predicted, or modeled and 2) Reliability can be calculated or estimated through statistical and probabilistic methods to improve hardware reliability. The amazing thing about this is that during the many decades that reliability engineers have been taught this and believe that this is true, there is little if any empirical field data from the vast majority of verified failures that shows any correlation with calculated predictions of failure rates.
The probabilistic statistical predictions based on broad assumptions of the underlying physical causes begin with the first electronics reliability prediction guide begin November 1956, with the publication of the RCA release TR-1100, "Reliability Stress Analysis for Electronic Equipment", which presented models for computing rates of component failures. This publication was followed by the "RADC Reliability Notebook" in October 1959, and the publication of a military reliability prediction handbook format known as MIL-HDBK-217.
It still continues today with various software applications which are progenies of the MIL-HDBK-217. Underlying these “reliability prediction assessment” methods and calculations is the assumption that the main driver of unreliability is due to components that have intrinsic failure rates moderated by the absolute temperature. It has been assumed that the component failure rates follow the Arrhenius equation and that component failure rates approximately doubles for every 10 degC.
MIL-HDBK-217 was removed from the military as reference document in 1996 and has not been updated since that time; it is still being reference unofficially by military contractors and still believed to have some validity even without any supporting evidence.
Electronics reliability engineering has a fundamental “knowledge distribution” problem in that real field failure data, and the root causes of those failures can never be shared with the larger reliability engineering community. Reliability data is some of the most confidential sensitive data a manufacturer has, and short of a court order will never be published. Without this information being disseminated and shared, little changes in the beliefs of the vast majority of the engineering community.
Even though the probabilistic prediction approach to reliability has been practiced and applied for decades any engineer who has seen the root causes of verified field failures will observe that most all failures that occur before the electronic system is technologically obsolete, are caused by 1) errors in manufacturing 2) overlooked design margins 3) or accidental overstress or abuse by the customer. The timing of the root causes of these failures, which many times are caused by multiple events, are random and inconsistent. Therefore there is no basis for applying statistical or probabilistic predictive methods.
It is long past time that the electronics design and manufacturing organizations to abandon these invalid and misleading approaches, acknowledge that reliability cannot be estimated from assumptions and calculations. Instead a more effective approach is to put most of your reliability engineering resources to finding 1) errors in manufacturing 2) overlooked design margins 3) or weakest link that my cause failure in the roughest environment a customer may subject through stress testing as quickly as possible.