Uptime Engineering has developed a comprehensive set of methods to support and establish a powerful reliability process. Applied statistics and physics of failure methods are based on a broad expertise in product development. This methodology is used to establish a uniform reliability strategy from product concept until fleet operation.
How to get information out of data
Statistics deliver powerful methods to uncover correlations and describe dependencies in large data sets. Applied statistics in reliability engineering include reliability and lifetime analysis, correlation, regression and variance analysis as well as graphical data representation techniques.
Statistical methods are used for the analysis of series failures by double checking hypotheses on potential root causes. The combination with Physics of Failure is superior to singular use of statistics for problem analysis and understanding. In fleet monitoring statistical analyses of time series data is primarily used for the detection of deviating system behaviour. System models, describing the regular behaviour, can also be purely statistical.
Applied statistics deliver a powerful set of data analysis methods to find correlations or identify deviating system behaviour. Both tasks are frequently required during product development and operation.
How to assess actual product usages under various conditions
A product development should be performed with respect to the future product usage in different applications and markets and by various customers with individual usage behaviours. The assessment of this usage variety is based on representative statistics of field data. Via principal component analysis (PCA) this set of available, aggregated information is projected into a – much smaller – set of independend variables. PCA is a well proven statistical method which allows a simpler interpretion of usage variability. Additionally it may reveal, that usage of a product is not as heterogeneous as expected.
A Usage Space Analysis delivers the assessment of various product usage conditions from a load point of view. In addition it allows to determine the quality of durability tests relative to customer usage. Further, it indicates outlier-applications, which require a large amount of dedicated testing, without synergy for other applications.
The usage space links testing to the diversity of load. It therefore is most relevant for keeping test volumina and test variety as low as possible without spoiling the quality of a test programme. It helps to assess the cost/benefit relation of outlier-applications. Usage Space Analysis can also assess additional test requirements whenever markets, load profiles, product specifications or similar target parameters are changed.
How to evaluate damage kinetics for various duty cycles
Physics of Failure (PoF) is a science-based modelling approach to quantify the damaging effects of all kind of operating conditions on materials, components or technical systems. A PoF model library covers a wide range of failure modes, as amongst others fatigue, wear, aging and corrosion. A comprehensive set of models can be found in the scientific literature (Arrhenius, Woehler, Manson-Coffin, Norton, etc.). Moreover, PoF is also a useful concept for the development of dedicated models for specific failure modes. PoF models are transparent: their background can be understood, the relation between the input and the result is evident. This is most helpful in practical application, e.g. when the applicability of a model has to be assessed.
PoF is used during the whole product life cycle supporting design, validation, manufacturing, monitoring, maintenance. During the design phase PoF is used for systematic investigation of robustness against various failure risks. Product validation is based on PoF models which deliver quantitative test assessment results. This is used for the optimisation of test procedures with respect to various types of failure risks (Damage Calculation). Manufacturing related risks are mitigated via sensitivity analyses in order to specifiy quality scattering limits. Risk focused fleet monitoring relies on PoF as indicator for damage driving conditions and events to be observed.
Standard PoF models describe the kinetics of damaging mechanisms in a generic form. Thus, a PoF model library is compact and standard models can be applied to various particular cases in multiple applications. PoF models are transparent, model assumptions and limitations are easy to assess by engineers. The application of PoF models delivers quantitative results. This is most relevant in particular in the context of decision making.
How to evaluate the effect of duty cycles
Damage calculation derives damage accumulation from load histories. This is executed via Physics of Failure models applied to time series load data. Load can be of mechanical, thermal, chemical and electrical nature. Thus, a comprehensive set of input data is necessary to cover all load aspects at critical locations within a product. As measurement at critical locations is often not possible, local load is frequently generated from global load data via transfer functions (“virtual sensor”) or substitute values are taken. Damage calculation is based on load spectra. If in addition the load carrying capacity is known, this method is also used for the evaluation of the remaining useful lifetime.
The calculation with a set of models, applied to a single measured time series, allows the assessment of a load situation, a duty cycle or a test condition with respect to various damage mechanisms. In addition to this the relative damage intensity of load histories with respect to various types of load can be assessed. This is in particular useful for test optimisation relative to reference duty cycles. It is also used as a tool for the optimisation of test procedures with respect to certain failure modes. The comparison of test conditions with target customers or reference operation conditions is another purpose of damage calculation programmes.
Damage calculation delivers a quantitative comparison of damage kinetics for various load histories. The methods are adaptable to the available input data. High accuracy can be achieved by System Response Modelling or measurement, which can be performed during Design Verification with limited extra effort.
How to develop and optimise test procedures for various purposes
Testing simulates the intended product usage in order to reveal design weaknesses. As both time to market and related costs arecritical resources tests have to be efficient, i.e. they deliver a maximum of information within a given time period. Two types of tests should be intelligently combined in order to achieve this goal:
Accelerated testing ( HALT Highly Accelerated Life Test , HASS Highly Accelerated Stress Screening ) is possible for well understood failure mechanisms. Prototype components allow for start of testing at a very early stage of product development.
Representative testing aims to cover all possible failure modes – including unknown phenomena. The complete system should be subjected to various realistic operation modes and boundary conditions.
Tests are designed based on customer usage patterns of the product as identified in the Usage Space Analysis. In general component testing is allocated to suppliers with supervision and guidance from the OEM. Test acceleration is achived via the level, the frequency or the type of . The effects of test candidates on various failure mechanisms are evaluated via Damage Calculation in order to avoid over- or under-testing.
A well balanced sequence of accelerated and representative tests provides the highest efficiency of a Validation programme. Accelerated tests demonstrate component maturity at early stages of product development. This allows for problem solving at acceptable costs. Subsequent representative tests show lower failure rates and will lead to much higher reliability demonstration in the final step of the Validation programme.
How to identify tasks to achieve top reliability
A system (or top-down) risk assessment has to identify various types of risks, amongst others technical, temporal, organisational, manufacturing and service related risks. This is useful for systematic risk mitigation. It requires comprehensive knowledge about the supplier base, usage modes and conditions, applications, lifetime expectations, markets, etc.
Complementary bottom-up methods are useful for detailed assessment of both functional and reliability risks:
A system risk assessment delivers the requirements for the overall risk management during the product development phase. Component based bottom-up methods deliver the requirements for the design verification and validation plan and the basis for their assessment.
Comprehensive risk analysis delivers the basis for a systematic product development, outlined in the Product Verification and Validation programme. Risk analysis does not only specify the requirements for risk mitigation, but also identifies target conflicts to be addressed at an early stage of product development.
How to understand failure mechanisms and damage drivers
Fault Tree Analysis (FTA) is a standard reasoning method to understand how a technical system or a component might fail. It starts from potential failure evidences to derive the corresponding root causes and failure mechanisms. FTA workshops build on the implicit knowledge of domain experts to make it explicit. Moreover, extended FTA identifies failure driving conditions, corresponding observables, failure indicators and critical properties. These extensions are necessary to make FTA useful in the context of product development and monitoring.
Risk reduction is the general purpose of FTA. This can be realised in several ways:
FTA delivers a transparent and well-structured knowledge-base. If it is used as a standard tool a corporate reference on reliability risks evolves which is improved and extended with each analysis. As FTA results may be transferred to similar components in general, this approach is quite effective using an expert’s time. The various applications of FTA results to test assessment, programme optimisation, quality specification and fleet analysis are building one cornerstone to establish a robust reliability process.
How to define useful objectives for component validation
Reliability Targeting is used to derive component targets from the specified target of a complete system. The standard method, based on the system block diagramme, delivers reasonable results for relatively simple assemblies. However, a rising number of components, considered in this approach, quickly lead to unrealistic and unnecessary high demonstration targets. Reliability Targeting takes the factors “engineering experience” and “engineering knowledge” into account by using the results of Risk Assessments to reduce targets for low-risk sub-systems. This method provides actually achievable testing volumes.
Reliability Targeting is used for test effort allocation according to the risk distribution among the different sub-systems.
Testing volume is concentrated at risk focus items. The reliability targets remain actually demonstrable and the overall efficiency of testing is optimised.
How to allocate effort and staff for highest product reliability
Validation is the demonstration of product reliability and lifetime. It is a very time consuming and costly process which requires various input. A detailed Risk Assessment generates targets. The Usage Space is needed for transparency on the heterogeneity of duty cycles. Damage Models deliver risk-related test assessments.
The validation of product families requires a method for the transfer of test results to other variants – a major advantage of parts communality.
Fig.: Selected failure modes are addressed by dedicated component tests to validate component maturity. Total validation results from further contributions on all levels of integration. A failure mode specific evaluation allows for mitigation of weak points.
Programme optimisation is based on a test hierarchy in line with system integration. Some failure modes can be addressed on component level. Early, fast and cost effective testing delivers mature components for the integration into modules, where interaction is in the focus of testing. Finally, the complete mechatronic system is tested in a representative mode to uncover unknown failure risks.
After programme planning the controlling of the actual target demonstration is used for recommendation whenever failure cases, delayed activities or any other changes put the target achievement at risk.
Transparency on validation activities is generated and test contributions with respect to relevant risks are quantified. Thus, optimisation of effort allocation is feasible. It results in a worksplit along the supply chain, to be established for validation responsibilities. Programme optimisation delivers reliability demonstration for various product variants with diverse usage conditions
Products are validated with respect to future customer operation. Heterogeneous usage modes are addressed by one common programme, which is optimised for coverage of all failure risks to the required extent. Suppliers are integrated for maturity demonstration of components. The focus of an OEM is the integration and the validation of product variants. The synergy of parts communality – common validation of product variants – becomes feasible.
How to measure progress in reliability demonstration
Testing for reliability demonstration defines time to market for many products. Thus, it should be started as soon as possible for the demonstration of component maturity. However, actual system reliability demonstration requires complementary system tests. Reliability Growth monitors the reliability performance of samples, similar to the future product, operated under representative load and boundary conditions. Continuous supervision of durability test fleets measures the slope of reliability. MTBF Mean Time Between Failures or a similar measure is used for reliability reporting.
Reliability Growth provides an early warning system for immature components or infant mortality failures. The time evolution of MTBF Mean Time Between Failures over time is quite a sensitive indicator for the quality of the validation process. It allows for a robust prediction of any reliability target achievement. Moreover, MTBF Mean Time Between Failures is used as reference value to identify the onset of series failure problems.
A simple, yet significant KPI Key Performance Indicator is derived from the large volume of distributed and heterogeneous validation activities. It perfectly monitors customer expectations and gives an objective assessment of the validation process and of the product quality. Early indication of serial problems allows fast reaction before costs and impact on customer confidence become severe.
How to predict system behaviour
Most of the actual reliability problems are related to dynamic system operation. They origin in transient local stresses evolving from gradients in inertial mass, in load or in load capacity. A most frequent example is delayed heating and cooling due to gradients in thermal mass, leading to thermal fatigue faults. Accurate measurement of local load or stress in transient operation results in high efforts for instrumentation and high volume data processing. Therefore, system response modelling is used as an efficient replacement for permanent measurement. A one-time measurement of the system response upon load cycling is used to parametrise a transfer function mapping the – generally available – global load data to local load or stress under transient conditions. This mapping is used as a “virtual sensor” that derives stresses at critical component locations from the generally available load data.
Virtual sensors are used for prediction of local load. Various duty cycles or load spectra may be compared up to a high resolution. During product development virtual sensors are used for test assessment. During fleet monitoring virtual sensors generate the expectation value for State Detection via residual analyses.
No extra instrumentation is required for testing or operation, no high frequency data sampling needs to be executed whereas load information for various locations within a product is derived from global load data.
How to find out if a system deviates from sound conditions
State detection discriminates between healthy and degraded system conditions. Significantly deviating states are reported in a warning / alarm system. System Response Modelling is used to generate references for State Detection. A comprehensive set of models is needed to cover all known deviations, indicating the onset of failure propagation. Furthermore, detection of deviation can be used as trigger for a Diagnostics process to drive problem solving. It is most relevant to reflect the details of the transient behaviour in System Response Modelling because static limits for measurement channels provide only low sensitivity and lead to high frequency of false alarms.
Highly sensitive system observation leading to state detection with low and quantified error probability via residual analysis.
Extreme high probability of detection for deviating system properties under realistic load conditions. Full usage of the available SCADA Supervisory Control and Data Acquisition data.
How to identify root causes for deviations
Diagnosis is a method to determine root cause(s) for observation(s) of deviating system behaviour, i.e. it explains State Detection as consequence(s) of an assumed mechanism. In general this is a difficult task, as there are several mechanisms, which may deliver a consistent and comprehensive explanation of an observation. These hypotheses have to be checked in a second step, the differential diagnosis. It uses additional observations (additional measurement channels or on-site inspections), system response upon critical operation modes or time evolution of the observed deviation.
A reasoning engine is used to automate the diagnostic process. The diagnostic algorithms are based on the correlation between observations and root causes – as determined in the extended Fault Tree Analysis. The output of the reasoning algorithm is the list of hypotheses supplemented by discriminating actions for the proper execution of a differential diagnosis.
Diagnostics are used to understand the background of deviating behaviour. Thus transparency on the underlying technical issue is created. Service activities may be organised by identifying the necessary qualifications of the service personnel, set of spare parts and tools in advance to any intervention. Remaining lifetime models may be selected based on results of the diagnosis.
A systematic analysis of observations for the determination of a root cause is established. Diagnostics help to gain transparency on the health state of a given system. The insights gained through diagnostics may serve as a basis for risk management, mitigation and countermeasures. The results of Diagnostics may be used for a more efficient planning of service activities and for the determination of the remaining lifetime of a given system in Prognostics.
How to predict remaining lifetime
Prognostics predict system behaviour, in particular the failure probability or the time to failure. A proper sequence analysis is starting with State Detection followed by Diagnostics. Thereafter gained insights from Dignostics are used as the basis for Prognostics.
Diagnosis of the damaging mechanism allows the proper selection of a Physics of Failure model for prognosis.
These models are not based on overall properties like lifetime or average power but rather refer to damaging load and boundary conditions as input.
Accurate model selection is most relevant since the damage kinetics may be a highly non-linear function of certain load aspects.
Progonstics help evaluating the remaining lifetime of a system and / or the failure probability over time. The estimated remaining lifetime and / or the failure probability in turn provide indicators for the allocation of attention to particular parts of a system. A time based clustering of parts of the system at risk may be derived.
Prognostics are a complementary input for decision making in the context of condition and / or risk based maintenance and help to improve the efficiency of service activities leading to cost reductions. At the management level Prognostics may provide a solid basis for risk management, O&M budgeting and valuation considerations.