The complexity of IT systems is increasing, they consist of several complex components. Ensuring continuous, error-free operation and the debugging of such systems are becoming more difficult. On the bright side, these tasks can be effectively supported by suitable tools.
In a complex system, which consist of many components, determining the cause of an error can be a very difficult task. However, if we know the exact malfunction and the error propagation behaviour of the components, the possible cause of the malfunction can be found by EPA (Error Propagation Analysis). If we know the exact symptoms, and the fault propagation behaviour of the components, the analysis can determine the internal error causing the erroneous output.
Performing error propagation analysis itself can be a difficult task. In my thesis, I present how the error propagation problem can be mapped to a general mathematical problem, the constraint solving problem. General constraint solver algorithms already exist. Solving the error propagation problem, we can acquire all possible running paths of the system and the possible errors for each faults. In these results we can find the cause of every possible error mode.
Using error propagation analysis in design time, we can obtain more useful information of the analysed system, the single points of failures and the critical points can be found. These points should be monitored with a runtime monitoring system, in order to determine the components that can easily cause system-wide breakdowns. Monitoring rules for these critical components can be prepared automatically based on the results.
Analysing historical data can support the construction of the error propagation model and the diagnostic rules. In this thesis I describe how the exploratory data analysis can help the analysis.
In the thesis I present examples on the usage of this method based on various research projects.