Data analysis is an important activity which is performed by a growing number of nontechnical users as well.
Data cleaning is necessary in most data analysis processes. A data analysis process can be very sensitive to faults in the input data and often needs human check by experts. The structure and the sources of the data may also change over time, which makes data cleaning more difficult. In data analysis projects, data processing, data cleaning and data mining areis often implemented by workflows. Graphical data analysis workflow tools (eg. RapidMiner, Knime) support these activities for non-technical users as well who may not have deep experience in data cleaning and analysis.
If the data errors were are not eliminated completely during data processing / data cleaning, these errors can seriously distort the output result of the actual data analysis steps (eg. interactive visual analysis, statistical methods). However, these errors can be “caught” during the process avoided by inserting additional data cleaning and consistency checks. This way the further propagation of data errors can be stopped.
In the paper I described an ontology-based metamodel to represent general data analysis processes. I created data cleaning and analysis processes in a popular graphical tool (RapidMiner) and implemented the automatic generation of (ontology) instance models. I created a data fault-error-failure taxonomy based on literature research. For the various types of data cleaning, analysis and processing operations, I defined error propagation rules and analysed the data quality sensitivity and robustness of the steps. I showed how system models can help the analysis of data consistency and completeness.
I examined how methods of dependable computing can be used within this field. I used a tool for error propagation analysis of the above mentioned processes which are transformed to a component based system model. This tool for the error propagation analysis is based on constraint programming. Potential faults and failures of data management workflows are traced back to the originall model so this system can reveal weak points in the process where data cleaning steps should be inserted to prevent failures caused by poor data quality. In this paper I also introduced the practical applicability of this method in the case study of performance and dependability analysis of metrics measured in a complex cloud based application.
My results can help to perform data analysis projects more effectively by giving recommendations on how to eliminate input data errors and the effects of wrong calculations, also considering the model of the system under analysis. These recommendations can save a lot of expert effort and help data analysts to concentrate on the investigation of the business problem. The methodology is independent from the tools being used for the analysis.