This Thesis is about the analysis of the S.M.A.R.T. diagnostic data of hard disk drives,
and the prediction of the remaining lifetime of the drives, using this data.
In the Information age, the amount of stored information is growing rapidly and we are
storing the majority of this data on traditional hard disk drives. Because of this, the failure
of hard disk drives is a problem which is hard to eliminate for cloud service providers. The
hard disk drives are equipped with a so called S.M.A.R.T. diagnostic technology for quite
some time, which enables us to read different diagnostic attributes. If large amount of this
data is available, we can predict failures at a certain level, by analyzing this data.
In this Thesis the aim is to find a solution to this problem. The Backblaze cloud service
provider collected and published 5 years of S.M.A.R.T. data, which I analyze to build
machine learning models, which can predict the time to failure for the drives.
In the Thesis I present the technology of machine learning and the structure of the data
being used. I remove the unimformative variables/observations from the data, then I
perform statistical analysis to choose a subset of the remaining data, which can be used
for machine learning. I present how the data can be transformed to make it usable for
machine learning and I present the creation of learning-, validation- and test sets.
After preparing the data, I present the machine learning algorithms I use. I create more
models for each algorithm, which differ in the length of the time period known for the
model. I train these models and describe their results by evaluating the models with the
test set. Then I showcase a few problems and interesting facts encountered while working
on the Thesis.
I conclude the Thesis by summarising and evaluating the project results and I suggest a
few interesting opportunities for future work.