Nowadays, dealing with ‘big data’ effectively is the key for the success of many companies in different industries. For example, Netflix uses sophisticated recommendation systems to satisfy the demands of its customers. Amazon just opened a grocery store without a checkout line, where machine learning and artificial intelligence are used for tracking items that customers pick up. The Chan Zuckerberg Biohub aims to prevent or treat all diseases in our children’s lifetime. To achieve this, several machine learning algorithms are used.
The most widely used system for handling very large data sets efficiently is the Hadoop framework. One of Hadoop’s main component is its file system, called Hadoop Distributed File System (HDFS). In distributed systems, avoiding data loss is very important. Traditionally HDFS replicates the data to tolerate hardware faults in a Hadoop cluster, but in the newest Hadoop version a new technique, erasure coding is also implemented as a framework. Erasure coding is able to achieve the same level of data protection with significantly less raw storage space than replication. Erasure coding performs better than replication using the Intel Storage Acceleration Library (also known as ISA-L) according to benchmarks and it is recently recommended to be used.
This thesis was prepared in cooperation with a Danish startup called Chocolate Cloud. The main goal was to enable erasure coding in a Hadoop-compatible file system, using Chocolate Cloud’s technology. Measurements showed that their solution can compete with the de-facto fastest EC library on the market, Intel’s Intelligent Storage Acceleration Library.