Flexible and efficient storage solutions in Hadoop

OData support
Dr. Holczer Tamás
Department of Networked Systems and Services

Nowadays we live in an age where nearly everybody has more than one mobile phone or notebook, moreover the smart household equipment's market is growing. These devices are produces million bytes of data every second around the word. This information is relevant for the manufacturers, because they can improve their products analyzing this huge amount of data. The traditional ways to process this data is slow and ineffective that is why the big data solutions were born. The biggest open source big data project is the Hadoop. Hadoop uses HDFS to store data. It provides many ways to store data but mostly it is designed to handle large files. To handle small files, the HDFS provides SequenceFile and MapFile. These techniques aggregates small files into a bigger one. Unfortunately, the already implemented SequenceFile/MapFile Writer is unable to append a file which was closed. I implemented a solution, which is capable to solve this issue. My solution has to be transparent, it has to work with the Hadoop ecosystem. This ecosystem provides redundancy and many failover methods. I drew ideas from Hadoop to use failover and redundancy in my code too. I intercepted another problem. If the uploaded file is bigger than 85MB, then the Hadoop ecosystem throws a Java exception (Exception java.lang.OutOfMemoryError). The 85MB is not so big but the built in writers reads the whole file into the memory. It was implemented that way, because some stock compression algorithms can only work if the whole file is present in the memory. Moreover, the ecosystem uses the apache.commons hash functions, which also requires the whole file in the memory. My solution also manage to solve this problem. I implemented two different approaches, the first one streams the files into the HDFS just like Hadoops HDFS put method, the other one divides big files into smaller ones, so the writer can handle the request. Finally, I compared my solutions, with each other and with Hadoops HDFS put method.


Please sign in to download the files of this thesis.