By Big Data problem, we mean that we have become unable extract information from data using traditional relational databases. This is because the data we generate can be described as a high volume, high variety dataset, that usually needs to be processed in very little time. The issue is worsened by the fact that the popular methodology of exploratory data analysis requires great computational power. Without other technologies, data processing is near-impossible and thus, we can not extract information from data, rendering it useless.
Such a necessary technology is the Hadoop framework, which was developed by Apache Software Foundation. This framework stores data in a distributed file system and one of its main design aspects is, that it should be able to run on commodity hardware. The default data processing method of Hadoop is a MapReduce algorithm, which operates decently on batches of data, but it is far too slow for near-real-time data visualization.
To solve this problem, Pivotal Software introduced a parallel SQL query engine, HAWQ. Using this database layer, it is possible to read and write data natively to and from Hadoop’s distributed file system, replacing the original engine. This technology allows us to issue queries to terra- and petabyte sized datasets, thus we can make charts and diagrams which were not possible before.
We can connect to HAWQ from the popular data visualization environment, R. If we can establish this setup, computational restraints and pixel numbers will no longer limit our capability to plot data, solving a huge issue.
The goal of this paper is to document the setup of such an architecture and demonstrate some of the basic diagrams used in exploratory data analysis, using R and SQL commands.