Facebook, WhatsApp, Google Maps and Netflix are the fancy and majestically welcomed daily products in our era. Companies create IT solutions to provide possibilities for mankind to be connected in real time and share the joyful moments online. As a benefit a bigger community is building up, what no one have ever seen, although it requires bigger infrastructure, no one have ever created. This is Big Data.
We are flooded by enormous amount of information like messages, GIFs and videos. Social media have been rising since 2010 and the infrastructure what operates behind it is Big Data. Over the last 20 years Big Data were developed by many people to make the knowledge of the world reachable for everybody.
Cloudera is one of the biggest companies delivering Big Data value to the B2B market. They offer their product, called CDH as a full service for companies. They are developing, selling and supporting it for every customer. It is a distribution of the open-source Apache Hadoop project with plenty of either mature or brand-new components.
Different customers have different demands, based on their IT needs. Some of them wish for stable operation, others wish for variety or velocity. CDH aims at providing both ways for the customers. Cloudera intends to pack the finest components together for every business. Some components are more used, the company has a great benefit from those. Others are rarely used, these are meant to retire for profit optimization purposes. To make sure about the components usage we need knowledge about their reputation.
During the cooperation with Cloudera, I have created a popularity analysis. It shows which component is trending or failing over a period of time. To accomplish this project I used Python scripts to analyze the lifelines of 5 components. I collected essential information from various data sources about the components. I broadened my knowledge about Big Data from Tom White’s book „Hadoop: The Definitive Guide”.