One fine day people realized that it is raining gems and diamonds from the sky and they start looking for a huge container to collect and store it all, but even the biggest physical container is not enough since it is raining everywhere and every time, no one can have all of it alone, so they decide to just collect it in their regular containers and then share and use it.
Since the last few years, and more with the introduction of hand-held devices, valuable data is being generated all around us. Right from health care companies, weather information of the entire world, data from GPS, telecommunication, stock exchange, financial data, data from the satellites, aircrafts to the social networking sites which are a rage these days we are almost generating 1.35 million GB of data every minute. This huge amount of valuable, variety data being generated at a very high speed is termed as “Big Data”.
This data is of interest to many companies, as it provides statistical advantage in predicting the sales, health epidemic predictions, climatic changes, economic forecasts etc. With the help of Big Data, the health care providers, are able to detect an outbreak of flu, just by number of people in the geography writing on the social media sites “not feeling well.. down with cold !”.
Big data was used to locate the missing Malaysian flight “MH370″. It was Big Data that helped analyze the million responses and the impact of the very famous TV show “Satyamev Jayate”. Big data techniques are being used in neonatal units, to analyze and record the breathing pattern and heartbeats of babies to predict infections even before the symptoms appear.
As they say, when you have a really big hammer, everything becomes a nail. There is not a single field where big data does not give you the edge, however processing of this massive amount of data is a challenge and hence the need of a framework that could store and process data in a distributed manner (the shared regular containers).
Apache Hadoop is an open source framework, developed by Doug Cutting and Mike Cafarella in 2005, written in java for distributed processing and storage of very large data sets on clusters of normal commodity hardware.
It uses data replication for reliability, high speed indexing for faster retrieval of data and is centrally managed by a search server for locating data. Hadoop has HDFS (Hadoop Distributed File System) for the storage of data and MapReduce for parallel processing of this distributed data. To top it all, it is cost effective since it uses commodity hardware only, and is scalable to the extent you require. Hadoop framework is in huge demand by all big companies. It is the handle for the Big hammer!!