Tag Archives: Big Data Hadoop

6 Questions Organizations Should Ask About Big Data Architecture

Big data come with big promises, but businesses often face tough challenges to determine how to take big advantage of big data and deploy the effective architecture seamlessly into their system.


6 Questions Organizations Should Ask About Big Data Architecture


From descriptive statistics to AI to SAS predictive analytics – every single thing is spurred by big data innovation. At the 2017 Dell EMC World conference, which took place on Monday, the chief systems engineer for data analytics at Dell EMC, Cory Minton – gave a presentation simplifying the biggest decisions an organisation need to make when employing big data.


Are You a Student of Statistics? – You must know these 3 things

We a premiere statistical and data analysis training institute offering courses on SAS, Big Data Hadoop, Business intelligence and Ai. We asked our faculty to tell us the three most important things that every student of elementary statistics should know.


Are you a student of statistics?


Infographic: How Big Data Analytics Can Help To Boost Company Sales?

Following a massive explosion in the world of data has made the slow paced statisticians into the most in-demand people in the job market right now. But why are all companies whether big or small out for data analysts and scientists?


Infographic: How Big Data Analytics Can Help To Boost Company Sales?


Companies are collecting data from all possible sources, through PCs, smart phones, RFID sensors, gaming devices and even automotive sensors. However, just the volume of data is not the main factor that needs to be tackled efficiently, because that is not the only factor that is changing the business environment, but there is the velocity as well as variety of data as well which is increasing at light speed and must be managed with efficacy.


Things To Be Aware Of Regarding Hadoop Clusters

Hadoop is being increasingly used by companies of diverse scope and size and they are realizing that running Hadoop optimally is a tough call. As a matter of fact it is not humanly possible to respond to the changing conditions in real time as these may take place across several nodes in order to fix dips in performance or those that are causing bottlenecks. This performance degradation is exactly what needsto be critically remedied in cases where Hadoop is deployed on large scales where Hadoop is expected to deliver results critical to your business in the proper time. The following three signs signal the health of your Hadoop cluster.


hadoop clusters


  • The Out of Capacity Problem

The true test of your Hadoop infrastructure comes to fore when you are able to efficiently run all of your jobs and complete them within adequate time. In this it is not rare to come across instances where you have seemingly run out of capacity as you are unable to run additional application. However monitoring tools indicate that are not making full use of processing capability or other resources. The primary challenge that now lies before you is to sort out the root cause of the problem you have. Most often you will find them to be related to the YARN architecture that is used by Hadoop.YARN is static in nature and after the scheduling of jobs the process of adjusting system and network resources. The solution lies in configuring YARN to deal with worst case scenarios.


Will Spark Replace Hadoop?

Top 2016 Trends Expected to Turn Fruitful in 2017

I hope this post will help you to answer some questions related to Apache spark that might be coming into your mind these days related to Spark in Big Data Analytics.


The Role of Big Data in the Largest Database of Biometric Information

Aadhaar project from our very own India happens to on the most ambitious projects relying on Big Data ever to be undertaken. The goal is for the collection, storage and utilization of the biometric details of a population that has crossed the billion mark years ago. It is needless to say that a project of such epic proportions presents tremendous challenges but also gives rise to an incredible opportunity according to MapR, the company that is serving the technology behind the execution of this project.


Aadhaar is in its essence a 12 digit number assigned to a person / an individual by the UIDA , the abbreviated form of “Unique Identification Authority of India” The project was born in 2009 and had former Infosys CEO and co-founder Nandan Nilekani as its first chairman and the architect of this grand project which needed much input in terms of the tech involved.

The intention is to make it an unique identifier for all Indian citizens and prevent the use of false identities and fraudulent activities. MapR which is head-quartered in California is the distributor and developer of “Apache APA +0.00% Hadoop” has been putting into use its extensive experience in integrating web-scale enterprise storageand real-time database tech, for the purposes of this project.

According to John Schroeder who is the CEO and co-founder of MapR, the project presents multiple challenges including analytics, storage and making sure that the data involved remains accurate and secure amidst authentications that amount to several millions over the course of each passing day.Individual persons are provided with their number and a iris-scan or fingerprint is taken so that their identity might be proved and queried to and matched from the database backbone to a headshot photo of the person. Each day witnesses over a hundred million verifications of identity and all this needs to be done in real-time in about 200 milliseconds.

India has a percentage of rural population many of which are yet to be connected to the digital grid and as Schroeder continues the solution had to be economical and be reliable even under low bandwidth situations and technology behind it needed to be resilient which would work even with areas with low levels of connectivity.

Source: Forbes


How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Hadoop is java based open source framework by Apache Software Foundation, It works on the principle of distributed storage and parallel computing for large datasets on commodity hardware.

Let’s take few core concepts of Hadoop in detail :

Distributed Storage – Here in Hadoop we deal with files of size TB or may be PB. We divide each file into parts and store them on multiple machines. It replicates each file by default 3 times (you can change replication factor as per your requirement) , 3 copies of each file minimizes the risk of data loss in Hadoop Eco system. In real life as you store a copy of car key at home to avoid problem in case your keys are lost

How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Parallel Processing – We have progressed a lot in terms of storage space, processing power of processers but seek time of hard disk has not improved significantly to overcome this issue in Hadoop to read a file of 1 TB would take a long time by storing this file on 10 machines on a cluster, we can reduce seek time by upto 10 times.
HDFS has a minimum block size of 64MB to store large files in an optimized manner.

Let me explain you with some calculations:
Traditional System Hadoop System (HDFS)
File Size – 1TB (1000000000 KB) 1TB (1000000000 KB)
Windows Block Size – 8KB 64MB
No. of Blocks = 125000000 (1000000000 /8) 15625 (1000000000 /64000)
Assuming avg seek time = 4ms 4ms
Total Seek Time =125000000* 4 15625 * 4
= 500000000ms =62500ms
As you can see due to HDFS Block size of 64MB we could save 499937500ms (i.e. 99.98%of seek time) while reading 1TB of file in comparison to windows system.

We could further reduce seek time by dividing file into n parts and saving them on n no. of machines then seek time for 1TB file would be 62500/n ms.

Here you can see one use of parallel processing i.e. parallel reading of a file in cluster on multiple machines.
Parallel processing is a concept on which Map Reduce paradigm work in Hadoop, it distributes a job into multiple tasks for processing as a Map Reduce job more details in coming up blog for Map Reduce.

Commodity Hardware – It is the usual hardware that you use as your laptops / desktops in place of High Availability reliable IBM Machines. The use of commodity hardware has helped business hubs to save a lot of infrastructure cost. Commodity hardware is approx. 60% cheaper than High Availability reliable machine.