Tag Archives: Big Data Hadoop

6 Questions Organizations Should Ask About Big Data Architecture

Big data come with big promises, but businesses often face tough challenges to determine how to take big advantage of big data and deploy the effective architecture seamlessly into their system.


6 Questions Organizations Should Ask About Big Data Architecture


From descriptive statistics to AI to SAS predictive analytics – every single thing is spurred by big data innovation. At the 2017 Dell EMC World conference, which took place on Monday, the chief systems engineer for data analytics at Dell EMC, Cory Minton – gave a presentation simplifying the biggest decisions an organisation need to make when employing big data.


Are You a Student of Statistics? – You must know these 3 things

We a premiere statistical and data analysis training institute offering courses on SAS, Big Data Hadoop, Business intelligence and Ai. We asked our faculty to tell us the three most important things that every student of elementary statistics should know.


Are you a student of statistics?


Infographic: How Big Data Analytics Can Help To Boost Company Sales?

Following a massive explosion in the world of data has made the slow paced statisticians into the most in-demand people in the job market right now. But why are all companies whether big or small out for data analysts and scientists?

Companies are collecting data from all possible sources, through PCs, smart phones, RFID sensors,
gaming devices and even automotive sensors. However, just the volume of data is not the main factor that needs to be tackled efficiently, because that is not the only factor that is changing the business environment, but there is the velocity as well as variety of data as well which is increasing at light speed and must be managed with efficacy.

Why data is the new frontier to boost your sales figures?

Earlier the sales personnel were the only people from whom the customers gathered data about the products but today there are various sources from where customers can gather data so people are no longer that heavily reliant on the availability of data.


Things To Be Aware Of Regarding Hadoop Clusters

Hadoop is being increasingly used by companies of diverse scope and size and they are realizing that running Hadoop optimally is a tough call. As a matter of fact it is not humanly possible to respond to the changing conditions in real time as these may take place across several nodes in order to fix dips in performance or those that are causing bottlenecks. This performance degradation is exactly what needsto be critically remedied in cases where Hadoop is deployed on large scales where Hadoop is expected to deliver results critical to your business in the proper time. The following three signs signal the health of your Hadoop cluster.

hadoop clusters

  • The Out of Capacity Problem

The true test of your Hadoop infrastructure comes to fore when you are able to efficiently run all of your jobs and complete them within adequate time. In this it is not rare to come across instances where you have seemingly run out of capacity as you are unable to run additional application. However monitoring tools indicate that are not making full use of processing capability or other resources. The primary challenge that now lies before you is to sort out the root cause of the problem you have. Most often you will find them to be related to the YARN architecture that is used by Hadoop.YARN is static in nature and after the scheduling of jobs the process of adjusting system and network resources. The solution lies in configuring YARN to deal with worst case scenarios.

  • Jobs with High Priority Fail to Finish on Time

All jobs running on clusters are not equally important and there may be present jobs with critical importance that must be completed within a given time frame. And one might find himself in a situation where jobs of high priority are not finishing within the stipulated deadlines.Troubleshooting such problems may be begun by checking parameters or configuration setting that have been modified in the recent past. You may also ask other users of the same cluster whether they have tweaked with settings or applications. This approach is time consuming and not all users will necessary provide all of the information. Up- front planning holds the key to resolve such sorts of resource contention.

  • You Cluster Halts Occasionally

In order to solve problems of this type node monitoring tools often fail to make the grade as their visibility cannot be broken down to the level of users, tasks or jobs. An alternative approach to resolve the problem remains that tools like iostat which monitor all of the processes that use disks significantly. Still you need to anticipate spikes in the usage of disks through such methods and it may not be completed by relying solely on human interaction and technology must be used. It is advisable that you invest in tools that automatically correct any contention problem even while jobs are in progress. Hadoop’s value may be maximized through anticipation of, reacting swiftly to and making real tiomedecisions.


Will Spark Replace Hadoop?

I hope this post will help you to answer some questions related to Apache spark that might be coming into your mind these days related to Spark in Big Data Analytics.

Apache Spark

It is a framework for performing analytics on a distributed cluster, It uses in memory computation over map reduce for better performance and speed. It runs on the top of Hadoop cluster and access Hadoop file system. It can process structured data stored in hive and streaming data from flume.

will spark replace hadoop

Will Spark replace Hadoop?

– Hadoop is a distributed, parallel processing framework that has been used for Map Reduce jobs. These jobs take minutes to hours for completion. Spark has come up as an alternative approach to traditional Map reduce model that can be used for real time data processing and fast interactive queries that complete quite fast. Thus, Hadoop supports both Map Reduce and Apache Spark.

Spark uses in memory storage whereas Hadoop cluster stores data on disk. Hadoop uses replication policy for fault tolerance mechanism whereas Spark uses Resilient Distributed Datasets for fault tolerant mechanism.

Spark features:

1.) Speed – It completes job running in memory on Hadoop Clusters 100 times faster, on disk 10 times faster. It Stores intermediate data in memory using concept of Resilient Distributed Dataset It removes unnecessary read and write on disk for intermediate data.

2.) Easy to use – It allows you to develop your code in JAVA, Scala and Python.

3.) SQL, Complex Analytics and Streaming –Spark supports SQL like features, Complex Analytics likemachine learning.

4.) Runs Everywhere – Spark runs on Hadoop, Mesos, standalone, or in cloud. It can access data in diverse data sources like HDFS, HBase, Cassandra and S3.

Spark Use Cases –

Insurance – optimize claim process by using Spark’s machine learning capabilities to process and analyze all claims being filed.

Retail – Use spark to analyze point of sale transaction data and coupon usage.Used for Interactive Data Processing and Data Mining.


The Role of Big Data in the Largest Database of Biometric Information

Aadhaar project from our very own India happens to on the most ambitious projects relying on Big Data ever to be undertaken. The goal is for the collection, storage and utilization of the biometric details of a population that has crossed the billion mark years ago. It is needless to say that a project of such epic proportions presents tremendous challenges but also gives rise to an incredible opportunity according to MapR, the company that is serving the technology behind the execution of this project.


Aadhaar is in its essence a 12 digit number assigned to a person / an individual by the UIDA , the abbreviated form of “Unique Identification Authority of India” The project was born in 2009 and had former Infosys CEO and co-founder Nandan Nilekani as its first chairman and the architect of this grand project which needed much input in terms of the tech involved.

The intention is to make it an unique identifier for all Indian citizens and prevent the use of false identities and fraudulent activities. MapR which is head-quartered in California is the distributor and developer of “Apache APA +0.00% Hadoop” has been putting into use its extensive experience in integrating web-scale enterprise storageand real-time database tech, for the purposes of this project.

According to John Schroeder who is the CEO and co-founder of MapR, the project presents multiple challenges including analytics, storage and making sure that the data involved remains accurate and secure amidst authentications that amount to several millions over the course of each passing day.Individual persons are provided with their number and a iris-scan or fingerprint is taken so that their identity might be proved and queried to and matched from the database backbone to a headshot photo of the person. Each day witnesses over a hundred million verifications of identity and all this needs to be done in real-time in about 200 milliseconds.

India has a percentage of rural population many of which are yet to be connected to the digital grid and as Schroeder continues the solution had to be economical and be reliable even under low bandwidth situations and technology behind it needed to be resilient which would work even with areas with low levels of connectivity.

Source: Forbes


How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Hadoop is java based open source framework by Apache Software Foundation, It works on the principle of distributed storage and parallel computing for large datasets on commodity hardware.

Let’s take few core concepts of Hadoop in detail :

Distributed Storage – Here in Hadoop we deal with files of size TB or may be PB. We divide each file into parts and store them on multiple machines. It replicates each file by default 3 times (you can change replication factor as per your requirement) , 3 copies of each file minimizes the risk of data loss in Hadoop Eco system. In real life as you store a copy of car key at home to avoid problem in case your keys are lost

How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Parallel Processing – We have progressed a lot in terms of storage space, processing power of processers but seek time of hard disk has not improved significantly to overcome this issue in Hadoop to read a file of 1 TB would take a long time by storing this file on 10 machines on a cluster, we can reduce seek time by upto 10 times.
HDFS has a minimum block size of 64MB to store large files in an optimized manner.

Let me explain you with some calculations:
Traditional System Hadoop System (HDFS)
File Size – 1TB (1000000000 KB) 1TB (1000000000 KB)
Windows Block Size – 8KB 64MB
No. of Blocks = 125000000 (1000000000 /8) 15625 (1000000000 /64000)
Assuming avg seek time = 4ms 4ms
Total Seek Time =125000000* 4 15625 * 4
= 500000000ms =62500ms
As you can see due to HDFS Block size of 64MB we could save 499937500ms (i.e. 99.98%of seek time) while reading 1TB of file in comparison to windows system.

We could further reduce seek time by dividing file into n parts and saving them on n no. of machines then seek time for 1TB file would be 62500/n ms.

Here you can see one use of parallel processing i.e. parallel reading of a file in cluster on multiple machines.
Parallel processing is a concept on which Map Reduce paradigm work in Hadoop, it distributes a job into multiple tasks for processing as a Map Reduce job more details in coming up blog for Map Reduce.

Commodity Hardware – It is the usual hardware that you use as your laptops / desktops in place of High Availability reliable IBM Machines. The use of commodity hardware has helped business hubs to save a lot of infrastructure cost. Commodity hardware is approx. 60% cheaper than High Availability reliable machine.