How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing


Hadoop is java based open source framework by Apache Software Foundation, It works on the principle of distributed storage and parallel computing for large datasets on commodity hardware.

Let’s take few core concepts of Hadoop in detail :

Distributed Storage – Here in Hadoop we deal with files of size TB or may be PB. We divide each file into parts and store them on multiple machines. It replicates each file by default 3 times (you can change replication factor as per your requirement) , 3 copies of each file minimizes the risk of data loss in Hadoop Eco system. In real life as you store a copy of car key at home to avoid problem in case your keys are lost

How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Parallel Processing – We have progressed a lot in terms of storage space, processing power of processers but seek time of hard disk has not improved significantly to overcome this issue in Hadoop to read a file of 1 TB would take a long time by storing this file on 10 machines on a cluster, we can reduce seek time by upto 10 times.
HDFS has a minimum block size of 64MB to store large files in an optimized manner.

Let me explain you with some calculations:
Traditional System Hadoop System (HDFS)
File Size – 1TB (1000000000 KB) 1TB (1000000000 KB)
Windows Block Size – 8KB 64MB
No. of Blocks = 125000000 (1000000000 /8) 15625 (1000000000 /64000)
Assuming avg seek time = 4ms 4ms
Total Seek Time =125000000* 4 15625 * 4
= 500000000ms =62500ms
As you can see due to HDFS Block size of 64MB we could save 499937500ms (i.e. 99.98%of seek time) while reading 1TB of file in comparison to windows system.

We could further reduce seek time by dividing file into n parts and saving them on n no. of machines then seek time for 1TB file would be 62500/n ms.

Here you can see one use of parallel processing i.e. parallel reading of a file in cluster on multiple machines.
Parallel processing is a concept on which Map Reduce paradigm work in Hadoop, it distributes a job into multiple tasks for processing as a Map Reduce job more details in coming up blog for Map Reduce.

Commodity Hardware – It is the usual hardware that you use as your laptops / desktops in place of High Availability reliable IBM Machines. The use of commodity hardware has helped business hubs to save a lot of infrastructure cost. Commodity hardware is approx. 60% cheaper than High Availability reliable machine.