Category Archives: hadoop

IGFS : Ignite FileSystem

Apache Hadoop, everyone chanting Hadoop, it’s number one framework to store data in reliable manner. I agree, it’s true, fast, low cost and fault tolerance. Is there any alternative to HDFS? Is there any competitors or alternatives to HDFS to store data? Yes it’s called Ignite file system (IGFS). It’s number one storage file system to store, In this article i am explain about ignite file system overview.


What is Ignite File System (IGFS)

Apache ignite is a unified system to store and process any type of data. Apache ignite internally using ignite file system (IGFS) to store data. In another words like HDFS it will store, like alluxio it centralize data in memory, like Spark it’s process everything in memory. It’s fourth generating system.

In HDFS there are two type of storage called Memory and desk. By defaut HDFS store data in desk. Usually when you are processing at that time data move to memory, after processed, the data store in desk.

where as in IGFS, the data in two storage levels called on heap memory and off heap memory. It’s store data by default in memory. If not fit in memory remaining data store in off-heap memory. Means when you are processing time no IO hits, directly process data quickly. It’s huge plus to in memory processing systems like Spark.

On-heap Vs Off-heap memory
Simply when data processing time, temporary data store in memory to process. Let example ram size 8gb.
Now if you are processing 5gb that data fit in Memory so that data called on-heap. After process data garbage collector clean that on-heap memory.

If data more than heap memory (8gb), than remaining amount of data store in off-heap memory. Let example if you have 8gb ram, you want to process 10 gb ram than what happens 8gb store in ram remaining 2gb data store in off-heap. garbage collector unable to clean that off-heap memory.
Compare with off-heap memory on-heap memory very fast, but compare with desk off-heap memory very fast. Now ignite store like this, everything onheap and offheap

IGFS Integrate Other System

Ignite easily integrate with any distributed system like HDFS, Cloudera, Hortonworks. Unlike HDFS, IGFS does not need a name node. It automatically determines the file data locality using a hashing function.

If you use Ignite, no need Alluxio / Tachyon, both are doing same functionality. Ignite at a time it will store data, and process data. Alluxio simply Accelerator layer on top of HDFS, it’s not processing.

Please note Ignite is replacement of Alluxio, but not replacement of HDFS and spark. If you know Spark, directly run spark or execute ignite command anything ignite will support.

Additionally Ignite support OLAP and OLTP operations. All these features not available in HDFS. That’s why apache ignite creating wonders in future especially in Internet of things.

RDDs Vs DataFrame Vs DataSet

Application Program Interface (API) is a set of functions and procedures that allow the creation of applications which access the features or data of an operating system, application, or other service to process data.
APIs very very important to implement any framework. Now Spark using many APIs and importing many APIs from other Bigdata ecosystems. Latest Spark using three type of APIs. Spark always revolving around these APIs called RDD, Dataframe and DataSet APIs.

Now Which is the best API? RDD or Dataframe or DataSet? Why? these are common spark interview questions. In this post I am explaining difference between RDD, Dataframe & DataSet
Rdd dataframe dataset

What is RDD?
In simple words, Collection of Java or Scala objects that follows Immutability, distributed, fault tolerance properties.
Spark core use many functions, most of the functions copied from Scala. Based on functionality, spark separate those functions as Transformations and actions.
Means in RDD API these Scala functions (Transformations & Actions) to compute the data. Its main advantage if you know Scala functions, it’s easy to compute data.
The main dis-advantage in RDDs is, it’s using Java serialization by default. Either Java or Scala running JVM only so both using Java Serialization only.

Why Java Serialization?

let eg: if you want to store Arrays, json data, or any other data in database, it’s not supporting. So that u r serialize data into binary format, than convert that binary format data to database understandable format.
Now Java also using its own serialization concept called Java Serialization.
Java Serialization intentionally for small amount of java object not for long amount of objects. If you use java serialization, it’s drastically decrease performance.
Additionally Java serialization consume huge amount of resources to serialize data. So that as using avro serialization, it’s internally compress data so that little advantage to improve performance.
RDD using Java Serialization, so it’s decrease performance. If you process large amount of data. Kyro serialization little optimize Spark RDD jobs, but you must follow some terms and conditions.
One more dis-advantage is Jvva serialization is sending both data and it’s structure between nodes. It’s another headache, but it’s resolved in Dataframe.
Spark when it’ starting time means in spark 1.0 introduced RDDs. It’s ok, good processing fine, everything fine, but performance only main dis-advantage. If you are processing unstructured data, Rdd highly recommended.

What is DataFrame?
After couple of month, Spark introduced another API called DataFrame. It’s very powerful mainly focus on performance and to run SQL queries on top of data.
In Simple words a collection of RDDs plus Schema called DataFrame. In DataFrame, the data is organized into named columns like RDBMS. Means Structure separated, data separated. Spark understands the data schema, so no need to use Java serialization to encode the data, Serialize only data.
So Spark developer, can easily run SQL queries on top of distributed data, additionally support DSL commands, so Scala programmer also easily run Scala commands. These features not available in RDD.
If spark knows the schema, there is no need to use Java serialization to encode the data. So no need to de-serialize the data when you applied sorting or shuffling.

The power of DataFrame API is catalyst optimizer. It internally apply logical plans and physical plans, finally based on cost based model, choose the best optimized plan. So It’s internally optimize data compare with RDDs.
DataFrame also using Java serialization, so like RDDs same dis-advantages available in Data-frame also. Means main advantage optimize performance, and make user friendly, dis-advantage serialization.

What is DataSet API?
Another IOT framework called Flink, it’s internally using two powerful APIs called DataSet and DataStream APIs. DataSet used to process batch data, DataStream api used to process streaming data. Spark core by default batch process so that they copied this Flink DataSet API and placed in Spark 1.6 experimentally.
In spark 1.6, dataset api got good results, so that in spark 2.0 DataFrame merged in DataSet. In Spark 2.0 only dataset available, there is no dataframes.
The main difference between RDD, DataFrame and DataSet is Serialization and Performance. This DataSet api internally using a special serialization called encoder, it’s very powerful than java serialization.  It support Rdd transformations and dataframe DSL commands and allows SQL queries as well. Means if you know rdd and dataframes same steps you can apply in dataset as well. encoder

In another words, Unifying RDD + Dataframe using encoder serialization called DataSet. This DataSet introduced in 1.6 version, but it’s main abstraction in spark 2.0. The main advantage in DataSet is high level type-safe, but RDD low level type-safe. So programmer can easily identified syntax errors & analyze errors in compile stage only. More info about type safe

Spark also moving towards dataset so that instead of spark streaming,mllib, graphx, going towards structure streaming, ml-pipeline, graph-frames. As per my prediction, in future no RDD concepts in future.
One more disadvantage is only Java and Scala supports dataset, but not python language, because of it’s dynamic nature.


The main difference between RDD, Dataframe and Dataset is performance, to optimize performance, RDD switched to Dataframe, next switched to Dataset.

WordCount in Spark Scala

Hello, In theses videos, I am explaining about how to install eclipse, how to install scala? how to create appropriate configurations in eclipse, maven to implement spark applications, finally how to run spark wordcount program in maven build tool.

How to install Eclipse in Ubuntu:

If you want this script just mail me at i will mail

Download eclipse from eclipse
put somewhere where you want /home/hadoop/work
tar -zxvf /home/hadoop/work/eclipse-jee-mars-R-linux-gtk-x86_64.tar.gz
gksudo gedit /usr/share/applications/eclipse.desktop
#enter password
#paste it
[Desktop Entry]
Name=Eclipse 4
Comment=Integrated Development Environment

##how to install scala plugin in Eclipse#####
#First check updates and updates to prevent problems in future
go to Help>check for updates> next>next>accept conditions>finish // wait 5 min restart the eclipse
go to Help>eclipse marketplace>find-> scala> scala ide> confirm> next>next>accept>finish

after create project in maven, right click and go to configure >add scala nature

#####How to create a maven project and hellow world scala program####
for spark Streaming:
spark core:
spark sql:

#based on your spark , scala, hadoop version change it.

WordCount using Spark Scala

Hadoop Mapreduce Interview Questions

What is Hadoop MapReduce ?
MapReduce is a set of programs used to process or analyze vast of data over a Hadoop cluster. It process the vast amount of the datasets parallelly across the clusters in a fault-tolerant manner across the Hadoop framework.
Can you elaborate about MapReduce job?
Based on the configuration, the MapReduce Job first splits the input data into independent chunks called Blocks. These blocks processed by Map() and Reduce() functions. First Map function process the data, then processed by reduce function. The Framework takes care of sorts the Map outputs, scheduling the tasks.
Why compute nodes and the storage nodes are the same?
Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code execute where the data is stored, so the data node and compute node are the same.
What is the configuration object importance in MapReduce?

  • It’s used to set/get of parameter name & value pairs in XML file.
  • It’s used to initialize values, read from external file and set as a value parameter.
  • Parameter values in the program always overwrite with new values which are coming from external configure files.
  • Parameter values received from Hadoop’s default values.

Where Mapreduce not recommended?

Mapreduce is not recommended for Iterative kind of processing. It means repeat the output in a loop manner.
To process Series of Mapreduce jobs, MapReduce not suitable. each job persists data in local disk, then again load to another job. It’s costly operation and not recommended.

What is Namenode and it’s responsibilities?

Namenode is a logical daemon name for a particular node. It’s heart of the entire Hadoop system. Which store the metadata in FsImage and get all block information in the form of Heartbeat.

What is JobTracker’s responsibility?

  • Scheduling the job’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker.
  • Monitoring the tasks, if failed, re-execute the failed tasks.

What are the JobTracker & TaskTracker in MapReduce?
MapReduce Framework consists of a single JobTracker per Cluster, one TaskTracker per node. Usually A cluster has multiple nodes, so each cluster has single JobTracker and multiple TaskTrackers.
JobTracker can schedule the job and monitor the TaskTrackers. If TaskTracker failed to execute tasks, try to re-execute the failed tasks.
TaskTracker follow the JobTracker’s instructions and execute the tasks. As a slave node, it report the job status to Master JobTracker in the form of Heartbeat.
What is Job Scheduling importance in Hadoop MapReduce?
Scheduling is a systematic procedure of allocating resources in the best possible way among multiple tasks. Hadoop task tracker performing many procedures, sometime a particular procedure should finish quickly and provide more prioriety, to do it few job schedulers come into the picture. Default Schedule is FIFO.
Fair scheduling, FIFO and CapacityScheduler are most popular hadoop scheduling in hadoop.
When used reducer?
To combine multiple mapper’s output used reducer. Reducer has 3 primary phases sort, shuffle and reduce. It’s possible to process data without reducer, but used when the shuffle and sort is required.
What is Replication factor?
A chunk of data is stored in different nodes with in a cluster called replication factor. By default replication value is 3, but it’s possible to change it. Automatically each file is split into blocks and spread across the cluster.
Where the Shuffle and sort process does?
After Mapper generate the output temporary store the intermediate data on the local File System. Usually this temporary file configured at core-site.xml in the Hadoop file. Hadoop Framework aggregate and sort this intermediate data, then update into Hadoop to be processed by the Reduce function. The Framework deletes this temporary data in the local system after Hadoop completes the job.
Java is mandatory to write MapReduce Jobs?
No, By default Hadoop implemented in JavaTM, but MapReduce applications need not be written in Java. Hadoop support Python, Ruby, C++ and other Programming languages.
Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Hadoop Pipes allows programmers to implement MapReduce applications by using C++ programs.
What methods can controle the map and reduce function’s output?
setOutputKeyClass() and setOutputValueClass()
If they are different, then the map output type can be set using the methods.
setMapOutputKeyClass() and setMapOutputValueClass()
What is the main difference between Mapper and Reducer?
Map method is called separately for each key/value have been processed. It process input key/value pairs and emits intermediate key/value pairs.
Reduce method is called separately for each key/values list pair. It process intermediate key/value pairs and emits final key/value pairs.
Both are initialize and called before any other method is called. Both don’t have any parameters and no output.

Why compute nodes and the storage nodes are same?
Compute nodes are logical processing units, Storage nodes are physical storage units (Nodes). Both are running in the same node because of “data locality” issue. As a result Hadoop minimize the data network wastage and allows to process quickly.
What is difference between MapSide join and Reduce Side Join? or
When we goes to MapSide Join and Reduce Join?
Join multple tables in mapper side, called map side join. Please note mapside join should has strict format and sorted properly. If dataset is smaller tables, goes through reducer phrase. Data should partitioned properly.

Join the multiple tables in reducer side called reduce side join. If you have large amount of data tables, planning to join both tables. One table is large amount of rows and columns, another one has few number of tables only, goes through Rreduce side join. It’s the best way to join the multiple tables.
What happen if number of reducer is 0?
Number of reducer = 0 also valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output consider as output, Hadoop store this information in separate folder.
when we are goes to combiner? Why it is recommendable?
Mappers and reducers are independent they dont talk each other. When the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} we goes to combiner to optimize the mapreduce process. Many mapreduce jobs are limited by the bandwidth, so by default Hadoop framework minimizes the data bandwidth network wastage. To achieve it’s goal, Mapreduce allows user defined “Cominer function” to run on the map output. It’s an MapReduce optimization technique, but it’s optional.
What is the main difference between MapReduce Combiner and Reducer?
Both Combiner and Reducer are optional, but most frequently used in MapReduce. There are three main differences such as:
1) combiner will get only one input from one Mapper. While Reducer will get multiple mappers from different mappers.
2) If aggregation required used reducer, but if the function follows commutative (a.b=b.a) and associative a.(b.c)=(a.b).c law, use combiner.
3) Input and output keys and values types must same in combiner, but reducer can follows any type input, any output format.
What is combiner?
It’s a logical aggregation of key and value pair produced by mapper. It’s reduces a lot amount of duplicated data transfer between nodes, so eventually optimize the job performance. The framework decides whether combiner runs zero or multiple times. It’s not suitable where mean function occurs.
What is partition?
After combiner and intermediate map-output the Partitioner controls the keys after sort and shuffle. Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. It means each partition can executed by only a single reducer. If you call reducer, automatically partition called in reducer by automatically.
When we goes to partition?
By default Hive reads entire dataset even the application have a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows special option called partitions. When you are creating table, hive partitioning the table based on requirement.
What are the important steps when you are partitioning table?
Don’t over partition the data with too small partitions, it’s overhead to the namenode.
if dynamic partition, atleast one static partition should exist and set to strict mode by using given commands.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
first load data into non-partitioned table, then load such data into partitioned table. It’s not possible to load data from local to partitioned table.

insert overwrite table table_name partition(year) select * from non-partition-table;
Can you elaborate MapReduce Job architecture?
First Hadoop programmer submit Mpareduce program to JobClient.

Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s in the form of Job_HadoopStartedtime_00001. It’s unique ID.

Once JobClient receive received Job ID copy the Job resources (job.xml, job.jar) to File System (HDFS) and submit job to JobTracker. JobTracker initiate Job and schedule the job.

Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrive the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.
Why Task Tracker launch Child Jvm?
Most frequently, hadoop developer mistakenly submit wrong jobs or having bugs. If Task Tracker use existent JVM, it may interrupt the main JVM, so other tasks may influenced. Where as child JVM if it’s trying to damage existent resources, TaskTracker kill that child JVM and retry or relaunch new child JVM.
Why JobClient, Job Tracker submits job resources to File system?
Data locality. Move competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resources copy where the data available.

How many Mappers and reducers can run?

By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default values in Mapreduce.xml in conf file.
What is InputSplit?
A chunk of data processed by a single mapper called InputSplit. In another words logical chunk of data which processed by a single mapper called Input split, by default inputSplit = block Size.
How to configure the split value?
By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.

1) split size = min (max_splitsize, max (block_size, min_split_size));

2) split size = max(min_split_size, min (block_size, max_split, size));

by default split size = block size

Always No of splits = No of mappers.

Apply above formula:

1) split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb

split size = min (10gb (let assume), 64)

split size = 64MB.

2) 2) split size = max(min_split_size, min (block_size, max_split, size));

split size = max (512kb, min (64, 10GB));

split size = max (512kb, 64);

split size = 64 MB;
How much ram Required to process 64MB data?
Leg assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to process a chunk of data.

So in this way required more memory to process un-structured process.
What is difference between block and split?
Block How much chunk data to stored in the memory called block.
Split: how much data to process the data called split.
Why Hadoop framework reads a file parallel why not sequential?
Why Hadoop reads parallel why not writes parallel?
To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and write sequentially.
If i am change block size from 64 to 128, then what happen?
Even you have changed block size not effect existent data. After changed the block size, every file chunked after 128 MB of block size.

It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.
What is isSplitable()?
By default this value is true. It is used to split the data in the input format. if un-structured data, it’s not recommendable to split the data, so process entire file as a one split. to do it first change isSplitable() to false.
How much Hadoop allows maximum block size and minimum block size?
Minimum: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.

Maximum: Depends on environment. There is no upper-bound.
What are the job resource files?
job.xml and job.jar are core resources to process the Job. Job Client copy the resources to the HDFS.
What’s the MapReduce job consists?
MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker.
What is the Data locality?
This is most frequently asked Cloudera certification interview question, most important MapReduce interview question it is. Whereever the data is there process the data, computation/process the data where the data available, this process called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal follow data locality. It’s possible when the data is splittable, by default it’s true.
What is speculative execution?
It’s one of the important mapreduce interview question and cloudera certification as well. Hadoop run t.he process in commodity hardware, so it’s possible to fail the systems also has low memory. So if system failed, process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique. Computation/logic distribute to the multiple systems and execute which system execute quickly. By default this value is true. Now even the system crashed, not a problem, framework choose logic from other systems.

Eg: logic distributed on A, B, C, D systems, completed within a time.

System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process.
When we goes to reducer?
When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without reducer its possible to do this operation.
What is chain Mapper?
Chain mapper class is a special mapper class sets which run in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way n number of mapper connected in chain fashion.
How to do value level comparison?
Hadoop can process key level comparison only but not in the value level comparison.
What is setup and clean up methods?
If you don’t no what is starting and ending point/lines, it’s much difficult to solve those problems. Setup and clean up can resolve it.

N number of blocks, by default 1 mapper called to each split. each split has one start and clean up methods. N number of methods, number of lines. Setup is initialize job resources. The purpose of clean up is close the job resources. Map is process the data. once last map is completed, cleanup is initialized. It Improves the data transfer performance. All these block size comparison can do in reducer as well.

If you have any key and value, compare one key value to another key value use it. If you compare record level used these setup and cleanup. It open once and process many times and close once. So it save a lot of network wastage during process.
Why TaskTracker launch child JVM to do a task? Why not use existent JVM?
Sometime child threads currupt parent threads. It means because of programmer mistake entired MapReduce task distruped. So task tracker launch a child JVM to process individual mapper or tasker. If tasktracker use existent JVM, it might damage main JVM. If any bugs occur, tasktracker kill the child process and relaunch another child JVM to do the same task. Usually task tracker relaunch and retry the task 4 times.
How many slots allocate for each task?
By default each task has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to process the data.
What is RecordReader?
RecordReader reads <key, value> pairs from an InputSplit. After InputSplit, typically RecordReader convert the data into byte format Input and presents record oriented view for Mapper, then only Mapper can process the data.

record readerset the input format by using this command.

FileInputFormat.addInputPath() will read file from a specified directory and send those files to the mapper. All these configurations include in Mapreduce job file.
Can you explain different types of Input formats?

input format
input format is too important in mapreduce


input formates


Biginsights installation in Linux

As a part of IBM Biginsights Series, in this post I am explaining about How to install IBM biginsights in Redhat Linux (RHEL7). Everyone know how to install VMware. It’s easy after Vmware installation, please check, your system has given requirements.

install infosphere biginsights

BIOS setting:
Press F10 > Security> System Security> Virtualization technology Enable> Press F10 to save and exit.

Requirements to install IBM InfoSphere BigInsights:
Minimum: 2CPUs, 4GB RAM, 30GB DISK
Recommended: 4CPUs, 8GB RAM, 70 GB DISK.

Step 1: Register and login in IBM account. Click this link to register.

03Step 2:

Go to your email account and confirm your account. Then login here.0

Step 3: Automatically It’s redirected to this page.


Step 4: scroll down, select desired version then click download.


Step 5: Download biginsights from this link. Now a pop up box will come, It’s take a lot of time to download (depends on your net speed).

Please note If you unzip that file, it consuming more memory. It means zip file just 12 GB, but after unzip it, that file size 27GB. After unzip place vmdk file (iibi3002_QuickStart_Cluster_VMware.vmdk) in Virtualbox VMS folder where your VMware is installed.

Now get ready to install IBM Biginsights in Virtual Machine.

Step 6: Click on Open Virtual machine.


Step 7: It’s popup this link click VMX  and open. That’s it, automatically VMware configured everything.

2Step 8: It’s 8GB, 4 CPUs, but If you have only 8gb, ram, change it to 6gb and 2cpus. The main reason, your system also utilize some amount of ram. So to prevent problems, configure like that.


Step 9:  Double click on Memory or click on edit  virtual machine settings to modify the memory and processors.

4Step 10:  Change ram size  from 8192 to 6916, then click ok. Similarly Change processors from 4 to 2.

5Step 11: Finally click on Start up this Guest operating system. Automatically It’s ask many popup ok buttons, simply click OK Ok Ok.


Step 12:  Now Biginsights installing like this (showing in the percentage).

8Step 13: After installation, It’s asking, Bivm login: enter root, then enter Password: <here your desired password>(example: PAss123)).

set root name in installation

Step 14: Again It Asking BiAdmin Username: <your desired username example: biadmin>, Then asking password: <enter desired password example:BIadmin123). Finally It asking Language. Enter your language example: English US. (Please note these passwords, languages, not changeable). Click F10.


select language in installationStep 15: Now almost IBM InfoSphere BigInsights installed successfully. simply login your credentials. Username: biadmin. Password: PAss123.     login biadminStep 16:  Now automatically you are enter into InfoSphere BigInsights console. Now installation process successfully completed.

IBM infosphere biginsightsIt’s not only for Redhat linux (RHEL7), any OS, either windows, Ubuntu, any OS, you follow same steps.

Hadoop 2.x Interview questions

What is the core changes in Hadoop 2.x?

Many changes, especially single point of failure and Decentralize JobTracker power to data-nodes is the main changes. Entire job tracker architecture changed. Some of the main difference between Hadoop 1.x and 2.x given below.

  • Single point of failure – Rectified
  • Nodes limitation (4000- to unlimited) – Rectified.
  • JobTracker bottleneck  – Rectified
  • Map-reduce slots are changed static to dynamic.
  • High availability – Available
  • Support both Interactive, graph iterative algorithms (1.x not support).
  • Allows other applications also to integrate with HDFS.

What is YARN?

YARN stands for “Yet Another Resource Negotiator.” For efficient cluster utilization used YARN. It’s most powerful technology in 2.x. Unlike 1.x, JobTracker, resource manager and job scheduling/monitoring done (ApplicationMaster) in separate daemons. So ease the JobTracker problems. YARN is a layer that separate ResourceManager and NodeManager.

What is the difference between MapReduce1 and MapReduce2/YARN?

In Mapreduce 1, Hadoop centralized all tasks to the JobTracker. It allocate resources and scheduling the jobs across the cluster. In YARN, de-centralized this to ease the work pressure on the JobTracker. ResourceManager responsibility allocate resources to the particular nodes and Node manager schedule the jobs on the applicationMaster. YARN allows parallel execution and ApplicationMaster managing and execute the job. This approach can ease many JobTracker problems and improves to scale up ability and optimize the job performance. Additionally YARN can allows to create multiple applications to scal up on the distributed environment.

How Hadoop determined the distance between two nodes?

Hadoop admin write a script called Topology script to determine the rack location of nodes. It is trigger to know the distance of the nodes to replicate the data. Configure this script in core-site.xml
in the you should write script where the nodes located.

Mistakenly user deleted a file, how hadoop remote from it’s file system? Can u roll back it?

HDFS first renames its file name and place it in /trash directory for a configurable amount of time. In this senario block might freed, but not file. After this time, NameNode deletes the file from HDFS name-space and make file freed. It’s configurable as fs.trash.interval in core-site.xml. By default its value is 1, you can set to 0 to delete file without storing in trash.

What is difference between Hadoop NameNode Federation, NFS and JournalNode ?

HDFS federation can separate the namespace and storage to improves the scalability and isolation.


What is DistCP functionality in Hadoop?

This Distributed copy tool used for large to transfer the data internally and externally in the cluster.
hadoop distcp hdfs://namenode1:8020/nn hdfs://namenode2:8020/nn
It can copy multiple sources to destination cluster.Last resource is destination cluster.
hadoop distcp hdfs://namenode1:8020/dd1 hdfs://namenode2:8020/dd2 hdfs://namenode3:8020/dd3

YARN is replacement of MapReduce?

YARN is generic concept, it support mapreduce, but it’s not replacement of MapReduce. You can development many applicatins with the help of YARN. Spark, drill and many more applications work on the top of YARN.

What are the core concepts/Processes in YARN?

  1. Resource manager: As equivalent to the JobTracker
  2. Node manager: As equivalent to the Task Tracker.
  3. Application manager: As equivalent to Jobs. Everything is application in YARN. When client submit job (application),

Containers: As equivalent to slots.

Yarn child: If you submit the application, dynamically Application master launch Yarn child to do Map and Reduce tasks.

If application manager failed, not a problem, resource manager automatically start new application task.


Steps to upgrade Hadoop 1.x to Hadoop 2.x?

To upgrade 1.x to 2.x dont upgrade directly. Simple download locally then remove old files in 1.x files. Up-gradation take more time.

share folder there. its important.. share.. hadoop .. mapreduce .. lib.

stop all processes.

Delete old meta data info… from work/hadoop2data

copy and rename first 1.x data into work/hadoop2.x

Don’t format NN while upgradation.

Hadoop namenode -upgrade // It will take a lot of time.

Don’t close previous terminal open new terminal.

hadoop namenode -rollback

Cloudera Certification Questions

Cloudera Certification is a dream for many Hadoop developers and Administrators, but it’s not too easy and too hard. These Interview questions can assists to get a cloudera certification in first attempt. These interview questions just gives overview information about Hadoop core concepts and ecosystems. Please note basically cloudera certification is multiple question formats, who has depth knowledge about HDFS and Mapreduce, they will get this cloudera certification.



Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing.

Daemon is an independent logical program that run as a background process. Apache Hadoop comprised of five independent daemons. Each daemon run within their own JVM. Hadoop Daemons run on a single machine in Pseudo and Cluster applications, but not standalone application.

Two types of nodes in the Hadoop cluster such as NameNode, Secondary NameNode, Job Tracker are Master nodes. DataNode and TaskTracker are slave nodes.
NameNode: This Deamon holds the Name space  (metadata) for HDFS. It store the data on both RAM and local disk.
Secondary NameNode: This daemon stores the NameNode’s metadata copy, but not replacement/alternative to the NameNode. It store the data on the disk, periodically for every one hour, Secondary NameNode takes backup data.
JobTracker: This daemon scheduling the jobs and managing the cluster resources.

The Slave nodes depends on the master nodes. Master nodes has single point of failure.

Node/Ports Http Port No RPC ports
NameNode: 50070 8020
Secondary NameNode: 50090
Data NameNode: 50075
JobTracker: 50030 8021
TaskTracker: 50060

TaskTracker: This Daemon receive instructions from JobTracker and execute the MapReduce tasks and report the status to the Job tracker.
DataNode: After split the vast amount of data this DataNode store in the form of blocks. Send heart beat and block report to the NameNode.

Understand how Apache Hadoop exploits data locality.

Hadoop framework always try to minimize the network wastage and maximize throughput of the system. When the application is processing, moving the data over the network is costly operation, so by default hadoop framework migrate the computation where the data is located. HDFS provides interfaces for applications to move programming logic where the data is located. This phenomenon called Data Locality. Hadoop is fault Tolerance, so even though few warnings, bugs not stop the entire process. Hadoop framework don’t allows updates, once created, write and closed, it’s not possible to alter the data.

Why you set up a cluster in Hadoop?

It’s mandatory. If deployed data over the systems (Data sets) instead of cluster, every-time required authentication. Where as in cluster, admin, can get authentication privileges  to access different data sets. So It’s mandatory to form a cluster to form a cluster in Psudo or protection environment.

Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons.


Analyze the benefits and challenges of the HDFS architecture.

Analyze how HDFS implements file sizes, block sizes, and block abstraction.

Understand default replication values and storage requirements for replication.

Determine how HDFS stores, reads, and writes files.

Identify the role of Apache Hadoop Classes, Interfaces, and Methods.

Understand how Hadoop Streaming might apply to a job workflow.

What is MapReduce?

MapReduce is a linearly scalable programming model. MapReduce is a batch query processor, that ability to run user queries against datasets and get results quickly.

Why MapReduce needed, why not Rdbms to do large-scale batch analysis?

what is the difference between seek time and transfer rate?
Seeking is the process of moving data from the disk to a destination to read or write. Where as the transfer rate is a disk’s bandwidth. If seek time increase, it’s headack to read or write large datasets.

What is B-tree?

B-tree is the best suitable for traditional RDBMS to update small datasets. It’s less efficient than MapReduce, which use sort/merge operations to rebuild the database.

Difference between RDBMS and Mapreduce?

MapReduce is the best suitable to analyze the whole datasets in batch fashion. It suits where the data is written once, and read many times. Schema is optional, but it can processed both structured and unstructured data.
RDBMS is the best suitable for a small datasets for queries or updates purpose. It’s suite where datasets that are continually updated. Schema is mandatory so, it can process only structured and semi structured data.

What is structured data?

The data that is organized into entities that have defined a particular format/schema called structured data.
Semi-structured data is organized into entities that is looser and may or may not have schema, often ignored the entitie’s schema.
Unstructured data don’t have any particular structure or format. It’s don’t have any schema, but it’s interpret the data at processing time.

Why RDBMS use normalization?

RDBMS most often normalized to retain it’s integrity and redundancy.

Can you elaborate few compatibility problems in Hadoop?

Generally three type of compatibility issues such as API compatibility, data compatibility and wire compatibility.
API compatibility concerns the contract between user code and Hadoop java API. Data compatibility concerns peristent data and metadata formats.
Wire compatibility concerns interportability between clients and servers via HTTP and RPC ports.

Can you define input splits and blocks?

Hadoop framework splits the input to a mapreduce job into fixed-size logical pieces called input splits. Hadoop creating one map task for each split.
Hadoop framework splits the input data to HDFS into fixed-size physical pieces called blocks.

What is the benefit, if Hadoop process data paralelly?

Hadoop create one map for each split, if hadoop processing chunks of file, it’s overhead of managing those splits. So if we are processing those splits in parallel, it’s optimize load-balancing. Hadoop intentionally done to process vast amount of data to process parallelly.

What is data locality?

Framework take of this task. It run the map rask where the split data resides in HDFS called data locality. As a result, map task doesn’t use bandwidth. But reduce tasks don’t have advantage of data locality.

Why map task output always to the local disk, not to HDFS why?

Yes, If not stored in local disk, Hadoop replicate such data in HDFS, it’s overhead to the Hadoop. Also Hadoop mistakenly take one split’s mapper as another split’s input. So Hadoop store intermediate data in local disk, but reducer output always stored in HDFS.

What is Hadoop Streaming?

An interface between Hadoop and MapReduce program called Streaming, that can read standard input and write to standard output.

Why Hadoop pipes?

A C++ interface to Hadoop called pipe, tht can interupt the C++ code into Hadoop understandable format. Pipes doesn’t run in standalone mode.

What is distributed filesystems?

FileSystems that manage the storage across a network of machines are called distributed filesystems. Data lose is common in this distributed filesystems.

What is Streaming data?

HDFS designed for store very large files with streaming data access patren. Hadoop follow write-once, read-many-times pattern, so hadoop can process most efficiently.

Why HDFS, why not other storage?

HDFS allows streaming data access, also run in commodity hardware and process lots of small files. HDFS can allows parallel processing.
HDFS is best suit for low-latency access data and write once read many times for delivering a high throughput of data.

What is block? Difference between file block and HDFS block?

A chunk of data that stored into a physical file called block. A fixed amount of data that can store to read or write. File system’s default disk block size is 512 bytes. HDFS default block size is 64MB. If disk block filled a portion of actual block size, it occupy total memory. While, HDFS block doesn’t occupy a full block’s wroth of file.

Why HDFS block size is so large?

Disk block size just 4kb, but HDFS default block size 64MB to optimize seek time and transfer rate.

What is Namenode functions?

Namenode manages the filesystem namespace, receive block report and respond to the client.

What are Namespace image and edit logs files?

Namenode persistently store meta data on the local disk’s namespace image and the edit log. fsimage stored all block’s information, edit log records and flushed on the namespace image.

How namenode overcome single point of failure?

If namenode goes down, everything obliterated this state called single point of failure. To resolve this issue, hadoop provides 2 mechonisms. First namenode persists the data in multiple filesystems, it’s introduced in Hadoop 2.x. Second periodically merge the namespace image with editlog in secondary namenode.

What is HDFS federation?

Namenode federation scale by adding more namenodes. Each namenode manages a portion of the filesystem namespace volume. Those namenodes are independent, don’t require coordinate with other nodes. Each data node persists the data in both namenodes.
Example: namenode1 manages sales, namenode2 manages products, namenode3 manages services …

What is block pool storage?

Namespace volume should unique, but have many block poles. A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all block poles independently in a cluster.

Why use clusterID?

ClusterID added to identify all nodes in cluster. After format the namenode this clusterID helps to identify the nodes.
Namespace can generate BlockIDs to identify each block’s information.

How to increase namenode memory?

by setting HADOOP_NAMENODE_OPTS in with specified Ram size. For example
export HADOOP_NAMENODE_OPTS=”-Xmx2000m”

How much memory does a nanemode need?

Memory usage depends on number of blocks per file.
Number of nodes * Number of disk space per node /(block size * Number of replicas * 1024)

What is HDFS high-availability?

Secondary namenode can protects against data loss, but not provide High Availability of the filesystem. Namenode is the sole repository of the metadata. With the help of NFS, namenode persists the data in high availability systems to prevent data loss, but it can’t auto start standby namenode. Standby nanemode tries to become active namenode, to do it tries to kill the active namenode.

What is the difference between HTTP and WebHDFS?

The HTTP interface is read-only interface, while the new WebHDFS interface support all filesystem operations include Karberos authentication. Enable WebHDFS by setting dfs.webhdfs.enable to true.

Can you elaborate about Network Topology in Hadoop?

Communicate multiple  nodes with the help of network. Hadoop transfer vast amount of data between nodes.
Bandwidth = distance between multiple nodes.
Bandwidth is determined based on distance.
Data process on the same node=0,
Data process on the same track=2,
Data process on the different rack within same data center=4, and
Data process on the different data center =6

What is FSDataOutputStream?

The DistributedFileSystem returns an FSDataOutputStream to reads and writes data in data queue. It launches DataStreamer to write data in the pipeline.

How Hadoop writes data?

Hadoop client, send request to namenode to allowcate nodes via distributed filesystem to write data. DFSOutputStream communicate datanodes and temporary form a pipe to write data. DataStreamer transfer the data from one block to another and acknowledged to the namenode. If atleast one block is filled, namenode consider as data wrote properly. If any problem to write data, framework re-tries four times.

How Hadoop reads data from blocks?

Client first request to namenode via Distributed FileSystem by calling open() to read the data. Namenode return the addresses of datanodes, then client calls read() function. Hadoop reads the data parallelly, DFSInputStream communicates nearest blocks to the client. After read the data, client calls close() function to the FSDataInputStream.

How Hadoop replicated the data?

When the block is writing the data in one node, asynchronously replicated across the cluster until its target rep-lication factor is reached. Most often the data is replicated within the same node, another two replica stored in another rack.

What is the importance of checksum?

If you transfer any data few bytes of data loss is common. In Hadoop data corruption occurring is high. Checksum is a error-detection schema to determine data loss when data enter and leave in the network. Datanode is the responsible for verify the checksum in the pipeline.

How Hadoop get corrupted data?

LocalFileSystem uses ChecksumFileSystem to find the checksum data. Hadoop use getRawFileSystem() method to get first enterning checksum data. If found corrupted data, it call reportChecksumFailure() method. Hadoop Administrator take care of such files.

What are the benefits when Compression the file?

It reduces the space need to store files and it speeds up data transfer across the network. So compression is highly recommendable for vast amount of data. Default, gzip, snappy and LZO are common compression formats. All compression techniques suitable for Mapreduce.

Options: -1 means optimize speed, -9 means optimize speed. eg: gzip -1 filename.

CompressionCodec interface allows to compress and de-compress the file. are common compression codec formats.

Which Compression Format Should I Use?

Depends on application, that compression format should allows splitting. For large files like log files, store the files uncompressed. Store the files uncompressed, use sequence file.

Steps to compress files in MapReduce?

set mapred.output.compress=true

What is Serialization and deserialization?

Network support only byte stream objects, but not other format objects. Serialization is the process of converting structured objects into byte stream objects for transmission over network.
Deserialization is a process to convert byte stream object to Structured objects.

Where and when we use serialization process?

These serialization & deserialization process most frequently occur in distributed data processing.
RPC protocol use serialization and deserialization concept when hadoop transmit data between different nodes.

Can you tell me some RPC serialization formats?

Compact format for best network bandwidth.
Fast – for inter-process communication format. It’s highly recommendable for distributed systems to read, write TBs of data in seconds.
Extensible- protocols change over time to meet new requirements.
Interoperable – support clients that are written in different languages to the server.

Why Hadoop uses it’s own serialization format instead of RPC formats?

Writable interface is central to Hadoop to corm key and value types. Which serialize and deserialize the data. It’s compact and fast, but not easy to extend. Instead of using other serialization, hadoop use it’s own interface to serialize and de-serialize the data.

What is writable interface?

Writable interface responsible to read and write data in a serialize form for transmission. It defines 2 methods such as writing it’s DataOutput binary stream, reading it’s DataInput binary stream.

When you are use safemode?

Safemode is a temporary state of the namenode to perform only read-only operation called safemode. By Default hadoop automatically enter and leave when the cluster is started, but admin can manually enter and leave the safemode.
When you are upgrade the Hadoop version, or doing complex changes in Namenode admin manually enable safemode. To save the metadata manually to the disk and reset the edit log, the name node should safemode and save the namespace with the help of given command
hadoop dfsadmin -safemode enter
hadoop dfsadmin -saveNamespace
hadoop dfsadmin -upgrade
hadoop dfsadmin -safemode leave

Explain different type of modes in hadoop.

Local Mode:Hadoop runs on the local OS file system, but not HDFS. Everything runs on single JVM. Most often used to implement MapReduce programs in development enveronment.
Pseudo mode: Hadoop runs on the single local system, but installed Hadoop. Every Daemon runs independently and has it’s own JVM. Used HDFS to store the data. It’s best choice for developing and testing apps.
Distribution mode: Hadoop runs on the cluster (multiple systems). Everything runs on it’s own JVM, multiple threads can run. Datanode, Task tracker runs in single node, remaining all nodes runs independently.

What is Datanode and Task Tracker’s HeapSize?

Java use a temporary memory to store the data in the form of HEAPSIZE. By default datanode heapsize 128 MB, Task tracker heapsize 512 MB.

What are the ways to interact with HDFS?

Command line interface
Java api
web interface.

What is writable interface does?

It’s centre point of Hadoop to do serialization and de-serialize the data. It defines two methods called DataInput and DataOutput binary stream to read and write the data.

What is Avro?

Apache Avro is a language-neural data serialization system. Writables dont have language serialization facility. With the help of Avro, Hadoop can easily serialize, read and write C++, Python, Ruby and other programming languages. Avro has language-independent schema, code generation is optional in Avro.

What is Counters?

Counters are useful channels for gathering statistics about the job to analyze the quality of the application and diagnosis the problem. Every Bigdata analyst should aware of this counters to debug the job.

What is HFTP?

HFTP is a read-only Hadoop file-system, that lets you read data from a remote HDFS cluster. The data stored in datanode, but it’s not allows to write or modify the file-system state. If you are moving data from one hadoop version to another hadoop version, use HFTP. It’s wire-compatible between different versions of HDFS.

Eg: hadoop distcp -i htfp://sourcefile:50070/sourcepath hdfs://destinationfile:50070/destinationpath

How to tune up the job performance?

There are many ways such as Combine small files by using nfileinputformat. Slightly minimixe reducers to maximize performance. Use combiners to filter duplicate values. Compress the map output to save network bandwidth. Use custom serialization ad implement RawComparator to maximize speed. MapReduce suffle, memory management configuration can improve map performance.

Why sometime Hadoop execute “connection refused” error?

Most frequently two reasons One ssh is not installed so run this in cli mode. sudo apt-get install ssh.
Second reason hostname is mis mached. first check host-name by using this command. sudo gedit /etc/hosts.
By default    localhost
Add given code after this value.    system-hostname

What is Offline Image Viewer?

Dump the HDFS fsimage files to human readable formats to offline analyze the cluster’s namespace quickly.

Hbase Programming

What is Hbase?

Hbase is a column-oriented database management system, which runs on the top of HDFS used to update the existent vast amount of data.

When we use Hbase?

Hbase is not a solution for all solutions. It’s designed to resolve the Bigdata problems. A table which contains billions of rows and millions of columns on the top of hadoop cluster. When the RDBMS doesn’t solve the problem, NoSql is the best option to ease the problem. It ease many Hadoop and bigdata problems, especially updates the existent data.


Hbase internally using Ruby scripting. If you are well-ever in Ruby, Hbase is best for you.

To enter into Hbase to run query just enter hbase shell.

You can check http://localhost:60010 to check master health check.

localhost:60030 H region server information.

Store the all table in Hbase.

Enter ‘list’ to know tables.