Category Archives: Interview

Hadoop Mapreduce Interview Questions

What is Hadoop MapReduce ?
MapReduce is a set of programs used to process or analyze vast of data over a Hadoop cluster. It process the vast amount of the datasets parallelly across the clusters in a fault-tolerant manner across the Hadoop framework.
Can you elaborate about MapReduce job?
Based on the configuration, the MapReduce Job first splits the input data into independent chunks called Blocks. These blocks processed by Map() and Reduce() functions. First Map function process the data, then processed by reduce function. The Framework takes care of sorts the Map outputs, scheduling the tasks.
Why compute nodes and the storage nodes are the same?
Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code execute where the data is stored, so the data node and compute node are the same.
What is the configuration object importance in MapReduce?

  • It’s used to set/get of parameter name & value pairs in XML file.
  • It’s used to initialize values, read from external file and set as a value parameter.
  • Parameter values in the program always overwrite with new values which are coming from external configure files.
  • Parameter values received from Hadoop’s default values.

Where Mapreduce not recommended?

Mapreduce is not recommended for Iterative kind of processing. It means repeat the output in a loop manner.
To process Series of Mapreduce jobs, MapReduce not suitable. each job persists data in local disk, then again load to another job. It’s costly operation and not recommended.

What is Namenode and it’s responsibilities?

Namenode is a logical daemon name for a particular node. It’s heart of the entire Hadoop system. Which store the metadata in FsImage and get all block information in the form of Heartbeat.

What is JobTracker’s responsibility?

  • Scheduling the job’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker.
  • Monitoring the tasks, if failed, re-execute the failed tasks.

What are the JobTracker & TaskTracker in MapReduce?
MapReduce Framework consists of a single JobTracker per Cluster, one TaskTracker per node. Usually A cluster has multiple nodes, so each cluster has single JobTracker and multiple TaskTrackers.
JobTracker can schedule the job and monitor the TaskTrackers. If TaskTracker failed to execute tasks, try to re-execute the failed tasks.
TaskTracker follow the JobTracker’s instructions and execute the tasks. As a slave node, it report the job status to Master JobTracker in the form of Heartbeat.
What is Job Scheduling importance in Hadoop MapReduce?
Scheduling is a systematic procedure of allocating resources in the best possible way among multiple tasks. Hadoop task tracker performing many procedures, sometime a particular procedure should finish quickly and provide more prioriety, to do it few job schedulers come into the picture. Default Schedule is FIFO.
Fair scheduling, FIFO and CapacityScheduler are most popular hadoop scheduling in hadoop.
When used reducer?
To combine multiple mapper’s output used reducer. Reducer has 3 primary phases sort, shuffle and reduce. It’s possible to process data without reducer, but used when the shuffle and sort is required.
What is Replication factor?
A chunk of data is stored in different nodes with in a cluster called replication factor. By default replication value is 3, but it’s possible to change it. Automatically each file is split into blocks and spread across the cluster.
Where the Shuffle and sort process does?
After Mapper generate the output temporary store the intermediate data on the local File System. Usually this temporary file configured at core-site.xml in the Hadoop file. Hadoop Framework aggregate and sort this intermediate data, then update into Hadoop to be processed by the Reduce function. The Framework deletes this temporary data in the local system after Hadoop completes the job.
Java is mandatory to write MapReduce Jobs?
No, By default Hadoop implemented in JavaTM, but MapReduce applications need not be written in Java. Hadoop support Python, Ruby, C++ and other Programming languages.
Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Hadoop Pipes allows programmers to implement MapReduce applications by using C++ programs.
What methods can controle the map and reduce function’s output?
setOutputKeyClass() and setOutputValueClass()
If they are different, then the map output type can be set using the methods.
setMapOutputKeyClass() and setMapOutputValueClass()
What is the main difference between Mapper and Reducer?
Map method is called separately for each key/value have been processed. It process input key/value pairs and emits intermediate key/value pairs.
Reduce method is called separately for each key/values list pair. It process intermediate key/value pairs and emits final key/value pairs.
Both are initialize and called before any other method is called. Both don’t have any parameters and no output.

Why compute nodes and the storage nodes are same?
Compute nodes are logical processing units, Storage nodes are physical storage units (Nodes). Both are running in the same node because of “data locality” issue. As a result Hadoop minimize the data network wastage and allows to process quickly.
What is difference between MapSide join and Reduce Side Join? or
When we goes to MapSide Join and Reduce Join?
Join multple tables in mapper side, called map side join. Please note mapside join should has strict format and sorted properly. If dataset is smaller tables, goes through reducer phrase. Data should partitioned properly.

Join the multiple tables in reducer side called reduce side join. If you have large amount of data tables, planning to join both tables. One table is large amount of rows and columns, another one has few number of tables only, goes through Rreduce side join. It’s the best way to join the multiple tables.
What happen if number of reducer is 0?
Number of reducer = 0 also valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output consider as output, Hadoop store this information in separate folder.
when we are goes to combiner? Why it is recommendable?
Mappers and reducers are independent they dont talk each other. When the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} we goes to combiner to optimize the mapreduce process. Many mapreduce jobs are limited by the bandwidth, so by default Hadoop framework minimizes the data bandwidth network wastage. To achieve it’s goal, Mapreduce allows user defined “Cominer function” to run on the map output. It’s an MapReduce optimization technique, but it’s optional.
What is the main difference between MapReduce Combiner and Reducer?
Both Combiner and Reducer are optional, but most frequently used in MapReduce. There are three main differences such as:
1) combiner will get only one input from one Mapper. While Reducer will get multiple mappers from different mappers.
2) If aggregation required used reducer, but if the function follows commutative (a.b=b.a) and associative a.(b.c)=(a.b).c law, use combiner.
3) Input and output keys and values types must same in combiner, but reducer can follows any type input, any output format.
What is combiner?
It’s a logical aggregation of key and value pair produced by mapper. It’s reduces a lot amount of duplicated data transfer between nodes, so eventually optimize the job performance. The framework decides whether combiner runs zero or multiple times. It’s not suitable where mean function occurs.
What is partition?
After combiner and intermediate map-output the Partitioner controls the keys after sort and shuffle. Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. It means each partition can executed by only a single reducer. If you call reducer, automatically partition called in reducer by automatically.
When we goes to partition?
By default Hive reads entire dataset even the application have a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows special option called partitions. When you are creating table, hive partitioning the table based on requirement.
What are the important steps when you are partitioning table?
Don’t over partition the data with too small partitions, it’s overhead to the namenode.
if dynamic partition, atleast one static partition should exist and set to strict mode by using given commands.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
first load data into non-partitioned table, then load such data into partitioned table. It’s not possible to load data from local to partitioned table.

insert overwrite table table_name partition(year) select * from non-partition-table;
Can you elaborate MapReduce Job architecture?
First Hadoop programmer submit Mpareduce program to JobClient.

Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s in the form of Job_HadoopStartedtime_00001. It’s unique ID.

Once JobClient receive received Job ID copy the Job resources (job.xml, job.jar) to File System (HDFS) and submit job to JobTracker. JobTracker initiate Job and schedule the job.

Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrive the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.
Why Task Tracker launch Child Jvm?
Most frequently, hadoop developer mistakenly submit wrong jobs or having bugs. If Task Tracker use existent JVM, it may interrupt the main JVM, so other tasks may influenced. Where as child JVM if it’s trying to damage existent resources, TaskTracker kill that child JVM and retry or relaunch new child JVM.
Why JobClient, Job Tracker submits job resources to File system?
Data locality. Move competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resources copy where the data available.

How many Mappers and reducers can run?

By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default values in Mapreduce.xml in conf file.
What is InputSplit?
A chunk of data processed by a single mapper called InputSplit. In another words logical chunk of data which processed by a single mapper called Input split, by default inputSplit = block Size.
How to configure the split value?
By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.

1) split size = min (max_splitsize, max (block_size, min_split_size));

2) split size = max(min_split_size, min (block_size, max_split, size));

by default split size = block size

Always No of splits = No of mappers.

Apply above formula:

1) split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb

split size = min (10gb (let assume), 64)

split size = 64MB.

2) 2) split size = max(min_split_size, min (block_size, max_split, size));

split size = max (512kb, min (64, 10GB));

split size = max (512kb, 64);

split size = 64 MB;
How much ram Required to process 64MB data?
Leg assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to process a chunk of data.

So in this way required more memory to process un-structured process.
What is difference between block and split?
Block How much chunk data to stored in the memory called block.
Split: how much data to process the data called split.
Why Hadoop framework reads a file parallel why not sequential?
Why Hadoop reads parallel why not writes parallel?
To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and write sequentially.
If i am change block size from 64 to 128, then what happen?
Even you have changed block size not effect existent data. After changed the block size, every file chunked after 128 MB of block size.

It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.
What is isSplitable()?
By default this value is true. It is used to split the data in the input format. if un-structured data, it’s not recommendable to split the data, so process entire file as a one split. to do it first change isSplitable() to false.
How much Hadoop allows maximum block size and minimum block size?
Minimum: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.

Maximum: Depends on environment. There is no upper-bound.
What are the job resource files?
job.xml and job.jar are core resources to process the Job. Job Client copy the resources to the HDFS.
What’s the MapReduce job consists?
MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker.
What is the Data locality?
This is most frequently asked Cloudera certification interview question, most important MapReduce interview question it is. Whereever the data is there process the data, computation/process the data where the data available, this process called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal follow data locality. It’s possible when the data is splittable, by default it’s true.
What is speculative execution?
It’s one of the important mapreduce interview question and cloudera certification as well. Hadoop run t.he process in commodity hardware, so it’s possible to fail the systems also has low memory. So if system failed, process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique. Computation/logic distribute to the multiple systems and execute which system execute quickly. By default this value is true. Now even the system crashed, not a problem, framework choose logic from other systems.

Eg: logic distributed on A, B, C, D systems, completed within a time.

System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process.
When we goes to reducer?
When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without reducer its possible to do this operation.
What is chain Mapper?
Chain mapper class is a special mapper class sets which run in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way n number of mapper connected in chain fashion.
How to do value level comparison?
Hadoop can process key level comparison only but not in the value level comparison.
What is setup and clean up methods?
If you don’t no what is starting and ending point/lines, it’s much difficult to solve those problems. Setup and clean up can resolve it.

N number of blocks, by default 1 mapper called to each split. each split has one start and clean up methods. N number of methods, number of lines. Setup is initialize job resources. The purpose of clean up is close the job resources. Map is process the data. once last map is completed, cleanup is initialized. It Improves the data transfer performance. All these block size comparison can do in reducer as well.

If you have any key and value, compare one key value to another key value use it. If you compare record level used these setup and cleanup. It open once and process many times and close once. So it save a lot of network wastage during process.
Why TaskTracker launch child JVM to do a task? Why not use existent JVM?
Sometime child threads currupt parent threads. It means because of programmer mistake entired MapReduce task distruped. So task tracker launch a child JVM to process individual mapper or tasker. If tasktracker use existent JVM, it might damage main JVM. If any bugs occur, tasktracker kill the child process and relaunch another child JVM to do the same task. Usually task tracker relaunch and retry the task 4 times.
How many slots allocate for each task?
By default each task has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to process the data.
What is RecordReader?
RecordReader reads <key, value> pairs from an InputSplit. After InputSplit, typically RecordReader convert the data into byte format Input and presents record oriented view for Mapper, then only Mapper can process the data.

record readerset the input format by using this command.

FileInputFormat.addInputPath() will read file from a specified directory and send those files to the mapper. All these configurations include in Mapreduce job file.
Can you explain different types of Input formats?

input format
input format is too important in mapreduce


input formates


Hadoop 2.x Interview questions

What is the core changes in Hadoop 2.x?

Many changes, especially single point of failure and Decentralize JobTracker power to data-nodes is the main changes. Entire job tracker architecture changed. Some of the main difference between Hadoop 1.x and 2.x given below.

  • Single point of failure – Rectified
  • Nodes limitation (4000- to unlimited) – Rectified.
  • JobTracker bottleneck  – Rectified
  • Map-reduce slots are changed static to dynamic.
  • High availability – Available
  • Support both Interactive, graph iterative algorithms (1.x not support).
  • Allows other applications also to integrate with HDFS.

What is YARN?

YARN stands for “Yet Another Resource Negotiator.” For efficient cluster utilization used YARN. It’s most powerful technology in 2.x. Unlike 1.x, JobTracker, resource manager and job scheduling/monitoring done (ApplicationMaster) in separate daemons. So ease the JobTracker problems. YARN is a layer that separate ResourceManager and NodeManager.

What is the difference between MapReduce1 and MapReduce2/YARN?

In Mapreduce 1, Hadoop centralized all tasks to the JobTracker. It allocate resources and scheduling the jobs across the cluster. In YARN, de-centralized this to ease the work pressure on the JobTracker. ResourceManager responsibility allocate resources to the particular nodes and Node manager schedule the jobs on the applicationMaster. YARN allows parallel execution and ApplicationMaster managing and execute the job. This approach can ease many JobTracker problems and improves to scale up ability and optimize the job performance. Additionally YARN can allows to create multiple applications to scal up on the distributed environment.

How Hadoop determined the distance between two nodes?

Hadoop admin write a script called Topology script to determine the rack location of nodes. It is trigger to know the distance of the nodes to replicate the data. Configure this script in core-site.xml
in the you should write script where the nodes located.

Mistakenly user deleted a file, how hadoop remote from it’s file system? Can u roll back it?

HDFS first renames its file name and place it in /trash directory for a configurable amount of time. In this senario block might freed, but not file. After this time, NameNode deletes the file from HDFS name-space and make file freed. It’s configurable as fs.trash.interval in core-site.xml. By default its value is 1, you can set to 0 to delete file without storing in trash.

What is difference between Hadoop NameNode Federation, NFS and JournalNode ?

HDFS federation can separate the namespace and storage to improves the scalability and isolation.


What is DistCP functionality in Hadoop?

This Distributed copy tool used for large to transfer the data internally and externally in the cluster.
hadoop distcp hdfs://namenode1:8020/nn hdfs://namenode2:8020/nn
It can copy multiple sources to destination cluster.Last resource is destination cluster.
hadoop distcp hdfs://namenode1:8020/dd1 hdfs://namenode2:8020/dd2 hdfs://namenode3:8020/dd3

YARN is replacement of MapReduce?

YARN is generic concept, it support mapreduce, but it’s not replacement of MapReduce. You can development many applicatins with the help of YARN. Spark, drill and many more applications work on the top of YARN.

What are the core concepts/Processes in YARN?

  1. Resource manager: As equivalent to the JobTracker
  2. Node manager: As equivalent to the Task Tracker.
  3. Application manager: As equivalent to Jobs. Everything is application in YARN. When client submit job (application),

Containers: As equivalent to slots.

Yarn child: If you submit the application, dynamically Application master launch Yarn child to do Map and Reduce tasks.

If application manager failed, not a problem, resource manager automatically start new application task.


Steps to upgrade Hadoop 1.x to Hadoop 2.x?

To upgrade 1.x to 2.x dont upgrade directly. Simple download locally then remove old files in 1.x files. Up-gradation take more time.

share folder there. its important.. share.. hadoop .. mapreduce .. lib.

stop all processes.

Delete old meta data info… from work/hadoop2data

copy and rename first 1.x data into work/hadoop2.x

Don’t format NN while upgradation.

Hadoop namenode -upgrade // It will take a lot of time.

Don’t close previous terminal open new terminal.

hadoop namenode -rollback

Hadoop Interview Questions

Why use Hadoop?
Hadoop can handels any type of data, in any quantity and leverages on commodity hardware to mitigate costs.
Structured, unstructured, Schema, unschema, high volume, low quantity of data, Whatever it may be any data, you can store reliability.
What is Big Data?
Traditional databases much difficult to process different types of data and vast amount of data. Big data is a strategy to process large and complex data sets, which is not processed by traditional databases.

Today every organization generating massive volume of both Structured and unstructured data. It’s difficult to storage & process computationally. Big data can resolve this problem by using 4 v’s formula called

  • Volume – Size of the data
  • velocity – Speed of the data (Ram)
  • Verity – Structured & Unstructured data
  • Veracity – Uncertain, imprecise data.

What is Hadoop?
Hadoop is a open source project from Apache foundation, that enable the distributed storage & processing the large data sets across clusters of commodity hardware.
What is File System?
A file system is a set of structured data files that used by O.S to keep and organize the data on disk. Every file system permit users & groups to read, write, execute and delete privileges.

What is FUSE filesystem?
HDFS is user space FileSystem, but not POSIX file-System. It means Hadoop not satisfied POSIX rules and regulations.

What is DFS?

Distributed File System is a client or server based application (Systematic method) that store data in different servers/systems paralytically based on the server architecture .

What is No SQL?

NoSQL is acronym of Not Only SQL. It can ease many RDBMS problems. It store & access data across multiple servers. It’s highly recommendable for standalone projects and huge unstructured datasets.

What is different between real-time and batch processing?

Batch process:

It execute a series of programs (jobs) on a computer without any manual interaction. Hadoop by default use batch process.

Real-time Process:

Series of jobs continuously execute continuously and process as early as possible called real time process. Most of the Hadoop ecosystem allows real time processing.

What is meta-Data?

Data about data called meta data. Name Node store the meta data information, but not index the data. It means Name node can understand the data information only, but not inner content information details.

What is NFS?

Network File System is a client/Server application that allows to share resources between different servers on computer network. Hadoop 2.x allows NFS to store Name-node meta-data information in another system. It’s developed by Sun Microsystems.

What is Hadoop Ecosystem?

It’s a community of different tools/application that connection with a Hadoop. Pig, Hive, Hbase, Sqoop, Ooziee and Flume are common Hadoop ecosystem applications.

What is Raid Disks?

Redundant Array of Inexpensive/Independent Disks (RAID) can store the same data in different places. It’s highly recommendable for Name Node to store meta data.

What is Replication?

By default Hadoop automatically store actual data in different system, most often in another rock and other data center. This replicated backup process is called replication. It’s possible to change the default value 3. Depends on the requirements the data node replicas vary between 1 and 512.

Why Hadoop doesn’t support Updates and append?

By default Hadoop meant for write once and read many time functionality. Hadoop 2.x support append operation, but Hadoop 1.x doesn’t support.

What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

Elaborate Hadoop Process?

NameNode: The NameNode is the arbitrator and repository for all HDFS metadata.
Secondary NameNode: Backup of the metadata for every one hour.
DataNode: Store actual data in the form of blocks.
Job Tracker: Data process & schedule map-reduce tasks to specific nodes in the cluster.
Task Tracker: Follow Job tracker instructions and do Mapreduce & shuffle operations.

Elaborate important RPC & HTTP codes.

RPC Port:
  • 8020- NameNode
  • 8021: Job Tracker
Http Port:
  • 50070 – name Node
  • 50075 – DataNode
  • 50090 – Secondary namenode
  • 50030 – JobTracker
  • 50060 – TaskTracker

What is RPC Protocol?

Remote Procedure Cell (RPC) protocols supporting client server communications. Client most often interact to name node and job tracker. So that only 8020 & 8021 ports only available.

What is HTTP Protocols?

Http Protocol is transferring files on the web server. Over world wide web. This protocol communicates browsers & servers. The data & everything should transfer over the browser. So every node has a port Number.

Why Data Node & Task Tracker in the same machine?

To process a task, task tracker most often communicate with task tracker. If the Data node and task tracker are in different nodes or have long distance, it’s taking a long time & network failures. So to ease the process, both are in the same machine.

What is SetUp() and CleanUp()?

These Mapreduce methods include at the start and end of the each split.

SetUp for initialize the resources.

Map and reduce is processing the data.

Cleanup is close the resources.

Comparison also trigger. Setup and cleanup is trigger in both map and reducer.

Map reduce can give record level control, but these two can give block level control. file level also allows in Input Format.

What is Distributed Cache?

When Map or reduce task needs access to common data, or old data, or application depends on existing applications, Hadoop framework use this feature to boost efficiency. It’s configured in the JobConf and spans to multiple servers.

What is Counter in MapReduce?

Counters provides the way to measure the progress or the number of operations that occur within a map-reduce program. Counters doesn’t interact in any mapreduce programs, but analytical purpose every BigData analyst and Hadoop developer used these counters.

Why NameNode & Job tracker & Secondary Name node in different machine?

If the name node fails secondary data will take a backup, but if put NN & SNN is same machine, it’s possible to fail both NN & SNN at a time. So it’s a good idea to place separate system.
Job tracker takes a huge amount of ram to process the data. If NN and JT perform operations in the same machine, it’s possible to slow-down the process. So it’s a good idea to place JT in a separate system. Both NN & JT process huge amount of data.

What are the drawbacks of hadoop 1.x?

  • Single point of failure.
  • Salable maximum 4000 nodes.
  • By default haddop has low latency.
  • Lots of small files.
  • Limited Jobs
  • Append & Updates not possible.
  • OS dependent.
    Most of the problems resolved in hadoop 2.x

What is Low latency?

The process is completed quickly called low latency. The RDBMS has low latency because of low data. Where as Hadoop has High latency by default.

Elaborate FSimage & Editlog?

Editlog is a transaction log file that persistently record every change that occurs in the file system. HDFS metadata changes are persisted to the edit log. The file system stores all these data in a file called FSImage. This FSImage data overwriting all previous data. The FSimage in secondary namenode.

What is Checkpoint?

Checkpoint is a process that encapsulate FSimage and editlog and compacts into a new FSimage. It’s critical for efficient namenode recovery, restart the Namenode and to knoow cluster health.

What does Hadoop daemon do?

A set of programs (Jobs) run in the background until the process is finished called Daemons.
Most often these daemons run in separate Java Process (JVM instances).

Is Java mandatory to write Map and reduce programs?
No, Hadoop framework has special utility called streaming that allows to write map-reduce programs, by using Perl, Python, Ruby and other programming languages, but to customization in MapReduce Java is mandatory, the main reason, Hadoop customized by default in Java.
Mapper & Reducer work together?

Mapper parallel & independently work in HDFS.
Reducer work sequentially & has relationship with other reducers.
Immediate data store in local file system. After all mappers completed, then only reducer process will start. So there is no relationship between those?

What is the importance of Writable interface?

Writable is an interface thats allow to serialize and serialize the data, based on Data-input and Data-output. These Serialization and De-serialization is mandatory to transfer the objects over the network.
Hadoop provides different classes to implement Writable interface such as Text, IntWritable, LongWritable, FloatWritable, BooleanWritable and more. All these classes listed in package.

What is Combiner?

Combiner is a function used to optimize the map-reduce job. It runs on the O/p of the map phase. The output of combiner class is the intermediate data is the Input of reducer.
Output of reducer is displayed in disk. All Maps aggregation done by reducer in a block level.

What is partitioner?

After combiner this partitioner process occur. partitioner divides the data according to the number of reducers. It means it occurs before reducer. When there is reducer partitioner there. control of mapper depends on split. Directly we dont have privileges to access partition, but with split possible.
No of partitions = No of reducers.
Hash partitioner is default partitioner.

What is hash partitioner?

MapReduce use HashPartitioner as it’s partitioner class by default. The hash partitioner ensures that all records with same map output goes to same reducer.

What is normalization?

Normalization is a database design technique that logically devide a database into two or more tables and define relationship between different tables.

What is different between horizontal and vertical scaling?

Horizontal scaling: Scale by adding more machine or system or nodes into the pool of resource. It’s easy to scale dynamically by adding more machine to the existing pool.
EG: Mangodb, cassandhra,
Vertical scaling: It’s adding more power (RSM, CPU) to the existing machine or node. So it scale more data through multi core.
EG: Mysql

What is Structured data & unstructured data?

A data that can defined a data type modal and easily fixed within a record called structure data.
EG: Text, HTML tags.
A data that can’t define data type modal & difficult to fixed within a record called unstructured data.
EG: images, Graphics

What is safe mode for NameNode?

On start-up the Namenode temporarily enters a special state called safe mode. Datanode reports heart beat to the namenode. After configurable data replicated data blocks the data node sends block report message to Namenode, then automatically namenode exit from the safe-mode state.

What is SSH & Https?

Secure Shall run on the top of SSL used for secure access to a remote host.
Https: run on the top of SSL used for standard HTTP communication.

What is SSH? Why we used in Hadoop?

SSH (Secure Shell) is a secure shell that is heart to communicate client and namenode, datanode. Additionly required username/password authentication scheme for secure access to a remote host; but Hadoop needs password less security connection.

What are Daemons in Hadoop?

A framework process that runs in the background called daemons. There are 5 daemons.
  • Namenode
  • Data Node
  • Secondary NameNode
  • Job Tracker
  • Task tracker
    Each Daemon runs separately in its own JVM.

What is Speculative execution?

When speculative execution enabled, the Job tracker will assign the some task to multiple nodes and take the result which node finish the task quickly; the rest of the task instances discarded.

No of Blocks = No of jobs is it true?
No, By default no of blocks = no of mappers (by default)

No of splits = No of maps (always)

If data stored once, it’s not possible to change block size again. So it’s possible to change split. It’s a logical operation. So depends on the project, it’s possible to change split size configuration.
Any relation between Mapper outputs?
No, Mapper out put independent . There is no relation between mapper outputs.
Why we are using Pipes in Hadoop?
Hadoop pipes is a package that allows C++ code to write map reduce programs in Hadoop. This package can split the C++ code into Hadoop understandable format.
What is dist cp?

Distributed Copy is a tool used for large amounts of data. It copies large amount of data across multiple clusters parallel.

What is Risk awareness?

To minimize network traffic between two data nodes in racks, Namenode place the blocks in proper order based on the rack awareness.

What is combiner important?

Combiner is a function, used to optimize for Map-reduce job. It works as map side reducer, but map reduce should not depend on the combiner.

What are the types of schedulers?

FIFO: Default scheduler it is. It schedule the Jobs in First In First Out format.
FAIR: Give priority dynamically
CAPACITY: Give priority in % to process a job. Highly recommendable in 2.x;

What type of compression techniques in Hadoop?

  • None:
  • Record:
  • Block – Highly recommendable
Compression codex is:
  • Default codex
  • Gzip (.gz)
  • Bzipcode – (.Bz)
  • Snappy – .snappy – Highly recommendable
  • Lzo: – .lzo

What is Serialization importance in map reducer?

In Hadoop data stored in only binary stream format. A process of converting structured objects into byte stream. RPC use serialization to convert into byte stream.

Which is deserialization How it’s work?

RPC protocol use serialization to convert the source data node into binary stream data. Framework transfer this data to the remote destination node.
Destination node use de-serialization to convert the binary stream data to object structured data.

What is Inverted Index?

Inverted index is a simple hash table which mapping the words to the different document sets. All search engines utilizing this inverted index to process user submitted queries.
Doc 1:
Venu, brms, Madhavi, anjali, anu, Jyothi, Koti
Venu, anu, brms, Sita, jyothi
Doc3: Venu, Jyothi
Inverted index:
  • Venu: -> Doc 1, Doc2, Doc 3
  • Jyothi: Doc1, Doc 3
  • anjali: -> Doc1, Doc 2
  • sudha: -> Doc 1, Doc2
  • anu -> Doc 1, Doc 2
  • Madhavi: -> doc 1
  • Koti ->Doc 1
  • jyothi: Doc1
  • Sita: -> Doc 1

What is Data Locality?
Hadoop believes in “Moving the logic to the data is cheaper than moving data”. Transfer the locality instead of data. It means the logic is execute where the data is stored. By default this value is true. But it’s much difficult for un-structured data.
How Text input Format read the data?
By default Hadoop MapReduce consider as “Text” is an input and output format. Hadoop Framework consider each line is a line object called “Key”, it’s an hexa-decimal number. The Value of the whole line consider as a Value. This key and data value gets processed by a mapper. The mapper consider key as “LongWritable” parameter and Value as Text parameter.
What is the importance of data processing parallelly from multiple disks?
According to the Moore’s law, every year hard drives storing data massively. Storing the data in multiple drives is not a problem, but to read all data, takes a long time to be processed.
So the data storing and process parally can ease many problems. To do it, a framework use special framework called Hadoop to store and process parallel.
What is the problems with parallel Writing and Reading?
Hardware failure: Powerfailure, network failure, server crashing are the main problems.
Data combining correctly and orderly to process the data is much difficult.
How Hadoop resolve the parallel read/write process?
HDFS store the data in reliable manner through replication. Keep the data in multiple systems and allows parallel to process.
Mapreduce read the data parallel and write sequentially.
Where HDFS is not suitable?
If the application that require lowlatency data access, its not suitable
A lots of small files, can increase metadata, it’s not recommendable.
HDFS doesn’t support multiple writes, arbitrary file modifications. So hadoop is not suitable for such applications.

  • Hardware failure is common in parallel distributions. Hadoop can ease this problems by replicate the data.
  • Most of the applications, that access Streaming data, it’s batch process.
  • Easily scale the large data sets, provides high throughput and minimize the network wastage.
  • Simple and coherency modal means write once and read many access model for files.
  • Portability across different platforms is another plus point in HDFS. It can easily adopt any type of application to process easily.
  • It can run in commodity hardware and store with very cheap cost.

hdfs-clusterCan you explain about HDFS Architecture?

  • HDFS has a master/slave architecture.
  • Single namenode, multiple nodes acts as master and slaves.
  • Internally, input file is split into multiple chunks (Blocks) of data, these chunks of data stored in multiple datanodes.
  • Multiple chunks of data stored across the cluster and allows to read/write parallel manner.

What are the DataNode responsibility?

  • The DataNodes are responsible for serving read and write requests from the file client.
  • Based on namenode’s instruction, the datanode also perform block creation, deletion, and replication operations.
  • Every three seconds send heart beat and block report information to the namenode. Every 10th heartbeat namenode sends a blockreport.