Tag Archives: Interview

Hadoop Mapreduce Interview Questions

What is Hadoop MapReduce ?
MapReduce is a set of programs used to process or analyze vast of data over a Hadoop cluster. It process the vast amount of the datasets parallelly across the clusters in a fault-tolerant manner across the Hadoop framework.
Can you elaborate about MapReduce job?
Based on the configuration, the MapReduce Job first splits the input data into independent chunks called Blocks. These blocks processed by Map() and Reduce() functions. First Map function process the data, then processed by reduce function. The Framework takes care of sorts the Map outputs, scheduling the tasks.
Why compute nodes and the storage nodes are the same?
Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code execute where the data is stored, so the data node and compute node are the same.
What is the configuration object importance in MapReduce?

  • It’s used to set/get of parameter name & value pairs in XML file.
  • It’s used to initialize values, read from external file and set as a value parameter.
  • Parameter values in the program always overwrite with new values which are coming from external configure files.
  • Parameter values received from Hadoop’s default values.

Where Mapreduce not recommended?

Mapreduce is not recommended for Iterative kind of processing. It means repeat the output in a loop manner.
To process Series of Mapreduce jobs, MapReduce not suitable. each job persists data in local disk, then again load to another job. It’s costly operation and not recommended.

What is Namenode and it’s responsibilities?

Namenode is a logical daemon name for a particular node. It’s heart of the entire Hadoop system. Which store the metadata in FsImage and get all block information in the form of Heartbeat.

What is JobTracker’s responsibility?

  • Scheduling the job’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker.
  • Monitoring the tasks, if failed, re-execute the failed tasks.

What are the JobTracker & TaskTracker in MapReduce?
MapReduce Framework consists of a single JobTracker per Cluster, one TaskTracker per node. Usually A cluster has multiple nodes, so each cluster has single JobTracker and multiple TaskTrackers.
JobTracker can schedule the job and monitor the TaskTrackers. If TaskTracker failed to execute tasks, try to re-execute the failed tasks.
TaskTracker follow the JobTracker’s instructions and execute the tasks. As a slave node, it report the job status to Master JobTracker in the form of Heartbeat.
What is Job Scheduling importance in Hadoop MapReduce?
Scheduling is a systematic procedure of allocating resources in the best possible way among multiple tasks. Hadoop task tracker performing many procedures, sometime a particular procedure should finish quickly and provide more prioriety, to do it few job schedulers come into the picture. Default Schedule is FIFO.
Fair scheduling, FIFO and CapacityScheduler are most popular hadoop scheduling in hadoop.
When used reducer?
To combine multiple mapper’s output used reducer. Reducer has 3 primary phases sort, shuffle and reduce. It’s possible to process data without reducer, but used when the shuffle and sort is required.
What is Replication factor?
A chunk of data is stored in different nodes with in a cluster called replication factor. By default replication value is 3, but it’s possible to change it. Automatically each file is split into blocks and spread across the cluster.
Where the Shuffle and sort process does?
After Mapper generate the output temporary store the intermediate data on the local File System. Usually this temporary file configured at core-site.xml in the Hadoop file. Hadoop Framework aggregate and sort this intermediate data, then update into Hadoop to be processed by the Reduce function. The Framework deletes this temporary data in the local system after Hadoop completes the job.
Java is mandatory to write MapReduce Jobs?
No, By default Hadoop implemented in JavaTM, but MapReduce applications need not be written in Java. Hadoop support Python, Ruby, C++ and other Programming languages.
Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Hadoop Pipes allows programmers to implement MapReduce applications by using C++ programs.
What methods can controle the map and reduce function’s output?
setOutputKeyClass() and setOutputValueClass()
If they are different, then the map output type can be set using the methods.
setMapOutputKeyClass() and setMapOutputValueClass()
What is the main difference between Mapper and Reducer?
Map method is called separately for each key/value have been processed. It process input key/value pairs and emits intermediate key/value pairs.
Reduce method is called separately for each key/values list pair. It process intermediate key/value pairs and emits final key/value pairs.
Both are initialize and called before any other method is called. Both don’t have any parameters and no output.

Why compute nodes and the storage nodes are same?
Compute nodes are logical processing units, Storage nodes are physical storage units (Nodes). Both are running in the same node because of “data locality” issue. As a result Hadoop minimize the data network wastage and allows to process quickly.
What is difference between MapSide join and Reduce Side Join? or
When we goes to MapSide Join and Reduce Join?
Join multple tables in mapper side, called map side join. Please note mapside join should has strict format and sorted properly. If dataset is smaller tables, goes through reducer phrase. Data should partitioned properly.

Join the multiple tables in reducer side called reduce side join. If you have large amount of data tables, planning to join both tables. One table is large amount of rows and columns, another one has few number of tables only, goes through Rreduce side join. It’s the best way to join the multiple tables.
What happen if number of reducer is 0?
Number of reducer = 0 also valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output consider as output, Hadoop store this information in separate folder.
when we are goes to combiner? Why it is recommendable?
Mappers and reducers are independent they dont talk each other. When the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} we goes to combiner to optimize the mapreduce process. Many mapreduce jobs are limited by the bandwidth, so by default Hadoop framework minimizes the data bandwidth network wastage. To achieve it’s goal, Mapreduce allows user defined “Cominer function” to run on the map output. It’s an MapReduce optimization technique, but it’s optional.
What is the main difference between MapReduce Combiner and Reducer?
Both Combiner and Reducer are optional, but most frequently used in MapReduce. There are three main differences such as:
1) combiner will get only one input from one Mapper. While Reducer will get multiple mappers from different mappers.
2) If aggregation required used reducer, but if the function follows commutative (a.b=b.a) and associative a.(b.c)=(a.b).c law, use combiner.
3) Input and output keys and values types must same in combiner, but reducer can follows any type input, any output format.
What is combiner?
It’s a logical aggregation of key and value pair produced by mapper. It’s reduces a lot amount of duplicated data transfer between nodes, so eventually optimize the job performance. The framework decides whether combiner runs zero or multiple times. It’s not suitable where mean function occurs.
What is partition?
After combiner and intermediate map-output the Partitioner controls the keys after sort and shuffle. Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. It means each partition can executed by only a single reducer. If you call reducer, automatically partition called in reducer by automatically.
When we goes to partition?
By default Hive reads entire dataset even the application have a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows special option called partitions. When you are creating table, hive partitioning the table based on requirement.
What are the important steps when you are partitioning table?
Don’t over partition the data with too small partitions, it’s overhead to the namenode.
if dynamic partition, atleast one static partition should exist and set to strict mode by using given commands.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
first load data into non-partitioned table, then load such data into partitioned table. It’s not possible to load data from local to partitioned table.

insert overwrite table table_name partition(year) select * from non-partition-table;
Can you elaborate MapReduce Job architecture?
First Hadoop programmer submit Mpareduce program to JobClient.

Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s in the form of Job_HadoopStartedtime_00001. It’s unique ID.

Once JobClient receive received Job ID copy the Job resources (job.xml, job.jar) to File System (HDFS) and submit job to JobTracker. JobTracker initiate Job and schedule the job.

Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrive the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.
Why Task Tracker launch Child Jvm?
Most frequently, hadoop developer mistakenly submit wrong jobs or having bugs. If Task Tracker use existent JVM, it may interrupt the main JVM, so other tasks may influenced. Where as child JVM if it’s trying to damage existent resources, TaskTracker kill that child JVM and retry or relaunch new child JVM.
Why JobClient, Job Tracker submits job resources to File system?
Data locality. Move competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resources copy where the data available.

How many Mappers and reducers can run?

By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default values in Mapreduce.xml in conf file.
What is InputSplit?
A chunk of data processed by a single mapper called InputSplit. In another words logical chunk of data which processed by a single mapper called Input split, by default inputSplit = block Size.
How to configure the split value?
By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.

1) split size = min (max_splitsize, max (block_size, min_split_size));

2) split size = max(min_split_size, min (block_size, max_split, size));

by default split size = block size

Always No of splits = No of mappers.

Apply above formula:

1) split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb

split size = min (10gb (let assume), 64)

split size = 64MB.

2) 2) split size = max(min_split_size, min (block_size, max_split, size));

split size = max (512kb, min (64, 10GB));

split size = max (512kb, 64);

split size = 64 MB;
How much ram Required to process 64MB data?
Leg assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to process a chunk of data.

So in this way required more memory to process un-structured process.
What is difference between block and split?
Block How much chunk data to stored in the memory called block.
Split: how much data to process the data called split.
Why Hadoop framework reads a file parallel why not sequential?
Why Hadoop reads parallel why not writes parallel?
To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and write sequentially.
If i am change block size from 64 to 128, then what happen?
Even you have changed block size not effect existent data. After changed the block size, every file chunked after 128 MB of block size.

It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.
What is isSplitable()?
By default this value is true. It is used to split the data in the input format. if un-structured data, it’s not recommendable to split the data, so process entire file as a one split. to do it first change isSplitable() to false.
How much Hadoop allows maximum block size and minimum block size?
Minimum: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.

Maximum: Depends on environment. There is no upper-bound.
What are the job resource files?
job.xml and job.jar are core resources to process the Job. Job Client copy the resources to the HDFS.
What’s the MapReduce job consists?
MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker.
What is the Data locality?
This is most frequently asked Cloudera certification interview question, most important MapReduce interview question it is. Whereever the data is there process the data, computation/process the data where the data available, this process called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal follow data locality. It’s possible when the data is splittable, by default it’s true.
What is speculative execution?
It’s one of the important mapreduce interview question and cloudera certification as well. Hadoop run t.he process in commodity hardware, so it’s possible to fail the systems also has low memory. So if system failed, process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique. Computation/logic distribute to the multiple systems and execute which system execute quickly. By default this value is true. Now even the system crashed, not a problem, framework choose logic from other systems.

Eg: logic distributed on A, B, C, D systems, completed within a time.

System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process.
When we goes to reducer?
When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without reducer its possible to do this operation.
What is chain Mapper?
Chain mapper class is a special mapper class sets which run in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way n number of mapper connected in chain fashion.
How to do value level comparison?
Hadoop can process key level comparison only but not in the value level comparison.
What is setup and clean up methods?
If you don’t no what is starting and ending point/lines, it’s much difficult to solve those problems. Setup and clean up can resolve it.

N number of blocks, by default 1 mapper called to each split. each split has one start and clean up methods. N number of methods, number of lines. Setup is initialize job resources. The purpose of clean up is close the job resources. Map is process the data. once last map is completed, cleanup is initialized. It Improves the data transfer performance. All these block size comparison can do in reducer as well.

If you have any key and value, compare one key value to another key value use it. If you compare record level used these setup and cleanup. It open once and process many times and close once. So it save a lot of network wastage during process.
Why TaskTracker launch child JVM to do a task? Why not use existent JVM?
Sometime child threads currupt parent threads. It means because of programmer mistake entired MapReduce task distruped. So task tracker launch a child JVM to process individual mapper or tasker. If tasktracker use existent JVM, it might damage main JVM. If any bugs occur, tasktracker kill the child process and relaunch another child JVM to do the same task. Usually task tracker relaunch and retry the task 4 times.
How many slots allocate for each task?
By default each task has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to process the data.
What is RecordReader?
RecordReader reads <key, value> pairs from an InputSplit. After InputSplit, typically RecordReader convert the data into byte format Input and presents record oriented view for Mapper, then only Mapper can process the data.

record readerset the input format by using this command.

FileInputFormat.addInputPath() will read file from a specified directory and send those files to the mapper. All these configurations include in Mapreduce job file.
Can you explain different types of Input formats?

input format
input format is too important in mapreduce


input formates


Sqoop Interview Questions

What is Sqoop?

Sqoop is an open source Hadoop ecosystem that asynchronously imports/export data between Hadoop and relational databases;
Sqoop provides parallel operation and fault tolerance. It means which import and export the data parallelly, so it provides fault tolerance.

Tell me few import control commands:
These commands are most frequently used to import RDBMS data.

How Sqoop can handle large objects?

Blog and Clob columns are common large objects. If the object is less than 16 MB, it stored inline with the rest of the data. If large objects, temporary stored in _lob subdirectory. Those lobs processes in a streaming fashion. Those data materialized in memory for processing. If you set LOB limit to 0, those lobs objects placed in external storage.

 What type of databases Sqoop can support?

MySQL, Oracle, PostgreSQL, HSQLDB, IBM Netezza and Teradata. Every database connects through jdbc driver.

sqoop import --connect jdbc:mysql://localhost/database --username ur_user_name --password ur_pass_word
sqoop import --connect jdbc:teradata://localhost/DATABASE=database_name --driver "com.teradata.jdbc.TeraDriver" --username ur_user_name --password ur_pass_word

 What are the common privileges steps in Sqoop to access MySQL?

As a root user to grant all privileges to access the mysql Database.

Mysql -u root -p
//Enter a password
mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'localhost';
mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'localhost';
// here you can mention db_name.* or db_name.table_name between ON and TO.


sqoop interview

Sqoop Interview Question and Answers

What is the importance of eval tool?
It allows users to run sample SQL queries against Database and preview the results on the console. It can help to know what data can import? The desired data imported or not?

Stx: sqoop eval (generic-args) (eval-args)

	sqoop eval --connect jdbc:mysql://localhost/database -- query "select name, cell from employee limit 10"
sqoop eval --connect jdbc:oracle://localhost/database -e "insert into database values ('Venu', '9898989898')"

Can we import the data with “Where” condition?

Yes, Sqoop has a special option to export/import a particular column data.

sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "

How to export the data from a particular column field data?

There is a separate argument called –columns  that allow to export/import from the table.

Syntax: --columns <col,col,col…>


sqoop import --connect jdbc:mysql://localhost/database --table employee --columns emp_id, name, cell --username root --password password;

What is the difference between Sqoop and distcp?

Distcp can transfer any type of data from one cluster to another cluster, but Sqoop can transfer any data  between RDBMS and Hadoop ecosystems. Both distcp and sqoop following same approaches to pull/transfer data.

What is the difference between Flume and Sqoop?
The Flume is a distributed, reliable Hadoop ecosystem which collect, aggregate and move large amount of log data. It can collect data from different resources and asynchronously pull into the HDFS.
It doesn’t consider schema and structure or unstructured data, it can pull any type of data.
Sqoop just acts as interpreter exchange/transfer the data between RDBMS and Hadoop ecosystems. It can import or export only RDBMS data, Schema is mandatory to process.

What are the common delimiters and escape characters in Sqoop?

 The default delimiters are a comma (,) for fields, a newline (\n) for records. Common delimited fields followed by — and values given below.
--enclosed-by <char>
--escaped-by <char>
--fields-terminated-by <char>
--lines-terminated-by <char>
--optionally-enclosed-by <char>

Escape characters are:

 Can Sqoop import tables into hive?
Yes, it’s possible, many hive commands also available to import into the Hive.

--hive-table <table-name>

Can Sqoop can import data into Hbase?
Yes, Few commands also help to import the data into Hbase directly.

--column-family <family>
--hbase-row-key <col>
--hbase-table <table-name>

 What is the Meta-store tool?
This tool can host metastore, which is configured in sqoop-site.xml. Multiple users can access and execute these saved jobs, but you should configure in sqoop-site.xml


Syntax: sqoop metastore (generic-args) (metastore-args)

The Sqoop meta-store jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop --store-dir /metastore-hdfs-file

What is Sqoop Merge tool?

Merge tool can combine two datasets, New new datasets can overwrite old documents. Merge tool can flatten two datasets into one.
Syntax: sqoop merge (generic-args) (merge-args)

 sqoop merge --new-data newer --onto older --target-dir merged --jar-file datatypes.jar --class-name Foo --merge-key id

 What is codegen?
The Codegen is a tool that encapsulates and interrupt the jobs, finally generate Java class.
Syntax: $ sqoop codegen (generic-args) (codegen-args)

Apart from import and export, Sqoop can do anything?
Yes, many things it can do.
Codegen: Generate code to interact with RDBMS database records.
Eval: Evaluate a SQL statement and display the results.
Merge: Merge tool can flatten multiple datsets into one dataset.

Can you export from a particular row or column?

Sure, Sqoop provides few options such options can allow to import or export based on where class you can get the data from the table.

--columns <col1,col2..>
--where <condition>
--query <SQL query>


sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --where "start_date > '2010-01-01'"
 sqoop eval --connect jdbc:mysql://db.example.com/corp \
    --query "SELECT * FROM employees LIMIT 10"
sqoop import --connect jdbc:mysql://localhost/database -username root --password your_password --columns "name,employee_id,jobtitle"

How to create and drop Hive table in Sqoop?
It’s possible to create tables, but it’s not possible to drop Hive table.

sqoop create-hive-table --connect jdbc:mysql://localhost/database --table table_name

Assume you use Sqoop to import the data into a temporary Hive table using no special options to set custom Hive table field delimiters. In this case, what will Sqoop use as field delimiters in the Hive table data file?
The Sqoop default delimiter is 0x2c (comma), but by default Sqoop uses Hive’s default delimiters when doing a Hive table export, which is 0x01 (^A).

How to import new data in a particular table every day?
It’a one of the main problems for Hadoop developers. Let example, you had downloaded 1TB data yesterday, today you got another 1gb data, if you import the data, again sqoop import 1TB+1GB data. So to get only use this command. Let example, you have already downloaded 1TB data which stored in the hive $Lastimport file. Now you can run it.

sqoop import --incremental lastmodified --check-column lastmodified --last-value "$LASTIMPORT  --connect jdbc:mysql://localhost:3306/database_name --table table_name --username user_name --password pass_word


  1. You are using Sqoop to import data from a MySQL server on a machine named dbserver, which you will subsequently query using Impala. The database is named db, the table is named sales, and the username and password are fred and fredpass. Which query imports the data into a table which can then be used with the Impala

More tips

Hbase Interview questions

What is Hbase?

Hbase is a colluomn-oriented database management system, which runs on the top of HDFS. It’s sub project of Hadoop. It highly scalable (both linear and modular scaling), distributed, process billions of rows quickly.

 When we goes to Apache HBase?

When the database has billions of rows, millions of columns and sparse datasets, Hbase is the best choice to process such data. Hbase can process unstructured data also, so Hadoop used to update the existent data.


What is importance of ColumnFamily in Hbase?

A logical deviation of a data represented by a key called column family. Virtually column families form dynamically based on data. Which holes the multiple columns of related data. All column members of a column family have the same prefix. For example Vehicle is a column Maruthi, Tata, Hero are the sub column of the Vehicle. So here Vehicle consider as column family.

Eg: Hbase > put ‘cars’, ‘price’, ‘ Vehicle:Maruthi’, ‘1,00,000’ // The syntax should be in order table, row, column family, value.

put ‘cars’, ‘price’, ‘Vehicle:Tata’,’2,00,000′

put ‘cars’, ‘price’,’Vehicle:Hero’,’3,00,000′

Here, cars is a table, Vehicle is a column family and 1,00,000 value.

What is the different between a column-oriented  and row-oriented databases?

Why Hbase instead of Hadoop?

Hbase suitable for low latency request, but mapreduce is high latency. Hadoop not support updates, but Hbase can support. Hadoop can store matadata only, but hbase can index the data.

What type of datatypes supports and not supports in Hbase?

Hbase has Put and Result interface which converts bytes and stored in an array as a value. So it can support any datatype like string, number, image or anything that can rendered as bytes. Typecasting always possible.

What are different type of block cache? 

Hbase provides 2 different block cache, such as on-heap and off-heap cache also called LruBlockCache (default) and bucketCache.  On-heap cache is implemented from Java heap, where as bucketCache implemented from file-block cache.

hbase interview questions

Elaborate very important commands in Hbase

create ‘table’, ‘columnFamily’

put ‘table’, ‘rwo’, ‘columnFamily’, ‘value’

get’table’, ‘row’, ‘columnfamily’, ‘value’

scan ‘table’, ‘row’, ‘columnfamily’, ‘value’

list ‘tablename’

disable ‘table’

drop ‘table’

describe ‘table’

What is Mem-store?

Menstore is a temporary repository in Hbase, which holds data in-memory modifications to the Store. It’s store Maximum HDFS block size data, once reaches maximum size(64MB), it flushes the data into a HDFS.

What is DFSClient functions?

DFSCllient handels all remote server’s interactions. It means to communicate with NameNode, Datanodes or JobTracker/YARN, required DFSClient. Hbase persists the data in HDFS via DFS client.

Where Read performance and write permanence high?

Sequential keys, salted keys, promoted field keys and random keys are main 4 types of keys.

Sequential reads: Sequential keys> salted keys> promoted field keys> random keys
Sequential writes: random keys> promoted field keys> salted keys> Sequential keys

All these keys based on #keys and focus on unique key.

What is autosharding in Hbase?

Hbase dynamically distributed by the system when the hbase is getting huge amount of data, this feature called auto-sharding.

 What are Ulimit and nproc of Hbase?

ulimit is a upper bound of the process.
nproc can limiting the maximum number of processes available for a particular application. Which restrict the processes.

What are Bloom Filters?

Bloom filters is filtering out blocks that you don’t need. Which can save your  disk and improve read latency.

What is the importance of MemStore and BlockCache?

Memory utilization and caching structures are too important in Hbase. To archive it’s goal, HBase maintain two cache structures called MemStore and BlockCache. MemStore is a temporary repository and buffering in memory. Block cache keeps data blocks in memory after read.

What are different type of blocks in HBase?

Block is a single smallest amount/unit of data. There are 4 type of veriets such as: Data, Meta, Index and Bloom. Data locks store User data. Index and Bloom blocks serve to speed up the read path. Index provides index of the particular Data blocks. Bloom block contain a bloom filter, that filter the data and display desired data quickly. Meta blocks store information about Hfile.

What are the core components in Hbase?

Hmaster serves one or more HRegion Servers.
Each HRegion Server serves one or more Region.
Each Region serves one Hlog and multiple Stores.
Each Store serves one MemStore and multiple StoreFile.
Each Store file has only one Hfile.
Each Hfile can hold 64kb of data.

How to write a file?

First the  client written the data to HregionServer. First data stored data in (write ahead log) Hlog file, then the data is written to MemStore. Memstore temporary holds the data. If Memstore is full, it flush the data to Hfile. The data is ordered in Memstore and Hfile. Which is the temporary repository in the Hbase. Which persist the data on HDFS via DFS client.

 Explain what is WAL and Hlog in Hbase?

Why Hbase follow lexicographical order?

If you forget syntax what should you do?

use help followed by command, for example, help ‘scan’

What is different between scan and get?

When CRUD operations not applicable?

when schema level updates / alternations done, its not possible to run CRUD operations. To alter schema level updations first disable the table, it’s mandatory.


Useful links:





Pig Interview Questions & Answers

What is pig?

Pig is a data flow language that process parallel on Hadoop.  Pig use a special language called Pig latin scripting to process and analyze the data. It allows Join, sort, filter, and UDFs to analyze the data. It can store and analyze any type of data which either structured and un structured. Highly recommendable for streaming data.

What is Dataflow language?

To access the external data, every language must follow many rules and regulations. The instructions are flowing through data by executing different control statements, but data doesn’t get moved. Dataflow language can get a stream of data which passes from one instruction to another instruction to be processed. Pig can easily process those conditions, jumps, loops and process the data in efficient manner.

Can you define Pig in 2 lines?

Pig is a platform to analyze large data sets that should either structured or unstructured data by using Pig latin scripting. Intentionally done for streaming data, un-structured data in parallel.

What are the main difference between local mode and MapReduce mode?

Local mode: No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system.
100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode: It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

 What is the difference between Store and dump commands?

Dump command after process the data displayed on the terminal, but it’s not stored anywhere. Where as Store stored in local file system or HDFS and output execute in a folder. In the protection environment most often hadoop developer used ‘store’ command to store data in in the HDFS.

What is the relation between map, tuple and bag?

Bag: collection of touples is called bag. It hold entire touples and  maps data, we represent bags with {}
Tuple: collection of map called fields. It’s fixed length and have multiple fields in (touple). The fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps.
map: collection of data element that mapping where element have pig data types. Most often map can ease unstructured data’s data-type.

{(‘hyderabad’, ‘500001’), ([‘area’#’ameerpet’, ‘pin’#500016])}

Here { is bag, ( is a touple, [ is a maps.

What are the relational operations in Pig?

for each — to iterate and loop all date into an object.
order by — sort the data in ascending order or descending order.
filters – It’s similar to where command in SQL. It filter the data to process.
group: grouping the data to get desired output.
distinct: Displays only unique values, but it’s works on entire records, but not individual fields.
join: logically join many tables and get desired output.
limit: It not use MapReduce, just filter and display limited data info only.

 What is Pig Engine importance?

It’s acts as interpreter between Pig Latin script and MapReduce Jobs. It creating environment to execute Pig scripts into series of mapreduce jobs in parallel manner.

Why Pig instead of Mapreduce?

Compare with MapReduce many features available in Apache Pig.
In Mapreduce it’s too difficult to join multiple data sets. Development cycle is very long.
Depends on the task, Pig automatically converts code into Map or Reduces. Easy to join multiple tables and run many sql queries like Join, filter, group by, order by , union and many more.

Can you tell me little bit about Hive and Pig?

Pig internally use Pig Latin, it’s procedural language. Schema is optional, no meta store concept. where as Hive use a database to store meta store.
Hive internally use special language called HQL, it’s subset of SQL. Schema is mandatory to process. Hive intentionally done for Queries.
But both Pig and Hive run on top of MapReduce and convert internal commands into MapReduce jobs. Both used to analyze the data and eventually generate same output. you can see this post for more info about Hive and Pig

What is Flatten does in Pig?

Syntactically flatten similar to UDF, but it’s powerful than UDFs. The main aim of Flatten is change the structure of touple and bags, UDFs can’t do it. Flatten can un-nest the Touple and bags, it’s opposite to “Tobag” and “ToTouple”.

Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So Pig -x Mapreduce mode is the best choice to process vast amount of data.

How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine  execute the query on the MR Jobs. The Mapreduce process the data and generate output report. Here Mapreduce doesn’t return output to Pig, directly stored in the HDFS.

How to debugging in Pig?

Describe: Review the schema.
Explain: logical, Physical and MapReduce execution plans.
Illustrate: Step by step execution of the each step execute in this operator.
These commands used to debugging the pig latin script.

Tell me few important operators while working with Data in Pig.

Filter: Working with Touples and rows to filter the data.
Foreach: Working with Colums of data to load data into columns.
Group: Group the data in single relation.
Cogroup & Join: To group/Join data in multiple relations.
Union: Merge the data of multiple relations.
Split: partition the content into multiple relations.

What is Topology Script?

Topology scripts are used by Hadoop to determine the rack location of nodes. Its trigger to replicate the data. As a part of rack awareness, Hadoop by default configured in topology.script.file.name. If not set, the rack id is returned  for any passed IP address.

Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.
Single line comments:
Dump B; — It execute the data, but not store in the file system.
Multiple Line comments:
Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System.
In protection level most often used Store command */

Can you tell me important data types in Pig?

Primitive datatypes: Int, Long, float, double, arrays, chararray, byte array.
Complex datatypes: Touple, bag, map

What is co-group does in Pig?

Cogroup can groups rows based on columns, unlike Group it can join the multiple tables on the grouped column.

answer: http://joshualande.com/cogroup-in-pig/

What is difference between group by and co-group?

Can we say cogroup is a group of more than 1 data set?


Why we are using user defined functions (UDFs) in Pig?



Hadoop Interview Questions

Why use Hadoop?
Hadoop can handels any type of data, in any quantity and leverages on commodity hardware to mitigate costs.
Structured, unstructured, Schema, unschema, high volume, low quantity of data, Whatever it may be any data, you can store reliability.
What is Big Data?
Traditional databases much difficult to process different types of data and vast amount of data. Big data is a strategy to process large and complex data sets, which is not processed by traditional databases.

Today every organization generating massive volume of both Structured and unstructured data. It’s difficult to storage & process computationally. Big data can resolve this problem by using 4 v’s formula called

  • Volume – Size of the data
  • velocity – Speed of the data (Ram)
  • Verity – Structured & Unstructured data
  • Veracity – Uncertain, imprecise data.

What is Hadoop?
Hadoop is a open source project from Apache foundation, that enable the distributed storage & processing the large data sets across clusters of commodity hardware.
What is File System?
A file system is a set of structured data files that used by O.S to keep and organize the data on disk. Every file system permit users & groups to read, write, execute and delete privileges.

What is FUSE filesystem?
HDFS is user space FileSystem, but not POSIX file-System. It means Hadoop not satisfied POSIX rules and regulations.

What is DFS?

Distributed File System is a client or server based application (Systematic method) that store data in different servers/systems paralytically based on the server architecture .

What is No SQL?

NoSQL is acronym of Not Only SQL. It can ease many RDBMS problems. It store & access data across multiple servers. It’s highly recommendable for standalone projects and huge unstructured datasets.

What is different between real-time and batch processing?

Batch process:

It execute a series of programs (jobs) on a computer without any manual interaction. Hadoop by default use batch process.

Real-time Process:

Series of jobs continuously execute continuously and process as early as possible called real time process. Most of the Hadoop ecosystem allows real time processing.

What is meta-Data?

Data about data called meta data. Name Node store the meta data information, but not index the data. It means Name node can understand the data information only, but not inner content information details.

What is NFS?

Network File System is a client/Server application that allows to share resources between different servers on computer network. Hadoop 2.x allows NFS to store Name-node meta-data information in another system. It’s developed by Sun Microsystems.

What is Hadoop Ecosystem?

It’s a community of different tools/application that connection with a Hadoop. Pig, Hive, Hbase, Sqoop, Ooziee and Flume are common Hadoop ecosystem applications.

What is Raid Disks?

Redundant Array of Inexpensive/Independent Disks (RAID) can store the same data in different places. It’s highly recommendable for Name Node to store meta data.

What is Replication?

By default Hadoop automatically store actual data in different system, most often in another rock and other data center. This replicated backup process is called replication. It’s possible to change the default value 3. Depends on the requirements the data node replicas vary between 1 and 512.

Why Hadoop doesn’t support Updates and append?

By default Hadoop meant for write once and read many time functionality. Hadoop 2.x support append operation, but Hadoop 1.x doesn’t support.

What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

Elaborate Hadoop Process?

NameNode: The NameNode is the arbitrator and repository for all HDFS metadata.
Secondary NameNode: Backup of the metadata for every one hour.
DataNode: Store actual data in the form of blocks.
Job Tracker: Data process & schedule map-reduce tasks to specific nodes in the cluster.
Task Tracker: Follow Job tracker instructions and do Mapreduce & shuffle operations.

Elaborate important RPC & HTTP codes.

RPC Port:
  • 8020- NameNode
  • 8021: Job Tracker
Http Port:
  • 50070 – name Node
  • 50075 – DataNode
  • 50090 – Secondary namenode
  • 50030 – JobTracker
  • 50060 – TaskTracker

What is RPC Protocol?

Remote Procedure Cell (RPC) protocols supporting client server communications. Client most often interact to name node and job tracker. So that only 8020 & 8021 ports only available.

What is HTTP Protocols?

Http Protocol is transferring files on the web server. Over world wide web. This protocol communicates browsers & servers. The data & everything should transfer over the browser. So every node has a port Number.

Why Data Node & Task Tracker in the same machine?

To process a task, task tracker most often communicate with task tracker. If the Data node and task tracker are in different nodes or have long distance, it’s taking a long time & network failures. So to ease the process, both are in the same machine.

What is SetUp() and CleanUp()?

These Mapreduce methods include at the start and end of the each split.

SetUp for initialize the resources.

Map and reduce is processing the data.

Cleanup is close the resources.

Comparison also trigger. Setup and cleanup is trigger in both map and reducer.

Map reduce can give record level control, but these two can give block level control. file level also allows in Input Format.

What is Distributed Cache?

When Map or reduce task needs access to common data, or old data, or application depends on existing applications, Hadoop framework use this feature to boost efficiency. It’s configured in the JobConf and spans to multiple servers.

What is Counter in MapReduce?

Counters provides the way to measure the progress or the number of operations that occur within a map-reduce program. Counters doesn’t interact in any mapreduce programs, but analytical purpose every BigData analyst and Hadoop developer used these counters.

Why NameNode & Job tracker & Secondary Name node in different machine?

If the name node fails secondary data will take a backup, but if put NN & SNN is same machine, it’s possible to fail both NN & SNN at a time. So it’s a good idea to place separate system.
Job tracker takes a huge amount of ram to process the data. If NN and JT perform operations in the same machine, it’s possible to slow-down the process. So it’s a good idea to place JT in a separate system. Both NN & JT process huge amount of data.

What are the drawbacks of hadoop 1.x?

  • Single point of failure.
  • Salable maximum 4000 nodes.
  • By default haddop has low latency.
  • Lots of small files.
  • Limited Jobs
  • Append & Updates not possible.
  • OS dependent.
    Most of the problems resolved in hadoop 2.x

What is Low latency?

The process is completed quickly called low latency. The RDBMS has low latency because of low data. Where as Hadoop has High latency by default.

Elaborate FSimage & Editlog?

Editlog is a transaction log file that persistently record every change that occurs in the file system. HDFS metadata changes are persisted to the edit log. The file system stores all these data in a file called FSImage. This FSImage data overwriting all previous data. The FSimage in secondary namenode.

What is Checkpoint?

Checkpoint is a process that encapsulate FSimage and editlog and compacts into a new FSimage. It’s critical for efficient namenode recovery, restart the Namenode and to knoow cluster health.

What does Hadoop daemon do?

A set of programs (Jobs) run in the background until the process is finished called Daemons.
Most often these daemons run in separate Java Process (JVM instances).

Is Java mandatory to write Map and reduce programs?
No, Hadoop framework has special utility called streaming that allows to write map-reduce programs, by using Perl, Python, Ruby and other programming languages, but to customization in MapReduce Java is mandatory, the main reason, Hadoop customized by default in Java.
Mapper & Reducer work together?

Mapper parallel & independently work in HDFS.
Reducer work sequentially & has relationship with other reducers.
Immediate data store in local file system. After all mappers completed, then only reducer process will start. So there is no relationship between those?

What is the importance of Writable interface?

Writable is an interface thats allow to serialize and serialize the data, based on Data-input and Data-output. These Serialization and De-serialization is mandatory to transfer the objects over the network.
Hadoop provides different classes to implement Writable interface such as Text, IntWritable, LongWritable, FloatWritable, BooleanWritable and more. All these classes listed in org.apache.hadoop.io package.

What is Combiner?

Combiner is a function used to optimize the map-reduce job. It runs on the O/p of the map phase. The output of combiner class is the intermediate data is the Input of reducer.
Output of reducer is displayed in disk. All Maps aggregation done by reducer in a block level.

What is partitioner?

After combiner this partitioner process occur. partitioner divides the data according to the number of reducers. It means it occurs before reducer. When there is reducer partitioner there. control of mapper depends on split. Directly we dont have privileges to access partition, but with split possible.
No of partitions = No of reducers.
Hash partitioner is default partitioner.

What is hash partitioner?

MapReduce use HashPartitioner as it’s partitioner class by default. The hash partitioner ensures that all records with same map output goes to same reducer.

What is normalization?

Normalization is a database design technique that logically devide a database into two or more tables and define relationship between different tables.

What is different between horizontal and vertical scaling?

Horizontal scaling: Scale by adding more machine or system or nodes into the pool of resource. It’s easy to scale dynamically by adding more machine to the existing pool.
EG: Mangodb, cassandhra,
Vertical scaling: It’s adding more power (RSM, CPU) to the existing machine or node. So it scale more data through multi core.
EG: Mysql

What is Structured data & unstructured data?

A data that can defined a data type modal and easily fixed within a record called structure data.
EG: Text, HTML tags.
A data that can’t define data type modal & difficult to fixed within a record called unstructured data.
EG: images, Graphics

What is safe mode for NameNode?

On start-up the Namenode temporarily enters a special state called safe mode. Datanode reports heart beat to the namenode. After configurable data replicated data blocks the data node sends block report message to Namenode, then automatically namenode exit from the safe-mode state.

What is SSH & Https?

Secure Shall run on the top of SSL used for secure access to a remote host.
Https: run on the top of SSL used for standard HTTP communication.

What is SSH? Why we used in Hadoop?

SSH (Secure Shell) is a secure shell that is heart to communicate client and namenode, datanode. Additionly required username/password authentication scheme for secure access to a remote host; but Hadoop needs password less security connection.

What are Daemons in Hadoop?

A framework process that runs in the background called daemons. There are 5 daemons.
  • Namenode
  • Data Node
  • Secondary NameNode
  • Job Tracker
  • Task tracker
    Each Daemon runs separately in its own JVM.

What is Speculative execution?

When speculative execution enabled, the Job tracker will assign the some task to multiple nodes and take the result which node finish the task quickly; the rest of the task instances discarded.

No of Blocks = No of jobs is it true?
No, By default no of blocks = no of mappers (by default)

No of splits = No of maps (always)

If data stored once, it’s not possible to change block size again. So it’s possible to change split. It’s a logical operation. So depends on the project, it’s possible to change split size configuration.
Any relation between Mapper outputs?
No, Mapper out put independent . There is no relation between mapper outputs.
Why we are using Pipes in Hadoop?
Hadoop pipes is a package that allows C++ code to write map reduce programs in Hadoop. This package can split the C++ code into Hadoop understandable format.
What is dist cp?

Distributed Copy is a tool used for large amounts of data. It copies large amount of data across multiple clusters parallel.

What is Risk awareness?

To minimize network traffic between two data nodes in racks, Namenode place the blocks in proper order based on the rack awareness.

What is combiner important?

Combiner is a function, used to optimize for Map-reduce job. It works as map side reducer, but map reduce should not depend on the combiner.

What are the types of schedulers?

FIFO: Default scheduler it is. It schedule the Jobs in First In First Out format.
FAIR: Give priority dynamically
CAPACITY: Give priority in % to process a job. Highly recommendable in 2.x;

What type of compression techniques in Hadoop?

  • None:
  • Record:
  • Block – Highly recommendable
Compression codex is:
  • Default codex
  • Gzip (.gz)
  • Bzipcode – (.Bz)
  • Snappy – .snappy – Highly recommendable
  • Lzo: – .lzo

What is Serialization importance in map reducer?

In Hadoop data stored in only binary stream format. A process of converting structured objects into byte stream. RPC use serialization to convert into byte stream.

Which is deserialization How it’s work?

RPC protocol use serialization to convert the source data node into binary stream data. Framework transfer this data to the remote destination node.
Destination node use de-serialization to convert the binary stream data to object structured data.

What is Inverted Index?

Inverted index is a simple hash table which mapping the words to the different document sets. All search engines utilizing this inverted index to process user submitted queries.
Doc 1:
Venu, brms, Madhavi, anjali, anu, Jyothi, Koti
Venu, anu, brms, Sita, jyothi
Doc3: Venu, Jyothi
Inverted index:
  • Venu: -> Doc 1, Doc2, Doc 3
  • Jyothi: Doc1, Doc 3
  • anjali: -> Doc1, Doc 2
  • sudha: -> Doc 1, Doc2
  • anu -> Doc 1, Doc 2
  • Madhavi: -> doc 1
  • Koti ->Doc 1
  • jyothi: Doc1
  • Sita: -> Doc 1

What is Data Locality?
Hadoop believes in “Moving the logic to the data is cheaper than moving data”. Transfer the locality instead of data. It means the logic is execute where the data is stored. By default this value is true. But it’s much difficult for un-structured data.
How Text input Format read the data?
By default Hadoop MapReduce consider as “Text” is an input and output format. Hadoop Framework consider each line is a line object called “Key”, it’s an hexa-decimal number. The Value of the whole line consider as a Value. This key and data value gets processed by a mapper. The mapper consider key as “LongWritable” parameter and Value as Text parameter.
What is the importance of data processing parallelly from multiple disks?
According to the Moore’s law, every year hard drives storing data massively. Storing the data in multiple drives is not a problem, but to read all data, takes a long time to be processed.
So the data storing and process parally can ease many problems. To do it, a framework use special framework called Hadoop to store and process parallel.
What is the problems with parallel Writing and Reading?
Hardware failure: Powerfailure, network failure, server crashing are the main problems.
Data combining correctly and orderly to process the data is much difficult.
How Hadoop resolve the parallel read/write process?
HDFS store the data in reliable manner through replication. Keep the data in multiple systems and allows parallel to process.
Mapreduce read the data parallel and write sequentially.
Where HDFS is not suitable?
If the application that require lowlatency data access, its not suitable
A lots of small files, can increase metadata, it’s not recommendable.
HDFS doesn’t support multiple writes, arbitrary file modifications. So hadoop is not suitable for such applications.

  • Hardware failure is common in parallel distributions. Hadoop can ease this problems by replicate the data.
  • Most of the applications, that access Streaming data, it’s batch process.
  • Easily scale the large data sets, provides high throughput and minimize the network wastage.
  • Simple and coherency modal means write once and read many access model for files.
  • Portability across different platforms is another plus point in HDFS. It can easily adopt any type of application to process easily.
  • It can run in commodity hardware and store with very cheap cost.

hdfs-clusterCan you explain about HDFS Architecture?

  • HDFS has a master/slave architecture.
  • Single namenode, multiple nodes acts as master and slaves.
  • Internally, input file is split into multiple chunks (Blocks) of data, these chunks of data stored in multiple datanodes.
  • Multiple chunks of data stored across the cluster and allows to read/write parallel manner.

What are the DataNode responsibility?

  • The DataNodes are responsible for serving read and write requests from the file client.
  • Based on namenode’s instruction, the datanode also perform block creation, deletion, and replication operations.
  • Every three seconds send heart beat and block report information to the namenode. Every 10th heartbeat namenode sends a blockreport.