hadoop banner

Hadoop Interview Questions

Why use Hadoop?
Hadoop can handels any type of data, in any quantity and leverages on commodity hardware to mitigate costs.
Structured, unstructured, Schema, unschema, high volume, low quantity of data, Whatever it may be any data, you can store reliability.
What is Big Data?
Traditional databases much difficult to process different types of data and vast amount of data. Big data is a strategy to process large and complex data sets, which is not processed by traditional databases.

Today every organization generating massive volume of both Structured and unstructured data. It’s difficult to storage & process computationally. Big data can resolve this problem by using 4 v’s formula called

  • Volume – Size of the data
  • velocity – Speed of the data (Ram)
  • Verity – Structured & Unstructured data
  • Veracity – Uncertain, imprecise data.

What is Hadoop?
Hadoop is a open source project from Apache foundation, that enable the distributed storage & processing the large data sets across clusters of commodity hardware.
What is File System?
A file system is a set of structured data files that used by O.S to keep and organize the data on disk. Every file system permit users & groups to read, write, execute and delete privileges.

What is FUSE filesystem?
HDFS is user space FileSystem, but not POSIX file-System. It means Hadoop not satisfied POSIX rules and regulations.

What is DFS?

Distributed File System is a client or server based application (Systematic method) that store data in different servers/systems paralytically based on the server architecture .

What is No SQL?

NoSQL is acronym of Not Only SQL. It can ease many RDBMS problems. It store & access data across multiple servers. It’s highly recommendable for standalone projects and huge unstructured datasets.

What is different between real-time and batch processing?

Batch process:

It execute a series of programs (jobs) on a computer without any manual interaction. Hadoop by default use batch process.

Real-time Process:

Series of jobs continuously execute continuously and process as early as possible called real time process. Most of the Hadoop ecosystem allows real time processing.

What is meta-Data?

Data about data called meta data. Name Node store the meta data information, but not index the data. It means Name node can understand the data information only, but not inner content information details.

What is NFS?

Network File System is a client/Server application that allows to share resources between different servers on computer network. Hadoop 2.x allows NFS to store Name-node meta-data information in another system. It’s developed by Sun Microsystems.

What is Hadoop Ecosystem?

It’s a community of different tools/application that connection with a Hadoop. Pig, Hive, Hbase, Sqoop, Ooziee and Flume are common Hadoop ecosystem applications.

What is Raid Disks?

Redundant Array of Inexpensive/Independent Disks (RAID) can store the same data in different places. It’s highly recommendable for Name Node to store meta data.

What is Replication?

By default Hadoop automatically store actual data in different system, most often in another rock and other data center. This replicated backup process is called replication. It’s possible to change the default value 3. Depends on the requirements the data node replicas vary between 1 and 512.

Why Hadoop doesn’t support Updates and append?

By default Hadoop meant for write once and read many time functionality. Hadoop 2.x support append operation, but Hadoop 1.x doesn’t support.

What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

Elaborate Hadoop Process?

NameNode: The NameNode is the arbitrator and repository for all HDFS metadata.
Secondary NameNode: Backup of the metadata for every one hour.
DataNode: Store actual data in the form of blocks.
Job Tracker: Data process & schedule map-reduce tasks to specific nodes in the cluster.
Task Tracker: Follow Job tracker instructions and do Mapreduce & shuffle operations.

Elaborate important RPC & HTTP codes.

RPC Port:
  • 8020- NameNode
  • 8021: Job Tracker
Http Port:
  • 50070 – name Node
  • 50075 – DataNode
  • 50090 – Secondary namenode
  • 50030 – JobTracker
  • 50060 – TaskTracker

What is RPC Protocol?

Remote Procedure Cell (RPC) protocols supporting client server communications. Client most often interact to name node and job tracker. So that only 8020 & 8021 ports only available.

What is HTTP Protocols?

Http Protocol is transferring files on the web server. Over world wide web. This protocol communicates browsers & servers. The data & everything should transfer over the browser. So every node has a port Number.

Why Data Node & Task Tracker in the same machine?

To process a task, task tracker most often communicate with task tracker. If the Data node and task tracker are in different nodes or have long distance, it’s taking a long time & network failures. So to ease the process, both are in the same machine.

What is SetUp() and CleanUp()?

These Mapreduce methods include at the start and end of the each split.

SetUp for initialize the resources.

Map and reduce is processing the data.

Cleanup is close the resources.

Comparison also trigger. Setup and cleanup is trigger in both map and reducer.

Map reduce can give record level control, but these two can give block level control. file level also allows in Input Format.

What is Distributed Cache?

When Map or reduce task needs access to common data, or old data, or application depends on existing applications, Hadoop framework use this feature to boost efficiency. It’s configured in the JobConf and spans to multiple servers.

What is Counter in MapReduce?

Counters provides the way to measure the progress or the number of operations that occur within a map-reduce program. Counters doesn’t interact in any mapreduce programs, but analytical purpose every BigData analyst and Hadoop developer used these counters.

Why NameNode & Job tracker & Secondary Name node in different machine?

If the name node fails secondary data will take a backup, but if put NN & SNN is same machine, it’s possible to fail both NN & SNN at a time. So it’s a good idea to place separate system.
Job tracker takes a huge amount of ram to process the data. If NN and JT perform operations in the same machine, it’s possible to slow-down the process. So it’s a good idea to place JT in a separate system. Both NN & JT process huge amount of data.

What are the drawbacks of hadoop 1.x?

  • Single point of failure.
  • Salable maximum 4000 nodes.
  • By default haddop has low latency.
  • Lots of small files.
  • Limited Jobs
  • Append & Updates not possible.
  • OS dependent.
    Most of the problems resolved in hadoop 2.x

What is Low latency?

The process is completed quickly called low latency. The RDBMS has low latency because of low data. Where as Hadoop has High latency by default.

Elaborate FSimage & Editlog?

Editlog is a transaction log file that persistently record every change that occurs in the file system. HDFS metadata changes are persisted to the edit log. The file system stores all these data in a file called FSImage. This FSImage data overwriting all previous data. The FSimage in secondary namenode.

What is Checkpoint?

Checkpoint is a process that encapsulate FSimage and editlog and compacts into a new FSimage. It’s critical for efficient namenode recovery, restart the Namenode and to knoow cluster health.

What does Hadoop daemon do?

A set of programs (Jobs) run in the background until the process is finished called Daemons.
Most often these daemons run in separate Java Process (JVM instances).

Is Java mandatory to write Map and reduce programs?
No, Hadoop framework has special utility called streaming that allows to write map-reduce programs, by using Perl, Python, Ruby and other programming languages, but to customization in MapReduce Java is mandatory, the main reason, Hadoop customized by default in Java.
Mapper & Reducer work together?

Mapper parallel & independently work in HDFS.
Reducer work sequentially & has relationship with other reducers.
Immediate data store in local file system. After all mappers completed, then only reducer process will start. So there is no relationship between those?

What is the importance of Writable interface?

Writable is an interface thats allow to serialize and serialize the data, based on Data-input and Data-output. These Serialization and De-serialization is mandatory to transfer the objects over the network.
Hadoop provides different classes to implement Writable interface such as Text, IntWritable, LongWritable, FloatWritable, BooleanWritable and more. All these classes listed in org.apache.hadoop.io package.

What is Combiner?

Combiner is a function used to optimize the map-reduce job. It runs on the O/p of the map phase. The output of combiner class is the intermediate data is the Input of reducer.
Output of reducer is displayed in disk. All Maps aggregation done by reducer in a block level.

What is partitioner?

After combiner this partitioner process occur. partitioner divides the data according to the number of reducers. It means it occurs before reducer. When there is reducer partitioner there. control of mapper depends on split. Directly we dont have privileges to access partition, but with split possible.
No of partitions = No of reducers.
Hash partitioner is default partitioner.

What is hash partitioner?

MapReduce use HashPartitioner as it’s partitioner class by default. The hash partitioner ensures that all records with same map output goes to same reducer.

What is normalization?

Normalization is a database design technique that logically devide a database into two or more tables and define relationship between different tables.

What is different between horizontal and vertical scaling?

Horizontal scaling: Scale by adding more machine or system or nodes into the pool of resource. It’s easy to scale dynamically by adding more machine to the existing pool.
EG: Mangodb, cassandhra,
Vertical scaling: It’s adding more power (RSM, CPU) to the existing machine or node. So it scale more data through multi core.
EG: Mysql

What is Structured data & unstructured data?

A data that can defined a data type modal and easily fixed within a record called structure data.
EG: Text, HTML tags.
A data that can’t define data type modal & difficult to fixed within a record called unstructured data.
EG: images, Graphics

What is safe mode for NameNode?

On start-up the Namenode temporarily enters a special state called safe mode. Datanode reports heart beat to the namenode. After configurable data replicated data blocks the data node sends block report message to Namenode, then automatically namenode exit from the safe-mode state.

What is SSH & Https?

Secure Shall run on the top of SSL used for secure access to a remote host.
Https: run on the top of SSL used for standard HTTP communication.

What is SSH? Why we used in Hadoop?

SSH (Secure Shell) is a secure shell that is heart to communicate client and namenode, datanode. Additionly required username/password authentication scheme for secure access to a remote host; but Hadoop needs password less security connection.

What are Daemons in Hadoop?

A framework process that runs in the background called daemons. There are 5 daemons.
  • Namenode
  • Data Node
  • Secondary NameNode
  • Job Tracker
  • Task tracker
    Each Daemon runs separately in its own JVM.

What is Speculative execution?

When speculative execution enabled, the Job tracker will assign the some task to multiple nodes and take the result which node finish the task quickly; the rest of the task instances discarded.

No of Blocks = No of jobs is it true?
No, By default no of blocks = no of mappers (by default)

No of splits = No of maps (always)

If data stored once, it’s not possible to change block size again. So it’s possible to change split. It’s a logical operation. So depends on the project, it’s possible to change split size configuration.
Any relation between Mapper outputs?
No, Mapper out put independent . There is no relation between mapper outputs.
Why we are using Pipes in Hadoop?
Hadoop pipes is a package that allows C++ code to write map reduce programs in Hadoop. This package can split the C++ code into Hadoop understandable format.
What is dist cp?

Distributed Copy is a tool used for large amounts of data. It copies large amount of data across multiple clusters parallel.

What is Risk awareness?

To minimize network traffic between two data nodes in racks, Namenode place the blocks in proper order based on the rack awareness.

What is combiner important?

Combiner is a function, used to optimize for Map-reduce job. It works as map side reducer, but map reduce should not depend on the combiner.

What are the types of schedulers?

FIFO: Default scheduler it is. It schedule the Jobs in First In First Out format.
FAIR: Give priority dynamically
CAPACITY: Give priority in % to process a job. Highly recommendable in 2.x;

What type of compression techniques in Hadoop?

  • None:
  • Record:
  • Block – Highly recommendable
Compression codex is:
  • Default codex
  • Gzip (.gz)
  • Bzipcode – (.Bz)
  • Snappy – .snappy – Highly recommendable
  • Lzo: – .lzo

What is Serialization importance in map reducer?

In Hadoop data stored in only binary stream format. A process of converting structured objects into byte stream. RPC use serialization to convert into byte stream.

Which is deserialization How it’s work?

RPC protocol use serialization to convert the source data node into binary stream data. Framework transfer this data to the remote destination node.
Destination node use de-serialization to convert the binary stream data to object structured data.

What is Inverted Index?

Inverted index is a simple hash table which mapping the words to the different document sets. All search engines utilizing this inverted index to process user submitted queries.
Eg:
Doc 1:
Venu, brms, Madhavi, anjali, anu, Jyothi, Koti
doc2:
Venu, anu, brms, Sita, jyothi
Doc3: Venu, Jyothi
Inverted index:
  • Venu: -> Doc 1, Doc2, Doc 3
  • Jyothi: Doc1, Doc 3
  • anjali: -> Doc1, Doc 2
  • sudha: -> Doc 1, Doc2
  • anu -> Doc 1, Doc 2
  • Madhavi: -> doc 1
  • Koti ->Doc 1
  • jyothi: Doc1
  • Sita: -> Doc 1

What is Data Locality?
Hadoop believes in “Moving the logic to the data is cheaper than moving data”. Transfer the locality instead of data. It means the logic is execute where the data is stored. By default this value is true. But it’s much difficult for un-structured data.
How Text input Format read the data?
By default Hadoop MapReduce consider as “Text” is an input and output format. Hadoop Framework consider each line is a line object called “Key”, it’s an hexa-decimal number. The Value of the whole line consider as a Value. This key and data value gets processed by a mapper. The mapper consider key as “LongWritable” parameter and Value as Text parameter.
What is the importance of data processing parallelly from multiple disks?
According to the Moore’s law, every year hard drives storing data massively. Storing the data in multiple drives is not a problem, but to read all data, takes a long time to be processed.
So the data storing and process parally can ease many problems. To do it, a framework use special framework called Hadoop to store and process parallel.
What is the problems with parallel Writing and Reading?
Hardware failure: Powerfailure, network failure, server crashing are the main problems.
Data combining correctly and orderly to process the data is much difficult.
How Hadoop resolve the parallel read/write process?
HDFS store the data in reliable manner through replication. Keep the data in multiple systems and allows parallel to process.
Mapreduce read the data parallel and write sequentially.
Where HDFS is not suitable?
If the application that require lowlatency data access, its not suitable
A lots of small files, can increase metadata, it’s not recommendable.
HDFS doesn’t support multiple writes, arbitrary file modifications. So hadoop is not suitable for such applications.
Why HDFS?

  • Hardware failure is common in parallel distributions. Hadoop can ease this problems by replicate the data.
  • Most of the applications, that access Streaming data, it’s batch process.
  • Easily scale the large data sets, provides high throughput and minimize the network wastage.
  • Simple and coherency modal means write once and read many access model for files.
  • Portability across different platforms is another plus point in HDFS. It can easily adopt any type of application to process easily.
  • It can run in commodity hardware and store with very cheap cost.

hdfs-clusterCan you explain about HDFS Architecture?

  • HDFS has a master/slave architecture.
  • Single namenode, multiple nodes acts as master and slaves.
  • Internally, input file is split into multiple chunks (Blocks) of data, these chunks of data stored in multiple datanodes.
  • Multiple chunks of data stored across the cluster and allows to read/write parallel manner.

What are the DataNode responsibility?

  • The DataNodes are responsible for serving read and write requests from the file client.
  • Based on namenode’s instruction, the datanode also perform block creation, deletion, and replication operations.
  • Every three seconds send heart beat and block report information to the namenode. Every 10th heartbeat namenode sends a blockreport.