Cloudera Certification is a dream for many Hadoop developers and Administrators, but it’s not too easy and too hard. These Interview questions can assists to get a cloudera certification in first attempt. These interview questions just gives overview information about Hadoop core concepts and ecosystems. Please note basically cloudera certification is multiple question formats, who has depth knowledge about HDFS and Mapreduce, they will get this cloudera certification.
Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing.
Daemon is an independent logical program that run as a background process. Apache Hadoop comprised of five independent daemons. Each daemon run within their own JVM. Hadoop Daemons run on a single machine in Pseudo and Cluster applications, but not standalone application.
Two types of nodes in the Hadoop cluster such as NameNode, Secondary NameNode, Job Tracker are Master nodes. DataNode and TaskTracker are slave nodes.
NameNode: This Deamon holds the Name space (metadata) for HDFS. It store the data on both RAM and local disk.
Secondary NameNode: This daemon stores the NameNode’s metadata copy, but not replacement/alternative to the NameNode. It store the data on the disk, periodically for every one hour, Secondary NameNode takes backup data.
JobTracker: This daemon scheduling the jobs and managing the cluster resources.
The Slave nodes depends on the master nodes. Master nodes has single point of failure.
|Node/Ports||Http Port No||RPC ports|
TaskTracker: This Daemon receive instructions from JobTracker and execute the MapReduce tasks and report the status to the Job tracker.
DataNode: After split the vast amount of data this DataNode store in the form of blocks. Send heart beat and block report to the NameNode.
Understand how Apache Hadoop exploits data locality.
Hadoop framework always try to minimize the network wastage and maximize throughput of the system. When the application is processing, moving the data over the network is costly operation, so by default hadoop framework migrate the computation where the data is located. HDFS provides interfaces for applications to move programming logic where the data is located. This phenomenon called Data Locality. Hadoop is fault Tolerance, so even though few warnings, bugs not stop the entire process. Hadoop framework don’t allows updates, once created, write and closed, it’s not possible to alter the data.
Why you set up a cluster in Hadoop?
It’s mandatory. If deployed data over the systems (Data sets) instead of cluster, every-time required authentication. Where as in cluster, admin, can get authentication privileges to access different data sets. So It’s mandatory to form a cluster to form a cluster in Psudo or protection environment.
Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons.
Analyze the benefits and challenges of the HDFS architecture.
Analyze how HDFS implements file sizes, block sizes, and block abstraction.
Understand default replication values and storage requirements for replication.
Determine how HDFS stores, reads, and writes files.
Identify the role of Apache Hadoop Classes, Interfaces, and Methods.
Understand how Hadoop Streaming might apply to a job workflow.
What is MapReduce?
MapReduce is a linearly scalable programming model. MapReduce is a batch query processor, that ability to run user queries against datasets and get results quickly.
Why MapReduce needed, why not Rdbms to do large-scale batch analysis?
what is the difference between seek time and transfer rate?
Seeking is the process of moving data from the disk to a destination to read or write. Where as the transfer rate is a disk’s bandwidth. If seek time increase, it’s headack to read or write large datasets.
What is B-tree?
B-tree is the best suitable for traditional RDBMS to update small datasets. It’s less efficient than MapReduce, which use sort/merge operations to rebuild the database.
Difference between RDBMS and Mapreduce?
MapReduce is the best suitable to analyze the whole datasets in batch fashion. It suits where the data is written once, and read many times. Schema is optional, but it can processed both structured and unstructured data.
RDBMS is the best suitable for a small datasets for queries or updates purpose. It’s suite where datasets that are continually updated. Schema is mandatory so, it can process only structured and semi structured data.
What is structured data?
The data that is organized into entities that have defined a particular format/schema called structured data.
Semi-structured data is organized into entities that is looser and may or may not have schema, often ignored the entitie’s schema.
Unstructured data don’t have any particular structure or format. It’s don’t have any schema, but it’s interpret the data at processing time.
Why RDBMS use normalization?
RDBMS most often normalized to retain it’s integrity and redundancy.
Can you elaborate few compatibility problems in Hadoop?
Generally three type of compatibility issues such as API compatibility, data compatibility and wire compatibility.
API compatibility concerns the contract between user code and Hadoop java API. Data compatibility concerns peristent data and metadata formats.
Wire compatibility concerns interportability between clients and servers via HTTP and RPC ports.
Can you define input splits and blocks?
Hadoop framework splits the input to a mapreduce job into fixed-size logical pieces called input splits. Hadoop creating one map task for each split.
Hadoop framework splits the input data to HDFS into fixed-size physical pieces called blocks.
What is the benefit, if Hadoop process data paralelly?
Hadoop create one map for each split, if hadoop processing chunks of file, it’s overhead of managing those splits. So if we are processing those splits in parallel, it’s optimize load-balancing. Hadoop intentionally done to process vast amount of data to process parallelly.
What is data locality?
Framework take of this task. It run the map rask where the split data resides in HDFS called data locality. As a result, map task doesn’t use bandwidth. But reduce tasks don’t have advantage of data locality.
Why map task output always to the local disk, not to HDFS why?
Yes, If not stored in local disk, Hadoop replicate such data in HDFS, it’s overhead to the Hadoop. Also Hadoop mistakenly take one split’s mapper as another split’s input. So Hadoop store intermediate data in local disk, but reducer output always stored in HDFS.
What is Hadoop Streaming?
An interface between Hadoop and MapReduce program called Streaming, that can read standard input and write to standard output.
Why Hadoop pipes?
A C++ interface to Hadoop called pipe, tht can interupt the C++ code into Hadoop understandable format. Pipes doesn’t run in standalone mode.
What is distributed filesystems?
FileSystems that manage the storage across a network of machines are called distributed filesystems. Data lose is common in this distributed filesystems.
What is Streaming data?
HDFS designed for store very large files with streaming data access patren. Hadoop follow write-once, read-many-times pattern, so hadoop can process most efficiently.
Why HDFS, why not other storage?
HDFS allows streaming data access, also run in commodity hardware and process lots of small files. HDFS can allows parallel processing.
HDFS is best suit for low-latency access data and write once read many times for delivering a high throughput of data.
What is block? Difference between file block and HDFS block?
A chunk of data that stored into a physical file called block. A fixed amount of data that can store to read or write. File system’s default disk block size is 512 bytes. HDFS default block size is 64MB. If disk block filled a portion of actual block size, it occupy total memory. While, HDFS block doesn’t occupy a full block’s wroth of file.
Why HDFS block size is so large?
Disk block size just 4kb, but HDFS default block size 64MB to optimize seek time and transfer rate.
What is Namenode functions?
Namenode manages the filesystem namespace, receive block report and respond to the client.
What are Namespace image and edit logs files?
Namenode persistently store meta data on the local disk’s namespace image and the edit log. fsimage stored all block’s information, edit log records and flushed on the namespace image.
How namenode overcome single point of failure?
If namenode goes down, everything obliterated this state called single point of failure. To resolve this issue, hadoop provides 2 mechonisms. First namenode persists the data in multiple filesystems, it’s introduced in Hadoop 2.x. Second periodically merge the namespace image with editlog in secondary namenode.
What is HDFS federation?
Namenode federation scale by adding more namenodes. Each namenode manages a portion of the filesystem namespace volume. Those namenodes are independent, don’t require coordinate with other nodes. Each data node persists the data in both namenodes.
Example: namenode1 manages sales, namenode2 manages products, namenode3 manages services …
What is block pool storage?
Namespace volume should unique, but have many block poles. A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all block poles independently in a cluster.
Why use clusterID?
ClusterID added to identify all nodes in cluster. After format the namenode this clusterID helps to identify the nodes.
Namespace can generate BlockIDs to identify each block’s information.
How to increase namenode memory?
by setting HADOOP_NAMENODE_OPTS in hadoop-env.sh with specified Ram size. For example
How much memory does a nanemode need?
Memory usage depends on number of blocks per file.
Number of nodes * Number of disk space per node /(block size * Number of replicas * 1024)
What is HDFS high-availability?
Secondary namenode can protects against data loss, but not provide High Availability of the filesystem. Namenode is the sole repository of the metadata. With the help of NFS, namenode persists the data in high availability systems to prevent data loss, but it can’t auto start standby namenode. Standby nanemode tries to become active namenode, to do it tries to kill the active namenode.
What is the difference between HTTP and WebHDFS?
The HTTP interface is read-only interface, while the new WebHDFS interface support all filesystem operations include Karberos authentication. Enable WebHDFS by setting dfs.webhdfs.enable to true.
Can you elaborate about Network Topology in Hadoop?
Communicate multiple nodes with the help of network. Hadoop transfer vast amount of data between nodes.
Bandwidth = distance between multiple nodes.
Bandwidth is determined based on distance.
Data process on the same node=0,
Data process on the same track=2,
Data process on the different rack within same data center=4, and
Data process on the different data center =6
What is FSDataOutputStream?
The DistributedFileSystem returns an FSDataOutputStream to reads and writes data in data queue. It launches DataStreamer to write data in the pipeline.
How Hadoop writes data?
Hadoop client, send request to namenode to allowcate nodes via distributed filesystem to write data. DFSOutputStream communicate datanodes and temporary form a pipe to write data. DataStreamer transfer the data from one block to another and acknowledged to the namenode. If atleast one block is filled, namenode consider as data wrote properly. If any problem to write data, framework re-tries four times.
How Hadoop reads data from blocks?
Client first request to namenode via Distributed FileSystem by calling open() to read the data. Namenode return the addresses of datanodes, then client calls read() function. Hadoop reads the data parallelly, DFSInputStream communicates nearest blocks to the client. After read the data, client calls close() function to the FSDataInputStream.
How Hadoop replicated the data?
When the block is writing the data in one node, asynchronously replicated across the cluster until its target rep-lication factor is reached. Most often the data is replicated within the same node, another two replica stored in another rack.
What is the importance of checksum?
If you transfer any data few bytes of data loss is common. In Hadoop data corruption occurring is high. Checksum is a error-detection schema to determine data loss when data enter and leave in the network. Datanode is the responsible for verify the checksum in the pipeline.
How Hadoop get corrupted data?
LocalFileSystem uses ChecksumFileSystem to find the checksum data. Hadoop use getRawFileSystem() method to get first enterning checksum data. If found corrupted data, it call reportChecksumFailure() method. Hadoop Administrator take care of such files.
What are the benefits when Compression the file?
It reduces the space need to store files and it speeds up data transfer across the network. So compression is highly recommendable for vast amount of data. Default, gzip, snappy and LZO are common compression formats. All compression techniques suitable for Mapreduce.
Options: -1 means optimize speed, -9 means optimize speed. eg: gzip -1 filename.
CompressionCodec interface allows to compress and de-compress the file. org.apache.hadoop.io.compress.DefaultCodec/GzipCodec/SnappyCodec are common compression codec formats.
Which Compression Format Should I Use?
Depends on application, that compression format should allows splitting. For large files like log files, store the files uncompressed. Store the files uncompressed, use sequence file.
Steps to compress files in MapReduce?
What is Serialization and deserialization?
Network support only byte stream objects, but not other format objects. Serialization is the process of converting structured objects into byte stream objects for transmission over network.
Deserialization is a process to convert byte stream object to Structured objects.
Where and when we use serialization process?
These serialization & deserialization process most frequently occur in distributed data processing.
RPC protocol use serialization and deserialization concept when hadoop transmit data between different nodes.
Can you tell me some RPC serialization formats?
Compact format for best network bandwidth.
Fast – for inter-process communication format. It’s highly recommendable for distributed systems to read, write TBs of data in seconds.
Extensible- protocols change over time to meet new requirements.
Interoperable – support clients that are written in different languages to the server.
Why Hadoop uses it’s own serialization format instead of RPC formats?
Writable interface is central to Hadoop to corm key and value types. Which serialize and deserialize the data. It’s compact and fast, but not easy to extend. Instead of using other serialization, hadoop use it’s own interface to serialize and de-serialize the data.
What is writable interface?
Writable interface responsible to read and write data in a serialize form for transmission. It defines 2 methods such as writing it’s DataOutput binary stream, reading it’s DataInput binary stream.
When you are use safemode?
Safemode is a temporary state of the namenode to perform only read-only operation called safemode. By Default hadoop automatically enter and leave when the cluster is started, but admin can manually enter and leave the safemode.
When you are upgrade the Hadoop version, or doing complex changes in Namenode admin manually enable safemode. To save the metadata manually to the disk and reset the edit log, the name node should safemode and save the namespace with the help of given command
hadoop dfsadmin -safemode enter
hadoop dfsadmin -saveNamespace
hadoop dfsadmin -upgrade
hadoop dfsadmin -safemode leave
Explain different type of modes in hadoop.
Local Mode:Hadoop runs on the local OS file system, but not HDFS. Everything runs on single JVM. Most often used to implement MapReduce programs in development enveronment.
Pseudo mode: Hadoop runs on the single local system, but installed Hadoop. Every Daemon runs independently and has it’s own JVM. Used HDFS to store the data. It’s best choice for developing and testing apps.
Distribution mode: Hadoop runs on the cluster (multiple systems). Everything runs on it’s own JVM, multiple threads can run. Datanode, Task tracker runs in single node, remaining all nodes runs independently.
What is Datanode and Task Tracker’s HeapSize?
Java use a temporary memory to store the data in the form of HEAPSIZE. By default datanode heapsize 128 MB, Task tracker heapsize 512 MB.
What are the ways to interact with HDFS?
Command line interface
What is writable interface does?
It’s centre point of Hadoop to do serialization and de-serialize the data. It defines two methods called DataInput and DataOutput binary stream to read and write the data.
What is Avro?
Apache Avro is a language-neural data serialization system. Writables dont have language serialization facility. With the help of Avro, Hadoop can easily serialize, read and write C++, Python, Ruby and other programming languages. Avro has language-independent schema, code generation is optional in Avro.
What is Counters?
Counters are useful channels for gathering statistics about the job to analyze the quality of the application and diagnosis the problem. Every Bigdata analyst should aware of this counters to debug the job.
What is HFTP?
HFTP is a read-only Hadoop file-system, that lets you read data from a remote HDFS cluster. The data stored in datanode, but it’s not allows to write or modify the file-system state. If you are moving data from one hadoop version to another hadoop version, use HFTP. It’s wire-compatible between different versions of HDFS.
Eg: hadoop distcp -i htfp://sourcefile:50070/sourcepath hdfs://destinationfile:50070/destinationpath
How to tune up the job performance?
There are many ways such as Combine small files by using nfileinputformat. Slightly minimixe reducers to maximize performance. Use combiners to filter duplicate values. Compress the map output to save network bandwidth. Use custom serialization ad implement RawComparator to maximize speed. MapReduce suffle, memory management configuration can improve map performance.
Why sometime Hadoop execute “connection refused” error?
Most frequently two reasons One ssh is not installed so run this in cli mode. sudo apt-get install ssh.
Second reason hostname is mis mached. first check host-name by using this command. sudo gedit /etc/hosts.
By default 127.0.0.1 localhost
Add given code after this value.
What is Offline Image Viewer?
Dump the HDFS fsimage files to human readable formats to offline analyze the cluster’s namespace quickly.