What is the core changes in Hadoop 2.x?
Many changes, especially single point of failure and Decentralize JobTracker power to data-nodes is the main changes. Entire job tracker architecture changed. Some of the main difference between Hadoop 1.x and 2.x given below.
- Single point of failure – Rectified
- Nodes limitation (4000- to unlimited) – Rectified.
- JobTracker bottleneck – Rectified
- Map-reduce slots are changed static to dynamic.
- High availability – Available
- Support both Interactive, graph iterative algorithms (1.x not support).
- Allows other applications also to integrate with HDFS.
What is YARN?
What is the difference between MapReduce1 and MapReduce2/YARN?
In Mapreduce 1, Hadoop centralized all tasks to the JobTracker. It allocate resources and scheduling the jobs across the cluster. In YARN, de-centralized this to ease the work pressure on the JobTracker. ResourceManager responsibility allocate resources to the particular nodes and Node manager schedule the jobs on the applicationMaster. YARN allows parallel execution and ApplicationMaster managing and execute the job. This approach can ease many JobTracker problems and improves to scale up ability and optimize the job performance. Additionally YARN can allows to create multiple applications to scal up on the distributed environment.
How Hadoop determined the distance between two nodes?
Hadoop admin write a script called Topology script to determine the rack location of nodes. It is trigger to know the distance of the nodes to replicate the data. Configure this script in core-site.xml
in the rack-awareness.sh you should write script where the nodes located.
Mistakenly user deleted a file, how hadoop remote from it’s file system? Can u roll back it?
HDFS first renames its file name and place it in /trash directory for a configurable amount of time. In this senario block might freed, but not file. After this time, NameNode deletes the file from HDFS name-space and make file freed. It’s configurable as fs.trash.interval in core-site.xml. By default its value is 1, you can set to 0 to delete file without storing in trash.
What is difference between Hadoop NameNode Federation, NFS and JournalNode ?
HDFS federation can separate the namespace and storage to improves the scalability and isolation.
What is DistCP functionality in Hadoop?
This Distributed copy tool used for large to transfer the data internally and externally in the cluster.
hadoop distcp hdfs://namenode1:8020/nn hdfs://namenode2:8020/nn
It can copy multiple sources to destination cluster.Last resource is destination cluster.
hadoop distcp hdfs://namenode1:8020/dd1 hdfs://namenode2:8020/dd2 hdfs://namenode3:8020/dd3
YARN is replacement of MapReduce?
YARN is generic concept, it support mapreduce, but it’s not replacement of MapReduce. You can development many applicatins with the help of YARN. Spark, drill and many more applications work on the top of YARN.
What are the core concepts/Processes in YARN?
- Resource manager: As equivalent to the JobTracker
- Node manager: As equivalent to the Task Tracker.
- Application manager: As equivalent to Jobs. Everything is application in YARN. When client submit job (application),
Containers: As equivalent to slots.
Yarn child: If you submit the application, dynamically Application master launch Yarn child to do Map and Reduce tasks.
If application manager failed, not a problem, resource manager automatically start new application task.
Steps to upgrade Hadoop 1.x to Hadoop 2.x?
To upgrade 1.x to 2.x dont upgrade directly. Simple download locally then remove old files in 1.x files. Up-gradation take more time.
share folder there. its important.. share.. hadoop .. mapreduce .. lib.
stop all processes.
Delete old meta data info… from work/hadoop2data
copy and rename first 1.x data into work/hadoop2.x
Don’t format NN while upgradation.
Hadoop namenode -upgrade // It will take a lot of time.
Don’t close previous terminal open new terminal.
hadoop namenode -rollback