Why use Hadoop?
Hadoop can handels any type of data, in any quantity and leverages on commodity hardware to mitigate costs.
Structured, unstructured, Schema, unschema, high volume, low quantity of data, Whatever it may be any data, you can store reliability.
What is Big Data?
Traditional databases much difficult to process different types of data and vast amount of data. Big data is a strategy to process large and complex data sets, which is not processed by traditional databases.
Today every organization generating massive volume of both Structured and unstructured data. It’s difficult to storage & process computationally. Big data can resolve this problem by using 4 v’s formula called
- Volume – Size of the data
- velocity – Speed of the data (Ram)
- Verity – Structured & Unstructured data
- Veracity – Uncertain, imprecise data.
What is Hadoop?
Hadoop is a open source project from Apache foundation, that enable the distributed storage & processing the large data sets across clusters of commodity hardware.
What is File System?
A file system is a set of structured data files that used by O.S to keep and organize the data on disk. Every file system permit users & groups to read, write, execute and delete privileges.
What is FUSE filesystem?
HDFS is user space FileSystem, but not POSIX file-System. It means Hadoop not satisfied POSIX rules and regulations.
What is DFS?
Distributed File System is a client or server based application (Systematic method) that store data in different servers/systems paralytically based on the server architecture .
What is No SQL?
NoSQL is acronym of Not Only SQL. It can ease many RDBMS problems. It store & access data across multiple servers. It’s highly recommendable for standalone projects and huge unstructured datasets.
What is different between real-time and batch processing?
It execute a series of programs (jobs) on a computer without any manual interaction. Hadoop by default use batch process.
Series of jobs continuously execute continuously and process as early as possible called real time process. Most of the Hadoop ecosystem allows real time processing.
What is meta-Data?
Data about data called meta data. Name Node store the meta data information, but not index the data. It means Name node can understand the data information only, but not inner content information details.
What is NFS?
Network File System is a client/Server application that allows to share resources between different servers on computer network. Hadoop 2.x allows NFS to store Name-node meta-data information in another system. It’s developed by Sun Microsystems.
What is Hadoop Ecosystem?
It’s a community of different tools/application that connection with a Hadoop. Pig, Hive, Hbase, Sqoop, Ooziee and Flume are common Hadoop ecosystem applications.
What is Raid Disks?
Redundant Array of Inexpensive/Independent Disks (RAID) can store the same data in different places. It’s highly recommendable for Name Node to store meta data.
What is Replication?
Why Hadoop doesn’t support Updates and append?
By default Hadoop meant for write once and read many time functionality. Hadoop 2.x support append operation, but Hadoop 1.x doesn’t support.
What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.
Elaborate Hadoop Process?
Elaborate important RPC & HTTP codes.
- 8020- NameNode
- 8021: Job Tracker
- 50070 – name Node
- 50075 – DataNode
- 50090 – Secondary namenode
- 50030 – JobTracker
- 50060 – TaskTracker
What is RPC Protocol?
What is HTTP Protocols?
Why Data Node & Task Tracker in the same machine?
What is SetUp() and CleanUp()?
These Mapreduce methods include at the start and end of the each split.
SetUp for initialize the resources.
Map and reduce is processing the data.
Cleanup is close the resources.
Comparison also trigger. Setup and cleanup is trigger in both map and reducer.
Map reduce can give record level control, but these two can give block level control. file level also allows in Input Format.
What is Distributed Cache?
When Map or reduce task needs access to common data, or old data, or application depends on existing applications, Hadoop framework use this feature to boost efficiency. It’s configured in the JobConf and spans to multiple servers.
What is Counter in MapReduce?
Counters provides the way to measure the progress or the number of operations that occur within a map-reduce program. Counters doesn’t interact in any mapreduce programs, but analytical purpose every BigData analyst and Hadoop developer used these counters.
Why NameNode & Job tracker & Secondary Name node in different machine?
What are the drawbacks of hadoop 1.x?
- Single point of failure.
- Salable maximum 4000 nodes.
- By default haddop has low latency.
- Lots of small files.
- Limited Jobs
- Append & Updates not possible.
- OS dependent.
Most of the problems resolved in hadoop 2.x
What is Low latency?
Elaborate FSimage & Editlog?
What is Checkpoint?
What does Hadoop daemon do?
Is Java mandatory to write Map and reduce programs?
No, Hadoop framework has special utility called streaming that allows to write map-reduce programs, by using Perl, Python, Ruby and other programming languages, but to customization in MapReduce Java is mandatory, the main reason, Hadoop customized by default in Java.
Mapper & Reducer work together?
What is the importance of Writable interface?
Hadoop provides different classes to implement Writable interface such as Text, IntWritable, LongWritable, FloatWritable, BooleanWritable and more. All these classes listed in org.apache.hadoop.io package.
What is Combiner?
What is partitioner?
What is hash partitioner?
What is normalization?
What is different between horizontal and vertical scaling?
What is Structured data & unstructured data?
What is safe mode for NameNode?
What is SSH & Https?
What is SSH? Why we used in Hadoop?
What are Daemons in Hadoop?
- Data Node
- Secondary NameNode
- Job Tracker
- Task tracker
Each Daemon runs separately in its own JVM.
What is Speculative execution?
No of Blocks = No of jobs is it true?
No, By default no of blocks = no of mappers (by default)
No of splits = No of maps (always)
If data stored once, it’s not possible to change block size again. So it’s possible to change split. It’s a logical operation. So depends on the project, it’s possible to change split size configuration.
Any relation between Mapper outputs?
No, Mapper out put independent . There is no relation between mapper outputs.
Why we are using Pipes in Hadoop?
Hadoop pipes is a package that allows C++ code to write map reduce programs in Hadoop. This package can split the C++ code into Hadoop understandable format.
What is dist cp?
What is Risk awareness?
What is combiner important?
What are the types of schedulers?
What type of compression techniques in Hadoop?
- Block – Highly recommendable
- Default codex
- Gzip (.gz)
- Bzipcode – (.Bz)
- Snappy – .snappy – Highly recommendable
- Lzo: – .lzo
What is Serialization importance in map reducer?
Which is deserialization How it’s work?
What is Inverted Index?
- Venu: -> Doc 1, Doc2, Doc 3
- Jyothi: Doc1, Doc 3
- anjali: -> Doc1, Doc 2
- sudha: -> Doc 1, Doc2
- anu -> Doc 1, Doc 2
- Madhavi: -> doc 1
- Koti ->Doc 1
- jyothi: Doc1
- Sita: -> Doc 1
What is Data Locality?
Hadoop believes in “Moving the logic to the data is cheaper than moving data”. Transfer the locality instead of data. It means the logic is execute where the data is stored. By default this value is true. But it’s much difficult for un-structured data.
How Text input Format read the data?
By default Hadoop MapReduce consider as “Text” is an input and output format. Hadoop Framework consider each line is a line object called “Key”, it’s an hexa-decimal number. The Value of the whole line consider as a Value. This key and data value gets processed by a mapper. The mapper consider key as “LongWritable” parameter and Value as Text parameter.
What is the importance of data processing parallelly from multiple disks?
According to the Moore’s law, every year hard drives storing data massively. Storing the data in multiple drives is not a problem, but to read all data, takes a long time to be processed.
So the data storing and process parally can ease many problems. To do it, a framework use special framework called Hadoop to store and process parallel.
What is the problems with parallel Writing and Reading?
Hardware failure: Powerfailure, network failure, server crashing are the main problems.
Data combining correctly and orderly to process the data is much difficult.
How Hadoop resolve the parallel read/write process?
HDFS store the data in reliable manner through replication. Keep the data in multiple systems and allows parallel to process.
Mapreduce read the data parallel and write sequentially.
Where HDFS is not suitable?
If the application that require lowlatency data access, its not suitable
A lots of small files, can increase metadata, it’s not recommendable.
HDFS doesn’t support multiple writes, arbitrary file modifications. So hadoop is not suitable for such applications.
- Hardware failure is common in parallel distributions. Hadoop can ease this problems by replicate the data.
- Most of the applications, that access Streaming data, it’s batch process.
- Easily scale the large data sets, provides high throughput and minimize the network wastage.
- Simple and coherency modal means write once and read many access model for files.
- Portability across different platforms is another plus point in HDFS. It can easily adopt any type of application to process easily.
- It can run in commodity hardware and store with very cheap cost.
- HDFS has a master/slave architecture.
- Single namenode, multiple nodes acts as master and slaves.
- Internally, input file is split into multiple chunks (Blocks) of data, these chunks of data stored in multiple datanodes.
- Multiple chunks of data stored across the cluster and allows to read/write parallel manner.
What are the DataNode responsibility?
- The DataNodes are responsible for serving read and write requests from the file client.
- Based on namenode’s instruction, the datanode also perform block creation, deletion, and replication operations.
- Every three seconds send heart beat and block report information to the namenode. Every 10th heartbeat namenode sends a blockreport.