Hadoop Admin Interview Questions

What is the use of SSH in Hadoop?

By default, SSH doesn’t required for Hadoop framework, but most frequently hadoop run scripts then warped into a file and called that file to run script. SSH file can run those script files, to do so required ssh file. For example start-mapred.sh is a script file, to run it, required ssh file. Otherwise you can start it manually  by using this command.
hadoop dfsadmin start namenode

What is check pointing in Hadoop?

Check point is a key point of maintaining and persisting the file system meta data in HDFS. It’s only trigger to recover and restart the Namenode.

Can you explain few compatibility issues in Hadoop?

Java API compatibility issue. It’s trigger in Hadoop. Hadoop interfaces runs based on java api. so its trigger to interfaceStability issues.
Java binary compatibility: Most often this problem will occur in mapreduce and yarn applications.
Semantic compatibility: used to test the apis behaviour, most often Tests and javadocs specifies api’s behavior.
wire compatibility: transfer data over hadoop process by usin rpc.client server and server to server compatibility issues are trigger. most often during upgrade this problem will uccer.
REST API compatibility issues such as requests and responce issues are common compatibility issues.

Why ecc ram recommended for servers?
Error Correcting Code (ECC) Ram can automatically detect & corrects most common internal data & memory errors. So  most of the cases protection servers used this ECC ram.

Can you explain few FileSystem shell commands?

appendToFile
hdfs dfs -appendToFile local_src1 src2 src3 destination
cat:
hdfs dfs -cat filepath_to_read
chgrp: To change group association of files.
hdfs dfs -chmod -r URI
chmod: To change file permission
hdfs dfs -chmod -r URI
chown: change ownership
hdfs dfs -chown -r URI
copyFromLocal:
hdfs dfs -copyFromLocal URI
copyToLocal:
hdfs dfs -copyToLocal URI
count: Count the number of files and directories. -q for remaining quota, -h is human readabld.
hdfs dfs -count -q -h /file
cp:copy apped overwrite the file. -f forcefully overwrite
hdfs dfs -cp -f URI
hdfs dfs -c src1 src2 target //to copy
du: display the size of files. -s summary of file length. -h means human readabld.
hdfs dfs -du -s -h uri
dus: display summary of the length.
hdfs dfs -dus
expunge: empty trash.
hdfs dfs -expunge
get: hdfs dfs -get hdfs_src local_destination
DFS
hdfs dfs -ls -r
hdfs dfs -mkdir /paths
hdfs dfs -moveFromLocal localfiles1 2 3 dest
hdfs dfs -moveToLocal sourses 1 2 3 dest
hdfs dfs -mv source destination
hdfs dfs -put localsrc dest
hdfs dfs -rm -r url
hdfs dfs -setrep 3 path
hdfs drs -stat uri //stastical information
hdfs dfs -tail url // last kb of file.
hdfs dfs -test -[ezd] url // -e file exist or not?, -z zero length or not? d is it directory or not?
hdfs dfs -text src // allows only text input output format
hdfs dfs -touchz url // create an empty file.

mapred historyserver  .. to know the server history.

………getfacl– getfatter ..getmerge..setfacl…setfattr…

Can you explain few Hadoop admin commands?

hdfs version: to check version of hadoop.
balancer: HDFS not store data uniformly across multiple datanodes especiall after commission/decommission. So few nodes have more work pressure, to resolve this problem used balancer command. The balancer uniformly placed the data cross the cluster.

hdfs balancer -threshold -policy
eg: hdfs balancer -thrshold 80 -policy dd

Mover: Its similar to balancer. Periodically scans the files in HDFS to check the blocks follow rules or not?

hdfs mover -p files -f localfilename

daemonlog:
Get/set log level for each daemon.
hadoop daemonlog -getlevel localhost:8000/logLevel?log=name
hadoop daemonlog -setlevel localhost:9000/logLevel?log=name

datanode:
hdfs datanode -rollback , many more.
hdfs namenode -upgrade
hdfs secondarynamenode -checkpoint
hdfs dfsadmin -refreshNodes,
hdfs dfsadmin -safemode leave

hadoop dfsadmin -report // Report basic filesystem information.
hadoop dfsadmin -safemode leave/enter/get/wait // get out of safemode, enter into safemode
hadoop dfsadmin -refreshNodes //Force the namenode to reread it’s configuration settings.

hadoop distcp /source /destination
//Copy files recursively from one file to another url
hadoop fsck /path/to/check/corrupt/file delete/move/the-file
//It can designed for reporting problems & check the file system’s health status.

hdfs balancer <Percentage of disk capacity> <datanode>
hdfs datanode -rollback

distcp: Destributed copy tool used to copy vast amount from one or multiple sources to single destination.
eg/syntax: hadoop distcp /home/hadoop/Desktop/file1 hdfs://nn:8020/file1 /destnationfile
If you dont no destination or other version, use -update.
If already there files use -overwrite  to overwrite the files.

eg: hadoop distcp -update -overwrite /source1 /source2 /target

hdfs dfs:
Used to run filesystem commands on the different filesystems like hadoop.
hdfs dfs -cat hdfs://nn:8080/file1

fetchdt:
fetch deligation token in a file on the local system to access secure system from non-secure client.

FSCK: its used for reporting problems in files and check the health status of the cluster. 99% namenode automatically corrects most recover failures.
hdfs fsck files_path
jar:
hadoop jar jarfile_path main_class args

mapred job -submit job_file, …

Pipes used to run/execute C++ programs in HDFS.
mapred pipes -input path, …
Queue:
mapred queue -list

mapred classpath

Notes on RDBMS & No SQL

Structure, semi structured, UN-structrured data for small data size

normzlized

olap/oltp

crud

vertically scalable

acid

schema orienited

row orienited

joins, views triggers & allow primary key there

stored procedures…

jdbc, odbc connections

 NoSQL

structured, semi structure  for large data size.

un structured (small )

De mormalized

OLAP/oltp

horizontally scalable

non acid

schema less

coloumn orienited,, graph, documents,

no joins, no views, no triggers,  Primary key allows

Jdbc, ODBC, connection.

more info: http://www.noqul.org

………..

If application need acid functionality dont go

Atomicity, curable, integrity and Durable (ACID)

EG: bank: acid required  if multiple accessing, failed or wait until finish., rollback also available.

Eenadu paper there no acid anyone can see and process.

..

No sql is not replacement for RDBMS.

No SQL on the top of RDBMS  is the best approach.

………..

CAP Theorem, Consistency,  Availability, partitioning trigger

Out of 3 only allows 2.

Hadoop 1x: c, P

Hadoop 2x: cap

Hbase: c, A(2.x not avail), P

Cassandra: A, P

Mangodb: C, P

Two type of architecture available

master slave

peer to pear architecture : cassandra.

availability is risk factor in master slave..

consistancy is the main problems in Peer to Peer architecture.

Gossip mechanism is trigger in Nosql peer to peer.

It’s communicate through all.  data always available even data lost. eventually consistently occur.

Hbase  is not acid, but follow acid functionality.

 

NoSQL categories:

Key values,: MEM cache, Dynamo DB..  querying the data not available. stored the data.

Column oriented,: Hbase, Cassandra, BigTable,.. much difficult, but available.

document oriented,: Mangodb,  CouchDB .. .. querying available easily (JSON

graph oriented. : Neo4j, Graph, OrientDB.. used in social media, no storage, but used relationship between users.

 

 

Hbase

Why Hbase use Phoenix, Why NoSql doesn’t support SQL Queries?
By Default NoSql doesn’t support SQL queries, but with the help of tools, it’s possible to run SQL commands on the top of NoSQL databases. For example Phoenix is a tool, that run on the top Hbase. Phoenix is a SQL layer over HBase, use JDBC driver to convert user queries into NoSql understandable format.

 

Big data:

Crud not support directly on Bigdata.

Updates are not possible in Big data.

particular keyword available or not not support. (no indexing)

……….

after quarying, results will come within a fraction of time and support the CRUD operations.

as a part of big data implementation Hbase and cassandra implemented.

Stumbleupon implemented Hbase ( QL not 100% sql) + with Phoenix 100% implemented SQL.–

Facebook-> amazon -> data stax -.< apache cassandra (Data Stax is main) (CQL) almost like 100%- joins not support

 

Scalability issue is trigger in Hbase, so it’s best. Other NOSQL application not on the top of hdfs.
Any application can use Hbase. Most of use coading, but not commands. Still Hbase 0.x version only.

Column family. is the Collection of columns. There is no concept of database. Everything is in the hbase is Tables.

Collection of family called Table.

only one datatype is bytes in hbase.

Local Mode,

Psuedo mode, : Hmaster only run

cluster -> Internal & external zookeeper. Protection purpose cluster external zookeeper.

Zookeeper is monitoring like Nagios . If system is highly availability must use zookeeper. It runs java threads.  Simple launch number of threads and observing those threads.  Ganglia and Nagios are networking tools, but zookeeper is process level monitoring.

…………..

 

start hadoopbase working or not true or not how to verify?

Hbase-site-xml… you are in cluster odistributed is true you are in, if false: no

hbase-eng.sh

last line expoert hbase manages zk = true. It’s local internal zookeeper..

First start zookeeper, master and region server.

Hbase shall : starting Hbase prompt

List: list of the tables.
hadoop fs -rmr /file – to delete file.

create ‘table name’, ‘column family’ // you can ad multiple column families, but creating one table only ..

create ‘t’,’cf’, ‘cf1’, ‘cf2’, ‘cf3’

no need ; just enter is end the line;

isd;

Coloum family you can add millions of columns dynamically, but it’s not possible in RDBMS.

column!= column family;
We can add wild operations also.

for eg: List ‘r’ show all tables started with r.

list list ‘r.*’ both are

CRUD

Create get scan,put, drop delete (crud)

put ‘test’, ‘row1’, ‘cf:a’,’value’

where a is columns.  the value assign to a.

Hbase and cossandra follows Sparse matrix to store the data , while RDBMS follow dense matrix.

If assign new value assign to value,  in the form of revision/version.

……………………

MR integration

Filter

Phoenix……..

MRIntegration: Whey to go to MR Integration?

Hbase do sequentially, but map-reduce do parallel. So some time hbase do MRintegration for better results.

 

 

 

Hive Video Tutorials

To run SQL queries on the top of Mapreduce,  Hadoop use Hive tool to run SQL queries over Hadoop. In this post I am explaining about different ways to create Hive tables and load the data from local, HDFS and HQL table.

 

Test how many databases in Hive: show databases;

Create database: create database hiveSeries;

select Database to store tables: use hiveseries;

display current Database: set hive.cli.print.current.db=true;

C1) Create table : create table cellnumbers (name string, age int, cell bigint) row format delimited fields terminated by ‘,’;

Know how many tables: show tables;

 

Load data from local system:
L1) load data local inpath ‘/home/hadoop/Desktop/cellNo.txt’ into table cellnumbers;

Test table schema: describe cellnumbers;

select * from cellnumbers;

Advanced Tips to create Hive tables.

C2) Create & load data from existance table:
create table namewithcell as select name, cell from cellnumbers;

C3) Create one field table for testing:
create table onecell (cell bigint);

L2) Insert data from existent Hive table:
insert into table onecell select cell from cellnumbers;

C4) Create another table for testing:
create table temp (name string, age int, cell bigint) row format delimited fields terminated by ‘,’;

L3) Load data from HDFS:
load data inpath ‘/samplefile’ into table temp;

Check the data in UI;
http://localhost:50070

Files:
CellNo:
Venu,30,9247159150
satya,28,8889990009
sudha,33,7788990009
venkat,23,8844332244
sudhakar,10,993322887
jyothi,34,6677889900

Create external table, commenting, Alter table, Overwrite table

Debugging: Describe table, describe extended, describe formatted: To know schema of the table;

Comment: It’s understandable purpose you write a command when you are creating hive table.
Overwrite :  This option overwrite table. It means delete old table and replace it with new table.
external table: Create table externally (out of database) so when you have deleted data, just delete metadata, but not original data.

create external table external_table (name string comment “First Name”, age int, officeNo bigint comment “official number”, cell bigint comment “personal mobile”) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile;

load data local inpath ‘/home/hadoop/Desktop/cellNo.txt’ into overwrite table external_table;
display the database column header name.
set hive.cli.print.header=true;
Alter table: You can change structure of the table such as alter rename the table,  add new columns, and replace the columns. Drop individual columns not possible, but alternatively you can get desired files only.
alter table onecell drop column f_name  // It’s not possible
create table one_cell as select cell from onecell; // It’s alternative technique.

alter table cellnumbers rename to MobileNumbers;
alter table onecell add columns (name string comment “first name”);
alter table onecell replace columns (cell bigint, f_name string);

………….

Collection data types:
create table details (name string, friends array,Cell map<string, bigint>, others struct<company:string, your_Pincode: int, Married:string, Salary: float>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’ lines terminated by ‘\n’ stored as textfile;

load data local inpath ‘/home/hadoop/Desktop/complexData.txt’ into table details;
select name, cell[‘personal’], others.company from details;

…………………………………….

Bucketized Hive Table | How to sampling the Hive table?

Create sample table: create table samp (country string, code string, year int, population bigint) row format delimited fields terminated by ‘,’;

Load data to sample table: load data local inpath ‘/home/hadoop/Desktop/Population.csv’ into table samp;

Create bucket table: create table buck (countryName string, countrycode string, year int, population bigint) clustered by (year) sorted by (countryname) into 45 buckets;

enable bucketing table: set hive.enforce.bucketing = true;

insert data into bucket table: insert overwrite table buck select * from samp;
Test bucket table: select countryName, population from buck tablesample( bucket 10 out of 45 on year);
Test normal table: select country, population from samp limit 10;
////////////////////////////////////////////////

Hive Inner Join, Right Outer Join, Map side Join


What is Join?
Join is a clause that combines the records of two tables (or Data-Sets).

Create sample tables to join

create table products (name string, user string, units int) row format delimited fields terminated by ‘,’;
create table price (product string, price int) row format delimited fields terminated by ‘,’;
create table users (user string, city string, cell bigint) row format delimited fields terminated by ‘,’;

load data local inpath ‘/home/hadoop/Desktop/price.txt’ into table price;
load data local inpath ‘/home/hadoop/Desktop/productname.txt’ into table products;
load data local inpath ‘/home/hadoop/Desktop/users.txt’ into table users;

Inner Join
……………..
select products.* , users.city, users.cell from products join users on products.user = users.user;
………..
Left Outer Join:

select products.* , users.city, users.cell from products left outer join users on products.user = users.user;

Right Outer Join:
——————
select products.* , users.* from products right outer join users on products.user = users.user;

Full Outer Join:
select products.* , users.* from products full outer join users on products.user = users.user;

Map Join:
——————
All tasks performed by the mapper only, suitable for small tables to optimize the task. When you need to join a large table with a small table Hive can perform a map side join.

select /*+ mapjoin(users) */ products.* , users.* from products join users on products.user = users.user;

select /*+ mapjoin(products) */ products.* , users.* from products join users on products.user = users.user;
select /*+ mapjoin(products) */ products.* , users.* from products right outer join users on products.user = users.user;

select /*+ mapjoin(users) */ products.* , users.* from products left outer join users on products.user = users.user;

///////////////////////////////

input formats

Input format ….data type

1) text input format …. Lw, T

2) Key value text input format … TT

3) Nline Input Format — Lw, T

4) DBInputFormat —??

5) squenceFileInputFormat

6) Custom input format

Output format

7 Text output format

8) Sequence File output format.

 

 

 

1) When you want to read text only use text inputformat… its not possible to read image, videos.

Key value text input format.

2) If in input only key and values like <venu , 987987987> use key input format..//either tab separate, key value…

It s also text format, but use only key value.. only. by default separater is tab we can write it with the help of , or other.

3) Nline input format… in single mapper a file 1000 lines or n lines,.. has  repeatly coming in mapper level.

4) db inputformat most often used in Sqoop… in database depends on file format we can input any type of format.

*5) Sequence file input format is a binary format.. most often used in binary data.. any file you can convert into binary format… with the help of this file.   It improve the performance.

6) extending input level format.

 

7) Common text output format

8) Mahoot , Nutch and all applications backend use this output format.

 

For every input format.. internally has Record reader…. this record reader take the request from user .. it reads line by line,,, finally take record writer  at the starting of inputformat.

more info see this

Distributed cache for comparision

 

Inverted Indexing:Sequence file.

all small text files combine “sequence file” and process with the help of sequence input format. It can convert into binary format.

Inverted Indexing:

 

error: Content is protected !!