Category Archives: hadoop

Hbase, Phoenix, SQuirrel SQL, Zookeeper Installation

 

Hello In this video I have demonstrated about Hbase – Zookeeper configuration and Phoenix – Squirrel SQL Integrate with Hbase.  It’s simple and easy.

If you configure export HBASE_MANAGES_ZK=true , its Internal Hbase, where as export HBASE_MANAGES_ZK=false, its external Hbase.

 

http://localhost:60010 Master

http://localhost:60030 Region Server

hbase shell

Hadoop Admin Interview Questions

What is the use of SSH in Hadoop?

By default, SSH doesn’t required for Hadoop framework, but most frequently hadoop run scripts then warped into a file and called that file to run script. SSH file can run those script files, to do so required ssh file. For example start-mapred.sh is a script file, to run it, required ssh file. Otherwise you can start it manually  by using this command.
hadoop dfsadmin start namenode

What is check pointing in Hadoop?

Check point is a key point of maintaining and persisting the file system meta data in HDFS. It’s only trigger to recover and restart the Namenode.

Can you explain few compatibility issues in Hadoop?

Java API compatibility issue. It’s trigger in Hadoop. Hadoop interfaces runs based on java api. so its trigger to interfaceStability issues.
Java binary compatibility: Most often this problem will occur in mapreduce and yarn applications.
Semantic compatibility: used to test the apis behaviour, most often Tests and javadocs specifies api’s behavior.
wire compatibility: transfer data over hadoop process by usin rpc.client server and server to server compatibility issues are trigger. most often during upgrade this problem will uccer.
REST API compatibility issues such as requests and responce issues are common compatibility issues.

Why ecc ram recommended for servers?
Error Correcting Code (ECC) Ram can automatically detect & corrects most common internal data & memory errors. So  most of the cases protection servers used this ECC ram.

Can you explain few FileSystem shell commands?

appendToFile
hdfs dfs -appendToFile local_src1 src2 src3 destination
cat:
hdfs dfs -cat filepath_to_read
chgrp: To change group association of files.
hdfs dfs -chmod -r URI
chmod: To change file permission
hdfs dfs -chmod -r URI
chown: change ownership
hdfs dfs -chown -r URI
copyFromLocal:
hdfs dfs -copyFromLocal URI
copyToLocal:
hdfs dfs -copyToLocal URI
count: Count the number of files and directories. -q for remaining quota, -h is human readabld.
hdfs dfs -count -q -h /file
cp:copy apped overwrite the file. -f forcefully overwrite
hdfs dfs -cp -f URI
hdfs dfs -c src1 src2 target //to copy
du: display the size of files. -s summary of file length. -h means human readabld.
hdfs dfs -du -s -h uri
dus: display summary of the length.
hdfs dfs -dus
expunge: empty trash.
hdfs dfs -expunge
get: hdfs dfs -get hdfs_src local_destination
DFS
hdfs dfs -ls -r
hdfs dfs -mkdir /paths
hdfs dfs -moveFromLocal localfiles1 2 3 dest
hdfs dfs -moveToLocal sourses 1 2 3 dest
hdfs dfs -mv source destination
hdfs dfs -put localsrc dest
hdfs dfs -rm -r url
hdfs dfs -setrep 3 path
hdfs drs -stat uri //stastical information
hdfs dfs -tail url // last kb of file.
hdfs dfs -test -[ezd] url // -e file exist or not?, -z zero length or not? d is it directory or not?
hdfs dfs -text src // allows only text input output format
hdfs dfs -touchz url // create an empty file.

mapred historyserver  .. to know the server history.

………getfacl– getfatter ..getmerge..setfacl…setfattr…

Can you explain few Hadoop admin commands?

hdfs version: to check version of hadoop.
balancer: HDFS not store data uniformly across multiple datanodes especiall after commission/decommission. So few nodes have more work pressure, to resolve this problem used balancer command. The balancer uniformly placed the data cross the cluster.

hdfs balancer -threshold -policy
eg: hdfs balancer -thrshold 80 -policy dd

Mover: Its similar to balancer. Periodically scans the files in HDFS to check the blocks follow rules or not?

hdfs mover -p files -f localfilename

daemonlog:
Get/set log level for each daemon.
hadoop daemonlog -getlevel localhost:8000/logLevel?log=name
hadoop daemonlog -setlevel localhost:9000/logLevel?log=name

datanode:
hdfs datanode -rollback , many more.
hdfs namenode -upgrade
hdfs secondarynamenode -checkpoint
hdfs dfsadmin -refreshNodes,
hdfs dfsadmin -safemode leave

hadoop dfsadmin -report // Report basic filesystem information.
hadoop dfsadmin -safemode leave/enter/get/wait // get out of safemode, enter into safemode
hadoop dfsadmin -refreshNodes //Force the namenode to reread it’s configuration settings.

hadoop distcp /source /destination
//Copy files recursively from one file to another url
hadoop fsck /path/to/check/corrupt/file delete/move/the-file
//It can designed for reporting problems & check the file system’s health status.

hdfs balancer <Percentage of disk capacity> <datanode>
hdfs datanode -rollback

distcp: Destributed copy tool used to copy vast amount from one or multiple sources to single destination.
eg/syntax: hadoop distcp /home/hadoop/Desktop/file1 hdfs://nn:8020/file1 /destnationfile
If you dont no destination or other version, use -update.
If already there files use -overwrite  to overwrite the files.

eg: hadoop distcp -update -overwrite /source1 /source2 /target

hdfs dfs:
Used to run filesystem commands on the different filesystems like hadoop.
hdfs dfs -cat hdfs://nn:8080/file1

fetchdt:
fetch deligation token in a file on the local system to access secure system from non-secure client.

FSCK: its used for reporting problems in files and check the health status of the cluster. 99% namenode automatically corrects most recover failures.
hdfs fsck files_path
jar:
hadoop jar jarfile_path main_class args

mapred job -submit job_file, …

Pipes used to run/execute C++ programs in HDFS.
mapred pipes -input path, …
Queue:
mapred queue -list

mapred classpath

Notes on RDBMS & No SQL

Structure, semi structured, UN-structrured data for small data size

normzlized

olap/oltp

crud

vertically scalable

acid

schema orienited

row orienited

joins, views triggers & allow primary key there

stored procedures…

jdbc, odbc connections

 NoSQL

structured, semi structure  for large data size.

un structured (small )

De mormalized

OLAP/oltp

horizontally scalable

non acid

schema less

coloumn orienited,, graph, documents,

no joins, no views, no triggers,  Primary key allows

Jdbc, ODBC, connection.

more info: http://www.noqul.org

………..

If application need acid functionality dont go

Atomicity, curable, integrity and Durable (ACID)

EG: bank: acid required  if multiple accessing, failed or wait until finish., rollback also available.

Eenadu paper there no acid anyone can see and process.

..

No sql is not replacement for RDBMS.

No SQL on the top of RDBMS  is the best approach.

………..

CAP Theorem, Consistency,  Availability, partitioning trigger

Out of 3 only allows 2.

Hadoop 1x: c, P

Hadoop 2x: cap

Hbase: c, A(2.x not avail), P

Cassandra: A, P

Mangodb: C, P

Two type of architecture available

master slave

peer to pear architecture : cassandra.

availability is risk factor in master slave..

consistancy is the main problems in Peer to Peer architecture.

Gossip mechanism is trigger in Nosql peer to peer.

It’s communicate through all.  data always available even data lost. eventually consistently occur.

Hbase  is not acid, but follow acid functionality.

 

NoSQL categories:

Key values,: MEM cache, Dynamo DB..  querying the data not available. stored the data.

Column oriented,: Hbase, Cassandra, BigTable,.. much difficult, but available.

document oriented,: Mangodb,  CouchDB .. .. querying available easily (JSON

graph oriented. : Neo4j, Graph, OrientDB.. used in social media, no storage, but used relationship between users.

 

 

input formats

Input format ….data type

1) text input format …. Lw, T

2) Key value text input format … TT

3) Nline Input Format — Lw, T

4) DBInputFormat —??

5) squenceFileInputFormat

6) Custom input format

Output format

7 Text output format

8) Sequence File output format.

 

 

 

1) When you want to read text only use text inputformat… its not possible to read image, videos.

Key value text input format.

2) If in input only key and values like <venu , 987987987> use key input format..//either tab separate, key value…

It s also text format, but use only key value.. only. by default separater is tab we can write it with the help of , or other.

3) Nline input format… in single mapper a file 1000 lines or n lines,.. has  repeatly coming in mapper level.

4) db inputformat most often used in Sqoop… in database depends on file format we can input any type of format.

*5) Sequence file input format is a binary format.. most often used in binary data.. any file you can convert into binary format… with the help of this file.   It improve the performance.

6) extending input level format.

 

7) Common text output format

8) Mahoot , Nutch and all applications backend use this output format.

 

For every input format.. internally has Record reader…. this record reader take the request from user .. it reads line by line,,, finally take record writer  at the starting of inputformat.

more info see this

Distributed cache for comparision

 

Inverted Indexing:Sequence file.

all small text files combine “sequence file” and process with the help of sequence input format. It can convert into binary format.

Inverted Indexing: