Hive Optimize query performance

The Default configuration not suitable for all applications. Few changes can optimize your Hadoop and Hive queries. In this post I am explain about different ways to optimize Hive and few Hive Technical interview questions.

Is SQL scalable? How to run SQL queries in the Hadoop?

By default SQL is not scalable, SQL databases are vertically scalable. Hadoop is scalable and horizontally scalable. Hive is a Hadoop component which allows programmers to run SQL queries on the top of Hadoop.

Why Scalable is most too important in Hadoop?

Let example: job perform on 3 nodes out of 5 nodes. If one node is failed, automatically job run in 4th node. Everyone knows job failed in distributed process. So job process run without any fail/interruption.

What is the Pros and Cons of Broadcast Join?
In BroadCast Join, small tables are loaded into memory in all nodes. Mappers scans through the large table and joins. It’s the best suitable for small tables. It’s fast and single scan through a largest table, but if table is more than Ram, it’s not process. So when we join two tables, that tables must smaller than RAM. If table more than RAM size, use SortMergeBucket Join.

How Cost Based Optimizer (CBO) optimize Hive?
CBO introduced in Hive 0.14, It’s main goal is generate efficient execution plans. Cost Based Optimization (CBO) that leverages statistics collected on Hive tables and optimize hive query optimizer. set two parameters:
set hive.compute.query.using.stats=true;
set hive.stats.dbclass=fs; Hive performance optimization How Tez execution engine optimize Hive performance?
If you are using hadoop 2.x use Tez execute engine for better performance. Run following in the terminal to enable Tez execution engine.
set hive.execution.engine=tez;

What is Skewed tables in Hive?
When certain values are appear very often, Skewed tables highly recommendable. Skewed tables split frequently appeared values into a separate table and rest of the values separate to some other file. So when user queried, most appeared values skipped to process again. As a result Hive optimize the performance.

create table TableName (column1 string, column2 string) skewed by (column1) on ('x_separate_file')

What is Vectorization?
Use Vectorization to improve query performance. It combines multiple rows instead of single row each time. Use given code on terminal to enable it.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

What is ORCFile? How it’s optimize Hive Query performance?
use ORCFile format to optimize query performance. SNAPPY is the best compression technique use it with ORC format.
CREATE TABLE ORC_table (EmpID int, Emp_name string, Emp_age int, address string) STORED AS ORC tblproperties (“orc.compress” = “SNAPPY”);

Other Tips to optimize hive performance:

If you join two tables, one table is smaller another is too big, use Map-side join to optimize the task. If select has multiple fields, leverage to multiple queries format. SELECT count(1) FROM (SELECT DISTINCT column_field FROM table_name) All Imported data automatically partitioned into hourly buckets based on time. Where clause must be used to prevent unnecessary data.

Eg: select name, age, cell from biodata where time > 1349393020 Use hive, ORDER BY use one reducer, SORT BY use multiple reducer.
So if you process a large amount of data, dont’ use ORDER BY, prefer SORT BY.

Eg: SELECT name, location, voterid FROM aadhar_card DISTRIBUTE BY name SORT BY age.

Increase Parallelism: Please add given lines to compress the data. ensure maximum split size 265Mb

SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

If you are process/joining a small table and Large table use Map side Join.

If you enable “Set hive.auto.convert.join=true”
It can optimize Job performance when you performing Join Operation.

Paste given code in mapred-site.xml to decrease burden on Namenode during sort & shuffling.

It can compress output of Mapper & reduce
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

If possible Apply SMB map join.
Sort Merge Bucket Join is faster than map join. It’s very efficient if applicable, but it’s used when you have sorted & bucketed the table.
To enable, use this configuration settings.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

Spark Advanced Interview questions

In my previous post i have shared few Spark interview questions, please check once. If you want to learn Apache Spark Contact now.

What is DataFrame?

SQL + RDD = Spark DataFrame

A SQL programming abstraction on the top of Spark core called DataFrames. Schema for RDD called DataFrame. It can ease many Spark problems. The DataFrame API is available in Scala, Java, Python, and R. So any programmer can create DataFrame.

What is the glom importance on Spark? Glom is a method (RDD.glom()) which returns a new RDD, that containing the distinct elements in the form of Array. Usually partition returns a row at a time, but RDD.glom() allows you to treat a partition as an array rather as single row at time.

What is Shark?

It’s an older version of SparkSQL.  It allows to run Hive on Spark, but now it replaced by SparkSQL.

What is difference between Tacheyon and Apache Spark?

Tachyon is a memory centric distributed storage system that share memory across cluster. The programmer can run Spark, Mapreduce, Shark, and Flink without any code change. Where as Spark is a cluster computing framework, you can run batch, streaming and interactive analytics rapidly. Fastness and laziness is the power of spark.

What is different between Framework, Library and API?

API is a part of library, that defines how to interact with external code. If library requested something, the API serving and fulfill the requirements. Where as Library is a collection of classes to do a specific task like to create package. Framework provides functionalities/solution to the particular problem area, but install different softwares, provide environment. It’s heart to develop software applications.

What is history of Apache Spark?

Apache Spark is originally developed in the AMPLab at UC Berkeley, later it’s moved to Apache top level project. Databricks is a spark contributer founded by the creators of Apache spark.

What is the main difference between Spark and Strom?

Spark performance data parallel computation, whereas Strom performs task parallel computation. Compare with this, Storm process quickly, but both are open-source, distributed, fault tolerant and scalable to process streaming data.

What is Data scientist responsibility?

Analyzed for insights and model data for visualization. He/she may have experience with SQL, statistics, predictive modeling, and functional programming like python, R.

What is Data engineer responsibilities?

BigData engineer usually build production data processing applications. Most often engineer control to monitor, inspect, and tune applications by using programming languages.

When do you use apache spark?

For iterative, interactive application with faster processing, real time stream processing. Single platform for all batch process, streaming, interactive applications apache spark is the best choice.

spark interview questions What is the purpose of GraphX library?

Most of the social media sites generates Graphs. GraphX used fo Graphs and graph-parallel computation with common algorithms.

What is Apache Zeppelin?

It’s a Collaborative data analytics and visualization tool for Apache Spark, and Flink. It’s in incubating stage, means it’s not stabled and implementing stage.

What is Jupyter?
It’s evolved from the IPython project. It’s Python3 version inbuilt API for visualization.It also supports R, Ruby, pyspark and other languages.

What is Data Scientist workbench?
Interactive data platform built around Python tool Jupyter.It’s pre-installed with Python, Scala or R.

What is Spark Notebook?
It’s a spark SQL tool, It dynamically inject JavaScript libraries to create visualizations.

DataBrisks Cloud?
It’s available on AWS. If you have EC2 account, you can use this.

What is Zeppelin?
Zeppelin is analytical tool that supports multiple language back-end, by default it support scala with SparkContext.

DFS is mandatory to run Spark?
No, no need HDFS. RDD use hadoop InputFormat API to input data. So RDD can support any storage system like Aws, Azure, Google, or local file system. It support any input format implementation can directly used in spark such as input from Hbase, Cassandra, MongoDB, or Custom input format directly processed in RDD.

How Spark identify data locality?
Usually InputFormat specifies splits and locality. RDD use Hadoop InputFormat API, so partitions correlate to HDFS splits. So spark can easily identify the data locality when it’s needed.

How coalesce increase RDD performance?
Shuffle can decrease RDD performance, so Repartition can increase partitions after filter. Where as coalesce decrease partitions, process without shuffle. Coalesce can consolidate before outputting before HDFS without parallelism. So coalesce directly affect RDD partition performance.

What is the important of co-located key – value pairs?

Some cases, multiple values have same key values especially iterative values. So Co-located values benefited for many operations. RangePartitioner and HashPartitioner ensures all pairs with the same key.

What are numeric RDDs statical operations?

Standard deviation, mean, sum, min, max. Stats() returns all statistic values.

RDDs can shared across application?

No, RDDs can shared across the network, but not application.

What is the difference between reduce and fold?

Fold is a function which allows you to calculate many powerful/large operations on time. Please find the Answer here.

What is SBT?

The SBT is an open source build tool for Scala and Java projects. It’s similar to the Java’s Maven.  

Which is the use of Kryo serialization in Spark?

The SparkContext supports only Java serialization. Kryo is a fast and efficient serialization framework for Java. Kryo serialization work on the top of Java serialization. It’s highly recommended for a large amount of data.

What is the sizeEstimator tool?

The Java heap is the amount of memory allocated to applications running in the JVM. An utility to estimates the size of objects in Java heep called SizeEstimator. It’s a trigger to partition the data.

What is pipe operator?

Pipe operator allows to process RDD data using external applications. After created RDD, developer pipe that RDD through shell script. Shell scripting can allows to access that RDD through external applications.

What is executors?

Spark sends the application code to the executors via SC. The sparkContext sends these tasks to the executors to run computations and store the data for your application. Each application has its own executors.

What is SparkContext object functionality?

Any spark application consists of driver program and executors to run on the cluster. A jar is containing this application for the processing. SparkContext object coordinates these processes.

What is Dstreams?
A sequence of RDDs called a DStream. It’s a high level Spark abstraction. It create a continious stream of data from different sources like Kafka, flume and generate a series of batches.

Hive Vs Pig Difference

Hive and Pig both are apache organization products to analyze vast amount of data sets on top of Hadoop without MapReduce code. Both tools used to analyze the data, means perform OLAP operations. Depends on Storage process, data categorized into three types such as Structured, unstructured, semi structured data. To analyze the data Hadoop uses these Hive and Pig ecosystems to optimize mapreduce queries.  Both Hadoop ecosystems work on the top of the Hadoop and ultimately same outcome can be achieved, but follow different process. Here demonstrate the features of PIG and HIVE.

Hive is nothing but a SQL like interface on the top of Hadoop to analyze schema based structured data. Pig is high level data flow language to analyze any type of data. Both Hive and Pig summarize & analyze data, but follow different process.

Pig Interview question
Pig Interview question

Hive

Pig

Hadoop should start to run Hive. Not required to start Hadoop, you can run standalone mode or cluster mode, but you should install Hadoop.
 If you have limited joins and filters go ahead with HIVE.  Pig is highly recommendable when you have huge number of joins and filters.
Hive Support only Structured data, so most often used in the data warehouse Pig can process both structured & unstructured data, so it’s the best suitable for Streaming Data
support User Defined Functions, but much hard to debug. Very easy to write a UDF to calculate Matrics.
Manually create table to store intermediate data. Not required to create table table.
Hive Stores the meta data in database like darby, (by default), mysql, oracle Pig has no metadata support.
 Hive use separate query language called HQL goes beyond standard SQL Pig use own language called Pig Latin is the relational data-flow language
 Best suitable for analysts especially big data analysts and who familiar to SQL, most often used to generate reports and statistics functions.
 Best suitable for programmers and software developers and who familiar Scripting languages like Python, Java
 Hive can operate an optional thrift based server and operates on the server side of any cluster Pig can operates on the client side of any cluster, there is no any server side concept.
 It execute quickly, but not load quickly.  It loads the data effectively and quickly.
Carefully configure the Hive in Cluster, Pseudo mode. Pig Installed based on shell interaction  , so not required any other configuration, Just extract the tar file.

 

Hive Interview Questions

What is Hive?
It’s an open source project under the Apache Software Foundation, it’s a data warehouse software ecosystem in Hadoop. Which manage vast amount of structured data sets, by using HQl language; it’s similar to SQL.
Where hive is the best suitable?
When you are doing data warehouse applications,
Where you are getting static data instead of dynamic data,
when the application on high latency (response time high).
where a large data set is maintained and mined for insights, reports.
When we are using queries instead of scripting we use Hive.
When hive is not suitable?
It doesn’t provide OLTP transactions supports only OLAP transactions.
If application required OLTP, switch to NoSQL databases.
HQL queries have higher latency, due to the mapreduce.

Hive Support Acid Transactions?
By default it doesn’t support record-level update, insert and delete, but recent Hive 1.4 later versions supporting insert, update and delete operations.  So hive  support ACID transactions.

To achieve updates & deletion transactions in 1.4 version, you must change given default values.

hive.support.concurrency – true
hive.enforce.bucketing – true
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service

What is Hive MetaStore?
MetaStore is a central repository of Hive, that allows to store meta data in external database. By default Hive store meta data in Derby database, but you can store in MySql, Oracle depends on project.
Why I choose Hive instead of MapReduce?
There are Partitions to simplify the data process, Bucketing for sampling the data, sort the data quickly, and simplify the mapreduce process. Partitions and Buckets can segmenting large data sets to improve Query performance in Hive. So It is highly recommendable for structure data.
Can I access Hive without Hadoop?
Hive store and process the data on the top of Hadoop, but it’s possible to run in Other data storage systems like Amazon S3, GPFS (IBM) and MapR file systems.

What is the relationship between MapReduce and Hive? or How Mapreduce jobs submits on the cluster?
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the  interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.
If you run select * query in Hive, why it’s not run Mpareduce?
It’s an optimization technique. hive.fetch.task.conversion property can (FETCH task) minimize latency of mapreduce overhead. When queried SELECT, FILTER, LIMIT queries, this property skip mapreduce and using FETCH task. As a result Hive can execute query without run mapreduce task.

By default it’s value “minimal”. Which optimize: SELECT STAR, FILTER on partition columns, LIMIT queries only, where as another value is “more” which optimize : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns).
How Hive can improve performance with ORC format tables?
Hive can store the data in highly efficient manner in the Optimized Row Columnar (ORC) file format. It can ease many Hive file format limitations. Using ORC files can improves the performance when reading, writing, and processing data. Enable this format by run this command and create table like this.

set hive.compute.query.using.stats=true;
set hive.stats.dbclass=fs;

CREATE TABLE orc_table (
id int,
name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\;’
LINES TERMINATED BY ‘\n’
STORED AS ORC;
What is the importance of Vectorization in Hive?
It’s a query optimization technique. Instead of processing multiple rows, Vectorization allows to process process a batch of rows as a unit. Consequently it can optimize query performance. The file must be stored in ORC format to enable this Vectorization. It’s disabled by default, but enable this property by run this command.

set hive.vectorized.execution.enabled=true;​
Difference between sort by or order by clause in Hive? Which is the fast?
ORDER BY – sort the data in one reducer. Sort by much faster than order by.
SORT BY – sort the data within each reducer. You can use n number of reducers for sort.

In the first case (order by) maps sends each value to the single reducer and count them all.
In the second case (sort by) maps splits up the values to many reducers and each reduce generates its list and finds the count. So it can sort quickly.
Example:

SELECT name, id, cell FROM user_table ORDER BY id, name;
SELECT name, id, cell FROM user_table DISTRIBUTE BY id SORT BY name;
Wherever you run hive query, first it creates new metastore_db, why? What is the importance of Metastore_db?
When we run the hive query, first it creates a local metastore, before creates the metastore first Hive checks whether metastore is already exists or not? If presents shows error, else the process goes on. This configuration is set in hive-site.xml like this.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
Tell me different Hive metastore configuration.
There are three types of metastores configuration called

1) Embedded metastore
2) Local metastore
3) Remote metastore.

If Hive run any query first it enter into embedded mode, It’s default mode. In Command line all operations done in embedded mode only, it can access Hive libraries locally. In the embedded metastore configuration, hive driver, metastore interface and databases use same JVM. It’s good for development and testing.

hive embedded mode

In local metastore the metastore store data in external databases like MYSQL. Here Hive driver and metastore run in the same JVM, but remotely communicate with external Database. For better protection required credentials in Local metastore.

Where as in Remote server, use remote mode to run the queries over Thift server.
In Remote metastore, Hive driver and metastore interface would be running in a different JVM. So for better protection, required credentials such are isolated from Hive users.

local metastoreremote metastore
Hive can process any type of data formats?
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data.
Example:
MetadataTypedColumnsetSerDe: used to read/write CSV format data.
JsonSerDe: process Json data.
RejexSerDe: process weblog data.
AvroSerde: Avro format data.
What Is the HWI?
The Hive Web Interface is an alternative to the command line interface. HWI is a simple graphical interface, It’s hive web interface. The HWI allows start at database level directly. you can get all SerDe, column names and types and simplifies the hive steps. It’s seccession based interface, so you can run multiple hive queries simultaneously. There is no local metastore mode in HWI.
What is the difference between Like and Rlike operators in HIVE?
Like: used to find the substrings within a main string with regular expression %.
Rlike is a special fuction which also finds the sub strings within a main string, but return true or false without using regular expression.

Example: Tablename is table, column is name.
name=VenuKatragadda, venkatesh, venkateswarlu
Select * from table where name like “venu%. //VenuKatragadda.
select * from table where name rlike “venk%”. // false, true, true.
What are the Hive default read and write classes?
Hive use 2+2 classes to read and write the files.
1)TextInputFormat/HiveIgnoreKeyTextOutputFormat.
2) SequenceFileInputFormat/SequenceFileOutputFormat:

First class used to read/write the plain text. Second class used for sequence files.
What is Query processor in Hive?
It’s a core processing unit in Hive framework, it converting SQL to map/reduce jobs and run in the other dependencies. As a result hive can convert the Hive queries into Hive queries.

What are Views in Hive?

Based on user requirement create and manage view. You can set data as view. It’s a logical construct. It’s used where query is more complicated and to hide complexity of query and make easy to the users.
Example:
Create view table_name as select * from employee where salary>10000;
What is different between database and data-warehouse?
Typically database is designed for OLTP transactional operations. Where as Data-warehouse is implemented for OLAP (analysis) operations.
OLTP can constrained to a single application. OLAP resists as a layer on the top of several databases.
OlTP process current, streaming and dynamic data where as OLAP process Retired, historic and static data only.
Database completely has normalization concept. DWH is De-normalization concept.

Hive Interview Question and Answers
Hive Interview Question and Answers

What is the different between Internal and external tables in Hive?
Hive will create a database on the master node to store meta data to keep data in safe. Let example, If you partition table, table schema stores data in the external table.
In Managed table, Schema stored in the local system, but in External table MetaStore separate from the node and stored in a secure database. In Internal Table, Hive reads and loads entire file as it is to process, but in External simply loads depends on the query logic.

If user drop the table, Hive drop original data and MetaStore, but in External table, just drop MetaStore, but not original data. Hive by default store in internal table, but it’s not recommendable. Store the data in external table.

How to write single and multiple line commands in Hive?
To write single line commands we use –followed by commands.

eg: –It is too important step.

Hive doesn’t supports multiple comments now.
What is Thrift server & client, JDBC and ODBC driver importance in Hive?

Thrift is a cross language RPC framework which generate code and cobines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and scala.
JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.
ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS
Does Hive support 100% SQL Quries like Insert, Delete and Updates?
Hive doesn’t support Updates in record level. To update, It integrate with Hbase.
When you are use Hive?
When the data is structured data, Static data, Low density is not a problem, If the data processed based on the queries, Hive is the best option. Most often data warehouse data processed in the Hive.
What is the use of partition in hive?
To analyze a particular set of data, not required to load entire data, desired data partition is a good approach. To achieve this goal, Hive allows to partition the data based on particular column. Static partition and Dynamic partition, both can optimize the Hive performance. For Instant, required a particular year information, partition based on year.

Is is mandatory Schema in Hive?

yes, It’s mandatory to create a table in Database. Hive is schema oriented modal. It store schema information in external database.
How Hive Serialize and DeSerialize the data?
In Hive language, SerDe also called Serialization and DeSerialization. Usually when read/write the data, user first communicate with inputformat, then it connect with Record reader to read/write record.The data is stored in Serialized (binary) format in Record. To serialize the data dat goes to row, here deserialized custem serde use object inspector to deserialize the data in fields. now user see the data in the fields, that deliver to the end user.
How Hive use Java in SerDe?
To insert data into table, Hive create an object by using Java. To transfer java objects over network, the data should be serialized. Each field serialized by using Object inspector and finally serialized data stored in Hive table.
Does Hive Support Insert, delete, or updation?
As of now, Hive doesn’t support record level updadation, insert and deletion queries. HQL is subset of SQL, but not equalto SQL. To update Hive integrate with Hbase.
Tell me few function names in Hive
CONTACT(‘Venu’-‘Bigdata’-‘analyst’); // Venu-Bigdata-analyst
CONTACT_WS(‘-‘, ‘venu’, ‘bigdata’, ‘analyst’); //venu-bigdata-analyst
REPEAT(‘venu’,3);
TRIM(‘ VENU ‘); //VENU (without spaces)
LTRim(‘ venu ‘); //venu (trim leftside, but not rightside)
RTRIM(‘ venu ‘); // venu(trim rightside only, but not leftside)
REVERSE(‘venu’); //unev
LOWER(‘Venu’); //venu
LCASE “”
UPPER OR UCASE(‘Venu’); //VENU
RLIKE .. return T/F for sub string.
‘Venu’ RLIKE ‘en’ //True
‘Venu’ RLIKE ‘^V.*’ //T
Difference between order by and sort by in hive?
SORT BY -use number of reducers, so it can process quickly.
ORDER BY – use single reducer. If data is too large, it’s take a long time to sort the data.
Difference between Internal and External Table?
External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.
Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost.
What is the difference between Hive and Hbase?

  • Hive allows most of the SQL queries, but Hbase not allows SQL queries directly.
  • Hive doesn’t support record level update, insert, and deletion operations on table, but Hbase can do it.
  • Hive is a Data warehouse framework where as Hbase is a NoSQL database.
  • Hive run on the top of Mapreduce, Hbase run on the top of HDFS.

How many ways you can run Hive?
In CLI mode (By using command line inerface).
By using JDBC or ODBC.
By Called Hive Thift client. It allows java, PHP, Python, Ruby and C++ to write commands to run in Hive.
Can you explain different type of SerDe?
By default Hive used Lazy Serde also allows Jeson Serde and most often used RegexSerde to be Serialized and DeSerialized Data.
Why we are using buckets in Hive?
To process many chunks of files, to analyze vast amount of data, sometime burst the process and time. Bucketing is a sampling concept to analyze the data, by using hashing algorithm. set hive.enforce.bucketing=true; can enable the process.
How Hive Organize the data?
Hive organize in three ways such as Tables, Partitions and Buckets. Tables organize based on Arrays, Maps, primitive column types. Partitions has one or more partition keys based on project requirements.
Buckets used for analyze the data for sampling purpose. It’s good approach to process a pinch of data in the form of buckets instead of process all data.
Can you explain about Hive Architecture?
There are 5 core components there in Hive such as: UI, Driver, Compiler, Metastore, Execute Engine.
only Hive-architecture
What is User Interface (UI)?
UI: This interface is interpreter between users and Driver, which accept queries from User and execute on the Driver. Now two types of interfaces available in Hive such as command line interface and GUI interface. Hadoop provides Thrift interface and JDBC/ODBC for integrating other applications.
What is importance of Driver in Hive?
Driver: It manages life cycle of HiveQL queries. Driver receives the queries from User Interface and fetch on the ODBC/JDBC interfaces to process the query. Driver create separate independent section to handle each query.

Compiler: Compiler accept plans from Drivers and gets the required metadata from MetaStore, to execute Plan.

MetaStore: Hive Store meta data in the table. It means information about data is stored in MetaStore in the form of table, it may be internal or external table. Hive compiler get the meta data information from metastore table.

Execute Engine:
Hive Driver execute the output in the execution Engine. Here, execute engine executes the queries in the MapReduce JobTracker. Based on Required information, Hive queries run in the MapReduce to process the data.
When we are use explode in Hive?
Sometime Hadoop developer takes array as input and convert into a separate table row. To achieve this goal, Hive use explode, it acts as interpreter to convert complex data-types into desired table formats.
Syntax:
SELECT explode (arrayName) AS newCol FROM TableName;
SELECT explode(map) AS newCol1, NewCol2 From TableName;
What is ObjectInspector functionality in Hive?
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects. Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union.
Can you overwrite Hadoop Mapreduce configuration in Hive?
Yes, You can overwrite Hive map, reduce steps in hive conf settings. Hive allows to overwrite Hadoop configuration files.
How to display the present database name in the terminal?

There are two ways to know the current database. One temporary in cli and second one is persistently.

1) in CLI just enter this command: set hive.cli.print.current.db=true;

2) In hive-site.xml paste this code:

    <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    </property>
In second scenario, you can automatically display the Hive database name when you open terminal.
Is a job split into map?
No, Hadoop framework can split the data-file, but not Job. This chunks of data stored in blocks. Each split need a map to process. Where as Job is a configurable unit to control execution of the plan/logic. Job is not a physical data-set to split, it’s a logical configuration API to process those split.
What is the difference between Describe and describe extended?

To see table definition in Hive, use describe <table name>; command
Where as
To see more detailed information about the table, use describe extended <tablename>; command
Another important command describe formatted <tablename>; also describe all details in a clean manner.

 

What is difference between static and dynamic partition of a table?

To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table. Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition.  If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.

What is the difference between partition and bucketing?
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
Bucketing follows Hash algorithm. Based on number of buckets, randomly the data inserted into the bucket to sampling of the data. For more information about bucketing & partition please follow this link.

More tips click here

Hadoop Mapreduce Interview Questions

What is Hadoop MapReduce ?
MapReduce is a set of programs used to process or analyze vast of data over a Hadoop cluster. It process the vast amount of the datasets parallelly across the clusters in a fault-tolerant manner across the Hadoop framework.
Can you elaborate about MapReduce job?
Based on the configuration, the MapReduce Job first splits the input data into independent chunks called Blocks. These blocks processed by Map() and Reduce() functions. First Map function process the data, then processed by reduce function. The Framework takes care of sorts the Map outputs, scheduling the tasks.
Why compute nodes and the storage nodes are the same?
Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code execute where the data is stored, so the data node and compute node are the same.
What is the configuration object importance in MapReduce?

  • It’s used to set/get of parameter name & value pairs in XML file.
  • It’s used to initialize values, read from external file and set as a value parameter.
  • Parameter values in the program always overwrite with new values which are coming from external configure files.
  • Parameter values received from Hadoop’s default values.

Where Mapreduce not recommended?

Mapreduce is not recommended for Iterative kind of processing. It means repeat the output in a loop manner.
To process Series of Mapreduce jobs, MapReduce not suitable. each job persists data in local disk, then again load to another job. It’s costly operation and not recommended.

What is Namenode and it’s responsibilities?

Namenode is a logical daemon name for a particular node. It’s heart of the entire Hadoop system. Which store the metadata in FsImage and get all block information in the form of Heartbeat.

What is JobTracker’s responsibility?

  • Scheduling the job’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker.
  • Monitoring the tasks, if failed, re-execute the failed tasks.

What are the JobTracker & TaskTracker in MapReduce?
MapReduce Framework consists of a single JobTracker per Cluster, one TaskTracker per node. Usually A cluster has multiple nodes, so each cluster has single JobTracker and multiple TaskTrackers.
JobTracker can schedule the job and monitor the TaskTrackers. If TaskTracker failed to execute tasks, try to re-execute the failed tasks.
TaskTracker follow the JobTracker’s instructions and execute the tasks. As a slave node, it report the job status to Master JobTracker in the form of Heartbeat.
What is Job Scheduling importance in Hadoop MapReduce?
Scheduling is a systematic procedure of allocating resources in the best possible way among multiple tasks. Hadoop task tracker performing many procedures, sometime a particular procedure should finish quickly and provide more prioriety, to do it few job schedulers come into the picture. Default Schedule is FIFO.
Fair scheduling, FIFO and CapacityScheduler are most popular hadoop scheduling in hadoop.
When used reducer?
To combine multiple mapper’s output used reducer. Reducer has 3 primary phases sort, shuffle and reduce. It’s possible to process data without reducer, but used when the shuffle and sort is required.
What is Replication factor?
A chunk of data is stored in different nodes with in a cluster called replication factor. By default replication value is 3, but it’s possible to change it. Automatically each file is split into blocks and spread across the cluster.
Where the Shuffle and sort process does?
After Mapper generate the output temporary store the intermediate data on the local File System. Usually this temporary file configured at core-site.xml in the Hadoop file. Hadoop Framework aggregate and sort this intermediate data, then update into Hadoop to be processed by the Reduce function. The Framework deletes this temporary data in the local system after Hadoop completes the job.
Java is mandatory to write MapReduce Jobs?
No, By default Hadoop implemented in JavaTM, but MapReduce applications need not be written in Java. Hadoop support Python, Ruby, C++ and other Programming languages.
Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Hadoop Pipes allows programmers to implement MapReduce applications by using C++ programs.
What methods can controle the map and reduce function’s output?
setOutputKeyClass() and setOutputValueClass()
If they are different, then the map output type can be set using the methods.
setMapOutputKeyClass() and setMapOutputValueClass()
What is the main difference between Mapper and Reducer?
Map method is called separately for each key/value have been processed. It process input key/value pairs and emits intermediate key/value pairs.
Reduce method is called separately for each key/values list pair. It process intermediate key/value pairs and emits final key/value pairs.
Both are initialize and called before any other method is called. Both don’t have any parameters and no output.

mapreduce
Why compute nodes and the storage nodes are same?
Compute nodes are logical processing units, Storage nodes are physical storage units (Nodes). Both are running in the same node because of “data locality” issue. As a result Hadoop minimize the data network wastage and allows to process quickly.
What is difference between MapSide join and Reduce Side Join? or
When we goes to MapSide Join and Reduce Join?
Join multple tables in mapper side, called map side join. Please note mapside join should has strict format and sorted properly. If dataset is smaller tables, goes through reducer phrase. Data should partitioned properly.

Join the multiple tables in reducer side called reduce side join. If you have large amount of data tables, planning to join both tables. One table is large amount of rows and columns, another one has few number of tables only, goes through Rreduce side join. It’s the best way to join the multiple tables.
What happen if number of reducer is 0?
Number of reducer = 0 also valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output consider as output, Hadoop store this information in separate folder.
when we are goes to combiner? Why it is recommendable?
Mappers and reducers are independent they dont talk each other. When the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} we goes to combiner to optimize the mapreduce process. Many mapreduce jobs are limited by the bandwidth, so by default Hadoop framework minimizes the data bandwidth network wastage. To achieve it’s goal, Mapreduce allows user defined “Cominer function” to run on the map output. It’s an MapReduce optimization technique, but it’s optional.
What is the main difference between MapReduce Combiner and Reducer?
Both Combiner and Reducer are optional, but most frequently used in MapReduce. There are three main differences such as:
1) combiner will get only one input from one Mapper. While Reducer will get multiple mappers from different mappers.
2) If aggregation required used reducer, but if the function follows commutative (a.b=b.a) and associative a.(b.c)=(a.b).c law, use combiner.
3) Input and output keys and values types must same in combiner, but reducer can follows any type input, any output format.
What is combiner?
It’s a logical aggregation of key and value pair produced by mapper. It’s reduces a lot amount of duplicated data transfer between nodes, so eventually optimize the job performance. The framework decides whether combiner runs zero or multiple times. It’s not suitable where mean function occurs.
What is partition?
After combiner and intermediate map-output the Partitioner controls the keys after sort and shuffle. Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. It means each partition can executed by only a single reducer. If you call reducer, automatically partition called in reducer by automatically.
When we goes to partition?
By default Hive reads entire dataset even the application have a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows special option called partitions. When you are creating table, hive partitioning the table based on requirement.
What are the important steps when you are partitioning table?
Don’t over partition the data with too small partitions, it’s overhead to the namenode.
if dynamic partition, atleast one static partition should exist and set to strict mode by using given commands.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
first load data into non-partitioned table, then load such data into partitioned table. It’s not possible to load data from local to partitioned table.

insert overwrite table table_name partition(year) select * from non-partition-table;
Can you elaborate MapReduce Job architecture?
First Hadoop programmer submit Mpareduce program to JobClient.

Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s in the form of Job_HadoopStartedtime_00001. It’s unique ID.

Once JobClient receive received Job ID copy the Job resources (job.xml, job.jar) to File System (HDFS) and submit job to JobTracker. JobTracker initiate Job and schedule the job.

Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrive the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.
Why Task Tracker launch Child Jvm?
Most frequently, hadoop developer mistakenly submit wrong jobs or having bugs. If Task Tracker use existent JVM, it may interrupt the main JVM, so other tasks may influenced. Where as child JVM if it’s trying to damage existent resources, TaskTracker kill that child JVM and retry or relaunch new child JVM.
Why JobClient, Job Tracker submits job resources to File system?
Data locality. Move competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resources copy where the data available.

How many Mappers and reducers can run?

By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default values in Mapreduce.xml in conf file.
What is InputSplit?
A chunk of data processed by a single mapper called InputSplit. In another words logical chunk of data which processed by a single mapper called Input split, by default inputSplit = block Size.
How to configure the split value?
By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.

1) split size = min (max_splitsize, max (block_size, min_split_size));

2) split size = max(min_split_size, min (block_size, max_split, size));

by default split size = block size

Always No of splits = No of mappers.

Apply above formula:

1) split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb

split size = min (10gb (let assume), 64)

split size = 64MB.

2) 2) split size = max(min_split_size, min (block_size, max_split, size));

split size = max (512kb, min (64, 10GB));

split size = max (512kb, 64);

split size = 64 MB;
How much ram Required to process 64MB data?
Leg assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to process a chunk of data.

So in this way required more memory to process un-structured process.
What is difference between block and split?
Block How much chunk data to stored in the memory called block.
Split: how much data to process the data called split.
Why Hadoop framework reads a file parallel why not sequential?
or
Why Hadoop reads parallel why not writes parallel?
To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and write sequentially.
If i am change block size from 64 to 128, then what happen?
Even you have changed block size not effect existent data. After changed the block size, every file chunked after 128 MB of block size.

It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.
What is isSplitable()?
By default this value is true. It is used to split the data in the input format. if un-structured data, it’s not recommendable to split the data, so process entire file as a one split. to do it first change isSplitable() to false.
How much Hadoop allows maximum block size and minimum block size?
Minimum: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.

Maximum: Depends on environment. There is no upper-bound.
What are the job resource files?
job.xml and job.jar are core resources to process the Job. Job Client copy the resources to the HDFS.
What’s the MapReduce job consists?
MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker.
What is the Data locality?
This is most frequently asked Cloudera certification interview question, most important MapReduce interview question it is. Whereever the data is there process the data, computation/process the data where the data available, this process called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal follow data locality. It’s possible when the data is splittable, by default it’s true.
What is speculative execution?
It’s one of the important mapreduce interview question and cloudera certification as well. Hadoop run t.he process in commodity hardware, so it’s possible to fail the systems also has low memory. So if system failed, process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique. Computation/logic distribute to the multiple systems and execute which system execute quickly. By default this value is true. Now even the system crashed, not a problem, framework choose logic from other systems.

Eg: logic distributed on A, B, C, D systems, completed within a time.

System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process.
When we goes to reducer?
When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without reducer its possible to do this operation.
What is chain Mapper?
Chain mapper class is a special mapper class sets which run in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way n number of mapper connected in chain fashion.
How to do value level comparison?
Hadoop can process key level comparison only but not in the value level comparison.
What is setup and clean up methods?
If you don’t no what is starting and ending point/lines, it’s much difficult to solve those problems. Setup and clean up can resolve it.

N number of blocks, by default 1 mapper called to each split. each split has one start and clean up methods. N number of methods, number of lines. Setup is initialize job resources. The purpose of clean up is close the job resources. Map is process the data. once last map is completed, cleanup is initialized. It Improves the data transfer performance. All these block size comparison can do in reducer as well.

If you have any key and value, compare one key value to another key value use it. If you compare record level used these setup and cleanup. It open once and process many times and close once. So it save a lot of network wastage during process.
Why TaskTracker launch child JVM to do a task? Why not use existent JVM?
Sometime child threads currupt parent threads. It means because of programmer mistake entired MapReduce task distruped. So task tracker launch a child JVM to process individual mapper or tasker. If tasktracker use existent JVM, it might damage main JVM. If any bugs occur, tasktracker kill the child process and relaunch another child JVM to do the same task. Usually task tracker relaunch and retry the task 4 times.
How many slots allocate for each task?
By default each task has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to process the data.
What is RecordReader?
RecordReader reads <key, value> pairs from an InputSplit. After InputSplit, typically RecordReader convert the data into byte format Input and presents record oriented view for Mapper, then only Mapper can process the data.

record readerset the input format by using this command.
job.setInputFormat(KeyValueTextInputFormat.class)

FileInputFormat.addInputPath() will read file from a specified directory and send those files to the mapper. All these configurations include in Mapreduce job file.
Can you explain different types of Input formats?

input format
input format is too important in mapreduce

 

input formates

 

Spark Interview questions

If you want Apache spark training contact here.

What is Spark?

Spark is a parallel data processing framework. It allows to develop fast, unified big data application combine batch, streaming and interactive analytics.

Why Spark?

Spark is third generation distributed data processing platform. It’s unified bigdata solution for all bigdata processing problems such as batch , interacting, streaming processing.So it can ease many bigdata problems.

What is RDD?
Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of partitioned data that satisfies these properties. Immutable, distributed, lazily evaluated, catchable are common RDD properties.

What is Immutable?
Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it’s not allows updates and modifications. Please note data collection is not immutable, but data value is immutable.

What is Distributed?
RDD can automatically the data is distributed across different parallel computing nodes.

What is Lazy evaluated?
If you execute a bunch of program, it’s not mandatory to evaluate immediately. Especially in Transformations, this Laziness is trigger.

What is Catchable?
keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times faster than Hadoop.

What is Spark engine responsibility?
Spark responsible for scheduling, distributing, and monitoring the application across the cluster.

What are common Spark Ecosystems?
Spark SQL(Shark) for SQL developers,
Spark Streaming for streaming data,
MLLib for machine learning algorithms,
GraphX for Graph computation,
SparkR to run R on Spark engine,
BlinkDB enabling interactive queries over massive data are common Spark ecosystems.  GraphX, SparkR and BlinkDB are in incubation stage.

spark ecosystems

What is Partitions?
partition is a logical division of the data, this idea derived from Map-reduce (split). Logical data specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data and output data everything is Partitioned RDD.

How spark partition the data?

Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3 and other data resources.

Is it mandatory to start Hadoop to run spark application?
No not mandatory, but there is no separate storage in Spark, so it use local file system to store the data. You can load data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.

spark interview questions

spark interview questions

 

What is SparkContext?
When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new SparkContext object. SparkContext tell spark how to access the cluster. SparkConf is key factor to create programmer application.

What is SparkCore functionalities?
SparkCore is a base engine of apache spark framework. Memory management, fault tolarance, scheduling and monitoring jobs, interacting with store systems are primary functionalities of Spark.

How SparkSQL is different from HQL and SQL?
SparkSQL is a special component on the sparkCore engine that support SQL and HiveQueryLanguage without changing any syntax. It’s possible to join SQL table and HQL table.

When did we use Spark Streaming?
Spark Streaming is a real time processing of streaming data API. Spark streaming gather streaming data from different resources like web server log files, social media data, stock market data or Hadoop ecosystems like Flume, and Kafka.

How Spark Streaming API works?
Programmer set a specific time in the configuration, with in this time how much data gets into the Spark, that data separates as a batch. The input stream (DStream) goes into spark streaming. Framework breaks up into small chunks called batches, then feeds into the spark engine for processing. Spark Streaming API passes that batches to the core engine. Core engine can generate the final results in the form of streaming batches. The output also in the form of batches. It can allows streaming data and batch data for processing.

What is Spark MLlib?

Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib provides different algorithms, that algorithms scale out on the cluster for data processing. Most of the data scientists use this MLlib library.

What is GraphX?

GraphX is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and iterative graph computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.

What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local FileSystem. Spark uses FS API to read data from different storage engines.

Why Partitions are immutable?
Every transformation generate new partition.  Partitions uses HDFS API so that partition is immutable, distributed and fault tolerance. Partition also aware of data locality.

What is Transformation in spark?

Spark provides two special operations on RDDs called  transformations and Actions. Transformation follow lazy operation and temporary hold the data until unless called the Action. Each transformation generate/return new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are common spark transformations.

What is Action in Spark?

Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is input of Actions. reduce, collect, takeSample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.

What is RDD Lineage?
Lineage is a RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data.Each RDD remembers how the RDD build from other datasets.

What is Map and flatMap in Spark?

Map is a specific line or row to process that data. In FlatMap each input item can be mapped to multiple output items (so function should return a Seq rather than a single item). So most frequently used to return Array elements.

What are broadcast variables?
Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters). Broadcast variables stored as Array Buffers, which sends read-only values to work nodes.

What are Accumulators in Spark?
Spark of-line debuggers called accumulators. Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job you can use accumulators. Only the driver program can read an accumulator value, not the tasks.

How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends on the task.

To learn basic Spark video tutorials, just visit bigdata university website.

Biginsights installation in Linux

As a part of IBM Biginsights Series, in this post I am explaining about How to install IBM biginsights in Redhat Linux (RHEL7). Everyone know how to install VMware. It’s easy after Vmware installation, please check, your system has given requirements.

install infosphere biginsights

BIOS setting:
Press F10 > Security> System Security> Virtualization technology Enable> Press F10 to save and exit.

Requirements to install IBM InfoSphere BigInsights:
Minimum: 2CPUs, 4GB RAM, 30GB DISK
Recommended: 4CPUs, 8GB RAM, 70 GB DISK.

Step 1: Register and login in IBM account. Click this link to register.

03Step 2:

Go to your email account and confirm your account. Then login here.0

Step 3: Automatically It’s redirected to this page.

01

Step 4: scroll down, select desired version then click download.

02

Step 5: Download biginsights from this link. Now a pop up box will come, It’s take a lot of time to download (depends on your net speed).

Please note If you unzip that file, it consuming more memory. It means zip file just 12 GB, but after unzip it, that file size 27GB. After unzip place vmdk file (iibi3002_QuickStart_Cluster_VMware.vmdk) in Virtualbox VMS folder where your VMware is installed.

Now get ready to install IBM Biginsights in Virtual Machine.

Step 6: Click on Open Virtual machine.

1

Step 7: It’s popup this link click VMX  and open. That’s it, automatically VMware configured everything.

2Step 8: It’s 8GB, 4 CPUs, but If you have only 8gb, ram, change it to 6gb and 2cpus. The main reason, your system also utilize some amount of ram. So to prevent problems, configure like that.

3

Step 9:  Double click on Memory or click on edit  virtual machine settings to modify the memory and processors.

4Step 10:  Change ram size  from 8192 to 6916, then click ok. Similarly Change processors from 4 to 2.

5Step 11: Finally click on Start up this Guest operating system. Automatically It’s ask many popup ok buttons, simply click OK Ok Ok.

7

Step 12:  Now Biginsights installing like this (showing in the percentage).

8Step 13: After installation, It’s asking, Bivm login: enter root, then enter Password: <here your desired password>(example: PAss123)).

set root name in installation

Step 14: Again It Asking BiAdmin Username: <your desired username example: biadmin>, Then asking password: <enter desired password example:BIadmin123). Finally It asking Language. Enter your language example: English US. (Please note these passwords, languages, not changeable). Click F10.

 

select language in installationStep 15: Now almost IBM InfoSphere BigInsights installed successfully. simply login your credentials. Username: biadmin. Password: PAss123.     login biadminStep 16:  Now automatically you are enter into InfoSphere BigInsights console. Now installation process successfully completed.

IBM infosphere biginsightsIt’s not only for Redhat linux (RHEL7), any OS, either windows, Ubuntu, any OS, you follow same steps.

Sqoop Interview Questions

What is Sqoop?

Sqoop is an open source Hadoop ecosystem that asynchronously imports/export data between Hadoop and relational databases;
Sqoop provides parallel operation and fault tolerance. It means which import and export the data parallelly, so it provides fault tolerance.

Tell me few import control commands:
–Append
–Columns
–Where
These commands are most frequently used to import RDBMS data.

How Sqoop can handle large objects?

Blog and Clob columns are common large objects. If the object is less than 16 MB, it stored inline with the rest of the data. If large objects, temporary stored in _lob subdirectory. Those lobs processes in a streaming fashion. Those data materialized in memory for processing. If you set LOB limit to 0, those lobs objects placed in external storage.

 What type of databases Sqoop can support?

MySQL, Oracle, PostgreSQL, HSQLDB, IBM Netezza and Teradata. Every database connects through jdbc driver.
Eg:

sqoop import --connect jdbc:mysql://localhost/database --username ur_user_name --password ur_pass_word
sqoop import --connect jdbc:teradata://localhost/DATABASE=database_name --driver "com.teradata.jdbc.TeraDriver" --username ur_user_name --password ur_pass_word

 What are the common privileges steps in Sqoop to access MySQL?

As a root user to grant all privileges to access the mysql Database.

Mysql -u root -p
//Enter a password
mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'localhost';
mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'localhost';
// here you can mention db_name.* or db_name.table_name between ON and TO.

 

sqoop interview

Sqoop Interview Question and Answers

What is the importance of eval tool?
It allows users to run sample SQL queries against Database and preview the results on the console. It can help to know what data can import? The desired data imported or not?

Stx: sqoop eval (generic-args) (eval-args)
Eexample:

	sqoop eval --connect jdbc:mysql://localhost/database -- query "select name, cell from employee limit 10"
sqoop eval --connect jdbc:oracle://localhost/database -e "insert into database values ('Venu', '9898989898')"

Can we import the data with “Where” condition?

Yes, Sqoop has a special option to export/import a particular column data.

sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "

How to export the data from a particular column field data?

There is a separate argument called –columns  that allow to export/import from the table.

Syntax: --columns <col,col,col…>

Example:

sqoop import --connect jdbc:mysql://localhost/database --table employee --columns emp_id, name, cell --username root --password password;

What is the difference between Sqoop and distcp?

Distcp can transfer any type of data from one cluster to another cluster, but Sqoop can transfer any data  between RDBMS and Hadoop ecosystems. Both distcp and sqoop following same approaches to pull/transfer data.

What is the difference between Flume and Sqoop?
The Flume is a distributed, reliable Hadoop ecosystem which collect, aggregate and move large amount of log data. It can collect data from different resources and asynchronously pull into the HDFS.
It doesn’t consider schema and structure or unstructured data, it can pull any type of data.
Sqoop just acts as interpreter exchange/transfer the data between RDBMS and Hadoop ecosystems. It can import or export only RDBMS data, Schema is mandatory to process.

What are the common delimiters and escape characters in Sqoop?

 The default delimiters are a comma (,) for fields, a newline (\n) for records. Common delimited fields followed by — and values given below.
--enclosed-by <char>
--escaped-by <char>
--fields-terminated-by <char>
--lines-terminated-by <char>
--optionally-enclosed-by <char>

Escape characters are:
\b
\n
\r
\t
\”
\\’
\\
\0

 Can Sqoop import tables into hive?
Yes, it’s possible, many hive commands also available to import into the Hive.

--hive-import
--hive-overwrite
--hive-table <table-name>
--hive-drop-import-delims
--create-hive-table

Can Sqoop can import data into Hbase?
Yes, Few commands also help to import the data into Hbase directly.

--column-family <family>
--hbase-create-table
--hbase-row-key <col>
--hbase-table <table-name>

 What is the Meta-store tool?
This tool can host metastore, which is configured in sqoop-site.xml. Multiple users can access and execute these saved jobs, but you should configure in sqoop-site.xml

<property>
<name>sqoop.metastore.client.enable.autoconnect</name>
<value>false</value>
</property>

Syntax: sqoop metastore (generic-args) (metastore-args)
Example:

The Sqoop meta-store jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop --store-dir /metastore-hdfs-file

What is Sqoop Merge tool?

Merge tool can combine two datasets, New new datasets can overwrite old documents. Merge tool can flatten two datasets into one.
Syntax: sqoop merge (generic-args) (merge-args)
Example:

 sqoop merge --new-data newer --onto older --target-dir merged --jar-file datatypes.jar --class-name Foo --merge-key id

 What is codegen?
The Codegen is a tool that encapsulates and interrupt the jobs, finally generate Java class.
Syntax: $ sqoop codegen (generic-args) (codegen-args)

Apart from import and export, Sqoop can do anything?
Yes, many things it can do.
Codegen: Generate code to interact with RDBMS database records.
Eval: Evaluate a SQL statement and display the results.
Merge: Merge tool can flatten multiple datsets into one dataset.

Can you export from a particular row or column?

Sure, Sqoop provides few options such options can allow to import or export based on where class you can get the data from the table.

--columns <col1,col2..>
--where <condition>
--query <SQL query>

Example:

sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --where "start_date > '2010-01-01'"
 sqoop eval --connect jdbc:mysql://db.example.com/corp \
    --query "SELECT * FROM employees LIMIT 10"
sqoop import --connect jdbc:mysql://localhost/database -username root --password your_password --columns "name,employee_id,jobtitle"

How to create and drop Hive table in Sqoop?
It’s possible to create tables, but it’s not possible to drop Hive table.

sqoop create-hive-table --connect jdbc:mysql://localhost/database --table table_name

Assume you use Sqoop to import the data into a temporary Hive table using no special options to set custom Hive table field delimiters. In this case, what will Sqoop use as field delimiters in the Hive table data file?
The Sqoop default delimiter is 0x2c (comma), but by default Sqoop uses Hive’s default delimiters when doing a Hive table export, which is 0x01 (^A).

How to import new data in a particular table every day?
It’a one of the main problems for Hadoop developers. Let example, you had downloaded 1TB data yesterday, today you got another 1gb data, if you import the data, again sqoop import 1TB+1GB data. So to get only use this command. Let example, you have already downloaded 1TB data which stored in the hive $Lastimport file. Now you can run it.

sqoop import --incremental lastmodified --check-column lastmodified --last-value "$LASTIMPORT  --connect jdbc:mysql://localhost:3306/database_name --table table_name --username user_name --password pass_word

Exercise:

  1. You are using Sqoop to import data from a MySQL server on a machine named dbserver, which you will subsequently query using Impala. The database is named db, the table is named sales, and the username and password are fred and fredpass. Which query imports the data into a table which can then be used with the Impala

More tips

Hadoop 2.x Interview questions

What is the core changes in Hadoop 2.x?

Many changes, especially single point of failure and Decentralize JobTracker power to data-nodes is the main changes. Entire job tracker architecture changed. Some of the main difference between Hadoop 1.x and 2.x given below.

  • Single point of failure – Rectified
  • Nodes limitation (4000- to unlimited) – Rectified.
  • JobTracker bottleneck  – Rectified
  • Map-reduce slots are changed static to dynamic.
  • High availability – Available
  • Support both Interactive, graph iterative algorithms (1.x not support).
  • Allows other applications also to integrate with HDFS.

What is YARN?

YARN stands for “Yet Another Resource Negotiator.” For efficient cluster utilization used YARN. It’s most powerful technology in 2.x. Unlike 1.x, JobTracker, resource manager and job scheduling/monitoring done (ApplicationMaster) in separate daemons. So ease the JobTracker problems. YARN is a layer that separate ResourceManager and NodeManager.

What is the difference between MapReduce1 and MapReduce2/YARN?

In Mapreduce 1, Hadoop centralized all tasks to the JobTracker. It allocate resources and scheduling the jobs across the cluster. In YARN, de-centralized this to ease the work pressure on the JobTracker. ResourceManager responsibility allocate resources to the particular nodes and Node manager schedule the jobs on the applicationMaster. YARN allows parallel execution and ApplicationMaster managing and execute the job. This approach can ease many JobTracker problems and improves to scale up ability and optimize the job performance. Additionally YARN can allows to create multiple applications to scal up on the distributed environment.

How Hadoop determined the distance between two nodes?

Hadoop admin write a script called Topology script to determine the rack location of nodes. It is trigger to know the distance of the nodes to replicate the data. Configure this script in core-site.xml
<property>
<name>topology.script.file.name</name>
<value>core/rack-awareness.sh</value>
</property>
in the rack-awareness.sh you should write script where the nodes located.

Mistakenly user deleted a file, how hadoop remote from it’s file system? Can u roll back it?

HDFS first renames its file name and place it in /trash directory for a configurable amount of time. In this senario block might freed, but not file. After this time, NameNode deletes the file from HDFS name-space and make file freed. It’s configurable as fs.trash.interval in core-site.xml. By default its value is 1, you can set to 0 to delete file without storing in trash.

What is difference between Hadoop NameNode Federation, NFS and JournalNode ?

HDFS federation can separate the namespace and storage to improves the scalability and isolation.

 

What is DistCP functionality in Hadoop?

This Distributed copy tool used for large to transfer the data internally and externally in the cluster.
hadoop distcp hdfs://namenode1:8020/nn hdfs://namenode2:8020/nn
It can copy multiple sources to destination cluster.Last resource is destination cluster.
hadoop distcp hdfs://namenode1:8020/dd1 hdfs://namenode2:8020/dd2 hdfs://namenode3:8020/dd3

YARN is replacement of MapReduce?

YARN is generic concept, it support mapreduce, but it’s not replacement of MapReduce. You can development many applicatins with the help of YARN. Spark, drill and many more applications work on the top of YARN.

What are the core concepts/Processes in YARN?

  1. Resource manager: As equivalent to the JobTracker
  2. Node manager: As equivalent to the Task Tracker.
  3. Application manager: As equivalent to Jobs. Everything is application in YARN. When client submit job (application),

Containers: As equivalent to slots.

Yarn child: If you submit the application, dynamically Application master launch Yarn child to do Map and Reduce tasks.

If application manager failed, not a problem, resource manager automatically start new application task.

 

Steps to upgrade Hadoop 1.x to Hadoop 2.x?

To upgrade 1.x to 2.x dont upgrade directly. Simple download locally then remove old files in 1.x files. Up-gradation take more time.

share folder there. its important.. share.. hadoop .. mapreduce .. lib.

stop all processes.

Delete old meta data info… from work/hadoop2data

copy and rename first 1.x data into work/hadoop2.x

Don’t format NN while upgradation.

Hadoop namenode -upgrade // It will take a lot of time.

Don’t close previous terminal open new terminal.

hadoop namenode -rollback

error: Content is protected !!