Tag Archives: hive vs pig

Hive Vs Pig Difference

Hive and Pig both are apache organization products to analyze vast amount of data sets on top of Hadoop without MapReduce code. Both tools used to analyze the data, means perform OLAP operations. Depends on Storage process, data categorized into three types such as Structured, unstructured, semi structured data. To analyze the data Hadoop uses these Hive and Pig ecosystems to optimize mapreduce queries.  Both Hadoop ecosystems work on the top of the Hadoop and ultimately same outcome can be achieved, but follow different process. Here demonstrate the features of PIG and HIVE.

Hive is nothing but a SQL like interface on the top of Hadoop to analyze schema based structured data. Pig is high level data flow language to analyze any type of data. Both Hive and Pig summarize & analyze data, but follow different process.

Pig Interview question
Pig Interview question

Hive

Pig

Hadoop should start to run Hive. Not required to start Hadoop, you can run standalone mode or cluster mode, but you should install Hadoop.
 If you have limited joins and filters go ahead with HIVE.  Pig is highly recommendable when you have huge number of joins and filters.
Hive Support only Structured data, so most often used in the data warehouse Pig can process both structured & unstructured data, so it’s the best suitable for Streaming Data
support User Defined Functions, but much hard to debug. Very easy to write a UDF to calculate Matrics.
Manually create table to store intermediate data. Not required to create table table.
Hive Stores the meta data in database like darby, (by default), mysql, oracle Pig has no metadata support.
 Hive use separate query language called HQL goes beyond standard SQL Pig use own language called Pig Latin is the relational data-flow language
 Best suitable for analysts especially big data analysts and who familiar to SQL, most often used to generate reports and statistics functions.
 Best suitable for programmers and software developers and who familiar Scripting languages like Python, Java
 Hive can operate an optional thrift based server and operates on the server side of any cluster Pig can operates on the client side of any cluster, there is no any server side concept.
 It execute quickly, but not load quickly.  It loads the data effectively and quickly.
Carefully configure the Hive in Cluster, Pseudo mode. Pig Installed based on shell interaction  , so not required any other configuration, Just extract the tar file.