Category Archives: Pig

Hive Vs Pig Difference

Hive and Pig both are apache organization products to analyze vast amount of data sets on top of Hadoop without MapReduce code. Both tools used to analyze the data, means perform OLAP operations. Depends on Storage process, data categorized into three types such as Structured, unstructured, semi structured data. To analyze the data Hadoop uses these Hive and Pig ecosystems to optimize mapreduce queries.  Both Hadoop ecosystems work on the top of the Hadoop and ultimately same outcome can be achieved, but follow different process. Here demonstrate the features of PIG and HIVE.

Hive is nothing but a SQL like interface on the top of Hadoop to analyze schema based structured data. Pig is high level data flow language to analyze any type of data. Both Hive and Pig summarize & analyze data, but follow different process.

Pig Interview question
Pig Interview question

Hive

Pig

Hadoop should start to run Hive. Not required to start Hadoop, you can run standalone mode or cluster mode, but you should install Hadoop.
 If you have limited joins and filters go ahead with HIVE.  Pig is highly recommendable when you have huge number of joins and filters.
Hive Support only Structured data, so most often used in the data warehouse Pig can process both structured & unstructured data, so it’s the best suitable for Streaming Data
support User Defined Functions, but much hard to debug. Very easy to write a UDF to calculate Matrics.
Manually create table to store intermediate data. Not required to create table table.
Hive Stores the meta data in database like darby, (by default), mysql, oracle Pig has no metadata support.
 Hive use separate query language called HQL goes beyond standard SQL Pig use own language called Pig Latin is the relational data-flow language
 Best suitable for analysts especially big data analysts and who familiar to SQL, most often used to generate reports and statistics functions.
 Best suitable for programmers and software developers and who familiar Scripting languages like Python, Java
 Hive can operate an optional thrift based server and operates on the server side of any cluster Pig can operates on the client side of any cluster, there is no any server side concept.
 It execute quickly, but not load quickly.  It loads the data effectively and quickly.
Carefully configure the Hive in Cluster, Pseudo mode. Pig Installed based on shell interaction  , so not required any other configuration, Just extract the tar file.

 

Pig Interview Questions & Answers

What is pig?

Pig is a data flow language that process parallel on Hadoop.  Pig use a special language called Pig latin scripting to process and analyze the data. It allows Join, sort, filter, and UDFs to analyze the data. It can store and analyze any type of data which either structured and un structured. Highly recommendable for streaming data.

What is Dataflow language?

To access the external data, every language must follow many rules and regulations. The instructions are flowing through data by executing different control statements, but data doesn’t get moved. Dataflow language can get a stream of data which passes from one instruction to another instruction to be processed. Pig can easily process those conditions, jumps, loops and process the data in efficient manner.

Can you define Pig in 2 lines?

Pig is a platform to analyze large data sets that should either structured or unstructured data by using Pig latin scripting. Intentionally done for streaming data, un-structured data in parallel.

What are the main difference between local mode and MapReduce mode?

Local mode: No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system.
100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode: It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

 What is the difference between Store and dump commands?

Dump command after process the data displayed on the terminal, but it’s not stored anywhere. Where as Store stored in local file system or HDFS and output execute in a folder. In the protection environment most often hadoop developer used ‘store’ command to store data in in the HDFS.

What is the relation between map, tuple and bag?

Bag: collection of touples is called bag. It hold entire touples and  maps data, we represent bags with {}
Tuple: collection of map called fields. It’s fixed length and have multiple fields in (touple). The fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps.
map: collection of data element that mapping where element have pig data types. Most often map can ease unstructured data’s data-type.

{(‘hyderabad’, ‘500001’), ([‘area’#’ameerpet’, ‘pin’#500016])}

Here { is bag, ( is a touple, [ is a maps.

What are the relational operations in Pig?

for each — to iterate and loop all date into an object.
order by — sort the data in ascending order or descending order.
filters – It’s similar to where command in SQL. It filter the data to process.
group: grouping the data to get desired output.
distinct: Displays only unique values, but it’s works on entire records, but not individual fields.
join: logically join many tables and get desired output.
limit: It not use MapReduce, just filter and display limited data info only.

 What is Pig Engine importance?

It’s acts as interpreter between Pig Latin script and MapReduce Jobs. It creating environment to execute Pig scripts into series of mapreduce jobs in parallel manner.

Why Pig instead of Mapreduce?

Compare with MapReduce many features available in Apache Pig.
In Mapreduce it’s too difficult to join multiple data sets. Development cycle is very long.
Depends on the task, Pig automatically converts code into Map or Reduces. Easy to join multiple tables and run many sql queries like Join, filter, group by, order by , union and many more.

Can you tell me little bit about Hive and Pig?

Pig internally use Pig Latin, it’s procedural language. Schema is optional, no meta store concept. where as Hive use a database to store meta store.
Hive internally use special language called HQL, it’s subset of SQL. Schema is mandatory to process. Hive intentionally done for Queries.
But both Pig and Hive run on top of MapReduce and convert internal commands into MapReduce jobs. Both used to analyze the data and eventually generate same output. you can see this post for more info about Hive and Pig

What is Flatten does in Pig?

Syntactically flatten similar to UDF, but it’s powerful than UDFs. The main aim of Flatten is change the structure of touple and bags, UDFs can’t do it. Flatten can un-nest the Touple and bags, it’s opposite to “Tobag” and “ToTouple”.

Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So Pig -x Mapreduce mode is the best choice to process vast amount of data.

How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine  execute the query on the MR Jobs. The Mapreduce process the data and generate output report. Here Mapreduce doesn’t return output to Pig, directly stored in the HDFS.

How to debugging in Pig?

Describe: Review the schema.
Explain: logical, Physical and MapReduce execution plans.
Illustrate: Step by step execution of the each step execute in this operator.
These commands used to debugging the pig latin script.

Tell me few important operators while working with Data in Pig.

Filter: Working with Touples and rows to filter the data.
Foreach: Working with Colums of data to load data into columns.
Group: Group the data in single relation.
Cogroup & Join: To group/Join data in multiple relations.
Union: Merge the data of multiple relations.
Split: partition the content into multiple relations.

What is Topology Script?

Topology scripts are used by Hadoop to determine the rack location of nodes. Its trigger to replicate the data. As a part of rack awareness, Hadoop by default configured in topology.script.file.name. If not set, the rack id is returned  for any passed IP address.

Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.
Single line comments:
Dump B; — It execute the data, but not store in the file system.
Multiple Line comments:
Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System.
In protection level most often used Store command */

Can you tell me important data types in Pig?

Primitive datatypes: Int, Long, float, double, arrays, chararray, byte array.
Complex datatypes: Touple, bag, map

What is co-group does in Pig?

Cogroup can groups rows based on columns, unlike Group it can join the multiple tables on the grouped column.

answer: http://joshualande.com/cogroup-in-pig/

What is difference between group by and co-group?

??
Can we say cogroup is a group of more than 1 data set?

??

Why we are using user defined functions (UDFs) in Pig?

??