Pig Interview Questions & Answers

What is pig?

Pig is a data flow language that process parallel on Hadoop.  Pig use a special language called Pig latin scripting to process and analyze the data. It allows Join, sort, filter, and UDFs to analyze the data. It can store and analyze any type of data which either structured and un structured. Highly recommendable for streaming data.

What is Dataflow language?

To access the external data, every language must follow many rules and regulations. The instructions are flowing through data by executing different control statements, but data doesn’t get moved. Dataflow language can get a stream of data which passes from one instruction to another instruction to be processed. Pig can easily process those conditions, jumps, loops and process the data in efficient manner.

Can you define Pig in 2 lines?

Pig is a platform to analyze large data sets that should either structured or unstructured data by using Pig latin scripting. Intentionally done for streaming data, un-structured data in parallel.

What are the main difference between local mode and MapReduce mode?

Local mode: No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system.
100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode: It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

 What is the difference between Store and dump commands?

Dump command after process the data displayed on the terminal, but it’s not stored anywhere. Where as Store stored in local file system or HDFS and output execute in a folder. In the protection environment most often hadoop developer used ‘store’ command to store data in in the HDFS.

What is the relation between map, tuple and bag?

Bag: collection of touples is called bag. It hold entire touples and  maps data, we represent bags with {}
Tuple: collection of map called fields. It’s fixed length and have multiple fields in (touple). The fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps.
map: collection of data element that mapping where element have pig data types. Most often map can ease unstructured data’s data-type.

{(‘hyderabad’, ‘500001’), ([‘area’#’ameerpet’, ‘pin’#500016])}

Here { is bag, ( is a touple, [ is a maps.

What are the relational operations in Pig?

for each — to iterate and loop all date into an object.
order by — sort the data in ascending order or descending order.
filters – It’s similar to where command in SQL. It filter the data to process.
group: grouping the data to get desired output.
distinct: Displays only unique values, but it’s works on entire records, but not individual fields.
join: logically join many tables and get desired output.
limit: It not use MapReduce, just filter and display limited data info only.

 What is Pig Engine importance?

It’s acts as interpreter between Pig Latin script and MapReduce Jobs. It creating environment to execute Pig scripts into series of mapreduce jobs in parallel manner.

Why Pig instead of Mapreduce?

Compare with MapReduce many features available in Apache Pig.
In Mapreduce it’s too difficult to join multiple data sets. Development cycle is very long.
Depends on the task, Pig automatically converts code into Map or Reduces. Easy to join multiple tables and run many sql queries like Join, filter, group by, order by , union and many more.

Can you tell me little bit about Hive and Pig?

Pig internally use Pig Latin, it’s procedural language. Schema is optional, no meta store concept. where as Hive use a database to store meta store.
Hive internally use special language called HQL, it’s subset of SQL. Schema is mandatory to process. Hive intentionally done for Queries.
But both Pig and Hive run on top of MapReduce and convert internal commands into MapReduce jobs. Both used to analyze the data and eventually generate same output. you can see this post for more info about Hive and Pig

What is Flatten does in Pig?

Syntactically flatten similar to UDF, but it’s powerful than UDFs. The main aim of Flatten is change the structure of touple and bags, UDFs can’t do it. Flatten can un-nest the Touple and bags, it’s opposite to “Tobag” and “ToTouple”.

Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So Pig -x Mapreduce mode is the best choice to process vast amount of data.

How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine  execute the query on the MR Jobs. The Mapreduce process the data and generate output report. Here Mapreduce doesn’t return output to Pig, directly stored in the HDFS.

How to debugging in Pig?

Describe: Review the schema.
Explain: logical, Physical and MapReduce execution plans.
Illustrate: Step by step execution of the each step execute in this operator.
These commands used to debugging the pig latin script.

Tell me few important operators while working with Data in Pig.

Filter: Working with Touples and rows to filter the data.
Foreach: Working with Colums of data to load data into columns.
Group: Group the data in single relation.
Cogroup & Join: To group/Join data in multiple relations.
Union: Merge the data of multiple relations.
Split: partition the content into multiple relations.

What is Topology Script?

Topology scripts are used by Hadoop to determine the rack location of nodes. Its trigger to replicate the data. As a part of rack awareness, Hadoop by default configured in topology.script.file.name. If not set, the rack id is returned  for any passed IP address.

Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.
Single line comments:
Dump B; — It execute the data, but not store in the file system.
Multiple Line comments:
Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System.
In protection level most often used Store command */

Can you tell me important data types in Pig?

Primitive datatypes: Int, Long, float, double, arrays, chararray, byte array.
Complex datatypes: Touple, bag, map

What is co-group does in Pig?

Cogroup can groups rows based on columns, unlike Group it can join the multiple tables on the grouped column.

answer: http://joshualande.com/cogroup-in-pig/

What is difference between group by and co-group?

Can we say cogroup is a group of more than 1 data set?


Why we are using user defined functions (UDFs) in Pig?



5 thoughts on “Pig Interview Questions & Answers”

    1. In Pig There are null checks available.. some operators such “is null ” is not null” would help to handle NULL values in Pig

    1. Pig is case senstive as well as in-senstive.

      Case senstive : for Alias, in-built functions, UDFs
      Case Insenstive : Operators like foreach, as etc.

Leave a Reply

Your email address will not be published. Required fields are marked *