Partition in Hive

When the online user queried, Hive reads the entire data sets. It’s take a long time and heavy bottleneck for MapReduce jobs. To overcome this issue, Hive allows a special option called Partition the table. The big data analyst most frequently Partition the Hive tables, it’s recommendable to analyze the large data sets. Based on the values of particular columns segregate the input record into different files. You can partitioning multiple columns also. Instead of analyzing vast amount of data, we can partition and analyze the target data to get desired output results. It’s the best approach to improve the query performance on larger tables.

For example, Population in india — 110 crore. Where as to process only Andhra population, simply partitioned by the state. So instead of processing 110 crore records, simply process 4 crore records. So processing little amount of data and get quick results.

Most frequently Hadoop developer create schema and stored the data in HDFS, but it’s  not partitioned the data. So to do it, we use PARTITIONED BY clause can segregate the column.

Syntax to partition the data:
CREATE TABLE table_name (col1 data_type1, col2 data_type2, …) PARTITIONED BY (partition_column1 data_type1, partition_column2 data_type2);

Please note: Don’t partition too many columns, which is an overhead to NameNode.

There are two type of partitions in Hive, such as static partition and dynamic partition. To do dynamic partition, you must run these 2 commands.

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Dynamic partition not required data_type in the partitioned columns.
Please note: Don’t include partition columns in the table definition.
You need to do at least one static partition in the table. The main difference between static and dynamic partition is hive.mapred.mode=strict or nonstrict.
If you want to debug,  use DESCRIBE FORMATTED partition_table_name.

Please note most of the analysts facing a lot of problems to make schema to the table. It’s trigger.

Now After created partitioned table insert the raw hive data to partition the data.
INSERT OVERWRITE TABLE table_name
PARTITION (partition_column1, partition_column2)
SELECT * from un-partitioned-table;

Now the data from un-partitioned-table dumped into the partitioned table called table_name.