The Default configuration not suitable for all applications. Few changes can optimize your Hadoop and Hive queries. In this post I am explain about different ways to optimize Hive and few Hive Technical interview questions.
Is SQL scalable? How to run SQL queries in the Hadoop?
By default SQL is not scalable, SQL databases are vertically scalable. Hadoop is scalable and horizontally scalable. Hive is a Hadoop component which allows programmers to run SQL queries on the top of Hadoop.
Why Scalable is most too important in Hadoop?
Let example: job perform on 3 nodes out of 5 nodes. If one node is failed, automatically job run in 4th node. Everyone knows job failed in distributed process. So job process run without any fail/interruption.
What is the Pros and Cons of Broadcast Join?
In BroadCast Join, small tables are loaded into memory in all nodes. Mappers scans through the large table and joins. It’s the best suitable for small tables. It’s fast and single scan through a largest table, but if table is more than Ram, it’s not process. So when we join two tables, that tables must smaller than RAM. If table more than RAM size, use SortMergeBucket Join.
How Cost Based Optimizer (CBO) optimize Hive?
CBO introduced in Hive 0.14, It’s main goal is generate efficient execution plans. Cost Based Optimization (CBO) that leverages statistics collected on Hive tables and optimize hive query optimizer. set two parameters: set hive.compute.query.using.stats=true; set hive.stats.dbclass=fs;How Tez execution engine optimize Hive performance?
If you are using hadoop 2.x use Tez execute engine for better performance. Run following in the terminal to enable Tez execution engine. set hive.execution.engine=tez;
What is Skewed tables in Hive?
When certain values are appear very often, Skewed tables highly recommendable. Skewed tables split frequently appeared values into a separate table and rest of the values separate to some other file. So when user queried, most appeared values skipped to process again. As a result Hive optimize the performance.
create table TableName (column1 string, column2 string) skewed by (column1) on ('x_separate_file')
What is Vectorization? Use Vectorization to improve query performance. It combines multiple rows instead of single row each time. Use given code on terminal to enable it. set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true;
What is ORCFile? How it’s optimize Hive Query performance?
use ORCFile format to optimize query performance. SNAPPY is the best compression technique use it with ORC format.
CREATE TABLE ORC_table (EmpID int, Emp_name string, Emp_age int, address string) STORED AS ORC tblproperties (“orc.compress” = “SNAPPY”);
Other Tips to optimize hive performance:
If you join two tables, one table is smaller another is too big, use Map-side join to optimize the task. If select has multiple fields, leverage to multiple queries format. SELECT count(1) FROM (SELECT DISTINCT column_field FROM table_name) All Imported data automatically partitioned into hourly buckets based on time. Where clause must be used to prevent unnecessary data.
Eg: select name, age, cell from biodata where time > 1349393020 Use hive, ORDER BY use one reducer, SORT BY use multiple reducer.
So if you process a large amount of data, dont’ use ORDER BY, prefer SORT BY.
Eg: SELECT name, location, voterid FROM aadhar_card DISTRIBUTE BY name SORT BY age.
Increase Parallelism: Please add given lines to compress the data. ensure maximum split size 265Mb
If you are process/joining a small table and Large table use Map side Join.
If you enable “Set hive.auto.convert.join=true”
It can optimize Job performance when you performing Join Operation.
Paste given code in mapred-site.xml to decrease burden on Namenode during sort & shuffling.
It can compress output of Mapper & reduce
If possible Apply SMB map join.
Sort Merge Bucket Join is faster than map join. It’s very efficient if applicable, but it’s used when you have sorted & bucketed the table.
To enable, use this configuration settings.
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
Hive and Pig both are apache organization products to analyze vast amount of data sets on top of Hadoop without MapReduce code. Both tools used to analyze the data, means perform OLAP operations. Depends on Storage process, data categorized into three types such as Structured, unstructured, semi structured data. To analyze the data Hadoop uses these Hive and Pig ecosystems to optimize mapreduce queries. Both Hadoop ecosystems work on the top of the Hadoop and ultimately same outcome can be achieved, but follow different process. Here demonstrate the features of PIG and HIVE.
Hive is nothing but a SQL like interface on the top of Hadoop to analyze schema based structured data. Pig is high level data flow language to analyze any type of data. Both Hive and Pig summarize & analyze data, but follow different process.
Hadoop should start to run Hive.
Not required to start Hadoop, you can run standalone mode or cluster mode, but you should install Hadoop.
If you have limited joins and filters go ahead with HIVE.
Pig is highly recommendable when you have huge number of joins and filters.
Hive Support only Structured data, so most often used in the data warehouse
Pig can process both structured & unstructured data, so it’s the best suitable for Streaming Data
support User Defined Functions, but much hard to debug.
What is Hive?
It’s an open source project under the Apache Software Foundation, it’s a data warehouse software ecosystem in Hadoop. Which manage vast amount of structured data sets, by using HQl language; it’s similar to SQL. Where hive is the best suitable?
When you are doing data warehouse applications,
Where you are getting static data instead of dynamic data,
when the application on high latency (response time high).
where a large data set is maintained and mined for insights, reports.
When we are using queries instead of scripting we use Hive. When hive is not suitable?
It doesn’t provide OLTP transactions supports only OLAP transactions.
If application required OLTP, switch to NoSQL databases.
HQL queries have higher latency, due to the mapreduce.
To achieve updates & deletion transactions in 1.4 version, you must change given default values.
hive.support.concurrency – true
hive.enforce.bucketing – true
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service
What is Hive MetaStore?
MetaStore is a central repository of Hive, that allows to store meta data in external database. By default Hive store meta data in Derby database, but you can store in MySql, Oracle depends on project. Why I choose Hive instead of MapReduce?
There are Partitions to simplify the data process, Bucketing for sampling the data, sort the data quickly, and simplify the mapreduce process. Partitions and Buckets can segmenting large data sets to improve Query performance in Hive. So It is highly recommendable for structure data. Can I access Hive without Hadoop?
Hive store and process the data on the top of Hadoop, but it’s possible to run in Other data storage systems like Amazon S3, GPFS (IBM) and MapR file systems.
What is the relationship between MapReduce and Hive? or How Mapreduce jobs submits on the cluster?
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster. If you run select * query in Hive, why it’s not run Mpareduce?
It’s an optimization technique. hive.fetch.task.conversion property can (FETCH task) minimize latency of mapreduce overhead. When queried SELECT, FILTER, LIMIT queries, this property skip mapreduce and using FETCH task. As a result Hive can execute query without run mapreduce task.
By default it’s value “minimal”. Which optimize: SELECT STAR, FILTER on partition columns, LIMIT queries only, where as another value is “more” which optimize : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns). How Hive can improve performance with ORC format tables?
Hive can store the data in highly efficient manner in the Optimized Row Columnar (ORC) file format. It can ease many Hive file format limitations. Using ORC files can improves the performance when reading, writing, and processing data. Enable this format by run this command and create table like this.
CREATE TABLE orc_table (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\;’
LINES TERMINATED BY ‘\n’
STORED AS ORC; What is the importance of Vectorization in Hive?
It’s a query optimization technique. Instead of processing multiple rows, Vectorization allows to process process a batch of rows as a unit. Consequently it can optimize query performance. The file must be stored in ORC format to enable this Vectorization. It’s disabled by default, but enable this property by run this command.
set hive.vectorized.execution.enabled=true; Difference between sort by or order by clause in Hive? Which is the fast?
ORDER BY – sort the data in one reducer. Sort by much faster than order by.
SORT BY – sort the data within each reducer. You can use n number of reducers for sort.
In the first case (order by) maps sends each value to the single reducer and count them all.
In the second case (sort by) maps splits up the values to many reducers and each reduce generates its list and finds the count. So it can sort quickly. Example:
SELECT name, id, cell FROM user_table ORDER BY id, name;
SELECT name, id, cell FROM user_table DISTRIBUTE BY id SORT BY name; Wherever you run hive query, first it creates new metastore_db, why? What is the importance of Metastore_db?
When we run the hive query, first it creates a local metastore, before creates the metastore first Hive checks whether metastore is already exists or not? If presents shows error, else the process goes on. This configuration is set in hive-site.xml like this.
<description>JDBC connect string for a JDBC metastore</description>
</property> Tell me different Hive metastore configuration.
There are three types of metastores configuration called
1) Embedded metastore
2) Local metastore
3) Remote metastore.
If Hive run any query first it enter into embedded mode, It’s default mode. In Command line all operations done in embedded mode only, it can access Hive libraries locally. In the embedded metastore configuration, hive driver, metastore interface and databases use same JVM. It’s good for development and testing.
In local metastore the metastore store data in external databases like MYSQL. Here Hive driver and metastore run in the same JVM, but remotely communicate with external Database. For better protection required credentials in Local metastore.
Where as in Remote server, use remote mode to run the queries over Thift server.
In Remote metastore, Hive driver and metastore interface would be running in a different JVM. So for better protection, required credentials such are isolated from Hive users.
Hive can process any type of data formats?
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data. Example:
MetadataTypedColumnsetSerDe: used to read/write CSV format data.
JsonSerDe: process Json data.
RejexSerDe: process weblog data.
AvroSerde: Avro format data. What Is the HWI?
The Hive Web Interface is an alternative to the command line interface. HWI is a simple graphical interface, It’s hive web interface. The HWI allows start at database level directly. you can get all SerDe, column names and types and simplifies the hive steps. It’s seccession based interface, so you can run multiple hive queries simultaneously. There is no local metastore mode in HWI. What is the difference between Like and Rlike operators in HIVE?
Like: used to find the substrings within a main string with regular expression %.
Rlike is a special fuction which also finds the sub strings within a main string, but return true or false without using regular expression.
Example: Tablename is table, column is name.
name=VenuKatragadda, venkatesh, venkateswarlu
Select * from table where name like “venu%. //VenuKatragadda.
select * from table where name rlike “venk%”. // false, true, true. What are the Hive default read and write classes?
Hive use 2+2 classes to read and write the files.
First class used to read/write the plain text. Second class used for sequence files. What is Query processor in Hive?
It’s a core processing unit in Hive framework, it converting SQL to map/reduce jobs and run in the other dependencies. As a result hive can convert the Hive queries into Hive queries.
What are Views in Hive?
Based on user requirement create and manage view. You can set data as view. It’s a logical construct. It’s used where query is more complicated and to hide complexity of query and make easy to the users. Example:
Create view table_name as select * from employee where salary>10000; What is different between database and data-warehouse?
Typically database is designed for OLTP transactional operations. Where as Data-warehouse is implemented for OLAP (analysis) operations.
OLTP can constrained to a single application. OLAP resists as a layer on the top of several databases.
OlTP process current, streaming and dynamic data where as OLAP process Retired, historic and static data only.
Database completely has normalization concept. DWH is De-normalization concept.
What is the different between Internal and external tables in Hive?
Hive will create a database on the master node to store meta data to keep data in safe. Let example, If you partition table, table schema stores data in the external table.
In Managed table, Schema stored in the local system, but in External table MetaStore separate from the node and stored in a secure database. In Internal Table, Hive reads and loads entire file as it is to process, but in External simply loads depends on the query logic.
If user drop the table, Hive drop original data and MetaStore, but in External table, just drop MetaStore, but not original data. Hive by default store in internal table, but it’s not recommendable. Store the data in external table.
How to write single and multiple line commands in Hive?
To write single line commands we use –followed by commands.
eg: –It is too important step.
Hive doesn’t supports multiple comments now. What is Thrift server & client, JDBC and ODBC driver importance in Hive?
Thrift is a cross language RPC framework which generate code and cobines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and scala.
JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.
ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS Does Hive support 100% SQL Quries like Insert, Delete and Updates?
Hive doesn’t support Updates in record level. To update, It integrate with Hbase. When you are use Hive?
When the data is structured data, Static data, Low density is not a problem, If the data processed based on the queries, Hive is the best option. Most often data warehouse data processed in the Hive. What is the use of partition in hive?
To analyze a particular set of data, not required to load entire data, desired data partition is a good approach. To achieve this goal, Hive allows to partition the data based on particular column. Static partition and Dynamic partition, both can optimize the Hive performance. For Instant, required a particular year information, partition based on year.
Is is mandatory Schema in Hive?
yes, It’s mandatory to create a table in Database. Hive is schema oriented modal. It store schema information in external database. How Hive Serialize and DeSerialize the data?
In Hive language, SerDe also called Serialization and DeSerialization. Usually when read/write the data, user first communicate with inputformat, then it connect with Record reader to read/write record.The data is stored in Serialized (binary) format in Record. To serialize the data dat goes to row, here deserialized custem serde use object inspector to deserialize the data in fields. now user see the data in the fields, that deliver to the end user. How Hive use Java in SerDe?
To insert data into table, Hive create an object by using Java. To transfer java objects over network, the data should be serialized. Each field serialized by using Object inspector and finally serialized data stored in Hive table. Does Hive Support Insert, delete, or updation?
As of now, Hive doesn’t support record level updadation, insert and deletion queries. HQL is subset of SQL, but not equalto SQL. To update Hive integrate with Hbase. Tell me few function names in Hive
CONTACT(‘Venu’-‘Bigdata’-‘analyst’); // Venu-Bigdata-analyst
CONTACT_WS(‘-‘, ‘venu’, ‘bigdata’, ‘analyst’); //venu-bigdata-analyst
TRIM(‘ VENU ‘); //VENU (without spaces)
LTRim(‘ venu ‘); //venu (trim leftside, but not rightside)
RTRIM(‘ venu ‘); // venu(trim rightside only, but not leftside)
UPPER OR UCASE(‘Venu’); //VENU
RLIKE .. return T/F for sub string.
‘Venu’ RLIKE ‘en’ //True
‘Venu’ RLIKE ‘^V.*’ //T Difference between order by and sort by in hive?
SORT BY -use number of reducers, so it can process quickly.
ORDER BY – use single reducer. If data is too large, it’s take a long time to sort the data. Difference between Internal and External Table?
External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.
Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost. What is the difference between Hive and Hbase?
Hive allows most of the SQL queries, but Hbase not allows SQL queries directly.
Hive doesn’t support record level update, insert, and deletion operations on table, but Hbase can do it.
Hive is a Data warehouse framework where as Hbase is a NoSQL database.
Hive run on the top of Mapreduce, Hbase run on the top of HDFS.
How many ways you can run Hive?
In CLI mode (By using command line inerface).
By using JDBC or ODBC.
By Called Hive Thift client. It allows java, PHP, Python, Ruby and C++ to write commands to run in Hive. Can you explain different type of SerDe?
By default Hive used Lazy Serde also allows Jeson Serde and most often used RegexSerde to be Serialized and DeSerialized Data. Why we are using buckets in Hive?
To process many chunks of files, to analyze vast amount of data, sometime burst the process and time. Bucketing is a sampling concept to analyze the data, by using hashing algorithm. set hive.enforce.bucketing=true; can enable the process. How Hive Organize the data?
Hive organize in three ways such as Tables, Partitions and Buckets. Tables organize based on Arrays, Maps, primitive column types. Partitions has one or more partition keys based on project requirements.
Buckets used for analyze the data for sampling purpose. It’s good approach to process a pinch of data in the form of buckets instead of process all data. Can you explain about Hive Architecture?
There are 5 core components there in Hive such as: UI, Driver, Compiler, Metastore, Execute Engine. What is User Interface (UI)?
UI: This interface is interpreter between users and Driver, which accept queries from User and execute on the Driver. Now two types of interfaces available in Hive such as command line interface and GUI interface. Hadoop provides Thrift interface and JDBC/ODBC for integrating other applications. What is importance of Driver in Hive? Driver: It manages life cycle of HiveQL queries. Driver receives the queries from User Interface and fetch on the ODBC/JDBC interfaces to process the query. Driver create separate independent section to handle each query.
Compiler: Compiler accept plans from Drivers and gets the required metadata from MetaStore, to execute Plan.
MetaStore: Hive Store meta data in the table. It means information about data is stored in MetaStore in the form of table, it may be internal or external table. Hive compiler get the meta data information from metastore table.
Execute Engine: Hive Driver execute the output in the execution Engine. Here, execute engine executes the queries in the MapReduce JobTracker. Based on Required information, Hive queries run in the MapReduce to process the data. When we are use explode in Hive?
Sometime Hadoop developer takes array as input and convert into a separate table row. To achieve this goal, Hive use explode, it acts as interpreter to convert complex data-types into desired table formats.
SELECT explode (arrayName) AS newCol FROM TableName;
SELECT explode(map) AS newCol1, NewCol2 From TableName; What is ObjectInspector functionality in Hive?
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects. Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union. Can you overwrite Hadoop Mapreduce configuration in Hive?
Yes, You can overwrite Hive map, reduce steps in hive conf settings. Hive allows to overwrite Hadoop configuration files. How to display the present database name in the terminal?
There are two ways to know the current database. One temporary in cli and second one is persistently.
1) in CLI just enter this command: set hive.cli.print.current.db=true;
2) In hive-site.xml paste this code:
In second scenario, you can automatically display the Hive database name when you open terminal.
Is a job split into map?
No, Hadoop framework can split the data-file, but not Job. This chunks of data stored in blocks. Each split need a map to process. Where as Job is a configurable unit to control execution of the plan/logic. Job is not a physical data-set to split, it’s a logical configuration API to process those split. What is the difference between Describe and describe extended?
To see table definition in Hive, use describe <table name>; command Where as
To see more detailed information about the table, use describe extended <tablename>; command
Another important command describe formatted <tablename>; also describe all details in a clean manner.
What is difference between static and dynamic partition of a table?
To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table. Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition. If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.
What is the difference between partition and bucketing?
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
Bucketing follows Hash algorithm. Based on number of buckets, randomly the data inserted into the bucket to sampling of the data. For more information about bucketing & partition please follow this link.
When the online user queried, Hive reads the entire data sets. It’s take a long time and heavy bottleneck for MapReduce jobs. To overcome this issue, Hive allows a special option called Partition the table. The big data analyst most frequently Partition the Hive tables, it’s recommendable to analyze the large data sets. Based on the values of particular columns segregate the input record into different files. You can partitioning multiple columns also. Instead of analyzing vast amount of data, we can partition and analyze the target data to get desired output results. It’s the best approach to improve the query performance on larger tables.
For example, Population in india — 110 crore. Where as to process only Andhra population, simply partitioned by the state. So instead of processing 110 crore records, simply process 4 crore records. So processing little amount of data and get quick results.
Most frequently Hadoop developer create schema and stored the data in HDFS, but it’s not partitioned the data. So to do it, we use PARTITIONED BY clause can segregate the column.
Syntax to partition the data:
CREATE TABLE table_name (col1 data_type1, col2 data_type2, …) PARTITIONED BY (partition_column1 data_type1, partition_column2 data_type2);
Please note: Don’t partition too many columns, which is an overhead to NameNode.
There are two type of partitions in Hive, such as static partition and dynamic partition. To do dynamic partition, you must run these 2 commands.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Dynamic partition not required data_type in the partitioned columns.
Please note: Don’t include partition columns in the table definition.
You need to do at least one static partition in the table. The main difference between static and dynamic partition is hive.mapred.mode=strict or nonstrict.
If you want to debug, use DESCRIBE FORMATTED partition_table_name.
Please note most of the analysts facing a lot of problems to make schema to the table. It’s trigger.
Now After created partitioned table insert the raw hive data to partition the data.
INSERT OVERWRITE TABLE table_name
PARTITION (partition_column1, partition_column2)
SELECT * from un-partitioned-table;
Now the data from un-partitioned-table dumped into the partitioned table called table_name.
To run SQL queries on the top of Mapreduce, Hadoop use Hive tool to run SQL queries over Hadoop. In this post I am explaining about different ways to create Hive tables and load the data from local, HDFS and HQL table.
Test how many databases in Hive: show databases;
Create database: create database hiveSeries;
select Database to store tables: use hiveseries;
display current Database: set hive.cli.print.current.db=true;
C1) Create table : create table cellnumbers (name string, age int, cell bigint) row format delimited fields terminated by ‘,’;
Know how many tables: show tables;
Load data from local system:
L1) load data local inpath ‘/home/hadoop/Desktop/cellNo.txt’ into table cellnumbers;
Test table schema: describe cellnumbers;
select * from cellnumbers;
Advanced Tips to create Hive tables.
C2) Create & load data from existance table:
create table namewithcell as select name, cell from cellnumbers;
C3) Create one field table for testing:
create table onecell (cell bigint);
L2) Insert data from existent Hive table:
insert into table onecell select cell from cellnumbers;
C4) Create another table for testing:
create table temp (name string, age int, cell bigint) row format delimited fields terminated by ‘,’;
L3) Load data from HDFS:
load data inpath ‘/samplefile’ into table temp;
Create external table, commenting, Alter table, Overwrite table
Debugging: Describe table, describe extended, describe formatted: To know schema of the table;
Comment: It’s understandable purpose you write a command when you are creating hive table. Overwrite : This option overwrite table. It means delete old table and replace it with new table. external table: Create table externally (out of database) so when you have deleted data, just delete metadata, but not original data.
create external table external_table (name string comment “First Name”, age int, officeNo bigint comment “official number”, cell bigint comment “personal mobile”) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile;
load data local inpath ‘/home/hadoop/Desktop/cellNo.txt’ into overwrite table external_table; display the database column header name.
set hive.cli.print.header=true; Alter table: You can change structure of the table such as alter rename the table, add new columns, and replace the columns. Drop individual columns not possible, but alternatively you can get desired files only.
alter table onecell drop column f_name // It’s not possible
create table one_cell as select cell from onecell; // It’s alternative technique.
alter table cellnumbers rename to MobileNumbers;
alter table onecell add columns (name string comment “first name”);
alter table onecell replace columns (cell bigint, f_name string);
Collection data types:
create table details (name string, friends array,Cell map<string, bigint>, others struct<company:string, your_Pincode: int, Married:string, Salary: float>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’ lines terminated by ‘\n’ stored as textfile;
load data local inpath ‘/home/hadoop/Desktop/complexData.txt’ into table details;
select name, cell[‘personal’], others.company from details;
Bucketized Hive Table | How to sampling the Hive table?
Create sample table: create table samp (country string, code string, year int, population bigint) row format delimited fields terminated by ‘,’;
Load data to sample table: load data local inpath ‘/home/hadoop/Desktop/Population.csv’ into table samp;
Create bucket table: create table buck (countryName string, countrycode string, year int, population bigint) clustered by (year) sorted by (countryname) into 45 buckets;
enable bucketing table: set hive.enforce.bucketing = true;
insert data into bucket table: insert overwrite table buck select * from samp; Test bucket table: select countryName, population from buck tablesample( bucket 10 out of 45 on year); Test normal table: select country, population from samp limit 10;
Hive Inner Join, Right Outer Join, Map side Join
What is Join?
Join is a clause that combines the records of two tables (or Data-Sets).
Create sample tables to join
create table products (name string, user string, units int) row format delimited fields terminated by ‘,’;
create table price (product string, price int) row format delimited fields terminated by ‘,’;
create table users (user string, city string, cell bigint) row format delimited fields terminated by ‘,’;
load data local inpath ‘/home/hadoop/Desktop/price.txt’ into table price;
load data local inpath ‘/home/hadoop/Desktop/productname.txt’ into table products;
load data local inpath ‘/home/hadoop/Desktop/users.txt’ into table users;
select products.* , users.city, users.cell from products join users on products.user = users.user;
Left Outer Join:
select products.* , users.city, users.cell from products left outer join users on products.user = users.user;
Right Outer Join:
select products.* , users.* from products right outer join users on products.user = users.user;
Full Outer Join:
select products.* , users.* from products full outer join users on products.user = users.user;
All tasks performed by the mapper only, suitable for small tables to optimize the task. When you need to join a large table with a small table Hive can perform a map side join.
select /*+ mapjoin(users) */ products.* , users.* from products join users on products.user = users.user;
select /*+ mapjoin(products) */ products.* , users.* from products join users on products.user = users.user;
select /*+ mapjoin(products) */ products.* , users.* from products right outer join users on products.user = users.user;
select /*+ mapjoin(users) */ products.* , users.* from products left outer join users on products.user = users.user;