Tag Archives: join

Hive Video Tutorials

To run SQL queries on the top of Mapreduce,  Hadoop use Hive tool to run SQL queries over Hadoop. In this post I am explaining about different ways to create Hive tables and load the data from local, HDFS and HQL table.

 

Test how many databases in Hive: show databases;

Create database: create database hiveSeries;

select Database to store tables: use hiveseries;

display current Database: set hive.cli.print.current.db=true;

C1) Create table : create table cellnumbers (name string, age int, cell bigint) row format delimited fields terminated by ‘,’;

Know how many tables: show tables;

 

Load data from local system:
L1) load data local inpath ‘/home/hadoop/Desktop/cellNo.txt’ into table cellnumbers;

Test table schema: describe cellnumbers;

select * from cellnumbers;

Advanced Tips to create Hive tables.

C2) Create & load data from existance table:
create table namewithcell as select name, cell from cellnumbers;

C3) Create one field table for testing:
create table onecell (cell bigint);

L2) Insert data from existent Hive table:
insert into table onecell select cell from cellnumbers;

C4) Create another table for testing:
create table temp (name string, age int, cell bigint) row format delimited fields terminated by ‘,’;

L3) Load data from HDFS:
load data inpath ‘/samplefile’ into table temp;

Check the data in UI;
http://localhost:50070

Files:
CellNo:
Venu,30,9247159150
satya,28,8889990009
sudha,33,7788990009
venkat,23,8844332244
sudhakar,10,993322887
jyothi,34,6677889900

Create external table, commenting, Alter table, Overwrite table

Debugging: Describe table, describe extended, describe formatted: To know schema of the table;

Comment: It’s understandable purpose you write a command when you are creating hive table.
Overwrite :  This option overwrite table. It means delete old table and replace it with new table.
external table: Create table externally (out of database) so when you have deleted data, just delete metadata, but not original data.

create external table external_table (name string comment “First Name”, age int, officeNo bigint comment “official number”, cell bigint comment “personal mobile”) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile;

load data local inpath ‘/home/hadoop/Desktop/cellNo.txt’ into overwrite table external_table;
display the database column header name.
set hive.cli.print.header=true;
Alter table: You can change structure of the table such as alter rename the table,  add new columns, and replace the columns. Drop individual columns not possible, but alternatively you can get desired files only.
alter table onecell drop column f_name  // It’s not possible
create table one_cell as select cell from onecell; // It’s alternative technique.

alter table cellnumbers rename to MobileNumbers;
alter table onecell add columns (name string comment “first name”);
alter table onecell replace columns (cell bigint, f_name string);

………….

Collection data types:
create table details (name string, friends array,Cell map<string, bigint>, others struct<company:string, your_Pincode: int, Married:string, Salary: float>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’ lines terminated by ‘\n’ stored as textfile;

load data local inpath ‘/home/hadoop/Desktop/complexData.txt’ into table details;
select name, cell[‘personal’], others.company from details;

…………………………………….

Bucketized Hive Table | How to sampling the Hive table?

Create sample table: create table samp (country string, code string, year int, population bigint) row format delimited fields terminated by ‘,’;

Load data to sample table: load data local inpath ‘/home/hadoop/Desktop/Population.csv’ into table samp;

Create bucket table: create table buck (countryName string, countrycode string, year int, population bigint) clustered by (year) sorted by (countryname) into 45 buckets;

enable bucketing table: set hive.enforce.bucketing = true;

insert data into bucket table: insert overwrite table buck select * from samp;
Test bucket table: select countryName, population from buck tablesample( bucket 10 out of 45 on year);
Test normal table: select country, population from samp limit 10;
////////////////////////////////////////////////

Hive Inner Join, Right Outer Join, Map side Join


What is Join?
Join is a clause that combines the records of two tables (or Data-Sets).

Create sample tables to join

create table products (name string, user string, units int) row format delimited fields terminated by ‘,’;
create table price (product string, price int) row format delimited fields terminated by ‘,’;
create table users (user string, city string, cell bigint) row format delimited fields terminated by ‘,’;

load data local inpath ‘/home/hadoop/Desktop/price.txt’ into table price;
load data local inpath ‘/home/hadoop/Desktop/productname.txt’ into table products;
load data local inpath ‘/home/hadoop/Desktop/users.txt’ into table users;

Inner Join
……………..
select products.* , users.city, users.cell from products join users on products.user = users.user;
………..
Left Outer Join:

select products.* , users.city, users.cell from products left outer join users on products.user = users.user;

Right Outer Join:
——————
select products.* , users.* from products right outer join users on products.user = users.user;

Full Outer Join:
select products.* , users.* from products full outer join users on products.user = users.user;

Map Join:
——————
All tasks performed by the mapper only, suitable for small tables to optimize the task. When you need to join a large table with a small table Hive can perform a map side join.

select /*+ mapjoin(users) */ products.* , users.* from products join users on products.user = users.user;

select /*+ mapjoin(products) */ products.* , users.* from products join users on products.user = users.user;
select /*+ mapjoin(products) */ products.* , users.* from products right outer join users on products.user = users.user;

select /*+ mapjoin(users) */ products.* , users.* from products left outer join users on products.user = users.user;

///////////////////////////////