Zeppelin Installation & Run SparkSQL

posted in: Spark | 0

 

Apache Zeppelin is a web based Notebook that allows  programmer to implement Spark application in Scala and Python. It’s opensource, so you can easily download and implement spark applications. In this video I am explaining how to install Zeppelin and analyze sample csv data.


Zeppelin Documentation

Download Zeppelin from Apache mirror website and extract the zip file
wget http://www.us.apache.org/dist/incubator/zeppelin/0.5.0-incubating/zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz

unzip zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz

step 1:
sudo apt-get update
sudo apt-get install openjdk-7-jdk
sudo apt-get install git  maven  npm

Step 2: Clone repository:
mvn clean package -Pspark-1.4 -Dhadoop.version=2.2.0 -Phadoop-2.2 -DskipTests

Step 3:

Configuration:
Modify env files. Copy zeppelin-env.sh.template to zeppelin-env.sh and zeppelin-site.xml.template to zeplin-site.xml

./conf/zeppelin-env.sh
./conf/zeppelin-site.xml

Include given code in the ./conf/zeppelin-env.sh file.
export SPARK_HOME=/home/$USER/userwork/spark-1.4.0-bin-hadoop1
export HADOOP_HOME=/home/$USER/work/hadoop-1.1.2
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/

//It’s suitable for Hadoop 1.x and spark 1.40

.bashrc
export ZEPPELIN_MASTER=/home/$USER/work/zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3/zeppelin-0.5.0-incubating/

export PATH=$ZEPPELIN_MASTER/bin:$PATH

Run
zeppelin-daemon.sh start

Execute this command in browser http://localhost:8080

Please note It’s not mandatory to start spark, hadoop to start zeppelin.

Now Installation successfully completed.
To Implement a POC, download a file from given link.
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
unzip it place somewhere.

unzip bank.zip

Go to Zeppelin create a new notebook note. First create a RDD, It’s fundamental step.

Create RDD

val x = sc.textFile(“/home/hadoop/Desktop/bank/bank-full.csv”)

/*Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching.
more help: http://www.scala-lang.org/old/node/107
*/
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

//Here just extracting a specified fields only.

 

bank.toDF().registerTempTable(“bank”)

val bank = x.map(s=>s.split(“;”)).filter(s=>s(0)!=”\”age\””).map( s=>Bank(s(0).toInt, s(1)replaceAll(“\””, “”), s(2).replaceAll(“\””, “”), s(3).replaceAll(“\””, “”), s(5).replaceAll(“\””, “”).toInt )))

/* Here, s(0).toInt It means converts to Integer.  s(1) and few strings with replaceAll(“\””, “”) it replaces ” symbols in the string. If you dont mention it, you will face errors. You will see in this video.

Run SQL Queries
%sql select age, count(1) as totel from bank where age < 30 group by age order by age

%sql select age, count(1) totel from bank where age < ${maxAge=30} group by age order by age

%sql select age, count(1) from bank where marital=”${marital=single}” group by age order by age

Stop the  server:
zeppelin-daemon.sh start

Reference : https://github.com/apache/incubator-zeppelin
https://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html