IGFS : Ignite FileSystem

Apache Hadoop, everyone chanting Hadoop, it’s number one framework to store data in reliable manner. I agree, it’s true, fast, low cost and fault tolerance. Is there any alternative to HDFS? Is there any competitors or alternatives to HDFS to store data? Yes it’s called Ignite file system (IGFS). It’s number one storage file system to store, In this article i am explain about ignite file system overview.

IGFS

What is Ignite File System (IGFS)

Apache ignite is a unified system to store and process any type of data. Apache ignite internally using ignite file system (IGFS) to store data. In another words like HDFS it will store, like alluxio it centralize data in memory, like Spark it’s process everything in memory. It’s fourth generating system.

In HDFS there are two type of storage called Memory and desk. By defaut HDFS store data in desk. Usually when you are processing at that time data move to memory, after processed, the data store in desk.

where as in IGFS, the data in two storage levels called on heap memory and off heap memory. It’s store data by default in memory. If not fit in memory remaining data store in off-heap memory. Means when you are processing time no IO hits, directly process data quickly. It’s huge plus to in memory processing systems like Spark.

On-heap Vs Off-heap memory
Simply when data processing time, temporary data store in memory to process. Let example ram size 8gb.
Now if you are processing 5gb that data fit in Memory so that data called on-heap. After process data garbage collector clean that on-heap memory.

If data more than heap memory (8gb), than remaining amount of data store in off-heap memory. Let example if you have 8gb ram, you want to process 10 gb ram than what happens 8gb store in ram remaining 2gb data store in off-heap. garbage collector unable to clean that off-heap memory.
Compare with off-heap memory on-heap memory very fast, but compare with desk off-heap memory very fast. Now ignite store like this, everything onheap and offheap

IGFS Integrate Other System

Ignite easily integrate with any distributed system like HDFS, Cloudera, Hortonworks. Unlike HDFS, IGFS does not need a name node. It automatically determines the file data locality using a hashing function.

If you use Ignite, no need Alluxio / Tachyon, both are doing same functionality. Ignite at a time it will store data, and process data. Alluxio simply Accelerator layer on top of HDFS, it’s not processing.

Please note Ignite is replacement of Alluxio, but not replacement of HDFS and spark. If you know Spark, directly run spark or execute ignite command anything ignite will support.

Additionally Ignite support OLAP and OLTP operations. All these features not available in HDFS. That’s why apache ignite creating wonders in future especially in Internet of things.

RDDs Vs DataFrame Vs DataSet

Application Program Interface (API) is a set of functions and procedures that allow the creation of applications which access the features or data of an operating system, application, or other service to process data.
APIs very very important to implement any framework. Now Spark using many APIs and importing many APIs from other Bigdata ecosystems. Latest Spark using three type of APIs. Spark always revolving around these APIs called RDD, Dataframe and DataSet APIs.

Now Which is the best API? RDD or Dataframe or DataSet? Why? these are common spark interview questions. In this post I am explaining difference between RDD, Dataframe & DataSet
Rdd dataframe dataset

What is RDD?
In simple words, Collection of Java or Scala objects that follows Immutability, distributed, fault tolerance properties.
Spark core use many functions, most of the functions copied from Scala. Based on functionality, spark separate those functions as Transformations and actions.
Means in RDD API these Scala functions (Transformations & Actions) to compute the data. Its main advantage if you know Scala functions, it’s easy to compute data.
The main dis-advantage in RDDs is, it’s using Java serialization by default. Either Java or Scala running JVM only so both using Java Serialization only.

Why Java Serialization?

let eg: if you want to store Arrays, json data, or any other data in database, it’s not supporting. So that u r serialize data into binary format, than convert that binary format data to database understandable format.
Now Java also using its own serialization concept called Java Serialization.
Java Serialization intentionally for small amount of java object not for long amount of objects. If you use java serialization, it’s drastically decrease performance.
Additionally Java serialization consume huge amount of resources to serialize data. So that as using avro serialization, it’s internally compress data so that little advantage to improve performance.
RDD using Java Serialization, so it’s decrease performance. If you process large amount of data. Kyro serialization little optimize Spark RDD jobs, but you must follow some terms and conditions.
One more dis-advantage is Jvva serialization is sending both data and it’s structure between nodes. It’s another headache, but it’s resolved in Dataframe.
Spark when it’ starting time means in spark 1.0 introduced RDDs. It’s ok, good processing fine, everything fine, but performance only main dis-advantage. If you are processing unstructured data, Rdd highly recommended.

What is DataFrame?
After couple of month, Spark introduced another API called DataFrame. It’s very powerful mainly focus on performance and to run SQL queries on top of data.
In Simple words a collection of RDDs plus Schema called DataFrame. In DataFrame, the data is organized into named columns like RDBMS. Means Structure separated, data separated. Spark understands the data schema, so no need to use Java serialization to encode the data, Serialize only data.
So Spark developer, can easily run SQL queries on top of distributed data, additionally support DSL commands, so Scala programmer also easily run Scala commands. These features not available in RDD.
If spark knows the schema, there is no need to use Java serialization to encode the data. So no need to de-serialize the data when you applied sorting or shuffling.

The power of DataFrame API is catalyst optimizer. It internally apply logical plans and physical plans, finally based on cost based model, choose the best optimized plan. So It’s internally optimize data compare with RDDs.
DataFrame also using Java serialization, so like RDDs same dis-advantages available in Data-frame also. Means main advantage optimize performance, and make user friendly, dis-advantage serialization.

What is DataSet API?
Another IOT framework called Flink, it’s internally using two powerful APIs called DataSet and DataStream APIs. DataSet used to process batch data, DataStream api used to process streaming data. Spark core by default batch process so that they copied this Flink DataSet API and placed in Spark 1.6 experimentally.
In spark 1.6, dataset api got good results, so that in spark 2.0 DataFrame merged in DataSet. In Spark 2.0 only dataset available, there is no dataframes.
The main difference between RDD, DataFrame and DataSet is Serialization and Performance. This DataSet api internally using a special serialization called encoder, it’s very powerful than java serialization.  It support Rdd transformations and dataframe DSL commands and allows SQL queries as well. Means if you know rdd and dataframes same steps you can apply in dataset as well. encoder

In another words, Unifying RDD + Dataframe using encoder serialization called DataSet. This DataSet introduced in 1.6 version, but it’s main abstraction in spark 2.0. The main advantage in DataSet is high level type-safe, but RDD low level type-safe. So programmer can easily identified syntax errors & analyze errors in compile stage only. More info about type safe

Spark also moving towards dataset so that instead of spark streaming,mllib, graphx, going towards structure streaming, ml-pipeline, graph-frames. As per my prediction, in future no RDD concepts in future.
One more disadvantage is only Java and Scala supports dataset, but not python language, because of it’s dynamic nature.

Conclusion:

The main difference between RDD, Dataframe and Dataset is performance, to optimize performance, RDD switched to Dataframe, next switched to Dataset.

WordCount in Spark Scala

Hello, In theses videos, I am explaining about how to install eclipse, how to install scala? how to create appropriate configurations in eclipse, maven to implement spark applications, finally how to run spark wordcount program in maven build tool.

How to install Eclipse in Ubuntu:

If you want this script just mail me at venu@bigdataanalyst.in i will mail

Download eclipse from eclipse
http://mirrors.ustc.edu.cn/eclipse/technology/epp/downloads/release/mars/R/eclipse-jee-mars-R-linux-gtk-x86_64.tar.gz
put somewhere where you want /home/hadoop/work
extract
tar -zxvf /home/hadoop/work/eclipse-jee-mars-R-linux-gtk-x86_64.tar.gz
gksudo gedit /usr/share/applications/eclipse.desktop
#enter password
#paste it
[Desktop Entry]
Name=Eclipse 4
Type=Application
Exec=/home/hadoop/work/eclipse/eclipse
Terminal=false
Icon=/home/hadoop/work/eclipse/icon.xpm
Comment=Integrated Development Environment
NoDisplay=false
Categories=Development;IDE;
Name[en]=Eclipse

##how to install scala plugin in Eclipse#####
#First check updates and updates to prevent problems in future
go to Help>check for updates> next>next>accept conditions>finish // wait 5 min restart the eclipse
go to Help>eclipse marketplace>find-> scala> scala ide> confirm> next>next>accept>finish

after create project in maven, right click and go to configure >add scala nature

#####How to create a maven project and hellow world scala program####
for spark Streaming: http://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10/1.6.0
spark core: http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10
spark sql: http://mvnrepository.com/artbifact/org.apache.spark/spark-sql_2.10
scala: http://mvnrepository.com/artifact/org.scala-lang/scala-library/2.10.6
Hadoop:http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.7.2

#based on your spark , scala, hadoop version change it.

WordCount using Spark Scala

Ways to create dataframes

Hello, In my previous post I have explained different spark interview questions. In this Post I am explaining about easy ways to create dataframes in Spark. In this Video I am explaining how to create Dataframes in different ways.

Usually four ways to create a dataframes, but use DataFrame API is the best and easy way to create a dataframe.
1) use DataFrame API
2)
Programmatically Specifying Schema
3) U
se Case Classes
4) use toDF

If you want scriipt/code, just mail me at venu@bigdataanalyst.in

 

Sample Scala Functions

Write a function of x power y.

def power (x:Int, n:Int):Int = if ( n==0) 1 else if(n>0 && n%2 == 0) power(x, n/2) * power(x,n/2) else x * power(x, n – 1)

Write QuickSort in Scala

def sort(xs: Array[Int]): Array [Int] = {
if(xs.length <=1) xs else
{

val mid =xs(xs.length/2)
Array.concat(sort(xs filter(mid>)),
xs filter (mid ==),
sort(xs filter (mid <)))
}
}

Write (GCD) greatest  common  divisor  of  two numbers.
def gcd(a: Int, b: Int): Int =
if (b == 0) a
else
gcd(b, a % b)

Factorial of a number:

def fact(n:Int):Int = if (n==0) 1 else n*fact(n-1)

Write square root function:

def abs(x:Double) = if (x < 0) -x else x
def isGoodEnough(guess: Double, x: Double) =
abs(guess * guess -x)/x <0.001

def improve(guess: Double, x: Double) =(guess + x / guess) / 2
def sqrtIter(guess: Double, x: Double): Double ={
if (isGoodEnough(guess, x)) guess
else sqrtIter(improve(guess, x), x)}
def sqrt(x: Double) = sqrtIter(1.0, x)

Zeppelin Installation & Run SparkSQL

 

Apache Zeppelin is a web based Notebook that allows  programmer to implement Spark application in Scala and Python. It’s opensource, so you can easily download and implement spark applications. In this video I am explaining how to install Zeppelin and analyze sample csv data.


Zeppelin Documentation

Download Zeppelin from Apache mirror website and extract the zip file
wget http://www.us.apache.org/dist/incubator/zeppelin/0.5.0-incubating/zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz

unzip zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz

step 1:
sudo apt-get update
sudo apt-get install openjdk-7-jdk
sudo apt-get install git  maven  npm

Step 2: Clone repository:
mvn clean package -Pspark-1.4 -Dhadoop.version=2.2.0 -Phadoop-2.2 -DskipTests

Step 3:

Configuration:
Modify env files. Copy zeppelin-env.sh.template to zeppelin-env.sh and zeppelin-site.xml.template to zeplin-site.xml

./conf/zeppelin-env.sh
./conf/zeppelin-site.xml

Include given code in the ./conf/zeppelin-env.sh file.
export SPARK_HOME=/home/$USER/userwork/spark-1.4.0-bin-hadoop1
export HADOOP_HOME=/home/$USER/work/hadoop-1.1.2
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/

//It’s suitable for Hadoop 1.x and spark 1.40

.bashrc
export ZEPPELIN_MASTER=/home/$USER/work/zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3/zeppelin-0.5.0-incubating/

export PATH=$ZEPPELIN_MASTER/bin:$PATH

Run
zeppelin-daemon.sh start

Execute this command in browser http://localhost:8080

Please note It’s not mandatory to start spark, hadoop to start zeppelin.

Now Installation successfully completed.
To Implement a POC, download a file from given link.
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
unzip it place somewhere.

unzip bank.zip

Go to Zeppelin create a new notebook note. First create a RDD, It’s fundamental step.

Create RDD

val x = sc.textFile(“/home/hadoop/Desktop/bank/bank-full.csv”)

/*Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching.
more help: http://www.scala-lang.org/old/node/107
*/
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

//Here just extracting a specified fields only.

 

bank.toDF().registerTempTable(“bank”)

val bank = x.map(s=>s.split(“;”)).filter(s=>s(0)!=”\”age\””).map( s=>Bank(s(0).toInt, s(1)replaceAll(“\””, “”), s(2).replaceAll(“\””, “”), s(3).replaceAll(“\””, “”), s(5).replaceAll(“\””, “”).toInt )))

/* Here, s(0).toInt It means converts to Integer.  s(1) and few strings with replaceAll(“\””, “”) it replaces ” symbols in the string. If you dont mention it, you will face errors. You will see in this video.

Run SQL Queries
%sql select age, count(1) as totel from bank where age < 30 group by age order by age

%sql select age, count(1) totel from bank where age < ${maxAge=30} group by age order by age

%sql select age, count(1) from bank where marital=”${marital=single}” group by age order by age

Stop the  server:
zeppelin-daemon.sh start

Reference : https://github.com/apache/incubator-zeppelin
https://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html

 

Scala Interview Questions

Scala stands for Scalable language. It’s a functional oriented programming language. Recently it’s spread it’s roots in all major languages like Apache Spark, Akka.

What is difference between abstract class and traits in Scala?

A class that extend another subClass called abstract class. Here abstract class creates a constructor and a class can extend only one class.
Traits are collections of fields and behaviors that you can extend or mix into your classes. It’s a component of a class, not a class by itself, so there is no constructor, trait is restricted in comparison to class to prevent multiple inheritance problems.scala interview questions

What is difference between Map and FlatMap?
Map is a method that apply a function on all elements.  it returns the associated value in a Some. Where as FlatMap also same, but it returns a sequence of for each element in the list. The results flattening into the original data list.
Eg: val l = List(“venu”, “Jyothi”, “Koti”, “Brahma”)
l.map(_.toUpperCase) //Returns VENU, JYOTHI, KOTI, BRAHMA
l.flatMap(_.toUpperCase) //Returns V,E,N,U,J,Y,O,T,H,I,K,O,T,I,B,R,A,H,M,A

What is the difference between for/yield combination without combination?

Yield is part of for comprehensions, which generates a value which will be temporary remembered/buffered internal values, but the initial collection is not change.

  • The combination of for/yield is returning a new collection.
  • But for loop without yield is return just operating on each element, it’s not create new collection

What is difference between for loop, for with Guard, for with Yield?

  • Simple for loop iterates over a collection, translated to foreach method.
  • for with guard translated to a sequence of withFilter method followed by foreach method.
  • for with yield translated to a withFilter followed by map method.

What indexOf(“”) does?

It searches for specified value and return Int.

“Hello”.indexOf(“l”)
res0: Int = 2

Why one billion plus two billion is not equal to three billion?

Int is 32 bits. It means it support maximum 2 power 32 (2147483647) value.
2 power 31 is 2147483647. If you add 2^32+1= -2147483648 (reached maximum 32 bit number)
2^31+2=-2147483647
2^31+3=-2147483646
It means it’s decreasing value from the last number.
If you add 2^31 + 2^31 answer -2.
However If you add 2^31 + 2^31  + 4 answer is 2. Now it coming from 0 and counts left to right (positive) manner.
Similarly if you add more than 2147483647 (two billion) it shows negative values, not 3 billions. If you add more than that value use long data type (Long is 64  bit) to get accuracy value.

How to print Un-icodes?
println(“\u2230”)
You can get different unicodes from this website.

What is difference between val and var variables?
To change the reference variables used “var”. Where as in “Val” the reference variable doesn’t change.

How can you get this output quickly?
res1: String = (venu, 30, Bangalore)

var n = “venu”
var a = 30
var l = “Bangalore”
s”($n, $a, $l)”

What is Case Class importance?

Depends on constructor Arguments, case class generate an immutable data-holding objects. Case classes provides that  can be reused in new applications without change the original application. Spark uses case classes to infer schema and write code in a single line.

What are case classes limitations?

Case Classes cannot take more than 22 fileds,  you don’t know schema beforehand.

If more than 22 fields in case class then How Spark process the data?
If Case class doesn’t work properly, data loaded as an RDD or the row objects. Schema created separately using StructType (table) and StructFiled (field) objects. Schema is applied to the row RDD to create a DataFrame.

What is string interpolation?
String interpolation allows users to embed variable references directly in processed string literals.
Let examples var names = “Venu”
>”Hello $name please welcome”

There are different interpolations such as: String interpolations, formatted string, raw interpolation,

What is BigInt datatype limitations?
Most often to define a number datatype used Int or Long, but to calculate huge numbers used BigInt. It’s not recommended it can slowdown the system performance.

What is the Difference between Array and List?

Array is mutable, it’s possible to change value, but List not immutable. Arrays are non-variant and lists are co-variant.

Can you explain few collections in Scala?

  • Array:List same type element, it’s mutable.
  • list: List same type elements, its immutable.
  • sets: have no duplicates
  • tuple: A tuple groups together simple logical collections of items without using a class.
  • map: Evaluates a function over each element in the list, returning with the same number of elements.
  • foreach: It’s like map but returns nothing. foreach is intended for side-effects only.
  • filter: Based on condition, removes any element, returns list of elements.
  • zip: aggregates the contents of two lists into a single list of pairs.
  • partition: splits a list based on where it falls with respect to a predicate function.
  • dind: based on matches it returns the first element.
  • drop: Drops the first specified element.
  • dropWhile: Drops the number of specified elements based on condition.
  • flatten: It collapse one level of nested structure.
  • flatMap: It similar to map, but return multiple lists.

Spark Technical Interview Questions

Most of the Bigdata analysts might get Apache Spark Performance Interview questions. In this post I am explaining about advanced Spark Concepts with detail explanations.

What is the DAG importance in Spark?
Directed acyclic graph (DAG) is an execution engine. It ignores/skip unwanted multi-stage execution model and offers the best performance improvements. In mapreduce, Hadoop can execute in the mapper and reduce. If you want to execute HQL queries, Hive execute once, mapreduce execute again. The dog is an execution modal it allows directly in a straight forward manner. So SQL, HQL, JQL, other languages directly executes in the Spark through DAG execution engine.

How many ways to create an RDD? Which is the best way?
Usually two types. Parallelize existing data and  second option is  referencing a dataset in an external dataset. But Parallelize option is not recommended. If you process vast amount of data, that might crash the driver JVM.

1) Parallelize: val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data)

2) External Datasets: val distFile = sc.textFile(“data.txt”)

groupByKey or reduceByKey Which is the best in Spark?
In mapreduce programe you can get same output through groupByKey and reduceBykey. If you are processing a large amount of dataset, reduceByKey is highly recommendable. It can combine output with common key on each partition before shuffling the data. Whereas in groupByKey all unnecessary data being transferred over the network. So spark performance will decreased for a large amount of data.

When you don’t call collect() action?

Don’t copy all elements of a large RDD to the driver, it’s bottleneck to the Driver program. If more than 1TB, sometime it crush the driver JVM. Similarly countByKey, countByValue, collectAsMap also suitable for small data-sets.

val values = myVeryLargeRDD.collect() Instead of that, use take or takeSample actions. Those actions can filter and take desired amount of data only.

What is the difference between cache() and persist()?

Both we need to call to store the RDD data into memory. With cache(), you can use the default MEMORY_ONLY storage level. With Persist(), you can assign any Storage level like MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,  DISK_ONLY, and more.

If the data-set is lower than memory use cache(), otherwise use persist().

What is the difference between real-time data processing and micro batch processing?
When you have got the data instantly process that data, called real time data processing.
When you have got a chunk amount of data, hold it in a small batch, then processes it as early as possible, called micro batch processing.
Strom is an example of real-time and Spark streaming is the example of micro batch processing.

How Data Serialization optimize spark performance?
Data Serialization is the first step to tune-up the Spark application performance. Spark aim is balance between convenience and performance. To achieve it’s aim Spark allows two type of Serialization libraries called (Default) Java Serialization and Kyro serialization. compare with Java Serizlization, Kryo serialization is the best option. Include given code in SparkConf. conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”).

What is the difference between Narrow Transformation and Wide Transformation?
Input and output says in same partition. No data movement is needed. In Wide transformation, input from other partitions are required. Data should shuffling, before processing. Narrow Transformation is independent, happen in parallel. Wide Transformation is depended on multiple child partitions. Narrow transformation highly recommendable for better spark RDD performance.

different spark transformationsOther Spark Optimized Tips:

  • Network bandwidth is the bottleneck of any distributed file system. Store the RDDs in Serialized form, to reduce memory usage and optimize RDD performance.
  • If the objects are large, increase spark.kryoserializer.buffer 64k and spark.kryoserializer.buffer.max 64M  config default properties.
  • How to optimize spark memory?
    There are two places to optimization, one is at driver level and executor level.
    Specify drive memory while you run an application. Eg: spark-shell –drive-memory 4g
    Specify executor memory while you run application. Eg: spark-shell –executor-memory 4g
  • More Tips

What is the difference between Mesos and Yarn?

Mesos is a cluster manager, which is evolving into a data center operating system.
Yarn is Hadoop compute framework that has a robust resource management features.

What is DSL importance in DataFrame?
A layer on top of the DataFrames to perform relational operations called Domain specific language.

How many ways to create DataFrames?
Easy way is to leverage Scala case class and second way is programmatically specify schema.

What is Catalyst optimizer?

The power of SparkSQL/ DataFrame comes due to catalyst optimizer.
Catalyst optimizer primarily leverages functional programming constructs of Scala such as pattern matching. It offers a general framework for transforming trees, which we use to perform analysis, optimization, planning, and runtime code generation.

What is Catalyst optimizer goals?

  • Optimize the code for better performance.
  • Allow users to optimize the Spark code.

Why SparkSQL use Catalyst Optimizer?

  • To analyze a logical plan,
  • To optimize Physical and logical plans.
  • Code generate to complete the query.

Give an example how catalog optimizer can optimize Logical plan?
Let example A bunch of data is sorted, thus filter out unnecessary data to reduce network wastage to optimize logical plan.

What is Physical planning?
Based on cost of each plan SparkSQL takes a logical plan and generate one or more physical plans.

What is Quasi quotes?
It’s scala feature to generate java bytecode to run on each machine.

Can you tell me few data-types in JSON?

String, Number, Boolean, null. Everything in the form of key and value format.

What are objects and Array in Json data?

Curly braces represents an object, whereas sequence of objects represent an array in the form of [ ].

Object is {“Name”: “Venu”, “age” : “30”}

Array is [{“Name”: “Venu”, “age” : “30”},{“Name”: “Venu”, “age” : “30”}, {“Name”: “Venu”, “age” : “30”}]

How to decide whether use MEMORY_ONLY_SER or MEMORY_AND_DISK_SER to persist the RDD?

If data is lower than RAM, use MEMORY_ONLY_SER, if more than RAM size, use MEMORY_AND_DISK_SER

Why broadcast variables?

Communication is too important in spark. To reduce communication cast spark use broadcast variables. Instead of transferring variables with tasks, these variables keep in cache in read-only mode.

 

error: Content is protected !!