Tag Archives: Spark

RDDs Vs DataFrame Vs DataSet

Application Program Interface (API) is a set of functions and procedures that allow the creation of applications which access the features or data of an operating system, application, or other service to process data.
APIs very very important to implement any framework. Now Spark using many APIs and importing many APIs from other Bigdata ecosystems. Latest Spark using three type of APIs. Spark always revolving around these APIs called RDD, Dataframe and DataSet APIs.

Now Which is the best API? RDD or Dataframe or DataSet? Why? these are common spark interview questions. In this post I am explaining difference between RDD, Dataframe & DataSet
Rdd dataframe dataset

What is RDD?
In simple words, Collection of Java or Scala objects that follows Immutability, distributed, fault tolerance properties.
Spark core use many functions, most of the functions copied from Scala. Based on functionality, spark separate those functions as Transformations and actions.
Means in RDD API these Scala functions (Transformations & Actions) to compute the data. Its main advantage if you know Scala functions, it’s easy to compute data.
The main dis-advantage in RDDs is, it’s using Java serialization by default. Either Java or Scala running JVM only so both using Java Serialization only.

Why Java Serialization?

let eg: if you want to store Arrays, json data, or any other data in database, it’s not supporting. So that u r serialize data into binary format, than convert that binary format data to database understandable format.
Now Java also using its own serialization concept called Java Serialization.
Java Serialization intentionally for small amount of java object not for long amount of objects. If you use java serialization, it’s drastically decrease performance.
Additionally Java serialization consume huge amount of resources to serialize data. So that as using avro serialization, it’s internally compress data so that little advantage to improve performance.
RDD using Java Serialization, so it’s decrease performance. If you process large amount of data. Kyro serialization little optimize Spark RDD jobs, but you must follow some terms and conditions.
One more dis-advantage is Jvva serialization is sending both data and it’s structure between nodes. It’s another headache, but it’s resolved in Dataframe.
Spark when it’ starting time means in spark 1.0 introduced RDDs. It’s ok, good processing fine, everything fine, but performance only main dis-advantage. If you are processing unstructured data, Rdd highly recommended.

What is DataFrame?
After couple of month, Spark introduced another API called DataFrame. It’s very powerful mainly focus on performance and to run SQL queries on top of data.
In Simple words a collection of RDDs plus Schema called DataFrame. In DataFrame, the data is organized into named columns like RDBMS. Means Structure separated, data separated. Spark understands the data schema, so no need to use Java serialization to encode the data, Serialize only data.
So Spark developer, can easily run SQL queries on top of distributed data, additionally support DSL commands, so Scala programmer also easily run Scala commands. These features not available in RDD.
If spark knows the schema, there is no need to use Java serialization to encode the data. So no need to de-serialize the data when you applied sorting or shuffling.

The power of DataFrame API is catalyst optimizer. It internally apply logical plans and physical plans, finally based on cost based model, choose the best optimized plan. So It’s internally optimize data compare with RDDs.
DataFrame also using Java serialization, so like RDDs same dis-advantages available in Data-frame also. Means main advantage optimize performance, and make user friendly, dis-advantage serialization.

What is DataSet API?
Another IOT framework called Flink, it’s internally using two powerful APIs called DataSet and DataStream APIs. DataSet used to process batch data, DataStream api used to process streaming data. Spark core by default batch process so that they copied this Flink DataSet API and placed in Spark 1.6 experimentally.
In spark 1.6, dataset api got good results, so that in spark 2.0 DataFrame merged in DataSet. In Spark 2.0 only dataset available, there is no dataframes.
The main difference between RDD, DataFrame and DataSet is Serialization and Performance. This DataSet api internally using a special serialization called encoder, it’s very powerful than java serialization.  It support Rdd transformations and dataframe DSL commands and allows SQL queries as well. Means if you know rdd and dataframes same steps you can apply in dataset as well. encoder

In another words, Unifying RDD + Dataframe using encoder serialization called DataSet. This DataSet introduced in 1.6 version, but it’s main abstraction in spark 2.0. The main advantage in DataSet is high level type-safe, but RDD low level type-safe. So programmer can easily identified syntax errors & analyze errors in compile stage only. More info about type safe

Spark also moving towards dataset so that instead of spark streaming,mllib, graphx, going towards structure streaming, ml-pipeline, graph-frames. As per my prediction, in future no RDD concepts in future.
One more disadvantage is only Java and Scala supports dataset, but not python language, because of it’s dynamic nature.


The main difference between RDD, Dataframe and Dataset is performance, to optimize performance, RDD switched to Dataframe, next switched to Dataset.

WordCount in Spark Scala

Hello, In theses videos, I am explaining about how to install eclipse, how to install scala? how to create appropriate configurations in eclipse, maven to implement spark applications, finally how to run spark wordcount program in maven build tool.

How to install Eclipse in Ubuntu:

If you want this script just mail me at venu@bigdataanalyst.in i will mail

Download eclipse from eclipse
put somewhere where you want /home/hadoop/work
tar -zxvf /home/hadoop/work/eclipse-jee-mars-R-linux-gtk-x86_64.tar.gz
gksudo gedit /usr/share/applications/eclipse.desktop
#enter password
#paste it
[Desktop Entry]
Name=Eclipse 4
Comment=Integrated Development Environment

##how to install scala plugin in Eclipse#####
#First check updates and updates to prevent problems in future
go to Help>check for updates> next>next>accept conditions>finish // wait 5 min restart the eclipse
go to Help>eclipse marketplace>find-> scala> scala ide> confirm> next>next>accept>finish

after create project in maven, right click and go to configure >add scala nature

#####How to create a maven project and hellow world scala program####
for spark Streaming: http://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10/1.6.0
spark core: http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10
spark sql: http://mvnrepository.com/artbifact/org.apache.spark/spark-sql_2.10
scala: http://mvnrepository.com/artifact/org.scala-lang/scala-library/2.10.6

#based on your spark , scala, hadoop version change it.

WordCount using Spark Scala

Ways to create dataframes

Hello, In my previous post I have explained different spark interview questions. In this Post I am explaining about easy ways to create dataframes in Spark. In this Video I am explaining how to create Dataframes in different ways.

Usually four ways to create a dataframes, but use DataFrame API is the best and easy way to create a dataframe.
1) use DataFrame API
Programmatically Specifying Schema
3) U
se Case Classes
4) use toDF

If you want scriipt/code, just mail me at venu@bigdataanalyst.in


Zeppelin Installation & Run SparkSQL


Apache Zeppelin is a web based Notebook that allows  programmer to implement Spark application in Scala and Python. It’s opensource, so you can easily download and implement spark applications. In this video I am explaining how to install Zeppelin and analyze sample csv data.

Zeppelin Documentation

Download Zeppelin from Apache mirror website and extract the zip file
wget http://www.us.apache.org/dist/incubator/zeppelin/0.5.0-incubating/zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz

unzip zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3.tgz

step 1:
sudo apt-get update
sudo apt-get install openjdk-7-jdk
sudo apt-get install git  maven  npm

Step 2: Clone repository:
mvn clean package -Pspark-1.4 -Dhadoop.version=2.2.0 -Phadoop-2.2 -DskipTests

Step 3:

Modify env files. Copy zeppelin-env.sh.template to zeppelin-env.sh and zeppelin-site.xml.template to zeplin-site.xml


Include given code in the ./conf/zeppelin-env.sh file.
export SPARK_HOME=/home/$USER/userwork/spark-1.4.0-bin-hadoop1
export HADOOP_HOME=/home/$USER/work/hadoop-1.1.2
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/

//It’s suitable for Hadoop 1.x and spark 1.40

export ZEPPELIN_MASTER=/home/$USER/work/zeppelin-0.5.0-incubating-bin-spark-1.4.0_hadoop-2.3/zeppelin-0.5.0-incubating/


zeppelin-daemon.sh start

Execute this command in browser http://localhost:8080

Please note It’s not mandatory to start spark, hadoop to start zeppelin.

Now Installation successfully completed.
To Implement a POC, download a file from given link.
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
unzip it place somewhere.

unzip bank.zip

Go to Zeppelin create a new notebook note. First create a RDD, It’s fundamental step.

Create RDD

val x = sc.textFile(“/home/hadoop/Desktop/bank/bank-full.csv”)

/*Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching.
more help: http://www.scala-lang.org/old/node/107
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

//Here just extracting a specified fields only.



val bank = x.map(s=>s.split(“;”)).filter(s=>s(0)!=”\”age\””).map( s=>Bank(s(0).toInt, s(1)replaceAll(“\””, “”), s(2).replaceAll(“\””, “”), s(3).replaceAll(“\””, “”), s(5).replaceAll(“\””, “”).toInt )))

/* Here, s(0).toInt It means converts to Integer.  s(1) and few strings with replaceAll(“\””, “”) it replaces ” symbols in the string. If you dont mention it, you will face errors. You will see in this video.

Run SQL Queries
%sql select age, count(1) as totel from bank where age < 30 group by age order by age

%sql select age, count(1) totel from bank where age < ${maxAge=30} group by age order by age

%sql select age, count(1) from bank where marital=”${marital=single}” group by age order by age

Stop the  server:
zeppelin-daemon.sh start

Reference : https://github.com/apache/incubator-zeppelin


Spark Technical Interview Questions

Most of the Bigdata analysts might get Apache Spark Performance Interview questions. In this post I am explaining about advanced Spark Concepts with detail explanations.

What is the DAG importance in Spark?
Directed acyclic graph (DAG) is an execution engine. It ignores/skip unwanted multi-stage execution model and offers the best performance improvements. In mapreduce, Hadoop can execute in the mapper and reduce. If you want to execute HQL queries, Hive execute once, mapreduce execute again. The dog is an execution modal it allows directly in a straight forward manner. So SQL, HQL, JQL, other languages directly executes in the Spark through DAG execution engine.

How many ways to create an RDD? Which is the best way?
Usually two types. Parallelize existing data and  second option is  referencing a dataset in an external dataset. But Parallelize option is not recommended. If you process vast amount of data, that might crash the driver JVM.

1) Parallelize: val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data)

2) External Datasets: val distFile = sc.textFile(“data.txt”)

groupByKey or reduceByKey Which is the best in Spark?
In mapreduce programe you can get same output through groupByKey and reduceBykey. If you are processing a large amount of dataset, reduceByKey is highly recommendable. It can combine output with common key on each partition before shuffling the data. Whereas in groupByKey all unnecessary data being transferred over the network. So spark performance will decreased for a large amount of data.

When you don’t call collect() action?

Don’t copy all elements of a large RDD to the driver, it’s bottleneck to the Driver program. If more than 1TB, sometime it crush the driver JVM. Similarly countByKey, countByValue, collectAsMap also suitable for small data-sets.

val values = myVeryLargeRDD.collect() Instead of that, use take or takeSample actions. Those actions can filter and take desired amount of data only.

What is the difference between cache() and persist()?

Both we need to call to store the RDD data into memory. With cache(), you can use the default MEMORY_ONLY storage level. With Persist(), you can assign any Storage level like MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,  DISK_ONLY, and more.

If the data-set is lower than memory use cache(), otherwise use persist().

What is the difference between real-time data processing and micro batch processing?
When you have got the data instantly process that data, called real time data processing.
When you have got a chunk amount of data, hold it in a small batch, then processes it as early as possible, called micro batch processing.
Strom is an example of real-time and Spark streaming is the example of micro batch processing.

How Data Serialization optimize spark performance?
Data Serialization is the first step to tune-up the Spark application performance. Spark aim is balance between convenience and performance. To achieve it’s aim Spark allows two type of Serialization libraries called (Default) Java Serialization and Kyro serialization. compare with Java Serizlization, Kryo serialization is the best option. Include given code in SparkConf. conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”).

What is the difference between Narrow Transformation and Wide Transformation?
Input and output says in same partition. No data movement is needed. In Wide transformation, input from other partitions are required. Data should shuffling, before processing. Narrow Transformation is independent, happen in parallel. Wide Transformation is depended on multiple child partitions. Narrow transformation highly recommendable for better spark RDD performance.

different spark transformationsOther Spark Optimized Tips:

  • Network bandwidth is the bottleneck of any distributed file system. Store the RDDs in Serialized form, to reduce memory usage and optimize RDD performance.
  • If the objects are large, increase spark.kryoserializer.buffer 64k and spark.kryoserializer.buffer.max 64M  config default properties.
  • How to optimize spark memory?
    There are two places to optimization, one is at driver level and executor level.
    Specify drive memory while you run an application. Eg: spark-shell –drive-memory 4g
    Specify executor memory while you run application. Eg: spark-shell –executor-memory 4g
  • More Tips

What is the difference between Mesos and Yarn?

Mesos is a cluster manager, which is evolving into a data center operating system.
Yarn is Hadoop compute framework that has a robust resource management features.

What is DSL importance in DataFrame?
A layer on top of the DataFrames to perform relational operations called Domain specific language.

How many ways to create DataFrames?
Easy way is to leverage Scala case class and second way is programmatically specify schema.

What is Catalyst optimizer?

The power of SparkSQL/ DataFrame comes due to catalyst optimizer.
Catalyst optimizer primarily leverages functional programming constructs of Scala such as pattern matching. It offers a general framework for transforming trees, which we use to perform analysis, optimization, planning, and runtime code generation.

What is Catalyst optimizer goals?

  • Optimize the code for better performance.
  • Allow users to optimize the Spark code.

Why SparkSQL use Catalyst Optimizer?

  • To analyze a logical plan,
  • To optimize Physical and logical plans.
  • Code generate to complete the query.

Give an example how catalog optimizer can optimize Logical plan?
Let example A bunch of data is sorted, thus filter out unnecessary data to reduce network wastage to optimize logical plan.

What is Physical planning?
Based on cost of each plan SparkSQL takes a logical plan and generate one or more physical plans.

What is Quasi quotes?
It’s scala feature to generate java bytecode to run on each machine.

Can you tell me few data-types in JSON?

String, Number, Boolean, null. Everything in the form of key and value format.

What are objects and Array in Json data?

Curly braces represents an object, whereas sequence of objects represent an array in the form of [ ].

Object is {“Name”: “Venu”, “age” : “30”}

Array is [{“Name”: “Venu”, “age” : “30”},{“Name”: “Venu”, “age” : “30”}, {“Name”: “Venu”, “age” : “30”}]

How to decide whether use MEMORY_ONLY_SER or MEMORY_AND_DISK_SER to persist the RDD?

If data is lower than RAM, use MEMORY_ONLY_SER, if more than RAM size, use MEMORY_AND_DISK_SER

Why broadcast variables?

Communication is too important in spark. To reduce communication cast spark use broadcast variables. Instead of transferring variables with tasks, these variables keep in cache in read-only mode.


Spark Advanced Interview questions

In my previous post i have shared few Spark interview questions, please check once. If you want to learn Apache Spark Contact now.

What is DataFrame?

SQL + RDD = Spark DataFrame

A SQL programming abstraction on the top of Spark core called DataFrames. Schema for RDD called DataFrame. It can ease many Spark problems. The DataFrame API is available in Scala, Java, Python, and R. So any programmer can create DataFrame.

What is the glom importance on Spark? Glom is a method (RDD.glom()) which returns a new RDD, that containing the distinct elements in the form of Array. Usually partition returns a row at a time, but RDD.glom() allows you to treat a partition as an array rather as single row at time.

What is Shark?

It’s an older version of SparkSQL.  It allows to run Hive on Spark, but now it replaced by SparkSQL.

What is difference between Tacheyon and Apache Spark?

Tachyon is a memory centric distributed storage system that share memory across cluster. The programmer can run Spark, Mapreduce, Shark, and Flink without any code change. Where as Spark is a cluster computing framework, you can run batch, streaming and interactive analytics rapidly. Fastness and laziness is the power of spark.

What is different between Framework, Library and API?

API is a part of library, that defines how to interact with external code. If library requested something, the API serving and fulfill the requirements. Where as Library is a collection of classes to do a specific task like to create package. Framework provides functionalities/solution to the particular problem area, but install different softwares, provide environment. It’s heart to develop software applications.

What is history of Apache Spark?

Apache Spark is originally developed in the AMPLab at UC Berkeley, later it’s moved to Apache top level project. Databricks is a spark contributer founded by the creators of Apache spark.

What is the main difference between Spark and Strom?

Spark performance data parallel computation, whereas Strom performs task parallel computation. Compare with this, Storm process quickly, but both are open-source, distributed, fault tolerant and scalable to process streaming data.

What is Data scientist responsibility?

Analyzed for insights and model data for visualization. He/she may have experience with SQL, statistics, predictive modeling, and functional programming like python, R.

What is Data engineer responsibilities?

BigData engineer usually build production data processing applications. Most often engineer control to monitor, inspect, and tune applications by using programming languages.

When do you use apache spark?

For iterative, interactive application with faster processing, real time stream processing. Single platform for all batch process, streaming, interactive applications apache spark is the best choice.

spark interview questions What is the purpose of GraphX library?

Most of the social media sites generates Graphs. GraphX used fo Graphs and graph-parallel computation with common algorithms.

What is Apache Zeppelin?

It’s a Collaborative data analytics and visualization tool for Apache Spark, and Flink. It’s in incubating stage, means it’s not stabled and implementing stage.

What is Jupyter?
It’s evolved from the IPython project. It’s Python3 version inbuilt API for visualization.It also supports R, Ruby, pyspark and other languages.

What is Data Scientist workbench?
Interactive data platform built around Python tool Jupyter.It’s pre-installed with Python, Scala or R.

What is Spark Notebook?
It’s a spark SQL tool, It dynamically inject JavaScript libraries to create visualizations.

DataBrisks Cloud?
It’s available on AWS. If you have EC2 account, you can use this.

What is Zeppelin?
Zeppelin is analytical tool that supports multiple language back-end, by default it support scala with SparkContext.

DFS is mandatory to run Spark?
No, no need HDFS. RDD use hadoop InputFormat API to input data. So RDD can support any storage system like Aws, Azure, Google, or local file system. It support any input format implementation can directly used in spark such as input from Hbase, Cassandra, MongoDB, or Custom input format directly processed in RDD.

How Spark identify data locality?
Usually InputFormat specifies splits and locality. RDD use Hadoop InputFormat API, so partitions correlate to HDFS splits. So spark can easily identify the data locality when it’s needed.

How coalesce increase RDD performance?
Shuffle can decrease RDD performance, so Repartition can increase partitions after filter. Where as coalesce decrease partitions, process without shuffle. Coalesce can consolidate before outputting before HDFS without parallelism. So coalesce directly affect RDD partition performance.

What is the important of co-located key – value pairs?

Some cases, multiple values have same key values especially iterative values. So Co-located values benefited for many operations. RangePartitioner and HashPartitioner ensures all pairs with the same key.

What are numeric RDDs statical operations?

Standard deviation, mean, sum, min, max. Stats() returns all statistic values.

RDDs can shared across application?

No, RDDs can shared across the network, but not application.

What is the difference between reduce and fold?

Fold is a function which allows you to calculate many powerful/large operations on time. Please find the Answer here.

What is SBT?

The SBT is an open source build tool for Scala and Java projects. It’s similar to the Java’s Maven.  

Which is the use of Kryo serialization in Spark?

The SparkContext supports only Java serialization. Kryo is a fast and efficient serialization framework for Java. Kryo serialization work on the top of Java serialization. It’s highly recommended for a large amount of data.

What is the sizeEstimator tool?

The Java heap is the amount of memory allocated to applications running in the JVM. An utility to estimates the size of objects in Java heep called SizeEstimator. It’s a trigger to partition the data.

What is pipe operator?

Pipe operator allows to process RDD data using external applications. After created RDD, developer pipe that RDD through shell script. Shell scripting can allows to access that RDD through external applications.

What is executors?

Spark sends the application code to the executors via SC. The sparkContext sends these tasks to the executors to run computations and store the data for your application. Each application has its own executors.

What is SparkContext object functionality?

Any spark application consists of driver program and executors to run on the cluster. A jar is containing this application for the processing. SparkContext object coordinates these processes.

What is Dstreams?
A sequence of RDDs called a DStream. It’s a high level Spark abstraction. It create a continious stream of data from different sources like Kafka, flume and generate a series of batches.

Spark Interview questions

If you want Apache spark training contact here.

What is Spark?

Spark is a parallel data processing framework. It allows to develop fast, unified big data application combine batch, streaming and interactive analytics.

Why Spark?

Spark is third generation distributed data processing platform. It’s unified bigdata solution for all bigdata processing problems such as batch , interacting, streaming processing.So it can ease many bigdata problems.

What is RDD?
Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of partitioned data that satisfies these properties. Immutable, distributed, lazily evaluated, catchable are common RDD properties.

What is Immutable?
Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it’s not allows updates and modifications. Please note data collection is not immutable, but data value is immutable.

What is Distributed?
RDD can automatically the data is distributed across different parallel computing nodes.

What is Lazy evaluated?
If you execute a bunch of program, it’s not mandatory to evaluate immediately. Especially in Transformations, this Laziness is trigger.

What is Catchable?
keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times faster than Hadoop.

What is Spark engine responsibility?
Spark responsible for scheduling, distributing, and monitoring the application across the cluster.

What are common Spark Ecosystems?
Spark SQL(Shark) for SQL developers,
Spark Streaming for streaming data,
MLLib for machine learning algorithms,
GraphX for Graph computation,
SparkR to run R on Spark engine,
BlinkDB enabling interactive queries over massive data are common Spark ecosystems.  GraphX, SparkR and BlinkDB are in incubation stage.

spark ecosystems

What is Partitions?
partition is a logical division of the data, this idea derived from Map-reduce (split). Logical data specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data and output data everything is Partitioned RDD.

How spark partition the data?

Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3 and other data resources.

Is it mandatory to start Hadoop to run spark application?
No not mandatory, but there is no separate storage in Spark, so it use local file system to store the data. You can load data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.

spark interview questions

spark interview questions


What is SparkContext?
When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new SparkContext object. SparkContext tell spark how to access the cluster. SparkConf is key factor to create programmer application.

What is SparkCore functionalities?
SparkCore is a base engine of apache spark framework. Memory management, fault tolarance, scheduling and monitoring jobs, interacting with store systems are primary functionalities of Spark.

How SparkSQL is different from HQL and SQL?
SparkSQL is a special component on the sparkCore engine that support SQL and HiveQueryLanguage without changing any syntax. It’s possible to join SQL table and HQL table.

When did we use Spark Streaming?
Spark Streaming is a real time processing of streaming data API. Spark streaming gather streaming data from different resources like web server log files, social media data, stock market data or Hadoop ecosystems like Flume, and Kafka.

How Spark Streaming API works?
Programmer set a specific time in the configuration, with in this time how much data gets into the Spark, that data separates as a batch. The input stream (DStream) goes into spark streaming. Framework breaks up into small chunks called batches, then feeds into the spark engine for processing. Spark Streaming API passes that batches to the core engine. Core engine can generate the final results in the form of streaming batches. The output also in the form of batches. It can allows streaming data and batch data for processing.

What is Spark MLlib?

Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib provides different algorithms, that algorithms scale out on the cluster for data processing. Most of the data scientists use this MLlib library.

What is GraphX?

GraphX is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and iterative graph computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.

What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local FileSystem. Spark uses FS API to read data from different storage engines.

Why Partitions are immutable?
Every transformation generate new partition.  Partitions uses HDFS API so that partition is immutable, distributed and fault tolerance. Partition also aware of data locality.

What is Transformation in spark?

Spark provides two special operations on RDDs called  transformations and Actions. Transformation follow lazy operation and temporary hold the data until unless called the Action. Each transformation generate/return new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are common spark transformations.

What is Action in Spark?

Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is input of Actions. reduce, collect, takeSample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.

What is RDD Lineage?
Lineage is a RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data.Each RDD remembers how the RDD build from other datasets.

What is Map and flatMap in Spark?

Map is a specific line or row to process that data. In FlatMap each input item can be mapped to multiple output items (so function should return a Seq rather than a single item). So most frequently used to return Array elements.

What are broadcast variables?
Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters). Broadcast variables stored as Array Buffers, which sends read-only values to work nodes.

What are Accumulators in Spark?
Spark of-line debuggers called accumulators. Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job you can use accumulators. Only the driver program can read an accumulator value, not the tasks.

How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends on the task.

To learn basic Spark video tutorials, just visit bigdata university website.