Tag Archives: spark interview questions

Spark Technical Interview Questions

Most of the Bigdata analysts might get Apache Spark Performance Interview questions. In this post I am explaining about advanced Spark Concepts with detail explanations.

What is the DAG importance in Spark?
Directed acyclic graph (DAG) is an execution engine. It ignores/skip unwanted multi-stage execution model and offers the best performance improvements. In mapreduce, Hadoop can execute in the mapper and reduce. If you want to execute HQL queries, Hive execute once, mapreduce execute again. The dog is an execution modal it allows directly in a straight forward manner. So SQL, HQL, JQL, other languages directly executes in the Spark through DAG execution engine.

How many ways to create an RDD? Which is the best way?
Usually two types. Parallelize existing data and  second option is  referencing a dataset in an external dataset. But Parallelize option is not recommended. If you process vast amount of data, that might crash the driver JVM.

1) Parallelize: val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data)

2) External Datasets: val distFile = sc.textFile(“data.txt”)

groupByKey or reduceByKey Which is the best in Spark?
In mapreduce programe you can get same output through groupByKey and reduceBykey. If you are processing a large amount of dataset, reduceByKey is highly recommendable. It can combine output with common key on each partition before shuffling the data. Whereas in groupByKey all unnecessary data being transferred over the network. So spark performance will decreased for a large amount of data.

When you don’t call collect() action?

Don’t copy all elements of a large RDD to the driver, it’s bottleneck to the Driver program. If more than 1TB, sometime it crush the driver JVM. Similarly countByKey, countByValue, collectAsMap also suitable for small data-sets.

val values = myVeryLargeRDD.collect() Instead of that, use take or takeSample actions. Those actions can filter and take desired amount of data only.

What is the difference between cache() and persist()?

Both we need to call to store the RDD data into memory. With cache(), you can use the default MEMORY_ONLY storage level. With Persist(), you can assign any Storage level like MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,  DISK_ONLY, and more.

If the data-set is lower than memory use cache(), otherwise use persist().

What is the difference between real-time data processing and micro batch processing?
When you have got the data instantly process that data, called real time data processing.
When you have got a chunk amount of data, hold it in a small batch, then processes it as early as possible, called micro batch processing.
Strom is an example of real-time and Spark streaming is the example of micro batch processing.

How Data Serialization optimize spark performance?
Data Serialization is the first step to tune-up the Spark application performance. Spark aim is balance between convenience and performance. To achieve it’s aim Spark allows two type of Serialization libraries called (Default) Java Serialization and Kyro serialization. compare with Java Serizlization, Kryo serialization is the best option. Include given code in SparkConf. conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”).

What is the difference between Narrow Transformation and Wide Transformation?
Input and output says in same partition. No data movement is needed. In Wide transformation, input from other partitions are required. Data should shuffling, before processing. Narrow Transformation is independent, happen in parallel. Wide Transformation is depended on multiple child partitions. Narrow transformation highly recommendable for better spark RDD performance.

different spark transformationsOther Spark Optimized Tips:

  • Network bandwidth is the bottleneck of any distributed file system. Store the RDDs in Serialized form, to reduce memory usage and optimize RDD performance.
  • If the objects are large, increase spark.kryoserializer.buffer 64k and spark.kryoserializer.buffer.max 64M  config default properties.
  • How to optimize spark memory?
    There are two places to optimization, one is at driver level and executor level.
    Specify drive memory while you run an application. Eg: spark-shell –drive-memory 4g
    Specify executor memory while you run application. Eg: spark-shell –executor-memory 4g
  • More Tips

What is the difference between Mesos and Yarn?

Mesos is a cluster manager, which is evolving into a data center operating system.
Yarn is Hadoop compute framework that has a robust resource management features.

What is DSL importance in DataFrame?
A layer on top of the DataFrames to perform relational operations called Domain specific language.

How many ways to create DataFrames?
Easy way is to leverage Scala case class and second way is programmatically specify schema.

What is Catalyst optimizer?

The power of SparkSQL/ DataFrame comes due to catalyst optimizer.
Catalyst optimizer primarily leverages functional programming constructs of Scala such as pattern matching. It offers a general framework for transforming trees, which we use to perform analysis, optimization, planning, and runtime code generation.

What is Catalyst optimizer goals?

  • Optimize the code for better performance.
  • Allow users to optimize the Spark code.

Why SparkSQL use Catalyst Optimizer?

  • To analyze a logical plan,
  • To optimize Physical and logical plans.
  • Code generate to complete the query.

Give an example how catalog optimizer can optimize Logical plan?
Let example A bunch of data is sorted, thus filter out unnecessary data to reduce network wastage to optimize logical plan.

What is Physical planning?
Based on cost of each plan SparkSQL takes a logical plan and generate one or more physical plans.

What is Quasi quotes?
It’s scala feature to generate java bytecode to run on each machine.

Can you tell me few data-types in JSON?

String, Number, Boolean, null. Everything in the form of key and value format.

What are objects and Array in Json data?

Curly braces represents an object, whereas sequence of objects represent an array in the form of [ ].

Object is {“Name”: “Venu”, “age” : “30”}

Array is [{“Name”: “Venu”, “age” : “30”},{“Name”: “Venu”, “age” : “30”}, {“Name”: “Venu”, “age” : “30”}]

How to decide whether use MEMORY_ONLY_SER or MEMORY_AND_DISK_SER to persist the RDD?

If data is lower than RAM, use MEMORY_ONLY_SER, if more than RAM size, use MEMORY_AND_DISK_SER

Why broadcast variables?

Communication is too important in spark. To reduce communication cast spark use broadcast variables. Instead of transferring variables with tasks, these variables keep in cache in read-only mode.


Spark Advanced Interview questions

In my previous post i have shared few Spark interview questions, please check once. If you want to learn Apache Spark Contact now.

What is DataFrame?

SQL + RDD = Spark DataFrame

A SQL programming abstraction on the top of Spark core called DataFrames. Schema for RDD called DataFrame. It can ease many Spark problems. The DataFrame API is available in Scala, Java, Python, and R. So any programmer can create DataFrame.

What is the glom importance on Spark? Glom is a method (RDD.glom()) which returns a new RDD, that containing the distinct elements in the form of Array. Usually partition returns a row at a time, but RDD.glom() allows you to treat a partition as an array rather as single row at time.

What is Shark?

It’s an older version of SparkSQL.  It allows to run Hive on Spark, but now it replaced by SparkSQL.

What is difference between Tacheyon and Apache Spark?

Tachyon is a memory centric distributed storage system that share memory across cluster. The programmer can run Spark, Mapreduce, Shark, and Flink without any code change. Where as Spark is a cluster computing framework, you can run batch, streaming and interactive analytics rapidly. Fastness and laziness is the power of spark.

What is different between Framework, Library and API?

API is a part of library, that defines how to interact with external code. If library requested something, the API serving and fulfill the requirements. Where as Library is a collection of classes to do a specific task like to create package. Framework provides functionalities/solution to the particular problem area, but install different softwares, provide environment. It’s heart to develop software applications.

What is history of Apache Spark?

Apache Spark is originally developed in the AMPLab at UC Berkeley, later it’s moved to Apache top level project. Databricks is a spark contributer founded by the creators of Apache spark.

What is the main difference between Spark and Strom?

Spark performance data parallel computation, whereas Strom performs task parallel computation. Compare with this, Storm process quickly, but both are open-source, distributed, fault tolerant and scalable to process streaming data.

What is Data scientist responsibility?

Analyzed for insights and model data for visualization. He/she may have experience with SQL, statistics, predictive modeling, and functional programming like python, R.

What is Data engineer responsibilities?

BigData engineer usually build production data processing applications. Most often engineer control to monitor, inspect, and tune applications by using programming languages.

When do you use apache spark?

For iterative, interactive application with faster processing, real time stream processing. Single platform for all batch process, streaming, interactive applications apache spark is the best choice.

spark interview questions What is the purpose of GraphX library?

Most of the social media sites generates Graphs. GraphX used fo Graphs and graph-parallel computation with common algorithms.

What is Apache Zeppelin?

It’s a Collaborative data analytics and visualization tool for Apache Spark, and Flink. It’s in incubating stage, means it’s not stabled and implementing stage.

What is Jupyter?
It’s evolved from the IPython project. It’s Python3 version inbuilt API for visualization.It also supports R, Ruby, pyspark and other languages.

What is Data Scientist workbench?
Interactive data platform built around Python tool Jupyter.It’s pre-installed with Python, Scala or R.

What is Spark Notebook?
It’s a spark SQL tool, It dynamically inject JavaScript libraries to create visualizations.

DataBrisks Cloud?
It’s available on AWS. If you have EC2 account, you can use this.

What is Zeppelin?
Zeppelin is analytical tool that supports multiple language back-end, by default it support scala with SparkContext.

DFS is mandatory to run Spark?
No, no need HDFS. RDD use hadoop InputFormat API to input data. So RDD can support any storage system like Aws, Azure, Google, or local file system. It support any input format implementation can directly used in spark such as input from Hbase, Cassandra, MongoDB, or Custom input format directly processed in RDD.

How Spark identify data locality?
Usually InputFormat specifies splits and locality. RDD use Hadoop InputFormat API, so partitions correlate to HDFS splits. So spark can easily identify the data locality when it’s needed.

How coalesce increase RDD performance?
Shuffle can decrease RDD performance, so Repartition can increase partitions after filter. Where as coalesce decrease partitions, process without shuffle. Coalesce can consolidate before outputting before HDFS without parallelism. So coalesce directly affect RDD partition performance.

What is the important of co-located key – value pairs?

Some cases, multiple values have same key values especially iterative values. So Co-located values benefited for many operations. RangePartitioner and HashPartitioner ensures all pairs with the same key.

What are numeric RDDs statical operations?

Standard deviation, mean, sum, min, max. Stats() returns all statistic values.

RDDs can shared across application?

No, RDDs can shared across the network, but not application.

What is the difference between reduce and fold?

Fold is a function which allows you to calculate many powerful/large operations on time. Please find the Answer here.

What is SBT?

The SBT is an open source build tool for Scala and Java projects. It’s similar to the Java’s Maven.  

Which is the use of Kryo serialization in Spark?

The SparkContext supports only Java serialization. Kryo is a fast and efficient serialization framework for Java. Kryo serialization work on the top of Java serialization. It’s highly recommended for a large amount of data.

What is the sizeEstimator tool?

The Java heap is the amount of memory allocated to applications running in the JVM. An utility to estimates the size of objects in Java heep called SizeEstimator. It’s a trigger to partition the data.

What is pipe operator?

Pipe operator allows to process RDD data using external applications. After created RDD, developer pipe that RDD through shell script. Shell scripting can allows to access that RDD through external applications.

What is executors?

Spark sends the application code to the executors via SC. The sparkContext sends these tasks to the executors to run computations and store the data for your application. Each application has its own executors.

What is SparkContext object functionality?

Any spark application consists of driver program and executors to run on the cluster. A jar is containing this application for the processing. SparkContext object coordinates these processes.

What is Dstreams?
A sequence of RDDs called a DStream. It’s a high level Spark abstraction. It create a continious stream of data from different sources like Kafka, flume and generate a series of batches.

Spark Interview questions

If you want Apache spark training contact here.

What is Spark?

Spark is a parallel data processing framework. It allows to develop fast, unified big data application combine batch, streaming and interactive analytics.

Why Spark?

Spark is third generation distributed data processing platform. It’s unified bigdata solution for all bigdata processing problems such as batch , interacting, streaming processing.So it can ease many bigdata problems.

What is RDD?
Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of partitioned data that satisfies these properties. Immutable, distributed, lazily evaluated, catchable are common RDD properties.

What is Immutable?
Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it’s not allows updates and modifications. Please note data collection is not immutable, but data value is immutable.

What is Distributed?
RDD can automatically the data is distributed across different parallel computing nodes.

What is Lazy evaluated?
If you execute a bunch of program, it’s not mandatory to evaluate immediately. Especially in Transformations, this Laziness is trigger.

What is Catchable?
keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times faster than Hadoop.

What is Spark engine responsibility?
Spark responsible for scheduling, distributing, and monitoring the application across the cluster.

What are common Spark Ecosystems?
Spark SQL(Shark) for SQL developers,
Spark Streaming for streaming data,
MLLib for machine learning algorithms,
GraphX for Graph computation,
SparkR to run R on Spark engine,
BlinkDB enabling interactive queries over massive data are common Spark ecosystems.  GraphX, SparkR and BlinkDB are in incubation stage.

spark ecosystems

What is Partitions?
partition is a logical division of the data, this idea derived from Map-reduce (split). Logical data specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data and output data everything is Partitioned RDD.

How spark partition the data?

Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3 and other data resources.

Is it mandatory to start Hadoop to run spark application?
No not mandatory, but there is no separate storage in Spark, so it use local file system to store the data. You can load data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.

spark interview questions

spark interview questions


What is SparkContext?
When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new SparkContext object. SparkContext tell spark how to access the cluster. SparkConf is key factor to create programmer application.

What is SparkCore functionalities?
SparkCore is a base engine of apache spark framework. Memory management, fault tolarance, scheduling and monitoring jobs, interacting with store systems are primary functionalities of Spark.

How SparkSQL is different from HQL and SQL?
SparkSQL is a special component on the sparkCore engine that support SQL and HiveQueryLanguage without changing any syntax. It’s possible to join SQL table and HQL table.

When did we use Spark Streaming?
Spark Streaming is a real time processing of streaming data API. Spark streaming gather streaming data from different resources like web server log files, social media data, stock market data or Hadoop ecosystems like Flume, and Kafka.

How Spark Streaming API works?
Programmer set a specific time in the configuration, with in this time how much data gets into the Spark, that data separates as a batch. The input stream (DStream) goes into spark streaming. Framework breaks up into small chunks called batches, then feeds into the spark engine for processing. Spark Streaming API passes that batches to the core engine. Core engine can generate the final results in the form of streaming batches. The output also in the form of batches. It can allows streaming data and batch data for processing.

What is Spark MLlib?

Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib provides different algorithms, that algorithms scale out on the cluster for data processing. Most of the data scientists use this MLlib library.

What is GraphX?

GraphX is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and iterative graph computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.

What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local FileSystem. Spark uses FS API to read data from different storage engines.

Why Partitions are immutable?
Every transformation generate new partition.  Partitions uses HDFS API so that partition is immutable, distributed and fault tolerance. Partition also aware of data locality.

What is Transformation in spark?

Spark provides two special operations on RDDs called  transformations and Actions. Transformation follow lazy operation and temporary hold the data until unless called the Action. Each transformation generate/return new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are common spark transformations.

What is Action in Spark?

Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is input of Actions. reduce, collect, takeSample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.

What is RDD Lineage?
Lineage is a RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data.Each RDD remembers how the RDD build from other datasets.

What is Map and flatMap in Spark?

Map is a specific line or row to process that data. In FlatMap each input item can be mapped to multiple output items (so function should return a Seq rather than a single item). So most frequently used to return Array elements.

What are broadcast variables?
Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters). Broadcast variables stored as Array Buffers, which sends read-only values to work nodes.

What are Accumulators in Spark?
Spark of-line debuggers called accumulators. Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job you can use accumulators. Only the driver program can read an accumulator value, not the tasks.

How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends on the task.

To learn basic Spark video tutorials, just visit bigdata university website.