Tag Archives: experienced

Spark Technical Interview Questions

Most of the Bigdata analysts might get Apache Spark Performance Interview questions. In this post I am explaining about advanced Spark Concepts with detail explanations.

What is the DAG importance in Spark?
Directed acyclic graph (DAG) is an execution engine. It ignores/skip unwanted multi-stage execution model and offers the best performance improvements. In mapreduce, Hadoop can execute in the mapper and reduce. If you want to execute HQL queries, Hive execute once, mapreduce execute again. The dog is an execution modal it allows directly in a straight forward manner. So SQL, HQL, JQL, other languages directly executes in the Spark through DAG execution engine.

How many ways to create an RDD? Which is the best way?
Usually two types. Parallelize existing data and  second option is  referencing a dataset in an external dataset. But Parallelize option is not recommended. If you process vast amount of data, that might crash the driver JVM.

1) Parallelize: val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data)

2) External Datasets: val distFile = sc.textFile(“data.txt”)

groupByKey or reduceByKey Which is the best in Spark?
In mapreduce programe you can get same output through groupByKey and reduceBykey. If you are processing a large amount of dataset, reduceByKey is highly recommendable. It can combine output with common key on each partition before shuffling the data. Whereas in groupByKey all unnecessary data being transferred over the network. So spark performance will decreased for a large amount of data.

When you don’t call collect() action?

Don’t copy all elements of a large RDD to the driver, it’s bottleneck to the Driver program. If more than 1TB, sometime it crush the driver JVM. Similarly countByKey, countByValue, collectAsMap also suitable for small data-sets.

val values = myVeryLargeRDD.collect() Instead of that, use take or takeSample actions. Those actions can filter and take desired amount of data only.

What is the difference between cache() and persist()?

Both we need to call to store the RDD data into memory. With cache(), you can use the default MEMORY_ONLY storage level. With Persist(), you can assign any Storage level like MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,  DISK_ONLY, and more.

If the data-set is lower than memory use cache(), otherwise use persist().

What is the difference between real-time data processing and micro batch processing?
When you have got the data instantly process that data, called real time data processing.
When you have got a chunk amount of data, hold it in a small batch, then processes it as early as possible, called micro batch processing.
Strom is an example of real-time and Spark streaming is the example of micro batch processing.

How Data Serialization optimize spark performance?
Data Serialization is the first step to tune-up the Spark application performance. Spark aim is balance between convenience and performance. To achieve it’s aim Spark allows two type of Serialization libraries called (Default) Java Serialization and Kyro serialization. compare with Java Serizlization, Kryo serialization is the best option. Include given code in SparkConf. conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”).

What is the difference between Narrow Transformation and Wide Transformation?
Input and output says in same partition. No data movement is needed. In Wide transformation, input from other partitions are required. Data should shuffling, before processing. Narrow Transformation is independent, happen in parallel. Wide Transformation is depended on multiple child partitions. Narrow transformation highly recommendable for better spark RDD performance.

different spark transformationsOther Spark Optimized Tips:

  • Network bandwidth is the bottleneck of any distributed file system. Store the RDDs in Serialized form, to reduce memory usage and optimize RDD performance.
  • If the objects are large, increase spark.kryoserializer.buffer 64k and spark.kryoserializer.buffer.max 64M  config default properties.
  • How to optimize spark memory?
    There are two places to optimization, one is at driver level and executor level.
    Specify drive memory while you run an application. Eg: spark-shell –drive-memory 4g
    Specify executor memory while you run application. Eg: spark-shell –executor-memory 4g
  • More Tips

What is the difference between Mesos and Yarn?

Mesos is a cluster manager, which is evolving into a data center operating system.
Yarn is Hadoop compute framework that has a robust resource management features.

What is DSL importance in DataFrame?
A layer on top of the DataFrames to perform relational operations called Domain specific language.

How many ways to create DataFrames?
Easy way is to leverage Scala case class and second way is programmatically specify schema.

What is Catalyst optimizer?

The power of SparkSQL/ DataFrame comes due to catalyst optimizer.
Catalyst optimizer primarily leverages functional programming constructs of Scala such as pattern matching. It offers a general framework for transforming trees, which we use to perform analysis, optimization, planning, and runtime code generation.

What is Catalyst optimizer goals?

  • Optimize the code for better performance.
  • Allow users to optimize the Spark code.

Why SparkSQL use Catalyst Optimizer?

  • To analyze a logical plan,
  • To optimize Physical and logical plans.
  • Code generate to complete the query.

Give an example how catalog optimizer can optimize Logical plan?
Let example A bunch of data is sorted, thus filter out unnecessary data to reduce network wastage to optimize logical plan.

What is Physical planning?
Based on cost of each plan SparkSQL takes a logical plan and generate one or more physical plans.

What is Quasi quotes?
It’s scala feature to generate java bytecode to run on each machine.

Can you tell me few data-types in JSON?

String, Number, Boolean, null. Everything in the form of key and value format.

What are objects and Array in Json data?

Curly braces represents an object, whereas sequence of objects represent an array in the form of [ ].

Object is {“Name”: “Venu”, “age” : “30”}

Array is [{“Name”: “Venu”, “age” : “30”},{“Name”: “Venu”, “age” : “30”}, {“Name”: “Venu”, “age” : “30”}]

How to decide whether use MEMORY_ONLY_SER or MEMORY_AND_DISK_SER to persist the RDD?

If data is lower than RAM, use MEMORY_ONLY_SER, if more than RAM size, use MEMORY_AND_DISK_SER

Why broadcast variables?

Communication is too important in spark. To reduce communication cast spark use broadcast variables. Instead of transferring variables with tasks, these variables keep in cache in read-only mode.