RDDs Vs DataFrame Vs DataSet

posted in: hadoop, Spark | 0

Application Program Interface (API) is a set of functions and procedures that allow the creation of applications which access the features or data of an operating system, application, or other service to process data.
APIs very very important to implement any framework. Now Spark using many APIs and importing many APIs from other Bigdata ecosystems. Latest Spark using three type of APIs. Spark always revolving around these APIs called RDD, Dataframe and DataSet APIs.

Now Which is the best API? RDD or Dataframe or DataSet? Why? these are common spark interview questions. In this post I am explaining difference between RDD, Dataframe & DataSet
Rdd dataframe dataset

What is RDD?
In simple words, Collection of Java or Scala objects that follows Immutability, distributed, fault tolerance properties.
Spark core use many functions, most of the functions copied from Scala. Based on functionality, spark separate those functions as Transformations and actions.
Means in RDD API these Scala functions (Transformations & Actions) to compute the data. Its main advantage if you know Scala functions, it’s easy to compute data.
The main dis-advantage in RDDs is, it’s using Java serialization by default. Either Java or Scala running JVM only so both using Java Serialization only.

Why Java Serialization?

let eg: if you want to store Arrays, json data, or any other data in database, it’s not supporting. So that u r serialize data into binary format, than convert that binary format data to database understandable format.
Now Java also using its own serialization concept called Java Serialization.
Java Serialization intentionally for small amount of java object not for long amount of objects. If you use java serialization, it’s drastically decrease performance.
Additionally Java serialization consume huge amount of resources to serialize data. So that as using avro serialization, it’s internally compress data so that little advantage to improve performance.
RDD using Java Serialization, so it’s decrease performance. If you process large amount of data. Kyro serialization little optimize Spark RDD jobs, but you must follow some terms and conditions.
One more dis-advantage is Jvva serialization is sending both data and it’s structure between nodes. It’s another headache, but it’s resolved in Dataframe.
Spark when it’ starting time means in spark 1.0 introduced RDDs. It’s ok, good processing fine, everything fine, but performance only main dis-advantage. If you are processing unstructured data, Rdd highly recommended.

What is DataFrame?
After couple of month, Spark introduced another API called DataFrame. It’s very powerful mainly focus on performance and to run SQL queries on top of data.
In Simple words a collection of RDDs plus Schema called DataFrame. In DataFrame, the data is organized into named columns like RDBMS. Means Structure separated, data separated. Spark understands the data schema, so no need to use Java serialization to encode the data, Serialize only data.
So Spark developer, can easily run SQL queries on top of distributed data, additionally support DSL commands, so Scala programmer also easily run Scala commands. These features not available in RDD.
If spark knows the schema, there is no need to use Java serialization to encode the data. So no need to de-serialize the data when you applied sorting or shuffling.

The power of DataFrame API is catalyst optimizer. It internally apply logical plans and physical plans, finally based on cost based model, choose the best optimized plan. So It’s internally optimize data compare with RDDs.
DataFrame also using Java serialization, so like RDDs same dis-advantages available in Data-frame also. Means main advantage optimize performance, and make user friendly, dis-advantage serialization.

What is DataSet API?
Another IOT framework called Flink, it’s internally using two powerful APIs called DataSet and DataStream APIs. DataSet used to process batch data, DataStream api used to process streaming data. Spark core by default batch process so that they copied this Flink DataSet API and placed in Spark 1.6 experimentally.
In spark 1.6, dataset api got good results, so that in spark 2.0 DataFrame merged in DataSet. In Spark 2.0 only dataset available, there is no dataframes.
The main difference between RDD, DataFrame and DataSet is Serialization and Performance. This DataSet api internally using a special serialization called encoder, it’s very powerful than java serialization.  It support Rdd transformations and dataframe DSL commands and allows SQL queries as well. Means if you know rdd and dataframes same steps you can apply in dataset as well. encoder

In another words, Unifying RDD + Dataframe using encoder serialization called DataSet. This DataSet introduced in 1.6 version, but it’s main abstraction in spark 2.0. The main advantage in DataSet is high level type-safe, but RDD low level type-safe. So programmer can easily identified syntax errors & analyze errors in compile stage only. More info about type safe

Spark also moving towards dataset so that instead of spark streaming,mllib, graphx, going towards structure streaming, ml-pipeline, graph-frames. As per my prediction, in future no RDD concepts in future.
One more disadvantage is only Java and Scala supports dataset, but not python language, because of it’s dynamic nature.

Conclusion:

The main difference between RDD, Dataframe and Dataset is performance, to optimize performance, RDD switched to Dataframe, next switched to Dataset.