What is DataFrame?
SQL + RDD = Spark DataFrame
A SQL programming abstraction on the top of Spark core called DataFrames. Schema for RDD called DataFrame. It can ease many Spark problems. The DataFrame API is available in Scala, Java, Python, and R. So any programmer can create DataFrame.
What is the glom importance on Spark? Glom is a method (RDD.glom()) which returns a new RDD, that containing the distinct elements in the form of Array. Usually partition returns a row at a time, but RDD.glom() allows you to treat a partition as an array rather as single row at time.
What is Shark?
It’s an older version of SparkSQL. It allows to run Hive on Spark, but now it replaced by SparkSQL.
What is difference between Tacheyon and Apache Spark?
Tachyon is a memory centric distributed storage system that share memory across cluster. The programmer can run Spark, Mapreduce, Shark, and Flink without any code change. Where as Spark is a cluster computing framework, you can run batch, streaming and interactive analytics rapidly. Fastness and laziness is the power of spark.
What is different between Framework, Library and API?
API is a part of library, that defines how to interact with external code. If library requested something, the API serving and fulfill the requirements. Where as Library is a collection of classes to do a specific task like to create package. Framework provides functionalities/solution to the particular problem area, but install different softwares, provide environment. It’s heart to develop software applications.
What is history of Apache Spark?
Apache Spark is originally developed in the AMPLab at UC Berkeley, later it’s moved to Apache top level project. Databricks is a spark contributer founded by the creators of Apache spark.
What is the main difference between Spark and Strom?
Spark performance data parallel computation, whereas Strom performs task parallel computation. Compare with this, Storm process quickly, but both are open-source, distributed, fault tolerant and scalable to process streaming data.
What is Data scientist responsibility?
Analyzed for insights and model data for visualization. He/she may have experience with SQL, statistics, predictive modeling, and functional programming like python, R.
What is Data engineer responsibilities?
BigData engineer usually build production data processing applications. Most often engineer control to monitor, inspect, and tune applications by using programming languages.
When do you use apache spark?
For iterative, interactive application with faster processing, real time stream processing. Single platform for all batch process, streaming, interactive applications apache spark is the best choice.
Most of the social media sites generates Graphs. GraphX used fo Graphs and graph-parallel computation with common algorithms.
What is Apache Zeppelin?
It’s a Collaborative data analytics and visualization tool for Apache Spark, and Flink. It’s in incubating stage, means it’s not stabled and implementing stage.
What is Jupyter?
It’s evolved from the IPython project. It’s Python3 version inbuilt API for visualization.It also supports R, Ruby, pyspark and other languages.
What is Data Scientist workbench?
Interactive data platform built around Python tool Jupyter.It’s pre-installed with Python, Scala or R.
What is Spark Notebook?
It’s available on AWS. If you have EC2 account, you can use this.
What is Zeppelin?
Zeppelin is analytical tool that supports multiple language back-end, by default it support scala with SparkContext.
DFS is mandatory to run Spark?
No, no need HDFS. RDD use hadoop InputFormat API to input data. So RDD can support any storage system like Aws, Azure, Google, or local file system. It support any input format implementation can directly used in spark such as input from Hbase, Cassandra, MongoDB, or Custom input format directly processed in RDD.
How Spark identify data locality?
Usually InputFormat specifies splits and locality. RDD use Hadoop InputFormat API, so partitions correlate to HDFS splits. So spark can easily identify the data locality when it’s needed.
How coalesce increase RDD performance?
Shuffle can decrease RDD performance, so Repartition can increase partitions after filter. Where as coalesce decrease partitions, process without shuffle. Coalesce can consolidate before outputting before HDFS without parallelism. So coalesce directly affect RDD partition performance.
What is the important of co-located key – value pairs?
Some cases, multiple values have same key values especially iterative values. So Co-located values benefited for many operations. RangePartitioner and HashPartitioner ensures all pairs with the same key.
What are numeric RDDs statical operations?
Standard deviation, mean, sum, min, max. Stats() returns all statistic values.
RDDs can shared across application?
No, RDDs can shared across the network, but not application.
What is the difference between reduce and fold?
Fold is a function which allows you to calculate many powerful/large operations on time. Please find the Answer here.
What is SBT?
The SBT is an open source build tool for Scala and Java projects. It’s similar to the Java’s Maven.
Which is the use of Kryo serialization in Spark?
The SparkContext supports only Java serialization. Kryo is a fast and efficient serialization framework for Java. Kryo serialization work on the top of Java serialization. It’s highly recommended for a large amount of data.
What is the sizeEstimator tool?
The Java heap is the amount of memory allocated to applications running in the JVM. An utility to estimates the size of objects in Java heep called SizeEstimator. It’s a trigger to partition the data.
What is pipe operator?
Pipe operator allows to process RDD data using external applications. After created RDD, developer pipe that RDD through shell script. Shell scripting can allows to access that RDD through external applications.
What is executors?
Spark sends the application code to the executors via SC. The sparkContext sends these tasks to the executors to run computations and store the data for your application. Each application has its own executors.
What is SparkContext object functionality?
Any spark application consists of driver program and executors to run on the cluster. A jar is containing this application for the processing. SparkContext object coordinates these processes.
What is Dstreams?
A sequence of RDDs called a DStream. It’s a high level Spark abstraction. It create a continious stream of data from different sources like Kafka, flume and generate a series of batches.