What is Apache Spark, RDD, DataFrame, DataSet, Action and Transformation

2019-02-13 spark etl

What is Spark?

Spark is high-performance general-purpose distributed processing system. It is used with distributed storage such as HDFS and S3, and cluster managers such as Hadoop YARN.

HDFS(Hadoop Distributed File System)とは - sambaiz-net

How Hadoop YARN allocates resources to applications and check how much resources are allocated - sambaiz-net

It can be processed faster than Hadoop’s MapReduce by storing the intermediate data in memory. There are APIs for Java, Scala, Python, R. While python is easy to write, performance suffers due to interaction with JVM.

Develop Spark Applications in Scala, deploy with GitHub Actions, and perform remote debugging on EMR - sambaiz-net

RDD, DataFrame and DataSet

RDD (Resilient Distributed Dataset), a low-level interface of Spark Core, is an immutable distributed collection with types.

DataFrame is a Spark SQL’s table-like data format which performance is better than calling RDD API by yourself because byte level optimization called Tungsten and query optimizer called Catalyst work.

DataSet is a typed DataFrame, and from Spark 2.0, DataFrame is an alias for Dataset[Row].

Action and Transformation

Action is a process having side effects such as something output, while Transformation just returns RDD. Transformation is not executed immediately, and when Action is executed, the required dependencies are lazy executed based on DAG (Directed Acyclic Graph). DAG is visualized on Web UI.

Spark Web UI: Monitor Job Stages, Tasks distribution and SQL plan - sambaiz-net

Even if the calculation result is lost due to a network or node failure, it can be recovered fast by tracing the dependency and calculating in parallel.

narrow dependency and wide dependency

Transformation is executed by each Executor in parallel for the RDD partition.

Dependencies of a process such as map and filter that read single partition are called “narrow dependencies”, and since the processing can be completed by one Executor, it can be executed collectively in one pipeline so it is fast.

On the other hand, dependencies of a process such as groupByKey that read multiple partitions are called “wide dependency”, and if the Executor does not have partitions, high-cost shuffle occurs.

To avoid shuffle, when joining a DataFrame smaller than spark.sql.autoBroadcastJoinThreshold, Broadcast Hash Join that sends the DataFrame to all Executors occurs instead. A execution unit that does not require shuffle is called a Stage.

References

High Performance Spark - O’Reilly Media

RDD vs DataFrames and Datasets: A Tale of Three Apache Spark APIs

Stage — Physical Unit Of Execution · Mastering Apache Spark