What is Spark?
Spark is high-performance general-purpose distributed processing system. It is used with distributed storage such as HDFS and S3, and cluster managers such as Hadoop YARN.
HDFS(Hadoop Distributed File System)とは - sambaiz-net
It can be processed faster than Hadoop’s MapReduce by storing the intermediate data in memory. There are APIs for Java, Scala, Python, R. While python is easy to write, performance suffers due to interaction with JVM.
RDD, DataFrame and DataSet
RDD (Resilient Distributed Dataset), a low-level interface of Spark Core, is an immutable distributed collection with types.
DataFrame is a Spark SQL’s table-like data format which performance is better than calling RDD API by yourself because byte level optimization called Tungsten and query optimizer called Catalyst work.
DataSet is a typed DataFrame, and from Spark 2.0, DataFrame is an alias for Dataset[Row].
Action and Transformation
Action is a process having side effects such as something output, while Transformation just returns RDD. Transformation is not executed immediately, and when Action is executed, the required dependencies are lazy executed based on DAG (Directed Acyclic Graph). DAG is visualized on Web UI.
Spark Web UI: Monitor Job Stages, Tasks distribution and SQL plan - sambaiz-net
Even if the calculation result is lost due to a network or node failure, it can be recovered fast by tracing the dependency and calculating in parallel.
narrow dependency and wide dependency
Transformation is executed by each Executor in parallel for the RDD partition.
Dependencies of a process such as map and filter that read single partition are called “narrow dependencies”, and since the processing can be completed by one Executor, it can be executed collectively in one pipeline so it is fast.
On the other hand, dependencies of a process such as groupByKey that read multiple partitions are called “wide dependency”, and if the Executor does not have partitions, high-cost shuffle occurs.
To avoid shuffle, when joining a DataFrame smaller than spark.sql.autoBroadcastJoinThreshold
,
Broadcast Hash Join that sends the DataFrame to all Executors occurs instead.
A execution unit that does not require shuffle is called a Stage.
References
High Performance Spark - O’Reilly Media
RDD vs DataFrames and Datasets: A Tale of Three Apache Spark APIs