Clustering by k-means method with MLlib of Spark

Make EMR clusters' scale-in faster with Task nodes

Athena for Apache Spark の Notebook で DataFrame.toPandas().plot() した際の日本語が文字化けしないようにする

Register an EKS cluster launched with CDK to EMR on EKS and run Spark jobs

Develop Spark Applications in Scala, deploy with GitHub Actions, and perform remote debugging on EMR

Build Spark and debug it remotely at IntelliJ

Spark SQLのJOIN時に余分なパーティションが読まれる例とDynamic Partition Pruning (DPP)

Aggregate logs of spark running on an EMR cluster with Fluent Bit

Settings for running Spark on EMR

Launch an EMR cluster with AWS CLI and run Spark applications

Redshift Serverless and other serverless ETL services, run query with Glue Data Catalog

Treat Spark struct as map to expand to multiple rows with explode

Spark Web UI: Monitor Job Stages, Tasks distribution and SQL plan


Athena (Presto) and Glue (Spark) can return different values when running the same query

Enable Job Bookmark of AWS Glue to process from the records following ones executed previously

What is Apache Spark, RDD, DataFrame, DataSet, Action and Transformation

AWS GlueでCSVを加工しParquetに変換してパーティションを切りAthenaで参照する

Launch Hive execution environment with Cloudera Docker Image and execute query to JSON log