Spark

Querying Snowflake with the Spark Connector

Creating Iceberg Tables in S3 Tables from EMR Serverless, inserting data, and querying from Athena

Running Spark MLlib on EMR Serverless from EMR Studio's Jupyter Notebook

Register Iceberg Tables in Glue Data Catalog to query from Athena and Snowflake

Walk through Iceberg metadata contents by creating tables, modifying schema and write mode, and writing data in Spark

Avoiding OOM in count-distinct operations on massive datasets using HyperLogLog++, a probabilistic cardinality estimation algorithm

Share variables to executors using Spark's Broadcast variables and Accumulator

Call Livy's REST API to run a Spark job

Install Livy on EMR on EKS and run Spark jobs from local Jupyter notebooks with Sparkmagic

Clustering by k-means method with MLlib of Spark

Make EMR clusters' scale-in faster with Task nodes

Athena for Apache Spark の Notebook で DataFrame.toPandas().plot() した際の日本語が文字化けしないようにする

Launch an EKS cluster and register it to EMR on EKS with CDK to run Spark jobs

Develop Spark Applications in Scala, deploy with GitHub Actions, and perform remote debugging on EMR

Build Spark and debug it remotely at IntelliJ

Spark SQLのJOIN時に余分なパーティションが読まれる例とDynamic Partition Pruning (DPP)

Aggregate logs of spark running on an EMR cluster with Fluent Bit

Settings for running Spark on EMR

Launch an EMR cluster with AWS CLI and run Spark applications

Redshift Serverless and other serverless ETL services, run query with Glue Data Catalog

Treat Spark struct as map to expand to multiple rows with explode

Spark Web UI: Monitor Job Stages, Tasks distribution and SQL plan

GlueのカスタムコネクタでBigQueryに接続する

Athena (Presto) and Glue (Spark) can return different values when running the same query

Enable Job Bookmark of AWS Glue to process from the records following ones executed previously

What is Apache Spark, RDD, DataFrame, DataSet, Action and Transformation

AWS GlueでCSVを加工しParquetに変換してパーティションを切りAthenaで参照する

Launch Hive execution environment with Cloudera Docker Image and execute query to JSON log