etl

Make EMR clusters' scale-in faster with Task nodes

Launch an EKS cluster and register it to EMR on EKS with CDK to run Spark jobs

Retry processing consisting of multiple Tasks with Callbacks in Airflow

Express dependencies on past tasks in Airflow

Create an environment of Amazon Managed Workflow for Apache Airflow (MWAA) with CDK and run a workflow

Run Apache Airflow with Docker Compose and execute a workflow

Develop Spark Applications in Scala, deploy with GitHub Actions, and perform remote debugging on EMR

Build Spark and debug it remotely at IntelliJ

Spark SQLのJOIN時に余分なパーティションが読まれる例とDynamic Partition Pruning (DPP)

Why can Athena v2 fail to query map columns in parquet source tables

Settings for running Spark on EMR

Launch an EMR cluster with AWS CLI and run Spark applications

Settings for querying tables of other accounts with Athena

Implement Athena's data source connectors and user defined functions (UDF)

Compare Redshift Serverless and Athena performances by TPC-DS queries

Generate data with TPC-DS Connector for Glue

Redshift Serverless and other serverless ETL services, run query with Glue Data Catalog

Generate data with TPC-DS Connector in Athena's Federated Query

Treat Spark struct as map to expand to multiple rows with explode

Spark Web UI: Monitor Job Stages, Tasks distribution and SQL plan

Glue DataBrewでデータを可視化して分析するProjectと機械学習の前処理を行うJobをCDKで作成する

GlueのカスタムコネクタでBigQueryに接続する

Athena (Presto) and Glue (Spark) can return different values when running the same query

CDKでGlue Data CatalogのDatabase,Table,Partition,Crawlerを作成する

CDKでKinesis Data Analytics上にPyFlinkのコードをデプロイして動かす

Enable Job Bookmark of AWS Glue to process from the records following ones executed previously

Athena(Presto)でWindow関数を用いた集計を行う

GoでAthenaのクエリを実行する

Kinesis Data AnalyticsのSQL, Lambdaへの出力とCDKによるリソースの作成

VSCodeのDocker開発コンテナでJupyter Notebookを開いてAthenaのクエリを実行し可視化する

What is Apache Spark, RDD, DataFrame, DataSet, Action and Transformation

AWS GlueでCSVを加工しParquetに変換してパーティションを切りAthenaで参照する

Athenaのmigrationやpartitionするathena-adminを作った

Launch Hive execution environment with Cloudera Docker Image and execute query to JSON log

NorikraでログをJOINする

NorikraとFluentdで流れてきたログをリアルタイムに集計する

fluentdでKinesis Streamsに送ってLambdaで読んでS3に保存する

Kinesis Streams/Firehose/Analyticsを試す