Currently, I don’t have the best practice to manage Iceberg tables in AWS. This is because there’s an issue where Iceberg tables updated with CDK (CloudFormation) are no longer treated as Iceberg tables. Creating them from Athena Notebooks with Spark is the easiest approach, but I was looking for a method that would be more suitable for operations from a review and deployment perspective, and I reached to dbt-athena, the official dbt adapter for Athena.
Register Iceberg Tables in Glue Data Catalog to query from Athena and Snowflake - sambaiz-net
Creating new tables in BigQuery by processing data with SQL using dbt - sambaiz-net
$ python -m pip install dbt-core dbt-athena
$ dbt --version
Core:
- installed: 1.10.4
- latest: 1.10.4 - Up to date!
Plugins:
- athena: 1.9.4 - Up to date!
$ cat ~/.dbt/profiles.yml
athenatest:
outputs:
dev:
database: awsdatacatalog
region_name: ap-northeast-1
s3_data_dir: s3://****/data
s3_staging_dir: s3://****/staging
schema: dbtathenatest
threads: 1
type: athena
target: dev
While dbt normally performs record insertion, you can perform only table creation/update by appending 0 rows. The default on-schema-change is ignore, so added columns won’t be reflected unless configured. Structs in Trino are ROW, not STRUCT. They are registered as STRUCT in DataCatalog.
{{ config(
materialized = 'incremental',
incremental_strategy = 'append',
table_type = 'iceberg',
format = 'parquet',
partitioned_by = ['id'],
on_schema_change = 'append_new_columns'
) }}
select
CAST(NULL AS INTEGER) AS id,
CAST(NULL AS ROW(a INTEGER, b VARCHAR)) AS value
where false
The default location is {data_dir}/{schema}/{model}/{uuid} as defined here.
# create a table
$ dbt run --debug
....
create table "awsdatacatalog"."dbtathenatest"."my_first_dbt_model"
with (
table_type='iceberg',
is_external=false,location='****/data/dbtathenatest/my_first_dbt_model/985181f4-f9d6-4486-9063-697ca8f510c5',
partitioning=ARRAY['id'],
format='parquet'
)
as
...
# add a column
$ dbt run --debug
...
In "awsdatacatalog"."dbtathenatest"."my_first_dbt_model":
Schema changed: True
Source columns not in target: [AthenaColumn(column='b', dtype='string', char_size=None, numeric_precision=None, numeric_scale=None, table_type=<TableType.ICEBERG: 'iceberg_table'>)]
Target columns not in source: []
New column types: []
On model.dbtathenatest.my_first_dbt_model: alter table `dbtathenatest`.`my_first_dbt_model`
add columns (b string)
...
However, there are constraints originating from Athena - only specific tblproperties can be set, and partition changes are not reflected.
{{ config(
...
tblproperties = {
'write.update.mode': 'merge-on-read'
}
) }}
// An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Table properties [write.update.mode] are not supported.
By the way, similar to BigQuery, insert_overwrite cannot be set as an incremental_strategy.
Create BigQuery Tables for Apache Iceberg using dbt and Read them from Snowflake - sambaiz-net
Invalid incremental strategy provided: insert_overwrite
Incremental models on Iceberg tables only work with 'append', 'microbatch' or 'merge' (v3 only) strategy.
Microbatch model 'my_first_dbt_model' must provide an 'event_time' (string) config that indicates the name of the event time column.