Creating and updating only the schema of Iceberg tables with dbt-athena

icebergaws

Currently, I don’t have the best practice to manage Iceberg tables in AWS. This is because there’s an issue where Iceberg tables updated with CDK (CloudFormation) are no longer treated as Iceberg tables. Creating them from Athena Notebooks with Spark is the easiest approach, but I was looking for a method that would be more suitable for operations from a review and deployment perspective, and I reached to dbt-athena, the official dbt adapter for Athena.

Register Iceberg Tables in Glue Data Catalog to query from Athena and Snowflake - sambaiz-net

Creating new tables in BigQuery by processing data with SQL using dbt - sambaiz-net

$ python -m pip install dbt-core dbt-athena
$ dbt --version
Core:
  - installed: 1.10.4
  - latest:    1.10.4 - Up to date!

Plugins:
  - athena: 1.9.4 - Up to date!
  
$ cat ~/.dbt/profiles.yml 
athenatest:
  outputs:
    dev:
      database: awsdatacatalog
      region_name: ap-northeast-1
      s3_data_dir: s3://****/data
      s3_staging_dir: s3://****/staging
      schema: dbtathenatest
      threads: 1
      type: athena
  target: dev

While dbt normally performs record insertion, you can perform only table creation/update by appending 0 rows. The default on-schema-change is ignore, so added columns won’t be reflected unless configured. Structs in Trino are ROW, not STRUCT. They are registered as STRUCT in DataCatalog.

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'append',
    table_type = 'iceberg',
    format = 'parquet',
    partitioned_by = ['id'],
    on_schema_change = 'append_new_columns'
) }}

select
    CAST(NULL AS INTEGER) AS id,
    CAST(NULL AS ROW(a INTEGER, b VARCHAR)) AS value
where false

The default location is {data_dir}/{schema}/{model}/{uuid} as defined here.

# create a table
$ dbt run --debug
....
create table "awsdatacatalog"."dbtathenatest"."my_first_dbt_model"
    with (
      table_type='iceberg',
      is_external=false,location='****/data/dbtathenatest/my_first_dbt_model/985181f4-f9d6-4486-9063-697ca8f510c5',
      partitioning=ARRAY['id'],
      format='parquet'
    )
    as
    ...

# add a column
$ dbt run --debug
...
In "awsdatacatalog"."dbtathenatest"."my_first_dbt_model":
        Schema changed: True
        Source columns not in target: [AthenaColumn(column='b', dtype='string', char_size=None, numeric_precision=None, numeric_scale=None, table_type=<TableType.ICEBERG: 'iceberg_table'>)]
        Target columns not in source: []
        New column types: []

On model.dbtathenatest.my_first_dbt_model: alter table `dbtathenatest`.`my_first_dbt_model`
          add columns (b string)
...

However, there are constraints originating from Athena - only specific tblproperties can be set, and partition changes are not reflected.

{{ config(
    ...
    tblproperties = {
        'write.update.mode': 'merge-on-read'
    }
) }}

// An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Table properties [write.update.mode] are not supported.

By the way, similar to BigQuery, insert_overwrite cannot be set as an incremental_strategy.

Create BigQuery Tables for Apache Iceberg using dbt and Read them from Snowflake - sambaiz-net

Invalid incremental strategy provided: insert_overwrite
        Incremental models on Iceberg tables only work with 'append', 'microbatch' or 'merge' (v3 only) strategy.
Microbatch model 'my_first_dbt_model' must provide an 'event_time' (string) config that indicates the name of the event time column.