Understanding DuckDB Hive Partition Pruning from Source Code
databaseThis article examines how pruning is performed when reading Hive-formatted partitioned tables in DuckDB.
Retrieve Google Sheets data using SQL with DuckDB’s Go client - sambaiz-net
$ duckdb --version
v1.4.4 (Andium) 6ddac802ff
With hive_partitioning=true, only the files for the specified partition are read.
$ duckdb
D COPY (SELECT i, i % 12 + 1 AS month FROM range(1000000) t(i))
TO 'data' (FORMAT PARQUET, PARTITION_BY (month));
D EXPLAIN ANALYZE
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning=true)
WHERE month = 1;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 0.0034s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
...
┌───────────────────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ Function: │
│ READ_PARQUET │
│ │
│ File Filters: │
│ (month = 1) │
│ │
│ Scanning Files: 1/12 │
│ Total Files Read: 1 │
└───────────────────────────┘
In DuckDB’s query execution, the Parser first parses the SQL string, then the Binder resolves table and column references and determines their types using the catalog. From there, a Logical Plan consisting of operations such as scans and filters is created, and the Optimizer performs optimizations such as pushing down filters to be applied at scan time. The plan is then converted to a Physical Plan and executed. Pruning is performed in the Optimizer, though some mechanisms like Join Filter Pushdown are applied dynamically at execution time.
The Binder receives the AST generated by the Parser and analyzes the table function arguments. During this process, HivePartitioning::Parse() extracts key=value partition information from file paths. During the Optimizer’s filter pushdown, the table function’s pushdown_complex_filter callback is invoked, and within it HivePartitioning::ApplyFiltersToFileList() extracts the partition value from each file’s path, substitutes the column in the filter expression WHERE month = 1 with the partition value, and excludes the file if the result is false like 2 = 1.
In queries with JOIN, if the JOIN condition includes the partition column, pruning works on both tables even when filtering by only one table’s column. This is because FilterCombiner groups columns with equal values and treats them the same. From ON t1.month = t2.month and WHERE t1.month = 1, month = 1 is applied to both tables.
D SELECT *
FROM read_parquet('data1/**/*.parquet', hive_partitioning=true) t1
JOIN read_parquet('data2/**/*.parquet', hive_partitioning=true) t2 ON t1.month = t2.month
WHERE t1.month = 1;
Subqueries are converted to JOIN by the Optimizer, and Join Filter Pushdown is applied, which aggregates min/max from the subquery results at execution time and skips files outside that range. Note that in BigQuery, dynamic expressions like subqueries cause all partitions to be scanned.
D SELECT *
FROM read_parquet('data/**/*.parquet', hive_partitioning=true) t1
WHERE t1.month = (SELECT month FROM other_table WHERE key = 1);