Athena’s Federated Query is a feature to execute queries on non-S3 data sources such as DynamoDB and RDS through Lambda function which is a data sources connector.

This article uses TPC-DS Connector in the AWS official repository. It generates the data of TPC-DS, which is a database benchmark in Decision Support.

Although it is in the official repository, it is a custom connector, so you need to build it yourself. Basically, you can do it according to README, but it fails to build in jdk16, so install jdk8.

$ brew tap homebrew/cask-versions
$ brew install --cask corretto8
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
export PATH=${JAVA_HOME}/bin:${PATH}
$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment Corretto-8.312.07.1 (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM Corretto-8.312.07.1 (build 25.312-b07, mixed mode) take a region as a parameter but if it is not set in .aws/config etc, AWS SDK cannot read it so it fails.

$ git clone -b v2021.51.1 --depth 1 
$ cd aws-athena-query-federation/
$ mvn clean install -DskipTests=true
$ cd athena-tpcds/
$ mvn clean install -DskipTests=true

# export AWS_REGION=us-west-2
$ ../tools/ <BUCKET_NAME> athena-tpcds us-west-2
Do you wish to proceed? (yes or no) yes

When completed, the private application is registered in the Serverless Application Repository, so create the lambda function.

If you execute a query like select * from “lambda:tpcds_catalog”.tpcds1.customer limit 100, the lambda function runs in the backend and returns the generated data. The number of tpcds1 is a scale factor and it contains 1GB data in total.

It can be saved with UNLOAD.

UNLOAD (select * from "lambda:tpcds_catalog".tpcds250.catalog_sales) 
TO 's3://<BUCKET_NAME>/tpcds_data/' 
WITH ( format = 'JSON', compression = 'gzip')

So I tried to UNLOAD all the tables of tcpds250, but it timed out at catalog_sales even if I increased the memory of Lambda to the maximum of 10GB, so it is not suitable to generate files of a huge dataset.

