Redshift Serverless and other serverless ETL services, run query with Glue Data Catalog

2021-12-26 aws etl spark presto

Redshift Serverless

Redshift Serverless is a new feature that can use Redshift, a petabyte-scale DWH without launching an instance, announced at this year’s re:Invent. It is available existing features such as Redshift Spectrum, refer to S3 directly, Federated Query to RDS, and Redshift ML. I’m happy with this update as it is costly to keep the instance running for occasional usage such as analytics.

The cost is charged for at least 1 minute RPU time and storage. RPU is a resource unit that contains 2vCPU and 16GiB memory and scales automatically. The rate is $0.45/RPU hour in Oregon, whereas Tokyo is set relatively high at $0.70/RPU hour. Since dc2.large (2vCPU, 15 GiB memory) instance is $0.314/hour on-demand in Tokyo, if the resource usage is not over 40% on average due to like infrequent use or irregularity, it seems that the cost can be suppressed. If RI is applied to the maximum, the rate becomes $0.110/hour so the threshold drop to 15%, but there is also an operational advantage that you do not have to worry about the scale. In addition, there is no charge for the load amount of Redshift Spectrum.

(PS: 2022-07-13) When it became GA, the price dropped to $0.36/RPU hour in Oregon and $0.36/RPU hour in Tokyo.

Other serverless ETL services

There are other services that can perform queries on serverless such as Athena, a Presto’s managed service, and Glue, a Spark’s managed service. The former can easily execute queries and charge by load amount is easy to understand, but there are limit of execution time and concurrent execution, and the latter can perform heavy aggregation, but it is not very suitable for ad hoc execution because it is necessary to create a job.

Since both of them supports Hive Metastore so can refer the schema of Glue Data Catalog, the Hive Metastore compatible service, they can be used properly according to the purpose, but there are differences in the syntax that is available, and even if the same query can be executed, different results may be returned, so be careful when it ports.

Athena (Presto) and Glue (Spark) can return different values when running the same query - sambaiz-net

There are connectors for not only RDS but DynamoDB and Redis etc., and other data sources can be connected by using a custom connector.

Generate data with TPC-DS Connector in Athena’s Federated Query - sambaiz-net

GlueのカスタムコネクタでBigQueryに接続する - sambaiz-net

Redshift, Glue and Athena with Glue Data Catalog

There is also EMR Serverless announced at re:Invent this year. If the data size is too large, OOM or “No space left on device” can occur due to heavy transformation or repartition(), but Glue cannot be scaled up, so it may be physically difficult to resolve the problem. In this case, EMR may be suitable.

Launch an EMR cluster with AWS CLI and run Spark applications - sambaiz-net

In third party services, there is Snowflake, SaaS hosted on each cloud. I often hear some cases of migrating from Redshift to Snowflake due to cost and ease of scale. However, now that Redshift Serverless is now available, it may affect technology selection. The cost is charged for Snowflake Credit and storage. The rate depends on the plan.

BigQuery Omni that the query engine runs on the multi-cloud with Anthos became GA in October. Supported regions are few yet, and the cost is not pay-as-you-go so I don’t image to use same as other services, but it’s a long-awaited feature that executes BigQuery queries on AWS without transferring data.

Query the table in Glue Data Catalog

Create a database and table in Data Catalog and a role to access it with CDK.

CDKでGlue Data CatalogのDatabase,Table,Partition,Crawlerを作成する - sambaiz-net

CDK v2 has become stable now as announced at re:Invent, so I write it in v2. The package @aws-cdk/aws-xxx for each service has been merged into aws-cdk-lib, but alpha features are isolated under the name that the previous one with -alpha suffix.

$ npx [email protected] init app --language typescript
$ npm install --save aws-cdk-lib
$ npm install --save @aws-cdk/aws-glue-alpha

The role used by Redshift Serverless is required to contain redshift.amazonaws.com and redshift-serverless.amazonaws.com principal.

const db = new Database(this, 'Database', {
  databaseName: "test_db",
})

const bucket = new Bucket(this, 'Bucket', {
  bucketName: `redshift-serverless-test-${this.account}-${this.region}`
})

new BucketDeployment(this, 'DeployWebsite', {
  sources: [Source.asset('./data')],
  destinationBucket: bucket,
  destinationKeyPrefix: 'test_table/',
});

const table = new Table(this, 'TestTable', {
  tableName: "test_table",
  database: db,
  columns: [{
    name: "id",
    type: Schema.BIG_INT,
  }],
  dataFormat: DataFormat.JSON,
  bucket: bucket,
  s3Prefix: "test_table/",
  compressed: false,
})

const role = new Role(this, 'SchemaRole', {
    assumedBy: new CompositePrincipal(
      new ServicePrincipal('redshift.amazonaws.com'),
      new ServicePrincipal('redshift-serverless.amazonaws.com')
    )
  })
  role.addToPolicy(new PolicyStatement({
    resources: [
      db.databaseArn,
      db.catalogArn,
      table.tableArn
    ],
    actions: ['glue:Get*'],
  }))
  role.addToPolicy(new PolicyStatement({
    resources: [
      bucket.bucketArn,
      `${bucket.bucketArn}/*`,
    ],
    actions: ['s3:Get*', 's3:List*'],
  }))
})

Associate the role to a serverless endpoint to enable to use when creating a schema.

After clicking “Serverless” in the resource list and connecting to it, create an external schema in Redshift specifying the database in Data Catalog and the role.