Athena (Presto) and Glue (Spark) can return different values when running the same query

2021-07-03 aws presto spark etl

AWS has multiple managed services that aggregate with SQL-like queries. If queries are executed ad-hoc, Presto-based Athena, which can quickly and easily query to tables in Glue’s data catalog, is handy, while if heavy queries are executed in batch, Spark-based Glue, which can avoid resource and time limitation, can be better. Therefore, they can be used properly depending on the case.

Redshift Serverless and other serverless aggregation services, run query with Glue Data Catalog - sambaiz-net

Presto that executes ANSI-compatible SQL and Spark that executes Hive-compatible Spark SQL by default have different syntaxes partially, but sometimes the same query can be reused. In that case, the same result would be expected to be returned, but it may vary significantly due to the following differences in behavior.

Number types

In Presto 0.198 and later, decimal literals are treated as DECIMAL by default, but in Athena engine version 2 (Presto 0.217), it becomes DOUBLE. Perhaps parse-decimal-literals-as-double is being passed for compatibility with engine version 1 (Presto 0.172).

SELECT typeof(1.2), /* double */
       1 / 3.0 * 10000000 /* 3333333.333333333 */

Similarly, in Spark 2.3 and later, decimal literal are treated as DECIMAL, and so is Glue 2.0 (Spark 2.4.3). DECIMAL scale (number of digits after the decimal point) is limited to a maximum of 6.

print(spark.sql("""SELECT 1.2""").dtypes) 
# [('1.2', 'decimal(2,1)')]
print(spark.sql("""SELECT 1 / 3.0 * 10000000""").collect())
# [Row((CAST((CAST(CAST(1 AS DECIMAL(1,0)) AS DECIMAL(2,1)) / CAST(3.0 AS DECIMAL(2,1))) AS DECIMAL(14,6)) * CAST(CAST(10000000 AS DECIMAL(8,0)) AS DECIMAL(14,6)))=Decimal('3333330.000000'))]

In Presto, if interger is divided by an integer, it will be truncated without being cast, but in Spark, it will be cast to DOUBLE. Also, if integer is divided by a decimal literal, it will be DOUBLE for Presto and DECIMAL for Spark, so the accuracy will be different.

SELECT 1 / 3, /* 0 */
       1 / 3.0 /* 0.3333333333333333 */

print(spark.sql("SELECT 1 / 3").collect())
#[Row((CAST(1 AS DOUBLE) / CAST(3 AS DOUBLE))=0.3333333333333333)]   
print(spark.sql("SELECT 1 / 3.0").collect())
# [Row((CAST(CAST(1 AS DECIMAL(1,0)) AS DECIMAL(2,1)) / CAST(3.0 AS DECIMAL(2,1)))=Decimal('0.333333'))]
print(spark.sql("SELECT 1D / 3.0").collect())
# [Row((1.0 / CAST(3.0 AS DOUBLE))=0.3333333333333333)]

cast to integer

In Athena, decimal are casted to integer with round, while in Glue, with floor.

SELECT CAST(1.49 AS INTEGER), /* 1 */
       CAST(1.50 AS INTEGER)  /* 2 */

print(spark.sql("""SELECT CAST(1.49 AS INTEGER)""").collect()[0][0])
# 1
print(spark.sql("""SELECT CAST(1.50 AS INTEGER)""").collect()[0][0])
# 1

Array indexes

Athena’s indexes are 1-based, while Spark is 0-based.

SELECT split('aaa,bbb,ccc,ddd', ',')[1] /* aaa */

print(spark.sql("SELECT split('aaa,bbb,ccc,ddd', ',')[1] as v").collect()[0]['v']) 
# bbb