PyAthena Returns NULL When Querying Array Columns Containing Structs

awspython

PyAthena returns NULL when querying columns like array<struct<..., array<...>>>.

from pyathena import connect # Version: 3.27.1

conn = connect(region_name="ap-northeast-1", work_group="primary")
cursor = conn.cursor()

cursor.execute("""
SELECT ARRAY[
    CAST(ROW(1, ARRAY['a', 'b']) AS ROW(id INTEGER, tags ARRAY(VARCHAR)))
] AS col
""")

print(cursor.fetchone())  # (None,)

Looking at the implementation, nested arrays like [{id=1, tags=[a, b]}] that cannot be directly parsed by json.loads() are considered too complex and return None instead of being parsed.

Using CAST AS JSON returns [{"id":1,"tags":["a","b"]}] which is parsed by _to_json(), so the value can be retrieved.

cursor.execute("""
SELECT CAST(
    ARRAY[
        CAST(ROW(1, ARRAY['a', 'b']) AS ROW(id INTEGER, tags ARRAY(VARCHAR)))
    ] AS JSON
) AS col
""")

print(cursor.fetchone())  # ('[{"id":1,"tags":["a","b"]}]',)

Alternatively, using PandasCursor with unload=True also works since it reads from Parquet output on S3.

from pyathena.pandas.cursor import PandasCursor

conn = connect(region_name="ap-northeast-1", s3_staging_dir="s3://your-bucket/")
cursor = conn.cursor(PandasCursor, unload=True)

df = cursor.execute("""
SELECT ARRAY[
    CAST(ROW(1, ARRAY['a', 'b']) AS ROW(id INTEGER, tags ARRAY(VARCHAR)))
] AS col
""").as_pandas()

print(df['col'].iloc[0])  # [{'id': 1, 'tags': array(['a', 'b'], dtype=object)}]