PyAthena returns NULL when querying columns like array<struct<..., array<...>>>.
from pyathena import connect # Version: 3.27.1
conn = connect(region_name="ap-northeast-1", work_group="primary")
cursor = conn.cursor()
cursor.execute("""
SELECT ARRAY[
CAST(ROW(1, ARRAY['a', 'b']) AS ROW(id INTEGER, tags ARRAY(VARCHAR)))
] AS col
""")
print(cursor.fetchone()) # (None,)
Looking at the implementation, nested arrays like [{id=1, tags=[a, b]}] that cannot be directly parsed by json.loads() are considered too complex and return None instead of being parsed.
Using CAST AS JSON returns [{"id":1,"tags":["a","b"]}] which is parsed by _to_json(), so the value can be retrieved.
cursor.execute("""
SELECT CAST(
ARRAY[
CAST(ROW(1, ARRAY['a', 'b']) AS ROW(id INTEGER, tags ARRAY(VARCHAR)))
] AS JSON
) AS col
""")
print(cursor.fetchone()) # ('[{"id":1,"tags":["a","b"]}]',)
Alternatively, using PandasCursor with unload=True also works since it reads from Parquet output on S3.
from pyathena.pandas.cursor import PandasCursor
conn = connect(region_name="ap-northeast-1", s3_staging_dir="s3://your-bucket/")
cursor = conn.cursor(PandasCursor, unload=True)
df = cursor.execute("""
SELECT ARRAY[
CAST(ROW(1, ARRAY['a', 'b']) AS ROW(id INTEGER, tags ARRAY(VARCHAR)))
] AS col
""").as_pandas()
print(df['col'].iloc[0]) # [{'id': 1, 'tags': array(['a', 'b'], dtype=object)}]