Athena(Presto)でWindow関数を用いた集計を行う

aws

Athena(Presto)でSUM()AVG()といった集計関数にOVERを付けてWindow集計を行う。

Window Functions — Presto 0.247 Documentation

次のテストデータを使う。

$ cat test-data.csv 
{"date":"2021-02-01","user":1,"value":10}
{"date":"2021-02-01","user":2,"value":20}
{"date":"2021-02-01","user":3,"value":30} 
{"date":"2021-02-01","user":1,"value":40}
{"date":"2021-02-01","user":2,"value":50}
{"date":"2021-02-01","user":3,"value":60}
{"date":"2021-02-02","user":1,"value":100}
{"date":"2021-02-02","user":2,"value":200}
{"date":"2021-02-02","user":3,"value":300} 
{"date":"2021-02-02","user":1,"value":400}
{"date":"2021-02-02","user":2,"value":500}
{"date":"2021-02-02","user":3,"value":600}
{"date":"2021-02-03","user":1,"value":1}
{"date":"2021-02-03","user":2,"value":2}
{"date":"2021-02-03","user":3,"value":3} 
{"date":"2021-02-03","user":1,"value":4}
{"date":"2021-02-03","user":2,"value":5}
{"date":"2021-02-03","user":3,"value":6}

PARTITION BY

その値ごとに集計する。

SELECT date, user, value, SUM(value) OVER (PARTITION BY user) as sum_value 
FROM test
ORDER BY date, user;

GROUP BYしているわけではないので、行数は減らない。

"date","user","value","sum_value"
"2021-02-01","1","40","555"
"2021-02-01","1","10","555"
"2021-02-01","2","20","777"
"2021-02-01","2","50","777"
"2021-02-01","3","60","999"
"2021-02-01","3","30","999"
"2021-02-02","1","100","555"
"2021-02-02","1","400","555"
"2021-02-02","2","500","777"
"2021-02-02","2","200","777"
"2021-02-02","3","600","999"
"2021-02-02","3","300","999"
"2021-02-03","1","1","555"
"2021-02-03","1","4","555"
"2021-02-03","2","5","777"
"2021-02-03","2","2","777"
"2021-02-03","3","6","999"
"2021-02-03","3","3","999"

ORDER BY

ソートし範囲を絞って集計する。 デフォルトの集計範囲が RANGE UNBOUNDED PRECEDING (BETWEEN CURRENT ROW) なので、それまでの値の行で集計されている。

SELECT date, user, value, SUM(value) OVER (ORDER BY date) as sum_value 
FROM test 
WHERE user = 1
ORDER BY date, user;
"date","user","value","sum_value"
"2021-02-01","1","10","50"
"2021-02-01","1","40","50"
"2021-02-02","1","100","550"
"2021-02-02","1","400","550"
"2021-02-03","1","1","555"
"2021-02-03","1","4","555"

ORDER BY + ROWS

行単位で集計範囲を指定する。次の例では1行前からその行までを集計範囲としている。1 FOLLOWING でそれ以降の行も指定できる。 RANGEではdateが等しい行は同じ集計値になるが、ROWSでは行によって異なる。 それなら RANGE 1 PRECEDING のように書きたいところだが、RANGE PRECEDING is only supported with UNBOUNDED となっている。

SELECT date, user, value, SUM(value) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as sum_value 
FROM test 
WHERE user = 1
ORDER BY date, user;
"date","user","value","sum_value"
"2021-02-01","1","10","10"
"2021-02-01","1","40","50"
"2021-02-02","1","100","140"
"2021-02-02","1","400","500"
"2021-02-03","1","1","401"
"2021-02-03","1","4","5"