Athena(Presto)でSUM()
やAVG()
といった集計関数にOVER
を付けてWindow集計を行う。
Window Functions — Presto 0.247 Documentation
次のテストデータを使う。
$ cat test-data.csv
{"date":"2021-02-01","user":1,"value":10}
{"date":"2021-02-01","user":2,"value":20}
{"date":"2021-02-01","user":3,"value":30}
{"date":"2021-02-01","user":1,"value":40}
{"date":"2021-02-01","user":2,"value":50}
{"date":"2021-02-01","user":3,"value":60}
{"date":"2021-02-02","user":1,"value":100}
{"date":"2021-02-02","user":2,"value":200}
{"date":"2021-02-02","user":3,"value":300}
{"date":"2021-02-02","user":1,"value":400}
{"date":"2021-02-02","user":2,"value":500}
{"date":"2021-02-02","user":3,"value":600}
{"date":"2021-02-03","user":1,"value":1}
{"date":"2021-02-03","user":2,"value":2}
{"date":"2021-02-03","user":3,"value":3}
{"date":"2021-02-03","user":1,"value":4}
{"date":"2021-02-03","user":2,"value":5}
{"date":"2021-02-03","user":3,"value":6}
PARTITION BY
その値ごとに集計する。
SELECT date, user, value, SUM(value) OVER (PARTITION BY user) as sum_value
FROM test
ORDER BY date, user;
GROUP BYしているわけではないので、行数は減らない。
"date","user","value","sum_value"
"2021-02-01","1","40","555"
"2021-02-01","1","10","555"
"2021-02-01","2","20","777"
"2021-02-01","2","50","777"
"2021-02-01","3","60","999"
"2021-02-01","3","30","999"
"2021-02-02","1","100","555"
"2021-02-02","1","400","555"
"2021-02-02","2","500","777"
"2021-02-02","2","200","777"
"2021-02-02","3","600","999"
"2021-02-02","3","300","999"
"2021-02-03","1","1","555"
"2021-02-03","1","4","555"
"2021-02-03","2","5","777"
"2021-02-03","2","2","777"
"2021-02-03","3","6","999"
"2021-02-03","3","3","999"
ORDER BY (+ RANGE)
ソートし範囲を絞って集計する。
デフォルトの集計範囲は RANGE UNBOUNDED PRECEDING
なので、次の例はそれまでの値の行で集計されている。
SELECT date, user, value, SUM(value) OVER (ORDER BY date) as sum_value
FROM test
WHERE user = 1
ORDER BY date, user;
"date","user","value","sum_value"
"2021-02-01","1","10","50"
"2021-02-01","1","40","50"
"2021-02-02","1","100","550"
"2021-02-02","1","400","550"
"2021-02-03","1","1","555"
"2021-02-03","1","4","555"
ORDER BY + ROWS
行単位で集計範囲を指定する。次の例では1行前からその行までを集計範囲としている。1 FOLLOWING
のようにしてそれ以降の行も指定できる。
上のRANGEの例ではdateが等しい行は同じ集計値になっているが、ROWSでは行によって異なっている。
また、RANGE 1 PRECEDING
のような記述は RANGE PRECEDING is only supported with UNBOUNDED
となってしまうのでできない。
SELECT date, user, value, SUM(value) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as sum_value
FROM test
WHERE user = 1
ORDER BY date, user;
"date","user","value","sum_value"
"2021-02-01","1","10","10"
"2021-02-01","1","40","50"
"2021-02-02","1","100","140"
"2021-02-02","1","400","500"
"2021-02-03","1","1","401"
"2021-02-03","1","4","5"