Memory-efficiently estimating cardinality of union/intersection/difference sets with Theta Sketch

pythonalgorithm

Theta Sketch is an algorithm that takes uniformly distributed hash values in (0,1) for each record, uses the (k+1)th smallest value as θ, and estimates cardinality as k/θ. For example, if k = 10000 and θ = 0.1, the estimated cardinality is 10000 / 0.1 = 100000. This means that if the 10000th smallest hash value is 0.1, the total is likely to be about 10 times larger, which seems intuitively correct.

While only the first k+1 hash values need to be retained, it is still about 2-16 times more memory than HyperLogLog to achieve the same accuracy. However, Theta Sketch has the advantage of being able to calculate cardinality for not only unions but also intersections and differences.

Avoiding OOM in count-distinct operations on massive datasets using HyperLogLog++, a probabilistic cardinality estimation algorithm - sambaiz-net

Let’s try running Theta Sketch using Apache DataSketches’ Python library.

# pip install datasketches

import datasketches

# log2(k) = 12 => k = 4096
sk1 = datasketches.update_theta_sketch(lg_k=12)
sk2 = datasketches.update_theta_sketch(lg_k=12)

for i in range(80000):
    if i % 10000 == 0:
      print(f"{i}: estimate={sk1.get_estimate()}, retained={sk1.num_retained}, theta={sk1.theta}")
    sk1.update(f"user_{i}")
for i in range(50000, 100000):
    sk2.update(f"user_{i}")

print(f"sk1: {sk1.get_estimate()} ({sk1.get_lower_bound(2)} ~ {sk1.get_upper_bound(2)})") # ±2σ

union = datasketches.theta_union(lg_k=12)
union.update(sk1)
union.update(sk2)
res_u = union.get_result()
print(f"sk1 or sk2: {res_u.get_estimate()} ({res_u.get_lower_bound(2)} ~ {res_u.get_upper_bound(2)})")

inter = datasketches.theta_intersection()
inter.update(sk1)
inter.update(sk2)
res_i = inter.get_result()
print(f"sk1 and sk2: {res_i.get_estimate()} ({res_i.get_lower_bound(2)} ~ {res_i.get_upper_bound(2)})")

diff = datasketches.theta_a_not_b()
res_d = diff.compute(sk1, sk2)
print(f"A not B: {res_d.get_estimate()} ({res_d.get_lower_bound(2)} ~ {res_d.get_upper_bound(2)})")

The DataSketches implementation retains values within the range that does not exceed 2k.

0: estimate=0.0, retained=0, theta=1.0
10000: estimate=10098.703305439585, retained=5309, theta=0.5257110580860762
20000: estimate=20117.47194615111, retained=5712, theta=0.2839322960305072
30000: estimate=30192.83690288952, retained=4559, theta=0.1509960794563062
40000: estimate=40299.0595643963, retained=6085, theta=0.1509960794563062
50000: estimate=50358.92340635489, retained=7604, theta=0.1509960794563062
60000: estimate=59881.62457473204, retained=4864, theta=0.08122692118898255
70000: estimate=69275.06198232262, retained=5627, theta=0.08122692118898255
sk1: 78914.72317517114 (77041.05752351602 ~ 80833.78096572716)
sk1 or sk2: 99693.27849790886 (96676.37265451884 ~ 102803.91342023716)
sk1 and sk2: 28980.53952486004 (27851.32041395266 ~ 30155.24656627949)
A not B: 49934.1836503111 (48446.98934122362 ~ 51466.80798682627)