Weighted Random Sampling with PostgreSQL [Follow-up]

Posted by Marco Santoni on Fri 10 February 2017

I received valuable feedbacks by Jim Nasby regarding the post about weighted random sampling with PostgreSQL. I will report here Jim's email.

Sadly, Common Table Expressions (CTE)s are insanely expensive, because each one must be fully materialized. So in your example, you're essentially creating 5 temp tables (one for each CTE). Obviously that's not a big deal with only 4 weights and 1000 samples, but for other use cases that overhead could really add up. Note that this is not the same as the OFFSET 0 trick... You can get a similar breakdown of code by using subselects in FROM clauses. That would look something like:

SELECT color
   FROM (<samples code>) AS samples
   JOIN (
     SELECT <cumulative_bounds SELECT>
       FROM (
         SELECT <sampling_cumulative_prob SELECT>
           FROM (....)
        ) AS sampling_cumulative_prob
     ) AS cumulative_bounds ON ...

Not as nice as WITH, but not horrible. You can also create temporary views for each of the intermediate steps.

in weights_with_sum, you can get rid of the join in favor of sum(weight) OVER() AS weight_sum.

Finally, random() produces 0.0 <= x < 1.0, so the bounds on the numrange should be '[)', not '(]'. Personally, I would just create the numrange immediately in cummulative_bounds, but that's mostly just a matter of style.

BTW, if you've got plpythonu loaded there's probably an easier way to generate the set of ranges, which could then be joined to the random samples.

BTW, width_bucket(operand anyelement, thresholds anyarray) (see second instance on docs) might be even faster; it'd definitely be simpler:

SELECT color[width_bucket(random(), thresholds)
   FROM generate_series(1,1000)
     , (
       SELECT array_agg(color) AS colors
           , array_agg(cum_prod) AS thresholds
         FROM sampling_cumulative_prod
     ) AS prob;