Sadly, Common Table Expressions (CTE)s are insanely expensive, because
each one must be fully materialized. So in your example, you're
essentially creating 5 temp tables (one for each CTE). Obviously that's
not a big deal with only 4 weights and 1000 samples, but for other use
cases that overhead could really add up. Note that this is not the same
OFFSET 0 trick...
You can get a similar breakdown of code by using subselects in
clauses. That would look something like:
SELECT color FROM (<samples code>) AS samples JOIN ( SELECT <cumulative_bounds SELECT> FROM ( SELECT <sampling_cumulative_prob SELECT> FROM (....) ) AS sampling_cumulative_prob ) AS cumulative_bounds ON ...
Not as nice as
WITH, but not horrible. You can also create temporary
views for each of the intermediate steps.
in weights_with_sum, you can get rid of the
join in favor of
OVER() AS weight_sum.
0.0 <= x < 1.0, so the bounds on the
'(]'. Personally, I would just create the
cummulative_bounds, but that's mostly just a matter of style.
BTW, if you've got
plpythonu loaded there's probably an easier way to
generate the set of ranges, which could then be joined to the random
width_bucket(operand anyelement, thresholds anyarray) (see second
instance on docs)
might be even faster; it'd definitely be simpler:
SELECT color[width_bucket(random(), thresholds) FROM generate_series(1,1000) , ( SELECT array_agg(color) AS colors , array_agg(cum_prod) AS thresholds FROM sampling_cumulative_prod ) AS prob;