Marco Santoni

Weighted Random Sampling with PostgreSQL

You have a table like the following:

CREATE TABLE weights (
color varchar primary key,
weight float
);

INSERT INTO weights (color, weight)
VALUES
('red', 8),
('blue', 3),
('green', 10),
('yellow', 10);

The table lists the weights associated with certain colors. Imagine a weight representing how much you like that color.

Now, you want to add 1000 colored tiles to your website. You want the color of the tiles to be sampled at random according to the weights table.

We'll write a PostgreSQL script that implements such random sampling. I'll write the entire query first, and then explain each part separately.

CREATE TABLE sampled_colors AS
WITH weights_with_sum AS (
SELECT
color,
weight,
weight_sum
FROM weights
CROSS JOIN (SELECT sum(weight) AS weight_sum FROM weights) s
),
sampling_probability AS (
SELECT
color,
weight / weight_sum AS prob
FROM weights_with_sum
),
sampling_cumulative_prob AS (
SELECT
color,
sum(prob) OVER (order by color) AS cum_prob
FROM sampling_probability
),
cumulative_bounds AS (
SELECT
color,
COALESCE(
lag(cum_prob) OVER (ORDER BY cum_prob, color),
0
) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM sampling_cumulative_prob
),
samples AS (
SELECT
generate_series(1, 1000) AS sample_idx,
random() AS sample
)
SELECT
color
FROM samples
JOIN cumulative_bounds ON
sample::numeric <@ numrange(lower_cum_bound::numeric,
upper_cum_bound::numeric, '(]');

Let's look at one piece at a time.

WITH weights_with_sum AS (
SELECT
color,
weight,
weight_sum
FROM weights
CROSS JOIN (SELECT sum(weight) AS weight_sum FROM weights) s
),
sampling_probability AS (
SELECT
color,
weight / weight_sum AS prob
FROM weights_with_sum
)
SELECT *
FROM sampling_probability;
-- output:
color | prob
--------+--------------------
red | 0.258064516129032
blue | 0.0967741935483871
green | 0.32258064516129
yellow | 0.32258064516129

Here, we're just normalizing the weights. Each weight is divided by the total sum of the weights. In this way, we are re-writing each weight as a discrete probability of that color being sampled.

...
sampling_cumulative_prob AS (
SELECT
color,
sum(prob) OVER (order by color) AS cum_prob
FROM sampling_probability
),
cumulative_bounds AS (
SELECT
color,
COALESCE(
lag(cum_prob) OVER (ORDER BY cum_prob, color),
0
) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM sampling_cumulative_prob
)
SELECT *
FROM cumulative_bounds;
-- output:
color | lower_cum_bound | upper_cum_bound
--------+--------------------+--------------------
blue | 0 | 0.0967741935483871
green | 0.0967741935483871 | 0.419354838709677
red | 0.419354838709677 | 0.67741935483871
yellow | 0.67741935483871 | 1

In this piece of code, we're are representing the weights as a cumulative distribution function.

...
samples AS (
SELECT
generate_series(1, 1000) AS sample_idx,
random() AS sample
)
SELECT
color
FROM samples
JOIN cumulative_bounds ON
sample::numeric <@ numrange(lower_cum_bound::numeric,
upper_cum_bound::numeric, '(]');

In the last part, we're sampling 1000 times a random number between 0 and 1. We then assign this sample to the corresponding color based on the values of the cumulative function. For example, if the first sample is 0.45, it will match the 'red' range (0.41-0.67). Therefore, that sample will be 'red'.

The result of the query is a table filled with 1000 colors sampled at random based on the weights.

SELECT *
FROM sampled_colors
LIMIT 10;
-- output:
color
--------
green
green
red
yellow
yellow
green
blue
red
red
red

Can we check that the result is correct? Were the weights really taken into account?

SELECT
color,
count(*)
FROM sampled_colors
GROUP BY 1;
-- output:
color | count
--------+-------
yellow | 309
green | 320
red | 276
blue | 95

The proportion of samples is quite close to the proportion of the weights. This similarity is clear if we compare this table with the discrete probability table above.

Other articles

Applied Bayesian Inference with PyMC [video]

Published: Thu 30 June 2016
By mrsantoni

In Uncategorized.

I was glad to give an intro to Bayesian Inference at PyData Florence 2016. The video of the talk is now out.
read more
A Simple Machine Learning Pipeline

Published: Sun 19 June 2016
By mrsantoni

In Uncategorized.

This post contains the code that I used in my talk at Python Milano Meetup on June 22nd 2016. The talk was a quick overview of Pipeline, a nice API by scikitlearn to abstract your machine learning algorithm. It is based on the Boston Housing Data Set.

We'll just load …
read more
Install a .deb file from terminal on Ubuntu
Published: Mon 23 May 2016
By mrsantoni

In Uncategorized.

I use Ubuntu 16.04. Sometimes, when I double-click a .deb file, the installation program does not work. What often solves the problem is installing it from terminal.
```
sudo dpkg -i my_deb_file.deb
sudo apt-get -f install
```
read more
Insights from Data Science Milan - 19/05/16

Published: Fri 20 May 2016
By mrsantoni

In Events.

tags: Events

#DeepLearning introduction and enterprise architectures using #H2O - first #DataScienceMilan meetup! - https://t.co/I8LsfaFJSu
— Andrea Scarso (@andreaesseci) May 18, 2016

A new Data Science meetup is out in Milan. Two talks about Deep Learning were given in the first event.

Neural Networks and Deep Learning: An Introduction. @MilanHighTech. The first …
read more
Bayesian A/B Testing in Python
Published: Sun 15 May 2016
By mrsantoni

In Code.

tags: Bayesian

Imagine you re-designing your e-commerce website. You have to decide whether the "Buy Item" button should be blue or green. You decide to setup an A/B test, so you build two versions of the item page:
- Page A which has a blue button;
- Page B which has a green …
read more
Insights from PyData Florence 16

Published: Wed 20 April 2016
By mrsantoni

In Events.

tags: Events

I have just joined PyData conference in Florence, and I will list briefly some interesting insights.

Oh my... We are already overcrowded @pyconit and it's *just* the beginning!! 🎉🎉 good job guys! 🙌🏻 #pycon7
— (((Valerio Maggio))) (@leriomaggio) April 15, 2016

Time Travel and Time Series Analysis with Pandas and Statsmodels, @hendorf …
read more