Consider the following problem.

We have a stream of incoming data, .

is not known in advance (i.e. is random) and is potentially very very big.

The task is to draw a sample of size such that for any . That is, the sampling scheme is such that every element of the stream has the same probability of being included in the sample.

To be clear, with high probability, and could be so large that we cannot simply store all of the elements of in memory first and then perform the sampling once the stream ends. He-he!

Amazingly, there exists a statistimagical algorithm that allows us to draw such a sample. The procedure is often referred to as reservoir sampling.

Here is the basic strategy.

We start by saving the first elements of the stream in an array as they come. If we denote with the sample after we observed , the -th element of , then we can write for any . Furthermore, note that and that .

  • If the stream ends before or exactly when we reach the -th element, i.e. , we simply end up sampling the whole stream: and for all .

  • Otherwise, if , we proceed as follows:

    for ,

    • with probability ,

      1. save , the -th element of the stream

      2. randomly select one of the elements of

      3. pop the randomly selected element out of

      4. fill the empty slot of with

      5. this modified version of becomes .

    • with probability , ignore and set .

To convince ourselves that this strategy indeed generates a sample such that for all , regardless of the value of , it is enough to use an induction argument.

Let’s start by considering iteration :

  • by design.

  • At iteration , the probability that, for , is retained in is the probability of the event “ is discarded or is selected but is not popped out”:

Now, let’s assume that the above also holds for a generic , i.e. for every . Then, for , we have that

  • by design.

  • For ,

Here you can find a Python 3 implementation of this procedure.

Special thanks to @Matteozio for his suggestions and edits!