This post is about a brilliant algorithm called the Alias Method. This method allows you to generate samples according to a discrete probability distribution in constant time. In other words, it allows you to simulate a loaded or unfair die in constant time.
To explain what this means for me, imagine you have a cluster of servers that handle some sort of request. The cluster is not homogeneous so some servers have more resources (i.e. CPU, memory, network) which means when you distribute the load, you want the more beefier servers to handle more requests. So given the desired probability distribution, how would you implement an algorithm to randomly select instances with a distribution that is weighted according to their resources?
The naive approach is simple: find a common “denominator” of the distribution and separate the range into smaller ranges with each sub-range being the proportion of the whole equal to its probability. Then randomly generate an integer between 1 and that common denominator and find out which “bucket” that integer falls in by comparing the integer with each of the sub-ranges.
To illustrate this imagine you have 5 heterogeneous servers: 2 instances with 8 GB of RAM, 2 instances with 32GB of RAM, and 1 instance with 16GB of RAM. The distribution is illustrated below:
So to implement the naive approach you would find the sum (8+8+32+32+16 = 96) and then write something like this (in pseudocode):
// Generate a random number between 1-96 int selection = Random.nextInt(96) // Return selection based on which range the number falls in if selection <= 8: return "S1" elif selection <= 16: return "S2" elif selection <= 48: return "S3" elif selection <= 80: return "S4" else return "S5"
However a runtime analysis will conclude that this method will run in O(n) where n is the number of buckets. The question that led to my discovery of the Alias Method was: “Could we do this in O(1) constant time?”.*
At the crux of this problem is the fact that the only tool we have to generate a random result in constant time (i.e. Random.nextInt or something similar) only gives us a uniform distribution. So the question is, how can we use a uniform distribution (or more specifically a couple uniform distributions) to simulate a weighted distribution?
The answer is of course: The Alias Method
A lot of my explanation of this method is based on an article written by Keith Schwarz, a Stanford CS lecturer. If you want a much more detailed and in-depth explanation of this method and comparison with similar methods, take a look at his article.
The alias method depends on a preprocessing step to create a data structure that would essentially deconstruct each discrete probability into fragments that would be distributed across the whole distribution so that each new bucket would have equal chance but the now-uniformly-distributed bucket are actually comprised of an original probability and an alias probability (the fragmented shard). This achieves constant time because the higher-level bucket can be selected at random (uniformly) and then the internal probability can be select using a random value with a single comparison since there are at most 2 internal probabilities in each bucket.
Allow me to illustrate this:
With the server resource distribution that I described, you could easily calculate the desired probability for each instance:
S1 = 1/12 (8/96) S2 = 1/12 (8/96) S3 = 1/3 (32/96) S4 = 1/3 (32/96) S5 = 1/6 (16/96)
Or for visual-types like me:
So again, the goal here is to create a uniform-distribution structure out of these discrete probabilities while still maintaining the weighted distribution on a secondary layer. To get to the uniform distribution, let’s multiply each probability by the number of buckets (i.e. 5) so that the sum of the fractions would add up to the number of buckets. Now it is possible to create 5 buckets of height 1– or in other words, a uniform distribution. I’ve multiplied each fraction by 5, and drawn out a line to mark where height=1 is:
Now in order to fill out each bucket so that each has a height of 1, you take any part of any bucket that is above the “height-of-1” line and distribute them out as fragments to “fill-up” any bucket that is below the “height-of-1” line. Each bucket can only have two fragments so that we can maintain a constant time lookup for the secondary operation. The individual steps are illustrated here:
First, notice that there are two buckets with their adjusted probability to be above the height-of-1 line. If we take a chunk off S3 (7/12), and fill in the rest of S1 bucket, you’d get the distribution below:
Next, S3 is still above the height-of-1 line (13/12) and so we take another 7/12 out of S3 and fill S2 with it.
But now S3 is below 1. But S4 still has plenty to spare! So we take 6/12 off of S4 to fill in the rest of S3.
Finally, 2/12 off the top of S4 fits perfectly to fill in the rest of S5.
In the end, this is the distribution that we’re left with.
Now to simulating the desired distribution, is a simple two-step process:
- Generate a random integer between 1 and n (n = number of buckets, in this case 5)
- Generate a random decimal number between 0 and 1.
Hopefully you can see where this is headed. You use the first random integer to select the bucket, then use the second random decimal to find out where in that bucket you land (or whether or not you select the alias, hence the name). So for example if you generate a random integer of 3 and a decimal of 0.18743, you would look up the 3rd bucket, then find out if 0.18743 is < (6/12) — in this case we selected S3.
And voila! Now we can generate samples from a non-uniform distribution in constant time!
*A potentially very non-memory-efficient alternative could be to create an array of size 96 and then use the randomly generated integer as an index into the array where each array entry would contain some pointer to the server instance. Something like:
["S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S2", "S2", ...]
Then using a random number between 0 and 95, look up by index in constant time for the selected server instance. This of course is extremely memory-inefficient and has a space complexity of O(n x m) where m is the average RAM per instance. I didn’t really consider this a solution given the absurd space complexity.