Network Applications of Bloom Filters: A Survey (pdf) by Broder and Mitzenmacher is a fun read.
Bloom filters were first invented to solve a problem near and dear to my heart, spelling correction. While entire dictionary of every known word is a very large beast the bloom filter of a dictionary isn’t. It allow you to ask the question “Is this in the dictionary?” using a cleverly arranged bitmap. The down side is that occationally says yes when the answer is no.
Bloom filters are simple. You start with a bit map, all zeros, and then for each symbol in your dictionary you hash it and set a bit in the map. We aren’t done yet, but clearly you can hash your spelling word and see if it’s bit is set. Instead of one bit you actually hash the symbol a dozen times different ways, and set a the dozen bits. If your bit map is large enough then it becomes quite unlikely that a misspelled word will have all 12 bits set. The math to get the right size bit map and number of hashs isn’t too terrible, you can look it up.
The paper mentions a cool varient of a Bloom filter that solves the problem of how to find hot-spots. While this problem comes up a lot but I’ve never seen a satisfactory method before. The hot-spot problem comes up a lot, it is the problem of finding the elite in your power-law distribution. It’s hard because you can’t afford anything that’s the scale of the whole population. Here three example applications. You want to notice that a particular sensor was noticing a large number of small earthquakes. You want to dynamically shift a page into a caching network when it suddenly gets slash dotted. You want dynamicly limit the bandwidth of a client that starts downloading your entire site. Let’s generalize the problem: given a stream of symbols how can you pluck out the ones that are appearing in the stream a lot.
The trick mentioned in the paper was, I gather, outlined in this other paper, New Directions in Traffic Measurement and Accounting (citeseer). Instead of a bit map use counters. Each time a symbol arrives we increment all it’s associated counters. As we do that we assemble the minimum over the counters; that’s a good estimate of how many times we have seen that symbol.
Tada! Symbols with high counters can then be put on a list of likely to be hot. You can then start filtering, caching, watching more carefully, etc. that guy.
This could be used to add a possibly-hot? operator to lots of server data: pages, ip addresses, users, etc. which could then be fed into the server’s operating logic. You could use it to find newly emerging high volume pinger’s in feedmesh. You could use it to find phrases hot new phrases in netnews, newswire, stock trading forums, or to notice sensors in your sensor net that seem to be all a quiver.
ps. There is a cute trick for avoiding the need to have a dozen hash functions. If you only have two h1(x) and h2(x) you can make hi(x) by combining the output of h1 and h2, say by doing h1(x) + i*h1(x). (pdf).
Pingback: Ascription is an Anathema to any Enthusiasm » Blog Archive » Bloom Counter Encoding