Popularity Hashing

As regulars know my favorite distribution is power-law.

I think there is a lot of system design that would play out a lot better if people admitted that one or another key statistic about the load on the system will be power-law distributed. It troubles me to read about system designs that implicitly assume that the traffic loads will be uniform. In most cases I suspect the designers haven’t thought this thru.

One way that plays out is that I suspect the folks that designed most of the internet just did not understand that graphs of communication links will condense into power-law graphs. So if you want to build a resilient network your going to have to take steps to make sure that the hubs don’t become single points of failure. Similarly I don’t think they understood that for similar reasons the only a few vendors would capture most of the DNS, most of the email, etc. etc.

I got to rant about this to Ben Laurie the other day. One example I was giving was that the distributed hash table designs all appear to assume that the frequency of looking up individual keys is uniform. I think that’s vanishingly unlikely. I presume that a handful of elite keys will get looked up a lot more than others – for example looking up Star Wars is a lot more common than looking up Julie London’s verson of Sway. Since I presume the traffic is power-law distributed then the elite keys will account for a disproportionate amount of the traffic. If your node in the peer to peer network implementing the distributed hash table happens to get stuck with one of the elite keys your reward is, in effect, a denial of service attack.

The obvious solution is to spray the popular keys across more nodes in the network. Ben had the clever idea that if you could look up the popular stuff one way and the regular stuff in the traditional way. In effect running 2 or more hash tables one for each tier of popularity. Clients of the distributed hash table would, of course, start by trying the popular table and then fall back on the less popular one. Servers of particular keys would monitor their traffic and shift load onto their neighboring tables as necessary.

Tricks like this would presumably be useful even for some simple single process in memory hash tables. A two tiered hash table is likely to get the elite entries densely packed into the fast cache memory where they are never paged out.

It pulls my cord that designers continue to ignore the prevalence of power-law distributions in the populations they are designing for. For example all the economics text books show the price/volume curve as a straight line. Setting aside my irritating I bet there are some really cool algorithms to be discovered that take this to heart.

Ascription is an Anathema to any Enthusiasm

Ben Hyde

1 thought on “Popularity Hashing”

Leave a Reply