Archive for July, 2005

Powerlaw of War

Wednesday, July 27th, 2005

A bleak but facinating example of a power-law distribution. The first drawing shows one data set from events in Iraq along with another for events in Columbia.

The vertical is how awful the event was. Iraqi conflict is less severely skewed than the Columbian. Adopting the terminology of wealth distribution you’d say the Iraqi conflict is more equitably distributed.

The second chart is a rare catch. It’s plotting alpha, i.e. the slope of the power-law over time.

The article via kolmstead and the powerlaw tag at del.icio.us.

The frequency-intensity distribution of fatalities in “old wars”, 1816-1980, is a power-law with exponent 1.80. Global terrorist attacks, 1968-present, also follow a power-law with exponent 1.71 for G7 countries and 2.5 for non-G7 countries. Here we analyze two ongoing, high-profile wars on opposite sides of the globe - Colombia and Iraq. Our analysis uses our own unique dataset for killings and injuries in Colombia, plus publicly available data for civilians killed in Iraq. We show strong evidence for power-law behavior within each war. Despite substantial differences in contexts and data coverage, the power-law coefficients for both wars are tending toward 2.5, which is a value characteristic of non-G7 terrorism as opposed to old wars. We propose a plausible yet analytically-solvable model of modern insurgent warfare, which can explain these observations.

Hot Spots

Monday, July 25th, 2005

Network Applications of Bloom Filters: A Survey (pdf) by Broder and Mitzenmacher is a fun read.

Bloom filters were first invented to solve a problem near and dear to my heart, spelling correction. While entire dictionary of every known word is a very large beast the bloom filter of a dictionary isn’t. It allow you to ask the question “Is this in the dictionary?” using a cleverly arranged bitmap. The down side is that occationally says yes when the answer is no.

Bloom filters are simple. You start with a bit map, all zeros, and then for each symbol in your dictionary you hash it and set a bit in the map. We aren’t done yet, but clearly you can hash your spelling word and see if it’s bit is set. Instead of one bit you actually hash the symbol a dozen times different ways, and set a the dozen bits. If your bit map is large enough then it becomes quite unlikely that a misspelled word will have all 12 bits set. The math to get the right size bit map and number of hashs isn’t too terrible, you can look it up.

The paper mentions a cool varient of a Bloom filter that solves the problem of how to find hot-spots. While this problem comes up a lot but I’ve never seen a satisfactory method before. The hot-spot problem comes up a lot, it is the problem of finding the elite in your power-law distribution. It’s hard because you can’t afford anything that’s the scale of the whole population. Here three example applications. You want to notice that a particular sensor was noticing a large number of small earthquakes. You want to dynamically shift a page into a caching network when it suddenly gets slash dotted. You want dynamicly limit the bandwidth of a client that starts downloading your entire site. Let’s generalize the problem: given a stream of symbols how can you pluck out the ones that are appearing in the stream a lot.

The trick mentioned in the paper was, I gather, outlined in this other paper, New Directions in Traffic Measurement and Accounting (citeseer). Instead of a bit map use counters. Each time a symbol arrives we increment all it’s associated counters. As we do that we assemble the minimum over the counters; that’s a good estimate of how many times we have seen that symbol.

Tada! Symbols with high counters can then be put on a list of likely to be hot. You can then start filtering, caching, watching more carefully, etc. that guy.

This could be used to add a possibly-hot? operator to lots of server data: pages, ip addresses, users, etc. which could then be fed into the server’s operating logic. You could use it to find newly emerging high volume pinger’s in feedmesh. You could use it to find phrases hot new phrases in netnews, newswire, stock trading forums, or to notice sensors in your sensor net that seem to be all a quiver.

ps. There is a cute trick for avoiding the need to have a dozen hash functions. If you only have two h1(x) and h2(x) you can make hi(x) by combining the output of h1 and h2, say by doing h1(x) + i*h1(x). (pdf).

Digital Fountains in the Walled Garden

Monday, July 25th, 2005

Bummer.

Returning yet again to the topic of how the producer of an information good can shift the distribution load out onto a swarm of consumers. The producer can package his content in ways that make it easier and more likely that the swarm can collaborate. For example he can distributed redundent copies of the data or provide lots of checksums.

The extreme case of this is sometimes called a digital fountain, like a water fountain. The producer sprays a stream of droplets and if the consumer can catch a cup full then he can reassemble the original cup of content. And it turns out there are some very effective algorithums for enabling just that.

Here’s a short and simple survey paper on digital fountains (pdf).

…an idealized digital fountain should …

  • A source generate a potentially infinite supply of encoding packages from the original data. Ideally, encoding packets can be generated in constant time per encoding packet given the original data.
  • A reciever can reconstruct a message that would require k packets to send … once any k encoding packets have ben recieved. This reconstruction should also be extremely fast, preferably linear in k.

Amazingly there are designs that come extremely close to that goal. But.

… Most of the work that has been described above has been undertaken by employees of or consultants with the company Digital Fountain, Inc. … the company has a number of patents issued and pending that appear to cover both the fundamental theory and specific implementation designs …

So, no network effect and this stuff will get designed into only those industrial standards that are pay to play. Damn.

Messages in Unusual Orders

Sunday, July 24th, 2005

I think I first saw this hack in an obfuscated programming contests where in it was used to implement the tape of a Turing machine. A Turing machine runs it’s tape both forward and backward. If you simulating that one obvious possiblity is a doublely linked list where each node holds a pointer to both the next node and the previous node.

So in node B you store a pointer to A and C. The hack saves memory by xoring A and C together and storing that. This works because as you travel foward along the tape and you arive at node B you will have just left node A so you can xor the stored A+C with A and out pops C. And you can do the same thing going in the other direction. Of course if you arived at node B via any other route the the A+C value is useless. For example you lose the ablity to delete B from the tape given a random pointer to it. I’ve never found a practical application of this hack other than saving space and obfuscating code. I’ve never ran into an application other than a Turing machine where I wanted messages that could be read forward and backward; though now that I’m thinking about it there might be some cases involving undo/redo.

In anycase I was reminded of this by looking at some of the schemes used in peer to peer distribution systems. In some of those systems the message is fragmented by the publisher and scattered across the the swarm of clients. They then collaborate to reassemble the message. It’s better if you can arrange things so that the lose of a few fragments doesn’t make it impossible to reassemble the message. A scarce fragment, like a patent on a key element of a standard, can frustrate the entire swarm. This makes the system brittle and raises the stakes for getting the coordination right.

The publisher can solve that. He can pump up is output and provide some redundent info. That removes the scarcity, relaxes the need for perfect coordination. Brittle no more. The easy idea is he just distributes multiple copies of the content; but consider this trick. Just as a very primitive example. The publisher breaks the content into two fragements A and B. He then ships three packets: A, B, and A xor B. Consumers can then reassemble the content given any two of the three packets.

Here some links: used in streaming, used in file transfer, pretty picture the example above.

Getting the publisher’s content moved thru the swarm is the same problem as getting a message passed thru a lossy channel. That problem that’s gotten a lot of attention for the last 50+ years. The world is full of lossy channels so if you go look at any heavy used lossy channel you’ll find tricks of the trade for solving this problem. Netnews for example. When clueful people (and pornographers of course) distribute large files thru netnews they use special archive formats that include a set of PAR or Parity Sets that provide redudency for to enable the consumer to reconstitute the archive given only a subset of the messages that dispatched it.

Free for the Network

Saturday, July 23rd, 2005

Nice post about the role of free in your business model.

…there are huge network effects in the deployment of their service. The more users they have, the more value accrues to each user. They are thinking about the role of free code and data in their business. And well they should. …