Skip to content

Bloom Counter Encoding

It sure looks to me like there is good, fast, linear, bounded storage, way to encode a stream of data on the fly that uses the varient on bloom filters outlined yesterday to select the compression codes on the fly.

Powerlaw of War

A bleak but facinating example of a power-law distribution. The first drawing shows one data set from events in Iraq along with another for events in Columbia.

The vertical is how awful the event was. Iraqi conflict is less severely skewed than the Columbian. Adopting the terminology of wealth distribution you’d say the Iraqi conflict is more equitably distributed.

The second chart is a rare catch. It’s plotting alpha, i.e. the slope of the power-law over time.

The article via kolmstead and the powerlaw tag at del.icio.us.

The frequency-intensity distribution of fatalities in “old wars”, 1816-1980, is a power-law with exponent 1.80. Global terrorist attacks, 1968-present, also follow a power-law with exponent 1.71 for G7 countries and 2.5 for non-G7 countries. Here we analyze two ongoing, high-profile wars on opposite sides of the globe – Colombia and Iraq. Our analysis uses our own unique dataset for killings and injuries in Colombia, plus publicly available data for civilians killed in Iraq. We show strong evidence for power-law behavior within each war. Despite substantial differences in contexts and data coverage, the power-law coefficients for both wars are tending toward 2.5, which is a value characteristic of non-G7 terrorism as opposed to old wars. We propose a plausible yet analytically-solvable model of modern insurgent warfare, which can explain these observations.

Hot Spots

Network Applications of Bloom Filters: A Survey (pdf) by Broder and Mitzenmacher is a fun read.

Bloom filters were first invented to solve a problem near and dear to my heart, spelling correction. While entire dictionary of every known word is a very large beast the bloom filter of a dictionary isn’t. It allow you to ask the question “Is this in the dictionary?” using a cleverly arranged bitmap. The down side is that occationally says yes when the answer is no.

Bloom filters are simple. You start with a bit map, all zeros, and then for each symbol in your dictionary you hash it and set a bit in the map. We aren’t done yet, but clearly you can hash your spelling word and see if it’s bit is set. Instead of one bit you actually hash the symbol a dozen times different ways, and set a the dozen bits. If your bit map is large enough then it becomes quite unlikely that a misspelled word will have all 12 bits set. The math to get the right size bit map and number of hashs isn’t too terrible, you can look it up.

The paper mentions a cool varient of a Bloom filter that solves the problem of how to find hot-spots. While this problem comes up a lot but I’ve never seen a satisfactory method before. The hot-spot problem comes up a lot, it is the problem of finding the elite in your power-law distribution. It’s hard because you can’t afford anything that’s the scale of the whole population. Here three example applications. You want to notice that a particular sensor was noticing a large number of small earthquakes. You want to dynamically shift a page into a caching network when it suddenly gets slash dotted. You want dynamicly limit the bandwidth of a client that starts downloading your entire site. Let’s generalize the problem: given a stream of symbols how can you pluck out the ones that are appearing in the stream a lot.

The trick mentioned in the paper was, I gather, outlined in this other paper, New Directions in Traffic Measurement and Accounting (citeseer). Instead of a bit map use counters. Each time a symbol arrives we increment all it’s associated counters. As we do that we assemble the minimum over the counters; that’s a good estimate of how many times we have seen that symbol.

Tada! Symbols with high counters can then be put on a list of likely to be hot. You can then start filtering, caching, watching more carefully, etc. that guy.

This could be used to add a possibly-hot? operator to lots of server data: pages, ip addresses, users, etc. which could then be fed into the server’s operating logic. You could use it to find newly emerging high volume pinger’s in feedmesh. You could use it to find phrases hot new phrases in netnews, newswire, stock trading forums, or to notice sensors in your sensor net that seem to be all a quiver.

ps. There is a cute trick for avoiding the need to have a dozen hash functions. If you only have two h1(x) and h2(x) you can make hi(x) by combining the output of h1 and h2, say by doing h1(x) + i*h1(x). (pdf).

Digital Fountains in the Walled Garden

Bummer.

Returning yet again to the topic of how the producer of an information good can shift the distribution load out onto a swarm of consumers. The producer can package his content in ways that make it easier and more likely that the swarm can collaborate. For example he can distributed redundent copies of the data or provide lots of checksums.

The extreme case of this is sometimes called a digital fountain, like a water fountain. The producer sprays a stream of droplets and if the consumer can catch a cup full then he can reassemble the original cup of content. And it turns out there are some very effective algorithums for enabling just that.

Here’s a short and simple survey paper on digital fountains (pdf).

…an idealized digital fountain should …

  • A source generate a potentially infinite supply of encoding packages from the original data. Ideally, encoding packets can be generated in constant time per encoding packet given the original data.
  • A reciever can reconstruct a message that would require k packets to send … once any k encoding packets have ben recieved. This reconstruction should also be extremely fast, preferably linear in k.

Amazingly there are designs that come extremely close to that goal. But.

… Most of the work that has been described above has been undertaken by employees of or consultants with the company Digital Fountain, Inc. … the company has a number of patents issued and pending that appear to cover both the fundamental theory and specific implementation designs …

So, no network effect and this stuff will get designed into only those industrial standards that are pay to play. Damn.

Messages in Unusual Orders

I think I first saw this hack in an obfuscated programming contests where in it was used to implement the tape of a Turing machine. A Turing machine runs it’s tape both forward and backward. If you simulating that one obvious possiblity is a doublely linked list where each node holds a pointer to both the next node and the previous node.

So in node B you store a pointer to A and C. The hack saves memory by xoring A and C together and storing that. This works because as you travel foward along the tape and you arive at node B you will have just left node A so you can xor the stored A+C with A and out pops C. And you can do the same thing going in the other direction. Of course if you arived at node B via any other route the the A+C value is useless. For example you lose the ablity to delete B from the tape given a random pointer to it. I’ve never found a practical application of this hack other than saving space and obfuscating code. I’ve never ran into an application other than a Turing machine where I wanted messages that could be read forward and backward; though now that I’m thinking about it there might be some cases involving undo/redo.

In anycase I was reminded of this by looking at some of the schemes used in peer to peer distribution systems. In some of those systems the message is fragmented by the publisher and scattered across the the swarm of clients. They then collaborate to reassemble the message. It’s better if you can arrange things so that the lose of a few fragments doesn’t make it impossible to reassemble the message. A scarce fragment, like a patent on a key element of a standard, can frustrate the entire swarm. This makes the system brittle and raises the stakes for getting the coordination right.

The publisher can solve that. He can pump up is output and provide some redundent info. That removes the scarcity, relaxes the need for perfect coordination. Brittle no more. The easy idea is he just distributes multiple copies of the content; but consider this trick. Just as a very primitive example. The publisher breaks the content into two fragements A and B. He then ships three packets: A, B, and A xor B. Consumers can then reassemble the content given any two of the three packets.

Here some links: used in streaming, used in file transfer, pretty picture the example above.

Getting the publisher’s content moved thru the swarm is the same problem as getting a message passed thru a lossy channel. That problem that’s gotten a lot of attention for the last 50+ years. The world is full of lossy channels so if you go look at any heavy used lossy channel you’ll find tricks of the trade for solving this problem. Netnews for example. When clueful people (and pornographers of course) distribute large files thru netnews they use special archive formats that include a set of PAR or Parity Sets that provide redudency for to enable the consumer to reconstitute the archive given only a subset of the messages that dispatched it.

Free for the Network

Nice post about the role of free in your business model.

…there are huge network effects in the deployment of their service. The more users they have, the more value accrues to each user. They are thinking about the role of free code and data in their business. And well they should. …

52 Card Pickup

More on peer to peer. Recall that what’s cool about peer to peer is that it allows a publisher to reach a bizillion consumers while only expending a few units of bandwidth. He shifts the distribution costs onto the consumers. It makes for a wonderful example of the classic standards problem – “If everybody would just…”

One aspect of this that I’ve been playing with is how the publisher might frame the game to encourage the swarm to collaborate. The BitTorrent technique of this kind is to fragement the content and scatter it into the swarm.

At that point the publisher sits back and says “Oh boys, you sort it out.” It’s a variant of 52 card pickup. Coordinating the resulting card game is the trick. In BitTorrent an entity known as the tracker helps along that coordination. There is a lot of design space to search around how to orchestrate the coordination.

At the same time it’s a special class of market making and clearing house design. Part of the cost born by the consumers is the cost of keeping the market liquid. For example can you design the system so it welcomes and forgives poorly behaving swarm members while at the sametime holding onto the generous ones?

Usability but not presence

This, of course, …


checking foo.h usability... yes
checking foo.h presence... no
configure: WARNING: foo.h: accepted by the compiler, rejected by the preprocessor!
configure: WARNING: foo.h: proceeding with the compiler's result

… happens when your configure.in sets CFLAGS where it ought to have set CPPFLAGS.

Startup

Oh this made me snort!

I worked for a company who’s plausable premise was that lots of CFOs are terribly unhappy because nobody in their firms actually computes the NPV (Net Present Value) of the various projects they are considering. The software was an expert system that the CFO could give/demand that everybody use as part of the project approval process. The premise, while plausable, was wrong.

It’s easy to over estimate the demand for good practice. Particularly if your an expert in said practice.

Emily update


It looks to me like Emily is managing to go ashore just far enough south so the concentrated population along the border will be spared the worst of the wind.

… a strong Category 3 storm … 125 mph … The eye of the storm came ashore just before dawn near San Fernando, a town about 80 miles south of the U.S.-Mexico border. The National Hurricane Center in Miami said hurricane force winds extended outward 70 miles. Tropical storm-force winds blew over south Texas.

I think this side looking radar image (pdf) of the Brownsville-Matamoros area is very cool. You can see how the river has grown more convoluted over time, and also how occationally it folds back on it’s self so that an entire loop is then abandoned. (more maps).

On a more serious note.

expect rains to swamp the low lying ramshackle settlements known as colonias which line the Rio Grande, many of which have no drainage systems…. “It is going to be a very dangerous situation. We could easily see 15 inches of rain in some mountains areas and that will cause flash floods and mudslides,” said Stacy Stewart of the U.S. National Hurricane Center in Miami. …Many fretted over the fate of their modest corrugated iron homes.
(more)