Monthly Archives: July 2005

The Five Great Philosophies of Life

The book I was looking for was not on the shelf, but this was a few books down. The original copyright is 1904. Prudence Hyde renewed the copyright in 1938 by Prudence Hyde. The five great philosophies?

  • Epicurean Pursuit of Pleasure
  • Stoic Self-Control by Law
  • Platonic Subordination of Lower to Higher
  • Aristotelian Sense of Proportion
  • The Christian Spirt of Love

But then I’m sure you knew that. The author was William De Witt Hyde, a president of Bowdoin College. (Amazon)

Loyality is very skew’d

Chris Anderson points out an interesting power point presentation from a company called DVD stations ptt/pdf Their business is DVD rental kiosks. The kiosks print DVDs on demand and have really fast connection back to the mother ship. Like all attacks on a distribution channel the upside comes from unlocking value that couldn’t get thru the channel beforehand. In this case the long tail of content. There are some nice charts showing how much value there is out there. One curiousity, they have a little budge in their revenue for movies that are around 9-10 years old, what’s up with that?

I was interested in two charts. The first one shows a portion of their sales pipeline, it shows that a few customers account for a large chunk of their revenue. You gotta love the labels marketing people put on the folks in various stages of their pipeline.

The second chart shows which channels distribute thru and how much revenue comes out of each. Since I don’t watch cable TV I was surprised how much is premium cable and video-on-demand. This is pie is of course a slice of yet larger pies, i.e. the entertainment pie. In the future the big slice is going to be neighborhood puppet theaters – you heard it here first!

Of course I suspect that both these pie charts are just power-law curves, but ironically they don’t display that. They want to be an elite. For example, in the customer catagory space that leads to inevitably into pricing discussions and then into the new dark ages of DRM content.

Silver Spoon 0.0.2

Fresh version at http://www.cozy.org/silverspoon/.

If you compile up the test program big then you can use it to glean out a list of items that occur (aprox) a certain number of times. For example if “yesterday -ip” dumps the log of yesterday’s web server traffic with just the IP addresses, then: “yesterday -ip | big -c 50 > bl50” will show you the IP addresses that dropped by 50 or more times. Or you can then do “yesterday -ip | big -c 40 -b bl50” to see the ones that dropped by from 40 to 50 times. The “-b” switch stands for black list and all items in that set are discarded.

Silver Spoon

htp-0.0.1.tgz is a tar file containing some code I hacked together this afternoon implementing bloom counters. Minimal doc and source views.

It has got lots of bugs. While is also an implementatiof of bloom filters it’s totally untested. There is one test executable named big. It is unix pipe fitting. Given a stream of lines it prints out all the lines appear more N times.

The code depends on APR.

The mnemonic HTP was for Hot Spot, but only late in the day did I realize that’s too much like HTTP – oops. That’s change if I get back to this.

No promises, no warrenty, lousy license.

Powerlaw of War

A bleak but facinating example of a power-law distribution. The first drawing shows one data set from events in Iraq along with another for events in Columbia.

The vertical is how awful the event was. Iraqi conflict is less severely skewed than the Columbian. Adopting the terminology of wealth distribution you’d say the Iraqi conflict is more equitably distributed.

The second chart is a rare catch. It’s plotting alpha, i.e. the slope of the power-law over time.

The article via kolmstead and the powerlaw tag at del.icio.us.

The frequency-intensity distribution of fatalities in “old wars”, 1816-1980, is a power-law with exponent 1.80. Global terrorist attacks, 1968-present, also follow a power-law with exponent 1.71 for G7 countries and 2.5 for non-G7 countries. Here we analyze two ongoing, high-profile wars on opposite sides of the globe – Colombia and Iraq. Our analysis uses our own unique dataset for killings and injuries in Colombia, plus publicly available data for civilians killed in Iraq. We show strong evidence for power-law behavior within each war. Despite substantial differences in contexts and data coverage, the power-law coefficients for both wars are tending toward 2.5, which is a value characteristic of non-G7 terrorism as opposed to old wars. We propose a plausible yet analytically-solvable model of modern insurgent warfare, which can explain these observations.

Hot Spots

Network Applications of Bloom Filters: A Survey (pdf) by Broder and Mitzenmacher is a fun read.

Bloom filters were first invented to solve a problem near and dear to my heart, spelling correction. While entire dictionary of every known word is a very large beast the bloom filter of a dictionary isn’t. It allow you to ask the question “Is this in the dictionary?” using a cleverly arranged bitmap. The down side is that occationally says yes when the answer is no.

Bloom filters are simple. You start with a bit map, all zeros, and then for each symbol in your dictionary you hash it and set a bit in the map. We aren’t done yet, but clearly you can hash your spelling word and see if it’s bit is set. Instead of one bit you actually hash the symbol a dozen times different ways, and set a the dozen bits. If your bit map is large enough then it becomes quite unlikely that a misspelled word will have all 12 bits set. The math to get the right size bit map and number of hashs isn’t too terrible, you can look it up.

The paper mentions a cool varient of a Bloom filter that solves the problem of how to find hot-spots. While this problem comes up a lot but I’ve never seen a satisfactory method before. The hot-spot problem comes up a lot, it is the problem of finding the elite in your power-law distribution. It’s hard because you can’t afford anything that’s the scale of the whole population. Here three example applications. You want to notice that a particular sensor was noticing a large number of small earthquakes. You want to dynamically shift a page into a caching network when it suddenly gets slash dotted. You want dynamicly limit the bandwidth of a client that starts downloading your entire site. Let’s generalize the problem: given a stream of symbols how can you pluck out the ones that are appearing in the stream a lot.

The trick mentioned in the paper was, I gather, outlined in this other paper, New Directions in Traffic Measurement and Accounting (citeseer). Instead of a bit map use counters. Each time a symbol arrives we increment all it’s associated counters. As we do that we assemble the minimum over the counters; that’s a good estimate of how many times we have seen that symbol.

Tada! Symbols with high counters can then be put on a list of likely to be hot. You can then start filtering, caching, watching more carefully, etc. that guy.

This could be used to add a possibly-hot? operator to lots of server data: pages, ip addresses, users, etc. which could then be fed into the server’s operating logic. You could use it to find newly emerging high volume pinger’s in feedmesh. You could use it to find phrases hot new phrases in netnews, newswire, stock trading forums, or to notice sensors in your sensor net that seem to be all a quiver.

ps. There is a cute trick for avoiding the need to have a dozen hash functions. If you only have two h1(x) and h2(x) you can make hi(x) by combining the output of h1 and h2, say by doing h1(x) + i*h1(x). (pdf).

Digital Fountains in the Walled Garden

Bummer.

Returning yet again to the topic of how the producer of an information good can shift the distribution load out onto a swarm of consumers. The producer can package his content in ways that make it easier and more likely that the swarm can collaborate. For example he can distributed redundent copies of the data or provide lots of checksums.

The extreme case of this is sometimes called a digital fountain, like a water fountain. The producer sprays a stream of droplets and if the consumer can catch a cup full then he can reassemble the original cup of content. And it turns out there are some very effective algorithums for enabling just that.

Here’s a short and simple survey paper on digital fountains (pdf).

…an idealized digital fountain should …

  • A source generate a potentially infinite supply of encoding packages from the original data. Ideally, encoding packets can be generated in constant time per encoding packet given the original data.
  • A reciever can reconstruct a message that would require k packets to send … once any k encoding packets have ben recieved. This reconstruction should also be extremely fast, preferably linear in k.

Amazingly there are designs that come extremely close to that goal. But.

… Most of the work that has been described above has been undertaken by employees of or consultants with the company Digital Fountain, Inc. … the company has a number of patents issued and pending that appear to cover both the fundamental theory and specific implementation designs …

So, no network effect and this stuff will get designed into only those industrial standards that are pay to play. Damn.

Messages in Unusual Orders

I think I first saw this hack in an obfuscated programming contests where in it was used to implement the tape of a Turing machine. A Turing machine runs it’s tape both forward and backward. If you simulating that one obvious possiblity is a doublely linked list where each node holds a pointer to both the next node and the previous node.

So in node B you store a pointer to A and C. The hack saves memory by xoring A and C together and storing that. This works because as you travel foward along the tape and you arive at node B you will have just left node A so you can xor the stored A+C with A and out pops C. And you can do the same thing going in the other direction. Of course if you arived at node B via any other route the the A+C value is useless. For example you lose the ablity to delete B from the tape given a random pointer to it. I’ve never found a practical application of this hack other than saving space and obfuscating code. I’ve never ran into an application other than a Turing machine where I wanted messages that could be read forward and backward; though now that I’m thinking about it there might be some cases involving undo/redo.

In anycase I was reminded of this by looking at some of the schemes used in peer to peer distribution systems. In some of those systems the message is fragmented by the publisher and scattered across the the swarm of clients. They then collaborate to reassemble the message. It’s better if you can arrange things so that the lose of a few fragments doesn’t make it impossible to reassemble the message. A scarce fragment, like a patent on a key element of a standard, can frustrate the entire swarm. This makes the system brittle and raises the stakes for getting the coordination right.

The publisher can solve that. He can pump up is output and provide some redundent info. That removes the scarcity, relaxes the need for perfect coordination. Brittle no more. The easy idea is he just distributes multiple copies of the content; but consider this trick. Just as a very primitive example. The publisher breaks the content into two fragements A and B. He then ships three packets: A, B, and A xor B. Consumers can then reassemble the content given any two of the three packets.

Here some links: used in streaming, used in file transfer, pretty picture the example above.

Getting the publisher’s content moved thru the swarm is the same problem as getting a message passed thru a lossy channel. That problem that’s gotten a lot of attention for the last 50+ years. The world is full of lossy channels so if you go look at any heavy used lossy channel you’ll find tricks of the trade for solving this problem. Netnews for example. When clueful people (and pornographers of course) distribute large files thru netnews they use special archive formats that include a set of PAR or Parity Sets that provide redudency for to enable the consumer to reconstitute the archive given only a subset of the messages that dispatched it.