Thinking around the efficiency problems for blogs, or URI in general. Here two things that don’t look too hard to do (one about polling, one about diffs); along with some clarification about the populations that yearn for these improvements.
First let’s block out some givens. There are content consumers and content providers. Two populations, call them readers and writers. Then we have all the stuff in-between them that makes it work. We would like to make that stuff more efficient; presumably so life is better for those two populations.
The readers and the writers are power-law distributed. You can make a cartoon of that. Each population has its elites and its masses. There are billions of readers, billions of writters – most of them very low volume – that’s the masses. There are a much smaller number of elites. A few voracious readers like Google, or Technorati. A few very popular and prolific writers. Live Journal is a good example in the blog space. For the purposes of this discussion I’ll just ignore the middle classes and the scale free networks that really describe the situation.
With that picture – two populations broken into elite/masses – it becomes clear that the stuff in the middle might be broken out into four design problems: {reader,writer}*{elite,masses}. It certainly helps to generate cases to think about. For example the elite readers yearn for a way to lower to the cost of polling all the masses of writers.
So one simple idea is to just attempt to address the polling problem. Do something so that it is easier to know if a site’s been updated. Building a distributed system that provides a standard where all parties can rendezvous and collaborate about notification of changes wouldn’t be that hard a design problem. But is sufficiently complex to take some real work to get it right. For example it would be a bad thing (I think) if that design encouraged a single notification hub owned by one entity.
I suspect that elite authors and the elite readers are the only ones who really care about solving the bandwidth problems at this point; or at least they stand to gain the most from getting the problem solved. I started thinking that after it occurred to me that the current bloom of interest in these problems came out of Microsoft deciding to stop providing full feeds for a mess of blogs they host. I found myself thinking “Oh please! What they can’t afford the bandwidth? Bandwidth is practically free. … hm. I bet they are really just trying to get those reader eyeballs to come back to their site.” Who knows if all that is true or not, just a though in my head, but it’s clear that you need a lot of readers before you start finding the bandwidth problem all that serious. My blog is in the top .5% of all blogs and I run it on a little DSL connection off a discarded computer in my basement.
In some standardization situations when you set out your four boxes you discover that most of the benefit of introducing a standard is captured by fixing the elite to elite exchanges. That happens when the elites mostly exchange with each other. That’s not the case in this scenario – at least I don’t think so. It would be very interesting to see some data about that for various subsets of the reader/writer markets. Microsoft’s misery with their developer network blogs is a particularly ironic case study though. Developer networks are all about large platform vendors creating relationships with masses of small developers.
Well so in thinking about the bandwidth problem it seems to me that there is one simple hack I’ve not seen suggested so far.
Just add a new encoding type. Clueful readers write “Accept-Encoding: gzip” in their requests and responsive content providers then act on that to reduce bandwidth. In addition clueful readers do conditional gets and include an ETag so that if the resource hasn’t changed at all then they can get back a very efficient (i.e. practically empty) response. What if the elite readers started doing: “Accept-Encoding: gzip, x-diff-from-etag”?
Ah, but I hear you say. “That will never work, we’d have to revise all the entire installed base.” Well yes, if we wanted the entire installed base to get the benefit of this efficiency improvement. But what if we only really care about the elites? For example if Technorati (an elite reader) started doing this and Typepad, Live Journal, MovableType and WordPress all started responding appropriately it would have a huge impact on the total bandwidth being consumed.
Of course we could spend a few years negotiating exacltly what “x-diff-from-etag” means, and how many variations there are. While that is the usual approach I suspect in this case somebody could hack up a proposal that everybody could sign onto pretty quickly. You’ll notice who I think should fund that labor. It’s simple enough that it is unlikely to be biased toward one of the four classes of users ({reader,writer}*{elite,masses}).
The problem is http caches. Consider an AOL cache in front of all subscribers. The first one to get a delta means that EVERYONE gets a delta. That’s the purpose of the Vary: header.