My thoughts keep turning to this effort by the publishers to update the robot exclusion protocol, i.e. ACAP. The current situation with the robot exclusion protocol certainly doesn’t look stable. We are going to get a revision, or substitute to that protocol. But who has the market power, the legitimacy, the technical and legal chops to create one? It just makes your brain hurt!
I think you could say … the current protocol is works. But why? A combination of factors? A gentleman’s agreement (that doesn’t sound stable). A concern that failing to conform would blacken your spider’s reputation. Why wouldn’t search engines wouldn’t associate with such spiders? That the protocol appears to work is surprising. These are very weak drivers.
It’s a great edge case in the world of protocols. It isn’t technically or legally enforced. It is impossible to enforce it technically; and any attempt to enforce it legally would rapidly bring a lot of issues out from under the rug.
It is a case study in the general problem: how to tame a pure public good. In this case information. So the usual circus of issues come to play. “Pee in the pool.” “Information wants to be free.” Copyright. Trade secrets. Privacy. Good manners. I guess it’s possible to imagine a perfect descendant of the robot exclusion protocol what would all me to mark a communication with some metadata that states exactly what purposes I license it for going forward.
Marking pages with permission metadata is exactly what the robot exclusion protocol is doing. Off to one side it says “sure index this” v.s. “no peeking!”. In that way it is almost identical to a copyright license, plus the convention that spiders tend to know where to look for it.
I suspect that something like ACAP is inevitable. I suspect it’s inevitable that the tie to copyright licensing will be strengthened. Spiders can look forward to some regulatory arm twisting.
With the big wealthy content owners on one side and the big wealthy search engines on the other it’s going to be fighting gorillas. That can’t be avoided. A shame really, given the ties to the privacy problem. Since it is tempting consider using copyright law as a lever in licensing limited use of one’s personal data.
Update: Andy Oram offers an interesting perspective.
Actually, I think the multi-million dollar lawsuit that brought down Bidder’s Edge is a good example of legal enforcement of the robot exclusion protocol. The judge granted a preliminary injunction on the basis of “trespass to chattles”, which was a stretch, but robots.txt was the No Trespassing sign.
My friend Andy Oram wrote a longer more thoughtful piece on this. see here.
Thanks Kimbo, that’s a fascinating point I’d not considered before in thinking about the Bidder’s Edge case. Metadata as trespassing signs – ha!