I read this morning about bots that pretend to be Google. I’m surprised to realise that I’m unaware of any standard scheme for a bot (or other HTTP client) to assert it’s identity in a secure way. This seems like a kind of authentication, i.e. some sites would prefer to know they are being crawled by the authentic bot v.s. an imposter.
There is a list of the standard authentication schemes. But none of them handle this use case.
This doesn’t look too difficult. You need a way for agents to sign their requests. So, you make another auth scheme. Authentication schemes using this scheme include a few fields. A URL to denote who is signing, and presumably the document associated with that URL has the public key for the agent. A field that allows the server to infer exactly what the agent signed. That would need to include enough stuff to frustrate various reply attempts (the requested url and the time might be sufficient).
More interesting, at least to me, is why we do not appear to have such a standard already. We can play the cost benefit game. There are at least four players in that calculation.
The standard’s gauntlet is a PIA. For an individual this would be a pretty small feather in one’s cap. And a long haul. So the cost/benefit for the technologist is weak. And this isn’t just a matter of technology, you also have to convince the other players to play into the game.
What about the sites the bot is visiting. They play a tax for serving the bad bots. The size of that tax is the benefit they might capture after running the gauntlet. But meanwhile have alternatives. They aren’t very good alternatives, but they are probably sufficient for example they can whitelist the IP address ranges they think the good bots are using. That’s tedious, but it’s in their comfort zone v.s. the standards gauntlet.
What about the operators of the big “high quality” bots. It might be argued that the fraudulent bots are doing them some sort of reputation damage, but I find that hard to believe. An slightly disconcerting thought is that they might pay some people in their standards office to run the gauntlet because this would create little barrier to entry for other spidering businesses.
The fourth constituency that might care is the internet architecture crowd. I wonder if they are actually somewhat opposed, or at least ambivalent, to this kind of authentication. Since it has the smell of an attempt to undermine anonymity.