I believe it was Ray Kurzweil, circa 1989, who advised encouraging a private jargon inside your new company. I remember because that was just about the time I was starting to think open would totally trump closed in our industry. The advise seemed to my ears a bit old fashion. But at the same time I suspected he meant that it was a good way to tighten the bonds inside the team. By then I knew enough about cults to recognise that’s common inside cults; it supports two of the keys to running a good cult – information control, and thought stopping processes.
But don’t take any of that too seriously. It is good advise none the less. I’ll admit to being a Whorfarian, the words you highlight effect your thinking.
These days I’d add to all that though. Language is the ultimate exchange standard. So when you decide to innovate a new private language your cutting your self off and creating friction or trade barriers with your outside partners. Importantly the advantage a new group has is that they can pick and choose what to emphasis. They can take a run at leveraging some particular competitive advantage. As Dave Winer says: You can’t win by zigging when he zigs. You have to zag to beat him. Ray’s advise can be viewed as a bit of implementation advise for that.
So it was with some interest that I saw Google revealing their in house standard for serailizing data. It’s not hard to see that Protocol Buffers are alternative to XML. And it is amusing to, at least to me, to think that they did this in the hope of reducing the frictions that occur when they must translate from their in house argot into the dialects used by the outside world. It’s fun to note that if your start up is as successful as Google you get to promulgate your private jargon. It is one of the spoils of war. You can push that friction into your compliments, make them pay the switching costs.
Protocol buffers aren’t anything special: messages are lists of key value pairs, keys can repeat, a small set of value types; including unicode strings and recursively other messages. They are very practical, close to the metal. Choices were made and they are what they are. They are quite dense, and easy to parse. Many messages can be serialized in one pass. In classic run length encoding style nested structures have their size mentioned in their header. That makes emitting one pass serialization hard.
Given an array of bytes it wouldn’t be child play to guess that your holding a protocol buffer, you could do it huristically but it would still be a guess. You need a protocol buffer’s declaration to parse it. For example you absent the declaration you can’t know if a you’ve got a sint32 or an int64, etc. All that disappointed me. It disapointed my inner archivist and the inner peeping tom (who has often debugged tough problems by watching the bytes fly by on the wire).
There is a nice detail that allows optional keys which in turn makes it somewhat easier to add new message variants. With luck the old message handlers can just ignore the additions. It made me smile to note that mechanism can be used to pad messages; which in turn makes it more likely that you can serialize in a single pass.
There is another nice detail that allows a key to appear more than once in spite of the metadata saying it is single valued. The last occurance wins. This lets you implement a kind of inheritance/defaulting. For example if your implementing CSS style sheets you read the default style message, and then read the variations from the default, and your ready to go. They call that merging.
Given the declarative information for a given protocol buffer it’s not hard to convert it to XML or what every else you like. The observers and archivists will just have to be careful to not loose that metadata; and some of them will no doubt build clever hueristic to cobble together substitute metadata. Interestingly, inspite of efforts to the contrary, you can’t really work with XML without additional metadata to help. And, that stuff is horribly complex.
As I like to emphasis what really matters with an exchange standard is how many transactions per second move over it. No doubt this one has critical mass at least inside the Google cloud. What matters here for how important this standard might be is how much adoption by non-Google actors. But, I suspect we will be speaking this dialect more often and for quite a while. Of course, the rudest way to say that is that we will be chasing their tail lights. But I don’t really feel that way since I’ve never particularly liked XML and I welcome a substitute.
You actually can parse a protocol buffer without the .proto file, though you’ll lose a lot of the metadata of course:
http://blog.reverberate.org/2008/07/12/100-lines-of-c-that-can-parse-any-protocol-buffer/
Matt – I didn’t look at your code, so no doubt your right, but my impression was that since type 0 is both sint32 and int32, (for example) and they are encoded differently (i.e. ZigZag is used for one but not the other) you need the proto to tell you if ZigZag is in use. – ben
That’s true – without the .proto you can get the wire type but not the actual field type.
At first blush I thought that was a boo boo; but I’ve got to wondering if they didn’t do some calculation that showed vast savings from leaving the design as is. … nah, i think it was a mistake.
byhde, it’s simply not a design consideration. Making it possible to declare up to 32 different fields in a single byte header is more valuable than being able to infer any details.
The wire-type declaration is simply enough information to read and skip an unknown message. nothing more.
Good things would have been enabled by signaling ZigZag explicitly.