Archive for the 'programming' Category

cl-ppcre as a tokenizer

Tuesday, July 29th, 2008

CL-PPCRE is a Lisp clone of the perl regular expression subsystem (i gather it’s even faster).  I  love it because lots of things I used to do in Perl or AWK I now do in Lisp.  Each language has it’s coding tricks and there are lot for Perl’s regular expressions.  The Perl regex manpage is 30 pages, and the FAQ is 20 pages.  That’s a lot of clever, and double or triple is once you recall how concise Perl is!  I often is use all this to build lexical analysiers using regular expressions to nibble tokens off string. (One variation on that is outlined in the Perl FAQ.)

Both Cl-PPCRE and Perl have two notations for their regular expressions.  The concise one you’ve certainly used; and a much more verbose one.  For example you might write “alpha|beta” to match either of the two strings in the concise form, and in CL-PPCRE’s long form you’d write (:alternation “alpha” “beta”).  In Lisp you might write something like: (setf (parse-tree-synomym :tokens) ‘(:sequence :start-anchor (:alternation :begin :end :if …))) and then nibble off tokens by doing (scan :tokens program-text :start start).

The only problem with that is you lost track of what token was recognized.  My trick for that is to define the individual tokens so when they are matched the set a variable in the dynamic extent established for tokenizing.  Along these lines:  (setf (parse-tree-synonym :if) (:sequence “if” (:function set-token-kind-to-if)). The :function form is used to do call outs in the midst of the pattern matching.  It is the analog of Perl’s (?{ code }) construct.

That of course all get’s wrapped up in macros so it reads nicely, e.g: (deftoken :constant (”\\\d+”) (parse-integer (token-text))).

You can juice this up by having parse-tree-synonyms that keep track of the line number in the file your parsing.

If you get the details right this can be very efficent; both in time to code up and at runtime.  When combined with CL-YACC you can whip-up a parser for yet another language in a few hours.  Yesterday I built one for XDR.  Here is a simplified example.

SchemeCat

Sunday, July 27th, 2008

What you say?

Tuesday, July 15th, 2008

I believe it was Ray Kurzweil, circa 1989, who advised encouraging a private jargon inside your new company.  I remember because that was just about the time I was starting to think open would totally trump closed in our industry.  The advise seemed to my ears a bit old fashion.  But at the same time I suspected he meant that it was a good way to tighten the bonds inside the team.  By then I knew enough about cults to recognise that’s common inside cults; it supports two of the keys to running a good cult - information control, and thought stopping processes.

But don’t take any of that too seriously.  It is good advise none the less.  I’ll admit to being a Whorfarian, the words you highlight effect your thinking.

These days I’d add to all that though.  Language is the ultimate exchange standard.  So when you decide to innovate a new private language your cutting your self off and creating friction or trade barriers with your outside partners.   Importantly the advantage a new group has is that they can pick and choose what to emphasis.  They can take a run at leveraging some particular competitive advantage.  As Dave Winer says: You can’t win by zigging when he zigs. You have to zag to beat him.  Ray’s advise can be viewed as a bit of implementation advise for that.

So it was with some interest that I saw Google revealing their in house standard for serailizing data.  It’s not hard to see that Protocol Buffers are alternative to XML.  And it is amusing to, at least to me, to think that they did this in the hope of reducing the frictions that occur when they must translate from their in house argot into the dialects used by the outside world.  It’s fun to note that if your start up is as successful as Google you get to promulgate your private jargon.  It is one of the spoils of war.  You can push that friction into your compliments, make them pay the switching costs.

Protocol buffers aren’t anything special: messages are lists of key value pairs, keys can repeat, a small set of value types; including unicode strings and recursively other messages. They are very practical, close to the metal.  Choices were made and they are what they are.  They are quite dense, and easy to parse.  Many messages can be serialized in one pass.  In classic run length encoding style nested structures have their size mentioned in their header.  That makes emitting one pass serialization hard.

Given an array of bytes it wouldn’t be child play to guess that your holding a protocol buffer, you could do it huristically but it would still be a guess.   You need a protocol buffer’s declaration to parse it.  For example you absent the declaration you can’t know if a you’ve got a sint32 or an int64, etc.  All that disappointed me.  It disapointed my inner archivist and the inner peeping tom (who has often debugged tough problems by watching the bytes fly by on the wire).

There is a nice detail that allows optional keys which in turn makes it somewhat easier to add new message variants.  With luck the old message handlers can just ignore the additions.  It made me smile to note that mechanism can be used to pad messages; which in turn makes it more likely that you can serialize in a single pass.

There is another nice detail that allows a key to appear more than once in spite of the metadat saying it is single valued.  The last occurance wins.  This lets you implement a kind of inheritance/defaulting.  For example if your implementing CSS style sheets you read the default style message, and then read the variations from the default, and your ready to go.  They call that merging.

Given the declarative information for a given protocol buffer it’s not hard to convert it to XML or what every else you like.  The observers and archivists will just have to be careful to not loose that metadata; and some of them will no doubt build clever hueristic to cobble together substitute metadata.  Interestingly, inspite of efforts to the contrary, you can’t really work with XML without additional metadata to help.  And, that stuff is horribly complex.

As I like to emphasis what really matters with an exchange standard is how many transactions per second move over it.  No doubt this one has critical mass at least inside the Google cloud.  What matters here for how important this standard might be is how much adoption by non-Google actors.    But, I suspect we will be speaking this dialect more often and for quite a while.  Of course, the rudest way to say that is that we will be chasing their tail lights.  But I don’t really feel that way since I’ve never particularly liked XML and I welcome a substitute.

Better data, fewer customers

Wednesday, June 4th, 2008

Here’s an amusing business tactic: lock out the customers. You know the drill. You visit the website, log-in, and the vendor inserts an extra page forcing you to provide your missing zip code, or what ever. “Our data quality is more important than your time.”

So this company, a health club, locked their patrons out of the club. If they went to the desk they were told that their account was missing their email address.

It’s always good to keep things neat and tidy, but it’s impossible to get across how much this kind of thing drives customers away. Particularly because it drives off the most lightly connected customers. The ones that are hardest to model. The ones you desperately need going forward. There is probably some deep design principle here. You need to design the system to maximize the amount of chaos in your data. Homogeneity is a false God.

Meanwhile, if you enjoy keeping account data tidy you might want to drop by the Useless Account web site. Sign up! Edit your account profile to your hearts content.

Conditions in Common Lisp

Tuesday, May 6th, 2008

Dan Weinreb wrote a long blog post recently outlining what Common Lisp’s “Conditions (exceptions) are really about.”  He argues, and I won’t disagree, that they provide a clean, even elegant, way to handle all the long tail of situations that arise when your trying to be fastidious about the contract for your functions.  One of the things that used to be unique about Lisp was it’s elegant support for unwinding the stack when odd things forced you to abandon what ever your doing.  It was a lot easier to see to it that the printer handle was relinquished, the file closed, the gate shut in Lisp than it was in Fortran or C.  These days most languages are better about admitting that these things happen and something should be in the language to help.

Interestingly though the Common Lisp condition system remains a much more elegant thing than you find in other languages.  I don’t really understand why.  The condition system doesn’t just provide a way to unwind the stack, passing along an object instance to describe what ‘exceptional’ condition arose.  It also allows the handler for these conditions to resolve them.  In effect this provides an elegant way to implement call backs, i.e. that situation where a low level body of code needs to ask the higher, or later written, code how to resolve a problem.  For example you can use the condition system to implement the classic call back scenario where a GUI widget wants to provide it’s users with the ability to do additional checking on user input.

There’s more, but what I really wanted to get off my chest (and that’s what a blog’s for eh?) is that I sometimes I like to muse about raising a condition is yet another example of how complex function calling really is.  Back in the day we used to be able to figure out what function would implement a call site at link time.  Later we got used to having dynamic linking do some of that, and it was common in memory constrained systems to load and flush code on the fly under the covers when in the function call.  I’ve seen systems where you switched interpreters in the function call for performance and space trade off reasons.  These days people are used to having a complex method dispatch that puzzles out what method will will take care of the request.  The method dispatch is a messy beast when you allow new classes to be loaded.  It was with some pleasure that I first read the mythical man month’s description of how some ornate linker came to be obsoleted; since at the time I was struggling with how loading compiled lisp files was an amazingly disruptive event; since it upset all your method tables.

Generally method dispatch is done by searches up stream in the acyclic directed graph of types.  I.e. the behavior of the function is inherited from the class hierarchy.  The class hierarchy is not the only hierarchy in most programs.  For example there is almost always a containment hierarchy, i.e. the OK button is inside this dialog; or for example this file write is taking place inside the task of writing the log, inside the task of handling this web request.

At one point in my life I was quite enthusiastic about the idea of language designs that might allow these other hierarchies more respect.  During that period I built a few class systems where behavior was inherited first from the containment hierarchy and only secondarily from the class hierarchy.

What I like about conditions is how when a condition is raised and the system goes off seeking a handler for that condition - it’s just another variant of method dispatch - and the call hierarchy is were we search for the handler.  For me the fact that these calls often don’t return; or return in such unusual ways is part of what’s delightful about the condition system.