cl-ppcre as a tokenizer

CL-PPCRE is a Lisp clone of the perl regular expression subsystem (i gather it’s even faster). I love it because lots of things I used to do in Perl or AWK can I now do in Lisp. Each language has it’s coding tricks and there are lot for Perl’s regular expressions. The Perl regex manpage is 30 pages, and the FAQ is 20 pages. That’s a lot of clever, and double or triple is once you recall how concise Perl is! I often is use all this to build lexical analysiers using regular expressions to nibble tokens off string. (One variation on that is outlined in the Perl FAQ.)

Both Cl-PPCRE and Perl have two notations for their regular expressions. The concise one you’ve certainly used; and a much more verbose one. For example you might write “alpha|beta” to match either of the two strings in the concise form, and in CL-PPCRE’s long form you’d write (:alternation “alpha” “beta”). In Lisp you might write something like: (setf (parse-tree-synomym :tokens) ‘(:sequence :start-anchor (:alternation :begin :end :if …))) and then nibble off tokens by doing (scan :tokens program-text :start start).

The only problem with that is you lost track of what token was recognized. My trick for that is to define the individual tokens so when they are matched the set a variable in the dynamic extent established for tokenizing. Along these lines: (setf (parse-tree-synonym :if) (:sequence “if” (:function set-token-kind-to-if)). The :function form is used to do call outs in the midst of the pattern matching. It is the analog of Perl’s (?{ code }) construct.

That of course all get’s wrapped up in macros so it reads nicely, e.g: (deftoken :constant (“\\\d+”) (parse-integer (token-text))).

You can juice this up by having parse-tree-synonyms that keep track of the line number in the file your parsing.

If you get the details right this can be very efficent; both in time to code up and at runtime. When combined with CL-YACC you can whip-up a parser for yet another language in a few hours. Yesterday I built one for XDR. Here is a simplified example.

Ascription is an Anathema to any Enthusiasm

Ben Hyde

Leave a Reply