Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 57 (Mar 2004)

[suggested title: ``Creating an Inline language'']

Back in one of the very first issues of Linux Magazine, I wrote about the Spew language. Spew takes a description of text and sentences and paragraphs, and generate a random result based on that description. For example, given the spew input file of:

  START: sentence
  sentence: person " " action "."
  person: "He" | "She" | "Barnacle Bill"
  action: "walks" | "sails" | "swims"

we'd end up with random combinations of people walking, sailing, and swimming. The grammar is specified using a simple BNF-like format, with extensions to give weighting to more-favored choices.

My implementation of spew consists of two parts: a parser to translate the grammar (such as the above) into a compiled internal representation, and then an execution engine to walk through that representation, selecting choices and subrules as needed.

To speed my development, I chose to use Parse::RecDescent to write the first part. While this allowed me to save time while writing my grammar translator, the cost came with a penalty. Now, in order to go from a spew grammar to a spew output, I had to first load Parse::RecDescent, then parse and translate my grammar's grammar (written in Parse::RecDescent's language), then execute that translation to parse the spew input to get a compiled representation of that grammar, and finally run through that final data structure to generate the random output.

For toy operation, this sufficed. But for anything significant, like a random mission statement generated on every RSS hit to my webserver, this setup would be prohibitively CPU expensive. But consider that 90% of the process is the same for every hit. All we need is a cache of that intermediate data structure, and we're done. But we want to ensure that we are using a cache based on the current grammar.

One nice framework that has come along over the years is the Inline framework, originally by Brian Ingerson. He realized that people didn't write a lot of interfaces to code written in languages like C because the XS language was so awkward to use, and wanted to simplify the mechanism. Also, XS required a shim to be created between Perl and the other library, and this required firing up a C compiler on the right input files. For this to be fast and transparent, the shim had to be easily located and cached, and not rebuilt needlessly.

The Inline framework solves this problem by creating an MD5 checksum of the source code, and use that value as part of a directory name located in a writable location. The directory holds the results of analyzing and translating that source code so that it can be quickly reused.

Initially, the only Inline module was Inline::C, for building quick XS glue for arbitrary C code and libraries. But over the following months, various other languages were also supported, including hooks to handle interpreted languages by caching their translation. And that's where we get back to Spew.

By following the Inline-API manpage (and staring at some of the other examples in the CPAN), I created the Inline::Spew module, which I've placed on the CPAN as well. Once installed, I can take a spew grammar and cache the translation in a transparent way:

  use Inline Spew => q{
  START: sentence
  sentence: person " " action "."
  person: "He" | "She" | "Barnacle Bill"
  action: "walks" | "sails" | "swims"
  };
  print spew(), "\n" for 1..5;

And we get (randomly):

  She swims.
  He sails.
  Barnacle Bill sails.
  Barnacle Bill walks.
  He walks.

The first time I invoke the script, the Inline mechanism computes the MD5 of the source text, and notes that I don't have a compiled version of the grammar. So the Spew compiler gets loaded, along with Parse::RecDescent and everything it needs. The grammar grammar gets compiled, then used to convert my grammar to a data structure. And here's the cool part: the data structure is then saved (as a YAML file). On subsequent invocations, we start by loading the data structure, and then performing the final random walk. On my laptop, this results in about a half a second of CPU time saved per invocation, as long as I don't change the grammar.

If I didn't like spew for my subroutine name, I could instead use an Inline configuration control of SUB to change the name. For example, I could change the name to comment with:

  use Inline Spew => q{
  START: sentence
  sentence: person " " action "."
  person: "He" | "She" | "Barnacle Bill"
  action: "walks" | "sails" | "swims"
  }, SUB => 'comment';
  print comment(), "\n" for 1..5;

And how does this all work? Let's examine the module code in [listing one, below]. This listing has been slightly abridged for inclusion here, but I've left the essential pieces in place.

Line 1 defines this module as belonging to Inline::Spew. Lines 3 and 4 bring in the Inline and YAML modules. Line 6 declares this module to be a subclass of Inline.

Lines 8 to 14 define the subroutine that registers this particular Inline module to the Inline framework. Inline goes through Perls' include path when invoked, finding all modules that match Inline::*, and brings them in to find out the languages they support. The register subroutine clarifies the language that this module supports, whether it is a compiled or interpreted language, and any filename suffix that should be used. These are returned as a hashref by protocol.

Line 16 defines the validation routine, to validate the configuration parameters being used. I'm lazy, and didn't write one, but this is where I would check that the only configuration parameter is SUB and that its value is a respectable subroutine name. Maybe in the next release...

Lines 19 to 31 form the compilation phase of this module. This routine will be called if there doesn't exist a cached compilation result for a given piece of Spew grammar. The only input parameter is the Inline object, copied here to $o in line 20.

Lines 21 and 22 extract the spew source text into $code, and the computed desired location of the spew ``compiled'' file into $location.

One thing I figured out by looking at other examples, although it's not clear in the documentation, is that the directory containing the to-be-created file defined in $location might not yet exist. It seems odd to me, because that means that every Inline language must then include a few lines of code to create the directory, rather than just being able to write directly into the directory. However, since that's the way things are done, I've added such code in lines 24 through 26. Oddly enough, there's an undocumented method call of mkpath to handle the path creation once I determine that the directory must be made, which I've seen in an example on the manpage. But why this isn't just part of the API, I do not know.

Once the directory is created, it's time to compile the spew input grammar into the intermediate data structure, via the call to spew_compile in line 28. (The spew_compile subroutine is defined later in this file.) Line 30 dumps this data structure as a YAML-formatted file to the location expected by Inline. And that ends our compilation step.

After the compilation step, Inline then loads the compiled module, defined here with the load subroutine beginning in line 33. First, the Inline object is shifted off into $o, as before (in line 34).

Next, we determine the name of the subroutine to be bound to the compiled object. Line 37 sets $s to the value of the SUB configuration value, defaulting to spew if absent. If the name doesn't contain a double colon, then we need to prefix the proper package. We can get the package name from the pkg variable in the API object hash, shown in line 39.

Lines 43 and 44 extract the compiled spew grammar from the $location file. At this point, $result[0] should be the same as $spew at the end of the build subroutines.

Lines 46 to 53 create the subroutine. First, in line 47, we have to turn off the strictness about symbolic references, because we're about to use a string value ($sub) as a reference. Next, we create a anonymous subroutine, and assign it to the glob defined using the name $sub. This effectively creates the subroutine as a named subroutine. This anonymous subroutine is also a closure, because it refers to the lexicals of $start and @result, defined in an outer scope. When the subroutine is invoked, it takes its only parameter (defaulting to START if not present), and calls the spew_show subroutine, defined below, along with the grammar data structure.

The first time a program with a particular grammar is run, the initial call to Inline invokes register to understand what Inline::Spew is about, validate on the configuration parameters (doing nothing here), build to translate the grammar into the stored file, then load to load that stored file and connect it up with the subroutine name of choice. On subsequent calls, the build step is skipped, but that's good, because that's the expensive one.

The remainder of the module listing is adapted very slightly from the original spew program presented in [this column, from the September 99 issue]. As such, I won't bother re-describing the code in detail, except to point out that the code in lines 132 through 135 computes the grammar grammar ``lazily''. The expensive Parse::RecDescent invocation and execution is avoided until we actually need to compile a spew grammar. Also note that my error checking is extremely clumsy: if a bad grammar is given, the program simply dies (line 137). This too could use a bit of cleaning up.

But it works as it is. Have fun with this module by creating varying web pages or automatic email responders. (I'd be interested to hear from you if you do.) Use the module as an example for your own Inline modules. And until next time, enjoy!

Listings

        =1=     package Inline::Spew;
        =2=     
        =3=     require Inline;
        =4=     require YAML;
        =5=     
        =6=     our @ISA = qw(Inline);
        =7=     
        =8=     sub register {
        =9=       return {
        =10=              language => 'Spew',
        =11=              type => 'interpreted',
        =12=              suffix => 'spew',
        =13=             };
        =14=    }
        =15=    
        =16=    sub validate {
        =17=    }
        =18=    
        =19=    sub build {
        =20=      my $o = shift;
        =21=      my $code = $o->{API}{code};
        =22=      my $location = "$o->{API}{location}";
        =23=    
        =24=      require File::Basename;
        =25=      my $directory = File::Basename::dirname($location);
        =26=      $o->mkpath($directory) unless -d $directory;
        =27=    
        =28=      my $spew = spew_compile($code);
        =29=    
        =30=      YAML::DumpFile($location, $spew);
        =31=    }
        =32=    
        =33=    sub load {
        =34=      my $o = shift;
        =35=    
        =36=      my $sub = do {
        =37=        my $s = $o->{CONFIG}{SUB} || "spew";
        =38=        unless ($s =~ /::/) {
        =39=          $s = $o->{API}{pkg}."::$s";
        =40=        }
        =41=        $s;
        =42=      };
        =43=      my $location = $o->{API}{location};
        =44=      my @result = YAML::LoadFile($location);
        =45=    
        =46=      {
        =47=        no strict 'refs';
        =48=        *$sub = sub {
        =49=          my $start = shift || "START";
        =50=          return spew_show($result[0], $start);
        =51=        };
        =52=      }
        =53=    }
        =54=    
        =55=    sub spew_show {
        =56=      my ($parsed, $defn) = @_;
        =57=      die "missing defn for $defn" unless exists $parsed->{$defn};
        =58=    
        =59=      my @choices = @{$parsed->{$defn}{is}};
        =60=      my $weight = 0;
        =61=      my @keeper = ();
        =62=      while (@choices) {
        =63=        my ($thisweight, @thisitem) = @{pop @choices};
        =64=        $thisweight = 0 if $thisweight < 0; # no funny stuff
        =65=        $weight += $thisweight;
        =66=        @keeper = @thisitem if rand($weight) < $thisweight;
        =67=      }
        =68=      my $result;
        =69=      for (@keeper) {
        =70=        ## should be a list of ids or defns
        =71=        die "huh $_ in $defn" if ref $defn;
        =72=        if (/^ (.*)/s) {
        =73=          $result .= $1;
        =74=        } elsif (/^(\w+)$/) {
        =75=          $result .= spew_show($parsed, $1);
        =76=        } else {
        =77=          die "Can't show $_ in $defn\n";
        =78=        }
        =79=      }
        =80=      return $result;
        =81=    }
        =82=    
        =83=    BEGIN {
        =84=    
        =85=      my $parser;
        =86=      my $GRAMMAR = q{
        =87=    
        =88=    { my %grammar; my $internal = 0; }
        =89=    
        =90=    grammar: rule(s) /\Z/ { \%grammar; }
        =91=    
        =92=    ## rule returns identifier (not used)
        =93=    rule: identifier ":" defn {
        =94=                    push @{$grammar{$item[1]}{is}}, @{$item[3]};
        =95=                    $grammar{$item[1]}{defined}{$itempos[1]{line}{to}}++;
        =96=                    $item[1];
        =97=            }
        =98=            | <error>
        =99=    
        =100=   ## defn returns listref of choices
        =101=   defn: <leftop: choice "|" choice>
        =102=   
        =103=   ## choice returns a listref of [weight => @items]
        =104=   choice: weight unweightedchoice { [ $item[1] => @{$item[2]} ] }
        =105=   
        =106=   ## weight returns weight if present, 1 if not
        =107=   weight: /\d+(\.\d+)?/ <commit> /\@/ { $item[1] } | { 1 }
        =108=   
        =109=   ## unweightedchoice returns a listref of @items
        =110=   unweightedchoice: item(s)
        =111=   
        =112=   ## item returns " literal text" or "identifier"
        =113=   item:
        =114=           { $_ = extract_quotelike($text) and " " . eval }
        =115=           | identifier <commit> ...!/:/ { # must not be followed by colon!
        =116=                   $grammar{$item[1]}{used}{$itempos[1]{line}{to}}++;
        =117=                   $item[1]; # non-leading space flags an identifier
        =118=           }
        =119=           | "(" defn ")" { # parens for recursion, gensym an internal
        =120=                   ++$internal;
        =121=                   push @{$grammar{$internal}{is}}, @{$item[2]};
        =122=                   $internal;
        =123=           }
        =124=           | <error>
        =125=   
        =126=   identifier: /[^\W\d]\w*/
        =127=   };
        =128=   
        =129=     sub spew_compile {
        =130=       my $source = shift;
        =131=   
        =132=       unless ($parser) {
        =133=         require Parse::RecDescent;
        =134=         $parser = Parse::RecDescent->new($GRAMMAR) or die "internal bad";
        =135=       }
        =136=   
        =137=       my $parsed = $parser->grammar($source) or die "bad spew grammar";
        =138=   
        =139=       for my $id (sort keys %$parsed) {
        =140=         next if $id =~ /^\d+$/;       # skip internals
        =141=         my $id_ref = $parsed->{$id};
        =142=         unless (exists $id_ref->{defined}) {
        =143=           die "$id used in @{[sort keys %{$id_ref->{used}}]} but not defined";
        =144=         }
        =145=       }    
        =146=   
        =147=       return $parsed;
        =148=     }
        =149=   }
        =150=   
        =151=   1;
        =152=   __END__
        =153=   =head1 NAME
        =154=   
        =155=   Inline::Spew - Inline module for Spew
        =156=   
        =157=   [rest of pod deleted]

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.