Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 04 (Sep 1999)

[suggested title: Writing Randomly]

Over the 22 years of my professional writing career (interspersed and overlapped with gigs as a system administrator, security consultant, software engineer, systems tester, video producer, and karaoke KJ), I've written quite a few books, columns, Usenet postings, and email messages. Some have accused me of having some sort of artificial means or program to actually spit out this volume of text. Well, it's time for me to finally 'fess up. Everything I've ever written, including this column, has actually been the product of a series of programs of ever-increasing complexity.

In fact, this column is being written by a program that will write about a simple version of itself, which it, er, I mean, I call spew. spew takes a simple BNF-like random sentence grammar describing the text to be generated, then performs a random walk through the grammar, producing a single result per invocation that would successfully be parsed by that grammar. Elements can be ``weighted'', preferring one choice over another, and output strings are full Perl strings, so that control characters can be easily specified.

Given the proper input, a carefully constructed random sentence grammar could generate a random haiku, or as you'll see here, a pointy-hair-boss mission statement. Or even a canned-but-variable email reply to a boring question. Not that I would ever do anything like that.

This spew program was inspired by a recent Usenet posting (archived at http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=473743216) from Gee Starr, who wrote:

I'm looking for a perl script that uses recursive grammar techniques to generate random sentences. I've found several scripts that will throw up a string of text chosen from a pre-made list, but I'd really like to find something that generates sentences on the fly.


I know there've been a few non-perl solutions like Spew and the Dada
Engine; what I'm looking for doesn't have to be that complex -- Don
Cross's Javascript implementation is along the scale I'm thinking.

Anybody here know of something like that?

Having seen that, I recalled back 20 years to the time I wrote just such a program to test my abilities at using the Unix YACC and LEX tools that I had just read about, called yak (for Yet Another Klugey generator, as well as the meaning of having something that yaks at you). So, I whipped out Damian Conway's wonderful Parse::RecDescent module, with which I had been playing recently, and generated the program in [listing one, below].

The first three lines start most of my long programs, enabling warnings, turning on compiler restrictions, and disabling buffering on STDOUT.

Line 5 brings in the Parse::RecDescent module, found in the CPAN. You need a fairly late-model version of this rapidly developing module (I used version 1.65), so if the program doesn't work the first time, try an upgrade. Line 6 brings in the standard Getopt::Long module.

Lines 8 through 10 look for the designated ``start'' symbol for the random sentence grammar. A grammar can define multiple starting places, defaulting to START unless specified. $START will contain this name later.

Lines 14 through 69 define input to Parse::RecDescent to build a parser for the small BNF random sentence grammar describing the random sentences. Yes, we're building a parser to build an un-parser, so some of the terminology is overlapping. Bear with me. In fact, I tried to make the BNF look very much like Parse::RecDescent grammar as well. If Parse::RecDescent can create this random sentence grammar parser successfully, we get a parsing object in $parser. This will fail only when I've hacked the lines in the source code, hence the simple bad! error.

Lines 16 through 27 describe the return value from the parser once it is executed against a random sentence grammar. In brief, we'll have a hashref with keys being the various non-terminal symbols of the grammar, and the corresponding values will be the information about that non-terminal. The is nested sub-key contains a series of choices, which will be randomly selected during the random walk. Some of the item entries will be literal strings to be printed (indicated by a leading space, stripped before printing), and some are further non-terminal symbols to be walked recursively. To generate this data structure, we parse the random sentence grammar with the grammar that follows.

I'm not going to go into detail about the (Parse::RecDescent) grammar of the (random sentence) grammar, because Damian has done an excellent job in his documentation for Parse::RecDecent. However, if you're familiar with general BNF, you should be able to pick out most of the elements of the random sentence grammar. Some of the slightly noteworthy things include:

  1. Non-terminals of both this grammar and the random sentence grammar are C-symbols (same as Perl identifiers).

  2. This grammar uses both regular expressions and constant strings as terminals, and those terminals are separated by optional white space.

  3. The grammar keeps track of the line numbers where non-terminals are defined and used (for error reporting).

  4. When a rule has a subrule (a parenthesized part), a non-terminal entry is generated for the subrule, but assigned a sequentially increasing integer instead of a real name. In all other respects, the non-terminal acts identical to a user-defined non-terminal. This means that:

      foo: a ( b c | d e ) f | g h

    is the same as

      foo: a 1 f | g h
      1: b c | d e

    Except that you can't really have a non-terminal named 1.

  5. Weights are added by prefixing a choice with a positive floating-point number followed by an @, as in:

      foo: 5 @ fred | 2 @ barney | 1.5 @ dino | betty

    which is five times as likely to pick fred as betty (or a total of 5 out of 9.5 times). This is simpler than repeating a grammar choice multiple times, as I've seen in other similar programs, and makes available fine-grained ratio definitions.

  6. Some productions use lterrorgt, which hints the parser generator as to what to tell the user if something isn't found. For example, if while looking for an item, the parser finds neither a Perl-style quoted string, nor a non-terminal, nor a subrule, the error flag says that this is worthy of an error message at this level, offering the user what the valid choices were at this point. Without that, the error message would have been generated at a higher level, merely indicating that the grammar was bad.

After the parser is constructed, lines 71 through 74 fetch the random sentence grammar from the filenames specified on the command line. Standalone comment lines are discarded, but same-line comments aren't recognized.

Line 76 uses the constructed parser to parse the random sentence grammar, yielding the definitions for the non-terminals. Lines 78 to 87 walk through the parsed output, looking for inconsistencies in non-terminals used but not defined, or vice versa. Note that we don't actually abort the program if something is used but not defined, because it might not be selected on this particular random walk. It's a bit like playing russian roulette though.

Line 89 is a debugging stub. If you want to see the output of the parser, uncomment the line. I used that quite a bit while I was debugging the parser. Line 90 starts the random walk through the random sentence grammar, by generating exactly one item of the top-level non-terminal.

Lines 92 through 116 define the show subroutine, extracting the right portions of the parsed random sentence grammar. Lines 96 to 104 select one of the choices, based on their weights. Lines 105 through 115 dump each item of that choice. If the item is a terminal, it gets printed in line 109. If the item is a symbol, it's a non-terminal, and show gets called recursively to dump the new non-terminal.

So, there's the spew program, but to be fun, we need some input. I typed random sentence into www.google.com looking for some grammars. The most interesting hit I got led me to the Dilbert Mission Generator, located at http://www.dilbert.com/comics/dilbert/career/bin/ms2.cgi. I spent about an hour hitting reload repeatedly to reverse-engineer the output, and came up with [listing two, below]. I've cleaned up some of the choices, and fixed a few misspellings, so this grammar isn't quite what you see there.

The start symbol is START for convenience, in line 5. The output is four missions (one for each fiscal quarter), as shown in line 7. I named the non-terminals similar to the language they evoke, because that makes it easier for me to see what fits together.

Note that lines 14, 15, and 17 use subrules to eliminate some typing, and that line 11 has a weight of 2, choosing the missions that have reasons twice as likely as the missions that are unjustified.

The lists of terms in lines 50 to 84 came off the ``word list'' page linked from the mission generator, with a small bit of modification.

If you stick this into mission.spew, a simple invocation of spew mission.spew yields something like this on standard output:

We have committed to collaboratively integrate ethical sources while promoting personal employee growth.

Our challenge is to professionally simplify progressive services to set us apart from the competition.

We assertively restore virtual data so that we may synergistically create parallel infrastructures while promoting personal employee growth.

It is our business to authoritatively network prospective products so that we may endeavor to efficiently customize seven-habits-conforming services while promoting personal employee growth.

Hmm, I think I heard those very words at the last company shareholder meeting. Those employees sure are growing too.

One of the other results of my research into some random grammars led to a nice little stash at http://www-cs-faculty.stanford.edu/~zelenski/rsg/grammars/. The trouble is that they were all in some weird format that wasn't good as input to my spew program. Well, in a short period of time, I had written a conversion program to go from the one format to the other, given in [listing three, below]. Now, I could leverage off the student's works to generate Bionic Woman episodes and Microsoft Press Releases. How fun!

Anyway, I hope you're now convinced that Randal is actually a figment of his own imagination, and all of the work is really being done by the work of clever programs. [Note to editors: please continue to send the payments for the columns though... we're still paying for hardware maintenance.] Until next time, enjoy!

Listings

        =0=     #### LISTING ONE ####
        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use Parse::RecDescent 1.65;
        =6=     use Getopt::Long;
        =7=     
        =8=     GetOptions(
        =9=                "start=s" => \ (my $START = "START"),
        =10=              ) or die "see code for usage\n";
        =11=    
        =12=    ## define the grammar of the spew grammar:
        =13=    
        =14=    (my $parser = Parse::RecDescent->new(<<'END_OF_GRAMMAR')) or die "bad!";
        =15=    
        =16=    ## return hashref
        =17=    ## { ident => {
        =18=    ##     is => [
        =19=    ##       [weight => item, item, item, ...],
        =20=    ##       [weight => item, item, item, ...], ...
        =21=    ##     ],
        =22=    ##     defined => { line-number => times }
        =23=    ##     used => { line-number => times }
        =24=    ##   }, ...
        =25=    ## }
        =26=    ## item is " literal" or ident
        =27=    ## ident is C-symbol or number (internal for nested rules)
        =28=    
        =29=    { my %grammar; my $internal = 0; }
        =30=    
        =31=    grammar: rule(s) /\Z/ { \%grammar; }
        =32=    
        =33=    ## rule returns identifier (not used)
        =34=    rule: identifier ":" defn {
        =35=                    push @{$grammar{$item[1]}{is}}, @{$item[3]};
        =36=                    $grammar{$item[1]}{defined}{$itempos[1]{line}{to}}++;
        =37=                    $item[1];
        =38=            }
        =39=            | <error>
        =40=    
        =41=    ## defn returns listref of choices
        =42=    defn: <leftop: choice "|" choice>
        =43=    
        =44=    ## choice returns a listref of [weight => @items]
        =45=    choice: weight unweightedchoice { [ $item[1] => @{$item[2]} ] }
        =46=    
        =47=    ## weight returns weight if present, 1 if not
        =48=    weight: /\d+(\.\d+)?/ <commit> /\@/ { $item[1] } | { 1 }
        =49=    
        =50=    ## unweightedchoice returns a listref of @items
        =51=    unweightedchoice: item(s)
        =52=    
        =53=    ## item returns " literal text" or "identifier"
        =54=    item:
        =55=            { $_ = extract_quotelike($text) and " " . eval }
        =56=            | identifier <commit> ...!/:/ { # must not be followed by colon!
        =57=                    $grammar{$item[1]}{used}{$itempos[1]{line}{to}}++;
        =58=                    $item[1]; # non-leading space flags an identifier
        =59=            }
        =60=            | "(" defn ")" { # parens for recursion, gensym an internal
        =61=                    ++$internal;
        =62=                    push @{$grammar{$internal}{is}}, @{$item[2]};
        =63=                    $internal;
        =64=            }
        =65=            | <error>
        =66=    
        =67=    identifier: /[^\W\d]\w*/
        =68=    
        =69=    END_OF_GRAMMAR
        =70=    
        =71=    my @data = <>;
        =72=    for (@data) {
        =73=      s/^\s*#.*//;
        =74=    }
        =75=    
        =76=    (my $parsed = $parser->grammar(join '', @data)) or die "bad parse";
        =77=    
        =78=    for my $id (sort keys %$parsed) {
        =79=      next if $id =~ /^\d+$/;       # skip internals
        =80=      my $id_ref = $parsed->{$id};
        =81=      unless (exists $id_ref->{defined}) {
        =82=        print "$id used in @{[sort keys %{$id_ref->{used}}]} but not defined - FATAL\n";
        =83=      }
        =84=      unless (exists $id_ref->{used} or $id eq $START) {
        =85=        print "$id defined in @{[sort keys %{$id_ref->{defined}}]} but not used - WARNING\n";
        =86=      }
        =87=    }    
        =88=    
        =89=    #DEBUGGING:# use Data::Dumper; print Dumper($parsed);
        =90=    show($START);
        =91=    
        =92=    sub show {
        =93=      my $defn = shift;
        =94=      die "missing defn for $defn" unless exists $parsed->{$defn};
        =95=    
        =96=      my @choices = @{$parsed->{$defn}{is}};
        =97=      my $weight = 0;
        =98=      my @keeper = ();
        =99=      while (@choices) {
        =100=       my ($thisweight, @thisitem) = @{pop @choices};
        =101=       $thisweight = 0 if $thisweight < 0; # no funny stuff
        =102=       $weight += $thisweight;
        =103=       @keeper = @thisitem if rand($weight) < $thisweight;
        =104=     }
        =105=     for (@keeper) {
        =106=       ## should be a list of ids or defns
        =107=       die "huh $_ in $defn" if ref $defn;
        =108=       if (/^ (.*)/s) {
        =109=         print $1;
        =110=       } elsif (/^(\w+)$/) {
        =111=         show($1);
        =112=       } else {
        =113=         die "Can't show $_ in $defn\n";
        =114=       }
        =115=     }
        =116=   }
        =0=     #### LISTING TWO ####
        =1=     ## Our challenge is to effectively reverse-engineer the output of:
        =2=     ## http://www.dilbert.com/comics/dilbert/career/bin/ms2.cgi
        =3=     ## as well as dynamically provide interactive feedback. :-)
        =4=     
        =5=     START: missions
        =6=     
        =7=     missions: mission "\n\n" mission "\n\n" mission "\n\n" mission "\n"
        =8=     
        =9=     mission:
        =10=      Our_job_is_to " " do_goals "." |
        =11=      2 @ Our_job_is_to " " do_goals " " because "."
        =12=    
        =13=    Our_job_is_to:
        =14=      ("It is our " | "It's our ") job " to" |
        =15=      "Our " job (" is to" | " is to continue to") |
        =16=      "The customer can count on us to" |
        =17=      ("We continually " | "We ") ("strive" | "envision" | "exist") " to" |
        =18=      "We have committed to" |
        =19=      "We"
        =20=    
        =21=    job:
        =22=      "business" | "challenge" | "goal" | "job" | "mission" | "responsibility"
        =23=      
        =24=    do_goals:
        =25=      goal | goal " " in_order_to " " goal
        =26=    
        =27=    in_order_to:
        =28=      "as well as to" |
        =29=      "in order that we may" |
        =30=      "in order to" |
        =31=      "so that we may endeavor to" |
        =32=      "so that we may" |
        =33=      "such that we may continue to" |
        =34=      "to allow us to" |
        =35=      "while continuing to" |
        =36=      "and"
        =37=    
        =38=    because:
        =39=      "because that is what the customer expects" |
        =40=      "for 100% customer satisfaction" |
        =41=      "in order to solve business problems" |
        =42=      "to exceed customer expectations" |
        =43=      "to meet our customer's needs" |
        =44=      "to set us apart from the competition" |
        =45=      "to stay competitive in tomorrow's world" |
        =46=      "while promoting personal employee growth"
        =47=    
        =48=    goal: adverbly " " verb " " adjective " " noun
        =49=    
        =50=    adverbly:
        =51=      "quickly" | "proactively" | "efficiently" | "assertively" |
        =52=      "interactively" | "professionally" | "authoritatively" |
        =53=      "conveniently" | "completely" | "continually" | "dramatically" |
        =54=      "enthusiastically" | "collaboratively" | "synergistically" |
        =55=      "seamlessly" | "competently" | "globally"
        =56=    
        =57=    
        =58=    verb:
        =59=      "maintain" | "supply" | "provide access to" | "disseminate" |
        =60=      "network" | "create" | "engineer" | "integrate" | "leverage other's" |
        =61=      "leverage existing" | "coordinate" | "administrate" | "initiate" |
        =62=      "facilitate" | "promote" | "restore" | "fashion" | "revolutionize" |
        =63=      "build" | "enhance" | "simplify" | "pursue" | "utilize" | "foster" |
        =64=      "customize" | "negotiate"
        =65=    
        =66=    adjective:
        =67=      "professional" | "timely" | "effective" | "unique" | "cost-effective" |
        =68=      "virtual" | "scalable" | "economically sound" |
        =69=      "inexpensive" | "value-added" | "business" | "quality" | "diverse" |
        =70=      "high-quality" | "competitive" | "excellent" | "innovative" |
        =71=      "corporate" | "high standards in" | "world-class" | "error-free" |
        =72=      "performance-based" | "multimedia-based" | "market-driven" |
        =73=      "cutting edge" | "high-payoff" | "low-risk high-yield" |
        =74=      "long-term high-impact" | "prospective" | "progressive" | "ethical" |
        =75=      "enterprise-wide" | "principle-centered" | "mission-critical" |
        =76=      "parallel" | "interdependent" | "emerging" |
        =77=      "seven-habits-conforming" | "resource-leveling"
        =78=    
        =79=    noun:
        =80=      "content" | "paradigms" | "data" | "opportunities" |
        =81=      "information" | "services" | "materials" | "technology" | "benefits" |
        =82=      "solutions" | "infrastructures" | "products" | "deliverables" |
        =83=      "catalysts for change" | "resources" | "methods of empowerment" |
        =84=      "sources" | "leadership skills" | "meta-services" | "intellectual capital"
        =0=     #### LISTING THREE ####
        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use Parse::RecDescent;
        =6=     
        =7=     (my $parser = Parse::RecDescent->new(<<'END_OF_GRAMMAR')) or die "bad!";
        =8=     
        =9=     start: <leftop: junk rule junk> /\Z/ { 1 }
        =10=    
        =11=    junk: /[^{}]*/ { 1 }
        =12=    
        =13=    rule: "{" defining choices "}" { print "$item[2]: $item[3]\n\n" }
        =14=      | <error>
        =15=    
        =16=    defining: symbol ";" <commit> { $item[1] } | symbol
        =17=    
        =18=    symbol: /^<(\S+)>/ {
        =19=      my $x = $1;
        =20=      $x =~ s/([^a-zA-Z])/sprintf "_%02x_", ord $1/ge;
        =21=      if ($x eq "start") {
        =22=        $x = "START";
        =23=      }
        =24=      ## warn "saw symbol $x\n";
        =25=      $x;
        =26=    }
        =27=    
        =28=    choices: choice(s) { join " |\n  ", @ {$item[1]} } | { q{""} }
        =29=    
        =30=    choice: item(s) ";" { join q{ " " }, @ {$item[1]} }
        =31=      | <error>
        =32=    
        =33=    item: symbol | wordlist | <error>
        =34=    
        =35=    wordlist: word(s) { "q{" . join( " ", @ {$item[1]} ) . "}" }
        =36=    
        =37=    word: /^"(\S+)"/ <commit> { $1 } | /^[^\s<>{};]+/
        =38=    
        =39=    END_OF_GRAMMAR
        =40=    
        =41=    (my $parsed = $parser->start(join '', <>)) or die "bad parse";
        =42=    
        =43=    ## see http://www-cs-faculty.stanford.edu/~zelenski/rsg/grammars/

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.