Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Linux Magazine Column 04 (Sep 1999)
[suggested title: Writing Randomly]
Over the 22 years of my professional writing career (interspersed and overlapped with gigs as a system administrator, security consultant, software engineer, systems tester, video producer, and karaoke KJ), I've written quite a few books, columns, Usenet postings, and email messages. Some have accused me of having some sort of artificial means or program to actually spit out this volume of text. Well, it's time for me to finally 'fess up. Everything I've ever written, including this column, has actually been the product of a series of programs of ever-increasing complexity.
In fact, this column is being written by a program that will write
about a simple version of itself, which it, er, I mean, I call
spew
. spew
takes a simple BNF-like random sentence grammar
describing the text to be generated, then performs a random walk
through the grammar, producing a single result per invocation that
would successfully be parsed by that grammar. Elements can be
``weighted'', preferring one choice over another, and output strings are
full Perl strings, so that control characters can be easily specified.
Given the proper input, a carefully constructed random sentence grammar could generate a random haiku, or as you'll see here, a pointy-hair-boss mission statement. Or even a canned-but-variable email reply to a boring question. Not that I would ever do anything like that.
This spew
program was inspired by a recent Usenet posting (archived
at http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=473743216
) from Gee
Starr, who wrote:
I'm looking for a perl script that uses recursive grammar techniques to generate random sentences. I've found several scripts that will throw up a string of text chosen from a pre-made list, but I'd really like to find something that generates sentences on the fly.
I know there've been a few non-perl solutions like Spew and the Dada Engine; what I'm looking for doesn't have to be that complex -- Don Cross's Javascript implementation is along the scale I'm thinking.
Anybody here know of something like that?
Having seen that, I recalled back 20 years to the time I wrote just
such a program to test my abilities at using the Unix YACC and LEX
tools that I had just read about, called yak
(for Yet Another
Klugey generator, as well as the meaning of having something that yaks
at you). So, I whipped out Damian Conway's wonderful Parse::RecDescent
module, with which I had been playing recently, and generated
the program in [listing one, below].
The first three lines start most of my long programs, enabling warnings, turning on compiler restrictions, and disabling buffering on STDOUT.
Line 5 brings in the Parse::RecDescent
module, found in the CPAN.
You need a fairly late-model version of this rapidly developing module
(I used version 1.65), so if the program doesn't work the first time,
try an upgrade. Line 6 brings in the standard Getopt::Long
module.
Lines 8 through 10 look for the designated ``start'' symbol for the
random sentence grammar. A grammar can define multiple starting
places, defaulting to START
unless specified. $START
will
contain this name later.
Lines 14 through 69 define input to Parse::RecDescent
to build a
parser for the small BNF random sentence grammar describing the random
sentences. Yes, we're building a parser to build an un-parser, so
some of the terminology is overlapping. Bear with me. In fact, I
tried to make the BNF look very much like Parse::RecDescent
grammar
as well. If Parse::RecDescent
can create this random sentence
grammar parser successfully, we get a parsing object in $parser
.
This will fail only when I've hacked the lines in the source code,
hence the simple bad!
error.
Lines 16 through 27 describe the return value from the parser once it
is executed against a random sentence grammar. In brief, we'll have a
hashref with keys being the various non-terminal symbols of the
grammar, and the corresponding values will be the information about
that non-terminal. The is
nested sub-key contains a series of
choices, which will be randomly selected during the random walk. Some
of the item
entries will be literal strings to be printed
(indicated by a leading space, stripped before printing), and some are
further non-terminal symbols to be walked recursively. To generate
this data structure, we parse the random sentence grammar with the
grammar that follows.
I'm not going to go into detail about the (Parse::RecDescent
)
grammar of the (random sentence) grammar, because Damian has done an
excellent job in his documentation for Parse::RecDecent
. However,
if you're familiar with general BNF, you should be able to pick out
most of the elements of the random sentence grammar. Some of the slightly
noteworthy things include:
-
Non-terminals of both this grammar and the random sentence grammar are C-symbols (same as Perl identifiers).
-
This grammar uses both regular expressions and constant strings as terminals, and those terminals are separated by optional white space.
-
The grammar keeps track of the line numbers where non-terminals are defined and used (for error reporting).
-
When a rule has a subrule (a parenthesized part), a non-terminal entry is generated for the subrule, but assigned a sequentially increasing integer instead of a real name. In all other respects, the non-terminal acts identical to a user-defined non-terminal. This means that:
foo: a ( b c | d e ) f | g h
is the same as
foo: a 1 f | g h 1: b c | d e
Except that you can't really have a non-terminal named
1
. -
Weights are added by prefixing a choice with a positive floating-point number followed by an
@
, as in:foo: 5 @ fred | 2 @ barney | 1.5 @ dino | betty
which is five times as likely to pick
fred
asbetty
(or a total of 5 out of 9.5 times). This is simpler than repeating a grammar choice multiple times, as I've seen in other similar programs, and makes available fine-grained ratio definitions. -
Some productions use
lt
errorgt
, which hints the parser generator as to what to tell the user if something isn't found. For example, if while looking for an item, the parser finds neither a Perl-style quoted string, nor a non-terminal, nor a subrule, the error flag says that this is worthy of an error message at this level, offering the user what the valid choices were at this point. Without that, the error message would have been generated at a higher level, merely indicating that the grammar was bad.
After the parser is constructed, lines 71 through 74 fetch the random sentence grammar from the filenames specified on the command line. Standalone comment lines are discarded, but same-line comments aren't recognized.
Line 76 uses the constructed parser to parse the random sentence grammar, yielding the definitions for the non-terminals. Lines 78 to 87 walk through the parsed output, looking for inconsistencies in non-terminals used but not defined, or vice versa. Note that we don't actually abort the program if something is used but not defined, because it might not be selected on this particular random walk. It's a bit like playing russian roulette though.
Line 89 is a debugging stub. If you want to see the output of the parser, uncomment the line. I used that quite a bit while I was debugging the parser. Line 90 starts the random walk through the random sentence grammar, by generating exactly one item of the top-level non-terminal.
Lines 92 through 116 define the show
subroutine, extracting the
right portions of the parsed random sentence grammar. Lines 96 to 104
select one of the choices, based on their weights. Lines 105 through
115 dump each item of that choice. If the item is a terminal, it gets
printed in line 109. If the item is a symbol, it's a non-terminal,
and show
gets called recursively to dump the new non-terminal.
So, there's the spew
program, but to be fun, we need some input. I
typed random sentence
into www.google.com
looking for some
grammars. The most interesting hit I got led me to the Dilbert
Mission Generator, located at
http://www.dilbert.com/comics/dilbert/career/bin/ms2.cgi
. I spent
about an hour hitting reload repeatedly to reverse-engineer the
output, and came up with [listing two, below]. I've cleaned up some
of the choices, and fixed a few misspellings, so this grammar isn't
quite what you see there.
The start symbol is START
for convenience, in line 5. The output
is four missions (one for each fiscal quarter), as shown in line 7. I
named the non-terminals similar to the language they evoke, because
that makes it easier for me to see what fits together.
Note that lines 14, 15, and 17 use subrules to eliminate some typing, and that line 11 has a weight of 2, choosing the missions that have reasons twice as likely as the missions that are unjustified.
The lists of terms in lines 50 to 84 came off the ``word list'' page linked from the mission generator, with a small bit of modification.
If you stick this into mission.spew
, a simple invocation of
spew mission.spew
yields something like this on standard output:
We have committed to collaboratively integrate ethical sources while promoting personal employee growth.
Our challenge is to professionally simplify progressive services to set us apart from the competition.
We assertively restore virtual data so that we may synergistically create parallel infrastructures while promoting personal employee growth.
It is our business to authoritatively network prospective products so that we may endeavor to efficiently customize seven-habits-conforming services while promoting personal employee growth.
Hmm, I think I heard those very words at the last company shareholder meeting. Those employees sure are growing too.
One of the other results of my research into some random grammars led
to a nice little stash at
http://www-cs-faculty.stanford.edu/~zelenski/rsg/grammars/
. The
trouble is that they were all in some weird format that wasn't good as
input to my spew
program. Well, in a short period of time, I had
written a conversion program to go from the one format to the other,
given in [listing three, below]. Now, I could leverage off the
student's works to generate Bionic Woman episodes and Microsoft Press
Releases. How fun!
Anyway, I hope you're now convinced that Randal is actually a figment of his own imagination, and all of the work is really being done by the work of clever programs. [Note to editors: please continue to send the payments for the columns though... we're still paying for hardware maintenance.] Until next time, enjoy!
Listings
=0= #### LISTING ONE #### =1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= use Parse::RecDescent 1.65; =6= use Getopt::Long; =7= =8= GetOptions( =9= "start=s" => \ (my $START = "START"), =10= ) or die "see code for usage\n"; =11= =12= ## define the grammar of the spew grammar: =13= =14= (my $parser = Parse::RecDescent->new(<<'END_OF_GRAMMAR')) or die "bad!"; =15= =16= ## return hashref =17= ## { ident => { =18= ## is => [ =19= ## [weight => item, item, item, ...], =20= ## [weight => item, item, item, ...], ... =21= ## ], =22= ## defined => { line-number => times } =23= ## used => { line-number => times } =24= ## }, ... =25= ## } =26= ## item is " literal" or ident =27= ## ident is C-symbol or number (internal for nested rules) =28= =29= { my %grammar; my $internal = 0; } =30= =31= grammar: rule(s) /\Z/ { \%grammar; } =32= =33= ## rule returns identifier (not used) =34= rule: identifier ":" defn { =35= push @{$grammar{$item[1]}{is}}, @{$item[3]}; =36= $grammar{$item[1]}{defined}{$itempos[1]{line}{to}}++; =37= $item[1]; =38= } =39= | <error> =40= =41= ## defn returns listref of choices =42= defn: <leftop: choice "|" choice> =43= =44= ## choice returns a listref of [weight => @items] =45= choice: weight unweightedchoice { [ $item[1] => @{$item[2]} ] } =46= =47= ## weight returns weight if present, 1 if not =48= weight: /\d+(\.\d+)?/ <commit> /\@/ { $item[1] } | { 1 } =49= =50= ## unweightedchoice returns a listref of @items =51= unweightedchoice: item(s) =52= =53= ## item returns " literal text" or "identifier" =54= item: =55= { $_ = extract_quotelike($text) and " " . eval } =56= | identifier <commit> ...!/:/ { # must not be followed by colon! =57= $grammar{$item[1]}{used}{$itempos[1]{line}{to}}++; =58= $item[1]; # non-leading space flags an identifier =59= } =60= | "(" defn ")" { # parens for recursion, gensym an internal =61= ++$internal; =62= push @{$grammar{$internal}{is}}, @{$item[2]}; =63= $internal; =64= } =65= | <error> =66= =67= identifier: /[^\W\d]\w*/ =68= =69= END_OF_GRAMMAR =70= =71= my @data = <>; =72= for (@data) { =73= s/^\s*#.*//; =74= } =75= =76= (my $parsed = $parser->grammar(join '', @data)) or die "bad parse"; =77= =78= for my $id (sort keys %$parsed) { =79= next if $id =~ /^\d+$/; # skip internals =80= my $id_ref = $parsed->{$id}; =81= unless (exists $id_ref->{defined}) { =82= print "$id used in @{[sort keys %{$id_ref->{used}}]} but not defined - FATAL\n"; =83= } =84= unless (exists $id_ref->{used} or $id eq $START) { =85= print "$id defined in @{[sort keys %{$id_ref->{defined}}]} but not used - WARNING\n"; =86= } =87= } =88= =89= #DEBUGGING:# use Data::Dumper; print Dumper($parsed); =90= show($START); =91= =92= sub show { =93= my $defn = shift; =94= die "missing defn for $defn" unless exists $parsed->{$defn}; =95= =96= my @choices = @{$parsed->{$defn}{is}}; =97= my $weight = 0; =98= my @keeper = (); =99= while (@choices) { =100= my ($thisweight, @thisitem) = @{pop @choices}; =101= $thisweight = 0 if $thisweight < 0; # no funny stuff =102= $weight += $thisweight; =103= @keeper = @thisitem if rand($weight) < $thisweight; =104= } =105= for (@keeper) { =106= ## should be a list of ids or defns =107= die "huh $_ in $defn" if ref $defn; =108= if (/^ (.*)/s) { =109= print $1; =110= } elsif (/^(\w+)$/) { =111= show($1); =112= } else { =113= die "Can't show $_ in $defn\n"; =114= } =115= } =116= } =0= #### LISTING TWO #### =1= ## Our challenge is to effectively reverse-engineer the output of: =2= ## http://www.dilbert.com/comics/dilbert/career/bin/ms2.cgi =3= ## as well as dynamically provide interactive feedback. :-) =4= =5= START: missions =6= =7= missions: mission "\n\n" mission "\n\n" mission "\n\n" mission "\n" =8= =9= mission: =10= Our_job_is_to " " do_goals "." | =11= 2 @ Our_job_is_to " " do_goals " " because "." =12= =13= Our_job_is_to: =14= ("It is our " | "It's our ") job " to" | =15= "Our " job (" is to" | " is to continue to") | =16= "The customer can count on us to" | =17= ("We continually " | "We ") ("strive" | "envision" | "exist") " to" | =18= "We have committed to" | =19= "We" =20= =21= job: =22= "business" | "challenge" | "goal" | "job" | "mission" | "responsibility" =23= =24= do_goals: =25= goal | goal " " in_order_to " " goal =26= =27= in_order_to: =28= "as well as to" | =29= "in order that we may" | =30= "in order to" | =31= "so that we may endeavor to" | =32= "so that we may" | =33= "such that we may continue to" | =34= "to allow us to" | =35= "while continuing to" | =36= "and" =37= =38= because: =39= "because that is what the customer expects" | =40= "for 100% customer satisfaction" | =41= "in order to solve business problems" | =42= "to exceed customer expectations" | =43= "to meet our customer's needs" | =44= "to set us apart from the competition" | =45= "to stay competitive in tomorrow's world" | =46= "while promoting personal employee growth" =47= =48= goal: adverbly " " verb " " adjective " " noun =49= =50= adverbly: =51= "quickly" | "proactively" | "efficiently" | "assertively" | =52= "interactively" | "professionally" | "authoritatively" | =53= "conveniently" | "completely" | "continually" | "dramatically" | =54= "enthusiastically" | "collaboratively" | "synergistically" | =55= "seamlessly" | "competently" | "globally" =56= =57= =58= verb: =59= "maintain" | "supply" | "provide access to" | "disseminate" | =60= "network" | "create" | "engineer" | "integrate" | "leverage other's" | =61= "leverage existing" | "coordinate" | "administrate" | "initiate" | =62= "facilitate" | "promote" | "restore" | "fashion" | "revolutionize" | =63= "build" | "enhance" | "simplify" | "pursue" | "utilize" | "foster" | =64= "customize" | "negotiate" =65= =66= adjective: =67= "professional" | "timely" | "effective" | "unique" | "cost-effective" | =68= "virtual" | "scalable" | "economically sound" | =69= "inexpensive" | "value-added" | "business" | "quality" | "diverse" | =70= "high-quality" | "competitive" | "excellent" | "innovative" | =71= "corporate" | "high standards in" | "world-class" | "error-free" | =72= "performance-based" | "multimedia-based" | "market-driven" | =73= "cutting edge" | "high-payoff" | "low-risk high-yield" | =74= "long-term high-impact" | "prospective" | "progressive" | "ethical" | =75= "enterprise-wide" | "principle-centered" | "mission-critical" | =76= "parallel" | "interdependent" | "emerging" | =77= "seven-habits-conforming" | "resource-leveling" =78= =79= noun: =80= "content" | "paradigms" | "data" | "opportunities" | =81= "information" | "services" | "materials" | "technology" | "benefits" | =82= "solutions" | "infrastructures" | "products" | "deliverables" | =83= "catalysts for change" | "resources" | "methods of empowerment" | =84= "sources" | "leadership skills" | "meta-services" | "intellectual capital" =0= #### LISTING THREE #### =1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= use Parse::RecDescent; =6= =7= (my $parser = Parse::RecDescent->new(<<'END_OF_GRAMMAR')) or die "bad!"; =8= =9= start: <leftop: junk rule junk> /\Z/ { 1 } =10= =11= junk: /[^{}]*/ { 1 } =12= =13= rule: "{" defining choices "}" { print "$item[2]: $item[3]\n\n" } =14= | <error> =15= =16= defining: symbol ";" <commit> { $item[1] } | symbol =17= =18= symbol: /^<(\S+)>/ { =19= my $x = $1; =20= $x =~ s/([^a-zA-Z])/sprintf "_%02x_", ord $1/ge; =21= if ($x eq "start") { =22= $x = "START"; =23= } =24= ## warn "saw symbol $x\n"; =25= $x; =26= } =27= =28= choices: choice(s) { join " |\n ", @ {$item[1]} } | { q{""} } =29= =30= choice: item(s) ";" { join q{ " " }, @ {$item[1]} } =31= | <error> =32= =33= item: symbol | wordlist | <error> =34= =35= wordlist: word(s) { "q{" . join( " ", @ {$item[1]} ) . "}" } =36= =37= word: /^"(\S+)"/ <commit> { $1 } | /^[^\s<>{};]+/ =38= =39= END_OF_GRAMMAR =40= =41= (my $parsed = $parser->start(join '', <>)) or die "bad parse"; =42= =43= ## see http://www-cs-faculty.stanford.edu/~zelenski/rsg/grammars/