Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 31 (Dec 2001)

[suggested title: Word of the day for PostgreSQL]

Recently, the folks at RedHat selected the open-source PostgreSQL database as the base for their commercial RedHat Database product. This was not without some crying from the ranks of the MySQL faithful, who could not quite understand why their baby had been passed up.

After all, MySQL was faster, better, cheaper, easier to install, and all those other cool buzzwords. And PostgreSQL was this oddity: buggy, unsupported, no longer being developed, couldn't deal with large data easily, and (gasp) required a separate process for each connection instead of being nicely threaded.

In my professional opinion, those who discount PostgreSQL in favor of MySQL have not seen PostgreSQL in any recent time. I too had been one of those ``MySQL all the way'' folks until I was nudged by fellow Perl hackers into taking a look again. And to make one last statement about that comparison before I get into the meat of this month's column, let me say that my conclusion is that MySQL is ``putting a bit of structure onto a flat file'', where PostgreSQL is ``an open-source Oracle replacement at a fraction of the cost''.

And to celebrate the successful installation of PostgreSQL on my system, I wanted to tackle a little project. While poking around for one, someone mentioned on the Perl IRC channel about having a ``word of the day'' program, and that inspired me to create one with PostgreSQL (aside, I'll call that ``Pg'' from now on to save typing).

A DBM hash could easily hold the words and their definitions. But a database could also easily hold a per-user ``I've already seen that word'' table, so that I won't end up seeing the same word twice. And recently, someone had shown me the trick of:

  SELECT word FROM list ORDER BY random() LIMIT 1;

to get a random entry, so I wanted to put that into a program somewhere. This works by first sorting the list according to a random value, effectively shuffling it, and then selecting just the first entry.

Of course, after tinkering with it a bit, I decided there's no reason not to make it a CGI script as well as a crontab-able program, so I threw in HTML-cleaning of the definitions.

But what dictionary? I was first pointed at Lingua::Wordnet, and went off on an hour detour to install and understand that. And as I was discussing that online, someone else metioned FOLDOC, the dictionary of computing terms (both historical and current) as a good source of pertinent information. And in fact, it turned out to be a very good source for this little project.

The program flow is as follows: On each invocation, the current FOLDOC flat file is mirrored into a local cache. If the local cache was updated, we parse the file into words and definitions, and put that into the Pg database in a simple two-column table.

Next, for each user (keyed by Unix userid) that invokes the program, a ``deck'' for them is consulted. Initially, the deck is empty, so a new wordlist is created in a random order by ``shuffling'' a copy of the terms from the dictionary. This is done using SQL similar to (for the user ``merlyn''):

  INSERT INTO deck (word, person)
  SELECT word, merlyn
  FROM foldoc
  ORDER BY random()

The first entry for that user is pulled off, and deleted from the deck. The definition is then pulled from the dictionary table. If the invocation was from the command line, the word and definition are displayed with minimal reformatting. However, if CGI invocation is detected, then an HTML massaging locates URLs and email addresses, as well as fixing up HTML entities.

Now comes the cool part about Pg. All this shuffling and updating of the dictionary is being done within transactions that do not block the other readers! If the dictionary is stale, the other readers see the dictionary instantaneously change from the old dictionary to the new dictionary, without blocking. Try that with MySQL. Pg provides the concurrent consistent views that only the big guys like Oracle and Interbase have provided in the past.

Enough on that.. let's look at the code in [listing one, below].

Lines 1 through 3 start most of the programs I write, turning on warnings, enabling compiler restrictions, and disabling the buffering on STDOUT.

Line 5 accesses the DBI module, found in the CPAN. And you'll need to also have the DBD::Pg module, found both in the CPAN and with the Pg source distribution.

Lines 8 through 10 define the configuration parameters that I might want to change in this program. The URL from which I'm fetching the FOLDOC file is in $REMOTE. The local file in which this is being cached is in $LOCAL. And @DSN defines the DBI identifier, user, and password. The Pg database foldoc_word_of_the_day must already exist, but can be empty, because the script creates the necessary tables.

Line 13 establishes the connection to the Pg database. RaiseError is set, causing all serious errors to throw an exception. If uncaught, the exceptions cause the program to die, but for those rare steps where some of the serious errors might be expected, we can use a eval block to catch them.

Line 15 causes the database to be refreshed from the master remote wordlist, but only if the invoker of the script can write to the local cache. This means that if the file is writable only by me, then the CGI invocations won't refresh the file, and I don't have to make any directory or file ``world writable'' just to make it work with CGI.

Line 16 fetches the word and its definition from the base, for the user running the script. This is nobody on my webserver, or merlyn for me. The ``already seen'' lists are per-user.

Lines 18 to 27 format the response. If we're running as CGI, then the environment variable GATEWAY_INTERFACE is set. The text2html subroutine from HTML::FromText is pulled in, and appropriate headers are added for a CGI response. If it's not CGI, then the word and meaning are simply dumped to STDOUT.

Line 29 disconnects from the database, and line 31 keeps us from accidentally executing further code below.

And now for the subroutines, starting with refresh_wordlist in line 33. We'll fetch the list with LWP::Simple (found in the LWP library in the CPAN), using the mirror routine to mirror the file to the local cache. If the return code is 200, then we've got a new version, and it's time to refresh the Pg database as well.

Lines 37 to 40 ensure that the Pg table holding the dictionary has been created. Both the word and meaning columns are of type ``text'', which is a text string of unlimited size, stored compactly.

Lines 42 to 74 form a commit-block for a transaction. If anything fails in the block, then the rollback in line 76 erases the actions as if nothing had happened. Also, any changes made within the block are not visible to other users until line 73 is executed, so it's as if we're working on our own private copy of the database.

Line 45 clears out the foldoc table. Good thing we're doing this in private, or other users would no longer be able to get the definitions for their words. (In MySQL, this could be accomplished by blocking other users while we're doing this, but in Pg, the other users see the old version until we commit.)

Lines 47 and 48 define a statement handle to insert the entries into the foldoc table. Lines 50 to 72 parse the FOLDOC file. The file consists of many thousands of entries that look like:

    Artistic license
      <legal> The {open source license} applicable to {Perl}.
      (1999-12-29)

with the term flush-left, and the definition indented. A trivial parsing problem for Perl. The business step is down in line 62, where the term and definition are inserted into the database. A trace of the term and ``number of lines inserted'' (usually 1) accompanies the insertion, which usually scrolls by faster than I can read it.

Once a new dictionary is inserted, it's time to also invalidate the existing decks. First, the deck is created in line 83 if needed. The deck has two columns: a word (identical to the word column of the other table) and a person. Then the deck is cleaned in line 89, forcing the next hit for each individual requestor to shuffle a clean deck for them.

To get a word and definition, we call the routine starting in line 99, passing the individual into $person in line 100. We'll try to get the first ``card'' of the deck twice, failing after the second try.

Lines 105 to 111 attempt to grab that first card. A placeholder is used to identify the person, ensuring that we don't have to be aware of the quoting conventions for that string. If that succeeds, then we delete the card from the deck, and look up the definition in lines 119 to 122, and return the word and the definition in line 123. If it fails, it's time to shuffle, so lines 131 to 135 create the shuffled deck using the random number generator to insert the items.

Again, the updates to the deck are done inside transaction begin-end brackets, so that other users of the database will not see the partial updates. In fact, even if two hits for the web user (``nobody'') come in at the very same time, they'll be dealt different words (through a serialization thanks to the ``FOR UPDATE'' in the select in line 108).

There's a handful of uncaught die operations in the program that would make some unexpected errors rather unfriendly in a CGI environment, but that's for future tweaking.

In conclusion, it's not hard to use PostgreSQL, especially its advanced features, and it has finally arrived as a real, practical, database. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use DBI;
        =6=     
        =7=     ## configuration
        =8=     my $REMOTE = "http://foldoc.doc.ic.ac.uk/foldoc/Dictionary";;
        =9=     my $LOCAL = "/home/merlyn/Web/Dictionary.txt";
        =10=    my @DSN = qw(dbi:Pg:dbname=foldoc_word_of_the_day USER PASS);
        =11=    ## end configuration
        =12=    
        =13=    my $dbh = DBI->connect(@DSN, {RaiseError => 1, PrintError => 0});
        =14=    
        =15=    refresh_wordlist() if -w $LOCAL;
        =16=    my ($word, $meaning) = get_word_and_meaning_for(scalar getpwuid $<);
        =17=    
        =18=    if ($ENV{GATEWAY_INTERFACE}) {  # running under CGI
        =19=      require HTML::FromText;
        =20=    
        =21=      print "Content-type: text/html\n\n";
        =22=      print HTML::FromText::text2html("$word\n\n$meaning",
        =23=                                      map { $_ => 1 }
        =24=                                      qw(title urls email paras));
        =25=    } else {
        =26=      print "$word\n$meaning";
        =27=    }
        =28=    
        =29=    $dbh->disconnect;
        =30=    
        =31=    exit 0;
        =32=    
        =33=    sub refresh_wordlist {
        =34=      require LWP::Simple;
        =35=      return unless LWP::Simple::mirror($REMOTE, $LOCAL) == 200;
        =36=    
        =37=      eval {
        =38=        $dbh->do(q{CREATE TABLE foldoc (word text, meaning text)});
        =39=      };
        =40=      die $@ if $@ and $@ !~ /already exists/;
        =41=    
        =42=      eval {
        =43=        $dbh->begin_work;
        =44=    
        =45=        $dbh->do(q{DELETE FROM foldoc}); # clean it out
        =46=    
        =47=        my $insert = $dbh->prepare
        =48=          (q{INSERT INTO foldoc(word, meaning) VALUES (?, ?)});
        =49=    
        =50=        open LOCAL, $LOCAL or die;
        =51=        my $entry;
        =52=        {
        =53=          $_ = <LOCAL>;
        =54=          if (not defined $_ or /^\S/) { # end of definition
        =55=            if (defined $entry) {   # save any cached definition
        =56=              $entry =~ s/^(\S.*)\n([ \t]*\n)*// or die;
        =57=              my $key = $1;         # get key
        =58=              $entry =~ s/\s+\z/\n/; # clean up definition
        =59=    
        =60=              unless ($key =~ /Free On-line Dictionary|Acknowledgements/) {
        =61=                print "$key -> ";
        =62=                print $insert->execute($key, $entry);
        =63=                print "\n";
        =64=              }
        =65=    
        =66=              undef $entry;
        =67=            }
        =68=            last unless defined $_;
        =69=          }
        =70=          $entry .= $_;
        =71=          redo;
        =72=        }
        =73=        $dbh->commit;
        =74=      };
        =75=      if ($@) {
        =76=        $dbh->rollback;
        =77=        die $@;
        =78=      }
        =79=    
        =80=      ## create and reset word
        =81=    
        =82=      eval {
        =83=        $dbh->do(q{CREATE TABLE deck (word text, person text)});
        =84=      };
        =85=      die $@ if $@ and $@ !~ /already exists/;
        =86=    
        =87=      eval {
        =88=        $dbh->begin_work;
        =89=        $dbh->do(q{DELETE FROM deck}); # clean it out
        =90=        $dbh->commit;
        =91=      };
        =92=      if ($@) {
        =93=        $dbh->rollback;
        =94=        die $@;
        =95=      }
        =96=    
        =97=    }
        =98=    
        =99=    sub get_word_and_meaning_for {
        =100=     my $person = shift;
        =101=   
        =102=     for (my $tries = 0; $tries <= 2; $tries++) {
        =103=       $dbh->begin_work;
        =104=       if (my ($word) =
        =105=           $dbh->selectrow_array(q{
        =106=                                   SELECT word FROM deck
        =107=                                   WHERE person = ?
        =108=                                   FOR UPDATE OF deck
        =109=                                   LIMIT 1
        =110=                                  },
        =111=                                 undef, $person)) {
        =112=   
        =113=         ## got a good word
        =114=         $dbh->do(q{DELETE FROM deck WHERE (word, person) = (?, ?)},
        =115=                  undef, $word, $person);
        =116=         $dbh->commit;
        =117=   
        =118=         if (my ($meaning) =
        =119=             $dbh->selectrow_array(q{
        =120=                                     SELECT meaning FROM foldoc
        =121=                                     WHERE word = ?
        =122=                                    }, undef, $word)) {
        =123=           return ($word, $meaning);
        =124=         }
        =125=   
        =126=         die "missing meaning for $word\n";
        =127=       } else {
        =128=         ## no words left, shuffle the deck
        =129=   
        =130=         $dbh->do(q{
        =131=                    INSERT INTO deck (word, person)
        =132=                    SELECT word, ?
        =133=                    FROM foldoc
        =134=                    ORDER BY random()
        =135=                   }, undef, $person);
        =136=         $dbh->commit;
        =137=       }
        =138=     }
        =139=     die "Cannot get a word for $person\n";
        =140=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.