Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 19 (Dec 2000)

Well, as a nice follow-on to last month's column about how ``fog''-gy the good old online documentation can get sometimes, let's head another direction about language: automated translation.

By now you've probably seen the Babelfish Translator at Altavista (at http://babelfish.altavista.com, if you haven't). It's a nice demonstration of some human-language machine translation. You type in (or paste in) a short chunk of text, and then select from a dozen different pairs of languages, and you get an approximate translation in a few seconds or so.

It wasn't long after this service appeared that people started playing with the complementary translations, or at least what should be complementary translations, like going from English to French and back to English. And as was quickly discovered, the translations each introduce some small unavoidable errors. Yes, it's cliche, but ``something gets lost in the translation''. These can be funny or tragic, depending on whether you depend on the service to provide precise communication, I guess.

And soon after that, people got tired of cutting and pasting into the nice web form to try to find goofy mis-translations, and started to write nice programmatic wrappers (usually in Perl) to stuff the form fields directly, invoke the URL, and extract the result. I know, I wrote one within a few weeks after finding out about Babelfish.

A nice wrapper lives in the CPAN as WWW::Babelfish, which I noticed recently, and I decided to put a CGI wrapper around it. Now obviously just calling the real Babelfish from my CGI wrapper around WWW::Babelfish would be a wasted effort, but I thought, hey, why not let people specify an entire chain of translations, to more easily locate the humorous mistranslations? And that's what we have to talk about in this month's column.

Besides being a cute demonstration of Babelfish, I also solved some interesting CGI issues about selecting the language chain, so this is more than just an amusement: it's also a demo for some CGI technology. So let's look at this marvel in [listing one, below].

Lines 1 through 3 start nearly every CGI script I write, enabling taint checks and warnings, restricting the use of soft references, barewords, and undeclared variables, and disabling the buffering of standard output.

Lines 5 through 13 define the list of languages understood (at this writing) by Babelfish. I'll store it as a hash, keyed by a two-letter code for each language. These aren't the country codes, but just the first two letters (lowercase) for each.

Lines 15 through 18 define the supported translations available (again, at the time of this writing) as a four-character code. The first two characters are the ``from'' language, while the last two are the ``to'' language. The hash keys define the legitimate mappings, and the values are simply ``1'' to make existence test the same as truth test, in case I forget and use the wrong one.

Line 20 brings in the CGI module (included with Perl), defining all the normal shortcuts, and some additional shortcuts to unroll the start and end of a table and table row. Additionally, we'll pull in the escapeHTML routine to fix the HTML entities in the Babelfish output (if any, just to be safe).

Line 22 prints the CGI/HTTP header, and starts the HTML with a title of babel linker and a big head of the same.

Lines 24 through 28 generate the form element to hold the text to be translated, as well as a ``translate'' button to trigger this translation. The default text on the first invocation is a tribute to the famous Monty Python Hungarian Phrasebook sketch. Later invocations will use whatever the current text is, thanks to the ``sticky fields'' feature of CGI.pm.

Now, you may be saying ``Hmm, where's the control to select the language translation chain?''. And here's the cool part. The translation chain will be encoded in the URL that invokes this script! For example, if the script is installed as /cgi/babel-on, then an English to German to French to English translation will be encoded as /cgi/babel-on/engefren. Note that the ``path info'' immediately following the script name encodes the language chain as two-character pairs for each language, which we'll decode a few lines down.

So, when this script ``invokes itself'' from the URL created in line 25, it passes along the current language chain setting as the ``path info'' information, and the text to be translated comes in via a normal form element. Too cool, and very simple once you understand it.

But how does the path info get created in the first place? That's coming up. First, we need to rip out any presence of the submit button in the sticky fields, first noting if button was selected on this invocation, in lines 30 and 31. Without this, the language selection links we're about to generate also trigger an immediate translation, and I tolerated that during development for about all of two minutes before I hacked in these two lines of code.

Next, we'll grab the current path info in line 33, and skip past the leading slash if any (there'll be a slash if there's any info) by moving pos($pi) forward to 1 if the slash is present.

Line 34 is also tightly coded to walk through $pi looking for adjacent two-character language codes, breaking them out into separate elements of @path. The keys of %LANGUAGES define the codes, which we join with the regex-or vertical bar. The resulting regex looks something like /\G(en|fr|ge|it|po|ru|sp)/g, although the ordering of the keys is unpredictable, which doesn't matter here. So, for our path info example above, we now have:

  @path = ("en","ge","fr","en");

There. The language chain is becoming obvious. Line 36 was added after some initial testing. It's not necessary, but provides a friendly default of ``English to German'' if no languages are specified at all. Without it, we get no chain selected, and you have to make at least two stabs at some language titles before the ``translate'' button becomes useful.

Lines 38 to 56 generate the language chain creation matrix. The result is a table that reads like: ``from english to german to french to english and then to (unselected)'', except that vertically above and below each selected name (in boldface) is a list of links that can replace the current selection. The list is dynamically chosen such that a chain is made up only of items allowed by the %PERMITTED hash defined earlier. So, to select French as the starting language, you select the ``French'' word below the first English word. And then the second column changes to an ``unselected'' column, but only English and German appear there, since they are the only permitted destination languages. So you poke one of those (say, ``German''), and you get back a three column matrix now with German bolded, and only English and French links in the third column. You really have to play with it to appreciate it, because it came out much slicker and intuitive than I had imagined (and it's all straight CGI and HTML).

The mechanism behind the sequence is straightforward (he says now, after toiling for about an hour over the code to make it simple and maintainable). First, we start a borderless table and single row in lines 39 through 41. These are closed off in line 55. Note that we use the extended start/end tags rather than the traditional table() wrapper around the contents. I thank Lincoln Stein for adding this feature in recent versions of CGI.pm, as it has solved some traditionally difficult tasks in much nicer ways than before.

Line 43 constructs a CGI path that reinvokes this script, without any path info information. We'll be gluing on to the end of this string as we go along.

Line 45 takes the current language chain, and wraps it in two extra null entries, so that we may slide along a two-element window to generate each column. (I tried three approaches for this before I stumbled across this one. Whew!)

Lines 46 to 53 dump each column, one by one. First, line 47 extracts the first two elements of @links (which changes each time thanks to line 52). Then, we print two table data cells. The first varies depending on whether we're looking at a middle column, the last column, or the first column respectively, thanks to a nested pair of ?: tests. The second is a nice stack of links, generated by calling the links subroutine, defined below and described in a moment. Line 51 adds the next language in the chain to the end of the URL, so the next round will start from the right place.

Now, out of sequence slightly, let's look at links, down in lines 83 to 95. First, we take the URL path leading up to this, the ``from'' language, and the ``to'' language, and stuff them into local variables in line 84. Next, we'll compute the possible languages that can follow the ``from'' language. Initially, this is all languages (line 85), but we narrow that down in line 88 by verifying the permitted pairs. $from is empty on the first call to links so we have all possible languages in the first column, always.

Finally, in lines 90 to 94, we create a stack of links for each language in this column. Line 91 turns the two-character code into a human readable name. Line 92 detects if the language link would be equivalent to some part of what we already have selected, and if so, the name is merely bolded in line 92. Otherwise, we construct a link to the path info constructed so far, plus the two character language abbreviation, plus the sticky field info (so the text block gets carried forward) with a visible tag of the full language name. And that's all followed by a br tag, so that we get the links (or bolded text) on separate lines.

OK, back up to the main code to finish this out. Lines 59 through 79 handle the translation if requested, requiring both a valid path of sufficient length (you can't have a one-element chain!) and having selected the translate button. If so, we bring in the WWW::Babelfish module via require (so that we don't incur that cost for non-translation invocations while we're setting up the chain), and create a translator object in line 62.

Lines 64 and 77 create a borderless table again, similar to the one created above. Lines 65 to 76 perform the step-by-step translation on the contents of $text, mirroring what we did to generate the language chain matrix above. Recall that the elements of @path are all the two-character abbreviations, so line 67 patches them up to be full strings (which is what the WWW::Babelfish object wants).

Lines 68 and 69 perform the ``remote procedure call'' to the Altavista Babelfish service, courtesy of the $linguist object. Lines 70 to 72 show the results, step by step. If $result is defined, we show the translation at this stage, and continue. Otherwise, we break out in line 73. Lines 74 and 75 slide the window over, resetting the invariants for the top of the loop again.

And line 81 ends the response, and we're done!

There's theoretically no limit to the number of elements of the chain this way. I played with the selection thing for a good 20 or 30 minutes, having fun stabbing the links and watching the rest of everything ripple to the only legitimate choices, and then hitting ``translate'' when I was finally curious about that chain's effect on the text. For example, the phrase ``The spirit is willing, but the flesh is weak'', generates the following hits when translated from English to French to English to German to English:

  ... from English to French becomes ...
    L'esprit est dispos, mais la chair est faible.
  ... from French to English becomes ...
    The spirit is laid out, but the flesh is weak.
  ... from English to German becomes ...
    Der Geist wird ausgebritten, aber das Fleisch ist schwach.
  ... from German to English becomes ...
    The spirit becomes ausgebritten, but the flesh is weak.

Yup. Loses something in the translation (and interesting to note that the English to German generates what I presume is a fairly common word that the German to English translator spits up on). But at least I got the answer in a dozen seconds or so. So, hopefully, you learned a bit about self-invoking scripts, about using the path info to pass control information separate from the form data, about creating a simple yet powerful user interface using a table and links, and about how funny a machine translation can mangle some good text. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     my %LANGUAGES = qw(
        =6=       en English
        =7=       fr French
        =8=       ge German
        =9=       it Italian
        =10=      po Portuguese
        =11=      ru Russian
        =12=      sp Spanish
        =13=    );
        =14=    
        =15=    my %PERMITTED;
        =16=    $PERMITTED{$_}++ for qw(
        =17=      enfr enge enit enpo ensp fren geen iten poen ruen spen frge gefr
        =18=    );
        =19=    
        =20=    use CGI qw(:all *table *Tr escapeHTML);
        =21=    
        =22=    print header, start_html('babel linker'), h1('babel linker');
        =23=    
        =24=    print                           # text area form, translate button:
        =25=      start_form,
        =26=      submit('translate'),
        =27=      textarea('text', "My hovercraft is full of eels!", 4, 50),
        =28=      end_form;
        =29=    
        =30=    my $translate_wanted = defined param('translate');
        =31=    Delete('translate');            # so language-changing URLs don't trigger
        =32=    
        =33=    (my $pi = path_info()) =~ m{^/}g; # skip past leading slash if present
        =34=    
        =35=    my @path = $pi =~ /\G(@{[join "|", keys %LANGUAGES]})/g;
        =36=    @path = qw(en ge) unless @path; # default to english-to-german if no path
        =37=    
        =38=    ## start of language selection matrix...
        =39=    print
        =40=      start_table({border => 0, cellspacing => 0, cellpadding => 2}),
        =41=      start_Tr;
        =42=    
        =43=    my $pathstring = url()."/";
        =44=    
        =45=    my @links = ("",@path,"");
        =46=    while (@links > 1) {
        =47=      my ($from, $to) = @links;     # first two, ignore rest for now
        =48=      print
        =49=        td($from ? $to ? "to" : "and then to" : "from"),
        =50=          td(links($pathstring, $from, $to));
        =51=      $pathstring .= $to;
        =52=      shift @links;
        =53=    }
        =54=    
        =55=    print end_Tr, end_table;
        =56=    ## ...end of language selection matrix
        =57=    
        =58=    ## now do the translation if needed:
        =59=    if ($translate_wanted and @path > 1) {
        =60=      require WWW::Babelfish;
        =61=      my $text = param('text');
        =62=      my $linguist = WWW::Babelfish->new or die "no linguist";
        =63=    
        =64=      print start_table({border => 0, cellspacing => 0, cellpadding => 3});
        =65=      while (@path > 1) {
        =66=        my ($src, $dst) = @path;    # first two elements, rest ignored for now
        =67=        $_ = $LANGUAGES{$_} for $src, $dst;
        =68=        my $result = $linguist->
        =69=          translate(source => $src, destination => $dst, text => $text);
        =70=        print Tr(td("... from $src to $dst becomes ..."),
        =71=                 td(defined $result ? escapeHTML($result) :
        =72=                    "... unintelligible (aborting) ..."));
        =73=        last unless defined $result;
        =74=        shift @path;                # slide it over
        =75=        $text = $result;
        =76=      }
        =77=      print end_table;
        =78=    
        =79=    }
        =80=    
        =81=    print end_html;
        =82=    
        =83=    sub links {
        =84=      my ($path, $from, $to) = @_;
        =85=      my @permitted = sort keys %LANGUAGES;
        =86=    
        =87=      ## strip bogus combos if this isn't the first in the chain:
        =88=      @permitted = grep { exists $PERMITTED{"$from$_"} } @permitted if $from;
        =89=    
        =90=      return map {
        =91=        my $lang = $LANGUAGES{$_};
        =92=        ($_ eq $to) ? b($lang) :
        =93=          a({-href => "$path$_?".query_string()}, $lang), br;
        =94=      } @permitted;
        =95=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.