Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 34 (Oct 2000)

[suggested title: Little acts of magic]

So, let's start with some text manipulation. I have a poem written by a good friend of mine, Peg Edera, in a file named peg_poem, as follows:

    The Little Acts
    Maybe there is no magic.
    Maybe it is only faith.
    The lonely girl stuffs a letter
    In a bottle, casts it in the sea.
    It floats, it sinks.
    What counts is that it's cast.
    Those little intentions,
    The petals placed on the altar,
    The prayer whispered into air,
    The deep breath.
    The little acts,
    The candles lit,
    The incense burning
    And we remember what counts
    And we know what we need
    In the little acts.
    Then is when to listen.
    What is it that brought you
    To this?
    There's the magic.
    Right in hearing what
    Sends you to 
    Hope.
      Peg Edera
      February 8, 2000

The title of her poem inspired the theme of this column, so it's only appropriate that we use the text as grist for the mill.

Let's start with the basics of opening the file and reading. That'd be something like:

  open POEM, "peg_poem" or die "Cannot open: $!";
  while (<POEM>) {
    ... do something here with each line
  }

Within the body of this while loop, $_ contains each line from the poem. So on the first iteration, we get The Little Acts and a newline, and so on.

If we just want to copy the data to STDOUT, a simple print will do:

  # open...
  while (<POEM>) {
    print;
  }

Here, the print defaults to STDOUT, so we'll end up copying the input to the output. What if we wanted line numbers? The variable $. contains the line number of the most recently read file:

  # open...
  while (<POEM>) {
    print "$.: $_";
  }

And now we get a nice display labeled by line numbers. Let's optimize this a bit... there's too much typing for such a small thing happening in the middle of the loop:

  # open...
  print "$.: $_" while <POEM>;

Ahh yes, the while modifier form. Each line is still read into $_, and thus the print gets the right info.

Even the open can be optimized out of there, by using the cool ``diamond'' operator. The operator looks at the current value of @ARGV for a list of filenames, so let's give it one:

  @ARGV = qw(peg_poem);
  while (<>) {
    print;
  }

Notice we don't have to explicitly open now, because that's handled by the diamond. Of course, copying files is best done by a special purpose module for copying:

  use File::Copy;
  copy "peg_poem", \*STDOUT;

But that's just another way of doing it.

Let's go the other direction: processing the information before sending it out. As an artist, I'm sure Peg appreciates the ability to include blank lines between the paragraphs of the poem. But how would we strip those blank lines on the output? Simple enough: use a regular expression:

  while (<>) {
    print if /\S/;
  }

Here, the regular expression is looking for any single non-whitespace character. If there aren't any of those, the line is at least blank looking, and not worth printing.

Besides printing things as quickly as we read them, we can also read the entire file into memory for more interesting operations:

  while (<>) {
    push @data, $_;
  }

Each new line is added to the end of @data, which initially starts empty. Now we can print them out in the reverse order:

  for ($i = $#data; $i >= 0; $i--) {
    print $data[$i];
  }

And while this works (it's a normal for loop), it's actually much less work for the programmer (and slightly more for Perl) to write this simply as:

  print reverse @data;

which takes the @data value and reverses a copy of it end-for-end before handing this new list off to print.

What if we wanted to reverse each string? Hmm. Well, the reverse operator in a scalar context turns the string around. But then the newline is at the wrong end. So, it's a multiple-step procedure:

  foreach $line (@data) {
    chomp($copy = $line);
    print reverse($copy)."\n";
  }

Here, I take the string, copy it into a separate variable (so that the chomp doesn't affect the original @data element), then reverse that variable's contents in a scalar context (because it's the operand of the string concatentation operator), and then dump that out.

Another way to grab the part of the string up to but not including the newline is with a regular expression:

  foreach (@data) {
    print reverse($1)."\n" if /(.*)/;
  }

In this case, I'm using the implicit $_ variable together with a regular-expression match to find all the characters that don't include newline (because dot doesn't match a newline), and then using that as the argument to reverse. Magic!

We could also drift this towards a mapping operation, now that I look at it. Let's make a little assembly line:

  @reversed = map {
    /(.*)/ && reverse($1)."\n";
  } @data;
  print @reversed;

The map operation takes every element of @data and temporarily places it into $_. The regular expression match always succeeds, and when it does, $1 contains the string up to but not including the newline, which then gets reversed and a newline is tacked on the end. Of course, we don't need that intermediate variable:

  print map {
    /(.*)/ && reverse($1)."\n";
  } @data;

I think Peg would probably laugh at the output of that program applied to her work, so let's look at some other small magic items.

If we wanted to break the lines into a series of words, the easiest way is to apply a regular expression match with a ``global'' modifier to each line, like so:

  while (<>) {
    push @words, /(\S+)/g;
  }

Here, the regular expression of \S+ matches every contiguous chunk of non-whitespace characters. So, after the first line has been processed, we'll have:

  @words = ("The", "Little", "Acts");

and the second line contributes nothing to the array, because there are no matches. We can shorten this slighly, using that map operator again:

  @words = map /(\S+)/g, <>;

And this is pretty powerful, so let me go through it slowly. First, the diamond operator on the right is being used in a list context, meaning that all the lines from all the files of @ARGV are being sucked in at once. Next, the map operator takes each element (line) from the list, and shoves it into $_. Next, the regular expression is being evaluated in a list context, and since the match can occur multiple times on a given string, each match contributes 0 or more elements to the result list. That result list then becomes the value for @words. Wow, all in one line of code.

The problem with this particular rendering is that we're sucking in the punctuation as well. So magic is in the array as magic., and that's not the same word, especially if we want to count the words.

So, we can alter this a bit:

  @words = map /(\w+)/g, <>;

and now we're grabbing all contiguous alphanumerics-and-underscores, selected by the things that \w+ matches.

But that breaks the word <There's> into two pieces. Bleh. There are many a long hacking sessions whehn I've wished that the definition for \w had included apostrophes but excluded underscores. So, it's a slightly more precise and explicit regex for me:

  @words = map /([a-zA-Z']+)/g, <>;

There. That works for this poem. And leaves out those nasty date numbers as well.

Now, as a final bit of magic, let's see what the most frequent word is. Wait, some of them are initial caps, so we need to do one more hop:

  @words = map lc, map /([a-zA-Z']+)/g, <>;

That fixes it so they're all lowercase. Better! Now let's count them:

  $count{$_}++ for @words;

No... it can't be that simple? But it is. Each of the words ends up in $_. We use that as a key to a hash. The value initially is undef, and incremented as we see each word.

Now it's time to dump them out:

  @by_order = sort { $count{$b} <=> $count{$a} } keys %count;
  for (@by_order) {
    print "$_ => $count{$_}\n";
  }

And that dumps them out in the sorted order. We're using a sort block to control the ordering of the keys, then dumping them in that order.

Well, I hope you see that Perl can be a little magic at times. It's the little things that count. Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.