Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 34 (Oct 2000)
[suggested title: Little acts of magic]
So, let's start with some text manipulation. I have a poem
written by a good friend of mine, Peg Edera, in a file named
peg_poem
, as follows:
The Little Acts
Maybe there is no magic. Maybe it is only faith. The lonely girl stuffs a letter In a bottle, casts it in the sea. It floats, it sinks. What counts is that it's cast.
Those little intentions, The petals placed on the altar, The prayer whispered into air, The deep breath. The little acts, The candles lit, The incense burning And we remember what counts And we know what we need In the little acts.
Then is when to listen. What is it that brought you To this? There's the magic. Right in hearing what Sends you to Hope.
Peg Edera February 8, 2000
The title of her poem inspired the theme of this column, so it's only appropriate that we use the text as grist for the mill.
Let's start with the basics of opening the file and reading. That'd be something like:
open POEM, "peg_poem" or die "Cannot open: $!"; while (<POEM>) { ... do something here with each line }
Within the body of this while
loop, $_
contains each line from
the poem. So on the first iteration, we get The Little Acts
and a
newline, and so on.
If we just want to copy the data to STDOUT
, a simple print
will
do:
# open... while (<POEM>) { print; }
Here, the print
defaults to STDOUT
, so we'll end up copying the
input to the output. What if we wanted line numbers? The variable $.
contains the line number of the most recently read file:
# open... while (<POEM>) { print "$.: $_"; }
And now we get a nice display labeled by line numbers. Let's optimize this a bit... there's too much typing for such a small thing happening in the middle of the loop:
# open... print "$.: $_" while <POEM>;
Ahh yes, the while
modifier form. Each line is still read into
$_
, and thus the print
gets the right info.
Even the open
can be optimized out of there, by using the cool
``diamond'' operator. The operator looks at the current value of @ARGV
for a list of filenames, so let's give it one:
@ARGV = qw(peg_poem); while (<>) { print; }
Notice we don't have to explicitly open now, because that's handled by the diamond. Of course, copying files is best done by a special purpose module for copying:
use File::Copy; copy "peg_poem", \*STDOUT;
But that's just another way of doing it.
Let's go the other direction: processing the information before sending it out. As an artist, I'm sure Peg appreciates the ability to include blank lines between the paragraphs of the poem. But how would we strip those blank lines on the output? Simple enough: use a regular expression:
while (<>) { print if /\S/; }
Here, the regular expression is looking for any single non-whitespace character. If there aren't any of those, the line is at least blank looking, and not worth printing.
Besides printing things as quickly as we read them, we can also read the entire file into memory for more interesting operations:
while (<>) { push @data, $_; }
Each new line is added to the end of @data
, which initially starts
empty. Now we can print them out in the reverse order:
for ($i = $#data; $i >= 0; $i--) { print $data[$i]; }
And while this works (it's a normal for
loop), it's actually much
less work for the programmer (and slightly more for Perl) to write
this simply as:
print reverse @data;
which takes the @data
value and reverses a copy of it end-for-end
before handing this new list off to print
.
What if we wanted to reverse each string? Hmm. Well, the reverse
operator in a scalar context turns the string around. But then the
newline is at the wrong end. So, it's a multiple-step procedure:
foreach $line (@data) { chomp($copy = $line); print reverse($copy)."\n"; }
Here, I take the string, copy it into a separate variable (so that the
chomp
doesn't affect the original @data
element), then reverse
that variable's contents in a scalar context (because it's the operand
of the string concatentation operator), and then dump that out.
Another way to grab the part of the string up to but not including the newline is with a regular expression:
foreach (@data) { print reverse($1)."\n" if /(.*)/; }
In this case, I'm using the implicit $_
variable together with a
regular-expression match to find all the characters that don't include
newline (because dot doesn't match a newline), and then using that as
the argument to reverse
. Magic!
We could also drift this towards a mapping operation, now that I look at it. Let's make a little assembly line:
@reversed = map { /(.*)/ && reverse($1)."\n"; } @data; print @reversed;
The map
operation takes every element of @data
and temporarily
places it into $_
. The regular expression match always succeeds,
and when it does, $1
contains the string up to but not including
the newline, which then gets reversed and a newline is tacked on the
end. Of course, we don't need that intermediate variable:
print map { /(.*)/ && reverse($1)."\n"; } @data;
I think Peg would probably laugh at the output of that program applied to her work, so let's look at some other small magic items.
If we wanted to break the lines into a series of words, the easiest way is to apply a regular expression match with a ``global'' modifier to each line, like so:
while (<>) { push @words, /(\S+)/g; }
Here, the regular expression of \S+
matches every contiguous chunk
of non-whitespace characters. So, after the first line has been processed,
we'll have:
@words = ("The", "Little", "Acts");
and the second line contributes nothing to the array, because there
are no matches. We can shorten this slighly, using that map
operator
again:
@words = map /(\S+)/g, <>;
And this is pretty powerful, so let me go through it slowly. First,
the diamond operator on the right is being used in a list context,
meaning that all the lines from all the files of @ARGV
are being
sucked in at once. Next, the map
operator takes each element
(line) from the list, and shoves it into $_
. Next, the regular
expression is being evaluated in a list context, and since the match
can occur multiple times on a given string, each match contributes 0
or more elements to the result list. That result list then becomes
the value for @words
. Wow, all in one line of code.
The problem with this particular rendering is that we're sucking in
the punctuation as well. So magic
is in the array as magic.
,
and that's not the same word, especially if we want to count the
words.
So, we can alter this a bit:
@words = map /(\w+)/g, <>;
and now we're grabbing all contiguous alphanumerics-and-underscores,
selected by the things that \w+
matches.
But that breaks the word <There's> into two pieces. Bleh. There are
many a long hacking sessions whehn I've wished that the definition for
\w
had included apostrophes but excluded underscores. So, it's
a slightly more precise and explicit regex for me:
@words = map /([a-zA-Z']+)/g, <>;
There. That works for this poem. And leaves out those nasty date numbers as well.
Now, as a final bit of magic, let's see what the most frequent word is. Wait, some of them are initial caps, so we need to do one more hop:
@words = map lc, map /([a-zA-Z']+)/g, <>;
That fixes it so they're all lowercase. Better! Now let's count them:
$count{$_}++ for @words;
No... it can't be that simple? But it is. Each of the words ends up
in $_
. We use that as a key to a hash. The value initially is
undef
, and incremented as we see each word.
Now it's time to dump them out:
@by_order = sort { $count{$b} <=> $count{$a} } keys %count; for (@by_order) { print "$_ => $count{$_}\n"; }
And that dumps them out in the sorted order. We're using a sort block to control the ordering of the keys, then dumping them in that order.
Well, I hope you see that Perl can be a little magic at times. It's the little things that count. Until next time, enjoy!