Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 35 (Dec 2000)
[suggested title: So what's the difference?]
A lot of common programming is dealing with things that change. And things do indeed change, and sometimes we'd like to know how they changed.
For example, if we have a list of items:
@one = qw(a b c d e f g);
and then later, we look at it again, and there's a different set of items:
@two = qw(b c e h i j);
How can we tell what's new, what's old, and what's gone?
We could certainly try to do it by brute force:
@one = qw(a b c d e f g); @two = qw(b c e h i j); foreach $one (@one) { if (grep $one eq $_, @two) { print "$one is in both old and new\n"; } else { print "$one has been deleted\n"; } } foreach $two (@two) { unless (grep $two eq $_, @one) { print "$two has been added\n"; } }
And this in fact gives us an appropriate response:
a has been deleted b is in both old and new c is in both old and new d has been deleted e is in both old and new f has been deleted g has been deleted h has been added i has been added j has been added
But this is incredibly inefficient. The computation time will rise in
proportion to the product of sizes of both the lists. This is because
every element of one list is being compared to every element of the
other list (twice, in fact). The grep
operator is a loop over each
item, so we've effectly got nested loops, and that should nearly
always be a danger sign.
The perlfaq4
manpage approaches this subject, giving a solution of
something like:
@union = @intersection = @difference = (); %count = (); foreach $element (@one, @two) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element; }
with the caveat that we're assuming one item of each kind within each list. While that works for our input data as well, we'll run into trouble on more general data. However, with a slight modification, we can handle even duplicate items in each list:
@one = qw(a a a a b c d e f g); @two = qw(b c e h i i i i j); my %tracker = (); $tracker{$_} .= 1 for @one; $tracker{$_} .= 2 for @two; for (sort keys %tracker) { if ($tracker{$_} !~ /1/) { print "$_ has been added\n"; } elsif ($tracker{$_} !~ /2/) { print "$_ has been deleted\n"; } else { print "$_ is in both old and new\n"; } }
Success. Correct output, and reasonably efficient. If you're doing a
lot of these, check into the CPAN modules starting with Set::
.
And then we come to the problem of telling the difference between two
sequences, where the ordering matters. The very nice
Algorithm::Diff
in the CPAN computes a reasonably short
difference-list, similar to the Unix diff command, to tell us how
to transform one list into another. There are a number of interfaces.
The most interesting one I found was traverse_sequences
, which
gives me all of the elements of the two lists in sequence, but marked
in a way that I can tell which of the two lists (or both) the item
belongs.
Let's look at a simple example:
use Algorithm::Diff qw(traverse_sequences); @one = qw(M N a b P Q c d e f V W g h); @two = qw(a b R S c d T U e f g h X Y); traverse_sequences(\@one, \@two, { MATCH => sub { show($one[$_[0]], $two[$_[1]]) }, DISCARD_A => sub { show($one[$_[0]], "---") }, DISCARD_B => sub { show("---", $two[$_[1]]) }, }); sub show { printf "%10s %10s\n", @_; }
Here we've given two token sequences in @one
and @two
. Using
traverse_sequences
, we'll print out common sequences (via the
MATCH
callback, removed material (via the DISCARD_A
callback),
and new material (via the DISCARD_B
callback). Changed material
shows up as a series of deletes followed by a series of inserts.
The callbacks are defined as references to anonymous subroutines,
more commonly known as ``coderefs''. The two parameters passed
to each of the callbacks are the current indicies within the @one
and @two
arrays. As this isn't the actual value, I need to take
the index and look it up in the appropriate array.
The result is something like:
M --- N --- a a b b P --- Q --- --- R --- S c c d d --- T --- U e e f f V --- W --- g g h h --- X --- Y
Notice the common sequences. The printf
operation lines up the
columns nicely.
Well, this is a nice text-mode tabular output, but we can get a bit nicer if we know we're sending the result to HTML. Let's color-code all deletions in red, and insertions in green.
A first cut at the algorithm generates far too many font tags:
use Algorithm::Diff qw(traverse_sequences); @one = qw(M N a b P Q c d e f V W g h); @two = qw(a b R S c d T U e f g h X Y); traverse_sequences(\@one, \@two, { MATCH => sub { colorshow("", $one[$_[0]]) }, DISCARD_A => sub { colorshow("red", $one[$_[0]]) }, DISCARD_B => sub { colorshow("green", $two[$_[1]]) }, }); sub colorshow { my $color = shift; my $string = shift; if (length $color) { print "<font color=$color>$string</font>\n"; } else { print "$string\n"; } }
This generates a correct result, but excessive output:
<font color=red>M</font> <font color=red>N</font> a b <font color=red>P</font> <font color=red>Q</font> <font color=green>R</font> <font color=green>S</font> c d <font color=green>T</font> <font color=green>U</font> e f <font color=red>V</font> <font color=red>W</font> g h <font color=green>X</font> <font color=green>Y</font>
What we need is some tracking of state information to figure out if we're already in red or green mode:
use Algorithm::Diff qw(traverse_sequences); @one = qw(M N a b P Q c d e f V W g h); @two = qw(a b R S c d T U e f g h X Y); traverse_sequences(\@one, \@two, { MATCH => sub { colorshow("", $one[$_[0]]) }, DISCARD_A => sub { colorshow("red", $one[$_[0]]) }, DISCARD_B => sub { colorshow("green", $two[$_[1]]) }, }); colorshow(""); # reset back to BEGIN { my $currentcolor = "";
sub colorshow { my $color = shift; my $string = shift; if ($color ne $currentcolor) { print "</font>\n" if length $currentcolor; print "<font color=$color>\n" if length $color; $currentcolor = $color; } if (defined $string and length $string) { print "$string\n"; } } }
Here, I'm tracking the state of the current HTML color in the
$currentcolor
static variable. As it changes, I send out the
end-font or begin-font tags as needed. The only oddness now is that I
need to make one final call to colorshow
with the uncolored tag to
close off any final begin-font tag. This call is harmless if we were
already outside a colored region.
And that's much better, resulting in:
<font color=red> M N </font> a b <font color=red> P Q </font> <font color=green> R S </font> c d <font color=green> T U </font> e f <font color=red> V W </font> g h <font color=green> X Y </font>
Although my web-hacking friends might prefer to see that as:
<span style="background: red; color: black"> M N </span> a b <span style="background: red; color: black"> P Q </span> <span style="background: green; color: black"> R S </span> c d <span style="background: green; color: black"> T U </span> e f <span style="background: red; color: black"> V W </span> g h <span style="background: green; color: black"> X Y </span>
And that'd be a pretty easy small change, but I'll leave that to you. There's a little extra whitespace in the output here than I like, but at least the job is getting done with minimal hassle.
So, now when someone asks you ``what the difference?'', you can show different ways of answering that question! Until next time, enjoy!