Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Unix Review Column 67 (Nov 2006)

[Suggested title: ``Formatting reports with Template Toolkit'']

Recently, a Usenet posting (yes, I still read that) discussed using Perl to determine the disk block usage for various users on the system. Apparently, the poster had noticed a discrepency between the number of blocks reported by du and the total blocks returned by summing Perl's -s function applied to each file. I replied that you can't use -s, because that's merely the length of the file, and thanks to sparse Unix files, the value wasn't necessarily the number of blocks used. Also, a large file consumes indirect blocks to locate the blocks as needed.

Because of sparse files and indirect blocks, the accurate way to determine the actual cost of a file requires calling stat, and using the blocks value (element index 12). As I was thinking about this, I thought it'd be interesting to profile my own disk usage organized according to logarithmic buckets. Once I had that working, I started tinkering some more (as I often do), and wanted a nice HTML table output, organized by user, of the disk blocks and file count for a given directory hierarchy. And the result is in [listing one, below].

Line 5 provides a constant to help me scale a given block count into the relevant bucket. If I take the natural log of the block count, and divide it by the natural log of 2, then truncate that downward to an integer, I'll get buckets for 1 block, 2-3 blocks, 4-7 blocks, 8-15 blocks, and so on, in nice powers of two. Rather than keep dividing by the natural log of 2, I'll use the reciprocal and multiply, which in hindsight is a completely unnoticed microoptimization. Oh well.

Lines 7 and 8 pull in the File::Find module (found in the core distribution) and Template Toolkit module (found in the CPAN).

Line 10 defines the summary data structure which will hold the results of our data gathering. The hash is keyed by user ID number, pointing at an array organized by logarithmic block buckets, each holding a hashref that accumulates the number of blocks and the count of files.

Line 14 defaults the directories to be searched to include the current directory.

Line 16 creates a hash to enable us to ignore the subsequent times that we see the same multiply-linked file. This hash has to be defined outside the subroutine used by File::Find.

Lines 18 to 28 recurse through the directories using File::Find. The inline anonymous subroutine is called for every entry: it's our job to return quickly if the entry is not needed. Hence, lines 19 and 20 return for symbolic links or items that are not either files or directories. (Directories consume disk blocks, and thus I chose to count those in my storage summary.)

Lines 21 to 23 pick out the relevant parts of the stat call, including the device and inode number, the user ID of the file's owner, and the number of blocks. If the device and inode number pair have been seen before, line 23 rejects the entry quickly.

If we make it to line 25, it's the first time we've seen a given file, so we next need to determine the bucket in which this entry should be accumulated. If the block count is 0, we can't compute the logarithm, so I decided to arbitrarily put that in with the entries with a block count of 1. Otherwise, the natural log of the block count is divided by the log of 2. We don't need to take the integer value of the result because we're using it as an array index, which automatically truncates the value.

Lines 26 to 27 accumulate the information for this file into the appropriate bucket.

Once the data is gathered, I create two auxiliary data structures to help the rendering a bit. Line 30 creates a mapping from user ID to usernames. Lines 31 to 36 create the top-of-table column headers by computing the powers of 2 as needed. The @labels array is compared to each user's data, and if needed, extended until it's long enough.

And for the rendering, I reach for my favorite templating system, Template Toolkit. I need to pass the data into the template using a hash, which I create in lines 38 to 41. Inside the template, sums refers to the my %sums hashref, and so on.

Lines 43 and 44 create the top-level Template object, configured to invoke post-chomping (to avoid spurious newlines in the output), and invoke the template on my data. The result appears on STDOUT.

This is the end of the Perl code. The rest of the file is a Template Toolkit template in a giant ``here'' document. Line 45 declares that the blocks returned by stat are in 1K blocks, which I determined by experimentation.

Lines 46 to 49 are a slight tweak to make the table cells look reasonable in my browser. I'm no CSS expert, so if this is broken, just remember that I never claimed to be ``Just another CSS hacker''. It does seem to do what I want though.

Lines 51 to 67 define a macro to take a number and format it nicely into 4 characters or less (up to a terabyte, and after that, you're on your own, you lucky dog). I tried using Template::Plugin::Number::Format and it didn't do what I wanted, so I wrote my own. I use this both for the bytes and for the file count, simply by applying the BLOCKSIZE scaling factor as needed.

Lines 68 to 70 define a macro to show bytes and file count in a consistent way from column to column, on separate lines in the HTML.

Lines 71 to 107 define the main loop, walking through the results user by user.

We need to preface the first user's output with all of the table header information, including the column headers. This is detected in lines 72 to 84.

Lines 74 to 83 produce the column heads by walking through the labels array. Ahead of the first column, I insert the username column (line 76), and after the last column, I append the totals column (line 80). The remaining labels are scaled by the block size and displayed using the nice number formatting routine (line 78).

Once the headers are displayed, it's time to display each row. Lines 85 to 103 iterate over the columns of a given row, using the labels to define the columns over which we will iterate.

As with the column headers, we'll insert the username before the first data column, and initialize the total accumulators (lines 88 to 91). And after the data is complete, the totals are dumped (lines 99 to 102).

For each data cell, we'll pull out the detailed hashref in line 87. If this cell exists (checked in line 92), then we should show the data (likes 93 to 95). Otherwise, we need to include a non-breaking space so that the browser renders the cell anyway (line 97).

Okay, this started out being a 10-line program, but there it is. A sample run on my laptop's library directory as rendered by my browser is shown in [figure one]. Hope you find this tool useful, and learned a few things about creating data reduction tools. Until next time, enjoy!

Listing One

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     use warnings;
        =4=     
        =5=     use constant ONEOVERLOG2 => 1/log(2);
        =6=     
        =7=     use File::Find;
        =8=     use Template;
        =9=     
        =10=    my %sums;
        =11=    ## $sums{$uid}[$log_2_blocks]{blocks} += $blocks;
        =12=    ## $sums{$uid}[$log_2_blocks]{count} ++;
        =13=    
        =14=    @ARGV = qw(.) unless @ARGV;
        =15=    
        =16=    my %seen;
        =17=    
        =18=    find sub {
        =19=      return if -l;
        =20=      return unless -f _ or -d _;
        =21=      my @stat = stat(_) or return;
        =22=      my ($dev, $ino, $uid, $blocks) = @stat[0, 1, 4, 12];
        =23=      return if $seen{"$dev $ino"}++;
        =24=    
        =25=      my $slot = $blocks ? log($blocks) * ONEOVERLOG2 : 0;
        =26=      $sums{$uid}[$slot]{blocks} += $blocks;
        =27=      $sums{$uid}[$slot]{count}++;
        =28=    }, @ARGV;
        =29=    
        =30=    my %usernames = map { $_ => scalar getpwuid $_ } keys %sums;
        =31=    my @labels;
        =32=    for (values %sums) {
        =33=      while (@labels < @$_) {
        =34=        push @labels, 2**@labels;
        =35=      }
        =36=    }
        =37=    
        =38=    my %VARS = 
        =39=      (sums => \%sums,
        =40=       usernames => \%usernames,
        =41=       labels => \@labels);
        =42=    
        =43=    Template->new({ POST_CHOMP => 1 })
        =44=      ->process(\<<'END_OF_TEMPLATE', \%VARS) or die Template->error;
        =45=    [% BLOCKSIZE = 1024 %]
        =46=    <style type="text/css">
        =47=    th, td {
        =48=      margin: 0; border: 1px dotted black; padding: 0.2em; text-align: center;
        =49=    }
        =50=    </style>
        =51=    [% MACRO nice(n) BLOCK;
        =52=      IF n >= 1073741824 * 9.995;
        =53=        n / 1073741824 | format('%dG');
        =54=      ELSIF n >= 1073741824;
        =55=        n / 1073741824 | format('%.2gG');
        =56=      ELSIF n >= 1048576 * 9.995;
        =57=        n / 1048576 | format('%dM');
        =58=      ELSIF n >= 1048576;
        =59=        n / 1048576 | format('%.2gM');
        =60=      ELSIF n >= 1024 * 9.995;
        =61=        n / 1024 | format('%dK');
        =62=      ELSIF n >= 1024;
        =63=        n / 1024 | format('%.2gK');
        =64=      ELSE;
        =65=        n;
        =66=      END;
        =67=    END %]
        =68=    [% MACRO bc(b,c) BLOCK %]
        =69=    <td>[% bytes = b * BLOCKSIZE; nice(bytes) %]<br/>[% nice(c) %]</td>
        =70=    [% END %]
        =71=    [% FOR uid = sums.keys.nsort %]
        =72=    [%   IF loop.first %]
        =73=    <table>
        =74=    [%     FOR label = labels %]
        =75=    [%       IF loop.first %]
        =76=    <tr><th>User</th>
        =77=    [%       END %]
        =78=    <th>[% label_bytes = label * BLOCKSIZE; nice(label_bytes) %]</th>
        =79=    [%       IF loop.last %]
        =80=    <th>Total</th>
        =81=    </tr>
        =82=    [%       END %]
        =83=    [%     END %]
        =84=    [%   END %]
        =85=    [%   labels_max = labels.max %]
        =86=    [%   FOR column = [0..labels_max] %]
        =87=    [%     cell = sums.$uid.$column %]
        =88=    [%     IF loop.first %]
        =89=    [%       total_blocks = 0; total_count = 0 %]
        =90=    <tr><th>[% usernames.$uid || "?"; " ("; uid; ")" %]</th>
        =91=    [%     END %]
        =92=    [%     IF cell %]
        =93=    [%       total_blocks = total_blocks + cell.blocks %]
        =94=    [%       total_count = total_count + cell.count %]
        =95=    [%       bc(cell.blocks, cell.count) %]
        =96=    [%     ELSE %]
        =97=    <td>&nbsp;</td>
        =98=    [%     END %]
        =99=    [%     IF loop.last %]
        =100=   [%       bc(total_blocks, total_count) %]
        =101=   </tr>
        =102=   [%     END %]
        =103=   [%   END %]
        =104=   [%   IF loop.last %]
        =105=   </table>
        =106=   [%   END %]
        =107=   [% END %]
        =108=   END_OF_TEMPLATE

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.