Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 83 (Jul 2006)

[Suggested title: ``Progress Bars for Download'']

If you're like me, you spend a lot of time getting things over your net connection, downloading them to your desktop machine (or in my case, my laptop which is my only machine). One of the things I find myself doing frequently is watching the output of curl as it keeps me up to date on how much has been downloaded, and how much longer it'll take to do the rest.

I recently stumbled across Term::ProgressBar. This CPAN module can draw a labeled progress bar, to show how much of a task has been completed. The bar has a major part drawn with nice = characters, labeled by percentage, and a little flying * that shows the percentage complete within each one of the = steps. The bar is drawn in such a way that successive invocations overwrite the previous one, creating the illusion that it's just ``growing'' as we make progress.

One of the nice features of Term::ProgressBar is that it also notes the times when the update is called, and can give an estimate of how many hours, minutes, and seconds it'll be before the task is complete. This is done automatically without any work on the caller's part, except for requesting the option. It was this particular feature that had me think that I could emulate what curl does during a download with a nice little progress bar. I knew I could hook those values of the download-in-progress with a LWP::UserAgent content callback, and the result is in [listing one, below].

Line 1 declares my path to Perl, and enables warnings throughout the program. I still use -w instead of use warnings, mostly because I'm lazy and habitual. The problem with using -w is that it enables warnings globally, even for code I didn't write or test. With use warnings, only the files (or smaller lexical scope) in which it appears will have warnings enabled.

Line 2 enables the standard Perl restrictions, disabling arbitrary barewords and symbolic references, and requiring simple variables to be declared lexically.

Lines 4 through 6 pull in the three modules I've installed from CPAN. Term::ProgressBar provides the progress bar described earlier. URI and LWP::UserAgent are part of the Bundle::LWP: the useful collection of modules to deal with everything about the web except for CGI.

Line 8 creates my virtual user agent in $ua, acting as a client for HTTP transactions. I'll be using this user agent object as I might use a browser, telling it to fetch a particular URL. Various configuration options exist for a LWP::UserAgent object, such as what kind of browser it tells the server it might be; however, I've left all the settings at the default because, yes, I'm lazy.

Note that some servers care about the browser identification, and I might want to go back and reconfigure this user agent to have it pretend to be a certain version of Internet Explorer or Firefox to access certain ``restricted pages''. Yes, the server trusts an arbitrary string sent by the browser, and some sites use that string to control access. How silly.

Lines 10 through 55 loop once for each URL specified on the command line. The loop is exited when @ARGV is finally empty, which happens eventually because the first item of @ARGV is shifted off into $url in line 11. Line 12 shows the URL that we're currently trying to download.

Lines 14 to 18 try to figure out a suitable local filename for the downloaded information. I wanted to emulate curl -O by taking the last component of the path as the name, so I pulled out the URI module to do the parsing.

Line 14 creates a URI object from the requested URL. Line 15 grabs just the path part of the URL. That's the section after the host, but before the optional query string. Line 16 removes everything from the path up to the final (or only) slash. At this point, $path is a candidate for a filename. However, if it's empty (the path ends in slash, for example), I force it to be download instead in line 17. And finally, because I want it to be a new file, I just add X in front of the name until the file doesn't exist locally in line 18.

Yes, that's a pretty hokey chunk of code there, but it was good enough for the few samples with with which I used it.

Once I have a filename, line 20 opens up a handle to that file, using a lexical filehandle and 3-argument open, which works fine on modern Perl versions, but probably won't work if you haven't upgraded Perl since 1998.

Lines 22 to 24 create the progress bar object. I'm selecting a label of Download, which seems appropriate at this point, along with an initial guess at the total size as 1024 bytes. Later, I'll be updating this amount with either a better guess, or at the actual bytes as reported by the server. Finally, I'm also enabling estimated-time mode, using linear approximation (the only choice possible).

Line 26 establishes $output. I'll be using this to count the bytes downloaded so far, so I'll start with 0.

Line 27 defines a boolean flag, $target_is_set, initially false. When I've seen a good length from the server, I'll use it as the final target value for upper bound, and set this to true. This keeps me from having to repeatedly check for the value on each iteration, which seemed wasteful.

Line 28 holds the number of bytes I should see before updating the bar again. On each bar update, I'm told how long to wait before a half-second would have passed in terms of bytes downloaded. By paying attention to this value, I can optimize the number of calls I make to update the bar.

Lines 29 to 50 perform the download, by calling the get method against the user agent. Line 30 defines the desired URL for this request.

Lines 31 to 50 define a content callback. Normally, as the LWP::UserAgent object is fetching the reply, the ``content'' is loaded into the object, and available only when the entire response has been seen (by calling the content method on the response object). However, we can define a callback subroutine which will be called as each chunk is observed from the server.

In this case, as each chunk is observed, we'll get a call to our subroutine beginning in line 31 (an anonymous subroutine is being used here). The subroutine will be passed three values: the chunk of data that has been read ($chunk in line 32), the response object as constructed so far ($response), and the protocol handler object ($protocol). I'm not using the $protocol object at all, but the other two are very important.

Lines 34 to 41 attempt to update the total-bytes target, unless we've already done this for this download (noted because $target_is_set is set). Line 35 reaches into the response object, looking for the content_length header in the web server's response. If that's been provided, we can get a clearer idea of percentage of text seen so far.

If we know, we'll set the content length as the target in line 36, noting that we've already done that in line 37. However, for many downloads (especially those created dynamically), the server has no idea how many bytes it will eventually send. So in line 39, I fake up a target that is everything seen so far, plus the chunk we've just seen, plus perhaps one more chunk just like it. It's wrong, but there's no right value anyway, and we keep seeing the value as ``almost there''.

Once I've updated the target, it's time to actually write the data that has been seen. I update the total bytes seen so far in line 43, and then print the data to the handle in line 44. Actually, I could eliminate the $output variable by using -s on the filehandle every time I need it, since those numbers should be the same. However, that would be making an operating system request repeatedly for information that I can easily calculate, so why not just calculate it?

Lines 46 to 48 update the bar. Initially, $next_so_far is 0, so we call this method on the first chunk of data we see. That will draw the initial bar with the initial guess of maximum bytes (possibly an accurate value directly from the server), and leave room for a ``time remaining'' value that will be updated after a few more calls. The return value from the update will modify $next_so_far, giving us the suggestion to not call update again until we've seen that many bytes. As mentioned earlier, this is an optimization so that the bar is updated roughly every half second based on calls made seen previously for this progress bar. I could completely ignore this value, and just call update on each chunk, and the result would be similar, although a lot more output will be generated.

Once the download is complete, the call to get in line 29 returns, and we move on with the next step of the program. I want the bar to read ``100% downloaded'' when I'm done. I know the total length in $output, so I call target in line 52 to say ``yes, this number of bytes is 100%''. I also call update to say ``yes, I've seen exactly this many bytes'' in line 53. And I'm done with that file!

So there you have it: emulating curl's download time and percentage using Term::ProgressBar. Hopefully, you've seen enough to add progress bars to your own applications. Also, check out Tk::ProgressBar and CGI::ProgressBar in the CPAN for graphic and web-based applications, and Smart::Comments for automatically adding progress bars to your loops. Until next time, enjoy!

LISTING

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     
        =4=     use Term::ProgressBar;
        =5=     use URI;
        =6=     use LWP::UserAgent;
        =7=     
        =8=     my $ua = LWP::UserAgent->new;
        =9=     
        =10=    while (@ARGV) {
        =11=      my $url = shift;
        =12=      print "$url:\n";
        =13=    
        =14=      my $uri = URI->new($url);
        =15=      my $path = $uri->path;
        =16=      $path =~ s{.*/}{};
        =17=      $path = "download" unless length $path;
        =18=      $path = "X$path" while -e $path;
        =19=    
        =20=      open my $outhandle, ">", $path or die "Cannot create $path: $!";
        =21=    
        =22=      my $bar = Term::ProgressBar->new({ name => 'Download',
        =23=                                         count => 1024,
        =24=                                         ETA => 'linear'});
        =25=    
        =26=      my $output = 0;
        =27=      my $target_is_set = 0;
        =28=      my $next_so_far = 0;
        =29=      $ua->get
        =30=        ($url,
        =31=         ":content_cb" => sub {
        =32=           my ($chunk, $response, $protocol) = @_;
        =33=    
        =34=           unless ($target_is_set) {
        =35=             if (my $cl = $response->content_length) {
        =36=               $bar->target($cl);
        =37=               $target_is_set = 1;
        =38=             } else {
        =39=               $bar->target($output + 2 * length $chunk);
        =40=             }
        =41=           }
        =42=    
        =43=           $output += length $chunk;
        =44=           print {$outhandle} $chunk;
        =45=    
        =46=           if ($output >= $next_so_far) {
        =47=             $next_so_far = $bar->update($output);
        =48=           }
        =49=    
        =50=         });
        =51=    
        =52=      $bar->target($output);
        =53=      $bar->update($output);
        =54=    
        =55=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.