Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Linux Magazine Column 25 (Jun 2001)

[suggested title: Simple online quiz technique (part 1 of 2)]

I have a pretty long list of ``write a magazine article about this someday'' items. But I could always use more, so if you want to see your name in print, please email them to me, and you'll be appropriately credited!

One item that's been in there for nearly as long as I have been keeping a list is ``show how to do an online quiz correctly, so that people can't cheat by backing up or peeking at the URLs''. Why was that there? Well, far too often, I've seen ``web quiz'' freeware that was all too trivial. The proper answer was either guessable via staring at the mouseover URLs, or I could simply hit the ``back'' button if I got it wrong, and retry a different guess until I got it right.

So, I gave it some thought, and had come up with a scheme that was very simple, permitting me to generate random quizzes, and yet prevented someone from getting the ``proper'' answer, but only permitted one try at the proper answer.

But I hate to write content. If you have visited my website, you know the only really changing content is the online archive of the past magazine columns I've written, and the code that results from writing those magazine columns. So if I were to show a quiz in a column, I'd need to come up with content. Argh.

However, the other day, on a mailing list I frequent called (void), which seems mostly populated by 20-something new-media hackers in London, many of whom are big Perl fans, someone posed a ``trivia'' question which got me thinking.

He quoted a paragraph from the screenit.com website. Now, I'm very familiar with screenit.com as one of my most frequently visited sites. Each week, nearly every new movie opening in the US gets thoroughly reviewed, not only for artistic merit (in great detail), but also in terms of ``parental information'' organized into 16 categories such as ``Alcohol/Drug use'' and ``Frightening/Tense scenes''. At the top of the review is a quick scoreboard, but the details further down the page give a meticulously detailed amount of information about how appropriate this movie is for your kids or (in my case) your big date with that special someone.

I don't know how the reviewers get this much information on each movie, and get it up on the website in near real time, every week since about 1996. Especially when they are supported primarily by ad revenue and a passion for the result. So please, support their sponsors, and drop them a thank-you note at the provided links.

OK, so back to the mailing list. The ``Profanity'' paragraph from one of the movies, showing all the words used about which a conservative parent should be concerned, and the challenge was made to ``guess the movie''. Because the Profanity paragraph seemed to go on and on and on, a few guessed ``South Park'', which probably takes the top 3 slot for excessive profanity, but the correct answer was ``The Big Lebowski''. If you want to see that, follow this link:

  http://www.screenit.com/movies/1998/the_big_lebowski.html#p

[Editors, do we dare simply reprint the paragraph in the column? Your call!]

I got to thinking ``hey, here's a lot of free content for an interesting quiz''. After all, their archives go back a few years, and the data format is fairly regular (not quite regular enough though... as we'll see later).

So I hacked up a program to grab the data, and another one to implement the quizzing architecture I had already sketched. This month, we'll look at the data grabber, and you'll have to wait until next month to see the quizzer. Sorry, too much for one month.

For the data grabber, I used Google to search for the movies. Google has this nice underdocumented ``site-only'' search feature: by adding site:screenit.com to a query, I get hits only from that site. By adding ``profanity'' to the query, I got back a list of the pages that included that word at least once, a narrow-enough query that I got a lot of high-quality hits.

Of course, Google doles out the hits at most 100 at a time, so I had to repeat the query using an increasing starting point until I got all the hits it could give me. We'll see how that works when we get to the code below.

And then the really cool part. Part of the link that Google returns is a pointer to Google's own cache for that page. So, rather than following the link back to screenit.com's site, causing stress on their sometimes overloaded server, I simply ask Google for its cached version! After all, if it's not in Google's cache, it won't be returned in a search.

From that cached page, I then look into the data to find the profanity paragraph and the movie title. The movie title is somewhat related to the screenit.com URL, but I didn't count on that, since some of them seemed to be arbitrary.

Finally, for permanent storage (my cache of their cache), I use a simple DBM database, accessible in Perl as a hash (a hash cache), allowing for easy programming and updates and queries. I had first considered using a MySQL database, but this turned out to be much easier.

And the profanity grabbing program is presented in [listing one, below].

Lines 1 through 3 begin nearly every program I write, enabling warnings, turning on the common compiler restrictions (mandatory for programs longer than 10 lines), and disabling the buffering of standard output.

Line 6 defines the only configuration constant for this program: the location of the database that this program and the quiz program share. This has to be in a directory that is accessible from CGI programs, although it should not be in a directory that is mapped to a URL. You wouldn't want it to be that easy to cheat.

Lines 9 through 11 pull in the web access modules (part of the LWP library) to fetch and manipulate the data from Google.

Line 13 connects us to our DBM database. Yes, purists may point out that dbmopen is officially ``deprecated'', but it's still the most convenient interface for quick-and-dirty programs like this.

Lines 15 through 18 remind me of the format of the database. I picked the format out of sheer laziness, not because of any design. The key is the URL minus the scheme, because that's how Google reports it for an internal link. The value consists of the real name of the movie (as we would call it), followed by a newline and then the paragraph of the profanity information. I didn't do any cleanup of the paragraph, so most of them still contain an embedded ``A HREF=...'' element. Please remember that laziness is a Perl virtue.

Line 20 creates the ``user agent'' object, acting as a tiny browser to go view the web. If I had needed to set up a particular configuration, such as web proxy information or a particular user agent, I'd do that here too.

Line 22 starts the ``outer loop''. We'll loop once for each index page from Google, stepping through result hits 100 at a time.

Lines 24 through 26 set up the Google URL for this particular query. We'll ask for links into screenit.com that contain the word ``profanity'', 100 at a time, starting at the link numbered in $start, and disabling the ``similar pages'' filter. These parameters were reverse-engineered by typing in those queries directly to Google, and watching the resulting links being provided. So, if Google changes format, this breaks. Such is the way with screen scraping.

Lines 27 and 28 ask Google for the response, aborting the outer loop if it breaks.

Lines 31 and 32 look within the Google response for links to the cache, which means we've stumbled across some pages that Google has seen at screenit.com. If none, again, we simply abort the outer loop.

Line 35 begins the ``inner loop'', cycling once for each URL in Google's cache, which should be a page at screenit.com containing the word ``profanity''.

Lines 36 through 39 skip any URL that doesn't look like a movie. Again, this was determined by looking at the returned links, and noticing that screenit.com also rates music CDs and other things. But the movies are all movies/year (where year is a 4-digit number), so we ignore anything not in that pattern.

Lines 40 to 43 skip over any URL for which we've already got good data. This permits us to re-run this program quickly, even as often as once a day, without spending a lot of time refetching data we've already extracted. This step is why the DBM is keyed by the URL rather than the movie title: we want this lookup to be very fast.

Lines 46 to 52 ``follow'' the cache URL link, grabbing the data from Google's cache. Note that at no time are we actually touching screenit.com's web site: we're simply making a cache of a cache. I think this is important in terms of being a good neighbor for such a valuable service. If we can't get the cache, we simply ignore it.

Lines 54 through 70 locate the ``profanity'' paragraph within the response. I'm using an ``extended'' regular expression here, so the whitespace within the regular expression spanning lines 56 to 65 is simply ignored. Again, the pattern here came from staring at enough of the entries until I could see the possible range. Even so, there's a few of the movies that this doesn't match properly, but I've got enough movies for a good quiz, so I stopped. The pattern puts the profanity paragraph in the $1 match variable, which is saved into $prof in line 70.

Similarly, lines 73 to 86 find the movie title by extracting the HTML page title. Some of the titles have a preceding label, optionally matched in lines 75 to 77. Again, this came from running the program a half-dozen times, constantly tweaking what was and wasn't matching. Line 87 dumps the title to standard output as an indication that we got good matching data on this movie, including its title.

Finally, line 90 stores the data into the database through a simple DBM assignment.

Line 94 reports the total movies in the quiz database, which at the moment I'm writing this reports 933 movies. That's plenty for the quiz.

So, we plop this into a file, adjust the one configuration parameter, and run it. Out pops a database of movies and their screenit.com profanity paragraphs, all without touching screenit.com at all, thanks to our friends at Google. Next month, we'll see how to build a CGI quiz script using this database. Until then, watch a movie or two at the theatre, or rent a DVD. Enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ## config
        =6=     my $DATA_DB = "/home/merlyn/Web/profanity_quiz";
        =7=     ## end config
        =8=     
        =9=     use LWP::UserAgent;
        =10=    use HTTP::Request::Common;
        =11=    use URI;
        =12=    
        =13=    dbmopen my %DATA, $DATA_DB, 0644 or die "Cannot open db: $!";
        =14=    
        =15=    ## %DATA format: for each movie, keyed by partial URL,
        =16=    ## value is "$title\n$profanity_paragraph_with_newlines", as in:
        =17=    ## $DATA{"www.screenit.com/movies/1997/gone_fishin.html"} =
        =18=    ## "GONE FISHIN'\n<DL>many\nlines\n</DL>\n";
        =19=    
        =20=    my $ua = LWP::UserAgent->new;
        =21=    
        =22=    for (my $start = 0; ; $start += 100) {
        =23=      ## fetch each index page:
        =24=      my $uri = URI->new("http://www.google.com/search";);
        =25=      $uri->query_form('q' => "site:screenit.com profanity",
        =26=                       'num' => 100, 'start' => $start, 'filter' => 0);
        =27=      my $response = $ua->simple_request(GET $uri);
        =28=      last unless $response->is_success;
        =29=      
        =30=      ## parse the index page looking for links to movie pages in cache:
        =31=      my @urls = $response->content =~ m{A HREF=/search\?q=cache:(.*?)\+}g;
        =32=      last unless @urls;
        =33=    
        =34=      ## fetch each cached movie page if it fits the profile:
        =35=      for my $url (@urls) {
        =36=        unless ($url =~ m{movies/\d\d\d\d/}) {
        =37=          print "skipping $url\n";
        =38=          next;
        =39=        }
        =40=        if ($DATA{$url}) {
        =41=          print "skipping $url because we have it\n";
        =42=          next;
        =43=        }
        =44=    
        =45=        ## get cached movie page from cache:
        =46=        $uri->query_form('q' => "cache:$url");
        =47=        my $res = $ua->simple_request(GET $uri);
        =48=        print $uri, " ==>\n";
        =49=        unless ($res->is_success) {
        =50=          print "___ FAILURE ___\n", $res->as_string, "______\n";
        =51=          next;
        =52=        }
        =53=    
        =54=        ## look for profanity paragraph:
        =55=        unless ($res->content =~ m{
        =56=                                   \n
        =57=                                   (
        =58=                                    <dl>
        =59=                                    .*?
        =60=                                    (?:\n.*?)??
        =61=                                    profanity</a>\n
        =62=                                    (?:.+\n)*?
        =63=                                    </dl>\n
        =64=                                   )
        =65=                                   \n
        =66=                                  }ix) {
        =67=          print "can't find profanity DL in\n", $res->content;
        =68=          next;
        =69=        }
        =70=        my $prof = $1;
        =71=    
        =72=        ## look for title:
        =73=        unless ($res->content =~ m{
        =74=                                   <title>
        =75=                                   (?:
        =76=                                    SCREEN \s+ IT! \s+ \S+ \s+ REVIEW: \s+
        =77=                                   )?
        =78=                                   (
        =79=                                    .+
        =80=                                   )
        =81=                                   </title>
        =82=                                  }ix) {
        =83=          print "can't find title in\n", $res->content;
        =84=          next;
        =85=        }
        =86=        my $title = $1;
        =87=        print "... $title\n";       # for tracing
        =88=    
        =89=        ## save data:
        =90=        $DATA{$url} = "$title\n$prof";
        =91=      }
        =92=    }
        =93=    
        =94=    print scalar keys %DATA, " total movies for the quiz!\n";

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 25 (Jun 2001)

Listings