Clojure tutorial: fetching web comics (part 2)

November 18, 2008

Last time, we saw how to fetch the image URL of web comics in Clojure by using regular expressions and some Java objects. A short-coming of that program was that it could only fetch an image URL. Some web comics (such as Xkcd) have a small tooltip text that appears when you hover the mouse cursor over the image on the web site; this text is often an integral part of the comic and we would like to fetch it as well.

Today’s program

In this article, we will modify our original program to fetch the latest Xkcd with its tooltip text. To keep things interesting for the avid Clojure apprentice, we will use multi-methods for this purpose. We will also see more aspects of Clojure’s integration with Java by using the HTML Parser library.

Language changes

Recently, Clojure had a couple of changes in its core library that will make it into 1.0. For this tutorial to be useful with future stable versions of Clojure, I will now be using the SVN version of Clojure instead of the latest stable release. The changes are not many, but they do affect the program from the first article. Here is a list of things you’ll need to change:

Regular expression literals are automatically escaped

There was only one occurrence in the original program of a regular expression with a backslash in it, the :regex attribute for Penny Arcade. Delete one backslash to make the line look like this:

:regex #"images/\d{4}/.+?(?:png|gif|jpg)"

Binding syntax is done inside vectors everywhere

Some people may have found it inconsistent that the syntax of certain binding-introducing forms was (form [var val]) while other forms didn’t have the square brackets. This has now been addressed and all bindings are done inside square brackets. There were two such occurrences in the original program: the with-open call in fetch-url and doseq at the end of the program. Change these two lines to the following:

(with-open [stream (. url (openStream))]

(doseq [comic *comics*]

Multi-methods

Multi-methods are one of Clojure’s ways to create polymorphic code. There are two parts to them:

  • The declaration: We create a new multi-method with the defmulti macro. We specify the name of the multi-method and a dispatch function. The dispatch function will be called with all the arguments passed to the multi and its return value will be used to choose which method to execute. An optional third argument specifies a default dispatch value; if it’s omitted, :default is assumed.
  • The methods: They’re called multi-methods because they can have multiple implementations. You define a method with the defmethod macro. You must supply the name of the multi, the dispatch value, the parameter vector and the body.

To make this clearer, here is a simple example. report is a multi-method that is passed a collection and returns "I am empty" if calling the dispatch function empty? on its argument returns true and "I have elements" otherwise.

(defmulti report empty?)
(defmethod report true [x] "I am empty")
(defmethod report :default [x] "I have elements")

(report "")         ; "I am empty"
(report [1 2 3])    ; "I have elements"

fetch-comic

We will declare a fetch-comic multi-method that takes a comic and dispatch on its :type value. The default method will be our old regular expression function, which we'll transform into a method.

(defmulti fetch-comic :type)

Now, let's convert image-url to a method; the name was changed to fetch-comic because we don't simply fetch an URL anymore, we may get other information as well. Don't forget to update the call in the doseq at the end of the program. Methods cannot have documentation strings, so we've had to remove it.

(defmethod fetch-comic :default [comic]
  (let [src (fetch-url (:url comic))
        image (re-find (:regex comic) src)]
    (str (or (:prefix comic) (:url comic))
         image)))

The program should work just like it did before.

Fetching image URL and tooltip

With our old function transformed into a method, we are ready to tackle the tooltip-fetching method. Although nothing stops us from using regular expressions for this task, we will use a Java library specifically designed for HTML parsing and extraction.

The method is fairly short (12 lines), but I must first introduce some concepts that will be used and talk about the HTML Parser library.

  • Refs: Despite being a functional language, Clojure recognizes that there are situations when having data that changes is necessary. Refs are one way to do so: refs are basically variables that hold the address to an object. When you modify the object, what actually happens behind the scene is that a new object is created and your ref will now point to the address of that new object, leaving the old one intact.
  • proxy: proxy is a macro that extend a class, implements interfaces and returns an instance of that new class.
  • HTML Parser: a Java library to parse and extract content from an HTML document. The org.htmlparser.Parser constructor fetches the HTML online if its argument looks like an URL. The library specifies many built-in filter classes, though none allow using a regular expression to search for a particular attribute in a tag. We will therefore use the visitor pattern method provided. visitAllNodesWith takes a NodeVisitor argument, and we'll use proxy to implement its visitTag method.
(import '(org.htmlparser Parser)
        '(org.htmlparser.visitors NodeVisitor)
        '(org.htmlparser.tags ImageTag))

(defmethod fetch-comic :tooltip-comic [comic]
  (let [img-tags (ref [])
        parser (Parser. (:url comic))
        visitor (proxy [NodeVisitor] []
                  (visitTag [tag]
                            (when (and (instance? ImageTag tag)
                                       (re-find (:regex comic)
                                                (.getImageURL tag)))
                              (dosync (alter img-tags conj tag)))))]
    (.visitAllNodesWith parser visitor)
    [(.getImageURL (first @img-tags))
     (.getAttribute (first @img-tags) "title")]))

That may seem like a lot of code, but there's actually a lot of things you know in there. Let's look at it in detail:

  • import: we went over this in the first article, it just imports some names into the current namespace. We import some classes from HTML Parser to keep our code a little more succinct.
  • defmethod: we've just seen this: create a method for the multi-method fetch-comic for when the dispatch value is :tooltip-comic.
  • let: we've seen let before also: it creates a new scope and establishes some bindings within that scope.
  • img-tags (ref []): ref returns a reference that points to its argument. We will store the image tags that fit our search criteria into img-tags. We'll see in a minute why we need a "mutating" variable for this purpose.
  • parser (Parser. (:url comics)): call the Parser constructor with the URL of the comic.
  • visitor (proxy [NodeVisitor] []): this is the really interesting part. proxy will sub-class NodeVisitor and return an instance of this new class. We implement the visitTag method: it takes one argument, a tag and has a void return value. This is why we need to store the tags into a ref. When that tag is an image tag and that its src value matches our regular expression, we conj it to img-tags
  • (dosync (alter img-tags conj tag)): dosync executes the expressions in its body in a transaction. alter (which must be called within a transaction) modifies the value pointed to by img-tags by conjing the current tag onto the value referenced by img-tags
  • (.visitAllNodesWith parser visitor): visit all the nodes of parser using our custom visitor object. When this has completed, img-tags should have the image tag of the comic.
  • (.getImageURL (first @img-tags)): get the URL of the first image tag. @img-tags is syntactic sugar for (deref img-tags); it returns the value referenced by the ref. getImageURL returns the complete URL of the image, we won't need a prefix like we did with the other method.
  • (.getAttribute (first @img-tags) "title"): getAttribute returns the value of an arbitrary attribute of a tag. The tooltip text of a comic is in the title tag.

Data

The final step is to add Xkcd to our *comics* vector:

{:name "Xkcd"
 :url "http://www.xkcd.com"
 :regex #"comics"
 :type :tooltip-comic
}

Running the script

To run the script, you will need to include HTML Parser in your class path:

$ java -cp $HOME/src/clojure/clojure.jar:$HOME/src/htmlparser1_6/lib/htmlparser.jar \
clojure.lang.Script comics2.clj

Penny-Arcade: http://www.penny-arcade.com/images/2008/20081117.jpg
We The Robots: http://www.wetherobots.com/comics/2008-11-14-Gnawed.jpg
Xkcd: ["http://imgs.xkcd.com/comics/a_bunch_of_rocks.png" "I call Rule 34 on Wolfram's Rule 34."]

Full program

You can download the full program here

Special thanks to Chouser for proof reading a draft of this post.