Clojure tutorial: fetching web comics (part 1)

October 31, 2008

This post is the first of what I hope will become a series about Clojure. Clojure is a young language, and though there is a lot of documentation already on the Internet and in the blogs of many enthusiasts, I figured there would be no harm in having some more.

In this series, I will start with a simple script and with each post, I will improve the program by introducing new Clojure features. The whole thing is not mapped out, and as such, I am very receptive to constructive comments on how to make these posts better.

The problem

The problem we will tackle on is a fairly simple one: scraping web sites. I am a of web comics enthusiast, I read quite a few of them, but I don’t like to go on 20 web sites to view them all, and their RSS feeds are often not to my satisfaction: some feeds only give you a link to the latest strip, others have news, information or publicity in them that you don’t want. So we will write a script that extracts the latest strip from web comic sites and create an RSS feed with those.

What we’ll do today

In this first post, we’ll get something very simple working: the program will download the content of a web site, extract the strip link and print it. This will allow us to view data structures and Java interop. I will not assume that you know any Clojure, so I’ll try to explain as we go along. If something is unclear, you can always check out the documentation on the official Clojure web site.

The pseudo code of our application will be as follows:

for each comic:
    get the html
    extract the image URL with a regex
    display the complete image URL

Data

We’ll start our program by defining our data. We will want to scrape several comic strips and not have to write one function per web comic, so we’ll need a standard way to represent the different comics we have. We will need four pieces of data:

  • The name of the web comic
  • The URL where the latest comic can be found
  • A regular expression to capture the strip image link
  • An optional URL prefix to construct an absolute URL

Because most sites use relative links for their images, if no URL prefix is given, we will assume that the URL of the latest strip page is to be used as the prefix.

We will represent the data of one comic with a hash-map and we will put all those hash-maps inside a vector. Here’s the result with two comics:

(def *comics*
  [{:name "Penny-Arcade"
    :url "http://www.penny-arcade.com/comic/"
    :regex #"images/\\d{4}/.+?(?:png|gif|jpg)"
    :prefix "http://www.penny-arcade.com/"
    }
   {:name "We The Robots"
    :url "http://www.wetherobots.com/"
    :regex #"comics/.+?[.](?:jpg|png|gif)"
    }
  ])

A few notes about this piece of code:

  • def is special form to assign a value to a name. In this case, we assign our vector of hash-maps to *comics*.
  • * is a valid character in an identifier. It is a Lisp convention to use asterisks around a variable name to indicate that it is a global variable. The list of valid identifier characters is described in the page about the reader.
  • Clojure has literal syntax for vectors: space-separated values enclosed in square braces.
  • Clojure has literal syntax for maps: space-separated values enclosed in curly braces. Clojure considers commas to be white space, so you can use them to clearly separate the different pairs: {:false 0, :true 1}.
  • Clojure has a special data type called a keyword. Keywords begin with a colon followed by one or more identifier characters.
  • Clojure strings are enclosed inside double quotes.
  • Clojure has literal syntax for regular expressions: #"regex". In the latest stable release of Clojure (20080916), the text inside the quotes is not automatically escaped, and the backslashes need to be doubled. In the Subversion repository and in future releases, this behavior has been changed and you no longer need to double the backslashes (except to represent a literal backslash.)
  • We omitted the :prefix key/value pair for “We The Robot”: accessing a non-existing key in a Clojure map returns nil, which is what we said we want when we want to use the value of the :url field as the prefix.

That’s actually quite a lot of notes for such a short piece of code! Now that we have our data, let’s look at the next step, fetching the HTML from a URL.

Fetching the HTML

Java has a class to read documents through the HTTP protocol, which means that Clojure has a class to read documents through the HTTP protocol. Sadly, Java does not have a method to download an entire document as a string. We’ll have to create our own function to do the deed.

The classes that we’ll need can be accessed by their fully-qualified names (e.g.: java.io.BufferedReader), but this tends to make the code long-winded. We’ll use the import function to load the class names into the current namespace to keep our code shorter.

(import '(java.net URL)
        '(java.lang StringBuilder)
        '(java.io BufferedReader InputStreamReader))

import takes an arbitrary number of lists where the first element is a symbol representing the name of the package and the rest are the classes to be added to the namespace. Here, we import URL, StringBuilder, BufferedReader and InputStreamReader.

Now, let’s look at the code to download an HTML page:

(defn fetch-url
  "Return the web page as a string."
  [address]
  (let [url (URL. address)]
    (with-open stream (. url (openStream))
      (let [buf (BufferedReader. (InputStreamReader. stream))]
        (apply str (line-seq buf))))))

We’ll look at the code line by line in just a moment, but let me first explain quickly what this function does. fetch-url is a function that takes an argument, address, uses this argument to create a new URL object and open a stream to that object. We then read all the lines from that stream, join them together and return one big string.

  • (defn fetch-url: defn is a macro used to define a new function. It is followed by the name of the function, fetch-url.
  • "Return the web page as a string.": functions can have a documentation string which can be consulted in the REPL with the doc function. It appears between the name of the function and the formal parameters.
  • [address]: the list of formal parameters is actually a vector of formal parameters.
  • (let [url (URL. address)]: let is a special form used to introduce a new scope with some bindings. The bindings are defined inside square brackets with the format [name1 value1 name2 value2 ...]. (URL. address) constructs a new URL (the Java class) from address. Suffixing a dot to a class name is the same as (new ClassName).
  • (with-open stream (. url (openStream)): with-open is a macro that wraps code inside a try/finally block and calls the close method after the block has finished executing. Here we open a stream to the URL of our comic and with-open will automatically close that stream when we’re done. There are other ways to call the method: (. url openStream) and (.openStream url) are both valid.
  • Next we have one more definitions, a buffered reader. This should be familiar to Java people.
  • (apply str (line-seq buf)))))): the function line-seq returns a lazy sequence of all the lines in a BufferedReader. We then apply the str function to all those lines to join them together into one string and this value is returned. You’ll note that there are a lot of closing parentheses on this line: it’s a Lisp convention to close every parentheses on the same line instead of putting each one on a separate line as is conventional in the Java world.

Phew, that was a lot to take in! Now that we’ve completed the second line of our pseudo code, we’re ready to extract the image links.

Extracting the image link

The function used to get the image link is much shorter than fetch-url. We will pass a comic (a map), we will use the Clojure function re-find to find the string we are looking for and we will return it with the prefix. Let’s look at the code:

(defn image-url
  "Return the absolute URL of the image of a comic.
  If the comic has a prefix, prepend it to the URL,
  otherwise use the :url value."
  [comic]
  (let [src (fetch-url (:url comic))
        image (re-find (:regex comic) src)]
    (str (or (:prefix comic) (:url comic))
         image)))

This should now look familiar to you. A function of one argument with a documentation string. We won’t look at every line, instead I’ll explain the important parts:

  • Maps are functions of their keys: to access a value in a map, you say (map key). If a key is a keyword, you can also say (:keyword map).
  • re-find returns either the matching string if there were no captures in the regular expression, a vector if there were captures or nil if no match was found. We don’t do any captures in our examples, so image is a string.
  • The function str is used to concatenate strings. (str "foo" "bar") returns "foobar".
  • or returns its first argument if it’s true, the second one otherwise. nil and false are the only false values, all other values are true. This returns the prefix if there is one or the url if there is no defined prefix.

Printing the URLs

Finally, we can print the URLs. We will use the doseq macro for this purpose, which is practically a foreach loop. doseq takes three argument: the name of an individual item, a collection and a body. We will print the name of the comic and the URL of its latest strip.

(doseq comic *comics*
  (println (str (:name comic) ": " (image-url comic))))

This should give us the following output:

Penny-Arcade: http://www.penny-arcade.com/images/2008/20081029.jpg
We The Robots: http://www.wetherobots.com/comics/2008-10-22-Storytime.jpg

Next time

Next time, we’ll look at how multimethods can help us to handle cases such as Xkcd where we also want to get the URL of the strip, but also the alt text to have a complete strip.


PHP: wrong for long-running processes (wrong for America?)

October 28, 2008

Yesterday at work, we heard from a client for whom we’d written a web application two years ago. Although his call had nothing to do with this project, I was reminded of a problem with PHP that we’d found out about when we wrote this program: PHP doesn’t reuse freed file descriptors and thus can run out of them in a long-running process.

For this particular project, we wrote a small PHP service that needs to be running 24/7. I won’t get into details of why we needed such a service, NDA and whatnot. One thing I can tell you is that it crashes every 3-4 weeks (wow, that’s a painful thing to say…) and needs to be restarted. The client doesn’t really mind, and sees little value in paying to fix this bug. So we never got to fix the problem. Nevertheless, let’s look at it.

As I said in the first paragraph, PHP can run out of file descriptors. Here is a simple script to demonstrate the problem.

<?php
$fd = fopen('/etc/passwd', 'r');
echo "$fd\n";
fclose($fd);

$fd = fopen('/etc/fstab', 'r');
echo "$fd\n";
fclose($fd);
?>

And the program’s output:

$ php fds.php
Resource id #5
Resource id #6

Here’s the equivalent Python program for comparison purposes:

f = open('/etc/passwd')
print f.fileno()
f.close()

f = open('/etc/fstab')
print f.fileno()
f.close()

And the output:

$ python fds.py
3
3

As you can see, in the PHP program, although we clearly close the first file, the interpreter used the next file descriptor, while the same was reused in Python. If you wrap the code of these scripts inside an infinite loop, you will see that Python prints the same file descriptor over and over again while the one in PHP keeps incrementing.

And that’s the problem we bumped into; after weeks of opening and closing sockets, files and other types of resources, the program reached the last possible file descriptor and then crashed.

For the vast majority of PHP scripts out there, this is not a real problem since the interpreter is usually launched for small periods of time to generate an HTML document. It does, however, render PHP completely unsuitable for developing any sort of long-running program.

Update: I got some flack for calling resources file descriptors. First, I would like to say that when you observe the /proc//fd/ directory on Linux when the script is running, the fd #3 is indeed reused over and over again.

However, resources are not. I ran the script in an infinite loop print every 200,000th line. It went up to 2^31-1 then it wrapped into negative numbers. And finally, it reached back positive numbers. Here are the last lines of output.

Resource id #-1000000
Resource id #-800000
Resource id #-600000
Resource id #-400000
Resource id #-200000
Resource id #0
real: 105074.820s; user: 59353.445s; sys: 45313.772s; CPU: 99.61%
~/php$

There we go, PHP just exited.


Cribbage points counter in Clojure

October 11, 2008

For the past week or so, I’ve been playing around with Clojure, the JVM-based Lisp dialect written by Rich Hickey. Clojure has been out for a year now and I guesstimate I’ve known about it for maybe 7-8 months, but I’d never really tried it before this week. I converted my Haskell Cribbage points counter and am currently in the process of converting my Python comics fetcher. In this post, I will discuss the Cribbage point counter, which doesn’t exploit the JVM interoperability features nor the concurrency features. It’s a strictly functional and sequential program.

I go over the code function by function in this post, but those who want to see the whole thing, you can see it here.

Cribbage rules

First, for the people who don’t know Cribbage, let me give you a quick primer. In a two-player match, each player is dealt six cards. They each discard two cards into a hand that we call “the crib.” Then, the non-dealer cuts the deck and the dealer reveals the card at the cut point. This is the starter card. There’s a part of the game called the play, but we’re not going to focus on that here. Instead, we’ll look at the count. There are five (5) ways to make points at the count:

  1. Fifteens: every card combination that adds up to fifteen is worth 2 points. Numbered cards have a value equal to their rank, the ace is worth one point and all figure cards are worth 10 points.
  2. Pairs: every pair (two cards with identical rank) are worth 2 points. This means that three of a kind is worth 6 points and four of a kind is worth 12 points.
  3. Flush: If every card in the hand is of the same suit, each card is worth one point. If the hand is the crib, the starter card must be of the same suit, otherwise no points are awarded.
  4. Straights: straights of three or more cards are worth one point per card. Only the longest straight(s) are counted.
  5. His Nobs: if you have a jack in your hand and that its suit is equal to the suit of the starter card, that is worth one point.

Data structure

Let’s look at the code now. The first thing we need is a way to create cards. We will use a Clojure struct to hold a card’s properties. Each struct will have four fields: rank, suit, numerical value and index. The first two fields are obvious, the numerical value field will be used when we add up combinations to see if they equal 15 and the index will allow us to sort cards to see if we have a straight.

(def *ranks* {:ace     [ 1  0]
              :two     [ 2  1]
              :three   [ 3  2]
              :four    [ 4  3]
              :five    [ 5  4]
              :six     [ 6  5]
              :seven   [ 7  6]
              :eight   [ 8  7]
              :nine    [ 9  8]
              :ten     [10  9]
              :jack    [10 10]
              :queen   [10 11]
              :king    [10 12]})

(def *suits* #{:club :diamond :heart :spade})

(defstruct card :rank :suit :value :index)

(defn make-card [rank suit]
  (when (and (contains? *ranks* rank)
             (*suits* suit))
	(struct card rank
                 suit
                 (first (*ranks* rank))
                 (second (*ranks* rank)))))

*ranks* is a hash-map with the keys being keywords (a data type in Clojure) and the values being a vector (denoted by the square brackets syntax) containing the numerical value and the index. *suits* is a set (denoted by the #{} syntax) of four keywords representing the suits.

defstruct is used to create a new type of structures. The first parameter is the name of the structure type and the rest are the names of the different fields.

Finally, we have the make-card function that will serve as a constructor to create a well-formed card. It takes two parameters, a rank and a suit, and returns a card struct if the rank and the suit are valid, nil otherwise (if a function has no explicit return value, nil is returned.)

Now that we have a way to represent cards, let’s look at how to count cards. I listed five ways a player can score in the count, so it would make sense to write one function for each and we’ll just add them all to get the total score.

Count fifteen

As was explained earlier, every combination of cards that adds up to 15 is worth two points. We shall do this by the following process:

  • List all card combinations
  • For every combination adding up to 15, yield 2 points
  • Add up the points

The first thing we need, is to list the card combinations. There is a function (not included in Clojure) called powerset that does just that. If we give it the vector [1, 2, 3], it should return [[], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3]]. Wikipedia explains way we can generate a subset. Here it is in Clojure:

(defn powerset [xs]
  (let [n (count xs)
        m (Math/pow 2 n)]
    (for [i (range m)]
      (for [j (range n) :when (bit-test i j)]
        (nth xs j)))))

A few notes about this function: unlike Common Lisp or Scheme, bindings in a let form are done inside a vector. (Math/pow 2 n) calls the pow static method in the Math class from Java. Clojure has list comprehensions; their syntax is (for [element list] element). It is possible to filter elements by using :when followed by a boolean expression. The elements are generated lazily.

I won’t explain in detail how this function works. Instead, let’s look at how we use it to count the fifteens of a hand:

(defn count-fifteen [cards]
  (sum (for [set (powerset (map :value cards))
             :when (= (sum set) 15)] 2)))

We use a list comprehension here too. We generate the powerset of the numerical values of the cards passed to the function, and when a particular combination has a sum of 15, we yield 2. And we sum up all those 2s to give us the final tally.

One quick note: sum is not a function in Clojure, I wrote it myself. The definition is simple, it’s just a (reduce + args), but I find that the name sum is shorter and more descriptive.

Counting pairs

After we’re done with the fifteens, let’s count pairs. The count-pairs function is gonna be almost identical to count-fifteen except that we’re going to use a function called pairs on the ranks of the cards instead of a powerset. Let’s look at pairs:

(defn pairs [[x & xs]]
  (when (seq xs)
    (lazy-cat (map (fn [y] [x y]) xs)
              (pairs xs))))

When the input to pairs is [1, 2, 3], it yields [[1, 2], [1, 3], [2, 3]]. If a list with zero or one element is given, it returns nil. Like powerset, we generate pairs lazily.

With the pairs function, writing count-pairs is trivial:

(defn count-pairs [cards]
  (sum (for [[r1 r2] (pairs (map :rank cards))
             :when (= r1 r2)] 2)))

This reads: for every pair of ranks, if r1 and r2 are equal, yield 2 and them sum the 2s.

Counting His Nobs

I skipped over the flush and the straight, because counting His Nobs is easier ;) There is a difference with count-fifteen and count-pairs: we must pass one extra parameter, the starter card, to test for suit equality. Let’s look at the code:

(defn count-his-nobs [cards starter]
  (if (some #(and (= (:rank %) :jack)
                  (= (:suit %) (:suit starter))) cards)
    1
    0))

I introduce a special Clojure syntax here, the #() syntax. Usually, anonymous functions are defined with: (fn [x] (something x)). Clojure provides a shortcut version: #(something %). % will replace the formal parameter, and if there are more than one, you use %1, %2, etc. to denote each parameter.

The function some is similar to the function any? in other programming languages: if an element in a collection returns true when applied to a predicate, some returns true. If no element is found to match, nil is returned (not false.)

You should now understand that if some find a card in cards that is a jack and that its suit is the same as the suit of the starter card, we return 1, otherwise we return 0.

Didn’t I tell you that it was an easy one?

Counting a flush

Like all good British games, Cribbage has some weird rules. Flushes are one example. With a “normal” hand, if the four cards in your hands are suited, that’s a flush and that’s worth four points. If the starter card is suited as well, that’s five points. However, in the case of the crib, you can only make a flush if your four cards and the starter card are suited.

This means that in order to properly count the points, we need to know if we are counting the crib or not.

count-flush will therefore have three formal parameters: the cards in your hands, the starter card and a boolean value to tell us if it’s the crib or not.

(defn count-flush [cards starter crib?]
  (defn count-flush-aux [cards]
    (if (apply = (map :suits cards))
        (count cards)
        0))
  (if crib?
      (count-flush-aux (conj cards starter))
      (max (count-flush-aux cards)
           (count-flush-aux (conj cards starter)))))

There’s a nested function inside count-flush: count-flush-aux. It returns the number of cards it’s been passed if they all have the same suit and zero otherwise.

The function conj is used to add an element to a sequence. If count-flush is called to count a crib hand, we test for suit equality for the hand cards and the starter card. If it’s a normal hand, we take the maximum value between counting a flush of just the hand cards and of the hand cards plus the starter card.

Counting straights

The count-straight function is probably the messiest function of the entire program (if you have suggestions on how to improve it, please do so in the comment.)

I’ll just explain the steps one by one, then I’ll paste the code. Hopefully, this will be understandable.

  • We get a powerset of the sorted indexes (the struct field) of our cards.
  • A straight has at least 3 cards in it, so we keep only the combinations that have a count of 3 or more.
  • We verify that all the indexes of a combination are increasing by one. We do so by zipping the combination and an infinite sequence starting from the first element of the combination and subtracting the two values. If every difference equals zero, we have a straight.
  • We find the longest straight (or zero if we have none.)
  • For every straight that has a count equal to the longest straight, we yield longest and sum them.
(defn irange
  "Infinite range starting at n"
  [n]
  (lazy-cons n (irange (inc n))))

(defn count-straight [cards]
  (let [indexes (sort (map :index cards))
        possible-straights (filter #(>= (count %) 3) (powerset indexes))
        straights (filter (fn [s] (every? zero?  (map - s (irange (first s)))))
                          possible-straights)
        longest (if (empty? straights)
                    0
                    (apply max (map count straights)))]
    (sum (for [s straights :when (= (count s) longest)] longest))))

Again, if you find a simpler way to do this function, please let me know.

Putting it all together

Now that we have our five counting functions, the last thing we have to do is add them all together.

(defn count-hand [cards starter crib]
  (let [all-cards (if starter
                      (conj cards starter)
                      cards)]
    (+ (count-fifteen all-cards)
       (count-pairs all-cards)
       (count-straight all-cards)
       (count-flush cards starter crib)
       (count-his-nobs cards starter))))

And that’s how you count points in Cribbage!

Impressions of Clojure

Despite its youth, Clojure is a very nice and very fun language. Rich Hickey clearly did his homework and thought about the design of the language instead of pitching everything in like C# or Java, without regard to the general cohesion of the language. It diverges from “traditional” Lisps in many regards, but it feels like one when you are coding. The many “shortcuts”, such as #() functions or having hash-maps be functions of their keys helps to keep the programs succinct.

I also have a couple of negative points; the execution environment could probably be a little more polished: sometimes, I would get a runtime exception, but I couldn’t get the line where the error was occurring. With a small program like mine, it’s not too hard to find the source of the error, but in a larger project, this could be a real pain point. I wouldn’t dislike it if Clojure came with a “clj” program to start the interpreter or execute a program instead of having you write a small bash script to do this task.

I will finish the translation of my comic script and I’ll probably write another post. This one will use more Java and concurrency, two of the main benefits of Clojure over other languages.