K&R and Programming Safety

The C Programming Language by Dennis Ritchie and Brian Kernighan — colloquially known as K&R — is consistently ranked among the top computer programming books. It introduces the C language, its idioms and a part of its library in 190 pages. The last 50 pages contain a reference of the language. Programmers frequently praise K&R for its succinctness and clarity and deplore that the newer programming books are much longer and don’t have clear explanations or real-world examples.

I own a copy of K&R and I agree that it is great; authors of programming books should read it and work to bring the same pith and clarity to their own work. Nevertheless, writing a book on a general purpose programming language in less than 200 pages is not something that authors should aim for. K&R achieves its thinness by avoiding some important topics, not the least of which is safety.

When learning a new skill, such as driving or cooking, it is important to be aware of the possible dangers and how to avoid them. Road safety should not begin after you’ve wrecked your father’s car and it is not after you’ve cut your fingers that you need to learn how to properly handle a knife. Programming is no different: safety needs to be taught from the start.

Regretfully, K&R lacks detailed explanations on how to use the language safely. I cannot recall instances when the authors discussed data validation, how to gracefully and idiomatically handle error conditions or how to avoid the dreaded buffer overflows or format string exploits that plague so many programs today. It might have been okay in the late 70s to just tell the reader how to make a program work and let them deal with safety themselves, but in the 21st century, when so many aspects of our lives are governed by software, it is crucial that safety be constantly on the programmers’ minds. Authors should not leave out the details on how to use a tool safely just to skimp on page; doing so would be highly irresponsible.

I recently read Programming in Scala by Martin Odersky, Lex Spoon and Bill Venners and I want to point to their book as an example of clear and succinct writing while still providing the read with plenty of details and making sure that the best practises of the language are taught.

Enums in C

A few months ago, in a general-purpose programming chat room, a few friends and I were arguing about the treatment of enumerated types in C. I had just watched one of the Stanford University videos on C++ where I learned that C++’s enums were real types and that they were strongly typed. I expressed my surprise and satisfaction on IRC about this fact and how I thought this was an improvement on their behavior in C. What followed was a shouting match where nobody listened to anybody and everybody just argued past one another about the treatment of enums in C.

I would like to explain in this post the problems I have with C’s enums without having to address the concerns of four different people at the same time. I wish to stress that those are the problems that I have with C’s enums, not the problems that C’s enums have.

First, a quick explanations of what enums are for the folks who don’t know C. Enums in C are a way to map identifiers to integral values. These identifiers can then be used to avoid magic numbers in your code. Here is a simple example:

enum Temperature { temp_low, temp_medium, temp_high };
enum Temperature t = temp_high;

The first line declares three “tags”, temp_low, temp_medium and temp_high which will be respectively mapped to the values 0, 1 and 2. (You can specify the exact value of a tag if you need to, but I won’t get into that, nor into the other subtleties, since they are irrelevant to the point I want to make.) The second line declares a variable of that enumerated type and assigns one of our tags.

One might ask why we don’t simply use #defines instead of enums; enums can be auto-enumerated, which is convenient, the type of the variable gives an indication of its usage, and certain compilers can emit a warning when you forget one of the tags in a switch/case statement. Beyond that, they’re pretty much the same.

I’m of the opinion that enumerations should create a new type and that the values of that type should be a set of “objects” that are completely distinct from the objects of other enumerations and from integers. In C, enums are not strongly typed. The following two declarations are perfectly valid:

enum Temperature t2 = 2;
int t3 = temp_high;

This is because enum Temperature is not a new type, it’s simply a set of textual tags that represent integers. You can mix and match ints and enums, and even with -Wall -Wextra -pedantic, gcc won’t raise a peep about the issue.

Even more insidious than that, you can mix and match different enums and you will not even get a warning. Consider the following code:

#include <stdio.h>

enum Direction { North, South, East, West };
enum Color { Red, Green, Blue };

void paint_screen(enum Color c) {
    switch(c) {
    case Red:
        puts("painting screen red");
        break;
    case Green:
        puts("painting screen green");
        break;
    case Blue:
        puts("painting screen blue");
        break;
    }
}

int main(void) {
    enum Direction d = South;

    paint_screen(d); /* oops! */

    return 0;
}

If you compile this code with -Wall -Wextra -pedantic, the compiler will remain silent, even though painting the screen South makes no sense.

In a language such as C++, enumerated types are their own types, not just integers with easy-to-remember names; the compiler would’ve caught that the argument passed to paint_screen was of the wrong type. This would raise an error that you’d be forced to fix to make the code compile.

The C apologists will argue that it’s not in the standard, that They Know What They’re Doing™, that it would make the language more complex, that doing bit-operations on enum values would be impossible, etc. I’m of the opinion that strongly-typed enums help make systems reliable, even if it seems like a trivial feature. A programming language’s target audience are programmers — people, not computers — and because programmers are fallible, the language should help making sure that silly mistakes are avoided.

Org-Mode

I’ve been using Org-Mode for the past few weeks and I have to say that, as a casual user, it’s a very nice, very solid piece of software.

todo

I wanted a small todo list application to remember the books I wanted to buy, the online articles I wanted to read, the blog posts I wanted to write, etc. Since Emacs is always running on my machine, and with all the good I heard about org-mode, I figured I ought to try it, and I wasn’t disappointed.

I am using version 6.29, which has supports levels-by-indentation, which is makes it really nicer to use and view. My usage is limited to adding a few tasks with check lists (though I hope to expand on that in the near future) and Org-Mode works wonders for that. The keybindings are very well thought out, such as the usage of tab to cycle through the different levels, the intelligent backspace key which goes over indentation, the different variations on the Return key or the M-Down/M-Up commands to re-arrange elements in a list. Features such as displaying a task’s progress with percentage or number of items checked off are surprisingly easy to activate and work flawlessly. I was also extremely impressed with the table support Org-Mode offers, though I’ve yet to use them for anything in my todo.org file.

I highly suggest that any Emacs user check out Org-Mode for the times when they need to take notes, make todo lists, plan an event, or even to manage a litte software project.

Letter to math book authors

Hello authors and future authors of mathematics introductory textbooks,

Today, I mailed my last integral calculus homework. In a couple of weeks, I’ll be able to take the exam and be one step closer to obtaining all the necessary requisite credits to enroll in a university computer science class. Although I did good in this calculus class and the one before, I’d like to give you a message: stop trying to impress me!

I’m not talented when it comes to maths. If I want to understand, I need to work very hard, read the material a couple of times to make sure everything really sunk in, do every single problem in the exercises, redo the ones I struggled with, ask friends of mine for further explanations, etc. And I don’t mind; there are things I’m naturally good at (probably), and maths are not one of those and I’m fine with that and with working harder to understand them. And I know I’m not the only person who’s in this situation.

As such, I wish that the authors of introductory math books would clearly spell out the steps in their explanations, even if that takes more pages. I remember one instance where a formula was introduced, and I could not see how they went from step a to step b. I tried to, but I just couldn’t see it. After 20-25 minutes of fruitless attempts to understand what reductions they applied, I went to a couple of math-savvy friends who were able to quickly explain to me what happened between those two steps.

I’m fortunate to have people around me with strong math skills who can help me whenever I’m stuck, but some people may not be so blessed. For their sake, and for mine too, please don’t assume that we implicitly know what happened between steps a and b, where you skipped 8 steps to shorten the text. Spell it out for us please. We won’t mind or be offended. If we understand without the extra explanations, we’ll just skim over them. If we don’t, we’ll be very grateful that the entire process is clearly explained.

A introductory math textbook is also not a place to try and dazzle us with your delusions of literary grandeur. Please use plain, simple sentences. I’m already devoting a large part of my limited brain power to understanding the math, I don’t need to waste anymore of those to figure out what your meaning. Just say it plainly.

Thank you, and with that, I’m off to see if linear algebra and vector calculus is any easier, and if my textbook is better written that the previous ones.

Are the new JavaScript engines going to completely change the game?

Last week, a gentleman wrote on his blog about the performance of CPython, Jython and IronPython in creating an empty binary tree to demonstrate a performance problem in IronPython.

The article was posted to Reddit, but I thought that the demonstration of other how dynamically typed languages performed was more interesting than the comparison of the different Python implementations. Among the implementations submitted were Slava Pestov’s Factor version and my own version in Clojure. On my home PC, creating a binary tree of depth 22 took 549 seconds in Python, 14 seconds in Clojure and 4 seconds in Factor.

But my biggest surprise was when I tried a JavaScript implementation with Google’s V8 engine: it executed in 1.8 seconds. That is 300 times faster than the Python implementation. I have not (yet) done any serious benchmarking of V8, but I think it’s not completely unreasonable to expect that its performance is better than Python’s in most cases.

Now, I’m sure we’ve all joked at some point in the past about JavaScript being slow, but V8 — and from what I hear, WebKit’s JavaScriptCore — sound like game changers. The better performance would surely be a major factor in creating more complex web-based applications and the little language that everybody laughed at and relegated to cutesy web page effects could suddenly become a very important player in a lot of other fields. Here are a few from the top of my head:

  • Server-side web programming*
  • General purpose scripting*
  • General purpose programming*
  • Embedded language in applications such as productivity suites or games
  • Extension language for Emacs-like applications
  • Mobile device programming language

* Given that it has a decent standard library

The possibilities are very vast, and because JavaScript is already known by a lot of programmers, adoption would certainly be easy. Damn, was Steve Yegge right?

Clojure performance tips

I originally wrote this in a Google Groups thread, but I figured it’s worth repeating here.

Somebody posted a Java and Clojure snippet to the Clojure Google group and mentioned that the Java code was vastly faster than the Clojure code and he wondered if Clojure could get within reach of Java’s speed.

In my own clj-starcraft project, I faced — and actually, still face — performance problems vis-à-vis Java. Specifically, at the time of this writing, my Clojure code is roughly 6 times slower than Java (Clojure takes around 70 seconds to parse 1,050 files, Java takes 12.)

The 70 seconds figure used to be much worse however. At the beginning of the project, it took over ten minutes to analyze my 1,050 files. That was even slower than my Python implementation (which, I must confess, pretty much ignored performance.)

Thanks to Java’s nice profilers and the friendly Clojure folks, I was able to improve the performance of my program. Here are some of the things I did.

(set! *warn-on-reflection* true)

This is likely the most important thing you can do to improve performance: turning on this setting will warn you of every place where the Java reflection API is used to resolve methods and attributes. As you may imagine, a direct call is a lot faster than going through all that reflection machinery. Wherever Clojure tells you that it can’t resolve a method, you need to go there and put a type hint to avoid the reflection call. The type hints section of the Clojure website shows an example of the speed-up and how to use type hints.

Fixing all the instances where *warn-on-reflection* complained improved the performance of clj-starcraft from ten minutes down to about three and half.

Coerce your numbers

Clojure can use Java’s primitive types. Whenever you are in a tight loop, strongly consider coercing your values to primitive types to get the most speed out of your code. The primitive types section of the Clojure website shows an example of the speed-up and how to do the coercion.

Use binary arithmetic operators

For a while now, Clojure has supported inlining of certain expressions. For arithmetic operations, only the calls with exactly two arguments will be inlined. If you find yourself doing arithmetic in a tight loop with more than two operands, you may want to consider rewriting the code so that every operator has exactly two operands. The following micro-benchmark shows the effect of inlining.

user> (time (dotimes [_ 1e7] (+ 2 4 5)))
"Elapsed time: 1200.703487 msecs"

user> (time (dotimes [_ 1e7] (+ 2 (+ 4 5))))
"Elapsed time: 241.716554 msecs"

Use == instead of =

Using == to compare numbers instead of = can have an appreciable performance impact:

user> (time (dotimes [i 1e7] (= i i)))
"Elapsed time: 230.797482 msecs"

user> (time (dotimes [i 1e7] (== i i)))
"Elapsed time: 5.143681 msecs"

Avoid using destructuring binding for vectors

In a tight loop, if you want to assign the values inside a vector to names to improve readability, consider using direct indexing instead of destructuring binding. Although the later will yield clearer code, it will also be slower.

user> (let [v [1 2 3]]
        (time
         (dotimes [_ 1e7]
           (let [[a b c] v]
             a b c))))
"Elapsed time: 537.239895 msecs"

user> (let [v [1 2 3]]
        (time
         (dotimes [_ 1e7]
           (let [a (v 0)
                 b (v 1)
                 c (v 2)]
             a b c))))
"Elapsed time: 12.072122 msecs"

Prefer locals to vars

If you need to lookup a value inside a tight loop, you may want to consider using a local variable (defined with let) instead of a Var. Let’s look at another timing comparison:

user> (time
       (do
         (def x 1)
         (dotimes [_ 1e8]
           x)))
"Elapsed time: 372.373304 msecs"

user> (time
       (let [x 1]
         (dotimes [_ 1e8]
           x)))
"Elapsed time: 3.479041 msecs"

If you think that using a local variable would help performance, consider the following idiom to avoid breaking other code that may depend on that Var:

(let [local-x x]
  (defn my-fn [a b c]
    ...))

Use the profilers

Sun’s JVM has two profilers, -Xprof and -Xrunhprof. Use them to find your bottlenecks instead of blindly guessing.

Caveat

Do note that many of these performance tricks improve the performance by a few hundred milliseconds over millions of executions. Unless they give you a significant performance boost, you may want to avoid using them if they make your code less clear.