## In Search of Performance in Haskell

I was doing ITA Software’s “Word Numbers” problem. Now there are various solutions on the Web. Reviewing those solutions tell us it’s a graph/grammar problem. Today we’re going to solve it brute-force.

Now let me give a quick recap of the problem:

A "word number" is defined aswordNumber 1 = "one" wordNumber 2 = "onetwo" wordNumber 3 = "onetwothree" ... wordNumber 12 = "onetwothreefourfivesixseveneightnineteneleventwelve" ...

Find the 51-billion-th letter of the wordNumber Infinity. Assume that letter is found for wordNumber x, also find the sum of 1 to x

(As someone noted in the StackOverflow comments, I actually misinterpreted the original problem. But that doesn’t affect our purpose here.)

### A naive algorithm in C++

The idea seems simple: 2 counters, one for sum of numbers and one for length. We go from wordNumber 1 until the length meets 51000000000. Here is how I said it in C++:

OK, nothing surprising. It takes **45 seconds** on my machine to execute (g++ -O3). We will use this as a baseline.

## First attempt in Haskell

OK, I lied. It was not my first attempt. My initial version had space leaks which led to the program eating up 1GB RAM in several seconds. Using profiling techniques from Real World Haskell, I identified the space leaks and sprinkled Bang patterns everywhere.

So how well does that run?

Well, I don’t know.

As far as I remember, one version finished in 12 minutes, but I am not sure it is the version posted here. In short, the performance was not bearable.

Now if we were using Perl or Ruby, that might not have been too surprising. But for Haskell this seems something obvious is missing.

## The solution

So there I went, I asked for help on StackOverflow and in Haskell-cafe.

There I got help from Bryan O’Sullivan who posted his version. Reiner Pope said that with `-fllvm`

, it finishes in about **1.5 minutes**. (OK, he did not exactly say that. He said Haskell version is half as fast as C++.)

## But it did not work for me

So I downloaded Bryan’s code (we’ll call it `RealWordNumber.hs`

from now on, sorry bad pun 🙂 and off I went in excitement:

ghc -O2 RealWordNumber.hs RealWordNumber Sum: 1 Segmentation fault/access violation in generated code

Since `RealWordNumber.hs`

uses `Int`

, there is an integer overflow in his code. That’s also exactly the reason why I had to use `Int64`

in my original version.

(I was using Windows where GHC only has 32-bit)

So I patched his code to use `Int64`

and off I went. It never finished, I had to kill it.

Huh?

## The culprit — 64-bit integers are slow in 32-bit programs

OK the **major** reason turned out to be this. To confirm, I compiled the C++ version to 32-bit using VC++ (cl /Ox WordNumber.cpp). It took **4 minutes** to complete.

So I took RealWordNumber.hs to Arch Linux x64. It turned out that Int in GHC x64 is actually `Int64`

, and the program completed in **5 minutes** `(ghc -O2)`

.

Since the LLVM backend for GHC 7.0.3 seemed broken at the time I write this, and I did not want to go fixing it immediately, I trusted that it would really turn to **1.5 minutes** if I ran it with `-fllvm`

.

## Now the fun part — breaking down performance improvements

“So,” I figured, “I just needed to take my original version and put it on Arch x64″. So I tried that, and it was still taking longer than my patience allowed (8+ minutes). The following will explain how I got my version up to the same league with `RealWordNumber.hs`

, step by step.

First of all, I tuned the number down to 510 million (510000000) to avoid having to wait all day doing this.

Here is a base line using 510 million as an input:

RealWordNumber -- 1.94 seconds K2WordNumber -- 5.41 seconds

From here we see that if I had waited about 15 minutes, I could have seen my original version finish 😛

Now on to how I got it up to speed:

## 1. Change `divMod`

to `quotRem`

I think I heard it from #haskell, people say that `quotRem`

is faster than `divMod`

. So I simply changed all `divMod`

to `quotRem`

. Look at what we’ve got:

K2WordNumber-1 -- 4.48 seconds

OK, that was true. A surprising huge speed up.

## 2. Change `Int64`

to `Int`

Since GHC x64’s `Int`

is actually 64-bit, I changed all `Int64`

to `Int`

and removed the unnecessary `fromIntegral`

calls.

K2WordNumber-2 -- 3.20 seconds

2 simple changes, we are already getting there 😛

## 3. Change quotRem to separate quot and rem

Since `quotRem`

returns a tuple of boxed `Int64`

, which may need to be constructed and reconstructed. Max Bolingbroke suggested that using `quot`

and `rem`

separately might help. Here’s the experiment result

K2WordNumber-3 -- 3.13 seconds

It might have been an improvement, but might have just been experimental error. Unfortunately it seems like not much speed up was achieved.

## 4. Removed redundant parameters

It turned out in my original version, there were several redundant parameters being passed *recursively* in the function solve (the `acc`

label, and `curr`

). I used `ghc -Wall`

to track them and removed them.

K2WordNumber-4 -- 3.16 seconds

No change. Since GHC figured out enough to warn me, it probably did the right thing and optimized them for me anyway.

## 5. Simplified parameter passing

The original `solve`

function looked like this

solve :: Int64 -> (Int64, Int64, Int64) -> [Int64] -> (Int64, Int64, Int64) solve !n !acc@(!sumNum, !sumLen, !curr) (!num:nums) | sumLen' >= n = (sumNum', sumLen, num) | otherwise = solve n (sumNum', sumLen', num) nums where sumNum' = sumNum + num sumLen' = sumLen + wordLength num

To quote Bryan:

You’re passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.

He is right. I also noticed that the n parameter does not change and does not need to be passed on. Johan Tibell blogged that GHC could in theory create this closure for us. But I did it manually to see what happens:

solve :: Int -> (Int, Int, Int) solve n = go 0 0 1 where go !sumNum sumLen i | sumLen' >= n = (sumNum', sumLen, i) | otherwise = go sumNum' sumLen' (i+1) where sumNum' = sumNum + i sumLen' = sumLen + wordLength i

That resulted in

K2WordNumber-5 -- 2.87 seconds

Good.

## 6. Using unboxed integer types

I replaced `quot`

and `rem`

with these

(I# a) // (I# b) = I# (a `quotInt#` b) (I# a) % (I# b) = I# (a `remInt#` b)

K2WordNumber-6 — 2.77 seconds

Not as much as I had thought

## 7. Using `Data.Vector.Unboxed`

instead of `Data.Array.Unboxed`

K2WordNumber-7 -- 2.50 seconds

- Using
`Data.Vector.Generic.unsafeIndex`

instead of the`(!)`

operator

K2WordNumber-8 -- 2.20 seconds

Wow, this one was huge.

- Move the length vectors inside
`wordLength`

‘s closure

wordLength :: Int -> Int wordLength i = go 0 i where go !pad !n | n < 10 = lenOnes `VG.unsafeIndex` n + pad | n < 20 = lenTeens `VG.unsafeIndex` (n-10) + pad | n < 100 = go (lenTens `VG.unsafeIndex` (n//10) + pad) (n%10) | n < 1000 = go (go (7+pad) (n//100)) (n%100) | n < 1000000 = go (go (8+pad) (n//1000)) (n%1000) | otherwise = go (go (7+pad) (n//1000000)) (n%1000000) (I# a) // (I# b) = I# (a `quotInt#` b) (I# a) % (I# b) = I# (a `remInt#` b) lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ... lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6] lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3

K2WordNumber-9a — 2.20 seconds

No improvement at all… But if I add bangs before the vectors…

!lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ... !lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6] !lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3

K2WordNumber-9 — 2.00 seconds

WOW!! That was a large speed up. It turns out the reason for the speed up is related to strictness. `wordLength`

is not strict in all `lenOnes`

, `lenTens`

and `lenTeens`

, so they might be garbaged collected (?). Putting bangs on them allow them to stay alive. Since we need to use those arrays very frequently, it was better for them to be around all the time.

Haskell did not allow putting bangs on “global” variables, so I missed it in my original bang sprinkling exercise.

OK, so now we have transformed my original sloppy Haskell program to a “well-performing” Haskell program. Since it performs the same as Bryan O’Sullivan (casually) tuned version, I am going to assume it’s probably as fast as it should (before maybe resorting to serious hacks).

And here is the final version of the code:

## Moral of the story

- Haskell is a very fast functional language. But for tight areas like this, always consider giving plain old C a try — it may just save you days of profiling.
- 64-bit operations in 32-bit programs are slow (*)
`quotRem`

is faster than`divMod`

- Earlier I argued in #haskell that “if
`Integer`

vs`Int`

matters for your performance, your program is probably written wrong”. Well, there are cases where it matters quite a lot. It turns out conversions are quite costly. (Though I still believe people should use`Integer`

by default) - Use the wrapper-worker pattern liberally.
- If you recurse, make sure every variable actually changes in each recursion. (For most cases this will benefit. If you only recurse several times maybe the closure’s overhead will be larger.)
- Don’t use a list or tuple just because infinite lists look smart.
`Data.Vector`

is a better “general case” container than`Data.Array`

Try to reduce the scope of “global” variables. Put them in a closure so you can

*bang*on it.This may be actually the suckage of Windows’ “WOW64″ (32-bit emulation layer). I have not tried to prove this on 32-bit Linux GHC)

### Extra: the fastest indexing data structure — functions!?

I heard from #haskell that “pattern matching in Haskell is less than O(n)”. When I heard it, I didn’t really believe it. I have always thought that N pattern matches means translating into N `if-then-else`

statements for the general case. He said something related to Church encoding but I did not see how it could be applied in the general case to avoid N `if-then-else`

statements.

Just for fun, I tried changing the length vectors into functions:

lenOnes 0 = 0 lenOnes 1 = 3 lenOnes 2 = 3 lenOnes 3 = 5 -- ... lenTeens 9 = 8

K2WordNumber-10 — 2.00 seconds

Yes, pattern matching is as fast as unsafeIndex. WOW! wren nt thornton from the comments below give a more thorough explanation on this one. It turns out GHC is really darn smart and can figure out fast code given “primitive” definitions.