Fibonacci numbers on the Galaxy S3: ARM benchmarks of Rust, Ocaml, Haskell, Go, Racket, Lua, C# and Java

I recently upgraded my phone to a Galaxy S3, an ARMv7 device with 2GB of ram and a quad-core 1.3ghz processor. After installing Arch Linux on it (only in a chroot; I’m not brave enough to boot directly into it), which seemed like the right thing to do, I thought it’d be interesting to do some benchmarking and see how various language implementations perform on ARM compared to on my x86-64 (Intel dual core i5 M430 2.27GHz) laptop.

This benchmark involves calculating a large number of Fibonacci numbers repeatedly. The exact code used is available here. Note that the purpose of this benchmark is to compare the ARM and x86 implementations of each language, not to compare the different languages to each other, so no effort has been put into optimising the different implementations.

The benchmarks were run with an input of 500 million, with the speed for each language being expressed as a percentage of the speed of the C implementation. This allows for easy comparison, as each language being expressed in terms of C speed (as opposed to in absolute terms) minimises the effect of the different speeds of the processors used.

The results from my x86-64 laptop are as follows (input of 500 million):

Language Running time (ms) % C speed
C (Clang) 3726 100
C# (Mono) 7625 48.8656
Go 8489 43.8921
Haskell 9839 37.8697
Java (OpenJDK) 4237 87.9396
LuaJit 15135 24.6184
Ocaml 9668 38.5395
Racket 22785 16.3529
Rust 3910 95.3

The results from my ARMv7 Galaxy S3 (input of 500 million):

Language Running time (ms) % C speed
C (Clang) 5051 100
C# (Mono) 52736 9.5779
Go 12246 41.2461
Haskell 22304 22.6462
Java (OpenJDK) 50079 10.0861
LuaJit 21738 23.2358
Ocaml 28774 17.554
Racket 56519 8.93682
Rust 5741 87.9812

The performance of each language on ARMv7 expressed as a percentage of its x86-64 performance:

Language Performance (% x86-64 speed)
C 100
LuaJit 94.3839
Go 93.9716
Rust 92.3190
Haskell 59.8003
Racket 54.6498
Ocaml 45.5481
C# (Mono) 19.6005
Java (OpenJDK) 11.4693

The benchmarks can be run by `sh fib.sh 500000000 true`, replacing true with false if the executables have already been compiled. This will print an html table to stdout, and store the results in text form in `./sortedResults`. Rename this to something like armResults or x86Results, and two different such files may be compared using `sh fibdiff.sh armResults x86Results`, which will output the above html table to stdout.

Again, note the purpose of this benchmark is not to compare the languages with each other, but to see how well each one performs on ARM compared to on x86.

Interestingly, the Rust implementation actually runs faster on ARM. I suspect this indicates a bug of some sort in the compiler, as Rust generally performs at close to C speed. Both the ARM and x86 implementations are running on the latest Arch Linux build, but the Rust version on ARM appears to be “0.11.0-pre”, whereas on x86 it’s “0.11.0”, so it’s possible there was some regression between those two versions.

*Edit: the Rust version used the Rust int type, which is 64bit on x86-64, whereas C’s int is 32bit. This exaggerated Rust’s speed relative to C on ARM. The results table has now been updated to show results from using Rust’s fixed size i32 type, which is much faster on x86-64. It can now be seen that Rust manages to perform quite well on both platforms, which makes sense considering its use of the LLVM backend.

*Another edit: according to strncat on Reddit: “Rust is slower than C at this task on ARM (and x86_64, once the 64-bit overhead is gone) because it uses LLVM’s segmented stack support for stack overflow checking. The sane alternative is outputting a one byte write in multiples of the guard size, so 99.5% of functions won’t have any overhead, and the remaining ones will only have some writes rather than the branch that is there today.”

Java on ARM is hilariously slow, 1/10th of its x86 speed, which I suspect may be because the JVM I’m using, OpenJDK, doesn’t yet have a JIT compiler. Or, if it does, it’s not particularly well developed. The Oracle JVM is apparently faster on ARM.

C# (via Mono) is also much slower on ARM, 1/5 of its x86 speed. Presumably ARM code generation hasn’t received as much attention as x86.

Luajit’s performance on ARM, almost equal to its x86 performance, proves once again that Mike Pall is a genius. Interestingly, as everything is a double in Lua, it may actually have an advantage in this case as there’s a modulus operation in the code, and ARM doesn’t (as far as I’m aware) have hardware support for integer division, only float division.

The Go compiler also runs surprisingly well on ARM, nearly matching its performance on x86. I’m not sure whether this is due to lots of work having been put into ARM code generation or not so much work being put into x86 code generation, but it certainly bodes well for future plans to allow the use of Go in the Android NDK.

Racket, OCaml and Haskell all run at about half their x86 speed on ARM, which seems reasonable as I can’t imagine they’re often run on ARM, so I imagine ARM performance probably hasn’t received much attention.

In terms of the C implementation, it’s interesting to note that its absolute speed on the ARM device is almost 73% of the speed of its x86 speed, in spite of the x86 device being a 16 inch i5 laptop and the ARM device being just a 5 inch phone.

Miscellanea

The Haskell implementation is named hsfib.hs instead of fib.hs, breaking from the naming conventions of the other implementations. Why is this? If I name it fib.hs, GHC will notice the fib.o left over from compiling the OCaml implementation and try to link that, with hilarity ensuing soon thereafter. This could be avoided if GHC checked whether the object files it was attempting to link were actually Haskell object files.

Getting an Integer from a Number in Typed Racket seems to take a lot of work:
`(numerator (inexact->exact (real-part n)))`
Although there’s probably a simpler method I’m missing.

There’s a Common Lisp implementation in the repo, but the ARM support of SBCL is quite recent and I couldn’t get it to build.

At least 70% of the C# implementation is just copy pasted from the Java implementation.  Syntactically those languages seem even more similar than Lisp and Scheme.

Recursion in Lua is interesting. Doing a comparison like `if n == 1` does not work out well, due to the nature of floating point equality, so `if n < 1` is necessary instead.

The correctness of the Rust and C implementations is actually optimisation dependent. If compiled with -O, they use constant space; if compiled without it, they use linear space and fail with a stack overflow. Never, ever write code like this for any project that’s even remotely important..

OCaml’s Emacs support is really awesome. I installed a couple of plugins, Tuareg and Merlin, and not only did they give me autocomplete but also automatic error checking whenever I save. I was particularly impressed when it combined with Ocaml’s type-checking to warn me that the float I was accidentally passing to printf didn’t match the type required by the %d in the format string; even Haskell’s default printf function doesn’t do compile time checking of the format string.

*Edit: I was asked why I didn’t include Nimrod. Nimrod compiles to C, so it would be the same speed on both ARM and x86. Seems like it would be a good option for ARM development.

Bonus

I ran the Java implementation of the benchmark under Dalvik. The results are rather underwhelming, to say the least. Here are the results from running with an input of 500 million (the same as above), with the Dalvik results included.

Language Running time (ms) % C speed
C 5051 100
C# (Mono) 52736 9.5779
Dalvik 75699 6.6725
Go 12246 41.2461
Haskell 22304 22.6462
Java (OpenJDK) 50079 10.0861
LuaJit 21738 23.2358
Ocaml 28774 17.554
Racket 56519 8.93682
Rust 5741 87.9812

Now that’s why the new ahead-of-time-compiled Android RunTime is A Good Thing. Also possibly part of the reason why the Android UI is less fluid than the iPhone’s, and I say this as someone who wouldn’t use an iPhone if you paid me. Where would Objective C appear in that table? Right at the top; Objective C is a superset of C, so valid C code is generally valid Objective C code, making it extremely easy to write fast code in Objective C (just write C).

Note however that I ran the Dalvik file (compiled with `dx –dex –output=fib.dex fib.class`) with `dalvikvm -cp fib.dex fib`, the simplest way to run it. It’s quite possible there’s a better way to compile/run it that makes it faster; I’m not particularly familiar with Android development. There apparently exists a tool called ‘dexopt’, which is run automatically when a new apk is installed, but I’m not sure whether it’s run when I execute a .dex file manually with `dalvikvm`. Running `dexopt` tells me to look at the dexopt source for usage information; I don’t have time to figure it out right now, but if someone wants to send me a script for running dexopt on a .dex file I’ll be happy to test it.

Moral of the story: performantly porting a JIT compiler to ARM is difficult (unless you’re Mike Pall).

*Edit: apparently development of the ARM port of Luajit was sponsored, so the developer may have had more incentive to optimise it than the developers of the ARM ports of OpenJDK and Mono.

Posted in Uncategorized | Tagged , , , , , , , | 26 Comments

Why Go is Great (for servers)

The Go language has some weaknesses compared to other modern languages. It also however has some advantages that allow its users to do things that users of many other languages cannot easily do. In particular, to have a large number of green threads running on a single machine, making it easy to structure a server by having one such thread (goroutine) for each client connection, and to communicate between these goroutines with channels, making it easy to coordinate them and avoiding the need for mutexes and other forms of manual synchronization.

As an example, take the following program, a race relay race (it’s like a data race race). Imagine two pairs of n (in this case three) threads: (a1, a2, a3), and (b1, b2, b3). For each one, a message is sent from the first to the last and then back again (e.g. from a1 to a2 to a3 to a2 to a1). Whichever racer completes this first out of (a1..an) and (b1..bn) wins the race. Now, the average computer can have at most around 65k open tcp connections, so a server using one connection per client would require around this many goroutines. Let’s therefore set n to 65/2 = 32k; 32k green threads per racer.

Here’s how we make the racetrack. The function is recursive: in the case where numLeft is > 0, it makes another part of the track to the right, in a new goroutine, then creates another goroutine for handling message passing from its left to its right, then itself works on handling message passing from its right to left. This means each racer is actually using 2*numLeft goroutines, so we set the total number per track to 32/2=16k. If numLeft is less than one, we instead set to work passing the input from the left channel back into the left channel.

func makeTrack(left chan int, numLeft int){
     if numLeft > 0{
         right := make(chan int)
         go makeTrack(right, numLeft-1)
         go sendSidewards(left, right)
         sendSidewards(right, left)
      }else{
         for{
             n := <-left
             left<- n            
         }
     }
 }func sendSidewards(from chan int, to chan int){
     for{
         n := <-from
         to <-n
      }
 }

Now, you may wonder: why not just use one channel for both sending ints from the left channel right and those from the right left? It’s not good practice to have one goroutine writing and reading from the same channel, as if it attempts to write when the channel has data to be read then it will block (deadlock). We’d hence need four channels if we were only using one goroutine: left in, right in, left out, and right out. I found this alternative less subjectively pleasant than just having one goroutine for sending left and one for sending right

Now, for starting the race:

func doRace(start chan int, resultChan chan []int, numRunners int){
     results := make([]int,0,numRunners)
     for i := 0; i < numRunners; i++{
         start <- i
      }
     for i := 0; i < numRunners; i++{
         n := <-start
         results = append(results, n)
     }
     resultChan <- results
 }

We create numRunners ints, each numbered with their start position (first to run is 1, last is numRunners-1), and send them down the channel. Then we receive them all, and stick them in an array which is then sent back through the appropriate channel. This allows us to examine the order in which they arrived.

Putting it all together:   

    if len(os.Args) != 3{
         panic("Error: program takes two arguments, runners and threads.")
     }
     inThreads := os.Args[2]
     inRunners := os.Args[1]
     runtime.GOMAXPROCS(4)
     numThreads, err := strconv.Atoi(inThreads)
     if err != nil{
         panic("Error: threads must be a int.")
     }
     numRunners, err := strconv.Atoi(inRunners)
     if err != nil{
         panic("Error: runners must be a int.")
     }
     racerA := make(chan int)
     racerB := make(chan int)
     go makeTrack(racerA, numThreads)
     go makeTrack(racerB, numThreads)
     resultsA := make(chan []int)
     resultsB := make(chan []int)
     go doRace(racerA, resultsA, numRunners)
     go doRace(racerB, resultsB, numRunners)
     select{
     case results := <- resultsA:
         fmt.Printf("A won! With results: \n %v\n", results)
     case results := <- resultsB:
         fmt.Printf("B won! With results: \n %v\n", results)
     }

That creates the racetracks, channels to them and for the race results to be returned on, and runs the races. Note the select statement: this is an extremely useful feature of Go that will receive from whichever of the target channels (in this case resultsA and resultsB) is ready first. Running this takes around 3 seconds, not bad for sending data across so many channels. What’s really impressive, however: quadruple that number to 64000, meaning a total of 256 thousand goroutines are created, and it still runs in around 3 seconds! Now what’s why Go is great for servers. I’m fairly certain that trying to do the same thing with 256 thousand kernel threads on this laptop would melt it.

I attempted an implementation of the same thing in Elixir (awesome mega Erlang preprocessor), but I stumbled when trying to take the address of a launched process. Running:

pid = spawn foo(bar)
baz(pid)

Seemed to be waiting for foo(bar) to complete before running baz, preventing the program from working if foo is a function looping until it receives a message from baz. Not sure if this was a bug, or just me doing something wrong due to unfamiliarity with the language (probably the latter). Anyway, while it’s certainly possible to do in Erlang what Go can do, Go is five to ten times faster for cpu-intensive work. It’s also more statically typed, preventing errors that wouldn’t be caught at compile time in Erlang/Elixir.

*Edit: it should be noted that Erlang has something called Dialyser, which can do static checking of Erlang code. A redditor wrote an Erlang implementation of the code for comparison, available here, which performs quite well.

I considered an implementation in D, but while D has green threads in the form of fibres, but doesn’t seem to include a scheduler, so I’d have to write one myself. Also, while it has message passing between threads, it doesn’t seem to have message passing between fibres. It also has no equivalent of Go’s select statement, so the only way to poll multiple channels would be to wait on one with a very short timeout, then wait on the next, and so on in a loop, which is rather ugly.

*Edit: D does have a fibre scheduler as part of the the Vibe.d webserver, but this isn’t yet integrated into the standard library. Integration is planned such that eventually message passing between fibres will be possible in D as it is in Go.

What other languages have lightweight threads? Rust still has m:n threading, even if it’s not now the default, but I’m not sure if its message passing facilities work between green threads or just between OS threads. I attempted an implementation in Rust, but it fails with a generic ‘failed while receiving on a closed channel’ error not pointing to where in the code it occurred, and I ain’t got time for debugging that. Better to wait until the library and its error messages are more developed.

C# and F# don’t seem to natively support message passing between lightweight threads. Clojure has a library for it, but it seems to be an order of magnitude slower than Go. The JVM has Akka, but it’s incredibly verbose in Java, and to use it with Scala would first require learning Scala, which is for some reason a language that has never appealed to me.

Anything else? Interestingly, Haskell actually has lightweight threads and message passing, and they’re quite performant (see the benchmark linked in the previous paragraph). I attempted to write a Haskell implementation, and amazingly succeeded thanks to judicious use of unsafePerformIO, which is all the fun of unsafe sex with none of the risks (being able to write an `IO A -> A` function after having spent hours battering your head against the type system is pretty much the best thing ever). Yes, I’m a bad person. Anyway, it managed to match the Go implementation, handling even 360,000 threads and 70 runners with no problem. What’s more, if I compiled with O3 and had numRunners and numThreads hardcoded (as opposed to via argv), it would do all the calculation at compile time, inlining the result and always producing the same answer when run.

*Edit: turns out the unsafePerformIO was completely unnecessary. The same program without unsafePerformIO can be seen here, courtesy of a helpful redditor.

Okay, the Haskell implementation must have cheated. It works when setting numTreads to a billion, and there’s no way it’s actually creating a billion threads, no matter how lightweight they are.

Anyway, so Haskell can also operate with multitudes of lightweight threads; why still use Go? One reason: Haskell doesn’t have select, and it can’t even be implemented on standard channels as they offer no means of checking whether they contain data or not. It could be done with synchronous channels, which offer more functionality, but they failed with a build error when I attempted to cabal install them. There’s also Software Transactional Memory channels, but they operate in the STM monad rather than IO, so would lose the advantage of being able to stick print statements everywhere for debugging (well, you can if you really want to, you might just get things printed multiple times if the transaction is retried). Finally, there are MVars, which are like channels that only hold one item, and which allow checking for emptiness; these are what I used for checking who won the race.

Another reason to use Go: reasoning about both concurrency and lazy evaluation simultaneously is more difficult than reasoning about concurrency in a strict environment. Learning Go in general is also much easier: you can teach someone Go in a week, whereas Haskell takes a lot longer to learn. Finally: while it’s possible to write extremely fast code in Haskell, idiomatic code in Go is more performant than idiomatic Haskell.

In summary: Go is great for servers because it allows you to easily build efficient servers based on message passing between lightweight threads. Its advantage over Erlang/Elixir is that it’s much faster at calculation-heavy workloads and can do shared-memory concurrency. Its advantage over Haskell is that it’s much easier to learn, more idiomatically performant, and has a nifty select statement that as far as I’m aware Haskell hasn’t implemented. It’s also more ‘batteries-included'; no need to try and figure out which of the four+ different channel libraries is most appropriate.

Its advantage over Rust is that it’s stable, with a comprehensive standard library, while Rust’s standard library is still in development. It’s also easier to write, as there’s no need to fight the borrow checker or worry about lifetimes. While Rust’s advanced linear type system allows memory safety without a garbage collector, garbage collection isn’t usually a big problem for servers, so Rust’s type system represents a potentially unnecessary overhead. Although Rust does have easier-to-use garbage-collected allocation via managed pointers, its garbage collector is currently quite primitive compared to Go’s, so Go currently makes more sense to use if one desires a garbage-collected server language with lightweight message passing concurrency. Finally, the biggest advantage to me personally of Go over Rust: no macros, meaning no need to read other people’s macros. The only macros I’ve ever liked are lisp macros, and that’s largely because they can be expanded and examined dynamically in the REPL.

Go’s advantage over Java/Scala + Akka on the JVM? Concurrency in Go is much more concise than using Akka in Java. Go has the same advantages over Scala that it has over Haskell, i.e its extremely shallow learning curve, plus the extra advantage of significantly faster compilation. Finally, Akka uses an actor model, like Erlang, which is subtly different from the lightweight threads+channels model used by Go and Haskell and so is not a direct substitute.

The Go and Haskell implementations, along with the incomplete Rust and Elixir implementations, are available at this repo.

Any valid criticisms of this post (of which there are undoubtedly many) are welcome either here or at the relevant reddit discussion.

Posted in Uncategorized | Tagged , , , , , , , , | 2 Comments

The Magic Forest problem revisited: rehabilitating Java with the aid of D.

I recently came across an interesting blog post in which Java performed slower than I’d have thought reasonable, around 4 times slower than C++, so I decided to try my luck at making it faster. The problem? Imagine a forest of goats, wolves and lions. “Wolves can devour goats, and lions can devour wolves and goats. (‘The stronger animal eats the weaker one’.) As this is a magic forest, a wolve, after having devoured a goat, is transmuted into a lion; a lion, after having devoured a goat, is transmuted into a wolve; and a lion having devoured a wolve becomes a goat.” A new world is generated for each possible outcome (a forest (2,2,2) would generate (3,1,1), (1,3,1) and (1,1,3)), with the same then happening in each of these new forests recursively, and so on until a ‘stable’ forest is found: one with only one kind of animal remaining.

Note: some new results have been found since this post was originally made, and are listed at the bottom.

My first impulse was to rewrite the functional Java implementation in imperative code, assuming it would be faster. Big mistake! My implementation ran in around 1.5 seconds for an input of (100,100,100), compared to 65ms for C++; an order of magnitude slower, and much slower than the functional version. To try to understand what went wrong, I translated the Java implementation into C, and it was equally slow! C is generally not (barring a really insidious compiler) an order of magnitude slower than C++, so I realised I must have made a mistake in the translation from functional to imperative. I hence decided to translate the functional code into a less-verbose language in order to understand it better, so I grabbed the D language and got busy:

forest_t[] meal(forest_t[] forests) {
  return map!(a => [forest_t(-1, -1, +1)+a, 
                    forest_t(-1, +1, -1)+a, 
                    forest_t(+1, -1, -1)+a])(forests)
    .join
    .partition!(forest_invalid)
    .sort
    .uniq.array;
}

The D turned out to be rather elegant compared to the original C++, which is a little harder on the eyes:

std::vector<forest_t> meal(const std::vector<forest_t>& forests) {
  static const std::array<forest_t,3> possible_meals = {{
    forest_t{{-1,-1,+1}},
    forest_t{{-1,+1,-1}},
    forest_t{{+1,-1,-1}}
  }};
  std::vector<forest_t> next_forests;
  next_forests.reserve(forests.size() * possible_meals.size());
  for (auto meal: possible_meals) {
    forest_t next_forest;
    std::transform(std::begin(forests), std::end(forests), std::back_inserter(next_forests),
      [&](const forest_t& forest) {
        std::transform(std::begin(forest), std::end(forest),
        std::begin(meal), std::begin(next_forest), std::plus<int>());
        return next_forest;
      });
  }
  auto valid_end = std::remove_if(std::begin(next_forests), std::end(next_forests), forest_invalid);
  std::stable_sort(std::begin(next_forests), valid_end);
  next_forests.erase(std::unique(std::begin(next_forests), valid_end), std::end(next_forests));
  return next_forests;
}

Unfortunately, the D was also noticeably slower; around 200ms for input (100,100,100), compared to 65s for C++. Assuming for no good reason that creating and populating the array of new forests was what was consuming the most time, I next decided to vectorise it using D’s delightful inline assembly support:

forest_t[] meal(forest_t[] forests) {
  auto forestsAddr = forests.ptr;
  size_t forLen = forests.length;
  forest_t[] newFors = uninitializedArray!(forest_t[])(forLen*3);
  auto newForsAddr = newFors.ptr;
  size_t bytesToSimd = (forLen)*3*4;
  int[12] potentialMeals = [-1, -1, 1, 0, -1, 1, -1, 0, 1, -1, -1, 0];
  asm{
      movupd XMM0, [potentialMeals];
      movupd XMM1, [potentialMeals+16];
      movupd XMM2, [potentialMeals+32];
      mov R8, forestsAddr;
      mov R9, forestsAddr;
      add R9, bytesToSimd;
      mov RCX, newForsAddr;
  loop:;
      movupd XMM3, [R8];
      movupd XMM4, [R8];
      movupd XMM5, [R8];
      paddd XMM3, XMM0;
      paddd XMM4, XMM1;
      paddd XMM5, XMM2;
      movupd [RCX], XMM3;
      movupd [RCX+12], XMM4;
      movupd [RCX+24], XMM5;
      add RCX, 36;
      add R8, 12;
      cmp R8, R9;
      jne loop;
  }

Which, unfortunately, had absolutely no effect on performance whatsoever. I was considering rewriting it to use only aligned access, but it finally occurred me that I should probably check that this allocation actually was a bottleneck before proceeding. I ran the program through Valgrind’s callgrind tool, and lo and behold, turns out I’d been chasing waterfalls; it actually represented an extremely insignificant proportion of the runtime. The sort was taking the most time, and upon a little much-needed reflection, I realised that made perfect sense, as sorting is 0(nlogn), while allocating a large array is just O(n). Thanks to a clever suggestion from someone on the D forums, I then replaced the .sort.uniq with sticking everything into a hashtable and taking it out again (which should be only O(n)), and got the runtime down to only around 100ms, quite competitive with C++’s 65ms.

I figured that, as O(n) beats the O(nlogn) sort+std::unique algorithm that C++ was using for removing duplicates, the D version might actually be faster for higher inputs. Unfortunately, this was not the case; at (500,500,500), while the C++ implementation ran in around 7 seconds, the D version took over 25. I assumed this was due to D’s hashmap implementation generating garbage, which gave me an oh-so-brilliant idea. Java’s got the best garbage collector in the world, right? So if I translate the D implementation into Java, the hashmap garbage shouldn’t cause trouble, and the O(n) uniqifying would allow it to beat C++ (oh, and around this time I also realised why my original C and Java implementations were so slow: they were using a naive 0(n*n) method of removing duplicates; yes, I’m an idiot). Anyway, I fired up Java, and got an implementation going using Java.util.HashSet.

static Forest[] meal(Forest[] forests) {
    HashSet<Forest> uniqFors = new HashSet<Forest>(forests.length*3);
    for(int i = 0, j = 0; i < forests.length; i++, j+=3){
        uniqFors.add(new Forest(forests[i].goats + 1,
            forests[i].wolves - 1,
            forests[i].lions - 1));
        uniqFors.add(new Forest(forests[i].goats - 1,
            forests[i].wolves + 1,
            forests[i].lions - 1));
        uniqFors.add(new Forest(forests[i].goats - 1,
            forests[i].wolves - 1,
            forests[i].lions + 1));
    }
    return uniqFors.toArray(forests);
}

How’d it go? Put it this way. Imagine the pain of giving birth while being repeatedly punched in the testicles, stung by extremely venomous spiders and debugging a Boost library, convert that pain into ineffectiveness, multiply it by a thousand, and it’d still be a few orders of magnitude more effective than using HashSet to remove duplicates in Java. It literally failed with a java.lang.OutOfMemoryError on an input of around (30,30,30), while the D implementation could handle (300, 300, 300). Needless to say, it was time to go back to the proverbial drawing board.

Okay, I’m turns out I’m even more of a shenjingbing than first imagined; I didn’t realise, but apparently java defaults to reference-based equality checks when no hashing function is defined for a class. This means the HashSet wasn’t working as intended; it wasn’t removing duplicates, as it didn’t recognise duplicates as duplicates since they were separate objects even if their goats, wolves and lions values were the same. Properly implementing equals() and hashCode() makes the Java HashSet implementation slightly faster than the C++ for high inputs and much faster than the original functional implementation, so in that sense yaay, I did it. Thanks to JingJing23 on reddit for pointing out my mistake.

They say when you don’t know what to do, do what you know. So I translated the original D implementation (the 200ms one using .sort.uniq) into C, albeit using a naive quicksort instead of whatever D uses, and it ran in around 90ms, quite competitive with the C++. What’s more, its performance didn’t degrade with higher input like the D hashmap implementation; (500,500,500) took 7s in C++, and only 9s in C. On a hunch, I decided to try change D’s to use quicksort too, via the following elegant implementation (don’t worry, I promise this is the last D I’ll show! It’s actually not even mine; I found it on the internet).

forest_t[] quickSort(forest_t[] fors) pure nothrow {
  if (fors.length >= 2) {
    auto parts = partition3!(forLessThan)(fors, fors[$ / 2]);
    parts[0].quickSort;
    parts[2].quickSort;
  }
  return fors;
}

And voila, it was as fast as the C, and around 80% of the C++. Presumably the sorting algorithm in D’s std.algorithm is just particularly unsuited to this case; sometimes sorting algorithms are made to ensure good performance on worst case input (unlike quicksort, which has O(n*n) worst case), but this introduces a constant factor that can make them slower on best-case input than the appropriately named quicksort.

Armed with this knowledge, I was certain I’d finally be able to give the Java implementation the speedup I’d been searching for. I thus set about translating the C implementation into Java, and by translating I mean transliterating. I capitalised on the fact that an array of struct{int, int, int} is equivalent to an array of ints, in order to get past Java’s lack of value types, and changed various functions to take and reuse buffers, in the end managing to remove ‘new’ from the main loop completely. My miraculously ugly Java quicksort:

public void qsFors(int start, int len){
  if (len < 2){
    return;
  }
  int p = len/2+start;
  forests[(this.cap-1)*3] = this.gAt(p);
  forests[(this.cap-1)*3+1] = this.wAt(p);
  forests[(this.cap-1)*3+2]= this.lAt(p);    
  int left = start;
  int right = start + len - 1;
  while (left <= right) {
    if (forLessThan(left, this.cap-1)) {
      left++;
      continue;
    }
    if (forLessThan(this.cap-1, right)) {
      right--;
      continue;
    }
    swapFors(left, right);
    left++;
    right--;
  }
  qsFors(start, right - start + 1);
  qsFors(left, start + len - left);
}

Note how it uses integers like pointers, as the array of forests is actually just an array of ints. Also note how it stores the pivot goat, wolf and lion values at the final 3 spots in the array, as the ‘forLessThan’ function can only compare two parts of the same array. If you’re thinking this could lead to problems if the final 3 spots in the array contained a valid forest, you’re right, but at that point there’d be an extremely high probability of overflowing the array and getting an outOfBoundsException anyway.

How’d it go? In spite of having essentially written C in Java, it was actually slightly slower than the original functional implementation from the original blog! So a big congratulations to the writers java.util.stream. While I’ve no idea how distinct works, but it certainly kicked my ass.

For fun, I also wrote a few other implementations. One in Haskell, which was surprisingly fast considering I just used sticking everything into Data.Set and taking it out again for removing duplicates. It ran in around 350ms for input (100,100,100), however slowed down at higher inputs presumably due to garbage or hash collisions.

meals :: V.Vector Forest -> V.Vector Forest
meals fors = V.fromList $ S.toList $ S.fromList $ V.toList $
             snd $ V.unstablePartition forInvalid $ V.concatMap meal fors

I’d be interested to see how fast one using sort and removing consecutive duplicates would be; I attempted one myself, however it was rather slow as I lacked the monad-foo to make an in-place partitioning function for removing duplicates.

Minor gripe: why doesn’t Haskell support generic functions on collections; why do I have to use Set.fromList for a set but Vector.fromList for a vector? Why isn’t there a generic map function? D seems to handle this fine, via ranges, and even C++ manages it via std::algorithm.

Edit: Turns out it can, via type classes, but Haskell programmers don’t always write it this way.

I also wrote a Go version, but it’s slower than I’d have expected, presumably because the Go standard library sorting algorithm uses heapsort “if depth of 2*ceil(lg(n+1)) is reached”. It also generates more garbage than the D version, as D’s creator Walter Bright recently did some much-needed work on removing allocations from D’s std.algorithm. So a big thank you to Walter for ironing out that wart on the language!

I wrote a Common Lisp version too, but it’s rather slow as I was too lazy to write my own unique function, and the one in the standard library (‘remove-duplicates’) uses the naive O(n*n) algorithm for some reason. I’d be quite excited to see how fast it is if someone wants to write up an improved version!

Finally, I attempted a Rust version, and although I wasn’t able to make it functional as the iterator syntax got the better of me, I did get an imperative version working, and it ran at around the same speed as the C and D versions.

I had one more idea for optimising the Java version. If I could create an ideal hashset, with O(n) time for uniquifying and minimal garbage generation, I should be able to beat the C++ implementation’s O(nlogn). A simple way to create a perfect hash (unique for any forest) is by using the first n bits of the goats, wolves and lions values of a forest, as follows:

ulong hash = 0;
hash |= this.goats;
hash <<= BITS_FOR_HASH;
hash |= this.wolves;
hash <<= BITS_FOR_HASH;
hash |= this.lions;

This produces a maximum value MAX of 1<<(BITS_FOR_HASH)*3, meaning if BITS_FOR_HASH is less than eleven it will fit in a uint (just to clarify for any Haskell programmers reading this, that ‘<<=’ is to bitshift left as ‘+=’ is to ‘+'; it’s not a bind). To guarantee O(1) insertion and removal (no collisions), a simple if super space-inefficient solution is to use a bitset of size MAX, as follows:

  forest_t[] uniquify(forest_t[] fors){
    auto newFors = uninitializedArray!(forest_t[])(fors.length);
    size_t i = 0;
    foreach(f; fors){
      ulong hash = f.Hash();
      if(this.bits[hash] == false){
        this.bits[hash] = true;
        newFors[i]=f;
        i++;
      }
    }
    newFors.length=i;
    return newFors;
  }

While incredibly slow for small inputs (as it’s allocating and deallocating hundreds of megabytes every call), for an input of (500,500,500) it only took around 20 seconds, just slightly less than half the speed of the quickSort+uniq implementation. I suspect that with even higher input it could well prove to be faster, however input greater than (512,512,512) is not possible, as the maximum goat/wolf/lion value that can be hashed with BITS_FOR_HASH < 11 is 1024. Using BITS_FOR_HASH of 11 or larger is not possible, as it wouldn’t translate into the Java version, as the maximum value would be greater than 2^32 and there’s no way to index into a Java BitArray with a long. The only option would be to write my own Java implementation of a long-indexable bitarray, and ain’t nobody got time for that, so I admit defeat. java.util.stream: 1. Logicchains: 0.

Moral of the story: profile before attempting to optimise, don’t assuming functional programming necessarily performs poorly, and give the D language a try sometime; while it may have some wrinkles, it can nevertheless be quite powerful and elegant.

Also, sometimes if pays to implement your own sorting function if you know the input won’t be worse case, as demonstrated here by quicksort being twice as fast as D’s standard sort. (Or, conversely, maybe D needs to improve its default sorting algorithm for this kind of input, as C++’s and Rust’s weren’t so slow.)

Edit: The Java HashSet implementation of the problem I was using was buggy in that I didn’t realise I had to manually implement toHash and equals on the forest class. Fixing that made it actually faster than the C++ at higher inputs, so yay, optimisation complete!

Also, changing the D to use the default stable sort (like the C++) actually made it as fast as the C++ for higher inputs (such as 500,500,500). Stable sorting is faster in this case than unstable sorting, possibly because the input is already partially sorted (forests generate similar forests nearby to themselves in the array).

Generic functions on collections are possible in Haskell, via type classes, they’re just not shown in the tutorials as they can produce errors messages that are less comprehensible to beginners.

Finally, there’s a superior algorithmic implementation that runs in linear time, described here.

Posted in Uncategorized | Tagged , , , , , , , , , | 13 Comments

Memory Access Microbenchmark: Clojure, F#, Go, Julia, Lua, Racket, Ruby, Self…

*Edit – For clarity, I’ve decided to just include the most up-to-date, fair and relevant results tables at the top here. Scroll down below to read more about the optimisation process and the effectiveness of various different implementations.

NumTrades = 150

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 0.0944 248 100.00 352224 757 60 1502 9
Clojure - 0.0000 607 40.86 893908 976 79 2814 0
Java3 javac 0.8451 624 39.74 826964 864 119 3386 2
Scala - 0.0000 737 33.65 751168 787 57 1719 0
Common Lisp - 0.0000 1152 21.53 902192 715 44 2103 0
Python2 (Pypy)2 - 0.0000 1716 14.45 1382304 609 42 1382 0
C#3 mcs 0.7248 2099 11.82 727712 893 114 3268 4
F#3 fsharpc 4.2022 2126 11.67 731808 660 39 1541 12
Racket raco 17.1353 2150 11.53 875424 691 38 1959 11142
Go3 go 0.3930 2516 9.86 352952 656 62 1313 2799
Ruby1 - 0.0000 6122 4.05 1070024 636 52 1266 0
JuliaBigNum - 0.0000 8932 2.78 2526800 565 52 1227 0
Tcl - 0.0000 18754 1.32 1702460 543 45 1130 0

NumTrades = 100

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 3.0596 210 100.00 235036 757 60 1502 9
Clojure - 0.0000 295 71.19 706660 976 79 2814 0
Java3 javac 2.8895 421 49.88 561296 864 119 3386 2
Scala - 0.0000 478 43.93 479916 787 57 1719 0
Common Lisp - 0.0000 644 32.61 866428 715 44 2103 0
Python2 (Pypy)2 - 0.0000 1036 20.27 955904 609 42 1382 0
F#3 fsharpc 7.6323 1082 19.41 492912 660 39 1541 12
C#3 mcs 1.7726 1144 18.36 488812 893 114 3268 4
Racket raco 26.5090 1208 17.38 639700 691 38 1959 11142
Go3 go 2.8093 1921 10.93 235768 656 62 1313 2799
Ruby1 - 0.0000 3045 6.90 675208 636 52 1266 0
Lua - 0.0000 3780 5.56 2202448 576 53 1305 0
Smalltalk (GNU) - 0.0000 5545 3.79 611088 712 76 1641 0
JuliaBigNum - 0.0000 7836 2.68 1012580 565 52 1227 0
Tcl - 0.0000 9865 2.13 1135652 543 45 1130 0

NumTrades = 50

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 3.6418 81 100.00 117848 757 60 1502 8
Common Lisp - 0.0000 84 96.43 324988 715 44 2103 0
Clojure - 0.0000 117 69.23 403120 976 79 2814 0
Racket raco 20.1401 132 61.36 373368 691 38 1959 11142
Java3 javac 2.8739 176 46.02 318364 864 119 3386 2
Scala - 0.0000 213 38.03 276620 787 57 1719 0
LuaJit - 0.0000 230 35.22 649116 576 53 1305 0
F#3 fsharpc 5.3502 381 21.26 254008 660 39 1541 12
C#3 mcs 1.4964 389 20.82 249920 893 114 3268 4
Python2 (Pypy)2 - 0.0000 428 18.93 482480 609 42 1382 0
Go3 go 2.3946 814 9.95 118576 656 62 1313 2799
Ruby1 - 0.0000 1435 5.64 301384 636 52 1266 0
Lua - 0.0000 1720 4.71 1101764 576 53 1305 0
Smalltalk (GNU) - 0.0000 1825 4.44 311132 712 76 1641 0
JuliaBigNum - 0.0000 2424 3.34 907624 565 52 1227 0
Python3 - 0.0000 2425 3.34 1338756 497 32 1126 0
Tcl - 0.0000 4102 1.97 569216 543 45 1130 0

NumTrades = 10

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
Common Lisp - 0.0000 16 100.00 111068 715 44 2103 0
C3 clang 0.0966 16 100.00 24096 757 60 1502 9
Racket raco 17.0957 26 61.54 190028 691 38 1959 11142
Clojure - 0.0000 27 59.26 121268 976 79 2814 0
Java3 javac 0.8250 39 41.03 96224 864 119 3386 2
LuaJit - 0.0000 40 40.00 128036 576 53 1305 0
Scala - 0.0000 41 39.02 148820 787 57 1719 0
F#3 fsharpc 4.1270 63 25.40 62920 660 39 1541 12
C#3 mcs 0.7194 64 25.00 58820 893 114 3268 4
Python2 (Pypy)2 - 0.0000 85 18.82 113608 609 42 1382 0
Go3 go 0.5028 158 10.13 24828 656 62 1313 2799
JuliaBigNum - 0.0000 273 5.86 316656 565 52 1227 0
Ruby1 - 0.0000 286 5.59 65244 636 52 1266 0
Smalltalk (GNU) - 0.0000 324 4.94 66276 712 76 1641 0
Lua - 0.0000 370 4.32 216136 576 53 1305 0
Python3 - 0.0000 485 3.30 273000 497 32 1126 0
Tcl - 0.0000 813 1.97 115772 543 45 1130 0

I recently came across an interesting blog post that dealt with optimising a common situation in which the JVM performed surprisingly poorly, which prompted me to try the microbenchmark in a few other languages to see how they fared. The code for this benchmark can be found here.

The problem: iterating over an array of objects modifying and reading its contents. What’s  interesting about it? Firstly, as this is a relatively common task, the results give a tiny bit of insight into the ‘general’ speed of the languages tested. Secondly, it’s directly relevant to the effectiveness of object pooling, a technique I’m using in the other benchmark I’m working on. The idea of an object pool is to statically allocate the needed objects at initialisation and reuse them instead of creating new objects, to avoid generating garbage. The effectiveness of this technique would however be limited if it’s impossible to access and modify objects in the pool without generating garbage.

Why are some of the usual suspects (D, Rust, Nimrod, C++) missing? They use the same backend as C so they’d produce results of similar speed, and their metaprogramming facilities are powerful enough that it’d be easy to ‘cheat’ and do all the calculation at compile time. If somebody wants to submit an implementation in one of these languages they are however quite welcome, as long as it doesn’t ‘cheat’.

Now, the results (but don’t neglect to read the rest of the post, as there’s a twist at the end). The value ‘NumTrades’ multiplied by 50,000 is the number of objects in the array of trade objects. Note that if a particular language/implementation is missing from a certain NumTrades value’s results, this means that it ran out of memory or was otherwise killed by the OS. The tests are run on a dual-core 64bit, 1.6ghz Intel i5, with 4gb of ram, running Linux Mint 15 64bit.

*Edit: It should be noted that this set of graphs tests the form of allocation used in the blog post that inspired this post. It is however suboptimal, and the set of graphs further below involve implementations using an optimal method of allocation/object pooling. This first set of graphs just provides for interesting comparison, to show how  small change can lead to significant speedup. Also note that the Ruby, Smalltalk and Lua implementations in this graph use the faster form of allocation, as they ran out of memory and died too soon without it. It is hence more fair to compare them with the implementations suffixed with 2, in the later set of graphs. In fact, these graphs are partially just here as part of explaining the optimisation process; if you’re interested in the fairest numbers, scroll to the final set of graphs.

NumTrades = 150 

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.0318 110 100.00 300524 2279 180 5840 8
Go go 0.3103 124 88.71 352484 614 61 1206 2256
C clang 0.0737 127 86.61 352044 712 56 1425 9
JavaUnSafe javac 1.1290 203 54.19 326036 1182 156 5148 2
CLispFixNum - 0.0000 228 48.24 810536 745 44 2152 0
Julia - 0.0000 298 36.91 609564 552 52 1228 0
Common Lisp - 0.0000 1120 9.82 902192 715 44 2103 0
Java javac 2.1709 1575 6.98 918820 819 121 3329 1
C# mcs 0.6872 1587 6.93 999892 878 115 3301 4
Racket raco 16.8814 1955 5.63 877364 691 38 1959 11142
F# fsharpc 4.2951 5761 1.91 1452868 646 40 1526 12
Python2 (Pypy) - 0.0000 9489 1.16 1519092 561 38 1192 0
Tcl - 0.0000 14486 0.76 1702460 544 46 1170 0

NumTrades = 100

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.3159 70 100.00 200468 2279 180 5840 8
Go go 0.3185 82 85.37 235300 614 61 1206 2256
C clang 1.8142 84 83.33 234856 722 56 1435 9
JavaUnSafe javac 1.1487 135 51.85 223596 1182 156 5148 2
CLispFixNum - 0.0000 152 46.05 594160 745 44 2152 0
Julia - 0.0000 200 35.00 415212 552 52 1228 0
Common Lisp - 0.0000 656 10.67 866428 715 44 2103 0
C# mcs 0.7074 953 7.35 966208 890 116 3320 4
Java javac 2.7260 1052 6.65 651904 819 121 3329 1
Racket raco 20.3636 1237 5.66 639704 691 38 1959 11142
F# fsharpc 6.2570 1393 5.03 976720 646 40 1526 12
Python2 (Pypy) - 0.0000 2234 3.13 1183108 561 38 1192 0
Lua - 0.0000 4580 1.53 2202468 554 52 1208 0
Ruby - 0.0000 4753 1.47 2233092 605 56 1250 0
Smalltalk (GNU) - 0.0000 5313 1.32 611088 712 76 1641 0
Tcl - 0.0000 9279 0.75 1135660 544 46 1170 0
Python3 - 0.0000 10654 0.66 1582084 607 39 1268 0
Clojure - 0.0000 86824 0.08 994700 960 53 2811 0

NumTrades = 50

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
Go go 1.8760 40 100.00 118112 614 61 1206 2256
x86-64 assembly fasm 0.6947 40 100.00 100416 2279 180 5840 8
C clang 1.6018 43 93.02 117668 722 56 1435 7
JavaUnSafe javac 1.2099 73 54.79 121152 1182 156 5148 2
Common Lisp - 0.0000 96 41.67 324988 715 44 2103 0
Julia - 0.0000 99 40.40 271700 552 52 1228 0
Racket raco 21.5142 130 30.77 373364 691 38 1959 11142
LuaJit - 0.0000 230 17.39 649136 554 52 1208 0
C# mcs 1.5198 460 8.70 488508 890 116 3320 4
F# fsharpc 5.7907 480 8.33 455608 646 40 1526 12
Java javac 3.0675 677 5.91 493744 819 121 3329 1
Python2 (Pypy) - 0.0000 1010 3.96 620820 561 38 1192 0
Smalltalk (GNU) - 0.0000 1835 2.18 311136 712 76 1641 0
Lua - 0.0000 2150 1.86 1101764 554 52 1208 0
Ruby - 0.0000 2197 1.82 1046204 605 56 1250 0
Python3_fst - 0.0000 2460 1.62 1338748 497 32 1127 0
Tcl - 0.0000 4121 0.97 569212 544 46 1170 0
Python3 - 0.0000 4951 0.81 794376 607 39 1268 0
Clojure - 0.0000 10246 0.39 858024 960 53 2811 0

NumTrades = 10

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
Go go 1.6017 8 100.00 24360 614 61 1206 2256
C clang 1.6518 8 100.00 23920 722 56 1435 9
Java javac 1.1308 13 61.54 117792 819 121 3329 1
JavaUnSafe javac 1.0829 14 57.14 39256 1182 156 5148 2
Common Lisp - 0.0000 16 50.00 111068 715 44 2103 0
Julia - 0.0000 20 40.00 164340 552 52 1228 0
Racket raco 17.0663 26 30.77 190052 691 38 1959 11142
LuaJit - 0.0000 50 16.00 128032 554 52 1208 0
C# mcs 0.6989 67 11.94 106228 890 116 3320 4
F# fsharpc 4.4070 83 9.64 116864 646 40 1526 12
Python2 (Pypy) - 0.0000 203 3.94 142012 561 38 1192 0
Smalltalk (GNU) - 0.0000 324 2.47 66276 712 76 1641 0
Ruby - 0.0000 448 1.79 212004 605 56 1250 0
Lua - 0.0000 450 1.78 216132 554 52 1208 0
FasterPython - 0.0000 484 1.65 273000 497 32 1127 0
Tcl - 0.0000 821 0.97 115768 544 46 1170 0
Python3 - 0.0000 1017 0.79 163864 607 39 1268 0
Clojure - 0.0000 1121 0.71 322328 960 53 2811 0

Take the results for Python, Ruby and Julia with a grain of salt, as I’m less experienced with these languages and there are probably optimisations that I’ve missed. Suggestions are welcome! (Note also that the rightmost column of the table represents the size of the resulting executable; it doesn’t seem to fit fully into the page.)

*Edit: The Ruby and Julia versions have been improved since the benchmark was originally run, and now perform quite impressively. A version that’s faster in CPython (but not in Pypy) has also been added, as Python3_fst.

Points of interest:

Pypy – Significantly faster than standard Python 3 (CPython); quite impressive. For such an otherwise elegant language, however, writing constructors in Python is surprisingly awkward. Or perhaps it’d be more accurate to say the fact that one even has to write constructors in the first place seems odd; is there a default constructor I’m not aware of, a more concise way to write the following? *edit: there is indeed a better way to write this. I need to learn more Python.

class PyMemTrade():
  def __init__(self, tradeId, clientId, venueId, instrumentCode, price, quantity, side):
  self.tradeId = tradeId
  self.clientId = clientId
  self.venueId = venueId
  self.instrumentCode = instrumentCode
  self.price = price
  self.quantity = quantity
  self.side = side

C# – Faster than Java, and uses less memory. An interesting note: I originally ran the benchmark with the older Mono 2.1 runtime, before having to upgrade to 3.2 to get F# to compile, and the C# implementation actually ran faster there. Presumably the old garbage collector was better suited to this case, whereas maybe the new one is more like Java’s.

x86-64 assembly – Uses less memory, courtesy of struct packing. Normally languages pad structs, to align memory such that each memory access of a particular type is at an address that’s a multiple of that type. So an int32 can be accessed at 4, 8, 12, etc, whereas an int64 can only be accessed at 8, 16, 24 etc. The trade struct in this benchmark is long, long, int, int, long, long, byte; 8, 8, 4, 4, 8, 8, 1, or 41 bytes total. If sticking these in an array, 41 is not a multiple of 8, so 7 bytes of padding are necessary at the end of each struct so that the long at the start of the next one is aligned. As assembly doesn’t have structs, just byte arrays, it’s possible to avoid this padding. Note that while it’s faster in this case, unaligned access is generally thought to be slower, and on non-X86 architectures can cause the program to segfault. Also, you can do the same in C with compiler extensions, and that’s generally considered a much better way to achieve unaligned datastructures than writing the whole program in assembly.

Note that I’m a pretty mediocre x86 assembly programmer; it might well be faster if written by someone more experienced. Pull request are welcome. Also note that if you exclude the printing code (I couldn’t get printf to work without segfaulting, so I copy-pasted an integer printing routine. Take a look; printing integers is actually ridiculously slow and complicated compared to printing strings.), the assembly code is still more concise than its Java equivalent. Heck, going by plain linecount the entire assembly file, including the ridiculous integer printing routine, is 225 lines, only around 15% more lines than the unsafe Java code :/ To be fair, however, I should note that the Java code does include numerous setters and getters, which aren’t strictly necessary. For those who aren’t familiar with assembly, here’s an example of setting the values in a struct, where rcb is the address of the struct and rbx is the value you want to set tradeId, price and quantity to (please don’t hesitate to let me know if you spot any mistakes, as I’m quite confident the x86-64 implementation doesn’t function completely as intended):

  mov qword [rcx], rbx ;tradeId = rbx
  mov qword [rcx+8], 1 ;clientId = 1
  mov dword [rcx+16], 123 ;venueCode = 123
  mov dword [rcx+20], 321 ;instrumentCode = 123
  mov qword [rcx+24], rbx ;price = rbx
  mov qword [rcx+32], rbx ;quantity = rbx

Go – Almost matches C, quite impressive considering how many more man-hours the LLVM has had put into it than the Go compiler. Why is it so much faster than C# and Java? Maybe because it allows me to pointer chase, manipulating data in the array directly without generating any temporary objects, and hence minimising garbage creation. Something I learned: on x86-64, Go ints are 64bit; I initially made the mistake of assuming they were 32bit like C ints.

Common Lisp (SBCL) – Incredibly impressive speed for a dynamic language. Note in particular that I didn’t even add type annotations, as my attempts at adding them didn’t seem to speed the implementation up at all. The speed drops as NumTrades increases however, and profiling with time() reveals that garbage collection takes up a significant (60%+) percent of the running time. This means that either the compiler isn’t able to convert:

 (let ((trade-ref (aref trades i)))
   (progn
   (setf (lisp-memory-trade-trade-id trade-ref) i)
   (setf (lisp-memory-trade-client-id trade-ref) 1))

into

 struct CMemoryTrade *trade = &(trades[i]);
  trade->TradeId = i;
  trade->ClientId = 1;

or that it’s unable to stack allocate trade-ref so as to avoid it becoming heap garbage. That, or its bignum implementation generates more garbage. The same applies to the Racket compiler and the JVM.

*Edit: Thanks to type annotations it’s possible to make the Common Lisp implementation perform even faster, via avoiding bignums and overflowing instead as C, Go and Java do. This result is titled CLispFixnum.

Pharo – Sure it’s slow, but it’s had far fewer development hours put into it than any of the other languages used. It was however one of the funnest implementations to write, and also the most concise. Wait, you say, Pharo’s not in the list? I converted the program to GNU Smalltalk as that’s easer to run from the command line and around 50% faster, but it was developed in Pharo. It’s also not too slow compared to CPython. If I was given the resources to write my own language (tangent incoming; feel free to skip the next two paragraphs)

it would be a statically typed Smalltalk (or maybe something closer to Self) with type inference, a focus on referential transparency, and incremental compilation. If the methods of an immutable object only messaged other immutable objects (were pure functions), it’d be possible to compile the whole object into an optimised block that was potentially as fast as the equivalent C code. The compiler would be a first class citizen. so it’d also be possible to compile non-referentially-transparent blocks with Compiler compile: myBlock. Track would be kept of all the objects that this block messaged and its assumed responses from them, and if any were changed then the compiled block would become invalid and revert to interpreted code. The file system would also be a first class citizen; compiled blocks could be saved as executables that took their messages as flags (./myblock doLotsOfMathsWithInput: someNum) and outputted the result to stdout, which the runtime could then capture. Block data would be stored in metadata files accompanying the executables, and the filesystem could be used as if it were a Smalltalk image. Note that this would allow easy parallelism via the OS process scheduler, a bit like Erlang. It would also allow for runtime code specialisation, as described in this link, making it faster than C in some cases.
For instance, the following function:

 int myFun(a, b, c){
      return a*b+c
  }

can be compiled to just return 12 + c if a and b are known at compile time to be 3 and 4, which is much faster. If these are runtime constants, however (Java final), while it should still be possible to do the same optimisation on the code, in practice the JVM doesn’t (and I don’t think the CLR does either). A language that took advantage of this kind of optimisation could be faster than c for problems where there’s a lot of information only known at runtime that doesn’t change once it’s loaded.

Racket – Very enjoyable to write. It was originally much faster, before I fixed a weird bug I was having. Can you spot the bug in the following code?

 (define trades (make-vector *NUM_RECORDS* [rkt-mem-trade 1 1 1 1 1 1 #\a]))

If not, neither did I. I couldn’t understand why my attempts to mutate the objects in the vector were producing weird results, especially when equivalent code in Common Lisp worked fine. Even though half the trades should have a side value of ‘S’ and half should have ‘B’, for some reason after the vector was initialised they all had ‘S’, even though my initialisation code had been checked and double-checked; Then I realised: for every other value, all the objects in the vector also shared the same value, the value of the final object in the vector. Why was this? Answer: all the objects in the vector pointed to the same object. Changing the array creation code to the following:

 (define (make-trade i) (rkt-mem-trade 1 1 1 1 1 1 #\a))
 (define: trades : (Vectorof rkt-mem-trade) (build-vector *NUM_RECORDS* make-trade) )

fixed the problem. It also slowed the code down significantly. It does however suggest at a solution to the problem of poor performance due to the lack of cache locality of *obj arrays compared to obj arrays: have all the *obj point to the same object, as then it’ll always be in the cache…

Lua – Luajit really is quite impressive, achieving a 10x speedup over the standard Lua interpreter, however still falling behind Racket and Common Lisp. Interestingly, although Luajit appears to use less heap, plain Lua actually handles conditions of high memory use better: the reason there is no Luajit implementation listed for numTrades values of 100 and 150 is because it died with an out of memory error while running them (the benchmarking script will automatically exclude languages that don’t compile or run properly).

Self – What’s Self? It’s a research language that’s like a more-pure form of Smalltalk. It was extremely awesome to write, but I didn’t include it as it ran out of memory on NumTrades values of less than 10 (and took over a second to run at NumTrades=10). Nevertheless I think it’s worth checking out for fans of Smalltalk as it does an even better job of ‘pure OO’, and is extremely elegant about it. My original implementation in Self was only around 20 lines of code, although there’s a few more now as I wasn’t sure how to concatenate strings for printing so I printed them the ugly way.

F# – A neat language, and generally fun to write. Not quite as fast as C#, although it was originally the fastest language. Why? Firstly, I made the same mistake as in Racket, initialising an array with a function that ended up filling it with references to the same object rather than different objects. After I fixed that, however, it was still really fast. This was because I didn’t realise that parameter-less functions had to be called with a () at the end; a function with one parameter can be called as ‘myFun param’, but a function with no parameters must be called as ‘myFun()’. In the process of figuring out what I was doing wrong, I ended up looking through the CIL (CLR bytecode) disassembly of the program, and I must say it was the one of the most nauseating things I’ve ever seen in my life. Imagine a language with twice the verbosity of Java and half the clarity of x86 assembly, and you’ve got a language that’s about four times as elegant as the CLR Common Intermediate Language. I really don’t think the following should belong in bytecode: *Edit: okay, I may be wrong here, and the CIL isn’t itself bytecode, but rather is translated into bytecode. In which case, ignore this complaint.

.class nested public auto ansi sealed serializable FSMemTrade
  extends [mscorlib]System.Object
  implements class [mscorlib]System.IEquatable`1, [mscorlib]System.Collections.IStructuralEquatable, class [mscorlib]System.IComparable`1, [mscorlib]System.IComparable, [mscorlib]System.Collections.IStructuralComparable {
  .custom instance void class [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::'.ctor'(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = (01 00 02 00 00 00 00 00 )

Even functions seemed to have multiple verbose constructors defined. The total assembly size was around 1800 lines, almost an order of magnitude more than the equivalent program written by hand or by the LLVM in an assembly language. If you ever plan on writing low-level code for a virtual machine by hand, my advice is to stick to the JVM, as its bytecode is much nicer (only 160 lines for the disassembly of Java.class, which is only 10 lines more than the original Java.java source file…). To be fair however, the disassembly of the C# executable (as opposed to the F# one) was only around 460 lines long, and didn’t have quite as much boilerplate.

Ruby - I don’t quite understand the fuss about Ruby; like Python, it seems to be a dynamic language that wants me to write constructors (and getters/setters), or else use ugly stuff like $trades[i].instance_variable_get(“@price”) . To be fair, it does seem to be slightly faster than CPython, but it’s still slower than Smalltalk, and if I was going to use a language that slow I’d definitely prefer Smalltalk. Maybe someone can enlighten me as to what’s so great about Ruby? (And Rails doesn’t count, as I’m not a webdev.)

*Edit: Okay, I just mentally compared Ruby to Javascript and PHP, and was overcome with love for the language. I think I now better understand its appeal.

*Edit2: Wow, I made a small change and now the Ruby implementation is twice as fast, faster than Smalltalk and beating CPython. Now I’m impressed.

Julia – The language is quite fast on many other tasks, comparable to C or Fortran, but it seems the optimiser doesn’t know what to do with this particular program. I suspect someone more familiar with the language than I am could significantly speed it up by changing the array of trades into a one-dimensional matrix of int64s and manipulating that; the language is built around numerical computing, and the compiler has specialised optimisations it applies when it detects idiomatic numerics code. Writing the language was quite pleasant; I think the future looks bright for Julia. It was also the most robust of the ‘slower’ languages: managing to complete 150 numTrades where GST, Ruby and Clojure dropped out. Note if you want to test it yourself that the Julia compiler takes quite a while to build, maybe at least half as long as Rust. Interestingly, this is the most concise language in terms of compressed source file size.

*Edit: Wow, a patch to make the global trades array local has sped it up massively, to near C speed. This shows the importance of being familiar with the gotchas of a language currently still in development in order to maximise its performance.

Clojure – Poor performance here can probably be blamed on the JVM itself, considering how poorly Java fares. The Clojure implementation was however actually the least enjoyable to write (although normally my experience with Clojure is quite pleasant), due to how the language goes out of its way to make declaring mutable types difficult. Where Racket only requires a #mutable tag to render a struct mutable, Clojure requires declaring an interface complete with getters and setters for every value, and then declaring a type that implements these getters and setters. Somebody please tell me there’s an alternative to the following monstrosity:

(definterface IMemTest
  (^Long gtradeId []) (^Long gclientId []) (^Long gvenueId []) (^Long ginstrumentCode []) (^Long gprice [])
  (^Long gquantity []) (^Character gside [])
  (stradeId [^Long v]) (sclientId [^Long v]) (svenueId [^Long v]) (sinstrumentCode [^Long v]) (sprice [^Long v])
  (squantity [^Long v]) (sside [^Character v]))
(deftype CJMemTest [^:unsynchronized-mutable ^Long tradeId ^:unsynchronized-mutable ^Long clientId ^:unsynchronized-mutable ^Long venueId
  ^:unsynchronized-mutable ^Long instrumentCode ^:unsynchronized-mutable ^Long price ^:unsynchronized-mutable ^Long quantity
  ^:unsynchronized-mutable ^Character side]
IMemTest
  (gtradeId [_] tradeId)(gclientId [_] clientId)(gvenueId [_] venueId)(ginstrumentCode [_] instrumentCode)
  (gprice [_] price)(gquantity [_] quantity)(gside [_] side)
  (stradeId [this v] (set! tradeId v)) (sclientId [this v] (set! clientId v)) (svenueId [this v] (set! venueId v))
  (sinstrumentCode [this v] (set! instrumentCode v))
  (sprice [this v] (set! price v)) (squantity [this v] (set! quantity v))(sside [this v] (set! side v)))

I wonder if Rich Hickey ever had any bad experiences with Godzilla, as the above code suggests that he must really hate mutation. Even Haskell makes declaring mutable datastructures easier. (I’d love if someone were to submit a Haskell information, by the way. My type-foo isn’t up to scratch for dealing with the state monad.)

Finally, the twist: although the C, Go and C# implementations are faster, they’re also wrong! Run the Clojure version with sufficient numTrades changing +’ to + and it’ll throw an ‘integer overflow’ exception, kindly letting us know as the other implementations didn’t that the buyCost and sellCost variables are overflowing. The other Lisps, Python and Smalltalk handle this by automatic promotion to bignum. The benchmark is hence not particularly fair, as bignum operations are slower than standard integer multiplication, and the dynamic languages are doing bignum calculations while the statically typed languages are simply overflowing. Look at the numbers for numruns < 100 for fairer comparison, as no languages are using bignums then. Also note that the Clojure version can use +’ instead of + to automatically promote to bignums and avoid the overflow, which the current version does.

I bet you thought that was all. But no, there’s more! Long story short, the Java code in the original blog post was poorly written from a performance perspective (unnecessary ‘new’s). I cleaned it up, and voila, it’s faster than the original unsafe Java implementation and quite close to the C. Moral of the story: don’t resort to sun.misc.unsafe until you’ve removed all the ‘new’s from your performance-sensitive code. In the following table, each language suffixed with ‘2’ is an implementation where the following (or equivalent):

JavaMemoryTrade trade = new JavaMemoryTrade();
 trades[i] = trade;
 trade.setTradeId(i);
 trade.setClientId(1);

is changed to:

trades[i].setTradeId(i);
 trades[i].setClientId(1);

Here it is:

NumTrades = 150 

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.3751 120 100.00 300528 2279 180 5840 8
Go go 0.7014 122 98.36 352484 614 61 1206 2256
C clang 0.0903 127 94.49 352044 712 56 1425 9
Go2 go 0.3431 127 94.49 352476 594 59 1170 2256
C2 clang 1.8033 129 93.02 352044 689 54 1374 9
Java2 javac 2.4070 164 73.17 706224 803 118 3239 1
CLispFixNum - 0.0000 228 52.63 810536 745 44 2152 0
Julia - 0.0000 298 40.27 601312 552 52 1228 0
F#2 fsharpc 4.8907 343 34.99 729336 637 38 1521 12
C#2 mcs 1.3778 347 34.58 727544 849 112 3213 4
Common Lisp - 0.0000 1108 10.83 902192 715 44 2103 0
C# mcs 0.7031 1592 7.54 999880 878 115 3301 4
Java javac 0.8713 1612 7.44 912288 819 121 3329 1
Common Lisp2 - 0.0000 1644 7.30 902772 675 39 1978 0
Python2 (Pypy)2 - 0.0000 1676 7.16 1382296 609 42 1382 0
Racket raco 16.9661 1989 6.03 875392 691 38 1959 11142
F# fsharpc 4.1842 2274 5.28 1453200 646 40 1526 12
Racket2 raco 19.8999 2334 5.14 875384 698 36 2020 11142
Python2 (Pypy) - 0.0000 3443 3.49 1519136 561 38 1192 0

NumTrades = 100 

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.2580 70 100.00 200468 2279 180 5840 8
Go go 0.3212 82 85.37 235300 614 61 1206 2256
C clang 0.1121 83 84.34 234856 712 56 1425 9
C2 clang 1.7537 85 82.35 234856 689 54 1374 9
Go2 go 0.3121 85 82.35 235288 594 59 1170 2256
Java2 javac 2.7039 106 66.04 530540 803 118 3239 1
CLispFixNum - 0.0000 152 46.05 594160 745 44 2152 0
Julia - 0.0000 198 35.35 436676 552 52 1228 0
F#2 fsharpc 5.5708 228 30.70 490452 637 38 1521 12
C#2 mcs 1.6621 231 30.30 488656 849 112 3213 4
Common Lisp - 0.0000 652 10.74 866428 715 44 2103 0
Java javac 0.8986 831 8.42 632704 819 121 3329 1
C# mcs 0.7101 979 7.15 827400 878 115 3301 4
Python2 (Pypy)2 - 0.0000 990 7.07 955840 609 42 1382 0
Common Lisp2 - 0.0000 1012 6.92 866200 675 39 1978 0
Racket raco 16.8806 1219 5.74 639732 691 38 1959 11142
F# fsharpc 4.0892 1386 5.05 976720 646 40 1526 12
Racket2 raco 23.0935 1461 4.79 639720 698 36 2020 11142
Python2 (Pypy) - 0.0000 2213 3.16 1183096 561 38 1192 0
Lua - 0.0000 4730 1.48 2202468 554 52 1208 0
Ruby - 0.0000 5262 1.33 2233164 605 56 1250 0
Smalltalk (GNU) - 0.0000 5345 1.31 611088 712 76 1641 0
Python3-2 - 0.0000 7901 0.89 1582084 607 42 1380 0
Python3 - 0.0000 9863 0.71 1582080 607 39 1268 0

It can be seen that this optimisation results in a huge speedup for the Java version, as it reduces object creation and hence reduces garbage collection. Similar for C#, F# and Python. Interestingly however it has an adverse effect on the performance of the Lisp implementations, suggesting that an object is created for each aref/vector-ref, so minimising the use of those functions produces better performance. This suggests, somewhat sadly, that object pooling may not be an effective option in Lisp, as there’s no way to mutate objects in an array without generating garbage (short of using unsafe ffi functions).

Finally, the most useful part of the benchmark: benchmarking the fastest implementation in each language, with the statically typed languages all using bignums so as to produce a fair result. The results are as follows:

NumTrades = 150

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 1.9361 252 100.00 352224 750 60 1492 9
Java3 javac 2.6750 547 46.07 797152 864 119 3386 2
Common Lisp - 0.0000 1116 22.58 902192 715 44 2103 0
Racket raco 20.1457 1957 12.88 875388 691 38 1959 11142
C#3 mcs 1.6699 2100 12.00 727712 893 114 3268 4
F#3 fsharpc 5.8450 2125 11.86 731956 659 39 1541 12
Go3 go 0.6288 2515 10.02 352956 656 62 1313 2799
Python2 (Pypy) - 0.0000 3333 7.56 1519136 561 38 1192 0

NumTrades = 100

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 1.7614 162 100.00 235036 750 60 1492 9
Julia - 0.0000 198 81.82 421180 552 52 1228 0
Java3 javac 2.9222 369 43.90 561684 864 119 3386 2
Common Lisp - 0.0000 644 25.16 866428 715 44 2103 0
F#3 fsharpc 5.4417 1073 15.10 492908 659 39 1541 12
C#3 mcs 1.3916 1106 14.65 488812 893 114 3268 4
Racket raco 20.3282 1209 13.40 639704 691 38 1959 11142
Go3 go 0.6998 1659 9.76 235768 656 62 1313 2799
Python2 (Pypy) - 0.0000 2122 7.63 1183156 561 38 1192 0
Ruby - 0.0000 4258 3.80 2233108 605 56 1250 0
Lua - 0.0000 4600 3.52 2202356 554 52 1208 0
Smalltalk (GNU) - 0.0000 5295 3.06 611084 712 76 1641 0
JuliaBigNum - 0.0000 7726 2.10 1001056 565 52 1227 0
Python3 - 0.0000 10212 1.59 1582084 607 39 1268 0

C really shines here, courtesy of libGMP and stack allocation, but Go drops well behind, presumably due to its bignum library being slower and generating more garbage. Java performs relatively well, which I imagine is due either to its garbage collector being better or its bignum implementation generating little garbage. Interestingly, SBCL beats Go, F# and C#  (twice as fast as them at NumTrades=150) in spite of being a dynamic language and being compiled without any type annotations (they didn’t seem to make it any faster), which just goes to show that dynamically typed languages needn’t necessarily be slower than statically typed ones in all cases. Racket also appears to beat them slightly at NumTrades=150, too. If anybody knows a way to stop the SBCL implementation from consing I’d be interested to hear it, as garbage collection accounts for over 50% of the running time so avoiding it would being it close to Java in speed. I actually also made a SBCL implementation that just calls libgmp via the ffi, which results in no garbage collection occurring (which means aref isn’t creating garbage, so object pooling _is_ possible in Common Lisp, yay!), yet it’s about twice as slow. It’s in the repo as Lisp3.lisp if anyone wants to fix it (it’s inspired by the lisp pidigits code from the alioth language benchmarks).

Note that D, Rust or Nimrod could also perform as well as C here, simply by calling out to libGMP. As could Go, for that matter, but I thought for the sake of the benchmark it made more sense to use the native Bignum implementation.

If you want to run the benchmarks yourself, it’s as simple as:

 git clone https://github.com/logicchains/ArrayAccessBench
 go run Benchmarker.go

In the Benchmarker.go source the output and input file can be changed; as input, BenchmarkData.dat produces the first set of tables, BenchmarkData2.dat the second, and BenchmarkData3.dat the third. The numtrades values to test are stored near the top of the file in an array of strings, and the time to pause between runs can also be set via the WaitTime constant.

Hope you enjoyed the benchmark! The moral of this post: new considered harmful. The tl;dr of it: Lisp and Racket beat Go at Bignums. 

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , | 30 Comments

“I’m from Microsoft, and you’re blocking our servers.”

Last night I received a call to my landline, a heavily Indian-accented voice stating “I am from Microsoft. Your computer is sending many errors and blocking our servers. Today the amount of errors has intensified, so we decided to take the initiative and contact you directly”. I was intrigued; I run Linux Mint, and rarely boot into Windows (I certainly hadn’t that day). I told him this, and he helpfully corrected my English; “it’s Lih-nux, not Lie-nux”, and then told me I must have a Windows running somewhere. Maybe I had a rare Linux trojan that was somehow booting up the Windows partition in the background and using it to DDOS Microsoft by sending repeated error reports?

I dutifully restarted and booted into Windows. I asked him how he got my landline, and he replied that I had ‘registered’ it with Microsoft when setting up the computer. How prescient of me, considering that I had installed Windows well before moving to this house and getting the current number. I asked him my name, and he responded with the name attached to the number in the phonebook, one not used anywhere in my Windows registration; maybe he was just trying to be polite?

By this time Windows had finished booting. Hey told me to hit win+r (run hotkey), and type eventmgr into the run dialogue. I followed his instructions, clicking first on Custom Views then on Administrative Events, and there were over 13,000 ‘errors’ logged there! I told him this, and he was quite shocked too, stating that this was a new ‘record’ for them. Which explains why the ‘Microsoft Servers’, despite being able to handle errors reports coming in from over a billion computers running Windows, are unable to handle a few thousand error reports coming from a single PC over what looks like the course of weeks (I didn’t raise the possibility to him that these were internal errors, and not error reports sent to Microsoft, as I didn’t wish to hurt his feelings).

He told me to open up a command prompt and spelled out a command for me to type, assoc, and I dutifully obeyed. This spilled out a list of what appeared to be registry entries. He told me to take a look at the third last and pay attention to the line containing CLSID, something like “ZFSendToTarget=CLSID{SomeStringOfDigitsAndLetters}”. He then explained patiently how only my computer and Microsoft have access to that number, and proceeded to read his copy out to me, which was identical to mine, and assured me that this proved he was genuinely from Microsoft.

Figuring I was clearly hooked, he then moved on to the crux of his brilliant plan. He instructed me to type an url into the run prompt, “freepcupdate.com”. I opened the website in a secure browser, to be greeted with a poorly formatted page with ‘MicroSoft’ as the heading and a poorly drawn arrow further down pointing to a large green ‘update’ button. He told me to click the button, and, figuring I had wasted enough of his time and completed my spambaiting duty, I promptly hung up the phone. He called again, and I politely told him to stop calling, and hung up. He called again, and told me in a threatening voice that “We are disconnecting your computer!”. Disconnected from what? He called yet again, and I remembered that, for reasons that now escape me, I happen to have a large singing bowl lying about the house. It’s basically a bowl-shaped gong, like a smaller version of the image at the bottom of this post. I fetched the bowl, answered the phone, and then placed the handset inside the bowl. I then proceeded to strike the bowl repeatedly until it was resonating a note (somewhat ironically, it was tuned to F#) loud enough to be just on the threshold of causing pain. I left the handset in there until the sound had died down, then hung up the phone. Needless to say, he didn’t call again.

http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Rin_gong_at_Kiyomizu-dera,_Kyoto.JPG/300px-Rin_gong_at_Kiyomizu-dera,_Kyoto.JPG
Singing Bowl (a bit larger than mine), courtesy of Wikipedia.

Posted in Uncategorized | 2 Comments

Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

I’m working on a modular random level generator, the essence of which uses similar logic to that of a roguelike. Last month I benchmarked the effectiveness of various programming languages at a simplified level generation algorithm that roughly mimics the demands of my actual algorithm, using single-threaded code. I’m now running the same benchmark with concurrent code, to examine how easy the languages tested are to parallelise for this task.

The code is available here. Any improvements to the code are most welcome. Most of the running time is spent on (writing in Haskell* as it’s easy to read):

roomHitRoom Room {rPos=(x,y), rw=w, rh=h} Room {rPos=(x2, y2), rw=w2, rh=h2}
| (x2 + w2 +1 ) < x || x2 > (x+w+1 ) = False
| (y2 + h2 +1 ) < y || y2 > (y+h+1 ) = False
| otherwise = True

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; our actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, with all languages using a similar prng algorithm to ensure fair comparison. Note that the program is embarrassingly parallel, a trait I aim for in algorithm selection, and so how easy the implementations are to parallelise for this simple task is not necessarily representative of how effective the languages are at more complex parallel tasks. I.e, these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement.

Edit: A couple of months after the benchmark was originally run, somebody submitted a D implementation that takes advantage of LLVM optimisation particularly well and is significantly faster than the other entries (also quite concise, using the D parallel foreach). It’s included at the top of the table. An impressive achievement, although it’s certainly possible that someone quite familiar with optimising C or C++ could produce an equally fast result.

The results are as follows:

Language Compiler Speed (s) % Fastest Resident Mem Use (KiB)
D ldc2 0.812 116.38% 26,536
C++ clang++ 0.945 100.00% 25,552
D****** ldc2 0.955 98.95% 26,536
Neat***** fcc 0.958 98.64% 26,762
Nimrod clang 0.980 96.43% 25,932
C++ *** g++ 1.025 92.20% 25,532
Rust rustc 1.109 85.21% 47,708
Go 6g 1.184 79.81% 30,768
C clang 1.199 78.82% 25,796
Scala scala 1.228 76.95% 72,960
Nimrod gcc 1.376 68.68% 26,120
C**** gcc 1.467 64.42% 25,800
D dmd 2.103 44.94% 26,508
Go gccgo 2.710 34.87% 69,120
Language SLOC SLOC to Parallelise
Nimrod 109 -*
Neat 117 6
Rust 123 20**
Scala 99 20
Go 131 0
C++ 142 15
D 83 -24**
C 172 32

*Haskell was excluded from this version of the benchmark as there seems to be a space leak of some sort in the algorithm that neither I nor anyone who’s examined it so far has been able to overcome. Nimrod was added to the benchmark instead, and so since it has no single-threaded version to compare to it thus has no ‘SLOC to Parallelise’ measure. Nimrod is a language with a whitespace-based syntax, like Python, but which compiles to C for optimal speed.

**I parallelised the D and Rust programs myself, hence the code is probably not idiomatic, and still has plenty of room for improvement. D for instance has a parallel for loop; I couldn’t get it working, but if someone got it working then it would significantly reduce the size of the code. Edit: the D version has now been made more idiomatic, and uses the parallel foreach.

***Somebody submitted a C++ version that runs twice as fast (in around 0.550 ms on the GCC), using an occlusion buffer for collision testing between rooms.  I’m not including it in the benchmark numbers as the algorithm is different, but anyone who’s interested can view it here.

****It turns out the reason the C version runs slower than the C++ one is because the PRNG seeds for each thread are all stored in an array together, forcing the hardware threads to compete for access to them and slowing the program down. Having each thread use a copy of the original seed from the array brings the speed up to that of the C++ implementation.

*****A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

******This is the D version from the time of the benchmarks, not the faster more recent submission.

Speed:

Generally, the relative speeds for the concurrent benchmark were the same as those for the single-threaded one, with llvm D, C++ and C++ running fastest, along with the new entry, Nimrod. I was surprised however by how C was slower than C++ and D; I imagine this may have been due to my naive C implementation not giving the compiler sufficient hints, or something related to aliasing. (Edit: The reason C is slower than C++ is described at **** above.) C is definitely capable of reaching those speeds: the Nimrod compiler compiles to C, and impressively fast C at that. The gap between gccgo’s speed for this problem and 6g’s is surprising; it just goes to show that gccgo isn’t always the best choice for speed. Note that the gcc C implementation was slower than the LLVM one because the gcc missed an optimisation involving turning a jump in the GenRands function into a conditional move, resulting in the gcc C one encountering more branch misses.

Memory Use:

Memory use was mostly as expected. I was impressed by how Go and particularly D didn’t use much more memory than the C and C++ versions in spite of being garbage collected, but Rust’s memory use was somewhat surprisingly high. This is likely just due to the immaturity of the language, as there’s not reason it should need so much memory for such a task. Scala used an expectedly large amount for a JVM language. 

Concision:

I was quite impressed by Nimrod’s concision. Presumably it has such a low SLOC count because it uses whitespace rather than {}, like python, or because it uses a very concise parallel for loop courtesy of OpenMP:

for i in 0 || <NumThreads:
        makeLevs(i.int32)

Note however that Nimrod was written completely idiomatically. Go, Scala and C++ are similar, but for the D and Rust implementations only the single-threaded portion was written idiomatically (correction: the Rust one was, but the D one was written hackishly for speed) ; the parallelisation of that code was done naively by myself. Note also that the Scala version’s ‘SLOC to parallelise’ measure also includes optimisations made for speed; it was possible to parallelise the algorithm much more simply, but this had inferior speed characteristics.

Subjective experience:

Working on the C implementation I had one of those rare moments where it feels like the language is laughing at me. I was encountering weird bugs in my implementation of the for loop:

for (i=0;i<NumThreads;i++){
      pthread_create(&threads[i], NULL, MakeLevs, (void *)&i)
      …
}

Can you spot the problem in that code? I was passing ‘i’ into the newly created thread by reference, to represent the thread number (pthread_create only takes a void* pointer as an argument, normally to a struct, but in this case a single integer was all that was needed). What was happening was that when the thread tried to access ‘i’, even if was the first thing that thread did, often ‘i’ had already been incremented by the for loop in the original thread, so the thread would have the wrong number; there might be two threads with thread number 2, and they’d both do the exact same calculations, writing to the same part of the global array, corrupting the output. I fixed this using the rather cumbersome method of filling an array with the numbers 1 to NumThreads and passing a pointer to a value from there; it could have been done more concisely by just casting ‘i’ to void* and then back to an integer, but I feel it’s bad practice to cast values to pointers. It’s potentially unportable: if ‘i’ was a large 64bit integer, converting it to void* and back would work on a 64 bit system but not on a 32bit one, as pointers on the latter are only 32bit, and it’s impossible to store a large 64bit integer in a 32bit pointer (although this problem would be unlikely to surface in this case unless one somehow had a machine with over 2^32 cores..).

The D implementation also surprised me, although I think this was largely because I was unfamiliar with D’s memory model. I originally tried having each thread write to part of a global array, like in the C implementation, but, after they had completed their task and the time came to read the data, the array was empty, possibly having been garbage collected. The solution was to mark the array as shared, which required propagating the chain of ‘shared’ typing to the objects being written to the array in order to ensure safety. This was an interesting difference from Rust’s memory safety model, where the type of pointers rather than objects determines how they can be shared, but I’m not familiar enough with D’s memory model to comment on their relative effectiveness. I liked the use of messages in D, which allow the sending of data between threads using Thread IDs for addressing, rather than channels, and I imagine this would be particularly useful for applications running on multiple processes across a network once messaging between processes is supported (currently it’s only supported between in-process threads). Note that D offers support for multiple methods of parallelising code in its standard library, so the method I used may not necessarily be the neatest or most idiomatic.

(Edit: It turns out that all memory in D is thread local unless specified otherwise, which seems like an effective way of ensuring memory safety.)

The Rust implementation proved to be relatively straightforward, apart from some slight difficulties I had with the syntax. Declaring a shared channel (one that can have multiple senders) required:

    let (port, chan) = stream();

    let chan = comm::SharedChan::new(chan); 

I imagine this process will be more concise by the time version 1.0 is reached. Using ‘.clone()’ for explicit capture of a variable into a closure also took a bit of getting used to, but it makes sense in light of Rust’s memory model (no direct sharing of memory between tasks). I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs): 

let result = ports.iter().fold(0, |accum, port| accum + port.recv() );

I wasn’t however familiar enough with the language and current iterator syntax to implement it myself. Using a future might also be more concise. Go felt the easiest to parallelise, although to a degree this was largely because it didn’t enforce the same kind of memory safety as Rust or D. This made it more enjoyable for a project like this, but I imagine on a much larger project the stricter nature of Rust and D’s memory models might come in useful, especially if one was working with programmers of dubious competence who couldn’t be trusted to write memory-safe code.I didn’t write the Scala, Nimrod or C++ version, so I have no comments on the experience of doing so.

Notes for optimising:

To anyone wanting to optimise the code, note that the code must produce consistent results for any given seed, but needn’t produce the same results as the other implementations. This is because different seeding behaviours will produce different results, as the seed for each thread/task is calculated by the original random thread multiplied by the square of that thread/task’s number. This means a different number of threads may produce a different result. Currently D, C, C++ and Rust will all produce the same result for any given seed, but will produce different results for a given number of cores. Go on the other hand will produce the same result for any particular seed no matter how many cores are used. Also note that optimisations that change the fundamental logic, the number of calls to GenRands used, or the number of checks done is not allowed. Finally, remember to change the NumCores variable or its equivalent to however many cores you have in your machine.Timing is done using bash’s ‘time’ function, as I’ve found it to be the simplest accurate method for timing threaded applications. The fastest result of 20 trials is taken, with all trials using the same seed. Resident memory use is obtained from running “command time -f ‘max resident:\t%M KiB’ filename seed” in Bash.

The moral of the story:

For a task like this, compiler choice has as significant an effect on speed as language choice.
Posted in Uncategorized | Tagged , , , , , , , , , , , | 48 Comments

Benchmarking level generation: Go, Rust, Haskell and D (and now Scala).

I’m working on a random level generator for a game that, although written in C++, is modular such that the level generator can be written in something more high-level. As C++ isn’t always the most fun and productive language to program in, I set out to benchmark some potential alternatives, using a simple Roguelike level generation benchmark that relies heavily on iteration and conditionals and hence roughly mimics the actual generator. The code used is available here: https://github.com/logicchains/levgen-benchmarks. Any suggestions for improving the code used would be most welcome. The majority of the running time is spent on this (using the Haskell version as I think it’s the easiest to read):

roomHitRoom Room {rPos=(x,y), rw=w, rh=h} Room {rPos=(x2, y2), rw=w2, rh=h2}
| (x2 + w2 +1 ) < x || x2 > (x+w+1 ) = False
| (y2 + h2 +1 ) < y  || y2 > (y+h+1 ) = False
| otherwise = True

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; the actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, so take these benchmarks with a grain of salt, as to a degree they’re as much a benchmark of the respective languages’ random number generators as they are of their general speed (i.e. these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement).

All implementations now use the XorShift PRNG, with Rust, Go, C and D using the exact same code (a 32bit xorshift function), Scala using a 128-bit xorshift function, and Haskell using a 128-bit xorshift library. It is hence now less unreasonable to make comparisons between the languages, compilers and implementations.

The results are as follows:

Compiler Speed(s)  %Fastest  SLOC 
LDC 0.256 100% 107
Clang 0.279 92% 140
FCC* 0.283 90% 111
Rustc 0.303 85% 103
6g 0.323 79% 131
G++ 0.330 78% 127
Scala 0.344 74% 79
GCC 0.347 74% 140
LLVM-GHC 0.428 60% 78
GHC 0.546 47% 78
DMD 0.567 45% 136
GCCGO 0.598 43% 131

*A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here https://github.com/FeepingCreature/fcc. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

**Edit: Improvements to the D code make it now the fastest implementation. Most of the speedup came from rearranging the checkColl function to allow for better LLVM optimisation, but I’m not personally familiar enough with the language or the LLVM to explain the optimisations in question.  

The LLVM version used by LDC, Rust and Clang is 3.2. GCC and GCCGo used GCC version 4.7.3, while GDC used GCC version 4.6.4. Rust was version “0.8-pre (9da42dc 2013-07-17 04:16:42 -0700)”, GHC was version 7.6.2, DMD was 2.036 and 6g (Go) was 1.1.1. They were all run with 03 where available, –opt-level=3 for Rust, -release for DMD, and -funbox-strict-fields for Haskell. D now also runs with -inline and -noboundscheck.

D was the fastest non-C language tested, with Rust following close behind. It’s worth noting that for this task at least both D and Rust “beat C” at the same code, when comparing their results to the GCC C executable’s. The LLVM appears to do a much better job of optimising this kind of PRNG than GCC, showing how sometimes compiler effectiveness at optimising a given piece of code can be a bigger factor than language choice in determining speed. I will be excited to run these benchmarks again in a year’s time and see how Clang, LLVM D and LLVM Rust compare then.

Scala really surprised me; I didn’t realise a JVM language could run so fast for this kind of code. It even matched the speed of GCC C, putting a lie to the statement that languages running on a VM must necessarily be slower. Many thanks to markehammons for submitting the Scala code! Although I haven’t yet learned Scala, the code is surprisingly easy to read, and isn’t full of ugly optimisation code like I’d have expected a fast JVM program to be. Definitely a language worth considering even for speed-critical tasks.

Rust’s speed was impressive, for such a new language. Its flexibility with memory (optional GC) made it interesting to write in, however Rust’s flexibility (at least currently) comes at a slight price, as its syntax takes a bit of getting used to; to pass a heap-allocated vector by reference I need to use myFunc(& mut myVector) (even though it’s already a mutable vector), and the function receiving it needs myAlias:&mut~[myVector] in its type signature, like fn myFunc(myAlias:&mut ~[myVector]) {..}. Compared even to C ( void myFunc(struct MyType myAlias[arrayLength]){..} ), the Rust version looks a bit byzantine. For those who haven’t been following Rust, there are around seven different kinds of pointers: @ (garbage collected heap allocation), ~ (uniquely owned heap allocation), & (borrowed pointer to heap/stack allocation), * (raw C pointer, only useable in unsafe code), and mutable versions of the first three of those (actually, I’m not sure if there’s a mut & pointer, so maybe just six pointers in total). Note also that a pointer to a mutable value behaves differently than a pointer to a non-mutable value.

**Quoting kibwen on YCombinator:
“We’re moving the garbage-collected pointers (@) out into a library, so by default we’ll have only a single pointer, ~ (and there’s no mutable version of this). We’ll still have & and &mut, but those aren’t pointers, they’re references (it’s our own fault for calling them “borrowed pointers” in our documentation). So depending on how you count, that’s either one or three pointer types, which I’m happy with. The “unsafe” pointers exist only for C interop and have no bearing on normal code, so it’s inaccurate to count them.” So there’s actually really only three relevant kinds of pointer.

I also benchmarked a version of the Rust code using stack-allocated vectors, but there was no noticeable speed difference. I did however learn that stack-allocating vectors in Rust is currently rather verbose, as since uninitialised values aren’t allowed, I had to create an instance of the object I wanted an array of that had all its values set to zero and fill the vector with that upon creation. Hopefully in future Rust will adopt something like Go’s automatic initialisation of all values to zero, or at least have this as an option (or document it better, if it already is an option). Currently it looks like this:

**Apparently it is already possible, as described here: https://news.ycombinator.com/item?id=6094819. Hopefully it’ll be documented in the Rust tutorial next time it’s updated.**

let emptyr = Room {X:0,Y:0,W:0,H:0,N:0};
let emptyt = Tile{X:0,Y:0,T:0};
let emptyl = Lev{TS : [emptyt,..2500] , RS : [emptyr,..100] };
let mut ls : [Lev, ..100] = [emptyl,..100];

Which is a lot of unnecessary code, and would quickly become cumbersome in a large project that made heavy use of stack-allocated arrays (although there doesn’t seem to be a reason for doing so, as in this test at least they performed no faster than heap-allocated ones).

Go was impressive, performing well for a relatively new language albeit still falling behind D. The default PRNG was quite lacking in speed, so changing it to use XorShift resulted in a massive speedup. Interestingly, while GCCGo was faster with the default PRNG, the standard 6g runtime was faster with the bitshifted XorShift PRNG. The freedom from semicolons is a nice experience, and in terms of my subjective experience it was the most enjoyable language to write imperative code in. I’m quite looking forward to the development of LLVM Go.

Although it took me quite a while to optimise, I’m happy with Haskell’s performance, especially since it has to carry a vector of [initially] ten million random Ints around (random numbers can’t just be generated in the middle of functions, as that breaks purity). Writing recursively was a nice change, and the naive Haskell version was the most concise implementation of the languages tested (it grew significantly when I switched from lists to vectors and to the faster random number generator, as I couldn’t get pattern matching to work on vectors so had to stick [unsafe]Head and Tail everywhere).

**Updated the Haskell version to use MWC generator instead of Mersenne, and updated again to use more idiomatic code, resulting in a speedup to 0.770s. Updated yet again to use XorShift, now got the running time down to 0.546s. Running with -fllvm speeds it up to 0.428s.

Any more improvements to the Haskell code would be most welcome. Stay tuned for parallel level generation benchmarks! (Eventually..)

Finally, the moral of the story: there’s no such thing as a slow language, only an insufficiently optimising compiler.

Second moral of the story: if cryptographic-level randomness isn’t needed, then XorShift is the best PRNG algorithm available in terms of speed.

Posted in Uncategorized | Tagged , , , , , , , , , | 77 Comments