Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

I’m working on a modular random level generator, the essence of which uses similar logic to that of a roguelike. Last month I benchmarked the effectiveness of various programming languages at a simplified level generation algorithm that roughly mimics the demands of my actual algorithm, using single-threaded code. I’m now running the same benchmark with concurrent code, to examine how easy the languages tested are to parallelise for this task.

The code is available here. Any improvements to the code are most welcome. Most of the running time is spent on (writing in Haskell* as it’s easy to read):

roomHitRoom Room {rPos=(x,y), rw=w, rh=h} Room {rPos=(x2, y2), rw=w2, rh=h2}
| (x2 + w2 +1 ) < x || x2 > (x+w+1 ) = False
| (y2 + h2 +1 ) < y || y2 > (y+h+1 ) = False
| otherwise = True

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; our actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, with all languages using a similar prng algorithm to ensure fair comparison. Note that the program is embarrassingly parallel, a trait I aim for in algorithm selection, and so how easy the implementations are to parallelise for this simple task is not necessarily representative of how effective the languages are at more complex parallel tasks. I.e, these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement.

Edit: A couple of months after the benchmark was originally run, somebody submitted a D implementation that takes advantage of LLVM optimisation particularly well and is significantly faster than the other entries (also quite concise, using the D parallel foreach). It’s included at the top of the table. An impressive achievement, although it’s certainly possible that someone quite familiar with optimising C or C++ could produce an equally fast result.

The results are as follows:

Language Compiler Speed (s) % Fastest Resident Mem Use (KiB)
D ldc2 0.812 116.38% 26,536
C++ clang++ 0.945 100.00% 25,552
D****** ldc2 0.955 98.95% 26,536
Neat***** fcc 0.958 98.64% 26,762
Nimrod clang 0.980 96.43% 25,932
C++ *** g++ 1.025 92.20% 25,532
Rust rustc 1.109 85.21% 47,708
Go 6g 1.184 79.81% 30,768
C clang 1.199 78.82% 25,796
Scala scala 1.228 76.95% 72,960
Nimrod gcc 1.376 68.68% 26,120
C**** gcc 1.467 64.42% 25,800
D dmd 2.103 44.94% 26,508
Go gccgo 2.710 34.87% 69,120
Language SLOC SLOC to Parallelise
Nimrod 109 -*
Neat 117 6
Rust 123 20**
Scala 99 20
Go 131 0
C++ 142 15
D 83 -24**
C 172 32

*Haskell was excluded from this version of the benchmark as there seems to be a space leak of some sort in the algorithm that neither I nor anyone who’s examined it so far has been able to overcome. Nimrod was added to the benchmark instead, and so since it has no single-threaded version to compare to it thus has no ‘SLOC to Parallelise’ measure. Nimrod is a language with a whitespace-based syntax, like Python, but which compiles to C for optimal speed.

**I parallelised the D and Rust programs myself, hence the code is probably not idiomatic, and still has plenty of room for improvement. D for instance has a parallel for loop; I couldn’t get it working, but if someone got it working then it would significantly reduce the size of the code. Edit: the D version has now been made more idiomatic, and uses the parallel foreach.

***Somebody submitted a C++ version that runs twice as fast (in around 0.550 ms on the GCC), using an occlusion buffer for collision testing between rooms.  I’m not including it in the benchmark numbers as the algorithm is different, but anyone who’s interested can view it here.

****It turns out the reason the C version runs slower than the C++ one is because the PRNG seeds for each thread are all stored in an array together, forcing the hardware threads to compete for access to them and slowing the program down. Having each thread use a copy of the original seed from the array brings the speed up to that of the C++ implementation.

*****A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

******This is the D version from the time of the benchmarks, not the faster more recent submission.

Speed:

Generally, the relative speeds for the concurrent benchmark were the same as those for the single-threaded one, with llvm D, C++ and C++ running fastest, along with the new entry, Nimrod. I was surprised however by how C was slower than C++ and D; I imagine this may have been due to my naive C implementation not giving the compiler sufficient hints, or something related to aliasing. (Edit: The reason C is slower than C++ is described at **** above.) C is definitely capable of reaching those speeds: the Nimrod compiler compiles to C, and impressively fast C at that. The gap between gccgo’s speed for this problem and 6g’s is surprising; it just goes to show that gccgo isn’t always the best choice for speed. Note that the gcc C implementation was slower than the LLVM one because the gcc missed an optimisation involving turning a jump in the GenRands function into a conditional move, resulting in the gcc C one encountering more branch misses.

Memory Use:

Memory use was mostly as expected. I was impressed by how Go and particularly D didn’t use much more memory than the C and C++ versions in spite of being garbage collected, but Rust’s memory use was somewhat surprisingly high. This is likely just due to the immaturity of the language, as there’s not reason it should need so much memory for such a task. Scala used an expectedly large amount for a JVM language. 

Concision:

I was quite impressed by Nimrod’s concision. Presumably it has such a low SLOC count because it uses whitespace rather than {}, like python, or because it uses a very concise parallel for loop courtesy of OpenMP:

for i in 0 || <NumThreads:
        makeLevs(i.int32)

Note however that Nimrod was written completely idiomatically. Go, Scala and C++ are similar, but for the D and Rust implementations only the single-threaded portion was written idiomatically (correction: the Rust one was, but the D one was written hackishly for speed) ; the parallelisation of that code was done naively by myself. Note also that the Scala version’s ‘SLOC to parallelise’ measure also includes optimisations made for speed; it was possible to parallelise the algorithm much more simply, but this had inferior speed characteristics.

Subjective experience:

Working on the C implementation I had one of those rare moments where it feels like the language is laughing at me. I was encountering weird bugs in my implementation of the for loop:

for (i=0;i<NumThreads;i++){
      pthread_create(&threads[i], NULL, MakeLevs, (void *)&i)
      …
}

Can you spot the problem in that code? I was passing ‘i’ into the newly created thread by reference, to represent the thread number (pthread_create only takes a void* pointer as an argument, normally to a struct, but in this case a single integer was all that was needed). What was happening was that when the thread tried to access ‘i’, even if was the first thing that thread did, often ‘i’ had already been incremented by the for loop in the original thread, so the thread would have the wrong number; there might be two threads with thread number 2, and they’d both do the exact same calculations, writing to the same part of the global array, corrupting the output. I fixed this using the rather cumbersome method of filling an array with the numbers 1 to NumThreads and passing a pointer to a value from there; it could have been done more concisely by just casting ‘i’ to void* and then back to an integer, but I feel it’s bad practice to cast values to pointers. It’s potentially unportable: if ‘i’ was a large 64bit integer, converting it to void* and back would work on a 64 bit system but not on a 32bit one, as pointers on the latter are only 32bit, and it’s impossible to store a large 64bit integer in a 32bit pointer (although this problem would be unlikely to surface in this case unless one somehow had a machine with over 2^32 cores..).

The D implementation also surprised me, although I think this was largely because I was unfamiliar with D’s memory model. I originally tried having each thread write to part of a global array, like in the C implementation, but, after they had completed their task and the time came to read the data, the array was empty, possibly having been garbage collected. The solution was to mark the array as shared, which required propagating the chain of ‘shared’ typing to the objects being written to the array in order to ensure safety. This was an interesting difference from Rust’s memory safety model, where the type of pointers rather than objects determines how they can be shared, but I’m not familiar enough with D’s memory model to comment on their relative effectiveness. I liked the use of messages in D, which allow the sending of data between threads using Thread IDs for addressing, rather than channels, and I imagine this would be particularly useful for applications running on multiple processes across a network once messaging between processes is supported (currently it’s only supported between in-process threads). Note that D offers support for multiple methods of parallelising code in its standard library, so the method I used may not necessarily be the neatest or most idiomatic.

(Edit: It turns out that all memory in D is thread local unless specified otherwise, which seems like an effective way of ensuring memory safety.)

The Rust implementation proved to be relatively straightforward, apart from some slight difficulties I had with the syntax. Declaring a shared channel (one that can have multiple senders) required:

    let (port, chan) = stream();

    let chan = comm::SharedChan::new(chan); 

I imagine this process will be more concise by the time version 1.0 is reached. Using ‘.clone()’ for explicit capture of a variable into a closure also took a bit of getting used to, but it makes sense in light of Rust’s memory model (no direct sharing of memory between tasks). I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs): 

let result = ports.iter().fold(0, |accum, port| accum + port.recv() );

I wasn’t however familiar enough with the language and current iterator syntax to implement it myself. Using a future might also be more concise. Go felt the easiest to parallelise, although to a degree this was largely because it didn’t enforce the same kind of memory safety as Rust or D. This made it more enjoyable for a project like this, but I imagine on a much larger project the stricter nature of Rust and D’s memory models might come in useful, especially if one was working with programmers of dubious competence who couldn’t be trusted to write memory-safe code.I didn’t write the Scala, Nimrod or C++ version, so I have no comments on the experience of doing so.

Notes for optimising:

To anyone wanting to optimise the code, note that the code must produce consistent results for any given seed, but needn’t produce the same results as the other implementations. This is because different seeding behaviours will produce different results, as the seed for each thread/task is calculated by the original random thread multiplied by the square of that thread/task’s number. This means a different number of threads may produce a different result. Currently D, C, C++ and Rust will all produce the same result for any given seed, but will produce different results for a given number of cores. Go on the other hand will produce the same result for any particular seed no matter how many cores are used. Also note that optimisations that change the fundamental logic, the number of calls to GenRands used, or the number of checks done is not allowed. Finally, remember to change the NumCores variable or its equivalent to however many cores you have in your machine.Timing is done using bash’s ‘time’ function, as I’ve found it to be the simplest accurate method for timing threaded applications. The fastest result of 20 trials is taken, with all trials using the same seed. Resident memory use is obtained from running “command time -f ‘max resident:\t%M KiB’ filename seed” in Bash.

The moral of the story:

For a task like this, compiler choice has as significant an effect on speed as language choice.
About these ads
This entry was posted in Uncategorized and tagged , , , , , , , , , , , . Bookmark the permalink.

48 Responses to Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

  1. Jon Renner says:

    What were the command args for gccgo? gccgo is usually faster than 6g. did you use -O3 or whatnot?

  2. Jason says:

    More idiomatic C than using an array would be to just cast the integer to a void * and then cast it back in your thread. It’s technically non-portable, but will work on just about any implementation of C ever.

    • logicchains says:

      I considered that; it’s mentioned towards the end of the second paragraph of the ‘Subjective experience’ section. I avoided it because if the integer was a really big 64 bit value (or a really big long long), converting to a pointer and back on a 32 bit system (with 32 bit pointers) would mangle it. I suppose that might be being a bit too cautious in this case, however.

  3. Isn’t the “referencing a changing loop variable” thing just easily cured by stack-allocating a new int inside the loop and copying the value of i into it?

    • logicchains says:

      Nope; the variable will go out of scope when that pass through the loop ends. A static variable would work, but they can’t be used like that in C.

      • I would just pass the int “by value” (don’t pass a pointer, just put the int inside the pointer). The other option is to pass the threads array as a pointer, you can then use pointer-arithmetic to determine the index of “your” thread by finding the offset to the start of the array.

      • logicchains says:

        I try to avoid treating integers as pointers and pointers as integers because of the potential bugs that it can cause, but I’m probably being too cautious for this case.

      • Ah, you’re right, I’m thinking in Go, where a stack local will be magically promoted to the heap if it escapes; my above is indeed the recommended fix in Go.

  4. qznc says:

    Your irritation with the D memory model is likely due to the fact that global variables are thread-local by default in D. Declaring them “shared” makes them truly global like in C/C++.

  5. This made me chuckle: “[...] especially if one was working with programmers of dubious competence who couldn’t be trusted to write memory-safe code.”

  6. ANdrew says:

    For the C implementation, try changing ‘void MakeRoom’ to ‘inline void MakeRoom’. Halved the time it took on my machine.

  7. sap says:

    both the D and C++ code gives me segmentation fault at runtime.
    i compiled the C++ one using -lpthread and the D one using the flags provided on the source file. both compile with 0 errors / warnings.

  8. What was the issue with your Haskell code? Could you elaborate?

    • logicchains says:

      Take the code here: https://github.com/logicchains/levgen-benchmarks/blob/master/H.hs and run it. Then, take the same code, change the genRooms function to contain:
      where
      noFit = genRooms (n-1) (restInts) rsDone

      tr = Room {rPos=(x,y), rw= w, rh= h}

      x = rem (U.unsafeHead randInts) levDim

      y = rem (U.unsafeIndex randInts 1) levDim

      restInts = U.unsafeDrop 4 randInts

      w = rem (U.unsafeIndex randInts 2) maxWid + minWid

      h = rem (U.unsafeIndex randInts 3) maxWid + minWid
      And change:
      let rands = U.unfoldrN 10000000 (Just . next) gen
      to:
      let rands = U.unfoldrN 20000000 (Just . next) gen

      The running time should double. Does it? Or does it increase by orders of magnitude? The latter is what happens to me, and to a couple of other people who have tried it.

  9. Andres says:

    Would be cool than the next match be about memory management… Gc languages performance has not measure here and would be interesting see how perform d, go, nimrod comparing to well known jvm with scala…….

  10. In the ‘C’ implementation example, why are we passing the address of loop variable ‘i’? Because of call-by-reference, a newly created thread accesses ‘i’ itself the contents of which is changing in the main looping thread anyway, if I am not missing something obvious. One quick solution could be to write a function which

    - takes in an ‘int’ as a param,
    - internally ‘malloc’s an integer,
    - initializes this newly allocated integer with the param
    - returns the newly allocated integer

    and then, call this function with loop variable ‘i’ and use its return value as a parameter to pthread_create(). Shouldn’t that work?

    • logicchains says:

      That should work fine, thanks. I was just avoiding heap allocation for simplicity. I’m not sure if it would result in less lines of code used than the current version.

  11. Andreas says:

    I messed for fun with the c++ example. It is now on my system almost twice as fast.

    I changed the analytical intersection test with a occlusion buffer that save for every tile if it is used. I am thinking there is a off-by-1 error somewhere, but it seems to work fine.

    It would be fun to redo the test for this implementation. This lowlevel rasterisation is something c/c++ shine on with pointer arithmetic and that stuff, even though i didn’t bother to try out how to SIMDfy it.

    Code: http://pastebin.com/5wtzPE1R

    • logicchains says:

      Great, thanks! I imagine it wouldn’t be so easy to implement efficiently in the other languages, although I think D could possibly just copy the C code, and Rust and Go might be able to do something similar with an unsafe block and raw pointers.

  12. Jeff says:

    I sent a pull request with a parallel Haskell implementation.

    • logicchains says:

      Thanks! Would it be possible to add a timeout, so that after 50,000 attempts at generating 100 rooms it gives up and moves onto the next level, as in the other implementations?

  13. none says:

    genGoodRooms in H.hs looks very suspicious re space leaks.

    aux accum count t = do …

    should say aux accum !count !t = …

    to avoid thunk buildups in incrementing the count and decrementing t.

    Use LANG: BangPatterns to recognize that idiom (otherwise you’d need seq annotations).

    • logicchains says:

      I tried that, but it didn’t significantly increase the speed (it’s still far from possible for me to generate 1000 levels with
      50000 tries per level in under one minute).

  14. mleise says:

    I’ve adapted the parallel D version with “idiomatic” parallelization. It practically just adds the word “parallel” and moves the random number generator inside the level loop. It is also around 18% faster on my computer than your current D code on Github. Here’s the code:
    https://gist.github.com/mleise/7330759

    • logicchains says:

      Great, it’s faster for me too. It seems to be generating different results than the one in the repo; is it creating NUM_LEVS threads with a different seed for each one? If so, it’s impressive that using so many threads can produce a faster result than using NUM_CORES threads.

      • mleise says:

        You are right, about the new seed (or RNG) for every level. “parallel” by default (and I didn’t change it) runs on as many threads as there are CPU cores and splits the levels array into work units (e.g. 80 chunks of 10 levels). Both thread count and work unit size can be configured globally, but the defaults were already good and I could get away with no additional lines of code.

      • logicchains says:

        That’s impressive. I wonder if it’d be possible to optimise the C++ similarly.

  15. mleise says:

    Ok, and now comes the killer implementation: ~0.100s for 800 levels! It produces the same levels as the original algorithm, but fills the whole level as if NUM_TRIES was close to infinity.

    Instead of trial and error it works deterministically. That means each step adds a new room to the level. To produce the same results, I reproduce the probabilities that a room of each size can be placed in the level. E.g. a 3×3 room might be twice as probable than a 5×4 room. Once a room size is selected using these probabilities, it is placed at a random free spot.

    These free spots are kept in 64 bitmaps (for each of the 64 possible room shapes) in the size of the level, that tells the algorithm where these free spots are. After placing a room all 64 bitmaps are updated by marking a rectangular area and then updating the probabilities for each room size accordingly.

    Code: http://dpaste.dzfl.pl/6ce1b7d9

    • logicchains says:

      That is extremely impressive! I’d be quite interested to see how it compares to the occlusion buffer/free list technique used here: https://github.com/danielhams/Levgen-Algorithm-Benchmarks. It works as follows:

      -Generate a room with random dimensions
      -Attempt to find some free space of appropriate size
      -Offset the room randomly inside the free space
      -Compute new free entries for the space left over and put them back into the free list
      -When out of free space walk the occlusion buffer looking for blocks with appropriate minimum size
      -When there are enough rooms or the compaction of the free space above doesn’t produce any free space, level generation is complete

      Note that I can’t include the numbers in the benchmark results as it’s a fundamentally different algorithm to the other implementations.

      • mleise says:

        That free list technique is a bit faster (e.g. 0.093 seconds in one of the 4 versions using the built-in timing). The runtime of the other algorithms that are included depends largely on the number of trial and error steps. The quality of the output is roughly the same as far as I can tell.
        That said, it does create different levels, as the free areas are created going from left to right, top to bottom. For every free cell, the largest possible rectangle is drawn to the bottom-right and added to the free list. This creates arbitrary borders on the level where no rooms will be placed (as far as I understood the code). I.e. after placing a few rooms you have several disjoint areas where rooms can go to and they are mostly square. So it gives up on the “completely random” property that the trial and error approach you used has naturally.

    • logicchains says:

      Ah, right. I think you’re correct, your implementation does indeed produce more random results. Would you say there’s anything in particular about D that made the algorithm easier to implement?

      • mleise says:

        D’s default to thread local storage gave me per-thread random number generators “for free” and I relied on compile-time function execution to generate the initial state of the level generator struct. So where I wrote “MyLevelGen.RoomGenerator rg;” it is just a blit of the pre-generated bit pattern at run-time for the chosen level and room sizes.

        An example: I use single bits for the occluded bitmaps (so one byte holds 8 values), but round up to machine word size. So if the bitmap would be 30 pixels wide, it is rounded up to 32 bits on a 32-bit computer. To make it easier for the algorithm I initialize these 2 extra bits to “occluded” for the full height of the bitmap at compile-time and rely on the memcpy() style default initialization of structs in D, instead of running a constructor.

    • logicchains says:

      That’s interesting, from skimming over the code I didn’t even realise it was doing any compile-time calculation (apart from optimisation, of course). Quite impressive, considering how ugly it would probably look to do the same thing with C++ templates. This is probably a stupid question, but do you think it would ever be possible to embed DMD into an executable and have it generate code on the fly based on parameters known only at runtime, like one can with Lisp macros, or would that be unfeasible?

      • mleise says:

        I don’t believe you’ll ever be able to replace existing code in a D program like you can do in Lisp. The design of the language is focused on static code and compile-time reflection. Depending on your needs you can write your code to a file, and compile a shared library with dmd from it that you load and unload as a plugin or use LuaJIT where you need the flexibility of an embedded scripting language: https://github.com/JakobOvrum/LuaD

      • logicchains says:

        Right. I’d love to see a language where it was possible to do what your D implementation does but with the room size parameters supplied at runtime, but right now that seems to only be possible with assembly or writing one’s own JIT. D’s templates are quite impressive nonetheless; I haven’t seen a language that does them better, although I haven’t tried Nimrod’s.

  16. bearophile says:

    The version by mleise with many changes and improvements, the performance should be the same, but the line count is different, so please update the D entry and its lines number:
    http://dpaste.dzfl.pl/d37ba995

  17. bearophile says:

    Better, this D version doesn’t mix leading tabs and spaces, sorry:
    http://dpaste.dzfl.pl/raw/e2bfc00f

    • logicchains says:

      I’m currently getting an error on line 58: cannot implicitly convert expression (i / 50LU) of type ulong to uint. Could you make it compatible with ldc2 of version: ‘based on DMD v2.063.2 and LLVM 3.2′ ?

      • bearophile says:

        The compiler version is OK, but I have compiled the code on a 32 bit system, where size_t is 32 bit, while on your 64 bit system “i / levelSize” is an ulong, and it can’t fit into an uint of the Tile.
        To solve the problem just replace the line 57 “foreach (immutable i, ref t; this.tiles)” with “foreach (immutable uint i, ref t; this.tiles)”.
        Apparently such 3264 bit conversion problems come from a design decision of the D language.

      • logicchains says:

        Great, it works fine now. I take it this problem would be avoided if the constants were typed?

  18. bearophile says:

    The immutable index was already typed, but in a different way (size_t instead of uint). More info in the newsgroup because this comment system is not very good for complex explanations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s