Benchmarking level generation: Go, Rust, Haskell and D (and now Scala).

I’m working on a random level generator for a game that, although written in C++, is modular such that the level generator can be written in something more high-level. As C++ isn’t always the most fun and productive language to program in, I set out to benchmark some potential alternatives, using a simple Roguelike level generation benchmark that relies heavily on iteration and conditionals and hence roughly mimics the actual generator. The code used is available here: https://github.com/logicchains/levgen-benchmarks. Any suggestions for improving the code used would be most welcome. The majority of the running time is spent on this (using the Haskell version as I think it’s the easiest to read):

roomHitRoom Room {rPos=(x,y), rw=w, rh=h} Room {rPos=(x2, y2), rw=w2, rh=h2}
| (x2 + w2 +1 ) < x || x2 > (x+w+1 ) = False
| (y2 + h2 +1 ) < y  || y2 > (y+h+1 ) = False
| otherwise = True

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; the actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, so take these benchmarks with a grain of salt, as to a degree they’re as much a benchmark of the respective languages’ random number generators as they are of their general speed (i.e. these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement).

All implementations now use the XorShift PRNG, with Rust, Go, C and D using the exact same code (a 32bit xorshift function), Scala using a 128-bit xorshift function, and Haskell using a 128-bit xorshift library. It is hence now less unreasonable to make comparisons between the languages, compilers and implementations.

The results are as follows:

Compiler Speed(s)  %Fastest  SLOC 
LDC 0.256 100% 107
Clang 0.279 92% 140
FCC* 0.283 90% 111
Rustc 0.303 85% 103
6g 0.323 79% 131
G++ 0.330 78% 127
Scala 0.344 74% 79
GCC 0.347 74% 140
LLVM-GHC 0.428 60% 78
GHC 0.546 47% 78
DMD 0.567 45% 136
GCCGO 0.598 43% 131

*A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here https://github.com/FeepingCreature/fcc. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

**Edit: Improvements to the D code make it now the fastest implementation. Most of the speedup came from rearranging the checkColl function to allow for better LLVM optimisation, but I’m not personally familiar enough with the language or the LLVM to explain the optimisations in question.  

The LLVM version used by LDC, Rust and Clang is 3.2. GCC and GCCGo used GCC version 4.7.3, while GDC used GCC version 4.6.4. Rust was version “0.8-pre (9da42dc 2013-07-17 04:16:42 -0700)”, GHC was version 7.6.2, DMD was 2.036 and 6g (Go) was 1.1.1. They were all run with 03 where available, –opt-level=3 for Rust, -release for DMD, and -funbox-strict-fields for Haskell. D now also runs with -inline and -noboundscheck.

D was the fastest non-C language tested, with Rust following close behind. It’s worth noting that for this task at least both D and Rust “beat C” at the same code, when comparing their results to the GCC C executable’s. The LLVM appears to do a much better job of optimising this kind of PRNG than GCC, showing how sometimes compiler effectiveness at optimising a given piece of code can be a bigger factor than language choice in determining speed. I will be excited to run these benchmarks again in a year’s time and see how Clang, LLVM D and LLVM Rust compare then.

Scala really surprised me; I didn’t realise a JVM language could run so fast for this kind of code. It even matched the speed of GCC C, putting a lie to the statement that languages running on a VM must necessarily be slower. Many thanks to markehammons for submitting the Scala code! Although I haven’t yet learned Scala, the code is surprisingly easy to read, and isn’t full of ugly optimisation code like I’d have expected a fast JVM program to be. Definitely a language worth considering even for speed-critical tasks.

Rust’s speed was impressive, for such a new language. Its flexibility with memory (optional GC) made it interesting to write in, however Rust’s flexibility (at least currently) comes at a slight price, as its syntax takes a bit of getting used to; to pass a heap-allocated vector by reference I need to use myFunc(& mut myVector) (even though it’s already a mutable vector), and the function receiving it needs myAlias:&mut~[myVector] in its type signature, like fn myFunc(myAlias:&mut ~[myVector]) {..}. Compared even to C ( void myFunc(struct MyType myAlias[arrayLength]){..} ), the Rust version looks a bit byzantine. For those who haven’t been following Rust, there are around seven different kinds of pointers: @ (garbage collected heap allocation), ~ (uniquely owned heap allocation), & (borrowed pointer to heap/stack allocation), * (raw C pointer, only useable in unsafe code), and mutable versions of the first three of those (actually, I’m not sure if there’s a mut & pointer, so maybe just six pointers in total). Note also that a pointer to a mutable value behaves differently than a pointer to a non-mutable value.

**Quoting kibwen on YCombinator:
“We’re moving the garbage-collected pointers (@) out into a library, so by default we’ll have only a single pointer, ~ (and there’s no mutable version of this). We’ll still have & and &mut, but those aren’t pointers, they’re references (it’s our own fault for calling them “borrowed pointers” in our documentation). So depending on how you count, that’s either one or three pointer types, which I’m happy with. The “unsafe” pointers exist only for C interop and have no bearing on normal code, so it’s inaccurate to count them.” So there’s actually really only three relevant kinds of pointer.

I also benchmarked a version of the Rust code using stack-allocated vectors, but there was no noticeable speed difference. I did however learn that stack-allocating vectors in Rust is currently rather verbose, as since uninitialised values aren’t allowed, I had to create an instance of the object I wanted an array of that had all its values set to zero and fill the vector with that upon creation. Hopefully in future Rust will adopt something like Go’s automatic initialisation of all values to zero, or at least have this as an option (or document it better, if it already is an option). Currently it looks like this:

**Apparently it is already possible, as described here: https://news.ycombinator.com/item?id=6094819. Hopefully it’ll be documented in the Rust tutorial next time it’s updated.**

let emptyr = Room {X:0,Y:0,W:0,H:0,N:0};
let emptyt = Tile{X:0,Y:0,T:0};
let emptyl = Lev{TS : [emptyt,..2500] , RS : [emptyr,..100] };
let mut ls : [Lev, ..100] = [emptyl,..100];

Which is a lot of unnecessary code, and would quickly become cumbersome in a large project that made heavy use of stack-allocated arrays (although there doesn’t seem to be a reason for doing so, as in this test at least they performed no faster than heap-allocated ones).

Go was impressive, performing well for a relatively new language albeit still falling behind D. The default PRNG was quite lacking in speed, so changing it to use XorShift resulted in a massive speedup. Interestingly, while GCCGo was faster with the default PRNG, the standard 6g runtime was faster with the bitshifted XorShift PRNG. The freedom from semicolons is a nice experience, and in terms of my subjective experience it was the most enjoyable language to write imperative code in. I’m quite looking forward to the development of LLVM Go.

Although it took me quite a while to optimise, I’m happy with Haskell’s performance, especially since it has to carry a vector of [initially] ten million random Ints around (random numbers can’t just be generated in the middle of functions, as that breaks purity). Writing recursively was a nice change, and the naive Haskell version was the most concise implementation of the languages tested (it grew significantly when I switched from lists to vectors and to the faster random number generator, as I couldn’t get pattern matching to work on vectors so had to stick [unsafe]Head and Tail everywhere).

**Updated the Haskell version to use MWC generator instead of Mersenne, and updated again to use more idiomatic code, resulting in a speedup to 0.770s. Updated yet again to use XorShift, now got the running time down to 0.546s. Running with -fllvm speeds it up to 0.428s.

Any more improvements to the Haskell code would be most welcome. Stay tuned for parallel level generation benchmarks! (Eventually..)

Finally, the moral of the story: there’s no such thing as a slow language, only an insufficiently optimising compiler.

Second moral of the story: if cryptographic-level randomness isn’t needed, then XorShift is the best PRNG algorithm available in terms of speed.

About these ads
This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

77 Responses to Benchmarking level generation: Go, Rust, Haskell and D (and now Scala).

  1. Romain says:

    Hi!

    Just some quick remarks on the Haskell code:

    The reason why you had to refactor your code in the example (the conditional) is not directly related to a lazy evaluation problem. The problem is that || evaluates its left operand before its right. Consider this:

    Prelude> True || undefined
    True
    Prelude> undefined || True
    *** Exception: Prelude.undefined
    Prelude> False || undefined
    *** Exception: Prelude.undefined

    In the first case, undefined is not evaluated, but it is in both other cases. We say that || is strict in its first argument.

    Secondly, I think it would be more idiomatic to generate random number on the fly, especially to avoid storing this vectors of ints. You can still have pure functions, it is just your main “loop” which must be “impure”. If this wouldn’t be possible, an idiomatic solution could be to use monads such as MonadRandom, which only allow the generation of random numbers.

    • logicchains says:

      Ah, thanks for that. So the Haskell || is essentially the same as the short-circuiting || operator in C, makes sense.

      If it’s idiomatic Haskell to just generate random numbers on the fly, then I could just pass around a random number generator, like in the Rust version. I’ll test it later.

      • Romain says:

        Yes, it behaves the same as a short-circuiting or in C. Except that in C, you could not write this function yourself (except with macros), as arguments to a function are always evaluated before the function is even called. Lazy evaluation allows you to have this “short-circuiting” behaviour in any function.

        To come back to the random numbers problem, you can actually get rid of your big vector of ints without changing much of your code.

        I see you make a call to randoms from the System.Random.Mersenne module. This function actually returns an infinite list of random values, which you can pass around instead of the vector. Lazy evaluation will make that this list is not generated beforehand, as your vector is, but on the fly, as elements are needed.

      • orclev says:

        When you’re doing something inherently stateful, such as generating random numbers, it isn’t considered bad form to use a less pure function. Instead of passing the entire vector when you only need 2 numbers, pass those 2 numbers, and make the function that calls that function stateful by using the ST monad. Most of the more useful parallelism libraries in Haskell also internally use a StateT monad transformer so you can use a random number generator in those easily as well.

        The key thing is to avoid using IO as much as you can as that’s really where most of the “bad” stuff in terms of impure code lives. ST or State is just a convenience to keep from having to pass around state arguments to all your functions.

        You could also go the other direction and manually pass around the RNG, most of the random number libraries make this possible by returning not only a current value, but a next seed value when you use them, so you could just pass around the state of the RNG manually rather than using the ST or State monads.

      • logicchains says:

        Somebody attempted a version using the State Monad, but for some reason it ran a lot slower. I’ll attempt passing the PRNG around randomly when I have time, although I would like to figure out why the current version fails with four random numbers per room.

    • logicchains says:

      I replaced the unboxed vector of Ints with an infinite list and cut the running time down to 1.75s from 1.9s! In future I’ll know better than to think an unboxed vector is necessarily faster than a list just because it’s unboxed, haha.

  2. logicchains says:

    Would an infinite list behave better than an unboxed vector? I was originally using an infinite list, but changed to an unboxed vector to avoid a space leak while I was optimising. The code now is much cleaner than it was at the time I switched though, so I’ll try using an infinite list again and see what effect it has on performance.

    • Romain says:

      The thing is that an infinite list won’t consume nearly as much space as elements are generated on the fly. If you use the head and only keep the tail of the list in your functions, then the elements consumed can be garbage collected. The unevaluated tail will just take constant space.

  3. Typo “GCCGo and DMD only being secondary implementations”, I bet you meant GDC.

    In my experience performance of the rust is very sensitive to the amount of hand-tuned inlining. Afaik rust doesn’t do automatic inlining like C does and rust uses segmented stacks, so function calls aren’t very cheap. Take a look at benchmark I was partly an author of: https://github.com/mozilla/rust/blob/master/src/test/bench/noise.rs. All these inlines were very crucial to the performance. Oh, if you’re interested in my perlin noise benchmarks you can find C, D, Go, Python, Tcl and Rust versions here: https://gist.github.com/nsf/1170424, however it’s recommended to use the rust version from their repo as it has these inlines applied.

    • logicchains says:

      Ah, yes I did mean GDC, thanks. Corrected that now.

      I’ll try the Rust version with inlining. Any guidance on how to choose between [inline(always)] and [inline], or should I just benchmark each and find which works best? Also, is it idiomatic to use for int::range(min, max) |i| {..} like in the benchmark you mention or to use a ‘while i< max {..i+=1}' loop like in the Rust Tutorial?

      • I’m not a rust guru myself, so I have no idea what’s the different between inline(always) and inline, but I have a pretty good guess that using inline(always) will force it as much as possible, while the other may or may not. Same about idiomatic code, I have no idea, ask rust guys. This perlin noise benchmark was the first and the last program I wrote in rust. :D

  4. Matt K says:

    The default Go random number generator is wrapped with a big mutex which basically makes it useless for multithreaded (goroutine) code. I wrote a short replacement for it that only works on Linux as it relies on the kernel’s read() implementation being atomic for short reads, and relies on the kernel’s /dev/urandom driver; needless to say it is much much quicker.

  5. foobar says:

    I’d say it’s pretty evident from these results that if you need a critical speed advantage, use C or C++. The compilers are excellent and their lint like tools are improving constantly. I’d pick Rust or C any day over D or Go or Haskell.

    If you pick Haskell, you can’t find cheap developers anywhere for your product. Go is beginning to look like vaporware. Rust is still too immature but will probably beat D once it’s finalized.

    • No. says:

      Yeah Go is vaporware, only Google and a large number of other high profile companies use it (e.g. cloudflare)

    • orclev says:

      You assume someone wants to find cheap developers. You usually get what you pay for. I’d happily hire a Erlang, OCaml, or Haskell dev for pretty much anything as odds are good they’re going to know what they’re doing, a PHP dev, not so much. Not to say that a PHP dev can’t be good, just that there are an awful lot of bad PHP devs, so you end up spending a ton of time screening out the bad ones to find the good ones.

      There’s other considerations than just execution time involved as well, there’s metrics like development time, size of the code base, and how common bugs are (or how easy it is to introduce different categories of bugs). Sure in the end if all you really care about is the absolute fastest speed you can get, a naive implementation in C is probably going to beat a naive implementation in pretty much anything else (except maybe assembly), but if done properly most languages can achieve something approaching C speed (or sometimes exceeding it).

  6. Jesse McNelis says:

    The reason why dealing with a pointer to a slice in Go is awkward is because it’s not idiomatic. It’s much more common to pass a slice to a function and have the function return the slice if it modified it’s length.
    Awkward code in Go is generally a sign that you’re trying to do the wrong thing.

  7. Huon Wilson says:

    Since everyone’s jumping on the optimisation band-wagon, I’ll do it too, I adapted the Rust code to use a faster RNG (XorShift, in the standard library too), and use iterators & vector-builders, which took Rust down to equal with GCC (which is faster than clang on my computer):

    Rust       0.36
    Rust isaac 0.57
    Rust orig  0.70
    Clang      0.42
    GCC        0.36
    

    The code: https://github.com/huonw/levgen-benchmarks/blob/master/R.rs

  8. Quon says:

    It just test random number generator and compiler opt, not the language itself.

    • logicchains says:

      The Rust and LDC versions both use the same random number generator, and the highest compiler optimisation settings available, but the LDC version is notably faster.

  9. Dave says:

    I am not sure if this was implemented as you did not discuss it. However I would give the same seed to the PRNG’s, though results could vary using different PRNG’s.

    • logicchains says:

      I tried each one with the seeds 10,20..100 and took the fastest speed. Generally the seed had little significant affect on the timing, so I didn’t mention it.

  10. pikachu says:

    There are a few things you’re doing bad with Go. The most important this is with the range statement with causes a Room struct to be copied.

    Try this:
    func CheckColl(x, y, w, h int, rs []Room) bool {
    var r *Room
    for i := range rs {
    r = &rs[i]
    if ((r.X + r.W + 1) (x + w + 1)) {
    continue
    }
    if ((r.Y + r.H + 1) (y + h + 1)) {
    continue
    }
    return true
    }
    return false
    }

  11. Maxwell McAdams says:

    The Haskell implementation seems odd to me in a few ways, and I think there are large performance gains to be had with some more idiomatic code.

    Manipulating big vectors is the least idiomatic Haskell possible. The usual way to refactor a giant data structure sitting in memory would be to stream it lazily through a thread-safe queue. Okasaki’s cons-list queues are popular and you can probably find a fast implementation on hackage, but even a lazy ByteStream would be better than that gargantuan vector. Same goes for bounds-checking after the fact, but I don’t see a simple way to rearrange that.

    Also, I’m confused by your trepidation towards generating PRNs on-the-fly. Yes, you break referential transparency, but the “Random” in the type signature is enough to protect your code purity in spirit. Idiomatic Haskell is about encapsulating side-effects, not contorting your code to hide them. Streaming random numbers into memory as they are consumed would keep all of the dirty impurity hidden from the numeric functions anyways.

    Also, the bounds-checking can be done with unsafe casts into the Bits type for fast mass bitwise operations. If you were unboxing your numericals anyways the overhead won’t be significant. This shatters purity, but not referential transparency, so its probably okay.

    • logicchains says:

      I don’t really have enough experience with Haskell right now to write good idiomatic code, but any improvements submitted to the source are most welcome. PRNG speed is the biggest factor in the overall speed right now, so for Haskell to compete with the other implementations it would really need an XorShift PRNG, which I’m not sure how to implement.

  12. iopq says:

    The table doesn’t line up for me, I can’t tell which number corresponds to which language/compiler.

    • logicchains says:

      Sorry. I’ll add a separator character in between the numbers/compilers to make it clearer when the formatting fails.

      • p says:

        Why not use an actual table for the table? Is it a WordPress limitation?

        If it allows you to inject JavaScript you could do something like:

        function tablify(p, paragraphCount) {
            var table = document.createElement("table");
            table.id = "real-table";
            table.style.borderCollapse = "collapse";
            table.style.color = "black";
            var discard = [];
            for (var i=0; i  p:nth-child(6)"), 3);
        
      • logicchains says:

        It didn’t allow Javascript injection, but turns out a HTML table works fine. Updated now to use a real table.

  13. John Smith says:

    There is another catch with go. With 6g, int means 64-bit. The “equivalent” C code uses 32-bit integers, and is obviously faster.

  14. Joe Wakeling says:

    I think you are still not comparing like for like in the RNG stakes. D’s default Xorshift is the 128-bit version, i.e. the internal state consists of four uint32_t words. The C and Go versions use a single uint32_t for the RNG state, i.e. they are 32-bit Xorshift.

    Try replacing Xorshift in the D code with Xorshift32 and see how that affects things.

    Oh, and — be careful with Xorshift implementations that directly copy the paper by Marsaglia. It has typos :-)

    • Joe Wakeling says:

      Sorry, just spotted that D.d uses the same handwritten 32-bit Xorshift as C and Go versions. How does the speed of that compare with the 128-bit library version used in D-Xorshift.d?

      • logicchains says:

        It’s much faster, at least with LDC. The LDC-compiled version using the 128-bit library ran in around 0.40s, whereas the current LDC-compiled version using the 32-bit Xorshift runs in 0.29s.

  15. chai2010 says:

    How about disable bounds checking in Go (-B optional in 6g)?

  16. Pingback: 系统级编程语言性能PK _ 业界资讯 _ HTML5工作室

  17. Dave says:

    Data profiling in Haskell shows an awful lot of ints staying in memory from main.rands.

    Doesn’t look like you’re getting lazy generation of this list. You’re likely better off writing this as concurrent code with a Random generator channel.

    Do this

    ghc -O2 -prof -auto-all -caf-all –make H.hs -rtsopts=all -fforce-recomp

    ./H 3 +RTS -hy -p >/dev/null

    hp2ps -c H.hp

    open H.ps (if you’re on a mac, maybe gnome-open on ubuntu, or just use a PS viewer)

    • logicchains says:

      It currently uses an unboxed vector of 10 million ints, which is slower than a lazy list of vectors. The only reason for this is that I’m unsure how to get an infinite list from the XorShift package; if you can change the following code:

      gen <- newXorshift
      let rands = U.unfoldrN 10000000 (Just . next) gen

      so that rands is an infinite list, then that should speed it up a bit (U refers there to Data.Vector.Unboxed).

      I'm not looking to use any concurrency yet so as to allow fair comparison between the different implementations, but within a month or so I'll be implementing concurrent versions in each language, so I'll try the prng channel then if I can't get the lazy list of rands working.

      • Dave says:

        You can still write it concurrently and use 1 actual thread. Just a thought. It may not buy you very much unless you compile with -threaded, which would allow for some overlap between threads.

        A lazy list of random numbers from Xorshift is

        myRandList :: Xorshift -> ([Int], Xorshift)
        myRandList g =
        let (i, g’) = next g
        (rest, g”) = myRandList g’
        in ((i : rest), g”)

        I think..

        That’s what I basically stuck in a State Monad like so…

        type MyRandomGen = State Xorshift

        genRandom :: MyRandomGen Int
        genRandom = do
        gen V.Vector Room -> MyRandomGen (V.Vector Room)
        genRooms 0 done = return done
        genRooms !n rsDone = do
        [r1,r2] <- sequence $ replicate 2 genRandom
        let x = r1 `rem` levDim
        y = r2 `rem` levDim
        w = r1 `rem` maxWid + minWid
        h = r2 `rem` maxWid + minWid
        tr = Room {rPos=(x,y), rw= w, rh= h}
        if (checkBound tr) || (checkColl tr rsDone)
        then genRooms (n-1) rsDone
        else genRooms (n-1) (V.cons tr rsDone)

        By the way, the performance goes straight to hell if you generate 4 random numbers in either H.hs or this version. I don't know why! Using 2 as H.hs does works well enough where the y and h are paired up with x and w….

        I will post the full code.

  18. Dave says:

    I wrote a version that uses a State monad wrapper around the random number generator and runs every generation function in that monad. With -O2 it does not perform as well on my machine as the giant Vector one. Using Control.Monad.State.Strict made it a little faster, but not quite as fast.

    According to the heap profiles i’ve generated though, the state monad generator used 1/50th of the memory of H.hs

    H.hs uses 157 MB of memory while the state monad version uses 3 MB.

    I could post my version somewhere if you want it.

    • logicchains says:

      Sure, if you post a copy to Pastebin I’ll test it and see how it compares to H.hs on my machine. When the code used the Mersenne twister prng, changing from an unboxed vector to a lazy infinite list sped it up by over 10%, so I’m surprised the same doesn’t apply to the XorShift prng. If the slowdown with the state monad generator saves that much memory then it would probably perform better when the number of rooms/iterations needed is scaled up, too, and would work better with parallel code.

  19. Dave says:

    State Monad version here:
    http://pastebin.com/gamMWuyQ

    • logicchains says:

      That was surprisingly slow, less than half the speed of the unboxed vector one. I’ll profile it later to see why. I think the easiest way to speed it up would be to use the same XorShift function as the other implementations use:

      int GenRand(uint32_t *gen) {
      *gen += *gen;
      *gen ^= 1;
      int32_t tgen=*gen;
      if ( tgen < 0) {
      *gen ^= 0x88888eef;
      }
      int a = *gen;
      return a;
      }
      rather than the XorShift library. It might also be easier to get an infinite list out of it if you code the XorShift yourself.

  20. dave says:

    Yes, I was surprised how slow it was as well. I’m wondering if this is something for the ST Monad. When I profiled it it just looked like a ton of data being created and garbage collected over time. This makes me think we’d rather modify the same memory location a lot than create/destroy a ton of stuff.

  21. Dave says:

    I have another version that does away with the Vectors, and the strictness, still using the State Monad for the random number generation and it’s a little bit faster than the one with Vectors and bang patterns on my machine.

    http://pastebin.com/G1MMuka1

    5,810,043,592 bytes allocated in the heap
    7,354,800 bytes copied during GC
    126,296 bytes maximum residency (3 sample(s))
    30,576 bytes maximum slop
    2 MB total memory in use (0 MB lost due to fragmentation)

    Tot time (elapsed) Avg pause Max pause
    Gen 0 11229 colls, 0 par 0.14s 0.15s 0.0000s 0.0001s
    Gen 1 3 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s

    INIT time 0.00s ( 0.00s elapsed)
    MUT time 3.42s ( 3.46s elapsed)
    GC time 0.14s ( 0.15s elapsed)
    RP time 0.00s ( 0.00s elapsed)
    PROF time 0.00s ( 0.00s elapsed)
    EXIT time 0.00s ( 0.00s elapsed)
    Total time 3.57s ( 3.61s elapsed)

    %GC time 3.9% (4.2% elapsed)

    Alloc rate 1,696,550,332 bytes per MUT second

    Productivity 96.1% of total user, 95.0% of total elapsed

    real 0m3.614s
    user 0m3.567s
    sys 0m0.045s

    • logicchains says:

      Your new version is a bit faster, but still doesn’t match the unboxed vector one. I’m not fully up to speed with the state monad so I can’t think of any way to improve it. I did try writing a new genRand function using Data.Bits, but it was incredibly slow, less than a third of the speed of the state monad one.

      I’ve also found it doesn’t work well with 4 rands (20 million ints total); maybe it’s an addressing problem? 20 million ints (which are 64 bit by default on my program) is 160 million bytes, and maybe GHC has trouble properly addressing that much memory (entirely speculation, but I can’t think why else performance would decrease exponentially like that).

  22. Dave says:

    If we could just get rid of those MUT objects… I’m not sure of their origin.

    • logicchains says:

      Would it be possible to implement the prng directly using the Data.Bits package? http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Bits.html It seems the functions there are pure, so it might be possible to make a pure XorShift function and bypass the state monad entirely. Something like (I don’t have time to test this right now, and slightly pseudocodish):

      GenRand :: UInt -> UInt
      GenRand seed = if signedval < 0 then xored2 else xored
      where
      shifted = shiftL seed 1
      xored = xor shifted 1
      signedval :: Int = fromIntegral xored
      xored2 = xor xored 0x88888eef

      The value returned is both the random number generated and the seed for the next one.

  23. yxtiger says:

    Go-lang code below may obtain 10% speed up. I’m using “go1.1.1 windows/386″.
    (And it’s intresting that, if assignment of “x_m1,xw_p1,y_m1,yh_p1″ split to 4 lines, or combine to 1 line, there will be 1% speed lose. So there’s still more optimization to be done in Go compiler)

    func CheckColl(x, y, w, h uint32, rs []Room) bool {
    var r *Room
    x_m1, xw_p1 := x-1, x+w+1
    y_m1, yh_p1 := y-1, y+h+1
    for i := range rs {
    r = &rs[i]
    switch {
    case r.X > xw_p1:
    continue
    case (r.X + r.W) yh_p1:
    continue
    case (r.Y + r.H) < y_m1:
    continue
    }
    return true
    }
    return false
    }

  24. yxtiger says:

    There will be 4 “case” sentences above, but some vanished during post…
    Try again:
    switch{
    case r.X > xw_p1:
    continue
    case (r.X + r.W) yh_p1:
    continue
    case (r.Y + r.H) < y_m1:
    continue
    }

  25. yxtiger says:

    Golang code below leads 5% speed up:
    func GenRand(gen *uint32) uint32 {
    seed := (*gen <> 31) * 0x88888eef
    *gen = seed
    return seed
    }
    ————————
    Even more, reduce function parameter may gain 1-2% speed up:
    (need the caller r1/r2 modify accordingly)

    var genseed uint32 = 18
    func GenRand() uint32 {
    seed := (genseed <> 31) * 0x88888eef
    genseed = seed
    return genseed
    }
    ————————
    ASM code may be found after compile(use “go tool 8g -S main.go”) like below, it’s obviously not optimized thoroughly. So it’s reasonable to expect more performance with later Golang version.
    0268 (main.go:136) MOVL BX,genseed+0(SB)
    0269 (main.go:136) MOVL genseed+0(SB),BX

    0272 (main.go:136) XORL BX,genseed+0(SB)
    0273 (main.go:136) MOVL genseed+0(SB),CX

  26. yxtiger says:

    seed := (genseed << 1) + 1
    seed ^= (seed >> 31) * 0x88888eef

  27. logicchains says:

    Great thanks; it runs significantly faster now.

  28. andrew Aw says:

    please show the loc every language..which was the most expresive one?..

    • logicchains says:

      Did it. Seems Haskell is the most ‘expressive’ of the implementations, followed by Scala and then Rust. Haskell is slower, but I think the Haskell code is much less optimised than the other implementations, as I haven’t found a Haskell programmer willing to improve the Haskell XorShift prng, and I’m not skilled enough at Haskell yet to do it myself.

  29. yxtiger says:

    1.
    Intel compiler should have a position, for its great optimization among CPU instructions.
    C/C++ should get much benefit from it, “(x<<1)+1" will get "lea ebx, DWORD PTR [esi+esi+1]" like, really great!

    2.
    Since many modifies made during comparison, algorithm becomes non-consistence.
    It's more faire to move "for(…;.<50000;…" loop inside "MakeRoom" function, "Rand" changed to inline, especially on the benchmark-basement.

    • logicchains says:

      I didn’t realise there was a free version of the Intel compiler available, but apparently there’s one for Linux, so I’ll use it on the upcoming concurrent version of the benchmarks. I’ll also update the functions then to bring the for 50k loop into MakeRoom.

  30. yxtiger says:

    Note on Golang: vars are initialized when declare, rather than only assign mem.
    So should pay attention to array/map’s declaration.

    Changes can be take:

    func MakeRoom(count uint32) *[]Room {
    rs := make([]Room, 100)
    —–change to—–
    var rs [100]Room
    func MakeRoom(count uint32) *[]Room {

    for i := 0; i < 100; i++ {
    rs := MakeRoom(50000)
    ts := make([]Tile, 2500)
    for ii := uint32(0); ii < 2500; ii++ {
    ts[ii] = Tile{X: ii % TileDim, Y: ii / TileDim, T: 0}
    —–change to—–
    ts := make([]Tile, 2500)
    for ii := uint32(0); ii < 2500; ii++ {
    ts[ii] = Tile{ii % TileDim, ii / TileDim, 0}
    }
    for i := 0; i < 100; i++ {
    rs := MakeRoom(50000)
    for ii := uint32(0); ii < 2500; ii++ {
    ts[ii].T = 0

    for _, r := range *rs {
    Room2Tiles(&r, &ts)
    —–change to(because duplication in range)—–
    for j := range *rs {
    Room2Tiles(&((*rs)[j]), &ts)

  31. Dave says:

    genRooms is taking the most time on the Haskell versions I’ve been looking at.

    • logicchains says:

      That it does, but it’s still possible to reduce the overall running time a bit by using a faster PRNG. I can’t think of any way to speed up genRooms that doesn’t involve changing the prng (or better, using a lazy list of random numbers, but I haven’t found an effective way to do that with the current prng).

      • Dave says:

        I suppose if we wanted a faster PRNG, we could implement it as transformations done in the ST monad. It definitely feels a little like cheating, but it’s something Haskell programmers reach for on occasion for certain update-in-place situations. Given that we’re generating a LOT of data that seems to just get GC’d… it could be useful.

        I, however, have never used ST for this kind of stuff, or much at all for that matter.

  32. Dave says:

    Actually, I take it back. I think ST is not going to help much as the way it seems to work is to take a regular old haskell value from which you create an STRef, then do an update, and return a new Haskell value. Unless the intermediate shifts in the PRNG cause a lot of allocations and such, it may be of little value.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s