Benchmarking level generation: Go, Rust, Haskell and D (and now Scala).

Posted on July 23, 2013 by logicchains

I’m working on a random level generator for a game that, although written in C++, is modular such that the level generator can be written in something more high-level. As C++ isn’t always the most fun and productive language to program in, I set out to benchmark some potential alternatives, using a simple Roguelike level generation benchmark that relies heavily on iteration and conditionals and hence roughly mimics the actual generator. The code used is available here: https://github.com/logicchains/levgen-benchmarks. Any suggestions for improving the code used would be most welcome. The majority of the running time is spent on this (using the Haskell version as I think it’s the easiest to read):

roomHitRoom Room {rPos=(x,y), rw=w, rh=h} Room {rPos=(x2, y2), rw=w2, rh=h2}
| (x2 + w2 +1 ) < x || x2 > (x+w+1 ) = False
| (y2 + h2 +1 ) < y || y2 > (y+h+1 ) = False
| otherwise = True

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; the actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, so take these benchmarks with a grain of salt, as to a degree they’re as much a benchmark of the respective languages’ random number generators as they are of their general speed (i.e. these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement).

All implementations now use the XorShift PRNG, with Rust, Go, C and D using the exact same code (a 32bit xorshift function), Scala using a 128-bit xorshift function, and Haskell using a 128-bit xorshift library. It is hence now less unreasonable to make comparisons between the languages, compilers and implementations.

The results are as follows:

Compiler	Speed(s)	%Fastest	SLOC
LDC	0.256	100%	107
Clang	0.279	92%	140
FCC*	0.283	90%	111
Rustc	0.303	85%	103
6g	0.323	79%	131
G++	0.330	78%	127
Scala	0.344	74%	79
GCC	0.347	74%	140
LLVM-GHC	0.428	60%	78
GHC	0.546	47%	78
DMD	0.567	45%	136
GCCGO	0.598	43%	131

*A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here https://github.com/FeepingCreature/fcc. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

**Edit: Improvements to the D code make it now the fastest implementation. Most of the speedup came from rearranging the checkColl function to allow for better LLVM optimisation, but I’m not personally familiar enough with the language or the LLVM to explain the optimisations in question.

The LLVM version used by LDC, Rust and Clang is 3.2. GCC and GCCGo used GCC version 4.7.3, while GDC used GCC version 4.6.4. Rust was version “0.8-pre (9da42dc 2013-07-17 04:16:42 -0700)”, GHC was version 7.6.2, DMD was 2.036 and 6g (Go) was 1.1.1. They were all run with 03 where available, –opt-level=3 for Rust, -release for DMD, and -funbox-strict-fields for Haskell. D now also runs with -inline and -noboundscheck.

D was the fastest non-C language tested, with Rust following close behind. It’s worth noting that for this task at least both D and Rust “beat C” at the same code, when comparing their results to the GCC C executable’s. The LLVM appears to do a much better job of optimising this kind of PRNG than GCC, showing how sometimes compiler effectiveness at optimising a given piece of code can be a bigger factor than language choice in determining speed. I will be excited to run these benchmarks again in a year’s time and see how Clang, LLVM D and LLVM Rust compare then.

Scala really surprised me; I didn’t realise a JVM language could run so fast for this kind of code. It even matched the speed of GCC C, putting a lie to the statement that languages running on a VM must necessarily be slower. Many thanks to markehammons for submitting the Scala code! Although I haven’t yet learned Scala, the code is surprisingly easy to read, and isn’t full of ugly optimisation code like I’d have expected a fast JVM program to be. Definitely a language worth considering even for speed-critical tasks.

Rust’s speed was impressive, for such a new language. Its flexibility with memory (optional GC) made it interesting to write in, however Rust’s flexibility (at least currently) comes at a slight price, as its syntax takes a bit of getting used to; to pass a heap-allocated vector by reference I need to use myFunc(& mut myVector) (even though it’s already a mutable vector), and the function receiving it needs myAlias:&mut~[myVector] in its type signature, like fn myFunc(myAlias:&mut ~[myVector]) {..}. Compared even to C ( void myFunc(struct MyType myAlias[arrayLength]){..} ), the Rust version looks a bit byzantine. For those who haven’t been following Rust, there are around seven different kinds of pointers: @ (garbage collected heap allocation), ~ (uniquely owned heap allocation), & (borrowed pointer to heap/stack allocation), * (raw C pointer, only useable in unsafe code), and mutable versions of the first three of those (actually, I’m not sure if there’s a mut & pointer, so maybe just six pointers in total). Note also that a pointer to a mutable value behaves differently than a pointer to a non-mutable value.

**Quoting kibwen on YCombinator:
“We’re moving the garbage-collected pointers (@) out into a library, so by default we’ll have only a single pointer, ~ (and there’s no mutable version of this). We’ll still have & and &mut, but those aren’t pointers, they’re references (it’s our own fault for calling them “borrowed pointers” in our documentation). So depending on how you count, that’s either one or three pointer types, which I’m happy with. The “unsafe” pointers exist only for C interop and have no bearing on normal code, so it’s inaccurate to count them.” So there’s actually really only three relevant kinds of pointer.

I also benchmarked a version of the Rust code using stack-allocated vectors, but there was no noticeable speed difference. I did however learn that stack-allocating vectors in Rust is currently rather verbose, as since uninitialised values aren’t allowed, I had to create an instance of the object I wanted an array of that had all its values set to zero and fill the vector with that upon creation. Hopefully in future Rust will adopt something like Go’s automatic initialisation of all values to zero, or at least have this as an option (or document it better, if it already is an option). Currently it looks like this:

**Apparently it is already possible, as described here: https://news.ycombinator.com/item?id=6094819. Hopefully it’ll be documented in the Rust tutorial next time it’s updated.**

let emptyr = Room {X:0,Y:0,W:0,H:0,N:0};
let emptyt = Tile{X:0,Y:0,T:0};
let emptyl = Lev{TS : [emptyt,..2500] , RS : [emptyr,..100] };
let mut ls : [Lev, ..100] = [emptyl,..100];

Which is a lot of unnecessary code, and would quickly become cumbersome in a large project that made heavy use of stack-allocated arrays (although there doesn’t seem to be a reason for doing so, as in this test at least they performed no faster than heap-allocated ones).

Go was impressive, performing well for a relatively new language albeit still falling behind D. The default PRNG was quite lacking in speed, so changing it to use XorShift resulted in a massive speedup. Interestingly, while GCCGo was faster with the default PRNG, the standard 6g runtime was faster with the bitshifted XorShift PRNG. The freedom from semicolons is a nice experience, and in terms of my subjective experience it was the most enjoyable language to write imperative code in. I’m quite looking forward to the development of LLVM Go.

Although it took me quite a while to optimise, I’m happy with Haskell’s performance, especially since it has to carry a vector of [initially] ten million random Ints around (random numbers can’t just be generated in the middle of functions, as that breaks purity). Writing recursively was a nice change, and the naive Haskell version was the most concise implementation of the languages tested (it grew significantly when I switched from lists to vectors and to the faster random number generator, as I couldn’t get pattern matching to work on vectors so had to stick [unsafe]Head and Tail everywhere).

**Updated the Haskell version to use MWC generator instead of Mersenne, and updated again to use more idiomatic code, resulting in a speedup to 0.770s. Updated yet again to use XorShift, now got the running time down to 0.546s. Running with -fllvm speeds it up to 0.428s.

Any more improvements to the Haskell code would be most welcome. Stay tuned for parallel level generation benchmarks! (Eventually..)

Finally, the moral of the story: there’s no such thing as a slow language, only an insufficiently optimising compiler.

Second moral of the story: if cryptographic-level randomness isn’t needed, then XorShift is the best PRNG algorithm available in terms of speed.

This entry was posted in Uncategorized and tagged Benchmark, C++, D, Go, Haskell, LLVM, Programming, Random Level Generation, Random number generation, Rust. Bookmark the permalink.

77 Responses to Benchmarking level generation: Go, Rust, Haskell and D (and now Scala).

Romain says:

July 24, 2013 at 12:38 pm

Hi!

Just some quick remarks on the Haskell code:

The reason why you had to refactor your code in the example (the conditional) is not directly related to a lazy evaluation problem. The problem is that || evaluates its left operand before its right. Consider this:

Prelude> True || undefined
True
Prelude> undefined || True
*** Exception: Prelude.undefined
Prelude> False || undefined
*** Exception: Prelude.undefined

In the first case, undefined is not evaluated, but it is in both other cases. We say that || is strict in its first argument.

Secondly, I think it would be more idiomatic to generate random number on the fly, especially to avoid storing this vectors of ints. You can still have pure functions, it is just your main “loop” which must be “impure”. If this wouldn’t be possible, an idiomatic solution could be to use monads such as MonadRandom, which only allow the generation of random numbers.

Reply
- logicchains says:
  
  July 24, 2013 at 12:54 pm
  
  Ah, thanks for that. So the Haskell || is essentially the same as the short-circuiting || operator in C, makes sense.
  
  If it’s idiomatic Haskell to just generate random numbers on the fly, then I could just pass around a random number generator, like in the Rust version. I’ll test it later.
  
  Reply
  - Romain says:
    
    July 24, 2013 at 1:04 pm
    
    Yes, it behaves the same as a short-circuiting or in C. Except that in C, you could not write this function yourself (except with macros), as arguments to a function are always evaluated before the function is even called. Lazy evaluation allows you to have this “short-circuiting” behaviour in any function.
    
    To come back to the random numbers problem, you can actually get rid of your big vector of ints without changing much of your code.
    
    I see you make a call to randoms from the System.Random.Mersenne module. This function actually returns an infinite list of random values, which you can pass around instead of the vector. Lazy evaluation will make that this list is not generated beforehand, as your vector is, but on the fly, as elements are needed.
  - orclev says:
    
    August 24, 2013 at 4:34 am
    
    When you’re doing something inherently stateful, such as generating random numbers, it isn’t considered bad form to use a less pure function. Instead of passing the entire vector when you only need 2 numbers, pass those 2 numbers, and make the function that calls that function stateful by using the ST monad. Most of the more useful parallelism libraries in Haskell also internally use a StateT monad transformer so you can use a random number generator in those easily as well.
    
    The key thing is to avoid using IO as much as you can as that’s really where most of the “bad” stuff in terms of impure code lives. ST or State is just a convenience to keep from having to pass around state arguments to all your functions.
    
    You could also go the other direction and manually pass around the RNG, most of the random number libraries make this possible by returning not only a current value, but a next seed value when you use them, so you could just pass around the state of the RNG manually rather than using the ST or State monads.
  - logicchains says:
    
    August 24, 2013 at 2:28 pm
    
    Somebody attempted a version using the State Monad, but for some reason it ran a lot slower. I’ll attempt passing the PRNG around randomly when I have time, although I would like to figure out why the current version fails with four random numbers per room.
- logicchains says:
  
  July 24, 2013 at 2:04 pm
  
  I replaced the unboxed vector of Ints with an infinite list and cut the running time down to 1.75s from 1.9s! In future I’ll know better than to think an unboxed vector is necessarily faster than a list just because it’s unboxed, haha.
  
  Reply
logicchains says:

July 24, 2013 at 1:15 pm

Would an infinite list behave better than an unboxed vector? I was originally using an infinite list, but changed to an unboxed vector to avoid a space leak while I was optimising. The code now is much cleaner than it was at the time I switched though, so I’ll try using an infinite list again and see what effect it has on performance.

Reply
- Romain says:
  
  July 24, 2013 at 2:17 pm
  
  The thing is that an infinite list won’t consume nearly as much space as elements are generated on the fly. If you use the head and only keep the tail of the list in your functions, then the elements consumed can be garbage collected. The unevaluated tail will just take constant space.
  
  Reply
  - logicchains says:
    
    July 24, 2013 at 3:37 pm
    
    It worked, saving me 0.15s. Do you see anything else there that would work better as a list than a vector?
nsf (@nosmileface) says:

July 24, 2013 at 1:23 pm

Typo “GCCGo and DMD only being secondary implementations”, I bet you meant GDC.

In my experience performance of the rust is very sensitive to the amount of hand-tuned inlining. Afaik rust doesn’t do automatic inlining like C does and rust uses segmented stacks, so function calls aren’t very cheap. Take a look at benchmark I was partly an author of: https://github.com/mozilla/rust/blob/master/src/test/bench/noise.rs. All these inlines were very crucial to the performance. Oh, if you’re interested in my perlin noise benchmarks you can find C, D, Go, Python, Tcl and Rust versions here: https://gist.github.com/nsf/1170424, however it’s recommended to use the rust version from their repo as it has these inlines applied.

Reply
- logicchains says:
  
  July 24, 2013 at 2:13 pm
  
  Ah, yes I did mean GDC, thanks. Corrected that now.
  
  I’ll try the Rust version with inlining. Any guidance on how to choose between [inline(always)] and [inline], or should I just benchmark each and find which works best? Also, is it idiomatic to use for int::range(min, max) |i| {..} like in the benchmark you mention or to use a ‘while i< max {..i+=1}' loop like in the Rust Tutorial?
  
  Reply
  - nsf (@nosmileface) says:
    
    July 24, 2013 at 8:52 pm
    
    I’m not a rust guru myself, so I have no idea what’s the different between inline(always) and inline, but I have a pretty good guess that using inline(always) will force it as much as possible, while the other may or may not. Same about idiomatic code, I have no idea, ask rust guys. This perlin noise benchmark was the first and the last program I wrote in rust. 😀
Matt K says:

July 24, 2013 at 3:10 pm

The default Go random number generator is wrapped with a big mutex which basically makes it useless for multithreaded (goroutine) code. I wrote a short replacement for it that only works on Linux as it relies on the kernel’s read() implementation being atomic for short reads, and relies on the kernel’s /dev/urandom driver; needless to say it is much much quicker.

Reply
- Jan Mercl says:
  
  July 24, 2013 at 3:59 pm
  
  Use http://golang.org/pkg/math/rand/#New per goroutine and you don’t need any locking whatsoever.
  
  Reply
foobar says:

July 24, 2013 at 3:21 pm

I’d say it’s pretty evident from these results that if you need a critical speed advantage, use C or C++. The compilers are excellent and their lint like tools are improving constantly. I’d pick Rust or C any day over D or Go or Haskell.

If you pick Haskell, you can’t find cheap developers anywhere for your product. Go is beginning to look like vaporware. Rust is still too immature but will probably beat D once it’s finalized.

Reply
- No. says:
  
  August 2, 2013 at 5:35 am
  
  Yeah Go is vaporware, only Google and a large number of other high profile companies use it (e.g. cloudflare)
  
  Reply
- orclev says:
  
  August 24, 2013 at 4:41 am
  
  You assume someone wants to find cheap developers. You usually get what you pay for. I’d happily hire a Erlang, OCaml, or Haskell dev for pretty much anything as odds are good they’re going to know what they’re doing, a PHP dev, not so much. Not to say that a PHP dev can’t be good, just that there are an awful lot of bad PHP devs, so you end up spending a ton of time screening out the bad ones to find the good ones.
  
  There’s other considerations than just execution time involved as well, there’s metrics like development time, size of the code base, and how common bugs are (or how easy it is to introduce different categories of bugs). Sure in the end if all you really care about is the absolute fastest speed you can get, a naive implementation in C is probably going to beat a naive implementation in pretty much anything else (except maybe assembly), but if done properly most languages can achieve something approaching C speed (or sometimes exceeding it).
  
  Reply
Jesse McNelis says:

July 24, 2013 at 3:48 pm

The reason why dealing with a pointer to a slice in Go is awkward is because it’s not idiomatic. It’s much more common to pass a slice to a function and have the function return the slice if it modified it’s length.
Awkward code in Go is generally a sign that you’re trying to do the wrong thing.

Reply
- logicchains says:
  
  July 24, 2013 at 6:06 pm
  
  Right, I’ll change that.
  
  Reply
Huon Wilson says:

July 24, 2013 at 5:01 pm
Since everyone’s jumping on the optimisation band-wagon, I’ll do it too, I adapted the Rust code to use a faster RNG (XorShift, in the standard library too), and use iterators & vector-builders, which took Rust down to equal with GCC (which is faster than clang on my computer):
```
Rust       0.36
Rust isaac 0.57
Rust orig  0.70
Clang      0.42
GCC        0.36
```
The code: https://github.com/huonw/levgen-benchmarks/blob/master/R.rs
Reply
- Vladimir says:
  
  July 24, 2013 at 6:12 pm
  
  D’s default RNG is Mersenne-Twister. Here’s a patch to use XorShift in the D code too: https://github.com/CyberShadow/levgen-benchmarks/compare/d-xorshift
  
  This gives D a 20% speedup in my tests.
  
  Reply
  - logicchains says:
    
    July 24, 2013 at 7:04 pm
    
    I tested it and it didn’t run as fast as the current D version in the repo (which uses the C stdlib PRNG).
- logicchains says:
  
  July 24, 2013 at 7:02 pm
  
  Thanks for that! It sped the Rust implementation up to 0.620s, 90% of the best C version here, and faster than the D versions produced by the DMD and GDC compilers.
  
  Reply
Quon says:

July 24, 2013 at 8:25 pm

It just test random number generator and compiler opt, not the language itself.

Reply
- logicchains says:
  
  July 24, 2013 at 9:16 pm
  
  The Rust and LDC versions both use the same random number generator, and the highest compiler optimisation settings available, but the LDC version is notably faster.
  
  Reply
Dave says:

July 24, 2013 at 8:37 pm

I am not sure if this was implemented as you did not discuss it. However I would give the same seed to the PRNG’s, though results could vary using different PRNG’s.

Reply
- logicchains says:
  
  July 24, 2013 at 9:14 pm
  
  I tried each one with the seeds 10,20..100 and took the fastest speed. Generally the seed had little significant affect on the timing, so I didn’t mention it.
  
  Reply
pikachu says:

July 24, 2013 at 10:17 pm

There are a few things you’re doing bad with Go. The most important this is with the range statement with causes a Room struct to be copied.

Try this:
func CheckColl(x, y, w, h int, rs []Room) bool {
var r *Room
for i := range rs {
r = &rs[i]
if ((r.X + r.W + 1) (x + w + 1)) {
continue
}
if ((r.Y + r.H + 1) (y + h + 1)) {
continue
}
return true
}
return false
}

Reply
- pikachu says:
  
  July 24, 2013 at 10:23 pm
  
  With this change I got run-time down to 0.88, of which check room is 16%, and (C code with -Ofast is 0.511s on my machine) and random functions take about 77%.
  If you use a faster PRNG (see for example https://code.google.com/p/go/source/detail?r=3bf9ffdcca1f9585f28dcf0e4ca1c75ea29e18be). 6g should match the speed of C.
  
  Reply
- logicchains says:
  
  July 24, 2013 at 10:38 pm
  
  Thanks, that change decreased the running time from 1.9s to 1.65s with 6g and down to 0.720 with GCCGo.
  
  Reply
- logicchains says:
  
  July 24, 2013 at 10:53 pm
  
  Sorry, make that down to 0.952s with 6g. For some reason however it makes the GCCGo one take longer, around 0.990s.
  
  Reply
Maxwell McAdams says:

July 25, 2013 at 3:53 am

The Haskell implementation seems odd to me in a few ways, and I think there are large performance gains to be had with some more idiomatic code.

Manipulating big vectors is the least idiomatic Haskell possible. The usual way to refactor a giant data structure sitting in memory would be to stream it lazily through a thread-safe queue. Okasaki’s cons-list queues are popular and you can probably find a fast implementation on hackage, but even a lazy ByteStream would be better than that gargantuan vector. Same goes for bounds-checking after the fact, but I don’t see a simple way to rearrange that.

Also, I’m confused by your trepidation towards generating PRNs on-the-fly. Yes, you break referential transparency, but the “Random” in the type signature is enough to protect your code purity in spirit. Idiomatic Haskell is about encapsulating side-effects, not contorting your code to hide them. Streaming random numbers into memory as they are consumed would keep all of the dirty impurity hidden from the numeric functions anyways.

Also, the bounds-checking can be done with unsafe casts into the Bits type for fast mass bitwise operations. If you were unboxing your numericals anyways the overhead won’t be significant. This shatters purity, but not referential transparency, so its probably okay.

Reply
- logicchains says:
  
  July 25, 2013 at 12:20 pm
  
  I don’t really have enough experience with Haskell right now to write good idiomatic code, but any improvements submitted to the source are most welcome. PRNG speed is the biggest factor in the overall speed right now, so for Haskell to compete with the other implementations it would really need an XorShift PRNG, which I’m not sure how to implement.
  
  Reply
iopq says:

July 25, 2013 at 5:54 am

The table doesn’t line up for me, I can’t tell which number corresponds to which language/compiler.

Reply
- logicchains says:
  
  July 25, 2013 at 12:16 pm
  
  Sorry. I’ll add a separator character in between the numbers/compilers to make it clearer when the formatting fails.
  
  Reply
  - p says:
    
    July 26, 2013 at 6:20 am
    Why not use an actual table for the table? Is it a WordPress limitation?
    
    If it allows you to inject JavaScript you could do something like:
    
    function tablify(p, paragraphCount) { var table = document.createElement("table"); table.id = "real-table"; table.style.borderCollapse = "collapse"; table.style.color = "black"; var discard = []; for (var i=0; i p:nth-child(6)"), 3);
  - logicchains says:
    
    July 26, 2013 at 1:20 pm
    
    It didn’t allow Javascript injection, but turns out a HTML table works fine. Updated now to use a real table.
John Smith says:

July 25, 2013 at 9:35 am

There is another catch with go. With 6g, int means 64-bit. The “equivalent” C code uses 32-bit integers, and is obviously faster.

Reply
- John Smith says:
  
  July 25, 2013 at 9:37 am
  
  You should either switch to int32/uint32 in Go code, or use int64_t/uint64_t from stdint.h in C code to make a fair comparison.
  
  Reply
  - logicchains says:
    
    July 25, 2013 at 4:05 pm
    
    Switching to int32/uint32 didn’t result in any speedup to the code for me.
- logicchains says:
  
  July 25, 2013 at 12:15 pm
  
  Right, I’ll update that and post the results.
  
  Reply
Joe Wakeling says:

July 25, 2013 at 6:36 pm

I think you are still not comparing like for like in the RNG stakes. D’s default Xorshift is the 128-bit version, i.e. the internal state consists of four uint32_t words. The C and Go versions use a single uint32_t for the RNG state, i.e. they are 32-bit Xorshift.

Try replacing Xorshift in the D code with Xorshift32 and see how that affects things.

Oh, and — be careful with Xorshift implementations that directly copy the paper by Marsaglia. It has typos 🙂

Reply
- Joe Wakeling says:
  
  July 25, 2013 at 6:49 pm
  
  Sorry, just spotted that D.d uses the same handwritten 32-bit Xorshift as C and Go versions. How does the speed of that compare with the 128-bit library version used in D-Xorshift.d?
  
  Reply
  - logicchains says:
    
    July 25, 2013 at 6:59 pm
    
    It’s much faster, at least with LDC. The LDC-compiled version using the 128-bit library ran in around 0.40s, whereas the current LDC-compiled version using the 32-bit Xorshift runs in 0.29s.
chai2010 says:

July 26, 2013 at 9:50 am

How about disable bounds checking in Go (-B optional in 6g)?

Reply
Pingback: 系统级编程语言性能PK _ 业界资讯 _ HTML5工作室
Dave says:

July 30, 2013 at 12:14 pm

Data profiling in Haskell shows an awful lot of ints staying in memory from main.rands.

Doesn’t look like you’re getting lazy generation of this list. You’re likely better off writing this as concurrent code with a Random generator channel.

Do this

ghc -O2 -prof -auto-all -caf-all –make H.hs -rtsopts=all -fforce-recomp

./H 3 +RTS -hy -p >/dev/null

hp2ps -c H.hp

open H.ps (if you’re on a mac, maybe gnome-open on ubuntu, or just use a PS viewer)

Reply
- logicchains says:
  
  July 30, 2013 at 1:41 pm
  
  It currently uses an unboxed vector of 10 million ints, which is slower than a lazy list of vectors. The only reason for this is that I’m unsure how to get an infinite list from the XorShift package; if you can change the following code:
  
  gen <- newXorshift
  let rands = U.unfoldrN 10000000 (Just . next) gen
  
  so that rands is an infinite list, then that should speed it up a bit (U refers there to Data.Vector.Unboxed).
  
  I'm not looking to use any concurrency yet so as to allow fair comparison between the different implementations, but within a month or so I'll be implementing concurrent versions in each language, so I'll try the prng channel then if I can't get the lazy list of rands working.
  
  Reply
  - Dave says:
    
    July 30, 2013 at 3:19 pm
    
    You can still write it concurrently and use 1 actual thread. Just a thought. It may not buy you very much unless you compile with -threaded, which would allow for some overlap between threads.
    
    A lazy list of random numbers from Xorshift is
    
    myRandList :: Xorshift -> ([Int], Xorshift)
    myRandList g =
    let (i, g’) = next g
    (rest, g”) = myRandList g’
    in ((i : rest), g”)
    
    I think..
    
    That’s what I basically stuck in a State Monad like so…
    
    type MyRandomGen = State Xorshift
    
    genRandom :: MyRandomGen Int
    genRandom = do
    gen V.Vector Room -> MyRandomGen (V.Vector Room)
    genRooms 0 done = return done
    genRooms !n rsDone = do
    [r1,r2] <- sequence $ replicate 2 genRandom
    let x = r1 `rem` levDim
    y = r2 `rem` levDim
    w = r1 `rem` maxWid + minWid
    h = r2 `rem` maxWid + minWid
    tr = Room {rPos=(x,y), rw= w, rh= h}
    if (checkBound tr) || (checkColl tr rsDone)
    then genRooms (n-1) rsDone
    else genRooms (n-1) (V.cons tr rsDone)
    
    By the way, the performance goes straight to hell if you generate 4 random numbers in either H.hs or this version. I don't know why! Using 2 as H.hs does works well enough where the y and h are paired up with x and w….
    
    I will post the full code.
Dave says:

July 30, 2013 at 1:58 pm

I wrote a version that uses a State monad wrapper around the random number generator and runs every generation function in that monad. With -O2 it does not perform as well on my machine as the giant Vector one. Using Control.Monad.State.Strict made it a little faster, but not quite as fast.

According to the heap profiles i’ve generated though, the state monad generator used 1/50th of the memory of H.hs

H.hs uses 157 MB of memory while the state monad version uses 3 MB.

I could post my version somewhere if you want it.

Reply
- logicchains says:
  
  July 30, 2013 at 2:05 pm
  
  Sure, if you post a copy to Pastebin I’ll test it and see how it compares to H.hs on my machine. When the code used the Mersenne twister prng, changing from an unboxed vector to a lazy infinite list sped it up by over 10%, so I’m surprised the same doesn’t apply to the XorShift prng. If the slowdown with the state monad generator saves that much memory then it would probably perform better when the number of rooms/iterations needed is scaled up, too, and would work better with parallel code.
  
  Reply
Dave says:

July 30, 2013 at 3:20 pm

State Monad version here:
http://pastebin.com/gamMWuyQ

Reply
- logicchains says:
  
  July 31, 2013 at 2:25 pm
  
  That was surprisingly slow, less than half the speed of the unboxed vector one. I’ll profile it later to see why. I think the easiest way to speed it up would be to use the same XorShift function as the other implementations use:
  
  int GenRand(uint32_t *gen) {
  *gen += *gen;
  *gen ^= 1;
  int32_t tgen=*gen;
  if ( tgen < 0) {
  *gen ^= 0x88888eef;
  }
  int a = *gen;
  return a;
  }
  rather than the XorShift library. It might also be easier to get an infinite list out of it if you code the XorShift yourself.
  
  Reply
dave says:

July 31, 2013 at 3:07 pm

Yes, I was surprised how slow it was as well. I’m wondering if this is something for the ST Monad. When I profiled it it just looked like a ton of data being created and garbage collected over time. This makes me think we’d rather modify the same memory location a lot than create/destroy a ton of stuff.

Reply
Dave says:

August 1, 2013 at 12:20 pm

I have another version that does away with the Vectors, and the strictness, still using the State Monad for the random number generation and it’s a little bit faster than the one with Vectors and bang patterns on my machine.

http://pastebin.com/G1MMuka1

5,810,043,592 bytes allocated in the heap
7,354,800 bytes copied during GC
126,296 bytes maximum residency (3 sample(s))
30,576 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 11229 colls, 0 par 0.14s 0.15s 0.0000s 0.0001s
Gen 1 3 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s

INIT time 0.00s ( 0.00s elapsed)
MUT time 3.42s ( 3.46s elapsed)
GC time 0.14s ( 0.15s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.57s ( 3.61s elapsed)

%GC time 3.9% (4.2% elapsed)

Alloc rate 1,696,550,332 bytes per MUT second

Productivity 96.1% of total user, 95.0% of total elapsed

real 0m3.614s
user 0m3.567s
sys 0m0.045s

Reply
- logicchains says:
  
  August 1, 2013 at 9:45 pm
  
  Your new version is a bit faster, but still doesn’t match the unboxed vector one. I’m not fully up to speed with the state monad so I can’t think of any way to improve it. I did try writing a new genRand function using Data.Bits, but it was incredibly slow, less than a third of the speed of the state monad one.
  
  I’ve also found it doesn’t work well with 4 rands (20 million ints total); maybe it’s an addressing problem? 20 million ints (which are 64 bit by default on my program) is 160 million bytes, and maybe GHC has trouble properly addressing that much memory (entirely speculation, but I can’t think why else performance would decrease exponentially like that).
  
  Reply
Dave says:

August 1, 2013 at 12:21 pm

If we could just get rid of those MUT objects… I’m not sure of their origin.

Reply
- logicchains says:
  
  August 1, 2013 at 1:02 pm
  
  Would it be possible to implement the prng directly using the Data.Bits package? http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Bits.html It seems the functions there are pure, so it might be possible to make a pure XorShift function and bypass the state monad entirely. Something like (I don’t have time to test this right now, and slightly pseudocodish):
  
  GenRand :: UInt -> UInt
  GenRand seed = if signedval < 0 then xored2 else xored
  where
  shifted = shiftL seed 1
  xored = xor shifted 1
  signedval :: Int = fromIntegral xored
  xored2 = xor xored 0x88888eef
  
  The value returned is both the random number generated and the seed for the next one.
  
  Reply
yxtiger says:

August 5, 2013 at 3:11 am

Go-lang code below may obtain 10% speed up. I’m using “go1.1.1 windows/386”.
(And it’s intresting that, if assignment of “x_m1,xw_p1,y_m1,yh_p1” split to 4 lines, or combine to 1 line, there will be 1% speed lose. So there’s still more optimization to be done in Go compiler)

func CheckColl(x, y, w, h uint32, rs []Room) bool {
var r *Room
x_m1, xw_p1 := x-1, x+w+1
y_m1, yh_p1 := y-1, y+h+1
for i := range rs {
r = &rs[i]
switch {
case r.X > xw_p1:
continue
case (r.X + r.W) yh_p1:
continue
case (r.Y + r.H) < y_m1:
continue
}
return true
}
return false
}

Reply
- rogerdpackroger says:
  
  January 4, 2014 at 12:17 am
  
  go 1.2 is supposed to have some speed improvements?
  
  Reply
  - logicchains says:
    
    January 5, 2014 at 8:56 am
    
    I just tested it on the newest version of Go, but it didn’t run any faster.
yxtiger says:

August 5, 2013 at 3:19 am

There will be 4 “case” sentences above, but some vanished during post…
Try again:
switch{
case r.X > xw_p1:
continue
case (r.X + r.W) yh_p1:
continue
case (r.Y + r.H) < y_m1:
continue
}

Reply
yxtiger says:

August 5, 2013 at 8:19 pm

Golang code below leads 5% speed up:
func GenRand(gen *uint32) uint32 {
seed := (*gen <> 31) * 0x88888eef
*gen = seed
return seed
}
————————
Even more, reduce function parameter may gain 1-2% speed up:
(need the caller r1/r2 modify accordingly)

var genseed uint32 = 18
func GenRand() uint32 {
seed := (genseed <> 31) * 0x88888eef
genseed = seed
return genseed
}
————————
ASM code may be found after compile(use “go tool 8g -S main.go”) like below, it’s obviously not optimized thoroughly. So it’s reasonable to expect more performance with later Golang version.
0268 (main.go:136) MOVL BX,genseed+0(SB)
0269 (main.go:136) MOVL genseed+0(SB),BX
…
0272 (main.go:136) XORL BX,genseed+0(SB)
0273 (main.go:136) MOVL genseed+0(SB),CX

Reply
yxtiger says:

August 5, 2013 at 8:20 pm

seed := (genseed ＜＜ 1) + 1
seed ^= (seed ＞＞ 31) * 0x88888eef

Reply
logicchains says:

August 5, 2013 at 9:03 pm

Great thanks; it runs significantly faster now.

Reply
andrew Aw says:

August 6, 2013 at 5:56 am

please show the loc every language..which was the most expresive one?..

Reply
- logicchains says:
  
  August 6, 2013 at 2:12 pm
  
  Did it. Seems Haskell is the most ‘expressive’ of the implementations, followed by Scala and then Rust. Haskell is slower, but I think the Haskell code is much less optimised than the other implementations, as I haven’t found a Haskell programmer willing to improve the Haskell XorShift prng, and I’m not skilled enough at Haskell yet to do it myself.
  
  Reply
yxtiger says:

August 7, 2013 at 2:06 pm

1.
Intel compiler should have a position, for its great optimization among CPU instructions.
C/C++ should get much benefit from it, “(x<<1)+1" will get "lea ebx, DWORD PTR [esi+esi+1]" like, really great!

2.
Since many modifies made during comparison, algorithm becomes non-consistence.
It's more faire to move "for(…;.<50000;…" loop inside "MakeRoom" function, "Rand" changed to inline, especially on the benchmark-basement.

Reply
- logicchains says:
  
  August 7, 2013 at 11:46 pm
  
  I didn’t realise there was a free version of the Intel compiler available, but apparently there’s one for Linux, so I’ll use it on the upcoming concurrent version of the benchmarks. I’ll also update the functions then to bring the for 50k loop into MakeRoom.
  
  Reply
yxtiger says:

August 7, 2013 at 2:41 pm

Note on Golang: vars are initialized when declare, rather than only assign mem.
So should pay attention to array/map’s declaration.

Changes can be take:

func MakeRoom(count uint32) *[]Room {
rs := make([]Room, 100)
—–change to—–
var rs [100]Room
func MakeRoom(count uint32) *[]Room {

for i := 0; i < 100; i++ {
rs := MakeRoom(50000)
ts := make([]Tile, 2500)
for ii := uint32(0); ii < 2500; ii++ {
ts[ii] = Tile{X: ii % TileDim, Y: ii / TileDim, T: 0}
—–change to—–
ts := make([]Tile, 2500)
for ii := uint32(0); ii < 2500; ii++ {
ts[ii] = Tile{ii % TileDim, ii / TileDim, 0}
}
for i := 0; i < 100; i++ {
rs := MakeRoom(50000)
for ii := uint32(0); ii < 2500; ii++ {
ts[ii].T = 0

for _, r := range *rs {
Room2Tiles(&r, &ts)
—–change to(because duplication in range)—–
for j := range *rs {
Room2Tiles(&((*rs)[j]), &ts)

Reply
- logicchains says:
  
  August 7, 2013 at 11:48 pm
  
  If you’ve tested those changes and they improve the speed, then you’re welcome to submit a pull request for them at the Github repository.
  
  Reply
Dave says:

August 8, 2013 at 1:21 pm

genRooms is taking the most time on the Haskell versions I’ve been looking at.

Reply
- logicchains says:
  
  August 9, 2013 at 12:11 am
  
  That it does, but it’s still possible to reduce the overall running time a bit by using a faster PRNG. I can’t think of any way to speed up genRooms that doesn’t involve changing the prng (or better, using a lazy list of random numbers, but I haven’t found an effective way to do that with the current prng).
  
  Reply
  - Dave says:
    
    August 9, 2013 at 1:13 am
    
    I suppose if we wanted a faster PRNG, we could implement it as transformations done in the ST monad. It definitely feels a little like cheating, but it’s something Haskell programmers reach for on occasion for certain update-in-place situations. Given that we’re generating a LOT of data that seems to just get GC’d… it could be useful.
    
    I, however, have never used ST for this kind of stuff, or much at all for that matter.
Dave says:

August 9, 2013 at 1:26 am

Actually, I take it back. I think ST is not going to help much as the way it seems to work is to take a regular old haskell value from which you create an STRef, then do an update, and return a new Haskell value. Unless the intermediate shifts in the PRNG cause a lot of allocations and such, it may be of little value.

Reply
- logicchains says:
  
  August 9, 2013 at 11:23 am
  
  A simple solution would be to just take the source at http://hackage.haskell.org/packages/archive/xorshift/2.0.1/doc/html/src/Random-Xorshift-Int32.html#makeXorshift32 and modify it slightly. Looking at it now however I’m not sure whether that would help much, as it uses the code:
  step32 :: Xorshift32 -> Xorshift32
  step32 (Xorshift32 a) = Xorshift32 d where
  b = a `xor` (shiftL a 13)
  c = b `xor` (shiftR b 17)
  d = c `xor` (shiftL c 5)
  which is the same as Scala’s prng, yet the Scala version performs faster than the Haskell one even though it’s using 4 random numbers per room and the Haskell one only uses two.
  
  I’ll try changing the unboxed vector of random integers to a lazy vector and see if that helps at all.
  
  Reply
- logicchains says:
  
  August 10, 2013 at 12:37 pm
  
  I tried it with a lazy list and a lazy vector, but both were significantly slower than the unboxed vector.
  
  Reply

	bob on Fibonacci numbers on the Galax…
	kimberly on “I’m from Microsof…
	logicchains on Fulfilling a Pikedream: the up…
	logicchains on Fulfilling a Pikedream: the up…
	logicchains on Fulfilling a Pikedream: the up…