Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

What were the command args for gccgo? gccgo is usually faster than 6g. did you use -O3 or whatnot?

I’m fairly certain I did, or at least I intended to. I’ll check once more to make sure it didn’t accidentally slip my mind.

Yes, I used -O3. Gccgo just appears to be much much slower than 6g for this task, for some reason. I’ll check the gccgo assembly when I have time to see if it’s mangling the GenRands function.

Did you use a -march=native or equivalent on any of the gcc/clang compiled code?

I tried it, but it didn’t seem to make a noticeable difference, so I built without it.

More idiomatic C than using an array would be to just cast the integer to a void * and then cast it back in your thread. It’s technically non-portable, but will work on just about any implementation of C ever.

I considered that; it’s mentioned towards the end of the second paragraph of the ‘Subjective experience’ section. I avoided it because if the integer was a really big 64 bit value (or a really big long long), converting to a pointer and back on a 32 bit system (with 32 bit pointers) would mangle it. I suppose that might be being a bit too cautious in this case, however.

Isn’t the “referencing a changing loop variable” thing just easily cured by stack-allocating a new int inside the loop and copying the value of i into it?

Nope; the variable will go out of scope when that pass through the loop ends. A static variable would work, but they can’t be used like that in C.

I would just pass the int “by value” (don’t pass a pointer, just put the int inside the pointer). The other option is to pass the threads array as a pointer, you can then use pointer-arithmetic to determine the index of “your” thread by finding the offset to the start of the array.

I try to avoid treating integers as pointers and pointers as integers because of the potential bugs that it can cause, but I’m probably being too cautious for this case.

Ah, you’re right, I’m thinking in Go, where a stack local will be magically promoted to the heap if it escapes; my above is indeed the recommended fix in Go.

Your irritation with the D memory model is likely due to the fact that global variables are thread-local by default in D. Declaring them “shared” makes them truly global like in C/C++.

Thanks, I didn’t realise that.

This made me chuckle: “[…] especially if one was working with programmers of dubious competence who couldn’t be trusted to write memory-safe code.”

For the C implementation, try changing ‘void MakeRoom’ to ‘inline void MakeRoom’. Halved the time it took on my machine.

both the D and C++ code gives me segmentation fault at runtime.
i compiled the C++ one using -lpthread and the D one using the flags provided on the source file. both compile with 0 errors / warnings.

You need to run them with the seed, as in ./PD 10 or ./PCPP 42 . If they still segfault even with the seed, then I’m not sure what’s going wrong.

What was the issue with your Haskell code? Could you elaborate?

Take the code here: https://github.com/logicchains/levgen-benchmarks/blob/master/H.hs and run it. Then, take the same code, change the genRooms function to contain:
where
noFit = genRooms (n-1) (restInts) rsDone

tr = Room {rPos=(x,y), rw= w, rh= h}

x = rem (U.unsafeHead randInts) levDim

y = rem (U.unsafeIndex randInts 1) levDim

restInts = U.unsafeDrop 4 randInts

w = rem (U.unsafeIndex randInts 2) maxWid + minWid

h = rem (U.unsafeIndex randInts 3) maxWid + minWid
And change:
let rands = U.unfoldrN 10000000 (Just . next) gen
to:
let rands = U.unfoldrN 20000000 (Just . next) gen

The running time should double. Does it? Or does it increase by orders of magnitude? The latter is what happens to me, and to a couple of other people who have tried it.

Would be cool than the next match be about memory management… Gc languages performance has not measure here and would be interesting see how perform d, go, nimrod comparing to well known jvm with scala…….

The next benchmark will be about garbage collected languages’ effectiveness at running an OpenGL animation at a consistent framerate (without stuttering).

In the ‘C’ implementation example, why are we passing the address of loop variable ‘i’? Because of call-by-reference, a newly created thread accesses ‘i’ itself the contents of which is changing in the main looping thread anyway, if I am not missing something obvious. One quick solution could be to write a function which

– takes in an ‘int’ as a param,
– internally ‘malloc’s an integer,
– initializes this newly allocated integer with the param
– returns the newly allocated integer

and then, call this function with loop variable ‘i’ and use its return value as a parameter to pthread_create(). Shouldn’t that work?

That should work fine, thanks. I was just avoiding heap allocation for simplicity. I’m not sure if it would result in less lines of code used than the current version.

I messed for fun with the c++ example. It is now on my system almost twice as fast.

I changed the analytical intersection test with a occlusion buffer that save for every tile if it is used. I am thinking there is a off-by-1 error somewhere, but it seems to work fine.

It would be fun to redo the test for this implementation. This lowlevel rasterisation is something c/c++ shine on with pointer arithmetic and that stuff, even though i didn’t bother to try out how to SIMDfy it.

Code: http://pastebin.com/5wtzPE1R

Great, thanks! I imagine it wouldn’t be so easy to implement efficiently in the other languages, although I think D could possibly just copy the C code, and Rust and Go might be able to do something similar with an unsafe block and raw pointers.

I sent a pull request with a parallel Haskell implementation.

Thanks! Would it be possible to add a timeout, so that after 50,000 attempts at generating 100 rooms it gives up and moves onto the next level, as in the other implementations?

genGoodRooms in H.hs looks very suspicious re space leaks.

aux accum count t = do …

should say aux accum !count !t = …

to avoid thunk buildups in incrementing the count and decrementing t.

Use LANG: BangPatterns to recognize that idiom (otherwise you’d need seq annotations).

I tried that, but it didn’t significantly increase the speed (it’s still far from possible for me to generate 1000 levels with
50000 tries per level in under one minute).

I’ve adapted the parallel D version with “idiomatic” parallelization. It practically just adds the word “parallel” and moves the random number generator inside the level loop. It is also around 18% faster on my computer than your current D code on Github. Here’s the code:

	// Compile with:
	// ldc2 -O5 -check-printf-calls -fdata-sections -ffunction-sections -release -singleobj -strip-debug -wi -disable-boundscheck -L=–gc-sections -L=-s D.d

	@safe:
	import std.conv, std.stdio, std.parallelism;

	enum LEVEL_SIZE = 50; /// Width and height of a level
	enum ROOMS = 100; /// Maximum number of rooms in a level
	enum ROOM_SIZE_BASE = 2; /// Rooms will be at least this value plus one in size.
	enum ROOM_SIZE_MOD = 8; /// Random additional room size: [0 .. ROOM_SIZE_MOD)

	enum NUM_LEVS = 800;
	enum NUM_TRIES = 50000;

	struct Tile {
	uint x;
	uint y;
	uint t;
	}

	struct Room {
	uint x;
	uint y;
	uint w;
	uint h;
	size_t number;
	}

	struct Level {
	Tile[LEVEL_SIZE ^^ 2] tiles = void;
	Room[ROOMS] rooms = void;
	size_t roomCnt = 0;

	void makeRoom(ref Random rnd) nothrow {
	immutable x = rnd.next() % LEVEL_SIZE;
	immutable y = rnd.next() % LEVEL_SIZE;
	if (x == 0 \|\| y == 0) return;

	immutable w = ROOM_SIZE_BASE + rnd.next() % ROOM_SIZE_MOD;
	immutable h = ROOM_SIZE_BASE + rnd.next() % ROOM_SIZE_MOD;
	if (x + w >= LEVEL_SIZE \|\| y + h >= LEVEL_SIZE) return;
	if (checkColl( x, y, w, h )) return;

	Room* r = &this.rooms[this.roomCnt];
	r.number = this.roomCnt++;
	r.x = x;
	r.y = y;
	r.w = w;
	r.h = h;
	}

	/// Returns true, when the given area collides with existing rooms.
	bool checkColl(in uint x, in uint y, in uint w, in uint h) const pure nothrow {
	foreach (ref r; this.rooms[0 .. this.roomCnt]) {
	if (r.x + r.w + 1 >= x && r.x <= x + w + 1 &&
	r.y + r.h + 1 >= y && r.y <= y + h + 1) {
	return true;
	}
	}
	return false;
	}

	/// Initializes and then builds the tiles from the room definitions.
	void buildTiles() pure nothrow {
	foreach (uint i; 0 .. this.tiles.length) {
	this.tiles[i].x = i % LEVEL_SIZE;
	this.tiles[i].y = i / LEVEL_SIZE;
	this.tiles[i].t = 0;
	}
	foreach (ref r; this.rooms[0 .. this.roomCnt]) {
	foreach (xi; r.x .. r.x + r.w + 1)
	foreach (yi; r.y .. r.y + r.h + 1) {
	this.tiles[yi * LEVEL_SIZE + xi].t = 1;
	}
	}
	}

	void dump() const @trusted {
	foreach (row; 0 .. LEVEL_SIZE) {
	immutable offset = LEVEL_SIZE * row;
	foreach (col; 0 .. LEVEL_SIZE) {
	write( this.tiles[offset + col].t );
	}
	writeln();
	}
	}
	}

	struct Random {
	uint current;

	uint next() nothrow {
	current += current;
	current ^= (current > int.max) ? 0x88888eee : 1;
	return current;
	}
	}

	__gshared Level[NUM_LEVS] levels;

	void main(string[] args) @system {
	// Create a local random number generator
	immutable seed = (args.length > 1) ? args[1].to!uint() : 123;
	writefln( "The random seed is: %s", seed );

	// Create several levels for benchmarking purposes
	foreach (levelIdx, ref level; parallel( levels[] )) {
	auto rnd = Random( cast(uint) (seed * (levelIdx+1) * (levelIdx+1)) );
	foreach (i; 0 .. NUM_TRIES) {
	level.makeRoom(rnd);
	if (level.roomCnt == ROOMS) {
	break;
	}
	}
	level.buildTiles();
	}

	// Select the level with the most rooms for printing
	Level* levelToPrint = &levels[0];
	foreach (ref level; levels[1 .. $]) {
	if (level.roomCnt > levelToPrint.roomCnt) {
	levelToPrint = &level;
	}
	}
	levelToPrint.dump();
	}

view raw

LevGen Parallel D

hosted with ❤ by GitHub

Great, it’s faster for me too. It seems to be generating different results than the one in the repo; is it creating NUM_LEVS threads with a different seed for each one? If so, it’s impressive that using so many threads can produce a faster result than using NUM_CORES threads.

You are right, about the new seed (or RNG) for every level. “parallel” by default (and I didn’t change it) runs on as many threads as there are CPU cores and splits the levels array into work units (e.g. 80 chunks of 10 levels). Both thread count and work unit size can be configured globally, but the defaults were already good and I could get away with no additional lines of code.

That’s impressive. I wonder if it’d be possible to optimise the C++ similarly.

Ok, and now comes the killer implementation: ~0.100s for 800 levels! It produces the same levels as the original algorithm, but fills the whole level as if NUM_TRIES was close to infinity.

Instead of trial and error it works deterministically. That means each step adds a new room to the level. To produce the same results, I reproduce the probabilities that a room of each size can be placed in the level. E.g. a 3×3 room might be twice as probable than a 5×4 room. Once a room size is selected using these probabilities, it is placed at a random free spot.

These free spots are kept in 64 bitmaps (for each of the 64 possible room shapes) in the size of the level, that tells the algorithm where these free spots are. After placing a room all 64 bitmaps are updated by marking a rectangular area and then updating the probabilities for each room size accordingly.

Code: http://dpaste.dzfl.pl/6ce1b7d9

That is extremely impressive! I’d be quite interested to see how it compares to the occlusion buffer/free list technique used here: https://github.com/danielhams/Levgen-Algorithm-Benchmarks. It works as follows:

-Generate a room with random dimensions
-Attempt to find some free space of appropriate size
-Offset the room randomly inside the free space
-Compute new free entries for the space left over and put them back into the free list
-When out of free space walk the occlusion buffer looking for blocks with appropriate minimum size
-When there are enough rooms or the compaction of the free space above doesn’t produce any free space, level generation is complete

Note that I can’t include the numbers in the benchmark results as it’s a fundamentally different algorithm to the other implementations.

That free list technique is a bit faster (e.g. 0.093 seconds in one of the 4 versions using the built-in timing). The runtime of the other algorithms that are included depends largely on the number of trial and error steps. The quality of the output is roughly the same as far as I can tell.
That said, it does create different levels, as the free areas are created going from left to right, top to bottom. For every free cell, the largest possible rectangle is drawn to the bottom-right and added to the free list. This creates arbitrary borders on the level where no rooms will be placed (as far as I understood the code). I.e. after placing a few rooms you have several disjoint areas where rooms can go to and they are mostly square. So it gives up on the “completely random” property that the trial and error approach you used has naturally.

Ah, right. I think you’re correct, your implementation does indeed produce more random results. Would you say there’s anything in particular about D that made the algorithm easier to implement?

D’s default to thread local storage gave me per-thread random number generators “for free” and I relied on compile-time function execution to generate the initial state of the level generator struct. So where I wrote “MyLevelGen.RoomGenerator rg;” it is just a blit of the pre-generated bit pattern at run-time for the chosen level and room sizes.

An example: I use single bits for the occluded bitmaps (so one byte holds 8 values), but round up to machine word size. So if the bitmap would be 30 pixels wide, it is rounded up to 32 bits on a 32-bit computer. To make it easier for the algorithm I initialize these 2 extra bits to “occluded” for the full height of the bitmap at compile-time and rely on the memcpy() style default initialization of structs in D, instead of running a constructor.

That’s interesting, from skimming over the code I didn’t even realise it was doing any compile-time calculation (apart from optimisation, of course). Quite impressive, considering how ugly it would probably look to do the same thing with C++ templates. This is probably a stupid question, but do you think it would ever be possible to embed DMD into an executable and have it generate code on the fly based on parameters known only at runtime, like one can with Lisp macros, or would that be unfeasible?

I don’t believe you’ll ever be able to replace existing code in a D program like you can do in Lisp. The design of the language is focused on static code and compile-time reflection. Depending on your needs you can write your code to a file, and compile a shared library with dmd from it that you load and unload as a plugin or use LuaJIT where you need the flexibility of an embedded scripting language: https://github.com/JakobOvrum/LuaD

Right. I’d love to see a language where it was possible to do what your D implementation does but with the room size parameters supplied at runtime, but right now that seems to only be possible with assembly or writing one’s own JIT. D’s templates are quite impressive nonetheless; I haven’t seen a language that does them better, although I haven’t tried Nimrod’s.

The version by mleise with many changes and improvements, the performance should be the same, but the line count is different, so please update the D entry and its lines number:
http://dpaste.dzfl.pl/d37ba995

Better, this D version doesn’t mix leading tabs and spaces, sorry:
http://dpaste.dzfl.pl/raw/e2bfc00f

I’m currently getting an error on line 58: cannot implicitly convert expression (i / 50LU) of type ulong to uint. Could you make it compatible with ldc2 of version: ‘based on DMD v2.063.2 and LLVM 3.2’ ?

The compiler version is OK, but I have compiled the code on a 32 bit system, where size_t is 32 bit, while on your 64 bit system “i / levelSize” is an ulong, and it can’t fit into an uint of the Tile.
To solve the problem just replace the line 57 “foreach (immutable i, ref t; this.tiles)” with “foreach (immutable uint i, ref t; this.tiles)”.
Apparently such 3264 bit conversion problems come from a design decision of the D language.

Great, it works fine now. I take it this problem would be avoided if the constants were typed?

The immutable index was already typed, but in a different way (size_t instead of uint). More info in the newsgroup because this comment system is not very good for complex explanations.

Language	Compiler	Speed (s)	% Fastest	Resident Mem Use (KiB)
D	ldc2	0.812	116.38%	26,536
C++	clang++	0.945	100.00%	25,552
D******	ldc2	0.955	98.95%	26,536
Neat*****	fcc	0.958	98.64%	26,762
Nimrod	clang	0.980	96.43%	25,932
C++ ***	g++	1.025	92.20%	25,532
Rust	rustc	1.109	85.21%	47,708
Go	6g	1.184	79.81%	30,768
C	clang	1.199	78.82%	25,796
Scala	scala	1.228	76.95%	72,960
Nimrod	gcc	1.376	68.68%	26,120
C****	gcc	1.467	64.42%	25,800
D	dmd	2.103	44.94%	26,508
Go	gccgo	2.710	34.87%	69,120

Language	SLOC	SLOC to Parallelise
Nimrod	109	-*
Neat	117	6
Rust	123	20**
Scala	99	20
Go	131	0
C++	142	15
D	83	-24**
C	172	32

	bob on Fibonacci numbers on the Galax…
	kimberly on “I’m from Microsof…
	logicchains on Fulfilling a Pikedream: the up…
	logicchains on Fulfilling a Pikedream: the up…
	logicchains on Fulfilling a Pikedream: the up…

Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

48 Responses to Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Meta

Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

Share this:

Related

48 Responses to Benchmarks Round Two: Parallel Go, Rust, D, Scala and Nimrod.

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Meta