Memory Access Microbenchmark: Clojure, F#, Go, Julia, Lua, Racket, Ruby, Self…

*Edit – For clarity, I’ve decided to just include the most up-to-date, fair and relevant results tables at the top here. Scroll down below to read more about the optimisation process and the effectiveness of various different implementations.

NumTrades = 150

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 0.0944 248 100.00 352224 757 60 1502 9
Clojure - 0.0000 607 40.86 893908 976 79 2814 0
Java3 javac 0.8451 624 39.74 826964 864 119 3386 2
Scala - 0.0000 737 33.65 751168 787 57 1719 0
Common Lisp - 0.0000 1152 21.53 902192 715 44 2103 0
Python2 (Pypy)2 - 0.0000 1716 14.45 1382304 609 42 1382 0
C#3 mcs 0.7248 2099 11.82 727712 893 114 3268 4
F#3 fsharpc 4.2022 2126 11.67 731808 660 39 1541 12
Racket raco 17.1353 2150 11.53 875424 691 38 1959 11142
Go3 go 0.3930 2516 9.86 352952 656 62 1313 2799
Ruby1 - 0.0000 6122 4.05 1070024 636 52 1266 0
JuliaBigNum - 0.0000 8932 2.78 2526800 565 52 1227 0
Tcl - 0.0000 18754 1.32 1702460 543 45 1130 0

NumTrades = 100

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 3.0596 210 100.00 235036 757 60 1502 9
Clojure - 0.0000 295 71.19 706660 976 79 2814 0
Java3 javac 2.8895 421 49.88 561296 864 119 3386 2
Scala - 0.0000 478 43.93 479916 787 57 1719 0
Common Lisp - 0.0000 644 32.61 866428 715 44 2103 0
Python2 (Pypy)2 - 0.0000 1036 20.27 955904 609 42 1382 0
F#3 fsharpc 7.6323 1082 19.41 492912 660 39 1541 12
C#3 mcs 1.7726 1144 18.36 488812 893 114 3268 4
Racket raco 26.5090 1208 17.38 639700 691 38 1959 11142
Go3 go 2.8093 1921 10.93 235768 656 62 1313 2799
Ruby1 - 0.0000 3045 6.90 675208 636 52 1266 0
Lua - 0.0000 3780 5.56 2202448 576 53 1305 0
Smalltalk (GNU) - 0.0000 5545 3.79 611088 712 76 1641 0
JuliaBigNum - 0.0000 7836 2.68 1012580 565 52 1227 0
Tcl - 0.0000 9865 2.13 1135652 543 45 1130 0

NumTrades = 50

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 3.6418 81 100.00 117848 757 60 1502 8
Common Lisp - 0.0000 84 96.43 324988 715 44 2103 0
Clojure - 0.0000 117 69.23 403120 976 79 2814 0
Racket raco 20.1401 132 61.36 373368 691 38 1959 11142
Java3 javac 2.8739 176 46.02 318364 864 119 3386 2
Scala - 0.0000 213 38.03 276620 787 57 1719 0
LuaJit - 0.0000 230 35.22 649116 576 53 1305 0
F#3 fsharpc 5.3502 381 21.26 254008 660 39 1541 12
C#3 mcs 1.4964 389 20.82 249920 893 114 3268 4
Python2 (Pypy)2 - 0.0000 428 18.93 482480 609 42 1382 0
Go3 go 2.3946 814 9.95 118576 656 62 1313 2799
Ruby1 - 0.0000 1435 5.64 301384 636 52 1266 0
Lua - 0.0000 1720 4.71 1101764 576 53 1305 0
Smalltalk (GNU) - 0.0000 1825 4.44 311132 712 76 1641 0
JuliaBigNum - 0.0000 2424 3.34 907624 565 52 1227 0
Python3 - 0.0000 2425 3.34 1338756 497 32 1126 0
Tcl - 0.0000 4102 1.97 569216 543 45 1130 0

NumTrades = 10

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
Common Lisp - 0.0000 16 100.00 111068 715 44 2103 0
C3 clang 0.0966 16 100.00 24096 757 60 1502 9
Racket raco 17.0957 26 61.54 190028 691 38 1959 11142
Clojure - 0.0000 27 59.26 121268 976 79 2814 0
Java3 javac 0.8250 39 41.03 96224 864 119 3386 2
LuaJit - 0.0000 40 40.00 128036 576 53 1305 0
Scala - 0.0000 41 39.02 148820 787 57 1719 0
F#3 fsharpc 4.1270 63 25.40 62920 660 39 1541 12
C#3 mcs 0.7194 64 25.00 58820 893 114 3268 4
Python2 (Pypy)2 - 0.0000 85 18.82 113608 609 42 1382 0
Go3 go 0.5028 158 10.13 24828 656 62 1313 2799
JuliaBigNum - 0.0000 273 5.86 316656 565 52 1227 0
Ruby1 - 0.0000 286 5.59 65244 636 52 1266 0
Smalltalk (GNU) - 0.0000 324 4.94 66276 712 76 1641 0
Lua - 0.0000 370 4.32 216136 576 53 1305 0
Python3 - 0.0000 485 3.30 273000 497 32 1126 0
Tcl - 0.0000 813 1.97 115772 543 45 1130 0

I recently came across an interesting blog post that dealt with optimising a common situation in which the JVM performed surprisingly poorly, which prompted me to try the microbenchmark in a few other languages to see how they fared. The code for this benchmark can be found here.

The problem: iterating over an array of objects modifying and reading its contents. What’s  interesting about it? Firstly, as this is a relatively common task, the results give a tiny bit of insight into the ‘general’ speed of the languages tested. Secondly, it’s directly relevant to the effectiveness of object pooling, a technique I’m using in the other benchmark I’m working on. The idea of an object pool is to statically allocate the needed objects at initialisation and reuse them instead of creating new objects, to avoid generating garbage. The effectiveness of this technique would however be limited if it’s impossible to access and modify objects in the pool without generating garbage.

Why are some of the usual suspects (D, Rust, Nimrod, C++) missing? They use the same backend as C so they’d produce results of similar speed, and their metaprogramming facilities are powerful enough that it’d be easy to ‘cheat’ and do all the calculation at compile time. If somebody wants to submit an implementation in one of these languages they are however quite welcome, as long as it doesn’t ‘cheat’.

Now, the results (but don’t neglect to read the rest of the post, as there’s a twist at the end). The value ‘NumTrades’ multiplied by 50,000 is the number of objects in the array of trade objects. Note that if a particular language/implementation is missing from a certain NumTrades value’s results, this means that it ran out of memory or was otherwise killed by the OS. The tests are run on a dual-core 64bit, 1.6ghz Intel i5, with 4gb of ram, running Linux Mint 15 64bit.

*Edit: It should be noted that this set of graphs tests the form of allocation used in the blog post that inspired this post. It is however suboptimal, and the set of graphs further below involve implementations using an optimal method of allocation/object pooling. This first set of graphs just provides for interesting comparison, to show how  small change can lead to significant speedup. Also note that the Ruby, Smalltalk and Lua implementations in this graph use the faster form of allocation, as they ran out of memory and died too soon without it. It is hence more fair to compare them with the implementations suffixed with 2, in the later set of graphs. In fact, these graphs are partially just here as part of explaining the optimisation process; if you’re interested in the fairest numbers, scroll to the final set of graphs.

NumTrades = 150 

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.0318 110 100.00 300524 2279 180 5840 8
Go go 0.3103 124 88.71 352484 614 61 1206 2256
C clang 0.0737 127 86.61 352044 712 56 1425 9
JavaUnSafe javac 1.1290 203 54.19 326036 1182 156 5148 2
CLispFixNum - 0.0000 228 48.24 810536 745 44 2152 0
Julia - 0.0000 298 36.91 609564 552 52 1228 0
Common Lisp - 0.0000 1120 9.82 902192 715 44 2103 0
Java javac 2.1709 1575 6.98 918820 819 121 3329 1
C# mcs 0.6872 1587 6.93 999892 878 115 3301 4
Racket raco 16.8814 1955 5.63 877364 691 38 1959 11142
F# fsharpc 4.2951 5761 1.91 1452868 646 40 1526 12
Python2 (Pypy) - 0.0000 9489 1.16 1519092 561 38 1192 0
Tcl - 0.0000 14486 0.76 1702460 544 46 1170 0

NumTrades = 100

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.3159 70 100.00 200468 2279 180 5840 8
Go go 0.3185 82 85.37 235300 614 61 1206 2256
C clang 1.8142 84 83.33 234856 722 56 1435 9
JavaUnSafe javac 1.1487 135 51.85 223596 1182 156 5148 2
CLispFixNum - 0.0000 152 46.05 594160 745 44 2152 0
Julia - 0.0000 200 35.00 415212 552 52 1228 0
Common Lisp - 0.0000 656 10.67 866428 715 44 2103 0
C# mcs 0.7074 953 7.35 966208 890 116 3320 4
Java javac 2.7260 1052 6.65 651904 819 121 3329 1
Racket raco 20.3636 1237 5.66 639704 691 38 1959 11142
F# fsharpc 6.2570 1393 5.03 976720 646 40 1526 12
Python2 (Pypy) - 0.0000 2234 3.13 1183108 561 38 1192 0
Lua - 0.0000 4580 1.53 2202468 554 52 1208 0
Ruby - 0.0000 4753 1.47 2233092 605 56 1250 0
Smalltalk (GNU) - 0.0000 5313 1.32 611088 712 76 1641 0
Tcl - 0.0000 9279 0.75 1135660 544 46 1170 0
Python3 - 0.0000 10654 0.66 1582084 607 39 1268 0
Clojure - 0.0000 86824 0.08 994700 960 53 2811 0

NumTrades = 50

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
Go go 1.8760 40 100.00 118112 614 61 1206 2256
x86-64 assembly fasm 0.6947 40 100.00 100416 2279 180 5840 8
C clang 1.6018 43 93.02 117668 722 56 1435 7
JavaUnSafe javac 1.2099 73 54.79 121152 1182 156 5148 2
Common Lisp - 0.0000 96 41.67 324988 715 44 2103 0
Julia - 0.0000 99 40.40 271700 552 52 1228 0
Racket raco 21.5142 130 30.77 373364 691 38 1959 11142
LuaJit - 0.0000 230 17.39 649136 554 52 1208 0
C# mcs 1.5198 460 8.70 488508 890 116 3320 4
F# fsharpc 5.7907 480 8.33 455608 646 40 1526 12
Java javac 3.0675 677 5.91 493744 819 121 3329 1
Python2 (Pypy) - 0.0000 1010 3.96 620820 561 38 1192 0
Smalltalk (GNU) - 0.0000 1835 2.18 311136 712 76 1641 0
Lua - 0.0000 2150 1.86 1101764 554 52 1208 0
Ruby - 0.0000 2197 1.82 1046204 605 56 1250 0
Python3_fst - 0.0000 2460 1.62 1338748 497 32 1127 0
Tcl - 0.0000 4121 0.97 569212 544 46 1170 0
Python3 - 0.0000 4951 0.81 794376 607 39 1268 0
Clojure - 0.0000 10246 0.39 858024 960 53 2811 0

NumTrades = 10

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
Go go 1.6017 8 100.00 24360 614 61 1206 2256
C clang 1.6518 8 100.00 23920 722 56 1435 9
Java javac 1.1308 13 61.54 117792 819 121 3329 1
JavaUnSafe javac 1.0829 14 57.14 39256 1182 156 5148 2
Common Lisp - 0.0000 16 50.00 111068 715 44 2103 0
Julia - 0.0000 20 40.00 164340 552 52 1228 0
Racket raco 17.0663 26 30.77 190052 691 38 1959 11142
LuaJit - 0.0000 50 16.00 128032 554 52 1208 0
C# mcs 0.6989 67 11.94 106228 890 116 3320 4
F# fsharpc 4.4070 83 9.64 116864 646 40 1526 12
Python2 (Pypy) - 0.0000 203 3.94 142012 561 38 1192 0
Smalltalk (GNU) - 0.0000 324 2.47 66276 712 76 1641 0
Ruby - 0.0000 448 1.79 212004 605 56 1250 0
Lua - 0.0000 450 1.78 216132 554 52 1208 0
FasterPython - 0.0000 484 1.65 273000 497 32 1127 0
Tcl - 0.0000 821 0.97 115768 544 46 1170 0
Python3 - 0.0000 1017 0.79 163864 607 39 1268 0
Clojure - 0.0000 1121 0.71 322328 960 53 2811 0

Take the results for Python, Ruby and Julia with a grain of salt, as I’m less experienced with these languages and there are probably optimisations that I’ve missed. Suggestions are welcome! (Note also that the rightmost column of the table represents the size of the resulting executable; it doesn’t seem to fit fully into the page.)

*Edit: The Ruby and Julia versions have been improved since the benchmark was originally run, and now perform quite impressively. A version that’s faster in CPython (but not in Pypy) has also been added, as Python3_fst.

Points of interest:

Pypy – Significantly faster than standard Python 3 (CPython); quite impressive. For such an otherwise elegant language, however, writing constructors in Python is surprisingly awkward. Or perhaps it’d be more accurate to say the fact that one even has to write constructors in the first place seems odd; is there a default constructor I’m not aware of, a more concise way to write the following? *edit: there is indeed a better way to write this. I need to learn more Python.

class PyMemTrade():
  def __init__(self, tradeId, clientId, venueId, instrumentCode, price, quantity, side):
  self.tradeId = tradeId
  self.clientId = clientId
  self.venueId = venueId
  self.instrumentCode = instrumentCode
  self.price = price
  self.quantity = quantity
  self.side = side

C# – Faster than Java, and uses less memory. An interesting note: I originally ran the benchmark with the older Mono 2.1 runtime, before having to upgrade to 3.2 to get F# to compile, and the C# implementation actually ran faster there. Presumably the old garbage collector was better suited to this case, whereas maybe the new one is more like Java’s.

x86-64 assembly – Uses less memory, courtesy of struct packing. Normally languages pad structs, to align memory such that each memory access of a particular type is at an address that’s a multiple of that type. So an int32 can be accessed at 4, 8, 12, etc, whereas an int64 can only be accessed at 8, 16, 24 etc. The trade struct in this benchmark is long, long, int, int, long, long, byte; 8, 8, 4, 4, 8, 8, 1, or 41 bytes total. If sticking these in an array, 41 is not a multiple of 8, so 7 bytes of padding are necessary at the end of each struct so that the long at the start of the next one is aligned. As assembly doesn’t have structs, just byte arrays, it’s possible to avoid this padding. Note that while it’s faster in this case, unaligned access is generally thought to be slower, and on non-X86 architectures can cause the program to segfault. Also, you can do the same in C with compiler extensions, and that’s generally considered a much better way to achieve unaligned datastructures than writing the whole program in assembly.

Note that I’m a pretty mediocre x86 assembly programmer; it might well be faster if written by someone more experienced. Pull request are welcome. Also note that if you exclude the printing code (I couldn’t get printf to work without segfaulting, so I copy-pasted an integer printing routine. Take a look; printing integers is actually ridiculously slow and complicated compared to printing strings.), the assembly code is still more concise than its Java equivalent. Heck, going by plain linecount the entire assembly file, including the ridiculous integer printing routine, is 225 lines, only around 15% more lines than the unsafe Java code :/ To be fair, however, I should note that the Java code does include numerous setters and getters, which aren’t strictly necessary. For those who aren’t familiar with assembly, here’s an example of setting the values in a struct, where rcb is the address of the struct and rbx is the value you want to set tradeId, price and quantity to (please don’t hesitate to let me know if you spot any mistakes, as I’m quite confident the x86-64 implementation doesn’t function completely as intended):

  mov qword [rcx], rbx ;tradeId = rbx
  mov qword [rcx+8], 1 ;clientId = 1
  mov dword [rcx+16], 123 ;venueCode = 123
  mov dword [rcx+20], 321 ;instrumentCode = 123
  mov qword [rcx+24], rbx ;price = rbx
  mov qword [rcx+32], rbx ;quantity = rbx

Go – Almost matches C, quite impressive considering how many more man-hours the LLVM has had put into it than the Go compiler. Why is it so much faster than C# and Java? Maybe because it allows me to pointer chase, manipulating data in the array directly without generating any temporary objects, and hence minimising garbage creation. Something I learned: on x86-64, Go ints are 64bit; I initially made the mistake of assuming they were 32bit like C ints.

Common Lisp (SBCL) – Incredibly impressive speed for a dynamic language. Note in particular that I didn’t even add type annotations, as my attempts at adding them didn’t seem to speed the implementation up at all. The speed drops as NumTrades increases however, and profiling with time() reveals that garbage collection takes up a significant (60%+) percent of the running time. This means that either the compiler isn’t able to convert:

 (let ((trade-ref (aref trades i)))
   (progn
   (setf (lisp-memory-trade-trade-id trade-ref) i)
   (setf (lisp-memory-trade-client-id trade-ref) 1))

into

 struct CMemoryTrade *trade = &(trades[i]);
  trade->TradeId = i;
  trade->ClientId = 1;

or that it’s unable to stack allocate trade-ref so as to avoid it becoming heap garbage. That, or its bignum implementation generates more garbage. The same applies to the Racket compiler and the JVM.

*Edit: Thanks to type annotations it’s possible to make the Common Lisp implementation perform even faster, via avoiding bignums and overflowing instead as C, Go and Java do. This result is titled CLispFixnum.

Pharo – Sure it’s slow, but it’s had far fewer development hours put into it than any of the other languages used. It was however one of the funnest implementations to write, and also the most concise. Wait, you say, Pharo’s not in the list? I converted the program to GNU Smalltalk as that’s easer to run from the command line and around 50% faster, but it was developed in Pharo. It’s also not too slow compared to CPython. If I was given the resources to write my own language (tangent incoming; feel free to skip the next two paragraphs)

it would be a statically typed Smalltalk (or maybe something closer to Self) with type inference, a focus on referential transparency, and incremental compilation. If the methods of an immutable object only messaged other immutable objects (were pure functions), it’d be possible to compile the whole object into an optimised block that was potentially as fast as the equivalent C code. The compiler would be a first class citizen. so it’d also be possible to compile non-referentially-transparent blocks with Compiler compile: myBlock. Track would be kept of all the objects that this block messaged and its assumed responses from them, and if any were changed then the compiled block would become invalid and revert to interpreted code. The file system would also be a first class citizen; compiled blocks could be saved as executables that took their messages as flags (./myblock doLotsOfMathsWithInput: someNum) and outputted the result to stdout, which the runtime could then capture. Block data would be stored in metadata files accompanying the executables, and the filesystem could be used as if it were a Smalltalk image. Note that this would allow easy parallelism via the OS process scheduler, a bit like Erlang. It would also allow for runtime code specialisation, as described in this link, making it faster than C in some cases.
For instance, the following function:

 int myFun(a, b, c){
      return a*b+c
  }

can be compiled to just return 12 + c if a and b are known at compile time to be 3 and 4, which is much faster. If these are runtime constants, however (Java final), while it should still be possible to do the same optimisation on the code, in practice the JVM doesn’t (and I don’t think the CLR does either). A language that took advantage of this kind of optimisation could be faster than c for problems where there’s a lot of information only known at runtime that doesn’t change once it’s loaded.

Racket – Very enjoyable to write. It was originally much faster, before I fixed a weird bug I was having. Can you spot the bug in the following code?

 (define trades (make-vector *NUM_RECORDS* [rkt-mem-trade 1 1 1 1 1 1 #\a]))

If not, neither did I. I couldn’t understand why my attempts to mutate the objects in the vector were producing weird results, especially when equivalent code in Common Lisp worked fine. Even though half the trades should have a side value of ‘S’ and half should have ‘B’, for some reason after the vector was initialised they all had ‘S’, even though my initialisation code had been checked and double-checked; Then I realised: for every other value, all the objects in the vector also shared the same value, the value of the final object in the vector. Why was this? Answer: all the objects in the vector pointed to the same object. Changing the array creation code to the following:

 (define (make-trade i) (rkt-mem-trade 1 1 1 1 1 1 #\a))
 (define: trades : (Vectorof rkt-mem-trade) (build-vector *NUM_RECORDS* make-trade) )

fixed the problem. It also slowed the code down significantly. It does however suggest at a solution to the problem of poor performance due to the lack of cache locality of *obj arrays compared to obj arrays: have all the *obj point to the same object, as then it’ll always be in the cache…

Lua – Luajit really is quite impressive, achieving a 10x speedup over the standard Lua interpreter, however still falling behind Racket and Common Lisp. Interestingly, although Luajit appears to use less heap, plain Lua actually handles conditions of high memory use better: the reason there is no Luajit implementation listed for numTrades values of 100 and 150 is because it died with an out of memory error while running them (the benchmarking script will automatically exclude languages that don’t compile or run properly).

Self – What’s Self? It’s a research language that’s like a more-pure form of Smalltalk. It was extremely awesome to write, but I didn’t include it as it ran out of memory on NumTrades values of less than 10 (and took over a second to run at NumTrades=10). Nevertheless I think it’s worth checking out for fans of Smalltalk as it does an even better job of ‘pure OO’, and is extremely elegant about it. My original implementation in Self was only around 20 lines of code, although there’s a few more now as I wasn’t sure how to concatenate strings for printing so I printed them the ugly way.

F# – A neat language, and generally fun to write. Not quite as fast as C#, although it was originally the fastest language. Why? Firstly, I made the same mistake as in Racket, initialising an array with a function that ended up filling it with references to the same object rather than different objects. After I fixed that, however, it was still really fast. This was because I didn’t realise that parameter-less functions had to be called with a () at the end; a function with one parameter can be called as ‘myFun param’, but a function with no parameters must be called as ‘myFun()’. In the process of figuring out what I was doing wrong, I ended up looking through the CIL (CLR bytecode) disassembly of the program, and I must say it was the one of the most nauseating things I’ve ever seen in my life. Imagine a language with twice the verbosity of Java and half the clarity of x86 assembly, and you’ve got a language that’s about four times as elegant as the CLR Common Intermediate Language. I really don’t think the following should belong in bytecode: *Edit: okay, I may be wrong here, and the CIL isn’t itself bytecode, but rather is translated into bytecode. In which case, ignore this complaint.

.class nested public auto ansi sealed serializable FSMemTrade
  extends [mscorlib]System.Object
  implements class [mscorlib]System.IEquatable`1, [mscorlib]System.Collections.IStructuralEquatable, class [mscorlib]System.IComparable`1, [mscorlib]System.IComparable, [mscorlib]System.Collections.IStructuralComparable {
  .custom instance void class [FSharp.Core]Microsoft.FSharp.Core.CompilationMappingAttribute::'.ctor'(valuetype [FSharp.Core]Microsoft.FSharp.Core.SourceConstructFlags) = (01 00 02 00 00 00 00 00 )

Even functions seemed to have multiple verbose constructors defined. The total assembly size was around 1800 lines, almost an order of magnitude more than the equivalent program written by hand or by the LLVM in an assembly language. If you ever plan on writing low-level code for a virtual machine by hand, my advice is to stick to the JVM, as its bytecode is much nicer (only 160 lines for the disassembly of Java.class, which is only 10 lines more than the original Java.java source file…). To be fair however, the disassembly of the C# executable (as opposed to the F# one) was only around 460 lines long, and didn’t have quite as much boilerplate.

Ruby - I don’t quite understand the fuss about Ruby; like Python, it seems to be a dynamic language that wants me to write constructors (and getters/setters), or else use ugly stuff like $trades[i].instance_variable_get(“@price”) . To be fair, it does seem to be slightly faster than CPython, but it’s still slower than Smalltalk, and if I was going to use a language that slow I’d definitely prefer Smalltalk. Maybe someone can enlighten me as to what’s so great about Ruby? (And Rails doesn’t count, as I’m not a webdev.)

*Edit: Okay, I just mentally compared Ruby to Javascript and PHP, and was overcome with love for the language. I think I now better understand its appeal.

*Edit2: Wow, I made a small change and now the Ruby implementation is twice as fast, faster than Smalltalk and beating CPython. Now I’m impressed.

Julia – The language is quite fast on many other tasks, comparable to C or Fortran, but it seems the optimiser doesn’t know what to do with this particular program. I suspect someone more familiar with the language than I am could significantly speed it up by changing the array of trades into a one-dimensional matrix of int64s and manipulating that; the language is built around numerical computing, and the compiler has specialised optimisations it applies when it detects idiomatic numerics code. Writing the language was quite pleasant; I think the future looks bright for Julia. It was also the most robust of the ‘slower’ languages: managing to complete 150 numTrades where GST, Ruby and Clojure dropped out. Note if you want to test it yourself that the Julia compiler takes quite a while to build, maybe at least half as long as Rust. Interestingly, this is the most concise language in terms of compressed source file size.

*Edit: Wow, a patch to make the global trades array local has sped it up massively, to near C speed. This shows the importance of being familiar with the gotchas of a language currently still in development in order to maximise its performance.

Clojure – Poor performance here can probably be blamed on the JVM itself, considering how poorly Java fares. The Clojure implementation was however actually the least enjoyable to write (although normally my experience with Clojure is quite pleasant), due to how the language goes out of its way to make declaring mutable types difficult. Where Racket only requires a #mutable tag to render a struct mutable, Clojure requires declaring an interface complete with getters and setters for every value, and then declaring a type that implements these getters and setters. Somebody please tell me there’s an alternative to the following monstrosity:

(definterface IMemTest
  (^Long gtradeId []) (^Long gclientId []) (^Long gvenueId []) (^Long ginstrumentCode []) (^Long gprice [])
  (^Long gquantity []) (^Character gside [])
  (stradeId [^Long v]) (sclientId [^Long v]) (svenueId [^Long v]) (sinstrumentCode [^Long v]) (sprice [^Long v])
  (squantity [^Long v]) (sside [^Character v]))
(deftype CJMemTest [^:unsynchronized-mutable ^Long tradeId ^:unsynchronized-mutable ^Long clientId ^:unsynchronized-mutable ^Long venueId
  ^:unsynchronized-mutable ^Long instrumentCode ^:unsynchronized-mutable ^Long price ^:unsynchronized-mutable ^Long quantity
  ^:unsynchronized-mutable ^Character side]
IMemTest
  (gtradeId [_] tradeId)(gclientId [_] clientId)(gvenueId [_] venueId)(ginstrumentCode [_] instrumentCode)
  (gprice [_] price)(gquantity [_] quantity)(gside [_] side)
  (stradeId [this v] (set! tradeId v)) (sclientId [this v] (set! clientId v)) (svenueId [this v] (set! venueId v))
  (sinstrumentCode [this v] (set! instrumentCode v))
  (sprice [this v] (set! price v)) (squantity [this v] (set! quantity v))(sside [this v] (set! side v)))

I wonder if Rich Hickey ever had any bad experiences with Godzilla, as the above code suggests that he must really hate mutation. Even Haskell makes declaring mutable datastructures easier. (I’d love if someone were to submit a Haskell information, by the way. My type-foo isn’t up to scratch for dealing with the state monad.)

Finally, the twist: although the C, Go and C# implementations are faster, they’re also wrong! Run the Clojure version with sufficient numTrades changing +’ to + and it’ll throw an ‘integer overflow’ exception, kindly letting us know as the other implementations didn’t that the buyCost and sellCost variables are overflowing. The other Lisps, Python and Smalltalk handle this by automatic promotion to bignum. The benchmark is hence not particularly fair, as bignum operations are slower than standard integer multiplication, and the dynamic languages are doing bignum calculations while the statically typed languages are simply overflowing. Look at the numbers for numruns < 100 for fairer comparison, as no languages are using bignums then. Also note that the Clojure version can use +’ instead of + to automatically promote to bignums and avoid the overflow, which the current version does.

I bet you thought that was all. But no, there’s more! Long story short, the Java code in the original blog post was poorly written from a performance perspective (unnecessary ‘new’s). I cleaned it up, and voila, it’s faster than the original unsafe Java implementation and quite close to the C. Moral of the story: don’t resort to sun.misc.unsafe until you’ve removed all the ‘new’s from your performance-sensitive code. In the following table, each language suffixed with ’2′ is an implementation where the following (or equivalent):

JavaMemoryTrade trade = new JavaMemoryTrade();
 trades[i] = trade;
 trade.setTradeId(i);
 trade.setClientId(1);

is changed to:

trades[i].setTradeId(i);
 trades[i].setClientId(1);

Here it is:

NumTrades = 150 

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.3751 120 100.00 300528 2279 180 5840 8
Go go 0.7014 122 98.36 352484 614 61 1206 2256
C clang 0.0903 127 94.49 352044 712 56 1425 9
Go2 go 0.3431 127 94.49 352476 594 59 1170 2256
C2 clang 1.8033 129 93.02 352044 689 54 1374 9
Java2 javac 2.4070 164 73.17 706224 803 118 3239 1
CLispFixNum - 0.0000 228 52.63 810536 745 44 2152 0
Julia - 0.0000 298 40.27 601312 552 52 1228 0
F#2 fsharpc 4.8907 343 34.99 729336 637 38 1521 12
C#2 mcs 1.3778 347 34.58 727544 849 112 3213 4
Common Lisp - 0.0000 1108 10.83 902192 715 44 2103 0
C# mcs 0.7031 1592 7.54 999880 878 115 3301 4
Java javac 0.8713 1612 7.44 912288 819 121 3329 1
Common Lisp2 - 0.0000 1644 7.30 902772 675 39 1978 0
Python2 (Pypy)2 - 0.0000 1676 7.16 1382296 609 42 1382 0
Racket raco 16.9661 1989 6.03 875392 691 38 1959 11142
F# fsharpc 4.1842 2274 5.28 1453200 646 40 1526 12
Racket2 raco 19.8999 2334 5.14 875384 698 36 2020 11142
Python2 (Pypy) - 0.0000 3443 3.49 1519136 561 38 1192 0

NumTrades = 100 

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
x86-64 assembly fasm 0.2580 70 100.00 200468 2279 180 5840 8
Go go 0.3212 82 85.37 235300 614 61 1206 2256
C clang 0.1121 83 84.34 234856 712 56 1425 9
C2 clang 1.7537 85 82.35 234856 689 54 1374 9
Go2 go 0.3121 85 82.35 235288 594 59 1170 2256
Java2 javac 2.7039 106 66.04 530540 803 118 3239 1
CLispFixNum - 0.0000 152 46.05 594160 745 44 2152 0
Julia - 0.0000 198 35.35 436676 552 52 1228 0
F#2 fsharpc 5.5708 228 30.70 490452 637 38 1521 12
C#2 mcs 1.6621 231 30.30 488656 849 112 3213 4
Common Lisp - 0.0000 652 10.74 866428 715 44 2103 0
Java javac 0.8986 831 8.42 632704 819 121 3329 1
C# mcs 0.7101 979 7.15 827400 878 115 3301 4
Python2 (Pypy)2 - 0.0000 990 7.07 955840 609 42 1382 0
Common Lisp2 - 0.0000 1012 6.92 866200 675 39 1978 0
Racket raco 16.8806 1219 5.74 639732 691 38 1959 11142
F# fsharpc 4.0892 1386 5.05 976720 646 40 1526 12
Racket2 raco 23.0935 1461 4.79 639720 698 36 2020 11142
Python2 (Pypy) - 0.0000 2213 3.16 1183096 561 38 1192 0
Lua - 0.0000 4730 1.48 2202468 554 52 1208 0
Ruby - 0.0000 5262 1.33 2233164 605 56 1250 0
Smalltalk (GNU) - 0.0000 5345 1.31 611088 712 76 1641 0
Python3-2 - 0.0000 7901 0.89 1582084 607 42 1380 0
Python3 - 0.0000 9863 0.71 1582080 607 39 1268 0

It can be seen that this optimisation results in a huge speedup for the Java version, as it reduces object creation and hence reduces garbage collection. Similar for C#, F# and Python. Interestingly however it has an adverse effect on the performance of the Lisp implementations, suggesting that an object is created for each aref/vector-ref, so minimising the use of those functions produces better performance. This suggests, somewhat sadly, that object pooling may not be an effective option in Lisp, as there’s no way to mutate objects in an array without generating garbage (short of using unsafe ffi functions).

Finally, the most useful part of the benchmark: benchmarking the fastest implementation in each language, with the statically typed languages all using bignums so as to produce a fair result. The results are as follows:

NumTrades = 150

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 1.9361 252 100.00 352224 750 60 1492 9
Java3 javac 2.6750 547 46.07 797152 864 119 3386 2
Common Lisp - 0.0000 1116 22.58 902192 715 44 2103 0
Racket raco 20.1457 1957 12.88 875388 691 38 1959 11142
C#3 mcs 1.6699 2100 12.00 727712 893 114 3268 4
F#3 fsharpc 5.8450 2125 11.86 731956 659 39 1541 12
Go3 go 0.6288 2515 10.02 352956 656 62 1313 2799
Python2 (Pypy) - 0.0000 3333 7.56 1519136 561 38 1192 0

NumTrades = 100

Language Compiler Compile time (s) Running time (ms) % Fastest Resident mem use (KiB) Compressed source size Lines of code Number of characters Executable size (KB)
C3 clang 1.7614 162 100.00 235036 750 60 1492 9
Julia - 0.0000 198 81.82 421180 552 52 1228 0
Java3 javac 2.9222 369 43.90 561684 864 119 3386 2
Common Lisp - 0.0000 644 25.16 866428 715 44 2103 0
F#3 fsharpc 5.4417 1073 15.10 492908 659 39 1541 12
C#3 mcs 1.3916 1106 14.65 488812 893 114 3268 4
Racket raco 20.3282 1209 13.40 639704 691 38 1959 11142
Go3 go 0.6998 1659 9.76 235768 656 62 1313 2799
Python2 (Pypy) - 0.0000 2122 7.63 1183156 561 38 1192 0
Ruby - 0.0000 4258 3.80 2233108 605 56 1250 0
Lua - 0.0000 4600 3.52 2202356 554 52 1208 0
Smalltalk (GNU) - 0.0000 5295 3.06 611084 712 76 1641 0
JuliaBigNum - 0.0000 7726 2.10 1001056 565 52 1227 0
Python3 - 0.0000 10212 1.59 1582084 607 39 1268 0

C really shines here, courtesy of libGMP and stack allocation, but Go drops well behind, presumably due to its bignum library being slower and generating more garbage. Java performs relatively well, which I imagine is due either to its garbage collector being better or its bignum implementation generating little garbage. Interestingly, SBCL beats Go, F# and C#  (twice as fast as them at NumTrades=150) in spite of being a dynamic language and being compiled without any type annotations (they didn’t seem to make it any faster), which just goes to show that dynamically typed languages needn’t necessarily be slower than statically typed ones in all cases. Racket also appears to beat them slightly at NumTrades=150, too. If anybody knows a way to stop the SBCL implementation from consing I’d be interested to hear it, as garbage collection accounts for over 50% of the running time so avoiding it would being it close to Java in speed. I actually also made a SBCL implementation that just calls libgmp via the ffi, which results in no garbage collection occurring (which means aref isn’t creating garbage, so object pooling _is_ possible in Common Lisp, yay!), yet it’s about twice as slow. It’s in the repo as Lisp3.lisp if anyone wants to fix it (it’s inspired by the lisp pidigits code from the alioth language benchmarks).

Note that D, Rust or Nimrod could also perform as well as C here, simply by calling out to libGMP. As could Go, for that matter, but I thought for the sake of the benchmark it made more sense to use the native Bignum implementation.

If you want to run the benchmarks yourself, it’s as simple as:

 git clone https://github.com/logicchains/ArrayAccessBench
 go run Benchmarker.go

In the Benchmarker.go source the output and input file can be changed; as input, BenchmarkData.dat produces the first set of tables, BenchmarkData2.dat the second, and BenchmarkData3.dat the third. The numtrades values to test are stored near the top of the file in an array of strings, and the time to pause between runs can also be set via the WaitTime constant.

Hope you enjoyed the benchmark! The moral of this post: new considered harmful. The tl;dr of it: Lisp and Racket beat Go at Bignums. 

About these ads
This entry was posted in Uncategorized and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

33 Responses to Memory Access Microbenchmark: Clojure, F#, Go, Julia, Lua, Racket, Ruby, Self…

  1. Guy Murphy says:

    With the .NET bytecode… where you perhaps looking at CIL (http://en.wikipedia.org/wiki/Common_Intermediate_Language). This is the step before bytecode…

    1.Source code is converted to Common Intermediate Language (CIL), which is the CLI’s equivalent to assembly language for a CPU.
    2.CIL is then assembled into a form of so-called bytecode and a CLI assembly is created.
    3.Upon execution of a CLI assembly, its code is passed through the runtime’s JIT compiler to generate native code. Ahead-of-time compilation may also be used, which eliminates this step, but at the cost of executable-file portability.
    4.The computer’s processor executes the native code.

    This is I’ll admit lower level than my comfort-zone, and you’re clearly a more talented programmer, so my text isn’t meant to read as snide, I might have the wrong end of the stick… I’m just checking if you might be misinterpreting what CIL is… I still thought it was MIL and just found out its now called CIL. It was MIL I went looking for.

    • logicchains says:

      I was just assuming that CIL was bytecode, I didn’t realise that there may be another immediate step in which it’s transformed into bytecode. I’m more familiar with Java, where the intermediate language is the bytecode. Thanks for that; I’ll update the post to correct it.

  2. James says:

    Would Clojure’s agents let you write its implementation less hideously? Or maybe transients? Or hiphip array, if non-core libraries are allowed… in any case, I don’t see why a type definition has to be involved, unless it would definitely perform much worse without.

    • logicchains says:

      Can transients or agents be used with standard Java arrays? My reason for using a type definition is that types can be stuck in an array, and there’s a good change the JVM will store the objects sequentially in memory, improving cache performance.

  3. Scott Tadman says:

    I know that Ruby is not known for blazing performance relative to C in terms of execution speed, but the way this Ruby code is constructed is sabotaging things pretty badly. Adding a simple attr_accessor for each of the instance variables you want to access, which follows the Smalltalk convention of only exposing data you wish to have exposed rather than brute-forcing it with instance_variable_get, makes it perform twice as fast.

    Obviously Ruby won’t come out on top, but it is a capable language that’s very fast to write in and has an unusually high degree of expressiveness. Compared to Perl, Python and PHP, it’s very easy to read and even easier to learn. Just don’t expect it to have the same philosophy as C, it’s coming at programming from a completely different angle, just as Haskell, LISP and others have their own take.

    • logicchains says:

      Ah, thank you, I didn’t realise that would make such a difference. I’ve updated the Ruby version to reflect this, and made a note under the results table. I’m planning on running the benchmarks again in a few hours (it takes over an hour to run them all) and updating the tables.

  4. Timo says:

    The benchmark is a bit silly because the fastest languages (ASM, C, Go) seem to be only limited by memory bandwidth. I believe the difference between C and ASM comes mostly from the fact that the C struct has 7 byte padding, which causes more memory traffic. By using the smaller integer sizes (int->short, long->int) and struct packing the struct size can be reduced from 48 bytes to 21 and the code runs twice as fast. Also it’s worth mentioning that “long” is 32 bits on some platforms, and the correct type for 64-bit data would be “long long” or “int64_t”.

    • logicchains says:

      It actually mentions in the x86-64 section that the reason it’s faster is because it uses less memory due to lack of padding

      I’m using that pattern of longs and ints because they are what were used in the original benchmark that inspired this post; presumably ints and shorts would have insufficient capacity for that specific case, hence the use of longs and ints.

  5. jc says:

    Adding some type declarations to the Common Lisp version speeds it up 5 times (130ms per iteration vs 80 ms for C), and makes it give the same incorrect result as the C version for numTrades=150:
    - in the defstruct: (price 0 :type fixnum) (quantity 0 :type fixnum)
    - change the declared type of trades: (declaim ((simple-array lisp-memory-trade *) trades))
    - in perf-run declare the type of the intermediate result to be a fixnum, similar for sell-cost: (incf buy-cost (the fixnum (* (lisp-memory-trade-price trade) (lisp-memory-trade-quantity trade))))

  6. B says:

    Why does your Go code take a reference to trades([i]) and then continually dereference it?

    Instead:
    trade := &(trades[i])
    trade.TradeId = i
    trade.ClientId = 1
    trade.VenueCode = 123
    trade.InstrumentCode = 321

    It shouldn’t have any effect on speed, but the code is a lot nicer to look at

    http://play.golang.org/p/Jwms6USttT

    • logicchains says:

      You’re right. I was thinking Go didn’t automatically dereference struct pointers, but of course it does, it’s array pointers it doesn’t automatically dereference. Whoops.

  7. Sorry to be critical, but I’m afraid that this is a bit of a waste of time for anyone who reads this if the results of each language version do not match. Having some versions silently overflow and benchmarking against versions producing a “correct” result while saying “Look, this is 100 times faster” is a nonsense.

    Also please detail what system you are benchmarking on for comparitive/illustrative purposes. *Personally* I’m seeing numbers way higher than yours and failure to complete which gives me reason to doubt the benchmarks even further.

    • logicchains says:

      As it says near the top, the best versions for comparison are the final two graphs, in which all languages produce the same results. The first set are just a copy of the program described in the link in the first paragraph, which happened to overflow.

      I’ll add data on my system. It’s a few years old, so I’m not surprised you’re seeing better numbers.

      • Actually I’m seeing worse numbers:
        e.g. Java3 = 1040ms after the first run, Ruby = 10916 ms, Python dies

        This is a laptop with only a 4700MQ and 16GB hence I assumed these must have been run on the latest Xeon….

      • Oh BTW, don’t be too hurt by the critical responses. I personally find this kind of thing quite fun and actually learned a few things optimising your Ruby.

        I actually think that a site similar to the computer language benchmark game http://benchmarksgame.alioth.debian.org/ could be useful if it compared basic operations rather than entire algorithms….

      • logicchains says:

        At what value of NumTrades, and Which JVM are you using? Mine is:

        java version “1.7.0_25″
        OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.13.04.2)
        OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

        Considering C#3 is around four times slower than Java3 for me, it’s not surprising if a different JVM could be slower.

        And, no worries. I think it would have worked better if I’d just focused on the bignum aspect of it. Essentially, it’s really just calculating n^2 + (n-1)^2 .. + 1^2 where X is NumTrades and is also < sqrt(LONG_MAX). That might make an interesting benchmark.

  8. Jeff C. Hunt says:

    Most micro benchmarks are silly, but this one is particularly so. The Lua code is *very* far from optimal (almost everything is global) and no Lua programmer would write anything remotely resembling it.

    • logicchains says:

      Well contributions are welcome. The idea of the benchmark is to have people more familiar with the languages used submit more optimal implementations, but maybe I didn’t make that clear enough. For comparison, the implementations from the final two tables are what you should be looking at, as they all do the same thing. Well, apart from the Lua, as that uses floats rather than bignums.

      • Jeff C. Hunt says:

        Right, but the whole thing appears to be a randomly chosen sequence of operations. If you actually came up with a feasible, realistic thing to implement, other people could get creative with the implementation and actually compete to improve them without mindless, verbatim translation.

      • logicchains says:

        You’re right; if I’d thought it through more I could probably have designed a much better benchmark. I think I focused too much on just verbatim translation of the program that inspired the benchmark, without giving enough thought to what was actually being benchmarked.

  9. Jeff C. Hunt says:

    The more I read the code for these benchmark, the more I’m in shock and awe and how dumb the idea is, how poorly implemented they all are and how none of them do quite the same thing.

    • logicchains says:

      If you read the whole post, you’d see that all the implementations suffixed 2 do the same thing, and all those suffixed 3 do too, and the ones suffixed 3 produce the same results as the Python, Ruby and Lisp implementations (which automatically promote to bignum). But I realise now I could definitely have organised it better.

  10. Pingback: F# Weekly #8, 2014 | Sergey Tihon's Blog

  11. Alex Miller says:

    The Clojure code had several issues in it. I’ve posted a pull request that addresses these issues. I think the two biggest problems were using boxed fields instead of primitive and calling def inside the loop and maybe the third was using a faster version of “odd?”.

    https://github.com/logicchains/ArrayAccessBench/pull/9

    I would expect this version should be comparable with the Java3 version.

    • logicchains says:

      Thank you! It is indeed comparable to the Java3 version. I feel like a complete idiot for not realising ^Long was boxed. I’m surprised that ‘def’ performs so poorly though, but I suppose there’s no reason something should work the same way in Clojure as it does in Scheme and Common Lisp.

  12. Tables are not displayed completely. Their width is wider than column. Very sad because there is plenty of free space.

  13. togo toto says:

    What a fucking retard you are. A veritable cunt.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s