Fibonacci numbers on the Galaxy S3: ARM benchmarks of Rust, Ocaml, Haskell, Go, Racket, Lua, C# and Java

I recently upgraded my phone to a Galaxy S3, an ARMv7 device with 2GB of ram and a quad-core 1.3ghz processor. After installing Arch Linux on it (only in a chroot; I’m not brave enough to boot directly into it), which seemed like the right thing to do, I thought it’d be interesting to do some benchmarking and see how various language implementations perform on ARM compared to on my x86-64 (Intel dual core i5 M430 2.27GHz) laptop.

This benchmark involves calculating a large number of Fibonacci numbers repeatedly. The exact code used is available here. Note that the purpose of this benchmark is to compare the ARM and x86 implementations of each language, not to compare the different languages to each other, so no effort has been put into optimising the different implementations.

The benchmarks were run with an input of 500 million, with the speed for each language being expressed as a percentage of the speed of the C implementation. This allows for easy comparison, as each language being expressed in terms of C speed (as opposed to in absolute terms) minimises the effect of the different speeds of the processors used.

The results from my x86-64 laptop are as follows (input of 500 million):

Language Running time (ms) % C speed
C (Clang) 3726 100
C# (Mono) 7625 48.8656
Go 8489 43.8921
Haskell 9839 37.8697
Java (OpenJDK) 4237 87.9396
LuaJit 15135 24.6184
Ocaml 9668 38.5395
Racket 22785 16.3529
Rust 3910 95.3

The results from my ARMv7 Galaxy S3 (input of 500 million):

Language Running time (ms) % C speed
C (Clang) 5051 100
C# (Mono) 52736 9.5779
Go 12246 41.2461
Haskell 22304 22.6462
Java (OpenJDK) 50079 10.0861
LuaJit 21738 23.2358
Ocaml 28774 17.554
Racket 56519 8.93682
Rust 5741 87.9812

The performance of each language on ARMv7 expressed as a percentage of its x86-64 performance:

Language Performance (% x86-64 speed)
C 100
LuaJit 94.3839
Go 93.9716
Rust 92.3190
Haskell 59.8003
Racket 54.6498
Ocaml 45.5481
C# (Mono) 19.6005
Java (OpenJDK) 11.4693

The benchmarks can be run by `sh fib.sh 500000000 true`, replacing true with false if the executables have already been compiled. This will print an html table to stdout, and store the results in text form in `./sortedResults`. Rename this to something like armResults or x86Results, and two different such files may be compared using `sh fibdiff.sh armResults x86Results`, which will output the above html table to stdout.

Again, note the purpose of this benchmark is not to compare the languages with each other, but to see how well each one performs on ARM compared to on x86.

Interestingly, the Rust implementation actually runs faster on ARM. I suspect this indicates a bug of some sort in the compiler, as Rust generally performs at close to C speed. Both the ARM and x86 implementations are running on the latest Arch Linux build, but the Rust version on ARM appears to be “0.11.0-pre”, whereas on x86 it’s “0.11.0″, so it’s possible there was some regression between those two versions.

*Edit: the Rust version used the Rust int type, which is 64bit on x86-64, whereas C’s int is 32bit. This exaggerated Rust’s speed relative to C on ARM. The results table has now been updated to show results from using Rust’s fixed size i32 type, which is much faster on x86-64. It can now be seen that Rust manages to perform quite well on both platforms, which makes sense considering its use of the LLVM backend.

*Another edit: according to strncat on Reddit: “Rust is slower than C at this task on ARM (and x86_64, once the 64-bit overhead is gone) because it uses LLVM’s segmented stack support for stack overflow checking. The sane alternative is outputting a one byte write in multiples of the guard size, so 99.5% of functions won’t have any overhead, and the remaining ones will only have some writes rather than the branch that is there today.”

Java on ARM is hilariously slow, 1/10th of its x86 speed, which I suspect may be because the JVM I’m using, OpenJDK, doesn’t yet have a JIT compiler. Or, if it does, it’s not particularly well developed. The Oracle JVM is apparently faster on ARM.

C# (via Mono) is also much slower on ARM, 1/5 of its x86 speed. Presumably ARM code generation hasn’t received as much attention as x86.

Luajit’s performance on ARM, almost equal to its x86 performance, proves once again that Mike Pall is a genius. Interestingly, as everything is a double in Lua, it may actually have an advantage in this case as there’s a modulus operation in the code, and ARM doesn’t (as far as I’m aware) have hardware support for integer division, only float division.

The Go compiler also runs surprisingly well on ARM, nearly matching its performance on x86. I’m not sure whether this is due to lots of work having been put into ARM code generation or not so much work being put into x86 code generation, but it certainly bodes well for future plans to allow the use of Go in the Android NDK.

Racket, OCaml and Haskell all run at about half their x86 speed on ARM, which seems reasonable as I can’t imagine they’re often run on ARM, so I imagine ARM performance probably hasn’t received much attention.

In terms of the C implementation, it’s interesting to note that its absolute speed on the ARM device is almost 73% of the speed of its x86 speed, in spite of the x86 device being a 16 inch i5 laptop and the ARM device being just a 5 inch phone.

Miscellanea

The Haskell implementation is named hsfib.hs instead of fib.hs, breaking from the naming conventions of the other implementations. Why is this? If I name it fib.hs, GHC will notice the fib.o left over from compiling the OCaml implementation and try to link that, with hilarity ensuing soon thereafter. This could be avoided if GHC checked whether the object files it was attempting to link were actually Haskell object files.

Getting an Integer from a Number in Typed Racket seems to take a lot of work:
`(numerator (inexact->exact (real-part n)))`
Although there’s probably a simpler method I’m missing.

There’s a Common Lisp implementation in the repo, but the ARM support of SBCL is quite recent and I couldn’t get it to build.

At least 70% of the C# implementation is just copy pasted from the Java implementation.  Syntactically those languages seem even more similar than Lisp and Scheme.

Recursion in Lua is interesting. Doing a comparison like `if n == 1` does not work out well, due to the nature of floating point equality, so `if n < 1` is necessary instead.

The correctness of the Rust and C implementations is actually optimisation dependent. If compiled with -O, they use constant space; if compiled without it, they use linear space and fail with a stack overflow. Never, ever write code like this for any project that’s even remotely important..

OCaml’s Emacs support is really awesome. I installed a couple of plugins, Tuareg and Merlin, and not only did they give me autocomplete but also automatic error checking whenever I save. I was particularly impressed when it combined with Ocaml’s type-checking to warn me that the float I was accidentally passing to printf didn’t match the type required by the %d in the format string; even Haskell’s default printf function doesn’t do compile time checking of the format string.

*Edit: I was asked why I didn’t include Nimrod. Nimrod compiles to C, so it would be the same speed on both ARM and x86. Seems like it would be a good option for ARM development.

Bonus

I ran the Java implementation of the benchmark under Dalvik. The results are rather underwhelming, to say the least. Here are the results from running with an input of 500 million (the same as above), with the Dalvik results included.

Language Running time (ms) % C speed
C 5051 100
C# (Mono) 52736 9.5779
Dalvik 75699 6.6725
Go 12246 41.2461
Haskell 22304 22.6462
Java (OpenJDK) 50079 10.0861
LuaJit 21738 23.2358
Ocaml 28774 17.554
Racket 56519 8.93682
Rust 5741 87.9812

Now that’s why the new ahead-of-time-compiled Android RunTime is A Good Thing. Also possibly part of the reason why the Android UI is less fluid than the iPhone’s, and I say this as someone who wouldn’t use an iPhone if you paid me. Where would Objective C appear in that table? Right at the top; Objective C is a superset of C, so valid C code is generally valid Objective C code, making it extremely easy to write fast code in Objective C (just write C).

Note however that I ran the Dalvik file (compiled with `dx –dex –output=fib.dex fib.class`) with `dalvikvm -cp fib.dex fib`, the simplest way to run it. It’s quite possible there’s a better way to compile/run it that makes it faster; I’m not particularly familiar with Android development. There apparently exists a tool called ‘dexopt’, which is run automatically when a new apk is installed, but I’m not sure whether it’s run when I execute a .dex file manually with `dalvikvm`. Running `dexopt` tells me to look at the dexopt source for usage information; I don’t have time to figure it out right now, but if someone wants to send me a script for running dexopt on a .dex file I’ll be happy to test it.

Moral of the story: performantly porting a JIT compiler to ARM is difficult (unless you’re Mike Pall).

*Edit: apparently development of the ARM port of Luajit was sponsored, so the developer may have had more incentive to optimise it than the developers of the ARM ports of OpenJDK and Mono.

About these ads
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

26 Responses to Fibonacci numbers on the Galaxy S3: ARM benchmarks of Rust, Ocaml, Haskell, Go, Racket, Lua, C# and Java

  1. Is this real life ? says:

    It would be interesting to see the results for Android with ARM+NEON, -O3 compiler flags.

  2. Daniel Micay says:

    Your x86_64 Rust results are slow because the int type is defined as pointer-size. You should use i32 instead for an apples to apples comparison.

    • logicchains says:

      The point is not to compare the languages to each other, but only with themselves on ARM. I imagine the common case of porting a program to ARM would be that it primarily used ints, not i32s. Ints are more commonly used than fixed-size types in other languages, in my experience.

      • John Doe says:

        Then why include tables contain ratios in comparison to C at all? Or at least, move them further down as annex data, instead of having them as the first visible tables.

        Your benchmarks ought to to the same exact thing in all languages. If Lua uses floats, it’s not doing the same thing. Neither would it be in JS, but you’re not including it although it’s a ubiquitous language and more relevant on current ARM devices, thus way more interesting.

        You’re timing the number parsing in most languages except Lua and Rust, then in some of those, you’re timing the cast (Go) or transformation (Racket) to integer, and finally, you’re timing the formatted output.

        Simply take all of those out of the benchmarks, it obviously introduces a constant time that is not interesting. I don’t care if the code isn’t so tidy, as long as it’s correct.

        If you care about number parsing, integer casting or float to integer truncation and text formatting, make separate benchmarks for each of those.

        You ought to justify why you use accumulating recursion in some languages and not in others. Is it that they are actually performing tail-call optimization? Then they should support deep recursion without a stack overflow, unlike a few of the other languages. You should either eliminate recursion where tail calls are not guaranteed, or test stack blowing for an advantage/disadvantage analysis.

        You ought to justify why you didn’t declare fixnums or 32-bit integers in Common Lisp (I know you didn’t include it in the article, but it’s in the repo) and Racket, because you could make it much faster. On the other hand, if you don’t optimize them, they’ll support bignums, unlike a few of the other languages which will either fail (not that bad) or silently overflow (really bad). You should either optimize them, or include tests where languages with number restrictions either fail/overflow (32-bit integers) or yield imprecise results (floating-point numbers) for an advantage/disadvantage analysis.

        BTW, in Racket, the numerator might be a really big number: (numerator (inexact->exact 2.1)) => 4728779608739021. You ought to use (inexact->exact (truncate (real-part n))).

        Conclusion: your performance benchmarks are not testing just one thing, they’re not 1:1 comparable, some are not even correct, and you’re not providing an analysis of the advantages and disadvantages of each languages, i.e. what is it that you lose by using C or on the other hand what is it you earn by using languages with bignums.

        Overall, this is a good effort, but for this kind of post, you must level up.

      • logicchains says:

        I’ve sorted the first two tables alphabetically instead in order to make it seem less like it’s comparing languages.

        The benchmarks do not need to do the same thing in all languages as they aren’t meant to be comparing the languages. As the post says many times.

        The time taken to parse the input int and print the results is miniscule compared to the time spent in the main loop.

        There’s no particular reason for using recursion in some languages and not others, it’s just how I felt like writing them. As each language is only being compared with itself, this shouldn’t matter.

        Again, I’m not analysing the advantages and disadvantages of each language as I’m not comparing them, I’m only comparing how well each language works on ARM compared to on x86.

  3. None says:

    Interesting idea, but poor test. Try again with compression or encryption. I think that would produce much more useful results.

  4. Philippe Marschall says:

    I would expect Oracle JDK to be much faster than OpenJDK on ARM. The reason being that only Oracle JDK has the C1 and C2 JITs. For whatever reason Oracle open sources C1 and C2 only for x64, not ARM. See also https://blogs.oracle.com/jtc/entry/comparing_jvms_on_arm_linux.

  5. What flags did you use to compile for each language?

  6. I compiled the OCaml version using the pre-release 4.02 version and I get a huge improvement.

    When I compiled with the latest release version(4.01) I was getting around 9200 (faster than your results, but I’m on an intel core I7, maybe you’re on an I5?). After compiling with 4.02.0 beta I get:

    75699
    LANGUAGE Ocaml 2796

    …which is actually faster than your C results.
    (that’s on my x86-64 core I7 laptop)

    What compile flags did you use for your C version?

    • logicchains says:

      You’re correct, I’m using an i5 m, 2010 model. Flags for the C were just O3.

      I saw on Reddit that the speedup comes from Ocaml using a magic constant instead of idiv for the mod instruction, very impressive. I actually considered doing that manually for the implementations, but I could only find a formula for division, not for getting the modulus.

      You’re probably not the right person to ask, but will 4.02 include the reentrant Ocaml runtime? Real threading support in Ocaml is probably one of the things I’m looking forward to most this year.

  7. Madmann says:

    I experienced a little with your haskell code and found out that changing the ‘mod’ operation into a ‘rem’ (since the second argument is positive) gave me a 20% boost in performance (you will need to change the name of the variable rem in ‘doWork’ to be able to use this Prelude function). Could you try this as well ?

    • logicchains says:

      I tried that and it actually made the Haskell version around 40% faster than the C version. I suspect in that case GHC is smart enough to figure out that the Fibonacci function doesn’t need to be called n times, whereas Clang isn’t.

  8. funnuy says:

    Notice that mod in Haskell isn’t the same as % in c.

    in c:
    2 % 3 = 2
    2 % -3 = 2
    -2 % 3 = -2
    -2 % -3 = -2

    in haskell:
    2 `mod` 3 = 2
    2 `mod` (-3) = -1
    2 `mod` 3 = 1
    (-2) `mod` (-3) = -2

    You should use rem instead, which is faster and behave like %. With this little change, Haskell is even faster than C in my old x86-64 laptop:

    $ clang -O3 ./fib.c -o fib && ./fib 500000000
    75699
    LANGUAGE C 4089

    $ ghc -fllvm -O3 hsfib.hs && ./hsfib 500000000
    75699
    LANGUAGE Haskell 2357

    • logicchains says:

      Wow, I tested it and the Haskell is indeed much faster. Any idea what it does? I suspect maybe it’s figured out it doesn’t need to do the Fibonacci function more than a certain amount of times.

  9. Isaac Gouy says:

    Why do you think the old Haskell benchmark suite is called NoFib?

    Why are you doing something worse than the benchmarks game?

    • logicchains says:

      The benchmark game compares the performance of different languages on particular platforms. This is meant only to compare the performance of each language on multiple platforms, not to compare languages with each other.

      • Isaac Gouy says:

        1) That doesn’t make it OK to base your comparison on 5-line Fib snippets.

        2) If the point was to compare the same program on different platforms, you would present 9 tables (one for each language implementation) each with 2 rows (one for each platform).

        The comparison you present is between 9 language implementations on the same platform, for two different platforms.

      • logicchains says:

        Nine tables of two rows doesn’t seem particularly ergonomic. I’m going to take the traditionalist approach and maintain that I’m comparing what I say I’m comparing, not what my tables apparently suggest I’m comparing.

  10. Jun Furuse says:

    At least for OCaml, this is not benchmark of fib. Printf debug at the fib argument reveals:

    1
    2
    3
    5
    10
    15
    25
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0

    So after fib 25, it continuously computes fib 0

    • Jun Furuse says:

      The code almost measuring the speed of mod. This is why integer div optimization of OCaml 4.02 affected the result tremendously.

      • logicchains says:

        Right, but that optimisation wouldn’t affect the relative performance difference between Ocaml on x86 and ARM, would it?

        I’ll admit it was probably a misnomer to call it a Fibonacci number benchmark, as the Fibonacci number bit is only there to stop the compiler from optimising out the whole loop. I just wanted an extremely simple program to compare ARM and x86 performance.

    • logicchains says:

      I’m aware of this. I couldn’t have it continuously computer larger Fibonacci numbers, as that would overflow into bignum territory, and it would become essentially a bignum library benchmark.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s