The Computer Language
23.03 Benchmarks Game

A History

Once upon a time…

Doug Bagley had a burst of crazy curiosity: When I started this project, my goal was to compare all the major scripting languages. Then I started adding in some compiled languages for comparison…

That project was abandoned in 2002, restarted in 2004 by Brent Fulgham, continued from 2008 by Isaac Gouy, and interrupted in 2018 by the Debian Alioth hosting service EOL. Everything has changed; several times.

August 2020 through July 2021, Google Search Console saw 329,977 clicks.

Enough that many web search results show web spam and phishing - be careful!

a good starting point

How does Java compare in terms of speed to C or C++ or C# or Python? The answer depends greatly on the type of application you're running. No benchmark is perfect, but The Computer Language Benchmarks Game is a good starting point.

Differences in approach — to memory management, parallel programming, regex, arbitrary precision arithmetic, implementation technique — are part and parcel of using different programming languages.

So we accept something intermediate between chaos and rigidity — enough flex & slop & play to allow for Haskell programs that are not just mechanically translated from Fortran; enough similarity in the basic workloads & tested results.

We both accept PCRE and GMP and … library code; and refuse custom memory pool and hash table implementations.

The best way to complain is to make things.

As one person, you can’t help everyone. And you can’t make everyone happy because your tool is only a tiny part of their life. Don’t make that your goal: focus on learning and broadening your perspective. If it stops being constructive for you, then stop. Period.

Perhaps you would make different library choices? Perhaps you would include different programming language implementations? Perhaps you would make measurements on new computers?

Please do! Please use codespeed or hyperfine or take the measurement scripts we use and start making the kind-of measurements you would like to see.

Dismiss, Distract

You will probably come across stuff that people have said about the benchmarks game. Did they show that what they claim is true? If they provided a URL, does the content actually confirm what they said?

Maybe they were genuinely mistaken. Maybe the content changed.

I heard that one is pretty bad…

Maybe they heard wrong.

…a source of comparable programs, which is not remotely true…

Long ago, some of these programs were part of the Go install. Presumably the Go programs were intended to be "comparable" to the C programs and more: actually intended to be the same. Some of those programs are still listed on the benchmarks game website.

What about programs that are not just C-transliterated into some other language? Can those different but similar programs be comparable programs?

What specific criteria could make it so obvious that these different programs should not be described as comparable programs?

… if your study claims that C++ uses … 56% more time … than C … Approximately every C program is a valid C++ program …

For that particular study only 3 of the 9 selected C programs seem to compile as C++. Similarly none of the selected JavaScript programs seem to compile as TypeScript.

For a single outlier (regex-redux) there's a 12x difference between the measured times of the selected C and C++ programs. For a single outlier (fannkuch-redux) there's a 15x difference between the measured times of the selected JS and TS programs.

  ratio of averages   average of ratios
  Mean GeoMean Median   Mean GeoMean Median
C 1.00 1.00 1.00  :  1.00 1.00 1.00
C++ 1.56 1.34 1.03  :  2.37 1.34 1.00
JS 6.52 6.13 6.01  :  7.64 6.13 7.25
TS 46.20 8.60 6.12  :  20.67 8.60 7.80

"The desire to have one number that represents relative … performance is certainly understandable, as we want to draw simple conclusions about one … compared to others." … "However, it should be made clear that any measure of the mean value of data is misleading when there is large variance."

… neither scientific nor indicative of expected performance…

… That having been said … on the current benchmarks, Rust already outperforms C++, which is a pretty big deal…

No, we have to choose —

- either we accept the "neither scientific nor indicative" dismissal and don't even consider "a pretty big deal";

- or we reject the "neither scientific nor indicative" dismissal and consider "a pretty big deal".

… nor indicative of expected performance on real-world idiomatic code.

We've certainly not attempted to prove that these measurements, of a few tiny programs, are somehow representative of the performance of any real-world applications — not known — and in-any-case Benchmarks are a crock.

There's a reason they call it a game

Wrong!

The name "benchmarks game" is just a name.

…to dispute a decision you basically need to pray the maintainer reopens it for some reason.

Never true. Followup comments could always be made in the project ticket tracker. There was a public discussion forum, etc. etc.

Someone's brilliant hack was rejected. Someone saw the opportunity to push traffic to their personal blog.

The guy that runs it arbitrarily decided to stop tracking some languages he didn't care about…

Measurements are no longer made for these —

ATS, FreeBASIC, CINT, Cyclone, Tiny C, Mono C#, Mono F#, Intel C++, Clang++, CAL, Clean, Clojure, Digital Mars D, GNU D, Gwydion Dylan, SmartEiffel, bigForth, GNU GForth, Groovy, Hack, Icon, Io, Java -client, Java -Xint, gcj Java, Substrate VM, Rhino JavaScript, SpiderMonkey, TraceMonkey, Lisaac, LuaJIT, Mercury, Mozart/Oz, Nice, Oberon-2, Objective-C, Pike, SWI Prolog, YAP Prolog, IronPython, PyPy, Rebol, Rexx, Scala, Bigloo Scheme, Chicken Scheme, Ikarus Scheme, GNU Smalltalk, Squeak Smalltalk, Mlton SML, SML/NJ, Tcl, Truffle, Zonnon.

Like everyone else, I'm sitting on my hands waiting for kostya and hanabi1224 to make and publish all the other program measurements.

I know it will take more time than I choose. Been there; done that.

Be curious

Wtf kind of benchmark counts the jvm startup time?

How much difference does it make for these tiny programs?

Let's compare our fastest-of-6 no warmup measurements against the fastest-of-25 (or 55 or 175) with warmup JMH SampleTime p(0.0000) measurements:

java 14.0.1 SampleTime
i5-3330 No Warmup With Warmup
fannkuch-redux #3 39.939 39.460
fannkuch-redux #1 10.614 11.442
n-body #1 7.704 7.608
n-body #4 6.742 6.644
spectral-norm #1 7.574 8.171
spectral-norm #2 1.679 1.537
spectral-norm #3 1.578 1.432
 
JMH @Fork(value=55) SingleShotTime
spectral-norm #3 1.495

otoh JVM start-up, JIT, OSR… are quickly effective and these no warmup / with warmup comparisons show miniscule difference.

otoh For measurements of a few tenths of a second, a few tenths of a second is a huge difference.

(In stark contrast to the traditional expectation of warmup, some benchmarks exhibit slowdown, where the performance of in-process iterations drops over time.)

The benchmarks game benchmarks your code… on a 2012 desktop machine?

Let's compare normalized published times of the same n-body programs: made in the cloud on a 2019 Xeon 8272 and made on our 2012 i5-3330 desktop.

n-body AMD EPYC 7763 Xeon 8272 i5-3330  
programs Q1'2021 Q2'2019 Q3'2012  
GCC C++ 1.00 1.00 1.00  
Rust #7 1.33 1.53 1.47  
C gcc #8 1.53 1.70 1.89  
C gcc #5 1.86 1.93 2.81  
Rust #2 1.72 1.98 2.57  
C gcc #2 1.85 2.05 3.31  
Rust 2.02 2.29 2.61  
Swift #7 2.12 2.48 2.43  
Chapel #2 1.97 2.54 2.87  
OCaml 2.23 2.81 3.14  
Go 2.18 3.03 3.02  
C# .NET #8 2.24 3.03 3.25  
Java 2.65 3.46 3.60  
Node js #6 2.98 4.05 3.88  
Dart #3 2.67 5.23 5.78  
Racket #2   7.24 6.48  

We should expect increasing use of hand-written vector instructions will cause changes in the relative performance of the most optimised programs.

Apples and Oranges

We compare programs against each other, as though the different programming languages had been designed for the exact same purpose — that just isn't so.

The problems introduced by multicore processors, networked systems, massive computation clusters, and the web programming model were being worked around rather than addressed head-on. Moreover, the scale has changed: today's server programs comprise tens of millions of lines of code, are worked on by hundreds or even thousands of programmers, and are updated literally every day. To make matters worse, build times, even on large compilation clusters, have stretched to many minutes, even hours. Go was designed and developed to make working in this environment more productive.

Most (all?) large systems developed using Erlang make heavy use of C for low-level code, leaving Erlang to manage the parts which tend to be complex in other languages, like controlling systems spread across several machines and implementing complex protocol logic.

Lua is a tiny and simple language, partly because it does not try to do what C is already good for, such as sheer performance, low-level operations, or interface with third-party software. Lua relies on C for those tasks.