How programs are measured (Benchmarks Game)

As-of 2025, BenchExec

Time-to-Solve

The host machine is isolated from networks and unloaded. (Of course, that is not how programs are used!)
To provide each program with a similiar context, file system caches and swap are cleared before measurements are made for each program. As part of that reset, the first measurement is excluded from the mean and confidence intervals. To ameliorate measurement bias we vary UNIX environment size.
For a different approach, see Robust benchmarking in noisy environments.
If the program gives the expected output within an arbitrary cutoff time (usually 5 or 10 minutes elapsed time) the program is measured again (11 more times) with output redirected to /dev/null.
If the program doesn't give the expected output within an arbitrary timeout (usually one hour) the program is forced to quit.
Most measurements shown on the website are either:
- within the arbitrary cutoff - the lowest elapsed time from 12 BenchExec measurements
- within the arbitrary cutoff - the 95% t-score Confidence Interval from 11 CPU time BenchExec measurements; or if the CI was miniscule, the Mean
- outside the arbitrary cutoff - the elapsed time, cpu time and peak memory from one BenchExec measurement
(For sure, programs taking 4 and 5 hours were only measured once!)
Some few programs were changed to make in-process measurements, within the BenchExec container:
- Some Java programs were changed to make 24 successive calculations and print System.nanoTime() to System.err before and after each calculation
- Some n-body programs were changed to make a 1st calculation, 4 successive warmup calculations, and a final 6th calculation. Before and after the 1st and 6th calculations; the programs triggered file changes and thus mtime timestamps and thus millisecond elapsed time measurements

Who wrote the programs

The programs have been crowd sourced, contributed to the project by an ever-changing self-selected group.

Ancient hardware

How source-code size is measured

We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source-code instead of lines of code.

median
February 2025
Toit	558
Perl	570
Lua	580
PHP	581
Ruby	583
Python 3	585
Julia	634
Chapel	646
Racket	696
JavaScript	698
OCaml	741
Erlang	798
Go	831
Dart	847
Smalltalk	871
Haskell	892
Java	910
Lisp	938
Swift	939
F#	943
Pascal	959
Fortran	1091
C#	1117
C	1121
C++	1129
Rust	1235
Ada	1825

(Note: There is some evidence that complexity metrics don't provide any more information than SLoC or LoC.)

How make-time is measured

There may be required (or chosen) actions before program run-time. Some languages are most known for their ahead-of-time language implementations and others for their run-time language implementations.

(The program log shows seconds to complete and log all make actions. The measurement for Python programs records use of Pyright.)

median
February 2025
Lua
Perl
PHP
Racket
Ruby
Julia	0.12
Node.js	0.12
Pascal	1.12
Toit	1.14
OCaml	1.50
Erlang	1.59
Java -Xint	2.14
C gcc	2.34
Lisp SBCL	2.58
Java jdk23	2.67
Fortran	2.96
Dart jit	3.50
C++ gcc	3.69
Ada gcc	3.71
Python 3	5.29
C clang	5.43
Go	5.43
Dart aot	6.29
C#	7.74
Rust	10.29
F#	12.37
Swift	14.54
Haskell GHC	17.97
Chapel	20.86
C# naot	24.29
Smalltalk	30.34
Java naot	98.17