As-of 2025, BenchExec
Time-to-Solve
The host machine is isolated from networks and unloaded. (Of course, that is not how programs are used!)
To provide each program with a similiar context, file system caches and swap are cleared before measurements are made for each program. As part of that reset, the first measurement is excluded from the mean and confidence intervals. To ameliorate measurement bias we vary UNIX environment size.
For a different approach, see Robust benchmarking in noisy environments
.
If the program gives the expected output within an arbitrary cutoff time (usually 5 or 10 minutes elapsed time) the program is measured again (11 more times) with output redirected to /dev/null
.
If the program doesn't give the expected output within an arbitrary timeout (usually one hour) the program is forced to quit.
Most measurements shown on the website are either:
within the arbitrary cutoff - the lowest elapsed time from 12 BenchExec measurements
within the arbitrary cutoff - the 95% t-score Confidence Interval from 11 CPU time BenchExec measurements; or if the CI was miniscule, the Mean
outside the arbitrary cutoff - the elapsed time, cpu time and peak memory from one BenchExec measurement
(For sure, programs taking 4 and 5 hours were only measured once!)
Some few programs were changed to make in-process measurements, within the BenchExec container:
Some Java programs were changed to make 24 successive calculations and print System.nanoTime()
to System.err
before and after each calculation
Some n-body programs were changed to make a 1st calculation, 4 successive warmup calculations, and a final 6th calculation. Before and after the 1st and 6th calculations; the programs triggered file changes and thus mtime timestamps and thus millisecond elapsed time measurements
Who wrote the programs
The programs have been crowd sourced, contributed to the project by an ever-changing self-selected group.
Ancient hardware
How source-code size is measured
We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The measurement is the size in bytes of that GZip compressed source-code file.
Thanks to Brian Hurt for the idea of using size of compressed source-code instead of lines of code.
median
|
February 2025
|
Toit
| 558
|
Perl
| 570
|
Lua
| 580
|
PHP
| 581
|
Ruby
| 583
|
Python 3
| 585
|
Julia
| 634
|
Chapel
| 646
|
Racket
| 696
|
JavaScript
| 698
|
OCaml
| 741
|
Erlang
| 798
|
Go
| 831
|
Dart
| 847
|
Smalltalk
| 871
|
Haskell
| 892
|
Java
| 910
|
Lisp
| 938
|
Swift
| 939
|
F#
| 943
|
Pascal
| 959
|
Fortran
| 1091
|
C#
| 1117
|
C
| 1121
|
C++
| 1129
|
Rust
| 1235
|
Ada
| 1825
|
(Note: There is some evidence that complexity metrics don't provide any more information than SLoC or LoC.)
How make-time is measured
There may be required (or chosen) actions before program run-time. Some languages are most known for their ahead-of-time language implementations and others for their run-time language implementations.
(The program log shows seconds to complete and log all make actions
. The measurement for Python programs records use of Pyright.)
median
|
February 2025
|
Lua
|
|
Perl
|
|
PHP
|
|
Racket
|
|
Ruby
|
|
Julia
| 0.12
|
Node.js
| 0.12
|
Pascal
| 1.12
|
Toit
| 1.14
|
OCaml
| 1.50
|
Erlang
| 1.59
|
Java -Xint
| 2.14
|
C gcc
| 2.34
|
Lisp SBCL
| 2.58
|
Java jdk23
| 2.67
|
Fortran
| 2.96
|
Dart jit
| 3.50
|
C++ gcc
| 3.69
|
Ada gcc
| 3.71
|
Python 3
| 5.29
|
C clang
| 5.43
|
Go
| 5.43
|
Dart aot
| 6.29
|
C#
| 7.74
|
Rust
| 10.29
|
F#
| 12.37
|
Swift
| 14.54
|
Haskell GHC
| 17.97
|
Chapel
| 20.86
|
C# naot
| 24.29
|
Smalltalk
| 30.34
|
Java naot
| 98.17
|