The Computer Language
25.02 Benchmarks Game

As-of 2025, BenchExec

Time-to-Solve

  1. The host machine is isolated from networks and unloaded. (Of course, that is not how programs are used!)

    To provide each program with a similiar context, file system caches and swap are cleared before measurements are made for each program. As part of that reset, the first measurement is excluded from the mean and confidence intervals. To ameliorate measurement bias we vary UNIX environment size.

    For a different approach, see Robust benchmarking in noisy environments.

     

  2. If the program gives the expected output within an arbitrary cutoff time (usually 5 or 10 minutes elapsed time) the program is measured again (11 more times) with output redirected to /dev/null.

  3. If the program doesn't give the expected output within an arbitrary timeout (usually one hour) the program is forced to quit.

  4. Most measurements shown on the website are either:

    • within the arbitrary cutoff - the lowest elapsed time from 12 BenchExec measurements

    • within the arbitrary cutoff - the 95% t-score Confidence Interval from 11 CPU time BenchExec measurements; or if the CI was miniscule, the Mean

    • outside the arbitrary cutoff - the elapsed time, cpu time and peak memory from one BenchExec measurement

    (For sure, programs taking 4 and 5 hours were only measured once!)

  5. Some few programs were changed to make in-process measurements, within the BenchExec container:

    • Some Java programs were changed to make 24 successive calculations and print System.nanoTime() to System.err before and after each calculation

    • Some n-body programs were changed to make a 1st calculation, 4 successive warmup calculations, and a final 6th calculation. Before and after the 1st and 6th calculations; the programs triggered file changes and thus mtime timestamps and thus millisecond elapsed time measurements

Who wrote the programs

The programs have been crowd sourced, contributed to the project by an ever-changing self-selected group.

Ancient hardware

How source-code size is measured

We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source-code instead of lines of code.

median
February 2025
Toit 558
Perl 570
Lua 580
PHP 581
Ruby 583
Python 3 585
Julia 634
Chapel 646
Racket 696
JavaScript 698
OCaml 741
Erlang 798
Go 831
Dart 847
Smalltalk 871
Haskell 892
Java 910
Lisp 938
Swift 939
F# 943
Pascal 959
Fortran 1091
C# 1117
C 1121
C++ 1129
Rust 1235
Ada 1825

(Note: There is some evidence that complexity metrics don't provide any more information than SLoC or LoC.)

How make-time is measured

There may be required (or chosen) actions before program run-time. Some languages are most known for their ahead-of-time language implementations and others for their run-time language implementations.

(The program log shows seconds to complete and log all make actions. The measurement for Python programs records use of Pyright.)

median
February 2025
Lua  
Perl  
PHP  
Racket  
Ruby  
Julia 0.12
Node.js 0.12
Pascal 1.12
Toit 1.14
OCaml 1.50
Erlang 1.59
Java -Xint 2.14
C gcc 2.34
Lisp SBCL 2.58
Java jdk23 2.67
Fortran 2.96
Dart jit 3.50
C++ gcc 3.69
Ada gcc 3.71
Python 3 5.29
C clang 5.43
Go 5.43
Dart aot 6.29
C# 7.74
Rust 10.29
F# 12.37
Swift 14.54
Haskell GHC 17.97
Chapel 20.86
C# naot 24.29
Smalltalk 30.34
Java naot 98.17