The Computer Language
24.11 Benchmarks Game

Who wrote the programs

The programs have been crowd sourced, contributed to the project by an ever-changing self-selected group.

How programs could be measured better

There are some examples where the measurement script fails to make cpu time and memory use measurements: OCaml fannkuch-redux #3, OCaml fannkuch-redux #4, OCaml reverse-complement, OCaml reverse-complement #3.

BenchExec successfully makes measurements for those programs. For example, OCaml fannkuch-redux #3 —

starttime=2024-07-16T12:24:19.263290-07:00
returnvalue=0
walltime=9.037729900999693s
cputime=35.813672s
memory=31322112B
blkio-read=0B
blkio-write=0B
pressure-cpu-some=8.862072s
pressure-io-some=0s
pressure-memory-some=0s

Seems like BenchExec would now be a better framework for this kind-of project.

How programs are measured

  1. Each program is run and measured at the smallest input value, program output redirected to a file and compared to expected output. As long as the output matches expected output, the program is then run and measured at the next larger input value until measurements have been made at every input value.

  2. If the program gives the expected output within an arbitrary cutoff time (120 seconds) the program is measured again (5 more times) with output redirected to /dev/null.

  3. If the program doesn't give the expected output within an arbitrary timeout (usually one hour) the program is forced to quit. If measurements at a smaller input value have been successful within an arbitrary cutoff time (120 seconds), the program is measured again (5 more times) at that smaller input value, with output redirected to /dev/null.

  4. The measurements shown on the website are either:

    • within the arbitrary cutoff - the lowest time and highest memory use from 6 measurements

    • outside the arbitrary cutoff - the sole time and memory use measurement

  5. For sure, programs taking 4 and 5 hours were only measured once!

How programs are timed

Each program is run as a child-process of a Python script using Popen:

Note: These measurements include startup time. Compare with a dozen examples that exclude startup cost and are made after warm-up iterations.

How program memory use is measured

The script child-process maximum resident set size is recorded using os.wait3.

This is not the maximum resident set size of the process tree and may underestimate the memory use of multi process programs.

How source code size is measured

We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source code instead of lines of code.

median source code gzip (July 2018)
Perl 513
TypeScript 532
Lua 553
Ruby 568
Dart 610
Chapel 632
Racket 638
Python 672
PHP 736
Hack 745
JavaScript 779
Erlang 792
Go 829
Haskell 842
Pascal 846
F# 876
OCaml 914
Java 945
Smalltalk 950
Lisp 1004
Fortran 1019
C++ 1044
C# 1059
C 1115
Swift 1164
Rust 1319
Ada 1819

(Note: There is some evidence that complexity metrics don't provide any more information than SLoC or LoC.)

How CPU load is measured

The psutil.cpu_percent before forking the child-process and after the child-process exits.

This will include processing done by the measurement script, and whatever else happens on the idle isolated test machine.