The Computer Language
25.03 Benchmarks Game

startup warmup

BenchExec wall-clock aka elapsed-seconds aka secs is recorded outside the program being measured and is independent of the program being measured.


Instead, timestamps can be recorded (in-process) inside the program being measured. Each program edited to include measurement code, which may be different for different programming languages.


Let's ascribe differences between the independent measurements and in-process measurements to startup and shutdown costs. For example, some n-body programs —

  source secs in-process
  C gcc 5.540 5.533
  C clang 5.704 5.697
  Intel C 6.023 6.017
  Swift 6.664 6.645
  C# naot 7.025 7.008
  Dart aot 7.139 7.126
  C# 7.218 7.143
  Dart 7.775 7.260
  Java naot 7.302 7.280
  Java 7.610 7.536
  Go 9.606 9.598
  Node.js 9.672 9.608
  Java -Xint 156.236 156.164
  PHP 242.367 242.325
  Ruby 317.904 317.793
  Toit 496.780 496.706
  Python 3 511.083 510.028

Once programs have been edited to include measurement code, successive calculations can be measured. (In practice we found that the fifth iteration … for default workload … exhibit well-warmed up behavior …)

Let's ascribe differences between the 1st calculation and the 6th calculation to warmup. For example, some n-body programs —

  source 1st 6th
  C gcc 5.536 5.533
  C clang 5.697 5.687
  Intel C 6.009 6.007
  Swift 6.671 6.663
  C# naot 7.008 7.030
  Dart aot 7.141 7.118
  C# 7.168 7.021
  Dart 7.283 7.232
  Java naot 7.299 7.267
  Java 7.595 7.225
  Go 9.596 9.599
  Node.js 9.644 9.564
  Java -Xint 160.354 160.057
  PHP 242.477 242.454
  Ruby 318.026 317.894
  Toit 503.125 502.172
  Python 3 513.793 515.733

The in-process measurements for those particular tiny tiny Java programs are tenths-of-a-second faster, than the BenchExec wall-clock measurements. The calculation takes seconds and tens-of-seconds.

Sometimes fast startup does matter.

… a major release of the DaCapo benchmark suite for Java that took fourteen years to develop

Remember, the benchmarks game only shows a tiny number of tiny, tiny programs. So with humility: Let's perform 24 successive calculations and record System.nanoTime() before and after each calculation (exclude startup and shutdown costs, and warmup).

24 in-process nbody Java #8 elapsed seconds mtime measurements

The way the tiny programs were written does seem to matter: nbody Java #1 #2 #3 #4 #5 elapsed seconds

fannkuch-redux

Successive calculations with the fannkuch-redux #8 program became slower; calculations with the fannkuch-redux #3 program were stable, and after the first calculation the fannkuch-redux #2 program became faster.

fannkuchredux Java #8 elapsed seconds fannkuchredux Java #2 #3 elapsed seconds

Some experimental studies show 10.9% of process executions don't reach a steady state of peak performance; and 43.5% of process executions were inconsistent; and sometimes they are slower than what came before.

spectral-norm

Written for onecore:

spectralnorm Java #1 #8 elapsed seconds

:and written for multicore: spectralnorm Java #2 #3 elapsed seconds

How much difference does it make for these tiny programs? otoh For measurements of tenths of a second, tenths of a second is a huge difference. otoh For measurements of seconds and tens-of-seconds, JVM startup, JIT, OSR… are quickly effective.