startup warmup

BenchExec wall-clock aka elapsed-seconds aka secs is recorded outside the program being measured and is independent of the program being measured.

Instead, timestamps can be recorded (in-process) inside the program being measured. Each program edited to include measurement code, which may be different for different programming languages.

Let's ascribe differences between the independent measurements and in-process measurements to startup and shutdown costs. For example, some n-body programs —

source	secs	in-process
C gcc	5.540	5.533
C clang	5.704	5.697
Intel C	6.023	6.017
Swift	6.664	6.645
C# naot	7.025	7.008
Dart aot	7.139	7.126
C#	7.218	7.143
Dart	7.775	7.260
Java naot	7.302	7.280
Java	7.610	7.536
Go	9.606	9.598
Node.js	9.672	9.608
Java -Xint	156.236	156.164
PHP	242.367	242.325
Ruby	317.904	317.793
Toit	496.780	496.706
Python 3	511.083	510.028

Once programs have been edited to include measurement code, successive calculations can be measured. (In practice we found that the fifth iteration … for default workload … exhibit well-warmed up behavior …)

Let's ascribe differences between the 1st calculation and the 6th calculation to warmup. For example, some n-body programs —

source	1st	6th
C gcc	5.536	5.533
C clang	5.697	5.687
Intel C	6.009	6.007
Swift	6.671	6.663
C# naot	7.008	7.030
Dart aot	7.141	7.118
C#	7.168	7.021
Dart	7.283	7.232
Java naot	7.299	7.267
Java	7.595	7.225
Go	9.596	9.599
Node.js	9.644	9.564
Java -Xint	160.354	160.057
PHP	242.477	242.454
Ruby	318.026	317.894
Toit	503.125	502.172
Python 3	513.793	515.733

The in-process measurements for those particular tiny tiny Java programs are tenths-of-a-second faster, than the BenchExec wall-clock measurements. The calculation takes seconds and tens-of-seconds.

Sometimes fast startup does matter.

… a major release of the DaCapo benchmark suite for Java that took fourteen years to develop…

Remember, the benchmarks game only shows a tiny number of tiny, tiny programs. So with humility: Let's perform 24 successive calculations and record System.nanoTime() before and after each calculation (exclude startup and shutdown costs, and warmup).

24 in-process nbody Java #8 elapsed seconds mtime measurements

The way the tiny programs were written does seem to matter: nbody Java #1 #2 #3 #4 #5 elapsed seconds

fannkuch-redux

Successive calculations with the fannkuch-redux #8 program became slower; calculations with the fannkuch-redux #3 program were stable, and after the first calculation the fannkuch-redux #2 program became faster.

fannkuchredux Java #8 elapsed seconds fannkuchredux Java #2 #3 elapsed seconds

Some experimental studies show 10.9% of process executions don't reach a steady state of peak performance; and 43.5% of process executions were inconsistent; and sometimes they are slower than what came before.

spectral-norm

Written for onecore:

spectralnorm Java #1 #8 elapsed seconds

:and written for multicore: spectralnorm Java #2 #3 elapsed seconds

How much difference does it make for these tiny programs? otoh For measurements of tenths of a second, tenths of a second is a huge difference. otoh For measurements of seconds and tens-of-seconds, JVM startup, JIT, OSR… are quickly effective.