fannkuch-redux description

Background

The fannkuch benchmark is defined by programs in [pdf]Performing Lisp Analysis of the FANNKUCH Benchmark, Kenneth R. Anderson and Duane Rettig. FANNKUCH is an abbreviation for the German word Pfannkuchen, or pancakes, in analogy to flipping pancakes. The conjecture is that the maximum count is approximated by n*log(n) when n goes to infinity.

How to implement

We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

Each program should:

Take a permutation of {1,...,n}, for example: {4,2,1,5,3}.
Take the first element, here 4, and reverse the order of the first 4 elements: {5,1,2,4,3}.
Repeat this until the first element is a 1, so flipping won't change anything more: {3,4,2,1,5}, {2,4,3,1,5}, {4,2,3,1,5}, {1,3,2,4,5}.
Count the number of flips, here 5.
Keep a checksum
- checksum = checksum + (if permutation_index is even then flips_count else -flips_count)
- checksum = checksum + (toggle_sign_-1_1 * flips_count)
Do this for all n! permutations, and record the maximum number of flips needed for any permutation.

diff program output N = 7 with this output file to check your program output has the correct format, before you contribute your program.

Use a larger command line argument (12) to check program performance.

Example

Thanks to Oleg Mazurov for insisting on a checksum and providing this helpful description of the approach he took -

common idea for parallel implementation is to divide all work (n! permutations) into chunks small enough to avoid load imbalance but large enough to keep overhead low. I set the number of chunks as a parameter (NCHUNKS = 150) from which I derive the size of a chunk (CHUNKSZ) and the actual number of chunks/tasks to be processed (NTASKS), which may be different from NCHUNKS because of rounding.
Task scheduling is trivial: threads will atomically get and increment the taskId variable to derive a range of permutation indices to work on:
```
task = taskId.getAndIncrement();
idxMin = task * CHUNKSZ;
idxMax = min( idxMin + CHUNKSZ, n! );
```
Maximum flip counts and partial checksums can be computed for chunks in arbitrary order and recombined to generate the required result at the final step (CHUNKSZ must be even for adding partial checksums to be associative - I didn't enforce it in my submission).
Now I need to go from a permutation index to the permutation itself.
The predefined order in which all permutations are to be generated can be described as follows: to generate n! permutations of n arbitrary numbers, rotate the numbers left (from higher position to lower) n times, so that each number appears in the n-th position, and for each rotation recursively generate (n-1)! permutations of the first n-1 numbers whatever they are.
To optimize the process I use an intermediate data structure, count[], which keeps count of how many rotations have been done at every level. Apparently, count[0] is always 0, as there is only one element at that level, which can't be rotated; count[1] = 0..1 for two elements, count[2] = 0..2 for three elements, etc.
To generate next permutation I swap the first two elements and increase count[1]. If count[1] becomes greater than 1, I'm done with rotations at level 1 and need to "return" (as it would have been in the recursive implementation) to level 2. Now, I rotate 3 elements and increment count[2]. If it becomes greater than 2, I'm done with level 2 and need to go to level 3, etc.
It should be clear now how to generate a permutation and corresponding count[] array from an arbitrary index. Basically, count[k] = ( index % (k+1)! ) / k! is the number of rotations we need to perform on elements 0..k. Doing it in the descending order from n-1 to 1 gives us both the count[] array and the permutation.