Thanks to Oleg Mazurov for insisting on a checksum and providing this helpful description of the approach he took -
common idea for parallel implementation is to divide all work (n! permutations) into chunks small enough to avoid load imbalance but large enough to keep overhead low. I set the number of chunks as a parameter (NCHUNKS = 150) from which I derive the size of a chunk (CHUNKSZ) and the actual number of chunks/tasks to be processed (NTASKS), which may be different from NCHUNKS because of rounding.
Task scheduling is trivial: threads will atomically get and increment the taskId variable to derive a range of permutation indices to work on:
task = taskId.getAndIncrement();
idxMin = task * CHUNKSZ;
idxMax = min( idxMin + CHUNKSZ, n! );
Maximum flip counts and partial checksums can be computed for chunks in arbitrary order and recombined to generate the required result at the final step (CHUNKSZ must be even for adding partial checksums to be associative - I didn't enforce it in my submission).
Now I need to go from a permutation index to the permutation itself.
The predefined order in which all permutations are to be generated can be described as follows: to generate n! permutations of n arbitrary numbers, rotate the numbers left (from higher position to lower) n times, so that each number appears in the n-th position, and for each rotation recursively generate (n-1)! permutations of the first n-1 numbers whatever they are.
To optimize the process I use an intermediate data structure, count[], which keeps count of how many rotations have been done at every level. Apparently, count[0] is always 0, as there is only one element at that level, which can't be rotated; count[1] = 0..1 for two elements, count[2] = 0..2 for three elements, etc.
To generate next permutation I swap the first two elements and increase count[1]. If count[1] becomes greater than 1, I'm done with rotations at level 1 and need to "return" (as it would have been in the recursive implementation) to level 2. Now, I rotate 3 elements and increment count[2]. If it becomes greater than 2, I'm done with level 2 and need to go to level 3, etc.
It should be clear now how to generate a permutation and corresponding count[] array from an arbitrary index. Basically, count[k] = ( index % (k+1)! ) / k! is the number of rotations we need to perform on elements 0..k. Doing it in the descending order from n-1 to 1 gives us both the count[] array and the permutation.