The Computer Language
24.12 Benchmarks Game

k-nucleotide description

Variance

Some language implementations have hash tables built-in; some provide a hash table as part of a collections library; some use a third-party hash table library. (For example, use either khash or CK_HT for C language k-nucleotide programs.) The hash table algorithm implemented is likely to be different in different libraries.

Please don't implement your own custom "hash table" - it will not be accepted.

The work

The work is to use the built-in or library hash table implementation to accumulate count values - lookup the count for a key and update the count in the hash table. Don't optimize away the work.

Mapping the DNA letters to the bytes 0, 1, 2, 3, and using a hash function that concatenates those two-byte codes is an acceptable (but not a required) optimization.

How to implement

We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

Each program should:

diff program output for this 10KB input file (generated with the fasta program N = 1000) with this output file to check your program output has the correct format, before you contribute your program.

Generate a larger input file (using one of the fasta programs with command line arguments: 25000000 > input25000000.txt) to check program performance.