regex-redux description

[ Contentious. Different libraries. ]

Background

HN discussion and regex engines on a curated set of tasks.

Variance

Some language implementations have regex built-in; some provide a regex library; some use a third-party regex library.

The regex algorithm implemented is very likely to be different in different libraries.

The work

The work is to use the same simple regex patterns and actions to manipulate FASTA format data. Don't optimize away the work.

How to implement

We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

Each program should:

read all of a redirected FASTA format file from stdin, and record the sequence length
use the same simple regex pattern match-replace to remove FASTA sequence descriptions and all linefeed characters, and record the sequence length

use the same simple regex patterns -

agggtaaa|tttaccct
[cgt]gggtaaa|tttaccc[acg]
a[act]ggtaaa|tttacc[agt]t
ag[act]gtaaa|tttac[agt]ct
agg[act]taaa|ttta[agt]cct
aggg[acg]aaa|ttt[cgt]ccct
agggt[cgt]aa|tt[acg]accct
agggta[cgt]a|t[acg]taccct
agggtaa[cgt]|[acg]ttaccct

- representing DNA 8-mers and their reverse complement (with a wildcard in one position), and (one pattern at a time) count matches in the redirected file

write the regex pattern and count
use the same magic regex patterns -
```
tHa[Nt]
aND|caN|Ha[DS]|WaS
a[NSt]|BY
<[^>]*>
\\|[^|][^|]*\\|
```
- to (one pattern at a time, in the same order) match-replace the pattern in the redirected file with -
```
<4>
<3>
<2>
|
-
```
- and record the sequence length
write the 3 recorded sequence lengths

diff program output for this 10KB input file (generated with the fasta program N = 1000) with this output file to check your program output has the correct format, before you contribute your program.

Generate a larger input file (using one of the fasta programs with command line arguments: 5000000 > input5000000.txt) to check program performance.

Thanks to Jeremy Zerfas for insisting that the programs follow the "one pattern at a time" guideline, and developing the magic regex patterns. Thanks to Matt Brubeck for the good enough magic regex pattern.