Just Write a Parser

At the Broad Institute, we use a wide variety of bioinformatics software. I recently learned that we use biopython to transform ACE to FASTA. I learned this because a script wrote calls it failed for one ACE file. I don’t know the ACE format, so even after I found an old spec, I couldn’t immediately tell if it was a data error, script error, library error, or spec error.

Easiest solution would be to find a common tool that also parses ACE; if it fails on this data, then it’s not the data. Looking around online yielded a couple of code snippets, but this comment was the highest-rated answer:

The file format is simple enough that you can just write a parser.

I see time.

I see compiler errors, runtime errors. I see string-splitting errors. I see can’t-read-the-whole-file-into-memory errors. I see newline or unicode errors. I see case errors, whitespace errors. Too much shit in a comment field errors.

I see a solution that at best handles the file in front of me. I see it breaking on the second, seventh, eighth, and future attempts to run.

I see my script being a bare-minimum, half-assed effort that is never right or good, because this task isn’t the task I’m trying to work on. This task is merely an obstacle to my real task, which is processing a strain of HIV and moving on to the next one, because I have 100 strains to work through.

I see an industry with tens of thousands of silent, unique implementations in common languages for the exact same file format. I see major application platforms inadvertently forking the spec due to implementation errors or unclear specifications. Unrelated applications and platforms receiving support tickets for these undocumented forks: “It works here, it should work in your application.” I see the silent forks propogate and the spec muddy.

I’m learning that web development is similar.

In Bioperl, there’s a module called Bio::Assembly::IO. It’s a complete implementation for the parser I need. We don’t have it installed at the Broad Institute, despite our many employees who are perl hackers. No doubt because it’s simple enough to just write your own parser.