Your code is your lab book

Code, reuse and documentation

An extensive if slightly contentious twitter conversation popped up recently following this tweet from Aylwyn Scally:

What followed was a rather long discussion about how, when and even if code should be documented. Whilst the resounding answer in general should be “yes”, there’s lots of grey in between, ranging from “good code shouldn’t need documentation”, to “if you publish a tool, it needs minimal documentation for use and reuse”, to “documentation is a time pressure”. Some tweets however got a bit scary, with some useful ripostes (NB I’m not picking on Aylwyn here at all – it’s just relevant context):

I would recommend reading the whole thread as there are some really interesting points raised on all sides. However, one tweet stood out for me:

Why shouldn’t code be like a lab book?

Consider the lowly paper lab book. A stalwart of both the wet and dry lab, it’s a place to store your thoughts, processes and even results. The context to those thoughts, processes and results is just as important in the lab notebook as it is in a block of code.

At first glance, some bits of code don’t require much, if any, documentation – “a = b+c” would seem pretty self-explanatory given some assumptions over the syntax. However, C++ is a language that allows operator overloading. Given that we made assumptions about what b and c represent, we assume reasonably that the + operator has an additive effect. What if the context for this code dictates that the + actually means that a and b are strings, and + concatenates the first half of and the latter half of b? This is a well-versed problem, but it’s relevant to why documentation is vital wherever possible.

Abstracting up from this, a tool requires even more documentation to be useful. Even the simplest tools have helper context – even though I use “ls” every day, and I use a subset of the flags from muscle memory alone, “man ls” gives me a 243-line file that describes everything about a tool that simply lists directory contents. I’ve seen bioinformatics tools with less documentation than this. Although I don’t read the documentation every time I run “ls”, it’s necessary for communicating what the tool does, how to change its behaviour, and most importantly, how to reproduce what someone else is seeing when they run the same tool – “If you run ls like this, you should see the same thing I do“.

The lab notebook is a labour of love, just like code. Picking up someone else’s notebook can either be an incomprehensible alien landscape, or a journey through someone’s methods with a handy tour guide. In addition, the necessity of traditional publication guidelines means the former gets distilled into the latter (I’m not going to get into the argument about the pros and cons of traditional publishing here – that’s for another time). I’ll call this context distillation. Publishing code (and by publishing here, I mean in the sense that you’re intending for other people to read and assess it, not just sticking it on GitHub) needs the same rigour. Sure, I could post up scanned images of my lab notebook, but without documentation, it would be pretty hard for someone to pick up my outputs and use them in this form.

Context distillation

Coding/engineering conventions help a bit here – they can provide part of that distillation that takes a hacky script into a publishable piece of reusable understandable code:

Command line tool conventions: http://www.gigasciencejournal.com/content/2/1/15

Engineering conventions: http://www.gigasciencejournal.com/content/3/1/31

Being slightly biased (I reviewed both these manuscripts), they are a nice starting point for a range of simple requirements that can help bioinformaticians write better tools, one of which is providing help and documentation to users. This was mirrored in the tweetpile:

This is key. We need, as a community, usable requirements and standards for saying “this is how code should go from being available to being reusable“. How do we get our lab notebook code into that form via a number of checkpoints that both programmers and reviewers agree on?

Every researcher has a responsibility to distil their methods effectively. Their papers just wouldn’t (shouldn’t?) get published otherwise. Coders are in the same boat.

The “I don’t have time” excuse doesn’t wash.

Provide the minimum set of required documentation, as agreed by a community, to go from “your mileage may vary” code to “I’d be surprised if your mileage varies” code. C. Titus Brown (are there bioinformatics blog posts that don’t mention Titus?) and Brad Chapman proposed a reviewer criteria checklist that goes some way to providing these requirements for context distillation, but notably omits anything regarding documentation.

Sounds like there’s more work to be done!