Your code is your lab book

Code, reuse and documentation

An extensive if slightly contentious twitter conversation popped up recently following this tweet from Aylwyn Scally:

What followed was a rather long discussion about how, when and even if code should be documented. Whilst the resounding answer in general should be “yes”, there’s lots of grey in between, ranging from “good code shouldn’t need documentation”, to “if you publish a tool, it needs minimal documentation for use and reuse”, to “documentation is a time pressure”. Some tweets however got a bit scary, with some useful ripostes (NB I’m not picking on Aylwyn here at all – it’s just relevant context):

I would recommend reading the whole thread as there are some really interesting points raised on all sides. However, one tweet stood out for me:

Why shouldn’t code be like a lab book?

Consider the lowly paper lab book. A stalwart of both the wet and dry lab, it’s a place to store your thoughts, processes and even results. The context to those thoughts, processes and results is just as important in the lab notebook as it is in a block of code.

At first glance, some bits of code don’t require much, if any, documentation – “a = b+c” would seem pretty self-explanatory given some assumptions over the syntax. However, C++ is a language that allows operator overloading. Given that we made assumptions about what b and c represent, we assume reasonably that the + operator has an additive effect. What if the context for this code dictates that the + actually means that a and b are strings, and + concatenates the first half of and the latter half of b? This is a well-versed problem, but it’s relevant to why documentation is vital wherever possible.

Abstracting up from this, a tool requires even more documentation to be useful. Even the simplest tools have helper context – even though I use “ls” every day, and I use a subset of the flags from muscle memory alone, “man ls” gives me a 243-line file that describes everything about a tool that simply lists directory contents. I’ve seen bioinformatics tools with less documentation than this. Although I don’t read the documentation every time I run “ls”, it’s necessary for communicating what the tool does, how to change its behaviour, and most importantly, how to reproduce what someone else is seeing when they run the same tool – “If you run ls like this, you should see the same thing I do“.

The lab notebook is a labour of love, just like code. Picking up someone else’s notebook can either be an incomprehensible alien landscape, or a journey through someone’s methods with a handy tour guide. In addition, the necessity of traditional publication guidelines means the former gets distilled into the latter (I’m not going to get into the argument about the pros and cons of traditional publishing here – that’s for another time). I’ll call this context distillation. Publishing code (and by publishing here, I mean in the sense that you’re intending for other people to read and assess it, not just sticking it on GitHub) needs the same rigour. Sure, I could post up scanned images of my lab notebook, but without documentation, it would be pretty hard for someone to pick up my outputs and use them in this form.

Context distillation

Coding/engineering conventions help a bit here – they can provide part of that distillation that takes a hacky script into a publishable piece of reusable understandable code:

Command line tool conventions: http://www.gigasciencejournal.com/content/2/1/15

Engineering conventions: http://www.gigasciencejournal.com/content/3/1/31

Being slightly biased (I reviewed both these manuscripts), they are a nice starting point for a range of simple requirements that can help bioinformaticians write better tools, one of which is providing help and documentation to users. This was mirrored in the tweetpile:

This is key. We need, as a community, usable requirements and standards for saying “this is how code should go from being available to being reusable“. How do we get our lab notebook code into that form via a number of checkpoints that both programmers and reviewers agree on?

Every researcher has a responsibility to distil their methods effectively. Their papers just wouldn’t (shouldn’t?) get published otherwise. Coders are in the same boat.

The “I don’t have time” excuse doesn’t wash.

Provide the minimum set of required documentation, as agreed by a community, to go from “your mileage may vary” code to “I’d be surprised if your mileage varies” code. C. Titus Brown (are there bioinformatics blog posts that don’t mention Titus?) and Brad Chapman proposed a reviewer criteria checklist that goes some way to providing these requirements for context distillation, but notably omits anything regarding documentation.

Sounds like there’s more work to be done!

Advertisements

3 Comments

  1. Maybe this is a good place to clarify my position on this. Twitter certainly isn’t, and debates there are usually unsatisfactory.

    Like everyone in the field, I am very happy to see the growing trend towards reproducibility in science, and the development of tools and systems to make this easier. I value using them myself. What I don’t see is any need to force the issue, e.g. by introducing further barriers to publication or penalties for failure to meet some new set of standards. The field is moving in a good direction already, and it’s doing so in an organic and user-driven way. Not perfectly by any means, but I also don’t see where the crisis is.

    For my own part, I like to try and write clear code, with documentation where useful, and I make my code freely available in public repositories. I do it primarily for my own benefit and that of people in my group, but if someone else is interested in using my code or reproducing an analysis then I help them to do that, often writing additional documentation to support it. I’ve always been a proponent of free and open software, and not just in science.

    I’m also very happy for people to advocate continued movement in this direction, to suggest voluntary standards of good practice, establish training material etc. What I object to, however, is a naive and sometimes sanctimonious attitude that people who aren’t working in a certain way are doing bad science, or are part of the problem, whatever the problem is.

    Much of the drive for full and uncompromising reproducibility comes from people with a particular perspective on what science involves. Not an invalid perspective but a limited one nevertheless. It’s frequently people for whom science is dominated by writing software and processing data. Many of them have come from software development into science, or perhaps even see themselves as moving in the opposite direction.

    Science, even computational science, is not software. Nevertheless there are many lessons from software development and engineering which have been beneficial for how we do computational science, and I’m sure this trend will continue.

    But people need to think carefully about using their view of good practice as a stick to beat up on other scientists, or make sweeping statements like ‘poorly documented code is bad code’ or ‘lack of time is no excuse’. Partly because yes it bloody well is, partly because it alienates people, and partly because that kind of thing can cut both ways.

    For example, here’s another sweeping statement: some of the most easily reproducible science is also the poorest. It’s work which is almost entirely data processing, gluing together standard tools, building pipelines and databases and doing obvious things. A lot of the activities involved in reproducibility are the kinds of busy-work that people like to do to avoid tackling the hard part of science: generating new ideas and genuine understanding. Our field is already swamped by this stuff – that’s the real crisis.

    If that paragraph offends you, maybe reflect on the fact that scientific careers are heavily dependent on being valued by one’s peers, so attacking others’ worth is a dangerous and too-easily deployed weapon.

    I like hearing about tools and approaches to make reproducibility easier, and how to integrate it into existing training for students. I’m totally uninterested in discussing ways to make it harder for me to publish, or to spend even more effort on time-consuming activities for which I get inadequate career reward. I have plenty of admin and teaching for that. And then there is peer review.

    To people who feel very strongly about this, I say be persuasive, not critical, and think about how to provide carrots, not make sticks.

    • I agree that much gets lost in translation in 140 characters, so I thought a blog post would be a good place to get my general thoughts down longhand and encourage comments, so thank you for your fantastic response!

      It isn’t a crisis per se, but Titus made a good point: It’s good to try and do things right the first time around. We’re at a point where we can start to sow the seed of good practice (and that’s all I’m suggesting, not enforcing) to get our context around our code and software into a consistently descriptive format.

      I apologise if my post was coming across as sanctimonious, as that certainly wasn’t the intention. I too believe that buy-in to good practice comes from the carrot, not the stick. Show the benefits, and people will naturally want to join in the ecosystem. If they don’t, they get left behind. That doesn’t mean that their science is bad by any means, but they are making it harder for themselves to contribute, collaborate and communicate effectively.

      Poorly documented code is unhelpful hard-to-reuse code, not bad code. I believe lack of time is an excuse, because what should happen (hindsight is wonderful) is that documentation and community engagement should be a *part* of programming and engineering, not an afterthought like it so often is in science. I make time for my team to write and read documentation. We’ve got more papers out this way, which is a nice by-product. Again, carrot, not stick.

      Oh, I don’t get offended very easily, but I do disagree often 🙂 I completely agree with you that science, especially bioinformatics, careers are based on a very singular form of output and this needs to change sorely. I’m involved in projects (COPO – @copo_project) to build infrastructure (pipelines? 😉 ) to enable scientists to catalogue and disseminate their research outputs as first class semantic objects, allowing them to cite almost everything they do and track the full provenance of the work of others. This is a cultural shift as well as a technical one, so I certainly don’t subscribe to attacking the worth of others, not just because it’s hypocritical, but I don’t think we even have a good system implemented to ascribe “worth” to science anyway!

      I also value and like the tools and approaches for reproducibility. We are only content and/or tolerant with the current traditional publishing paradigm because it’s all we’ve had since scientific discourse began, and our whole valorisation system is bolted onto that. This is the crisis in my opinion. Too many researchers, especially data generators and bioinformaticians, are not getting the recognition and reward (see recent Nature comments article) they deserve and need for promoting their career progression and self-worth. I feel that giving researchers a simple light set of conventions (maybe requirements is too strong) can actually help rather than hinder.

  2. Documenting bioinformatics code is a topic that always creates some debate. I think it would help to distinguish between different types of documentation and different types of code.

    For instance, documentation can exist at several levels in a software. There is the in-depth descriptions of what a particularly intricate section of code does. There is the systematic description of what each function does, gets as input and returns as output. There is also a top-level description of what the tool does, the options of command line, etc. Then you need a description what you have done, i.e. what input with what software in what order, that would be your lab book (see PD at the bottom).

    When you take into account the type of software, the amount of documentation that you need/expect for a public API, a standalone tool or a quick bash script to launch a set of tools is not the same. Also as @froggleston mentioned, if you intend to publish a paper describing that particular piece of software, I expect/request the documentation to be of good quality.

    And just to add to the controversy, I don’t think more documentation is necessary better. Knowing what level of documentation is necessary is part of being a good developer. Swamp your user/yourself with too much information and you would have failed. Don’t describe your complex data structure and you will feel like re-writing the code from scratch if you need to make any amendment in the future.

    PD: I agree with Aylwyn in that code and lab book are not the same. I like to think that a piece of code is more like a lab protocol. Have you seen many comments in a lab protocol? Does it explain why you need to incubate overnight instead of for 2 hours only? Overall, I think we are doing a decent job.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s