Training for Bioinformatics Triage

I was perusing the Twitter when this passed my peepers from the Grauniad, which got me wondering about a new blog post, and related to the upcoming Software Carpentry course I’m helping to give at TGAC next month.

The inimitable Prof. Brian Cox states:

“I think if you’re not comfortable with the unknown, then it’s difficult to be a scientist … I don’t need answers to everything. I want to have answers to find.”

This is similar, but suitably different for the purposes of argument, to what a programmer wants:

“I think if you’re not comfortable with the unknown, then it’s difficult to be a programmer … How do I find the problem, and how does that affect my diagnosis to start searching for the solution?”

How do you teach people code diagnosis skills? Surely ability to triage is based on personality, experience and simple dogged determination, and is reserved for experts with years of practice? Well, yes, and thankfully a resounding no.

I’m a daily regular in the ##java freenode IRC channel, aimed at helping people with all manner of Java development questions. It’s a great, if a little aggressive, resource and I’d recommend developers in the Java space to at least idle in there to pick up best practice. There are other channels devoted to the other programming languages too. Whatever the language, one of the common entry-level to beginner (the two are different) type attitudes is that there is always a clear resource to help you get to your answer, and that someone must be able to provide an answer quickly. This simply isn’t the case. Often in programming, it’s a real labour-intensive task to search, filter and read through documentation and tutorials to give you the best hope of attempting a solution. That said, learning how to find the problem and being able to describe that problem are imperative.

So, given that science is basically an exercise in delving into the unknown, it would seem sensible to conclude scientists should make great programmers. So here are some steps that I’ve found invaluable along the way that hopefully should be applicable to those, particularly in the biological domain, that want to get started or improve their programming and get into doing a bit of bioinformatics development themselves.

How do you ask the right question?

Know what you’re trying to achieve

  • You need to have a clear idea of what your end goal is, down to specific small packages of work. This is not the same as knowing what the problem is. 
  • A global goal, e.g. “Parse in my VCF file, search for variants in my region of interest, pull out those results, show a report”, can comprise a relatively daunting body of work for a beginner. However, this goal can be split into a number of far more manageable chunks, each with more granular goals.
  • When you know what you’re trying to do, finding the problem becomes easier. A pencil and paper or a whiteboard is a great way to slow the thought process down and concentrate on thinking about the goals, rather than how to complete them, which can immediately throw up design problems before you’ve already coded up a load of stuff.
  • Knowing where your code deviates from your goals is great for focusing the mind, keeping small chunks of information in your head, and hence great for triage.

Granularity leads to an excellent basis for test-driven development

  • If you know you are working towards a small goal with a clear outcome, i.e. a unit, write a test for it, i.e. a unit test.
  • Each time you add more code to your program, the test is carried out, making sure your downstream code that uses it is sane.
  • By knowing that tests previously worked and now they have stopped working due to underlying code changes, i.e. “regression“, is great for triage.

Learn how to use Google and how to skim read

  • You’d be amazed how many developer questions are prefixed with “I tried to Google, but I couldn’t find anything”.
  • It’s surprisingly easy to filter out cruft by doing a cursory broad search, then using the simple operators that Google gives you to filter, for example the exact query quotes, and negation.
  • By reading a lot of documentation, you learn how to skim read to find the nuggets of relevance quickly. Scientists are usually great at this because they read a lot of scientific papers, and as such should make great documentation readers and question-askers in the bioinformatics space.
  • Poor search and documentation processing skills are unnecessarily sloppy, are unwarranted and are bad for triage.

Know the tools and techniques to help you pinpoint where errors are happening

  • Finding the bits of code that aren’t doing as you expect (which aren’t that many because you’re writing tests now, right?) is probably the single biggest time sink.
  • Many practices and tools are available in all languages to help you find where the issues are:
    • Sensible logging – even wrapping code segments with printing to stdout can be sufficient to breakpoint larger bits of code.
    • Debuggers – may seem daunting, but they are almost second-to-none to find potentially stubborn bugs like race conditions.
    • Small code fragments – 10 100-line code snippets are far easier to debug than a single 1000-line one.
    • Read the APIs – the specifications and syntax of a language or library are crucial to understand what things do, and what they don’t. This is amazingly frequently overlooked by entry-level programmers.

Don’t paraphrase

  • If you have a specific problem related to a specific goal, don’t gloss over elements of your issue or your attempts at solutions.
  • Writing a simple self-contained test case demonstrating the problem is good practice, as it minimises ambiguity or misunderstanding.
  • Being able to state categorically what you’ve attempted and any errors you see is great for triage.

What next?

So there’s a clincher. Training courses and tutorials are great ways to learn syntax, to speak to experienced developers and to try out new things. However, unless you have a vested interest/job in bioinformatics, maintaining relevance to your work following the course or tutorial is extremely hard. This is where experience and personality come in.

Whenever you see a problem in your day-to-day work, take the time out to see if you could work out how to help yourself by programming your way out of it. This is not time wasted. You’re training your brain to think in a programming context, which will make you quicker at diagnosing issues in the future.

Similarly, learning best practice is not time wasted. “I don’t have time” is synonymous with “I am not motivated to do things properly”. A sensible scientist wouldn’t leave their paper acceptance chances in the hands of a knowingly hasty and flawed experimental design, and a bioinformatician shouldn’t do the same when publishing code. Hacking scripts together is commonplace and ubiquitous, and for good reason – it’s procedural glue for your tasks that can be automated to some extent. However, a lot of time spent in triage as a result of this quick and dirty development can be avoided by making things easier on yourself, which includes taking the time to learn version control, learn the proper conventions of the language, name everything concisely, use unit tests, document regularly and as fully as is relevant and possible. I like using travelling downtime to document my code and processes.

Finally, the more people that understand code triage, the more people will be better qualified to undertake peer review of software in bioinformatics, which is an area that is sorely lacking.

Updated

Changed the regression sentence for a bit more clarity – thanks for the suggestion Mick!