Training for Bioinformatics Triage

I was perusing the Twitter when this passed my peepers from the Grauniad, which got me wondering about a new blog post, and related to the upcoming Software Carpentry course I’m helping to give at TGAC next month.

The inimitable Prof. Brian Cox states:

“I think if you’re not comfortable with the unknown, then it’s difficult to be a scientist … I don’t need answers to everything. I want to have answers to find.”

This is similar, but suitably different for the purposes of argument, to what a programmer wants:

“I think if you’re not comfortable with the unknown, then it’s difficult to be a programmer … How do I find the problem, and how does that affect my diagnosis to start searching for the solution?”

How do you teach people code diagnosis skills? Surely ability to triage is based on personality, experience and simple dogged determination, and is reserved for experts with years of practice? Well, yes, and thankfully a resounding no.

I’m a daily regular in the ##java freenode IRC channel, aimed at helping people with all manner of Java development questions. It’s a great, if a little aggressive, resource and I’d recommend developers in the Java space to at least idle in there to pick up best practice. There are other channels devoted to the other programming languages too. Whatever the language, one of the common entry-level to beginner (the two are different) type attitudes is that there is always a clear resource to help you get to your answer, and that someone must be able to provide an answer quickly. This simply isn’t the case. Often in programming, it’s a real labour-intensive task to search, filter and read through documentation and tutorials to give you the best hope of attempting a solution. That said, learning how to find the problem and being able to describe that problem are imperative.

So, given that science is basically an exercise in delving into the unknown, it would seem sensible to conclude scientists should make great programmers. So here are some steps that I’ve found invaluable along the way that hopefully should be applicable to those, particularly in the biological domain, that want to get started or improve their programming and get into doing a bit of bioinformatics development themselves.

How do you ask the right question?

Know what you’re trying to achieve

  • You need to have a clear idea of what your end goal is, down to specific small packages of work. This is not the same as knowing what the problem is. 
  • A global goal, e.g. “Parse in my VCF file, search for variants in my region of interest, pull out those results, show a report”, can comprise a relatively daunting body of work for a beginner. However, this goal can be split into a number of far more manageable chunks, each with more granular goals.
  • When you know what you’re trying to do, finding the problem becomes easier. A pencil and paper or a whiteboard is a great way to slow the thought process down and concentrate on thinking about the goals, rather than how to complete them, which can immediately throw up design problems before you’ve already coded up a load of stuff.
  • Knowing where your code deviates from your goals is great for focusing the mind, keeping small chunks of information in your head, and hence great for triage.

Granularity leads to an excellent basis for test-driven development

  • If you know you are working towards a small goal with a clear outcome, i.e. a unit, write a test for it, i.e. a unit test.
  • Each time you add more code to your program, the test is carried out, making sure your downstream code that uses it is sane.
  • By knowing that tests previously worked and now they have stopped working due to underlying code changes, i.e. “regression“, is great for triage.

Learn how to use Google and how to skim read

  • You’d be amazed how many developer questions are prefixed with “I tried to Google, but I couldn’t find anything”.
  • It’s surprisingly easy to filter out cruft by doing a cursory broad search, then using the simple operators that Google gives you to filter, for example the exact query quotes, and negation.
  • By reading a lot of documentation, you learn how to skim read to find the nuggets of relevance quickly. Scientists are usually great at this because they read a lot of scientific papers, and as such should make great documentation readers and question-askers in the bioinformatics space.
  • Poor search and documentation processing skills are unnecessarily sloppy, are unwarranted and are bad for triage.

Know the tools and techniques to help you pinpoint where errors are happening

  • Finding the bits of code that aren’t doing as you expect (which aren’t that many because you’re writing tests now, right?) is probably the single biggest time sink.
  • Many practices and tools are available in all languages to help you find where the issues are:
    • Sensible logging – even wrapping code segments with printing to stdout can be sufficient to breakpoint larger bits of code.
    • Debuggers – may seem daunting, but they are almost second-to-none to find potentially stubborn bugs like race conditions.
    • Small code fragments – 10 100-line code snippets are far easier to debug than a single 1000-line one.
    • Read the APIs – the specifications and syntax of a language or library are crucial to understand what things do, and what they don’t. This is amazingly frequently overlooked by entry-level programmers.

Don’t paraphrase

  • If you have a specific problem related to a specific goal, don’t gloss over elements of your issue or your attempts at solutions.
  • Writing a simple self-contained test case demonstrating the problem is good practice, as it minimises ambiguity or misunderstanding.
  • Being able to state categorically what you’ve attempted and any errors you see is great for triage.

What next?

So there’s a clincher. Training courses and tutorials are great ways to learn syntax, to speak to experienced developers and to try out new things. However, unless you have a vested interest/job in bioinformatics, maintaining relevance to your work following the course or tutorial is extremely hard. This is where experience and personality come in.

Whenever you see a problem in your day-to-day work, take the time out to see if you could work out how to help yourself by programming your way out of it. This is not time wasted. You’re training your brain to think in a programming context, which will make you quicker at diagnosing issues in the future.

Similarly, learning best practice is not time wasted. “I don’t have time” is synonymous with “I am not motivated to do things properly”. A sensible scientist wouldn’t leave their paper acceptance chances in the hands of a knowingly hasty and flawed experimental design, and a bioinformatician shouldn’t do the same when publishing code. Hacking scripts together is commonplace and ubiquitous, and for good reason – it’s procedural glue for your tasks that can be automated to some extent. However, a lot of time spent in triage as a result of this quick and dirty development can be avoided by making things easier on yourself, which includes taking the time to learn version control, learn the proper conventions of the language, name everything concisely, use unit tests, document regularly and as fully as is relevant and possible. I like using travelling downtime to document my code and processes.

Finally, the more people that understand code triage, the more people will be better qualified to undertake peer review of software in bioinformatics, which is an area that is sorely lacking.

Updated

Changed the regression sentence for a bit more clarity – thanks for the suggestion Mick!

Advertisements

Retiring Scientific Ideas

I thought I’d wade in and start writing a blog about science and the errant wisps of thought that float around. I’ve been following a few blogs for a while, and whilst I like the commenting idea, I often find I’d rather write about it afresh. Seeing as how I now “do Twitter”, this seemed like the next level – please comment constructively and tell me I’m talking rubbish.

There’s been much talk over a wide range of topics within said blogs, from “cats and dogs”, to “harassment”, to “how to publish science”. I have views on all these subjects, and I guess I’ll try to condense them down into a single meandering waft.

It all came together when I happened upon the edge.org news item regarding the next big question: “WHAT SCIENTIFIC IDEA IS READY FOR RETIREMENT?”

Given the previous Edge questions, such as “WHAT WILL CHANGE EVERYTHING?” and “WHAT DO YOU BELIEVE IS TRUE EVEN THOUGH YOU CANNOT PROVE IT?“, I initially thought this new offering a bit under-existential, but felt it brought some of the previous discussion topics I mentioned earlier into focus.

Cats and Dogs

Mick Watson recently opined on Ewan Birney’s driving factors in personal scientific attitudes, quite conveniently compartmentalised into two four-legged analogies. Whilst broadly generalising scientists into two distinct categories is easy, the reality is not that simple, and being “catlike” has distinct downsides in scientific outlook. It’s a common preconception that PI’s and bioinformaticians are “cats” – powerful, independent, gate-keeper like entities. “Dogs” are viewed as group workers, following the direction of the pack. As Mick says, there are times to be a cat and times to be a dog but on the whole, scientists need to be both. I would disagree, in that scientists need to be dog most of the time, unless the situation warrants cat-like behaviour. Fierce unwavering catlike independence is great for moments of personal development and the motivation to carry on with a problem when everything is screaming at you to drop it (which is why the link is made to bioinformatics – coding is a labour of cat-like love). Fierce unwavering catlike independence is also shitty for a myriad of reasons: it often comes across as arrogance and science has more than enough of that already; it’s counter-productive in project-based discussions when consensus is essential (labouring a viewpoint is commonplace); it stifles interoperability on a personal and practical scale; it breeds resentment of other cats and promotes sectarianism. This widely-posted PLOS article outlines these pigeon-holed behaviours, where all the types described are “cat-like” traits. Luckily, the article also describes how to be more dog in each instance. As a self-professed dog lover, I’m probably biased in this analogy.

Harassment

We’ve all read the recent descriptions and subsequent outpourings of mutual support for the victims of sexual harassment in the scientific world, demonstrated as naming and shaming of the perpetrators. Whilst I’m not going to add my 2 pence here as to why sexual harassment is so completely horrific (because my feelings are more than ably summarised by people far more eloquent than myself) it raises an issue for the greater whole. Why does science have the problem of harassment in the first place? Why the continual propagation of not just sexual harassment, but ideas harassment? Furthering knowledge is no more a gender exclusive pursuit than is the ability to tie shoelaces, or carry out the most basic of bodily functions. History could be, and most surely is, to blame, but I’m pretty sure we don’t live in the 40s anymore. So, I’m not talking about scientific debate – this is natural disagreement with a view to pushing one’s envelopes. I’m talking about personally denigrating someone for their scientific viewpoint. This came to a head with the ENCODE debacle. Whether you like 80% functional as a figure or not, and whether you like the notion of “big science” or not, the dumbing down of those involved in the project by others was, in my view, an utter dick move for science. What better way to make the field of science look and feel like a disagreement at a child’s ballpool birthday party to those “on the outside”, than to publish an article with the tone of a sarcastic ageing relative that you can’t get rid of at Christmas? The shame is that the content of the article is generally sound, and does bring to mind the discrepancies of funding availability for smaller science projects in this world of “big data”, “big science” and “big ideas”. Which brings me nicely on to…

How To Publish Science

The ideology behind science is the dissemination of findings based on empirical observations, but this empiricism is not essential for promoting scientific discourse. I couldn’t care less if Randy got a Nobel and shunned the very same “glamour” journals he used to disseminate knowledge – his message, in time, will remain the same: “you don’t need a journal’s impact factor to help your career“. Sure, he’s had the benefits of hindsight, success and publicity to make that point. Some don’t like his eLife-bolstered open access stance when coupled with the silver-spoon hypocrisy it’s based on. I couldn’t care less. Casting stones seems to be rife, with people staking claim on their open access dedication being initiated earlier and with far more vigour than hapless dinosaur Randy. This is the hipster way, and it’s based on the same desire to seek “glamour” that drives impact factors.

Similarly, one thing that exposes the nasty side to being bent over the impact factor barrel are the stories of science group managers demanding that papers from the group only be published in journals with an impact factor of and above a certain score. This promotes closed world behaviour in postdocs and RAs, and is tantamount to harassment. Science gains no intrinsic value by having a shiny wrapper on it, and even worse still is the persistence in the notion that by publishing your science in a non-glamour journal means your science is somehow worth less. Given the monumental fubars in the closed review process whereby trojan papers have made their way into these gold standard publications, the beauty in modern day science dissemination is that Twitter, open access publishing and open peer review (arXiv, F1000, PeerJ, PubPeer, etc) are becoming actively embraced by new researchers and outwardly recommended by research leaders. This truly is the way forward for good science. However, albeit well and good, the problem resurfaces when attempting to rationalise your research outputs to funding bodies and tenure panels without the archaic traditional measurements of scientific success . This is not the way forward for good science and needs careful consideration in future. I was genuinely worried that some people I have discussed this with are fearful that if they don’t publish their work in these glamour journals, their career will suffer. This is a horrible existence in a modern scientific arena.

So, what scientific idea is ready for retirement? Sectarian Closed Science.

Sectarianism and closedness needs to die in science, and I’m both disturbed and heartened in equal measure by the precariously balanced see-saw of the world around scientific endeavour. We’re teetering on the edge of something shame-facedly simple yet vital, and that is science for everyone. Scientific careers that are recognised for what the science is, not where it’s published. Science that is freely available for discussion and debate, from sofas to auditoriums, without fear of personal attacks. Science that excludes noone.