Bioinformatics Basics: A Training Conundrum

A couple of tweets got me started on what turned out to be a long old journey though the realms of ideology around what’s the best training medium for presenting bioinformatics to novices

And later on:

Mick’s original statement proved to be a very popular starting point for discussion with over 140 replies to date, many of them me trying to get to the bottom of why IPython Notebook was viewed as suboptimal for training, and at worst, pointless for getting students (herein, “students” refers to “bootcamp attendees” or “training sponges”, not students in the common educational sense) engaged in learning Python.

Before I start, I’ll reiterate a tweet I sent:

So, to summarise a very long line of tweets, I’ll concentrate on a few key points: abstraction (and why it’s helpful for learning); should novices be thrown in at the deep end (and why bioinformatics needs less “dark arts magician” and more “it doesn’t take much training to become relatively effective”); bioinformatics training doesn’t have to be contextualised by a specific research goal (and why general foundations in best practice are far more valuable). This isn’t a blog post about why I love Python (which I don’t) or why everyone should use IPython Notebook (which they shouldn’t).


A few of the tweets circled around the idea of abstraction, i.e. layering up of (usually) simplifications to a paradigm to make things easier or to restrict a complex domain to more manageable foci. Some thought (myself included) that abstraction was good for novices, and others thought otherwise, preferring “realism” to “obfuscating the reality with layers”. Firstly, abstraction is not obfuscation:

retain only information which is relevant for a particular purpose (Wikipedia)

The key word here is “retain”. You aren’t throwing away information, merely providing facile inroads to key points for the particular areas you’re concentrating on. Abstraction done well should mean traversing concepts from the abstract to the real is natural and intuitive. Done badly, it’s confusing and leads to obfuscation. In the case of the Software Carpentry (SWC) bootcamps, IPython Notebook (ipynb) provides abstraction over the IPython interpreter, which in turn is an abstraction over the Python interpreter, all contained in a very simple web page. Everyone recognises and understands the web nowadays. The argument that the ipynb web page format is confusing to a novice and doesn’t help them learn is rubbish. A central feature of ipynb is that the code that you write in each interpreted “cell” is valid standalone (in the context of the page of the notebook) Python code. It’s a glorified text editor that lets you run Python code, akin to something like (albeit done better). You can copy and paste the code into a script using vi and the shell, which are also taught prior to the Python section in a typical SWC bootcamp. The students are not being misled, and the goal is to teach them basic Python, not ipynb itself. This is abstraction used well, in my opinion.


Mick and Nick Loman both hold the opinion, from the tweets they have made, that novices should learn realism first. I assume this means sitting a student down at a terminal, teaching them the shell, then tools that they need to use to do a particular analysis. Firstly, and this really is a subject for another blog post but merits mentioning anyway, not all bioinformatics is data analysis. Moving on, I’m talking (hah) more abstract – the period before you even get to analysis which underpins everything bioinformaticians do in the shell, which is monkey work. You don’t need to be a pro, and the more basic you can make the entry point, the better.

Nick also tweeted that you can teach someone the concept of the shell in one line. Whilst this is true, the reality is immediately deeper (note, not immediately more complex) – the shell takes a short time to learn the basics, and a lifetime to master. After 15 years of using the Linux shell and, later on, related day-to-day bio-admin toolchains I’m still finding treats, tidbits, and the need to revisit reference material to help me complete a task. It’s a behemoth of epic proportions, and completely out of the realm of a novice to learn it in a short time. Showing someone the true power of the shell and its environs in a short time is truly daunting to a novice and promotes the all-too-common fallacy that bioinformatics is intrinsically hard, that only computer nerds with poor social skills and pasty complexions can truly succeed, and that you have to be a sociopath to speak to one let alone start a collaboration. This, my friends, is the monumental pile of steaming turds, festering away in the corner of labs and institutions all over the globe. This is why people are still stuck using Excel for data analysis. This attitude needs to go away, and soon.

That said, learning some very simple basics can provide a great foundation to get the student working with IO, file manipulation, sorting, searching and interoperability. The shell isn’t a dark art in itself. The majority of tasks can be done by someone with a short introduction and a good dose of on-the-job full-time experience (in the order of a couple of weeks). This is what the SWC bootcamps are trying to do – instil the notion that everyone can learn enough of the shell in 2 days to become more confident to tackle their own research problems.

training for bioinformatics research

As has been noted recently in blogs and papers, there is a wide range of demand and skill sets for the budding and indeed accomplished bioinformatician. User level, engineer level, researcher level – whatever the level, best practice is vital.

The realism problem rears its ugly head here. By concentrating on the exact problems in mind you’re missing out on all the underlying ideology to¬†make you a better bioinformatics user/engineer/researcher from the outset. Mick recently said that “most bioinformaticians are bad scientists” (provocatively of course – unsure of how much seriousness was present). I wince at this statement. Most bioinformaticians haven’t been given the opportunity to learn what being a good researcher is, because they are required to get started on a problem ASAP, analyse some data, produce a result, and get a middle-author gig for a foreseeable future. Eeesh. This is horrible. One of the attendees at a recent bootcamp said to me that he/she was being asked to analyse a particular type of dataset that he/she hadn’t seen before. They wanted training (an exact course to help them was actually coming up) but their lab manager said that it wasn’t suitable, so the attendee was considering paying with their own money. W. T. Actual. F. Transferable skills are probably the most valuable things you can give a researcher.


I truly believe that frameworks like ipynb help novices to quickly get started with hacking in Python. The reference material is interspersed with the code that actually runs within the page, allowing real-time assessment by the student themselves of the code they have written and the output (and errors!) they get. This has to be more effective than a massive wodge of printed out text or book and a blank Python interpreter for getting novices up a couple of rungs on the ladder.

The whole point of ipynb is not to teach students how to use ipynb to do their research. It’s a tool to at the very least get them started in understanding what comprises bioinformatics research, how best to go about learning the fundamentals, and how to apply them in a safe environment – away from pestering lab/project/group managers. You don’t need years of experience to be effective in bioinformatics, and the training I’m talking about isn’t aimed at shoving this pseudo-experience down their throats at a bootcamp. It’s ground-work. Preparation. Core principles that make bioinformatics what it is – interesting, fun, and at the end of the day, valuable to the community. I couldn’t agree with this tweet more:

Budding bioinformaticians should learn these central tools that are so ubiquitous that not doing so would be needlessly ignorant and an unwarranted misappropriation of their own time. Likewise, I argue that the entry point to the world of bioinformatics should be sensible, measured and consistent – learn the basics of the shell, versioning systems, a suitable programming language, and the notion of reproducibility and best practice. Every time.

These concepts should just as ubiquitous as screen and parallel.