JSON representations of ENA objects

Generating JSON from ENA accessions

I’m currently hoovering up ENA data for consumption by collaborators and researchers on the Norwich Research Park, and am placing the metadata about the accessions into iRODS for more efficient data management.

The ENA has some handy REST APIs for getting to accession information, but it returns tab-delimited text. Here’s a script to convert that TSV output into a nicer JSON representation. You can then parse, filter and pretty print the output using a library like jq.

#!/bin/bash

## Generate JSON output from an ENA accession
## Author: Rob Davey, The Genome Analysis Centre (TGAC), UK
## http://www.ebi.ac.uk/ena/data/warehouse/usage

## supply an optional local path to link to the accession.
## handy for importing into iRODS for example
LOCALPATH=""

## returned result type. defaults to "read_run"
RESULT="read_run"

while getopts "h?l:r:" opt; do
  case "$opt" in
  h|\?)
    echo "show_ena.sh [options] ACCESSION"
    exit 0
    ;;
  l) LOCALPATH="$OPTARG"
     ;;
  r) RESULT="$OPTARG"
     ;;
  :) echo "ERROR: Option -$OPTARG requires an argument." >&2
     exit 1
     ;;
  esac
done

IN=${@:OPTIND:1}

## project, study, sample, experiment, or run accession. if no positional parameter exists, read from stdin
[ $# -ge 1 ] && PROJ="$IN" || read PROJ
if [ -z "$PROJ" ]; then
  echo "No accession supplied"
  exit 1
fi

if [ -z "$2" ]; then
  RESULT="read_run"
fi

OUT=`curl --silent "http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=$PROJ&download=text&result=$RESULT"`

i=1
HEADERS=

echo "["

while read -r line; do
  ## substitute tabs for commas. IFS doesn't do nice multi-whitespace separation.
  line=${line//$'\t'/,}

  if [ $i -eq 1 ]; then
    # parse headers
    IFS=$',' read -r -a HEADERS <<< "$line"
  else
    # parse values
    echo "{"
      while IFS=$',' read -r -a VALUES ; do
        for j in "${!HEADERS[@]}" ; do
          echo "\"${HEADERS[j]}\":\"${VALUES[j]}\","
        done
      done <<< "$line" | sed '$s/,//'
      
      ## insert local path if supplied by -l flag
      if [ ! -z "$LOCALPATH" ]; then
        echo ",\"local_path\":\"$LOCALPATH\""
      fi

    echo "},"
  fi
  i=$((i + 1));
done <<< "$OUT" | sed '$s/,//'

echo "]"

It’s also available through my GitHub. Comments greatly received, as always!

UPDATE v2:

Now reads from stdin as well as from a supplied accession. This is helpful if you’ve already downloaded read files from the ENA, but want to get the metadata for them:

find `pwd` -name "*.fastq*" | \ 
awk 'match ($1, /[SE]RR[0-9]*/, m) { print $0, m[0] }' | \
xargs -l bash -c 'show_ena.sh -l $0 $1'

This command will find all the FASTQ files in the current directory (potentially with another extension, like .gz), match them if they have an ENA-like accession in the filename, and pipe the local file path and accession to the above script.

Advertisement

An open science abstract

I’ve been invited to give a talk soon on open science and bioinformatics, and this is my abstract. I wanted to get feedback as I think I might have polarised it too much from the outset, but I wanted to be intentionally critical of the status quo 🙂

Thoughts and comments are very welcome please! Thanks to the following for their input: Chris Cole, Richard Smith-Unna, Michael Markie, Torsten Seemann

Open science needs open scientists: an ever-increasing interdEpendence.

In recent years, scientific research has experienced an interesting juxtaposition. There is increasing pressure from funding bodies to make research data accessible. Researchers also need the increasingly sensational track records published in high-impact journals to ensure continued project support and/or tenure.

Pressure to release data by funders, whilst obviously a step in the right direction, represents a formidably large stick but a depressingly small carrot which results in simply another tedious hurdle to getting research published rather than a vehicle to get recognition. The constant push for papers in journals with perceived impact and prestige, whilst still seen as a key assessment mechanism for a researcher’s career, promotes a closed door approach and a touch of paranoia, with research becoming a competitive endeavour rather than a mutually beneficial collaborative one.

Thankfully, a new breed of researchers at all career stages, from graduate to PI, who can see these mutual benefits of sharing their work openly are becoming greater in number and more vocal by the day. Open source code, open data, powerful tools and infrastructure, social networks, and open access publishing all play a part in the ecosystems of the Open Science movement.

Your code is your lab book

Code, reuse and documentation

An extensive if slightly contentious twitter conversation popped up recently following this tweet from Aylwyn Scally:

What followed was a rather long discussion about how, when and even if code should be documented. Whilst the resounding answer in general should be “yes”, there’s lots of grey in between, ranging from “good code shouldn’t need documentation”, to “if you publish a tool, it needs minimal documentation for use and reuse”, to “documentation is a time pressure”. Some tweets however got a bit scary, with some useful ripostes (NB I’m not picking on Aylwyn here at all – it’s just relevant context):

https://twitter.com/jetpack/status/585734840420540416

I would recommend reading the whole thread as there are some really interesting points raised on all sides. However, one tweet stood out for me:

Why shouldn’t code be like a lab book?

Consider the lowly paper lab book. A stalwart of both the wet and dry lab, it’s a place to store your thoughts, processes and even results. The context to those thoughts, processes and results is just as important in the lab notebook as it is in a block of code.

At first glance, some bits of code don’t require much, if any, documentation – “a = b+c” would seem pretty self-explanatory given some assumptions over the syntax. However, C++ is a language that allows operator overloading. Given that we made assumptions about what b and c represent, we assume reasonably that the + operator has an additive effect. What if the context for this code dictates that the + actually means that a and b are strings, and + concatenates the first half of and the latter half of b? This is a well-versed problem, but it’s relevant to why documentation is vital wherever possible.

Abstracting up from this, a tool requires even more documentation to be useful. Even the simplest tools have helper context – even though I use “ls” every day, and I use a subset of the flags from muscle memory alone, “man ls” gives me a 243-line file that describes everything about a tool that simply lists directory contents. I’ve seen bioinformatics tools with less documentation than this. Although I don’t read the documentation every time I run “ls”, it’s necessary for communicating what the tool does, how to change its behaviour, and most importantly, how to reproduce what someone else is seeing when they run the same tool – “If you run ls like this, you should see the same thing I do“.

The lab notebook is a labour of love, just like code. Picking up someone else’s notebook can either be an incomprehensible alien landscape, or a journey through someone’s methods with a handy tour guide. In addition, the necessity of traditional publication guidelines means the former gets distilled into the latter (I’m not going to get into the argument about the pros and cons of traditional publishing here – that’s for another time). I’ll call this context distillation. Publishing code (and by publishing here, I mean in the sense that you’re intending for other people to read and assess it, not just sticking it on GitHub) needs the same rigour. Sure, I could post up scanned images of my lab notebook, but without documentation, it would be pretty hard for someone to pick up my outputs and use them in this form.

Context distillation

Coding/engineering conventions help a bit here – they can provide part of that distillation that takes a hacky script into a publishable piece of reusable understandable code:

Command line tool conventions: http://www.gigasciencejournal.com/content/2/1/15

Engineering conventions: http://www.gigasciencejournal.com/content/3/1/31

Being slightly biased (I reviewed both these manuscripts), they are a nice starting point for a range of simple requirements that can help bioinformaticians write better tools, one of which is providing help and documentation to users. This was mirrored in the tweetpile:

This is key. We need, as a community, usable requirements and standards for saying “this is how code should go from being available to being reusable“. How do we get our lab notebook code into that form via a number of checkpoints that both programmers and reviewers agree on?

Every researcher has a responsibility to distil their methods effectively. Their papers just wouldn’t (shouldn’t?) get published otherwise. Coders are in the same boat.

The “I don’t have time” excuse doesn’t wash.

Provide the minimum set of required documentation, as agreed by a community, to go from “your mileage may vary” code to “I’d be surprised if your mileage varies” code. C. Titus Brown (are there bioinformatics blog posts that don’t mention Titus?) and Brad Chapman proposed a reviewer criteria checklist that goes some way to providing these requirements for context distillation, but notably omits anything regarding documentation.

Sounds like there’s more work to be done!

DOI to RDF to JSON-LD

It’s Friday afternoon, so why not write a DOI-to-JSON-LD resolver I thought?

Luckily, web services already exist to help me on this:

http://dx.doi.org/ – The global DOI service (that has a REST API yay!)

http://rdf-translator.appspot.com/ – An RDF translator (that has a REST API yay!)

So, putting these together with a bit of curl, you get:

#!/bin/bash
DOI=$1
OUT=$2

curl -D - -L -H "Accept: application/rdf+xml" "http://dx.doi.org/$1" -o $2.rdf.xml

curl --data-urlencode content@$2.rdf.xml http://rdf-translator.appspot.com/convert/detect/json-ld/content -o $2.json

The script takes 2 parameters – firstly the DOI itself, and secondly the output filename (no extensions, just the name!).

NB: thanks to @rmounce for the heads-up about DOI complexities. Put the DOI in double quotes, and bash will be happy.

So for example:

./doi-resolve-to-jsonld.sh "10.1126/science.1249721" choulet

Gives us 2 files:

choulet.rdf.xml
choulet.json

The first being the RDF XML output from the DOI resolver, and the second being the converted JSON-LD:

choulet.json

Enjoy! 🙂

Bioinformatics Basics: A Training Conundrum

A couple of tweets got me started on what turned out to be a long old journey though the realms of ideology around what’s the best training medium for presenting bioinformatics to novices

And later on:

Mick’s original statement proved to be a very popular starting point for discussion with over 140 replies to date, many of them me trying to get to the bottom of why IPython Notebook was viewed as suboptimal for training, and at worst, pointless for getting students (herein, “students” refers to “bootcamp attendees” or “training sponges”, not students in the common educational sense) engaged in learning Python.

Before I start, I’ll reiterate a tweet I sent:

So, to summarise a very long line of tweets, I’ll concentrate on a few key points: abstraction (and why it’s helpful for learning); should novices be thrown in at the deep end (and why bioinformatics needs less “dark arts magician” and more “it doesn’t take much training to become relatively effective”); bioinformatics training doesn’t have to be contextualised by a specific research goal (and why general foundations in best practice are far more valuable). This isn’t a blog post about why I love Python (which I don’t) or why everyone should use IPython Notebook (which they shouldn’t).

Abstraction

A few of the tweets circled around the idea of abstraction, i.e. layering up of (usually) simplifications to a paradigm to make things easier or to restrict a complex domain to more manageable foci. Some thought (myself included) that abstraction was good for novices, and others thought otherwise, preferring “realism” to “obfuscating the reality with layers”. Firstly, abstraction is not obfuscation:

retain only information which is relevant for a particular purpose (Wikipedia)

The key word here is “retain”. You aren’t throwing away information, merely providing facile inroads to key points for the particular areas you’re concentrating on. Abstraction done well should mean traversing concepts from the abstract to the real is natural and intuitive. Done badly, it’s confusing and leads to obfuscation. In the case of the Software Carpentry (SWC) bootcamps, IPython Notebook (ipynb) provides abstraction over the IPython interpreter, which in turn is an abstraction over the Python interpreter, all contained in a very simple web page. Everyone recognises and understands the web nowadays. The argument that the ipynb web page format is confusing to a novice and doesn’t help them learn is rubbish. A central feature of ipynb is that the code that you write in each interpreted “cell” is valid standalone (in the context of the page of the notebook) Python code. It’s a glorified text editor that lets you run Python code, akin to something like ideone.com (albeit done better). You can copy and paste the code into a script using vi and the shell, which are also taught prior to the Python section in a typical SWC bootcamp. The students are not being misled, and the goal is to teach them basic Python, not ipynb itself. This is abstraction used well, in my opinion.

NOVICES – THE DEEP END PROBLEM

Mick and Nick Loman both hold the opinion, from the tweets they have made, that novices should learn realism first. I assume this means sitting a student down at a terminal, teaching them the shell, then tools that they need to use to do a particular analysis. Firstly, and this really is a subject for another blog post but merits mentioning anyway, not all bioinformatics is data analysis. Moving on, I’m talking (hah) more abstract – the period before you even get to analysis which underpins everything bioinformaticians do in the shell, which is monkey work. You don’t need to be a pro, and the more basic you can make the entry point, the better.

Nick also tweeted that you can teach someone the concept of the shell in one line. Whilst this is true, the reality is immediately deeper (note, not immediately more complex) – the shell takes a short time to learn the basics, and a lifetime to master. After 15 years of using the Linux shell and, later on, related day-to-day bio-admin toolchains I’m still finding treats, tidbits, and the need to revisit reference material to help me complete a task. It’s a behemoth of epic proportions, and completely out of the realm of a novice to learn it in a short time. Showing someone the true power of the shell and its environs in a short time is truly daunting to a novice and promotes the all-too-common fallacy that bioinformatics is intrinsically hard, that only computer nerds with poor social skills and pasty complexions can truly succeed, and that you have to be a sociopath to speak to one let alone start a collaboration. This, my friends, is the monumental pile of steaming turds, festering away in the corner of labs and institutions all over the globe. This is why people are still stuck using Excel for data analysis. This attitude needs to go away, and soon.

That said, learning some very simple basics can provide a great foundation to get the student working with IO, file manipulation, sorting, searching and interoperability. The shell isn’t a dark art in itself. The majority of tasks can be done by someone with a short introduction and a good dose of on-the-job full-time experience (in the order of a couple of weeks). This is what the SWC bootcamps are trying to do – instil the notion that everyone can learn enough of the shell in 2 days to become more confident to tackle their own research problems.

training for bioinformatics research

As has been noted recently in blogs and papers, there is a wide range of demand and skill sets for the budding and indeed accomplished bioinformatician. User level, engineer level, researcher level – whatever the level, best practice is vital.

The realism problem rears its ugly head here. By concentrating on the exact problems in mind you’re missing out on all the underlying ideology to make you a better bioinformatics user/engineer/researcher from the outset. Mick recently said that “most bioinformaticians are bad scientists” (provocatively of course – unsure of how much seriousness was present). I wince at this statement. Most bioinformaticians haven’t been given the opportunity to learn what being a good researcher is, because they are required to get started on a problem ASAP, analyse some data, produce a result, and get a middle-author gig for a foreseeable future. Eeesh. This is horrible. One of the attendees at a recent bootcamp said to me that he/she was being asked to analyse a particular type of dataset that he/she hadn’t seen before. They wanted training (an exact course to help them was actually coming up) but their lab manager said that it wasn’t suitable, so the attendee was considering paying with their own money. W. T. Actual. F. Transferable skills are probably the most valuable things you can give a researcher.

To CONCLUDE…

I truly believe that frameworks like ipynb help novices to quickly get started with hacking in Python. The reference material is interspersed with the code that actually runs within the page, allowing real-time assessment by the student themselves of the code they have written and the output (and errors!) they get. This has to be more effective than a massive wodge of printed out text or book and a blank Python interpreter for getting novices up a couple of rungs on the ladder.

The whole point of ipynb is not to teach students how to use ipynb to do their research. It’s a tool to at the very least get them started in understanding what comprises bioinformatics research, how best to go about learning the fundamentals, and how to apply them in a safe environment – away from pestering lab/project/group managers. You don’t need years of experience to be effective in bioinformatics, and the training I’m talking about isn’t aimed at shoving this pseudo-experience down their throats at a bootcamp. It’s ground-work. Preparation. Core principles that make bioinformatics what it is – interesting, fun, and at the end of the day, valuable to the community. I couldn’t agree with this tweet more:

Budding bioinformaticians should learn these central tools that are so ubiquitous that not doing so would be needlessly ignorant and an unwarranted misappropriation of their own time. Likewise, I argue that the entry point to the world of bioinformatics should be sensible, measured and consistent – learn the basics of the shell, versioning systems, a suitable programming language, and the notion of reproducibility and best practice. Every time.

These concepts should just as ubiquitous as screen and parallel.

Training for Bioinformatics Triage

I was perusing the Twitter when this passed my peepers from the Grauniad, which got me wondering about a new blog post, and related to the upcoming Software Carpentry course I’m helping to give at TGAC next month.

The inimitable Prof. Brian Cox states:

“I think if you’re not comfortable with the unknown, then it’s difficult to be a scientist … I don’t need answers to everything. I want to have answers to find.”

This is similar, but suitably different for the purposes of argument, to what a programmer wants:

“I think if you’re not comfortable with the unknown, then it’s difficult to be a programmer … How do I find the problem, and how does that affect my diagnosis to start searching for the solution?”

How do you teach people code diagnosis skills? Surely ability to triage is based on personality, experience and simple dogged determination, and is reserved for experts with years of practice? Well, yes, and thankfully a resounding no.

I’m a daily regular in the ##java freenode IRC channel, aimed at helping people with all manner of Java development questions. It’s a great, if a little aggressive, resource and I’d recommend developers in the Java space to at least idle in there to pick up best practice. There are other channels devoted to the other programming languages too. Whatever the language, one of the common entry-level to beginner (the two are different) type attitudes is that there is always a clear resource to help you get to your answer, and that someone must be able to provide an answer quickly. This simply isn’t the case. Often in programming, it’s a real labour-intensive task to search, filter and read through documentation and tutorials to give you the best hope of attempting a solution. That said, learning how to find the problem and being able to describe that problem are imperative.

So, given that science is basically an exercise in delving into the unknown, it would seem sensible to conclude scientists should make great programmers. So here are some steps that I’ve found invaluable along the way that hopefully should be applicable to those, particularly in the biological domain, that want to get started or improve their programming and get into doing a bit of bioinformatics development themselves.

How do you ask the right question?

Know what you’re trying to achieve

  • You need to have a clear idea of what your end goal is, down to specific small packages of work. This is not the same as knowing what the problem is. 
  • A global goal, e.g. “Parse in my VCF file, search for variants in my region of interest, pull out those results, show a report”, can comprise a relatively daunting body of work for a beginner. However, this goal can be split into a number of far more manageable chunks, each with more granular goals.
  • When you know what you’re trying to do, finding the problem becomes easier. A pencil and paper or a whiteboard is a great way to slow the thought process down and concentrate on thinking about the goals, rather than how to complete them, which can immediately throw up design problems before you’ve already coded up a load of stuff.
  • Knowing where your code deviates from your goals is great for focusing the mind, keeping small chunks of information in your head, and hence great for triage.

Granularity leads to an excellent basis for test-driven development

  • If you know you are working towards a small goal with a clear outcome, i.e. a unit, write a test for it, i.e. a unit test.
  • Each time you add more code to your program, the test is carried out, making sure your downstream code that uses it is sane.
  • By knowing that tests previously worked and now they have stopped working due to underlying code changes, i.e. “regression“, is great for triage.

Learn how to use Google and how to skim read

  • You’d be amazed how many developer questions are prefixed with “I tried to Google, but I couldn’t find anything”.
  • It’s surprisingly easy to filter out cruft by doing a cursory broad search, then using the simple operators that Google gives you to filter, for example the exact query quotes, and negation.
  • By reading a lot of documentation, you learn how to skim read to find the nuggets of relevance quickly. Scientists are usually great at this because they read a lot of scientific papers, and as such should make great documentation readers and question-askers in the bioinformatics space.
  • Poor search and documentation processing skills are unnecessarily sloppy, are unwarranted and are bad for triage.

Know the tools and techniques to help you pinpoint where errors are happening

  • Finding the bits of code that aren’t doing as you expect (which aren’t that many because you’re writing tests now, right?) is probably the single biggest time sink.
  • Many practices and tools are available in all languages to help you find where the issues are:
    • Sensible logging – even wrapping code segments with printing to stdout can be sufficient to breakpoint larger bits of code.
    • Debuggers – may seem daunting, but they are almost second-to-none to find potentially stubborn bugs like race conditions.
    • Small code fragments – 10 100-line code snippets are far easier to debug than a single 1000-line one.
    • Read the APIs – the specifications and syntax of a language or library are crucial to understand what things do, and what they don’t. This is amazingly frequently overlooked by entry-level programmers.

Don’t paraphrase

  • If you have a specific problem related to a specific goal, don’t gloss over elements of your issue or your attempts at solutions.
  • Writing a simple self-contained test case demonstrating the problem is good practice, as it minimises ambiguity or misunderstanding.
  • Being able to state categorically what you’ve attempted and any errors you see is great for triage.

What next?

So there’s a clincher. Training courses and tutorials are great ways to learn syntax, to speak to experienced developers and to try out new things. However, unless you have a vested interest/job in bioinformatics, maintaining relevance to your work following the course or tutorial is extremely hard. This is where experience and personality come in.

Whenever you see a problem in your day-to-day work, take the time out to see if you could work out how to help yourself by programming your way out of it. This is not time wasted. You’re training your brain to think in a programming context, which will make you quicker at diagnosing issues in the future.

Similarly, learning best practice is not time wasted. “I don’t have time” is synonymous with “I am not motivated to do things properly”. A sensible scientist wouldn’t leave their paper acceptance chances in the hands of a knowingly hasty and flawed experimental design, and a bioinformatician shouldn’t do the same when publishing code. Hacking scripts together is commonplace and ubiquitous, and for good reason – it’s procedural glue for your tasks that can be automated to some extent. However, a lot of time spent in triage as a result of this quick and dirty development can be avoided by making things easier on yourself, which includes taking the time to learn version control, learn the proper conventions of the language, name everything concisely, use unit tests, document regularly and as fully as is relevant and possible. I like using travelling downtime to document my code and processes.

Finally, the more people that understand code triage, the more people will be better qualified to undertake peer review of software in bioinformatics, which is an area that is sorely lacking.

Updated

Changed the regression sentence for a bit more clarity – thanks for the suggestion Mick!

Retiring Scientific Ideas

I thought I’d wade in and start writing a blog about science and the errant wisps of thought that float around. I’ve been following a few blogs for a while, and whilst I like the commenting idea, I often find I’d rather write about it afresh. Seeing as how I now “do Twitter”, this seemed like the next level – please comment constructively and tell me I’m talking rubbish.

There’s been much talk over a wide range of topics within said blogs, from “cats and dogs”, to “harassment”, to “how to publish science”. I have views on all these subjects, and I guess I’ll try to condense them down into a single meandering waft.

It all came together when I happened upon the edge.org news item regarding the next big question: “WHAT SCIENTIFIC IDEA IS READY FOR RETIREMENT?”

Given the previous Edge questions, such as “WHAT WILL CHANGE EVERYTHING?” and “WHAT DO YOU BELIEVE IS TRUE EVEN THOUGH YOU CANNOT PROVE IT?“, I initially thought this new offering a bit under-existential, but felt it brought some of the previous discussion topics I mentioned earlier into focus.

Cats and Dogs

Mick Watson recently opined on Ewan Birney’s driving factors in personal scientific attitudes, quite conveniently compartmentalised into two four-legged analogies. Whilst broadly generalising scientists into two distinct categories is easy, the reality is not that simple, and being “catlike” has distinct downsides in scientific outlook. It’s a common preconception that PI’s and bioinformaticians are “cats” – powerful, independent, gate-keeper like entities. “Dogs” are viewed as group workers, following the direction of the pack. As Mick says, there are times to be a cat and times to be a dog but on the whole, scientists need to be both. I would disagree, in that scientists need to be dog most of the time, unless the situation warrants cat-like behaviour. Fierce unwavering catlike independence is great for moments of personal development and the motivation to carry on with a problem when everything is screaming at you to drop it (which is why the link is made to bioinformatics – coding is a labour of cat-like love). Fierce unwavering catlike independence is also shitty for a myriad of reasons: it often comes across as arrogance and science has more than enough of that already; it’s counter-productive in project-based discussions when consensus is essential (labouring a viewpoint is commonplace); it stifles interoperability on a personal and practical scale; it breeds resentment of other cats and promotes sectarianism. This widely-posted PLOS article outlines these pigeon-holed behaviours, where all the types described are “cat-like” traits. Luckily, the article also describes how to be more dog in each instance. As a self-professed dog lover, I’m probably biased in this analogy.

Harassment

We’ve all read the recent descriptions and subsequent outpourings of mutual support for the victims of sexual harassment in the scientific world, demonstrated as naming and shaming of the perpetrators. Whilst I’m not going to add my 2 pence here as to why sexual harassment is so completely horrific (because my feelings are more than ably summarised by people far more eloquent than myself) it raises an issue for the greater whole. Why does science have the problem of harassment in the first place? Why the continual propagation of not just sexual harassment, but ideas harassment? Furthering knowledge is no more a gender exclusive pursuit than is the ability to tie shoelaces, or carry out the most basic of bodily functions. History could be, and most surely is, to blame, but I’m pretty sure we don’t live in the 40s anymore. So, I’m not talking about scientific debate – this is natural disagreement with a view to pushing one’s envelopes. I’m talking about personally denigrating someone for their scientific viewpoint. This came to a head with the ENCODE debacle. Whether you like 80% functional as a figure or not, and whether you like the notion of “big science” or not, the dumbing down of those involved in the project by others was, in my view, an utter dick move for science. What better way to make the field of science look and feel like a disagreement at a child’s ballpool birthday party to those “on the outside”, than to publish an article with the tone of a sarcastic ageing relative that you can’t get rid of at Christmas? The shame is that the content of the article is generally sound, and does bring to mind the discrepancies of funding availability for smaller science projects in this world of “big data”, “big science” and “big ideas”. Which brings me nicely on to…

How To Publish Science

The ideology behind science is the dissemination of findings based on empirical observations, but this empiricism is not essential for promoting scientific discourse. I couldn’t care less if Randy got a Nobel and shunned the very same “glamour” journals he used to disseminate knowledge – his message, in time, will remain the same: “you don’t need a journal’s impact factor to help your career“. Sure, he’s had the benefits of hindsight, success and publicity to make that point. Some don’t like his eLife-bolstered open access stance when coupled with the silver-spoon hypocrisy it’s based on. I couldn’t care less. Casting stones seems to be rife, with people staking claim on their open access dedication being initiated earlier and with far more vigour than hapless dinosaur Randy. This is the hipster way, and it’s based on the same desire to seek “glamour” that drives impact factors.

Similarly, one thing that exposes the nasty side to being bent over the impact factor barrel are the stories of science group managers demanding that papers from the group only be published in journals with an impact factor of and above a certain score. This promotes closed world behaviour in postdocs and RAs, and is tantamount to harassment. Science gains no intrinsic value by having a shiny wrapper on it, and even worse still is the persistence in the notion that by publishing your science in a non-glamour journal means your science is somehow worth less. Given the monumental fubars in the closed review process whereby trojan papers have made their way into these gold standard publications, the beauty in modern day science dissemination is that Twitter, open access publishing and open peer review (arXiv, F1000, PeerJ, PubPeer, etc) are becoming actively embraced by new researchers and outwardly recommended by research leaders. This truly is the way forward for good science. However, albeit well and good, the problem resurfaces when attempting to rationalise your research outputs to funding bodies and tenure panels without the archaic traditional measurements of scientific success . This is not the way forward for good science and needs careful consideration in future. I was genuinely worried that some people I have discussed this with are fearful that if they don’t publish their work in these glamour journals, their career will suffer. This is a horrible existence in a modern scientific arena.

So, what scientific idea is ready for retirement? Sectarian Closed Science.

Sectarianism and closedness needs to die in science, and I’m both disturbed and heartened in equal measure by the precariously balanced see-saw of the world around scientific endeavour. We’re teetering on the edge of something shame-facedly simple yet vital, and that is science for everyone. Scientific careers that are recognised for what the science is, not where it’s published. Science that is freely available for discussion and debate, from sofas to auditoriums, without fear of personal attacks. Science that excludes noone.