JSON representations of ENA objects

Generating JSON from ENA accessions

I’m currently hoovering up ENA data for consumption by collaborators and researchers on the Norwich Research Park, and am placing the metadata about the accessions into iRODS for more efficient data management.

The ENA has some handy REST APIs for getting to accession information, but it returns tab-delimited text. Here’s a script to convert that TSV output into a nicer JSON representation. You can then parse, filter and pretty print the output using a library like jq.

#!/bin/bash

## Generate JSON output from an ENA accession
## Author: Rob Davey, The Genome Analysis Centre (TGAC), UK
## http://www.ebi.ac.uk/ena/data/warehouse/usage

## supply an optional local path to link to the accession.
## handy for importing into iRODS for example
LOCALPATH=""

## returned result type. defaults to "read_run"
RESULT="read_run"

while getopts "h?l:r:" opt; do
  case "$opt" in
  h|\?)
    echo "show_ena.sh [options] ACCESSION"
    exit 0
    ;;
  l) LOCALPATH="$OPTARG"
     ;;
  r) RESULT="$OPTARG"
     ;;
  :) echo "ERROR: Option -$OPTARG requires an argument." >&2
     exit 1
     ;;
  esac
done

IN=${@:OPTIND:1}

## project, study, sample, experiment, or run accession. if no positional parameter exists, read from stdin
[ $# -ge 1 ] && PROJ="$IN" || read PROJ
if [ -z "$PROJ" ]; then
  echo "No accession supplied"
  exit 1
fi

if [ -z "$2" ]; then
  RESULT="read_run"
fi

OUT=`curl --silent "http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=$PROJ&download=text&result=$RESULT"`

i=1
HEADERS=

echo "["

while read -r line; do
  ## substitute tabs for commas. IFS doesn't do nice multi-whitespace separation.
  line=${line//$'\t'/,}

  if [ $i -eq 1 ]; then
    # parse headers
    IFS=$',' read -r -a HEADERS <<< "$line"
  else
    # parse values
    echo "{"
      while IFS=$',' read -r -a VALUES ; do
        for j in "${!HEADERS[@]}" ; do
          echo "\"${HEADERS[j]}\":\"${VALUES[j]}\","
        done
      done <<< "$line" | sed '$s/,//'
      
      ## insert local path if supplied by -l flag
      if [ ! -z "$LOCALPATH" ]; then
        echo ",\"local_path\":\"$LOCALPATH\""
      fi

    echo "},"
  fi
  i=$((i + 1));
done <<< "$OUT" | sed '$s/,//'

echo "]"

It’s also available through my GitHub. Comments greatly received, as always!

UPDATE v2:

Now reads from stdin as well as from a supplied accession. This is helpful if you’ve already downloaded read files from the ENA, but want to get the metadata for them:

find `pwd` -name "*.fastq*" | \ 
awk 'match ($1, /[SE]RR[0-9]*/, m) { print $0, m[0] }' | \
xargs -l bash -c 'show_ena.sh -l $0 $1'

This command will find all the FASTQ files in the current directory (potentially with another extension, like .gz), match them if they have an ENA-like accession in the filename, and pipe the local file path and accession to the above script.