how to tell if biom file is formatted json or hdf5?

1,161 views
Skip to first unread message

Andrew Krohn

unread,
Dec 24, 2015, 8:49:05 PM12/24/15
to Qiime 1 Forum
I'm writing some scripts to pass my QIIME output also through phyloseq for some alternative graphical output.  Everything is working just fine, but phyloseq seems to accept only json-formatted biom tables.  That is fine I guess, but I decided to jump on the hdf5 bandwagon about a year ago so now I have to make sure my input file is json before I start.  A minor annoyance since it is a simple command to convert, but I would like my script to handle this step if possible.

After playing with the biom command for a bit and reading some other documentation, I seem to be unable to find an output that lists the biom file formatting.  It's entirely possible I am missing something simple and obvious in which case please do clue me in.  Otherwise, maybe someone has an alternative command that will do this (via python perhaps?) and willing to share?

Colin Brislawn

unread,
Dec 25, 2015, 12:05:51 AM12/25/15
to Qiime 1 Forum
Hello Andrew,

Try this:
head -c 100 otu_table.biom
{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org"

1.0.0 is json,
2.1.0 is hdf5
I'll leave the string parsing up to you.

Also HDF5 will be supported by phyloseq, eventually. 

Good luck!
Merry Christmas you crazy man,
Colin Brislawn

Andrew Krohn

unread,
Dec 28, 2015, 12:25:35 PM12/28/15
to Qiime 1 Forum
Thanks, Colin.  That worked OK for the json-formatted tables but the HDF5 are still non-human readable.  This command does turn up human-readable "HDF" as part of the string, so at first I was using grep "HDF" to check if a file is hdf5 or json.  However, it seems possible, however unlikely, that a taxonomic string somewhere might contain HDF for some reason.  So now I have switched the code to check for json by grepping "Biological Observation Matrix" which almost certainly won't wind up in a non-human readable string by accident.

Here's the code:
## Trap function on exit.
function finish {
if [[ -f $jsontemp ]]; then
rm $jsontemp
fi

}
trap finish EXIT

## Test if input is properly formatted and correct if necessary
randcode=`cat /dev/urandom |tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1` 2>/dev/null
jsontemp="$tempdir/${randcode}_json.biom"
jsontest=$(grep "Biological Observation Matrix" $input)
if [[ -z "$jsontest" ]]; then
## convert biom for processing
echo "Converting input table (HDF5 format) to JSON for processing."
biom convert -i $input -o $jsontemp --to-json
wait
table="$jsontemp"
else
table="$input"
fi
wait

## Execute R slaves
#Call separate R scripts here...

Yoshiki Vázquez Baeza

unread,
Dec 28, 2015, 1:41:27 PM12/28/15
to Qiime 1 Forum
Sorry for jumping in late, but if you use the unix file command you should see two different outputs.

See the different outputs for the following two example tables that are hosted in the biom-format repo:

yoshikivazquezbaeza:examples@master$ file rich_sparse_otu_table.biom 
rich_sparse_otu_table.biom: ASCII text
yoshikivazquezbaeza:examples@master$ file rich_sparse_otu_table_hdf5.biom 
rich_sparse_otu_table_hdf5.biom: Hierarchical Data Format (version 5) data

Andy, if you think adding this to the documentation would be useful, let us know and we'll try to do this!

Thanks!

Yoshiki.

Colin Brislawn

unread,
Dec 28, 2015, 2:33:59 PM12/28/15
to Qiime 1 Forum
Oh that's great! I never knew about the unix file command before.

Andrew Krohn

unread,
Dec 28, 2015, 3:21:06 PM12/28/15
to Qiime 1 Forum
That is much cleaner.  Thanks Yoshiki.  I have revised my code thusly:

## Test if input is properly formatted and correct if necessary
hdf5test=$(file $input | grep "Hierarchical Data Format")
if [[ ! -z "$hdf5test" ]]; then
## convert biom for processing
echo "Converting input table (HDF5 format) to JSON for processing."
biom convert -i $input -o $jsontemp --to-json
wait
table="$jsontemp"
else
table="$input"
fi

I was hoping there was a clean way to find out what the format was straight from the biom command.  I think that could be useful, but with the output from file command I would say such a feature is not necessary.  One just needs to know the correct command to issue.
Reply all
Reply to author
Forward
0 new messages