Problem with giant biome file

deenie

unread,

Oct 21, 2014, 10:56:45 AM10/21/14

to qiime...@googlegroups.com

Hi,

I'm trying to work with a giant biom file I received from the EMP 10,000 OTU project (2.5GB) but having trouble getting it even read in anywhere. Happy to share the file by direct message, but I'm not if it is open access yet would prefer not posting here.

I am trying to work with this biom format in any way (e.g., python, R) but it seems that all the libraries I've tried crash.

Thanks in advance!

Adina

For example,

R

===================================================

> x1 = read_biom("/mnt/data/emp/full_emp_table_w_tax.biom")
Segmentation fault (core dumped)

PYTHON:

===================================================

I am trying to convert the biom to hdf5 format or to tsv (I know it will be big - I will subset it eventually).

biom convert -i full_emp_table_w_tax.biom -o table.from_biom.txt --to-tsv

biom convert -i full_emp_table_w_tax.biom -o out.hdf5 --to-hdf5

Traceback (most recent call last):

File "/usr/local/bin/pyqi", line 184, in <module>

optparse_main(cmd_obj, argv[1:])

File "/usr/local/lib/python2.7/dist-packages/pyqi/core/interfaces/optparse/__init__.py",

line 275, in optparse_main

result = optparse_cmd(local_argv[1:])

File "/usr/local/lib/python2.7/dist-packages/pyqi/core/interface.py",

line 38, in __call__

cmd_input = self._input_handler(in_, *args, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/pyqi/core/interfaces/optparse/__init__.py",

line 194, in _input_handler

self._optparse_input[optparse_clean_name])

File "/usr/local/lib/python2.7/dist-packages/biom/interfaces/optparse/input_handler.py",

line 42, in load_biom_table

return parse_biom_table(table_f)

File "/usr/local/lib/python2.7/dist-packages/biom/parse.py", line

307, in parse_biom_table

t = Table.from_tsv(fp, None, None, lambda x: x)

File "/usr/local/lib/python2.7/dist-packages/biom/table.py", line

3622, in from_tsv

t_md_name) = Table._extract_data_from_tsv(lines, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/biom/table.py", line

3714, in _extract_data_from_tsv

first_values = lines[data_start].strip().split(delim)

IndexError: list index out of range

System information
==================
Platform: linux2
Python/GCC version: 2.7.3 (default, Feb 27 2014, 19:58:35) [GCC 4.6.3]
Python executable: /usr/bin/python

Dependency versions
===================
pyqi version: 0.3.2
NumPy version: 1.6.1
SciPy version: 0.9.0
h5py version: 2.3.1

biom-format package information
===============================
biom-format version: 2.1

Kyle Bittinger

unread,

Oct 21, 2014, 4:05:09 PM10/21/14

to qiime...@googlegroups.com

The internal format for .biom files has recently changed, and I wonder what format you have.

The JSON format (produced by QIIME 1.8) can be parsed in R using any JSON library. My R package, qiimer (on CRAN), has a couple functions to extract data in a convenient form.

The HDF5 format was developed specifically for large OTU tables coming out of the Earth Microbiome project. AFAIK there is no direct support for this yet in R, though you may have some luch with a generic HDF5 library. You may also want to check in with Joey McMurdie to see what the status is for support in phyloseq.

To see if your .biom file is in JSON format or HDF5 format, you can use the command "head" with the -c flag to print out the first several bytes. For example, to print the first 500 bytes in the file "otu_table.biom", you'd type

head -c 500 otu_table.biom

The JSON format will start with a "{" and will follow the rules given at json.org. The HDF5 format will print a bunch of gibberish to your screen.

Hope that helped,

Kyle

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

deenie

unread,

Oct 22, 2014, 8:46:36 AM10/22/14

to qiime...@googlegroups.com

Hi Kyle + others of interest:

Thanks for the response. I think I was unclear in my initial post -- I have a JSON format BIOM file. It is 2.5 GB and from my conversations with Sean Gibbons, we think that it is too big for the current JSON parsers within both the R biom and python biom packages. One solution is to write a another parser for it, but I was wondering if there is any development of scripts that can handle this, without having to go down that rabbit hole?

Best,

Adina

Kyle Bittinger

unread,

Oct 22, 2014, 10:20:36 AM10/22/14

to qiime...@googlegroups.com

Will it parse in R using the RJSONIO or rjson package? Also try the ijson library for Python (https://pypi.python.org/pypi/ijson/2.0).

Please update us on how it goes, either way. I am interested to see how these libraries handle large files.

--Kyle

Jorge Cañardo Alastuey

unread,

Oct 22, 2014, 4:41:33 PM10/22/14

to qiime...@googlegroups.com

Hi,

A few memory leaks were recently fixed in the Python biom package, and converting to tsv should now take much less memory. You'll need to install the package from github or wait for a new biom release (it should be possible to update the package running pip install --upgrade https://github.com/biocore/biom-format/tarball/master).

Does the conversion to hdf5 also fail?

Best,
Jorge

deenie

unread,

Oct 22, 2014, 5:42:54 PM10/22/14

to qiime...@googlegroups.com

Hi Jorge,

Thanks for the upgrade link, I've been looking for that. There is an error, and it seems to indicate that the biom file is at fault. Is this what it looks like to you? I trust Sean on this biom file but it could always have an issue with it (it is the one for this effort, https://peerj.com/articles/545/?utm_content=bufferd78d9&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer). Its the full EMP table with taxon if you are familiar with this project.

biom convert -i full_emp_table_w_tax.biom -o table.hdf5 --table-type "OTU table" --to-hdf5

ValueError: Extra data: line 1 column -1660304975 - line 1 column 2634662321 (char -1660304975 - 2634662321)

biom convert -i full_emp_table_w_tax.biom -o otu_table.txt --to-tsv --table-type="OTU table"

ValueError: Extra data: line 1 column -1660304975 - line 1 column 2634662321 (char -1660304975 - 2634662321)

Thanks,

Adina

Jorge Cañardo Alastuey

unread,

Oct 24, 2014, 7:47:48 PM10/24/14

to qiime...@googlegroups.com

Hi Adina,

It certainly looks like a problem in the biom file, so I believe it must have gotten corrupted somehow. The conversion from json to hdf5 for such a large table will need a fair amount of RAM (~a couple dozen GiB) but it should work without any issues (I just tried it!).

Best,
Jorge

--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/8OMntVmZnds/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.

Alex Chase

unread,

Nov 21, 2014, 5:12:49 PM11/21/14

to qiime...@googlegroups.com

Hi all,

I am currently having the same issue with processing the full_emp table. Have their been any updates on this front? For my purposes, I need to only extract specific OTUs and the data corresponding to their rows. Every time I load the file into memory, python breaks down. Any help would be much appreciated!

Thanks in advance!

Best,

Alex

Daniel McDonald

unread,

Nov 21, 2014, 5:51:45 PM11/21/14

to qiime...@googlegroups.com

Hi Alex and Adina,

Could one of you post a link to the file or share it with me directly? I'll identify the specific issue, and reproduce an HDF5, BIOM 2.0 compatible table from it

Best,

Daniel

Alex Chase

unread,

Nov 21, 2014, 6:08:55 PM11/21/14

to qiime...@googlegroups.com

Hey Daniel,

Thanks for the help. Here is the link:

https://dl.dropboxusercontent.com/u/68839641/full_emp_table_w_tax.biom.tgz

Feel free to email me directly (alex.b...@gmail.com) if that is easier.

Thanks again and really appreciate it!

Best,

Alex

Daniel McDonald

unread,

Nov 21, 2014, 6:11:11 PM11/21/14

to qiime...@googlegroups.com

Great, thanks. I'll pull it down now

Daniel McDonald

unread,

Nov 21, 2014, 6:43:47 PM11/21/14

to qiime...@googlegroups.com

This file parses fine for me though it requires ~30GB which is understandably prohibitive. Jorge has an HDF5 formatted file, and we'll be making it available shortly. The HDF5 representation will be more manageable, and you can subset on load as well too if you only want a selection of the OTUs or samples.

Best,

Daniel

Adina Chuang Howe

unread,

Nov 21, 2014, 6:46:17 PM11/21/14

to qiime...@googlegroups.com

Thanks! Sorry I to this later than everyone!

Daniel McDonald

unread,

Nov 21, 2014, 7:12:02 PM11/21/14

to qiime...@googlegroups.com

Hey everyone,

And HDF5 formatted version of the EMP table can be grabbed here:

ftp://thebeast.colorado.edu/pub/full_emp_table_w_tax.hdf5

Best,

Daniel

Alex Chase

unread,

Nov 21, 2014, 7:16:09 PM11/21/14

to qiime...@googlegroups.com

Thank you!

Adina Chuang Howe

unread,

Dec 12, 2014, 11:15:56 AM12/12/14

to qiime...@googlegroups.com

Hi,

Does anyone have examples of parsers using h5py to select subsets out of the biom file? I am spending a lot of time learning about hdf5 and hoping for some baseline starter scripts?

Thanks,

Adina

Daniel McDonald

unread,

Dec 12, 2014, 11:28:35 AM12/12/14

to qiime...@googlegroups.com

Hi Adina,

If you're working from the command line, you can use the command:

biom subset-table

If you're working from within Python, you can subset on load by passing in a list of IDs to keep.

I hope that helps, please let me know if you need more help here.

Best,

Daniel

Adina Chuang Howe

unread,

Dec 12, 2014, 11:34:38 AM12/12/14

to qiime...@googlegroups.com

Thanks Daniel - I was at a loss without this! Saved me a ton of effort.

Daniel McDonald

unread,

Dec 12, 2014, 11:36:36 AM12/12/14

to qiime...@googlegroups.com

That's great to hear!

HDF5 is one of those things that's too good to be true, but is true. However, there is still a learning curve of course and from a practical standpoint, users and developers shouldn't have to worry about the actual on-disk representation and should be able to interact via command line and/or a stable API.