Binary file format over CSV?

103 views
Skip to first unread message

Aaron Iba

unread,
Oct 24, 2017, 7:14:53 PM10/24/17
to H2O Open Source Scalable Machine Learning - h2ostream
I started out using CSV to import my data to H2O, because that's the format most of the H2O documentation & examples use.

My CSV files are several gigabytes (~2M rows by 90 columns).  It seems to me that a binary file format would be much more compact.

The H2O documentation page lists 8 different supported formats, with various caveats.  It's not clear to me which format would be best in my case.

Could someone recommend a fast/compact binary format?

My data is being generated by Java/Clojure code, and consists entirely of double values.

I did a test using the org.apache.avro/avro 1.8.0 library to generate avro files.  As expected, the files were smaller than CSV.  But I noticed that working with these files in H2O was surprisingly slower than working with CSV files.  Perhaps another one of the supported file formats would be even faster than CSV, as well as being more compact?

Appreciate the help!

-- Aaron

Erin LeDell

unread,
Oct 24, 2017, 9:40:32 PM10/24/17
to Aaron Iba, H2O Open Source Scalable Machine Learning - h2ostream

I assume you're using the parallel file reader, h2o.importFile, and not h2o.uploadFile?  If not, make that switch and that will speed things up a lot.  H2O can read a zipped CSV file but I am not sure if that's going to speed anything up, it might just be better for storage.

ARFF may be slightly faster than CSV because all the column types are pre-defined by the user.

-Erin

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Aaron Iba

unread,
Oct 25, 2017, 2:57:01 PM10/25/17
to H2O Open Source Scalable Machine Learning - h2ostream
Didn't realize H2O supports zipped CSV.  I am now gzipping the CSV files and giving them a file extension ".csv.gz" and H2O seems to automatically do the right thing.

GZipping CSVs still doesn't seem as efficient to serialize/deserialize as just storing the raw bytes of doubles, but it's in the same ballpark so I'm happy.

It would be cool if this was noted in the H2O documentation.

Thank you!

-- Aaron

Erin LeDell

unread,
Oct 25, 2017, 3:09:15 PM10/25/17
to Aaron Iba, H2O Open Source Scalable Machine Learning - h2ostream, Angela Bartz

Aaron,

Good call, we will add this to the docs.  cc-ing Angela Bartz, Head of Documentation.

-Erin

Angela Bartz

unread,
Oct 25, 2017, 3:19:52 PM10/25/17
to Erin LeDell, Aaron Iba, H2O Open Source Scalable Machine Learning - h2ostream
Thanks for finding this discrepancy. I'll make the update in the docs.

Sincerely,
Angela Bartz

To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Darren Cook

unread,
Oct 26, 2017, 6:38:14 AM10/26/17
to h2os...@googlegroups.com
> Didn't realize H2O supports zipped CSV. I am now gzipping the CSV files and
> giving them a file extension ".csv.gz" and H2O seems to automatically do the
> right thing.
>
> GZipping CSVs still doesn't seem as efficient to serialize/deserialize as just
> storing the raw bytes of doubles, but it's in the same ballpark so I'm happy.
>
> It would be cool if this was noted in the H2O documentation.

I thought it was mentioned in the manual! Must've been something I was
told directly. (So, if I've got any of the below wrong, please let me
know!)

One additional consideration is when doing parallel data import into a
cluster. If it is an unzipped csv file, H2O can do offset reads, so each
node in your cluster can be directly reading its part of the csv file,
in parallel. Whereas if it is zipped, H2O will have to read the whole
file, and unzip it, before it can then do the parallel read.

So, if using very large data files, reading from HDFS, it is best to use
unzipped csv. But if the data is further away than the LAN then it is
best to use zipped csv. IIRC, reading from S3, when using AWS instances,
there was not much difference.

BTW, looking at the big picture, if you are spending half an hour or
more training models, then the difference of a few minutes optimizing
importFile is not worth worrying about.

Darren


--
Darren Cook, Software Researcher/Developer
My New Book: Practical Machine Learning with H2O:
http://shop.oreilly.com/product/0636920053170.do

Erin LeDell

unread,
Oct 26, 2017, 2:20:44 PM10/26/17
to Darren Cook, h2os...@googlegroups.com, Angela Bartz
Darren,

Thanks for the input.  I don't think it's mentioned in our user guide,
so you probably got that information talking with one of our engineers
directly.

Angela can validate this on our end and update the docs.  This is good
information!  Especially if it's true :-)

-Erin

Aaron Iba

unread,
Oct 27, 2017, 5:38:01 PM10/27/17
to H2O Open Source Scalable Machine Learning - h2ostream
Darren, thank you for the additional thoughts.  Great points all of these.

I'm reading from S3, so unfortunately cannot take advantage of parallel reads on the same file.  Reading .csv.gz files from S3 is now working great, thank you again for the help.

In case it's of interest to others loading data from S3, or to those maintaining the web docs for H2O, I found the web docs unclear on how to get data in from S3.  The User Guide doesn't say much about S3.  (Strangely if you search for "S3" using the "Search docs" box of the user guide, you get no results at all).  But I did eventually find this page.

I couldn't get s3n:// URLs to work with either inline access keys (e.g. s3n://<KEY>:<SECRET>@<BUCKET>/path) or using a core-site.xml config.  It was also unclear to me what the difference between s3n:// and s3:// are.  Googling for "s3n protocol" returns a bunch of stuff about hadoop, but I'm not using hadoop.

Instead, after some googling, I realized that if I run h2o.jar with UNIX environment variables AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY, then "s3://..." URLs (not s3n://) work fine.  So that's my current solution.

I also purchased your OReilly book on H2O, then discovered that this is all covered quite nicely in Chapter 2.  Wish I had done that first!  Well worth it.

-- Aaron
Reply all
Reply to author
Forward
0 new messages