SVM Light Data Format

409 views
Skip to first unread message

Praneet Mhatre

unread,
Sep 11, 2014, 12:22:32 PM9/11/14
to h2os...@googlegroups.com
I am a new user of h2o. I apologize in advance if I'm missing something very obvious here. It looks like h2o supports svmlight as an acceptable sparse data format. I tried reading my svmlight data through the web UI and it worked great. But I can't figure out how to read it via R or the REST API. None of the docs I searched for mention svmlight or any other parser type, for that matter. Can someone please tell me what is the appropriate place to specify the parser type?

Thank you,
Praneet

ke...@0xdata.com

unread,
Sep 12, 2014, 4:29:46 AM9/12/14
to h2os...@googlegroups.com
Hi Praneet,
If your data parsed through the web ui, that's a good sign.

In general, H2O tries to not require the user to tell the tool about the dataset. H2O tries to deduce the right behavior from analyzing the data.

In the case of the web ui, did you leave the parser_type selection at empty or Auto. It may have auto-selected SVMLight after previewing your data, so you might not have selected SVMLight. Usually you don't have to, but for some datasets, h2o doesn't deduce correctly and has to be told.

I'm thinking that it's likely in R when you tell h2o to parse, that it will deduce your svm data format correctly, without being told.

Can (or did?) you try that?

I was just reading the h2o R documentation and I noticed that parser_type was not listed as a parameter. This might be an oversight, since sometimes parser_type has to be forced (although it's rare..and the choices are just auto, two xls (excel) cases, and SVMLight)


-kevin

Praneet Mhatre

unread,
Sep 12, 2014, 12:45:34 PM9/12/14
to h2os...@googlegroups.com, ke...@0xdata.com
Hi Kevin,

Thank you for the reply.

And yes, I did try to parse the data in R without specifying the parser_type and it failed to detect that I was using an svmlight file. And to respond to your earlier point, the UI did not auto select svmlight as the parser type. It auto-suggested csv with '\t' as the separator. So I had to manually specify the type as svmlight and everything worked great from that point on.

And just to clarify, h2o currently does not have a way to specify the parser_type via R, correct? And if yes, is there any other way I can read my sparse data in svmlight format?

Thanks,

Kevin

unread,
Sep 12, 2014, 1:24:53 PM9/12/14
to Praneet Mhatre, h2os...@googlegroups.com
Hi Praneet,

ok, thanks for the additional information.

Yes, it sounds like H2O auto-suggested csv parse with \t as the separator.
So then the SVMLight parser_type has to be given to h2o (it exists in the browse and json interface).
(a separate question I have as a test person is whether H2O should have been able to deduce the SVM format, even with tabs for white space. I will check on that, but let's hold off on that for now)

I have to check with people today, as to whether the 'parser_type' param exists in R. If it doesn't, it needs to be.

I was thinking of something you might try.

I'm trying to remember, but I think the SVMLight definition of separators is lenient, it's not just space? I'm wondering if you have tabs in your SVM dataset.

If you have time, and that's true, could you open your dataset in an editor and substitute all the tabs for spaces?
(since SVM specifies the column, there's no additional information that a "separator" provides, unlike normal csvs)

i.e. if it was vi
1,$ s/\t/ /g

or
sed -e 's/\t/ /g' < dataset > new_dataset

and then try new_dataset in the web ui and see if gets the auto detected format correct. (also might try in R)

thanks,
-kevin

spn...@gmail.com

unread,
Sep 12, 2014, 2:27:04 PM9/12/14
to h2os...@googlegroups.com

Praneet,

I've opened up a new jira for adding this parameter: https://0xdata.atlassian.net/browse/PUB-1019

Once this bug is resolved, you will be able to specify parser_type ("AUTO", "SVMLight", "CSV", "XLS") from the R interface. This fix will make its way into the next nightly build (tonight @ 11PM), but you can get at this feature immediately from git.

Stay tuned,
Spencer

Praneet Mhatre

unread,
Sep 12, 2014, 2:27:47 PM9/12/14
to h2os...@googlegroups.com, pranee...@gmail.com, ke...@0xdata.com
Hi Kevin,

The UI still detects CSV as the parser_type and space as the separator.

Praneet Mhatre

unread,
Sep 12, 2014, 2:33:37 PM9/12/14
to h2os...@googlegroups.com, spn...@gmail.com
That would be great. Thanks, Spencer.

spn...@gmail.com

unread,
Sep 12, 2014, 2:49:16 PM9/12/14
to h2os...@googlegroups.com, spn...@gmail.com
The "parser_type" has been pushed, and the ticket closed (commit 63332a30adbbcccfdeb22de152f6684ed95ed47d)


Thanks,
Spencer

Kevin Normoyle

unread,
Sep 12, 2014, 3:21:44 PM9/12/14
to Praneet Mhatre, h2os...@googlegroups.com
Hi Praneet,


On 09/12/2014 11:27 AM, Praneet Mhatre wrote:

> The UI still detects CSV as the parser_type and space as the separator.

not to belabor the issue since Spencer's update will cover things, but I'm curious why auto detect didn't work
(when I looked at my tests, there was only one case where I had to specify SVMLight)

can you post the first 5 rows of your data?

I'm wondering if you have comments or some text in your SVMLight file, or something odd we didn't see before.


thanks for spending the time on this
-kevin

here's an interesting SVMLight example, as a fer-instance to compare to, that I parsed with no parser_type= parameter.
(the numbers can be integer or other fp formats)


34
99
88
63 4:58.7922              5:-86.6483             9:55.8207              10:51.5601          
130 4:-61.6788             5:-62.645
90 3:-85.6198             6:-51.4775             7:-50.988              10:-94.965          
95 5:-54.5767             6:-69.6121             9:-66.1534             10:-73.7346         
185
33 4:-82.4124
70 10:75.8938

Praneet Mhatre

unread,
Sep 12, 2014, 4:49:58 PM9/12/14
to h2os...@googlegroups.com, pranee...@gmail.com, ke...@0xdata.com
Sure. I was planning to attach a sample with my original post, but I didn't know what the attachment etiquette here was. Here's the first 5 rows:

1.0     0:1.0   1:1.0   2:1.0   3:75.0  4:512.0 5:1.0   6:0.0   7:0.0   8:0.0   9:0.0   10:0.0  11:0.0  12:9.0
0.0     0:1.0   1:1.0   2:1.0   3:0.0   4:0.0   5:0.0   6:0.0   7:0.0   8:1.0   9:0.0   10:0.0  11:0.0  12:3.0
0.0     0:1.0   1:1.0   2:1.0   3:0.0   4:30.0  5:0.0   6:0.0   7:1.0   8:0.0   9:0.0   10:0.0  11:0.0  12:32.0
0.0     0:1.0   1:1.0   2:1.0   3:0.0   4:0.0   5:0.0   6:0.0   7:0.0   8:0.0   9:0.0   10:0.0  11:0.0  12:1.0
0.0     1:1.0   2:1.0   3:0.0   4:25.0  5:0.0   6:0.0   7:1.0   8:0.0   9:0.0   10:0.0  11:0.0  12:27.0 13:1.0
Reply all
Reply to author
Forward
0 new messages