Problem importing csv data.

248 views
Skip to first unread message

Antonio Vidal Vidal

unread,
Dec 12, 2013, 1:03:05 PM12/12/13
to h2os...@googlegroups.com
Hi,
I'm trying H2O 2.0.1.2 to analyze some data in a single windows machine. To do it, I'm importing two datasets with the same kind of data from two separate csv's: one for training and another one for testing.

The problem is that in one of them a variable is imported as categorical and in the other as numeric: is there a way to specify the type of the variable imported?

The log of the problem:

12-dic 06:50:19.873 172.20.34.149:54321   2408  FJ-9-41   ERRR WATER: 
+                                                                     java.lang.IllegalArgumentException: Incompatible column: 'hnd_price', expected (trained on) numeric, was passed a categorical

+                                                                     	at water.Model.adapt(Model.java:195)

+                                                                     	at water.Model.adapt(Model.java:232)

+                                                                     	at water.Model.score(Model.java:99)

+                                                                     	at hex.gbm.SharedTreeModelBuilder$Score.doIt(SharedTreeModelBuilder.java:383)

+                                                                     	at hex.drf.DRF.doScoring(DRF.java:166)

+                                                                     	at hex.drf.DRF.buildModel(DRF.java:149)

+                                                                     	at hex.gbm.SharedTreeModelBuilder.buildModel(SharedTreeModelBuilder.java:147)

+                                                                     	at hex.drf.DRF.exec(DRF.java:111)

+                                                                     	at water.Job$4.compute2(Job.java:482)

+                                                                     	at water.H2O$H2OCountedCompleter.compute(H2O.java:668)

+                                                                     	at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)

+                                                                     	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)

+                                                                     	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)

+                                                                     	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)

+                                                                     	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Many thanks,
Antonio Vidal

Antonio Vidal Vidal | Innovation Manager
T/ + 34 986 410 091 (ext) 240
M/ + 34 673 214 580
www.optaresolutions.com



Tom Kraljevic

unread,
Dec 12, 2013, 2:25:00 PM12/12/13
to Antonio Vidal Vidal, h2os...@googlegroups.com

Hi Antonio,


We are adding a procedure to do this in the UI early in the new year.
Meanwhile, we will send you a recipe that allows you to do this from R.

A quick workaround is to turn the integers into strings by changing the data
from '1', '2', '3'…  to 'val1', 'val2', 'val3', …

We automatically detect strings and turn them into categoricals.


Thanks,
Tom


--
You received this message because you are subscribed to the Google Groups "H2O Users - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tom Kraljevic

unread,
Dec 12, 2013, 3:21:49 PM12/12/13
to Antonio Vidal Vidal, h2os...@googlegroups.com

Antonio,


Apologies, I think I misread your problem earlier.

I originally read that you wanted to treat your column as a categorical, but it looks
like you really want to treat it as a number (since it's a price).

Most likely what happened is your test data has some bad (i.e. character) values
in the column and H2O treated it as a categorical instead of as a number.

Since H2O automatically treats columns with strings as categoricals, the solution
would be to ensure all values in the column are a number.

As I mentioned earlier, we are working on a set of UI changes which will enable the
user to bind a column to a particular data type.  But for now, I suggest you ensure the
values in the column are strictly numerical.


Thanks,
Tom

Chris Kuethe

unread,
Dec 24, 2013, 12:23:43 AM12/24/13
to h2os...@googlegroups.com, Antonio Vidal Vidal
On Thursday, December 12, 2013 11:25:00 AM UTC-8, Tom Kraljevic wrote:

Hi Antonio,


We are adding a procedure to do this in the UI early in the new year.
Meanwhile, we will send you a recipe that allows you to do this from R.

I'd also like to see this recipe.
 

A quick workaround is to turn the integers into strings by changing the data
from '1', '2', '3'…  to 'val1', 'val2', 'val3', …

We automatically detect strings and turn them into categoricals.

Not very well? Here's is a little csv file that demonstrates categoricals not being correctly generated. I'm double-quoting every value since some of my input may have commas, spaces, or empty strings. When I upload this through the web UI, the string values are not handled correctly. Should I be reformatting the input somehow?


"name","int1","int2","int3","str1","float1","str2","float2"
"alice","0","0","0","edison","0.5","washington","-0.73"
"bob","0","0","1","","0","","0"
"carol","0","1","0","edison","1.0","jefferson","1.0"
"dave","0","1","1","-","13.0","-1","42.0"
"eve","1","0","0","17","1.609","jefferson","42.0"
"fred","1","0","1","westinghouse","3.1416","washington","42.0"
"gillian","1","1","0","tesla","69.0","jefferson","42.0"
"harold","1","1","1","edison","13.0","madison","42.0"

SriSatish Ambati

unread,
Dec 24, 2013, 1:08:28 AM12/24/13
to Chris Kuethe, h2ostream, Antonio Vidal Vidal
Chris & Antonio -

- We support as.factor() with R syntax that you can use for this -
An example is here:

setwd(normalizePath(dirname(R.utils::commandArgs(asValues=TRUE)$"-f")))
source('../../findNSourceUtils.R')

test.as.factor.basic <- function(conn) {
  hex <- h2o.uploadFile(conn, locate("../smalldata/cars.csv"), key = "cars.hex")
  hex[,"cylinders"] <- as.factor(hex[,"cylinders"])
  expect_true(is.factor(hex[,"cylinders"])[1])
  testEnd()
}
doTest("Test the as.factor unary operator", test.as.factor.basic)

In the example you gave, we strip the quotes and make them ints. It's an issue that we did not run into in the field, HEX-453, yet.

If you are trying the shortcut to enum - 
 I'd try really concatenating a single character in the front of the values or use as.factor via R.

fyi, String handling with large cardinality is currently broken and being fixed (HEX-959)

thanks &
Let us know if you need more help, Sri



--
You received this message because you are subscribed to the Google Groups "H2O Users - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
ceo & co-founder, 0xdata Inc

Chris Kuethe

unread,
Dec 24, 2013, 12:56:34 PM12/24/13
to h2os...@googlegroups.com, Chris Kuethe

On Monday, December 23, 2013 10:08:28 PM UTC-8, srisatish wrote:

Thanks for the reply!
 
In the example you gave, we strip the quotes and make them ints. It's an issue that we did not run into in the field, HEX-453, yet.

If you are trying the shortcut to enum - 
 I'd try really concatenating a single character in the front of the values or use as.factor via R.

I'm extracting values from user-controlled text input so I can't be totally sure what values I'll see: 1, 1.0, "mickey mouse", or "" could all legitimately appear. And I won't know ahead of time what values I might see.

It looks like the parser uses the first non-header row to determine the column type? Along with the UI change to set the column type, what about scanning further into the input to help guess if there are non-numeric values present? For now, I'll just add a dummy row with non-numeric data:

index,country,region
0,0,0
1,Mexico,"North America"
2,Canada,"North America"
3,Bermuda,Caribbean
4,Japan,Asia
5,"United States","North America"

Changes to:

index,country,region
0,X,X
1,Mexico,"North America"
2,Canada,"North America"
3,Bermuda,Caribbean
4,Japan,Asia
5,"United States","North America"

fyi, String handling with large cardinality is currently broken and being fixed (HEX-959)

Hm. I think I might have just hit that. What is considered "large"? I just tried parsing a csv with 108 values in my enum - does it generate a backtrace like this?

java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException: null while mapping key $00000000000000000000$input.csv
11:00:46.251 FJ-9-1    INFO WATER: at water.parser.ParseDataset.parse(ParseDataset.java:330)
11:00:46.252 FJ-9-1    INFO WATER: at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:359)
11:00:46.253 FJ-9-1    INFO WATER: at water.H2O$H2OCountedCompleter.compute(H2O.java:668)
11:00:46.253 FJ-9-1    INFO WATER: at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
11:00:46.254 FJ-9-1    INFO WATER: at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
11:00:46.255 FJ-9-1    INFO WATER: at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
11:00:46.256 FJ-9-1    INFO WATER: at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
11:00:46.257 FJ-9-1    INFO WATER: at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
11:00:46.258 FJ-9-1    INFO WATER: Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException: null while mapping key $00000000000000000000$input.csv
11:00:46.258 FJ-9-1    INFO WATER: at jsr166y.ForkJoinTask.get(ForkJoinTask.java:961)
11:00:46.259 FJ-9-1    INFO WATER: at water.parser.ParseDataset.parse(ParseDataset.java:329)
11:00:46.261 FJ-9-1    INFO WATER: ... 7 more
11:00:46.267 FJ-9-1    INFO WATER: Caused by: java.lang.NullPointerException: null while mapping key $00000000000000000000$input.csv
11:00:46.269 FJ-9-1    INFO WATER: at water.parser.Enum.getTokenId(Enum.java:83)
11:00:46.270 FJ-9-1    INFO WATER: at water.parser.DParseTask.addStrCol(DParseTask.java:901)
11:00:46.271 FJ-9-1    INFO WATER: at water.parser.CsvParser.parallelParse(CsvParser.java:127)
11:00:46.272 FJ-9-1    INFO WATER: at water.parser.DParseTask.map(DParseTask.java:575)
11:00:46.274 FJ-9-1    INFO WATER: at water.MRTask.lcompute(MRTask.java:66)
11:00:46.275 FJ-9-1    INFO WATER: at water.DRemoteTask.compute2(DRemoteTask.java:74)
11:00:46.276 FJ-9-1    INFO WATER: ... 6 more


thanks &
Let us know if you need more help, Sri

Thanks for your help, season's greetings..
-C

SriSatish Ambati

unread,
Dec 24, 2013, 1:16:58 PM12/24/13
to Chris Kuethe, h2ostream
Chris,

- Yes, you guess right: we take the first few rows and guess the columns. The heuristics will be replaced with a more rule based parsing in the next gen parser. Thanks for working around the issue.

- 65k uniques is "large" unique strings at the moment.

I'm afraid, we ran into something else here - 

Would it be possible to describe -
- Version you are using (See: Admin > Inspect Logs - the git sha)
- Steps to repro incase it is,

thanks, Sri

Chris Kuethe

unread,
Dec 24, 2013, 8:08:46 PM12/24/13
to h2os...@googlegroups.com, Chris Kuethe
On Tuesday, December 24, 2013 10:16:58 AM UTC-8, srisatish wrote:
- 65k uniques is "large" unique strings at the moment.
 
My example data is tiny - less than 1000 records, each less than 256 bytes, less than 30 bytes for any given string.

I'm afraid, we ran into something else here - 

Would it be possible to describe -
- Version you are using (See: Admin > Inspect Logs - the git sha)

11:11:14.779 main      INFO WATER: Build git branch: rel-ito
11:11:14.779 main      INFO WATER: Build git hash: a49bf63680d0181d97054c301583f6ed63c00060
11:11:14.779 main      INFO WATER: Build git describe: nn-2-1112-ga49bf63
11:11:14.779 main      INFO WATER: Build project version: 2.0.1.2
Bug report sent.
 
- Is this data something you can share so we can try & fix it from here? 

My source data isn't for public consumption, but I cooked up a trivial example to reproduce it (and included it in the bug report). 

string1,int1,string2,float1
"foobar","12345","Zürich, Switzerland",3.1416

And here's a uuencoded version of the source file, in case Google strips the high bits...

begin 644 sample.csv
M<W1R:6YG,2QI;G0Q+'-T<FEN9S(L9FQO870Q"B)F;V]B87(B+"(Q,C,T-2(L
=(EK\<FEC:"P@4W=I='IE<FQA;F0B+#,N,30Q-@H`
`
end

avi...@optaresolutions.com

unread,
Dec 26, 2013, 3:57:02 AM12/26/13
to h2os...@googlegroups.com, Chris Kuethe
Hi all,
I also have a problem importing csv's with a Spanish locale (write.csv2 in R).

In your uploadindg assistant I could specify the character to separate columns but not the character to specify the decimal point that is "," in Spain so the numbers are imported as text. Of course there are some workarounds, but I think that would be interesting in the new heuristics to detect also the locale automatically or allow to specify manually the character for the decimal point.

Thank you for your great job and Merry Christmas!!!

BRs
Antonio Vidal

Zach Hespelt

unread,
Oct 16, 2014, 12:11:01 PM10/16/14
to h2os...@googlegroups.com
I'm trying to upload a dataset (as a CSV) to H2O, but the FirstName column gets converted to null/missing

Here's a sample of the dataset in R. The column types are: Numerical, factor, factor, numerical. The thing that is tripping me up is that one of the factor columns (Gender) gets interpreted correctly, while the other (FirstName) doesn't. I've tried quoting the columns. There aren't any missing/numerical values, or odd characters in the column. Any help would be great, thanks.


   
> head(TrainingNames)

     
Year FirstName Gender Freq
   
1 1880      Mary      F 7065
   
2 1880      Anna      F 2604
   
3 1880      Emma      F 2003
   
4 1880 Elizabeth      F 1939
   
5 1880    Minnie      F 1746
   
6 1880  Margaret      F 1578

   
> summary(TrainingNames)

         
Year        FirstName       Gender           Freq        
     
Min.   :1880   Francis:    268   F:1062432   Min.   :    5.0  
     
1st Qu.:1948   James  :    268   M: 729659   1st Qu.:    7.0  
     
Median :1981   Jean   :    268               Median :   12.0  
     
Mean   :1972   Jesse  :    268               Mean   :  186.1  
     
3rd Qu.:2000   Jessie :    268               3rd Qu.:   32.0  
     
Max.   :2013   John   :    268               Max.   :99674.0  
                   
(Other):1790483


Detected format after uploading to H2O web UI on localhost


   
Row# Year FirstName Gender Freq
   
Row 0 1880 Mary F 7065
   
Row 1 1880 Anna F 2604
   
Row 2 1880 Emma F 2003
   
Row 3 1880 Elizabeth F 1939
   
Row 4 1880 Minnie F 1746
   
Row 5 1880 Margaret F 1578
   
Row 6 1880 Ida F 1472
   
Row 7 1880 Alice F 1414
   
Row 8 1880 Bertha F 1320


Result after clicking okay


   
Row Year FirstName Gender Freq

   
Change Type As Factor As Factor As Factor

   
Type Int Int Enum Int

   
Min 1880 - 5
   
Max 2013 - 99674
   
Mean 1971.852 186.05
   
Std Dev 33.358 1578.377
   
Cardinality 2
   
Missing 1792091

   
0  1880 - F 7065
   
1  1880 - F 2604
   
2  1880 - F 2003
   
3  1880 - F 1939
   
4  1880 - F 1746
   
5  1880 - F 1578
   
6  1880 - F 1472
   
7  1880 - F 1414
   
8  1880 - F 1320
   
9  1880 - F 1288
   
10 1880 - F 1258

ke...@0xdata.com

unread,
Oct 17, 2014, 1:08:12 AM10/17/14
to h2os...@googlegroups.com, avi...@optaresolutions.com
I didn't see anyone address this paragraph below, where the OP mentioned "double quoting every value"

That's assuming that double quoting something makes it a string.
That's not true in h2o.

While not updated in a while, this parser spec for h2o is reasonable:
https://github.com/0xdata/h2o/wiki/Parser-Specification

"The whitespace and quote stripping rules, imply that a pure number can never be used or interpreted as a string."

You can't make a number into a enum, by quoting, in h2o.

I think this might explain why the two files had separate type guessing.

---------------------------

ke...@0xdata.com

unread,
Oct 17, 2014, 1:22:06 AM10/17/14
to h2os...@googlegroups.com, avi...@optaresolutions.com, ke...@0xdata.com
Hi Zach Hespelt:

You wrote
"I'm trying to upload a dataset (as a CSV) to H2O, but the FirstName column gets converted to null/missing
"

You have 1792091 rows in your dataset.

Typically, when h2o flips a col to all NAs, it's because there's a bad mix of integer/reals (or what h2o thinks is integer/real) and enums in a col, or there are too many unique values for the enums.


Here's what I do to discover that,

Assuming you have a linux system and your file is t.csv
Your "First Name" col is the interesting one (col 2)

Is the data comma separated?
outside of both R and h2o you can do
cut -d, -f2 t.csv | sort -u | wc -l

That will tell you how many unique categoricals you have.
If it's a relatively small number, you can pipe it into a file, and see if any numbers are in there

cut -d, -f2 t.csv | sort -u > t.csv
grep [0-9] t.csv

Let me know if you're able to do that.

-kevin

You included this information from h2o Inspect that said Cardinality 2
I suspect that info was not from the First Name col, since it won't say Cardinality 2 when it flips the entire col to NA.
Can you confirm? -thanks -kevin

Min 1880 - 5
Max 2013 - 99674
Mean 1971.852 � 186.05
Std Dev 33.358 � 1578.377
Cardinality 2
Missing 1792091


To be blunt, we have a set of rules for flipping a col to all NAs that have some good rationale. But we don't print out information when this happens, so it surprises people. I do find though that if we explain what h2o does, it is possible to find the problem. It is usually something about the data having numbers plus enums in the same col, or too many enums. It might be you have too many first names (possible with 1.7M rows)..but the cut + wc -l above will tell us!



ke...@0xdata.com

unread,
Oct 17, 2014, 1:57:46 AM10/17/14
to h2os...@googlegroups.com, avi...@optaresolutions.com, ke...@0xdata.com
Hi Chris Kuethe:

I think rel-ito is from Nov. 2013? I was wondering if you could update to rel-mandelbrot or rel-lambert?


more detail:
You mentioned you saw a stack trace with this data? (or did I misunderstand?)

I was excited to see it, as I've been testing full UTF8 support (for enums) in h2o. Can I ask what h2o version you have? Oh, I see you have rel-ito

11:11:14.779 main INFO WATER: Build git branch: rel-ito

rel-ito was h2o version 2.0.12 http://h2o-release.s3.amazonaws.com/h2o/rel-ito/2/index.html

Our current version is 2.8.0.1
http://h2o-release.s3.amazonaws.com/h2o/rel-mandelbrot/1/index.html


I could go back and test your case with rel-ito, but I wonder if it makes sense for you to get the latest release: Mandelbrot (or even the last release Lambert) and see if there is an issue
at http://0xdata.com/download/


I notice you had a utf8 (non-ascii) char in the 2nd row
string1,int1,string2,float1
"foobar","12345","Zürich, Switzerland",3.1416

I put that in a file and parsed it and it parsed okay for me. (I am using a local build, but it should be the same as rel-mandelbrot)

Note my browser doesn't support the font for Zürich so it looks weird in the parse Inspect result I copy/pasted here

Maybe I misunderstood your report above. Is this report about your expectation around quoted numbers, or does this case cause a stack trace?

I've done extensive testing of non-ascii UTF8 chars in our current H2O. our special chars that affect parsing (cols/numbers/reals/column separators/row separators) need to be all ascii, but within an enum, any UTF8 is fine.

thanks
-kevin


My result:


Row string1 int1 string2 float1
Change Type
As Factor
Type Enum Int Enum Real
Min 12345 3.1416
Max 12345 3.1416
Mean 12345 3.142
Std Dev � �
Cardinality 1 1
0 foobar 12345 Z�rich, Switzerland 3.1416

So I think the latest h2o will work for you?

thanks
-kevin

ke...@0xdata.com

unread,
Oct 17, 2014, 2:00:29 AM10/17/14
to h2os...@googlegroups.com, avi...@optaresolutions.com, ke...@0xdata.com
I just noticed Chris Kuethe had posted from a year ago, and Zach et.al had updated this post recently. so my reply to Chris probably is too late now! In any case, the info might be useful to folks, even today!

-kevin

Brandon Hill

unread,
Oct 17, 2014, 2:10:13 AM10/17/14
to h2os...@googlegroups.com
I accidentally sent my reply to only Zach, and not the list this morning.  For those curious about the answer: 

Zach is running into a combination of two items.  In the first version of H2O, we don't support strings as a column data type.  Columns with strings are treated as Factors (or enums depending on your background).  This is working fine for his gender column.  The other limit we have is on the number of unique factors we allow.  In this version of H2O, only 65000 unique factors are allowed.  For values beyond that, the system sets the remaining values to NA.  So what he is seeing is expected behavior.  Columns with greater than 65000 are typically unique values that aren't parameters to be used in a machine learning algorithm.  We have added support for a column of string values into the next generation of H2O (in our h2o-dev source repository on github), that would prevent the NAs..  In Zach's case, it seems he was anticipating almost 2M names, but also still wanted them to be treated as factors, which we don't currently support. 

-Brandon


On 10/16/14 11:00 PM, ke...@0xdata.com wrote:
I just noticed Chris Kuethe had posted from a year ago, and Zach et.al had updated this post recently. so my reply to Chris probably is too late now! In any case, the info might be useful to folks, even today!

-kevin


--
Join us at H2O World 2014 | November 18th and 19th | Computer History Museum, Mountain View

Reply all
Reply to author
Forward
Message has been deleted
0 new messages