Double quote in the csv file

4,222 views
Skip to first unread message

hmx...@gmail.com

unread,
Jul 5, 2014, 8:20:57 PM7/5/14
to druid-de...@googlegroups.com
It seems the csv parser doesn't take double quote correctly.

See, if there are 4 field in the input file (col A, B, C, D) and A = '"', B='abc', C='efg', D='"', the parse will take the row as

A = "abc,efg" and other columns are ''.

Is there any parameter to disable this default behavior?

Fangjin Yang

unread,
Jul 6, 2014, 12:52:33 AM7/6/14
to druid-de...@googlegroups.com
We are just using https://code.google.com/p/opencsv/ under the covers to parse CSV. I'm not sure if later versions are better with this. If you could submit a unit test, we can look into a fix.

hmx...@gmail.com

unread,
Jul 6, 2014, 1:40:20 AM7/6/14
to druid-de...@googlegroups.com
Thanks for the information. It took me 3 days to find out a line containing double quote from 400M rows. It caused Metrics DataType Exception because all columns are shifted.

Some suggestions:

* Let user specify field delimiter like hive/pig.
* When exception happens, print out the row caused issue together with the exception
* Make "ignoreInvalidRows": "true" really ignore them

hmx...@gmail.com

unread,
Jul 6, 2014, 2:01:35 AM7/6/14
to druid-de...@googlegroups.com
This is easy to reproduce. See, if you have a 2 line file with 6 columns:

a,b,c,3,d,e
",x,",1,c,g

Assuming the 4th column is a metric column and all other columns are dimensions, you won't be able to ingest it because for the 2nd row, the firs dimension becomes "x", 2nd dimension becomes 1, and third becomes c, the metric column (4th) becomes "g".

If you have a longSum on that, it will give you datatype exception because it will try to add 3 + g.

Gian Merlino

unread,
Jul 6, 2014, 10:35:32 AM7/6/14
to druid-de...@googlegroups.com
I think that's expected behavior for CSV. Most CSV parsers I am aware of will treat double quotes as a way to include commas within fields, so the first field there would get parsed as ,x,. If you *don't* want to treat quotes specially, and just split on commas, you can use the delimited data spec with delimiter "," instead of the csv data spec. I think the way you spell that is:

   {"format":"tsv", "delimiter": ",", "columns": [...], "dimensions": [...]}

hmx...@gmail.com

unread,
Jul 6, 2014, 10:58:15 AM7/6/14
to druid-de...@googlegroups.com
thanks. nice to know.

doesn't tsv commonly know as tab delimited values BTW?

i donot see this has been mentioned in the documents too.

Fangjin Yang

unread,
Jul 7, 2014, 1:23:13 PM7/7/14
to druid-de...@googlegroups.com
"TSV" can technically use any custom delimiter. It should probably be called "delimited" instead of tsv.

hmx...@gmail.com

unread,
Jul 7, 2014, 1:31:34 PM7/7/14
to druid-de...@googlegroups.com
100% agree. It will avoid a lot of confusions. Could you guys update the doc to reflect that?

Fangjin Yang

unread,
Jul 7, 2014, 2:16:39 PM7/7/14
to druid-de...@googlegroups.com
Yup. We also hope for community contribution to docs as well. All Druid docs are within the repo itself under docs/content.

Xavier Léauté

unread,
Jul 7, 2014, 2:20:23 PM7/7/14
to druid-de...@googlegroups.com
We need to leave "tsv" for backward compatibility, but we can add "delimited" as an alias.

For CSV, the assumption of most parsers is that if your fields contain double quotes or comma, the field has to be double-quoted.
Generally speaking if your file opens fine in a typical spreadsheet software without specifying extra settings, it should have a good chance of working. I tried importing your test case in Google Spreadsheets and it behaved the same way Druid did, leaving ",x," as a single field

Since there is no agreed upon CSV standard to speak of, we hope the library we use will cover most CSV files people encounter, which are typically generated by libraries that also target spreadsheets.

For production usage, we recommend using a well-defined format such as JSON or a simple delimited format where the input data has been scrubbed of delimiter characters and line terminators.


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/7db77f4d-5c1f-44aa-8ef8-b803360d5547%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

hmx...@gmail.com

unread,
Jul 7, 2014, 2:45:41 PM7/7/14
to druid-de...@googlegroups.com
If you go to the opencsv doc (that's what you guys use according to Fangjin), it allows you to specify if you want to use a quote or not.
https://code.google.com/p/opencsv/

CSV csv
= CSV
   
.separator(',')  // delimiter of fields
   
.quote('"')      // quote character

the problem here that this option is not exposed to the druid json.

For my case, the delimited option works great.

Xavier Léauté

unread,
Jul 7, 2014, 3:23:20 PM7/7/14
to druid-de...@googlegroups.com
Correct, we currently just expose the default opencsv behavior, and the underlying implementation has been abstracted.


Reply all
Reply to author
Forward
0 new messages