[csvreader] Is there any way to keep going after encountering an error?

27 views
Skip to first unread message

Steven Jeffries

unread,
Nov 16, 2018, 5:52:19 PM11/16/18
to wwwmake
I'm processing some huge files. It really sucks to get an error 80% of the way through and not be able to finish. Is there any way to get an error on a line, do something with it (probably log it), but continue processing the rest of the file?

Thanks!

Gerald Bauer

unread,
Nov 17, 2018, 7:41:18 AM11/17/18
to www...@googlegroups.com
Hello,

First thanks for trying a different encoding. Great to hear it works.

> I'm processing some huge files. It really sucks to get an error 80% of the way through and not be able to finish. Is there any way to get an error on a line, do something with it (probably log it), but continue processing the rest of the file?

Good point. Yes, of course. The idea is - since this is a kind of
new (alternative) csv reader / library to learn from the real world
and from real experience - that is, where your help is needed :-).

The point is and the idea is to collect the "real-world" errors
one-by-one and add options and tests to make the reader more "stable"
handling real-world cases / error conditions. The point is - again -
the more real-world cases that get reported - the better the reader
and they more I can fix / handle and so on.

Thus, to conclude - if you want to help out I invite you to post
the "offending" csv line(s) / text record(s) and what error you get -
they more the better.

Cheers. Prost.

Steven Jeffries

unread,
Nov 18, 2018, 4:05:09 PM11/18/18
to wwwmake

Thanks for always responding so quickly!

I'm not really talking about any specific errors, I'm more talking about continuing through errors once one is encountered. As anyone who has ever worked with client generated CSVs will know, this is always an issue.

An example would be something like this:

First Name,Middle Name,Last Name
Johnathan,Madeup,Smith
Farrokh,"Freddy" Mercury,Bulsara
Jane,AlsoMadeup,Smith

The improperly quoted line for Freddy Mercury will fail, and we will not get to the Jane Smith line. Now imagine there was 2GB of data before that line and another 1GB of data after it (a scenario that comes up far more frequently than I'd like).

It would be really nice to be able to say, "Hey, line 123456 failed" while still importing the rest of the data.

I'm not sure of how possible that is, or if something like that is even within the scope of this project, but it would be super cool.

Also, I can't say this enough, your CSV library is one of the best out there right now. Thanks for all of the work you've put into it!

Thanks!
- Steve

Gerald Bauer

unread,
Nov 19, 2018, 3:27:30 AM11/19/18
to www...@googlegroups.com
Hello Steve,

Thanks for your kind words and your detailed response and for the example:

> Farrokh,"Freddy" Mercury,Bulsara

Again the idea is (sorry if that disappoints you) to add better
error recovery support case-by-case / example-by-example. The goal
is, yes, that you can choose / configure how to handle errors incl.
ideally your case where a "broken" line gets skipped / recovered - but
the idea / theory is that there is no one-true-way/solution. Example:

If you (auto) convert 12.2.2 to a float - what do you expect?

- 12.2
- Float::NaN ?
- nil ?
- raise FormatException
- and so on

The idea is to offer all options (with some great defaults, of course).

Anways, thus, back to your sample:

> Farrokh,"Freddy" Mercury,Bulsara

What do you expect?

- A recoverable Format/StrayQuote error/exception?
- Auto-fixing the >"Freddy" Mercury< value if that's possible -
new rule! if quoted value is followed by more data auto-add it until
hitting the separator (that is, comma) and turn the quotes into
"literal" quotes as part of the value

That's my point. Ideally all "errors" can get auto-fixed and
recovered (with sensible defaults).

Thus, if interested you're invited to please keep reporting /
posting more so I can add the "fixes".

I will try to add an "auto-fix" recovery for the >"Freddy"
Mercury< case in the next csvreader library update.

Cheers.

PS: I also started a new ERRORS.md page to document all (recoverable)
built-in auto-fixes [1] and error recovery options.

[1] https://github.com/csvreader/csvreader/blob/master/ERRORS.md


El dom., 18 nov. 2018 a las 22:05, Steven Jeffries
(<stevenj...@gmail.com>) escribió:
> --
> You received this message because you are subscribed to the Google Groups "wwwmake" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to wwwmake+u...@googlegroups.com.
> To post to this group, send email to www...@googlegroups.com.
> Visit this group at https://groups.google.com/group/wwwmake.
> For more options, visit https://groups.google.com/d/optout.

Gerald Bauer

unread,
Nov 19, 2018, 4:43:51 AM11/19/18
to www...@googlegroups.com
Hello,

FYI: I uploaded / pushed a new csvreader version, that is,v 1.2.2
that now handles / includes an auto-fix for quoted values with extra
trailing values. Try:


def test_quote_with_trailing_value # see [1]
recs = [[ "Farrokh", "\"Freddy\" Mercury", "Bulsara" ]]
assert_equal recs, parser.parse( %Q{Farrokh,"Freddy" Mercury,Bulsara} )
assert_equal recs, parser.parse( %Q{ Farrokh , "Freddy" Mercury , Bulsara } )
assert_equal recs, parser.parse( %Q{Farrokh, "Freddy" Mercury ,Bulsara} )
end

The new "auto-fix" will read

"Freddy" Mercury

as is, that is, turn it into an "unquoted" value with "literal"
quotes. Note: Leading and trailing whitespaces get trimmed / ignored
as usual.

Cheers. Prost.

[1] https://github.com/csvreader/csvreader/blob/master/test/test_parser_autofix.rb
Reply all
Reply to author
Forward
0 new messages