CSV parsing: usability and performance considerations

5 views
Skip to first unread message

Antonin Delpeuch (lists)

unread,
Jan 9, 2021, 9:48:43 AM1/9/21
to openref...@googlegroups.com
Hello all,

As part of my work on scaling OpenRefine's backend I have been working
on CSV/TSV parsing last month. I would like to submit some thoughts to
you about this area of the tool. In short, the support for cell value
escaping (by default with double quotes) makes it hard to get an
efficient importer and can also cause some incorrect parses, so I am
looking for ways to improve the behaviour of the tool in this regard.

Let's first talk about an existing issue:
https://github.com/OpenRefine/OpenRefine/issues/724

Because not every CSV / TSV file uses escaping in the same way, our
parser does not always get things right with the default settings. In
some cases, the first character of a cell happens to be the one that is
treated as the escaping character (by default, the double quote), in
which case the parser can swallow many lines of the CSV file as a single
cell. The workaround in this case is to fix the importing settings, for
instance by changing the escaping character or disabling escaping
altogether. If the problem appears in the first few lines of the file,
then this will be visible in the import preview and users get a chance
to fix the problem straight away, but very often this happens further
down the file, and so the problem is only visible once the project is
created (with a lower than expected number of lines - which is a pretty
confusing symptom). It would be better to be able to flag this issue
earlier, ideally at the preview stage. We could also try to improve the
settings that are proposed by default to the user: if we notice a cell
which contains a significant proportion of the newlines of the entire
file, that suggests incorrect escaping settings.

Interestingly, this long standing issue is relevant for the new
architecture. Escaping is also a hurdle to making CSV parsing efficient.

In Spark, datasets are split into partitions, each of which containing a
finite stream of rows. Ideally, all partitions contain about the same
number of rows. For instance, the first partition could contains the
first batch of 1000 rows, the second contains rows 1001 to 2000, and so
on. To load a CSV file as a dataset in Spark, one must figure out where
in the file each partition should start (at which byte offset). Once
these offsets are determined, it is becomes easy to enumerate the rows
in any partition: read the file from the partition offset in a streaming
fashion.

The problem is : these offsets can only be computed efficiently when
escaping is disabled. When that is the case, the strategy is simple:
first, determine the total size (in bytes) of the file. Divide this
evenly by the desired number of partitions, giving a list of byte
offsets. These byte offsets are likely to be in the middle of a line, so
they need to be moved a bit: for each partition, start reading from the
initial offset until you reach a newline character. That is now your new
offset: since we do not use escaping, we are guaranteed that this is the
beginning of a new row (and not a newline character inside a cell).
However, if you want to allow newlines in cells (with escaping), then
you cannot make this assumption anymore. You can read about this issue
in various places:

https://kokes.github.io/blog/2018/05/19/spark-sane-csv-processing.html

https://issues.apache.org/jira/browse/SPARK-22236

So, to sum up: if we want to replicate the exact same behaviour as the
current CSV/TSV importer in the new architecture, it means that we will
load the entire file in memory by default, so we will not be able to
process large CSV files efficiently.

Now, because of the issue above, we know that the current behaviour is
not ideal. So the question is: can we find a middle ground that gives us
the best of both worlds? We want to parse the file as the user expects
it and do so in an efficient way.

One cheap "solution" would be to change the default settings of the
importer to no escaping, making it efficient by default, but that feels
wrong to me: that does not solve the problem when escaping must be used,
which is arguably fairly common if the file is clean. And the preview
stage will be efficient no matter which options are used (since only the
first few lines of the file are read), so there is no point in changing
the default settings for the sake of giving a responsive preview.

Another solution would be to have a "multiLine" option similar to the
one Spark offers. When setting this option to false (which is the
default), newlines are not allowed to be part of cell values, even when
they are placed inside quotes. All other characters (such as cell
separators) can still be escaped there. We could offer the same option
to the user, setting it to false by default as well. This has the double
benefit of avoiding issue #724 and of enabling efficient parsing by
default, but the downside is that our default settings would not match
RFC 4180 anymore.

I have been thinking about adapting my implementation of the records
mode based on custom RDDs to allow for RFC-compliant parsing without
loading the entire dataset in memory but it is not yet clear to me if
that can work.

A fourth option would be to build the RDD after reading the CSV file in
multiple passes, none of which would load the entire dataset in memory:

- a first pass to compute the number of lines. From this we deduce the
desired number of partitions and their row offsets.

- a second pass, reading again the CSV file in streaming mode and
counting the number of lines seen so far. Every time we reach a
partition offset, we note down the corresponding byte offset in the parser.

That then gives us all the necessary information to construct a RDD,
each partition starting at the precomputed byte offset with a CSV parser
in a fresh state.

The challenge with this approach is to make it work in a way that is
independent from the datamodel implementation: our CSV parser should
remain compatible with all implementations (Spark, in memory, and
potentially others). It does mean adding some complexity to the
datamodel interface (but might well be doable).

As you can see this is a fairly complicated problem! I think it is worth
thinking about it since CSV is still a pretty pervasive format so it is
worth getting this right. Looking forward to your thoughts on this.

Best,

Antonin

Thad Guidry

unread,
Jan 9, 2021, 3:11:52 PM1/9/21
to openref...@googlegroups.com
Another solution would be to have a "multiLine" option similar to the
one Spark offers. When setting this option to false (which is the
default), newlines are not allowed to be part of cell values, even when
they are placed inside quotes. All other characters (such as cell
separators) can still be escaped there. We could offer the same option
to the user, setting it to false by default as well. This has the double
benefit of avoiding issue #724 and of enabling efficient parsing by
default, but the downside is that our default settings would not match
RFC 4180 anymore.

This is fine, I think.  RFC 4180 is not a hard spec on this if you didn't know.
Notice the word "should":
   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx



Antonin Delpeuch (lists)

unread,
Apr 5, 2021, 10:00:14 AM4/5/21
to openref...@googlegroups.com
Following up on this, I have finally got round to implementing this option.

So, by default we will have the "multiLine" option set to false, so the
importer will treat each line of the source file as its own row.

To parse examples like the one above correctly, one will need to enable
the "multiLine" option, which is going to make parsing less efficient on
large files.

How should we expose this option in the UI? I can think of some wordings:
- Rows can spread on multiple lines
- Cells can contain newline characters
- Support quoted newline characters

Any preferences?

Antonin

Thad Guidry

unread,
Apr 5, 2021, 12:44:11 PM4/5/21
to openref...@googlegroups.com
I like the last one:
- Support quoted newline characters
But I'd probably rephrase it...

I tend to think the basis for the general rule is that of:
TRUE/FALSE - CRLF and LF will always be considered line breaks, no matter if enclosed in quotes or not.

So is that the actual rule you are going to implement with that option?

Antonin Delpeuch (lists)

unread,
Apr 5, 2021, 2:13:50 PM4/5/21
to openref...@googlegroups.com
On 05/04/2021 18:43, Thad Guidry wrote:
>
> I tend to think the basis for the general rule is that of:
> TRUE/FALSE - CRLF and LF will always be considered line breaks, no
> matter if enclosed in quotes or no>
> So is that the actual rule you are going to implement with that option?

To phrase it in this way, it would be more precise to say:

Line breaks will always be considered new *rows*, no matter if enclosed
in quotes or not.

Antonin

Thad Guidry

unread,
Apr 5, 2021, 2:38:26 PM4/5/21
to openref...@googlegroups.com
I tend to think in characters.  Line breaks are characters, but hidden.
So...

Line breaks are not the new rows.
Line breaks are the separators for new rows.
Reply all
Reply to author
Forward
0 new messages