Vendoring `encoding/csv/reader.go` (Is this a feature the golang community would entertain)

569 views
Skip to first unread message

P. Taylor Goetz

unread,
Apr 5, 2022, 8:05:42 PM4/5/22
to golang-dev
My company has a use case where we need the golang CSV reader to be a little more permissive with regards to double-quoted fields. We deal with a lot of CSVs, some of which hit grey areas of the [CSV spec](https://www.rfc-editor.org/rfc/rfc4180.html).

Specifically, we have users who provide unquoted fields containing double-quotes like `a,"Ugly" Sweater",b` and `a,30" Television,b`.

Our potential (hopefully temporary) solution is to vendor `encoding/csv/reader.go` to allow `csv.Reader` clients to specify a rune to define the double-quote character used during CSV parsing

For example, it would allow the following (very crude) test to pass:

```
func TestUnquotedReader(t *testing.T) {
line := `1234,24" Television`
csvReader := NewReader(strings.NewReader(line))
csvReader.Quote = rune(0)
record, err := csvReader.Read()
assert.Assert(t, err == nil)
assert.Assert(t, len(record) == 2)
assert.Assert(t, record[0] == "1234")
assert.Assert(t, record[1] == `24" Television`)

line = `a,"first" second,b`
csvReader = NewReader(strings.NewReader(line))
csvReader.Quote = rune(0)
record, err = csvReader.Read()
assert.Assert(t, err == nil)
assert.Assert(t, len(record) == 3)
assert.Assert(t, record[0] == "a")
assert.Assert(t, record[1] == `"first" second`)
assert.Assert(t, record[2] == "b")
}
```

Before going down the path of attempting to add this change to the golang core, my question is:

What is the community's interest in entertaining such a change?

Thanks in advance,

-Taylor



Ian Lance Taylor

unread,
Apr 6, 2022, 4:43:13 PM4/6/22
to P. Taylor Goetz, golang-dev
Interest is low. We've found that there a vast number of minor
variations of CSV format. Each knob we add gives an exponential
increase in the number of configurations to test. After a while we
decided to limit the package to RFC 4180, as the package comment says.
People who need slightly different formats should copy and modify the
package, as you suggest.

Ian

ben...@gmail.com

unread,
Apr 6, 2022, 6:23:39 PM4/6/22
to golang-dev
I'm also in the process of vendoring encoding/csv and hacking it around, though in my case it's more so I can use it as a bufio.SplitFunc than because I need different parsing behaviour.

BTW, I presume you know about the csv.Reader's LazyQuotes option? It doesn't quite get what you want in the second test case, though... https://go.dev/play/p/6bL4x8BtcoN

-Ben

Mike Schinkel

unread,
Apr 7, 2022, 2:23:54 PM4/7/22
to golang-dev
Ian

Just curious, but wonder if the changes required might potentially be able to be handled with the standard package by addition of a ReadFunc() method?

-Mike

Ian Lance Taylor

unread,
Apr 7, 2022, 4:38:25 PM4/7/22
to Mike Schinkel, golang-dev
On Thu, Apr 7, 2022 at 11:23 AM Mike Schinkel <mi...@newclarity.net> wrote:
>
> Just curious, but wonder if the changes required might potentially be able to be handled with the standard package by addition of a ReadFunc() method?

The package takes an io.Reader, so you can already implement whatever
transformations you like on the original data.

Ian

Mike Schinkel

unread,
Apr 7, 2022, 11:23:29 PM4/7/22
to golang-dev
Hi Ian,

Thank you for the quick reply.

I spent some time tracing through the code in debug mode, and it appears to you are indeed correct, in theory.

In practice I see the csv.Reader employs a bufio.Reader which calls the passed io.Reader to first fill the bufio buffer before it it processed by csv.Reader.

So while yes one can implement their own functionality in their own Reader, that requires duplicating most (all?) of the csv.Reader logic *and* adding overhead to the process.

OTOH when I suggested a ReadFunc(), here is what I was suggesting:


Unless I missed a testing requirement this would not require any additional testing than your existing tests. Further, I think it would be flexible enough that if in the future there was a need for "one more special case" in the future it would be possible to address in a backward-compatible way.

Something to consider?  

I will of coruse respect if you do not consider this compatible with the "Go way," but presenting it at least gave me the chance to illustrate the idea in concrete form.

-Mike
P.S. Worst case, maybe Taylor will find this helpful.



peterGo

unread,
Apr 8, 2022, 3:32:18 PM4/8/22
to golang-dev
This thread looks like The XY Problem to me: https://xyproblem.info/

The X problem is well known. It is the DSV (Delimiter-Separated Values) problem: https://en.wikipedia.org/wiki/Delimiter

We have known about the simple DSV solution since the days of Teletype ASRs and paper tape. The ASCII standard incorporates control character values for unit (field), record, group, and file separators.

The advent of spreadsheet programs drove the implementation of readable, printable data exchange formats, often CSV (Comma-Separated Values). Go implements RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files: https://www.ietf.org/rfc/rfc4180.txt. The CSV format uses matching double quotes (") to frame arbitrary text, with two embedded double quotes as an escape for one embedded double quote.

The examples illustrate a simple scheme that uses commas as field delimiters. A simple, easily implemented line = Join(fields, ",") and fields = Split(line, ",") scheme. However, the scheme breaks when commas occur in fields, for example, €1.599,99.

One solution is to use an unused value like the tab control character as the delimiter - TSV (Tab-Separated Values).

On a case-by-case basis it is sometimes possible to reformat malformed fields to CSV format. That's best done in a separate program. It's also possible to reformat package csv input using a custom io.Reader. A ReadFunc() or other changes to package csv are neither necessary nor desirable.

Here are the malformed examples when passing a custom DSV io.Reader to package csv.

Input:
a,"Ugly" Sweater",b
a,30" Television,b
1234,24" Television
a,"first" second,b

Output:
a,"""Ugly"" Sweater""",b
a,"30"" Television",b
1234,"24"" Television"
a,"""first"" second",b

Here are the same results using package bufio Scanner.

Input:
a,"Ugly" Sweater",b
a,30" Television,b
1234,24" Television
a,"first" second,b

Output:
a,"""Ugly"" Sweater""",b
a,"30"" Television",b
1234,"24"" Television"
a,"""first"" second",b

Peter

Ian Lance Taylor

unread,
Apr 8, 2022, 4:44:32 PM4/8/22
to Mike Schinkel, golang-dev
I guess that's the reverse of what I was thinking. It seems to me
that you implement the ReadFunc you suggest just by wrapping something
around encoding/csv, which seems straightforward. No doubt I am
missing something.

Ian

P. Taylor Goetz

unread,
Apr 8, 2022, 5:36:23 PM4/8/22
to golang-dev
Thanks Mike. I did indeed find it helpful.

-Taylor

On Thursday, April 7, 2022 at 11:23:29 PM UTC-4 Mike Schinkel wrote:

P. Taylor Goetz

unread,
Apr 8, 2022, 6:15:40 PM4/8/22
to golang-dev
Thanks for the input Peter.

To resolve the lack of background in terms of the XY Problem, here it is:

We have clients who ship us data in CSV/TSV format. And there are a lot of them. We have a system to parse and validate all files received.

Circa 2016 (years before I joined the company/project) we published a spec describing what CSV/TSV formats we accept. It's basically RFC 4180.

The actual implementation was written in Python, using Pandas, and was very liberal in terms of what it accepted as valid CSV/TSV. Clients throw crap CSVs at us and the Pandas config readily accept them. We were not enforcing our own published spec.

Fast forward to 2021 and that system is showing it's limitations. Some files take up to 24 hours to process and pose a risk to our SLAs. Our solution: rewrite the validation logic in go. Initial and current performance tests show up to a 10x performance increase over the python implementation.

But we still have some clients who throw questionable CSVs at us, as we've allowed it for upward of 7 years. Now we've got to figure out a way to get the toothpaste back in the tube and actually enforce our published spec. My goal is to get all clients to 100% compatibility with RFC 4180. But we need some runway to make that happen in a way that doesn't piss off our customers. We need a way to turn the screws toward RFC 4180 compliance slowly.

It's a small, but significant percentage of clients. For now we have a flag that directs "problem children" to the python implementation. Good citizens get the golang implementation and the resulting 10x performance improvement.

-Taylor



P. Taylor Goetz

unread,
Apr 8, 2022, 6:35:00 PM4/8/22
to golang-dev
Yes, I am. We've got a table that maps client feeds to CSV parser options until we can get our clients to comply with RFC 4180.

-Taylor

P. Taylor Goetz

unread,
Apr 8, 2022, 9:53:38 PM4/8/22
to golang-dev
I think this ap, proach could warrant some further discussion.

If I'm interpreting Mike's proposed intent correctly, it seems he's just looking to hook into the CSV parsing code.

I'm not sure I agree with this particular implementation, but I like the idea of being able to inject custom functions into the CSV parsing process. The first thing that comes to mind is being able to access the current line being parsed. The second would be being able to specify a function to run on the current line prior to parsersing.

Function hooks into the CSV parsing process could be an easy way to satisfy in way that absolves the golang community of any responsibility.

What I'm looking for is:
    a) A way to get the current line being parsed. (This is easy.)
    b) The ability to define a function run on each line prior to CSV parsing. Essentially a user-defined pre-processor.

The latter (b) is also fairly straightforward and non-invasive. The default would be a no-op, and anyone who overrides the default is responsible for the consequences of whatever they supply.

The option of last resort, IMHO, would be to fork "encoding/csv" and build a community of those who want more flexibility with CSV parsing in golang. If that's a path anyone wants to pursue, I have some experience helping OSS projects get established.

Would adding one or more plugin points for users of `encoding/csv` really be invasive?

-Taylor

PS: If anything I said comes across as adversarial, that is not my intent. I only want to promote discourse.


On Thursday, April 7, 2022 at 11:23:29 PM UTC-4 Mike Schinkel wrote:

Mike Schinkel

unread,
Apr 9, 2022, 12:02:51 AM4/9/22
to golang-dev
On Friday, April 8, 2022 at 9:53:38 PM UTC-4 ptg...@gmail.com wrote:
I think this ap, proach could warrant some further discussion.

If I'm interpreting Mike's proposed intent correctly, it seems he's just looking to hook into the CSV parsing code.

Yes!  That is exactly the case. 
 
I'm not sure I agree with this particular implementation,

To be clear, I was just offering up a strawman implementation to spark discussion. That's why I presented as a gist and not a PR.

I actually do not like some of what I did so I went back and reimplemented it and simplified the callback a lot and got rid of the interface{} performance penalty for non-custom use-cases:

- https://gist.github.com/mikeschinkel/98ed9de850f9c9a5f5be42a3bb70ca5f

However, I still don't fully like it even though it is better so it is still just a strawman/PoC.  

AS AN ASIDE FOR Ian to consider, these really seemed to come down to needing a custom Quote character just like is already possible for a custom Comma character.  Given that, maybe you could reconsider the potential of just adding that Quote character property? If so, I think that much of the code in the gist could be repurposed to that end.
 
but I like the idea of being able to inject custom functions into the CSV parsing process. The first thing that comes to mind is being able to access the current line being parsed. The second would be being able to specify a function to run on the current line prior to parsersing.

Function hooks into the CSV parsing process could be an easy way to satisfy in way that absolves the golang community of any responsibility.

What I'm looking for is:
    a) A way to get the current line being parsed. (This is easy.)
    b) The ability to define a function run on each line prior to CSV parsing. Essentially a user-defined pre-processor.

The latter (b) is also fairly straightforward and non-invasive. The default would be a no-op, and anyone who overrides the default is responsible for the consequences of whatever they supply.

Would love to see you tackle the injection of custom functionality; I might learn some new approaches. :-)

BTW, my first attempt at the PoC did not actually try to solve the problem as you posed, or at least as I understood it. As is so often the case, my approaches to solving a problem rarely survive an actual unit test! In that vein I encourage you to implement a PoC for your approach to CSV extensibility as you might find — like I so often do — that the way I envision a solution up-front often has blockers I did not anticipate.

The option of last resort, IMHO, would be to fork "encoding/csv" and build a community of those who want more flexibility with CSV parsing in golang. If that's a path anyone wants to pursue, I have some experience helping OSS projects get established.

I agree w/forking stdlib code as being ideally an option of last resort. 

FWIW, my interest in this is not because I have a specific use-case for CSV, but instead I was hoping that the Go team would be open to exploring more general purpose ways to provide "hooks" into the standard library code so that forks of standard library code that often won't get maintained to include bug fixes and close security holes don't proliferate any more than necessary, and also to hopefully maintain compatibility for multiple different extensions.

So I was just exploring what such extensibility might look like, and yes, I do not like what I came up with either. Hopefully there are more robust ways to handle this, possible with Generics.

-Mike

P.S. At this point I will probably bow out of this discussion because my interest was to spark a discussion about extensibility, and I think that has been done.  Good luck with your customer use-cases.

Would adding one or more plugin points for users of `encoding/csv` really be invasive?

-Taylor

PS: If anything I said comes across as adversarial, that is not my intent. I only want to promote discourse.


On Thursday, April 7, 2022 at 11:23:29 PM UTC-4 Mike Schinkel wrote:
On Apr 7, 2022, at 4:38 PM, Ian Lance Taylor <ia...@golang.org> wrote:

On Thu, Apr 7, 2022 at 11:23 AM Mike Schinkel <mi...@newclarity.net> wrote:

Just curious, but wonder if the changes required might potentially be able to be handled with the standard package by addition of a ReadFunc() method?

The package takes an io.Reader, so you can already implement whatever
transformations you like on the original data.

Ian

Hi Ian,

Thank you for the quick reply.

I spent some time tracing through the code in debug mode, and it appears to you are indeed correct, in theory.

In practice I see the csv.Reader employs a bufio.Reader which calls the passed io.Reader to first fill the bufio buffer before it it processed by csv.Reader.

So while yes one can implement their own functionality in their own Reader, that requires duplicating most (all?) of the csv.Reader logic *and* adding overhead to the process.

OTOH when I suggested a ReadFunc(), here is what I was suggesting:


Unless I missed a testing requirement this would not require any additional testing than your existing tests. Further, I think it would be flexible enough that if in the future there was a need for "one more special case" in the future it would be possible to address in a backward-compatible way.

Something to consider?  

I will of course respect if you do not consider this compatible with the "Go way," but presenting it at least gave me the chance to illustrate the idea in concrete form.

Manlio Perillo

unread,
Apr 9, 2022, 1:50:08 AM4/9/22
to golang-dev
On Friday, April 8, 2022 at 9:32:18 PM UTC+2 peterGo wrote:
This thread looks like The XY Problem to me: https://xyproblem.info/

The X problem is well known. It is the DSV (Delimiter-Separated Values) problem: https://en.wikipedia.org/wiki/Delimiter

We have known about the simple DSV solution since the days of Teletype ASRs and paper tape. The ASCII standard incorporates control character values for unit (field), record, group, and file separators.

The advent of spreadsheet programs drove the implementation of readable, printable data exchange formats, often CSV (Comma-Separated Values). Go implements RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files: https://www.ietf.org/rfc/rfc4180.txt. The CSV format uses matching double quotes (") to frame arbitrary text, with two embedded double quotes as an escape for one embedded double quote.

The examples illustrate a simple scheme that uses commas as field delimiters. A simple, easily implemented line = Join(fields, ",") and fields = Split(line, ",") scheme. However, the scheme breaks when commas occur in fields, for example, €1.599,99.


The scheme also breaks when a newline occur in fields (verified with Libreoffice conversion to CVS).
Moreover, Libreoffice conversion to CSV only allows to change the "field delimiter" (unit separator) and string delimiter, but not the record separator.

> [...]

Manlio 

Ian Lance Taylor

unread,
Apr 9, 2022, 11:50:37 PM4/9/22
to P. Taylor Goetz, golang-dev
On Fri, Apr 8, 2022 at 6:53 PM P. Taylor Goetz <ptg...@gmail.com> wrote:
>
> Would adding one or more plugin points for users of `encoding/csv` really be invasive?

In my opinion, yes. All the plugin points need to be documented,
tested, and maintained. We have to field bug reports against them.
It's additional complexity to a package that already has a large
number of API knobs in the Reader struct, so we also need to document,
test, and maintain how the plugin points interact with all of the API
knobs.

And this is for code that is all of 450 lines including extensive
comments. Over at https://go-proverbs.github.io/ you will see "A
little copying is better than a little dependency." For input files
that don't follow RFC 4180, copy the package and adjust it as you
need.

In general, thanks for the discussion, but I want to say that we made
the decision to stop adding knobs to encoding/csv years ago, and we
documented it. As the package docs say, "There are many kinds of CSV
files; this package supports the format described in RFC 4180." Once
a decision like that is made, we prefer to not revisit it unless there
is new information. Otherwise we will spend all of our time
relitigating old decisions. The fact that there are kinds of CSV
files that the package doesn't handle isn't new information. Thanks.

Ian

P. Taylor Goetz

unread,
Apr 10, 2022, 10:36:07 PM4/10/22
to golang-dev
I understand wanting to keep an API stable, But I sill politely disagree with the basis of your argument.

We could simply add a function pointer to the csv.Reader struct, that, if not nil, gets invoked (e.g. `before parsing any CSV line`).

The default would be nil. I.e. a no-op. So the only overhead would be at best a nil check and at worst a function call.

It would be backward compatible. No testing burden on google or the golang community. You set that pointer you own it.

If the default were nil (the zero value), why would you have to expand your tests? Anyone who sets that pointer to something non-nil would be on their own for the consequences. 

NB: I realize any changes suggested or implied in this thread are likely to be rejected. At the moment this is kind of an exercise for me getting a better understanding of the core golang community. My apologies if it's distracting or disruptive.

- Taylor

Ian Lance Taylor

unread,
Apr 11, 2022, 7:49:14 PM4/11/22
to P. Taylor Goetz, golang-dev
On Sun, Apr 10, 2022 at 7:36 PM P. Taylor Goetz <ptg...@gmail.com> wrote:
>
> I understand wanting to keep an API stable, But I sill politely disagree with the basis of your argument.
>
> We could simply add a function pointer to the csv.Reader struct, that, if not nil, gets invoked (e.g. `before parsing any CSV line`).
>
> The default would be nil. I.e. a no-op. So the only overhead would be at best a nil check and at worst a function call.
>
> It would be backward compatible. No testing burden on google or the golang community. You set that pointer you own it.
>
> If the default were nil (the zero value), why would you have to expand your tests? Anyone who sets that pointer to something non-nil would be on their own for the consequences.
>
> NB: I realize any changes suggested or implied in this thread are likely to be rejected. At the moment this is kind of an exercise for me getting a better understanding of the core golang community. My apologies if it's distracting or disruptive.

Adding a function pointer means that we need to document and maintain
the function pointer so that code that uses it continues to work as it
expects. The encoding/csv package already has a number of knobs, and
each of those knobs is going to interact with the function pointer, if
only in deciding whether the knob is applied before or after the
function is called. That all has to be documented, and once
documented it can never be changed. In effect the function pointer
will lock us into a specific implementation.

More generally, adding a new knob always complicates an API. We made
a choice a while back to not add any new knobs to encoding/csv.
Adding a function pointer is adding a knob.

Ian

Sam Hughes

unread,
Apr 12, 2022, 8:34:30 AM4/12/22
to golang-dev
@ptg, hey, why not the other way around?

1. I did observe that there's an experimental "Transformer" interface, in golang.com/x/exp/transform, but I haven't had time to dig around in it.
2. Something like this would would permit you to wrap a reader, such as from a fs.FileReader, packaged with a map of byte->rewriter-rule, and implementing the basic "io.Reader" interface. You would be able to then pass a configured instance to a CSV reader. There are some shortcuts, such as not permitting you to construct a new slice of byte, push any necessary escape characters, the triggering byte, and then the remaining bytes. This approach permits a rule to request an arbitrary look-forward chunk, but with no scan-more functionality. It should be noted that even if it doesn't look like I scraped it together for funsies, and I'm pretty confident the approach is alright, you may very well be able to make some major improvements. https://go.dev/play/p/uGwWSMmm9hO

const CHUNKSIZE int = 32

type RewriteFunc func([]byte)[]byte
type RewriteRule struct {
  RewriteFunc
  Size int
}

func (rule RewriteRule) IsOk() bool {
    return rule.Size > 0 && rule.RewriteFunc != nil
}

func (rule RewriteRule) Transform(p []byte) ([]byte, bool) {
  if rule.Size < len(p) {
    return []byte{}, false
  }
  return rule.RewriteFunc(p), true
}

type SmartCSV struct {
    buffer []byte
    ruleset map[byte]RewriteRule
    err error
    src io.Reader
}

func (scv *SmartCSV) feed(n int) int {
  if scv.err != nil {
    return 0, scv.err
  }
  chunk := make([]byte, n)
  n, scv.err = scv.src.Read(chunk)
  if n > 0 {
    scv.buffer = append(buffer, chunk...)
  }
  return n  
}

func (scv *SmartCSV) peek(n int) (byte, bool) {
  if scv.err != nil {
    return 0, scv.err
  }
  if n < len(scv.buffer) {
    return scv.buffer[n], true
  } else if scv.err != nil {
    return 0, false
  }
  scv.feed(CHUNKSIZE)
  return scv.peek(n)
}

func (scv *SmartCSV) pop() (b byte, ok bool) {
  if 0 < len(scv.buffer) {
    b = scv.buffer[0]
    if rule, ok := scv.ruleset[b]; ok && rule.IsOk() {
      _, ok = scv.peek(rule.Size)
      if !ok {
        scv.buffer = scv.buffer[1:]
        return b, false
      }
      p, ok := rule.Transform(scv.buffer)
      if !ok {
        scv.buffer = scv.buffer[1:]
        return b, false
      }
      b, scv.buffer = p[0], p[1:]
    } else {
      scv.buffer = scv.buffer[1:]
    }
  } else if scv.err != nil {
    return 0, false
  } else {
    b, ok = scv.peek(1)
  }
  return b, ok
}

maili...@carlmjohnson.net

unread,
Apr 25, 2022, 12:07:39 PM4/25/22
to golang-dev
As I understand the core issue, the problem is that you are accepting poorly formatted CSV files because Python allows this, but Python is now too slow to meet your requirements, so you want to switch to Go, but Go requires properly formatted CSV files. Some observations:

- In prior benchmarking, encoding/csv has been slower than Python's csv parser. This may have changed in the intervening years as Go improves, but probably not. You may want to do your own benchmarks before getting too deep into things.
- It seems unlikely to me in any case that you can read a CSV file much faster than Python, which of course "cheats" by being a C-wrapper.
- There was never going to be much interest in getting non-RFC encodings into the standard library, but especially now that Go modules makes third party packaging much easier to distribute, it's definitely not going to happen.
- If Python is too slow, you may want to try a partial rewrite where you let Python do the things it is fast at (calling C and Fortran libraries) and use Go for the things it is fast at (using memory efficiently, concurrency/parallelism, IO scheduling). In particular, what if you read the malformed CSVs in Python, dump them out to a well formatted file (possibly even a binary format or at the very least a well formatted CSV) and then read that out in Go?
- In the long run, you should get your vendors to do the right thing because otherwise you're at the mercy of their worst code. That makes sense for a large Postel-space like the Internet where there is no alternative, but for a closed system like a B2B relationship, you should just insist that they do the right thing and give them a deadline to become compliant within.
Reply all
Reply to author
Forward
0 new messages