Dealing with BOM in encoding/csv

Ondrej

unread,

Mar 8, 2016, 8:26:35 AM3/8/16

to golang-nuts

I know the misuse of BOM has been discussed here and in GH issues, but I wanted to ask how to work around it.

That is - if I simply have a BOM in a file and have to deal with it. The added nuisance in my case is that I'm reading a CSV file, which, if the top row is unquoted, is fine (just adds 3 bytes to the first column name, I could live with that), but if it's quoted, the reader will err, because it assumes the first quote is lazy, even though it isn't (because three bytes precede it). If I turn on LazyQuotes, then I get ["column1", column2, column3], which is suboptimal (and wrong). To give you an example, I took the original reader example and added BOM to illustrate this point.

As Ian suggests in the Github issue, one should deal with it before passing it to the CSV reader. The most naive way would be

f, _ := os.Open(fn)

rd := make([]byte, 3)

f.Read(rd)

if rd[0] == 0xEF && rd[1] == 0xBB && rd[2] == 0xBF {

// fmt.Println("BOM")

} else {

// fmt.Println("no BOM")

f.Seek(0, 0)

}

But that's just "UTF-8", no EOFs etc. So my question is - does anyone have a BOM stripping procedure in place?

Thanks!

Konstantin Khomoutov

unread,

Mar 9, 2016, 11:12:49 AM3/9/16

to Ondrej, golang-nuts

On Tue, 8 Mar 2016 04:01:45 -0800 (PST)
Ondrej <ondrej...@gmail.com> wrote:

[...]

> That is - if I simply have a BOM in a file and have to deal with it.
> The added nuisance in my case is that I'm reading a CSV file, which,
> if the top row is unquoted, is fine (just adds 3 bytes to the first
> column name, I could live with that), but if it's quoted, the reader
> will err, because it assumes the first quote is lazy, even though it
> isn't (because three bytes precede it). If I turn on LazyQuotes, then
> I get ["column1", column2, column3], which is suboptimal (and
> wrong). To give you an example, I took the original reader example

> <https://golang.org/pkg/encoding/csv/#example_Reader> and added BOM
> <http://play.golang.org/p/plpyHIFKCV> to illustrate this point.

>
> As Ian suggests in the Github issue, one should deal with it before
> passing it to the CSV reader. The most naive way would be
>
> f, _ := os.Open(fn)
> rd := make([]byte, 3)
> f.Read(rd)
> if rd[0] == 0xEF && rd[1] == 0xBB && rd[2] == 0xBF {
> // fmt.Println("BOM")
> } else {
> // fmt.Println("no BOM")
> f.Seek(0, 0)
> }
>
> But that's just "UTF-8", no EOFs etc. So my question is - does anyone
> have a BOM stripping procedure in place?

It's a bit unclear what exactly are you asking about.
If you mean "how do I implement bullet-proof UTF-8 CSV stream reader
which would just ignore BOM is it exists" then I'd say "use a
bufio.Reader and its Peek() function". I've managed to even google a
working example [1] which is almost like your case but deals with
UTF-16 rather than UTF-8, but it's super-easy to adapt it by peeking
three bytes: [2].

The implementation in [2] also uses a technique (which makes it more
general) of checking for the reader already being an instance of
`bufio.Reader` in which case there's no point in wrapping it in another
buffered reader as the original one can be used right away.

If you wanted to ask something else, please state your goals clearly
(you could use a bulleted checklist).

1. http://play.golang.org/p/zGrNnYRkPF
2. http://play.golang.org/p/V-rApA06al

Tamás Gulácsi

unread,

Mar 9, 2016, 4:00:43 PM3/9/16

to golang-nuts

Or
var b [3]byte
io.ReadFull(r, b[:])
if !(b[0]== ...) { // not BOM
r = io.MultiReader(bytes.NewReader(b[:]), r)
}

Reply all

Reply to author

Forward