encoding/csv quote all option

749 views
Skip to first unread message

pythona...@gmail.com

unread,
Oct 21, 2013, 7:31:55 PM10/21/13
to golan...@googlegroups.com
Hi all

I was thinking of adding a quote all option to the encoding/csv package writer.

The reason behind this is some databases and applications (i'm looking at you mssql and excel) need all fields to be quoted when importing csv files or strange results can happen e.g converting telephone number to int's.

The change itself would be minimal, adding a field to the writer called QuoteAll and changing the fieldNeedQuotes to return true if this is set.

If folks would be happy with this I can easily make the code change.

Stay Frosty
Colin Gemmell

twitter: @colin_gemmell


peterGo

unread,
Oct 21, 2013, 10:53:40 PM10/21/13
to golan...@googlegroups.com, pythona...@gmail.com
Colin,

Please provide some test cases that illustrate your problem.

Peter

Klaus Post

unread,
Oct 22, 2013, 3:32:02 AM10/22/13
to golan...@googlegroups.com, pythona...@gmail.com
CSV is rather 'floaty' format.

For Excel, I have had to write  fields as ="Content of Field" to completely avoid formatting of content.

Either way, I just wrote my own code for CSV files, since it is so basic.

A few things I learned:

Remember to add an UTF-8 BOM for correct character encoding; Don't use "sep=;" function, since it for some reason makes Excel revert to Windows ANSI charset - and don't write fields with more than 254 characters, since they will be truncated.

Jan Mercl

unread,
Oct 22, 2013, 3:35:21 AM10/22/13
to Klaus Post, golang-nuts, pythona...@gmail.com
On Tue, Oct 22, 2013 at 9:32 AM, Klaus Post <klau...@gmail.com> wrote:
> A few things I learned:
>
> Remember to add an UTF-8 BOM for correct character encoding...

No.

-j

atomly

unread,
Oct 22, 2013, 10:41:44 AM10/22/13
to Jan Mercl, Klaus Post, golang-nuts, pythona...@gmail.com
 
:: atomly ::

[ ato...@atomly.com : www.atomly.com  : http://blog.atomly.com/ ...
[ atomiq records : new york city : +1.347.692.8661 ...
[ e-mail atomly-new...@atomly.com for atomly info and updates ...

Jan Mercl

unread,
Oct 22, 2013, 10:53:31 AM10/22/13
to atomly, Klaus Post, golang-nuts, pythona...@gmail.com
On Tue, Oct 22, 2013 at 4:41 PM, atomly <ato...@gmail.com> wrote:
> On Tue, Oct 22, 2013 at 3:35 AM, Jan Mercl <0xj...@gmail.com> wrote:
>
>> On Tue, Oct 22, 2013 at 9:32 AM, Klaus Post <klau...@gmail.com> wrote:
>> > A few things I learned:
>> >
>> > Remember to add an UTF-8 BOM for correct character encoding...
>>
>> No.
>
>
> Yes.

Byte order has no meaning in UTF-8. The Unicode Standard is not
requiring or recommending to use it. See also
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

-j

PS: Why are you linking to anything Microsoft when Unicode is not a
Microsoft standard? It makes no sense to me. Moreover, the document
mandates ("Always prefix a Unicode plain text file with a byte order
mark...") the opposite of what the actual Unicode Standard says. This
is another example how Microsoft intentionally damages the world of
computing for their own financial interests.

Rob Pike

unread,
Oct 22, 2013, 11:54:57 AM10/22/13
to Jan Mercl, atomly, Klaus Post, golang-nuts, pythona...@gmail.com
From that document: 

"Always prefix a Unicode plain text file with a byte order mark, which informs an application receiving the file that the file is byte-ordered. "

There is enough misleading information in that one sentence to confuse an entire industry. And it succeeded in doing so.

What they want to say but are too misinformed to express clearly is that Windows code uses the first few bytes of the the file to determine the encoding used in the file. Somewhere along the way, a terrible decision was made to use a UTF-8-encoded BOM as the magic number for a UTF-8 file. In so doing the concept of byte order was introduced into a file format that has no byte order and a generation of programmers is now programming around that nonsensical misuse.

And as Jan points out, the Unicode standard says the exact opposite of this document: Don't use BOMs in UTF-8 text.

Disregard this document. It knows not of what it speaks.

-rob

Klaus Post

unread,
Oct 22, 2013, 11:57:22 AM10/22/13
to golan...@googlegroups.com
Anyway, if you want it to work automatically in Excel include a BOM marker. If you don't care about international characters or you have other concerns, don't.

Ibrahim M. Ghazal

unread,
Oct 22, 2013, 2:07:02 PM10/22/13
to Klaus Post, golang-nuts
On Tue, Oct 22, 2013 at 6:57 PM, Klaus Post <klau...@gmail.com> wrote:
> Anyway, if you want it to work automatically in Excel include a BOM marker.
> If you don't care about international characters or you have other concerns,
> don't.
>

I learned the hard way that Excel is even more broken than that. If
you do include a BOM then try to save the same csv from Excel, it
saves it as a "text" document with _tab_ separators instead of commas
by default. If you select CSV from the Save As dialog, it converts all
non-ASCII characters to question marks.

The worst part is that no combination of encoding and BOM will make
CSV files work in Excel in a sane way, see this Stack Overflow
question: http://stackoverflow.com/questions/6588068/which-encoding-opens-csv-files-correctly-with-excel-on-both-mac-and-windows

If you care about non-ASCII characters, I think the best option is to
save it as UTF-8 without BOM and tell the users to open it in a
non-broken program (LibreOffice, Google Docs, etc).

Mandolyte

unread,
Oct 23, 2013, 6:06:20 AM10/23/13
to golan...@googlegroups.com, pythona...@gmail.com
the ="00100" syntax is the only way make excel take a non-numeric information (which happens to be all digits) and avoid dropping the leading zeros. For example, US CAGE codes are all digits and have leading zeros (French CAGE codes begin with an "F", etc.).

I think this would be a nice option for the CSV package.

atomly

unread,
Oct 23, 2013, 10:28:12 AM10/23/13
to Jan Mercl, Klaus Post, golang-nuts, pythonandchips
On Tue, Oct 22, 2013 at 10:53 AM, Jan Mercl <0xj...@gmail.com> wrote:
Byte order has no meaning in UTF-8. The Unicode Standard is not
requiring or recommending to use it. See also
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

I'm well aware of that.  How could byte order mean anything in a standard based around single-byte characters? The problem, though, is that Microsoft insists on having a BOM at the beginning to identify it as UTF-8.

PS: Why are you linking to anything Microsoft when Unicode is not a
Microsoft standard? It makes no sense to me. Moreover, the document
mandates ("Always prefix a Unicode plain text file with a byte order
mark...") the opposite of what the actual Unicode Standard says. This
is another example how Microsoft intentionally damages the world of
computing for their own financial interests.

 I'm linking to Microsoft because the topic of this thread is how to properly encode CSV for use with Excel. I'm not advocating for anything or suggesting that anything Microsoft does is sane, but that doesn't mean you don't have to deal with their particular brand of insanity if you are hoping to interface with their products...
Reply all
Reply to author
Forward
0 new messages