Invalid Character in CSV Import

Michael Howden

unread,

Jun 6, 2011, 8:14:20 AM6/6/11

to sahan...@googlegroups.com

Hey All,

Invalid characters cause the csv import to fail

(s3xml.py ln:1501 - text = cls.xml_encode(unicode(text.decode("utf-8")))

Any suggestions on the best method for “clean” invalid characters from a csv file to be imported. Is it legitimate to convert them all to _, or remove them?

Cheers

Michael

Dominic König

unread,

Jun 6, 2011, 6:48:06 PM6/6/11

to sahan...@googlegroups.com

Hmm--

modifying input data due to encoding issues is generally a bad idea. I am very
reluctant to that.

Generally, the input should be UTF-8 or plain ASCII to be on the safe side
with the Python csv module. However, I'm aware that Windows applications often
don't do proper UTF-8 encoding, so we probably have to loop-in a utf-8 encoder
into the reader.

Let me add that and then we try again.
Could you please send me the respective source so that I can test for the
issue?

Dominic

signature.asc

Dominic König

unread,

Jun 8, 2011, 8:17:24 AM6/8/11

to sahan...@googlegroups.com

Done --

made "csv2tree" guessing the character encoding of the source, and re-encode
as UTF-8 before import.

You can simply add the encodings you want to support. Keep this list short,
though, with only the most likely encodings - we do not really want or need to
guess through all possible values here (otherwise we should use chardect).

Encoding all source files properly as UTF-8 is still the best option.

Dominic

signature.asc

Michael Howden

unread,

Jun 10, 2011, 6:09:14 AM6/10/11

to sahan...@googlegroups.com

Hey Dominic,

Thanks a lot - all seems to work fine now!

Cheers

Michael

Reply all

Reply to author

Forward