Invalid Character in CSV Import

513 views
Skip to first unread message

Michael Howden

unread,
Jun 6, 2011, 8:14:20 AM6/6/11
to sahan...@googlegroups.com

Hey All,

 

Invalid characters cause the csv import to fail

(s3xml.py ln:1501 - text = cls.xml_encode(unicode(text.decode("utf-8")))

 

Any suggestions on the best method for “clean” invalid characters from a csv file to be imported. Is it legitimate to convert them all to _, or remove them?

 

Cheers

 

Michael

Dominic König

unread,
Jun 6, 2011, 6:48:06 PM6/6/11
to sahan...@googlegroups.com
Hmm--

modifying input data due to encoding issues is generally a bad idea. I am very
reluctant to that.

Generally, the input should be UTF-8 or plain ASCII to be on the safe side
with the Python csv module. However, I'm aware that Windows applications often
don't do proper UTF-8 encoding, so we probably have to loop-in a utf-8 encoder
into the reader.

Let me add that and then we try again.
Could you please send me the respective source so that I can test for the
issue?

Dominic

signature.asc

Dominic König

unread,
Jun 8, 2011, 8:17:24 AM6/8/11
to sahan...@googlegroups.com
Done --

made "csv2tree" guessing the character encoding of the source, and re-encode
as UTF-8 before import.

You can simply add the encodings you want to support. Keep this list short,
though, with only the most likely encodings - we do not really want or need to
guess through all possible values here (otherwise we should use chardect).

Encoding all source files properly as UTF-8 is still the best option.

Dominic

signature.asc

Michael Howden

unread,
Jun 10, 2011, 6:09:14 AM6/10/11
to sahan...@googlegroups.com
Hey Dominic,

Thanks a lot - all seems to work fine now!

Cheers

Michael

Reply all
Reply to author
Forward
0 new messages