using DictReader() with .decode('utf-8', 'ignore')

Vincent Davis

unread,

Apr 14, 2015, 9:01:19 AM4/14/15

to pytho...@python.org

I had been reading in a file like so. (python 3)

with open(dfile, 'rb') as f:

for line in f:

line

= line.decode('utf-8', 'ignore').split(',')

How can I do accomplish decode('utf-8', 'ignore') when reading with DictReader()

Vincent Davis

720-301-3003

Michiel Overtoom

unread,

Apr 14, 2015, 9:17:03 AM4/14/15

to Python

> How can I do accomplish decode('utf-8', 'ignore') when reading with DictReader()

Have you tried using the csv module in conjunction with codecs?
There shouldn't be any need to 'ignore' characters.

import csv
import codecs

rs = csv.DictReader(codecs.open(fn, "rbU", "utf8"))
for r in rs:
print(r)

Greetings,

--
"You can't actually make computers run faster, you can only make them do less." - RiderOfGiraffes

Steven D'Aprano

unread,

Apr 14, 2015, 9:23:24 AM4/14/15

to

On Tue, 14 Apr 2015 10:54 pm, Vincent Davis wrote:

> I had been reading in a file like so. (python 3)
> with open(dfile, 'rb') as f:
> for line in f:
>
> line
> = line.decode('utf-8', 'ignore').split(',')
>
> How can I do accomplish decode('utf-8', 'ignore') when reading with
> DictReader()

Which DictReader? Do you mean the one in the csv module? I will assume so.

I haven't tried it, but I think something like this will work:

# untested
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['fieldname'])

--
Steven

Vincent Davis

unread,

Apr 14, 2015, 9:38:27 AM4/14/15

to Steven D'Aprano, pytho...@python.org

Which DictReader? Do you mean the one in the csv module? I will assume so.

yes.

# untested
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['fieldname'])

What you have seems to work, now I need to go find my strange symbols that are not 'utf-8' and see what happens

I was thought, that I had to open with 'rb' to use encoding?

Vincent Davis

Steven D'Aprano

unread,

Apr 14, 2015, 9:48:39 AM4/14/15

to

No, in Python 3 the rules are:

'rb' reads in binary mode, returns raw bytes without doing any decoding;

'r' reads in text mode, returns Unicode text, using the codec/encoding
specified. By default, if no encoding is specified, I think UTF-8 is used,
but it may depend on the platform.

If you are getting decoding errors when reading the file, it is possible
that the file isn't actually UTF-8. One test you can do:

with open(dfile, 'rb') as f:
for line in f:

try:
s = line.decode('utf-8', 'strict')
except UnicodeDecodeError as err:
print(err)

If you need help deciphering the errors, please copy and paste them here and
we'll see what we can do.

--
Steven

Vincent Davis

unread,

Apr 14, 2015, 10:52:37 AM4/14/15

to pytho...@python.org

On Tue, Apr 14, 2015 at 7:48 AM, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

with open(dfile, 'rb') as f:
for line in f:
try:
s = line.decode('utf-8', 'strict')
except UnicodeDecodeError as err:
print(err)

If you need help deciphering the errors, please copy and paste them here and
we'll see what we can do.

Below are the errors. I knew about these and I think the correct encoding is windows-1252. I will paste some code and output at the end of this email that prints the offending column in the line. These are very likely errors, and so I what to remove them. I am reading this csv into django sqlite3 db. What is strange to me is that using

"

with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='')

"

does not seem to remove these

, it seems to correctly save them to the db which I don't understand.

'utf-8' codec can't decode byte 0xa6 in position 368: invalid start byte
'utf-8' codec can't decode byte 0xac in position 223: invalid start byte
'utf-8' codec can't decode byte 0xa6 in position 1203: invalid start byte
'utf-8' codec can't decode byte 0xa2 in position 44: invalid start byte
'utf-8' codec can't decode byte 0xac in position 396: invalid start byte

import chardet

with open("DATA/ATSDTA_ATSP600.csv", 'rb') as f:

for line in f:

code = chardet.detect(line)

#if code == {'confidence': 0.5, 'encoding': 'windows-1252'}:

if code != {'encoding': 'ascii', 'confidence': 1.0}:

print(code)

win = line.decode('windows-1252').split(',') #windows-1252

norm = line.decode('utf-8', 'ignore').split(',')

ascii = line.decode('ascii', "ignore").split(',')

ascii2 = line.decode('ISO-8859-1').split(',')

for w, n, a, a2 in zip(win, norm, ascii, ascii2):

if w != n:

print(w

)

print(

n

)

a, a2)

print(win[0])

## Output

{'encoding': 'windows-1252', 'confidence': 0.5}
"¦   " "   " "   " "¦   "
"040543"
{'encoding': 'windows-1252', 'confidence': 0.5}
"LEASE GREGPRU D ¬ETERSPM                 " "LEASE GREGPRU D ETERSPM                 " "LEASE GREGPRU D ETERSPM                 " "LEASE GREGPRU D ¬ETERSPM                 "
"979643"
{'encoding': 'windows-1252', 'confidence': 0.5}
"¦   " "   " "   " "¦   "
"986979"
{'encoding': 'windows-1252', 'confidence': 0.5}
"WELLS FARGO &¢ COMPANY                   " "WELLS FARGO & COMPANY                   " "WELLS FARGO & COMPANY                   " "WELLS FARGO &¢ COMPANY                   "
"994946"
{'encoding': 'windows-1252', 'confidence': 0.5}
OSSOSSO¬¬O         " OSSOSSOO         " OSSOSSOO         " OSSOSSO¬¬O         "
"996535"

Vincent Davis
720-301-3003