Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

using DictReader() with .decode('utf-8', 'ignore')

4,200 views
Skip to first unread message

Vincent Davis

unread,
Apr 14, 2015, 9:01:19 AM4/14/15
to pytho...@python.org
I had been reading in a file like so. (python 3)
with open(dfile, 'rb') as f:
    for line in f:
       
​line
 = line.decode('utf-8', 'ignore').split(',')

​How can I ​do accomplish decode('utf-8', 'ignore') when reading with DictReader()


Vincent Davis 
720-301-3003

Michiel Overtoom

unread,
Apr 14, 2015, 9:17:03 AM4/14/15
to Python

> ​How can I ​do accomplish decode('utf-8', 'ignore') when reading with DictReader()

Have you tried using the csv module in conjunction with codecs?
There shouldn't be any need to 'ignore' characters.


import csv
import codecs

rs = csv.DictReader(codecs.open(fn, "rbU", "utf8"))
for r in rs:
print(r)

Greetings,

--
"You can't actually make computers run faster, you can only make them do less." - RiderOfGiraffes

Steven D'Aprano

unread,
Apr 14, 2015, 9:23:24 AM4/14/15
to
On Tue, 14 Apr 2015 10:54 pm, Vincent Davis wrote:

> I had been reading in a file like so. (python 3)
> with open(dfile, 'rb') as f:
> for line in f:
>
> line
> = line.decode('utf-8', 'ignore').split(',')
>
> How can I ​do accomplish decode('utf-8', 'ignore') when reading with
> DictReader()


Which DictReader? Do you mean the one in the csv module? I will assume so.

I haven't tried it, but I think something like this will work:


# untested
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['fieldname'])



--
Steven

Vincent Davis

unread,
Apr 14, 2015, 9:38:27 AM4/14/15
to Steven D'Aprano, pytho...@python.org

Which DictReader? Do you mean the one in the csv module? I will assume so.
​yes.​
 

# untested
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['fieldname'])

What you have seems to work, now I need to go find my strange symbols that are not ​'utf-8' and see what happens
I was thought, that I had to open with 'rb' to use ​encoding?


Vincent Davis

Steven D'Aprano

unread,
Apr 14, 2015, 9:48:39 AM4/14/15
to
No, in Python 3 the rules are:

'rb' reads in binary mode, returns raw bytes without doing any decoding;

'r' reads in text mode, returns Unicode text, using the codec/encoding
specified. By default, if no encoding is specified, I think UTF-8 is used,
but it may depend on the platform.


If you are getting decoding errors when reading the file, it is possible
that the file isn't actually UTF-8. One test you can do:

with open(dfile, 'rb') as f:
for line in f:
try:
s = line.decode('utf-8', 'strict')
except UnicodeDecodeError as err:
print(err)

If you need help deciphering the errors, please copy and paste them here and
we'll see what we can do.



--
Steven

Vincent Davis

unread,
Apr 14, 2015, 10:52:37 AM4/14/15
to pytho...@python.org

On Tue, Apr 14, 2015 at 7:48 AM, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
with open(dfile, 'rb') as f:
    for line in f:
        try:
            s = line.decode('utf-8', 'strict')
        except UnicodeDecodeError as err:
            print(err)

If you need help deciphering the errors, please copy and paste them here and
we'll see what we can do.

Below are the errors. I knew about these and I think the correct encoding is windows-1252. I will paste some code and output at the end of this email that prints the offending column in the line. These are very likely errors, and so I what to remove them. I am reading this csv into django sqlite3 db. What is strange to me is that using
​"​
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='')
​"​
 does not seem to remove these
​, it seems to correctly save them to the db which I don't understand.​


'utf-8' codec can't decode byte 0xa6 in position 368: invalid start byte
'utf-8' codec can't decode byte 0xac in position 223: invalid start byte
'utf-8' codec can't decode byte 0xa6 in position 1203: invalid start byte
'utf-8' codec can't decode byte 0xa2 in position 44: invalid start byte
'utf-8' codec can't decode byte 0xac in position 396: invalid start byte

import chardet
with open("DATA/ATSDTA_ATSP600.csv", 'rb') as f:
    for line in f:
        code = chardet.detect(line)
        #if code == {'confidence': 0.5, 'encoding': 'windows-1252'}:
        if code != {'encoding': 'ascii', 'confidence': 1.0}:
            print(code)
        win = line.decode('windows-1252').split(',') #windows-1252
        norm = line.decode('utf-8', 'ignore').split(',')
        ascii = line.decode('ascii', "ignore").split(',')
        ascii2 = line.decode('ISO-8859-1').split(',')
        
        for w, n, a, a2 in zip(win, norm, ascii, ascii2):
            if w != n:
                print(w
​)
​             print(
n
​)
a, a2)
                print(win[0])

​## Output​

{'encoding': 'windows-1252', 'confidence': 0.5}
"¦   " "   " "   " "¦   "
"040543"
{'encoding': 'windows-1252', 'confidence': 0.5}
"LEASE GREGPRU D ¬ETERSPM                 " "LEASE GREGPRU D ETERSPM                 " "LEASE GREGPRU D ETERSPM                 " "LEASE GREGPRU D ¬ETERSPM                 "
"979643"
{'encoding': 'windows-1252', 'confidence': 0.5}
"¦   " "   " "   " "¦   "
"986979"
{'encoding': 'windows-1252', 'confidence': 0.5}
"WELLS FARGO &¢ COMPANY                   " "WELLS FARGO & COMPANY                   " "WELLS FARGO & COMPANY                   " "WELLS FARGO &¢ COMPANY                   "
"994946"
{'encoding': 'windows-1252', 'confidence': 0.5}
OSSOSSO¬¬O         " OSSOSSOO         " OSSOSSOO         " OSSOSSO¬¬O         "
"996535"


Vincent Davis
720-301-3003
0 new messages