windows utf8 & lxml

Sayth Renshaw

unread,

Dec 20, 2016, 6:54:03 AM12/20/16

to

Hi

I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.

Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.

I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?

The key part of my script is

for content in roots:
utf8_parser = etree.XMLParser(encoding='utf-8')
fix_ascii = utf8_parser.decode('windows-1252')
mytree = etree.fromstring(
content.read().encode('utf-8'), parser=fix_ascii)

Without the added .decode my code looks like

for content in roots:
utf8_parser = etree.XMLParser(encoding='utf-8')
mytree = etree.fromstring(
content.read().encode('utf-8'), parser=utf8_parser)

However doing it in such a fashion returns this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Which I found this SO for http://stackoverflow.com/a/29217546/461887 but cannot seem to implement with lxml.

Ideas?

Sayth

Sayth Renshaw

unread,

Dec 20, 2016, 1:58:29 PM12/20/16

to

Possibly i will have to use a different method from lxml like this.
http://stackoverflow.com/a/29057244/461887

Sayth

Sayth Renshaw

unread,

Dec 21, 2016, 4:05:22 AM12/21/16

to

Why is windows so hard. Sort of running out of ideas, tried methods in the docs SO etc.

Currently

for xml_data in roots:
parser_xml = etree.XMLParser()
mytree = etree.parse(xml_data, parser_xml)

Returns
C:\Users\Sayth\Anaconda3\envs\race\python.exe C:/Users/Sayth/PycharmProjects/bs4race/race.py data/ -e *.xml
Traceback (most recent call last):
File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 100, in <module>
data_attr(rootObs)
File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 55, in data_attr
mytree = etree.parse(xml_data, parser_xml)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81110)
File "src/lxml/parser.pxi", line 1832, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:118109)
File "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument (src\lxml\lxml.etree.c:118392)
File "src/lxml/parser.pxi", line 1747, in lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:117180)
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\lxml.etree.c:111907)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105102)
File "src/lxml/parser.pxi", line 702, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106769)
File "src/lxml/lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:12074)
File "src/lxml/parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer (src\lxml\lxml.etree.c:102431)
io.UnsupportedOperation: read

Process finished with exit code 1

Thoughts?

Sayth

Peter Otten

unread,

Dec 21, 2016, 4:37:10 AM12/21/16

to

I don't think this has anything to do with the OS. Your lxml_data is
probably not what you think it is. Compare:

$ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import lxml.etree
>>> lxml.etree.parse(sys.stdout)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3239, in lxml.etree.parse
(src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument
(src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument
(src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike
(src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in
lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 679, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:92426)
File "lxml.etree.pyx", line 327, in
lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:10196)
File "parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer
(src/lxml/lxml.etree.c:89083)
io.UnsupportedOperation: not readable

That looks similar to what you get.

Deborah Swanson

unread,

Dec 21, 2016, 8:39:12 PM12/21/16

to

I'm not a beginning python coder, but I'm not an advanced one either. I
can't see why I have this problem, though at this point I've probably
been looking at it too hard and for too long (several days), so maybe
I'm just too close to it.
Can one of you guys see the problem (besides my childish coding)? I'll
give you the code first, and then the problem.

def moving():
import csv
ls = []
with open('E:\\Coding projects\\Pycharm\\Moving\\New Listings.csv',
'r') as infile:
raw = csv.reader(infile)
indata = list(raw)
rows = indata.__len__()
for i in range(rows):
ls.append(indata[i])
# sort: Description only, to make hyperelinks & find duplicates
mergeSort(ls)
# find & mark dups, make hyperlink if not dup
for i in range(1, len(ls) - 1):
if ls[i][0] == ls[i + 1][0]:
ls[i][1] = "dup"
else:
# make hyperlink
desc = ls[i][0]
url = ls[i][1]
ls[i][0] = '=HYPERLINK(\"' + url + '\",\"' + desc + '\")'
# save to csv
ls.insert(0, ["Description","url"])
with open('E:\\Coding projects\\Pycharm\\Moving\\Moving 2017
out.csv', 'w') as outfile:
writer = csv.writer(outfile, lineterminator='\n')
writer.writerows(ls)

import operator
def mergeSort(L, compare = operator.lt):
if len(L) < 2:
return L[:]
else:
middle = int(len(L)/2)
left = mergeSort(L[:middle], compare)
right = mergeSort(L[middle:], compare)
return merge(left, right, compare)

def merge(left, right, compare):
result = []
i,j = 0, 0
while i < len(left) and j < len(right):
if compare(left[i], right[j]):
result.append(left[i])
i += 1
else:
result.append(right[j])
j += 1
while (i < len(left)):
result.append(left[i])
i += 1
while (j < len(right)):
result.append(right[j])
j += 1
return result

moving()

The problem is that while mergeSort puts the list ls in perfect order,
which I can see by looking at result on merge's final return to
mergeSort, and at the left and the right once back in mergeSort. Both
the left half and the right half are in order. But the list L is still
in its original order, and after mergeSort completes, ls is still in its
original order. Maybe there's some bonehead error causing this, but I
just can't see it.

I can provide a sample csv file for input, if you want to execute this,
but to keep things simple, you can see the problem in just a table with
webpage titles in one column and their urls in the second column.

Any insights would be greatly appreciated.

Chris Angelico

unread,

Dec 21, 2016, 8:46:16 PM12/21/16

to

On Thu, Dec 22, 2016 at 11:55 AM, Deborah Swanson
<pyt...@deborahswanson.net> wrote:
> The problem is that while mergeSort puts the list ls in perfect order,
> which I can see by looking at result on merge's final return to
> mergeSort, and at the left and the right once back in mergeSort. Both
> the left half and the right half are in order. But the list L is still
> in its original order, and after mergeSort completes, ls is still in its
> original order. Maybe there's some bonehead error causing this, but I
> just can't see it.
>

Your analysis is excellent. Here's what happens: When you merge-sort,
you're always returning a new list (either "return L[:]" or "result =
[]"), but then you call it like this:

# sort: Description only, to make hyperelinks & find duplicates
mergeSort(ls)

This calls mergeSort, then drops the newly-sorted list on the floor.
Instead, try: "ls = mergeSort(ls)".

Thank you for making it so easy for us!

ChrisA

Deborah Swanson

unread,

Dec 21, 2016, 9:26:53 PM12/21/16

to

"ls = mergeSort(ls)" works perfectly!

I can see why now, but I'm not sure how long I would have knocked my
head against it before I saw it on my own. It must take awhile to
develop an eye for these things.

So thank you from the bottom of my heart! I do have a future in python
coding planned, but right now I need to find the cheapest nice little
house to move to, and this sorting problem was a major roadblock! The
webpage titles and urls are from Craigslist, soon to be joined by many
other fields, but I just couldn't get past this one problem.

Andrea D'Amore

unread,

Dec 22, 2016, 8:59:12 AM12/22/16

to

I know a code review wasn't the main goal of you message but I feel
it's worth mentioning two tips:

On 22 December 2016 at 01:55, Deborah Swanson <pyt...@deborahswanson.net> wrote:
> ls = []
> with open('E:\\Coding projects\\Pycharm\\Moving\\New Listings.csv',
> 'r') as infile:
> raw = csv.reader(infile)
> indata = list(raw)
> rows = indata.__len__()
> for i in range(rows):
> ls.append(indata[i])

This block init an empty list, creates a csv.reader, processes it all
converting to a list then loops over every in this list and assign
this item to the initially created list. The initial list object is
actually useless since at the end ls and rows will contain exactly the
same objects.
Your code can be simplified with:

with open(your_file_path) as infile:
ls = list(csv.reader(infile))

Then I see you're looping with an index-based approach, here

> for i in range(rows):
> ls.append(indata[i])
[…]

> # find & mark dups, make hyperlink if not dup
> for i in range(1, len(ls) - 1):

and in the other functions, basically wherever you use len().

Check Ned Batchelders's "Loop like a native" talk about that, there
are both a webpage and a PyCon talk.
By using "native" looping you'll get simplified code that is more
expressive in less lines.

--
Andrea

Stefan Behnel

unread,

Dec 26, 2016, 10:56:27 AM12/26/16

to

Hi!

Sayth Renshaw schrieb am 20.12.2016 um 12:53:
> I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.
>
> Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.
>
> I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?
>
> The key part of my script is
>
> for content in roots:
> utf8_parser = etree.XMLParser(encoding='utf-8')
> fix_ascii = utf8_parser.decode('windows-1252')

This looks rather broken. Are you sure this is what your code looks like,
or did just you type this into your email while trying to strip down your
actual code into a simpler example?

> mytree = etree.fromstring(
> content.read().encode('utf-8'), parser=fix_ascii)

Note that lxml can parse from Unicode, so once you have decoded your data,
you can just pass it into the parser as is, e.g.

mytree = etree.fromstring(content.decode('windows-1252'))

This is not something I'd encourage since it requires a bit of back and
forth encoding internally and is rather memory inefficient, but if your
decoding is non-trivial, this might still be a viable approach.

> Without the added .decode my code looks like
>
> for content in roots:
> utf8_parser = etree.XMLParser(encoding='utf-8')
> mytree = etree.fromstring(
> content.read().encode('utf-8'), parser=utf8_parser)
>
> However doing it in such a fashion returns this error:
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Same thing as above: I don't see how this error message matches the code
you show here. The exception you get might be a Python 2.x problem in the
first place.

Stefan

wxjm...@gmail.com

unread,

Dec 27, 2016, 5:07:33 AM12/27/16

to

Le lundi 26 décembre 2016 16:56:27 UTC+1, Stefan Behnel a écrit :
>
>
> This is not something I'd encourage since it requires a bit of back and
> forth encoding internally and is rather memory inefficient, but if your
> decoding is non-trivial, this might still be a viable approach.
>

You can not imagine how I'm laughing.
Poor Python.

jmf

Steve D'Aprano

unread,

Dec 27, 2016, 5:46:47 AM12/27/16

to

On Tue, 20 Dec 2016 10:53 pm, Sayth Renshaw wrote:

> content.read().encode('utf-8'), parser=utf8_parser)
>
> However doing it in such a fashion returns this error:
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
> invalid start byte

That tells you that the XML file you have is not actually UTF-8.

You have a file that begins with a byte 0xFF. That is invalid UTF-8. No
valid UTF-8 string contains the byte 0xFF.

https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

So you need to consider:

- Are you sure that the input file is intended to be UTF-8? How was it
created?

- Is the second byte 0xFE? If so, that suggests that you actually have
UTF-16 with a byte-order mark.

--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.