UTF-16 encoding line breaks?

Richard

unread,

Jun 11, 2003, 8:36:50 AM6/11/03

to

Hi,

I have a script which uses the .encode('UTF-16') function to encode a string
into UTF-16. However I am having difficulties in putting line breaks into
that string. \n is what I normally use but does not appear to become valid
UTF-16 once encoded. Can anyone tell me what escape command I can use in my
string to ensure that I get line breaks in my UTF-16 endoded output?

Thanks

Richard

Isaac To

unread,

Jun 11, 2003, 12:48:23 PM6/11/03

to

>>>>> "Richard" == Richard <rich...@hmgcc.gov.uk> writes:

Richard> Hi, I have a script which uses the .encode('UTF-16') function
Richard> to encode a string into UTF-16. However I am having
Richard> difficulties in putting line breaks into that string. \n is
Richard> what I normally use but does not appear to become valid UTF-16
Richard> once encoded. Can anyone tell me what escape command I can use
Richard> in my string to ensure that I get line breaks in my UTF-16
Richard> endoded output?

Why you should bother encoding something to UTF-16 before adding the return
characters? UTF-16 is a strange enough format that is quite clumsy to work
on after it is encoded. E.g., you have to detect the endian of the string
after it is encoded, since the implementation is free to use any
byte-ordering. It can also contain surrogate characters, which means 2
16-bit characters can actually represent 1 UCS-4 character. So basically,
if you want to operate on it, don't encode it yet, or decode it first.

Regards,
Isaac.

Martin v. Löwis

unread,

Jun 11, 2003, 2:33:43 PM6/11/03

to

"Richard" <rich...@hmgcc.gov.uk> writes:

> I have a script which uses the .encode('UTF-16') function to encode a string
> into UTF-16. However I am having difficulties in putting line breaks into
> that string. \n is what I normally use but does not appear to become valid
> UTF-16 once encoded.

Can you demonstrate that? It works fine for me.

Regards,
Martin

Chris Reedy

unread,

Jun 11, 2003, 3:47:03 PM6/11/03

to

Here's an example that, if it's not an outright error, at least confuses me:

Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32
IDLE 0.8 -- press F1 for help
>>> import codecs
>>> testfile = codecs.open('testfile.txt', mode='wb', encoding='utf16')
>>> testfile.write('abc\r\ndef\r\n')
>>> testfile.close()
>>> testfile = codecs.open('testfile.txt', encoding='utf16')
>>> testfile.read(100)
u'abc\r\ndef\r\n'
>>> testfile.close()

Everything's ok so far.

>>> testfile = codecs.open('testfile.txt', encoding='utf16')
>>> testfile.readlines()
[u'abc\r\n', u'def\r\n']
>>> testfile.close()

That looks fine.

>>> testfile = codecs.open('testfile.txt', encoding='utf16')
>>> testfile.readline()
Traceback (most recent call last):
File "<pyshell#27>", line 1, in ?
testfile.readline()
File "C:\DD\Python\lib\codecs.py", line 330, in readline
return self.reader.readline(size)
File "C:\DD\Python\lib\codecs.py", line 252, in readline
return self.decode(line, self.errors)[0]
UnicodeError: UTF-16 decoding error: truncated data
>>>

Huh?

All explanations gratefully appreciated, Chris

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----

Martin v. Löwis

unread,

Jun 11, 2003, 5:21:00 PM6/11/03

to

Chris Reedy <cre...@mitretek.org> writes:

> >>> testfile = codecs.open('testfile.txt', mode='wb', encoding='utf16')
> >>> testfile.write('abc\r\ndef\r\n')

That's an error: you should write Unicode objects to a file opened
by codecs.open. So use

testfile.write(u'abc\r\ndef\r\n')

instead.

> UnicodeError: UTF-16 decoding error: truncated data

Yes, .readline does not work on a UTF-16 stream. In Python 2.3,
you get

NotImplementedError: .readline() is not implemented for UTF-16

Contributions are welcome.

Regards,
Martin