unicode troubles and postgres

Ethan Furman

unread,

Jan 9, 2014, 1:49:27 PM1/9/14

to Python

So I'm working with postgres, and I get a datadump which I try to restore to my test system, and I get this:

ERROR: value too long for type character varying(4)
CONTEXT: COPY res_currency, line 32, column symbol: "руб"

"py6" sure looks like it should fit, but it don't. Further investigation revealed that "py6" is made up of the bytes d1
80 d1 83 d0 b1.

Any ideas on what that means, exactly?

--
~Ethan~

Peter Otten

unread,

Jan 9, 2014, 2:50:41 PM1/9/14

to pytho...@python.org

It may look like the ascii "py6", but you have three cyrillic letters:

>>> import unicodedata as ud
>>> [ud.name(c) for c in u"руб"]
['CYRILLIC SMALL LETTER ER', 'CYRILLIC SMALL LETTER U', 'CYRILLIC SMALL
LETTER BE']

The dump you are seeing are the corresponding bytes in UTF-8:

>>> u"руб".encode("utf-8")
'\xd1\x80\xd1\x83\xd0\xb1'

So postgres may be storing the string as utf-8.

Ethan Furman

unread,

Jan 9, 2014, 2:51:21 PM1/9/14

to pytho...@python.org

On 01/09/2014 10:49 AM, Ethan Furman wrote:
> So I'm working with postgres, and I get a datadump which I try to restore to my test system, and I get this:
>
> ERROR: value too long for type character varying(4)
> CONTEXT: COPY res_currency, line 32, column symbol: "руб"
>
> "py6" sure looks like it should fit, but it don't. Further investigation revealed that "py6" is made up of the bytes d1
> 80 d1 83 d0 b1.
>
> Any ideas on what that means, exactly?

For the curious, it means CYRILLIC SMALL LETTER ER, CYRILLIC SMALL LETTER U, CYRILLIC CAPITAL LETTER IE WITH GRAVE in
utf-8 format.

The problem was I had created the database from template0 instead of template1, and 0 is SQL-ASCII while 1 is UTF8.

--
~Ethan~

Chris Angelico

unread,

Jan 9, 2014, 3:56:38 PM1/9/14

to pytho...@python.org

On Fri, Jan 10, 2014 at 6:51 AM, Ethan Furman <et...@stoneleaf.us> wrote:
> The problem was I had created the database from template0 instead of
> template1, and 0 is SQL-ASCII while 1 is UTF8.

Ah, this is one of the traps with Postgres. This is one of the reasons
I prefer not to touch template[01] and to script the initialization of
a new database - any config changes are committed to source control as
part of that script.

ChrisA

wxjm...@gmail.com

unread,

Jan 10, 2014, 5:44:57 AM1/10/14

to

When one has to face such a characteristic sequence,
the first thing to do is to think "utf-8".

(Not a proof)

>>> a = list(range(0x0410, 0x0415))
>>> a += list(range(0x0440, 0x0445))
>>> a += list(range(0x0480, 0x0485))
>>> import unicodedata as ud
>>> for i in a:
... hex(i), chr(i).encode('utf-8'), ud.name(chr(i))
...
('0x410', b'\xd0\x90', 'CYRILLIC CAPITAL LETTER A')
('0x411', b'\xd0\x91', 'CYRILLIC CAPITAL LETTER BE')
('0x412', b'\xd0\x92', 'CYRILLIC CAPITAL LETTER VE')
('0x413', b'\xd0\x93', 'CYRILLIC CAPITAL LETTER GHE')
('0x414', b'\xd0\x94', 'CYRILLIC CAPITAL LETTER DE')
('0x440', b'\xd1\x80', 'CYRILLIC SMALL LETTER ER')
('0x441', b'\xd1\x81', 'CYRILLIC SMALL LETTER ES')
('0x442', b'\xd1\x82', 'CYRILLIC SMALL LETTER TE')
('0x443', b'\xd1\x83', 'CYRILLIC SMALL LETTER U')
('0x444', b'\xd1\x84', 'CYRILLIC SMALL LETTER EF')
('0x480', b'\xd2\x80', 'CYRILLIC CAPITAL LETTER KOPPA')
('0x481', b'\xd2\x81', 'CYRILLIC SMALL LETTER KOPPA')
('0x482', b'\xd2\x82', 'CYRILLIC THOUSANDS SIGN')
('0x483', b'\xd2\x83', 'COMBINING CYRILLIC TITLO')
('0x484', b'\xd2\x84', 'COMBINING CYRILLIC PALATALIZATION')

jmf