DBD::Pg && handling of UTF-8 char columns

Matthias Apitz

unread,

Oct 11, 2019, 4:30:03 PM10/11/19

to dbd...@perl.org

Hi there,

The background of reaching out to you is: We're porting a huge Library
Management System from support of DBS Sybase and Oracle to PostgreSQL.
The software, some 10 million lines of code, is written in all
programming languages one can think of: C, C++, ESQL/C, Perl, Java....

One special problem we face at the moment is how DBD::Pg is handling
UTF-8 strings in the char columns in the database. The PG server is 11.4
on Linux and DBD::Pg is 3.10.0-3.

I connect to the PG server with something like this (for tests):

$dbh = DBI->connect($PGDB, $PGDB_USER, $PGDB_PASS,
{ pg_utf8_flag => 1,
pg_enable_utf8 => 1,
AutoCommit => 0,
RaiseError => 0,
PrintError => 0,
}
);

and do a SELECT for a column which contains UTF-8 data (I double checked
this with SQL and ::bytea):

$sth=$dbh->prepare(
"select d02name from d02ben where d02bnr = '00001048313'")
or die "parse error\n".$DBI::errstr."\n";

$sth->execute
or die "exec error\n".$DBI::errstr."\n";

but when I now fetch the first row with:

@row = $sth->fetchrow_array;
$HexStr = unpack("H*", $row[0]);
print "HexStr: " . $HexStr . "\n";
print "$row[0]\n";

The resulting column contains ISO 8859-1 data:

HexStr: 50e46461676f67697363686520486f6368736368756c65205765696e67617274656e2020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020
P<E4>dagogische Hochschule Weingarten

Following the man page of DBD::Pg the attribute pg_enable_utf8 => 1
should ensure that strings are returned from DBI with the UTF-8 flag
switched on. The server sends the string in UTF-8 as I can see with
strace (see the chars P\303\244dagogische...):

...
recvfrom(3, "T\0\0\0 \0\1d02name\0\0\1\313\237\0\3\0\0\4\22\377\377\0\0\0|\0\0D\0\0\0\203\0\1\0\0\0yP\303\244dagogische Hochschule Weingarten C\0\0\0\rSELECT 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 185
write(1, "HexStr: 50e46461676f67697363686520486f6368736368756c65205765696e67617274656e2020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020\n", 249) = 249
write(1, "P\344dagogische Hochschule Weingarten
...

But why it gets translated to ISO? What do we wrong?

Thanks,

matthias
--
Matthias Apitz, ✉ gu...@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Matthias Apitz

unread,

Oct 11, 2019, 5:30:02 PM10/11/19

to Jon Jensen, dbd...@perl.org

El día viernes, octubre 11, 2019 a las 03:02:48p. m. -0600, Jon Jensen escribió:

> On Fri, 11 Oct 2019, Matthias Apitz wrote:
>
> > One special problem we face at the moment is how DBD::Pg is handling
> > UTF-8 strings in the char columns in the database. The PG server is 11.4
> > on Linux and DBD::Pg is 3.10.0-3.

> [snip]

> >
> > The resulting column contains ISO 8859-1 data:
> >
> > HexStr: 50e46461676f67697363686520486f6368736368756c65205765696e67617274656e2020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020
> > P<E4>dagogische Hochschule Weingarten
> >

> > ...

> Hi, Matthias.
>
> That output is from a print statement going to STDOUT, so I wonder, are
> you already telling Perl that you want UTF-8 output with something like
> this at the beginning of your script?
>
> use open qw( :std :utf8 );
>
> Otherwise Perl may be defaulting to writing out Latin-1.

Hi Jon,

Of course the line

> > P<E4>dagogische Hochschule Weingarten

is caused by missing UTF-8 on STDOUT. But, the line with

> > HexStr: 50e464616 ...

is not and shows that the \xe4 is there. Why?

Thanks for your repl in any case.

Matthias Apitz

unread,

Oct 12, 2019, 4:45:03 AM10/12/19

to Jon Jensen, dbd...@perl.org

El día viernes, octubre 11, 2019 a las 04:03:31p. m. -0600, Jon Jensen escribió:

> Perl's internal storage of string data is a little odd. \xe4 is the
> correct Unicode code point as per:
>
> https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29
>
> It is not UTF-8 encoded, true, but there's no reason Perl internally needs
> to use UTF-8 specifically, and I believe for Latin-1 it does not by
> default. It's a question of in-memory storage and processing (some kind of
> Unicode) vs. input/output (where you want UTF-8).
>
> If your script is configured to send UTF-8 to STDOUT, then I would expect
> that \xe4 will show up as the UTF-8 \xc3\xa4 instead.

I inserted another row into this table, encoded in UTF-8:

pos71=# select d02name from d02ben where d02bnr = '08.05.1945' ;
освобождение

pos71=# select d02name::bytea from d02ben where d02bnr = '08.05.1945' ;
\xd0bed181d0b2d0bed0b1d0bed0b6d0b4d0b5d0bdd0b8d0b520202020202020 ...

If I run this through Perl DBD::Pg:

@row = $sth->fetchrow_array;
$HexStr = unpack("H*", $row[0]);
print "HexStr: " . $HexStr . "\n";
print "$row[0]\n";

binmode(STDOUT, ':encoding(utf8)');
print "after binmode: $row[0]\n";

it gives:

DBI is version 1.642, DBD::Pg is version 3.10.0
client_encoding=UTF8, server_encoding=UTF8
HexStr: 3e41323e313e3634353d38352020202020202020 ...
Wide character in print at ./utf8-01.pl line 66.
освобождение
after binmode: освобождение

and if I add an utf8::encode($row[0]) after the fetch, like:

@row = $sth->fetchrow_array;
utf8::encode($row[0]);

it gives the correkt UTF-8 encoding:

DBI is version 1.642, DBD::Pg is version 3.10.0
client_encoding=UTF8, server_encoding=UTF8
HexStr: d0bed181d0b2d0bed0b1d0bed0b6d0b4d0b5d0bdd0b8d0b520202020202020 ...
освобождение
after binmode: Ð¾ÑÐ²Ð¾Ð±Ð¾Ð¶Ð´ÐµÐ½Ð¸Ðµ

i.e. the array returned by $sth->fetchrow_array does not contain an UTF-8 string.

Why it has to be passed through utf8::encode($row[0]) ?

Thanks

matthias

--
Matthias Apitz, ✉ gu...@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

3. Oktober! Wir gratulieren! Der Berliner Fernsehturm wird 50
aus: https://www.jungewelt.de/2019/10-02/index.php

signature.asc

Matthias Apitz

unread,

Oct 15, 2019, 2:00:02 AM10/15/19

to Jon Jensen, Matthias Apitz, dbd...@perl.org

El día viernes, octubre 11, 2019 a las 04:03:31p. m. -0600, Jon Jensen escribió:

> > is caused by missing UTF-8 on STDOUT. But, the line with
> >
> >>> HexStr: 50e464616 ...
> >
> > is not and shows that the \xe4 is there. Why?
>

> Perl's internal storage of string data is a little odd. \xe4 is the
> correct Unicode code point as per:
>
> https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29
>
> It is not UTF-8 encoded, true, but there's no reason Perl internally needs
> to use UTF-8 specifically, and I believe for Latin-1 it does not by
> default. It's a question of in-memory storage and processing (some kind of
> Unicode) vs. input/output (where you want UTF-8).
>
> If your script is configured to send UTF-8 to STDOUT, then I would expect
> that \xe4 will show up as the UTF-8 \xc3\xa4 instead.

The byte \xe4 is not UTF-8. The Unicode Codepoint for the letter

Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (U+00E4)

can be seen here:

http://www.fileformat.info/info/unicode/char/00E4/index.htm

and must be \xc3\xa4. One can see this also on any UNIX shell:

$ echo ä | od -tx1
0000000 c3 a4 0a

and if you convert it to ISO-8859-1 then you will get \xe4:

$ echo ä | iconv -f utf-8 -t iso-8859-1 | od -tx1
0000000 e4 0a
0000002

I learned meanwhile how to dump strings in Perl with Devel::Peek and this
shows for the column coming out of PostgreSQL:

...
Dump $string;

gives for this case:

SV = PVIV(0x386c3d0) at 0x2429050\n REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
IV = 2
PV = 0x39f6aa0 "P\303\244dagogische Hochschule Weingarten"\0 [UTF8 "P\x{e4}dagogische Hochschule Weingarten"]
CUR = 35
LEN = 37
COW_REFCNT = 1

i.e. from the database PG server is coming the code point correctly as
(octal) \303\244 which is the same as \xc3\xa4. And Perl mangles this to

[UTF8 "P\x{e4}dagogische Hochschule Weingarten"]

which is IMHO not correct and causing all this confusion.

We have to deal with this in our perl code. It's not a PostrgreSQL
problem.

Thanks in any case for your attention to this case.

matthias

--
Matthias Apitz, ✉ gu...@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

signature.asc