What is 'client_encoding'?

288 views
Skip to first unread message

Ron Savage

unread,
Aug 20, 2008, 3:01:21 AM8/20/08
to List - Rose::DB::Object
Hi Folks

I'm using DBD::Pg V 2.9.0 and Rose::DB::Object V 0.769.

I created a table with no special attributes for the table nor columns,
and when trying to store the country name "Cote D'ivoire" (with a '^'
over the first 'o' of course!) I get the unsurprising message:

<==><8==>
DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding
"UTF8": 0xd474
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding". at /usr/local/share/perl/5.10.0/Rose/DB/Object.pm
line 1105, <DATA> line 22205.
insert() - DBD::Pg::st execute failed: ERROR: invalid byte sequence for
encoding "UTF8": 0xd474
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding". at /usr/local/share/perl/5.10.0/Rose/DB/Object.pm
line 1105, <DATA> line 22205.
at populate.countries.pl line 64
<==><8==>
but who exactly is outputting the 'client_encoding' part of the message?

I can't find it in the source of DBI, DBD::Pg, nor Rose, so I guess it's
coming from the Postgre server.

Yet, I can't find any reference to it in the indexes of:
The PostgreSQL Reference Manual
Vols 1 & 2.

Any ideas?
--
Ron Savage
r...@savage.net.au
http://savage.net.au/index.html


Cees Hek

unread,
Aug 20, 2008, 3:37:17 AM8/20/08
to rose-db...@googlegroups.com

Hi Ron,

It is a PostgreSQL value that you can change per connection.

SET CLIENT_ENCODING TO 'value';

SHOW client_encoding;

More info on the following docs page:

http://www.postgresql.org/docs/8.1/static/multibyte.html

What is happening is that postgresql is expecting the data to be in
the given character set (run SHOW client_encoding to find out what
that is), but you are providing your data in another character set.

You should also know that client_encoding has nothing to do with what
encoding your database is in. If you are using postgresql 8.x then
your database is probably in UTF8. That doesn't mean you have to
provide all your data in UTF8. The point of the client_encoding
setting is that you can provide the data in any encoding you like, and
PostgreSQL will do the work of converting it to UTF8 for storage. It
also converts it back to your chosen client_encoding when you pull the
data back out.

Most likely you can fix your problem by calling SET CLIENT_ENCODING TO
'LATIN1'; before your insert. Or to make it global, you can set an
environment variable to do the same thing: $ENV{PGCLIENTENCODING} =
'LATIN1'. That is if your data is currently in latin1 (also known as
ISO 8859-1 or ECMA 94) , which it most likely is.

Cheers,

Cees

Ron Savage

unread,
Aug 20, 2008, 3:57:48 AM8/20/08
to rose-db...@googlegroups.com
Hi Cees

> > Yet, I can't find any reference to it in the indexes of:
> > The PostgreSQL Reference Manual
> > Vols 1 & 2.

I've ordered V 3 via Amazon...

> Hi Ron,
>
> It is a PostgreSQL value that you can change per connection.
>
> SET CLIENT_ENCODING TO 'value';
>
> SHOW client_encoding;

It's UTF8.

> More info on the following docs page:
>
> http://www.postgresql.org/docs/8.1/static/multibyte.html

I'm using V 8.2.5.

> What is happening is that postgresql is expecting the data to be in
> the given character set (run SHOW client_encoding to find out what
> that is), but you are providing your data in another character set.

Hmm. I get the error with or without 'use utf8;' in my program.

> You should also know that client_encoding has nothing to do with what
> encoding your database is in. If you are using postgresql 8.x then
> your database is probably in UTF8. That doesn't mean you have to
> provide all your data in UTF8. The point of the client_encoding
> setting is that you can provide the data in any encoding you like, and
> PostgreSQL will do the work of converting it to UTF8 for storage. It
> also converts it back to your chosen client_encoding when you pull the
> data back out.
>
> Most likely you can fix your problem by calling SET CLIENT_ENCODING TO
> 'LATIN1'; before your insert. Or to make it global, you can set an
> environment variable to do the same thing: $ENV{PGCLIENTENCODING} =
> 'LATIN1'. That is if your data is currently in latin1 (also known as
> ISO 8859-1 or ECMA 94) , which it most likely is.

Settin $ENV{} as you suggest works, with and without 'use utf8;' in the
code.

Beautiful.

$many x $thanx;

I owe myself, errr, you a beer.

Miika Pekkarinen

unread,
Aug 20, 2008, 4:12:05 AM8/20/08
to rose-db...@googlegroups.com
Hi,

I got the encoding working nice by doing some deflate and inflate
triggers. Here is my RoseSQL.pm -module that is used to auto-generate
modules from the db and loading the pre-generated modules. I hope this
might be useful for someone.

package RoseSQL;

use strict;
use Rose::DB::Object::Loader;
use RoseSQL::DB;
use Encode;

our $moduledirname = 'dbmodules';

sub new
{
my ($proto, $verbose) = @_;
my $class = ref($proto) || $proto;
my $this = {
verbose => $verbose
};

$| = 1 if $verbose;

# Find out the module dir
for my $dir ($moduledirname, "src/$moduledirname", "../src/$moduledirname")
{
if (-d $dir)
{
$this->{moduledir} = $dir;
push @INC, $dir;
last;
}
}

die "Module directory not found!" unless $this->{moduledir};

return bless($this, $class);
}

sub build_modules
{
my ($this) = @_;

my $moduledir = $this->{moduledir};
die "Incorrect moduledir: $moduledir" unless length($moduledir) > 5;

print STDERR "Removing old modules..." if $this->{verbose};
`find $moduledir -name '*.pm' -type f -exec rm -f {} \\;`;
print STDERR " done.\n" if $this->{verbose};

print STDERR "Building Rose::DB::Object modules..." if $this->{verbose};

my $loader = Rose::DB::Object::Loader->new(
db => RoseSQL::DB->new,
# db_class => 'RoseSQL::DB',
class_prefix => 'RoseSQL',
base_class => 'RoseSQL::DBO',
);

$loader->module_dir($this->{moduledir});
$loader->make_modules;
print STDERR " done.\n\n" if $this->{verbose};
}

# Storing to the database
sub _deflate_utf8
{
my ($this, $value) = @_;

# Turn on the utf8 flag
if(!Encode::is_utf8($value))
{
$value = Encode::decode_utf8($value);
}

return $value;
}

sub _inflate_utf8
{
my ($this, $value) = @_;

# Turn on the utf8 flag
if(!Encode::is_utf8($value))
{
$value = Encode::decode_utf8($value);
}

return $value;
}


sub load
{
my ($this) = @_;
my $moduledir = $this->{moduledir};

print STDERR "Loading Rose::DB::Object." if $this->{verbose};
RoseSQL::DB->new;
print STDERR '.' if $this->{verbose};

my $dh;
opendir($dh, "$moduledir/RoseSQL") or die "open fail: $!";

my @entries = grep { /\.pm$/ } readdir($dh);
for my $idx (0 .. $#entries)
{
print STDERR '.' if ($this->{verbose} && $idx % 10 == 0);
my $entry = $entries[$idx];
$entry =~ s/\.pm//;
require "$moduledir/RoseSQL/$entry/Manager.pm";

# Now process all columns to set triggers
my $meta = eval "RoseSQL::$entry"."->meta";
for my $column ($meta->columns)
{
next unless($column->type =~ /^(?:text|varchar|character)$/);

$column->add_trigger(
event => 'deflate',
code => \&_deflate_utf8,
);

# This should always work right (but doesn't)
$column->add_trigger(
event => 'inflate',
code => \&_inflate_utf8,
);
}

# Set the default speculative mode to true
$meta->default_load_speculative(1);
}
closedir($dh);
print STDERR " done.\n\n" if $this->{verbose};
}

1;

--
Miika Pekkarinen <mii...@ihme.org>

Miika Pekkarinen

unread,
Aug 20, 2008, 4:14:26 AM8/20/08
to rose-db...@googlegroups.com
And sorry about top-posting, I really didn't want to do that but was
too fast to post. :)

--
Miika Pekkarinen <mii...@ihme.org>

Bill Moseley

unread,
Aug 24, 2008, 2:22:34 PM8/24/08
to rose-db...@googlegroups.com
On Wed, Aug 20, 2008 at 05:57:48PM +1000, Ron Savage wrote:
> >
> > Most likely you can fix your problem by calling SET CLIENT_ENCODING TO
> > 'LATIN1'; before your insert. Or to make it global, you can set an
> > environment variable to do the same thing: $ENV{PGCLIENTENCODING} =
> > 'LATIN1'. That is if your data is currently in latin1 (also known as
> > ISO 8859-1 or ECMA 94) , which it most likely is.
>
> Settin $ENV{} as you suggest works, with and without 'use utf8;' in the
> code.
>
> Beautiful.

Ron, 'use utf8;' says your Perl source includes utf8 constants. In my
code any text that is utf8 always comes from a template or database that
is in utf8 (or some correctly decoded encoding) -- so I never need
'use utf8;' in my source.

I would question the need for setting the client encoding to anything
but utf8 for Postgresql. Then set DBD::Pg's pg_enable_utf8 option.
I'm careful to decode all data read into my program -- which is
something you should always do.

I also always use -E UTF8 when creating the database.

The pg_enable_utf8 option sets the utf8 flag on utf8 data read from
the database. I'm not clear how DBD::Pg implements that -- setting
the flag makes the (correct) assumption that Perl character strings
are utf8.

I would think that it would be more correct for DBD::Pg to
use encode_utf8() when sending data to Postgresql and decode_utf8()
when reading. A brief look at DBD::Pg and I don't see where it's
encoding to utf8 -- but maybe again it's making the assumption that
Perl strings are utf8 and uses them directly without encoding first.

Perhaps someone can explain how DBD::Pg handles encoding better.

--
Bill Moseley
mos...@hank.org
Sent from my iMutt

Ron Savage

unread,
Aug 25, 2008, 12:19:42 AM8/25/08
to rose-db...@googlegroups.com
Hi Bill

$many x $thanx for the info.

> Ron, 'use utf8;' says your Perl source includes utf8 constants. In my
> code any text that is utf8 always comes from a template or database that
> is in utf8 (or some correctly decoded encoding) -- so I never need
> 'use utf8;' in my source.

OK. I won't use that gratuitiously.

> I would question the need for setting the client encoding to anything
> but utf8 for Postgresql. Then set DBD::Pg's pg_enable_utf8 option.
> I'm careful to decode all data read into my program -- which is
> something you should always do.

OK. I can adopt that convention. To be explicit, I take you to mean
calling decode_utf8() on all column values retrieved via DBD::Pg?

Errr, no, I won't.

I did 'use Encode;' to get decode_utf8(), but then I was not able to
find an option in the 3-arg version of open which allowed me to write
a disk file which, when displayed in the browser, showed the glyphs I
expected.

So I removed Encode, decode_utf8() and the 3-arg form of open, and it
all started working.

Just for the record, the env var LC_CTYPE is empty.

> I also always use -E UTF8 when creating the database.

I did not use that when creating databases, but SHOW SERVER_ENCODING
displays UTF8, so the value I want is the default. Neat. But it's good
to know that that's happening (and what to check for).

> The pg_enable_utf8 option sets the utf8 flag on utf8 data read from
> the database. I'm not clear how DBD::Pg implements that -- setting
> the flag makes the (correct) assumption that Perl character strings
> are utf8.

Yep, saw the pg_enable_utf8 option in the docs, but as long as I don't
need it, I won't activate it.

> I would think that it would be more correct for DBD::Pg to
> use encode_utf8() when sending data to Postgresql and decode_utf8()
> when reading. A brief look at DBD::Pg and I don't see where it's
> encoding to utf8 -- but maybe again it's making the assumption that
> Perl strings are utf8 and uses them directly without encoding first.
>
> Perhaps someone can explain how DBD::Pg handles encoding better.

I too await enlightenment.

I'd love to see a Perl minigrant dedicated to someone writing a set of
test cases for us to download and study. The thought of each and every
one of us grinding through this is distressing...

Cees Hek

unread,
Aug 25, 2008, 1:36:02 AM8/25/08
to rose-db...@googlegroups.com
On Mon, Aug 25, 2008 at 4:22 AM, Bill Moseley <mos...@hank.org> wrote:
>
> I would question the need for setting the client encoding to anything
> but utf8 for Postgresql. Then set DBD::Pg's pg_enable_utf8 option.
> I'm careful to decode all data read into my program -- which is
> something you should always do.

If your application doesn't natively use utf8, then it is a pain to
have to deal with converting to and from utf8 when dealing with the
database. That is exactly what the client encoding option is meant to
get around. It gets postgresql to do the conversions for you.

If your webapp is sending out latin1 as it's content-encoding, then
you would be forced to manually convert everything you get from the DB
to latin1 when adding it to a template, also any data received from
webpages would need to be converted from latin1 to utf8 before
updating the database. Why not avoid that work...

Now, if you always set the content-encoding of your HTML pages to
utf8, then this isn't an issue to begin with, because then you can
send and retrieve from the DB in utf8 without needing to do any
conversions.

So in my opinion, you should always set the postgresql client_encoding
to what your application is natively spitting out and pulling in. To
ignore that option forces a lot more work on yourself.

> I also always use -E UTF8 when creating the database.

So do I. Even though my apps usually spit out latin1.

> The pg_enable_utf8 option sets the utf8 flag on utf8 data read from
> the database. I'm not clear how DBD::Pg implements that -- setting
> the flag makes the (correct) assumption that Perl character strings
> are utf8.

As far as I understand it, this has nothing to do with converting data
to or from utf8. It is only a flag that is used internally to perl to
decide how the data should be treated when comparing strings (ie using
the 'eq' operator). Check the Encode docs and search for 'The UTF-8
flag'

Turning the DBD::Pg option pg_enable_utf8 on marks all data read from
PostgreSQL with the perl utf8 flag, which means when you test equality
on one of these strings, it will be character-oriented instead of
byte-oriented. This flag is there for backwards compatibility, so
utf8 doesn't break old programs. So if your postgresql database is
spitting out utf8 to your app, you should probably set the
pg_enable_utf8 to true, so that if you use this data in any
comparisons, the new character-oriented comparison will be used.

> I would think that it would be more correct for DBD::Pg to
> use encode_utf8() when sending data to Postgresql and decode_utf8()
> when reading. A brief look at DBD::Pg and I don't see where it's
> encoding to utf8 -- but maybe again it's making the assumption that
> Perl strings are utf8 and uses them directly without encoding first.
>
> Perhaps someone can explain how DBD::Pg handles encoding better.

PostgreSQL already handles encoding/decoding for you, so why add it
into the driver module as well. Just tell PostgeSQL what encoding you
are sending in, and it will do the right thing (ie set the
CLIENT_ENCODING option and don't worry about encoding issues).

What I do think would be a handy addition to DBD::Pg is an option that
specifies the content encoding you would like to use. So then DBD::Pg
can do the right thing by calling SET CLIENT_ENCODING for you, or
converting the data manually if need be. Or at least providing some
docs around client encoding since it is a complex topic.

Cheers,

Cees Hek

Bill Moseley

unread,
Aug 25, 2008, 1:38:01 PM8/25/08
to rose-db...@googlegroups.com
This is really two (off) topic issues -- one about how to work with
characters in Perl and the other with how DBD::Pg handles encoding.


On Mon, Aug 25, 2008 at 03:36:02PM +1000, Cees Hek wrote:
>
> If your application doesn't natively use utf8, then it is a pain to
> have to deal with converting to and from utf8 when dealing with the
> database. That is exactly what the client encoding option is meant to
> get around. It gets postgresql to do the conversions for you.

I think you are mixing two different things there. The Postgresql
client encoding allows clients to to communicate with the database in
a different encoding than the database is encoded in.

But, programs commonly work with characters -- and to work with
characters you should decode the octets on input into Perl's abstract
concept of characters. Characters inside Perl, octets outside.
Postgresql's client encoding does not do that for you.

True, you can continue to ignore encoding and use byte semantics in
your Perl code then things work just like prior to Perl's unicode
support. By you are not working with characters -- and most of the
time you do want to work with characters. Ignoring this is where the
pain will come in.


> If your webapp is sending out latin1 as it's content-encoding, then
> you would be forced to manually convert everything you get from the DB
> to latin1 when adding it to a template, also any data received from
> webpages would need to be converted from latin1 to utf8 before
> updating the database. Why not avoid that work...

Not decoding the data on input means you are working with raw octets,
not characters. And passing it through un-decoded just to avoid the
work sounds like trouble waiting to happen. What if someday you want
to start sending a different encoding? What if you want to compare
strings from the DB with strings from another source that might be in
a different encoding? What if you want to know the length of a
string in characters?

There's no encoding of data when adding to a template. You decode
octets when reading in the template (e.g. ENCODING option in TT2), and
then you have characters. Data read from the database is likewise
decoded into Perl's characters -- and any data from any other source
(like an include file in Latin1 would be decoded as Latin1). Once
that is done you can safely work with your character data in Perl
because your input has all be "normalized" into characters.

Build your final character string then output using an I/O layer or
using encode(). Pick whatever encoding you like.

It's not that much work, and I'd argue ignoring will be much more work in
the long run.


Again, to get character data into Perl from any source requires decoding
therefor I would expect Postgresql to be no different.

AFAIK, DBD::Pg doesn't provide a way to specify the client encoding. That
is, you can't specify that client encoding is latin1 or utf8 (which
would mean decode( 'iso-8859-1', $octets ) or decode_utf8( $octets ).

What DBD::Pg does provide is the pg_enable_utf8 option. From what I
can see is if the octets look like a valid utf8 string DBD::Pg will
force the utf8 flag on the scalar. This can work because it just so
happens that Perl uses utf8 internally to represent characters[1], but
I think it would be more "proper" to use decode() and specify what the
client encoding PG is using.

Likewise, writing to the db should encode back to whatever PG's client
encoding is set to.

Sure, you can ignore all the encoding and decoding and just work with
octets (and then set PG's client encoding to match), but you are not
working with characters.

> PostgreSQL already handles encoding/decoding for you, so why add it
> into the driver module as well. Just tell PostgeSQL what encoding you
> are sending in, and it will do the right thing (ie set the
> CLIENT_ENCODING option and don't worry about encoding issues).

I think that terminology is incorrect. Postgresql's client encoding
will convert the database's native encoding to the client encoding for
you. But, it's still octets. It's not *decoding* in the sense that
Encode::decode() decodes octets into Perl's characters.

Again, Perl works with characters. Data to and from Postgresql is in
octets -- and I/O layers (and encode() and decode() ) are what convert
between octets and characters.

If you do SET CLIENT_ENCODING $foo then I'd want:

$chars = Encode::decode( $foo, $octets );

when reading from the db, and

$octets = Encode::encode( $foo, $characters );

when writing.

The problem with that approach is that not all data in the database
may be character data and thus can't always be decoded. So, unless
DBD::Pg knows what columns are binary then forcing the utf8 flag is
probably the safest approach as decode() would fail on binary data.

Anyway, that's how I see things...

[1] it's much safer to ignore the fact that Perl uses utf8 as it's
internal encoding for characters.

Cees Hek

unread,
Aug 26, 2008, 5:35:03 PM8/26/08
to rose-db...@googlegroups.com
On Tue, Aug 26, 2008 at 3:38 AM, Bill Moseley <mos...@hank.org> wrote:
>
> This is really two (off) topic issues -- one about how to work with
> characters in Perl and the other with how DBD::Pg handles encoding.

I think it is only slightly off topic for the list, but this is
something that a lot of people run into, so hopefully people won't
mind. Speak up if this discussion is annoying anyone.

> On Mon, Aug 25, 2008 at 03:36:02PM +1000, Cees Hek wrote:
>>
>> If your application doesn't natively use utf8, then it is a pain to
>> have to deal with converting to and from utf8 when dealing with the
>> database. That is exactly what the client encoding option is meant to
>> get around. It gets postgresql to do the conversions for you.
>
> I think you are mixing two different things there. The Postgresql
> client encoding allows clients to to communicate with the database in
> a different encoding than the database is encoded in.
>
> But, programs commonly work with characters -- and to work with
> characters you should decode the octets on input into Perl's abstract
> concept of characters. Characters inside Perl, octets outside.
> Postgresql's client encoding does not do that for you.
>
> True, you can continue to ignore encoding and use byte semantics in
> your Perl code then things work just like prior to Perl's unicode
> support. By you are not working with characters -- and most of the
> time you do want to work with characters. Ignoring this is where the
> pain will come in.

I guess my point was that this is only important if your application
needs/wants to support utf8 (or some other multi-byte encoding). If
you are just working with a single byte character set (ie latin1),
then this multibyte stuff does not come into play. With single byte
character sets, your 'length' functions, and comparison operators
should work on the octets, or on the decoded characters. At least
that is how I understand it... Please tell me if I am wrong on this,
because I may be deluding myself into a false sense of security...

I appreciate your comments though, because I think this info will be
very important when I do finally decide to tackle support for multiple
encodings and unicode in my application layer.

Cheers,

Cees

Reply all
Reply to author
Forward
0 new messages