DBI and character sets (yet again)

14 views
Skip to first unread message

Tim Bunce

unread,
Mar 21, 2004, 6:17:16 PM3/21/04
to Dean Arnold, DBI-users
On Sun, Mar 21, 2004 at 01:10:27PM -0800, Dean Arnold wrote:
> (Note: I'm sending this to both -users and -dev, I'm not
> certain which it belongs to at this point)

dbi-user I think, at this point, as wide user comment may be helpful.
Though I might regret that if this produces more heat than light.
[I've removed dev-dev from the CC list. Anyone else replying to
(replies to) the original please do the same. Thanks.]

> Is there a consistent charset encoding behavior defined for
> DBI at this time ?

No.

> If not, is a rule wrt charset encoding behavior needed ?

Yes.

> If a list of charset behaviors for each DBD is needed,
> I'd be happy to put one together, assuming the DBD authors
> send me the details for each driver.

That would be great.


I'm not expert on this, as I'm probably about to prove, but here's
my perspective, for today at least...

1. Most applications only work with one character set encoding
(not counting UTF8). Obvious example: Latin-1.

2. Unicode is where we're going. Get used to it.

3. I don't really want the DBI to be involved in any recoding
of character sets (from client charset to server charset)
and I suggest that the drivers don't try to do that either.

4. DBI v2 will provide hooks to allow callbacks to be fired
on fetching a field and/or row and that could be used by an
application for recoding if it wants to 'hide' it under the DBI.

5. When selecting data from the database the driver should:
- return strings which have a unicode character set as UTF8.
- return strings with other character sets as-is (unchanged) on
the presumption that the application knows what to do with it.

6. Drivers that want to can offer a mechanism to recode non-unicode
character sets into unicode but I don't see a big need for the
DBI to standardize an interface for that at the moment.

7. DBI v2 will probably provide a way for applications to force the
UTF8 flag on particular columns as a workaround for drivers that
don't know the string of bytes they're returing is actually UTF8.

8. When passing data to the database (including the SQL statement)
the driver should (perhaps) warn if it's presented with UTF8
strings but the database or database can't handle unicode.

Comments welcome, of course, but please stick to practical issues,
ideally with examples, rather than theoretical ones. Thanks.

Tim.

Dean Arnold

unread,
Mar 21, 2004, 7:50:34 PM3/21/04
to Tim Bunce, DBI-users
>
> > Is there a consistent charset encoding behavior defined for
> > DBI at this time ?
>
> No.
>
> > If not, is a rule wrt charset encoding behavior needed ?
>
> Yes.
>
> > If a list of charset behaviors for each DBD is needed,
> > I'd be happy to put one together, assuming the DBD authors
> > send me the details for each driver.
>
> That would be great.

OK. Shall we start w/ DBD::Oracle ? ;^)
And driver authors, feel free to forward to me (and/or thlis
list). I'll try to put together a little webpage with the info.

>
> 1. Most applications only work with one character set encoding
> (not counting UTF8). Obvious example: Latin-1.

Agreed.

>
> 2. Unicode is where we're going. Get used to it.
>


Agreed.

> 3. I don't really want the DBI to be involved in any recoding
> of character sets (from client charset to server charset)
> and I suggest that the drivers don't try to do that either.
>


OK. It could certainly be a performance killer.

> 4. DBI v2 will provide hooks to allow callbacks to be fired
> on fetching a field and/or row and that could be used by an
> application for recoding if it wants to 'hide' it under the DBI.
>
> 5. When selecting data from the database the driver should:
> - return strings which have a unicode character set as UTF8.
> - return strings with other character sets as-is (unchanged) on
> the presumption that the application knows what to do with it.

The problem is that there's no standard metadata currently defined
to provide what the encoding is, and (AFAIK) Perl only has a
"is_utf8()" method that can be tested on any string independently.

E.g., lets say I'm writing a database IDE with the new Perl/Tk
that handles UNICODE. I want my tool to be dbms/driver independent
(as much as possible). I'm retrieving data to stuff into a fancy
spreadsheet, and I want ptk to use UNICODE so I'm I18N'd. If
the dbms/driver returns UTF8, thats all great. If it doesn't, or it
returns things with multiple encodings (ugh), I don't have
any (standard) metadata to tell me how to force the string into utf8
via Encode::from_to(). Or for that matter, if drivers aren't tagging
the returned strings as UTF8, I don't have any idea what I'm dealing with.

I'm not expecting every driver, or DBI, to normalize everything; rather, just
a piece of info to tell an app what encoding the data is in. Presumably,
just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
the info. If the driver doesn't know, then it fills in with undef, and
the app is on its own. Otherwise, the app has enough info to make
the necessary conversion:

foreach (1..@{$sth->{TYPES}}) {
Encode::from_to($row->[$_-1], $sth->{CHAR_SET}->[$_-1], 'utf8')
if (defined($sth->{CHAR_SET}->[$_-1]) && ($sth->{CHAR_SET}->[$_-1] ne 'utf8'));
}

As for parameter data, there might be an optional CHAR_SET attribute provided,
or perhaps each driver can specify its "preferred" encoding, and the app
can coerce its data into that encoding as needed.

(I think the PerlIO encoding mechanisms provide some direction, or at least
have to deal with similar issues:
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Unicode-I-O.
Come to think of it, it might be of interest to those using DBD::CSV
or other file-based DBDs ?)

Actually, this all may be a Perl problem more than a DBI issue. E.g.,
in Java, String objects always have an associated character set,
but all Perl appears to have is "its UTF8" or "I don't know what it is except
a bunch of bytes".

>
> 6. Drivers that want to can offer a mechanism to recode non-unicode
> character sets into unicode but I don't see a big need for the
> DBI to standardize an interface for that at the moment.
>
> 7. DBI v2 will probably provide a way for applications to force the
> UTF8 flag on particular columns as a workaround for drivers that
> don't know the string of bytes they're returing is actually UTF8.
>
> 8. When passing data to the database (including the SQL statement)
> the driver should (perhaps) warn if it's presented with UTF8
> strings but the database or database can't handle unicode.

Qualified warning: attempt to convert to (e.g.) Latin1 first before throwing the
exception:

if (Encode::is_utf8($sql)) {
#
# my dbms only knows latin1, so check if compatible
#
$dbh->DBI::set_error(-1, 'Unsupported characters in query.')
unless from_to($sql, "utf-8", "iso-8859-1");
}

(the above is likely only reliable on Perl 5.8+)

>
> Comments welcome, of course, but please stick to practical issues,
> ideally with examples, rather than theoretical ones. Thanks.
>
> Tim.

And maybe an addition to the driver writer's POD to encourage UTF8
encoding ?

Hopefully this wasn't just a theoritcal discussion...

Dean Arnold
Presicient Corp.
www.presicient.com

Dean Arnold

unread,
Mar 21, 2004, 8:21:56 PM3/21/04
to Tim Bunce, DBI-users
FYI: I did a cursory scan of the online ODBC docs @ MSFT, but couldn't find
anything wrt providing charset info in the SQLColAttribute or SQLDescribeCol.

Tim Bunce

unread,
Mar 22, 2004, 6:04:15 AM3/22/04
to Dean Arnold, Tim Bunce, DBI-users
On Sun, Mar 21, 2004 at 04:50:34PM -0800, Dean Arnold wrote:
> >
> > > If a list of charset behaviors for each DBD is needed,
> > > I'd be happy to put one together, assuming the DBD authors
> > > send me the details for each driver.
> >
> > That would be great.
>
> OK. Shall we start w/ DBD::Oracle ? ;^)

You could, but that's very much a moving target at the moment.

> And driver authors, feel free to forward to me (and/or thlis
> list). I'll try to put together a little webpage with the info.

I think it would help if you formulated a set of questions for driver
authors (or anyone else) to answer. Especially as finding the right
questions can be harder than finding the answers.

Here are a few to get you started:

- Does the database:
- have any concept of national character sets?
- at what levels: database, table, field?
- url for list of character set names?
- does it support unicode?
- Does the database client API:
- provide access to character set information, and how?
- at what levels: database, table, field?
- does it have a concept of a client character set?
- how is the client charset determined (locale, env var etc)
- does it perform charset recoding?
- Does the DBD driver:
- (repeat last set of questions)


> > 3. I don't really want the DBI to be involved in any recoding
> > of character sets (from client charset to server charset)
> > and I suggest that the drivers don't try to do that either.
>
> OK. It could certainly be a performance killer.

It's not a performance issue. It would do nothing if client charset
the app wants is the same as the server charset. If it's not then
something has to do the recoding somewhere. At this point I don't see
a need for the DBI to do that itself - though it should provide
hooks that can help.

> > 5. When selecting data from the database the driver should:
> > - return strings which have a unicode character set as UTF8.
> > - return strings with other character sets as-is (unchanged) on
> > the presumption that the application knows what to do with it.
>
> The problem is that there's no standard metadata currently defined
> to provide what the encoding is, and (AFAIK) Perl only has a
> "is_utf8()" method that can be tested on any string independently.

It's a driver-private issue. If a column is unicode then (I think)
the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
type (SQL_WCHAR, SQL_WVARCHAR, etc).

> I'm not expecting every driver, or DBI, to normalize everything; rather, just
> a piece of info to tell an app what encoding the data is in.

It boils down to unicode or not. And if not then you (currently)
have to assume that it's the same charset as the client because
99.8745% of the time it will be.

> Presumably,
> just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
> the info. If the driver doesn't know, then it fills in with undef, and
> the app is on its own. Otherwise, the app has enough info to make
> the necessary conversion:

You're presuming that all database that support charsets will use
the same set of names as Encode uses. I hope that is the case but
it might not be. (Add that to your list of things to discover :)

> As for parameter data, there might be an optional CHAR_SET attribute provided,
> or perhaps each driver can specify its "preferred" encoding, and the app
> can coerce its data into that encoding as needed.

The "preferred" encoding could (should?) be unicode.

> [...]


> (the above is likely only reliable on Perl 5.8+)

There are still a regular stream of unicode related bugs being found
and fixed in perl. Anyone doing much work with unicode should be
using 5.8.3.

> > Comments welcome, of course, but please stick to practical issues,
> > ideally with examples, rather than theoretical ones. Thanks.
> >
> > Tim.
>
> And maybe an addition to the driver writer's POD to encourage UTF8
> encoding ?
>
> Hopefully this wasn't just a theoritcal discussion...

Only a little :) My point is that as engineers we want a beautiful
system that'll automatically and transparently recode between
multiple client and server charsets. But in practice very few people
need that (see points 1 and 2).

Some features, like charsets, vary greatly in how they're handled
by database APIs. For these kind of features the DBI usually lags
the drivers. Once a few drivers have implemented their own driver-specific
interfaces, and had them proven as practical by users, *then* I
can work with driver authors to see how best to extend the DBI API
in a way that'll work well for those drivers and others.

That's what happened for $sth->execute_array, and that's exactly
what's happening with $sth->more_results at the moment.

The driver survey would be a valuable step along this road.

Tim.

Michael Peppler

unread,
Mar 22, 2004, 10:03:21 AM3/22/04
to Tim Bunce, Dean Arnold, DBI-users
On Mon, 2004-03-22 at 03:04, Tim Bunce wrote:
> On Sun, Mar 21, 2004 at 04:50:34PM -0800, Dean Arnold wrote:
> > >
> > > > If a list of charset behaviors for each DBD is needed,
> > > > I'd be happy to put one together, assuming the DBD authors
> > > > send me the details for each driver.
> > >
> > > That would be great.
> >
> > OK. Shall we start w/ DBD::Oracle ? ;^)
>
> You could, but that's very much a moving target at the moment.
>
> > And driver authors, feel free to forward to me (and/or thlis
> > list). I'll try to put together a little webpage with the info.
>
> I think it would help if you formulated a set of questions for driver
> authors (or anyone else) to answer. Especially as finding the right
> questions can be harder than finding the answers.
>
> Here are a few to get you started:

For Sybase ASE (and DBD::Sybase)

> - Does the database:
> - have any concept of national character sets?

ASE has a concept of locales, with a mapping from the locale to a
character set.

> - at what levels: database, table, field?

server.

> - url for list of character set names?
> - does it support unicode?

Yes.

> - Does the database client API:
> - provide access to character set information, and how?

Yes, in the connection properties.

> - at what levels: database, table, field?

Server (i.e. connection).

> - does it have a concept of a client character set?

Yes.

> - how is the client charset determined (locale, env var etc)

locale/env var (LC_ALL/LANG), but can be overridden via connection
properties.

> - does it perform charset recoding?

Yes, if possible.

> - Does the DBD driver:
> - (repeat last set of questions)

DBD::Sybase will honor the current locale as that is the default
behavior of Sybase OpenClient, and you can override the client charset
in the DBI DSN as needed.

> > Presumably,
> > just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
> > the info. If the driver doesn't know, then it fills in with undef, and
> > the app is on its own. Otherwise, the app has enough info to make
> > the necessary conversion:
>
> You're presuming that all database that support charsets will use
> the same set of names as Encode uses. I hope that is the case but
> it might not be. (Add that to your list of things to discover :)

ASE uses "iso_1", "cp850", 'sjis", "eucjis", "eucgb", euccns, big5,
utf8, roman8, roman9, cp437, gb18030, eucksc and a few others that I've
probably missed. The charset names depend on the platform (i.e. Win32
has a different set of charset names than, say, linux or VMS).

FWIW... :-)

Michael
--
Michael Peppler Data Migrations, Inc.
mpep...@peppler.org http://www.peppler.org/
Sybase T-SQL/OpenClient/OpenServer/C/Perl developer available for short or
long term contract positions - http://www.peppler.org/resume.html

Dean Arnold

unread,
Mar 22, 2004, 6:08:55 PM3/22/04
to Tim Bunce, DBI-users
>
> I think it would help if you formulated a set of questions for driver
> authors (or anyone else) to answer. Especially as finding the right
> questions can be harder than finding the answers.
>
> Here are a few to get you started:
>
> - Does the database:
> - have any concept of national character sets?
> - at what levels: database, table, field?
> - url for list of character set names?
> - does it support unicode?
> - Does the database client API:
> - provide access to character set information, and how?
> - at what levels: database, table, field?
> - does it have a concept of a client character set?
> - how is the client charset determined (locale, env var etc)
> - does it perform charset recoding?
> - Does the DBD driver:
> - (repeat last set of questions)

Er, thats a pretty long list of detailed info, and I'm not certain its all entirely
needed.

How about:

1. Does your DBD currently support NLS encodings ?

2. If so
a. what character sets does it support ?
b. how does it determine what character set to use for returned data, and for parameter data and
SQL
statements ?
c. does it provide any metadata to indicate the character set of returned columns ?
d. does it provide any mechanism for apps to indicate the character set of parameter data and/or
SQL statements?
e. does it perform any internal character set conversions in support of any of the above ?

3. If not, do your target database and/or any supporting client access libraries
support NLS capabilities (ie, *could* you support NLS) ?

I'm assuming at least *some* databases (including some of the more popular)
support different charsets for columns in the same table, so whether some others
do not doesn't (IMHO) really matter (if we want to avoid an LCD implementation).
Either a given DBD

- supports NLS
- doesn't support NLS, but the database and client libs do
- doesn't support NLS because the database and/or client libs don't

And of course, ODBC, JDBC, and ADO have only "fuzzy" answers to the above.

>
> > > 5. When selecting data from the database the driver should:
> > > - return strings which have a unicode character set as UTF8.
> > > - return strings with other character sets as-is (unchanged) on
> > > the presumption that the application knows what to do with it.
> >
> > The problem is that there's no standard metadata currently defined
> > to provide what the encoding is, and (AFAIK) Perl only has a
> > "is_utf8()" method that can be tested on any string independently.
>
> It's a driver-private issue. If a column is unicode then (I think)
> the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
> type (SQL_WCHAR, SQL_WVARCHAR, etc).

Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32,
UCS2, etc. And we don't have a standard way of indicating which it is at present, and
we've previously agreed that we're not going to force DBD's to normalize on anything.

(I'm probably getting confused at this point, but bear with me...)

Are we relying on locale to determine which UNICODE encoding
the data is in those cases ?

E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD returns
UTF16, but doesn't have a std. means of telling the app that. Is the implicit assumption that
either the a) DBD, b) client libs, or c) database must figure out how to locale-ize the results ?

For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 columns,
who is responsible for getting them into the locale's charset ? At present, the app
can't (except using driver-specific i/fs, if they exist), since it doesn't know that the
columns are latin3, only that they're SQL_CHAR.

>
> > I'm not expecting every driver, or DBI, to normalize everything; rather, just
> > a piece of info to tell an app what encoding the data is in.
>
> It boils down to unicode or not. And if not then you (currently)
> have to assume that it's the same charset as the client because
> 99.8745% of the time it will be.

Which I assume is to be derived from the locale. OK.

>
> > Presumably,
> > just another bit of $sth metadata, e.g., $sth->{CHAR_SET}, to provide
> > the info. If the driver doesn't know, then it fills in with undef, and
> > the app is on its own. Otherwise, the app has enough info to make
> > the necessary conversion:
>
> You're presuming that all database that support charsets will use
> the same set of names as Encode uses. I hope that is the case but
> it might not be. (Add that to your list of things to discover :)

Or that DBD's will make the effort to map their database's
encoding names to its Perl equivalent, much as we assume
they know how to map whatever their client lib/database uses
to indicate "this is an integer" into SQL_INTEGER.

>
> > As for parameter data, there might be an optional CHAR_SET attribute provided,
> > or perhaps each driver can specify its "preferred" encoding, and the app
> > can coerce its data into that encoding as needed.
>
> The "preferred" encoding could (should?) be unicode.

So the app specifies SQL_WCHAR...but should the DBD then assume its
UTF8 encoded ? Couldn't it be UTF16, UCS2, etc. ? But there's no way
to tell the DBD that. And the DBD may need to use some other
encoding than UTF8.

Maybe it boils down to a set of rules:

1. DBD's should communicate the available locale information
to the datasource (which includes everything below the DBD:
client libs and database)

2. The datasource is responsible for returning data in the client's
locale defined charset.

3. The DBD assumes all data provided by an application is in
the locale defined charset.

4. If the datasource cannot return or accept data in the client's locale
defined charset, then ???

If we can nail down (4), even if its just "throw an error on connection",
then I think we've got your 99.8745% covered.

BTW: I did some checking on setting/getting locale info
(via setlocale(LC_CTYPE)), and AS Perl on Win32 doesn't appear
to behave very well. It reports something, but won't permit
a modification. I tried AS 5.6 on WinXP and AS 5.8.3 on Win2K.
Does any one know how to make that work (Google wasn't too helpful)?
Fedora 1 w/ Perl 5.8.3 works fine. I don't know if that impacts our
discussion or not, but I'd hate to settle on a solution that precluded
support for the most common platform.

I've ranted and braindumped enough for one day, so flame away,

Tim Bunce

unread,
Mar 23, 2004, 4:44:18 AM3/23/04
to Dean Arnold, Tim Bunce, DBI-users
On Mon, Mar 22, 2004 at 03:08:55PM -0800, Dean Arnold wrote:
> >
> > I think it would help if you formulated a set of questions for driver
> > authors (or anyone else) to answer. Especially as finding the right
> > questions can be harder than finding the answers.
> >
> > Here are a few to get you started:
> >
> > - Does the database:
> > - have any concept of national character sets?
> > - at what levels: database, table, field?
> > - url for list of character set names?
> > - does it support unicode?
> > - Does the database client API:
> > - provide access to character set information, and how?
> > - at what levels: database, table, field?
> > - does it have a concept of a client character set?
> > - how is the client charset determined (locale, env var etc)
> > - does it perform charset recoding?
> > - Does the DBD driver:
> > - (repeat last set of questions)
>
> Er, thats a pretty long list of detailed info, and I'm not certain its all entirely
> needed.

Better to ask for more detail up front or I suspect we'll be left
withouth sufficient information to guess the answers to the questions
we didn't think of :)

And here's one: Give a URL to (the relevant part of) the database API specification.


> > It's a driver-private issue. If a column is unicode then (I think)
> > the driver should set $sth->{TYPE}->[...] to the appropriate 'wide'
> > type (SQL_WCHAR, SQL_WVARCHAR, etc).
>
> Problem is, thats still a pretty high level. SQL_WCHAR can be UTF8, UTF16, UTF32,
> UCS2, etc. And we don't have a standard way of indicating which it is at present, and
> we've previously agreed that we're not going to force DBD's to normalize on anything.

Maybe I haven't said this before, but from the DBI's perspective
all the wide char types imply perl's native unicode, i.e. utf8.

> Are we relying on locale to determine which UNICODE encoding
> the data is in those cases ?

The database API should always make it clear. (I think it's typically UCS2.)

> E.g., my locale's charset is UTF8, and I retrieve some UNICODE columns. The DBD returns
> UTF16, but doesn't have a std. means of telling the app that. Is the implicit assumption that
> either the a) DBD, b) client libs, or c) database must figure out how to locale-ize the results ?

Converting UTF16 or UCS2 to UTF8 doesn't require any figuring out.
The locale isn't involved at all. It's a lossless conversion (so long
as you know what you're converting from, which the database API will specify).

> For that matter, if my locale's charset is UTF8 and the DBD returns some latin3 columns,
> who is responsible for getting them into the locale's charset ? At present, the app
> can't (except using driver-specific i/fs, if they exist), since it doesn't know that the
> columns are latin3, only that they're SQL_CHAR.

Currently it's the app, possibly using driver-specific interfaces.

> > > As for parameter data, there might be an optional CHAR_SET attribute provided,
> > > or perhaps each driver can specify its "preferred" encoding, and the app
> > > can coerce its data into that encoding as needed.
> >
> > The "preferred" encoding could (should?) be unicode.
>
> So the app specifies SQL_WCHAR...but should the DBD then assume its
> UTF8 encoded ?

The data from perl to the driver can be assumed to be utf8 if the
app has bound the column as SQL_WCHAR or SQL_WVARCHAR etc.
Wide data from the database API 'up' into to the driver is whatever
the database API says it will be (typically UCS2).


> mAybe it boils down to a set of rules:


>
> 1. DBD's should communicate the available locale information
> to the datasource (which includes everything below the DBD:
> client libs and database)
>
> 2. The datasource is responsible for returning data in the client's
> locale defined charset.

Those first two normally happen by default in most database APIs.

> 3. The DBD assumes all data provided by an application is in
> the locale defined charset.
>
> 4. If the datasource cannot return or accept data in the client's locale
> defined charset, then ???
>
> If we can nail down (4), even if its just "throw an error on connection",
> then I think we've got your 99.8745% covered.

Sadly not. All the above doesn't address how to get unicode _and_ one
other charset to coexist. That's the main goal I'm after as that's
what's needed to support migration towards unicode.

> I've ranted and braindumped enough for one day, so flame away,

Take a deep breath. Gather the driver information (formulate the
questions, email to driver authors, tabulate the results) and then
we'll take stock of where we're at.

Thanks!

Tim.

Reply all
Reply to author
Forward
0 new messages