Which DEFAULT CHARACTER SET should I choose?

507 views
Skip to first unread message

Luciano Rodrigues Nunes Mendes

unread,
May 23, 2022, 8:04:08 PM5/23/22
to firebird-support
Hi Guys,

I have a database where most VARCHAR columns use the character set WIN1252 and only one column of a table uses the character set UTF8.

Based on this scenario:
  •  Which character set should I choose with the default of this database? WIN1252 or UTF8?
  • Which character set should I choose when connecting to this database? WIN1252 or UFT8?
Thanks in advance,
Luciano

Tomasz Tyrakowski

unread,
May 24, 2022, 3:33:13 AM5/24/22
to firebird...@googlegroups.com
On 24.05.2022 at 02:04, Luciano Rodrigues Nunes Mendes wrote:
> I have a database where most VARCHAR columns use the character set WIN1252
> and only one column of a table uses the character set UTF8.
>
> Based on this scenario:
>
> - Which character set should I choose with the default of this
> database? WIN1252 or UTF8?

If your columns have explicitly specified character set, the default
charset of the dabatase won't affect them. Choose one most convenient
for you when creating new tables in your DB.

> - Which character set should I choose when connecting to this database?
> WIN1252 or UFT8?

The character set you specify when connecting is not the charset of the
database, but one which your client software uses (so that FB knows how
to transliterate strings between your client software and the database).
So specify whichever matches your client application character encoding.

regards
Tomasz

Stefan Heymann

unread,
May 24, 2022, 4:08:00 AM5/24/22
to firebird...@googlegroups.com
Luciano,

> I have a database where most VARCHAR columns use the character set WIN1252
> and only one column of a table uses the character set UTF8. [...]
>
> - Which character set should I choose when connecting to this database?
> WIN1252 or UFT8?

Assuming that your client software is Unicode aware (which it should be nowadays), you should use UTF8 as the client connection character set. Otherwise you will not get all the characters that could possibly be stored in that UTF8 column. You will get a transliteration error (exception) instead.

Regards

Stefan


Luciano Rodrigues Nunes Mendes

unread,
May 24, 2022, 6:41:15 AM5/24/22
to firebird...@googlegroups.com
Hi Tomasz and Stefan

Thank you very much for your prompt reply!

One last question:
If I always use UTF8 as the client charset wouldn't I unnecessarily increase my system's network latency since UTF8 needs 4 times more bytes to represent the same character in WIN1252?

Thanks in advance,
Luciano

--
You received this message because you are subscribed to the Google Groups "firebird-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firebird-suppo...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/firebird-support/38215115.20220524100759%40familie-heymann.de.

Mark Rotteveel

unread,
May 24, 2022, 6:48:29 AM5/24/22
to firebird...@googlegroups.com
On 24-05-2022 12:41, Luciano Rodrigues Nunes Mendes wrote:
> Hi Tomasz and Stefan
>
> Thank you very much for your prompt reply!
>
> One last question:
> If I always use UTF8 as the client charset wouldn't I unnecessarily
> increase my system's network latency since UTF8 needs 4 times more bytes
> to represent the same character in WIN1252?

It doesn't work that way. UTF8 requires up to 4 bytes per character, not
4 bytes for each character, and unless you use CHAR everywhere instead
of VARCHAR, the protocol implementation (both client and server) will
only send the actual bytes used for a value.

Mark
--
Mark Rotteveel

Luciano Rodrigues Nunes Mendes

unread,
May 24, 2022, 6:56:19 AM5/24/22
to firebird...@googlegroups.com
Thank you so much for all the clarification Mark!

Best Regards,
Luciano

--
You received this message because you are subscribed to the Google Groups "firebird-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firebird-suppo...@googlegroups.com.

Tomasz Tyrakowski

unread,
May 24, 2022, 6:58:33 AM5/24/22
to firebird...@googlegroups.com
On 24.05.2022 at 12:41, Luciano Rodrigues Nunes Mendes wrote:
> One last question:
> If I always use UTF8 as the client charset wouldn't I unnecessarily
> increase my system's network latency since UTF8 needs 4 times more bytes to
> represent the same character in WIN1252?

It all depends on your client setup. I mean: what kind of SDK / database
library you use to talk to Firebird and which string encoding your
client application uses internally.
If your client app is for example an "old" win32 app without unicode
support and only uses single byte WIN1252 (ANSI) encoding internally,
you can safely specify win1252 as the connection charset. This way the
data you get back from the database will be automatically converted to
win1252, if possible (i.e. if it doesn't contain characters not
representable in win1252).
If, on the other hand, your application uses unicode internally,
probably utf-8 would be a better option. However, keep in mind, that if
your database column encoding is win1252, trying to store in that column
any unicode characters, that can't be translated to win1252, will result
in a database error (Firebird can't store a character outside of win1252
in a win1252 column). In other words, setting utf-8 as connection
charset doesn't circumvent in any way the single byte win1252 encoding
of a table column.
In general, you should choose the connection encoding best matching the
internal string representation in your client application.
As the last remark, I wouldn't worry too much about wire representation
of utf-8. Most characters from win1252 will probably requre 1 or 2 bytes
at most. Unless you store huge amounts of text in a single column, utf-8
shouldn't impact the network performance significantly (at least that's
my general experience - I've never actually benchmarked it).

regards
Tomasz

Tomasz Tyrakowski

unread,
May 24, 2022, 7:07:39 AM5/24/22
to firebird...@googlegroups.com
One last remark. AFAIR you mentioned that some of your columns are
utf-8, while _most_ of them are win1252.
If you plan to store in those few utf-8 columns characters not included
in win1252 (I assume that's why they're utf-8 in the first place), you
definitely shouldn't set your connection charset to win1252 (and go with
utf-8). Any attempt at reading text containing non-win1252 characters
(from a utf-8 column) with win1252 connection charset will result in an
error (Firebird will try to convert such character to win1252, 'cause
that's your client charset, and of course it will fail).
The downside is that if you use utf-8 as your client charset, you may
accidently attempt to store non-win1252 character in one of your win1252
columns, which will fail as well.

regards
Tomasz

Luciano Rodrigues Nunes Mendes

unread,
May 24, 2022, 7:11:47 AM5/24/22
to firebird...@googlegroups.com
Thank you very much for your support Tomasz!

Based on your comment I will use UTF8 when connecting to the database since I use the latest version of Firebird's ADO.NET provider.

Best Regards,
Luciano

--
You received this message because you are subscribed to the Google Groups "firebird-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firebird-suppo...@googlegroups.com.

Mark Rotteveel

unread,
May 24, 2022, 7:12:54 AM5/24/22
to firebird...@googlegroups.com
On 24-05-2022 13:07, Tomasz Tyrakowski wrote:
> One last remark. AFAIR you mentioned that some of your columns are
> utf-8, while _most_ of them are win1252.
> If you plan to store in those few utf-8 columns characters not included
> in win1252 (I assume that's why they're utf-8 in the first place), you
> definitely shouldn't set your connection charset to win1252 (and go with
> utf-8). Any attempt at reading text containing non-win1252 characters
> (from a utf-8 column) with win1252 connection charset will result in an
> error (Firebird will try to convert such character to win1252, 'cause
> that's your client charset, and of course it will fail).
> The downside is that if you use utf-8 as your client charset, you may
> accidently attempt to store non-win1252 character in one of your win1252
> columns, which will fail as well.

In theory, the OP could also use connection character set NONE, but this
only works for clients that actually use the character set information,
and don't blanket apply the platform default character set.

That is, when you use connection character set NONE, Firebird
communicates the actual character set of the column, and sends the data
as stored. That is, a WIN1252 column is sent as WIN1252, a UTF8 column
is sent as UTF8.

But given the historic issues with connection character set NONE, like
not all client libraries implementing this (or implementing this
correctly), this is an option that should be carefully tested before using.

Mark
--
Mark Rotteveel

Tomasz Tyrakowski

unread,
May 24, 2022, 7:21:37 AM5/24/22
to firebird...@googlegroups.com, Mark Rotteveel
On 24.05.2022 at 13:12, Mark Rotteveel wrote:
> In theory, the OP could also use connection character set NONE, but this
> only works for clients that actually use the character set information,
> and don't blanket apply the platform default character set.
>
> That is, when you use connection character set NONE, Firebird
> communicates the actual character set of the column, and sends the data
> as stored. That is, a WIN1252 column is sent as WIN1252, a UTF8 column
> is sent as UTF8.
>
> But given the historic issues with connection character set NONE, like
> not all client libraries implementing this (or implementing this
> correctly), this is an option that should be carefully tested before using.
>
> Mark

Wow, thanks for that info. Must have missed it reading the manuals. I've
been fighting for years with databases that for historic reasons (like
they're 20+ years old, migrated from Interbase back in early 2000s) are
mostly win1250 encoded, and it's a real pain in the butt.
But just thinking about it, that sounds logical. If my client charset is
NONE, it's my own business how I decode the data read from the
connection. What I didn't know was that FB provides the encoding
information together with actual data. And it's really great this option
exists.
Thanks again.

regards
Tomasz

Dimitry Sibiryakov

unread,
May 24, 2022, 7:25:21 AM5/24/22
to firebird...@googlegroups.com
Tomasz Tyrakowski wrote 24.05.2022 13:07:
> If you plan to store in those few utf-8 columns characters not included in
> win1252 (I assume that's why they're utf-8 in the first place), you definitely
> shouldn't set your connection charset to win1252 (and go with utf-8). Any
> attempt at reading text containing non-win1252 characters (from a utf-8 column)
> with win1252 connection charset will result in an error (Firebird will try to
> convert such character to win1252, 'cause that's your client charset, and of
> course it will fail).

It is not quite so. Data is delivered in charset specified by
SQLDA/MessageMetadata which indeed is defaulted to connection charset but can be
overridden at will in every single case.

--
WBR, SD.

Mark Rotteveel

unread,
May 24, 2022, 7:29:37 AM5/24/22
to firebird...@googlegroups.com
Only if your client library supports that, which I think you'll find is
rarely the case. Most people will probably not use or dive into the
low-level abstractions of the fbclient API.

Mark
--
Mark Rotteveel

Tomasz Tyrakowski

unread,
May 24, 2022, 7:40:31 AM5/24/22
to firebird...@googlegroups.com
Well, maybe I wasn't clear enough. What I meant was:
a) when the client's internal string representation is win1252 and I get
a non-win1252 character from the database, I can either handle it
manually (thanks to Mark now I know I can ;) ) and use some kind of
placeholder character (loosing data) or I can rely on automatic
transcoding which will fail,
b) when my internal string is unicode, or any encoding incompatible with
win1252 for that matter, and the connection charset is e.g. utf-8, when
I try to store a non-win1252 character in a win1252 column, it will fail
because there's no way this character can be transcoded to win1252.

Overriding the default character set of the connection won't help in
either case. When my client strings can store only win1252, setting
connection charset temporarily to utf8 won't magically let me store
non-win1252 characters in my ANSI win1252 strings. And the other way
around: trying to store a non-win1252 character in a win1252 DB column
will fail no matter the connection charset.
Of course my mental model may be oversimplified, in which case I'm more
than ready to learn how in fact this is handled client-side and DB-side.

regards
Tomasz

Tomasz Tyrakowski

unread,
May 24, 2022, 8:04:35 AM5/24/22
to firebird...@googlegroups.com
That's exactly the case. Talking to Firebird from Python, node,
Embarcadero RAD or PHP, I've always relied on setting the connection
charset and letting the fbclient library transcode the strings for me
(and it's really great FB does it). Nowadays it's less of a problem,
most client software is unicode anyway, and utf-8 connection solves most
of the issues (unless one has winXXXX or ISO-XXXX legacy DBs, which I do
I'm afraid, but working on it ;) ).
I admit openly that the abundance of high-level client libraries (like
fdb for Python, node-firebird, PDO::Firebird for PHP, IBO, IBX or
FireDAC for Embarcadero) allowed me not to worry about the details of
the actual data being exchanged between the client and the server. I
don't know if any of the high level libraries even has an API letting me
access this lower level details.
However, Mark's remark about setting charset to NONE and handling all
the encoding myself is a very valuable one (in some edge cases it may
prove very useful). Provided, of course, that the wrapper libraries will
let me know the encoding of the strings they get me from the DB, so back
to reading manuals I suppose ;)
Thank you both for valuable info.

cheers
Tomasz
Reply all
Reply to author
Forward
0 new messages