[AOLSERVER] Charset differences between 3.3+ad13 and 4.0.10?

11 views
Skip to first unread message

Janine Ohmer

unread,
Dec 1, 2010, 12:19:48 AM12/1/10
to AOLS...@listserv.aol.com
Me again... still working out the last details on those sites I had to move.

We're having some issues with characters like apostrophes and dashes either disappearing (Safari) or showing up as garbage characters (Firefox).

I'm using the same version of Postgres and the same codebase. The databases on both systems use the UNICODE encoding. The main thing that's different is the AOLserver version. There isn't anything about charset in either the old or the new config file.

Has anyone done this conversion (in recent memory, that is :) and knows what the problem might be?

thanks,

janine

---
Janine Ohmer (formerly Sisk)
President/CEO of furfly, LLC
503-693-6407


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <list...@listserv.aol.com> with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: field of your email blank.

Fenton, Brian

unread,
Dec 1, 2010, 6:32:39 AM12/1/10
to AOLS...@listserv.aol.com
Hi Janine

this may not be the answer but it can't do any harm. You should consider adding these to the AOLserver tcl file:
ns_param HackContentType 1
ns_param DefaultCharset utf-8
ns_param HttpOpenCharset utf-8
ns_param OutputCharset utf-8
ns_param URLCharset utf-8

Also, rule out any OS differences by checking your locale, LANG etc. Does Postgres have an equivalent to Oracle's NLS_LANG? Hopefully you won't have to dive into codepages and all that stuff!

best wishes
Brian Fenton


________________________________________
From: AOLserver Discussion [AOLS...@LISTSERV.AOL.COM] On Behalf Of Janine Ohmer [jan...@FURFLY.NET]
Sent: 01 December 2010 05:19
To: AOLS...@LISTSERV.AOL.COM
Subject: [AOLSERVER] Charset differences between 3.3+ad13 and 4.0.10?

Janine Ohmer

unread,
Dec 1, 2010, 1:44:04 PM12/1/10
to AOLS...@listserv.aol.com
Hi Brian,

The old system isn't accessible anymore (something went very wrong and even though it's RAID and all, fsck ended up removing a bunch of files and making the system unbootable). But from looking at the files in etc from the backups I can deduce that LANG was set to the same thing as it is on the new system, en_US.UTF-8.

I had tried setting these:

ns_param HackContentType 1
ns_param URLCharset "utf-8"
ns_param OutputCharset "utf-8"
ns_param HttpOpenCharset "utf-8"

Which did not work. I will try adding DefaultCharset just in case that is the key, but I'm not holding my breath...

There is an environment variable I can set, PGCLIENTENCODING, which will let me specify what encoding to use for the client, so I can add that to nsd-postgres. But I'm not sure what to set it to - everything's already in UTF8, so there shouldn't *be* any encoding issues (famous last words, I know :).

I can't restart the site during the day (very picky client) so will not be able to try anything until late afternoon.

thanks,

janine

Bernhard van Woerden

unread,
Dec 1, 2010, 2:16:34 PM12/1/10
to AOLS...@listserv.aol.com
Can you check the byte sequence of the string in the db that's causing a problem.

select encode(column_name::bytea,'hex') from ...
or use get_byte to isolate 

Janine Ohmer

unread,
Dec 2, 2010, 2:39:20 AM12/2/10
to AOLS...@listserv.aol.com
On Dec 1, 2010, at 11:16 AM, Bernhard van Woerden wrote:

Can you check the byte sequence of the string in the db that's causing a problem.

Sort of... the column in question is of type text, and the only function I can find that will convert from text to bytea is decode, not encode.  So here's what I did:

select decode(answer_3, 'escape') from public_places where public_place_id=1012;

The interesting part of the result is  "Park\302\222s", which is supposed to be "Park's".

I then ran 

select encode('\\302\\222':bytea, 'hex');

and got "c292".

Google tells me that this is the same as U+0092, which is defined simply as <control>, or "private use 2".

This doesn't seem very helpful, but it's possible that my initial assumptions on how to do this were wrong and I'm really just trying to look up garbage data. :)

Bernhard, does this give you any clues?

thanks,

janine

Janine Ohmer

unread,
Dec 2, 2010, 2:41:37 AM12/2/10
to AOLS...@listserv.aol.com
I tried adding DefaultCharset to the config file, to no effect. I also tried setting PGCLIENTENCODING to UTF-8 in nsd-postgres, and that didn't change anything either. Just for grins I tried changing it to WIN, since the assumption is that these are Windows smart quotes, but that just made the garbage character disappear altogether.

janine

russell muetzelfeldt

unread,
Dec 2, 2010, 3:17:40 AM12/2/10
to AOLS...@listserv.aol.com
On 02/12/2010, at 6:39 PM, Janine Ohmer wrote:
> On Dec 1, 2010, at 11:16 AM, Bernhard van Woerden wrote:
>> Can you check the byte sequence of the string in the db that's causing a problem.
>
> Sort of... the column in question is of type text, and the only function I can find that will convert from text to bytea is decode, not encode. So here's what I did:
>
> select decode(answer_3, 'escape') from public_places where public_place_id=1012;
>
> The interesting part of the result is "Park\302\222s", which is supposed to be "Park's".

I don't know whether this is relevant or not, but 0x92 is "�" in the windows-1252 (aka "ISO-8859-1 (Windows)") encoding...


cheers

Russell

russell muetzelfeldt

unread,
Dec 2, 2010, 3:23:09 AM12/2/10
to AOLS...@listserv.aol.com
On 02/12/2010, at 7:17 PM, russell muetzelfeldt wrote:
> On 02/12/2010, at 6:39 PM, Janine Ohmer wrote:
>>
>> The interesting part of the result is "Park\302\222s", which is supposed to be "Park's".
>
> I don't know whether this is relevant or not, but 0x92 is "" in the windows-1252 (aka "ISO-8859-1 (Windows)") encoding...

hmmm, I'm not sure if that character got eaten at my end or by lserv-m03.elist.aol.com, but what I meant was 0x92 is U+2019 RIGHT SINGLE QUOTATION MARK (aka a typographic quote mark) in windows-1252.

Gustaf Neumann

unread,
Dec 2, 2010, 4:27:57 AM12/2/10
to AOLS...@listserv.aol.com
It seems as if you have invalid character data in your database.

http://networking.itags.org/internet-explorer/28465/
http://php.net/manual/de/function.urlencode.php

it is most probably the easiest approach to fix these in the
database.
With postgres, you could make a dump, change the quotes in
the dump and
reload.

-gustaf neumann

On 02.12.10 08:39, Janine Ohmer wrote:
> Sort of... the column in question is of type text, and the
> only function I can find that will convert from text to
> bytea is decode, not encode. So here's what I did:
>
> select decode(answer_3, 'escape') from public_places where
> public_place_id=1012;
>
> The interesting part of the result is "Park\302\222s",
> which is supposed to be "Park's".
>
>

Bernhard van Woerden

unread,
Dec 2, 2010, 5:08:54 AM12/2/10
to AOLS...@listserv.aol.com
Just found the same as Russell, it started life as a single byte but needs some conversion to match it with the correct unicode character which can then be stored in utf-8.

What happens if you ask AOLserver to output iso-8859-1 does it convert 0xc2 0x92 to 0x92 ?
ns_param  OutputCharset     iso-8859-1

Fenton, Brian

unread,
Dec 2, 2010, 8:10:35 AM12/2/10
to AOLS...@listserv.aol.com
How did you get the data from the old database to the new? Maybe it got corrupted somewhere along the way. I've seen this happen with Oracle databases. Oracle have a tool called the "Character Set Scanner utility" which will determine the actual character set of the data, and a tool called CSALTER to change the character set - I wonder is there a Postgres equivalent?

Brian


________________________________________
From: AOLserver Discussion [AOLS...@LISTSERV.AOL.COM] On Behalf Of Janine Ohmer [jan...@FURFLY.NET]

Sent: 02 December 2010 07:41
To: AOLS...@LISTSERV.AOL.COM
Subject: Re: [AOLSERVER] Charset differences between 3.3+ad13 and 4.0.10?

Janine Ohmer

unread,
Dec 2, 2010, 12:49:16 PM12/2/10
to AOLS...@listserv.aol.com
I will try this tonight and see what happens (can't restart the site during the day).

Gustaf, you may well be correct that it's bad data, but this wasn't happening on the old system (or at least that's what they tell me) so I'm hoping to fix it with configuration.

Brian, I used pg_dump/restore.  I have been using the "custom" dump format for a while, which unfortunately doesn't let me look at the data the way a text dump does, but I would have thought that a binary dump might be less prone to data corruption than a text one.  Perhaps not, or maybe this was unavoidable. 

I don't know of any PG tools to fix the character set;  I haven't run across anything like that in Google yet.  But Torben sent me an email suggesting I try the Linux utility recode, which seems to be found here:  http://recode.progiciels-bpi.ca/index.html.  If nothing else works, I can try running this on a text dump and see if it fixes the bad data.

thanks,

janine

Janine Ohmer

unread,
Dec 3, 2010, 1:29:08 AM12/3/10
to AOLS...@listserv.aol.com
Changing the charset seems to have fixed it, after a spot check of some pages I know had problems.  If the client doesn't find more, we'll be in good shape.

Thanks, everyone!

janine

Tom Jackson

unread,
Jan 26, 2011, 1:28:19 AM1/26/11
to AOLS...@listserv.aol.com
Hi Janine,

My name is Maria and I am Tom wife, he need your your help, Can you
help him? call me at 425-241-8218 (cell) or 425-614-3105 (home)
Please call. My email is maria.a...@gmail.com. I need your your
help to updated some website, please call,

thanks
maria

Reply all
Reply to author
Forward
0 new messages