Codepage confusion (with file names)

417 views
Skip to first unread message

Alex Strickland

unread,
May 17, 2017, 6:12:31 AM5/17/17
to harbou...@googlegroups.com
Hi all

I am trying to upgrade my app from 3.0.1dev (10/01/2012) to 3.4.0dev (*)
and hitting some issues.

One is that files with non-english characters in their names are not
loaded - Abschöpfung.jpg. I guess that the name is encoded in CP437 and
that is now incorrect. But I am not sure how to take things further. To
experiment a bit I added:

qout("hb_cdpSelect = " + hb_cdpSelect())
qout("hb_cdpSelect = " + hb_cdpSelect("CP437"))
qout("hb_cdpSelect = " + hb_cdpSelect())
qout("HB_LANGSELECT = " + HB_LANGSELECT())
qout("hb_cdpOS = " + hb_cdpOS())
qout("Set( _SET_OSCODEPAGE ) = " + asstring(Set( _SET_OSCODEPAGE )) )
qout("Set( _SET_OSCODEPAGE, hb_cdpOS() ) = " + asstring(Set(
_SET_OSCODEPAGE, hb_cdpOS() )) )
qout("Set( _SET_OSCODEPAGE ) = " + asstring(Set( _SET_OSCODEPAGE )) )
qout("Set( _SET_CODEPAGE ) = " + asstring(Set( _SET_CODEPAGE )) )
qout("Set( _SET_CODEPAGE, CP437 ) = " + asstring(Set( _SET_CODEPAGE,
"CP437" )) )
qout("Set( _SET_CODEPAGE ) = " + asstring(Set( _SET_CODEPAGE )) )

which results in:

hb_cdpSelect = EN
hb_cdpSelect = EN
hb_cdpSelect = EN
HB_LANGSELECT = en
hb_cdpOS = cp1252
Set( _SET_OSCODEPAGE ) = NIL
Set( _SET_OSCODEPAGE, hb_cdpOS() ) = NIL
Set( _SET_OSCODEPAGE ) = SVWIN
Set( _SET_CODEPAGE ) = EN
Set( _SET_CODEPAGE, CP437 ) = EN
Set( _SET_CODEPAGE ) = EN


I remain confused.

The "SVWIN" looks dodgy?

I am wondering whether I must disable the Unicode build to get the same
results - at least temporarily?

--
Regards
Alex


* - difficult because I lost the source code to the HwGUI lib I've been
using along with my PC all those years ago.

W.

unread,
May 17, 2017, 7:57:47 AM5/17/17
to Harbour Users
Have you tried if Set( _SET_OSCODEPAGE, hb_cdpSelect() ) prevents automatic conversions?
Also remember if it's Windows, then it could by default use OS unicode API. What's your real app hb_cdpSelect()?

Regards, W.

Alex Strickland

unread,
May 17, 2017, 9:11:16 AM5/17/17
to harbou...@googlegroups.com
On 2017-05-17 01:57 PM, W. wrote:

> Have you tried if Set( _SET_OSCODEPAGE, hb_cdpSelect() ) prevents
> automatic conversions?

I tried it now, and it does not seem to help:

Set( _SET_OSCODEPAGE, hb_cdpSelect() )
qout("Set( _SET_CODEPAGE ) = " + asstring(Set( _SET_CODEPAGE )) )

and it shows "EN".

> Also remember if it's Windows, then it could by default use OS unicode
> API.

It is Windows. I guess that it is using the OS unicode API. I am not
sure how to restore it to the default (CP437?) that was used in the
Harbour 3.0.1dev.

> What's your real app hb_cdpSelect()?

"EN".

--

Regards
Alex

Przemyslaw Czerpak

unread,
May 17, 2017, 3:59:48 PM5/17/17
to harbou...@googlegroups.com
On Wed, 17 May 2017, Alex Strickland wrote:

Hi Alex,

> >Have you tried if Set( _SET_OSCODEPAGE, hb_cdpSelect() ) prevents
> >automatic conversions?
> I tried it now, and it does not seem to help:
>
> Set( _SET_OSCODEPAGE, hb_cdpSelect() )
> qout("Set( _SET_CODEPAGE ) = " + asstring(Set( _SET_CODEPAGE )) )
>
> and it shows "EN".

SO HVM CP is EN which uses CP437.

_SET_OSCODEPAGE is used only for system calls which do not support
UNICODE. In MS-Windows Harbour builds nearly all system calls use
UNICODE parameters so _SET_OSCODEPAGE is ignores and characters
are translated directly from _SET_CODEPAGE.

> >Also remember if it's Windows, then it could by default use OS
> >unicode API.
> It is Windows. I guess that it is using the OS unicode API. I am not
> sure how to restore it to the default (CP437?) that was used in the
> Harbour 3.0.1dev.

False assumption gives false conclusions - Aristotle.

Your current application compiled and linked with HB3.2 when
_SET_CODEPAGE is "EN" converts strings to UNICODE using CP437
UNICODE character values. Just try:
? hb_cdpUniID( "EN" )
or:
? hb_cdpUniID( Set( _SET_CODEPAGE ) )

In old Harbour versions most of MS-Windows API calls used ANSI
character encoding and few ones (i.e. GTWIN output) OEM encoding.
ANSI and OEM are very imprecise terms which have different meaning
depending on national MS-Windows version. In US and West Europe
ANSI means CP1252 and OEM means CP852 so probably you are using
one of these two encodings not CP437.

> >What's your real app hb_cdpSelect()?
> "EN".

But are you sure that you have characters encoded in CP437
in your source code and data tables?

best regards,
Przemek

Alex Strickland

unread,
May 18, 2017, 11:13:45 AM5/18/17
to harbou...@googlegroups.com
On 2017-05-17 09:59 PM, Przemyslaw Czerpak wrote:

> False assumption gives false conclusions - Aristotle.

As you say. Or more crudely: "Assumption is the mother of all f*ups".

> Your current application compiled and linked with HB3.2 when
> _SET_CODEPAGE is "EN" converts strings to UNICODE using CP437
> UNICODE character values. Just try:
> ? hb_cdpUniID( "EN" )
> or:
> ? hb_cdpUniID( Set( _SET_CODEPAGE ) )

OK, from my 3.4.0 app I see :

hb_cdpUniID( 'EN' ) = cp437
hb_cdpUniID( Set( _SET_CODEPAGE ) )) = cp437

But when does it as you say above "convert strings to UNICODE using
CP437"? When I do a system call like FILE(str)? Is str converted from
CP437 to unicode?

Perhaps I should try and move to UTF8? How would I do that? Something like:

hb_cdpSelect("UTF8")?

> In old Harbour versions most of MS-Windows API calls used ANSI
> character encoding and few ones (i.e. GTWIN output) OEM encoding.
> ANSI and OEM are very imprecise terms which have different meaning
> depending on national MS-Windows version. In US and West Europe
> ANSI means CP1252 and OEM means CP852 so probably you are using
> one of these two encodings not CP437.

hb_cdpUniID returns cp437 in the 3.0.1 version but I assume from what
you say above that it is not relevant because Unicode calls are used.

> But are you sure that you have characters encoded in CP437
> in your source code and data tables?

I will try and manually check.

How would I convert the strings if they are cp1252? Something like:

hb_translate(string, "cp1252", "utf8")?

Or can I specify CP1252 as a parameter to DBUseArea? But then I guess
when I try and open the file that call will use unicode and the CP1252
encoding is still wrong?

BTW does the "SVWIN" result I mentioned make any sense to you?

I'm sorry to be so dumb about this but I have tried to read the relevant
posts in the Changelog but there are a lot!

--

Regards
Alex


vszakats

unread,
May 18, 2017, 12:25:50 PM5/18/17
to Harbour Users
Hi Alex,


On Thursday, May 18, 2017 at 5:13:45 PM UTC+2, Alex Strickland wrote:
BTW does the "SVWIN" result I mentioned make any sense to you?

It looks confusing indeed and the explanation is that internally Harbour
still uses legacy "sorting" modules similar to Cl*pper. These sorting
modules not only specify a codepage, but also a culture dependent
collation and upper/lower-case info. One such sorting module is SVWIN.

In codepage translation contexts it's more useful however to work with 
standard codepage names and there are cases where you want to
work with a standard codepage name received from the user or a
3rd-party component (f.e. the OS itself, as returned by hb_cdpOS()).

To avoid having to look up legacy sorting modules for the codepage
we need, Harbour accepts standard codepage names ("cp1252")
directly in functions where sorting modules are accepted. These
standard codepage names however must be internally converted
to a matching legacy sorting module name. For cp1252 this happens
to fall to SVWIN (it may be any other ones that are associated 
with cp1252, there is no documented rule for that.) For codepage
conversion operations (like in hb_Translate()), both will give the exact
same result.

One clean solution would be to split off codepage and culture
related properties from current sorting modules, but that's a long shot.

-Viktor

Alex Strickland

unread,
May 18, 2017, 4:08:56 PM5/18/17
to harbou...@googlegroups.com
Hi Viktor

> One clean solution would be to split off codepage and culture
> related properties from current sorting modules, but that's a long shot.

I did read your comments about the difficulties of doing that in the
changelog.

Thank you for confirming that it is "expected" behaviour.

--
Regards
Alex

Przemyslaw Czerpak

unread,
May 18, 2017, 5:27:02 PM5/18/17
to harbou...@googlegroups.com
On Thu, 18 May 2017, Alex Strickland wrote:

Hi Alex,

> >False assumption gives false conclusions - Aristotle.
> As you say. Or more crudely: "Assumption is the mother of all f*ups".

;)

> But when does it as you say above "convert strings to UNICODE using
> CP437"? When I do a system call like FILE(str)? Is str converted
> from CP437 to unicode?

CP437 just like any other code page, i.e. CP1252 is array of Unicode
values indexed by unsigned character value in string.

> Perhaps I should try and move to UTF8? How would I do that? Something like:
>
> hb_cdpSelect("UTF8")?

Why? You only have to specify correct codepage. That's all.
CP1252 or CP850 in your case.

> hb_cdpUniID returns cp437 in the 3.0.1 version but I assume from
> what you say above that it is not relevant because Unicode calls are
> used.

It does not mean that you ever used character in CP437 encoding.
Probably you never used such character when you were using
national characters for west Europe languages. Just simply in 3.0.1
they were passed as you typed them. So they are and were probably
in CP1252 encoding.

> >But are you sure that you have characters encoded in CP437
> >in your source code and data tables?
> I will try and manually check.

Just simply try
? ASC( "ö" )
Then check corresponding Unicode character value in s_uniCodes
array in src/rtl/cdpapi.c (CP437) and in src/codepage/uc1252.c
(CP1252).
Finaly check what glyph is assigned to Unicode value you are using.

SVWIN internally uses CP1252 so it can resolve your problem but
you should rather use sth like "ENWIN" to not change colattion
order, i.e. code below defines such codepage:

best regards,
Przemek


REQUEST HB_CODEPAGE_ENWIN
PROCEDURE Main()
Set( _SET_CODEPAGE, "ENWIN" )
? Set( _SET_CODEPAGE ), hb_cdpUniID( Set( _SET_CODEPAGE ) )
? "Abschöpfung.jpg:", file( "Abschöpfung.jpg" )
RETURN

#pragma begindump
#define HB_CP_ID ENWIN
#define HB_CP_INFO "English CP-1252"
#define HB_CP_UNITB HB_UNITB_1252
#define HB_CP_ACSORT HB_CDP_ACSORT_NONE
#define HB_CP_UPPER "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#define HB_CP_LOWER "abcdefghijklmnopqrstuvwxyz"
#define HB_CP_UTF8
#include "hbcdpreg.h" /* include CP registration code */
#pragma enddump

Alex Strickland

unread,
May 21, 2017, 9:31:43 AM5/21/17
to harbou...@googlegroups.com
On 2017-05-18 11:26 PM, Przemyslaw Czerpak wrote:
>
> Just simply try
> ? ASC( "ö" )
> Then check corresponding Unicode character value in s_uniCodes
> array in src/rtl/cdpapi.c (CP437) and in src/codepage/uc1252.c
> (CP1252).
> Finaly check what glyph is assigned to Unicode value you are using.
>
> SVWIN internally uses CP1252 so it can resolve your problem but
> you should rather use sth like "ENWIN" to not change colattion
> order, i.e. code below defines such codepage:

I didn't check but the code you have posted works perfectly for me.
Perhaps it is worth adding to Harbour?

Thank you.
--
Regards
Alex

Reply all
Reply to author
Forward
0 new messages