2025-09-03 12:21 UTC+0200 Przemyslaw Czerpak (druzus/at/poczta.onet.pl)

122 views
Skip to first unread message

Przemyslaw Czerpak

unread,
Sep 3, 2025, 6:21:21 AM (3 days ago) Sep 3
to harbou...@googlegroups.com
2025-09-03 12:21 UTC+0200 Przemyslaw Czerpak (druzus/at/poczta.onet.pl)
* src/rtl/cdpapi.c
+ added fallback translation table for different variants of Latin
character, now when translation is made between different encoding
and the variant of Latin character does not exist in destination
code page then it's translated to their base Latin form, i.e.:
hb_utf8ToStr( "ĄĆĘŁŃÓŚŹŻąćęłńóśźż", "EN" ) -> "ACELNOSZZacelnószz"

* tests/uc16_gen.prg
; updated comment

best regards
Przemek

Marek Długosz

unread,
Sep 3, 2025, 6:29:36 AM (3 days ago) Sep 3
to harbou...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Thanks a lot!

What about indexing an unicode field? Lets say DBCodePage is PLWIN and
some exotic letter like ư is inside Unicode text field (C:U)

Regards

Marek


-----BEGIN PGP SIGNATURE-----
Version: MailClient.Security v10.3.2619+1e54f4a61e

iQHDBAEBCAAtBQJouBiCJhxNYXJlayBExYJ1Z29zeiA8bWRsdWdvc3o2NUBnbWFp
bC5jb20+AAoJELBAX9mMrJ1CsN0MAIXRDANHJVSZ4WFBSXbuC3SwzHVsdududW8v
dGURc8Ov3uPtO6OVWxX3Jtz+zzdnn4ArqMOizhgJc7k4CK3ZDbeLTcUVwPUqgbb0
FYTSj3+E4WDmPEzsMuA8o3qd0XbfqMB3Gz21jNl5otc/YGMze7iInfRYPzwjGNef
onil8dVrzBvNdMUb6ue1DDju6jVQ/YKhYONPQNxgTpcyYLpvK/FYVDfknmAZVTpM
X+v2qg5jEy0uwRAs3+SCL2Ek2LMURadBWgWlTSyyZulTOEAWd+jieOqz4XiQZ0F9
m5pQNFj9HkxrTJrizKhjG8WWoTwIveI9zFE6TcxkpBVj1HIYJg7XuRjdLPrq7UPX
K1IvbPjYfOQ5h32kzOIQ9QoaWwjqHkgVCJEEjbQTvhijVzdw30wntfocjzkKfPqh
x1S/2Mt0bc6yvDyNWAnI7IzxElTsC6Pv6NT+ZdX6RfCB2RsCFQbe9yIf+et699RP
EJhb33h4h5tc1zhiGJP51pNkTfcv+w==
=8Uj4
-----END PGP SIGNATURE-----
mdlugosz65@gmail.com.asc

Aleksander Czajczynski

unread,
Sep 3, 2025, 2:03:33 PM (3 days ago) Sep 3
to harbou...@googlegroups.com
Marek Długosz wrote:
> Thanks a lot!
>
> What about indexing an unicode field? Lets say DBCodePage is PLWIN and
> some exotic letter like ư is inside Unicode text field (C:U)
>
Likely recent addition by Przemek does necessary simplification during
key creation and DBSeek().

https://os.allcom.pl/hb/#!eJyFkMFKw0AQhu_7FMOcEowHeyyo7CaxBJamZLcILUVqumDQZkO6OXjU1_AlfIn4XO5uaglKcU4z883_zzA8jymHrTBtR5YiBaMOhvgSpteQMJeVJghJzHORDmQNV7BZwwQ2bgbj6RJ_wOQPSFjcqq1RATrrrq4wGtZBSE4bbRsoz6gAP4UQ50m6oLMUcMHvszkSkre7oxGcnCCKAO-yQsgLToVEZ5kw2jSq3tmTZSrk5Y3n_p7-TfcfeGw7ge-u-nddVgfziv9qvz7Pqx1L2ExL3Vi5y5g2Ru9tQW7dI5V6DgYbwDCCQpVzbeEvZm3Ow27MLBwdGcHT4wN7UXUAhWyrfTCGENrPgI9vEbOTeA

LOCAL aStru

USE test // from /tests or Harbour Playground
aStru := DBStruct()
CLOSE
aStru[ 1 ][ 2 ] := "C:U"
aStru[ 2 ][ 2 ] := "C:U"
DBCreate("testuni", aStru )

USE testuni ALIAS "test" CODEPAGE "PLWIN"

OrdCreate( "testuni" ,, "FIRST+LAST" )

DBAppend()
TEST->FIRST := "Łoś"
TEST->LAST := "Złocisty"
DBAppend()
TEST->FIRST := "Łośư"
TEST->LAST := "Złocistyư"
DBGoTop()
DBGoBottom()

? DBSeek("Łoś "), RecNo() // 1
? DBSeek("Łośư"), RecNo() // 2
? DBSeek("Łośu"), RecNo() // 2

Real solution is a new format of multibyte aware index (ADS?), but is
this a real life problem?
I wouldn't trust unicode identifiers as unique keys, converted or not,
in any database, because of multiple visually similar characters available.

A hack: hb_strtohex( hb_translate( UFIELD, , "UTF16LE" ) ) in ordkey
expression, and similar in DBSeek()

Best regards, Aleksander

Przemyslaw Czerpak

unread,
Sep 4, 2025, 3:23:51 AM (2 days ago) Sep 4
to harbou...@googlegroups.com
Hi Aleksander and Marek,

To clarify.
Yes it may help in such situations but only if translation is enabled.
In this example the test server uses UTF8 encoding as default CDP
and DBF file uses PLWIN (CP1250 encoding).
When HVM uses different CDP then DBF translation with fallback
table is enabled (BTW now all translations between different CDPs
use this fallback table and neither source not destination CDP needs
to use any variant of Unicode encoding). It means that during indexing
field values from are translated to CP1250 and because letter 'ư' does
not exist in CP 1250 then is stored in index as simple Latin 'u'. Then the
same happens with SEEK parameter so the code works quite nicely.
Anyhow the revert translation will not make such conversion. The data
read from Unicode fields in DBF file can be well represented in UTF8
encoding. So it contains letter 'ư' not 'u'. In practice it means that any
filter expressions should compare values in PLWIN context. This
problem exists from the beginning and recent modification has not
changed it. When the translation is not fully reversible then the
the expressions evaluated by HVM may give unexpected  results so
programmer should keep it in mind. This is true in all languages using
such translations. Anyhow after recent modification it can be masked
by special sorting/comparison order which keeps the same wight for
all characters using different variants of the the same Latin letter.
Maybe we should add it.

Finally I've seen on this list the discussion about GET and "UTF8"
Harbour's CDP and looks that I have to repeat some information.
It's bound with character encoding so I'll do it in this thread.

UTF8 Harbour's CDP is only for encoding. It does not introduce any
more information about encoded characters so it will not change
the behavior of functions like UPPER(), LOWER(), ISALPHA() and
all string functions operating on character indexes. i.e. LEN(),
LEFT(), RIGHT(), SUBSTR(), STUFF(), STRTRAN(), PAD*() etc.
All of them will operate on byte (binary) representation of the
strings in UTF8 encoding. Also GET system using above functions
will not work as you may expect. Anyhow this CDP is useful for
translation and is extremely light so is present in all Harbour
static builds.
But if someone wants to use UTF8 as fully functional CDP which
affects all above functions. sorting order, etc. then he should use
UTF8EX. It consumes more memory and increases the size of
static executable due to additional information about Unicode
characters so it's not default choice.
In short words: the "bug" examples in code using GET system
can be "fixed" by replacing:
    hb_cdpSelect( "UTF8" )
with
    REQUEST HB_CODEPAGE_UTF8EX
    hb_cdpSelect( "UTF8EX" )

For more information about this changes in Harbour look at
2012-04-20 17:52 UTC+0200 Przemyslaw Czerpak (druzus/at/poczta.onet.pl)

best regards,
Przemek


W dniu 3.09.2025 o 20:03, Aleksander Czajczynski pisze:

Marek Długosz

unread,
Sep 4, 2025, 9:07:50 AM (2 days ago) Sep 4
to harbou...@googlegroups.com
Thanks!

I use both Delphi and Harbour, so modifying RDDADS was on my mind. It is not a critical issue. Because of KSeF I need to be prepared for exotic names in UTF8. My canteen program may need to put some foreign names into menu, for example while serving meals to foreign refugees, soldiers etc.

Regards

Marek

------ Wiadomość oryginalna ------
Nadawca "'Przemyslaw Czerpak' via Harbour Developers" <harbou...@googlegroups.com>
Data 04.09.2025 09:23:37
Temat Re: [harbour] 2025-09-03 12:21 UTC+0200 Przemyslaw Czerpak (druzus/at/poczta.onet.pl)
--
You received this message because you are subscribed to the Google Groups "Harbour Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to harbour-deve...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/harbour-devel/511dc5de-fa40-4393-8549-bdb6a2723a2e%40poczta.onet.pl.
mdlugosz65@gmail.com.asc

Marek Długosz

unread,
Sep 4, 2025, 11:25:40 AM (2 days ago) Sep 4
to harbou...@googlegroups.com
Hi!
You wrote:
----------------------------
"But if someone wants to use UTF8 as fully functional CDP which
 affects all above functions. sorting order, etc. then he should use
 UTF8EX. It consumes more memory and increases the size of
 static executable due to additional information about Unicode
 characters so it's not default choice.
 In short words: the "bug" examples in code using GET system
 can be "fixed" by replacing:
 hb_cdpSelect( "UTF8" )
 with
 REQUEST HB_CODEPAGE_UTF8EX
 hb_cdpSelect( "UTF8EX" )
---------------------------
I have discovered convenient way to modify string (from header to footer of the table in this case) using FOR EACH but it does not work in UTF8EX. This is not a problem, because my previous take was with utf8Peek/Poke and it works. Anyway this is what I have found.

if m='═'
    t:='╧'
    n:='╩'
else //if m='─'
    t:='┴'
    n:='╨'
endif

FOR EACH x IN @s
    switch x
      case '╦'
      case '╥'
        x:=n
        exit
      case '|'
      case '+'
      case '┬'
      case '╤' 
        x:=t
        exit
      case '-'
      case '═'
      case '─'
        x:=m
    end switch
  NEXT

mdlugosz65@gmail.com.asc
Reply all
Reply to author
Forward
0 new messages