Text encoding under Linux

185 views
Skip to first unread message

Alain Aupeix

unread,
Aug 21, 2013, 3:14:20 AM8/21/13
to harbou...@googlegroups.com
Hi,

I have made an improved tool based on gmail.prg for Linux.
It works fine, but I have some troubles with text encoding.

If I use a file with ansi encoding, the mail received is correct (accents, ...)
If I use a file with utf8 encoding, that is not ok, same after using hb_utf8tostr()

I have tried to use a function like utf8toansi(), but unfortunately, it doesn't exist.

How can I do ?

Here is a part of my code:

FUNCTION Main()

......
REQUEST HB_CODEPAGE_FR850
REQUEST HB_CODEPAGE_UTF8
......

  if cBody == NIL
     if isConf
        nConf=searchConf(aConf,nbConf,"iBody")
        if len(aConf[nConf][2])>0
           cBody=aConf[nConf][2]
           if left(cBody,1)=="F"     //if we keep body in a file
              cBody=substr(cBody,2)
              if file(cBody)
              if lower(right(cBody,4))=".htm".or.lower(right(cBody,5))=".html"
//               that's ok, nothing to do
              else
                 cBody=hb_utf8tostr(memoread(cBody))
//               cBody=memoread(cBody)
              endif
           endif
        endif
     endif
     if cBody == NIL
        cerror=cerror+" / "+"Body"
     endif
  endif

Thanks
A+
--

Alain Aupeix
http://jujuland.pagesperso-orange.fr/
http://pissobi-lacassagne.pagesperso-orange.fr/

U.buntu 12.04 | G.ramps 3.4.5-1 | H.arbour 3.2.0dev (2013-07-18 00:24) | HbIDE (Rev.255) | FiveLinux (r138) | Hw.Gui (2129)

Alain Aupeix

unread,
Aug 21, 2013, 4:24:05 AM8/21/13
to harbou...@googlegroups.com
Le 21/08/2013 09:14, Alain Aupeix a écrit :
Hi,

I have made an improved tool based on gmail.prg for Linux.
It works fine, but I have some troubles with text encoding.

If I use a file with ansi encoding, the mail received is correct (accents, ...)
If I use a file with utf8 encoding, that is not ok, same after using hb_utf8tostr()

I have tried to use a function like utf8toansi(), but unfortunately, it doesn't exist.
I have also tried this:

cBody=hb_oemtoansi(hb_utf8tostr(memoread(cBody)))

and it exactly like

cBody=hb_utf8tostr(memoread(cBody))

A+

Alain Aupeix

unread,
Aug 21, 2013, 6:46:00 AM8/21/13
to harbou...@googlegroups.com
Le 21/08/2013 10:24, Alain Aupeix a écrit :
Le 21/08/2013 09:14, Alain Aupeix a écrit :
Hi,

I have made an improved tool based on gmail.prg for Linux.
It works fine, but I have some troubles with text encoding.

If I use a file with ansi encoding, the mail received is correct (accents, ...)
If I use a file with utf8 encoding, that is not ok, same after using hb_utf8tostr()

I have tried to use a function like utf8toansi(), but unfortunately, it doesn't exist.
I have also tried this:

cBody=hb_oemtoansi(hb_utf8tostr(memoread(cBody)))

and it exactly like

cBody=hb_utf8tostr(memoread(cBody))
Well, I have write a function which allow to convert from str to ansi, and from ansi to str.
I'm not sure having handled all the characters, but it can be improve add missing values.
If you don't want to use hbft lib, you must replace hex values by dec values, but it's easier to use hex values, as hexeditors gives directly the values ...

/******************************************************************************/
Function str_ansi(cstring, nfrom, nto)
/******************************************************************************/
local cconv, aconv[20][2], rg

aconv[ 1,1]=chr(ft_hex2dec("85"))
aconv[ 1,2]=chr(ft_hex2dec("E0"))
aconv[ 2,1]=chr(ft_hex2dec("83"))
aconv[ 2,2]=chr(ft_hex2dec("E2"))
aconv[ 3,1]=chr(ft_hex2dec("84"))
aconv[ 3,2]=chr(ft_hex2dec("E4"))
aconv[ 4,1]=chr(ft_hex2dec("82"))
aconv[ 4,2]=chr(ft_hex2dec("E9"))
aconv[ 5,1]=chr(ft_hex2dec("8A"))
aconv[ 5,2]=chr(ft_hex2dec("E8"))
aconv[ 6,1]=chr(ft_hex2dec("88"))
aconv[ 6,2]=chr(ft_hex2dec("EA"))
aconv[ 7,1]=chr(ft_hex2dec("89"))
aconv[ 7,2]=chr(ft_hex2dec("EB"))
aconv[ 8,1]=chr(ft_hex2dec("8C"))
aconv[ 8,2]=chr(ft_hex2dec("EE"))
aconv[ 9,1]=chr(ft_hex2dec("8B"))
aconv[ 9,2]=chr(ft_hex2dec("EF"))
aconv[10,1]=chr(ft_hex2dec("93"))
aconv[10,2]=chr(ft_hex2dec("F4"))
aconv[11,1]=chr(ft_hex2dec("94"))
aconv[11,2]=chr(ft_hex2dec("F6"))
aconv[12,1]=chr(ft_hex2dec("97"))
aconv[12,2]=chr(ft_hex2dec("F9"))
aconv[13,1]=chr(ft_hex2dec("96"))
aconv[13,2]=chr(ft_hex2dec("FB"))
aconv[14,1]=chr(ft_hex2dec("81"))
aconv[14,2]=chr(ft_hex2dec("FC"))
aconv[15,1]=chr(ft_hex2dec("87"))
aconv[15,2]=chr(ft_hex2dec("E7"))
aconv[16,1]=chr(ft_hex2dec("8E"))
aconv[16,2]=chr(ft_hex2dec("C4"))
aconv[17,1]=chr(ft_hex2dec("90"))
aconv[17,2]=chr(ft_hex2dec("C9"))
aconv[18,1]=chr(ft_hex2dec("99"))
aconv[18,2]=chr(ft_hex2dec("D6"))
aconv[19,1]=chr(ft_hex2dec("9A"))
aconv[19,2]=chr(ft_hex2dec("DC"))
aconv[20,1]=chr(ft_hex2dec("80"))
aconv[20,2]=chr(ft_hex2dec("C7"))
   
for rg= 1 to 20
    cstring=strtran(cstring,aconv[rg,nfrom],aconv[rg,nto])
next
return(cstring)

/******************************************************************************
  This is the end ...
/******************************************************************************/

The parameters are cstring, nfrom, nto

cstring, 1, 2 to convert from str to ansi
cstring, 2, 1 to convert from ansi to str

Example:

cBody=str_ansi(hb_utf8tostr(memoread(cBody)),1,2) converts from str to ansi

A+

Klas Engwall

unread,
Aug 21, 2013, 5:51:19 PM8/21/13
to harbou...@googlegroups.com
Hi Alain,

> I have made an improved tool based on gmail.prgfor Linux.
> It works fine, but I have some troubles with text encoding.
>
> If I use a file with ansi encoding, the mail receivedis correct
> (accents, ...)
> If I use a file with utf8 encoding, that is not ok, same after using
> hb_utf8tostr()
>
> I have tried to use a function like utf8toansi(), but unfortunately, it
> doesn't exist.
>
> How can I do ?

You can use HB_Translate( cYourString, 'UTF8EX', cYourAnsiCodePage )

But isn't the problem rather that the <cCharSet> argument to
HB_SendMail() is missing? This results is HB_SendMail() defaulting it to
"ISO-8859-1".

Check the (too short) argument list in gmail.prg against the argument
list in sendmail.prg. And please note that the docs inside HB_SendMail()
have not been updated to reflect the entire set of argumentes.

Regards,
Klas

Klas Engwall

unread,
Aug 21, 2013, 6:20:21 PM8/21/13
to harbou...@googlegroups.com
BTW Alain,

> //REQUEST HB_CODEPAGE_FR850//
> //REQUEST HB_CODEPAGE_UTF8//

> //......//

> cBody=hb_utf8tostr(memoread(cBody))

I suppose you are also setting the codepage to "FR850" inbetween those
lines, so you are getting <cBody> translated to that codepage, right?
And that is also not what HB_SendMail() expects.

Regards,
Klas

Alain Aupeix

unread,
Aug 22, 2013, 2:52:36 AM8/22/13
to harbou...@googlegroups.com
Le 22/08/2013 00:20, Klas Engwall a écrit :
BTW Alain,

//REQUEST HB_CODEPAGE_FR850//
//REQUEST HB_CODEPAGE_UTF8//

//......//

cBody=hb_utf8tostr(memoread(cBody))

I suppose you are also setting the codepage to "FR850" inbetween those lines,
No, I don't set it.

so you are getting <cBody> translated to that codepage, right?
I don't know how works hb_utf8tostr(), but on the screen, I see that the change is done.
The principal reason, I use hb_utf8tostr() is that all the string manipulations functions works, instead of using utf8 (strtran, substr,...), and it's easier to convert from str to ansi than from utf8 to ansi.

And that is also not what HB_SendMail() expects.
All my test shows that sendmail expected ansi.

In a previous answer you're talking of hb_sendmail() about the fact that the doc wasn't up to date.
I agree, and now, there are 25 parameters instead of the doc included in hb_sendmail.prg

In the prg, there is this:

   hb_default( @cCharset, "ISO-8859-1" )
   hb_default( @cEncoding, "quoted-printable" )


I know the Charset parameter, but in gmail, it's not used (NIL), an in mine too. I think that this parameter is used for html code.And here I'm using ansi testfile.

What could be the value to use to have a correct text ?
What is the result encoding of hb_utf8tostr() ?

The complete source of smail is avalaible on my website JujuLand => Ubuntu
I think it ought not to work under Windows without some small modifications.

Alain Aupeix

unread,
Aug 22, 2013, 4:28:45 AM8/22/13
to harbou...@googlegroups.com
Le 22/08/2013 08:52, Alain Aupeix a écrit :
I think that this parameter is used for html code.And here I'm using ansi testfile.
I have tested with encoding (FR850, for exemple), and I saw my mistake, the encoding parameter is also used for simple text, not only html file.
I think of eliminating a few of conversion function ...

Klas Engwall

unread,
Aug 22, 2013, 7:30:13 AM8/22/13
to harbou...@googlegroups.com
Hi Alain,

>>> //REQUEST HB_CODEPAGE_FR850//
>>> //REQUEST HB_CODEPAGE_UTF8//
>>
>>> //......//
>>
>>> cBody=hb_utf8tostr(memoread(cBody))
>>
>> I suppose you are also setting the codepage to "FR850" inbetween those
>> lines,
> No, I don't set it.

If you are not setting the codepage for the VM, then the default is
used: "EN", which is a 437 OEM codepage.

>> so you are getting <cBody> translated to that codepage, right?
> I don't know how works hb_utf8tostr(), but on the screen, I see that the
> change is done.

What is shown on the screen is also affected by the codepage used by the
VM. For analyzing the result without risking side effects converting it
to hex is safer.

> The principal reason, I use hb_utf8tostr() is that all the string
> manipulations functions works, instead of using utf8 (strtran,
> substr,...), and it's easier to convert from str to ansi than from utf8
> to ansi.

HB_Translate() can convert from any codepage to any codepage and
HB_Utf8ToStr() and HB_StrToUtf8() can convert between UTF-8 and any
8-bit codepage ("str" can be text in any 8-bit codepage, see below).

>> And that is also not what HB_SendMail() expects.
> All my test shows that sendmail expected ansi.

It is really all about what HB_SendMail() tells the receiving mail
client to interpret the text as, and that happens in the Content-Type
header where the <cCharSet> argument is put if it is passed. If not
passed, "ISO-8859-1" is used by default:

Content-Type: text/plain; charset="ISO-8859-1"

If your body of text is not in ISO-8859-1 then you have to pass the
<cCharSet> argument to match it.

> In a previous answer you're talking of hb_sendmail() about the fact that
> the doc wasn't up to date.
> I agree, and now, there are 25 parameters instead of the doc included in
> hb_sendmail.prg
>
> In the prg, there is this:
>
> / hb_default( @cCharset, "ISO-8859-1" )//

Yes, that is where it goes wrong if you pass UTF-8 data and no UTF-8
<cCharSet> argument.

> I know the Charset parameter, but in gmail, it's not used (NIL),

If you mean the gmail.prg sample code, then it is just an incomplete
sample. Add the missing arguments and the problem should be solved.

> an in
> mine too. I think that this parameter is used for html code.And here I'm
> using ansi testfile.

I noticed in your other post that you have now seen the light :-)

> What could be the value to use to have a correct text ?

Pass "UTF-8" in the <cCharSet> argument

> What is the result encoding of hb_utf8tostr() ?

2007-06-23 11:10 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
[...]
* harbour/source/rtl/cdpapi.c
+ added two prg functions for translations from/to UTF-8:
HB_STRTOUTF8( <cStr> [, <cCPID> ] ) -> <cUTF8Str>
HB_UTF8TOSTR( <cUTF8Str> [, <cCPID> ] ) -> <cStr>
<cCPID> is Harbour codepage id, f.e.: "EN", "ES", "ESWIN",
"PLISO", "PLMAZ", "PL852", "PLWIN", ...
When not given then default HVM codepage (set by HB_SETCODEPAGE())
is used.

Regards,
Klas

Alain Aupeix

unread,
Aug 22, 2013, 8:14:51 AM8/22/13
to harbou...@googlegroups.com, Klas Engwall
Le 22/08/2013 13:30, Klas Engwall a écrit :
I noticed in your other post that you have now seen the light :-)
Hum, the light was not too much brilliant  ...
I tried some functions :  hb_cdpselect(), hb_langselect(), and nothing was ok

Yes, that is where it goes wrong if you pass UTF-8 data and no UTF-8 <cCharSet> argument.
In fact, keeping datas in UTF-8, as on the command line or in the file, and setting charset to UTF-8 for hb_sendmail done the trick.

Why to do simple, when we can do complicated :-!
What is the result encoding of hb_utf8tostr() ?

2007-06-23 11:10 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
  [...]
  * harbour/source/rtl/cdpapi.c
    + added two prg functions for translations from/to UTF-8:
         HB_STRTOUTF8( <cStr> [, <cCPID> ] ) -> <cUTF8Str>
         HB_UTF8TOSTR( <cUTF8Str> [, <cCPID> ] ) -> <cStr>
      <cCPID> is Harbour codepage id, f.e.: "EN", "ES", "ESWIN",
      "PLISO", "PLMAZ", "PL852", "PLWIN", ...
      When not given then default HVM codepage (set by HB_SETCODEPAGE())
I tried with the parameter, and it works too

Thanks for debugging me

I haven't now choiced the way I will use, but i know it will work without my converting function ...

Just a question, concerning CPID: why isn't there a FRWIN ? Same as ESWIN ?

Klas Engwall

unread,
Aug 22, 2013, 5:58:05 PM8/22/13
to harbou...@googlegroups.com
Hi Alain,

Please do not CC me. The CC arrives before the main message which is
then discarded by the mail server, and that confuses my mail client.

>> I noticed in your other post that you have now seen the light :-)
> Hum, the light was not too much brilliant ...
> I tried some functions : hb_cdpselect(), hb_langselect(), and nothing
> was ok

It can be done in different ways (but not with HB_LangSelect() which is
something different). I always use Set( _SET_CODEPAGE, cCodePage ) after
first requesting it. In my case:
request hb_codepage_svwin
...
Set( _SET_CODEPAGE, 'SVWIN' )

I also use
request hb_codepage_sv437c
...
Set( _SET_DBCODEPAGE, 'SV437C' )
for compatibility with my old Clipper-created dbf files and get
automatic conversion between the two.

>> Yes, that is where it goes wrong if you pass UTF-8 data and no UTF-8
>> <cCharSet> argument.
> In fact, keeping datas in UTF-8, as on the command line or in the file,
> and setting charset to UTF-8 for hb_sendmail done the trick.

That is good to hear!

> Why to do simple, when we can do complicated :-!

:-)

>>> What is the result encoding of hb_utf8tostr() ?
>>
>> 2007-06-23 11:10 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
>> [...]
>> * harbour/source/rtl/cdpapi.c
>> + added two prg functions for translations from/to UTF-8:
>> HB_STRTOUTF8( <cStr> [, <cCPID> ] ) -> <cUTF8Str>
>> HB_UTF8TOSTR( <cUTF8Str> [, <cCPID> ] ) -> <cStr>
>> <cCPID> is Harbour codepage id, f.e.: "EN", "ES", "ESWIN",
>> "PLISO", "PLMAZ", "PL852", "PLWIN", ...
>> When not given then default HVM codepage (set by HB_SETCODEPAGE())
> I tried with the parameter, and it works too

OK :-)

> Thanks for debugging me
>
> I haven't now choiced the way I will use, but i know it will work
> without my converting function ...
>
> Just a question, concerning CPID: why isn't there a FRWIN ? Same as ESWIN ?

But there is. And it is called exactly "FRWIN". There is also "FRISO".
Request it and then Set() it. Check out the cpfrwin.c, cpfriso.c and
other .c source files in the src/codepage directory for a complete list
of all the available codepages.

Regards,
Klas
Reply all
Reply to author
Forward
0 new messages