Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert Japanese to unicode characters for use with SAPI text to speech

199 views
Skip to first unread message

sled...@gmail.com

unread,
Apr 11, 2018, 1:02:28 PM4/11/18
to
Am trying to use the text 今日は元気ですか with ms SAPI. However, SAPI TTS requires either romanji or utf-8, unicode.

Thus, the question is "How to get the utf-8 values corresponding to the Japanese characters?

The examples on wiki assume one gets the one or more hex characters, then use encode to get the values for use with \u????.

It is not clear to me how to get the hex values, or if there is another way to do it.

e.g. set ha [encoding convertfrom euc-jp "\xA4\xCF"]

However, it is not clear to me that using encoding different from the system is even needed.

In addition, having read the TCL X docs on internationalization, it is not clear to me how to proceed.

BTW: the translation is "How are you today".

Appreciate any assistance.

Rich

unread,
Apr 11, 2018, 4:25:41 PM4/11/18
to
sled...@gmail.com wrote:
> Am trying to use the text ???????? with ms SAPI. However, SAPI TTS
> requires either romanji or utf-8, unicode.
>
> Thus, the question is "How to get the utf-8 values corresponding to
> the Japanese characters?

Where do you have "the Japanese characters"? Note that the rest of
what I write below assumes you have the string you want in a Tcl
variable somewhere. If that is not the case then you need to give more
detail on what you want to do.

> The examples on wiki assume one gets the one or more hex characters,
> then use encode to get the values for use with \u????.
>
> It is not clear to me how to get the hex values, or if there is
> another way to do it.

If you have a string in Tcl, and you would like to see the actual bytes
behind an encoding, then do this:

set encoded [encoding convertto utf-8 $original_tcl_string]

That gives you a 'binary string' that is encoded acording to the rules
of that system (in this example, utf-8).

Then, to get hex output of the bytes of the encoded data value, the
easiest way is to use the 'binary' command:

binary scan $encoded H* hexstring

And now the variable hexstring contains the hex encoded values of the
utf-8 bytes in '$encoded'.

Example (using the Unicode division sign symbol, which is the f7 code
point, https://www.fileformat.info/info/unicode/char/00f7/index.htm):

$ rlwrap tclsh
% set original_tcl_string \u00f7
÷
% set encoded [encoding convertto utf-8 $original_tcl_string]
÷
% binary scan $encoded H* hexstring
1
% set hexstring
c3b7
%

And, if you look on the page I provide a link to above, you'll see that
the utf-8 encoding of \u00f7 is 0xC3 0xB7, which is what came out from
Tcl above.

> However, it is not clear to me that using encoding different from the
> system is even needed.

You'd want to "converto" the encoding that whatever is receiving this
data expects, then ship it off as if it were a binary string.

sled...@gmail.com

unread,
Apr 12, 2018, 4:20:34 AM4/12/18
to

thanks for the reply...

set j 今日は元気ですか

Is there a more direct way to get the \uxxxx values of the characters stored in a string?

It seemed obtaining the hex values was a prereq to getting the corresponding \u values. But is that really necessary?

sled...@gmail.com

unread,
Apr 12, 2018, 4:37:35 AM4/12/18
to
Please let me rephrase what I am trying to do:
Given the char ÷, how does one get to \u00f7

The way I posited the problem, it seemed the purpose was to obtain hex values.

Rick

Brad Lanam

unread,
Apr 12, 2018, 7:19:46 AM4/12/18
to
On Thursday, April 12, 2018 at 1:37:35 AM UTC-7, sled...@gmail.com wrote:
> Please let me rephrase what I am trying to do:
> Given the char ÷, how does one get to \u00f7
>
> The way I posited the problem, it seemed the purpose was to obtain hex values.

Donal Fellows has this code on the wiki, and I turned it into a little
utility program:

#!/usr/bin/tclsh

proc u2a {s} {
set res ""
foreach i [split $s ""] {
scan $i %c c
if {$c<128} {append res $i} else {append res \\u[format %04.4X $c]}
}
return $res
}

puts [u2a [lindex $::argv 0]]

Rich

unread,
Apr 12, 2018, 7:25:01 AM4/12/18
to
sled...@gmail.com wrote:

[Note, quoting some context from the prior article is always a good
idea. This is Usent afterall, not google. Note the quotations below.]

> thanks for the reply...
>
> set j ????????
>
> Is there a more direct way to get the \uxxxx values of the characters
> stored in a string?

What I showed you is the direct way to get the hex values of the
characters stored in a string. And it sounded (from your first vague
posting) like this was what you wanted.

But if you want the code point values (the xxxx part of the \uxxxx
escape is a "code point" value, which is *different* from "the bytes of
a utf-8 encoded string) you'd just iterate over the string by character
and use the %c conversion of the 'scan' command to obtain the code
point value. Then you can use the %x conversion of format to get the
hex values:

$ rlwrap tclsh
% set str "Hello."
Hello.
% foreach c [split $str ""] {
puts -nonewline [format {\u%04x} [scan $c %c z ; set z]]
} ; puts ""
\u0048\u0065\u006c\u006c\u006f\u002e
%

And that string of \uxxxx items is the equivalent of the "Hello."
string that was first put into the "str" variable. You could also do:

set str "\u0048\u0065\u006c\u006c\u006f\u002e"

And you end up with the identical string in "str" because the Tcl
parser interprets the \uxxxx escapes for you, converting each to the
character represented by that unicode code point.

But if you already have a string, unless you want to write it out in a
Tcl script or otherwise convert it to the \uxxxx format for feeding to
the Tcl parser, you don't need the hex values. The hex escape (\uxxxx)
is for entering 'characters' into your script code with you can not
otherwise type them directly on your keyboard.

> It seemed obtaining the hex values was a prereq to getting the
> corresponding \u values. But is that really necessary?

You need the code point values for creating proper \uxxxx escapes. But
unless you are writing code to output Tcl code, or unless wherever you
are sending this data understands how to interpret the \uxxxx escapes,
they are not very useful to you other than as a debugging aid to see
what code points are actually in the string.

Rich

unread,
Apr 12, 2018, 7:28:56 AM4/12/18
to
sled...@gmail.com wrote:
> On Wednesday, April 11, 2018 at 1:25:41 PM UTC-7, Rich wrote:
>> sled...@gmail.com wrote:
>> > Am trying to use the text ???????? with ms SAPI. However, SAPI TTS
>> > requires either romanji or utf-8, unicode.
>> >
>> > Thus, the question is "How to get the utf-8 values corresponding to
>> > the Japanese characters?
>>
>
> Please let me rephrase what I am trying to do:
> Given the char ÷, how does one get to \u00f7
>
> The way I posited the problem, it seemed the purpose was to obtain
> hex values.

Yes, your origional question was quite open to multiple
interpretations.

For your new question, the exact answer is:

format {\u%04x} [scan ÷ %c z ; set z]

Testing that gives:

$ rlwrap tclsh
% format {\u%04x} [scan ÷ %c z ; set z]
\u00f7

Note, the "z ; set z" part above is a 'trick' to convert scan from
"write to a variable" (its default operation mode) into "return result
from evaluation" mode. The 'z' variable is still written, but the
result from the [] evaluation is the contents of z, not the return
value from 'scan'.

Yusuke Yamasaki

unread,
Apr 12, 2018, 10:23:24 PM4/12/18
to
How about using "map"?
I also used tcllib package to get integer list of unicode points.

package require unicode
proc map {lambda list} {
set result {}
foreach item $list {
lappend result [apply $lambda $item]
}
return $result
}
proc get_unicode_point {str} {
join [map {x {format {\u%04x} $x}} [unicode::fromstring $str]] ""
}

set result [get_unicode_point "今日は元気ですか"]
puts $result

Rich

unread,
Apr 13, 2018, 12:48:05 AM4/13/18
to
Yusuke Yamasaki <tm92...@gmail.com> wrote:
> How about using "map"?

If one has 8.6, one can use the lmap builtin, and not have to write
their own map implementation.

> I also used tcllib package to get integer list of unicode points.

Nice, I had not noticed this package in tcllib.

> package require unicode
> proc map {lambda list} {
> set result {}
> foreach item $list {
> lappend result [apply $lambda $item]
> }
> return $result
> }
> proc get_unicode_point {str} {
> join [map {x {format {\u%04x} $x}} [unicode::fromstring $str]] ""
> }

For lmap (8.6+), you'd have:

package require unicode
puts [join [lmap x [unicode::fromstring $str] {format {\u%04x} $x}] ""]

Schelte Bron

unread,
Apr 13, 2018, 7:18:37 AM4/13/18
to
Rich wrote:
> % format {\u%04x} [scan ÷ %c z ; set z]
> \u00f7
>
> Note, the "z ; set z" part above is a 'trick' to convert scan from
> "write to a variable" (its default operation mode) into "return
> result from evaluation" mode.

What would be a reason to use this trick, rather than simply:

% format {\u%04x} {*}[scan ÷ %c]
\u00f7
?


Schelte.

Rich

unread,
Apr 13, 2018, 8:56:55 AM4/13/18
to
Because I didn't read far enough into the first paragraph of the
manpage when refreshing my memory, and missed this:

If no varName variables are specified, then scan works in an inline
manner,

The trick is not needed for scan. It is needed for [binary scan].

Don Porter

unread,
Apr 13, 2018, 9:47:49 AM4/13/18
to
On 04/12/2018 07:28 AM, Rich wrote:
> % format {\u%04x} [scan ÷ %c z ; set z]

> Note, the "z ; set z" part above is a 'trick' to convert scan from
> "write to a variable" (its default operation mode) into "return result
> from evaluation" mode.

This is advice from 1999. [scan] has been able to directly return its
matches since Tcl 8.3.0 was released in February 2000.

http://wiki.tcl.tk/682

% info patchlevel
8.3.5
% scan A %c
65

--
| Don Porter Applied and Computational Mathematics Division |
| donald...@nist.gov Information Technology Laboratory |
| http://math.nist.gov/~DPorter/ NIST |
|______________________________________________________________________|

sled...@gmail.com

unread,
Apr 13, 2018, 1:22:48 PM4/13/18
to
Thanks to all who replied...

sled...@gmail.com

unread,
Apr 13, 2018, 1:27:01 PM4/13/18
to
Perfect...

Rich

unread,
Apr 13, 2018, 4:25:58 PM4/13/18
to
Don Porter <donald...@nist.gov> wrote:
> On 04/12/2018 07:28 AM, Rich wrote:
>> % format {\u%04x} [scan ÷ %c z ; set z]
>
>> Note, the "z ; set z" part above is a 'trick' to convert scan from
>> "write to a variable" (its default operation mode) into "return result
>> from evaluation" mode.
>
> This is advice from 1999. [scan] has been able to directly return its
> matches since Tcl 8.3.0 was released in February 2000.
>
> http://wiki.tcl.tk/682
>
> % info patchlevel
> 8.3.5
> % scan A %c
> 65

Yes, goof admitted in this posting: <paq9ek$k8h$1...@dont-email.me>

I didn't read far enough into the first paragraph of the scan manpage
when refreshing my memory before posting.
0 new messages