How to translate utf-8 hex to unicode hex?

185 views
Skip to first unread message

Sean

unread,
Nov 10, 2009, 1:44:51 PM11/10/09
to vim_use
Hi,

My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
What I want is 2 bytes unicode.

For example:
let input = "%E9%A6%AC"
let output = "99AC"

Based on the output, I can then get the real CJK: 馬.

Is it possible to do it from within Vim?

Thanks

Sean

Tony Mechelynck

unread,
Nov 10, 2009, 3:06:24 PM11/10/09
to vim...@googlegroups.com, Sean
You can do it the hard way, with arithmetic computations which I shall
explain below.

Or you can do it the easy way, by writing the bytes to disc as if they
were Latin1 (see ":help ++opt") and reading them back as UTF-8.

Or you can use the iconv() function (q.v.).


UTF-8 bytes are divided in "waterproof" categories as follows:

- Bytes 0x00 to 0x7F are "single" bytes, they each represent a single
codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII.

- Bytes 0xC0 to (currently) 0xF4 or (as originally foreseen and still
supported by Vim) 0xFD are "header" bytes in a multibyte sequence. Such
a byte MUST be the first byte of its sequence and the number of "one"
bits above the topmost "zero" bit indicates the number of bytes
(including this one) in the whole sequence.

- Bytes 0x80 to 0xBF are "trailer" bytes in a multibyte sequence. They
can be any byte in the sequence except the first.

- Bytes OxFE and OxFF are always invalid anywhere in UTF-8 text.

- In the bytes of a multibyte sequence, all bits after the topmost
"zero" bit in each byte constitute the "payload": they are data bits,
and in UTF-8 the most significant bits always come first.


Your example translates as follows:

0xE9 = 1110.1001 binary
header byte
the sequence is of three bytes
payload: 1001
0xA6 = 1010.0110 binary
trailer byte
payload: 100110
0xAC = 1010.1100 binary
trailer byte
payload: 101100
Result (concatenated payload bits) 1001.1001.1010.1100 binary, or U+99AC

Note that some hanzi are above U+20000; the UTF-8 code for them consists
of four bytes, not three: e.g. 𠄣 = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3
= %F0%A0%84%A3 in "percent-escaped" HTTP coding.


The Unicode code space had originally been foreseen as ranging from
U+0000 to U+7FFFFFFF but the current standards say that no codepoints
above U+10FFFD will ever be valid; also, codepoints whose hex
representation is xxFFFE or xxFFFF (where xx is anything) have been
expressly designated as invalid, never to be used.


Best regards,
Tony.
--
Putt's Law:
Technology is dominated by two types of people:
Those who understand what they do not manage.
Those who manage what they do not understand.

Sean

unread,
Nov 10, 2009, 4:24:19 PM11/10/09
to vim_use
Hi Tony,

I thought I had enough knowledge on UNICODE and UTF8, but it is
nothing after reading your message.

Now, I get what I want:

let input = "\xE9\xA6\xAC"
let output=iconv(input, "utf-8", "utf8")

Bingo! The output is real ==> '馬'

Thanks again.

Sean

On Nov 10, 12:06 pm, Tony Mechelynck <antoine.mechely...@gmail.com>
wrote:

Tony Mechelynck

unread,
Nov 10, 2009, 6:05:19 PM11/10/09
to vim...@googlegroups.com, Sean
On 10/11/09 22:24, Sean wrote:
>
> Hi Tony,
>
> I thought I had enough knowledge on UNICODE and UTF8, but it is
> nothing after reading your message.
>
> Now, I get what I want:
>
> let input = "\xE9\xA6\xAC"
> let output=iconv(input, "utf-8", "utf8")
>
> Bingo! The output is real ==> '馬'
>
> Thanks again.
>
> Sean

The particular parameters you give to iconv make it an identity permutation.

When I do

:echo "\xE9\xA6\xAC"

(with 'encoding' set to "utf-8") the result is





Best regards,
Tony.
--
"I am, in point of fact, a particularly haughty and exclusive person,
of pre-Adamite ancestral descent. You will understand this when I tell
you that I can trace my ancestry back to a protoplasmal primordial
atomic globule. Consequently, my family pride is something
inconceivable. I can't help it. I was born sneering."
-- Pooh-Bah, "The Mikado", Gilbert & Sullivan

Sean

unread,
Nov 10, 2009, 8:06:56 PM11/10/09
to vim_use
Hi Tony,

The last mile to go:

This always worked: (output is 馬)
-------------------------------------------
let input = "\xE9\xA6\xAC"
let output = iconv(input, "UTF-8", "UTF-8")
-------------------------------------------

However, this failed: (output is '\xE9\xA6\xAC')
-------------------------------------------
let input = "%E9%A6%AC"
let input = substitute(input, '%', '\\x', 'g')
let output = iconv(input, "UTF-8", "UTF-8")
-------------------------------------------

Now, the key becomes how to translate string (single quoted) to string
(double quoted). I guess that "\x" is meaningful only within double
quote.

Thanks

Sean

Tony Mechelynck

unread,
Nov 10, 2009, 9:22:25 PM11/10/09
to vim...@googlegroups.com, Sean
what about (untested)

function HttpToString(str)
return substitute(a:str, '%\(\x\x\)',
\ '\=eval(''"\x'' . submatch(1) . ''"'')', 'g')
endfunction

If I haven't goofed,

:echo HttpToStr('%E9%A6%AC')

ought to return



Note the use of pairs of single quotes to represent actual single quotes
in a single-quoted string.

The use of a continuation line assumes 'nocompatible'.

See also
:help sub-replace-expression
:help eval()


Best regards,
Tony.
--
Lizzie Borden took an axe,
And plunged it deep into the VAX;
Don't you envy people who
Do all the things _YOU_ want to do?

Sean

unread,
Nov 10, 2009, 11:15:32 PM11/10/09
to vim_use
Hi Tony,

You are real genius! It simply worked without modification!

I added it as part of VimIM plugin online:
http://maxiangjiang.googlepages.com/vimim.vim.html
" ================================ }}}
" ==== VimIM SoGou Cloud IM ==== {{{
" ====================================

Now, let me show you the power of Vim:

input in PinYin => woyouyigeqiguaidemeilidemeng
output in Chinese => 我有一个奇怪的美丽的梦
It is meaningless :) => "I have a strange but beautiful dream."

This is my gift to you:
http://maxiangjiang.googlepages.com/dream.png

Thanks

Sean

winterTTr

unread,
Nov 11, 2009, 12:25:20 AM11/11/09
to vim...@googlegroups.com
You are the author of the VimIM ? Nice to see you here.

And you want to use the sogou-cloud input result , and translate the 
content from url to the vim content ?
Realy amazing thought ;-)

Sean



Tony Mechelynck

unread,
Nov 11, 2009, 2:13:41 AM11/11/09
to vim...@googlegroups.com, Sean
On 11/11/09 05:15, Sean wrote:
> Hi Tony,
>
> You are real genius! It simply worked without modification!
>
> I added it as part of VimIM plugin online:
> http://maxiangjiang.googlepages.com/vimim.vim.html
> " ================================ }}}
> " ==== VimIM SoGou Cloud IM ==== {{{
> " ====================================
>
> Now, let me show you the power of Vim:
>
> input in PinYin => woyouyigeqiguaidemeilidemeng
> output in Chinese => 我有一个奇怪的美丽的梦
> It is meaningless :) => "I have a strange but beautiful dream."

Meaningless? It evokes powerful meanings to me; I link it with Martin
Luther King's famous "I had a dream" discourse, and, maybe less known,
the Hymn ("La Espero", i.e. "Hope") of the Esperantist movement, which
ends in words meaning "Our diligent colleagues won't tire in a labour of
peace, till the beautiful dream of mankind shall come true for eternal
blessing".

>
> This is my gift to you:
> http://maxiangjiang.googlepages.com/dream.png
>
> Thanks

My thanks to you, for the beautiful sentence in hanzi.

>
> Sean

Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
187. You promise yourself that you'll only stay online for another
15 minutes...at least once every hour.

Lily

unread,
Nov 11, 2009, 9:02:35 AM11/11/09
to vim_use
This is what I am using (need +python):
" %xx -> 对应的字符(到消息)[[[2
function Lilydjwg_hexchar()
let chars = Lilydjwg_get_pattern_at_cursor('\(%[[:xdigit:]]\{2}\)\
+')
if chars == ''
echohl WarningMsg
echo '在光标处未发现%表示的十六进制字符串!' "echo that the form of string cannot be
found there.
echohl None
return
endif
let str = substitute(chars, '%', '\\x', 'g')
exe 'py print ''' . str . ''''
endfunction
" 取得光标处的匹配[[[2
function Lilydjwg_get_pattern_at_cursor(pat) "This is a function I
borrowed from another plugin
let col = col('.') - 1
let line = getline('.')
let ebeg = -1
let cont = match(line, a:pat, 0)
while (ebeg >= 0 || (0 <= cont) && (cont <= col))
let contn = matchend(line, a:pat, cont)
if (cont <= col) && (col < contn)
let ebeg = match(line, a:pat, cont)
let elen = contn - ebeg
break
else
let cont = match(line, a:pat, contn)
endif
endwh
if ebeg >= 0
return strpart(line, ebeg, elen)
else
return ""
endif
endfunction

nmap <silent> t% :call Lilydjwg_hexchar()<CR>

After writing this to .vimrc, when I move the cursor to where the %xx
string is and press t%, I can see the decoded string.
Or you can just use a program called ascii2uni, eg:

echo %E9%A6%AC | ascii2uni -q -a J

and the output is 馬. You can combine this with the filter (:h filter)
feature of Vim to get lines of characters converted directly from
within Vim.
Reply all
Reply to author
Forward
0 new messages