About the internal function "iconv"

winterTTr

unread,

Nov 23, 2009, 1:43:06 AM11/23/09

to vim...@googlegroups.com, vim...@googlegroups.com

I use vim to read the file which is the attachment of this mail.

This file can be read in with the encoding "cp936", and thing goes well.

When i read the file via “:e ++enc=sjis” （ with a wrong encoding ) ,

the vim shows "conversion error", and the characters get messed.

So , the conversion failed by this "sjis encoding“

However, when i use the internal function of vim "iconv" like this.

---------------code-----------------------

let line1=readfile("cp936.txt",'b')[0]

echo iconv(line1,"sjis","utf-8")

---------------------------------------------

the result turned to be the "messed characters"

I think this case is much the same as i used the "e ++enc=sjis".

SO , the fail should happened during the conversion.

The doc about the "iconv" is like, below :

iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.

According to the doc, i think the iconv should return the empty string.

So, how can i know the happening of wrong conversion for the "iconv" ?

Or, is there some misunderstanding about iconv ?

cp936.txt

Tony Mechelynck

unread,

Nov 23, 2009, 2:56:36 AM11/23/09

to vim...@googlegroups.com, winterTTr, vim...@googlegroups.com

If the whole text consists of "valid bytes" according to the definition
of Shift-JIS, the conversion from sjis to utf-8 will not "fail", but if
the text was written using a different encoding, the result will
probably not "make sense". In that case you will get garbled text.
"Failing", from the point of view of the iconv routine, means "finding a
byte sequence which is invalid for the 'from' encoding at the position
where that sequence was encountered".

Similarly, conversion from Latin1 to UTF-8 will never fail, because any
byte is "valid" in Latin1; but if the text was originally written in
some non-Latin alphabet (using an encoding appropriate for that
alphabet), the result will not make sense.

Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
202. You're amazed to find out Spam is a food.

winterTTr

unread,

Nov 23, 2009, 3:24:22 AM11/23/09

to Tony Mechelynck, vim...@googlegroups.com, vim...@googlegroups.com

On Mon, Nov 23, 2009 at 3:56 PM, Tony Mechelynck <antoine.m...@gmail.com> wrote:

On 23/11/09 07:43, winterTTr wrote:

I use vim to read the file which is the attachment of this mail.
This file can be read in with the encoding "cp936", and thing goes well.
When i read the file via “:e ++enc=sjis” （ with a wrong encoding ) ,
the vim shows "conversion error", and the characters get messed.
So , the conversion failed by this "sjis encoding“

However, when i use the internal function of vim "iconv" like this.
---------------code-----------------------
let line1=readfile("cp936.txt",'b')[0]
echo iconv(line1,"sjis","utf-8")
---------------------------------------------
the result turned to be the "messed characters"

I think this case is much the same as i used the "e ++enc=sjis".
SO , the fail should happened during the conversion.

The doc about the "iconv" is like, below :
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.

According to the doc, i think the iconv should return the empty string.

So, how can i know the happening of wrong conversion for the "iconv" ?
Or, is there some misunderstanding about iconv ?

If the whole text consists of "valid bytes" according to the definition of Shift-JIS, the conversion from sjis to utf-8 will not "fail", but if the text was written using a different encoding, the result will probably not "make sense". In that case you will get garbled text. "Failing", from the point of view of the iconv routine, means "finding a byte sequence which is invalid for the 'from' encoding at the position where that sequence was encountered".

You mean that, if the iconv can NOT find a invalid byte sequence, iconv will return the result , even though the result maybe "garbled text" ?

And, for my case, i can see "conversion fail" when i use :e ++enc=sjis, does it means the iconv should also

check the conversion fail ?

Tony Mechelynck

unread,

Nov 23, 2009, 10:00:22 AM11/23/09

to winterTTr, vim...@googlegroups.com, vim...@googlegroups.com

If there is no invalid byte sequence, how could iconv "see" that the
text is garbled? The routine has no linguistic knowledge, it only has
conversion tables and conversion subroutines between Unicode codepoints
and the representation of characters in a (large) number of encodings.

OTOH, if ":e ++enc=sjis" (with 'encoding' set to utf-8) says that the
conversion failed, then at least there exists "some" machine-usable
criterion to say that the "from" text is invalid. (With 'encoding' set
to some non-Unicode value, "conversion error" could also mean that there
is no invalid sequence for the "from" encoding but that there are
characters in the "from" text which cannot be represented in the "to"
encoding.)

I don't know the details of ++enc= vs. iconv(), or of which
circumstances might lead to different results. Bram would probably be
able to say better than I whether the behaviour you noted is intended,
or whether you found a bug.

>
>
> Similarly, conversion from Latin1 to UTF-8 will never fail, because
> any byte is "valid" in Latin1; but if the text was originally
> written in some non-Latin alphabet (using an encoding appropriate
> for that alphabet), the result will not make sense.
>
>
> Best regards,
> Tony.
--

fortune: cpu time/usefulness ratio too high -- core dumped.

Bram Moolenaar

unread,

Nov 23, 2009, 2:57:17 PM11/23/09

to winterTTr, vim...@googlegroups.com, vim...@googlegroups.com

Winter wrote:

> I use vim to read the file which is the attachment of this mail.
> This file can be read in with the encoding "cp936", and thing goes well.

> When i read the file via â€œ:e ++enc=sjisâ€ ï¼ˆ with a wrong encoding ) ,

> the vim shows "conversion error", and the characters get messed.

> So , the conversion failed by this "sjis encodingâ€œ

>
> However, when i use the internal function of vim "iconv" like this.
> ---------------code-----------------------
> let line1=readfile("cp936.txt",'b')[0]
> echo iconv(line1,"sjis","utf-8")
> ---------------------------------------------
> the result turned to be the "messed characters"
>
> I think this case is much the same as i used the "e ++enc=sjis".
> SO , the fail should happened during the conversion.
>
>
> The doc about the "iconv" is like, below :
> iconv({expr}, {from}, {to}) *iconv()*
> The result is a String, which is the text {expr} converted
> from encoding {from} to encoding {to}.
> When the conversion fails an empty string is returned.
>
> According to the doc, i think the iconv should return the empty string.
>
> So, how can i know the happening of wrong conversion for the "iconv" ?
> Or, is there some misunderstanding about iconv ?

Yongwei Wu reported a similar problem last week. It appears that
iconv() tries conversion even when there are errors. If you get an
error message when loading the file that should also happen with
iconv().

Although getting an empty string on an error is good to notice something
went wrong, other times you might want to get whatever could be
converted. We could add an argument tells it to fail or do a "best
effort" conversion. I'm not sure that will always be possible, we
depend on what the iconv library does.

--
If your company is not involved in something called "ISO 9000" you probably
have no idea what it is. If your company _is_ involved in ISO 9000 then you
definitely have no idea what it is.
(Scott Adams - The Dilbert principle)

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Christian Brabandt

unread,

Nov 23, 2009, 3:08:31 PM11/23/09

to vim...@googlegroups.com, vim...@googlegroups.com

Hi Bram!

On Mo, 23 Nov 2009, Bram Moolenaar wrote:

> Yongwei Wu reported a similar problem last week. It appears that
> iconv() tries conversion even when there are errors. If you get an
> error message when loading the file that should also happen with
> iconv().
>
> Although getting an empty string on an error is good to notice something
> went wrong, other times you might want to get whatever could be
> converted. We could add an argument tells it to fail or do a "best
> effort" conversion. I'm not sure that will always be possible, we
> depend on what the iconv library does.

Well, iconv_open()น allows to append the special string //TRANSLIT and
//IGNORE on the target-charset, in which case a character that cannot
be represented in the target character set will be approximated through
one or several characters, that look similar to the original character
(//TRANSLIT case) or non-valid chars in the target charset will be
silently discarded (//IGNORE case).

น)at least on a GNU System (see
http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html)

regards,
Christian

Yongwei Wu

unread,

Nov 24, 2009, 6:56:18 AM11/24/09

to vim...@googlegroups.com, winterTTr, vim...@googlegroups.com

2009/11/24 Bram Moolenaar <Br...@moolenaar.net>:

I like this idea. It should be possible. My understanding is that
iconv(3) stops on invalid multi-byte sequences.

--
Wu Yongwei
URL: http://wyw.dcweb.cn/

Yongwei Wu

unread,

Nov 24, 2009, 7:24:49 AM11/24/09

to vim...@googlegroups.com, winterTTr, vim...@googlegroups.com

2009/11/24 Bram Moolenaar <Br...@moolenaar.net>:

I like this idea. It should be possible. iconv(3) converts a byte at
a time, and will set errno to EILSEQ and return (size_t)-1 on invalid
sequences. I have just verified this behaviour with the text in my
last report.

(If you received a previous message like this, please ignore. It was
the result of wrongly pressing "Send".)