Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Samba] Conversion error: Illegal multibyte sequence

1,233 views
Skip to first unread message

Laurent Blume

unread,
Sep 5, 2013, 4:20:01 PM9/5/13
to
Hello list,

I've noticed this problem for a few years now, I think. I see it popped
out now and then in discussions. But they always end before a solution
is given.

So let's try one more time :-)

I have plenty of UTF-8 named files and directories. It's UTF-8 all
round, I don't use anything else, so I have no doubt the byte sequences
are correct in the filesystem (I happen to have accented Latin chars,
Chinese, Japanese, and some non-letters chars, so it'd show up really
quick if there was an issue there :-)

I see those kind of errors in the logs, here about a directory named
"♻_Corbeille".
Please note: those lines are a direct copy of the log. So yes, the "♻"
character is indeed correct in the two first entries (including the
first conversion error line), but improperly logged as an invalid byte
sequence in the two latter entries, and those two have different lengths.

[2013/09/05 20:43:50.280597, 3] smbd/dir.c:1046(smbd_dirptr_get_entry)
smbd_dirptr_get_entry mask=[*] found ./♻_Corbeille fname=♻_Corbeille
(♻_Corbeille)
[2013/09/05 20:43:50.280641, 3] lib/charcnv.c:161(convert_string_internal)
convert_string_internal: Conversion error: Illegal multibyte
sequence(♻_Corbeille)
[2013/09/05 20:43:50.280679, 3] lib/charcnv.c:140(convert_string_internal)
convert_string_internal: Conversion error: Incomplete multibyte
sequence(��_Corbeille)
[2013/09/05 20:43:50.280715, 3] lib/charcnv.c:140(convert_string_internal)
convert_string_internal: Conversion error: Incomplete multibyte
sequence(�_Corbeille)


It does not prevent using the directory, and it displays properly on
Windows clients. So the issue is merely an annoying flood of logs.

The system is Solaris 10, running Samba 3.6.18 linked against GNU
libiconv 1.14.

The charsets are defined like this in the configuration:

dos charset = cp850
unix charset = UTF8
display charset = UTF8


So, any definitive fix for that?

Thanks!

Laurent
--
To unsubscribe from this list go to the following URL and read the
instructions: https://lists.samba.org/mailman/options/samba

Jeremy Allison

unread,
Sep 5, 2013, 4:40:01 PM9/5/13
to

This is the call to smb_iconv() returning an errno of EINVAL.

Firstly, add some debug statements inside smb_iconv_open_ex()
to find out if we're using the sys_iconv() function (that
calls the system iconv) or the internal UFT8 converters.

If it's the system iconv then you'll have to look inside
that source code.

If it's the internal converters add some debug statements
inside utf8_pull() and utf8_push() to see where the EINVAL
is being returned.

This will help track it down for your individual case.

Jeremy.

Laurent Blume

unread,
Sep 5, 2013, 5:30:02 PM9/5/13
to
On 2013-09-05 10:35 PM, Jeremy Allison wrote:
> This is the call to smb_iconv() returning an errno of EINVAL.
>
> Firstly, add some debug statements inside smb_iconv_open_ex()
> to find out if we're using the sys_iconv() function (that
> calls the system iconv) or the internal UFT8 converters.
>
> If it's the system iconv then you'll have to look inside
> that source code.
>
> If it's the internal converters add some debug statements
> inside utf8_pull() and utf8_push() to see where the EINVAL
> is being returned.
>
> This will help track it down for your individual case.

I'm not sure I'm still good at adding printf("DEBUG\n") lines around :-)
so I tried my hand with dtrace for a start.

Here are some examples of what it returns when looking at smb_iconv()
while I opened the directory and listed the content:

0 61539 smb_iconv:entry
0 88770 convert_string_talloc:return
0 88786 push_ucs2_talloc:return
0 69264 talloc_tos:entry
0 88535 talloc_tos:return
0 69516 push_ucs2_talloc:entry
0 69500 convert_string_talloc:entry
0 69497 lazy_initialize_conv:entry
0 88767 lazy_initialize_conv:return
0 61537 get_iconv_convenience:entry
0 61585 get_iconv_convenience:return
0 69852 get_conv_handle:entry
0 89122 get_conv_handle:return
0 82016 _talloc_realloc:entry
0 81992 talloc_alloc_pool:entry
0 81991 talloc_pool_objectcount:entry
0 101091 talloc_pool_objectcount:return
0 101092 talloc_alloc_pool:return
0 101116 _talloc_realloc:return


0 61539 smb_iconv:entry
0 88770 convert_string_talloc:return
0 88786 push_ucs2_talloc:return
0 69821 strupper_w:entry
0 69840 toupper_w:entry
0 69846 toupper_m:entry
0 89116 toupper_m:return
0 89110 toupper_w:return
0 69840 toupper_w:entry
0 69846 toupper_m:entry
0 75259 toupper:entry
0 94407 toupper:return
0 89116 toupper_m:return
0 89110 toupper_w:return
0 69840 toupper_w:entry
0 69846 toupper_m:entry
0 75259 toupper:entry
0 94407 toupper:return
0 89116 toupper_m:return
0 89110 toupper_w:return
0 89091 strupper_w:return
0 69499 convert_string:entry
0 69497 lazy_initialize_conv:entry
0 88767 lazy_initialize_conv:return
0 61537 get_iconv_convenience:entry
0 61585 get_iconv_convenience:return
0 69852 get_conv_handle:entry
0 89122 get_conv_handle:return

Does it make sense? There are some longer examples, but the only
function call containing iconv is get_iconv_convenience().

Laurent

Jeremy Allison

unread,
Sep 5, 2013, 5:40:01 PM9/5/13
to
No, it doesn't make sense. smb_iconv() vectors
through pull and push function pointers that
do the actual conversion - that's where the
errno is coming from. That's why you need
debug statements - you know it's going into
smb_iconv() but you don't know where it's
going from there.

Jeremy.

Laurent Blume

unread,
Sep 6, 2013, 3:50:02 PM9/6/13
to
On 2013-09-05 11:34 PM, Jeremy Allison wrote:
> No, it doesn't make sense. smb_iconv() vectors
> through pull and push function pointers that
> do the actual conversion - that's where the
> errno is coming from. That's why you need
> debug statements - you know it's going into
> smb_iconv() but you don't know where it's
> going from there.
>
> Jeremy.
>

At least those proved that neither Solaris' nor GNU's iconv were
involved :-)

I tagged all the possibles causes for EILSEQ, so here's the code causing
that issue, in lib/util/charset/iconv.c:

if ((c[0] & 0xf0) == 0xe0) {
if (in_left < 3 ||
(c[1] & 0xc0) != 0x80 ||
(c[2] & 0xc0) != 0x80) {
errno = EILSEQ;
DEBUG(3,("DEBUG, %d, %d, %d, %d\n",
in_left, c[0], c[1], c[2]));
goto error;
}

It happens because in_left == 2 when it should be 3:

[2013/09/06 20:59:50.249004, 3] ../lib/util/charset/iconv.c:600(utf8_pull)
DEBUG, 2, 226, 153, 187
[2013/09/06 20:59:50.249056, 3] lib/charcnv.c:161(convert_string_internal)
convert_string_internal: Conversion error: Illegal multibyte
sequence(â<99>»_Corbeille)

The sequence values are good, as this char is 0xE2 0x99 0xBB in UTF8.

I tried to backtrack the origin of this value, but I got lost after
convert_string_internal() in source3/lib/charcnv.c , where it was
incorrect already.

Is it getting warmer?

Laurent

Jeremy Allison

unread,
Sep 6, 2013, 5:00:02 PM9/6/13
to
Yes ! The tests against 0x80 are correct, so as you've
surmised it's the in_left == 2 is the real problem.

If after the DEBUG(3,("..)) statement you have
here you cause it to terminate with abort(),
then you can catch the backtrace if you set

panic action = /bin/sleep 999999

in the [global] section of your smb.conf. At
that point you should be able to attach to
the parent process of the sleep (which will
be an smbd) with gdb, and examine the contents
of i_len inside convert_string_internal().

Either that or add another debug inside
convert_string_internal() to print out
the values of srclen and i_len at various
points and try and determine why it's off
by one.

Cheers,

Jeremy.

Laurent Blume

unread,
Sep 6, 2013, 8:10:02 PM9/6/13
to
On 2013-09-06 10:54 PM, Jeremy Allison wrote:

> Either that or add another debug inside
> convert_string_internal() to print out
> the values of srclen and i_len at various
> points and try and determine why it's off
> by one.

Well, it was /quite/ further than that.

I think I got it this time, in sources3/smbd/mangle_hash2.c:

/*
* Note that if CH_UNIX is utf8 a string may be 3
* bytes, but this is ok as mb utf8 characters don't
* contain embedded ascii bytes. We are really
checking
* for mb UNIX asian characters like Japanese
(SJIS) here.
* JRA.
*/
DEBUG(3, ("DEBUG ++ name, %s\n", name));
>>>> if (convert_string(CH_UNIX, CH_UTF16LE,
name, 2, mbc, 2, False) == 2) {
/* Was a good mb string. */
name += 2;
continue;
}
DEBUG(3, ("DEBUG -- name, %s\n", name));


Why is the length here hardcoded to 2?

It's past 2am around here, so I could be missing something, time for bed.

Thanks,

Laurent

Jeremy Allison

unread,
Sep 6, 2013, 8:10:02 PM9/6/13
to
Woo hoo ! I think you've found an oooooold old bug :-).

I'll take a look at that asap.

Thanks !

Jeremy.

Laurent Blume

unread,
Sep 7, 2013, 4:50:01 AM9/7/13
to
On 2013-09-07 2:04 AM, Jeremy Allison wrote:
> Woo hoo ! I think you've found an oooooold old bug :-).

I told you so from the beginning! Years! ;-)
Not a critical one though, it seems.

> I'll take a look at that asap.
>
> Thanks !

I'll be happy to get rid of it too!

Laurent

Jeremy Allison

unread,
Sep 9, 2013, 6:50:01 PM9/9/13
to
Ok, here is a fix for 3.6.x. Can you test this and see
if it fixes the problem ? If so, I'll get this fixed
in master and back-ported to all releases.

Thanks !

Jeremy.
look1

Laurent Blume

unread,
Sep 10, 2013, 5:00:02 AM9/10/13
to
On 10/09/13 00:40, Jeremy Allison wrote:
> Ok, here is a fix for 3.6.x. Can you test this and see
> if it fixes the problem ? If so, I'll get this fixed
> in master and back-ported to all releases.

Applies, builds, and runs, and the error message disappeared: all good.

Thanks!

Laurent

Volker Lendecke

unread,
Sep 10, 2013, 5:50:03 AM9/10/13
to
Hi, Jeremy!

On Mon, Sep 09, 2013 at 03:40:06PM -0700, Jeremy Allison wrote:
> Ok, here is a fix for 3.6.x. Can you test this and see
> if it fixes the problem ? If so, I'll get this fixed
> in master and back-ported to all releases.
>
> Thanks !
>
> Jeremy.

> diff --git a/source3/smbd/mangle_hash2.c b/source3/smbd/mangle_hash2.c
> index 5aafe2f..e1aedf1 100644
> --- a/source3/smbd/mangle_hash2.c
> +++ b/source3/smbd/mangle_hash2.c
> @@ -626,7 +626,8 @@ static bool is_legal_name(const char *name)
> while (*name) {
> if (((unsigned int)name[0]) > 128 && (name[1] != 0)) {
> /* Possible start of mb character. */
> - char mbc[2];
> + size_t size = 0;
> + (void)next_codepoint(name, &size);
> /*
> * Note that if CH_UNIX is utf8 a string may be 3
> * bytes, but this is ok as mb utf8 characters don't
> @@ -634,7 +635,7 @@ static bool is_legal_name(const char *name)
> * for mb UNIX asian characters like Japanese (SJIS) here.
> * JRA.
> */
> - if (convert_string(CH_UNIX, CH_UTF16LE, name, 2, mbc, 2, False) == 2) {
> + if (size == 2) {
> /* Was a good mb string. */
> name += 2;
> continue;

Can you explain what this check is supposed to do at all? I
don't get it ... :-)

Thanks,

Volker

--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kon...@sernet.de

*****************************************************************
visit us on it-sa:IT security exhibitions in Nürnberg, Germany
October 8th - 10th 2013, hall 12, booth 333
free tickets available via code 270691 on: www.it-sa.de/gutschein
******************************************************************

Jeremy Allison

unread,
Sep 10, 2013, 12:50:02 PM9/10/13
to
It's an old, old check back from when SJIS and EUC were
common multi-byte systems.

SJIS especially has the property that the second byte
can contain a value <127 as part of the 2-byte char
set. So if CH_UNIX is set to a char set with such a
property we can't walk it as bytes, but must see if
a pair of values [0] (> 0x80) [1] (any value) can be
converted into a valid multi-byte char, in which case
we ignore it (otherwise we might look at the second
byte value of ':' or something and consider it invalid).

I thought about removing this and re-writing it, but
it made my brain hurt (and might break some very old
systems :-). So moving to next_codepoint() which checks
the next char len without causing the conversion error
messages seemed the simplest fix :-).

Jeremy.

Volker Lendecke

unread,
Sep 10, 2013, 1:50:01 PM9/10/13
to
On Tue, Sep 10, 2013 at 09:48:57AM -0700, Jeremy Allison wrote:
> It's an old, old check back from when SJIS and EUC were
> common multi-byte systems.
>
> SJIS especially has the property that the second byte
> can contain a value <127 as part of the 2-byte char
> set. So if CH_UNIX is set to a char set with such a
> property we can't walk it as bytes, but must see if
> a pair of values [0] (> 0x80) [1] (any value) can be
> converted into a valid multi-byte char, in which case
> we ignore it (otherwise we might look at the second
> byte value of ':' or something and consider it invalid).
>
> I thought about removing this and re-writing it, but
> it made my brain hurt (and might break some very old
> systems :-). So moving to next_codepoint() which checks
> the next char len without causing the conversion error
> messages seemed the simplest fix :-).

Thanks! +1 from me.

Volker

--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kon...@sernet.de

*****************************************************************
visit us on it-sa:IT security exhibitions in Nürnberg, Germany
October 8th - 10th 2013, hall 12, booth 333
free tickets available via code 270691 on: www.it-sa.de/gutschein
******************************************************************

Jeremy Allison

unread,
Sep 10, 2013, 2:10:02 PM9/10/13
to
On Tue, Sep 10, 2013 at 07:44:39PM +0200, Volker Lendecke wrote:
> On Tue, Sep 10, 2013 at 09:48:57AM -0700, Jeremy Allison wrote:
> > It's an old, old check back from when SJIS and EUC were
> > common multi-byte systems.
> >
> > SJIS especially has the property that the second byte
> > can contain a value <127 as part of the 2-byte char
> > set. So if CH_UNIX is set to a char set with such a
> > property we can't walk it as bytes, but must see if
> > a pair of values [0] (> 0x80) [1] (any value) can be
> > converted into a valid multi-byte char, in which case
> > we ignore it (otherwise we might look at the second
> > byte value of ':' or something and consider it invalid).
> >
> > I thought about removing this and re-writing it, but
> > it made my brain hurt (and might break some very old
> > systems :-). So moving to next_codepoint() which checks
> > the next char len without causing the conversion error
> > messages seemed the simplest fix :-).
>
> Thanks! +1 from me.

Actually - your question made me think about this
some more and I think I can easily simplify this - due
to the fact that no encoding with a length > 1 can
contain invalid characters (which are all ASCII < 0x80).

So here is the fix I'd like to commit to master, and
then I'll create a bug and back-port for 4.1.0, 4.0.next
and 3.6.next.

Please re-review (sorry :-).

Jeremy
0 new messages