MD5 digest of UTF-8 string in Perl 5.8

Juha-Mikko Ahonen

unread,

Oct 23, 2002, 10:42:27 AM10/23/02

to Markus Kuhn, perl-u...@perl.org, linux...@nl.linux.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 23 October 2002 17:20, Markus Kuhn wrote:
> How can I calculate the MD5 message digest of a Unicode string in
> Perl 5.8? The MD5 hash algorithm naturally expects a sequence of
> bytes as its input, and I have a string with a sequence of
> characters. I tried
>
> $ perl -e 'use Digest::MD5 qw(md5_hex); print md5_hex("\x{20ac}");'
> Wide character in subroutine entry at -e line 1.

The problem is in \x{20ac}. If you place the character in UTF-8 encoding
in place of the escape, it works perfectly. If you have real UTF-8
data, not perl escapes, then there is no problem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE9trVcnksV4Ys/z5gRAnmeAKCN1lpz3PsCkqbmrXHWs/9YvB83BACdFkyl
yiJ292C259r7zPBxUNw+KkI=
=4wSg
-----END PGP SIGNATURE-----

Markus Kuhn

unread,

Oct 23, 2002, 10:20:39 AM10/23/02

to perl-u...@perl.org, linux...@nl.linux.org

How can I calculate the MD5 message digest of a Unicode string in Perl
5.8? The MD5 hash algorithm naturally expects a sequence of bytes as its
input, and I have a string with a sequence of characters. I tried

$ perl -e 'use Digest::MD5 qw(md5_hex); print md5_hex("\x{20ac}");'
Wide character in subroutine entry at -e line 1.

but it seems that I have to explicitely convert my character string to a
byte sequence first.

How can I do this most efficently (keeping in mind, that it ought to be
a NOP internally as the string is already stored in UTF-8)?

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Jean-Michel Hiver

unread,

Oct 23, 2002, 10:53:52 AM10/23/02

to Markus Kuhn, perl-u...@perl.org, linux...@nl.linux.org

> How can I calculate the MD5 message digest of a Unicode string in Perl
> 5.8? The MD5 hash algorithm naturally expects a sequence of bytes as its
> input, and I have a string with a sequence of characters. I tried
>
> $ perl -e 'use Digest::MD5 qw(md5_hex); print md5_hex("\x{20ac}");'
> Wide character in subroutine entry at -e line 1.

I'd do something like that:

use Encode;
use Digest::MD5 qw(md5_hex);

sub md5_hex
{
my $string = shift;
Encode::_utf8_off ($string);
return md5_hex ($string);
}

In this case setting the UTF8 flag off is OK because $string is a copy
of the original string which won't be used outside of the subroutine.
You have to be careful when playing with Encode::is_utf8()
Encode::_utf8_on() and Encode::_utf8_off() but it's very handy.

Best regards,
--
IT'S TIME FOR A DIFFERENT KIND OF WEB
================================================================
Jean-Michel Hiver - Software Director
jhi...@mkdoc.com
+44 (0)114 255 8097
================================================================
VISIT HTTP://WWW.MKDOC.COM

Gisle Aas

unread,

Oct 23, 2002, 11:01:17 AM10/23/02

to Jean-Michel Hiver, Markus Kuhn, perl-u...@perl.org, linux...@nl.linux.org

Jean-Michel Hiver <jhi...@mkdoc.com> writes:

> > How can I calculate the MD5 message digest of a Unicode string in Perl
> > 5.8? The MD5 hash algorithm naturally expects a sequence of bytes as its
> > input, and I have a string with a sequence of characters. I tried
> >
> > $ perl -e 'use Digest::MD5 qw(md5_hex); print md5_hex("\x{20ac}");'
> > Wide character in subroutine entry at -e line 1.
>
> I'd do something like that:
>
> use Encode;
> use Digest::MD5 qw(md5_hex);
>
> sub md5_hex
> {
> my $string = shift;
> Encode::_utf8_off ($string);
> return md5_hex ($string);
> }

I would argue that it is much better to write it as:

md5_hex(Encode::encode_utf8($string))

Playing with _utf8_{on,off} is ugly for good reason and will break if
the internal representation change. Calling encode_utf8() should be
almost as efficient and is future-proof.

Regards,
Gisle

Markus Kuhn

unread,

Oct 23, 2002, 1:50:12 PM10/23/02

to Gisle Aas, perl-u...@perl.org, linux...@nl.linux.org

Gisle Aas wrote on 2002-10-23 15:01 UTC:
> md5_hex(Encode::encode_utf8($string))

Thanks, that looks indeed like the proper solution.

Juha-Mikko Ahonen wrote on 2002-10-23 14:42 UTC:
> > $ perl -e 'use Digest::MD5 qw(md5_hex); print md5_hex("\x{20ac}");'
> > Wide character in subroutine entry at -e line 1.
>

> The problem is in \x{20ac}. If you place the character in UTF-8 encoding
> in place of the escape, it works perfectly. If you have real UTF-8
> data, not perl escapes, then there is no problem.

I'm afraid, this didn't make sense to me. The internal representation of
the input value of the MD5 function should not depend on whether I used
the UTF-8 character of the hex escape notation in the source code. The
Perl compiler should eliminate this difference already in the scanner.