His patch is not sane. The utility functions look at the UTF8
flag to decide what to do, which is a broken approach, by
definition. (The UTF8 flag signifies the internal format of the
byte buffer of the string, but it says nothing about whether
a string consists of characters or bytes.) It will hide some
problems but cause others.
I do not have enough knowledge of CHI to give good advice on how
to proceed, however.
How many 3rd party drivers exist? How important is it that a new
version of CHI which fixes this problem be backward-compatible
with old drivers? (It will probably require changes to the driver
interface to fix it correctly.)
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
> * Jonathan Swartz <swa...@pobox.com> [2010-05-23 15:55]:
>> Can a second person who understands encoding vouch for Jiri's
>> approach - not necessarily the exact implementation, but the
>> concept of encoding keys and values as they come in and
>> decoding them as they go out?
>
> His patch is not sane. The utility functions look at the UTF8
> flag to decide what to do, which is a broken approach, by
> definition. (The UTF8 flag signifies the internal format of the
> byte buffer of the string, but it says nothing about whether
> a string consists of characters or bytes.) It will hide some
> problems but cause others.
>
I'm not so sure. I know that using is_utf8 is generally Wrong. But
CHI's role is not to interpret the data in any way, merely to store
and retrieve it and make sure it doesn't change in that process. In
that case, isn't it the right thing to make sure the utf8 flag is set
exactly the same after a store and retrieve?
For example, this script
#!/usr/bin/perl -w
use Encode;
use Storable qw(freeze thaw);
use strict;
my ($in, $out);
$in = "\x{263a}b";
$out = thaw(freeze([$in]))->[0];
print "is_utf8 before Storable: " . (Encode::is_utf8($in) ? 't' :
'f') . "\n";
print "is_utf8 after Storable: " . (Encode::is_utf8($out) ? 't' :
'f') . "\n\n";
$in = join('', map { chr($_) } (226, 152, 186, 98));
$out = thaw(freeze([$in]))->[0];
print "is_utf8 before Storable: " . (Encode::is_utf8($in) ? 't' :
'f') . "\n";
print "is_utf8 after Storable: " . (Encode::is_utf8($out) ? 't' :
'f') . "\n\n";
prints
is_utf8 before Storable: t
is_utf8 after Storable: t
is_utf8 before Storable: f
is_utf8 after Storable: f
so Storable is preserving the is_utf8 flag. JSON does the same thing.
We use Storable to serialize reference values, but we store scalar
values raw, so we need to take responsibility for the utf8 flag in the
scalar case.
Aristotle, please help me understand your objections better.
Thanks
Jon
--
You received this message because you are subscribed to the Google Groups "Perl-Cache Discuss" group.
To post to this group, send email to perl-cach...@googlegroups.com.
To unsubscribe from this group, send email to perl-cache-disc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/perl-cache-discuss?hl=en.
That would be my expectation - my application-level code should work the
same way with caching turned on or turned off. In the typical use case
you're going to call some API to get some data, and under the covers
that data might be coming from the cache, or might be coming e.g.
directly from a database (and also stored in the cache as a side
effect). So at the application level I would handle the data the same
way regardless, and expect CHI to be transparent.
Larry
Is it? If I store some data under the key `"naïve"`, once with
UTF8 flag turned off and once with the flag on, am I storing the
data under two different keys, or the same key? And if it *is*
considered the same key, and I ask for what that key is, should
I get it with UTF8 flag on (that was how it was first stored) or
off (as it was off in the latest write)?
Note that because Perl considers the flag a transparent internal
implementation detail, it can easily happen that code used the
*same scalar variable* in both cases (flag off, flag on), just
because it also used the variable in some other operation in the
meantime that implicitly upgraded the string.
That’s fine in Perl – a string with UTF8 flag off and a string
with UTF8 flag on mean the same thing if the contain the same
sequence of logical characters even when their internal
representation differs.
But it means that you’d in turn be forcing client code to look at
the UTF8 flag to make sure it’s really passing what it thinks
it’s passing. And IMO any API which forces its clients to care
about the UTF8 flag is broken.
> For example, this script […] prints […] so Storable is
> preserving the is_utf8 flag.
Storable never interprets the data you pass it in any way
whatsoever. CHI does.
> JSON does the same thing.
Colour me dubious. There is no way to express the concept of
a UTF8 flag in JSON.
> We use Storable to serialize reference values, but we store
> scalar values raw, so we need to take responsibility for the
> utf8 flag in the scalar case.
I think the only sane thing to do is to consider all strings to
be strings – and in Perl there is no difference between character
strings and byte strings. You can slurp a JPEG image file into
a scalar, upgrade the scalar (turning on the UTF8 flag), and then
write the scalar back out to another file, and you’ll get the
very same JPEG image back, even though the UTF8 flag was turned
on in internal storage.
The UTF8 flag is misnamed. What it actually means is whether the
internal storage format of the string is a fixed-width packed
bytes array or a variable-width integer sequence. That’s all.
The REAL problem you have is that some of your cache backends can
cope with keys containing characters > 255 and some cannot.
I think it’s a bug in those backends when they cannot cope.
But you could decide to solve the problem centrally in CHI by
defining some kind of universal characters→bytes transliteration
scheme. As it happens, the UTF-8 encoding is a good choice for
such an encoding. In other words, you would encode ALL keys, no
matter whether the UTF8 flag is turned on or off, because the
flag does not change the semantic meaning of the string, ie. the
key `"naïve"` should yield the same encoded result regardless of
whether it was stored in a scalar with the UTF8 flag on or off.
This does however mean that backends which would be capable of
storing characters > 255 will only store the transliterated
versions, just like everyone else, so eg. if you use a DBI or DBM
backend then the data in the store will be harder to examine
because it will be stored in encoded form.
Thanks for your message Aristotle. I still have a tenuous grasp of
these issues, so I appreciate your advice and anything further you can
provide!
Here's how the code looks now for keys and values.
KEYS
* Any key passed to a CHI operation (get, set, remove, etc.) is utf
encoded iff its utf-8 flag is on.
* The encoding is a one-way operation. We don't record that the key
was encoded, and get_keys does not attempt to decode it. (There is no
real support in CHI for storing meta-data about keys.)
The rationale here is that
* I want to use utf-8 strings for keys even in drivers that can't
handle wide characters
* I want to be able to pass the results of get_keys() back into get()
and have it still retrieve the same object, without double-encoding it
(though I realize this will break if someone calls get_keys(), then
somehow turns the utf-8 flag back on before passing it back into get())
* I want to be backwards compatible with existing caches with binary
string keys - thus I cannot encode all keys blindly
http://github.com/jonswar/perl-chi/blob/master/lib/CHI/Driver.pm#L500-502
VALUES
* Any scalar value passed to set is encoded iff the utf-8 flag is on.
* The encoding is a two-way operation. We record the fact that the
value was encoded, and we decode it when retrieving it from the cache.
The rationale here is that
* I want to be able to store utf-8 strings as values even in drivers
that can't handle wide characters
* I want the values to come out exactly the same way as when they were
stored
* I want to be backwards compatible with existing caches with binary
string values - thus I cannot decode all values blindly
http://github.com/jonswar/perl-chi/blob/master/lib/CHI/CacheObject.pm#L60-62
http://github.com/jonswar/perl-chi/blob/master/lib/CHI/CacheObject.pm#L131-132
Here's a test class that attempts to confirm some of this:
http://github.com/jonswar/perl-chi/blob/master/lib/CHI/t/Encode.pm
So. I'm consulting the utf-8 flag in both cases, even though I
understand from all the docs that it is "wrong" to depend on this
flag. But I can't figure out a better way to get the behavior and the
backward compatibility that I want without consulting the flag.
Feedback welcome.
Jon
* Jonathan Swartz <swa...@pobox.com> [2010-06-03 19:30]:
> KEYS
>
> * Any key passed to a CHI operation (get, set, remove, etc.) is
> utf encoded iff its utf-8 flag is on.
> * The encoding is a one-way operation. We don't record that the
> key was encoded, and get_keys does not attempt to decode it.
> (There is no real support in CHI for storing meta-data about
> keys.)
>
> The rationale here is that
> * I want to use utf-8 strings for keys even in drivers that
> can't handle wide characters
> * I want to be able to pass the results of get_keys() back into
> get() and have it still retrieve the same object, without
> double-encoding it (though I realize this will break if
> someone calls get_keys(), then somehow turns the utf-8 flag
> back on before passing it back into get())
> * I want to be backwards compatible with existing caches with
> binary string keys - thus I cannot encode all keys blindly
>
> http://github.com/jonswar/perl-chi/blob/master/lib/CHI/Driver.pm#L500-502
that’s a problem.
Keys with only characters < 128 will always yield the same value
because their representation is the same regardless of the UTF8
flag, and keys with characters > 255 will also always yield the
same value because they can only be stored in strings with the
UTF8 flag on.
But for keys with characters in the 128..255 range, there are two
possible internal representations. So a string which contains
such characters will correspond to two different keys, depending
on its UTF8 flag. Different code paths that should access the
same key might therefor end up accessing different keys. This is,
to put it poetically, schizophrenic.
The right thing to do is to either always encode strings for use
as keys (= backends do not have to handle wide characters), or
never encode them (= backends have to decide for themselves how
to handle characters > 255) – rather than encoding them sometimes
and not encoding them other times.
Note that whichever of these changes you make, the only data that
will be affected by this change is data for which CHI already
handles in a schizophrenic fashion.
There is no sane solution to centralise the handling of big
characters in CHI if you are aiming for zero compatibility
breakage.
I would opt seriously opt for the null strategy: simply document
that backends are required to handle big characters in whichever
way they deem best for themselves, unless they tell CHI that it
should encode keys for them, in which case CHI would *always*
encode keys. This way, old backends that were broken WRT big
characters continue to be broken in exactly the same way as they
used to be, i.e. compatibility is automatic. New backends (or new
backend releases) would take this into account.
> VALUES
>
> * Any scalar value passed to set is encoded iff the utf-8 flag
> is on.
> * The encoding is a two-way operation. We record the fact that
> the value was encoded, and we decode it when retrieving it
> from the cache.
>
> The rationale here is that
> * I want to be able to store utf-8 strings as values even in
> drivers that can't handle wide characters
> * I want the values to come out exactly the same way as when
> they were stored
> * I want to be backwards compatible with existing caches with
> binary string values - thus I cannot decode all values blindly
>
> http://github.com/jonswar/perl-chi/blob/master/lib/CHI/CacheObject.pm#L60-62
> http://github.com/jonswar/perl-chi/blob/master/lib/CHI/CacheObject.pm#L131-132
This is sane.
You get different results depending on whether the UTF8 flag is
on or off, but you also process them differently, so that it
cancels out on the bottom line.
> Here's a test class that attempts to confirm some of this:
>
> http://github.com/jonswar/perl-chi/blob/master/lib/CHI/t/Encode.pm
>
> So. I'm consulting the utf-8 flag in both cases, even though
> I understand from all the docs that it is "wrong" to depend on
> this flag. But I can't figure out a better way to get the
> behavior and the backward compatibility that I want without
> consulting the flag.
There is no way to get both. Consulting the flag merely trades
one set of broken behaviours for another.
Yes, I see the problem.
Ok, one more try: What if I only encoded strings that contained wide
characters? e.g.
if (is_utf8($key) && $key =~ /[^\x00-\xFF]/) {
encode(utf8 => $key);
}
Then there is no way for a key with characters in the 128..255 to be
stored as two different keys.
I know that it seems simpler and more correct to encode all keys. But
if I do that, I have to decode them all on the way back in (otherwise
I'll get double-encoding when people pass the results of get_keys() or
get_object()->key() back into CHI), which is undesirable (I have to
capture and filter all calls that return keys).
Letting the backends take care of this themselves will either amount
to the same thing, or result in inconsistent behavior.
> Note that whichever of these changes you make, the only data that
> will be affected by this change is data for which CHI already
> handles in a schizophrenic fashion.
I don't see that - right now, keys with chars in the 128..255 range
are always handled as binary chars.
Thanks
Jon
#!/usr/bin/perl -w
use Cache::FastMmap;
use Carp::Assert;
use DBI;
use DBD::SQLite;
use strict;
my $binary_off = chr(129);
my $binary_on = substr($binary_off . "\x{263a}", 0,
length($binary_off));
assert($binary_off eq $binary_on);
print "** sqlite **\n";
unlink("sqlite.dat");
my $dbh = DBI->connect("dbi:SQLite:dbname=sqlite.dat","","");
$dbh->do("create table foo (key text)");
my $sth = $dbh->do("insert into foo values (?)", {}, $binary_off);
print "binary_off: " . $dbh->selectcol_arrayref("select count(*)
from foo where key = ?", {}, $binary_off)->[0] . "\n";
print "binary_on: " . $dbh->selectcol_arrayref("select count(*)
from foo where key = ?", {}, $binary_on)->[0] . "\n";
print "** fastmmap **\n";
my $cache = Cache::FastMmap->new();
$cache->set($binary_off, 5);
print "binary_off: " . defined($cache->get($binary_off)) . "\n";
print "binary_on: " . defined($cache->get($binary_on)) . "\n";
This prints
** sqlite **
binary_off: 1
binary_on: 0
** fastmmap **
binary_off: 1
binary_on:
Meaning that, even though $binary_off eq $binary_on, both sqlite and
fastmmap treat them as distinct.
Jon