[mongodb-user] should turn off utf8 flag in BSON string (Perl driver)?

217 views
Skip to first unread message

nightsailer

unread,
May 18, 2010, 10:41:11 AM5/18/10
to mongodb-user
In perl driver, BSON string value will auto turn-on utf8 flag,
but this is hard to display in page, html, or template.

I must turn of utf8 flag like:

utf8_encode($content) ;

or

binmode(STDOUT, ':encoding(utf8)');

the secondary seems dirty quick, but It's hard to work with many web
framework,
like Dancer, PSGI compatible framework.

So, my hack way is just turn-off the utf8 in perl_mongo.c:

value = newSVpvn(buf->pos, len-1);
// SvUTF8_on(value);

I wonder, is this will raise any issues? or just work?

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Kristina Chodorow

unread,
May 18, 2010, 2:20:31 PM5/18/10
to mongod...@googlegroups.com
The problem is that, if you turn off the UTF8 flag, string functions (like getting the length) will return the wrong value.

Tim Hawkins

unread,
May 18, 2010, 2:27:53 PM5/18/10
to mongod...@googlegroups.com
Why not just set the encoding of your pages to utf-8, then everything works out fine.
I dont see any reason not to use UTF-8 in this day and age.

nightsailer

unread,
May 18, 2010, 11:51:15 PM5/18/10
to mongodb-user
All my pages are utf8 encoding, it works fine with MySQL, php
mongodb , but now can't work with MongoDB caused by its utf8 flag,
it will display fuzzy template , but display mongoDB data correctly.

So, I don't meant "not use UTF-8 encoding", I'm use it actually day
by day.

My meant is Don't turn-on UTF8 flag, and let string will keep in Perl
bytes form but not
sequence octes, see perl utf8 manual.


So, I disable utf8 flag, also fix a minor bug of driver ;-)

Now, it works fine, also could work with other PHP stuff.


I've pushed my fork:

http://github.com/nightsailer/ns-mongo-perl-driver

Please review it .


But it display wired, and throw "Wide character print " like
warnings.

On 5月19日, 上午2时27分, Tim Hawkins <tim.hawk...@me.com> wrote:
> Why not just set the encoding of your pages to utf-8, then everything works out fine.
> I dont see any reason not to use UTF-8 in this day and age.
>
> On May 18, 2010, at 10:41 PM, nightsailer wrote:
>
>
>
>
>
> > In perl driver, BSON string value will auto turn-on utf8 flag,
> > but this is hard to display in page, html, or template.
>
> > I must turn of utf8 flag like:
>
> > utf8_encode($content) ;
>
> > or
>
> > binmode(STDOUT, ':encoding(utf8)');
>
> > the secondary seems dirty quick, but It's hard to work with many web
> > framework,
> > like Dancer, PSGI compatible framework.
>
> > So, my hack way is just turn-off the utf8 in perl_mongo.c:
>
> > value = newSVpvn(buf->pos, len-1);
> > // SvUTF8_on(value);
>
> > I wonder, is this will raise any issues? or just work?
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.

nightsailer

unread,
May 19, 2010, 12:07:32 AM5/19/10
to mongodb-user
On 5月19日, 上午2时20分, Kristina Chodorow <krist...@10gen.com> wrote:
> The problem is that, if you turn off the UTF8 flag, string functions (like
> getting the length) will return the wrong value.


I see, cause it will return bytes length because it in Perl internal
form.
I don't care it, anytime you should decoding it to get utf8 length.

If turn on utf8 flag, hard to display UTF-X sequence in right way,
I'm tied to applied
utf8_encoding to every rows fetched from mongoDB or
binmode(STDIN ...), sometime STDIN or STDOUT will redirect or reopen,
like template rendering.

Many templates (like TT) can't render template in utf8 flag turn-on,
because We save template in utf8 encoding, but Perl read it in Perl
internal form(byte string),
So, when it mixed with MongoDB result, will warn: ' wide character
print" , and display
template file as fuzzy , but mongoDB result correctly.

I spent a whole day to fight this problem, and found ,If I keep
mongoDB, I should modified
many framework, like Dancer, Catalyst... So, finally, I just hack
mongoDB.

I'd played with MySQL(data also stored in utf8 encoding) before, but
not any problem.

Maybe there's some wrong in my way, I'm pleased to listen anymore
advice.


nightsailer.

2010. 5. 19

>
>
>
>
>
> On Tue, May 18, 2010 at 10:41 AM, nightsailer <nightsai...@gmail.com> wrote:
> > In perl driver, BSON string value will auto turn-on utf8 flag,
> > but this is hard to display in page, html, or template.
>
> > I must turn of utf8 flag like:
>
> > utf8_encode($content) ;
>
> > or
>
> > binmode(STDOUT, ':encoding(utf8)');
>
> > the secondary seems dirty quick, but It's hard to work with many web
> > framework,
> > like Dancer, PSGI compatible framework.
>
> > So, my hack way is just turn-off the utf8 in perl_mongo.c:
>
> > value = newSVpvn(buf->pos, len-1);
> > // SvUTF8_on(value);
>
> > I wonder, is this will raise any issues? or just work?
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsubscribe@google groups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/mongodb-user?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.

Kristina Chodorow

unread,
May 19, 2010, 8:19:25 AM5/19/10
to mongod...@googlegroups.com
I think you should just call utf8::encode on your strings before displaying them.  All string in Mongo are UTF8, so it makes more sense for the database to return UTF8 strings.  I know it's a pain and Perl does the stupidest handling of UTF8 I've ever seen, but UTF8 strings really should be indicated as such.


2010/5/19 nightsailer <night...@gmail.com>

nightsailer

unread,
May 19, 2010, 9:51:15 AM5/19/10
to mongodb-user
Yeah, just utf8::encode :-(
But it so general to do: Every field, every rows fetched should do,
so stupid way, feel sack.

I think you also consider DBI, like MySQL, Oracle, the database also
are UTF8,
but WHY they return all in bytes but not the UTF-X sequence?

Currently your driver also auto convert the bytes string back to utf8
sequence when send back to database,
So, I don't known why must keep the string indicated as such way?

I think you also consider other Perl user's cases that use utf8 in
their live world, I bed most of
them are not CJK users, that meant no difference UTF8 on or off.

Perl users less than ruby or PHP, but if MongoDB should work with
Catalyst these modern framework, this problem
should be fixed in right way.

I'm hope to consider a more handy way to resolve this problem(like
add a switch to options when do find/query),
but currently just hack.

BTW, there's seems a minor bug when process IxHash, I'd send you a
pull request.



On 5月19日, 下午8时19分, Kristina Chodorow <krist...@10gen.com> wrote:
> I think you should just call utf8::encode on your strings before displaying
> them. All string in Mongo are UTF8, so it makes more sense for the database
> to return UTF8 strings. I know it's a pain and Perl does the stupidest
> handling of UTF8 I've ever seen, but UTF8 strings really should be indicated
> as such.
>
> 2010/5/19 nightsailer <nightsai...@gmail.com>

nightsailer

unread,
May 19, 2010, 2:22:22 PM5/19/10
to mongodb-user
Hi Kristina:

I've add $MongoDB::BSON::utf8_flag_on switch to turn on/off UTF8
encoding/flag.

Default is 1, meant the behavior same as previous version,
If set $MongoDB::BSON::utf8_flag_on=1, will bypass SvUTF8_on.

The repository is fork on:

http://github.com/nightsailer/ns-mongo-perl-driver


At last, I read the Perl manual about unicode again and other related
books,

I thought Perl default read file as bytes, So although the template
file is UTF8 encode,
but perl dont't care it, it just read on byte mode.

And MonogoDB result treat as UTF8, it will print on character mode.

Finally, two mode are mixed, so the output is garbled.

there's two work around:

1. modify template engine to read template as UTF8 , like
binmode('STDIN','::utf8')
2. treat MongoDB result into byte mode.


I preferred 2, it's easy, and compatible with other templates engine
( I use TT2 currently).





On 5月19日, 下午8时19分, Kristina Chodorow <krist...@10gen.com> wrote:
> I think you should just call utf8::encode on your strings before displaying
> them. All string in Mongo are UTF8, so it makes more sense for the database
> to return UTF8 strings. I know it's a pain and Perl does the stupidest
> handling of UTF8 I've ever seen, but UTF8 strings really should be indicated
> as such.
>
> 2010/5/19 nightsailer <nightsai...@gmail.com>

Kristina Chodorow

unread,
May 21, 2010, 2:32:31 PM5/21/10
to mongod...@googlegroups.com
I'm still not sure this makes sense... I think the only reason this
"works" with relational drivers is that they don't know what encoding
the db is using.

If you create a string in Perl, like:

$x = "Ã";

and then check if it's UTF8:

print utf8::is_utf8($x);

it is. It's a little weird to create a UTF8 string, save it to a
database that uses UTF8 encoding, and get back a non-UTF8 string.

If you still want it, how about you make a feature request at
jira.mongodb.org and if you can get a few votes for it, I'll merge in
MongoDB::BSON::utf8_flag_on?


>
> On May 19, 2010 1:22 PM, "nightsailer" <night...@gmail.com> wrote:
>
> Hi Kristina:
>
> I've add $MongoDB::BSON::utf8_flag_on switch to  turn on/off UTF8
> encoding/flag.
>
> Default is 1, meant the behavior same as previous version,
> If  set $MongoDB::BSON::utf8_flag_on=1, will bypass SvUTF8_on.
>
> The repository is fork on:
>
> http://github.com/nightsailer/ns-mongo-perl-driver
>
> At last, I read the Perl manual about unicode again and other related
> books,
>
> I thought Perl default read file as bytes, So although the template
> file is UTF8 encode,
> but perl dont't care it, it just read on byte mode.
>
> And MonogoDB result treat as UTF8, it will print on character mode.
>
> Finally, two mode are mixed, so the output is garbled.
>
> there's two work around:
>
> 1. modify template engine to read template as UTF8 , like
> binmode('STDIN','::utf8')
> 2. treat MongoDB result into byte mode.
>
>
> I preferred 2, it's easy, and compatible with other templates engine
> ( I use TT2 currently).
>
> On 5月19日, 下午8时19分, Kristina Chodorow <krist...@10gen.com> wrote: > I think you should just call...
>
> > 2010/5/19 nightsailer <nightsai...@gmail.com>
>
> > > > > > > > On 5月19日, 上午2时20分, Kristina Chodorow <krist...@10gen.com> wrote: > > > The problem is ...
>
> > > > For more options, visit this group athttp:// > > groups.google.com/group/mongodb-user?hl=en. >...

nightsailer

unread,
May 23, 2010, 11:06:33 AM5/23/10
to mongodb-user


On 5月22日, 上午2时32分, Kristina Chodorow <krist...@10gen.com> wrote:
> I'm still not sure this makes sense... I think the only reason this
> "works" with relational drivers is that they don't know what encoding
> the db is using.
>
> If you create a string in Perl, like:
>
> $x = "Ã";
>
> and then check if it's UTF8:
>
> print utf8::is_utf8($x);
>

Yes, it is. But, if you:
print $x;

then will throws a warnings:

"Wide character in print ...."

so, you must:

$x=encode('utf8',$x);
print $x;

This is same as fetch from MongoDB.

Why?

Why need do encode?

as Perl manual said:

"Encodes a string from Perl's internal form into ENCODING and returns
a sequence of octets.
When you run $octets = encode("utf8", $string), then $octets may not
be equal to $string. Though they both contain the same data, the UTF8
flag for $octets is always off. When you encode anything, UTF8 flag of
the result is always off, even when it contains completely valid utf8
string."


I think Perl default io can't output UTF8, so if you print UTF8 flag
string will cause the warning.
The encode just turn off this flag.

The mongodb driver return data already is utf8, but currently turn-on
this flag cause my problem.


You known, currently Perl IO can't output utf8 sequence default, it
just print the data as byte string, except you force it treat
the data as utf8:

binmode STDOUT, ':utf8';


That's all.

I guess most Perl uses are latin users. So, they don't mention this
problem.

If anyone who use non-latin string in real world project, will has to
fight this problem.



> it is.  It's a little weird to create a UTF8 string, save it to a
> database that uses UTF8 encoding, and get back a non-UTF8 string.
>

I don't think get back as non-UTF8 string, instead, just as bytes
string in perl form,
anytime you can convert to UTF-xx character form through decode('utf8',
$v).
Reply all
Reply to author
Forward
0 new messages