Unicode EN Dash character + php driver

139 views
Skip to first unread message

Chris Clarke

unread,
Feb 15, 2012, 4:43:32 PM2/15/12
to mongod...@googlegroups.com
I have some data in a mongo document that includes the EN DASH Unicode character (http://www.fileformat.info/info/unicode/char/2013/index.htm). 

When queried via the mongo shell I have something like this:

{
_id: "someid",
title: "Semiconductor Physics and Devices (2011/12 – Semester 1)"
}

However, using the php driver, when I query this string and print_r the result I am getting:

(
        [_id] => someid
        [title] => Semiconductor Physics and Devices (2011/12 \u2013 Semester 1)
)

I feel like I am doing something very obviously wrong here...

Chris

Nat

unread,
Feb 15, 2012, 9:31:06 PM2/15/12
to mongod...@googlegroups.com
It looks correct to me. Which OS,terminal,font are you using? Certain configuration cannot display non-ASCII character properly.

Andreas Jung

unread,
Feb 16, 2012, 12:32:54 AM2/16/12
to mongod...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chris Clarke wrote:

> ( [_id] => someid [title] => Semiconductor Physics and Devices
> (2011/12 \u2013 Semester 1) )

What else do you expect? The string is likely returned represented as
unicode string in PHP (I am not a PHP guru) and displayed using print_r
using its *internal* represenation. That's what print_r is for. You need
to convert it to UTF-8 of course to render it in "human-readable" form.

- -aj
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQGUBAEBAgAGBQJPPJUGAAoJEADcfz7u4AZjiuALvAy74lOzvLPtIV943wPXanyg
pylobFnpyEqJ2+v6TdyI5Qh5EiEV9UCPrUheFOFMTEqB2PO08p0InvSADKN3iI6D
WVHXDCf7DonQR4D0d2sx+L+30JfKfOLeMz4uaipQnPwQMkFppKmUqq6G5wlThyD9
Ro3ls0bSvKofBBYa9P8uDa/SrPXUVBWRq4fppRgvO1fG/7j36vV7+G06ayJTMrch
fH6GWOLZMbP9uMrv4Kne+AbETJjTM2JlCbtWVAZLBBfBoZYOu9yDLOna8Xk+29bl
bfEBocdgCZlEKVWgmachEb27mDVIPJXTVhHE8nz3LIoLd1ZtTJTWt09+XNmfCeF1
vfMrsnn6wh2ZpOEnKh6qcNB6mJ5+jW1KSHgqCmrVGfORdzDt3E3reQ2pZqfqJRQ0
RjMvYk4F3Ro5pKHJuyjpJa6jpCFpwADXbjBZJ2w2saoSnP/6sHdYazfFEP1rErvC
CCO3NtakSk6eSyJEqympUcoYqQIx5RA=
=Qcae
-----END PGP SIGNATURE-----

lists.vcf

Chris Clarke

unread,
Feb 16, 2012, 9:50:47 AM2/16/12
to mongodb-user
That doesn't smell right.

This data is put in (via the PHP driver) using UTF-8 by json_encoding
the document before inserting. This stores it in DB as UTF-8.

Are you saying the php driver (which one can only assume is doing a
json_decode) is then converting to Unicode?

By the way echo'ing out the value (rather than using print_r) has the
same result.

Chris


On Feb 16, 5:32 am, Andreas Jung <li...@zopyx.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Chris Clarke wrote:
> > ( [_id] => someid [title] => Semiconductor Physics and Devices
> > (2011/12 \u2013 Semester 1) )
>
> What else do you expect? The string is likely returned represented as
> unicode string in PHP (I am not a PHP guru) and displayed using print_r
> using its *internal* represenation. That's what print_r is for. You need
> to convert it to UTF-8 of course to render it in "human-readable" form.
>
> - -aj
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (Darwin)
> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/
>
> iQGUBAEBAgAGBQJPPJUGAAoJEADcfz7u4AZjiuALvAy74lOzvLPtIV943wPXanyg
> pylobFnpyEqJ2+v6TdyI5Qh5EiEV9UCPrUheFOFMTEqB2PO08p0InvSADKN3iI6D
> WVHXDCf7DonQR4D0d2sx+L+30JfKfOLeMz4uaipQnPwQMkFppKmUqq6G5wlThyD9
> Ro3ls0bSvKofBBYa9P8uDa/SrPXUVBWRq4fppRgvO1fG/7j36vV7+G06ayJTMrch
> fH6GWOLZMbP9uMrv4Kne+AbETJjTM2JlCbtWVAZLBBfBoZYOu9yDLOna8Xk+29bl
> bfEBocdgCZlEKVWgmachEb27mDVIPJXTVhHE8nz3LIoLd1ZtTJTWt09+XNmfCeF1
> vfMrsnn6wh2ZpOEnKh6qcNB6mJ5+jW1KSHgqCmrVGfORdzDt3E3reQ2pZqfqJRQ0
> RjMvYk4F3Ro5pKHJuyjpJa6jpCFpwADXbjBZJ2w2saoSnP/6sHdYazfFEP1rErvC
> CCO3NtakSk6eSyJEqympUcoYqQIx5RA=
> =Qcae
> -----END PGP SIGNATURE-----
>
>  lists.vcf
> < 1KViewDownload

Sam Millman

unread,
Feb 16, 2012, 10:01:26 AM2/16/12
to mongod...@googlegroups.com
Can we see the code that saves the unicode character to DB in PHP?

PHP should not convert your character to either UTF8 (which it isn't) or unicode without your express permission. There might be something in the driver which triggers a weird case where this might occur but I bet more that its other code.

Also is:



(
        [_id] => someid
        [title] => Semiconductor Physics and Devices (2011/12 \u2013 Semester 1)
)

That an exact representation of your output document or is it formatted without "" enclosures.



--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Derick Rethans

unread,
Feb 16, 2012, 10:31:56 AM2/16/12
to mongod...@googlegroups.com
On Wed, 15 Feb 2012, Chris Clarke wrote:

> I have some data in a mongo document that includes the EN DASH Unicode
> character
> (http://www.fileformat.info/info/unicode/char/2013/index.htm).
>
> When queried via the mongo shell I have something like this:
>
> {
> _id: "someid",
> title: "Semiconductor Physics and Devices (2011/12 – Semester 1)"
> }
>
> However, using the php driver, when I query this string and print_r
> the result I am getting:
>
> (
> [_id] => someid
> [title] => Semiconductor Physics and Devices (2011/12 \u2013 Semester 1)
> )

I can't reproduce this:

derick@whisky:/tmp$ cat unicode.php

<?php
$m = new Mongo();
$m->demo->test->insert( array( '_id' => 'unicode', 'unicode' => '–' ) );

$r = $m->demo->test->findOne( array( '_id' => 'unicode' ) );
var_dump($r);
print_r($r);
?>

derick@whisky:/tmp$ php unicode.php

array(2) {
["_id"]=>
string(7) "unicode"
["unicode"]=>
string(3) "—"
}
Array
(
[_id] => unicode
[unicode] => —
)

How are you adding the data? Could you do a var_dump() of your arguments
to insert/update?

cheers,
Derick

--
http://mongodb.org | http://derickrethans.nl
twitter: @derickr and @mongodb

Derick Rethans

unread,
Feb 16, 2012, 10:33:17 AM2/16/12
to mongod...@googlegroups.com
On Thu, 16 Feb 2012, Sam Millman wrote:

> Can we see the code that saves the unicode character to DB in PHP?
>
> PHP should not convert your character to either UTF8 (which it isn't) or
> unicode without your express permission.

Just to clarify, PHP does not do any converstion between character sets,
and neither does the driver. The user is responsible for providing the
driver with UTF-8 strings.

Reply all
Reply to author
Forward
0 new messages