UTF8 between Java and PHP

108 views
Skip to first unread message

ccou...@gmail.com

unread,
May 5, 2015, 8:52:13 AM5/5/15
to caucho-...@googlegroups.com
More issues with UTF8 encoding!?

I got a Java code that returns a list retrieved from DB into a String[]. I display them in the console and it shows proper encoding:
Elément1
Elément2
âéèà
...

In php code, I retrieve that list using something list this:

$qws = java_class ( "com.java.Utils" );

$result = $qws->getList();


Unfortunately when I display that list in PHP, I get this:

El�ment1
El�ment2
����
...


There's something definitely wrong here, but I can't find any solution.


Database is set to use UTF8 encoding, with the following resource in context.xml:

connectionProperties="useCompression=true;useUnicode=true;characterEncoding=utf8;characterSetResults=utf8;connectionCollation=utf8_general_ci"


Then in php.ini, I've set this:

unicode.output_encoding=utf-8
unicode.runtime_encoding=utf-8 (also tried with iso-8859-1)
unicode-semantics=on


And html content is also set to UTF8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />



This is becoming extremely annoying and I'm just wondering if that was a good idea at all to use Quercus. No way to prove to my client that using Resin will be a great performance improvement until I can resolve this.


The only thing that works with UTF8 is this:

$dbh = mysqli_connect()
$dbh->set_charset("utf8");

Everything else just fails.

Kaz Nishimura

unread,
May 5, 2015, 6:41:33 PM5/5/15
to Cedric Counotte, caucho-...@googlegroups.com

That would be because Java Strings are normally in UTF-16.  If you specify characterEncoding=utf8, UTF-8 will be used on wire and the MySQL JDBC driver will convert it to UTF-16 after receiving data since Java normally operates in UTF-16.

Quercus Mysqli uses characterEncoding=8859_1 by default to suppress encoding conversion in the driver AFAIK. Then MySQL SET NAMES command to make communication with the server in that encoding.  Since the first 256 character of Unicode is the same as ISO 8859-1, the JDBC driver will not convert encoding of strings and your PHP scripts will get strings in the encoding as specified in the SET NAMES command.  Quite ugly implementation IMO.

If you use unicode.semantics=on in php.ini, your PHP scripts will work with UTF-16 strings and the story will change.

2015/05/05 21:52 <ccou...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "Quercus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caucho-quercu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ccou...@gmail.com

unread,
May 6, 2015, 4:47:17 AM5/6/15
to caucho-...@googlegroups.com, ccou...@gmail.com
Thanks a lot for your reply.

I added the semantic to the php.ini, verified php.ini is considered (with a specific log path, seeing logs there).

End results is quite interesting: I still get weird characters in the html results, but php output to the log is ok!!!

So I tried changing the html output to UTF-16 like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-16" />

Didn't work either, I'm quite lost as to what is really going on here.

ccou...@gmail.com

unread,
May 6, 2015, 5:25:29 AM5/6/15
to caucho-...@googlegroups.com, ccou...@gmail.com
Thanks to Kaz for his hints, I tried a few things and ended-up finding some kind of work-around. I wish there was a more solid solution, but I didn't find it:

I have to convert all strings received from Java to UTF-8, doing this:
mb_convert_encoding($string_from_java, 'UTF-8')



I tried several options that all failed one way or another:

- This creates an infinite loop in the server and the page never loads: mb_convert_encoding($string_from_java, 'UTF-8', 'UTF-16')! As if the original string is not in UTF-16!????

- Setting HTTP content-type to UTF-16 doesn't help in any ways!
- Setting unicode.output_encoding=utf-16 doesn't either!

Kaz Nishimura

unread,
May 6, 2015, 6:53:13 AM5/6/15
to ccou...@gmail.com, caucho-...@googlegroups.com
If you have unicode.semantics=on, all internal Quercus strings must be encoded in UTF-16 and they will be converted into appropriate encoding when writing to HTTP.  So what is the value of your mbstring.http_output?  If it is UTF-8, the output should be encoded as such.

I used unicode.semantics=on before but now don't, so my answer could be wrong for you.

2015年5月6日水曜日、<ccou...@gmail.com>さんは書きました:

Kaz Nishimura

unread,
May 6, 2015, 7:05:04 AM5/6/15
to ccou...@gmail.com, caucho-...@googlegroups.com
Please note that it is not unicode-semantics but unicode.semantics.

If I were you, I would convert Java strings by Java methods to PHP's encoding and leave unicode.semantics as is.

2015年5月6日水曜日、Kaz Nishimura<kaz...@vx68k.org>さんは書きました:

ccou...@gmail.com

unread,
May 18, 2015, 3:52:31 AM5/18/15
to caucho-...@googlegroups.com, ccou...@gmail.com
Quick update, using unicode.semantics=on with the proper syntax fixed this, so I'm going to use that instead of converting everything, makes the php much more portable in case I need to migrate to other servers.

Thanks Kaz for your help!

Kaz Nishimura

unread,
May 18, 2015, 8:52:53 PM5/18/15
to caucho-...@googlegroups.com
Please note that PHP apparently abandoned the adoption of unicode.semantics and internal Unicode representation for its strings.  You might need to adjust character positions in strings if you enabled unicode.semantics.

Quick update, using unicode.semantics=on with the proper syntax fixed this, so I'm going to use that instead of converting everything, makes the php much more portable in case I need to migrate to other servers.

Thanks Kaz for your help!

Reply all
Reply to author
Forward
0 new messages