[PHP-DEV] PHP Unicode support design document

0 views
Skip to first unread message

Andrei Zmievski

unread,
Aug 10, 2005, 3:31:58 AM8/10/05
to
--Apple-Mail-9--676997515
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

Before we go breaking things, please read this document that
describes how PHP will support the Unicode standard natively.
Hopefully the attachment will work.

Thanks,

-Andrei


--Apple-Mail-9--676997515
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
x-unix-mode=0640;
name="php-unicode-design.txt"
Content-Disposition: attachment;
filename=php-unicode-design.txt

Introduction
============

As successful as PHP has proven to be in the past several years, it is still
the only remaining member of the P-trinity of scripting languages - Perl and
Python being the other two - that remains blithely ignorant of the
multilingual and multinational environment around it. The software
development community has been moving towards Unicode Standard for some time
now, and PHP can no longer afford to be outside of this movement. Surely,
some steps have been taken recently to allow for easier processing of
multibyte data with the mbstring extension, but it is not enabled in PHP by
default and is not as intuitive or transparent as it could be.

The basic goal of this document is to describe how PHP 6 will support the
Unicode Standard natively. Since the full implementation of the Unicode
Standard is very involved, the idea is to use the already existing,
well-tested, full-featured, and freely available ICU (International
Components for Unicode) library. This will allow us to concentrate on the
details of PHP integration and speed up the implementation.

General Remarks
===============

Backwards Compatibility
-----------------------
Throughout the design and implementation of Unicode support, backwards
compatibility must be of paramount concern. PHP is used on an enormous number of
sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
that the existing data types and functions must work as they have always
done. However, the speed of certain operations may be affected, due to
increased complexity of the code overall.

Unicode Encoding
----------------
The initial version will not support Byte Order Mark. Characters are
expected to be composed, Normalization Form C. Later versions will support
BOM, and decomposed and other characters.


Implementation Approach
=======================

The implementation is done in phases. This allows for more basic and
low-level implementation issues to be ironed out and tested before
proceeding to more advanced topics.

Legend:
- TODO
+ finished
* in progress

Phase I
-------
+ Basic Unicode string support, including instantiation, concatenation,
indexing

+ Simple output of Unicode strings via 'print' and 'echo' statements
with appropriate output encoding conversion

+ Conversion of Unicode strings to/from various encodings via encode() and
decode() functions

+ Determining length of Unicode strings via strlen() function, some
simple string functions ported (substr).


Phase II
--------
* HTTP input request decoding

+ Fixing remaining string-aware operators (assignment to {}, etc)

+ Comparison (collation) of Unicode strings with built-in operators

* Support for Unicode and binary strings in PHP streams

+ Support for Unicode identifiers

* Configurable handling of conversion failures

+ \C{} escape sequence in strings


Phase III
---------
* Exposing ICU API

- Porting all remaining functions to support Unicode and/or binary
strings


Encoding Names
==============
All the encoding settings discussed in this document accept any valid
encoding name supported by ICU. See ICU online documentation for the full
list of encodings.


Internal Encoding
=================

UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
two bytes for any Unicode character in the Basic Multilingual Plane, which
is where most of the current world's languages are represented. While being
less memory efficient for basic ASCII text it simplifies the processing and
makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
processing as well.


Fallback Encoding
=================

This setting specifies the "fallback" encoding for all the other ones. So if
a specific encoding setting is not set, PHP defaults it to the fallback
encoding. If the fallback_encoding is not specified either, it is set to
UTF-8.

fallback_encoding = "iso-8859-1"


Runtime Encoding
================

Currently PHP neither specifies nor cares what the encoding of its strings
is. However, the Unicode implementation needs to know what this encoding is
for several reasons, including type coersion and encoding conversion for
strings generated at runtime via function calls and casting. This setting
specifies this runtime encoding.

runtime_encoding = "iso-8859-1"


Output Encoding
===============

Automatic output encoding conversion is supported on the standard output
stream. Therefore, command such as 'print' and 'echo' automatically convert
their arguments to the specified encoding. No automatic output encoding is
performed for anything else. Therefore, when writing to files or external
resources, the developer has to manually encode the data using functions
provided by the unicode extension or rely on stream encoding filters. The
unicode extension provides necessary stream filters to make developers'
lives easier.

The existing default_charset setting so far has been used only for
specifying the charset portion of the Content-Type MIME header. For several
reasons, this setting is deprecated. Now it is only used when the Unicode
semantics switch is disabled and does not affect the actual transcoding of
the output stream. The output encoding setting takes precedence in all other
cases.

output_encoding = "utf-8"


HTTP Input Encoding
===================

To make accessing HTTP input variables easier, PHP automatically decodes
HTTP GET and POST requests based on the specified encoding. If the HTTP
request contains the encoding specification in the headers, then it will be
used instead of this setting. If the HTTP input encoding setting is not
specified, PHP falls back onto the output encoding setting, because modern
browsers are supposed to return the data in the same encoding as they
received it in.

If the actual encoding is passed in the request itself or is found
elsewhere, then the application can ask PHP to re-decode the raw input
explicitly.

http_input_encoding = "utf-8"


Script Encoding
===============

PHP scripts may be written in any encoding supported by ICU. The encoding
of the scripts can be specified site-wide via an INI directive
script_encoding, or with a 'declare' pragma at the beginning of the script.
The reason for pragma is that an application written in Shift-JIS, for
example, should be executable on a system where the INI directive cannot be
changed by the application itself. The pragma setting is valid only for the
script it occurs in, and does not propagate to the included files.

pragma:
<?php declare(encoding = 'utf-8'); ?>

INI setting:
script_encoding = utf-8


Conversion Semantics
====================

Not all characters can be converted between Unicode and legacy encodings.
Normally, when downconverting from Unicode, the default behavior of ICU
converters is to substitute the missing sequence with the appropriate
substitution sequence for that codepage, such as 0x1A (Control-Z) in
ISO-8859-1. When upconverting to Unicode, if an encoding has a character
which cannot be converted into Unicode, that sequence is replaced by the
Unicode substitution character (U+FFFD).

The conversion failure behavior can be customized:

- perform substitution as described above with a custom substitution
character
- skip any invalid characters
- stop the conversion, raise an error, and return partial conversion
results
- replace the missing character with a diagnostic character and continue,
e.g. [U+hhhh]

There are two INI settings that control this.

unicode.from_error_mode = U_INVALID_SUBSTITUTE
U_INVALID_SKIP
U_INVALID_STOP
U_INVALID_ESCAPE

unicode.from_error_subst_char = a2

The second setting is supposed to contain the Unicode code point value for
the substitution character. This value has to be representable in the target
encoding.

Note that PHP always tries to convert as much as of the data as possible and
returns the converted results even if an error happens.


Unicode Switch
==============

Obviously, PHP cannot simply impose new Unicode support on everyone. There
are many applications that do not care about Unicode and do not need it.
Consequently, there is a switch that enables certain fundamental language
changes related to Unicode. This switch is available as a site-wide, or
per-dir INI setting only.

Note that having switch turned off does not imply that PHP is unaware of
Unicode at all and that no Unicode string can exist. It only affects certain
aspects of the language, and Unicode strings can always be created
programmatically.

unicode_semantics = On

[TODO: list areas that are affected by this switch]


Unicode String Type
===================

Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
UTF-16 format. It is the main string type in PHP when Unicode semantics
switch is turned on. Unicode strings can exist when the switch is off, but
they have to be produced programmatically, via calls to functions that
return Unicode type.

The operational unit when working with Unicode strings is a code point, not
code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code
units, each of which is a 16-bit word. Working on the code point level is
necessary because doing otherwise would mean offloading the processing of
surrogate pairs onto PHP users, and that is less than desirable.

The repercussions are that one cannot expect code point N to be at offset
N in the Unicode string. Instead, one has to iterate from the beginning from
the string using U16_FWD() macro until the desired codepoint is reached.

The codepoint access is one of the primary areas targeted for optimization.


Native Encoding String Type
===========================

Native encoding string type (IS_STRING) serves two purposes: backwards
compatibility when Unicode semantics switch is off, and for representing
strings in non-Unicode encodings (native encodings) when it is on. It is
processsed on the byte level.


Binary String Type
==================

Binary string type (IS_BINARY) can be used for storing images, PDFs, or
other binary data intended to be processed on a byte-level and that cannot
be intepreted as text.

Binary data type does not participate in implicit conversions, and cannot be
explicitly upconverted to other string types, although the inverse is
possible.

Printing binary data to the standard output passes it through as-is,
independent of the output encoding.

When Unicode semantics switch is off, binary string literals and binary
strings returned by functions actually resolve to IS_STRING type, for
backwards compatibility reasons.


Zval Structure Changes
======================

PHP is a type-agnostic language. Its data values are encapsulated in a zval
(Zend value) structure that can change as necessary to accomodate various types.

struct _zval_struct {
/* Variable information */
union {
long lval; /* long value */
double dval; /* double value */
struct {
char *val;
int len;
} str; /* string value */
HashTable *ht; /* hash table value */
zend_object_value obj; /* object value */
} value;
zend_uint refcount;
zend_uchar type; /* active type */
zend_uchar is_ref;
};

The type field determines what is stored in the union, IS_STRING being the only
data type pertinent to this discussion. In the current version, the strings
are binary-safe, but, for all intents and purposes, are assumed to be
comprised of 8-bit characters. It is possible to treat the string value as
an opaque type containing arbitrary binary data, and in fact that is how
mbstring extension uses it, in order to store multibyte strings. However,
many extensions and the Zend engine itself manipulate the string value
directly without regard to its internals. Needless to say, this can lead to
problems.

For IS_UNICODE type, we need to add another structure to the union:

union {
....
struct {
UChar *val; /* Unicode string value */
int32_t len; /* number of UChar's */
....
} value;

This cleanly separates the two types of strings and helps preserve backwards
compatibility. For IS_BINARY type, we can re-use the str union.


Language Modifications
======================

If a Unicode switch is turned on, PHP string literals - single-quoted,
double-quoted, and heredocs - become Unicode strings (IS_UNICODE type).
They support all the same escape sequences and variable interpolations as
previously, with the addition of some new escape sequences.

The contents of the strings are interpreted as follows:

- all non-escaped characters are interpreted as a corresponding Unicode
codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) =>
U+0061, Shift-JIS (0x92 0x69) => U+4E2D

- existing PHP escape sequences are also interpreted as Unicode codepoints,
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020

- two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or
6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
U+10410

- a new escape sequence allows specifying a character by its full
Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20

The single-quoted string is more restrictive than the other two types: so
far the only escape sequence allowed inside of it was \', which specifies
a literal single quote. However, single quoted strings now support the new
Unicode character escape sequences as well.

PHP allows variable interpolation inside the double-quoted and heredoc strings.
However, the parser separates the string into literal and variable chunks during
compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the
literal chunks can be handled in the normal way for as far as Unicode
support is concerned.

Since all string literals become Unicode by default, one loses the ability
to specify byte-oriented or binary strings. In order to create binary string
literals, a new syntax is necessary: prefixing a string literal with letter
'b' creates a binary string.

$var = b'abc\001';
$var = b"abc\001";
$var = b<<<EOD
abc\001
EOD;

The binary string literals support the same escape sequences as the current
PHP strings. If the Unicode switch is turned off, then the binary string
literals generate normal string (IS_STRING) type internally, without any
effect on the application.

The string operators have been changed to accomodate the new IS_UNICODE and
IS_BINARY types. In more detail:

- The concatenation (.) operator has been changed to automatically coerce
IS_STRING type to the more precise IS_UNICODE if its operands are of two
different string types. It does not perform coersion for IS_BINARY type,
however, since binary data is not considered to be in any encoding. To
concatenate string with binary data, strings have to be cast to binary
type first. The coersion uses the conversion matrix specified later in
this document.

- The concatenation assignment operator (.=) has been changed similarly.

- The string indexing operators {}/[] have been changed to accomodate
IS_UNICODE type strings and extract the specified character. Note that
the index specifies a code point, not a byte, or a code unit, thus
supporting supplementary characters as well.

- Both Unicode and binary string types can be used as array keys. If the
Unicode switch is on, the native encoding strings are converted to
Unicode, if they are used as hash keys, but binary strings are not.
Note that this means if Unicode switch is off, then Unicode string "abc"
and native string "abc" do not hash to the same value.

- Bitwise operators and increment/decrement operators do not work on
Unicode strings. They do work on binary strings.

- Two new casting operators are introduced, (unicode) and (binary).
They use the conversion matrix specified later in this document.

- The comparison operators when applied to Unicode strings, perform
comparison in binary code point order. They also do appropriate coersion
if the strings are of differing types.

- The arithmetic operators use the same semantic as today for converting
strings to numbers. A Unicode string is considered numeric if it
represents a long or a double number in en_US_POSIX locale.


Inline HTML
===========
Because inline HTML blocks are intermixed with PHP ones, they are also
written in the script encoding. PHP transcodes the HTML blocks to the output
encoding as needed, resulting in direct passthrough if the script encoding
matches output encoding.


Identifiers
===========
Considering that scripts may be written in various encodings, we do not
restrict identifiers to be ASCII-only. PHP allows any valid identifier based
on the Unicode Standard Annex #31. The identifiers are case folded when
necessary (class and function names) and converted to normalization form
NFKC, so that two identifiers written in two compatible ways refer to the
same thing.


Numbers
=======
Unlike identifiers, we restrict numbers to consist only of ASCII digits and
do not interpret them as written in a specific locale. The numbers are
expected to adhere to en_US_POSIX or C locale, i.e. having no thousands
separator and fractional separator being (.) "full stop". Numeric strings
are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as
a number even if the current locale's fractional separator is comma.


Parameter Parsing API Modifications
===================================

Internal PHP functions largely uses zend_parse_parameters() API in order to
obtain the parameters passed to them by the user. For example:

char *str;
int len;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) {
return;
}

This forces the input parameter to be a string, and its value and length are
stored in the variables specified by the caller.

There are now three new specifiers: 't', 'u', and 'T'.

't' specifier
-------------
This specifier indicates that the caller requires the incoming parameter
to be string data (IS_STRING, IS_UNICODE, IS_BINARY). The caller has to provide
the storage for string value, length, and type.

void *str;
int len;
zend_uchar type;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) {
return;
}
if (type == IS_UNICODE) {
/* process UTF-16 data */
} else {
/* process native string or binary data */
}

For IS_STRING and IS_BINARY types, the length represents the number of
bytes, and for IS_UNICODE the number of UChar's. When converting other
types (numbers, booleans, etc) to strings, the exact behavior depends on
the Unicode semantics switch: if on, they are converted to IS_UNICODE,
otherwise to IS_STRING.


'u' specifier
-------------
This specifier indicates that the caller requires the incoming parameter
to be a Unicode UTF-16 encoded string. If a non-Unicode string is passed,
the engine creates a copy of the string and automatically convert it
to Unicode type before passing it to the internal function. No such
conversion is necessary for Unicode strings, obviously. Binary type cannot
be upconverted, and the engine issues an error in such case.

UChar *str;
int32_t len;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) {
return;
}
/* process UTF-16 data */


'T' specifier
-------------
This specifier is useful when the function takes two or more strings and
operates on them. Using 't' specifier for each one would be somewhat
problematic if the passed-in strings are of mixed types, and multiple
checks need to be performed in order to do anything. All parameters
marked by the 'T' specifier are promoted to the same type.

Binary type is generally speaking the most precise one. However, we do not
want to convert Unicode strings to binary ones, so an error is thrown
if the incoming list of parameters has both Unicode and binary strings in
it.

If there are no binary strings, and at least one of the strings is of
Unicode type, then all the rest of the strings are upconverted to Unicode.

Otherwise the promotion is to IS_STRING type.


void *str1, *str2;
int len1, len2;
zend_uchar type1, type2;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
&type1, &str2, &len2, &type2) == FAILURE) {
return;
}
if (type1 == IS_UNICODE) {
/* process as Unicode, str2 is guaranteed to be Unicode as well */
} else {
/* process as native string, str2 is guaranteed to be the same */
}


The existing 's' specifier has been modified as well. If a Unicode string is
passed in, it automatically copies and converts the string to the runtime
encoding, and issues a warning. If a binary type is passed-in, no conversion
is necessary.


Upgrading Existing Functions
============================

Upgrading functions to work with new data types will be a deliberate and
involved process, because one needs to consider not only the mechanisms for
processing Unicode characters, for example, but also the semantics of
the function.

The main tenet of the upgrade process should be that when processing Unicode
strings, the unit of operation is a code point, not a code unit or a byte.
For example, strlen() returns the number of code points in the string.

strlen('abc') = 3
strlen('ab\U010000') = 3
strlen('ab\uD800\uDC00') = 3 /* not 4 */

Function upgrade guidelines are available in a separate document.


Unicode Extension
=================

There will be one or more extensions that provide Unicode and i18n services
to PHP. In phase I only the conversion service is necessary. The Unicode
extension is 'ext/unicode' and its functions should be prefixed with 'unicode'
or 'icu'.

Conversion Functions
--------------------

string unicode_encode(unicode $input, text $encoding)

Takes a UTF-16 Unicode string and converts it to the the target
encoding, returning the result.

unicode unicode_decode(string $input, text $encoding)

Takes a string in the source encoding and converts it to a UTF-16
Unicode string, returning the result.


Type Conversion Matrix
======================

to | IS_STRING | IS_UNICODE | IS_BINARY
from | | |
-------------------------------------------------------------------
| | |
IS_STRING | n/a | implicit=yes | explicit=yes
| | explicit=yes | implicit=no
| | |
-------------------------------------------------------------------
| | |
IS_UNICODE | explicit=yes | n/a | explicit=yes
| implicit=no | | implicit=no
| | |
------------------------------|------------------------------------
| | |
IS_BINARY | explicit=no | explicit=no | n/a
| implicit=no | implicit=no |
| | |

explicit = casting
implicit = for concatenation, etc

IS_STRING <-> IS_UNICODE uses runtime-encoding
IS_UNICODE -> IS_BINARY converts to runtime encoding first, then to binary


Implementation Details That Need Expanding
==========================================
- Streams support for Unicode - What stream filters will we be providing?
- Conversion errors behavior - Need to define the default.
- INI files encoding - Do we support BOMs?
- There are likely to be other issues which are missing from this document


Build System
============

Unicode support in PHP is always enabled. The only configuration option
during development should be the location of the ICU headers and libraries.

--with-icu-dir=<dir> <dir> parameter specifies the location of ICU
header and library files.

After the initial development we have to repackage ICU library for our needs
and bundle it with PHP.


Document History
================
0.5: Updated per latest discussions. Removed tentative language in several
places, since we have decided on everything described here already.
Clarified details according to Phase II progress.

0.4: Updated to include all the latest discussions. Updated development
phases.

0.3: Updated to include all the latest discussions.

0.2: Updated Phase I design proposal per discussion on uni...@php.net.
Modified Internal Encoding section to contain only UTF-16 info..
Expanded Script Encoding section.
Added Binary Data Type section.
Amended Language Modifications section to describe string literals
behavior.
Amended Build System section.

0.1: Phase I design proposal


References
==========

Unicode
http://www.unicode.org

Unicode Glossary
http://www.unicode.org/glossary/

UTF-8
http://www.utf-8.com/

UTF-16
http://www.ietf.org/rfc/rfc2781.txt

ICU Homepage
http://www.ibm.com/software/globalization/icu/

ICU User Guide and API Reference
http://icu.sourceforge.net/

Unicode Annex #31
http://www.unicode.org/reports/tr31/

PHP Parameter Parsing API
http://www.php.net/manual/en/zend.arguments.retrieval.php


Authors
=======
Andrei Zmievski <and...@gravitonic.com>

vim: set et :

--Apple-Mail-9--676997515
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
format=flowed

--Apple-Mail-9--676997515
Content-Type: text/plain; charset=us-ascii

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
--Apple-Mail-9--676997515--

Ron Korving

unread,
Aug 10, 2005, 6:46:02 AM8/10/05
to
This looks very promising, I'm impressed by the work you guys have done (big
thumbs up).

There are a few issues/questions I have after reading your document:


"Therefore, command such as 'print' and 'echo' automatically convert their
arguments to the specified encoding. No automatic output encoding is
performed for anything else."

What about the other functions that output to stdout directly, such as
readfile() and passthru()?


"The conversion failure behavior can be customized"...

Maybe it would be a nice feature to have an U_INVALID_EXCEPTION, so that
users can actually catch the error and deal with it. Just an idea. Of course
it's not usual for the PHP core and extensions to throw exceptions, but
perhaps this could change with PHP6.


"In order to create binary string literals, a new syntax is necessary:
prefixing a string literal with letter 'b' creates a binary string."

The b-prefix for binary strings is great, but how does that work with a
function like file_get_contents() or fread() ?
One can't do: $data = bfile_get_contents("somefile.bin");
And even if one could (somehow), wouldn't file_get_contents() already
unicode-encode all data it reads? How does such a function know if the user
is expecting binary or textual data or does the encoding simply happen after
the string is returned? In that case it's up to the user to use the
b-prefix, but then there's the syntax problem I mentioned.


Keep up the good work,

Ron

Antony Dovgal

unread,
Aug 10, 2005, 6:55:10 AM8/10/05
to
On Wed, 10 Aug 2005 12:45:27 +0200
"Ron Korving" <r.ko...@xit.nl> wrote:

> This looks very promising, I'm impressed by the work you guys have done (big
> thumbs up).
>
> There are a few issues/questions I have after reading your document:
>
>
> "Therefore, command such as 'print' and 'echo' automatically convert their
> arguments to the specified encoding. No automatic output encoding is
> performed for anything else."

That's actually something I wanted to ask about too.

Do we really need such kind of magic?

I think it may be pretty confusing when after echo'ing or print'ing a variable
you can see one output, but after writing the very same variable into a file
you can see something completely different.

IMO it's similar to what we have with __toString() ATM.
Yes, it's documented, but it's *still* confusing that there is some magic
involved in one case and there is no magic in an other, almost similar case.

--
Wbr,
Antony Dovgal

Derick Rethans

unread,
Aug 10, 2005, 7:02:08 AM8/10/05
to
On Wed, 10 Aug 2005, Ron Korving wrote:

> "In order to create binary string literals, a new syntax is necessary:
> prefixing a string literal with letter 'b' creates a binary string."
>
> The b-prefix for binary strings is great, but how does that work with a
> function like file_get_contents() or fread() ?
> One can't do: $data = bfile_get_contents("somefile.bin");

fopen() and file_get_contents() already understands a context parameter,
specifying whethter you'd want to have binary or string/unicode data can
be done through that.

and the b syntax, only works for literal strings in your code:
b"foo", but b$foo is not going to work.

Derick

--
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org

Christian Schneider

unread,
Aug 10, 2005, 9:26:59 AM8/10/05
to
Derick Rethans wrote:
> On Wed, 10 Aug 2005, Ron Korving wrote:
>>"In order to create binary string literals, a new syntax is necessary:
>>prefixing a string literal with letter 'b' creates a binary string."
>>
>>The b-prefix for binary strings is great, but how does that work with a
>>function like file_get_contents() or fread() ?
>>One can't do: $data = bfile_get_contents("somefile.bin");
>
> fopen() and file_get_contents() already understands a context parameter,
> specifying whethter you'd want to have binary or string/unicode data can
> be done through that.

We create images in PHP scripts and pass them through with
readfile("foo,gif"). Did I understand correctly that this would still
work without changes? But echo file_get_contents("foo.gif") would fail,
right?

This is not a complaint, just trying to understand the implications,
- Chris

Rasmus Lerdorf

unread,
Aug 10, 2005, 10:32:42 AM8/10/05
to
Yeah, print/echo was just a way of describing the underlying output
stuff. It wasn't meant to be taken literally.

-Rasmus

Andi Gutmans wrote:
> We need to automatically convert the output as internally we will be
> storing UTF-16 which is not what you want to send to the user. The SAPI
> output mechanism does the conversion, I don't think it's print & echo.
> It will actually save people a lot of headache that this is done
> automatically.
> As far as files are concerned, the default is also to convert to the INI
> encoding (forgot which INI parameter), but we will supply streams which
> allow you to control the in/out encoding of specific files.
>
> So basically, I think we need to update the doc as I am pretty sure we
> didn't change print/echo but the underlying input/output mechanisms.
>
> Andi

George Schlossnagle

unread,
Aug 10, 2005, 10:36:55 AM8/10/05
to

On Aug 10, 2005, at 10:30 AM, Rasmus Lerdorf wrote:

> Yeah, print/echo was just a way of describing the underlying output
> stuff. It wasn't meant to be taken literally.

Given the __toString fiasco, it's understandable that this would be
confusing though.

George

Ron Korving

unread,
Aug 10, 2005, 10:56:59 AM8/10/05
to
Exactly. That's how I understood it too: "Ah, the __toString behavior". I'm
very glad this is not the case.

Ron


"George Schlossnagle" <geo...@omniti.com> schreef in bericht
news:1B80531E-3842-4DF6...@omniti.com...

Andi Gutmans

unread,
Aug 10, 2005, 11:08:07 AM8/10/05
to

Andi

Antony Dovgal

unread,
Aug 10, 2005, 11:11:58 AM8/10/05
to

Ah.
Ok, then I'm happy =)


On Wed, 10 Aug 2005 07:30:38 -0700
Rasmus Lerdorf <ras...@lerdorf.com> wrote:

> Yeah, print/echo was just a way of describing the underlying output
> stuff. It wasn't meant to be taken literally.

Andrei Zmievski

unread,
Aug 10, 2005, 11:35:39 AM8/10/05
to
On Aug 10, 2005, at 7:26 AM, Andi Gutmans wrote:

> We need to automatically convert the output as internally we will
> be storing UTF-16 which is not what you want to send to the user.
> The SAPI output mechanism does the conversion, I don't think it's
> print & echo. It will actually save people a lot of headache that
> this is done automatically.

That's not true, actually. 'echo' and 'print' resolve to ZEND_ECHO
opcode which calls zend_print_variable(), which in turn calls
zend_make_printable_zval(). Now, this last function is supposed to
take a zval and turn it into a printable string, of course, which is
then output using utility_functions->write_function aka php_body_write
(). All that function cares about is how to output a binary string.
So, if we want to bubble the conversion down to the output layer, we
probably need to change the write function so that it takes a void*
and a type and knows how to deal with them appropriately.

Functions like readfile() are based on streams, so by default they
will be in binary mode, simply passing the data through, unless you
put a filter on it or change the default context.

> As far as files are concerned, the default is also to convert to
> the INI encoding (forgot which INI parameter), but we will supply
> streams which allow you to control the in/out encoding of specific
> files.

The filename_encoding setting is supposed to be used only for
filenames, not for file contents. That is dealt with using the
mechanisms that Sara implemented in the streams system.

-Andrei

Andrei Zmievski

unread,
Aug 10, 2005, 11:36:10 AM8/10/05
to
We have not changed the underlying output mechanism. The transcoding
is done by zend_make_printable_zval().

-Andrei


On Aug 10, 2005, at 7:30 AM, Rasmus Lerdorf wrote:

> Yeah, print/echo was just a way of describing the underlying output
> stuff. It wasn't meant to be taken literally.
>

> -Rasmus


>
> Andi Gutmans wrote:
>
>> We need to automatically convert the output as internally we will be
>> storing UTF-16 which is not what you want to send to the user. The
>> SAPI
>> output mechanism does the conversion, I don't think it's print &
>> echo.
>> It will actually save people a lot of headache that this is done
>> automatically.

>> As far as files are concerned, the default is also to convert to
>> the INI
>> encoding (forgot which INI parameter), but we will supply streams
>> which
>> allow you to control the in/out encoding of specific files.
>>

Rasmus Lerdorf

unread,
Aug 10, 2005, 12:07:28 PM8/10/05
to
Andrei Zmievski wrote:
> We have not changed the underlying output mechanism. The transcoding is
> done by zend_make_printable_zval().

Ok, but all the non-stream based output functions pass through that.
Not that we have very many, but it is more than just echo/print.

-Rasmus

Ron Korving

unread,
Aug 10, 2005, 1:00:13 PM8/10/05
to
I firmly believe though, that all outputting functions should act the same.
It's the same problem otherwise as with __toString(). I would use __toString
if it wasn't just restricted to echo and print, but right now it's pretty
useless to me. I hope that behavior can change in a major version update
(5.1, 6.0).

Ron


"Rasmus Lerdorf" <ras...@lerdorf.com> wrote in message
news:42FA261B...@lerdorf.com...

Marcus Boerger

unread,
Aug 10, 2005, 3:26:43 PM8/10/05
to
Hello Ron,

i had a chat with Andi about __toString() and i hope that he finally
undestood why a lot of ppl wanted it right from the beginning. To me the
current situation is simply the worst case because noone understnds when it
works and when not (. vs ,). Since we are doing a drastic change in string
behavior anyway and this time have enough time until next release i already
agreed on taking care about this one as early as possible. This way we will
have enough time to find any places where __toString() is not possible.
Also Andi and i discussed again the major problem and i could ensure Andi
that there is a simple solution which he agreed on. That is we require
__toString() to return a string and do a non interruptable halt otherwise.
Maybe in a later version this can be replaced by an exception we'll see
if someone finds the time to analyse that in detail.

best regards
marcus

Wednesday, August 10, 2005, 4:50:17 PM, you wrote:

> Exactly. That's how I understood it too: "Ah, the __toString behavior". I'm
> very glad this is not the case.

> Ron


> "George Schlossnagle" <geo...@omniti.com> schreef in bericht
> news:1B80531E-3842-4DF6...@omniti.com...
>>

>> On Aug 10, 2005, at 10:30 AM, Rasmus Lerdorf wrote:
>>
>> > Yeah, print/echo was just a way of describing the underlying output
>> > stuff. It wasn't meant to be taken literally.
>>

>> Given the __toString fiasco, it's understandable that this would be
>> confusing though.
>>
>> George


Best regards,
Marcus

Ron Korving

unread,
Aug 10, 2005, 3:30:40 PM8/10/05
to
Sounds absolutely great :)

Ron


"Marcus Boerger" <he...@php.net> wrote in message
news:1299141168.2...@marcus-boerger.de...

Andrei Zmievski

unread,
Aug 10, 2005, 4:13:25 PM8/10/05
to
On Aug 10, 2005, at 3:45 AM, Ron Korving wrote:

> This looks very promising, I'm impressed by the work you guys have
> done (big
> thumbs up).

Thanks.

> What about the other functions that output to stdout directly, such as
> readfile() and passthru()?

readfile() uses streams so it would rely on stream filters and such.
passthru() should probably operate in binary mode.

> Maybe it would be a nice feature to have an U_INVALID_EXCEPTION, so
> that
> users can actually catch the error and deal with it. Just an idea. Of
> course
> it's not usual for the PHP core and extensions to throw exceptions, but
> perhaps this could change with PHP6.

I think the feature of raising exceptions vs. errors is orthogonal to
what the switch does. Consider that you may want the
skip/substitute/escape performed and then raise an error or not.

> The b-prefix for binary strings is great, but how does that work with a
> function like file_get_contents() or fread() ?
> One can't do: $data = bfile_get_contents("somefile.bin");

> And even if one could (somehow), wouldn't file_get_contents() already
> unicode-encode all data it reads? How does such a function know if the
> user
> is expecting binary or textual data or does the encoding simply happen
> after
> the string is returned? In that case it's up to the user to use the
> b-prefix, but then there's the syntax problem I mentioned.

'b' prefix is only for string literals. file_get_contents(), fread()
and other streams-based functions use the default stream semantics,
meaning that unless you change the default context, the data returned
by them will be of IS_BINARY type. The default context can contain a
filter that decodes the data from the specified encoding into Unicode.

-Andrei

Andrei Zmievski

unread,
Aug 10, 2005, 4:17:40 PM8/10/05
to
On Aug 10, 2005, at 3:54 AM, Antony Dovgal wrote:
> Do we really need such kind of magic?
>
> I think it may be pretty confusing when after echo'ing or print'ing a
> variable
> you can see one output, but after writing the very same variable into
> a file
> you can see something completely different.

Absolutely, we do need it. Consider that the internal encoding is
UTF-16 and outputting that directly to a terminal (or browser) is bound
to cause havoc. That's just one of the examples.

Andrei Zmievski

unread,
Aug 10, 2005, 4:20:18 PM8/10/05
to
I did not have time to write the full reply earlier so here goes.

Even if we modify the output layer to be aware of various types of
strings coming down the pipe, it would still need to know the encoding
of IS_STRING's in order to convert them to the output encoding. This
presents a particular problem for inline HTML blocks, as they are
supposed to be in the script encoding, but by the time the HTML is sent
to the output layer, we don't know what the source script encoding was
for these HTML blocks. This problem exists in the current
implementation also, because the ZEND_ECHO opcode does not keep track
of what the script encoding was. This needs to be fixed, obviously.

One approach could be to implement a separate opcode for inline HTML
blocks and store the name of the script encoding it came from in the
opcode. Then when the output layer (or whatever else) gets to it, we
can check the encoding name in the opcode vs. the output encoding and
perform transcoding if necessary. This does mean that we may need to
dynamically open and close converters on each output (if there were
different script encodings floating around), but can be alleviated by
keeping some sort of converter cache around.

I am open to other ideas.

-Andrei

On Aug 10, 2005, at 8:34 AM, Andrei Zmievski wrote:

> That's not true, actually. 'echo' and 'print' resolve to ZEND_ECHO
> opcode which calls zend_print_variable(), which in turn calls
> zend_make_printable_zval(). Now, this last function is supposed to
> take a zval and turn it into a printable string, of course, which is
> then output using utility_functions->write_function aka

> php_body_write(). All that function cares about is how to output a

> binary string. So, if we want to bubble the conversion down to the
> output layer, we probably need to change the write function so that it
> takes a void* and a type and knows how to deal with them
> appropriately.
>

--

Adam Maccabee Trachtenberg

unread,
Aug 10, 2005, 4:46:16 PM8/10/05
to
On Wed, 10 Aug 2005, Marcus Boerger wrote:

> i had a chat with Andi about __toString() and i hope that he finally
> undestood why a lot of ppl wanted it right from the beginning. To me the
> current situation is simply the worst case because noone understnds when it
> works and when not (. vs ,).

Yea, this is super tricky and subtle because even though print/echo
and ,/. are different, they appear to be identical in almost all other
cases.


I agree that forcing __toString() to return a string is perfectly
reasonable.

-adam

--
ad...@trachtenberg.com | http://www.trachtenberg.com
author of o'reilly's "upgrading to php 5" and "php cookbook"
avoid the holiday rush, buy your copies today!

Rasmus Lerdorf

unread,
Aug 15, 2005, 6:11:25 PM8/15/05
to
I think the main issue here is that if your script encoding is set to
UTF-8 and you do everything in UTF-8 then these large blocks of UTF-8
are going to make a UTF-8 -> UTF-16 -> UTF-8 conversion roundtrip on
every request. It would be nice if we could somehow avoid that.

-Rasmus

Andi Gutmans wrote:
> Wouldn't it be easiest to have inline html become IS_UNICODE and then
> not deal with the problem of remember what the script encoding was? I
> thought that's what we already do today.
>
> Andi

--

Andi Gutmans

unread,
Aug 15, 2005, 6:13:38 PM8/15/05
to
If you want to optimize then I guess "remembering" the script_encoding is
the only way to do it. We could do it similar to the way we "cache" script
file names.
Another option is to just optimize for UTF-8 and use BOMs for UTF-8/UTF-16...

Andi

Ondrej Ivanič

unread,
Aug 16, 2005, 4:19:44 AM8/16/05
to
Andrei Zmievski wrote:
> + Determining length of Unicode strings via strlen() function, some
> simple string functions ported (substr).

It's not a problem to determine kind of char in single byte character
sets, but in the unicode with various encoding schemas I don't see easy
way how to do it.

It will be nice to have functions like this: isNumber(char),
isAlphabetic(char), isWhitespace(char) ...

It is on the plan or not?

--
Ondrej Ivanic
(ond...@kmit.sk)

cshm...@bellsouth.net

unread,
Aug 16, 2005, 6:59:13 AM8/16/05
to

>
> It will be nice to have functions like this: isNumber(char),
> isAlphabetic(char), isWhitespace(char) ...
>
> It is on the plan or not?

its done already, just not committed yet...


clayton

""Ondrej Ivanic"" <ond...@kmit.sk> wrote in message
news:4301A0D6...@kmit.sk...

Andrei Zmievski

unread,
Aug 16, 2005, 1:22:42 PM8/16/05
to
We certainly could, but we lose some speed, especially when
script_encoding == output_encoding (where we don't really need to
transcode HTML blocks). Are we up for that?

-Andrei

--

Andrei Zmievski

unread,
Aug 16, 2005, 1:31:23 PM8/16/05
to
Where should we save the script encoding from which an oparray was
built? In the oparray itself?

-Andrei

On Aug 15, 2005, at 3:13 PM, Andi Gutmans wrote:

> If you want to optimize then I guess "remembering" the script_encoding
> is the only way to do it. We could do it similar to the way we "cache"
> script file names.
> Another option is to just optimize for UTF-8 and use BOMs for
> UTF-8/UTF-16...

--

Andrey Hristov

unread,
Aug 16, 2005, 4:08:43 PM8/16/05
to
cshm...@bellsouth.net wrote:
>>It will be nice to have functions like this: isNumber(char),
>>isAlphabetic(char), isWhitespace(char) ...
>>
>>It is on the plan or not?
>
>
> its done already, just not committed yet...
>
>
> clayton
>
> ""Ondrej Ivanic"" <ond...@kmit.sk> wrote in message
> news:4301A0D6...@kmit.sk...
>
>>Andrei Zmievski wrote:
>>--
>>Ondrej Ivanic
>>(ond...@kmit.sk)
>
>
>
Please don't use stupid caps, these are functions not methods.


Andrey

l0t3k

unread,
Aug 16, 2005, 4:30:27 PM8/16/05
to
> Please don't use stupid caps, these are functions not methods.
>
>
> Andrey

of course not. see
http://icu.sourceforge.net/apiref/icu4c/uchar_8h.html

, but note that functions conform to PHP's function naming conventions
( lower_case_words_separated_by_underscores() ).

clayton

Peter Brodersen

unread,
Aug 16, 2005, 5:58:49 PM8/16/05
to
On Wed, 10 Aug 2005 00:31:30 -0700, in php.internals
and...@gravitonic.com (Andrei Zmievski) wrote:

> - existing PHP escape sequences are also interpreted as Unicode =
codepoints,
> including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" =3D> =
U+0020
[..]
>The single-quoted string is more restrictive than the other two types: =
so
>far the only escape sequence allowed inside of it was \', which =
specifies
>a literal single quote. However, single quoted strings now support the =
new
>Unicode character escape sequences as well.

=46or what it's worth, would \1 be interpreted as well in single quotes
(as it currently is in double quotes)?

I suppose one of the places where \digit would be present in several
cases is in poor-written pregs - such as:
print preg_replace('/([A-Z])/','<b>\1</b>',$string);
(where \1 is used as backreference instead of \\1 or $1)

I'm not that worried about my own preg-usage. I just want to be
prepared if I ever have to review some code for the purpose of
migrating to PHP6.

--=20
- Peter Brodersen

Andrei Zmievski

unread,
Aug 16, 2005, 6:24:29 PM8/16/05
to

On Aug 16, 2005, at 2:57 PM, Peter Brodersen wrote:
> For what it's worth, would \1 be interpreted as well in single quotes

> (as it currently is in double quotes)?


No. Only \u and \U have meaning in single quotes (in addition to
current ones).

-Andrei

Reply all
Reply to author
Forward
0 new messages