-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I apologize for this being a bit longer, but I tried to really clarify
what normalization is all about nd how it affects ZFS on OSX.
Am 17.03.14 20:56, schrieb Philip Robar:
The two normalization forms "formD" and "formC" mandate how certain
characters outside the standard ASCII range (A-Z, a-z, 0-9 and a few
punctuation characters ".,-;" and some other) are represented.
For example (note: the following is not fully technical correct, but
illustrates the idea), the German letter "ö", named o_umlaut, could
be represented as-is, that is as a single entity of Unicode code point
number 246.
However, the "ö" could also be seen as a plain "o" with two dots (in
printed text and modern German hand writing since 1978) or two short
downward lines (in some German hand writing scripts, for example the
Sütterlin script or other Kurrent scripts and hand writing taught
before 1978).
Similarly, the "ö" can be encode in Unicode by a two character
sequence, a plain "o" and a modifier '"' with the meaning "put two
dots above the previous character" (note: '"' is not such a modifier,
it serves here as a visualization of the actual modifier).
Now, a text in normalization "formC" or "combined form" would have all
characters, which can be represented by a single entity encode using
this single character.
A text in "formD" or "decomposed form" normalization would have all
characters that have some dots, accents, or other "additions" encoded
using the plain base character followed by one or more modifiers.
It is normalized formD, if the modifiers come in a defined order, for
example if a character has a dot above and below, the modifier for
"dot below" comes always first.
It is in irregular formD, if all characters are decomposed, but the
modifiers do not come in the defined order, in the example of a dot
below and above a character, having the modifier for "dot above"
coming before the modifier for "dot below" makes the string irregular.
This whole mess is important, because it affects how sorting works.
For example, two strings "o" + "dot_below" + "dot_above" and "o" +
"dot_above" + "dot_below" should compare equal, because they carry
the same information, despite the fact that they differ in their
binary representation.
Normalizing make comparing and sorting easier.
Normalization and ZFS and OSX
=============================
Why should we care?
Because Finder wants to sort directory listings, and for this needs to
know how the byte sequence it gets from the VFS maps to scripting
symbols and how these symbols order.
Finder expects text like filenames to be in formD.
For file systems like ZFS this means, they need to
(a) simple case: ignore encoding altogether and just deal with byte
sequences. Since names are stored and returned as they arrive from
the Finder & Co. no Problem arises. (In practice, problems arise
when the using terminal or applications that don't follow Apple's
encoding rules, because names in the wrong encoding could end up on
the file system.)
(b) complex case: Convert the internal form to and from formD when
communicating with the VFS (and through it with higher levels like
Finder)
In case of (b) we have two implementation choices:
(1) stick to the rules and really do the conversion, in both
directions, and verifying that what ever we get from the VFS is
actually in formD (it might not, when using terminal or 3rd party
applications not following Apple's encoding rules). In that case, the
setting of the normalization property doesn't matter, because it
controls how names are recorded *on* *disk*, and this encoding would
*never* be exposed to the VFS.
(2) be lazy and essentially do (a), that is present the names to VFS
in the form mandated by the normalization property when reading, i.e.
pass-through, but still do a best effort to force names received from
the VFS into the form mandated by normalization property when writing.
I hope this answers the question and sheds some light on the problem
of filename encoding.
Best regards
Björn
- --
| Bjoern Kahl +++ Siegburg +++ Germany |
| "googlelogin@-my-domain-" +++
www.bjoern-kahl.de |
| Languages: German, English, Ancient Latin (a bit :-)) |
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird -
http://www.enigmail.net/
iQCVAgUBUyeFwlsDv2ib9OLFAQIpNQP/a5lWCP4RGusktQhUsdm8uIaILLPendG2
9K3zgX8zHr2oMHftLQO8RU9Gk6dN68woINWmXkwGJYhrgFjQOuUMzJo38rR++AoJ
ZsKqX5siOGTnxHntypyxFsjiLfY6NBHHY1spHAH9wU6kDTgJyqRrQ8LoBi5xASuh
nweR0JdSXQg=
=Wbv0
-----END PGP SIGNATURE-----