On Thu, 19 Jan 2012 00:30:14 -0800 (PST)
Kristoff <
kristof...@ingres.com> wrote:
> On Jan 19, 12:31 am, "James K. Lowden" <
jklow...@schemamania.org>
> wrote:
> > On Mon, 16 Jan 2012 05:16:50 -0800 (PST)
> > Kristoff <
kristoff.pic...@ingres.com> wrote:
>
> > > The default unicode collation sequence used by Ingres treats
> > > underscore and space as the same.
> >
> > How is such a collation advantageous?
>
> Don't know the advantage of that, but it was the official standard
> collation sequence for Unicode 2.1 (
http://www.unicode.org/versions/
> components-2.1.9.html), fairly outdated now.
Hi Kristoff,
AIUI the default collation for blank and underscore are not the same.
Maybe I'm misinterpreting what you said or what it says.
http://unicode.org/collation/
points to
http://www.unicode.org/reports/tr10/,
the Unicode Collation Algorithm, which references
http://www.unicode.org/reports/tr10/#Allkeys
the allkeys file, the Default Unicode Collation Element Table,
which says:
$ grep -E '^00.+(LOW LINE|SPACE)$' allkeys.txt | head
0020 ; [*020A.0020.0002.0020] # SPACE
005F ; [*021B.0020.0002.005F] # LOW LINE
They sure don't *look* the same. They differ on the first level and
the fourth, which the UCA says is "computable".
The '*' indicates a "variable collation elements", which
the standard says "can be either treated as ignorables or not". These
constitute 22% of the Basic Latin page, 58 characters.
Is it the case that Ingres treats all such elements as ignorable?
I read
http://docs.actian.com/ingres/10.0/system-administrator-guide/3937-supported-collation-sequences?hilite=collation
but it doesn't describe a general Unicode collation.
For those following along at home, the UCA also says sequences of
igorables are all ignored e.g.,
"This is some text"
would sort equal to any of
" This is some text"
"This is some text"
"Thisissometext"
"This__is__some__text"
The whole set of ignorables includes most punctuation, so an
implementation could also fairly treat these as also equivalent:
"This is some text?"
"(This is some text)"
"This is *some* text!"
I'm not saying this is what Ingres is doing. I'm saying this is a
valid interpretation of what I understand to be the controlling
standard. And that nothing in the UCA DUCET standard suggests blanks
and underscores must have the same sorting value.
--jkl