Ingres 9.2 - Why does Ingres think these are the same?

Dennis

unread,

Jan 15, 2012, 6:51:30 PM1/15/12

to

The value for varchar colx in two different rows are
'AAA_BBB_12345' and 'AAA BBB 12345'.

Ingres 9.2.1 (a64.lnx/103)NPTL thinks these are the same, but Ingres 10
doesn't.

Ingres 9.2 wont allow this table to be created as btree unique on colx.

Is this a bug or am I missing something?

Thanks

Dennis

Ingres Forums

unread,

Jan 16, 2012, 2:36:05 AM1/16/12

to

Could you please perform this query and let us know the result?
SELECT DBMSINFO('unicode_level')\p\g

--
bilgihan
------------------------------------------------------------------------
bilgihan's Profile: http://community.actian.com/forum/member.php?userid=8658
View this thread: http://community.actian.com/forum/showthread.php?t=14157

Ingres Forums

unread,

Jan 16, 2012, 4:04:25 AM1/16/12

to

Is your where clause *WHERE colx = 'AAA_BBB_12345'* or is it *WHERE colx
LIKE 'AAA_BBB_12345'*?

LIKE treats the underscore as a wildcard for any single character unless
you escape it (*WHERE colx LIKE 'AAA\_BBB\_12345' ESCAPE '\'*) .

--
rhann
------------------------------------------------------------------------
rhann's Profile: http://community.actian.com/forum/member.php?userid=131

Martin Bowes

unread,

Jan 16, 2012, 4:23:48 AM1/16/12

to Ingres and related product discussion forum, Dennis

HI Dennis

On ordinary databases: createdb bowtest

I get the same answer on both 9.2.0 and 10.0.0 on a64/lnx.

In both cases the rows sort in order:
AAA BBB 12345
AAA_BBB_12345

You possibly used a different collation sequence on the different installations?

Martin Bowes

_______________________________________________
Info-Ingres mailing list
Info-...@kettleriverconsulting.com
http://ext-cando.kettleriverconsulting.com/mailman/listinfo/info-ingres

Martin Bowes

unread,

Jan 16, 2012, 5:03:02 AM1/16/12

to Ingres and related product discussion forum, Dennis

Hi All,

I've mucked around with this further....
I can replicate the error(?) if the database is unicode enabled: createdb -i bowtest

And only then if the sort column is nvarchar not varchar.

create table x(a nvarchar(100) not null not default);

insert into x values('AAA BBB 12345');
insert into x values('AAA_BBB_12345');

modify x to btree unique on a;

select * from x;

The modify will generate the error:

E_US1591 MODIFY: table could not be modified because rows contain duplicate keys.

I get this on both 9.2 and on 10.0.

Marty

-----Original Message-----
From: Martin Bowes
Sent: 16 January 2012 09:24
To: info-...@kettleriverconsulting.com
Cc: 'Dennis'
Subject: RE: [Info-Ingres] Ingres 9.2 - Why does Ingres think these are the same?

HI Dennis

On ordinary databases: createdb bowtest

I get the same answer on both 9.2.0 and 10.0.0 on a64/lnx.

In both cases the rows sort in order:
AAA BBB 12345
AAA_BBB_12345

You possibly used a different collation sequence on the different installations?

Martin Bowes

-----Original Message-----
From: Dennis [mailto:droe...@comcast.net]
Sent: 15 January 2012 23:52
To: info-...@kettleriverconsulting.com
Subject: [Info-Ingres] Ingres 9.2 - Why does Ingres think these are the same?

eric.m...@w-link.net

unread,

Jan 16, 2012, 1:07:52 AM1/16/12

to Ingres and related product discussion forum

Ingres 9.2 is evaluating the underscore "_" as the wildcard character.

Ian Kirkham

unread,

Jan 16, 2012, 7:26:42 AM1/16/12

to Ingres and related product discussion forum

No it isn't - pattern mating is only performed with the pattern
operators LIKE and SIMILAR TO.

This behaviour will likely be due to a UTF-8 installation and hence
Unicode collation enabled on the column in question.

Regards,
Ian

-----Original Message-----
From: info-ingr...@kettleriverconsulting.com
[mailto:info-ingr...@kettleriverconsulting.com] On Behalf Of
eric.m...@w-link.net
Sent: 16 January 2012 06:08
To: Ingres and related product discussion forum
Subject: Re: [Info-Ingres] Ingres 9.2 - Why does Ingres think these are
the same?

Ingres 9.2 is evaluating the underscore "_" as the wildcard character.

Kristoff

unread,

Jan 16, 2012, 8:16:50 AM1/16/12

to

It depends on the unicode collation sequence in use. The default
unicode collation sequence used by Ingres treats underscore and space
as the same.
You can use the udefault5 collation sequence instead (createdb -
iudefault5 dbname), which distinguishes between these characters (and
some more).
Not sure at the moment whether this was available in the first 9.2 GA
release, but at least available by patch and in 10.0.

By the way, when using a UTF8 installation you will see the problem
with varchar too.

Kristoff

> Info-Ing...@kettleriverconsulting.comhttp://ext-cando.kettleriverconsulting.com/mailman/listinfo/info-ingres

Emma K. McGrattan

unread,

Jan 16, 2012, 8:44:23 AM1/16/12

to Ingres and related product discussion forum

Thanks, Kristoff! I wasn't aware of this

-- Emma

PS Getting used to my iPhone. Excuse typos and nonsensical auto corrects!

Dennis

unread,

Jan 16, 2012, 10:56:57 AM1/16/12

to

On 1/16/2012 12:36 AM, Ingres Forums wrote:
> Could you please perform this query and let us know the result?
> SELECT DBMSINFO('unicode_level')\p\g
>
>

This returns 1

Dennis

Dennis

unread,

Jan 16, 2012, 11:04:52 AM1/16/12

to

On 1/16/2012 6:16 AM, Kristoff wrote:
> It depends on the unicode collation sequence in use. The default
> unicode collation sequence used by Ingres treats underscore and space
> as the same.
> You can use the udefault5 collation sequence instead (createdb -
> iudefault5 dbname), which distinguishes between these characters (and
> some more).

It looks like alterdb will let me change the collation sequence, but am
I better off to unload, drop, createdb and reload?

This database was created with createdb -i and II_CHARSET is UTF8. The
database was created the same way on Ingres 10 as well.

> Not sure at the moment whether this was available in the first 9.2 GA
> release, but at least available by patch and in 10.0.
>
> By the way, when using a UTF8 installation you will see the problem
> with varchar too.

These are declared as varchar.

Thanks

Dennis

Ingres Forums

unread,

Jan 17, 2012, 3:21:27 AM1/17/12

to

This confirms unicode is being used here.
'Knowlege Base | How to Determine if Unicode Enabled? - 411686 -
Actian' (http://www.actian.com/kb/article/411686)

This was also the most of discussion topic yesterday.

Dennis

unread,

Jan 17, 2012, 5:38:57 AM1/17/12

to

On 1/16/2012 6:16 AM, Kristoff wrote:

> It depends on the unicode collation sequence in use. The default
> unicode collation sequence used by Ingres treats underscore and space
> as the same.
> You can use the udefault5 collation sequence instead (createdb -
> iudefault5 dbname), which distinguishes between these characters (and
> some more).
> Not sure at the moment whether this was available in the first 9.2 GA
> release, but at least available by patch and in 10.0.

Is it safe to copy the udefault5 file from an Ingres 10 installation to
an Ingres 9.2 installation, or does this require some other patch to use
this in Ingres 9.2?

Dennis

Kristoff

unread,

Jan 17, 2012, 10:46:21 AM1/17/12

to

To be safe, copy the file udefault5.uce (in $II_SYSTEM/ingres/files/
collation/) to the 9.2 installation into the same directory and then
compile the collation sequence using:
aducompile udefault5.uce udefault5 -u

Kristoff

James K. Lowden

unread,

Jan 18, 2012, 6:31:01 PM1/18/12

to info-...@kettleriverconsulting.com

On Mon, 16 Jan 2012 05:16:50 -0800 (PST)
Kristoff <kristof...@ingres.com> wrote:

> The default unicode collation sequence used by Ingres treats
> underscore and space as the same.

That strikes me as somewhat ... counterintuitive.

How is such a collation advantageous? Is there a Unicode
recommendation that led Ingres^WActian to choose it?

--jkl

Kristoff

unread,

Jan 19, 2012, 3:30:14 AM1/19/12

to

On Jan 19, 12:31 am, "James K. Lowden" <jklow...@schemamania.org>
wrote:

> On Mon, 16 Jan 2012 05:16:50 -0800 (PST)
>

> Kristoff <kristoff.pic...@ingres.com> wrote:
> > The default unicode collation sequence used by Ingres treats
> > underscore and space as the same.
>
> That strikes me as somewhat ... counterintuitive.
>
> How is such a collation advantageous? Is there a Unicode
> recommendation that led Ingres^WActian to choose it?
>
> --jkl

Don't know the advantage of that, but it was the official standard
collation sequence for Unicode 2.1 (http://www.unicode.org/versions/
components-2.1.9.html), fairly outdated now.

Kristoff

James K. Lowden

unread,

Jan 21, 2012, 4:36:50 PM1/21/12

to info-...@kettleriverconsulting.com

On Thu, 19 Jan 2012 00:30:14 -0800 (PST)
Kristoff <kristof...@ingres.com> wrote:
> On Jan 19, 12:31 am, "James K. Lowden" <jklow...@schemamania.org>
> wrote:
> > On Mon, 16 Jan 2012 05:16:50 -0800 (PST)
> > Kristoff <kristoff.pic...@ingres.com> wrote:
>
> > > The default unicode collation sequence used by Ingres treats
> > > underscore and space as the same.
> >

> > How is such a collation advantageous?
>

> Don't know the advantage of that, but it was the official standard
> collation sequence for Unicode 2.1 (http://www.unicode.org/versions/
> components-2.1.9.html), fairly outdated now.

Hi Kristoff,

AIUI the default collation for blank and underscore are not the same.
Maybe I'm misinterpreting what you said or what it says.

http://unicode.org/collation/
points to
http://www.unicode.org/reports/tr10/,
the Unicode Collation Algorithm, which references
http://www.unicode.org/reports/tr10/#Allkeys
the allkeys file, the Default Unicode Collation Element Table,
which says:

$ grep -E '^00.+(LOW LINE|SPACE)$' allkeys.txt | head
0020 ; [*020A.0020.0002.0020] # SPACE
005F ; [*021B.0020.0002.005F] # LOW LINE

They sure don't *look* the same. They differ on the first level and
the fourth, which the UCA says is "computable".

The '*' indicates a "variable collation elements", which
the standard says "can be either treated as ignorables or not". These
constitute 22% of the Basic Latin page, 58 characters.

Is it the case that Ingres treats all such elements as ignorable?

I read
http://docs.actian.com/ingres/10.0/system-administrator-guide/3937-supported-collation-sequences?hilite=collation
but it doesn't describe a general Unicode collation.

For those following along at home, the UCA also says sequences of
igorables are all ignored e.g.,

"This is some text"

would sort equal to any of

" This is some text"
"This is some text"
"Thisissometext"
"This__is__some__text"

The whole set of ignorables includes most punctuation, so an
implementation could also fairly treat these as also equivalent:

"This is some text?"
"(This is some text)"
"This is *some* text!"

I'm not saying this is what Ingres is doing. I'm saying this is a
valid interpretation of what I understand to be the controlling
standard. And that nothing in the UCA DUCET standard suggests blanks
and underscores must have the same sorting value.

--jkl

Kristoff

unread,

Jan 23, 2012, 4:21:09 AM1/23/12

to

On Jan 21, 10:36 pm, "James K. Lowden" <jklow...@schemamania.org>
wrote:

> On Thu, 19 Jan 2012 00:30:14 -0800 (PST)
>

> I readhttp://docs.actian.com/ingres/10.0/system-administrator-guide/3937-su...

> but it doesn't describe a general Unicode collation.
>
> For those following along at home, the UCA also says sequences of
> igorables are all ignored e.g.,
>
> "This is some text"
>
> would sort equal to any of
>
> " This is some text"
> "This is some text"
> "Thisissometext"
> "This__is__some__text"
>
> The whole set of ignorables includes most punctuation, so an
> implementation could also fairly treat these as also equivalent:
>
> "This is some text?"
> "(This is some text)"
> "This is *some* text!"
>
> I'm not saying this is what Ingres is doing. I'm saying this is a
> valid interpretation of what I understand to be the controlling
> standard. And that nothing in the UCA DUCET standard suggests blanks
> and underscores must have the same sorting value.
>
> --jkl

Hi James,
I'm not speaking about the actual default collation at unicode.org,
but the old one which came with the 2.0 standard. Take a look at:
http://unicode.org/Public/UCA/2.1.9/basekeys-2.1.9.txt

$egrep '^005F|^0020' basekeys-2.1.9.txt

0020 ; [*0209.0020.0002.0020] # SPACE
005F ; [*0209.0021.0002.005F] # LOW LINE; COMPATSEQ

So in this old collation sequence the first weight of both chars is
the same.

Kristoff

Ian Kirkham

unread,

Jan 23, 2012, 4:38:57 AM1/23/12

to Ingres and related product discussion forum

Hi James,
You are correct in your reading of *current* DUCET tables but as Kristoff said 9.2 Ingres is based on an *older* DUCET which did collate space and underscore. If you don't like it, you can change it.
Regards,
Ian
-----Original Message-----
From: info-ingr...@kettleriverconsulting.com [mailto:info-ingr...@kettleriverconsulting.com] On Behalf Of James K. Lowden
Sent: 21 January 2012 21:37
To: info-...@kettleriverconsulting.com
Subject: Re: [Info-Ingres] Ingres 9.2 - Why does Ingres think these are the same?

On Thu, 19 Jan 2012 00:30:14 -0800 (PST)
Kristoff <kristof...@ingres.com> wrote:
> On Jan 19, 12:31 am, "James K. Lowden" <jklow...@schemamania.org>
> wrote:
> > On Mon, 16 Jan 2012 05:16:50 -0800 (PST)
> > Kristoff <kristoff.pic...@ingres.com> wrote:
>
> > > The default unicode collation sequence used by Ingres treats
> > > underscore and space as the same.
> >

> > How is such a collation advantageous?
>

> Don't know the advantage of that, but it was the official standard
> collation sequence for Unicode 2.1 (http://www.unicode.org/versions/
> components-2.1.9.html), fairly outdated now.

Hi Kristoff,

AIUI the default collation for blank and underscore are not the same.
Maybe I'm misinterpreting what you said or what it says.

http://unicode.org/collation/
points to
http://www.unicode.org/reports/tr10/,
the Unicode Collation Algorithm, which references
http://www.unicode.org/reports/tr10/#Allkeys
the allkeys file, the Default Unicode Collation Element Table,
which says:

$ grep -E '^00.+(LOW LINE|SPACE)$' allkeys.txt | head
0020 ; [*020A.0020.0002.0020] # SPACE
005F ; [*021B.0020.0002.005F] # LOW LINE

They sure don't *look* the same. They differ on the first level and
the fourth, which the UCA says is "computable".

The '*' indicates a "variable collation elements", which
the standard says "can be either treated as ignorables or not". These
constitute 22% of the Basic Latin page, 58 characters.

Is it the case that Ingres treats all such elements as ignorable?

I read

http://docs.actian.com/ingres/10.0/system-administrator-guide/3937-supported-collation-sequences?hilite=collation

but it doesn't describe a general Unicode collation.

For those following along at home, the UCA also says sequences of
igorables are all ignored e.g.,

"This is some text"

would sort equal to any of

" This is some text"
"This is some text"
"Thisissometext"
"This__is__some__text"

The whole set of ignorables includes most punctuation, so an
implementation could also fairly treat these as also equivalent:

"This is some text?"
"(This is some text)"
"This is *some* text!"

I'm not saying this is what Ingres is doing. I'm saying this is a
valid interpretation of what I understand to be the controlling
standard. And that nothing in the UCA DUCET standard suggests blanks
and underscores must have the same sorting value.

--jkl

James K. Lowden

unread,

Jan 24, 2012, 12:03:39 AM1/24/12

to info-...@kettleriverconsulting.com

On Mon, 23 Jan 2012 01:21:09 -0800 (PST)
Kristoff <kristof...@ingres.com> wrote:

> On Jan 21, 10:36 pm, "James K. Lowden" <jklow...@schemamania.org>
> wrote:
> > the allkeys file, the Default Unicode Collation Element Table,
> > which says:
> >
> > $ grep -E '^00.+(LOW LINE|SPACE)$' allkeys.txt | head
> > 0020 ; [*020A.0020.0002.0020] # SPACE
> > 005F ; [*021B.0020.0002.005F] # LOW LINE

...

> > I'm not saying this is what Ingres is doing. I'm saying this is a
> > valid interpretation of what I understand to be the controlling
> > standard.
>

> I'm not speaking about the actual default collation at unicode.org,
> but the old one which came with the 2.0 standard. Take a look at:
> http://unicode.org/Public/UCA/2.1.9/basekeys-2.1.9.txt
>
> $egrep '^005F|^0020' basekeys-2.1.9.txt
>
> 0020 ; [*0209.0020.0002.0020] # SPACE
> 005F ; [*0209.0021.0002.005F] # LOW LINE; COMPATSEQ
>
> So in this old collation sequence the first weight of both chars is
> the same.

Thanks for the clarification, Kristoff. If I may, did the standard
also change how/when the second field comes into play? AIUI
the current standard, each level is a tie-breaker for the the one
before. I won't go into detail now in case it's irrelevant.

To the OP's question, is the Unicode collation user modfiable (without a
compiler) or contemplated for a future release?

Regards,

--jkl

Kristoff

unread,

Jan 24, 2012, 4:24:20 AM1/24/12

to

On Jan 24, 6:03 am, "James K. Lowden" <jklow...@schemamania.org>
wrote:

Yes, the collation can be changed - there is no "true" collation which
would fit into any language, so actually it is expected that user will
customize it.
For Ingres you need to compile the collation sequence, but the
compiler (aducompile) is delivered with the Ingres distribution.

Be careful when creating your own sequence or when using the very
latest available at unicode.org. Currently Ingres does not understand
all the actual unicode features. If you look at the udefault5
collation which is now delivered in addition to the old one, it is not
the original one coming with Unicode 5.0, it only use the weights
defined in that version, but only contains those characters defined in
standard one coming wit Ingres.

Kristoff