Corrupt indexes on a system catalog

Jacob Salomon

unread,

Jan 6, 1998, 3:00:00 AM1/6/98

to

AIX 3.1.4
OnLine 7.11.UC1
Level: Not for the faint of heart.

Hi Family.

I have a biiig problem with a client system down. I am bcc'ing the tech
support person but I think he needs some help.

Several times a day, the alarm program kicks in with a complaint about a
corrupt index and the af file tells me to run the appropriate oncheck
command. When this succeeds, it looks OK. But it fails alot. One of the
af messages got me suspicious and decided to run oncheck against
systables, syscolumns and sysindexes. Here is a sample session:

43p-v3$ oncheck -cDI garpacv2:systables

Validating indexes for garpacv2:informix.systables...
Index tabname
Index tabid
ERROR:Key value mismatch between data row and btree item
Rowid 0xa21 contains key value:
Key: 4463:
Btree item contains rowid 0xa21, key value:
Key: 4399:

Index tabid is bad. OK to repair it? y
Error recreating index.
ISAM error: non-exclusive access.

TBLspace data check for garpacv2:informix.systables

Now, this non-exlusive access is strange - I (informix) am the only one
on the system. I even got this when I ran oncheck in quiescent mode.

In another window I ran onstat -k before the oncheck and while the
oncheck is awaiting my "y". The lock list before the oncheck is, of
course, empty - my control. Here is the result when I ran it while the
prompt was up:

43p:/home/garpac/jake[234]> onstat -k

INFORMIX-OnLine Version 7.11.UC1 -- Quiescent -- Up 20:03:53 -- 26616
Kbytes

Locks
address wtlist owner lklist type tblsnum rowid key#/bsiz
30257f60 0 4000dae8 0 HDR+S 100002 209 0
3025887c 0 4000dae8 30257f60 HDR+S 300002 0 0
2 active, 80000 total, 32768 hash buckets

As I recall, tblspace 100002 is table sysdatabases and rowid 209 refers
to the sysdatabases entry for my database (garpacv2). tblspace 300002 is
systables from my database - I checked.

So, I am holding a lock on a table (systables), yet when I go to do
something to the table I have locked I seem to be blocked my my own
lock.

43p:/home/garpac/jake[235]> onstat -u

INFORMIX-OnLine Version 7.11.UC1 -- Quiescent -- Up 20:08:38 -- 26616
Kbytes

Userthreads
address flags sessid user tty wait tout locks nreads nwrites
4000a010 ---P--D 0 informix - 0 0 0 129 283
4000a444 ---P--F 0 informix - 0 0 0 0 0
4000a878 ---P--F 0 informix - 0 0 0 0 0
4000acac ---P--B 8 informix - 0 0 0 0 0
4000b0e0 ---P--D 0 informix - 0 0 0 0 0
4000dae8 ---P--- 145 informix 4 0 0 2 0 0
4000ebb8 Y------ 145 informix 4 40173a48 0 2 0 0
7 active, 128 total, 19 maximum concurrent

Those last 2 sessions are my oncheck. So what does my session look
like? Sorry you asked...

43p:/home/garpac/jake[236]> onstat -g ses 145

INFORMIX-OnLine Version 7.11.UC1 -- Quiescent -- Up 20:09:48 -- 26616
Kbytes

session #RSAM total used
id user tty pid hostname threads memory memory
145 informix 4 194282 43p 2 81920 41636

tid name rstcb flags curstk status
321 oncheck 4000dae8 ---P--- 3064 sleeping(Forever)
322 oncheckm 4000ebb8 Y------ 5068 cond wait(sm_read)

Memory pools count 1
name class addr totalsize freesize #allocfrag #freefrag
145 V 40326010 81920 40284 132 11

name free used name free used
overhead 0 100 scb 48 40
opentable 0 6856 filetable 28 1348
misc 728 376 log 32768 8336
temprec 0 2444 blob 0 392
ralloc 0 4096 gentcb 0 376
ostcb 0 2268 net 68 3788
tbutility 0 32 sqscb 0 7392
rdahead 0 344 scan_desc 544 0
hashfiletab 0 552 osenv 48 772
oncheck 6052 68 sqtcb 0 1848
fragman 0 208

Sess SQL Current Iso Lock SQL ISAM F.E.
Id Stmt type Database Lvl Mode ERR ERR Vers
145 UNKNOWN garpacv2 CR Not Wait 0 0 7.11

Obviously no SQL statament - I can't run SQL in quiescent mode. What
other info can I include here? Oh yes, the transaction list, such as it
can be under quiescent mode:

43p:/home/garpac/jake[238]> onstat -x

INFORMIX-OnLine Version 7.11.UC1 -- Quiescent -- Up 20:20:56 -- 26616
Kbytes

Transactions
address flags userthread locks log begin isolation retrys coordinator
4002c010 A---- 4000a010 0 0 COMMIT 0
4002c134 A---- 4000a444 0 0 COMMIT 0
4002c258 A---- 4000a878 0 0 COMMIT 0
4002c37c A---- 4000acac 0 0 COMMIT 0
4002c4a0 A---- 4000b0e0 0 0 COMMIT 0
4002ca54 A---- 4000dae8 2 0 COMMIT 0
6 active, 128 total, 15 maximum concurrent

Note the last transaction in the list - the userthread corresponds to
the next-to-last thread listed in onstat -u.

What else can I look at?

Bottom line: HOW CAN I CORRECT a CORRUPTED INDEX ON S SYSTEM CATALOG?

Thanks for any help you can offer. (If you even read this you are quite
a person!)
--
-- Jake (Never yelled "CROWDED THEATER!" during a fire)
+------------------------------------------------------------+
| The expedient performance of a task with excessive concern |
| regarding its duration-to-completion engenders a virtual |
| certainty of diminished benefit therefrom. |
| -- Benjamin Franklin (but he said it in 3 words) |
+------------------------------------------------------------+

Peter Tashkoff

unread,

Jan 6, 1998, 3:00:00 AM1/6/98

to

Jake
Hi
I have had the same problem over the last month except that the table =
concerned was sysindexes, not systables. Other than that, the problem was =
the same.
The gallant lads at tech support tried their best but even with the =
benefit of the source code they were unable to correct the situation. The =
solution that I was forced to take was, dbexport the database, (take a =
level zero archive) , drop and dbimport the database. (I hesitate to share =
with you the feeling I had when I typed drop database for my live =
database...., suffice it to say that I would rather be pursuing undomestica=
ted aquatic avians)

Anyhow, it worked, my database is back, and I got to keep my job. =20
I hope you can find another solution. If you can, Please share it with us =
all
Cheers

Peter Tashkoff <tas...@iname.com>
Zespri International Limited Std Disclaimers Apply
All rights reserved. No party may use this document to vilify another.

Zespri New Zealand Kiwifruit, The World's Finest

>>> Jacob Salomon <ja...@garpac.com> 7/01/98 11:43:51 >>>

Hi Family.

--=20

]Steve Berg

unread,

Jan 7, 1998, 3:00:00 AM1/7/98

to

Jake,

Not to add to your dilemma, but there was a bug with oncheck -cI and
building indexes in 7.10.xx. It was fixed in 7.11.UD1. The bug number is
42264. If a table had multiple extents in the same chunk you would get
the errors you saw when running oncheck. Ofcourse exporting and importing
could sometimes correct the problem id the table became a single extent.

I think I was running 7.10.UC3 at the time.

Good luck
Steve Berg
Neuberger & Berman LLC
s...@masada.nb.com

Felix Koshy Mathews

unread,

Jan 7, 1998, 3:00:00 AM1/7/98

to

Peter Tashkoff wrote:

> Jake
> Hi
> I have had the same problem over the last month except that the table concerned was sysindexes, not systables. Other than that, the problem was the same.
> The gallant lads at tech support tried their best but even with the benefit of the source code they were unable to correct the situation. The solution that I was forced to take was, dbexport the database, (take a level zero archive) , drop and dbimport the database. (I hesitate to share with you the feeling I had when I typed drop database for my live database...., suffice it to say that I would rather be pursuing undomesticated aquatic avians)

<snip>

If you want to re-create your database to overcome this problem you can't use a binary archive / backup, like the one you get from ontape, etc. But I will never use dbexport for many reasons, out of which first two are performance related problems. Normally any well designed/tuned dbschema will have first & next extend sizes and lock level differing from the default sizes of 16-pages (were 8 before) and page level locking ( a major screw-up on OLTP
). When you dbimport your database will be nailed to these defaults. Now after the dbimport if you don't happen to have any document saying what where those extend sizes before, you will end up re-doing the calculations. Other problems are like long transaction, etc. for which you can have workarounds.

We make use of awaked output of dbschema, and unload with named pipes and compress to get a neat-compressed ( 5-7 times compression ) backup. For re-store we use uncompress named pipes & dbaccess, we get the job done perfectly. This whole thing is done in two scripts one for archive and another one for restore

--
Have a nice day

Felix K. Mathews
mailto:fmat...@systems.dhl.com

Peter Tashkoff

unread,

Jan 7, 1998, 3:00:00 AM1/7/98

to

dbexport -ss gives the server specific information including extent sizes
lock levels and dbspace assignments.
Taking the logging off the database until it is loaded
avoids any long transaction issues.
I don't disagree with your other comments though Felix, I simply prefer to
use dbexport because it is a known quality to me, and I do not
have a background of using unloads, named pipes etc.
I can imagine that others like yourself, who are comfortable
at this level would prefer not to use dbexport.
cheers

Peter Tashkoff <tas...@iname.com>
Zespri International Limited Std Disclaimers Apply
All rights reserved. No party may use this document to vilify another.

Zespri New Zealand Kiwifruit, The World's Finest

>>> Felix Koshy Mathews <fmat...@systems.DHL.COM> 8/01/98 11:34:32 >>>
Stuff snipped
Normally any well designed/tuned dbschema will have first & next extend =
sizes and lock level differing from the default sizes of 16-pages (were 8 =

before) and page level locking ( a major screw-up on OLTP

). When you dbimport your database will be nailed to these defaults. Now =
after the dbimport if you don't happen to have any document saying what =
where those extend sizes before, you will end up re-doing the calculations.=
Other problems are like long transaction, etc. for which you can have =
workarounds.
stuff snipped

Felix Koshy Mathews

unread,

Jan 7, 1998, 3:00:00 AM1/7/98

to

Peter Tashkoff wrote:

> dbexport -ss gives the server specific information including extent sizes
> lock levels and dbspace assignments.
> Taking the logging off the database until it is loaded
> avoids any long transaction issues.
> I don't disagree with your other comments though Felix, I simply prefer to
> use dbexport because it is a known quality to me, and I do not

<snip>

Hi-Pete,

You are right about that -ss option, but I have seen problems with the Informix_Server_Id being slapped in before column names which gets updated in triggers, we started noticing this from Engine 7.23. It modifies "UPDATE Table SET Col_Name" to "UPDATE Ownerid.Table SET
DBName@Informix_Server_Id:ownerid.Table". If you use this backup/dbexport to restore the database under another Informix_Server_Id, it will fail. As long as you are comfortable with dbexport, and it doesn't give you any trouble, it is good, but databases which have not been successfully dbimported from
dbexport-files, may face problems.

David Williams

unread,

Jan 9, 1998, 3:00:00 AM1/9/98

to

In article <34B2B3A7...@garpac.com>, Jacob Salomon
<ja...@garpac.com> writes

Yes, you have a shared lock in the database (to stop anyone from
dropping it whilst you are acessing it) and a shared lock on
systables. i.e. you are reading systables.

>So, I am holding a lock on a table (systables), yet when I go to do
>something to the table I have locked I seem to be blocked my my own
>lock.
>
>43p:/home/garpac/jake[235]> onstat -u
>
>INFORMIX-OnLine Version 7.11.UC1 -- Quiescent -- Up 20:08:38 -- 26616
>Kbytes
>
>Userthreads
>address flags sessid user tty wait tout locks nreads nwrites
>4000a010 ---P--D 0 informix - 0 0 0 129 283
>4000a444 ---P--F 0 informix - 0 0 0 0 0
>4000a878 ---P--F 0 informix - 0 0 0 0 0
>4000acac ---P--B 8 informix - 0 0 0 0 0

What is session 8 doing? This is not an oninit since it would have a
sessionid of 0.

Yes, the oncheck thread is holdng two locks. Everything seems ok.

What does onstat -g ath show?

Try bouncing the engine to make sure no-one is connected, even
remotely as informix.

>What else can I look at?
>
>Bottom line: HOW CAN I CORRECT a CORRUPTED INDEX ON S SYSTEM CATALOG?
>
>Thanks for any help you can offer. (If you even read this you are quite
>a person!)

--
David Williams

Maintainer of the Informix FAQ
Primary site (Beta Version) http://www.smooth1.demon.co.uk
Official site http://www.iiug.org/techinfo/faq/faq_top.html

I see you standin', Standin' on your own, It's such a lonely place for you, For
you to be If you need a shoulder, Or if you need a friend, I'll be here
standing, Until the bitter end...
So don't chastise me Or think I, I mean you harm...
All I ever wanted Was for you To know that I care

Jacob Salomon

unread,

Jan 19, 1998, 3:00:00 AM1/19/98

to

Hi Family.

This is a detailed summary of what I did to get out of a very serious
problem. As with the original, it is not reading for the faint of
heart. But since some of the replys I got indicated this has happened to
others, it may serve as a template for others in this situation, with
ideas on how to peice back together a corrupted OnLine system. Yes, it
was certainly a traumatic experience.

In case you are unfamiliar with this thread, please refer to the
original plaint at the end of this article. In a nutshell, the OnLine
system was corrupt at the partition page level, with the symptoms
showing up as tables with no columns defined. The result was that I
could get no info on one set of tables (18 to be exact), with the error
message to the effect that the entry is not found in syscolumns.

Thus, the following query:
select tabname from systables
where tabid not in (select tabid from syscolumns)

actually returned 18 table names.

I tried running dbexport on the database but it too ran up against the
corruption and inconsistent behavior in the catalogs.

This is consistent with some of the more sympathetic responses I got
from my original post. With the help of Informix tech support to get me
started, as well as a utility from iiug, I was able to unload the
database (800 tables or so) with their schemas, destroy the OnLine
instance, rebuild it, and reload the database. All in one (very long)
evening of work. Overall, I was fortunate (other things considered)
that the OnLine system here was rather small and that there was enough
file-system space to do everything I needed to do.

1.
For starters, there was another uncorrupted database in the instance.
How uncorrupted? I could run dbexport against it with no problems. I
did this.

2.
Next, I ran the following SQL script against the corrupted database:

select tabname from systables
where tabid not in (select tabid from syscolumns)

Yes, it’s the above query, identifying 18 tables I would need to
reconstruct from other sources, which were available on another machine.

3.
In order to save all schemas that I could save, I ran the following
script against the corrupted database:

output to unload_schema.sh without headings
select "dbschema -q -ss -d garpacv2 -t "
|| trim(tabname) || " create_" || trim(tabname) || ".sql;"
from systables
where tabid in (select tabid from syscolumns
where tabid > 99)
This produced a shell script in the form of:
dbschema -q -ss -d garpacv2 -t stxactnr create_stxactnr.sql;
dbschema -q -ss -d garpacv2 -t stxaddld create_stxaddld.sql;
dbschema -q -ss -d garpacv2 -t stxaddlr create_stxaddlr.sql;
. . . etc.

Of course, I had to remember to chmod +x on the resulting shell script
file.

4.
Similarly, I needed to save all the data I could save, hence the
following SQL script:

output to onload_tables.sql without headers
select "unload to " || trim(tabname) || ".unl"
|| " select * from " || trim(tabname) || ";"
from systables
where tabid in (select tabid from syscolumns
where tabid > 99)

This produced an SQL script looking like:

unload to stxactnr.unl select * from stxactnr;
unload to stxaddld.unl select * from stxaddld;
unload to stxaddlr.unl select * from stxaddlr;
unload to stxerord.unl select * from stxerord;
. . . etc.

5.
I ran the shell script created in step 3 and the SQL script created in
step 4. Hence, a crude approximation of a dbexport.

6.
Next, I needed to modify each schema script (created in step 3) to add a
load command. Hence the following shell command:

for SQ in create_*.sql -- For each schqma
do
tname=`echo $SQ | sed -e s/create_// -e s/\.sql//`
unlname=${tname}.unl
load_cmd="load from $unlname insert into $tname \;"
echo $load_name >>$SQ
done

Accomplishment: At the end of each schema file, after the create table
and create index commands, I have appended a load command.

If I had not been so under the gun, I might have come up with a more
clever script, with more sed, to insert the load command before the
create index. (Sigh.. Dilbert’s dilemma.. ;-)

7.
Obtained the copy_spaces script from the iiug archives. Thanks very
much to author Pete Stiglich. I *did* have to edit it to work the way I
needed with my (korn) shell but these were minor - < 10 minutes. Ran
copy_spaces to produce a shell-script of onspaces commands that would
recreate all of my dbspaces should the need arise.
***** The need has arisen! *****
8.
Modified the ONCONFIG file (yea, I saved a copy first) for the location
of the PHYSLOG and number of logical logs.

9.
Moment of truth! I took the instance off line and ran oninit -i to
reinitialize the root Dbspace.

10.
Ran the script produced by copy_spaces, getting back all my dbspaces.

11.
Ran the necessary onparams commands to get all my logs back into the
logdbs, followed by the necessary ontape to activate them. Ran onmode -l
a few times followed by onmode -c to make these new logs the active ones
and, since I was logging (initially) to /dev/null, allowing me to drop
the original 3 logs in rootdbs.

My system was now configured as before.

12.
Ran dbimport on the other database that had been uncorrupted.

13.
Recreated the formerly corrupted database in its intended dbspace.

14.
Ran the shell command loop:

for SQ in create_*.sql
do
dbaccess garpacv2 $SQ
done

15.
Manually created the 18 tables that could not be unloaded.

16.
Re-enabled logging on both databases.

17.
Scream in triumph! Howl at the moon (except it was a dark and stormy
night). Came up with a quick evasion when the building security guys
came it with a butterfly net.

Those 18 tables were, of course, empty. This skewed the application at
first but (another furtunate accident) the data was not essential and
could be regenerated by our admin folks. Permissions were also screwed
up but this was an easy, if tedious, task to fix up.

Thanks to those who tried to offer help but it was really to no avail
and this horrid task really was the only way to get my users back in
business.
--
=========================== Original Message ===========================

Hi Family.

========================================================================