Problems with long checkpoints

Sean Baker

unread,

Sep 27, 2011, 3:05:47 PM9/27/11

to inform...@iiug.org

Hello,

We're stuck without a dba after some massive layoffs. I know a little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long checkpoints (43 minutes, 61 minutes, 90 minutes!). We've rebooted the server and restarted the instance, and even did a restore from our production server to the dev server. It lasts for a while (maybe 30 minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it seems that might be the cause. Although he's added a few over the past several months without any problems. After the restore today, he added a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. Can anyone give me some obvious things to check? Or some better data to provide to the list for some help?

Thanks,

Sean

Log:

11:31:41 Checkpoint Completed: duration was 0 seconds.
11:31:41 Tue Sep 27 - loguniq 15415, logpos 0x33f1018, timestamp: 0x4f4c8892 Interval: 209340

11:31:41 Maximum server connections 12
11:31:41 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 0, Llog used 2

11:35:02 Space 'sm_dbs0' added.
11:35:02 Checkpoint Completed: duration was 0 seconds.
11:35:02 Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4, timestamp: 0x4f4c8a3d Interval: 209341

11:35:02 Maximum server connections 12
11:35:02 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 16, Llog used 10

11:35:13 Checkpoint Completed: duration was 0 seconds.
11:35:13 Tue Sep 27 - loguniq 15415, logpos 0x33fe018, timestamp: 0x4f4c8b0c Interval: 209342

11:35:13 Maximum server connections 12
11:35:13 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 0, Llog used 3

11:40:40 Checkpoint Completed: duration was 1 seconds.
11:40:40 Tue Sep 27 - loguniq 15415, logpos 0x3867018, timestamp: 0x4f4cdac8 Interval: 209343

11:40:40 Maximum server connections 12
11:40:40 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 134, Llog used 1129

11:44:58 Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 Old max 60.0 min 50.0 New max 59.4 min 49.5
11:46:07 Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 Old max 59.4 min 49.5 New max 58.8 min 49.0

jrenaut

unread,

Sep 27, 2011, 3:40:46 PM9/27/11

to

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. How about you post onstat -g ckp output after a long
checkpoint happens? Perhaps that will be more helpful.

Sean Baker

unread,

Sep 27, 2011, 3:48:18 PM9/27/11

to inform...@iiug.org

Will do. Although we're going on an hour since the last checkpoint...

jrenaut

unread,

Sep 27, 2011, 4:06:10 PM9/27/11

to

On Sep 27, 2:48 pm, Sean Baker <SBa...@moneymailer.com> wrote:

Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post an
onstat -g stk <thread id of main_loop() thread> to see what main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team

Sean Baker

unread,

Sep 27, 2011, 4:15:14 PM9/27/11

to inform...@iiug.org

> -----Original Message-----
> From: informix-l...@iiug.org [mailto:informix-list-

The developer tried dropping the dbspace he created, so now we're blocked with "DBS_DROP". Here's the onstat -g ath for the main_loop():

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 02:48:10 -- 6638456 Kbytes
Blocked:DBS_DROP

Threads:
tid tcb rstcb prty status vp-class name
18 15f37a548 15e56f028 3 sleeping secs: 1 5cpu main_loop()

And here's onstat -g stk 18:

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 02:49:56 -- 6638456 Kbytes
Blocked:DBS_DROP

Stack for thread: 18 main_loop()
base: 0x000000015f3c6000
len: 69632
pc: 0x0000000001073d0e
tos: 0x000000015f3d5a40
state: sleeping
vp: 4

0x0000000001073d0e (/opt/IBM/informix/bin/oninit) yield_processor_mvp
0x000000000107556b (/opt/IBM/informix/bin/oninit) mt_yield
0x0000000000bb951a (/opt/IBM/informix/bin/oninit) dskflush_bfpool
0x0000000000bb98c2 (/opt/IBM/informix/bin/oninit) dskflush
0x0000000000b56165 (/opt/IBM/informix/bin/oninit) checkpoint
0x00000000005336e9 (/opt/IBM/informix/bin/oninit) main_loop
0x000000000108560f (/opt/IBM/informix/bin/oninit) startup

Thanks,

Sean.

Sean Baker

unread,

Sep 27, 2011, 4:30:38 PM9/27/11

to inform...@iiug.org

> -----Original Message-----
> From: informix-l...@iiug.org [mailto:informix-list-

> bou...@iiug.org] On Behalf Of jrenaut
> Sent: Tuesday, September 27, 2011 1:06 PM
> To: inform...@iiug.org

The checkpoint finally finished. Here's onstat -g ckp:

----------
IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line -- Up 03:02:03 -- 6638456 Kbytes

AUTO_CKPTS=On RTO_SERVER_RESTART=Off

Critical Sections Physical Log Logical Log
Clock Total Flush Block # Ckpt Wait Long # Dirty Dskflu Total Avg Total Avg
Interval Time Trigger LSN Time Time Time Waits Time Time Time Buffers /Sec Pages /Sec Pages /Sec
209328 11:06:09 *Reorg 15415:0x312f018 0.0 0.0 0.0 1 0.0 0.0 0.0 6 6 30 30 24 24
209329 11:06:10 *Reorg 15415:0x314b018 0.0 0.0 0.0 1 0.0 0.0 0.0 12 12 53 53 28 28
209330 11:06:19 *Reorg 15415:0x32657bc 0.0 0.0 0.0 2 0.0 0.0 0.0 18 18 85 9 283 31
209331 11:06:20 *Reorg 15415:0x326b7bc 0.0 0.0 0.0 1 0.0 0.0 0.0 8 8 24 24 6 6
209332 11:06:21 *Reorg 15415:0x3282018 0.0 0.0 0.0 1 0.0 0.0 0.0 6 6 29 29 22 22
209333 11:06:22 *Reorg 15415:0x329a018 0.0 0.0 0.0 1 0.0 0.0 0.0 11 11 49 49 24 24
209334 11:06:28 *Reorg 15415:0x33cc7bc 0.0 0.0 0.0 1 0.0 0.0 0.0 14 14 114 19 307 51
209335 11:06:29 *Reorg 15415:0x33d4018 0.0 0.0 0.0 1 0.0 0.0 0.0 5 5 26 26 7 7
209336 11:11:39 CKPTINTVL 15415:0x33e0018 0.0 0.0 0.0 0 0.0 0.0 0.0 117 117 65 0 12 0
209337 11:16:40 CKPTINTVL 15415:0x33e2018 0.0 0.0 0.0 0 0.0 0.0 0.0 10 10 0 0 2 0
209338 11:21:40 CKPTINTVL 15415:0x33ed018 0.0 0.0 0.0 0 0.0 0.0 0.0 43 43 35 0 11 0
209339 11:26:40 CKPTINTVL 15415:0x33ef018 0.0 0.0 0.0 0 0.0 0.0 0.0 10 10 2 0 2 0
209340 11:31:41 CKPTINTVL 15415:0x33f1018 0.0 0.0 0.0 0 0.0 0.0 0.0 9 9 0 0 2 0
209341 11:35:02 *Admin 15415:0x33fb2e4 0.0 0.0 0.0 1 0.0 0.0 0.0 2 2 16 0 10 0
209342 11:35:13 *User 15415:0x33fe018 0.0 0.0 0.0 1 0.0 0.0 0.0 0 0 0 0 3 0
209343 11:40:40 CKPTINTVL 15415:0x3867018 1.1 1.1 0.0 0 0.0 0.0 0.0 897 831 134 0 1129 3
209344 13:23:25 CKPTINTVL 15415:0x477779c 5865.4 5718.6 0.0 1 0.8 24.7 24.7 1818113 317 332 1 4270 13
209345 13:23:26 *Admin 15415:0x4917018 0.2 0.0 0.0 25 0.0 0.2 0.2 151 151 28120 4 416 0
209346 13:23:26 *Admin 15415:0x492a018 0.0 0.0 0.0 1 0.0 0.0 0.0 15 15 37 37 19 19
209347 13:24:11 *User 15415:0x492c018 0.0 0.0 0.0 1 0.0 0.0 0.0 0 0 0 0 2 0

Max Plog Max Llog Max Dskflush Avg Dskflush Avg Dirty Blocked
pages/sec pages/sec Time pages/sec pages/sec Time
768 368 5719 43 153 0

Based on the current workload, the physical log might be too small
to accommodate the time it takes to flush the buffer pool during
checkpoint processing. The server might block transactions during checkpoints.
If the server blocks transactions, increase the physical log size to
at least 8838144 KB.
----------

Not sure if the warning at the end is real or not.

Thanks,

Sean.

jrenaut

unread,

Sep 27, 2011, 4:34:48 PM9/27/11

to

Well main_loop is at the point where it's trying to flush the buffer
pool, so if that's taking a long time, it could possibly be some I/O
problem. You could try and look at onstat -g iof and see if you see
any dbspace/chunk with really bad avg. time. You could also look at
onstat -F output and see if you see the "Chunk" column changing (which
would indicate that flushers are changing which chunk they are
flushing) or if they are not changing then which chunks are being
flushed which might be the ones that are having IO problems (not
likely errors but slow IO performance).

jrenaut

unread,

Sep 27, 2011, 5:12:13 PM9/27/11

to

Taking a look at your onstat -g ckp output your system goes from <
1000 dirty buffers per checkpoint, to the checkpoint that takes ~5800
seconds and it said you have 1.8 million dirty buffers for that
checkpoint. So it would seem like you are doing something on your
system that is doing tons of page modifications, and so at the time of
the checkpoint, those 1.8 million pages need to be flushed, which it
taking a long time. It might be an slow IO issue, or you might just
need to tune your page cleaners to do more aggressive cleaning by
changing your LRU min/max settings. Or you need to find out why your
system suddenly dirtied up 1.8 million pages if that's not supposed to
be happening.

Sean Baker

unread,

Sep 27, 2011, 5:48:17 PM9/27/11

to inform...@iiug.org

> -----Original Message-----
> From: informix-l...@iiug.org [mailto:informix-list-

You may be onto something regarding the slow I/O. The developer has been creating these test dbspaces on an NFS volume for several months, but there was an upgrade to the server holding the NFS volume last week. Here's onstat -g iof, I'm only including the output for the main dbspace and the new one:

----------
IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 01:07:04 -- 6638456 Kbytes

AIO global files:
gfd pathname bytes read page reads bytes write page writes io/s
6 dbs1 4337414144 2117867 3074048 1501 2156.6
op type count avg. time
seeks 0 N/A
reads 221019 0.0005
writes 1028 0.0015
kaio_reads 0 N/A
kaio_writes 0 N/A

9 sm_dbspace0 8747304960 4271145 892133376 435616 17.2
op type count avg. time
seeks 0 N/A
reads 10781 0.0878
writes 353054 0.0571
kaio_reads 0 N/A
kaio_writes 0 N/A
----------

The avg. time for sm_dbspace0 is quite a bit worse than dbs1. As a shot in the dark, we're going to create the dbspace on the same storage that dbds1 is on, just to rule out the NFS volume. Here's onstat -F, but I'm not sure how to interpret the results, other than to say nothing changes if I run it over and over:

----------
IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 01:13:31 -- 6638456 Kbytes

Fg Writes LRU Writes Chunk Writes
0 545128 13310

address flusher state data # LRU Chunk Wakeups Idle Tim
15e56f880 0 L 1d 3 8 2266 2257.890
15e5700d8 1 L 9 2 5 2261 2257.565
15e570930 2 L 5 4 3 2261 2257.660
15e571188 3 C 7 2 2 2258 2257.661
15e5719e0 4 L 15 2 0 2351 2351.992
15e572238 5 L d 6 0 2386 2382.430
15e572a90 6 L 13 3 0 2317 2316.740
15e5732e8 7 L 11 3 0 2383 2382.432
15e573b40 8 L 23 9 0 2437 2430.166
15e574398 9 L 19 4 0 2328 2327.008
15e574bf0 10 L 21 3 0 2391 2390.153
15e575448 11 L 25 2 0 2392 2391.980
15e575ca0 12 L 1 3 0 2393 2392.360
15e5764f8 13 L 1f 2 0 2264 2263.785
15e576d50 14 L 3 6 0 2284 2279.906
15e5775a8 15 L 27 2 0 2316 2316.650
states: Exit Idle Chunk Lru
----------

Thanks,

Sean.

Obnoxio The Clown

unread,

Sep 27, 2011, 5:51:27 PM9/27/11

to Sean Baker, IIUG Informix List

On 27/09/2011 22:48, Sean Baker wrote:

>
> The avg. time for sm_dbspace0 is quite a bit worse than dbs1. As a shot in the dark, we're going to create the dbspace on the same storage that dbds1 is on, just to rule out the NFS volume. Here's onstat -F, but I'm not sure how to interpret the results, other than to say nothing changes if I run it over and over:

NFS? ARGH!

--
Cheers,
Obnoxio The Clown

http://obotheclown.blogspot.com
I will now proceed to pleasure myself with this fish.

jrenaut

unread,

Sep 27, 2011, 10:43:00 PM9/27/11

to

Ok, I was mistaken in my original post about the onstat -F. I thought
the chunk number the flusher was cleaning was in the column labeled
"Chunk", but it's actually in the column labeled data. So if the
"state" column is "C" (for chunk) it would be trying to clean the
chunk number from the "data" column. So in your above output cleaner
#3 is assigned to clean chunk #7, and it appears it's the only cleaner
that is currently assigned to write out any chunks for the
checkpoint. If the checkpoint has been going on for a long time at
this point, it might be safe if you assume that it's the only chunk
that has not finished flushing. So I think that would be the chunk
that you'd be having the IO slowness on. I'm hoping it maps up to
some chunk in the sm_dbspace0 (can find the chunk numbers in onstat -d
output) that you've mentioned is NSF mounted, and does seem to have a
fairly slow avg time for write operations and it seems to be doing a
lot of them.

Art Kagel

unread,

Sep 28, 2011, 7:38:25 AM9/28/11

to Sean Baker, inform...@iiug.org

Here's the cause, in your onstat header during the checkpoint:

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 02:48:10 -- 6638456 Kbytes
Blocked:DBS_DROP

The checkpoint is blocked while a dbspace is being dropped. Probably very slow IO response. You should not be using NFS mounted volumes for chunks.

Art

Art S. Kagel
Advanced DataTools (www.advancedatatools.com)
Blog: http://informix-myview.blogspot.com/

Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on my employer, Advanced DataTools, the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference. Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.

On Tue, Sep 27, 2011 at 4:15 PM, Sean Baker <SBa...@moneymailer.com> wrote:

> -----Original Message-----
> From: informix-l...@iiug.org [mailto:informix-list-
> bou...@iiug.org] On Behalf Of jrenaut
> Sent: Tuesday, September 27, 2011 1:06 PM
> To: inform...@iiug.org

_______________________________________________
Informix-list mailing list
Inform...@iiug.org
http://www.iiug.org/mailman/listinfo/informix-list