last Friday and yesterday we had checkpoints with duration of 290 and 135
seconds. All writing applications were blocked. Typically the duration of
checkpoints is 0 to 2 seconds.
Didn't I understand right that blocking checkpoints shouldn't occur any more
in this release (IDS 11.50.FC5W4)? What could be the reason for such long
blocking checkpoints?
TIA,
Reinhard.
You would want to look at the onstat -g ckp output. Yes, most
checkpoints should be the new style, interval checkpoints which should
only block writing for a very small period (and not while the buffer
pool is flushing). But certain activity can still trigger the old
style, block while writing the buffer pool to disk, type checkpoints.
Jacques Renaut
IBM Informix advanced support
AUTO_CKPTS=On RTO_SERVER_RESTART=600 seconds Estimated recovery time 568
seconds
Critical
Sections Physical Log Logical Log
Clock Total Flush Block #
Ckpt Wait Long # Dirty Dskflu Total Avg Total Avg
Interval Time Trigger LSN Time Time Time Waits
Time Time Time Buffers /Sec Pages /Sec Pages /Sec
12597 10:14:31 RTO 1038004:0x1a335b4 1.6 1.4 0.0 27
0.0 0.1 0.2 6060 4314 50671 88 202807 355
12598 10:23:20 RTO 1038017:0x381470c 1.7 1.5 0.0 29
0.0 0.2 0.2 6245 4295 45914 86 203075 383
12599 10:32:03 RTO 1038031:0x1cf142c 1.5 1.3 0.0 31
0.0 0.1 0.2 6525 5112 46859 89 203523 389
12600 10:41:04 RTO 1038045:0x74d018 2.0 1.4 0.0 30
0.0 0.5 0.5 6405 4662 48977 90 204979 378
12601 10:50:10 RTO 1038058:0x23e8130 1.5 1.2 0.0 38
0.0 0.2 0.3 6406 5212 47638 87 202665 371
12602 10:58:36 RTO 1038072:0x667018 2.3 1.5 0.0 34
0.0 0.6 0.7 6385 4153 61832 122 202895 401
12603 11:09:02 RTO 1038085:0x2425018 3.0 1.8 0.0 43
0.0 0.9 1.1 6356 3468 63637 101 203340 324
12604 11:19:10 RTO 1038099:0x6da018 2.6 1.5 0.0 45
0.0 0.9 1.0 6262 4188 65118 107 202942 333
12605 11:29:12 RTO 1038112:0x25b3018 2.4 1.3 0.0 33
0.0 0.8 1.0 6350 4773 53909 89 203268 337
12606 11:39:14 RTO 1038126:0x9b0018 1.8 1.3 0.0 35
0.0 0.4 0.4 6590 4961 54366 90 203153 337
12607 11:48:09 RTO 1038139:0x2aad018 2.3 1.4 0.0 35
0.0 0.7 0.9 6143 4458 56251 105 203925 381
12608 11:58:05 RTO 1038153:0x14b8018 2.4 1.5 0.0 31
0.0 0.7 0.8 6119 4000 50029 83 204900 343
12609 12:08:00 RTO 1038166:0x37a92a0 2.2 1.4 0.0 39
0.0 0.7 0.8 6672 4807 54712 91 204315 343
12610 12:17:45 RTO 1038180:0x1f5d018 2.4 1.5 0.0 44
0.0 0.7 0.8 6216 4207 45676 78 204144 348
12611 12:27:40 RTO 1038189:0x4325018 2.5 1.2 0.0 36
0.0 1.0 1.2 6149 5006 52263 87 204519 343
12612 12:37:02 RTO 1038196:0x2552018 2.0 1.2 0.0 28
0.0 0.7 0.7 6369 5248 46698 83 202591 360
12613 12:47:27 RTO 1038203:0x9ce360 1.9 1.3 0.0 35
0.0 0.4 0.5 6221 4681 50462 80 203239 325
12614 12:57:57 RTO 1038209:0x67da018 1.9 1.4 0.0 34
0.0 0.3 0.4 6181 4406 41553 65 204604 324
12615 13:07:51 RTO 1038216:0x4a22018 2.8 1.9 0.0 38
0.0 0.8 0.9 6241 3359 49462 83 202880 342
12616 13:19:31 RTO 1038223:0x3472478 2.0 1.3 0.0 30
0.0 0.6 0.7 6276 4782 51021 72 204857 292
Max Plog Max Llog Max Dskflush Avg Dskflush Avg Dirty
Blocked
pages/sec pages/sec Time pages/sec pages/sec
Time
2096 1268 4 3973
3 0
The physical log size is smaller than the recommended size for a server
configured with RTO_SERVER_RESTART. Fast recovery performance might not
be optimal. For best fast recovery performance when RTO_SERVER_RESTART
is enabled, increase the physical log size to at least 4620000 KB.
For servers configured with a large buffer pool, this might
not be necessary. See the Administrator's Guide for more information.
Blocked Time is 0. So it is reported that since the last start of IDS no
transactions where blocked. In spite of this we _did_ see CHECKPOINT BLOCKED
during the long checkpoints. How can we solve this conflict and how can we
avoid such sporadically occuring long checkpoints?
> _______________________________________________
> Informix-list mailing list
> Inform...@iiug.org
> http://www.iiug.org/mailman/listinfo/informix-list
>
Block time isn't labeled very well. There's different types of
blocking, and block time is only increased if the server is running
out of resources and it has to switch an interval checkpoint to an old
blocking checkpoint. I do see the "# wait, wait time and long time"
columns have some values, and I think that is what you would be
calling your people blocked on the checkpoint. Those columns are for
threads/sessions that have to wait to enter a critical section but
were prevented because the interval checkpoint is in it's blocking
portion. So # wait is the total number of threads that tried to get
into critical sections but couldn't. The wait time column I think is
the total wait time divided by the # waits, so it's an average time to
wait to get into the critical section, and then long time is the
longest time any 1 thread had to wait to get into it's critical
section. So 11.x checkpoints aren't completely non-blocking, they do
have to block, but they no longer have to block writers (i.e. critical
sections) while flushing out the buffer pool, which depending on how
you have the system configured would generally be the bulk of your
checkpoint duration. Would this system happen to be a SDS primary?
Jacques Renaut
IBM Informix Advanced Support
This output does not show the values from the checkpoints that took a
long time, you need to run it just after the long checkpoints complete
not hours/days
later.
You would also need to output from onstat -g ath and onstat -g stk all
(after the first few seconds of checkpoint time)
whilst the checkpoint is running to see what is causing the issue..
Does anyone know if setting TRACEFUZZYCKPT or TRACECKPT environment
variables to 1 before starting the engine works in 11.50 as it did in
earlier versions?