Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Help with anormal pausing of program

15 views
Skip to first unread message

Jano

unread,
Jun 10, 2003, 7:37:33 PM6/10/03
to
Hello everybody,

I have been discussing this matter in the Gnatlist but it seems that we
have not extracted any definitive conclusion, so I want to ask here too.

I have a multitasking program; it has service tasks with usually are
accepting rendezvous, and periodic tasks that do some processing and
sleep for some time.

The problem is that I've detected that some (many) of these tasks are
being stopped for unusual long times, after some running time. It seems
like I issue a delay 1.0 but they sleep for much more. I haven't find
any deadlock or critical race condition anywhere, CPU usage is idle.

Since the program is fairly big and complex, I've distilled the
following test case to rule out problems in my code:

http://www.mycgiserver.com/~mosteo/dead-noio.zip

This is fairly simple but exhibits the same behavior. It has a queue for
trace messages, a task reading them, a task producing them, and other
dummy tasks that seem to favour the happening of the wrong thing.

The program runs fine for some time, and suddenly traces cease to
appear. Some arbitrary time after, usually less than half an hour (but
sometimes hours, sometimes "never"), everything resumes normally but is
evident that the task I'm using as reference (called heartbeat in my
sample) has slept for much more time than it should. I detect that
condition and report it.

I have the following test combinations, in everyone I launch eight
simultaneous instances of the test:

A) AMD 2000+ winXP SP1 Gnat 3.15p. Usually some of the instances
exhibits the rare stopping in less than half an hour.

B) Same computer as A) but with Linux, Gnat 3.15p native threading:
Everything OK (several hours uptime).

C) Same computer as A) but using ObjectAda free edition: Everything OK
(several days uptime).

D) PIII 1000 WinXP Gnat 3.15p. It has happened only once in several
days.

People from the gnatlist have run similar versions of this test program;
a Win2000 user has reported the freezing after 90 minutes, while users
of other platforms have not reported any problem for very long uptimes.

I would be very grateful if you could review my program and test it. I
would want to know if

a) There is a justification for abnormally long delays in my code,
because of some misunderstanding or bug in my tasks.

b) This is XP or at least windows specific, or even if it happens to
someone apart from me.

c) It could be a problem in Gnat-Windows itself.

d) How could I workaround that, if possible.

I've launched the eight tasks just before starting this post. They're
now all frozen, and what's more strange, I've seen them frozen
within 30 seconds of difference (I was keeping an eye on them), and have
resumed at the nearly same time. CPU idle all the time. This is driving
me crazy. Any light or hope in this matter would make me very happy. I
have usually uptimes of several days and no other program is showing
strange behaviors. I've tested my RAM also.

Thanks in advance,

--
-------------------------
Jano
402450.at.cepsz.unizar.es
-------------------------

Dmitriy Anisimkov

unread,
Jun 11, 2003, 10:31:07 AM6/11/03
to Jano
Jano wrote:
> The problem is that I've detected that some (many) of these tasks are
> being stopped for unusual long times, after some running time. It seems
> like I issue a delay 1.0 but they sleep for much more. I haven't find
> any deadlock or critical race condition anywhere, CPU usage is idle.

I had the same problem. It was becouse of wrong result from QueryPerformcnceCounter Win32 API call on some hardware configurations. GNAT for Win32 is counting time by using QueryPerformcnceCounter call. Some hardware configurations could make this counter to be wrong sometimes. See article in the MSDN

http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B274323

it is talking about leap counter forward, and contain program in C to detect it.

I was encounter the leap counter backward, and wrote program in Ada to detect it.

See the sources to understand how to use it.

Very short description is:
Press a key '1' then counter would checked 10000 times for the next counter is bigger than previous,
if it is not so the message
[prev_counter] > [next_counter] would appear.

Press '2' the same checking would be 20000, etcetera till '9'

------------------------------------
with System.OS_Interface;
with Ada.Text_IO;

procedure Time is
use Ada.Text_IO;
use System.OS_Interface;

Perf_Freq : aliased LARGE_INTEGER;
Curr_Counter, Prev_Counter : aliased LARGE_INTEGER;
Char : Character;
Pass : Natural := 0;
Max_Diff : LARGE_INTEGER := 0.0;
Diff : LARGE_INTEGER;

procedure Check (Item : BOOL);
pragma Inline (Check);

procedure Check (Item : BOOL) is
begin
if not Item then
raise Program_Error;
end if;
end Check;

begin
Check (QueryPerformanceFrequency (Perf_Freq'Access));
Put_Line (LARGE_INTEGER'Image (Perf_Freq));

Check (QueryPerformanceCounter (Prev_Counter'Access));

loop
Check (QueryPerformanceCounter (Curr_Counter'Access));

Diff := Curr_Counter - Prev_Counter;

if Diff < 0.0 then
Put_Line (LARGE_INTEGER'Image (Prev_Counter)
& " >" & LARGE_INTEGER'Image (Curr_Counter));
elsif Diff > Max_Diff then
Max_Diff := Diff;
end if;

Prev_Counter := Curr_Counter;

if Pass = 0 then
Get_Immediate (Char);

case Char is
when '0' =>
Put_Line (LARGE_INTEGER'Image (Curr_Counter));
when 'f' | 'F' =>
Check (QueryPerformanceFrequency (Perf_Freq'Access));
Put_Line (LARGE_INTEGER'Image (Perf_Freq));
when '1' .. '9' =>
Pass := Integer'Value ("" & Char) * 10000;
Max_Diff := 0.0;
Check (QueryPerformanceCounter (Prev_Counter'Access));
when 'q' | 'Q' => exit;
when others => Put (Char);
end case;
else
Pass := Pass - 1;

if Pass = 0 then
Put_Line (LARGE_INTEGER'Image (Curr_Counter)
& LARGE_INTEGER'Image (Max_Diff));
end if;
end if;
end loop;
end Time;
-----------------------------------

Jano

unread,
Jun 11, 2003, 2:42:19 PM6/11/03
to
Dmitriy Anisimkov dice...

> I had the same problem. It was becouse of wrong result from QueryPerformcnceCounter Win32 API call on some hardware configurations. GNAT for Win32 is counting time by using QueryPerformcnceCounter call. Some hardware configurations could make this counter to be wrong sometimes. See article in the MSDN

Finally some certain facts! You don't know how grateful I feel now. I
was desperate and questioning my most basic understanding of tasking.

Your test program effectively has shown that my PC is affected.

Now the question :D Is there a solution?

I don't understand clearly from the Microsoft article if a motherboard
driver update can help. I will try that as first attempt.

I suppose that my only other alternative is to alter my Gnat (ucks!) Any
hints for that?

Out of curiosity, is that corrected in Gnat 3.16?

Pascal Obry

unread,
Jun 11, 2003, 3:54:34 PM6/11/03
to

Jano <no...@celes.unizar.es> writes:

> Now the question :D Is there a solution?

Change your PC :)

Pascal.

--

--|------------------------------------------------------
--| Pascal Obry Team-Ada Member
--| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE
--|------------------------------------------------------
--| http://perso.wanadoo.fr/pascal.obry
--| "The best way to travel is by means of imagination"
--|
--| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595

tmo...@acm.org

unread,
Jun 11, 2003, 4:16:43 PM6/11/03
to
> > The problem is that I've detected that some (many) of these tasks are
> > being stopped for unusual long times, after some running time. It seems
>
> I had the same problem. It was becouse of wrong result from
> QueryPerformcnceCounter Win32 API call on some hardware configurations.
> GNAT for Win32 is counting time by using QueryPerformcnceCounter call.
>...
> http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B274323

See also
www.adapower.com/articles/gnatclockfix.html
That, IIRC, was a fix to Gnat 3.14p They released 3.15p with a
different correction, which had a bug in it (see comp.lang.ada
"Re: ANN: GNAT 3.15p binary for OS/2 available")

Jano

unread,
Jun 11, 2003, 5:11:47 PM6/11/03
to
tmo...@acm.org dice...

> See also www.adapower.com/articles/gnatclockfix.html

> That, IIRC, was a fix to Gnat 3.14p They released 3.15p with a
> different correction, which had a bug in it (see comp.lang.ada
> "Re: ANN: GNAT 3.15p binary for OS/2 available")

Mmm. By the dates, it seems that the patch is for 3.14p. However,
someone knows if it could work with 3.15p?

In any case, I'm clueless about what I must do with the replacement
file. I need to download sources and recompile? I see that the sources
are in my binary installation, but I don't think that replacing there as
achieved anything.

Thanks,

tmo...@acm.org

unread,
Jun 11, 2003, 6:36:25 PM6/11/03
to
> > That, IIRC, was a fix to Gnat 3.14p They released 3.15p with a
> > different correction, which had a bug in it (see comp.lang.ada
> > "Re: ANN: GNAT 3.15p binary for OS/2 available")
>
> Mmm. By the dates, it seems that the patch is for 3.14p. However,
> someone knows if it could work with 3.15p?
>
> In any case, I'm clueless about what I must do with the replacement
> file. I need to download sources and recompile? I see that the sources
> are in my binary installation, but I don't think that replacing there as
> achieved anything.
I don't have the stuff in front of me, but IIRC, ACT's 3.15p
s-osprim.adb needed a one line patch. Unfortunately, they use pragma
inline a lot, so a bunch of stuff needs to be recompiled once you
modify that file. And you need special compile option(s) to recompile
system components. It's doable, but somewhat tedious. I'll try to
post my correction, but in the meantime perhaps some kind soul, in
spirit of Free Software, will send you s-osprim.adb from 3.16, which
hopefully has only the single change.
You originally mentioned discussing this on gnatlist, where at least
one other user tested and found the same behavior. I presume that was
also using Gnat 3.15p?

Anisimkov

unread,
Jun 11, 2003, 10:21:31 PM6/11/03
to
"Jano" <no...@celes.unizar.es> wrote in message

> Finally some certain facts! You don't know how grateful I feel now. I
> was desperate and questioning my most basic understanding of tasking.


Glad to help.

> Your test program effectively has shown that my PC is affected.
>
> Now the question :D Is there a solution?

I think there is no way to complete workaround hardware error by the any
software solutions, so, the best way is to change the computer.

> I don't understand clearly from the Microsoft article if a motherboard
> driver update can help. I will try that as first attempt.


I don't think.

> I suppose that my only other alternative is to alter my Gnat (ucks!) Any
> hints for that?
>
> Out of curiosity, is that corrected in Gnat 3.16?

It is not GNAT fug, so the GNAT should not be fixed. Maybe somebody could
provide GNAT RTL for Win32 not based on the QueryPerformanceCounter ? It
would help, but the time precision would be decreased.


Jano

unread,
Jun 12, 2003, 2:31:26 AM6/12/03
to
Anisimkov dice...

> I think there is no way to complete workaround hardware error by the any
> software solutions, so, the best way is to change the computer.

Argh! It's only some months old!

> > I don't understand clearly from the Microsoft article if a motherboard
> > driver update can help. I will try that as first attempt.
>
> I don't think.

They didn't :(

> It is not GNAT fug, so the GNAT should not be fixed. Maybe somebody could
> provide GNAT RTL for Win32 not based on the QueryPerformanceCounter ? It
> would help, but the time precision would be decreased.

I see that we are talking about two different problems? The MSKB article
talks about leaps forward, but your test program shows leaps back. In
reality, more that leaps, it seems that some value is mangled because
the incorrect values are usually very small (< 5000)

The patch everybody is talking about would correct both problems or only
the leap forward? Because if the later, I'm on my own, I suppose.

And that motherboard got good reviews... argh... and is not a strange
one... argh again. Oh well.

Jano

unread,
Jun 12, 2003, 2:38:29 AM6/12/03
to
tmo...@acm.org dice...

> And you need special compile option(s) to recompile
> system components. It's doable, but somewhat tedious.

Is that procedure somewhere in the documentation? I've looked (not very
thorough) but haven't found that.

> I'll try to
> post my correction, but in the meantime perhaps some kind soul, in
> spirit of Free Software, will send you s-osprim.adb from 3.16, which
> hopefully has only the single change.

0:-)

> You originally mentioned discussing this on gnatlist, where at least
> one other user tested and found the same behavior. I presume that was
> also using Gnat 3.15p?

I can't find their post right now and I am in a hurry... I'll dig in the
archives later. I don't remember however any mention about that so I
suppose it was 3.15p.

Vinzent Hoefler

unread,
Jun 12, 2003, 3:06:44 AM6/12/03
to
Anisimkov wrote:

>I think there is no way to complete workaround hardware error by the any
>software solutions, so, the best way is to change the computer.

Not sure, if you mean that, but "working with software around hardware
errors" is part of my daily job. ;)


Vinzent.

tmo...@acm.org

unread,
Jun 12, 2003, 3:26:34 AM6/12/03
to
>I see that we are talking about two different problems? The MSKB article
>talks about leaps forward, but your test program shows leaps back. In
>reality, more that leaps, it seems that some value is mangled because
-- Some buggy chipsets have a counter that appears to run backwards
-- under heavy load (see Q274323), and Windows corrects this by
-- adding 2**24 (5-15 sec). Yuck! We need to try to detect this
So Gnat 3.14p had a clock that could jump forward several seconds.
The fix is to watch for a suspicious jump, check it against the
Windows system clock, and correct if needed. For that you need
the system clock and the performance counter values at some base
synchronization point. Gnat 3.15p failed to recalculate the base
when it did a resynchronization, so, IIRC, time would go into the
future, and a short "delay" would become a long one, waiting for
the future time.
It's late, and I'm tired, so I don't guarantee this is right,
but it appears my fix for Gnat 3.15p was to add the line
Base_Monotonic_Clock := Base_Clock;
at line 165 of s-osprim.adb, just before the line
end Get_Base_Time;
and then to compile with
gcc -c -gnatg -O2 s-osprim.adb
Because of heavy inlining, you'll find other parts of the Gnat
run-time that also need recompiling the same way.

Jano

unread,
Jun 12, 2003, 3:47:21 AM6/12/03
to
> I see that we are talking about two different problems? The MSKB article
> talks about leaps forward, but your test program shows leaps back. In
> reality, more that leaps, it seems that some value is mangled because
> the incorrect values are usually very small (< 5000)

SORRY SORRY SORRY my fault, I was misinterpreting the results of
Anisikov's test. I have no backward leaps. Then the patch discussed
maybe my solution.

Jano

unread,
Jun 12, 2003, 3:47:54 AM6/12/03
to
tmo...@acm.org dice...

> It's late, and I'm tired, so I don't guarantee this is right,
> but it appears my fix for Gnat 3.15p was to add the line
> Base_Monotonic_Clock := Base_Clock;
> at line 165 of s-osprim.adb, just before the line
> end Get_Base_Time;
> and then to compile with
> gcc -c -gnatg -O2 s-osprim.adb
> Because of heavy inlining, you'll find other parts of the Gnat
> run-time that also need recompiling the same way.

Many thanks for your advice, I'm going to try it right now. Furthermore,
I was misinterpreting the results from the test (not read the source
carefully enough the first time). So my problem is the one described in
the knowledge base and the one which the patch addresses :) :) :)

Preben Randhol

unread,
Jun 12, 2003, 4:32:29 AM6/12/03
to
Jano wrote:
> Anisimkov dice...
>
>> I think there is no way to complete workaround hardware error by the any
>> software solutions, so, the best way is to change the computer.
>
> Argh! It's only some months old!

Go and complain at the store and ask to get the hardware replaced. If
not complain to the producer.
--
Preben Randhol http://www.pvv.org/~randhol/

Georg Bauhaus

unread,
Jun 12, 2003, 5:43:01 AM6/12/03
to
tmo...@acm.org wrote:
: It's late, and I'm tired, so I don't guarantee this is right,

: but it appears my fix for Gnat 3.15p was to add the line
: Base_Monotonic_Clock := Base_Clock;
: at line 165 of s-osprim.adb, just before the line
: end Get_Base_Time;

This line appears as the last line before end System.OS_Primitives,
in package elaboration, in 5wosprim.adb, in GCC 3.3.

Dmitriy Anisimkov

unread,
Jun 12, 2003, 8:59:50 AM6/12/03
to
Jano wrote:
>>the incorrect values are usually very small (< 5000)
>
> SORRY SORRY SORRY my fault, I was misinterpreting the results of
> Anisikov's test. I have no backward leaps. Then the patch discussed
> maybe my solution.

Leap backward would looks like

123412341234 > 123412341234

in the console of Ada test after click '1' or '9' keys.

Leap forward would be detected by the C test written in the MSDN link I gave you.

tmo...@acm.org

unread,
Jun 12, 2003, 1:09:33 PM6/12/03
to
> : but it appears my fix for Gnat 3.15p was to add the line
> : Base_Monotonic_Clock := Base_Clock;
> : at line 165 of s-osprim.adb, just before the line
> : end Get_Base_Time;
>
> This line appears as the last line before end System.OS_Primitives,
> in package elaboration, in 5wosprim.adb, in GCC 3.3.
Yes. That's the initial synchronization. But it also needs to
appear when there's a re-synchronization, ie, when Base_Clock changes.
Unless of course 5wosprim.adb has a different solution to the problem.

Wiljan Derks

unread,
Jun 12, 2003, 2:08:13 PM6/12/03
to
Upto now, I was not ware of potential problems with the performace counter.
As cutomer for ACT, I reported problems with the clock in the past.
It turned out that the clock was sometimes not monotonic, thus
sometimes even getting a decreased clock value in time.
That problem has been fixed in 316 and 500a, which was related to the
calculations done with the performance counter.

Still in 316 and 500a there is still a problem in gnat related with delays.
We found that applications sometimes whould wait for a very long time.
This is caused by the logic for waiting inside s-prtaop.adb.
If you check the loop for the routines that wait with a timeout you find
that each loop
the routines do the call to monotonic time twice, which is wrong.
This bug is fixed in the wavefront of 501.

Wiljan


Georg Bauhaus

unread,
Jun 12, 2003, 6:19:06 PM6/12/03
to
tmo...@acm.org wrote:
:> : but it appears my fix for Gnat 3.15p was to add the line

Same code as in 3.15p. And I think I have missed the . Sorry.

Dmitriy Anisimkov

unread,
Jun 13, 2003, 9:35:10 AM6/13/03
to

Yes, the world is not ideal. And some computer subsystems (hardware or software) has an errors.
I think we should fix the errors on the place where it is, this way is better than to do errors workaround. We would have more stable systems this way.

Jano

unread,
Jun 13, 2003, 2:13:49 PM6/13/03
to
tmo...@acm.org dice...

> Yes. That's the initial synchronization. But it also needs to
> appear when there's a re-synchronization, ie, when Base_Clock changes.
> Unless of course 5wosprim.adb has a different solution to the problem.

I've applied the change you suggested and recompiled the runtime. After
more than 8 hours of uptime, when before my program rarely lasted for
more than half an hour without hanging, it seems I'm done with that
pesky problem.

Many thanks!

tmo...@acm.org

unread,
Jun 14, 2003, 12:59:12 AM6/14/03
to
> it seems I'm done with that pesky problem.
Good.
> Many thanks!
Glad to be of help.
0 new messages