Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Problem with DragAnObject on RISC OS 6.20

41 views

Skip to first unread message

Matthew Phillips

unread,

May 26, 2012, 5:50:40 PM5/26/12

Does anyone know of any change to the DragAnObject module in RISC OS 6.20
which might be causing some code to crash only on that version of the
operating system? Impact's database record card design uses DragAnObject to
allow users to drag fields around the card. It has worked fine for a number
of years on most versions of RISC OS, and I can verify that it works on 4.02,
4.39, 5.16 and 5.18. Users of RISC OS 6.20 are reporting that Impact crashes
immediately on the drag starting.

I've appended a typical back trace. A function within Impact called
Drag_Object calls the SWI DragAnObject_Start (via a generic library function
called Swi) with a C function as the callback for rendering. The SWI is
called via _kernel_swi. While within that library function an address
exception occurs, which is passed to my signal handler which spits out the
back trace to a file.

The code works fine on lots of other versions of RISC OS, but fails every
time on RISC OS 6.20. Does anyone know of any changes which might have
caused this?

In <dd968d15-13ad-4f4b...@c26g2000vbq.googlegroups.com> (3
June 2011) James Lampard described a fault in DragAnObject which could lead
to stack overflows and aborts on 32-bit systems, which was fixed in ROOL's
DragAnObject 0.09. Might there be a chance that RISC OS 6.20's version of
the module still has this fault and it is coming to the surface for some
reason? RISC OS 6.20 is able to be compiled as a 32-bit system, so maybe
26-bit builds are affected somehow as well? All speculation on my part.

The other thing I am considering is a fault in Impact. The documentation is
a bit sparse as to what environment your C program might expect to receive.
I have found a problem (on RISC OS 4.02, but not on 5.16, not tested on
others) where using setjmp and longjmp within the rendering callback code
caused a crash, so I had to trap errors in other ways. I think the crashes
occurred even if setjmp was called and longjmp was not needed. Maybe RISC OS
6.20 is less tolerant of something else I am doing wrong?

I could always fall back on Wimp_DragBox, but it would be nicer to use
DragAnObject.

Any ideas before I insert masses of debug code to attempt to debug at a
distance?

Thanks,

Matthew

Serious error at 18:14:04 25th May 2012, user quitting...
Impact, version 3.40 (10-Jan-2012)
WIMP version 6.61

Internal error: Abort on data transfer at &0386B0E4

Postmortem requested

Stack backtrace
Arg2: 0x00000001 1
Arg1: 0x0009b800 636928 -> [0x65746e49 0x6c616e72 0x72726520 0x203a726f]
385d0d0 in function _sys_error
Arg1: 0x00000005 5
38645f0 in function _real_default_signal_handler
Arg2: 0x00000000 0
Arg1: 0x00000005 5
38646f4 in function _raise_trap
Arg1: 0x00000005 5
386474c in function raise
Arg1: 0x00000005 5
897c8 in function SignalHandler
386b100 in unknown procedure
Arg1: 0x0009d20c 643596 -> [0x00050190 0x000122d4 0x0009d1fc 0x0009d258]
3859778 in shared library function
Arg2: 0x0009d20c 643596 -> [0x00050190 0x000122d4 0x0009d1fc 0x0009d258]
Arg1: 0x00049c40 302144 -> [0xe594100c 0xe5b2300c 0xe1510003 0x0a000004]
758ec in function Swi
Arg5: 0x00000000 0
Arg4: 0x00000190 400
Arg3: 0x0009d258 643672 -> [0x0000042a 0x00000424 0x00000624 0x0000045c]
Arg2: 0x000e7424 947236 -> [0x6e69614d 0x00000000 0x00000000 0x00000000]
Arg1: 0x000122d4 74452 -> [0xe1a0c00d 0xe92dd831 0xe24cb004 0xe15d000a]
65540 in function Drag_Object
Arg1: 0x000e7424 947236 -> [0x6e69614d 0x00000000 0x00000000 0x00000000]
12490 in function C_DragSelection
Arg3: 0x00000040 64
Arg2: 0x00000007 7
Arg1: 0x000e2a7c 928380 -> [0x5fc1ba39 0x00000000 0x000e6384 0x000da8b8]
.....

--
Matthew Phillips
Durham

Steve Fryatt

unread,

May 26, 2012, 6:15:08 PM5/26/12

On 26 May, Matthew Phillips wrote in message
<d3bc1c965...@sinenomine.freeserve.co.uk>:

> Does anyone know of any change to the DragAnObject module in RISC OS 6.20
> which might be causing some code to crash only on that version of the
> operating system? Impact's database record card design uses DragAnObject
> to allow users to drag fields around the card. It has worked fine for a
> number of years on most versions of RISC OS, and I can verify that it
> works on 4.02, 4.39, 5.16 and 5.18. Users of RISC OS 6.20 are reporting
> that Impact crashes immediately on the drag starting.

My memory of the details are hazy (it's been several months since I've even
looked at the code), but that's sounding familiar to a problem that I've
seen in NetSurf's toolbar GUI with DragASprite. IIRC, the SWI crashed out
immediately, but it wasn't obvious what was wrong with the parameters.

It wasn't just RO6, though (probably 5.16 as well).

I'm afraid I've got no useful suggestions as I never solved it, but I might
go back and take another look if/when I get enough time now you've reminded
me.

--
Steve Fryatt - Leeds, England

http://www.stevefryatt.org.uk/

Gerph

unread,

May 26, 2012, 6:59:51 PM5/26/12

On May 26, 10:50 pm, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> Does anyone know of any change to the DragAnObject module in RISC OS 6.20
> which might be causing some code to crash only on that version of the
> operating system? Impact's database record card design uses DragAnObject to
> allow users to drag fields around the card. It has worked fine for a number
> of years on most versions of RISC OS, and I can verify that it works on 4.02,
> 4.39, 5.16 and 5.18. Users of RISC OS 6.20 are reporting that Impact crashes
> immediately on the drag starting.
>
> I've appended a typical back trace. A function within Impact called
> Drag_Object calls the SWI DragAnObject_Start (via a generic library function
> called Swi) with a C function as the callback for rendering. The SWI is
> called via _kernel_swi. While within that library function an address
> exception occurs, which is passed to my signal handler which spits out the
> back trace to a file.

That sounds like you've got a problem right there. Unless you really
need it, stop using the signal handler and you'll get the Diagnostic
Dump out of the system when the failure occurs. At the very least that
will help you to work out where the problem is. The stack trace you
generated was not useful because it only traces the user mode stack,
not the actual failure cases, nor does it include the registers which
would help you diagnose these problems. The standard backtrace handler
will produce a diagnostic dump to help you (and others) diagnose the
problem, and includes the registers that the problem was in.

Of course, you can always use "*BTSDump -a -w dumpfile" to produce
such a dump manually.

> The code works fine on lots of other versions of RISC OS, but fails every
> time on RISC OS 6.20. Does anyone know of any changes which might have
> caused this?
>

> In <dd968d15-13ad-4f4b-a736-74152f76d...@c26g2000vbq.googlegroups.com> (3

> June 2011) James Lampard described a fault in DragAnObject which could lead

They start by basing their description on the erroneous documentation
from PRM 5a that states that the bit 17 controls whether the render
function operation is called in USR mode or not (which is what they
state). Bit 17 controls whether the C environment is set up for the
module callback and has nothing to do with the code being called in
USR mode. There probably should not be a USR mode entry sequence for
this module in any case - I believe that this was a documentation
error introduced between the author and the production of the updated
PRMs.

> to stack overflows and aborts on 32-bit systems, which was fixed in ROOL's
> DragAnObject 0.09.

The 'fix' they describe is unlikely to affect the code that was
changed and in any case will cause breakages on earlier modules unless
care is taken to ensure the version of the module, as no validation of
the flags is made.

> Might there be a chance that RISC OS 6.20's version of
> the module still has this fault and it is coming to the surface for some
> reason? RISC OS 6.20 is able to be compiled as a 32-bit system, so maybe
> 26-bit builds are affected somehow as well? All speculation on my part.
>
> The other thing I am considering is a fault in Impact. The documentation is
> a bit sparse as to what environment your C program might expect to receive.

If bit 17 is set:

R0-R3 are updated from the parameter block from R2.
R10 contains the SVC stack limit
R13 is the SVC stack
You are in SVC mode.
IRQ state has been preserved.

If bit 17 is clear:

R0-R3 are set up from the parameter block from R2.
R10 contains 0 (stack limit checking may be a problem for C
applications).
R13 is the SVC stack
You are in SVC mode.
IRQ state has been preserved.

As you can see the sequence is very similar.

Calling down to USR mode from SVC mode is unwise and will most likely
cause issues that you were not aware of.

> I have found a problem (on RISC OS 4.02, but not on 5.16, not tested on
> others) where using setjmp and longjmp within the rendering callback code
> caused a crash, so I had to trap errors in other ways.

Yeah, don't do that.
You do not have a USR mode C environment in the render callback and
should not do any calls like that. Because you don't have a C
environment set up, you shouldn't rely on any of the operations within
the callback function working in C environment. You may have luck
using the SVC C-module form of entry sequence, but I wouldn't
guarantee it.

And NEVER longjmp out of the render function.

> I think the crashes
> occurred even if setjmp was called and longjmp was not needed. Maybe RISC OS
> 6.20 is less tolerant of something else I am doing wrong?
>
> I could always fall back on Wimp_DragBox, but it would be nicer to use
> DragAnObject.
>
> Any ideas before I insert masses of debug code to attempt to debug at a
> distance?

Use the BTS system that was provided to handle debugging this sort of
case. It should give you all the information you need.

--
Gerph

Matthew Phillips

unread,

May 28, 2012, 3:42:11 AM5/28/12

In message <6dc02b81-c9ed-4034...@n16g2000vbn.googlegroups.com>

on 26 May 2012 Gerph wrote:

> On May 26, 10:50 pm, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> > I've appended a typical back trace. A function within Impact called
> > Drag_Object calls the SWI DragAnObject_Start (via a generic library
> > function called Swi) with a C function as the callback for rendering.
> > The SWI is called via _kernel_swi. While within that library function
> > an address exception occurs, which is passed to my signal handler which
> > spits out the back trace to a file.
>
> That sounds like you've got a problem right there. Unless you really
> need it, stop using the signal handler and you'll get the Diagnostic
> Dump out of the system when the failure occurs.

On eof the users sent me the diagnostic dump file, but I expect the registers
and disassembly will relate to the exception being raised from within the
signal handler, whose main purpose is to capture the backtrace to a file.
I'll compile a version without this and send it to one of the users to obtain
a better diagnostic.

> > In <dd968d15-13ad-4f4b-a736-74152f76d...@c26g2000vbq.googlegroups.com> (3
> > June 2011) James Lampard described a fault in DragAnObject which could
> > lead
>
> They start by basing their description on the erroneous documentation
> from PRM 5a that states that the bit 17 controls whether the render
> function operation is called in USR mode or not (which is what they
> state). Bit 17 controls whether the C environment is set up for the
> module callback and has nothing to do with the code being called in
> USR mode. There probably should not be a USR mode entry sequence for
> this module in any case - I believe that this was a documentation
> error introduced between the author and the production of the updated
> PRMs.
>
> > to stack overflows and aborts on 32-bit systems, which was fixed in
> > ROOL's DragAnObject 0.09.
>
> The 'fix' they describe is unlikely to affect the code that was
> changed and in any case will cause breakages on earlier modules unless
> care is taken to ensure the version of the module, as no validation of
> the flags is made.

As I received the code, Impact called DragAnObjectStart with bit 17 set (so
the callback would be called in SVC mode according to the PRM, but in reality
set up for module callback, which doesn't seem right to me, given we are
calling from an application).

In April 2011 I changed the code so that bit 17 was clear. This was because
I had discovered Impact crashed on RISC OS 4.02 when dragging an object. The
rendering procedure called setjmp, but not longjmp, but this was enough to
provoke the crash. The bit 17 value was probably a red herring but it looked
wrong to me so I tidied it up at the same time.

In May 2011 a user reported the problem of the crash on RISC OS 6.20, but
owing to a misunderstanding I thought some other corrections fixed it. There
was also, independently, a problem running Impact on the A9, where the Shared
C Library setjmp/longjmp does not work. This gave rise to the thread I
quoted from, where James Lampard replied regarding the changes on the RISC OS
5 fork. As a result of that, and a posting from Martin Wuerthner, I changed
my DragAnObject_Start call to have bit 17 clear and bit 18 set, and that's
how things are at the moment.

> If bit 17 is set:
>
> R0-R3 are updated from the parameter block from R2.
> R10 contains the SVC stack limit
> R13 is the SVC stack
> You are in SVC mode.
> IRQ state has been preserved.
>
> If bit 17 is clear:
>
> R0-R3 are set up from the parameter block from R2.
> R10 contains 0 (stack limit checking may be a problem for C
> applications).
> R13 is the SVC stack
> You are in SVC mode.
> IRQ state has been preserved.
>
> As you can see the sequence is very similar.

If you have bit 17 set, we have the SVC stack limit in R10, but does that
make stack limit checking better for C applications, as opposed to C modules?
If so, I cannot see why you would want to call with bit 17 clear.

> > I have found a problem (on RISC OS 4.02, but not on 5.16, not tested on
> > others) where using setjmp and longjmp within the rendering callback code
> > caused a crash, so I had to trap errors in other ways.
>
> Yeah, don't do that.
> You do not have a USR mode C environment in the render callback and
> should not do any calls like that. Because you don't have a C
> environment set up, you shouldn't rely on any of the operations within
> the callback function working in C environment. You may have luck
> using the SVC C-module form of entry sequence, but I wouldn't
> guarantee it.
>
> And NEVER longjmp out of the render function.

There are various places where the rendering function might attempt to use
longjmp to get out, but that will only be if there are errors in SWIs and as
far as I know there will not be in these cases. I removed the only
use of setjmp a year ago. Obviously I ought to correct these calls and
handle SWI errors differently if the function is used by a render callback,
and there may be some obscure reason why one of the SWIs is giving rise to an
error on RISC OS 6.20 but not on any other version, I suppose.

The rendering function does call quite a few other functions to get its work
done: they are nested up to six deep, and the automatic variables involved
require a maximum of 47 words in the worst case. The SWIs which are called
include Hourglass_On and Hourglass_Off, Wimp_PlotIcon, Font_FindFont,
Font_Paint, OS_Plot, and a few things to set colours. As for C Library
functions there are memset and strlen which should be innocuous, and bsearch
which I don't suppose requires much stack.

From what you write above, about not relying on the C environment, it sounds
like I could be in trouble with the amount of stack needed. If that's the
case I may have to rewrite to output to a sprite and use DragASprite rather
than DragAnObject, which would be a nuisance.

Or maybe I just ought to switch back to having bit 17 set.

I'll try to get a better backtrace from BTSDump and see if that tells me
anything.

Thanks,

--
Matthew Phillips
Durham

Gerph

unread,

May 28, 2012, 3:00:23 PM5/28/12

On May 28, 8:42 am, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> In message <6dc02b81-c9ed-4034-90d7-671f3e804...@n16g2000vbn.googlegroups.com>

> on 26 May 2012 Gerph wrote:
>
> > On May 26, 10:50 pm, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> > > I've appended a typical back trace. A function within Impact called
> > > Drag_Object calls the SWI DragAnObject_Start (via a generic library
> > > function called Swi) with a C function as the callback for rendering.
> > > The SWI is called via _kernel_swi. While within that library function
> > > an address exception occurs, which is passed to my signal handler which
> > > spits out the back trace to a file.
>
> > That sounds like you've got a problem right there. Unless you really
> > need it, stop using the signal handler and you'll get the Diagnostic
> > Dump out of the system when the failure occurs.
>
> On eof the users sent me the diagnostic dump file, but I expect the registers
> and disassembly will relate to the exception being raised from within the
> signal handler, whose main purpose is to capture the backtrace to a file.
> I'll compile a version without this and send it to one of the users to obtain
> a better diagnostic.

You could, of course, look at the dump that you've been given. It
might actually hold the details you're interested in.

> > > In <dd968d15-13ad-4f4b-a736-74152f76d...@c26g2000vbq.googlegroups.com> (3
> > > June 2011) James Lampard described a fault in DragAnObject which could
> > > lead
>

[snip]

> In May 2011 a user reported the problem of the crash on RISC OS 6.20, but
> owing to a misunderstanding I thought some other corrections fixed it. There
> was also, independently, a problem running Impact on the A9, where the Shared
> C Library setjmp/longjmp does not work. This gave rise to the thread I

A few people reported this, but nobody supplied a test case that I
could use to reproduce the problem. Aside from the screwy way that CTL
mangled the jmpbuf, I couldn't see anything else that would cause
problems... not to say that there aren't problems, but just that I
could never reproduce them. That said, the version of the OS on the A9
compared to the rest of the system was significantly older.

[snip]

> > If bit 17 is clear:
>
> > R0-R3 are set up from the parameter block from R2.
> > R10 contains 0 (stack limit checking may be a problem for C
> > applications).
> > R13 is the SVC stack
> > You are in SVC mode.
> > IRQ state has been preserved.
>
> > As you can see the sequence is very similar.
>
> If you have bit 17 set, we have the SVC stack limit in R10, but does that
> make stack limit checking better for C applications, as opposed to C modules?
> If so, I cannot see why you would want to call with bit 17 clear.

IIRC the stack extension check is done by checking SP-<size> < SL.
Which cannot be the case if SL is 0. However when the static base is
being loaded from SL-536 (or whatever the number is) that's likely to
break if SL was 0.

> > > I have found a problem (on RISC OS 4.02, but not on 5.16, not tested on
> > > others) where using setjmp and longjmp within the rendering callback code
> > > caused a crash, so I had to trap errors in other ways.
>
> > Yeah, don't do that.
> > You do not have a USR mode C environment in the render callback and
> > should not do any calls like that. Because you don't have a C
> > environment set up, you shouldn't rely on any of the operations within
> > the callback function working in C environment. You may have luck
> > using the SVC C-module form of entry sequence, but I wouldn't
> > guarantee it.
>
> > And NEVER longjmp out of the render function.
>
> There are various places where the rendering function might attempt to use
> longjmp to get out, but that will only be if there are errors in SWIs and as
> far as I know there will not be in these cases. I removed the only
> use of setjmp a year ago. Obviously I ought to correct these calls and
> handle SWI errors differently if the function is used by a render callback,
> and there may be some obscure reason why one of the SWIs is giving rise to an
> error on RISC OS 6.20 but not on any other version, I suppose.

It's not an option when I use the phrase 'NEVER' - using longjmp from
within SVC mode to a block that has USR mode r13 will corrupt SVC_r13
and result in... well you can probably imagine what will happen, I'm
sure. I doubt that this is the cause of your problems, but the point
is that doing such an operation from there will be fatal, so you
should remove all possible instances of it.

> The rendering function does call quite a few other functions to get its work
> done: they are nested up to six deep, and the automatic variables involved
> require a maximum of 47 words in the worst case. The SWIs which are called
> include Hourglass_On and Hourglass_Off, Wimp_PlotIcon, Font_FindFont,
> Font_Paint, OS_Plot, and a few things to set colours. As for C Library
> functions there are memset and strlen which should be innocuous, and bsearch
> which I don't suppose requires much stack.

bsearch requires a callback function, and is handled differently to
other calls - the size of the stack is not at issue, it is a matter of
how variables are referenced and the operations performed. bsearch
requiring the callback function means that the SCL will need to read
from the static base to determine what calling convention you are
using. The static base for the SCL is found from the stack limit - 536
(I think, OTTOMH), which won't have been set up by any code, so will
be invalid. You'll be reading random data. Depending on that data, the
function will either use APCS_A bindings, APCS_R bindings or the
APCS_32 sequence. The APCS_A bindings would result in a fatal call to
your callback function because R13 is no longer SP. The APCS_R
bindings would work fine. The APCS_32 bindings would result in R14 not
including the PSR, which would mean that the return from your callback
function restoring flags (assuming your code is built 26bit, not
32bit) would return to USR mode - which would be bad.

Come to think of it, the longjmp would have the same issue if called
within the render function context.

(at least that's if I'm remembering the environment correctly -
someone who's looked at the SCL entry sequences in the last 6 years or
so might know more than I)

> From what you write above, about not relying on the C environment, it sounds
> like I could be in trouble with the amount of stack needed. If that's the
> case I may have to rewrite to output to a sprite and use DragASprite rather
> than DragAnObject, which would be a nuisance.

It's not just the stack size. The C environment is more than the stack
size - if the static base data for the C library is not correct you
can get yourself in a whole world of pain.

> Or maybe I just ought to switch back to having bit 17 set.
>
> I'll try to get a better backtrace from BTSDump and see if that tells me
> anything.

Good luck. I've just been writing a bunch in my rambles about BTSDump
and how it helps. It amuses me how primitive it is. But it does get
the job done far better than anything else contemporary. Maybe things
have changed in the last 6 years or so.

--
Gerph

Matthew Phillips

unread,

May 28, 2012, 6:39:14 PM5/28/12

In message <b498a2ed-3973-49cd...@q2g2000vbv.googlegroups.com>

on 28 May 2012 Gerph wrote:

> On May 28, 8:42 am, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> > In May 2011 a user reported the problem of the crash on RISC OS 6.20, but
> > owing to a misunderstanding I thought some other corrections fixed it.
> > There was also, independently, a problem running Impact on the A9, where
> > the Shared C Library setjmp/longjmp does not work.
>

> A few people reported this, but nobody supplied a test case that I could
> use to reproduce the problem. Aside from the screwy way that CTL mangled
> the jmpbuf, I couldn't see anything else that would cause problems... not
> to say that there aren't problems, but just that I could never reproduce
> them. That said, the version of the OS on the A9 compared to the rest of
> the system was significantly older.

Here's what I said on this group on 9 July 2011:

I have been trying to track down a difficult bug in my software which only
affects users of the A9Home computer, specifically an A9 running "RISC OS
Select 4.42 kernel 8.68". The Shared C Library version is "5.59 (04 Mar
2006) 32bit only". The same user has a RISC PC with Select 6.10 kernel 10.49
and Shared C Library "5.63 (11 Mar 2007) 26bit only" on which the problem
does not occur. It does not occur on the Iyonix or Beagleboard either.

After investigating, I can only conclude that the A9's Shared C Library is
not fully restoring the context saved at setjmp when a subsequent longjmp is
executed. Let me set out my reasoning and if anyone finds any holes in it,
please let me know.

The problem occurs in a function which I can boil down to the following:

static void ExecuteFile( action a, dbase d, card c, char *path, int context )
{
execstr e;
int f;

e.path = M_StoreString( path );
e.a = a;
e.line = 0;

f = FC_ReadFile( path );

if ( !setjmp(E_GetJumpBuffer()) )
{
ExecuteBlock( &e, d, c, f, FALSE );
}

M_FreeMemory( e.path );
E_UnSetErrorPath();
if ( f ) FC_CloseFile( f );
}

The integer f is set to a RISC OS file handle returned by the function
FC_ReadFile. The E_GetJumpBuffer() returns a jmp_buf for use by setjmp, and
E_UnSetErrorPath unallocates it again.

The function works fine if the ExecuteBlock function is exited normally, but
if it is exited using longjmp then upon closing the file with FC_CloseFile it
turns out that f has somehow got corrupted and is equal to 618708. The SWI
then gives the error that the filehandle is illegal or already closed.

As you can see, f is an automatic and its address is not passed to the inner
function.

I can make the error go away completely by changing the code so that we have
a global called DebugF declared as an int*:

static int *DebugF;

static void ExecuteFile( action a, dbase d, card c, char *path, int context )
{
execstr e;
int f;

DebugF = &f;
... etc. as before.

When I examine the assembly language generated by the compiler in each case,
I find that in the version which produces the error f is stored throughout
the function in the register v5: it is never committed to memory. In the
second version, because we have taken the address of f and stored it
somewhere, f is forced to be stored in memory.

So somehow, when longjmp is used, the value which has been placed in v5 is
not the value which was in f when setjmp was called.

There are three explanations for this that I can think of:

1) the implementation of longjmp on the A9 is flawed, and fails to restore v5
properly.

2) the implementation of setjmp on the A9 is flawed, and stores the wrong
value for v5.

3) some other part of my programme is stomping on the jmp_buf.

Normally of course any sane programmer would go for (3) as the most likely
explanation. However, I have dumped the contents of the jmp_buf to file
immediately after the setjmp and on return from the longjmp and they are
identical. So that rules out (3). The structure of the jmp_buf is not
documented in public, but one of the words contains a value equal to the file
handle, f, so that suggests that longjmp is failing to restore v5, but only
on the A9.

Has anyone else had this sort of behaviour? Are there any alternative
explanations before I take this up with ROL?

----------------

Martin Wuerthner confirmed that he had reported the problem to Advantage 6
and ROL in 2006 and that he regarded TechWriter and EasiWriter as beta
software on the A9 as a consequence. As I did not have a very small test
case which provoked the problem, I did not take it any further, especially as
Martin confirmed that the problem seemed to have been resolved in later
versions of the ROL fork of the OS which had not yet been released for the
A9.

I concluded it was more a matter of lack of ongoing OS releases for the A9,
and as an A9 user was able to confirm that a softload of the Shared C
Library, as supplied with the C compiler, cured the problem, I have since
distributed that to the few people who use both Impact and the A9.

--
Matthew Phillips
Durham

Matthew Phillips

unread,

May 28, 2012, 6:54:41 PM5/28/12

In message <b498a2ed-3973-49cd...@q2g2000vbv.googlegroups.com>

on 28 May 2012 Gerph wrote:

> On May 28, 8:42 am, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> > In message <6dc02b81-c9ed-4034-90d7-671f3e804...@n16g2000vbn.googlegroups.com>
> > on 26 May 2012 Gerph wrote:
> > > If bit 17 is clear:
> >
> > > R0-R3 are set up from the parameter block from R2.
> > > R10 contains 0 (stack limit checking may be a problem for C
> > > applications).
> > > R13 is the SVC stack
> > > You are in SVC mode.
> > > IRQ state has been preserved.
> >
> > > As you can see the sequence is very similar.
> >
> > If you have bit 17 set, we have the SVC stack limit in R10, but does that
> > make stack limit checking better for C applications, as opposed to C
> > modules? If so, I cannot see why you would want to call with bit 17
> > clear.
>
> IIRC the stack extension check is done by checking SP-<size> < SL.
> Which cannot be the case if SL is 0. However when the static base is
> being loaded from SL-536 (or whatever the number is) that's likely to
> break if SL was 0.

Looking at what it says in PRM 5, it says bit 17 set means "SVC mode" (which
you have explained is a bit of a mistake) and also notes "use for modules,
also allows access to statics".

I wrote on 3 June 2011:

> > However, when reading PRM 5 later I noticed that it says of bit 17 "1:
> > SVC mode (use for modules; also allows access to statics)".
> >
> > My use is for a plain application, not a module.
> >
> > I take it this means that if my function (or functions it in turn calls?)
> > needs to access any static variables I will need to set this bit.

Martin Wuerthner replied:

> > No, this bit and the reference to statics is only relevant for module
> > code. Your USR mode application will be able to access its statics in any
> > case. So, if your program is an ordinary application it should clear bit
> > 17 and set bit 18, so it really gets called back in USR mode.

So that's what I did.

Clearly this only helps for versions of the module with the RO5 fix, but it
seems from what you say later, that Martin is wrong about earlier versions of
the module. Unless I have misunderstood either or both of you. My brain is
definitely hurting now!

> It's not an option when I use the phrase 'NEVER' - using longjmp from
> within SVC mode to a block that has USR mode r13 will corrupt SVC_r13 and
> result in... well you can probably imagine what will happen, I'm sure. I
> doubt that this is the cause of your problems, but the point is that doing
> such an operation from there will be fatal, so you should remove all
> possible instances of it.

I did understand what "NEVER" in capitals meant! Thanks. I will tackle this
aspect of the code also.

> bsearch requires a callback function, and is handled differently to other
> calls - the size of the stack is not at issue, it is a matter of how
> variables are referenced and the operations performed. bsearch requiring
> the callback function means that the SCL will need to read from the static
> base to determine what calling convention you are using. The static base
> for the SCL is found from the stack limit - 536 (I think, OTTOMH), which
> won't have been set up by any code, so will be invalid.

[snip]

> It's not just the stack size. The C environment is more than the stack size
> - if the static base data for the C library is not correct you can get
> yourself in a whole world of pain.

And if I have understood you properly, the static base can only be found from
the stack limit and that is only set up if bit 17 is set (whether
DragAnObject_start was called from module code, or from application code).

Is that correct?

If so I definitely ought to have bit 17 set, as there are a few static
variables being accessed from the rendering procs. It's surprising anything
is working as well as it is on RO 4.02, 4.39, 5.16, 5.18.

Matthew Phillips

unread,

Jun 7, 2012, 7:35:00 PM6/7/12

In message <63452a975...@sinenomine.freeserve.co.uk>

I have now got a BTS dump from the user, and it says the following. I'll put
my comments before the dump, in case you don't want to scroll down.

To me it looks as though the problem comes when the Shared C Library is
trying to read from [R10,#-540] which makes sense from what you said, given
this won't have been set up. I don't know what part of the C Library is
trying to do this. It has to be pretty fundamental because my callback
function never gets called by DragAnObject_Start. I can tell this because I
modified the application to write some debug to a file at various stages, and
the line at the start of the callback function is not written to file but the
line immediately before the call to DragAnObject_Start is written. (Or,
perhaps, the debug function could be responsible for triggering the address
exception in the same way that the code did before I added the debug.)

Changing the code back so that bit 17 is set and bit 18 is clear seems to
solve the problem on RISC OS 6.20. I had better test on a few other versions
of RISC OS before I can be happy it is fixed.

I'm not totally sure about my interpretation of the dump, though, as the
value of R10 given in the register listing looks like it could have 540
deducted from it and still be in application memory, so on the face of it I
don't see a complete explanation.

It still sounds to me as though there are no disadvantages for the C
environment in having bit 17 set, so I'll stick with that if it works. I
hope this is useful to other programmers in the future puzzling over the PRM
documentation! I've certainly learned quite a lot.

The other interesting thing is that some of the other calls my application
makes to DragAnObject_Start are working fine. They have different callback
functions but go through the same Drag_Object function as the problematic
call, and were also using bit 17 clear, bit 18 set. It must have been down
to stack usage and whether statics were accessed or the calling convention
needed reading. And interesting that there were no problems on 4.02 or 4.39,
as well as the RO5 branch, of course.

Dump coming up:

Register dump:
a1/r0 =0x00000000 0
a2/r1 =0x0386e674 59172468 -> [0x00000000] [0x00000000] [0x00000000] [0x00000000]
a3/r2 =0x01c06a68 29387368 -> [0x01c06ab0] [0x01c06af4] [0x00000000] [0x7fffffff]
a4/r3 =0x03862164 59122020 -> [0xe3a00000] [0xe1b0f00e] [0x7270665f] [0x66746e69]
v1/r4 =0x01c06a6c 29387372 -> [0x01c06af4] [0x00000000] [0x7fffffff] [0x0000000a]
v2/r5 =0x01c06a68 29387368 -> [0x01c06ab0] [0x01c06af4] [0x00000000] [0x7fffffff]
v3/r6 =0x00089c9a 564378
v4/r7 =0x00000000 0
v5/r8 =0x00000073 115
v6/r9 =0x00000000 0
sl/r10=0x000a7cac "2 F1321 S81199 N"...
fp/r11=0x000a873c 689980 -> [0x2386d7d8] [0x00049c40] [0x000a3e58] [0x000126a0]
ip/r12=0x00069c40 433216 -> [0xaa000022] [0xe1a01008] [0xe1a00004] [0xebffcfd2]
sp/r13=0x000a8714
lr/r14=0x0007b0f0 504048 -> [0xe3500000] [0x159f100c] [0x18810011] [0xe2700001]
pc/r15=0x238617fb

PC disassembly from &038617a8-&038617f8:
&038617d0 : .... : e3c00003 : BIC R0,R0,#3
&038617d4 : .... : e2800004 : ADD R0,R0,#4
&038617d8 : .... : e5850000 : STR R0,[R5,#0]
&038617dc : .... : e5100004 : LDR R0,[R0,#-4]
&038617e0 : ..P. : e3500000 : CMP R0,#0
&038617e4 : ..`. : b2600000 : RSBLT R0,R0,#0
&038617e8 : .p'. : b2277001 : EORLT R7,R7,#1
&038617ec : .... : e4d68001 : LDRB R8,[R6],#1
&038617f0 : .... : ea00000c : B &03861828
&038617f4 : p... : e59f1070 : LDR R1,&0386186C
&038617f8 : .... : e51ac21c : LDR R12,[R10,#-540]

&01c068c8 (SVC) : Exception called by &03861800, CPSR SVC-26 ARM fi vCzn
{Module SharedCLibrary Code+&97FC}
Exception registers dump at &01c068dc:
a1/r0 =0x00000000 0
a2/r1 =0x0386e674 59172468 -> [0x00000000] [0x00000000] [0x00000000] [0x00000000]
a3/r2 =0x01c06a68 29387368 -> [0x01c06ab0] [0x01c06af4] [0x00000000] [0x7fffffff]
a4/r3 =0x03862164 59122020 -> [0xe3a00000] [0xe1b0f00e] [0x7270665f] [0x66746e69]
v1/r4 =0x01c06a6c 29387372 -> [0x01c06af4] [0x00000000] [0x7fffffff] [0x0000000a]
v2/r5 =0x01c06a68 29387368 -> [0x01c06ab0] [0x01c06af4] [0x00000000] [0x7fffffff]
v3/r6 =0x00089c9a 564378
v4/r7 =0x00000000 0
v5/r8 =0x00000073 115
v6/r9 =0x00000000 0
sl/r10=0x00000000 0
fp/r11=0x01c06a50 29387344 -> [0x23861707] [0x01c06a6c] [0x00089c98] [0x01c06a68]
ip/r12=0x01c06a64 29387364 -> [0x00000000] [0x01c06ab0] [0x01c06af4] [0x00000000]
sp/r13=0x01c06a08
lr/r14=0x038622d4 59122388
pc/r15=0x238617fb
PSR=0x20000003 : SVC-26 ARM fi vCzn
&01c069f4 (SVC) : Aborting called by &03861800, CPSR SVC-26 ARM fi vCzn
{Module SharedCLibrary Code+&97FC}
&01c07fc8 (SVC) : SWI &49c40 called &03c1006c {-> Module DragAnObject Code+&68}
&01c07fe0 (SVC) : SWI &69c40 called by &0386d81c {Module SharedCLibrary Code+&15818}
SWI X[DragAnObject]_0 / XDragAnObject_Start (local)
Caller disassembly from &0386d7cc-&0386d81c:
&0386d7cc : .... : e1a0c00d : MOV R12,R13
&0386d7d0 : ..-. : e92ddbf4 : STMDB R13!,{R2,R4-R9,R11,R12,R14,PC}
&0386d7d4 : ..L. : e24cb004 : SUB R11,R12,#4
&0386d7d8 : .... : e3c0c102 : BIC R12,R0,#&80000000
&0386d7dc : .... : e3100102 : TST R0,#&80000000
&0386d7e0 : .... : 038cc802 : ORREQ R12,R12,#&00020000
&0386d7e4 : ..1. : e3310000 : TEQ R1,#0
&0386d7e8 : .... : 189103ff : LDMNEIA R1,{R0-R9}
&0386d7ec : .... : 1a000009 : BNE &0386D818
&0386d7f0 : .... : e3a00000 : MOV R0,#0
&0386d7f4 : .... : e3a01000 : MOV R1,#0
&0386d7f8 : . .. : e3a02000 : MOV R2,#0
&0386d7fc : .0.. : e3a03000 : MOV R3,#0
&0386d800 : .@.. : e3a04000 : MOV R4,#0
&0386d804 : .P.. : e3a05000 : MOV R5,#0
&0386d808 : .`.. : e3a06000 : MOV R6,#0
&0386d80c : .p.. : e3a07000 : MOV R7,#0
&0386d810 : .... : e3a08000 : MOV R8,#0
&0386d814 : .... : e3a09000 : MOV R9,#0
&0386d818 : q... : ef020071 : SWI XOS_CallASWIR12
&0386d81c : .... : e59dc000 : LDR R12,[R13,#0]
Leave C-environment in USR mode: PC was &0386d81c
PC disassembly already performed

Backtrace (USR mode):
386d81c: Unnamed routine at &0386d7cc
Arg 1: 0x000a3e58 671320 -> [0x00050190] [0x000126a0] [0x000a3e80] [0x000a8770]
7b0f0: Swi
Arg 1: 0x00049c40 302144 -> [0xe5a10008] [0xe59f0000] [0xe91ba810] [0x000a2b10]
Arg 2: 0x000a3e58 671320 -> [0x00050190] [0x000126a0] [0x000a3e80] [0x000a8770]
68994: StartDragObject
68acc: Drag_Object
Arg 1: 0x000126a0 function ?
Arg 2: 0x000f1500 "Main"
Arg 3: 0x000a87d0 690128 -> [0x000003b2] [0x000003f4] [0x00000456] [0x0000042c]
Arg 4: 0x00000190 400
Arg 5: 0x00000000 0
128f8: C_DragSelection
Arg 1: 0x000f1500 "Main"
Arg 2: 0x00000001 1
16788: DragSelect
Arg 1: 0x000dd0ec 905452 -> [0x5fc23779] [0x00000000] [0x000f0c68] [0x000dd0c8]
Arg 2: 0x00000000 0
Arg 3: 0x00000040 64
69aac: CallProcs
Arg 1: 0x000dd0ec 905452 -> [0x5fc23779] [0x00000000] [0x000f0c68] [0x000dd0c8]
Arg 2: 0x00000000 0
Arg 3: 0x00000040 64
Arg 4: 0x00000040 64
69ba8: CallAllIconProcs
Arg 1: 0x000dd0ec 905452 -> [0x5fc23779] [0x00000000] [0x000f0c68] [0x000dd0c8]
Arg 2: 0x00000000 0
Arg 3: 0x00000040 64
Arg 4: 0x00000040 64
6b9ec: WIN_ProcessWindowClick
Arg 1: 0x5fc23779 1606563705
Arg 2: 0x00000000 0
Arg 3: 0x00000040 64
75e6c: MouseClick
76b9c: EV_ProcessEvent
Arg 1: 0x00000006 6
96710: IMP_ProcessEvent
Arg 1: 0x00000000 0
233c4: main
Arg 1: 0x00000002 2
Arg 2: 0x000a8e88 -> [0x000a8e9c] [0x000a8ed1] [0x00000000] [0x3c694c3e] {char**}
386b100: _main
Arg 1: 0x000a8a30 "ADFS::HardDisc4."...
Arg 2: 0x00022b34 function main
323a0: Unnamed routine at &0003236c
386b504: <root call>

--
Matthew Phillips
Durham

Gerph

unread,

Jun 12, 2012, 4:18:42 PM6/12/12

[ Google ate my first mail after I'd spent an hour and a half on it.
Grr. So here's my second attempt. Hopefully it makes sense.
And apologies for delay; I've not felt like replying to news the
past
week or so. ]

On Jun 8, 12:35 am, Matthew Phillips <spam20...@yahoo.co.uk> wrote:
> In message <63452a9752.Matt...@sinenomine.freeserve.co.uk>

> on 28 May 2012 Matthew Phillips wrote:
>

> > In message <b498a2ed-3973-49cd-b3a8-a53b6333a...@q2g2000vbv.googlegroups.com>

[this might seem a little matter-of-fact - the original draft of my
mail
had me working out what was going on, but this time I know what's
going
on, so it's a little more fact and less steps-to-find-out]

That I can help with. If you use BTSDump to list the code following
the
failure (I don't know how far forward it looks, but I think it's
about
64 bytes or something, depending on how you've got the system
configured), you'll see that it's checking for a '.' and a '*'
character. The bit before the code that precedes the failure is
looking
for a '*' character and then doing some processing. This made me
think
that it was part of the printf code. A quick search further back in
the
SharedCLibrary in Zap shows that it's in _vfprintf - the RSBLT/EORLT
operation is unique within the module I think.

Anyhow, the code before the failing case (which isn't part of the
code
path) is dealing with a varargs variant of the width specifier (ie
getting the value from the passed registers for a width specifier
like
"%*s"). The failing code is something like this:

LDR r1, <SCL static address>
LDR r12, [r10, #-540] ; get C Library statics offset
ADD r1, r12,r1 ; get the address of the variable
LDRB r2, [r1,r8]
TST r2, #&20

this is essentially doing an 'isdigit' call on the character. The
'isdigit()' macro expands to a test of a bit in the locale character
type data, of which &20 is (I guess) the bit that means 'this is a
digit'.

So you're dying in a printf routine whilst it's trying to use the C
library statics to access the locale data structures to determine if
you've given it a digit or not whilst working out the width field
specifier.

> It has to be pretty fundamental because my callback
> function never gets called by DragAnObject_Start. I can tell this because I
> modified the application to write some debug to a file at various stages, and
> the line at the start of the callback function is not written to file but the
> line immediately before the call to DragAnObject_Start is written. (Or,
> perhaps, the debug function could be responsible for triggering the address
> exception in the same way that the code did before I added the debug.)

Bingo - if you used [fs]printf to write your debug information,
you've
just caused the problem that it is reporting.

Refer to my previous comment about using C functions within the
callback (albeit in the context of calling setjmp/longjmp):

"Because you don't have a C environment set up, you shouldn't rely on
any of the operations within the callback function working in C
environment."

So I'm not sure that the dump you've got is telling you anything
other
than 'the debug code is killing your application'.

> Changing the code back so that bit 17 is set and bit 18 is clear seems to
> solve the problem on RISC OS 6.20. I had better test on a few other versions
> of RISC OS before I can be happy it is fixed.
>
> I'm not totally sure about my interpretation of the dump, though, as the
> value of R10 given in the register listing looks like it could have 540
> deducted from it and still be in application memory, so on the face of it I
> don't see a complete explanation.

I think you're looking at the wrong bit - the 'Register dump' at
the top is the reconstructed C library view on the User mode
registers, not the ones that were in place at the time of the
exception. Look at the dump of registers under 'Exception
register dump' and you'll see that R10 = 0, and so consequently
[R10-540] would be bad.

> It still sounds to me as though there are no disadvantages for the C
> environment in having bit 17 set, so I'll stick with that if it works. I
> hope this is useful to other programmers in the future puzzling over the PRM
> documentation! I've certainly learned quite a lot.

I wouldn't do that. It sets up the environment in a way that is not
correct for your application.

Ok, here are the scenarios:

Normally when you are running your USR mode application, R10 = stack
limit, which is base of stack+540. The base of the stack contains
data
which is important to the C application.
[sl-540] = Offset from linkage address to C library statics
[sl-536] = Offset from linkage address to user code statics (ie the
stuff you write)

In a normal USR mode application:
[sl-540] = the offset of the application's C library workspace from
linkage location of the C library statics for the
version
of the SCL which was present when the application
initialised
[sl-536] = 0 (because the user code statics are at the address they
are linked at)

Every time you use a CLib function which needs access to one of its
statics, it offsets the address of the variable by [sl-540].
Every time you use a variable in your code you access it directly
(you
do not use [sl-536] to offset the code.

sl itself is the stack limit, and if your functions were to hit upon
that limit a stack extension would occur. The data at the base of the
stack, which includes the two values mentioned but also includes
pointers to the previous and next stack chunks, would be copied to a
new
chunk (and the pointers to the chunks updated).

So if you have R10=0 it'll crash when any C library statics are
accessed
for your code linked as a USR mode application, but you can access the
application statics ok (because the offset at [sl-536] will not be
used).

Now lets consider the usage with R10 = SVC stack base+540, which is
what
the DragAnObject module sets up. DragAnObject sets nothing else up.
The
entry point is intended for modules which previously set up the stack
entrails to contain the values it needs.

Let's explain that a bit more. The module entry sequence used by CMHG
(and CMunge, and anything else that wants to use SCL within the
module)
does the following (I've omitted things that aren't relevant):

* Preserve registers
* Work out stack base; all SVC stacks are based at a megabyte
boundary,
so this is simplfy clearing the bottom 20 bits.
* Preserve the two values found at the base of the stack - these are
sl-540 and sl-536.
* Store the offset from the module linkage location to the C library
statics in sl-540.
* Store the offset from the module linkage location to the module
statics in sl-536.
* Set FP = 0 (frame pointer of 0 indicates the end of backtrace)
* Enter the module code at the C routine specifed.

So if, at this point, the module calls DragAnObject_Start the
stack entrails contain the right values for the module, so any
callback function that is called (within the original calling
module) would work just fine. The important thing here is that
the callback function *is a C function*. Normally you wouldn't
do that. The normal way to provide a callback function in a C
module is to use an explicit generic veneer declaration in
CMHG/CMunge to set up the stack entrails for you.

Anyhow, that it differs from the more normal way of handling the
entry point isn't relevant in this case - except that it is the
root cause of the confusion.

The module code which accesses its statics will always use the
offset at [sl-536]; this is where it differs from the
application code that you would normally compile. The -zm switch
that you compile modules with changes this behaviour. Why do
modules need this ? Because the address that they are linked at
cannot contain a direct pointer to its module workspace. A
module at &21001324 could not simply have a pointer in the code
to the workspace at &2105678. Why ? One reason is that the
modules can be multiply instantiated, so the code may be used
with different module workspaces, so you have to have a
different area for each instance. Another reason is that the
module might be in ROM, and the workspace is dynamic in RAM.

So, this value exists in order that you can reference different
areas. APCS uses the term 'static base' for this type of
referencing, and it's possible to build your code to use a
static base register, instead of this offset. SCL doesn't and to
get into how much easier things might be if we used a static
base would probably drift too far off topic for most people to
care.

(just to drift very slightly off topic here, I just found
some ARM documentation that explicitly states that there are
"functions that do not use any static data of any kind, for
example fprintf()" - a fact that is not true of the SCL, as
fprintf does use static data, as I've described above)

Returning to the matter in hand....
The module's code will work because the values are available on
the stack, because DragAnObject set up R10 to point to the right
place - if it had been set up with R10 = 0, it would crash because
it referenced address -540 as we've seen above.

Assuming all is well, the module code returns to DragAnObject
and it restores things:

* Restore the old [sl-540] and [sl-536] contents
* Restore registers
* Set any flags necessary
* Return to OS

So you can see that under normal circumstances the stack base will
either be set to the offsets for the C library statics and module
statics from the module that is currently executing, or the previous
values preserved.

If a module were to exit from within the code that had stored values
at the base of the stack (eg because it aborted within its SWI
handler) there will remain values at the base of the stack which
relate to the module which exited.

The value of [sl-536] would be that of the offset from the
linkage address of the module statics to the module statics
within module's workspace. This is of no use to anyone but that
module.

The value of [sl-540] would be that of the offset from the
linkage address of the SharedCLibrary at the time that the module
initialised to the C library statics within that module.

So, consider the situation where nothing has gone wrong and the base
of the stack contains (0,0) - no offset present. If your user mode
application called DragAnObject with R10 pointing to the base of the
SVC stack+540, and called back into your application:

* Access to your application statics would be fine - because your
application code never uses [sl-536] to access its statics.

* Access to the C library statics wouldn't be right, but would
probably work so long as the values they contained were not
important. Why ?
Because the C library statics addresses that are being loaded
are the linkage addresses, and they're being offset by 0.
So the code will dereference values off the end of the
SharedCLibrary - the linkage address for statics in a module is
'as if' the statics were located immediately following the code.
So, for example, a read of the locale tables might get random
data, and in some cases that might work out just fine.
The important thing here is that unless there was a pointer in
the C library statics, it wouldn't crash. It would just get
the wrong data.

Now consider the state where the stack contains the data for some
other module, because it crashed as described above. If your user
mode application called DragAnObject with R10 pointing to the base
of the SVC stack+540 (which contains these offsets for the module
that broke), and called back into your application:

* Again, access to your application statics is fine, for the same
reasons.

* Access to the C library statics might seem to work just fine,
unless the module that crashed was killed. Why?
Because the offset in [sl-540] is the offset from the C library
linkage location to the statics of the failing module.
Consequently, any access to the statics works just fine, and
is actually a valid set of statics. Well. Valid so long as
the module didn't select a different locale, or whatever.
Standard file descriptors (stdout, stdin, stderr) will be the
descriptors /for the module/ not for the application. Similarly
the locale settings will be those for the module not the
application. And so on.

Consider the case where the base of the stack contains random values
or values which do not result in a valid block of memory when appplied
to the linkage locations of the static data in the SCL. In this case
you're pretty much guaranteed that you'll get a crash.

Ok, so step back further. We cannot rely on the value of those two
addresses from a given application - we don't know what they might be
and any of the above might have happened (or those addresses may be
updated to be markers in some other version of the system - yes that
could be useful).

And if we cannot rely on the value of those addresses, we cannot have
C code called with them set to any random values. Therefore... a
callback from DragAnObject which sets R10 to the SVC stackbase+540
WILL NOT WORK if you call an application C function. That it may work
sometimes is no justification.

If you can see anywhere in the above justification which is wrong and
means that it should work, please let me know - it's very possible
that
in the time that I've spent away from RISC OS that I've forgotten
something about how it functions. There are, I'm sure (/there ought to
be) people who know the way that the C library interactions work who
can corroborate or correct the above, and I'd be happy to be corrected
if someone can show where I'm wrong. It's been 6 or 7 years, so ... I
can easily misremember.

Anyhow, I hope I've convinced you that using bare C functions is not
going to work. DragAnObject was not meant to be used
by applications - and the documentation further confused things by
implying things which were not the case. The above description applies
to all versions of RISC OS (unless anyone's changed the way in which
C modules work, which is possible but ... tricky).

DragAnObject always felt a little unfinished to me. It doesn't quite
do
what you want and really I think you'd be better off having it as a
library
rather than a module. Maybe.

But, what can you do to mitigate the problem ? Create a veneer that
does the right thing for you would be my guess. That's all that the
modules do.

Here's a quick guess at how I'd provide the veneer.

--------8<--------
EXPORT callDragAnObject_Start
; _kernel_oserror *callDragAnObject_Start(unsigned long flags,
; void (*cfunc) (void
*private),
; void *private,
; bbox_t *dragbox,
; bbox_t *boundingbox);
;
; Call DragAnObject_Start, providing the rendering function as a C
; function that we can operate on.
; => R0 = flags
; R1 = pointer to C function to call
; R2 = private value to pass to C function in R0
; R3 = pointer to 16-byte block containing box
; [sp+0] = pointer to optional 16-byte block containing bounding
box
; <= R0 = pointer to error block, or 0 if no error.
callDragAnObject_Start SIGNATURE
MOV ip, sp
STMFD sp!, {v1,fp,ip,lr,pc}
SUB fp, ip, #4 ; fp -> pc on the stack

LDR r4, [fp, #4] ; r4(v1) = stacked value
ORR r0, r0, #(1<<16)+(1<<17) ; set the function callback
; and as C function flags
STMFD sp!, {r1,r2,sl} ; set up our registers
ADR r1, dao_callback_veneer ; use our veneer
MOV r2, sp ; workspace is on our stack
SWI XDragAnObject_Start

MOVVC r0, #0
[ {CONFIG}=32
LDMDB fp, {v1,fp,sp,pc}
|
LDMDB fp, {v1,fp,sp,pc}^
]

; Callback veneer, called by DragAnObject to render.
; Must call down to our C function, having set up the stack entrails
; properly.
; Will be entered in SVC mode.
; => R0 = C function pointer
; R1 = private value to pass in R0 to C function
; R2 = SL from our old veneer
; Can corrupt all R0-R10
dao_callback_veneer SIGNATURE
STMFD sp!, {r1, lr}

LDR v1, [sl, #-540] ; preserve SVC CLib statics
LDR v2, [sl, #-536] ; preserve SVC app statics
LDR v3, [r2, #-540] ; our CLib statics
LDR v4, [r2, #-536] ; our app statics
STR v3, [sl, #-540] ; replace SVC CLib statics
STR v4, [sl, #-536] ; replace SVC app statics

MOV fp, #0 ; stop backtrace here
MOV r0, r1
MOV lr, pc
LDR pc, [sp] ; call function

STR v1, [sl, #-540] ; restore SVC CLib statics
STR v2, [sl, #-536] ; restore SVC app statics
LDMFD sp!, {r1, pc}
--------8<--------

I'm not completely sure that that's right - I'm quite rusty on my
ARM code.

As for the 'fix' which was meant to call back to USR mode. Any
such fix needs to be able to restore SL, which is in R10. Which
is not available as part of the SWI entry parameters. Therefore
there is no legal way of getting the value of R10 within the
SWI call - therefore you couldn't restore the C environment
sufficiently in order to be useful.

Maybe I missed something about how they finessed that particular
'fix', but I feel that trying to fix a piece of code to allow
a documentation error to work probably isn't a good plan. Calling
down from a SWI call into USR mode I'm certain is a bad plan.
I'm pretty certain that calling back into an application through
a callback in SVC mode is also a bad plan but there you go.
Certainly you can do everything that DragAnObject does in a library
in your application which would save the unsafe callback, albeit
losing the advantage of having a block of code already in the
system.

> The other interesting thing is that some of the other calls my application
> makes to DragAnObject_Start are working fine. They have different callback
> functions but go through the same Drag_Object function as the problematic
> call, and were also using bit 17 clear, bit 18 set. It must have been down
> to stack usage and whether statics were accessed or the calling convention
> needed reading. And interesting that there were no problems on 4.02 or 4.39,
> as well as the RO5 branch, of course.

I imagine that the change in the way that applications are
called may have affected this. I'm not certain, but my
recollection is that the stack flattening performed by the
system does not manipulate the two words at the bottom of the
stack to clear them, so they would be left. Consequently the two
stack entrails values are probably the ones from AIF. Maybe. I'm
not certain and further speculation probably isn't worthwhile as
I've already shown that you're relying on values that you did
not initialise and therefore the behaviour is going to be
undefined.

Anyhow, I hope that using BTSDump has shown it to be useful as a
diagnostic tool, even if it merely tells you here that you broke
things by putting debugging code in.

If I were looking at the code now, I'd be very tempted to add a
PBTS point in DragAnObject, as its transferring control to
another chunk of code. That would allow you to trace the
transition to the C code (or any other code) and produce a
useful backtrace for BTSDump if anything went wrong.

--
Gerph
... Every nightmare that I dream;
make it mean something better.

0 new messages