Quickstep without alloca (final)

Oleh Derevenko

unread,

Jun 14, 2009, 2:38:00 PM6/14/09

to ODE Group

Hi All,

Here's final version of quickstep with stack allocation removed from it. The patch has been made against revision 1680.

dWorldStep is fully functional (even the new APIs) but still uses stack allocation internally.

If there are no objections reported during the week I'll commit it next week-end.

========= From CHANGELOG.txt =========

06/14/09 Oleh Derevenko
  * dWorldQuickStep re-implemented to avoid memory allocation on stack.
    Also several optimizations have been made to decrease memory
    requirements and optimize algorithm implementation of dWorldQuickStep.
    dWorldStep still remains with old memory allocation however new APIs
    mentioned below are fully functional for it.
    Both dWorldStep and dWorldQuickStep have been changed to return boolean
    success status.
    Both dWorldStep and dWorldQuickStep can now be used from multiple threads
    without serialization only if TLS is enabled for library with --enable-ou.
    New functions have been added:
     - dWorldStep2ContextRealloc
     - dWorldStep2
     - dWorldQuickStep2ContextRealloc
     - dWorldQuickStep2
     - dWorldStepContextFree
    Also dWorldStep and dWorldQuickStep have been changed to be wrapers for
    these new APIs with an internal context stored statically in library.
    dWorldStepCleanup and dWorldQuickStepCleanup functions have been added
    to free this internal context.
    From now on dWorldStep and dWorldQuickStep are to be considered obsolete.

========= From CHANGELOG.txt =========

Oleh Derevenko
-- ICQ: 36361783

quickstep-final.zip

Oleh Derevenko

unread,

Jun 21, 2009, 10:51:09 AM6/21/09

to ode-...@googlegroups.com

Already commited. ;)

Daniel K. O.

unread,

Jun 21, 2009, 4:22:26 PM6/21/09

to ode-...@googlegroups.com

Oleh Derevenko escreveu:
> Already commited. ;)

I have a few comments.

Please include the revision so people know where to look when things
break. ;)

As you were going to touch a large part of the quickstep code, you
should have made 2 commits; one to astyle the original source, and one
for your patch. Now the diff shows a lot of changes that were just
formatting changes, and distract me from the real changes.

I don't like the semantics you introduced.

> * dWorldQuickStep re-implemented to avoid memory allocation on stack.
> Also several optimizations have been made to decrease memory
> requirements and optimize algorithm implementation of dWorldQuickStep.
> dWorldStep still remains with old memory allocation however new APIs
> mentioned below are fully functional for it.
> Both dWorldStep and dWorldQuickStep have been changed to return boolean
> success status.

What happens when they return 0? Can the simulation go on? Are the
objects in an undefined state?

> Both dWorldStep and dWorldQuickStep can now be used from multiple threads
> without serialization only if TLS is enabled for library with --enable-ou.

Nothing changed on step.cpp. Why didn't dWorldStep work before and works
now?

> New functions have been added:
> - dWorldStep2ContextRealloc
> - dWorldStep2
> - dWorldQuickStep2ContextRealloc
> - dWorldQuickStep2
> - dWorldStepContextFree

Why do I need to deal with a context? What is a context? Would I want to
share a context between worlds?

The docs say the world must not be changed between the realloc and step2
calls. Didn't you just create a sequential coupling?

> Also dWorldStep and dWorldQuickStep have been changed to be wrapers for
> these new APIs with an internal context stored statically in library.

And seem to have quite a bit of overhead added to them. I wonder if the
people who where arguing when I proposed the damping addition about the
cost of 3 multiplications will accept the extra loops and indirect
calls. Can you show us some numbers?

> dWorldStepCleanup and dWorldQuickStepCleanup functions have been added
> to free this internal context.
> From now on dWorldStep and dWorldQuickStep are to be considered obsolete.

why are they deprecated? Didn't you just change their semantics to
return a success status?

--
Daniel K. O.
"The only way to succeed is to build success yourself"

Oleh Derevenko

unread,

Jun 21, 2009, 5:51:10 PM6/21/09

to ode-...@googlegroups.com

----- Original Message -----

From: "Daniel K. O."

To: <ode-...@googlegroups.com>

Sent: Sunday, June 21, 2009 11:22 PM

Subject: [ode-users] Re: Quickstep without alloca (final)

Please include the revision so people know where to look when things
break. ;)

Nothing will break. :)

As you were going to touch a large part of the quickstep code, you
should have made 2 commits; one to astyle the original source, and one
for your patch. Now the diff shows a lot of changes that were just
formatting changes, and distract me from the real changes.

First, if you use TortoiseSVN it can show you the differences with ignoring whitespace changes. I don't know if TortoiseSVN is available for Linux though.

Second, there are lots of real changes indeed. In quickstep.cpp in two main functions, probably almost every line was changed.

I don't like the semantics you introduced.

What exactly? Could you clarify?

>     * dWorldQuickStep re-implemented to avoid memory allocation on stack.
>       Also several optimizations have been made to decrease memory
>       requirements and optimize algorithm implementation of dWorldQuickStep.
>       dWorldStep still remains with old memory allocation however new APIs
>       mentioned below are fully functional for it.
>       Both dWorldStep and dWorldQuickStep have been changed to return boolean
>       success status.

What happens when they return 0? Can the simulation go on? Are the
objects in an undefined state?

It means that memory allocation has failed. This leaves objects in unchanged state and you can retry as soon as more memory is available.

> Both dWorldStep and dWorldQuickStep can now be used from multiple threads
> without serialization only if TLS is enabled for library with --enable-ou.

Nothing changed on step.cpp. Why didn't dWorldStep work before and works
now?

Well, maybe because I just did not call all the functions from utils.cpp. ;) It's easy to try it with my first patch file, is not it?

>       New functions have been added:
>        - dWorldStep2ContextRealloc
>        - dWorldStep2
>        - dWorldQuickStep2ContextRealloc
>        - dWorldQuickStep2
>        - dWorldStepContextFree

Why do I need to deal with a context? What is a context? Would I want to
share a context between worlds?

The context is abstract object that currently holds the memory preallocated for current step (or allocated at previous step). But similarly it can carry more meaning in the future. It should be an opaque pointer for the client code.

You can share a context among multiple worlds unless you do not use it for simulation in multiple threads at the same time. On one hand it saves memory by using one largest required memory area for all the worlds. On the other hand it does not allow you to efficiently manage allocated memory size when number of objects in one of worlds decreases significantly. If you used individual contexts for every world yuo would just delete corresponding one and allocate it once again for smaller world size. However if you share one context for everything you do not know if it is worth deleting it as you do not know if current world was the largest one and if there are other worlds of nearly same size that would require the same memory to be allocated back.

Though I would rathet say it is more profitable to share context if you have several worlds and it does not complicate implementation of multithreading for you.

The docs say the world must not be changed between the realloc and step2
calls. Didn't you just create a sequential coupling?

Yes, these functions must be called one right after another. The first one just quickly estimates the required memory size and tries to increase the allocated buffer if necessary. It can return status if allocation fails. The second one received everything prepared and does the actual solving job.

> Also dWorldStep and dWorldQuickStep have been changed to be wrapers for
> these new APIs with an internal context stored statically in library.

And seem to have quite a bit of overhead added to them. I wonder if the
people who where arguing when I proposed the damping addition about the
cost of 3 multiplications will accept the extra loops and indirect
calls. Can you show us some numbers?

I do not have simulations in my projects. This is why I posted patch file here for preview several weeks ago. The only person who did some tests (or who reported the results here) was John Hsu.

Don't come down to counting individual multiplications, Daniel. Yes, I spend some small amount of time on memory size pre-estimation but I also made several optimizations for both memory usage and algorithm implementation that should fairly compensate that cost. After all, the program that just crashes in memory lack condition and you can do nothing about it can't be seriously considered better than a program that is little bit slower but can survive and continue computations or display a clear error message at least.

>       dWorldStepCleanup and dWorldQuickStepCleanup functions have been added
>       to free this internal context.
>       From now on dWorldStep and dWorldQuickStep are to be considered obsolete.

why are they deprecated? Didn't you just change their semantics to
return a success status?

They are deprecated because they are just the wrappers now left for compatibility with existing code and they use static context variable that can't be efficiently controlled and they do not provide access to other features available in new APIs (like memory reservation policy and memory manager customizaton) and they would not provide access to the new features that could be added in the future (like multithreading support).

Oleh Derevenko

Daniel K. O.

unread,

Jun 21, 2009, 7:40:11 PM6/21/09

to ode-...@googlegroups.com

Oleh Derevenko escreveu:

> I don't like the semantics you introduced.
>
> What exactly? Could you clarify?

I clarified them in the items that followed.

> What happens when they return 0? Can the simulation go on? Are the
> objects in an undefined state?
>
> It means that memory allocation has failed. This leaves objects in
> unchanged state and you can retry as soon as more memory is
> available.

This should be in the docs then.

> Nothing changed on step.cpp. Why didn't dWorldStep work before and
> works now?
>
> Well, maybe because I just did not call all the functions from
> utils.cpp.

That's specious reasoning. I have a tiger-repellant rock that might
interest you.

> The context is abstract object that currently holds the memory
> preallocated for current step (or allocated at previous step). But
> similarly it can carry more meaning in the future.

So it's a poltergeist object. Why can't it live withing the world
object? We already set the quickstep parameters by setting them in the
world object.

> The docs say the world must not be changed between the realloc and
> step2 calls. Didn't you just create a sequential coupling?
>
> Yes, these functions must be called one right after another.

And don't you see anything wrong with that?

> I do not have simulations in my projects. This is why I posted patch
> file here for preview several weeks ago. The only person who did some
> tests (or who reported the results here) was John Hsu. Don't come
> down to counting individual multiplications, Daniel.

I don't want multiplications. I want to be able to tell how much slower
it is now. I doubt anyone struggling with the framerate will risk to
lower it even more.

> implementation that should fairly compensate that cost. After all,
> the program that just crashes in memory lack condition and you can do
> nothing about it can't be seriously considered better than a program
> that is little bit slower but can survive and continue computations
> or display a clear error message at least.

Some programs need speed, not memory.

> They are deprecated because they are just the wrappers now left for
> compatibility with existing code and they use static context variable
> that can't be efficiently controlled and they do not provide access
> to other features available in new APIs (like memory reservation
> policy and memory manager customizaton) and they would not provide
> access to the new features that could be added in the future (like
> multithreading support).

With this let me sum up my concerns: this contribution is unacceptable
as a whole.

It not only adds complexity to the code base (I was expecting you would
make the code shorter and clean; but it's now more than 30% longer), it
also makes the API more complex. Internal details are being
unnecessarily exposed to the users.

As John Hsu suggested, I think your changes should stay in a branch for
a while until things are ironed out.

Oleh Derevenko

unread,

Jun 22, 2009, 5:40:14 AM6/22/09

to ode-...@googlegroups.com

Hi Daniel

----- Original Message -----

From: "Daniel K. O."

To: <ode-...@googlegroups.com>

Sent: Monday, June 22, 2009 2:40 AM

Subject: [ode-users] Re: Quickstep without alloca (final)

> What happens when they return 0? Can the simulation go on? Are the
> objects in an undefined state?
>
> It means that memory allocation has failed. This leaves objects in
> unchanged state and you can retry as soon as more memory is
> available.

This should be in the docs then.

Sorry, I've lost my access to Wiki. Could add it please? I can add it to the Doxygen comments though.

> Nothing changed on step.cpp. Why didn't dWorldStep work before and
> works now?
>
> Well, maybe because I just did not call all the functions from
> utils.cpp.

That's specious reasoning. I have a tiger-repellant rock that might
interest you.

My English is not good enough to understand that. :) Don't you believe me? I just remember that I had quickly put NULL in parameter somewhere in first version to have it compile.

> The context is abstract object that currently holds the memory
> preallocated for current step (or allocated at previous step). But
> similarly it can carry more meaning in the future.

So it's a poltergeist object. Why can't it live withing the world
object? We already set the quickstep parameters by setting them in the
world object.

Well, and what's wrong about keeping it outside? As you already mentioned you could share the same memory for several worlds which would otherwise need to allocate copies individually.

> The docs say the world must not be changed between the realloc and
> step2 calls. Didn't you just create a sequential coupling?
>
> Yes, these functions must be called one right after another.

And don't you see anything wrong with that?

You mean, why can't they be just joined together? Well, that's an interesting question. They should be separated for the reasons of symmetry. Since it's necessary to have separate context destruction function (dWorldStepContextFree) it's also logical to have a separate context allocation function. Otherwise it would look as you would be freeing some by-product, some waste of world stepping function. ;)

> I do not have simulations in my projects. This is why I posted patch
> file here for preview several weeks ago. The only person who did some
> tests (or who reported the results here) was John Hsu. Don't come
> down to counting individual multiplications, Daniel.

I don't want multiplications. I want to be able to tell how much slower
it is now. I doubt anyone struggling with the framerate will risk to
lower it even more.

Well, why don't you have a try and let us know the results? I just can't test it myself. I do not have any physics simulation code. However I know there are people who use ODE for physics simulation and who do measure performance. Let them tell us their impression.

> implementation that should fairly compensate that cost. After all,
> the program that just crashes in memory lack condition and you can do
> nothing about it can't be seriously considered better than a program
> that is little bit slower but can survive and continue computations
> or display a clear error message at least.

Some programs need speed, not memory.

No one program should crash. It does not matter how fast you do something if you fail to finish that job.

> They are deprecated because they are just the wrappers now left for
> compatibility with existing code and they use static context variable
> that can't be efficiently controlled and they do not provide access
> to other features available in new APIs (like memory reservation
> policy and memory manager customizaton) and they would not provide
> access to the new features that could be added in the future (like
> multithreading support).

With this let me sum up my concerns: this contribution is unacceptable
as a whole.

It not only adds complexity to the code base (I was expecting you would
make the code shorter and clean; but it's now more than 30% longer), it
also makes the API more complex. Internal details are being
unnecessarily exposed to the users.

I disagree. The code base was unnecessary complex from the very beginning, you can't do it much complex any more. Don't count the lines. Generally the structure of functions did not change. Some loops were dropped and merged together, some loops were split into two or three - that's all. If I would implement it my way I would split those huge functions into smaller ones, not more than one loop per function (you can see the suggested splitting structure by compound operators {...} that have been inserted in the code). However I tried to minimize changes and follow the existing naming convention style (oh, that was a hard task!!!) because I knew that would bear much more indignation from everybody.

As for exposing internals - those are not the internals. Those are the operating levers. The more control functions you have for machinery the finer you can tune it and the smoother you can operate it. You still do not need to use those all. All the new parameters allow passing defaults of NULL.

As John Hsu suggested, I think your changes should stay in a branch for
a while until things are ironed out.

I suggest someone should try it in real application and present some numbers first, so that you could have information for judgement.

Oleh Derevenko

unread,

Jun 22, 2009, 6:02:28 AM6/22/09

to ode-...@googlegroups.com

----- Original Message -----

To: ode-...@googlegroups.com

Sent: Monday, June 22, 2009 12:40 PM

Subject: [ode-users] Re: Quickstep without alloca (final)

> The docs say the world must not be changed between the realloc and
> step2 calls. Didn't you just create a sequential coupling?
>
> Yes, these functions must be called one right after another.

And don't you see anything wrong with that?

You mean, why can't they be just joined together? Well, that's an interesting question. They should be separated for the reasons of symmetry. Since it's necessary to have separate context destruction function (dWorldStepContextFree) it's also logical to have a separate context allocation function. Otherwise it would look as you would be freeing some by-product, some waste of world stepping function. ;)

Well, actually, I planned you would first call context allocation function then use it with world stepping function several times and then call context destruction function. However it turned out, that context allocation function is unnecessary and memory size adjustment is needed before each step.

We can still change the API to add empty context allocation function (an empty stub for now, a fake) and do it that way (Alloc -> N x Use -> Free). However I don't see much sense in that unless there would be something that would be necessary to allocate just one and that would require substantial amount of time for allocation. All the rest can always be allocated on demand in world stepping functions pair if NULL parameter is passed.

jcooper

unread,

Jun 22, 2009, 11:21:53 AM6/22/09

to ode-users

I agree with Daniel on this. The changes seem to add unnecessary
complexity and messiness and are in need of additional testing.

> > I do not have simulations in my projects. This is why I posted patch
> > file here for preview several weeks ago. The only person who did some
> > tests (or who reported the results here) was John Hsu. Don't come
> > down to counting individual multiplications, Daniel.
>
> I don't want multiplications. I want to be able to tell how much slower
> it is now. I doubt anyone struggling with the framerate will risk to
> lower it even more.
>
> Well, why don't you have a try and let us know the results? I just can't
> test it myself. I do not have any physics simulation code. However I know
> there are people who use ODE for physics simulation and who do
> measure performance. Let them tell us their impression.

It seems that if Oleh doesn't have/use dynamics simulations, he should
not be messing much in the dynamics code. Messing with the code and
then saying, "Well, I don't use this code; so someone else should test
it for me..." is not a good way to go. There's a "space_stress"
demo. There should probably also be a "dynamics_stress" demo or
benchmark suite that gives an idea as to how performance compares.
The timing mechanism is conveniently there already. As Daniel
suggested, frame rate is extremely important for my applications. I'd
really like to see how the new code compares in terms of raw speed
(assuming no change in accuracy).

> > implementation that should fairly compensate that cost. After all,
> > the program that just crashes in memory lack condition and you can do
> > nothing about it can't be seriously considered better than a program
> > that is little bit slower but can survive and continue computations
> > or display a clear error message at least.
>
> Some programs need speed, not memory.
>
> No one program should crash. It does not matter how fast you do something if you fail to finish that job.

Honestly, I don't mind having alloca() in the code at all. As long as
it's clear that you need to make your stack large enough, it seems
like a nicer way to go. Temporary objects get put on the stack and
then flushed away without fragmentation concerns. I added multi-
threaded island processing into dWorldStep without a lot of difficulty
as it stands without even using the ODEou option. Of course, I might
have missed some potential race conditions and edge cases, but my
thread-pool/work-queue seemed to work fine after a few simple changes
(like adding an extra "tag" field so that stepping didn't conflict
with connected component detection and protecting the body list so
that pointers didn't get stomped when changed objects were moved to
the front of the list).

At least in the applications for which I use ODE, if memory allocation
fails, I'm out of luck whether it failed on the stack or on the heap.
I can't really hunt around for spare memory to free up and try again.
Granted, I'm not developing commercial products; so as far as I'm
concerned, crashing is as good as dying gracefully. I'd rather be
fast, and have the option of increasing the stack size as needed, I
think.

> > They are deprecated because they are just the wrappers now left for
> > compatibility with existing code and they use static context variable
> > that can't be efficiently controlled and they do not provide access
> > to other features available in new APIs (like memory reservation
> > policy and memory manager customizaton) and they would not provide
> > access to the new features that could be added in the future (like
> > multithreading support).
>
> With this let me sum up my concerns: this contribution is unacceptable
> as a whole.
>
> It not only adds complexity to the code base (I was expecting you would
> make the code shorter and clean; but it's now more than 30% longer), it
> also makes the API more complex. Internal details are being
> unnecessarily exposed to the users.
>
> I disagree. The code base was unnecessary complex from the very beginning, you can't do it much complex any more. Don't count the lines. Generally the structure of functions did not change. Some loops were dropped and merged together, some loops were split into two or three - that's all. If I would implement it my way I would split those huge functions into smaller ones, not more than one loop per function (you can see the suggested splitting structure by compound operators {...} that have been inserted in the code). However I tried to minimize changes and follow the existing naming convention style (oh, that was a hard task!!!) because I knew that would bear much more indignation from everybody.
> As for exposing internals - those are not the internals. Those are the operating levers. The more control functions you have for machinery the finer you can tune it and the smoother you can operate it. You still do not need to use those all. All the new parameters allow passing defaults of NULL.
> As John Hsu suggested, I think your changes should stay in a branch for
> a while until things are ironed out.
>
> I suggest someone should try it in real application and present some numbers first, so that you could have information for judgement.

As above, I agree with Daniel here. I also find repeated complaints
and arguments about existing conventions unnecessary. It's just not
that big a deal. They work okay and the lengthy threads about whether
or not a particular parameter should come first or last in the
arguments list makes this forum harder for everyone else to follow. I
certainly wouldn't say that the code base is complex beyond reason.

Joseph

Oleh Derevenko

unread,

Jun 22, 2009, 11:35:23 AM6/22/09

to ode-...@googlegroups.com

----- Original Message -----

From: "jcooper" <joseph...@gmail.com>

To: "ode-users" <ode-...@googlegroups.com>

Sent: Monday, June 22, 2009 6:21 PM

Subject: [ode-users] Re: Quickstep without alloca (final)

It seems that if Oleh doesn't have/use dynamics simulations, he should
not be messing much in the dynamics code. Messing with the code and
then saying, "Well, I don't use this code; so someone else should test
it for me..." is not a good way to go.

There are demos included with ODE. Is it not enough to run demos to be able to say if the code works or not after the changes? And regarding performance measurement someone should help me with this.

Honestly, I don't mind having alloca() in the code at all. As long as
it's clear that you need to make your stack large enough, it seems
like a nicer way to go.

You never know what is "large enough".

Temporary objects get put on the stack and
then flushed away without fragmentation concerns.

Allocation policy I use should not produce any fragmentation either.

I added multi-
threaded island processing into dWorldStep without a lot of difficulty
as it stands without even using the ODEou option. Of course, I might
have missed some potential race conditions and edge cases, but my
thread-pool/work-queue seemed to work fine after a few simple changes
(like adding an extra "tag" field so that stepping didn't conflict
with connected component detection and protecting the body list so
that pointers didn't get stomped when changed objects were moved to
the front of the list).

It's not the best solution. What if there would be just one island? It's the simplest but far not the best possible.

At least in the applications for which I use ODE, if memory allocation
fails, I'm out of luck whether it failed on the stack or on the heap.
I can't really hunt around for spare memory to free up and try again.
Granted, I'm not developing commercial products; so as far as I'm
concerned, crashing is as good as dying gracefully. I'd rather be
fast, and have the option of increasing the stack size as needed, I
think.

Please do some measurements with new sources and compare to 0.11.1 before blaming me for slowing down the code. If it will ne noticeably slower, I'll easily rollback all that.

Oleh Derevenko

jcooper

unread,

Jun 22, 2009, 12:44:45 PM6/22/09

to ode-users

> > It seems that if Oleh doesn't have/use dynamics simulations, he should
> > not be messing much in the dynamics code. Messing with the code and
> > then saying, "Well, I don't use this code; so someone else should test
> > it for me..." is not a good way to go.
>
> There are demos included with ODE. Is it not enough to run demos to be able to say if the code works or not after the changes?

I'm pretty sure that the demos currently included with ODE are not
exhaustive by any means, and it is rarely tenable to claim that
because a demo "runs" the code must therefore work. I haven't played
with the Unit Test framework that's included in the project. Maybe it
does the job. I dunno, but I kinda doubt it. However, I wasn't
questioning whether the new code actually works.

> And regarding performance measurement someone should help me with this.

It doesn't seem quite right to state that someone else should help
you justify your changes. You really should be able to benchmark/
profile your own code.

> > Honestly, I don't mind having alloca() in the code at all. As long as
> > it's clear that you need to make your stack large enough, it seems
> > like a nicer way to go.
>
> You never know what is "large enough".

But the same is true for the heap. It's guess it's common to have a
big heap instead of a stack, but that doesn't really make it better.

> > I added multi-
> > threaded island processing into dWorldStep without a lot of difficulty
> > as it stands without even using the ODEou option. Of course, I might
> > have missed some potential race conditions and edge cases, but my
> > thread-pool/work-queue seemed to work fine after a few simple changes
> > (like adding an extra "tag" field so that stepping didn't conflict
> > with connected component detection and protecting the body list so
> > that pointers didn't get stomped when changed objects were moved to
> > the front of the list).
>
> It's not the best solution. What if there would be just one island?

Clearly. I haven't suggested that anyone else should use it.
However, I never have just one island, but potentially hundreds of
small ones. With only four processors, the additional work of
creating a parallel LCP solver would have added no extra benefit for
me.

> It's the simplest but far not the best possible.

I would not say "by far". After all, the complexity of LCP solving
grows linearly with the number of islands but, in general, much faster
with the number of bodies/constraints in a single island. This
suggests that in a constant amount of time you could almost process
twice as many islands (of comparable complexity) if you had twice as
many processors working on them. However, the size of a single island
that could be processed in parallel would increase much more slowly
than the number of processors you threw at it, even if the matrix
solver didn't require a lot of inter-process communication overhead.
Naturally, there must be applications that will benefit greatly from
that small speed-up on a single, complex island. That doesn't mean
that an island stepping solution is tremendously inferior to a
parallel stepping solution since there are lots of applications that
can only be hurt by the parallel overhead of processing at such small
granularity.

Anyway, this discussion probably belongs in another thread.

> > At least in the applications for which I use ODE, if memory allocation
> > fails, I'm out of luck whether it failed on the stack or on the heap.
> > I can't really hunt around for spare memory to free up and try again.
> > Granted, I'm not developing commercial products; so as far as I'm
> > concerned, crashing is as good as dying gracefully. I'd rather be
> > fast, and have the option of increasing the stack size as needed, I
> > think.
>
> Please do some measurements with new sources and compare to
> 0.11.1 before blaming me for slowing down the code. If it will ne
> noticeably slower, I'll easily rollback all that.

I did not intend to sound accusatory. Your changes might make the
code run twice as fast. It would be nice to see and should be
investigated before the commitment is set in stone. I only meant to
voice my agreement with Daniel lest it should appear that this is only
a difference in opinion between two people and no one else cares what
ODE is like.

Personally, now is a rather poor time for me to be benchmarking other
code. If it's still an issue in a couple weeks, I can look into
creating some interesting benchmark demo cases.

Joseph

Oleh Derevenko

unread,

Jun 22, 2009, 1:01:47 PM6/22/09

to ode-...@googlegroups.com

Hi, Joseph

----- Original Message -----

> There are demos included with ODE. Is it not enough to run demos to be able to say if the code works or not after the changes?

I'm pretty sure that the demos currently included with ODE are not
exhaustive by any means, and it is rarely tenable to claim that
because a demo "runs" the code must therefore work. I haven't played
with the Unit Test framework that's included in the project. Maybe it
does the job. I dunno, but I kinda doubt it. However, I wasn't
questioning whether the new code actually works.

So, if you looked inside of quickstep/step implementation you could understand that if these functions ever run and produce correct result for anything they just can't produce incorrect result for anything else. There is no brancing in them - just the linear code flow.

> And regarding performance measurement someone should help me with this.

It doesn't seem quite right to state that someone else should help
you justify your changes. You really should be able to benchmark/
profile your own code.

I can estimate performance changes on my code just in my mind and it is enough for me to judge if it is wordth doing the change or not. If you need more precise estimation you are free to do it yourself. After all, if you are happy with current code why do you look at new commits at all? ;)

> > Honestly, I don't mind having alloca() in the code at all. As long as
> > it's clear that you need to make your stack large enough, it seems
> > like a nicer way to go.
>
> You never know what is "large enough".

But the same is true for the heap. It's guess it's common to have a
big heap instead of a stack, but that doesn't really make it better.

Heap usually can automatically grow up until all the free memory is used. For the stack that's not true. You have to reserve the necessary stack amount beforehand and you have to know how much stack you could potentially need.

> > I added multi-
> > threaded island processing into dWorldStep without a lot of difficulty
> > as it stands without even using the ODEou option. Of course, I might
> > have missed some potential race conditions and edge cases, but my
> > thread-pool/work-queue seemed to work fine after a few simple changes
> > (like adding an extra "tag" field so that stepping didn't conflict
> > with connected component detection and protecting the body list so
> > that pointers didn't get stomped when changed objects were moved to
> > the front of the list).
>
> It's not the best solution. What if there would be just one island?

Clearly. I haven't suggested that anyone else should use it.
However, I never have just one island, but potentially hundreds of
small ones. With only four processors, the additional work of
creating a parallel LCP solver would have added no extra benefit for
me.

It's specific to your problem. Somebody else could have different problem to solve where there could sometimes be several smaller islands or sometimes they could unite into one large island.

> > At least in the applications for which I use ODE, if memory allocation
> > fails, I'm out of luck whether it failed on the stack or on the heap.
> > I can't really hunt around for spare memory to free up and try again.
> > Granted, I'm not developing commercial products; so as far as I'm
> > concerned, crashing is as good as dying gracefully. I'd rather be
> > fast, and have the option of increasing the stack size as needed, I
> > think.
>
> Please do some measurements with new sources and compare to
> 0.11.1 before blaming me for slowing down the code. If it will ne
> noticeably slower, I'll easily rollback all that.

I did not intend to sound accusatory. Your changes might make the
code run twice as fast. It would be nice to see and should be
investigated before the commitment is set in stone. I only meant to
voice my agreement with Daniel lest it should appear that this is only
a difference in opinion between two people and no one else cares what
ODE is like.

Personally, now is a rather poor time for me to be benchmarking other
code. If it's still an issue in a couple weeks, I can look into
creating some interesting benchmark demo cases.

I suppose everybody would like to have benchmarking tests for ODE anyway, whether it would still be an issue for current change or not.

Oleh Derevenko

unread,

Jun 22, 2009, 4:17:26 PM6/22/09

to ode-...@googlegroups.com

Hi Joseph again,

----- Original Message -----

From: "jcooper"

> Please do some measurements with new sources and compare to
> 0.11.1 before blaming me for slowing down the code. If it will ne
> noticeably slower, I'll easily rollback all that.

I did not intend to sound accusatory. Your changes might make the
code run twice as fast. It would be nice to see and should be
investigated before the commitment is set in stone.

Unfortunately, my changes can't make the code run twice as fast. :) The most time is spent in the good old "asympthotic computational complexity" which did not change. So, taking into account that I have made some tuning to implementation inside the loop and have added some overhead outside of it I would expect that it will be a little bit slower for smal worlds (though it does not matter very much as small worlds are calculated quite quickly anyway) and approximately the same or a little bit faster for large worlds. In any case, the effect of changes is asympthotically negligible.

Spoke

unread,

Jun 22, 2009, 4:50:37 PM6/22/09

to ode-...@googlegroups.com

Hi!
I just test the current HEAD agaist HEAD-2 and I get arround 2 or 3
FPS without other changes in my current proyect with the new
implementation.

Anyone else have some numbers to share?

Jose A. Milan

2009/6/22, Oleh Derevenko <Oleh.De...@eleks.com>:

--
May the force be with you!

Daniel K. O.

unread,

Jun 22, 2009, 6:38:09 PM6/22/09

to ode-...@googlegroups.com

Oleh Derevenko escreveu:

> This should be in the docs then.
>
> Sorry, I've lost my access to Wiki. Could add it please? I can add it
> to the Doxygen comments though.

The wiki will be updated accordingly with what's in the doxygen docs.
But first the docs must exist.

> That's specious reasoning. I have a tiger-repellant rock that might
> interest you.
>
> My English is not good enough to understand that. :) Don't you
> believe me? I just remember that I had quickly put NULL in parameter
> somewhere in first version to have it compile.

See this:
http://www.youtube.com/watch?v=u2zXSaDFi7o

> Well, and what's wrong about keeping it outside? As you already
> mentioned you could share the same memory for several worlds which
> would otherwise need to allocate copies individually.

Then what about this?

dWorldShareWorkingMemory(dWorldID a, dWorldID b);
dWorldCleanupWorkingMemory(dWorldID w);
dWorldSetAllocationPolicy(dWorldID w, ...);

Done. No need for a "context" object anymore.

But what I would really prefer would be:

dWorldSetStepAllocator(dWorldID w, dWorldStepAllocator* a);

where you could write your own allocator if you wanted to do fancy
stuffs. It would be something like this:

struct dWorldStepAllocator {
int (*preallocate)(allocator, world, totalsize); // 0 = failure
void* (*allocate)(allocator, world, size);
void (*release)(allocator, world, ptr);

void (*ref)(allocator, world); // attached to a new world
void (*unref)(allocator, world); // dettached

void (*beginStep)(allocator, world); // before the step
void (*endStep)(allocator, world);// after the step
};

Want two worlds to share the same working memory? Just make your
allocator do so, and set the same allocator to both worlds. Want the
allocator to cache memory? Trivial. Want to know beforehand if there
will be enough memory? Set the preallocate member to a non-null function.

Then, with no allocator set, the world will be stepped with alloca-based
memory. The user can use malloc (or anything else) at runtime.

A single function, and a struct. Nothing to deprecate, no new entities
to be managed.

> No one program should crash. It does not matter how fast you do
> something if you fail to finish that job.

Most programs crash when memory allocation fails. One could already
switch to a malloc-based instead of alloca-based ODE if the stack limit
was being a problem.

> I disagree. The code base was unnecessary complex from the very
> beginning, you can't do it much complex any more. Don't count the
> lines.

If new abstract entities that needed to be managed and handled come up
in the API and internally, there's definitely more complexity.

When I objected against bringing mmap into ODE it was because I didn't
want a memory allocation implementation inside ODE. Why should I trust
the physics engine developers to do systems programming better than a
systems programmer (which can be either the authors of the C runtime, or
the application writer doing application-specific optimizations)?

My position is: I think committing all your changes to the trunk was a
bit premature. I don't think a memory allocator belongs to ODE, but
means to define your own are needed.

Oleh Derevenko

unread,

Jun 23, 2009, 4:29:23 AM6/23/09

to ode-...@googlegroups.com

Hi Daniel

----- Original Message -----

From: "Daniel K. O."

My position is: I think committing all your changes to the trunk was a
bit premature. I don't think a memory allocator belongs to ODE, but
means to define your own are needed.

That is the main purpose of SVN: to work with code base. You do not need to have each commit to be usable for making a release snapshot at once. You can commit something then you can change that - that's absolutely natural. Branches are for something experimental, which needs to be commited (to track changes and to avoid losing them) but does not work well or does not work at all yet.

No problem, I'll consider changing API as you suggest.

Oleh Derevenko

unread,

Jun 23, 2009, 5:35:52 AM6/23/09

to ode-...@googlegroups.com

Though I consider it incorrect to store runtime information as properties of world object. That's the incorrect OOD approach.

There could be many operations you could do with an object and some operations could require one set of runtime data the other operations could require the other set. These data do not belong to the object, they are not the attributes of the object and they should not be stored in object fields. Those are runtime attributes of operations and their lifecycle should be limited to those operations. They should be the method parameters.

If you already have the attributes that customize world stepping via world object fields - that's your problem and your mistake. In my opinion they all should have been moved to stepping function parameters (probably in the form of a preallocated structure, like dWorldStepReserveInfo, to avoid increasing number of stepping function parameters when anything needs to be added). But you are the admin of the project and I'm not goint to kill my nerves on arguing with you and proving anything to you. I can do it any way you like it.

Oleh Derevenko
-- ICQ: 36361783

----- Original Message -----

From: Oleh Derevenko

To: ode-...@googlegroups.com

Sent: Tuesday, June 23, 2009 11:29 AM

Subject: [ode-users] Re: Quickstep without alloca (final)

Daniel K. O.

unread,

Jun 24, 2009, 8:40:29 AM6/24/09

to ode-...@googlegroups.com

Oleh Derevenko escreveu:

> Though I consider it incorrect to store runtime information as
> properties of world object. That's the incorrect OOD approach. There
> could be many operations you could do with an object and some
> operations could require one set of runtime data the other operations
> could require the other set. These data do not belong to the object,
> they are not the attributes of the object and they should not be
> stored in object fields. Those are runtime attributes of operations
> and their lifecycle should be limited to those operations. They
> should be the method parameters.

By following your reasoning, there should be a
Stepper object that contains these attributes, and perform the simulation.

> If you already have the attributes that customize world stepping via
> world object fields - that's your problem and your mistake.

You mean, ERP, gravity, contact layer, auto disabling... every single
attribute of the World that is only used in stepping... should move
outside the World? The World exists to manage the simulation; but you
want to reduce it to nothing but a list of bodies and joints?

Actually, the list of bodies and joints is only useful for the stepping.
So let's move it too to the Stepper class. But wait, that leaves only
dWorldCreate and dWorldDestroy... The Stepper is now the World. So the
World is was the entity responsible for moving the simulation one step
ahead... who would have thought?

> But you are the
> admin of the project and I'm not goint to kill my nerves on arguing
> with you and proving anything to you. I can do it any way you like
> it.

Now, with all seriousness.
I'm as much the project admin as you are.
This is not a matter of preference. I think you are over-designing a
solution which will lower the usability of ODE in general.

An API should be tailored to create concepts for users to easily
understand. That's why we have joints, with axes and anchors, instead of
dCreateCosntraintJacobian().

Oleh Derevenko

unread,

Jun 27, 2009, 11:04:13 AM6/27/09

to ode-users

OK, here it is nice and shiny. :-) Enjoy! :-)
Thanks, Daniel. ;-)

======== from changelog ========
* New functions have been added:
- dWorldUseSharedWorkingMemory
- dWorldCleanupWorkingMemory
- dWorldSetStepMemoryReservationPolicy
- dWorldSetStepMemoryManager
======== from changelog ========

Oleh Derevenko
-- ICQ: 36361783

On 23 Чер, 01:38, "Daniel K. O." <danielko.lis...@gmail.com> wrote:
> Oleh Derevenko escreveu:

Daniel K. O.

unread,

Jun 27, 2009, 10:21:31 PM6/27/09

to ode-...@googlegroups.com

Oleh Derevenko escreveu:

> OK, here it is nice and shiny. :-) Enjoy! :-)
> Thanks, Daniel. ;-)
>
> ======== from changelog ========
> * New functions have been added:
> - dWorldUseSharedWorkingMemory
> - dWorldCleanupWorkingMemory
> - dWorldSetStepMemoryReservationPolicy
> - dWorldSetStepMemoryManager

Any reason for not using my second, more general, suggestion? The one
that requires only a single new function and a single struct, and has
the allocator "methods" know about the world, handle reference counting
for automatic sharing, etc, and most importantly, doesn't have any
allocator implementation built into ODE?

Just for completeness, here is again the struct I suggested:

struct dWorldStepAllocator {
int (*preallocate)(allocator, world, totalsize); // 0 = failure
void* (*allocate)(allocator, world, size);
void (*release)(allocator, world, ptr);

void (*ref)(allocator, world); // attached to a new world
void (*unref)(allocator, world); // dettached

void (*beginStep)(allocator, world); // before the step
void (*endStep)(allocator, world);// after the step
};

Let me describe the each "method" semantics; if any of them is NULL, it
will not be invoked. Except for the allocate method: alloca() will be
used instead.

preallocate: ODE will walk around all islands to compute the upper
bound, and call this method. The implementation can then allocate the
memory from the system; in case of failure, it can notify the
application, use a garbage collector, etc, and return 0. ODE will detect
this return status and not even try to step the world, returning
immediately from dWorldStep or dWorldQuickStep.

allocate: this is called multiple times from the island stepping code,
to allocate each array. There's nothing ODE can do in case of a failure,
so we can just require that this method should never fail (and let the
user perform this check).

release: releases the memory that was returned by allocate.

ref/unref: perform reference counting, if you want to share the same
allocator between multiple worlds. If the allocator is removed from the
world, or if the world is destroyed, the allocator will be unref()ed.

beginStep: the first thing done in a step. Can be used to acquire
resources or to start profiling the memory usage during a step.

endStep: the last thing called before the step finishes successfully.
Can be used to release memory that was cached during the step (e.g. if
the release method just returned the memory to a local memory pool).

This is how one would implement a malloc-based step allocator:

---
typedef dWorldStepAllocator A;
typedef dWorldID W;

void* malloc_allocate(A *a, W* w, unsigned size)
{
void *ptr = malloc(size);
assert(ptr);
return ptr;
}

void malloc_release(A* a, W* w, void *ptr)
{
free(ptr);
}

static dWorldStepAllocator mallocAllocator = {
0, // preallocate
malloc_allocate,
malloc_release
};

dWorldSetStepAllocator(world, &mallocAllocator);
---

With this approach, ODE doesn't need to know anything about sharing
memory between worlds. And I can replicate alloca's "near-zero" cost of
just incrementing an integer for each allocation (I, the ODE user, am
the one creating bodies and joints, so I can estimate how much memory I
need without requiring ODE to query that).

For proper allocator sharing between worlds (not necessarily to share
the memory allocated), the allocator methods definitely need pointers to
both world, and allocator objects.

Please comment on what you might think are the shortcomings of this
design. Preferably *before* you start coding yet another version, to
avoid wasting energy. This proposal already got a few refinements before
I posted it here. Hopefully you can check if it is missing anything.

Oleh Derevenko

unread,

Jun 28, 2009, 9:42:27 AM6/28/09

to ode-users

Let me explain how it is implemented internally.

There is a memory block of some size allocated from previous calls (or
there is not any if it is a first call). From that memory block the
memory is given on demand to the code by returning current allocation
tail address and adding area size to it to prepare for the next
allocation. This way a kind of virtual stack is created with the
exception that it grows to upper addresses and resides in a dynamic
memory block.
As a first step, algorithm splits bodies and joints of the worlds into
isles. I have changed original algorithm to first save all the bodies/
joints groupped by isles into one large array with numbers of bodies/
joints in each island being saved to another array instead of solving
each island immediately. This allows to estimate maximum memory size
that would be required for subsequent world stepping (the maximum by
all the isles). This all is done in context reallocation function.
Along with possibility to estimate memory size the bodies/joints
collected are preserved in the memory block to have them ready and
avoid building isles for the second time in the stepping function
itself. The first memory block is needed to be allocated to store this
body/joint distribution and island sizes. If there is any block
already available in context, it's used. If the block is of
insufficient size, it's freed (as no data is needed from previous
step) and new larger block is allocated. If there is no memory block
yet, it's just allocated from scratch. The necessary size can easily
be estimated from total number of bodies and joints in the world.
Now as all the bodies/joints are assigned to some isles, there is a
pass for all of them and for each island the joints are ispected and
approximate upper margin of required memory size is calculated. This
value is usually somewhat larger than the actual memory that would be
used by island but the main idea here is to obtain memory requirement
estimation with minimum extra computations (as quickly as possible).
After passing all the isles and calculating the maximum requirement
for all of them the additional memory size is obtained that must be
available in memory block after the current occupied area of isles'
bodies/joints and isles' sizes. If this or larger memory reserve is
already available in the block - it's great - it's just used as is. If
the memory already allocated is insufficient, the new block size is
calculated taking into account memory reservation policy (the memory
is allocated with extra reserve to allow for some additional world
growth on next steps without reallocations). However, it's necessary
to preserve bodies distribution by isles already available and on the
other hand it's necessary to avoid copying all the rest of current
block (which could be already quite large) during reallocation. To
solve this, a special technique is employed that instead of
reallocating existing block to a larger size, it's first shrunk to the
size of useful data in it and saved. Then another block of overall
required size is allocated to be used for world stepping. Then, after
the world stepping the old block with body/joint distribution is freed
completely and context remains with only the new, larger block. This
is the second and the last allocation might be needed for world
stepping. At this point context realloc call ends.
Then, if the memory could be allocated successfully the stepping
callback is invoked for each island.
As it was already mentioned the memory is given to the code by simply
incrementing allocated area end pointer. The memory does not need to
be released from code. And also, the allocated area end pointer can be
and is saved in some points along the code to return to previous
addresses and discard some arrays that could be needed in some local
code areas only. This way it's possible to reuse the same memory for
different data and decrease overall memory requirements. The
allocation order was specially rearranged in code to provide
possibility for this discarding of local scope data.

Now the conclusions.
1) The allocator object is almost unnecessary for external user. In
most cases standard dAlloc/dRealloc/dFree based on standard malloc/
realloc/free would be quite fine. The only reason I publised it to API
was possibility to have some specific implementation of realloc on
some system that would not make the memory remainder available for
subsequent allocations immediately when the size of a block is
decreased. In this case (or just by no reason) a library user can
employ mmap/munmap for allocation and unmap the remainder of the pages
explicitly (again if the target system supports that) instead of using
realloc.
2) The memory allocation used is faster and simpler to use by a client
(you just don't have to do anything at all - what could be even
simpler? ;)) than any malloc/free for each individual data block you
suggest. Even with memory pools (which would waste memory) even with
anything else you could think of. It's even faster and better than
alloca() itself because a) alloca has to probe each page by reading a
byte from it to detect stack overflow as it allocates while I don't
have to; b) alloca can't reuse memory already allocated in current
function unless the function is split into two while I only need to
save current allocation end pointer and then assign it back to context
later; c) alloca does not provide any means for necessary stack memory
reservation and memory allocation failure handling except for crashing
the program/raising hardware exception while I can easily check
everything and return boolean status.

The allocator structure you suggest is senseless to implement because
you can't achieve anything better with it. You can even hardly achieve
what is already there. And that would require lots of effort and
copmplex implementation from a user who tries to create that memory
manager. It will be for sure slower and will waste much more memory if
it would like to reuse anything.

The only concern I have about my current implementation is when there
are lots of very small worlds. With curent default memory reservation
policy of 64k minimum it could waste lots of memory in this case
unless working memory is shared for all worlds explicitly or user
explicitly assigns different reservation policy for each world
created.

Oleh Derevenko

Daniel K. O.

unread,

Jun 28, 2009, 11:51:34 AM6/28/09

to ode-...@googlegroups.com

Oleh Derevenko escreveu:

> Let me explain how it is implemented internally.

That's the problem. There shouldn't be any implementation at all. ONE
implementation fits ONE solution.

> Along with possibility to estimate memory size the bodies/joints
> collected are preserved in the memory block to have them ready and
> avoid building isles for the second time in the stepping function
> itself.

Island computation is fast, I don't think we need to cache it.

> After passing all the isles and calculating the maximum requirement
> for all of them the additional memory size is obtained that must be
> available in memory block after the current occupied area of isles'
> bodies/joints and isles' sizes. If this or larger memory reserve is
> already available in the block - it's great - it's just used as is. If
> the memory already allocated is insufficient, the new block size is
> calculated taking into account memory reservation policy (the memory
> is allocated with extra reserve to allow for some additional world
> growth on next steps without reallocations). However, it's necessary
> to preserve bodies distribution by isles already available and on the
> other hand it's necessary to avoid copying all the rest of current
> block (which could be already quite large) during reallocation. To
> solve this, a special technique is employed that instead of
> reallocating existing block to a larger size, it's first shrunk to the
> size of useful data in it and saved. Then another block of overall
> required size is allocated to be used for world stepping. Then, after
> the world stepping the old block with body/joint distribution is freed
> completely and context remains with only the new, larger block. This
> is the second and the last allocation might be needed for world
> stepping. At this point context realloc call ends.

Then this is just a classic memory allocator. Only one implementation,
that is, with specific characteristics, that might not cut for
everybody. One size doesn't fit all.

> As it was already mentioned the memory is given to the code by simply
> incrementing allocated area end pointer. The memory does not need to
> be released from code.

Why do you think so? I might need that memory back. But as ODE is
managing that memory, I can't retrieve it.

> Now the conclusions.
> 1) The allocator object is almost unnecessary for external user.

As is your current implementation. And the old alloca/malloc. The thing
is, you are changing the internal implementation without adding any
flexibility, just internal and external complexity.

> decreased. In this case (or just by no reason) a library user can
> employ mmap/munmap for allocation and unmap the remainder of the pages
> explicitly (again if the target system supports that) instead of using
> realloc.

Please tell me why it wouldn't be possible to implement such allocation
strategy with my suggestion?

> 2) The memory allocation used is faster and simpler to use by a client
> (you just don't have to do anything at all - what could be even
> simpler? ;))

And you also don't have control over the implementation.

> than any malloc/free for each individual data block you
> suggest.

No, I suggest the old malloc/free semantics can be replicated trivially
with my design. You can malloc on the prealloc method, and just do an
integer increment/decrement per allocation/deallocation. You can also
request memory from a garbage collector, and reuse memory from other
steps (and other worlds).

> Even with memory pools (which would waste memory) even with
> anything else you could think of.

Whoa.

> It's even faster and better than
> alloca() itself because

That's a bold claim, taken from your ass, I presume. You sound like a
conceited programmer who thinks his code is the greatest because he
wrote it himself.

> a) alloca has to probe each page by reading a
> byte from it to detect stack overflow as it allocates while I don't
> have to;

That does not correspond to my alloca implementation.

> b) alloca can't reuse memory already allocated in current
> function unless the function is split into two while I only need to
> save current allocation end pointer and then assign it back to context
> later;

And why does it matter? The algorithm is: allocate enough memory so the
function can run, then run.

> c) alloca does not provide any means for necessary stack memory
> reservation and memory allocation failure handling except for crashing
> the program/raising hardware exception while I can easily check
> everything and return boolean status.

Thus making alloca faster.

> The allocator structure you suggest is senseless to implement because
> you can't achieve anything better with it. You can even hardly achieve
> what is already there. And that would require lots of effort and
> copmplex implementation from a user who tries to create that memory
> manager. It will be for sure slower and will waste much more memory if
> it would like to reuse anything.

Another bold claim with nothing to back up. Can't you show a scenario
that can't be implemented in that design?

My design allows for the old alloca implementation to co-exist, if
that's the user wants; it allows switching to a malloc implementation
without having to recompile the library. And it allows for a block
allocator like yours, contrary to your claims. It allows for memory
reuse, if that's what the user wants; it allows a faster alternative, if
that's what's needed. It allows you to plug in any memory allocator or
garbage collector, if the user wants.

I stand by my position that ODE should not have a memory allocator; if
you are going to change how the memory is allocated, then define an API
to let the users control it. Don't just change the limiting semantics to
another limiting semantics.

Oleh Derevenko

unread,

Jun 28, 2009, 1:34:17 PM6/28/09

to ode-users

Hi Daniel

On 28 Чер, 18:51, "Daniel K. O." <danielko.lis...@gmail.com> wrote:

> Island computation is fast, I don't think we need to cache it.

Well, it's still a loopthrough all bodies and joints in the world with
necessity to save them into the memory array. Why do it twice if it is
possible to do it just once?

>
> Then this is just a classic memory allocator. Only one implementation,
> that is, with specific characteristics, that might not cut for
> everybody. One size doesn't fit all.

So the old way with memory allocated on stack could fit for everybody
and the new way with memory allocated just the same way in a dunamic
block does not. This sounds a bit strange to me.

> Why do you think so? I might need that memory back. But as ODE is
> managing that memory, I can't retrieve it.

No you can. You can retrieve that memory by calling
dWorldCleanupWorkingMemory() after each step. Since the memory could
be allocated with your custom allocator that might have been defined
with dWorldSetStepMemoryManager(), you'll get your memory back into
free_block() callback you might have provided and you'll be free to do
with it anything you like.

>
> > Now the conclusions.
> > 1) The allocator object is almost unnecessary for external user.
>
> As is your current implementation. And the old alloca/malloc. The thing
> is, you are changing the internal implementation without adding any
> flexibility, just internal and external complexity.

I'm improving internal implementation, not changing it.

>
> > decreased. In this case (or just by no reason) a library user can
> > employ mmap/munmap for allocation and unmap the remainder of the pages
> > explicitly (again if the target system supports that) instead of using
> > realloc.
>
> Please tell me why it wouldn't be possible to implement such allocation
> strategy with my suggestion?

Please compare your suggestion with what I need for the code to work
(the implementation description). I personally do not have any idea
how to implement all that with your allocator structure and what could
be the benefit from it. If you see anything there - go on and
implement it.
I have made that huge complex code to require zero to two memory
allocation calls per invocation regardless of what could be the
changes to world between the simulation steps. I've made internal
memory acquisition as fast as few integer operations, even faster than
alloca. What could be better? Why aren't you happy?

>
> > 2) The memory allocation used is faster and simpler to use by a client
> > (you just don't have to do anything at all - what could be even
> > simpler? ;))
>
> And you also don't have control over the implementation.

I sincerely do not understand what control do you need over the

implementation.

>
> > than any malloc/free for each individual data block you
> > suggest.
>
> No, I suggest the old malloc/free semantics can be replicated trivially
> with my design. You can malloc on the prealloc method, and just do an
> integer increment/decrement per allocation/deallocation. You can also
> request memory from a garbage collector, and reuse memory from other
> steps (and other worlds).

I have made it to require AT MOST TWO MEMORY ALLOCATIONS PER WHOLE
CALL. This is much better than trivial replacement of malloc/free
semantics. This is most likely the best what can be achieved at all.
And please note, you, as a developer, receive that already there in
ODE, without having to write any sophisticated memory allocators.

> > It's even faster and better than
> > alloca() itself because
>
> That's a bold claim, taken from your ass, I presume. You sound like a
> conceited programmer who thinks his code is the greatest because he
> wrote it himself.

Yor claim is bold and taken from ass, and my claim is based on
knowledge.

>
> > a) alloca has to probe each page by reading a
> > byte from it to detect stack overflow as it allocates while I don't
> > have to;
>
> That does not correspond to my alloca implementation.

Then if you allocate three pages with alloca while you have only one
page remaining you may miss the guard page at the end of the stack and
corrupt some data without even knowing it.

>
> > b) alloca can't reuse memory already allocated in current
> > function unless the function is split into two while I only need to
> > save current allocation end pointer and then assign it back to context
> > later;
>
> And why does it matter? The algorithm is: allocate enough memory so the
> function can run, then run.

You have to have that "enough" memory to be able to allocate it. And
with smaller size it's always more likely to happen. Also, please
remember old tales about the processor caches: the less your working
set is the more likely you'll have a cache hit on your data access.

>
> > c) alloca does not provide any means for necessary stack memory
> > reservation and memory allocation failure handling except for crashing
> > the program/raising hardware exception while I can easily check
> > everything and return boolean status.
>
> Thus making alloca faster.

No, not at all. Adding size to pointer to offset it to the next memory
block is as fast. And with that, you can check if you'll have enough
memory available while you allocate the memory container area itself.

>
> > The allocator structure you suggest is senseless to implement because
> > you can't achieve anything better with it. You can even hardly achieve
> > what is already there. And that would require lots of effort and
> > copmplex implementation from a user who tries to create that memory
> > manager. It will be for sure slower and will waste much more memory if
> > it would like to reuse anything.
>
> Another bold claim with nothing to back up. Can't you show a scenario
> that can't be implemented in that design?
>
> My design allows for the old alloca implementation to co-exist, if
> that's the user wants; it allows switching to a malloc implementation
> without having to recompile the library. And it allows for a block
> allocator like yours, contrary to your claims. It allows for memory
> reuse, if that's what the user wants; it allows a faster alternative, if
> that's what's needed. It allows you to plug in any memory allocator or
> garbage collector, if the user wants.

And it requires lots more of work debugging and bugfixing from user to
still have alloca-allocator which is worse (because it does not check
for available memory) or to still have malloc/free based allocator
which is also worse (because it's much slower and causes memory leaks
in ODE as its developers can't even design the code in a safe way so
that paired free was always called for malloc).

>
> I stand by my position that ODE should not have a memory allocator; if
> you are going to change how the memory is allocated, then define an API
> to let the users control it. Don't just change the limiting semantics to
> another limiting semantics.
>

You are free to implement anything you like or even revert my two
commits and leave everything as it was before.

Teravus Ovares

unread,

Jun 29, 2009, 3:09:01 AM6/29/09

to ode-...@googlegroups.com

On the memory manager item, I'm with Oleh. The point of it was to
allow a good way to prevent a stack collision. Heh, in one of my
shipped apps, I guesstimate the maximum amount of joints that a user
can create within the stack I set at compile time. (which turns out to
be somewhere between 8,000 and 15,000 joints with a specially compiled
windows dll, and with a script that sets ulimit to unlimited in Linux.
It's much less with the default stack size reservation).

It then counts up all of the joints in the step and stops creating
more when it hits the limit. This is a HUGE limitation for several
reasons.. large box stacks tends to create thousands of contacts.
Even with combining contacts, it limits box stacks to about 580
cubes. Throw in tri-mesh? Forget about it. Stack collision, game
over, do not pass go, do not collect $200. No error handling
opportunities for that.

Previously, the user could only create box stacks at ~580 cubes and
then it stack collided. In a multi-user, collaborative environment
with user created content, however, it is not desirable for a single
user to stack collide the environment on the other users. This is
affectionately called 'greifing' if done purposely to ruin another
person's experience in the collaborative environment.

Currently to deal with that possibility, after a number of joints are
created, it just doesn't create any more that step. This has the
negative side effect of causing objects to fall through the ground
when the maximum number of joints have been created. Additionally,
certain joints are required for specific functionality in the app.
AMotors and LMotors, for example, are the best way to get smooth
motion done. These also have to be counted and ignored if necessary.
Through a somewhat random process, most objects stay above ground even
when they miss one contact, but it's really hit and miss.. and when
it misses, it tends to be noticeable. A good comparison in real
world physics .. would be like a house sitting on the ground...
then when it gets demolished by a wrecking ball, half of the the mass
of the house in parts sinks into the ground and occupies the same
general space as the ground and other parts.. and 'fall asleep' via
some unknown quantum effect.... then as the bulldozer runs over
parts of the house, the parts of the house that it runs into that are
sharing the same space wake up and explode violently flipping the
bulldozer and causing the wrecking ball to hit the neighbor's house
also which then uses more contacts and the whole world sinks into the
ground... houses, bulldozers.. everything :D

It would be much nicer to be able to tune it and manage it runtime
rather then compile time. Maybe, this time, the complexity is
warranted.

I do like the idea of implementing the memory managing as a set of
virtual methods and having the current functionality as the reference
implementation. It would be hard to implement compatible methods
that extend the default functionality, but it would be far more
flexible.

Regards

Teravus

Oleh Derevenko

unread,

Jun 29, 2009, 8:29:32 AM6/29/09

to ode-...@googlegroups.com

Hi

----- Original Message -----

From: "Daniel K. O."

That's a bold claim, taken from your ass, I presume.

As an offtopic, in Ukrainian that sounds as "sucked out of a finger". ;)

gabriel

unread,

Jun 29, 2009, 8:50:39 AM6/29/09

to ode-users

With concern in Brazil, below the level now.

Wracky

unread,

Jul 1, 2009, 4:08:12 PM7/1/09

to ode-users

Hi Oleh,

Just wanted to say: great job! :D
Thanks for fixing this. As you might have seen, I've commented
on the stack-overflow problem before, and I'm really happy
to see you took care of it!

So thanks again for stepping up and fixing it! :)
Seriously cool.

Kind Regards,
Roel Binnendijk,
www.kalydo.com

> smime.p7s
> 3KViewDownload

Oleh Derevenko

unread,

Jul 1, 2009, 5:04:32 PM7/1/09

to ode-...@googlegroups.com

It's only QuickStep. Regular Step still uses stack and there is much more
code in it.

Reply all

Reply to author

Forward