I have a few comments.
Please include the revision so people know where to look when things
break. ;)
As you were going to touch a large part of the quickstep code, you
should have made 2 commits; one to astyle the original source, and one
for your patch. Now the diff shows a lot of changes that were just
formatting changes, and distract me from the real changes.
I don't like the semantics you introduced.
> * dWorldQuickStep re-implemented to avoid memory allocation on stack.
> Also several optimizations have been made to decrease memory
> requirements and optimize algorithm implementation of dWorldQuickStep.
> dWorldStep still remains with old memory allocation however new APIs
> mentioned below are fully functional for it.
> Both dWorldStep and dWorldQuickStep have been changed to return boolean
> success status.
What happens when they return 0? Can the simulation go on? Are the
objects in an undefined state?
> Both dWorldStep and dWorldQuickStep can now be used from multiple threads
> without serialization only if TLS is enabled for library with --enable-ou.
Nothing changed on step.cpp. Why didn't dWorldStep work before and works
now?
> New functions have been added:
> - dWorldStep2ContextRealloc
> - dWorldStep2
> - dWorldQuickStep2ContextRealloc
> - dWorldQuickStep2
> - dWorldStepContextFree
Why do I need to deal with a context? What is a context? Would I want to
share a context between worlds?
The docs say the world must not be changed between the realloc and step2
calls. Didn't you just create a sequential coupling?
> Also dWorldStep and dWorldQuickStep have been changed to be wrapers for
> these new APIs with an internal context stored statically in library.
And seem to have quite a bit of overhead added to them. I wonder if the
people who where arguing when I proposed the damping addition about the
cost of 3 multiplications will accept the extra loops and indirect
calls. Can you show us some numbers?
> dWorldStepCleanup and dWorldQuickStepCleanup functions have been added
> to free this internal context.
> From now on dWorldStep and dWorldQuickStep are to be considered obsolete.
why are they deprecated? Didn't you just change their semantics to
return a success status?
--
Daniel K. O.
"The only way to succeed is to build success yourself"
I clarified them in the items that followed.
> What happens when they return 0? Can the simulation go on? Are the
> objects in an undefined state?
>
> It means that memory allocation has failed. This leaves objects in
> unchanged state and you can retry as soon as more memory is
> available.
This should be in the docs then.
> Nothing changed on step.cpp. Why didn't dWorldStep work before and
> works now?
>
> Well, maybe because I just did not call all the functions from
> utils.cpp.
That's specious reasoning. I have a tiger-repellant rock that might
interest you.
> The context is abstract object that currently holds the memory
> preallocated for current step (or allocated at previous step). But
> similarly it can carry more meaning in the future.
So it's a poltergeist object. Why can't it live withing the world
object? We already set the quickstep parameters by setting them in the
world object.
> The docs say the world must not be changed between the realloc and
> step2 calls. Didn't you just create a sequential coupling?
>
> Yes, these functions must be called one right after another.
And don't you see anything wrong with that?
> I do not have simulations in my projects. This is why I posted patch
> file here for preview several weeks ago. The only person who did some
> tests (or who reported the results here) was John Hsu. Don't come
> down to counting individual multiplications, Daniel.
I don't want multiplications. I want to be able to tell how much slower
it is now. I doubt anyone struggling with the framerate will risk to
lower it even more.
> implementation that should fairly compensate that cost. After all,
> the program that just crashes in memory lack condition and you can do
> nothing about it can't be seriously considered better than a program
> that is little bit slower but can survive and continue computations
> or display a clear error message at least.
Some programs need speed, not memory.
> They are deprecated because they are just the wrappers now left for
> compatibility with existing code and they use static context variable
> that can't be efficiently controlled and they do not provide access
> to other features available in new APIs (like memory reservation
> policy and memory manager customizaton) and they would not provide
> access to the new features that could be added in the future (like
> multithreading support).
With this let me sum up my concerns: this contribution is unacceptable
as a whole.
It not only adds complexity to the code base (I was expecting you would
make the code shorter and clean; but it's now more than 30% longer), it
also makes the API more complex. Internal details are being
unnecessarily exposed to the users.
As John Hsu suggested, I think your changes should stay in a branch for
a while until things are ironed out.
----- Original Message -----Sent: Monday, June 22, 2009 12:40 PMSubject: [ode-users] Re: Quickstep without alloca (final)
> The docs say the world must not be changed between the realloc and
> step2 calls. Didn't you just create a sequential coupling?
>
> Yes, these functions must be called one right after another.
And don't you see anything wrong with that?You mean, why can't they be just joined together? Well, that's an interesting question. They should be separated for the reasons of symmetry. Since it's necessary to have separate context destruction function (dWorldStepContextFree) it's also logical to have a separate context allocation function. Otherwise it would look as you would be freeing some by-product, some waste of world stepping function. ;)
The wiki will be updated accordingly with what's in the doxygen docs.
But first the docs must exist.
> That's specious reasoning. I have a tiger-repellant rock that might
> interest you.
>
> My English is not good enough to understand that. :) Don't you
> believe me? I just remember that I had quickly put NULL in parameter
> somewhere in first version to have it compile.
See this:
http://www.youtube.com/watch?v=u2zXSaDFi7o
> Well, and what's wrong about keeping it outside? As you already
> mentioned you could share the same memory for several worlds which
> would otherwise need to allocate copies individually.
Then what about this?
dWorldShareWorkingMemory(dWorldID a, dWorldID b);
dWorldCleanupWorkingMemory(dWorldID w);
dWorldSetAllocationPolicy(dWorldID w, ...);
Done. No need for a "context" object anymore.
But what I would really prefer would be:
dWorldSetStepAllocator(dWorldID w, dWorldStepAllocator* a);
where you could write your own allocator if you wanted to do fancy
stuffs. It would be something like this:
struct dWorldStepAllocator {
int (*preallocate)(allocator, world, totalsize); // 0 = failure
void* (*allocate)(allocator, world, size);
void (*release)(allocator, world, ptr);
void (*ref)(allocator, world); // attached to a new world
void (*unref)(allocator, world); // dettached
void (*beginStep)(allocator, world); // before the step
void (*endStep)(allocator, world);// after the step
};
Want two worlds to share the same working memory? Just make your
allocator do so, and set the same allocator to both worlds. Want the
allocator to cache memory? Trivial. Want to know beforehand if there
will be enough memory? Set the preallocate member to a non-null function.
Then, with no allocator set, the world will be stepped with alloca-based
memory. The user can use malloc (or anything else) at runtime.
A single function, and a struct. Nothing to deprecate, no new entities
to be managed.
> No one program should crash. It does not matter how fast you do
> something if you fail to finish that job.
Most programs crash when memory allocation fails. One could already
switch to a malloc-based instead of alloca-based ODE if the stack limit
was being a problem.
> I disagree. The code base was unnecessary complex from the very
> beginning, you can't do it much complex any more. Don't count the
> lines.
If new abstract entities that needed to be managed and handled come up
in the API and internally, there's definitely more complexity.
When I objected against bringing mmap into ODE it was because I didn't
want a memory allocation implementation inside ODE. Why should I trust
the physics engine developers to do systems programming better than a
systems programmer (which can be either the authors of the C runtime, or
the application writer doing application-specific optimizations)?
My position is: I think committing all your changes to the trunk was a
bit premature. I don't think a memory allocator belongs to ODE, but
means to define your own are needed.
----- Original Message -----From: Oleh DerevenkoSent: Tuesday, June 23, 2009 11:29 AMSubject: [ode-users] Re: Quickstep without alloca (final)
By following your reasoning, there should be a
Stepper object that contains these attributes, and perform the simulation.
> If you already have the attributes that customize world stepping via
> world object fields - that's your problem and your mistake.
You mean, ERP, gravity, contact layer, auto disabling... every single
attribute of the World that is only used in stepping... should move
outside the World? The World exists to manage the simulation; but you
want to reduce it to nothing but a list of bodies and joints?
Actually, the list of bodies and joints is only useful for the stepping.
So let's move it too to the Stepper class. But wait, that leaves only
dWorldCreate and dWorldDestroy... The Stepper is now the World. So the
World is was the entity responsible for moving the simulation one step
ahead... who would have thought?
> But you are the
> admin of the project and I'm not goint to kill my nerves on arguing
> with you and proving anything to you. I can do it any way you like
> it.
Now, with all seriousness.
I'm as much the project admin as you are.
This is not a matter of preference. I think you are over-designing a
solution which will lower the usability of ODE in general.
An API should be tailored to create concepts for users to easily
understand. That's why we have joints, with axes and anchors, instead of
dCreateCosntraintJacobian().
Any reason for not using my second, more general, suggestion? The one
that requires only a single new function and a single struct, and has
the allocator "methods" know about the world, handle reference counting
for automatic sharing, etc, and most importantly, doesn't have any
allocator implementation built into ODE?
Just for completeness, here is again the struct I suggested:
struct dWorldStepAllocator {
int (*preallocate)(allocator, world, totalsize); // 0 = failure
void* (*allocate)(allocator, world, size);
void (*release)(allocator, world, ptr);
void (*ref)(allocator, world); // attached to a new world
void (*unref)(allocator, world); // dettached
void (*beginStep)(allocator, world); // before the step
void (*endStep)(allocator, world);// after the step
};
Let me describe the each "method" semantics; if any of them is NULL, it
will not be invoked. Except for the allocate method: alloca() will be
used instead.
preallocate: ODE will walk around all islands to compute the upper
bound, and call this method. The implementation can then allocate the
memory from the system; in case of failure, it can notify the
application, use a garbage collector, etc, and return 0. ODE will detect
this return status and not even try to step the world, returning
immediately from dWorldStep or dWorldQuickStep.
allocate: this is called multiple times from the island stepping code,
to allocate each array. There's nothing ODE can do in case of a failure,
so we can just require that this method should never fail (and let the
user perform this check).
release: releases the memory that was returned by allocate.
ref/unref: perform reference counting, if you want to share the same
allocator between multiple worlds. If the allocator is removed from the
world, or if the world is destroyed, the allocator will be unref()ed.
beginStep: the first thing done in a step. Can be used to acquire
resources or to start profiling the memory usage during a step.
endStep: the last thing called before the step finishes successfully.
Can be used to release memory that was cached during the step (e.g. if
the release method just returned the memory to a local memory pool).
This is how one would implement a malloc-based step allocator:
---
typedef dWorldStepAllocator A;
typedef dWorldID W;
void* malloc_allocate(A *a, W* w, unsigned size)
{
void *ptr = malloc(size);
assert(ptr);
return ptr;
}
void malloc_release(A* a, W* w, void *ptr)
{
free(ptr);
}
static dWorldStepAllocator mallocAllocator = {
0, // preallocate
malloc_allocate,
malloc_release
};
dWorldSetStepAllocator(world, &mallocAllocator);
---
With this approach, ODE doesn't need to know anything about sharing
memory between worlds. And I can replicate alloca's "near-zero" cost of
just incrementing an integer for each allocation (I, the ODE user, am
the one creating bodies and joints, so I can estimate how much memory I
need without requiring ODE to query that).
For proper allocator sharing between worlds (not necessarily to share
the memory allocated), the allocator methods definitely need pointers to
both world, and allocator objects.
Please comment on what you might think are the shortcomings of this
design. Preferably *before* you start coding yet another version, to
avoid wasting energy. This proposal already got a few refinements before
I posted it here. Hopefully you can check if it is missing anything.
That's the problem. There shouldn't be any implementation at all. ONE
implementation fits ONE solution.
> Along with possibility to estimate memory size the bodies/joints
> collected are preserved in the memory block to have them ready and
> avoid building isles for the second time in the stepping function
> itself.
Island computation is fast, I don't think we need to cache it.
> After passing all the isles and calculating the maximum requirement
> for all of them the additional memory size is obtained that must be
> available in memory block after the current occupied area of isles'
> bodies/joints and isles' sizes. If this or larger memory reserve is
> already available in the block - it's great - it's just used as is. If
> the memory already allocated is insufficient, the new block size is
> calculated taking into account memory reservation policy (the memory
> is allocated with extra reserve to allow for some additional world
> growth on next steps without reallocations). However, it's necessary
> to preserve bodies distribution by isles already available and on the
> other hand it's necessary to avoid copying all the rest of current
> block (which could be already quite large) during reallocation. To
> solve this, a special technique is employed that instead of
> reallocating existing block to a larger size, it's first shrunk to the
> size of useful data in it and saved. Then another block of overall
> required size is allocated to be used for world stepping. Then, after
> the world stepping the old block with body/joint distribution is freed
> completely and context remains with only the new, larger block. This
> is the second and the last allocation might be needed for world
> stepping. At this point context realloc call ends.
Then this is just a classic memory allocator. Only one implementation,
that is, with specific characteristics, that might not cut for
everybody. One size doesn't fit all.
> As it was already mentioned the memory is given to the code by simply
> incrementing allocated area end pointer. The memory does not need to
> be released from code.
Why do you think so? I might need that memory back. But as ODE is
managing that memory, I can't retrieve it.
> Now the conclusions.
> 1) The allocator object is almost unnecessary for external user.
As is your current implementation. And the old alloca/malloc. The thing
is, you are changing the internal implementation without adding any
flexibility, just internal and external complexity.
> decreased. In this case (or just by no reason) a library user can
> employ mmap/munmap for allocation and unmap the remainder of the pages
> explicitly (again if the target system supports that) instead of using
> realloc.
Please tell me why it wouldn't be possible to implement such allocation
strategy with my suggestion?
> 2) The memory allocation used is faster and simpler to use by a client
> (you just don't have to do anything at all - what could be even
> simpler? ;))
And you also don't have control over the implementation.
> than any malloc/free for each individual data block you
> suggest.
No, I suggest the old malloc/free semantics can be replicated trivially
with my design. You can malloc on the prealloc method, and just do an
integer increment/decrement per allocation/deallocation. You can also
request memory from a garbage collector, and reuse memory from other
steps (and other worlds).
> Even with memory pools (which would waste memory) even with
> anything else you could think of.
Whoa.
> It's even faster and better than
> alloca() itself because
That's a bold claim, taken from your ass, I presume. You sound like a
conceited programmer who thinks his code is the greatest because he
wrote it himself.
> a) alloca has to probe each page by reading a
> byte from it to detect stack overflow as it allocates while I don't
> have to;
That does not correspond to my alloca implementation.
> b) alloca can't reuse memory already allocated in current
> function unless the function is split into two while I only need to
> save current allocation end pointer and then assign it back to context
> later;
And why does it matter? The algorithm is: allocate enough memory so the
function can run, then run.
> c) alloca does not provide any means for necessary stack memory
> reservation and memory allocation failure handling except for crashing
> the program/raising hardware exception while I can easily check
> everything and return boolean status.
Thus making alloca faster.
> The allocator structure you suggest is senseless to implement because
> you can't achieve anything better with it. You can even hardly achieve
> what is already there. And that would require lots of effort and
> copmplex implementation from a user who tries to create that memory
> manager. It will be for sure slower and will waste much more memory if
> it would like to reuse anything.
Another bold claim with nothing to back up. Can't you show a scenario
that can't be implemented in that design?
My design allows for the old alloca implementation to co-exist, if
that's the user wants; it allows switching to a malloc implementation
without having to recompile the library. And it allows for a block
allocator like yours, contrary to your claims. It allows for memory
reuse, if that's what the user wants; it allows a faster alternative, if
that's what's needed. It allows you to plug in any memory allocator or
garbage collector, if the user wants.
I stand by my position that ODE should not have a memory allocator; if
you are going to change how the memory is allocated, then define an API
to let the users control it. Don't just change the limiting semantics to
another limiting semantics.