ISR and schd::preemptable problems in resolve_pltgot

17 views
Skip to first unread message

Rick Payne

unread,
Aug 5, 2020, 6:00:27 PM8/5/20
to OSv Development

I've been noodling with the old 'assigned virtio' code, trying to make
it work again so I can use this method to get raw packets. See
discusion on the 'raw sockets' thread.

I'm mostly there (though I will admit, I'm far from a C++ programmer,
so the code is pretty hideous), but I ran into an issue with the
interrupt routine crashing. It turns out that if the interrupt routine
is in my .so file, then the symbol resolution is not done at load time,
rather its done on first call.

I was hitting an assert in resolve_pltgot. I remember having a similar
issue a while ago (2016!) - back then it was a bug in
pthread_spin_lock. In this case, because the symbols aren't resolved on
loading of the .so file, the first time the isr is called, we try to
solve the missing symbols. However, we are not sched::preemptable() at
that point, so the assert fires.

I tried the existing isr code, copying it into my code. The wait part
looks like this:

void receiver(virtio::vring *vq)
{
while (1) {
sched::thread::wait_until([&] {
bool have_elements = vq->used_ring_not_empty();
if (!have_elements) {
vq->enable_interrupts();

// we must check that the ring is not empty *after*
// we enable interrupts to avoid a race where a packet
// may have been delivered between queue->used_ring_not_empty()
// and queue->enable_interrupts() above
have_elements = vq->used_ring_not_empty();
if (have_elements) {
vq->disable_interrupts();
}
}
return have_elements;
});
...


sched::thread::wait_until and sched::thread::stop_wait are not
resolved. I changed resolve_pltgot to what was suggested back in 2016:

diff --git a/core/elf.cc b/core/elf.cc
index ffb16004..5cdf1ec2 100644
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -816,7 +816,7 @@ void object::relocate_pltgot()

void* object::resolve_pltgot(unsigned index)
{
- assert(sched::preemptable());
+ // assert(sched::preemptable());
auto rel = dynamic_ptr<Elf64_Rela>(DT_JMPREL);
auto slot = rel[index];
auto info = slot.r_info;
@@ -826,6 +826,14 @@ void* object::resolve_pltgot(unsigned index)
void *addr = _base + slot.r_offset;
auto sm = symbol(sym);

+ if (!sched::preemptable()) {
+ auto nameidx = (dynamic_ptr<Elf64_Sym>(DT_SYMTAB) + sym)-
>st_name;
+ auto name = dynamic_ptr<const char>(DT_STRTAB) + nameidx;
+ std::cerr << "resolve_pltgot " << demangle(name) <<
+ " in " << pathname() << " found in '" <<
+ sm.obj->pathname() << "'\n";
+ }
+
if (sm.obj != this) {
WITH_LOCK(_used_by_resolve_plt_got_mutex) {
_used_by_resolve_plt_got.insert(sm.obj-
>shared_from_this());

This no longer asserts, and shows the symbols being resolved:

resolve_pltgot _ZN5sched6thread9stop_waitEv
(sched::thread::stop_wait()) in /tests/tst-assign-virtio.so found in ''
Handling packets
resolve_pltgot _ZN6virtio5vring17enable_interruptsEv
(virtio::vring::enable_interrupts()) in /tests/tst-assign-virtio.so
found in ''
resolve_pltgot _ZN5sched6thread4waitEv (sched::thread::wait()) in
/tests/tst-assign-virtio.so found in ''

So what is the appropriate solution here? I could try and arrange for
my code to be linked with "-Wl,-z,now" but that feels like quite a big
hammer (as that could then apply to the whole app?)

OTOH, it feels unnatural to be relying on symbol lookups in an isr. If
resolve_pltgot really does not need the assert check (whch was
suggested to be the case in 2016), maybe there is nothing to do here
other than comment out an assert?

Probably I should arrange the code so that I can call
virtio_driver::wait_for_queue, but at the moment, I can't. I don't
think we should assume that the user's code would want that - or do we?

Thoughts?

Rick

Nadav Har'El

unread,
Aug 6, 2020, 4:02:23 AM8/6/20
to Rick Payne, OSv Development
The assert was there to make a rare random crash into a reliable crash. The thing is
that symbol resolution needs to take a mutex, because in very rare cases it might be
running in parallel to someone loading an additional shared object. This taking of a mutex
may block the thread, and when in non-preemptable state, will cause a crash at that
point. So the idea of the assert is to make you aware of this problem much earlier,
before you start seeing it once a year in the field.

This was explained in the commit that added this assert, back in 2014: 930cb83d9.

The "z,now" solution will work, but there is another problem you should be aware of:
When user code is running in non-preemptable state, both the code's stack, and the code
itself, must already be in memory and cannot be swapped-in. The latter in particular means
that the executable must be entirely loaded into memory, not page by page ("demand paging")
as we normally do.

We have a trick that will do both the on-load symbol resolution and loading the entire object
and not page by page: If you add:

asm(".pushsection .note.osv-mlock, \"a\"; .long 0, 0, 0; .popsection");

To your source code, both things will auto-magically happen :-)
We have in <osv/elf.hh> a macro doing this, OSV_ELF_MLOCK_OBJECT(), but you don't really need that macro, you can just copy this line.

I suggest you use this trick, instead of the "z,now" thing.

Unless you executable is huge, I don't think you'll have problems doing this for the entire executable (you only need this "section" thing in one place in your source code), but if that bothers you, you can always put the problematic code in a separate shared object, and only that object will be marked with this special section.
 

Probably I should arrange the code so that I can call
virtio_driver::wait_for_queue, but at the moment, I can't. I don't
think we should assume that the user's code would want that - or do we?

Thoughts?

Rick

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/7fa151d59007a824119e83d6906f64ea1f94fe3d.camel%40rossfell.co.uk.

Rick Payne

unread,
Aug 6, 2020, 4:40:19 AM8/6/20
to Nadav Har'El, OSv Development
On Thu, 2020-08-06 at 11:02 +0300, Nadav Har'El wrote:

> We have a trick that will do both the on-load symbol resolution and
> loading the entire object
> and not page by page: If you add:
>
> asm(".pushsection .note.osv-mlock, \"a\"; .long 0, 0, 0;
> .popsection");
>
> To your source code, both things will auto-magically happen :-)
> We have in <osv/elf.hh> a macro doing this, OSV_ELF_MLOCK_OBJECT(),
> but you don't really need that macro, you can just copy this line.
>
> I suggest you use this trick, instead of the "z,now" thing.

Thanks for the great explaination - and yes, that does indeed sort the
issue, thanks ;)

> Unless you executable is huge, I don't think you'll have problems
> doing this for the entire executable (you only need this "section"
> thing in one place in your source code), but if that bothers you, you
> can always put the problematic code in a separate shared object, and
> only that object will be marked with this special section.

Understood. I'll see how we get on, but we have your backup plan
anyway.

Thanks again. I'll try and get my patch into something that could be
applied, as I know there are others looking to do similar things.

cheers,
Rick


Nadav Har'El

unread,
Aug 6, 2020, 6:44:28 AM8/6/20
to Rick Payne, OSv Development
On Thu, Aug 6, 2020 at 11:02 AM Nadav Har'El <n...@scylladb.com> wrote:
The assert was there to make a rare random crash into a reliable crash. The thing is
that symbol resolution needs to take a mutex, because in very rare cases it might be
running in parallel to someone loading an additional shared object. This taking of a mutex
may block the thread, and when in non-preemptable state, will cause a crash at that
point. So the idea of the assert is to make you aware of this problem much earlier,
before you start seeing it once a year in the field.

By the way, if this problem really bothered us, it should be possible with relatively small effort to
make symbol-resolution lock-free. We actually have the list of objects protected by RCU,
not a mutex (see commit 68afb68ee84769db064949839ee50bff08145c6e)
so it is already possible to do most of the lookup in a lock-free manner, and we just
need to replace the last remaining mutexes in the resolv_pltgot code, which might not
be difficult.

Rick Payne

unread,
Aug 6, 2020, 5:54:50 PM8/6/20
to Nadav Har'El, OSv Development

> On 6 Aug 2020, at 21:00, Nadav Har'El <n...@scylladb.com> wrote:
>
> By the way, if this problem really bothered us, it should be possible with relatively small effort to
> make symbol-resolution lock-free. We actually have the list of objects protected by RCU,
> not a mutex (see commit 68afb68ee84769db064949839ee50bff08145c6e)
> so it is already possible to do most of the lookup in a lock-free manner, and we just
> need to replace the last remaining mutexes in the resolv_pltgot code, which might not
> be difficult.

Noted, but as you says, suspect its not an issue for what I’m doing. If it becomes an issue, I’ll come back to you for pointers on how to progress this fix ;)

Cheers,
Rick
Reply all
Reply to author
Forward
0 new messages