Re: [PATCH 0/3] cell_management: fix logic in prologue

5 views
Skip to first unread message

Jan Kiszka

unread,
Jul 17, 2015, 3:40:05 AM7/17/15
to Veaceslav Falico, jailho...@googlegroups.com, Henning Schild
On 2015-07-17 09:24, Veaceslav Falico wrote:
> On Fri, Jul 17, 2015 at 08:29:46AM +0200, Jan Kiszka wrote:
>> On 2015-07-16 14:08, Jan Kiszka wrote:
>>> On 2015-07-16 13:47, Veaceslav Falico wrote:
>>>> Hi,
>>>>
>>>> I'm only getting starting to learn the code and I don't understand the
>>>> logic completely, so this patchset might be complete garbage.
>>>>
>>>> Anyway, hypervisor-side cell_management_prologue() verification for
>>>> task
>>>> actions and their permissions strikes quite odd - both that it seems
>>>> to be
>>>> buggy for the CELL_DESTROY case and that it lacks CELL_START
>>>> verification.
>>>>
>>>> This patchset tries to address both and introduce a macro for
>>>> current and
>>>> future use - hopefully it gets along the logic of online/offline cells.
>>>>
>>>> The CELL_START verification bug can be easily reproduced:
>>>>
>>>> sudo tools/jailhouse enable configs/qemu-vm.cell
>>>> sudo tools/jailhouse cell create configs/apic-demo.cell
>>>> sudo tools/jailhouse cell load apic-demo
>>>> inmates/demos/x86/apic-demo.bin -a 0xf0000
>>>> sudo tools/jailhouse cell start apic-demo
>>>> sudo tools/jailhouse cell start apic-demo
>>>> sudo tools/jailhouse cell start apic-demo
>>>>
>>>> After the 3rd start it hangs for me completely.
>>>
>>> Yeah, good catch - confirmed.
>>>
>>>>
>>>> It's based on the most up-to-date next branch.
>>>>
>>>> Hope that the fixes make sense :).
>>>
>>> Will look into them ASAP. Thanks already!
>>
>> OK, understood the effect now. This pattern used to work for the
>> apic-demo until ef3b370c64 (SMP support). Since then, the bootstrap code
>> uses some global variables to define the CPU ID and type (BP vs. AP).
>> This adds states to the bootstrap process, thus we can no longer restart
>> the cell without reloading it first (to reset the states).
>
> And that's exaclty why I think that there shouldn't be an implied restart
> on start (and shutdown on set_loadable) command - because of these kind
> of bugs.
>
> Anyway, understood, and it's definitely not a priority... but:
>
> Maybe there should be a more clean devision between (implied) tasks and
> what they do?
>
> currently both SET_LOADABLE and START will shut down the running (!) cell,
> which doesn't strike me as something intuitive.
>
> What do you think about the following design:
>
> --------------------
> CREATE: creates a cell from cell_desc if one doesn't exist already.
> DESTROY: destroys a cell if it's offline (implies UNLOAD)
>
> LOAD: maps regions to the cell if it's not mapped already and the cell is
> offline, copies the image
> UNLOAD: unmaps regions if the cell is offline
>
> START: starts a cell if it's offline and loaded
> STOP: shuts down a cell if it's online
> RESTART: STOP && START
>
> FLUSH (?): memzeros the loadable memory if it's offline and loaded
> --------------------
>
> This way there'll be a lot more flexibility, more defined behaviour and
> easier to understand/get into it.

I fact, the current interface is already flexible enough to handle all
the (currently known) use cases - but it does intentionally avoid too
much policing about how the APIs are used. And we try to keep the
interface small.

So we don't copy cell memory in the hypervisor but offloaded this to the
Linux kernel driver. Keeps the hypervisor simpler without losing
functionality. But even if we copy in the hypervisor, we would still not
know if the root cell copied the right data, so we could still lock up
like we do now.

To address both the current issue as well as other mistakes (I still
tend to forget specifying -a 0xf0000 on x86 loads once in a while e.g.),
we need the timeout to recover from unavoidable crashes of cells, even
of those with higher privileges.

That said, a lot more could be done in a higher management layer, higher
than the jailhouse tool because that is stateless.

>
>>
>> The real problem is now related to the TODO item "cell software watchdog
>> via comm region messages": The apic-demo participates in the comm region
>> protocol, thus the hypervisor expects an answer on the first restart
>> request. But as the cell is now stuck during boot, there won't be an
>> answer, and the hypervisor locks up the root cell.
>>
>> The answer to this is what is listed under that TODO item: adding a
>> timeout to the communication region protocol and fail a cell that does
>> not react in time. The only challenge of this is that the hypervisor has
>> no notion of time so far.
>
> The easiest solution I see is just to add a watchdog to the root cell and a
> new hypercall _TICK or whatever - it'll keep the base small, more or less
> reliable and it's easy to implement :).
>
> What do you think?

If the hypervisor is supposed to declare a cell dead that is not
reacting within a specific time, the hypervisor also has to be able to
retrieve that time on its own (we don't want to rely on the root cell in
most scenarios after startup). Moreover, while we are reconfiguring and
eventually waiting on a non-root cell to confirm this, the root cell is
stopped (that's part of the locking architecture). So we can't use a
root cell-based tick for that case anyway.

Jan

--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
Reply all
Reply to author
Forward
0 new messages