Re: Some questions about OSv

69 views
Skip to first unread message

Waldek Kozaczuk

unread,
Nov 28, 2023, 1:20:06 AM11/28/23
to Yueyang Pan, OSv Development
Hi,

It is great to hear from you. Please see my answers below. 

I hope you also do not mind I reply to the group so others may add something extra or refine/correct my answers as I am not an original developer/designer of OSv.

On Fri, Nov 24, 2023 at 8:50 AM Yueyang Pan <yueya...@epfl.ch> wrote:
Dear Waldemar Kozaczuk,
    I am Yueyang Pan from EPFL. Currently I am working on a project about remote memory and trying to develop a prototype based on OSv. I am the guy who raised the questions on the google group several days ago as well. For that question, I made a workaround by adding my own stats class which record the sum and count because I need is the average number. Now I have some further questions. Probably they are a bit dumb for you but I will be very grateful if you could spend a little bit of time to give me some suggestions.

The tracepoints use ring buffers of fixed size so eventually, all old tracepoints would be overwritten by new ones. I think you can either increase the size or use the approach used by the script freq.py (you need to add the module httpserver-monitoring-api). There is also newly added (experimental though) strace-like functionality (see https://github.com/cloudius-systems/osv/commit/7d7b6d0f1261b87b678c572068e39d482e2103e4). Finally, you may find the comments on this issue relevant - https://github.com/cloudius-systems/osv/issues/1261#issuecomment-1722549524. I am also sure you have come across this wiki page - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py.

    Now after my profiling, I found the mutex in global tib_flush_mutex to be hot in my benchmark so I am trying to remove it but it turns to be a bit hard without understanding the thread model of OSv. So I would like to ask whether there is any high-level doc that describes what the scheduling policy of OSv is, how the priority of the threads are decided, whether we can disable preemption or not (the functionality of preempt_lock) and the design of synchronisation primitives (for example why it is not allowed to have preemption disabled inside lockfree::mutex). I am trying to understand by reading the code directly but it can be really helpful if there is some material which describes the design.

If indeed your "hot" spot is around tlb_flush_mutex (used by  flush_tlb_all()) then I am guessing your program does a lot of mmap/unmap (see unpopulate class in core/memory.cc that uses tlb_gather). I am not familiar with details of what it tlb_gather exactly does it probably forces TLB (Translation Lookaway Buffer) to flush old virtual/physical memory mapping entries after unmapping. The mmu::flush_tlb_all() is actually used in more places.

My wild suggestion would be to try to convert the tlb_flush_mutex to spinlock (see include/osv/spinlock.h and core/spinlock.cc). It is a bit controversial idea as OSv prides itself on lock-less structures and almost no spinklocks used (the console initialization is the only place left). But in some places (see https://github.com/cloudius-systems/osv/issues/853#issuecomment-279215964) and https://github.com/cloudius-systems/osv/commit/f8866c0dfd7ca1fcb4b2d9a280946878313a75d3 and https://groups.google.com/g/osv-dev/c/4wMAHCs7_dk/m/1LHdvmoeBwAJ we may benefit from those.

Please note the lock-less sched::thread::wait_until in the end of the flush_tlb_all would need to be replaced with "busy" wait/sleep.

As far as the information on mutexes and scheduling, the best information you can find in the original OSv paper - https://www.usenix.org/conference/atc14/technical-sessions/presentation/kivity. See also https://github.com/cloudius-systems/osv/wiki/Components-of-OSv and many other Wikis. 

Your preemption question - the lock-free mutex needs to have preemption on - imagine if we have a single CPU and the mutex ends up getting into the wait state to acquire a lock the thread would need to be eventually switched to another one that would release the lock. But if the preemption is off, then the scheduler will keep switching to the same waiting thread for each timer event and our original thread would never acquire the lock.

I hope all this helps.

Waldek

    Thanks in advance for any advice you could provide. The questions may be a bit dumb so pardon me if I disturb you.
    Best Wishes
    Pan

Nadav Har'El

unread,
Nov 28, 2023, 2:30:00 AM11/28/23
to Waldek Kozaczuk, Yueyang Pan, OSv Development
On Tue, Nov 28, 2023 at 8:20 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Hi,

It is great to hear from you. Please see my answers below. 

I hope you also do not mind I reply to the group so others may add something extra or refine/correct my answers as I am not an original developer/designer of OSv.

On Fri, Nov 24, 2023 at 8:50 AM Yueyang Pan <yueya...@epfl.ch> wrote:
Dear Waldemar Kozaczuk,
    I am Yueyang Pan from EPFL. Currently I am working on a project about remote memory and trying to develop a prototype based on OSv. I am the guy who raised the questions on the google group several days ago as well. For that question, I made a workaround by adding my own stats class which record the sum and count because I need is the average number. Now I have some further questions. Probably they are a bit dumb for you but I will be very grateful if you could spend a little bit of time to give me some suggestions.

The tracepoints use ring buffers of fixed size so eventually, all old tracepoints would be overwritten by new ones. I think you can either increase the size or use the approach used by the script freq.py

Exactly. OSv's tracepoints have two modes. One is indeed to save them in a ring buffer - so you'll see the last N traced events when you read that buffer - but other is a mode that just counts the events. What freq.py does is to retrieve the count at one second, then retrieve the count the next second - and the subtraction is the average number of this even per second.

If you want instead of counting the event, to have a sum of, say, integers that come from the event (e.g., sum of packet lengths), we don't have support for this at the moment - we only increment the count by 1. It could be added as a feature, I guess. But you can always do something ad-hoc like maintain a global variable which you add.
 
(you need to add the module httpserver-monitoring-api). There is also newly added (experimental though) strace-like functionality (see https://github.com/cloudius-systems/osv/commit/7d7b6d0f1261b87b678c572068e39d482e2103e4). Finally, you may find the comments on this issue relevant - https://github.com/cloudius-systems/osv/issues/1261#issuecomment-1722549524. I am also sure you have come across this wiki page - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py.

    Now after my profiling, I found the mutex in global tib_flush_mutex to be hot in my benchmark so I am trying to remove it but it turns to be a bit hard without understanding the thread model of OSv. So I would like to ask whether there is any high-level doc that describes what the scheduling policy of OSv is, how the priority of the threads are decided, whether we can disable preemption or not (the functionality of preempt_lock) and the design of synchronisation primitives (for example why it is not allowed to have preemption disabled inside lockfree::mutex). I am trying to understand by reading the code directly but it can be really helpful if there is some material which describes the design.

There are a lot of questions here, and I'm not even sure answering them will explain specifically why tlb_flush_mutex is highly contested in your workload.

Waldek suggested that you read the OSv paper from Usenix, which is a good start for understanding the overall OSv architecture.
The scheduling policy and priority (how to decide which thread should run next) is described in more detail in this document: https://docs.google.com/document/d/1W7KCxOxP-1Fy5EyF2lbJGE2WuKmu5v0suYqoHas1jRM/edit

If you have specific questions, post them here and I'll try to answer. But only a few at a time :-) You had a lot of questions above and I can't answer them all in one mail :-)

ilya meignan-masson

unread,
Nov 28, 2023, 5:22:45 AM11/28/23
to OSv Development
Hello
I am also working on a similar project about page caching for databases. I am also faced with similar issues as the OP. I have a workload that executes for several minutes and the traces that I manage to extract are only a few seconds long.

Exactly. OSv's tracepoints have two modes. One is indeed to save them in a ring buffer - so you'll see the last N traced events when you read that buffer - but other is a mode that just counts the events. What freq.py does is to retrieve the count at one second, then retrieve the count the next second - and the subtraction is the average number of this even per second.

I don't understand your answer. How do you enable either of those modes ? I don't see any mention of such modes in the wiki (https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py).
/scripts/run.py only includes options to "enable" tracepoints but no way to choose between those modes. By default, it seems to me that the default is the first mode, recording events.
In the code, I have not found any definition of a fixed size buffer for trace events.
The only variable that seems relevant is trace_log_size defined in include/osv/trace.hh. It corresponds to the size of a ring-buffer for trace logging which you are also mentionning. However, this ring buffer seems to be used only by the strace functionality and not during "normal" tracing.
I tried to increase the size of this ring-buffer but no change on the number of events collected as I expected.
Could you enlighten me on this part ?

As a comment, I believe what the OP meant by : "I made a workaround by adding my own stats class which record the sum and count because I need is the average number" refers to the sum of the run time of a function and the number of time the function is called to compute the average time of execution. This is exactly my use case and, while it is possible to use global variables, it would be nice to use the OSv functionalities if possible.

As a side note regarding the wiki, I was mislead at first by the symbol resolution part. I understood that you could insert tracepoints in an application. Upon testing, I have found that the tracepoints can only be placed inside the kernel object files. The
If it is all the same to you, I would probably like to propose a PR with modifications to the wiki to prevent this misunderstanding.

Thanks in advance for your answers.
Regards
Ilya

Waldek Kozaczuk

unread,
Nov 28, 2023, 8:07:35 AM11/28/23
to OSv Development
On Tuesday, November 28, 2023 at 1:20:06 AM UTC-5 Waldek Kozaczuk wrote:
Hi,

It is great to hear from you. Please see my answers below. 

I hope you also do not mind I reply to the group so others may add something extra or refine/correct my answers as I am not an original developer/designer of OSv.

On Fri, Nov 24, 2023 at 8:50 AM Yueyang Pan <yueya...@epfl.ch> wrote:
Dear Waldemar Kozaczuk,
    I am Yueyang Pan from EPFL. Currently I am working on a project about remote memory and trying to develop a prototype based on OSv. I am the guy who raised the questions on the google group several days ago as well. For that question, I made a workaround by adding my own stats class which record the sum and count because I need is the average number. Now I have some further questions. Probably they are a bit dumb for you but I will be very grateful if you could spend a little bit of time to give me some suggestions.

The tracepoints use ring buffers of fixed size so eventually, all old tracepoints would be overwritten by new ones. I think you can either increase the size or use the approach used by the script freq.py (you need to add the module httpserver-monitoring-api). There is also newly added (experimental though) strace-like functionality (see https://github.com/cloudius-systems/osv/commit/7d7b6d0f1261b87b678c572068e39d482e2103e4). Finally, you may find the comments on this issue relevant - https://github.com/cloudius-systems/osv/issues/1261#issuecomment-1722549524. I am also sure you have come across this wiki page - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py.

    Now after my profiling, I found the mutex in global tib_flush_mutex to be hot in my benchmark so I am trying to remove it but it turns to be a bit hard without understanding the thread model of OSv. So I would like to ask whether there is any high-level doc that describes what the scheduling policy of OSv is, how the priority of the threads are decided, whether we can disable preemption or not (the functionality of preempt_lock) and the design of synchronisation primitives (for example why it is not allowed to have preemption disabled inside lockfree::mutex). I am trying to understand by reading the code directly but it can be really helpful if there is some material which describes the design.

If indeed your "hot" spot is around tlb_flush_mutex (used by  flush_tlb_all()) then I am guessing your program does a lot of mmap/unmap (see unpopulate class in core/memory.cc that uses tlb_gather). I am not familiar with details of what it tlb_gather exactly does it probably forces TLB (Translation Lookaway Buffer) to flush old virtual/physical memory mapping entries after unmapping. The mmu::flush_tlb_all() is actually used in more places.

My wild suggestion would be to try to convert the tlb_flush_mutex to spinlock (see include/osv/spinlock.h and core/spinlock.cc). It is a bit controversial idea as OSv prides itself on lock-less structures and almost no spinklocks used (the console initialization is the only place left). But in some places (see https://github.com/cloudius-systems/osv/issues/853#issuecomment-279215964) and https://github.com/cloudius-systems/osv/commit/f8866c0dfd7ca1fcb4b2d9a280946878313a75d3 and https://groups.google.com/g/osv-dev/c/4wMAHCs7_dk/m/1LHdvmoeBwAJ we may benefit from those.

Please note the lock-less sched::thread::wait_until in the end of the flush_tlb_all would need to be replaced with "busy" wait/sleep.
Or instead of spinlock you can use the Nadav's "mutex with spinning" - https://groups.google.com/g/osv-dev/c/4wMAHCs7_dk/m/1LHdvmoeBwAJ - it may be a good fit here.  

Yueyang Pan

unread,
Nov 28, 2023, 3:04:20 PM11/28/23
to Nadav Har'El, Waldek Kozaczuk, OSv Development
Hi Nadav and Waldek,
Thanks a lot for very detailed answers from both of you. I have some updates on this.
For the first question, I ended up implementing my own adhoc stat class where I can measure the total time (Or total count) of a function and calculate the average. I am still struggling to make the perf work. I got this error when using perf kvm as shown here https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#profiling 
Couldn't record guest kernel [0]'s reference relocation symbol.
From perf. Have you ever encountered this problem when you were developing?

For the second question, I ended up removing the global tlb_flush_mutex and introduced linux-like design where you have percpu call_function_data which contains a percpu array of call_single_data. Each CPU has its own call_single_queue where the call_single_data is enqueued or dequeued. If you don’t mind, I can arrange the code a bit and send the patch. Then you can review it. I am not sure how the developing process works for OSv and I will appreciate it very much if you can give me some guide.

For the scheduling part, I am reading the paper now and the doc. Thanks for the resources. I need sometime to digest because I found that preempt_lock matters a lot for performance of my code.
    
    Best Wishes
    Pan

Waldek Kozaczuk

unread,
Nov 28, 2023, 3:17:50 PM11/28/23
to Yueyang Pan, Nadav Har'El, OSv Development
On Tue, Nov 28, 2023 at 3:04 PM Yueyang Pan <yueya...@epfl.ch> wrote:
Hi Nadav and Waldek,
Thanks a lot for very detailed answers from both of you. I have some updates on this.
For the first question, I ended up implementing my own adhoc stat class where I can measure the total time (Or total count) of a function and calculate the average. I am still struggling to make the perf work. I got this error when using perf kvm as shown here https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#profiling 
Couldn't record guest kernel [0]'s reference relocation symbol.
From perf. Have you ever encountered this problem when you were developing?

I have never seen it but I will try to dig a bit deeper once I have time.

For the second question, I ended up removing the global tlb_flush_mutex and introduced linux-like design where you have percpu call_function_data which contains a percpu array of call_single_data. Each CPU has its own call_single_queue where the call_single_data is enqueued or dequeued. If you don’t mind, I can arrange the code a bit and send the patch. Then you can review it. I am not sure how the developing process works for OSv and I will appreciate it very much if you can give me some guide.
Feel free to create PR on github.

Do you see significant improvement with your change to use percpu call_function_data? OSv has its one percpu structures concept (see include/osv/percpu.hh) so I wonder if you can leverage it.

I wonder how this Linux-like solution helps given that the point of the mmu::flush_tlb_all() (where tlb_flush_mutex is used) is to coordinate the flushing of TLB and make sure all CPUs do it so the virtual/physical mapping is in sync across all CPUs. How do you achieve it in your solution? Is potential speed improvement gained from avoiding IPIs which are known to be slow?

Yueyang Pan

unread,
Nov 29, 2023, 8:59:56 AM11/29/23
to Waldek Kozaczuk, Nadav Har'El, OSv Development
On 28 Nov 2023, at 21:17, jwkoz...@gmail.com wrote:



On Tue, Nov 28, 2023 at 3:04 PM Yueyang Pan <yueya...@epfl.ch> wrote:
Hi Nadav and Waldek,
Thanks a lot for very detailed answers from both of you. I have some updates on this.
For the first question, I ended up implementing my own adhoc stat class where I can measure the total time (Or total count) of a function and calculate the average. I am still struggling to make the perf work. I got this error when using perf kvm as shown here https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#profiling 
Couldn't record guest kernel [0]'s reference relocation symbol.
From perf. Have you ever encountered this problem when you were developing?

I have never seen it but I will try to dig a bit deeper once I have time.

Thanks a lot in advance!


For the second question, I ended up removing the global tlb_flush_mutex and introduced linux-like design where you have percpu call_function_data which contains a percpu array of call_single_data. Each CPU has its own call_single_queue where the call_single_data is enqueued or dequeued. If you don’t mind, I can arrange the code a bit and send the patch. Then you can review it. I am not sure how the developing process works for OSv and I will appreciate it very much if you can give me some guide.
Feel free to create PR on github.

Do you see significant improvement with your change to use percpu call_function_data? OSv has its one percpu structures concept (see include/osv/percpu.hh) so I wonder if you can leverage it.

Yeah, I am having a look right now how to initialise per-cpu variables. The previous implementation I had was using a std::array with maximum number of CPUs. It is a bit messy so I am taking some time to massage the code. I will create a PR once done properly.



I wonder how this Linux-like solution helps given that the point of the mmu::flush_tlb_all() (where tlb_flush_mutex is used) is to coordinate the flushing of TLB and make sure all CPUs do it so the virtual/physical mapping is in sync across all CPUs. How do you achieve it in your solution? Is potential speed improvement gained from avoiding IPIs which are known to be slow?

I have seen some performance improvement on my own benchmark based on some academic prototype which added swap to OSv. I will find some multithreaded mmap/unmap benchmarks and share the numbers once done. I think it will also have performance improvement because multiple cores don’t need to be serialised for the whole mmu::tlb_flush_all when the batch size has been reached. They can send IPI at the same time and wait. The receiver side can only perform 1 TLB flush and pop all the rest request from the software queue.

For example both A and B want to do the mmu::tlb_flush_all() with the global mutex. A has to do first and B has to do later. With linux like approach A and B can both send IPIs (or even ignore sending if the core has already received IPI but not yet process it) and the receive side only needs to do tlb_flush_local once and pop both request A and B from the queue. 

I can see the benefits from multiple places. The number of IPIs sent can be reduced. The waiting time for mutex can be eliminated. The total time of interrupt handling can be reduced on the receiver side. 

Nadav Har'El

unread,
Nov 30, 2023, 12:42:47 PM11/30/23
to ilya meignan-masson, OSv Development
On Tue, Nov 28, 2023 at 12:22 PM ilya meignan-masson <ilya.m...@gmail.com> wrote:
Hello
I am also working on a similar project about page caching for databases. I am also faced with similar issues as the OP. I have a workload that executes for several minutes and the traces that I manage to extract are only a few seconds long.

Exactly. OSv's tracepoints have two modes. One is indeed to save them in a ring buffer - so you'll see the last N traced events when you read that buffer - but other is a mode that just counts the events. What freq.py does is to retrieve the count at one second, then retrieve the count the next second - and the subtraction is the average number of this even per second.

I don't understand your answer. How do you enable either of those modes ? I don't see any mention of such modes in the wiki (https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py).

The tracing described in this document - at least its beginning, is the ring buffer "mode" - each trace event is saved to a ring buffer with a size which should be configurable (I don't remember right now where it's configured...).

The "counting" mode is different - when a tracepoint is enabled in "count" mode, each time this event occurs, a counter is incremented and nothing else. This is super-fast - you can count extremely frequent, even a million per second, with little or no performance degredation.

We have a script, scripts/freq.py, which makes this easy to use. It needs the httpserver module to be included in the OSv image, and connects to it to enable counters and then show the counter increment every few seconds (i.e., the frequency of the event).

For example,

scripts/build -j8 modules=rogue,httpserver
scripts/run

and in another window,

scripts/freq.py localhost sched_switch

(or something like that... I can't test this right now because for some reason the httpserver module doesn't build on my Fedora 38, I'll need to check this later)

 
/scripts/run.py only includes options to "enable" tracepoints but no way to choose between those modes. By default, it seems to me that the default is the first mode, recording events.
In the code, I have not found any definition of a fixed size buffer for trace events.
The only variable that seems relevant is trace_log_size defined in include/osv/trace.hh. It corresponds to the size of a ring-buffer for trace logging which you are also mentionning. However, this ring buffer seems to be used only by the strace functionality and not during "normal" tracing.
I tried to increase the size of this ring-buffer but no change on the number of events collected as I expected.
Could you enlighten me on this part ?

I think this is core/trace.cc, which has
        const size_t size = trace_page_size * std::max(size_t(256), 1024 / ncpu);

which means up to 256*4096 = 1MB of size. I think you can increase this to any number you want and recompile. I don't remember why this isn't documented anywhere or be made easier to configure (it's been years since I last look at this code...)
Reply all
Reply to author
Forward
0 new messages