Dear Waldemar Kozaczuk,
I am Yueyang Pan from EPFL. Currently I am working on a project about remote memory and trying to develop a prototype based on OSv. I am the guy who raised the questions on the google group several days ago as well. For that question, I made a workaround by adding my own stats class which record the sum and count because I need is the average number. Now I have some further questions. Probably they are a bit dumb for you but I will be very grateful if you could spend a little bit of time to give me some suggestions.
Now after my profiling, I found the mutex in global tib_flush_mutex to be hot in my benchmark so I am trying to remove it but it turns to be a bit hard without understanding the thread model of OSv. So I would like to ask whether there is any high-level doc that describes what the scheduling policy of OSv is, how the priority of the threads are decided, whether we can disable preemption or not (the functionality of preempt_lock) and the design of synchronisation primitives (for example why it is not allowed to have preemption disabled inside lockfree::mutex). I am trying to understand by reading the code directly but it can be really helpful if there is some material which describes the design.
Thanks in advance for any advice you could provide. The questions may be a bit dumb so pardon me if I disturb you.
Best Wishes
Pan
Hi,It is great to hear from you. Please see my answers below.I hope you also do not mind I reply to the group so others may add something extra or refine/correct my answers as I am not an original developer/designer of OSv.On Fri, Nov 24, 2023 at 8:50 AM Yueyang Pan <yueya...@epfl.ch> wrote:Dear Waldemar Kozaczuk,
I am Yueyang Pan from EPFL. Currently I am working on a project about remote memory and trying to develop a prototype based on OSv. I am the guy who raised the questions on the google group several days ago as well. For that question, I made a workaround by adding my own stats class which record the sum and count because I need is the average number. Now I have some further questions. Probably they are a bit dumb for you but I will be very grateful if you could spend a little bit of time to give me some suggestions.The tracepoints use ring buffers of fixed size so eventually, all old tracepoints would be overwritten by new ones. I think you can either increase the size or use the approach used by the script freq.py
(you need to add the module httpserver-monitoring-api). There is also newly added (experimental though) strace-like functionality (see https://github.com/cloudius-systems/osv/commit/7d7b6d0f1261b87b678c572068e39d482e2103e4). Finally, you may find the comments on this issue relevant - https://github.com/cloudius-systems/osv/issues/1261#issuecomment-1722549524. I am also sure you have come across this wiki page - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py.Now after my profiling, I found the mutex in global tib_flush_mutex to be hot in my benchmark so I am trying to remove it but it turns to be a bit hard without understanding the thread model of OSv. So I would like to ask whether there is any high-level doc that describes what the scheduling policy of OSv is, how the priority of the threads are decided, whether we can disable preemption or not (the functionality of preempt_lock) and the design of synchronisation primitives (for example why it is not allowed to have preemption disabled inside lockfree::mutex). I am trying to understand by reading the code directly but it can be really helpful if there is some material which describes the design.
Exactly. OSv's tracepoints have two modes. One is indeed to save them in a ring buffer - so you'll see the last N traced events when you read that buffer - but other is a mode that just counts the events. What freq.py does is to retrieve the count at one second, then retrieve the count the next second - and the subtraction is the average number of this even per second.
Hi,It is great to hear from you. Please see my answers below.I hope you also do not mind I reply to the group so others may add something extra or refine/correct my answers as I am not an original developer/designer of OSv.On Fri, Nov 24, 2023 at 8:50 AM Yueyang Pan <yueya...@epfl.ch> wrote:Dear Waldemar Kozaczuk,
I am Yueyang Pan from EPFL. Currently I am working on a project about remote memory and trying to develop a prototype based on OSv. I am the guy who raised the questions on the google group several days ago as well. For that question, I made a workaround by adding my own stats class which record the sum and count because I need is the average number. Now I have some further questions. Probably they are a bit dumb for you but I will be very grateful if you could spend a little bit of time to give me some suggestions.The tracepoints use ring buffers of fixed size so eventually, all old tracepoints would be overwritten by new ones. I think you can either increase the size or use the approach used by the script freq.py (you need to add the module httpserver-monitoring-api). There is also newly added (experimental though) strace-like functionality (see https://github.com/cloudius-systems/osv/commit/7d7b6d0f1261b87b678c572068e39d482e2103e4). Finally, you may find the comments on this issue relevant - https://github.com/cloudius-systems/osv/issues/1261#issuecomment-1722549524. I am also sure you have come across this wiki page - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py.Now after my profiling, I found the mutex in global tib_flush_mutex to be hot in my benchmark so I am trying to remove it but it turns to be a bit hard without understanding the thread model of OSv. So I would like to ask whether there is any high-level doc that describes what the scheduling policy of OSv is, how the priority of the threads are decided, whether we can disable preemption or not (the functionality of preempt_lock) and the design of synchronisation primitives (for example why it is not allowed to have preemption disabled inside lockfree::mutex). I am trying to understand by reading the code directly but it can be really helpful if there is some material which describes the design.If indeed your "hot" spot is around tlb_flush_mutex (used by flush_tlb_all()) then I am guessing your program does a lot of mmap/unmap (see unpopulate class in core/memory.cc that uses tlb_gather). I am not familiar with details of what it tlb_gather exactly does it probably forces TLB (Translation Lookaway Buffer) to flush old virtual/physical memory mapping entries after unmapping. The mmu::flush_tlb_all() is actually used in more places.My wild suggestion would be to try to convert the tlb_flush_mutex to spinlock (see include/osv/spinlock.h and core/spinlock.cc). It is a bit controversial idea as OSv prides itself on lock-less structures and almost no spinklocks used (the console initialization is the only place left). But in some places (see https://github.com/cloudius-systems/osv/issues/853#issuecomment-279215964) and https://github.com/cloudius-systems/osv/commit/f8866c0dfd7ca1fcb4b2d9a280946878313a75d3 and https://groups.google.com/g/osv-dev/c/4wMAHCs7_dk/m/1LHdvmoeBwAJ we may benefit from those.Please note the lock-less sched::thread::wait_until in the end of the flush_tlb_all would need to be replaced with "busy" wait/sleep.
Hi Nadav and Waldek,Thanks a lot for very detailed answers from both of you. I have some updates on this.For the first question, I ended up implementing my own adhoc stat class where I can measure the total time (Or total count) of a function and calculate the average. I am still struggling to make the perf work. I got this error when using perf kvm as shown here https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#profilingCouldn't record guest kernel [0]'s reference relocation symbol.From perf. Have you ever encountered this problem when you were developing?
For the second question, I ended up removing the global tlb_flush_mutex and introduced linux-like design where you have percpu call_function_data which contains a percpu array of call_single_data. Each CPU has its own call_single_queue where the call_single_data is enqueued or dequeued. If you don’t mind, I can arrange the code a bit and send the patch. Then you can review it. I am not sure how the developing process works for OSv and I will appreciate it very much if you can give me some guide.
On 28 Nov 2023, at 21:17, jwkoz...@gmail.com wrote:On Tue, Nov 28, 2023 at 3:04 PM Yueyang Pan <yueya...@epfl.ch> wrote:Hi Nadav and Waldek,Thanks a lot for very detailed answers from both of you. I have some updates on this.For the first question, I ended up implementing my own adhoc stat class where I can measure the total time (Or total count) of a function and calculate the average. I am still struggling to make the perf work. I got this error when using perf kvm as shown here https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#profilingCouldn't record guest kernel [0]'s reference relocation symbol.From perf. Have you ever encountered this problem when you were developing?I have never seen it but I will try to dig a bit deeper once I have time.
For the second question, I ended up removing the global tlb_flush_mutex and introduced linux-like design where you have percpu call_function_data which contains a percpu array of call_single_data. Each CPU has its own call_single_queue where the call_single_data is enqueued or dequeued. If you don’t mind, I can arrange the code a bit and send the patch. Then you can review it. I am not sure how the developing process works for OSv and I will appreciate it very much if you can give me some guide.Feel free to create PR on github.Do you see significant improvement with your change to use percpu call_function_data? OSv has its one percpu structures concept (see include/osv/percpu.hh) so I wonder if you can leverage it.
I wonder how this Linux-like solution helps given that the point of the mmu::flush_tlb_all() (where tlb_flush_mutex is used) is to coordinate the flushing of TLB and make sure all CPUs do it so the virtual/physical mapping is in sync across all CPUs. How do you achieve it in your solution? Is potential speed improvement gained from avoiding IPIs which are known to be slow?
HelloI am also working on a similar project about page caching for databases. I am also faced with similar issues as the OP. I have a workload that executes for several minutes and the traces that I manage to extract are only a few seconds long.Exactly. OSv's tracepoints have two modes. One is indeed to save them in a ring buffer - so you'll see the last N traced events when you read that buffer - but other is a mode that just counts the events. What freq.py does is to retrieve the count at one second, then retrieve the count the next second - and the subtraction is the average number of this even per second.I don't understand your answer. How do you enable either of those modes ? I don't see any mention of such modes in the wiki (https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py).
/scripts/run.py only includes options to "enable" tracepoints but no way to choose between those modes. By default, it seems to me that the default is the first mode, recording events.In the code, I have not found any definition of a fixed size buffer for trace events.The only variable that seems relevant is trace_log_size defined in include/osv/trace.hh. It corresponds to the size of a ring-buffer for trace logging which you are also mentionning. However, this ring buffer seems to be used only by the strace functionality and not during "normal" tracing.I tried to increase the size of this ring-buffer but no change on the number of events collected as I expected.Could you enlighten me on this part ?