[V2 0/4] cyclictest support

luca abeni

unread,

May 10, 2023, 4:57:55 AM5/10/23

to osv...@googlegroups.com, luca...@gmail.com

From: luca abeni <luca...@gmail.com>

Hi,

here is Version 2 of the patches I am using to run a simple real-time test
(cyclictest) on OSv.
The patches are based on github issue 386
(https://github.com/cloudius-systems/osv/issues/386) and are a forward
port of patches by Nadav Har'El and Claudio Fontana; I applied some
minor changes requested by Nadav:
- I removed the "policy" parameter (which was actually not used) from
the original patches
- I split into a different patch (with a changelog explaining the
motivation for this change) the addition of "#define __NEED_size_t"
in include/api/sched.h
- I moved the definitions of some constants and macros from header
files to the appropriate C files
- I removed a useless NULL pointer check
- I used UNIMPLEMENTED() instead of WARN_STUBBED() (moving it to the
acutally unimplemented code path) in clock_nanosleep()
- I updated Nadav's patch to the latest version

With these patches, cyclictest can start and measure some latencies.

Please cc me if you reply because I am not subscribed to the mailing list

Claudio Fontana (2):
libc: add RT sched API
libc/time: add stub for clock_nanosleep

Nadav Har'El (1):
sched: low-level support for POSIX-compatible real-time scheduling

luca abeni (1):
Define __NEED_size_t in include/api/sched.h

Makefile | 2 +
core/sched.cc | 82 ++++++++++++++++++++---
include/api/sched.h | 1 +
include/osv/sched.hh | 85 +++++++++++++++++++++++-
libc/pthread.cc | 22 -------
libc/sched.cc | 143 ++++++++++++++++++++++++++++++++++++++++
libc/time.cc | 49 +++++++++++++-
tests/misc-scheduler.cc | 91 +++++++++++++++++++++++++
8 files changed, 441 insertions(+), 34 deletions(-)
create mode 100644 libc/sched.cc

--
2.30.2

luca abeni

unread,

May 10, 2023, 4:57:59 AM5/10/23

to osv...@googlegroups.com, luca...@gmail.com, Nadav Har'El

From: Nadav Har'El <n...@scylladb.com>

This patch adds "real-time priority" to OSv threads, which can be used
to implement Posix's SCHED_FIFO and SCHED_RR scheduling policies.

Refs #386. It doesn't yet "Fixes" it because this patch does not yet add
the Posix front-end to this implementation (sched_setscheduler() et al.) -
this will be added in a separate patch.

We add to each thread a _realtime._priority, an unsigned integer.
Normal (non-real-time) threads have _realtime._priority of 0, and are
time-shared fairly as before, while realtime threads have
_realtime._priority > 0, and preempt any thread with a lower realtime
priority, including all normal threads.

We add a new API, t->set_realtime_priority(priority), to make a thread
real-time scheduled. The "priority" is an unsigned int as explained above
(0 means normal thread). The resulting scheduling policy is fully
compatible with POSIX's SCHED_FIFO. This includes following the intricate
rules on what happens to the running thread's place in the queue when there
are several real-time threads with the same priority and the running
thread yields, waits, or is preempted by a higher-priority thread.
This patch also adds an exercise for these cases in tests/misc-scheduler.cc.

This patch does not yet implement time slices as needed for supporting
POSIX's SCHED_RR policy. This patch already includes a time-slice setting
API, t->set_realtime_time_slice(duration), but if used it produces a
warning message, and the scheduler currently ignores this setting.
Adding support for real-time time slices is left for a future patch:
To implement time slices, we will need to keep in _realtime the total
time this thread ran since starting this slice (only zeroed when the
thread is removed from the run queue), and each time we switch to a
realtime thread with timeslice >= 0, we'll set an appropriate timer
(for timeslice minus already used time).

I tested that although this patch adds code into the scheduler it does
not noticably slow down context switches of normal threads. E.g.,
tests/misc-ctxsw.so measured colocated context switch time of 240ns
before this patch, and exactly the same after it.

While developing this patch I tested it using added workloads in
tests/misc-scheduler.cc, included in this patch, but unfortunately
this is not a real regression test - it has no assertions and cannot
automatically detect failure (and, like all the "misc-*" tests, it
doesn't run automatically in "make check").

Signed-off-by: Nadav Har'El <n...@scylladb.com>
---
core/sched.cc | 82 ++++++++++++++++++++++++++++++++-----
include/osv/sched.hh | 75 +++++++++++++++++++++++++++++++++-
tests/misc-scheduler.cc | 89 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 235 insertions(+), 11 deletions(-)

diff --git a/core/sched.cc b/core/sched.cc
index 65842ff3..68f49bcd 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -26,6 +26,7 @@
#include <osv/preempt-lock.hh>
#include <osv/app.hh>
#include <osv/symbols.hh>
+#include <osv/stubbing.hh>

MAKE_SYMBOL(sched::thread::current);
MAKE_SYMBOL(sched::cpu::current);
@@ -290,7 +291,16 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
#endif
} else if (!called_from_yield) {
auto &t = *runqueue.begin();
- if (p->_runtime.get_local() < t._runtime.get_local()) {
+ if (p->_realtime._priority > 0) {
+ // Only switch to a higher-priority realtime thread
+ if (t._realtime._priority <= p->_realtime._priority) {
+#ifdef __aarch64__
+ return switch_data;
+#else
+ return;
+#endif
+ }
+ } else if (t._realtime._priority == 0 && p->_runtime.get_local() < t._runtime.get_local()) {
preemption_timer.cancel();
auto delta = p->_runtime.time_until(t._runtime.get_local());
if (delta > 0) {
@@ -302,6 +312,20 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
return;
#endif
}
+ } else { /* called_from_yield */
+ if (p->_realtime._priority > 0) {
+ auto &t = *runqueue.begin();
+ // When yielding, only switch if the next thread has a higher
+ // or same priority as p (i.e., don't switch if t it has a
+ // lesser priority than p).
+ if (t._realtime._priority < p->_realtime._priority) {
+#ifdef __aarch64__
+ return switch_data;
+#else
+ return;
+#endif
+ }
+ }
}
// If we're here, p no longer has the lowest runtime. Before queuing
// p, return the runtime it borrowed for hysteresis.
@@ -309,7 +333,10 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
p->_detached_state->st.store(thread::status::queued);

if (!called_from_yield) {
- enqueue(*p);
+ // POSIX requires that if a real-time thread doesn't yield but
+ // rather is preempted by a higher-priority thread, it be
+ // reinserted into the runqueue first, not last, among its equals.
+ enqueue_first_equal(*p);
}

trace_sched_preempt();
@@ -350,16 +377,18 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
n->_runtime.add_context_switch_penalty();
}
preemption_timer.cancel();
- if (!called_from_yield) {
- if (!runqueue.empty()) {
- auto& t = *runqueue.begin();
- auto delta = n->_runtime.time_until(t._runtime.get_local());
- if (delta > 0) {
- preemption_timer.set_with_irq_disabled(now + delta);
+ if (n->_realtime._priority == 0) {
+ if (!called_from_yield) {
+ if (!runqueue.empty()) {
+ auto& t = *runqueue.begin();
+ auto delta = n->_runtime.time_until(t._runtime.get_local());
+ if (delta > 0) {
+ preemption_timer.set_with_irq_disabled(now + delta);
+ }
}
+ } else {
+ preemption_timer.set_with_irq_disabled(now + preempt_after);
}
- } else {
- preemption_timer.set_with_irq_disabled(now + preempt_after);
}

if (app_thread.load(std::memory_order_relaxed) != n->_app) { // don't write into a cache line if it can be avoided
@@ -530,6 +559,16 @@ void cpu::enqueue(thread& t)
runqueue.insert_equal(t);
}

+// When the run queue has several threads with equal thread_runtime_compare,
+// enqueue() puts a thread after its equals, while enqueue_first_equal()
+// puts it before its equals. The distinction is mostly interesting for real-
+// time priority threads.
+void cpu::enqueue_first_equal(thread& t)
+{
+ trace_sched_queue(&t);
+ runqueue.insert_before(runqueue.lower_bound(t), t);
+}
+
void cpu::init_on_cpu()
{
arch.init_on_cpu();
@@ -866,6 +905,29 @@ float thread::priority() const
return _runtime.priority();
}

+void thread::set_realtime_priority(unsigned priority)
+{
+ _realtime._priority = priority;
+}
+
+unsigned thread::realtime_priority() const
+{
+ return _realtime._priority;
+}
+
+void thread::set_realtime_time_slice(thread_realtime::duration time_slice)
+{
+ if (time_slice > 0) {
+ WARN_ONCE("set_realtime_time_slice() used but real-time time slices"
+ " not yet supported");
+ }
+ _realtime._time_slice = time_slice;
+}
+
+thread_realtime::duration thread::realtime_time_slice() const {
+ return _realtime._time_slice;
+}
+
sched::thread::status thread::get_status() const
{
return _detached_state->st.load(std::memory_order_relaxed);
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 8a2694cb..c1f414ba 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -313,6 +313,24 @@ private:
int _renormalize_count;
};

+// "Normal" threads are fairly time-shared according to the thread_runtime
+// data above. Such normal threads all have _realtime._priority == 0.
+// A "real-time" thread is one where _realtime._priority > 0. The scheduler
+// always picks the thread with the highest _realtime_priority to run next;
+// In particular normal threads run only when no real-time thread wants to
+// run. When several real-time threads with equal _realtime._priority want to
+// run, each one is run for _realtime._time_slice before switching to the
+// next one; If _realtime_time_slice is == 0, there is no limit on the amount
+// of time the thread may run - it will only be preempted when it waits,
+// yields, or a higher-priority thread comes along.
+// The realtime scheduling policy matches the POSIX SCHED_RR and SCHED_FIFO.
+class thread_realtime {
+public:
+ using duration = thread_runtime::duration;
+ unsigned _priority = 0;
+ duration _time_slice = duration::zero();
+};
+
// "tau" controls the length of the history we consider for scheduling,
// or more accurately the rate of decay of an exponential moving average.
// In particular, it can be seen that if a thread has been monopolizing the
@@ -615,6 +633,54 @@ public:
* explained in set_priority().
*/
float priority() const;
+ /**
+ * Set thread's real-time priority
+ *
+ * By default new threads have a "real-time priority" of 0 and participate
+ * in fair time-sharing of the CPUs (the share is determined by
+ * set_priority()).
+ *
+ * A thread becomes a "real-time" thread by setting a positive integer as
+ * real-time priority. The scheduler always picks the thread with the
+ * highest real-time priority to run next; In particular normal threads run
+ * only when no real-time thread wants to run. When several real-time
+ * threads with equal real-time priority want to run, each one is run
+ * until it waits or yields, before switching to the next one.
+ *
+ * The real-time scheduling policy matches POSIX's "SCHED_FIFO" policy,
+ * or "SCHED_RR" if set_realtime_time_slice() was also used.
+ */
+ void set_realtime_priority(unsigned priority);
+ /**
+ * Get thread's real-time priority
+ *
+ * Returns the thread's real-time priority, an unsigned integer whose
+ * meaning is explained in set_realtime_priority().
+ */
+ unsigned realtime_priority() const;
+ /**
+ * Set thread's real-time scheduling time slice.
+ *
+ * By default, real-time threads (see set_realtime_priority()) continue to
+ * run until they yield, wait, or are preempted by a higher-priority
+ * real-time thread. When a time_slice > 0 is set for a thread with this
+ * function, a thread will be limited to that time-slice before it
+ * automatically yields to the next thread of equal real-time priority (if
+ * any). Setting time_slice == 0 reverts to the default behavior, disabling
+ * the time-slice limit for the thread.
+ *
+ * With time_slice == 0, the real-time scheduling policy matches POSIX's
+ * "SCHED_FIFO" policy. With time_slice > 0, it matches POSIX's "SCHED_RR"
+ * policy.
+ */
+ void set_realtime_time_slice(thread_realtime::duration time_slice);
+ /**
+ * Get thread's real-time scheduling time slice
+ *
+ * Returns the thread's real-time scheduling time slice, whose meaning is
+ * explained in set_realtime_time_slice().
+ */
+ thread_realtime::duration realtime_time_slice() const;
/**
* Prevent a waiting thread from ever waking (returns false if the thread
* was not in waiting state). This capability is not safe: If the thread
@@ -708,6 +774,7 @@ private:
//
// wake() on any state except waiting is discarded.
thread_runtime _runtime;
+ thread_realtime _realtime;
// part of the thread state is detached from the thread structure,
// and freed by rcu, so that waking a thread and destroying it can
// occur in parallel without synchronization via thread_handle
@@ -862,7 +929,12 @@ osv::clock::uptime::duration process_cputime();
class thread_runtime_compare {
public:
bool operator()(const thread& t1, const thread& t2) const {
- return t1._runtime.get_local() < t2._runtime.get_local();
+ if (t1._realtime._priority > t2._realtime._priority)
+ return true;
+ else if (t2._realtime._priority > 0)
+ return false;
+ else
+ return t1._runtime.get_local() < t2._runtime.get_local();
}
};

@@ -931,6 +1003,7 @@ struct cpu : private timer_base::client {
thread_runtime::duration preempt_after = thyst);
#endif
void enqueue(thread& t);
+ void enqueue_first_equal(thread& t);
void init_idle_thread();
virtual void timer_fired() override;
class notifier;
diff --git a/tests/misc-scheduler.cc b/tests/misc-scheduler.cc
index f4de24d1..578fabe1 100644
--- a/tests/misc-scheduler.cc
+++ b/tests/misc-scheduler.cc
@@ -139,6 +139,66 @@ void priority_test(std::vector<float> ps)
}
#endif

+#ifdef __OSV__
+void realtime_test(std::vector<int> ps)
+{
+ std::cerr << "Starting realtime test\n";
+ std::vector<std::thread> threads;
+ mutex mtx;
+ for (auto p : ps) {
+ threads.push_back(std::thread([p]() {
+ sched::thread::current()->set_realtime_priority(p);
+ std::cout << "Starting thread with realtime priority " << p << "\n";
+ // Sleep a bit, to let all test threads get started. The thread
+ // starting the test threads is not realtime, so it can be preempted
+ // by the test threads.
+ sleep(1);
+ for (int i=0; i<10; i++) {
+ _loop(100000);
+ std::cout << p << std::flush;
+ }
+ }));
+ }
+ for (auto &t : threads) {
+ t.join();
+ }
+ std::cerr << "\nRealtime test done\n";
+}
+
+void realtime_test2(bool yield)
+{
+ std::cerr << "Starting realtime test #2 - FIFO order, yield=" << yield << "\n";
+ std::vector<std::thread> threads;
+ mutex mtx;
+ std::atomic<int> last_seen(-1);
+ for (int p = 0; p < 10; p++) {
+ threads.push_back(std::thread([p,yield,&last_seen]() {
+ sched::thread::current()->set_realtime_priority(1);
+ // Sleep a bit, to let all test threads get started. The thread
+ // starting the test threads is not realtime, so it can be preempted
+ // by the test threads.
+ sleep(1);
+ for (int i = 0 ; i < 3; i++) {
+ for(int j = 0; j < 100000; j++) {
+ if (last_seen.exchange(p) != p) {
+ std::cout << p << std::flush; // context-switched to p
+ }
+ _loop(1);
+ }
+ if (yield)
+ sched::thread::yield();
+ else
+ sched::thread::sleep(std::chrono::milliseconds(1));
+ }
+ }));
+ }
+ for (auto &t : threads) {
+ t.join();
+ }
+ std::cerr << "\nRealtime test #2 done\n";
+}
+#endif
+

int main()
{
@@ -148,6 +208,35 @@ int main()
return 0;
}

+ #ifdef __OSV__
+ // Tests for thread::set_realtime() support for POSIX-like realtime
+ // scheduling.
+ // TODO: Move this code into a real test, and in addition to just
+ // printing progress, also save it into a string and check this string.
+ // (Need to check the first 10 characters of this string repeat 2 more
+ // times and that's it).
+ realtime_test({0, 1, 2});
+ realtime_test2(false);
+ realtime_test2(true);
+ // Check that intermittent thread with priority 2 doesn't force
+ // realtime_test2 to context-switch more often than it normally
+ // should (each time we the priority-2 thread sleeps, we need to
+ // go back to the same priority-1 thread that previously ran - not
+ // to the next one). We expect the output from the test below to
+ // be identical to that from the test above.
+ std::cout << "Additional intermittent thread with priority 2\n";
+ std::atomic<bool> stop(false);
+ std::thread ti([&stop]() {
+ sched::thread::current()->set_realtime_priority(2);
+ while (!stop.load()) {
+ sched::thread::sleep(std::chrono::milliseconds(5));
+ }
+ });
+ realtime_test2(true);
+ stop.store(true);
+ ti.join();
+#endif
+
#ifdef __OSV__
auto p = sched::thread::priority_default;
priority_test({p, p*4});
--
2.30.2

luca abeni

unread,

May 10, 2023, 4:58:01 AM5/10/23

to osv...@googlegroups.com, luca...@gmail.com

From: luca abeni <luca...@gmail.com>

This header file actually uses size_t (for example, in the definition of
__sched_cpucount()). Hence, including api/sched.h without having defined
__NED_size_t before can result in build errors.
This issue is not visible now because api/sched.h has only one used, which
includes it after osv/sched.hh, which ends up defining __NEED_size_t.
---
include/api/sched.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/api/sched.h b/include/api/sched.h
index 1893efc1..68074b22 100644
--- a/include/api/sched.h
+++ b/include/api/sched.h
@@ -9,6 +9,7 @@ extern "C" {
#define __NEED_struct_timespec
#define __NEED_pid_t
#define __NEED_time_t
+#define __NEED_size_t

#include <bits/alltypes.h>

--
2.30.2

luca abeni

unread,

May 10, 2023, 4:58:02 AM5/10/23

to osv...@googlegroups.com, luca...@gmail.com, Claudio Fontana

From: Claudio Fontana <claudio...@huawei.com>

implement the sched_ API using the OSv realtime scheduler
implementation.

This includes:

sched_setscheduler, sched_getscheduler,
sched_setparam, sched_getparam,
sched_get_priority_min, sched_get_priority_max,
and a stub for Linux-only sched_rr_get_interval.

They are implemented together as they are tightly
integrated, and one calls the other.

Signed-off-by: Claudio Fontana <claudio...@huawei.com>
---
Makefile | 2 +

libc/pthread.cc | 22 -------
libc/sched.cc | 143 ++++++++++++++++++++++++++++++++++++++++

tests/misc-scheduler.cc | 6 +-
4 files changed, 149 insertions(+), 24 deletions(-)
create mode 100644 libc/sched.cc

diff --git a/Makefile b/Makefile
index ea6f1c0d..b626a1f2 100644
--- a/Makefile
+++ b/Makefile
@@ -1913,6 +1913,8 @@ libc += mallopt.o

libc += linux/makedev.o

+libc += sched.o
+
musl += fenv/fegetexceptflag.o
musl += fenv/feholdexcept.o
musl += fenv/fesetexceptflag.o
diff --git a/libc/pthread.cc b/libc/pthread.cc
index de5979e8..76e6d982 100644
--- a/libc/pthread.cc
+++ b/libc/pthread.cc
@@ -960,28 +960,6 @@ void pthread_exit(void *retval)
t->_thread->exit();
}

-// Following 4 functions provide minimal implementation
-// that ONLY covers default Linux SCHED_OTHER policy
-int sched_get_priority_min(int policy)
-{
- switch (policy) {
- case SCHED_OTHER:
- return 0;
- default:
- return EINVAL;
- }
-}
-
-int sched_get_priority_max(int policy)
-{
- switch (policy) {
- case SCHED_OTHER:
- return 0;
- default:
- return EINVAL;
- }
-}
-
int pthread_setschedparam(pthread_t thread, int policy,
const struct sched_param *param)
{
diff --git a/libc/sched.cc b/libc/sched.cc
new file mode 100644
index 00000000..733dc6ae
--- /dev/null
+++ b/libc/sched.cc
@@ -0,0 +1,143 @@
+/*
+ * Copyright (C) 2014 Huawei Technologies Duesseldorf GmbH
+ *
+ * This work is open source software, licensed under the terms of the
+ * BSD license as described in the LICENSE file in the top-level directory.
+ */
+
+#include <errno.h>
+#include <api/sched.h>
+#include <api/sys/resource.h>
+#include <osv/sched.hh>
+
+#include <osv/stubbing.hh>
+
+#define SCHED_PRIO_MIN 1
+#define SCHED_PRIO_MAX 99
+
+/* sched_rr_get_interval writes the time quantum for SCHED_RR
+ * processes to the tp timespec structure.
+ * This is Linux-specific, and we don't implement this yet.
+ */
+
+int sched_rr_get_interval(pid_t pid, struct timespec * tp)
+{
+ WARN_STUBBED();
+ errno = ENOSYS;
+ return -1;
+}
+
+static int sched_get_priority_minmax(int policy, int value)
+{
+ switch (policy) {
+ case SCHED_OTHER:
+ return 0;
+ case SCHED_FIFO:
+ return value;
+ default:
+ errno = EINVAL;
+ /* error return value unfortunately overlaps with allowed
+ * POSIX API allowed range. Another good reason to have
+ * only rt priorities > 0.
+ */
+ return -1;
+ }
+}
+
+int sched_get_priority_min(int policy)
+{
+ return sched_get_priority_minmax(policy, SCHED_PRIO_MIN);
+}
+
+int sched_get_priority_max(int policy)
+{
+ return sched_get_priority_minmax(policy, SCHED_PRIO_MAX);
+}
+
+static int sched_setparam_aux(sched::thread *t, int policy, int prio)
+{
+ switch (policy) {
+ case SCHED_OTHER:
+ if (prio != 0) {
+ errno = EINVAL;
+ return -1;
+ }
+ break;
+ case SCHED_FIFO:
+ if (prio < SCHED_PRIO_MIN || prio > SCHED_PRIO_MAX) {
+ errno = EINVAL;
+ return -1;
+ }
+ break;
+ default:
+ errno = EINVAL;
+ return -1;
+ }
+
+ t->set_realtime_priority(prio);
+ return 0;
+}
+
+int sched_setscheduler(pid_t pid, int policy,
+ const struct sched_param *param)
+{
+ sched::thread *t = sched::thread::current();
+
+ if (pid != 0 && pid != getpid()) {
+ errno = ESRCH;
+ return -1;
+ }
+ if (!param) {
+ errno = EINVAL;
+ return -1;
+ }
+
+ return sched_setparam_aux(t, policy, param->sched_priority);
+}
+
+int sched_getscheduler(pid_t pid)
+{
+ sched::thread *t = sched::thread::current();
+
+ if (pid != 0 && pid != getpid()) {
+ errno = ESRCH;
+ return -1;
+ }
+
+ if (t->realtime_priority() == 0) {
+ return SCHED_OTHER;
+ };
+
+ return SCHED_FIFO;
+}
+
+/*
+int sched_setparam(pid_t pid, const struct sched_param *param)
+{
+ sched::thread *t = sched::thread::current();
+
+ if (pid != 0 && pid != getpid()) {
+ errno = ESRCH;
+ return -1;
+ }
+ if (!param) {
+ errno = EINVAL;
+ return -1;
+ }
+
+ return sched_setparam_aux(t, t->get_realtime().policy(), param->sched_priority);
+}
+*/
+
+int sched_getparam(pid_t pid, struct sched_param *param)
+{
+ sched::thread *t = sched::thread::current();
+
+ if (pid != 0 && pid != getpid()) {
+ errno = ESRCH;
+ return -1;
+ }
+
+ param->sched_priority = t->realtime_priority();
+ return 0;
+}
diff --git a/tests/misc-scheduler.cc b/tests/misc-scheduler.cc
index 578fabe1..d36201e7 100644
--- a/tests/misc-scheduler.cc
+++ b/tests/misc-scheduler.cc
@@ -140,13 +140,15 @@ void priority_test(std::vector<float> ps)
#endif

#ifdef __OSV__
-void realtime_test(std::vector<int> ps)
+void realtime_test(int policy, std::vector<int> ps)
{
+ assert(policy == SCHED_RR || policy == SCHED_FIFO);

std::cerr << "Starting realtime test\n";

std::vector<std::thread> threads;

mutex mtx;
for (auto p : ps) {
- threads.push_back(std::thread([p]() {
+ assert(p >= SCHED_PRIO_MIN && p <= SCHED_PRIO_MAX);
+ threads.push_back(std::thread([policy, p]() {
sched::thread::current()->set_realtime_priority(p);

std::cout << "Starting thread with realtime priority " << p << "\n";

// Sleep a bit, to let all test threads get started. The thread

--
2.30.2

luca abeni

unread,

May 10, 2023, 4:58:04 AM5/10/23

to osv...@googlegroups.com, luca...@gmail.com, Claudio Fontana

From: Claudio Fontana <claudio...@huawei.com>

we only support CLOCK_MONOTONIC, and we don't support remainder
due to the signal implementation.

Signed-off-by: Claudio Fontana <claudio...@huawei.com>
---

include/osv/sched.hh | 10 +++++++++
libc/time.cc | 49 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index c1f414ba..03502c38 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -513,6 +513,8 @@ public:
inline void wake_with_from_mutex(Action action);
template <class Rep, class Period>
static void sleep(std::chrono::duration<Rep, Period> duration);
+ template <class Clock, class Duration>
+ static void sleep(std::chrono::time_point<Clock, Duration> time_point);
/**
* Let the other thread on the current CPU run if there is any.
*
@@ -1407,6 +1409,14 @@ void thread::sleep(std::chrono::duration<Rep, Period> duration)
sleep_impl(t);
}

+template <class Clock, class Duration>
+void thread::sleep(std::chrono::time_point<Clock, Duration> time_point)
+{
+ timer t(*current());
+ t.set(time_point);
+ wait_until([&] { return t.expired(); });
+}
+
template <class Action>
inline
void thread::wake_with_irq_or_preemption_disabled(Action action)
diff --git a/libc/time.cc b/libc/time.cc
index 2660eb86..c31f31fd 100644
--- a/libc/time.cc
+++ b/libc/time.cc
@@ -15,7 +15,18 @@
#include <osv/sched.hh>
#include "pthread.hh"

-u64 convert(const timespec& ts)
+#define timespecisset(t) ((t)->tv_sec || (t)->tv_nsec)
+#define timespecclear(t) ((t)->tv_sec = (t)->tv_nsec = 0)
+#define timespeccmp(s,t,op) ((s)->tv_sec == (t)->tv_sec ? \
+ (s)->tv_nsec op (t)->tv_nsec : (s)->tv_sec op (t)->tv_sec)
+#define timespecadd(s,t,a) ( (a)->tv_sec = (s)->tv_sec + (t)->tv_sec, \
+ ((a)->tv_nsec = (s)->tv_nsec + (t)->tv_nsec) >= 1000000000L && \
+ ((a)->tv_nsec -= 1000000000L, (a)->tv_sec++) )
+#define timespecsub(s,t,a) ( (a)->tv_sec = (s)->tv_sec - (t)->tv_sec, \
+ ((a)->tv_nsec = (s)->tv_nsec - (t)->tv_nsec) < 0 && \
+ ((a)->tv_nsec += 1000000000L, (a)->tv_sec--) )
+
+static u64 convert(const timespec& ts)
{
return ts.tv_sec * 1000000000 + ts.tv_nsec;
}
@@ -36,6 +47,9 @@ int gettimeofday(struct timeval* tv, struct timezone* tz)
OSV_LIBC_API
int nanosleep(const struct timespec* req, struct timespec* rem)
{
+ if (!req || req->tv_nsec < 0 || req->tv_nsec >= 1000000000L || req->tv_sec < 0)
+ return libc_error(EINVAL);
+
sched::thread::sleep(std::chrono::nanoseconds(convert(*req)));
return 0;
}
@@ -127,3 +141,36 @@ clock_t clock(void)
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
return ts.tv_sec * 1000000000L + ts.tv_nsec;
}
+
+int clock_nanosleep(clockid_t clock_id, int flags,
+ const struct timespec *request,
+ struct timespec *remain)
+{
+ /* XXX we are implementing really only CLOCK_MONOTONIC, */
+ /* and we don't support remain, due to signals. */
+ if (remain) {
+ UNIMPLEMENTED("clock_nanosleep(): remain not supported, due to signals");
+ }
+ if (clock_id != CLOCK_MONOTONIC) {
+ UNIMPLEMENTED("clock_nanosleep(): only CLOCK_MONOTONIC is supported");
+ }
+
+ switch (flags) {
+ case 0:
+ return nanosleep(request, NULL);
+ case TIMER_ABSTIME: {
+ struct timespec ts;
+ if (clock_gettime(clock_id, &ts) != 0) {
+ return -1;
+ }
+ if (timespeccmp(request, &ts, <=)) {
+ return 0;
+ }
+ sched::thread::sleep(osv::clock::uptime::time_point(
+ osv::clock::uptime::duration(convert(*request))));
+ return 0;
+ }
+ default:
+ return libc_error(EINVAL);
+ }
+}
--
2.30.2

Nadav Har'El

unread,

May 10, 2023, 5:07:22 AM5/10/23

to luca abeni, osv...@googlegroups.com

Reviewed-by: Nadav Har'El <n...@scylladb.com>

--
Nadav Har'El

n...@scylladb.com

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20230510085745.11416-3-luca.abeni%40santannapisa.it.

Nadav Har'El

unread,

May 10, 2023, 5:47:41 AM5/10/23

to luca abeni, osv...@googlegroups.com, Claudio Fontana

Thanks! Looks better, but I still have some requests below.

--
Nadav Har'El

n...@scylladb.com

On Wed, May 10, 2023 at 11:58 AM luca abeni <luca...@gmail.com> wrote:

You can also add your own copyright line, if you want.

+ *
+ * This work is open source software, licensed under the terms of the
+ * BSD license as described in the LICENSE file in the top-level directory.
+ */
+
+#include <errno.h>
+#include <api/sched.h>
+#include <api/sys/resource.h>
+#include <osv/sched.hh>
+
+#include <osv/stubbing.hh>
+
+#define SCHED_PRIO_MIN 1
+#define SCHED_PRIO_MAX 99
+
+/* sched_rr_get_interval writes the time quantum for SCHED_RR
+ * processes to the tp timespec structure.
+ * This is Linux-specific, and we don't implement this yet.

We shouldn't say it's Linux-specific because it's not (see https://pubs.opengroup.org/onlinepubs/009696799/functions/sched_rr_get_interval.html)

But it's fine that we don't implement it yet.

The id passed to sched_setscheduler should be a thread id, not a process id,

so I think we should compare pid to gettid(), not getpid().

But, there's a bigger problem here:

+ errno = ESRCH;

This incorrectly claims that the other thread doesn't exist, without checking, and may confuse

an application which tries to change the scheduling of a different thread (which POSIX supports).

So a t a minimum we should print an "unimplemented" warning here, but much better is simply

to implement it, which is easy, we have an example in sched_setaffinity().

The relevant code snippet:

sched::thread *t;
if (pid == 0) {
t = sched::thread::current();
} else {
t = sched::thread::find_by_id(pid);
if (!t) {
errno = ESRCH;
return -1;
}
// TODO: After the thread was found, if it exits the code below
// may crash. Perhaps we should have a version of find_by_id(),
// with_thread_by_id(pid, func), which holds thread_map_mutex while
// func runs.
}

I think you can just use exactly that code, as is (I guess with the same TODO, which is still valid).

Same in other places below where you wrongly check getpid() and return ESRCH without actually checking.

I won't repeat the same comment - so just look for all of them.

If this code is commented out, please remove it from the patch.

You don't use the "policy" here, so why do you add it in this patch?

sched::thread::current()->set_realtime_priority(p);
std::cout << "Starting thread with realtime priority " << p << "\n";
// Sleep a bit, to let all test threads get started. The thread
--
2.30.2

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20230510085745.11416-4-luca.abeni%40santannapisa.it.

Nadav Har'El

unread,

May 10, 2023, 6:03:20 AM5/10/23

to luca abeni, Waldek Kozaczuk, osv...@googlegroups.com, Claudio Fontana

Thanks. Again, my review inline below.

--
Nadav Har'El

n...@scylladb.com

On Wed, May 10, 2023 at 11:58 AM luca abeni <luca...@gmail.com> wrote:

From: Claudio Fontana <claudio...@huawei.com>

we only support CLOCK_MONOTONIC, and we don't support remainder
due to the signal implementation.

Signed-off-by: Claudio Fontana <claudio...@huawei.com>
---
include/osv/sched.hh | 10 +++++++++
libc/time.cc | 49 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index c1f414ba..03502c38 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -513,6 +513,8 @@ public:
inline void wake_with_from_mutex(Action action);
template <class Rep, class Period>
static void sleep(std::chrono::duration<Rep, Period> duration);
+ template <class Clock, class Duration>
+ static void sleep(std::chrono::time_point<Clock, Duration> time_point);

The function name sleep() is deeply ingrained in C / Unix / Linux users' minds as something that takes

a relative time (duration) as parameter (i.e., sleep for N seconds), not a deadline (time_point).

So I suggest giving this new function a different name. A good name is "sleep_until", this is also the

name that C++'s thread API uses: https://en.cppreference.com/w/cpp/thread/sleep_until

Unless I'm missing something, you aren't using any of these macros below, just timespeccmp(),

so please remove all the unused ones.

In general we should need functions to add or subtract timestamps, because what we normally

would do is to convert the timespec to a std::chrono::timepoint, and those we can add and

subtract with the "+" and "-" operators which are already implemented by std::chrono. We can

also do the same with <=, by the way.

By the way, in modern C++ (and C), it's much nicer to replace macros by inline

functions, which are easier to understand and also give you some compile-time checking

on types, etc.

+
+static u64 convert(const timespec& ts)
{
return ts.tv_sec * 1000000000 + ts.tv_nsec;
}
@@ -36,6 +47,9 @@ int gettimeofday(struct timeval* tv, struct timezone* tz)
OSV_LIBC_API
int nanosleep(const struct timespec* req, struct timespec* rem)
{
+ if (!req || req->tv_nsec < 0 || req->tv_nsec >= 1000000000L || req->tv_sec < 0)
+ return libc_error(EINVAL);
+
sched::thread::sleep(std::chrono::nanoseconds(convert(*req)));
return 0;
}
@@ -127,3 +141,36 @@ clock_t clock(void)
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
return ts.tv_sec * 1000000000L + ts.tv_nsec;
}
+
+int clock_nanosleep(clockid_t clock_id, int flags,

I think Waldek would like us to put the "OSV_LIBC_API" tag on functions which implement the C library

functions - as we do above on nanosleep(). CCing Waldek to agree or correct me.

+ const struct timespec *request,
+ struct timespec *remain)
+{
+ /* XXX we are implementing really only CLOCK_MONOTONIC, */
+ /* and we don't support remain, due to signals. */
+ if (remain) {
+ UNIMPLEMENTED("clock_nanosleep(): remain not supported, due to signals");
+ }
+ if (clock_id != CLOCK_MONOTONIC) {
+ UNIMPLEMENTED("clock_nanosleep(): only CLOCK_MONOTONIC is supported");
+ }
+
+ switch (flags) {
+ case 0:
+ return nanosleep(request, NULL);
+ case TIMER_ABSTIME: {
+ struct timespec ts;
+ if (clock_gettime(clock_id, &ts) != 0) {
+ return -1;
+ }
+ if (timespeccmp(request, &ts, <=)) {
+ return 0;
+ }

Do we really need to do this check? Won't sched::thread::sleep() simply silently return immediately

if the given time point is in the past? Please check, and if it just works, you don't need the above if()s

at all, and don't need the timespeccmp function.

+ sched::thread::sleep(osv::clock::uptime::time_point(
+ osv::clock::uptime::duration(convert(*request))));
+ return 0;
+ }
+ default:
+ return libc_error(EINVAL);
+ }
+}
--
2.30.2

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20230510085745.11416-5-luca.abeni%40santannapisa.it.

Luca Abeni

unread,

May 14, 2023, 4:50:44 PM5/14/23

to Nadav Har'El, Waldek Kozaczuk, osv...@googlegroups.com, Claudio Fontana

Hi,

thanks for all the comments; I'll try to address them in the next
days (same for the comments on the other patch)

Luca

Luca Abeni

unread,

May 21, 2023, 5:53:49 PM5/21/23

to Nadav Har'El, osv...@googlegroups.com, Claudio Fontana

Hi,

On Wed, 10 May 2023 at 11:47, Nadav Har'El <n...@scylladb.com> wrote:
[...]

>> +int sched_setscheduler(pid_t pid, int policy,
>> + const struct sched_param *param)
>> +{
>> + sched::thread *t = sched::thread::current();
>> +
>> + if (pid != 0 && pid != getpid()) {
>
>
> The id passed to sched_setscheduler should be a thread id, not a process id,
> so I think we should compare pid to gettid(), not getpid().

I have to admit this comment confused me a little bit... After
re-reading the manpage, I now understand the issue and I am addressing
it.
I hope to send an updated patchset in the next few days.

Luca

Nadav Har'El

unread,

May 22, 2023, 8:26:10 AM5/22/23

to Luca Abeni, osv...@googlegroups.com, Claudio Fontana

On Mon, May 22, 2023 at 12:53 AM Luca Abeni <luca...@gmail.com> wrote:

Hi,

On Wed, 10 May 2023 at 11:47, Nadav Har'El <n...@scylladb.com> wrote:
[...]
>> +int sched_setscheduler(pid_t pid, int policy,
>> + const struct sched_param *param)
>> +{
>> + sched::thread *t = sched::thread::current();
>> +
>> + if (pid != 0 && pid != getpid()) {
>
>
> The id passed to sched_setscheduler should be a thread id, not a process id,
> so I think we should compare pid to gettid(), not getpid().

I have to admit this comment confused me a little bit... After
re-reading the manpage, I now understand the issue and I am addressing
it.

Thanks! I hope you understood that my comment about gettid() vs getpid() was just the first one,

when I saw you wrongly using getpid() in the check - but then I noted that this check isn't the

best thing to do anyway - instead of checking that pid must be the same as gettid(), we can

support any thread id by looking up the thread by id instead of assuming that it must be the

current thread.

Nadav Har'El

unread,

May 22, 2023, 8:30:35 AM5/22/23

to luca abeni, osv...@googlegroups.com, Claudio Fontana

On Wed, May 10, 2023 at 12:47 PM Nadav Har'El <n...@scylladb.com> wrote:

sched::thread *t;
if (pid == 0) {
t = sched::thread::current();
} else {
t = sched::thread::find_by_id(pid);
if (!t) {
errno = ESRCH;
return -1;
}
// TODO: After the thread was found, if it exits the code below
// may crash. Perhaps we should have a version of find_by_id(),
// with_thread_by_id(pid, func), which holds thread_map_mutex while
// func runs.
}

I think you can just use exactly that code, as is (I guess with the same TODO, which is still valid).

I looked again, and we actually have a function exactly like I wished for in the TODO :-)

with_thread_by_id() from include/osv/sched.hh. I added it in commit f142ccf42f5d4465e827e3ff0a330b62d070755d

six years ago - I guess I fulfilled my own wish but then forgot to go back to where I wanted to use it.

So you should probably use the with_thread_by_id() function.

Luca Abeni

unread,

May 22, 2023, 12:23:46 PM5/22/23

to Nadav Har'El, osv...@googlegroups.com, Claudio Fontana

Hi,

OK, I am looking at this API right now... So, the setscheduler
function would look like
int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param)
{
sched::thread *t;
int res = -1;

if (!param) {
errno = EINVAL;
return -1;

}

if (pid == 0) {
t = sched::thread::current();

res = sched_setparam_aux(t, policy, param->sched_priority);
} else {
sched::with_thread_by_id(pid, [param, policy, &res] (sched::thread *t) {
if (t) res = sched_setparam_aux(t, policy, param->sched_priority);
});
if (res == -1) {
errno = ESRCH;
}
}

return res;
}

...Right? For the moment, this is only compile-tested; if you agree
that this is the right thing to do, I'll test it and I will update all
the parts of the patch where "getpid()" was erroneously called.
(note: I left the "pid == 0" check to have an efficient fastpath; If
this is a useless optimization, I'll remove the if() - the alternative
would be to always call something like "with_thread_by_id(pid ? pid:
gettid(), ...))

Luca

Waldek Kozaczuk

unread,

May 31, 2023, 12:23:06 PM5/31/23

to OSv Development

BTW I have just pushed Nadav's latest low-level real-time scheduler MR so you no longer have to send it along with others (please rebase accordingly if necessary).

I saw some of you mentioned OSv pid. It used to be 0 but now it is 2 (see constant in include/osv/pid.h). Changing from 0 to 2 helps us run some apps (like Dotnet) that do not like to run with pid 0.

Regards,

Waldek

Luca Abeni

unread,

Jun 5, 2023, 7:51:01 AM6/5/23

to Nadav Har'El, Waldek Kozaczuk, osv...@googlegroups.com, Claudio Fontana

Hi,

On Wed, 10 May 2023 at 12:03, Nadav Har'El <n...@scylladb.com> wrote:
[...]

So, I looked at this, and I think it was just some kind of
optimization: without the checks above, the sleep() function (now
sleep_until()) will be invoked, will create a timer, and will add it
to the cpu's timers list... Then, the clockevent device will try to
program the cpu timer, will see that the time is in the past, and will
consider the timer as already expired... So, when sleep() will invoke
"wait_until([&] { return t.expired(); })", this will not block the
thread.

I think this check is trying to optimize for the unlikely situation
(clock_nanosleep() asking to wait until a time in the past), so I
removed the checks... Everything seems to be working correctly
(cyclictest still works fine).

I'll send the updated patchset soon.

Luca

Reply all

Reply to author

Forward