Code Review - Add MSR device to devarch

Davide Libenzi

unread,

Nov 9, 2015, 10:01:56 AM11/9/15

to Akaros

This adds the "msr" device to devarch, allowing userspace software (ie, CPU counters programming machinery) to interact with MSRs.
I also cleaned up that file a bit, dropping a few things that were unused anywhere (there are some more in there).

https://github.com/brho/akaros/compare/master...dlibenzi:devarch_msr

The following changes since commit 1165c2bda44b7f1fb3b776c0dc5b0fb4dd499961:

  Add networking unit tests (2015-11-03 12:00:38 -0500)

are available in the git repository at:

  g...@github.com:dlibenzi/akaros devarch_msr

for you to fetch changes up to bd69792de3d53f08b7b6e15de684b22744c52f62:

  Added test for devarch MSR file (2015-11-09 06:49:04 -0800)

----------------------------------------------------------------
Davide Libenzi (6):
      Added completion data structure
      Added CPU set data structure
      Added APIs to read and write MSR values on multiple CPUs
      Cleaned up devarch.c code to remove unused code
      Plugged MSR read and write APIs into devarch MSR file
      Added test for devarch MSR file

 kern/arch/x86/devarch.c        | 430 +++++++++++++++++++----------------------
 kern/include/completion.h      |  17 ++
 kern/include/cpu_set.h         |  66 +++++++
 kern/src/Kbuild                |   1 +
 kern/src/completion.c          |  30 +++
 user/utest/devarch_file_test.c |  71 +++++++
 6 files changed, 379 insertions(+), 236 deletions(-)
 create mode 100644 kern/include/completion.h
 create mode 100644 kern/include/cpu_set.h
 create mode 100644 kern/src/completion.c
 create mode 100644 user/utest/devarch_file_test.c

Davide Libenzi

unread,

Nov 19, 2015, 8:44:24 PM11/19/15

to Akaros

If you prefer, you can hold on on this one, and do a bigger review later.

I have a devarch_msr_perf branch forked off this one, where I am adding the counter overflow based tracing.

Davide Libenzi

unread,

Nov 20, 2015, 9:51:27 AM11/20/15

to Akaros

I have rebased both branches from brho/master.

The rebase script worked fine ☺

Barret Rhoden

unread,

Nov 23, 2015, 2:38:18 PM11/23/15

to aka...@googlegroups.com

On 2015-11-19 at 17:44 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> If you prefer, you can hold on on this one, and do a bigger review
> later. I have a devarch_msr_perf branch forked off this one, where I
> am adding the counter overflow based tracing.

I already started on Friday (on the airplane). Will try to finish it
up today.

Barret

Barret Rhoden

unread,

Nov 23, 2015, 3:42:59 PM11/23/15

to aka...@googlegroups.com

Hi -

On 2015-11-09 at 07:01 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> This adds the "msr" device to devarch, allowing userspace software
> (ie, CPU counters programming machinery) to interact with MSRs.
> I also cleaned up that file a bit, dropping a few things that were
> unused anywhere (there are some more in there).
>
>
> https://github.com/brho/akaros/compare/master...dlibenzi:devarch_msr
>
>
> The following changes since commit
> 1165c2bda44b7f1fb3b776c0dc5b0fb4dd499961:
>
> Add networking unit tests (2015-11-03 12:00:38 -0500)
>
> are available in the git repository at:
>
> g...@github.com:dlibenzi/akaros devarch_msr
>
> for you to fetch changes up to
> bd69792de3d53f08b7b6e15de684b22744c52f62:
>
> Added test for devarch MSR file (2015-11-09 06:49:04 -0800)

Comments below:

> From b8425da6e7813b9f0c6cb2e1ebe0124236e9eeba Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 11:25:39 -0800
> Subject: Added completion data structure
>
> diff --git a/kern/src/completion.c b/kern/src/completion.c
> new file mode 100644
> index 000000000000..40e922ab8120
> --- /dev/null
> +++ b/kern/src/completion.c
> @@ -0,0 +1,30 @@
> +/* Copyright (c) 2015 Google Inc
> + * Davide Libenzi <dlib...@google.com>
> + * See LICENSE for details.
> + */
> +
> +#include <kthread.h>
> +#include <completion.h>
> +
> +void completion_init(struct completion *comp, int count)
> +{
> + cv_init(&comp->cv);

If you want to signal completion from IRQ context, then we'll need these CVs to
be irqsave. In that case, you'll need to use the irqsave variants for most of
the CV operations (e.g. cv_lock_irqsave). If you don't, you'll run the risk of
deadlocking, and CONFIG_SPINLOCK_DEBUG will complain.

> + comp->count = count;
> +}
> +
> +void completion_complete(struct completion *comp, int how_much)
> +{
> + cv_lock(&comp->cv);
> + comp->count -= how_much;
> + if (comp->count <= 0)
> + __cv_broadcast(&comp->cv);

Would it be a bug for comp->count to ever be less than 0? (Depends on how this
is used).

> + cv_unlock(&comp->cv);
> +}
> +
> +void completion_wait(struct completion *comp)
> +{
> + cv_lock(&comp->cv);
> + if (comp->count > 0)
> + cv_wait(&comp->cv);

This is fine as is.

As an FYI, if you want to be a little faster, you can use cv_wait_and_unlock,
then immediately return (careful of IRQs). And/or you could do a racy read of
comp->count to see if it's already 0 without grabbing the lock.

> + cv_unlock(&comp->cv);
> +}
> --
> 2.6.0.rc2.230.g3dd15c0
>
> From 80128f764bdc236540355df797bf3fc550b93fb7 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 11:27:01 -0800
> Subject: Added CPU set data structure

Do we want to call these cpu_sets or core_sets? In general, we've referred to
logical processors as 'cores' or 'pcores' throughout the codebase. Though we
also have been a little inconsistent with it.

> diff --git a/kern/include/cpu_set.h b/kern/include/cpu_set.h

> +#define CPUSET_INT_BITS (sizeof(cpu_set_int_t) * 8)
> +
> +typedef uint64_t cpu_set_int_t;
> +
> +struct cpu_set {
> + cpu_set_int_t cpus[(MAX_NUM_CORES + CPUSET_INT_BITS - 1) / CPUSET_INT_BITS];

We have a helper macro for the array size calculation:
cpus[DIV_ROUND_UP(MAX_NUM_CORES, CPUSET_INT_BITS)];

> +static inline void cpu_set_setcpu(struct cpu_set *cset, unsigned int cpuno)
> +{
> + cset->cpus[cpuno / CPUSET_INT_BITS] |= 1 << (cpuno % CPUSET_INT_BITS);
> +}

This stuff all looks correct. But we should use bitops for it instead of
rolling our own. We brought in many of Linux's bitops (k/i/bitops.h,
k/a/x/bitops.h). If there's a general-purpose bitop we're missing that you
need, then we should add it to bitops.h

> From 6781115d82ce87a64894ec773b0d69671ebb84ed Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 11:33:56 -0800
> Subject: Added APIs to read and write MSR values on multiple CPUs
>

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c

> +uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> + size_t *nvalues)
> +{
> + int cpu = core_id();
> + struct smp_read_values srv;
> +
> + srv.values = kzmalloc(num_cores * sizeof(*srv.values), KMALLOC_WAIT);
> + if (srv.values) {

With KMALLOC_WAIT, kmalloc always succeeds.

> + completion_init(&srv.comp, cpu_set_remote_count(cset));
> + for (int i = 0; i < num_cores; i++) {
> + if (cpu_set_getcpu(cset, i)) {
> + if (i == cpu)
> + srv.values[i] = read_msr(addr);
> + else
> + send_kernel_message(i, cpu_read_msr, (long) &srv, addr, 0,
> + KMSG_ROUTINE);
> + }
> + }
> + completion_wait(&srv.comp);

This whole block should be done with smp_do_in_cpus.

> + *nvalues = num_cores;
> + }
> +
> + return srv.values;
> +}
> +
> +static void cpu_write_msr(uint32_t srcid, long a0, long a1, long a2)
> +{
> + struct completion *comp = (struct completion *) a0;
> +
> + write_msr((uint32_t) a1, (uint64_t) a2);
> + completion_complete(comp, 1);
> +}
> +
> +void cpuset_write_msr(const struct cpu_set *cset, uint32_t addr, uint64_t value)
> +{
> + int cpu = core_id();
> + struct completion comp;
> +
> + completion_init(&comp, cpu_set_remote_count(cset));
> + for (int i = 0; i < num_cores; i++) {
> + if (cpu_set_getcpu(cset, i)) {
> + if (i == cpu)
> + write_msr(addr, value);
> + else
> + send_kernel_message(i, cpu_write_msr, (long) &comp, addr, value,
> + KMSG_ROUTINE);
> + }
> + }
> + completion_wait(&comp);

This whole block should be done with smp_do_in_cpus.

> From 008c4762727a39ad8e5c5dd2265317445ddb86bd Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 16:44:41 -0800
> Subject: Cleaned up devarch.c code to remove unused code

> -static char *devname(void)
> -{
> - return archdevtab.name;
> -}

Not a big deal, but the reason I had this devname() helper is that many devices
borrow code from each other. devarch only used it once, but others use devname
a bunch of times. When that code gets reused for a new device, we can leave
devname as is. e.g. copying error messages around like:
set_error(ENOSYS, "Can't tap #%s file type %d", devname(), type);

> From ca1462749dbdd1c4d8fb34973d42da5a3972aebe Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 17:49:16 -0800
> Subject: Plugged MSR read and write APIs into devarch MSR file

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c
> @@ -19,12 +19,15 @@
> #include <error.h>
> #include <cpio.h>
> #include <pmap.h>
> +#include <umem.h>
> #include <smp.h>
> #include <ip.h>
> #include <time.h>
> #include <cpu_set.h>
> #include <completion.h>
>
> +#define REAL_MEM_SIZE (1024 * 1024)
> +
> struct io_map {
> struct io_map *next;
> int reserved;
> @@ -48,7 +51,7 @@ enum {
> Qiow,
> Qiol,
> Qgdb,
> - Qmem,
> + Qrealmem,

So in the last cleanup patch you removed realmem and make it read up to
max_paddr bytes. In this patch, you undid that. Ideally, the parts of
this patch that made that change would be squashed into the previous
patch.

You can do this with git rebase -i. For this commit, choose "e" (edit),
then:
$ git reset HEAD^ # unstage this entire commit's changes
$ git add -p # selectively pick what should be squashed
$ git commit --amend # push those changes into the previous commit
or
git commit -m WIP-squash # make a separate commit, fix it up later
$ git add -p # add everything for *this* commit
$ git commit -c ca14627 # commit with the original message
$ git rebase --continue

> uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> size_t *nvalues)
> {
> - int cpu = core_id();
> struct smp_read_values srv;
>
> - srv.values = kzmalloc(num_cores * sizeof(*srv.values), KMALLOC_WAIT);
> + srv.addr = addr;
> + srv.values = kzmalloc(num_cores * sizeof(*srv.values), 0);
> if (srv.values) {
> - completion_init(&srv.comp, cpu_set_remote_count(cset));
> - for (int i = 0; i < num_cores; i++) {
> - if (cpu_set_getcpu(cset, i)) {
> - if (i == cpu)
> - srv.values[i] = read_msr(addr);
> - else
> - send_kernel_message(i, cpu_read_msr, (long) &srv, addr, 0,
> - KMSG_ROUTINE);
> - }
> - }
> - completion_wait(&srv.comp);
> + smp_do_in_cpus(cset, cpu_read_msr, &srv);

This is exactly what I said to do for the earlier patch. I'm glad you
did it, but I would be even happier if you fixed up the git history so
that it was like that originally.

It may seem trivial, but not only does it keep the git history clean (so
that we can easily see how the code evolved), but it saves me time. I
was making comments on code that had already been fixed.

So now it looks like there's two parts of this commit that should be in
other commits. That's something that can be fixed with multiple runs of
git rebase -i, as described above.

> @@ -397,10 +386,27 @@ static long archread(struct chan *c, void *a, long n, int64_t offset)
> return n;
> case Qioalloc:
> break;
> - case Qmem:
> - printk("readmem %p %p %p %p %p\n", offset, a, n, KADDR(0),
> - max_paddr);
> - return readmem(offset, a, n, KADDR(0), max_paddr);
> + case Qrealmem:
> + return readmem(offset, a, n, KADDR(0), REAL_MEM_SIZE);

This change (realmen) is right next to Qmsr. When you try to fix this
with a rebase and do a git add -p, git won't be able to figure out that
these are separate. You'll have to specify 'e' and manually edit the
hunk when you git add. It should be pretty straightforward.

> + case Qmsr:
> + cpu_set_init(&cset);
> + cpu_set_fill_available(&cset);
> + values = cpuset_read_msr(&cset, (uint32_t) offset, &nvalues);
> + if (values) {
> + if (n >= nvalues * sizeof(uint64_t)) {
> + if (memcpy_to_user_errno(current, a, values,
> + nvalues * sizeof(uint64_t)))
> + n = -1;
> + else
> + n = nvalues * sizeof(uint64_t);
> + } else {
> + n = 0;

So if the user doesn't ask for a large enough n, we tell them EOF? Is
this a bug/error on the user's part?

> From 0407977d44cc3cbcc8c24de1dca67676b6b6189f Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 20:11:28 -0800
> Subject: Added test for devarch MSR file

> --- /dev/null
> +++ b/user/utest/devarch_file_test.c
> @@ -0,0 +1,69 @@
> +
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <fcntl.h>
> +#include <utest/utest.h>
> +
> +TEST_SUITE("DEVARCH");
> +
> +/* <--- Begin definition of test cases ---> */
> +
> +static bool test_msr(void)
> +{
> + static const int max_cpus = 256;

256 should be MAX_NUM_CORES, or MAX_NUM_CORES + 1 if you want a buffer.
To get the symbol, #include <ros/arch/arch.h>.

> + int fd = open("#arch/msr", O_RDWR);
> + uint64_t *values = malloc(max_cpus * sizeof(uint64_t));
> + uint64_t tmp = 0;
> + ssize_t n;
> +
> + UT_ASSERT_M("Failed to open MSR device file (#arch/msr)", fd >= 0);
> + UT_ASSERT_M("Failed to allocated MSR values memory", values);
> +
> + n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x10);
> + UT_ASSERT_M("Failed to read MSR values from 0x10", n > 0);
> +
> + n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x309);
> + UT_ASSERT_M("Failed to read MSR values from 0x309", n > 0);
> +
> + n = pwrite(fd, &tmp, sizeof(uint64_t), 0x309);
> + UT_ASSERT_M("Failed to write MSR values to 0x309",
> + n == sizeof(uint64_t));

Will these tests always be passing? What are those MSRs, and why do we allow
the user to read / write them?

> From e8cb42302b1ab90d6457356e3cc1063a499d3665 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 9 Nov 2015 16:33:48 -0800
> Subject: Added whitelisting to MSR read/write code

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c
> @@ -82,7 +80,19 @@ static struct dirtab archdir[Qmax] = {
> {"realmem", {Qrealmem, 0}, 0, 0444},
> {"msr", {Qmsr, 0}, 0, 0666},
> };
> -
> +/* White list entries needs to be ordered by start address, and never overlap.
> + */
> +static const struct address_range msr_rd_wlist[] = {
> + ADDRESS_RANGE(0x00000000, 0xffffffff),
> +};

I like the use of a whitelist, but we need to actually pick what we want
userspace to be able to read. They could give us a garbage MSR and probably
trigger a GPF or something nasty.

> +static const struct address_range msr_wr_wlist[] = {
> + ADDRESS_RANGE(0x000000c1, 0x000000c8),
> + ADDRESS_RANGE(0x00000186, 0x00000189),
> + ADDRESS_RANGE(0x00000199, 0x00000199),
> + ADDRESS_RANGE(0x00000309, 0x0000030b),
> + ADDRESS_RANGE(0x0000038d, 0x0000038d),
> + ADDRESS_RANGE(0x0000038f, 0x00000391),
> +};

What MSRs are these? Given the sensitivity of this stuff, it'd be nice to use
some sort of #define / name for MSRs instead of magic numbers. I like the
address range, but we lose out on explicitly marking what MSRs we are allowing.
I think using real names (e.g. MSR_FOOBAR) and also using the range should be
okay, so long as we only make ranges of similar, contiguous MSRs.

ron minnich

unread,

Nov 23, 2015, 6:28:04 PM11/23/15

to aka...@googlegroups.com

Looking at how linux does it, it's

/dev/cpu/#/msr, and you do reads/writes at offset for the msr. Further, the offset is interpreted: a true byte offset into the "MSR address space" would shift the msr # left 3 bits, since each MSR is 8 bytes, but they just do the same thing we all do, and just use the offset as an offset into "msr space", e.g. reading msr 58 it does a pread at offset 58, and you get eight bytes.

So to read all the msrs you have a pread per core :-(. But if we had an iov call that let us specify offset per iov then we could have a gather of all the msrs in the kernel that we want for all cores in one vector read, and we could even do crazy stuff like read multiple msrs per core for some set of cores into an array ... I have a strange attraction for scatter/gather IO. I.e. you could do this:

d0 = open("/dev/cpu/0/msr" ...)

d1 = open("/dev/cpu/1/msr" ...)

vread({d0, &buf[0], 16, 3a}, {d1, &buf[16], 16, 3a}) to read 3a and 3b in cores 0 and 1

ron

--
You received this message because you are subscribed to the Google Groups "Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
To post to this group, send email to aka...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Davide Libenzi

unread,

Nov 23, 2015, 11:17:42 PM11/23/15

to Akaros

Sure, let's turn them irqsave.

> + comp->count = count;
> +}
> +
> +void completion_complete(struct completion *comp, int how_much)
> +{
> + cv_lock(&comp->cv);
> + comp->count -= how_much;
> + if (comp->count <= 0)
> + __cv_broadcast(&comp->cv);

Would it be a bug for comp->count to ever be less than 0? (Depends on how this
is used).

For this use, it would be a bug. it might be better to keep it such that the count is a threshold, more than a limit.

> + cv_unlock(&comp->cv);
> +}
> +
> +void completion_wait(struct completion *comp)
> +{
> + cv_lock(&comp->cv);
> + if (comp->count > 0)
> + cv_wait(&comp->cv);

This is fine as is.

This seems racy, in general. In this case, we never issue to the local CPU, otherwise you could do cv_lock(), get an IRQ which does cv_lock_irqsave(), and get a deadlock.

IMO that cv_lock() should be irqsave. Like this?

void completion_wait(struct completion *comp)

{

int8_t state;

cv_lock_irqsave(&comp->cv, &state);

if (comp->count > 0)

cv_wait_and_unlock(&comp->cv);

enable_irqsave(&state);

}

As an FYI, if you want to be a little faster, you can use cv_wait_and_unlock,
then immediately return (careful of IRQs). And/or you could do a racy read of
comp->count to see if it's already 0 without grabbing the lock.

Yes, thanks.

> Subject: Added CPU set data structure

Do we want to call these cpu_sets or core_sets? In general, we've referred to
logical processors as 'cores' or 'pcores' throughout the codebase. Though we
also have been a little inconsistent with it.

OK, renaming to core_set

> diff --git a/kern/include/cpu_set.h b/kern/include/cpu_set.h

> +#define CPUSET_INT_BITS (sizeof(cpu_set_int_t) * 8)
> +
> +typedef uint64_t cpu_set_int_t;
> +
> +struct cpu_set {
> + cpu_set_int_t cpus[(MAX_NUM_CORES + CPUSET_INT_BITS - 1) / CPUSET_INT_BITS];

We have a helper macro for the array size calculation:
cpus[DIV_ROUND_UP(MAX_NUM_CORES, CPUSET_INT_BITS)];

> +static inline void cpu_set_setcpu(struct cpu_set *cset, unsigned int cpuno)
> +{
> + cset->cpus[cpuno / CPUSET_INT_BITS] |= 1 << (cpuno % CPUSET_INT_BITS);
> +}

This stuff all looks correct. But we should use bitops for it instead of
rolling our own. We brought in many of Linux's bitops (k/i/bitops.h,
k/a/x/bitops.h). If there's a general-purpose bitop we're missing that you
need, then we should add it to bitops.h

It is missing everything. get/set/clear

The operations in bitops seem to be more special operations, and targeting a single unsigned long.

I added mem_get_bit, mem_set_bit, mem_clear_bit to bitops.h.

> From 6781115d82ce87a64894ec773b0d69671ebb84ed Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 11:33:56 -0800
> Subject: Added APIs to read and write MSR values on multiple CPUs
>

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c

> +uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> + size_t *nvalues)
> +{
> + int cpu = core_id();
> + struct smp_read_values srv;
> +
> + srv.values = kzmalloc(num_cores * sizeof(*srv.values), KMALLOC_WAIT);
> + if (srv.values) {

With KMALLOC_WAIT, kmalloc always succeeds.

This has already been changed in continuation branch. Will leave it as is here.

> + completion_init(&srv.comp, cpu_set_remote_count(cset));
> + for (int i = 0; i < num_cores; i++) {
> + if (cpu_set_getcpu(cset, i)) {
> + if (i == cpu)
> + srv.values[i] = read_msr(addr);
> + else
> + send_kernel_message(i, cpu_read_msr, (long) &srv, addr, 0,
> + KMSG_ROUTINE);
> + }
> + }
> + completion_wait(&srv.comp);

This whole block should be done with smp_do_in_cpus.

Ditto.

> + completion_init(&comp, cpu_set_remote_count(cset));
> + for (int i = 0; i < num_cores; i++) {
> + if (cpu_set_getcpu(cset, i)) {
> + if (i == cpu)
> + write_msr(addr, value);
> + else
> + send_kernel_message(i, cpu_write_msr, (long) &comp, addr, value,
> + KMSG_ROUTINE);
> + }
> + }
> + completion_wait(&comp);

This whole block should be done with smp_do_in_cpus.

Ditto.

> From 008c4762727a39ad8e5c5dd2265317445ddb86bd Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 8 Nov 2015 16:44:41 -0800
> Subject: Cleaned up devarch.c code to remove unused code

> -static char *devname(void)
> -{
> - return archdevtab.name;
> -}

Not a big deal, but the reason I had this devname() helper is that many devices
borrow code from each other. devarch only used it once, but others use devname
a bunch of times. When that code gets reused for a new device, we can leave
devname as is. e.g. copying error messages around like:
set_error(ENOSYS, "Can't tap #%s file type %d", devname(), type);

There wasn't a single user of that, this is why I dropped it.

Will try. /shivering ...

> @@ -397,10 +386,27 @@ static long archread(struct chan *c, void *a, long n, int64_t offset)
> return n;
> case Qioalloc:
> break;
> - case Qmem:
> - printk("readmem %p %p %p %p %p\n", offset, a, n, KADDR(0),
> - max_paddr);
> - return readmem(offset, a, n, KADDR(0), max_paddr);
> + case Qrealmem:
> + return readmem(offset, a, n, KADDR(0), REAL_MEM_SIZE);

This change (realmen) is right next to Qmsr. When you try to fix this
with a rebase and do a git add -p, git won't be able to figure out that
these are separate. You'll have to specify 'e' and manually edit the
hunk when you git add. It should be pretty straightforward.

I will try. I might fsck it up 😀

> + case Qmsr:
> + cpu_set_init(&cset);
> + cpu_set_fill_available(&cset);
> + values = cpuset_read_msr(&cset, (uint32_t) offset, &nvalues);
> + if (values) {
> + if (n >= nvalues * sizeof(uint64_t)) {
> + if (memcpy_to_user_errno(current, a, values,
> + nvalues * sizeof(uint64_t)))
> + n = -1;
> + else
> + n = nvalues * sizeof(uint64_t);
> + } else {
> + n = 0;

So if the user doesn't ask for a large enough n, we tell them EOF? Is
this a bug/error on the user's part?

I am not sure this is a bug on the user part.

I changed to return ERANGE now.

OK.

> + int fd = open("#arch/msr", O_RDWR);
> + uint64_t *values = malloc(max_cpus * sizeof(uint64_t));
> + uint64_t tmp = 0;
> + ssize_t n;
> +
> + UT_ASSERT_M("Failed to open MSR device file (#arch/msr)", fd >= 0);
> + UT_ASSERT_M("Failed to allocated MSR values memory", values);
> +
> + n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x10);
> + UT_ASSERT_M("Failed to read MSR values from 0x10", n > 0);
> +
> + n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x309);
> + UT_ASSERT_M("Failed to read MSR values from 0x309", n > 0);
> +
> + n = pwrite(fd, &tmp, sizeof(uint64_t), 0x309);
> + UT_ASSERT_M("Failed to write MSR values to 0x309",
> + n == sizeof(uint64_t));

Will these tests always be passing? What are those MSRs, and why do we allow
the user to read / write them?

The 0x309 is a counter value register. Well, if we need to test write(), we need to write somewhere.

Here we write back the same value, but yes, it could be racy if other code does that.

> +/* White list entries needs to be ordered by start address, and never overlap.
> + */
> +static const struct address_range msr_rd_wlist[] = {
> + ADDRESS_RANGE(0x00000000, 0xffffffff),
> +};

I like the use of a whitelist, but we need to actually pick what we want
userspace to be able to read. They could give us a garbage MSR and probably
trigger a GPF or something nasty.

Basically, MSR varies from CPU family to CPU family.

In Linux, if you have permissions, you can read/write whatever you want.

Linux read msr code uses exception tables to catch errors.

Not sure what to do here.

> +static const struct address_range msr_wr_wlist[] = {
> + ADDRESS_RANGE(0x000000c1, 0x000000c8),
> + ADDRESS_RANGE(0x00000186, 0x00000189),
> + ADDRESS_RANGE(0x00000199, 0x00000199),
> + ADDRESS_RANGE(0x00000309, 0x0000030b),
> + ADDRESS_RANGE(0x0000038d, 0x0000038d),
> + ADDRESS_RANGE(0x0000038f, 0x00000391),
> +};

What MSRs are these? Given the sensitivity of this stuff, it'd be nice to use
some sort of #define / name for MSRs instead of magic numbers. I like the
address range, but we lose out on explicitly marking what MSRs we are allowing.
I think using real names (e.g. MSR_FOOBAR) and also using the range should be
okay, so long as we only make ranges of similar, contiguous MSRs.

Ditto.

The ranges also, depends on the number of cores.

We could have the ranges empty, and populate them with boot cmdline options.

Keep in mind that I am planning not to use the MSR device for userspace perf anymore.

Needs discussion on the separate email I sent.

Rebased branch according to comments so far. Missing some pending ...

Davide Libenzi

unread,

Nov 23, 2015, 11:23:19 PM11/23/15

to Akaros

On Mon, Nov 23, 2015 at 3:27 PM, ron minnich <rmin...@gmail.com> wrote:

Looking at how linux does it, it's
/dev/cpu/#/msr, and you do reads/writes at offset for the msr. Further, the offset is interpreted: a true byte offset into the "MSR address space" would shift the msr # left 3 bits, since each MSR is 8 bytes, but they just do the same thing we all do, and just use the offset as an offset into "msr space", e.g. reading msr 58 it does a pread at offset 58, and you get eight bytes.

So to read all the msrs you have a pread per core :-(. But if we had an iov call that let us specify offset per iov then we could have a gather of all the msrs in the kernel that we want for all cores in one vector read, and we could even do crazy stuff like read multiple msrs per core for some set of cores into an array ... I have a strange attraction for scatter/gather IO. I.e. you could do this:
d0 = open("/dev/cpu/0/msr" ...)
d1 = open("/dev/cpu/1/msr" ...)
vread({d0, &buf[0], 16, 3a}, {d1, &buf[16], 16, 3a}) to read 3a and 3b in cores 0 and 1

I am not in love in populating N cpus namespace for MSR.

We could use:

PREAD_PWRITE_OFFSET = MSR_ADDRESS | (CORE << 32)

As pread/write offset.

But keep in mind that the MSR device will not be used by userspace perf, for the reasons I outlined (and waiting for comments) in a separate email.

Davide Libenzi

unread,

Nov 24, 2015, 9:53:33 AM11/24/15

to Akaros

I added safe (exception table guarded) MSR read/write APIs.

Barret Rhoden

unread,

Nov 24, 2015, 10:19:37 AM11/24/15

to aka...@googlegroups.com

On 2015-11-23 at 20:17 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> > > +void completion_wait(struct completion *comp)
> > > +{
> > > + cv_lock(&comp->cv);
> > > + if (comp->count > 0)
> > > + cv_wait(&comp->cv);
> >
> > This is fine as is.
> >
>
> This seems racy, in general. In this case, we never issue to the
> local CPU, otherwise you could do cv_lock(), get an IRQ which does
> cv_lock_irqsave(), and get a deadlock.
> IMO that cv_lock() should be irqsave.

Yeah, cv_lock would need to be irqsave, since it could be used from IRQ
context (as mentioned above).

> Like this?
>
> void completion_wait(struct completion *comp)
> {
> int8_t state;
>
> cv_lock_irqsave(&comp->cv, &state);
> if (comp->count > 0)
> cv_wait_and_unlock(&comp->cv);
> enable_irqsave(&state);
> }

Roughly. That also needs to unlock if we don't take the branch. I
think this would work:

void completion_wait(struct completion *comp)
{
int8_t state;

cv_lock_irqsave(&comp->cv, &state);
if (comp->count > 0) {
cv_wait_and_unlock(&comp->cv);
enable_irqsave(&state);

} else {
cv_unlock_irqsave(&comp->cv, &state);

}
}

> > > +static inline void cpu_set_setcpu(struct cpu_set *cset, unsigned
> > > int
> > cpuno)
> > > +{
> > > + cset->cpus[cpuno / CPUSET_INT_BITS] |= 1 << (cpuno %
> > CPUSET_INT_BITS);
> > > +}
> >
> > This stuff all looks correct. But we should use bitops for it
> > instead of rolling our own. We brought in many of Linux's bitops
> > (k/i/bitops.h, k/a/x/bitops.h). If there's a general-purpose bitop
> > we're missing that you need, then we should add it to bitops.h
> >
>
> It is missing everything. get/set/clear

I think we were missing get_bit, but we have __set_bit and __clear_bit
(the __ are the non-atomic versions).

> The operations in bitops seem to be more special operations, and
> targeting a single unsigned long.

I think they treat the pointer argument as a base address, and then
they offset to any address. For instance:

static inline void __set_bit(int nr, volatile unsigned long *addr)
{
asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
}

where bts:
If the bit base operand specifies a memory location, the
operand represents the address of the byte in memory that
contains the bit base (bit 0 of the specified byte) of the bit
string. The range of the bit position that can be referenced by
the offset operand depends on the operand size.

and for 64 bit operands, it can go from -2^63 to 2^63 - 1 (from the SDM
V2a 3-10).

> I added mem_get_bit, mem_set_bit, mem_clear_bit to bitops.h.

So I think we just need get_bit (or test_bit), which can go in the
arch/bitops.h. x86's can use the 'bt' instruction.

> > So now it looks like there's two parts of this commit that should
> > be in other commits. That's something that can be fixed with
> > multiple runs of git rebase -i, as described above.
> >
>
> Will try. /shivering ...

Thanks. Once you get the hang of it, it's pretty powerful. I think of
it as assigning hunks (that make up a diff) to various commits.

> > > + n = pwrite(fd, &tmp, sizeof(uint64_t), 0x309);
> > > + UT_ASSERT_M("Failed to write MSR values to 0x309",
> > > + n == sizeof(uint64_t));
> >
> > Will these tests always be passing? What are those MSRs, and why
> > do we allow
> > the user to read / write them?
> >
>
> The 0x309 is a counter value register. Well, if we need to test
> write(), we need to write somewhere.

Sorry if I was unclear, I was mostly looking for a #define naming the
register.

> > +/* White list entries needs to be ordered by start address, and
> > never overlap.
> > > + */
> > > +static const struct address_range msr_rd_wlist[] = {
> > > + ADDRESS_RANGE(0x00000000, 0xffffffff),
> > > +};
> >
> > I like the use of a whitelist, but we need to actually pick what we
> > want userspace to be able to read. They could give us a garbage
> > MSR and probably
> > trigger a GPF or something nasty.
> >
>
> Basically, MSR varies from CPU family to CPU family.

Do the values vary for a particular MSR, or just that the availability
of MSRs vary based on the family? I was under the impression that it's
the latter (availability varies, not the value). In that case, we can
be explicit about what MSRs we're giving access to.

> In Linux, if you have permissions, you can read/write whatever you
> want.

I guess this means if you're root? I'd like for us to limit the
whitelist to things we need, if we're going to open this up for any
use. (But it sounds like you don't need it for perf, so maybe we can
put this in something for "admin only" in the future).

> Linux read msr code uses exception tables to catch errors.
> Not sure what to do here.

Ah nice. I guess we'd need something similar. Even if we whitelist
MSR_FOOBAR in the code, knowing that it is okay for the user to
read/write it, but then it turns out we run on a machine that doesn't
have that MSR, we'd still GPF. So we'd want something like the
exception fixups.

> > > +static const struct address_range msr_wr_wlist[] = {
> > > + ADDRESS_RANGE(0x000000c1, 0x000000c8),
> > > + ADDRESS_RANGE(0x00000186, 0x00000189),
> > > + ADDRESS_RANGE(0x00000199, 0x00000199),
> > > + ADDRESS_RANGE(0x00000309, 0x0000030b),
> > > + ADDRESS_RANGE(0x0000038d, 0x0000038d),
> > > + ADDRESS_RANGE(0x0000038f, 0x00000391),

> The ranges also, depends on the number of cores.

Why is that?

> We could have the ranges empty, and populate them with boot cmdline
> options. Keep in mind that I am planning not to use the MSR device
> for userspace perf anymore.

Ah, okay.

> Rebased branch according to comments so far. Missing some pending ...

Thanks!

Barret

Barret Rhoden

unread,

Nov 24, 2015, 10:20:03 AM11/24/15

to aka...@googlegroups.com

On 2015-11-24 at 06:53 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> I added safe (exception table guarded) MSR read/write APIs.

awesome!

Davide Libenzi

unread,

Nov 24, 2015, 11:21:31 AM11/24/15

to Akaros

On Tue, Nov 24, 2015 at 7:19 AM, Barret Rhoden <br...@cs.berkeley.edu> wrote:

Roughly. That also needs to unlock if we don't take the branch. I
think this would work:

void completion_wait(struct completion *comp)
{
int8_t state;

cv_lock_irqsave(&comp->cv, &state);
if (comp->count > 0) {
cv_wait_and_unlock(&comp->cv);
enable_irqsave(&state);
} else {
cv_unlock_irqsave(&comp->cv, &state);
}
}

Right, done.

> > > +static inline void cpu_set_setcpu(struct cpu_set *cset, unsigned
> > > int
> > cpuno)
> > > +{
> > > + cset->cpus[cpuno / CPUSET_INT_BITS] |= 1 << (cpuno %
> > CPUSET_INT_BITS);
> > > +}
> >
> > This stuff all looks correct. But we should use bitops for it
> > instead of rolling our own. We brought in many of Linux's bitops
> > (k/i/bitops.h, k/a/x/bitops.h). If there's a general-purpose bitop
> > we're missing that you need, then we should add it to bitops.h
> >
>
> It is missing everything. get/set/clear

I think we were missing get_bit, but we have __set_bit and __clear_bit
(the __ are the non-atomic versions).

Oh, I missed the *arch* bitops.h file 😀

Undoing the mem_ ... adding get_bit()

> > > + n = pwrite(fd, &tmp, sizeof(uint64_t), 0x309);
> > > + UT_ASSERT_M("Failed to write MSR values to 0x309",
> > > + n == sizeof(uint64_t));
> >
> > Will these tests always be passing? What are those MSRs, and why
> > do we allow
> > the user to read / write them?
> >
>
> The 0x309 is a counter value register. Well, if we need to test
> write(), we need to write somewhere.

Sorry if I was unclear, I was mostly looking for a #define naming the
register.

Oh, OK.

> > +/* White list entries needs to be ordered by start address, and
> > never overlap.
> > > + */
> > > +static const struct address_range msr_rd_wlist[] = {
> > > + ADDRESS_RANGE(0x00000000, 0xffffffff),
> > > +};
> >
> > I like the use of a whitelist, but we need to actually pick what we
> > want userspace to be able to read. They could give us a garbage
> > MSR and probably
> > trigger a GPF or something nasty.
> >
>
> Basically, MSR varies from CPU family to CPU family.

Do the values vary for a particular MSR, or just that the availability
of MSRs vary based on the family? I was under the impression that it's
the latter (availability varies, not the value). In that case, we can
be explicit about what MSRs we're giving access to.

The number of MSRs depends on the number of counters.

I will try to add #define based ranges there, and we can defer the MSR device access policy to when we have a policy.

> In Linux, if you have permissions, you can read/write whatever you
> want.

I guess this means if you're root? I'd like for us to limit the
whitelist to things we need, if we're going to open this up for any
use. (But it sounds like you don't need it for perf, so maybe we can
put this in something for "admin only" in the future).

OK.

> Linux read msr code uses exception tables to catch errors.
> Not sure what to do here.

Ah nice. I guess we'd need something similar. Even if we whitelist
MSR_FOOBAR in the code, knowing that it is okay for the user to
read/write it, but then it turns out we run on a machine that doesn't
have that MSR, we'd still GPF. So we'd want something like the
exception fixups.

Added.

> > > +static const struct address_range msr_wr_wlist[] = {
> > > + ADDRESS_RANGE(0x000000c1, 0x000000c8),
> > > + ADDRESS_RANGE(0x00000186, 0x00000189),
> > > + ADDRESS_RANGE(0x00000199, 0x00000199),
> > > + ADDRESS_RANGE(0x00000309, 0x0000030b),
> > > + ADDRESS_RANGE(0x0000038d, 0x0000038d),
> > > + ADDRESS_RANGE(0x0000038f, 0x00000391),
> The ranges also, depends on the number of cores.

Why is that?

Sorry, number of *counters*, which is ARCH specific.

Branch updated.

Barret Rhoden

unread,

Nov 24, 2015, 2:15:50 PM11/24/15

to aka...@googlegroups.com

On 2015-11-24 at 08:21 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> > void completion_wait(struct completion *comp)
> > {
> > int8_t state;
> >
> > cv_lock_irqsave(&comp->cv, &state);
> > if (comp->count > 0) {
> > cv_wait_and_unlock(&comp->cv);
> > enable_irqsave(&state);
> > } else {
> > cv_unlock_irqsave(&comp->cv, &state);
> > }
> > }
> >

My fault, state needs to be initialized:
int8_t state = 0;

I can touch that up if you'd like.

> > I think we were missing get_bit, but we have __set_bit and
> > __clear_bit (the __ are the non-atomic versions).
> >
>
> Oh, I missed the *arch* bitops.h file 😀
> Undoing the mem_ ... adding get_bit()

Again, my fault, I missed test_bit() in arch bitops.h. It's a macro,
which might be why I missed it. Not sure. =P

I can touch that up too (replace get_bit with test_bit).

One other minor thing is that 3f21a3870 "Added CPU set data structure"
adds both core_set.h and cpu_set.h. I can also touch that up if you'd
like.

Let me know if you want to update the branch or if you want me to do
the touchups.

Thanks,

Barret

Davide Libenzi

unread,

Nov 24, 2015, 2:38:27 PM11/24/15

to Akaros

I'll take care of that.

Barret

Barret Rhoden

unread,

Nov 24, 2015, 2:42:53 PM11/24/15

to aka...@googlegroups.com

On 2015-11-24 at 11:38 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> I'll take care of that.

Thanks.

Also, we'll need to hook the exception fixups into the GPF handling
code:

/ $ cat \#arch/msr
HW TRAP frame at 0xffff80013ec7fad0 on core 0
rax 0x0000000000000000
rbx 0x0000000000000000
rcx 0x0000000000000040
rdx 0xffff80013fff9d10
rbp 0xffff80013ec7fb98
rsi 0x0000000000000000
rdi 0xffff80013ec7fca0
r8 0xffff80013fff9d10
r9 0xffff80013ec7fc20
r10 0x00000000000000b0
r11 0xffff80013ec7fba8
r12 0xffff80013ec7fca0
r13 0xffffffffc20fc320
r14 0xffff80013ec7fcf8
r15 0x0000000000000001
trap 0x0000000d General Protection
gsbs 0xffffffffc6abb100
fsbs 0x00000000deadbeef
err 0x--------00000000
rip 0xffffffffc20fc33a
cs 0x------------0008
flag 0x0000000000000246
rsp 0xffff80013ec7fb98
ss 0x------------0010
kernel panic at kern/arch/x86/trap.c:385, from core 0: Damn Damn! Unhandled trap in the kernel!
Entering Nanwan's Dungeon on Core 0 (Ints off):
Type 'help' for a list of commands.

Davide Libenzi

unread,

Nov 24, 2015, 3:24:55 PM11/24/15

to Akaros

Should be all set now.

Barret Rhoden

unread,

Nov 24, 2015, 3:43:35 PM11/24/15

to aka...@googlegroups.com

On 2015-11-24 at 12:24 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> Should be all set now.

Awesome, thanks!

Thanks, merged to master at aecd50eb71c2..abe793d7c750 (from, to]

You can see the entire diff with 'git diff' or at
https://github.com/brho/akaros/compare/aecd50eb71c2...abe793d7c750

Reply all

Reply to author

Forward