Merging relayfs?

Tom Zanussi

unread,

Jul 11, 2005, 9:19:01 PM7/11/05

to ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Hi Andrew, can you please merge relayfs? It provides a low-overhead
logging and buffering capability, which does not currently exist in
the kernel.

relayfs key features:

- Extremely efficient high-speed logging/buffering
- Simple mechanism for user-space data retrieval
- Very short write path
- Can be used in any context, including interrupt context
- No runtime resource allocation
- Doesn't do a kmalloc for each "packet"
- No need for end-recipient
- Data may remain buffered whether it is consumed or not
- Data committed to disk in bulk, not per "packet"
- Can be used in circular-buffer mode for flight-recording

The relayfs code has been in -mm for more than three months following
the extensive review that took place on LKML at the beginning of the
year, at which time we addressed all of the issues people had. Since
then only a few minor patches to the original codebase have been
needed, most of which were sent to us by users; we'd like to thank
those who took the time to send patches or point out problems.

The code in the -mm tree has also been pounded on very heavily through
normal use and testing, and we haven't seen any problems with it - it
appears to be very stable.

We've also tried to make it as easy as possible for people to create
'quick and dirty' (or more substantial) kernel logging applications.
Included is a link to an example that demonstrates how useful this can
be. In a nutshell, it uses relayfs logging functions to track
kmalloc/kfree and detect memory leaks. The only thing it does in the
kernel is to log a small binary record for each kmalloc and kfree.
The data is then post-processed in user space with a simple Perl
script. You can see an example of the output and the example itself
here:

http://relayfs.sourceforge.net/examples.html#kleak

Last but not least, it's still small (40k worth of source),
self-contained and unobtrusive to the rest of the kernel.

In summary, relayfs is very stable, is useful to current users and
with inclusion, would be useful to many others. If you can think of
anything we've overlooked or should work on to get relayfs to the
point of inclusion, please let us know.

Thanks,

Tom Zanussi
Karim Yaghmour

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

unread,

Jul 11, 2005, 9:51:04 PM7/11/05

to Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Tom Zanussi <zan...@us.ibm.com> wrote:
>
> Hi Andrew, can you please merge relayfs?

I guess so. Would you have time to prepare a list of existing and planned
applications?

Dave Airlie

unread,

Jul 11, 2005, 10:18:53 PM7/11/05

to Andrew Morton, Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

> >
> > Hi Andrew, can you please merge relayfs?
>
> I guess so. Would you have time to prepare a list of existing and planned
> applications?

I have a plan to use it for something that no-one knows about yet..

I was going to use it for doing a DRM packet debug logger... to try
and trace hangs in the system, using printk doesn't really help as
guess what it slows the machine down so much that your races don't
happen... I wrote some basic code for this already.. and I'm hoping to
use some work time to get it finished at some stage...

Dave.

Tom Zanussi

unread,

Jul 11, 2005, 10:23:41 PM7/11/05

to Andrew Morton, Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, ros...@goodmis.org, bar...@ev-en.org, pras...@us.ibm.com

Andrew Morton writes:
> Tom Zanussi <zan...@us.ibm.com> wrote:
> >
> > Hi Andrew, can you please merge relayfs?
>
> I guess so. Would you have time to prepare a list of existing and planned
> applications?

Sure. I know that systemtap (http://sourceware.org/systemtap/) is
using relayfs and that LTT (http://www.opersys.com/ltt/index.html) is
also currently being reworked to use it.

I've also added a couple of people to the cc: list that I've consulted
with in getting their applications to use relayfs, one of which is the
logdev debugging device recently posted to LKML.

I also know that there are still users of the old relayfs around; I
don't however know what their plans are regarding moving to the new
relayfs.

My own personal interest is to start playing around with creating some
visualization tools using data gathered from relayfs. Hopefully, I'll
have more time to do that if relayfs gets merged. ;-)

Hope that helps,

Tom

Christoph Hellwig

unread,

Jul 11, 2005, 10:26:51 PM7/11/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, Jul 11, 2005 at 08:10:42PM -0500, Tom Zanussi wrote:
>
> Hi Andrew, can you please merge relayfs? It provides a low-overhead
> logging and buffering capability, which does not currently exist in
> the kernel.

While the code is pretty nicely in shape it seems rather pointless to
merge until an actual user goes with it.

Andrew Morton

unread,

Jul 11, 2005, 10:36:58 PM7/11/05

to Christoph Hellwig, zan...@us.ibm.com, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Christoph Hellwig <h...@infradead.org> wrote:
>
> On Mon, Jul 11, 2005 at 08:10:42PM -0500, Tom Zanussi wrote:
> >
> > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > logging and buffering capability, which does not currently exist in
> > the kernel.
>
> While the code is pretty nicely in shape it seems rather pointless to
> merge until an actual user goes with it.

Ordinarily I'd agree. But this is a bit like kprobes - it's a funny thing
which other kernel features rely upon, but those features are often ad-hoc
and aren't intended for merging.

relayfs is more for in-kernel "applications" than for userspace ones, if
you like.

Still, first let us get a handle on who wants relayfs now and in the future
and for what. Then we can better decide.

Greg KH

unread,

Jul 11, 2005, 11:07:17 PM7/11/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, Jul 11, 2005 at 08:10:42PM -0500, Tom Zanussi wrote:
>

> Hi Andrew, can you please merge relayfs? It provides a low-overhead
> logging and buffering capability, which does not currently exist in
> the kernel.
>
> relayfs key features:
>
> - Extremely efficient high-speed logging/buffering
> - Simple mechanism for user-space data retrieval
> - Very short write path
> - Can be used in any context, including interrupt context
> - No runtime resource allocation
> - Doesn't do a kmalloc for each "packet"
> - No need for end-recipient
> - Data may remain buffered whether it is consumed or not
> - Data committed to disk in bulk, not per "packet"
> - Can be used in circular-buffer mode for flight-recording

What ever happened to exporting the relayfs file ops, and just using
debugfs as your controlling fs instead? As all of the possible users
fall under the "debug" type of kernel feature, it makes more sense to
confine users to that fs, right?

thanks,

greg k-h

Karim Yaghmour

unread,

Jul 11, 2005, 11:09:57 PM7/11/05

to Andrew Morton, Christoph Hellwig, zan...@us.ibm.com, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

Andrew Morton wrote:
> Still, first let us get a handle on who wants relayfs now and in the future
> and for what. Then we can better decide.

We used relayfs for our series of tests on PREEMPT_RT and I-Pipe.
Specifically, we used relayfs buffers to store the timestamps for our
interrupt latency measurements. This allowed us to easily have access
to very large buffering areas without having to worry about any form
of detailed resource allocation, or runtime overhead of logging. IOW,
it allowed us to concentrate on our main priority: log a very large
amount of timestamps.

On the LTT side, relayfs is bound to be at the center of whatever
architecture we settle on for the ongoing rewrite. For having used it
for past releases of LTT, we know that it can handle very heavy data
throughput with little overhead using a relatively simple API.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Karim Yaghmour

unread,

Jul 11, 2005, 11:21:24 PM7/11/05

to Greg KH, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

Greg KH wrote:
> What ever happened to exporting the relayfs file ops, and just using
> debugfs as your controlling fs instead? As all of the possible users
> fall under the "debug" type of kernel feature, it makes more sense to
> confine users to that fs, right?

Actually, like we discussed the last time this surfaced, there are far
more users for relayfs than just debugging. What we settled on was
having relayfs export its file ops so that indeed debugfs users could
use it to log things in conjunction with debugfs.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Greg KH

unread,

Jul 11, 2005, 11:26:27 PM7/11/05

to Karim Yaghmour, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, Jul 11, 2005 at 11:03:59PM -0400, Karim Yaghmour wrote:
>
> Greg KH wrote:
> > What ever happened to exporting the relayfs file ops, and just using
> > debugfs as your controlling fs instead? As all of the possible users
> > fall under the "debug" type of kernel feature, it makes more sense to
> > confine users to that fs, right?
>
> Actually, like we discussed the last time this surfaced, there are far
> more users for relayfs than just debugging.

Based on the proposed users of this fs, I don't see any. What ones are
you saying are not "debug" type operations? And yes, I consider LTT a
"debug" type operation :)

The best part of this, is it gives distros and users a consistant place
to mount the fs, and to know where this kind of thing shows up in the fs
namespace.

> What we settled on was having relayfs export its file ops so that
> indeed debugfs users could use it to log things in conjunction with
> debugfs.

Last I looked, this was not possible. Has this changed in the latest
version?

thanks,

greg k-h

Tom Zanussi

unread,

Jul 11, 2005, 11:57:35 PM7/11/05

to Greg KH, Karim Yaghmour, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

Greg KH writes:
> On Mon, Jul 11, 2005 at 11:03:59PM -0400, Karim Yaghmour wrote:
> >
> > Greg KH wrote:
> > > What ever happened to exporting the relayfs file ops, and just using
> > > debugfs as your controlling fs instead? As all of the possible users
> > > fall under the "debug" type of kernel feature, it makes more sense to
> > > confine users to that fs, right?
> >
> > Actually, like we discussed the last time this surfaced, there are far
> > more users for relayfs than just debugging.
>
> Based on the proposed users of this fs, I don't see any. What ones are
> you saying are not "debug" type operations? And yes, I consider LTT a
> "debug" type operation :)
>
> The best part of this, is it gives distros and users a consistant place
> to mount the fs, and to know where this kind of thing shows up in the fs
> namespace.

Makes sense, and I don't see a problem with getting rid of the fs part
of relayfs and letting debugfs take over that role, if debugfs were
there for all potential users. It doesn't sound like it would satisfy
users like LTT and systemtap though, who expect to be available at all
times even on production systems, which wouldn't be the case unless
the distros always shipped with debugfs enabled.

>
> > What we settled on was having relayfs export its file ops so that
> > indeed debugfs users could use it to log things in conjunction with
> > debugfs.
>
> Last I looked, this was not possible. Has this changed in the latest
> version?

The file operations are all exported, but I haven't actually tried to
use relayfs files in debugfs. Is there something more needed?

Tom

Karim Yaghmour

unread,

Jul 12, 2005, 12:00:29 AM7/12/05

to Greg KH, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

Greg KH wrote:
> Based on the proposed users of this fs, I don't see any. What ones are
> you saying are not "debug" type operations? And yes, I consider LTT a
> "debug" type operation :)
>
> The best part of this, is it gives distros and users a consistant place
> to mount the fs, and to know where this kind of thing shows up in the fs
> namespace.

Except that relayfs contains files that all behave in a very specific
way: as relayfs buffers, while debugfs may contain a variety of different
types of files.

I kind'a see what you're trying to say, and I fully understand that some
debugfs users may indeed use the relayfs fileops to add an entry in
debugfs which serves as a buffer, and that's the very reason we exported
them to boot. But there's something to be said about having a single
filesystem (and therefore tree somewhere in /) which contains entries
dedicated to a single purpose: dump huge amounts of data out of the
kernel and into userspace whether or not the system is being debuged.

From a user point of view, it sounds awfully weird if they're using
"debugfs" on a production system ...

> Last I looked, this was not possible. Has this changed in the latest
> version?

Here's from 2.6.13-rc2-mm1 fs/relayfs/inode.c
> +EXPORT_SYMBOL_GPL(relayfs_open);
> +EXPORT_SYMBOL_GPL(relayfs_poll);
> +EXPORT_SYMBOL_GPL(relayfs_mmap);
> +EXPORT_SYMBOL_GPL(relayfs_release);
> +EXPORT_SYMBOL_GPL(relayfs_file_operations);
> +EXPORT_SYMBOL_GPL(relayfs_create_dir);
> +EXPORT_SYMBOL_GPL(relayfs_remove_dir);

It's been there ever since you've asked for it earlier this year :)

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Greg KH

unread,

Jul 12, 2005, 12:34:01 AM7/12/05

to Karim Yaghmour, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, Jul 11, 2005 at 11:52:57PM -0400, Karim Yaghmour wrote:
>
> Greg KH wrote:
> > Based on the proposed users of this fs, I don't see any. What ones are
> > you saying are not "debug" type operations? And yes, I consider LTT a
> > "debug" type operation :)
> >
> > The best part of this, is it gives distros and users a consistant place
> > to mount the fs, and to know where this kind of thing shows up in the fs
> > namespace.
>
> Except that relayfs contains files that all behave in a very specific
> way: as relayfs buffers, while debugfs may contain a variety of different
> types of files.

The path/filename dictates how it is used, so putting relayfs type files
in debugfs is just fine. debugfs allows any types of files to be there.

> I kind'a see what you're trying to say, and I fully understand that some
> debugfs users may indeed use the relayfs fileops to add an entry in
> debugfs which serves as a buffer, and that's the very reason we exported
> them to boot.

Good.

> But there's something to be said about having a single filesystem (and
> therefore tree somewhere in /)

New trees in / are not LSB compliant, hence the reason for writing
securityfs to get rid of /selinux and other LSM filesystems that were
starting to sprout up.

> which contains entries dedicated to a single purpose: dump huge
> amounts of data out of the kernel and into userspace whether or not
> the system is being debuged.

But that's exactly what debugfs is for, to allow data to be dumped out
of the kernel for different usages.

> From a user point of view, it sounds awfully weird if they're using
> "debugfs" on a production system ...

Ok, have a better name for it? It's simple and easy to understand.

> > Last I looked, this was not possible. Has this changed in the latest
> > version?
>
> Here's from 2.6.13-rc2-mm1 fs/relayfs/inode.c
> > +EXPORT_SYMBOL_GPL(relayfs_open);
> > +EXPORT_SYMBOL_GPL(relayfs_poll);
> > +EXPORT_SYMBOL_GPL(relayfs_mmap);
> > +EXPORT_SYMBOL_GPL(relayfs_release);
> > +EXPORT_SYMBOL_GPL(relayfs_file_operations);
> > +EXPORT_SYMBOL_GPL(relayfs_create_dir);
> > +EXPORT_SYMBOL_GPL(relayfs_remove_dir);
>
> It's been there ever since you've asked for it earlier this year :)

Thanks, didn't realize that. Wait, all that should be needed is
"relayfs_file_operations", right? Why have those others exported?

thanks,

greg k-h

Greg KH

unread,

Jul 12, 2005, 12:36:38 AM7/12/05

to Tom Zanussi, Karim Yaghmour, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, Jul 11, 2005 at 10:55:33PM -0500, Tom Zanussi wrote:
> Greg KH writes:
> > On Mon, Jul 11, 2005 at 11:03:59PM -0400, Karim Yaghmour wrote:
> > >
> > > Greg KH wrote:
> > > > What ever happened to exporting the relayfs file ops, and just using
> > > > debugfs as your controlling fs instead? As all of the possible users
> > > > fall under the "debug" type of kernel feature, it makes more sense to
> > > > confine users to that fs, right?
> > >
> > > Actually, like we discussed the last time this surfaced, there are far
> > > more users for relayfs than just debugging.
> >
> > Based on the proposed users of this fs, I don't see any. What ones are
> > you saying are not "debug" type operations? And yes, I consider LTT a
> > "debug" type operation :)
> >
> > The best part of this, is it gives distros and users a consistant place
> > to mount the fs, and to know where this kind of thing shows up in the fs
> > namespace.
>
> Makes sense, and I don't see a problem with getting rid of the fs part
> of relayfs and letting debugfs take over that role, if debugfs were
> there for all potential users. It doesn't sound like it would satisfy
> users like LTT and systemtap though, who expect to be available at all
> times even on production systems, which wouldn't be the case unless
> the distros always shipped with debugfs enabled.

They will, the overhead of adding debugfs support is _very_ tiny, only:
$ size fs/debugfs/built-in.o
text data bss dec hex filename
2257 788 8 3053 bed fs/debugfs/built-in.o

So I do not see why you should not just drop your fs part.

> > > What we settled on was having relayfs export its file ops so that
> > > indeed debugfs users could use it to log things in conjunction with
> > > debugfs.
> >
> > Last I looked, this was not possible. Has this changed in the latest
> > version?
>
> The file operations are all exported, but I haven't actually tried to
> use relayfs files in debugfs. Is there something more needed?

Shouldn't be. Try it to make sure though :)

thanks,

greg k-h

Andi Kleen

unread,

Jul 12, 2005, 12:42:00 AM7/12/05

to Andrew Morton, linux-...@vger.kernel.org

Andrew Morton <ak...@osdl.org> writes:

> Christoph Hellwig <h...@infradead.org> wrote:
> >
> > On Mon, Jul 11, 2005 at 08:10:42PM -0500, Tom Zanussi wrote:
> > >
> > > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > > logging and buffering capability, which does not currently exist in
> > > the kernel.
> >

> > While the code is pretty nicely in shape it seems rather pointless to
> > merge until an actual user goes with it.
>
> Ordinarily I'd agree. But this is a bit like kprobes - it's a funny thing
> which other kernel features rely upon, but those features are often ad-hoc
> and aren't intended for merging.

Yes, it's a special case because it's useful for custom debugging
hacks. I would be in favour of merging it.

-Andi

Karim Yaghmour

unread,

Jul 12, 2005, 12:47:32 AM7/12/05

to Greg KH, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

Greg KH wrote:
> The path/filename dictates how it is used, so putting relayfs type files
> in debugfs is just fine. debugfs allows any types of files to be there.

..

> New trees in / are not LSB compliant, hence the reason for writing
> securityfs to get rid of /selinux and other LSM filesystems that were
> starting to sprout up.

..

> But that's exactly what debugfs is for, to allow data to be dumped out
> of the kernel for different usages.

..
> Ok, have a better name for it? It's simple and easy to understand.

It also carries with it the stigma of "kernel debugging", which I just
don't see production system maintainers liking very much.

So tell you what, how about if we merged what's in debugfs into relayfs
instead? We'll still end up with one filesystem, but we'll have a more
inocuous name. After all, if debugfs is indeed for dumping data from the
kernel to user-space for different usages, then relaying is what it's
actually doing, right?

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Greg KH

unread,

Jul 12, 2005, 1:35:17 AM7/12/05

to Karim Yaghmour, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, Jul 12, 2005 at 12:40:41AM -0400, Karim Yaghmour wrote:
>
> Greg KH wrote:
> > The path/filename dictates how it is used, so putting relayfs type files
> > in debugfs is just fine. debugfs allows any types of files to be there.

> ...

> > New trees in / are not LSB compliant, hence the reason for writing
> > securityfs to get rid of /selinux and other LSM filesystems that were
> > starting to sprout up.

> ...

> > But that's exactly what debugfs is for, to allow data to be dumped out
> > of the kernel for different usages.

> ...

> > Ok, have a better name for it? It's simple and easy to understand.
>
> It also carries with it the stigma of "kernel debugging", which I just
> don't see production system maintainers liking very much.

But they like the name "dtrace" instead? (sorry, couldn't resist...)

Come on, they will never see the name "debugfs", right? Your tools will
then have a common place to look for your ltt and other files, as you
_know_ where it will be mounted in the fs namespace.

And you _are_ doing kernel debugging and tracing with ltt, what's wrong
with admitting that?

> So tell you what, how about if we merged what's in debugfs into relayfs
> instead? We'll still end up with one filesystem, but we'll have a more
> inocuous name. After all, if debugfs is indeed for dumping data from the
> kernel to user-space for different usages, then relaying is what it's
> actually doing, right?

Sorry, but debugfs was there first, and people are already using it in
the kernel tree :)

Anyway, good luck trying to get the distros to accept
yet-another-fs-to-mount-somewhere, I know it was hard to get support for
sysfs as it was...

greg k-h

Baruch Even

unread,

Jul 12, 2005, 5:17:18 AM7/12/05

to Tom Zanussi, Andrew Morton, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, ros...@goodmis.org, pras...@us.ibm.com

Tom Zanussi wrote:
> Andrew Morton writes:
> > Tom Zanussi <zan...@us.ibm.com> wrote:
> > >
> > > Hi Andrew, can you please merge relayfs?
> >
> > I guess so. Would you have time to prepare a list of existing and planned
> > applications?
>

> I've also added a couple of people to the cc: list that I've consulted
> with in getting their applications to use relayfs, one of which is the
> logdev debugging device recently posted to LKML.

I'm using relayfs during my development work to log the current TCP
stack parameters and timing information. There is no reason that I can
see to merge this into the kernel, but it's very useful for my
development work.

I'd like to see relayfs merged.

Baruch

Steven Rostedt

unread,

Jul 12, 2005, 9:05:41 AM7/12/05

to Christoph Hellwig, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, 2005-07-12 at 03:25 +0100, Christoph Hellwig wrote:
> On Mon, Jul 11, 2005 at 08:10:42PM -0500, Tom Zanussi wrote:
> >
> > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > logging and buffering capability, which does not currently exist in
> > the kernel.
>
> While the code is pretty nicely in shape it seems rather pointless to
> merge until an actual user goes with it.
>

I have to also say that this is an exception. How many people out there
have written a variant of relayfs to do debugging? It is about time
that there's a buffer in the kernel that can be written to and later
retrieved to debug things like the scheduler that printk in all its
forms just doesn't cut it.

I've been working with Tom to get my logdev debugging tool to use
relayfs as a back end. This allows for showing output that shows
exactly what's going on inside the kernel. It keeps the latest data
around and when/if the kernel crashes, it shows all the events that lead
up to the crash. Well, it doesn't automatically show what has happened,
but you can put print like statements anywhere in any context and the
latest will be dumped on command or a NMI/panic/oops or whatever.

Once relayfs is added, we need to make a buffer that can be written to
from multiple CPUS. I understand that Tom got complaints that the
buffers were not orignally lockless, and different CPUs would have their
own buffers. But this really hurts trying to debug race conditions on
SMP machines, since you don't get the interleaved output of what's going
on. God I need to get KCSP working, and not worry about race conditions
anymore! :-)

-- Steve

Tomasz Kłoczko

unread,

Jul 12, 2005, 10:04:15 AM7/12/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, 11 Jul 2005, Tom Zanussi wrote:

>
> Hi Andrew, can you please merge relayfs? It provides a low-overhead
> logging and buffering capability, which does not currently exist in
> the kernel.
>
> relayfs key features:
>
> - Extremely efficient high-speed logging/buffering

Usualy/for now relayfs is used as base infrastructure for variuos
debuging/measuring.
IMO storing raw data and transfer them to user space it is wrong way.
Why ? Becase i adds very big overhead for memory nad storage.
Big .. compare to in situ storing partialy analyzed data in conters
and other like it is in DTrace.

IMO much better will be add base/template set of functions for use in
KProbes probes which will come with KProbes code as base tool set. It will
allow cut transfered data size from megabites/gigabyutes to hundret
bytes/kilo bytes, make debuging/measuring more smooth without additional
latency for transfer data outside kernel space.

It will be good not reinvent wheel in wrong way if in working implemtation
like DTrace it work more than well.

Yes, maybe it will be good have something like relayfs for some other
tasks but for debuging/measuring better will be IMO use other way which
will not use this technik.

kloczek
--
-----------------------------------------------------------
*Ludzie nie mają problemów, tylko sobie sami je stwarzają*
-----------------------------------------------------------
Tomasz Kłoczko, sys adm @zie.pg.gda.pl|*e-mail: klo...@rudy.mif.pg.gda.pl*

Baruch Even

unread,

Jul 12, 2005, 10:26:39 AM7/12/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Tomasz Kłoczko wrote:
> On Mon, 11 Jul 2005, Tom Zanussi wrote:
>
>>
>> Hi Andrew, can you please merge relayfs? It provides a low-overhead
>> logging and buffering capability, which does not currently exist in
>> the kernel.
>>
>> relayfs key features:
>>
>> - Extremely efficient high-speed logging/buffering
>
>
> Usualy/for now relayfs is used as base infrastructure for variuos
> debuging/measuring.
> IMO storing raw data and transfer them to user space it is wrong way.
> Why ? Becase i adds very big overhead for memory nad storage.
> Big .. compare to in situ storing partialy analyzed data in conters
> and other like it is in DTrace.
>
> IMO much better will be add base/template set of functions for use in
> KProbes probes which will come with KProbes code as base tool set. It
> will allow cut transfered data size from megabites/gigabyutes to hund
ret
> bytes/kilo bytes, make debuging/measuring more smooth without additio
nal
> latency for transfer data outside kernel space.

There is no relation between using kprobes and reducing the logged data
size. At the end the debugging/tracing facility is there to provide dat
a
to the developer who tries to detect the problem or ensure correctness.

The kprobes can only serve as a replacement to changing the source code
in order to extract the debugging information, and it does it very well
.

Cutting the amount of data transferred is only possible if you add the
problem detection logic into the kernel and only transport problem
reports to user-mode.

Baruch

Steve Rotolo

unread,

Jul 12, 2005, 10:43:30 AM7/12/05

to Greg KH, Karim Yaghmour, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, 2005-07-12 at 01:23, Greg KH wrote:
> And you _are_ doing kernel debugging and tracing with ltt, what's wrong
> with admitting that?
>

Hi. I think that viewing tracing tools like LTT and systemtap as
strictly kernel debug tools is very short-sighted. With a good
post-processing tool, tracing is very useful to application developers
who can benefit by visualizing the interaction between user-level tasks
and the OS as well as the synchronization of multiple tasks/threads.

IOW, tracing is in many ways an _application_ debug tool, not a _kernel_
debug tool. And application developers usually do not want to run a
debug kernel.

I would like to see relayfs merged.

--
Steve Rotolo
<steve....@ccur.com>

Jason Baron

unread,

Jul 12, 2005, 10:58:42 AM7/12/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Mon, 11 Jul 2005, Tom Zanussi wrote:

>
> Hi Andrew, can you please merge relayfs? It provides a low-overhead
> logging and buffering capability, which does not currently exist in
> the kernel.
>

One concern I had regarding relayfs, which was raised previously, was
regarding its use of vmap,
http://marc.theaimsgroup.com/?l=linux-kernel&m=110755199913216&w=2 On x86,
the vmap space is at a premium, and this space is reserved over the entire
lifetime of a 'channel'. Is the use of vmap really critical for
performance?

thanks,

-Jason

Tom Zanussi

unread,

Jul 12, 2005, 11:20:17 AM7/12/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

=?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
> On Mon, 11 Jul 2005, Tom Zanussi wrote:
>
> >
> > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > logging and buffering capability, which does not currently exist in
> > the kernel.
> >
> > relayfs key features:
> >
> > - Extremely efficient high-speed logging/buffering
>
> Usualy/for now relayfs is used as base infrastructure for variuos
> debuging/measuring.
> IMO storing raw data and transfer them to user space it is wrong way.
> Why ? Becase i adds very big overhead for memory nad storage.
> Big .. compare to in situ storing partialy analyzed data in conters
> and other like it is in DTrace.
>

But isn't it supposed to be a good thing to keep analysis out of the
kernel if possible? And many things can't be aggregated, such as the
detailed sequence of events in a trace. Anyway, it doesn't have to be
an 'all or nothing' thing. For some applications it may make sense to
do some amount of filtering and aggregation in the kernel. AFAICS
DTrace takes this to the extreme and does everything in the kernel,
and IIRC it can't easily be made to general system tracing along the
lines of LTT, for instance.

> IMO much better will be add base/template set of functions for use in
> KProbes probes which will come with KProbes code as base tool set. It will
> allow cut transfered data size from megabites/gigabyutes to hundret
> bytes/kilo bytes, make debuging/measuring more smooth without additional
> latency for transfer data outside kernel space.

The systemtap project is using kprobes along these lines.

Tom

Tom Zanussi

unread,

Jul 12, 2005, 11:32:23 AM7/12/05

to Jason Baron, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Jason Baron writes:
>
> On Mon, 11 Jul 2005, Tom Zanussi wrote:
>
> >
> > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > logging and buffering capability, which does not currently exist in
> > the kernel.
> >
>
> One concern I had regarding relayfs, which was raised previously, was
> regarding its use of vmap,
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110755199913216&w=2 On x86,
> the vmap space is at a premium, and this space is reserved over the entire
> lifetime of a 'channel'. Is the use of vmap really critical for
> performance?

Yes, the vmap'ed area is reserved over the lifetime of the channel,
but the typical usage of a channel is transient - allocate it at the
start of say a tracing run, and then vunmap it and free the memory
when done. Unless you're using huge buffers, you wouldn't run into a
problem running out of vmalloc space, and typical applications should
be able to use relatively small buffers.

I don't really know how we would get around using vmap - it seems like
the alternatives, such as managing an array of pages or something like
that, would slow down the logging path too much to make it useful as a
low overhead logging mechanism. I you have any ideas though, please
let me know.

Tom

Tomasz Kłoczko

unread,

Jul 12, 2005, 11:34:52 AM7/12/05

to Baruch Even, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, 12 Jul 2005, Baruch Even wrote:
[..]

>> Usualy/for now relayfs is used as base infrastructure for variuos
>> debuging/measuring.
>> IMO storing raw data and transfer them to user space it is wrong way.
>> Why ? Becase i adds very big overhead for memory nad storage.
>> Big .. compare to in situ storing partialy analyzed data in conters
>> and other like it is in DTrace.
>>
>> IMO much better will be add base/template set of functions for use in
>> KProbes probes which will come with KProbes code as base tool set. It
>> will allow cut transfered data size from megabites/gigabyutes to hundret
>> bytes/kilo bytes, make debuging/measuring more smooth without additional
>> latency for transfer data outside kernel space.
>
> There is no relation between using kprobes and reducing the logged data
> size. At the end the debugging/tracing facility is there to provide data
> to the developer who tries to detect the problem or ensure correctness.

Yes, now relayfs and KProbes this two diffrent stories without
strict relation but this relation exist on higher level. Both are used
for solve tha same problems (for measure, watch, some skeleton debug).

Collecting data _without_ dynamically hanged probes requires relayfes but
if collected data can be rolled to data types what you will want to see as
result of experiment (i.e. number of calls of some code asociated with
differnt stack path or number of I/O operation asociated with avarange
transfered data in I/O operations) sucking result data will not be an
issue :)

> The kprobes can only serve as a replacement to changing the source code
> in order to extract the debugging information, and it does it very well.
>
> Cutting the amount of data transferred is only possible if you add the
> problem detection logic into the kernel and only transport problem
> reports to user-mode.

Of course yes. I want only say: if KProbes will have this logic relayfs
will not be neccessary and instead focusing on develop and merge relayfs
better will be spend time on prepare code for this additional logic
(and probably neccesasary amount of code will be compareable to current
relayfs code size :)

Tomasz Kłoczko

unread,

Jul 12, 2005, 11:46:23 AM7/12/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, 12 Jul 2005, Tom Zanussi wrote:

> =?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
> > On Mon, 11 Jul 2005, Tom Zanussi wrote:
> >
> > >
> > > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > > logging and buffering capability, which does not currently exist in
> > > the kernel.
> > >
> > > relayfs key features:
> > >
> > > - Extremely efficient high-speed logging/buffering
> >
> > Usualy/for now relayfs is used as base infrastructure for variuos
> > debuging/measuring.
> > IMO storing raw data and transfer them to user space it is wrong way.
> > Why ? Becase i adds very big overhead for memory nad storage.
> > Big .. compare to in situ storing partialy analyzed data in conters
> > and other like it is in DTrace.
> >
>
> But isn't it supposed to be a good thing to keep analysis out of the
> kernel if possible?

As long as you try for example measure (?) .. not.

> And many things can't be aggregated, such as the detailed sequence of
> events in a trace.

DTrace real examples shows something completly diffret.
MANY things (if not ~almost all) can be kept only in aggregated form
during experiments.

> Anyway, it doesn't have to be
> an 'all or nothing' thing. For some applications it may make sense to
> do some amount of filtering and aggregation in the kernel. AFAICS
> DTrace takes this to the extreme and does everything in the kernel,
> and IIRC it can't easily be made to general system tracing along the
> lines of LTT, for instance.

Try measure number of dysk I/O operation without touching storage for
store raw data. What you need ? only one counter (few bytes) instead of huge
amount of memeory for buffer and store logs. Try measure something like
scheduler with possible small system distruption.

Steven Rostedt

unread,

Jul 12, 2005, 11:55:00 AM7/12/05

to Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org, Tom Zanussi

On Tue, 2005-07-12 at 10:58 -0400, Jason Baron wrote:
> On Mon, 11 Jul 2005, Tom Zanussi wrote:

> One concern I had regarding relayfs, which was raised previously, was
> regarding its use of vmap,
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110755199913216&w=2 On x86,
> the vmap space is at a premium, and this space is reserved over the entire
> lifetime of a 'channel'. Is the use of vmap really critical for
> performance?

I believe that (Tom correct me if I'm wrong) the use of vmap was to
allocate a large buffer without risking failing to allocate. Since the
buffer does not need to be in continuous pages. If this is a problem,
maybe Tom can use my buffer method to make a buffer :-)

See http://www.kihontech.com/logdev where my logdev debugging tool that
allocates separate pages and uses an accounting system instead of the
more efficient vmalloc to keep the data in the pages together. I'm
currently working with Tom to get this to use relayfs as the back end.
But here you can take a look at how the buffering works and it doesn't
waste up vmalloc.

-- Steve

Steven Rostedt

unread,

Jul 12, 2005, 12:06:18 PM7/12/05

to Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org, Jason Baron

On Tue, 2005-07-12 at 10:26 -0500, Tom Zanussi wrote:

> I don't really know how we would get around using vmap - it seems like
> the alternatives, such as managing an array of pages or something like
> that, would slow down the logging path too much to make it useful as a
> low overhead logging mechanism. I you have any ideas though, please
> let me know.

Tom,

My logdev device was pretty quick! The managing of the pages were
negligible to the copying of the data to the buffer. Although, sometimes
you needed to copy across buffers, but this too wouldn't be too much of
an impact.

-- Steve

Tom Zanussi

unread,

Jul 12, 2005, 12:20:41 PM7/12/05

to Steven Rostedt, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org, Tom Zanussi

Steven Rostedt writes:
> On Tue, 2005-07-12 at 10:58 -0400, Jason Baron wrote:
> > On Mon, 11 Jul 2005, Tom Zanussi wrote:
>
> > One concern I had regarding relayfs, which was raised previously, was
> > regarding its use of vmap,
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=110755199913216&w=2 On x86,
> > the vmap space is at a premium, and this space is reserved over the entire
> > lifetime of a 'channel'. Is the use of vmap really critical for
> > performance?
>
> I believe that (Tom correct me if I'm wrong) the use of vmap was to
> allocate a large buffer without risking failing to allocate. Since the
> buffer does not need to be in continuous pages. If this is a problem,
> maybe Tom can use my buffer method to make a buffer :-)
>

The main reason we use vmap is so that from the kernel side we have a
nice contiguous address range to log to even though the the pages
aren't actually contiguous.

> See http://www.kihontech.com/logdev where my logdev debugging tool that
> allocates separate pages and uses an accounting system instead of the
> more efficient vmalloc to keep the data in the pages together. I'm
> currently working with Tom to get this to use relayfs as the back end.
> But here you can take a look at how the buffering works and it doesn't
> waste up vmalloc.

It might be worthwhile to try out different alternatives and compare
them, but I'm pretty sure we won't be able to beat what's already in
relayfs. The question is I guess, how much slower would be
acceptable?

Tom

Steven Rostedt

unread,

Jul 12, 2005, 12:32:02 PM7/12/05

to Tom Zanussi, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org

On Tue, 2005-07-12 at 11:08 -0500, Tom Zanussi wrote:
> Steven Rostedt writes:
> > On Tue, 2005-07-12 at 10:58 -0400, Jason Baron wrote:
> > > On Mon, 11 Jul 2005, Tom Zanussi wrote:
> >
> > > One concern I had regarding relayfs, which was raised previously, was
> > > regarding its use of vmap,
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=110755199913216&w=2 On x86,
> > > the vmap space is at a premium, and this space is reserved over the entire
> > > lifetime of a 'channel'. Is the use of vmap really critical for
> > > performance?
> >
> > I believe that (Tom correct me if I'm wrong) the use of vmap was to
> > allocate a large buffer without risking failing to allocate. Since the
> > buffer does not need to be in continuous pages. If this is a problem,
> > maybe Tom can use my buffer method to make a buffer :-)
> >
>
> The main reason we use vmap is so that from the kernel side we have a
> nice contiguous address range to log to even though the the pages
> aren't actually contiguous.

That's what I meant, but you said it better :-)

>
> > See http://www.kihontech.com/logdev where my logdev debugging tool that
> > allocates separate pages and uses an accounting system instead of the
> > more efficient vmalloc to keep the data in the pages together. I'm
> > currently working with Tom to get this to use relayfs as the back end.
> > But here you can take a look at how the buffering works and it doesn't
> > waste up vmalloc.
>
> It might be worthwhile to try out different alternatives and compare
> them, but I'm pretty sure we won't be able to beat what's already in
> relayfs. The question is I guess, how much slower would be
> acceptable?

I totally agree that the vmalloc way is faster, but I would also argue
that the accounting to handle the separate pages would not even be
noticeable with the time it takes to do the actual copying into the
buffer. So if the accounting adds 3ns on top of 500ns to complete, I
don't think people will mind.

I haven't looked too much into the workings of relayfs (I let you handle
that ;-) so I don't really know the impact it would have to use
something like logdev's buffering system.

-- Steve

Tom Zanussi

unread,

Jul 12, 2005, 12:33:56 PM7/12/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

=?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
> On Tue, 12 Jul 2005, Tom Zanussi wrote:
>
> > =?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
> > > On Mon, 11 Jul 2005, Tom Zanussi wrote:
> > >
> > > >
> > > > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > > > logging and buffering capability, which does not currently exist in
> > > > the kernel.
> > > >
> > > > relayfs key features:
> > > >
> > > > - Extremely efficient high-speed logging/buffering
> > >
> > > Usualy/for now relayfs is used as base infrastructure for variuos
> > > debuging/measuring.
> > > IMO storing raw data and transfer them to user space it is wrong way.
> > > Why ? Becase i adds very big overhead for memory nad storage.
> > > Big .. compare to in situ storing partialy analyzed data in conters
> > > and other like it is in DTrace.
> > >
> >
> > But isn't it supposed to be a good thing to keep analysis out of the
> > kernel if possible?
>
> As long as you try for example measure (?) .. not.
>
> > And many things can't be aggregated, such as the detailed sequence of
> > events in a trace.
>
> DTrace real examples shows something completly diffret.
> MANY things (if not ~almost all) can be kept only in aggregated form
> during experiments.

But you can also do the aggregation in user space if you have a cheap
way of getting it there, as we've shown with some of the examples.
Why do you need it in the kernel? And what do you do when you need to
know the exact sequence of events, especially if you don't really know
what you're looking for?

>
> > Anyway, it doesn't have to be
> > an 'all or nothing' thing. For some applications it may make sense to
> > do some amount of filtering and aggregation in the kernel. AFAICS
> > DTrace takes this to the extreme and does everything in the kernel,
> > and IIRC it can't easily be made to general system tracing along the
> > lines of LTT, for instance.
>
> Try measure number of dysk I/O operation without touching storage for
> store raw data. What you need ? only one counter (few bytes) instead of huge
> amount of memeory for buffer and store logs. Try measure something like
> scheduler with possible small system distruption.

Most of the time the data is just being buffered and only when the
buffer is full is it written to disk, as one write. If that's too
disruptive, then maybe you do need to do some aggregation in the kernel,
but it sounds like a special case.

As for measuring the sheduler, I know that people have used it for
that e.g. Steven Rostedt's logdev device, which he uses to trace
problems in the RT kernel.

Tom

Tom Zanussi

unread,

Jul 12, 2005, 12:48:21 PM7/12/05

to Steven Rostedt, Tom Zanussi, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org

OK, it sounds like something to experiment with - I can play around
with it, and later submit a patch to remove vmap if it works out.
Does that sound like a good idea?

Tom

Steven Rostedt

unread,

Jul 12, 2005, 12:53:19 PM7/12/05

to Tom Zanussi, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org

On Tue, 2005-07-12 at 11:36 -0500, Tom Zanussi wrote:

> >
> > I totally agree that the vmalloc way is faster, but I would also argue
> > that the accounting to handle the separate pages would not even be
> > noticeable with the time it takes to do the actual copying into the
> > buffer. So if the accounting adds 3ns on top of 500ns to complete, I
> > don't think people will mind.
>
> OK, it sounds like something to experiment with - I can play around
> with it, and later submit a patch to remove vmap if it works out.
> Does that sound like a good idea?

Sounds good to me, since different approaches to a problem are always
good, since it allows for comparing the plusses and minuses. Not sure
if you want to take a crack using my ring buffers, but although they are
quite confusing, they have been fully tested, since I haven't changed
the ring buffer for a few years (although logdev itself has gone through
several changes). I use the logdev device on a daily basis to debug
almost every kernel I ever touch. When working with a new kernel, the
first thing I do is usually add my logdev patch.

Note to all: The patch I posted is not the same patch that I usually
use (although the ring buffers _are_ the same), since I add stuff that
is usually more specific to what I do. So if something is broken with
it, I would greatly appreciate it if someone lets me know.

Thanks,

-- Steve

Tomasz Kłoczko

unread,

Jul 12, 2005, 1:05:24 PM7/12/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, 12 Jul 2005, Tom Zanussi wrote:

[..]

> > DTrace real examples shows something completly diffret.
> > MANY things (if not ~almost all) can be kept only in aggregated form
> > during experiments.
>
> But you can also do the aggregation in user space if you have a cheap
> way of getting it there, as we've shown with some of the examples.

Sorry but real life examples shows that store chunk of
data in agregator is less expensive than context switch neccessary for
store data or time neccasy for send and handle signal from buffer like
"I'm full! let me out of here ..".

[..]

> > store raw data. What you need ? only one counter (few bytes) instead of huge
> > amount of memeory for buffer and store logs. Try measure something like
> > scheduler with possible small system distruption.
>
> Most of the time the data is just being buffered and only when the
> buffer is full is it written to disk, as one write. If that's too
> disruptive, then maybe you do need to do some aggregation in the kernel,
> but it sounds like a special case.

OK .. "so you can say better is stop flushing buffers on measure which
wil take day or more" ? :_)
Some DTrace probes/technik are specialy prepared for long or evel very
long time experiment wich will only prodyce few lines results on end of
experiment.
Look at DTrace documentation for speculative tracing:
http://docs.sun.com/app/docs/doc/817-6223/6mlkidli7?a=view

Some experiments do not have deterinistic time and must be finished after
i. e. "occasional failing". What if it will take so long so you will fill
all avalaible storage in relayfs way ?
OK, never mind .. you have discontinued storage. Using kind speculative
tracing way I'll have result *just after* "occasional failing" and you
will start parse data stored using relayfs.

Tom Zanussi

unread,

Jul 12, 2005, 1:08:28 PM7/12/05

to Steven Rostedt, Tom Zanussi, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org

Steven Rostedt writes:
> On Tue, 2005-07-12 at 11:36 -0500, Tom Zanussi wrote:
>
> > >
> > > I totally agree that the vmalloc way is faster, but I would also argue
> > > that the accounting to handle the separate pages would not even be
> > > noticeable with the time it takes to do the actual copying into the
> > > buffer. So if the accounting adds 3ns on top of 500ns to complete, I
> > > don't think people will mind.
> >
> > OK, it sounds like something to experiment with - I can play around
> > with it, and later submit a patch to remove vmap if it works out.
> > Does that sound like a good idea?
>
> Sounds good to me, since different approaches to a problem are always
> good, since it allows for comparing the plusses and minuses. Not sure
> if you want to take a crack using my ring buffers, but although they are
> quite confusing, they have been fully tested, since I haven't changed
> the ring buffer for a few years (although logdev itself has gone
through

I was thinking of something simpler, like just using the page array we
already have in relayfs, but not vmap'ing it and instead writing to
the current page, detecting when to split a record, moving on to the
next page, etc. and seeing how it compares with the vmap version.

Tom

Tom Zanussi

unread,

Jul 12, 2005, 1:26:33 PM7/12/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, pras...@us.ibm.com

=?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
> On Tue, 12 Jul 2005, Tom Zanussi wrote:

[...]

> >
> > Most of the time the data is just being buffered and only when the
> > buffer is full is it written to disk, as one write. If that's too
> > disruptive, then maybe you do need to do some aggregation in the kernel,
> > but it sounds like a special case.
>
> OK .. "so you can say better is stop flushing buffers on measure which
> wil take day or more" ? :_)
> Some DTrace probes/technik are specialy prepared for long or evel very
> long time experiment wich will only prodyce few lines results on end of
> experiment.
> Look at DTrace documentation for speculative tracing:
> http://docs.sun.com/app/docs/doc/817-6223/6mlkidli7?a=view
>

It's also possible to do long-running 'experiments' using relayfs, and
never write anything at all to disk. Here's an example prototype I
did using a Perl interpreter embedded in the user space event-reading
loop:

http://www.listserv.shafik.org/pipermail/ltt-dev/2004-August/000649.html

> Some experiments do not have deterinistic time and must be finished after
> i. e. "occasional failing". What if it will take so long so you will fill
> all avalaible storage in relayfs way ?
> OK, never mind .. you have discontinued storage. Using kind speculative
> tracing way I'll have result *just after* "occasional failing" and you
> will start parse data stored using relayfs.

As in the example above, you don't necessary need to fill any
available storage. You can also use relayfs in 'circular-buffer'
mode, which would capture a buffer full of events up the point of your
failure. Sounds like speculative tracing to me.

Tomasz Kłoczko

unread,

Jul 12, 2005, 3:35:49 PM7/12/05

to Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, pras...@us.ibm.com

On Tue, 12 Jul 2005, Tom Zanussi wrote:

[..]
> > This is much more simpler and much better for control (also from point of
> > view caching bugs in agregator code -> also from point of view kernel
> > stability).
> >
> > Also .. probably some code for handle i.e. counters cen be the same as
> > existing code in current kernel.
> > Probably some "atomic" (and/or simpler) agregators can be usefull in other
> > places in kernel for collecting some data during all time when system
> > works .. so code for handle this can be reused in non-ocasinal
> > tracing/measuring.
> > And again: all without things like relayfs.
>
> Well, you should check out the sytemtap project. It's basically a
> DTrace clone which is already doing these kinds of things with
> kprobes, and it's using relayfs...

Probaly by this it will be harder to say "KProbes it is Solaris DTrace
clone".

Vara Prasad

unread,

Jul 12, 2005, 4:49:11 PM7/12/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Tomasz Kłoczko wrote:

> On Tue, 12 Jul 2005, Tom Zanussi wrote:
> [..]
>
>> > This is much more simpler and much better for control (also from
>> point of
>> > view caching bugs in agregator code -> also from point of view ker
nel
>> > stability).
>> >
>> > Also .. probably some code for handle i.e. counters cen be the sam
e as
>> > existing code in current kernel.
>> > Probably some "atomic" (and/or simpler) agregators can be usefull
>> in other
>> > places in kernel for collecting some data during all time when sys
tem
>> > works .. so code for handle this can be reused in non-ocasinal
>> > tracing/measuring.
>> > And again: all without things like relayfs.
>>
>> Well, you should check out the sytemtap project. It's basically a
>> DTrace clone which is already doing these kinds of things with
>> kprobes, and it's using relayfs...
>
>
> Probaly by this it will be harder to say "KProbes it is Solaris DTrac
e
> clone".
>

I have not looked at Dtrace code but based on their USENIX paper looks
like we can not call Systemtap as Dtrace clone without a buffering
scheme like relayfs.

> kloczek

Vara Prasad

unread,

Jul 12, 2005, 5:33:54 PM7/12/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Tomasz Kłoczko wrote:

> On Tue, 12 Jul 2005, Tom Zanussi wrote:
>

>> =?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
>> > On Tue, 12 Jul 2005, Tom Zanussi wrote:

>> [...]

>>
>> >
>> > OK .. "so you can say better is stop flushing buffers on measure w
hich
>> > wil take day or more" ? :_)
>> > Some DTrace probes/technik are specialy prepared for long or evel
very
>> > long time experiment wich will only prodyce few lines results on
>> end of
>> > experiment.
>> > Look at DTrace documentation for speculative tracing:
>> > http://docs.sun.com/app/docs/doc/817-6223/6mlkidli7?a=view
>> >
>

How do you propose to implement speculative tracing without a buffer to

hold the data, when data needs to stay in the kernel for a while before

we decide to commit or discard?

>>
>> [...]

Tom Zanussi

unread,

Jul 12, 2005, 5:46:28 PM7/12/05

to Tom Zanussi, Steven Rostedt, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org

Tom Zanussi writes:
> Steven Rostedt writes:
> > On Tue, 2005-07-12 at 11:36 -0500, Tom Zanussi wrote:
> >
> > > >
> > > > I totally agree that the vmalloc way is faster, but I would also argue
> > > > that the accounting to handle the separate pages would not even be
> > > > noticeable with the time it takes to do the actual copying into the
> > > > buffer. So if the accounting adds 3ns on top of 500ns to complete, I
> > > > don't think people will mind.
> > >
> > > OK, it sounds like something to experiment with - I can play around
> > > with it, and later submit a patch to remove vmap if it works out.
> > > Does that sound like a good idea?
> >
> > Sounds good to me, since different approaches to a problem are always
> > good, since it allows for comparing the plusses and minuses. Not sure
> > if you want to take a crack using my ring buffers, but although they are
> > quite confusing, they have been fully tested, since I haven't changed
> > the ring buffer for a few years (although logdev itself has gone
> through
>
> I was thinking of something simpler, like just using the page array we
> already have in relayfs, but not vmap'ing it and instead writing to
> the current page, detecting when to split a record, moving on to the
> next page, etc. and seeing how it compares with the vmap version.
>

Just a clarification - I didn't mean to ignore your ring buffers - it
would be good to try both, I think...

Steven Rostedt

unread,

Jul 12, 2005, 7:48:22 PM7/12/05

to Tom Zanussi, Jason Baron, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, ak...@osdl.org

On Tue, 2005-07-12 at 16:38 -0500, Tom Zanussi wrote:

> Tom Zanussi writes:
> >
> >
> > I was thinking of something simpler, like just using the page array we
> > already have in relayfs, but not vmap'ing it and instead writing to
> > the current page, detecting when to split a record, moving on to the
> > next page, etc. and seeing how it compares with the vmap version.
> >
>
> Just a clarification - I didn't mean to ignore your ring buffers - it
> would be good to try both, I think...

Oh, by all means, simple is usually better. I didn't take any offense
to not using it. My ring buffers are quite confusing, and took quite of
bit debugging to finally get them straight. If you get something that
works then it should be good to go. My ring buffers were meant to be
always used as a ring buffer that would only save the latest data and
not stop when full. So, each page had to have it's own start and stop
since the beginning of the buffer could actually be anywhere on any
page. That's because, once the ring buffer filled up, the start of the
buffer would move as you added more data.

A simple approach should be best, but if you start doing the individual
page accounting, and find that it's getting complex to handle all cases,
then it's good to know that my ring buffers are always out there :-)

I will also admit that my ring buffers lost one byte per page. Because
I wanted to save on space with the accounting, and only had a start and
end pointer per page. So when start and end were equal, the buffer was
considered empty and when end was one less than start, it was considered
full. But since end always pointed to an empty spot, it would still be
empty when the buffer was full, thus wasting one byte per page. But to
solve this, I would either have to add another variable in the buffer
page descriptor (adding at least one byte, but probably 4 bytes) which
would just be more waste, or I would have to make a complex system even
more complex (ie. adding a flag on the end pointer at the MSB to
differentiate between end being empty or filled).

-- Steve

Andrew Morton

unread,

Jul 12, 2005, 7:59:05 PM7/12/05

to Steven Rostedt, zan...@us.ibm.com, jba...@redhat.com, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org

Steven Rostedt <ros...@goodmis.org> wrote:
>
> I will also admit that my ring buffers lost one byte per page. Because
> I wanted to save on space with the accounting, and only had a start and
> end pointer per page. So when start and end were equal, the buffer was
> considered empty and when end was one less than start, it was considered
> full. But since end always pointed to an empty spot, it would still be
> empty when the buffer was full, thus wasting one byte per page. But to
> solve this, I would either have to add another variable in the buffer
> page descriptor (adding at least one byte, but probably 4 bytes) which
> would just be more waste, or I would have to make a complex system even
> more complex (ie. adding a flag on the end pointer at the MSB to
> differentiate between end being empty or filled).

Nope. Just make the indices 32-bit numbers and let them wrap.

Full: (tail - head) == size
Empty: (tail - head) == 0
Add item: buf[head++ & (size-1)] = item;
Remove item: buf[tail++ & (size-1)]

Steven Rostedt

unread,

Jul 12, 2005, 8:15:45 PM7/12/05

to Andrew Morton, zan...@us.ibm.com, jba...@redhat.com, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org

On Tue, 2005-07-12 at 16:55 -0700, Andrew Morton wrote:
> Steven Rostedt <ros...@goodmis.org> wrote:
> >
> > I will also admit that my ring buffers lost one byte per page. Because
> > I wanted to save on space with the accounting, and only had a start and
> > end pointer per page. So when start and end were equal, the buffer was
> > considered empty and when end was one less than start, it was considered
> > full. But since end always pointed to an empty spot, it would still be
> > empty when the buffer was full, thus wasting one byte per page. But to
> > solve this, I would either have to add another variable in the buffer
> > page descriptor (adding at least one byte, but probably 4 bytes) which
> > would just be more waste, or I would have to make a complex system even
> > more complex (ie. adding a flag on the end pointer at the MSB to
> > differentiate between end being empty or filled).
>
> Nope. Just make the indices 32-bit numbers and let them wrap.
>
> Full: (tail - head) == size
> Empty: (tail - head) == 0
> Add item: buf[head++ & (size-1)] = item;
> Remove item: buf[tail++ & (size-1)]

You know I knew someone would have an answer. Look for version 0.2.1
comming soon :-)

Thanks,

-- Steve

Vara Prasad

unread,

Jul 13, 2005, 12:31:01 AM7/13/05

to linux-...@vger.kernel.org

Tomasz Kłoczko wrote:

O.K, Tomasz your point is we can do aggregation in the kernel and cut
down the amount of data that needs to be sent out from the kernel hence

we don't need an efficient, low overhead mechanism like relayfs to get
the data out of the kernel. Having relayfs doesn't prevent someone in
aggregating the data in the kernel, so it is not an argument for not
including relayfs in the kernel when it fills the need for those who
needs raw data.

I am part of a team working on systemtap where we are are developing a
tool similar to Dtrace that does some aggregation where appropriate but

nothing like fancy statistics etc. We use relayfs in our systemtap
project and based on my reading of Dtrace paper they use exactly
similar to relayfs buffering mechanism as well.

There are tools like itrace and Intel has one (i forgot the name) they
would like to get the raw data into user space and do all kinds of
fancy statistical analysis, visualization etc. Their value add is the
analysis of the data. I am sure you are not suggesting pushing
capabilities of those tools to the kernel, right.

As Steven Rostedt mentioned in his initial reply in this thread, many o
f
us have written adhoc buffering scheme similar to what relayfs provides

to debug kernel problems that happen after a long running test, if such

facility already exists in the kernel everyone doesn't have to develop
one.

I would like to see relayfs merged.

Spirakis, Charles

unread,

Jul 13, 2005, 4:15:26 AM7/13/05

to Vara Prasad, linux-...@vger.kernel.org

I believe the Intel tool that Vara is referencing is the Vtune tool
(which has an open source, GPL'ed statistical sampling driver). It keeps
a trace history (instead of aggregating the data) that is passed into
user space so that it can do post processing analysis from user space.
The most common method of aggregating data for sampling/profiling is to
lose the time information of when a sample is taken (for example, that
is what oprofile does). For many people, this is fine. For others, they
want the time information so they can visualize the sequence of events.

Having relayfs merged into the kernel would allow us to have a
consistent and reliable way of passing the data we need from kernel
space into user space.

In essence, relayfs is a basic infrastructure upon which other tools can
be built - whether that's profiling, debugging, logging, etc.

-- charles

> -----Original Message-----
> From: linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] On Behalf Of Vara Prasad
> Sent: Tuesday, July 12, 2005 9:30 PM
> To: unlisted-recipients
> Cc: linux-...@vger.kernel.org
> Subject: Re: Merging relayfs?
>
> *** snip ***

>
> There are tools like itrace and Intel has one (i forgot the
> name) they would like to get the raw data into user space and
> do all kinds of fancy statistical analysis, visualization
> etc. Their value add is the analysis of the data. I am sure
> you are not suggesting pushing capabilities of those tools to
> the kernel, right.
>
> As Steven Rostedt mentioned in his initial reply in this

> thread, many of us have written adhoc buffering scheme

Tomasz Kłoczko

unread,

Jul 13, 2005, 8:42:06 AM7/13/05

to Vara Prasad, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Tue, 12 Jul 2005, Vara Prasad wrote:

> Tomasz Kłoczko wrote:
>
>> On Tue, 12 Jul 2005, Tom Zanussi wrote:
>>
>>> =?ISO-8859-2?Q?Tomasz_K=B3oczko?= writes:
>>> > On Tue, 12 Jul 2005, Tom Zanussi wrote:
>>> [...]
>>>
>>> >
>>> > OK .. "so you can say better is stop flushing buffers on measure which
>>> > wil take day or more" ? :_)
>>> > Some DTrace probes/technik are specialy prepared for long or evel very
>>> > long time experiment wich will only prodyce few lines results on end of
>>> > experiment.
>>> > Look at DTrace documentation for speculative tracing:
>>> > http://docs.sun.com/app/docs/doc/817-6223/6mlkidli7?a=view
>
> How do you propose to implement speculative tracing without a buffer to hold
> the data, when data needs to stay in the kernel for a while before we decide
> to commit or discard?

Buffering some data inside kernel space and buffering with
infrastructure for transfer to user space this are two diffrent things.

Tomasz Kłoczko

unread,

Jul 13, 2005, 9:48:57 AM7/13/05

to Vara Prasad, linux-...@vger.kernel.org, Linus Torvalds, ak...@osdl.org

On Tue, 12 Jul 2005, Vara Prasad wrote:
[..]

> O.K, Tomasz your point is we can do aggregation in the kernel and cut down
> the amount of data that needs to be sent out from the kernel hence we don't
> need an efficient, low overhead mechanism like relayfs to get the data out of
> the kernel. Having relayfs doesn't prevent someone in aggregating the data in
> the kernel, so it is not an argument for not including relayfs in the kernel
> when it fills the need for those who needs raw data.

Of course you are right and (look again) this is what I told in first mail
in this thread :)

> I am part of a team working on systemtap where we are are developing a tool
> similar to Dtrace that does some aggregation where appropriate but nothing
> like fancy statistics etc. We use relayfs in our systemtap project and based
> on my reading of Dtrace paper they use exactly similar to relayfs buffering
> mechanism as well.

If I can suggest something about order prepare some feactures:

1) prepare base infrastructure for counters,

this "tool" will take very small amount of data and can be performad
by very small pieces of binary codes. Even this will allow perform some
*very* interesting experinments on existing kernel code.
And after above:

2) prepare base infrastructure for association tables of couters (for
collecting data for example about I/O operations or other two or more
arguments operations),
3) prepare user space tool with some kind of language which will allow
hanging ptrobes with aboove tho (simple counters and association tables
of couters)
4) base functions for measure time (with KProbes overhead and without) and
store them in couters and association tables,

All above base "tools" for above will take small or medium amount of data
and can be performad small or medium pieces of binary codes. And after
above:

5) prepare infrastrucrute for probes which will store data in diffrent
containers depending on initiator process and/or thread (and maybe in
next etap also will be good have something more common which will
depend on stack path),
6) prepare base functions for tracing stack paths (counting them and store
in association tables),
7) make some kind of study where is it will be good compute something
more complicated like base "speculative probes" (lookin on
working DTrace probably answer in this point will be "yes").

All to this moment will not require relayfs because amount of transfered
data will be _very low_.
Details of above will be probably different (I have only some very
common knowledge about DTrace implementations details and some avarange
about using dtrace tool) but I want count/pint *only* feactutres which
will not require using relayfs.

And *after finish above* will be much easier perform some kind of study
about "is relayfs is still neccessary ?" and *if* answer will be still
"YES" try to integrate neccessary patches (or maybe something other ..
maybe better adjusted to all non-above cases). Also add something like
relayfs at this moment _will not require_ changes in existing code (if
will require changes will be very small but maybe will ollow reduce
existin now relayfs (?)).

But if you will build all infrastructure even for simple couters on
relayfs fundament it will be (IMO) badly/incorrectly designed .. and using
even simple couters will introduce to high overhead for system.

*NOT using realyfs* if it is not neccessary for possibly big amout
of feactures future KProbes IMO in this case is *fundamental*.

To time where this base not requiring relayfs feactures will not be
integrated in kernel code better IMO will be stop merging relayfs.

> There are tools like itrace and Intel has one (i forgot the name) they would
> like to get the raw data into user space and do all kinds of fancy
> statistical analysis, visualization etc. Their value add is the analysis of
> the data. I am sure you are not suggesting pushing capabilities of those
> tools to the kernel, right.

I don't know any thing about this tool (can you sent URL?) but please ..
dont't be fool and do not try as first prepare something eye candy :)
Rest this area for other developers and focus on fundaments :)

regards

Karim Yaghmour

unread,

Jul 13, 2005, 12:04:01 PM7/13/05

to Tomasz Kłoczko, Vara Prasad, linux-...@vger.kernel.org, Linus Torvalds, ak...@osdl.org

Tomasz Kłoczko wrote:
> *NOT using realyfs* if it is not neccessary for possibly big amout
> of feactures future KProbes IMO in this case is *fundamental*.
>
> To time where this base not requiring relayfs feactures will not be
> integrated in kernel code better IMO will be stop merging relayfs.

This part of the thread is really veering off-topic. This counters thin
g is
your own personal crusade and has nothing to do with the fundamental ne
ed
for a generic buffering mechanism such as relayfs.

I would suggest you start a separate thread to discuss the implementati
on of
a generic counters mechanism, if that's indeed what you're interested i
n.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Tomasz Kłoczko

unread,

Jul 13, 2005, 12:23:33 PM7/13/05

to Vara Prasad, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

On Wed, 13 Jul 2005, Vara Prasad wrote:
[..]

> O.K, looks like you are agreeing that we need a buffering mechanism in the
> kernel to implement speculative tracing, right.

Each agregator have own data. This data are buffered ..
In this sense: yes, it infrastructure for allocate, deallocate, copy ..
(generaly) operate on this buffers is needed.

> Once we have the buffering mechanism we need to create an efficient API
> for producers of the data to write to that buffering scheme. To my
> knowledge there is no such generic buffering mechanism already in the
> kernel, Relayfs implements that buffering scheme and an efficient API to
> write to it. Isn't that a good reason to have Relayfs merged?

Sorry but not. Relayfs this is much more than it is required for simple
manage buffers (better will be say in this point "probes data
containers"). All this kind operation can be performed
using reference/index.

> Once the data in the buffer is decided to be committed you need a mechanism
> to get that data from the kernel to userspace. If you don't like Relayfs
> transfer mechanism, what do you suggest using?

Correct me if I'm wrong .. ant try fill all this area where you see my
worse knowledge then yours or other strict kernel developers.

1) relayfs was prepared for low latency on move data outside kernel space,
2) getting data from probes do not require organize all them in regular
file system structure also in most cases will do not require low latency.
Only in all cases where buffer must be neccessarly moved outside kernel
space will require minimal overhead.

Many other kernel sugbsystem allow transfer data as result of simple
request with argument as reference/index. Organize all data stored/used by
probes in named structure (if it is *realy* neccesary) can be IMO moved
outside kernel space.
Why ? becase *all operations on kernel side on this data* seems can be
performed without addidional namig abstraction (buffer number, buffer size
and data type stored in buffer it will be all what is neccessary in
probably all cases even in case operate on complex data).

If you realy want get data from probes via fopen()/read() why not map
"probes data containers" to procfs/sysfs ? For reciving signals from
perobes for move out of kernel space mapped buffer content and/or ALSO
reciving signals with DATA (on request from user space) probably can be
performed via existing netlink infrastrucrure or (higher) event
notification.

(?)

Allow me ask you: do you try test is using netlink will allow perform
operations in neccessary time frame ? (with additional assumption agregate
maximum data possibly in "short range" from probe) .. probably not because
most of skeleton ussages of KProbes and also LTT interface was prepared
with assumption agregate data outside kernel space.
Do you see this ?

This was and sill is core cause of LTT problems and why it will never will
be so usefull as DTrace. Agregate data in possible "short distance" from
probe is *core DTrace assumption*. Simple .. this why using DTrace is
*very light* even if you are enable/hang thousands of probes inside kernel
space and still it allwo use this kind of technik evel in very fragile
(from point of view stabilyty) or under very high presure systems.

Tomasz Kłoczko

unread,

Jul 13, 2005, 12:57:15 PM7/13/05

to Vara Prasad, linux-...@vger.kernel.org, Linus Torvalds, ak...@osdl.org

On Wed, 13 Jul 2005, Vara Prasad wrote:
[..]

> Looks like you have not looked at systemtap project although Tom pointed
> about it to you in his previous postings. The URL for systemtap is
> http://sourceware.org/systemtap/, i strongly suggest you to look at that
> project.

I'm just fill this gap.
Sorry but I cant't find in this document even single word about assumption
about agregatre data possibly in short range from probe. But point 6.1
this document says:

"Kernel-to-user transport Data collected from systemtap in the kernel must
^^^^^^^^^^^^^^ ^^^^
somehow be transmitted to userspace. This transport must
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
have high performance and minimal performance impact on the monitored
system. One candidate is relayfs. Relayfs provides an efficient way to
move large blocks of data from the kernel to userspace. The data is sent
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in per-cpu beffers which a userspace program can save or display.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Drawbacks are that the data arrives in blocks and is separated into
per-cpu blocks, possibly requiring a post-processing step that stitches
the data into an integrated steam. Relayfs is included in some recent -mm
kernels. It can be built as a loadable module and is currently checked
into CVS under src/runtime/relayfs. The other candidate is netlink.
Netlink is included in the kernel. It allows a simple stream of data to be
sent using the familiar socket APIs. It is unlikely to be as fast as
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
relayfs. Relayfs typically makes use of netlink as a control channel. With
^^^^^^^
some simple extensions, the runtime can use netlink as the main transport
too. So we can currently select in the runtime between relayfs and
netlink, allowing us to support streams of data or blocks. And allowing us
to perform direct comparisons of efficiency. [..]"

So .. using relayfs is neccessary because all collected data "must
somehow be transmitted to userspace" and this why must be transfered huge
amout of data.

But if transering big amout of data will not be an issue seems netlink can
be used for transfer data (generaly agregated) from kernel probes (?).
But also "with some simple extensions, the runtime can use netlink as the
main transport too".
Even this document says "relayfs isn't neccessary fundament for
systemtap". So .. why try to push for merge relayfs *NOW* ?
Because KProbes do not have expressions and some base agregators like
couters isn't possibe to check NOW in real examples is realy realyfs is
neccessary (?) :)

Vara Prasad

unread,

Jul 13, 2005, 11:05:54 AM7/13/05

to Tomasz Kłoczko, Tom Zanussi, ak...@osdl.org, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Tomasz Kłoczko wrote:

O.K, looks like you are agreeing that we need a buffering mechanism in
the kernel to implement speculative tracing, right. Once we have the

buffering mechanism we need to create an efficient API for producers of

the data to write to that buffering scheme. To my knowledge there is no

such generic buffering mechanism already in the kernel, Relayfs
implements that buffering scheme and an efficient API to write to it.
Isn't that a good reason to have Relayfs merged?

Once the data in the buffer is decided to be committed you need a

mechanism to get that data from the kernel to userspace. If you don't
like Relayfs transfer mechanism, what do you suggest using?

-

Vara Prasad

unread,

Jul 13, 2005, 12:01:44 PM7/13/05

to Tomasz Kłoczko, linux-...@vger.kernel.org, Linus Torvalds, ak...@osdl.org

Tomasz Kłoczko wrote:

> On Tue, 12 Jul 2005, Vara Prasad wrote:
> [..]
>

[..]

Looks like you have not looked at systemtap project although Tom pointe
d
about it to you in his previous postings. The URL for systemtap is
http://sourceware.org/systemtap/, i strongly suggest you to look at tha
t

project. We are implementing most of the above what you are suggesting

in the systemtap project. I don't agree with you that implementing the
above features is trivial and takes small amount of code, can you submi
t
patches to show the simple implementation you are talking about.

> All to this moment will not require relayfs because amount of transfe
red
> data will be _very low_.

I think you are forgetting the fact that relayfs has two different
portions one is the buffering scheme another is the data transfer
mechanism. Some of the above features you are talking of needs a
buffering scheme.

> Details of above will be probably different (I have only some very
> common knowledge about DTrace implementations details and some
> avarange about using dtrace tool) but I want count/pint *only*
> feactutres which will not require using relayfs.

I beg to differ, as i mentioned in my earlier postings Dtrace has a
similar per-CPU buffering scheme according to their USENIX paper
http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf refer to
section 3.3, can you explain why?

[...]

>
> But if you will build all infrastructure even for simple couters on
> relayfs fundament it will be (IMO) badly/incorrectly designed .. and
> using
> even simple couters will introduce to high overhead for system.

Do you have any performance data to justify your claim of high overhead
?

[...]

>
>
> regards
>
> kloczek

bye,
Vara Prasad

Roman Zippel

unread,

Jul 14, 2005, 9:28:24 AM7/14/05

to Andrew Morton, Christoph Hellwig, zan...@us.ibm.com, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Hi,

On Mon, 11 Jul 2005, Andrew Morton wrote:

> > > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > > logging and buffering capability, which does not currently exist in
> > > the kernel.
> >

> > While the code is pretty nicely in shape it seems rather pointless to
> > merge until an actual user goes with it.
>
> Ordinarily I'd agree. But this is a bit like kprobes - it's a funny thing
> which other kernel features rely upon, but those features are often ad-hoc
> and aren't intended for merging.

I agree with Christoph, I'd like to see a small (and useful) example
included, which can be used as reference. relayfs client still need some
code of their own to communicate with user space. If I look at the example
code I'm not really sure netlink is a good way to go as control channel.
kprobes has a rather simple interface, relayfs is more complex and I think
it's a good idea to provide some sane and complete example code to copy
from.

Looking through the patch there are still a few areas I'm concerned about:
- the usage of atomic_t look a little silly, there is only a single
writer and probably needs some cache line optimisations
- I would prefer "unsigned int" over just "unsigned"
- the padding/commit arrays can be easily managed by the client
- overwrite mode can be implemented via the buffer switch callback

In general I'm not against merging, but I have a few ideas for further
cleanups/optimisations and it really would help to have some useful
example code (e.g. a _simple_ event tracer).

bye, Roman

Tom Zanussi

unread,

Jul 14, 2005, 11:03:36 AM7/14/05

to Roman Zippel, Andrew Morton, Christoph Hellwig, zan...@us.ibm.com, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Roman Zippel writes:
> Hi,
>
> On Mon, 11 Jul 2005, Andrew Morton wrote:
>
> > > > Hi Andrew, can you please merge relayfs? It provides a low-overhead
> > > > logging and buffering capability, which does not currently exist in
> > > > the kernel.
> > >
> > > While the code is pretty nicely in shape it seems rather pointless to
> > > merge until an actual user goes with it.
> >
> > Ordinarily I'd agree. But this is a bit like kprobes - it's a funny thing
> > which other kernel features rely upon, but those features are often ad-hoc
> > and aren't intended for merging.
>
> I agree with Christoph, I'd like to see a small (and useful) example
> included, which can be used as reference. relayfs client still need some
> code of their own to communicate with user space. If I look at the example
> code I'm not really sure netlink is a good way to go as control channel.
> kprobes has a rather simple interface, relayfs is more complex and I think
> it's a good idea to provide some sane and complete example code to copy
> from.
>

The netlink control channel seems to work very well, but I can
certainly change the examples to use something different. Could you
suggest something?

> Looking through the patch there are still a few areas I'm concerned about:
> - the usage of atomic_t look a little silly, there is only a single
> writer and probably needs some cache line optimisations

The only things that are atomic are the counts of produced and
consumed buffers and these are only ever updated or read in the slow
buffer-switch path. They're atomic because if they weren't, wouldn't
it be possible for the client to read an unfinished value if the
producer was in the middle of updating it?

> - I would prefer "unsigned int" over just "unsigned"
> - the padding/commit arrays can be easily managed by the client

Yes, I can move them out and update the examples to reflect that, but
I thought that if this was something that most clients would need to
do, it made some sense to keep it in relayfs and avoid duplication in
the clients.

> - overwrite mode can be implemented via the buffer switch callback

The buffer switch callback is already where this is handled, unless
you're thinking of something else - one of the first checks in the
buffer switch is relay_buf_full(), which always returns 0 if the
buffer is in overwrite mode.

Tom

bert hubert

unread,

Jul 16, 2005, 5:14:14 PM7/16/05

to Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, relayf...@lists.sourceforge.net

Ok, I'm working furiously on my OLS presentation (Wednesday, 3pm, be
there), but I'm running into a wall with relayfs, which I intend to use to
convey large amounts of disk statistics towards userspace.

Now, I've read Documentation/filesystems/relayfs.txt many times over, and I
don't get it.

It appears there is relayfs, and 'klog' on top of that. It also appears that
to access relayed data from the kernel in userspace there is librelay.c.

On reading librelay.c, I find code sending and receiving netlink
messages, but relayfs.txt doesn't even contain the word netlink!

I then launched the 'kleak-app' sample program, but told it to look at
/relay/diskstat* instead of its own file, but it gives me unspecified
netlink errors.

Things I need to know, and which I hope to find documented somewhere:

1) Do I need to do the netlink thing?
2) What kind of messages do I need to send/receive?
3) What is the exact format userspace sees in the relayfs file? Iow, can I
access that file w/o using librelay.c?
4) What are the semantics for reading from that file?
5) When using klog, is there only one channel?
6) does librelay.c talk to regular relayfs or to klog?

Don't get me wrong, relayfs sure looks nice for what I'm trying to do but
from userspace it is sort of a black box right now..

Thanks!

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Tom Zanussi

unread,

Jul 16, 2005, 7:15:47 PM7/16/05

to bert hubert, Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, relayf...@lists.sourceforge.net

bert hubert writes:
> Ok, I'm working furiously on my OLS presentation (Wednesday, 3pm, be
> there), but I'm running into a wall with relayfs, which I intend to use to
> convey large amounts of disk statistics towards userspace.
>
> Now, I've read Documentation/filesystems/relayfs.txt many times over, and I
> don't get it.
>
> It appears there is relayfs, and 'klog' on top of that. It also appears that
> to access relayed data from the kernel in userspace there is librelay.c.
>
> On reading librelay.c, I find code sending and receiving netlink
> messages, but relayfs.txt doesn't even contain the word netlink!

Hi,

relayfs itself only provides the buffering and file operations along
with the kernel API for clients as documented in
Documentation/filesystems/relayfs.txt. Applications still need some
kind of communication between the kernel and user space in order to
know when data is ready and how much is ready - the relay-apps stuff
tries to make this easy to do by allowing clients to ignore all those
details. It happens to use netlink for this, but clients can use
whatever they want to do this communication.

The klog patch just makes a couple of utility logging functions
available for use from anywhere within the kernel which allow the
client to not have to worry about whether or not there's a relayfs
channel ready to receive the data - you could just as well use
relay_write directly in say the IO function you want to trace, but
you'd have to do something like if(relay_channel) relay_write(). It
just allows you to uncondionally log regardless of whether there's a
channel ready or not.

If you just want to get something up and running without worrying
about the netlink channel and all that stuff, you can just modify the
kleak example as follows:

- apply the klog.patch

- in kleak.c, change init_relay_app("kleak", "cpu", NULL) to
init_relay_app("diskstat", "cpu", NULL). The relayfs files will be
created as /mnt/relay/diskstat/cpu0...cpuX, if you've mounted relayfs
at /mnt/relay.

- in kleak-app.c, change

static char *kleak_filebase = "/mnt/relay/kleak/cpu";

to

static char *kleak_filebase = "/mnt/relay/diskstat/cpu";

- log the data from the kernel functions using klog() or
klog_printk(). The kleak.patch file shows how to do this for
kmalloc/kfree, just do something similar in the functions you actually
want to instrument. You can also use klog_printk() if you want to log
as text.

Then just run the kleak app, and when it finishes, you should have a
set of files, cpu0l...cpuX in your current directory containing all
the data you've logged.

If you still have problems and would be willing to share your code,
I'd be happy to get it going myself. Just let me know.

>
> I then launched the 'kleak-app' sample program, but told it to look at
> /relay/diskstat* instead of its own file, but it gives me unspecified
> netlink errors.

Can you give me more details about these errors?

>
> Things I need to know, and which I hope to find documented somewhere:
>
> 1) Do I need to do the netlink thing?

No, the example code uses netlink, but you could use anything you want
to communicate between the kernel and daemon.

> 2) What kind of messages do I need to send/receive?

Basically, the daemon needs to know, for a given per-cpu buffer, how
many sub-buffers have been produced and consumed, in order to know
which sections of the mmapped buffer to read. It also needs to notify
the kernel client of how many sub-buffers it's consumed. Basically
that's it - the rest is application management e.g. the buffer sizes
to use, when to start/stop logging, etc.

> 3) What is the exact format userspace sees in the relayfs file? Iow, can I
> access that file w/o using librelay.c?

The format is whatever the client writes into it - relayfs itself
doesn't impose any format at all. The client doesn't need librelay.c
to read the data itself - librelay.c is for managing the daemon side
of the application and writing ready data to disk as it becomes
available. It doesn't know anything about the actual data being written.

> 4) What are the semantics for reading from that file?

The file is a buffer broken up into sub-buffers. The client reads the
sub-buffers it knows are ready directly from the mmapped buffer.
The file can only be mmap()ed - there is no read() available.

> 5) When using klog, is there only one channel?

There is only one channel, which is represented in the filesytem as a
set of per-cpu files.

> 6) does librelay.c talk to regular relayfs or to klog?

librelay.c talks to the client code in relay-app.h, which in turn uses
the relayfs kernel API to talk to relayfs.

BTW, there's also documentation in relay-app.h, don't know if you saw
that.

Hope that helps,

Tom

bert hubert

unread,

Jul 17, 2005, 5:02:53 AM7/17/05

to Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, relayf...@lists.sourceforge.net

On Sat, Jul 16, 2005 at 06:13:55PM -0500, Tom Zanussi wrote:

> relayfs itself only provides the buffering and file operations along
> with the kernel API for clients as documented in
> Documentation/filesystems/relayfs.txt. Applications still need some
> kind of communication between the kernel and user space in order to
> know when data is ready and how much is ready - the relay-apps stuff
> tries to make this easy to do by allowing clients to ignore all those
> details. It happens to use netlink for this, but clients can use
> whatever they want to do this communication.

Ok - that is good to know. What is missing from relayfs.txt is a demarcation
of which system does what.

As I see it there are three things currently:

1) Basic relayfs facilities, which only stuff data into N sub-buffers per
CPU, but also offer a set of functions that could be called via userspace
over some sort of communication channel.

2) klog which is a thin wrapper over relay_write

3) relay-app.h which lives in the kernel and communicates with librelay.c in
user space, providing that communication.

Is this correct?

> Then just run the kleak app, and when it finishes, you should have a
> set of files, cpu0l...cpuX in your current directory containing all
> the data you've logged.

I've changed the fprintf(stderr, "netlink send error") to perror("netlink
send error") and now it prints 'Connection refused', which makes heaps of
sense since I did not use relay-app.h, but wrote directly to the channel.

> > 2) What kind of messages do I need to send/receive?
>
> Basically, the daemon needs to know, for a given per-cpu buffer, how
> many sub-buffers have been produced and consumed, in order to know
> which sections of the mmapped buffer to read. It also needs to notify

I currently just write away without any userspace component, except that I
mmap the entire relayfs file in which I see the four configured sub-buffers.
I guess that in override mode that would work?

> The format is whatever the client writes into it - relayfs itself
> doesn't impose any format at all. The client doesn't need librelay.c
> to read the data itself - librelay.c is for managing the daemon side
> of the application and writing ready data to disk as it becomes
> available. It doesn't know anything about the actual data being written.

Ok - so there is nothing in there except n stretches of data, and some
padding? Each write is either IN a sub-buffer or not at all, it doesn't span
sub-buffers?

> > 4) What are the semantics for reading from that file?
>
> The file is a buffer broken up into sub-buffers. The client reads the
> sub-buffers it knows are ready directly from the mmapped buffer.
> The file can only be mmap()ed - there is no read() available.

Indeed. So the idea is to wait for a ringbuffer to become 'full', read it,
and wait for the next one to become full?

> BTW, there's also documentation in relay-app.h, don't know if you saw
> that.

Yes - but it only makes sense after the 'separation of powers' within
relayfs is clear. relayfs.txt talks rather cavalierly of 'clients' and
'calls' but does not make clear this client lives in userspace and can't
just call kernel functions.

Please consider the patch below. I'm not 100% sure if everything is correct,
but I'd love to know.

I'm wondering how relayfs could be operated safely in overwrite mode, btw -
who's to say the kernel might not have zoomed past my sub-buffer once I'm
notified of the crossing? The padding data I receive might be outdated by
then. Sounds racey.

In fact, it appears this might even happen in non-overwrite mode.

diff -urBb -X linux-2.6.13-rc3-mm1/Documentation/dontdiff linux-2.6.13-rc3-mm1/Documentation/filesystems/relayfs.txt linux-2.6.13-rc3-mm1-ahu/Documentation/filesystems/relayfs.txt
--- linux-2.6.13-rc3-mm1/Documentation/filesystems/relayfs.txt 2005-07-17 11:00:48.000638680 +0200
+++ linux-2.6.13-rc3-mm1-ahu/Documentation/filesystems/relayfs.txt 2005-07-17 10:58:21.634889656 +0200
@@ -23,6 +23,46 @@
the function parameters are documented along with the functions in the
filesystem code - please see that for details.

+Semantics
+=========
+
+Each relayfs channel has one buffer per CPU, each buffer has one or
+more sub-buffers. Messages are written to the first sub-buffer until it
+is too full to contain a new message, in which case it it is written to
+the next (if available). At this point, userspace can be notified so it
+empties the first ringbuffer, while the kernel continues writing to the
+next.
+
+If notified that a sub-buffer is full, the kernel knows how many bytes
+of it are padding, ie, unused. Userspace can use this knowledge to copy
+only valid data.
+
+After copying, userspace can notify the kernel that a sub-channel has
+been consumed.
+
+relayfs can operate in a mode where it will overwrite data not yet
+collected by userspace, and not wait for it to consume it.
+
+relayfs itself does not provide for communication of such data between
+userspace and kernel, allowing the kernel side to remain simple and not
+impose a single interface on userspace. It does provide a separate
+helper though, described below.
+
+Klog, relay-app & librelay
+==========================
+
+relayfs itself is ready to use, but to make things easier, two
+additional systems are provided. Klog is a simple wrapper to make
+sending data to a channel simpler. relay-app is the kernel counterpart
+of userspace librelay.c, combined these two files provide glue to
+easily stream data, without having to bother with housekeeping.
+
+It is possible to use relayfs without relay-app & librelay, but you'll
+have to implement communication between userspace and kernel, allowing
+both to convey the state of buffers (full, empty, amount of padding).
+
+Klog, relay-app and librelay can be found on
+http://relayfs.sourceforge.net

The relayfs user space API
==========================
@@ -34,7 +74,8 @@
open() enables user to open an _existing_ buffer.

mmap() results in channel buffer being mapped into the caller's
- memory space.
+ memory space. Note that you can't do a partial mmap - you must
+ map the entire file, which is NRBUF * SUBBUFSIZE.

poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
@@ -70,6 +109,9 @@
relayfs_create_dir(name, parent)
relayfs_remove_dir(dentry)
relay_commit(buf, reserved, count)
+
+ channel management typically called on instigation of userspace:
+
relay_subbufs_consumed(chan, cpu, subbufs_consumed)

write functions:
@@ -86,10 +128,9 @@
buf_unmapped(buf, filp)
buf_full(buf, subbuf_idx)

-
-A relayfs channel is made of up one or more per-cpu channel buffers,
-each implemented as a circular buffer subdivided into one or more
-sub-buffers.
+As explained above, a relayfs channel is made of up one or more per-cpu
+channel buffers, each implemented as a circular buffer subdivided into
+one or more sub-buffers.

relay_open() is used to create a channel, along with its per-cpu
channel buffers. Each channel buffer will have an associated file
@@ -123,24 +164,25 @@
data regardless of whether it's actually been consumed. In
no-overwrite mode, writes will fail i.e. data will be lost, if the
number of unconsumed sub-buffers equals the total number of
-sub-buffers in the channel. In this mode, the client is reponsible
-for notifying relayfs when sub-buffers have been consumed via
-relay_subbufs_consumed(). A full buffer will become 'unfull' and
-logging will continue once the client calls relay_subbufs_consumed()
-again. When a buffer becomes full, the buf_full() callback is invoked
-to notify the client. In both modes, the subbuf_start() callback will
-notify the client whenever a sub-buffer boundary is crossed. This can
-be used to write header information into the new sub-buffer or fill in
-header information reserved in the previous sub-buffer. One piece of
-information that's useful to save in a reserved header slot is the
-number of bytes of 'padding' for a sub-buffer, which is the amount of
-unused space at the end of a sub-buffer. The padding count for each
-sub-buffer is contained in an array in the rchan_buf struct passed
-into the subbuf_start() callback: rchan_buf->padding[prev_subbuf_idx]
-can be used to to get the padding for the just-finished sub-buffer.
-subbuf_start() is also called for the first sub-buffer in each channel
-buffer when the channel is created. The mode is specified to
-relay_open() using the overwrite parameter.
+sub-buffers in the channel.
+
+In this mode, the userspace client is reponsible for notifying relayfs when
+sub-buffers have been consumed via relay_subbufs_consumed(). A full buffer
+will become 'unfull' and logging will continue once the client calls
+relay_subbufs_consumed(). When a buffer becomes full, the buf_full()
+callback is invoked to notify the client. In both modes, the subbuf_start()
+callback will notify the client whenever a sub-buffer boundary is crossed.
+
+This can be used to write header information into the new sub-buffer or fill
+in header information reserved in the previous sub-buffer. One piece of
+information that's useful to save in a reserved header slot is the number of
+bytes of 'padding' for a sub-buffer, which is the amount of unused space at
+the end of a sub-buffer. The padding count for each sub-buffer is contained
+in an array in the rchan_buf struct passed into the subbuf_start() callback:
+rchan_buf->padding[prev_subbuf_idx] can be used to to get the padding for
+the just-finished sub-buffer. subbuf_start() is also called for the first
+sub-buffer in each channel buffer when the channel is created. The mode is
+specified to relay_open() using the overwrite parameter.

kernel clients write data into the current cpu's channel buffer using
relay_write() or __relay_write(). relay_write() is the main logging

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Roman Zippel

unread,

Jul 17, 2005, 10:07:06 AM7/17/05

to Tom Zanussi, Andrew Morton, Christoph Hellwig, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Hi,

On Thu, 14 Jul 2005, Tom Zanussi wrote:

> The netlink control channel seems to work very well, but I can
> certainly change the examples to use something different. Could you
> suggest something?

It just looks like a complicated way to do an ioctl, a control file that
you can read/write would be a lot simpler and faster.

> > Looking through the patch there are still a few areas I'm concerned about:
> > - the usage of atomic_t look a little silly, there is only a single
> > writer and probably needs some cache line optimisations
>
> The only things that are atomic are the counts of produced and
> consumed buffers and these are only ever updated or read in the slow
> buffer-switch path. They're atomic because if they weren't, wouldn't
> it be possible for the client to read an unfinished value if the
> producer was in the middle of updating it?

No.

> > - I would prefer "unsigned int" over just "unsigned"
> > - the padding/commit arrays can be easily managed by the client
>
> Yes, I can move them out and update the examples to reflect that, but
> I thought that if this was something that most clients would need to
> do, it made some sense to keep it in relayfs and avoid duplication in
> the clients.

If a lot of clients needs this, there a different ways to do this, e.g. by
introducing some helper functions that clients can use. This way you can
keep the core simple and allow the client to modify its behaviour.

> > - overwrite mode can be implemented via the buffer switch callback
>
> The buffer switch callback is already where this is handled, unless
> you're thinking of something else - one of the first checks in the
> buffer switch is relay_buf_full(), which always returns 0 if the
> buffer is in overwrite mode.

I mean, relayfs doesn't has to know about this, the client itself can do
it (e.g. via helper functions).

bye, Roman

Tom Zanussi

unread,

Jul 17, 2005, 11:45:04 AM7/17/05

to bert hubert, Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, relayf...@lists.sourceforge.net

bert hubert writes:
> On Sat, Jul 16, 2005 at 06:13:55PM -0500, Tom Zanussi wrote:
>
> > relayfs itself only provides the buffering and file operations along
> > with the kernel API for clients as documented in
> > Documentation/filesystems/relayfs.txt. Applications still need some
> > kind of communication between the kernel and user space in order to
> > know when data is ready and how much is ready - the relay-apps stuff
> > tries to make this easy to do by allowing clients to ignore all those
> > details. It happens to use netlink for this, but clients can use
> > whatever they want to do this communication.
>
> Ok - that is good to know. What is missing from relayfs.txt is a demarcation
> of which system does what.
>
> As I see it there are three things currently:
>
> 1) Basic relayfs facilities, which only stuff data into N sub-buffers per
> CPU, but also offer a set of functions that could be called via userspace
> over some sort of communication channel.
>
> 2) klog which is a thin wrapper over relay_write
>
> 3) relay-app.h which lives in the kernel and communicates with librelay.c in
> user space, providing that communication.
>
> Is this correct?

Yes.

>
> > Then just run the kleak app, and when it finishes, you should have a
> > set of files, cpu0l...cpuX in your current directory containing all
> > the data you've logged.
>
> I've changed the fprintf(stderr, "netlink send error") to perror("netlink
> send error") and now it prints 'Connection refused', which makes heaps of
> sense since I did not use relay-app.h, but wrote directly to the
channel.

Right - you need to insmod kleak.ko in order for the netlink socket to
be created in the kernel.

>
> > > 2) What kind of messages do I need to send/receive?
> >
> > Basically, the daemon needs to know, for a given per-cpu buffer, how
> > many sub-buffers have been produced and consumed, in order to know
> > which sections of the mmapped buffer to read. It also needs to notify
>
> I currently just write away without any userspace component, except that I
> mmap the entire relayfs file in which I see the four configured sub-buffers.
> I guess that in override mode that would work?

Right - this sounds exactly like what overwrite mode is meant for -
flight-recording types of applications, where you don't have an active
reader in userspace and you're interested in the most recent data. If
you don't have an active reader and use no-overwrite mode, the buffer
will become full when it wraps around the first time, and subsequent
events will be lost (the buffer-full callback will tell you when this
happens).

>
> > The format is whatever the client writes into it - relayfs itself
> > doesn't impose any format at all. The client doesn't need librelay.c
> > to read the data itself - librelay.c is for managing the daemon side
> > of the application and writing ready data to disk as it becomes
> > available. It doesn't know anything about the actual data being written.
>
> Ok - so there is nothing in there except n stretches of data, and some
> padding? Each write is either IN a sub-buffer or not at all, it doesn't span
> sub-buffers?

Right, a write will never be split across sub-buffers.

>
> > > 4) What are the semantics for reading from that file?
> >
> > The file is a buffer broken up into sub-buffers. The client reads the
> > sub-buffers it knows are ready directly from the mmapped buffer.
> > The file can only be mmap()ed - there is no read() available.
>
> Indeed. So the idea is to wait for a ringbuffer to become 'full', read it,
> and wait for the next one to become full?

Right, as sub-buffers become full, the userspace part of the client
should read them, update the kernel part with how many it just
consumed, and wait around for more.

>
> > BTW, there's also documentation in relay-app.h, don't know if you saw
> > that.
>
> Yes - but it only makes sense after the 'separation of powers' within
> relayfs is clear. relayfs.txt talks rather cavalierly of 'clients' and
> 'calls' but does not make clear this client lives in userspace and can't
> just call kernel functions.
>
> Please consider the patch below. I'm not 100% sure if everything is correct,
> but I'd love to know.

Yes, on first reading, it all looks correct, and does a nice job of
clarifying things - thanks for taking the time to do this. :-)

>
> I'm wondering how relayfs could be operated safely in overwrite mode, btw -
> who's to say the kernel might not have zoomed past my sub-buffer once I'm
> notified of the crossing? The padding data I receive might be outdated by
> then. Sounds racey.

It is racey - in this mode, there's nothing to keep the kernel from
writing as much as it wants before the user side has a chance to read
any of it. The only way this can be used safely is to make sure the
kernel side isn't writing anything when the client is reading. This
would be typical of a flight-recording usage i.e. kernel writes a
bunch of data continuously, then stops and allows the client to read
whatever's in there.

>
> In fact, it appears this might even happen in non-overwrite mode.

It shouldn't ever be able to happen in non-overwrite mode - if it
did, it would be a bug. Can you be more specific as to how you see
this happening in this mode?

Thanks,

Tom

--
Regards,

Tom Zanussi <zan...@us.ibm.com>
IBM Linux Technology Center/RAS

Tom Zanussi

unread,

Jul 17, 2005, 11:55:11 AM7/17/05

to Roman Zippel, Tom Zanussi, Andrew Morton, Christoph Hellwig, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Roman Zippel writes:
> Hi,
>
> On Thu, 14 Jul 2005, Tom Zanussi wrote:
>
> > The netlink control channel seems to work very well, but I can
> > certainly change the examples to use something different. Could you
> > suggest something?
>
> It just looks like a complicated way to do an ioctl, a control file that
> you can read/write would be a lot simpler and faster.

You're right - in previous versions, we did use ioctl - we ended up
using netlink as it seemed like least offensive option to most people.
I'll try modifying the example code to use a control file or something
like that instead though.

>
> > > Looking through the patch there are still a few areas I'm concerned about:
> > > - the usage of atomic_t look a little silly, there is only a single
> > > writer and probably needs some cache line optimisations
> >
> > The only things that are atomic are the counts of produced and
> > consumed buffers and these are only ever updated or read in the slow
> > buffer-switch path. They're atomic because if they weren't, wouldn't
> > it be possible for the client to read an unfinished value if the
> > producer was in the middle of updating it?
>
> No.
>
> > > - I would prefer "unsigned int" over just "unsigned"
> > > - the padding/commit arrays can be easily managed by the client
> >
> > Yes, I can move them out and update the examples to reflect that, but
> > I thought that if this was something that most clients would need to
> > do, it made some sense to keep it in relayfs and avoid duplication in
> > the clients.
>
> If a lot of clients needs this, there a different ways to do this, e.g. by
> introducing some helper functions that clients can use. This way you can
> keep the core simple and allow the client to modify its behaviour.

OK, I'll think about the best way to change this.

>
> > > - overwrite mode can be implemented via the buffer switch callback
> >
> > The buffer switch callback is already where this is handled, unless
> > you're thinking of something else - one of the first checks in the
> > buffer switch is relay_buf_full(), which always returns 0 if the
> > buffer is in overwrite mode.
>
> I mean, relayfs doesn't has to know about this, the client itself can do
> it (e.g. via helper functions).

In a previous version, we did something like having the client pass
back a return value from the callback indicating whether or not to
continue or stop. I can try doing something like that instead again.

Tom

bert hubert

unread,

Jul 17, 2005, 3:51:19 PM7/17/05

to Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, relayf...@lists.sourceforge.net

On Sun, Jul 17, 2005 at 10:43:40AM -0500, Tom Zanussi wrote:

> It is racey - in this mode, there's nothing to keep the kernel from
> writing as much as it wants before the user side has a chance to read
> any of it. The only way this can be used safely is to make sure the
> kernel side isn't writing anything when the client is reading. This
> would be typical of a flight-recording usage i.e. kernel writes a
> bunch of data continuously, then stops and allows the client to read
> whatever's in there.

Or by numbering entries written out, when in flight-recording mode you
wouldn't want to block the kernel.

> > In fact, it appears this might even happen in non-overwrite mode.
>
> It shouldn't ever be able to happen in non-overwrite mode - if it
> did, it would be a bug. Can you be more specific as to how you see
> this happening in this mode?

Yeah - you're right. The misunderstanding is because in both cases
(overwrite and non-overwrite) data is lost, except that in one case you lose
old data, and in the other new data.

It might be a good idea to document this as well.

Btw, I've already uncovered interesting things using relayfs, but I still
don't see the case for having it merged :-)

Thanks for your answers, I think I get it all now.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Tom Zanussi

unread,

Jul 17, 2005, 4:51:01 PM7/17/05

to bert hubert, Tom Zanussi, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com, relayf...@lists.sourceforge.net

bert hubert writes:
> On Sun, Jul 17, 2005 at 10:43:40AM -0500, Tom Zanussi wrote:
>
> > It is racey - in this mode, there's nothing to keep the kernel from
> > writing as much as it wants before the user side has a chance to read
> > any of it. The only way this can be used safely is to make sure the
> > kernel side isn't writing anything when the client is reading. This
> > would be typical of a flight-recording usage i.e. kernel writes a
> > bunch of data continuously, then stops and allows the client to read
> > whatever's in there.
>
> Or by numbering entries written out, when in flight-recording mode you
> wouldn't want to block the kernel.
>
> > > In fact, it appears this might even happen in non-overwrite mode.
> >
> > It shouldn't ever be able to happen in non-overwrite mode - if it
> > did, it would be a bug. Can you be more specific as to how you see
> > this happening in this mode?
>
> Yeah - you're right. The misunderstanding is because in both cases
> (overwrite and non-overwrite) data is lost, except that in one case you lose
> old data, and in the other new data.

Just to clarify - in either mode, if you don't have a consumer or the
consumer can't keep up with the amount of data being written by the
kernel, you will of course lose data at some point. Normally you
wouldn't want to lose data; by using non-overwrite mode you're
implicitly letting relayfs know this i.e. if at any point all the
sub-buffers remain unread and the kernel is still trying to write into
them, let the client know (via the buffer-full callback) that this has
happened. Presumably you would then increase the buffer size or have
the kernel write less etc.

>
> It might be a good idea to document this as well.
>

Yes, I'll make it more explicit in the documentation.

> Btw, I've already uncovered interesting things using relayfs, but I still
> don't see the case for having it merged :-)

Glad to hear it. Can you say what if anything would convince you it
should be merged?

>
> Thanks for your answers, I think I get it all now.

No problem, and thanks for patch and other suggestions.

Tom

Hareesh Nagarajan

unread,

Jul 18, 2005, 1:20:03 AM7/18/05

to Tom Zanussi, Roman Zippel, Andrew Morton, Christoph Hellwig, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Tom Zanussi wrote:
> Roman Zippel writes:
> > Hi,
> >
> > On Thu, 14 Jul 2005, Tom Zanussi wrote:
> >
> > > The netlink control channel seems to work very well, but I can
> > > certainly change the examples to use something different. Could you
> > > suggest something?
> >
> > It just looks like a complicated way to do an ioctl, a control file that
> > you can read/write would be a lot simpler and faster.
>
> You're right - in previous versions, we did use ioctl - we ended up
> using netlink as it seemed like least offensive option to most people.
> I'll try modifying the example code to use a control file or something
> like that instead though.

Having an ioctl() interface will definitely make things less
complicated. Are the older versions which use ioctl available off the
relayfs website?

I'm not quite sure if my opinion matters but I'd like to see relayfs
merged. To me it appears to be the quickest and cleanest way to export
trace data from the kernel to userspace.

Thanks,

Hareesh Nagarajan
-= Engineering Intern =-

Richard J Moore

unread,

Jul 18, 2005, 4:45:42 AM7/18/05

to Tom Zanussi, Andrew Morton, Christoph Hellwig, ka...@opersys.com, linux-...@vger.kernel.org, va...@us.ibm.com, zan...@us.ibm.com, Roman Zippel

Tom Zanussi <zan...@us.ibm.com> wrote on 14/07/2005 16:01:25:

> The only things that are atomic are the counts of produced and
> consumed buffers and these are only ever updated or read in the slow
> buffer-switch path. They're atomic because if they weren't, wouldn't
> it be possible for the client to read an unfinished value if the
> producer was in the middle of updating it?

This depends on architecture. It is possible under some architectures to
see the so-called score-boarding effect when reading on one processor while
writing on another when not having imposed any atomicity. From memory, I
believe this might be possible with zSeries, but I'll need to check the
microarchitecture docs. It's been a long time since I read them but I do
recall a reference to the score-boarding effect.

> ...

Richard

- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072

Steven Rostedt

unread,

Jul 18, 2005, 9:29:53 AM7/18/05

to bert hubert, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, Tom Zanussi

On Sun, 2005-07-17 at 21:45 +0200, bert hubert wrote:
> On Sun, Jul 17, 2005 at 10:43:40AM -0500, Tom Zanussi wrote:
>
> > It is racey - in this mode, there's nothing to keep the kernel from
> > writing as much as it wants before the user side has a chance to read
> > any of it. The only way this can be used safely is to make sure the
> > kernel side isn't writing anything when the client is reading. This
> > would be typical of a flight-recording usage i.e. kernel writes a
> > bunch of data continuously, then stops and allows the client to read
> > whatever's in there.
>
> Or by numbering entries written out, when in flight-recording mode you
> wouldn't want to block the kernel.

Exactly! I've written a logging device to record data in the kernel
that a printk can't help with. I've used this in debugging inturrupts,
the scheduler, and high speed network packets. Where a printk to a
serial would just slow things down, and going to the network is too
expensive, and complex if you happen to be debugging the network. This
tool is called logdev (http://www.kihontech.com/logdev) and uses a ring
buffer that is like the relayfs overwrite mode. It can do printk like
records and when something goes wrong, I dump the buffer to the serial.
Or I have a user space program reading it from a device. I don't care
about anything that happened earlier, I want to only know what happened
up to the point I dumped the buffer. Lately, I've been usuing this with
Ingo's RT patch, and when the system locks up, I dump the buffer, and it
shows quite nicely where the lockup occurred, and why.

With Tom's help, I also have a version that uses relayfs as a backend in
overwrite mode. It's still a work in progress (so no web site yet!)
since there's some issues of using a singe buffer for multiple CPUs.
This helps in debugging race conditions since you need to see how events
interleave.

>
> > > In fact, it appears this might even happen in non-overwrite mode.
> >
> > It shouldn't ever be able to happen in non-overwrite mode - if it
> > did, it would be a bug. Can you be more specific as to how you see
> > this happening in this mode?
>
> Yeah - you're right. The misunderstanding is because in both cases
> (overwrite and non-overwrite) data is lost, except that in one case you lose
> old data, and in the other new data.
>
> It might be a good idea to document this as well.
>
> Btw, I've already uncovered interesting things using relayfs, but I still
> don't see the case for having it merged :-)

The reason I would like to see this merged, so kernel hackers don't need
to constantly write there own logging buffers everytime you need to
debug a complex area of the kernel.

-- Steve

Steven Rostedt

unread,

Jul 18, 2005, 9:47:05 AM7/18/05

to Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton, Roman Zippel

On Sun, 2005-07-17 at 10:52 -0500, Tom Zanussi wrote:

> >
> > > > - overwrite mode can be implemented via the buffer switch callback
> > >
> > > The buffer switch callback is already where this is handled, unless
> > > you're thinking of something else - one of the first checks in the
> > > buffer switch is relay_buf_full(), which always returns 0 if the
> > > buffer is in overwrite mode.
> >
> > I mean, relayfs doesn't has to know about this, the client itself can do
> > it (e.g. via helper functions).
>
> In a previous version, we did something like having the client pass
> back a return value from the callback indicating whether or not to
> continue or stop. I can try doing something like that instead again.

Tom,

I'm actually very much against this. Looking at a point of view from the
logdev device. Having a callback to know to continue at every buffer
switch would just be slowing down something that is expected to be very
fast. I don't see the problem with having an overwrite mode or not. Why
can't relayfs know this? It _is_ an operation of relayfs, and having it
pushed to the client would seem to make the client need to know more
about how relayfs works that it needs to. Because, the logdev device
doesn't care about buffer switches.

-- Steve

Roman Zippel

unread,

Jul 18, 2005, 10:23:57 AM7/18/05

to Steven Rostedt, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Hi,

On Mon, 18 Jul 2005, Steven Rostedt wrote:

> I'm actually very much against this. Looking at a point of view from the
> logdev device. Having a callback to know to continue at every buffer
> switch would just be slowing down something that is expected to be very
> fast.

What exactly would be slowed down?
It would just move around some code and even avoid the overwrite mode
check.

> I don't see the problem with having an overwrite mode or not. Why
> can't relayfs know this?

The point is to design a simple and flexible relayfs layer, which means
not every possible function has to be done in the relayfs layer, as long
it's flexible enough to build additional functionality on top of it (for
which it can again provide some library functions).

bye, Roman

Tom Zanussi

unread,

Jul 18, 2005, 10:32:49 AM7/18/05

to Hareesh Nagarajan, Tom Zanussi, Roman Zippel, Andrew Morton, Christoph Hellwig, linux-...@vger.kernel.org, ka...@opersys.com, va...@us.ibm.com, richard...@uk.ibm.com

Hareesh Nagarajan writes:
> Tom Zanussi wrote:
> > Roman Zippel writes:
> > > Hi,
> > >
> > > On Thu, 14 Jul 2005, Tom Zanussi wrote:
> > >
> > > > The netlink control channel seems to work very well, but I can
> > > > certainly change the examples to use something different. Could you
> > > > suggest something?
> > >
> > > It just looks like a complicated way to do an ioctl, a control file that
> > > you can read/write would be a lot simpler and faster.
> >
> > You're right - in previous versions, we did use ioctl - we ended up
> > using netlink as it seemed like least offensive option to most people.
> > I'll try modifying the example code to use a control file or something
> > like that instead though.
>
> Having an ioctl() interface will definitely make things less
> complicated. Are the older versions which use ioctl available off the
> relayfs website?

Yes, the 'old relayfs' patches on the website implement ioctl.

Tom

Steven Rostedt

unread,

Jul 18, 2005, 10:40:34 AM7/18/05

to Roman Zippel, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

On Mon, 2005-07-18 at 16:16 +0200, Roman Zippel wrote:
> Hi,
>
> On Mon, 18 Jul 2005, Steven Rostedt wrote:
>
> > I'm actually very much against this. Looking at a point of view from the
> > logdev device. Having a callback to know to continue at every buffer
> > switch would just be slowing down something that is expected to be very
> > fast.
>
> What exactly would be slowed down?
> It would just move around some code and even avoid the overwrite mode
> check.

Yes, you're adding a jump to another function via a function pointer,
that would kill the cache line of execution, to avoid a simple check, or
some other way of handling it. Since I don't want to know the internals
of relayfs, the overwrite mode could be implemented in a more officient
way. Granted, this probably isn't much of a slowdown since the copying
of data would be much longer.

>
> > I don't see the problem with having an overwrite mode or not. Why
> > can't relayfs know this?
>
> The point is to design a simple and flexible relayfs layer, which means
> not every possible function has to be done in the relayfs layer, as long
> it's flexible enough to build additional functionality on top of it (for
> which it can again provide some library functions).

The overwrite mode isn't that complex. You don't want to make something
so flexible that it becomes more complex. Assembly is more flexible
than C but I wouldn't want to code a lot with it. A library function
for me is out of the question, since what I build on top of relayfs is
mostly in the kernel. The overwrite mode would then have to be
implemented through another kernel activity. I might as well keep my
own ring buffers and forget about using relayfs, and all my points in
which I argue for it being merged is mute.

-- Steve

Karim Yaghmour

unread,

Jul 18, 2005, 10:43:17 AM7/18/05

to Roman Zippel, Steven Rostedt, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Roman Zippel wrote:
> The point is to design a simple and flexible relayfs layer, which means
> not every possible function has to be done in the relayfs layer, as long
> it's flexible enough to build additional functionality on top of it (for
> which it can again provide some library functions).

I guess I just don't get the point here. Why cut something away if many
users will need it. If it's that popular that you're ready to provide a
library function to do it, then why not just leave it to boot? One of the
goals of relayfs is to avoid code duplication with regards to buffering
in general.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Tom Zanussi

unread,

Jul 18, 2005, 10:45:14 AM7/18/05

to Steven Rostedt, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton, Roman Zippel

Steven Rostedt writes:
> On Sun, 2005-07-17 at 10:52 -0500, Tom Zanussi wrote:
>
> > >
> > > > > - overwrite mode can be implemented via the buffer switch callback
> > > >
> > > > The buffer switch callback is already where this is handled, unless
> > > > you're thinking of something else - one of the first checks in the
> > > > buffer switch is relay_buf_full(), which always returns 0 if the
> > > > buffer is in overwrite mode.
> > >
> > > I mean, relayfs doesn't has to know about this, the client itself can do
> > > it (e.g. via helper functions).
> >
> > In a previous version, we did something like having the client pass
> > back a return value from the callback indicating whether or not to
> > continue or stop. I can try doing something like that instead again.
>
> Tom,
>
> I'm actually very much against this. Looking at a point of view from the
> logdev device. Having a callback to know to continue at every buffer
> switch would just be slowing down something that is expected to be very
> fast. I don't see the problem with having an overwrite mode or not. Why
> can't relayfs know this? It _is_ an operation of relayfs, and having it
> pushed to the client would seem to make the client need to know more
> about how relayfs works that it needs to. Because, the logdev device
> doesn't care about buffer switches.

I don't think it would slow anything down - it would be pretty much
the same code being executed as before e.g. the buffer_start()
callback for overwrite mode could look like this:

int buffer_start()
{
...
return 1; // continue unconditionally
}

And for no-overwrite mode:

int buffer_start()
{
...
return !relay_buf_full(buf); // continue if not full
}

Since the buffer start callback already returns the amount that's
supposed to be reserved at the start of the sub-buffer, I'd have to
make that an outparam instead, I guess, but it's basically the same
code handling the overwrite/no-overwrite condition.

Tom

Roman Zippel

unread,

Jul 18, 2005, 11:09:26 AM7/18/05

to Steven Rostedt, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Hi,

On Mon, 18 Jul 2005, Steven Rostedt wrote:

> > What exactly would be slowed down?
> > It would just move around some code and even avoid the overwrite mode
> > check.
>
> Yes, you're adding a jump to another function via a function pointer,
> that would kill the cache line of execution, to avoid a simple check, or
> some other way of handling it.

RTFS. (deliver_default_callback)

> Since I don't want to know the internals
> of relayfs,

You have to anyway, currently relayfs client need some knowledge about how
buffers are managed.

> the overwrite mode could be implemented in a more officient way.

I wouldn't call the buffer switch routine efficient, yet.

> > > I don't see the problem with having an overwrite mode or not. Why
> > > can't relayfs know this?
> >
> > The point is to design a simple and flexible relayfs layer, which means
> > not every possible function has to be done in the relayfs layer, as long
> > it's flexible enough to build additional functionality on top of it (for
> > which it can again provide some library functions).
>
> The overwrite mode isn't that complex. You don't want to make something
> so flexible that it becomes more complex. Assembly is more flexible
> than C but I wouldn't want to code a lot with it. A library function
> for me is out of the question, since what I build on top of relayfs is
> mostly in the kernel. The overwrite mode would then have to be
> implemented through another kernel activity. I might as well keep my
> own ring buffers and forget about using relayfs, and all my points in
> which I argue for it being merged is mute.

I must admit I have no clue, what you're talking about here...
The keywords above are "_simple_ _and_ _flexible_".

bye, Roman

Roman Zippel

unread,

Jul 18, 2005, 11:22:33 AM7/18/05

to Karim Yaghmour, Steven Rostedt, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Hi,

On Mon, 18 Jul 2005, Karim Yaghmour wrote:

> I guess I just don't get the point here. Why cut something away if many
> users will need it. If it's that popular that you're ready to provide a
> library function to do it, then why not just leave it to boot? One of the
> goals of relayfs is to avoid code duplication with regards to buffering
> in general.

The road to bloatness is paved with lots of little features.
There aren't that many users anyway (none of the examples use that
feature). I'd prefer to concentrate on a simple and correct relayfs layer
and we can still think about other features as more users appear.
Starting a design by implementing every little feature which _might_ be
needed is a really bad idea.

bye, Roman

Tom Zanussi

unread,

Jul 18, 2005, 12:01:40 PM7/18/05

to Roman Zippel, Karim Yaghmour, Steven Rostedt, Tom Zanussi, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Roman Zippel writes:
> Hi,
>
> On Mon, 18 Jul 2005, Karim Yaghmour wrote:
>
> > I guess I just don't get the point here. Why cut something away if many
> > users will need it. If it's that popular that you're ready to provide a
> > library function to do it, then why not just leave it to boot? One of the
> > goals of relayfs is to avoid code duplication with regards to buffering
> > in general.
>
> The road to bloatness is paved with lots of little features.
> There aren't that many users anyway (none of the examples use that
> feature). I'd prefer to concentrate on a simple and correct relayfs layer
> and we can still think about other features as more users appear.
> Starting a design by implementing every little feature which _might_ be
> needed is a really bad idea.
>

OK, if we got rid of the padding counts and commit counts and let the
client manage those, we can simplify the buffer switch slow path and
make the API simpler in the process. Here's a first proposal for
doing that - I won't know until I actually do it what snags I may run
into, but if this looks like the right direction to go, I'll go ahead
with it...

- get rid of the padding counts - the client can manage those if it
wants to, but in any case pass the padding for the previous sub-buffer
in to the subbuf_start callback.

- get rid of the commit counts - the client can manage those. Also,
get rid of the related API functions that deal with those
i.e. relay_commit() and the deliver() callback.

- change the buffer_start() callback to something like the following
(the body shows an example of what would typically be done by a
client):

/*
* subbuf_start() callback.
*
* Return 1 to allow logging to continue, 0 to stop.
*/
static int subbuf_start_default_callback (struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
int prev_padding)
{
*((int *)prev_subbuf) = prev_padding;

if (relay_buf_full(buf))
return 0;

relay_reserve(subbuf, sizeof (int));

return 1;
}

- add a relay_reserve() function for the client to use to reserve
space at the beginning of the sub-buffer (it can use this reserved
space to save the padding among other things). This would be used by
the client in the subbuf_start callback, rather than returning it via
an outparam or struct.

- remove the buf_full() callback - the client can determine this in
the subbuf_start() callback.

Also, as far as the netlink/ioctl/proc file communication, I'll have
to think more about it, but will play around with something when I
update the example code.

Let me know if this sounds ok, or if you have better suggestions.

Thanks,

Tom

Paul Jackson

unread,

Jul 20, 2005, 5:29:35 PM7/20/05

to Steven Rostedt, bert....@netherlabs.nl, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

Steve wrote:
> The reason I would like to see this merged, so kernel hackers don't need
> to constantly write there own logging buffers everytime you need to
> debug a complex area of the kernel.

But I doubt that relayfs, or anything resembling it, will accomplish
this purpose, at least for some of us, in many such situations.

When I'm debugging something requiring detailed tracing, I don't want
to have to think about whether the tracing tool has the particular
behaviour, performance, data loss, and other such characteristics
needed for my immediate needs. It is easier to code up some little
ad hoc mechanism than it is to try to figure out whether some general
purpose mechanism is suitable and how to use the generic mechanism.

Invariably in any particular situation, there is some almost trivial
way to hack in something adequate, for very little effort, doing
things that would be utterly useless in some other case.

Such tracing mechanisms work to obtain major subsystem isolation,
by exposing the flow of data and control back and forth across a
major boundary, such as using strace for the initial isolation of a
problem that might be in user space, or might be in the kernel.

But for detailed work within a subsystem, the corners that one can
cut with ad hoc tools often make them vastly superior to general
purpose tools.

Even the best equipped of carpenters sometimes throw together some
temporary scaffolding using rough cut 2x4's (2 inch by 4 inch cross
section lumbar; I don't know what they're called in metric nations.)

If there are enough specific purposes for relayfs, fine. But beware
of over generalizing its potential usefulness. There is always the
risk of over designing it, adding additional flexibility and options
in an effort to gain customers, at the expense of making it less and
less obviously useful in a trivial way for any specific purpose.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.925.600.0401

bert hubert

unread,

Jul 20, 2005, 5:47:16 PM7/20/05

to Paul Jackson, Steven Rostedt, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

>
> When I'm debugging something requiring detailed tracing, I don't want
> to have to think about whether the tracing tool has the particular
> behaviour, performance, data loss, and other such characteristics
> needed for my immediate needs. It is easier to code up some little
> ad hoc mechanism than it is to try to figure out whether some general
> purpose mechanism is suitable and how to use the generic mechanism.

You can do lots of modes with relayfs already - no ping-pong buffer,
n-buffer, lossy, not lossy etc etc.

I currently use it in 'flight-recorder' mode where new messages overwrite
old ones.

It might be good to document different possible ways of using relayfs.

> If there are enough specific purposes for relayfs, fine. But beware
> of over generalizing its potential usefulness. There is always the
> risk of over designing it, adding additional flexibility and options
> in an effort to gain customers, at the expense of making it less and
> less obviously useful in a trivial way for any specific purpose.

It's currently pretty limited - but you can add more features on top of it,
in a modular fashion. I tend not to use the complex stuff, but you can layer
it if you want.

It'd be nice if we had some basic relaying infrastructure available that'd
cover most needs successfully. Advanced users can do advanced things if they
want.

Btw, the diskstat tools (http://ds9a.nl/diskstat) require relayfs. It'll be
released this Friday or so.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Paul Jackson

unread,

Jul 20, 2005, 8:33:51 PM7/20/05

to bert hubert, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

Bert wrote:
> the diskstat tools require relayfs

That way might lay the real value of relayfs, as a common
technology basis for specific tools that are developed
and maintained on top of relayfs.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.925.600.0401

Paul Jackson

unread,

Jul 22, 2005, 4:03:56 PM7/22/05

to bert hubert, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

Another vote in favor of relayfs here ...

I am reminded by my good colleagues at SGI that relayfs is a key
to the Linux Trace Toolkit (LTT), which is in turn an important
technology for some product(s) on which SGI is working.

It is uses such as this which speak to the value of including
relayfs in the kernel.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@sgi.com> 1.925.600.0401

bert hubert

unread,

Jul 22, 2005, 4:36:56 PM7/22/05

to Paul Jackson, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

On Fri, Jul 22, 2005 at 01:01:32PM -0700, Paul Jackson wrote:
> Another vote in favor of relayfs here ...

At OLS the 'SystemTAP' idea was presented, which has been partially
implemented already, and it builds on relayfs as well. It dovetails nicely
with kprobes.

So it appears there is a sizeable amount of code which is building on
relayfs, iow, it is getting to be infrastructure.

I'm redoing diskstat to work with k/jprobes so it won't require a kernel
patch anymore, but it will still rely on relayfs.

So it would be tremendously helpful if relayfs would be part of the
mainline. I'll be banging out some HOWTO style documentation soonish.

Bert.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Tom Zanussi

unread,

Jul 22, 2005, 4:44:32 PM7/22/05

to Tom Zanussi, Roman Zippel, Karim Yaghmour, Steven Rostedt, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Tom Zanussi writes:
>
> OK, if we got rid of the padding counts and commit counts and let the
> client manage those, we can simplify the buffer switch slow path and
> make the API simpler in the process. Here's a first proposal for
> doing that - I won't know until I actually do it what snags I may run
> into, but if this looks like the right direction to go, I'll go ahead
> with it...
>

Here's a preliminary patch that does this cleanup. It ends up being a
nice little simplification of the API and the buffer switch path.
Despite the size of the patch, the changes aren't that significant and
they don't reduce the functionality at all - I've tested using an
updated version of the relay-apps examples - those are still in flux
at the moment, but if anyone wants to see them now, I'll clean them up
and make them available. I have tested that things basically work,
but I still need to do more testing; I'm posting now just in case
anyone disagrees with the changes.

I'll also be posting an update to the documentation shortly.

Here are the changes made by this patch:

- changed unsigned to unsigned int, and also changed several uses of
int to unsigned int where it made sense
- removed the padding counts and commit counts
- changed the subbuf_start() callback to add a prev_padding param, and
return a boolean value to indicate whether or not the buffer switch
should occur
- added a subbuf_start_reserve() helper function
- removed the deliver() callback
- removed the relay_commit() function
- removed the buf_full() callback
- added __relay_reserve(), which is used by relay_reserve() but allows
a client that already has a buffer pointer to use that instead

Tom

diff -urpN -X dontdiff linux-2.6.13-rc3-mm1/fs/relayfs/buffers.c linux-2.6.13-rc3-mm1-cur/fs/relayfs/buffers.c
--- linux-2.6.13-rc3-mm1/fs/relayfs/buffers.c 2005-07-16 11:47:34.000000000 -0500
+++ linux-2.6.13-rc3-mm1-cur/fs/relayfs/buffers.c 2005-07-22 01:10:21.000000000 -0500
@@ -95,7 +95,7 @@ int relay_mmap_buf(struct rchan_buf *buf
static void *relay_alloc_buf(struct rchan_buf *buf, unsigned long size)
{
void *mem;
- int i, j, n_pages;
+ unsigned int i, j, n_pages;

size = PAGE_ALIGN(size);
n_pages = size >> PAGE_SHIFT;
@@ -137,27 +137,15 @@ struct rchan_buf *relay_create_buf(struc
if (!buf)
return NULL;

- buf->padding = kmalloc(chan->n_subbufs * sizeof(unsigned *), GFP_KERNEL);
- if (!buf->padding)
- goto free_buf;
-
- buf->commit = kmalloc(chan->n_subbufs * sizeof(unsigned *), GFP_KERNEL);
- if (!buf->commit)
- goto free_buf;
-
buf->start = relay_alloc_buf(buf, chan->alloc_size);
- if (!buf->start)
- goto free_buf;
-
+ if (!buf->start) {
+ kfree(buf);
+ return NULL;
+ }
+
buf->chan = chan;
kref_get(&buf->chan->kref);
return buf;
-
-free_buf:
- kfree(buf->commit);
- kfree(buf->padding);
- kfree(buf);
- return NULL;
}

/**
@@ -167,7 +155,7 @@ free_buf:
void relay_destroy_buf(struct rchan_buf *buf)
{
struct rchan *chan = buf->chan;
- int i;
+ unsigned int i;

if (likely(buf->start)) {
vunmap(buf->start);
@@ -175,8 +163,6 @@ void relay_destroy_buf(struct rchan_buf
__free_page(buf->page_array[i]);
kfree(buf->page_array);
}
- kfree(buf->padding);
- kfree(buf->commit);
kfree(buf);
kref_put(&chan->kref, relay_destroy_channel);
}
diff -urpN -X dontdiff linux-2.6.13-rc3-mm1/fs/relayfs/relay.c linux-2.6.13-rc3-mm1-cur/fs/relayfs/relay.c
--- linux-2.6.13-rc3-mm1/fs/relayfs/relay.c 2005-07-16 11:47:34.000000000 -0500
+++ linux-2.6.13-rc3-mm1-cur/fs/relayfs/relay.c 2005-07-23 09:27:40.000000000 -0500
@@ -26,10 +26,7 @@
*/
int relay_buf_empty(struct rchan_buf *buf)
{
- int produced = atomic_read(&buf->subbufs_produced);
- int consumed = atomic_read(&buf->subbufs_consumed);
-
- return (produced - consumed) ? 0 : 1;
+ return (buf->subbufs_produced - buf->subbufs_consumed) ? 0 : 1;
}

/**
@@ -38,17 +35,10 @@ int relay_buf_empty(struct rchan_buf *bu
*
* Returns 1 if the buffer is full, 0 otherwise.
*/
-static inline int relay_buf_full(struct rchan_buf *buf)
+int relay_buf_full(struct rchan_buf *buf)
{
- int produced, consumed;
-
- if (buf->chan->overwrite)
- return 0;
-
- produced = atomic_read(&buf->subbufs_produced);
- consumed = atomic_read(&buf->subbufs_consumed);
-
- return (produced - consumed > buf->chan->n_subbufs - 1) ? 1 : 0;
+ unsigned int ready = buf->subbufs_produced - buf->subbufs_consumed;
+ return (ready >= buf->chan->n_subbufs) ? 1 : 0;
}

/*
@@ -65,22 +55,13 @@ static inline int relay_buf_full(struct

*/
static int subbuf_start_default_callback (struct rchan_buf *buf,
void *subbuf,

- unsigned prev_subbuf_idx,
- void *prev_subbuf)
+ void *prev_subbuf,
+ unsigned int prev_padding)
{
return 0;
}

/*
- * deliver() default callback. Does nothing.
- */
-static void deliver_default_callback (struct rchan_buf *buf,
- unsigned subbuf_idx,
- void *subbuf)
-{
-}
-
-/*
* buf_mapped() default callback. Does nothing.
*/
static void buf_mapped_default_callback(struct rchan_buf *buf,
@@ -96,22 +77,11 @@ static void buf_unmapped_default_callbac
{
}

-/*
- * buf_full() default callback. Does nothing.
- */
-static void buf_full_default_callback(struct rchan_buf *buf,
- unsigned subbuf_idx,
- void *subbuf)
-{
-}
-
/* relay channel default callbacks */
static struct rchan_callbacks default_channel_callbacks = {
.subbuf_start = subbuf_start_default_callback,
- .deliver = deliver_default_callback,
.buf_mapped = buf_mapped_default_callback,
.buf_unmapped = buf_unmapped_default_callback,
- .buf_full = buf_full_default_callback,
};

/**
@@ -148,10 +118,8 @@ static inline void *get_next_subbuf(stru
*
* See relay_reset for description of effect.
*/
-static inline void __relay_reset(struct rchan_buf *buf, int init)
+static inline void __relay_reset(struct rchan_buf *buf, unsigned int init)
{
- int i;
-
if (init) {
init_waitqueue_head(&buf->read_wait);
kref_init(&buf->kref);
@@ -161,28 +129,19 @@ static inline void __relay_reset(struct
flush_scheduled_work();
}

- atomic_set(&buf->subbufs_produced, 0);
- atomic_set(&buf->subbufs_consumed, 0);
- atomic_set(&buf->unfull, 0);
+ buf->subbufs_produced = 0;
+ buf->subbufs_consumed = 0;
buf->finalized = 0;
buf->data = buf->start;
buf->offset = 0;

- for (i = 0; i < buf->chan->n_subbufs; i++) {
- buf->padding[i] = 0;
- buf->commit[i] = 0;
- }
-
- buf->offset = buf->chan->cb->subbuf_start(buf, buf->data, 0, NULL);
- buf->commit[0] = buf->offset;
+ buf->chan->cb->subbuf_start(buf, buf->data, NULL, 0);
}

/**
* relay_reset - reset the channel
* @chan: the channel
*
- * Returns 0 if successful, negative if not.
- *
* This has the effect of erasing all data from all channel buffers
* and restarting the channel in its initial state. The buffers
* are not freed, so any mappings are still in effect.
@@ -192,7 +151,7 @@ static inline void __relay_reset(struct
*/
void relay_reset(struct rchan *chan)
{
- int i;
+ unsigned int i;

if (!chan)
return;
@@ -255,14 +214,10 @@ static inline void setup_callbacks(struc

if (!cb->subbuf_start)
cb->subbuf_start = subbuf_start_default_callback;
- if (!cb->deliver)
- cb->deliver = deliver_default_callback;
if (!cb->buf_mapped)
cb->buf_mapped = buf_mapped_default_callback;
if (!cb->buf_unmapped)
cb->buf_unmapped = buf_unmapped_default_callback;
- if (!cb->buf_full)
- cb->buf_full = buf_full_default_callback;
chan->cb = cb;
}

@@ -272,7 +227,6 @@ static inline void setup_callbacks(struc
* @parent: dentry of parent directory, NULL for root directory
* @subbuf_size: size of sub-buffers
* @n_subbufs: number of sub-buffers
- * @overwrite: overwrite buffer when full?
* @cb: client callback functions
*
* Returns channel pointer if successful, NULL otherwise.
@@ -284,12 +238,11 @@ static inline void setup_callbacks(struc
*/
struct rchan *relay_open(const char *base_filename,
struct dentry *parent,
- unsigned subbuf_size,
- unsigned n_subbufs,
- int overwrite,
+ unsigned int subbuf_size,
+ unsigned int n_subbufs,
struct rchan_callbacks *cb)
{
- int i;
+ unsigned int i;
struct rchan *chan;
char *tmpname;

@@ -304,7 +257,6 @@ struct rchan *relay_open(const char *bas
return NULL;

chan->version = RELAYFS_CHANNEL_VERSION;
- chan->overwrite = overwrite;
chan->n_subbufs = n_subbufs;
chan->subbuf_size = subbuf_size;
chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs);
@@ -339,35 +291,6 @@ free_chan:
}

/**
- * deliver_check - deliver a guaranteed full sub-buffer if applicable
- */
-static inline void deliver_check(struct rchan_buf *buf,
- unsigned subbuf_idx)
-{
- void *subbuf;
- unsigned full = buf->chan->subbuf_size - buf->padding[subbuf_idx];
-
- if (buf->commit[subbuf_idx] == full) {
- subbuf = buf->start + subbuf_idx * buf->chan->subbuf_size;
- buf->chan->cb->deliver(buf, subbuf_idx, subbuf);
- }
-}
-
-/**
- * do_switch - change subbuf pointer and do related bookkeeping
- */
-static inline void do_switch(struct rchan_buf *buf, unsigned new, unsigned old)
-{
- unsigned start = 0;
- void *old_data = buf->start + old * buf->chan->subbuf_size;
-
- buf->data = get_next_subbuf(buf);
- buf->padding[new] = 0;
- start = buf->chan->cb->subbuf_start(buf, buf->data, old, old_data);
- buf->offset = buf->commit[new] = start;
-}
-
-/**
* relay_switch_subbuf - switch to a new sub-buffer
* @buf: channel buffer
* @length: size of current event
@@ -377,45 +300,30 @@ static inline void do_switch(struct rcha
* Performs sub-buffer-switch tasks such as invoking callbacks,
* updating padding counts, waking up readers, etc.
*/
-unsigned relay_switch_subbuf(struct rchan_buf *buf, unsigned length)
+unsigned int relay_switch_subbuf(struct rchan_buf *buf, unsigned int length)
{
- int new, old, produced = atomic_read(&buf->subbufs_produced);
- unsigned padding;
+ void *old, *new;

if (unlikely(length > buf->chan->subbuf_size))
goto toobig;

- if (unlikely(atomic_read(&buf->unfull))) {
- atomic_set(&buf->unfull, 0);
- new = produced % buf->chan->n_subbufs;
- old = (produced - 1) % buf->chan->n_subbufs;
- do_switch(buf, new, old);
- return 0;
- }
-
- if (unlikely(relay_buf_full(buf)))
- return 0;
-
- old = produced % buf->chan->n_subbufs;
- padding = buf->chan->subbuf_size - buf->offset;
- buf->padding[old] = padding;
- deliver_check(buf, old);
- buf->offset = buf->chan->subbuf_size;
- atomic_inc(&buf->subbufs_produced);
-
- if (waitqueue_active(&buf->read_wait)) {
- PREPARE_WORK(&buf->wake_readers, wakeup_readers, buf);
- schedule_delayed_work(&buf->wake_readers, 1);
+ if (buf->offset != buf->chan->subbuf_size + 1) {
+ buf->prev_padding = buf->chan->subbuf_size - buf->offset;
+ buf->subbufs_produced++;
+ if (waitqueue_active(&buf->read_wait)) {
+ PREPARE_WORK(&buf->wake_readers, wakeup_readers, buf);
+ schedule_delayed_work(&buf->wake_readers, 1);
+ }
}

- if (unlikely(relay_buf_full(buf))) {
- void *old_data = buf->start + old * buf->chan->subbuf_size;
- buf->chan->cb->buf_full(buf, old, old_data);
+ old = buf->data;
+ new = get_next_subbuf(buf);
+ buf->offset = 0;
+ if (!buf->chan->cb->subbuf_start(buf, new, old, buf->prev_padding)) {
+ buf->offset = buf->chan->subbuf_size + 1;
return 0;
}
-
- new = (produced + 1) % buf->chan->n_subbufs;
- do_switch(buf, new, old);
+ buf->data = new;

if (unlikely(length + buf->offset > buf->chan->subbuf_size))
goto toobig;
@@ -429,26 +337,6 @@ toobig:
}

/**
- * relay_commit - add count bytes to a sub-buffer's commit count
- * @buf: channel buffer
- * @reserved: reserved address associated with commit
- * @count: number of bytes committed
- *
- * Invokes deliver() callback if sub-buffer is completely written.
- */
-void relay_commit(struct rchan_buf *buf,
- void *reserved,
- unsigned count)
-{
- unsigned offset, subbuf_idx;
-
- offset = reserved - buf->start;
- subbuf_idx = offset / buf->chan->subbuf_size;
- buf->commit[subbuf_idx] += count;
- deliver_check(buf, subbuf_idx);
-}
-
-/**
* relay_subbufs_consumed - update the buffer's sub-buffers-consumed count
* @chan: the channel
* @cpu: the cpu associated with the channel buffer to update
@@ -461,9 +349,10 @@ void relay_commit(struct rchan_buf *buf,
* NOTE: kernel clients don't need to call this function if the channel
* mode is 'overwrite'.
*/
-void relay_subbufs_consumed(struct rchan *chan, int cpu, int subbufs_consumed)
+void relay_subbufs_consumed(struct rchan *chan,
+ unsigned int cpu,
+ unsigned int subbufs_consumed)
{
- int produced, consumed;
struct rchan_buf *buf;

if (!chan)
@@ -473,14 +362,9 @@ void relay_subbufs_consumed(struct rchan
return;

buf = chan->buf[cpu];
- if (relay_buf_full(buf))
- atomic_set(&buf->unfull, 1);
-
- atomic_add(subbufs_consumed, &buf->subbufs_consumed);
- produced = atomic_read(&buf->subbufs_produced);
- consumed = atomic_read(&buf->subbufs_consumed);
- if (consumed > produced)
- atomic_set(&buf->subbufs_consumed, produced);
+ buf->subbufs_consumed += subbufs_consumed;
+ if (buf->subbufs_consumed > buf->subbufs_produced)
+ buf->subbufs_consumed = buf->subbufs_produced;
}

/**
@@ -502,7 +386,7 @@ void relay_destroy_channel(struct kref *
*/
void relay_close(struct rchan *chan)
{
- int i;
+ unsigned int i;

if (!chan)
return;
@@ -524,7 +408,7 @@ void relay_close(struct rchan *chan)
*/
void relay_flush(struct rchan *chan)
{
- int i;
+ unsigned int i;

if (!chan)
return;
@@ -541,5 +425,5 @@ EXPORT_SYMBOL_GPL(relay_close);
EXPORT_SYMBOL_GPL(relay_flush);
EXPORT_SYMBOL_GPL(relay_reset);
EXPORT_SYMBOL_GPL(relay_subbufs_consumed);
-EXPORT_SYMBOL_GPL(relay_commit);
EXPORT_SYMBOL_GPL(relay_switch_subbuf);
+EXPORT_SYMBOL_GPL(relay_buf_full);
diff -urpN -X dontdiff linux-2.6.13-rc3-mm1/include/linux/relayfs_fs.h linux-2.6.13-rc3-mm1-cur/include/linux/relayfs_fs.h
--- linux-2.6.13-rc3-mm1/include/linux/relayfs_fs.h 2005-07-16 11:47:34.000000000 -0500
+++ linux-2.6.13-rc3-mm1-cur/include/linux/relayfs_fs.h 2005-07-23 09:31:22.000000000 -0500
@@ -22,7 +22,7 @@
/*
* Tracks changes to rchan_buf struct
*/
-#define RELAYFS_CHANNEL_VERSION 3
+#define RELAYFS_CHANNEL_VERSION 4

/*
* Per-cpu relay channel buffer
@@ -31,20 +31,18 @@ struct rchan_buf
{
void *start; /* start of channel buffer */
void *data; /* start of current sub-buffer */
- unsigned offset; /* current offset into sub-buffer */
- atomic_t subbufs_produced; /* count of sub-buffers produced */
- atomic_t subbufs_consumed; /* count of sub-buffers consumed */
- atomic_t unfull; /* state has gone from full to not */
+ unsigned int offset; /* current offset into sub-buffer */
+ unsigned int subbufs_produced; /* count of sub-buffers produced */
+ unsigned int subbufs_consumed; /* count of sub-buffers consumed */
struct rchan *chan; /* associated channel */
wait_queue_head_t read_wait; /* reader wait queue */
struct work_struct wake_readers; /* reader wake-up work struct */
struct dentry *dentry; /* channel file dentry */
struct kref kref; /* channel buffer refcount */
struct page **page_array; /* array of current buffer pages */
- int page_count; /* number of current buffer pages */
- unsigned *padding; /* padding counts per sub-buffer */
- unsigned *commit; /* commit counts per sub-buffer */
- int finalized; /* buffer has been finalized */
+ unsigned int page_count; /* number of current buffer pages */
+ unsigned int finalized; /* buffer has been finalized */
+ unsigned int prev_padding; /* temporary variable */
} ____cacheline_aligned;

/*
@@ -53,10 +51,9 @@ struct rchan_buf
struct rchan
{
u32 version; /* the version of this struct */
- unsigned subbuf_size; /* sub-buffer size */
- unsigned n_subbufs; /* number of sub-buffers per buffer */
- unsigned alloc_size; /* total buffer size allocated */
- int overwrite; /* overwrite buffer when full? */
+ unsigned int subbuf_size; /* sub-buffer size */
+ unsigned int n_subbufs; /* number of sub-buffers per buffer */
+ unsigned int alloc_size; /* total buffer size allocated */
struct rchan_callbacks *cb; /* client callbacks */
struct kref kref; /* channel refcount */
void *private_data; /* for user-defined data */
@@ -86,32 +83,23 @@ struct rchan_callbacks
* subbuf_start - called on buffer-switch to a new sub-buffer
* @buf: the channel buffer containing the new sub-buffer
* @subbuf: the start of the new sub-buffer
- * @prev_subbuf_idx: the previous sub-buffer's index
* @prev_subbuf: the start of the previous sub-buffer
+ * @prev_padding: unused space at the end of previous sub-buffer
*
- * The client should return the number of bytes it reserves at
- * the beginning of the sub-buffer, 0 if none.
+ * The client should return 1 to continue logging, 0 to stop
+ * logging.
*
* NOTE: subbuf_start will also be invoked when the buffer is
* created, so that the first sub-buffer can be initialized
* if necessary. In this case, prev_subbuf will be NULL.
+ *
+ * NOTE: the client can reserve bytes at the beginning of the new
+ * sub-buffer by calling subbuf_start_reserve() in this callback.
*/
int (*subbuf_start) (struct rchan_buf *buf,
void *subbuf,
- unsigned prev_subbuf_idx,
- void *prev_subbuf);
-
- /*
- * deliver - deliver a guaranteed full sub-buffer to client
- * @buf: the channel buffer containing the sub-buffer
- * @subbuf_idx: the sub-buffer's index
- * @subbuf: the start of the new sub-buffer
- *
- * Only works if relay_commit is also used
- */
- void (*deliver) (struct rchan_buf *buf,
- unsigned subbuf_idx,
- void *subbuf);
+ void *prev_subbuf,
+ unsigned int prev_padding);

/*
* buf_mapped - relayfs buffer mmap notification
@@ -132,18 +120,6 @@ struct rchan_callbacks
*/
void (*buf_unmapped)(struct rchan_buf *buf,
struct file *filp);
-
- /*
- * buf_full - relayfs buffer full notification
- * @buf: the channel channel buffer
- * @subbuf_idx: the current sub-buffer's index
- * @subbuf: the start of the current sub-buffer
- *
- * Called when a relayfs buffer becomes full
- */
- void (*buf_full)(struct rchan_buf *buf,
- unsigned subbuf_idx,
- void *subbuf);
};

/*
@@ -152,21 +128,19 @@ struct rchan_callbacks

struct rchan *relay_open(const char *base_filename,
struct dentry *parent,
- unsigned subbuf_size,
- unsigned n_subbufs,
- int overwrite,
+ unsigned int subbuf_size,
+ unsigned int n_subbufs,
struct rchan_callbacks *cb);
extern void relay_close(struct rchan *chan);
extern void relay_flush(struct rchan *chan);
extern void relay_subbufs_consumed(struct rchan *chan,
- int cpu,
- int subbufs_consumed);
+ unsigned int cpu,
+ unsigned int consumed);
extern void relay_reset(struct rchan *chan);
-extern unsigned relay_switch_subbuf(struct rchan_buf *buf,
- unsigned length);
-extern void relay_commit(struct rchan_buf *buf,
- void *reserved,
- unsigned count);
+extern int relay_buf_full(struct rchan_buf *buf);
+
+extern unsigned int relay_switch_subbuf(struct rchan_buf *buf,
+ unsigned int length);
extern struct dentry *relayfs_create_dir(const char *name,
struct dentry *parent);
extern int relayfs_remove_dir(struct dentry *dentry);
@@ -186,7 +160,7 @@ extern int relayfs_remove_dir(struct den
*/
static inline void relay_write(struct rchan *chan,
const void *data,
- unsigned length)
+ unsigned int length)
{
unsigned long flags;
struct rchan_buf *buf;
@@ -214,7 +188,7 @@ static inline void relay_write(struct rc
*/
static inline void __relay_write(struct rchan *chan,
const void *data,
- unsigned length)
+ unsigned int length)
{
struct rchan_buf *buf;

@@ -227,20 +201,19 @@ static inline void __relay_write(struct
}

/**
- * relay_reserve - reserve slot in channel buffer
- * @chan: relay channel
+ * __relay_reserve - reserve slot in channel buffer
+ * @buf: relay channel buffer
* @length: number of bytes to reserve
*
* Returns pointer to reserved slot, NULL if full.
*
- * Reserves a slot in the current cpu's channel buffer.
+ * Reserves a slot in the specified channel buffer.
* Does not protect the buffer at all - caller must provide
* appropriate synchronization.
*/
-static inline void *relay_reserve(struct rchan *chan, unsigned length)
+static inline void *__relay_reserve(struct rchan_buf *buf, unsigned int length)
{
void *reserved;
- struct rchan_buf *buf = chan->buf[smp_processor_id()];

if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
length = relay_switch_subbuf(buf, length);
@@ -253,6 +226,38 @@ static inline void *relay_reserve(struct
return reserved;
}

+/**
+ * relay_reserve - reserve slot in channel buffer
+ * @chan: relay channel
+ * @length: number of bytes to reserve
+ *
+ * Returns pointer to reserved slot, NULL if full.
+ *
+ * Reserves a slot in the current cpu's channel buffer.
+ * Does not protect the buffer at all - caller must provide
+ * appropriate synchronization.
+ */
+static inline void *relay_reserve(struct rchan *chan, unsigned int length)
+{
+ struct rchan_buf *buf = chan->buf[smp_processor_id()];
+ return __relay_reserve(buf, length);
+}
+
+/**
+ * subbuf_start_reserve - reserve bytes at the start of a sub-buffer
+ * @buf: relay channel buffer
+ * @length: number of bytes to reserve
+ *
+ * Helper function used to reserve bytes at the beginning of
+ * a sub-buffer in the subbuf_start() callback.
+ */
+static inline void subbuf_start_reserve(struct rchan_buf *buf,
+ unsigned int length)
+{
+ BUG_ON(length >= buf->chan->subbuf_size - 1);
+ buf->offset = length;
+}
+
/*
* exported relayfs file operations, fs/relayfs/inode.c
*/

Tom Zanussi

unread,

Jul 22, 2005, 4:47:13 PM7/22/05

to Tom Zanussi, Roman Zippel, Karim Yaghmour, Steven Rostedt, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Tom Zanussi writes:
>
> OK, if we got rid of the padding counts and commit counts and let the
> client manage those, we can simplify the buffer switch slow path and
> make the API simpler in the process. Here's a first proposal for
> doing that - I won't know until I actually do it what snags I may run
> into, but if this looks like the right direction to go, I'll go ahead
> with it...
>

And here's a patch to update the Documentation...

diff -urpN -X dontdiff linux-2.6.13-rc3-mm1/Documentation/filesystems/relayfs.txt linux-2.6.13-rc3-mm1-cur/Documentation/filesystems/relayfs.txt
--- linux-2.6.13-rc3-mm1/Documentation/filesystems/relayfs.txt 2005-07-16 11:47:32.000000000 -0500
+++ linux-2.6.13-rc3-mm1-cur/Documentation/filesystems/relayfs.txt 2005-07-23 12:50:46.000000000 -0500
@@ -23,6 +23,47 @@ This document provides an overview of th

the function parameters are documented along with the functions in the
filesystem code - please see that for details.

+Semantics
+=========
+
+Each relayfs channel has one buffer per CPU, each buffer has one or
+more sub-buffers. Messages are written to the first sub-buffer until

+it is too full to contain a new message, in which case it it is
+written to the next (if available). Messages are never split across
+sub-buffers. At this point, userspace can be notified so it empties
+the first sub-buffer, while the kernel continues writing to the next.
+
+When notified that a sub-buffer is full, the kernel knows how many
+bytes of it are padding i.e. unused. Userspace can use this knowledge
+to copy only valid data.
+
+After copying it, userspace can notify the kernel that a sub-buffer
+has been consumed.

+
+relayfs can operate in a mode where it will overwrite data not yet
+collected by userspace, and not wait for it to consume it.
+
+relayfs itself does not provide for communication of such data between
+userspace and kernel, allowing the kernel side to remain simple and not
+impose a single interface on userspace. It does provide a separate
+helper though, described below.
+

+klog, relay-app & librelay

+==========================
+
+relayfs itself is ready to use, but to make things easier, two

+additional systems are provided. klog is a simple wrapper to make
+sending data to a channel simpler, regardless of whether a channel to
+write to exists or not. relay-app is the kernel counterpart of
+userspace librelay.c, combined these two files provide glue to easily
+stream data, without having to bother with housekeeping.

+
+It is possible to use relayfs without relay-app & librelay, but you'll
+have to implement communication between userspace and kernel, allowing
+both to convey the state of buffers (full, empty, amount of padding).
+

+klog, relay-app and librelay can be found in the relay-apps tarball on

+http://relayfs.sourceforge.net

The relayfs user space API
==========================

@@ -34,7 +75,8 @@ available and some comments regarding th

open() enables user to open an _existing_ buffer.

mmap() results in channel buffer being mapped into the caller's
- memory space.
+ memory space. Note that you can't do a partial mmap - you must
+ map the entire file, which is NRBUF * SUBBUFSIZE.

poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.

@@ -63,13 +105,15 @@ Here's a summary of the API relayfs prov
channel management functions:

relay_open(base_filename, parent, subbuf_size, n_subbufs,
- overwrite, callbacks)
+ callbacks)
relay_close(chan)
relay_flush(chan)
relay_reset(chan)
relayfs_create_dir(name, parent)
relayfs_remove_dir(dentry)
- relay_commit(buf, reserved, count)

+
+ channel management typically called on instigation of userspace:
+
relay_subbufs_consumed(chan, cpu, subbufs_consumed)

write functions:

@@ -77,19 +121,22 @@ Here's a summary of the API relayfs prov
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
+ __relay_reserve(buf, length)

callbacks:

- subbuf_start(buf, subbuf, prev_subbuf_idx, prev_subbuf)
- deliver(buf, subbuf_idx, subbuf)
+ subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
buf_unmapped(buf, filp)
- buf_full(buf, subbuf_idx)

+ helper functions:
+
+ relay_buf_full(buf)
+ subbuf_start_reserve(buf, length)

-A relayfs channel is made of up one or more per-cpu channel buffers,
-each implemented as a circular buffer subdivided into one or more
-sub-buffers.
+

+Creating a channel
+------------------

relay_open() is used to create a channel, along with its per-cpu
channel buffers. Each channel buffer will have an associated file

@@ -117,30 +164,106 @@ though, it's safe to assume that having
idea - you're guaranteed to either overwrite data or lose events
depending on the channel mode being used.

-relayfs channels can be opened in either of two modes - 'overwrite' or
-'no-overwrite'. In overwrite mode, writes continuously cycle around
-the buffer and will never fail, but will unconditionally overwrite old
-data regardless of whether it's actually been consumed. In
-no-overwrite mode, writes will fail i.e. data will be lost, if the
+Channel 'modes'
+---------------
+
+relayfs channels can be used in either of two modes - 'overwrite' or
+'no-overwrite'. The mode is entirely determined by the implementation
+of the subbuf_start() callback, as described below. In 'overwrite'
+mode, also known as 'flight recorder' mode, writes continuously cycle
+around the buffer and will never fail, but will unconditionally
+overwrite old data regardless of whether it's actually been consumed.
+In no-overwrite mode, writes will fail i.e. data will be lost, if the

number of unconsumed sub-buffers equals the total number of
-sub-buffers in the channel. In this mode, the client is reponsible
-for notifying relayfs when sub-buffers have been consumed via
-relay_subbufs_consumed(). A full buffer will become 'unfull' and
-logging will continue once the client calls relay_subbufs_consumed()
-again. When a buffer becomes full, the buf_full() callback is invoked
-to notify the client. In both modes, the subbuf_start() callback will
-notify the client whenever a sub-buffer boundary is crossed. This can
-be used to write header information into the new sub-buffer or fill in
-header information reserved in the previous sub-buffer. One piece of
-information that's useful to save in a reserved header slot is the
-number of bytes of 'padding' for a sub-buffer, which is the amount of
-unused space at the end of a sub-buffer. The padding count for each
-sub-buffer is contained in an array in the rchan_buf struct passed
-into the subbuf_start() callback: rchan_buf->padding[prev_subbuf_idx]
-can be used to to get the padding for the just-finished sub-buffer.
-subbuf_start() is also called for the first sub-buffer in each channel
-buffer when the channel is created. The mode is specified to
-relay_open() using the overwrite parameter.

+sub-buffers in the channel. It should be clear that if there is no
+consumer or if the consumer can't consume sub-buffers fast enought,
+data will be lost in either case; the only difference is whether data
+is lost from the beginning or the end of a buffer.
+

+As explained above, a relayfs channel is made of up one or more

+per-cpu channel buffers, each implemented as a circular buffer
+subdivided into one or more sub-buffers. Messages are written into
+the current sub-buffer of the channel's current per-cpu buffer via the
+write functions described below. Whenever a message can't fit into
+the current sub-buffer, because there's no room left for it, the
+client is notified via the subbuf_start() callback that a switch to a
+new sub-buffer is about to occur. The client uses this callback to 1)
+initialize the next sub-buffer if appropriate 2) finalize the previous
+sub-buffer if appropriate and 3) return a boolean value indicating
+whether or not to actually go ahead with the sub-buffer switch.
+
+To implement 'no-overwrite' mode, the userspace client would provide
+an implementation of the subbuf_start() callback something like the
+following:
+
+static int subbuf_start(struct rchan_buf *buf,
+ void *subbuf,

+ void *prev_subbuf,
+ unsigned int prev_padding)

+{
+ if (relay_buf_full(buf))
+ return 0;
+
+ if (prev_subbuf)
+ *((unsigned *)prev_subbuf) = prev_padding;
+
+ subbuf_start_reserve(buf, sizeof(unsigned int));
+
+ return 1;
+}
+
+If the current buffer is full i.e. all sub-buffers remain unconsumed,
+the callback returns 0 to indicate that the buffer switch should not
+occur yet i.e. until the consumer has had a chance to read the current
+set of ready sub-buffers. For the relay_buf_full() function to make
+sense, the consumer is reponsible for notifying relayfs when
+sub-buffers have been consumed via relay_subbufs_consumed(). Any
+subsequent attempts to write into the buffer will again invoke the
+subbuf_start() callback with the same parameters; only when the
+consumer has consumed one or more of the ready sub-buffers will
+relay_buf_full() return 0, in which case the buffer switch can
+continue.
+
+The implementation of the subbuf_start() callback for 'overwrite' mode
+would be very similar:
+
+static int subbuf_start(struct rchan_buf *buf,
+ void *subbuf,

+ void *prev_subbuf,
+ unsigned int prev_padding)

+{
+ if (prev_subbuf)
+ *((unsigned *)prev_subbuf) = prev_padding;
+
+ subbuf_start_reserve(buf, sizeof(unsigned int));
+
+ return 1;
+}
+
+In this case, the relay_buf_full() check is meaningless and the
+callback always returns 1, causing the buffer switch to occur
+unconditionally. It's also meaningless for the client to use the
+relay_subbufs_consumed() function in this mode, as it's never
+consulted.
+
+Header information can be reserved at the beginning of each sub-buffer
+by calling the subbuf_start_reserve() helper function from within the
+subbuf_start() callback. This reserved area can be used to store
+whatever information the client wants. In the example above, room is
+reserved in each sub-buffer to store the padding count for that
+sub-buffer. This is filled in for the previous sub-buffer in the
+subbuf_start() implementation; the padding value for the previous
+sub-buffer is passed into the subbuf_start() callback along with a
+pointer to the previous sub-buffer, since the padding value isn't
+known until a sub-buffer is filled. The subbuf_start() callback is
+also called for the first sub-buffer when the channel is opened, to
+give the client a chance to reserve space in it. In this case the
+previous sub-buffer pointer passed into the callback will be NULL, so
+the client should check the value of the prev_subbuf pointer before
+writing into the previous sub-buffer.
+
+Writing to a channel
+--------------------

kernel clients write data into the current cpu's channel buffer using
relay_write() or __relay_write(). relay_write() is the main logging

@@ -151,22 +274,31 @@ __relay_write(), which only disables pre
don't return a value, so you can't determine whether or not they
failed - the assumption is that you wouldn't want to check a return
value in the fast logging path anyway, and that they'll always succeed
-unless the buffer is full and in no-overwrite mode, in which case
-you'll be notified via the buf_full() callback.
+unless the buffer is full and no-overwrite mode is being used, in
+which case you can detect a failed write in the subbuf_start()
+callback by calling the relay_buf_full() helper function.

relay_reserve() is used to reserve a slot in a channel buffer which
can be written to later. This would typically be used in applications
that need to write directly into a channel buffer without having to
stage data in a temporary buffer beforehand. Because the actual write
may not happen immediately after the slot is reserved, applications
-using relay_reserve() can call relay_commit() to notify relayfs when
-the slot has actually been written. When all the reserved slots have
-been committed, the deliver() callback is invoked to notify the client
-that a guaranteed full sub-buffer has been produced. Because the
-write is under control of the client and is separated from the
-reserve, relay_reserve() doesn't protect the buffer at all - it's up
-to the client to provide the appropriate synchronization when using
-relay_reserve().
+using relay_reserve() can keep a count of the number of bytes actually
+written, either in space reserved in the sub-buffers themselves or as
+a separate array. See the 'reserve' example in the relay-apps tarball
+at http://relayfs.sourceforge.net for an example of how this can be
+done. Because the write is under control of the client and is
+separated from the reserve, relay_reserve() doesn't protect the buffer
+at all - it's up to the client to provide the appropriate
+synchronization when using relay_reserve().
+
+relay_reserve() uses __relay_reserve() to actually do the reservation;
+__relay_reserve() is also available to clients - in some cases for
+more efficiency, it may be more efficient for the client to direcly
+access the buffer when maintaining commit counts, for example.
+
+Closing a channel
+-----------------

The client calls relay_close() when it's finished using the channel.
The channel and its associated buffers are destroyed when there are no
@@ -175,6 +307,9 @@ forces a sub-buffer switch on all the ch
to finalize and process the last sub-buffers before the channel is
closed.

+Misc
+----
+
Some applications may want to keep a channel around and re-use it
rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial

Karim Yaghmour

unread,

Jul 22, 2005, 7:29:20 PM7/22/05

to Tom Zanussi, Roman Zippel, Steven Rostedt, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Tom Zanussi wrote:
> - removed the deliver() callback
> - removed the relay_commit() function

This breaks LTT. Any reason why this needed to be removed? In the end,
the code will just end up being duplicated in ltt and all other users.
IOW, this is not some potential future use, but something that's
currently being used.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Tom Zanussi

unread,

Jul 22, 2005, 10:34:59 PM7/22/05

to ka...@opersys.com, Tom Zanussi, Roman Zippel, Steven Rostedt, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Karim Yaghmour writes:
>
> Tom Zanussi wrote:
> > - removed the deliver() callback
> > - removed the relay_commit() function
>
> This breaks LTT. Any reason why this needed to be removed? In the end,
> the code will just end up being duplicated in ltt and all other users.
> IOW, this is not some potential future use, but something that's
> currently being used.

Because I realized that like the padding and commit arrays, they're
not really necessary.

In all the examples, the padding is saved in space reserved at the
beginning of the sub-buffer via subbuf_start_reserve(), except that
now the padding is passed into the subbuf_start() callback rather than
kept in an array. The padding value passed in is then directly saved
in the reserved padding space.

Similarly, in the case of the reserve/commit example, extra space is
also reserved for the commit count using subbuf_start_reserve().
After space for an event is reserved using relay_reserve() and
completely written, the event length is added to that commit value.
In userspace, the sub-buffer reading loop looks at the commit value in
the sub-buffer, and if it matches (sub-buffer size - padding), the
buffer has been completely written and can be saved, otherwise it's
not yet complete and is checked again the next time around. This way,
there's no need for a deliver() callback, the relay_commit() is
replaced with the increment of the reserved commit value, the arrays
aren't needed and you get the same result in the end in a much simpler
way, IMHO.

But if you see a problem with it or have any suggestions to make it
better/different, please let me know...

Tom

Christoph Hellwig

unread,

Jul 23, 2005, 2:54:22 PM7/23/05

to Paul Jackson, bert hubert, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

On Fri, Jul 22, 2005 at 01:01:32PM -0700, Paul Jackson wrote:

> Another vote in favor of relayfs here ...
>
> I am reminded by my good colleagues at SGI that relayfs is a key
> to the Linux Trace Toolkit (LTT), which is in turn an important
> technology for some product(s) on which SGI is working.

I don't think anyone cares for product plans of particular companies.
That beein said I wish LTT folks would make a little more progress so
we could actually include it.

Christoph Hellwig

unread,

Jul 23, 2005, 2:56:51 PM7/23/05

to bert hubert, Paul Jackson, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, ka...@opersys.com, linux-...@vger.kernel.org, zan...@us.ibm.com

On Fri, Jul 22, 2005 at 10:33:21PM +0200, bert hubert wrote:
> On Fri, Jul 22, 2005 at 01:01:32PM -0700, Paul Jackson wrote:
> > Another vote in favor of relayfs here ...
>
> At OLS the 'SystemTAP' idea was presented, which has been partially
> implemented already, and it builds on relayfs as well. It dovetails nicely
> with kprobes.

And what exactly is this systemtap thing supposed to be? And why the
heck do they announce it at some conference and we should suddenly care
about it?

Karim Yaghmour

unread,

Jul 25, 2005, 7:56:58 PM7/25/05

to Christoph Hellwig, Paul Jackson, bert hubert, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, zan...@us.ibm.com

Christoph Hellwig wrote:
> That beein said I wish LTT folks would make a little more progress so
> we could actually include it.

We're working on it. On the topic of revamping LTT, 3 different people
came up with 3 different implementations.

Following your feedback on the patch I sent a few weeks back, I headed
out asking myself "what is the bare-minimum tracing functionality that
will actually fly while still being flexible enough to add to it?" I
spent some time at the OLS comparing notes with others interested in this
area, and I think we've got something that should fit the bill. We should
be able to post something sooner rather than later.

Now if only I could remember what I talked about after I left the Black
Thorn at 2h45am and the guy in the elevator at Les Suites pressed on a
button and said "'M' for more beer" ...

Thanks,

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

Karim Yaghmour

unread,

Jul 25, 2005, 10:43:20 PM7/25/05

to Tom Zanussi, Roman Zippel, Steven Rostedt, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, Christoph Hellwig, Andrew Morton

Tom Zanussi wrote:
> In userspace, the sub-buffer reading loop looks at the commit value in
> the sub-buffer, and if it matches (sub-buffer size - padding), the
> buffer has been completely written and can be saved, otherwise it's
> not yet complete and is checked again the next time around. This way,
> there's no need for a deliver() callback, the relay_commit() is
> replaced with the increment of the reserved commit value, the arrays
> aren't needed and you get the same result in the end in a much simpler
> way, IMHO.

Actually this has a much greater potential of loosing buffers because
we have to poll the buffer for completion. Seen another way, the kernel-
side has got to wait until the user-side has "figured out" that it needs
to commit content to disk. As it was originally, it was relatively
straightforward to dertermine why data was lost: ok, we've signaled it
from kernel space, but the daemon never flushed it out. Without commit/
deliver, things are much less clear, and I still miss what gain we
are making by removing them.

I would very much like to see the commit/deliver functionality back.
Such mechanisms are required for any sane producer-consumer model.

Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546

bert hubert

unread,

Jul 26, 2005, 1:17:29 AM7/26/05

to Karim Yaghmour, Christoph Hellwig, Paul Jackson, ros...@goodmis.org, relayf...@lists.sourceforge.net, richard...@uk.ibm.com, va...@us.ibm.com, linux-...@vger.kernel.org, zan...@us.ibm.com

On Mon, Jul 25, 2005 at 07:47:45PM -0400, Karim Yaghmour wrote:

> Now if only I could remember what I talked about after I left the Black
> Thorn at 2h45am and the guy in the elevator at Les Suites pressed on a
> button and said "'M' for more beer" ...

I bet in involved 'M' for more markers, Karim :-)

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services