Synchronous signal delivery..

Linus Torvalds

unread,

Feb 13, 2003, 3:00:16 PM2/13/03

to

Ok,
I talked about select() and signal races in Australia at IBM (forget who
exactly was involved so I just cc'd the usual suspects), and here's a
prototype example of what I suggested as a potential solution.

[ Warning: this goes on top of the current BK tree with all the signal
fixes and the generalized "dequeue_signal()" stuff.

Also note that it is a prototype, ie the packet returned to "read()" is
obviously not the full info, and there may be other issues with this. ]

It's a generic "synchronous signal delivery" method, and it uses a
perfectly regular file descriptor to deliver an arbitrary set of signals
that are pending.

It adds one new system call:

fd = sigfd(sigset_t * mask, unsigned long flags);

which creates a file descriptor that is associated with the particular
thread that created it, and the particular signal mask that the user was
interested in. That file descriptor can be passed around all the normal
ways: it can be dup()'ed, given to somebody else with a AF_UNIX socket,
and obviously read() and select()/poll()'ed upon.

So you can have a process that does a sigfd(), forks, and the child can
then listen in on the specified signals of the parent, for example.

NOTE! For it to be useful, the signals that you want to wait on and read()
using the file descriptor obviously have to be blocked, otherwise they'll
just be delivered the old-fashioned way.

Here's a trivial example program using the new system call:

#include <stdio.h>
#include <signal.h>

#define __NR_syscall (259)

int sigfd(sigset_t *mask, unsigned long flags)
{
void *vsyscall = (void *) 0xffffe000;
int ret;
asm volatile("call *%1"
:"=a" (ret)
:"m" (vsyscall),
"0" (__NR_syscall),
"b" (mask),
"c" (flags));
return ret;
}

int main(void)
{
sigset_t mask;
int fd, len;
char buffer[1024];

sigfillset(&mask);
sigprocmask(SIG_BLOCK, &mask, NULL);

fd = sigfd(&mask, 0);
printf("sigfd returns %d\n", fd);

len = read(fd, buffer, sizeof(buffer));
printf("read returns %d (signal %d)\n", len, *(int *)buffer);
return 0;
}

Just run it, and send the process signals to see what happens.

The above example program is obviously totally useless, and any real use
would have to expand the implementation with addign the full siginfo to
the packet read (which is trivial apart from deciding on what format to
use - it would be good to not have it be architecture-dependent and in
particular it would be horrible to have different formats for different
compatibility layers).

Any real use would also probably be a select() or poll() loop.

Linus

---
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1053 -> 1.1054
# include/linux/init_task.h 1.21 -> 1.22
# include/linux/sched.h 1.131 -> 1.132
# kernel/fork.c 1.104 -> 1.105
# arch/i386/kernel/entry.S 1.54 -> 1.55
# fs/exec.c 1.70 -> 1.71
# kernel/signal.c 1.71 -> 1.72
# fs/Makefile 1.56 -> 1.57
# include/asm-i386/unistd.h 1.23 -> 1.24
# (new) -> 1.1 fs/sigfd.c
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/02/13 torv...@home.transmeta.com 1.1054
# Add support for "sigfd()" system call, that allows you to get signals
# through a synchronous system call interface instead of interrupting the
# process asynchronously.
#
# This allows things like adding certain signal masks to poll() or select()
# loops without the normal races or special cases.
# --------------------------------------------
#
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S Thu Feb 13 11:19:23 2003
+++ b/arch/i386/kernel/entry.S Thu Feb 13 11:19:23 2003
@@ -801,6 +801,7 @@
.long sys_epoll_wait
.long sys_remap_file_pages
.long sys_set_tid_address
+ .long sys_sigfd

.rept NR_syscalls-(.-sys_call_table)/4
diff -Nru a/fs/Makefile b/fs/Makefile
--- a/fs/Makefile Thu Feb 13 11:19:23 2003
+++ b/fs/Makefile Thu Feb 13 11:19:23 2003
@@ -10,7 +10,8 @@
namei.o fcntl.o ioctl.o readdir.o select.o fifo.o locks.o \
dcache.o inode.o attr.o bad_inode.o file.o dnotify.o \
filesystems.o namespace.o seq_file.o xattr.o libfs.o \
- fs-writeback.o mpage.o direct-io.o aio.o eventpoll.o
+ fs-writeback.o mpage.o direct-io.o aio.o eventpoll.o \
+ sigfd.o

obj-$(CONFIG_COMPAT) += compat.o

diff -Nru a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c Thu Feb 13 11:19:23 2003
+++ b/fs/exec.c Thu Feb 13 11:19:23 2003
@@ -578,6 +578,7 @@
return -ENOMEM;

spin_lock_init(&newsighand->siglock);
+ init_waitqueue_head(&newsighand->waiting);
atomic_set(&newsighand->count, 1);
memcpy(newsighand->action, oldsighand->action, sizeof(newsighand->action));

diff -Nru a/fs/sigfd.c b/fs/sigfd.c
--- /dev/null Wed Dec 31 16:00:00 1969
+++ b/fs/sigfd.c Thu Feb 13 11:19:23 2003
@@ -0,0 +1,327 @@
+/*
+ * linux/fs/sigfd.c
+ *
+ * Copyright (C) 2003 Linus Torvalds
+ */
+
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/signal.h>
+
+#include <asm/uaccess.h>
+
+#define SIGFD_MAGIC (0xa01eaf86) /* Random number, no meaning. No, really! */
+
+struct sigfd_inode {
+ sigset_t sigmask;
+ struct task_struct *tsk;
+ struct sighand_struct *sighand;
+ struct inode vfs_inode;
+};
+
+static inline struct sigfd_inode *SIGFD_I(struct inode *inode)
+{
+ return container_of(inode, struct sigfd_inode, vfs_inode);
+}
+
+
+static kmem_cache_t * sigfd_inode_cachep;
+
+static struct inode *sigfd_alloc_inode(struct super_block *sb)
+{
+ struct sigfd_inode *ei;
+ ei = (struct sigfd_inode *)kmem_cache_alloc(sigfd_inode_cachep, SLAB_KERNEL);
+ if (!ei)
+ return NULL;
+ return &ei->vfs_inode;
+}
+
+static void sigfd_destroy_inode(struct inode *inode)
+{
+ kmem_cache_free(sigfd_inode_cachep, SIGFD_I(inode));
+}
+
+static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
+{
+ struct sigfd_inode *ei = (struct sigfd_inode *) foo;
+
+ if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) == SLAB_CTOR_CONSTRUCTOR)
+ inode_init_once(&ei->vfs_inode);
+}
+
+/*
+ * This needs to have some standardized structure return value..
+ */
+static ssize_t sigfd_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
+{
+ struct sigfd_inode *ei = SIGFD_I(filp->f_dentry->d_inode);
+ struct task_struct *tsk = ei->tsk;
+ struct sighand_struct *sighand = ei->sighand;
+ DECLARE_WAITQUEUE(wait, current);
+ int signr;
+ siginfo_t info;
+
+ for (;;) {
+ spin_lock_irq(&sighand->siglock);
+ signr = 0;
+ if (sighand != tsk->sighand)
+ break;
+ signr = dequeue_signal(tsk, &ei->sigmask, &info);
+ if (signr)
+ break;
+
+ add_wait_queue(&sighand->waiting, &wait);
+ set_task_state(tsk, TASK_INTERRUPTIBLE);
+ spin_unlock_irq(&sighand->siglock);
+ schedule();
+ remove_wait_queue(&sighand->waiting, &wait);
+ if (signal_pending(current))
+ return -ERESTARTNOHAND;
+ }
+ spin_unlock_irq(&sighand->siglock);
+ if (count > sizeof(int))
+ count = sizeof(int);
+ count -= copy_to_user(buf, &signr, count);
+ if (!count)
+ count = -EFAULT;
+ return count;
+}
+
+static unsigned int sigfd_poll(struct file *filp, poll_table *wait)
+{
+ struct sigfd_inode *ei = SIGFD_I(filp->f_dentry->d_inode);
+ struct task_struct *tsk = ei->tsk;
+ struct sighand_struct *sighand = ei->sighand;
+ unsigned long *mask, *shared, *private;
+ int i;
+
+ poll_wait(filp, &sighand->waiting, wait);
+
+ if (sighand != tsk->sighand)
+ return 0;
+
+ mask = ei->sigmask.sig;
+ shared = tsk->signal->shared_pending.signal.sig;
+ private = tsk->pending.signal.sig;
+ for (i = 0; i < _NSIG_WORDS; i++) {
+ if (*mask & (*shared | *private))
+ return POLLIN;
+ mask++; shared++; private++;
+ }
+ return 0;
+}
+
+static int sigfd_release(struct inode *inode, struct file *filp)
+{
+ struct sigfd_inode *ei = SIGFD_I(inode);
+ struct sighand_struct *sighand = ei->sighand;
+
+ if (atomic_dec_and_test(&sighand->count))
+ kmem_cache_free(sighand_cachep, sighand);
+ put_task_struct(ei->tsk);
+ return 0;
+}
+
+static struct file_operations sigfd_fops = {
+ .read = sigfd_read,
+ .poll = sigfd_poll,
+ .release = sigfd_release,
+};
+
+static int sigfd_delete_dentry(struct dentry *dentry)
+{
+ return 1;
+}
+
+static struct dentry_operations sigfd_dentry_operations = {
+ .d_delete = sigfd_delete_dentry,
+};
+
+static struct vfsmount *sigfd_mnt;
+
+static struct inode * get_sigfd_inode(void)
+{
+ struct inode *inode = new_inode(sigfd_mnt->mnt_sb);
+
+ if (inode) {
+ /* This is what we do. */
+ inode->i_fop = &sigfd_fops;
+
+ /*
+ * Mark the inode dirty from the very beginning,
+ * that way it will never be moved to the dirty
+ * list because "mark_inode_dirty()" will think
+ * that it already _is_ on the dirty list.
+ */
+ inode->i_state = I_DIRTY;
+ inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
+ inode->i_uid = current->fsuid;
+ inode->i_gid = current->fsgid;
+ inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ inode->i_blksize = PAGE_SIZE;
+ }
+ return inode;
+}
+
+/*
+ * Create a file descriptor that is associated with our signal
+ * state. We can pass it around to others if we want to, but
+ * it will always be _our_ signal state.
+ */
+asmlinkage long sys_sigfd(sigset_t *user_mask, unsigned long flags)
+{
+ struct qstr this;
+ char name[32];
+ struct dentry *dentry;
+ struct file *file;
+ struct inode *inode;
+ struct sigfd_inode *ei;
+ sigset_t sigmask;
+ int error, fd;
+
+ error = -EINVAL;
+ if (copy_from_user(&sigmask, user_mask, sizeof(sigmask)))
+ goto badmask;
+ sigdelsetmask(&sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
+
+ /*
+ * The user gives us the signals he is interested in,
+ * but the signal code is geared towards the signals
+ * that are blocked and thus _not_ interesting. Switch
+ * the meaning here.
+ */
+ signotset(&sigmask);
+
+ error = -ENFILE;
+ file = get_empty_filp();
+ if (!file)
+ goto no_files;
+ inode = get_sigfd_inode();
+ if (!inode)
+ goto close_file;
+ error = get_unused_fd();
+ if (error < 0)
+ goto close_inode;
+ fd = error;
+
+ error = -ENOMEM;
+ sprintf(name, "[%lu]", inode->i_ino);
+ this.name = name;
+ this.len = strlen(name);
+ this.hash = inode->i_ino; /* will go */
+ dentry = d_alloc(sigfd_mnt->mnt_sb->s_root, &this);
+ if (!dentry)
+ goto close_fd;
+
+ dentry->d_op = &sigfd_dentry_operations;
+ d_add(dentry, inode);
+ file->f_vfsmnt = mntget(mntget(sigfd_mnt));
+ file->f_dentry = dget(dentry);
+
+ file->f_pos = 0;
+ file->f_flags = O_RDONLY;
+ file->f_op = &sigfd_fops;
+ file->f_mode = FMODE_READ;
+ file->f_version = 0;
+
+ /* sigfd state */
+ ei = SIGFD_I(inode);
+ ei->sigmask = sigmask;
+ get_task_struct(current);
+ ei->tsk = current;
+
+ /*
+ * We also increment the sighand count to make sure
+ * it doesn't go away from us in poll() when the task
+ * exits (which can happen if the fd is passed to
+ * another process with unix domain sockets.
+ *
+ * This also guarantees that an execve() will reallocate
+ * the signal state, and thus avoids security concerns
+ * with a untrusted process that passes off the signal
+ * queue fd to another, and then does a suid execve.
+ */
+ ei->sighand = current->sighand;
+ atomic_inc(&ei->sighand->count);
+
+ /* Ok, return it! */
+ fd_install(fd, file);
+ return fd;
+
+close_fd:
+ put_unused_fd(fd);
+close_inode:
+ iput(inode);
+close_file:
+ put_filp(file);
+no_files:
+badmask:
+ return error;
+}
+
+static struct super_operations sigfd_ops = {
+ .alloc_inode = sigfd_alloc_inode,
+ .destroy_inode = sigfd_destroy_inode,
+ .statfs = simple_statfs,
+};
+
+static struct super_block *sigfd_get_sb(struct file_system_type *fs_type,
+ int flags, char *dev_name, void *data)
+{
+ return get_sb_pseudo(fs_type, "sigfd:", &sigfd_ops, SIGFD_MAGIC);
+}
+
+static struct file_system_type sigfd_fs_type = {
+ .name = "sigfd",
+ .get_sb = sigfd_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init init_sigfd(void)
+{
+ int err;
+
+ sigfd_inode_cachep = kmem_cache_create("sigfd_inode_cache",
+ sizeof(struct sigfd_inode),
+ 0, SLAB_HWCACHE_ALIGN,
+ init_once, NULL);
+
+ err = -ENOMEM;
+ if (!sigfd_inode_cachep)
+ goto cachep_failed;
+
+ err = register_filesystem(&sigfd_fs_type);
+ if (err)
+ goto registration_failed;
+
+ sigfd_mnt = kern_mount(&sigfd_fs_type);
+ err = PTR_ERR(sigfd_mnt);
+ if (IS_ERR(sigfd_mnt))
+ goto mount_failed;
+
+ /* All done */
+ return 0;
+
+mount_failed:
+ unregister_filesystem(&sigfd_fs_type);
+registration_failed:
+ kmem_cache_destroy(sigfd_inode_cachep);
+cachep_failed:
+ return err;
+}
+
+static void __exit exit_sigfd(void)
+{
+ kmem_cache_destroy(sigfd_inode_cachep);
+ unregister_filesystem(&sigfd_fs_type);
+ mntput(sigfd_mnt);
+}
+
+module_init(init_sigfd)
+module_exit(exit_sigfd)
diff -Nru a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h Thu Feb 13 11:19:23 2003
+++ b/include/asm-i386/unistd.h Thu Feb 13 11:19:23 2003
@@ -264,6 +264,7 @@
#define __NR_epoll_wait 256
#define __NR_remap_file_pages 257
#define __NR_set_tid_address 258
+#define __NR_sigfd 259

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -Nru a/include/linux/init_task.h b/include/linux/init_task.h
--- a/include/linux/init_task.h Thu Feb 13 11:19:23 2003
+++ b/include/linux/init_task.h Thu Feb 13 11:19:23 2003
@@ -52,6 +52,7 @@
.count = ATOMIC_INIT(1), \
.action = { {{0,}}, }, \
.siglock = SPIN_LOCK_UNLOCKED, \
+ .waiting = __WAIT_QUEUE_HEAD_INITIALIZER(sighand.waiting), \
}

/*
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Thu Feb 13 11:19:23 2003
+++ b/include/linux/sched.h Thu Feb 13 11:19:23 2003
@@ -224,6 +224,7 @@
atomic_t count;
struct k_sigaction action[_NSIG];
spinlock_t siglock;
+ wait_queue_head_t waiting;
};

/*
diff -Nru a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c Thu Feb 13 11:19:23 2003
+++ b/kernel/fork.c Thu Feb 13 11:19:23 2003
@@ -676,6 +676,7 @@
if (!sig)
return -1;
spin_lock_init(&sig->siglock);
+ init_waitqueue_head(&sig->waiting);
atomic_set(&sig->count, 1);
memcpy(sig->action, current->sighand->action, sizeof(sig->action));
return 0;
diff -Nru a/kernel/signal.c b/kernel/signal.c
--- a/kernel/signal.c Thu Feb 13 11:19:23 2003
+++ b/kernel/signal.c Thu Feb 13 11:19:23 2003
@@ -758,6 +758,7 @@
if (LEGACY_QUEUE(&t->pending, sig))
return 0;

+ wake_up(&t->sighand->waiting);
ret = send_signal(sig, info, &t->pending);
if (!ret && !sigismember(&t->blocked, sig))
signal_wake_up(t, sig == SIGKILL);
@@ -849,6 +850,7 @@
* We always use the shared queue for process-wide signals,
* to avoid several races.
*/
+ wake_up(&p->sighand->waiting);
ret = send_signal(sig, info, &p->signal->shared_pending);
if (unlikely(ret))
return ret;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Davide Libenzi

unread,

Feb 13, 2003, 3:20:13 PM2/13/03

to

On Thu, 13 Feb 2003, Linus Torvalds wrote:

> It's a generic "synchronous signal delivery" method, and it uses a
> perfectly regular file descriptor to deliver an arbitrary set of signals
> that are pending.
>
> It adds one new system call:
>
> fd = sigfd(sigset_t * mask, unsigned long flags);
>
> which creates a file descriptor that is associated with the particular
> thread that created it, and the particular signal mask that the user was
> interested in. That file descriptor can be passed around all the normal
> ways: it can be dup()'ed, given to somebody else with a AF_UNIX socket,
> and obviously read() and select()/poll()'ed upon.

That's really nice, I like file-based interfaces. No plan to have a way to
change the sig-mask ? Close and reopen ?
What do you think about having timers through a file interface ?

> Any real use would also probably be a select() or poll() loop.

And sice it supports ->poll(), epoll.

- Davide

Linus Torvalds

unread,

Feb 13, 2003, 3:40:04 PM2/13/03

to

On Thu, 13 Feb 2003, Davide Libenzi wrote:
>
> That's really nice, I like file-based interfaces. No plan to have a way to
> change the sig-mask ? Close and reopen ?

You can have multiple fd's open at the same time, which is a lot more
convenient.

I will _not_ add a fcntl-like or ioctl interface to this. They are ugly
_and_ there are security issues (ie if you pass on the fd to somebody
else, they must not be able to change the signal mask _you_ specified).

> What do you think about having timers through a file interface ?

One of the reasons for the "flags" field (which is not unused) was because
I thought it might have extensions for things like alarms etc.

Linus

Paul P Komkoff Jr

unread,

Feb 13, 2003, 3:50:12 PM2/13/03

to

Replying to Linus Torvalds:

> It's a generic "synchronous signal delivery" method, and it uses a
> perfectly regular file descriptor to deliver an arbitrary set of signals
> that are pending.

The one functionality I miss way too much in linux (comparing to win32) is
FindFirstChangeNotification and ReadDirectoryChangesW thing.

These functions have one nice purpose: we can watch a directory hierarchy
for changes an efficient way. e.g. AFAIK via dnotify I can only see that
directory was changed, but cannot actually get all the changes. If I will
re-read all directory, I can miss some changes (if other process is
tampering with this dir too).

With ReadDirectoryChangesW I can read all changes happened with watched
hierarchy by doing sequence of, probably blocking, reads from some handle,
and each read will return some action/event "description" (e.g. "created file
a; renamed file a to file b; etc")

I was thinking about the way of implementing this functionality in linux. By
adding my own syscalls with semantics similar to sigfd.

And, thus, not only signals can be delivered through the same way. Maybe it
worth generalizing into some other "abstraction" ?

> Any real use would also probably be a select() or poll() loop.

P.S. Kernel already have an almost similar thing for different purpose -
rtnetlink.

--
Paul P 'Stingray' Komkoff Jr /// (icq)23200764 /// (http)stingr.net
This message represents the official view of the voices in my head

Richard B. Johnson

unread,

Feb 13, 2003, 4:30:13 PM2/13/03

to

On Thu, 13 Feb 2003, Paul P Komkoff Jr wrote:

> Replying to Linus Torvalds:
> > It's a generic "synchronous signal delivery" method, and it uses a
> > perfectly regular file descriptor to deliver an arbitrary set of signals
> > that are pending.
>
> The one functionality I miss way too much in linux (comparing to win32) is
> FindFirstChangeNotification and ReadDirectoryChangesW thing.
>

But they are __not__ kernel functions! They are part of the "window"
(basically a shell). All the stuff necessary to implement that
on a Unix/Linux system already exists, and has existed since the
birth of Unix, circa 1970.

... and if it were necessary to be implemented, it
WouldNeverHaveSuchAGoddamnLongFunctionName(ever);

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

Davide Libenzi

unread,

Feb 13, 2003, 5:30:09 PM2/13/03

to

On Thu, 13 Feb 2003, Linus Torvalds wrote:

>
> On Thu, 13 Feb 2003, Davide Libenzi wrote:
> >
> > That's really nice, I like file-based interfaces. No plan to have a way to
> > change the sig-mask ? Close and reopen ?
>
> You can have multiple fd's open at the same time, which is a lot more
> convenient.
>
> I will _not_ add a fcntl-like or ioctl interface to this. They are ugly
> _and_ there are security issues (ie if you pass on the fd to somebody
> else, they must not be able to change the signal mask _you_ specified).

It does not have necessarily to be just another ioctl/fcntl, it can be a
write. About security, chages might be allowed only to the task that
created the fd, if you're concerned. It's not that someone will starve
w/out such functionality though.

> > What do you think about having timers through a file interface ?
>
> One of the reasons for the "flags" field (which is not unused) was because
> I thought it might have extensions for things like alarms etc.

I was thinking more like :

int timerfd(int timeout, int oneshot);

that returns a pollable fd.

- Davide

Linus Torvalds

unread,

Feb 13, 2003, 6:10:09 PM2/13/03

to

On Thu, 13 Feb 2003, Davide Libenzi wrote:
>
> It does not have necessarily to be just another ioctl/fcntl, it can be a
> write. About security, chages might be allowed only to the task that
> created the fd, if you're concerned. It's not that someone will starve
> w/out such functionality though.

I'd actually like to reserve writes to _sending_ signals. Especially if
you have another process that listens in on the signals you get, it might
want to also force the signals through.

> > One of the reasons for the "flags" field (which is not unused) was because
> > I thought it might have extensions for things like alarms etc.
>
> I was thinking more like :
>
> int timerfd(int timeout, int oneshot);

It could be a separate system call, but since the infrastructure is
hopefully identical (most of the sigfd() code is actually creating the fs
infrastructure to get an inode with the information), it should share a
lot of the paths. Maybe even the system call.

Linus

Jamie Lokier

unread,

Feb 13, 2003, 9:50:06 PM2/13/03

to

Linus Torvalds wrote:
> The above example program is obviously totally useless, and any real use
> would have to expand the implementation with addign the full siginfo to
> the packet read (which is trivial apart from deciding on what format to
> use - it would be good to not have it be architecture-dependent and in
> particular it would be horrible to have different formats for different
> compatibility layers).

siginfo is _almost_ architecture independent now. There are some
quirks. siginfo is bad though, because it has to be compatible with
whatever standard it came from, so nobody dares to extend it. (See
dnotify and sigsegv below).

I see that there are several fairly general event sources now:

- signals
- epoll events
- async I/O events
- posix timers

More events that don't provide enough information:

- dnotify details (siginfo doesn't say enough)
- sigsegv read/write? (siginfo doesn't say enough)

More events that should be accessible but aren't:

- vm paging like crazy, please release some memory

Your synchronous signals code effectively makes signals work with
select/poll/epoll nicely. Async I/O still reports events in its own
way.

Suggestion du jour
------------------

I think that epoll and sigfd and async I/O could meld into a nice and
extensible, and fast, event interface. You'd first create an "event
fd" (i.e. very similar to epoll_create or sigfd), then you'd add event
sets of interest (which include signals and file descriptors), then
you'd wait for them using read().

All event sources (such as timers) would have a facility for adding
themselves to a given sigfd.

Such fds could be passed around in the usual ways (as you described
for sigfds).

Async I/Os would be queued with reference to a sigfd, instead of an
aio descriptor (which is just an integer anyway) as they are now.

read() would report return one or more structure containing pending
events, in the format: LENGTH, CATEGORY, REST... with the first word
saying which category of events (and giving the format), the next
giving the length, and the rest in much the same format as epoll or
async I/O do now, and whatever format is appropriate for signals.

Davide has already worked out the tricky logic of attaching fd-poll
events to an event reporter, although a special system call is used to
get the events instead of just read(). The async I/O folk have
already done similar for async I/Os, except a different special system
call is used. But both basically return an array of bytes containing
event structures.

In summary:

sigfd(...) // Create an event reporter fd.
sigfd_sigmask(fd,...) // Attach a signal mask to the reporter.
epoll_ctl(fd,...) // Attach an fd watcher to the reporter.
fcntl(...DNOTIFY_*,fd) // Attach a directory watcher to the reporter.
futex_event(...,fd) // Attach a futex waiter to the reporter.

My main reason for wanting a little unity, rather than lots of
different event reporter types (i.e. epoll, aio, sigfd) as now is
mainly that async I/O doesn't fit at all because it doesn't use an fd
to report its events. Anything fd-based can obviously be monitored
using epoll (or poll/select).

Obviously, all that's needed there is for async I/O to report _its_
events using a file descriptor and read().

However, once you've got a kernel data structure for queuing events
(epoll's logic is close to ideal IMHO), it seems clean and efficient
to generalise the code, rather than having several different event
queuing and reporting subsystems.

And when that's done you have some nice bonuses:

- All event types are reported equally fast, and in a single
system call (read()).

- The order in which events occurred is preserved.
(This is lost when you have to scan multiple queues).

- Hierarchies of event sets of any kind are possible.
(epoll has solved the logical problems of this already).

- Less code duplicated.

- Adding new kinds of kernel events becomes _very_ simple.

-- Jamie

Keith Adamson

unread,

Feb 13, 2003, 10:10:08 PM2/13/03

to

trimmed cc list

> I see that there are several fairly general event sources now:
>
> - signals
> - epoll events
> - async I/O events
> - posix timers
>
> More events that don't provide enough information:
>
> - dnotify details (siginfo doesn't say enough)
> - sigsegv read/write? (siginfo doesn't say enough)
>
> More events that should be accessible but aren't:
>
> - vm paging like crazy, please release some memory
>
> Your synchronous signals code effectively makes signals work with
> select/poll/epoll nicely.

How about also including a connect()/bind() interface so that
you can sort of have a "sockets for signals" type interface.
This seems like a nice type of interface for synchronization.
And maybe use send()/recv() instead of read()/write(). Or am
I on crack:)

Keith Adamson

unread,

Feb 13, 2003, 11:00:17 PM2/13/03

to

On Thu, 2003-02-13 at 22:11, Keith Adamson wrote:
>
> How about also including a connect()/bind() interface so that
> you can sort of have a "sockets for signals" type interface.
> This seems like a nice type of interface for synchronization.
> And maybe use send()/recv() instead of read()/write(). Or am
> I on crack:)
>

I guess what I'm trying to say is that if you want to generalize
a synchronous signal delivery interface I think the networking
interface is a better paradigm than the filesystem interface.

Dan Kegel

unread,

Feb 14, 2003, 2:10:06 AM2/14/03

to

Linus wrote:
> fd = sigfd(sigset_t * mask, unsigned long flags);

Damn. I got the interface wrong. I guessed it would be
int sigopen(int signum);
when I wrote the man page...
http://www.uwsg.iu.edu/hypermail/linux/kernel/0106.3/0404.html

- Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045

Abramo Bagnara

unread,

Feb 14, 2003, 4:00:15 AM2/14/03

to

Linus Torvalds wrote:
>
> On Thu, 13 Feb 2003, Davide Libenzi wrote:
> >
> > It does not have necessarily to be just another ioctl/fcntl, it can be a
> > write. About security, chages might be allowed only to the task that
> > created the fd, if you're concerned. It's not that someone will starve
> > w/out such functionality though.
>
> I'd actually like to reserve writes to _sending_ signals. Especially if
> you have another process that listens in on the signals you get, it might
> want to also force the signals through.

This reminds me the unfortunate (and much needed) lack of an unified way
to send/receive out-of-band data to/from a regular fd.

Something like:
oob = fd_open(fd, channel, flags);
write(oob, ...)
read(oob, ....)
close(oob);

Don't you think it's time to introduce it and to start to avoid the
proliferation of different tricky ways to do the same things?

--
Abramo Bagnara mailto:abramo....@libero.it

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

Giuliano Pochini

unread,

Feb 14, 2003, 6:00:18 AM2/14/03

to

> It's a generic "synchronous signal delivery" method, and it uses a
> perfectly regular file descriptor to deliver an arbitrary set of signals
> that are pending.
>

> It adds one new system call:
>

> fd = sigfd(sigset_t * mask, unsigned long flags);

IMHO it's not simply a signal delivery system, it's a message queue. It's
possible to deliver any kind of data to the process, and the fd can be
used to send data to other processes as well.

Bye.

Keith Adamson

unread,

Feb 14, 2003, 8:10:09 AM2/14/03

to

On Fri, 2003-02-14 at 05:55, Giuliano Pochini wrote:
>
> > It's a generic "synchronous signal delivery" method, and it uses a
> > perfectly regular file descriptor to deliver an arbitrary set of signals
> > that are pending.
> >
> > It adds one new system call:
> >
> > fd = sigfd(sigset_t * mask, unsigned long flags);
>
> IMHO it's not simply a signal delivery system, it's a message queue. It's
> possible to deliver any kind of data to the process, and the fd can be
> used to send data to other processes as well.
>

IMHO I agree, it would be real nice to be able to expose signals to
other processes using a socket type of interface. Even the kernel
could expose certain internal signals to user space programs such
as VM pressure.

Alan Cox

unread,

Feb 14, 2003, 8:30:33 AM2/14/03

to

On Fri, 2003-02-14 at 08:55, Abramo Bagnara wrote:
> This reminds me the unfortunate (and much needed) lack of an unified way
> to send/receive out-of-band data to/from a regular fd.
>
> Something like:
> oob = fd_open(fd, channel, flags);
> write(oob, ...)
> read(oob, ....)
> close(oob);
>
> Don't you think it's time to introduce it and to start to avoid the
> proliferation of different tricky ways to do the same things?

Why are you trying to throw yet more crap into the kernel. Linus signals
as fd thing is questionable but makes a little sense (its in many ways
more unix than the traditional approach of using real time queued
signal since you can now select on it)

Out of band data is a second data channel, so open two pipes. Jeez

Abramo Bagnara

unread,

Feb 14, 2003, 9:10:08 AM2/14/03

to

Alan Cox wrote:
>
> On Fri, 2003-02-14 at 08:55, Abramo Bagnara wrote:
> > This reminds me the unfortunate (and much needed) lack of an unified way
> > to send/receive out-of-band data to/from a regular fd.
> >
> > Something like:
> > oob = fd_open(fd, channel, flags);
> > write(oob, ...)
> > read(oob, ....)
> > close(oob);
> >
> > Don't you think it's time to introduce it and to start to avoid the
> > proliferation of different tricky ways to do the same things?
>
> Why are you trying to throw yet more crap into the kernel. Linus signals
> as fd thing is questionable but makes a little sense (its in many ways
> more unix than the traditional approach of using real time queued
> signal since you can now select on it)

My comment was not related to "signals as fd" stuff, but to the more
generic need (implicit in Linus reply to Davide's comment) to have
sometimes a control channel for an open fd (much like a file approach to
ioctl/fcntl problem space).

FWIW and IIRC a similar solution (based on a fs approach) was suggested
also by Al Viro some time ago.

> Out of band data is a second data channel, so open two pipes. Jeez

What about the relation between the two channels?

--
Abramo Bagnara mailto:abramo....@libero.it

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

Davide Libenzi

unread,

Feb 14, 2003, 7:00:23 PM2/14/03

to

On Thu, 13 Feb 2003, Linus Torvalds wrote:

> > > One of the reasons for the "flags" field (which is not unused) was because
> > > I thought it might have extensions for things like alarms etc.
> >
> > I was thinking more like :
> >
> > int timerfd(int timeout, int oneshot);
>

> It could be a separate system call, ...

I would personally like it a lot to have timer events available on
pollable fds. Am I alone in this ?

> but since the infrastructure is hopefully identical (most of the sigfd()
> code is actually creating the fs infrastructure to get an inode with the
> information), it should share a lot of the paths.

About that, I think we should make an utility function to be shared among
all the code that need to create "fake" inodes to expose fds. Right now
many component ( pipes, futexes, epoll, ... ) uses the basic code, sharing
the same needs, and duplicating basically the same code.

- Davide

Davide Libenzi

unread,

Feb 14, 2003, 7:10:10 PM2/14/03

to

On Fri, 14 Feb 2003, Jamie Lokier wrote:

> And when that's done you have some nice bonuses:
>
> - All event types are reported equally fast, and in a single
> system call (read()).
>
> - The order in which events occurred is preserved.
> (This is lost when you have to scan multiple queues).
>
> - Hierarchies of event sets of any kind are possible.
> (epoll has solved the logical problems of this already).
>
> - Less code duplicated.
>
> - Adding new kinds of kernel events becomes _very_ simple.

Hmm ... using read() you'll lose the timeout capability, that IMHO is
pretty nice.

Matti Aarnio

unread,

Feb 14, 2003, 7:10:14 PM2/14/03

to

On Fri, Feb 14, 2003 at 04:00:03PM -0800, Davide Libenzi wrote:
> On Thu, 13 Feb 2003, Linus Torvalds wrote:

....

> > > > One of the reasons for the "flags" field (which is not unused) was because
> > > > I thought it might have extensions for things like alarms etc.
> > > I was thinking more like :
> > >
> > > int timerfd(int timeout, int oneshot);
> >
> > It could be a separate system call, ...
>
> I would personally like it a lot to have timer events available on
> pollable fds. Am I alone in this ?

Somehow all this idea has a feeling of long established
Linux kernel facility called: netlink

It can send varying messages to userspace via a file-handle, and is
pollable. Originally that is for network codes, and therefore it
already has protocol capable to handle multiple different formats,
handle queue saturation, etc.

Do we need new syscall(s) ? Could it all be done with netlink ?

/Matti Aarnio

Jamie Lokier

unread,

Feb 14, 2003, 8:10:08 PM2/14/03

to

Davide Libenzi wrote:
> > And when that's done you have some nice bonuses:
> >
> > - All event types are reported equally fast, and in a single
> > system call (read()).
> >
> > - The order in which events occurred is preserved.
> > (This is lost when you have to scan multiple queues).
> >
> > - Hierarchies of event sets of any kind are possible.
> > (epoll has solved the logical problems of this already).
> >
> > - Less code duplicated.
> >
> > - Adding new kinds of kernel events becomes _very_ simple.
>
> Hmm ... using read() you'll lose the timeout capability, that IMHO is
> pretty nice.

Very good point.

Timeouts could be events too - probably a good idea as they can then
be absolute, relative, attached to different system clocks (monotonic
vs. timeofday). I think the POSIX timer work is like that.

It seems like a good idea to be able to attach one timeout event in
the same system call as the event_read call itself - because it is
_so_ common to vary the expiry time every time.

Then again, it is also extremely common to write this:

gettimeofday(...)
// calculate time until next application timer expires.
// Note also race condition here, if we're preempted.
read_events(..., next_app_time - timeofday)
// we need to know the current time.
gettimeofday(...)

So perhaps the current select/poll/epoll timeout method is not
particularly optimal as it is?

-- Jamie

Linus Torvalds

unread,

Feb 14, 2003, 8:10:10 PM2/14/03

to

On Fri, 14 Feb 2003, Davide Libenzi wrote:
>
> I would personally like it a lot to have timer events available on
> pollable fds. Am I alone in this ?

I don't know.

HOWEVER, judging from the discussion following, I do know that there are a
lot of people who want to have just "random things" available.

That's not what this patch was about. I'm not in the least interested in
some "generic event" mechanism, and it's not where I think this should
even go. This was very much about signals, and while I can see the
potential to extend the notion of signals to things like timers, I don't
think it's necessarily a good idea to extend it too far.

For example, you _can_, with the existing patch, already get timers. You
won't get any _new_ timers, but all the normal itimer signal stuff would
come down the sigfd() pipe the same way any other signal does.

Could we extend that to bind "other" timers to the sigfd()? Yes. And maybe
we could make it easier in general to "bind" events to the fd, instead of
having the coupling be static (ie right now it's a static coupling at
"sigfd()" call time, it could be split up into a "create descriptor" and
"bind descriptor" thing).

> About that, I think we should make an utility function to be shared among
> all the code that need to create "fake" inodes to expose fds. Right now
> many component ( pipes, futexes, epoll, ... ) uses the basic code, sharing
> the same needs, and duplicating basically the same code.

Some of it can be pulled in. However, the way the dynamic inode allocation
works, different kinds of inodes _have_ to have different superblocks,
since that's the level where the inode allocation and caching works. So
the fake inodes for a pipe, for example, are _not_ the same as the fake
inodes for the sigfd's. So not all of it is shared.

Linus

Linus Torvalds

unread,

Feb 14, 2003, 8:20:08 PM2/14/03

to

On Sat, 15 Feb 2003, Matti Aarnio wrote:
>
> Somehow all this idea has a feeling of long established
> Linux kernel facility called: netlink

Several people have said that, and it's completely NOT TRUE.

The thing about sigfd() has _nothing_ to do with sending packets, and
everything to do with the fact that you _associate_ signals with the thing
that you get the packets from.

Sure, the code could associate signals with a netlink fd instead. But
netlink is not actually a very good abstraction in my opinion - it has
another layer of code (the network layer) between it and the user, which
dos not add any value.

> Do we need new syscall(s) ? Could it all be done with netlink ?

We'd need the same new system call - the one to associate signals of this
process with the netlink thing.

(Yeah, the "system call" could be an ioctl entry, but quite frankly,
that's much WORSE than adding a system call. It's just system calls
without type checking).

Randy.Dunlap

unread,

Feb 14, 2003, 8:20:21 PM2/14/03

to

On Sat, 15 Feb 2003 01:01:53 +0000
Jamie Lokier <ja...@shareable.org> wrote:

| Davide Libenzi wrote:
...

| >
| > Hmm ... using read() you'll lose the timeout capability, that IMHO is
| > pretty nice.
|
| Very good point.
|
| Timeouts could be events too - probably a good idea as they can then
| be absolute, relative, attached to different system clocks (monotonic
| vs. timeofday). I think the POSIX timer work is like that.

Hi Davide, Jamie-

Yep. And there are people (plural :) who would still like to get
that patch accepted into 2.5 too....

| It seems like a good idea to be able to attach one timeout event in
| the same system call as the event_read call itself - because it is
| _so_ common to vary the expiry time every time.
|
| Then again, it is also extremely common to write this:
|
| gettimeofday(...)
| // calculate time until next application timer expires.
| // Note also race condition here, if we're preempted.
| read_events(..., next_app_time - timeofday)
| // we need to know the current time.
| gettimeofday(...)
|
| So perhaps the current select/poll/epoll timeout method is not
| particularly optimal as it is?

--
~Randy

Davide Libenzi

unread,

Feb 14, 2003, 8:30:05 PM2/14/03

to

On Sat, 15 Feb 2003, Matti Aarnio wrote:

> On Fri, Feb 14, 2003 at 04:00:03PM -0800, Davide Libenzi wrote:
> > On Thu, 13 Feb 2003, Linus Torvalds wrote:
> ....
> > > > > One of the reasons for the "flags" field (which is not unused) was because
> > > > > I thought it might have extensions for things like alarms etc.
> > > > I was thinking more like :
> > > >
> > > > int timerfd(int timeout, int oneshot);
> > >
> > > It could be a separate system call, ...
> >

> > I would personally like it a lot to have timer events available on
> > pollable fds. Am I alone in this ?
>

> Somehow all this idea has a feeling of long established
> Linux kernel facility called: netlink
>

> It can send varying messages to userspace via a file-handle, and is
> pollable. Originally that is for network codes, and therefore it
> already has protocol capable to handle multiple different formats,
> handle queue saturation, etc.
>

> Do we need new syscall(s) ? Could it all be done with netlink ?

The ( evntually ) new syscall do not have to implement anything special
about queue and message delivery, the f_op->poll() support will be
sufficent to have them working with select/poll/epoll. About netlink, I
personally find it quite confusing with respect of simple syscalls like :

int sigfd(...);
int timerfd(...);

Netlink is quite powerfull because of its generic message passing
infrastructure, that is IMHO overkilling when you simply have to receive
one timer/signal event. I personally do not like the idea of multiplexing
APIs, expecially ones that did not born with that purposes.

- Davide

Jeff Garzik

unread,

Feb 14, 2003, 8:40:07 PM2/14/03

to

Linus Torvalds wrote:
> On Sat, 15 Feb 2003, Matti Aarnio wrote:
>
>>Do we need new syscall(s) ? Could it all be done with netlink ?
>
>

> We'd need the same new system call - the one to associate signals of this
> process with the netlink thing.
>
> (Yeah, the "system call" could be an ioctl entry, but quite frankly,
> that's much WORSE than adding a system call. It's just system calls
> without type checking).

I have been lobbying for sys_garzik(2) for years... while you're in
there adding stuff, can you slip that in too please?

... :)

More seriously, and a bit of a tangent, I wonder how much attention we
need to give netlink. Because it either has the potential to be used as
a de facto in-kernel event-passing API, or it's too heavyweight for
that, implying [IMO] we need a netlink-lite.

I _don't_ want to see mini-netlinks springing up every time we need
[a]sync <foo> delivery inside the kernel.

Jeff

Davide Libenzi

unread,

Feb 14, 2003, 8:50:08 PM2/14/03

to

On Sat, 15 Feb 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > > And when that's done you have some nice bonuses:
> > >
> > > - All event types are reported equally fast, and in a single
> > > system call (read()).
> > >
> > > - The order in which events occurred is preserved.
> > > (This is lost when you have to scan multiple queues).
> > >
> > > - Hierarchies of event sets of any kind are possible.
> > > (epoll has solved the logical problems of this already).
> > >
> > > - Less code duplicated.
> > >
> > > - Adding new kinds of kernel events becomes _very_ simple.
> >

> > Hmm ... using read() you'll lose the timeout capability, that IMHO is
> > pretty nice.
>
> Very good point.
>
> Timeouts could be events too - probably a good idea as they can then
> be absolute, relative, attached to different system clocks (monotonic
> vs. timeofday). I think the POSIX timer work is like that.

POSIX timers might work, and events might be captured using the sigfd()
descriptor.

> It seems like a good idea to be able to attach one timeout event in
> the same system call as the event_read call itself - because it is
> _so_ common to vary the expiry time every time.
>
> Then again, it is also extremely common to write this:
>
> gettimeofday(...)
> // calculate time until next application timer expires.
> // Note also race condition here, if we're preempted.
> read_events(..., next_app_time - timeofday)
> // we need to know the current time.
> gettimeofday(...)
>
> So perhaps the current select/poll/epoll timeout method is not
> particularly optimal as it is?

What's bad in epoll_wait() to get events from all pollable descriptors ?

- Davide

Davide Libenzi

unread,

Feb 14, 2003, 9:10:06 PM2/14/03

to

On Fri, 14 Feb 2003, Linus Torvalds wrote:

>
> On Fri, 14 Feb 2003, Davide Libenzi wrote:
> >

> > I would personally like it a lot to have timer events available on
> > pollable fds. Am I alone in this ?
>

> I don't know.
>
> HOWEVER, judging from the discussion following, I do know that there are a
> lot of people who want to have just "random things" available.
>
> That's not what this patch was about. I'm not in the least interested in
> some "generic event" mechanism, and it's not where I think this should
> even go. This was very much about signals, and while I can see the
> potential to extend the notion of signals to things like timers, I don't
> think it's necessarily a good idea to extend it too far.
>
> For example, you _can_, with the existing patch, already get timers. You
> won't get any _new_ timers, but all the normal itimer signal stuff would
> come down the sigfd() pipe the same way any other signal does.

You're right, I had not thought about something like POSIX timers plus
sigfd().

> > About that, I think we should make an utility function to be shared among
> > all the code that need to create "fake" inodes to expose fds. Right now
> > many component ( pipes, futexes, epoll, ... ) uses the basic code, sharing
> > the same needs, and duplicating basically the same code.
>
> Some of it can be pulled in. However, the way the dynamic inode allocation
> works, different kinds of inodes _have_ to have different superblocks,
> since that's the level where the inode allocation and caching works. So
> the fake inodes for a pipe, for example, are _not_ the same as the fake
> inodes for the sigfd's. So not all of it is shared.

Superblocks will be different, but their "fake" functionality can be
shared. A few parameters like file system name, file system magic number,
root name, ... will be able to do the trick :

fakefs_t fakefs_create(chat const *root, char const *name, unsigned long magic);
struct inode *fakefs_new_inode(fakefs_t fkfs);
void fakefs_close(fakefs_t fkfs);

Jamie Lokier

unread,

Feb 14, 2003, 9:10:09 PM2/14/03

to

Davide Libenzi wrote:
> > Then again, it is also extremely common to write this:
> >
> > gettimeofday(...)
> > // calculate time until next application timer expires.
> > // Note also race condition here, if we're preempted.
> > read_events(..., next_app_time - timeofday)
> > // we need to know the current time.
> > gettimeofday(...)
> >
> > So perhaps the current select/poll/epoll timeout method is not
> > particularly optimal as it is?
>
> What's bad in epoll_wait() to get events from all pollable descriptors ?

Nothing wrong with that. It's the "relative to now" timeout argument
that is a bit racy, and the fact that you need a gettimeofday() system
call just before - every time - _purely_ to calculate the time until
the next application timer event.

If you must have a separate system call every time round your event
loop, it may as well set up a timerfd and let that be another pollable
descriptor.

In which case, read() is just fine for getting events :)

-- Jamie

Davide Libenzi

unread,

Feb 14, 2003, 11:10:07 PM2/14/03

to

On Sat, 15 Feb 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > > Then again, it is also extremely common to write this:
> > >
> > > gettimeofday(...)
> > > // calculate time until next application timer expires.
> > > // Note also race condition here, if we're preempted.
> > > read_events(..., next_app_time - timeofday)
> > > // we need to know the current time.
> > > gettimeofday(...)
> > >
> > > So perhaps the current select/poll/epoll timeout method is not
> > > particularly optimal as it is?
> >
> > What's bad in epoll_wait() to get events from all pollable descriptors ?
>
> Nothing wrong with that. It's the "relative to now" timeout argument
> that is a bit racy, and the fact that you need a gettimeofday() system
> call just before - every time - _purely_ to calculate the time until
> the next application timer event.
>
> If you must have a separate system call every time round your event
> loop, it may as well set up a timerfd and let that be another pollable
> descriptor.
>
> In which case, read() is just fine for getting events :)

Many ( many ) times, when you're going to wait for events, you want to
specify a maximum wait time ( reletive time ) and not an absolute time.
This is how ppl think about "timeouts". Different beast is the absolute
timer, that you can easily achieve with POSIX timers ( TIMER_ABSTIME ) and
a sigfd() dropped inside an event retrieval interface.

- Davide

Werner Almesberger

unread,

Feb 14, 2003, 11:10:09 PM2/14/03

to

Davide Libenzi wrote:
> What do you think about having timers through a file interface ?

Maybe I'm missing something obvious, but couldn't you simply
do this with a signal handler that writes to (a) pipe(s) ?

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina w...@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

Werner Almesberger

unread,

Feb 14, 2003, 11:30:11 PM2/14/03

to

Giuliano Pochini wrote:
> IMHO it's not simply a signal delivery system, it's a message queue.

Not entirely, because - as I understand it - signals would be
aggregated, so you'd always get one item, no matter how many
signals are actually pending at that time.

For generalizing such mechanisms, it might be useful to have
an atomic "overwrite" operation for pipes, and maybe also for
some sockets, e.g. something like this:

ssize_t overwrite(int fd,
const void *data_if_empty,size_t size_if_empty,
const void *data_if_full,size_t size_if_full,
int *was_empty);

If there is no data in the pipe/queue, "overwrite" would
write "data_if_empty", and clear *was_empty. Otherwise, it
would discard what's there, then write "data_if_full", and
set *was_empty. The whole operation is atomic with respect
to readers/pollers.

E.g.

static int signal_set = 0;

... add_signal(int signum)
{
int new_signal = 1 << signum;
int was_empty;

signal_set |= new_signal;
overwrite(fd,&new_signal,sizeof(int),&signal_set,sizeof(int),
&was_empty);
if (was_empty)
signal_set = new_signal;

Jamie Lokier

unread,

Feb 14, 2003, 11:40:06 PM2/14/03

to

Davide Libenzi wrote:
> Many ( many ) times, when you're going to wait for events, you want to
> specify a maximum wait time ( reletive time ) and not an absolute time.
> This is how ppl think about "timeouts". Different beast is the absolute
> timer, that you can easily achieve with POSIX timers ( TIMER_ABSTIME ) and
> a sigfd() dropped inside an event retrieval interface.

Agreed, both interfaces are useful. You see that epoll_wait is
optimised for one in particular though.

Curiously. I'll probably continue to use a calculated relative
timeout instead of using a POSIX timer, as the overhead of setting up
and tearing down the latter is more system calls which we still like
to avoid if it's not hard.

-- Jamie

Keith Adamson

unread,

Feb 15, 2003, 12:10:06 AM2/15/03

to

On Fri, 2003-02-14 at 20:03, Linus Torvalds wrote:
> Could we extend that to bind "other" timers to the sigfd()? Yes. And maybe
> we could make it easier in general to "bind" events to the fd, instead of
> having the coupling be static (ie right now it's a static coupling at
> "sigfd()" call time, it could be split up into a "create descriptor" and
> "bind descriptor" thing).
>

How about in the reverse ... being able to have multiple
processes able to dynamically connect to a single existing sigfd
and listen for a signal? You said that you want to reserve write()
for sending signals through the sigfd. If you implement the
write(sigfd, ...) then this seems to provide a very nice writer/reader
signal deliver interface with well defined end points for the sender
and receivers. Or maybe I'm just confused.

Werner Almesberger

unread,

Feb 15, 2003, 12:20:09 AM2/15/03

to

I wrote:
> ssize_t overwrite(int fd,
> const void *data_if_empty,size_t size_if_empty,
> const void *data_if_full,size_t size_if_full,
> int *was_empty);

Bah, rubbish. Make this

ssize_t overwrite(int fd,const void *data,size_t size);

If the pipe/queue is empty, don't write, and return 0. Now, this
could probably be implemented with a flag to send(2) (then this
wouldn't work for pipes, but you could use socketpair(2) for a
similar effect). Example:

void add_signal(int signum)
{

static int signal_set = 0;

int new_signal = 1 << signum;

int sent;

signal_set |= new_signal;
sent = send(fd,&signal_set,sizeof(int),MSG_OVERWRITE);
if (!sent) {
send(fd,&new_signal,sizeof(int),0);
signal_set = new_signal;
}
}

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina w...@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

Ingo Oeser

unread,

Feb 15, 2003, 8:10:07 AM2/15/03

to

On Fri, Feb 14, 2003 at 02:57:38PM +0100, Abramo Bagnara wrote:
> > Out of band data is a second data channel, so open two pipes. Jeez
>
> What about the relation between the two channels?

Encoded in the program logic, where it belongs. We have enough
needless interrelations between API functions already.

If you would like to have two channels in one, than simply
implement a multiplexer in the program that needs it (look at ssh
for an example).

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

Abramo Bagnara

unread,

Feb 15, 2003, 1:10:15 PM2/15/03

to

Ingo Oeser wrote:
>
> On Fri, Feb 14, 2003 at 02:57:38PM +0100, Abramo Bagnara wrote:
> > > Out of band data is a second data channel, so open two pipes. Jeez
> >
> > What about the relation between the two channels?
>
> Encoded in the program logic, where it belongs. We have enough
> needless interrelations between API functions already.
>
> If you would like to have two channels in one, than simply
> implement a multiplexer in the program that needs it (look at ssh
> for an example).

We might call this thread "the neverending misunderstanding" ;-)

My message was a proposal about an universal solution for
control/out-of-band streams on pseudo-files (like the Linus sig fd,
devices fd, socket, proc files, etc.) as a way to comunicate between
user space and kernel space.

I.e. something that might replace ioctl/fcntl mess giving same (and
more) flexybility and power (extending the 'everything is a file'
concept also to control data).

This is *not* something I'd propose for user space (where we definitely
have many good ways to achieve these results).

--
Abramo Bagnara mailto:abramo....@libero.it

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

James Antill

unread,

Feb 15, 2003, 2:30:11 PM2/15/03

to

Davide Libenzi <dav...@xmailserver.org> writes:

> On Thu, 13 Feb 2003, Linus Torvalds wrote:
>
> > > > One of the reasons for the "flags" field (which is not unused) was because
> > > > I thought it might have extensions for things like alarms etc.
> > >
> > > I was thinking more like :
> > >
> > > int timerfd(int timeout, int oneshot);
> >
> > It could be a separate system call, ...
>

> I would personally like it a lot to have timer events available on
> pollable fds. Am I alone in this ?

Think of "timer events" as a single TCP connection, so you have...

time X: empty
time X+Y: timed event "Arrives"
time X+Z: timed event "Arrives"

...at which point it's pretty obvious that if you "poll" the timer
event queue from anytime before X+Y it'll be empty, and anytime after
X+Y it'll be "full". There isn't any point in being able to distinguish
between the events X+Y and X+Z, you only need to know a timed event has
occurred so you should process all timed events that are needed.
At which point you just need to work out the difference between X and
X+Y, and pass that to poll/sigtimedwait/etc.

--
# James Antill -- ja...@and.org
:0:
* ^From: .*james@and\.org
/dev/null

Linus Torvalds

unread,

Feb 15, 2003, 3:20:07 PM2/15/03

to

On Fri, 14 Feb 2003, Davide Libenzi wrote:
> >
> > Some of it can be pulled in. However, the way the dynamic inode allocation
> > works, different kinds of inodes _have_ to have different superblocks,
> > since that's the level where the inode allocation and caching works. So
> > the fake inodes for a pipe, for example, are _not_ the same as the fake
> > inodes for the sigfd's. So not all of it is shared.
>
> Superblocks will be different, but their "fake" functionality can be
> shared. A few parameters like file system name, file system magic number,
> root name, ... will be able to do the trick :
>
> fakefs_t fakefs_create(chat const *root, char const *name, unsigned long magic);
> struct inode *fakefs_new_inode(fakefs_t fkfs);
> void fakefs_close(fakefs_t fkfs);

I'd love to see this. I agree that a fair amount of this should be
shareable with the pipe and socket code, for example. But I would call it
"virtual" instead of "fake", since there is nothing fake about the inode.
A pipe inode is one of the most fundamental and basic things in UNIX, it's
not "fake" just because it doesn't live on a harddisk.

In fact, one thing I noticed when doing the sigfd() code is that the pipe
code doesn't take advantage of the new inode allocation layer as well as
it could. It still uses multiple allocations, doing a _separate_
allocation for the small "pipe_inode_info" instead of doing the embedding
trick.

And that's obviously because the code was only minimally changed, because
doing the FS setup with a super-block operation to get at a specialized
allocator is more lines of code than people were willing to do when doing
the conversion.

(It's not _that_ many lines of code, but it could certainly be improved.
Almost everybody does the same thing at inode allocation and
de-allocation, it's just that the structure containers are different).

Linus

Davide Libenzi

unread,

Feb 15, 2003, 5:00:17 PM2/15/03

to

On Sat, 15 Feb 2003, Werner Almesberger wrote:

> Davide Libenzi wrote:
> > What do you think about having timers through a file interface ?
>
> Maybe I'm missing something obvious, but couldn't you simply
> do this with a signal handler that writes to (a) pipe(s) ?

You could do that, even if when you start having many timers things might
get messy.

- Davide

Davide Libenzi

unread,

Feb 15, 2003, 5:30:13 PM2/15/03

to

On Sat, 15 Feb 2003, James Antill wrote:

> > I would personally like it a lot to have timer events available on
> > pollable fds. Am I alone in this ?
>
> Think of "timer events" as a single TCP connection, so you have...
>
> time X: empty
> time X+Y: timed event "Arrives"
> time X+Z: timed event "Arrives"
>
> ...at which point it's pretty obvious that if you "poll" the timer
> event queue from anytime before X+Y it'll be empty, and anytime after
> X+Y it'll be "full". There isn't any point in being able to distinguish
> between the events X+Y and X+Z, you only need to know a timed event has
> occurred so you should process all timed events that are needed.
> At which point you just need to work out the difference between X and
> X+Y, and pass that to poll/sigtimedwait/etc.

I'm sorry, I'm a bit confused. What's the point here ?

Davide Libenzi

unread,

Feb 15, 2003, 6:40:07 PM2/15/03

to

On Sat, 15 Feb 2003, Linus Torvalds wrote:

>
> On Fri, 14 Feb 2003, Davide Libenzi wrote:
> > >
> > > Some of it can be pulled in. However, the way the dynamic inode allocation
> > > works, different kinds of inodes _have_ to have different superblocks,
> > > since that's the level where the inode allocation and caching works. So
> > > the fake inodes for a pipe, for example, are _not_ the same as the fake
> > > inodes for the sigfd's. So not all of it is shared.
> >
> > Superblocks will be different, but their "fake" functionality can be
> > shared. A few parameters like file system name, file system magic number,
> > root name, ... will be able to do the trick :
> >
> > fakefs_t fakefs_create(chat const *root, char const *name, unsigned long magic);
> > struct inode *fakefs_new_inode(fakefs_t fkfs);
> > void fakefs_close(fakefs_t fkfs);
>
> I'd love to see this. I agree that a fair amount of this should be
> shareable with the pipe and socket code, for example. But I would call it
> "virtual" instead of "fake", since there is nothing fake about the inode.
> A pipe inode is one of the most fundamental and basic things in UNIX, it's
> not "fake" just because it doesn't live on a harddisk.

Would something like the folowing work for all those cases ( not tested
and even compiled ) ...

- Davide

#define VIRTFS_NAME_MAX 128
#define VIRTFS_ROOT_MAX 128

typedef struct virtfs {
char name[VIRTFS_NAME_MAX];
char root[VIRTFS_ROOT_MAX];
unsigned long magic;
struct vfsmount *mnt;
struct file_system_type fst;
struct dentry_operations dop;
} *virtfs_t;

static int virtfs_delete_dentry(struct dentry *dentry) {

return 1;
}

static struct super_block *virtfs_get_sb(struct file_system_type *fs_type,
int flags, char *dev_name, void *data) {
virtfs_t vfs = container_of(fs_type, struct virtfs, fst);

return get_sb_pseudo(fs_type, vfs->root, NULL, vfs->magic);
}

virtfs_t virtfs_create(char const *root, char const *name, unsigned long magic) {
int error;
virtfs_t vtfs;

if (!(vtfs = kmalloc(sizeof(struct virtfs), GFP_KERNEL)))
return ERR_PTR(-ENOMEM);

memset(vtfs, 0, sizeof(*vtfs));
strncpy(vtfs->root, root, sizeof(vtfs->root) - 1);
strncpy(vtfs->name, name, sizeof(vtfs->name) - 1);
vtfs->magic = magic;
vtfs->fst.name = vtfs->name;
vtfs->fst.get_sb = virtfs_get_sb;
vtfs->fst.kill_sb = kill_anon_super;
vtfs->dop.d_delete = virtfs_delete_dentry;

error = register_filesystem(&vtfs->fst);
if (error)
goto eexit_1;

vtfs->mnt = kern_mount(&vtfs->fst);
error = PTR_ERR(vtfs->mnt);
if (IS_ERR(vtfs->mnt))
goto eexit_2;

return vtfs;

eexit_2:
unregister_filesystem(&vtfs->fst);
eexit_1:
kfree(vtfs);
return ERR_PTR(error);
}

int virtfs_new_file(virtfs_t vtfs, struct inode **ninode, struct file **nfile) {
int error;
struct qstr this;
struct dentry *dentry;
struct inode *inode;
struct file *file;
char name[32];

error = -ENFILE;
file = get_empty_filp();
if (!file)
goto eexit_1;

inode = new_inode(vtfs->mnt);
error = PTR_ERR(inode);
if (IS_ERR(inode))
goto eexit_2;

error = -ENOMEM;
sprintf(name, "[%lu]", inode->i_ino);
this.name = name;
this.len = strlen(name);
this.hash = inode->i_ino;
dentry = d_alloc(vtfs->mnt->mnt_sb->s_root, &this);
if (!dentry)
goto eexit_3;
dentry->d_op = &vtfs->dop;
d_add(dentry, inode);
file->f_vfsmnt = mntget(vtfs->mnt);
file->f_dentry = dget(dentry);

*nfile = file;
*ninode = inode;

return 0;
eexit_3:
iput(inode);
eexit_2:
put_filp(file);
eexit_1:
return error;
}

void virtfs_close(virtfs_t vtfs) {

unregister_filesystem(&vtfs->fst);
mntput(vtfs->mnt);
kfree(vtfs);

Werner Almesberger

unread,

Feb 15, 2003, 7:20:11 PM2/15/03

to

Davide Libenzi wrote:
> You could do that, even if when you start having many timers things might
> get messy.

Manage a list of pending timers, schedule a signal for the next
one (or, if you wish, launch a thread), etc. All that is pretty
standard stuff that can be hidden in some library function, and
you can even steal a lot of the code from the kernel :-)

It would be useful, though, to have something like the
"overwrite" function I described later in this thread, in case
there is a single fd that can accumulate more timer expirations
between reads, than fit into the pipe/queue. (Admittedly a bit
of a fringe scenario.)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina w...@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

Ingo Oeser

unread,

Feb 16, 2003, 10:00:29 AM2/16/03

to

Hi there,

On Sat, Feb 15, 2003 at 06:59:47PM +0100, Abramo Bagnara wrote:
> We might call this thread "the neverending misunderstanding" ;-)

;-)

> My message was a proposal about an universal solution for
> control/out-of-band streams on pseudo-files (like the Linus sig fd,
> devices fd, socket, proc files, etc.) as a way to comunicate between
> user space and kernel space.
>
> I.e. something that might replace ioctl/fcntl mess giving same (and
> more) flexybility and power (extending the 'everything is a file'
> concept also to control data).
>
> This is *not* something I'd propose for user space (where we definitely
> have many good ways to achieve these results).

So you propose channels that are defined ONLY by the kernel? That
would make much more sense and could replace ioctl/fcntl indeed,
if we add one more syscall: sys_transact().

long transact(int fd, const void *request_buffer, const size_t request_size,
void *result_buffer, const size_t result_size);

This syscall works like read/write with some exceptions:

(Basis features)
- short reads/short writes are not allowed
- EINTR will not happen.
- the effects done to the fd and the underlying object must be
undone if the request will fail
- it is optional (file operation) and can return -1/errno=ENOSYS,
if not implemented by the underlying kernel object.
- supplying (fd, NULL, 0, NULL, 0) checks for existence (NOP).

(Advanced features)
- supplying (fd, REQUEST, sizeof(REQUEST), MAP_FAILED, 0)
returns the expected return buffer size for this request (or
zero, if no result will be returned). May return -ENXIO, if
this checking is not supported or -EINVAL, if that request
is not supported by the underlying kernel object.

- supplying (fd, REQUEST, sizeof(REQUEST), NULL, 0)
is issueing a request without a result (e.g. a COMMAND) or
telling that the user is not interested in any detailed
results (that variant might be supported by the kernel object or not)

- The two buffers might alias partially or as a whole.

With that additional semantic we could replace all ioctl() too,
where an ioctl() is used only to get this semantic.

Other OSes have an ioctl() mechanism like that syscall and we
implement it by destroying the argument in IORW-ioctls.

With that in place, I could live without ioctl() and fcntl().
That would also destroy all the interfaces that require you to
read after a write, because it needs transaction semantics
(sg.c comes to mind).

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

Jan Harkes

unread,

Feb 16, 2003, 9:40:09 PM2/16/03

to

On Sat, Feb 15, 2003 at 12:08:10PM -0800, Linus Torvalds wrote:
> it could. It still uses multiple allocations, doing a _separate_
> allocation for the small "pipe_inode_info" instead of doing the embedding
> trick.

I believe that is because of named pipes. These are sharing the inode
with the filesystem in which the named pipe is stored.

Jan

Jamie Lokier

unread,

Feb 16, 2003, 11:50:07 PM2/16/03

to

Werner Almesberger wrote:
> Davide Libenzi wrote:
> > What do you think about having timers through a file interface ?
>
> Maybe I'm missing something obvious, but couldn't you simply
> do this with a signal handler that writes to (a) pipe(s) ?

You can do that, and sometimes it is done, but the point is to provide
a mechanism with is _fast_, as epoll() is.

-- Jamie

Daniel Heater

unread,

Feb 17, 2003, 10:30:16 AM2/17/03

to

* Davide Libenzi (dav...@xmailserver.org) wrote:
> On Thu, 13 Feb 2003, Linus Torvalds wrote:
>
> > > > One of the reasons for the "flags" field (which is not unused) was because
> > > > I thought it might have extensions for things like alarms etc.
> > >
> > > I was thinking more like :
> > >
> > > int timerfd(int timeout, int oneshot);
> >
> > It could be a separate system call, ...
>

> I would personally like it a lot to have timer events available on
> pollable fds. Am I alone in this ?

I currently do something similar to this with a driver for an 82c54 timer,
and for a couple of of hardware timer implementations.

I create a fd for each timer in the device. write() sets the timer count.
read() reads back the current timer count. select() with the timer's fd
as the exceptionfd argument is used to poll for a timer expiration. With
this hardware, the count is automatically reloaded and continues counting.

With this interface, I can write a simple app that waits for any number of
events on file descriptors using select, but will also timeout periodically
to do some housekeeping data. I just switch on the file descriptor when I
come out of the select to decide what needs to be done.

This seems simpler to me in many cases where I can allow some drift in doing
the timeout housekeeping because the rest of the code need not be concerned
with getting preempted by a signal. ie. No need to lock data structures
because there is always only one thread of control.

daniel.

James Antill

unread,

Feb 17, 2003, 6:40:18 PM2/17/03

to

Davide Libenzi <dav...@xmailserver.org> writes:

> On Sat, 15 Feb 2003, James Antill wrote:
>
> > > I would personally like it a lot to have timer events available on
> > > pollable fds. Am I alone in this ?
> >

> > Think of "timer events" as a single TCP connection, so you have...
>

> I'm sorry, I'm a bit confused. What's the point here ?

Just that although you could have each timer event on separate
pollable fds it's much the same as having separate events for 1
byte, 2 bytes, and 4 bytes available on a network socket (Ie. you are
much better off keeping that all in user space).

--
# James Antill -- ja...@and.org
:0:
* ^From: .*james@and\.org
/dev/null