[PATCH] driver-core: devtmpfs - driver core maintained /dev tmpfs

Kay Sievers

ongelezen,

30 apr 2009, 09:30:1730-04-2009

aan

From: Kay Sievers <kay.s...@vrfy.org>
Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs

Devtmpfs lets the kernel create a tmpfs very early at kernel
initialization, before any driver core device is registered. Every
device with a major/minor will have a device node created in this
tmpfs instance. After the rootfs is mounted by the kernel, the
populated tmpfs is mounted at /dev. In initramfs, it can be moved
to the manually mounted root filesystem before /sbin/init is
executed.

The tmpfs instance can be changed and altered by userspace at any time,
and in any way needed - just like today's udev-mounted tmpfs. Unmodified
udev versions will run just fine on top of it, and will recognize an
already existing kernel-created device node and use it.
The default node permissions are root:root 0600. Only if none of these
values have been changed by userspace, the driver core will remove the
device node when the device goes away. If the device node was altered
by udev, by applying the appropriate permissions and ownership, it will
need to be removed by udev - just as it usually works today.

This makes init=/bin/sh work without any further userspace support.
/dev will be fully populated and dynamic, and always reflect the current
device state of the kernel. Especially in the face of the already
implemented dynamic device numbers for block devices, this can be very
helpful in a rescue situation, where static devices nodes no longer
work.
Custom, embedded-like systems should be able to use this as a dynamic
/dev directory without any need for aditional userspace tools.

With the kernel populated /dev, existing initramfs or kernel-mount
bootup logic can be optimized to be more efficient, and not to require a
full coldplug run, which is currently needed to bootstrap the inital
/dev directory content, before continuing bringing up the rest of
the system. There will be no missed events to replay, because /dev is
available before the first kernel device is registered with the core.
A coldplug run can take, depending on the speed of the system and the
amount of devices which need to be handled, from one to several seconds.

Signed-off-by: Kay Sievers <kay.s...@vrfy.org>
Signed-off-by: Jan Blunck <jbl...@suse.de>
Signed-off-by: Greg Kroah-Hartman <gre...@suse.de>
---

--- a/block/bsg.c
+++ b/block/bsg.c
@@ -1062,6 +1062,11 @@ EXPORT_SYMBOL_GPL(bsg_register_queue);

static struct cdev bsg_cdev;

+static char *bsg_nodename(struct device *dev)
+{
+ return kasprintf(GFP_KERNEL, "bsg/%s", dev_name(dev));
+}
+
static int __init bsg_init(void)
{
int ret, i;
@@ -1082,6 +1087,7 @@ static int __init bsg_init(void)
ret = PTR_ERR(bsg_class);
goto destroy_kmemcache;
}
+ bsg_class->nodename = bsg_nodename;

ret = alloc_chrdev_region(&devid, 0, BSG_MAX_DEVS, "bsg");
if (ret)
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -8,6 +8,23 @@ config UEVENT_HELPER_PATH
Path to uevent helper program forked by the kernel for
every uevent.

+config DEVTMPFS
+ bool "Create a kernel maintained /dev tmpfs (EXPERIMENTAL)"
+ depends on HOTPLUG
+ help
+ This creates a tmpfs filesystem, and mounts it at bootup
+ and mounts it at /dev. The kernel driver core creates device
+ nodes for all registered devices in that filesystem. All device
+ nodes are owned by root and have the default mode of 0600.
+ Userspace can add and delete the nodes as needed. This is
+ intended to simplify bootup, and make it possible to delay
+ the initial coldplug at bootup done by udev in userspace.
+ It should also provide a simpler way for rescue systems
+ to bring up a kernel with dynamic major/minor numbers.
+ Meaningful symlinks, permissions and device ownership must
+ still be handled by userspace.
+ If unsure, say N here.
+
config STANDALONE
bool "Select only drivers that don't need compile-time external firmware" if EXPERIMENTAL
default y
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -4,6 +4,7 @@ obj-y := core.o sys.o bus.o dd.o \
driver.o class.o platform.o \
cpu.o firmware.o init.o map.o devres.o \
attribute_container.o transport_class.o
+obj-$(CONFIG_DEVTMPFS) += devtmpfs.o
obj-y += power/
obj-$(CONFIG_HAS_DMA) += dma-mapping.o
obj-$(CONFIG_ISA) += isa.o
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -134,3 +134,9 @@ static inline void module_add_driver(str
struct device_driver *drv) { }
static inline void module_remove_driver(struct device_driver *drv) { }
#endif
+
+#ifdef CONFIG_DEVTMPFS
+extern int devtmpfs_init(void);
+#else
+static inline int devtmpfs_init(void) { return 0; }
+#endif
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -161,10 +161,18 @@ static int dev_uevent(struct kset *kset,
struct device *dev = to_dev(kobj);
int retval = 0;

- /* add the major/minor if present */
+ /* add device node properties if present */
if (MAJOR(dev->devt)) {
+ const char *tmp;
+ const char *name;
+
add_uevent_var(env, "MAJOR=%u", MAJOR(dev->devt));
add_uevent_var(env, "MINOR=%u", MINOR(dev->devt));
+ name = device_get_nodename(dev, &tmp);
+ if (name) {
+ add_uevent_var(env, "DEVNAME=%s", name);
+ kfree(tmp);
+ }
}

if (dev->type && dev->type->name)
@@ -912,6 +920,8 @@ int device_add(struct device *dev)
error = device_create_sys_dev_entry(dev);
if (error)
goto devtattrError;
+
+ devtmpfs_create_node(dev);
}

error = device_add_class_symlinks(dev);
@@ -1055,6 +1065,7 @@ void device_del(struct device *dev)
if (parent)
klist_del(&dev->p->knode_parent);
if (MAJOR(dev->devt)) {
+ devtmpfs_delete_node(dev);
device_remove_sys_dev_entry(dev);
device_remove_file(dev, &devt_attr);
}
@@ -1125,6 +1136,47 @@ static struct device *next_device(struct
}

/**
+ * device_get_nodename - path of device node file
+ * @dev: device
+ * @tmp: possibly allocated string
+ *
+ * Return the relative path of a possible device node.
+ * Non-default names may need to allocate a memory to compose
+ * a name. This memory is returned in tmp and needs to be
+ * freed by the caller.
+ */
+const char *device_get_nodename(struct device *dev, const char **tmp)
+{
+ char *s;
+
+ *tmp = NULL;
+
+ /* the device type may provide a specific name */
+ if (dev->type && dev->type->nodename)
+ *tmp = dev->type->nodename(dev);
+ if (*tmp)
+ return *tmp;
+
+ /* the class may provide a specific name */
+ if (dev->class && dev->class->nodename)
+ *tmp = dev->class->nodename(dev);
+ if (*tmp)
+ return *tmp;
+
+ /* return name without allocation, tmp == NULL */
+ if (strchr(dev_name(dev), '!') == NULL)
+ return dev_name(dev);
+
+ /* replace '!' in the name with '/' */
+ *tmp = kstrdup(dev_name(dev), GFP_KERNEL);
+ if (!*tmp)
+ return NULL;
+ while ((s = strchr(*tmp, '!')))
+ s[0] = '/';
+ return *tmp;
+}
+
+/**
* device_for_each_child - device child iterator.
* @parent: parent struct device.
* @data: data for the callback.
--- /dev/null
+++ b/drivers/base/devtmpfs.c
@@ -0,0 +1,337 @@
+/*
+ * /dev tmpfs device nodes
+ *
+ * Copyright (C) 2009, Kay Sievers <kay.s...@vrfy.org>
+ *
+ * During bootup, before any driver core device is registered, a tmpfs
+ * filesystem is created. Every device which requests a devno, will
+ * create a device node in this filesystem. The node is named after the
+ * the nameof the device, or the susbsytem can provide a custom name
+ * for the node.
+ *
+ * All devices are owned by root. This is intended to simplify bootup, and
+ * make it possible to delay the initial coldplug done by udev in userspace.
+ *
+ * It should also provide a simpler way for rescue systems to bring up a
+ * kernel with dynamic major/minor numbers.
+ */
+
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/namei.h>
+
+static struct vfsmount *dev_mnt;
+
+static int dev_mkdir(const char *name, mode_t mode)
+{
+ struct nameidata nd;
+ struct dentry *dentry;
+ int err;
+
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ name, LOOKUP_PARENT, &nd);
+ if (err)
+ return err;
+
+ dentry = lookup_create(&nd, 1);
+ if (!IS_ERR(dentry)) {
+ err = vfs_mkdir(nd.path.dentry->d_inode, dentry, mode);
+ dput(dentry);
+ } else {
+ err = PTR_ERR(dentry);
+ }
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+
+ path_put(&nd.path);
+ return err;
+}
+
+static int dev_symlink(const char *target, const char *name)
+{
+ struct nameidata nd;
+ struct dentry *dentry;
+ int err;
+
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ name, LOOKUP_PARENT, &nd);
+ if (err)
+ return err;
+
+ dentry = lookup_create(&nd, 0);
+ if (!IS_ERR(dentry)) {
+ err = vfs_symlink(nd.path.dentry->d_inode, dentry, target);
+ dput(dentry);
+ }
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+
+ path_put(&nd.path);
+ return err;
+}
+
+static int create_path(const char *nodepath)
+{
+ char *path;
+ struct nameidata nd;
+ int err = 0;
+
+ path = kstrdup(nodepath, GFP_KERNEL);
+ if (!path)
+ return -ENOMEM;
+
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ path, LOOKUP_PARENT, &nd);
+ if (err == 0) {
+ struct dentry *dentry;
+
+ /* create directory right away */
+ dentry = lookup_create(&nd, 1);
+ if (!IS_ERR(dentry)) {
+ err = vfs_mkdir(nd.path.dentry->d_inode,
+ dentry, 0775);
+ dput(dentry);
+ }
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+
+ path_put(&nd.path);
+ } else if (err == -ENOENT) {
+ char *s;
+
+ /* parent directories do not exist, create them */
+ s = path;
+ while (1) {
+ s = strchr(s, '/');
+ if (!s)
+ break;
+ s[0] = '\0';
+ err = dev_mkdir(path, 0755);
+ if (err && err != -EEXIST)
+ ;//break;
+ s[0] = '/';
+ s++;
+ }
+ }
+
+ kfree(path);
+ return err;
+}
+
+int devtmpfs_create_node(struct device *dev)
+{
+ const char *tmp = NULL;
+ const char *nodename;
+ mode_t mode;
+ struct nameidata nd;
+ struct dentry *dentry;
+ int err;
+
+ if (!dev_mnt)
+ return 0;
+
+ nodename = device_get_nodename(dev, &tmp);
+ if (!nodename)
+ return -ENOMEM;
+
+ if (dev->class == &block_class)
+ mode = S_IFBLK|0600;
+ else
+ mode = S_IFCHR|0600;
+
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ nodename, LOOKUP_PARENT, &nd);
+ if (err == -ENOENT) {
+ /* create missing parent directories */
+ create_path(nodename);
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ nodename, LOOKUP_PARENT, &nd);
+ if (err)
+ goto out_name;
+ }
+
+ dentry = lookup_create(&nd, 0);
+ if (!IS_ERR(dentry)) {
+ err = vfs_mknod(nd.path.dentry->d_inode,
+ dentry, mode, dev->devt);
+ dput(dentry);
+ } else {
+ err = PTR_ERR(dentry);
+ }
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+
+ path_put(&nd.path);
+out_name:
+ kfree(tmp);
+ return err;
+}
+
+static int dev_rmdir(const char *name)
+{
+ struct nameidata nd;
+ struct dentry *dentry;
+ int err;
+
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ name, LOOKUP_PARENT, &nd);
+ if (err)
+ return err;
+
+ mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
+ dentry = lookup_one_len(nd.last.name, nd.path.dentry, nd.last.len);
+ if (!IS_ERR(dentry)) {
+ if (dentry->d_inode)
+ err = vfs_rmdir(nd.path.dentry->d_inode, dentry);
+ else
+ err = -ENOENT;
+ dput(dentry);
+ } else {
+ err = PTR_ERR(dentry);
+ }
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+
+ path_put(&nd.path);
+ return err;
+}
+
+static int delete_path(const char *nodepath)
+{
+ const char *path;
+ int err = 0;
+
+ path = kstrdup(nodepath, GFP_KERNEL);
+ if (!path)
+ return -ENOMEM;
+
+ while (1) {
+ char *base;
+
+ base = strrchr(path, '/');
+ if (!base)
+ break;
+ base[0] = '\0';
+ err = dev_rmdir(path);
+ if (err)
+ break;
+ }
+
+ kfree(path);
+ return err;
+}
+
+/* never delete a node that userspace has changed */
+static int dev_unchanged(struct device *dev, struct kstat *stat)
+{
+ if (stat->uid != 0 || stat->gid != 0)
+ return 0;
+ if (dev->class == &block_class) {
+ if (stat->mode != (S_IFBLK|0600))
+ return 0;
+ } else {
+ if (stat->mode != (S_IFCHR|0600))
+ return 0;
+ }
+ if (stat->rdev != dev->devt)
+ return 0;
+ return 1;
+}
+
+int devtmpfs_delete_node(struct device *dev)
+{
+ const char *tmp = NULL;
+ const char *nodename;
+ struct nameidata nd;
+ struct dentry *dentry;
+ struct kstat stat;
+ int deleted = 1;
+ int err;
+
+ if (!dev_mnt)
+ return 0;
+
+ nodename = device_get_nodename(dev, &tmp);
+ if (!nodename)
+ return -ENOMEM;
+
+ err = vfs_path_lookup(dev_mnt->mnt_root, dev_mnt,
+ nodename, LOOKUP_PARENT, &nd);
+ if (err)
+ goto out_name;
+
+ mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
+ dentry = lookup_one_len(nd.last.name, nd.path.dentry, nd.last.len);
+ if (!IS_ERR(dentry)) {
+ if (dentry->d_inode) {
+ err = vfs_getattr(nd.path.mnt, dentry, &stat);
+ if (!err && dev_unchanged(dev, &stat)) {
+ err = vfs_unlink(nd.path.dentry->d_inode,
+ dentry);
+ if (err == 0 || err == -ENOENT)
+ deleted = 1;
+ }
+ } else {
+ err = -ENOENT;
+ }
+ dput(dentry);
+ } else {
+ err = PTR_ERR(dentry);
+ }
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+
+ path_put(&nd.path);
+ if (deleted && strchr(nodename, '/'))
+ delete_path(nodename);
+out_name:
+ kfree(tmp);
+ return err;
+}
+
+/* After the root filesystem is mounted by the kernel at /root, or the
+ * initramfs in extracted at /root, this tmpfs will be mounted at /root/dev.
+ */
+int devtmpfs_mount(const char *mountpoint)
+{
+ struct path path;
+ int err;
+
+ if (!dev_mnt)
+ return 0;
+
+ err = kern_path(mountpoint, LOOKUP_FOLLOW, &path);
+ if (err)
+ return err;
+ err = do_add_mount(dev_mnt, &path, 0, NULL);
+ if (err)
+ printk(KERN_INFO "devtmpfs: error mounting %i\n", err);
+ else
+ printk(KERN_INFO "devtmpfs: mounted\n");
+ path_put(&path);
+ return err;
+}
+
+/*
+ * Create tmpfs mount, created core devices will add their device device
+ * nodes here.
+ */
+__init int devtmpfs_init(void)
+{
+ int err;
+
+ dev_mnt = do_kern_mount("tmpfs", 0, "devtmpfs", NULL);
+ if (IS_ERR(dev_mnt)) {
+ err = PTR_ERR(dev_mnt);
+ printk(KERN_ERR "devtmpfs: unable to initialize %i\n", err);
+ dev_mnt = NULL;
+ return -1;
+ }
+
+ /* create common files/directories */
+ dev_mkdir("pts", 0755);
+ dev_mkdir("shm", 01755);
+ dev_symlink("/proc/self/fd", "fd");
+ dev_symlink("/proc/self/fd/0", "stdin");
+ dev_symlink("/proc/self/fd/1", "stdout");
+ dev_symlink("/proc/self/fd/2", "stderr");
+ printk(KERN_INFO "devtmpfs: initialized\n");
+ return 0;
+}
--- a/drivers/base/init.c
+++ b/drivers/base/init.c
@@ -20,6 +20,7 @@
void __init driver_init(void)
{
/* These are the core pieces */
+ devtmpfs_init();
devices_init();
buses_init();
classes_init();
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -70,6 +70,11 @@ static ssize_t version_show(struct class
CORE_MINOR, CORE_PATCHLEVEL, CORE_DATE);
}

+static char *drm_nodename(struct device *dev)
+{
+ return kasprintf(GFP_KERNEL, "dri/%s", dev_name(dev));
+}
+
static CLASS_ATTR(version, S_IRUGO, version_show, NULL);

/**
@@ -101,6 +106,8 @@ struct class *drm_sysfs_create(struct mo
if (err)
goto err_out_class;

+ class->nodename = drm_nodename;
+
return class;

err_out_class:
--- a/drivers/input/input.c
+++ b/drivers/input/input.c
@@ -1238,8 +1238,14 @@ static struct device_type input_dev_type
.uevent = input_dev_uevent,
};

+static char *input_nodename(struct device *dev)
+{
+ return kasprintf(GFP_KERNEL, "input/%s", dev_name(dev));
+}
+
struct class input_class = {
.name = "input",
+ .nodename = input_nodename,
};
EXPORT_SYMBOL_GPL(input_class);

--- a/drivers/media/dvb/dvb-core/dvbdev.c
+++ b/drivers/media/dvb/dvb-core/dvbdev.c
@@ -447,6 +447,15 @@ static int dvb_uevent(struct device *dev
return 0;
}

+static char *dvb_nodename(struct device *dev)
+{
+ struct dvb_device *dvbdev = dev_get_drvdata(dev);
+
+ return kasprintf(GFP_KERNEL, "dvb/adapter%d/%s%d",
+ dvbdev->adapter->num, dnames[dvbdev->type], dvbdev->id);
+}
+
+
static int __init init_dvbdev(void)
{
int retval;
@@ -469,6 +478,7 @@ static int __init init_dvbdev(void)
goto error;
}
dvb_class->dev_uevent = dvb_uevent;
+ dvb_class->nodename = dvb_nodename;
return 0;

error:
--- a/drivers/usb/core/usb.c
+++ b/drivers/usb/core/usb.c
@@ -305,10 +305,21 @@ static struct dev_pm_ops usb_device_pm_o

#endif /* CONFIG_PM */

+
+static char *usb_nodename(struct device *dev)
+{
+ struct usb_device *usb_dev;
+
+ usb_dev = to_usb_device(dev);
+ return kasprintf(GFP_KERNEL, "bus/usb/%03d/%03d",
+ usb_dev->bus->busnum, usb_dev->devnum);
+}
+
struct device_type usb_device_type = {
.name = "usb_device",
.release = usb_release_dev,
.uevent = usb_dev_uevent,
+ .nodename = usb_nodename,
.pm = &usb_device_pm_ops,
};

--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -194,6 +194,7 @@ struct class {
struct kobject *dev_kobj;

int (*dev_uevent)(struct device *dev, struct kobj_uevent_env *env);
+ char *(*nodename)(struct device *dev);

void (*class_release)(struct class *class);
void (*dev_release)(struct device *dev);
@@ -289,6 +290,7 @@ struct device_type {
const char *name;
struct attribute_group **groups;
int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
+ char *(*nodename)(struct device *dev);
void (*release)(struct device *dev);

int (*suspend)(struct device *dev, pm_message_t state);
@@ -369,6 +371,7 @@ struct device_dma_parameters {

struct device {
struct device *parent;
+ char *(*nodename)(struct device *dev);

struct device_private *p;

@@ -496,6 +499,7 @@ extern struct device *device_find_child(
extern int device_rename(struct device *dev, char *new_name);
extern int device_move(struct device *dev, struct device *new_parent,
enum dpm_order dpm_order);
+extern const char *device_get_nodename(struct device *dev, const char **tmp);

/*
* Root device objects for grouping under /sys/devices
@@ -553,6 +557,16 @@ extern void put_device(struct device *de

extern void wait_for_device_probe(void);

+#ifdef CONFIG_DEVTMPFS
+extern int devtmpfs_create_node(struct device *dev);
+extern int devtmpfs_delete_node(struct device *dev);
+extern int devtmpfs_mount(const char *mountpoint);
+#else
+static inline int devtmpfs_create_node(struct device *dev) { return 0; }
+static inline int devtmpfs_delete_node(struct device *dev) { return 0; }
+static inline int devtmpfs_mount(const char *mountpoint) { return 0; }
+#endif
+
/* drivers/base/power/shutdown.c */
extern void device_shutdown(void);

--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -42,6 +42,8 @@ static inline struct shmem_inode_info *S
return container_of(inode, struct shmem_inode_info, vfs_inode);
}

+extern int init_tmpfs(void);
+
#ifdef CONFIG_TMPFS_POSIX_ACL
int shmem_permission(struct inode *, int);
int shmem_acl_init(struct inode *, struct inode *);
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -414,7 +414,7 @@ void __init prepare_namespace(void)

mount_root();
out:
+ devtmpfs_mount("dev");
sys_mount(".", "/", NULL, MS_MOVE, NULL);
sys_chroot(".");
}
-
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -8,6 +8,7 @@
#include <linux/dirent.h>
#include <linux/syscalls.h>
#include <linux/utime.h>
+#include <linux/device.h>

static __initdata char *message;
static void __init error(char *x)
@@ -604,6 +605,7 @@ static int __init populate_rootfs(void)
printk(KERN_EMERG "%s\n", err);
} else {
printk(" done\n");
+ devtmpfs_mount("dev");
}
free_initrd();
#endif
--- a/init/main.c
+++ b/init/main.c
@@ -64,6 +64,7 @@
#include <linux/idr.h>
#include <linux/ftrace.h>
#include <linux/async.h>
+#include <linux/shmem_fs.h>
#include <trace/boot.h>

#include <asm/io.h>
@@ -778,6 +779,7 @@ static void __init do_basic_setup(void)
init_workqueues();
cpuset_init_smp();
usermodehelper_init();
+ init_tmpfs();
driver_init();
init_irq_proc();
do_initcalls();
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2519,7 +2519,7 @@ static struct file_system_type tmpfs_fs_
.kill_sb = kill_litter_super,
};

-static int __init init_tmpfs(void)
+int __init init_tmpfs(void)
{
int error;

@@ -2687,5 +2687,3 @@ int shmem_zero_setup(struct vm_area_stru
vma->vm_ops = &shmem_vm_ops;
return 0;
}
-
-module_init(init_tmpfs)
--- a/sound/sound_core.c
+++ b/sound/sound_core.c
@@ -27,6 +27,11 @@ MODULE_DESCRIPTION("Core sound module");
MODULE_AUTHOR("Alan Cox");
MODULE_LICENSE("GPL");

+static char *sound_nodename(struct device *dev)
+{
+ return kasprintf(GFP_KERNEL, "snd/%s", dev_name(dev));
+}
+
static int __init init_soundcore(void)
{
int rc;
@@ -41,6 +46,8 @@ static int __init init_soundcore(void)
return PTR_ERR(sound_class);
}

+ sound_class->nodename = sound_nodename;
+
return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

ongelezen,

1 mei 2009, 01:40:1001-05-2009

aan

On Thu, 30 Apr 2009 15:23:42 +0200 Kay Sievers <kay.s...@vrfy.org> wrote:

> From: Kay Sievers <kay.s...@vrfy.org>
> Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
>
> Devtmpfs lets the kernel create a tmpfs very early at kernel
> initialization, before any driver core device is registered. Every
> device with a major/minor will have a device node created in this
> tmpfs instance. After the rootfs is mounted by the kernel, the
> populated tmpfs is mounted at /dev. In initramfs, it can be moved
> to the manually mounted root filesystem before /sbin/init is
> executed.

Lol, devfs.

> ...
>
> block/bsg.c | 6

> drivers/gpu/drm/drm_sysfs.c | 7
> drivers/input/input.c | 6
> drivers/media/dvb/dvb-core/dvbdev.c | 10 +
> drivers/usb/core/usb.c | 11 +

These five subsystems were updated, but there are so many others. Why
these five in particular?

> +const char *device_get_nodename(struct device *dev, const char **tmp)
> +{
> + char *s;
> +
> + *tmp = NULL;
> +
> + /* the device type may provide a specific name */
> + if (dev->type && dev->type->nodename)
> + *tmp = dev->type->nodename(dev);

dev->type->nodename() might have failed due to -ENOMEM, in which case
it seems wrong to assume that it returned NULL for <whatever reason you
thought it might want to return NULL>.

It's all a bit confused.

> + if (*tmp)
> + return *tmp;
> +
> + /* the class may provide a specific name */
> + if (dev->class && dev->class->nodename)
> + *tmp = dev->class->nodename(dev);
> + if (*tmp)
> + return *tmp;
> +
> + /* return name without allocation, tmp == NULL */
> + if (strchr(dev_name(dev), '!') == NULL)

s/ / /

> + return dev_name(dev);
> +
> + /* replace '!' in the name with '/' */
> + *tmp = kstrdup(dev_name(dev), GFP_KERNEL);
> + if (!*tmp)
> + return NULL;
> + while ((s = strchr(*tmp, '!')))
> + s[0] = '/';
> + return *tmp;
> +}
>

> ...

Greg KH

ongelezen,

1 mei 2009, 02:20:0601-05-2009

aan

On Thu, Apr 30, 2009 at 10:29:00PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2009 15:23:42 +0200 Kay Sievers <kay.s...@vrfy.org> wrote:
>
> > From: Kay Sievers <kay.s...@vrfy.org>
> > Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
> >
> > Devtmpfs lets the kernel create a tmpfs very early at kernel
> > initialization, before any driver core device is registered. Every
> > device with a major/minor will have a device node created in this
> > tmpfs instance. After the rootfs is mounted by the kernel, the
> > populated tmpfs is mounted at /dev. In initramfs, it can be moved
> > to the manually mounted root filesystem before /sbin/init is
> > executed.
>
> Lol, devfs.

Well, devfs "done right" with hopefully none of the vfs problems the
last devfs had. :)

> > block/bsg.c | 6
> > drivers/gpu/drm/drm_sysfs.c | 7
> > drivers/input/input.c | 6
> > drivers/media/dvb/dvb-core/dvbdev.c | 10 +
> > drivers/usb/core/usb.c | 11 +
>
> These five subsystems were updated, but there are so many others. Why
> these five in particular?

These are the ones that create a subdirectory in /dev/ None of the
others do.

> > +const char *device_get_nodename(struct device *dev, const char **tmp)
> > +{
> > + char *s;
> > +
> > + *tmp = NULL;
> > +
> > + /* the device type may provide a specific name */
> > + if (dev->type && dev->type->nodename)
> > + *tmp = dev->type->nodename(dev);
>
> dev->type->nodename() might have failed due to -ENOMEM, in which case
> it seems wrong to assume that it returned NULL for <whatever reason you
> thought it might want to return NULL>.
>
> It's all a bit confused.

I'll let Kay answer this one.

> > + if (*tmp)
> > + return *tmp;
> > +
> > + /* the class may provide a specific name */
> > + if (dev->class && dev->class->nodename)
> > + *tmp = dev->class->nodename(dev);
> > + if (*tmp)
> > + return *tmp;
> > +
> > + /* return name without allocation, tmp == NULL */
> > + if (strchr(dev_name(dev), '!') == NULL)
>
> s/ / /

good catch. I've edited that in the version of this patch in my tree.

thanks,

greg k-h

Andrew Morton

ongelezen,

1 mei 2009, 02:50:0701-05-2009

aan

On Thu, 30 Apr 2009 23:17:01 -0700 Greg KH <gr...@kroah.com> wrote:

> On Thu, Apr 30, 2009 at 10:29:00PM -0700, Andrew Morton wrote:
> > On Thu, 30 Apr 2009 15:23:42 +0200 Kay Sievers <kay.s...@vrfy.org> wrote:
> >
> > > From: Kay Sievers <kay.s...@vrfy.org>
> > > Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
> > >
> > > Devtmpfs lets the kernel create a tmpfs very early at kernel
> > > initialization, before any driver core device is registered. Every
> > > device with a major/minor will have a device node created in this
> > > tmpfs instance. After the rootfs is mounted by the kernel, the
> > > populated tmpfs is mounted at /dev. In initramfs, it can be moved
> > > to the manually mounted root filesystem before /sbin/init is
> > > executed.
> >
> > Lol, devfs.
>
> Well, devfs "done right" with hopefully none of the vfs problems the
> last devfs had. :)

I think Adam Richter's devfs rewrite (which, iirc, was tmpfs-based)
would have fixed up these things. But it was never quite completed and
came when minds were already made up.

I don't understand why we need devfs2, really. What problems are
people having with teh existing design?

> > > block/bsg.c | 6
> > > drivers/gpu/drm/drm_sysfs.c | 7
> > > drivers/input/input.c | 6
> > > drivers/media/dvb/dvb-core/dvbdev.c | 10 +
> > > drivers/usb/core/usb.c | 11 +
> >
> > These five subsystems were updated, but there are so many others. Why
> > these five in particular?
>
> These are the ones that create a subdirectory in /dev/ None of the
> others do.

oic.

Where is it determined that these subsystems create /dev subdirectories?
udev rules? If so, do we need to henceforth keep devfs2 (sorry, I
can't resist) in sync with udev?

Greg KH

ongelezen,

1 mei 2009, 03:00:1701-05-2009

aan

On Thu, Apr 30, 2009 at 11:43:12PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2009 23:17:01 -0700 Greg KH <gr...@kroah.com> wrote:
>
> > On Thu, Apr 30, 2009 at 10:29:00PM -0700, Andrew Morton wrote:
> > > On Thu, 30 Apr 2009 15:23:42 +0200 Kay Sievers <kay.s...@vrfy.org> wrote:
> > >
> > > > From: Kay Sievers <kay.s...@vrfy.org>
> > > > Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
> > > >
> > > > Devtmpfs lets the kernel create a tmpfs very early at kernel
> > > > initialization, before any driver core device is registered. Every
> > > > device with a major/minor will have a device node created in this
> > > > tmpfs instance. After the rootfs is mounted by the kernel, the
> > > > populated tmpfs is mounted at /dev. In initramfs, it can be moved
> > > > to the manually mounted root filesystem before /sbin/init is
> > > > executed.
> > >
> > > Lol, devfs.
> >
> > Well, devfs "done right" with hopefully none of the vfs problems the
> > last devfs had. :)
>
> I think Adam Richter's devfs rewrite (which, iirc, was tmpfs-based)
> would have fixed up these things. But it was never quite completed and
> came when minds were already made up.
>
> I don't understand why we need devfs2, really. What problems are
> people having with teh existing design?

Boot speed, boot speed, boot speed.

Oh, and reduction in complexity in init scripts, and saving embedded
systems a lot of effort to implement a dynamic /dev properly (have you
_seen_ what Android does to keep from having to ship udev? It's
horrible...)

> > > > block/bsg.c | 6
> > > > drivers/gpu/drm/drm_sysfs.c | 7
> > > > drivers/input/input.c | 6
> > > > drivers/media/dvb/dvb-core/dvbdev.c | 10 +
> > > > drivers/usb/core/usb.c | 11 +
> > >
> > > These five subsystems were updated, but there are so many others. Why
> > > these five in particular?
> >
> > These are the ones that create a subdirectory in /dev/ None of the
> > others do.
>
> oic.
>
> Where is it determined that these subsystems create /dev subdirectories?
> udev rules? If so, do we need to henceforth keep devfs2 (sorry, I
> can't resist) in sync with udev?

No, with this, udev rules can get simpler and remove these directory
names, keeping them only in one place, preventing anything from getting
out of sync.

thanks,

greg k-h

Andrew Morton

ongelezen,

1 mei 2009, 03:10:0601-05-2009

aan

Why can't they ship udev?

> > > > > block/bsg.c | 6
> > > > > drivers/gpu/drm/drm_sysfs.c | 7
> > > > > drivers/input/input.c | 6
> > > > > drivers/media/dvb/dvb-core/dvbdev.c | 10 +
> > > > > drivers/usb/core/usb.c | 11 +
> > > >
> > > > These five subsystems were updated, but there are so many others. Why
> > > > these five in particular?
> > >
> > > These are the ones that create a subdirectory in /dev/ None of the
> > > others do.
> >
> > oic.
> >
> > Where is it determined that these subsystems create /dev subdirectories?
> > udev rules? If so, do we need to henceforth keep devfs2 (sorry, I
> > can't resist) in sync with udev?
>
> No, with this, udev rules can get simpler and remove these directory
> names, keeping them only in one place, preventing anything from getting
> out of sync.

This assumes that devtmpfs is enabled in config, yes?

Does that means that we need two versions of udev out there, or can one
version be feasibly extended to handle both flavours of kernel?

Chris Wedgwood

ongelezen,

1 mei 2009, 03:10:0801-05-2009

aan

On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:

> Devtmpfs lets the kernel create a tmpfs very early at kernel
> initialization, before any driver core device is registered. Every
> device with a major/minor will have a device node created in this
> tmpfs instance. After the rootfs is mounted by the kernel, the
> populated tmpfs is mounted at /dev. In initramfs, it can be moved to
> the manually mounted root filesystem before /sbin/init is executed.

Why can't the initramfs create /dev and populate it?

Alan Jenkins

ongelezen,

1 mei 2009, 06:30:1301-05-2009

aan

Aren't you overeaching in your claims here? I'm sure you can't avoid
at least one coldplug run on a contemporary general purpose system,
because you lose so much of the functionality provided by udev. I'm
sure of that, but it would be nice if you could address it in the
changelog. And modern initramfs' require udev RUN rules to read UUIDs
and set up LVM.

I'm loving this for embedded, init=/bin/sh, and rescue floppies :-).
But I can't understand how you plan to use this as an optimisation.

And - I'm sure you must have considered this in a moment of madness -
do you know why we couldn't just start _udev_ "before the first kernel
device is registered with the core"?

Regards
Alan

Kay Sievers

ongelezen,

1 mei 2009, 07:00:1001-05-2009

aan

On Fri, May 1, 2009 at 09:03, Andrew Morton <ak...@linux-foundation.org> wrote:
> On Thu, 30 Apr 2009 23:55:27 -0700 Greg KH <gr...@kroah.com> wrote:
>> On Thu, Apr 30, 2009 at 11:43:12PM -0700, Andrew Morton wrote:
>> > I don't understand why we need devfs2, really. What problems are
>> > people having with teh existing design?

Bootup complexity. that you need a rather complex userspace, to get a
system booted. Pretty annoying in the rescue case, where you might not
have any working device node for a current disk. The static ones you
may have on the rootfs image, now that we can have dynamic block
device minors, may not work at all.

>> Boot speed, boot speed, boot speed.

>> Oh, and reduction in complexity in init scripts, and saving embedded
>> systems a lot of effort to implement a dynamic

It makes it possible to run udev in the background at bootup, parallel
to other things, instead of having a hard checkpoint to wait for,
before you can continue to bring up the rest.

> > /dev properly (have you
>> _seen_ what Android does to keep from having to ship udev? It's
>> horrible...)
>
> Why can't they ship udev?

Not sure, seems they avoid GPL code, if they can. Besides that it's
pretty slow to bootstrap /dev with stock udev. Most embedded systems
use busybox's mdev here.

>> > Where is it determined that these subsystems create /dev subdirectories?
>> > udev rules?

It should be defined in Documentation/devices.txt, and for some
subsystems not mentioned there, it is only in the default udev rules
set.

>> If so, do we need to henceforth keep devfs2 (sorry, I
>> > can't resist) in sync with udev?

>> No, with this, udev rules can get simpler and remove these directory
>> names, keeping them only in one place, preventing anything from getting
>> out of sync.

During the last years, we synchronized all distros, and all ship
exactly the same naming rules today besides a few distro-specific
custom additions. So we have a common well defined setup today. With
devtmpfs, the final decision is still in udev rules, like it is today,
but with this, the kernel can pre-create a node, where udev will apply
permissions/ownership to.

Future versions of udev will be able to follow the kernel provided
device name hint, so drivers can provide a hint to userspace, without
the need to drop udev rules. Existing udev rules would still overwrite
the kernel-provided device node name hints. But this is all optional,
devtmpfs enabled today works just fine with any old udev version, and
udev should do the right thing on top.

> This assumes that devtmpfs is enabled in config, yes?

It's optional, yes.

> Does that means that we need two versions of udev out there, or can one
> version be feasibly extended to handle both flavours of kernel?

No, udev works just fine with or without it. When udev receives the
event, it looks if there is an already existing node, and if yes, it
just applies the defined permissions/ownership to it. The final
decisionss are still made in udev. All ownerships must be handled in
uerspace anyway, there is no way for the kernel to assing group
numbers.

It does not really change anything visible for the normal distros,
besides that bootup can be made more efficient by running things in
parallel, and that there will be a working and correct /dev in the
rescue case.

For tiny, custom, embedded like setups, it might just work without any
userspace help, or with something very simple compared what we need
today.

Thanks,
Kay

Alan Cox

ongelezen,

1 mei 2009, 07:10:0901-05-2009

aan

> These are the ones that create a subdirectory in /dev/ None of the
> others do.

Yes they do - sound for example does.

Alan Cox

ongelezen,

1 mei 2009, 07:10:1201-05-2009

aan

> Boot speed, boot speed, boot speed.
>
> Oh, and reduction in complexity in init scripts, and saving embedded
> systems a lot of effort to implement a dynamic /dev properly (have you
> _seen_ what Android does to keep from having to ship udev? It's
> horrible...)

This is nothing to do with udev or the kernel side of things. Your
argument seems to be

"Remove a user space problem and make it a kernel one stuffed in
unpageable RAM and less flexible"

It seems to me your actual problem is "my tools suck" and the fix for
that is to fix the tools not add more random kernel junk.

Kay Sievers

ongelezen,

1 mei 2009, 07:10:1301-05-2009

aan

On Fri, May 1, 2009 at 13:01, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> These are the ones that create a subdirectory in /dev/ None of the
>> others do.
>
> Yes they do - sound for example does.

Sound is already in there, in the posted version. And the few missing
are in my local tree.

Thanks,
Kay

Kay Sievers

ongelezen,

1 mei 2009, 07:20:0801-05-2009

aan

On Fri, May 1, 2009 at 12:19, Alan Jenkins
<sourcej...@googlemail.com> wrote:
> On 4/30/09, Kay Sievers <kay.s...@vrfy.org> wrote:

>> With the kernel populated /dev, existing initramfs or kernel-mount
>> bootup logic can be optimized to be more efficient, and not to require a
>> full coldplug run, which is currently needed to bootstrap the inital
>> /dev directory content, before continuing bringing up the rest of
>> the system. There will be no missed events to replay, because /dev is
>> available before the first kernel device is registered with the core.
>> A coldplug run can take, depending on the speed of the system and the
>> amount of devices which need to be handled, from one to several seconds.
>
> Aren't you overeaching in your claims here? I'm sure you can't avoid
> at least one coldplug run on a contemporary general purpose system,

You still need coldplug, sure. But distros have 2 full coldplugs
today, one in initramfs, one in the rootfs.

> because you lose so much of the functionality provided by udev.

No, udev will do the same thing as it does today, it will not be
different, it's just that the colplug can run in parallel with other
stuff, and the initramfs coldplug can be much simplified, or even
avoided.

> I'm
> sure of that, but it would be nice if you could address it in the
> changelog. And modern initramfs' require udev RUN rules to read UUIDs
> and set up LVM.

A simple uuid/label lookup will not need a full coldplug run in
initramfs. You just need to start udev, and wait for the links to show
up. Possibly running triggering block events, but not the hundreds of
other devices.

> I'm loving this for embedded, init=/bin/sh, and rescue floppies :-).
> But I can't understand how you plan to use this as an optimisation.
>
> And - I'm sure you must have considered this in a moment of madness -
> do you know why we couldn't just start _udev_ "before the first kernel
> device is registered with the core"?

I think it's pretty obvious, that this is the fastest you can get for
/dev. And it's not about the time when udev is started, its about
doing the stuff when the devices are created, and not to need to redo
it at bootup, or do it even twice.

Thanks,
Kay

Kay Sievers

ongelezen,

1 mei 2009, 07:20:0501-05-2009

aan

On Fri, May 1, 2009 at 07:29, Andrew Morton <ak...@linux-foundation.org> wrote:

> dev->type->nodename() might have failed due to -ENOMEM, in which case
> it seems wrong to assume that it returned NULL for <whatever reason you
> thought it might want to return NULL>.
>
> It's all a bit confused.

This logic is only for providing a custom name hint. Only a few
devices need that at all. If the allocation fails, the default name
will be used, not the custom name.

Thanks,
Kay

Kay Sievers

ongelezen,

1 mei 2009, 07:20:0601-05-2009

aan

On Fri, May 1, 2009 at 13:03, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> Boot speed, boot speed, boot speed.
>>
>> Oh, and reduction in complexity in init scripts, and saving embedded
>> systems a lot of effort to implement a dynamic /dev properly (have you
>> _seen_ what Android does to keep from having to ship udev? It's
>> horrible...)
>
> This is nothing to do with udev or the kernel side of things. Your
> argument seems to be
>
> "Remove a user space problem and make it a kernel one stuffed in
> unpageable RAM and less flexible"
>
> It seems to me your actual problem is "my tools suck" and the fix for
> that is to fix the tools not add more random kernel junk.

The problem is that people optimize for fractions of seconds today,
driven mainly by your company. :)

No tool ever has a chance to get to the information only available at
early kernel init. All such tools will need to "replay" what the
kernel already did. This is intended to save us from doing this, and
retain the information which is there, but lost at the moment the
tools have the first chance to run.

It's not about a sucking tool, its just impossible. And there is no
space wasted, it's a single string for a very few subsystems, an
nothing is stored per device.

Thanks,
Kay

Michael Tokarev

ongelezen,

1 mei 2009, 07:40:1201-05-2009

aan

Kay Sievers wrote:
[]

> It does not really change anything visible for the normal distros,
> besides that bootup can be made more efficient by running things in
> parallel, and that there will be a working and correct /dev in the
> rescue case.

Ok, how about switch_root done as the last thing in initramfs?
If there's another filesystem (besides rootfs and real /root)
mounted, what switch_root (or equivalent) should do? Or should
the initramfs script move it (as in mount --move) to /root/dev
before invoking switch_root?

Thanks!

/mjt

Kay Sievers

ongelezen,

1 mei 2009, 07:50:1101-05-2009

aan

On Fri, May 1, 2009 at 13:38, Michael Tokarev <m...@tls.msk.ru> wrote:
> Kay Sievers wrote:
> []
>>
>> It does not really change anything visible for the normal distros,
>> besides that bootup can be made more efficient by running things in
>> parallel, and that there will be a working and correct /dev in the
>> rescue case.
>
> Ok, how about switch_root done as the last thing in initramfs?
> If there's another filesystem (besides rootfs and real /root)
> mounted, what switch_root (or equivalent) should do? Or should
> the initramfs script move it (as in mount --move) to /root/dev
> before invoking switch_root?

I thought it should --move it manually. Just instead of mounting its
own tmpfs at /dev, what all common distros do today.

But we can do anything here, that might be more appropriate, if you
have a better idea.

Thanks,
Kay

Hugh Dickins

ongelezen,

1 mei 2009, 08:00:1501-05-2009

aan

On Thu, 30 Apr 2009, Kay Sievers wrote:
...

I've no opinion on the big picture of your devtmpfs proposal.
But note that there's a second, !CONFIG_SHMEM, version of init_tmpfs()
also in mm/shmem.c - you'll want to take the "static " off that too,
for the tiny embedded case when ramfs is used to serve up tmpfs.

Hugh

Kay Sievers

ongelezen,

1 mei 2009, 08:10:0901-05-2009

aan

On Fri, May 1, 2009 at 13:41, Hugh Dickins <hu...@veritas.com> wrote:
> I've no opinion on the big picture of your devtmpfs proposal.
> But note that there's a second, !CONFIG_SHMEM, version of init_tmpfs()
> also in mm/shmem.c - you'll want to take the "static " off that too,
> for the tiny embedded case when ramfs is used to serve up tmpfs.

Ah, I see. Done. Will be in the next update.

Thanks a lot,
Kay

Alan Jenkins

ongelezen,

1 mei 2009, 08:40:0601-05-2009

aan

On 5/1/09, Kay Sievers <kay.s...@vrfy.org> wrote:
> On Fri, May 1, 2009 at 12:19, Alan Jenkins
> <sourcej...@googlemail.com> wrote:
>> On 4/30/09, Kay Sievers <kay.s...@vrfy.org> wrote:
>
>>> With the kernel populated /dev, existing initramfs or kernel-mount
>>> bootup logic can be optimized to be more efficient, and not to require a
>>> full coldplug run, which is currently needed to bootstrap the inital
>>> /dev directory content, before continuing bringing up the rest of
>>> the system. There will be no missed events to replay, because /dev is
>>> available before the first kernel device is registered with the core.
>>> A coldplug run can take, depending on the speed of the system and the
>>> amount of devices which need to be handled, from one to several seconds.
>>
>> Aren't you overeaching in your claims here? I'm sure you can't avoid
>> at least one coldplug run on a contemporary general purpose system,
>
> You still need coldplug, sure. But distros have 2 full coldplugs
> today, one in initramfs, one in the rootfs.
>
>> because you lose so much of the functionality provided by udev.
>
> No, udev will do the same thing as it does today, it will not be
> different, it's just that the colplug can run in parallel with other
> stuff, and the initramfs coldplug can be much simplified, or even
> avoided.

Ok, I can see there's no problem there, given that coldplug still
happens eventually.

>> I'm
>> sure of that, but it would be nice if you could address it in the
>> changelog. And modern initramfs' require udev RUN rules to read UUIDs
>> and set up LVM.
>
> A simple uuid/label lookup will not need a full coldplug run in
> initramfs. You just need to start udev, and wait for the links to show

> up.Possibly running triggering block events, but not the hundreds of
> other devices.

Assuming you have all the necessary modules loaded! I guess that's
easy enough though

find /sys -name modalias -exec cat \{\} \; | xargs -n 1 modprobe -Qa

bonus: increase "-n" for reduced modprobe fork overhead.

>> I'm loving this for embedded, init=/bin/sh, and rescue floppies :-).
>> But I can't understand how you plan to use this as an optimisation.
>>
>> And - I'm sure you must have considered this in a moment of madness -
>> do you know why we couldn't just start _udev_ "before the first kernel
>> device is registered with the core"?
>
> I think it's pretty obvious, that this is the fastest you can get for
> /dev. And it's not about the time when udev is started, its about
> doing the stuff when the devices are created, and not to need to redo
> it at bootup, or do it even twice.

Ok, yes. I can see that cutting initramfs coldplug down to size is a win.

I thought it was a useful comparison. Start udev early enough, and
you could avoid re-doing absolutely anything. Thinking about, the
reasons it doesn't work are

a) running before /dev/null and /dev/console requires hacks
b) it requires an initramfs
c) it pulls everything that hooks into or otherwise affects udev into
the initramfs; that's much more than we have at present, and a bigger
initramfs' can only make bootup _slower_.

Regards
Alan

Alan Cox

ongelezen,

1 mei 2009, 09:20:0801-05-2009

aan

> a) running before /dev/null and /dev/console requires hacks

Not really, well not if you are writing a serious small tool anyway. In
actual fact the bigger problem if you are using the standard dynamic
link setup (which you wouldn't be I suspect) is /dev/zero.

> b) it requires an initramfs
> c) it pulls everything that hooks into or otherwise affects udev into
> the initramfs; that's much more than we have at present, and a bigger
> initramfs' can only make bootup _slower_.

Ditto a bigger kernel. Remember moving stuff from initramfs to the kernel
makes it less memory efficient usually as its now harder to get rid of
and not pageable later on either.

Alan Cox

ongelezen,

1 mei 2009, 09:30:1601-05-2009

aan

> No tool ever has a chance to get to the information only available at
> early kernel init. All such tools will need to "replay" what the
> kernel already did. This is intended to save us from doing this, and
> retain the information which is there, but lost at the moment the
> tools have the first chance to run.

Serious question - which is the better problem to fix ?

> It's not about a sucking tool, its just impossible. And there is no
> space wasted, it's a single string for a very few subsystems, an
> nothing is stored per device.

Plus code plus tmpfs nodes (the latter are not quite free because you
create unneeded ones versus udev but I agree that is noise for most users)

Kay Sievers

ongelezen,

1 mei 2009, 09:30:2001-05-2009

aan

On Fri, May 1, 2009 at 15:18, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> No tool ever has a chance to get to the information only available at
>> early kernel init. All such tools will need to "replay" what the
>> kernel already did. This is intended to save us from doing this, and
>> retain the information which is there, but lost at the moment the
>> tools have the first chance to run.
>
> Serious question - which is the better problem to fix ?

I'm very sure, you can not fix it outside the kernel. Or do you have
an idea how to create the missing device nodes for device without
crawling sysfs, when the first userspace process is started?

>> It's not about a sucking tool, its just impossible. And there is no
>> space wasted, it's a single string for a very few subsystems, an
>> nothing is stored per device.
>
> Plus code plus tmpfs nodes (the latter are not quite free because you
> create unneeded ones versus udev but I agree that is noise for most users)

It's exactly the same as udev. Udev creates a node for every device
the core registers in tmpfs. This tmpfs is mounted by the kernel and
pre-populated with this, not by udev, that's all the difference it
has. There is not a single node more than before.

Thanks,
Kay

Greg KH

ongelezen,

1 mei 2009, 10:10:1101-05-2009

aan

On Thu, Apr 30, 2009 at 11:57:54PM -0700, Chris Wedgwood wrote:
> On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:
>
> > Devtmpfs lets the kernel create a tmpfs very early at kernel
> > initialization, before any driver core device is registered. Every
> > device with a major/minor will have a device node created in this
> > tmpfs instance. After the rootfs is mounted by the kernel, the
> > populated tmpfs is mounted at /dev. In initramfs, it can be moved to
> > the manually mounted root filesystem before /sbin/init is executed.
>
> Why can't the initramfs create /dev and populate it?

Right now it does, and it takes about 1-2 seconds to do so depending on
the hardware.

Which is over double the time it takes to boot the kernel entirely these
days, so it is quite noticable.

thanks,

greg k-h

Kay Sievers

ongelezen,

1 mei 2009, 11:10:1001-05-2009

aan

Exactly. That can not work, we always need to do the coldplug in the
rootfs, because only there are all the tools we require.

With the automatic devtmpfs mount in the rootfs, we can do most of the
coldplug in the background, because other stuff can be sure that
mandatory device nodes already exist, and they do not need to wait for
udev to finish.

Inside the initramfs, the need for coldplug is very much limited to
module loading and block device handling, and has also no hard
checkpoint anymore, when devtmpfsl pre-populates /dev.

Static /dev entries are no option anymore, with all the dynamic number
assignments today. It might be fine for custom systems, but distros
can not use it at all. And even for the custom setups, which could do
it, devtmpfs should be the most flexible and reliable option.

I think the init=/bin/sh case alone would be worth to do it, without
any of the other optimizations it makes possible. It's a pretty
difficult to manage situation today, if your userspace breaks, and you
need to get to your devices, and /dev is empty, contains entries with
the wrong numbers, or does not contain what you are looking for.

Thanks,
Kay

Alan Jenkins

ongelezen,

1 mei 2009, 11:50:1401-05-2009

aan

On 5/1/09, Greg KH <gr...@kroah.com> wrote:
> On Thu, Apr 30, 2009 at 11:57:54PM -0700, Chris Wedgwood wrote:
>> On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:
>>
>> > Devtmpfs lets the kernel create a tmpfs very early at kernel
>> > initialization, before any driver core device is registered. Every
>> > device with a major/minor will have a device node created in this
>> > tmpfs instance. After the rootfs is mounted by the kernel, the
>> > populated tmpfs is mounted at /dev. In initramfs, it can be moved to
>> > the manually mounted root filesystem before /sbin/init is executed.
>>
>> Why can't the initramfs create /dev and populate it?
>
> Right now it does, and it takes about 1-2 seconds to do so depending on
> the hardware.
>
> Which is over double the time it takes to boot the kernel entirely these
> days, so it is quite noticable.

Please, this argument is pants. The initramfs _could_ create /dev
and populate it. Neither crawling /sys or creating device nodes is
horribly expensive. It's udev that adds overhead which is not needed
at this point. If the initramfs was optimised to do the same as
devtmpfs, it needn't take more than 50ms on my eeepc.

Here's my 630Mhz Celeron in action:

# Crawl sysfs to discover valid devices
time ls -l /sys/dev/ > /dev/null

real 0m0.008s
user 0m0.000s
sys 0m0.007s

# Create X number of device nodes
time cp -a /dev/block /dev/char .dev2

real 0m0.016s
user 0m0.003s
sys 0m0.013s

If this deserves to live in the kernel, let's not pretend that it is
because it works dramatically faster.

Thanks
Alan

Chris Wedgwood

ongelezen,

1 mei 2009, 12:00:1801-05-2009

aan

On Fri, May 01, 2009 at 07:01:08AM -0700, Greg KH wrote:

> Right now it does, and it takes about 1-2 seconds to do so depending
> on the hardware.

The code I have only does block devices right now, but it could do
more. Even if it was 10x slower it wouldn't be close to 1-2 seconds.

> Which is over double the time it takes to boot the kernel entirely
> these days, so it is quite noticable.

So the kernel takes 500-1000ms then? That seems really high.

Greg KH

ongelezen,

1 mei 2009, 12:10:1501-05-2009

aan

On Fri, May 01, 2009 at 04:43:49PM +0100, Alan Jenkins wrote:
> On 5/1/09, Greg KH <gr...@kroah.com> wrote:
> > On Thu, Apr 30, 2009 at 11:57:54PM -0700, Chris Wedgwood wrote:
> >> On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:
> >>
> >> > Devtmpfs lets the kernel create a tmpfs very early at kernel
> >> > initialization, before any driver core device is registered. Every
> >> > device with a major/minor will have a device node created in this
> >> > tmpfs instance. After the rootfs is mounted by the kernel, the
> >> > populated tmpfs is mounted at /dev. In initramfs, it can be moved to
> >> > the manually mounted root filesystem before /sbin/init is executed.
> >>
> >> Why can't the initramfs create /dev and populate it?
> >
> > Right now it does, and it takes about 1-2 seconds to do so depending on
> > the hardware.
> >
> > Which is over double the time it takes to boot the kernel entirely these
> > days, so it is quite noticable.
>
> Please, this argument is pants.

Is that short pants? Jeans? Khakis? What color? :)

> The initramfs _could_ create /dev and populate it.

How? With a bash script like Android does? Do you want to maintain two
different code streams for this kind of thing?

> Neither crawling /sys or creating device nodes is horribly expensive.

No, but it is measurable.

> It's udev that adds overhead which is not needed at this point.
> If the initramfs was optimised to do the same as devtmpfs, it needn't
> take more than 50ms on my eeepc.

It would take longer than 50ms.

> Here's my 630Mhz Celeron in action:
>
> # Crawl sysfs to discover valid devices
> time ls -l /sys/dev/ > /dev/null
>
> real 0m0.008s
> user 0m0.000s
> sys 0m0.007s
>
> # Create X number of device nodes
> time cp -a /dev/block /dev/char .dev2
>
> real 0m0.016s
> user 0m0.003s
> sys 0m0.013s

But first, time it with an initramfs with the load time of your script
and the rest of the stuff you need to do there to get it running. And
also drop your caches before doing such a test as well to make it
"real".

> If this deserves to live in the kernel, let's not pretend that it is
> because it works dramatically faster.

But it does.

And it's also a solution for the embedded people, and the rescue disk
users, and the others of us that have to drop down to init=/bin/bash at
times.

If you don't like it, don't build it into your kernel, it's only 300
lines of code to keep away from your machine.

thanks,

greg k-h

Chris Wedgwood

ongelezen,

1 mei 2009, 12:20:1001-05-2009

aan

On Fri, May 01, 2009 at 09:09:51AM -0700, Greg KH wrote:

> You usually need to do more. We started out just only doing block
> devices, but you also need the memory devices, and console, and a
> few others as well.

I could add that. It's probably not hard.

What's more, it has liblkid support so you could mount by label and a
few other things. It seems reasonable that you could add lvm support
as well.

The main downsides I see right now are:

- it's not in the kernel by default

- it's actually quite large beacuse i link against glibc

It's not clear these are show stoppers long term.

> After all that, you have already written another tool, which seems
> like overkill when we can do it all in the kernel so much easier and
> simpler.

I think that's a different argument. Doesn't the same logic imply we
shouldn't have udevd?

Greg KH

ongelezen,

1 mei 2009, 12:20:1201-05-2009

aan

On Fri, May 01, 2009 at 08:53:42AM -0700, Chris Wedgwood wrote:
> On Fri, May 01, 2009 at 07:01:08AM -0700, Greg KH wrote:
>
> > Right now it does, and it takes about 1-2 seconds to do so depending
> > on the hardware.
>
> The code I have only does block devices right now, but it could do
> more. Even if it was 10x slower it wouldn't be close to 1-2 seconds.

You usually need to do more. We started out just only doing block

devices, but you also need the memory devices, and console, and a few
others as well.

After all that, you have already written another tool, which seems like

overkill when we can do it all in the kernel so much easier and simpler.

> > Which is over double the time it takes to boot the kernel entirely

> > these days, so it is quite noticable.
>
> So the kernel takes 500-1000ms then? That seems really high.

Yes, we can boot a kernel in 500-1000ms on some pretty low end, crappy
disk controller/device laptops these days. Yes, it's not as fast as we
would like it, but we are working on making it even faster. I think
Arjan has some pending patches to resolve some of the remaining issues
that are present for machines with "legacy" input devices that we need
to get merged.

thanks,

greg k-h

Andrew Morton

ongelezen,

1 mei 2009, 15:30:1301-05-2009

aan

On Fri, 1 May 2009 13:16:22 +0200
Kay Sievers <kay.s...@vrfy.org> wrote:

> On Fri, May 1, 2009 at 07:29, Andrew Morton <ak...@linux-foundation.org> wrote:
>
> > dev->type->nodename() might have failed due to -ENOMEM, in which case
> > it seems wrong to assume that it returned NULL for <whatever reason you
> > thought it might want to return NULL>.
> >
> > It's all a bit confused.
>
> This logic is only for providing a custom name hint. Only a few
> devices need that at all. If the allocation fails, the default name
> will be used, not the custom name.

But that's bad, isn't it? It means that the kernel will come up with
one name if the memory allcoation succeeded, and a different name if
the allocation failed.

Alan Jenkins

ongelezen,

1 mei 2009, 17:20:1101-05-2009

aan

On 5/1/09, Greg KH <gr...@kroah.com> wrote:
> On Fri, May 01, 2009 at 04:43:49PM +0100, Alan Jenkins wrote:
>> On 5/1/09, Greg KH <gr...@kroah.com> wrote:
>> > On Thu, Apr 30, 2009 at 11:57:54PM -0700, Chris Wedgwood wrote:
>> >> On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:
>> >>
>> >> > Devtmpfs lets the kernel create a tmpfs very early at kernel
>> >> > initialization, before any driver core device is registered. Every
>> >> > device with a major/minor will have a device node created in this
>> >> > tmpfs instance. After the rootfs is mounted by the kernel, the
>> >> > populated tmpfs is mounted at /dev. In initramfs, it can be moved to
>> >> > the manually mounted root filesystem before /sbin/init is executed.
>> >>
>> >> Why can't the initramfs create /dev and populate it?
>> >
>> > Right now it does, and it takes about 1-2 seconds to do so depending on
>> > the hardware.
>> >
>> > Which is over double the time it takes to boot the kernel entirely these
>> > days, so it is quite noticable.
>>
>> Please, this argument is pants.
>
> Is that short pants? Jeans? Khakis? What color? :)

Damned international communication :-).

Adjective
pants (comparative more pants, superlative most pants)

(British, slang) Of inferior quality, rubbish
Your mobile is pants ï¿½ why donï¿½t you get one like mine?

And - apologies, for raising my voice for something which doesn't deserve it.

>> The initramfs _could_ create /dev and populate it.
>
> How? With a bash script like Android does? Do you want to maintain two
> different code streams for this kind of thing?

I hypothesized a small, "obviously correct", C program. It would make
sense to maintain it as part of udev. Seeing as most initramfs' will
still use udev, and just ask it to do less.

>> Neither crawling /sys or creating device nodes is horribly expensive.
>
> No, but it is measurable.
>
>> It's udev that adds overhead which is not needed at this point.
>> If the initramfs was optimised to do the same as devtmpfs, it needn't
>> take more than 50ms on my eeepc.
>
> It would take longer than 50ms.

> But first, time it with an initramfs with the load time of your script

> and the rest of the stuff you need to do there to get it running. And
> also drop your caches before doing such a test as well to make it
> "real".

I was trying to measure the system calls required. I'm all too aware
of the overhead of fork() even without exec(), so I wouldn't write it
in shell. And I was comparing initramfs w/"makedev.c" to initramfs
w/devtmpfs.

>> If this deserves to live in the kernel, let's not pretend that it is
>> because it works dramatically faster.
>
> But it does.
>
> And it's also a solution for the embedded people, and the rescue disk
> users, and the others of us that have to drop down to init=/bin/bash at
> times.
>
> If you don't like it,

I said upthread that I would love it for all of those reasons.
Seriously, I've done all three in the past, this sounds pretty
attractive.

I'm trying to gain an honest understanding of the idea, and what it
will mean, in part by comparing it to the alternatives. I am very
confused when you answer "why put this in the kernel" with "it's too
slow at the moment", because there is no _direct_ connection between
the two.

I think what your answer missed is that you're making the process much
_simpler_, which explains both why it can be faster, and why it should
be considered as a legitimate role for the kernel.

<deleted comprehensive notes which explain the nature of the
simplification among other things, but probably aren't worth restating
here>.

The problem is that the simplification is the important part of the
change, it has tradeoffs, and glossing over tradeoffs can come over
all suspicious. But I'm happy that Kay's taken the time to answer
all the questions I had now.

> don't build it into your kernel, it's only 300
> lines of code to keep away from your machine.

I refuse to rise to that :). I think on the contrary, it will become
difficult to disable when future initramfs' rely on it. But I won't
want to at that point.

Thanks
Alan

Kay Sievers

ongelezen,

1 mei 2009, 18:10:0601-05-2009

aan

On Fri, May 1, 2009 at 21:26, Andrew Morton <ak...@linux-foundation.org> wrote:
> On Fri, 1 May 2009 13:16:22 +0200
> Kay Sievers <kay.s...@vrfy.org> wrote:
>
>> On Fri, May 1, 2009 at 07:29, Andrew Morton <ak...@linux-foundation.org> wrote:
>>
>> > dev->type->nodename() might have failed due to -ENOMEM, in which case
>> > it seems wrong to assume that it returned NULL for <whatever reason you
>> > thought it might want to return NULL>.
>> >
>> > It's all a bit confused.
>>
>> This logic is only for providing a custom name hint. Only a few
>> devices need that at all. If the allocation fails, the default name
>> will be used, not the custom name.
>
> But that's bad, isn't it? It means that the kernel will come up with

> one name if the memory allocation succeeded, and a different name if
> the allocation failed.

Yeah, sure, it's bad. But I think we have pretty much lost anyway, if
we run into oom at this stage.

What should we do instead? If we, for some reason, can not get a
possible custom name?

Thanks,
Kay

Andrew Morton

ongelezen,

1 mei 2009, 18:30:1401-05-2009

aan

On Fri, 1 May 2009 23:59:32 +0200
Kay Sievers <kay.s...@vrfy.org> wrote:

> On Fri, May 1, 2009 at 21:26, Andrew Morton <ak...@linux-foundation.org> wrote:
> > On Fri, 1 May 2009 13:16:22 +0200
> > Kay Sievers <kay.s...@vrfy.org> wrote:
> >
> >> On Fri, May 1, 2009 at 07:29, Andrew Morton <ak...@linux-foundation.org> wrote:
> >>
> >> > dev->type->nodename() might have failed due to -ENOMEM, in which case
> >> > it seems wrong to assume that it returned NULL for <whatever reason you
> >> > thought it might want to return NULL>.
> >> >
> >> > It's all a bit confused.
> >>
> >> This logic is only for providing a custom name hint. Only a few
> >> devices need that at all. If the allocation fails, the default name
> >> will be used, not the custom name.
> >

> > But that's bad, isn't it? __It means that the kernel will come up with

> > one name if the memory allocation succeeded, and a different name if
> > the allocation failed.
>
> Yeah, sure, it's bad. But I think we have pretty much lost anyway, if
> we run into oom at this stage.
>
> What should we do instead? If we, for some reason, can not get a
> possible custom name?

Not much - just sayin'.

Presumably the page allocator will have given a big spew, so the
operator knows what went wrong.

Brian Swetland

ongelezen,

1 mei 2009, 21:30:1101-05-2009

aan

On Thu, Apr 30, 2009 at 11:55 PM, Greg KH <gr...@kroah.com> wrote:
> On Thu, Apr 30, 2009 at 11:43:12PM -0700, Andrew Morton wrote:
>> On Thu, 30 Apr 2009 23:17:01 -0700 Greg KH <gr...@kroah.com> wrote:
>> > On Thu, Apr 30, 2009 at 10:29:00PM -0700, Andrew Morton wrote:
>> >
>> > Well, devfs "done right" with hopefully none of the vfs problems the
>> > last devfs had. :)
>>
>> I think Adam Richter's devfs rewrite (which, iirc, was tmpfs-based)
>> would have fixed up these things. But it was never quite completed and
>> came when minds were already made up.
>>
>> I don't understand why we need devfs2, really. What problems are
>> people having with teh existing design?

>
> Boot speed, boot speed, boot speed.
>
> Oh, and reduction in complexity in init scripts, and saving embedded
> systems a lot of effort to implement a dynamic /dev properly (have you
> _seen_ what Android does to keep from having to ship udev? It's
> horrible...)

It's always struck me as odd that sysfs couldn't provide device node
access, given that there's already an entity exposed for everything
(or nearly everything). It seems weird to have to have an agent in
userspace to create another hierarchy in addition to what the kernel
already maintains.

I guess the really tricky bit is how to deal with
permissions/ownership sanely. I suspect there's no easy way to do
something that "just works" for even the majority of userspace
environments. Most of the ugly in the microudev thing in our init
comes from having to do something about permissions.

I would love to have a way for the kernel to do something like devfs
(it'd let me kill some ugly userspace code on my side)....

Brian

Kay Sievers

ongelezen,

1 mei 2009, 21:50:0901-05-2009

aan

On Sat, May 2, 2009 at 03:24, Brian Swetland <swet...@google.com> wrote:
> On Thu, Apr 30, 2009 at 11:55 PM, Greg KH <gr...@kroah.com> wrote:
>> On Thu, Apr 30, 2009 at 11:43:12PM -0700, Andrew Morton wrote:
>>> On Thu, 30 Apr 2009 23:17:01 -0700 Greg KH <gr...@kroah.com> wrote:
>>> > On Thu, Apr 30, 2009 at 10:29:00PM -0700, Andrew Morton wrote:
>>> >
>>> > Well, devfs "done right" with hopefully none of the vfs problems the
>>> > last devfs had. :)
>>>
>>> I think Adam Richter's devfs rewrite (which, iirc, was tmpfs-based)
>>> would have fixed up these things. But it was never quite completed and
>>> came when minds were already made up.
>>>
>>> I don't understand why we need devfs2, really. What problems are
>>> people having with teh existing design?
>>
>> Boot speed, boot speed, boot speed.
>>
>> Oh, and reduction in complexity in init scripts, and saving embedded
>> systems a lot of effort to implement a dynamic /dev properly (have you
>> _seen_ what Android does to keep from having to ship udev? It's
>> horrible...)
>
> It's always struck me as odd that sysfs couldn't provide device node
> access, given that there's already an entity exposed for everything
> (or nearly everything).

You really want to be able to run grep-like stuff in sysfs, which
would do horrible things with device nodes. Also it does not support
extended attributes, not access control lists, ..., all stuff we need
for device nodes. You also want userspace to have control over device
nodes, and possibly mangle them, regardless what the kernel exports,
that's why it's a tmpfs and not part of sysfs.

> It seems weird to have to have an agent in
> userspace to create another hierarchy in addition to what the kernel
> already maintains.

Well, until just recently, there was no sane definition how device
nodes are names and layouted, every system did it differently, some
even tried to keep the totally useless devfs naming scheme alive. Now
that we managed to define a common default setup, which almost
everybody ships it, it makes it possible to add the few needed rules.

> I guess the really tricky bit is how to deal with
> permissions/ownership sanely.

Simple permissions would be possible without too much hassle, but uid
gid ownership, I can't see how the kernel could do that.

> I suspect there's no easy way to do
> something that "just works" for even the majority of userspace
> environments.

It will work just fine for root environments, what's missing without
userspace support is if you need to grant specific users access to
devices.

> Most of the ugly in the microudev thing in our init
> comes from having to do something about permissions.
>
> I would love to have a way for the kernel to do something like devfs
> (it'd let me kill some ugly userspace code on my side)....

How are permissions defined in your environment? What's the set of
permissions you need to apply?

Thanks,
Kay

Brian Swetland

ongelezen,

1 mei 2009, 22:10:0901-05-2009

aan

On Fri, May 1, 2009 at 6:48 PM, Kay Sievers <kay.s...@vrfy.org> wrote:
> On Sat, May 2, 2009 at 03:24, Brian Swetland <swet...@google.com> wrote:
>> It's always struck me as odd that sysfs couldn't provide device node
>> access, given that there's already an entity exposed for everything
>> (or nearly everything).
>
> You really want to be able to run grep-like stuff in sysfs, which
> would do horrible things with device nodes. Also it does not support
> extended attributes, not access control lists, ..., all stuff we need
> for device nodes. You also want userspace to have control over device
> nodes, and possibly mangle them, regardless what the kernel exports,
> that's why it's a tmpfs and not part of sysfs.

That makes sense.

>> It seems weird to have to have an agent in
>> userspace to create another hierarchy in addition to what the kernel
>> already maintains.
>
> Well, until just recently, there was no sane definition how device
> nodes are names and layouted, every system did it differently, some
> even tried to keep the totally useless devfs naming scheme alive. Now
> that we managed to define a common default setup, which almost
> everybody ships it, it makes it possible to add the few needed rules.

That's good news -- I wasn't sure if there was still variety in layout
policies that required somehow supporting multiple different ones.

>> I guess the really tricky bit is how to deal with
>> permissions/ownership sanely.
>
> Simple permissions would be possible without too much hassle, but uid
> gid ownership, I can't see how the kernel could do that.

Yeah, I don't see any easy solution there. Which means we end up
having to have some userspace agent responsible for arranging
permissions as devices are published.

>> I suspect there's no easy way to do
>> something that "just works" for even the majority of userspace
>> environments.
>
> It will work just fine for root environments, what's missing without
> userspace support is if you need to grant specific users access to
> devices.
>
>> Most of the ugly in the microudev thing in our init
>> comes from having to do something about permissions.
>>
>> I would love to have a way for the kernel to do something like devfs
>> (it'd let me kill some ugly userspace code on my side)....
>
> How are permissions defined in your environment? What's the set of
> permissions you need to apply?

In our world we use groups to provide access to specific classes of
hardware resources (audio, video, display, dsp, etc) and processes
that have the appropriate permissions are arranged to run with
necessary additional groups for the hardware they need to access.
Very little of the system ever runs as root -- most runs as a per-app
or per-service untrusted user with permissions granted via group
membership.

Brian

Kay Sievers

ongelezen,

1 mei 2009, 22:30:1401-05-2009

aan

On Sat, May 2, 2009 at 04:02, Brian Swetland <swet...@google.com> wrote:
> On Fri, May 1, 2009 at 6:48 PM, Kay Sievers <kay.s...@vrfy.org> wrote:

>>> I would love to have a way for the kernel to do something like devfs
>>> (it'd let me kill some ugly userspace code on my side)....
>>
>> How are permissions defined in your environment? What's the set of
>> permissions you need to apply?
>
> In our world we use groups to provide access to specific classes of
> hardware resources (audio, video, display, dsp, etc) and processes
> that have the appropriate permissions are arranged to run with
> necessary additional groups for the hardware they need to access.

These group numbers are always static on your system, and don't get
changed while the system is running? Like you always assign gid X to
all sound devices? Or do you need to manage them dynamically?

Do your device nodes permissions/ownership ever change on the running
system after the device is alive?

Thanks,
Kay

Brian Swetland

ongelezen,

2 mei 2009, 00:50:0802-05-2009

aan

On Fri, May 1, 2009 at 7:28 PM, Kay Sievers <kay.s...@vrfy.org> wrote:
> On Sat, May 2, 2009 at 04:02, Brian Swetland <swet...@google.com> wrote:
>> On Fri, May 1, 2009 at 6:48 PM, Kay Sievers <kay.s...@vrfy.org> wrote:
>
>>>> I would love to have a way for the kernel to do something like devfs
>>>> (it'd let me kill some ugly userspace code on my side)....
>>>
>>> How are permissions defined in your environment? What's the set of
>>> permissions you need to apply?
>>
>> In our world we use groups to provide access to specific classes of
>> hardware resources (audio, video, display, dsp, etc) and processes
>> that have the appropriate permissions are arranged to run with
>> necessary additional groups for the hardware they need to access.
>
> These group numbers are always static on your system, and don't get
> changed while the system is running? Like you always assign gid X to
> all sound devices? Or do you need to manage them dynamically?

The gids for device access are static, part of the platform
definition. I could imagine a model where the board file for a target
defined the ownership for devices or that was loaded in from userspace
at boot (just a table of path-match-string, uid, gid, perms). Not
sure if that'd be considered too ugly, but it would certainly solve
the problem in a straightforward way. Early on the "device ownership
policy" is installed and then userspace can leave everything to the
kernel.

The uids for applications are dynamic, assigned on app install (every
app gets its own uid).

> Do your device nodes permissions/ownership ever change on the running
> system after the device is alive?

We never change device node permissions or ownership at runtime. I
could see situations in which that would be useful, but it's not
something that is currently used by the platform.

Brian

Christoph Hellwig

ongelezen,

2 mei 2009, 03:20:0902-05-2009

aan

On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:

> From: Kay Sievers <kay.s...@vrfy.org>
> Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs

Umm, guys this needs much broader discussion than just sneaking in
a patch under the covers.

It basically does re-introduce devfs under a different name, and from
looking at the implementation it might not be quite as bad a Gooch's
original, but it's certainly worse than Adam Richters rewrite the we
never ended up merging.

Now we might want to revisit the decision to leave all the device name
handling to a userspace daemon, because it provded to be quite fragile
under certain circumstances, and you apparently see performance issues.

> Devtmpfs lets the kernel create a tmpfs very early at kernel
> initialization, before any driver core device is registered. Every
> device with a major/minor will have a device node created in this
> tmpfs instance. After the rootfs is mounted by the kernel, the
> populated tmpfs is mounted at /dev. In initramfs, it can be moved
> to the manually mounted root filesystem before /sbin/init is
> executed.

That for example is something that is not acceptable. We really don't
want the kernel to mess with the initial namespace in such a major way.

> The tmpfs instance can be changed and altered by userspace at any time,
> and in any way needed - just like today's udev-mounted tmpfs. Unmodified
> udev versions will run just fine on top of it, and will recognize an
> already existing kernel-created device node and use it.
> The default node permissions are root:root 0600. Only if none of these
> values have been changed by userspace, the driver core will remove the
> device node when the device goes away. If the device node was altered
> by udev, by applying the appropriate permissions and ownership, it will
> need to be removed by udev - just as it usually works today.

That's some really, really odd lifetime rules.

Counter-proposal: Re-introduce a proper mini-devfs. All nodes in there
are kernel-created and not changeable which sorts out that whole
mess of both drivers and userspace messing with tree topology we had
both in original devfs and this new devtmpfs. Single-instance so it can be
populated before it's actually mounted somewhere, that way the kernel
doesn't have to do any policy devicision on where it's mounted. Mount
point would usually be /dev/something so /dev can remaining udev-managed
tmpfs or even manually maintained and symlinks can point into
/dev/something.

> +static char *bsg_nodename(struct device *dev)
> +{
> + return kasprintf(GFP_KERNEL, "bsg/%s", dev_name(dev));
> +}
> +
> static int __init bsg_init(void)
> {
> int ret, i;
> @@ -1082,6 +1087,7 @@ static int __init bsg_init(void)
> ret = PTR_ERR(bsg_class);
> goto destroy_kmemcache;
> }
> + bsg_class->nodename = bsg_nodename;

And adding this gunk to every driver is really ugly. Must say
late-devfs version of the same defintively was more pretty.

Christoph Hellwig

ongelezen,

2 mei 2009, 03:30:0802-05-2009

aan

On Fri, May 01, 2009 at 03:24:01PM +0200, Kay Sievers wrote:
> I'm very sure, you can not fix it outside the kernel. Or do you have
> an idea how to create the missing device nodes for device without
> crawling sysfs, when the first userspace process is started?

Just make sure to queue up your uevents in a ring buffer that udev
can read once it has started?

Kay Sievers

ongelezen,

2 mei 2009, 07:40:0602-05-2009

aan

On Sat, May 2, 2009 at 09:16, Christoph Hellwig <h...@infradead.org> wrote:
> >After the rootfs is mounted by the kernel, the
>> populated tmpfs is mounted at /dev. In initramfs, it can be moved
>> to the manually mounted root filesystem before /sbin/init is
>> executed.
>
> That for example is something that is not acceptable. We really don't
> want the kernel to mess with the initial namespace in such a major way.

There is nothing like "mess around", it's not mounted at all, until
the kernel mounts the root filesystem at /, then devtmpfs is mounted
the first time, and only if it's compiled in because you asked for it.
Also, just try:
egrep 'mknod|create_dev' init/*.c
and see what we currently do.

> Counter-proposal: Re-introduce a proper mini-devfs. All nodes in there
> are kernel-created and not changeable which sorts out that whole
> mess of both drivers and userspace messing with tree topology we had
> both in original devfs and this new devtmpfs. Single-instance so it can be
> populated before it's actually mounted somewhere, that way the kernel
> doesn't have to do any policy devicision on where it's mounted.

That sounds worse than devtpfs, and does not help for most of the
mentioned problems we are trying to solve here.

> Mount
> point would usually be /dev/something so /dev can remaining udev-managed
> tmpfs or even manually maintained and symlinks can point into
> /dev/something.

And that would solve what? init=/bin/sh would still not work, you can
not bring your box up with that, and you have some pretty useless
unchangeable stuff hanging around in a /dev subdirectory?

>> @@ -1082,6 +1087,7 @@ static int __init bsg_init(void)
>> ret = PTR_ERR(bsg_class);
>> goto destroy_kmemcache;
>> }
>> + bsg_class->nodename = bsg_nodename;
>
> And adding this gunk to every driver is really ugly. Must say
> late-devfs version of the same defintively was more pretty.

There are only a very few places who need this, there nothing ever
like "every driver". It's a very few subsystems, not even the drivers,
if they have more than one. Most device nodes do not have a
subdirectory and don't have that at all.

Thanks,
Kay

Kay Sievers

ongelezen,

2 mei 2009, 09:40:1202-05-2009

aan

On Sat, May 2, 2009 at 06:42, Brian Swetland <swet...@google.com> wrote:

>>> In our world we use groups to provide access to specific classes of
>>> hardware resources (audio, video, display, dsp, etc) and processes
>>> that have the appropriate permissions are arranged to run with
>>> necessary additional groups for the hardware they need to access.
>>
>> These group numbers are always static on your system, and don't get
>> changed while the system is running? Like you always assign gid X to
>> all sound devices? Or do you need to manage them dynamically?
>
> The gids for device access are static, part of the platform
> definition.

This definition does mainly assign a specific gid to every device of a
specific subsystem, right? Like you have all sound devices use one
gid, all video devices use another one, ... Very similar what udev
does on every system with the:
SUBSYSTEM=="...", GROUP="..."
matches, right?

> I could imagine a model where the board file for a target
> defined the ownership for devices or that was loaded in from userspace
> at boot (just a table of path-match-string, uid, gid, perms). Not
> sure if that'd be considered too ugly, but it would certainly solve
> the problem in a straightforward way.

Hmm, at the point we would need it, there is no userspace to load it,
or a file to read.

> Early on the "device ownership
> policy" is installed and then userspace can leave everything to the
> kernel.

It would need to be right in the kernel, linked into the driver core.
Some "module" could do that, and hook into the processing, when a
device is registered.

> The uids for applications are dynamic, assigned on app install (every
> app gets its own uid).
>
>> Do your device nodes permissions/ownership ever change on the running
>> system after the device is alive?
>
> We never change device node permissions or ownership at runtime. I
> could see situations in which that would be useful, but it's not
> something that is currently used by the platform.

That would in theory make it possible for your system to put the gid
information in a custom kernel "module" and link it to the driver
core. Usual systems, like the distros do, can not do that, they have a
far more complex policy and completely depend on meaningful symlinks
to the device nodes - but udev solves that problem already, and will
continue to do that.
For very customized systems like yours, maybe we can make it possible
to link-in custom code. I can definitely see the advantage over the
weird hacks you need in your userspace code, just to bring the system
up. Let's just wait, where this discussion ends up with, and if
somebody finds a fundamental problem with devtmpfs approach, otherwise
I'll take look what we can do.

We have a well defined common device naming today, and nobody
implements any custom "naming scheme" like the misguided devfs
hierarchy with default names like:
/dev/ide/host0/bus0/target0/lun0/part1, or /dev/tty/0, or
/dev/cdroms/0 anymore.
We all use the default kernel names, and udev just follows the kernel
given name already for almost all devices. We have a few exceptions in
udev rules left, but there is no good reason not to add these
exceptions as a few strings per subsystem to the kernel itself, and
have the kernel create the default nodes, if it solves a real problem.

Thanks,
Kay

Kay Sievers

ongelezen,

2 mei 2009, 09:50:1502-05-2009

aan

On Sat, May 2, 2009 at 09:19, Christoph Hellwig <h...@infradead.org> wrote:
> On Fri, May 01, 2009 at 03:24:01PM +0200, Kay Sievers wrote:
>> I'm very sure, you can not fix it outside the kernel. Or do you have
>> an idea how to create the missing device nodes for device without
>> crawling sysfs, when the first userspace process is started?
>
> Just make sure to queue up your uevents in a ring buffer that udev
> can read once it has started?

Which does not really target any of the problems we try to solve, and
is probably even larger than the 300 lines to create the proper /dev
stuff right away. It's about fractions of a second, we are optimizing
for, and we need to start as many things in parallel, as early as
possible. And a working and populated /dev is mandatory for most of
the stuff we need to bring up.

I think the init=/bin/sh case alone would be justification enough to
do that, it can save you a lot of trouble if things go wrong, which
things do, and which is pretty hard to cope with today, with no access
to your devices.

We are not implementing anything crazy here like devfs did, including
the later versions - there is no modprobe behind your back, no lookup
hooks, no stupid new naming scheme, no new filesystem type to
register.
Udev uses the kernel provided names anyway today, there are no naming
rules at all in current userspace for 98 of 100 devices. It's todays
kernel which provides the naming already, and we will not change
anything here, just add the few exceptions, which are only in udev
rules today, and let the kernel create the node that udev will create
anyway.

Thanks,
Kay

Kyle Moffett

ongelezen,

2 mei 2009, 11:10:1402-05-2009

aan

On Fri, May 1, 2009 at 9:12 AM, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> a) running before /dev/null and /dev/console requires hacks
>
> Not really, well not if you are writing a serious small tool anyway. In
> actual fact the bigger problem if you are using the standard dynamic
> link setup (which you wouldn't be I suspect) is /dev/zero.
>
>> b) it requires an initramfs
>> c) it pulls everything that hooks into or otherwise affects udev into
>> the initramfs; that's much more than we have at present, and a bigger
>> initramfs' can only make bootup _slower_.
>
> Ditto a bigger kernel. Remember moving stuff from initramfs to the kernel
> makes it less memory efficient usually as its now harder to get rid of
> and not pageable later on either.

Hmm... I'm almost curious how early it is possible to start userspace,
if you ensure that the userspace is very careful to limit its
dependencies. Specifically, you could actually include software
that's little more than a paged kernel thread with its own mm. If
there's any userspace code that we would really want to be updated in
lockstep with the kernel (possibly this udev-replacement), it could be
originally linked as part of the kernel image and later marked as
swappable pages.

So then the essential questions would be things like:

* How early can rootfs be set up? (even if not yet populated from the
initramfs)
* How early can pipefs be ready?
* What syscalls are safe from that early of a context?

If those questions have satisfactory answers, you could basically
build a mini-udev as a part of the kernel, perhaps even with a
device-table automatically extracted from a specific section by the
linker script. It would also be useful without an initramfs, as it
could replace the existing root= name-to-dev_t lookup code with
something swappable.

Cheers,
Kyle Moffett

Andy Lutomirski

ongelezen,

2 mei 2009, 11:20:1202-05-2009

aan

Kay Sievers wrote:
> On Sat, May 2, 2009 at 09:19, Christoph Hellwig <h...@infradead.org> wrote:
>> On Fri, May 01, 2009 at 03:24:01PM +0200, Kay Sievers wrote:
>>> I'm very sure, you can not fix it outside the kernel. Or do you have
>>> an idea how to create the missing device nodes for device without
>>> crawling sysfs, when the first userspace process is started?
>> Just make sure to queue up your uevents in a ring buffer that udev
>> can read once it has started?
>
> Which does not really target any of the problems we try to solve, and
> is probably even larger than the 300 lines to create the proper /dev
> stuff right away. It's about fractions of a second, we are optimizing
> for, and we need to start as many things in parallel, as early as
> possible. And a working and populated /dev is mandatory for most of
> the stuff we need to bring up.
>
> I think the init=/bin/sh case alone would be justification enough to
> do that, it can save you a lot of trouble if things go wrong, which
> things do, and which is pretty hard to cope with today, with no access
> to your devices.

What's wrong with:

mount -n -t sysfs none /sys
mount -n -t tmpfs none /tmp
udevd --daemon
udevadm trigger

once the shell comes up? There could even be a standard script that all
distributions ship that does that, plus mounts /proc and does whatever
magic is needed to make Ctrl-C work.

(OK, so you depend on udev and mount working, but you already depend
on sh working, and you'll have a heck of time rescuing anything if even
mount doesn't work.)

If you want a really reliable rescue mode, then either put a whole
working busybox system in a spare initramfs with a spare boot menu entry
or just use a real rescue disk, neither of which require devtmpfs.

As a separate question, what happens with devtmpfs if I plug in some
device that uses dynamic minors, then unplug it, then plug in another
device that gets a new minor but the same name, all before (or even
after) udev starts? Are there any subsystems that could do that?

--Andy

Kay Sievers

ongelezen,

2 mei 2009, 11:40:0902-05-2009

aan

On Sat, May 2, 2009 at 17:18, Andy Lutomirski <lu...@myrealbox.com> wrote:
> What's wrong with:
>
> mount -n -t sysfs none /sys
> mount -n -t tmpfs none /tmp

You need tmpfs on /dev, and you need initial nodes in the new tmpfs
before you can start udevd.

> udevd --daemon
> udevadm trigger

Then you need to wait for udev to finish, to be sure the nodes are
there, before you start anything else.

> once the shell comes up?

If the kernel mounts your root, and not initramfs, the shell does not
come up, when /dev is empty, like it is on some distros which always
use tmpfs on /dev.

> There could even be a standard script that all
> distributions ship that does that, plus mounts /proc and does whatever magic
> is needed to make Ctrl-C work.

There is no such thing today, as "standard that distros ship", and I
doubt there will be. We needed years to sync the device names across
distros. That happened now, but there will never be any "standard
thing that bring us a box that the kernel can "ship". Same for the
"let every new driver provide a udev rule for the names of the device
at the same time", that will never happen, people just don't care
about userspace issues at all - I've given up on trying that.

> (OK, so you depend on udev and mount working, but you already depend
> on sh working, and you'll have a heck of time rescuing anything if even
> mount doesn't work.)

You need access to your devices, and device numbers may be dynamic.
It's not about mount(8) is not working, you may have lost anyway in
such cases.

> If you want a really reliable rescue mode, then either put a whole working
> busybox system in a spare initramfs with a spare boot menu entry or just use
> a real rescue disk, neither of which require devtmpfs.

Yeah, why not have a spare box close to every one, then you can use
that instead. :) This is not about being prepared that stuff went
wrong, then you are lucky. You usually can not copy a new image to a
box that does not boot, and you do not have a working rescue disk
right next to you.

> As a separate question, what happens with devtmpfs if I plug in some device
> that uses dynamic minors, then unplug it, then plug in another device that
> gets a new minor but the same name, all before (or even after) udev starts?
> Are there any subsystems that could do that?

The node gets removed, then a new one gets created for the new device.
Besides that there are no device with the same name at the same time
today, udev would not solve that problem either.

Jeff Garzik

ongelezen,

2 mei 2009, 13:00:1902-05-2009

aan

Christoph Hellwig wrote:
> On Thu, Apr 30, 2009 at 03:23:42PM +0200, Kay Sievers wrote:
>> From: Kay Sievers <kay.s...@vrfy.org>
>> Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
>
> Umm, guys this needs much broader discussion than just sneaking in
> a patch under the covers.
>
> It basically does re-introduce devfs under a different name, and from
> looking at the implementation it might not be quite as bad a Gooch's
> original, but it's certainly worse than Adam Richters rewrite the we
> never ended up merging.

I was interested in this Richter devfs rewrite, since I was unfamiliar
with it.

For the benefit of the thread, here is a URL that people can examine:

http://marc.info/?l=linux-kernel&m=104138806530375&w=2

Regards,

Jeff

Kay Sievers

ongelezen,

2 mei 2009, 14:00:1702-05-2009

aan

On Sat, May 2, 2009 at 18:59, Jeff Garzik <je...@garzik.org> wrote:
> Christoph Hellwig wrote:

>> It basically does re-introduce devfs under a different name, and from
>> looking at the implementation it might not be quite as bad a Gooch's
>> original, but it's certainly worse than Adam Richters rewrite the we
>> never ended up merging.
>
> I was interested in this Richter devfs rewrite, since I was unfamiliar with
> it.
>
> For the benefit of the thread, here is a URL that people can examine:
>
> http://marc.info/?l=linux-kernel&m=104138806530375&w=2

And it did:
- the crazy devfs naming scheme with:
/dev/ide/host0/bus0/target0/lun0/part1, /dev/tty/0,
and it did not even create our current names by default
- path lookup interception and called modprobe behind your back
- call_usermodhelper() from the kernel to name devices
- introduced a new filesystem type and a bunch of
new datatypes
- and so on ...

I don't think anybody else besides Christoph would call devtmpfs
"seriously worse" than this. :) None of these "features" are needed
today, or even close to even worth to think about it.

And for the "policy in kernel department" which will be the next naive
argument: The kernel carries the policy today for 98% of the devices,
if you change any driver given name, it will no longer show up in /dev
with the current name. That's the reality since years, and will not be
different anytime soon, there is no real naming policy besides the
current kernel supplied names.

Now that we managed to define a common set of /dev names we all share
across distros, it's time to let the kernel know about the remaining
2% and let it pre-create the device nodes itself for userspace to pick
up, with all the earlier mentioned opportunities it offers.

The final policy will still live in udev which sets ownership and
mode, and possibly overwrites the node name. Nothing really has
changed, it can just be made faster, and it is much more reliable if
stuff in userspace goes wrong.

Thanks,
Kay

Michael Riepe

ongelezen,

2 mei 2009, 14:30:1502-05-2009

aan

Hi!

Kay Sievers wrote:
> On Fri, May 1, 2009 at 15:18, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>
>>>No tool ever has a chance to get to the information only available at
>>>early kernel init. All such tools will need to "replay" what the
>>>kernel already did. This is intended to save us from doing this, and
>>>retain the information which is there, but lost at the moment the
>>>tools have the first chance to run.
>>
>>Serious question - which is the better problem to fix ?

>
>
> I'm very sure, you can not fix it outside the kernel. Or do you have
> an idea how to create the missing device nodes for device without
> crawling sysfs, when the first userspace process is started?

Well, what about this? Let the kernel buffer all events until udev is
ready to process them. Once it signals that it's up and running - the
best method for that is tbd -, empty the queue, and then discard it,
including the additional code. If udev doesn't come up in time after
init has started, or the buffer overflows, assume the system is using a
static /dev and throw away the queue as well. If udev starts too late,
it can still do a coldplug - in that case, boot speed is already so low
that the additional delay hardly matters.

Comments?

--
Michael "Tired" Riepe <michae...@googlemail.com>
X-Tired: Each morning I get up I die a little

Alan Jenkins

ongelezen,

2 mei 2009, 16:00:2202-05-2009

aan

On 5/2/09, Michael Riepe <michae...@googlemail.com> wrote:
> Hi!
>
> Kay Sievers wrote:
>> On Fri, May 1, 2009 at 15:18, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>>
>>>>No tool ever has a chance to get to the information only available at
>>>>early kernel init. All such tools will need to "replay" what the
>>>>kernel already did. This is intended to save us from doing this, and
>>>>retain the information which is there, but lost at the moment the
>>>>tools have the first chance to run.
>>>
>>>Serious question - which is the better problem to fix ?
>>
>>
>> I'm very sure, you can not fix it outside the kernel. Or do you have
>> an idea how to create the missing device nodes for device without
>> crawling sysfs, when the first userspace process is started?
>
> Well, what about this? Let the kernel buffer all events until udev is
> ready to process them. Once it signals that it's up and running - the
> best method for that is tbd -, empty the queue, and then discard it,
> including the additional code. If udev doesn't come up in time after
> init has started, or the buffer overflows, assume the system is using a
> static /dev and throw away the queue as well. If udev starts too late,
> it can still do a coldplug - in that case, boot speed is already so low
> that the additional delay hardly matters.

Already answered by Kay.
<http://groups.google.com/group/linux.kernel/msg/9851b3f5e82d65ef>

But the premise of this question is bogus anyway. You don't gain
anything by capturing events earlier; replaying them is really quite
cheap, and extracting device numbers and names from sysfs is even
cheaper. The gains are in reducing how much processing you need to do
on them. Both in terms of performance, and in making it
simple/neutral enough to be considered for kernel inclusion, which in
turn brings a number of small benefits.

Alan

Alan Jenkins

ongelezen,

2 mei 2009, 16:30:1202-05-2009

aan

On 5/2/09, Kay Sievers <kay.s...@vrfy.org> wrote:
> On Sat, May 2, 2009 at 09:16, Christoph Hellwig <h...@infradead.org> wrote:
>> >After the rootfs is mounted by the kernel, the
>>> populated tmpfs is mounted at /dev. In initramfs, it can be moved
>>> to the manually mounted root filesystem before /sbin/init is
>>> executed.
>>
>> That for example is something that is not acceptable. We really don't
>> want the kernel to mess with the initial namespace in such a major way.
>
> There is nothing like "mess around", it's not mounted at all, until
> the kernel mounts the root filesystem at /, then devtmpfs is mounted
> the first time, and only if it's compiled in because you asked for it.
> Also, just try:
> egrep 'mknod|create_dev' init/*.c
> and see what we currently do.
>
>> Counter-proposal: Re-introduce a proper mini-devfs. All nodes in there
>> are kernel-created and not changeable which sorts out that whole
>> mess of both drivers and userspace messing with tree topology we had
>> both in original devfs and this new devtmpfs. Single-instance so it can
>> be
>> populated before it's actually mounted somewhere, that way the kernel
>> doesn't have to do any policy devicision on where it's mounted.
>
> That sounds worse than devtpfs, and does not help for most of the
> mentioned problems we are trying to solve here.

On a narrow issue: do you really object to moving the "mount dev -t
devfs2 /dev" into userspace (and therefore giving it a user-visible
name)?? That would address Cristophs particular objection about
"messing around" with the initial namespace. It means I can be 100%
sure I can boot an old initramfs with this option enabled. And it
gives a nice clean way for new initramfs' to test for this feature -
when they try to mount it, it fails. It would seem to make for a
rather smoother migration path.

It shouldn't mean too many more LOC, you're already doing the "single
instance" thing.

Thanks
Alan

Kay Sievers

ongelezen,

2 mei 2009, 17:40:0702-05-2009

aan

On Sat, May 2, 2009 at 22:22, Alan Jenkins
<sourcej...@googlemail.com> wrote:

> On a narrow issue: do you really object to moving the "mount dev -t
> devfs2 /dev" into userspace (and therefore giving it a user-visible
> name)?? That would address Cristophs particular objection about
> "messing around" with the initial namespace.

An argument which does not stand at all, there is no mess, it is not
mounted at all until we are ready with initializing the kernel. And
instead of creating some random nodes in /dev like we do today, we
mount it, and it contains a node for every device. I hardly see any of
the mentioned "namespace mess" here, it's just the simplest, most
robust, and most efficient thing we can do. :)

> It means I can be 100%
> sure I can boot an old initramfs with this option enabled.

Oh, it does not change anything for an existing initramfs, if that
option enabled. After initramfs found and mounted the real rootfs at
/root, your are totally free to call:
mount --move /dev/ /root/dev
or not to do that, like we do today.

> And it
> gives a nice clean way for new initramfs' to test for this feature -
> when they try to mount it, it fails. It would seem to make for a
> rather smoother migration path.

I think that is all covered just fine.

One thing that I tried to solve by doing a kernel mounted fs, is that
/dev on the rootfs is completely empty, it is that way on some distros
today, and if you do init=/bin/sh, it will not work, because
/dev/console is missing.

Another thing, why I would like to avoid a new fstype is that
userspace checks if /dev is a tmpfs to find out if it's a dynamic /dev
- nothing really that should prevent us from doing a new filesystem,
but we should need a good reason to do it, I think.

Thanks,
Kay

Kay Sievers

ongelezen,

2 mei 2009, 17:50:0702-05-2009

aan

On Sat, May 2, 2009 at 21:55, Alan Jenkins
<sourcej...@googlemail.com> wrote:
> On 5/2/09, Michael Riepe <michae...@googlemail.com> wrote:

>> Well, what about this? Let the kernel buffer all events until udev is
>> ready to process them. Once it signals that it's up and running - the
>> best method for that is tbd -, empty the queue, and then discard it,
>> including the additional code. If udev doesn't come up in time after
>> init has started, or the buffer overflows, assume the system is using a
>> static /dev and throw away the queue as well. If udev starts too late,
>> it can still do a coldplug - in that case, boot speed is already so low
>> that the additional delay hardly matters.
>
> Already answered by Kay.
> <http://groups.google.com/group/linux.kernel/msg/9851b3f5e82d65ef>
>
> But the premise of this question is bogus anyway. You don't gain
> anything by capturing events earlier; replaying them is really quite
> cheap, and extracting device numbers and names from sysfs is even
> cheaper. The gains are in reducing how much processing you need to do
> on them. Both in terms of performance, and in making it
> simple/neutral enough to be considered for kernel inclusion, which in
> turn brings a number of small benefits.

Exactly, we would add more code, make stuff more complex as it already
is, and gain basically nothing.

It is not about missing events, it is about the bootstrap step we
would like to avoid. Buffering events, or reading all device
information from sysfs at this stage introduces a hard checkpoint,
where we need to bring a process up process these events and create
nodes for them. Only after that we can start other things in userspace
which depend on a working /dev, and with the kernel populated one, we
can just go ahead.

Even when it wouldn't be the fasted we can do, which it probably is,
it would still be the simplest we can do, and that would win in most
cases.

Thanks,
Kay

Alan Jenkins

ongelezen,

2 mei 2009, 17:50:1002-05-2009

aan

Also, AFAICS this would avoid a memory leak on umount.

In it's original form, if you unmount it, you can't get it back. But
it doesn't get destroyed either; all the tmpfs nodes just hang around
in limbo, right?

It'd be even nicer if it somehow avoided consuming memory when it
isn't used. I guess that requires more code, looks more like a "real"
devfs, and as C. says is probably more sane if exported as a read
only. Hopefully it would make it possible to remove the code
afterwards, but that sounds like even more work.

But is read-only so bad? You just have to copy it over to a tmpfs and
then mount that on top of /dev. That's atomic, so it won't interfere
with parallel early init. I sympathize, devtmpfs is a really neat
hack that does exactly what udev needs. But you have to admit, it
doesn't fit in _quite_ as well with the kernel status quo.

Greg KH

ongelezen,

2 mei 2009, 18:00:1002-05-2009

aan

On Sat, May 02, 2009 at 10:41:50PM +0100, Alan Jenkins wrote:
>
> But is read-only so bad? You just have to copy it over to a tmpfs and
> then mount that on top of /dev. That's atomic, so it won't interfere
> with parallel early init.

The copy would not be atomic.

> I sympathize, devtmpfs is a really neat hack that does exactly what
> udev needs. But you have to admit, it doesn't fit in _quite_ as well
> with the kernel status quo.

I disagree, it mirrors exactly what we are doing today from userspace,
which is quite the "status quo" in that distros have been doing that for
years now :)

thanks,

greg k-h

Kay Sievers

ongelezen,

2 mei 2009, 18:10:0802-05-2009

aan

On Sat, May 2, 2009 at 23:41, Alan Jenkins

Sure you can't get it back if you drop it, but what would be the point
of un-mounting it and wanting it back? It's just like every other
tmpfs too. :)

> It'd be even nicer if it somehow avoided consuming memory when it
> isn't used.

It's probably easy to make it destroy itself, and free everything, if
it's un-mounted, if that solves your concerns about built-in but not
used.

> I guess that requires more code, looks more like a "real"
> devfs, and as C. says is probably more sane if exported as a read
> only. Hopefully it would make it possible to remove the code
> afterwards, but that sounds like even more work.

Then you should probably not enable it, or have a boot option to
disable it. What's the point in removing it later, it is exactly what
you want, and what everybody has today, a /dev on tmpfs, just that the
kernel will puts the default nodes in there.

> But is read-only so bad? You just have to copy it over to a tmpfs and
> then mount that on top of /dev. That's atomic, so it won't interfere
> with parallel early init. I sympathize, devtmpfs is a really neat
> hack that does exactly what udev needs. But you have to admit, it
> doesn't fit in _quite_ as well with the kernel status quo.

I think it fits quite nice as a "device node companion" to sysfs. And
I don't really see the point of a read-only mount, it should be the
/dev that is used on the real system, until you shut it down.

Thanks,
Kay

Alan Jenkins

ongelezen,

2 mei 2009, 18:10:1002-05-2009

aan

On 5/2/09, Kay Sievers <kay.s...@vrfy.org> wrote:

> On Sat, May 2, 2009 at 22:22, Alan Jenkins
> <sourcej...@googlemail.com> wrote:
>
>> On a narrow issue: do you really object to moving the "mount dev -t
>> devfs2 /dev" into userspace (and therefore giving it a user-visible
>> name)?? That would address Cristophs particular objection about
>> "messing around" with the initial namespace.
>
> An argument which does not stand at all, there is no mess, it is not
> mounted at all until we are ready with initializing the kernel. And
> instead of creating some random nodes in /dev like we do today, we
> mount it, and it contains a node for every device. I hardly see any of
> the mentioned "namespace mess" here, it's just the simplest, most
> robust, and most efficient thing we can do. :)
>
>> It means I can be 100%
>> sure I can boot an old initramfs with this option enabled.
>
> Oh, it does not change anything for an existing initramfs, if that
> option enabled. After initramfs found and mounted the real rootfs at
> /root, your are totally free to call:
> mount --move /dev/ /root/dev
> or not to do that, like we do today.

Sorry, you're right. I should go to bed :-).

It would matter if you had a different naming scheme for /dev than the
kernel, and you were trying to get away with a static /dev. I can't
believe anyone important does that though :-).

>> And it
>> gives a nice clean way for new initramfs' to test for this feature -
>> when they try to mount it, it fails. It would seem to make for a
>> rather smoother migration path.
>
> I think that is all covered just fine.

Oh, I see.

grep "/dev" /proc/mounts > /dev/null

> One thing that I tried to solve by doing a kernel mounted fs, is that
> /dev on the rootfs is completely empty, it is that way on some distros
> today, and if you do init=/bin/sh, it will not work, because
> /dev/console is missing.
>
> Another thing, why I would like to avoid a new fstype is that
> userspace checks if /dev is a tmpfs to find out if it's a dynamic /dev
> - nothing really that should prevent us from doing a new filesystem,
> but we should need a good reason to do it, I think.

I thought udev was documented somewhere as compatible with a non-tmpfs
/dev, in a "just because you could" sort of way. I've seen something
test for tmpfs... nevermind, it's probably something different. (Just
the init script that checks whether /dev has been mounted that way by
an initramfs, or the user decided to do without intramfs' so the
rootfs gets to mount it instead).

Good night
Alan

Michael Tokarev

ongelezen,

3 mei 2009, 03:30:0903-05-2009

aan

Alan Jenkins wrote:
> On 5/2/09, Kay Sievers <kay.s...@vrfy.org> wrote:

[]

>>> And it
>>> gives a nice clean way for new initramfs' to test for this feature -
>>> when they try to mount it, it fails. It would seem to make for a
>>> rather smoother migration path.
>> I think that is all covered just fine.
>
> Oh, I see.
>
> grep "/dev" /proc/mounts > /dev/null

Oh no. This one will match all your /dev/sda1 etc too :)
What yo want to do is either

grep -w /dev /proc/mount

or

awk '$2 == "/dev" { exit 0; } END { exit 1; }' /proc/mounts

or something similar. To distinguish between the following
two cases:

/dev/sda1 / ext3 rw,noatime,data=ordered 0 0
devfs /dev tmpfs rw,nosuid,size=512k,mode=755 0 0

/mjt

Lars Marowsky-Bree

ongelezen,

4 mei 2009, 12:30:2004-05-2009

aan

On 2009-05-02T23:47:03, Kay Sievers <kay.s...@vrfy.org> wrote:

> It is not about missing events, it is about the bootstrap step we
> would like to avoid. Buffering events, or reading all device
> information from sysfs at this stage introduces a hard checkpoint,
> where we need to bring a process up process these events and create
> nodes for them. Only after that we can start other things in userspace
> which depend on a working /dev, and with the kernel populated one, we
> can just go ahead.

So if udevd was present before the kernel started scanning devices this
would not be as much of a problem, would it?

(Couldn't udevd trigger the device scan? Could it be started so early as
for this not to matter?)

And isn't part of the problem that you have a choke point in the
dependency graph - namely "/dev working" as a predicate for running
certain services/scripts - instead of more fine-grained dependencies for
just the devices they need?

Regards,
Lars

--
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N�rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

Kay Sievers

ongelezen,

4 mei 2009, 13:00:1404-05-2009

aan

On Mon, May 4, 2009 at 18:20, Lars Marowsky-Bree <l...@suse.de> wrote:
> On 2009-05-02T23:47:03, Kay Sievers <kay.s...@vrfy.org> wrote:
>
>> It is not about missing events, it is about the bootstrap step we
>> would like to avoid. Buffering events, or reading all device
>> information from sysfs at this stage introduces a hard checkpoint,
>> where we need to bring a process up process these events and create
>> nodes for them. Only after that we can start other things in userspace
>> which depend on a working /dev, and with the kernel populated one, we
>> can just go ahead.
>
> So if udevd was present before the kernel started scanning devices this
> would not be as much of a problem, would it?

It is too early, you would need a working userspace environment to
start a process like udev, which we don't have at this point. You
really want to catch the creation of /dev/null and /dev/console and
such, and there is no real chance to run anything like udev reliably
at this stage.

> (Couldn't udevd trigger the device scan? Could it be started so early as
> for this not to matter?)

The driver-core is basically "80% of basic udev" in the kernel - the
/dev part is just the last missing piece. :)

The problem is not the missing events, they could be pretty easily
recovered from sysfs with just another special hack to run at bootup -
it's the time it takes to bring up the engine to bootstrap /dev, to
allow us to start any other process which looks for devices. Today,
udev mounts /dev as a tmpfs, and at that point it is obviously empty,
and needs to be filled, and nothing else can reliably run at that
time.

> And isn't part of the problem that you have a choke point in the
> dependency graph - namely "/dev working" as a predicate for running
> certain services/scripts - instead of more fine-grained dependencies for
> just the devices they need?

Without the basic devices, you can not start anything "normal", and
currently it's udev which provides /dev/null /dev/console,
/dev/random, the ttys, or whatever which might be needed you can not
know in advance.
The plan is to start udevd, but run the coldplug in the background
and start other stuff in parallel, because you can be sure that all
currently known devices are already there, and the missing meaningful
symlinks created by udev will show up soon, along with a new event to
hook into. There will be no hard checkpoint anymore to wait for the
basic environment..

The other important reason besides that it saves us from coming up
with just another custom hack to fill the initial /dev, is that it is
damn simple and very reliable. It allows init=/bin/sh to just work,
and /dev has the same logic when an initramfs is used, or the kernel
mounts the rootfs. And simpler is almost always better - and I think
it's also the fastest we can do.

Thanks,
Kay

Michael Riepe

ongelezen,

4 mei 2009, 14:00:1204-05-2009

aan

Hi!

Kay Sievers wrote:

> The problem is not the missing events, they could be pretty easily
> recovered from sysfs with just another special hack to run at bootup -
> it's the time it takes to bring up the engine to bootstrap /dev, to
> allow us to start any other process which looks for devices. Today,
> udev mounts /dev as a tmpfs, and at that point it is obviously empty,
> and needs to be filled, and nothing else can reliably run at that
> time.

And what about mounting /dev from an already populated image? Or, even
faster, using the /dev directory of the root fs? That way, the device
nodes would be present as soon as / is mounted, without any additional
overhead, except the very first time the system boots (in case you
choose not to populate /dev with a default set of device nodes in advance).

I know, not using tmpfs is a security risk and whatever. But does that
really matter in an embedded system where you have no user accounts?

[...]

> The plan is to start udevd, but run the coldplug in the background
> and start other stuff in parallel, because you can be sure that all
> currently known devices are already there, and the missing meaningful
> symlinks created by udev will show up soon, along with a new event to
> hook into. There will be no hard checkpoint anymore to wait for the
> basic environment..

You can do that with a persistent /dev as well. It will even keep the
symlinks udev created before the system rebooted. The only drawback is
that you have to wait for device nodes that belong to new devices which
were connected to the system while it was down. But that rarely happens,
and will eventually be fixed by udev.

> The other important reason besides that it saves us from coming up
> with just another custom hack to fill the initial /dev, is that it is
> damn simple and very reliable.

Pardon me, but I have to ask this: Isn't your patch a custom hack, too?

--
Michael "Tired" Riepe <michae...@googlemail.com>
X-Tired: Each morning I get up I die a little

Kay Sievers

ongelezen,

4 mei 2009, 14:20:1404-05-2009

aan

On Mon, May 4, 2009 at 19:54, Michael Riepe
<michae...@googlemail.com> wrote:
>> The problem is not the missing events, they could be pretty easily
>> recovered from sysfs with just another special hack to run at bootup -
>> it's the time it takes to bring up the engine to bootstrap /dev, to
>> allow us to start any other process which looks for devices. Today,
>> udev mounts /dev as a tmpfs, and at that point it is obviously empty,
>> and needs to be filled, and nothing else can reliably run at that
>> time.
>
> And what about mounting /dev from an already populated image? Or, even
> faster, using the /dev directory of the root fs? That way, the device
> nodes would be present as soon as / is mounted, without any additional
> overhead, except the very first time the system boots (in case you
> choose not to populate /dev with a default set of device nodes in advance).

Dynamic device numbers! A static /dev does not work at all for many
subsystems, not to mention the risk you take by talking to the wrong
device pointed to, by your incorrect static device nodes. It's not an
option at all today, and it will get much worse in the future.

> I know, not using tmpfs is a security risk and whatever. But does that
> really matter in an embedded system where you have no user accounts?

If you have a very limited environment, and no hotplug setups, sure
you can do that. But some of us have not even sd* as stable numbers
anymore.

>> The plan is to start udevd, but run the coldplug in the background
>> and start other stuff in parallel, because you can be sure that all
>> currently known devices are already there, and the missing meaningful
>> symlinks created by udev will show up soon, along with a new event to
>> hook into. There will be no hard checkpoint anymore to wait for the
>> basic environment..
>
> You can do that with a persistent /dev as well. It will even keep the
> symlinks udev created before the system rebooted. The only drawback is
> that you have to wait for device nodes that belong to new devices which
> were connected to the system while it was down. But that rarely happens,
> and will eventually be fixed by udev.

What? Keeping the links is insane, many things can go wrong with that.
Udev does not work reliable on non-tmpfs - you can do that, but don't
be surprised if stuff goes wrong and talks to the wrong devices.

Thanks,
Kay

Michael Riepe

ongelezen,

4 mei 2009, 15:00:2604-05-2009

aan

Kay Sievers wrote:
> On Mon, May 4, 2009 at 19:54, Michael Riepe
> <michae...@googlemail.com> wrote:
>
>>>The problem is not the missing events, they could be pretty easily
>>>recovered from sysfs with just another special hack to run at bootup -
>>>it's the time it takes to bring up the engine to bootstrap /dev, to
>>>allow us to start any other process which looks for devices. Today,
>>>udev mounts /dev as a tmpfs, and at that point it is obviously empty,
>>>and needs to be filled, and nothing else can reliably run at that
>>>time.
>>
>>And what about mounting /dev from an already populated image? Or, even
>>faster, using the /dev directory of the root fs? That way, the device
>>nodes would be present as soon as / is mounted, without any additional
>>overhead, except the very first time the system boots (in case you
>>choose not to populate /dev with a default set of device nodes in advance).
>
>
> Dynamic device numbers! A static /dev does not work at all for many
> subsystems, not to mention the risk you take by talking to the wrong
> device pointed to, by your incorrect static device nodes. It's not an
> option at all today, and it will get much worse in the future.

Maybe it's just me, but my devices end up being numbered the same after
every reboot. Unless I add or remove devices to/from the system, of course.

Unfortunately, that doesn't mean that it will always stay that way.

--
Michael "Tired" Riepe <michae...@googlemail.com>
X-Tired: Each morning I get up I die a little

Kay Sievers

ongelezen,

4 mei 2009, 15:20:1704-05-2009

aan

On Mon, May 4, 2009 at 20:55, Michael Riepe
<michae...@googlemail.com> wrote:
>> Dynamic device numbers! A static /dev does not work at all for many
>> subsystems, not to mention the risk you take by talking to the wrong
>> device pointed to, by your incorrect static device nodes. It's not an
>> option at all today, and it will get much worse in the future.
>
> Maybe it's just me, but my devices end up being numbered the same after
> every reboot. Unless I add or remove devices to/from the system, of course.

Sure, that works fine for most things, and will continue to do so. But
there are entire subsystems in the kernel, mostly newer ones, which do
not have any static number assignment and require dynamic /dev
support.

It's not about /dev/null and /dev/tty and such, that will not change,
but there is already an option in the upstream kernel to assign
dynamic numbers to sd* disks to allow more than 15 partitions, and the
numbers of subsystem requiring dynamic numbers will likely grow.

The thing today is that userspace on distro systems messes around in
/dev and assigns access control lists to grant device-access to
specific users, udev creates many symlinks there to identify your
devices, because the kernel names get reordered, and nothing will ever
revert /dev to the state where it came from at boot.
So the wider support for a static /dev is kind of fading away,
stuff that works today statically will likely continue to work, but
all the new stuff will not work with a static /dev. That's one part of
what devtmpfs addresses.

Thanks,
Kay

Greg KH

ongelezen,

4 mei 2009, 15:40:1304-05-2009

aan

On Mon, May 04, 2009 at 08:55:34PM +0200, Michael Riepe wrote:
>
>
> Kay Sievers wrote:
> > On Mon, May 4, 2009 at 19:54, Michael Riepe
> > <michae...@googlemail.com> wrote:
> >
> >>>The problem is not the missing events, they could be pretty easily
> >>>recovered from sysfs with just another special hack to run at bootup -
> >>>it's the time it takes to bring up the engine to bootstrap /dev, to
> >>>allow us to start any other process which looks for devices. Today,
> >>>udev mounts /dev as a tmpfs, and at that point it is obviously empty,
> >>>and needs to be filled, and nothing else can reliably run at that
> >>>time.
> >>
> >>And what about mounting /dev from an already populated image? Or, even
> >>faster, using the /dev directory of the root fs? That way, the device
> >>nodes would be present as soon as / is mounted, without any additional
> >>overhead, except the very first time the system boots (in case you
> >>choose not to populate /dev with a default set of device nodes in advance).
> >
> >
> > Dynamic device numbers! A static /dev does not work at all for many
> > subsystems, not to mention the risk you take by talking to the wrong
> > device pointed to, by your incorrect static device nodes. It's not an
> > option at all today, and it will get much worse in the future.
>
> Maybe it's just me, but my devices end up being numbered the same after
> every reboot. Unless I add or remove devices to/from the system, of course.

It's just you :)

I have machines here that enumerate their USB devices on every other
boot in different ways.

I have one fun laptop here that enumerates the PCI bus in different ways
depending on how it feels at the moment.

Nothing about either of these busses is deterministic, nor will it be
changing to be that way in the future.

> Unfortunately, that doesn't mean that it will always stay that way.

Exactly, it will not.

thanks,

greg k-h

Kay Sievers

ongelezen,

6 mei 2009, 09:00:2306-05-2009

aan

On Thu, 2009-04-30 at 15:23 +0200, Kay Sievers wrote:
> From: Kay Sievers <kay.s...@vrfy.org>
> Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs

Below are some numbers. Appended the used initramfs /init script, which
gets very simple with devtmpfs, compared what we need do today to handle
and fill an empty /dev after mounting its own tmpfs there.

This simple initramfs example still supports udev by-{label,uuid,id} device
name links, and other setups which need udev to find/setup/assemble
the root device.

Thanks,
Kay

kvm: initramfs mount-by-label
[ 2.018622] Freeing unused kernel memory: 2172k freed
[ 2.034799] initramfs: starting ...
[ 2.093656] initramfs: looking for /dev/disk/by-label/root
[ 2.121746] initramfs: starting udev
[ 2.141221] udev: starting version 142
[ 2.223777] initramfs: trigger block events
[ 2.465649] initramfs: mounting /dev/disk/by-label/root
[ 2.937051] kjournald starting. Commit interval 5 seconds
[ 2.941167] EXT3 FS on sda1, internal journal
[ 2.962522] EXT3-fs: mounted filesystem with writeback data mode.
[ 3.122292] initramfs: switching to root filesystem /dev/disk/by-label/root and start /bin/bash

kvm: initramfs mount-by-kernel-node
[ 1.939865] Freeing unused kernel memory: 2172k freed
[ 1.957918] initramfs: starting ...
[ 2.226510] initramfs: looking for /dev/sda1
[ 2.249192] initramfs: mounting /dev/sda1
[ 2.731584] kjournald starting. Commit interval 5 seconds
[ 2.736040] EXT3 FS on sda1, internal journal
[ 2.779422] EXT3-fs: mounted filesystem with writeback data mode.
[ 2.866554] initramfs: switching to root filesystem /dev/sda1 and start /bin/bash

kvm: direct kernel mount
[ 2.425663] kjournald starting. Commit interval 5 seconds
[ 2.434656] EXT3 FS on sda1, internal journal
[ 2.506279] EXT3-fs: mounted filesystem with writeback data mode.
[ 2.517395] VFS: Mounted root (ext3 filesystem) on device 259:524288.
[ 2.550488] devtmpfs: mounted
[ 2.555379] Freeing unused kernel memory: 2172k freed

real box: initramfs and mount-by-label
[ 1.513740] Freeing unused kernel memory: 2172k freed
[ 1.519782] initramfs: starting ...
[ 1.532795] initramfs: looking for /dev/disk/by-label/root
[ 1.533226] initramfs: starting udev
[ 1.537925] initramfs: trigger block events
[ 1.538715] udev: starting version 142
[ 1.647484] initramfs: mounting /dev/disk/by-label/root
[ 1.657646] kjournald starting. Commit interval 5 seconds
[ 1.657804] EXT3 FS on sda1, internal journal
[ 1.657864] EXT3-fs: mounted filesystem with writeback data mode.
[ 1.687084] initramfs: switching to root filesystem /dev/disk/by-label/root and start /sbin/init

real box: kernel mount
[ 1.329632] kjournald starting. Commit interval 5 seconds
[ 1.329756] EXT3 FS on sda1, internal journal
[ 1.329814] EXT3-fs: mounted filesystem with writeback data mode.
[ 1.329867] VFS: Mounted root (ext3 filesystem) on device 259:524288.
[ 1.330313] devtmpfs: mounted
[ 1.330377] async_waiting @ 1
[ 1.330381] async_continuing @ 1 after 0 usec
[ 1.330424] Freeing unused kernel memory: 2172k freed

#!/bin/sh

# minimal initramfs with persistent device name support
# requires devtmpfs support provided by the kernel

shell() {
echo "initramfs: dropping to shell"
exec >/dev/console 2>&1 </dev/console < /dev/console
sh -i
}

getarg() {
local o line
for o in $cmdline; do
test "$o" = "$1" && return 0
if test "${o%%=*}" = "${1%=}"; then
echo ${o#*=}
return 0
fi
done
return 1
}

export PATH=/sbin:/bin:/usr/sbin:/usr/bin
export TERM=linux

# console redirect
exec >/dev/kmsg 2>&1 </dev/console
echo "initramfs: starting ..."

# mount needed filesystems
mount -t proc /proc /proc >/dev/null
mount -t sysfs /sys /sys >/dev/null
mount -t devpts -o gid=5,mode=620 /dev/pts /dev/pts

# kernel commandline
read cmdline </proc/cmdline;

# root we are looking for
root=$(getarg root=)
echo "initramfs: looking for $root"

# shortcut, in case root is already there
if test -e "$root"; then
echo "initramfs: mounting $root"
mount "$root" /root
mounted=yes
fi

# we need to start udev to create the symlink we are looking for
if test -z "$mounted"; then
echo "initramfs: starting udev"
udevd --daemon --resolve-names=never
# create links for already existing block devices
echo "initramfs: trigger block events"
udevadm trigger --subsystem-match=block
udevadm settle --timeout=3

# try if the link is there now
if test -e "$root"; then
echo "initramfs: mounting $root"
mount "$root" /root
mounted=yes
fi

if test -z "$mounted"; then
# load modules to let the root device show up
echo "initramfs: trigger module events"
udevadm trigger --attr-match=modalias

echo "initramfs: waiting for $root"
# loop until root shows up
while test -z "$mounted"; do
if test -e "$root"; then
echo "initramfs: mounting $root"
mount "$root" /root && mounted=yes
fi
sleep 0.05
done
fi

# kill udev
kill $(pidof udevd) >/dev/null 2>&1
fi

# move filesystems over to the mounted root
mount --move /dev /root/dev
mount --move /proc /root/proc
mount --move /sys /root/sys

# move into root
cd /root
init=$(getarg init=)
test -z "$init" && init=/sbin/init
echo "initramfs: switching to root filesystem $root and start $init"
exec run-init -c /dev/console /root $init

Arjan van de Ven

ongelezen,

6 mei 2009, 21:40:0706-05-2009

aan

On Thu, 30 Apr 2009 15:23:42 +0200
Kay Sievers <kay.s...@vrfy.org> wrote:

> From: Kay Sievers <kay.s...@vrfy.org>
> Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
>

> Devtmpfs lets the kernel create a tmpfs very early at kernel
> initialization, before any driver core device is registered. Every
> device with a major/minor will have a device node created in this
> tmpfs instance. After the rootfs is mounted by the kernel, the

> populated tmpfs is mounted at /dev. In initramfs, it can be moved
> to the manually mounted root filesystem before /sbin/init is
> executed.

so just to state the obvious: this code is not needed to boot fast.
It is mostly a workaround for having a bad initrd; if you don't use an
initrd, or if you use an initrd that's made with the right device nodes
in it already, you really just don't need this.

I would much rather that you just fix your initrd... than to put this
sort of thing into the kernel....

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Kay Sievers

ongelezen,

6 mei 2009, 22:10:1206-05-2009

aan

On Thu, May 7, 2009 at 03:41, Arjan van de Ven <ar...@infradead.org> wrote:
> On Thu, 30 Apr 2009 15:23:42 +0200
> Kay Sievers <kay.s...@vrfy.org> wrote:
>
>> From: Kay Sievers <kay.s...@vrfy.org>
>> Subject: driver-core: devtmpfs - driver core maintained /dev tmpfs
>>
>> Devtmpfs lets the kernel create a tmpfs very early at kernel
>> initialization, before any driver core device is registered. Every
>> device with a major/minor will have a device node created in this
>> tmpfs instance. After the rootfs is mounted by the kernel, the
>> populated tmpfs is mounted at /dev. In initramfs, it can be moved
>> to the manually mounted root filesystem before /sbin/init is
>> executed.
>
> so just to state the obvious: this code is not needed to boot fast.
> It is mostly a workaround for having a bad initrd; if you don't use an
> initrd, or if you use an initrd that's made with the right device nodes
> in it already, you really just don't need this.
>
> I would much rather that you just fix your initrd... than to put this
> sort of thing into the kernel....

How will you solve the dynamic device numbers? They are a complete
reality today.

Thanks,
Kay

Arjan van de Ven

ongelezen,

6 mei 2009, 22:30:1906-05-2009

aan

On Thu, 7 May 2009 04:08:33 +0200
Kay Sievers <kay.s...@vrfy.org> wrote:
>
> How will you solve the dynamic device numbers? They are a complete
> reality today.

not for storage though... and for the rest udev is fast enough in
practice....

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Kay Sievers

ongelezen,

6 mei 2009, 22:50:1106-05-2009

aan

On Thu, May 7, 2009 at 04:25, Arjan van de Ven <ar...@infradead.org> wrote:
> On Thu, 7 May 2009 04:08:33 +0200
> Kay Sievers <kay.s...@vrfy.org> wrote:
>>
>> How will you solve the dynamic device numbers? They are a complete
>> reality today.
>
> not for storage though...

Oh sure:
cat /sys/class/block/sd*/dev
259:524288
259:262144
259:786432
259:131072
259:0
259:393216
259:655360

That's the box I write this from. It's experimental now but in the
upstream kernel. Distros want to support more than 15 partitions, so
that will happen sooner than later.

And you may talk to the wrong rtc device, if you use static nodes, and
so on. There are just too many things that can go wrong, and distros
that don't have a very limited set of hardware, can not take the risk
of a static /dev. They don't have this today, and will not go back to
it. That's why we are coming up with this.

> and for the rest udev is fast enough in
> practice....

Sure, but you still have the transition to the tmpfs /dev, and during
that you are very limited in what you can do, and this addresses
devtmpfs.

Thanks,
Kay

Eric W. Biederman

ongelezen,

7 mei 2009, 04:20:1307-05-2009

aan

Have I read the summary of this properly?

sysfs is slow.
udev is slow.

Let's bloat the kernel to solve that problem?

Eric

Kay Sievers

ongelezen,

7 mei 2009, 05:40:0907-05-2009

aan

On Thu, May 7, 2009 at 10:17, Eric W. Biederman <ebie...@xmission.com> wrote:

> sysfs is slow.
> udev is slow.

Everything is "slow" if you want to boot "fast". It's just numbers,
and this is more about simplicity and reliability, with just the side
effect that it's the fastest you can do, when you don't want to lose
current functionality.

> Let's bloat the kernel to solve that problem?

You want to add namespaces for devices to it?

Thanks,
Kay

Theodore Tso

ongelezen,

7 mei 2009, 10:50:4007-05-2009

aan

On Thu, May 07, 2009 at 11:28:19AM +0200, Kay Sievers wrote:
> On Thu, May 7, 2009 at 10:17, Eric W. Biederman <ebie...@xmission.com> wrote:
>
> > sysfs is slow.
> > udev is slow.
>
> Everything is "slow" if you want to boot "fast". It's just numbers,
> and this is more about simplicity and reliability, with just the side
> effect that it's the fastest you can do, when you don't want to lose
> current functionality.

How much of this is because, say, udev is using shell scripts as
opposed to C? If the answer is you lose flexibility when you do it as
shell script fragments, I just spent 30 minutes with Alsadair chewing
on my ear about how udev's use of shell scripts is making dm/LVM's
life hard --- and I'll note that if we solve this problem by bloating
the kernel, you lose flexibility *anyway* since the kernel code will
be in C and not shell scripts. :-)

- Ted

Kay Sievers

ongelezen,

7 mei 2009, 11:20:1807-05-2009

aan

On Thu, May 7, 2009 at 16:43, Theodore Tso <ty...@mit.edu> wrote:
> On Thu, May 07, 2009 at 11:28:19AM +0200, Kay Sievers wrote:
>> On Thu, May 7, 2009 at 10:17, Eric W. Biederman <ebie...@xmission.com> wrote:
>>
>> > sysfs is slow.
>> > udev is slow.
>>
>> Everything is "slow" if you want to boot "fast". It's just numbers,
>> and this is more about simplicity and reliability, with just the side
>> effect that it's the fastest you can do, when you don't want to lose
>> current functionality.
>
> How much of this is because, say, udev is using shell scripts as
> opposed to C? If the answer is you lose flexibility when you do it as
> shell script fragments, I just spent 30 minutes with Alsadair chewing
> on my ear about how udev's use of shell scripts is making dm/LVM's
> life hard --- and I'll note that if we solve this problem by bloating
> the kernel, you lose flexibility *anyway* since the kernel code will
> be in C and not shell scripts. :-)

Udev does not use any shell scripts besides one for firmware loading
and one composing /dev/disk/by-path/ link names for block devices
without metadata. Both are very simple and cheap, and do not cause any
known problems for anybody.

It's probably not udev's fault, why people chew on your ear. Sorry, I
don't think I can solve that problem. :)

Thanks,
Kay

Eric W. Biederman

ongelezen,

9 mei 2009, 20:30:1209-05-2009

aan

Kay Sievers <kay.s...@vrfy.org> writes:

> On Thu, May 7, 2009 at 10:17, Eric W. Biederman <ebie...@xmission.com> wrote:
>
>> sysfs is slow.
>> udev is slow.
>
> Everything is "slow" if you want to boot "fast". It's just numbers,
> and this is more about simplicity and reliability, with just the side
> effect that it's the fastest you can do, when you don't want to lose
> current functionality.

My primary concern is that you are inventing a new mechanism when
the existing mechanism has known issues that could explain the slowdowns.

My secondary concern is we have are adding a new mechanism when
the existing mechanism has some significant known issues, that are
not currently being addressed. Instead the next shiny bobble is
being tackled.

Finally I am stumped how creating a couple hundred device nodes can be
a significant slowdown in booting. It feels like we your patchset
adds a line of code for each device node it intends to create.

>> Let's bloat the kernel to solve that problem?
>
> You want to add namespaces for devices to it?

I think it is a logical conclusion, and it might potentially be needed.
But that is a long ways out.

Eric

Kay Sievers

ongelezen,

9 mei 2009, 21:00:2309-05-2009

aan

On Sun, May 10, 2009 at 02:29, Eric W. Biederman <ebie...@xmission.com> wrote:
> Kay Sievers <kay.s...@vrfy.org> writes:
>
>> On Thu, May 7, 2009 at 10:17, Eric W. Biederman <ebie...@xmission.com> wrote:
>>
>>> sysfs is slow.
>>> udev is slow.
>>
>> Everything is "slow" if you want to boot "fast". It's just numbers,
>> and this is more about simplicity and reliability, with just the side
>> effect that it's the fastest you can do, when you don't want to lose
>> current functionality.
>
> My primary concern is that you are inventing a new mechanism when
> the existing mechanism has known issues that could explain the slowdowns.

You mean that mounting a new tmpfs at /dev has the issue of being
empty? And it takes time to fill it, where you can't reliably do other
things at the same time that might need device nodes? Yeah, I guess
that's an issue with the existing mechanism, and I'm interested in
your proposal to solve it.

Thanks,
Kay

Eric W. Biederman

ongelezen,

9 mei 2009, 22:20:1309-05-2009

aan

Kay Sievers <kay.s...@vrfy.org> writes:

>> My primary concern is that you are inventing a new mechanism when
>> the existing mechanism has known issues that could explain the slowdowns.
>
> You mean that mounting a new tmpfs at /dev has the issue of being
> empty?

I was thinking more along the lines of a single global lock for all of
sysfs,.

> And it takes time to fill it, where you can't reliably do other
> things at the same time that might need device nodes? Yeah, I guess
> that's an issue with the existing mechanism, and I'm interested in
> your proposal to solve it.

As for that question why not.

initramfs has always come prepopulated with device nodes.

It isn't a challenge to mount a new tmpfs somewhere else populate it
with the basics and them move it onto tmp.

I would be very surprised if you needed much more than /dev/console,
/dev/zero and /dev/null for reliable operation of most programs.

The rest just looks like ensuring you can deal with root (or what
ever device you care about) showing up after things start running.
With usb devices apparently not having a guaranteed maximum wait
and things like dhcp required for network connectivity I don't see
where adding code to wait for you device to show up would be
unnecessary logic.

That said it might make sense to be able to kick off udevd or similar
before /sbin/init was exec'd. I don't know how much that would gain.

Numbers like 2 seconds sound absolutely huge in cpu time. And I don't
have a clue why udev would take anywhere near that long even doing
everything synchronously. Have you instrumented things up to
see what takes that much time?

Eric

Pavel Machek

ongelezen,

14 mei 2009, 05:30:1214-05-2009

aan

Hi!

> > so just to state the obvious: this code is not needed to boot fast.
> > It is mostly a workaround for having a bad initrd; if you don't use an
> > initrd, or if you use an initrd that's made with the right device nodes
> > in it already, you really just don't need this.
> >
> > I would much rather that you just fix your initrd... than to put this
> > sort of thing into the kernel....
>
> How will you solve the dynamic device numbers? They are a complete
> reality today.

Maybe dynamic device numbers are bad idea to start with?

I'd like to keep static /dev, thank you... With big enough major/minor
space (which we have today, I believe), complexity of dynamic device
numbers can and should be avoided.

Plus... this patch 'hardcoded' device names. Why not 'hardcode'
device numbers in the same way, so that static /dev keeps working?
Namespace should be big enough...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html