udev and devfs - The final word

Greg KH

unread,

Dec 30, 2003, 7:40:09 PM12/30/03

to

(This text can be found at
kernel.org/pub/linux/utils/kernel/hotplug/udev_vs_devfs for those who
want to link to it. I'll also update it with info based on the thread I
know is going to spawn from this post...)

Executive summary for those too lazy to read this whole thing:
I don't care about devfs, and I don't want to talk about it at
all anymore. If you love devfs, fine, I'm not trying to tell
anyone what to do. But you really should be looking into using
udev instead. All further email messages sent to me about devfs
will be gladly ignored.

First off, some background. For a description of udev, and what it's
original design goals were, please see the OLS 2003 paper on udev,
available at:
<http://www.kroah.com/linux/talks/ols_2003_udev_paper/Reprint-Kroah-Hartman-OLS2003.pdf>
and the slides for the talk, available at:
<http://www.kroah.com/linux/talks/ols_2003_udev_talk/>
The OLS paper can also be found in the docs/ directory of the udev
tarball, available on kernel.org in the /pub/linux/utils/kernel/hotplug/
directory.

In that OLS paper, I described the current situation of a static /dev
and the current problems that a number of people have with it. I also
detailed how devfs tries to solve a number of these problems. In
hindsight, I should have never mentioned the word, devfs, when talking
about udev. I did so only because it seemed like a good place to start
with. Most people understood what devfs is, and what it does. To
compare udev against it, showing how udev was more powerful, and a more
complete solution to the problems people were having, seemed like a
natural comparison to me.

But no more. I hereby never want to compare devfs and udev again. With
the exception of this message...

The Problems:
1) A static /dev is unwieldy and big. It would be nice to only show
the /dev entries for the devices we actually have running in the
system.
2) We are (well, were) running out of major and minor numbers for
devices.
3) Users want a way to name devices in a persistent fashion (i.e. "This
disk here, must _always_ be called "boot_disk" no matter where in
the scsi tree I put it", or "This USB camera must always be called
"camera" no matter if I have other USB scsi devices plugged in or
not.")
4) Userspace programs want to know when devices are created or removed,
and what /dev entry is associated with them.

The constraints:
1) No policy in the kernel!
2) Follow standards (like the LSB)
3) must be small so embedded devices will use it.

So, how does devfs stack up to the above problems and constraints:
Problems:
1) devfs only shows the dev entries for the devices in the system.
2) devfs does not handle the need for dynamic major/minor numbers
3) devfs does not provide a way to name devices in a persistent
fashion.
4) devfs does provide a deamon that userspace programs can hook into
to listen to see what devices are being created or removed.
Constraints:
1) devfs forces the devfs naming policy into the kernel. If you
don't like this naming scheme, tough.
2) devfs does not follow the LSB device naming standard.
3) devfs is small, and embedded devices use it. However it is
implemented in non-pagable memory.

Oh yeah, and there are the insolvable race conditions with the devfs
implementation in the kernel, but I'm not going to talk about them right
now, sorry. See the linux-kernel archives if you care about them (and
if you use devfs, you should care...)

So devfs is 2 for 7, ignoring the kernel races.

And now for udev:
Problems:
1) using udev, the /dev tree only is populated for the devices that
are currently present in the system.
2) udev does not care about the major/minor number schemes. If the
kernel tomorrow switches to randomly assign major and minor numbers
to different devices, it would work just fine (this is exactly
what I am proposing to do in 2.7...)
3) This is the main reason udev is around. It provides the ability
to name devices in a persistent manner. More on that below.
4) udev emits D-BUS messages so that any other userspace program
(like HAL) can listen to see what devices are created or removed.
It also allows userspace programs to query it's database to see
what devices are present and what they are currently named as
(providing a pointer into the sysfs tree for that specific device
node.)
Constraints:
1) udev moves _all_ naming policies out of the kernel and into
userspace.
2) udev defaults to using the LSB device naming standard. If users
want to deviate away from this standard (for example when naming
some devices in a persistent manner), it is easily possible to do
so.
3) udev is small (49Kb binary) and is entirely in userspace, which
is swapable, and doesn't have to be running at all times.

Nice, 7 out of 7 for udev. Makes you think the problems and constraints
were picked by a udev developer, right? No, the problems and
constraints are ones I've seen over the years and so udev, along with
the kernel driver model and sysfs, were created to solve these real
problems. I also have had the luxury to see the problems that the
current devfs implementation has, and have taken the time to work out
something that does not have those same problems.

So by just looking at the above descriptions, everyone should instantly
realize that udev is far better than devfs and start helping out udev
development, right? Oh, you want more info, ok...

Back in May 2003 I released a very tiny version of udev that implemented
everything that devfs currently does, in about 6Kb of userspace code:
http://marc.theaimsgroup.com/?l=linux-kernel&m=105003185331553

Yes, that's right, 6Kb. So, you are asking, why are you still working
on udev if it did everything devfs did back in May 2003? That's because
just managing static device nodes based on what the kernel calls the
devices is _not_ the primary goal of udev. It's just a tiny side affect
of it's primary goal, the ability to never worry about major/minor
number assignments and provide the ability to achieve persistent device
names if wanted.

All the people wanting to bring up the udev vs. devfs argument go back
and read the previous paragraph. Yes, all Gentoo users who keep filling
up my inbox with smoking emails, I mean you.

So, how well does udev solve it's goals:
Prevent users from ever worrying about major/minor numbers
And here you were, not knowing you ever needed to worry about
major/minor numbers in the first place, right? Ah, I see you
haven't plugged in 2 USB printers and tried to figure out which
printer was which /dev entry? Or plugged in 4000 SCSI disks and
tried to figure out how to access that 3642nd disk and what it was
called in /dev. Or plugged in a USB camera and a USB flash drive
and then tried to download the pictures off of the flash drive by
accident?

As the above scenarios show, both desktop users and big iron users
both need to not worry about which device is assigned to what
major/minor device.

udev doesn't care what major/minor number is assigned to a device.
It merely takes the numbers that the kernel says it assigned to the
device and creates a device node based on it, which the user can
then use (if you don't understand the whole major/minor to device
node issue, or even what a device node is, trust me, you don't
really want to, go install udev and don't worry about it...) As
stated above, if the kernel decides to start randomly assigning
major numbers to all devices, then udev will still work just fine.

Provide a persistent device naming solution:
Lots of people want to assign a specific name that they can talk to
a device to, no matter where it is in the system, or what order they
plugged the device in. USB printers, SCSI disks, PCI sound cards,
Firewire disks, USB mice, and lots of other devices all need to be
assigned a name in a consistent manner (udev doesn't handle network
devices, naming them is already a solved solution, using nameif).
udev allows users to create simple rules to describe what device to
name. If users want to call a program running a large database
half-way around the world, asking it what to name this device, it
can. We don't put the naming database into the kernel (like other
Unix variants have), everything is in userspace, and easily
accessible. You can even run a perl script to name your device if
you are that crazy...

For more information on how to create udev rules to name devices,
please see the udev man page, and look at the example udev rules
that ship with the tarball.

So, convinced already why you should use udev instead of devfs? No.
Ok, fine, I'm not forcing you to abandon your bloated, stifling policy,
nonextensible, end of life feature if you don't want to. But please
don't bother me about it either, I don't care about devfs, only about
udev.

This is my last posting about this topic, all further emails sent to me
about why devfs is wonderful, and why are you making fun of this
wonderful, stable gift from the gods, will be gleefully ignored and
possibly posted in a public place where others can see.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Prakash K. Cheemplavam

unread,

Dec 30, 2003, 8:00:16 PM12/30/03

to

Greg KH wrote:

[big snip]

> All the people wanting to bring up the udev vs. devfs argument go back
> and read the previous paragraph. Yes, all Gentoo users who keep filling
> up my inbox with smoking emails, I mean you.

[yet another big snip]

Hihi, life is unfair to you. ;-) I am one of those nasty gentoo users
and still use devfs, but I want to switch asap, as I found a thread in
gentoo forums about it and furthermore tend to do experiments with my
installation. So not all gentoo users are bad users. ;-) I really
appreciate your work and hope you will find more time in developing udev
instead of wasting time (though it was quite interesting for me to read
your text) with arguing for it. So I hope when I do the transition it
goes smoothly, but even if not, I won't bash onto your head. ;-)

Cheers,

Prakash

Pascal Schmidt

unread,

Dec 30, 2003, 10:20:05 PM12/30/03

to

On Wed, 31 Dec 2003 01:40:09 +0100, you wrote in linux.kernel:

> 2) udev does not care about the major/minor number schemes. If the
> kernel tomorrow switches to randomly assign major and minor numbers
> to different devices, it would work just fine (this is exactly
> what I am proposing to do in 2.7...)

Why? I want to keep my static device files in /dev. I don't even have
hotpluggable devices, and many months do pass before even one piece
of hardware gets changed (in which case I know what I have to do).
I don't want to eat any overhead or run any daemons or hotplug agents.

What benefit would there be in "random" numbers? More compressed number
space by giving out numbers sequentially? Or less having to work with
the numbers because they become just cookies and never need to be
inspected except in very small parts of the kernel?

--
Ciao,
Pascal

Paulo Marques

unread,

Dec 31, 2003, 7:50:09 AM12/31/03

to

Greg KH wrote:

> Oh yeah, and there are the insolvable race conditions with the devfs
> implementation in the kernel, but I'm not going to talk about them right
> now, sorry. See the linux-kernel archives if you care about them (and
> if you use devfs, you should care...)

I really think you should, because IMHO this is *the* major argument against devfs.

I spent days trying to tweak a mandrake distribution into running from a Compact
Flash card.

The init sequence would fail with I/O errors as if the card had hardware
problems. It took me a long time to realize that it was devfs and devfsd the
culprits. With *exactly* the same setup, but static device nodes the system
worked just fine.

Maybe it was the slow compact flash PIO modes that were triggering the bug, but
the truth was that devfs had bugs in it, and I never saw anyone trying to
correct them later.

So my opinion is: udev is *really* needed and you're doing a great job with it.
Don't let anyone tell you otherwise :)

Just my 2 cents,

--
Paulo Marques - www.grupopie.com

"In a world without walls and fences who needs windows and gates?"

Greg KH

unread,

Dec 31, 2003, 2:40:08 PM12/31/03

to

On Wed, Dec 31, 2003 at 01:53:55AM +0100, Prakash K. Cheemplavam wrote:
> Greg KH wrote:
>
> [big snip]
> > All the people wanting to bring up the udev vs. devfs argument go back
> > and read the previous paragraph. Yes, all Gentoo users who keep filling
> > up my inbox with smoking emails, I mean you.
> [yet another big snip]
>
> Hihi, life is unfair to you. ;-) I am one of those nasty gentoo users
> and still use devfs, but I want to switch asap, as I found a thread in
> gentoo forums about it and furthermore tend to do experiments with my
> installation. So not all gentoo users are bad users. ;-) I really
> appreciate your work and hope you will find more time in developing udev
> instead of wasting time (though it was quite interesting for me to read
> your text) with arguing for it. So I hope when I do the transition it
> goes smoothly, but even if not, I won't bash onto your head. ;-)

Thanks, I have gotten a lot of response to this message from Gentoo
users appologizing for the "bad seeds". By no means did I mean to
disparage all Gentoo users, just the ones that keep bothering me with
this pointless argument.

In fact, now that I know Gentoo works without devfs, I'm considering
putting it on an old laptop I have around here...

thanks,

greg k-h

Greg KH

unread,

Dec 31, 2003, 2:40:12 PM12/31/03

to

On Wed, Dec 31, 2003 at 04:05:59AM +0100, Pascal Schmidt wrote:
> On Wed, 31 Dec 2003 01:40:09 +0100, you wrote in linux.kernel:
>
> > 2) udev does not care about the major/minor number schemes. If the
> > kernel tomorrow switches to randomly assign major and minor numbers
> > to different devices, it would work just fine (this is exactly
> > what I am proposing to do in 2.7...)
>
> Why? I want to keep my static device files in /dev. I don't even have
> hotpluggable devices, and many months do pass before even one piece
> of hardware gets changed (in which case I know what I have to do).
> I don't want to eat any overhead or run any daemons or hotplug agents.

You would not have any "extra" overhead if you don't add any new devices
to your system. udev only runs when /sbin/hotplug runs. As for extra
space on your disk, this email thread is almost as big as the udev
binary is :)

> What benefit would there be in "random" numbers? More compressed number
> space by giving out numbers sequentially?

Yes.

> Or less having to work with the numbers because they become just
> cookies and never need to be inspected except in very small parts of
> the kernel?

That is already happening today in the kernel.

And 2.8 will probably have the "random number" assignment be a compile
option, depending on the maturity of udev. We'll just have to see how
it works out.

thanks,

greg k-h

Rob Love

unread,

Dec 31, 2003, 3:30:10 PM12/31/03

to

On Wed, 2003-12-31 at 14:23, Greg KH wrote:

> What benefit would there be in "random" numbers? More compressed number
> space by giving out numbers sequentially?

That is one advantage.

> Or less having to work with the numbers because they become just
> cookies and never need to be inspected except in very small parts of
> the kernel?

Yup, especially this one. It is not so much "let's make the device
numbers random" but "let's just not care what they are."

We can get to the point where we don't even need the explicit concept of
device numbers, but just "any old unique value" to use as a cookie. The
kernel can pull that number from anywhere, and notify user-space via
udev ala hotplug.

Rob Love

Nathan Conrad

unread,

Dec 31, 2003, 5:10:07 PM12/31/03

to

One thing that I'm confused about with respect to device files is how
kernel arguments are supposed to work. Now, we _seem_ to have a
mish-mash of different ways to tell the kernel which device to open as
a console, which device to use as a suspend device, etc.... Now, all
of the device names are being migrated to userland. How is the kernel
supposed to determine which device to use when it is told use
/dev/hda3 or /dev/ide/host0/something/part3 as the suspend partition?
The kernel no longer knows to which device this string this device is
connected.

(I have not looked into how these parameters are parsed; this is pure
speculation)

One solution that I see if the device names are totally removed from
the kernel is specifying these parameters as sysfs paths. Would this
work? Or is there a better way?

-Nathan

--
Nathan J. Conrad Campus phone #5930
301 Scott hall, UNC Charlotte http://bungled.net
GPG: F4FC 7E25 9308 ECE1 735C 0798 CE86 DA45 9170 3112

Rob Love

unread,

Dec 31, 2003, 5:30:09 PM12/31/03

to

On Wed, 2003-12-31 at 17:01, Nathan Conrad wrote:

> One thing that I'm confused about with respect to device files is how
> kernel arguments are supposed to work. Now, we _seem_ to have a
> mish-mash of different ways to tell the kernel which device to open as
> a console, which device to use as a suspend device, etc.... Now, all
> of the device names are being migrated to userland. How is the kernel
> supposed to determine which device to use when it is told use
> /dev/hda3 or /dev/ide/host0/something/part3 as the suspend partition?
> The kernel no longer knows to which device this string this device is
> connected.

Uh, Unix systems (Linux included) do not use the filename of the device
node at all. Those are just names for you, the user.

The kernel uses the device number to understand what device user-space
is trying to access. The kernel associates the device with a device
number. Normally that number is static, and known a priori, so we just
create a huge /dev directory with all possible devices and their
assigned numbers (you can see these numbers with ls -la).

But if the kernel _tells_ user-space what the device number is, for each
device as it is created, we do not need a static /dev directory. We can
assemble the directory on the fly and device numbers really no longer
matter. This is what udev does.

walt

unread,

Dec 31, 2003, 5:30:19 PM12/31/03

to

Greg KH wrote:

> In fact, now that I know Gentoo works without devfs, I'm considering
> putting it on an old laptop I have around here...

That would be ideal. I'm sure you will like the 'portage' system as
much as we (the gentoo hordes) do.

Note that the portage system already includes 'hotplug' and 'udev'
but possibly lagging behind a bit: hotplug-20030805-r3 and udev-011.

I have installed them both but just have not been able to get udev
working yet -- I don't yet understand the problems well enough to tell
you why, unfortutately. (udev is still marked 'experimental' so I'm
probably omitting important steps somewhere.)

If you could get udev working in gentoo you would become an instant
hero rather than the target of nasty emails. Think of how great
that would be for your New Year! We would become the wind beneath
your wings instead of the rotten tomatoes in your mailbox ;0)

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Dec 31, 2003, 6:00:14 PM12/31/03

to

On Wed, Dec 31, 2003 at 05:20:18PM -0500, Rob Love wrote:
> On Wed, 2003-12-31 at 17:01, Nathan Conrad wrote:
>
> > One thing that I'm confused about with respect to device files is how
> > kernel arguments are supposed to work. Now, we _seem_ to have a
> > mish-mash of different ways to tell the kernel which device to open as
> > a console, which device to use as a suspend device, etc.... Now, all
> > of the device names are being migrated to userland. How is the kernel
> > supposed to determine which device to use when it is told use
> > /dev/hda3 or /dev/ide/host0/something/part3 as the suspend partition?
> > The kernel no longer knows to which device this string this device is
> > connected.
>
> Uh, Unix systems (Linux included) do not use the filename of the device
> node at all. Those are just names for you, the user.
>
> The kernel uses the device number to understand what device user-space
> is trying to access. The kernel associates the device with a device
> number. Normally that number is static, and known a priori, so we just
> create a huge /dev directory with all possible devices and their
> assigned numbers (you can see these numbers with ls -la).
>
> But if the kernel _tells_ user-space what the device number is, for each
> device as it is created, we do not need a static /dev directory. We can
> assemble the directory on the fly and device numbers really no longer
> matter. This is what udev does.

I think you've missed a point here. There are several places where kernel
deals with device identification.
a) when normal pathname lookup results in a device node on filesystem.
That's the regular way.
b) when we create a new device node; device number is passed to
->mknod() and new device node is created. Also a normal codepath.
c) when late-boot code mounts the final root. It used to be black
magic, but these days it's done by regular syscalls. Namely, we parse the
"device name" (most of the work is done by lookups in sysfs), do mknod(2)
and mount(2). It's still done from the kernel mode, but it could be moved
to userland. Should be, actually.
d) when kernel deals with resume/suspend stuff. Currently - black
magic. Should be moved to early userland (same parser as for final root
name + mknod on rootfs + open() to get the device in question).
e) in several pathological syscalls we pass device number to
identify a device. ustat(2) and its ilk - bad API that can't die.
f) /dev/raw passes device number to bind raw device to block device.
Bad API; we probably ought to replace it with saner one at some point.
g) RAID setup - mix of both pathologies; should be done in userland
and interfaces are in bad need of cleanup.
h) nfsd uses device number as a substitute for export ID if said
ID is not given explicitly. That, BTW, is a big problem for crackpipe
dreams about random device numbers - export ID _must_ be stable across
reboots.
i) mtdblk parses "device name" on boot; should be take to early
userland, same as RAID et.al.

Eventually name_to_dev_t() should be gone from kernel mode
completely - all callers should be shifted to early userland. But
that will take a lot of work - currently we have a big mess in that
area.

Rob Love

unread,

Dec 31, 2003, 6:10:09 PM12/31/03

to

On Wed, 2003-12-31 at 17:55, vi...@parcelfarce.linux.theplanet.co.uk
wrote:

> I think you've missed a point here. There are several places where kernel
> deals with device identification.

I know all of this. I was trying to explain how Unix VFS understands
devices (via major/minor number, not filename). Different audience.

Rob Love

Tommi Virtanen

unread,

Dec 31, 2003, 6:20:05 PM12/31/03

to

Rob Love wrote:
>>One thing that I'm confused about with respect to device files is how
>>kernel arguments are supposed to work. Now, we _seem_ to have a
>>mish-mash of different ways to tell the kernel which device to open as
>>a console, which device to use as a suspend device, etc.... Now, all
>>of the device names are being migrated to userland. How is the kernel
>>supposed to determine which device to use when it is told use
>>/dev/hda3 or /dev/ide/host0/something/part3 as the suspend partition?
>>The kernel no longer knows to which device this string this device is
>>connected.

...

> The kernel uses the device number to understand what device user-space
> is trying to access. The kernel associates the device with a device
> number. Normally that number is static, and known a priori, so we just
> create a huge /dev directory with all possible devices and their
> assigned numbers (you can see these numbers with ls -la).

Let me try to rephrase Nathan's question more explicitly.

If user policy decides all naming, how does the kernel parse e.g.
root=/dev/foo arguments? Or the swap partition to use for swsuspend?

Rob Love

unread,

Dec 31, 2003, 6:20:13 PM12/31/03

to

On Wed, 2003-12-31 at 16:45, Tommi Virtanen wrote:

> Let me try to rephrase Nathan's question more explicitly.
>
> If user policy decides all naming, how does the kernel parse e.g.
> root=/dev/foo arguments? Or the swap partition to use for swsuspend?

Oh. That has always been a hack, ala name_to_dev_t().

We will have to continue doing that hack so long as those users are in
the kernel proper (and not early user-space, for example).

Rob Love

Tommi Virtanen

unread,

Dec 31, 2003, 6:20:15 PM12/31/03

to

Rob Love wrote:
>>Let me try to rephrase Nathan's question more explicitly.
>>
>>If user policy decides all naming, how does the kernel parse e.g.
>>root=/dev/foo arguments? Or the swap partition to use for swsuspend?
> Oh. That has always been a hack, ala name_to_dev_t().
>
> We will have to continue doing that hack so long as those users are in
> the kernel proper (and not early user-space, for example).

I think devfs names are accepted as root= arguments, so that's a bit of
a loss.. with udev, your /dev and your root= are equal only if you
follow the standard naming.

For root=, I can see how early userspace can move that to userspace.
But what about swsuspend?

Are there any more kernel options taking file names? I think now would
be a good time to stop adding more of them :)

Andreas Dilger

unread,

Dec 31, 2003, 7:00:11 PM12/31/03

to

On Dec 31, 2003 22:55 +0000, vi...@parcelfarce.linux.theplanet.co.uk wrote:
> h) nfsd uses device number as a substitute for export ID if said
> ID is not given explicitly. That, BTW, is a big problem for crackpipe
> dreams about random device numbers - export ID _must_ be stable across
> reboots.

We had a problem with this and Lustre, when we NFS export it. Lustre is
already a network filesystem so we don't have a device number. I had a
discussion with Neil Brown about this and suggested that we allow NFS to
get a _real_ stable export ID from the filesystem (e.g. superblock UUID
or similar) instead of the device number hackery which only has a vague
relationship to stable.

We implemented it for Lustre with a filesystem option FS_NFSEXP_FSID
that tells nfsd it can export such a filesystem in the absence of
FS_REQUIRES_DEV and then put our export ID into sb->s_dev (although I'd
prefer something slightly cleaner than that).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

Andries Brouwer

unread,

Dec 31, 2003, 7:20:23 PM12/31/03

to

On Wed, Dec 31, 2003 at 03:19:22PM -0500, Rob Love wrote:

> We can get to the point where we don't even need the explicit concept of
> device numbers, but just "any old unique value" to use as a cookie. The
> kernel can pull that number from anywhere, and notify user-space via
> udev ala hotplug.

My plan has been to essentially use a hashed disk serial number
for this "any old unique value". The problem is that "any old"
is easy enough, but "unique" is more difficult.
Naming devices is very difficult, but in some important cases,
like SCSI or IDE disks, that would work and give a stable name.

The kernel must not invent consecutive numbers - that does not
lead to stable names. Setting this up correctly is nontrivial.

Rob Love

unread,

Dec 31, 2003, 7:40:19 PM12/31/03

to

On Wed, 2003-12-31 at 19:15, Andries Brouwer wrote:

> My plan has been to essentially use a hashed disk serial number
> for this "any old unique value". The problem is that "any old"
> is easy enough, but "unique" is more difficult.
> Naming devices is very difficult, but in some important cases,
> like SCSI or IDE disks, that would work and give a stable name.

Yup.

> The kernel must not invent consecutive numbers - that does not
> lead to stable names. Setting this up correctly is nontrivial.

This is definitely an interesting problem space.

I agree wrt just inventing consecutive numbers. If there was a nice way
to trivially generate a random and unique number from some
device-inherent information, that would be nice.

Rob Love

Helge Hafting

unread,

Dec 31, 2003, 8:10:10 PM12/31/03

to

On Tue, Dec 30, 2003 at 04:29:42PM -0800, Greg KH wrote:
>
> 2) We are (well, were) running out of major and minor numbers for
> devices.

devfs tried to fix this one by _getting rid_ of those numbers.
Seriously - what are they needed for?
(Yes, I know why they're needed with /dev on ext2)
Opening a device in devfs went straight to the device from the
inode - no extra lookup of "device numbers"
Numbers were provided mostly for backward compatibility - they
weren't used for the main task of accessing devices.

udev has many other advantages of course, too bad we still
have to carry those numbers around.

Helge Hafting

Martin Schlemmer

unread,

Dec 31, 2003, 9:10:08 PM12/31/03

to

On Thu, 2004-01-01 at 00:17, walt wrote:

> Note that the portage system already includes 'hotplug' and 'udev'
> but possibly lagging behind a bit: hotplug-20030805-r3 and udev-011.
>

Afiak, we are current on udev :D As for hotplug, I will have to check -
I see the latest usb patches cause usb.agent to complain about "09" not
valid token or such, but I have not looked into it yet.

> I have installed them both but just have not been able to get udev
> working yet -- I don't yet understand the problems well enough to tell
> you why, unfortutately. (udev is still marked 'experimental' so I'm
> probably omitting important steps somewhere.)
>

Well, ideally you need baselayout-1.8.6.12-r3 as well ... But if you
do have issues, try to bother me first, as it could be something I did
or did not do ;)

> If you could get udev working in gentoo you would become an instant
> hero rather than the target of nasty emails. Think of how great
> that would be for your New Year! We would become the wind beneath
> your wings instead of the rotten tomatoes in your mailbox ;0)

Hmm, It works fine here? With sysfs patches from Greg (not yet into
official linux bk), I only had to run alsa's script to create device
nodes, and create /dev/{core,stdin,stdout,stderr} - the rest udev
creates - although, yes we do have the ramdisk/tarball feature to
save permissions/additions.

But once again, drop me a mail first with versions of udev, baselayout,
kernel, hotplug, etc, if you have latest unstable baselaout and still
cannot get it working - it is a Gentoo issue after all (as well as the
fact that I was under the impression that it should _just_work_ if you
have latest everything unstable =) ...

Thanks,

--

Martin Schlemmer

signature.asc

Martin Schlemmer

unread,

Dec 31, 2003, 9:10:13 PM12/31/03

to

On Thu, 2004-01-01 at 04:03, Martin Schlemmer wrote:
> On Thu, 2004-01-01 at 00:17, walt wrote:
>
> > Note that the portage system already includes 'hotplug' and 'udev'
> > but possibly lagging behind a bit: hotplug-20030805-r3 and udev-011.
> >
>
> Afiak, we are current on udev :D

Err, correction - I just saw 012 is out =p

--
Martin Schlemmer

signature.asc

Rob Landley

unread,

Jan 1, 2004, 7:40:12 AM1/1/04

to

On Wednesday 31 December 2003 18:31, Rob Love wrote:
> On Wed, 2003-12-31 at 19:15, Andries Brouwer wrote:
> > My plan has been to essentially use a hashed disk serial number
> > for this "any old unique value". The problem is that "any old"
> > is easy enough, but "unique" is more difficult.
> > Naming devices is very difficult, but in some important cases,
> > like SCSI or IDE disks, that would work and give a stable name.
>
> Yup.
>
> > The kernel must not invent consecutive numbers - that does not
> > lead to stable names. Setting this up correctly is nontrivial.
>
> This is definitely an interesting problem space.
>
> I agree wrt just inventing consecutive numbers. If there was a nice way
> to trivially generate a random and unique number from some
> device-inherent information, that would be nice.
>
> Rob Love

Fundamental problem: "Unique" depends on the other devices in the system. You
can't guarantee unique by looking at one device, more or less by definition.

Combine that with hotplug and you have a world of pain. Generating a number
from a device is just a fancy hashing function, but as soon as you have two
devices that generate the same number independently (when in separate
systems) and you plug them both into the same system: boom.

Now if you don't care about hotplug, it gets a little easier. You can have a
collission handler that does some kind of hashing thing, figuring out which
device needs to get bumped and bumping it. (As long as it consistently picks
the same victim, you're okay, although that in and of itself could get
interesting. And if you remove the earlier device it conflicted with and
reboot, the device could get renumbered which is evil...)

Of course the EASY way to deal with collisions is to just fail the hash thingy
in a detectable way, and punt to some kind of udev override. So if you yank
a drive from system A, throw it in system B, try to re-export it NFS, and
it's not going to work, it TELLS you.

Solve 90% of the problem space and have a human deal with the exceptions. How
big's the unique number being exported, anyway? (If it's 32 bits, the
exceptions are 1 in 4 billion. It may never be seen in the wild...)

Rob

Rob Love

unread,

Jan 1, 2004, 10:30:11 AM1/1/04

to

On Thu, 2004-01-01 at 07:34, Rob Landley wrote:

> Fundamental problem: "Unique" depends on the other devices in the system. You
> can't guarantee unique by looking at one device, more or less by definition.

Of course.

> Combine that with hotplug and you have a world of pain. Generating a number
> from a device is just a fancy hashing function, but as soon as you have two
> devices that generate the same number independently (when in separate
> systems) and you plug them both into the same system: boom.

A solution would have to deal with collisions.

> Of course the EASY way to deal with collisions is to just fail the hash thingy
> in a detectable way, and punt to some kind of udev override. So if you yank
> a drive from system A, throw it in system B, try to re-export it NFS, and
> it's not going to work, it TELLS you.

No no no. Nothing this complicated. No punting to udev.

> Solve 90% of the problem space and have a human deal with the exceptions. How
> big's the unique number being exported, anyway? (If it's 32 bits, the
> exceptions are 1 in 4 billion. It may never be seen in the wild...)

Device numbers are 64-bit now.

Rob Love

Andries Brouwer

unread,

Jan 1, 2004, 11:00:11 AM1/1/04

to

On Thu, Jan 01, 2004 at 10:22:53AM -0500, Rob Love wrote:

> Device numbers are 64-bit now.
>
> Rob Love

I am afraid I have to disappoint you. I made them 64-bit,
and I think they were 64-bit for a few months in the -mm tree,
forgot the details, but unfortunately Al went back to 32-bit again.

Rob Love

unread,

Jan 1, 2004, 11:00:12 AM1/1/04

to

On Thu, 2004-01-01 at 10:48, Andries Brouwer wrote:

> I am afraid I have to disappoint you. I made them 64-bit,
> and I think they were 64-bit for a few months in the -mm tree,
> forgot the details, but unfortunately Al went back to 32-bit again.

You did disappoint me! My heart is crushed and my aspirations for the
future ruined.

But you are right, dunno what I was thinking.

Rob Love

Pascal Schmidt

unread,

Jan 1, 2004, 11:30:14 AM1/1/04

to

On Wed, 31 Dec 2003, Greg KH wrote:

> You would not have any "extra" overhead if you don't add any new devices
> to your system. udev only runs when /sbin/hotplug runs. As for extra
> space on your disk, this email thread is almost as big as the udev
> binary is :)

Well, but if random device numbers become a reality, udev would have
to run at boot time or I wouldn't get usable device nodes. So there
is some setup complexity (because so far I don't need a correctly setup
hotplug system at all). Not much of a problem, granted, distributions
will do this for most of us and only a few people will do it by hand.

--
Ciao,
Pascal

Shaheed

unread,

Jan 1, 2004, 12:10:04 PM1/1/04

to

Rob Landley wrote:

>Combine that with hotplug and you have a world of pain. Generating a number
> from a device is just a fancy hashing function, but as soon as you have two
> devices that generate the same number independently (when in separate
> systems) and you plug them both into the same system: boom.

If one has two otherwise identical devices, the only thing that distinguishes
them to the system is their point of attachment. Even from a user's point of
view, the only difference is the connector it is plugged into. That implies
that the hash resolution value ought to be based on the point of attachment.

It seems to me that the key to making this system as transparent as possible
is to make these source value of the hash and the attachment point visible
and navigable by userspace/humans. Perhaps something like this:

- every driver exports its name and some driver-or-devicetype-dependant value
(serial number, MAC address, disk WWID, pty number, kernel address of kobject
or whatever) to /sbin/hotplug. The userspace logic gets to hash+uniquify the
value as required, and then create a sysfs tree node ("/uid/xxx") whose
leaves contain the point of attachment.

- At the bottom of the sysfs tree for the device add a leaf that points back
to the entry into "/uid" tree.

Thus, userspace can navigate in either direction between the point of
attachment, and the identifiying characteristic of the deivce.

Thanks, Shaheed

walt

unread,

Jan 1, 2004, 3:00:16 PM1/1/04

to

Martin Schlemmer wrote:
> On Thu, 2004-01-01 at 00:17, walt wrote:

>> ...I have not been able to get udev working yet...

> Hmm, It works fine here? I was under the impression that
> it should _just_work_ if you have latest everything unstable...

Yes! I want to confirm that it DOES 'just work' with this one
little thingy I missed:

I needed to add TWO boot flags because of the way I have my
kernel configured: 'nodevfs' AND 'devfs=nomount'.

Without the 'devfs=nomount' flag the kernel was starting devfsd
anyway, which keeps udev from working, apparently.

So, Greg, please be nice to Martin, who is working hard to
get gentoo people out of your mailbox.

Thanks to both of you, and Happy New Year!

Greg KH

unread,

Jan 1, 2004, 4:10:05 PM1/1/04

to

On Thu, Jan 01, 2004 at 05:17:50PM +0100, Pascal Schmidt wrote:
> On Wed, 31 Dec 2003, Greg KH wrote:
>
> > You would not have any "extra" overhead if you don't add any new devices
> > to your system. udev only runs when /sbin/hotplug runs. As for extra
> > space on your disk, this email thread is almost as big as the udev
> > binary is :)
>
> Well, but if random device numbers become a reality, udev would have
> to run at boot time or I wouldn't get usable device nodes.

Exactly, it's on the TODO list :)

thanks,

greg k-h

Message has been deleted

Martin Schlemmer

unread,

Jan 1, 2004, 5:10:24 PM1/1/04

to

On Thu, 2004-01-01 at 21:53, walt wrote:
> Martin Schlemmer wrote:
> > On Thu, 2004-01-01 at 00:17, walt wrote:
>
> >> ...I have not been able to get udev working yet...
>
> > Hmm, It works fine here? I was under the impression that
> > it should _just_work_ if you have latest everything unstable...
>
> Yes! I want to confirm that it DOES 'just work' with this one
> little thingy I missed:
>
> I needed to add TWO boot flags because of the way I have my
> kernel configured: 'nodevfs' AND 'devfs=nomount'.
>
> Without the 'devfs=nomount' flag the kernel was starting devfsd
> anyway, which keeps udev from working, apparently.
>

Hmm, right, that will do it.

Perhaps I could change this to display a warning if udev is present,
but devfs is mounted over /dev ...

--
Martin Schlemmer

signature.asc

Rob

unread,

Jan 1, 2004, 6:20:15 PM1/1/04

to

On Wednesday 31 December 2003 07:31 pm, Rob Love wrote:

<snip>

> This is definitely an interesting problem space.
>
> I agree wrt just inventing consecutive numbers. If there was a nice way
> to trivially generate a random and unique number from some
> device-inherent information, that would be nice.
>
> Rob Love

my first thought was hardware serial numbers, but i'm guessing they mostly
don't exist based on the discomfort caused by the pentium 3 serial number in
the past. my second thought was raw latency. in the real world, 2 identical
devices of any nature are going to respond electrically at different rates. i
kind of stole the concept from what i read about the i810 rng... quantum
differences can distinguish between 2 of anything, and based on the response
time, 'cookies' can be written out to keep them separately ID'd. some devices
will get slower over time, e.g. increasing error rates and aging silicon will
throw the 'cookie' off, so you'd re-calibrate every so often, like on a
reboot. those are rare for some of us ;)

the big IF: can you measure that with enough precision to at least decrease
the probablity of collision?

--
Rob Couto
r...@cafe4111.org
Rules for computing success:
1) Attitude is no substitute for competence.
2) Ease of use is no substitute for power.
3) Safety matters; use a static-free hammer.
--

Hollis Blanchard

unread,

Jan 1, 2004, 7:30:12 PM1/1/04

to

On Wednesday, Dec 31, 2003, at 15:52 US/Central, Tommi Virtanen wrote:
> I think devfs names are accepted as root= arguments, so that's a bit of
> a loss.. with udev, your /dev and your root= are equal only if you
> follow the standard naming.
>
> For root=, I can see how early userspace can move that to userspace.
> But what about swsuspend?
>
> Are there any more kernel options taking file names? I think now would
> be a good time to stop adding more of them :)

"console=" takes driver-supplied names which usually happen to match
/dev node names. For example, drivers/serial/8250.c names itself
"ttyS", so "console=ttyS0" will end up going to that driver, regardless
of the state of /dev.

I'm not saying that's good or bad, but what's the alternative?
"console=class/tty/ttyS0"?

--
Hollis Blanchard
IBM Linux Technology Center

Maciej Zenczykowski

unread,

Jan 1, 2004, 7:30:13 PM1/1/04

to

> Solve 90% of the problem space and have a human deal with the exceptions.
> How big's the unique number being exported, anyway? (If it's 32 bits, the
> exceptions are 1 in 4 billion. It may never be seen in the wild...)

Wouldn't this be a classical birthday problem with 50% collision chance
popping up in and around a few hundred devices? [20 for 8 bits, 23 for
365, 302 for 16 bits, 77163 for 32 bits], and that's only in a single
system - with hundreds of thousands of systems even a 0.1% collision rate
is deadly. [0.1% collision rate at 32 bits with 2932 devices] Even with
only 300 devices per system, you'll still get a collision (at 32 bits) on
more than 1 system in a hundred thousand.

Cheers,
MaZe.

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 1, 2004, 7:40:10 PM1/1/04

to

On Thu, Jan 01, 2004 at 06:17:43PM -0600, Hollis Blanchard wrote:
> "console=" takes driver-supplied names which usually happen to match
> /dev node names. For example, drivers/serial/8250.c names itself
> "ttyS", so "console=ttyS0" will end up going to that driver, regardless
> of the state of /dev.
>
> I'm not saying that's good or bad, but what's the alternative?
> "console=class/tty/ttyS0"?

Console code will need serious work anyway; note that current names
do _not_ refer to tty devices - there is some overlap, but right now
we have
* console drivers
* some of them being connected with tty drivers; those can tell
which tty driver corresponds to them
* console ouput code maintaining chain of console drivers; output
is sent to them, attempt to open() /dev/console ends up picking the first
console driver that has corresponding tty one (== has console->device())
and opening the tty device in question
* unholy mess with redirects.

There's no device nodes for console drivers. So names in console=... are
something very odd, indeed.

Tyler Hall

unread,

Jan 1, 2004, 11:00:13 PM1/1/04

to

Since we're moving toward treating device numbers as unique handles for
devices in a system, why can't we just dynamically allocate them like
process ID's? As each device driver loads and registers with the kernel,
it can request a device number and the kernel can assign the next
available one.

Tyler

Rob wrote:

>On Wednesday 31 December 2003 07:31 pm, Rob Love wrote:
>
><snip>
>
>
>>This is definitely an interesting problem space.
>>
>>I agree wrt just inventing consecutive numbers. If there was a nice way
>>to trivially generate a random and unique number from some
>>device-inherent information, that would be nice.
>>
>> Rob Love
>>
>>
>
>my first thought was hardware serial numbers, but i'm guessing they mostly
>don't exist based on the discomfort caused by the pentium 3 serial number in
>the past. my second thought was raw latency. in the real world, 2 identical
>devices of any nature are going to respond electrically at different rates. i
>kind of stole the concept from what i read about the i810 rng... quantum
>differences can distinguish between 2 of anything, and based on the response
>time, 'cookies' can be written out to keep them separately ID'd. some devices
>will get slower over time, e.g. increasing error rates and aging silicon will
>throw the 'cookie' off, so you'd re-calibrate every so often, like on a
>reboot. those are rare for some of us ;)
>
>the big IF: can you measure that with enough precision to at least decrease
>the probablity of collision?
>
>
>

-

Rob Landley

unread,

Jan 2, 2004, 2:30:15 AM1/2/04

to

On Thursday 01 January 2004 13:43, Kai Henningsen wrote:
> r...@landley.net (Rob Landley) wrote on 01.01.04 in
<20040101063...@landley.net>:

> > On Wednesday 31 December 2003 18:31, Rob Love wrote:
> > > On Wed, 2003-12-31 at 19:15, Andries Brouwer wrote:
> > > > My plan has been to essentially use a hashed disk serial number
> > > > for this "any old unique value". The problem is that "any old"
> > > > is easy enough, but "unique" is more difficult.
> > > > Naming devices is very difficult, but in some important cases,
> > > > like SCSI or IDE disks, that would work and give a stable name.
> > >
> > > Yup.
> > >
> > > > The kernel must not invent consecutive numbers - that does not
> > > > lead to stable names. Setting this up correctly is nontrivial.
> > >
> > > This is definitely an interesting problem space.
> > >
> > > I agree wrt just inventing consecutive numbers. If there was a nice
> > > way to trivially generate a random and unique number from some
> > > device-inherent information, that would be nice.
> > >
> > > Rob Love
> >
> > Fundamental problem: "Unique" depends on the other devices in the system.
> > You can't guarantee unique by looking at one device, more or less by
> > definition.
>

> This is actually not fundamental at all.
>
> The best-known exception is probably the MAC address. But it is not the
> only example of devices having true unique information.

I thought of mentioning this, but deleted it as a digression. But since you
brought it up:

A) There are ethernet cards that have the same mac address. (Over the years,
the cheap manufacturers have managed to screw this up. Ask Alan Cox.) They
show up randomly and cause real headaches for network administrators if you
don't think to look for it.

B) You can override the mac address thing thing comes with. This is done all
the time. (Hot failover comes to mind, but it's not the only one. I
remember how the cable modem company that serviced my mother's house snagged
the mac address of the cable modem as part of the inital setup, and refused
to work with a different mac address. (I asked their support guys: They
wanted to make sure you were still using the machine they'd installed their
special software on, which was a windows machine and I was installing a linux
firewall. And predicting THIS digression: yes I power cycled and hit the
reset button on the cable modem, it didn't help. The problem was at the
other end, their gateway dropped packets from the wrong mac address.)

So I changed the mac address of the other machine as part of its init scripts,
and it worked again...

> It is certainly true, though, that there are devices without this kind of
> info.
>
> And remember that you can sometimes use secondary information. With any
> kind of read-write storage device, it might be possible to create such a
> piece of information and store it onto that device.

I.E. a udev config entry?

> Moral: keep the identifier creation framework flexible enough so that you
> can chose device-specific means to produce useful identifiers. (And, use
> long identifiers, as they're less likely to be duplicated in general.)

Seems to be what udev is for. When we do go to random major and minor
numbers, maybe it would be useful to let udev request specific ones? (Just a
thought...)

> MfG Kai

Shawn

unread,

Jan 2, 2004, 11:50:29 AM1/2/04

to

On Wed, 2003-12-31 at 13:17, Greg KH wrote:
> In fact, now that I know Gentoo works without devfs, I'm considering
> putting it on an old laptop I have around here...

If you use an "old" laptop you might want to use the distcc option... ;)
Unless you like you installs to take three weeks... Literally.

Andreas Jellinghaus

unread,

Jan 2, 2004, 1:10:14 PM1/2/04

to

On Wed, 31 Dec 2003 00:32:58 +0000, Greg KH wrote:
> The Problems:
> 1) A static /dev is unwieldy and big. It would be nice to only show
> the /dev entries for the devices we actually have running in the
> system.

last time i checked, devices for physical resources are only a part
of the devices in /dev. the other big part are those devices for
virtual resources, like virtual master/slave tty, network block devices,
loop devices, virtual consoles, etc.

neither devfs nor udev handle the virtual part. only devpts does,
and only for one special class of virtual devices. and usb devices
are neither handled by devfs nor udev, but by usbfs.

Actually udev is a regression:
- devfs was a first efford at a sane /dev naming policy, udev returns to
the old and cryptic lsb device naming.
- devfs made makedev obsolete, udev doesn't work without it / can
currently not create all devices because of missing sysfs support.

Ignore this mail if you want, but people might be unhappy with udev
because of these regressions and not caring about it will not improve
the situation.

Andreas

Shawn

unread,

Jan 2, 2004, 1:30:12 PM1/2/04

to

Let me begin by pointing out that I was a proponent of devfs from when
it first got written.

On Fri, 2004-01-02 at 11:54, Andreas Jellinghaus wrote:
> On Wed, 31 Dec 2003 00:32:58 +0000, Greg KH wrote:
> > The Problems:
> > 1) A static /dev is unwieldy and big. It would be nice to only show
> > the /dev entries for the devices we actually have running in the
> > system.

> neither devfs nor udev handle the virtual part. only devpts does,
> and only for one special class of virtual devices. and usb devices
> are neither handled by devfs nor udev, but by usbfs.

I'm thinking maybe this is just fine.

> Actually udev is a regression:
> - devfs was a first efford at a sane /dev naming policy, udev returns to
> the old and cryptic lsb device naming.

Every way of doing things is just another say of doing it. Location
based naming has it's major issues. It's solved by UUID or LABEL, so
device naming is just a matter of preference anyway. You can change it
with udev, IIRC. You could not with devfs. Chances are you use devfsd
anyway, right?

> - devfs made makedev obsolete, udev doesn't work without it / can
> currently not create all devices because of missing sysfs support.

No one is saying it is currently perfect for everyone, however, it suits
many people just fine. devfs went through the same thing and this is an
invalid argument when debating the technical merit of either.

> Ignore this mail if you want, but people might be unhappy with udev
> because of these regressions and not caring about it will not improve
> the situation.

By the time devfs goes away enough testing will have happened. Don't
look for it to go away within 2.6.

Linus Torvalds

unread,

Jan 2, 2004, 3:50:12 PM1/2/04

to

On Thu, 1 Jan 2004, Rob Love wrote:
>
> On Thu, 2004-01-01 at 10:48, Andries Brouwer wrote:
> > I am afraid I have to disappoint you. I made them 64-bit,
> > and I think they were 64-bit for a few months in the -mm tree,
> > forgot the details, but unfortunately Al went back to 32-bit again.
>
> You did disappoint me! My heart is crushed and my aspirations for the
> future ruined.
>
> But you are right, dunno what I was thinking.

Note that one reason I didn't much like the 64-bit versions is that not
only are they bigger, they also encourage insanity. Ie you'd find SCSI
people who want to try to encode device/controller/bus/target/lun info
into the device number.

We should resist any effort that makes the numbers "mean" something. They
are random cookies. Not "unique identifiers", and not "addresses".

The unique identifiers you get from things like udev, using contents of
the device itself or user preferences etc. That's outside the scope of the
kernel. The addresses you get from /sys.

Linus

Shaheed

unread,

Jan 2, 2004, 4:40:07 PM1/2/04

to

Hi,

I have a device called an IT8212 IDE RAID controller. It comes with a Linux
2.4 driver which emulates a SCSI interface and supports both JBOD and
hardware RAID (0 and 1) modes of operation.

I don't quite understand the relationship between drivers/ide/... and RAID
support. Why for example, would the existing driver be written as a SCSI
driver and not an IDE driver?

Andries Brouwer

unread,

Jan 2, 2004, 11:20:08 PM1/2/04

to

On Fri, Jan 02, 2004 at 12:42:41PM -0800, Linus Torvalds wrote:

Hi Linus - A happy 2004 !

> Note that one reason I didn't much like the 64-bit versions is that not
> only are they bigger, they also encourage insanity. Ie you'd find SCSI
> people who want to try to encode device/controller/bus/target/lun info
> into the device number.

Weak. "We don't want this power that has good uses because it also
can be used stupidly." That is not Unix-style.

> We should resist any effort that makes the numbers "mean" something. They
> are random cookies. Not "unique identifiers", and not "addresses".

Random cookies? I prefer "arbitrary" over "random". The value plays no role
at all, but it must be unique, preferably stable across reboots.

Andries

Linus Torvalds

unread,

Jan 3, 2004, 12:00:06 AM1/3/04

to

On Sat, 3 Jan 2004, Andries Brouwer wrote:
>
> > Note that one reason I didn't much like the 64-bit versions is that not
> > only are they bigger, they also encourage insanity. Ie you'd find SCSI
> > people who want to try to encode device/controller/bus/target/lun info
> > into the device number.
>
> Weak. "We don't want this power that has good uses because it also
> can be used stupidly." That is not Unix-style.

No.

That's not the argument: the argument is that the _only_ thing that 64-bit
stuff can be used for is stupid things.

For everything else, a 32-bit dev_t is sufficient.

And the UNIX way is definitely: "do one thing, and do it well" and "small
is beautiful". It has _never_ been "overdesign everything to accomodate
stupidity".

You may have confused UNIX with Multics. Where overdesign was the rule,
not the exception.

> > We should resist any effort that makes the numbers "mean" something. They
> > are random cookies. Not "unique identifiers", and not "addresses".
>
> Random cookies? I prefer "arbitrary" over "random". The value plays no role
> at all, but it must be unique, preferably stable across reboots.

Don't use "unique". It has way too many connotations of _true_ uniqieness
in computer science.

And the operative word in "preferably stable across reboots" is
"preferably". Because it basically cannot be in the general case (it
can't be unique for things that aren't enumerable, and clearly a lot of
things aren't), and thus nothing must ever _assume_ it is.

And the thing is, to break those wrong assumptions (that are true in many
common cases, but are _not_ true in the rare general case), we may have to
actively do things that are "silly" on purpose. For example, for
debugging, we start the "jiffies" counter not at zero, but at -300. That's
patently _silly_, but it was very useful in finding the cases where the
rare general case was not handled correctly.

Similarly, I'll probably advocate at some point (when distributions are
using udev) that we purposefully try to make device numbers _unstable_
across reboots, to find cases that do the wrong thing and have things
hardcoded. Exactly to find and fix them, so that the distribution works
correctly even when things aren't enumerable.

(As to examples of inumerable devices, iSCSI comes to mind. As does pretty
much anything else that is connected over IP - you can't even enumerate
according to path or IP, since those may change too).

Linus

Greg KH

unread,

Jan 3, 2004, 1:10:05 AM1/3/04

to

On Thu, Jan 01, 2004 at 02:18:55AM +0100, Helge Hafting wrote:
> On Tue, Dec 30, 2003 at 04:29:42PM -0800, Greg KH wrote:
> >
> > 2) We are (well, were) running out of major and minor numbers for
> > devices.
>
> devfs tried to fix this one by _getting rid_ of those numbers.
> Seriously - what are they needed for?

But devfs failed in this. The devfs kernel interface still requires a
major/minor number to create device nodes.

Hopefully I can work on fixing this up in 2.7.

thanks,

greg k-h

Greg KH

unread,

Jan 3, 2004, 1:10:08 AM1/3/04

to

On Thu, Jan 01, 2004 at 06:17:43PM -0600, Hollis Blanchard wrote:

> On Wednesday, Dec 31, 2003, at 15:52 US/Central, Tommi Virtanen wrote:
> >I think devfs names are accepted as root= arguments, so that's a bit of
> >a loss.. with udev, your /dev and your root= are equal only if you
> >follow the standard naming.
> >
> >For root=, I can see how early userspace can move that to userspace.
> >But what about swsuspend?
> >
> >Are there any more kernel options taking file names? I think now would
> >be a good time to stop adding more of them :)
>
> "console=" takes driver-supplied names which usually happen to match
> /dev node names. For example, drivers/serial/8250.c names itself
> "ttyS", so "console=ttyS0" will end up going to that driver, regardless
> of the state of /dev.

These are just string matches that the different console drivers use.
They have nothing to do with an actual /dev node.

thanks,

greg k-h

Greg KH

unread,

Jan 3, 2004, 1:20:13 AM1/3/04

to

On Fri, Jan 02, 2004 at 05:31:04AM -0500, Mark Mielke wrote:

> On Fri, Jan 02, 2004 at 01:17:20AM +0100, Maciej Zenczykowski wrote:
> > Wouldn't this be a classical birthday problem with 50% collision chance
> > popping up in and around a few hundred devices? [20 for 8 bits, 23 for
> > 365, 302 for 16 bits, 77163 for 32 bits], and that's only in a single
> > system - with hundreds of thousands of systems even a 0.1% collision rate
> > is deadly. [0.1% collision rate at 32 bits with 2932 devices] Even with
> > only 300 devices per system, you'll still get a collision (at 32 bits) on
> > more than 1 system in a hundred thousand.
>

> I don't see this (multiple systems) as relevant. Device numbers do not need
> to be unique across systems, and they shouldn't even need to be unique across
> system reboots. Even when collisions occur, it doesn't matter, as it can just
> pick a different random number, or follow a free list, or hundreds of other
> algorithms.
>
> Isn't this all just a question of device registration performance? 1) The
> device module needs to register the appropriate numbers efficiently.

What is "efficiently"? No one really cares about milliseconds here,
seconds are even tollerable at least for small seconds :)

> 2) /dev needs to be populated or updated efficiently. devfs tried for
> a just in time approach, whereas udev tries for a proactive approach.

"proactive"? udev is "reactive" in that it reacts to the number that
the kernel exports to userspace. That's all.

Remember, devfs also uses those same, hardcoded numbers...

thanks,

greg k-h

Valdis.K...@vt.edu

unread,

Jan 3, 2004, 2:00:15 AM1/3/04

to

On Fri, 02 Jan 2004 22:07:48 PST, Greg KH <gr...@kroah.com> said:

> What is "efficiently"? No one really cares about milliseconds here,
> seconds are even tollerable at least for small seconds :)

Anybody who's had to sit and watch a Sun E10K enumerate 400+ disks
will disagree with that, unless "small seconds" are tiny fractions thereof. :)

Ian Kent

unread,

Jan 3, 2004, 7:10:05 AM1/3/04

to

Even an old E3500 with only 70 or so disks and the evil RDAC is enough.

Ian

Andries Brouwer

unread,

Jan 3, 2004, 8:20:09 AM1/3/04

to

On Fri, Jan 02, 2004 at 08:46:33PM -0800, Linus Torvalds wrote:

> > Random cookies? I prefer "arbitrary" over "random". The value plays no role
> > at all, but it must be unique, preferably stable across reboots.
>

> The operative word in "preferably stable across reboots" is
> "preferably". Because it basically cannot be in the general case,

> and thus nothing must ever _assume_ it is.

Sure. It is not "need". It is "quality of implementation".
Consider NFS.

Andries

Helge Hafting

unread,

Jan 3, 2004, 10:20:09 AM1/3/04

to

On Fri, Jan 02, 2004 at 09:59:38PM -0800, Greg KH wrote:
> On Thu, Jan 01, 2004 at 02:18:55AM +0100, Helge Hafting wrote:
> > On Tue, Dec 30, 2003 at 04:29:42PM -0800, Greg KH wrote:
> > >
> > > 2) We are (well, were) running out of major and minor numbers for
> > > devices.
> >
> > devfs tried to fix this one by _getting rid_ of those numbers.
> > Seriously - what are they needed for?
>
> But devfs failed in this. The devfs kernel interface still requires a
> major/minor number to create device nodes.
>

Yes. The numbers went unused in the common case of opening a device by name though.

> Hopefully I can work on fixing this up in 2.7.

Interesting - how do you plan to do this?
There must be some connection from device node to driver. Devfs had
a pointer in the inode. The old way has numbers, and spend time on
a search.

Are you considering a sort of "minimal devfs" managed by udev?

Helge Hafting

Pavel Machek

unread,

Jan 3, 2004, 1:40:19 PM1/3/04

to

Hi!

> actively do things that are "silly" on purpose. For example, for
> debugging, we start the "jiffies" counter not at zero, but at -300. That's
> patently _silly_, but it was very useful in finding the cases where the
> rare general case was not handled correctly.

BTW, as we are currently in stable series, it might be good idea to
make jiffies start at zero... Hopefully jiffie wrap had enough testing
during stable...

Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 3, 2004, 4:30:11 PM1/3/04

to

On Sat, Jan 03, 2004 at 04:22:41PM +0100, Helge Hafting wrote:
> On Fri, Jan 02, 2004 at 09:59:38PM -0800, Greg KH wrote:
> > On Thu, Jan 01, 2004 at 02:18:55AM +0100, Helge Hafting wrote:
> > > On Tue, Dec 30, 2003 at 04:29:42PM -0800, Greg KH wrote:
> > > >
> > > > 2) We are (well, were) running out of major and minor numbers for
> > > > devices.
> > >
> > > devfs tried to fix this one by _getting rid_ of those numbers.
> > > Seriously - what are they needed for?
> >
> > But devfs failed in this. The devfs kernel interface still requires a
> > major/minor number to create device nodes.
> >
> Yes. The numbers went unused in the common case of opening a device by name though.

No, they were not. RTFS, please.

Greg KH

unread,

Jan 3, 2004, 5:20:10 PM1/3/04

to

On Sat, Jan 03, 2004 at 04:22:41PM +0100, Helge Hafting wrote:

> > Hopefully I can work on fixing this up in 2.7.
>
> Interesting - how do you plan to do this?

Probably something like the current interface for USB minor numbers when
CONFIG_USB_DYNAMIC_MINORS is enabled. The drivers will request a
certian major/minor, but the kernel will just give it whatever it feels
like.

That's my first guess, actual implementation will probably differ wildly
:)

thanks,

greg k-h

Greg KH

unread,

Jan 3, 2004, 5:20:11 PM1/3/04

to

It's "small seconds" _after_ the kernel has enumerated them. That's the
majority of the time spent enumerating scsi disks.

Also, udev will be running while the kernel is off detecting the next
disk.

Greg KH

unread,

Jan 3, 2004, 5:30:12 PM1/3/04

to

On Sat, Jan 03, 2004 at 02:01:40PM +0100, Witukind wrote:

> On Fri, 2 Jan 2004 21:59:38 -0800
> Greg KH <gr...@kroah.com> wrote:
>
> > On Thu, Jan 01, 2004 at 02:18:55AM +0100, Helge Hafting wrote:
> > > On Tue, Dec 30, 2003 at 04:29:42PM -0800, Greg KH wrote:
> > > >
> > > > 2) We are (well, were) running out of major and minor numbers for
> > > > devices.
> > >
> > > devfs tried to fix this one by _getting rid_ of those numbers.
> > > Seriously - what are they needed for?
> >
> > But devfs failed in this. The devfs kernel interface still requires a
> > major/minor number to create device nodes.
>

> Let's be more precise and not say that "devfs" failed this, but that the
> current implementation of devfs failed this.

Um, that's all we have to go by right now, sorry.

> If devfs works good on FreeBSD, it probably means that the current
> devfs for Linux is badly designed, not that the idea of devfs is bad.

I have no idea how FreeBSD implemented devfs.

If you know how FreeBSD implemented devfs, and how it solves all of the
problems that I detailed in my original posting, I would be interested.

Linus Torvalds

unread,

Jan 3, 2004, 5:40:08 PM1/3/04

to

On Sat, 3 Jan 2004, Andries Brouwer wrote:
>

> Sure. It is not "need". It is "quality of implementation".
> Consider NFS.

The problems occur when there are things we _cannot_ guarantee, and that
user space starts unnecessarily to depend on. And that ends up resulting
in bugs waiting to happen. Bugs that many "normal" developers may never
hit, simply because the quality of implementation ends up being so good
that it hides the problem cases in regular usage.

And then a high-quality implementation actually ends up being
_detrimental_. It's hiding problems that can still happen, they just
happen rarely enough that the bugs don't get found and fixed.

And then the painful thing of forcing "stupid", aka "bad QoI" behaviour,
actually ends up being the better thing in the long run.

Linus

Christoph Hellwig

unread,

Jan 3, 2004, 5:40:19 PM1/3/04

to

On Sat, Jan 03, 2004 at 02:16:04PM -0800, Greg KH wrote:
> > If devfs works good on FreeBSD, it probably means that the current
> > devfs for Linux is badly designed, not that the idea of devfs is bad.
>
> I have no idea how FreeBSD implemented devfs.
>
> If you know how FreeBSD implemented devfs, and how it solves all of the
> problems that I detailed in my original posting, I would be interested.

The FreeBSD implementation is pretty similar to the devfs we have in 2.6
API- and implementation wise. Just because it works somehow in most
situation doesn't mean it's right..

Andries Brouwer

unread,

Jan 3, 2004, 6:20:06 PM1/3/04

to

On Sat, Jan 03, 2004 at 02:27:47PM -0800, Linus Torvalds wrote:

> > Sure. It is not "need". It is "quality of implementation".
> > Consider NFS.

> And then a high-quality implementation actually ends up being

> _detrimental_. It's hiding problems that can still happen, they just
> happen rarely enough that the bugs don't get found and fixed.

Empty talk. This is not about finding and fixing bugs.
We know very precisely what properties the NFS protocol has.
Now one can have a system that works as well as possible with NFS.
And one can have a worse system.

Andries

Mark Mielke

unread,

Jan 3, 2004, 8:20:11 PM1/3/04

to

On Sun, Jan 04, 2004 at 12:08:40AM +0100, Andries Brouwer wrote:
> On Sat, Jan 03, 2004 at 02:27:47PM -0800, Linus Torvalds wrote:
> > And then a high-quality implementation actually ends up being
> > _detrimental_. It's hiding problems that can still happen, they just
> > happen rarely enough that the bugs don't get found and fixed.
> Empty talk. This is not about finding and fixing bugs.
> We know very precisely what properties the NFS protocol has.
> Now one can have a system that works as well as possible with NFS.
> And one can have a worse system.

It seems to me that as long as /dev is always a local mount (tmpfs in
the case of an NFS-root installation), it doesn't really matter. Maintaining
system-specific information on a remote machine seems dirty, and something
that shouldn't be *expected* to work. You wouldn't expect /proc to work
over NFS, would you? :-)

mark

--
ma...@mielke.cc/ma...@ncf.ca/ma...@nortelnetworks.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

Valdis.K...@vt.edu

unread,

Jan 3, 2004, 9:00:21 PM1/3/04

to

On Sat, 03 Jan 2004 20:16:26 EST, Mark Mielke said:

> It seems to me that as long as /dev is always a local mount (tmpfs in
> the case of an NFS-root installation), it doesn't really matter. Maintaining
> system-specific information on a remote machine seems dirty, and something
> that shouldn't be *expected* to work. You wouldn't expect /proc to work
> over NFS, would you? :-)

ISTR that SunOS 4.0 handled an NFS-mounted /dev and swap just fine some 15
years ago? (in fact, due to performance differences between the disks on a Sun3/
2xx server and the shoebox disk on a 3/50, you could page faster over the net
than to a local /dev/swap).

So it's more a case of "we have decided to do it differently" than "that's so nuts
that it shouldn't be expected to work"....

Linus Torvalds

unread,

Jan 3, 2004, 9:20:07 PM1/3/04

to

On Sun, 4 Jan 2004, Andries Brouwer wrote:
>
> Empty talk. This is not about finding and fixing bugs.
> We know very precisely what properties the NFS protocol has.
> Now one can have a system that works as well as possible with NFS.
> And one can have a worse system.

Oh, things can be _much_ worse than /dev over NFS.

You don't seem to realize what I men with "not enumerable".

With NFS, you could have some strange per-mount device number mapping etc,
and it wouldn't need to be all that complicated.

But if you start considering network-attached storage (as in "disks over
IP", not as in "samba"), the problem is that you fundamentally cannot
enumerate the things on a kernel level. EVER. There is no way to do
automatic discovery, because the bus fundamentally isn't enumerable. It
isn't even _repeatable_, ie if you do broadcast "tell me what disks
exists", the results won't be ordered some way.

In other words, the device numbers that eventually get attached to these
disks (however the discovery ends up working - with the sysadmin
explicitly mentioning them, or with some kind of broadcast protocol)
simply WILL NOT NECESSARILY be the same across reboots.

And there just _isn't_ any way to make them the same or to "describe" the
storage in any integer of any finite length. It has nothing to do with
32-bit vs 64-bit vs 1024-bit.

Once you accept that fact, you should accept the fact that device numbers
not only have no meaning, they literally have no permanence across reboots
either.

Yes, the common case is permanent. What I'm saying is that the common case
_cannot_ be the generic case.

Linus

Andries Brouwer

unread,

Jan 3, 2004, 10:00:12 PM1/3/04

to

On Sat, Jan 03, 2004 at 06:09:47PM -0800, Linus Torvalds wrote:
> On Sun, 4 Jan 2004, Andries Brouwer wrote:

> > Empty talk. This is not about finding and fixing bugs.
> > We know very precisely what properties the NFS protocol has.
> > Now one can have a system that works as well as possible with NFS.
> > And one can have a worse system.
>
> Oh, things can be _much_ worse than /dev over NFS.

Yes, but why do you start saying that?

Our topic is the statement that it is good to have device numbers
stable across a reboot. Not absolutely necessary, but good.

For example, given an NFS mount, if the server reboots and
suddenly the client sees different stat data, that would be
less than optimal. A low quality NFS implementation.

You write long stories - but it really is desirable to have
stable device numbers.

> You don't seem to realize what I mean with "not enumerable".

One of your side avenues is the matter of enumeration.
I don't see why that would be relevant. One identifies
things by their UUID. Order is never important.

> And there just _isn't_ any way to make them the same or to "describe" the
> storage in any integer of any finite length. It has nothing to do with
> 32-bit vs 64-bit vs 1024-bit.

A UUID usually takes 128 bits.

Andries

Norman Diamond

unread,

Jan 3, 2004, 10:00:13 PM1/3/04

to

Pavel Machek wrote:

> BTW, as we are currently in stable series, it might be good idea to
> make jiffies start at zero...

I disagree. The importance of fixing bugs does not decrease in stable.
Hiding bugs is still the opposite of fixing bugs.

Perhaps I misunderstand the meaning of stable, but I expected stable to mean
that efforts tend more towards fixing things so they work properly, and
unstable meant that efforts tend more towards adding features even though
they're broken at first. Hiding a broken thing is still the opposite of
fixing a broken thing.

> Hopefully jiffie wrap had enough testing during stable...

I think you mean unstable, in which case I agree with this half of what I
think you meant. This still doesn't give any reason to switch back to
hiding things. In fact this doesn't give any reason to switch from a
technique that "hopefully [...] had enough testing" to a different
technique, even if logically the different technique doesn't need as much
testing.

Linus Torvalds

unread,

Jan 3, 2004, 10:10:11 PM1/3/04

to

On Sun, 4 Jan 2004, Andries Brouwer wrote:
>
> You write long stories - but it really is desirable to have
> stable device numbers.

And I write the long stories because you do not seem to _get_ the point.

The point is that we will most likely ON PURPOSE break those stable device
numbers, for debugging reasons. Because it is _not_ desirable to have
people _believe_ that they can depend on stable device numbers.

> I don't see why that would be relevant. One identifies
> things by their UUID. Order is never important.

And this is exactly how it should be. However, it requires that user code
actually does the right thing.

And to _verify_ that user code properly identifies devices by other things
than device numbers, we should during 2.7.x explicitly _break_ all
dependencies on stable device numbers.

And UUID's are _not_ "device numbers". They fundamentally _cannot_ be
that, because the kernel just doesn't have any information on how to
generate a unique identifier that is actually stable.

The kernel doesn't know what it can depend on - should it look at the UUID
in the boot sector of the disk, or should it look up the UUID using IP
number reverse lookup, or what?

The only thing that can generate a UUID is literally user mode. Which is
_exactly_ why things like udev exists.

So device numbers are _not_ UUID's. Device numbers are needed before the
UUID's have been identified.

And that has been my point all along: device numbers do not have any
meaning. They are neither unique nor stable across reboots. They have no
information AT ALL associated with them. Anybody who thinks that they are
is fundamentally _wrong_ about it.

I agree that for a stable kernel we should then go back to "best effort"
mode, where for simple politeness reasons we should try to keep device
numbers as stable as we can.

Linus

Ananda Bhattacharya

unread,

Jan 3, 2004, 11:40:10 PM1/3/04

to

Hi,
I was wondering if one compiles a kernel for a
Pentium 4 which has HyperThreading will we need to recompile
SMP support for a single physical CPU or will one need to
have SMP enabled to take advantag of hyperthreading.

thanks
-A

Martin J. Bligh

unread,

Jan 4, 2004, 1:00:12 AM1/4/04

to

> I was wondering if one compiles a kernel for a
> Pentium 4 which has HyperThreading will we need to recompile
> SMP support for a single physical CPU or will one need to
> have SMP enabled to take advantag of hyperthreading.

You need SMP.

M.

Greg KH

unread,

Jan 4, 2004, 4:10:14 AM1/4/04

to

On Fri, Jan 02, 2004 at 01:26:44AM -0600, Rob Landley wrote:
> > Moral: keep the identifier creation framework flexible enough so that you
> > can chose device-specific means to produce useful identifiers. (And, use
> > long identifiers, as they're less likely to be duplicated in general.)
>
> Seems to be what udev is for. When we do go to random major and minor
> numbers, maybe it would be useful to let udev request specific ones? (Just a
> thought...)

Let udev request specific what? Major/minor numbers? Huh? I think you
are very confused here...

thanks,

greg k-h

Rob Landley

unread,

Jan 4, 2004, 4:50:14 AM1/4/04

to

On Sunday 04 January 2004 02:57, Greg KH wrote:
> On Fri, Jan 02, 2004 at 01:26:44AM -0600, Rob Landley wrote:
> > > Moral: keep the identifier creation framework flexible enough so that
> > > you can chose device-specific means to produce useful identifiers.
> > > (And, use long identifiers, as they're less likely to be duplicated in
> > > general.)
> >
> > Seems to be what udev is for. When we do go to random major and minor
> > numbers, maybe it would be useful to let udev request specific ones?
> > (Just a thought...)
>
> Let udev request specific what? Major/minor numbers? Huh? I think you
> are very confused here...

Currently, NFS exports are using device major/minor as part of the identifier
for an exported direcory, and device numbers are going to be dynamically
allocated in 2.7 to support hotplug, so i was wondering if there was a need
to have some way for root to go "I know this device hotplugged in at major 3
minor 99, but if major 53 minor 12 is free, could you change it to that?") A
bit like dup2, only for devices.

The discussion has moved on since then, and now it seems pretty clear that NFS
is going to be expected to use something OTHER than device numbers, and Linus
wants a clean break with device nodes being cookies. Better solution all
around, really...

But the original question did make sense. (The answer was "no", but that's
often the sign of a good question. :)

> thanks,
>
> greg k-h

Rob

Andries Brouwer

unread,

Jan 4, 2004, 8:30:13 AM1/4/04

to

On Sat, Jan 03, 2004 at 07:04:17PM -0800, Linus Torvalds wrote:

> I agree that for a stable kernel we should then go back to "best effort"
> mode, where for simple politeness reasons we should try to keep device
> numbers as stable as we can.

Good - you understand now.
So, the right setup - you call it politeness, I call it quality
of implementation - is to have both stable names and stable numbers,
in as many cases as possible.

Concerning the names, we are in reasonable shape. We have nameif
that binds a stable name to a MAC address. Much beter than eth2.
Also udev is a good step in the right direction - it gives
stable names under certain circumstances.

(And since udev can use the kernel device number, it can give stable
names under more circumstances when the kernel device number is
more often stable.)

Concerning the numbers, numbers based on enumeration are less than
satisfactory - they must be the last fallback when nothing else
can be found. And the ordering then is the ordering in time.

Almost always something better can be found. It is the drivers' job
to invent the device number. For the important special case of
SCSI or IDE disk, the disk serial number can be used.

Our helper function takes a string and an integer and a range, and
produces a device number in the given range, distinct from already
existing numbers. If you prefer random device numbers you make this
function ignore the string argument. I prefer stable device numbers
so would do an md5sum-like thing.

And that brings us back to the start of this thread:
Life is simpler when there is more room.
So it is a pity that we chose for less room.

Andries

Mark Mielke

unread,

Jan 4, 2004, 4:00:19 PM1/4/04

to

On Sat, Jan 03, 2004 at 08:54:36PM -0500, Valdis.K...@vt.edu wrote:
> ISTR that SunOS 4.0 handled an NFS-mounted /dev and swap just fine
> some 15 years ago? (in fact, due to performance differences between
> the disks on a Sun3/ 2xx server and the shoebox disk on a 3/50, you
> could page faster over the net than to a local /dev/swap).

Whether it did at some point, or whether it didn't, doesn't really matter.

It doesn't need to, and with the amount of memory that most computers come
with these days, remote access storage for tiny kernel data structures, like
that which would be required for tmpfs /dev that is only populated with the
devices that actually exist, just isn't worth it.

> So it's more a case of "we have decided to do it differently" than
> "that's so nuts that it shouldn't be expected to work"....

I was saying "why do you think this is a good model?" not "I can't imagine
why you would do it..." :-) Sorry it didn't come across as I intended.

Linus Torvalds

unread,

Jan 4, 2004, 4:10:23 PM1/4/04

to

On Sun, 4 Jan 2004, Andries Brouwer wrote:
>

> On Sat, Jan 03, 2004 at 07:04:17PM -0800, Linus Torvalds wrote:
> >
> > I agree that for a stable kernel we should then go back to "best effort"
> > mode, where for simple politeness reasons we should try to keep device
> > numbers as stable as we can.
>
> Good - you understand now.

Oh, _I_ always understood. You were the one that was arguing for stable
numbers as somehow important. I'm just telling you that they aren't
stable, and that a user application that depends on their stability or
their uniqieness is BROKEN.

> So, the right setup - you call it politeness, I call it quality
> of implementation - is to have both stable names and stable numbers,
> in as many cases as possible.

And I still disagree. You seem to think that this is an "absolute
goodness", and call it a quality issue.

While I personally strongly believe that it is a bug in user space to
care, and that it is not a quality issue at all, but rather a "allow buggy
and/or nonconverted user space to work".

In other words, it's not about "quality", as much as about compatibility
with applications that are old and/or braindead. Big difference.

Linus

Andries Brouwer

unread,

Jan 4, 2004, 5:10:16 PM1/4/04

to

On Sun, Jan 04, 2004 at 01:05:20PM -0800, Linus Torvalds wrote:

> Oh, _I_ always understood. You were the one that was arguing for
> stable numbers as somehow important.

Indeed. I said "preferably stable across reboots".

> I'm just telling you that they aren't stable, and that a
> user application that depends on their stability or

> their uniqueness is BROKEN.

Surprise! Are you leaving POSIX? Or ditching NFS?
Or demanding that NFS servers must never reboot?

A common Unix idiom is testing for the identity
of two files by comparing st_ino and st_dev.
A broken idiom?

No idea what part of our Unix heritage you now have decided to call broken.

Andries

Helge Hafting

unread,

Jan 4, 2004, 5:30:16 PM1/4/04

to

On Sun, Jan 04, 2004 at 11:01:04PM +0100, Andries Brouwer wrote:
> On Sun, Jan 04, 2004 at 01:05:20PM -0800, Linus Torvalds wrote:
>
> > Oh, _I_ always understood. You were the one that was arguing for
> > stable numbers as somehow important.
>
> Indeed. I said "preferably stable across reboots".
>
> > I'm just telling you that they aren't stable, and that a
> > user application that depends on their stability or
> > their uniqueness is BROKEN.
>
> Surprise! Are you leaving POSIX? Or ditching NFS?
> Or demanding that NFS servers must never reboot?
>
> A common Unix idiom is testing for the identity
> of two files by comparing st_ino and st_dev.
> A broken idiom?
>
> No idea what part of our Unix heritage you now have decided to call broken.
>

You worry about /dev over nfs, with the server booting in the middle of
such a comparison? This can work even with randomized device numbers,
just don't let that nfs server populate the exported /dev itself.

Let the client(s) run udev, and have one /dev for each on persistent
storage. If the nfs server reboots it simply keeps serving /dev's
in whatever shape the clients set them up with.

Helge Hafting

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 4, 2004, 5:40:23 PM1/4/04

to

On Sun, Jan 04, 2004 at 11:01:04PM +0100, Andries Brouwer wrote:

> A common Unix idiom is testing for the identity
> of two files by comparing st_ino and st_dev.
> A broken idiom?

No, just your usual highly selective reading. First of all, that
idiom relies only on different ->s_dev *among* *currently* *mounted*
*filesystems*. In part that has anything to do with devices, it means
only one thing:

Any two different block devices that are both currently opened by
the kernel and are both alive must have different device numbers.

Note the "are alive" part - we can even allow reuse of device numbers
as long as we make sure that stat() will fail on filesystems mounted
from dead ones.

Now, care to explain how preserving aforementioned common Unix idiom
is related to your expostulations?

Valdis.K...@vt.edu

unread,

Jan 4, 2004, 6:40:17 PM1/4/04

to

On Sun, 04 Jan 2004 23:01:04 +0100, Andries Brouwer said:

> A common Unix idiom is testing for the identity
> of two files by comparing st_ino and st_dev.
> A broken idiom?

Comparing two of these obtained at the same time is *usually* a good
test, although racy even on current systems. (Consider the case of an
unlink()/creat() pair between the two stat() calls - there's been more than
one race condition resulting in a security hole based on THIS one). It's
only safe if you actually have an open reference to both files before you
fstat() either one. And yes, it has to be fstat(), as you can't guarantee
that the file referenced by path in stat() is the one you did an open() on.

Comparing the st_ino/st_dev for a file to day with one from last Friday has
NEVER been a good idea.

Mark Mielke

unread,

Jan 4, 2004, 8:10:15 PM1/4/04

to

On Sun, Jan 04, 2004 at 10:37:10PM +0000, vi...@parcelfarce.linux.theplanet.co.uk wrote:
> On Sun, Jan 04, 2004 at 11:01:04PM +0100, Andries Brouwer wrote:
> > A common Unix idiom is testing for the identity
> > of two files by comparing st_ino and st_dev.
> > A broken idiom?
> No, just your usual highly selective reading. First of all, that
> idiom relies only on different ->s_dev *among* *currently* *mounted*
> *filesystems*.

> ...

> Now, care to explain how preserving aforementioned common Unix idiom
> is related to your expostulations?

I think he is defending bad design practices by pointing out common
bad design practices, and asking why these bad practices shouldn't be
allowed to continue, given that they are so common... :-)

Are there any real programs that assume st_dev/st_ino values are constant
across mount/unmount/mount? If so, Linus is saying we should break these
programs, so that the authors can become aware of the problem, rather than
leaving the problem as a subtle corner case.

I see no reason at all to keep these programs running. They are incorrect,
and that is that.

If and when this comes up in 2.7 development, I would like to see an
option of the sort: 1) Try to maintain major:minor numbers across
reboots (even at the expense of complexity and efficiency), 2) Try to
maintain a subset of the major:minor numbers across reboots
(compromise) 3) Provide the most efficient implementation, making no
guarantees regarding the numbering scheme, unless using a numbering
scheme turns out to be more efficient. Deprecate 1), and let 2) and 3)
evolve until we see who the victor is... :-) As long as the interface
that maps device to number is abstracted, the above should be pluggable.

mark

--
ma...@mielke.cc/ma...@ncf.ca/ma...@nortelnetworks.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

-

Jeremy Maitin-Shepard

unread,

Jan 4, 2004, 8:50:08 PM1/4/04

to

Valdis.K...@vt.edu writes:

> On Sun, 04 Jan 2004 23:01:04 +0100, Andries Brouwer said:
>> A common Unix idiom is testing for the identity
>> of two files by comparing st_ino and st_dev.
>> A broken idiom?

> Comparing two of these obtained at the same time is *usually* a good
> test, although racy even on current systems. (Consider the case of an
> unlink()/creat() pair between the two stat() calls - there's been more than
> one race condition resulting in a security hole based on THIS one). It's
> only safe if you actually have an open reference to both files before you
> fstat() either one. And yes, it has to be fstat(), as you can't guarantee
> that the file referenced by path in stat() is the one you did an
> open() on.

Unfortunately, programs such as tar depend on inode numbers of distinct
files being distinct even when the file is not open over a period of
several minutes/seconds. This is needed to avoid dumping hard links
more than once. Furthermore, there is no efficient way to write
programs such as tar without depending on this capability. Thus, if
st_ino cannot be used reliably for this purpose, it would be useful for
there to be a system call for retrieving a true
unique-within-the-filesystem identifier for the file.

--
Jeremy Maitin-Shepard

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 4, 2004, 9:10:05 PM1/4/04

to

On Sun, Jan 04, 2004 at 08:43:27PM -0500, Jeremy Maitin-Shepard wrote:

> Unfortunately, programs such as tar depend on inode numbers of distinct
> files being distinct even when the file is not open over a period of
> several minutes/seconds. This is needed to avoid dumping hard links
> more than once. Furthermore, there is no efficient way to write
> programs such as tar without depending on this capability. Thus, if
> st_ino cannot be used reliably for this purpose, it would be useful for
> there to be a system call for retrieving a true
> unique-within-the-filesystem identifier for the file.

No such thing. It's not the matter of having a syscall to extract such
identifier - it's that on a lot of filesystems (including many common Unix
ones) there's nothing that would qualify.

Note that tar et.al. do not behave well if used on actively modified directory
tree and ->st_ino reuse is the least of the problems in that area.

Jeremy Maitin-Shepard

unread,

Jan 4, 2004, 9:10:08 PM1/4/04

to

Mark Mielke <ma...@mark.mielke.cc> writes:

> On Sun, Jan 04, 2004 at 08:43:27PM -0500, Jeremy Maitin-Shepard wrote:

>> Unfortunately, programs such as tar depend on inode numbers of distinct
>> files being distinct even when the file is not open over a period of
>> several minutes/seconds. This is needed to avoid dumping hard links
>> more than once. Furthermore, there is no efficient way to write
>> programs such as tar without depending on this capability. Thus, if
>> st_ino cannot be used reliably for this purpose, it would be useful for
>> there to be a system call for retrieving a true
>> unique-within-the-filesystem identifier for the file.

> We already have that: st_nlink

> I think you mean a system call that would allow you to be certain that
> two file descriptors refer to the same file. Then, any files with
> st_nlink >= 2 would have to use the system call to match them up.

In order to efficiently implement tar, it is necessary to store the
inode numbers for files with a link count greater than 1 in a hash
table. It would not be practical to keep open all of these files in
order to ensure that the inode numbers remain valid. Thus, a different
unique identifier is needed, which is unique even for files that are not
open.

Jeremy Maitin-Shepard

unread,

Jan 4, 2004, 9:20:07 PM1/4/04

to

vi...@parcelfarce.linux.theplanet.co.uk writes:

> On Sun, Jan 04, 2004 at 08:43:27PM -0500, Jeremy Maitin-Shepard wrote:
>> Unfortunately, programs such as tar depend on inode numbers of distinct
>> files being distinct even when the file is not open over a period of
>> several minutes/seconds. This is needed to avoid dumping hard links
>> more than once. Furthermore, there is no efficient way to write
>> programs such as tar without depending on this capability. Thus, if
>> st_ino cannot be used reliably for this purpose, it would be useful for
>> there to be a system call for retrieving a true
>> unique-within-the-filesystem identifier for the file.

> No such thing. It's not the matter of having a syscall to extract such
> identifier - it's that on a lot of filesystems (including many common Unix
> ones) there's nothing that would qualify.

Even if the files in question aren't being modified, created, deleted,
etc.? Even if nothing on the filesystem is being modified, created,
deleted, etc.?

> [snip]

--
Jeremy Maitin-Shepard

Valdis.K...@vt.edu

unread,

Jan 4, 2004, 9:30:20 PM1/4/04

to

On Sun, 04 Jan 2004 20:02:36 EST, Mark Mielke said:

> If and when this comes up in 2.7 development, I would like to see an
> option of the sort: 1) Try to maintain major:minor numbers across
> reboots (even at the expense of complexity and efficiency), 2) Try to
> maintain a subset of the major:minor numbers across reboots
> (compromise) 3) Provide the most efficient implementation, making no
> guarantees regarding the numbering scheme, unless using a numbering
> scheme turns out to be more efficient. Deprecate 1), and let 2) and 3)
> evolve until we see who the victor is... :-) As long as the interface
> that maps device to number is abstracted, the above should be pluggable.

I'd recommend (at least during 2.7) some code in the allocator:

if (LINUX_VERSION_CODE % 3) {
major ^= get_random_bytes(4);
minor ^= get_random_bytes(4);
}

Just to keep everybody honest. :)

Andries Brouwer

unread,

Jan 4, 2004, 9:40:04 PM1/4/04

to

On Sun, Jan 04, 2004 at 10:37:10PM +0000, vi...@parcelfarce.linux.theplanet.co.uk wrote:

Hi Al - a happy 2004 to you too!

> Now, care to explain how preserving aforementioned common Unix idiom
> is related to your expostulations?

Hmm. You sound like you agree that random device numbers and NFS
are a bad combination, but don't see why my example might be
relevant.

There is a great variation here in what various servers and clients do,
but roughly speaking filehandles tend to contain a fsid, and this fsid
often (no fsid= given) involves (major,minor,ino). When device numbers
vary randomly, the fsid may vary randomly. Various bad things may happen:
maybe all file handles go stale (or, worse, refer to something else),
or maybe device numbers on the client vary randomly.

Andries

Linus Torvalds

unread,

Jan 4, 2004, 10:00:07 PM1/4/04

to

On Sun, 4 Jan 2004, Andries Brouwer wrote:
>

> Surprise! Are you leaving POSIX? Or ditching NFS?
> Or demanding that NFS servers must never reboot?

Ok, Andries, time for you to take a deep breath, and calm down. Because
your arguments are getting ridiculous in the extreme.

A NFS server is sure as hell not going to export _its_ dynamic /dev to its
clients. That would be not just stupid, but crazy. Next you tell me that
you were using devfs and exporting that over NFS.

A NFS server is going to export something _totally_ different than its own
/dev directory - it needs to be _client_-specific anyway. That's true with
stable numbers too, btw - ever tried to mount a Solaris /dev on a Linux
client? No workee.

> A common Unix idiom is testing for the identity
> of two files by comparing st_ino and st_dev.
> A broken idiom?

No. It still works. Even if the device numbers change across reboots.

Why? Becuase that _program_ sure as hell isn't running across a reboot.

And again, this is not something we haven't seen before. Have you ever
looked at the "st_dev" values? Try once - look at what it returns for a
NFS-mounted filesystem. Ponder. Notice how it already is NOT stable across
reboots.

In other words, the stuff you're complaining about is all stuff that
nobody has _ever_ been able to rely on, and that has nothign to do with
udev or anythign else. It all just shows how 100% right I am for saying
that you cannot rely on stable numbers.

So I repeat: calm down, and think it through.

Linus

David Lang

unread,

Jan 4, 2004, 10:10:12 PM1/4/04

to

Linus, what Andries is saying is that if you export a directory (say
/home) the process of exporting it somehow uses the /dev device number so
if the server reboots and gets a different device number for the partition
that /home is on the clients won't see it as the same export, breaking the
NFS requirement that a server can be rebooted.

I don't agree with him becouse if the NFS server does include /dev info in
what it shows to the outside world it's already broken.

David Lang

On Sun, 4 Jan 2004, Linus Torvalds wrote:

> Date: Sun, 4 Jan 2004 18:52:56 -0800 (PST)
> From: Linus Torvalds <torv...@osdl.org>
> To: Andries Brouwer <ae...@win.tue.nl>
> Cc: Rob Love <r...@ximian.com>, r...@landley.net,
> Pascal Schmidt <der.e...@email.de>, linux-...@vger.kernel.org,
> Greg KH <gr...@kroah.com>
> Subject: Re: udev and devfs - The final word

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

Daniel Jacobowitz

unread,

Jan 4, 2004, 10:20:05 PM1/4/04

to

On Sun, Jan 04, 2004 at 06:52:56PM -0800, Linus Torvalds wrote:
>
>
> On Sun, 4 Jan 2004, Andries Brouwer wrote:
> >
> > Surprise! Are you leaving POSIX? Or ditching NFS?
> > Or demanding that NFS servers must never reboot?
>
> Ok, Andries, time for you to take a deep breath, and calm down. Because
> your arguments are getting ridiculous in the extreme.
>
> A NFS server is sure as hell not going to export _its_ dynamic /dev to its
> clients. That would be not just stupid, but crazy. Next you tell me that
> you were using devfs and exporting that over NFS.
>
> A NFS server is going to export something _totally_ different than its own
> /dev directory - it needs to be _client_-specific anyway. That's true with
> stable numbers too, btw - ever tried to mount a Solaris /dev on a Linux
> client? No workee.

I think you two are talking straight past each other. Andries is
talking about the fsid, which is determined by the NFS server, based on
its idea of the device number of the filesystem underlying the exported
directory. Right now, I can reboot my host system, and when it comes
up then the NFS directories it exports to clients will have the same
fsid. With random device numbers it won't work; after rebooting the
NFS server all clients will be forced to explicitly unmount and
remount.

Now, it seems to me that this isn't much of an argument against random
device numbers. Have userspace set a UUID for the device if you want,
and use that in the fsid instead. But that's the argument; it has
nothing to do with the NFS server exporting its /dev.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 4, 2004, 10:20:16 PM1/4/04

to

On Sun, Jan 04, 2004 at 09:02:02PM -0500, Jeremy Maitin-Shepard wrote:

> In order to efficiently implement tar, it is necessary to store the
> inode numbers for files with a link count greater than 1 in a hash
> table. It would not be practical to keep open all of these files in
> order to ensure that the inode numbers remain valid. Thus, a different
> unique identifier is needed, which is unique even for files that are not
> open.

Files that are not open could've been removed and replaced with something
completely different since your stat(2).

Linus Torvalds

unread,

Jan 4, 2004, 10:40:09 PM1/4/04

to

On Sun, 4 Jan 2004, Daniel Jacobowitz wrote:
>
> I think you two are talking straight past each other. Andries is
> talking about the fsid, which is determined by the NFS server, based on
> its idea of the device number of the filesystem underlying the exported
> directory. Right now, I can reboot my host system, and when it comes
> up then the NFS directories it exports to clients will have the same
> fsid. With random device numbers it won't work; after rebooting the
> NFS server all clients will be forced to explicitly unmount and
> remount.

Ahh. I'll buy into that, and yes, this is an example of something that
needs to be fixed.

It shouldn't be fixed by saying "device numbers have to be stable across
reboots", because the fact is, we're most likely going to have storage
that is really really hard to enumerate in a repeatable fashion.

So the _proper_ thing to do is to have the NFS server not use the device
number as part of fsid. It should use the _stable_ UUID of the filesystem
or some similar label.

And it should do that exactly because the device number isn't as stable as
NFS exporting would like it to be. Exactly because things like network-
attached disks etc. How would you otherwise export a disk that perhaps
gets its address from DHCP?

[ I incredulously asked a NetApp person why you'd ever want to expose the
_disk_ over ethenet, rather than just have the NAS device export a
filesystem of its own. It turns out that some people really want to just
see a block device, either because Windows sucks at network filesystems
or because they want to do things like databases onto them. The point
being that once you do that, you'll likely want to export the thing as
an SMB share from the thing that "owns" the disk.

So you would literally have a _disk_ whose IP address changed depending
on what other machines were booted on the same network. ]

Issues like this is also why Linux vendors have already started doing
things like "mount by label" - because disks have a nasty tendency to move
around, and specifying the fstab contents (or "root=xxx" on the kernel
command line) with physical location or similar just doesn't work all
that well. It happens today with things like USB2 or firewire disks. They
get moved around, and they get a new device number.

It's still not _common_, but it's slowly getting there.

> Now, it seems to me that this isn't much of an argument against random
> device numbers. Have userspace set a UUID for the device if you want,
> and use that in the fsid instead. But that's the argument; it has
> nothing to do with the NFS server exporting its /dev.

I buy into that, and I agree 100% with you that this is just a case where
you should use a UUID.

Linus

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 4, 2004, 10:50:08 PM1/4/04

to

On Mon, Jan 05, 2004 at 03:29:01AM +0100, Andries Brouwer wrote:
> On Sun, Jan 04, 2004 at 10:37:10PM +0000, vi...@parcelfarce.linux.theplanet.co.uk wrote:
>
> Hi Al - a happy 2004 to you too!
>
> > Now, care to explain how preserving aforementioned common Unix idiom
> > is related to your expostulations?
>
> Hmm. You sound like you agree that random device numbers and NFS
> are a bad combination, but don't see why my example might be
> relevant.

No. I don't see what the fuck does it have to POSIX compliance, ability
to determine whether two files are identical by st_ino/st_dev and common
UNIX idioms.

> There is a great variation here in what various servers and clients do,
> but roughly speaking filehandles tend to contain a fsid, and this fsid
> often (no fsid= given) involves (major,minor,ino).

Now, _that_ is true. And yes, I agree that setups with unstable device
numbers do need explicit actions on part of admin. In particular, editing
/etc/exports to add fsid= in each relevant entry.

Which means that *in* *setups* *where* *numbers* *are* *currently* *stable*
we should not make them random without admin's knowledge. And /etc/exports
is not the only problem - RAID, journaling filesystems with device number of
log stored on-disk, etc.

*However*, if we are talking about new classes of devices, all bets are off
and proper fix is to stop using unsuitable interfaces for those devices.
For exports it means "use explicit fsid". For RAID we both agreed, IIRC,
that raidtools will need to switch to saner API, etc.

Rob Landley

unread,

Jan 4, 2004, 11:00:12 PM1/4/04

to

On Sunday 04 January 2004 21:06, David Lang wrote:
> Linus, what Andries is saying is that if you export a directory (say
> /home) the process of exporting it somehow uses the /dev device number so
> if the server reboots and gets a different device number for the partition
> that /home is on the clients won't see it as the same export, breaking the
> NFS requirement that a server can be rebooted.

NFS always struck me as a peverse design. "The fileserver must be stateless
with regard to clients, even though maintainging state is what a filesystem
DOES, and the point of the thing is to export a filesystem." Okay... (If it
was exporting read-only filesystems with no locking of any kind, maybe they'd
have a point, but come on guys...)

So here's an example of where the fileserver _does_ expect to maintain
non-file state across reboots. "Ooh, the device node we're exporting is part
of the ID, gee, we missed one!"

So why, exactly, can the NFS server not maintain whatever extra state it needs
to remember between reboots in a filesystem? (Not even necessarily the one
it's exporting, just some rc file something under /var.) The device node it
was exporting USED to be in the filesystem, you know, ala mknod. Now that
the kernel's not keeping that stable, have the #*%(&# server generate a
number and make a note of it somewhere. (Is requiring an NFS server to have
access to persistent storage too much to ask?)

Personally, I could never figure out why Samba servers are in userspace but
NFS servers seem to want to live in the kernel. I can almost secure a samba
server for access to the outside world, but a NFS system that isn't behind a
firewall automatically says to me "this machine has already been compromised
eight ways from sunday within five minutes of being exposed to the internet".
Call me paranoid...

Rob

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 4, 2004, 11:00:14 PM1/4/04

to

On Sun, Jan 04, 2004 at 07:33:16PM -0800, Linus Torvalds wrote:
> Ahh. I'll buy into that, and yes, this is an example of something that
> needs to be fixed.
>
> It shouldn't be fixed by saying "device numbers have to be stable across
> reboots", because the fact is, we're most likely going to have storage
> that is really really hard to enumerate in a repeatable fashion.
>
> So the _proper_ thing to do is to have the NFS server not use the device
> number as part of fsid. It should use the _stable_ UUID of the filesystem
> or some similar label.

... and we already have a way to specify it explicitly. Which, BTW, allows
to take server down, copy exported fs from failing IDE disk to SCSI one and
reexport. With clients remaining happy with you. Remember discussions
circa 2.5.50 or so about that stuff?

So we have tools for that. And it's 100% OK to say "if you are doing NFS
export of filesystem that lives on $new_weird_device, explicit fsid= is
not just a good idea, it's a must-have".

What is _not_ OK, though, is to have folks suddenly see /dev/hda3 changing
its device number - then we would break existing setups that worked all
along; even if admin can fix the breakage, it's not a good thing to do.

Linus Torvalds

unread,

Jan 4, 2004, 11:10:07 PM1/4/04

to

On Mon, 5 Jan 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:
>
> What is _not_ OK, though, is to have folks suddenly see /dev/hda3 changing
> its device number - then we would break existing setups that worked all
> along; even if admin can fix the breakage, it's not a good thing to do.

Ehh, it will actually happen.

If nothing else, things like SATA will end up meaning that the device you
were used to seeign as /dev/hdc will suddenly show up as /dev/scd0
instead. Just because you changed the cabling while you upgraded to a
newer version of your CD-ROM drive.

And the thing is, with fs labels and udev, even "existing systems" really
shouldn't much care.

Now, we'd probably not want to force the switch, but I do suspect we'll
have exactly this as a switch in the "Kernel Debugging Config" section.
Where even _common_ things like disks could end up with per-bootup values.
Just to verify that every part of the system ends up having it right.

Think of it this way: RedHat not that long ago decided to break with a
_lot_ of tradition by switching over to UTF-8 as the common text encoring.
It broke some _major_ programs in how they dealt with "simple" things like
keyboard input that had worked for literally _decades_.

And you could switch it off if you really wanted to, but quite frankly, it
wasn't even a simple choice in the install. You had to know what you were
doing to switch it off.

And the thing is, that is _the_ single thing that cleaned up a lot of
remaining problems wrt UTF-8 on Linux. Yes, almost all of them had been
solved already, or RH wouldn't have dared do the switch. But to get there
all the way, you had to literally force the cut-over.

(Yeah, I'm a bad person, and I personally went back to the C locale,
because "pine" still doesn't get UTF-8 right, and nobody is apparently
ever going to fix it. Oh, well. But at least I know I'm doing something
_wrong_, which in itself is a good thing.).

Linus

Peter Chubb

unread,

Jan 4, 2004, 11:20:07 PM1/4/04

to

>>>>> "Andries" == Andries Brouwer <ae...@win.tue.nl> writes:

Andries> On Sun, Jan 04, 2004 at 01:05:20PM -0800, Linus Torvalds
Andries> wrote:

Andries> Surprise! Are you leaving POSIX? Or ditching NFS? Or
Andries> demanding that NFS servers must never reboot?

Andries> A common Unix idiom is testing for the identity of two files
Andries> by comparing st_ino and st_dev. A broken idiom?

It's worse than that. You can do
mknod fred b maj minor
anywhere on any UNIX filesystem and expect it to a) work and b) refer
to the same device for all time until it is removed. However, this
doesn't appear to be guaranteed by SUS -- the only guarantees are that
the dev_t returned from the stat() family of calls is unique within a LAN.

I know that Linux already breaks this (the stupid /dev/sg[0-9] that
depend not on the SCSI bus and lun but on the order they're detected,
for example)

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 4, 2004, 11:50:07 PM1/4/04

to

On Sun, Jan 04, 2004 at 08:02:20PM -0800, Linus Torvalds wrote:
>
>
> On Mon, 5 Jan 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:
> >
> > What is _not_ OK, though, is to have folks suddenly see /dev/hda3 changing
> > its device number - then we would break existing setups that worked all
> > along; even if admin can fix the breakage, it's not a good thing to do.
>
> Ehh, it will actually happen.
>
> If nothing else, things like SATA will end up meaning that the device you
> were used to seeign as /dev/hdc will suddenly show up as /dev/scd0
> instead. Just because you changed the cabling while you upgraded to a
> newer version of your CD-ROM drive.

If I open the damn box, I sure as hell can be bothered to edit stuff in
/etc...

> And the thing is, with fs labels and udev, even "existing systems" really
> shouldn't much care.
>
> Now, we'd probably not want to force the switch, but I do suspect we'll
> have exactly this as a switch in the "Kernel Debugging Config" section.
> Where even _common_ things like disks could end up with per-bootup values.
> Just to verify that every part of the system ends up having it right.

Then we'd better have a very good idea of the things that are going to
break. Note that right now even late-boot code in kernel itself will
break on that - there are explicit checks for ROOT_DEV==MKDEV(2,0),
all sorts of weird crap deep in the bowels of arch/ppc/*/*, etc.

It won't be an easy transition - I know that Greg is very optimistic
about it, but there will be a *lot* of crap to take care of. In theory
getting bigger dev_t should've been very straightforward, but if you
check what really had been involved...

ObOtherStraightforwardThings: net_device refcounting. Take a look at
Jeff's queue someday - by now it's one big merge short of getting it
right for practically all drivers. 1.9Mb total + 247Kb pending patches
here. Several hundreds changesets, practically all of them fixing
exploitable holes. And yes, most of them had been bugs all along -
since 2.2 if not earlier. Sure, that made things better, but if somebody
comes along and makes similar "fun" necessary for e.g. ALSA...

> because "pine" still doesn't get UTF-8 right, and nobody is apparently
> ever going to fix it. Oh, well. But at least I know I'm doing something
> _wrong_, which in itself is a good thing.).

Heh. Took you long enough - "using pine" should've been a dead giveaway
from the very beginning ;-)

Linus Torvalds

unread,

Jan 4, 2004, 11:50:09 PM1/4/04

to

On Mon, 5 Jan 2004, Peter Chubb wrote:
>
> It's worse than that. You can do
> mknod fred b maj minor
> anywhere on any UNIX filesystem and expect it to a) work and b) refer
> to the same device for all time until it is removed.

Hmm.. I can see (a) (except for the fact that pretty much all unixes have
mount-flags to say "no device files") but I don't see why you'd _ever_
expect (b) to be true.

It's patently not true for such rather traditional unix devices as pty's,
for example. The "same device" ends up being true only for as long as the
master at the other end exists - and the same numbers get re-used in all
normal usage for different virtual devices.

> I know that Linux already breaks this (the stupid /dev/sg[0-9] that
> depend not on the SCSI bus and lun but on the order they're detected,
> for example)

That "stupid" thing is a hell of a lot less stupid than the alternatives,
and is very much equivalent to how pty's work.

In fact, the "number according to detection" is pretty much the best
device number allocation strategy. It's the _only_ one that doesn't have
some incorrect bias built-in.

Linus

Trond Myklebust

unread,

Jan 5, 2004, 12:00:10 AM1/5/04

to

På su , 04/01/2004 klokka 22:48, skreiv Rob Landley:

> NFS always struck me as a peverse design. "The fileserver must be stateless
> with regard to clients, even though maintainging state is what a filesystem
> DOES, and the point of the thing is to export a filesystem." Okay... (If it
> was exporting read-only filesystems with no locking of any kind, maybe they'd
> have a point, but come on guys...)

Sigh... What has that got to do with anything?

Read the RFCs: NFS *was* entirely stateless until v4 was drafted.
Locking was never part of the NFS protocol, but was an external addition
that was documented by the Open Group. So, yes, there is a history and a
reason behind all the talk of statelessness.

As for the current thread about remembering device numbers: as far as
NFS is concerned, that is entirely an implementation issue. There is no
need for any extra NFS protocol support for this sort of crap.

> So why, exactly, can the NFS server not maintain whatever extra state it needs
> to remember between reboots in a filesystem? (Not even necessarily the one
> it's exporting, just some rc file something under /var.) The device node it
> was exporting USED to be in the filesystem, you know, ala mknod. Now that
> the kernel's not keeping that stable, have the #*%(&# server generate a
> number and make a note of it somewhere. (Is requiring an NFS server to have
> access to persistent storage too much to ask?)

It could be done (and probably entirely in userspace). I assume you are
volunteering to do the work?

> Personally, I could never figure out why Samba servers are in userspace but
> NFS servers seem to want to live in the kernel. I can almost secure a samba
> server for access to the outside world, but a NFS system that isn't behind a
> firewall automatically says to me "this machine has already been compromised
> eight ways from sunday within five minutes of being exposed to the internet".
> Call me paranoid...

Sun was doing Kerberos for NFS years before the Samba project was
started.

Security has bugger all to do with kernel or userland and everything to
do with the short-sighted "munitions" policies of certain governments at
the time around when the Sun RPC protocol was being drafted. The same
policies were still around to dictate our implementation much later when
we were doing RPC for Linux. Now the laws have changed, and so we've
finally been able to add strong authentication in 2.6.x.

Cheers,
Trond

Linus Torvalds

unread,

Jan 5, 2004, 12:00:11 AM1/5/04

to

On Mon, 5 Jan 2004 vi...@parcelfarce.linux.theplanet.co.uk wrote:

> > If nothing else, things like SATA will end up meaning that the device you
> > were used to seeign as /dev/hdc will suddenly show up as /dev/scd0
> > instead. Just because you changed the cabling while you upgraded to a
> > newer version of your CD-ROM drive.
>
> If I open the damn box, I sure as hell can be bothered to edit stuff in
> /etc...

Actually, not necessarily.

The thing is, _the_ most common reason I have for opening the box is that
the effing thing started having problems.

At which point I want to just remove the disk, move it to another box, and
boot up the other box.

And THAT is exactly the kind of situation where I sure as hell don't want
to care exactly where the disk was. I can't "prepare" for it by editing
files in /etc, since I don't know that the CPU fan or whatever is going to
die on me.

And this is _exactly_ why we should try to get away from device numbering
having any meaning. Because if we do this right, something like the CPU
fan dying, and me moving a disk to a new machine that has SATA (with the
disk having both SATA and PATA connectors), I shouldn't need to even
_think_ about it.

That's where "mount by label" does part of the job. But if the system is
_always_ set up to do things like NFS exports according to some separate
UUID, that too would "just work".

There's a lot to be said for "just work". Even if sometimes it takes some
pain when you break old (and broken) assumptions.

> > because "pine" still doesn't get UTF-8 right, and nobody is apparently
> > ever going to fix it. Oh, well. But at least I know I'm doing something
> > _wrong_, which in itself is a good thing.).
>
> Heh. Took you long enough - "using pine" should've been a dead giveaway
> from the very beginning ;-)

Those are them fighting words.

But since you brought it up: do you actually have anything else that can
open a remote IMAP file with a few thousand messages without taking ages
for it, and that you don't have to mouse around with? I'd like a graphical
interface for configuring stuff etc, but I sure as hell don't want to find
some f*ing icon to save a few messages that I selected in-order to my
"doit" queue or go to the next one, or pipe the thing to a shell-script,
or any number of things that are my actual _job_.

And the "no mousing" means that I don't want to have some popup window
that asks me what file I want to save into or similar crap. I can type
fast enough if I stay on the keyboard and can focus on one part of the
screen, but if I have to switch my focus around, I'm a goner.

On a related matter, I'm probably a retard, but I've tried alternatives to
"trn" too, and there really aren't any. None of the graphical news readers
can show me one full page of threads, select the 3-4 threads from _that_
one page that I want (from the keyboard), and then kill _that_ one page.
Not the whole newsgroup: only the part that shows in the window at that
time.

In "trn", the magic command is capital-D, for "discard".

Linus

Eric W. Biederman

unread,

Jan 5, 2004, 12:40:10 AM1/5/04

to

vi...@parcelfarce.linux.theplanet.co.uk writes:

> On Sun, Jan 04, 2004 at 08:02:20PM -0800, Linus Torvalds wrote:
> > Now, we'd probably not want to force the switch, but I do suspect we'll
> > have exactly this as a switch in the "Kernel Debugging Config" section.
> > Where even _common_ things like disks could end up with per-bootup values.
> > Just to verify that every part of the system ends up having it right.
>
> Then we'd better have a very good idea of the things that are going to
> break. Note that right now even late-boot code in kernel itself will
> break on that - there are explicit checks for ROOT_DEV==MKDEV(2,0),
> all sorts of weird crap deep in the bowels of arch/ppc/*/*, etc.

/sbin/lilo and possibly some of the other bootloaders. Relationships
between devices are a challenge to work with. How do you go from a
partition to it's actual block device etc. I don't remember how many
major numbers lilo has hard coded, I just remember looking at it once
and realizing I couldn't think of a better way to accomplish what it
was trying to do.

Eric

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Jan 5, 2004, 1:20:10 AM1/5/04

to

On Sun, Jan 04, 2004 at 08:52:56PM -0800, Linus Torvalds wrote:

> That's where "mount by label" does part of the job. But if the system is
> _always_ set up to do things like NFS exports according to some separate
> UUID, that too would "just work".

mount by label does part of the job, until you decide to use dd(1) to copy
a disk. At which point you have, AFAICS, no way tell which copy will get
mounted.

> Those are them fighting words.
>
> But since you brought it up: do you actually have anything else that can
> open a remote IMAP file with a few thousand messages without taking ages
> for it, and that you don't have to mouse around with? I'd like a graphical
> interface for configuring stuff etc, but I sure as hell don't want to find
> some f*ing icon to save a few messages that I selected in-order to my
> "doit" queue or go to the next one, or pipe the thing to a shell-script,
> or any number of things that are my actual _job_.

I prefer to ssh to another box and use mutt. Seriously, I've made a mistake
of reading imapd source and that was enough to decide that I'm _not_ touching
uw-<anything> and that protocol in general unless I really have no other
options. So far I've managed to avoid that...

> On a related matter, I'm probably a retard, but I've tried alternatives to
> "trn" too, and there really aren't any.

Same here. There are things about trn command set I'd prefer to see changed,
but it's better than other newsreaders I've seen...

Rob Landley

unread,

Jan 5, 2004, 2:10:09 AM1/5/04

to

On Sunday 04 January 2004 22:52, Trond Myklebust wrote:
> På su , 04/01/2004 klokka 22:48, skreiv Rob Landley:
> > NFS always struck me as a peverse design. "The fileserver must be
> > stateless with regard to clients, even though maintainging state is what
> > a filesystem DOES, and the point of the thing is to export a filesystem."
> > Okay... (If it was exporting read-only filesystems with no locking of
> > any kind, maybe they'd have a point, but come on guys...)
>
> Sigh... What has that got to do with anything?
>
> Read the RFCs: NFS *was* entirely stateless until v4 was drafted.
> Locking was never part of the NFS protocol, but was an external addition
> that was documented by the Open Group. So, yes, there is a history and a
> reason behind all the talk of statelessness.

I vaguely remember being pretty well up to speed on V2 (circa... 1995?) The
last one I even glanced at was V3, but I never had to support it. I haven't
even looked at V4. For exporting /home directories, everybody I deal with
seems to want samba servers these days instead for some reason. (Couple of
net boot systems that care more about permissions than that, but ram's so
cheap that it's easier to just "ssh user@bootserver -i key "cat root_img.tgz"
| tar xz" into a ramfs or shmfs or some such. (Heck, the last system I set
up like that mounted a zisofs image and ran from that...)

I'm sure it's still useful. I just haven't wanted to even attempt to secure
it. For home directories, samba is doing a simple tcp/ip connection per
session, reestablishing it automatically if it breaks (same server reboot
question). Since _both_ protocols seem to suck pretty badly under the hood,
it's been a question of choosing the lesser of two evils. It seems that more
people actually USE samba, so...

> > So why, exactly, can the NFS server not maintain whatever extra state it
> > needs to remember between reboots in a filesystem? (Not even necessarily
> > the one it's exporting, just some rc file something under /var.) The
> > device node it was exporting USED to be in the filesystem, you know, ala
> > mknod. Now that the kernel's not keeping that stable, have the #*%(&#
> > server generate a number and make a note of it somewhere. (Is requiring
> > an NFS server to have access to persistent storage too much to ask?)
>
> It could be done (and probably entirely in userspace). I assume you are
> volunteering to do the work?

I don't like nfs, I haven't bothered to actually use it for anything since
1999, so no.

> > Personally, I could never figure out why Samba servers are in userspace
> > but NFS servers seem to want to live in the kernel. I can almost secure
> > a samba server for access to the outside world, but a NFS system that
> > isn't behind a firewall automatically says to me "this machine has
> > already been compromised eight ways from sunday within five minutes of
> > being exposed to the internet". Call me paranoid...
>
> Sun was doing Kerberos for NFS years before the Samba project was
> started.
>
> Security has bugger all to do with kernel or userland and everything to
> do with the short-sighted "munitions" policies of certain governments at
> the time around when the Sun RPC protocol was being drafted. The same

I can transparently tunnel any tcp/ip session through ssh with some iptables
rules and a dozen line python script. (Great fun for rolling your own vpn.)
Mixing UDP and encryption is just plain a bad idea: no level at which it
makes sense to store persistent connection state in a "fire and forget"
packet protocol...)

I.E. this also works with samba, but didn't with (old) NFS.

> policies were still around to dictate our implementation much later when
> we were doing RPC for Linux. Now the laws have changed, and so we've
> finally been able to add strong authentication in 2.6.x.

Can you recommend a good link to the history of NFS? Computer history's a
hobby of mine. (I've got snippets on this topic, but not any kind of unified
story of NFS...)

http://www.landley.net/history/mirror/index.html
http://www.landley.net/history/scans/index.html

> Cheers,
> Trond

Rob