Adding checkpointing API to Linux kernel

Werner Krebs

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to linux-...@vger.rutgers.edu

I like your comments, and am forwarding them to the linux-kernel mailing
list.

I'm the maintainer of GNU Queue, a distributed batch and interactive job
load-balancing system. The homepage for GNU Queue is
http://bioinfo.mbb.yale.edu/~wkrebs/queue.html (frequently updated) and
http://www.gnu.org/software/queue (official).

Andy Glew writes about the need for an NT-like (gasp!) API in the
GNU/Linux that would all trapping of OS system calls without recompiling
code. (For use in checkpoint migration facilities, which we are
considering adding to GNU Queue.) Recompilation is necessary to use
these facilities with commercial apps. The scary thing is it appears to
be easier in NT than in GNU/Linux (see Andy's comments). Given NT's
other "features" (lack of true simultaneous multiuser capaibilities), it
would see that GNU/Linux and UNIX are otherwise ideal systems for this
type of distributed application, so this is not a good thing, IMHO.

Could anyone develop such an API?

Andy Glew (gl...@cs.wisc.edu)
> I'm not in the Condor group. I'm just a user. I recommend that you contact
> the Condor folk directly, off their web page.
>
> By the way, rumour has it that the Condor source code has been "sold"
> to the NCSA - the National Center for Supercomputer Applications, you know,
> the people who paid for Mosaic. Since the NCSA is not a company, maybe
> they can be approached wrt public use.
>
> Other comments:
>
> I frequently use SPSS, and less frequently SAS, so I know of what you speak.
> You probably saw that Condor has the option of running "vanilla UNIX" jobs,
> without the remote UNIX system calls, and without checkpointing, without
> recompilation. On UNIX, checkpointing requires recompilation. My jobs
> usually run using AFS filesystem access on the local machine, and checkpointing.
>
> Recent Condor work has involved binary editting to insert the Condor code,
> which would five you a way of getting checkpoints and migration without recompiling.
> A student presented a class project in this last year.
>
> Most annoyingly, it looks like the Condor facilities can be obtained most easily
> without recompilation for the ongoing Condor port of NT. It turns out that Windows
> has an API to allow editing of any interface between modules that is exported as a
> DLL interface. So transparency is easier on NT than on UNIX. Providing
> such an API for LINUX might be enough to help jumpstart LINUX applications.
>
>
>
> Werner Krebs wrote:
>
> > Yes, I only recently looked into Condor (I thought it was commercial).
> >
> > With the source code being restricted, it is certainly not ``free'' software in the usual sense of the term.
> >
> > Queue and Condor have different origins. Queue was originally written to migrate SAS and Splus jobs, where checkpoint migration isn't an option. These are large statistical jobs
> which not only use fork and read and write the same files, but create huge temporary files that they need to re-read over and over again. You need a large scratch disk on each
> machine for these to run efficiently, so you simply can't trap all system calls and send them over the network, you need to have both fast local disk space ('/usr/tmp') and global
> disk space (NFS or AFS filesystem) that are distinguished by pathnames. An efficient, reliable, cached network filesystem is also a plus to reduce network traffic as these jobs are
> both I/O and CPU bound at different times.
> >
> > This, in turn, means that once the program is running on machine, migrating it to another machine is basically impossible.
> >
> > However, it turns out Queue is similar to an expensive commercial product, LSF, which apparently uses old Condor code (the network host advertisement code). We're planning to
> reimplement something like this (probably radically different as I have to even read the Condor manual) using a PostgreSQL server to take information from clients. But, it would be
> much simpler if we could existing free code and go along with an existing standard. (Adding old checkpoint migration from Condor to Queue wouldn't be bad, either.)
> >
> > At one point, Condor published its source code and allowed free re-distribution of it; I was wondering it would be possible for Queue to use some of that code. We're GNU, so to
> actually distribute the code (rather than just the hooks), RMS would probably want the code to be licensed under the GPL license, which is a much stricter license than what it
> currently is licensed under (no use of the code in commercial software products without consent of Copyright holders).
> >
> > Since I notice you're at Wisconsin, I was wondering if something like this could be arranged (using an old, widely published version of the Condor code in Queue?) The Condor folks
> seem to have removed all their old source code from the net.
> >
> > Thanks.

[... snip for brevity]

>
> --
> ---
> Andy "Krazy" Glew, gl...@cs.wisc.edu, UW Madison and Intel.
> DISCLAIMER: private posting, not representative of university or Intel.
> Please respond by email in addition to replying to newsgroup.
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Andy Glew

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Alan Cox, Werner Krebs, linux-...@vger.rutgers.edu

> > Andy Glew writes about the need for an NT-like (gasp!) API in the
> > GNU/Linux that would all trapping of OS system calls without recompiling
> > code. (For use in checkpoint migration facilities, which we are
> > considering adding to GNU Queue.) Recompilation is necessary to use
> > these facilities with commercial apps. The scary thing is it appears to
> > be easier in NT than in GNU/Linux (see Andy's comments). Given NT's
>

> You can already do it in Linux. LD_PRELOAD and LD_LIBRARY_PATH allow you
> to control and modify linkloading order. If you get a static linked commercial
> binary just ask them for the object modules under the LGPL terms the C library
> has to rebuild it (or just for a dynamic binary)

I think the key thing is that the NT APIs allow you to do things like

handle := open(API-name);

FOREACH symbol IN handle.symbollist DO
IF is_function(symbol) THEN
...
... dynamically create code to marshall arguments,
... intercept syscall, and then call original syscall.
...
ENDIF
ENDDO

I.e. you do not have to know, a priori, what functions you want to intercept
- as you would have to if placing something on the path.
You can intercept them all.

Moreover, you can intercept a call to foo(), have it call your_foo(),
and then allow it to continue on to call the default foo().

Come to think of it, I suspect that Beowulf has something similar
- note that "marshalling arguments" is a classic RPC (remote procedure
call) term. Or maybe Beowulf doesn't need it?

Andy Glew

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Werner Krebs

By the way, although I do not know how to do this "API interposition"
as flexibly on LINUX as on NT (I hope that Alan Cox will show me how),
I would probably agree with someone who said that this is not an appropriate
topic for the kernel group.

One of Windows' strengths has been that such interposition can be done
for any API that is implemented in a DLL, not just the kernel/syscall
interface. (There have been papers about syscall interposition for UNIXes
that worked just on the syscall interface, not for any other.)

Therefore, typically, in Windows a whole slew of different modules
come together in the form of DLLs. Sometimes a module is implemented
by something resembling an OS kernel or device driver - i.e. a shared
library interface that is implemented in a different privilege domain.
Sometimes a module is implemented by message passing or remote
procedure call - marshalling its arguments and sending them to a different
process, perhaps a different computer system. Sometimes a module is just
an ordinary library.

The nice thing about interposition is that you can intercept any such DLL
level API. It's frequently used for profiling.

Doing interposition just for the kernel would be insufficient, although it might
answer the network batch programs' needs - such as Werber's for GNU QUEUE.

There are any number of binary editting tools for UNIXes out there.
Probably at least one of them runs on LINUX. Unfortunately, most
of the ones I know are a bit too low level, but maybe not all are like that.

So, therefore: I don't think that it is appropriate to add such an interposition API
to the LINUX kernel. It is better done at a higher level - in the dynamic link
tools, but also, perhaps, through binary editting on running programs.
It should be noted that API editting is a very specific form of binary editting
- if APIs are first class, and all calls go through interface tables, then
it corresponds solely to changing the interface tables, and involves none
of the complexities of actually editting the call sites (finding free registers,
etc.)

Andy Glew

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Alan Cox

Great!

Like I said, all of this started out when the CONDOR developers told me
that they would be able to give users like me totally transparent checkpointing
and process migration on NT, whereas on UNIX (Solaris, LINUX, etc.)
you have to compile and link your code specially.

Maybe the problem is just that batch queue developers like Werner,
or the CONDOR folk, have been so hot on having solutions portable
across all UNIXes that they have not been willing to make LINUX specific
modifications.

I look forward to using stuff like this in future versions of GNU QUEUE
and CONDOR. Transparent checkpointing, process migration, and
ex-post-facto being able to take a running job, and migrating it somewhere
else.

Alan Cox wrote:

> > I.e. you do not have to know, a priori, what functions you want to intercept
> > - as you would have to if placing something on the path.
> > You can intercept them all.
>

> Learn about libdl and ld.so ..

>
> > Moreover, you can intercept a call to foo(), have it call your_foo(),
> > and then allow it to continue on to call the default foo().
>

> Learn about libdl and ld.so - its all in there, its all been in there
> for a very long time
>
> Alan

--
---
Andy "Krazy" Glew, gl...@cs.wisc.edu, UW Madison and Intel.
DISCLAIMER: private posting, not representative of university or Intel.
Please respond by email in addition to replying to newsgroup.

-

Werner Krebs

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Andy Glew

Andy Glew wrote:
>
> By the way, although I do not know how to do this "API interposition"
> as flexibly on LINUX as on NT (I hope that Alan Cox will show me how),
> I would probably agree with someone who said that this is not an appropriate
> topic for the kernel group.

I think he is suggesting modifying LD_LIBRARY_PATH and placing a
dynamically linked .so library that would contain the intercepted calls
in a directory at the front of LD_LIBRARY_PATH.

I don't know if this would requiring re-compiling all of glibc with the
intercepted calls and then using that as a substitute. If that's not the
case, then this could essentially be done with ANY of the commercial
UNIXes that support LD_LIBRARY_PATH and its variants as well, including
SunOS, IRIX, etc. Which raises the question of why hasn't it been done
by the Condor group? (Why are they trying to _edit_ binaries to get this
to work.)

The advantage of GNU/Linux in this sense is that it uses glibc, so the
source is available and its much easier to create an intercept library
like this, but it wouldn't be impossible for the commercial Unixes.

If it's not an appropriate topic for the kernel group, please accept my
apologies for starting the thread. It's sometimes hard to figure out
whether something is best done in libc or in the kernel. We are talking
about intercepting kernel calls and it would be done in the kernel under
NT (which does just about everything in the kernel), so I hope I wasn't
totally off track. Flames to /dev/null, please (and certainly not to
Andy in any event.)

-

Andy Glew

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Werner Krebs, Andy Glew

> Which raises the question of why hasn't it been done
> by the Condor group? (Why are they trying to _edit_ binaries to get this
> to work.)

(1) It may be unfair to say that the Condor group is doing the binary
editing. I saw a student class project presentation that was doing so.

(2) Binary editing allows you to intercept (a) statically linked binaries
(as long as you have symbol table information) and (b) already running
programs. So, for example, you could start up a SAS or SPSS job
interactively, realize that it is going to take too long, go away for
lunch, realize that it still hasn't finished, and then say "install the
system call interception and checkpointing library, and go and migrate
yourself off onto the network somewhere".

> The advantage of GNU/Linux in this sense is that it uses glibc, so the
> source is available and its much easier to create an intercept library
> like this, but it wouldn't be impossible for the commercial Unixes.

Source code certainly helps. However, not having source code forces
you to make modularity decisions that are sometimes better.

> If it's not an appropriate topic for the kernel group, please accept my
> apologies for starting the thread. It's sometimes hard to figure out
> whether something is best done in libc or in the kernel. We are talking
> about intercepting kernel calls and it would be done in the kernel under
> NT (which does just about everything in the kernel)

Actually, on Windows this sort of thing is not done in the kernel,
but in a totally different process - essentially in a debugger process,
which dynamically installs code in the user process to accomplish it.

The question is not "whether something is best done in libc or in the kernel".
The question is whether it is better done in still a different module.

> Flames to /dev/null, please (and certainly not to Andy in any event.)

Sorry for the splurge on this mailing list.

Michael Elizabeth Chastain

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to werner...@yale.edu

Hi Werner,

> Andy Glew writes about the need for an NT-like (gask!) API in the

> GNU/Linux that would all trapping of OS system calls without
> recompiling code.

First I am going to chide you for mindlessly writing "GNU/Linux"
when you are talking about the Linux Kernel, not a whole system.
Consider yourself chided.

The API exists and it's named "ptrace". Works great, too.

I have a trace-and-replay program based on ptrace. The tracer is similar
to strace.

The replayer is the cool part. It takes control whenever the target
process executes a system call, annuls the original system call, and
overwrites the target process registers and address space with the values
that I want to be in there.

I also run gdb (or any other debugger) as a client program and filter
all its calls to ptrace. Effectively, the replayer is a proxy server
for ptrace. It shows gdb a picture of the target process replaying
its execution.

All this runs in user space with stock linux kernel, stock target
binaries, and stock gdb.

It's been running like this for three years. I released the source code
under GPL in November 1995. As far as I know, three people in the entire
world have ever run it, counting me.

One of the two guys put up a mud server and traced it. He sent me
the trace file, and I ran gdb on it. I re-executed his program, I
set breakpoints anywhere I wanted, I inspected data at any breakpoint.
Hmmm, there's a structure that looks funny, I'll just restart and set an
earlier breakpoint.

During those three years of no interest, the linux kernel interface has
shifted again and again. The replayer needs a table of every system call
and how it affects memory, and that table needs more entries every week
(thanks to ioctl). So I have a great demo, if you have 1.3.42 kernel
headers to compile it against.

ftp://ftp.shout.net/pub/users/mec/misc/mec-0.3.tar.gz

There's more.

If I put memory-access rule checking in at replay time, I can do better
than e-fence, on stock binaries with no recompilation. Hell, I can do
better than *Purify* on *stock binaries* and without tangling with their
object-code-insertion patents.

I have enough information available in the proxy ptrace filter to
implement PTRACE_SINGLESTEP_BACKWARDS. How would you like to have that
capability in gdb? "Execute backwards until this data watchpoint
changes." Imagine a graphical debugger with a scrollbar for time,
where the top is "beginning of execution" and the bottom is "end of
execution."

And remember, you are doing all this on a trace file that the user of
your program sent in from the field without changing a *damn thing*
on their system, except for running the trace wrapper program. They
don't even need symbols on their executable, as long as you have an
identical executable that does have symbols.

Your customer's Apache tips over every two weeks under heavy load?
Tell them to run it under the tracer and send you a trace file the next
time it tips over.

You need to debug your real-time embedded program? Trace it, run it in
real time, then take the trace file back to your high-powered workstation.

This is radical paradigm-shifting technology. It's the best program I
ever wrote. It's probably the best program I ever *will* write in my
entire life.

The entire reason I got involved in linux development was to reach a
point where I could talk about this technology and get more than two
people to download the damn demo and try it out. To get to a place
where the gdb maintainers at cygnus would respond to my letters.

It hurts to talk about this. It brings tears to my eyes.

I suppose it's off-topic, too, because it is a user space program.
No kernel hooks needed.

Time to get back to xconfig bugs.

Michael Elizabeth Chastain
<mailto:m...@shout.net>
"love without fear"

Andy Glew

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Werner Krebs, Andy Glew, linux-...@vger.rutgers.edu

> (2) Binary editing allows you to intercept (a) statically linked binaries
> (as long as you have symbol table information)

BTW, binary editing allows you to intercept system calls for statically
linked programs even if you don't have the symbol tables: use code analysis
to actually find the instructions used to enter the kernel (trap, etc.), and
intercept those, since the kernel API is pretty well known.

Ulrich Drepper

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Alan Cox

al...@lxorguk.ukuu.org.uk (Alan Cox) writes:

> You can already do it in Linux. LD_PRELOAD and LD_LIBRARY_PATH allow you
> to control and modify linkloading order.

You should not propagate the use of LD_PRELOAD so much. In future
every use of this envvar will add quite a severe penalty to the
execution time. Or better said: if a programs is started with
LD_PRELOAD it cannot take advantage of the optimizations which will be
implemented sometime soon.

--
---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com `------------------------

Werner Krebs

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to Michael Elizabeth Chastain

Michael Elizabeth Chastain wrote:
>
> Hi Werner,
>
> > Andy Glew writes about the need for an NT-like (gask!) API in the
> > GNU/Linux that would all trapping of OS system calls without
> > recompiling code.
>
> First I am going to chide you for mindlessly writing "GNU/Linux"
> when you are talking about the Linux Kernel, not a whole system.
> Consider yourself chided.

Except that I've also gotten email for saying "Linux" kernel when ~it
was technically correct but nevertheless would not have hurt" to say
'GNU/Linux.'~ Believe me, folks, I appreciate the difference and I
appreciate the politics, so let's avoid this topic as it is a sore spot.

> The API exists and it's named "ptrace". Works great, too.

Yes, but we're talking about checkpoint migration of CPU and I/O
intensive jobs to boxes that have free CPU resources. ptrace() will slow
things down significantly. At least, that was my experience on the Cray
Y-MP running Cray UNICOS, where the delay from ptrace() was VERY
noticeable. With GNU/Linux, err, Linux kernel, err GNU/Linux, it is
significantly less noticeable but still there, I'll bet.

So, help us add checkpoint migration facilities to GNU Queue. You can
throw in your ptrace facility for free.

> I suppose it's off-topic, too, because it is a user space program.
> No kernel hooks needed.
>
> Time to get back to xconfig bugs.
>
> Michael Elizabeth Chastain
> <mailto:m...@shout.net>
> "love without fear"

-

Michael Elizabeth Chastain

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to werner...@yale.edu

Hi Werner,

No problem, I don't really want to go into GNU/Linux versus Linus
Kernel either.

> Yes, but we're talking about checkpoint migration of CPU and I/O
> intensive jobs to boxes that have free CPU resources.

Yes, but ... you asked for an API. That is the documented API for
filtering system calls. You can do 'proxy filtering' for most
system calls, and I have a lot of details on the ones that cause
problems, like "exit" and "mmap".

If it's not fast enough, you can consider speeding up the implementation.
You can also consider interface enhancements. The documented interface
to ptrace is very old and very slow (one word per system call).

Also remember that any system call filtering, including ptrace,
has zero cost for "CPU intensive" user-space work. If you are compiling
or ray-tracing or cracking codes, you do a lot of work per system call
issued. If you want your 'cp -a' command to migrate around the cluster
it's going to slow down.

Anyways, I am not here to suggest that you checkpoint any certain way,
I am just here to throw an answer to your question so you can add the
answer into your knowledge base.

Other good things to know: you can access the child's memory fast by
reading /proc/$pid/mem. I don't know if you can write it reliably
(Andi Kleen wrote the code into 2.2 but you will be the first soldier
on the beach if you use it).

Michael

Raul Miller

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to dre...@cygnus.com

Ulrich Drepper <dre...@cygnus.com> wrote:
> You should not propagate the use of LD_PRELOAD so much. In future
> every use of this envvar will add quite a severe penalty to the
> execution time. Or better said: if a programs is started with
> LD_PRELOAD it cannot take advantage of the optimizations which will be
> implemented sometime soon.

What should be used to achieve this, then?

--
Raul

Michael Meissner

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Andy Glew

On Mon, Jan 18, 1999 at 04:57:41PM -0600, Andy Glew wrote:
> > (2) Binary editing allows you to intercept (a) statically linked binaries
> > (as long as you have symbol table information)
>
> BTW, binary editing allows you to intercept system calls for statically
> linked programs even if you don't have the symbol tables: use code analysis
> to actually find the instructions used to enter the kernel (trap, etc.), and
> intercept those, since the kernel API is pretty well known.

And if you are not careful (and/or too successful), you then get sued by
Rational/Purify for patent infringement (before it got gobbled up, Pure was
active in trying to enforce this patent, dunno if things are now changed with
Rational at the helm).

--
Michael Meissner, Cygnus Solutions (Massachusetts office)
4th floor, 955 Massachusetts Avenue, Cambridge, MA 02139, USA
meis...@cygnus.com, 617-354-5416 (office), 617-354-7161 (fax)

Michael Elizabeth Chastain

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to al...@lxorguk.ukuu.org.uk

Hi Alan,

> Thats cos a million of us never knew it existed. I'd practically kill for
> that stuff ( Not quite , imagine a trail of slightly bruised people in my
> wake). Its value to authors could be huge.

I don't blame anyone. It's a world of information glut, you can't
download every demo that some unknown person brings to linux-kernel.
Plus you never know: can this unknown person actually deliver (10%
chance) or are they another town council candidate (90% chance).

Fresh Meat didn't exist back then. When I resume this project, I'll put
demos up on Fresh Meat, and that will be a good channel for people to
find out about it.

One reason I've been sitting on it is that the technology works great
in a closed-source world. It works even better if the vendor can tweak
the OS. For example, I have to jump through hoops because I can't set
ORIG_EAX through ptrace. Sun or SCO could drop this into their
operating system in one product cycle.

Questions for you: (1) do you think this would be an interesting
presentation at the Linux Bazaar. (2) Do you think it's so interesting
that I could ask them to cover part of my travel expenses.

> This is something I've been pondering - strace has the same problems -
> it does suggest their should be a single good syscall/ioctl definition
> somewhere

I worked on the strace ioctl stuff for a while. That's when I wrote
Documentation/ioctl-number.txt.

My syscall table is pretty stable. Linus does not add new syscalls
very often, and he rarely changes existing syscalls because of binary
compatibility. He changed adjtimex back in 1.3.Something.

The ioctl's kill me. Dozens of people write drivers and many of them
write little private ioctl's. I have a fairly accurate and extensive
table that's current up to Linux 2.1.117.

Also I have no support right now if the target program mmap's a device.
I can probably do something about that if I have some kernel support.
I need stuff like PTRACE_MMAP, PTRACE_MUNMAP, and PTRACE_MPROTECT so
that I can play with the target's memory map. I only need to play with
it while the target is paused for tracing.

It's been challenging to figure out ways to get what I need without any
kernel support. In some ways it's been a discipline that made me do
the write thing. But I'm ready to start writing kernel patches in 2.3.XX
for the purpose of making my trace-and-replay easier. Things like:

- Make execve() set the registers to a defined value, rather than
inheriting state from before the execve(). I know someone else
wants this for his own reasons.

- Allow PTRACE_POKEUSR to set ORIG_EAX to -1 to abort a child's
child. This requires a change to a slow-path function in entry.S.

- Figure out some way to drain the ioctl swamp. I won't mind if the
existing thousand ioctls get frozen if there is some way to handle
new ones, such as an 'nioctl' call that takes the length as the
fourth parameter.

- Minor changes to include/linux/*.h because I need to include those
files -- glibc will never be complete enough for me. I am
happy with the modern idea that I should snapshot include/linux/*.h
and include the snapshot with my source. Right now I have about
ten workarounds with #define games to use the 2.1.117 include files.
This is actually a low number, because I include *every* file that
provides part of the user-kernel ABI, and my program is written in
C++! Kernel work in this direction would probably benefit glibc
developers.

Michael

Tim Smith

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to linux-...@vger.rutgers.edu

On Mon, 18 Jan 1999, Alan Cox wrote:
> You can already do it in Linux. LD_PRELOAD and LD_LIBRARY_PATH allow you to

> control and modify linkloading order. If you get a static linked commercial
> binary just ask them for the object modules under the LGPL terms the C
> library has to rebuild it (or just for a dynamic binary)

This presumes that the statically linked commercial binary was built using
an LGPL'ed library. That may almost always be true now, but will it be
after Codewarrior for Linux ships?

--Tim Smith

Alexander Kjeldaas

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Alan Cox, Michael Elizabeth Chastain

On Tue, Jan 19, 1999 at 12:19:21AM +0000, Alan Cox wrote:
>
> > shifted again and again. The replayer needs a table of every system call
> > and how it affects memory, and that table needs more entries every week
> > (thanks to ioctl). So I have a great demo, if you have 1.3.42 kernel
> > headers to compile it against.
>

> This is something I've been pondering - strace has the same problems - it
> does suggest their should be a single good syscall/ioctl definition somewhere
>

And from time to time, security issues regarding ioctls that are not
checking for privileges when they should are found. A central
ioctl-directory is a good place to have privilege information too.

astor

--
Alexander Kjeldaas, Guardian Networks AS, Trondheim, Norway
http://www.guardian.no/

Michael Elizabeth Chastain

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to al...@lxorguk.ukuu.org.uk, as...@guardian.no

Hi Alexander,

> And from time to time, security issues regarding ioctls that are not
> checking for privileges when they should are found. A central
> ioctl-directory is a good place to have privilege information too.

This is much less frequent in 2.1 with the current uaccess.h
implementations of copy_from_user, copy_to_user, get_user, put_user.
At least on i386, these facilities cannot stomp on kernel memory no
matter *what* values the user specifies.

There were some problems around 2.1.77 or so with buggy sound driver
code that had lots of __get_user and __put_user.

In fact, I would advise someone who wants to do a security check of the
kernel (either a Good Guy or a Bad Guy) to make a list of the unchecked
functions in uaccess.h, grep the entire kernel source for these functions,
and validate all of the use cases. They should all have explicit
constraint checks.

Michael

Miquel van Smoorenburg

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to linux-...@vger.rutgers.edu

In article <cistron.m10...@the-village.bc.nu>,
Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>The libdl stuff itself is pretty portable, although your additional
>hooks do need to match the routines you are replacing yourself in structure
>size and the like.
>
>Things like the realvideo fixing program is a good example of stealing
>the "open" syscall and altering its behaviour to work around a bug in RV

"ltrace" is probably an even better example:

$ ltrace ls
atexit(0x400064e0) = 0
__libc_init_first(1, 0xbffffd83, 0, 0xbffffd86, 0xbffffd9e) = 0x400064e0
atexit(0x0804d440) = 0
setlocale(6, "") = "C"
bindtextdomain("fileutils", "/usr/share/locale") = "/usr/share/locale"
textdomain("fileutils") = "fileutils"
time(NULL) = 916736275
isatty(1) = 0
getenv("POSIXLY_CORRECT") = NULL
getenv("COLUMNS") = NULL
ioctl(1, 21523, 0xbffffc48, 0x4000b3f0, 0xbffffca4) = -1
getenv("POSIXLY_CORRECT") = NULL
getenv("TABSIZE") = NULL
getopt_long(1, 0xbffffca4, "abcdfgiklmnopqrstuw:xABCDFGI:LNQ"..., 0x0804d4ec, NU
LL) = -1
malloc(10800) = 0x0804ff68
malloc(12) = 0x080529a0

ltrace is a library tracer like strace is a system call tracer ..
it's pretty cool.

ftp://ftp.debian.org/debian/dists/unstable/main/source/utils/ltrace_*

Mike.
--
Indifference will certainly be the downfall of mankind, but who cares?

Simon Kenyon

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Michael Elizabeth Chastain

On 18-Jan-99 Michael Elizabeth Chastain wrote:
> I have a trace-and-replay program based on ptrace. The tracer is similar
> to strace.

i too would have been using it all day, every day
had i known that it exists

what needs to happen to get it to work on 2.2.x?
--
simon

Henning P. Schmiedehausen

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to linux-...@vger.rutgers.edu

t...@tzs.net (Tim Smith) writes:

>On Mon, 18 Jan 1999, Alan Cox wrote:
>This presumes that the statically linked commercial binary was built using
>an LGPL'ed library. That may almost always be true now, but will it be
>after Codewarrior for Linux ships?

Considering that Be just dropped CodeWarrior in favour of egcs, I don't
think that the impact of it for Linux will be huge.

(Be stated that egcs Code is better than the CW code).

Kind regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- h...@tanstaafl.de
TANSTAAFL! Consulting - Unix, Internet, Security

Hutweide 15 Fon.: 09131 / 50654-0 "There ain't no such
D-91054 Buckenhof Fax.: 09131 / 50654-20 thing as a free Linux"

Rogier Wolff

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Alan Cox

Alan Cox wrote:
> > shifted again and again. The replayer needs a table of every system call
> > and how it affects memory, and that table needs more entries every week
> > (thanks to ioctl). So I have a great demo, if you have 1.3.42 kernel
> > headers to compile it against.
>
> This is something I've been pondering - strace has the same problems - it
> does suggest their should be a single good syscall/ioctl definition somewhere

Strace should take a config file that tells it the names of the
system calls, the number of arguments, and how to print them (and when).

I tried implementing that 8-5 years ago, but got bored/distracted
before it was finished.

Roger.

--
** R.E....@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* Never blow in a cat's ear because if you do, usually after three or *
* four times, they will bite your lips! And they don't let go for at *
* least a minute. -- Lisa Coburn, age 9

Michael Elizabeth Chastain

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to si...@koala.ie

Hi Simon,

> what needs to happen to get it to work on 2.2.x?

I need to write some more code to handle ELF execve(). ELF didn't
exist when I started this project.

It doesn't handle signals at all, although I have a good idea how
to do this. It also doesn't handle shared writeable memory segments
or mmap'ed devices; that is inherently hard. It handles mmap'ed
files just fine as long as the file doesn't change after the
target mmap's it.

I have a list of every system call and every ioctl in the system.
The system call list is fairly stable. Linus does not add new system
calls often, and he almost never changes the binary interface of an
existing system call (I think this has happened once or twice in four
years; it happened to adjtimex several years ago). The ioctl list is a
huge time sink because the ioctl system call does not have a parameter
that describes the size of the memory region that is being smashed.

It will speed up replay a lot if writing to /proc/$pid/mem works
reliably. Andi Kleen already did this but I haven't tested it.

I have a list of "it would be nice if ..." in task-kernel-lix.txt.
Some of these are features that other people have asked for too,
so they would be good kernel patches for the beginning of 2.3.XX:

ptrace ability to set ORIG_EAX to veto system calls
clear process registers on execve
some include/linux/*.h cleanup for C++ compatibility

Right now we are in the over-my-dead-body stage of 2.2.0 development,
though.

Michael Elizabeth Chastain
<mailto:m...@shout.net>
"love without fear"

-

Simon Kenyon

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Michael Elizabeth Chastain

On 19-Jan-99 Michael Elizabeth Chastain wrote:
> Right now we are in the over-my-dead-body stage of 2.2.0 development,
> though.

i think i realised that this was a 2.3.x thing
i also think i realised that it was not ready now even if (by some fluke) Linus
were to say ok in 2.2.x
just giving you all the encouragement to push for whatever changes are required
when the time comes
--
simon

Andy Glew

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Werner Krebs, Andy Glew, linux-...@vger.rutgers.edu

Urgghhh.... So maybe I have gone full circle.
If you want to do transparent migration and checkpointing
of statically linked binaries, then you need
to intercept system calls.

An "interposition device driver" is a fairly nice and easy
thing to do here - e.g. one that invokes arbitrary added
code when a system call is done. This code might be:
in the kernel (gack); in a separate process, when it really
amounts to "debug breakpoint on trap", where it will be
too slow for many applications, but fast enough for some
(such as the simulations I run); or perhaps using code/data
memory added to the original program as transparently
as possible.

This is a special case of intercepting an arbitrary API.
It is not as general. But it is more universally reliable
than the binary editting approach. It will work with
just about any binary, including statically linked binaries,
and binaries that do wierd things like dynamically writing
code to invoke system calls - e.g. it might allow transparent
checkpointing and migration of things like a Java VM with
JIT compilation. Such completeness is good.

I apologize to Werner, for having said that such interposition
should not be in the kernel. My excuse is that I keep looking
for that "single point of leverage". In this case, it appears that
there are three main points of leverage: (1) a kernel syscall
interposition layer, (2) jump table editting, and (3) binary editting.
All have their respective advantages and disadvantages. None
is uniformly more powerful than the other.

So, back to where we started: is there a kernel syscall interposition
driver for LINUX? Another respondent implied that there might be.

MOLNAR Ingo

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to Andy Glew

On Tue, 19 Jan 1999, Andy Glew wrote:

> So, back to where we started: is there a kernel syscall interposition
> driver for LINUX? Another respondent implied that there might be.

yes, it's called ptrace(), debugging tools like gdb or strace use it
extensively. There is another solution that is used by the iBCS2
kernel-module, a separate system call entry. So you can 'proxy' a system
call in an arbitrary way.

-- mingo

Alexander Kjeldaas

unread,

Jan 20, 1999, 3:00:00 AM1/20/99

to Michael Elizabeth Chastain, al...@lxorguk.ukuu.org.uk, as...@guardian.no

On Tue, Jan 19, 1999 at 02:55:47AM -0600, Michael Elizabeth Chastain wrote:
> Hi Alexander,
>
> > And from time to time, security issues regarding ioctls that are not
> > checking for privileges when they should are found. A central
> > ioctl-directory is a good place to have privilege information too.
>
> This is much less frequent in 2.1 with the current uaccess.h
> implementations of copy_from_user, copy_to_user, get_user, put_user.
> At least on i386, these facilities cannot stomp on kernel memory no
> matter *what* values the user specifies.
>

This isn't the problem. The problem is ioctl calls which should have
an capable(SOMETHING) check, but don't. Errors like that are probably
easier to spot if they are specified centrally.

astor

--
Alexander Kjeldaas, Guardian Networks AS, Trondheim, Norway
http://www.guardian.no/

-

Michael Elizabeth Chastain

unread,

Jan 20, 1999, 3:00:00 AM1/20/99

to as...@guardian.no

astor> This isn't the problem. The problem is ioctl calls which should have
astor> an capable(SOMETHING) check, but don't. Errors like that are probably
astor> easier to spot if they are specified centrally.

I am thinking about a registration facility

system_call_register( ... )
system_call_unregister( ... )
/proc/system_call_list

ioctl_register( ... )
ioctl_unregister( .... )
/proc/ioctl_list

Then strace and mec-trace could read the lists via the /proc files
and would not need their own tables that go out of date all the time.

Michael

Michael Elizabeth Chastain

unread,

Jan 20, 1999, 3:00:00 AM1/20/99

to linux-...@vger.rutgers.edu

Andy Glew writes ...

> So, back to where we started: is there a kernel syscall interposition
> driver for LINUX? Another respondent implied that there might be.

Come here, I have something to show you. :)

Steven Roberts

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to Alexander Kjeldaas

Alexander Kjeldaas wrote:
>
> On Tue, Jan 19, 1999 at 02:55:47AM -0600, Michael Elizabeth Chastain wrote:
> > Hi Alexander,
> >
> > > And from time to time, security issues regarding ioctls that are not
> > > checking for privileges when they should are found. A central
> > > ioctl-directory is a good place to have privilege information too.
> >
> > This is much less frequent in 2.1 with the current uaccess.h
> > implementations of copy_from_user, copy_to_user, get_user, put_user.
> > At least on i386, these facilities cannot stomp on kernel memory no
> > matter *what* values the user specifies.
> >
>

> This isn't the problem. The problem is ioctl calls which should have

> an capable(SOMETHING) check, but don't. Errors like that are probably

> easier to spot if they are specified centrally.
>

> astor

speaking of ioctl's, are there any good spots to find documentaion of
all
of the ioctl's. sometimes I have found them in the driver src, somtimes
man
pages, sometimes doc directory in the general tree, etc...

Just wondering if there is any available documentation about them in a
standard location.

Steve

Oliver Xymoron

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to Michael Elizabeth Chastain

On Thu, 21 Jan 1999, Michael Elizabeth Chastain wrote:

> > If these functions register callbacks, we can do away with a fair amount
> > of crufty switch code as well.
>
> Oh yeah. The networking code has a lot of this, where dispatch
> functions have switches with 40 codes in them just to pass them down to
> the next layer. Probably none of this is fast-path code so it's more
> an esthetic thing.

Exactly. The trick is with majors like 0 where minors are different
devices. Dispatching the callbacks becomes a bit uglier (a mask for the
minor field?). For 2.3 we might have enough major numbers to make this a
non-issue, though.

Turning code like switch blocks into data is generally a big win (except
in performance critical cases like reading an IP header). It becomes much
easier for the maintainer to make structural changes and global
optimizations and the code becomes shorter, simpler, and more readable.
And possibly faster.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

Michael Elizabeth Chastain

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to as...@guardian.no, stro...@ata-sd.com

There's an overview list in Documentation/ioctl-number.txt. That's
about the best you'll get.

Michael

Oliver Xymoron

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to Michael Elizabeth Chastain

On Wed, 20 Jan 1999, Michael Elizabeth Chastain wrote:

> I am thinking about a registration facility
>

> ioctl_register( ... )
> ioctl_unregister( .... )
> /proc/ioctl_list

If these functions register callbacks, we can do away with a fair amount
of crufty switch code as well. I'd recommend that at least the ioctl
register functions (if not the syscall version) take an array of arguments
so that you can register multiple ioctls at once. Something like:

static const ioctl_reg my_ioctls[...] = {
....
}

void init_module()
{
...
ioctl_register(my_ioctls...);
...
}

void cleanup_module()
{
...
ioctl_unregister(my_ioctls...);
...
}

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

Michael Elizabeth Chastain

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to oxym...@waste.org

Hi Oliver,

> If these functions register callbacks, we can do away with a fair amount
> of crufty switch code as well.

Oh yeah. The networking code has a lot of this, where dispatch

functions have switches with 40 codes in them just to pass them down to
the next layer. Probably none of this is fast-path code so it's more
an esthetic thing.

Michael

Zygo Blaxell

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to

In article <36A75DB9...@ata-sd.com>,

Steven Roberts <stro...@ata-sd.com> wrote:
>speaking of ioctl's, are there any good spots to find documentaion of
>all
>of the ioctl's. sometimes I have found them in the driver src, somtimes
>man
>pages, sometimes doc directory in the general tree, etc...
>
>Just wondering if there is any available documentation about them in a
>standard location.

Since the ioctl()'s are implemented all over the place (in the driver,
in a 3rd-party module, etc...), the documentation is similarly fragmented.

It might be a good idea to document all of the ioctl()'s that are implemented
somewhere in the standard kernel in a documentation repository included in
the kernel source.

The only difficulty with the latter idea that I know of is that some of the
documentation work will overlap with the standard man pages packages and in
some ways greatly overlap with the C libraries. AFAIK this is fairly minor.

--
Zygo Blaxell (with a name like that, who needs a nick?)
Linux Engineer (my favorite official job title so far)
Corel Corporation (whose opinions sometimes differ from those shown above)
zy...@corel.ca (also zbla...@furryterror.org)

Pavel Machek

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to Alan Cox, Michael Elizabeth Chastain

Hi!

> > It's been running like this for three years. I released the source code
> > under GPL in November 1995. As far as I know, three people in the entire
> > world have ever run it, counting me.
>

> Thats cos a million of us never knew it existed. I'd practically kill for
> that stuff ( Not quite , imagine a trail of slightly bruised people in my
> wake). Its value to authors could be huge.
>

> > shifted again and again. The replayer needs a table of every system call
> > and how it affects memory, and that table needs more entries every week
> > (thanks to ioctl). So I have a great demo, if you have 1.3.42 kernel
> > headers to compile it against.
>
> This is something I've been pondering - strace has the same problems - it
> does suggest their should be a single good syscall/ioctl definition somewhere

BTW things like network block device would benefit from ioctl's being
standartized, too. That way, I would be able to pass ioctls across
network and I would be able to create network char device and play
sounds over network over it :-).
Pavel

PS: How many things would break if we forced ioctls to _always_ pass

struct foo {
int len;
char data[len];
} ?

--
I'm really pa...@atrey.karlin.mff.cuni.cz. Pavel
Look at http://atrey.karlin.mff.cuni.cz/~pavel/ ;-).

Michael Elizabeth Chastain

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to al...@lxorguk.ukuu.org.uk, pa...@bug.ucw.cz

Hi Pavel,

> PS: How many things would break if we forced ioctls to _always_ pass
>
> struct foo {
> int len;
> char data[len];
> } ?

Everything would break. Remember that lots of programs use terminal
control ioctls.

So far I've seen two design ideas. One is an new 'nioctl' system call
which takes 'len' as a fourth parameter. The other is to add
'ioctl_register' and 'ioctl_unregister' to the kernel, so that the
kernel knows this information and can export it, even though the
information is not present in the API.

Michael

Michael Elizabeth Chastain

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to stro...@ata-sd.com

Hi Steven,

> (hmmm... thinking I may have just voleentered :)

It's a dandelion problem -- you pull on it, and you find a whole root
system underneath it.

The problem is: "how do you document an entity where hundreds of unrelated
people check in code everywhere, and your desire for documentation
exceeds the willingness of those hundreds of people to write it."

When Linus decides he wants more documentation, he can make that a
criterion for accepting a patch. He can lay out whatever rules he likes.
My suggestion is: "if your code has an interface to userland, your patch
must have a file in Documentation/ somewhere that documents the interface
to userland."

Alexander L. Belikoff

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to linux-...@vger.rutgers.edu

Pavel Machek <pa...@bug.ucw.cz> writes:

>
> PS: How many things would break if we forced ioctls to _always_ pass
>
> struct foo {
> int len;
> char data[len];
> } ?
>

Most things, I guess... :-) That's like changing the semantics of any
more or less standard system call.

OTOH, what can be done is to change all Linux-specific ioctl's to take
a structure above and to leave the cross-platform ones intact (like
BSD screen ctl etc). Then, a program in question would have an ugly
table which would store lengths of the 3rd arg based on an ioctl
number - still better than nothing.

--
Alexander L. Belikoff
Bloomberg L.P. / BFM Financial Research Ltd.
ab...@vallinor4.com, ab...@bfr.co.il

Steven Roberts

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to Michael Elizabeth Chastain

Michael Elizabeth Chastain wrote:
> There's an overview list in Documentation/ioctl-number.txt. That's
> about the best you'll get.
>

oh...

Well is there any effort currently underway to document them?
If not, is there a desire to have better docs? I would think
there would be...

(hmmm... thinking I may have just voleentered :)

well, I'm porting some code that will be needing ioctls to linux over
the next few months. I'll start docs on the ones I use, and if people
think it is a good idea, I may try starting up an effort to document
all of them (or at least the most common ones)

Steve

Kenneth Albanowski

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to Michael Elizabeth Chastain

On Fri, 22 Jan 1999, Michael Elizabeth Chastain wrote:

> Hi Pavel,

>
> > PS: How many things would break if we forced ioctls to _always_ pass
> >
> > struct foo {
> > int len;
> > char data[len];
> > } ?
>

> Everything would break. Remember that lots of programs use terminal
> control ioctls.
>
> So far I've seen two design ideas. One is an new 'nioctl' system call
> which takes 'len' as a fourth parameter. The other is to add
> 'ioctl_register' and 'ioctl_unregister' to the kernel, so that the
> kernel knows this information and can export it, even though the
> information is not present in the API.

These can't accomplish the same thing: ioctl_register won't help the ioctl
handler figure out how many bytes you passed it for a variable-length
block. If the registration mechanism would simplify some other stuff, go
right ahead, but I'd definitely like to see an nioctl, and similar
routines anywhere that length isn't explicity passed.

Also, 2.0.x, at least, had some places where routines used the MM layer to
calculate a true maximun length for a buffer. While accurate, this is a
remarkable amount of work to go to because a length field wasn't
available, and also broke for uClinux. Note that this was actually in the
fs code, and not actual ioctls. sys_mount is one these, but you might be
surprised at the others: every syscall that takes a filename.

--
Kenneth Albanowski (kja...@kjahds.com, CIS: 70705,126)

Michael Elizabeth Chastain

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to kja...@kjahds.com

Hi Kenneth,

> These can't accomplish the same thing: ioctl_register won't help the ioctl
> handler figure out how many bytes you passed it for a variable-length
> block.

That is true for a simple ioctl_register. A complex complete
ioctl_register has to handle variable-length blocks, blocks with secondary
ioctl blocks like 'struct ifreq' ioctls. And then some ioctls take
different parameters in different drivers.

> Also, 2.0.x, at least, had some places where routines used the MM layer to
> calculate a true maximun length for a buffer. While accurate, this is a
> remarkable amount of work to go to because a length field wasn't
> available, and also broke for uClinux. Note that this was actually in the
> fs code, and not actual ioctls. sys_mount is one these, but you might be
> surprised at the others: every syscall that takes a filename.

I wouldn't be surprised -- I had to dig through all these cases.
sys_mount is just gratuitous lack of a length parameter. I handled
the filename stuff by reading to the end of the string or until
ptrace gave me errors back (end of target memory).

For my application I can get away with reading more than the actual
changes. For some of the hairier ioctls like FDRAWCMD, I just fall
all the way back to "snapshot the entire target memory space",
which works for me.

I am curious, what is ucLinux's interest in this stuff?

Michael

Steven Roberts

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to Michael Elizabeth Chastain

Michael Elizabeth Chastain wrote:
> Hi Steven,

>
> > (hmmm... thinking I may have just voleentered :)
>

> It's a dandelion problem -- you pull on it, and you find a whole root
> system underneath it.
>
> The problem is: "how do you document an entity where hundreds of unrelated
> people check in code everywhere, and your desire for documentation
> exceeds the willingness of those hundreds of people to write it."

Yes, I realize it would be impossible to keep all of them documented,
but
I would think that at least the basic ioctl's (say for changing baud on
a serial port) would be fairly static and those the documentation would
stay correct

>
> When Linus decides he wants more documentation, he can make that a
> criterion for accepting a patch. He can lay out whatever rules he likes.
> My suggestion is: "if your code has an interface to userland, your patch
> must have a file in Documentation/ somewhere that documents the interface
> to userland."
>

I wouldn't be trying to have a completely inclusive set, just something
that would at least contain a bunch of them so at least most programmers
wouldn't have to dig.

Steve

Robert Kiesling

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to m...@shout.net

> When Linus decides he wants more documentation, he can make that a
> criterion for accepting a patch. He can lay out whatever rules he likes.
> My suggestion is: "if your code has an interface to userland, your patch
> must have a file in Documentation/ somewhere that documents the interface
> to userland."

That probably could be done now, with the code as-is, without
too, too much difficulty. Any TeXperts that would like to
volunteer for this should make it known before I get started on
the project.

--
(Open Sourcism: n. A pathological form of volunteering.)
Robert Kiesling
kies...@ix.netcom.com

Kenneth Albanowski

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to Michael Elizabeth Chastain

On Fri, 22 Jan 1999, Michael Elizabeth Chastain wrote:

> Hi Kenneth,
>
> > These can't accomplish the same thing: ioctl_register won't help the ioctl
> > handler figure out how many bytes you passed it for a variable-length
> > block.
>
> That is true for a simple ioctl_register. A complex complete
> ioctl_register has to handle variable-length blocks, blocks with secondary
> ioctl blocks like 'struct ifreq' ioctls. And then some ioctls take
> different parameters in different drivers.

Indeed. My experience is that mechanisms which attempt to "describe" stuff
falls prey to complexity (and tend to invoke the halting problem if you
take them too far). An nioctl call with a length would be several orders
of magnitude less complex. (Quite literally.)

> > Also, 2.0.x, at least, had some places where routines used the MM layer to
> > calculate a true maximun length for a buffer. While accurate, this is a
> > remarkable amount of work to go to because a length field wasn't
> > available, and also broke for uClinux. Note that this was actually in the
> > fs code, and not actual ioctls. sys_mount is one these, but you might be
> > surprised at the others: every syscall that takes a filename.
>
> I wouldn't be surprised -- I had to dig through all these cases.
> sys_mount is just gratuitous lack of a length parameter. I handled
> the filename stuff by reading to the end of the string or until
> ptrace gave me errors back (end of target memory).

Which amounts to the same thing, save that in the kernel you can query for
end-of-target-memory in a somewhat efficient manner.

> For my application I can get away with reading more than the actual
> changes. For some of the hairier ioctls like FDRAWCMD, I just fall
> all the way back to "snapshot the entire target memory space",
> which works for me.
>
> I am curious, what is ucLinux's interest in this stuff?

Mainly: how to implement UNIX style syscalls without memory protection.
uClinux currently doesn't have any mechanism for efficiently finding what
memory a user process owns (this is partially a simple implementation
issue, and partially the lack of an MMU).

Actually, I'm interested in the entire aspect of checkpointing, ptrace,
etc., for uClinux and in general. uClinux adds a few interesting twists,
among them: binaries are executed directly from ROM whenever possible, and
ptrace needs some additional concepts to indicate the base address space
of a process. The first means ld.so style fixups are not an option, and
the second means any debugger will need a bit of tweaking. (On that note,
I wanted to set gdb up for this, but was sorely disappointed at the
quality of its ptrace-based remote server.)

--
Kenneth Albanowski (kja...@kjahds.com, CIS: 70705,126)

-

Brandon S. Allbery KF8NH

unread,

Jan 23, 1999, 3:00:00 AM1/23/99

to Michael Elizabeth Chastain

In message <1999012215...@duracef.shout.net>, Michael Elizabeth
Chastai
n writes:
+-----

| > PS: How many things would break if we forced ioctls to _always_ pass
| >
| > struct foo {
| > int len;
| > char data[len];
| > } ?
|
| Everything would break. Remember that lots of programs use terminal
| control ioctls.
|
| So far I've seen two design ideas. One is an new 'nioctl' system call
| which takes 'len' as a fourth parameter. The other is to add
| 'ioctl_register' and 'ioctl_unregister' to the kernel, so that the
| kernel knows this information and can export it, even though the
| information is not present in the API.

+--->8

Why am I having sudden memories of ioctl(s, I_STR)?

--
brandon s. allbery [os/2][linux][solaris][japh] all...@kf8nh.apk.net
system administrator [WAY too many hats] all...@ece.cmu.edu
carnegie mellon / electrical and computer engineering KF8NH
We are Linux. Resistance is an indication that you missed the point.

Matthias Urlichs

unread,

Jan 24, 1999, 3:00:00 AM1/24/99

to linux-...@vger.rutgers.edu

Michael Elizabeth Chastain <m...@shout.net> writes:
> - Figure out some way to drain the ioctl swamp. I won't mind if the
> existing thousand ioctls get frozen if there is some way to handle
> new ones, such as an 'nioctl' call that takes the length as the
> fourth parameter.
>
Other Unices use macros which encode the length into the ioctl number
(via _IO[RW] macros)...

--
Matthias Urlichs | noris network GmbH | sm...@noris.de | ICQ: 20193661
The quote was selected randomly. Really. | http://www.noris.de/~smurf/
--
Q: What is black and white and red all over?
A: Half a nun.

Matthias Urlichs

unread,

Jan 24, 1999, 3:00:00 AM1/24/99

to linux-...@vger.rutgers.edu

Michael Elizabeth Chastain <m...@shout.net> writes:
>

> I have a trace-and-replay program based on ptrace. The tracer is similar
> to strace.
> [ description ]

Umm, excuse me for asking, but why is that the first time I hear about
this?

Anyway, it's three years old, doesn't support ELF format (according to the
README) and doesn't compile under glibc. The latter problem I can fix, but
WRT the former I'm out of my depth. :-(

--
Matthias Urlichs | noris network GmbH | sm...@noris.de | ICQ: 20193661
The quote was selected randomly. Really. | http://www.noris.de/~smurf/
--

How do you know that electrons are political?
Because you can never determine their exact position.

Olaf Titz

unread,

Jan 25, 1999, 3:00:00 AM1/25/99

to linux-...@vger.rutgers.edu

> PS: How many things would break if we forced ioctls to _always_ pass
>
> struct foo {
> int len;
> char data[len];
> } ?

At the user level, everything would break. ;-) But perhaps it would be
possible to redefine just the ioctl syscall (to take an additional
length parameter, perhaps) and let the library sort it out? This way
user programs would remain compatible.

Olaf

Jamie Lokier

unread,

Jan 25, 1999, 3:00:00 AM1/25/99

to Matthias Urlichs, linux-...@vger.rutgers.edu

Matthias Urlichs wrote:
> Other Unices use macros which encode the length into the ioctl number
> (via _IO[RW] macros)...

Good point. Linux does this too for most ioctls; see <asm-i386/ioctl.h>.

-- Jamie

Message has been deleted