The linux-kernel mailing list FAQ

122 views

Skip to first unread message

玄鹤

unread,

Nov 16, 2009, 9:43:52 AM11/16/09

to freesky, micha...@ericsson.com

http://www.kernel.org/pub/linux/docs/lkml/

The linux-kernel mailing list FAQ
Before you consider posting to the linux-kernel mailing list, please
read at least the start of section 3 of this FAQ list.

These frequently asked questions are divided in various categories.
Please contribute any category and Q/A that you may find relevant. You
can also add your answer to any question that has already been
answered, if you have additional information to contribute.

The official site is: http://www.tux.org/lkml/ (this is in the east
coast of the U.S.A). Many thanks to Sam Chessman and David Niemi for
hosting the FAQ on a high-bandwidth, professionally managed Linux
server. The following mirrors are available (and are updated at the
same time as the official site):

* http://www.atnf.csiro.au/~rgooch/linux/docs/lkml/ in Sydney,
Australia
* http://www.ras.ucalgary.ca/~rgooch/linux/docs/lkml/ in Calgary,
Canada
* http://www.kernel.org/pub/linux/docs/lkml/ in the west coast of
the U.S.A.

Hot off the Presses
vger.kernel.org has enabled ECN. You may need to switch ISP in order
to receive linux-kernel email. See the section on ECN for more
details.
Two digest forms of linux-kernel (a normal digest every 100KB and a
once-daily digest) are available at http://lists.us.dell.com/.
Go to http://www.atnf.csiro.au/~rgooch/linux/docs/kernel-newsflash.html
for newflashes about official kernel releases.
NOTE: this page is no longer maintained. If there is an alternative
page, please let me know.
Read this before complaining to linux-kernel about compile problems.
Chances are a thousand other people have noticed and the fix is
already published.
Index

* Basic Linux kernel documentation
* Contributors and some special expressions
* Related mailing lists
* Question Index
1. General questions
2. Driver specific questions
3. Mailing list questions
4. "How do I" questions
5. "Who's who" questions
6. CPU questions
7. OS questions
8. Compiler/binutils questions
9. Feature specific questions
10. "What's changed between kernels 2.0.x and 2.2.x" questions
11. Primer documents
12. Kernel Programming Questions
13. Mysterious kernel messages
14. Odd kernel behaviour
15. Programming Religion
16. User-space Programming Questions
* Answers
* Contributing

Basic Linux kernel documentation
The following are Linux kernel related documents, which you should
take a look at before you post to the linux-kernel mailing list:

* The Linux Kernel Hackers' Guide, compiled by Michael K. Johnson
of Red Hat fame. Includes among other documents selected Q/As from the
linux-kernel mailing list.
* The Linux Kernel book, by David A. Rusling, available in various
formats from the Linux Documentation Project and mirrors. Still being
worked on, but explains clearly the main structure of the Linux
kernel.
* The Linux FAQ by Robert Kiesling has many high quality Q/As.
* The Linux Kernel HOWTO by Brian Ward. Fundamental reading for
anybody wanting to post to the linux-kernel mailing list.
* Various Linux HOWTOs on specific questions, such as the BogoMips
mini-HOWTO by Wim van Dorst. These are all by definition LDP
documents.
* The Linux kernel source code for any particular kernel version
that you may be using. Note that there is a /Documentation directory
which holds some very useful text files about drivers, etc. Also check
the MAINTAINERS file in the kernel source root directory.
* Some drivers even have Web pages, with additional up to date
information e.g. the network drivers by Donald Becker, etc. Check the
Hardware section in the LDP site.
* Similarly, Linux implementations for some CPU architectures have
dedicated Web pages, mailing lists, and sometimes even a HOWTO e.g.
the Linux Alpha HOWTO by Neal Crook. Check the LDP site and its
mirrors for Web links to the various architecture specific sites.
* Linux device drivers, a book written by Alessandro Rubini. C.
Scott Ananian reviewed it for Amazon.com.
* Linux kernel internals, a book by Michael Beck (Editor) et al.
Also reviewed for Amazon.com.
* Another useful site is: http://www.kernelnewbies.org/
* Here is a general guide on how to ask questions in a way that
greatly improves your chances of getting a reply:
http://www.catb.org/~esr/faqs/smart-questions.html. If you have a bug
to report, you should also read http://www.chiark.greenend.org.uk/~sgtatham/bugs.html.
Extra instructions, specific to the Linux kernel are available
here.

Contributors and some special expressions
This is the list of contributors to this FAQ. They are listed in
alphabetic order of their abbreviations, used in the Answers sections
below to identify the author(s) of each answer.

* AC : Alan Cox
* AV : Alexander Viro
* ADB: Andrew D. Balsa
* CP : Colin Plumb
* DBE: Daniel Bergstrom
* DSM: David S. Miller (co-postmaster)
* DW : David Woodhouse
* JBG: Jan-Benedict Glaw
* KGB: Krzysztof G. Baranowski
* KO : Keith Owens
* MEA: Matti E. Aarnio (co-postmaster)
* MRW: Matthew Wilcox
* PG : Paul Gortmaker
* RC : Ralph Corderoy
* REG: Richard E. Gooch (FAQ maintainer)
* REW: Roger E. Wolff
* RML: Robert M. Love
* RRR: Rafael R. Reilova
* TAC: Thomas A. Cort
* TJ : Trevor Johnson
* TYT: Theodore Y. Ts'o
* VKh: Vassilii Khachaturov

Some English expressions for non-native English readers. Many of these
(and far more) may be obtained from the Jargon File:

* AFAIK = As Far As I Know
* AKA = Also Known As
* ASAP = As Soon As Possible
* BTW = By The Way (used to introduce some piece of information or
question that is on a different topic but may be of interest)
* COLA = comp.os.linux.announce (newsgroup)
* ETA = Estimated Time of Arrival
* FAQ = Frequently Asked Question
* FUD = Fear, Uncertainty and Doubt
* FWIW = For What It's Worth
* FYI = For Your Information
* IANAL = I Am Not A Lawyer
* IIRC = If I Recall Correctly
* IMHO = In My Humble Opinion
* IMNSHO = In My Not-So-Humble Opinion
* IOW = In Other Words
* LART = Luser Attitude Readjustment Tool (quoting Al Viro:
"Anything you use to forcibly implant the clue into the place where
luser's head is")
* LUSER = pronounced "loser", a user who is considered to indeed
be a loser (idiot, drongo, wanker, dim-wit, fool, etc.)
* OTOH = On The Other Hand
* PEBKAC = Problem Exists Between Keyboard And Chair
* ROTFL = Rolling On The Floor Laughing
* RSN = Real Soon Now
* RTFM = Read The Fucking Manual (original definition) or Read The
Fine Manual (if you want to pretend to be polite)
* TANSTAAFL = There Ain't No Such Thing As A Free Lunch
(contributed by David Niemi, quoting Robert Heinlein in his science
fiction novel 'The Moon is a Harsh Mistress')
* THX = Thanks (thank you)
* TIA = Thanks In Advance
* WIP = Work In Progress
* WRT = With Respect To

Related mailing lists
Some questions are better posted to related mailing lists on specific
subjects. Posting to these mailing lists helps reduce the volume on
the linux-kernel mailing list and also increases your chances of
having your message read by an expert on the subject. Some people do
not have the time to subscribe to the linux-kernel mailing list, as it
is too general for them. Some related lists are:

* The linu...@vger.kernel.org mailing list is for networking
user questions. Subscribe by sending
subscribe linux-net in the message body to
majo...@vger.kernel.org
* The net...@vger.kernel.org mailing list is for network
development (not user questions). Subscribe by sending
subscribe netdev in the message body to
majo...@vger.kernel.org

Question Index
Section 1 - General questions

1. Why do you use "GNU/Linux" sometimes and just "Linux" in other
parts of the FAQ?
2. What is an experimental kernel version?
3. What is a production kernel?
4. What is a feature freeze?
5. What is a code freeze?
6. What is a f.g.hhprei kernel?
7. Where do I get the latest kernel source?
8. Where do I get extra kernel patches?
9. What is a patch?
10. How do I make a patch suitable for the linux kernel list?
11. How do I apply a patch?
12. What's vger?
13. What is a CVS tree? Where can I find more information about CVS?
14. Is there a CVS tutorial?
15. How do I get my patch into the kernel?
16. Why does the kernel tarball contain a directory called linux/
instead of linux-x.y.z/ ?
17. What's the difference between the official kernels and Alan
Cox's -ac series of patches?
18. What does it mean for a module to be tainted?
19. What is this about GPLONLY symbols?
20. Do I have to use GIT to send patches?
21. Who maintains the kernel?
22. The kernel doesn't compile cleanly. What shall I do ?

Section 2 - Driver specific questions

1. Driver such and such is broken!
2. Here is a new driver for hardware XYZ.
3. Is there support for my card TW-345 model C in kernel version
f.g.hh?
4. Who maintains driver such and such?
5. I want to write a driver for card TW-345 model C, how do I get
started?
6. I want to get the docs, but they want me to sign an NDA (Non-
Disclosure Agreement).
7. I want/need/must have a driver for card TW-345 model C! Won't
anybody write one for me?
8. What's this major/minor device number thing?
9. Why aren't WinModems supported?
10. Modern CPUs are very fast, so why can't I write a user mode
interrupt handler?
11. Do I need to test my driver against all distributions?

Section 3 - Mailing list questions

1. How do I subscribe to the linux-kernel mailing list?
2. How do I unsubscribe from the linux-kernel mailing list?
3. Do I have to be subscribed to post to the list?
4. Is there an archive for the list?
5. How can I search the archive for a specific question?
6. Are there other ways to search the Web for information on a
particular Linux kernel issue?
7. How heavy is the traffic on the list?
8. What kind of question can I ask on the list?
9. What posting style should I use for the list?
10. Is the list moderated?
11. Can I be ejected from the list?
12. Are there any implicit rules on this list that I should be aware
of?
13. How do I post to the list?
14. Does the list get spammed?
15. I am not getting any mail anymore from the list! Is it down or
what?
16. Is there an NNTP gateway somewhere for the mailing list?
17. I want to post a Great Idea (tm) to the list. What should I do?
18. There is a long thread going on about something completely
offtopic, unrelated to the kernel, and even some people who are in the
"Who's who" section of this FAQ are mingling in it. What should I do
to fight this "noise"?
19. Can we have the Subject: line modified to help mail filters?
20. Can we have a Reply-To: header automatically added to the list
traffic?
21. Can I post job offers/requests to the list?
22. Why do I get bounces when I send private email to some people?
23. Why don't you split the list, such as having one each for the
development and stable series?

Section 4 - "How do I" questions

1. How do I post a patch?
2. How do I capture an Oops?
3. How do I post an Oops?
4. I think I found a bug, how do I report it?
5. What information should go into a bug report?
6. I found a bug in an "old" version of the kernel, should I report
it?
7. How do I compile the kernel?
8. How do check if the running kernel is tainted?

Section 5 - "Who's who" questions
Names are in alphabetical order (last name) to avoid stepping on toes.
If someone doesn't appear here, check /usr/src/linux/CREDITS.

1. Who is in charge here?
2. Why don't we have a Linux Kernel Team page, same as there are
for other projects?
3. Why doesn't <any of the below> answer my mails? Isn't that rude?
4. Why do I get bounces when I send private to email to some of
these people?
5. Who is Matti Aarnio?
6. Who is H. Peter Anvin?
7. Who is Donald Becker?
8. Who is Alan Cox?
9. Who is Richard E. Gooch?
10. Who is Paul Gortmaker?
11. Who is Bill Hawes?
12. Who is Mark Lord?
13. Who is Larry McVoy?
14. Who is David S. Miller?
15. Who is Linus Torvalds?
16. Who is Theodore Y. T'so?
17. Who is Stephen Tweedie?
18. Who is Roger Wolff?

Some people haven't contributed yet with a few lines about themselves,
and the policy of this FAQ dictates that nobody is going to write
about anybody else without authorization. Hence the missing links e.g.
if you are not Linus, don't insist, we are not going to add your
information about Linus.

Other OS developers:

* Who is Prof. Douglas Comer (Xinu)?
* Who is Richard M. Stallman aka RMS (GNU)?
* Who is Prof. Andrew Tanenbaum (MINIX)?

Section 6 - CPU questions
Is this a matter of taste or what?

1. What is the "best" CPU for GNU/Linux?
2. What is the fastest CPU for GNU/Linux?
3. I want to implement the Linux kernel for CPU Hyper123, how do I
get started?
4. Why is my Cyrix 6x86/L/MX detected by the kernel as a Cx486?
5. What about those x86 CPU bugs I read about?
6. I grabbed the standard kernel tarball from ftp.kernel.org or
some mirror of it, and it doesn't compile on the Sparc, what gives?
7. Does the Linux kernel execute the Halt instruction to power down
the CPU?
8. I have a non-Intel x86 CPU. What is the [best|correct] kernel
config option for my CPU?
9. What CPU types does Linux run on?

Section 7 - OS questions
OS theory and practical issues mix.

1. OS $toomuch has this Nice feature, so it must be better than GNU/
Linux.
2. Why doesn't the Linux kernel have a graphical boot screen like
$toomuch OS?
3. The kernel in OS CTE-variant has this Nice-very-nice feature,
can I port it to the Linux kernel?
4. How about adding feature Nice-also-very-nice to the Linux
kernel?
5. Are there more bugs in later versions of the Linux kernel,
compared to earlier versions?
6. Why does the Linux kernel source code keep getting larger and
larger?
7. The kernel source is HUUUUGE and takes too long to download.
Couldn't it be split in various tarballs?
8. What are the licensing/copying terms on the Linux kernel?
9. What are those references to "bazaar" and "cathedral"?
10. What is this "World Domination" thing?
11. What are the plans for future versions of the Linux kernel?
12. Why does it show BogoMips instead of MHz in the kernel boot
message?
13. I installed kernel x.y.z and package foo doesn't work anymore,
what should I do?
14. People talk about user space vs. kernel space. What's the
advantage of each?
15. What are threads?
16. Can I use threads with GNU/Linux?
17. You mean threads are implemented in user space? Why not in
kernel space? Wouldn't that be more efficient?
18. Can GNU/Linux machines be clustered?
19. How well does Linux scale for SMP?
20. Can I lock a process/thread to a CPU?
21. How efficient are threads under Linux?
22. How does the Linux networking/TCP stack work?
23. Can we put the networking/TCP stack into user-space?

Section 8 - Compiler/binutils questions
Kernel compilation problems.

1. I downloaded the newest kernel and it doesn't even compile!
What's wrong?
2. What are the recommended compiler/binutils for building kernels?
3. Why the recommended compiler? I like xyz-compiler better.
4. Can I compile the kernel with gcc 2.8.x, egcs, (add your xyz
compiler here)? What about optimizations? How do I get to use -O99,
etc.?
5. I compiled the kernel with xyz compiler and get the following
warnings/errors/strange behavior, should I post a bug report to the
list? Should I post a patch?
6. Why does my kernel compilation stops at random locations with:
"Internal compiler error: program cc1 caught fatal signal 11."?
7. What compiler flags should I use to compile modules?
8. Why do I get unresolved symbols like foo__ver_foo in modules?
9. Why do I get unresolved symbols with __bad_ in the name?

Section 9 - Feature specific questions
Miscellaneous kernel features questions.

1. GNU/Linux Y2K compliance?
2. What is the maximum file size supported under ext2fs? 2 GB?
3. GGI/KGI or the Graphics Interface in Kernel Space debate?
4. How do I get more than 16 SCSI disks?
5. What's devfs and why is it a Good Idea (tm)?
6. Linux memory management? Zone allocation?
7. How many open files can I have?
8. When will the Linux accept(2) bug be fixed?
9. What about STREAMS? I noticed Caldera has a STREAMS package,
when will that go in the kernel source proper?
10. I need encryption and steganography. Why isn't it in the kernel?
11. How about an undelete facility in the kernel?
12. How about tmpfs for Linux?
13. What is the maximum file size/filesystem size?
14. Linux uses lots of swap while I still have stuff in cache. Isn't
this wrong?
15. Why don't we add resource forks/streams to Linux filesystems
like NT has?
16. Why don't we internationalise kernel messages?

Section 10- "What's changed between kernels 2.0.x and 2.2.x" questions

1. Size (source and executable)?
2. Can I use a 2.2.x kernel with a distribution based on a 2.0.x
kernel?
3. New filesystems supported?
4. Performance?
5. New drivers not available under 2.0.x?
6. What are those __initxxx macros?
7. I have seen many posts on a "Memory Rusting Effect". Under what
circumstances/why does it occur?
8. Why does ifconfig show incorrect statistics with 2.2.x kernels?
9. My pseudo-tty devices don't work any more. What happened?
10. Can I use Unix 98 ptys?
11. Capabilities?
12. Kernel API changes

Section 11- Primer documents
Please, if you wish to contribute a Q/A in this section, provide a
very short answer defining the topic and then a URL to a longer text/
Web page. Like that we can have various URL's for a single Q, each
with a different point of view. Another advantage of this approach is
that each contributor has to sit down and write a coherent HTML page
or text file. Having to structure a written answer gives ample time to
think about the issues and the topic as a whole. It also allows
frequent independent revisions, which would be impossible on the FAQ
itself.

Note that writing the longer text/Web page on some relevant Linux
kernel topic and providing a Q/A in this section confers you instant
Guru status. Some people would *kill* for this. Now go and write your
stuff. ;)

1. What's a primer document and why should I read it first?
2. How about having I/O completion ports?
3. What is the VFS and how does it work?
4. What's the Linux kernel's notion of time?
5. Is there any magic in /proc/scsi that I can use to rescan the
SCSI bus?

Section 12- Kernel Programming Questions
Answers to common questions about kernel programming details. See also
Tigran Aivazian's page on kernel programming.

1. When is cli() needed?
2. Why do I see sometimes a cli()-sti() pair, and sometimes a
save_flags-cli()-restore_flags sequence?
3. Can I call printk() when interrupts are disabled?
4. What is the exact purpose of start_bh_atomic() and end_bh_atomic
()?
5. Is it safe to grab the global kernel lock multiple times?
6. When do I need to initialise variables?

Section 13- Mysterious kernel messages
We sometimes get these messages in our system logs and wonder what
they mean...

1. What exactly does a "Socket destroy delayed" mean?
2. What do I do about "inconsistent MTRRs"?
3. Why does my kernel report lots of "DriveStatusError BadCRC"
messages?
4. Why does my kernel report lots of "APIC error" messages?

Section 14- Odd kernel behaviour
The kernel behaves in ways that seem odd...

1. Why is kapmd using so much CPU time?
2. Why does the 2.4 kernel report Connection refused when
connecting to sites which work fine with earlier kernels?
3. Why does the kernel now report zero shared memory?
4. Why does lsmod report a use count of -1 for some modules? Is
this a bug?
5. Why doesn't the kernel see all of my RAM?
6. I've mounted a filesystem in two different places and it worked.
Why?

Section 15- Programming Religion
Responses to suggestions about programming techniques and languages.

1. Why is the Linux kernel written in C/assembly?
2. Why don't we rewrite it all in assembly language for processor
Mega666?
3. Why don't we rewrite the Linux kernel in C++?
4. Why is the Linux kernel monolithic? Why don't we rewrite it as a
microkernel?
5. Why don't we replace all the goto's with C exceptions?
6. Why are the kernel developers so dismissive of new techniques?

Section 16- User-space Programming Questions
Answers to common questions about user-space programming details, as
it relates to the kernel/user-space interface (i.e. system calls).
This does not cover questions on the C library nor any other library,
as those questions are not related to the kernel.

1. Why does setsockopt() double SO_RCVBUF?

Answers
Section 1 - General questions

1. Why do you use "GNU/Linux" sometimes and just "Linux" in other
parts of the FAQ?
* (ADB) In this FAQ, we have tried to use the word "Linux"
or the expression "Linux kernel" to designate the kernel, and GNU/
Linux to designate the entire body of GNU/GPL'ed OS software, as found
in the various distributions. We prefer to call a cat, a cat, and a
GNU, a GNU. ;-)
The purpose of the FAQ is to provide information on the
Linux kernel and avoid debates on e.g. semantics issues. Further
discussion of the relationship between GNU software and Linux can be
found at http://www.gnu.org/gnu/linux-and-gnu.html.
BTW, it seems many people forget that the linux kernel
mailing list is a forum for discussion of kernel-related matters, not
GNU/Linux in general; please do not bring up this subject on the list.
2. What is an experimental kernel version?
* (ADB) Linux kernel versions are divided in two series:
experimental (odd series e.g. 1.3.xx or 2.1.x) and production (even
series e.g. 1.2.xx, 2.0.xx, 2.2.x, 2.4.x and so on). The experimental
series are fast moving versions which are used to test new features,
algorithms, device drivers, etc. By their own nature the experimental
kernels may behave in unpredictable ways, so one may experience data
losses, random machine lockups, etc.
3. What is a production kernel?
* (ADB) Production or stable kernels have a well defined
feature set, a low number of known bugs, and tried and proven drivers.
They are released less frequently than the experimental kernels, but
even so some "vintages" are considered better than others. GNU/Linux
distributions are usually based on chosen stable kernel versions, not
necessarily the latest production version.
4. What is a feature freeze?
* (ADB) A feature freeze is when Linus announces on the
linux-kernel list that he will not consider any more features until
the release of a new stable kernel version. Usually the net effect of
such an announcement is that on the following days people on the list
propose a flurry of new features before Linus really enforces the
feature freeze. ;-)
5. What is a code freeze?
* (ADB) A code freeze is more restrictive than a feature
freeze; it means only severe bug fixes are accepted. This is a short
phase that usually precedes the creation of a new stable kernel tree.
6. What is a f.g.hhprei kernel?
* (ADB) These are intermediate pre-release versions of
version f.g.hh. Note that usually i < 5, but e.g. 2.0.34prei was
available with i = 1 to 16. Sometimes "pre" is replaced by the
initials of the developer putting together the kernel revision, e.g.
2.1.105ac4 means the 4th intermediate release of kernel version
2.1.105 by Alan Cox.
7. Where do I get the latest kernel source?
* (ADB) The primary site for the Linux kernel (experimental
and production) sources is hosted by Transmeta (the company Linus
Torvalds used to work for) on a dedicated Web server at http://www.kernel.org/.
This site is mirrored across the world, and has pointers to mirrors
for each country. You can go directly to a mirror for your country by
going to http://www.CODE.kernel.org/ where "CODE" is the appropriate
country code. For example, "au" is the country code for Australia, so
the principle mirror site for Australia is http://www.au.kernel.org/
* (REG) You may also access tarballs and patches directly
via ftp from ftp://ftp.CODE.kernel.org/pub/linux/kernel/ which is
where Linus distributes his kernels from. Other notable kernel hackers
have directories under the people directory, which is where they keep
their kernel patches. The testing directory is where Linus puts pre-
release patches. The pre-release patches are mainly intended for other
developers, so they can stay in sync with changes in Linus' source
tree. These are often highly experimental and may crash or cause
filesystem corruption. Use at your own risk.

Note that Linus and Marcelo are using GIT to manage their
kernel source trees, and it is more convenient for them to make
snapshots of their latest trees available via GIT, rather than make
patches. If you want access to these snapshots (which are merely a
work in progress, and may be buggy), there are several access methods
available:

CVS: :pserver:anon...@cvs.kernel.org:/home/cvs/linux-2.
[45]
Subversion: svn://svn.kernel.org/linux-2.[46]/trunk

* (JBG) Linux is no longer maintained with the BitKeeper
source code management system, but with GIT, a tool Linus wrote after
BitKeeper was no longer available to all developers. You can browse
Linus's latest kernel source as well as all other people's projects
hosted on kernel.org. There's also a nice Overview of GIT and some
helper tools as well as a complete Tutorial to get you into using GIT.
8. Where do I get extra kernel patches?
* (REG) There are many places which provide various extra
patches to the kernel for new features. One fairly good archive is
available at: http://www.linuxhq.com/.
9. What is a patch?
* (RRR) A patch file (as it refers to the Linux kernel) is
an ASCII text file that contains the differences between the original
code and the new code, plus some additional information such as
filenames and line numbers. The patch program (man patch) can then
apply the patch to an existing kernel source tree.
10. How do I make a patch suitable for the linux kernel list?
* (REG) Here are some basic guidelines for posting patches.
For information on how to generate patches, see the entry by RRR
below.
o Ensure the patch does not have trailing control-M
characters on each line. A number of broken tools used to encode
patches add control-M for "DOS compatibility". This breaks many
versions of patch, so be sure to configure your tools properly, or use
unbroken tools, otherwise your patch will be silently deleted.
o Include the patch inline in your email, in plain
text. Do not post it as a base64 MIME attachment. Many people will not
be able to read your patch, and thus your patch will be deleted
without comment.
o This FAQ previously advised posting a URL to a patch
if the patch is large. This is no longer recommended. The preferred
way to submit a large patch is to break it up into logical chunks,
with a descriptive comment for each, and post each piece with a
subject line like
"[PATCH] cleanup of foo driver [1/5]".
Do not start a new thread for each chunk - rather,
post each chunk as a followup to the previous chunk. You may want to
begin with an explanatory post, and label it something like
"[PATCH] cleanup of foo driver [0/5]".
See Documentation/SubmittingPatches for more
information.
o If you want Linus or one of the primary maintainers
(i.e. Marcelo, David) to apply your patch, you must Cc: them
explicitly, otherwise your patch will be ignored.
o When sending patches to Linus or one of the primary
maintainers, you must include the patch inline, in plain text, no
matter how large the patch.
o If you want to send a patch to the list for comment,
and also send it to Linus/primary maintainer for inclusion, and the
patch is large, you may wonder how to reconcile the conflicting
requirements. The solution is obvious: post the URL to the mailing
list, wait for comments, and later send the patch, inline, to Linus/
primary maintainer. Yes, this is more work for you. No, we don't care.
o If you have a mailer that eats whitespace or causes
similar corruption, then FIX YOUR MAILER, don't expect to be able to
take the easy solution and MIME encode your patch.
Finally, I've seen one person question the veracity of
these guidelines, stating that the rules are rather more relaxed, and
this FAQ is being over zealous. Fortunately, the King Penguin himself
responded to this, so I include his words on this, so that there can
be no doubt:

If I get a patch in an attachment (other than a "Text/
PLAIN" type
attachment with no mangling and that pretty much all mail
readers and
all tools will see as a normal body), I simply WILL NOT
apply it unless
I have strong reason to. I usually wont even bother
looking at it,
unless I expected something special from the sender.

Really. Don't send patches as attachments.

Linus

* A caveat applies for people using a Mozilla Mail client.
Andrew Morton noted that Mozilla mangles spaces in column zero when
patches are included in the message body. Fortunately, Mozilla Mail
sends patch attachments as type text/plain or text/x-patch (depending
on the presence of a file extension), so it's safe to send patches as
attachments instead.
* (RRR) To make a patch you use the diff program (read the
info file for diff). The easiest way to do this is to set up two
source trees under /usr/src, set a symlink "/usr/src/linux" to point
to the modified tree, and diff one tree against the other. The file /
usr/src/Documentation/CodingStyle has more specific information, read
it. Things to remember:
o Always specify unified (-u) diff format.
o Avoid making formatting changes to the source that
make the diff needlessly larger. Watch out for editors that convert
tabs to spaces or vice versa.
o Unless you have specific reasons, diff against the
latest official source tree. Otherwise, your patch is likely to be
ignored. Either way, specify in your post against what you've diff'ed.
o Make sure your diff includes only the intended
changes in your patch, not every other patch you have made to your
source tree. Usually patches are limited to a few files, or
directories. It is best to only diff the relevant files i.e. if I only
made changes to the file driver_xyz.c under drivers/net, then I would
use the following commands (assuming you have the original source tree
named "linux-2.1.105", and the modified tree pointed at by the symlink
"linux"):

cd /usr/src
diff -u linux-2.1.105/drivers/net/
driver_xyz.c \
linux/drivers/net/driver_xyz.c >
my_patch

o The following two should go without saying: the
arguments to diff are first source (the original, unmodified file(s)),
and then destination (your modified version of the file(s)), otherwise
you get a reversed patch (and lots of people wondering what you're
smoking). Also, make sure your patch applies and compiles cleanly.
o Of course you need to set up two identical source
directories to be able to diff the tree later. A nice trick --
requiring a little bit of consideration, though -- is to create the
modified source tree from hard links to the original source tree:

tar xzvf linux-2.1.anything.tar.gz
mv linux linux-2.1.anything.orig
cp -al linux-2.1.anything.orig linux-2.1.anything

This will hardlink every source file from the
original tree to a new location; it is very fast, since it does not
need to create some 80+ megabytes of files. You can now apply patches
to the linux-2.1.anything source tree, since patch does not change the
original files but move them to filename.orig, so the contents of the
hard-linked file will not be changed.

Assuming that your editor does the same thing, too
(moving original files to backup files before writing out changed
ones) you can also freely edit within the hardlinked tree. If your
editor does not handle files this way, you need to make a copy of each
file before editing it, like this:

cp driver_xyz.c temporary; mv temporary driver_xyz.c

You can use file permissions to remind you to do
this. Just remove write permissions from all the files in the
directory you are working in:

chmod -w *.c

The changed tree can be diffed at high speed, since
most files don't just have indentical contents, they are identical
files in both trees. Naturally removing that tree is quite fast, too.
Thanks to Janos Farkas <che...@shadow.banki.hu> for this trick.
o Finally, review the patch file (the format is not
that complicated) before posting, and include all relevant information
as to the nature of the patch. In particular, specify: why is this
patch needed/useful, and what exactly does it fix/improve.
11. How do I apply a patch?
* (TAC) (From /usr/src/linux/README) You can upgrade between
releases by patching. Patches are distributed in the traditional gzip
and the new bzip2 format. To install by patching, get all the newer
patch files, enter the top-level directory of the unpacked kernel
source tree and execute:
gzip -cd patchXX.gz | patch -p1 or:
bzip2 -dc patchXX.bz2 | patch -p1
(repeat xx for all versions bigger than the version of
your current source tree, in order) and you should be ok. You may want
to remove the backup files (xxx~ or xxx.orig), and make sure that
there are no failed patches (xxx# or xxx.rej). If there are, either
you or me has made a mistake.
Alternatively, the script patch-kernel can be used to
automate this process. It determines the current kernel version and
applies any patches found. Use it thus:
scripts/patch-kernel .
The first argument in the command is the location of the
kernel source. Patches are applied from the current directory, but an
alternative directory can be specified as the second argument.
* (RRR) To apply kernel patches please take a look at the
kernel README file (/usr/src/linux/README) under "Installing the
kernel". There is also a good explanation on the Linux HQ Project
site.
12. What's vger?
* (REG) "vger" is the name of the machine which hosts the
LKML server. This server also hosts a number of other linux-related
mailing lists. More information about the server is available at
http://vger.kernel.org/
13. What is a CVS tree? Where can I find more information about CVS?
* (REG) "CVS" is short for Concurrent Versions System, a
Source Code Management system. Check out the CVS Bubbles page.
14. Is there a CVS tutorial somewhere?
* (ADB) Here is a CVS tutorial which you can find online:
o An interactive CVS tutorial.
Getting a general idea of how CVS works takes about 15
minutes (highly recommended). Note that there are various graphical
front ends to CVS, so you don't have to learn the usual assortment of
cryptic commands.
15. How do I get my patch into the kernel?
* (RRR) Depending on your patch there are several ways to
get it into the kernel. The first thing is to determine under which
maintainer does your code fall into (look in the MAINTAINERS file). If
your patch is only a small bugfix and you're sure that it is
'obviously correct', then by all means send it to the appropriate
maintainer and post it to the list. If there is urgency to the bugfix
(i.e. a major security hole) you can also send it to Linus directly,
but remember he's likely to ignore random patches unless they are
"obviously correct" to him, have the maintainer's approval, or have
been well tested and meet the first condition. In case you're
wondering what constitutes well tested, here's another important bit:
one purpose of the list is to get patches peer-reviewed and well-
tested. Now, if your patch is relatively big, i.e. a rewrite of a
large code section or a new device driver, then to conserve bandwidth
and disk-space just post an announcement to the list with a link to
the patch. Lastly, if you're not too sure about your patch yet, want
some feedback from the maintainer, or wish to avoid open-season
flaming on work-in-progress, then use private email.
* (REG) If there is no specific maintainer for the part of
the kernel you want to patch, then you have three main options:
o send it to linux-...@vger.kernel.org and hope
someone picks it up and feeds it to Linus, or maybe Linus himself will
pick it up (don't count on it)
o send it to linux-kernel and Cc: Linus Torvalds
<torv...@osdl.org> and hope Linus will apply it. Note that Linus
operates like a black box. Do not expect a response from him. You will
need to check patches he releases to see if he applied your patch. If
he doesn't apply your patch, you will need to resend it (often many
times). If after weeks or months and many patch releases he still
hasn't applied it, maybe you should give up. He probably doesn't like
it
o send it to linux-kernel and Cc: Alan Cox
<al...@redhat.com>. Alan is better at responding to email, and will
queue your patch and resend it to Linus periodically, so you can
forget about it. He also serves as a good taste tester. If Alan
accepts your patch, it's more likely that Linus will too. If he
doesn't like your patch, you will probably get an email saying so.
Expect it to be terse.
16. Why does the kernel tarball contain a directory called linux/
instead of linux-x.y.z/ ?
* (DW) Because that's the way Linus wants it. It makes
applying many consecutive patches simpler, because the directory
doesn't need to be renamed each time, and it also makes life easier
for Linus.
17. What's the difference between the official kernels and Alan
Cox's -ac series of patches?
* (REG, contributed by Erik Mouw) Alan's kernel can be seen
as a test bed for Linus' kernels. While Linus is very conservative and
only applies obvious and well tested patches to the 2.4 kernel, Alan
maintains a set of kernel patches that contains new concepts, more and/
or newer drivers, and more intrusive patches. If the patches prove
themselves stable, Alan submits them to Linus to include them into the
official kernel.
18. What does it mean for a module to be tainted?
* (REG, contributed by John Levon) Some vendors distribute
binary modules (i.e. modules without available source code under a
free software license). As the source is not freely available, any
bugs uncovered whilst such modules are loaded cannot be investigated
by the kernel hackers. All problems discovered whilst such a module is
loaded must be reported to the vendor of that module, not the Linux
kernel hackers and the linux-kernel mailing list. The tainting scheme
is used to identify bug reports from kernels with binary modules
loaded: such kernels are marked as "tainted" by means of the
MODULE_LICENSE tag. If a module is loaded that does not specify an
approved license, the kernel is marked as tainted. The canonical list
of approved license strings is in linux/include/linux/module.h.
"oops" reports marked as tainted are of no use to the
kernel developers and will be ignored. A warning is output when such a
module is loaded. Note that you may come across module source that is
under a compatible license, but does not have a suitable
MODULE_LICENSE tag. If you see a warning from modprobe or insmod for a
module under a compatible license, please report this bug to the
maintainers of the module, so that they can add the necessary tag.
* (KO) If a symbol has been exported with EXPORT_SYMBOL_GPL
then it appears as unresolved for modules that do not have a GPL
compatible MODULE_LICENSE string, and prints a warning. A module can
also taint the kernel if you do a forced load. This bypasses the
kernel/module verification checks and the result is undefined, when it
breaks you get to keep the pieces.
* (KO) According to Alan Cox, a license of "BSD without
advertisement clause" is not a suitable free software license. This
license type allows binary only modules without source code. Any
modules in the kernel tarball with this license should really be "Dual
BSD/GPL".
19. What is this about GPLONLY symbols?
* (REG) By default, symbols are exported using
EXPORT_SYMBOL, so they can be used by loadable modules. During the 2.4
series, a new export directive EXPORT_SYMBOL_GPL was added. This is
almost the same thing, except that the symbol can only be accessed by
modules which have a GPL compatible licence (note that this includes
dual-licenced BSD/GPL code). This new directive was added for these
reasons:
o To clarify the ambiguous legal ground on which non-
GPL (particularly proprietary) modules lie. A strict reading of the
GPL prohibits loading proprietary modules into the kernel. While Linus
has consistently stated that proprietary modules are allowed (i.e. he
has granted an explicit exemption), it is not clear that he is able to
speak for all developers who have contributed to the Linux kernel.
While many think Linus' edict means that all contributed code falls
under this exemption granted by Linus, not everyone agrees that this
is a legally sound argument. The new EXPORT_SYMBOL_GPL directive makes
the licence conditions explicit, and thus removes the legal ambiguity.
o To allow choice for developers who wish, for their
own reasons, to contribute code which cannot be used by proprietary
modules. Just as a developer has the right to distribute code under a
proprietary licence, so too may a developer distribute code under an
anti-proprietary licence (i.e. strict GPL).
Note that Linus has stated that existing symbols will not
be switched to GPL-only. Developers of proprietary modules for Linux
need not fear. Furthermore, it is quite unlikely that Linus will look
favourably upon the introduction of new core driver APIs which are
restricted to GPL-only modules. This would not be in the best
interests of Linux. Linus has forwarded me a message he sent to
someone else to clarify his views. Note that since that time, several
developers have eroded the number of non-GPL only symbols by writing
new (usually better) infrastructure and interfaces and deprecating the
older interfaces. The newer interfaces are often tagged as GPL-only.
In addition, there are some "kernel janitors" who aggressively submit
patches to remove all symbols (whether GPL-only or not) which are not
used by code shipped with the kernel source tree.
20. Do I have to use GIT to send patches?
* (REG) Absolutely not. Some kernel developers, including
Linus and Marcelo, have chosen to use GIT to manage their kernel
source trees, but this does not mean you need to use GIT yourself to
maintain your trees or submit patches. Many notable kernel developers
continue to maintain their source trees using other tools and
techniques, and continue to send conventional patches.
21. Who maintains the kernel?
* (REG) Originally, Linus Torvalds maintained the kernel. As
the kernel has matured, he has delegated maintenance for older stable
versions to others, while he continues development of the latest
"bleeding edge" release. As of 27-MAY-2002, the following kernel
versions are maintained by these people:
o 2.0 David Weinehall <t...@acc.umu.se>
o 2.2 Alan Cox <al...@lxorguk.ukuu.org.uk>
o 2.4 Marcelo Tosatti <mtos...@redhat.com>
o 2.6 Linus Torvalds <torv...@osdl.org>
22. The kernel doesn't compile cleanly. What shall I do?
* (REG) First make sure you have the latest version of that
kernel series. Perhaps a pre-patch already has a fix. If not, search
the list archives for a fix. Don't contribute to noise on the list by
asking a question that may already have been answered.
If the problem has not yet been fixed, try digging into
the code yourself and post a fix to the mailing list. You'll be
famous! Beware that making broken code compile just for the sake of a
clean 'make bzImage modules' doesn't count as a fix, and your fix will
be discarded, ignored or flamed.

Section 2 - Driver specific questions

1. Driver such and such is broken!
* (RRR) Try to be more specific. Please, provide information
on your particular setup (see Qs How do I make a bug report?) Also see
the Q: "kernel x.y.z broken!" below.
* (ADB) That's the worst possible way to start a thread.
Please try to reach the author of the driver first and report the
"broken" driver to him. Constructive criticism is welcome, usually.
2. Here is a new driver for hardware XYZ.
* (REW) Good work! Please try to find a few people that also
have the XYZ hardware and have them test it on their configuration
(e.g. by posting a message on a newsgroup). No it won't go in the
standard kernel before some people have tested it.
Testing will take a while. In the mean time, kernel
development will continue, and you will have to rewrite your patch for
the most recent version before Linus might consider it.
As a whole new driver is most likely more than a few pages
long, we'd prefer it if you would put the actual driver up for ftp
instead of posting it to the list. Post the URL and the description
that tells us what your driver does for which hardware.
3. Is there support for my card TW-345 model C in kernel version
f.g.hh?
* (REW) First check if your card is detected at boot time.
It usually is. Second see if you might need to configure something
like modules.conf for your card. Third see if there is a file with the
card name in the kernel sources. (e.g. you have a Buslogic card, and
there is a buslogic.c file in the kernel sources, you're in luck.).
Next, grep for the manufacturer name through ALL the kernel sources.
And try the model number of your card. Also try to find the largest
chip on your card and grep for the chip number on that thing. Realize
that 53C80 chips might be named 5380 in the kernel. Other chips don't
have their middle name removed.
Nothing yet? Now check DejaNews, using the same arguments
you used to grep the kernel source. There are 99.99% chances that
somebody has exactly the same card TW-345 model C.
Ok. That's what you can do without bothering anyone. If
all this doesn't lead somewhere, you should really ask this question
on a newsgroup like comp.os.linux.hardware.
4. Who maintains driver such and such?
* (RRR) Have a look at the /usr/src/linux/MAINTAINERS file,
this is the most authoritative source. Also check the source code for
the driver itself; in both cases, check the latest version of the
kernel that you have available. Some drivers have specific Web pages
and sometimes even a dedicated mailing list. Check those first. If you
cannot contact the maintainer then as a last resort post a short
message to the list. In any case, keep in mind that maintainers are
usually very busy people and most of them work on Linux for free and
in their spare time, so don't expect an immediate response. Some
maintainers get just too many mails in too small periods of time to be
able to answer them all, so please be kind to them.
5. I want to write a driver for card TW-345 model C, how do I get
started?
* (REW) Good initiative! First a piece of advise: are you up
to this? Ten times as many projects like this get started as get
finished. Also, make sure that you're not doing double work. Make sure
that such a driver is not already available: read Q/A 2.3 above...
First prepare yourself. Get the docs, read them (OK,
you're allowed to start skipping stuff if you've gotten to the part
"detailed register descriptions"). Next, get the Linux kernel source,
find a driver that drives similar hardware to the one you're going to
work on, and read THAT. (I usually use the smallest one I can find: wc
-l *.c | sort -n | head -4).
Ok. You've thought about it. Now the question is, do you
have technical documentation for your card? You can reverse engineer
the driver for MS operating systems, but having the documentation is
MUCH easier.
In the dark old ages (70s to middle of the 80s), you got a
complete technical description with every card you could get. This is
no longer the case. Anyway, contact your vendor and politely ask them
for the "device driver kit" or the "technical manual" for the card.
Try the head office and your local office at the same
time. Local offices occasionally have bad photo copies that they give
out before you get an official rejection from the head office. In that
case whom you got the documentation from becomes confidential
information. Don't put the guy's name in the source.
If you can't get the technical documentation, consider
giving up and investing in a competitors product (and tell the
manufacturer about this). Not given up yet? Ok. Next step is to find
out what the DOS driver does. Try to get the card to work while you
run it in a microsoft emulator (dosemu or WINE). This will allow you
to program these tools to log the I/O accesses of the driver. This
will give you a large list of I/O accesses that the driver did. If
you're good, you might be able to see patterns, and deduce how the
driver works. From there you might be able to write a working driver.
Good luck! You'll need it.
6. I want to get the docs, but they want me to sign an NDA (Non-
Disclosure Agreement).
* (REW) Some people find this a tremendous problem. Some
companies just want to know who has the docs to their hardware, and
don't mind if you write a GPL-ed driver. In that case, there is really
no problem: just tell them what you intend to do and ask them to
acknowledge in writing that they've understood what you're saying. In
that case, you can get your driver into the standard kernel, but you
cannot send out the docs to anybody who wants to work on the driver.
They will have to rely on the comments in the source.
Other companies (just like Netscape) themselves signed
NDAs that forbids them to disclose information to you.
Some really think that they have trade secrets in the
interface towards the software, and intend to keep them secret. Those
won't allow you to write a driver and then put the source on the net.
Be careful with these.
* (ADB) The first and only NDA I ever received instantly
found its way to the wastebasket. I would advise anybody who gets an
NDA to refuse to sign it, if it refers to anything that may/will be
put under GNU/GPL. Of course, for contract work this doesn't apply.
7. I want/need/must have a driver for card TW-345 model C! Won't
anybody write one for me?
* (REW) Some Linux developers will settle for a beer, and
develop the driver for you. Others want a "free sample" of the
hardware and will then go ahead and write the driver.
If you need more than a few of the cards or you
manufacture the cards yourself, you can consider paying one of the
commercial Linux device driver companies to get a commercially backed,
officially maintained device driver.
8. What's this major/minor device number thing?
* (REG) Device numbers are the traditional Unix way to
provide a mapping between the filesystem and device drivers. A device
number is a combination of a major number and a minor number.
Currently Linux has 8 bit majors and minors. When you open a device
file (character or block device) the kernel takes the major number
from the inode and indexes into a table of driver structure pointers.
The specific driver structure is then used to call the driver open()
method, which in turn may interpret the minor number. There are two
tables: one for character devices and one for block devices, each are
256 entries maximum. Obviously, there must be agreement between device
numbers used in a driver and files in /dev. The kernel source has the
file Documentation/devices.tex which lists all the official major and
minor numbers. H. Peter Anvin (HPA) maintains this list. If you write
a new driver (for public consumption), you will need to get a major
number allocated by HPA. See the Q/A on devfs for an improved (IMHO)
mechanism for handling device drivers.
9. Why aren't WinModems supported?
* (REG, quoting Edward S. Marshall) The problem is the lack
of specifications for this hardware. Most companies producing so-
called "WinModems" refuse to provide specifications which would allow
non-Microsoft operating systems to use them.
The basic issue is that they don't work like a traditional
modem; they don't have a DSP, instead making the CPU do all the work.
Hence, you can't talk to them like a traditional modem, and you need
to run the modem driver as a realtime task, or you'll have serious
data loss issues under any kind of load. They're simply a poor design.
* (REG) Note that some people have been putting effort into
reverse engineering some WinModems, so you may be lucky and find that
yours is now supported. If not, it's time to get a refund and buy a
real modem.
Note that modems have to be approved by the appropriate
statutory or regulatory body for standards compliance (to make sure
they don't send crap down the line and blow up the exchange). With
WinModems, the driver software needs to be certified as well as the
hardware. It's harder to get approval for Open Source drivers, since
it usually costs money to obtain approval. Also, in theory, it's
easier to modify an Open Source driver, so it would no longer be
compliant. In reality, 99.999% of users don't even know there is
source code for the driver, so "Standards Compliance" may well be a
smoke-screen for manfacturers who don't want to bother with non-WinTel
systems. If certification was the only problem, manufacturers could
release binary-only drivers.
* (DW)The good news is that a certain amount of WinModem
hardware is now supported. The bad news is that that is just the tip
of the iceberg. Although the WinModems can now be used, they have
functionality similar to that of a sound card - all the modulation and
demodulation has to be performed by the host CPU. Work is progressing
on this front too - see http://www.linmodems.org/ for more up-to-date
information.
10. Modern CPUs are very fast, so why can't I write a user mode
interrupt handler?
* (REG, quoting Pete Zaitcev) This is not a question of
having enough CPU cycles to waste them on mode switches. Rather, the
current Linux architecture does not allow it. User processes run with
interrupts enabled. Thus, any interrupt handler must deactivate the
particular interrupt source before a process is scheduled to run, or
an interrupt storm results. The deactivation is done in a device
specific manner, so at least a small device driver must be present in
kernel mode.
11. Do I need to test my driver against all distributions?
* (REG, MEA) There are minor detail changes in between each
kernel version (even in stable series), and depending on what
configuration options are used (basically SMP or not), certain things
like spinlocks may or may not reserve space in structures, and may or
may not need to be called (are even optimized away in non-SMP
systems), meaning that a binary driver compiled for SMP might not work
with a non-SMP kernel. And vice versa.
Also different vendors tend to inject different things
into their kernel patch-sets, which again may subtly change data
layouts, etc. In stable kernel series great pains are suffered at
maintenance so that data layouts of in-kernel APIs (and API calls
themselves) are not changed. Nevertheless something may change making
binary drivers to fail in mysterious ways.
Subtle memory changes may appear with i386-PAE mode (large
memory machines which can't map all of RAM into the kernel at the same
time).
Because of these differences, a driver compiled for one
version of the kernel, or one vendor's kernel, is not likely to work
with another kernel. Thus, if you are distributing a binary-only
driver, you will have a significant support load compiling drivers for
different kernels. If you are distributing a driver in source form,
then, provided the driver is well-written (i.e. does not make
assumptions about byte ordering or word sizes and uses standard kernel
interfaces), the driver should be portable across kernel versions and
architecture types. It will of course have to be compiled by end-users
for their particular kernel. Distribution maintainers are likely to
provide pre-compiled drivers, thus most end-users won't need to
compile the driver themselves.

Section 3 - Mailing list questions
The linux-kernel mailing list is for discussion of the development of
the Linux kernel itself. Questions about administration of a Linux
based system, programming on a Linux system or questions about a Linux
distribution are not appropriate.

"Test" messages are very, very inappropriate on the lkml or any other
list, for that matter. If you want to know whether the subscribe
succeeded, wait for a couple of hours after you get a reply from the
mailing list software saying it did. You'll undoubtedly get a number
of list messages. If you want to know whether you can post, you must
have something important to say, right? After you have read the
following paragraphs, compose a real letter, not a test message, in an
editor, saving the body of the letter in the off chance your post
doesn't succeed. Then post your letter to lkml. Please remember that
there are quite a number of subscribers, and it will take a while for
your letter to be reflected back to you. An hour is not too long to
wait.

(REG) The essential point to remember when posting to the linux-kernel
mailing list is that there are a lot of very busy people reading the
list. No matter how important you think you are, it is most likely
that there are many people on the list who are more important than
you. "Important" is not measured by the amount of money you have, how
much your question is worth to your company or how desperate you are
for an answer, rather, it is measured by how much you contribute to
the linux kernel.

With that in mind, you should make sure that you are not wasting the
time of other people on the list. Write for maximum efficiency of
reading. It doesn't matter if it takes twice as long for you to
compose a more readable message, if it halves the time a hundred key
kernel developers spend trying to decode your message. Ignoring good
taste and consideration is most likely to result in you being ignored.

1. How do I subscribe to the linux-kernel mailing list?
* (ADB) Think again before you subscribe. Do you really want
to get that much traffic in your mailbox? Are you so concerned about
Linux kernel development that you will patch your kernel once a week,
suffer through the oopses, bugs and the resulting time and energy
losses? Are you ready to join the Order of the Great Penguin, and be
called a "Linux geek" for the rest of your life? Maybe you're better
off reading the "Kernel coverage at LWN.net" summary at http://lwn.net/Kernel/.
OK, if you still want to read linux-kernel in its full
glory, send the line "subscribe linux-kernel your_email@your_ISP" in
the body of the message to majo...@vger.kernel.org (don't include
the " characters, and of course replace the fake email address with
your true address). You have been warned!
* (MEA) Quite often I see things like what this summary
report tells:

FAILED:
<smtp cedar-republic.com edm...@cedar-republic.com
60000>: ...\
<<- RCPT To:<edm...@cedar-republic.com>
->> 550 <edm...@cedar-republic.com>... we do not
relay

Feeding this address to a page at URL: http://vger.kernel.org/mxverify.html
yields information that ONE of their backup MX servers refuses to send
email thru to them. Thus whenever all other servers fail to be
reachable, that one ruins their email connectivity.

Do make sure YOU don't have this very problem!
See http://vger.kernel.org/majordomo-info.html for
information on Majordomo.
2. How do I unsubscribe from the linux-kernel mailing list?
* (ADB) At the bottom of each and every message sent by the
linux-kernel mailing list server one can read:
-
To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
the body of a message to majo...@vger.kernel.org
See http://vger.kernel.org/majordomo-info.html for
information on Majordomo.
3. Do I have to be subscribed to post to the list?
* (ADB) No, you don't have to be subscribed to the list to
post to it. The address of the list is linux-...@vger.kernel.org.
And you should indicate on your message that you wish to be personally
CC'ed the answers/comments posted to the list in response to your
posting.
* (REG) It is, however, generally considered good netiquette
to be subscribed to a list (or a newsgroup for that matter) and lurk
for a while before posting. That way you can learn what's considered
an appropriate post and what isn't.
Don't treat the list as your personal helpdesk. Remember
that the list is a community.
4. Is there an archive for the list?
* (REG) There are many. Here are some:
o http://www.uwsg.indiana.edu/hypermail/linux/kernel/
has a search by word/subject capability.
o http://marc.theaimsgroup.com/?l=linux-kernel keeps a
collection of Linux related list archives.
o http://lkml.org/ is another archive with latest
kernels, latest messages and hottest messages tables.
o http://groups.google.com/groups?hl=en&q=fa.linux.kernel&meta=
is a Google interface to the fa.linux.kernel newsgroup, which is in
turn fed from the mailing list.
o http://gossamer-threads.com/lists/linux/kernel/ has
an easy interface and an appealing format (click on a thread, shows
all posts in a thread with posts clearly delimited).
5. How can I search the archive for a specific question?
* (ADB) Use simple keywords which refer to the issue that
matters to you. For example, if you are investigating an oops that
happens whenever you plug in a network adapter NIC-007, use "NIC-007"
or "oops NIC-007". As soon as you have found a link to a message that
interests you, try to follow the thread. Remember that you will almost
always get more information by carefully searching the archive than by
posting a question to the list itself.
6. Are there other ways to search the Web for information on a
particular Linux kernel issue?
* (ADB) Sure. Before you check the list archives, you can
search DejaNews and AltaVista (simultaneously, if your browser allows
you to open various windows). You can also follow some links on the
Linux Documentation Project site.
7. How heavy is the traffic on the list?
* List traffic is very heavy; the average number of messages
per day is ~400 [07/2007 - 02/2008]. That's over 12 000 messages a
month!!!
* (ADB) You really don't want to read each and every posting
to the list. If you are concerned with list traffic, I suggest you
temporarily try the digest lists, which will be much easier on your
mailbox (thanks to A. Wik for this suggestion).
* (REG) There is a weekly summary called "Kernel coverage at
LWN.net" at http://lwn.net/Kernel/, which can save you a lot of time.
8. What kind of question can I ask on the list?
* (ADB) The basic rule is to avoid asking questions that
have been asked before, or that are irrelevant to other list users, or
that are off topic. Please use your good sense.
* (REG) Remember that this is a list for the discussion of
kernel development. If you have some ideas or bug reports to
contribute, this is the place. User space issues are not appropriate
for this forum. If you find a bug in the C library or some
application, it doesn't belong on linux-kernel.
9. What posting style should I use for the list?
* (REG, contributed by thun...@xs4all.nl) When following up
a post on the kernel mailing list, please think before you quote.
Since everybody else on the list also got the original post, don't
quote it entirely. Highlight only the points that you really need to
understand your arguments. Make sure the quoted part is recognizable
as such, by ensuring each quoted line starts with a > (or more >>, in
case of multi-level quoting). Don't quote signatures, entire patches,
entire config files or entire posts. Don't quote the standard
signature. The kernel-list is crowded enough already, let's take care!
* (REG) Be aware that your message is far more likely to be
deleted without being read if you have too much quoted material before
your reply.
* (REG) And please reply after the quoted text, not before
it (as per RFC 1855). It's very confusing to see a reply before the
quoted context. And it's embarrassing: it makes you look like a
newbie. Change your mailer if necessary, if the one you have makes it
hard to do reply-after-quoting.
I know some people like to quote the entire message they
are replying to, so they put their reply right at the top so people
won't give up after the first page of quoted material. Don't do it.
It's annoying. Just learn to stop quoting everything. No-one wants to
see it all anyway (list archives allow people to see everything if
they missed it). You're not helping yourself anyway, as you're more
likely to be ignored if you reply-before-quoting.
* (REG) Please don't use tabs or multiple spaces to quote
text. Use the "> " sequence instead. Using whitespace to quote text
makes it difficult to differentiate between what's quoted and the
reply. And don't try to be cute or "different" and use some other
character like "}" or whatever. Again, it's confusing. It wastes
people's time. Write for maximum efficiency of reading.
* (REG) Please try to have halfway reasonable spelling and
grammar. When reading text with really bad spelling or grammar, people
stall while trying to parse your post. Don't think you're being
"artistic" by stripping out all punctuation characters. Linux-kernel
is not an online gallery, it's a communications medium. Write for
maximum efficiency of reading.
* (REG) Please don't have long, inflammatory, controversial
or offensive signatures (see RFC 1855). The rule of thumb is no more
than 4 lines of 80 characters each.
* (PG) Don't attach huge files to your post. One major
culprit is people attaching their kernel .config file to their post.
These can be in excess of 1000 lines, and will grow more as kernel
options are continuously added. If the contents of your .config file
are relevant to your post then attach the output of

grep ^C .config

or

grep "=[y|m]" .config

.
* (MEA) Some structures are forbidden as they appear to be
used way too much in SPAM mail. Specifically, messages with Content-
Type: text/html either as the only (primary) message, or as ANY of
component sub-messages are considered spam, and rejected outright
without any info to the sender.
Also, any message with header matching the regular
expression: X-Mailing-List:.*@vger.kernel.org is considered to be
LOOPING somewhere, and is thus diverted to list-owner.
* (REG) If you are using stuck using Microsoft Outlook or
Outlook Express, which have flawed quoting algorithms, you should
apply one of the following fixes:
o http://home.in.tum.de/~jain/software/oe-quotefix/
for Outlook Express
o http://home.in.tum.de/~jain/software/outlook-quotefix/
for Outlook
These fixes make these mailers more standards compliant.
10. Is the list moderated?
* (ADB) No, the linux-kernel list is not moderated.
11. Can I be ejected from the list?
* (ADB) It is technically possible, but I have never heard
of anybody being ejected from the linux-kernel list.
* (REW) But you will if you post questions or answers that
are asked and answered on this FAQ. ;-)
* (MEA) Oh definitely, all you need to have is
malfunctioning email system which does not accept email to you -- e.g.
check your domain backup MX servers by using the tool at:
http://vger.kernel.org/mxverify.html
It is known that over the years the keepers of vger's
lists have removed some people after getting sufficiently annoyed with
them, but there you really must try to exceed yourself, and will
likely get lots of peer pressure before getting kicked off.
Another way to quickly get yourself removed is to use the
program called "fetchmail" -- which in itself is not all that bad, but
apparently it is far too easy to accidentally re-post email to
addresses which the visible RFC 822 headers contain -- that is, what
the original sender used, like: To: linux-...@vger.kernel.org The
result is duplicate messages on the mailing list. If you let that
happen, you can be quite sure that your subscription will be removed
as soon as possible.
12. Are there any implicit rules on this list that I should be aware
of?
* (ADB) Here are a few implicit rules which you should be
aware of:
o Stick to the subject. This is a Linux kernel list,
mainly for developers.
o Use English only!
o Don't post in HTML format! If you are using IE or
Netscape, please turn off HTML formatting for your posts to the kernel
list.
o If you use that other OS, make sure your mailer
doesn't use Charset="Windows*" as those posts will be blocked.
o If you will be asking a question, before you post to
the list, try to find the answer in the available documentation or in
the list archives. Just remember that 99% of the questions on this
list have already been answered at least once. Usually the first
answer is the most detailed, so the archives contain far better
information than you will get from somebody who has answered the same
question a dozen times or more.
o Be precise, clear and concise, whether asking a
question or making a comment or announcing a bug, posting a patch or
whatever. Post facts, avoid opinions.
o Be nice, there is no need to be rude. Avoid
expressions that may be interpreted as aggressive towards other list
participants, even if the subject being treated is particularly
relevant to you and/or controversial.
o Don't drag on with controversies. Don't try to have
the last word. You will eventually have the last word, but meanwhile
you'll have lost all your sympathy credit.
o A line of code is worth a thousand words. If you
think of a new feature, implement it first, then post to the list for
comments.
o It's very easy to criticize someone else's code, but
when you write something for the first time, it's not that simple. If
you find a bug, a mistake, or something that could be perfected, don't
immediately post a comment such as "This piece of code is crap, how
did it get into the kernel?". Contact the author of the code, explain
the issue, and try to get the point across in a simple, humble way. Do
that a few times and you will get a lot of credit as a good code
debugger. Then when you write a piece of code people will pay
attention to you.
o Don't flame beginners that ask the wrong questions.
This adds noise to the list. Send them a private mail pointing them to
a source of information e.g. this FAQ.
* (MEA) If you post HTML, your email won't make it to the
lists (see section 3.9).
* (RC) Ensure your email doesn't match any of the regular
expressions in vger's Majordomo's taboo list of regular expressions
else it will be silently dropped. This matches seemingly innocuous
words like `Deutschland' as in `Sitecom Deutschland GmbH'.
* (REG) Don't post post any religious or political material,
including in your signature. Doing it in the body of a message will
anger people, as it's always off-topic and is a waste of bandwidth
(remember that even in the 21st century, many people are still being
gouged by the second for bandwidth by their ISP or telco or both).
Including this unwanted material in your signature is less
obnoxious, but is pointless at best (preaching to the converted). Most
people will ignore it, and many will be prone to ignore the content of
your message, recognising you are a wanker. If you want to be taken
seriously, leave the soap-box at home. Limit your posts to technical
issues.
13. How do I post to the list?
* (REG) You send a message to the address linux-
ker...@vger.kernel.org
14. Does the list get spammed?
* (ADB) The linux-kernel list is no longer spammed, you will
rarely if ever find a commercial posting to the list itself. OTOH once
you post to the list, expect to get a few undesirable mails in the
following days. Unfortunately some people watch the list and think
it's a good idea to pick names from it. There are many ways to avoid
spam, check the dedicated anti spam sites on the list. I learned many
things this way.
* (REW) Although the list maintainers do their best to keep
the list spam free, it is not possible to do this 100%. Some of the
good kernel development people cannot keep up with the volume on linux-
kernel. But they do occasionally post. Therefore we need to keep the
submissions open for "everybody". Some of the other important people
have two or three Email addresses. They too need to post from
different addresses. Consequently something that looks like a
submission from a valid return address tends to go on the list. There
is nothing an automated filtering system can do about it.
The end result is about one spam a month. It happens. The
maintainer will get a flood of mail about it and he will block the
domain it came from. Please don't bother the list about it, don't add
noise. Don't post "This guy is a jerk if he spams this list". Don't
post "I traced him, you can mail bomb him at this address". Don't post
"I traced him, bother his postmaster at such and such".
15. I am not getting any mail anymore from the list! Is it down or
what?
* (ADB) Majordomo is an intelligent mail list server. If for
any reason your email address is unavailable, after some retries you
will be automatically unsubscribed.
* (REW) On the other hand, accidents with the mailing list
server have happened. These have wiped out the whole subscription list
once or twice. Just resubscribe. Majordomo will get you a nice note
saying you're still subscribed if suddenly everybody went dumb. Don't
post "Just testing: Is the list working? I didn't get any mail for a
few days now".
* (MEA) You may get unsubscribed because MTAs relaying
traffic to you get bounces for some reason. One thing to verify is
that your email routing data in the DNS is valid, e.g. feed your
address to the input box at: http://vger.kernel.org/mxverify.html
* (MEA) VGER and/or one of its fanout boxes may be in
overload. Usually system keepers notice the situation, and it becomes
fixed within 1-2 days without messages being lost, but we don't track
the entire world. Asking help from postm...@vger.kernel.org could
expedite the issue. Asking help on lists WILL NOT help, doing so just
puts more load on the system!
16. Is there an NNTP gateway somewhere for the mailing list?
* (RRR) Yes there is the newsgroup fa.linux.kernel, but you
can only read the mailing list there, not post directly. Posting to
the list must go by email to linux-...@vger.kernel.org.
Here's the dejanews URL, if your NNTP host don't have the
group http://www.dejanews.com/bg.xp?level=fa.linux.kernel
* (REG) Unfortunately not all news servers have the fa
heirarchy. You can access the fa.linux.kernel by going to
http://groups.google.com/groups?hl=en&q=fa.linux.kernel&meta=
* (REG, contributed by Gunter Ohrner) Yes, GMANE offers a
bidirectional gateway at nntp:gmane.linux.kernel at server
news.gmane.org with additional web access at http://blog.gmane.org/gmane.linux.kernel.
17. I want to post a Great Idea (tm) to the list. What should I do?
* (REG) OK, that's great. Now:
o First make sure that your idea is relevant to kernel
development. Perhaps your idea is better implemented in the C library,
or perhaps in a new library? Before posting to linux-kernel, be sure
it really is a kernel issue.
o OK, so you have this great idea for the kernel. Are
you sure someone hasn't thought of it before? Reading all of this
document is a good starting point. Also search the mailing list
archives to see if that topic has been raised before.
o Now you have verified that you have an idea none has
suggested before. For the best response, code up an implementation/
kernel patch and post that to the kernel list when you announce your
idea. If you provide code, you can be sure someone will try it out and
give you comments. If you don't know anything about kernel hacking,
this is a good time to start learning:-) By the time you've
implemented your idea, you'll be able to call yourself a Linux Guru.
o If you really can't code something up, but still
would like your idea implemented, post a message to the kernel list.
Be as clear and precise as possible, so that people can understand
your idea quickly. If you are lucky, someone who likes your idea may
find the time to implement it. If nobody steps forward to implement
it, you're out of luck: remember, we're all volunteers and we all have
too many things to do as it is.
o If you get a negative response to your idea, don't
get offended, after all, we all have different notions on what is a
Good Idea (tm) and a Bad Idea (tm). If someone is rude to you, please
resist the temptation to carry on a war on the list. Instead, email
them privately saying that you don't like their rudeness. If everybody
is polite, but just strongly disagrees with your idea, be careful not
to push it too hard. If people haven't understood the point you are
making, try explaining it a different way. But if people understand
your idea but maintain it is flawed, it's time to stop pushing it.
Pushing harder will just get you ignored.
o If you're convinced you're right, despite what
everybody else says, stop talking about it and implement it! If you're
right, you'll have the last laugh.
* (ADB) Good code (i.e. documented, elegant, efficient) and
some benchmarking data showing your Great Idea performs well will go a
long way to show you're right.
18. There is a long thread going on about something completely
offtopic, unrelated to the kernel, and even some people who are in the
"Who's who" section of this FAQ are mingling in it. What should I do
to fight this "noise"?
* (REW, ADB) Ignore it.
* (REG) Don't send a response to the kernel list under any
circumstances. If you feel compelled to respond, do so privately
informing the person that the message was offtopic. Or set up a
procmail recipe to drop all messages for that thread: that way you'll
never see the thread again.
19. Can we have the Subject: line modified to help mail filters?
* (REG) The usual proposition is that a string like [LINUX-
KERNEL] is prepended to the subject line.
This question has been raised many times before, and the
answer has always been "no" or "there are better ways to filter
email". The majority of the developers, and all (?) of the list
maintainers take this position. Some of the reasons are:
o It would increase the size of the Subject: line.
This is a problem, as it limits the amount of useful information that
can be seen in the Subject: line, making it harder to scan through a
list of subject lines looking for interesting subjects.
o It can lead to the Subject: line from hell, since
some mailers and users don't behave sanely. Imagine the following:
RE: [LINUX-KERNEL] Re: [LINUX-KERNEL] RE: [LINUX-
KERNEL] Re: [LINUX-KERNEL] Critical security flaw in 2.666.0
That's a lot of characters. The useful information
will very likely be lost due to line truncation.
o It doesn't work for cross-posted messages, as the
subject line for a single email will change depending on which list it
was sent via. Not only can this confuse simple-minded filtering
recipes, it can also break threaded mail readers (people may end up
reading the same message twice).
o Cross-posting will make the Subject: line from hell
problem more frequent. Imagine the following:
RE: [LINUX-KERNEL] Re: [LINUX-SCSI] RE: [LINUX-
KERNEL] Re: [LINUX-SCSI] RE: [LINUX-KERNEL] Re: [LINUX-SCSI] Critical
security flaw in 2.666.0
See? It just gets worse. Give it up, Subject: line
modification is a bad idea.
o The correct way to filter is to base your recipe on
the X-Mailing-List: line, which should always have "linux-
ker...@vger.kernel.org".
An example procmail recipe would look like this:

# Linux-kernel list
:0: /var/lib/emacs/lock/!home!fred!mfilter!linux!
kernel
* ^X-Mailing-List:.*linux-kernel@vger\.kernel\.org
/home/fred/mfilter/linux/kernel

People subscribed to linux-kernel-
dig...@lists.us.dell.com, which uses GNU Mailman, may want to use
something like this:

# linux-kernel-digest
:0
* ^X-BeenThere: linux-kernel-digest@lists\.us\.dell
\.com
/home/fred/mfilter/linux/kernel-digest

People using mailagent might try this in
their .rules file (thanks to Martin Smith
<mar...@sharrow.demon.co.uk>):

To CC: /linux-...@vger.kernel.org/
{ SPLIT -adi ~/Kernel }

Similarly to procmail you can omit the mail folder
from the split command. This causes the split messages to go back into
the mailagent queue for further processing.

Most mailers with filtering capabilities can be
similarly configured. If not, then you can simply install procmail. If
perchance you're running a damaged OS that can't filter properly, and
there is no procmail port for it, then you should either upgrade, or
accept that you won't be able to filter linux-kernel. Don't bother
asking for a subject line modification.
o If you really want to get the feel of a toy mailing
list, you can write a procmail recipe which will modify the Subject:
line.
An example procmail recipe would look like this:

# Linux-kernel list
:0 f
* ^X-Mailing-List:.*linux-kernel@vger\.kernel\.org
| sed -e 's/^Subject: /Subject: [TOY-LINUX-
KERNEL] /'

Warning: if you do this, be careful to edit your
Subject: line when replying to messages from the list, otherwise you
risk being ignored or kill-filed.
20. Can we have a Reply-To: header automatically added to the list
traffic?
* (DW) Some mailing lists automatically add a Reply-To:
header to the mails which go through them, forcing people to reply to
the list, rather than replying personally to the original poster. This
is a bad idea for many reasons which won't be listed here. See Chip
Rosenthal's excellent summary Reply-To: Munging Considered Harmful for
more explanation.
21. Can I post job offers/requests to the list?
* (REG) Of course not! This is a technical development list,
not a job exchange. You may find this site useful: http://www.hotlinuxjobs.com/
22. Why do I get bounces when I send private email to some people?
* (REG) This could be for a variety of reasons, such as
temporary problems with mail delivery. Your email may also be blocked
(permanently rejected) by that individual or their ISP. This often
happens if you send email from a machine or domain which is listed in
the MAPS RBL, DUL and ORBS lists. These lists have been set up to
protect people against spam. See http://www.mail-abuse.org/ for more
information on these lists.
NOTE that these lists aren't trying to block you
personally, they are trying to block known spammers or spammer-
friendly sites (RBL and ORBS), or uncontrolled dial-up users (DUL). If
you are being blocked, it probably means you have the misfortune to be
using an ISP that is not a good net citizen and thus has been added to
the RBL or ORBS lists. In some cases, you may be blocked because your
ISP has volunteered their dial-up IP address ranges to the DUL, in
which case you should be using their approved mail relay rather than
sending email out directly from your host.
You must NOT post a message to the kernel list about this,
as the people there cannot and will not help you. Nor should you use
the list as a means of getting a message through to the individual you
are trying to contact. This is not what the list is for.
If you are intent on making a fool of yourself in public,
follow the same path as too many others before you, and complain on
the kernel list about how unfair it is that you are being blocked
because your ISP is bad. Expect sympathy from some, flames from others
and silence from most. The net gain will be that your mail will still
be blocked by the anti-spam lists, many people will ignore you in
future emails (because you've made a fool of yourself), and you may
find yourself in the killfiles of some people (i.e. you personally are
being blocked because some people are fed up with you and don't want
to hear anything more from you).
If you actually want your mails to no longer be blocked,
get your ISP to clean up their act, or switch to a decent ISP. If you
are required to use your ISP's mail relay, but it is crippled somehow,
complain to your ISP or switch to one with competent staff.
If your ISP is unresponsive and you don't have an
alternative ISP you could switch to, you'll just have to accept that
an increasing fraction of people will block your email (as more and
more people subscribe to the anti-spam lists). There's no point in
shouting at the people who are defending themselves against spam (no-
one is obliged to receive any and all email), go pester the spammers
instead.
23. Why don't you split the list, such as having one each for the
development and stable series?
* (REG, by "hacksaw") It's true that the lkml is a high
traffic list and can be a lot to handle. However, splitting the list
wouldn't help, since most developers would just subscribe to both
lists. In fact, there would then be extra traffic, because of the
number of issues that hit both the development and stable kernels, or
even farther back!

Section 4 - "How do I" questions

1. How do I post a patch?
* (ADB) I assume you made the patch following the general
instructions found above. Now write a short post describing your
patch, the version of the kernel it applies to, your tests, the
feedback you would like to get, etc. This should fit in 10 lines.
Attach your patch and a one line README file describing it very
succinctly, and mentioning your name and email (either as two ASCII
files or as a MIME encoded tarball). In the subject of your post, put:
[PATCH] <the driver name or piece of code patched>, kernel <kernel
version>. Send. Wait.
The small README file insures that your patch will not
start circulating around the net without people noticing your name. If
you don't care about copyright and/or your patch is trivial, you can
skip tarring the files, just gzip the patch file and attach it to your
post.
* (REG) Note that Linus does not read linux-kernel very
much. So if you want him to see a patch, you will need to send it to
him directly (say by Cc:ing him if you post to the list). Note that
Linus likes to be able to read patches in plain ASCII, so anything
that is uuencoded or MIMEd is likely to go straight to the bit-bucket.
If because your patch is large you only send a URL, send a plain-text
copy to Linus privately.
Also note that Linus drops patches silently when he is too
busy (which is always:-), so if you don't see it in the next kernel
patch, send it again. Oh, and don't expect him to tell you he's
applied the patch, either.
2. How do I capture an Oops?
* (REG, quoting Keith Owens) If an Oops is recoverable then
the text appears first in the kernel message buffer (/proc/kmsg). You
can use the dmesg command to print the contents but most of the time
klogd and syslogd will automatically capture the Oops and write it to
your log files.

Sometimes an Oops is so bad that the kernel is completely
hung. When this occurs, almost anything that requires kernel support
is also dead. In particular most interrupt driven subsystems are
unusable, especially after the dreaded "Aiee, killing interrupt
handler" message. Since most disk controllers use interrupts, no disk
I/O is possible so the Oops does not get written to the log files. The
same problem applies to logging over the network, most network cards
require interrupt handlers.

In a complete hang, you have three options.
o Write the Oops down by hand from the screen and type
it in after you have rebooted. This is the only option if you have not
planned for a kernel hang.
o If you plan ahead and install a serial console
linked to another machine (read linux/Documentation/serial-
console.txt) then you can capture the Oops report on the other
machine. By far the easiest and most reliable option.
o Since kernel 2.3.10 it has also been possible to use
a parallel port line printer as a console. You can either attach a
real printer, or another computer with EPP (Enhanced Parallel Port)
support, which pretends to be a printer.
o There have been patches on linux-kernel to save the
log somewhere in hardware. Unfortunately these patches are very
hardware specific. Search the l-k archives for "Oops assist", "OOPS
output over reboot" and "KMSGDUMP". Most of these patches require that
the keyboard still works and even that can be useless when the kernel
hangs.
Other operating systems can save the log even when the
machine hangs, why doesn't Linux? Any OS that can save the log after a
catastrophic kernel failure must do so without kernel support, that
typically means using the underlying hardware. Alas the ix86 hardware
does not provide enough support for this, in particular most BIOS will
clear memory on reset, destroying any data in storage.
3. How do I post an Oops?
* (ADB) Assuming you have found a genuine Oops (those are
rare nowadays, but they happen), you should post the relevant portions
of your system log, kernel configuration file and kernel symbol map,
plus a description of your hardware and the circumstances under which
the Oops occurred. Can the Oops be triggered by any particular method?
Did it happen after you changed any part of your hardware
configuration?
Don't post your oops report before you have checked linux/
Documentation/oops-tracing.txt, the relevant paragraphs in linux/
README, the ksymoops C program in linux/scripts/ksymoops which has
another README, and the gdb man and info pages (thanks to Paul Kimoto
for this tip). These documents describe the basic procedure for kernel
oops tracing. Good trace info makes it much easier to understand and
solve apparently weird oopses.
* (REG) Don't even bother posting an Oops if you haven't run
it through ksymoops to decode the symbol addresses. The report will be
ignored because it contains too little useful information.
Make sure you copy the correct System.map file into /boot
or into the modules directory, otherwise you will get incorrect
results.
* (REG, quoting "The Doctor What") There are some situations
that make a kernel oops useless. The two most obvious are if your are
overclocking your CPU or running VMWARE's vmmon. The reason is that
overclocking can introduce random bit errors, while VMWARE's vmmon has
the ability to (and does) change parts of the kernel. In both cases,
data in the kernel, as reported by the oops, won't be useful.
4. I think I found a bug, how do I report it?
* (ADB) A bug differs very slightly from an oops, actually.
An oops is when the kernel detects that something has gone wrong. A
bug is when something (in the kernel, presumably) doesn't behave the
way it should, either with a driver or in some kernel algorithm. If
you can detect this misbehaviour, you may or may not be getting an
oops.
Perhaps the most important step is to determine under
which conditions this misbehaviour can be triggered, and whether it is
reproducible.
5. What information should go in a bug report?
* (ADB) Does it affect system security? Is it related to a
specific driver/hardware configuration? Did you manage to identify the
piece(s) of kernel code concerned? It really depends on the kind of
bug you found.
* (TYT) Please follow general good bug reporting guidelines:
remember, the developers don't have access to your system, and they're
not mind readers. Tell us which kernel version, and what your hardware
is (if you're not sure, more detail is better than less). At the very
least, tell us what processor and motherboard you have, how much
memory, how many and what kind of disks (IDE, SCSI, etc.), what kind
of disk controllers you have, what other expansion boards (specify
whether they're PCI or ISA or some other bus). Also useful: what
version of gcc and binutils were used to compile the kernel.
Try to find a simple, reliable way to trigger the problem.
Telling the developer that they have to set up some complicated
application environment (especially if it involves some ghastly
expensive proprietary software like SAP or Oracle :-) may cause the
developer to hit the 'd' key and move on.
In general, raw data is better than jumping to
conclusions. If you want to give your guesses in your bug reports,
they're of course welcome, but this is not a substitute for raw data.
Many problems are not what they first seem. A hardware problem can
masquerade as a VM problem. A device driver or VM problem can cause
the filesystem code to notice a discrepancy, and flag a warning. Even
if you're sure that the problem isn't a hardware problem, or some
other theory that the developer advances, the scientific method
demands that you do a test to rule these sorts of things out.
Sometimes, you will get surprised.....
If you get a kernel oops message, it's useless unless you
give us the proper symbolic information. This used to mean sending
relevant pieces out of System.map. Fortunately, with the latest
syslogd/klogd, this is much simpler (check the man page of klogd to
see if your version has this feature; if it doesn't, you should
upgrade to the latest version, and probably to a modern distribution).
Make sure that you have the System.map file installed the appropriate
place so that klogd can find it (the standard search path is in the /
boot, /, and /usr/src/linux directories).
If the system oops and then dies without a chance for
klogd to record the information into a syslog file, copy down the oops
message exactly, and then use the ksymoops (see the man page) to get
the symbolic information out. Remember, the raw numbers by themselves
will generally not be useful.
If you can, try to isolate the problem to a specific
kernel version. Knowledge that it worked in version 2.2.17, as well as
2.3.0-test6, but it stopped working in 2.3.0-test7-pre1, is extremely
helpful, and will save developers a lot of time. (If you're
comfortable disecting patches, fell free, taking apart the individual
file changes and try to isolate to a particular change.)
* (REG) You did of course read REPORTING-BUGS from the
kernel source tree first, didn't you?
6. I found a bug in an "old" version of the kernel, should I report
it?
* (CP) Only if it hasn't been fixed yet. The best thing to
do is to try to repeat it with a new version of the kernel. If not,
you have to figure out if it's been fixed yet. The kernel release
announcements and patch descriptions from Jitterbug are also useful.
Failing that, look for discussion of the bug in linux-kernel and check
the patches between your kernel and the latest ones for relevant
changes.
If you can't find your bug mentioned, and you're not
running a truly ancient kernel, posting a bug report is worthwhile.
You can probably expect a request of the form "try it with the latest
kernel" or "try it with this patch" in response. If there's a reason
why you can't run the latest kernel (like it's your main dialin server
and you don't want to mess with it), saying it in your original report
will save some explaining later.
7. How do I compile the kernel?
* (REG) See the Kernel HOWTO for some information. Also,
there are people at http://www.kernelnewbies.org/ who are usually
willing to help.
* (William Stearns) The Buildkernel script walks you through
an entire kernel build, including downloading the necessary files,
patching the source, building the kernel and modules, installing the
lot into lilo, and optionally building pcmcia-cs, cipe, and freeswan
code for that kernel. Download and install either the tar or rpm
version of the script, and run one of the following commands:

buildkernel NEWESTSTABLE #To build the most recent stable
kernel.
buildkernel NEWESTBETA #To build the most recent beta
kernel.
buildkernel 2.4.7 #If you know the version you wish
to build.

8. How do I check if the running kernel is tainted?
* (Kalin Kozhuharov) The short answer is:

cat /proc/sys/kernel/tainted

If it is "0", then the running kenrel is NOT tainted.
Otherwise - it is tainted.
See the Tainted kernel document from Novell/SUSE Linux for
more information.

Section 5 - "Who's who" questions

1. Who is in charge here?
* (ADB) Do you mean "Who takes decisions relative to the
mailing list?" or do you mean "Who takes decisions relative to the
Linux kernel"? If the former: there is relatively little to decide
when it comes to the mailing list. Majordomo, once correctly setup,
will manage the list in an autonomous fashion. In any case, you can
always reach the Majordomo-owner for the list, if you have a very
specific question about the list mechanism itself. When it comes to
kernel development management and decision making, see the answer to
Question 7.8 below.
2. Why don't we have a Linux Kernel Team page, same as there are
for other projects?
* (ADB) Perhaps because there is no Linux Kernel Team, per
se. Also because so many people contributed to the Linux kernel that
it would be a tough task to setup and maintain such a page. Finally,
although this is not a rule, most Linux kernel contributors prefer to
keep a low profile, for various reasons.
3. Why doesn't <any of the below> answer my mails? Isn't that rude?
* (ADB) Probably because of sheer lack of time to answer
each email that gets sent to them. What would you do if you got 1000
mails in your mailbox, from one day to the next? They don't mean to be
rude, however.
One hint: if you attach to your mail a genuinely useful
piece of good quality code that you wrote, there are good chances that
it will be answered (choose a good subject line, too). If you ask a
dozen beginner's questions, the truth is, there are zero chances that
you will get even the simplest reply pointing to some source of
information.
Aside from that, you may get "mail rejected" error
messages if you try to contact some major contributors of the list. It
is due to the spam filtering systems used by them. Please complain
about it to your ISP and don't post to the list about spam !! .
* (REG) Some people also have very aggressive mail filtering
which rejects (non-list) messages from people they don't know, asking
for a re-send with a password (this stops SPAM dead). If you mail to
someone and receive such an automatic response, don't get upset.
Remember, a person's mailbox is their personal property.
Also, some people maintain "guru lists" and only read
posts on linux-kernel by someone on their guru list, other people's
posts go to /dev/null. This is done because there are too many
questions asked on linux-kernel which shouldn't be (which is why
people should read this FAQ first!), and people can't cope with the
load. If you post to the list and want to make sure a specific
individual will see the message, Cc: that person.
4. Why do I get bounces when I send private email to some of these
people?
* (REG) Some people, like Alan Cox, bounce messages. Read
this to find out why and what you can do about it.
5. Who is Matti Aarnio?
* (MEA) He is principally a ZMailer hacker, and a co-
postmaster of vger.kernel.org.
Sometimes he finds also cycles to hack on the kernel, and
you see some patches from him. (e.g. initial work on Large File
Summit; files over 2G in size, was his)
6. Who is H. Peter Anvin?
7. Who is Donald Becker?
8. Who is Alan Cox?
* (AC) Alan Cox supervises the 2.0.34/35/36 kernel releases,
works on the Mac68K port, the SGI port, 2.0 networking, modular sound,
video capture and helps collect up and sort patches to the kernel. He
gets to do all this and sleep because the nice guys at Red Hat pay him
to hack Linux.
9. Who is Richard E. Gooch?
* (REG himself) "I've written various utilities and kernel
patches which you can find here including the MTRR, devfs and fastpoll
patches. My PhD in Computer Science was on the topic of Astronomical
Visualization , which is my current research interest. This is what I
work on when I don't get distracted by kernel hacking. See my home
page to find out more about me."
10. Who is Paul Gortmaker?
* (ADB, OK'ed by Paul) Paul has contributed various pieces
of kernel code over the last few years, among other things the Real
Time Clock driver. He is also the maintainer of the 8390 based network
drivers (NE-2000, etc.), and wrote the Linux Ethernet HOWTO and the
Boot-Prompt HOWTO.
11. Who is Mark Lord?
12. Who is Larry McVoy?
13. Who is David S. Miller?
* (DSM) David Miller is mainly known for the porting work he
has done, primarily for the 32-bit and 64-bit Sparc platforms although
he has made significant contributions to the MIPS effort as well. He
is also the current maintainer of the IP networking layer in the
kernel and likes to address general performance and scalability
problems all over as his time permits.
14. Who is Linus Torvalds?
15. Who is Theodore Y. T'so?
* (TYTSO) Theodore Ts'o has over the years written,
rewritten, or supported Posix Job Control, the high level tty driver,
the serial driver, the ramdisk support, e2fsck/e2fsprogs, and other
bits and pieces of code in and near the kernel. He is currently a
member of the Technical Board of Linux International. His day job at
MIT is concerned with Kerberos and other network security and I/T
architecture issues. He is also a member of the Internet Engineering
Task Force, where he serves as a member of the Security Area
Directorate.
16. Who is Roger Wolff?
* (REW himself) "I wrote the kmalloc that still drives
linux-2.0.x. I wrote the Specialix and Olicom device drivers. I
currently write Linux device drivers for a living. Contact me if you
need one."

Other OS developers
Rogier Wolff (REW) suggested we add a section on OS developers who
influenced/preceded the design of Linux.

* Who is Prof. Douglas Comer?
o (Prof. Comer) Dr. Douglas Comer is a full professor of
Computer Science at Purdue University, where he teaches courses on
operating systems and computer networks. He has written numerous
research papers and textbooks, and currently heads several networking
research projects.
He has been involved in TCP/IP and internetworking since
the late 1970s, and is an internationally recognized authority. He
designed and implemented X25NET and Cypress networks, and the Xinu
operating system. He is director of the Internetworking Research Group
at Purdue, editor of Software - Practice and Experience, and a former
member of the Internet Architecture Board.
Dr. Comer completed the original version of Xinu (and
wrote "The Xinu approach" book) in 1979. Since then, Xinu has been
expanded and ported to a wide variety of platforms, including: IBM PC,
Macintosh, Digital Equipment Corporation VAX and DECStation 3100, Sun
Microsystems Sun 2, Sun 3 and Sparcstations, and Intel Pentium. It has
been used as the basis for many research projects. Furthermore, Xinu
has been used as an embedded system in products by companies such as
Motorola, Mitsubishi, Hewlett-Packard, and Lexmark. There is a full
TCP/IP stack, and even the original version of Xinu (for the PDP-11)
supported arbitrary processes and network I/O.
* Who is Richard M. Stallman?
o (RMS) Richard Stallman is the founder of the GNU project,
launched in 1984 to develop the free operating system GNU (an acronym
for "GNU's Not Unix"), and thereby give computer users the freedom
that most of them have lost. GNU is free software: everyone is free to
copy it and redistribute it, as well as to make changes either large
or small.
Today, Linux-based variants of the GNU system, based on
the kernel Linux developed by Linus Torvalds, are in widespread use.
There are estimated to be over 10 million users of GNU/Linux systems
today.
Richard Stallman is the principal author of the GNU C
Compiler, a portable optimizing compiler which was designed to support
diverse architectures and multiple languages. The compiler now
supports over 30 different architectures and 7 programming languages.
Stallman also wrote the GNU symbolic debugger (GDB), GNU
Emacs, and various other GNU programs.
Stallman received the Grace Hopper Award from the
Association for Computing Machinery for 1991 for his development of
the first Emacs editor in the 1970s. In 1990 he was awarded a
MacArthur Foundation fellowship, and in 1996 an honorary doctorate
from the Royal Institute of Technology in Sweden. In 1998 he received
the Electronic Frontier Foundation's Pioneer award along with Linus
Torvalds.
* Who is Prof. Andrew Tanenbaum?
o (Prof. Tanenbaum) Andrew S. Tanenbaum has an S.B. degree
from MIT and a Ph.D. from the University of California at Berkeley. He
is currently a Professor of Computer Science at the Vrije Universiteit
in Amsterdam, The Netherlands, where he heads the Computer Systems
Group.
His current research focuses primarily on the design of
wide-area distributed systems that scale to millions of users. These
research projects have led to over 70 refereed papers in journals and
conference proceedings. He is also the author of five books.
Prof. Tanenbaum has also produced a considerable volume of
software. He was the principal architect of the Amsterdam Compiler
Kit, a widely-used toolkit for writing portable compilers, and MINIX,
a small UNIX-like operating system for operating systems courses.
Prof. Tanenbaum is a Fellow of the ACM, a Senior Member of
the IEEE, a member of the Royal Netherlands Academy of Arts and
Sciences, and winner of the ACM Karl V. Karlstrom Outstanding Educator
Award.

Section 6 - CPU questions

1. What is the "best" CPU for GNU/Linux?
* (REW) There is no "best" CPU. The choice of CPU always
depends on your price/performance/technical requirements. On the x86
side, we have Intel, AMD, Cyrix and IDT/Centaur, with various models
available. All of these work.
Besides the x86 processors, the Linux kernel runs on 68k
processors, MIPS R3000 and R4000, Power PC, ARM, Alpha and Sparc
processors. There are lots of different ways to build a computer
around a processor. If you have an x86, they built a PC around it.
Don't go around buying second hand R4000 computers because the Linux
kernel runs on the R4000 processor. Check the latest Linux kernel
revision to see if the specific computer you're buying is supported.
* (ADB) OK, the Linux kernel is a good start. Now, there is
a huge difference between kernel support and a ready-to-install
distribution. Only four architectures have widely available,
reasonably homogeneous distributions: x86 (or i386), Alpha, Sparc and
Power-PC. And the Alpha and Sparc distributions that exist still have
some rough edges. IOW, if you don't want to spend a lot of time
installing and fine-tuning GNU/Linux, and you have a limited budget,
your "best" choice is an x86 machine. If you have very specific needs
(e.g. a hand-held computer running Linux, where the low power ARM
architecture would be the ideal choice, or a workstation dedicated to
scientific applications, where an Alpha or a Sparc would provide
superior performance), check the various architectures, list your
specific requirements, and make a choice. Nowadays Alpha 21164
machines are much more affordable than one or two years ago, but it's
certainly harder to put one together than your average PC clone.
2. What is the fastest CPU for GNU/Linux?
* (REW, ADB) The CPU field is very active in terms of
technological developments. New CPU models, new architectures, new
manufacturing technologies keep pushing the state of the art. WRT GNU/
Linux, it is a general consensus that Alpha machines usually provide
the best floating point performance, when the actually shipping
hardware available at any given point in time is compared (June 1998:
the 21164/600).
However for non floating point applications the issue is
not as clear-cut. Very high clock rate x86 machines (e.g. Pentium-II/
400) provide impressive integer performance, for use in e.g. databases
or Web server applications.
For 3D rendering applications you may want to consider the
GNU/GPL Mesa OpenGL compatible library, which has support for some
graphics accelerator chips.
Also note that some applications are not CPU bound. Check
the exact bottleneck in your case.
3. I want to implement the Linux kernel for CPU Hyper123, how do I
get started?
* (ADB) Is Hyper123 supported by gcc, or at least is the
Hyper architecture supported by gcc? Do you have a target machine with
a well defined architecture? If you have answered yes to both
questions proceed to REW's answer. If you have answered no to either
or both, don't even bother getting started. This is a major project,
not exactly the kind of thing you do over the weekend. Quoting from a
SparcLinux paper by Miguel de Icaza:
"Thanks to having an international team of developers and
support people, when the first Linux/SPARC distribution on CD went out
we had a very strong port: a port that had taken only 22 months to
engineer and complete (starting from scratch up to releasing the
operating system on a bootable CD-ROM)."
* (REW) Auch. Difficult task. Besides having to write
support for the processor, you will also have to write the boot
sequence to get things going. And a few device drivers.
You're not running away screaming yet? Good. Make sure you
get the programmers manual for Hyper123, and data sheets for all the
peripheral IC's. Make sure you have the docs for the computer that
you're working on (addresses, registers for the stuff on the
motherboard).
After that, start on learning the processor, by writing
the boot program. Try booting a simple program that says "hello
world". That will also allow you to write a console device driver.
Next, there is the hard part: get Linux to compile and run
on the processor. Make a new arch directory and start putting things
in there that implement whatever needs implementing on your processor.
4. Why is my Cyrix 6x86/L/MX/MII detected by the kernel as a Cx486?
* (RRR, ADB) Cyrix 6x86 CPUs are different in many ways from
Pentium (tm) and AMD K5/K6 (tm) CPUs, so special code must be included
for adequate CPU detection, setup and reporting. Cyrix 6x86 support
isn't perfect in kernels 2.0.x up to 2.0.34. From 2.0.35 on things
should get much better ('cause we're working on it ;) ). Similarly,
late 2.1.1xx kernels should fully support the Cyrix CPUs. Please check
the Linux Cyrix 6x86 HOWTO site for details and patches.
5. What about those x86 CPU bugs I read about?
* (ADB) There are basically three known bugs that affect x86
processors, and each CPU design got its fair share it seems:
o The Intel Pentium F00F "Death" bug, affects ALL
Pentium and Pentium MMX CPUs. Linus implemented the Intel recommended
workaround for this bug a few days after the bug was first reported in
the newsgroups. All recent kernels will report and workaround the bug.
o The AMD K6 "sig11" bug, affects only a few K6
revisions. Was diagnosed by Benoit Poulot-Cazajous. There is no
workaround, but you can get your processor exchanged by contacting
AMD. 2.2.x kernels will detect buggy K6 processors and report the
problem in the kernel boot message. Recently, a new K6 bug has been
reported on the linux-kernel list. Benoit is checking into it.
o The Cyrix 6x86(Classic, L, MX) "Coma" bug, affects
ALL Cyrix 6x86 CPUs. I proposed a simple workaround which is
implemented as a user space boot option, a few hours after the bug was
reported on the linux-kernel mailing list. See the Linux Cyrix 6x86
HOWTO site for details. Cyrix was notified of the bug, and their new
MII CPUs are not affected by this problem anymore.
6. I grabbed the standard kernel tarball from ftp.kernel.org or
some mirror of it, and it doesn't compile on the Sparc, what gives?
* (DSM) Often the Sparc port diverges due to the sheer high
rate of changes which occur to that port. Also changes can happen to
major interfaces in the kernel and the Sparc port is not updated at
the same time. Eventually the Sparc port maintainers do try to merge
all of their work into the standard tree, and at which time it will
compile. In any event, trees which will compile just fine are
available via two mechanisms, the vger CVS tree (accessible via read-
only anonymous CVS) and pre made tarballs of known working stable or
test kernel trees. Check:
o ftp://vger.kernel.org/pub/linux/README.CVS and
o ftp://vger.kernel.org/pub/linux/Sparc/kernel/v2.{0,1}/
7. Does the Linux kernel execute the Halt instruction to power down
the CPU?
* (REG, ADB) Yes. The Linux kernel will execute the Halt
instruction when the machine is idle (check the code for the idle_task
in sched.c). It has done so since the earliest i386 implementation,
even though on the i386 we didn't care about power saving; it's just
that halting the CPU is the Right Thing (tm) to do when there is no
other task that must be run.
On the Pentium, K6 and C6 CPUs, power consumption gets
automatically reduced from an average 12-24 Watts operating power down
to 2-3 Watts when the processor is Halted. On the Cyrix 6x86 CPUs,
Halt state power consumption can be further reduced down to 150 mw by
enabling the Suspend-on-Halt feature.
Reduced power consumption means cooler, more reliable
machine operation and longer component life. And it saves trees too.
8. I have a non-Intel x86 CPU. What is the [best|correct] kernel
config option for my CPU?
* (ADB) For 386 class machines, compile as a 386. For 486-
class machines, compile as a 486.
For the Cyrix 6x86 family CPUs and the AMD K5 and K6, you
should probably compile the kernel as a Pentium or PPro. The only
difference between the Pentium (-M586) and PPro (-M686) compile
options is in the string operations (AFAIK). The Pentium option uses a
header file that breaks down the complex string opcodes into simpler
operations (which are faster on the Intel Pentium and Pentium MMX).
The PPro option uses the complex opcodes, but should be
slightly faster than a Pentium because of the PPro has deeper, smarter
pipelines.
The same rules apply to the 6x86 family and the K5/K6, but
the difference in speed is minimal between the Pentium and PPro kernel
config options on these CPUs (PPro should be slightly better).
The 486 kernel config option (-M486) should not be used
for anything above a 486-class CPU. This option sets code alignment
options that work well on the 486, but that cause excessive NOP
padding on 586 and above class machines. Usually, the 6x86 speculative
execution capabilities will just optimize this padding at run time,
but the NOP opcodes still take precious L1/L2 cache space (same
applies to the K6; I am not 100% sure of what the K5 does).
The 386 config option (-M386) does not suffer from
excessive padding, but does not produce code optimized for recent x86
CPUs either, so it is also deprecated, except for kernels included in
GNU/Linux distributions which must run on the widest possible range of
machines.
9. What CPU types does Linux run on?
* (REG) Quite a few. Below is the list for kernel 2.4.18.
Note that for some CPUs advanced development is kept outside the
mainline kernel, and changes are merged into the mainline
periodically. The WWW pages for these projects are listed as well.
o alpha, by DEC (now Compaq). http://www.alphalinux.org/
o ARM http://www.arm.linux.org.uk/
o Cris (AXIS) http://developer.axis.com/software/linux/
o x86 (32 bit, aka IA32, aka i386)
o x86-64 (64 bit extension to x86 by AMD) http://www.x86-64.org/
o IA64 (aka Itanium aka Itanic, by Intel and HP)
http://www.linuxia64.org/
o M68K (Motorola M68000 family) http://www.linux-m68k.org/
o MIPS (32 bit and 64 bit) http://www.linux.sgi.com/
o PA-RISC (by HP) http://www.parisc-linux.org/
o Power PC 32 bit http://www.penguinppc.org/ and 64
bit http://penguinppc64.org/
o S390/S390x (IBM mainframe) http://linux.s390.org/
o SuperH (by Hitachi) http://www.linux-sh.org/
o Sparc (32 and 64 bits) http://sunsite.tut.fi/SPARCLinux/
and http://www.ultralinux.org/
o OpenRISC (unfinished) http://www.opencores.org/projects.cgi/web/or1k/openrisc_1200
o Emotion Engine (SONY Playstation 2) http://playstation2-linux.com/
o ColdFire by Motorola (incompatible derivative of
MC68000) http://www.uclinux.org/ports/coldfire/
o VAX (DEC) http://linux-vax.sourceforge.net/
o TMS320 Digital Signal Processor (Texas Instrument)
http://www.dsplinux.net/
o 8088 / 8086 / 80286 (INTEL) http://elks.sourceforge.net/
o ITRON (Japanese CPU used by DoCoMo in 3G mobile
phones) http://www.emblix.org/english/etop.html
o General CPU http://www.cyut.edu.tw/~ckhung/resource/linux_ports.html

Section 7 - OS questions

1. OS $toomuch has this Nice feature, so it must be better than GNU/
Linux.
* (ADB) Sorry, but this simply means that OS $toomuch was
designed with a given set of objectives and priorities, and GNU/Linux
was designed with a different one. Neither is better than the other
and also note that I am not referring to the respective
implementations. But please, no OS comparisons on the linux-kernel
list. Check the newsgroups instead, particularly
comp.os.linux.advocacy which is dedicated to that kind of debate.
2. Why doesn't the Linux kernel have a graphical boot screen like
$toomuch OS?
* (ADB) Because it doesn't need one. You can add that
feature to the boot loader code, if you want to. The Linux kernel has
no graphics primitives, just like any UNIX kernel.
3. The kernel in OS CTE-variant has this Nice-very-nice feature,
can I port it to the Linux kernel?
* (ADB) Sure, you can do (almost) anything you want with
Free Software. Oh, OS CTE-variant is not Free Software?
4. How about adding feature Nice-also-very-nice to the Linux
kernel?
* (ADB) You should probably read the definition of creeping
featurism first. Related concepts, in increasing order of obfuscation:
the KISS rule-of-thumb, the "Small is Beautiful" concept, Occam's
Razor and Complexity Theory. A good book to read on these concepts as
they apply to OS design is "The Mythical Man-Month" by Frederick P.
Brooks, Jr.
5. Are there more bugs in later versions of the Linux kernel,
compared to earlier versions?
* (ADB) There are no more known bugs in later kernel
versions than in earlier kernel versions. However, the Linux kernel
source code has been growing at a constant rate. As a general rule,
large pieces of code tend to have undetected bugs. OTOH, the core code
for the Linux kernel seems to have stabilized at around 16 thousand
lines of C code, according to Larry McVoy.
* (REW) I'd say more than 23 thousand lines in 2.1.x. Add
together the totals from kernel, mm, arch/<somearch>/, subtract fpu-
emulation.
6. Why does the Linux kernel source code keep getting larger and
larger?
* (ADB) There are four causes for this unbounded growth:
1. New architectures are implemented. This is usually
OK, because the code that is specific to each architecture is (in
theory, at least) separate from the others. Common code doesn't grow.
2. New drivers are implemented. Again, this is OK,
because each driver has different source files, and those are
selectively compiled in the kernel executable or built as modules
according to the specified kernel configuration.
3. Old code gets adequately documented. Adding comments
and documentation increases the size of the source, but it's still a
Good Idea (tm).
4. Creeping featurism. It's generally considered a Bad
Idea (tm) to keep adding more features to an already working piece of
code.
7. The kernel source is HUUUUGE and takes too long to download.
Couldn't it be split in various tarballs?
* (REG) The kernel (as of 2.1.110) has about 1.5 million
lines of code in *.c, *.h and *.S files. Of those, about 253 k lines
(17%) are in the architecture-specific subdirectories, and about 815 k
lines (54%) are in platform-independent drivers. If, like most people,
you are only interested in i386, you could save about 230 k lines by
removing the other architecture-specific trees. That is a 15% saving,
which is not that much, really. The "core" kernel and filesystems take
up about 433 k lines, or around 29%.
If you want to start pruning drivers away, the problem
becomes much harder, since most of that code is architecture
independent. Or at least, is supposed to be/will be. There is some
driver code which probably should be moved to an i386-specific
subdirectory, and perhaps over time it will be (it will take a lot of
work!), but you need to be careful. PCI cards for example should be
architecture independent. Throwing out the non i386-specific drivers
will save around 97 k lines, a saving of about 6%.
But the most important argument for/against splitting the
kernel sources is not about how much space/download time you could
save. It's about the work involved for Linus or whoever will be
putting together the kernel releases. Building tarballs (compressed
tarfiles) of the whole kernel already represents a considerable amount
of work; splitting it into various architecture-dependent tarballs
would dramatically increase this effort and would probably pose
serious maintainability problems too.
If you are really desperate for a reduced kernel, set up
some automated procedure yourself, which takes the patches which are
made available, applies them to a base tree and then tars up the tree
into multiple components. Once you've done all this, make it available
to the world as a public service. There will be others who will
appreciate your efforts.
Under no circumstances should you complain to the kernel
list. I promise you that Linus and the core developers will completely
ignore such messages, so whinging about it is a complete waste of
bandwidth. The only message on this subject that should be posted is
an announcement of a new service providing split kernel sources.
8. What are the licensing/copying terms on the Linux kernel?
* (RRR) In the root directory of the Linux kernel source
tree (e.g. /usr/src/linux/), you will find a file COPYING. The file
states that the Linux kernel is placed under the GNU General Public
License (version 2), a copy of which is provided. If still in doubt,
post to the appropriate forums (such as gnu.misc.discuss) or ask a
lawyer, but don't ask about it on the linux-kernel list.
9. What are those references to "bazaar" and "cathedral"?
* (ADB) These terms are used to describe two different
development models adopted by the Free Software community, and were
first coined by Eric S. Raymond. You should check his original
article.
Note that Eric's article describes two among an infinite
range of possible different development models. You could for example
create new "Versailles", "Great Wall of China" or "Pyramid of Kheops"
software development models. As long as the end result is under a GNU/
GPL license, it will still be Free Software.
10. What is this "World Domination" thing?
* (ADB) Geek humor? Please don't take this seriously! This
is just a way of saying that there are more and more people using GNU/
Linux all over the world i.e. that the Free Software movement is
gaining momentum. Note that the "Free" in Free Software refers to
freedom, just about the opposite of what's implied by "World
Domination".
* (REW) This is a reference to an interview of Linus some
years ago. After being pretty modest about the success that Linux was
enjoying he concluded the interview with the remark: "Of course, what
I really want is total world domination."
I've been browsing the net for the reference for this.
http://www.ukuug.org/newsletter/63/ne...@uk63-5.shtml and
http://www.linuxgazette.com/issue15/lg_toc15.html are close but not
quite close enough.
Linus has referred back to this remark often enough.
11. What are the plans for future versions of the Linux kernel?
* (ADB) Linus would be the best person to ask, but I don't
know if he would have the time and patience to answer this question.
There are some development issues that can be mentioned, though:
1. PnP support in the kernel. Right now one can get PnP
support using the isapnptools user space package and manually tuning
the I/O, IRQ and DMA channel allocation, but future Linux kernels will
do that for you.
2. Improved SMP support.
3. Improved 64 bit code support.
4. Improved POSIX support.
5. Improved APM support.
12. Why does it show BogoMips instead of MHz in the kernel boot
message?
* (ADB) On some processor architectures it is very difficult
to find out the clock speed of the processor, and since the kernel
does not depend on determining the MHz rating of a processor to
operate correctly, MHz simply do not get calculated at boot time.
OTOH, BogoMips get calculated because the kernel bases itself on
BogoMips data to implement small time delays (busy loops) needed by
various drivers in different circumstances. Note that neither BogoMips
nor MHz measure processor performance in any way. See the BogoMips
HOWTO by Wim van Dorst for an accurate description of BogoMips. Also
take a look at the Linux Benchmarking HOWTO (shameless plug) if you
want some basic information on Linux performance measurements.
Sometimes your BogoMips reading will vary by as much as
30%, from one kernel to another. This is due to changes in the
alignment of the BogoMips calibration loop, which interacts with cache
behavior. Richard B. Johnson has recently proposed a small patch that
takes care of this problem.
13. I installed kernel x.y.z and package foo doesn't work anymore,
what should I do?
* (RRR) Check out the /usr/src/linux/Documentation/Changes
and make sure you have the recommended versions (or newer) of the
relevant software. This is very important. A lot of things are
evolving on Linux and newer versions of the kernel may break older
packages (especially on the development kernels). If you are using
development kernels keep an eye for reports on the kernel list. If all
else fails post a bug report (see Q/A on bug reports) to the list.
14. People talk about user space vs. kernel space. What's the
advantage of each?
* (REG) User space is what all user (including root)
programs run in. It is fully virtual memory (i.e. normally swappable).
The X server is in user space, for example. So is your shell. Kernel
space is the domain of the kernel (wow!), device drivers and hardware
interrupt handlers. Kernel memory is non-swapable (i.e. it's REAL
RAM), and hence should be used sparingly. Also, operations performed
in kernel space are not pre-emptive: this means other processes are
prevented from running until the operation completes.
Some people think that it's better to implement stuff in
kernel space ("so that everyone has it"). In general this is a Bad
Idea (tm) (see "creeping featurism" above), since kernel space
resources are more "heavy" than user space resources. For example,
coding a Mandelbrot fractal generator in kernel space is a *really
stupid* idea.
The job of the kernel is to provide a safe and simple
interface to hardware and give different processes a fair share of the
resources, and to arbitrate access to resources/hardware.
Many ideas are best implemented in user space, with
perhaps the absolute minimum of kernel support. The only exceptions to
this principle are where it is particularly complicated or inefficient
to implement the solution in user space only. This is why filesystems
are in the kernel (you *could* put them in user space implemented as
daemons), because a kernel implementation is *much* faster.
Note that you can make user space memory non-swappable by
using the mlock(2) system call. This is a privileged operation and
should not be used trivially.
15. What are threads?
* (ADB) Very shortly, threads are "lightweight" processes
(see the definition of a process in the Linux Kernel book) that can
run in parallel. This is a lot different from standard UNIX processes
that are scheduled and run in sequential fashion. More threads
information can be found here or in the excellent Linux Parallel
Processing HOWTO by Professor Hank Dietz.
* (REG) When people talk about threads, they usually mean
kernel threads i.e. threads that can be scheduled by the kernel. On
SMP hardware, threads allow you to do things truly concurrently (this
is particularly useful for large computations). However, even without
SMP hardware, using threads can be good. It can allow you to divide
your problems into more logical units, and have each of those run
separately. For example, you could have one thread read blocking on a
socket while another reads something from disk. Neither operation has
to delay the other. Read "Threads Primer" by Bill Lewis and Daniel J.
Berg (Prentice Hall, ISBN 0-13-443698-9).
16. Can I use threads with GNU/Linux?
* (REG) Yes! The Linux kernel has the clone(2) system call,
which provides the underlying mechanism for implementing a threads
library. And Xavier Leroy has provided us with LinuxThreads, a POSIX
1003b implementation of threads for the Linux kernel.
If you have a libc 5 system, you'll need to install
LinuxThreads if it is not already installed. You can get the
LinuxThreads library here.
If you have a libc 6 (aka glibc 2) system, you shouldn't
need to do anything. Glibc has LinuxThreads merged in.
17. You mean threads are implemented in kernel space in GNU/Linux?
Why not a hybrid kernel/user space implementation? Wouldn't that be
more efficient?
* (REG) It is not clear that there is any significant
benefit for Linux to have a hybrid threading library. If we look at
Solaris Threads, they have a hybrid scheme, and claim that is an
advantage. Well, yes, I suppose so, given their environment (the
Solaris 2 kernel). They have a very heavy kernel, so a pure kernel
space implementation would be too slow (remember the time it takes to
enter/exit the kernel). Linux, on the other hand, has a very efficient
kernel, so the difference between a kernel context switch under Linux
and a user space context switch under Solaris 2 is pretty small. Also
note that Solaris Threads took a long time to get right, because of
problems such as signal delivery to threads. With a pure kernel
threads implementation, signal delivery is much simpler. Fixing the
signal delivery problems with Solaris Threads increased the complexity
of their library, leading to bloat and performance loss. We don't want
to make the same mistakes.
Now, you may argue that a hybrid scheme under Linux would
be even better. Maybe. Prove it. Code it and benchmark it. In any
case, this is a discussion that is not relevant to the kernel, since a
hybrid scheme is built on top of kernel threads (Solaris 2 builds
their threads on top of LWPs (Light Weight Processes) too). It's a
user space issue, so please, keep it off the kernel list.
BTW: if you do manage to code something up and it is much
faster than pure kernel space threads, you may need some kind of extra
kernel support (depending on how you implement things). If that
happens, then come and talk about it on the kernel list.
The Linux philosophy is to optimize the kernel first, so
that all possible implementations can share the benefits.
18. Can GNU/Linux machines be clustered?
* (REG) Different people mean different things when they
talk about clustering. Some people want transparent fault tolerance
and load balancing of general applications, others want parallel
processing of a single job. Most people who talk about fault tolerance
expect hardware and OS support of this (if a node goes down, the OS
will automatically migrate the application to another node). This is
not (yet) available for Linux.
You can write a fault tolerant application for a network
of computers without direct OS support: you just need to structure
your application appropriately. Note that a fault tolerant distributed
application may also be a parallel, multithreaded application.
The Beowulf project provides an API and system management
software to write parallel distributed applications on a network of
Linux machines. The main emphasis here is on parallelism to get
maximum processing power, although fault tolerance is possible too. An
example of a Beowulf clustered Linux system is Avalon, which has just
been listed among the world's 500 most powerful supercomputers.
Beowulf clusters deliver GFLOPS using arrays of commodity
computers. It is an incredibly cheap and elegant way to get
significant computing power for e.g. scientific applications.
* (ADB) Also check the Parallel Processing HOWTO by
Professor Hank Dietz.
* (REG) In June 2000, Mission Critical Linux released
Kimberlite which they describe as an "open source linux clustering
cabability". Tim Burke, their Cluster Architect describes it thus:
A Kimberlite Cluster provides support for two server nodes
connected to a shared SCSI or Fibre Channel storage subsystem, in an
active-active failover environment. The software provides the ability
to detect when either node leaves the cluster, and will automatically
trigger recovery scripts which perform the procedures necessary to
restart applications on the remaining node. When the node rejoins the
cluster, applications can be moved back to it, manually or
automatically, if required. Sample recovery scripts are provided.
Kimberlite is designed to deliver the highest levels of data integrity
and be extremely robust. It is suitable for deployment in any
environment that requires high availability for un-modified Linux
applications.
19. How well does Linux scale for SMP?
* (REG) Reasonably well. Kernel version 2.2 has much better
scalability than version 2.0. People are running 4 processor Intel
Xeon machines and 14 processor UltraSparc machines. Version 2.2 still
has a global kernel lock, but this is often released quite quickly
(for example, when the process blocks waiting for a resource and/or
data), so the net result is that it is quite unlikely for two
processors to compete for the global lock. Experiments with 14
processor UltraSparc machines shows that Linux scales well, indicating
that the current locking strategy is not hurting us for these
machines.
Also consider that for parallel processing jobs, the
kernel is not involved, so even Kernel v2.0 scaled well for these
applications. When we talk about SMP scalability, we are referring to
how many IO operations the kernel can perform at the same time.
Unfortunately some hysterical NT supporters continue to
spread FUD that Linux does not scale well on SMP. Efforts to insert a
bit of truth have generally fallen on deaf ears. If someone tells you
that NT scales better than Linux, ignore them. They're operating in a
fact-free zone. Tests indicate that NT has trouble scaling to 4
processors. There really is no competition.
Note that kernel version 2.3 has replaced the remaining
global kernel lock with finer grained locking. This allows Linux to
scale well to 64 processor machines and beyond.
20. Can I lock a process/thread to a CPU?
* (RML) Yes, as of 2.5.8 the Linux kernel supports binding a
process to a particular CPU. Patches exist for the 2.4 kernel series
but are not yet merged (as of 28-APR-2002). This is called "task CPU
affinity" and the interface was implemented via the following
syscalls:
int sched_setaffinity(pid_t pid, unsinged long len,
unsigned long *mask)
int sched_getaffinity(pid_t pid, unsinged long len,
unsigned long *mask)
which set and get a given task's CPU affinity,
respectively. Utilities for manipulating affinity and the patches for
2.4 are available at kernel.org. The interface allows any task's
affinity to be retrieved, although only the task's uid (or root) can
change the affinity. The calls assure the task has been successfully
scheduled to a valid CPU before returning.
21. How efficient are threads under Linux?
* (REG) Incredibly. Compared with all the other kernel-based
thread implementations, Linux is probably the fastest. Each thread
takes only 8 kiB of kernel memory for the stack and thread creation
and context switching is very fast. I have measured less than 1
microsecond context switch times on an old Pentium/MMX 200 (see
http://www.atnf.csiro.au/~rgooch/benchmarks/linux-scheduler.html for
more details). However, the Linux scheduler is designed to work well
with a small number of running threads. Best results are obtained when
the number of running theads equals the number of processors.
Avoid the temptation to create large numbers of threads in
your application. Threads should only be used to take advantage of
multiple processors or for specialised applications (i.e. low-latency
real-time), not as a way of avoiding programmer effort (writing a
state machine or an event callback system is quite easy). A good rule
of thumb is to have up to 1.5 threads per processor and/or one thread
per RT input stream. On a single processor system, a normal
application would have at most two threads, over 10 threads is
seriously flawed and hundreds or thousands of threads is progressively
more insane.
A common request is to modify the Linux scheduler to
better handle large numbers of running processes/threads. This is
always rejected by the kernel developer community because it is,
frankly, stupid to have large numbers of threads. Many noted and
respected people will extol the virtues of large numbers of threads.
They are wrong. Some languages and toolkits create a thread for each
object, because it fits into a particular ideology. A thread per
object may be appealing in the abstract, but is in fact inefficient in
the real world. Linux is not a good computer science project. It is,
however, good engineering. Understand the distinction, and you will
understand why many widely acclaimed ideas in computer science are
held with contempt in the Linux kernel developer community.
22. How does the Linux networking/TCP stack work?
* (REG) The best guide may be found in the Linux kernel
sources. A popular reference is "TCP/IP Illustrated" (volumes 1 to 3)
by W. Richard Stevens, which explains much of the theory and practice
behind TCP/IP. This material is based on the BSD implementation, which
differs from Linux in fundamental ways. Nevertheless, it is an
excellent reference.
23. Can we put the networking/TCP stack into user-space?
* (REG) The short answer is no, because this would slow it
down (see the monolithic versus microkernel debate for reasons why).
The longer answer involves the motivations behind the question. Some
people want to inspect every packet, and think it's easier to do in
user-space. In fact, the kernel has a network packet filtering API
(Linux Socket Filter (LSF), which is an easier-to-use implementation
of the Berkeley Packet Filter (BPF)). The LSF allows you to capture
some or all packets and pass them to user-space. This yields the
advantages of a kernel-based networking stack, but still allows you to
inspect packets in user-space if needed.
One reason people want to inspect packets is to perform
firewalling. In this case, a far superior solution is available, using
the Netfilter infrastructure. This is a kernel-level firewalling/NAT
solution which is fast and reliable. You may create both stateful and
stateless firewalling configurations. This infrastructure was
introduced during the 2.3.x development cycle.

Section 8 - Compiler/binutils questions

1. I downloaded the newest kernel and it doesn't even compile!
What's wrong?
* (REG) First check the kernel newsflash page at
http://www.atnf.csiro.au/~rgooch/linux/docs/kernel-newsflash.html
where late-breaking patches may be posted.
* (DW) Do not post any details of the compile failure to the
mailing list unless you have first checked the archives to ensure that
the question hasn't been asked already.
Normally, if Linus allows a simple typo into a release
kernel which prevents it from compiling, a patch is posted to the list
within hours, yet still there are clueless idiots who continue to ask
about it for weeks thereafter.
Do not do this. We will find out where you live, and we
will come to your house and knock on your door at three o'clock in the
morning to ask you stupid questions. Repeatedly, if needs be.
REW's note below also says this; but evidently not
explicitly enough. Some people are just too stupid, I guess.
* (RRR) Make sure you are compiling with the recommended
version of gcc with default optimizations flags (IOW, leave the
Makefiles alone) and a recent binutils. The binutils package is the
one that contains the assembler (gas) and linker (ld). See
Documentation/Changes for more info. If that works then, experiment
with different compiler/optimizations.
* (REW) Linus cannot test every permutation of drivers and
options. He's a selfish little guy. He just compiles the version that
runs on his computers, and then releases it. Actually, he sometimes
even doesn't compile it before releasing. He's a busy man. Give him a
break. Wait for half a day. Someone will post a patch that will fix it
within that time. If that doesn't happen for more than a day, fix it
yourself, and post the patch to linux-kernel. If you don't have the
expertise to do this yourself, please wait for another day, before
reporting the problem.
Please check if it hasn't been reported before. Most
companies have a help desk that keeps the end users from bothering the
developers. Linux is different: You get to talk to the developers. But
don't waste everybody's time by posting stuff that has been reported
already.
* (DBE) Not all ports of the Linux kernel to different
hardware platforms are fully merged into the official tree at
kernel.org. If you have problems compiling a kernel for a non-i386
architecture please check the related Web pages and mailing-lists for
that specific port.
2. What are the recommended compiler/binutils for building kernels?
* (REG) This depends on the kernel version. Until 26-
OCT-2000, gcc 2.7.2.3 was the recommended compiler for all kernels. On
this date, Linus announced that gcc 2.91.66 (aka egcs 1.1.2) is the
recommended compiler for 2.4.x kernels up to version 2.4.9. Gcc 2.95.3
is the recommended compiler for kernel 2.4.10 and later.
The recommended binutils is 2.9.1.0.25. Avoid binutils
versions from 2.8.1.0.25 to 2.9.1.0.2, these were beta releases and
known to be buggy.
Always see the Documentation/Changes file for details.
3. Why the recommended compiler? I like xyz-compiler better.
* (RRR) Quick Answer: it's what Linus uses. Real Answer: the
recommended compiler has been extensively tested and proven to be a
very stable compiler. What is at issue is not whether other compilers
can optimize better, but whether they will compile the kernel
correctly. Current kernels and compilers are very complex pieces of
software. There are just too many ways that the two can interact and
cause trouble (a recent example: gcc 2.8.x and kernel 2.0.x). By
keeping constant one of the variables - the compiler - kernel
developers can concentrate on the kernel. If both the compiler and
kernel are changing then it's anyone's guess what went wrong.
4. Can I compile the kernel with gcc 2.8.x, egcs, (add your xyz
compiler here)? What about optimizations? How do I get to use -O99,
etc.?
* (RRR) Sure, it's your kernel. But if it doesn't work, you
get to fix it. Seriously now, there is really no point in compiling a
production kernel with an experimental compiler. Production kernels
should only be compiled with the recommended compiler. Newer compilers
are known to break the 2.0 series kernels, known symptoms of this
breakage are hwclock and the X server seg.faulting.
Compiling a 2.0 kernel with egcs or gcc 2.8, even after
applying the workaround of copying the ioport.c file from a late 2.1
kernel to 2.0, is not recommended and will inevitably lead to
unpredictable kernel behaviour.
Regarding 2.1 kernels, they usually compile fine with
other compiler versions, but do NOT complain to the list if your are
not using the recommended compiler. Linux developers have enough work
tracking kernel bugs, to also be swamped with compiler related bugs.
If you want to play with the optimization options, you
need to hack the Makefile in arch/i386/Makefile (assuming you have an
x86 processor), but if it breaks... well, you should know the answer
by now. Also keep in mind that some demented optimizations (such as -
O99) may even produce slower and bigger kernels, due to gratuitous
loop unrolling and function inlining.
* (ADB) I think the standard Paul Gortmaker disclaimer (?)
is: "If it breaks, you get to keep the pieces." ;-)
5. I compiled the kernel with xyz compiler and get the following
warnings/errors/strange-behavior, should I post a bug report to the
list? Should I post a patch?
* (RRR) In general, no, unless you get these with the
recommended compiler. Few exceptions:
Everyone welcomes code cleanup patches, for instance,
newer compilers may complain a lot more. Some of these warnings may
even be warranted (i.e. ambiguous use of else statements), fixing
these is a good thing.
There could be some aging code around that makes too many
assumptions about the compiler (especially true about inline
assembly), some of the newer compilers break these statements. Fixing
these is also a good thing, but be very sure you're are really fixing
a bug in the kernel. Workarounds for other compilers will be ignored
(if the compiler is buggy, fix the compiler!).
6. Why does my kernel compilation stops at random locations with:
"Internal compiler error: program cc1 caught fatal signal 11."?
* (REW) Sometimes bad hardware causes this. Read the Web
page at http://www.BitWizard.nl/sig11/ about this. The important word
here is random. If it stops at the same place every time, the kernel
source might have a glitch or your compiler might be bad. The Web page
is mostly about the random error source: hardware. There is a bunch of
different error messages that you can get if you have bad or marginal
hardware.
* (ADB) Overclocked processors very often fail long
compilations with a sig11, because a long gcc compilation puts more
strain on the processor. As the processor heats up, it may attain a
point where internal timings get out of spec. At this point, something
gives and you get a sig11.
Also, some old K6 revisions would sig11 when compiling
large programs if > 32 Mb of RAM were installed on the Linux box. AMD
will exchange these faulty processors for free. Benoit Poulot-Cazajous
correctly diagnosed the problem and devised an ingenious test for this
bug that is run at boot time in 2.2.x kernels.
7. What compiler flags should I use to compile modules?
* (REG) At the very least, you need these: -O2 -DMODULE -
D__KERNEL__ -DLINUX -Dlinux
* (KO) I don't advise compiling modules by hand if the
directory is in the kernel source tree. The rest of the Makefile
system will not know about the extra modules so it will not recompile
them if the config changes nor will it install the modules. The best
method (until the 2.5 Makefile rewrite) is to add the directory into
the kernel Makefile system.
Create a kernel Makefile in your new directory. Example:

#
# Example Makefile for your own modules
#

SUB_DIRS :=
MOD_SUB_DIRS := $(SUB_DIRS)
ALL_SUB_DIRS := $(SUB_DIRS)

M_OBJS := example-module1.o example-module2.o

include $(TOPDIR)/Rules.make

Edit the Makefile in the parent directory to add your
subdirectory to the SUB_DIRS list. make dep, make modules and make
modules_install will automatically handle your modules.
* (VKh) If you have a local makefile with which you wish to
build your module not linked under the kernel tree in the proper way,
you still can "ride" on the master Makefile.

This way one can eliminate the dependency on your
particular machine kernel compilation options to be hardwired in the
local Makefile. I.e., once you reconfigure the kernel, your driver
will compile itself when you do a local "make" with the correct set of
the new flags.

This is what you can do on 2.2 (Makefile excerpt follows):

EXTRA_CFLAGS := -DDEBUG -DLINUX -I/usr/src/foo/include
MI_OBJS := your-module.o
O_TARGET := your-module.o
O_OBJS := your1.o your2.o

# Reuse Linux kernel master makefile on this directory
ifdef MAKING_MODULES
include $(TOPDIR)/Rules.make
else
all::
cd '/usr/src/linux' && make modules SUBDIRS=$(PWD)
endif

In 2.4 the syntax is different. Rename MI_OBJS to obj-m
and O_OBJS to obj-y to achieve the same goal there:

obj-m := your-module.o
O_TARGET := your-module.o
obj-y := your1.o your2.o

8. Why do I get unresolved symbols like foo__ver_foo in modules?
* (KO) If /proc/ksyms or the output from depmod -ae contains
symbols like "foo__ver_foo" then you have been bitten by the broken
Makefile code for symbol versioning. The only safe way to recover is
save your config, delete everything, restore the config and recompile.
Do this:

mv .config ..
make mrproper
mv ../.config .
make oldconfig
make dep clean bzImage modules
# install, boot

9. Why do I get unresolved symbols with __bad_ in the name?
* (REG) This is an indication that a function has been
called with an invalid parameter. In some cases, these invalid
parameters can be detected at compile time (through clever use of
preprocessor tricks), so the preprocessor will modify the called
function name into an invalid one. This will prevent the final link
stage from completing (or will prevent the module from loading).
OK, so now that you know why, go forth and pester the
maintainer of the section of code that is making the invalid function
call. You should check the CREDITS and MAINTAINERS files to determine
the maintainer.

Section 9 - Feature specific questions

1. GNU/Linux Y2K compliance?
* (ADB) Y2K compliance under GNU/Linux is a multi-level
problem.
1. Applications. Check your application sources for
routines that only operate on/test the last two digits of the year
field/variable(s). Obviously the problem here is that 2000 > 1999, but
00 < 99. Unfortunately, poor programming practices are just as common
and unavoidable as death and taxes...
2. Libraries. Libc5 and glibc are known to be Y2K
compliant. Alan Cox mentioned that libc4 had some problems.
3. Kernel. The Linux kernel is y2k compliant. BTW the
code snippet in the /arch/i386/kernel/time.c will force those non-y2k
compliant RTC implementations to the correct date on 00:00:00 Jan 1,
2000. It's been there for quite some time, now, nice and quiet; added
by Alan Modra circa 1994!
4. BIOS. On x86 PC machines, upon boot some BIOS's will
wrap back to 1900, later versions will correctly wrap the RTC clock to
2000. This is a rather critical problem in embedded systems if they
are not running Linux; if they are running Linux this is solved by
Alan Modra's code snippet mentioned above. :-)
5. Hardware. The standard PC RTC chip will not wrap the
century. Wrapping must be done in software/BIOS. The chip will store
the century data, but it just won't increment it on 00:00:00 Jan 1,
2000. Same issue as BIOS WRT embedded systems.
Testing the kernel, the BIOS and the RTC hardware is
relatively easy if you are allowed to reboot the machine; just enter
the CMOS setup routine and set the time to Dec 31 1999, 23:58:00. Boot
and check what happens.
Checking applications and libraries takes a lot more
work... Specially checking applications when you don't have access to
the source code :( The only way is simulation. But this is getting off
topic: if you don't have access to the source code, then it's not
relevant to GNU/Linux. ;)
2. What is the maximum file size supported under ext2fs? 2 GB?
* (REW, AC) In the 2.0.x kernels, maximum file size (not to
be confused with partition sizes, which can be much larger) under
ext2fs is 2GB. Larger files are only supported on 64-bit architectures
(Alpha and UltraSPARC) in late 2.1.1xx kernels.
Files larger than 2GB are difficult to support on 32-bit
architectures. This will probably be implemented in the 2.3 kernel
series.
3. GGI/KGI or the Graphics Interface in Kernel Space debate?
* (REG, ADB) GGI/KDI information can be found here. The GGI/
KGI developers warn against useless debates on the kernel list.
4. How do I get more than 16 SCSI disks?
* (REG) Get kernel version 2.2.0 or later.
5. What's devfs and why is it a Good Idea (tm)?
* (REG) OK, pushing my own barrow here. Devfs allows device
drivers to have a direct link with device special files (what you see
in /dev). The current dependence on major/minor numbers to provide
this link poses scalability and performance problems. Devfs also only
has device nodes for devices that you have available. Read the devfs
FAQ for more details. Note that devfs went into the official 2.3.46
kernel.
6. Linux memory management? Zone allocation?
* (ADB) Rik van Riel has setup a nice page on Linux memory
management. It has a link to an excellent tutorial on virtual memory.
7. How many open files can I have?
* (REG) With kernels 2.0.x you can have 256 open FDs (file
descriptors). With 2.2.x you can have 1024. Various patches exist
which allow you to increase these limits. Note that this can break
select(2).
8. When will the Linux accept(2) bug be fixed?
* (REG) Firstly, this is not a bug in the Linux kernel,
despite the fact that the "Sendmail 8.9.0 Known Bugs List" states
there is a bug with Linux accept(2). The Linux accept(2) call can
return the ETIMEDOUT error when there are system resource problems.
This is not wrong, just different from what Sendmail expects. Since
accept(2) is not part of the POSIX standard, it cannot be claimed that
Linux is violating it. I'm told that the Single UNIX Specification,
Version 2 (SUSv2), which is much newer, implicitly prohibits
ETIMEDOUT. Nevertheless, the networking hackers are not inclined to
change this behaviour. They seem to prefer to follow POSIX in this,
perhaps following the maxim the great thing about standards is that
there are so many to choose from. Note also that BSD documents
slightly different behaviour from SUSv2. It is prudent for an
application to deal gracefully with unexpected error codes.
9. What about STREAMS? I noticed Caldera has a STREAMS package,
when will that go in the kernel source proper?
* (REG) STREAMS allow you to "push" filters onto a network
stack. The idea is that you can have a very primitive network stream
of data, and then "push" a filter ("module") that implements TCP/IP or
whatever on top of that. Conceptually, this is very nice, as it allows
clean separation of your protocol layers. Unfortunately, implementing
STREAMS poses many performance problems. Some Unix STREAMS based
server telnet implementations even ran the data up to user space and
back down again to a pseudo-tty driver, which is very inefficient.
STREAMS will never be available in the standard Linux
kernel, it will remain a separate implementation with some add-on
kernel support (that comes with the STREAMS package). Linus and his
networking gurus are unanimous in their decision to keep STREAMS out
of the kernel. They have stated several times on the kernel list when
this topic comes up that even optional support will not be included.
* (REW, quoting Larry McVoy) "It's too bad, I can see why
some people think they are cool, but the performance cost - both on
uniprocessors and even more so on SMP boxes - is way too high for
STREAMS to ever get added to the Linux kernel."
Please stop asking for them, we have agreement amongst the
head guy, the networking guys, and the fringe folks like myself that
they aren't going in.
* (REG, quoting Dave Grothe, the STREAMS guy) STREAMS is a
good framework for implementing complex and/or deep protocol stacks
having nothing to do with TCP/IP, such as SNA. It trades some
efficiency for flexibility. You may find the Linux STREAMS package
(LiS) to be quite useful if you need to port protocol drivers from
Solaris or UnixWare, as Caldera did.
The Linux STREAMS (LiS) package is available for download
if you want to use STREAMS for Linux. The following site also contains
a dissenting view, which supports STREAMS.
10. I need encryption and steganography*. Why isn't it in the
kernel?
* (TJ) Note that this section was written in 2000/2001, and
the laws in various countries have changed since then. Updates would
be appreciated.
In France and Russia, strong encryption is essentially
illegal, using it there requires a license which is seldom granted.
The United States has cumbersome restrictions on exporting such
software (it's considered a "munition"--see http://www.epic.org/crypto/export_controls/
). Having these features in the standard kernel would therefore cause
great inconvenience to people in those countries. However, separate
programs and patches to the kernel are available at:
o ftp://ftp.csua.berkeley.edu/pub/cypherpunks/filesystems/linux/
o http://www.freeswan.org/
o http://www.quick.com.au/ftp/pub/sjg/
o http://www.ssh.org/
o http://web.mit.edu/kerberos/www/
o http://tcfs.dia.unisa.it/
o ftp://ftp.tik.ee.ethz.ch/pub/packages/skip/
(*) Steganography is disguising sensitive data as noise in
a digitized image, sound file, or the like.
11. How about an undelete facility in the kernel?
* (REG) This idea keeps being raised again and again. There
is no need for kernel support to do this. You can easily do it in user
space. There are replacement versions of the rm utility which will
move/copy files to a wastebasket area instead of actually deleting. If
you're really keen, you could implement a wrapper for the unlink
system call, and use LD_PRELOAD to override the function for all
applications. This has been done by Manuel Arriaga and is called
"libtrash". It is available at: http://m-arriaga.net/software/libtrash/
12. How about tmpfs for Linux?
* (REG) The 2.4 series kernels have introduced a tmpfs. The
old SysV shared memory code has been replaced with a new shm file-
system, which is much simpler and cleaner, thanks to the improved VFS.
Since the shm code can be shared to create a tmpfs, this was done. You
may find tmpfs useful if you have an embedded system which has the
root file-system on a read-only media but needs a writable file-
system.
* (REG) Prior to the introduction of tmpfs, many people
asked for its development, on the grounds that it would be faster
than /tmp in a conventional file-system. This was never considered a
valid reason for tmpfs development, because the Linux ext2 filesystem
is so good that it outperforms tmpfs (memory-based filesystems) in
other operating systems. Jim Nance (jln...@avanticorp.com) has posted
a comparison to linux-kernel. Here is an extract of his message:

The original question is enough of an FAQ that I thought
it would be
good to have real numbers rather than just my assurances
that Linux
has a fast FS layer. Therefore I wrote a benchmarking
program that
creates/writes/destroys files and ran it under several
operating
systems and on several types of file systems. I have
included that
program as an attachment to this mail. Here are the
results:

OS Hardware FS Type
Loops/Second

--------------------------------------------------------------------
Linux 2.2.5-ac6 1 nfs
16.33
Linux 2.2.5-ac6 1 arla
73.67
Linux 2.2.5-ac6 1 ext2
15383.32
Solaris 2.6 2 afs
71.33
Solaris 2.6 2 nfs
10.00
Solaris 2.6 2 ufs
23.67
Solaris 2.6 2 tmpfs
9162.32
Digital Unix 4.0D 3 afs
49.33
Digital Unix 4.0D 3 nfs
14.67
Digital Unix 4.0D 3 ufs
28.67
Digital Unix 4.0D 3 memfs
3062.66
Linux 2.0.33 4 afs
69.33
Linux 2.0.33 4 nfs
15.00
Linux 2.0.33 4 ext2
2218.33

Hardware:
1 -> 333 MHz PII, 512M ram, Compaq WDE4360W disk
2 -> Ultra450 class Sun server (300MHz?)
3 -> Personal Workstation 600 AU. 600 MHz alpha. 1.5G ram
4 -> 75 MHz Pentium, 32M ram, Segate ST31200N disk

Notice how Linux writting to an ext2 file system is
significantly
faster than any other OS/FS combination. The next closest
is Solaris
writting to tmpfs, and its still far behind ext2. It's
also good to
notice how slow both Solaris and Digital Unix are on their
local file
systems. This is probably why both have a ram base file
system.

Please note that this benchmark is intended to measure the
time it
takes to create and delete files, which is expensive on
most non-linux
systems. It does not indicate anything about the data I/O
rate to an
existing file.

It would be interesting to see a comparison between Linux
ext2fs and tmpfs.
* (REG, by Adam Sulmicki) If after reading all the above you
still feel you need tmpfs, and you're stuck in the stone age with a
2.2 kernel, read on. However, keep in mind it is more of a hack than
true tmpfs.
The magic way to do it is:
o compile ramdisk support into kernel, the option is:
CONFIG_BLK_DEV_RAM=y
o Run the following command to create 2mb ext2 ram
disk: /sbin/mke2fs -vm0 /dev/ram 2048
o mount it: mount /dev/ram /tmp
And you are done.
13. What is the maximum file size/filesystem size?
* (REG) Maximum file size depends on the block size on your
filesystem. For ext2 (and UFS, SysVFS and similar filesystems), the
limits are:

Block size Maximum file size (GiBytes)
512 B 2
1 kiB 16
2 kiB 128
4 kiB 1024
8 kiB 8192 (PAGE_SIZE must be >= 8 kiB)

plus a small amount. The limitation is due to the classic
triply-indirect addressing scheme. In the future, ext2 will have
extent-based addressing, which will overcome this problem.

The limit for a single filesystem (partition) on a 32 bit
CPU is 4 Gi blocks. Each block is 512 Bytes, so that works out to 2
TiB. For 64 bit CPUs, the limits are bigger than you can imagine.
14. Linux uses lots of swap while I still have stuff in cache. Isn't
this wrong?
* (MRW) Not really. Linux will page out processes which
haven't been used for a long time (e.g. lpd on many systems) in favour
of retaining data from files which have been used recently (e.g.
header files while compiling a big program). This is more efficient.
Trust us, we're engineers.
15. Why don't we add resource forks/streams to Linux filesystems
like NT has?
* (REG) Resource forks (aka "named streams") are a way of
storing multiple "streams" of data in a file. Each stream may be read,
written and seeked in just like in files with only one stream of data.
Resource forks are used to store ancillary data with files (such as
which icon to display for the file when using a graphical
filemanager). These extra streams of data may be manipulated by any
user who has write access to the file, just as the "primary" stream
can be manipulated.
Unix only supports one "stream" of data per file. Adding
support for multiple streams to the Linux kernel is not considered to
be especially difficult. However, files with multiple data streams
would break a large number of user-space programmes (which currently
only manipulate the "primary" stream) and protocols (such as ftp,
http, email, NFS and many more). A number of new utilities would need
to be written, and a large number of shell scripts would have to be
audited for correctness in a multiple-stream world. Because of this
massive breakage, many kernel developers consider resource forks to be
a bad idea.
Rather than add kernel support, a user-space library could
be written which provided easy management of multiple steams of data
for applications, while still storing the data in a single Unix file.
If someone wants to write such a library, please do so. Once it's
completed, send an email to the FAQ maintainer.
Note that the GNUstep/Foundation library has the NSBundle
class, which provides this functionality. A number of APIs to this
class for different languages are available:
o Objective-C has GNUstep at: http://www.gnustep.org/
o Java has JIGS at: http://www.gnustep.it/jigs/
o Smalltalk has StepTalk at: http://www.gnustep.org/experience/StepTalk.html
o Scheme has gstep-guile at:
http://www.tiptree.demon.co.uk/gstep/guile/gstep_guile_toc.html
Note that a separate problem is the storage of "extended
attributes". These are attributes like file permissions (such as ACLs
and POSIX capability sets), which have limited size, and tend to be
read and written atomically (i.e. you can't read or write part of the
attribute nor seek in it). These usually require special privileges to
modify. Also, you normally don't want to copy these attributes when
copying files around, thus these extended attributes don't present the
problems of massive breakage that resource forks would.
16. Why don't we internationalise kernel messages?
* (REG) There are several reasons why this should not be
done:
o It would bloat the kernel sources
o It would drastically increase the cost of
maintaining the kernel message database
o Kernel message output would slow down
o English is the language in which the kernel sources
are written, and thus is the language in which kernel messages are
written. Developers cannot be expected to provide translations
o Bug reports should be submitted in English, and that
includes kernel messages. If kernel messages were to be output in some
other language, most developers could not help in fixing bugs
o Translation can be performed in user-space, there is
no need to change the kernel
o It would bloat the kernel sources
Finally, it will not be done. No core developer supports
this. Neither does Linus. Don't even ask.

Section 10 - "What's changed between kernels 2.0.x and 2.2.x"
questions

1. Size (source and executable)?
* (REW) I use the following to quickly estimate the size of
a project:
cat `find . -name \*.c -o -name \*.h -o -name \*.S `| wc -
l
I get 811985 (lines of code, including comments and blank
lines) when I run this on the 2.0.33 kernel source, and 1460508 when I
run this on a 1.0.106 kernel.
This means that the Linux kernel qualifies as an
"extremely large" software product, requiring the effort of 200 to 500
programmers for 5 to 10 years. [Richard Farley: Software engineering
concepts, Mc Graw-Hill, 1985, page 11].
Actually, the Linux kernel is now 7 years old, and has
seriously involved 100 to 1000 programmers. (i.e. not counting those
that have contributed a "one line fix"). This is my personal guess, so
feel free to disagree or tell me otherwise.
* (ADB) I can't compare actual kernel footprints of 2.0.x
vs. 2.1.x, but I think it's worth mentionning that 2.1.x kernels have
the ability to "jettison" kernel initialization code, freeing the
corresponding physical memory. So, even though the executable is
certainly larger for 2.1.x kernels, you may actually get a smaller
memory footprint.
2. Can I use a 2.2.x kernel with a distribution based on a 2.0.x
kernel?
* (REW) Yes. However some applications may need upgrading.
Read /usr/src/linux/Documentation/Changes before you complain about
something not working. Also note that the 2.1.(x+1) kernel may need a
different set of upgrades than 2.1.x, so you should check the Changes
file every single time you upgrade your Linux kernel.
3. New filesystems supported?
* NTFS (read-only). Allows read-only access to Windows NT
(tm) partitions.
* Coda. Coda is an advanced experimental distributed file
system with features such as server replication and disconnected
operation for laptops. Note that Coda is also available for 2.0.x
kernels as an add-on package. Check the Coda Web site for more
information.
4. Performance?
* (REG) Here are some performance optimizations which are
only available on 2.2.x kernels:
o MTRRs. MTRRs are registers in PPro and Pentium II
CPUs which define memory regions with distinct properties. The default
mode for PCI memory accesses is "uncacheable" which means memory and I/
O addresses on a PCI peripheral are not cached. For linear frame
buffers, a better mode is "write-combining" which allows the CPU to re-
order and slightly delay writes to memory so that they can be done in
blocks. If you are writing to the PCI bus, you then use PCI burst mode
transfers, which are a few times faster.
o Finer grained locking. Most instances of the global
SMP spinlock have been replaced with finer grained locking. This gives
much better concurrency.
o User buffer checks. Replaced the old, painful way of
checking if user buffers passed to syscalls were legal by a kernel
exception handler. The kernel now assumes a buffer is OK. If not, an
exception handler catches the fault and returns -EFAULT to user space.
The advantage is that legal buffers no longer need to be carefully
checked, which is much faster. The old scheme was also suffering from
race conditions under SMP.
o New directory entry cache (dcache). This makes file
lookups much faster.
Example: time find /usr -name gcc -print
2.1.104: cold cache: 0.180u 0.460s 0:15.02 4.2% 0+0k
0+0io 85pf+0w
2.1.104: warm cache: 0.100u 0.150s 0:00.25 100.0%
0+0k 0+0io 72pf+0w
2.0.33: cold cache: 0.100u 0.660s 0:14.87 5.1% 0+0k
0+0io 85pf+0w
2.0.33: warm cache: 0.090u 0.600s 0:00.69 100.0%
0+0k 0+0io 72pf+0w
Note /usr had 17750 files/directories. We see how
with a cold cache (no disc blocks cached) there is very little
difference. However, once the cache is warm, we see a fourfold
reduction in system time. This is because inode lookups are not needed
when a dcache entry is available. Tests performed on a Pentium/MMX
200.
5. New drivers not available under 2.0.x?
* (XXX) Please add your answer here...
6. What are those __initxxx macros?
* (KGB) __initfunc() for example is a macro used to put its
first argument (a function) into a special ELF section that is dropped
from memory once drivers's initialization is over.
So if you write an initialization function, whose code
will never be used again after your driver is initialized, you can use
__initfunc() around its declaration in order to reduce your kernel
memory footprint by a few KB of memory. Similarly, __initdata() is
used for variables, arrays, strings, etc. For implementation details
and examples please consult the file include/linux/init.h from a 2.2.x
source tree.
The main idea here is that the kernel memory is not
swappable. Jettisoning useless code represents a nice way to save RAM.
7. I have seen many posts on a "Memory Rusting Effect". Under what
circumstances/why does it occur?
* (ADB) AFAIK the expression was coined by Bill Metzenthen,
who also provided many data points measuring this phenomenon. It
describes the not-so-sane behaviour of the MM layer in kernels 2.1.x
(where x is approx. > 50) in small memory systems (e.g. 8 MB
machines). Alan Cox recommends the following procedure to detect the
Memory Rusting Effect:
(if you have > 8 MB of RAM installed, reboot with mem=8M
as a kernel boot parameter before)

time make zImage
find / -print
time make zImage

Comparing the results of the time measurements will show
whether the memory has been rusted or not by the find.
Note that just comparing the time it takes to compile a
sample kernel under 2.0.x versus 2.1.x is the wrong way to measure the
Memory Rusting Effect. So don't post this kind of data to the list (we
already know it takes longer under 2.1.x in small memory machines).
The Memory Rusting Effect is being investigated by a
number of kernel hackers, and it seems practically solved as of kernel
2.1.111.
8. Why does ifconfig show incorrect statistics with 2.2.x kernels?
* (TJ) This is in linux/Documentation/Changes that comes
with the kernel sources:

"For support for new features like IPv6, upgrade to the
latest
net-tools. This will also fix other problems. For example,
the format of /proc/net/dev changed; as a result, an older
ifconfig
will incorrectly report errors."

9. My pseudo-tty devices don't work any more. What happened?
* (TJ) Support for ptys using a major number of 4 was
dropped in Linux 2.1.115. Replace your device files with ones using
the new major numbers, 2 and 3. They will work with later 1.3 versions
of Linux, and any 2.x version.
* (REG) If you use devfs, then this problem magically goes
away.
10. Can I use Unix 98 ptys?
* (TJ, with much information provided by H. Peter Anvin)
Yes, but only if you have a kernel and libc which support them, and if
your applications are written and compiled to use them. They will be
supported by Linux 2.2 and glibc 2.1. This is in Documentation/Changes
that comes with the kernel sources.
There is also the new standalone libpt by Duncan Simpson
which implements the Unix98 PTY API independently of libc (check the
Incoming directory on metalab.unc.edu/Linux and mirrors). You still
need to have your apps compiled to use this API, of course.
11. Capabilities?
* (TJ) There's a FAQ on capabilities under Linux at
ftp://ftp.guardian.no/pub/free/linux/capabilities/capfaq.txt.
12. Kernel API changes
* (REG) Some parts of the kernel API (programming interface)
have changed from v2.0 to v2.2. This is relevant to the authors of 3rd
party device drivers, filesystems and other code. So called "3rd
party" code is any kernel code which is not distributed with the
official kernel tarball that Linus distributes. A quick reference for
programmers wishing to port their code to v2.2 is available here. Note
that this document is not relevant for programmes running in user
space.
If you want to port your drivers to the 2.4 series kernel,
then read this, which tells you how to port code from 2.2 to 2.4.

Section 11 - Primer documents

1. What's a primer document and why should I read it first?
* (REG) From time to time various technical debates start on
the linux kernel list. Some of these are about quite important topics,
however often these debates are repeated every few months or so and
much of the same ground is covered each time around. Other times,
questions about how some part of the Linux kernel works are posted.
Often we see the same old questions time and time again. Don't get me
wrong: these are often reasonable questions, it's just that seeing
them over and over is something we'd rather avoid.
This section has some primer document links on various
topics that should be read before starting a debate or posing a
question (which itself can lead to a debate). This is not an attempt
to censor debate, rather, it's an attempt to get you familiar with the
current arguments so that you can contribute something new without
going over old ground. If it's just a question you have, hopefully we
can explain it clearly once, in a single document, and then point
everybody to it.
2. How about having I/O completion ports?
* (REG) The existing UNIX semantics - select(2) and poll(2)
- for polling for activity on FDs do not scale very well: the overhead
is too high with large numbers of FDs. Here is a primer document which
explains some of the problems and explores some solutions.
3. What is the VFS and how does it work?
* (REG) The VFS (Virtual FileSystem or Virtual Filesystem
Switch, depending on who you talk to) is basically the Linux
filesystem layer. It incorporates the dentry cache and standard UNIX
file semantics. It also contains a "switch" to specific filesystem
types (ext2, vfat, iso9660 and so on), which is why Linux supports so
many different filesystems. Read this VFS primer document if you want
to know more.
4. What's the Linux kernel's notion of time?
* (ADB) I have tried to put together some information on
this topic, which you can find here. Colin Plumb is working on new
code for the Linux kernel software clock.
5. Is there any magic in /proc/scsi that I can use to rescan the
SCSI bus?
* (TJ) The text below is from drivers/scsi/scsi.c.
/*
* Usage: echo "scsi add-single-device 0 1 2 3" >/proc/
scsi/scsi
* with "0 1 2 3" replaced by your "Host Channel Id Lun".
* Consider this feature BETA.
* CAUTION: This is not for hotplugging your
peripherals. As
* SCSI was not designed for this you could damage
your
* hardware !
* However perhaps it is legal to switch on an
* already connected device. It is perhaps not
* guaranteed this device doesn't corrupt an ongoing data
transfer.
*/
For a typical discussion of this topic, see
http://jpj.net/~trevor/linux/rescan_scsi.txt.gz.

Section 12 - Kernel Programming Questions

1. When is cli() needed?
* (ADB) cli() is a kernel wide function that disables
maskable interrupts, whereas sti() is the equivalent function that
enables maskable interrupts. Some routines must be run with interrupts
disabled, because some peripherals need a guaranteed access sequence,
or because the routine is not reentrant and could be reentered from an
interrupt, etc. You should never use cli() in a user space program/
daemon.
* (REW) The use of cli() is no longer encouraged. On a
single processor, this simply clears an internal CPU flag, which is
ANDed with the Maskable Interrupt Request pin. On SMP systems it is
quite troublesome to keep ALL processors from servicing interrupts if
one processor wants to do something uninterrupted. Currently we try to
do locking on a much finer scale. For example, you should put a
spinlock on the record that describes THIS INSTANCE of the device that
needs the handling without accesses to other registers (e.g. from the
interrupt routine). Besides preventing the overhead of trying to keep
the other CPUs from handling interrupts, this allows the other CPUs to
service interrupts from a second card of the same type in the same
machine.
2. Why do I see sometimes a cli()-sti() pair, and sometimes a
save_flags-cli()-restore_flags sequence?
* (RRR) The cli()-sti() pair assumes that interrupts were
enabled when execution of the code began, and thus proceeds to
reenable them at the end. The save_flags-cli-restore_flags sequence
doesn't make this assumption. Since the interrupt flag is one of the
flags saved by save_flags(), it will be correctly restored to its
previous state by restore_flags(). This is critical for code that may
be called with interrupts either on or off.
Using save_flags-cli-restore_flags does incur in a very
slight overhead as compared to the cli()-sti) pair, which may be
significant for speed critical code (apart from being superfluous if
it's known a priori that the code will never be called with interrupts
off).
* (REG) Note that on UP systems cli(), sti() and
restore_flags() operate immediately. However, on SMP systems, these
functions may have to wait for the global IRQ lock (when another CPU
has disabled interrupts). Other than this difference, these functions
are SMP safe. It is also safe to call cli() multiple times on one CPU:
the global IRQ lock is only grabbed the first time.
3. Can I call printk() when interrupts are disabled?
* (REG) Yes, you can, although you should be careful. Older
kernels had the infamous cli()-sti() pair in printk(), so you would
get enabled interrupts when returning from printk(), whether printk()
was called with interrupts disabled or enabled; whereas recent kernels
(e.g. 2.1.107) restore the flags when printk() is finished. You have
to know which version of the kernel you are coding for. Read the
Source, Luke. Also note that in 2.2.x kernels, printk() grabs a
spinlock for SMP machines to avoid any possible deadlocks.
4. What is the exact purpose of start_bh_atomic() and end_bh_atomic
()?
* (REG, quoting Krzysztof G. Baranowski) To protect your
code from being interrupted by a bottom half handler. It is mostly
used in syscalls and functions called from userspace and is better
than cli/sti pair, because most of the time there is no need to mask
interrupts on hardware level..
5. Is it safe to grab the global kernel lock multiple times?
* (REG) Yes. The global kernel lock is recursive per
process. That means a process can grab the lock multiple times and not
deadlock. The lock is released when unlock_kernel() is called as many
times as lock_kernel() was called.
6. When do I need to initialise variables?
* (REG) All variables should be initialised (implicitly or
explicitly) before they are read from. Automatic variables are placed
on the stack, and thus will have a random initial value. This means
that you need to manually initialise them.
Static variables are placed in the .bss section, which is
initialised to zero by the kernel (at the start of the boot sequence).
If the initial value of a static variable should be zero, you don't
need to do anything. If it should be a non-zero value, you will need
to initialise it. Note that you should not explicitly initialise a
static variable to zero, as this will increase the size of the kernel
image, which causes problems for embedded systems.

Section 13 - Mysterious kernel messages

1. What exactly does a "Socket destroy delayed" mean?
* (TJ, from a post by Henner Eisen) Sometimes you may get:
Jul 25 22:14:02 zero kernel: Socket destroy delayed (r=212
w=0)
in /var/log/messages.
It means that the kernel cannot free the internal data
structures associated with a released socket because there are still
socket data buffers (in the above case 212 bytes read memory)
accounted to the socket. For this reason, destroying is delayed and
tried again later. At some point, after the remaining sk_buffs
accounted to the socket are freed, destroying should succeed. Also:
It keeps spitting that out about every 5 seconds or so.
the only way to fix it is to reboot. It doesn't happen very often, but
I'd like to find out what's causing it.
This might indicate a problem that some kernel entity (i.e
protocol module or network device driver), which is responsible for
freeing an sk_buff, fails to do so. To help tracking down the problem,
try to find out under which circumstances the messages start to appear
(in particular, which program closed a socket right before the
messages appears, which network protocol does it use, which network
device drivers are involved).
2. What do I do about "inconsistent MTRRs"?
* (REG) Sometimes you may get:
mtrr: your CPUs had inconsistent ... MTRR settings
mtrr: probably your BIOS does not setup all CPUs

In English, using "had" as past or past perfect tense
commonly implies that the condition no longer exists. While it isn't
absolutely proper, it is very common. The MTRRs were inconsistent, but
they aren't anymore. The kernel fixed them up. Everything is fine
now.
3. Why does my kernel report lots of "DriveStatusError BadCRC"
messages?
* (REG, contributed by Mark Hahn) You may see messages like:

kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
kernel: hda: dma_intr: error=0x84 { DriveStatusError
BadCRC }

In UDMA modes, each transfer is checksum'ed for integrity
(like Ultra2 SCSI, and more robust than normal SCSI's parity
checking). When a transfer fails this test, it is retried, and this
warning is reported. Seeing these warnings occasionally is not unusual
or even a bad thing - they just inflate your logs a little. If this
really bothers you, you can comment out the warning in the driver.
Seeing lots of these warnings (multiple per second) is
almost certainly a sign that your IDE hardware is broken. For
reference, all IDE must:
o have a cable length of 18" or less
o have both ends plugged in (no stub)
o be 80-conductor cable if you're using a mode >
udma33.
IDE modes are generally also generated from the system
clock, so if you're overclocking (for instance, running AGP at 75
MHz), you're violating IDE specs, and should not expect correct
behavior. Similarly, it's possible for your controller's driver to get
timing parameters wrong, but this is certainly not the first
explanation to adopt.
4. Why does my kernel report lots of "APIC error" messages?
* (REG, contributed by Mark Hahn) You may get messages like:
APIC error on CPU1: 00(08).
APIC is the hardware that ia32 systems use to communicate
between CPUs to handle low-level events like interrupts and TLB
flushes. APIC messages are checksummed, and automatically retried when
they fail. This message indicates that a transaction failed; it's only
a problem when there are many of them. The APIC checksum is quite
weak, so even a few failures is a cause for concern, since it implies
that some corruption has likely gone undetected.

Assuming you're not forcing your motherboard to use an
invalid system clock (i.e. AGP other than 66 MHz), this is strictly a
physical design flaw in your motherboard. The Abit BP6 is notorious
for this flaw, but it's not unheard of on other boards (such as the
Gigabyte BXD), and it's possible on any board that uses APICs.

You can force the kernel not to use APIC like this with
the "noapic" kernel option. This also forces CPU0 to handle all
interrupts.

Section 14 - Odd kernel behaviour

1. Why is kapmd using so much CPU time?
* (REG) Don't worry, it's not stealing valuable CPU time
from other processes. It's just consuming idle cycles (normally
charged to the idle task, which is displayed differently in top).
Normally, when your system is idle, the system idle task
is run, and this is shown as idle time (i.e. the "unused" CPU time is
not charged to a specific process). With APM (Advanced Power
Management), a special idle task (kapmd) is required so that greater
power saving techniques can be enabled. So now, the "unused" CPU time
is charged to the kapmd task instead.
2. Why does the 2.4 kernel report Connection refused when
connecting to sites which work fine with earlier kernels?
* (DW) The 2.4 kernel is designed to make your Internet
Experience more pleasurable. One of the ways in which it does so is by
implementing Explicit Congestion Notification - a new method defined
in RFC 3168 for improving TCP performance in the presence of
congestion by allowing routers to provide an early warning of traffic
flow problems.
Unfortunately, there are bugs in some firewall products
which cause them to reject incoming packets with ECN enabled. If your
own firewall is broken in this respect, you should check with your
vendor for a fix.
If the site to which you cannot connect is not under your
control, then after you have contacted the administrator of the
offending site to let them know about their problem, you can disable
ECN in the 2.4 kernel either by disabling the CONFIG_INET_ECN option
and recompiling the kernel, or by executing the following command as
root:
# echo 0 > /proc/sys/net/ipv4/tcp_ecn
* (REG) Fixes are available from some router vendors, and
have been since at least mid-2000. These are not "feature
patches" (which may add new features and have new bugs), but purely
bug fixes, and thus should be safe to use, even for the most paranoid.
If you have problems connecting to a site, please contact their
support. Note that some major sites are known to have lied about fixes
from their router/firewall vendor, so if you hear the excuse "we are
waiting on a fix from our vendor", be skeptical. While there is a
workaround available (see above), it is important to encourage sites
and ISPs to be ECN tolerant. This doesn't mean that these sites need
to support ECN (although it's in their interests), but they need to
fix buggy routers so that ECN-enabled systems can fall back to non-ECN
mode, rather than having refused or timed out connections. The
specific RFC that these buggy routers are violating is: RFC 793.
vger.kernel.org is running an ECN-enabled kernel. This
means if your email account is with an ISP which has a buggy router,
you will not be able to receive linux-kernel mail (as well as other
mailing lists hosted on vger). You should check if your ISP is ECN
tolerant, and get them to fix their routers or switch to another ISP.

Patches for the following products are available:
o CISCO PIX. Patch available for download here. Patch
information:

Bug ID: CSCds23698
Headline: PIX sends RSET in response to tcp
connections with ECN bits set
Product: PIX
Component: fw
Severity: 2 Status: R
[Resolved]
Version Found: 5.1(1) Fixed-in Version: 5.1
(2.206) 5.1(2.207) 5.2(1.200)

o CISCO Local Director. Patch available for download
here. Patch information:

Bug Id : CSCds40921
Headline: LD rejects syn with reserved bits set in
flags field of TCP
hdr
Product: ld
Component: rotor
Severity: 3 Status: R
[Resolved]
Version Found: 3.3(3) Fixed-in Version:
3.3.3.107

Further information may obtained from http://gtf.org/garzik/ecn/
3. Why does the kernel now report zero shared memory?
* (REG, contributed by Erik Mouw) Yes, the processes still
share memory, but due to changes to the VM in 2.4 it became too CPU
intensive to calculate the total amount of shared memory. In order not
to break the userland tools, the "MemShared" field in /proc/meminfo
was set to 0.
4. Why does lsmod report a use count of -1 for some modules? Is
this a bug?
* (REW) There are several possibilities. First:
* (DW) No, this is not necessarily a bug. A module may
report a use count of -1 if it has a can_unload function, which is
called when necessary by the system to determine if it is safe to
unload the module.
* (REW) But then again, it could be a bug anyway. In that
case, you'd normally see the usage count at 0 (or more when it's
actually used), and when "something" happens, the usage may drop below
zero. If you can repeat this, please drop the driver maintainer an
Email. Some modules lack the code to unload. They will deliberately
set their usage count to -1 to prevent unloading.
5. Why doesn't the kernel see all of my RAM?
* (REG, based on contribution from Mark Hahn) Some older
distributions like (RedHat 6.1) are quite old, and use a 2.2 kernel
which has not fundamentally changed since mid-to-late 1998. Way back
then, the safe thing for the kernel to do was trust the standard bios
memory detection mechanism. That bios call returns memory size as a 16
bit count of 1 KiB chunks, leading to a 64 MiB limit. Modern kernels
(2.4 is the current stable kernel) use more modern bios calls that can
detect all your memory, and even keep track of which memory is used by
the bios itself. So your best option is to install a modern kernel.
You can workaround the 64 MiB limit with obsolete kernels by telling
the kernel how much memory you have, by using the mem= boot argument.
For example, if you have 128 MiB of RAM, you would type mem=128M at
the lilo prompt, or can have lilo use the argument automatically (add
append="mem=128M" to your /etc/lilo.conf file).
6. I've mounted a filesystem in two different places and it worked.
Why?
* (AV, paraphrased by William Stearns)Because you've asked
the kernel to do that. Yes, it works. No, it's not a bug. To unmount
it from either mountpoint, simply run umount <mountpoint>. Repeat for
each mountpoint on which you do not wish the filesystem mounted.

Section 15 - Programming Religion

1. Why is the Linux kernel written in C/assembly?
* (ADB) For many reasons, some practical, others
theoretical. The practical reasons first: when Linus began writing
Linux, what he had available was a 386, Minix (a minimal OS designed
by Andrew Tanenbaum for OS design teaching purposes) and gcc. The
theoretical reasons: some small parts of any OS kernel will always be
written in assembly language, because they are too dependent on the
hardware to be coded in C; for example, CPU and virtual memory setup.
Or because we are dealing with very short routines that must be
implemented in the fastest possible code e.g. the stubs for the "top
half" interrupt handlers. WRT C, OS designers (since Thompson and
Ritchie first wrote UNIX) have traditionally used C to implement as
many OS kernel routines as possible. In this sense C can be considered
the "canonical" language for OS kernel implementation, and
particularly for UNIX variants.
2. Why don't we rewrite it all in assembly language for processor
Mega666?
* (ADB) Basically because we wouldn't gain much in terms of
efficiency, but would lose a lot in terms of ease of maintenance and
readability of the source code. Gcc is actually quite efficient, when
we look at the assembler code generated. You are referred to Andrew
Tanenbaum's book "Structured Computer Organization", 3rd ed., pages
401-404, for a more detailed comparison of the use of high level
languages vs. assembly language in the implementation of OS's. There
are a number of references on the subject at the end of the book, too.
3. Why don't we rewrite the Linux kernel in C++?
* (ADB) Again, this has to do with practical and theoretical
reasons. On the practical side, when Linux got started gcc didn't have
an efficient C++ implementation, and some people would argue that even
today it doesn't. Also there are many more C programmers than C++
programmers around. On theoretical grounds, examples of OS's
implemented in Object Oriented languages are rare (Java-OS and Oberon
System 3 come to mind), and the advantages of this approach are not
quite clear cut (for OS design, that is; for GUI implementation KDE is
a good example that C++ beats plain C any day).
* (REW) In the dark old days, in the time that most of you
hadn't even heard of the word "Linux", the kernel was once modified to
be compiled under g++. That lasted for a few revisions. People
complained about the performance drop. It turned out that compiling a
piece of C code with g++ would give you worse code. It shouldn't have
made a difference, but it did. Been there, done that.
* (REG) Today (Nov-2000), people claim that compiler
technology has improved so that g++ is not longer a worse compiler
than gcc, and so feel this issue should be revisited. In fact, there
are five issues. These are:
o Should the kernel use object-oriented programming
techniques? Actually, it already does. The VFS (Virtual Filesystem
Switch) is a prime example of object-oriented programming techniques.
There are objects with public and private data, methods and
inheritance. This just happens to be written in C. Another example of
object-oriented programming is Xt (the X Intrinsics Toolkit), also
written in C. What's important about object-oriented programming is
the techniques, not the languages used.
o Should the kernel be rewritten in C++? This is
likely to be a very bad idea. It would require a very large amount of
work to rewrite the kernel (it's a large piece of code). There is no
point in just compiling the kernel with g++ and writing the odd
function in C++, this would just result in a confusing mix of C and C+
+ code. Either the kernel is left in C, or it's all moved to C++.
To justify the enormous effort in rewriting the
kernel in C++, significant gains would need to be demonstrated. The
onus is clearly on whoever wants to push the rewrite to C++ to show
such gains.
o Is it a good idea to write a new driver in C++? The
short answer is no, because there isn't any support for C++ drivers in
the kernel.
o Why not add a C++ interface layer to the kernel to
support C++ drivers? The short answer is why bother, since there
aren't any C++ drivers for Linux. However, if you are bold enough to
consider writing a driver in C++ and a support layer, be aware that
this is unlikely to be well received in the community. Most of the
kernel developers are unconvinced of the merits of C++ in general, and
consider C++ to generate bloated code. Also, it would result in a
confusing mix of C and C++ code in the kernel. Any C++ code in the
kernel would be a second-class citizen, as it would be ignored by most
kernel developers when changes to internal interfaces are made. A C++
support layer would be frequently be broken by such changes (as
whoever is making the changes would probably not bother fixing the C++
code to match), and thus would require a strong commitment from
someone to regularly maintain it.
o Can we make the kernel headers C++-friendly? This is
the first step required for supporting C++ drivers, and on the face
seems quite reasonable (it is not a C++ support layer). This has the
problem that C++ reserves keywords which are valid variable or field
names in C (such as private and new). Thus, C++ is not 100% backwards
compatible with C. In effect, the C++ standards bodies would be
dictating what variable names we're allowed to have. From past
behaviour, the C++ standards people have not shown a commitment to
100% backwards compatibility. The fear is that C++ will continue to
expand its claim on the namespace. This would generate an ongoing
maintenance burden on the kernel developers.
Note that someone once submitted a patch which
performed this "cleaning up". It was ~250 kB in size, and was quite
invasive. The patch did not generate much enthusiasm.
Apparently, someone has had the temerity to label
the above paragraph as "a bit fuddy". So Erik Mouw did a short back-of-
the-envelope calculation to show that searching the kernel sources for
possible C++ keywords is a nightmare. Here is his calculation and
comments (dates April, 2002):

% find /usr/src/linux-2.4.19-pre3-rmap12h -name "*.
[chS]" |\
xargs cat | wc -l
4078662

So there's over 4 million lines of kernel source.
Let's assume 10% is
comments, so there's about 3.6 million lines left.
Each of those lines
has to be checked for C++ keywords. Assume that you
can do about 5
seconds per line (very optimistic), work 24 hours
per day, and 7 days
a week:
5 s 1 hour 1 day 1 week
3600000 lines * ------ * -------- * ---------- *
-------- = 29.8 weeks
line 3600 s 24 hours 7
days

Sounds like a nightmare to me. You can automate
large parts of this,
but you'll need to write a *very* intelligent search-
and-replace tool
for that. Better use that time in a more efficient
way by learning C.

Note that this is the time required to do a proper
manual audit of the code. You could cheat and forgo the auditing
process, and instead just compile with C++ and fix all compiler
errors, figuring that the compiler can do most of the work. This would
still be a major effort, and has the problem that there may be uses of
some C++ keywords which don't generate a compiler error, but do
generate unintended code. In other words, introduced bugs. That is not
a risk the kernel development community is prepared to take.

My personal view is that C++ has its merits, and makes
object-oriented programming easier. However, it is a more complex
language and is less mature than C. The greatest danger with C++ is in
fact its power. It seduces the programmer, making it much easier to
write bloatware. The kernel is a critical piece of code, and must be
lean and fast. We cannot afford bloat. I think it is fair to say that
it takes more skill to write efficient C++ code than C code. Not every
contributer to the linux kernel is an uber-guru, and thus will not
know the various tricks and traps for producing efficient C++ code.
* (REG) Finally, while Linus maintains the development
kernel, he is the one who makes the final call. In case there are any
doubts on what his opinion is, here is what he said in 2004:

In fact, in Linux we did try C++ once already, back in
1992.

It sucks. Trust me - writing kernel code in C++ is a
BLOODY STUPID IDEA.

The fact is, C++ compilers are not trustworthy. They were
even worse in 1992, but some fundamental facts haven't changed:
o the whole C++ exception handling thing is
fundamentally broken. It's _especially_ broken for kernels.
o any compiler or language that likes to hide things
like memory allocations behind your back just isn't a good choice for
a kernel.
o you can write object-oriented code (useful for
filesystems etc) in C, _without_ the crap that is C++.
In general, I'd say that anybody who designs his kernel
modules for C++ is either
o (a) looking for problems
o (b) a C++ bigot that can't see what he is writing is
really just C anyway
o (c) was given an assignment in CS class to do so.
Feel free to make up (d).
4. Why is the Linux kernel monolithic? Why don't we rewrite it as a
microkernel?
* (REG) The short answer is why should we? The longer answer
is that experience has shown that microkernels have poor performance
compared to monolithic kernels. Microkernels have a fundamental design
problem, where different components of the kernel cannot interact
without passing a privilege barrier (which is expensive). Microkernel
advocates claim this is a feature, as it increases modularity and
protects one part of the kernel from another. Whether this is a
feature or a mis-feature is in the eye of the beholder, but it is
clear that there is a performance cost inherent in the microkernel
design. This is a cost the Linux kernel developers (and apparently,
the users) are unwilling to bear.
There are projects which have ported the Linux kernel to
generic microkernels (such as Mach3), usually making Linux a
"personality". There are also other projects to create microkernel-
based Unix-like implementations. Here is a short list:
o MkLinux was funded by Apple, and runs Linux on
PowerPC Macs. It is available at: http://www.mklinux.org/. An x86
version is also available. Note that there is now a native Linux
kernel for the PowerPC which is much faster, and is actively
maintained. MkLinux has become a historical footnote.
o The Hurd is a microkernel-based Unix, and is
supposed to be the promised GNU kernel. It sits on top of Mach3. The
Debian Project provides a full distribution for the Hurd.
o FIASCO is another project for creating MicroKernel
LINUX. See http://os.inf.tu-dresden.de/fiasco/ for details.
There is a historical Usenet thread related to this
subject, dating back from 1992, with posts from Linus, Andrew
Tanenbaum, Roger Wolff, Theodore Y T'so, David Miller and others. Nice
reading on a rainy afternoon. It's fascinating to see how some
predictions (which seemed rather reasonable at the time) have proved
wrong over the years (for example, that we would all be using RISC
chips by 1998).
5. Why don't we replace all the goto's with C exceptions?
* (REG) Admittedly, all those goto's do look a bit ugly.
However, they are usually limited to error paths, and are used to
reduce the amount of code required to perform cleanup operations.
Replacing these with Politically Correct if-then-else blocks would
require duplication of cleanup code. So switching to if-then-else
blocks might be good Computer Science theory, but using goto's is good
Engineering. Since the Linux kernel is one designed to be used, rather
than to demonstrate theory, sound engineering principles take
priority.
So now we come to the suggestion for replacing the goto's
with C exception handlers. There are two main problems with this. The
first is that C exceptions, like any other powerful abstraction, hide
the costs of what is being done. They may save lines of source code,
but can easily generate much more object code. Object code size is the
true measure of bloat. A second problem is the difficulty in
implementing C exceptions in kernel-space. This is convered in more
detail below.
* (REG, quoting Keith Owens) The exceptions patch has to use
assembler to walk the stack frames. Exceptions are being touted as a
replacement for goto in new driver code but the sample patch only
works for i386. No arch independent code can use exceptions until you
have arch specific code that does the equivalent of longjmp for _all_
architectures.
Doing longjmp in the kernel is _hard_, I know because I
had to do it for kdb on i386 and ia64. The kernel does things
differently from user space and sometimes the arch maintainers decide
to change the internal register usage. They are allowed to do this
because it only affects the kernel, but any change to kernel register
usage will probably require a corresponding change to setjmp/longjmp.
So you have arch dependent code which has to be done for
all architectures before any driver can use it and the code has to be
kept up to date by each arch maintainer. Tell me again why the
existing mechanisms are not working and why we need exceptions? IOW,
what existing problem justifies all the extra arch work and
maintenance?
6. Why are the kernel developers so dismissive of new techniques?
* (REG) This is a complaint that is raised periodically,
usually shortly after some debate or flamewar following on from a
suggestion to use a "new" technique. Often one or more noted kernel
developers will shoot down the idea with a dismissive "that's a dumb
idea" or "all pain, no gain", without a detailed explanation of why
it's a bad idea. This does indeed look arrogant and dismissive, and
gives the impression that the kernel developers are a pack of old dogs
unwilling to learn new tricks. This perception is compounded by
proclamations made by various computer science teachers about the
positive value of the proposed new technique.
It should be noted, however, that kernels developers are
exceptionally busy people, and generally prefer to write code than
engage in lengthy discussions about why some idea is not good (at
least for the kernel). Further, it's fairly likely that the "new"
technique that is being proposed has already been evaluated, and found
to be inadequate/inappropriate for the kernel. Or perhaps the
developer has had prior experience with this technique and found it
lacking.
If you are convinced that your favourite technique has
value, you have to prove it. You can't demand that other people spend
the time explaining to you why they think it's a bad idea. You have to
do the hard work yourself to show you're right. Code up a patch and
benchmark it compared to the standard kernel. Be prepared to defend
your patch in a broader context, and demonstrate that it doesn't have
costly side-effects. Remember that many micro-optimisations result in
macro slowdowns.
Finally, some personal advice. Coding up a controversial
patch and proving you're right is a time-consuming task. Because of
this, avoid pushing ideas which you read in a book or heard from some
CS notable. Stick to pushing ideas which you have either had prior
experience, or have spent a lot of time thinking about. This will
increase your chances of picking a winner, and decrease your
frustration levels.

Section 16 - User-space programming questions

1. Why does setsockopt() double SO_RCVBUF?
* (REG) This yields similar behaviour as *BSD versions of
Unix. Andi Kleen has stated:

Linux counts internal headers in the buffer. BSD does not.
This
is a heuristic for compatibility (half of the buffer
reserved for
headers). To compensate the TCP window offering is halved.

Contributing
Contributions are welcome on this FAQ. These can be submitted,
preferably in diff -u format, (against this HTML document source) by
Email to Richard (see the Contributors section above).

Sometimes, we may feel your contribution is controversial and/or
incomplete and/or could be improved somehow. Also, the turnaround time
has a wide range, from hours to months, depending on how busy Richard
is. Please do not email him to chase changes as it slows him down.
Suggestions and patches are queued, and will be processed eventually.
Acknowledgements are usually sent when the change is made. Please be
patient, FAQ updates are rarely urgent. Note that small, "obviously
correct" patches are more likely to be processed faster, and often
jump the queue ahead of larger patches.

Last updated on 17 Oct 2009 by Richard Gooch. This document is GPL'ed
by its various contributors.

玄鹤

unread,

Nov 16, 2009, 10:12:41 AM11/16/09

to freesky

http://linux.sheup.com/linux/39/linux25236.htm

历史记录-A 目前FAQ维护一览: bx_bird----------------->中断管理 updated
freshground----------------->内存FAQ; sirx----------------->进程管理与线程相关+内核同
步 garycao----------------->内核介绍与基本配置+内核调试+内核编程; xshell-----------------
>内核模块编程+启动初始化; 目录: 一 Linux内核FAQ说明二索引三论坛问题四基本Linux文档五成员说明六
Linux Kernel 1. 内核介绍和基本配置 2. 启动初始化 3. 内存管理 4. 内核同步 5. 进程管理及线程相关 6. 中断
7. 内核模块编程 8. PCI 9. 网络 10. 文件系统 11. 硬件相关 12. 内核调试七 Linux 设备驱动程序八
Linux 内核编程九 Linux 资源十 faq整理相关的声明一 Linux内核FAQ说明 LINUX内核FAQ收集整理了中国
linux论坛内核技术版讨论过的与LINUX内核相关的常见问题 .当你在接触内核的过程中碰到问题，可以先在这儿查阅是否已经有过类似的讨论，如果
阅读之后仍有疑问或者FAQ中没有涉及到你所关心的问题，请到KernelTech_CN留言. 二索引三论坛问题四基本Linux文档
Linux驱动程序第二版深入理解Linux内核英文版 Daniel P. Bovet & Marco Cesati O'Reilly出版
Intel关于x86体系手册官方Linux文档网站 Linux汇编相关资源 tux.org上面的相关连接 The Linux Kernel
Hackers' Guide The Linux Kernel The Linux FAQ The Linux Kernel HOWTO
Kernelhacking-HOWTO Various Linux HOWTOs on specific questions
BogoMips mini-HOWTO by Wim van Dorst the network drivers by Donald
Becker the Linux Alpha HOWTO by Neal Crook www.kernelnewbies.org 如何提问以帮
助你更快地得到答案如何有效地提供BUG报告五成员说明这里目前列出的是加入FAQ整理工作的成员名单和他们的介绍(待补)
freshgound,sirx,bx_bird,garycao,xshell 六 Linux内核 I 内核介绍和基本配置 (lkml指
Linux kernel Mailing List) 1. Q: 什么是试验版内核? ```A: (lkml)Linux内核版本分为两个开发
树:试验树(版本号为奇数,如:1.3.xx 或 2.1.x )和稳定树(版本号为偶数,如1.2.xx,2.0.xx)。试验树更新得很快,通常
用于测试新的特性,算法,设备驱动程序,等等。试验树的内核也许会产生奇怪的操作，这样就有可能导致数据丢失，或者机器死锁等等。 2. Q: 什么
是稳定版内核? ```A: (lkml)稳定树具有定义良好的特性，更少的bug，可靠的驱动程序。虽然稳定树的更新比实验树慢很多，通常有一些比别
的版本好的"最佳版本"。Linux的发行版通常是基于这些"最佳版本"，而不一定是最新的版本。 3. Q: 内核版本号为f.g.hhprei表
示什么? ```A: (lkml)这是Linux版本f.g.hh的中间版。通常i < 5，但是也有例外，如2.0.34prei 就具有i =
1 to 16个版本。有时候"pre"会被的Linux版本维护者的名字的简写所取代。如 2.1.105ac4表示Alan Cox发布的内核版本
2.1.105的第四个中间版。 4. Q: 我在哪里能下载Linux的最新内核源码? ```A: (lkml)Linux内核源码(试验树和稳定
树)的主要站点是由Transmeta公司(Linus Tor valds工作的公司)维护的http://www.kernel.org/。这个站
点在全球很多国家都有镜像站点。你可以通过链接http://www.CODE.kernel.org/来访问本国的镜像站点。这里，CODE "是指
国家代码。如"cn"是中国的国家代码，这样，中国的Linux内核主要镜像站点就是ht tp://www.cn.kernel.org/。 ```
你也可以通过FTP访问ftp://ftp.CODE.kernel.org/pub/linux/kernel/来得到Linux内核的patch，
这是Linus发布他的Linux内核的地方。其他的著名的Linux内核hacker在"peo ple"目录下有自己的目录，他们可以在那里存放自
己的内核patch。"testing"是linus存放自己的pre-release patches的地方。pre-release
patches主要是其他的开发者使用的，以便他们可以与Linus的源码树保持同步。加上这些patch的内核可能比试验树的内核更加危险，它可能会
crash，甚至破坏你的文件系统。使用这些patch时，你需要特别地注意到这种风险性。 5. Q: 我在哪里能下载额外的Linux的内核的
patch? ```A: (lkml)有很多地方提供不同的额外的patch，这些patch可以为Linux内核增加新的功能。这里是一个不错的
站点。 6. Q: 什么是patch? ```A: patch文件(这里指Linux内核patch)是ASCII文本文件，它包含在原始代码和新
代码之间的不同部分，以及一些附加的信息，如文件名，行号。通过patch程序(man patch)可以把这个patch加到一个现有的内核源码树
上。 7. Q: 怎样制作Linux内核的patch? ```A: (lkml)你可以用diff程序(info diff)来制作patch。最
简单的方式是在/usr/src目录下建立两个源码树，建立一个链接"/usr/src/linux"，链接到修改后的源码树。然后再运行diff程序
比较这两个源码树。文件/usr/src/Documentation/CodingStyle包含了更多的详细的信息，读一读这个文件。要记住：总
是使用unified (-u) diff格式。不要改变源码的格式，以免使diff文件不必要的变大。察看你的编辑器设置，不要把tabs 转换为空
格或者相反。如果你没有特殊的原因，尽量针对最新的官方源码树做你的patch。否则，你的patch很有可能被忽略。或者，你在你的帖子中注明你的
patch所针对的Linux版本号。确认你的patch只包括你这个patch所需要做的改动，而不是你对源码树的所有的改动。通常，patch限
制在几个文件或目录里。最好只diff相关的文件。例如，如果我只修改了dri vers/net目录下的文件driver_xyz.c，我会用以下的
指令(假定你的原始源码树目录名字为 "linux-2.4.18"，"linux"链接指向修改后的源码树): cd /usr/src diff
-u linux-2.4.18/drivers/net/driver_xyz.c linux/drivers/net/
driver_xyz.c > my_patch 8. Q: 怎样安装一个patch? ```A: (lkml)(From /usr/src/
linux/README)你可以通过安装patch来升级你的Linux源码版本。patch一般都是使用传统的gzip或比较新的bzip2压缩后
的压缩文件。如果你想要通过安装patch升级你的Linux源码版本，你需要得到所有的新的patch文件，进入到你想要升级的内核源码目录，执
行： gzip -cd patchXX.gz patch -p0 或: bzip2 -dc patchXX.bz2 patch -p0 (按顺
序，为所有的版本号大于你当前的源码树的版本号的patch，重复xx)，就完成了你的升级。你也许需要移走backup文件(xxx~或
xxx.orig)，确认没有安装失败的patch(xxx# 或 xxx.rej)。如果这些文件存在，要么你，要么我犯了一个错误。你也可以使用
patch-kernel来自动完成这些步骤。它会根据当前的内核版本来自动安装所有找到的patch。这样，运行命令： linux/
scripts/patch-kernel linux 指令的第一个参数是内核源码的目录。patch需要存放在当前目录，或者存放在第二个参数所
指定的目录。 ```你可以察看内核的README文件(/usr/src/linux/README)的"Installing the
kernel"了解更多信息。这里是一个Linux HQ Project网站关于patch的比较好的说明。 9. Q: 每个人都在谈论"CVS
tree at vger"，vger是什么?这个CVS树做什么? ```A: (lkml)vger.kernel.org是一个服务于Linux
社团的邮件列表和网站。一个Linux内核开发树的CVS服务器是http://vger.samba.org/译注：现在好像不能访问?)，它是最新
的 vger.kernel.org的主源码树的镜像。需要注意CVS源码树不是官方的源码树，它主要是为了方便一些资深的Linux内核hacker
的使用而建立的。由Linus Torvalds所维护的网站htt p://www.kernel.org/才是官方Linux源码树存放的地方，它在全球都有镜像站点。David
Miller (vger CVS tree 维护者)定期地把CVS树生成的patch提交给Linus。另外，vger C VS源码树还存放
Sparc, Sparc64 和一些网络方面的试验和测试的patch。 10. Q: 我在哪里能找到更多的CVS相关的信息? A:
(lkml)很多GNU/Linux的发行版都包括了CVS，你也可以访问CVS Bubbles page 11. Q: 哪里有CVS指南?
A: (lkml)这是一个你可以访问的CVS指南： An interactive CVS tutorial 访问这个站点，你花15分钟就可以得
到一个关于CVS工作流程的大概的概念(推荐)。由于目前已经开发了很多的CVS图形前端工具，你可以不用学习一般的神秘的CVS指令。 12.
Q: 怎样才能把我的patch加到官方Linux内核呢? A: (lkml)根据你的patch的内容的不同，有几种不同的方式把它加入官方
Linux内核。首先，确定你所修改的代码是由谁来维护(察看MAINTANERS文件)。如果你的patch只是一个很小的bug修复，而且你确信
它"显然正确"，那么，你可以把它发给适当的维护者，以及张贴到 lkml中。如果这是一个紧急的buf修复(如：一个大的安全漏洞)，你也可以把它直
接发给L inus，但是要记住他有可能会不理睬随意的patch，除非他认为它们"显然正确"。如果你的 patch很大，例如，大片重写的代码或一
个新的设备驱动程序，为了节省网络带宽和磁盘空间，你可以给lkml发送一个帖子来说明你的patch，并在你的帖子中加上你的patch的链接。最
后，如果你不大确信你的patch是否正确，需要一些维护者的反馈，你可以使用私信。如果你的patch所涉及的内核部分，没有明确的维护者，那么你有
三个选择：把它发送给linux-...@vger.kernel.org，希望有人能看到它来提交给Linus，或者Lin us本人会看到它
(基本不要指望) 把它发给linux-kernel，并Cc：Linus Torvalds <torv...@transmeta.com>。希
望Linus 能够采用它。要注意没人知道Linus怎么工作，他不一定给你直接的回复。你需要检查Lin us发布的patch来看他有没有采用你的
patch。如果他没有采用，你也许需要重新发送一次你的patch(通常很多次)。如果等了几个礼拜或者几个月以后，经过了很多的patch的发布
后，他仍然没有采用你的patch，也许你应该放弃。他也许不喜欢它。把它发给linux-kernel，并Cc：Alan Cox 。Alan在回
复邮件方面做的好一些，他会把你的patch排队，定期的把它转发给Linux，这样，你就可以不用再担心没有人理睬它了。他总是显得品位很好，如果
Alan接受了你的patch，很可能Linus也会接受。如果他不喜欢你的patch，你也许会收到一封emain给你解释。 13. Q: 为什么
Linux内核tarball包含的目录是linux而不是linux-x.y.z? A: (lkml)这是因为Linus想要这样做。这样可以使安
装连续的patch更容易一些，因为不用每次都改变目录名称，这样也让Linus的生活更加轻松一些。 14. Q: 官方Linux内核和Alan
Cox的内核(ac系列的patch)有什么区别? A: (lkml)Alan的内核可以被看作Linus的内核的测试版。虽然Linus很保守，只
接受明显的和经过很好测试的2.4的内核patch，Alan维护着一系列内核patch，它们包含着新的概念，更多(与/或)更新的驱动程序，更多的
插入的patch。如果这些patch能够证明自己是稳定的，Alan就会把它们提交给Linus，一般把它们加入到官方Linux内核中去。
15. Q: Linux内核是由谁维护的? A: (lkml)最初，由Linus Torvalds来维护所有内核。当Linux内核成熟后，他把
一些老的稳定的版本的维护工作委派给了其他人，而他继续维护最新的开发版。在2002年5月27号，以下的内核版本是由以下人员来维护的。 2.0
David Weinehall <t...@acc.umu.se> 2.2 Alan Cox 2.4 Marcelo Tosatti
<mar...@conectiva.com.br> 2.5 Linus Torvalds <torv...@transmeta.com>
16. Q: 我怎么建立我自己的内核? A: (linux天字一号) Let's step by step: **********step 1
做一个新kernel出来******************** 1.从http://www.kernel.org/pub/linux/
kernel/v2.4/linux-2.4.12.tar.gz将最新的稳定版kernel down下来。 2.用mv /usr/src/
linux /usr/src/linux.old将原来的目录移走，用tar xfzv linux-2.4 .12.tar.gz -C /
usr/src/linux将代码解开. 3.切换到/usr/src/linux目录,执行make menUConfig 4. 可以什么都不用
改，直接保存。当然也可以试着去配置内核的选项，大部分都可以直接看出是干什么的，看不出的可以查其帮助信息。 5. 执行make dep 6.
执行make bzImage做一个压缩的内核。 7. 编译完成后，可以从/usr/src/linux/arch/i386/中找到新内核
bzImage. 8. 执行make modules,编译完成后将/usr/src/linux/modules拷到/lib/modules/
2.4.12目录让系统自动加载这些驱动及模块，如果无法实现自动化，以后需要你用insmod去加载这些东西。
************step 2 使用这个新内核********************* 1. 用cp /usr/src/linux/
arch/i386/bzImage /boot将内核复制到/boot目录。 2. 用vi /etc/lilo.conf命令来修改lilo配置文
件大概如下: boot=/dev/hda map=/boot/map install=/boot/boot.b lba32 (支持大硬盘)
timeout=50 (启动等待时间) default=linux (默认的启动项) other=/dev/hda1 (DOS操作系统)
label=win (label) table=/dev/hda image=/boot/vmlinuz label=linux root=/
dev/hda2 (hda2为旧系统的根文件系统) read-only ---------上面原系统本身会有，你需要加入下面的代
码----------------- image=/boot/bzImage label=new (新系统的名字) root=/dev/
hda2 (根文件系统,不一定是hda2,可根据与旧系统的根文件系统来决定) read-only 3. 执行lilo II 启动初始化 1
Q:内核启动相关的代码有哪些? A:以X86为例, arch/i386/boot/bootsect.S 引导部分; arch/i386/
boot/setup.S,video.S 根据BIOS初始化硬件数据,进入保护模式,并加载内核映象; arch/i386/kernel/
head.S 页表初始化; 2 Q:内核引导时可以加参数麽？有哪些参数可以加载? A:可以.一般而言,就是在lilo引导指定内核时加参数，如
root=/dev/hda??,参见 BootPrompt-Howto. 3 Q:内核引导过程中为什么不能使用0-64K直接的地址区域? A:
保留给BIOS数据区，在启动过程中，启动代码使用BIOS数据区和相关的BIOS调用获取初始化必需的硬件信息.但是在启动后期该区域可以被覆盖使
用，这也就意味着内核启动之后不能够进行BIOS 调用. 4 Q:大内核与普通内核的分界线是多少K? A:大小在508K之内的属于普通内核,可
以直接加载到0x10000,然后解压缩到0x100000,否则加载到 0x100000后再原地解压. 5 Q:508K数值是如何得到的?
A: 内核引导代码将自己从当前所在的位置0x7c00:0000跳到0x9000:0000处(即576K地址处) 继续运行.所以如果是xx普通内
核则必然要求大小满足 576K - 64K(BIOS自动保留区) - 参数区4K(到底存放了什么有待商榷) = 508K 之内. 参数区存放
系统引导时接收的命令行参数和BIOS传给内核的必要的硬件信息 6 Q:如何编译链接不同类型的内核映象? A:make bzimage与
make zimage分别对应了大内核与普通内核的生成方式. 7 Q:内核启动时传入的命令行参数存放在甚么位置? A: ????? 8 Q:内
核启动初始时的内存如何布局? A: 0A0000 +------------------------------+ Reserved for
BIOS Do not use. Reserved for BIOS EBDA. 09A000
+------------------------------+ Stack/heap/cmdline For use by the
kernel real-mode code. 098000 +------------------------------+ Kernel
setup The kernel real-mode code. 090200 +------------------------------
+ Kernel boot sector The kernel legacy boot sector. 090000
+------------------------------+ Protected-mode kernel The bulk of the
kernel image. 010000 +------------------------------+ Boot loader <-
Boot sector entry point 0000:7C00 001000
+------------------------------+ Reserved for MBR/BIOS 000800
+------------------------------+ Typically used by MBR 000600
+------------------------------+ BIOS use only 000000
+------------------------------+ III 内存管理 Freshground整理结果 IV 内核同步 1 Q:
对于单cpu的系统，是不是不应该用spin_lock。如果用了，会不会降低系统的性能呢 ? A: (minihorse) > I have
a question related to spin locking on UP systems.Before that I would >
like to point out my understanding of the background stuff > 1.
spinlocks shud be used in intr handlers It should be used in the
interrupt handler, if you need to prevent any race conditions with
other interrupt/non-interrupt context code that may be executing on
some other CPU on an SMP system. Thus spinlocks need to be held for as
short a duration as possible. You would need to use the
spin_lock_irqsave/spin_unlock_irqrestore variant pair to prevent your
interrupt handler from running on the same processor while holding the
lock. This may be needed if the interrupt handler may try to acquire
the same lock thus causing a deadlock. > 2. interrupts can preempt
kernel code > 3. spinlocks are turned to empty when kernel is compiled
without SMP > support. > > If a particular driver is running( not the
intr handler part) and at this > time an interrupt occurs. The handler
has to be invoked now. Won't the > preemption cause race conditions/
inconsistencies? Is any other mechanism > used? Pl correct me if I
have not understood any part of this correctly On a UP kernel the
spin_lock_irqsave/spin_unlock_irqrestore pair eXPand to save_flags
(flag); cli()/restore_flags(flag). The maSKINg of interrupts on the
processor between spin_lock_irqsave and spin_unlock_irqrestore pair
prevent the user context code from being preempted by the interrupt
handler. 2 Q. 我希望在核心代码中实现对某个数据结构的保护，也就是在同一时刻只能有一个进程对他操作，在用户进程中可以采用信号量来
实现，在核心中呢？ A.(BNN)Criticl Regsion is a far more famous issue which is
applied to many are as including system software. Below are my 2 cents
and wish helpful.
------------------------------------------------------------------ *
Why need protection in kernel mode? In a Word, all data structures are
shard by all processes including interrupt handlers. * How to protect
critical data structures? Make sure the operations on those data
structures are ATOMIC. For traditional operating systems(compared to
real-time os), the kernel is non -preemptive. (We will consider SMP
model later). In other words, within the ke rnel, you do NOT have to
worry about the consistency of your critical data str uctures. The
only thing we have to take care is all about the interrupt handle rs
including ther timer. In other words, within the kernel mode, the
interrupt still can branch your execution flow. In the case that an
interrupt will have to work on some data structures which is shared by
some kernel threads or fun ctions. We then still have to use either *
cli/sti (disable/enable interrupt), * raise the corresponding IPL
level, * or use some lock mechanisms to make sure the data structures
consistent. Basically, for different CPU architecture, soem different
**one clock cycle** instructures are provided in order to achieve the
ATOMIC feature and thus impl ement the mutex or semaphore mechanisms.
In other words, those instructions wi ll promise something like:
atomic_add, atomic_dec, test_and_set primtives. I personally am more
familiar with powerpc arch. Within powerpc, an instructio n pair
called "reserve the memory bus" and "release the memory bus" are
provid ed. With those two instructures, linux can provide any ATOMIC
operations. For x86, conceptually, the thing should be the same.
Atomic issue is very imprtant for kernel data structures. However, the
size of atomic granularity is very important. For real-time opearating
system which r equires that the kernel should be also preemtpive, we
need make sure the criti cal area should be protected in case another
either high priority process/thre ad take the cpu control away. 3 Q. 有3
个进程P1,P2,P3，P1首先P(mutex)后,P2,P3阻塞，当P1释放互斥信号量V(mutex) 以后，P2，P3哪个进程先进入？根据
什么原则？ A. (xshell)first wait first get , FWFG. 4 Q.原子操作疑问 #define
atomic_set(v,i) (((v)->counter) = (i)) static __inline__ void
atomic_add(int i, volatile atomic_t *v) { __asm__ __volatile__( LOCK
"addl %1,%0" :"=m" (__atomic_fool_gcc(v)) :"ir" (i),
"m" (__atomic_fool_gcc(v))); } 请问为什么atomic_set(v,i)不需要用LOCK（并用汇编来写）？而
atomic_add（）为什么不写成这样： #define atomic_add(i，v) (((v)->counter) += (i))
A.(nxin)在现在的体系中，一个加操作至少需要三步，从内存取数据，执行加法运算，将结果存回内存，不可能在一个时钟周期完成，在多处理器的环境
下，需要用特殊的汇编指令保证操作的原子性。而set操作只是把数据存到内存，这个操作总是原子性的，当然如果数据长度超过机器字的长度或不是按字边界
对齐，操作可能不是原子的。 5 Q: 对于单cpu的系统，是不是不应该用spin_lock。如果用了，会不会降低系统的性能呢 ? A:
(minihorse) > I have a question related to spin locking on UP
systems.Before that I would > like to point out my understanding of
the background stuff > 1. spinlocks shud be used in intr handlers It
should be used in the interrupt handler, if you need to prevent any
race conditions with other interrupt/non-interrupt context code that
may be executing on some other CPU on an SMP system. Thus spinlocks
need to be held for as short a duration as possible. You would need to
use the spin_lock_irqsave/spin_unlock_irqrestore variant pair to
prevent your interrupt handler from running on the same processor
while holding the lock. This may be needed if the interrupt handler
may try to acquire the same lock thus causing a deadlock. > 2.
interrupts can preempt kernel code > 3. spinlocks are turned to empty
when kernel is compiled without SMP > support. > > If a particular
driver is running( not the intr handler part) and at this > time an
interrupt occurs. The handler has to be invoked now. Won't the >
preemption cause race conditions/inconsistencies? Is any other
mechanism > used? Pl correct me if I have not understood any part of
this correctly On a UP kernel the spin_lock_irqsave/
spin_unlock_irqrestore pair expand to save_flags(flag); cli()/
restore_flags(flag). The masking of interrupts on the processor
between spin_lock_irqsave and spin_unlock_irqrestore pair prevent the
user context code from being preempted by the interrupt handler V 进程管理及
线程相关 1. Q: 如何在内核中唤醒和睡眠用户进程? A: (xshell)你可以参考interruptible_sleep_on和
wake_up_interruptible的代码实现对指定进程的睡眠与唤醒, 其中，使用interruptible_sleep_on将当前进程
置入睡眠态和一睡眠进程管理队列中，该队列中的进程可被中断唤醒，wake_up_interruptible则唤醒睡眠进程管理队列中的进程。详细
信息 2. Q: 如read, 用户没输入时候，系统调用阻塞。此时候进程(进程1)是否退出了核心执行态，进入suspend,由内核重新调度其他
进程(进程2）运行;那先前的进程1在用户输入时如何又再次获得cpu呢??是等进程2 的时间片用完，重新调度吗 ? A:
(linux_tao)进程在核心态执行系统调用，系统调用阻塞时进程转入内存睡眠状态，内核调度其他进程运行。当睡眠进程等待的事件发生时，该进程被
唤醒，转为内存就绪。之后它被调度，进入“核心态运行”状态。在此状态下，继续完成read调用。read调用完成后，返回用户态运行。
(sirx)不需要等到进程2的时间片用完。如果进程1等待的中断发生的时候进程2正在执行系统调用，那么需要__等到进程2的系统调用执行完__再重
新调度。如果中断发生的时候进程 2正在用户态运行，马上重新调度。详见Linux 内核笔记2 – 进程调度 3. Q: linux中线程是不是
属于内核实现的？或者是创建线程在用户级，管理在内核? A: (sirx)linux下的线程有用户态和内核态两种，但内核只创建和管理内核线程，用
它们来完成一些需要经常重复执行的工作。用户线程，就是应用程序内部的线程啦，由用户态的线程库来生成和调度，内核从来就不知道用户线程的存在。
4. Q: 在linux下当一个进程创建了若干个线程的时候 a、这时在主进程（非线程中）内调用fork，那么子进程是如何继承父进程的线程的？是
全部还是不继承？ b、若在主进程的某个线程内fork，这时这个子进程是继承父进程的全部线程还是只继承f ork它的线程或是不继承？ A: (待
答) 5. Q．在当前系统下，调度时间片的长度是多少？ A: (sirx) 与2.2.x版的内核相比，kernel2.4.x的时间片长度缩短
了，对于最高优先级的进程来说，时间片的长度为100ms，默认优先级进程的时间片长度为60ms，而最低优先级进程的时间片长度为10ms。
6. Q. Linux如何保证对I/O事件相对比较快的响应速度，这个响应速度是否与调度时间片的长短有关？ A: (sirx)当I/O事件发生
的是时候，对应的中断处理程序被激活，当它发现有进程在等待这个I/O事件的时候，它会激活等待进程，并且设置当前正在执行进程的
need_resched标志，这样在中断处理程序返回的时候，调度程序被激活，原来在等待I/O事件的进程（很可能）获得执行权，从而保证了对I/O
事件的相对快速响应（毫秒级）。从上面的说明可以看出，在I/O事件发生的时候，I/O事件的处理进程会抢占当前进程，响应速度与调度时间片的长度无
关。 7. Q．高优先级(nice)进程和低优先级进程在执行上有何区别？例如一个优先级为-19（最高优先级）的进程和优先级为20（最低）的进
程有何区别 A: (sirx)进程获得的CPU时间的绝对数目取决于它的初始counter值，初始的counter的计算公式(sched.c
in kernel 2.4.14)如下： p->counter = (p->counter >> 1) + ((20 - p->nice)
>> 2) +1) 由公式可以计算出，对于标准进程（p->nice 为0），得到的初始counter为6，即进程获得的时间片为60ms。最高
优先级进程（nice为-19）的初始counter值为10，进程的时间片为100ms。最低优先级进程（nice为20）的初始counter值为
1,进程时间片为10ms。结论是最高优先级进程会获得最低优先级进程10倍的执行时间，普通优先级进程接近两倍的执行时间。当然，这是在进程不进行任
何IO操作的时候的数据，在有IO操作的时候，进程会经常被迫睡眠来等待IO操作的完成,真正所占用的CPU时间是很难比较的。我们可以看到每次重新计
算counter的时候，新的counter值都要加上它本身剩余值的一半，这种奖励只适用于通过SCHED_YIELD主动放弃CPU的进程，只有它
在重新计算的时候coun ter值没有用完，所以在计算后counter值会增大，但永远不可能超过20。 8. Q: Linux提供的同步机制有
那些? A: (garycao)(see /usr/include/目录下semaphore.h,pthread.h) 一般linux信号量分
为以下几类 1. semaphore: int sem_init (sem_t *sem, int pshared, unsigned
int value); int sem_wait (sem_t * sem); int sem_post (sem_t * sem);
int sem_destroy (sem_t * sem); 2. mutex pthread_mutex_init
pthread_mutex_lock pthread_mutex_unlock pthread_mutex_destroy
3.Condition Variables pthread_cond_init pthread_cond_wait
pthread_cond_signal pthread_cond_broadcast pthread_cond_destroy VI 中断管
理 bx_bird的整理 VII 模块编程 1 Q:模块编程有甚么好的起步教程? A:去Google上搜寻lkmpg(可加载内核模块编程),
也可以"Complete Linux Loadable Kern el Modules"为关键字查询.内核源码中有很多驱动都是模块编程很好的范
例. 2 Q:将驱动程序以模块形式加载与作为内核的一部分使用有甚么不同? A:模块的方式减小了整个内核的大小,增加了使用的灵活性. 3 Q:内
核编译后make moduels_install安装的模块存放在何处? A:对于2.4.X是/lib/modules/`uname -r`/
kernel/;而2.2.X则是/lib/modules/`uname -r `/; 4 Q:模块调试有哪些工具和方法? A:最简单的是
printk+dmesg,另外kgdb也是不错的选择. 5 Q:如何加载自己已经编译的模块? A:insmod ./
your.o;modprobe ./your.o也可以. 6 Q:为什么我用insmod加载模块时它说找不到模块文件? A:你可以insmod
后跟所要加载模块的完整路径.或者修改/etc/modules.conf增加一个新的 alias. ps,如果在rh6.2上编译2.4.X,模块
安装后需要cd /lib/modules/`uname -r`/; mv kernel /* .;然后depmod -a建立
modules.dep即可实现相关模块的自动加载. 7 Q:模块编程中常见的几个宏怎么使用? A:MODULE_PARM(var,type)
定义类型type的参数var,在insmod时可以通过var=???指定var 的值; 以下询息可以用/sbin/modinfo
a_module 查看: MODULE_PARM_DESC(var,desc) 模块可选参数的描述; MODULE_AUTHOR(name)
模块作者; MODULE_DESCRIPTION(desc) 模块功能的描述; 8 Q:模块编程时常用的编译选项是哪些? A:
CFLAGS=-DMODULE -D__KERNEL__ -I/lib/modules/`uname -r`/build/include -
O2(为你当前使用的内核编译模块) 9 Q:内核模块加载时出现版本错误该怎么办? A:说明你当前运行的内核在内核加载时检查版本询息,可以在内
核配置菜单中去掉这一选项 ,另外你也可以在编译内核时添加版本询息,即#include <linux/modversion.h>,然后重新编译再
次加载. 10 Q:内核模块中如何使用系统调用? A:在模块程序中加上这几句: #define __KERNEL_SYSCALLS__
#include <linux/unistd.h> int errno; 11 Q:在内核模块编程中如何进行文件操作? A:加入
open,read,close等系统调用,在使用这些文件操作的系统调用之前需要set_fs(KER NEL_DS); 12 Q:如何使当前模块
的函数符号对内核其他部分是不可见的? A:在模块程序的最后一行加上EXPORT_NO_SYMBOLS即可. VIII PCI 1. Q:
pci_read_config_byte, pci_read_config_word等等函数的原形在哪里? A: (hyl)/arch/
i386/kernel/pci-pc.c 中有几个 pci_ops 型的变量.详细信息 IX 网络 1. Q: 怎样才能快速找到
connect,socket函数实现的代码? A: (nxin)下载并解开glibc的源码，socket的实现在sysdeps/unix/
sysv/linux/i386/soc ket.S中，linux的大部分系统调用都在sysdeps/unix/sysv/linux/ 或
sysdeps/unix/sysv /linux/i386/ 中。系统调用最后都调用了int $0x80，有些共用同一个入口，比如
socket ,connect都调sys_socketcall，但select调sys_select。 2. Q: 请教高手：Lan网卡收到包
后，应该比较mac地址? ```A: (nxin)网卡可以工作在几种模式之下，比如是否接收广播，是否接收指定MAC地址，是否接收指定多播地址，
是否全部接收，一般情况不设为全部接收状态，因为这样会加重系统负担，如果你需要可以设为混杂模式就可以全部接收了。 X 文件系统 1 Q:
Linux可以打开文件数是多少 ? A: (jkl)每个进程可打开的文件数量受rlimit制约，缺省为1024，上限为1024*1024。但系
统中总的打开文件的数量受file-max和inode-max这两个参数制约，它们可以在/proc/sys /fs/中读取和调整。 XI 硬件相
关 1. Q: 磁盘扇区与磁盘块如何定义以及如何区分使用? A: (m.ouyang)Basically I think, hard
sector is a physical parameter got from d evice, block size is just
block size, when fs layer transfers requests to ide drivers , all info
are in sector expression, anyway, block size setting will a ffect io
requests. For example, if you doubled block size, you may cut io requ
est times(ide interrupts) to nearly a half. So in your own ide/atapi
driver, y ou can define the block size as you want, but not sector
size, only keep it as multiples of sector size. XII 内核调试 1. Q: 有人熟悉
linux内核调试技术吗? A: (garycao)有以下几种方法: 使用kgdb通过串口来调试 bdi2000 gdb可以通过以太
(bdi2000的以太,bdi2000通过bdm或jtag来调)来调.(注:x86没有 bdm和jtag调试接口) windriver
vision probe(click)来调试内核,最大的不方便是看不到源码,不过如果你熟悉汇编的话,也能调. 2. Q:为什么需要两台机器用
于kgdb调试内核? A:kgdb需要gdb来处理源码并分析gcc产生的调试信息.当内核在被调试时 GDB不能运行在测试机器上.因此gdb必需
在一台拥有正常运转的内核的机器上被执行. 3. Q:用户可以在中断句柄中设置断点麽? A:当然可以.断点可以设置在内核中任何一个地方.但是
kgdb不能在正被kgdb 使用的内核部分设置断点，如kgdb串行线中断句柄和"interprocessor"中断句柄. 4. Q:为什么内核
和模块需要在开发机器上编译而不是测试机器上? A:gdb需要参考源代码文件和vmlinux或者模块的目标文件.因为gdb是运行在开发机器上所以
这些文件必需被提供.测试机器上仅仅需要vmlinux和模块的目标文件即可,如果一个内核或者模块在开发机器上编译完成后,只需要把这些文件拷贝到测
试机器上.另一方面,如果在测试机器上编译的话,就需要将源文件和目标文件都拷贝到开发机器上.所以在开发机器上直接编译比较简单. 七
Linux 设备驱动程序 1 Q: 怎样添加我的驱动程序到内核? A: (garycao)在linux 2.4中,你需要修改两个文件
config.in(也可能为Config.in)和Makefile 如:你把你的程序mydriver.c放在drivers/char目录下
1. 你可以修改drivers/char/Config.in,在合适的位置加上一行: tristate 'XXXXXXXX'
CONFIG_XXXX 2. 然后你需要修改drivers/char/Makefile,在合适的位置,加上 obj-$
(CONFIG_XXXX) += mydriver.o 这样,你就可以在make menuconfig时选择配置你的驱动程序了八
Linux 内核编程 1. Q: linux 内核原代码汇编中 .align 伪指令的意思是什么? A: (rush)gas文档 For
example `.align 3' advances the location counter until it a multiple
of 8. If the location counter is already a multiple of 8, no change is
needed. 比如.align 3，2的三次方就是8，也就是要对齐在8边界，比如你现在所在的byte是5，那 .align 3之后那个变量就
会在8，中间自动插了2个内容为null的byte。如果是在程序代码段中则会插入nop操作码，如果我没理解错的话。 2. Q:
start_kernel里面的prof_shift干什么用的 ? if (prof_shift) { prof_buffer =
(unsigned int *) memory_start; /* only text is profiled */ prof_len =
(unsigned long) &_etext - (unsigned long) &_stext; prof_len >>=
prof_shift; memory_start += prof_len * sizeof(unsigned int); memset
(prof_buffer, 0, prof_len * sizeof(unsigned int)); } prof_buffer分配做什么用
的,prof在原代码里什么意思 A: (xshell)prof means profile,也许译成"概要"比较合适,这里是内核对自身代码使用
情况的一个统计这段代码的意思是你是否想通过在启动命令行中设置profile="整型参数n"将内核分成pro f_shift=n个部分以便检查
内核各部分代码使用的频繁程度 prof_buffer当然是内核存放那些统计记录的地址区域 init/main.c中该函数用于从命令行接受启动参
数profile，赋值给prof_shift 135 static int __init profile_setup(char *str)
136 { 137 int par; 138 if (get_option(&str,&par)) prof_shift = par;
139 return 1; 140 } 与之相关的有/proc/profile and /usr/sbin/readprofile,当然必需在
启动时加profile= integer,否则没有的啊 3. Q: System.map中的几个标志T,t,B,b,..的问题 ? A:
(ytang)refer from binutils documents: The symbol type. At least the
following types are used; others are, as well, d epending on the
object file format. If lowercase, the symbol is local; if uppe rcase,
the symbol is global (external). A The symbol's value is absolute, and
will not be changed by further linking. B The symbol is in the
uninitialized data section (known as BSS). C The symbol is common.
Common symbols are uninitialized data. When linking, mul tiple common
symbols may appear with the same name. If the symbol is defined a
nywhere, the common symbols are treated as undefined references. For
more deta ils on common symbols, see the discussion of -warn-common in
Linker options. D The symbol is in the initialized data section. G The
symbol is in an initialized data section for small objects. Some
object fi le formats permit more efficient Access to small data
objects, such as a globa l int variable as opposed to a large global
array. I The symbol is an indirect reference to another symbol. This
is a GNU extension to the a.out object file format which is rarely
used. N The symbol is a debugging symbol. R The symbol is in a read
only data section. S The symbol is in an uninitialized data section
for small objects. T The symbol is in the text (code) section. U The
symbol is undefined. V The symbol is a weak object. When a weak
defined symbol is linked with a norma l defined symbol, the normal
defined symbol is used with no error. When a weak undefined symbol is
linked and the symbol is not defined, the value of the we ak symbol
becomes zero with no error. W The symbol is a weak symbol that has not
been specifically tagged as a weak ob ject symbol. When a weak defined
symbol is linked with a normal defined symbol , the normal defined
symbol is used with no error. When a weak undefined symbo l is linked
and the symbol is not defined, the value of the weak symbol become s
zero with no error. 4. Q: 为什么需要copy_from_user ? A: (xshell) 1.
copy_from_user中的fixup的作用是为了修补当缺页异常在中断上下文中发生时保证co py_from_user的正常返回，其返回值
为尚未成功copy的字节数,如果在非中断上下文的情况下，发生用户空间或内核空间地址缺页异常仍然按照一般的缺页异常的处理方式调页.所以
copy_from_user在使用上具有通用性 2. 内核地址空间中可以发生缺页异常，但是并不真正地调页，这仅仅是为了遵循MMU的缺页机制而
已,同时也做一些缺页检查，这一点可以在fault.c中的do_page_fault()的vmall oc_fault:中找到答案 5. Q:
wmb是干吗的函数 ? A: (xshell)在指令序列中放一个wmb的效果是使得指令执行到该处时，把所有缓存的数据写到该写的地方，同时使得
wmb前面的写指令一定会在wmb的写指令之前执行 6. Q: 宏#与##有什么区别 ? A: (jkl)宏定义中的"#"前缀将参数替换成字符
串，例如 #define test(x) #x test(123) 被替换成字符串"123"; "##"用于连接参数，例如 #define
test(x,y) ##x##y test(123,456) 替换成123456; 因此BI(0x0,0)替换成BUILD_IRQ
(0x00)，"pushl $"#nr"-256nt"替换成"pushl $""0x00 ""-256nt" cpp.info手册对此有详细说
明。 7. Q: __attribute__是什么意思? A: (jkl)__attribute__是gcc的关键字，用以描述变量属性，如:
__attribute__((regparm(0))) int printk(const char * fmt, ...)
__attribute__ (( format (printf, 1, 2)));禁止printk使用寄存器传递调用参数，并将printk的参
数1作为 printf格式串，从参数2开始检查其类型；详细信息 8. Q: 请问_end在那儿定义的 ? (2.0.34里的
archi386kernelsetup.c中的_end变量line number 162) A: (jkl)_end是连接器ld定义的，每个
ELF格式的应用程序都可以使用此符号。参见vmlin ux.lds文件. ld的脚本里面可以定义许多东西。info ld可以了解ld脚本的编
写。 9. Q: .prvious是什么意思 ? A: (jkl).previous伪指令恢复当前段的前一个段作为当前段，由于ELF中允许
用.section自定义段，这里的.previous作用就是恢复.text作为当前段。或许应该说恢复到当前.section定义之前的段作为当
前段。 10. Q: volatile是什么意思 ? A: (onfirelinux)volatile指一个变量可能随时由于外界地变化而变化。
详细信息 11. Q: 什么是信号 ? A: (sirx)信号是UNIX进程间通信的一种标准方式，在最早期的UNIX系统中已经存在。信号的出现
允许内核和其它进程通知进程特定事件的发生。现代 UNIX中也存在其它的进程间通信方式，但由于信号相对简单和有效，它们仍然被广泛使用。详细信
息九 Linux 资源十 faq整理相关的声明 1.关于faq faq应该是一些基础性的资料,应该是简明扼要,需要展开的话就应该属于精华文
档应该收录的内容. 2.关于作者署名. 为了表示对作者的尊重,尽量注明作者名称.不过如果答案是由3人以上讨论得出的,或者没法确定作者的,则不进
行署名. 3.faq使用"早发布,多发布原则" 4.faq风格尽量参考别人成熟的格式.http://www.tux.org/lkml/ --
我是谁？ ※ 来源:．农大BBS http://bbs.cau.edu.cn[FROM: 61.149.2.231]

（出处：http://www.sheup.com）

玄鹤

unread,

Nov 16, 2009, 10:16:58 AM11/16/09

to freesky

http://www.xmsc.com.cn/InfoView/Article_81609.html

Linux 线程实现机制分析[多图]

作者：杨沙洲
自从多线程编程的概念出现在 Linux 中以来，Linux 多线应用的发展总是与两个问题脱不开干系：兼容性、效率。本文从线程模型入手，通过分析
目前 Linux 平台上最流行的 LinuxThreads 线程库的实现及其不足，描述了 Linux 社区是如何看待和解决兼容性和效率这两个问
题的。一. 基础知识：线程和进程

按照教科书上的定义，进程是资源管理的最小单位，线程是程序执行的最小单位。在操作系统设计上，从进程演化出线程，最主要的目的就是更好的支持SMP以
及减小（进程/线程）上下文切换开销。

无论按照怎样的分法，一个进程至少需要一个线程作为它的指令执行体，进程管理着资源（比如cpu、内存、文件等等），而将线程分配到某个cpu上执行。
一个进程当然可以拥有多个线程，此时，如果进程运行在SMP机器上，它就可以同时使用多个cpu来执行各个线程，达到最大程度的并行，以提高效率；同
时，即使是在单cpu的机器上，采用多线程模型来设计程序，正如当年采用多进程模型代替单进程模型一样，使设计更简洁、功能更完备，程序的执行效率也更
高，例如采用多个线程响应多个输入，而此时多线程模型所实现的功能实际上也可以用多进程模型来实现，而与后者相比，线程的上下文切换开销就比进程要小多
了，从语义上来说，同时响应多个输入这样的功能，实际上就是共享了除cpu以外的所有资源的。

针对线程模型的两大意义，分别开发出了核心级线程和用户级线程两种线程模型，分类的标准主要是线程的调度者在核内还是在核外。前者更利于并发使用多处理
器的资源，而后者则更多考虑的是上下文切换开销。在目前的商用系统中，通常都将两者结合起来使用，既提供核心线程以满足smp系统的需要，也支持用线程
库的方式在用户态实现另一套线程机制，此时一个核心线程同时成为多个用户态线程的调度者。正如很多技术一样，"混合"通常都能带来更高的效率，但同时也
带来更大的实现难度，出于"简单"的设计思路，Linux从一开始就没有实现混合模型的计划，但它在实现上采用了另一种思路的"混合"。

在线程机制的具体实现上，可以在操作系统内核上实现线程，也可以在核外实现，后者显然要求核内至少实现了进程，而前者则一般要求在核内同时也支持进程。
核心级线程模型显然要求前者的支持，而用户级线程模型则不一定基于后者实现。这种差异，正如前所述，是两种分类方式的标准不同带来的。

当核内既支持进程也支持线程时，就可以实现线程-进程的"多对多"模型，即一个进程的某个线程由核内调度，而同时它也可以作为用户级线程池的调度者，选
择合适的用户级线程在其空间中运行。这就是前面提到的"混合"线程模型，既可满足多处理机系统的需要，也可以最大限度的减小调度开销。绝大多数商业操作
系统（如Digital Unix、Solaris、Irix）都采用的这种能够完全实现POSIX1003.1c标准的线程模型。在核外实现的线程又
可以分为"一对一"、"多对一"两种模型，前者用一个核心进程（也许是轻量进程）对应一个线程，将线程调度等同于进程调度，交给核心完成，而后者则完全
在核外实现多线程，调度也在用户态完成。后者就是前面提到的单纯的用户级线程模型的实现方式，显然，这种核外的线程调度器实际上只需要完成线程运行栈的
切换，调度开销非常小，但同时因为核心信号（无论是同步的还是异步的）都是以进程为单位的，因而无法定位到线程，所以这种实现方式不能用于多处理器系
统，而这个需求正变得越来越大，因此，在现实中，纯用户级线程的实现，除算法研究目的以外，几乎已经消失了。

Linux内核只提供了轻量进程的支持，限制了更高效的线程模型的实现，但Linux着重优化了进程的调度开销，一定程度上也弥补了这一缺陷。目前最流
行的线程机制LinuxThreads所采用的就是线程-进程"一对一"模型，调度交给核心，而在用户级实现一个包括信号处理在内的线程管理机制。
Linux-LinuxThreads的运行机制正是本文的描述重点。

二.Linux 2.4内核中的轻量进程实现

最初的进程定义都包含程序、资源及其执行三部分，其中程序通常指代码，资源在操作系统层面上通常包括内存资源、IO资源、信号处理等部分，而程序的执行
通常理解为执行上下文，包括对cpu的占用，后来发展为线程。在线程概念出现以前，为了减小进程切换的开销，操作系统设计者逐渐修正进程的概念，逐渐允
许将进程所占有的资源从其主体剥离出来，允许某些进程共享一部分资源，例如文件、信号，数据内存，甚至代码，这就发展出轻量进程的概念。Linux内核
在 2.0.x版本就已经实现了轻量进程，应用程序可以通过一个统一的clone()系统调用接口，用不同的参数指定创建轻量进程还是普通进程。在内核
中， clone()调用经过参数传递和解释后会调用do_fork()，这个核内函数同时也是fork()、vfork()系统调用的最终实现：

<linux-2.4.20/kernel/fork.c>
int do_fork(unsigned long clone_flags, unsigned long stack_start,
strUCt pt_regs *regs, unsigned long stack_size)

其中的clone_flags取自以下宏的"或"值：

<linux-2.4.20/include/linux/sched.h>
#define CSIGNAL 0x000000ff /* signal mask to be sent
at exit */
#define CLONE_VM 0x00000100 /* set if VM shared between
processes */
#define CLONE_FS 0x00000200 /* set if fs info shared
between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared
between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and
blocked signals shared */
#define CLONE_PID 0x00001000 /* set if pid shared */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing
continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants
the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the
same parent as the cloner */
#define CLONE_THREAD 0x00010000 /* Same thread group? */
#define CLONE_NEWNS 0x00020000 /* New namespace group? */
#define CLONE_SIGNAL (CLONE_SIGHAND CLONE_THREAD)

在do_fork()中，不同的clone_flags将导致不同的行为，对于LinuxThreads，它使用（CLONE_VM
CLONE_FS CLONE_FILES CLONE_SIGHAND）参数来调用clone()创建"线程"，表示共享内存、共享文件系统访问计
数、共享文件描述符表，以及共享信号处理方式。本节就针对这几个参数，看看Linux内核是如何实现这些资源的共享的。

1.CLONE_VM

do_fork ()需要调用copy_mm()来设置task_struct中的mm和active_mm项，这两个mm_struct数据与进程所
关联的内存空间相对应。如果do_fork()时指定了CLONE_VM开关，copy_mm()将把新的task_struct中的mm和
active_mm设置成与 current的相同，同时提高该mm_struct的使用者数目（mm_struct::mm_users）。也就是
说，轻量级进程与父进程共享内存地址空间，由下图示意可以看出mm_struct在进程中的地位：

Linux 线程实现机制分析[多图]图片1

2.CLONE_FS

task_struct 中利用fs（struct fs_struct *）记录了进程所在文件系统的根目录和当前目录信息，do_fork()时调
用copy_fs()复制了这个结构；而对于轻量级进程则仅增加fs- >count计数，与父进程共享相同的fs_struct。也就是说，轻量级进
程没有独立的文件系统相关的信息，进程中任何一个线程改变当前目录、根目录等信息都将直接影响到其他线程。

3.CLONE_FILES

一个进程可能打开了一些文件，在进程结构task_struct中利用files（struct files_struct *）来保存进程打开的文件
结构（struct file）信息，do_fork()中调用了copy_files()来处理这个进程属性；轻量级进程与父进程是共享该结构的，
copy_files() 时仅增加files->count计数。这一共享使得任何线程都能访问进程所维护的打开文件，对它们的操作会直接反映到进程
中的其他线程。

4.CLONE_SIGHAND

每一个Linux进程都可以自行定义对信号的处理方式，在task_struct中的sig（struct signal_struct）中使用一个
struct k_sigaction结构的数组来保存这个配置信息，do_fork()中的copy_sighand()负责复制该信息；轻量级进程
不进行复制，而仅仅增加signal_struct::count计数，与父进程共享该结构。也就是说，子进程与父进程的信号处理方式完全相同，而且可
以相互更改。

do_fork()中所做的工作很多，在此不详细描述。对于SMP系统，所有的进程fork出来后，都被分配到与父进程相同的cpu上，一直到该进程被
调度时才会进行cpu选择。

尽管Linux支持轻量级进程，但并不能说它就支持核心级线程，因为Linux的"线程"和"进程"实际上处于一个调度层次，共享一个进程标识符空间，
这种限制使得不可能在Linux上实现完全意义上的POSIX线程机制，因此众多的Linux线程库实现尝试都只能尽可能实现POSIX的绝大部分语
义，并在功能上尽可能逼近。

三.LinuxThread的线程机制

LinuxThreads 是目前Linux平台上使用最为广泛的线程库，由Xavier Leroy
(Xavier...@inria.fr)负责开发完成，并已绑定在GLIBC中发行。它所实现的就是基于核心轻量级进程的"一对一"线程模型，
一个线程实体对应一个核心轻量级进程，而线程之间的管理在核外函数库中实现。

1.线程描述数据结构及实现限制

LinuxThreads 定义了一个struct _pthread_descr_struct数据结构来描述线程，并使用全局数组变量
__pthread_handles来描述和引用进程所辖线程。在 __pthread_handles中的前两项，LinuxThreads定义了两
个全局的系统线程：__pthread_initial_thread 和__pthread_manager_thread，并用
__pthread_main_thread表征 __pthread_manager_thread的父线程（初始为
__pthread_initial_thread）。

truct _pthread_descr_struct是一个双环链表结构，__pthread_manager_thread所在的链表仅包括它一
个元素，实际上，__pthread_manager_thread是一个特殊线程，LinuxThreads仅使用了其中的errno、
p_pid、 p_priority等三个域。而__pthread_main_thread所在的链则将进程中所有用户线程串在了一起。经过一系列
pthread_create()之后形成的__pthread_handles数组将如下图所示：

Linux 线程实现机制分析[多图]图片2

新创建的线程将首先在__pthread_handles数组中占据一项，然后通过数据结构中的链指针连入以
__pthread_main_thread为首指针的链表中。这个链表的使用在介绍线程的创建和释放的时候将提到。

LinuxThreads 遵循POSIX1003.1c标准，其中对线程库的实现进行了一些范围限制，比如进程最大线程数，线程私有数据区大小等等。
在LinuxThreads的实现中，基本遵循这些限制，但也进行了一定的改动，改动的趋势是放松或者说扩大这些限制，使编程更加方便。这些限定宏主要
集中在 sysdeps/unix/sysv/linux/bits/local_lim.h（不同平台使用的文件位置不同）中，包括如下几个：

每进程的私有数据key数，POSIX定义_POSIX_THREAD_KEYS_MAX为128，LinuxThreads使用
PTHREAD_KEYS_MAX，1024；私有数据释放时允许执行的操作数，LinuxThreads与POSIX一致，定义
PTHREAD_DESTRUCTOR_ITERATIONS为4；每进程的线程数，POSIX定义为64，LinuxThreads增大到1024
（PTHREAD_THREADS_MAX）；线程运行栈最小空间大小，POSIX未指定，LinuxThreads使用
PTHREAD_STACK_MIN，16384（字节）。

2.管理线程

" 一对一"模型的好处之一是线程的调度由核心完成了，而其他诸如线程取消、线程间的同步等工作，都是在核外线程库中完成的。在
LinuxThreads中，专门为每一个进程构造了一个管理线程，负责处理线程相关的管理工作。当进程第一次调用pthread_create()创
建一个线程的时候就会创建（__clone()）并启动管理线程。

在一个进程空间内，管理线程与其他线程之间通过一对"管理管道（manager_pipe[2]）"来通讯，该管道在创建管理线程之前创建，在成功启动
了管理线程之后，管理管道的读端和写端分别赋给两个全局变量 __pthread_manager_reader和
__pthread_manager_request，之后，每个用户线程都通过 __pthread_manager_request向管理线程发请
求，但管理线程本身并没有直接使用 __pthread_manager_reader，管道的读端（manager_pipe[0]）是作为
__clone()的参数之一传给管理线程的，管理线程的工作主要就是监听管道读端，并对从中取出的请求作出反应。

创建管理线程的流程如下所示：
（全局变量pthread_manager_request初值为-1）

Linux 线程实现机制分析[多图]图片3

初始化结束后，在__pthread_manager_thread中记录了轻量级进程号以及核外分配和管理的线程id，
2*PTHREAD_THREADS_MAX+1这个数值不会与任何常规用户线程id冲突。管理线程作为pthread_create()的调用者线程
的子线程运行，而pthread_create()所创建的那个用户线程则是由管理线程来调用clone()创建，因此实际上是管理线程的子线程。（此
处子线程的概念应该当作子进程来理解。）

__pthread_manager()就是管理线程的主循环所在，在进行一系列初始化工作后，进入while(1)循环。在循环中，线程以2秒为
timeout查询（__poll()）管理管道的读端。在处理请求前，检查其父线程（也就是创建manager的主线程）是否已退出，如果已退出就退
出整个进程。如果有退出的子线程需要清理，则调用 pthread_reap_children()清理。

然后才是读取管道中的请求，根据请求类型执行相应操作（switch-case）。具体的请求处理，源码中比较清楚，这里就不赘述了。

3.线程栈

在LinuxThreads中，管理线程的栈和用户线程的栈是分离的，管理线程在进程堆中通过malloc()分配一个
THREAD_MANAGER_STACK_SIZE字节的区域作为自己的运行栈。

用户线程的栈分配办法随着体系结构的不同而不同，主要根据两个宏定义来区分，一个是NEED_SEPARATE_REGISTER_STACK，这个属
性仅在IA64平台上使用；另一个是FLOATING_STACK宏，在i386等少数平台上使用，此时用户线程栈由系统决定具体位置并提供保护。与此
同时，用户还可以通过线程属性结构来指定使用用户自定义的栈。因篇幅所限，这里只能分析i386平台所使用的两种栈组织方式：
FLOATING_STACK方式和用户自定义方式。

在FLOATING_STACK方式下，LinuxThreads利用mmap()从内核空间中分配8MB空间（i386系统缺省的最大栈空间大小，如
果有运行限制（rlimit），则按照运行限制设置），使用mprotect()设置其中第一页为非访问区。该 8M空间的功能分配如下图：

Linux 线程实现机制分析[多图]图片4

低地址被保护的页面用来监测栈溢出。

对于用户指定的栈，在按照指针对界后，设置线程栈顶，并计算出栈底，不做保护，正确性由用户自己保证。

不论哪种组织方式，线程描述结构总是位于栈顶紧邻堆栈的位置。

4.线程id和进程id

每个LinuxThreads线程都同时具有线程id和进程id，其中进程id就是内核所维护的进程号，而线程id则由LinuxThreads分配和
维护。__pthread_initial_thread 的线程id为PTHREAD_THREADS_MAX，
__pthread_manager_thread的是 2*PTHREAD_THREADS_MAX+1，第一个用户线程的线程id为
PTHREAD_THREADS_MAX+2，此后第n个用户线程的线程 id遵循以下公式

tid=n*PTHREAD_THREADS_MAX+n+1

这种分配方式保证了进程中所有的线程（包括已经退出）都不会有相同的线程id，而线程id的类型pthread_t定义为无符号长整型
（unsigned long int），也保证了有理由的运行时间内线程id不会重复。

从线程id查找线程数据结构是在pthread_handle()函数中完成的，实际上只是将线程号按PTHREAD_THREADS_MAX取模，得
到的就是该线程在__pthread_handles中的索引。

5.线程的创建

在pthread_create ()向管理线程发送REQ_CREATE请求之后，管理线程即调用pthread_handle_create()创
建新线程。分配栈、设置thread 属性后，以pthread_start_thread()为函数入口调用__clone()创建并启动新线程。
pthread_start_thread ()读取自身的进程id号存入线程描述结构中，并根据其中记录的调度方法配置调度。一切准备就绪后，再调用
真正的线程执行函数，并在此函数返回后调用 pthread_exit()清理现常

6.LinuxThreads的不足

由于Linux内核的限制以及实现难度等等原因，LinuxThreads并不是完全POSIX兼容的，在它的发行README中有说明。

1)进程id问题

这个不足是最关键的不足，引起的原因牵涉到LinuxThreads的"一对一"模型。

Linux 内核并不支持真正意义上的线程，LinuxThreads是用与普通进程具有同样内核调度视图的轻量级进程来实现线程支持的。这些轻量级进
程拥有独立的进程id，在进程调度、信号处理、IO等方面享有与普通进程一样的能力。在源码阅读者看来，就是Linux内核的clone()没有实现
对 CLONE_PID参数的支持。

在内核do_fork()中对CLONE_PID的处理是这样的：

if (clone_flags & CLONE_PID) {
if (current->pid)
goto fork_out;
}

这段代码表明，目前的Linux内核仅在pid为0的时候认可CLONE_PID参数，实际上，仅在SMP初始化，手工创建进程的时候才会使用
CLONE_PID参数。

按照POSIX定义，同一进程的所有线程应该共享一个进程id和父进程id，这在目前的"一对一"模型下是无法实现的。

2)信号处理问题

由于异步信号是内核以进程为单位分发的，而LinuxThreads的每个线程对内核来说都是一个进程，且没有实现"线程组"，因此，某些语义不符合
POSIX标准，比如没有实现向进程中所有线程发送信号，README对此作了说明。

如果核心不提供实时信号，LinuxThreads将使用SIGUSR1和SIGUSR2作为内部使用的restart和cancel信号，这样应用程
序就不能使用这两个原本为用户保留的信号了。在Linux kernel 2.1.60以后的版本都支持扩展的实时信号（从_SIGRTMIN到
_SIGRTMAX），因此不存在这个问题。

某些信号的缺省动作难以在现行体系上实现，比如SIGSTOP和SIGCONT，LinuxThreads只能将一个线程挂起，而无法挂起整个进程。

3)线程总数问题

LinuxThreads将每个进程的线程最大数目定义为1024，但实际上这个数值还受到整个系统的总进程数限制，这又是由于线程其实是核心进程。

在kernel 2.4.x中，采用一套全新的总进程数计算方法，使得总进程数基本上仅受限于物理内存的大小，计算公式在kernel/fork.c的
fork_init()函数中：

max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 8

在i386上，THREAD_SIZE=2*PAGE_SIZE，PAGE_SIZE=2^12（4KB），mempages=物理内存大小

/PAGE_SIZE，对于256M的内存的机器，mempages=256*2^20/2^12=256*2^8，此时最大线程数为4096。
但为了保证每个用户（除了root）的进程总数不至于占用一半以上物理内存，fork_init()中继续指定：

init_task.rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
init_task.rlim[RLIMIT_NPROC].rlim_max = max_threads/2;

这些进程数目的检查都在do_fork()中进行，因此，对于LinuxThreads来说，线程总数同时受这三个因素的限制。

4)管理线程问题

管理线程容易成为瓶颈，这是这种结构的通病；同时，管理线程又负责用户线程的清理工作，因此，尽管管理线程已经屏蔽了大部分的信号，但一旦管理线程死
亡，用户线程就不得不手工清理了，而且用户线程并不知道管理线程的状态，之后的线程创建等请求将无人处理。

5)同步问题

LinuxThreads中的线程同步很大程度上是建立在信号基础上的，这种通过内核复杂的信号处理机制的同步方式，效率一直是个问题。

6）其他POSIX兼容性问题

Linux中很多系统调用，按照语义都是与进程相关的，比如nice、setuid、setrlimit等，在目前的LinuxThreads中，这些
调用都仅仅影响调用者线程。

7）实时性问题

线程的引入有一定的实时性考虑，但LinuxThreads暂时不支持，比如调度选项，目前还没有实现。不仅LinuxThreads如此，标准的
Linux在实时性上考虑都很少。

四.其他的线程实现机制

LinuxThreads 的问题，特别是兼容性上的问题，严重阻碍了Linux上的跨平台应用（如Apache）采用多线程设计，从而使得Linux
上的线程应用一直保持在比较低的水平。在Linux社区中，已经有很多人在为改进线程性能而努力，其中既包括用户级线程库，也包括核心级和用户级配合改
进的线程库。目前最为人看好的有两个项目，一个是RedHat公司牵头研发的NPTL（Native Posix Thread Library），另
一个则是IBM投资开发的NGPT（Next Generation Posix Threading），二者都是围绕完全兼容POSIX
1003.1c，同时在核内和核外做工作以而实现多对多线程模型。这两种模型都在一定程度上弥补了LinuxThreads的缺点，且都是重起炉灶全新
设计的。

1.NPTL

NPTL的设计目标归纳可归纳为以下几点：

* POSIX兼容性
* SMP结构的利用
* 低启动开销
* 低链接开销（即不使用线程的程序不应当受线程库的影响）
* 与LinuxThreads应用的二进制兼容性
* 软硬件的可扩展能力
* 多体系结构支持
* NUMA支持
* 与C++集成

在技术实现上，NPTL仍然采用1:1的线程模型，并配合glibc和最新的Linux Kernel2.5.x开发版在信号处理、线程同步、存储管理
等多方面进行了优化。和LinuxThreads不同，NPTL没有使用管理线程，核心线程的管理直接放在核内进行，这也带了性能的优化。

主要是因为核心的问题，NPTL仍然不是100%POSIX兼容的，但就性能而言相对LinuxThreads已经有很大程度上的改进了。

2.NGPT

IBM的开放源码项目NGPT在2003年1月10日推出了稳定的2.2.0版，但相关的文档工作还差很多。就目前所知，NGPT是基于GNU Pth
（GNU Portable Threads）项目而实现的M:N模型，而GNU Pth是一个经典的用户级线程库实现。

按照2003年3月NGPT官方网站上的通知，NGPT考虑到NPTL日益广泛地为人所接受，为避免不同的线程库版本引起的混乱，今后将不再进行进一步
开发，而今进行支持性的维护工作。也就是说，NGPT已经放弃与NPTL竞争下一代Linux POSIX线程库标准。

3.其他高效线程机制

此处不能不提到Scheduler Activations。这个1991年在ACM上发表的多线程内核结构影响了很多多线程内核的设计，其中包括
Mach3.0、NetBSD和商业版本 Digital Unix（现在叫Compaq True64 Unix）。它的实质是在使用用户级线程调度
的同时，尽可能地减少用户级对核心的系统调用请求，而后者往往是运行开销的重要来源。采用这种结构的线程机制，实际上是结合了用户级线程的灵活高效和核
心级线程的实用性，因此，包括Linux、FreeBSD在内的多个开放源码操作系统设计社区都在进行相关研究，力图在本系统中实现
Scheduler Activations。

http://www.Stcore.com 　收集整理

（出处：http://www.hackhome.com/）

玄鹤

unread,

Nov 16, 2009, 10:19:28 AM11/16/09

to freesky

http://www.wangchao.net.cn/bbsdetail_36117.html

Linux 2.4进程调度分析 1

Reply all

Reply to author

Forward

0 new messages