[Lse-tech] Re: 10.31 second kernel compile

Dave Hansen

unread,

Mar 13, 2002, 4:44:43 PM3/13/02

to

Martin J. Bligh wrote:
>>Due to the final link and compress stage, there is a fair amount of idle
>>time at the end of the run. Its going to be hard to push that number
>>lower by adding cpus.
>
> I think we need to fix the final phase .... anyone got any ideas
> on parallelizing that?
The final linking stage in the makefile looks like this:

vmlinux: piggy.o $(OBJECTS)
$(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o

ld has a "-r" option
`--relocateable'
Generate relocatable output---i.e., generate an output
file that can in turn serve as input to `ld'. This is
often called partial linking. As a side effect, in
environments that support standard Unix magic numbers,
this option also sets the output file's magic number
to `OMAGIC'. If this option is not specified, an
absolute file is produced. When linking C++ programs,
this option will not resolve references to construc
tors; to do that, use -Ur.

If we link in chunks, we can parallelize this.
Image 26 object files: [a-z].o

ld -r -o abcd.o [abcd].o
ld -r -o efgh.o [efgh].o
...
ld -r -o abcdefgh.o {abcd,efgh,...}.o

then, instead of the old final link stage:
$(LD) $(ZLINKFLAGS) -o vmlinux {abcdefgh,...}.o piggy.o

The final link will still take a while, but we will have at least broken
up SOME of the work. I'm going to see if this will actually work now.
Any comments?

--
Dave Hansen
have...@us.ibm.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Keith Owens

unread,

Mar 13, 2002, 8:07:15 PM3/13/02

to

On Wed, 13 Mar 2002 13:44:43 -0800,
Dave Hansen <have...@us.ibm.com> wrote:
>The final linking stage in the makefile looks like this:
>
>vmlinux: piggy.o $(OBJECTS)
> $(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o
>

>If we link in chunks, we can parallelize this.
>Image 26 object files: [a-z].o
>
>ld -r -o abcd.o [abcd].o
>ld -r -o efgh.o [efgh].o
>...
>ld -r -o abcdefgh.o {abcd,efgh,...}.o
>
>then, instead of the old final link stage:
>$(LD) $(ZLINKFLAGS) -o vmlinux {abcdefgh,...}.o piggy.o
>
>The final link will still take a while, but we will have at least broken
>up SOME of the work. I'm going to see if this will actually work now.
>Any comments?

I'm sorry Dave, you can't do that ;) The init_call order is controlled
by link order, change the link order and you corrupt the kernel
initialization order, double plus ungood. The link of vmlinux requires
that $(OBJECTS) be exactly as coded.

Momchil Velikov

unread,

Mar 14, 2002, 8:21:38 AM3/14/02

to

>>>>> "Anton" == Anton Blanchard <an...@samba.org> writes:
Anton> Thats due to the way we manipulate the ppc hashed page table. Every
Anton> time we update the linux page tables we have to update the hashed
Anton> page table. There are some obvious optimisations we need to make,

Out of curiousity, why there's a need to update the linux page tables ?
Doesn't pte/pmd/pgd family functions provide enough abstraction in
order to maintain _only_ the hashed page table ?

Regards,
-velco

Dipankar Sarma

unread,

Mar 14, 2002, 8:16:09 AM3/14/02

to

On Thu, Mar 14, 2002 at 10:27:26PM +1100, Anton Blanchard wrote:
>
> > > 554 .d_lookup
> >
> > Did you try the dcache patches?
>
> Not for this, I did do some benchmarking of the RCU dcache patches a
> while ago which I should post.

Please do ;-) This shows why we need to ease the pressure on dcache_lock.

>
> > Can you publish lockmeter stats?
>
> I didnt get a chance to run lockmeter, I tend to use the kernel profiler
> and use a hacked readprofile (originally from tridge) that displays
> profile hits vs assembly instruction. Thats usually good enough to work
> out which spinlocks are a problem.

Is this a PPC only hack ? Also, where can I get it ?

Thanks
--
Dipankar Sarma <dipa...@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

Hanna Linder

unread,

Mar 14, 2002, 1:21:49 PM3/14/02

to

--On Thursday, March 14, 2002 22:27:26 +1100 Anton Blanchard <an...@samba.org> wrote:

>> > 554 .d_lookup
>>
>> Did you try the dcache patches?
>
> Not for this, I did do some benchmarking of the RCU dcache patches a
> while ago which I should post.
>

There are two dcache patches. The one I wrote based on Al Viro's
suggestion for fast path walking is especially good for NUMA
systems hit by cache bouncing (d_lookup is the main culprit in
the dcache). Martin had some initial results that looked very
good.

Following is the clean 2.5.6 version which is also available at:
http://sf.net/projects/lse Under the Read Copy Update Section

Hanna Linder (han...@us.ibm.com)
IBM Linux Technology Center

-----------
diff -Nru -X dontdiff linux-2.5.6/fs/dcache.c linux-2.5.6-fw/fs/dcache.c
--- linux-2.5.6/fs/dcache.c Thu Mar 7 18:18:13 2002
+++ linux-2.5.6-fw/fs/dcache.c Fri Mar 8 13:50:43 2002
@@ -704,13 +704,22 @@

struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
+ struct dentry * dentry;
+ spin_lock(&dcache_lock);
+ dentry = __d_lookup(parent,name);
+ spin_unlock(&dcache_lock);
+ return dentry;
+}
+
+struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
+{
+
unsigned int len = name->len;
unsigned int hash = name->hash;
const unsigned char *str = name->name;
struct list_head *head = d_hash(parent,hash);
struct list_head *tmp;

- spin_lock(&dcache_lock);
tmp = head->next;
for (;;) {
struct dentry * dentry = list_entry(tmp, struct dentry, d_hash);
@@ -732,10 +741,8 @@
}
__dget_locked(dentry);
dentry->d_vfs_flags |= DCACHE_REFERENCED;
- spin_unlock(&dcache_lock);
return dentry;
}
- spin_unlock(&dcache_lock);
return NULL;
}

diff -Nru -X dontdiff linux-2.5.6/fs/namei.c linux-2.5.6-fw/fs/namei.c
--- linux-2.5.6/fs/namei.c Thu Mar 7 18:18:24 2002
+++ linux-2.5.6-fw/fs/namei.c Fri Mar 8 13:56:25 2002
@@ -268,8 +268,41 @@
static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, int flags)
{
struct dentry * dentry = d_lookup(parent, name);
+
+ if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+ if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
+ dput(dentry);
+ dentry = NULL;
+ }
+ }
+ return dentry;
+}

+/*for fastwalking*/
+static inline void undo_locked(struct nameidata *nd)
+{
+ if(nd->flags & LOOKUP_LOCKED){
+ dget(nd->dentry);
+ mntget(nd->mnt);
+ spin_unlock(&dcache_lock);
+ nd->flags &= ~LOOKUP_LOCKED;
+ }
+}
+
+/*
+ * For fast path lookup while holding the dcache_lock.
+ * SMP-safe
+ */
+static struct dentry * cached_lookup_nd(struct nameidata * nd, struct qstr * name, int flags)
+{
+ struct dentry * dentry = NULL;
+ if(!(nd->flags & LOOKUP_LOCKED))
+ return cached_lookup(nd->dentry, name, flags);
+
+ dentry = __d_lookup(nd->dentry, name);
+
if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+ undo_locked(nd);
if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
dput(dentry);
dentry = NULL;
@@ -279,6 +312,34 @@
}

/*
+ * Short-cut version of permission(), for calling by
+ * path_walk(), when dcache lock is held. Combines parts
+ * of permission() and vfs_permission(), and tests ONLY for
+ * MAY_EXEC permission.
+ *
+ * If appropriate, check DAC only. If not appropriate, or
+ * short-cut DAC fails, then call permission() to do more
+ * complete permission check.
+ */
+static inline int exec_permission_lite(struct inode *inode)
+{
+ umode_t mode = inode->i_mode;
+
+ if ((inode->i_op && inode->i_op->permission))
+ return -EACCES;
+
+ if (current->fsuid == inode->i_uid)
+ mode >>= 6;
+ else if (in_group_p(inode->i_gid))
+ mode >>= 3;
+
+ if (mode & MAY_EXEC)
+ return 0;
+
+ return -EACCES;
+}
+
+/*
* This is called when everything else fails, and we actually have
* to go to the low-level filesystem to find out what we should do..
*
@@ -472,7 +533,9 @@
struct qstr this;
unsigned int c;

- err = permission(inode, MAY_EXEC);
+ err = exec_permission_lite(inode);
+ if(err)
+ err = permission(inode, MAY_EXEC);
dentry = ERR_PTR(err);
if (err)
break;
@@ -507,6 +570,7 @@
case 2:
if (this.name[1] != '.')
break;
+ undo_locked(nd);
follow_dotdot(nd);
inode = nd->dentry->d_inode;
/* fallthrough */
@@ -523,16 +587,20 @@
break;
}
/* This does the actual lookups.. */
- dentry = cached_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
+ dentry = cached_lookup_nd(nd, &this, LOOKUP_CONTINUE);
if (!dentry) {
+ undo_locked(nd);
dentry = real_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
err = PTR_ERR(dentry);
if (IS_ERR(dentry))
break;
}
/* Check mountpoints.. */
- while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
- ;
+ if(d_mountpoint(dentry)){
+ undo_locked(nd);
+ while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
+ ;
+ }

err = -ENOENT;
inode = dentry->d_inode;
@@ -543,6 +611,7 @@
goto out_dput;

if (inode->i_op->follow_link) {
+ undo_locked(nd);
err = do_follow_link(dentry, nd);
dput(dentry);
if (err)
@@ -555,7 +624,8 @@
if (!inode->i_op)
break;
} else {
- dput(nd->dentry);
+ if (!(nd->flags & LOOKUP_LOCKED))
+ dput(nd->dentry);
nd->dentry = dentry;
}
err = -ENOTDIR;
@@ -575,6 +645,7 @@
case 2:
if (this.name[1] != '.')
break;
+ undo_locked(nd);
follow_dotdot(nd);
inode = nd->dentry->d_inode;
/* fallthrough */
@@ -586,7 +657,8 @@
if (err < 0)
break;
}
- dentry = cached_lookup(nd->dentry, &this, 0);
+ dentry = cached_lookup_nd(nd, &this, 0);
+ undo_locked(nd);
if (!dentry) {
dentry = real_lookup(nd->dentry, &this, 0);
err = PTR_ERR(dentry);
@@ -626,11 +698,14 @@
else if (this.len == 2 && this.name[1] == '.')
nd->last_type = LAST_DOTDOT;
return_base:
+ undo_locked(nd);
return 0;
out_dput:
+ undo_locked(nd);
dput(dentry);
break;
}
+ undo_locked(nd);
path_release(nd);
return_err:
return err;
@@ -734,6 +809,36 @@
nd->dentry = dget(current->fs->pwd);
read_unlock(&current->fs->lock);
return 1;
+}
+
+int path_lookup(const char *name, unsigned int flags, struct nameidata *nd)
+{
+ nd->last_type = LAST_ROOT; /* if there are only slashes... */
+ nd->flags = flags;
+ if (*name=='/'){
+ read_lock(&current->fs->lock);
+ if (current->fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
+ nd->mnt = mntget(current->fs->altrootmnt);
+ nd->dentry = dget(current->fs->altroot);
+ read_unlock(&current->fs->lock);
+ if (__emul_lookup_dentry(name,nd))
+ return 0;
+ read_lock(&current->fs->lock);
+ }
+ spin_lock(&dcache_lock); /*to avoid cacheline bouncing with d_count*/
+ nd->mnt = current->fs->rootmnt;
+ nd->dentry = current->fs->root;
+ read_unlock(&current->fs->lock);
+ }
+ else{
+ read_lock(&current->fs->lock);
+ spin_lock(&dcache_lock);
+ nd->mnt = current->fs->pwdmnt;
+ nd->dentry = current->fs->pwd;
+ read_unlock(&current->fs->lock);
+ }
+ nd->flags |= LOOKUP_LOCKED;
+ return (path_walk(name, nd));
}

/*
diff -Nru -X dontdiff linux-2.5.6/include/linux/dcache.h linux-2.5.6-fw/include/linux/dcache.h
--- linux-2.5.6/include/linux/dcache.h Thu Mar 7 18:18:30 2002
+++ linux-2.5.6-fw/include/linux/dcache.h Fri Mar 8 13:50:43 2002
@@ -220,6 +220,7 @@

/* appendix may either be NULL or be used for transname suffixes */
extern struct dentry * d_lookup(struct dentry *, struct qstr *);
+extern struct dentry * __d_lookup(struct dentry *, struct qstr *);

/* validate "insecure" dentry pointer */
extern int d_validate(struct dentry *, struct dentry *);
diff -Nru -X dontdiff linux-2.5.6/include/linux/fs.h linux-2.5.6-fw/include/linux/fs.h
--- linux-2.5.6/include/linux/fs.h Thu Mar 7 18:18:19 2002
+++ linux-2.5.6-fw/include/linux/fs.h Fri Mar 8 13:50:43 2002
@@ -1273,12 +1273,15 @@
* - require a directory
* - ending slashes ok even for nonexistent files
* - internal "there are more path compnents" flag
+ * - locked when lookup done with dcache_lock held
*/
#define LOOKUP_FOLLOW (1)
#define LOOKUP_DIRECTORY (2)
#define LOOKUP_CONTINUE (4)
#define LOOKUP_PARENT (16)
#define LOOKUP_NOALT (32)
+#define LOOKUP_LOCKED (64)
+
/*
* Type of the last component on LOOKUP_PARENT
*/
@@ -1309,13 +1312,7 @@
extern int FASTCALL(path_init(const char *, unsigned, struct nameidata *));
extern int FASTCALL(path_walk(const char *, struct nameidata *));
extern int FASTCALL(link_path_walk(const char *, struct nameidata *));
-static inline int path_lookup(const char *path, unsigned flags, struct nameidata *nd)
-{
- int error = 0;
- if (path_init(path, flags, nd))
- error = path_walk(path, nd);
- return error;
-}
+extern int FASTCALL(path_lookup(const char *, unsigned, struct nameidata *));
extern void path_release(struct nameidata *);
extern int follow_down(struct vfsmount **, struct dentry **);
extern int follow_up(struct vfsmount **, struct dentry **);

Linus Torvalds

unread,

Mar 14, 2002, 2:05:50 PM3/14/02

to

In article <87wuwfx...@fadata.bg>,

Momchil Velikov <ve...@fadata.bg> wrote:
>
>Out of curiousity, why there's a need to update the linux page tables ?
>Doesn't pte/pmd/pgd family functions provide enough abstraction in
>order to maintain _only_ the hashed page table ?

No. The IBM hashed page tables are not page tables at all, they are
really just a bigger 16-way set-associative in-memory TLB.

You can't actually sanely keep track of VM layout in them.

Those POWER4 machines are wonderful things, but they have a few quirks:

- it's so expensive that anybody who is slightly price-conscious gets a
farm of PC's instead. Oh, well.

- the CPU module alone is something like .5 kilowatts (translation:
don't expect it in a nice desktop factor, even if you could afford
it).

- IBM nomenclature really is broken. They call disks DASD devices, and
they call their hash table a page table, and they just confuse
themselves and everybody else for no good reason. They number bits
the wrong way around, for example (and big-endian bitordering really
_is_ clearly inferior to little-endian, unlike byte-ordering. Watch
the _same_ bits in the _same_ register change name in the 32 vs
64-bit architecture manuals, and puke)

But with all their faults, they do have this really studly setup with 8
big, fast CPU's on a single module. A few of those modules and you get
some ass-kick performance numbers. As you can see.

Linus

Chris Wedgwood

unread,

Mar 15, 2002, 7:16:56 AM3/15/02

to

On Thu, Mar 14, 2002 at 07:33:40PM +0100, Daniel Phillips wrote:

On March 14, 2002 02:21 pm, Momchil Velikov wrote:

> Out of curiousity, why there's a need to update the linux page
> tables ? Doesn't pte/pmd/pgd family functions provide enough
> abstraction in order to maintain _only_ the hashed page table ?

No, it's hardwired to the x86 tree view of page translation.

What about doing soft TLB reloads then?

--cw

Linus Torvalds

unread,

Mar 15, 2002, 1:20:17 PM3/15/02

to

In article <E16la2m-0000SX-00@starship>,

Daniel Phillips <phil...@bonn-fries.net> wrote:
>On March 14, 2002 02:21 pm, Momchil Velikov wrote:
>>
>> Out of curiousity, why there's a need to update the linux page tables ?
>> Doesn't pte/pmd/pgd family functions provide enough abstraction in
>> order to maintain _only_ the hashed page table ?
>
>No, it's hardwired to the x86 tree view of page translation.

No no no.

If you think that, then you don't see the big picture.

In fact, when I did the 3-level page tables for Linux, no x86 chips that
could _use_ three levels actually existed.

The Linux MM was actually _designed_ for portability when I did the port
to alpha (oh, that's a long time ago). I even wrote my masters thesis on
why it was done the way it was done (the only actual academic use I ever
got out of the whole Linux exercise ;)

Yes a tree-based page table matches a lot of hardware architectures very
well. And it's _not_ just x86: it also matches soft-fill TLB's better
than alternatives (newer sparcs and MIPS), and matches a number of other
architecture specifications (eg alpha, m68k).

So on about 50% of architectures (and 99.9% of machines), the Linux MM
data structures can be made to map 1:1 to the hardware constructs, so
that you avoid duplicate information.

But more importantly than that, the whole point really is that the page
table tree as far as Linux is concerned is nothing but an _abstraction_
of the VM mapping hardware. It so happens that a tree format is the only
sane format to keep full VM information that works well with real loads.

Whatever the hardware actually does, Linux considers that to be noting
but an extended TLB. When you can make the MM software tree map 1:1
with the extended TLB (as on x86), you win in memory usage and in
cheaper TLB invalidates, but you _could_ (if you wanted to) just keep
two separate trees. In fact, with the rmap patches, that's exactly what
you see: the software tree is _not_ 1:1 with the hardare tree any more
(but it _is_ a proper superset, so that you can still get partial
sharing and still get the cheaper TLB updates).

Are there machines where the sharing between the software abstraction
and the hardware isn't as total? Sure. But if you actually know how
hashed page tables work on ppc, you'd be aware of the fact that they
aren't actualy able to do a full VM mapping - when a hash chain gets too
long, the hardware is no longer able to look it up ("too long" being 16
entries on a PPC, for example).

And that's a common situation with non-tree VM representations - they
aren't actually VM representations, they are just caches of what the
_real_ representation is. And what do we call such caches? Right: they
are nothing but a TLB.

So the fact is, the Linux tree-based VM has _nothing_ to do with x86
tree-basedness, and everything to do with the fact that it's the only
sane way to keep VM information.

The fact that it maps 1:1 to the x86 trees with the "folding" of the mid
layer was a design consideration, for sure. Being efficient and clever
is always good. But the basic reason for tree-ness lies elsewhere.
(The basic reasons for tree-ness is why so many architectures _do_ use a
tree-based page table - you should think of PPC and ia64 as the sick
puppies who didn't understand. Read the PPC documentation on virtual
memory, and you'll see just _how_ sick they are).

Linus