Minutes compiled by Hanna Linder han...@us.ibm.com, please post
corrections to lse-...@lists.sf.net.
Object Based Reverse Mapping:
(Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)
Dave coded up an initial patch for partial object based rmap
which he sent to linux-mm yesterday. Rik pointed out there is a scalability
problem with the full object based approach. However, a hybrid approach
between regular rmap and object based may not be too radical for
2.5/2.6 timeframe.
Ben said none of the users have been complaining about
performance with the existing rmap. Martin disagreed and said Linus,
Andrew Morton and himself have all agreed there is a problem.
One of the problems Martin is already hitting on high cpu machines with
large memory is the space consumption by all the pte-chains filling up
memory and killing the machine. There is also a performance impact of
maintaining the chains.
Ben said they shouldnt be using fork and bash is the
main user of fork and should be changed to use clone instead.
Gerrit said bash is not used as much as Ben might think on
these large systems running real world applications.
Ben said he doesnt see the large systems problems with
the users he talks to and doesnt agree the full object based rmap
is needed. Gerrit explained we have very complex workloads running on
very large systems and we are already hitting the space consumption
problem which is a blocker for running Linux on them.
Ben said none of the distros are supporting these large
systems right now. Martin said UL is already starting to support
them. Then it degraded into a distro discussion and Hanna asked
for them to bring it back to the technical side.
In order to show the problem with object based rmap you have to
add vm pressure to existing benchmarks to see what happens. Martin
agreed to run multiple benchmarks on the same systems to simulate this.
Cliff White of the OSDL offered to help Martin with this.
At the end Ben said the solution for now needs to be
a hybrid with existing rmap. Martin, Rik, and Dave all agreed with Ben.
Then we all agreed to move on to other things.
*ActionItem - someone needs to change bash to use clone instead of fork..
Scheduler Hang as discovered by restarting a large Web application
multiple times:
Rick Lindlsey/ Hanna Linder
We were seeing a hard hang after restarting a large web
serving application 3-6 times on the 2.5.59 (and up) kernels
(also seen as far back as 2.5.44). It was mainly caused when two
threads each have interrupts disabled and one is spinning on a lock that
the other is holding. The one holding the lock has sent an IPI to all
the other processes telling them to flush their TLB's. But the one
witinging for the spinlock has interrupts turned off and does not recieve
that IPI request. So they both sit there waiting for ever.
The final fix will be in kernel.org mainline kernel version 2.5.63.
Here are the individual patches which should apply with fuzz to
older kernel versions:
http://linux.bkbits.net:8080/linux-2.5/cs...@1.1005?nav=index.html
http://linux.bkbits.net:8080/linux-2.5/cs...@1.1004?nav=index.html
Shared Memory Binding :
Matt Dobson -
Shared memory binding API (new). A way for an
application to bind shared memory to Nodes. Motivation
is for large databases support that want more control
over their shared memory.
current allocation scheme is each process gets
a chunk of shared memory from the same node the process
is located on. instead of page faulting around to different
nodes dynamicaly this API will allow a process to specify
which node or set of nodes to bind the shared memory to.
Work in progress.
Martin - gcc 2.95 vs 3.2.
Martin has done some testing which indicates that gcc 3.2 produces
slightly worse code for the kernel than 2.95 and takes a bit
longer to do so. gcc 3.2 -Os produces larger code than gcc 2.95 -O2.
On his machines -O2 was faster than -Os, but on a cpu wiht smaller
caches the inverse may be true. More testing may be needed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ben is right. I think IBM and the other big iron companies would be
far better served looking at what they have done with running multiple
instances of Linux on one big machine, like the 390 work. Figure out
how to use that model to scale up. There is simply not a big enough
market to justify shoveling lots of scaling stuff in for huge machines
that only a handful of people can afford. That's the same path which
has sunk all the workstation companies, they all have bloated OS's and
Linux runs circles around them.
In terms of the money and in terms of installed seats, the small Linux
machines out number the 4 or more CPU SMP machines easily 10,000:1.
And with the embedded market being one of the few real money makers
for Linux, there will be huge pushback from those companies against
changes which increase memory footprint.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Scalability done properly should not degrade performance on smaller
machines, Pee Cees, or even microscopic organisms.
On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.
There's quite a bit of commonality with large x86 highmem there, as
the highmem crew is extremely concerned about the kernel's memory
footprint and is looking to trim kernel memory overhead from every
aspect of its operation they can. Reducing kernel memory footprint
is a crucial part of scalability, in both scaling down to the low end
and scaling up to highmem. =)
-- wli
In your humble opinion.
Unfortunately, as I've pointed out to you before, this doesn't work in
practice. Workloads may not be easily divisible amongst machines, and
you're just pushing all the complex problems out for every userspace
app to solve itself, instead of fixing it once in the kernel.
The fact that you were never able to do this before doesn't mean it's
impossible, it just means that you failed.
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.
And the profit margin on the big machines will outpace the smaller
machines by a similar ratio, inverted. The high-end space is where most
of the money is made by the Linux distros, by selling products like SLES
or Advanced Server to people who can afford to pay for it.
M.
mjb> Unfortunately, as I've pointed out to you before, this doesn't work
mjb> in practice. Workloads may not be easily divisible amongst
mjb> machines, and you're just pushing all the complex problems out for
mjb> every userspace app to solve itself, instead of fixing it once in
mjb> the kernel.
Please permit an observer from the sidelines a few comments.
I think all four of you are right, for different reasons.
>
> Scalability done properly should not degrade performance on smaller
> machines, Pee Cees, or even microscopic organisms.
s/should/must/ in the above. That must be a guiding principle.
>
>
> On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> > In terms of the money and in terms of installed seats, the small Linux
> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
> > And with the embedded market being one of the few real money makers
> > for Linux, there will be huge pushback from those companies against
> > changes which increase memory footprint.
>
> There's quite a bit of commonality with large x86 highmem there, as
> the highmem crew is extremely concerned about the kernel's memory
> footprint and is looking to trim kernel memory overhead from every
> aspect of its operation they can. Reducing kernel memory footprint
> is a crucial part of scalability, in both scaling down to the low end
> and scaling up to highmem. =)
>
>
> -- wli
Since the time between major releases of the kernel seems to be two to
three years now (counting to where the new kernel is really stable),
it is probably worthwhile to think about what high-end systems will
be like when 3.0 is expected.
My guess is that a trend will be machines with increasingly greater cpu
counts with access to the same memory. Why? Because if it can be done,
it will be done. The ability to put more cpus on a single chip may
translate into a Moore's law of increasing cpu counts per machine. And
as Martin points out, the high end machines are where the money is.
In my own unsophisticated opinion, Larry's concept of Cache Coherent
Clusters seems worth further development. And Martin is right about the
need for fixing it in the kernel, again IMHO. But how to fix it in the
kernel? Would something similar to OpenMosix or OpenSSI in a future
kernel be appropriate to get Larry's CCCluster members to cooperate? Or
is it possible to continue the scalability race when cpu counts get to
256, 512, etc.
Just some thoughts from the sidelines.
Best regards,
Steven
My opinion has nothing to do with it, go benchmark them and see for
yourself. I'm in a pretty good position to back up my statements with
data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well
as a pile of others, so we have both the hardware and the software to
do the comparisons. I stand by statement above and so does anyone else
who has done the measurements. It is much much more pleasant to have
Linux versus any other Unix implementation on the same platform. Let's
keep it that way.
> Unfortunately, as I've pointed out to you before, this doesn't work in
> practice. Workloads may not be easily divisible amongst machines, and
> you're just pushing all the complex problems out for every userspace
> app to solve itself, instead of fixing it once in the kernel.
"fixing it", huh? Your "fixes" may be great for your tiny segment of
the market but they are not going to be welcome if they turn Linux into
BloatOS 9.8.
> The fact that you were never able to do this before doesn't mean it's
> impossible, it just means that you failed.
Thanks for the vote of confidence. I think the thing to focus on,
however, is that *noone* has ever succeeded at what you are trying
to do. And there have been many, many attempts. Your opinion, it
would appear, is that you are smarter than all of the people in all
of those past failed attempts, but you'll forgive me if I'm not
impressed with your optimism.
> > In terms of the money and in terms of installed seats, the small Linux
> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
> > And with the embedded market being one of the few real money makers
> > for Linux, there will be huge pushback from those companies against
> > changes which increase memory footprint.
>
> And the profit margin on the big machines will outpace the smaller
> machines by a similar ratio, inverted.
Really? How about some figures? You'd need HUGE profit margins to
justify your position, how about some actual hard cold numbers?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Nope, I was referring to this:
>> > Ben is right. I think IBM and the other big iron companies would be
>> > far better served looking at what they have done with running multiple
>> > instances of Linux on one big machine, like the 390 work. Figure out
>> > how to use that model to scale up. There is simply not a big enough
>> > market to justify shoveling lots of scaling stuff in for huge machines
>> > that only a handful of people can afford.
Which I totally disagree with.
>> >That's the same path which
>> > has sunk all the workstation companies, they all have bloated OS's and
>> > Linux runs circles around them.
Not the fact that Linux is capable of stellar things, which I totally
agree with.
> I'm in a pretty good position to back up my statements with
> data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well
> as a pile of others, so we have both the hardware and the software to
> do the comparisons. I stand by statement above and so does anyone else
> who has done the measurements.
Oh, I don't doubt it - But I'd be amused to see the measurements,
if you have them to hand.
> It is much much more pleasant to have Linux versus any other Unix
> implementation on the same platform. Let's keep it that way.
Absolutely.
>> Unfortunately, as I've pointed out to you before, this doesn't work in
>> practice. Workloads may not be easily divisible amongst machines, and
>> you're just pushing all the complex problems out for every userspace
>> app to solve itself, instead of fixing it once in the kernel.
>
> "fixing it", huh? Your "fixes" may be great for your tiny segment of
> the market but they are not going to be welcome if they turn Linux into
> BloatOS 9.8.
They won't - the maintainers would never allow us to do that.
>> The fact that you were never able to do this before doesn't mean it's
>> impossible, it just means that you failed.
>
> Thanks for the vote of confidence. I think the thing to focus on,
> however, is that *noone* has ever succeeded at what you are trying
> to do. And there have been many, many attempts. Your opinion, it
> would appear, is that you are smarter than all of the people in all
> of those past failed attempts, but you'll forgive me if I'm not
> impressed with your optimism.
Who said that I was going to single-handedly change the world? What's
different with Linux is the development model. That's why *we* will
succeed where others have failed before. There's some incredible intellect
all around Linux, but that's not all it takes, as you've pointed out.
>> > In terms of the money and in terms of installed seats, the small Linux
>> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
>> > And with the embedded market being one of the few real money makers
>> > for Linux, there will be huge pushback from those companies against
>> > changes which increase memory footprint.
>>
>> And the profit margin on the big machines will outpace the smaller
>> machines by a similar ratio, inverted.
>
> Really? How about some figures? You'd need HUGE profit margins to
> justify your position, how about some actual hard cold numbers?
I don't have them to hand, but if you think anyone's making money on
PCs nowadays, you're delusional (with respect to hardware). With respect
to Linux, what makes you think distros are going to make large amounts
of money from a freely replicatable OS, for tiny embedded systems?
Support for servers, on the other hand, is a different game ...
M.
The path to hell is paved with good intentions.
> > Really? How about some figures? You'd need HUGE profit margins to
> > justify your position, how about some actual hard cold numbers?
>
> I don't have them to hand, but if you think anyone's making money on
> PCs nowadays, you're delusional (with respect to hardware).
Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
$500M/quarter in profit.
Lots of people working for companies who haven't figured out how to do
it as well as Dell *say* it can't be done but numbers say differently.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
And how much of that was profit on PCs running Linux?
M.
While I totally agree with your points, I want to mention that
although this ratio is true, the exact opposite ratio applies to
the price of the service contracts a company can land with the big
machines :-)
While I understand these numbers are on the mark, there is a tertiary
issue to realize.
Dell makes money on many things other than thin-margin PCs. And lo'
and behold one of those things is selling the larger Intel based
servers and support contracts to go along with that. And so you're
nearly supporting Martin's arguments for supporting large servers
better under Linux by bringing up Dell's balance sheet :-)
Or PCs period, they make tons of bucks on servers and associated
support contracts.
Intel can use PAE to "turn back the clock" on ia32. Although googling
doesn't support this speculation, I am willing to bet Intel will
eventually unveil a new PAE that busts the 64GB barrier -- instead of
trying harder to push consumers to 64-bit processors. Processor speed,
FSB speed, PCI bus bandwidth, all these are issues -- but ones that
pale in comparison to the long term effects of highmem on the market.
Enterprise customers will see this as a signal to continue building
around ia32 for the next few years, thoroughly damaging 64-bit
technology sales and development. I bet even IA64 suffers...
at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly
floating around The Register and various rumor web sites, but Intel
is gonna miss that huge profit opportunity too by trying to hack the
ia32 ISA to scale up to big iron -- where it doesn't belong.
Being cynical, one might guess that Intel will treat IA64 as a loss
leader until the other 64-bit competition dies, keeping ia32 at the
top end of the market via silly PAE/PSE hacks. When the existing
64-bit compettion disappears, five years down the road, compilers
will have matured sufficiently to make using IA64 boxes feasible.
If you really want to scale, just go to 64-bits, darn it. Don't keep
hacking ia32 ISA -- leave it alone, it's fine as it is, and will live
a nice long life as the future's preferred embedded platform.
64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly
old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech,
but who knows if AMD will survive competition with Intel. PPC64 is
the wild card in all this. I hope it succeeds.
Jeff,
feeling like a silly, random rant after a long drive
...and from a technical perspective, highmem grots up the code, too :)
I did some digging trying to find that ratio before I posted last night
and couldn't. You obviously think that the servers are a significant
part of their business. I'd be surprised at that, but that's cool,
what are the numbers? PC's, monitors, disks, laptops, anything with less
than 4 cpus is in the little bucket, so how much revenue does Dell generate
on the 4 CPU and larger servers?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
It's not a question of revenue, it's one of profit. Very few people buy
desktops for use with Linux, compared to those that buy them for Windows.
The profit on each PC is small, thus I still think a substantial proportion
of the profit made by hardware vendors by Linux is on servers rather than
desktop PCs. The numbers will be smaller for high end machines, but the
profit margins are much higher.
M.
That's all handwaving and has no meaning without numbers. I could care less
if Dell has 99.99% margins on their servers, if they only sell $50M of servers
a quarter that is still less than 10% of their quarterly profit.
So what are the actual *numbers*? Your point makes sense if and only if
people sell lots of server. I spent a few minutes in google: world wide
server sales are $40B at the moment. The overwhelming majority of that
revenue is small servers. Let's say that Dell has 20% of that market,
that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
you long long odds that that is 90% of their revenue in the server space.
Supposing that's right, that's $200M/quarter in big iron sales. Out of
$8000M/quarter.
I'd love to see data which is different than this but you'll have a tough
time finding it. More and more companies are looking at the cost of
big iron and deciding it doesn't make sense to spend $20K/CPU when they
could be spending $1K/CPU. Look at Google, try selling them some big
iron. Look at Wall Street - abandoning big iron as fast as they can.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
But we're talking about linux ... and we're talking about profit, not
revenue. I'd guess that 99% of their desktop sales are for Windows.
And I'd guess they make 100 times as much profit on a big server as they
do on a desktop PC.
Would be nice if someone had real numbers, but I doubt they're published
except in non-free corporate research reports.
M.
You are thinking in today's terms. Find the asymptote and project out.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
OK, I predict that Linux will take over the whole of the high end server
market ... if people stop complaining about us fixing scalability. That
should give some nicer numbers ....
M.
Extending the useful life of current hardware will shift profit even
further towards support contracts, and away from hardware sales.
Imagine the performance gain a webserver serving mostly static
content, with light database and scripting usage is going to see
moving from a 2.4 -> 2.6 kernel? Zero copy and filesystem
improvements alone will extend it's useful life dramatically, in my
opinion.
John.
I think people overestimate the numbner of large boxes badly. Several IDE
pre-patches didn't work on highmem boxes. It took *ages* for people to
actually notice there was a problem. The desktop world is still 128-256Mb
and some of the crap people push is problematic even there. In the embedded
space where there is a *ton* of money to be made by smart people a lot
of the 2.5 choices look very questionable indeed - but not all by any
means, we are for example close to being able to dump the block layer,
shrink stacks down by using IRQ stacks and other good stuff.
I'm hoping the Montavista and IBM people will swat each others bogons 8)
Alan
Err, here's a news flash. Dell has just one server with more than
4 CPUS and it tops out at 8. Everything else is clusters. And they
call any machine that doesn't have a head a server, they have servers
starting $299. Yeah, that's right, $299.
http://www.dell.com/us/en/bsd/products/series_pedge_servers.htm
How much do you want to bet that more than 95% of their server revenue
comes from 4CPU or less boxes? I wouldn't be surprised if it is more
like 99.5%. And you can configure yourself a pretty nice quad xeon box
for $25K. Yeah, there is some profit in there but nowhere near the huge
margins you are counting on to make your case.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> I think people overestimate the numbner of large boxes badly. Several IDE
> pre-patches didn't work on highmem boxes. It took *ages* for people to
> actually notice there was a problem. The desktop world is still 128-256Mb
> and some of the crap people push is problematic even there. In the embedded
> space where there is a *ton* of money to be made by smart people a lot
> of the 2.5 choices look very questionable indeed - but not all by any
> means, we are for example close to being able to dump the block layer,
> shrink stacks down by using IRQ stacks and other good stuff.
Well, I've never seen IDE in a highmem box, and there's probably a good
reason for it. The space trimmings sound pretty interesting. IRQ stacks
in general sound good just to mitigate stackblowings due to IRQ pounding.
On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> I'm hoping the Montavista and IBM people will swat each others bogons 8)
Sounds like a bigger win for the bigboxen, since space matters there,
but large-scale SMP efficiency probably doesn't make a difference to
embedded (though I think some 2x embedded systems are floating around).
-- wli
Sounds like low-capacity boxen meant to minimize colocation costs via
rackspace minimization.
On Sat, Feb 22, 2003 at 11:56:42AM -0800, Larry McVoy wrote:
> How much do you want to bet that more than 95% of their server revenue
> comes from 4CPU or less boxes? I wouldn't be surprised if it is more
> like 99.5%. And you can configure yourself a pretty nice quad xeon box
> for $25K. Yeah, there is some profit in there but nowhere near the huge
> margins you are counting on to make your case.
Ask their marketing dept. or something. I can maximize utility
integrals and find Nash equilibria, but can't tell you Dell's secrets.
-- wli
Smaller cleaner code is a win for everyone, and it often pays off in ways
that are not immediately obvious. For example having your entire kernel
working set and running app fitting in the L2 cache happens to be very
good news to most people.
Alan
OK, so now you've slid from talking about PCs to 2-way to 4-way ...
perhaps because your original arguement was fatally flawed.
The work we're doing on scalablity has big impacts on 4-way systems
as well as the high end. We're also simultaneously dramatically improving
stability for smaller SMP machines by finding reproducing races in
5 minutes that smaller machines might hit once every year or so, and
running high-stress workloads that thrash the hell out of various
subsystems exposing bugs.
Some applications work well on clusters, which will give them cheaper
hardware, at the expense of a lot more complexity in userspace ...
depending on the scale of the system, that's a tradeoff that might go
either way.
For applications that don't work well on clusters, you have no real
choice but to go with the high-end systems. I'd like to see Linux
across the board, as would many others.
You don't believe we can make it scale without screwing up the low end,
I do believe we can do that. Time will tell ... Linus et al are not
stupid ... we're not going to be able to submit stuff that screwed up
the low-end, even if we wanted to.
M.
It's all vague handwaving because people either don't know real numbers,
or sure as heck won't post them on a public list...
Jeff
IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
is a fun toy, but bigger than *I* need, even for development purposes.
But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
IDE products for my 8-proc 16 GB machine... And running pre-patches in
a production environment that might expose this would be a little
silly as well.
Probably a bad example to extrapolate large system numbers from.
gerrit
At least the SGI Altix does have an IDE/ATAPI CDROM drive :)
oh, come on. the issue is whether memory is fast and flat.
most "scalability" efforts are mainly trying to code around the fact
that any ccNUMA (and most 4-ways) is going to be slow/bumpy.
it is reasonable to worry that optimizations for imbalanced machines
will hurt "normal" ones. is it worth hurting uni by 5% to give
a 50% speedup to IBM's 32-way? I think not, simply because
low-end machines are more important to Linux.
the best way to kill Linux is to turn it into an OS best suited
for $6+-digit machines.
> For applications that don't work well on clusters, you have no real
ccNUMA worst-case latencies are not much different from decent
cluster (message-passing) latencies. getting an app to work on a cluster
is a matter of programming will.
regards, mark hahn.
PAE is a relatively minor insult compared to the FPU, the 50,000 psi
register pressure, variable-length instruction encoding with extremely
difficult to optimize for instruction decoder trickiness, the nauseating
bastardization of segmentation, the microscopic caches and TLB's, the
lack of TLB context tags, frankly bizarre and just-barely-fixable gate
nonsense, the interrupt controller, and ISA DMA.
I've got no idea why this particular system-level ugliness which is
nothing more than a routine pitstop in any bring your own barfbag
reading session of x86 manuals fascinates you so much.
At any rate, if systems (or any other) programming difficulties were
any concern at all, x86 wouldn't be used at all.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> Enterprise customers will see this as a signal to continue building
> around ia32 for the next few years, thoroughly damaging 64-bit
> technology sales and development. I bet even IA64 suffers...
> at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly
> floating around The Register and various rumor web sites, but Intel
> is gonna miss that huge profit opportunity too by trying to hack the
> ia32 ISA to scale up to big iron -- where it doesn't belong.
What power do you suppose we have to resist any of this? Intel, the
800lb gorilla, shoves what it wants where it wants to shove it, and
all the "exit only" signs in the world attached to our backsides do
absolutely nothing to deter it whatsoever.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> Being cynical, one might guess that Intel will treat IA64 as a loss
> leader until the other 64-bit competition dies, keeping ia32 at the
> top end of the market via silly PAE/PSE hacks. When the existing
> 64-bit compettion disappears, five years down the road, compilers
> will have matured sufficiently to make using IA64 boxes feasible.
Sounds relatively natural. I don't have a good notion of the legality
boundaries wrt. to antitrust, but I'd assume they would otherwise do
whatever it takes to either defeat or wipe out competitors.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> If you really want to scale, just go to 64-bits, darn it. Don't keep
> hacking ia32 ISA -- leave it alone, it's fine as it is, and will live
> a nice long life as the future's preferred embedded platform.
Take this up with Intel. The rest of us are at their mercy.
Good luck finding anyone there to listen to it, you'll need it.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> 64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly
> old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech,
> but who knows if AMD will survive competition with Intel. PPC64 is
> the wild card in all this. I hope it succeeds.
Alpha is old, dead, and kicking most other cpus' asses from the grave.
I always did like DEC hardware. =(
I'm not sure what's so nice about x86-64; another opcode prefix
controlled extension atop the festering pile of existing x86 crud
sounds every bit as bad any other attempt to prolong x86. Some of
the system device -level cleanups like the HPET look nice, though.
This success/failure stuff sounds a lot like economics, which is
pretty much even further out of our control than the weather or the
government. What prompted this bit?
-- wli
Not even close, by several orders of magnitude.
-- wli
Linux has a key feature that most other OS's lack: It can (easily, and by all)
be recompiled for a particular architecture. So, there is no particular reason why
optimizing for a high-end system has to kill performance on uni-processor
machines.
For instance, don't locks simply get compiled away to nothing on
uni-processor machines?
--
Ben Greear <gre...@candelatech.com> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
Scalability is not just NUMA machines by any stretch of the imagination.
It's 2x, 4x, 8x SMP as well.
> it is reasonable to worry that optimizations for imbalanced machines
> will hurt "normal" ones. is it worth hurting uni by 5% to give
> a 50% speedup to IBM's 32-way? I think not, simply because
> low-end machines are more important to Linux.
We would never try to propose such a change, and never have.
Name a scalability change that's hurt the performance of UP by 5%.
There isn't one.
> ccNUMA worst-case latencies are not much different from decent
> cluster (message-passing) latencies. getting an app to work on a cluster
> is a matter of programming will.
It's a matter of repeatedly reimplementing a bunch of stuff in userspace,
instead of doing things in kernel space once, properly, with all the
machine specific knowledge that's needed. It's *so* much easier to
program over a single OS image.
M.
Nice attempt at deflection but it won't work. Your position is that
there is no money in PC's only in big iron. Last I checked, "big iron"
doesn't include $25K 4 way machines, now does it? You claimed that
Dell was making the majority of their profits from servers. To refresh
your memory: "I bet they still make more money on servers than desktops
and notebooks combined". Are you still claiming that? If so, please
provide some data to back it up because, as Mark and others have pointed
out, the bulk of their servers are headless desktop machines in tower
or rackmount cases. I fail to see how there are better margins on the
same hardware in a rackmount box for $800 when the desktop costs $750.
Those rack mount power supplies and cases are not as cheap as the desktop
ones, so I see no difference in the margins.
Let's get back to your position. You want to shovel stuff in the kernel
for the benefit of the 32 way / 64 way etc boxes. I don't see that as
wise. You could prove me wrong. Here's how you do it: go get oprofile
or whatever that tool is which lets you run apps and count cache misses.
Start including before/after runs of each microbench in lmbench and
some time sharing loads with and without your changes. When you can do
that and you don't add any more bus traffic, you're a genius and
I'll shut up.
But that's a false promise because by definition, fine grained threading
adds more bus traffic. It's kind of hard to not have that happen, the
caches have to stay coherent somehow.
> Some applications work well on clusters, which will give them cheaper
> hardware, at the expense of a lot more complexity in userspace ...
> depending on the scale of the system, that's a tradeoff that might go
> either way.
Tell it to Google. That's probably one of the largest applications in
the world; I was the 4th engineer there, and I didn't think that the
cluster added complexity at all. On the contrary, it made things go
one hell of a lot faster.
> You don't believe we can make it scale without screwing up the low end,
> I do believe we can do that.
I'd like a little more than "I think I can, I think I can, I think I can".
The people who are saying "no you can't, no you can't, no you can't" have
seen this sort of work done before and there is no data which shows that
it is possible and all sorts of data which shows that it is not.
Show me one OS which scales to 32 CPUs on an I/O load and run lmbench
on a single CPU. Then take that same CPU and stuff it into a uniprocessor
motherboard and run the same benchmarks on under Linux. The Linux one
will blow away the multi threaded one. Come on, prove me wrong, show
me the data.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
This is *exactly* the reasoning that every OS marketing weenie has used
for the last 20 years to justify their "feature" of the week.
The road to slow bloated code is paved one cache miss at a time. You
may quote me on that. In fact, print it out and put it above your
monitor and look at it every day. One cache miss at a time. How much
does one cache miss add to any benchmark? .001%? Less.
But your pet features didn't slow the system down. Nope, they just made
the cache smaller, which you didn't notice because whatever artificial
benchmark you ran didn't happen to need the whole cache.
You need to understand that system resources belong to the user. Not the
kernel. The goal is to have all of the kernel code running under any
load be less than 1% of the CPU. Your 5% number up there would pretty
much double the amount of time we spend in the kernel for most workloads.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
I could ask the SGI Eagan folks to do that with an Altix and a IA64
Whitebox - oh wait, both OSes would be Linux..
Err, I think you're wrong. It's been a long time since I looked, but I'm
pretty sure myrinet had single digit microseconds. Yup, google rocks,
7.6 usecs, user to user. Last I checked, Sequents worst case was around
there, right?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On your part or mine? seemingly yours.
> Your position is that
> there is no money in PC's only in big iron. Last I checked, "big iron"
> doesn't include $25K 4 way machines, now does it?
I would call 4x a "big machine" which is what I originally said.
> You claimed that
> Dell was making the majority of their profits from servers.
I think that's probably true (nobody can be certain, as we don't have the
numbers).
> To refresh
> your memory: "I bet they still make more money on servers than desktops
> and notebooks combined". Are you still claiming that?
Yup.
> If so, please
> provide some data to back it up because, as Mark and others have pointed
> out, the bulk of their servers are headless desktop machines in tower
> or rackmount cases.
So what? they're still servers. I can no more provide data to back it up
than you can to contradict it, because they don't release those figures.
Note my sentence began "I bet", not "I have cast iron evidence".
> Let's get back to your position. You want to shovel stuff in the kernel
> for the benefit of the 32 way / 64 way etc boxes.
Actually, I'm focussed on 16-way at the moment, and have never run on,
or published numbers for anything higher. If you need to exaggerate
to make your point, then go ahead, but it's pretty transparent.
> I don't see that as wise. You could prove me wrong.
> Here's how you do it: go get oprofile
> or whatever that tool is which lets you run apps and count cache misses.
> Start including before/after runs of each microbench in lmbench and
> some time sharing loads with and without your changes. When you can do
> that and you don't add any more bus traffic, you're a genius and
> I'll shut up.
I don't feel the need to do that to prove my point, but if you feel the
need to do it to prove yours, go ahead.
> But that's a false promise because by definition, fine grained threading
> adds more bus traffic. It's kind of hard to not have that happen, the
> caches have to stay coherent somehow.
Adding more bus traffic is fine if you increase throughput. Focussing
on just one tiny aspect of performance is ludicrous. Look at the big
picture. Run some non-micro benchmarks. Analyse the results. Compare
2.4 vs 2.5 (or any set of patches I've put into the kernel of your choice)
On UP, 2P or whatever you care about.
You seem to think the maintainers are morons that we can just slide crap
straight by ... give them a little more credit than that.
> Tell it to Google. That's probably one of the largest applications in
> the world; I was the 4th engineer there, and I didn't think that the
> cluster added complexity at all. On the contrary, it made things go
> one hell of a lot faster.
As I've explained to you many times before, it depends on the system.
Some things split easily, some don't.
>> You don't believe we can make it scale without screwing up the low end,
>> I do believe we can do that.
>
> I'd like a little more than "I think I can, I think I can, I think I can".
> The people who are saying "no you can't, no you can't, no you can't" have
> seen this sort of work done before and there is no data which shows that
> it is possible and all sorts of data which shows that it is not.
The only data that's relevant is what we've done to Linux. If you want
to run the numbers, and show some useful metric on a semi-realistic
benchmark, I'd love to seem.
> Show me one OS which scales to 32 CPUs on an I/O load and run lmbench
> on a single CPU. Then take that same CPU and stuff it into a uniprocessor
> motherboard and run the same benchmarks on under Linux. The Linux one
> will blow away the multi threaded one.
Nobody has every really focussed before on an OS that scales across the
board from UP to big iron ... a closed development system is bad at
resolving that sort of thing. The real interesting comparison is UP
or 2x SMP on Linux with and without the scalability changes that have
made it into the tree.
> Come on, prove me wrong, show me the data.
I don't have to *prove* you wrong. I'm happy in my own personal knowledge
that you're wrong, and things seem to be going along just fine, thanks.
If you want to change the attitude of the maintainers, I suggest you
generate the data yourself.
M.
Fine, stick 'em all together. I bet it's either an improvement or
doesn't even register on the scale. Knock yourself out.
M.
Sequent hardware is very old. Go time a Regatta.
M.
the only public info I've seen is "round-trip in as little as 40ns",
which is too vague to be useful. and sounds WAY optimistic - perhaps
that's just between two CPUs in a single brick. remember that
LMBench shows memory latencies of O(100ns) for even fast uniprocessors.
Oh, it's definitely different hardware. Maybe the 16550-related portion
of the ASIC is the same :) but just do an lspci to see huge differences in
motherboard chipsets, on-board parts, more complicated BIOS, remote
management bells and whistles, etc. Even the low-end rackmounts.
But the better margins come simply from the mentality, IMO. Desktops
just aren't "as important" to a business compared to servers, so IT
shops are willing to spend more money to not only get better hardware,
but also the support services that accompany it. Selling servers
to enterprise data centers means bigger, more concentrated cash pool.
Jeff
You are going to drag 1994 technology into this to compare against
something in 2003? Hmm. You might win on that comparison. But yeah,
Sequent way back then was in that ballpark. World has moved forwards
since then...
gerrit
> > Ben said none of the distros are supporting these large
> > systems right now. Martin said UL is already starting to support
> > them.
>
> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.
Larry it isn't that Linux isn't being scaled in the way you suggest.
But for the people who really care about scalability having a single
system image is not the most important thing so making it look like
one system is secondary.
Linux clusters are currently among the top 5 supercomputers of the
world. And there the question is how do you make 1200 machines look
like one. And how do you handle the reliability issues. When MTBF
becomes a predictor for how many times a week someone needs to replace
hardware the problem is very different from a simple SMP.
And there seems to be a fairly substantial market for huge machines,
for people who need high performance. All kinds of problems are
require enormous amounts of data crunching.
So far the low hanging fruit on large clusters is still with making
the hardware and the systems actually work. But increasingly having
a single high performance distributed filesystem is becoming
important.
But look at projects like bproc, mosix, and lustre. Not the best
things in the world but the work is getting done. Scalability is
easy. The hard part is making it look like one machine when you are
done.
Eric
> LSE Con Call Minutes from Feb21
>
> Minutes compiled by Hanna Linder han...@us.ibm.com, please post
> corrections to lse-...@lists.sf.net.
>
> Object Based Reverse Mapping:
> (Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)
>
> Ben said none of the users have been complaining about
> performance with the existing rmap. Martin disagreed and said Linus,
> Andrew Morton and himself have all agreed there is a problem.
> One of the problems Martin is already hitting on high cpu machines with
> large memory is the space consumption by all the pte-chains filling up
> memory and killing the machine. There is also a performance impact of
> maintaining the chains.
Note: rmap chains can be restricted to an arbitrary length, or an
arbitrary total count trivially. All you have to do is allow a fixed
limit on the number of people who can map a page simultaneously.
The selection of which chain to unmap can be a bit tricky but is
relatively straight forward. Why doesn't someone who is seeing
this just hack this up?
One phrase ... "price:performance ratio". That's all it's about.
The only thing that will kill 32-bit big iron is the availability of
cheap 64 bit chips. It's a free-market economy.
It's ugly to program, but it's cheap, and it works.
M.
What's nice about x86-64 is that it runs existing 32 bit apps fast and
doesn't suffer from the blisteringly small caches that were part of your
rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
Not to mention that the amount of reengineering in compilers like
gcc required to get decent performance out of it is actually sane.
> sounds every bit as bad any other attempt to prolong x86. Some of
> the system device -level cleanups like the HPET look nice, though.
HPET is part of one of the PCYY specs and even available on 32 bit x86,
there are just not that many bug free implements yet. Since x86-64 made
it part of the base platform and is testing it from launch, they actually
have a chance at being debugged in the mass market versions.
-ben
--
Don't email: <a href=mailto:"aa...@kvack.org">aa...@kvack.org</a>
I've run some numbers on this. Looks like it reclaims most of the
fork/exec/exit rmap overhead.
The testcase is applying and removing 64 kernel patches using my patch
management scripts. I use this because
a) It's a real workload, which someone cares about and
b) It's about as forky as anything is ever likely to be, without being a
stupid microbenchmark.
Testing is on the fast P4-HT, everything in pagecache.
2.4.21-pre4: 8.10 seconds
2.5.62-mm3 with objrmap: 9.95 seconds (+1.85)
2.5.62-mm3 without objrmap: 10.86 seconds (+0.91)
Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those
seconds.
So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
Here is 2.5.62-mm3, with objrmap:
c013042c find_get_page 601 10.7321
c01333dc free_hot_cold_page 641 2.7629
c0207130 __copy_to_user_ll 687 6.6058
c011450c flush_tlb_page 725 6.4732
c0139ba0 clear_page_tables 841 2.4735
c011718c pte_alloc_one 910 6.5000
c013b56c do_anonymous_page 954 1.7667
c013b788 do_no_page 1044 1.6519
c015b59c d_lookup 1096 3.2619
c013ba00 handle_mm_fault 1098 4.6525
c0108d14 system_call 1116 25.3636
c0137240 release_pages 1828 6.4366
c013a1f4 zap_pte_range 2616 4.8806
c013f5c0 page_add_rmap 2776 8.3614
c0139eac copy_page_range 2994 3.5643
c013f70c page_remove_rmap 3132 6.2640
c013adb4 do_wp_page 6712 8.4322
c01172e0 do_page_fault 8788 7.7496
c0106ed8 poll_idle 99878 1189.0238
00000000 total 158601 0.0869
Note one second spent in pte_alloc_one().
Here is 2.4.21-pre4, with the following functions uninlined
pte_t *pte_alloc_one(struct mm_struct *mm, unsigned long address);
pte_t *pte_alloc_one_fast(struct mm_struct *mm, unsigned long address);
void pte_free_fast(pte_t *pte);
void pte_free_slow(pte_t *pte);
c0252950 atomic_dec_and_lock 36 0.4800
c0111778 flush_tlb_mm 37 0.3304
c0129c3c file_read_actor 37 0.2569
c025282c strnlen_user 43 0.5119
c012b35c generic_file_write 46 0.0283
c0114c78 schedule 48 0.0361
c0129050 unlock_page 53 0.4907
c0140974 link_path_walk 57 0.0237
c0116740 copy_mm 62 0.0852
c0130740 __free_pages_ok 62 0.0963
c0126afc handle_mm_fault 63 0.3424
c01254c0 __free_pte 67 0.8816
c0129198 __find_get_page 67 0.9853
c01309c4 rmqueue 70 0.1207
c011ae0c exit_notify 77 0.1075
c0149b34 d_lookup 81 0.2774
c0126874 do_anonymous_page 83 0.3517
c0126960 do_no_page 86 0.2087
c01117e8 flush_tlb_page 105 0.8750
c0106f54 system_call 138 2.4643
c01255c8 copy_page_range 197 0.4603
c0130ffc __free_pages 204 5.6667
c0125774 zap_page_range 262 0.3104
c0126330 do_wp_page 775 1.4904
c0113c18 do_page_fault 864 0.7030
c01052f8 poll_idle 6803 170.0750
00000000 total 11923 0.0087
Note the lack of pte_alloc_one_slow().
So we need the page table cache back.
We cannot put it in slab, because slab does not do highmem.
I believe the best way to solve this is to implement a per-cpu LIFO head
array of known-to-be-zeroed pages in the page allocator. Populate it with
free_zeroed_page(), grab pages from it with __GFP_ZEROED.
This is a simple extension to the existing hot and cold head arrays, and I
have patches, and they don't work. Something in the pagetable freeing path
seems to be putting back pages which are not fully zeroed, and I didn't get
onto debugging it.
It would be nice to get it going, because a number of architectures can
perhaps nuke their private pagetable caches.
I shall drop the patches in next-mm/experimental and look hopefully
at Dave ;)
gerrit
Really? "Several orders of magnitude"? Show me the data.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Sun, Feb 23, 2003 at 12:01:43AM -0800, Larry McVoy wrote:
> Really? "Several orders of magnitude"? Show me the data.
I was assuming ethernet when I said that.
-- wli
> On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote:
> > On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> > > I'm not sure what's so nice about x86-64; another opcode prefix
> > > controlled extension atop the festering pile of existing x86 crud
> >
> > What's nice about x86-64 is that it runs existing 32 bit apps fast and
> > doesn't suffer from the blisteringly small caches that were part of your
> > rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> > Not to mention that the amount of reengineering in compilers like
> > gcc required to get decent performance out of it is actually sane.
>
> Four or five years ago the claim was that IA64 would solve all the large
> memory problems. Commercial viability and substantial market presence
> is still lacking. x86-64 has the same uphill battle. It has a better
> architecture for highmem and potentially better architecture for large
> systems in general (compared to IA32, not substantially better than, say,
> IA64 or PPC64). It also has at least one manufacturer looking at high
> end systems. But until those systems have some recognized market share,
> the boys with the big pockets aren't likely to make the ubiquitous.
> The whole thing about expenses to design and develop combined with the
> ROI model have more influence on their deployment than the fact that it
> is technically a useful architecture.
Garrit, you missed the preior posters point. IA64 had the same fundamental
problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
binaries.
the 8086/8088 CPU was nothing special when it was picked to be used on the
IBM PC, but once it was picked it hit a critical mass that has meant that
compatability with it is critical to a new CPU. the 286 and 386 CPUs were
arguably inferior to other options available at the time, but they had one
feature that absolutly trumped everything else, they could run existing
programs with no modifications faster then anything else available. with
the IA64 Intel forgot this (or decided their name value was so high that
they were immune to the issue) x86-64 takes the same approach that the 286
and 386 did and will be used by people who couldn't care less about 64 bit
stuff simply becouse it looks to be the fastest x86 cpu available (and if
the SMP features work as advertised it will again give a big boost to the
price/performance of SMP machines due to much cheaper MLB designs). if it
was being marketed by Intel it would be a shoo-in, but AMD does have a bit
of an uphill struggle
David Lang
If I didn't know this mattered I wouldn't bother with the barfbags.
I just wouldn't deal with it.
-- wli
On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote:
> What's nice about x86-64 is that it runs existing 32 bit apps fast and
> doesn't suffer from the blisteringly small caches that were part of your
> rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> Not to mention that the amount of reengineering in compilers like
> gcc required to get decent performance out of it is actually sane.
Rant? It was just a catalogue of other things that are nasty. The
point was that PAE's not special, it's one of a very long list of
very ugly uglinesses, and my list wasn't anywhere near exhaustive.
But yes, more cache is good. Unfortunately the amount of baggage from
32-bit x86 stuff still puts a good chunk of systems programming into
the old bring your own barfbag territory.
On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
>> sounds every bit as bad any other attempt to prolong x86. Some of
>> the system device -level cleanups like the HPET look nice, though.
On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote:
> HPET is part of one of the PCYY specs and even available on 32 bit x86,
> there are just not that many bug free implements yet. Since x86-64 made
> it part of the base platform and is testing it from launch, they actually
> have a chance at being debugged in the mass market versions.
Well, it beats the heck out of the TSC and the PIT, and x86-64 is
apparently supposed to have it "for real".
I'm not excited at all about another opcode prefix and pagetable format.
-- wli
> > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> >> ia32 big iron. sigh. I think that's so unfortunately in a number
> >> of ways, but the main reason, of course, is that highmem is evil :)
>
> One phrase ... "price:performance ratio". That's all it's about.
> The only thing that will kill 32-bit big iron is the availability of
> cheap 64 bit chips. It's a free-market economy.
>
> It's ugly to program, but it's cheap, and it works.
Not all heavy-duty problems die for 64 bit, but fit nicely into 32 bit.
There is however different 32-bit architectures for which it fit more or less
nicely into. SIMD may or may not give the boost just as 64 bit in itself.
This is just like clustering vs. SMP, it depends on the application.
Cheers,
Magnus
> Note: rmap chains can be restricted to an arbitrary length, or an
> arbitrary total count trivially. All you have to do is allow a fixed
> limit on the number of people who can map a page simultaneously.
>
> The selection of which chain to unmap can be a bit tricky but is
> relatively straight forward. Why doesn't someone who is seeing
> this just hack this up?
I'm not sure how useful this feature would be. Also,
there are a bunch of corner cases in which you cannot
limit the number of processes mapping a page, think
about eg. mlock, nonlinear vmas and anonymous memory.
All in all I suspect that the cost of such a feature
might be higher than any benefits.
cheers,
Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/
I have a plan for that (UKVA) ... we reserve a per-process area with
kernel type protections (either at the top of user space, changing
permissions appropriately, or inside kernel space, changing per-process
vs global appropriately).
This area is permanently mapped into each process, so that there's no
kmap_atomic / tlb_flush_one overhead ... it's highmem backed still.
In order to do fork efficiently, we may need space for 2 sets of
pagetables (12Mb on PAE).
Dave McCracken had an earlier implementation of that, but we never saw
an improvement (quite possibly because the fork double-space wasn't
there) - Dave Hansen is now trying to get something work with current
kernels ... will let you know.
M.
> On Sat, 22 Feb 2003, Eric W. Biederman wrote:
>
> > Note: rmap chains can be restricted to an arbitrary length, or an
> > arbitrary total count trivially. All you have to do is allow a fixed
> > limit on the number of people who can map a page simultaneously.
> >
> > The selection of which chain to unmap can be a bit tricky but is
> > relatively straight forward. Why doesn't someone who is seeing
> > this just hack this up?
>
> I'm not sure how useful this feature would be.
The problem. There is no upper bound to how many rmap
entries there can be at one time. And the unbounded
growth can overwhelm a machine.
The goal is to provide an overall system cap on the number
of rmap entries.
> Also,
> there are a bunch of corner cases in which you cannot
> limit the number of processes mapping a page, think
> about eg. mlock, nonlinear vmas and anonymous memory.
Unless something has changed for nonlinear vmas, and anonymous
memory we have been storing enough information to recover
the page in the page tables for ages.
For mlock we want a cap on the number of pages that are locked,
so it should not be a problem. But even then we don't have to
guarantee the page is constantly in the processes page table, simply
that the mlocked page is never swapped out.
> All in all I suspect that the cost of such a feature
> might be higher than any benefits.
Cost? What Cost?
The simple implementation is to walk the page lists and unmap
the pages that are least likely to be used next.
This is not something new. We have been doing this in 2.4.x and
before for years. Before it just never freed up rmap entries, as well
as preparing a page to be paged out.
Eric
David.L> Garrit, you missed the preior posters point. IA64 had the
David.L> same fundamental problem as the Alpha, PPC, and Sparc
David.L> processors, it doesn't run x86 binaries.
This simply isn't true. Itanium and Itanium 2 have full x86 hardware
built into the chip (for better or worse ;-). The speed isn't as good
as the fastest x86 chips today, but it's faster (~300MHz P6) than the
PCs many of us are using and it certainly meets my needs better than
any other x86 "emulation" I have used in the past (which includes
FX!32 and its relatives for Alpha).
--david
Why?
The x86 is a hell of a lot nicer than the ppc32, for example. On the
x86, you get good performance and you can ignore the design mistakes (ie
segmentation) by just basically turning them off.
On the ppc32, the MMU braindamage is not something you can ignore, you
have to write your OS for it and if you turn it off (ie enable soft-fill
on the ones that support it) you now have to have separate paths in the
OS for it.
And the baroque instruction encoding on the x86 is actually a _good_
thing: it's a rather dense encoding, which means that you win on icache.
It's a bit hard to decode, but who cares? Existing chips do well at
decoding, and thanks to the icache win they tend to perform better - and
they load faster too (which is important - you can make your CPU have
big caches, but _nothing_ saves you from the cold-cache costs).
The low register count isn't an issue when you code in any high-level
language, and it has actually forced x86 implementors to do a hell of a
lot better job than the competition when it comes to memory loads and
stores - which helps in general. While the RISC people were off trying
to optimize their compilers to generate loops that used all 32 registers
efficiently, the x86 implementors instead made the chip run fast on
varied loads and used tons of register renaming hardware (and looking at
_memory_ renaming too).
IA64 made all the mistakes anybody else did, and threw out all the good
parts of the x86 because people thought those parts were ugly. They
aren't ugly, they're the "charming oddity" that makes it do well. Look
at them the right way and you realize that a lot of the grottyness is
exactly _why_ the x86 works so well (yeah, and the fact that they are
everywhere ;).
The only real major failure of the x86 is the PAE crud. Let's hope
we'll get to forget it, the same way the DOS people eventually forgot
about their memory extenders.
(Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
will matter, and people can overlook the grottiness there. Right now
Intel doesn't even seem to be interested in "64-bit for the masses", and
maybe IBM will be. AMD certainly seems to be serious about the "masses"
part, which in the end is the only part that really matters).
Linus
Nobody ever seems to have solved the threading impact of UKVA's. I told
Andrea about it almost a year ago, and his reaction was "oh, duh!" and
couldn't come up with a solution either.
The thing is, you _cannot_ have a per-thread area, since all threads
share the same TLB. And if it isn't per-thread, you still need all the
locking and all the scalability stuff that the _current_ pte_highmem
code needs, since there are people with thousands of threads in the same
process.
Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a
pipe-dream of people who haven't thought it through.
Linus
Linus> Look at them the right way and you realize that a lot of the
Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
Linus> the fact that they are everywhere ;).
But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
better than P4 on 0.13um. As far as I can guess, the only reason P4
comes out on 0.13um (and 0.09um) before anything else is due to the
latter part you mention: it's where the volume is today.
--david
> > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> >> ia32 big iron. sigh. I think that's so unfortunately in a number
> >> of ways, but the main reason, of course, is that highmem is evil :)
>
> One phrase ... "price:performance ratio". That's all it's about.
> The only thing that will kill 32-bit big iron is the availability of
> cheap 64 bit chips. It's a free-market economy.
>
> It's ugly to program, but it's cheap, and it works.
I guess ugly to program is in the eye of the beholder. The big platforms
have always seemed much worse to me. When every box is feels free to
change things in arbitrary ways for no good reason. Or where OS and
other low-level software must know exactly which motherboard they are
running on to work properly.
Gratuitous incompatibilities are the ugliest thing I have ever seen.
Much less ugly then the warts a real platform accumulates because it
is designed to actually be used.
Eric
Care to share those impressive benchmark numbers (for macro-benchmarks)?
Would be interesting to see the difference, and where it wins.
Thanks,
M
I don't see why that's an issue - the pagetables are per-process, not
per-thread.
Yes, that was a stalling point for sticking kmap in there, which was
amongst my original plotting for it, but the stuff that's per-process
still works.
I'm not suggesting kmapping them dynamically (though it's rather like
permanent kmap), I'm suggesting making enough space so we have them all
there for each process all the time. None of this tiny little window
shifting around stuff ...
M.
gzip doesn't work because its not unpackable from an arbitary point. x86
in many ways is compressed, with common codes carefully bitpacked. A
horrible cisc design constraint for size has come full circle and turned
into a very nice memory/cache optimisation
IA64 *can* run IA32 binaries, just more slowly than native IA64 code.
gerrit
They did that already ... IBM were demonstrating such a thing a couple of
years ago. Don't see it helping with icache though, as it unpacks between
memory and the processory, IIRC.
M.
I could be wrong, but I always thought that Sparc, and a lot of other
architectures could mark arbitrary areas of memory, (such as the
stack), as non-executable, whereas x86 only lets you have one
non-executable segment.
John.
On WHAT benchmark?
Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks.
As far as I know, the _only_ things Itanium 2 does better on is (a) FP
kernels, partly due to a huge cache and (b) big databases, entirely
because the P4 is crippled with lots of memory because Intel refuses to do
a 64-bit version (because they know it would totally kill ia-64).
Last I saw P4 was kicking ia-64 butt on specint and friends.
That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4
does every single day. You can't put an ia-64 in a reasonable desktop
machine, partly because of pricing, but partly because it would just suck
so horribly at things people expect not to suck (games spring to mind).
And I further bet that using a native distribution (ie totally ignoring
the power and price and bad x86 performance issues), ia-64 will work a lot
worse for people simply because the binaries are bigger. That was quite
painful on alpha, and ia-64 is even worse - to offset the bigger binaries,
you need a faster disk subsystem etc just to not feel slower than a
bog-standard PC.
Code size matters. Price matters. Real world matters. And ia-64 at least
so far falls flat on its face on ALL of these.
> As far as I can guess, the only reason P4
> comes out on 0.13um (and 0.09um) before anything else is due to the
> latter part you mention: it's where the volume is today.
It's where all the money is ("ia-64: 5 billion dollars in the red and
still sinking") so of _course_ it's where the efforts get put.
Linus
Exactly. Which means that UKVA has all the same problems as the current
global map.
There are _NO_ differences. Any problems you have with the current global
map you would have with UKVA in threads. So I don't see what you expect to
win from UKVA.
> Yes, that was a stalling point for sticking kmap in there, which was
> amongst my original plotting for it, but the stuff that's per-process
> still works.
Exactly what _is_ "per-process"? The only thing that is per-process is
stuff that is totally local to the VM, by the linux definition.
And the rmap stuff certainly isn't "local to the VM". Yes, it is torn down
and built up by the VM, but it needs to be traversed by global code.
Linus
The x86 has that stupid "executablility is tied to a segment" thing, which
means that you cannot make things executable on a page-per-page level.
It's a mistake, but it's one that _could_ be fixed in the architecture if
it really mattered, the same way the WP bit got fixed in the i486.
I'm definitely not saying that the x86 is perfect. It clearly isn't. But a
lot of people complain about the wrong things, and a lot of people who
tried to "fix" things just made them worse by throwing out the good parts
too.
Linus
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The x86 is a hell of a lot nicer than the ppc32, for example. On the
> x86, you get good performance and you can ignore the design mistakes (ie
> segmentation) by just basically turning them off.
We "basically" turn it off, but I was recently reminded it existed,
as LDT's are apparently wanted by something in userspace. There seem
to be various other unwelcome reminders floating around performance
critical paths as well.
I vaguely remember segmentation being the only way to enforce
execution permissions for mmap(), which we just don't bother doing.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> On the ppc32, the MMU braindamage is not something you can ignore, you
> have to write your OS for it and if you turn it off (ie enable soft-fill
> on the ones that support it) you now have to have separate paths in the
> OS for it.
The hashtables don't bother me very much. They can relatively easily
be front-ended by radix tree pagetables anyway, and if it sucks, well,
no software in the world can save sucky hardware. Hopefully later models
fix it to be fast or disablable. I'm more bothered by x86 lacking ASN's.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).
I'm not so sure, between things cacheline aligning branch targets and
space/time tradeoffs with smaller instructions running slower than
large sequences of instructions, this stuff gets pretty strange. It
still comes out smaller in the end but by a smaller-than-expected though
probably still significant margin. There's a good chunk of the
instruction set that should probably just be dumped outright, too.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The low register count isn't an issue when you code in any high-level
> language, and it has actually forced x86 implementors to do a hell of a
> lot better job than the competition when it comes to memory loads and
> stores - which helps in general. While the RISC people were off trying
> to optimize their compilers to generate loops that used all 32 registers
> efficiently, the x86 implementors instead made the chip run fast on
> varied loads and used tons of register renaming hardware (and looking at
> _memory_ renaming too).
Invariably we get stuck diving into assembly anyway. =)
This one is basically me getting irked by looking at disassemblies of
random x86 binaries and seeing vast amounts of register spilling. It's
probably not a performance issue aside from code bloat esp. given the
amount of trickery with the weird L1 cache stack magic and so on.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> IA64 made all the mistakes anybody else did, and threw out all the good
> parts of the x86 because people thought those parts were ugly. They
> aren't ugly, they're the "charming oddity" that makes it do well. Look
> at them the right way and you realize that a lot of the grottyness is
> exactly _why_ the x86 works so well (yeah, and the fact that they are
> everywhere ;).
Count me as "not charmed". We've actually tripped over this stuff, and
for the most part you've been personally squashing the super low-level
bugs like the NT flag business and vsyscall segmentation oddities.
IA64 suffers from truly excessive featuritis and there are relatively
good chances some (or all) of them will be every bit as unused and
hated as segmentation if it actually survives.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The only real major failure of the x86 is the PAE crud. Let's hope
> we'll get to forget it, the same way the DOS people eventually forgot
> about their memory extenders.
We've not really been able to forget about segments or ISA DMA...
The pessimist in me has more or less already resigned me to PAE as
a fact of life.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> (Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
> will matter, and people can overlook the grottiness there. Right now
> Intel doesn't even seem to be interested in "64-bit for the masses", and
> maybe IBM will be. AMD certainly seems to be serious about the "masses"
> part, which in the end is the only part that really matters).
ppc64 is sane in my book (not vendor nepotism, the other "vanilla RISC"
machines get the same rating in my book). No idea about marketing stuff.
-- wli
Linus> Look at them the right way and you realize that a lot of the
Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
Linus> the fact that they are everywhere ;).
>> But does x86 reall work so well? Itanium 2 on 0.13um performs a
>> lot better than P4 on 0.13um. As far as I can guess, the only
>> reason P4 comes out on 0.13um (and 0.09um) before anything else
>> is due to the latter part you mention: it's where the volume is
>> today.
Martin> Care to share those impressive benchmark numbers (for
Martin> macro-benchmarks)? Would be interesting to see the
Martin> difference, and where it wins.
You can do it two ways: you can look at the numbers Intel is publicly
projected for Madison, or you can compare McKinley with 0.18um Pentium 4.
--david
This just just for PTEs ... for which at the moment we have two choices:
1. Stick them in lowmem (fills up the global space too much).
2. Stick them in highmem - too much overhead doing k(un)map_atomic
as measured by both myself and Andrew.
Using UKVA for PTEs seems to be a better way to implement pte-highmem to me.
If you're walking another processes' pagetables, you just kmap them as now,
but I think this will avoid most of the kmap'ing (if we have space for two
sets of pagetables so we can do a little bit of trickery at fork time).
>> Yes, that was a stalling point for sticking kmap in there, which was
>> amongst my original plotting for it, but the stuff that's per-process
>> still works.
>
> Exactly what _is_ "per-process"? The only thing that is per-process is
> stuff that is totally local to the VM, by the linux definition.
The pagetables.
> And the rmap stuff certainly isn't "local to the VM". Yes, it is torn
> down and built up by the VM, but it needs to be traversed by global code.
Sorry, subject was probably misleading ... I'm just talking about the
PTEs here, not sticking anything to do with rmap into UKVA.
Partially object-based rmap is cool for other reasons, that have little to
do with this. ;-)
M.
Another term for "UKVA for pagetables only" is "recursive pagetables",
if this helps clarify anything.
-- wli
Ummm ... I'm not exactly happy working with Intel's own projections on the
performance of their Itanium chips ... seems a little unscientific ;-)
Presumably when you said "Itanium 2 on 0.13um performs a lot better than P4
on 0.13um." you were referring to some benchmarks you have the results of?
If you can't publish them, fair enough. But if you can, I'd love to see how
it compares ... Itanium seems to be "more interesting" nowadays, though I
can't say I'm happy about the complexity of it.
M.
Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
I don't think so. According to Intel [1], the highest clockfrequency
for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
[2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].
--david
[1] http://www.intel.com/support/processors/xeon/corespeeds.htm
[2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
[3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html
yes it's not the same clock speed, but if that's the clock speed they can
achieve on that process it's equivalent. the P4 covers a LOT of sins by
ratcheting up it's speed, what matters is the final capability, not the
capability/clock (if capability/clock was what mattered the AMD chips
would have put intel out of business and the P4 would be as common as
ia-64)
David Lang
On Sun, 23 Feb 2003, David Mosberger wrote:
> Date: Sun, 23 Feb 2003 14:40:44 -0800
> From: David Mosberger <dav...@napali.hpl.hp.com>
> Reply-To: dav...@hpl.hp.com
> To: Linus Torvalds <torv...@transmeta.com>
> Cc: dav...@hpl.hp.com, linux-...@vger.kernel.org
> Subject: Re: Minutes from Feb 21 LSE Call
>
David.L> I would call a 15% lead over the ia64 pretty substantial.
Huh? Did you misread my mail?
2 GHz Xeon: 701 SPECint
1 GHz Itanium 2: 810 SPECint
That is, Itanium 2 is 15% faster.
--david
I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and
I've not seen it for a long time. What happened to it ?
David Lang
On Sun, 23 Feb 2003, David Mosberger wrote:
> Date: Sun, 23 Feb 2003 14:54:12 -0800
> From: David Mosberger <dav...@napali.hpl.hp.com>
> Reply-To: dav...@hpl.hp.com
> To: David Lang <david...@digitalinsight.com>
> Cc: dav...@hpl.hp.com, Linus Torvalds <torv...@transmeta.com>,
> linux-...@vger.kernel.org
> Subject: Re: Minutes from Feb 21 LSE Call
>
Got anything more real-world than SPECint type microbenchmarks?
M.
> On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> > I think people overestimate the numbner of large boxes badly. Several IDE
> > pre-patches didn't work on highmem boxes. It took *ages* for people to
> > actually notice there was a problem. The desktop world is still 128-256Mb
>
> IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> is a fun toy, but bigger than *I* need, even for development purposes.
> But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> IDE products for my 8-proc 16 GB machine... And running pre-patches in
> a production environment that might expose this would be a little
> silly as well.
I don't disagree with most of your point, however there certainly are
legitimate uses for big boxes with small (IDE) disk. Those which first
come to mind are all computational problems, in which a small dataset is
read from disk and then processors beat on the data. More or less common
examples are graphics transformations (original and final data
compressed), engineering calculations such as finite element analysis,
rendering (raytracing) type calculations, and data analysis (things like
setiathome or automated medical image analysis).
IDE drives are very cost effective, and low cost motherboard RAID is
certainly useful for preserving the results of large calculations on small
(relatively) datasets.
--
bill davidsen <davi...@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
That hardly counts as reasonably performant: the slowest mainstream chips
from Intel and AMD are clocked well over 1 GHz. At least x86-64 will
improve the performance of the 32 bit databases people have already
invested large amounts of money in, and it will do so without the need
for a massive outlay of funds for a new 64 bit license. Why accept
more than 10x the cost to migrate to ia64 when a new x86-64 will improve
the speed of existing applications, and improve scalability with the
transparent addition of a 64 bit kernel?
-ben
--
Don't email: <a href=mailto:"aa...@kvack.org">aa...@kvack.org</a>
> Mark Hahn wrote:
> > oh, come on. the issue is whether memory is fast and flat.
> > most "scalability" efforts are mainly trying to code around the fact
> > that any ccNUMA (and most 4-ways) is going to be slow/bumpy.
> > it is reasonable to worry that optimizations for imbalanced machines
> > will hurt "normal" ones. is it worth hurting uni by 5% to give
> > a 50% speedup to IBM's 32-way? I think not, simply because
> > low-end machines are more important to Linux.
> >
> > the best way to kill Linux is to turn it into an OS best suited
> > for $6+-digit machines.
>
> Linux has a key feature that most other OS's lack: It can (easily, and by all)
> be recompiled for a particular architecture. So, there is no particular reason why
> optimizing for a high-end system has to kill performance on uni-processor
> machines.
This is exactly correct, although build just the optimal kernel for a
machine is still somewhat art rather than science. You have to choose the
trade-offs carefully.
> For instance, don't locks simply get compiled away to nothing on
> uni-processor machines?
Preempt causes most of the issues of SMP with few of the benefits. There
are loads for which it's ideal, but for general use it may not be the
right feature, and I ran it during the time when it was just a patch, but
lately I'm convinced it's for special occasions.
--
bill davidsen <davi...@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
-
Note that preemption was pushed by the embedded people Larry was advocating
for, not the big-machine crowd .... ironic, eh?
M.
Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>> I don't think so. According to Intel [1], the highest
>> clockfrequency for a 0.18um part is 2GHz (both for Xeon and P4,
>> for Xeon MP it's 1.5GHz). The highest reported SPECint for a
>> 2GHz Xeon seems to be 701 [2]. In comparison, a 1GHz McKinley
>> gets a SPECint of 810 [3].
Martin> Got anything more real-world than SPECint type
Martin> microbenchmarks?
SPECint a microbenchmark? You seem to be redefining the meaning of
the word (last time I checked, lmbench was a microbenchmark).
Ironically, Itanium 2 seems to do even better in the "real world" than
suggested by benchmarks, partly because of the large caches, memory
bandwidth and, I'm guessing, partly because of it's straight-forward
micro-architecture (e.g., a synchronization operation takes on the
order of 10 cycles, as compared to order of dozens and hundres of
cycles on the Pentium 4).
BTW: I hope I don't sound too negative on the Pentium 4/Xeon. It's
certainly an excellent performer for many things. I just want to
point out that Itanium 2 also is a good performer, probably more so
than many on this list seem to be willing to give it credit for.
--david
> > We would never try to propose such a change, and never have.
> > Name a scalability change that's hurt the performance of UP by 5%.
> > There isn't one.
>
> This is *exactly* the reasoning that every OS marketing weenie has used
> for the last 20 years to justify their "feature" of the week.
>
> The road to slow bloated code is paved one cache miss at a time. You
> may quote me on that. In fact, print it out and put it above your
> monitor and look at it every day. One cache miss at a time. How much
> does one cache miss add to any benchmark? .001%? Less.
>
> But your pet features didn't slow the system down. Nope, they just made
> the cache smaller, which you didn't notice because whatever artificial
> benchmark you ran didn't happen to need the whole cache.
Clearly this is the case, the benefit of a change must balance the
negative effects. Making the code paths longer hurts free cache, having
more of them should not. More code is not always slower code, and doesn't
always have more impact on cache use. You identify something which must be
considered, but it's not the only thing to consider. Linux shouild be
stable, not moribund.
> You need to understand that system resources belong to the user. Not the
> kernel. The goal is to have all of the kernel code running under any
> load be less than 1% of the CPU. Your 5% number up there would pretty
> much double the amount of time we spend in the kernel for most workloads.
Who profits? For most users a bit more system time resulting in better
disk performance would be a win, or at least non-lose. This isn't black
and white.
On Sat, 22 Feb 2003, Larry McVoy wrote:
> Let's get back to your position. You want to shovel stuff in the kernel
> for the benefit of the 32 way / 64 way etc boxes. I don't see that as
> wise. You could prove me wrong. Here's how you do it: go get oprofile
> or whatever that tool is which lets you run apps and count cache misses.
> Start including before/after runs of each microbench in lmbench and
> some time sharing loads with and without your changes. When you can do
> that and you don't add any more bus traffic, you're a genius and
> I'll shut up.
Code only costs when it's executed. Linux is somewhat heading to the place
where a distro has a few useful configs and then people who care for the
last bit of whatever they see as a bottleneck can build their own fro
"make config." So it is possible to add features for big machines without
any impact on the builds which don't use the features. it goes without
saying that this is hard. I would guess that it results in more bugs as
well, if one path or another is "the less-traveled way."
>
> But that's a false promise because by definition, fine grained threading
> adds more bus traffic. It's kind of hard to not have that happen, the
> caches have to stay coherent somehow.
Clearly. And things which require more locking will pay some penalty for
this. But a quick scan of this list on keyword "lockless' will show that
people are thinking about this.
I don't think developers will buy ignoring part of the market to
completely optimize for another. Linux will grow by being ubiquitious, not
by winning some battle and losing the war. It's not a niche market os.
--
bill davidsen <davi...@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
-
Oh, ok. We did that for alpha, and it was a good deal there (it's actually
architected for alpha). So yes, I don't mind doing it for the page tables,
and it should work fine on x86 too (it's not necessarily a very portable
approach, since it requires that the pmd- and the pte- tables look the
same, which is not always true).
So sure, go ahead with that part.
Linus
Ehh, and this is with how much cache?
Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches,
and I suspect that whatever 0.13 Itanium numbers you're looking at are
with the new 6MB caches.
So your "apples to apples" comparison isn't exactly that.
The only thing that is meaningful is "performace at the same time of
general availability". At which point the P4 beats the Itanium 2 senseless
with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up
at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping
that 25% lead.
Linus
> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david...@digitalinsight.com> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> Huh? Did you misread my mail?
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.
according to pricewatch i could buy ten 2GHz Xeons for about the cost of
one Itanium 2 900MHz.
that's not even considering the cost of the motherboards i'd need to plug
those into.
-dean
I've been thinking about this recently, and it turns out that the whole
point is moot with a fixed address vsyscall page: non-exec stacks are
trivially circumvented by using the vsyscall page as a known starting
point for the exploite. All the other tricks of changing the starting
stack offset and using randomized load addresses don't help at all,
since the exploite can merely use the vsyscall page to perform various
operations. Personally, I'm still a fan of the shared library vsyscall
trick, which would allow us to randomize its laod address and defeat
this problem.
-ben
http://www-3.ibm.com/chips/techlib/techlib.nsf/products/CodePack
If you are thinking of this it dose look like people was not using it I
know I'm not.It reduces memory for instructions but that is all and
memory is seems is not a problem at least not for instructions.
It dose not exist in new cpu's from IBM I don't know the official reason
for the removal.
If you really do mean compressed cache I don't think anybody has done
that for real.
Eh? By that logic there's no bound to the number of vmas that can exist
at a given time. But there is a bound on the number that a single process
can force the system into using, and that limit also caps the number of
rmap entries the process can bring into existance. Virtual address space
is not free, and there are already mechanisms in place to limit it which,
given that the number of rmap entries are directly proportion to the amount
of virtual address space in use, probably need proper configuration.
> The goal is to provide an overall system cap on the number
> of rmap entries.
No, the goal is to have a stable system under a variety of workloads that
performs well. User exploitable worst case behaviour is a bad idea. Hybrid
solves that at the expense of added complexity.
-ben
--
Don't email: <a href=mailto:"aa...@kvack.org">aa...@kvack.org</a>
> If you really do mean compressed cache I don't think anybody has done
> that for real.
people are doing this *for real* -- it really depends on what you define
as compressed.
ARM thumb is definitely a compression function for code.
x86 native instructions are compressed compared to the RISC-like micro-ops
which a processor like athlon, p3, and p4 actually execute. for similar
operations, an x86 would average probably 1.5 bytes to encode what a
32-bit RISC would need 4 bytes to encode.
-dean
Dean> On Sun, 23 Feb 2003, David Mosberger wrote:
>> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david...@digitalinsight.com> said:
David.L> I would call a 15% lead over the ia64 pretty substantial.
>> Huh? Did you misread my mail?
>> 2 GHz Xeon: 701 SPECint
>> 1 GHz Itanium 2: 810 SPECint
>> That is, Itanium 2 is 15% faster.
Dean> according to pricewatch i could buy ten 2GHz Xeons for about
Dean> the cost of one Itanium 2 900MHz.
Not if you want comparable cache-sizes [1]:
Intel Xeon MP, 2MB L3 cache: $3692
Itanium 2, 1 GHZ, 3MB L3 cache: $4226
Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247
Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338
Intel basically prices things by the cache size.
--david
[1]: http://www.intel.com/intel/finance/pricelist/
Now wait a minute. I thought you worked at Transmeta.
There were no development and debugging costs associated with getting
all those different kinds of gates working, and all the segmentation
checking right?
Wouldn't it have been easier to build the system, and shift the effort
where it would really do some good, if you didn't have to support
all that crap?
An extra base/bounds check doesn't take any die area? An extra exception
source doesn't complicate exception handling?
> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).
I *really* thought you worked at Transmeta.
Transmeta's software-decoding is an extreme example of what all modern
x86 processors are doing in their L1 caches, namely predecoding the
instructions and storing them in expanded form. This varies from
just adding boundary tags (Pentium) and instruction type (K7) through
converting them to uops and cacheing those (P4).
This exactly undoes any L1 cache size benefits. The win, of course, is
that you don't have as much shifting and aligning on your i-fetch path,
which all the fixed-instruction-size architectures already started with.
So your comments only apply to the L2 cache.
And for the expense of all the instruction predecoding logic betweeen
L2 and L1, don't you think someone could build an instruction compressor
to fit more into the die-size-limited L2 cache? With the sizes cache likes
are getting to these days, you should be able to do pretty well.
It seems like 6 of one, half dozen of the other, and would save the
compiler writers a lot of pain.
> The low register count isn't an issue when you code in any high-level
> language, and it has actually forced x86 implementors to do a hell of a
> lot better job than the competition when it comes to memory loads and
> stores - which helps in general. While the RISC people were off trying
> to optimize their compilers to generate loops that used all 32 registers
> efficiently, the x86 implementors instead made the chip run fast on
> varied loads and used tons of register renaming hardware (and looking at
> _memory_ renaming too).
I don't disagree that chip designers have managed to do very well with
the x86, and there's nothing wrong with making a virtue out of a necessity,
but that doesn't make the necessity good.
I was about to raise the same point. L1 dcache access tends to be a
cycle-limiting bottleneck, and as pearly as the original Pentium, the
x86 had to go to a 2-access-per-cycle L1 dcache to avoid bottlenecking
with only 2 pipes!
The low register count *does* affect you when using a high-level language,
because if you have too many live variables floating around, you start
suffering. Handling these spills is why you need memory renaming.
It's true that x86 processors have had fancy architectural features
sooner than similar-performance RISCs, but I think there's a fair case
that that's because they've *needed* them. Why do the P4 and K7/K8 have
such enormous reorder buffers, able to keep around 100 instructions
in flight at a time? Because they need it to extract parallelism out
of an instruction stream serialized by a miserly register file.
They've developed some great technology to compensate for the weaknesses,
but it's sure nice to dream of an architecture with all that great
technology but with fewer initial warts. (Alpha seemed like the
best hope, but *sigh*. Still, however you apportion blame for its
demise, performance was clearly not one of its problems.)
I think the same claim applies much more powerfully to the ppc32's MMU.
It may be stupid, but it is only visible from inside the kernel, and
a fairly small piece of the kernel at that.
It could be scrapped and replaced with something better without any
effect on existing user-level code at all.
Do you think you can replace the x86's register problems as easily?
> The only real major failure of the x86 is the PAE crud.
So you think AMD extended the register file just for fun?
Hell, the "PAE crud" is the *same* problem as the tiny register
file. Insufficient virtual address space leading to physical > virtual
kludges.
And, as you've noticed, there are limits to the physical/virtual
ratio above which it gets really painful. And the 64G:4G ratio of PAE
is mirrored in the 128:8 ratio of P4 integer registers.
I wish the original Intel designers could have left a "no heroic measures"
living will, because that design is on more life support than Darth Vader.
On Sun, 23 Feb 2003, David Mosberger wrote:
> >>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <dean-list-l...@arctic.org> said:
>
> Dean> On Sun, 23 Feb 2003, David Mosberger wrote:
> >> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david...@digitalinsight.com> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> >> Huh? Did you misread my mail?
>
> >> 2 GHz Xeon: 701 SPECint
> >> 1 GHz Itanium 2: 810 SPECint
>
> >> That is, Itanium 2 is 15% faster.
>
> Dean> according to pricewatch i could buy ten 2GHz Xeons for about
> Dean> the cost of one Itanium 2 900MHz.
>
> Not if you want comparable cache-sizes [1]:
somehow i doubt you're quoting Xeon numbers w/2MB of cache above. in
fact, here's a 701 specint with only 512KB of cache @ 2GHz:
http://www.spec.org/osg/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
my point was that if you had comparable die sizes the 15% "advantage"
would disappear. there's a hell of a lot which could be done with the
approximately double die size that the itanium 2 has compared to any of
the commodity x86 parts. but then the cost per part would be
correspondingly higher... which is exactly what is shown in the intel cost
numbers.
a more fair comparison would be your itanium 2 number with this:
http://www.spec.org/osg/cpu2000/results/res2002q4/cpu2000-20021021-01742.html
2MB L2 Xeon @ 2GHz, scores 842.
is this the itanium 2 number you're quoting us?
http://www.spec.org/osg/cpu2000/results/res2002q3/cpu2000-20020711-01469.html
'cause that's with 3MB L3.
-dean