Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Best CPU platform(s) for FPGA synthesis

389 views
Skip to first unread message

jjoh...@cs.ucf.edu

unread,
Jul 26, 2007, 6:19:04 PM7/26/07
to
OK, the questions apply primarily to FPGA synthesis (Altera Quartus
fitter for StratixII and HardCopyII), but I'm interested in feedback
regarding all EDA tools in general.


Context: I'm suffering some long Quartus runtimes on their biggest
StratixII and second-biggest HardCopyII device. Boss has given me
permission to order a new desktop/workstation/server. Immediate goal
is to speed up Quartus, but other long-term value considerations will
be taken into account.


True or false?
--------------------
Logic synthesis (analyze/elaborate/map) is mostly integer operations?
Place and Route (quartus_map) is mostly double-precision floating-
point?
Static Timing Analysis (TimeQuest) is mostly double-precision floating-
point?
RTL simulation is mostly integer operations?
SDF / gate-level simulation is mostly double-precision floating-point?


AMD or Intel?
-------------------
Between AMD & Intel's latest multicore CPUs,
- Which offers the best integer performance?
- Which offers the best floating-point performance?
Specific models within the AMD/Intel family?
Assume cost is no object, and each uses its highest-performing memory
interface, but disk access is (necessary evil) over a networked drive.
(Small % of total runtime anyway.)


Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs.
Windows? >2GB of RAM?
---------------------------------------------------------------------------------------------------------------------------------
Is Quartus (and the others) more efficient in any one particular
environment? I prefer Linux, but the OS is now secondary to pure
runtime performance (unless it is a major contributor). Can any of
them make use of more than 2GB or RAM? More than 4GB? Useful limit on
the number of processors/cores?


Any specific box recommendations?

Thanks a gig,

jj

sh...@cadence.com

unread,
Jul 26, 2007, 7:01:22 PM7/26/07
to
On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:
>
> True or false?
> --------------------
> Logic synthesis (analyze/elaborate/map) is mostly integer operations?
Yes.

> Place and Route (quartus_map) is mostly double-precision floating-
> point?

I don't know why they would use floating point if they don't have to.

> Static Timing Analysis (TimeQuest) is mostly double-precision floating-
> point?

I seriously doubt it. I don't see a need for floating point there
when delays can use scaled integers.

> RTL simulation is mostly integer operations?

Yes.

> SDF / gate-level simulation is mostly double-precision floating-point?

No, or at least not in any implementation I am familiar with. All the
delays are scaled up so that integers can be used for them.

In simulation (assuming something with state-of-the art performance),
the CPU operations themselves are not very important anyway. It is
not compute-bound, it is memory-access-bound. What you need is big
caches and fast access to memory for when the cache isn't big enough.


> Is Quartus (and the others) more efficient in any one particular
> environment? I prefer Linux, but the OS is now secondary to pure
> runtime performance (unless it is a major contributor). Can any of
> them make use of more than 2GB or RAM? More than 4GB?

64-bit Linux can make use of more than 4GB of RAM. But don't use 64-
bit executables unless your design is too big for 32-bit tools,
because they will run slower on the same machine.

> Useful limit on
> the number of processors/cores?

Most of these tools are not multi-threaded, so the only way you will
get a speedup is if you have multiple jobs at the same time. Event-
driven simulation in particular is not amenable to multi-threading,
despite much wishful thinking for the last few decades.

Jon Beniston

unread,
Jul 27, 2007, 10:18:06 AM7/27/07
to

> > Static Timing Analysis (TimeQuest) is mostly double-precision floating-
> > point?
>
> I seriously doubt it. I don't see a need for floating point there
> when delays can use scaled integers.

Dynamic range?

Cheers,
Jon


Nial Stewart

unread,
Jul 27, 2007, 10:25:27 AM7/27/07
to
I think that memory performance is the limiting factor for
FPGA synthesis and P&R.

This machine had a single core AMD 64 processor which I recently replaced with
a slightly faster dual core processor.

I ran a fairly quick FPGA build through Quartus to get a time for a
before and after comparison before I did the swap.

The before and after times were exactly the same :-(

I think the amount and speed of memory is crucial, it's probably
worth paying as much attention to that as to the processor.


Nial.


Frank Buss

unread,
Jul 27, 2007, 10:34:15 AM7/27/07
to
Nial Stewart wrote:

> I ran a fairly quick FPGA build through Quartus to get a time for a
> before and after comparison before I did the swap.

Did you changed the setting "use up to x number of CPUs" (don't remember
the exact name) somewhere in the project settings?

--
Frank Buss, f...@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

Patrick Dubois

unread,
Jul 27, 2007, 10:56:27 AM7/27/07
to
On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:
> AMD or Intel?
> -------------------
> Between AMD & Intel's latest multicore CPUs,
> - Which offers the best integer performance?
> - Which offers the best floating-point performance?
> Specific models within the AMD/Intel family?
>
> Assume cost is no object, and each uses its highest-performing memory
> interface, but disk access is (necessary evil) over a networked drive.
> (Small % of total runtime anyway.)
>
> Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs.
> Windows? >2GB of RAM?

If cost is no object, then go with the Intel quad-core running at 3
GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is,
according to several reports in this forum, the single most important
factor.

I would say go with 4GB of ram, although if you're using the biggest
chips, you might need more. Keep in mind that Windows 32-bit will only
be able to use 3GB max of this 4 GB, and each application will only be
able to access 2GB max. So you might consider Windows 64 bits or Linux
64 bits if necessary.

Patrick

Kai Harrekilde-Petersen

unread,
Jul 27, 2007, 12:17:26 PM7/27/07
to
Jon Beniston <j...@beniston.com> writes:

Not a likely problem. Even a 32bit int would be big enough for holding
up to a ridiculous 4.3 seconds, assuming 1psec resolution.

As far as I know, everything in the simulate, synth, P&R, and STA
chain can be performed with adequate resolution using integers.

Crosstalk and inductive effects might require floating point help, but
I would be surprised if even that can be approximated well with
fixed-point arithmetic.


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Eric Smith

unread,
Jul 27, 2007, 12:43:37 PM7/27/07
to
sh...@cadence.com writes:
> 64-bit Linux can make use of more than 4GB of RAM. But don't use 64-
> bit executables unless your design is too big for 32-bit tools,
> because they will run slower on the same machine.

Although that might be true for some specific cases, in general on Linux
native 64-bit executables tend to run faster than 32-bit executables.
But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.

jjoh...@cs.ucf.edu

unread,
Jul 27, 2007, 1:17:37 PM7/27/07
to

Thanks everyone, this is real interesting, but please don't stop
posting if you have more insights to share!

FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on
Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on
my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another
test run of a double-precision DSP simulation (compiled C) ran
substantially slower on the Opteron, which I thought was supposed to
have better floating-point performance than Xeons of that era. Maybe
it was just a case of the gcc -O5 optimization switches being totally
tuned to Intel instead of AMD, or maybe my Quartus P&R step is
primarily dominated by integer calculations.

I originally suspected P&R might have a lot of floating-point
calculations (even prior to signal-integrity considerations) if they
were doing any kind of physical synthesis (e.g., delay calculation
based on distance and fanout); ditto for STA, because that's usually
an integral part of the P&R loops. I also suspected that if floating-
point operations (at least multiplies, add/subtract, and MACs) could
be done in a single cycle, there would be no advantage to using
integer arithmetic instead (especially if manual, or somewhat explicit
integer scaling is required).

On the other hand, in something like a router, you can get more exact
location info wrt stuff like grid coordinates than you can with
floating-point. As far as dynamic range is concerned, I seem to recall
that SystemC standardized on 64-bit time to run longer simulations,
but SystemC is a different animal in that regard anyway. Nonetheless,
I also seem to recall that its implementation of time was 64-bit
integers (scaled), because the average FPU operations are really only
linear over the 53-bit mantissa part. Assuming they want linear
representation of time ticks, I can see the appeal of using 64-bit
integers in simulation.

As far as event-driven simulations are concerned, I totally understand
how hard it is to make good use of multithreading or multiprocessing,
because everything is so tightly coupled in that evaluate/update/
reschedule loop. If you were working at a much higher level
(behavioral/transaction), where the number of low-level events is
lower and the computation behind "complex" events took up a much
larger portion of the evaluate/update/reschedule loop, then multicore/
multiprocessing solutions might be more effective for simulation.
(Agree/disagree?) It seems that as you get more coarse-grained with
the simulation, that even distributed processing (multiple machines on
a network) becomes more feasible. Obviously the scheduler has one
"core" and has to reside in one CPU/memory space, but if it has less
work to do, then it can handle less frequent communication with the
event-processing CPUs in another space.

Back to Quartus in particular and Windows in general... Quartus
supports the new "number_of_cpus" or some similar variable, but only
seems to use it in small sections of quartus_fit (I think Altera is
just making their baby steps in this area).

That appears to be related to the number of processors inside one box.
If a single CPU is just hyperthreaded, the processor takes care of
instruction distribution unrelated to a variable like number_of_cpus,
right? And if there are two single-core processors in a box, obviously
it will utilize "number_of_cpus=2" as expected. Does anyone know how
that works with dual-core CPUs? i.e, if I have two quad-core CPUs in
one box, will setting "number_of_cpus=7" make optimal use of 7 cores
while leaving me one to work in a shell or window?

Does anyone know if Quartus makes better use of multiple processors in
a partitioned bottom-up flow compared to a single top-down compile
flow?

In 32-bit Windows, is that 3GB limit for everything running at one
time? i.e., is 4GB a waste on a Windows machine? Can it run multiple
2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit,
and 2GB an absolute process limit in Windows?

In 32-bit Linux, can it run 4GB per process and as many simultaneous
processes of that size as the virtual memory will support?

In going to 64-bit apps and O/S versions, should the tools run equally
fast as long as the processor is truly 64-bit?


Thanks again for all the insights and interesting discussion.


jj

Jon Beniston

unread,
Jul 27, 2007, 1:24:01 PM7/27/07
to
On 27 Jul, 17:17, Kai Harrekilde-Petersen <k...@harrekilde.dk> wrote:
> Jon Beniston <j...@beniston.com> writes:
> >> > Static Timing Analysis (TimeQuest) is mostly double-precision floating-
> >> > point?
>
> >> I seriously doubt it. I don't see a need for floating point there
> >> when delays can use scaled integers.
>
> > Dynamic range?
>
> Not a likely problem. Even a 32bit int would be big enough for holding
> up to a ridiculous 4.3 seconds, assuming 1psec resolution.

I think you're a factor of 1000 out.

For an ASIC STA, gate delays must be specified at a much finer
resolution than 1ps.

Cheers,
Jon


Kai Harrekilde-Petersen

unread,
Jul 27, 2007, 4:47:50 PM7/27/07
to
jjoh...@cs.ucf.edu writes:

> Thanks everyone, this is real interesting, but please don't stop
> posting if you have more insights to share!

[snip]

> In 32-bit Linux, can it run 4GB per process and as many simultaneous
> processes of that size as the virtual memory will support?

As I recall, 32-bit Linux has a limit around 3.0-3.5GB per process.
On the 64-bit Linux , I have used 8+GB for a single process doing
gatelevel simulations.

Kai Harrekilde-Petersen

unread,
Jul 27, 2007, 4:56:30 PM7/27/07
to
Jon Beniston <j...@beniston.com> writes:

> On 27 Jul, 17:17, Kai Harrekilde-Petersen <k...@harrekilde.dk> wrote:
>> Jon Beniston <j...@beniston.com> writes:
>> >> > Static Timing Analysis (TimeQuest) is mostly double-precision floating-
>> >> > point?
>>
>> >> I seriously doubt it. I don't see a need for floating point there
>> >> when delays can use scaled integers.
>>
>> > Dynamic range?
>>
>> Not a likely problem. Even a 32bit int would be big enough for holding
>> up to a ridiculous 4.3 seconds, assuming 1psec resolution.
>
> I think you're a factor of 1000 out.

Duh, brain fart indeed!

> For an ASIC STA, gate delays must be specified at a much finer
> resolution than 1ps.

I don't recall seeing sub-psec resolution in the 130nm libraries I
have seen, but that doesn't imply that it cannot be so.

But I stand by my argument: the actual resolution should not matter
much, as the total clock delays and cycle times should scale pretty
much as the library resolution. Otherwise, there wouldn't be a point
in choosing such a fast technology (who in their right mind would use
a 45m process for implementing an 32kHz RTC, unless they had to?)

Paul Uiterlinden

unread,
Jul 27, 2007, 5:02:01 PM7/27/07
to
jjoh...@cs.ucf.edu wrote:

> In 32-bit Linux, can it run 4GB per process and as many simultaneous
> processes of that size as the virtual memory will support?

Below is what I have read about it in "Self-Service Linux®"
http://www.phptr.com/content/images/013147751X/downloads/013147751X_book.pdf
I have no experience with it.

<quote>
3.2.2.1.6 The Kernel Segment

The only remaining segment in a process' address space to discuss is the
kernel segment. The kernel segment starts at 0xc0000000 and is
inaccessible by user processes. Every process contains this segment,
which makes transferring data between the kernel and the process'
virtual memory quick and easy. The details of this segment’s contents,
however, are beyond the scope of this book.

Note:

You may have realized that this segment accounts for one quarter of the
entire address space for a process. This is called 3/1 split address
space. Losing 1GB out of 4GB isn't a big deal for the average user, but
for high-end applications such as database managers or Web servers,
this can become an issue. The real solution is to move to a 64-bit
platform where the address space is not limited to 4GB, but due to the
large amount of existing 32-bit x86 hardware, it is advantageous to
address this issue. There is a patch known as the 4G/4G patch, which
can be found at ftp.kernel.org/pub/linux/kernel/people/akpm/patches/ or
http://people.redhat.com/mingo/4g-patches. This patch moves the 1GB
kernel segment out of each process’ address space, thus providing the
entire 4GB address space to applications.
<end quote>

--
Paul Uiterlinden
www.aimvalley.nl
e-mail addres: remove the not.

PeteS

unread,
Jul 28, 2007, 11:52:07 AM7/28/07
to

The last time I checked the speed of a full FPGA build, the cache did
indeed have the single largest effect, which is hardly surprising. A
cache access is typically one internal bus cycle (not a cpu cycle) which
is an order of magnitude faster than an external memory access cycle.

Properly optimised code that uses the I-Cache properly will run much
faster than inline code, incidentally.

Cheers

PeteS

comp.arch.fpga

unread,
Jul 28, 2007, 8:35:15 AM7/28/07
to
On Jul 27, 7:17 pm, jjohn...@cs.ucf.edu wrote:
> Thanks everyone, this is real interesting, but please don't stop
> posting if you have more insights to share!

>


> I originally suspected P&R might have a lot of floating-point
> calculations (even prior to signal-integrity considerations) if they
> were doing any kind of physical synthesis (e.g., delay calculation
> based on distance and fanout); ditto for STA, because that's usually
> an integral part of the P&R loops. I also suspected that if floating-
> point operations (at least multiplies, add/subtract, and MACs) could
> be done in a single cycle, there would be no advantage to using
> integer arithmetic instead (especially if manual, or somewhat explicit
> integer scaling is required).
>
> On the other hand, in something like a router, you can get more exact
> location info wrt stuff like grid coordinates than you can with
> floating-point. As far as dynamic range is concerned, I seem to recall
> that SystemC standardized on 64-bit time to run longer simulations,
> but SystemC is a different animal in that regard anyway. Nonetheless,
> I also seem to recall that its implementation of time was 64-bit
> integers (scaled), because the average FPU operations are really only
> linear over the 53-bit mantissa part. Assuming they want linear
> representation of time ticks, I can see the appeal of using 64-bit
> integers in simulation.

Any operations on large netlists are completely memory and pointer
dominated.
There are lot's of random access pointer indirections in data sets
that are much larger
than the cache. The computations done once you have the data do not
matter at all.
You need hundreds of CPU cycles to access the delay parameters of two
gates in a netlist.
Summing them up can be done for free while the CPU waits on the next
load instruction.

On the other hand, if your dynamic range is needed for summing up
small values floating point
does not help at all.
1e12 + 1 = 1e12 in 32-bit floating point. For opperations like that 32-
bit integer actually has 6 bits
more dynamic range.

> As far as event-driven simulations are concerned, I totally understand
> how hard it is to make good use of multithreading or multiprocessing,

Why? In a larger design there will allways be many active processes at
each
timestep. These can be distributed to individual processors.
All operations can be on shared memory because each signal has only
one driver.

Kolja Sulimma

Ioiod

unread,
Aug 1, 2007, 12:29:06 AM8/1/07
to

<sh...@cadence.com> wrote in message
news:1185490882.4...@b79g2000hse.googlegroups.com...

> On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:
>> Is Quartus (and the others) more efficient in any one particular
>> environment? I prefer Linux, but the OS is now secondary to pure
>> runtime performance (unless it is a major contributor). Can any of
>> them make use of more than 2GB or RAM? More than 4GB?
>
> 64-bit Linux can make use of more than 4GB of RAM. But don't use 64-
> bit executables unless your design is too big for 32-bit tools,
> because they will run slower on the same machine.

Interesting -- on an AMD Athlon X2/5200+ running RHEL Linux 4 update 4
x86_64,
just about all Synopsys Design Compiler jobs run FASTER in 64-bit mode than
32-bit mode, between 5-10% faster. THe penalty is slightly larger
RAM-footprint,
just as you noted. The X2/5200+ is spec'd the same as an Opteron 1218
(2.6GHz,
2x1MB L2 cache..)

This trend was pretty much consistent across all our Linux EDA-tools.

On Solaris SPARC, 64-bit mode was definitely slower than 32-bit mode, by
about
10-20%. For the life of me, I can't understand why the AMD would run 64-bit
mode faster than its 32-bit mode -- but for every other machine
architecture,
64-bit mode is almost always slower.

I forgot to re-run my 32bit vs 64-bit benchmark on Intel Core2 Duo machines.
FOr 64-bit, the Intel E6850 (4MB L2 cache, 3.0GHz) ran anywhere
from 50-60% faster than the AMD X2/5200+. Don't worry, no production
machines
were overclocked (for obvious official, sign/off reasons.) It was just a
admin's
corner cubicle experiment.

> Most of these tools are not multi-threaded, so the only way you will
> get a speedup is if you have multiple jobs at the same time. Event-
> driven simulation in particular is not amenable to multi-threading,
> despite much wishful thinking for the last few decades.

When I ran two separate (unrelated) jobs simultaneously on the AMD and Intel
machines, the AMD machine handled dual-tasking much better. AMD only
dropped 5-7%, for each job. The E6600 fared a lot worse -- anywhere from
10-30% performance drop. (Though not as bad as the Pentium/3 and
Pentium/4 based Xeons.)

I'm wondering if the E6600's unified 4MB L2-cache thrashes badly in
dual-tasking.
Or maybe the better way to look at it, in single-tasking the 4MB L2-cache is
4X more than the AMD Opteron's 1MB cache per CPU-core.


Ioiod

unread,
Aug 1, 2007, 12:30:16 AM8/1/07
to

"Eric Smith" <er...@brouhaha.com> wrote in message
news:m33az9z...@donnybrook.brouhaha.com...

I think that should be qualified to say 64-bit x86_64 Linux binaries run
faster than
the same binaries compiled for 32-bit x86 Linux.

For other CPU-architectures (MIPs, SPARC, PowerPC, etc.), the opposite is
generally true.


glen herrmannsfeldt

unread,
Aug 1, 2007, 10:40:34 PM8/1/07
to
Ioiod wrote:

(snip)

> On Solaris SPARC, 64-bit mode was definitely slower than
> 32-bit mode, by about 10-20%.
> For the life of me, I can't understand why the AMD would
> run 64-bit mode faster than its 32-bit mode
> -- but for every other machine architecture,
> 64-bit mode is almost always slower.

It might be because more registers are available, and IA32
code is register starved.

-- glen

Paul Leventis

unread,
Aug 2, 2007, 12:08:15 AM8/2/07
to
Hi JJ,

Here is a rather long but detailed reply to your questions courtesy of
Adrian, one of our parallel compile experts.

You were correct in guessing that quartus_fit included floating-point
operations, but as other writers here have responded, memory accesses
are easily as important in terms of runtime, if not more so. By
contrast, quartus_sta is dominated by integer operations and memory
accesses. Incidentally, this is why quartus_fit will produce a
different fit on different OS's while quartus_sta will not - integer
operations are exact across all platforms but the different compilers
optimize floating-point operations differently between Windows and
Linux, which result in a different fit.

Quartus II's new NUM_PARALLEL_PROCESSORS is required to enable any
kind of parallel compilation. We do not offer any support for
HyperThreaded processors and actually recommend our users disable it
in the BIOS, as it can decrease memory system performance even for a
normal, non-parallel compilation. By contrast, multi-core machines
yeild good results. If you have an Intel Core 2 Duo, for example,
you'd set NUM_PARALLEL_PROCESSORS to 2. If you have two dual-core
Opterons, you'd set it to 4, and so on.

Currently, some parts of quartus_fit, quartus_tan and quartus_sta can
take advantage of parallel compilation, though the best improvement is
usually in quartus_fit. Small designs and those with easy timing and
routability constraints will typically not see much improvement, but
larger and harder-to-fit circuits (the designs that need it the most!)
can see substantial reductions. While the speedups are currently
modest and nowhere near linear with the number of processors used,
they have improved with every release since Quartus 6.1 and we plan to
continue this in future releases.

We do not currently support additional parallel features during
incremental compilation; ie, different partitions will not be mapped
and fit completely in parallel; the fitter will get as much benefit
from parallel compilation as it would without any partitions.

One gotcha with parallel compilation is related to my first point
about Quartus having lots of memory accesses. On some current systems,
the memory system can become a significant bottleneck. For example, an
Intel Core 2 Quad chip has two shared L2 caches, which enables very
fast communication between cores (1,2) and (2,3), but relatively slow
communication between (1,3) and (2,4) since those memory requests must
all share the front-side bus. In this case, setting
NUM_PARALLEL_PROCESSORS to 4 may even give a worse result than setting
it to 2 by forcing half the communication to take place over this
slower FSB. Even with only two processors in use, the OS may sometimes
schedule the processes on cores (1,3) and (2,4) unless you specify
otherwise. Solutions to this problem can be found at www.altera.com/support.
Not all platforms are affected; you'll have to try it and see.

At present, Quartus II currently supports a maximum of four processors
(or cores), so your dual Quad configuration will mostly go unused.
However, your intuition about leaving a processor free is correct; if
you have a four-core system and leave NUM_PARALLEL_PROCESSORS to 3,
you will never see Quartus take more than 75% of your computer's CPU.

As for different OS's, the 32-bit Windows version of Quartus is a
little faster than the Linux version; the differences are largely due
to the quality and settings of the optimizing C compilers we use on
these two platforms, and varies somewhat between various Quartus
executables. 64-bit versions of Quartus are slightly slower than 32-
bit versions due to the increase in working set size (memory) from 64-
bit pointers; this in turn reduces cache hits and thus slows down the
program. This behaviuor is true of most 64-bit applications.

Note: You can run 32-bit Quartus in 64-bit Windows/Linux with no such
performance penalty, and gain access to 4 GB of addressable memory.
This should meet user needs for all but the largets and most
complicated of Stratix III designs. See information on memory
requirements of Quartus at http://www.altera.com/products/software/products/quartus2/memory/qts-memory.html.
Also, I've posted on this topic previously (http://tinyurl.com/
36boga).

Regards,

Paul Leventis
Altera Corp.

Paul Leventis

unread,
Aug 2, 2007, 12:22:23 AM8/2/07
to
On Jul 27, 10:34 am, Frank Buss <f...@frank-buss.de> wrote:
> Nial Stewart wrote:
> > I ran a fairly quick FPGA build through Quartus to get a time for a
> > before and after comparison before I did the swap.
>
> Did you changed the setting "use up to x number of CPUs" (don't remember
> the exact name) somewhere in the project settings?

Yes, turning on multiple CPU support (NUM_PARALLEL_PROCESSORS setting)
will help :-)

It will also depend on whether this is a slow or fast compile. A toy
design will see no speed-up, since the run time will be dominated by
aspects of the compiler that are normally a small portion of run time
-- reading databases from disk, setting up data structures, etc. It
is only the key time-consuming algorithms that have been parallelized
(and only some of them at that). Gains will be the largest on large
designs with compilcated timing assignments.

Jon Beniston

unread,
Aug 2, 2007, 5:22:20 AM8/2/07
to

>
> Any specific box recommendations?
>

I'd recommend running them on MicroBlaze.. good opportunities for h/w
acceleration ;-)

Jon


jjoh...@cs.ucf.edu

unread,
Aug 2, 2007, 3:15:44 PM8/2/07
to

Yo, Adrian! ;) and Paul and everyone else, that's some great info and
is very much appreciated.

Since quartus_fit is dominating my runtime (EP2S180 and HC230), and
quartus_fit gains the most from extra CPUs, it makes sense for me to
go at least to 4 CPUs (I currently only have dual-processor boxes,
thus the need to go shopping). Do you know if the HardCopyII fitter
also makes use of multiple processors?

When Quartus does spawn jobs off to up to 4 processors, can each one
of those spawned jobs use up to 4GB?

In the case of Quartus supporting a max of 4 processors, at the very
least an 8-processor box would allow me to run two copies of Quartus
at the same time (e.g., different designs, or different flavors of the
same design). 8 processors on 64-bit Linux w/ 16GB of RAM with 32-bit
Quartus would seem to be a well-balanced setup if most Quartus jobs
remain under 2GB, correct?

Since memory access is such a big part of the overall runtime,
obviously the faster memory buses on newer machines will help. (Good
thing, because the clock speed difference along from an Opteron 250 to
a newer Opteron 2218 isn't much of an increase: 2.4GHz to 2.8GHz).

Since the databases for big chips get so large (and memory accesses
apparently so random), does a larger data cache buy you much? The L1
I&D caches are relatively small on both AMD and Intel, although
Opteron is 2x (64K Instr, 64K Data) larger than Intel's.

For the L2 cache, Intel's is 2x larger than AMDs on a per-core basis.
Since Intel shares two caches between neighboring cores (as you say
1&2 or 3&4 can share quickly, but slow from 1/3 and 2/4), whereas
Opterons have a dedicated cache per core, would Opterons see a speedup
from less contention for the cache, or a slowdown from having to go
outside the local caches in order to share data? (I guess a function
of how often the quartus_fit algorithms need to share data, right?)

If I were trying to run two Quartus jobs simultaneously on one 8-CPU
machine (with NUM_PARALLEL_CPUS = 4 for each run), I would expect
competition for external memory to be huge, and thus statistically
some benefit to Intel's larger cache. And with more "stuff" cached,
that the higher clock speeds on current Intel CPUs might give the
runtime advantage to Intel. On the other hand, AMD has the Direct
Connect Architecture and HyperTransport, so...

I know you vendor guys are reluctant to publish benchmark info, but
from the currently-available, mainstream, small-server perspective
with 8 processors, I'm kind of pushed toward the following CPU
choices:

4 dual-core Opteron 2218's (2.6 GHz, 90nm process, 2MB L2 cache as 1MB
dedicated per core )
4 dual-core Opteron 2220's (2.8 GHz, 90nm process, 2MB L2 cache as 1MB
dedicated per core )
4 dual-core Intel 5160's (3.0 GHz, 65nm process, 1333 MHz FSB, 4MB
shared L2 cache)
2 quad-core Intel X5355's (2.66 GHz, 65nm process, 1333 MHz FSB, 8MB
L2 cache, shared 4MB per core pair)

Of those, is there an obvious bang for the buck advantage (weighted
more toward bang than buck) for any one of those in particular?

-------
P.S. Those QX6850's are hard to come by; Dell's overclocked XPS720's
look sweet, but my company won't spring for overclocked boxes...


Thanks again, very very much!

Wei Wang

unread,
Aug 2, 2007, 4:42:09 PM8/2/07
to

is there such a setting for xilinx ise as well?

thx, -wei

Wei Wang

unread,
Aug 2, 2007, 4:45:08 PM8/2/07
to

Why only 3GB max of 4GB? thanks, -Wei

Wei Wang

unread,
Aug 2, 2007, 5:54:40 PM8/2/07
to

Found similar memory recommendations for Xilinx's largest XC5VLX330
FPGA,
http://www.xilinx.com/ise/products/memory.htm#v5lx
only Linux-64 machines are supported, memory recommendation: typical
7.2GB and peak 10.6GB.

steve...@xilinx.com

unread,
Aug 2, 2007, 6:12:43 PM8/2/07
to
"Wei Wang" <camw...@gmail.com> wrote in message
news:1186091680.6...@z24g2000prh.googlegroups.com...

> Found similar memory recommendations for Xilinx's largest XC5VLX330
> FPGA,
> http://www.xilinx.com/ise/products/memory.htm#v5lx
> only Linux-64 machines are supported, memory recommendation: typical
> 7.2GB and peak 10.6GB.

This web page needs to be updated: NT64 is also supported, but runtime
will be faster on Linux64, so that's what we recommend.

Steve


MM

unread,
Aug 2, 2007, 7:43:16 PM8/2/07
to
Hi Steve,

Could you give us (Xilinx users) some more detailed recommendations on what
would be the best platform to run ISE/EDK tools when working on midsize to
big designs? Tell us what you are using @ Xilinx? :)

Thanks,
/Mikhail


<steve...@xilinx.com> wrote in message news:f8tksu$ca...@cnn.xilinx.com...

steve...@xilinx.com

unread,
Aug 2, 2007, 8:00:24 PM8/2/07
to
I can give you some general recommendations. For the best place and route
runtimes,
use a 64bit Linux system. If your design is small enough to fit into 4G of
memory
(LX110 or smaller), and you are not programming devices (the 32bit cable
drivers
don't work on a 64bit system), you can use the 32bit executables to save
memory.
Otherwise, go ahead and use the 64bit executables. They use more memory and
the runtime is simular.

As mentioned earlier, synthesis, map, place and route do not use
multithreading, so
you will not get an advantage using multiple processors for a single design.
However,
ProjNav is multithreaded so if you are doing different tasks, other
processors will
be used. In addition, upcoming software releases will use those processors.

Steve

"MM" <mb...@yahoo.com> wrote in message
news:5hf8n2F...@mid.individual.net...

Eric Smith

unread,
Aug 2, 2007, 9:00:31 PM8/2/07
to
Steve Lass wrote:
> I can give you some general recommendations. For the best place and
> route runtimes, use a 64bit Linux system. If your design is small
> enough to fit into 4G of memory (LX110 or smaller), and you are not
> programming devices (the 32bit cable drivers don't work on a 64bit
> system), you can use the 32bit executables to save memory.
> Otherwise, go ahead and use the 64bit executables. They use more
> memory and the runtime is simular.

Note that it works just fine to install 32-bit ISE on a 64-bit Linux
system, and to install the 64-bit cable drivers.

In my experience, the open source user-space-only cable interface works
far better than the Xilinx-supplied cable drivers anyhow:

http://www.rmdir.de/~michael/xilinx/

MM

unread,
Aug 3, 2007, 2:12:57 AM8/3/07
to
<steve...@xilinx.com> wrote in message news:f8tr6p$c9...@cnn.xilinx.com...

> I can give you some general recommendations. For the best place and route
> runtimes,
> use a 64bit Linux system. If your design is small enough to fit into 4G of
> memory
> (LX110 or smaller), and you are not programming devices (the 32bit cable
> drivers
> don't work on a 64bit system), you can use the 32bit executables to save
> memory.
> Otherwise, go ahead and use the 64bit executables. They use more memory
and
> the runtime is simular.

Is there a 64-bit version of EDK ? If not, can I mix 64 bit ISE with 32 bit
EDK?

Thanks,
/Mikhail


Andreas Ehliar

unread,
Aug 3, 2007, 2:01:52 AM8/3/07
to
On 2007-08-02, Wei Wang <camw...@gmail.com> wrote:
> Why only 3GB max of 4GB? thanks, -Wei

The short answer is that the upper 1GB is reserved for the kernel.
If you want a bit more detail you can look at for example the
following article:
http://kerneltrap.org/node/2450

/Andreas

Wei Wang

unread,
Aug 3, 2007, 6:04:06 AM8/3/07
to
> > <steve.l...@xilinx.com> wrote in message
> >news:f8tksu$ca...@cnn.xilinx.com...
> >> "Wei Wang" <camww...@gmail.com> wrote in message

> >>news:1186091680.6...@z24g2000prh.googlegroups.com...
> >>> Found similar memory recommendations for Xilinx's largest XC5VLX330
> >>> FPGA,
> >>>http://www.xilinx.com/ise/products/memory.htm#v5lx
> >>> only Linux-64 machines are supported, memory recommendation: typical
> >>> 7.2GB and peak 10.6GB.
>
> >> This web page needs to be updated: NT64 is also supported, but runtime
> >> will be faster on Linux64, so that's what we recommend.
>
> >> Steve- Hide quoted text -
>
> - Show quoted text -

What I found was very interesting, it was taking me 12 hours to run
the MAP process before, but yesterday it only took me ~3 hours to run
MAP, and PAR only too took ~40 mins as well.

I was trying to figure out the reasons, then found in *.map *.mrp
files that there was always a map phase which took such a long time as
~10+ hours, and that phrase was always very memory hungry. I was using
Linux64 with 2GB real memory and 4GB swap memory, as I just found that
the real 2GB memory was much smaller than the required peak memory
10.6GB. Yesterday, I was running ISE9.1i for XC5VLX330 on another
Linux64 machine with 11G real memory and 8G swap memory, the there
wasn't any MAP phrase which took a ridiculous ~10+ hours.

Can Xilinx guys shed some more light on the runtime of MAP and PAR,
wrt different memory sizes and CPU cores?

Patrick Dubois

unread,
Aug 3, 2007, 10:32:25 AM8/3/07
to
On 2 août, 15:15, jjohn...@cs.ucf.edu wrote:

> P.S. Those QX6850's are hard to come by; Dell's overclocked XPS720's
> look sweet, but my company won't spring for overclocked boxes...

Polywell has some desktop computers with QX6850 available. Although
since you're looking at an 8-way workstation (!), QX6850 is probably
not an option. Polywell has AMD or Intel workstations with the CPUs
you're looking at as well.

For one socket, Intel clearly has the edge over AMD I think. For multi-
socket workstations/servers however, I'm not so sure. Benchmarks are
harder to find. I would suspect that the Hypertransport bus would help
AMD close the gap with Intel a little. Their integrated memory
controller probably helps as well in a multi-socket machine.

I searched for benchmarks for the newest 90-nm Opteron but couldn't
find any unfortunately...

Patrick


steve...@xilinx.com

unread,
Aug 3, 2007, 12:39:55 PM8/3/07
to
"Wei Wang" <camw...@gmail.com> wrote in message
news:1186135446.3...@g4g2000hsf.googlegroups.com...

> Can Xilinx guys shed some more light on the runtime of MAP and PAR,
> wrt different memory sizes and CPU cores?
>

Even though our memory requirement table lists devices, memory is more
dependent on the design and the timing constraints. Since we can't predict
what is in your design, we just give you the typical and max numbers from
our collected test cases.

An example for constraints which will reduce memory is instead of creating
a bunch of individual from to timespecs, you can create timegroups with the
endpoints, then put one timespec on that.

Also, ISE 9.2i is getting an average of 27% improvement in memory
utilization.

I don't have any data regarding runtime of different CPU cores.

Steve


0 new messages