Why is SRAM faster than DRAM ???

Mark Thorson

unread,

Apr 18, 1995, 3:00:00 AM4/18/95

to

I thought I knew the answer to this one, but maybe the answer has
changed over the years. I _think_ the answer is that it takes
longer to charge or discharge the capacitor in the DRAM memory cell
than for transistors to actively drive the two nodes in an SRAM
cell to their complementary states.

But it just occurred to me that the extra loading on the long
row and column wires in the DRAM array (because it has more memory
cells) is responsible for the difference in speed.

Yet another reason could be marketing. The SRAM guys know they can't
win in density so they work on speed. The DRAM guys know they can't
win in speed, so they work on density. This brings up the question,
how fast could DRAM go if it were really pushed? Could DRAM ever be
competitive with SRAM in speed? If not, how close could it come?

Del Cecchi

unread,

Apr 18, 1995, 3:00:00 AM4/18/95

to

Or maybe it takes a pretty long time to sense a signal that is only a few
millivolts on the bitline, among other things. If the DRAM is the one device
cell, the small signal will always cause it to be slower.

--

Del Cecchi

Mark Johnson

unread,

Apr 18, 1995, 3:00:00 AM4/18/95

to

In article <eeeD78...@netcom.com> e...@netcom.com (Mark Thorson) writes:
>
> I thought I knew the answer to this one, but maybe the answer has
> changed over the years. I _think_ the answer is that it takes
> longer to charge or discharge the capacitor in the DRAM memory cell
> than for transistors to actively drive the two nodes in an SRAM
> cell to their complementary states.
>
> But it just occurred to me that the extra loading on the long
> row and column wires in the DRAM array (because it has more memory
> cells) is responsible for the difference in speed.
>
> Yet another reason could be marketing. The SRAM guys know they can't
> win in density so they work on speed. The DRAM guys know they can't
> win in speed, so they work on density. This brings up the question,
> how fast could DRAM go if it were really pushed? Could DRAM ever be
> competitive with SRAM in speed? If not, how close could it come?
>

I'd recommend to pay attention to replies from people at
VLSI Technology (vlsi.com) and/or people who say they
used to work at a company called Visic. (note the absence
of the letter "H")

In the early 1980's Visic built, and offered for sale,
a very high speed DRAM. Their DRAM was the same density
as the biggest DRAM you could buy at the time (64 kbits),
which was 4x bigger than the biggest SRAM of the time.
Their 64k DRAM was also _faster_ (35ns) than the fastest
16k SRAM of the time (45ns).

Unfortunately, Visic's DRAM was also more expensive than
the other, slower (120ns) 64k DRAMs. And it was more
expensive than 45ns 16k SRAMs.

Visic's DRAM never caught on. People avoided it in droves.
Eventually, Visic was absorbed by VLSI Technology Inc.

WHY did Visic's fast DRAM fail to succeed? I dunno.
Maybe it had something to do with customers and their
preferences in chosing among speed, density, and price.
Maybe because of it being perceived as single-source
and/or non-JEDEC-standard. The folks from Visic itself,
or VLSI Technology, can probably give you a better
answer.

Anand Natrajan

unread,

Apr 19, 1995, 3:00:00 AM4/19/95

to

Mark Thorson asks:
:
: Why is SRAM faster than DRAM?

I may be wrong here, but doesn't it have anything to do with charge motion?
SRAM chips are typically CMOS chips, in which charge is induced on
substrates due to applied voltage. But DRAM chips have capacitors, which
require charge to move on large current paths to reach the capacitor
surfaces. The difference in current paths maybe the cause of the difference
in speed.

Anand

Mark Johnson

unread,

Apr 19, 1995, 3:00:00 AM4/19/95

to

ger...@il.us.swissbank.com (Gerald Gleason) writes (in comp.arch):

>> BTW, what are state of the art SRAMS using for storage cells? I know you
>> can use as many as 8 transisters, but what is the lower bound? 3? I
>> assume that your always trading space for speed, but I'm wondering whether
>> there is a "best possible" for the requirements of the SRAM market, or
>> whether different products pick different points on the size/speed curve.
> >

wil...@emc.com (Paul Wilson) writes (in comp.arch):

$ The last ones I knew were using 4 transistors, but I think some went
$ back to 6 - since the transistors were getting almost as small as
$ the resistors they were using to replace them and offered better
$ performance. Ive never seen 8, tho', and I love to see it done
$ with 3. :)

Presently the highest density SRAMs available are 4Mbits and they use
3D integration: silicon on insulator. Four NMOS transistors are
built using normal process technology, then an insulator is deposited,
then a film of poly-crystalline silicon. Two really low-performance
PMOS transistors (big VT, small Gm) are built in the polycrystalline
silicon above the four NMOS devices. In essenvce giving a 6-transistor
cell while using the area of a 4-transistor, 0-resistor cell. At the
cost of some pretty strange, extremely nonstandard, extra mask steps.
But boy is it dense. The process is sometimes called "TFT", referring
to the 3D integrated PMOS's as "thin film transistors". Fortuantely
for manufacturers and users, these pullup transistors don't affect
performance at all. No impact on speed OR power, as long as you
can turn OFF the one that's supposed to be off. So the crummy
characteristics of the TFT pmos don't hurt the datasheet specs.

At the 1Meg level, many people use the 4-transistor, 2-resistor
(4T+2R) cell. The resistor layer is much less expensive to fabricate
than the SOI/TFT process, but it's still nonstandard --- only SRAMs
and chips that want lots of on board SRAM (like microprocessors) use
these resistors. Resistor processes are hard to control in fab;
since a resistor is always "turned on", you want its resistance
to be high (to keep RAM standby power low) but not infinity.

Then there's the 6 transistor cell. 4 NMOS, 2 PMOS. No extra mask
steps, nothing but the standard logic/gate array/ASIC/microprocessor
process. Extremely popular. At constant lithographic featuresize,
the biggest of the three. Still, people love and use it like mad.
When you don't have resistors or TFT's to worry about, you have more
time to spend getting the lithographic featuresize down, which makes
*everything* (not just the SRAM cell) smaller. That's what people
say who don't want the headaches of resistors or TFTs.

Yet fewer # of transistors are possible, but generally they
sacrifice speed by doing single-ended reads and/or single-ended
writes. One of the cutest SRAM designs is the Lambda-Diode cell,
which uses only three transistors. Although originally developed
and published in Silicon FET (MOSFET _and_ JFET) technologies,
this cell is enjoying a new round of popularity in Gallium Arsenide.

John Caywood of Intel published a strange and amusing 5-T cell
SRAM in the 70's, and Ching-Lin Jiang of Mostek published an
even stranger 3-T SRAM cell in the 80's. Dig through your old
ISSCC digests and take a peep. Mostek was apparently a hotbed
of wacky SRAM cell innovation; they are responsible for the 4T+2R
cell *and* also for the supremely nonintuitive no-VDD cell which
doesn't route VDD to the memory array at all. Instead refresh
power is drawn from the bitlines.

--- Mark Johnson

Duncan Elliott

unread,

Apr 20, 1995, 3:00:00 AM4/20/95

to

DRAM can be built just as fast as SRAM.

In article <eeeD78...@netcom.com>, Mark Thorson <e...@netcom.com> wrote:
>I thought I knew the answer to this one, but maybe the answer has
>changed over the years. I _think_ the answer is that it takes
>longer to charge or discharge the capacitor in the DRAM memory cell
>than for transistors to actively drive the two nodes in an SRAM
>cell to their complementary states.

Richard Foss of MOSAID likes to describe the SRAM memory
cell as `a DRAM cell with built-in refresh'. A 4
transistor (4T) SRAM cell is a 4T DRAM cell with two
pull-up resistors added. The pull-ups can alternatively be
two p-channel transistors (6T cell) which is commonly used
in ASIC processes. For high density SRAMs, the pull-ups
are two thin-film transistors built above the basic 4
transistor circuit.

Of course, 1T DRAM cells are the most common. You'll find
4T DRAM cells primarily in ASIC memory.

Riding the trade-offs among, power, area, number of IC
process steps, and speed, there could potentially be a lot
of overlap between DRAM and SRAM. Of course you'ld be
hard-pressed to build an SRAM cell as small as a 1T DRAM
cell.

Here are some of the design trade-offs which make most DRAMs
slower than most SRAMs (1&2 are as you mentioned):

1. longer wordlines for higher density; longer RC delays
2. longer bitlines for higher density; high bitline capacitance
3. small cell size; small cell charge to dump on bitline
4. 1T cell; single ended charge transfer
(a 4T cell has differential outputs)
5. low-ish power per active bitline; slow sense and restore (relative to SRAM)
6. transistors and back-bias optimized for low leakage; slower logic
7. dynamic cells; as little as 0.2% refresh overhead.

The fact that a 1T DRAM cell only moves a bitline by
several 100mV while the SRAM cell can move the bitlines
rail-to-rail is not key. SRAMs sense the bitlines early
for speed, rather than waiting for the maximum differential
voltage. In fact, some SRAMs use DRAM style sense amps.

> This brings up the question,
>how fast could DRAM go if it were really pushed? Could DRAM ever be
>competitive with SRAM in speed? If not, how close could it come?

A 4T DRAM memory could be essentially as fast as SRAM and it
shouldn't be larger. In a processors or digital ASIC IC
process, you could expect to save area with DRAM. Since
SRAM IC processes put the pull-up over top of the basic 4
transistors, the DRAM isn't going to be much smaller. In
that case, why not add `built in refresh'.

--
Duncan Elliott, Department of Electrical and Computer Engineering
du...@eecg.toronto.edu University of Toronto
http://www.eecg.toronto.edu/~dunc Toronto, Canada M5S 1A4

Charles Grosjean

unread,

Apr 20, 1995, 3:00:00 AM4/20/95

to

mjoh...@netcom12.netcom.com (Mark Johnson) writes:

>John Caywood of Intel published a strange and amusing 5-T cell
>SRAM in the 70's, and Ching-Lin Jiang of Mostek published an
>even stranger 3-T SRAM cell in the 80's. Dig through your old
>ISSCC digests and take a peep. Mostek was apparently a hotbed
>of wacky SRAM cell innovation; they are responsible for the 4T+2R
>cell *and* also for the supremely nonintuitive no-VDD cell which
>doesn't route VDD to the memory array at all. Instead refresh
>power is drawn from the bitlines.

Richard Lyon published a paper in I think the 1987 VLSI Conference Proceedings
(Stanford, edited by Paul Losleben (sp?)) describing a 4T SRAM with a pair of
cross-coupled NFET's and a pair of PMOS readout transistors going to the bit-
lines. To hold state, the bit-lines are held at Vdd, and the PFET's are biased
below threshold to keep the cell alive, to read out, precharge the bitlines,
and then turn on the PFET transistors. A nifty cell, but only marginally
smaller than a 6T cell using "university" SCMOS design rules, depending on how
interesting the layout is.

Charles

Dirk Grunwald

unread,

Apr 21, 1995, 3:00:00 AM4/21/95

to

People seem to be ignoring hybrid DRAM's, like the RAMTRON EDRAM. In
this model, you have a large DRAM bank and a smaller less dense SRAM
bank that acts as a cache. You then have many (4K) lines bringing the
DRAM down to the SRAM. This gives you an amazing bandwidth on chip,
something that can't be duplicated in so-called 'lumped' cache models.

LEE BRIAN

unread,

Apr 22, 1995, 3:00:00 AM4/22/95

to

As well as Rambus, MDRAMs, RamLink...

I think for most processor applications, latency is not much of a concern
any more. By this, I mean general UNIX-type workstations, not real-time
embedded systems. If the processors really get much faster, you can always
context switch on cache misses. Perhaps more support should be put into
the mainstream RISCs for this, i.e., an instruction to write state to
memory and flush pipeline? This instruction itself can probably be
executed while a new context is fetched. How about a separate port to
a "context RAM" which stores all context information? This can be made
of the fastest SRAMs there are out there. Of course you can
always do software/compiler based prefetching.

Perhaps the bigger concern is bandwidth. If there is no way you can
load a cache line within a few hundred (!) clock cycles, the system
starts slowing down. But then again, if you have non-blocking loads
and lockup-free caches you should be even better off.

So I guess my main idea is that given a new processor arch and
memory technology available today like Rambus/MDRAM, we do not
require a proportionally fast core memory system since we can
tolerate the latency and lower bandwidth (compared to the processor
speed) for most general purpose systems.

(sorry if I sound intelligible, I just thought I'd get into these
discussions...)

regards,
bjl

--
Brian Jonathan Lee (aka "hojo") | "Beef satay?!?! Not beef satay again!!!!"
b...@ecf.toronto.edu | "XMen! XMen! Rescue Kitty from the caves!"
b...@eecg.toronto.edu | "Evil thy name is NETREK!"

Samuel Ng

unread,

Apr 22, 1995, 3:00:00 AM4/22/95

to

I guess the answer is in the complemetary and regenerative structure of a SRAM cell.
Being regenerative it should flip state quicker than a capacitor. By taking advantage
of the complementary output, one can always design a faster differential sense-amp.

Sam

Kurt Shoens

unread,

Apr 25, 1995, 3:00:00 AM4/25/95

to

b...@ecf.toronto.edu (LEE BRIAN) writes:
>I think for most processor applications, latency is not much of a concern

>any more. [...] If the processors really get much faster, you can always

>context switch on cache misses.

Context switching or thread switching on cache misses seems difficult to
justify, unless you are thinking of doing micro-threading within a single
application, in which case prefetching is probably simpler.

In full-scale context switching, there is considerable state to save (like
the register set) and even a single memory miss saving/restoring state means
you were better off synchronously waiting for the cache miss. Also, what
if there's nothing else to context switch to? For example, if you're running
just one thing, there may be nothing else useful to do during memory reads.
--
Kurt Shoens

LEE BRIAN

unread,

Apr 27, 1995, 3:00:00 AM4/27/95

to

My response was aimed at general multiuser/multitasking systems,
i.e., something a university would buy, a compute server for a
corporation, etc. I agree that it won't be fully utilized and the
through-put will be low if you don't give it enough to do.

I also agree that there will be alot of overhead in doing this, that
is why I suggested that processor manufacturers should implement
faster context switching and perhaps pipeline the saving of state
with computation. As processor clock rates go beyond 200MHz and
memory speeds remain at 60ns, taking 8 cycles to clear the pipeline
will probably be worth it.

Preston Briggs

unread,

Apr 28, 1995, 3:00:00 AM4/28/95

to

>>>I think for most processor applications, latency is not much of a concern
>>>any more. [...] If the processors really get much faster, you can always
>>>context switch on cache misses.

That's been Burton Smith's argument for quite some time, except we
want to switch on every instruction (which happens to cover all the
memory references too). Agarwal, leading the Alewife project, likes
to switch on cache misses.

>>In full-scale context switching, there is considerable state to save (like
>>the register set)

>I also agree that there will be alot of overhead in doing this, that

>is why I suggested that processor manufacturers should implement
>faster context switching and perhaps pipeline the saving of state
>with computation. As processor clock rates go beyond 200MHz and
>memory speeds remain at 60ns, taking 8 cycles to clear the pipeline
>will probably be worth it.

You don't have to save state and you don't have to clear the pipeline.
We have 128 hardware contexts per processor, each basically a full set
of registers plus status word, so switching between (up to 128)
threads is quick. Plus, since we plan to switch on every tick, the
pipeline has operations in flight from many threads at once.

Note that the 128 hard contexts need not be that expensive either.
Since any particular thread will issue instructions at a very low
rate, there's plenty of time to read the operands from the appropriate
"registers" and to write the results back. So, rather than pipelining
the context save/restore (as suggested above), we pipeline register
references.

Or, compare to a machine with register windows. Since the processor
might demand access to any register in the entire set at any time, the
entire set of registers (256 or 512 or whatever) must be implemented
in some low-latency (expensive) fashion. As usual, we don't care
about latency; we want bandwidth. Fortunately, high bandwidth is
cheaper than low latency.

Preston Briggs

Cliff Click

unread,

May 1, 1995, 3:00:00 AM5/1/95

to

pre...@Tera.COM (Preston Briggs) writes:

> Or, compare to a machine with register windows. Since the processor
> might demand access to any register in the entire set at any time, the
> entire set of registers (256 or 512 or whatever) must be implemented
> in some low-latency (expensive) fashion. As usual, we don't care
> about latency; we want bandwidth. Fortunately, high bandwidth is
> cheaper than low latency.

I think actually we want low latency, but it costs too much.
(almost by definition low latency means high bandwidth, because
if I can get a reply in 1 ns, then I can get 10 replies in 10 ns
(and if I can't then I don't have low latency)).

We settle for high bandwidth, high latency (10 replies in 10 ns,
but none in 1 ns) because it's almost as good and it's cheaper.

Cliff
--
Cliff Click Compiler Research Scientist
Cambridge Research Office, Hewlett-Packard Laboratories
One Main Street, 10th Floor, Cambridge, MA 02142
(617) 225-4915 Work (617) 225-4930 Fax
cli...@hpl.hp.com http://bellona.cs.rice.edu/MSCP/cliff.html

Brian N. Miller

unread,

May 3, 1995, 3:00:00 AM5/3/95

to

In article <3nra6c$g...@sparrow.tera.com>, pre...@Tera.COM (Preston Briggs) writes:

|We have 128 hardware contexts per processor, each basically a full set
|of registers plus status word, so switching between (up to 128)
|threads is quick.

Is CPU real-estate really that cheap? Now, or in the next decade?

|Plus, since we plan to switch on every tick, the
|pipeline has operations in flight from many threads at once.
|Note that the 128 hard contexts need not be that expensive either.

Wouldn't a cache go stir-crazy trying to serve 128 potentially
disparate memory streams? Or does this fancy machine not need a cache?

Preston Briggs

unread,

May 4, 1995, 3:00:00 AM5/4/95

to

>|We have 128 hardware contexts per processor, each basically a full set
>|of registers plus status word, so switching between (up to 128)
>|threads is quick.
>
>Is CPU real-estate really that cheap? Now, or in the next decade?

Hard questions to answer...

Our processors (CPU seems a misnomer for a parallel machine) have no
data cache, so we're able to invest those transistors in hardware
contexts. In any case, each processor is spread across several chips
(recall that we're building a super, not a workstation).

It's not yet available, but hopefully soon _and_ in the next decade.

>|Plus, since we plan to switch on every tick, the
>|pipeline has operations in flight from many threads at once.
>|Note that the 128 hard contexts need not be that expensive either.
>
>Wouldn't a cache go stir-crazy trying to serve 128 potentially
>disparate memory streams? Or does this fancy machine not need a cache?

There's a multi-level instruction cache for each processor, but no
data cache. Why not? We use parallelism to cover latency to shared
memory, so it's not necessary. It also avoids the problems of
maintaining coherency, allows cheap synchronization, avoids the need
for compiler (or programmer) cache management and data layout.

Preston Briggs

( )

unread,

May 8, 1995, 3:00:00 AM5/8/95

to

In article <3oblv2$k...@sparrow.tera.com>, pre...@Tera.COM (Preston Briggs) writes:
>
>
> There's a multi-level instruction cache for each processor, but no
> data cache. Why not? We use parallelism to cover latency to shared
> memory, so it's not necessary. It also avoids the problems of
> maintaining coherency, allows cheap synchronization, avoids the need
> for compiler (or programmer) cache management and data layout.

Hi, someone from sweden recently gave a talk in which
he compared socialism in sweden with a computer architecture
that he was describing, and that euphemism just sprung to
mind when I read the above: "~everyone is equally poor~". :-)

Albeit the multiple hw contexts can certainly keep the processors
constantly busy, how much work gets done on each thread
when it is stalled on every single memory reference?
Is the shared memory global or locally distributed to each
processor (NUMA)? Must be an interesting memory system design.

Zahir

Preston Briggs

unread,

May 10, 1995, 3:00:00 AM5/10/95

to

[about the Tera machine]

>Albeit the multiple hw contexts can certainly keep the processors
>constantly busy, how much work gets done on each thread
>when it is stalled on every single memory reference?
>Is the shared memory global or locally distributed to each
>processor (NUMA)? Must be an interesting memory system design.

Per-thread performance will probably be something like a Sparc.
Unless we get some parallelism from the thread, which'll let us do a
lot better. We also get some benefit, sometinmes, for shoving around
64 bits at a time, plus having certain interesting instructions.
Lots of interesting bit-twiddling things, for instance, plus general
purpose registers (hold integer or FP values, for people who like that
kind of thing). Bit-manipulation instructions let us do things like
copy a NULL-terminated string of characters at a rate of 8 bytes every
three instructions. (Talking thread level here -- with parallelism,
can do a lot more interesting things).

Basically though, the machine is aimed at throughput. You give it
enough work to do, and it'll keep the pipelines, network, memory, and
I/O busy.

All memory is shared, UMA, and distributed across a packet-switched, 3D,
toroidally-connected network. Each memory unit is a GByte, and quite
heavily pipelined.

Preston Briggs

Allan Gottlieb

unread,

May 12, 1995, 3:00:00 AM5/12/95

to

Here is some information about tera from the 2nd edition of almasi and
gottlieb. There is too much included on HEP (the first Burton Smith
design) for me to post it here.

:h4.The Tera Architecture
:spot id=tera.
.pi /tera
:p.
The idea of using rapid context switching to mask memory access
and other machine latencies is not unique to the HEP design. Indeed,
the dataflow architectures discussed in Chapter &bkarchs. represent
another attempt to utilize this idea, as does the Sparcle processor
used in the MIT Alewife project (page :spotref refid=alewife.). Not
surprisingly, Burton Smith, the HEP architect has continued to
champion the idea in his subsequent designs, Horizon :bibref
refid=kueh88. and Tera :bibref refid=alve90..
:p.
Tera was designed to achieve three goals:
:ol compact.
:li.High speed&emdash.a fast clock and many processors
:li.Wide applicability&emdash.suitable to a broad spectrum of problems
:li.Good compilability&emdash.easy target for compilers
:eol.
Many factors, at both the architectural and detailed-design level,
contribute to meeting these goals, and we will just touch on a few.
:p.
As we described in Chapter &bkinte., Tera is topologically a
(depopulated) 3-D mesh. That is, Tera consists of a three-dimensional
grid of switches with processors, memories, and I/O caches connected
to :hp1.some:ehp1.
of the switches. A 256-processor system would contain 4096
switches arranged in a :f.16 times 16 times 16:ef. mesh that wraps
around in all three dimensions. Connected to these switches would be
256 computational processors, 256 I/O processors, 512 data memories,
and 256 I/O memories.
:p.
As with HEP, a Tera single processor can execute multiple instruction
streams interleaved on a cycle-by-cycle basis. For HEP an individual
stream could rarely have more than one instruction executing simultaneously.
This meant that to get full performance out of a HEP processor, at
least 11 active streams were needed. With the technology improvements
in processor speeds outstripping memory-access times, the
corresponding number for Tera would be about 70. While there are
undoubtedly applications that can furnish 70 active streams, this
would limit applicability. Moreover, even the most important stream
could execute little faster than one instruction per 70 clocks and
previous multistream processors have been criticized for poor
single-stream performance. Tera uses two techniques to lessen this
problem.
First, it is a superscalar design: each instruction typically
specifies three operations. The second technique is novel: each
instruction contains a count specifying the number of
subsequent instructions that are guaranteed :hp1.not:ehp1. to depend
on the current instruction. Since independent instructions can
be active simultaneously, the processor can continue to issue
instructions from a single stream at full speed until a dependency is
encountered. The traditional approach to utilizing this observation
is to employ register scoreboarding; the novelty in Tera is that the
dependencies are made explicit in the architecture and are the
responsibility of the software. Since the dependency count is a 3-bit
field, up to 8 instructions (24 operations) from a given stream can be
executing concurrently and thus only 9 streams would be needed to
extract the full performance of the processor and each stream would
execute at a rate of one operation per 3 clocks. It will be
interesting to see how close in practice Tera will come to this best
case.
:p.
Tera expands upon the HEP concept of tagged memory. In addition to
full/empty bits, Tera memory has tags to support application-specific
lightweight traps and indirect memory references (the latter without
the processor complications that RISC advocates have rightfully
criticized). The memory also supports the Ultracomputer fetch-and-add
operation (but the network does not combine multiple references; see
page :spotref refid=faa2.).
--
Allan Gottlieb gott...@nyu.edu
New York University http://cs.nyu.edu/cs/faculty/gottlieb/