NVIDIA Retiring from the Graphis Card Business?

70 views
Skip to first unread message

Quadibloc

unread,
Sep 20, 2022, 9:02:59 PMSep 20
to
That sounds like a rather odd takeaway from today's big announcement
from NVIDIA of their new and improved new generation of graphics
cards.
Contrary to the rumors, the new 4090 operates within the same
power envelope as the 3090 did, instead of guzzling a whole lot
more power.
The new cards are said to be 2x as powerful as their predecessors -
and even 3x or 4x as powerful for some purposes!

However, not only were only the 4090 and 4080 announced today, but
they referred to the 3050, 3060, and 3070 as ideal solutions for mainstream
gamers... as if, since with that kind of performance improvement, the
4060, if it existed, would outshine the 3090... *there isn't going to ever _be_
a 4070 or 4060, let alone a 4050*.

The price range of even the 4080 is such that not everyone is going to feel
the need to spring for a graphics card at this price.
And, certainly, AMD is going to find it difficult to provide a similar level
of performance improvement in their new products. Is this intended to
provide breathing room, so that AMD can continue to exist, thus
keeping antitrust problems at bay?

Perhaps I just misunderstood the announcement, but "we've got this
wonderful new technology, but we're not going to bother to market
it to any great extent" sounds very strange.

John Savard

Quadibloc

unread,
Sep 20, 2022, 9:06:20 PMSep 20
to
On Tuesday, September 20, 2022 at 7:02:59 PM UTC-6, Quadibloc wrote:

> Perhaps I just misunderstood the announcement, but "we've got this
> wonderful new technology, but we're not going to bother to market
> it to any great extent" sounds very strange.

A little more thought has allowed me to come up with a motivation
for this unusual introduction.
The big supply chain crunch for the previous generation of video
cards has been *really* painful for both Nvidia, and all the OEMs
that make video cards using its chips. So Nvidia is doing two things
for them and itself:
- making it easier to sell off their unsold stock, and
- squeezing as many video card buyers as possible to buy the
more expensive, higher profit margin, cards.

John Savard

John Levine

unread,
Sep 20, 2022, 10:36:33 PMSep 20
to
According to Quadibloc <jsa...@ecn.ab.ca>:
>On Tuesday, September 20, 2022 at 7:02:59 PM UTC-6, Quadibloc wrote:
>
>> Perhaps I just misunderstood the announcement, but "we've got this
>> wonderful new technology, but we're not going to bother to market
>> it to any great extent" sounds very strange.
>
>A little more thought has allowed me to come up with a motivation
>for this unusual introduction.

I wonder if it is also the recent change in Ethereum from proof of
work to proof of stake, wiping out a large market for GPUs.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Thomas Koenig

unread,
Sep 21, 2022, 12:54:21 AMSep 21
to
Quadibloc <jsa...@ecn.ab.ca> schrieb:
https://dilbert.com/strip/2007-08-05 comes to mind.

Quadibloc

unread,
Sep 21, 2022, 1:18:34 AMSep 21
to
And I didn't even mention Shader Execution Reordering, which is claimed to be
as important an innovation for GPUs as out-of-order execution was for CPUs.

Not exactly modest, are they...

John Savard

Quadibloc

unread,
Sep 21, 2022, 1:28:32 AMSep 21
to
Assuming it's as good as they say, how is AMD going to survive if they
can't license this on reasonable terms?

One thing I see is that they could probably get the same benefits
for regular rendering (but not ray tracing) with Cache Access
Buffering - don't re-order the computations, but do the fetches
ahead of time in a similarly re-ordered order to go quicker. Since
AMD isn't really competing with Nvidia on ray-tracing, this could
work. Intel, on the other hand, seems more ambitious.

John Savard

Stephen Fuld

unread,
Sep 21, 2022, 1:58:20 AMSep 21
to
Perhaps their yields are low and they want to maximize the revenue on
their limited supply of chips. As yields improve, they will use more of
them at the lower price points.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)


John Dallman

unread,
Sep 21, 2022, 7:03:23 AMSep 21
to
In article <tgdtbe$25nd$1...@gal.iecc.com>, jo...@taugh.com (John Levine)
wrote:

> I wonder if it is also the recent change in Ethereum from proof of
> work to proof of stake, wiping out a large market for GPUs.

Bitcoin, and many other blockchains, are still PoW. The companies with
lots of mining kit seem more likely to change targets than give up.

John

Anton Ertl

unread,
Sep 21, 2022, 10:13:29 AMSep 21
to
They don't use graphics cards for Bitcoin; instead, they use ASICs
(more energy-efficient for Bitcoin).

So the crypto crash supposedly has resulted in a lot of stock at
graphics chips and/or graphics cards manufacturers, and the Ethereum
switch has compounded this. Former Ethereum miners are supposedly
flooding the market for used graphics cards.

But looking at the retail prices and at ebay, I don't really see this;
ok, the prices are lower than at the peaks of the crypto boom, but
they are still much higher than in earlier times and after the last
crypto crash. E.g., in summer 2019 the middle-class AMD RX570 cards
could be hard for EUR 130, while the entry-level RX 6500XT currently
costs upwards of EUR 200. And the prices I see on ebay for an
RX6600XT are only lower than retail prices for offers coming from the
USA (and the shipping cost would absorb much of the benefit).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

EricP

unread,
Sep 21, 2022, 10:58:09 AMSep 21
to
A little searching for Shader Execution Reordering turns up no
technical details specifically on it (though ray trace reordering
goes back quite some time) but did find this 40 sec video of SER.
Looks like SIMD with OoO within and across multiple rows.

https://developer.nvidia.com/rtx/path-tracing/nvapi/get-started




Quadibloc

unread,
Sep 21, 2022, 11:14:01 AMSep 21
to
On Wednesday, September 21, 2022 at 8:58:09 AM UTC-6, EricP wrote:
> Quadibloc wrote:
> > And I didn't even mention Shader Execution Reordering, which is claimed to be
> > as important an innovation for GPUs as out-of-order execution was for CPUs.
> >
> > Not exactly modest, are they...

> A little searching for Shader Execution Reordering turns up no
> technical details specifically on it (though ray trace reordering
> goes back quite some time) but did find this 40 sec video of SER.
> Looks like SIMD with OoO within and across multiple rows.
>
> https://developer.nvidia.com/rtx/path-tracing/nvapi/get-started

Interesting. From Jensen Huang's description, it sounded to me as
if it wasn't really related to OoO; calculations were regrouped,
after a delay, so that, in the case of ray-tracing, more similar calculations
could be bunched together to be executed in a single wave, and for
all forms of rendering, calculations accessing data from cache would
be regrouped to make cache access more efficient.

So it isn't really anything to do with dependencies and hazards, for
example.

John Savard

Quadibloc

unread,
Sep 21, 2022, 11:17:21 AMSep 21
to
On Tuesday, September 20, 2022 at 11:58:20 PM UTC-6, Stephen Fuld wrote:

> Perhaps their yields are low and they want to maximize the revenue on
> their limited supply of chips. As yields improve, they will use more of
> them at the lower price points.

Oh, of course - since the highest-end cards were released first in previous
generations as well. But this time, there was no mention of the lower-end
cards coming later, *and* a specific nod to the old lower-end cards still being
a good solution for ordinary gamers.

Sure, they could have gone further this time because the yields are particularly
low this time, what with it being a new process and all. But market conditions
caused by the GPU shortage which has now ended suddenly seemed to me to
be a more likely suspect, as something that was radically different compared to
previous releases.

John Savard

MitchAlsup

unread,
Sep 21, 2022, 11:20:31 AMSep 21
to
Threads within a WARP are reassigned to different WARPs as control
flow warrants. Thus instead of running ½ the threads for 1 instruction
and then the other ½ in a different cycle, you pay the energy cost to move
threads from WARP to WARP in order that all the WARPs are full and you
then get 100% throughput instead of 50% throughput in the face of flow
control decisions.

Quadibloc

unread,
Sep 21, 2022, 11:22:39 AMSep 21
to
Meanwhile, I see that AMD is going to announce their next generation,
the 7000 series of video cards, on November 3.
With 50% more performance per watt!
What with Nvidia claiming 2x to 4x more performance - to be fair, though,
only 2x for the most important case - for the same wattage, it seems as
though AMD is going to take a bit of a beating this time.

However, I do remember reading a news item about how AMD released
new drivers that doubled the performance of not only their current cards,
but also a lot of old ones. So maybe it's Nvidia that is catching up.

John Savard

EricP

unread,
Sep 21, 2022, 5:08:52 PMSep 21
to
Hmmm... this looks like what I was musing about for packing multiple
ops into a wide FPU or LSQ on the fly, in that case for VVM.

FPU has say 32 Reservation Stations, each with an instruction
code and 2 operands, and a Float Calculation Unit (FCU) can perform
2*FP64 or 4*FP32 or 8*FP16 at once, or even in combinations,
then a relatively straight forward set of schedulers could pick
multiple RS with the same instruction to launch at once, and a set of
muxes gets the operands into the scheduler selected FCU lanes.

Ideally this wide result set is written back and forwarded as one.
One difficulty for FPU and for a dynamically packing LSQ is that
this lane renaming takes place on the fly by the FPU/LSQ schedulers.
It needs a wide wake-up matrix that indicate to up to 8 dependents
and 8 register write ports which result lane to pull their result from.

Just speculating... the Nvidia SER looks like it might be scheduling
and shuffling threads between calculation lanes in a similar manner,
and possibly deals with similar issues for result forwarding/writeback.



MitchAlsup

unread,
Sep 21, 2022, 6:06:09 PMSep 21
to
This is an excellent reason that VVM has each lane compute its own
flow of control (if-the-else) so it can cancel the instructions that should
not be "performed" while performing those that are not cancelled.
<
So, my basic plan is that all instructions in the loop are inserted, then
as various flow control decisions are made, instruction selection is
performed by predication (not by branching). The instructions are already
there, they just need to be turned on and off on a per iteration basis.
<
But secondarily, each lane gets its own ALU+FPU (FCU) so it can compute
what is required without needing to know what its neighbors are doing.
<
> It needs a wide wake-up matrix that indicate to up to 8 dependents
> and 8 register write ports which result lane to pull their result from.
<
Big fast wide stuff does need a lot of register ports (or at least the
forwarding). AND while you can increase the read ports by duplication,
you cannot with the write ports.
>
> Just speculating... the Nvidia SER looks like it might be scheduling
> and shuffling threads between calculation lanes in a similar manner,
> and possibly deals with similar issues for result forwarding/writeback.
<
Basically, what you are "getting at" is that this kind of reshuffling adds
latency to the data flow. The observation is correct.

Quadibloc

unread,
Sep 21, 2022, 11:50:54 PMSep 21
to
On Tuesday, September 20, 2022 at 11:18:34 PM UTC-6, Quadibloc wrote:
Also, I didn't mention the controversy about the 12 GB and 16 GB versions of
the 4080 using different GPU chips.
Some have said that the 12 GB 4080 is a 4070 in disguise. I don't agree, I
think it's the other way around, and the 16 GB 4080 is a 4080 Ti in disguise.
So the only thing I have to criticize Nvidia on there is sloppy and confusing
nomenclature, not failing to give good performance.

John Savard

Quadibloc

unread,
Sep 22, 2022, 6:45:28 AMSep 22
to
On Wednesday, September 21, 2022 at 9:22:39 AM UTC-6, Quadibloc wrote:
> Meanwhile, I see that AMD is going to announce their next generation,
> the 7000 series of video cards, on November 3.
> With 50% more performance per watt!
> What with Nvidia claiming 2x to 4x more performance - to be fair, though,
> only 2x for the most important case - for the same wattage, it seems as
> though AMD is going to take a bit of a beating this time.

Now I see that despite Nvidia's claim of 2x to 4x improvement in some
areas, their advancement in actual performance on games was only
claimed to be 60%, so AMD is comparable.

John Savard

EricP

unread,
Sep 23, 2022, 3:01:17 PMSep 23
to
When I looked at OoO predication I came to the conclusion that, while at an
ISA level a predicate-FALSE instructions are considered NOPs, at the uArch
level "canceled" instructions can still have housekeeping to perform.

All uOps are marked "Enabled" and "Disabled" by the predicate instruction,
and each uOp has two sets of OnEnable and OnDisable actions.
The OnEnable actions is the normal instruction operations.
OnDisable actions might be as simple as marking the Instruction Queue
entry as "Done" or more complicated such as propagating an old value
or recovering allocated uOp resources such as LSQ entries.

The situations where an instruction could really be canceled
was if the predicate value was already resolved at Dispatch
(hand-off from in-order front end to OoO back end) where the
Dispatcher could sometimes optimize the uOp to being a NOP if,
for example, it did not have a dest register, like a ST.
In that case Dispatcher could insert a NOP marked Done into IQ.

It seems to me similar rules would apply would apply for in-order
pipelines even if uOp OnDisable action just diddles the scoreboard.

Dynamically grouping their threads into OnEnable and OnDisable action sets
would optimize the FU utilization for enabled instructions as single
uOps wouldn't have a mixture of Onxxx actions in different lanes.
Disabled uOp sets can bypass the calculation unit altogether
leaving it free for actual calculations.

> <
> So, my basic plan is that all instructions in the loop are inserted, then
> as various flow control decisions are made, instruction selection is
> performed by predication (not by branching). The instructions are already
> there, they just need to be turned on and off on a per iteration basis.

Yes

> <
> But secondarily, each lane gets its own ALU+FPU (FCU) so it can compute
> what is required without needing to know what its neighbors are doing.

Yes.

All of their dynamic packing and/or lane shuffling does assume that
there is a hardware advantage to having 8 lane SIMD calculation units
as opposed to 8 independent FP units.

Though the LSQ would still gain from packing multiple operations into
a single large one so it still might have some isolated lane shuffling.

> <
>> It needs a wide wake-up matrix that indicate to up to 8 dependents
>> and 8 register write ports which result lane to pull their result from.
> <
> Big fast wide stuff does need a lot of register ports (or at least the
> forwarding). AND while you can increase the read ports by duplication,
> you cannot with the write ports.
>> Just speculating... the Nvidia SER looks like it might be scheduling
>> and shuffling threads between calculation lanes in a similar manner,
>> and possibly deals with similar issues for result forwarding/writeback.
> <
> Basically, what you are "getting at" is that this kind of reshuffling adds
> latency to the data flow. The observation is correct.

Yes, I was actually thinking of forwarding as a potential critical
path if they want to be able to launch back-to-back executes.
Adding muxes into that path could might cause "issues".
If they don't do back-to-back executes, maybe it is like a
barrel processor between warps, it might not be an issue.

All that dynamic lane shuffling presupposes that the savings in gates
for an 8 lane SIMD unit is worth the shuffle cost to keep all lanes
busy vs having 8 independent FP64 and 8 ALU units.
With 78 billion transistors, having 8 full FP64 units per shader
seems unlikely to break the gate budget but might the power budget.

Quadibloc

unread,
Sep 23, 2022, 6:51:44 PMSep 23
to
On Tuesday, September 20, 2022 at 11:28:32 PM UTC-6, Quadibloc wrote:
> On Tuesday, September 20, 2022 at 11:18:34 PM UTC-6, Quadibloc wrote:
> > And I didn't even mention Shader Execution Reordering, which is claimed to be
> > as important an innovation for GPUs as out-of-order execution was for CPUs.

> Assuming it's as good as they say, how is AMD going to survive if they
> can't license this on reasonable terms?

According to TechPowerUp, SER from Nvidia was the same as something previously
described by Intel for their Xe-HPG cards, so at least Intel is in no trouble.

John Savard

EricP

unread,
Sep 23, 2022, 10:23:21 PMSep 23
to
I looked for related research papers written by people at Nvidia
and came across one co-authored by Michael Shebanow, now at Nvidia
and who I believe worked with Mitch at Motorola on the 88100 & 88110
RISC cpu projects and other projects since.

Improving GPU Performance via Large Warps and
Two-Level Warp Scheduling, 2011
http://people.inf.ethz.ch/omutlu/pub/large-gpu-warps_micro11.pdf

The above says "Fung et al. [7, 8] were the first to propose the idea
of combining threads from different warps to address underutilized
SIMD resources due to branch divergence on GPU cores."

[7] Fung et al. Dynamic warp formation and scheduling
for efficient GPU control flow. In MICRO-40, 2007
https://www.ece.ubc.ca/~aamodt/publications/papers/wwlfung.micro2007.pdf

[8] Fung et al. Dynamic warp formation: Efficient MIMD control
flow on SIMD graphics hardware. ACM TACO, 6(2):1–37, June 2009
https://dl.acm.org/doi/pdf/10.1145/1543753.1543756

I have only had a chance to quickly scan them but they all seem to be
talking about variations of "dynamic warp formation and scheduling".

There are many other related papers on the subject and this has been
an active research topic for 15 years.


Reply all
Reply to author
Forward
0 new messages