question about memory-bandwidth and logical cores

fir

unread,

Aug 20, 2017, 4:49:15 AM8/20/17

to

if ya got some memory bandwidth power on given physical core (say it is 7 GB/s) when you got 2 physical cores you got it on each core separately (i mean you got 2 x 7GB/s)
if you have 6 physical you will get 6 x 7 GB/s - this is as far as i know, but as far as i know it is true - and it is very important

my question is - is this the same with logical cores? i mean do having 2 logical cores in one physical doubles this 7 GB/s memory bandwidth? (i expect probably no and it is shared on two logical but im not sure) if someone have such info tell me know

Marcel Mueller

unread,

Aug 20, 2017, 5:36:24 AM8/20/17

to

There is no such rule at all, not even for physical cores. And it has
nothing to do with C++ programming at all.

So you should ask in a group matching your particular hardware.

Marcel

fir

unread,

Aug 20, 2017, 5:58:58 AM8/20/17

to

there is such rule an it has very much to do with c++ programming

(it is silly thinking that knowledge how MB is populated on hysical/logical cores has nothing to do with c++ programming, its basicaly fundamental)

Paavo Helde

unread,

Aug 20, 2017, 11:38:59 AM8/20/17

to

On 20.08.2017 11:48, fir wrote:
> if ya got some memory bandwidth power on given physical core (say it is 7 GB/s) when you got 2 physical cores you got it on each core separately (i mean you got 2 x 7GB/s)
> if you have 6 physical you will get 6 x 7 GB/s - this is as far as i know, but as far as i know it is true - and it is very important

It depends on the memory access pattern. If the program accesses
limited amounts of memory, fitting in the L1 cache, then indeed each
physical core can operate fully parallel to others (modulo false
sharing), as each physical core has its own L1 cache.

However, if the program is memory-bound and accesses large amounts of
memory not fitting in the L3 cache, then physical cores begin to
interfere with each other because the L3 cache is typically shared
between physical cores. If the L3 cache gets filled up the cores start
to fight over it and the program performance does not scale up with the
number of used threads any more.

>
> my question is - is this the same with logical cores? i mean do having 2 logical cores in one physical doubles this 7 GB/s memory bandwidth? (i expect probably no and it is shared on two logical but im not sure) if someone have such info tell me know
>

I think the L1 cache is typically shared by the logical cores, so the
memory bandwidth does not really double for logical cpus.

hth
Paavo

fir

unread,

Aug 20, 2017, 1:19:37 PM8/20/17

to

it ise easy to measure pgysically say do a big memset in 1 core, in 2 cores (half size by core A half by core B) and two logical cores - by 2 physical cores it will be just twice faster, do it in logical and you will see (probably there is like you said and on ligical it will not be twice faster and it will be exactly as slow as on one)

(i mostly belive in such kind of tests personally, but working only on 2 physical cores machine.. theoretically im not fully sure that machina with 6 physical cores will have 'memstets' 6 times faster but i think though probably it will)

Vir Campestris

unread,

Aug 20, 2017, 4:27:18 PM8/20/17

to

It's a general programming question, and this forum is for specific C++
problems.

The answer depends entirely on the hardware. Read about NUMA.

Andy

fir

unread,

Aug 20, 2017, 4:53:32 PM8/20/17

to

it is impossible to talk about specific things without general fundaments imo.. note that also this group is not only c++ language but also about a "programming in c++" (otherwise there should be 2 groups and i dont see the 2), and this
topic fits to programming in c++
(though it also fits to programming in pascal or jave, in c++ more people tent to know bandwidth details)

David Brown

unread,

Aug 21, 2017, 2:58:34 AM8/21/17

to

Of course L1 cache (content and bandwidth) is shared by the logical
cores - they are /logical/ cores, not physical cores. They share almost
everything - instruction decoders, pipelines, execution units, buffers,
etc. They have separate logical sets of ISA registers (the registers
visible to the programmer), but on devices like x86 chips (where the ISA
has few registers) there are many more physical hardware registers that
are mapped at different times - and the logical cores share them too.

Cores and caches are organised as a hierarchy in multi-core devices.
The highest bandwidths are in the closest steps - physical cores to
their L1 caches, cores to cores within a core cluster (if the chip has
this level), L1 caches to their L2 caches, L2 caches to the L3 cache
(usually shared amongst all cores on the chip), bandwidth off-chip.
Usually the off-chip bandwidth is shared amongst all cores, but for
multi-module chips like AMD's new devices, each chip in the module has
its own buses off the module.

In other words - it is complicated, depends totally on the level of
cache you are talking to, and details are specific to the device
architecture.

And as has been pointed out, it has /nothing/ to do with C++ - it is
general architecture issue, independent of language. Unless you are
targeting for a specific chip (such as fine-tuning for a particular
supercomputer model), you use the same general rules for all languages,
and all chips: Aim for locality of reference in your critical data
structures. Keep the structures small. Avoid sharing and false sharing
between threads. Use an OS that is aware of the memory architecture of
your processor, and the geometry of its logical and physical cores.

(I am replying to you here, for your interest. I have long ago seen it
as pointless trying to talk to Fir.)

Paavo Helde

unread,

Aug 21, 2017, 6:05:43 AM8/21/17

to

Thanks for clarifying this, this is more or less consistent with my
understanding.

I had an impression that there are still separate cpu instruction
pipelines for logical processors - they are executing different code
after all - is this not so?

I agree it is pointless to discuss with Fir, but there is no rule one
should do meaningful things all the time ;-) Some of his absurd ideas
contain some interesting moments...

Cheers
Paavo

Scott Lurndal

unread,

Aug 21, 2017, 9:01:12 AM8/21/17

to

Paavo Helde <myfir...@osa.pri.ee> writes:
>On 21.08.2017 9:58, David Brown wrote:

>> In other words - it is complicated, depends totally on the level of
>> cache you are talking to, and details are specific to the device
>> architecture.

>I had an impression that there are still separate cpu instruction
>pipelines for logical processors - they are executing different code
>after all - is this not so?
>

The whole point of SMT (e.g. hyperthreading) is to have higher
utilization of the core resources. The hyperthreads/logical processors
share all the resources of the core (except each logical processor
keeps separate state - e.g. registers, page table base address,
etc). The caches, store buffers, pipelines are shared.

David Brown

unread,

Aug 21, 2017, 10:17:04 AM8/21/17

to

On 21/08/17 12:05, Paavo Helde wrote:

>
> I had an impression that there are still separate cpu instruction
> pipelines for logical processors - they are executing different code
> after all - is this not so?
>

They will have to keep some parts separated, so that they can track
independent instruction schemes. How much is duplicated, and how much
is shared, is going to vary a bit between implementations.

fir

unread,

Aug 21, 2017, 11:41:25 AM8/21/17

to

brown is total lama i wouldnt listen to that fella (unles someone wants to gets stupider)

as those bandwidth imo it is probably clear
most preferably do some test with memset if you got logical cores at home

it is binary thing imo, like with physical cores and sse/avx

with 2 physical cores whan you do memset you will get it twoce as fast when you use 2 cores [tried it myself belive me]

when using avx, even if t has commands to store 8 integers at once you will get 0% speed bonus (compared to usege 8 sequential 32 bit mov stores) [tried it myself belive me]

logical cores are like AVX or like physical cores (i guess form whats is said here and form other things i maybe heard and i vaguelly remember that it goes unfortunatelly like AVX - no additional MemBandw)
[no tried it myself yet, got no logical cores on board]

Paavo Helde

unread,

Aug 21, 2017, 3:29:09 PM8/21/17

to

On 21.08.2017 18:41, fir wrote:
>
> brown is total lama i wouldnt listen to that fella (unles someone wants to gets stupider)

Calling somebody a Tibetan Lama is a compliment in my book!

fir

unread,

Aug 21, 2017, 4:26:46 PM8/21/17

to

well, im not sure if this brown is Tibetian lama, but for sure he is lama

David Brown

unread,

Aug 21, 2017, 4:44:53 PM8/21/17

to

Don't forget that Fir does not believe in correct spelling, or using the
conventional meanings for words. You can try to guess what he is trying
to say, or just ignore him. Certainly don't try to offer help, advice
or answers to his questions - that just results in insults. I suspect
it is because he can't cope with the idea that someone knows more than
he does - he asks more in the hope that other people will confirm that
they don't know either. Then he can make more posts replying to himself
with less and less intelligible content, and he can imagine that he is
the only person smart enough to talk to.

Sometimes his posts inspire interesting questions or other posts,
however. If Fir listens in and learns something, that's okay.

Vir Campestris

unread,

Aug 21, 2017, 4:49:21 PM8/21/17

to

On 21/08/2017 16:41, fir wrote:
> with 2 physical cores whan you do memset you will get it twoce as fast when you use 2 cores [tried it myself belive me]

Depends on the exact processor you have. Even different Intel ones will
give different results.

Did you read about NUMA?

Andy

fir

unread,

Aug 21, 2017, 5:09:04 PM8/21/17

to

you mean there are machines on the market (i mean x86/x64 architecture) that not multiply memory bandwidth with each physical cores?

(if so i would need to be very warned on what not spend my money ;c ) [as membandwidth is totally dritical for system efficiency, in short system efficiency - memory bandwidth]

if you have other than x86/x64 i dont much care

David Brown

unread,

Aug 22, 2017, 4:40:35 AM8/22/17

to

On 21/08/17 23:08, fir wrote:
> W dniu poniedziałek, 21 sierpnia 2017 22:49:21 UTC+2 użytkownik Vir
> Campestris napisał:
>> On 21/08/2017 16:41, fir wrote:
>>> with 2 physical cores whan you do memset you will get it twoce as
>>> fast when you use 2 cores [tried it myself belive me]
>>
>> Depends on the exact processor you have. Even different Intel ones
>> will give different results.
>>
>> Did you read about NUMA?
>>
>> Andy
>
> you mean there are machines on the market (i mean x86/x64
> architecture) that not multiply memory bandwidth with each physical
> cores?

That is not what he said - he asked you if you had read about NUMA.
Clearly you have not. You should do so.

An important point is that /some/ bandwidths scale by physical cores -
others do not. Typically each physical core has its own L1 cache, with
dedicated bandwidth to that level - the total core-to-L1 bandwidth
therefore scales with physical cores. But the bandwidth further out -
L2 to L3, L3 to memory - can be grouped by clusters of cores, and may
have shared buses out to memory. The level of sharing varies by
architecture. Typically Intel has flatter sharing for even and more
predictable accesses, while AMD has hierarchies giving better scaling of
total memory bandwidth by core count, but less uniform access (hence
"NUMA").