Scanning memory forward vs. reverse

Bonita Montero

unread,

Jan 28, 2024, 5:19:37 AMJan 28

to

With my thread pool there's some code that looks up a deque in reverse
order with find_if with reverse iterators to check if there's a "this"
pointer inside the deque. I also could have scanned the deque in forward
order but it's more likely to find a fitting element wenn searching from
the back first.
A deque usually consists of a number of linear parts in memory. This
lead me to the question if scanning memory is faster forward or back-
ward. I tried to test this with the below program:

#include <iostream>
#include <vector>
#include <atomic>
#include <chrono>

using namespace std;
using namespace chrono;

atomic_char aSum;

int main()
{
constexpr size_t GB = 1ull << 30;
vector<char> vc( GB );
auto sum = []( auto begin, auto end, ptrdiff_t step )
{
auto start = high_resolution_clock::now();
char sum = 0;
for( auto p = begin; end - p >= step; sum += *p, p += step );
::aSum.store( sum, memory_order_relaxed );
cout << duration_cast<nanoseconds>( high_resolution_clock::now() -
start ).count() / 1.0e6 << "ms" << endl;
};
constexpr size_t STEP = 100;
sum( vc.begin(), vc.end(), STEP );
sum( vc.rbegin(), vc.rend(), STEP );
}

On my Windows 7050X Zen4 computer scanning memory in both directions
has the same speed. On my Linux 3990X Zen2 computer scanning forward
is 22% faster. On my small Linux PC, a HP EliteDesk Mini PC with a
Skylake Pentium G4400 scanning memory forward is about 38% faster.
I'd first have guessed that the prefetchers between the memory-levels
are as effective for both directions. So I'd like to see some results
from you.

Marcel Mueller

unread,

Jan 28, 2024, 5:33:13 AMJan 28

to

Am 28.01.24 um 11:19 schrieb Bonita Montero:

> A deque usually consists of a number of linear parts in memory. This
> lead me to the question if scanning memory is faster forward or back-
> ward. I tried to test this with the below program:

> On my Windows 7050X Zen4 computer scanning memory in both directions
> has the same speed. On my Linux 3990X Zen2 computer scanning forward
> is 22% faster. On my small Linux PC, a HP EliteDesk Mini PC with a
> Skylake Pentium G4400 scanning memory forward is about 38% faster.

> I'd first have guessed that the prefetchers between the memory-levels
> are as effective for both directions. So I'd like to see some results
> from you.

Reverse memory access is typically slower simply because the last data
of a cache line (after a cache miss) arrives at last. If you read
forward the process continues when the first few bytes of the cache line
are read. The further data is read in parallel.

But details depend on many other factors. First of all the placement of
the memory chunks and the used prefetching technique (if any).

Marcel

Bonita Montero

unread,

Jan 28, 2024, 8:08:09 AMJan 28

to

Am 28.01.2024 um 11:32 schrieb Marcel Mueller:

> Reverse memory access is typically slower simply because the
> last data of a cache line (after a cache miss) arrives at last.

I tested this and for all offsets within a cacheline I get thes
same timing for all three of my computers:

#include <iostream>
#include <vector>
#include <chrono>
#include <atomic>

using namespace std;
using namespace chrono;

#if defined(__cpp_lib_hardware_interference_size)
constexpr size_t CL_SIZE = hardware_constructive_interference_size;
#else
constexpr size_t CL_SIZE = 64;
#endif

atomic_char aSum;

int main()
{
constexpr size_t

BLOCK_SIZE = 1ull << 20,
BLOCKS = BLOCK_SIZE / CL_SIZE,
ROUNDS = 1000;
vector<char> block( BLOCK_SIZE );
for( size_t offset = 0; offset != CL_SIZE; ++offset )

{
auto start = high_resolution_clock::now();
char sum = 0;

for( size_t round = ROUNDS; round--; )
for( size_t i = offset; i < BLOCK_SIZE; sum += block[i], i += CL_SIZE );
::aSum.store( sum, memory_order_relaxed );
cout << offset << ": " <<
duration_cast<nanoseconds>(high_resolution_clock::now() - start).count()
/ ((double)BLOCKS * ROUNDS) << endl;
}
}

Chris M. Thomasson

unread,

Jan 28, 2024, 2:19:02 PMJan 28

to

On 1/28/2024 5:07 AM, Bonita Montero wrote:
> Am 28.01.2024 um 11:32 schrieb Marcel Mueller:
>
>> Reverse memory access is typically slower simply because the
>> last data of a cache line (after a cache miss) arrives at last.
>
> I tested this and for all offsets within a cacheline I get thes
> same timing for all three of my computers:
>
> #include <iostream>
> #include <vector>
> #include <chrono>
> #include <atomic>
>
> using namespace std;
> using namespace chrono;
>
> #if defined(__cpp_lib_hardware_interference_size)
> constexpr size_t CL_SIZE = hardware_constructive_interference_size;
> #else
> constexpr size_t CL_SIZE = 64;
> #endif
>
> atomic_char aSum;
>
> int main()
> {
>     constexpr size_t
>         BLOCK_SIZE = 1ull << 20,
>         BLOCKS = BLOCK_SIZE / CL_SIZE,
>         ROUNDS = 1000;

>     vector<char> block( BLOCK_SIZE );

Try padding and aligning the blocks. iirc, std::vector works with
alignas. Actually, it's pretty nice.

Bonita Montero

unread,

Jan 29, 2024, 3:56:55 AMJan 29

to

Am 28.01.2024 um 20:18 schrieb Chris M. Thomasson:

> Try padding and aligning the blocks. iirc, std::vector works with
> alignas. Actually, it's pretty nice.

I'm testing all 64 offsets. If offset zero becomes physically offset
one in the cacheline doesn't matter since physical offset zero would
then be occupied by logical offset 63.

Chris M. Thomasson

unread,

Jan 29, 2024, 5:13:13 PMJan 29

to

You don't want to straddle any cache lines. Properly aligning and
padding the blocks gets around that...

Bonita Montero

unread,

Jan 30, 2024, 5:48:02 AMJan 30

to

Am 29.01.2024 um 23:12 schrieb Chris M. Thomasson:
> On 1/29/2024 12:56 AM, Bonita Montero wrote:
>> Am 28.01.2024 um 20:18 schrieb Chris M. Thomasson:
>>
>>> Try padding and aligning the blocks. iirc, std::vector works with
>>> alignas. Actually, it's pretty nice.
>>
>> I'm testing all 64 offsets. If offset zero becomes physically offset
>> one in the cacheline doesn't matter since physical offset zero would
>> then be occupied by logical offset 63.
>>
>

> You don't want to straddle any cache lines. ...

I'm testing all 64 offsets and for my measurement it doesn't matter if
the beginning of the block is at offset zero inside a cacheline since
the result show equal access times for all offsets.
If there were different results it might have made sense to have proper
alignment.

Vir Campestris

unread,

Jan 31, 2024, 10:56:15 AMJan 31

to

On 28/01/2024 10:19, Bonita Montero wrote:
> On my Windows 7050X Zen4 computer scanning memory in both directions
> has the same speed. On my Linux 3990X Zen2 computer scanning forward
> is 22% faster. On my small Linux PC, a HP EliteDesk Mini PC with a
> Skylake Pentium G4400 scanning memory forward is about 38% faster.
> I'd first have guessed that the prefetchers between the memory-levels
> are as effective for both directions. So I'd like to see some results
> from you.

On my Linux box with an AMD Ryzen 5 3400G it's about 11% slower for the
second number. But that's a very about, it's doing something else right
now and that's the average from several runs - where the ratio is
between 97% and 130%.

Andy