Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

speed of unaligned accesses that cross page-boundaries

94 views
Skip to first unread message

Bonita Montero

unread,
Mar 20, 2022, 1:04:37 PM3/20/22
to
I just wanted to measure the performance of unaligned accesses.
For accesses within a page the difference was below the measurement
error. Then I read that some processors put a penalty on unaligned
accesses that cross page-boundaries. So I wrote a little benchmark
that tries to measure this. But for my PCs, a Ryzen Threadripper
3990X, a Ryzen 7 1800X and a Phenom X4 945 all measurements were
also within the measurement error.
So I'm interested if someone out there has a PC where unaligned
accesses crossing a page has a performance penalty. So here's the
benchmark in C++20:

#if defined(_MSC_VER)
#include <Windows.h>
#elif defined(__unix__)
#include <unistd.h>
#endif
#include <iostream>
#include <vector>
#include <cstdint>
#include <chrono>
#include <atomic>

using namespace std;
using namespace chrono;
using namespace chrono_literals;

using T = uint64_t;

atomic<T> aSum;

int main()
{
auto bench = [&]<typename T>()
requires is_scalar_v<T>
{
auto probe = [&]( atomic<T> *pAT ) -> double
{
T sum = 0;
size_t rounds = 0;
nanoseconds nsSum( nanoseconds( 0 ) );
do
{
constexpr size_t INTERVAL = 1'000'000;
auto start = high_resolution_clock::now();
for( size_t r = INTERVAL; r--; )
sum += pAT->load( memory_order_relaxed );
nsSum += duration_cast<nanoseconds>( high_resolution_clock::now() -
start );
rounds += INTERVAL;
} while( nsSum < 250ms );
::aSum = sum;
return (double)(ptrdiff_t)rounds / (int64_t)nsSum.count();
};
auto getPageSize = []() -> size_t
{
#if defined(_MSC_VER)
SYSTEM_INFO si;
GetSystemInfo( &si );
return si.dwPageSize;
#elif defined(__unix__)
return (size_t)sysconf( _SC_PAGESIZE );
#endif
};
size_t pageSize = getPageSize();
vector<char> vec( 0x1000 + sizeof(T) );
auto align_ptr = [&]( char *p ) { return (char *)((size_t)(p +
pageSize) & ~(pageSize - 1)); };
atomic<T> *aligned = (atomic<T> *)align_ptr( &vec.front() );
double tUnaligned = probe( aligned - 1 ), tAligned = probe( aligned );
cout << (int)(tUnaligned / tAligned * 100.0 + 0.5) << "%" << endl;
};
bench.operator ()<T>();
}

Please post your results here.

Bonita Montero

unread,
Mar 20, 2022, 1:09:11 PM3/20/22
to
>         vector<char> vec( 0x1000 + sizeof(T) );
>         auto align_ptr = [&]( char *p ) { return (char *)((size_t)(p +
> pageSize) & ~(pageSize - 1)); };

vector<char> vec( pageSize + sizeof(T) );
auto align_ptr = [&]( char *p ) { return (char *)((size_t)(p +
pageSize) & -(ptrdiff_t)pageSize); };

Scott Lurndal

unread,
Mar 20, 2022, 2:12:27 PM3/20/22
to
Bonita Montero <Bonita....@gmail.com> writes:
>I just wanted to measure the performance of unaligned accesses.
>For accesses within a page the difference was below the measurement
>error. Then I read that some processors put a penalty on unaligned
>accesses that cross page-boundaries.

If the second page misses in the TLB, there will be a TLB
fill penalty which for an application running in a virtual
machine will require 23 memory accesses to walk the page
table (anywhere from 3 to 5 for bare-metal translation table
walks depending on page size).

On a TLB Hit, there still may be a latency hit to obtain
the cache line for the first line of the second page.

Both of these will only hit when the relevent conditions
exist, so they'll be difficult to measure without access
to the TLB and cache flush instruction(s) on your target
architecture.

Bonita Montero

unread,
Mar 20, 2022, 2:36:55 PM3/20/22
to
> If the second page misses in the TLB, there will be a TLB
> fill penalty which for an application running in a virtual
> machine will require 23 memory accesses to walk the page
> table (anywhere from 3 to 5 for bare-metal translation table
> walks depending on page size).

I'm measuring the penalty while crossing a page-boundary where
both pages are in the TLB - so this is irrelevant here.

> On a TLB Hit, there still may be a latency hit to obtain
> the cache line for the first line of the second page.

With all my three PCs the time for crossing a page-boundary, an
aligned access and an unaligned access within a page are all the
same.
I first thought that with the access crossing a page-boundary the
CPU does check for duplicate loads in the queue for outstanding
OoO-loads and satisfies them all from the same load. So I modified
my code a bit to have a configurable numer of accesses to different
page-boundaries:

#if defined(_WIN32)
#include <Windows.h>
#elif defined(__unix__)
#include <unistd.h>
#endif
#include <iostream>
#include <vector>
#include <cstdint>
#include <chrono>
#include <atomic>
#include <string_view>

using namespace std;
using namespace chrono;
using namespace chrono_literals;

using T = uint64_t;

atomic<T> aSum;

int main()
{
#if defined(_WIN32)
SetThreadAffinityMask( GetCurrentThread(), 1 );
if( !SetThreadPriority( GetCurrentThread(),
THREAD_PRIORITY_TIME_CRITICAL ) )
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_HIGHEST );
#endif

auto bench = [&]<typename T, size_t NBoundaries>()
requires is_scalar_v<T>
{
auto probe = [&]( vector<void *> const &addrs ) -> double
{
T sum = 0;
size_t rounds = 0;
nanoseconds nsSum( nanoseconds( 0 ) );
size_t
interval = 1'000'000 / addrs.size(),
roundsPerInterval = interval * addrs.size();
do
{
auto start = high_resolution_clock::now();
for( size_t r = interval; r--; )
for( void *p : addrs )
sum += ((atomic<T> *)p)->load( memory_order_relaxed );
nsSum += duration_cast<nanoseconds>( high_resolution_clock::now() -
start );
rounds += roundsPerInterval;
} while( nsSum < 250ms );
::aSum = sum;
return (double)(ptrdiff_t)rounds / (int64_t)nsSum.count();
};
auto getPageSize = []() -> size_t
{
#if defined(_WIN32)
SYSTEM_INFO si;
GetSystemInfo( &si );
return si.dwPageSize;
#elif defined(__unix__)
return (size_t)sysconf( _SC_PAGESIZE );
#endif
};
size_t pageSize = getPageSize();
auto allocPages = [&]( size_t nPages ) -> void *
{
#if defined(_WIN32)
return VirtualAlloc( nullptr, nPages * pageSize, MEM_RESERVE |
MEM_COMMIT, PAGE_READWRITE );
#elif defined(__unix__)
return mmap( nullptr, nPages * pageSize, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0 );
#endif
};
void *p = allocPages( NBoundaries + 1 );
vector<void *> addrs;
double times[3];
ptrdiff_t offset = -1;
do
{
addrs.resize( 0 );
addrs.reserve( NBoundaries );
for( size_t b = 0; b != NBoundaries; ++b )
addrs.emplace_back( (void *)((size_t)p + pageSize + offset) );
times[offset + 1] = probe( addrs );
} while( ++offset <= 1 );
auto pct = []( double tRel, double tBase ) -> int { return (int)((tRel
/ tBase - 1.0) * 100.0 + 0.5); };
cout << "crossing page boundaries: " << pct( times[0], times[1] ) <<
"%" << endl;
cout << "within page: " << pct( times[2], times[1] ) <<
"%" << endl;
};
bench.operator ()<T, 64>();
}

But this code also gives me the same access-times for unaligned accesses
within a page, aligned accesses and accesses crossing a page-boundary
for all my three PCs.

> Both of these will only hit when the relevent conditions
> exist, so they'll be difficult to measure without access
> to the TLB and cache flush instruction(s) on your target
> architecture.

The TLB privilege-checks are done on every acccess, even when the
address is scoped by the TLB.

Chris M. Thomasson

unread,
Mar 20, 2022, 7:13:44 PM3/20/22
to
On 3/20/2022 10:04 AM, Bonita Montero wrote:
> I just wanted to measure the performance of unaligned accesses.
> For accesses within a page the difference was below the measurement
> error. Then I read that some processors put a penalty on unaligned
> accesses that cross page-boundaries. So I wrote a little benchmark
> that tries to measure this. But for my PCs, a Ryzen Threadripper
> 3990X, a Ryzen 7 1800X and a Phenom X4 945 all measurements were
> also within the measurement error.
> So I'm interested if someone out there has a PC where unaligned
> accesses crossing a page has a performance penalty. So here's the
> benchmark in C++20:
[...]

Iirc, an unaligned atomic RMW would trigger a bus lock. Actually,
somebody used it to trigger a system wide membar, take a look at QPI
quiescence:

https://chaelim.github.io/2017-04-29-qpi-quiescence


I cannot find the damn original paper right now, but Windows API
actually has an API that allows one to build asymmetric synchronization:

https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers


Bonita Montero

unread,
Mar 21, 2022, 1:58:37 AM3/21/22
to
> Iirc, an unaligned atomic RMW would trigger a bus lock. Actually,
> somebody used it to trigger a system wide membar, take a look at QPI
> quiescence:

That's a completely different topic.

Juha Nieminen

unread,
Mar 21, 2022, 1:59:33 AM3/21/22
to
Bonita Montero <Bonita....@gmail.com> wrote:
> I just wanted to measure the performance of unaligned accesses.
> For accesses within a page the difference was below the measurement
> error. Then I read that some processors put a penalty on unaligned
> accesses that cross page-boundaries. So I wrote a little benchmark
> that tries to measure this. But for my PCs, a Ryzen Threadripper
> 3990X, a Ryzen 7 1800X and a Phenom X4 945 all measurements were
> also within the measurement error.

IIRC both Intel and AMD got rid of the unaligned access penalty
at some point (relatively recently). You would need an older CPU
in order to get the penalty.

(Btw, I assume you know that unaligned access is standardized as UB.)

Bonita Montero

unread,
Mar 21, 2022, 3:56:56 AM3/21/22
to
> IIRC both Intel and AMD got rid of the unaligned access penalty
> at some point (relatively recently). You would need an older CPU
> in order to get the penalty.

I don't wanted to measure the penalty of just unaligned accesses but
those unalaigned acccesses that cross a page-boundary.

In my first code in this thread I repeatedly accessed the same word
which crosses a page-boundary and there was no difference in access
times between such accesses and aligned accesses. Then I thought the
processor could bypass further loads or join multiple loads in the
load-queue so that I won't notice the penalty of the first accesss.
So I accessed a row of addresses which crossed a page-boundary in
my second code; but this code had a bug and also didn't measure a
difference. I corrected it and with this code I had a penalty of
nearly 50% when accessing an unaligned address crossing a page
-boundary. So I was right to assume that my CPUs, even my old 2009
Phenom II X4 945 all join multiple loads to the same address in the
load-queue !

Scott Lurndal

unread,
Mar 21, 2022, 10:49:42 AM3/21/22
to
"Chris M. Thomasson" <chris.m.t...@gmail.com> writes:
>On 3/20/2022 10:04 AM, Bonita Montero wrote:
>> I just wanted to measure the performance of unaligned accesses.
>> For accesses within a page the difference was below the measurement
>> error. Then I read that some processors put a penalty on unaligned
>> accesses that cross page-boundaries. So I wrote a little benchmark
>> that tries to measure this. But for my PCs, a Ryzen Threadripper
>> 3990X, a Ryzen 7 1800X and a Phenom X4 945 all measurements were
>> also within the measurement error.
>> So I'm interested if someone out there has a PC where unaligned
>> accesses crossing a page has a performance penalty. So here's the
>> benchmark in C++20:
>[...]
>
>Iirc, an unaligned atomic RMW would trigger a bus lock. Actually,
>somebody used it to trigger a system wide membar, take a look at QPI
>quiescence:

The keyword there is 'atomic' which requires the lock prefix on x86.

Note that there are other cases where Intel and AMD processors will
assert the system-wide bus lock - for example, if a competing spin
lock doesn't win for some period of time, the core will assert the
bus lock to ensure forward progress.

This can seriously impact performance[*], as we found out when we
put one of our large SMP machines at LLNL a couple decades ago.

[*] granted, this as a degenerate case tested for explictly by
the LLNL researchers where all 256 cores were spinning on the
same spinlock in a NUMA machine where there was a 10x difference
in latency between local and remote memory.

Scott Lurndal

unread,
Mar 21, 2022, 10:53:22 AM3/21/22
to
The barrel shifter penalty is gone. However, it is unavoidable
that such an access will need to access _two_ L1D cache lines, and
the hardware thread/core cannot do both accesses in parallel. Note that
Load-to-use penalties on x86 processors are about 4 clocks when
they hit in L1D. Much of that additional latency will be hidden
by out-of-order execution, unless there is an immediate dependency
upon the value loaded.

Scott Lurndal

unread,
Mar 21, 2022, 1:00:13 PM3/21/22
to
Clock Operation
----- -----------------
1 CAM the VA against the L1D TLB to get the PA [Assumes TLB hit]
2 Using bits from the PA to select the set, CAM the ways [Assumes L1D hit]
3 Load the data from SRAM onto the core bus.
4 load the data from the core bus into the register


This needs to be done for both cache lines, sequentially.

Now, given that BM is using the C++ compiler, it is likely
the compiler is hoisting the load and/or reordering instructions
to cover that expected latency. Adding in the cores out-of-order
execution, I doubt that BM can derive any valid latency information
relative to misaligned cache/page crossing accesses using C++. Note that
an access that crosses a page boundary must by definition also
cross a cache line boundary.

Tim Rentsch

unread,
Apr 18, 2022, 11:59:49 PM4/18/22
to
I assume BM is talking about unaligned access in the actual
machine. An access that is unaligned in the actual machine may
be an aligned access as far as the abstract machine is concerned.

Bonita Montero

unread,
Apr 19, 2022, 12:45:26 AM4/19/22
to
Am 21.03.2022 um 06:59 schrieb Juha Nieminen:
> Bonita Montero <Bonita....@gmail.com> wrote:
>> I just wanted to measure the performance of unaligned accesses.
>> For accesses within a page the difference was below the measurement
>> error. Then I read that some processors put a penalty on unaligned
>> accesses that cross page-boundaries. So I wrote a little benchmark
>> that tries to measure this. But for my PCs, a Ryzen Threadripper
>> 3990X, a Ryzen 7 1800X and a Phenom X4 945 all measurements were
>> also within the measurement error.
>
> IIRC both Intel and AMD got rid of the unaligned access penalty
> at some point (relatively recently). ...


No, for a long time. Unaligned accesses are quite common for
I/O purposes. The only unaligned access that is inefficient
is when you cross a page-boundary (almost twice the lantency).

Juha Nieminen

unread,
Apr 19, 2022, 8:51:41 AM4/19/22
to
Bonita Montero <Bonita....@gmail.com> wrote:
>> IIRC both Intel and AMD got rid of the unaligned access penalty
>> at some point (relatively recently). ...
>
> No, for a long time. Unaligned accesses are quite common for
> I/O purposes. The only unaligned access that is inefficient
> is when you cross a page-boundary (almost twice the lantency).

AFAIK Intel got rid of the penalty with the Sandy Bridge architecture
(i5-2400, i5-2500, i7-2600, i7-2700, etc). While that was 11 years ago,
that's "relatively recently" considering Intel's entire history.

Bonita Montero

unread,
Apr 19, 2022, 9:17:39 AM4/19/22
to
Maybe you're right. I just ran my unaligned benchmark on a old
Phenom X4 945 under Ubuntu 20.04:

0x200: 50%
0x400: 51%
0x800: 50%
0x1000: 50%
0x2000: 50%
0x4000: 59%
0x8000: 59%
0x10000: 69%
0x20000: 95%
0x40000: 95%
0x80000: 94%
0x100000: 90%
0x200000: 92%
0x400000: 92%
0x800000: 91%
0x1000000: 92%
0x2000000: 91%
0x4000000: 92%
0x8000000: 92%

The percentage is the speed of unaligned accesses related to aligned
accesses at the same block-size. With larger block-sizes the share
of the memory or higher level cache access time becomes a larger
part of the access time and as 64 bytes are fetched for a whole
cacheline and the access mostly fetches only one cacheline.

Scott Lurndal

unread,
Apr 19, 2022, 10:11:35 AM4/19/22
to
Bonita Montero <Bonita....@gmail.com> writes:
>Am 21.03.2022 um 06:59 schrieb Juha Nieminen:

>>
>> IIRC both Intel and AMD got rid of the unaligned access penalty
>> at some point (relatively recently). ...
>
>
>No, for a long time. Unaligned accesses are quite common for
>I/O purposes.

Actually, they are very uncommon for I/O purposes. 99.99% of
all MMIO accessess are aligned. 99.9999% of all DMA accesses
are aligned.

>The only unaligned access that is inefficient
>is when you cross a page-boundary (almost twice the lantency).

Actually, it's when you cross a cache line boundary that
the latency starts to increase. A TLB miss may add
additional latency.

Bonita Montero

unread,
Apr 19, 2022, 10:15:26 AM4/19/22
to
> Actually, they are very uncommon for I/O purposes. ...

Data structures for file or network-I/O often have unaligned
basic types.

> Actually, it's when you cross a cache line boundary that
> the latency starts to increase. ...

No, absolutely no.

Chris M. Thomasson

unread,
Apr 27, 2022, 4:52:17 AM4/27/22
to
Straddling a cache line is very bad... NO, try not to do it!

Bonita Montero

unread,
Apr 27, 2022, 2:02:42 PM4/27/22
to
That's not a problem with x86- and ARMv8-CPUs for a long time.
Only crossing a page-boundary includes two privilege-checks.

Bonita Montero

unread,
Apr 28, 2022, 2:19:24 AM4/28/22
to
Try this:

#if defined(_WIN32)
#include <Windows.h>
#elif defined(__unix__)
#include <sys/mman.h>
#include <pthread.h>
#endif
#include <iostream>
#include <string_view>
#include <memory>
#include <thread>
#include <vector>
#include <latch>
#include <atomic>
#include <chrono>
#include <semaphore>

using namespace std;
using namespace chrono;

int main()
{
constexpr size_t
#if defined(__cpp_lib_hardware_interference_size)
CL_SIZE = hardware_destructive_interference_size,
#else
CL_SIZE = 64,
#endif
BLOCK_SIZE = 0x1000,
ROUNDS = 10'000'000;
#if defined(_WIN32)
char *begin = (char *)VirtualAlloc( nullptr, BLOCK_SIZE, MEM_RESERVE |
MEM_COMMIT, PAGE_READWRITE );
#elif defined(__unix__)
char *begin = (char *)mmap( nullptr, BLOCK_SIZE, PROT_READ |
PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0 );
#endif
char *end = begin + BLOCK_SIZE;
atomic_uint readyCountDown;
binary_semaphore semReady( false );
counting_semaphore semRun( 0 );
atomic_uint synch;
atomic_uint64_t nsSum;
auto theThread = [&]( ptrdiff_t offset )
{
if( readyCountDown.fetch_sub( 1, memory_order_relaxed ) == 1 )
semReady.release();
semRun.acquire();
if( synch.fetch_sub( 1, memory_order_relaxed ) != 1 )
while( synch.load( memory_order_relaxed ) );
auto start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
for( char *p = begin + CL_SIZE; p != end; p += CL_SIZE )
(void)((atomic_uint &)p[offset]).load( memory_order_relaxed );
nsSum.fetch_add( (uint64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count(), memory_order_relaxed );
};
unsigned hc = thread::hardware_concurrency();
vector<jthread> threads;
threads.reserve( 2 );
for( ptrdiff_t offset = 0; offset >= -1; --offset )
{
cout << "offset: " << offset << endl;
for( unsigned nThreads = 1; nThreads <= 2; ++nThreads )
{
readyCountDown.store( nThreads, memory_order_relaxed );
synch.store( nThreads, memory_order_relaxed );
nsSum.store( 0, memory_order_relaxed );
for( unsigned t = 0; t != nThreads; ++t )
threads.emplace_back( theThread, offset );
semReady.acquire();
auto setAff = []( jthread::native_handle_type handle, unsigned cpu )
{
#if defined(_WIN32)
if( !SetThreadAffinityMask( handle, (DWORD_PTR)1 << cpu ) )
ExitProcess( EXIT_FAILURE );
#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(cpu, &cpuSet);
if( pthread_setaffinity_np( handle, sizeof cpuSet, &cpuSet ) )
exit( EXIT_FAILURE );
#endif
};
for( size_t t = 0; t != nThreads; ++t )
setAff( threads[t].native_handle(), hc / 2 * (unsigned)t );
semRun.release( nThreads );
threads.resize( 0 );
cout << "\t" << nThreads << ": " << (double)(int64_t)nsSum.load(
memory_order_relaxed ) / ((int)nThreads * 1.0e9) << endl;
}
}
}

On my computer (Ryzen Threadripper 3990X) this shows that accesses
crossing a cacheline-boundary are only about 5% slower and it dosn't
matter if there is one or two threads accessing the memory. I thought
that with two threads the access might be slower because unaligned
accesses might occupy more than one load unit and more threads might
occupy four load units - but by CPU has only two.

Bonita Montero

unread,
Apr 28, 2022, 6:19:11 AM4/28/22
to
There was a bug in my code. Now it's correct:
static
struct offset_t
{
ptrdiff_t offset;
char const *description;
} const offsets[] =
{
{ 0, "aligned" },
{ 1, "unaligned" },
{ -1, "unaligned, crossing cachline boundary" }
};
for( offset_t const &off : offsets )
{
cout << off.description << ":" << endl;
for( unsigned nThreads = 1; nThreads <= 2; ++nThreads )
{
readyCountDown.store( nThreads, memory_order_relaxed );
synch.store( nThreads, memory_order_relaxed );
nsSum.store( 0, memory_order_relaxed );
for( unsigned t = 0; t != nThreads; ++t )
threads.emplace_back( theThread, off.offset );
semReady.acquire();
auto setAff = []( jthread::native_handle_type handle, unsigned cpu )
{
#if defined(_WIN32)
if( !SetThreadAffinityMask( handle, (DWORD_PTR)1 << cpu ) )
ExitProcess( EXIT_FAILURE );
#elif defined(__unix__)
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(cpu, &cpuSet);
if( pthread_setaffinity_np( handle, sizeof cpuSet, &cpuSet ) )
exit( EXIT_FAILURE );
#endif
};
for( size_t t = 0; t != nThreads; ++t )
setAff( threads[t].native_handle(), hc / 2 * (unsigned)t );
semRun.release( nThreads );
threads.resize( 0 );
cout << "\t" << nThreads << ": " << (double)(int64_t)nsSum.load(
memory_order_relaxed ) / ((int)nThreads * 1.0e9) << endl;
}
}
}

But the access-times are nearly still the same, i.e. crossing
a cacheline-boundary is nearly for free.

Chris M. Thomasson

unread,
Apr 28, 2022, 7:01:07 PM4/28/22
to
On 4/28/2022 3:19 AM, Bonita Montero wrote:
> There was a bug in my code. Now it's correct:
[...]
> But the access-times are nearly still the same, i.e. crossing
> a cacheline-boundary is nearly for free.

One big problem is that it can lead to false-sharing. If threads, say A
and B are working on their own data sets all padded to and aligned on
cache lines, fine. Now, if they were not properly padded and aligned,
thread B can interfere with thread A and vise versa via false sharing.
This is not good at all. Remember that old hyperthreading aliasing
issue? ;^)

Bonita Montero

unread,
Apr 28, 2022, 11:52:36 PM4/28/22
to
> One big problem is that it can lead to false-sharing. ...

Totally different story ...

Chris M. Thomasson

unread,
May 4, 2022, 9:07:12 PM5/4/22
to
On 4/28/2022 8:52 PM, Bonita Montero wrote:
>> One big problem is that it can lead to false-sharing. ...
>
> Totally different story ...

Have you ever had to deal with a false sharing bug? The program works,
but it's slow..... Slower than a snail hiking up a mountain of salt?
Well, shit.

Bonita Montero

unread,
May 5, 2022, 12:12:58 AM5/5/22
to
I'm talking about unaligned accesses and not false sharing.

Chris M. Thomasson

unread,
May 5, 2022, 12:27:55 AM5/5/22
to
Straddling a cache line can case false sharing...

Bonita Montero

unread,
May 5, 2022, 1:37:01 AM5/5/22
to
That's not what I'm talking about.

Öö Tiib

unread,
May 5, 2022, 7:34:59 AM5/5/22
to
But don't you see BM does not want to talk about it. He uses
unaligned accesses only in his single threaded software.

Chris M. Thomasson

unread,
May 5, 2022, 3:16:23 PM5/5/22
to
Yeah, I am being rather intrusive here. Sorry everybody. ;^o
0 new messages