On 20/12/2023 18:08, Theo wrote:
> Vir Campestris <vir.cam...@invalid.invalid> wrote:
>> This is not the right group for this - but I don't know where is.
>> Suggestions on a postcard please...
> I'm crossposting this to comp.arch, where they may have some ideas.
> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
> bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
> What instruction sequences are being generated for the 8/16/32/64 byte
> loops? I'm wondering if the compiler is using different instructions,
> eg using MMX, SSE, AVX to do the operations. Maybe they are having
> different caching behaviour?
It's running the same loop for each time, but with different values for
the loop sizes.
> It would help if you could tell us the compiler and platform you're using,
> including version.
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
Which of course tells you I'm running Ubuntu!
On 20/12/2023 18:58, MitchAlsup wrote:
> Can we see the code ??
> Can you present a table of the timing results ??
I've run this with more detailed increments on the line size, but here
are my results for powers of 2.
Size 1 gave 3.82242e+09 bytes/second.
Size 2 gave 3.80533e+09 bytes/second.
Size 4 gave 2.68017e+09 bytes/second.
Size 8 gave 2.33751e+09 bytes/second.
Size 16 gave 2.18424e+09 bytes/second.
Size 32 gave 2.10243e+09 bytes/second.
Size 64 gave 1.99371e+09 bytes/second.
Size 128 gave 1.98475e+09 bytes/second.
Size 256 gave 2.01653e+09 bytes/second.
Size 512 gave 2.00884e+09 bytes/second.
Size 1024 gave 2.02713e+09 bytes/second.
Size 2048 gave 2.01803e+09 bytes/second.
Size 4096 gave 3.26472e+09 bytes/second.
Size 8192 gave 3.85126e+09 bytes/second.
Size 16384 gave 3.85377e+09 bytes/second.
Size 32768 gave 3.85293e+09 bytes/second.
Size 65536 gave 2.06793e+09 bytes/second.
Size 131072 gave 2.06845e+09 bytes/second.
The code will continue, but the results are roughly stable for larger sizes.
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the
right places! I'd be interested to know what results other people get.
#include <chrono>
#include <iostream>
#include <vector>
int main()
// If your computer is much slower or faster than mine
// you may need to adjust this value.
constexpr size_t NextCount = 1 << 28;
std::vector<uint64_t> CacheStore(NextCount, 0);
// Get a raw pointer to the vector.
// On my machine (Ubuntu, g++) this improves
// performance. Using vector's operator[]
// results in a function call.
uint64_t *CachePtr = &CacheStore[0];
// SetSize is the count of the uint64_t items to be tested.
// I assume that when this is too big for a cache the data
// will overflow to the next level.
// Each loop it doubles in size. I've run it with smaller
// increments too, and the behaviour
// is still confusing.
for (auto SetSize = 1; SetSize < NextCount; SetSize<<=1)
size_t loopcount = 0;
size_t j = NextCount / SetSize;
auto start = std::chrono::steady_clock::now();
// The outer loop repeats enough times so that the data
// written by the inner loops of various sizes is
// approximately constant.
for (size_t k = 0; k < j; ++k)
// The inner loop modifies data
// within a set of words.
for (size_t l = 0; l < SetSize; ++l)
// read-modify-write some data.
// Comment this out
// to confirm that the looping is not
// the cause of the anomaly
// this counts the actual number
// of memory accesses.
// rounding errors means that for
// different SetSize values
// the count is not completely
// consistent.
// Work out how long the loops took in microseconds,
// then scale to seconds
auto delta =
(std::chrono::steady_clock::now() - start).count()
/ 1e6;
// calculate how many bytes per second, and print.
std::cout << "Size " << SetSize << " gave "
<< (double)loopcount * (double)sizeof(uint64_t) /
delta << " bytes/second." << std::endl;
return 0;