Ideas for a log in memory

Thiago Adams

unread,

Aug 15, 2021, 8:09:19 AM8/15/21

to

I would like to have a fast data struct to keep logs
in memory.

The requirements are not fixed, but it would be nice if:

1 - Print recent entries first

09:10 Event at 9:10
08:00 Event at 8:00
...
...

2 - I think the total size could be fixed, removing old messages
when new ones are created.

3 - Avoiding malloc for each entry.

I also was wondering how terminals buffer are implemented.
For instance:

c:\command 1
c:\command 2
...
c:\command N

what happens when the buffer ends?
The memory is moved all the way when a new command is created?

The solution I was thinking about was using some circular buffer
but I can see it can have some variations and then I am asking for
ideas.

Paul N

unread,

Aug 15, 2021, 2:04:38 PM8/15/21

to

If the lines are all of roughly the same length you can have a circular buffer, just an array of strings of a fixed total length. There's no need to move the strings around, just overwrite old ones with the new ones, and keep track of where you've got to and where you started.

If it's wasteful to use the same length for each message, perhaps store them as a sort of linked list, eg:

[A][B]08:00 Event at 8:00[C][D]09:10 Event at 9:10[E] etc

where B stores some sort of pointer to C, C a pointer to B, D a pointer to E etc. Keep track of where the first and last pointers are. When storing a new message, remove enough old messages to leave space for the new one. The text of a message should be able to wrap around from the end to the start of the array.

Just a suggestion!
Paul.

Lew Pitcher

unread,

Aug 15, 2021, 2:41:52 PM8/15/21

to

With a sufficiently large block of memory, you could use a character-granularity circular buffer to
store free-form, sentinal-terminated lines. No need for linked lists or fixed-length arrays.

--
Lew Pitcher
"In Skills, We Trust"

Thiago Adams

unread,

Aug 15, 2021, 3:14:50 PM8/15/21

to

Yes, I am trying something like that.
To print in reverse order is that this design does not help.

I am thinking in a kind of linked list where each node points to the previous
one. Nodes are placed on the top of the same continuous memory. New nodes can
override old ones. This is similar of some allocator but instead of buying some
amount of memory by size I place the characters and compute size while writing
them. The problem of this. Maybe it is something too complicated..

Mark Bluemel

unread,

Aug 16, 2021, 3:37:19 AM8/16/21

to

This is not exactly a novel requirement.
Multiple systems I work with have such things, but a) the code is proprietary and b) it's not stuff I work with, so I can't give much insight.
However, I'd expect there are suitable open-source examples which you could find with a quick web search. What have you already looked at, and why did they not meet your needs?

Bonita Montero

unread,

Aug 16, 2021, 8:18:39 AM8/16/21

to

Am 15.08.2021 um 14:09 schrieb Thiago Adams:

> 3 - Avoiding malloc for each entry.

Logs mostly not consume that much memory that this is
really necessary.

Bart

unread,

Aug 16, 2021, 8:57:45 AM8/16/21

to

Especially if the events are 70 seconds (and maybe 70 minutes) apart.

John Dill

unread,

Aug 16, 2021, 9:57:35 AM8/16/21

to

I believe one name for this is a bipartite buffer, or bip buffer for short.

/*!
* \brief Defines the bip buffer overflow policy.
*
* This enumeration defines the available policies to handle when the
* inserted object will overflow the bip buffer.
*
* <ul>
* <li> \c c_bip_buffer_discard - This policy will not insert an object
* that would overflow the bip buffer.
* <li> \c c_bip_buffer_overwrite - This policy invalidates objects
* until enough memory is available to write it in the bip buffer.
* <li> \c c_bip_buffer_auto_resize - This policy automatically resizes
* the bip buffer when an object overflows the buffer.
* </ul>
*/
enum _c_bip_buffer_overflow_policy
{
c_bip_buffer_discard,
c_bip_buffer_overwrite,
c_bip_buffer_auto_resize
};

/*!
* \struct c_bip_buffer
* \brief This structure implements a bipartite buffer.
*/
struct c_bip_buffer
{
/*! \brief This variable stores the bip buffer's allocated memory. */
void* buffer;

/*! \brief This variable points to the first object in the bip buffer. */
void* head;

/*! \brief This variable points to the end of the bip buffer. */
void* tail;

/*! \brief This variable points to the end of the bipartite region. */
void* bip;

/*!
* \brief This variable stores the end of the bip buffer's memory
* buffer.
*/
void* end;

/*!
* \brief This variable stores the number of bytes used by objects
* stored in the bip buffer.
*/
size_t size;

/*! \brief This variable stores the allocated buffer size. */
size_t alloc_size;

/*! \brief This variable stores the bip buffer overflow policy. */
enum c_bip_buffer_overflow_policy overflow_policy;

/*!
* \brief This variable is the buffer load factor.
*
* The load factor specifies the maximum percentage of used space
* when resizing the buffer. This value is used only when an object
* overflows the buffer and the overflow policy is set to
* \c c_bip_buffer_auto_resize. It is primarily used to define
* the bip buffer resize behavior in response to bursts of data.
*
* Using a low load factor during a burst sequence of data will
* result in fewer expensive buffer resize operations. However, this
* can result in large memory inefficiency when the data rate is
* reduced back to normal levels.
*
* If the bip buffer is used to store a large group of objects for
* a long period of time, and the amount of memory to allocate is
* unknown, then the load factor will determine how often the buffer
* resizes to make space for more objects.
*/
float load_factor;

/*! \brief This variable is a queue of object descriptors. */
struct c_queue* object_queue;
};

You need a queue of object descriptors that define the size and
location of each object in the bip buffer that you manage. I decided
to store that externally, but you could store it internally as well.

You may also want to consider alignment issues as well if you
want to generalize it to storing objects of varying alignments.

Hope it helps,
John D.

Thiago Adams

unread,

Aug 16, 2021, 6:01:14 PM8/16/21

to

I have implemented two ideas.

1 - circular buffer of chars.
Each entry write text and the null terminator at end.
"abcd"
"123"
Result:
abcd*123* (circular buffer of chars * means \0)

To print from new entries to old, I need to search backwards
until I find *. Then I move forward to print the characters until
I find * again. I continue from the previous backward position
the same process. It is a linear zig zag.

For instance I need to go backwards 3 2 1 * and then 1 2 3 forward.
Then d c b a * backwards and a b c d forward.

to print the final result of (new on the top)

"123"
"abcd"

2 - circular buffer of blocks of chars.
Each block have a fixed number of chars.
To insert entries I fill blocks and add new one if necessary.

for instance blocks of 5 chars.

[abcd*][123**]

zeros are added.

To print from new to old I do similar of (1) but instead of char
by char is block by block. if the block end with * that means it
is a end block and the next one is the beginning of a entry.

Blocks can be like 10 chars, or like 80 chars to represent almost
one block per entry (line of log).

I am not sure with one is the best for my usage.

1 - is memory efficient.
2 - is faster to print from new to old.

Bart

unread,

Aug 16, 2021, 8:13:38 PM8/16/21

to

Some stats are missing from your posts:

* How frequently will messages be generated?

* How many bytes on average will each be?

* How many messages do you want to record before overwriting them?

* How often are you going to be printing them out?

Because on a typical desktop machine this is unlikely to be challenging,
or slow down the main application, unless you have many 1000s of
messages per second and the messages occupy 10s or 100s of MB.

But if it's only so many per minute, you could even just dump them to a
file.

Or is this going to run on some tiny system?

Thiago Adams

unread,

Aug 16, 2021, 8:59:36 PM8/16/21

to

This is for an http server. I will log requests, events, errors..
The frequency depends on how many clients it will be in the future.

> * How many bytes on average will each be?

20-120 cols. Average maybe 80.

> * How many messages do you want to record before overwriting them?

I guess 4 pages of info is ok. (This also depends on frequency)
More than that I need to read log files or read "pages of logs".

> * How often are you going to be printing them out?

When the sys admin (webbrowser) asks. "I want to see the logs"
Maybe also on console for debug purposes.

> Because on a typical desktop machine this is unlikely to be challenging,
> or slow down the main application, unless you have many 1000s of
> messages per second and the messages occupy 10s or 100s of MB.
>
> But if it's only so many per minute, you could even just dump them to a
> file.
>
> Or is this going to run on some tiny system?

A machine on the aws where cpu and memory costs money.

I also need a high frequency info, but in this case I will used
fixed array of strings to print what each thread is doing or if some
thread is blocked. and maybe some other data struct like a dashboard
to collect some useful values.

Christian Hanné

unread,

Aug 17, 2021, 2:13:19 AM8/17/21

to

deque<pair<timestamp, string>>
That's all.
In C: hundreds of lines of code.

Bart

unread,

Aug 17, 2021, 10:03:42 AM8/17/21

to

What would be the problem with using malloc for this?

I don't see it being a particular bottleneck.

However the requirements seem such that you could just use a fixed array.

256 entries (you said 4 pages - of 60 lines?) of 256 characters each
(you said 20-120), would mean 64KB total memory allocated.

I think 'aws' means Amazon Web Services (according to a quick google).
Would 64KB cost that much?

If 64KB is OK, then that is a very simple solution. You just need to
maintain a couple of indices into that 256-array of char[256] lines.

But if malloc is OK, then you can have a 256-array of char* instead,
which is what I'd prefer.

Thiago Adams

unread,

Aug 17, 2021, 10:11:08 AM8/17/21

to

Yes I agree that this is a viable solution.

The block idea is similar, but it also accepts using two or more blocks in case
the string is bigger than one block. (like using two lines for one message)

Chris M. Thomasson

unread,

Aug 17, 2021, 4:10:22 PM8/17/21

to

I remember doing one with a single linked list per thread. A log entry
would go into its local threads list. There was a logger thread that
intermittently tried to gather all of the elements out of each threads
log lists.

Bonita Montero

unread,

Aug 18, 2021, 4:31:53 AM8/18/21

to

A single log with a single lock would be sufficient because logs
don't have even a mediocre access-frequency and inserting and
deleting is done in almost no time that there's almost no chance
of a collision.

Thiago Adams

unread,

Aug 18, 2021, 7:28:43 AM8/18/21

to

I want to send or print new parts of the log.
While I am sending or printing everyone is waiting a lock.

One alternative is to make a copy of the log (memcpy) and then free the look.
or to create a queue of logs that is buffering logs while it is locked printing.

So, basically it is a competition between memcpy N bytes x malloc
each log entry.

Thiago Adams

unread,

Aug 18, 2021, 7:47:38 AM8/18/21

to

I think sometimes we want to see logs per thread but all together is also
useful. One alternative is to put in the message a thread ID then latter
create a filter.

Chris M. Thomasson

unread,

Aug 18, 2021, 6:47:11 PM8/18/21

to

The per-thread solution gets around a worker thread having to wait on a
lock. The main logger thread just grabs all of the threads logs using
single atomic operation per thread. Then the log thread can dump to a
file. So, there are no locks. The worker threads never have to wait.
Also, I remember creating a logger process, instead of a thread in the
process. Its was all lock free and had great performance.

Chris M. Thomasson

unread,

Aug 18, 2021, 6:48:00 PM8/18/21

to

Its that true for all programs? How do you know what their requirements are?

Chris M. Thomasson

unread,

Aug 18, 2021, 6:49:51 PM8/18/21

to

My rational for going the per thread route was that a worker thread
should never have to wait, block, or be interrupted for creating a log
entry. All of the hard work, writing to a file, ect, was done by the
dedicated logger thread. I even did one that used a logger process.

Bonita Montero

unread,

Aug 18, 2021, 11:41:16 PM8/18/21

to

For every kind of logging.

Chris M. Thomasson

unread,

Aug 18, 2021, 11:46:51 PM8/18/21

to

I cannot agree with you wrt every programs needs...

Bonita Montero

unread,

Aug 19, 2021, 12:06:31 AM8/19/21

to

Logging _never_ has such a high frequency or coarse locking-granularity
that you need sth. different than what I said.

Chris M. Thomasson

unread,

Aug 19, 2021, 12:18:58 AM8/19/21

to

How do you know?

Bonita Montero

unread,

Aug 19, 2021, 1:31:31 AM8/19/21

to

Am 19.08.2021 um 06:18 schrieb Chris M. Thomasson:

>> Logging _never_ has such a high frequency or coarse locking-granularity
>> that you need sth. different than what I said.

> How do you know?

Look at this:

#include <iostream>
#include <utility>
#include <string>
#include <deque>
#include <thread>
#include <mutex>
#include <ctime>
#include <vector>
#include <random>
#include <atomic>
#include <cstdint>

using namespace std;
using namespace chrono;

int main()
{
mutex mtxLog;
using log_entry_t = pair<time_t, string>;
deque<log_entry_t> log;
vector<thread> loggers;
condition_variable cvStop;
bool stop = false;
size_t const MAX_LOG_SIZE = 1'000'000;
nanoseconds nsCreate = nanoseconds( 0 ),
nsInsert = nanoseconds( 0 );
auto logThread = [&]()
{
mt19937_64 mt( (random_device())() );
uniform_int_distribution<int> uidMsGap( 10, 50 );
uniform_int_distribution<size_t> uidMsgLength( 10, 100 );
uniform_int_distribution<int> uidMsgChars( 'a', 'z' );
for( ; ; )
{
unique_lock<mutex> ul( mtxLog );
cv_status cvs = cvStop.wait_for( ul, milliseconds( uidMsGap(
mt ) ) );
if( stop )
return;
if( cvs != cv_status::timeout )
continue;
ul.unlock();
time_point<high_resolution_clock> begin = high_resolution_clock::now();
time_t ts = time( &ts );
if( ts == (time_t)-1 )
continue;
string msg( uidMsgLength( mt ), ' ' );
for( char &c : msg )
c = (char)uidMsgChars( mt );
nanoseconds durCreate = duration_cast<nanoseconds>(
high_resolution_clock::now() - begin );
ul.lock();
nsCreate += durCreate;
begin = high_resolution_clock::now();
if( log.size() == MAX_LOG_SIZE )
log.pop_front();
log.push_back( log_entry_t( ts, move( msg ) ) );
nsInsert += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin );
}
};
unsigned const MAX_THREADS = 10'000;
for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );
this_thread::sleep_for( seconds( 10 ) );
stop = true;
cvStop.notify_all();
for( thread &thr : loggers )
thr.join();
cout << ((double)(int64_t)nsInsert.count() /
(int64_t)nsCreate.count()) / (1.0e9 / 100.0) << "%" << endl;
}

This program spawns 10.000 threads which create log-entries every
10 to 50ms. The CPU-load is quite high on my computer (3990x 64
-core, SMT disabled). I sum up the times to create log entries as
well as to insert them. At the end I output the summed time to
insert the entries divided by the time to create them in percent.
The result is almost zero.

Bonita Montero

unread,

Aug 19, 2021, 1:32:49 AM8/19/21

to

Am 19.08.2021 um 07:30 schrieb Bonita Montero:

>             ul.lock();
>             nsCreate += durCreate;
>             begin = high_resolution_clock::now();

Should be:

nsCreate += durCreate;
begin = high_resolution_clock::now();

ul.lock();

Robert Latest

unread,

Aug 19, 2021, 1:50:33 AM8/19/21

to

Thiago Adams wrote:
> 3 - Avoiding malloc for each entry.

Why?

Bonita Montero

unread,

Aug 19, 2021, 2:12:54 AM8/19/21

to

Am 19.08.2021 um 07:30 schrieb Bonita Montero:

> log.push_back( log_entry_t( ts, move( msg ) ) );

log.emplace_back( ts, move( msg ) );

Bonita Montero

unread,

Aug 19, 2021, 2:52:27 AM8/19/21

to

Now I added some code that all threads almost start synchronously:

#include <iostream>
#include <utility>
#include <string>
#include <deque>
#include <thread>
#include <mutex>
#include <ctime>
#include <vector>
#include <random>
#include <atomic>
#include <cstdint>

using namespace std;
using namespace chrono;

int main()
{

mutex mtx;

using log_entry_t = pair<time_t, string>;
deque<log_entry_t> log;
vector<thread> loggers;

condition_variable cvRun, cvStop;
bool run = false,

stop = false;
size_t const MAX_LOG_SIZE = 1'000'000;
nanoseconds nsCreate = nanoseconds( 0 ),
nsInsert = nanoseconds( 0 );
auto logThread = [&]()
{

{
unique_lock<mutex> ul( mtx );
cvRun.wait( ul, [&]() { return run; } );

}
mt19937_64 mt( (random_device())() );
uniform_int_distribution<int> uidMsGap( 10, 50 );
uniform_int_distribution<size_t> uidMsgLength( 10, 100 );
uniform_int_distribution<int> uidMsgChars( 'a', 'z' );
for( ; ; )
{

unique_lock<mutex> ul( mtx );

cv_status cvs = cvStop.wait_for( ul, milliseconds( uidMsGap(
mt ) ) );
if( stop )
return;
if( cvs != cv_status::timeout )
continue;
ul.unlock();
time_point<high_resolution_clock> begin = high_resolution_clock::now();
time_t ts = time( &ts );
if( ts == (time_t)-1 )
continue;
string msg( uidMsgLength( mt ), ' ' );
for( char &c : msg )
c = (char)uidMsgChars( mt );
nanoseconds durCreate = duration_cast<nanoseconds>(
high_resolution_clock::now() - begin );

nsCreate += durCreate;
begin = high_resolution_clock::now();

ul.lock();

if( log.size() == MAX_LOG_SIZE )
log.pop_front();

log.emplace_back( ts, move( msg ) );

nsInsert += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin );
}
};
unsigned const MAX_THREADS = 10'000;
for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );

run = true, cvRun.notify_all();

this_thread::sleep_for( seconds( 10 ) );

stop = true, cvStop.notify_all();

Thiago Adams

unread,

Aug 19, 2021, 7:38:00 AM8/19/21

to

This is not strictly necessary.
This was an idea for optimization. Because circular buffers (when full)
one entry is deleted and another one is created.

We also could think of it as an optimized allocator that knows that
when full, the old memory is discarded.

Scott Lurndal

unread,

Aug 19, 2021, 10:03:48 AM8/19/21

to

"Chris M. Thomasson" <chris.m.t...@gmail.com> writes:

>>>>> 3 - Avoiding malloc for each entry.

>> So, basically it is a competition between memcpy N bytes x malloc
>> each log entry.
>
>The per-thread solution gets around a worker thread having to wait on a
>lock. The main logger thread just grabs all of the threads logs using
>single atomic operation per thread. Then the log thread can dump to a
>file. So, there are no locks. The worker threads never have to wait.
>Also, I remember creating a logger process, instead of a thread in the
>process. Its was all lock free and had great performance.

In our production code, we always log to a file with
append semantics. The operating system guarantees the atomicity
of each write in multithreaded applications.

If using stdio, the stream is set to line buffered.

For critical logs, the appropriate techniques to force data to
disk are used (e.g. bypass stdio and use O_DIRECT).

Thiago Adams

unread,

Aug 19, 2021, 10:19:44 AM8/19/21

to

In my case I also want to print and or send the newest parts
of the log.

Using just a file and no memory I would block writes
and read the last N bytes from the file to send. Instead I think
using log memory + file can help. In this case for print or send
the memory is used.

Bart

unread,

Aug 19, 2021, 10:58:07 AM8/19/21

to

The nearest I've done was to log transactions in a POS system.

Infrequent updates, but in a locale liable to power cuts. Each new entry
of multiple lines was appended to a file, opened and closed each time,
which recorded a days' transactions.

Scott Lurndal

unread,

Aug 19, 2021, 12:17:02 PM8/19/21

to

Thiago Adams <thiago...@gmail.com> writes:
>On Thursday, August 19, 2021 at 11:03:48 AM UTC-3, Scott Lurndal wrote:
>> "Chris M. Thomasson" <chris.m.t...@gmail.com> writes:
>>
>> >>>>> 3 - Avoiding malloc for each entry.
>> >> So, basically it is a competition between memcpy N bytes x malloc
>> >> each log entry.
>> >
>> >The per-thread solution gets around a worker thread having to wait on a
>> >lock. The main logger thread just grabs all of the threads logs using
>> >single atomic operation per thread. Then the log thread can dump to a
>> >file. So, there are no locks. The worker threads never have to wait.
>> >Also, I remember creating a logger process, instead of a thread in the
>> >process. Its was all lock free and had great performance.
>> In our production code, we always log to a file with
>> append semantics. The operating system guarantees the atomicity
>> of each write in multithreaded applications.
>>
>> If using stdio, the stream is set to line buffered.
>>
>> For critical logs, the appropriate techniques to force data to
>> disk are used (e.g. bypass stdio and use O_DIRECT).
>
>In my case I also want to print and or send the newest parts
>of the log.

$ man 1 tail

$ tail /path/to/logfile > /dev/printer

$ tail /path/to/logfile | nc target_host port

Chris M. Thomasson

unread,

Aug 19, 2021, 2:08:54 PM8/19/21

to

On 8/18/2021 10:30 PM, Bonita Montero wrote:
> Am 19.08.2021 um 06:18 schrieb Chris M. Thomasson:
>
>>> Logging _never_ has such a high frequency or coarse locking-granularity
>>> that you need sth. different than what I said.
>
>> How do you know?

> [...]

I will take a look at your code. Am a bit busy right now. However, my
main point is that how do you know what an unknown programs requirements
are wrt logging?

Chris M. Thomasson

unread,

Aug 19, 2021, 2:11:59 PM8/19/21

to

Perhaps, to avoid a call to malloc for each entry? If you malloc every
little thing, well, this forces things to be dependent on the
performance characteristics of the particular implementation of malloc.
Fwiw, I always try to amortize calls to malloc.

Bonita Montero

unread,

Aug 19, 2021, 2:15:42 PM8/19/21

to

Am 19.08.2021 um 20:08 schrieb Chris M. Thomasson:

> I will take a look at your code. Am a bit busy right now. However, my
> main point is that how do you know what an unknown programs requirements
> are wrt logging?

I think 10.000 threads with 10 to 50ms delays between log-entries for
each threads are a rather high frequency.

I've changed the code further to become more accurate and more
efficient:

#include <iostream>
#include <utility>
#include <string>
#include <deque>
#include <thread>
#include <mutex>
#include <ctime>
#include <vector>
#include <random>
#include <atomic>
#include <cstdint>

#if defined(_MSC_VER)
#pragma warning(disable: 26111)
#endif

using namespace std;
using namespace chrono;

int main()
{

mutex mtx;

using log_entry_t = pair<time_t, string>;
deque<log_entry_t> log;
vector<thread> loggers;

condition_variable cvRun, cvStop;
bool run = false,

stop = false;
size_t const MAX_LOG_SIZE = 1'000'000;

atomic_int64_t nsCreate = 0,
nsInsert = 0;

auto logThread = [&]()
{

{
unique_lock<mutex> ul( mtx );
cvRun.wait( ul, [&]() { return run; } );
}

mt19937_64 mt( (random_device())() );
uniform_int_distribution<int> uidMsGap( 10, 50 );
uniform_int_distribution<size_t> uidMsgLength( 10, 100 );
uniform_int_distribution<int> uidMsgChars( 'a', 'z' );
for( ; ; )
{

unique_lock<mutex> ul( mtx );

cv_status cvs = cvStop.wait_for( ul, milliseconds( uidMsGap( mt ) ) );
if( stop )
return;
if( cvs != cv_status::timeout )
continue;
ul.unlock();
time_point<high_resolution_clock> begin = high_resolution_clock::now();
time_t ts = time( &ts );
if( ts == (time_t)-1 )
continue;
string msg( uidMsgLength( mt ), ' ' );
for( char &c : msg )
c = (char)uidMsgChars( mt );

nsCreate += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();

begin = high_resolution_clock::now();
ul.lock();

if( log.size() == MAX_LOG_SIZE )
log.pop_front();

log.emplace_back( ts, move( msg ) );
ul.unlock();

nsInsert += duration_cast<nanoseconds>( high_resolution_clock::now()

- begin ).count();

}
};
unsigned const MAX_THREADS = 10'000;
for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );

run = true, cvRun.notify_all();

this_thread::sleep_for( seconds( 10 ) );

stop = true, cvStop.notify_all();

for( thread &thr : loggers )
thr.join();

cout << (double)nsInsert / nsCreate / (1.0e9 / 100.0) << "%" << endl;
cout << ((double)nsCreate + nsInsert) / 1.0e9 << endl;
}

Thiago Adams

unread,

Aug 19, 2021, 3:40:18 PM8/19/21

to

Like. 👍

Thiago Adams

unread,

Aug 19, 2021, 11:34:58 PM8/19/21

to

On Monday, August 16, 2021 at 7:01:14 PM UTC-3, Thiago Adams wrote:

> On Sunday, August 15, 2021 at 9:09:19 AM UTC-3, Thiago Adams wrote:
> > I would like to have a fast data struct to keep logs
> > in memory.
> >
> > The requirements are not fixed, but it would be nice if:
> >
> > 1 - Print recent entries first
> >
> > 09:10 Event at 9:10
> > 08:00 Event at 8:00
> > ...
> > ...
> >
> > 2 - I think the total size could be fixed, removing old messages
> > when new ones are created.
> >

> > 3 - Avoiding malloc for each entry.
> >
> >

> > I also was wondering how terminals buffer are implemented.
> > For instance:
> >
> > c:\command 1
> > c:\command 2
> > ...
> > c:\command N
> >
> > what happens when the buffer ends?
> > The memory is moved all the way when a new command is created?
> >
> > The solution I was thinking about was using some circular buffer
> > but I can see it can have some variations and then I am asking for
> > ideas.

> I have implemented two ideas.
>
> 1 - circular buffer of chars.
> Each entry write text and the null terminator at end.
> "abcd"
> "123"
> Result:
> abcd*123* (circular buffer of chars * means \0)
>
> To print from new entries to old, I need to search backwards
> until I find *. Then I move forward to print the characters until
> I find * again. I continue from the previous backward position
> the same process. It is a linear zig zag.
>
> For instance I need to go backwards 3 2 1 * and then 1 2 3 forward.
> Then d c b a * backwards and a b c d forward.
>
> to print the final result of (new on the top)
>
> "123"
> "abcd"
>
> 2 - circular buffer of blocks of chars.
> Each block have a fixed number of chars.
> To insert entries I fill blocks and add new one if necessary.
>
> for instance blocks of 5 chars.
>
> [abcd*][123**]
>
> zeros are added.
>
> To print from new to old I do similar of (1) but instead of char
> by char is block by block. if the block end with * that means it
> is a end block and the next one is the beginning of a entry.
>
> Blocks can be like 10 chars, or like 80 chars to represent almost
> one block per entry (line of log).
>
> I am not sure with one is the best for my usage.
>
> 1 - is memory efficient.
> 2 - is faster to print from new to old.

idea no 3.

Circular buffer of chars + circular buffer of indexes.

The circular buffer of indexes has the start index of each
log entry at the circular buffer of chars.

Sample: (* means '\0' )

AB*12* (circular buffer of chars)
[0 3] (circular buffer of indexes, we have 2 entries)
0 = start index of "AB*"
3 = start index of "12*"

When the buffer is full we will override old entries.
Let's say adding "CD" will override AB. When this happens
the writing position will be >= of the old index. So we pop the
old index. (0 in this case). We repeat this while writing chars.

At the end of this process we push the index of the new entry that
in this case is also 0 by coincidence.

cd*12*
[3 0 ]
3 = index of "12"
0 = index of "cd"

the next one to override is 12 or index 3.

We can choose any size for the buffer of chars.
The size of the buffer of indexes will be the in the worst case the
size of the buffer of chars / 2. Considering we add 1char + \0.
But we can have approximation of 20 bytes for instance.
Using this extra circular buffer of index is fast to navigate logs
forward or backwards compared with the zig zag necessary
to navigate backwards using a single buffer of chars.

I think this data struct can be used as a FIFO of text of variable size.
There is a limit of total size of text inserted at FIFO and a limit of entries.

Bonita Montero

unread,

Aug 20, 2021, 4:06:28 AM8/20/21

to

Am 19.08.2021 um 20:08 schrieb Chris M. Thomasson:

I've inserted a counter that counts the number of log-entries
enqueued. At the end I printed the number of log-entries per
second: 270.000 - with almost zero enqueuing-overhead ! Any
questions ?

Bonita Montero

unread,

Aug 20, 2021, 4:06:49 AM8/20/21

to

207.000
> questions ?

Robert Latest

unread,

Aug 20, 2021, 9:26:06 AM8/20/21

to

Chris M. Thomasson wrote:
> On 8/18/2021 10:50 PM, Robert Latest wrote:
>> Thiago Adams wrote:
>>> 3 - Avoiding malloc for each entry.
>>
>> Why?
>>
>
> Perhaps, to avoid a call to malloc for each entry? If you malloc every
> little thing, well, this forces things to be dependent on the
> performance characteristics of the particular implementation of malloc.

Absolutely. Neither should one malloc() "every little thing" without thinking
about it, nor should one ban the use of malloc() altogether. That said, one
probably can rely on malloc() being one of the most aggressively optimized
function in any given implementation simply because it is so ubiquitous.

> Fwiw, I always try to amortize calls to malloc.

What do you mean by "amortize"?

Chris M. Thomasson

unread,

Aug 20, 2021, 4:42:01 PM8/20/21

to

An example:

Instead of calling malloc for each individual node structure, why not
use one per, say, 1000000 node structures?

Bonita Montero

unread,

Aug 21, 2021, 3:06:23 AM8/21/21

to

Am 20.08.2021 um 10:06 schrieb Bonita Montero:

> I've inserted a counter that counts the number of log-entries
> enqueued. At the end I printed the number of log-entries per
> second: 270.000 - with almost zero enqueuing-overhead ! Any
> questions ?

Here's another improvement:

#include <iostream>
#include <utility>
#include <string>
#include <deque>
#include <thread>
#include <mutex>
#include <ctime>
#include <vector>
#include <random>
#include <atomic>
#include <cstdint>

using namespace std;
using namespace chrono;

using namespace chrono_literals;

int main()
{
mutex mtx;

using log_entry_t = pair<time_t, string>;
deque<log_entry_t> log;
vector<thread> loggers;

condition_variable cvRun, cvStop;
bool run = false,

stop = false;
size_t const MAX_LOG_SIZE = 1'000'000;

atomic_int64_t nsCreate = 0,
nsInsert = 0;

atomic_int64_t nEnqueued = 0;

auto logThread = [&]()
{

{
unique_lock<mutex> ul( mtx );
cvRun.wait( ul, [&]() { return run; } );
}

mt19937_64 mt( (random_device())() );
uniform_int_distribution<int> uidMsGap( 10, 50 );
uniform_int_distribution<size_t> uidMsgLength( 10, 100 );
uniform_int_distribution<int> uidMsgChars( 'a', 'z' );

for( ; ; ++nEnqueued )
{
milliseconds wait = milliseconds( uidMsGap( mt ) );
nanoseconds waited = nanoseconds( 0 );

unique_lock<mutex> ul( mtx );

for( ; ; )
{
milliseconds rndWaited = duration_cast<milliseconds>( waited + 500us );
if( rndWaited >= wait )
break;
time_point<high_resolution_clock> waitBegin =
high_resolution_clock::now();

cv_status cvs = cvStop.wait_for( ul,

wait - rndWaited );
if( stop )
return;
if( cvs == cv_status::timeout )
break;
waited += duration_cast<nanoseconds>( high_resolution_clock::now() -
waitBegin );
}
string msg;

if( log.size() == MAX_LOG_SIZE )

msg = move( log.front().second ),
log.pop_front();

ul.unlock();
time_point<high_resolution_clock> begin = high_resolution_clock::now();

time_t timeStamp = time( &timeStamp );
if( timeStamp == (time_t)-1 )
continue;
msg.resize( uidMsgLength( mt ) );

for( char &c : msg )
c = (char)uidMsgChars( mt );

nsCreate.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - begin ).count(),
memory_order_relaxed );

begin = high_resolution_clock::now();
ul.lock();

log.emplace_back( timeStamp, move( msg ) );
ul.unlock();
nsInsert.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - begin ).count(),
memory_order_relaxed );

}
};
unsigned const MAX_THREADS = 10'000;
for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );

run = true, cvRun.notify_all();
unsigned const SECS = 10;
this_thread::sleep_for( seconds( SECS ) );
stop = true, cvStop.notify_all();

for( thread &thr : loggers )
thr.join();

cout << (double)nsInsert / nsCreate / (1.0e9 / 100.0) << "%" << endl;
cout << (double)nEnqueued / (double)SECS << endl;
}

Along with the time_t-timestamp I allocate a string-object with a
size from 10 to 100 characters. Before I destroyed this string-ob-
ject at the front of the deque every time the queue becomes full.
Now I move out the string object at the front before destruction,
unlock the queue, and resize it to a new size so that at some time
the length of 100 is reached and the recycled string object isn't
relocated further. This makes somewhat more memory-consumption,
but at some point ther won't be any further allocations or deallo-
cations.
On my Treadripper 3900X on Win10 with disabled SMT I get nearly
the maximum throuput according to the parameters (10 to 50ms delay
per thread between each generated log-entry) with 330.000 log-en-
tries (maximum would be 333.000 entries per second) per second at
a CPU-load of 42%.

Thiago Adams

unread,

Aug 21, 2021, 8:06:09 AM8/21/21

to

Can you describe your sample?

You create N threads that will generate data and
add then do log using

lock()
add_log()
unlock().

Is this correct?

What happens if I need to print some parts of the log?

lock()
print_log();
unlock()

this?
So while I am printing N threads will wait lock to be free?

(You code is not working on VC++2019: Unhandled exception at 0x7641B512 in
ConsoleApplication.exe: Microsoft C++ exception: std::system_error at memory
location 0x012FF1C8)

Thiago Adams

unread,

Aug 21, 2021, 8:08:21 AM8/21/21

to

On Saturday, August 21, 2021 at 9:06:09 AM UTC-3, Thiago Adams wrote:
..

> (You code is not working on VC++2019: Unhandled exception at 0x7641B512 in
> ConsoleApplication.exe: Microsoft C++ exception: std::system_error at memory
> location 0x012FF1C8)

Changing MAX_THREADS for an small number it works.

Bonita Montero

unread,

Aug 21, 2021, 9:07:02 AM8/21/21

to

Am 21.08.2021 um 14:06 schrieb Thiago Adams:

> You create N threads that will generate data and
> add then do log using
>
> lock()
> add_log()
> unlock().

No, I wait for a time ranging from 10 to 50ms with a condvar, unlock
the queue, create the new queue-element, lock the queue, insert the
new queue-element (roughly said).

> Is this correct?
> What happens if I need to print some parts of the log?

The best would be to lock the queue, copy it to minimize the time
that it is held, unlock it and then process the copied queue.

> (You code is not working on VC++2019: Unhandled exception at 0x7641B512 in
> ConsoleApplication.exe: Microsoft C++ exception: std::system_error at memory
> location 0x012FF1C8)

Take the latest version:

int64_t nsCreateT = 0,
nsInsertT = 0;

for( ; ; ++nEnqueued )
{
milliseconds wait = milliseconds( uidMsGap( mt ) );
nanoseconds waited = nanoseconds( 0 );
unique_lock<mutex> ul( mtx );
for( ; ; )
{
milliseconds rndWaited = duration_cast<milliseconds>( waited + 500us );
if( rndWaited >= wait )
break;
time_point<high_resolution_clock> waitBegin =
high_resolution_clock::now();
cv_status cvs = cvStop.wait_for( ul,
wait - rndWaited );
if( stop )

{
nsCreate.fetch_add( nsCreateT );
nsInsert.fetch_add( nsInsertT );

return;
}
if( cvs == cv_status::timeout )
break;
waited += duration_cast<nanoseconds>( high_resolution_clock::now() -
waitBegin );
}
string msg;

time_point<high_resolution_clock> begin = high_resolution_clock::now();

if( log.size() == MAX_LOG_SIZE )
msg = move( log.front().second ),
log.pop_front();

nsCreateT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();
ul.unlock();

time_t timeStamp = time( &timeStamp );
if( timeStamp == (time_t)-1 )
continue;
msg.resize( uidMsgLength( mt ) );
for( char &c : msg )
c = (char)uidMsgChars( mt );

nsCreateT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();

begin = high_resolution_clock::now();
ul.lock();
log.emplace_back( timeStamp, move( msg ) );
ul.unlock();

nsInsertT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();

}
};
unsigned const MAX_THREADS = 10'000;
for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );
run = true, cvRun.notify_all();
unsigned const SECS = 10;
this_thread::sleep_for( seconds( SECS ) );
stop = true, cvStop.notify_all();
for( thread &thr : loggers )
thr.join();
cout << (double)nsInsert / nsCreate / (1.0e9 / 100.0) << "%" << endl;
cout << (double)nEnqueued / (double)SECS << endl;
}

A small "problem" with this version is that there might be allocations
because of expanding the deque at the end while inserting the new ele-
ment. So I decided to write a new version with a map, where the key is
the timestamp. This has the advantage that I can use the extract()
-method of std::map to extract the first elemeent, assign a new time-
stamp-key and a new message to it, and insert it at the end after-
wards. Here's the code:

#include <iostream>
#include <utility>
#include <string>

#include <map>

#include <thread>
#include <mutex>
#include <ctime>
#include <vector>
#include <random>
#include <atomic>
#include <cstdint>

using namespace std;
using namespace chrono;
using namespace chrono_literals;

int main()
{
mutex mtx;

using map_t = map<time_t, string>;
map_t log;

vector<thread> loggers;
condition_variable cvRun, cvStop;
bool run = false,
stop = false;
size_t const MAX_LOG_SIZE = 1'000'000;
atomic_int64_t nsCreate = 0,
nsInsert = 0;
atomic_int64_t nEnqueued = 0;
auto logThread = [&]()
{
{
unique_lock<mutex> ul( mtx );
cvRun.wait( ul, [&]() { return run; } );
}
mt19937_64 mt( (random_device())() );
uniform_int_distribution<int> uidMsGap( 10, 50 );
uniform_int_distribution<size_t> uidMsgLength( 10, 100 );
uniform_int_distribution<int> uidMsgChars( 'a', 'z' );

int64_t nsCreateT = 0,
nsInsertT = 0,
nEnqueuedT = 0;
for( ; ; )

{
milliseconds wait = milliseconds( uidMsGap( mt ) );
nanoseconds waited = nanoseconds( 0 );
unique_lock<mutex> ul( mtx );
for( ; ; )
{
milliseconds rndWaited = duration_cast<milliseconds>( waited + 500us );
if( rndWaited >= wait )
break;
time_point<high_resolution_clock> waitBegin =
high_resolution_clock::now();
cv_status cvs = cvStop.wait_for( ul,
wait - rndWaited );
if( stop )

{
nsCreate.fetch_add( nsCreateT );
nsInsert.fetch_add( nsInsertT );
nEnqueued.fetch_add( nEnqueuedT );

return;
}
if( cvs == cv_status::timeout )
break;
waited += duration_cast<nanoseconds>( high_resolution_clock::now() -
waitBegin );
}

time_point<high_resolution_clock> begin = high_resolution_clock::now();

map_t::node_type node;

if( log.size() == MAX_LOG_SIZE )

node = log.extract( log.begin() );
nsCreateT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();
ul.unlock();

begin = high_resolution_clock::now();
time_t timeStamp = time( &timeStamp );
if( timeStamp == (time_t)-1 )

{
if( node )
ul.lock(),
log.insert( move( node ) );
continue;
}
string msg;
if( node )
node.key() = timeStamp,
msg = move( node.mapped() );

msg.resize( uidMsgLength( mt ) );
for( char &c : msg )
c = (char)uidMsgChars( mt );

if( node )
node.mapped() = move( msg );
nsCreateT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();

begin = high_resolution_clock::now();
ul.lock();

if( node )
log.insert( move( node ) );
else
log.emplace( timeStamp, move( msg ) );
ul.unlock();
nsInsertT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();
++nEnqueuedT;

}
};
unsigned const MAX_THREADS = 10'000;
for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );
run = true, cvRun.notify_all();
unsigned const SECS = 10;
this_thread::sleep_for( seconds( SECS ) );
stop = true, cvStop.notify_all();
for( thread &thr : loggers )
thr.join();
cout << (double)nsInsert / nsCreate / (1.0e9 / 100.0) << "%" << endl;
cout << (double)nEnqueued / (double)SECS << endl;
}

After some time, when all MAX_LOG_SIZE elements have been assigned and
their message string-capacity has become 100 chars, there won't be any
further memory-allocations or -deallocations.
This solution gets the maximum performance of 333.000 elements per
second and slightly lower CPU-load than the version with the deque.

Thiago Adams

unread,

Aug 21, 2021, 10:16:05 AM8/21/21

to

On Saturday, August 21, 2021 at 10:07:02 AM UTC-3, Bonita Montero wrote:
> Am 21.08.2021 um 14:06 schrieb Thiago Adams:
>
> > You create N threads that will generate data and
> > add then do log using
> >
> > lock()
> > add_log()
> > unlock().
> No, I wait for a time ranging from 10 to 50ms with a condvar, unlock
> the queue, create the new queue-element, lock the queue, insert the
> new queue-element (roughly said).

Why are you doing this? I think if you write the reasons and uses words
instead of c++ code to explain what data struct is locked and what part of
the job is done by the caller (thread that create log) and what is inside the
log is better.

Using a circular buffer of chars for instance, to insert a log is:

1)

/*no malloc here*/
lock
/*no malloc here*/
push_each_char(log_text);
unlock

If the log_text is 120 chars, everyone will wait until this 120 where
copied.

Comparing with a circular buffer of pointer to char.

2)

const char* newtext = malloc(strlen(log_text));
if(newtext )
{
strcpy(newtext , log_text);
}

lock
/*no malloc if using a fixed size*/
push_char_prt(newtext); //very fast
unlock

At 2 we know that lock unlock is very fast just to copy a pointer.
The part on the caller side uses malloc that is actually something
that locks and unlocks (because we know malloc is thread safe)
So it is hard to say which one blocks less or is faster because malloc
is a kind of "black box".

To print the log, we can lock copy and release.

mem
lock
copy text to mem
unlock

using a circular buffer of chars (1) can be one or two memcpy
of consecutive chars.

Using a circular buffer of char* (2) each entry is copied separately
and the chars from one entry can be far away from the other.
We also need to free (pop old ones) that also is thread safe.

3)
Another option for logs is a message system.
It is similar of 2 we can use malloc on the caller side to create the
message and then push the message into a message queue.

The log system will consume the message to insert then into a data
struct.

We can use the same data structure using a linked list. The node is
created on the caller side. Then the node goes to a message queue.

When it is consumed the node goes to it final destination that is the linked
list of nodes. (log itself)

/*caller*/
create node with the text

lock
push node into the message queue
unlock

the node will be remove latter in another consuming thread.

at the consuming thread we lock get nodes unlock and move then to
a final queue.

The advantage of (3) is that while printing the callers will not be
blocked the total time of printing because the still can push new
message into the message queue while the final linked list is being
printed.

What is better? 1 2 or 3
I don't know...

Bonita Montero

unread,

Aug 21, 2021, 11:25:21 AM8/21/21

to

Am 21.08.2021 um 16:15 schrieb Thiago Adams:

> lock
> /*no malloc if using a fixed size*/
> push_char_prt(newtext); //very fast
> unlock

If you have a fixed size message that's possible. But this wouldn't
make a major difference since I reuse the string-objects ans increa-
singly saturate their capacity.

> At 2 we know that lock unlock is very fast just to copy a pointer.

The performance of lock / unlock is depending on if there's contention
or not. If there is, there's kernel-intervention which huge costs.

> using a circular buffer of chars (1) can be one or two memcpy
> of consecutive chars.

In C++ you could use a circular vector of chars along with a circular
vector of pairs of timestams and string-views.

Rest of your text unread.

Bonita Montero

unread,

Aug 21, 2021, 1:42:39 PM8/21/21

to

That's what you want, a ring-buffered logger, first the ring-buffer
class (lacks iterators, hence not as efficent as it could be):

#pragma once
#include <ctime>
#include <vector>
#include <stdexcept>
#include <new>
#include <utility>
#include <cstring>

template<typename T>
struct ring_buffer
{
ring_buffer( std::size_t size );
T &operator []( std::size_t idx );
T &front();
void pop_front();
template<typename ... Args>
T &emplace_back( Args &&...args );
bool full();
std::size_t size();
private:
using vec_t = std::vector<T>;
using it_t = typename vec_t::iterator;
vec_t m_ring;
it_t m_read, m_write;
bool m_full;
};

template<typename T>
ring_buffer<T>::ring_buffer( std::size_t size ) :
m_ring( size ),
m_read( m_ring.begin() ),
m_write( m_ring.begin() ),
m_full( false )
{
if( !size )
throw std::invalid_argument( "ring-buffer size must be != 0" );
}

template<typename T>
T &ring_buffer<T>::operator []( std::size_t idx )
{
using namespace std;
if( !m_full && m_read == m_write )
throw logic_error( "ring-buffer empty" );
if( m_write > m_read )
if( idx < (size_t)(m_write - m_read) )
return m_read[idx];
else
throw invalid_argument( "ring-buffer index too large" );
if( idx < (size_t)(m_ring.end() - m_read) )
return m_read[idx];
if( (idx -= m_ring.end() - m_read) >= (size_t)(m_write - m_ring.begin()) )
throw invalid_argument( "ring-buffer index too large" );
return m_ring[idx];
}

template<typename T>
T &ring_buffer<T>::front()
{
if( !m_full && m_read == m_write )
throw std::invalid_argument( "ring-buffer empty" );
return *m_read;
}

template<typename T>
void ring_buffer<T>::pop_front()
{
if( !m_full && m_read == m_write )
throw std::logic_error( "ring-buffer empty" );
if( ++m_read == m_ring.end() )
m_read = m_ring.begin();
m_full = m_read == m_write;
}

template<typename T>
template<typename ... Args>
T &ring_buffer<T>::emplace_back( Args &&...args )
{
using namespace std;
if( m_full )
throw logic_error( "ring buffer full" );
T &t = *m_write;
t.~T();
new( &t )T( forward<Args>( args ) ... );
if( ++m_write == m_ring.end() )
m_write = m_ring.begin();
m_full = m_write == m_read;
return t;
}

template<typename T>
std::size_t ring_buffer<T>::size()
{
if( m_full )
return m_ring.size();
if( m_read < m_write )
return m_write - m_read;
return (m_ring.end() - m_read) + (m_write - m_ring.begin());
}

template<typename T>
inline
bool ring_buffer<T>::full()
{
return m_full;
}

Now the circular log main file:

#if defined(_MSC_VER)
#define _CRT_SECURE_NO_WARNINGS
#endif
#include "ring_buffer.h"
#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <atomic>
#include <chrono>
#include <random>
#include <algorithm>

using namespace std;
using namespace chrono;
using namespace chrono_literals;

int main( int argc, char **argv )
{
struct log_entry_t
{
time_t timestamp;
char msg[100];
log_entry_t() :
timestamp( -1 )
{
msg[0] = 0;
}
log_entry_t( time_t timestamp, char const *msg ) :
timestamp( timestamp )
{
strncpy( this->msg, msg, 100 );
}
log_entry_t( log_entry_t const &le ) :
timestamp( le.timestamp )
{
strncpy( this->msg, le.msg, 100 );
}
};
using ring_t = ring_buffer<log_entry_t>;
mutex mtx;

size_t const MAX_LOG_SIZE = 1'000'000;

ring_t logRing( MAX_LOG_SIZE );
condition_variable cvReady, cvRun, cvStop;
unsigned ready = 0;

bool run = false,
stop = false;

atomic_int64_t nsCreate = 0,
nsInsert = 0;
atomic_int64_t nEnqueued = 0;

auto waitStart = [&]

{
unique_lock<mutex> ul( mtx );

++ready;
cvReady.notify_one();

cvRun.wait( ul, [&]() { return run; } );

};
auto wait = [&]( unique_lock<mutex> &ul, milliseconds ms ) -> bool
{

nanoseconds waited = nanoseconds( 0 );

for( ; ; )
{
milliseconds rndWaited = duration_cast<milliseconds>( waited + 500us );

if( rndWaited >= ms )
return !stop;

time_point<high_resolution_clock> waitBegin =
high_resolution_clock::now();

cv_status cvs = cvStop.wait_for( ul, ms

- rndWaited );
if( stop )

return false;

if( cvs == cv_status::timeout )

return true;

waited += duration_cast<nanoseconds>( high_resolution_clock::now() -
waitBegin );
}

};
auto logThread = [&]()
{

waitStart();

mt19937_64 mt( (random_device())() );
uniform_int_distribution<int> uidMsGap( 10, 50 );
uniform_int_distribution<size_t> uidMsgLength( 10, 100 );
uniform_int_distribution<int> uidMsgChars( 'a', 'z' );
int64_t nsCreateT = 0,
nsInsertT = 0;

for( ; ; )

{
unique_lock<mutex> ul( mtx );

if( !wait( ul, milliseconds( uidMsGap( mt ) ) ) )

{
nsCreate.fetch_add( nsCreateT );
nsInsert.fetch_add( nsInsertT );
return;
}

ul.unlock();

time_point<high_resolution_clock> begin = high_resolution_clock::now();

time_t timeStamp = time( &timeStamp );
if( timeStamp == (time_t)-1 )
continue;

char msg[100];
size_t msgLength = uidMsgLength( mt );
if( msgLength < 100 )
msg[msgLength] = 0;
for( ; msgLength--; msg[msgLength] = (char)uidMsgChars( mt ) );

nsCreateT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();
begin = high_resolution_clock::now();
ul.lock();

if( logRing.full() )
logRing.pop_front();
logRing.emplace_back( timeStamp, msg );

ul.unlock();
nsInsertT += duration_cast<nanoseconds>( high_resolution_clock::now()
- begin ).count();

++nEnqueued;
}
};
auto printThread = [&]()
{
waitStart();
vector<log_entry_t> cpyLog;
for( ; ; )

{
unique_lock<mutex> ul( mtx );

if( !wait( ul, 1s ) )
return;
size_t logSize = logRing.size();
vector<log_entry_t> cpyLog( logSize );
for( size_t i = 0; i != logSize; ++i )
cpyLog[i] = logRing[i];
ul.unlock();
sort( cpyLog.begin(), cpyLog.end(),
[]( log_entry_t &left, log_entry_t &right )
{
return left.timestamp < right.timestamp;

} );
}
};
unsigned const MAX_THREADS = 10'000;

vector<thread> loggers;
thread printer;

for( unsigned t = MAX_THREADS; t--; )
loggers.emplace_back( logThread );

if( argc > 1 )
printer = thread( printThread );
for( ; ; )

{
unique_lock<mutex> ul( mtx );

if( ready == MAX_THREADS + (argc > 1) )
break;
cvReady.wait( ul );

}
run = true, cvRun.notify_all();
unsigned const SECS = 10;
this_thread::sleep_for( seconds( SECS ) );
stop = true, cvStop.notify_all();
for( thread &thr : loggers )
thr.join();

if( argc > 1 )
printer.join();

cout << (double)nsInsert / nsCreate / (1.0e9 / 100.0) << "%" << endl;
cout << (double)nEnqueued / (double)SECS << endl;
}

This code has a 10.000 logging-threads and optionally one print-thread
if theres a cmdline-param that gets a copy of the log once a second and
sorts it; the elements could be not strictly sorted because the time-
stamp is got when the queue is not locked so that the creation of the
log-enties among different threads could have a race-condition.

Kaz Kylheku

unread,

Aug 21, 2021, 6:45:42 PM8/21/21

to

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Robert Latest

unread,

Aug 23, 2021, 3:14:30 AM8/23/21

to

Sounds reasonable. But is it? Only if your overhead logic of handing out 1m
pointers into that block is actually faster than simply calling malloc() 1m
times. The only way to find out is profiling it on the target implemenation.
I've never written anything so performance-critical that I would have bothered
with that.

Chris M. Thomasson

unread,

Aug 23, 2021, 4:23:47 PM8/23/21

to

On 8/23/2021 12:14 AM, Robert Latest wrote:
> Chris M. Thomasson wrote:
>> On 8/20/2021 6:25 AM, Robert Latest wrote:
>>> What do you mean by "amortize"?
>>>
>>
>> An example:
>>
>> Instead of calling malloc for each individual node structure, why not
>> use one per, say, 1000000 node structures?
>
> Sounds reasonable. But is it? Only if your overhead logic of handing out 1m
> pointers into that block is actually faster than simply calling malloc() 1m
> times.

Iirc, way back around 20 years ago, I had a multi layered approach. So,
a malloc'ed large chunk of nodes would be handled by a wait-free region
allocator. Very fast. Freed objects would go into distributed lock-free
LIFO caches. Threads would even have their own per-thread caches. It
basically went something like this:

allocate object:

Try to allocate from the per-thread cache (no sync)
Try to allocate from the distributed cache (lock-free)
Try to allocate from the region allocator (wait-free)
Try to call malloc to allocate another batch of objects. (who knows?)
If all of those fail, return NULL

deallocate object:

Try to push object into per-thread cache (no sync)
Try to push object into distributed cache (lock-free)
If those both fail, then decrement the reference count of its batch and
if it drops to zero we can deallocate the entire batch in a single call
to free. (wait-free decrement) if free is called then (who knows?)

> The only way to find out is profiling it on the target implemenation.
> I've never written anything so performance-critical that I would have bothered
> with that.

Way back, I was heavily into creating highly scalable algorithms. I
posted some of them over the years.

Chris M. Thomasson

unread,

Aug 30, 2021, 12:30:14 AM8/30/21

to

Fwiw, I am more into fractals now, vector fields, math... Graphics,
ect... Fwiw, here is some of my work:

real time vector field based DLA, click around on the screen for fun...

http://fractallife247.com/fdla/

Here is my code for the basic field:

http://fractallife247.com/test/vfield/
(click around...)

Here are some of my shaders:

https://www.shadertoy.com/user/Chris_M_Thomasson

:^)

Check this out, WebGL:

http://fractallife247.com/test/webgl/

Love this one, try dragging a point around the screen, see what happens...

https://www.shadertoy.com/view/MdcyRs

;^)

Chris M. Thomasson

unread,

Aug 30, 2021, 12:31:26 AM8/30/21

to

https://youtu.be/skGUAXAx6eg

https://youtu.be/DSiWvF5QOiI