Lock-free LRU-cache-algorithm

Bonita Montero

unread,

Aug 9, 2019, 2:59:33 AM8/9/19

to

Can anyone tell how a lock-free algorithm for a LRU-cache would
look like? LRU-caches are only possible with doubly-linked lists.
So I thought this woudn't be possible lock-free. But maybe I'm
wrong here and someone can give me an idea.

Öö Tiib

unread,

Aug 9, 2019, 8:15:06 AM8/9/19

to

One way is to add a thread that manages the cache.

Bonita Montero

unread,

Aug 9, 2019, 10:32:35 AM8/9/19

to

>> Can anyone tell how a lock-free algorithm for a LRU-cache would
>> look like? LRU-caches are only possible with doubly-linked lists.
>> So I thought this woudn't be possible lock-free. But maybe I'm
>> wrong here and someone can give me an idea.

> One way is to add a thread that manages the cache.

And that's lock-free??? ;-)

Melzzzzz

unread,

Aug 9, 2019, 10:38:13 AM8/9/19

to

Take a look at Linux kernel...
>

--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala

Chris M. Thomasson

unread,

Aug 10, 2019, 5:09:15 AM8/10/19

to

Perhaps a lock-free deque?

Chris M. Thomasson

unread,

Aug 10, 2019, 5:12:50 AM8/10/19

to

Maybe:

https://pdfs.semanticscholar.org/8a68/f45bd32ed050a96faa24139ab71178258f13.pdf

Its pretty glaringly.

Bonita Montero

unread,

Aug 10, 2019, 5:42:22 AM8/10/19

to

>> Can anyone tell how a lock-free algorithm for a LRU-cache would
>> look like? LRU-caches are only possible with doubly-linked lists.
>> So I thought this woudn't be possible lock-free. But maybe I'm
>> wrong here and someone can give me an idea.

> Perhaps a lock-free deque?

No, I want to move an arbitrary element of a doubly linked list
to the top.

Öö Tiib

unread,

Aug 10, 2019, 6:49:07 AM8/10/19

to

You anyway have some rainbow tree or hash table to find that
element that represents cached resource. The element being also
in linked list (having those prev/next pointers) is irrelevant
for consumers of cached resources. So you may leave the prev/next
pointers to be managed by single thread. When there is only single
thread that accesses those then task is "embarrassingly parallel"
IOW locks are not needed, IOW best "lock-free" there can be.

Bonita Montero

unread,

Aug 10, 2019, 7:21:39 AM8/10/19

to

> You anyway have some rainbow tree or hash table to find that
> element that represents cached resource. The element being also
> in linked list (having those prev/next pointers) is irrelevant
> for consumers of cached resources. So you may leave the prev/next
> pointers to be managed by single thread. When there is only single
> thread that accesses those then task is "embarrassingly parallel"
> IOW locks are not needed, IOW best "lock-free" there can be.

No, the element is found by multiple threads accessing thr hashtable.
I think the buckets could be partitioned with each parition having
a single mutex; and te number of mutexes should be a fair multiple
of the number of HW-threads so there is a low likehood of a collision.
But each thread will push the found cache indivually to the top, so
there's no central thread doing this.
Intel's transactional memory incarnation RTM would be perfectly
suitable for that, also IBMs equivalent incarnation since the
POWE8-CPUs. But AMD-CPUs lack such a feature since they have this
bain-damaged exclusive cache-architecture betwen L1/L2 (both
inclusive) and L3 which doesn't allow this versioning which gets
the rollback from the L3-cache.

Öö Tiib

unread,

Aug 10, 2019, 8:31:08 AM8/10/19

to

On Saturday, 10 August 2019 14:21:39 UTC+3, Bonita Montero wrote:
> > You anyway have some rainbow tree or hash table to find that
> > element that represents cached resource. The element being also
> > in linked list (having those prev/next pointers) is irrelevant
> > for consumers of cached resources. So you may leave the prev/next
> > pointers to be managed by single thread. When there is only single
> > thread that accesses those then task is "embarrassingly parallel"
> > IOW locks are not needed, IOW best "lock-free" there can be.
>
> No, the element is found by multiple threads accessing thr hashtable.
> I think the buckets could be partitioned with each parition having
> a single mutex; and te number of mutexes should be a fair multiple
> of the number of HW-threads so there is a low likehood of a collision.
> But each thread will push the found cache indivually to the top, so
> there's no central thread doing this.

But the threads could forward that task (of pushing the found element
to top) to single thread. It is technically management of those
prev/next pointers that the cache users don't care about anyway.
Forwarding can be done for example by using lock-free deque and
so it is achieved that whole thing is entirely lock-free.
If it is not good enough for you then perhaps read the paper
that Chris M. Thomasson posted or search for similar.

> Intel's transactional memory incarnation RTM would be perfectly
> suitable for that, also IBMs equivalent incarnation since the
> POWE8-CPUs. But AMD-CPUs lack such a feature since they have this
> bain-damaged exclusive cache-architecture betwen L1/L2 (both
> inclusive) and L3 which doesn't allow this versioning which gets
> the rollback from the L3-cache.

Processor cache architecture is mostly abstracted out of reach on
level of C++. We can adjust the behavior of our software a bit
and rest of it ... is topical in comp.arch.

Bonita Montero

unread,

Aug 10, 2019, 9:52:12 AM8/10/19

to

> But the threads could forward that task (of pushing the found
> element to top) to single thread. It is technically management
> of those prev/next pointers that the cache users don't care

> about anyway. ...

That would be totally brain-damaged because locking the linked
list and pushing the list-entry to the top by the thread that
accesses a certain block and thereby attributing it at the most
recent block would be more efficient. That's just because the
blocks aren't managed by a single thread but they're touched
shared.

> Processor cache architecture is mostly abstracted out of reach
> on level of C++. We can adjust the behavior of our software a
> bit and rest of it ... is topical in comp.arch.

Hey, transactional memory was one of the candidate-topics for C++20!
And the HLE-flavor of transactional memory which Intel implemnts (and
I recenctly wonderted that IBM implemented the same since the POWER8)
perfectly fits with the usual mutex-locking of C++. It's just that
HLE is more appropriate with mutex-locking with a small scope - like
in this example.

Öö Tiib

unread,

Aug 10, 2019, 12:40:46 PM8/10/19

to

On Saturday, 10 August 2019 16:52:12 UTC+3, Bonita Montero wrote:
> > But the threads could forward that task (of pushing the found
> > element to top) to single thread. It is technically management
> > of those prev/next pointers that the cache users don't care
> > about anyway. ...
>
> That would be totally brain-damaged because locking the linked
> list and pushing the list-entry to the top by the thread that
> accesses a certain block and thereby attributing it at the most
> recent block would be more efficient.

A brain has to be rather calcified to receive total damage from
simple idea how to get rid of constant race over top of list.

Bonita Montero

unread,

Aug 12, 2019, 1:48:52 PM8/12/19

to

I just found this article:
https://en.wikipedia.org/wiki/Page_replacement_algorithm
It it a good starting-point for this issue.

Chris M. Thomasson

unread,

Aug 12, 2019, 7:27:09 PM8/12/19

to

On 8/8/2019 11:59 PM, Bonita Montero wrote:

There are lock-free doubly-linked lists. However, the following
algorihtm might of of interest to you:

https://groups.google.com/d/topic/comp.lang.c++/sV4WC_cBb9Q/discussion

It is a simple deadlock free hashed mutex algorihtm.

Bonita Montero

unread,

Aug 13, 2019, 4:24:49 AM8/13/19

to

> There are lock-free doubly-linked lists. However, the following
> algorihtm might of of interest to you:
> https://groups.google.com/d/topic/comp.lang.c++/sV4WC_cBb9Q/discussion
> It is a simple deadlock free hashed mutex algorihtm.

Your code might help exchaning two elements in a lock-free list with
a low likehood of a collision. But the thing I need is a LRU-list which
scales similar. The difference here is that an element is always pushed
to the top when it is touched. So every thread contending on that must
also lock the head-pointers. So if there is always a common´structure
locked, I can stick with a single lock and I would get the same effi-
ciency as your code.

Bonita Montero

unread,

Aug 13, 2019, 4:42:21 AM8/13/19

to

> It is a simple deadlock free hashed mutex algorihtm.

BTW: Your hashing could be cleverer.
With ...
return (std::size_t)(((ptr_uint << 9) * 103) % m_locks.size());
... you multiply ptr_uint by 52.736. That's not a good hash-function.
Better mutitply by the largest prime fitting in a size_t. You can
get them here: https://primes.utm.edu/lists/2small/
So the maximum prime for 32 bit size_t would be 0xFFFFFFFB, the
maximum prime for 64 bit would be 18.446.744.073.709.551.557 (my
calculator can't convert that to hex).

Bonita Montero

unread,

Aug 13, 2019, 5:19:42 AM8/13/19

to

Am 13.08.2019 um 10:42 schrieb Bonita Montero:
> ...

> So the maximum prime for 32 bit size_t would be 0xFFFFFFFB, the
> maximum prime for 64 bit would be 18.446.744.073.709.551.557 (my
> calculator can't convert that to hex).

So finally it is 0xffffffffffffffc5.

Josef Moellers

unread,

Aug 13, 2019, 5:42:25 AM8/13/19

to

echo "obase=16; 18446744073709551557" | bc

Josef

Bonita Montero

unread,

Aug 13, 2019, 6:47:39 AM8/13/19

to

> Better mutitply by the largest prime fitting in a size_t. ...

I forgot that this is a rule only if this prime isn't
a meresenne-prime. If it is, take the next lower one.

Chris M. Thomasson

unread,

Aug 13, 2019, 6:03:44 PM8/13/19

to

On 8/13/2019 1:42 AM, Bonita Montero wrote:
>> It is a simple deadlock free hashed mutex algorihtm.
>
> BTW: Your hashing could be cleverer.

Big Time! yikes.

> With ...
> return (std::size_t)(((ptr_uint << 9) * 103) % m_locks.size());
> ... you multiply ptr_uint by 52.736. That's not a good hash-function.
> Better mutitply by the largest prime fitting in a size_t. You can
> get them here: https://primes.utm.edu/lists/2small/
> So the maximum prime for 32 bit size_t would be 0xFFFFFFFB, the
> maximum prime for 64 bit would be 18.446.744.073.709.551.557 (my
> calculator can't convert that to hex).

Yes. indeed. When I was coding my algorihtm, well, just used a stupid
simple hash of a pointer address. The posted version is in an embryonic,
example state.

Actually, the algorihtm can be used as a lock-based STM. It can use
try_locks to detect conflict and try to do something else before it
actually blocks on a mutex within the main table. So, it can be combined
with a lock-free TM. Use the locks as needed. Wrt the slowpath, or for
fallback conditions.

Funny thing, since it is based on hashing pointer addresses, well, each
run of the same program can use different locks each time. If the OS
randomizes where the program can get its memory from. Pointer A, run 1
is different in "value" of Pointer A, run 2.

;^)

Chris M. Thomasson

unread,

Aug 13, 2019, 6:17:40 PM8/13/19

to

On 8/13/2019 3:03 PM, Chris M. Thomasson wrote:
> On 8/13/2019 1:42 AM, Bonita Montero wrote:
>>> It is a simple deadlock free hashed mutex algorihtm.
>>
>> BTW: Your hashing could be cleverer.
>
> Big Time! yikes.
>
>
>> With ...
>> return (std::size_t)(((ptr_uint << 9) * 103) % m_locks.size());
>> ... you multiply ptr_uint by 52.736. That's not a good hash-function.
>> Better mutitply by the largest prime fitting in a size_t. You can
>> get them here: https://primes.utm.edu/lists/2small/
>> So the maximum prime for 32 bit size_t would be 0xFFFFFFFB, the
>> maximum prime for 64 bit would be 18.446.744.073.709.551.557 (my
>> calculator can't convert that to hex).
>
> Yes. indeed. When I was coding my algorihtm, well, just used a stupid
> simple hash of a pointer address. The posted version is in an embryonic,
> example state.

The hash needs to try and avoid collisions. Each collision, wrt
different pointers mapping into the same mutex, well, that is BAD.

Chris M. Thomasson

unread,

Aug 13, 2019, 6:31:54 PM8/13/19

to

On 8/13/2019 3:03 PM, Chris M. Thomasson wrote:

> On 8/13/2019 1:42 AM, Bonita Montero wrote:
>>> It is a simple deadlock free hashed mutex algorihtm.
>>
>> BTW: Your hashing could be cleverer.
>
> Big Time! yikes.

[...]

> Funny thing, since it is based on hashing pointer addresses, well, each
> run of the same program can use different locks each time. If the OS
> randomizes where the program can get its memory from. Pointer A, run 1
> is different in "value" of Pointer A, run 2.

Sorry for the message flood, but, failed try_locks can be used to build
meta data about how the table of mutexes is being used over time.

Chris M. Thomasson

unread,

Aug 14, 2019, 2:51:29 AM8/14/19

to

Yeah. It would boil down to a single head.

For some reason I am thinking about a lock-free LIFO as the head.
Humm... Thinking. Working on some other stuff right now, but there seems
like a way to do this.

Chris M. Thomasson

unread,

Aug 14, 2019, 2:55:45 AM8/14/19

to

On 8/13/2019 11:51 PM, Chris M. Thomasson wrote:
> On 8/13/2019 1:24 AM, Bonita Montero wrote:
>>> There are lock-free doubly-linked lists. However, the following
>>> algorihtm might of of interest to you:
>>> https://groups.google.com/d/topic/comp.lang.c++/sV4WC_cBb9Q/discussion
>>> It is a simple deadlock free hashed mutex algorihtm.
>>
>> Your code might help exchaning two elements in a lock-free list with
>> a low likehood of a collision. But the thing I need is a LRU-list which
>> scales similar. The difference here is that an element is always pushed
>> to the top when it is touched. So every thread contending on that must
>> also lock the head-pointers. So if there is always a common´structure
>> locked, I can stick with a single lock and I would get the same effi-
>> ciency as your code.
>
> Yeah. It would boil down to a single head.

A page is touched, then instantly pushed onto the active lock-free LIFO.
Marked as logically removed, yet still exists. Can be moved later.
Humm... I have not worked on lock-free in a while. But might be able to
sketch out a strange thing on this.

Bonita Montero

unread,

Sep 24, 2019, 12:33:32 PM9/24/19

to

I've found a solution that is mega-cool. I'm not gonna reveal it
because I'm thinking about to get an US-patent on it, but it is a
exotic combination of locked and lock-free programming. It scales
almost linear with the number of threads.

Scott Lurndal

unread,

Sep 24, 2019, 12:53:37 PM9/24/19

to

Bonita Montero <Bonita....@gmail.com> claims:

>> LRU-caches are only possible with doubly-linked lists.

Hm. That's news, indeed. Have you produced a formal proof?

You may want to consider LRU cache replacement algorithms used
in processor caches.

Bonita Montero

unread,

Sep 24, 2019, 1:00:52 PM9/24/19

to

> Hm. That's news, indeed. Have you produced a formal proof?

There's nothing to prof. The idea isn't very complex,
but it is very original.

> You may want to consider LRU cache replacement algorithms
> used in processor caches.

That's a totally different issue as the caches are basically
pipelined, but within the pipeline each step is serialized.

Chris M. Thomasson

unread,

Sep 24, 2019, 4:50:36 PM9/24/19

to

On 9/24/2019 9:33 AM, Bonita Montero wrote:
>> Can anyone tell how a lock-free algorithm for a LRU-cache would
>> look like? LRU-caches are only possible with doubly-linked lists.
>> So I thought this woudn't be possible lock-free. But maybe I'm
>> wrong here and someone can give me an idea.
>
> I've found a solution that is mega-cool. I'm not gonna reveal it
> because I'm thinking about to get an US-patent on it, but it is a
> exotic combination of locked and lock-free programming.

Do a deep patent feasibility study! Build a list of as many references
as you can, and read them all. It can safe you some of your hard earned
$$$. Patent lawyers cost a lot money! Or, do you work for Microsoft? For
some reason, I thought about that.

Let me guess... Clever hashed locking for the slow-paths, and lock-free
for the fast paths? That is a basic idiom. RCU uses it a lot. Locks for
the writers and sync free for the readers. The readers do not use any
atomics or memory barriers at all. Well, Alpha aside for a moment...

Here is a very simple example of a mixture of lock-based and wait-free.
100% wait-free on the push side, and "sometimes" lock-based on the
reader side experimental LIFO:

https://groups.google.com/d/msg/comp.arch/8Y0C8zGjtqI/bwg-hBLRAQAJ

> It scales
> almost linear with the number of threads.

Cool!

Chris M. Thomasson

unread,

Sep 24, 2019, 4:57:09 PM9/24/19

to

On 9/24/2019 1:50 PM, Chris M. Thomasson wrote:
> On 9/24/2019 9:33 AM, Bonita Montero wrote:

[...]

> Let me guess... Clever hashed locking for the slow-paths, and lock-free
> for the fast paths? That is a basic idiom. RCU uses it a lot. Locks for
> the writers and sync free for the readers. The readers do not use any
> atomics or memory barriers at all. Well, Alpha aside for a moment...
>
> Here is a very simple example of a mixture of lock-based and wait-free.
> 100% wait-free on the push side, and "sometimes" lock-based on the
> reader side experimental LIFO:
>
> https://groups.google.com/d/msg/comp.arch/8Y0C8zGjtqI/bwg-hBLRAQAJ

The experimental core of the code above is:
________________________________________
// A work node
struct ct_work
{
std::atomic<ct_work*> m_next;
std::thread::id m_data;
ct_work(std::thread::id data) : m_next(nullptr), m_data(data) {}

void process()
{
// [...]
// User Processing For This Work
}

ct_work* get_next() const
{
ct_work* w = nullptr;

while ((w = m_next.load(CT_MB_RLX)) == CT_WAIT)
{
// we can spin, or even do other work right here...
std::this_thread::yield();
}

return w;
}
};

// Easy Stack, only uses XCHG
struct ct_estack
{
std::atomic<ct_work*> m_head;
ct_estack() : m_head(nullptr) {}

void push(ct_work* n)
{
n->m_next.store(CT_WAIT, CT_MB_RLX);
ct_work* head = m_head.exchange(n, CT_MB_REL); // release
n->m_next.store(head, CT_MB_RLX);
}

ct_work* flush_try()
{
return m_head.exchange(nullptr, CT_MB_ACQ); // acquire
}
};

// Consume an Easy Stack...
void ct_consume(
ct_estack& estack
) {
ct_work* w = estack.flush_try();

while (w)
{
// Process FIRST!
w->process();

// Now, we can gain the next pointer.
ct_work* next = w->get_next();

// Okay, we can delete the work
delete w;
g_allocs.fetch_sub(1, CT_MB_RLX); // dec

w = next;
}
}
________________________________________

The takeaway is that the LIFO class ct_estack only uses XCHG, and no
loops. Now, this can be distributed.

Chris M. Thomasson

unread,

Sep 24, 2019, 4:58:33 PM9/24/19

to

On 9/24/2019 10:00 AM, Bonita Montero wrote:
>> Hm. That's news, indeed. Have you produced a formal proof?
>
> There's nothing to prof. The idea isn't very complex,
> but it is very original.

It seems like you realized how bad CMPXCHG can be when used in a loop.
Always try XADD and/or XCHG in a loopless algorihtm, before going to a
CAS loop.

Juha Nieminen

unread,

Sep 25, 2019, 2:42:55 AM9/25/19

to

Bonita Montero <Bonita....@gmail.com> wrote:
> I've found a solution that is mega-cool. I'm not gonna reveal it
> because I'm thinking about to get an US-patent on it, but it is a
> exotic combination of locked and lock-free programming. It scales
> almost linear with the number of threads.

I don't think you can patent algorithms even in the US currently.
They got rid of that loophole.

Bonita Montero

unread,

Sep 25, 2019, 3:08:55 AM9/25/19

to

> I don't think you can patent algorithms even in the US currently.
> They got rid of that loophole.

There's no loophole. You can officially patent algorithms in the US.

David Brown

unread,

Sep 25, 2019, 4:10:51 AM9/25/19

to

You can patent all kinds of nonsense in various countries, with the US
leading in the "give us the money and we'll accept anything" stakes.

In particular, in the American system patent offices are funded based on
the number of patents they issue, not on how well they screen
applications or the validity of the patents. The standard is to issue
the patent, and let its validity be tried in court - at vast expense to
the patent holder and the accused patent abuser.

I have heard that a rule of thumb is that you should not attempt to
patent unless the idea is worth 7 figures (i.e., over a million
dollars), and you have at least 6 figures of cash in the bank for the
legal expenses of enforcing the patent. And that is assuming you have a
good, solid, valid patent in the first place.

I am no lawyer, of course, and I will not be the slightest offended if
someone dismisses this as cynicism or urban legend. (Indeed I'd be
happy if someone can reasonably show the situation is not as bad as I
believe.)

(Australia tried a "fast track, low cost" patent system for a while, to
make it easier to get simple patents with less protection than normal
full patents. I believe they stopped after someone used it to patent
the wheel.)

Juha Nieminen

unread,

Sep 25, 2019, 6:13:11 AM9/25/19

to

"Allvoice Developments US, LLC v. Microsoft Corp., 612 F. App'x 1009
(Fed. Cir. 2015). The Supreme Court had held previously that software
in algorithm form without machine implementation could not be patented
in process format, see Gottschalk v. Benson and Parker v. Flook, but
could be patented when claimed as a machine inventively using software,
see Diamond v. Diehr."

https://en.wikipedia.org/wiki/Software_patents_under_United_States_patent_law

Bonita Montero

unread,

Sep 25, 2019, 7:38:06 AM9/25/19

to

>> There's no loophole. You can officially patent algorithms in the US.

> "Allvoice Developments US, LLC v. Microsoft Corp., 612 F. App'x 1009
> (Fed. Cir. 2015). The Supreme Court had held previously that software
> in algorithm form without machine implementation could not be patented
> in process format, see Gottschalk v. Benson and Parker v. Flook, but
> could be patented when claimed as a machine inventively using software,
> see Diamond v. Diehr."

Look, here, that's a similar patent to what we're talking about:
https://patents.google.com/patent/US5778442

Scott Lurndal

unread,

Sep 25, 2019, 8:56:10 AM9/25/19

to

Bonita Montero <Bonita....@gmail.com> stubbornly removed Juha Nieminen <nos...@thanks.invalid> attribution

My most recent patent (Issued 9/27/2019) is numbered 10,394,730. Your
referenced patent was granted over twenty years ago, long before the
court case Juha has referenced.

Bonita Montero

unread,

Sep 25, 2019, 9:28:30 AM9/25/19

to

>> Look, here, that's a similar patent to what we're talking about:
>> https://patents.google.com/patent/US5778442

> My most recent patent (Issued 9/27/2019) is numbered 10,394,730.
> Your referenced patent was granted over twenty years ago, long
> before the court case Juha has referenced.

Yes, but concrete implementations of algorithms are still
applicable for patents in the US.

Juha Nieminen

unread,

Sep 25, 2019, 1:23:36 PM9/25/19

to

So anybody can make their own implementation of the algorithm
and it will be just fine?

Well, good luck on your attempt at becoming a patent troll.
Maybe you'll get a couple of bucks from some random company.

Bonita Montero

unread,

Sep 25, 2019, 2:19:39 PM9/25/19

to

> So anybody can make their own implementation of the
> algorithm and it will be just fine?

"Implementation" doesn't iclude concrete languages.
Read the article you quoted.

Chris M. Thomasson

unread,

Sep 26, 2019, 1:59:20 AM9/26/19

to

for some reason, this makes me think of the following song:

https://youtu.be/1RTp1nmA4Xw?list=PLrQSYxsHzOso3PpMNKwTlNhMCK6FMDkzW

Chris M. Thomasson

unread,

Sep 26, 2019, 2:09:37 AM9/26/19

to

Perhaps even this, U-neek!

https://youtu.be/fjar4qUqBcE

Chris M. Thomasson

unread,

Sep 26, 2019, 2:11:46 AM9/26/19

to

On 8/8/2019 11:59 PM, Bonita Montero wrote:
> Can anyone tell how a lock-free algorithm for a LRU-cache would
> look like? LRU-caches are only possible with doubly-linked lists.
> So I thought this woudn't be possible lock-free. But maybe I'm
> wrong here and someone can give me an idea.

can I sign an NDA?

Bonita Montero

unread,

Sep 26, 2019, 2:14:06 AM9/26/19

to

>> Can anyone tell how a lock-free algorithm for a LRU-cache would
>> look like? LRU-caches are only possible with doubly-linked lists.
>> So I thought this woudn't be possible lock-free. But maybe I'm
>> wrong here and someone can give me an idea.

> can I sign an NDA?

Why?

Chris M. Thomasson

unread,

Sep 26, 2019, 2:33:27 AM9/26/19

to

I am interested in your algorithm! Makes me think of times past.

Chris M. Thomasson

unread,

Sep 26, 2019, 2:34:45 AM9/26/19

to

Talk some shop. Membars, efficiency, Relacy... ;^)

Chris M. Thomasson

unread,

Sep 26, 2019, 2:50:11 AM9/26/19

to

On 8/8/2019 11:59 PM, Bonita Montero wrote:
> Can anyone tell how a lock-free algorithm for a LRU-cache would
> look like? LRU-caches are only possible with doubly-linked lists.
> So I thought this woudn't be possible lock-free. But maybe I'm
> wrong here and someone can give me an idea.

be wary... Notice the SMR patent, or Hazard Pointers, from IBM is in an
abandoned state. Their layers are bigger than yours... YIKES! They can
pick and choose.

https://patents.google.com/patent/US20040107227A1/en

The latest activity.

Bonita Montero

unread,

Sep 26, 2019, 5:14:06 AM9/26/19

to

>>> can I sign an NDA?

>> Why?

> I am interested in your algorithm! Makes me think of times past.

Something different: my algorithm is fast when fetching blocks from
the lru-cache present in the cache, i.e. there can be an arbitrary
number of threads doing that. But even this needs at least three
CMPXCHGs on three 64 bit values (in a 64-bit-system, in a 32-bit
system you would have two 64- and one 32-bit exchange) if there's
no collision. I'm asking myself if this would be faster with trans-
actional memory, i.e if I'd build something similar like CMPXCHG
with TSX-RTM, if this would be faster.
I could redesign my "cachline ping pong" code to use TSX-RTM with-
out having tested this on my PC; so could anyone here run the num-
bers on that?

Bonita Montero

unread,

Sep 26, 2019, 2:37:25 PM9/26/19

to

> I could redesign my "cachline ping pong" code to use TSX-RTM with-
> out having tested this on my PC; so could anyone here run the num-
> bers on that?

So here's the code:

#if defined(_MSC_VER)
#include <Windows.h>
#include <intrin.h>
#elif defined(__unix__)
#include <sys/sysinfo.h>
#include <sched.h>
#include <pthread.h>
#include <immintrin.h>
#endif
#include <iostream>
#include <thread>
#include <cstddef>
#include <atomic>
#include <functional>
#include <chrono>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <array>

unsigned getNumberOfProcessors();
bool hasTSX();

using namespace std;
using namespace chrono;

inline
size_t fetchAdd( size_t volatile &v, size_t a )
{
#if defined(_MSC_VER)
#if defined(_M_X64)
return (size_t)_InterlockedExchangeAdd64( &(__int64 &)v, (__int64)a );
#elif defined(_M_IX86)
return (size_t)_InterlockedExchangeAdd( &(long &)v, (long)a );
#else
#error unsupported architecture
#endif
#elif defined(__GNUC__) || defined(__clang__)
return __sync_fetch_and_add( &v, a );
#else
#error unsupported architecture
#endif
}

inline
size_t compareExchange( size_t volatile &v, size_t c, size_t x )
{
#if defined(_MSC_VER)
#if defined(_M_X64)
return (size_t)_InterlockedCompareExchange64( &(__int64 &)v,
(__int64)x, (__int64)c );
#elif defined(_M_IX86)
return (size_t)_InterlockedCompareExchange( &(long &)v, (long)x,
(long)c );
#else
#error unsupported architecture
#endif
#elif defined(__GNUC__) || defined(__clang__)
return __sync_val_compare_and_swap( &v, c, x );
#else
#error unsupported architecture
#endif
}

inline
void rtmFetchAdd( size_t volatile &v, size_t a )
{
_xbegin();
++v;
_xend();
}

int main( int argc, char **argv )
{
if( argc < 2 )
return -1;
double nsPerClockCycle = 1.0 / (atof( argv[1] ) * 1.0e9);

auto thrXadd = []( uint8_t volatile &run, size_t adds, size_t
volatile &atm, atomic<size_t> &misses )
{
while( !run );
for( size_t i = adds; i; --i )
fetchAdd( atm, 1 );
};
auto thrXchg = []( uint8_t volatile &run, size_t adds, size_t
volatile &atm, atomic<size_t> &misses )
{
while( !run );
size_t missed = 0;
for( size_t i = adds, cmp = atm; i; --i )
{
for( size_t res; ; )
if( (res = compareExchange( atm, cmp, cmp + 1 )) == cmp )
{
cmp = cmp + 1;
break;
}
else
cmp = res,
++missed;
}
misses.fetch_add( missed );
};
auto rtmAdd = []( uint8_t volatile &run, size_t adds, size_t
volatile &atm, atomic<size_t> &misses )
{
while( !run );
for( size_t i = adds; i; --i )
rtmFetchAdd( atm, 1 );
};
using threadfunc = void (*)( uint8_t volatile &, size_t, size_t
volatile &, atomic<size_t> & );
array<threadfunc, 3> atf;
array<char const *, 3> threadDescr;
size_t nTests;
size_t const ADDS = 10'000'000;
unsigned nProcessors = getNumberOfProcessors();

atf[0] = thrXadd;
atf[1] = thrXchg;
atf[2] = rtmAdd;
threadDescr[0] = "xadd-thread";
threadDescr[1] = "cmpxchge-thread";
threadDescr[2] = "rtm-thread";
nTests = hasTSX() ? atf.size() : atf.size() - 1;

for( size_t m = 0; m != atf.size(); ++m )
{
cout << threadDescr[m] << ":" << endl;
for( unsigned nThreads = 1; nThreads <= nProcessors; ++nThreads )
{
atomic<size_t> misses( 0 );
uint8_t run = false;
size_t atm;

vector<thread> threads;
for( unsigned i = 0; i != nThreads; ++i )
{
threads.emplace_back( atf[m], ref( run ), ADDS, ref(
atm ), ref( misses ) );
#if defined(_MSC_VER)
SetThreadAffinityMask( threads[i].native_handle(),
(DWORD_PTR)1 << i );
#elif defined(__unix__)
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
pthread_setaffinity_np( threads[i].native_handle(),
sizeof cpuset, &cpuset );
#endif
}
time_point<high_resolution_clock> start =
high_resolution_clock::now();
run = true;
for( unsigned i = 0; i != nThreads; ++i )
threads[i].join();
uint64_t ns = (uint64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();;

double nsPerAdd = (double)ns / nThreads / ADDS / 1.0e9;
cout << "threads: " << nThreads << " cycles: " << nsPerAdd
/ nsPerClockCycle << " misses-ratio: " << (int)(100.0 * (size_t)misses /
nThreads / ADDS) << "%" << endl;
}
cout << endl;
}
}

unsigned getNumberOfProcessors()
{
#if defined(_MSC_VER)
SYSTEM_INFO si;
GetSystemInfo( &si );
return (unsigned)si.dwNumberOfProcessors;
#elif defined(__unix__)
return (unsigned)get_nprocs();
#endif
}

bool hasTSX()
{
#if defined(_MSC_VER)
int regs[4];
__cpuidex( regs, 7, 0 );
return regs[1] & (1 << 11);
#else
return true;
#endif
}

The code has to be compiled with -mrtm with the gcc. I was to lazy to
code a correct RTM-detection for Linux so on Linux-machines the code
would crash without RTM-support. But RTM is that what I'm about here;
so could someone with a TSX-/RTM-enabled CPU please compile that on
his machine and paste the output here? The issue I'm interested in
is if rtmFetchAdd() is faster than compareExchange() or fetchAdd().

Bonita Montero

unread,

Oct 8, 2019, 7:34:01 AM10/8/19

to

> I've found a solution that is mega-cool. I'm not gonna reveal it
> because I'm thinking about to get an US-patent on it, but it is a
> exotic combination of locked and lock-free programming. It scales
> almost linear with the number of threads.

Shit, I didn't consider cacheline ping-pong. Here's the scaling with
the number of cores on my Ryzen 7 1800X: https://bit.ly/31YgDAV
The blue graph is the scaling of my algorithm and the red algorithm
is the scaling of my algorithm when I set a certain parameter to zero
so that my algorithm very closely behaves like like updating a LRU
-list with a single mutex.
This is the relative scaling of my algorithm over the scaling of the
standard algorithm when repeatedly fetching and pinning an LRU-entry,
thereby pushing it to the head of the LRU-list, and then unpinning it:

1: 101 %
2: 1809 %
3: 1898 %
4: 1736 %
5: 1353 %
6: 1253 %
7: 1259 %
8: 1256 %
9: 1241 %
10: 1204 %
11: 1182 %
12: 1157 %
13: 1155 %
14: 1137 %
15: 1141 %
16: 1153 %

Öö Tiib

unread,

Oct 8, 2019, 11:31:25 AM10/8/19

to

On Tuesday, 8 October 2019 14:34:01 UTC+3, Bonita Montero wrote:

> Shit, I didn't consider cacheline ping-pong. Here's the scaling with
> the number of cores on my Ryzen 7 1800X: https://bit.ly/31YgDAV

403

Bonita Montero

unread,

Oct 8, 2019, 11:45:21 AM10/8/19

to

>> Shit, I didn't consider cacheline ping-pong. Here's the scaling with
>> the number of cores on my Ryzen 7 1800X: https://bit.ly/31YgDAV

> 403

Try this: https://pasteboard.co/IB2vueb.png

Bonita Montero

unread,

Oct 16, 2019, 9:41:09 AM10/16/19

to

> https://groups.google.com/d/msg/comp.arch/8Y0C8zGjtqI/bwg-hBLRAQAJ

A lock-free stack is the simplest kind of lock-free algorithm.

My LRU-algorithm has the following properties: it has a hashtable as
well as a LRU-list. The hashtable points to the nodes in the LRU-list.
For a cache-hit in the hashtable, pushing the LRU-entry to the head of
the LRU-list is kernel-contention-free in almost any case (how often
is configurable). As cache-hits have a high freqeuency parallel updates
are very essential. Inserts into the hashtable and the LRU-list as well
as flushes from the LRU-list occur conventionally locked. But this does
not really hurt because it happens by far not that often but only at
times when blocks are fetched from disk / ssd; and this has a lower
frequency by nature.
Here's a graph of the performance of my algorithm:
https://abload.de/image.php?img=performancekkklx.png
The groups arre the graphs according to the number of threads running.
On the vertical axis you can see the number of updates which occured
per second. Within the groups the bars have a growing parameter which
controls how often kernel-contention is necessary. For the rightmost
groups this hasn't pushed to a reasonable maximum as the performance
hasn't logarithmically hit the top trougput. With this benchmark I'm
randomly give inserts into the hashtable / LRU-list at an average of
150 iterations to have a realistic disk-slowdown.

Bonita Montero

unread,

Oct 16, 2019, 10:25:21 AM10/16/19

to

BTW: The drop from four to fife threads is because of the stupid CCX
-concept of my old Ryzen 7 1800X. On this CPU the cores are grouped in
clusters of four cores. Within such a cluster communications is faster
than between the groups. And the rest of the slowdown wih increasing
number of threads simply comes from the cache-traffic.
The leftmost bar in each group is the bar with parameter zero, where
my algorithm has almost the same locking-behaviour like having a single
global lock for everything. As you can see, there's a significant drop
from group 1 / bar 1 to group 2 / bar 1. But the drop isn't as dramatic
as it might look because I simulate a typical behaviour of filesystems
and databases: multiple fetches from the hash-table / LRU-list occur
at once because of prefetching. So my class has a call for getting
some entries in a row where the locks have been taken only once.
Without prefetching you get this performance:
https://abload.de/image.php?img=performance2b2j3l.png
One interesting thing here is that the simple one-mutex-for-all-algo
has a significant advantage over with mine when run single-threaded.
When I raise the parameter which helps my algorithm to handle mt-work-
loads, the performance even drops single-threaded! The other bars in
this group are also single-threaded but have an increasing parameter
to handle multithreaded access. And now as we don't have any prefet-
ching here look at the first bars in each group; that are the bars
of the one-mutex-for-all-case. The drop is even higher than 1 : 50
through the synch-overhead.
For larger number of threads the role of the optimization-parameter
beomes increasingly important. The group shifts more and more to be
a leftmost part of a logarithmic curve.

Chris M. Thomasson

unread,

Oct 16, 2019, 5:17:10 PM10/16/19

to

On 10/16/2019 6:40 AM, Bonita Montero wrote:
>> https://groups.google.com/d/msg/comp.arch/8Y0C8zGjtqI/bwg-hBLRAQAJ
>
> A lock-free stack is the simplest kind of lock-free algorithm.

Humm... Actually mutating a counter with fetch-and-add is very simple.
Vs locking and unlocking a mutex to perform the same mutation. The lock
free LIFO is way more complex. Its not the simplest. Btw, do you take
membars into account? Updating the counter can be a single liner and use
relaxed memory order. Simple. Going for LIFO, well, you need to get the
membars right wrt push and pop. Unless you are using some really exotic
methods. Epoch periods and such.

Bonita Montero

unread,

Oct 21, 2019, 10:42:22 AM10/21/19

to

I just read about an option of MySQL / MariaDB to circumvent the
locking-issue of the LRU list: the simply hash the block-key and
adress a configurable number of LRU-list where each of those is
locked with a single mutex. This gives not really a clean evition
of the very oldest entry in the LRU-list, but it should definitely
scale better than a single lock. How good this scales depends on
relation the time spent while holding the mutexes to the time not
holding a mutex and the number of threads involved here. In theory
this could even scale slightly better than my solution, but you
may have to have a lot of LRU-lists for that.

Chris M. Thomasson

unread,

Oct 23, 2019, 10:42:24 PM10/23/19

to

On 9/26/2019 2:13 AM, Bonita Montero wrote:
>>>> can I sign an NDA?
>
>>> Why?
>
>> I am interested in your algorithm! Makes me think of times past.
>
> Something different: my algorithm is fast when fetching blocks from
> the lru-cache present in the cache, i.e. there can be an arbitrary
> number of threads doing that. But even this needs at least three
> CMPXCHGs on three 64 bit values (in a 64-bit-system, in a 32-bit
> system you would have two 64- and one 32-bit exchange) if there's
> no collision.

Are you using an embedded version count within the data the CAS's work
on? Like an ABA counter? If a CAS fails, how far back do you have to
restart, or unroll if you will?

Bonita Montero

unread,

Oct 24, 2019, 5:00:23 AM10/24/19

to

>> Something different: my algorithm is fast when fetching blocks from
>> the lru-cache present in the cache, i.e. there can be an arbitrary
>> number of threads doing that. But even this needs at least three
>> CMPXCHGs on three 64 bit values (in a 64-bit-system, in a 32-bit
>> system you would have two 64- and one 32-bit exchange) if there's
>> no collision.

> Are you using an embedded version count within the data the CAS's work
> on? Like an ABA counter? If a CAS fails, how far back do you have to
> restart, or unroll if you will?

No, I'm not using ABA-counters.

Chris M. Thomasson

unread,

Oct 24, 2019, 4:13:46 PM10/24/19

to

Are you embedding any state with a pointer? bit stealing?

Chris M. Thomasson

unread,

Oct 24, 2019, 4:14:26 PM10/24/19

to

It can be fun to align pointers on a large boundary, and use that extra
space for meta data.

Bonita Montero

unread,

Oct 25, 2019, 12:54:07 AM10/25/19

to

>> No, I'm not using ABA-counters.

> Are you embedding any state with a pointer? bit stealing?

No.

Chris M. Thomasson

unread,

Oct 25, 2019, 2:06:31 AM10/25/19

to

Interesting. Are you "pinning" with the CAS? or does that fall to
locking? Fwiw, I do like my hashed locks on pointer addresses. It can do
many things, meta data makes it powerful in certain scenarios, also meta
data in the pointer address itself can effect the hash function, fun
times... Also, there is no need to lock the root. I remember you saying
how my work would always have to lock the head when updating the double
linked list wrt grabbing a LRU cache item. It can use lock-free here.
Eluding the lock. Here is a pure lock-free version wrt a singly linked
list. I am going to add the ability to delete from this:

https://groups.google.com/d/topic/comp.lang.c++/7U_Zjb7qj98/discussion

This is dynamic in nature. However, a LRU is easier because once its
full, it does not need to grow. It just replaces the LRU.

Not sure if I have to double link it or not. It might be a fun challenge
to keep it a single linked list.

Not sure if I want to use hashed locking, or go for pure lock-free.

Bonita Montero

unread,

Oct 25, 2019, 2:20:18 AM10/25/19

to

> Interesting. Are you "pinning" with the CAS?

Yes, I pin entries in the LRU-list with a CAS so that they can't be
evicted. But that has nothing to do with the parallel updates of the
LRU-list.

So I describe it another time to give you a chance to guess how my
idea works: cache-hits cann occur paralell, i.e. the updating of the
links to push a LRU-entry to the first place can be done by an arbi-
trary numbr of threads. That's the most relevant case because cache
-hits have a high frequency. When new elements are inserted into the
LRU-list or flushed from it the LRU-list is exclusively locked. But
this doesn't hurt since I/O is usually slow in comparison to a cache
hit.

Chris M. Thomasson

unread,

Oct 26, 2019, 12:35:12 AM10/26/19

to

On 10/24/2019 11:20 PM, Bonita Montero wrote:
>> Interesting. Are you "pinning" with the CAS?
>
> Yes, I pin entries in the LRU-list with a CAS so that they can't be
> evicted. But that has nothing to do with the parallel updates of the
> LRU-list.

Okay.

>
> So I describe it another time to give you a chance to guess how my
> idea works:

;^)

cache-hits cann occur paralell, i.e. the updating of the
> links to push a LRU-entry to the first place can be done by an arbi-
> trary numbr of threads. That's the most relevant case because cache
> -hits have a high frequency. When new elements are inserted into the
> LRU-list or flushed from it the LRU-list is exclusively locked. But
> this doesn't hurt since I/O is usually slow in comparison to a cache
> hit.

Fast the path the hell out of it! Let me think here. Well, I wish I had
more time, however, this is fun indeed.

Chris M. Thomasson

unread,

Oct 26, 2019, 12:37:16 AM10/26/19

to

A speedy lock-free deque, double linked list. A nice hash. Can be done.
Fall back to locks, no problem indeed! This brings back many memories.
Thanks Bonita.

Bonita Montero

unread,

Oct 26, 2019, 2:39:41 AM10/26/19

to

And ...
... updates to the links are not always kernel-contention-free,
but mostly, in my case it's configurable how often the updates
will block.
Look at this graph: https://abload.de/image.php?img=perfq9kri.png
The bar-groups are the number of threads and within the group
you can see the increasing parameter which controls how often
kernel-contention happens from left to right. The leftmost blue
bar in each group shows a zero-parameter where my algorithm blocks
on every access and almost behaves like having a standard-mutex.
My idea could also be applied to a completely different eviction
-scheme like adaptive-replacement cache (https://bit.ly/2Pm2n1A).

Chris M. Thomasson

unread,

Oct 26, 2019, 3:56:10 AM10/26/19

to

On 8/10/2019 9:40 AM, Öö Tiib wrote:
> On Saturday, 10 August 2019 16:52:12 UTC+3, Bonita Montero wrote:
>>> But the threads could forward that task (of pushing the found
>>> element to top) to single thread. It is technically management
>>> of those prev/next pointers that the cache users don't care
>>> about anyway. ...
>>
>> That would be totally brain-damaged because locking the linked
>> list and pushing the list-entry to the top by the thread that
>> accesses a certain block and thereby attributing it at the most
>> recent block would be more efficient.
>
> A brain has to be rather calcified to receive total damage from
> simple idea how to get rid of constant race over top of list.
>

Indeed the head is lock-free!

Bonita Montero

unread,

Oct 26, 2019, 4:02:36 AM10/26/19

to

>> That would be totally brain-damaged because locking the linked
>> list and pushing the list-entry to the top by the thread that
>> accesses a certain block and thereby attributing it at the most
>> recent block would be more efficient.

> A brain has to be rather calcified to receive total damage from
> simple idea how to get rid of constant race over top of list.

Then tell me that simple idea no one published before.