Proposals for implementing a scheduler in TSAN

121 views
Skip to first unread message

Доронин Олег

unread,
May 11, 2018, 12:40:58 AM5/11/18
to thread-s...@googlegroups.com, Evgeny Kalishenko, Максим Хижинский, vyu...@gmail.com
Motivation:
At the moment TSan is designed in such a way that if there are errors in the multithreaded code, they will be found only if the OS scheduler leads to the event of their occurrence. Such an upproach is quit good for lock-based algorithms because "lock" operations take significant part of the time. Concerning atomic operations and lock-free algorithms, potential contention has lower probability and the OS scheduler may not always reproduce the problems. We propose to implement scheduler in TSan that carry out manual modelling of different executions as it is done in RRD (https://github.com/dvyukov/relacy), and also simulate the order of variables visibility. Partial discussions can be viewed https://github.com/dvyukov/relacy/issues/7, https://github.com/dvyukov/relacy/pull/6, https://github.com/dorooleg/compiler-rt/issues/1

There are several approaches:

1. Write a scheduler based on fibers (Linux has makecontext, swapcontext). It is made in RRD.
Advantages:
1. It is easy to implement
2. An ability to make focus on specified parts of source code (If we have a server application, it is likely to have a "while (true)", hence, it is difficult to simulate different executions. Also, if we are interested in a certain area in the code, we may highlight it using annotations, and simulate different executions only between these annotations. Moreover, we can use fork mechanism to highlight code fragments because of existing only one real thread in the application, as a rule, before such fragments, and fork will create a copy of the entire application)
3. Easy application debug (Since the code is actually linear)
Disadvantages:
1. TLS (thread local storage) problem. (Because "swapcontext" does not consider TLS. One way to solve this problem is to hold manual TLS for each thread and copy it. However, it influences performance strongly. The other way is to substitute the TLS address for each thread, for example, on x86_64 via FS:0. By the way, it is not clear for me why the calculation of TLS in TSan goes through asm("mov %%fs:%c1,%0" : "=r"(descr_addr) : "i"(kThreadSelfOffset));
// Perhaps, is it right to calculate as described here https://www.uclibc.org/docs/tls.pdf ? Could anyone explain?
2. Low performance (This approach can work slowly. For example, if we have a 20-core processor and 20 threads are running in the application, using fiber simulteniously, the application will run 20 times slower. On the other hand, we can run 20 different applications or write a planning algorithm that will be able to simulate several paths in parallel)
3. Looping (Fiber can go into an endless loop, for instance, on spinlock or other reasons. To fix the hang problem, we can use a timer after which the fiber will be removed from the execution forcefully).
4. Partly natural thread environment (Unfortunately, fiber is not a natural environment for threads that can strongly affect the operation of the program).

2. We can write a scheduler based on POSIX threads. In order to manage threads, we will use the wait-notify mechanism based on spinlock and create dedicated thread that will manage others. Only one thread will work at a time. Probably, it will be desirable to run several threads simultaneously, but it is not clear yet how to do it with benefit.
Advantages:
1. This approach can work faster than with fiber since we have "hot threads" waiting on spinlock.
2. No need to monitor TLS (this is already done).
3. More natural environment in terms of OS multithreading
Disadvantages:
1. We need to create an additional thread to control the rest which affects performance.
2. It will not be possible to make fork as it is proposed with fiber because after the fork we will have only one thread
3. No understanding how to run multiple threads at the same time.
4. Need to monitor thread hangs.

The problem of global memory should be taken into account in both approaches: after the first iteration of the application global memory will be changed. In order to keep the initial state of the program we will use the fork mechanism.

Now, there are several questions to community:
1. Does it make sense to add to TSan? Can it be useful?
2. Which approach of planning is better to choose (or implement both of them)?
3. What pitfalls are not taken into account?
4. Are there any other options how to do it?
5. What can be done with TLS for fiber?
6. How to activate planning mechanism in TSan in a better way ("#include" with additional "DEFINE", immediately embed by default, planning types include through TSAN_OPTIONS)?

Dmitry Vyukov

unread,
May 11, 2018, 4:24:56 AM5/11/18
to thread-s...@googlegroups.com, Evgeny Kalishenko, Максим Хижинский
On Thu, May 10, 2018 at 10:49 PM, Доронин Олег <doro...@yandex.ru> wrote:
> Motivation:
> At the moment TSan is designed in such a way that if there are errors in the multithreaded code, they will be found only if the OS scheduler leads to the event of their occurrence. Such an upproach is quit good for lock-based algorithms because "lock" operations take significant part of the time. Concerning atomic operations and lock-free algorithms, potential contention has lower probability and the OS scheduler may not always reproduce the problems. We propose to implement scheduler in TSan that carry out manual modelling of different executions as it is done in RRD (https://github.com/dvyukov/relacy), and also simulate the order of variables visibility. Partial discussions can be viewed https://github.com/dvyukov/relacy/issues/7, https://github.com/dvyukov/relacy/pull/6, https://github.com/dorooleg/compiler-rt/issues/1

Hi Олег,

TSan must not lose it's ability to work with arbitrary programs
out-of-the box, it's one its main advantages. Fibers are quite
problematic in this context:


> There are several approaches:
>
> 1. Write a scheduler based on fibers (Linux has makecontext, swapcontext). It is made in RRD.
> Advantages:
> 1. It is easy to implement

This looks questionable to me. As compared to threads, fibers will
require more work to do manual scheduling, context switches, turning
threads into fibers, etc. So this looks like strictly more work as
compared to threads.


> 2. An ability to make focus on specified parts of source code (If we have a server application, it is likely to have a "while (true)", hence, it is difficult to simulate different executions. Also, if we are interested in a certain area in the code, we may highlight it using annotations, and simulate different executions only between these annotations.

This part is the same for threads, right? We can use annotations for
threads too.

> Moreover, we can use fork mechanism to highlight code fragments because of existing only one real thread in the application, as a rule, before such fragments, and fork will create a copy of the entire application)

This is different.


> 3. Easy application debug (Since the code is actually linear)
> Disadvantages:
> 1. TLS (thread local storage) problem. (Because "swapcontext" does not consider TLS. One way to solve this problem is to hold manual TLS for each thread and copy it. However, it influences performance strongly. The other way is to substitute the TLS address for each thread, for example, on x86_64 via FS:0. By the way, it is not clear for me why the calculation of TLS in TSan goes through asm("mov %%fs:%c1,%0" : "=r"(descr_addr) : "i"(kThreadSelfOffset));
> // Perhaps, is it right to calculate as described here https://www.uclibc.org/docs/tls.pdf ? Could anyone explain?

Note that besides TLS there are also signal masks, gettid and most
likely a very long tail of ways to observe difference between fibers
and threads.


> 2. Low performance (This approach can work slowly. For example, if we have a 20-core processor and 20 threads are running in the application, using fiber simulteniously, the application will run 20 times slower. On the other hand, we can run 20 different applications or write a planning algorithm that will be able to simulate several paths in parallel)
> 3. Looping (Fiber can go into an endless loop, for instance, on spinlock or other reasons. To fix the hang problem, we can use a timer after which the fiber will be removed from the execution forcefully).

This is a problem. There are sure waits that we will not intercept and
tsan must not hang in such cases. ote that asynchronous switch is very
problematic, I am not even sure how to do it. Relacy is all
synchronous, one the thread itself can decide to switch.


> 4. Partly natural thread environment (Unfortunately, fiber is not a natural environment for threads that can strongly affect the operation of the program).
>
> 2. We can write a scheduler based on POSIX threads. In order to manage threads, we will use the wait-notify mechanism based on spinlock and create dedicated thread that will manage others. Only one thread will work at a time. Probably, it will be desirable to run several threads simultaneously, but it is not clear yet how to do it with benefit.
> Advantages:
> 1. This approach can work faster than with fiber since we have "hot threads" waiting on spinlock.
> 2. No need to monitor TLS (this is already done).
> 3. More natural environment in terms of OS multithreading
> Disadvantages:
> 1. We need to create an additional thread to control the rest which affects performance.

Not necessary. I would even say adding an additional thread would be
wrong. pthread_mutex_unlock interceptor simply decided if it wants to
continue running the current thread, or switch to the newly unblocked
thread.

> 2. It will not be possible to make fork as it is proposed with fiber because after the fork we will have only one thread

True. But note that fork is problematic for multiple other reasons,
for example, if a program has created a socket, after a fork, both
processes will refer to the same socket and chaos will ensue.
Resources can't generally be forked.

> 3. No understanding how to run multiple threads at the same time.

There is some research on this:
https://dl.acm.org/citation.cfm?id=2254128
https://www.cs.rutgers.edu/~santosh.nagarakatte/papers/pldi12_needlepoint.pdf
and probably more.

> 4. Need to monitor thread hangs.

The same for fibers. But for threads it's much simpler. If we suspect
hangs, we can unblock other threads and let more threads run in
parallel.


> The problem of global memory should be taken into account in both approaches: after the first iteration of the application global memory will be changed. In order to keep the initial state of the program we will use the fork mechanism.
>
> Now, there are several questions to community:
> 1. Does it make sense to add to TSan? Can it be useful?
> 2. Which approach of planning is better to choose (or implement both of them)?
> 3. What pitfalls are not taken into account?
> 4. Are there any other options how to do it?
> 5. What can be done with TLS for fiber?
> 6. How to activate planning mechanism in TSan in a better way ("#include" with additional "DEFINE", immediately embed by default, planning types include through TSAN_OPTIONS)?


It would definitely be useful.

Since this is a large task, we need to figure out what would be an MVP
(minimal viable product). I think we need to start with adding
required annotations to synchronization primitives and thread control
functions (a-la what was done for TSan v1:
https://github.com/dvyukov/data-race-test-1/blob/master/earthquake/earthquake_wrap.h).
Then this already can be turned into a useful thing by just adding
random sleeps in the annotations (a-la
https://github.com/dvyukov/data-race-test-1/blob/master/earthquake/earthquake_core.c).
And this is commit-able by itself, so this is a good starting point.
After that I would add a systematic scheduling with threads as it's
simpler and will work with any programs. But note that some programs
will always use only the random mode, because they are too huge for
any systematic analysis.
After that I would look if we still want fibers for some cases or not.

Yes, the runtime behavior should be controlled with flags passed in
TSAN_OPTIONS.

FTR, some research in this area:

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-149.pdf
https://www.semanticscholar.org/paper/Delay-bounded-scheduling-Emmi-Qadeer/6e69ad3daf1d4dfe98b83a4ed448cfa0ff016102
https://research.cs.wisc.edu/wpis/papers/cav08.pdf
https://dl.acm.org/citation.cfm?id=1375625
https://link.springer.com/chapter/10.1007/978-3-642-00768-2_11
https://dl.acm.org/citation.cfm?id=1736040
https://dl.acm.org/citation.cfm?id=1453121
https://dl.acm.org/citation.cfm?id=1321679

Доронин Олег

unread,
May 18, 2018, 5:27:35 AM5/18/18
to thread-s...@googlegroups.com, Evgeny Kalishenko, Максим Хижинский, vyu...@gmail.com
> TSan must not lose it's ability to work with arbitrary programs
> out-of-the box, it's one its main advantages. Fibers are quite
> problematic in this context:

I agree that it's a lack of fibers, but at the same time they can be useful

> This looks questionable to me. As compared to threads, fibers will
> require more work to do manual scheduling, context switches, turning
> threads into fibers, etc. So this looks like strictly more work as
> compared to threads.

As practice shows, you're right.

> This part is the same for threads, right? We can use annotations for
> threads too.

At the moment I have no understanding how it can be used for real threads. Because we want to keep the global state of the program. I'll be thinking over it issue

> Note that besides TLS there are also signal masks, gettid and most
> likely a very long tail of ways to observe difference between fibers
> and threads.

When TLS is exchanged through %fs:0 I am having problems with the fact that TLS in TSAN is considered to be through %fs:16, and after replacing to %fs:0 it worked. There needs to be more research into what might be the case

> This is a problem. There are sure waits that we will not intercept and
> tsan must not hang in such cases. ote that asynchronous switch is very
> problematic, I am not even sure how to do it. Relacy is all
> synchronous, one the thread itself can decide to switch.

Thanks for the comment. It is really a problem. I'll think how to solve it. As an option, we can keep track of all writings and readings in and out of memory and synchronously switch threads. But it won't solve all the problems

> Not necessary. I would even say adding an additional thread would be
> wrong. pthread_mutex_unlock interceptor simply decided if it wants to
> continue running the current thread, or switch to the newly unblocked
> thread.

I agree

Thanks for the links to the articles. I'll take the time to study them.


Current progress

1. What is done at the moment:

There is already an implementation of a simple system for flow planning, which is based on the approach Relacy Race Detector https://github.com/llvm-mirror/compiler-rt/compare/master...dorooleg:schedulers
In the current version supported:
Fiber + tls copy [fiber_tls_copy]
Fiber + tls swap (writing in %fs:0 [fiber_tls_swap]
Pthread [pthread]
Different planning algorithms:
Random [random]
Random with different distributions [random_with_different_distributions]
Full path search [full_search]
Full path search with fixed window [fixed_window]
Full states search [all_states]

Example of use: export TSAN_OPTIONS="scheduler_platform=fiber_tls_swap scheduler_type=full_path"

But there are a number of defects:
- Problems with C code building. Need to get rid of STL
- The problem with hangs has not been solved
- There are places where you can speed up the algorithms
- Useful annotations not added yet
- Mutexes are not handled
- Nothing was implemented from the articles
- There may be other problems. We need to look more closely

2. Class diagram in an attachment

3. An example with which TSAN without scheduler is not detected data race in my working environment:
std::atomic_int d;
int a = 0;

int main()
{
std::thread t1([]() { ++d; ++a; ++d; });
std::thread t2([]() { ++d; ++a; ++d; });
t1.join();
t2.join();
std::cout << d << a << std::endl;
}

Interesting example, which is interesting for its possible results (the output of the program can be from 2 to 10). It is possible to simulate it with the help of full search of variants and random scheduler with different distributions

std::atomic_int value = 0;

int main()
{
std::thread t1([&]() {
for (int j = 0; j < 5; j++) {
auto r = value.load();
r++;
value.store(r);
}
});
std::thread t2([&]() {
for (int j = 0; j < 5; j++) {
auto r = value.load();
r++;
value.store(r);
}
});

t1.join();
t2.join();

std::cout << value.load() << std::endl;
}






diagram_classes.svg

Dmitry Vyukov

unread,
Jun 7, 2018, 8:56:11 AM6/7/18
to thread-sanitizer, Evgeny Kalishenko, Максим Хижинский, vyu...@gmail.com
On Fri, May 18, 2018 at 11:27 AM, Доронин Олег <doro...@yandex.ru> wrote:
>> TSan must not lose it's ability to work with arbitrary programs
>> out-of-the box, it's one its main advantages. Fibers are quite
>> problematic in this context:
>
> I agree that it's a lack of fibers, but at the same time they can be useful
>
>> This looks questionable to me. As compared to threads, fibers will
>> require more work to do manual scheduling, context switches, turning
>> threads into fibers, etc. So this looks like strictly more work as
>> compared to threads.
>
> As practice shows, you're right.
>
>> This part is the same for threads, right? We can use annotations for
>> threads too.
>
> At the moment I have no understanding how it can be used for real threads. Because we want to keep the global state of the program. I'll be thinking over it issue

Recovering from a vacation...

The same way as with fibers -- don't do any scheduling before we reach
the interesting region and after we leave it.
Yes, we can't use STL inside of tsan runtime. STL can be instrumented too.

> - The problem with hangs has not been solved
> - There are places where you can speed up the algorithms
> - Useful annotations not added yet
> - Mutexes are not handled
> - Nothing was implemented from the articles
> - There may be other problems. We need to look more closely

I would suggest to start small. Threads + random scheduler is enough
for starters and will allow us to shake out lots of common stuff.
These two examples are very nice!

Dmitry Vyukov

unread,
Jun 7, 2018, 8:57:45 AM6/7/18
to thread-sanitizer, Evgeny Kalishenko, Максим Хижинский
[drop invalid email address]
Reply all
Reply to author
Forward
0 new messages