> On Nov 19, 2019, at 04:44, Éloi Rivard <
azm...@gmail.com> wrote:
>
> Hello, There are a few things I wonder about ZODB and concurrency, and I am not sure of the answers the documentation gives.
> The documentation states tha database connections, transactions managers and transactions are not thread safe. It also states that we cannot share databases between processes. But can we share databases an storages between threads?
A Connection instance, and all the persistent objects accessed through it, may be used from exactly one thread (or greenlet!) at a time; the same goes for the transaction/manager associated with the connection.
A DB instance is there to vend Connection instances, and it can do so from any number of threads concurrently. How the DB, Connection, and storage cooperate to make this work are implementation details.
So this example should be fine:
> import ZODB, threading
> db = ZODB.DB(None)
> def work():
> with db.transaction() as conn:
> do_stuff()
>
> threading.Thread(target=work).start()
> threading.Thread(target=work).start()
But this would be broken:
def process_chunk(container, chunk_number):
chunk = container[chunk_number]
# Do stuff with chunk
db = get_the_db()
conn = db.open()
root = conn.root()
threading.Thread(target=process_chunk, args=(1,)).start()
threading.Thread(target=process_chunk, args=(2,)).start()
That's broken because it's sharing the same persistent object (the root) between threads without any sort of locking to ensure that the underlying Connection (hidden in _p_jar), is only used from one thread at a time.
The cross-thread sharing is pretty blatant here. A more subtle trap is shared caches of persistent objects.
cache = {}
db = get_the_db()
def process_chunk(chunk_number):
result = []
conn = db.open()
root = conn.root()
chunk = root[chunk_number]
for part in chunk:
answer = cache.get(part)
if answer is None
answer = part.find_or_compute_answer()
if answer._p_jar is None:
conn.add(answer) # could be new, could be previously saved
cache[part] = answer
result.append(answer)
return result
threading.Thread(target=process_chunk, args=(1,)).start()
threading.Thread(target=process_chunk, args=(2,)).start()
If any of the chunks use the same part, we could be accessing objects from different connections in different threads simultaneously.
Sometimes a shared cache like that can be found at a module level. That's destined to eventually result in a ConnectionStateError, but because of ZODB's caching, it may not be until the application is under significant load in a production setting. (Don't do that.)
Speaking of ZODB's caching, very often many of the cross-threading issues can be hidden by it: if an object is already in memory, there will be no need to use the _p_jar at all and ZODB is taken out of the equation entirely. Then you're just left with the normal concerns about using objects from multiple threads. (All of the examples above are probably that way.) That's all fine and good until an object gets ghosted (usually at the most inopportune time) or your access pattern changes just enough that an object that was usually in memory before, or was never accessed at all before, suddenly has to be loaded. I suggest having your tests frequently make use of DB.cacheMinimize() (e.g., between requests/transactions) to help smoke out issues like that; you can even make adjustments to the DB's connection pool so that sequential requests don't get the same Connection object.
Is all of that to say there's no way to parallelize work using a single Connection and transaction? Not at all. It can be done, but (like anything involving concurrency) it takes careful design. It's certainly easier to stick to the one-thread-per-connection/one-connection-per-thread/nothing-shared-except-the-DB model that's encouraged by default.
HTH,
Jason