Shared library compiled each time on the clusters's nodes

0 views
Skip to first unread message

Francesco Lombardi

unread,
Sep 12, 2017, 5:19:04 PM9/12/17
to rootp...@googlegroups.com

Dear all,
The problem is that rootpy rename each time the shared library present in ~/.cache/rootpy/x86_64-.../dicts creating a huge amount of process that want to write and read on the disk (more than 400 in the queue) stucking the disk itself.

The process remain in this condition for hours:
`rootpy.stl  : INFO     generating dictionary for std::vector<ReconstructedPosition> ...
ROOT.TUnixSystem.ACLiC: INFO     creating shared library /home/userhome/.cache/rootpy/x86_64-60403/dicts/812d0818ef1d4c71.so
rootpy.stl  : INFO     generating dictionary for std::vector<Peak> ...
ROOT.TUnixSystem.ACLiC: INFO     creating shared library /home/userhome/.cache/rootpy/x86_64-60403/dicts/74edb360299dab24.so
rootpy.stl  : INFO     generating dictionary for std::vector<Pulse> ...
ROOT.TUnixSystem.ACLiC: INFO     creating shared library /home/userhome/.cache/rootpy/x86_64-60403/dicts/a70dfff531c29809.so
rootpy.stl  : INFO     generating dictionary for std::vector<TriggerSignal> ...
ROOT.TUnixSystem.ACLiC: INFO     creating shared library /home/userhome/.cache/rootpy/x86_64-60403/dicts/adff6edd077a1873.so `

The question is why doen't use the same library or where is the problem?
Is there a why to define only a defined path or set of files?

--
     .---.      
   | o_o |      
   |  ¦_/  |    
 / /        \ \  
( |         | )
 /'\_     _/'\  
 \__)=(__/

Noel Dawe

unread,
Sep 13, 2017, 1:08:09 AM9/13/17
to rootp...@googlegroups.com
Hi Francesco,

If the dictionaries don't already exist in ~/.cache/rootpy/[...] then it will attempt to build them. Before doing this a lock is acquired so no other parallel process will clobber the dictionary generation. Once the dictionary is generated the lock is released and any other processes asking if the specific dictionary exists will now see it does and will not attempt to generate it. This is all fine if you only have a few processes running, but as you scale this up you can hit problems like yours. You essentially have 399 processes all waiting to acquire a lock on the same directory. If there are only a few dictionaries to generate and they are quick then it should be fine and the job happening to acquire the lock first will generate them and the rest will use those.

If any jobs happen to wait longer than 60 seconds they will assume the lock is stale (locks can sometimes be left around if a previous process crashed hard) and they will just break and acquire it. Some of the 400 jobs could be hitting this limit and then end up clobbering each other.

Try setting ROOTPY_GRIDMODE=1 in your job environment (or DEBUG=1) and rootpy will save the userdata in /tmp/random_unique_tmpdir. Then each job will generate its own dictionaries. Plus if /tmp is locally mounted on each node in your batch system they won't all be writing to the same disk.

Best,
Noel


--
You received this message because you are subscribed to the Google Groups "rootpy dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootpy-dev+unsubscribe@googlegroups.com.
To post to this group, send email to rootp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootpy-dev/CAKZYvXXYur_Ot%3DSTLap5LNOJCPuL0cVoGjK2etOtoP3PeNpoqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages