rez-env hanging when certain packages are appended to the request

136 views
Skip to first unread message

brent...@gmail.com

unread,
Jan 20, 2022, 2:30:28 PM1/20/22
to rez-config
Has anyone else encountered this?  I have a fairly complex package request list to my rez-env call.  It's 10 packages that expand to 226 in the actual rez environment.  However, when I add something fairly trivial to the request list, rez-env hangs.  Or at least it doesn't resolve in under 20 minutes before I kill it.  In this case, I'm adding a custom built pyside2 package to the request.

What's odd is that if I pair my pyside2 request with an additional "qt-5" request, the resolve works although qt-5 is already in the requires list of pyside2.  Or, if I move the pyside2 request earlier in the list, the resolve also works.

In another test, instead of requesting "pyside2", I requested the exact version that I get in my resolved environment using one of the two methods described above.  In this case, I requested "pyside2-5.12.6.x.2.0.0.0".  The resolve also hangs.

We've encountered a lot more of these rez-env hangs caused by adding specific packages to the request.  I know that how rez finds the intersection of compatible packages is complicated so I don't think there is a straightforward answer.  My two questions are:
  1. Anyone else encounter this behavior?
  2. What's the best way to troubleshoot rez-env hangs?

allan.johns

unread,
Feb 7, 2022, 6:05:04 PM2/7/22
to rez-config
What you're seeing is a very long solve as opposed to a hang, of course that's just a technical distinction to the end-user!

This can happen when you have a package coming in on the end of a resolve which introduces a conflicting requirement that came in nearer the beginning. This effectively "unrolls" the solver to its earlier state. The cumulative effect on solve time is exponential and is affected by the depth at which this occurs, and the number of packages involved. It's a bit more complicated than that but that's the crux of it.

You'll generally be able to avoid this by being more explicit with versioning where possible, and by moving the request to the start of the list rather than the end.

If you want to troubleshoot further, all the info is there (use `rez-env -vvvv` to get all relevant output), but making sense of this involves understanding of the solver, and that requires coming to terms with https://github.com/nerdvegas/rez/blob/master/src/rez/SOLVER.md

Hth
A

ps - I _think_ there is opportunity to improve this in the solver, but doing anything in there is complicated and I've not been able to dedicate the necessary time.

Fede Naum

unread,
Feb 16, 2022, 5:24:31 AM2/16/22
to rez-config
Hi Brent,

to answer 1. ) yes we do experience that behaviour every now and then, and we have one like that this week.
For context, our environments also involve ~300 packages and we have more than ~75.500 versions created for the  2.200 individual rez packages we have. 
So as you guessed and Allan confirmed, the work the solver is doing is not trivial, and the addition of just one package or a different version of that package (that has different dependencies or a different version number) can trow the solver into that seemingly eternal resolve path
In our experience, the reasons most of the time are:
  • packages with really wide ranges
  • non mutually exclusive variants (ie. Qt/MayaQT/Qt_vfx, or OpenImageIO with our namespace ALOpeImageIO)
  • packages with asymmetric variants
  • packages with some conditional requirements.
2) The troubleshooting approaches Allan mentioned are the ones that we use the most, another technique we use is to set REZ_MAX_FAILS to a relatively small value i.e 20,  so the solver will at least stop and show you the failure of one of the paths and that can give you a clue of where to start looking for, but also can mislead you a bit.

But as Allan mentioned, the first thing we do is to turn the verbosity. 

If that does not help, try to narrow down which is the minimum set of packages that cause rez to go to this sort of never resolving state.

Unfortunately, this is a trial and error exercise, but if you know what has changed recently try that first, or if you know the dependencies of the packages, try and play with the packages with non-fully mutually exclusive variants or with asymmetric variants.

Either start removing some of the requested packages until you find which is the culprit, or do the inverse process, start growing the number of packages 

Once you have found the culprit, start by playing with the version ranges to see if you can narrow them further.

Many times the problem will be solved by:

  • restricting the version range of a package, usually, the lower bound.
  • adding an anti-package for one of the variants  !package
  • removing unnecessary dependencies.

Hope this helps
Fede

PS:  Since this troubleshooting is so time-consuming, lately, I have been thinking about writing an external tool, that takes a rez env line that is not resolving (our default max_fails = 1000, but I guess we will put something more like 100) and then does a kind of a binary search to find you which packages are the one preventing the full request to be resolved...
Ie if we have 100 packages in the request, and we find that with the first 50 it resolves, then we try with 75 and so on, let say when adding 51 it causes not to resolve, then we can remove that package from the list, and try again with 75 but excluding that package, and if it resolves then go to 99..... I guess it can end up with a list of packages that cause not to resolve.
Honestly, I think in some cases, it might mislead you, but I guess the goal is not to find the exact root cause but give you an idea of where to look for without doing the process manually process. What do you think?

brent...@gmail.com

unread,
Feb 16, 2022, 12:26:58 PM2/16/22
to rez-config
Thanks all.  I will certainly use the verbose solver in the future.  That is one thing I miss from the rez1 days -- seeing all those tries that resulted in conflicts and I could just copy/paste them to get more information.  But I understand that the rez2 solving algorithm is more complex than that and that interpreting the verbose output is more of an art.

I do something similar to that binary search tool you're thinking of writing.  Mine is more brute force -- it takes the requested packages of the failed resolve and creates multiple rez-env commands, each time adding an additional package to the request.  That helps me find that tipping point.  What I then do is inspect that problematic package and start adding its requirements to the rez-env call one by one.  I keep doing that until I find the package that when added or removed toggles the hanging rez-env.  After that, what to do about it is again more of an art.  Usually, it's either move it in the resolve or tighten its version specification.  Given the interconnectedness of the packages, there is usually more than one way to fix it.

I get some pushback from teams too.  If I tell them that they need to add "openexr-2.5" to their package.py so that some workflow that uses their package won't hang, I'm told, "But our package doesn't use openexr!" Or, "Our package can use any version of openexr-2!"  I think this is where "workflow" packages might be useful (used to be called bundles in rez1).  It's package.py that has all the requested packages for a given environment.  The user only has to specify that workflow package.py, not construct a 20-package rez-env line, and no one cares if we need to move the package order around as long as they get the right environment.

Fede Naum

unread,
Feb 17, 2022, 7:18:07 AM2/17/22
to rez-config
.  After that, what to do about it is again more of an art.  
 
LoL! yes, it is an art ! , we are in the same boat :)

I get some pushback from teams too.  If I tell them that they need to add "openexr-2.5" to their package.py so that some workflow that uses their package won't hang, I'm told, "But our package doesn't use openexr!" Or, "Our package can use any version of openexr-2!" 
 
yes usually, the easiest workaround is to add that extra dependency in the package to push the solver that way,  but their push back is more than reasonable.. anyway, if you keep looking, in the majority of our cases those issues can be solved but increasing the lower bound of one of the requirements of the given package.
Because of the huge matrix of combinations that rez has to choose from it can go to this never-ending resolve.... so the least amount of versions and path the solver has to explore the fastest it will find a solve or a conflict.
With that lesson learned, we have advised developers that when they release a new version of a tool, to try to shrink the version range of their dependencies (even when the tool might work with an older version of some other package)

Also in  those cases what needs to be more specific in the top rez env request, if in your case, you add `openexr-2.5 ` at the beginning of the rez env line it will at least try that solver path first, and solve/conflict faster.

 
I think this is where "workflow" packages might be useful (used to be called bundles in rez1).  It's package.py that has all the requested packages for a given environment.  The user only has to specify that workflow package.py, not construct a 20-package rez-env line, and no one cares if we need to move the package order around as long as they get the right environment.

we have not used bundles here, but we have a tool on top of rez to manage that, and we called the collection of rez packages + settings a  "preset". 
Since it has a DB underneath, after changes that end up with some hanging situation,  we can roll back to the previous working version, or compare it to the previous working version, and then it is easy (easier) to spot the potential culprits of the hang.

Reply all
Reply to author
Forward
0 new messages