Pip requirements and remote caching

674 views
Skip to first unread message

Matthieu Poncin

unread,
Jan 8, 2019, 7:13:44 AM1/8/19
to Bazel/Python Special Interest Group
Hi, I am trying to speedup our CI with remote caching, however I am experiencing issues with external pip dependencies invalidating the cache across build machines.
Some of the tests are being properly cached, but some specific pip dependencies will invalidate the cache on many of our tests.

We have multiple py_binary and py_test defining external dependencies using the pip_import rules as defined here:

However one difference with this doc, is that I have to use the clean expunge to test reproducibility between build machines:
bazel clean --expunge
This is because bazel clean will not clean the pip requirements downloaded from the WORKSPACE.

What I found is that many of our tests are being invalidated due to pip modules recompiling C libraries and the compilation output (.so) is being different after each recompilations.
Here is a small list of offenders from one of our tests:
  • PyYAML_3_13
  • SQLAlchemy_1_1_9
  • psycopg2_2_7_4
  • pycparser_2_13
So for example, it can be reproduced if you take the rules_python repo and add `PyYAML` as dependency in requirements and BUILD file of the examples/helloworld project. The test will not be remotely cached between "bazel clean --expunge" runs.
This is because pip recompiles _yaml.so and the library file is different after every recompilations.

Here is a bit easier example to understand the root cause:
> pip install PyYAML
> cat env3/lib/python3.6/site-packages/_yaml.cpython-36m-darwin.so | md5
3be29a8a2eccb29c9f18488ae6ae949f

> pip uninstall PyYAML
> pip install PyYAML
> cat env3/lib/python3.6/site-packages/_yaml.cpython-36m-darwin.so | md5
13f3334bcbfc2f90dc084e669c28d842
Installing PyYAML twice will produce a different output.


I am a bit clueless how to go around this issue. How would you handle such situation? Any advice would be welcome.

6f6...@gmail.com

unread,
Jan 8, 2019, 10:11:53 AM1/8/19
to Matthieu Poncin, Bazel/Python Special Interest Group
Hi Mathieu,

I’ve been debugging a lot of remote cache misses like these lately. Two things that have helped are diffing the Bazel execution logs (see https://docs.bazel.build/versions/master/remote-execution-caching-debug.html, at the end where it says “Comparing Execution Logs”) and diffoscope (https://diffoscope.org).

In your case, since you’ve already found a file that is different, I would suggest running diffoscope on both .so’s to see what is different. Typically it’s either system configuration issues (like dynamically linking to a different version of a library) or timestamps in the files themselves.

I hope this helps.

Cheers,

-Oscar
> **Confidentiality**
> The information contained in this e-mail is confidential, may be privileged and is intended solely for the use of the named addressee. Access to this e-mail by any other person is not authorised. If you are not the intended recipient, you should not disclose, copy, distribute, take any action or rely on it and you should please notify the sender by reply. Any opinions expressed are not necessarily those of the company.
>
> We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
> To post to this group, send email to bazel-si...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/d3875ae2-f748-4d02-ae35-1daed2d6a0f7%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Nicolas Lopez

unread,
Jan 8, 2019, 10:23:25 AM1/8/19
to 6f6...@gmail.com, ago...@google.com, Matthieu Poncin, Bazel/Python Special Interest Group

Matthieu Poncin

unread,
Jan 9, 2019, 8:34:58 AM1/9/19
to Bazel/Python Special Interest Group
Thank you Oscar for the help!
Diffoscope was indeed helpful. Interestingly, objdump and nm would not show any difference between the 2 files on libYAML, but I could find 2 things changing in the file from diffoscope:

Screen Shot 2019-01-09 at 13.45.22.png

The last 2 diff are apparently containing the path to the build folder, so I was able to get rid of this change by setting a fixed build folder on pip with:
pip install -b /tmp/build-dir PyYAML
Coincidentally someone opened an issue on the same day about the same thing here: https://github.com/bazelbuild/rules_python/issues/154

I however have no clue what the first diff might be.

In any case, I doubt that I will be able to fix all of our external dependencies this way.
I can think of another alternative, which is to try to prevent pip from building libraries and depending on system libraries.
It is at least possible with libyaml by adding the following option in the requirements.txt file like this:
PyYAML==3.13 --global-option="--without-libyaml"

If I use pip directly, it is then correctly installing PyYAML without libyaml, however sadly the rules_python don't seem to be taking this option into account from the requirements file, _yaml.so is still included in the bazel builds.
I wonder if anyone has handled this situation differently, maybe by manually building the pip modules and including them in the repo?
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-python+unsubscribe@googlegroups.com.
> To post to this group, send email to bazel-sig-python@googlegroups.com.

Austin Schuh

unread,
Jan 9, 2019, 1:09:55 PM1/9/19
to Matthieu Poncin, Bazel/Python Special Interest Group
I've had good luck with objdump -s, or -g, or -G to figure out what changed.  It tends to be in the debug symbols.

Austin

> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
> To post to this group, send email to bazel-si...@googlegroups.com.
**Confidentiality**
The information contained in this e-mail is confidential, may be privileged and is intended solely for the use of the named addressee. Access to this e-mail by any other person is not authorised. If you are not the intended recipient, you should not disclose, copy, distribute, take any action or rely on it and you should please notify the sender by reply. Any opinions expressed are not necessarily those of the company.

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/87276253-a6fe-4262-885f-d595c86d1689%40googlegroups.com.

Matthieu Poncin

unread,
Jan 21, 2019, 6:41:45 AM1/21/19
to Bazel/Python Special Interest Group
Thank you for the suggestions! After going through all of our requirements, I found 3 types of issues with pip requirements breaking the remote cache:
  1. The build directory changing for each builds and inducing changes in the compiled libraries. Would be fixed with this: https://github.com/bazelbuild/rules_python/issues/154
  2. The current pip version used by rules_python is pretty old and has undeterministic behavior with metadata files (https://bitbucket.org/pypa/wheel/pull-requests/47/make-metadata-generation-deterministic/diff). It would be partially fixed from this PR: https://github.com/bazelbuild/rules_python/pull/136 . It is however not enough as there is still a bug in wheel regarding metadata requirement extras and I submitted a quick fix here: https://github.com/pypa/wheel/pull/284
  3. Some pip modules are built with the "optimize" option. "suds-jurko" is one of them and setting the following flag to 0 will fix it: https://bitbucket.org/jurko/suds/src/94664ddd46a61d06862fa8fb6ba7b9e054214f57/setup.cfg?at=default&fileviewer=file-view-default#setup.cfg-31. I do not know if this is something that could be fixed directly from the rules_python or not, any pointers on that would be appreciated.

Hope this will help anyone else trying to remote cache their python rules.
Matthieu
Reply all
Reply to author
Forward
0 new messages