Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale

126 views

Skip to first unread message

Wes Turner

unread,

Mar 10, 2020, 9:50:56 AM3/10/20

to pypa-dev

This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.

Working titles; seeking feedback:

- Guide for PyPI CI Service Providers
- Request from and advisory for CI Services and CI implementors
- PyPI cost solutions: CI, mirrors, containers, and caching to scale
- PyPI-dependent CI Service Provider and implementor Guide

See "Open Issues":

> - Does this need to be a PEP?
> - No: It's merely an informational advisory and a request
>    for consideration of sustainable resource utilization practices.
> - Yes: It might as well be maintained as the document to be
>    sent to CI services which are unnecessarily using significant
>    amounts of bandwidth.

PEP: 9999
Title: PyPI-dependent CI Service Provider and Implementor Guide
Author: Wes Turner
Sponsor: *[Full Name <email at example.com>]*
BDFL-Delegate:
Discussions-To: https://groups.google.com/forum/#!forum/pypa-dev
Status: Draft
Type: [Standards Track | Informational | Process]
Content-Type: text/x-rst
Requires: *[NNN]*
Created: 2020-03-07
Resolution:

Abstract
========

Continuous Integration automated build and testing services
can help reduce the costs of hosting PyPI by running local mirrors
and advising clients in regards to how to efficiently re-build
software hundreds or thousands of times a month
without re-downloading everything from PyPI every time.

This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.

Motivation
==========

- The costs of maintaining PyPI are increasing exponentially.
- CI builds impose significant load upon PyPI.
- Frequently re-downloading the exact same packages
is wasting PyPI and CI services' time, money, and bandwidth.
- Perhaps the primary issue is lack of awareness
of solutions for reducing resource requirements
and thereby costs for all involved.
- Many thousands of projects are overutilizing donated resources
when there is a more efficient way that CI services
can just centrally solve for.

Request from and advisory for CI Services and CI Implementors
==============================================================
Dear CI Service,

1. Please consider running local package mirrors and enabling use of local
   package mirrors by default for clients' CI builds.
2. Please advice clients regarding more efficient containerized
   software build and test strategies.

Running local package mirrors will save PyPI (the Python Package Index,
a service maintained by PyPA, a group within the non-profit Python
Software Foundation) generously donated resources.
(At present (March 2020), PyPI costs ~ $800,000 USD a month to operate; even with
generously donated resources).

If you would prefer to instead or also donate to PSF, [earmarked]
donations are very welcome and will be publicly acknowledged.

Data locality through caching is the solution
to efficient software distribution. There are a number of opportunities
to cache package downloads and thereby (1) reduce bandwidth
requirements, and (2) reduce build times:

- ~/.cache/pip -- This does not persist across hermetically isolated container invocations
- Network-local package repository mirror
- Container image

There are many package mirroring solutions for Python packages
and other packages and containers:

- A full mirror
- bandersnatch: https://pypi.org/project/bandersnatch/
- A partial mirror:
- pulp: https://pulpproject.org/
    - Pulp also handles RPM, Debian, Puppet, Docker, and OSTree
- A transparent proxy cache mirror
- devpi: https://pypi.org/project/devpi/
- Dumb HTTPS cache with maximum filesize:
    - squid?
- IPFS
- IPFS for software package repository mirroring is an active area of
    research.

Containers:

- OCI Container Registry
- Notary (TUF): https://github.com/theupdateframework/notary
- Amazon Elastic Container Registry: https://aws.amazon.com/ecr/
- Azure Container Registry: https://azure.microsoft.com/en-us/services/container-registry/
- Docker registry: https://docs.docker.com/registry/deploying/
- DockerHub: https://hub.docker.com/
- GitLab Container Registry:
    https://docs.gitlab.com/ce/user/packages/container_registry/
- Google Container Registry: https://gcr.io
- RedHat Quay Container Registry: https://quay.io
- Container Build Services
- Any CI Service can be used to build and upload a container

There are approaches to making individual (containerized) (Python)
software package builds more efficient:

A. Build a named container image containing the necessary dependencies,
   upload the container image to a container registry,
   reuse the container image for subsequent builds of your
   package(s)
B. Automate updates of pinned dependency versions using a
   free or paid service that regularly audits dependency specifications
   stored in source code repositories and sends pull requests
   to update the pinned versions.
C. Create a multi-stage Dockerfile that downloads all of the
   (version-pinned) dependencies
   in an initial stage and ``COPY`` those into a later stage which builds
   and tests the package under test

- [ ] TODO: what's the best way to do this?

D. Use a docker image as a cache

   - This requires ``DOCKER_BUILDKIT=1`` to be set
     so that ``# syntax=docker/dockerfile:experimental``
     and ``RUN --mount=type=cache,target=/root/.cache/pip`` work
   - [ ] TODO: what's the best way to do this?
   - "build time only -v option"
     https://github.com/moby/moby/issues/14080

E. Use a container build tool that supports mounting volumes at build
   time (podman, buildah,) and mount in the ~/.cache/pip directory
   for all builds so that your build doesn't need to re-download
   everything for PyPI on every CI build.

Security Implications
=====================

- Any external dependency is a security risk
- When software dependencies are not cached,
the devops workflow cannot run when the external dependency is
unavailable.
- TUF (The Update Framework) may help mitigate cache-poisoning risks.
PyPI and CNCF Notary implement cryptographic signatures with TUF:
The Update Framework.

How to Teach This
=================

- [ ] A more detailed guide detailing how to do multi-stage builds that
cache dependencies?
- [ ] Update packaging.python.org?
- [ ] Expand upon the instructions herein

Reference Implementation
========================

- [ ] Does anyone have examples of CI services that are doing this well
/ correctly? E.g. with proxy-caching on by default

Rejected Ideas
==============

[Why certain ideas that were brought while discussing this PEP were not ultimately pursued.]

Open Issues
===========

- Request for guidelines, tools, and best practices.
- Does this need to be a PEP?
- No: It's merely an informational advisory and a request
    for consideration of sustainable resource utilization practices.
- Yes: It might as well be maintained as the document to be
    sent to CI services which are unnecessarily using significant
    amounts of bandwidth.

References
==========

[A collection of URLs used as references through the PEP.]

Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

Jeremy Stanley

unread,

Mar 10, 2020, 2:49:42 PM3/10/20

to pypa...@googlegroups.com

[Apologies if you receive multiple copies of this, it seems that
Google Groups may silently discard posts with PGP+MIME signatures.]

On 2020-03-10 06:50:56 -0700 (-0700), Wes Turner wrote:
[...]

> Reference Implementation
> ========================
>
> - [ ] Does anyone have examples of CI services that are doing this well
> / correctly? E.g. with proxy-caching on by default

[...]

The CI system in OpenDev does this. In fact, we tried a number of
the aforementioned approaches over the years:

1. Building a limited mirror based on package dependencies; this was
inconvenient as projects needed to wait for the package to be pulled
into the mirror set before it could be used by jobs (because we also
wanted to prevent jobs from accidentally bypassing the mirror and
hitting PyPI directly), and curating the growing list of package
names/versions became cumbersome.

2. Maintaining a full mirror of PyPI via bandersnatch; we did this
for years, but it was unstable (especially early on, serial handling
in PyPI's API got better over time) so needed a fair amount of
attention, but the real reason we stopped was that some AI/ML
projects (I'm not pointing fingers but you know who you are) started
dumping giant nightly snapshots of their datasets into PyPI and we
didn't want to have to deal with multi-terabyte filesystem coherency
issues or month-long rebootstrapping periods; bandersnatch
eventually grew an option for filtering specific projects, but
required a full rebuild to filter out all the previously fetched
files, which we didn't want to deal with (and this would have become
an ongoing game of Whack-a-Mole with any new projects following
similar processes).

3. Using a caching proxy; this has turned out to be the
lowest-effort solution for us, occasional changes in pip and related
toolchain aside.

OpenDev's Zuul (project gating CI) service utilizes resources across
roughly a dozen different cloud providers, so we've found the best
way to reduce nondeterministic network failures is to cache as much
as possible locally within every provider/region. We configure Apache
on a persistent virtual machine in each of these via Ansible, and
this is what the relevant configuration currently looks like for
PyPI caching:

<URL: https://opendev.org/opendev/system-config/src/commit/b2b0cc1c834856afa5511ca9a489d0dfbc6ba948/playbooks/roles/mirror/templates/mirror.vhost.j2#L36-L88 >

Early in the setup phase before jobs might want to start pulling
anything from PyPI we install an /etc/pip.conf file onto the job
nodes from this template, with the local mirror hostname substituted
appropriately:

<URL: https://opendev.org/zuul/zuul-jobs/src/commit/de04f76d57ffd5737dea6c6eb3af4c26f2fe08a6/roles/configure-mirrors/templates/etc/pip.conf.j2 >

You'll notice that extra-index-url is set to a wheel_mirror URL,
that's a separate cache we build to accelerate jobs which rely on
packages that don't publish wheels to PyPI for the various platforms
we offer (a variety of Linux distributions). We collect common
Python package dependencies for projects running jobs, perform test
installations of them in a separate periodic job, check to see if
they or their transitive dependency set require building a wheel
from sdist rather than downloading a prebuilt one from PyPI, and
then add all of those to a central cache. We do this for each
available Python version across all the distros/releases for which
we maintain node images. The wheels are stored globally in AFS (the
infamous Andrew Filesystem) and then local OpenAFS caches are served
from Apache in every configured cloud provider (the configuration
for it appears immediately below the PyPI proxy cache in the vhost
template linked earlier).

Of course we don't just cache PyPI, we also mirror and/or cache
Linux distribution package repositories, Dockerhub/Quay, NPM
packages, Git repositories and whatever else is of interest to our
users. Every time a job has to touch the greater Internet to
retrieve build resources, that's one more opportunity for unexpected
failure and further waste of our generously donated build capacity,
so it's in our best interests and those of our users to implement
and take advantage of local caches anywhere we can safely do so
without undue compromise to the integrity of build results.
--
Jeremy Stanley

Wes Turner

unread,

Mar 10, 2020, 4:01:27 PM3/10/20

to Jeremy Stanley, pypa-dev

Thank you for your feedback and for sustainably caching build dependencies.

Presumably a caching proxy for all users of a CI service (with Apache, Squid, Nginx, ?) would need to have a current SSL cert, and would be mediating any requests to other servers.

Is it also possible to configure clients to use a caching proxy using just environment variables?

--
You received this message because you are subscribed to a topic in the Google Groups "pypa-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pypa-dev/Pdnoi8UeFZ8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pypa-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pypa-dev/20200310184938.a3mktqnp5db7jj3v%40yuggoth.org.

Jeremy Stanley

unread,

Mar 12, 2020, 2:34:49 PM3/12/20

to pypa-dev

[Sorry about the formatting, I've tried several times now to reply

by E-mail and Google keeps silently discarding my message once

accepted.]

On 2020-03-10 16:01:14 -0400 (-0400), Wes Turner wrote:
[...]

> Presumably a caching proxy for all users of a CI service (with
> Apache, Squid, Nginx, ?) would need to have a current SSL cert,
> and would be mediating any requests to other servers.

Yes, we have Let's Encrypt (automated through Ansible calling
acme.sh and injecting TXT RRs for DNS validation) issuing an X.509
cert for the mirror server in each environment. The source code for
all of that is available in the letsencrypt-.* roles at
https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles
currently.

> Is it also possible to configure clients to use a caching proxy
> using just environment variables?

[...]

Making sure environment variables get properly exported to child
processes and across process isolation boundaries is error-prone,
though we do provide a convenience script which can be sourced by or
invoked within shells on our job nodes if they need to know relevant
mirror information for more complex use cases:

<URL: https://opendev.org/opendev/base-jobs/src/commit/bbb1bc829351e94dde4e1f4aee13533181dc5ff3/roles/mirror-info/templates/mirror_info.sh.j2 >

If you're talking about a full proxy via $HTTP(S)_PROXY or the like,
that's just not safely achievable in our case due to lack of viable
access control (open proxies are a prime target for abuse by
hooligans). We operate all our infrastructure on global addresses
with job nodes provided out of diverse public cloud providers around
the World, so don't have dedicated address pools. The scale we
operate at (presently over 1K nodes at a time with an average
lifespan on the order of minutes) means the churn from trying to
automatically add and remove their individual addresses in packet
filtering rules would be incredibly problematic. We can't put
reusable credentials on the nodes or in job payloads which could be
exercised by untrusted proposed changes from random passers-by (our
focus is on hosting openly developed community collaborations and
testing proposed changes from anyone, anywhere). We really don't
want to have to orchestrate elaborate tunnels or allocate dedicated
private networks and funky routing between job nodes and mirrors for
a variety of reasons, not the least of which is that we're a
volunteer-run cooperative without a ton of people to throw at
complex solutions.

For a CI system which is operating in a restricted network where you
have control over which systems are able to reach the proxy or whose
jobs are allowed run, a full proxy might be an appropriate (and in
that case potentially simpler) choice, sure. Though even then there
are pitfalls; see the farther down in the Apache config from my
earlier reply where we jump through a variety of hoops to
successfully proxy-cache dockerhub. A simple proxy isn't really an
option for some services.