New AutoPkg dev release: 2.3.2 (will be 2.4 soon)

54 views
Skip to first unread message

Nick McSpadden

unread,
Dec 20, 2021, 6:03:49 PM12/20/21
to autopkg...@googlegroups.com
https://github.com/autopkg/autopkg/releases/tag/v2.3.2-b

I've created a new pre-release version of AutoPkg based on the dev_recipe_map branch. This branch has a radical redesign of the recipe loading logic that dramatically speeds up recipe locating.

Instead of searching for recipes or processors by traversing the file system every time we need to go find something, we instead generate a static map file of all repos and recipes on-disk. This static map (cleverly titled "recipe_map.json") searches the file system during certain operations (adding/removing repos, or adding new recipes, or adding overrides), and stores a list of all recipes in two ways:

1. A list of all identifiers, with the absolute paths to the recipe as the values;
2. A list of all recipe shortnames ("GoogleChrome.download") with the absolute paths to the recipes as values

Whenever. you do any operation in AutoPkg, it consults this recipe map only and knows where to go next. This has reduced load times for running a recipe from multiple seconds (which scales higher based on how many recipes you have on-disk) to fractions of a second.

Graham Pugh provided this sample output to show the very stark difference with this branch:
Master:
 ```
% autopkg run -v GoogleChrome.jamf
**load_recipe time: 22.676719694
**verify_parent_trust time: 114.754227371
**process_cli_overrides time: 114.754611623
**verify time: 172.079330759
**process time: 216.01272573499998
```

With the new dev version:
```
./autopkg run -v GoogleChrome.jamf
**load_recipe time: 0.03708735200000002
**verify_parent_trust time: 0.696461484
**process_cli_overrides time: 0.696903272
**verify time: 0.711556611
**process time: 44.696209888
```

22 seconds -> 0.03 seconds to load recipes.

FUTURE DIRECTION
So, what's next for this?

I want to replace just about all Github Search API calls with static maps instead. Rather than relying on the API to give us information sometimes, what if we took this static map idea a step further? The AutoPkg Github Repo itself could store a static mapping of all recipes and all repos across the org, and clients would simply fetch that static map when trying to search for recipes.

To enhance and combine this functionality with the recipe map, we'd also need to change the override to contain a bit more info. Right now, the override just contains the chain of parents and their hashes as was generated _at the time_, but what if it also referenced the recipe map to also store the list of all things the recipe would ever need to execute successfully? If we had a full repo map as well as a local recipe map, we could easily triangulate exactly where all the resources required to run a recipe are located, and then fetch them if we don't have them. If we store that information in the override itself, then CI environments that are ephemeral would have all the info they need to run any recipe contained within the override itself, rather than having to make a lot of guesses or assumptions.

CONTEXT
Right now, AutoPkg interacts with GitHub in a few ways:
1. You use autopkg search, which generally does what it says;
2. You use `autopkg info -p`, which tries to search for the parent repos of a recipe and fetch all of them;

The problem is that the Github search API occasionally just.. doesn't. This API is rate limited really heavily, and is especially harsh for large organizations that have lots of outbound traffic from one set of IPs. This means that if a large organization is talking to GitHub API often, you could be rate limited just by sheer volume of traffic. When the API rate limits you, it doesn't return useful results to AutoPkg.

In a CI environment, if you rely on `autopkg info -p` to automatically pull your repos, this means that occasionally Github just doesn't give you anything. So AutoPkg will fail to pull its parent repos for recipe chains properly, and that means that recipes just occasionally fail for no reason. Trying it again usually just works, without making any changes. At Facebook/Meta, where I use this feature heavily, I see this very frequently.

So frequently, in fact, that it actually reduces the overall reliability of the automatic parent fetching feature. 

I still believe that its intended goal is a good one: to avoid having to maintain a hardcoded list of repos to check out in a CI environment (or any environment). I think it makes sense for Autopkg to be able to dynamically figure out what repos you need, and then go get them. The evidence simply shows that we just can't rely on Github's API search for that.


I'd love to hear people's thoughts and see people's test results.

--
--
Nick McSpadden
nmcsp...@gmail.com

Erik Gomez

unread,
Dec 20, 2021, 9:15:16 PM12/20/21
to autopkg...@googlegroups.com
Speed improvements are awesome but I'm curious about your CI issue. How often does autopkg run in your environment?

We have an idempotent GitHub initialization script that sets and pulls all of our repos and I can't think of a single time when that particular part has failed. 

Thanks,
Erik Gomez

On Dec 20, 2021, at 5:03 PM, Nick McSpadden <nmcsp...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "autopkg-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to autopkg-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/autopkg-discuss/CAJySx-mS4BcNWCG2mU1GM5ChPYqq32OcQU3Joa0YG2g9XmjrsQ%40mail.gmail.com.

Allister Banks

unread,
Dec 20, 2021, 9:17:17 PM12/20/21
to autopkg-discuss
I'm all for improved experiences and centralized optimizations. We don't have so many repo's or ephemeral CI/CD nodes so we wouldn't directly benefit besides from the experiential things going on while iterating locally.
I'm sure I didn't follow the full history as this evolved in Slack, but as I'm assuming the next iteration involving a .json map of all repo's under the entire organization gets generated centrally by CI, there's also the "brute force" idea of just... generating a tarball of all recipe repo's as well. You're proposing hosting a dynamically/continuously updated artifact after 'traversing'/indexing all repos, 'in for a penny, in for a pound' I'd say. It can 'only' be a couple dozens of updates on the average day, and for probably barely a MB you could pre-resolve (or pre-cache a baseline to incrementally update against) everything the org knows about.
We mirror other repo's for config mgmt code to be able to know as things change upstream, but we don't do that for anything autopkg dependency-wise because trust recipes give us enough confidence that a run will halt on a hash mismatch. Having one-stop shopping to ~every possible recipe we'd interact with could also help from a forensic perspective, I'd think, and since paying a one-time network penalty could be faster still than code's resolution process, and it should definitely help in an ephemeral build node situation.
Pardon if this is just gobbledegook distraction, but there may be a 'there' there. Allister

Nick McSpadden

unread,
Dec 21, 2021, 12:43:31 AM12/21/21
to autopkg...@googlegroups.com
Responding to this point from Erik specifically:

> How often does autopkg run in your environment?

About 30 recipes nightly (~1 AM PST).

> We have an idempotent GitHub initialization script that sets and pulls all of our repos and I can't think of a single time when that particular part has failed. 

That's the thing - git pulling and general git operations from GitHub are not rate limited (or at least, the rate compared to the API usage is higher enough that I've never heard of anyone encountering it in the normal course of git operation). So if you know what repos you need, pulling repos via `autopkg repo-add` is a generally reliable operation - it just works.

`autopkg search` is one single API call, followed by a git pull if you choose to add a repo. The problem really lies with `autopkg info -p` because it has to make a bunch of API searches. It looks for the parent of each recipe in the chain, and if the recipe isn't already present in search, it does an API search for that identifier and looks for a matching recipe in a matching repo, then does a git pull. Then it evaluates the next one in the chain.

The problem is that if Github's API gives you a bad response, AutoPkg generally treats it like it got nothing back and will fail because it can't find a recipe that's in the recipe chain it needs. 

This idea works perfectly if the Github API never fails, but unfortunately it's a bit expensive in terms of API hits. When I first built it, I didn't expect to encounter this much error (mostly because at the scale of testing individually, I never hit it). But recently, Github search API has actually changed quite a bit, and it works differently than it did before, and also doesn't return precise results. AutoPkg has to make a bunch of guesses, and the logic I have in there is kinda... iffy at best. It makes so many assumptions that generally work because AutoPkg doesn't have a lot of variance in its construction of recipes, names, identifiers, and repos.

The next problem here is that AutoPkg has to make the same searches all over again when it actually runs the recipe, because it uses the same logic (`locate_recipe()` code) to look for custom Processors. This is a lot of what adds the time, but it also makes it way easier to use shared processors because it inherently finds them at runtime.

Actually, this change in this branch might already help this issue out because it makes less API calls total while parsing the recipe. But one way or the other, I have definitive evidence that the Github API just returns fewer results less often than it did in the past, without making any other changes. I'm curious if others using just-in-time fetching have similar problems, and a few people in #autopkg Slack have corroborated anecdotal evidence.

Nick McSpadden

unread,
Dec 21, 2021, 12:48:24 AM12/21/21
to autopkg...@googlegroups.com
To Allister's point:

> It can 'only' be a couple dozens of updates on the average day, and for probably barely a MB you could pre-resolve (or pre-cache a baseline to incrementally update against) everything the org knows about.

This is pretty much the goal. I was investigating Github Actions that run once or twice a day (the org doesn't experience very many actual recipe adds/deletions) and generate new static maps for the entire ecosystem. Thus, every CI and/or every org just downloads the list of everything we know about the network - sort of equivalent of caching the Munki catalog, or a yum metadata.xml file. If we treat AutoPkg Github more like a package manager for which the recipes themselves are the packages, this concept makes thematic sense. Having metadata about all of the contents of the ecosystem makes it super cheap to know what's where.

The harder question will be how to handle non-org repos (like Facebook's, for example). My gut sense (just spitballing ideas, I have not tried any of this in code) suggests that each repo could host its own static map, and users could clone it and it would be munged into the general org-wide map during search/location resolution. So we could provide AutoPkg with the code to generate this repo map and any third party/public/private non-autopkg-org repo could leverage this functionality for a very minor cost.

--
You received this message because you are subscribed to the Google Groups "autopkg-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to autopkg-discu...@googlegroups.com.

Erik Gomez

unread,
Dec 21, 2021, 1:08:58 AM12/21/21
to autopkg...@googlegroups.com
Non orgs and internal recipes right?

Thanks,
Erik Gomez

On Dec 20, 2021, at 11:48 PM, Nick McSpadden <nmcsp...@gmail.com> wrote:



Graham Pugh

unread,
Dec 21, 2021, 4:26:26 AM12/21/21
to autopkg-discuss
Hi Nick,

This is great news. But... is the 2.3.2-b release the same as the one I was testing a few weeks ago? I am not seeing any speed improvements as I saw with that earlier "alpha" - speed seems the same as with 2.3.1. 

Where is 'recipe_map.json' written? I looked in ~/Library/AutoPkg, ~/Library/AutoPkg/RecipeRepos and ~/Library/AutoPkg/Cache but didn't find anything (or, is it deleted after a run?).

Cheers
Graham

Nick McSpadden

unread,
Dec 21, 2021, 10:57:04 AM12/21/21
to autopkg...@googlegroups.com

Graham, this is the same release as what you tested earlier - I've only made a few commits since then and they've mostly been minor details. You already tested the bulk of the logic.

> Where is 'recipe_map.json' written? I looked in ~/Library/AutoPkg, ~/Library/AutoPkg/RecipeRepos and ~/Library/AutoPkg/Cache but didn't find anything (or, is it deleted after a run?).

Sorry, I should've provided more instructions.

recipe_map.json is only created/updated when you add or remove a repo, or add a new override. The easiest way to create this is to just delete and re-add any existing repo (like 'recipes'). It's in ~/Library/AutoPkg/recipe_map.json.

Nick McSpadden

unread,
Dec 21, 2021, 1:31:22 PM12/21/21
to autopkg...@googlegroups.com
I've re-released this beta with a single fix as 2.4.1:
https://github.com/autopkg/autopkg/releases/tag/v2.4.1-b
Reply all
Reply to author
Forward
0 new messages