[RFC PATCH] multi-pack-index: allow operating without pack files

11 views
Skip to first unread message

Johannes Berg

unread,
Aug 20, 2021, 3:56:04 PM8/20/21
to g...@vger.kernel.org, bup-...@googlegroups.com
Technically, multi-pack-index doesn't need pack files to exist,
but add_packed_git() today checks whether it exists or not.

In bup, a git pack format based backup tool, we'd really like
to take advantage of the multi-pack-index, since bup needs it
to save new objects to the repository efficiently (to check if
something already exists), and uses git to access the repo, so
the multi-pack-index can make more efficient.

Alternatively, bup has its own 'midx' format, of which multiple
can exist in a repository, predating the multi-pack-index.

All of this works well as long as the bup repository is just a
normal git repository. However, I've been adding encrypted and
encrypted remote repositories to bup, where the pack files are
not local, similar to promisor remotes, but not really done in
the same way.

In this case, the local storage is only the idx files, no pack
files (it's just a cache), and we access the pack files and
objects within in different ways. Unfortunately, in this case
we also cannot reuse bup's midx format very well: it only has
information on which objects exists, not where to find them,
and so reading from the repository requires reading all of the
idx files, something that git's multi-pack-index solves.

While we'll need to add read access to git's multi-pack-index
to bup, having a call to 'git multi-pack-index' write it would
be nice and save some duplication. However, in the case of the
remote/encrypted repositories, git currently cannot do that as
it requires the pack files to exist.

Add a command-line option to be able to not require pack files
to exist, to make that easier (rather than requiring writing
some dummy pack files, git even accepts empty files.)

Signed-off-by: Johannes Berg <joha...@sipsolutions.net>
---
Documentation/git-multi-pack-index.txt | 6 +++++-
builtin/multi-pack-index.c | 5 ++++-
midx.c | 9 ++++++---
midx.h | 1 +
packfile.c | 10 ++++++++--
packfile.h | 2 ++
6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index ffd601bc17b4..23db70fbebc2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -10,7 +10,7 @@ SYNOPSIS
--------
[verse]
'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress]
- [--preferred-pack=<pack>] <subcommand>
+ <subcommand> [<subcommand options>]

DESCRIPTION
-----------
@@ -40,6 +40,10 @@ write::
multiple packs contain the same object. If not given,
ties are broken in favor of the pack with the lowest
mtime.
+
+ --no-require-packs::
+ Don't require pack files to exist, useful only for
+ certain non-repository caches.
--

verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 8ff0dee2ecbb..2c9293b20c49 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -7,7 +7,7 @@
#include "object-store.h"

#define BUILTIN_MIDX_WRITE_USAGE \
- N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]")
+ N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>] [--no-require-packs]")

#define BUILTIN_MIDX_VERIFY_USAGE \
N_("git multi-pack-index [<options>] verify")
@@ -68,6 +68,9 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
OPT_STRING(0, "preferred-pack", &opts.preferred_pack,
N_("preferred-pack"),
N_("pack for reuse when computing a multi-pack bitmap")),
+ OPT_BIT(0, "no-require-packs", &opts.flags,
+ N_("don't require pack files to exist"),
+ MIDX_DONT_REQUIRE_PACKS),
OPT_END(),
};

diff --git a/midx.c b/midx.c
index 902e1a7a7d9d..98b3cb33201f 100644
--- a/midx.c
+++ b/midx.c
@@ -468,6 +468,7 @@ struct write_midx_context {
uint32_t num_large_offsets;

int preferred_pack_idx;
+ unsigned flags;
};

static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -482,9 +483,10 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,

ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);

- ctx->info[ctx->nr].p = add_packed_git(full_path,
- full_path_len,
- 0);
+ ctx->info[ctx->nr].p = _add_packed_git(full_path,
+ full_path_len,
+ 0,
+ !(ctx->flags & MIDX_DONT_REQUIRE_PACKS));

if (!ctx->info[ctx->nr].p) {
warning(_("failed to add packfile '%s'"),
@@ -924,6 +926,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
ctx.nr = 0;
ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
ctx.info = NULL;
+ ctx.flags = flags;
ALLOC_ARRAY(ctx.info, ctx.alloc);

if (ctx.m) {
diff --git a/midx.h b/midx.h
index 8684cf0fefe8..aa6382d99386 100644
--- a/midx.h
+++ b/midx.h
@@ -41,6 +41,7 @@ struct multi_pack_index {

#define MIDX_PROGRESS (1 << 0)
#define MIDX_WRITE_REV_INDEX (1 << 1)
+#define MIDX_DONT_REQUIRE_PACKS (1 << 2)

char *get_midx_rev_filename(struct multi_pack_index *m);

diff --git a/packfile.c b/packfile.c
index 9ef6d9829280..dfe994205914 100644
--- a/packfile.c
+++ b/packfile.c
@@ -687,7 +687,8 @@ void unuse_pack(struct pack_window **w_cursor)
}
}

-struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
+struct packed_git *_add_packed_git(const char *path, size_t path_len, int local,
+ int require_pack)
{
struct stat st;
size_t alloc;
@@ -717,7 +718,7 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
p->pack_promisor = 1;

xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
- if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
+ if (require_pack && (stat(p->pack_name, &st) || !S_ISREG(st.st_mode))) {
free(p);
return NULL;
}
@@ -734,6 +735,11 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
return p;
}

+struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
+{
+ return _add_packed_git(path, path_len, local, 1);
+}
+
void install_packed_git(struct repository *r, struct packed_git *pack)
{
if (pack->pack_fd != -1)
diff --git a/packfile.h b/packfile.h
index 3ae117a8aef0..a921077a05ef 100644
--- a/packfile.h
+++ b/packfile.h
@@ -96,6 +96,8 @@ void close_object_store(struct raw_object_store *o);
void unuse_pack(struct pack_window **);
void clear_delta_base_cache(void);
struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
+struct packed_git *_add_packed_git(const char *path, size_t path_len, int local,
+ int require_pack);

/*
* Unlink the .pack and associated extension files.
--
2.31.1

Derrick Stolee

unread,
Aug 22, 2021, 8:34:48 PM8/22/21
to Johannes Berg, g...@vger.kernel.org, bup-...@googlegroups.com
On 8/20/2021 3:55 PM, Johannes Berg wrote:
> Technically, multi-pack-index doesn't need pack files to exist,
> but add_packed_git() today checks whether it exists or not.

Having a multi-pack-index is supposed to indicate that we have
these objects in the objects/pack directory within the specified
pack-files.

I understand your goal to relax a condition of the multi-pack-index
file, but it's triggered by a flag during write and that choice
isn't persisted into the file. There is no way for a later Git
process to understand that the multi-pack-index doesn't actually
guarantee object existence.

And in a completely other side: one would think that including
a pack-file in the multi-pack-index would allow deleting the .idx
file, but there are a few reasons why we do not (including
interactions with third-party tools).

So, I'm not necessarily on board with this change unless
something is added to the multi-pack-index file (such as a new
version 2 and an optional chunk understood by that version) that
tells future Git processes that the .pack files might not exist.
I'm still not sure what Git should do about that other than stop
reading the multi-pack-index and ignore its contents.

> -struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
> +struct packed_git *_add_packed_git(const char *path, size_t path_len, int local,
> + int require_pack)

The only obvious thing that I noticed in the code is that we
typically use <function name>_1() as a way to create a static
version that is called by the global version.

> struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
> +struct packed_git *_add_packed_git(const char *path, size_t path_len, int local,
> + int require_pack);

...oh, but that's not what you're doing. What you could do
instead is convert the 'local' parameter into a 'flags' parameter
(I think we have started to prefer 'enum's recently) and create
MIDX_FLAG_LOCAL and MIDX_FLAG_PACKS_OPTIONAL flag values. That
avoids multiple methods and minimizes the change to existing
callers.

Thanks,
-Stolee

Martin Fick

unread,
Aug 22, 2021, 9:11:17 PM8/22/21
to Derrick Stolee, Johannes Berg, g...@vger.kernel.org, bup-...@googlegroups.com
On Sunday, August 22, 2021 8:34:43 PM MDT Derrick Stolee wrote:
> On 8/20/2021 3:55 PM, Johannes Berg wrote:
> > Technically, multi-pack-index doesn't need pack files to exist,
> > but add_packed_git() today checks whether it exists or not.
>
> Having a multi-pack-index is supposed to indicate that we have
> these objects in the objects/pack directory within the specified
> pack-files.

Hmm, isn't it a normal supported use case for repacking to potentially delete
packfiles which are in the MIDX (I'm specifically thinking about when someone
runs git gc with an older git version which knows nothing about MIDX files)?

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Taylor Blau

unread,
Aug 23, 2021, 12:05:53 AM8/23/21
to Derrick Stolee, Johannes Berg, g...@vger.kernel.org, bup-...@googlegroups.com
On Sun, Aug 22, 2021 at 08:34:43PM -0400, Derrick Stolee wrote:
> On 8/20/2021 3:55 PM, Johannes Berg wrote:
> > Technically, multi-pack-index doesn't need pack files to exist,
> > but add_packed_git() today checks whether it exists or not.
>
> Having a multi-pack-index is supposed to indicate that we have
> these objects in the objects/pack directory within the specified
> pack-files.
>
> I understand your goal to relax a condition of the multi-pack-index
> file, but it's triggered by a flag during write and that choice
> isn't persisted into the file. There is no way for a later Git
> process to understand that the multi-pack-index doesn't actually
> guarantee object existence.

We're going to run into problems much earlier than that, though: the
MIDX needs to load information about objects from packs in order to
break ties when multiple copies of the same object exist in multiple
packs (according to that pack's mtime).

So I'm not sure how we would even write a MIDX without physical packs on
disk that we can open and stat, let along how we would teach Git to
handle a situation where packs that did exist when writing a MIDX went
away when we tried to read from the same MIDX later on.

Thanks,
Taylor

Johannes Berg

unread,
Aug 23, 2021, 4:21:07 AM8/23/21
to Martin Fick, Derrick Stolee, g...@vger.kernel.org, bup-...@googlegroups.com
On Sun, 2021-08-22 at 19:11 -0600, Martin Fick wrote:
> On Sunday, August 22, 2021 8:34:43 PM MDT Derrick Stolee wrote:
> > On 8/20/2021 3:55 PM, Johannes Berg wrote:
> > > Technically, multi-pack-index doesn't need pack files to exist,
> > > but add_packed_git() today checks whether it exists or not.
> >
> > Having a multi-pack-index is supposed to indicate that we have
> > these objects in the objects/pack directory within the specified
> > pack-files.
>
> Hmm, isn't it a normal supported use case for repacking to potentially delete
> packfiles which are in the MIDX (I'm specifically thinking about when someone
> runs git gc with an older git version which knows nothing about MIDX files)?

Yeah, the multi-pack-index contains all the pack names that are there
(and then which one the object is in), and when git tries to use it,
then it doesn't really seem to have any issues?

You can "git repack -a -d" with an old version, or save the multi-pack-
index temporarily elsewhere and copy it back after repack, and things
work just fine, except of course "git multi-pack-index verify" will
complain loudly.

johannes

Johannes Berg

unread,
Aug 23, 2021, 4:23:13 AM8/23/21
to Taylor Blau, Derrick Stolee, g...@vger.kernel.org, bup-...@googlegroups.com
On Mon, 2021-08-23 at 00:05 -0400, Taylor Blau wrote:
>
> We're going to run into problems much earlier than that, though: the
> MIDX needs to load information about objects from packs in order to
> break ties when multiple copies of the same object exist in multiple
> packs (according to that pack's mtime).

Huh, I guess I never ran into a need - we make sure in bup that each
object only exists once, that's kind of the point :)

I guess we could break ties by "lower hash of the pack" instead of
"mtime"? It doesn't really matter how they're broken, as long as it's
consistent?

Arguably, mtime is not a good measure anyway, since the same repo
elsewhere would have different mtimes.

> So I'm not sure how we would even write a MIDX without physical packs on
> disk that we can open and stat, let along how we would teach Git to
> handle a situation where packs that did exist when writing a MIDX went
> away when we tried to read from the same MIDX later on.

It handles that just fine on the read side.

johannes

Johannes Berg

unread,
Aug 23, 2021, 5:22:05 AM8/23/21
to Derrick Stolee, g...@vger.kernel.org, bup-...@googlegroups.com
On Sun, 2021-08-22 at 20:34 -0400, Derrick Stolee wrote:
> On 8/20/2021 3:55 PM, Johannes Berg wrote:
> > Technically, multi-pack-index doesn't need pack files to exist,
> > but add_packed_git() today checks whether it exists or not.
>
> Having a multi-pack-index is supposed to indicate that we have
> these objects in the objects/pack directory within the specified
> pack-files.

Yeah, so, like I tried to explain in the patch email, this is
(partially) for a non-git use case.

The way you could think of it is that I'm trying to make a local index
cache, so we know locally which objects we have, but the actual packs
and their contents are stored in some slow/expensive (remote) storage
system.

With the local cache, I'd like to understand whether or not I have a
given object (in particular to avoid storing it again if I see it), and
also I'd like to know _where_ it is (which pack and where in the pack)
so that I can access only parts, instead of downloading everything.

The server isn't necessarily able to run something 'smart', it might
just be S3 blob storage, but storing pack files, not individual objects.

> I understand your goal to relax a condition of the multi-pack-index
> file, but it's triggered by a flag during write and that choice
> isn't persisted into the file. There is no way for a later Git
> process to understand that the multi-pack-index doesn't actually
> guarantee object existence.

It does, since the pack files are recorded. I've tested this briefly,
and it doesn't really seem to have any problems with it.

While I understand the concern, I'd also envision that this tool
command-line option is never set when this is used on normal git
repositories, only when used in this special use case I tried explaining
above.

> And in a completely other side: one would think that including
> a pack-file in the multi-pack-index would allow deleting the .idx
> file, but there are a few reasons why we do not (including
> interactions with third-party tools).

Well, arguably you can always restore it from the pack file, so you can
always delete it ;) Not sure git is set up for that though. Also, even
old versions of git would probably choke since they don't read the
multi-pack-index.

> So, I'm not necessarily on board with this change unless
> something is added to the multi-pack-index file (such as a new
> version 2 and an optional chunk understood by that version) that
> tells future Git processes that the .pack files might not exist.
> I'm still not sure what Git should do about that other than stop
> reading the multi-pack-index and ignore its contents.

It seems to do that today, or at least try to find in the idx if it goes
looking for an object and the multi-pack-index references a pack file
that doesn't exist - and as Martin also pointed out, that really is
necessary for this to work with older versions of git.

> [snip comments on the code]

Fair enough. I think we need to figure out the conceptual question
though - whether or not this is something that's acceptable in git in
the first place.

Like I said, the intent isn't to break git repositories or to even use
this on a git repository that's also normally being read by git (even
though I don't see any problems with that, but my testing wasn't very
deep). The intent (and the reason for this being [RFC]) is to see if git
upstream would be amenable to such a change that lets other tools use
git's index - especially multi-pack-index - machinery to operate on such
a thing as the "index cache" I described above.

In particular, in bup's [1] use case, we currently support our own
"midx" format which is similar to multi-pack-index, but only contains
*object existence*, not *object location*. Thus, there are two problems:

1. In a normal bup/git repository, when bup is used to write objects
(it only ever writes packs), then the midx is also updated to
allow bup itself to check for object existence without opening all
the *.idx files, allowing it to deduplicate objects efficiently.
In this case, bup always uses git to retrieve objects pack, and
this leads to inefficiencies: git will not be able to use the midx
files (and they don't contain object location anyway), but git
will open all the *.idx files. If we ask it, it might maintain its
own multi-pack-index.

Obviously, in this mode, there's no reason to have this patch - we
can just ask git to create the multi-pack-index and add code to
bup to understand how to read it in order to check for object
existence (for deduplication).

2. However, I'm working on adding encrypted repositories to bup, and
there the objects are stored differently - still in something like
a pack file, but it's encrypted, in a way that's done at the
object level.

In this mode, bup maintains an object index in a clear-text cache,
i.e. all the *.idx files, and currently adds its own midx to be
able to efficiently check for object existence.

However, in this case, bup cannot use git to access objects, since
there's no real git directory, only a folder containing the *.idx
files, the data is "elsewhere".

Since the *.idx file format is still the same (no reason to change
it), with this patch we can use git to efficiently (in both
developer time, I don't need to rewrite it, and runtime, since
it's in C not python) ask git to write the multi-pack-index that
then bup can use to do both: a) check for object existence during
further backups while adding new objects, and b) look for object
location while accessing the repository for reading.

I completely understand that this is a non-git use case, and I'm
entirely happy for you to tell me to get lost, but I figured it was such
a small change I had to try, rather than rewriting the entire multi-
pack-index maintenance code for bup (or importing git's there).

Really for (2) I don't even care about using git's multi-pack-index,
since git never uses it, but I wanted to have the same between (1) and
(2), and if we want git - called on a bup/git repo - to have a multi-
pack-index, then we need git's format, of course.

The other hack I considered was to just create empty *.pack files for
each *.idx file in the index cache because then git is (currently) happy
to accept this, so it never really validates anything there even when
the pack files "exist", but that felt even worse and like something git
might break with just a little more validity checking.


So basically I had choices between:
A. implement multi-pack-index in bup, to cover both (1) and (2)
B. implement a new midx format in bup, to cover just (2)
C. call unmodified git in bup, to cover just (1)
D. this patch to cover both (1) and (2)
E. create dummy *.pack files for the *.idx files in the cache, so
that git will accept it.

Obviously, I can also do any combination of these, e.g. B & C, but this
was the least amount of work ;)


But really, it's up to you, I'm just asking. If you feel this doesn't
fit into git's machinery (despite being so simple), I'll just do
"something else", not sure which one I might do. Quite possibly A., if
only because then I can write tests with "git multi-index-pack verify"
:-)

johannes

[1] https://bup.github.io/, not to be confused with bupstash.io

Reply all
Reply to author
Forward
0 new messages