[git] persistent_cache

21 views
Skip to first unread message

RjOllos

unread,
Aug 14, 2017, 7:15:06 PM8/14/17
to Trac Development
I've been wondering about the purpose of [git] persistent_cache and what the trade-offs are with enabling it. I've looked through the code, but don't have a clear idea yet. I'd like to improve the TracRepositoryAdmin documentation for the [git] peristent_cache option and the cached [repositories] section attribute (#12406).

The tradeoffs for the cached attribute look straightforward to me. There is more overhead in setting up explicit synchronization, but once properly configured a cached repository is always preferred as it will offer better performance. So the tradeoff is time to configure vs performance.

Can anyone describe the purpose of persistent_cache and the trade-offs with enabling it?

Would it make sense to have persistent_cache as a per-repository attribute, as in #12406? Similarly, should git_fs_encoding be a per-repository attribute? The other [git] options appear they should be global rather than per-repository, but I wonder about those two.

- Ryan

#12406: https://trac.edgewall.org/ticket/12406


Peter Stuge

unread,
Aug 17, 2017, 7:29:36 AM8/17/17
to trac...@googlegroups.com
RjOllos wrote:
> I've been wondering about the purpose of [git] persistent_cache and what
> the trade-offs are with enabling it. I've looked through the code, but
> don't have a clear idea yet.

For background I started writing a new git backend for Trac based on
libgit2/pygit2 years ago, but didn't complete it because of the mismatch
between Trac repository datamodel requirements/expectations and the Git
data model. :\

My WIP is still at http://git.stuge.se/?p=trac.git;a=commitdiff;h=ea5437b


I don't exactly remember what persistent_cache does, but can talk
about those data model incompatibilities, because I think they are
the purpose for caching.

Trac wants to look up all previous *and* all later commits relative
to a given commit in a repository, while the Git data model only
stores previous commits.

Traversing commit history in reverse chronological order is quite
efficient in the Git data model, forward not so much.

So "later commits" can only be looked up efficiently if Trac caches
which commits are followed by which other commits, and that cache
must be updated with every new commit added (or removed!) in the
repository.

If you don't know the id of the new commit then you have to resync,
ie. scan all commits, which takes a long time for larger repos.


> once properly configured a cached repository is always preferred as
> it will offer better performance

...mmmhh.. Only if the cache is always reliably updated! If there is
ever a single lapse then Trac gets stuck at that point, until manual
resync.


> Can anyone describe the purpose of persistent_cache and the
> trade-offs with enabling it?

I am unsure, but the #11971 description mentions:

"1. The git connector constructs revisions cache for parents, children,
branches and tags. The construction is performed each request if
persistent_cache option is disabled. We could improve it."


> Would it make sense to have persistent_cache as a per-repository
> attribute, as in #12406?

I think it would, since the persistent cache is much more fragile.
It could be desirable to enable it selectively.


> Similarly, should git_fs_encoding be a per-repository attribute?

Clear yes on this one. One Trac instance may be serving many repos for
a team working in very different environments, where one repo uses a
special encoding. It would be super annoying to require a separate
Trac instance for that one repo. Or worse, if there are many repos
with different encodings - though I think that's less likely.



Maybe #11164 is also related to persistent_cache?

#11164: https://trac.edgewall.org/ticket/11164
#11971: https://trac.edgewall.org/ticket/11971


Thanks for your work on Trac. Hope I could help a bit.

//Peter

Peter Suter

unread,
Aug 17, 2017, 12:52:48 PM8/17/17
to trac...@googlegroups.com
On 17.08.2017 13:27, Peter Stuge wrote:
> For background I started writing a new git backend for Trac based on
> libgit2/pygit2 years ago, but didn't complete it because of the mismatch
> between Trac repository datamodel requirements/expectations and the Git
> data model. :\
>
> My WIP is still at http://git.stuge.se/?p=trac.git;a=commitdiff;h=ea5437b
Are you aware of / do you have any thoughts on TracPygit2Plugin?
https://trac-hacks.org/wiki/TracPygit2Plugin
https://trac.edgewall.org/ticket/10606

>> Can anyone describe the purpose of persistent_cache and the
>> trade-offs with enabling it?
> I am unsure, but the #11971 description mentions:
>
> "1. The git connector constructs revisions cache for parents, children,
> branches and tags. The construction is performed each request if
> persistent_cache option is disabled. We could improve it."
>
I don't know Git or the Trac VCS model all that well. Just from reading
the source it seems that enabling the persistent_cache leads to storing
a reference to this (in-memory) revision cache in a Python class
variable (StorageFactory.__dict_nonweak).
Presumably the goal is to reuse that cache across requests to improve
performance.
So the trade-off would be better performance vs. higher memory usage
plus, as you pointed out, a higher risk for something to go wrong due to
the increased complexity.
Also it only helps if the same Python interpreter is used for another
request (not the case in e.g. CGI), right?

I'm not sure if it makes sense to use both this in-memory cache and
Trac's DB CachedRepository at the same time or if that's redundant and
pointless.
Any thoughts on that?

Peter

Jun Omae

unread,
Aug 17, 2017, 3:38:40 PM8/17/17
to trac...@googlegroups.com
On Tue, Aug 15, 2017 at 8:15 AM, RjOllos <rjo...@gmail.com> wrote:
> I've been wondering about the purpose of [git] persistent_cache and what the
> trade-offs are with enabling it. I've looked through the code, but don't
> have a clear idea yet. I'd like to improve the TracRepositoryAdmin
> documentation for the [git] peristent_cache option and the cached
> [repositories] section attribute (#12406).
>
> The tradeoffs for the cached attribute look straightforward to me. There is
> more overhead in setting up explicit synchronization, but once properly
> configured a cached repository is always preferred as it will offer better
> performance. So the tradeoff is time to configure vs performance.
>
> Can anyone describe the purpose of persistent_cache and the trade-offs with
> enabling it?

First, tracopt.versioncontrol.git.* constructs cache for all
revisions, children of each revision, parents of each revision and all
refs of target Git repository in-memory.

Second, checking whether the cache is fresh is determined by whether
any refs in cached refs are changed (using git show-ref). If cache is
not fresh, it reconstructs cache.

Enabling [git] persistent_cache, it checks whether it needs to refresh
the cache when Repository.sync() is invoked via explicit
synchronization. If explicit synchronization is not set up, new
commits in the repository wouldn't be shown.

Disabling [git] persistent_cache, it checks whether it needs to
refresh the cache each web request. That is `git show-ref` command is
invoked each web request. However, new commits in the repository would
be shown even if no explicit synchronization.

I consider persistent_cache option is not needed for per-repository.

> Would it make sense to have persistent_cache as a per-repository attribute,
> as in #12406? Similarly, should git_fs_encoding be a per-repository
> attribute? The other [git] options appear they should be global rather than
> per-repository, but I wonder about those two.

I think it would be good to be able to use git_fs_encoding option for
both global and per-repository. The encoding is typically utf-8
however Trac currently cannot handle if some repositories use non
utf-8 encoding.

--
Jun Omae <jun...@gmail.com> (大前 潤)

RjOllos

unread,
Aug 17, 2017, 9:40:45 PM8/17/17
to Trac Development


On Thursday, August 17, 2017 at 3:38:40 PM UTC-4, Jun Omae wrote:
Enabling [git] persistent_cache, it checks whether it needs to refresh
the cache when Repository.sync() is invoked via explicit
synchronization. If explicit synchronization is not set up, new
commits in the repository wouldn't be shown.

Disabling [git] persistent_cache, it checks whether it needs to
refresh the cache each web request. That is `git show-ref` command is
invoked each web request. However, new commits in the repository would
be shown even if no explicit synchronization.

We have a sync_per_request attribute that determines when the db cache is synchronized. If you've setup explicit synchronization then I assume you'd want persistent_cache = True and sync_per_request = False. If you haven't setup explicit synchronization, you'd want persistent_cache = False and sync_per_request = True.

Would it make sense to just remove the persistent_cache option and use sync_per_request to determine whether the Git revision cache is reconstructed on every web request?

- Ryan

Jun Omae

unread,
Aug 17, 2017, 11:47:37 PM8/17/17
to trac...@googlegroups.com
I consider that is not good.

GitCachedRepository.sync() checks whether each revision of the
repository is cached in revision table to execute a SELECT query. That
would lead performance down if too many revisions.

However, I think we could improve the sync() for already synchronized
cached git repository, to check each ref is cached before checking
whether all revisions are cached. I agree removing persistent_cache
option if improving sync() by the following patch.

This patch is for 1.0-stable:

diff --git a/tracopt/versioncontrol/git/PyGIT.py
b/tracopt/versioncontrol/git/PyGIT.py
index 47f397389..32aac9184 100644
--- a/tracopt/versioncontrol/git/PyGIT.py
+++ b/tracopt/versioncontrol/git/PyGIT.py
@@ -596,6 +596,11 @@ class Storage(object):
key=lambda (name, rev, head): (not head, name))
return [(name, rev) for name, rev, head in branches]

+ def get_refs(self):
+ for refname, rev in self.rev_cache.refs_dict.iteritems():
+ if refname != 'HEAD':
+ yield refname, rev
+
def get_commits(self):
return self.rev_cache.rev_dict

diff --git a/tracopt/versioncontrol/git/git_fs.py
b/tracopt/versioncontrol/git/git_fs.py
index 031f68c2b..c33eb7a5d 100644
--- a/tracopt/versioncontrol/git/git_fs.py
+++ b/tracopt/versioncontrol/git/git_fs.py
@@ -100,6 +100,21 @@ class GitCachedRepository(CachedRepository):
return count > 0
return False

+ def needs_sync():
+ max_holders = 999
+ revs = sorted(set(rev for refname, rev in repos.git.get_refs()))
+ for idx in xrange(0, len(revs), max_holders):
+ revs_ = revs[idx:idx + max_holders]
+ holders = ','.join(('%s',) * len(revs_))
+ args = [self.id]
+ args.extend(revs_)
+ query = 'SELECT COUNT(*) FROM revision ' \
+ 'WHERE repos=%s AND rev IN (' + holders + ')'
+ for count, in self.env.db_query(query, args):
+ if count < len(revs_):
+ return True
+ return False
+
def traverse(rev, seen):
revs = []
merge_revs = []
@@ -121,9 +136,7 @@ class GitCachedRepository(CachedRepository):
revs[idx:idx] = traverse(rev, seen)
return revs

- while True:
- repos.sync()
- repos_youngest = repos.youngest_rev or ''
+ def sync_revs():
updated = False
seen = set()

@@ -148,9 +161,13 @@ class GitCachedRepository(CachedRepository):
if feedback:
feedback(rev)

- if updated:
- continue # sync again
+ return updated

+ while True:
+ repos.sync()
+ repos_youngest = repos.youngest_rev or ''
+ if needs_sync() and sync_revs():
+ continue # sync again
if meta_youngest != repos_youngest:
with self.env.db_transaction as db:
db("""

Peter Stuge

unread,
Aug 18, 2017, 11:12:16 AM8/18/17
to trac...@googlegroups.com
Peter Suter wrote:
> > For background I started writing a new git backend for Trac based on
> > libgit2/pygit2 years ago, but didn't complete it because of the mismatch
> > between Trac repository datamodel requirements/expectations and the Git
> > data model. :\
> >
> > My WIP is still at http://git.stuge.se/?p=trac.git;a=commitdiff;h=ea5437b
>
> Are you aware of / do you have any thoughts on TracPygit2Plugin?
> https://trac-hacks.org/wiki/TracPygit2Plugin
> https://trac.edgewall.org/ticket/10606

I wasn't aware. I think that development may have started after my work.

I haven't focused on this topic since then, so haven't seen it.

Thoughts - well I think that the aggressive caching of repo
information into the Trac database is fundamentally broken, and
that all repo plugin implementations will suffer from that.

I would expect TracPygit2Plugin to be better than my WIP, because Jun
is much more familiar with Trac internals, but I am unable to do a
proper comparison.


> I don't know Git or the Trac VCS model all that well.

The primary mismatch is that Trac expects to traverse commit
history equally easily in both directions, while Git only stores
history in reverse chronological order.

Another smaller point is that Git allows commit histories outside of
branches. A tag or even a plain commit id can also be the end point
for a series of commits.


> Just from reading
> the source it seems that enabling the persistent_cache leads to storing
> a reference to this (in-memory) revision cache in a Python class
> variable (StorageFactory.__dict_nonweak).
> Presumably the goal is to reuse that cache across requests to improve
> performance.

For anything but minimal repositories on private Trac instances, the
fact that git_fs spawns a git process is absolutely outrageous - let
alone the large number of processes that were spawned when I looked
at this.

I think the proper solution is to move some of what Trac tries to do
with metadata caching out of Trac and closer to the repo, and have
the Trac repo plugin take advantage both of pygit2 and of that
external parent/child index. Trac should already have access to all
needed information, it should not need to build caches, Trac is just
the wrong place for that, IMHO.


> So the trade-off would be better performance vs. higher memory usage
> plus, as you pointed out, a higher risk for something to go wrong due to
> the increased complexity.
> Also it only helps if the same Python interpreter is used for another
> request (not the case in e.g. CGI), right?

Anything but the most trivial instances will use long-running
processes, typically either tracd or using WSGI.


> I'm not sure if it makes sense to use both this in-memory cache and
> Trac's DB CachedRepository at the same time or if that's redundant and
> pointless.
> Any thoughts on that?

I get the impression that the persistent_cache is stored, and not
just in-memory - but it is possible that I am wrong about this.


Thanks

//Peter

RjOllos

unread,
Aug 21, 2017, 2:24:12 AM8/21/17
to Trac Development

That would be good, because otherwise we should provide a better description of persistent_cache and guidance for when it should be enabled. Having fewer configuration options will make Trac easier to setup and configure.

I created a ticket for your change:
https://trac.edgewall.org/ticket/12895

- Ryan
Reply all
Reply to author
Forward
0 new messages