[PATCH 04/16] Implement repository wide locking

69 views
Skip to first unread message

Zoran Zaric

unread,
Oct 21, 2012, 7:51:37 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric
Add git.lock(), git.unlock() and git.is_locked(), which handle locking
of the repository.

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
lib/bup/git.py | 14 ++++++++++++++
lib/bup/t/tgit.py | 20 ++++++++++++++++++++
2 files changed, 34 insertions(+)

diff --git a/lib/bup/git.py b/lib/bup/git.py
index d3cc13a..18fafbe 100644
--- a/lib/bup/git.py
+++ b/lib/bup/git.py
@@ -1129,3 +1129,17 @@ class NeededObjects():
return self.pack_bitarrays[name]
else:
return None
+
+def is_locked():
+ """Check if the bup repository is currently locked."""
+ return os.path.exists(repo('buplock'))
+
+def lock():
+ """Lock the bup repository."""
+ if not is_locked():
+ open(repo('buplock'), 'w').close()
+
+def unlock():
+ """Unlock the bup repository."""
+ if is_locked():
+ os.unlink(repo('buplock'))
diff --git a/lib/bup/t/tgit.py b/lib/bup/t/tgit.py
index b564736..19ea02e 100644
--- a/lib/bup/t/tgit.py
+++ b/lib/bup/t/tgit.py
@@ -164,3 +164,23 @@ def test_check_repo_or_die():
WVPASSEQ(e.code, 15)
else:
WVFAIL()
+
+
+@wvtest
+def test_lock():
+ os.environ['BUP_DIR'] = bupdir = 'pybuptest.tmp'
+ subprocess.call(['rm','-rf', bupdir])
+ git.init_repo(bupdir)
+
+ git.check_repo_or_die()
+
+ locked = git.is_locked()
+ WVPASSEQ(locked, False)
+
+ git.lock()
+ locked = git.is_locked()
+ WVPASSEQ(locked, True)
+
+ git.unlock()
+ locked = git.is_locked()
+ WVPASSEQ(locked, False)
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:35 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
lib/bup/helpers.py | 20 ++++++++++++++++++++
lib/bup/t/thelpers.py | 9 +++++++++
2 files changed, 29 insertions(+)

diff --git a/lib/bup/helpers.py b/lib/bup/helpers.py
index 880c19a..e09c6de 100644
--- a/lib/bup/helpers.py
+++ b/lib/bup/helpers.py
@@ -732,3 +732,23 @@ def version_tag():
if n.startswith('tag: bup-'):
return n[9:]
return 'unknown-%s' % _version.COMMIT[:7]
+
+class BitArray():
+ """Provide a memory efficient set() like interface for indices."""
+ def __init__(self, size):
+ self._data = 0
+ self._size = size
+
+ def add(self, i):
+ if i != None:
+ self._data = self._data | (1 << i)
+
+ def remove(self, i):
+ if i != None and i in self:
+ self._data = self._data ^ (1 << i)
+
+ def __contains__(self, i):
+ if i != None:
+ return ((self._data & (1 << i)) >> i) == 1
+ else:
+ return False
diff --git a/lib/bup/t/thelpers.py b/lib/bup/t/thelpers.py
index 6e8252e..6e71fb4 100644
--- a/lib/bup/t/thelpers.py
+++ b/lib/bup/t/thelpers.py
@@ -85,3 +85,12 @@ def test_graft_path():
WVPASSEQ(graft_path(all_graft_points, path), "/opt/user")
WVPASSEQ(graft_path([(matching_full_path, new_path)], path),
"/opt")
+
+@wvtest
+def test_bit_array():
+ ba = BitArray(10)
+ WVPASSEQ((10 in ba), False)
+ ba.add(10)
+ WVPASSEQ((10 in ba), True)
+ ba.remove(10)
+ WVPASSEQ((10 in ba), False)
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:36 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
Documentation/bup-repack.md | 67 ++++++++++++++++++++++
README.md | 5 ++
cmd/repack-cmd.py | 132 ++++++++++++++++++++++++++++++++++++++++++++
lib/bup/git.py | 69 +++++++++++++++++++++++
t/test.sh | 48 +++++++++++++++-
5 files changed, 319 insertions(+), 2 deletions(-)
create mode 100644 Documentation/bup-repack.md
create mode 100755 cmd/repack-cmd.py

diff --git a/Documentation/bup-repack.md b/Documentation/bup-repack.md
new file mode 100644
index 0000000..2bc07f5
--- /dev/null
+++ b/Documentation/bup-repack.md
@@ -0,0 +1,67 @@
+% bup-save(1) Bup %BUP_VERSION%
+% Zoran Zaric <z...@zoranzaric.de>
+% %BUP_DATE%
+
+# NAME
+
+bup-repack - repack a repository to free up space
+
+# SYNOPSIS
+
+bup repack [-n] [-q] [-#] [-f]
+
+# DESCRIPTION
+
+`bup repack` repacks all objects in a repository into new
+packfiles. It traverses the history and saves only needed
+objects.
+
+`bup repack` iterates over the existing packfiles and
+deletes each, after writing the last needed object to the
+new packfile. Because of this repacking a repository can
+use the biggest existing packfile's filesize as additional
+diskspace.
+
+# OPTIONS
+
+-n,--dry-run
+: don't do anything just print out what would be done
+
+-q, --quiet
+: disable progress messages.
+
+-*#*, --compress=*#*
+: set the compression level to # (a value from 0-9, where
+ 9 is the highest and 0 is no compression). The default
+ is 1 (fast, loose compression). WARNING: Changing the
+ compression level will change all objects and will result
+ in duplicate objects if the new compression level isn't
+ set on new saves.
+
+-f, --force
+: ignore free space check.
+
+
+# EXAMPLE
+
+ $ bup index -ux /etc
+ Indexing: 1981, done.
+
+ $ bup save -r myserver: -n my-pc-backup --bwlimit=50k /etc
+ Reading index: 1981, done.
+ Saving: 100.00% (998/998k, 1981/1981 files), done.
+
+ $ bup repack
+ Traversing my-pc-backup to find needed objects...
+ Traversing objects: 54323, done.
+ Writing new packfiles...
+ Writing objects: 28115, done.
+
+
+# SEE ALSO
+
+`bup-save`(1)
+
+# BUP
+
+Part of the `bup`(1) suite.
diff --git a/README.md b/README.md
index b3a937e..4b0e516 100644
--- a/README.md
+++ b/README.md
@@ -332,6 +332,11 @@ mailing list (see below) if you'd like to help.
We'll have to do it in a totally different way. There are lots of
options. For now: make sure you've got lots of disk space :)

+ UPDATE: With `bup repack` you actually can prune away old backups. Making
+ backups unreachable from backup-sets doesn't have a UI yet though and is
+ left as an exercise for the reader. *hint* *hint* `git filter-branch` can be
+ useful for this.
+
- bup has never been tested on anything but Linux, MacOS, and Windows+Cygwin.

There's nothing that makes it *inherently* non-portable, though, so
diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
new file mode 100755
index 0000000..02f0bbd
--- /dev/null
+++ b/cmd/repack-cmd.py
@@ -0,0 +1,132 @@
+#!/usr/bin/env python
+import sys, os
+from bup import git, options
+from bup.helpers import *
+
+def run(argv):
+ # at least in python 2.5, using "stdout=2" or "stdout=sys.stderr" below
+ # doesn't actually work, because subprocess closes fd #2 right before
+ # execing for some reason. So we work around it by duplicating the fd
+ # first.
+ fd = os.dup(2) # copy stderr
+ try:
+ p = subprocess.Popen(argv, stdout=fd, close_fds=False)
+ return p.wait()
+ finally:
+ os.close(fd)
+
+
+optspec = """
+bup repack
+--
+q,quiet don't show progress meter
+v,verbose increase log output (can be used more than once)
+n,dry-run don't do anything, just print what would be done
+f,force ignore the space check
+#,compress= set compression level to # (0-9, 9 is highest) [1] (See WARNING in manpage!)
+"""
+o = options.Options(optspec)
+(opt, flags, extra) = o.parse(sys.argv[1:])
+
+git.check_repo_or_die()
+
+handle_ctrl_c()
+
+if not opt.force:
+ # this only works on unix
+ vfs_stats = os.statvfs(git.repo())
+ free_space = vfs_stats.f_bsize * vfs_stats.f_bavail
+ if not opt.force and free_space < git.max_pack_size * 2:
+ o.fatal("insufficent space")
+
+cp = git.CatPipe()
+
+opt.progress = (istty2 and not opt.quiet)
+refs = git.list_refs()
+refnames = [name for name, sha in refs]
+
+pl = git.PackIdxList(git.repo('objects/pack'))
+total_objects = len(pl)
+
+needed_objects = git.NeededObjects(pl)
+
+# Find needed objects reachable from commits
+traversed_objects_counter = 0
+
+for refname in refnames:
+ if not refname.startswith('refs/heads/'):
+ continue
+ log('Traversing %s to find needed objects...\n' % refname[11:])
+ for date, sha in ((date, sha.encode('hex')) for date, sha in
+ git.rev_list(refname)):
+ for type, sha_ in git.traverse_commit(cp, sha, needed_objects):
+ traversed_objects_counter += 1
+ qprogress('Traversing objects (%d/%d)\r' %
+ (traversed_objects_counter, total_objects))
+
+# Find needed objects reachable from tags
+tags = git.tags()
+if len(tags) > 0:
+ for key in tags:
+ log('Traversing tag %s to find needed objects...\n' % ", ".join(tags[key]))
+ for type, sha in git.traverse_commit(cp, sha, needed_objects):
+ traversed_objects_counter += 1
+ qprogress('Traversing objects (%d/%d)\r' %
+ (traversed_objects_counter, total_objects))
+skipped_objects = total_objects - traversed_objects_counter
+if skipped_objects == 0:
+ progress('Traversing objects (%d/%d), done.\n' %
+ (traversed_objects_counter, total_objects))
+else:
+ progress('Traversing objects (%d/%d), done. Skipped %d\n' %
+ (traversed_objects_counter, total_objects, skipped_objects))
+
+
+if traversed_objects_counter == 0:
+ o.fatal('No reachable objects found.')
+
+
+if not opt.dry_run:
+ blob_writer = git.PackWriter(compression_level=opt.compress)
+ w = git.PackWriter(compression_level=opt.compress)
+
+log('Writing new packfiles...\n')
+par2 = False
+written_object_counter = 0
+for pack in needed_objects.packs:
+ ba = needed_objects.get_bitarray_for_pack(pack.name)
+ for offset, sha in pack.hashes_sorted_by_ofs():
+ idx = pack._idx_from_hash(sha)
+ if idx in ba:
+ it = iter(cp.get(sha.encode('hex')))
+ type = it.next()
+ content = "".join(it)
+ if not opt.dry_run:
+ if opt.verbose:
+ print "writing %s %s" % (sha.encode('hex'), type)
+ if type == 'blob':
+ blob_writer._write(sha, type, content)
+ else:
+ w._write(sha, type, content)
+ needed_objects.remove(sha.encode('hex'))
+ written_object_counter += 1
+ qprogress('Writing objects: %d\r' % written_object_counter)
+ else:
+ it = iter(cp.get(sha.encode('hex')))
+ type = it.next()
+ if opt.verbose:
+ print "not writing %s %s" % (sha.encode('hex'), type)
+ if not opt.dry_run:
+ os.unlink(pack.name)
+ os.unlink(pack.name[:-3] + "pack")
+ if os.path.exists(pack.name[:-3] + "par2"):
+ par2 = True
+ os.unlink(pack.name[:-3] + "par2")
+progress('Writing objects: %d, done.\n' % written_object_counter)
+
+if not opt.dry_run:
+ blob_writer.close()
+ w.close()
+ if par2:
+ run(['bup', 'fsck', '-g'])
+
diff --git a/lib/bup/git.py b/lib/bup/git.py
index 8048524..d3cc13a 100644
--- a/lib/bup/git.py
+++ b/lib/bup/git.py
@@ -1060,3 +1060,72 @@ def tags():
tags[c].append(name) # more than one tag can point at 'c'

return tags
+
+def traverse_commit(cp, sha_hex, needed_objects):
+ if sha_hex not in needed_objects:
+ needed_objects.add(sha_hex)
+ yield ('commit', sha_hex)
+
+ it = iter(cp.get(sha_hex))
+ type = it.next()
+ assert(type == 'commit')
+ tree_sha = "".join(it).split("\n")[0][5:].rstrip(" ")
+ for obj in traverse_objects(cp, tree_sha, needed_objects):
+ yield obj
+
+
+def traverse_objects(cp, sha_hex, needed_objects):
+ if sha_hex not in needed_objects:
+ needed_objects.add(sha_hex)
+ it = iter(cp.get(sha_hex))
+ type = it.next()
+
+ if type == 'commit':
+ yield ('commit', sha_hex)
+
+ tree_sha = "".join(it).split("\n")[0][5:].rstrip(" ")
+
+ for obj in traverse_objects(cp, tree_sha, needed_objects):
+ yield obj
+
+ if type == 'tree':
+ yield ('tree', sha_hex)
+
+ for (mode,mangled_name,sha) in tree_decode("".join(it)):
+ for obj in traverse_objects(cp, sha.encode('hex'),
+ needed_objects):
+ yield obj
+
+ elif type == 'blob':
+ yield ('blob', sha_hex)
+
+class NeededObjects():
+ def __init__(self, pack_idx_list):
+ self.packs = [pack for pack in pack_idx_list.packs
+ if isinstance(pack, PackIdx)]
+ self.pack_bitarrays = dict()
+ for pack in self.packs:
+ self.pack_bitarrays[pack.name] = BitArray(len(pack))
+
+ def __contains__(self, sha):
+ for pack in self.packs:
+ idx = pack._idx_from_hash(sha.decode('hex'))
+ if idx in self.pack_bitarrays[pack.name]:
+ return True
+ return False
+
+ def add(self, sha):
+ for pack in self.packs:
+ idx = pack._idx_from_hash(sha.decode('hex'))
+ self.pack_bitarrays[pack.name].add(idx)
+
+ def remove(self, sha):
+ for pack in self.packs:
+ idx = pack._idx_from_hash(sha.decode('hex'))
+ self.pack_bitarrays[pack.name].remove(idx)
+
+ def get_bitarray_for_pack(self, name):
+ if name in self.pack_bitarrays:
+ return self.pack_bitarrays[name]
+ else:
+ return None
diff --git a/t/test.sh b/t/test.sh
index 2f1f24c..b1f1e9b 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -215,8 +215,6 @@ WVSTART "save/git-fsck"
(
set -e
cd "$BUP_DIR" || exit 1
- #git repack -Ad
- #git prune
(cd "$TOP/t/sampledata" && WVPASS bup save -vvn master /) || WVFAIL
n=$(git fsck --full --strict 2>&1 |
egrep -v 'dangling (commit|tree|blob)' |
@@ -484,3 +482,49 @@ WVPASSEQ "$(bup ls compression/latest/ | sort)" "$(ls $TOP/Documentation | sort)
COMPRESSION_9_SIZE=$(du -s $D | cut -f1)

WVPASS [ "$COMPRESSION_9_SIZE" -lt "$COMPRESSION_0_SIZE" ]
+
+WVSTART 'repack'
+D=repack.tmp
+export BUP_DIR="$TOP/$D/.bup"
+rm -rf $D
+mkdir $D
+dd if=/dev/urandom of=$D/repack-file bs=1M count=10
+bup init
+
+# Index and save test tree to source bupdir
+bup index -ux "$D"
+bup save -n repack "$D"
+bup tag foo repack
+
+sleep 3
+
+bup index -ux "$D"
+bup save -n repack "$D"
+
+WVPASS bup repack -f
+
+bup index -ux "$D"
+bup save -n repack "$D"
+
+WVPASS bup repack -f
+WVPASS bup fsck
+
+WVPASS bup restore repack/latest$TOP/$D/repack-file -C $D/out
+WVPASS diff $D/repack-file $D/out/repack-file
+
+bup index -ux "$D"
+bup save -n repack "$D"
+
+WVPASS bup repack -n -f
+
+bup index -ux "$D"
+bup save -n repack "$D"
+if bup fsck --par2-ok; then
+ bup fsck -g
+
+ WVPASS bup repack -f
+
+ WVPASSEQ $(bup ls repack/ | wc -l) "7"
+ WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) $(ls "$BUP_DIR/objects/pack" | grep "par2$" | grep -v "vol" | wc -l)
+fi
+
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:34 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
lib/bup/git.py | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/lib/bup/git.py b/lib/bup/git.py
index 3406edf..8048524 100644
--- a/lib/bup/git.py
+++ b/lib/bup/git.py
@@ -290,6 +290,10 @@ class PackIdxV1(PackIdx):
for i in xrange(self.fanout[255]):
yield buffer(self.map, 256*4 + 24*i + 4, 20)

+ def hashes_sorted_by_ofs(self):
+ return sorted(((self._ofs_from_idx(i), str(obj))
+ for i, obj in enumerate(self)))
+

class PackIdxV2(PackIdx):
"""Object representation of a Git pack index (version 2) file."""
@@ -325,6 +329,10 @@ class PackIdxV2(PackIdx):
for i in xrange(self.fanout[255]):
yield buffer(self.map, 8 + 256*4 + 20*i, 20)

+ def hashes_sorted_by_ofs(self):
+ return sorted(((self._ofs_from_idx(i), str(obj))
+ for i, obj in enumerate(self)))
+

_mpi_count = 0
class PackIdxList:
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:40 PM10/21/12
to bup-...@googlegroups.com, Tim Riemenschneider, Tim Riemenschneider, Zoran Zaric
From: Tim Riemenschneider <g...@tim-riemenschneider.de>

running bup repack on an empty repository left the repository locked.

Signed-off-by: Tim Riemenschneider <deb...@tim-riemenschneider.de>
Reviewed-by: Zoran Zaric <z...@zoranzaric.de>
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 1 +
t/test.sh | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 8ab2b6c..5d44810 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -90,6 +90,7 @@ try:


if traversed_objects_counter == 0:
+ git.unlock()
o.fatal('No reachable objects found.')


diff --git a/t/test.sh b/t/test.sh
index 73836a6..8ea6730 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -539,4 +539,4 @@ touch $D/foo
bup index -ux $D
touch $BUP_DIR/buplock
WVPASSEQ "$(bup save -n foo $D 2>&1 | tail -n 1)" "error: the repository is currently locked"
-WVPASSEQ "$(bup repack 2>&1 | tail -n 1)" "error: the repository is currently locked"
+WVPASSEQ "$(bup repack -f 2>&1 | tail -n 1)" "error: the repository is currently locked"
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:41 PM10/21/12
to bup-...@googlegroups.com, Tim Riemenschneider, Tim Riemenschneider, Zoran Zaric
From: Tim Riemenschneider <g...@tim-riemenschneider.de>

otherwise these would be left behind

Signed-off-by: Tim Riemenschneider <deb...@tim-riemenschneider.de>
[z...@zoranzaric.de: Remove the test that counts packfiles]
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 3 +++
t/test.sh | 17 +++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 5d44810..63b81cf 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -130,6 +130,9 @@ try:
if os.path.exists(pack.name[:-3] + "par2"):
par2 = True
os.unlink(pack.name[:-3] + "par2")
+ if os.path.exists(pack.name[:-3] + "vol000+200.par2"):
+ par2 = True
+ os.unlink(pack.name[:-3] + "vol000+200.par2")
progress('Writing objects: %d, done.\n' % written_object_counter)

if not opt.dry_run:
diff --git a/t/test.sh b/t/test.sh
index 8ea6730..c544c73 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -526,6 +526,12 @@ if bup fsck --par2-ok; then

WVPASSEQ $(bup ls repack/ | wc -l) "7"
WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) $(ls "$BUP_DIR/objects/pack" | grep "par2$" | grep -v "vol" | wc -l)
+
+ WVPASS bup repack -f
+ # One .par2 control-file per pack
+ WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) $(ls "$BUP_DIR/objects/pack" | grep "par2$" | grep -v "vol" | wc -l)
+ # One par2-volume file per pack
+ WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) $(ls "$BUP_DIR/objects/pack" | grep "par2$" | grep "vol" | wc -l)
fi

WVSTART 'lock'
@@ -540,3 +546,14 @@ bup index -ux $D
touch $BUP_DIR/buplock
WVPASSEQ "$(bup save -n foo $D 2>&1 | tail -n 1)" "error: the repository is currently locked"
WVPASSEQ "$(bup repack -f 2>&1 | tail -n 1)" "error: the repository is currently locked"
+
+WVSTART 'unlock-empty'
+D=unlock.tmp
+export BUP_DIR="$TOP/$D/.bup"
+rm -rf $D
+mkdir $D
+bup init
+
+WVPASSEQ "$(bup repack -f 2>&1 | tail -n 1)" "error: No reachable objects found."
+WVPASSEQ $(ls "$BUP_DIR/buplock" | wc -l) "0"
+WVPASSNE "$(bup repack -f 2>&1 | tail -n 1)" "error: the repository is currently locked"
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:33 PM10/21/12
to bup-...@googlegroups.com
This patchset adds the repack command. Repack introduces the possibility to get
rid of unneeded objects.

It is based on current master (4cdbfa3c59b73b926632b1ab7b9683d7dba22e1e).

Zoran Zaric

unread,
Oct 21, 2012, 7:51:39 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 3 +++
cmd/save-cmd.py | 4 ++++
t/test.sh | 12 ++++++++++++
3 files changed, 19 insertions(+)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index b1e04c0..8ab2b6c 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -30,6 +30,9 @@ o = options.Options(optspec)

git.check_repo_or_die()

+if git.is_locked():
+ o.fatal("the repository is currently locked")
+
handle_ctrl_c()

if not opt.force:
diff --git a/cmd/save-cmd.py b/cmd/save-cmd.py
index fb45427..4714a36 100755
--- a/cmd/save-cmd.py
+++ b/cmd/save-cmd.py
@@ -27,6 +27,10 @@ o = options.Options(optspec)
(opt, flags, extra) = o.parse(sys.argv[1:])

git.check_repo_or_die()
+
+if git.is_locked():
+ o.fatal("the repository is currently locked")
+
if not (opt.tree or opt.commit or opt.name):
o.fatal("use one or more of -t, -c, -n")
if not extra:
diff --git a/t/test.sh b/t/test.sh
index b1f1e9b..73836a6 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -528,3 +528,15 @@ if bup fsck --par2-ok; then
WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) $(ls "$BUP_DIR/objects/pack" | grep "par2$" | grep -v "vol" | wc -l)
fi

+WVSTART 'lock'
+D=lock.tmp
+export BUP_DIR="$TOP/$D/.bup"
+rm -rf $D
+mkdir $D
+bup init
+
+touch $D/foo
+bup index -ux $D
+touch $BUP_DIR/buplock
+WVPASSEQ "$(bup save -n foo $D 2>&1 | tail -n 1)" "error: the repository is currently locked"
+WVPASSEQ "$(bup repack 2>&1 | tail -n 1)" "error: the repository is currently locked"
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:38 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 171 ++++++++++++++++++++++++++++--------------------------
1 file changed, 89 insertions(+), 82 deletions(-)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 02f0bbd..b1e04c0 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -45,88 +45,95 @@ opt.progress = (istty2 and not opt.quiet)
refs = git.list_refs()
refnames = [name for name, sha in refs]

-pl = git.PackIdxList(git.repo('objects/pack'))
-total_objects = len(pl)
-
-needed_objects = git.NeededObjects(pl)
-
-# Find needed objects reachable from commits
-traversed_objects_counter = 0
-
-for refname in refnames:
- if not refname.startswith('refs/heads/'):
- continue
- log('Traversing %s to find needed objects...\n' % refname[11:])
- for date, sha in ((date, sha.encode('hex')) for date, sha in
- git.rev_list(refname)):
- for type, sha_ in git.traverse_commit(cp, sha, needed_objects):
- traversed_objects_counter += 1
- qprogress('Traversing objects (%d/%d)\r' %
- (traversed_objects_counter, total_objects))
-
-# Find needed objects reachable from tags
-tags = git.tags()
-if len(tags) > 0:
- for key in tags:
- log('Traversing tag %s to find needed objects...\n' % ", ".join(tags[key]))
- for type, sha in git.traverse_commit(cp, sha, needed_objects):
- traversed_objects_counter += 1
- qprogress('Traversing objects (%d/%d)\r' %
- (traversed_objects_counter, total_objects))
-skipped_objects = total_objects - traversed_objects_counter
-if skipped_objects == 0:
- progress('Traversing objects (%d/%d), done.\n' %
- (traversed_objects_counter, total_objects))
-else:
- progress('Traversing objects (%d/%d), done. Skipped %d\n' %
- (traversed_objects_counter, total_objects, skipped_objects))
-
-
-if traversed_objects_counter == 0:
- o.fatal('No reachable objects found.')
-
-
-if not opt.dry_run:
- blob_writer = git.PackWriter(compression_level=opt.compress)
- w = git.PackWriter(compression_level=opt.compress)
-
-log('Writing new packfiles...\n')
-par2 = False
-written_object_counter = 0
-for pack in needed_objects.packs:
- ba = needed_objects.get_bitarray_for_pack(pack.name)
- for offset, sha in pack.hashes_sorted_by_ofs():
- idx = pack._idx_from_hash(sha)
- if idx in ba:
- it = iter(cp.get(sha.encode('hex')))
- type = it.next()
- content = "".join(it)
- if not opt.dry_run:
+git.lock()
+
+try:
+ pl = git.PackIdxList(git.repo('objects/pack'))
+ total_objects = len(pl)
+
+ needed_objects = git.NeededObjects(pl)
+
+ # Find needed objects reachable from commits
+ traversed_objects_counter = 0
+
+ for refname in refnames:
+ if not refname.startswith('refs/heads/'):
+ continue
+ log('Traversing %s to find needed objects...\n' % refname[11:])
+ for date, sha in ((date, sha.encode('hex')) for date, sha in
+ git.rev_list(refname)):
+ for type, sha_ in git.traverse_commit(cp, sha, needed_objects):
+ traversed_objects_counter += 1
+ qprogress('Traversing objects (%d/%d)\r' %
+ (traversed_objects_counter, total_objects))
+
+ # Find needed objects reachable from tags
+ tags = git.tags()
+ if len(tags) > 0:
+ for key in tags:
+ log('Traversing tag %s to find needed objects...\n' %
+ ", ".join(tags[key]))
+ for type, sha in git.traverse_commit(cp, sha, needed_objects):
+ traversed_objects_counter += 1
+ qprogress('Traversing objects (%d/%d)\r' %
+ (traversed_objects_counter, total_objects))
+ skipped_objects = total_objects - traversed_objects_counter
+ if skipped_objects == 0:
+ progress('Traversing objects (%d/%d), done.\n' %
+ (traversed_objects_counter, total_objects))
+ else:
+ progress('Traversing objects (%d/%d), done. Skipped %d\n' %
+ (traversed_objects_counter, total_objects, skipped_objects))
+
+
+ if traversed_objects_counter == 0:
+ o.fatal('No reachable objects found.')
+
+
+ if not opt.dry_run:
+ blob_writer = git.PackWriter(compression_level=opt.compress)
+ w = git.PackWriter(compression_level=opt.compress)
+
+ log('Writing new packfiles...\n')
+ par2 = False
+ written_object_counter = 0
+ for pack in needed_objects.packs:
+ ba = needed_objects.get_bitarray_for_pack(pack.name)
+ for offset, sha in pack.hashes_sorted_by_ofs():
+ idx = pack._idx_from_hash(sha)
+ if idx in ba:
+ it = iter(cp.get(sha.encode('hex')))
+ type = it.next()
+ content = "".join(it)
+ if not opt.dry_run:
+ if opt.verbose:
+ print "writing %s %s" % (sha.encode('hex'), type)
+ if type == 'blob':
+ blob_writer._write(sha, type, content)
+ else:
+ w._write(sha, type, content)
+ needed_objects.remove(sha.encode('hex'))
+ written_object_counter += 1
+ qprogress('Writing objects: %d\r' % written_object_counter)
+ else:
+ it = iter(cp.get(sha.encode('hex')))
+ type = it.next()
if opt.verbose:
- print "writing %s %s" % (sha.encode('hex'), type)
- if type == 'blob':
- blob_writer._write(sha, type, content)
- else:
- w._write(sha, type, content)
- needed_objects.remove(sha.encode('hex'))
- written_object_counter += 1
- qprogress('Writing objects: %d\r' % written_object_counter)
- else:
- it = iter(cp.get(sha.encode('hex')))
- type = it.next()
- if opt.verbose:
- print "not writing %s %s" % (sha.encode('hex'), type)
+ print "not writing %s %s" % (sha.encode('hex'), type)
+ if not opt.dry_run:
+ os.unlink(pack.name)
+ os.unlink(pack.name[:-3] + "pack")
+ if os.path.exists(pack.name[:-3] + "par2"):
+ par2 = True
+ os.unlink(pack.name[:-3] + "par2")
+ progress('Writing objects: %d, done.\n' % written_object_counter)
+
if not opt.dry_run:
- os.unlink(pack.name)
- os.unlink(pack.name[:-3] + "pack")
- if os.path.exists(pack.name[:-3] + "par2"):
- par2 = True
- os.unlink(pack.name[:-3] + "par2")
-progress('Writing objects: %d, done.\n' % written_object_counter)
-
-if not opt.dry_run:
- blob_writer.close()
- w.close()
- if par2:
- run(['bup', 'fsck', '-g'])
+ blob_writer.close()
+ w.close()
+ if par2:
+ run(['bup', 'fsck', '-g'])
+
+finally:
+ git.unlock()

--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:47 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
lib/bup/helpers.py | 26 +++++++++++++++++++++++++-
lib/bup/t/thelpers.py | 5 +++++
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/lib/bup/helpers.py b/lib/bup/helpers.py
index e09c6de..0d834f6 100644
--- a/lib/bup/helpers.py
+++ b/lib/bup/helpers.py
@@ -1,7 +1,7 @@
"""Helper functions and classes for bup."""

import sys, os, pwd, subprocess, errno, socket, select, mmap, stat, re, struct
-import heapq, operator, time, platform
+import heapq, operator, time, platform, __builtin__
from bup import _version, _helpers
import bup._helpers as _helpers

@@ -733,6 +733,30 @@ def version_tag():
return n[9:]
return 'unknown-%s' % _version.COMMIT[:7]

+def _bin(value):
+ assert(value >= 0)
+ binmap = {
+ '0': '0000',
+ '1': '0001',
+ '2': '0010',
+ '3': '0011',
+ '4': '0100',
+ '5': '0101',
+ '6': '0110',
+ '7': '0111',
+ '8': '0000',
+ '9': '1001',
+ 'a': '1010',
+ 'b': '1011',
+ 'c': '1100',
+ 'd': '1101',
+ 'e': '1110',
+ 'f': '1111'
+ }
+ return '0b' + ''.join(binmap[x] for x in ('%x' % (value,))).lstrip('0') or '0'
+
+bin = getattr(__builtin__, "bin", _bin)
+
class BitArray():
"""Provide a memory efficient set() like interface for indices."""
def __init__(self, size):
diff --git a/lib/bup/t/thelpers.py b/lib/bup/t/thelpers.py
index 6e71fb4..b2dce90 100644
--- a/lib/bup/t/thelpers.py
+++ b/lib/bup/t/thelpers.py
@@ -94,3 +94,8 @@ def test_bit_array():
WVPASSEQ((10 in ba), True)
ba.remove(10)
WVPASSEQ((10 in ba), False)
+
+@wvtest
+def test_bin():
+ WVPASSEQ(bin(1), '0b1')
+ WVPASSEQ(bin(1337), '0b10100111001')
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:46 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
t/test.sh | 2 ++
1 file changed, 2 insertions(+)

diff --git a/t/test.sh b/t/test.sh
index 804d16b..4fd4ed5 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -486,6 +486,7 @@ WVPASS [ "$COMPRESSION_9_SIZE" -lt "$COMPRESSION_0_SIZE" ]
WVSTART 'repack'
D=repack.tmp
export BUP_DIR="$TOP/$D/.bup"
+export GIT_DIR="$TOP/$D/.bup"
rm -rf $D
mkdir $D
dd if=/dev/urandom of=$D/repack-file bs=1M count=10
@@ -535,6 +536,7 @@ if bup fsck --par2-ok; then
fi
bup save -n repack "$D"
bup save -n repack "$D"
+WVPASSEQ "$(git fsck --unreachable)" ""
# Force create a midx-file
bup midx -f
WVPASS bup repack -f
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:48 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric

Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
lib/bup/helpers.py | 4 ++++
lib/bup/t/thelpers.py | 2 ++
2 files changed, 6 insertions(+)

diff --git a/lib/bup/helpers.py b/lib/bup/helpers.py
index 0d834f6..c8a5bbc 100644
--- a/lib/bup/helpers.py
+++ b/lib/bup/helpers.py
@@ -776,3 +776,7 @@ class BitArray():
return ((self._data & (1 << i)) >> i) == 1
else:
return False
+
+ def ratio(self):
+ return float(bin(self._data).count("1")) / self._size
+
diff --git a/lib/bup/t/thelpers.py b/lib/bup/t/thelpers.py
index b2dce90..c347216 100644
--- a/lib/bup/t/thelpers.py
+++ b/lib/bup/t/thelpers.py
@@ -92,8 +92,10 @@ def test_bit_array():
WVPASSEQ((10 in ba), False)
ba.add(10)
WVPASSEQ((10 in ba), True)
+ WVPASSEQ(ba.ratio(), 0.1)
ba.remove(10)
WVPASSEQ((10 in ba), False)
+ WVPASSEQ(ba.ratio(), 0)

@wvtest
def test_bin():
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:49 PM10/21/12
to bup-...@googlegroups.com, Zoran Zaric
Reported-by: Yung-Chin Oei <yung...@yungchin.nl>
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 82ecd0c..ec73f83 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -104,6 +104,10 @@ try:
written_object_counter = 0
for pack in needed_objects.packs:
ba = needed_objects.get_bitarray_for_pack(pack.name)
+ if abs(ba.ratio() - 1) < 0.000001:
+ if opt.verbose:
+ print "All objects from packfile %s are needed" % pack.name
+ continue
for offset, sha in pack.hashes_sorted_by_ofs():
idx = pack._idx_from_hash(sha)
if idx in ba:
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:43 PM10/21/12
to bup-...@googlegroups.com, Tim Riemenschneider, Zoran Zaric
From: Tim Riemenschneider <g...@tim-riemenschneider.de>

Instead of running 'bup fsck -g' (and thus require bup being in the PATH)
use the environment variable BUP_MAIN_EXE
(otherwise the testcase with par2 fails, too)

Signed-off-by: Tim Riemenschneider <g...@tim-riemenschneider.de>
Reviewed-by: Zoran Zaric <z...@zoranzaric.de>
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 11c5211..1bda2dd 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -140,7 +140,7 @@ try:
blob_writer.close()
w.close()
if par2:
- run(['bup', 'fsck', '-g'])
+ run([os.environ['BUP_MAIN_EXE'], 'fsck', '-g'])

finally:
git.unlock()
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:44 PM10/21/12
to bup-...@googlegroups.com, Tim Riemenschneider, Zoran Zaric
From: Tim Riemenschneider <g...@tim-riemenschneider.de>

otherwise subsequent bup-runs fire warnings of missing .idx files used by the
(old) midx-file(s)

Signed-off-by: Tim Riemenschneider <g...@tim-riemenschneider.de>
Reviewed-by: Zoran Zaric <z...@zoranzaric.de>
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 5 +++++
t/test.sh | 6 ++++++
2 files changed, 11 insertions(+)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 1bda2dd..82ecd0c 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -145,3 +145,8 @@ try:
finally:
git.unlock()

+# If there are any *.midx-files, they are obsolete now (they reference deleted *.idx-files)
+import glob
+for midx in glob.glob(os.path.join(git.repo('objects/pack'), 'midx-*.midx')):
+ unlink(midx)
+
diff --git a/t/test.sh b/t/test.sh
index c544c73..771964c 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -533,6 +533,12 @@ if bup fsck --par2-ok; then
# One par2-volume file per pack
WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) $(ls "$BUP_DIR/objects/pack" | grep "par2$" | grep "vol" | wc -l)
fi
+bup save -n repack "$D"
+bup save -n repack "$D"
+# Force create a midx-file
+bup midx -f
+WVPASS bup repack -f
+WVPASSNE "$(bup save -n repack "$D" 2>&1 | head -n 1 | cut -d' ' -f1,2,4)" "warning: index missing"

WVSTART 'lock'
D=lock.tmp
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:42 PM10/21/12
to bup-...@googlegroups.com, Tim Riemenschneider, Zoran Zaric
From: Tim Riemenschneider <g...@tim-riemenschneider.de>

repack only worked, if the git.PackIdxList still contained all the
*.idx files, one per *.pack file. If they got replaced by a *.midx file,
not only it traversed objects multiple times, these objects are not
considered for repack. (Since I think that the other pack-files are
untouched, no data-loss should occur)

This fix tells PackIdxList to ignore the midx-files.

Signed-off-by: Tim Riemenschneider <g...@tim-riemenschneider.de>
Reviewed-by: Zoran Zaric <z...@zoranzaric.de>
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
cmd/repack-cmd.py | 1 +
1 file changed, 1 insertion(+)

diff --git a/cmd/repack-cmd.py b/cmd/repack-cmd.py
index 63b81cf..11c5211 100755
--- a/cmd/repack-cmd.py
+++ b/cmd/repack-cmd.py
@@ -52,6 +52,7 @@ git.lock()

try:
pl = git.PackIdxList(git.repo('objects/pack'))
+ pl.refresh(skip_midx = True)
total_objects = len(pl)

needed_objects = git.NeededObjects(pl)
--
1.7.12.4

Zoran Zaric

unread,
Oct 21, 2012, 7:51:45 PM10/21/12
to bup-...@googlegroups.com, Tim Riemenschneider, Zoran Zaric
From: Tim Riemenschneider <g...@tim-riemenschneider.de>

(containing a recipe to reproduce it)

Signed-off-by: Tim Riemenschneider <g...@tim-riemenschneider.de>
[z...@zoranzaric.de: Remove grep-writing-objects-tests because we don't
rewrite the packfile anymore]
Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
---
t/test.sh | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/t/test.sh b/t/test.sh
index 771964c..804d16b 100755
--- a/t/test.sh
+++ b/t/test.sh
@@ -563,3 +563,27 @@ bup init
WVPASSEQ "$(bup repack -f 2>&1 | tail -n 1)" "error: No reachable objects found."
WVPASSEQ $(ls "$BUP_DIR/buplock" | wc -l) "0"
WVPASSNE "$(bup repack -f 2>&1 | tail -n 1)" "error: the repository is currently locked"
+
+WVSTART "repack-and-midx"
+D=repack-midx.tmp
+export BUP_DIR="$TOP/$D/.bup"
+rm -rf $D
+mkdir $D
+bup init
+touch $D/foo
+bup index -ux "$D"
+bup save -n repack-midx --strip "$D"
+bup index --fake-invalid -ux "$D"
+bup save -n repack-midx --strip "$D"
+# We have 2 packs...
+WVPASSEQ $(ls "$BUP_DIR/objects/pack" | grep "pack$" | wc -l) "2"
+# ... which have a total of 4 objects
+# (blob + tree + commit in 1st pack, commit only (referencing 1st tree+blob) in second pack)
+WVPASSEQ "$(GIT_DIR=$BUP_DIR git count-objects -v | grep 'in-pack')" "in-pack: 4"
+# force a midx, this "hides" the normal idx in repack
+WVPASS bup midx -f
+# Full output of broken repack:
+#Traversing repack-midx to find needed objects...
+#Traversing objects (6/4), done. Skipped -2
+#Writing new packfiles...
+#Writing objects: 0, done.
--
1.7.12.4

Zoran Zaric

unread,
Oct 22, 2012, 8:27:03 PM10/22/12
to bup-...@googlegroups.com
On Mon, Oct 22, 2012 at 01:51:38AM +0200, Zoran Zaric wrote:
>
> Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
> ---
> cmd/repack-cmd.py | 171 ++++++++++++++++++++++++++++--------------------------
> 1 file changed, 89 insertions(+), 82 deletions(-)

In my local branch I squashed this into "Add repack command".

Oei, Yung-Chin

unread,
Oct 23, 2012, 9:49:32 PM10/23/12
to Zoran Zaric, bup-...@googlegroups.com
On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> +def lock():
> + """Lock the bup repository."""
> + if not is_locked():
> + open(repo('buplock'), 'w').close()

I'm a bit worried about this bit for two reasons. There's the somewhat
theoretical objection that there's no atomicity between checking
is_locked() and doing the open(), which I guess I'm not super-worried
about[*]. But the other thing is that lock() doesn't fail if
is_locked() == True, which I find kind of unintuitive.

As an alternative, I was wondering, would it be overkill to use some
locking library instead? (I had a quick browse, but I wouldn't know
which one to choose, actually...)

Thanks, YC


[*]: If you want, instead of using a file you could use a directory,
and then do something like "try: os.mkdir(repo('buplock'))" and that
would be atomic on most systems - but I haven't any serious experience
using locks, so probably someone will pop up now to say that's not
bulletproof either...

Oei, Yung-Chin

unread,
Oct 23, 2012, 9:49:58 PM10/23/12
to Zoran Zaric, bup-...@googlegroups.com
On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> +class BitArray():
> + """Provide a memory efficient set() like interface for indices."""
> + def __init__(self, size):
> + self._data = 0
> + self._size = size

Just out of curiosity: do you have future plans with self._size here?
Other than that, Reviewed-by: Yung-Chin Oei <yung...@yungchin.nl>

Oei, Yung-Chin

unread,
Oct 23, 2012, 9:52:26 PM10/23/12
to Zoran Zaric, bup-...@googlegroups.com
On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
>
> Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
> ---
> cmd/repack-cmd.py | 3 +++
> cmd/save-cmd.py | 4 ++++
> t/test.sh | 12 ++++++++++++
> 3 files changed, 19 insertions(+)

Sorry, I missed out on the earlier discussions about locking on the
list, and I only just read up on them, so this comes a bit late I'm
afraid.

The thing is I'm not sure if I agree that a save should not lock the
repo: if a repack must lock the repo, and a save has to respect that
lock, then doesn't that mean it should also work the other way around
- otherwise you could start a repack during a save? On the other hand,
I understand the argument (from the earlier discussion) that you might
not want locked saves, because that would prevent concurrent saves.

Would it make sense to have something like a shared lock for saves,
and an exclusive lock for repacks?

Thanks, YC

Oei, Yung-Chin

unread,
Oct 23, 2012, 9:53:51 PM10/23/12
to Zoran Zaric, bup-...@googlegroups.com
On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
>
> Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
> ---
> Documentation/bup-repack.md | 67 ++++++++++++++++++++++
> README.md | 5 ++
> cmd/repack-cmd.py | 132 ++++++++++++++++++++++++++++++++++++++++++++
> lib/bup/git.py | 69 +++++++++++++++++++++++
> t/test.sh | 48 +++++++++++++++-


Ok, so I've kind of a newbie how-to-review-patches question about this one...

I've pored over the repack code several times now, and I'm mostly
happy to say it's working well, but... this patch on its own would
still have some issues, which are fixed by the later patches in this
series. I think at least patches 11/16 and 16/16 fix potential
data-losing bugs.

So the question is - does signing this patch imply stating that it's
good on its own, too? Because in that case maybe I'd rather you squash
a few of the other patches together with this one.

Thanks! YC

Oei, Yung-Chin

unread,
Oct 23, 2012, 9:54:33 PM10/23/12
to Zoran Zaric, bup-...@googlegroups.com
On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> Reported-by: Yung-Chin Oei <yung...@yungchin.nl>
> Signed-off-by: Zoran Zaric <z...@zoranzaric.de>

Can this one get a mention in the commit message (and maybe also in
the code), that without this repack might lose the data in the pack
concerned? Otherwise it might seem an optional enhancement, and get
removed later on...

Zoran Zaric

unread,
Oct 24, 2012, 5:00:24 AM10/24/12
to Oei, Yung-Chin, bup-...@googlegroups.com
On Wed, Oct 24, 2012 at 02:52:26AM +0100, Oei, Yung-Chin wrote:
> On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> >
> > Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
> > ---
> > cmd/repack-cmd.py | 3 +++
> > cmd/save-cmd.py | 4 ++++
> > t/test.sh | 12 ++++++++++++
> > 3 files changed, 19 insertions(+)
>
> Sorry, I missed out on the earlier discussions about locking on the
> list, and I only just read up on them, so this comes a bit late I'm
> afraid.

No, the patches are here for review, so feedback is very welcome.


> The thing is I'm not sure if I agree that a save should not lock the
> repo: if a repack must lock the repo, and a save has to respect that
> lock, then doesn't that mean it should also work the other way around
> - otherwise you could start a repack during a save?

That's a good point.


> On the other hand, I understand the argument (from the earlier discussion)
> that you might not want locked saves, because that would prevent concurrent
> saves.
>
> Would it make sense to have something like a shared lock for saves,
> and an exclusive lock for repacks?

That sounds good.

Rob suggested to use the python-lockfile library instead of implementing our
own. Has anybody used it before and knows how to implement a shared lock?


> Thanks, YC

Thanks for your feedback,
Zoran

Zoran Zaric

unread,
Oct 24, 2012, 5:01:52 AM10/24/12
to Oei, Yung-Chin, bup-...@googlegroups.com
On Wed, Oct 24, 2012 at 02:54:33AM +0100, Oei, Yung-Chin wrote:
> On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> > Reported-by: Yung-Chin Oei <yung...@yungchin.nl>
> > Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
>
> Can this one get a mention in the commit message (and maybe also in
> the code), that without this repack might lose the data in the pack
> concerned? Otherwise it might seem an optional enhancement, and get
> removed later on...

I'll add that.


> Other than that, Reviewed-by: Yung-Chin Oei <yung...@yungchin.nl>

Thanks

Zoran Zaric

unread,
Oct 24, 2012, 5:06:31 AM10/24/12
to Oei, Yung-Chin, bup-...@googlegroups.com
On Wed, Oct 24, 2012 at 02:49:58AM +0100, Oei, Yung-Chin wrote:
> On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> > +class BitArray():
> > + """Provide a memory efficient set() like interface for indices."""
> > + def __init__(self, size):
> > + self._data = 0
> > + self._size = size
>
> Just out of curiosity: do you have future plans with self._size here?

I use it in BitArray.ratio() that is added in
"Add a method to calculate the set bits ratio in a BitArray"


> Other than that, Reviewed-by: Yung-Chin Oei <yung...@yungchin.nl>

Thanks

Simon Sapin

unread,
Oct 24, 2012, 6:49:20 AM10/24/12
to Zoran Zaric, bup-...@googlegroups.com
Le 22/10/2012 01:51, Zoran Zaric a �crit :
> + def add(self, i):
> + if i != None:
> + self._data = self._data | (1 << i)
> +
> + def remove(self, i):
> + if i != None and i in self:
> + self._data = self._data ^ (1 << i)

Hi,

Every mutation here copies the whole set and is thus O(n), though I have
no idea is this is good enough or how big n is.

If this ever becomes a bottleneck, it is possible with a very simple C
module to have a bit array with both compact storage and O(1) mutation.

The bitarray module does so, although it also does much more:

http://pypi.python.org/pypi/bitarray/

Cheers,
--
Simon Sapin

Zoran Zaric

unread,
Oct 24, 2012, 7:21:20 AM10/24/12
to Simon Sapin, bup-...@googlegroups.com
On Wed, Oct 24, 2012 at 12:49:20PM +0200, Simon Sapin wrote:
> Le 22/10/2012 01:51, Zoran Zaric a �crit :
> >+ def add(self, i):
> >+ if i != None:
> >+ self._data = self._data | (1 << i)
> >+
> >+ def remove(self, i):
> >+ if i != None and i in self:
> >+ self._data = self._data ^ (1 << i)
>
> Hi,
>
> Every mutation here copies the whole set and is thus O(n), though I
> have no idea is this is good enough or how big n is.

A packfile has a max of 200*1000 objects (lib.bup.git.max_pack_objects).


> If this ever becomes a bottleneck, it is possible with a very simple
> C module to have a bit array with both compact storage and O(1)
> mutation.
>
> The bitarray module does so, although it also does much more:
>
> http://pypi.python.org/pypi/bitarray/

I'd rather keep the python implementation. If it becomes a bottleneck we can
add the dependency later.

Any thoughts?


> Cheers,
> --
> Simon Sapin

Thanks,
Zoran

Zoran Zaric

unread,
Oct 24, 2012, 7:52:53 AM10/24/12
to Oei, Yung-Chin, bup-...@googlegroups.com
On Wed, Oct 24, 2012 at 11:00:24AM +0200, Zoran Zaric wrote:
> On Wed, Oct 24, 2012 at 02:52:26AM +0100, Oei, Yung-Chin wrote:
> > On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> > >
> > > Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
> > > ---
> > > cmd/repack-cmd.py | 3 +++
> > > cmd/save-cmd.py | 4 ++++
> > > t/test.sh | 12 ++++++++++++
> > > 3 files changed, 19 insertions(+)

> Rob suggested to use the python-lockfile library instead of implementing our
> own. Has anybody used it before and knows how to implement a shared lock?

ok it doesnt [1]:

> Note: The current implementation doesn’t provide for shared vs. exclusive
> locks. It should be possible for multiple reader processes to hold the
> lock at the same time

Any other lockfile implementation suggestions?

[1] http://packages.python.org/lockfile/lockfile.html

Simon Sapin

unread,
Oct 24, 2012, 7:58:35 AM10/24/12
to Zoran Zaric, bup-...@googlegroups.com
Le 24/10/2012 13:21, Zoran Zaric a �crit :
> I'd rather keep the python implementation. If it becomes a bottleneck we can
> add the dependency later.
>
> Any thoughts?

Of course, this was a suggestion *if it becomes a bottleneck*.

With a very unscientific benchmark, it already looks "pretty fast" on my
laptop:

python2.7 -m timeit -s 'i = 100000; a = 1<<(2*i)' 'b = a | 1<<i'
100000 loops, best of 3: 8.14 usec per loop

--
Simon Sapin

Zoran Zaric

unread,
Oct 24, 2012, 8:04:57 AM10/24/12
to Simon Sapin, bup-...@googlegroups.com
I'd keep the implementation in the patch for now.

Oei, Yung-Chin

unread,
Oct 24, 2012, 8:31:37 AM10/24/12
to Zoran Zaric, bup-...@googlegroups.com
On 24 October 2012 12:52, Zoran Zaric <z...@zoranzaric.de> wrote:
> Any other lockfile implementation suggestions?

There's fcntl.lockf:
http://docs.python.org/library/fcntl.html#fcntl.lockf ...but I believe
that's posix-only - is that a problem?

Zoran Zaric

unread,
Oct 24, 2012, 8:39:44 AM10/24/12
to Oei, Yung-Chin, bup-...@googlegroups.com
On Wed, Oct 24, 2012 at 11:00:24AM +0200, Zoran Zaric wrote:
> On Wed, Oct 24, 2012 at 02:52:26AM +0100, Oei, Yung-Chin wrote:
> > On 22 October 2012 00:51, Zoran Zaric <z...@zoranzaric.de> wrote:
> > >
> > > Signed-off-by: Zoran Zaric <z...@zoranzaric.de>
> > > ---
> > > cmd/repack-cmd.py | 3 +++
> > > cmd/save-cmd.py | 4 ++++
> > > t/test.sh | 12 ++++++++++++
> > > 3 files changed, 19 insertions(+)
> >
> > Sorry, I missed out on the earlier discussions about locking on the
> > list, and I only just read up on them, so this comes a bit late I'm
> > afraid.
>
> No, the patches are here for review, so feedback is very welcome.
>
>
> > The thing is I'm not sure if I agree that a save should not lock the
> > repo: if a repack must lock the repo, and a save has to respect that
> > lock, then doesn't that mean it should also work the other way around
> > - otherwise you could start a repack during a save?
>
> That's a good point.
>
>
> > On the other hand, I understand the argument (from the earlier discussion)
> > that you might not want locked saves, because that would prevent concurrent
> > saves.
> >
> > Would it make sense to have something like a shared lock for saves,
> > and an exclusive lock for repacks?
>
> That sounds good.

When save locks (and I think it should) it should lock on a branch-level. Two
concurrent saves to the same branch could result in one of the commits being
lost.

So save should lock a branch, repack all.

Oei, Yung-Chin

unread,
Oct 24, 2012, 9:02:52 AM10/24/12
to Zoran Zaric, bup-...@googlegroups.com
On 24 October 2012 13:39, Zoran Zaric <z...@zoranzaric.de> wrote:
> When save locks (and I think it should) it should lock on a branch-level. Two
> concurrent saves to the same branch could result in one of the commits being
> lost.
>
> So save should lock a branch, repack all.

Ah right, that makes a lot of sense. With the current code, I believe
(without having tested) that indeed that problem would not really be
caught: git update-ref would refuse to commit the second of the two
saves, because the ref would not match the provided old-ref, but we
never check its return value. [*]

Also, it looks like python-lockfile could support this scheme.


[*]: As a short-term fix, should we just change the call to
git-update-ref by not providing it the old-ref? If we do that, both
commits would succeed, and bup doesn't care too much about the order
of parents anyway, right?

Rob Browning

unread,
Oct 27, 2012, 12:33:20 PM10/27/12
to Zoran Zaric, Oei, Yung-Chin, bup-...@googlegroups.com
Zoran Zaric <z...@zoranzaric.de> writes:

> Rob suggested to use the python-lockfile library instead of implementing our
> own. Has anybody used it before and knows how to implement a shared lock?

...and note that I know nothing about that particular library. The main
library I know anything about is Debian's liblockfile (not python), but
I imagine we may have a python implementation too (haven't looked).

For the algorithm used there, see lockfile_create(3):

http://linux.die.net/man/3/lockfile_create

That particular algorithm (not implementation) is mandated by policy on
Debian systems, but only for apps that might interact, and that wouldn't
be apply to bup right now.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,
Oct 27, 2012, 12:41:57 PM10/27/12
to Zoran Zaric, Oei, Yung-Chin, bup-...@googlegroups.com
Zoran Zaric <z...@zoranzaric.de> writes:

> Any other lockfile implementation suggestions?

One other thing to keep in mind -- we should make sure that whatever
algorithm or library we choose handles all the cases we care about. For
example, would bup locking be expected to work correctly across NFS --
across X, Y, and Z platforms, etc.?

As a case in in point, lockfile_create(3)'s algorithm is designed to be
NFS safe, but I don't know to what extent it may or may not be portable.
Reply all
Reply to author
Forward
0 new messages