more git updates..

Linus Torvalds

unread,

Apr 9, 2005, 3:50:10 PM4/9/05

to

Sorry guys,
several of you have sent me small fixes and scripts to "git", but I've
been busy on breaking/changing the core infrastructure, so I didn't get
around to looking at the scripts yet.

The good news is, the data structures/indexes haven't changed, but many of
the tools to interface with them have new (and improved!) semantics:

In particular, I changed how "read-tree" works, so that it now mirrors
"write-tree", in that instead of actually changing the working directory,
it only updates the index file (aka "current directory cache" file from
the tree).

To actually change the working directory, you'd first get the index file
setup, and then you do a "checkout-cache -a" to update the files in your
working directory with the files from the sha1 database.

Also, I wrote the "diff-tree" thing I talked about:

torvalds@ppc970:~/git> ./diff-tree 8fd07d4b7778cd0233ea0a17acd3fe9d710af035 8c6d29d6a496d12f1c224db945c0c56fd60ce941 | tr '\0' '\n'
<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile
>100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile
<100664 9e1bee21e17c134a2fb008db62679048fc819528 cache.h
>100664 56ef561e590fd99e938bd47fd1f2c7ed46126ff0 cache.h
<100664 fd690acc02ef9c06d7c4c3541f69b10ca4b4f8c9 cat-file.c
>100664 6e6d89291ced17a406e64b97fe8bb96a22eefc9d cat-file.c
+100664 fd00e5603dcc4a93acceda0b8cb914fabc8645d5 checkout-cache.c
<100664 a4a8c3d9ef0c4cc6c82b96b5d1a91ac6d3bed466 commit-tree.c
>100664 236ceb7646e3f5d110fd83f815b82e94cc5b2927 commit-tree.c
+100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c
<100664 0eaa053919e0cc400ab9bc40d9272360117e6978 init-db.c
>100664 815743e92dad7e451c65bab01448ee8ae9deeb56 init-db.c
<100664 e7bfaadd5d2331123663a8f14a26604a3cdcb678 read-cache.c
>100664 71d0cb6fe9b7ff79e3b2c5a61e288ac9f62b39dc read-cache.c
<100664 ec0f167a6a505659e5af6911c97f465506534c34 read-tree.c
>100664 f5c50ba79d02f002b9675fd8f129fa388e3282c6 read-tree.c
<100664 00a29c403e751c2a2a61eb24fa2249c8956d1c80 show-diff.c
>100664 b963dd738989bc92bf02352bbedad13a74e66a7d show-diff.c
<100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c
>100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c
<100664 7abeeba116b2b251c12ae32c7b38cb048199b574 write-tree.c
>100664 9525c6fc975888a394477339db86216cd5bd5d7c write-tree.c

(ie the output of "diff-tree" has the same NUL-termination, but if you
insist on getting ASCII output, you can just use "tr" to change the NUL
into a NL).

The format of the "diff-tree" output is that the first character is "-"
for "remove file", "+" for "add file" and "<"/">" for "change file" (where
the "<" shows the old state, and ">" shows the new state).

Btw, the NUL-termination makes this really easy to use even in shell
scripts, ie you can do

diff-tree <sha1> <sha1> | xargs -0 do_something

and you'll get each line as one nice argument to your "do_something"
script. So a do_diff could be based on something like

#!/bin/sh
while [ "$1" != "" ]; do
filename="$(echo $1 | cut -d' ' -f3-)"
first_sha="$(echo $1 | cut -d' ' -f2)"
second_sha="$(echo $2 | cut -d' ' -f2)"
c="$(echo $1 | cut -c1)"
case "$c" in
"+")
echo diff -u /dev/null "$filename($first_sha)";;
"-")
echo diff -u "$filename($first_sha)" /dev/null;;
"<")
echo diff -u "$filename($first_sha)" "$filename($second_sha)"
shift;;
*)
echo WHAT?
exit 1;;
esac
shift
done

which really shows what a horrid shell-person I am (I still use the old
tools I learnt to use fifteen years ago. I bet you can do it trivially in
perl or something sane, and I'm just stuck in the stone age of UNIX).

That makes it _very_ easy to parse. The example above is the diff between
the initial commit and one of the more recent trees, so it has changes to
everything, but a more normal thing would be

torvalds@ppc970:~/git> diff-tree 787763499dc4f8cc345bc6ed8ee1e0ae31adedd6 5b0c2695634b5bab2f5d63c7bb30f7e5815af470 | tr '\0' '\n'
<100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c
>100664 81aa7bee003264ea302db835158e725eefa4012d fsck-cache.c

which tells you that the last commit changed just one file (it's from this
one:

torvalds@ppc970:~/git> cat-file commit `cat .dircache/HEAD`
tree 5b0c2695634b5bab2f5d63c7bb30f7e5815af470
parent 81c53a1d3551f358860731481bb2d87179d221e6
author Linus Torvalds <torv...@ppc970.osdl.org> Sat Apr 9 12:02:30 2005
committer Linus Torvalds <torv...@ppc970.osdl.org> Sat Apr 9 12:02:30 2005

Make "fsck-cache" print out all the root commits it finds.

Once I do the reference tracking, I'll also make it print out all the
HEAD commits it finds, which is even more interesting.

in case you care).

I've rsync'ed the new git repository to kernel.org, it should all be there
in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the
mirror scripts already picked it up on the public side too).

Can you guys re-send the scripts you wrote? They probably need some
updating for the new semantics. Sorry about that ;(

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Apr 9, 2005, 4:00:14 PM4/9/05

to

On Sat, 9 Apr 2005, Linus Torvalds wrote:
>
> To actually change the working directory, you'd first get the index file
> setup, and then you do a "checkout-cache -a" to update the files in your
> working directory with the files from the sha1 database.

Btw, this will not overwrite any old files, so if you have an old version
of something, you'd need to do "checkout-cache -f -a" (and order matters:
the "-f" must come first). This time I actually have a big comment at the
top of the checkout-cache.c file trying to explain the logic.

Petr Baudis

unread,

Apr 9, 2005, 4:10:08 PM4/9/05

to

Hello,

Dear diary, on Sat, Apr 09, 2005 at 09:45:52PM CEST, I got a letter
where Linus Torvalds <torv...@osdl.org> told me that...

> The good news is, the data structures/indexes haven't changed, but many of
> the tools to interface with them have new (and improved!) semantics:
>
> In particular, I changed how "read-tree" works, so that it now mirrors
> "write-tree", in that instead of actually changing the working directory,
> it only updates the index file (aka "current directory cache" file from
> the tree).
>
> To actually change the working directory, you'd first get the index file
> setup, and then you do a "checkout-cache -a" to update the files in your
> working directory with the files from the sha1 database.

that's great. I was planning to do something with this since currently
it really annoyed me. I think I will like this, even though I didn't
look at the code itself yet (just on my way).

> Also, I wrote the "diff-tree" thing I talked about:

..snip..

Hmm, I wonder, is this better done in C instead of a simple shell
script, like my gitdiff.sh? I'd say it is more flexible and probably
hardly performance-critical to have this scripted, and not difficult at
all provided you have ls-tree. But maybe I'm just too fond of my
script... ;-) (Ok, there's some trouble when you want to have newlines
and spaces in file names, and join appears to be awfully ignorant about
this... :[ )

BTW, do we care about changed modes? If so, they should probably have
their place in the diff-tree output.

BTW#2, I hope you will merge my ls-tree anyway, even though there is no
user for it currently... I should quickly figure out some. :-)

> Can you guys re-send the scripts you wrote? They probably need some
> updating for the new semantics. Sorry about that ;(

I'll try to merge ASAP.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Linus Torvalds

unread,

Apr 9, 2005, 5:10:11 PM4/9/05

to

On Sat, 9 Apr 2005, Linus Torvalds wrote:
>

> I suspect that I have to change the file format. Maybe make the "tree"
> object a two-level thing, and have a "directory" object.
>
> Then a "tree" object would point to a "directory" object, which would in
> turn point to the individual files (and other "directory" objects, of
> course). That way a commit that only changes a few files will only need to
> create a few new "directory" objects, instead of creating one huge "tree"
> object.

Actually, I guess I wouldn't have to change the format. I could just
extend the existing "tree" object to be able to point to other trees, and
that's it.

The downside of that is that then a tree wouldn't have a canonical format
any more: you could have two trees that have the exact same content, but
they'd haev different names. They should obviously merge very easily (and
thus you could create a new merge that _does_ have a common name), but
it's ugly.

I'll have to think about it. It's good to notice these issues early, this
was the first time I had actually tried to check in a kernel-sized tree
for real.

Linus

Linus Torvalds

unread,

Apr 9, 2005, 5:10:08 PM4/9/05

to

On Sat, 9 Apr 2005, Petr Baudis wrote:
>
> > Also, I wrote the "diff-tree" thing I talked about:
> ..snip..
>
> Hmm, I wonder, is this better done in C instead of a simple shell
> script, like my gitdiff.sh?

With 17,000 files in the kernel, and most commits just changing a small
number of them, I actually think "diff-tree" matters. You use "join"
(which is quite reasonable), but let's put it this way: just the list of
files in the current kernel is about half a megabyte of data. Ie your
temporary files that you use in the "ls-tree + ls-tree + join" is actually
going to be quite sizeable.

My goal here is that the speed of "git" really should be almost totally
independent of the size of the project. You clearly cannot avoid _some_
size-dependency: my "diff-tree" clearly also has to work through the same
1MB of data, but I think it's worth making the constant factor be as small
as humanly possible.

I just tried checking in a kernel tree tar-file, and the initial checkin
(which is allt he compression and the sha1 calculations for every single
file) took about 1:35 (minutes, not hours ;).

Doing a commit (trivial change to the top-level Makefile) and then doing a
"treediff" between those two things took 0.05 seconds using my C thing. Ie
we're talking so fast that we really don't care.

Doing a "show-diff" takes 0.15 secs or so (that's all the "stat" calls),
and now that I test it out I realize that the most expensive operation is
actually _writing_ the "index" file out. These are the two most expensive
steps:

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> time update-cache Makefile

real 0m0.283s
user 0m0.171s
sys 0m0.113s

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> time write-tree
5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a

real 0m0.441s
user 0m0.354s
sys 0m0.087s

ie with the current infrastructure it looks like I can do a "patch +
commit" in less than one second on the kernel, and 0.75 secs of that is
because the "tree" file actually grows pretty large:

cat-file tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a | wc -c

says that the uncompressed tree-file is 950,874 bytes. Compressing it
means that the archival version of it is "just" 462,546 bytes, but this is
really the part that is going to eat _tons_ of disk-space.

In other words, each "commit" file is very small and cheap, but since
almost every commit will also imply a totally new tree-file, "git" is
going to have an overhead of half a megabyte per commit. Oops.

Damn, that's painful. I suspect I will have to change the format somehow.

One option (which I haven't tested yet) is that since the tree-file is
already sorted, I could always write it out with the common subdirectory
part "collapsed", ie instead of writing

...
include/asm-i386/mach-default/bios_ebda.h
include/asm-i386/mach-default/do_timer.h
...

I'd write just

...
///bios_ebda.h
///do_timer.h
...

since the directory names are implied by the predecessor.

However, that doesn't help with the 20-byte sha1 associated with each
file, which is also obviously uncompressible, so with 17,000+ files, we
have a minimum overhead of abotu 350kB per tree-file.

So even if I did the pathname compression, it wouldn't help all that much.
I'd only be removing the only part of the file that _is_ very
compressible, and I'd probably end up with something that isn't all that
far away from the 450kB+ it is now.

I suspect that I have to change the file format. Maybe make the "tree"
object a two-level thing, and have a "directory" object.

Then a "tree" object would point to a "directory" object, which would in
turn point to the individual files (and other "directory" objects, of
course). That way a commit that only changes a few files will only need to
create a few new "directory" objects, instead of creating one huge "tree"
object.

Sadly, that will make "tree-diff" potentially more expensive. On the other
hand, maybe not: it will also speed it _up_, since directories that are
totally shared will be trivially seen as such and need no further
operation.

Thougths? That would break the current repository formats, and I'd have to
create a converter thing (which shouldn't be that bad, of course).

I don't have to do it right now. In fact, I'd almost prefer for the
current thing to become good enough that it's not painful to work with,
since right now I'm using it to develop itself. Then I can convert the
format with an automated script later, before I actually start working on
the kernel...

> BTW, do we care about changed modes? If so, they should probably have
> their place in the diff-tree output.

They're there. If you want to ignore them, you can just notice that the
sha1 matches between two lines, and then you don't even have to diff them.

Linus

Paul Jackson

unread,

Apr 9, 2005, 6:10:09 PM4/9/05

to

Linus wrote:
> the NUL-termination makes this really easy to use even in shell

grumble ...

> I still use the old tools I learnt to use fifteen years ago

new comer ;)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Ralph Corderoy

unread,

Apr 9, 2005, 7:30:14 PM4/9/05

to

Hi Linus,

> Btw, the NUL-termination makes this really easy to use even in shell
> scripts, ie you can do
>
> diff-tree <sha1> <sha1> | xargs -0 do_something
>
> and you'll get each line as one nice argument to your "do_something"
> script. So a do_diff could be based on something like
>
> #!/bin/sh

Watch out for when xargs invokes do_something more than once and the `<'
is parsed by a different one than the `>'. A `while read ...; do ...
done' would avoid that, but wouldn't like the NULs instead of LFs.

Cheers,

Ralph.

Linus Torvalds

unread,

Apr 9, 2005, 7:40:11 PM4/9/05

to

On Sat, 9 Apr 2005, Linus Torvalds wrote:
>
> Actually, I guess I wouldn't have to change the format. I could just
> extend the existing "tree" object to be able to point to other trees, and
> that's it.

Done, and pushed out. The current git.git repository seems to do all of
this correctly.

NOTE! This means that each "tree" file basically tracks just a single
directory. The old style of "every file in one tree file" still works, but
fsck-cache will warn about it. Happily, the git archive itself doesn't
have any subdirectories, so git itself is not impacted by it.

Now, this means that I should add a "recusive" option to "tree-diff", but
I haven't done so yet. So right now if I change the top-level Makefile,
_and_ change kernel/exit.c, then the "tree diff" between the two commit
trees ends up looking like:

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> diff-tree 7bec1223736d7e02c755e9a365984b3cbfa1e6e9 d64817f809a60cd960d3078ae91b4d19cb649501 | tr '\0' '\n'
<100644 e1e7f7430c0297f22042cff58da5ca73ef121b95 Makefile
>100644 8ee21134577e98fb642dffc5b797a0121645c543 Makefile
<40000 2239383d00ae746f5e79ceccf8ac3fbca62f949d kernel
>40000 a8fad219cb78a6b6a05a10f8643d615fefc8160f kernel

ie it shows that the Makefile blob has changed, and the kernel directory
has changed. You then need to recurse into the kernel tree to see what the
changes were there:

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> diff-tree 2239383d00ae746f5e79ceccf8ac3fbca62f949d a8fad219cb78a6b6a05a10f8643d615fefc8160f | tr '\0' '\n'
<100644 1a50b58453679b6fee8de4f744f4befc39397bb1 exit.c
>100644 e8df1325bf25816827a1a64404ad533a97bfdae2 exit.c

but it clearly all seems to work. And it means that a subdirectory that
didn't change at all (the common case) will be able to re-use the old sha1
file when you create a tree (this may in fact make "diff-tree" much less
important, since now it tends to handle objects that are just a few kB in
size, rather than almost a megabyte.

So in this case, the "commit cost" of changing two files was two small
tree files (1468 and 679 bytes respectively for the kernel/ and top-level
directory) and the commit file itself (251 bytes). In addition to the
actual data files that were changed, of course.

Goodie. Big difference between that and the 460kB of the old monolithic
tree file.

Paul Jackson

unread,

Apr 9, 2005, 8:50:07 PM4/9/05

to

Ralph wrote:
> Watch out for when xargs invokes do_something more than once and the `<'
> is parsed by a different one than the `>'.

It will take a pretty long list to do that. It seems that
GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX.

In the old days, with 4 KByte ARG_MAX limits, this would have
bitten us pretty quickly.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Bernd Eckenfels

unread,

Apr 9, 2005, 9:20:04 PM4/9/05

to

In article <2005040917394...@engr.sgi.com> you wrote:
> Ralph wrote:
>> Watch out for when xargs invokes do_something more than once and the `<'
>> is parsed by a different one than the `>'.
> It will take a pretty long list to do that. It seems that
> GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX.
> In the old days, with 4 KByte ARG_MAX limits, this would have
> bitten us pretty quickly.

Nevertheless I think it is more parser friendly to have single records for
diffs.

Greetings
Bernd

Paul Jackson

unread,

Apr 9, 2005, 9:40:08 PM4/9/05

to

Bernd wrote:
> more parser friendly to have single records for diffs.

good point

[looks like you trimmed the cc list - folks around here don't like that ;)]

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Paul Jackson

unread,

Apr 9, 2005, 10:10:08 PM4/9/05

to

Linus wrote:
> Damn, that's painful. I suspect I will have to change the format somehow.

The sha1 (ascii) digests for 16817 files take:

689497 bytes before compression
397475 bytes after minigzip

The pathnames, relative to top of tree, for these 16817
files take:

503983 bytes before compression
85786 bytes after minigzip compression

I doubt any fancifying up of the pathname storage will gain much.

However going from binary to ascii sha1 digest might help (compresses
better, I suspect - I'll have to write a few lines of code to see).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Paul Jackson

unread,

Apr 9, 2005, 10:20:11 PM4/9/05

to

> Then a "tree" object would point to a "directory" object,

Ah - light bulb flickers - in _separate_ files.

Yes, that obviously makes a difference.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Paul Jackson

unread,

Apr 9, 2005, 10:30:11 PM4/9/05

to

From before:

The sha1 (ascii) digests for 16817 files take:

689497 bytes before compression
397475 bytes after minigzip

New numbers:

The sha1 (binary) digests for 16817 files take:

336340 bytes before compression
334943 bytes after minigzip

So compressing binary digests isn't worth a darn, and compressing ascii
digests gets them down to within 18% of binary digests in size.

Petr Baudis

unread,

Apr 9, 2005, 10:50:07 PM4/9/05

to

Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter

where Linus Torvalds <torv...@osdl.org> told me that...

> On Sat, 9 Apr 2005, Linus Torvalds wrote:
> >
> > Actually, I guess I wouldn't have to change the format. I could just
> > extend the existing "tree" object to be able to point to other trees, and
> > that's it.
>
> Done, and pushed out. The current git.git repository seems to do all of
> this correctly.

..snip..

Ok, so now I can dare announce it, I hope. I hacked my branch of git
somewhat, kept in sync with Linus, and now I have something to show.
Please see it at

http://pasky.or.cz/~pasky/dev/git/

It is basically a set of (still rather crude) shell scripts upon Linus'
git, which make it sanely usable by mere humans for actual version
tracking. Its usage _is_ going to change, so don't get too used to it
(that'd be hard anyway, I suspect), but it should be working nicely.

I have described most of the interesting parts and some basic usage in
the README at that page. It wraps commits, supports log retrieval and
comfortable diffing between any two trees. And on top of that, it can do
some basic remote repositories - it will pull (rsync) from them and it
can make the local copy track them - on pull, it will be updated
accordingly (and your local commits on the tracked branch will get
orphaned).

I didn't attach a patch against Linus since I think it's pretty much
useless now. It's available as against-linus.patch on the web, and
you can apply it to the latest git tree (NOT 0.03). But it's probably
better idea to wget my tree. You can then watch us making progress by

gitpull.sh linus
gitpull.sh pasky

and see where we differ by:

gitdiff.sh linus pasky

(This is how the against-linus.patch was generated. I'd easily generate
even 0.03 patch this way, but I forgot to merge the fsck at that time,
so it would suck.)

(Note that the tree you wget is set up to track my branch. If you want
to stop tracking it (basically necessary now if you want to do local
commits), do:

cp .dircache/HEAD .dircache/HEAD.local
gittrack.sh

The cp says that something like "I want to pick up where the tracked
branch left off". Otherwise, untracking would return you to your "local"
branch, which is just some ancient predecessor of the pasky branch here
anyway.)

Note that I didn't really test it on anything but git itself yet, so I'm
not sure how will it cope especially with directories - I tried to make
it aware of them though. I will do some more practical testing tomorrow.

Otherwise, I will probably try to consolidate the usage and
documentation now, and beautify the scripts. I might start pondering
some merging too. Oh, and gitpatch.sh. :-)

Have fun and please share your opinions,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Junio C Hamano

unread,

Apr 10, 2005, 4:00:11 AM4/10/05

to

Listing the file paths and their sigs included in a tree to make
a snapshot of a tree state sounds fine, and diffing two trees by
looking at the sigs between two such files sounds fine as well.

But I am wondering what your plans are to handle renames---or
does git already represent them?

Petr Baudis

unread,

Apr 10, 2005, 5:50:09 AM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter
where Junio C Hamano <jun...@cox.net> told me that...
> >>>>> "CL" == Christopher Li <lk...@chrisli.org> writes:

>
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >>
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >>
>

> CL> Rename should just work. It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
>
> Sorry, I was unclear. But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?

No. See diff-tree output and
http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done.
Basically, you just take the two trees and compare them linearily (do a
normal diff on them, essentialy). Then the differences you spot this way
are everything what needs to appear in the patch.

> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time. If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?

Actually, this could be possible too I think. We will have to make
diff-tree two-pass, but it is already so blinding fast that I guess that
doesn't hurt too much. I might try to get my hands on that.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Junio C Hamano

unread,

Apr 10, 2005, 5:40:11 AM4/10/05

to

>>>>> "CL" == Christopher Li <lk...@chrisli.org> writes:

CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
>>

>> But I am wondering what your plans are to handle renames---or
>> does git already represent them?
>>

CL> Rename should just work. It will create a new tree object and you

CL> will notice that in the entry that changed, the hash for the blob
CL> object is the same.

Sorry, I was unclear. But doesn't that imply that a SCM built
on top of git storage needs to read all the commit and tree
records up to the common ancestor to show tree diffs between two
forked tree?

I suspect that another problem is that noticing the move of the

same SHA1 hash from one pathname to another and recognizing that
as a rename would not always work in the real world, because
sometimes people move files *and* make small changes at the same
time. If git is meant to be an intermediate format to suck
existing kernel history out of BK so that the history can be
converted for the next SCM chosen for the kernel work, I would
imagine that there needs to be a way to represent such a case.
Maybe convert a file rename as two git trees (one tree for pure
move which immediately followed by another tree for edit) if it
is not a pure move?

Christopher Li

unread,

Apr 10, 2005, 5:10:05 AM4/10/05

to

On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
>

> But I am wondering what your plans are to handle renames---or
> does git already represent them?
>

Rename should just work. It will create a new tree object and you

will notice that in the entry that changed, the hash for the blob

object is the same.

Chris

Petr Baudis

unread,

Apr 10, 2005, 5:50:08 AM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
where Christopher Li <lk...@chrisli.org> told me that...

> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
> >
>
> Rename should just work. It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

Which is of course wrong when you want to do proper merging, examine
per-file history, etc. One solution which springs to my mind is to have
a UUID accompany each blob and tree; that will take relatively lot of
space though, and I'm not sure it is really worth it.

How many renames were there in the 64k commits so far anyway?

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Wichert Akkerman

unread,

Apr 10, 2005, 5:50:12 AM4/10/05

to

Previously Christopher Li wrote:
> Rename should just work. It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

What if you rename and change a file within a changeset?

Wichert.

--
Wichert Akkerman <wic...@wiggy.net> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.

Christopher Li

unread,

Apr 10, 2005, 6:10:07 AM4/10/05

to

On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote:
>
> Done, and pushed out. The current git.git repository seems to do all of
> this correctly.
>
> NOTE! This means that each "tree" file basically tracks just a single
> directory. The old style of "every file in one tree file" still works, but
> fsck-cache will warn about it. Happily, the git archive itself doesn't
> have any subdirectories, so git itself is not impacted by it.

That is really cool stuff. My way to read it, correct me if I am wrong,
git is a user space version file system. "tree" <--> directory and
"blob" <--> file. "commit" to describe the version history.

Git always write out a full new version of blob when there is any
update to it. At first I think that waste a lot of space, especially
when there is only tiny change to it. But the more I think about it,
it make more sense. Kernel source is usually small objects and file is
compressed store any way. A very useful thing to gain form it is that,
we can truncate the older history. e.g. We can have option not to sync
the pre 2.4 change set, only grab it if we need to. Most of the time we
only interested in the recent change set.

There is one problem though. How about the SHA1 hash collision?
Even the chance is very remote, you don't want to lose some data do due
to "software" error. I think it is OK that no handle that
case right now. On the other hand, it will be nice to detect that
and give out a big error message if it really happens.

Some thing like the following patch, may be turn off able.

Chris

Index: git-0.03/read-cache.c
===================================================================
--- git-0.03.orig/read-cache.c 2005-04-09 18:42:16.000000000 -0400
+++ git-0.03/read-cache.c 2005-04-10 02:48:36.000000000 -0400
@@ -210,8 +210,22 @@
int fd;

fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666);
- if (fd < 0)
- return (errno == EEXIST) ? 0 : -1;
+ if (fd < 0) {
+ void *map;
+ static int error(const char * string);
+
+ if (errno != EEXIST)
+ return -1;
+ fd = open(filename, O_RDONLY);
+ if (fd < 0)
+ return -1;
+ map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+ if (map == MAP_FAILED)
+ return -1;
+ if (memcmp(buf, map, size))
+ return error("Ouch, Strike by lighting!\n");
+ return 0;
+ }
write(fd, buf, size);
close(fd);
return 0;

Christopher Li

unread,

Apr 10, 2005, 6:20:10 AM4/10/05

to

On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote:
> Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
> where Christopher Li <lk...@chrisli.org> told me that...
> > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > >
> > > But I am wondering what your plans are to handle renames---or
> > > does git already represent them?
> > >
> >
> > Rename should just work. It will create a new tree object and you
> > will notice that in the entry that changed, the hash for the blob
> > object is the same.
>
> Which is of course wrong when you want to do proper merging, examine
> per-file history, etc. One solution which springs to my mind is to have
> a UUID accompany each blob and tree; that will take relatively lot of
> space though, and I'm not sure it is really worth it.

It should just use the rename + change two step then it is tractable
with git now.

Chris

Christopher Li

unread,

Apr 10, 2005, 6:20:08 AM4/10/05

to

On Sun, Apr 10, 2005 at 02:28:54AM -0700, Junio C Hamano wrote:
> >>>>> "CL" == Christopher Li <lk...@chrisli.org> writes:
>
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >>
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >>
>
> CL> Rename should just work. It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
>
> Sorry, I was unclear. But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?
>
> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time. If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?
>

Git is not a SCM yet. For the rename + change set it should internally
handle by pure rename only plus the extra delta. The current git don't
have per file change history. From git's point of view some file deleted
and the other file appeared with same content.

It is the top level SCM to handle that correctly.
Rename a directory will be even more fun.

Chris

Ralph Corderoy

unread,

Apr 10, 2005, 6:30:14 AM4/10/05

to

Hi Paul,

> Ralph wrote:
> > Watch out for when xargs invokes do_something more than once and the
> > `<' is parsed by a different one than the `>'.
>
> It will take a pretty long list to do that. It seems that GNU xargs
> on top of a Linux kernel has a 128 KByte ARG_MAX.

I didn't realise it was that long, but one pair of files to diff takes
128 bytes of that.

$ wc -c <<\E

> <100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c
> >100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c

> E
128

So that's space for 1024 pairs. (Doesn't envp take some up too?) That
doesn't seem enough for diffs between revisions, but good enough for
most uses that people will get caught out when it fails.

$ bzip2 -dc patch-2.6.10.bz2 | grep -c '^diff '
5384

Cheers,

Ralph.

Rutger Nijlunsing

unread,

Apr 10, 2005, 7:30:13 AM4/10/05

to

On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:

> Listing the file paths and their sigs included in a tree to make
> a snapshot of a tree state sounds fine, and diffing two trees by
> looking at the sigs between two such files sounds fine as well.
>
> But I am wondering what your plans are to handle renames---or
> does git already represent them?

git doesn't represent transitions (or deltas), but only state. So it's
not (much) more then a .tar file from version-management perspective;
the only difference being that a git-tree has a comment field and a
predecessor-reference, which are currently not used in determining the
'patch' between two trees.

Deltas are derived by comparing different versions and determining
the difference by reverse-engineering the differences which got us
from version A to version B.

Deltas are currently described as patch(1)es. Patches don't have the
concept of 'renaming', so even after determining that file X has been
renamed to Y, we have no container for this fact. A patch(1) only
contains local-file-edits: substitute lines by other lines.

Deltas are not needed to follow a tree; deltas are useful for merging
branches of versions, and for reviewing purposes. This is comparable
to using tar for version-management: it is very common to weekly tar
your current version of your project as a poor-mans-version management
for one-person one-project.

So what is needed is a way to represent deltas which can contain more
than only traditional patches. I would propose a simple format:
the shell-script in a fixed-format.

Shell-patch format in EBNF:
<shellpatch> ::= ( <comment>? <command>* )*
<comment> ::= <commentline>+
The comments contains the text describing the function of the
patch following it.
<commentline> ::= "# " <text>
<command> ::=
"mv " <pathname> " " <pathname> "\n" |
"cp " <filename> " " <filename> "\n" |
"chmod " <mode> <pathname> "\n" |
"patch <<__UNIQUE_STRING__\n" <patch> "__UNIQUE_STRING__\n"
(where UNIQUE_STRING must not be contained in patch)
<filename> ::= <pathname>
(but pointing to a file)
<pathname> ::= a pathname relative to '.';
escaping special characters the shell-way;
may not contain '..'.

Example:
# Rename file b to a1, and change a line.
mv b a1
patch <<__END__
*** a1 Sun Apr 10 11:43:37 2005
--- a2 Sun Apr 10 11:43:41 2005
***************
*** 1,4 ****
1
2
! from
3
--- 1,4 ----
1
2
! to
3
__END__

Advantages:
- ASCII!
- a shell-patch is executable without extra tooling
- a shell-patch is readable and therefore reviewable
- a shell-patch is forward-compatible: a shell-patch acts
like a patch (since patch(1) ignores garbage around patch :),
but not backwards-compatible.
- extensible
- the heavy-lifting is done by 'patch'
Disadvantages:
- no deltas for binary files

Open issues:
- <comment> could be made more structured; maybe containing fields
like Sujbect:, Author:, Signed-By:, certificates, ...
(BitKeeper seems to be using "# " <field> ":" <value> "\n" lines)
- patch(1) doesn't know any directories. Should shell-patch
know directories? This implies commands working on directories to
(like directory renaming, mode changing, ...). Otherwise directories
are implicit (a file in a directories implies the existance of that
directory). Also implies mkdir and rmdir as shell-patch commands.
- extra commands might be useful to conserve more state(changes):
ln -s -- for symbolic links;
ln -- for hard links;
chown -- for permissions;
chattr -- for storing extended attributes
touch -- for setting timestamps (probably creation time only,
since mtime is something git relies on)
...and for the really adventurous:
sed 's,<fromstring>,<tostring>,' -- for substitutions
(this is something darcs supports, but which I think is too
bothersome to use since it is difficult to reverse engineere
from two random trees)
Why a fixed format at all?
- This way, the executable shell-patch can be proven to be
harmless to the machine: 'rm -rf /' is a valid shell-script,
but not a valid shell-patch (since 'rm' is not valid command,
random flags like '-rf' are not supported, and '/' is an absolute
pathname.
- A fixed format enables tooling to support such a patch format;
for example creating the reverse-patch, merging patches (yeah,
'cat' also merges patches...).

...what has this to do with git? Not much and everything, depending
on how you look onto it. 'git' is 'tar', and 'shell-patch' is 'patch';
both orthogonal concepts but very usable in combination. We'll look at
getting from two git trees to a shell-patch.

Diffing the trees would not only look at the file and per file at the
hashes, but also the other way around: which hash values are used more
than once. For files with the same hash value, compare the contents
(and rest of attributes); this is needed since the mapping from file
contents to sha1 is one-way. When the contents is the same, the
shell-patch-command to generate is obviously a 'cp'.

For example, we have got two trees in git (pathname -> hash value):
tree1/file1 -> 1234
tree1/file2 -> 4567
and
tree2/file1 -> 3456
tree2/file3 -> 4567
tree2/file4 -> 4567

..this could generate shell-patch:

# Comments-go-here
mv tree2/file2 tree2/file3
cp tree2/file3 tree2/file4
patch tree1/file1 <<__FILE_PATCH__
(patch-goes-here)
__FILE_PATCH__

...by an algorithm which starts by determining all renames, then all
copies, and finally all patches.

Comments?

--
Rutger Nijlunsing ---------------------- linux-kernel at tux.tmfweb.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

tony...@intel.com

unread,

Apr 10, 2005, 7:50:07 AM4/10/05

to

>handle by pure rename only plus the extra delta. The current git don't
>have per file change history. From git's point of view some file deleted
>and the other file appeared with same content.
>
>It is the top level SCM to handle that correctly.
>Rename a directory will be even more fun.

But from a git perspective it will be very efficient. Imagine that
Linus decides to rename arch/i386 as arch/x86 ... at the git repository
level this just requires a changeset, a new top level tree, and a new
tree for the arch directory showing that i386 changed to x86. That's
all ... every files below that didn't change, so the blobs for the files
are all the same.

-Tony

Ralph Corderoy

unread,

Apr 10, 2005, 8:00:17 AM4/10/05

to

Hi,

Christopher Li wrote:
> On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote:
> > NOTE! This means that each "tree" file basically tracks just a
> > single directory. The old style of "every file in one tree file"
> > still works, but fsck-cache will warn about it. Happily, the git
> > archive itself doesn't have any subdirectories, so git itself is not
> > impacted by it.
>
> That is really cool stuff. My way to read it, correct me if I am
> wrong, git is a user space version file system. "tree" <--> directory
> and "blob" <--> file. "commit" to describe the version history.

See the Venti filesystem in Bell Labs's Plan 9 OS. It too uses SHA-1.

http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf

Abstract

This paper describes a network storage system, called Venti,
intended for archival data. In this system, a unique hash of a
block's contents acts as the block identifier for read and write
operations. This approach enforces a write-once policy, preventing
accidental or malicious destruction of data. In addition, duplicate
copies of a block can be coalesced, reducing the consumption of
storage and simplifying the implementation of clients. Venti is a
building block for constructing a variety of storage applications
such as logical backup, physical backup, and snapshot file systems.

We have built a prototype of the system and present some preliminary
performance results. The system uses magnetic disks as the storage
technology, resulting in an access time for archival data that is
comparable to non-archival data. The feasibility of the write-once
model for storage is demonstrated using data from over a decade's
use of two Plan 9 file systems.

Cheers,

Ralph.

tony...@intel.com

unread,

Apr 10, 2005, 8:10:05 AM4/10/05

to

>In other words, each "commit" file is very small and cheap, but since
>almost every commit will also imply a totally new tree-file, "git" is
>going to have an overhead of half a megabyte per commit. Oops.
>
>Damn, that's painful. I suspect I will have to change the format somehow.

Having dodged that bullet with the change to make tree files point at
other tree files ... here's another (potential) issue.

A changeset that touches just one file a few levels down from the top
of the tree (say arch/i386/kernel/setup.c) will make six new files in
the git repository (one for the changeset, four tree files, and a new
blob for the new version of the file). More complex changes make more
files ... but say the average is ten new files per changeset since most
changes touch few files. With 60,000 changesets in the current tree, we
will start out our git repository with about 600,000 files. Assuming
the first byte of the SHA1 hash is random, that means an average of 2343
files in each of the objects/xx directories. Give it a few more years at
the current pace, and we'll have over 10,000 files per directory. This
sounds like a lot to me ... but perhaps filesystems now handle large
directories enough better than they used to for this to not be a problem?

Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?

-Tony

Linus Torvalds

unread,

Apr 10, 2005, 11:50:11 AM4/10/05

to

On Sun, 10 Apr 2005, Junio C Hamano wrote:
>
> But I am wondering what your plans are to handle renames---or
> does git already represent them?

You can represent renames on top of git - git itself really doesn't care.
In many ways you can just see git as a filesystem - it's content-
addressable, and it has a notion of versioning, but I really really
designed it coming at the problem from the viewpoint of a _filesystem_
person (hey, kernels is what I do), and I actually have absolutely _zero_
interest in creating a traditional SCM system.

So to take renaming a file as an example - why do you actually want to
track renames? In traditional SCM's, you do it for two reasons:

- space efficiency. Most SCM's are based on describing changes to a file,
and compress the data by doing revisions on the same file. In order to
continue that process past a rename, such an SCM _has_ to track
renames, or lose the delta-based approach.

The most trivial example of this is "diff", ie a rename ends up
generating a _huge_ diff unless you track the rename explicitly.

GIT doesn't care. There is _zero_ space efficiency in trying to track
renames. In fact, it would add overhead to the system, not lessen it.
That's because GIT fundamentally doesn't do the "delta-within-a-file"
model.

- annotate/blame. This is a valid concern, but the fact is, I never use
it. It may be a deficiency of mine, but I simply don't do the per-line
thing when I debug or try to find who was responsible. I do "blame" on
a much bigger-picture level, and I personally believe (pretty strongly)
that per-line annotations are not actually a good thing - they come not
because people _want_ to do things at that low level, but because
historically, you didn't _have_ the bigger-picture thing.

In other words, pretty much every SCM out there is based on SCCS
"mentally", even if not in any other model. That's why people think
per-line blame is important - you have that mental model.

So consider me deficient, or consider me radical. It boils down to the
same thing. Renames don't matter.

That said, if somebody wants to create a _real_ SCM (rather than my notion
of a pure content tracker) on top of GIT, you probably could fairly easily
do so by imposing a few limitations on a higher level. For example, most
SCM's that track renames require that the user _tell_ them about the
renames: you do a "bk mv" or a "svn rename" or something.

If you want to do the same on top of GIT, then you should think of GIT as
what it is: GIT just tracks contents. It's a filesystem - although a
fairly strange one. How would you track renames on top of that? Easy: add
your own fields to the GIT revision messages: GIT enforces the header, but
you can add anything you want to the "free-form" part that follows it.

Same goes for any other information where you care about what happens
"within" a file. GIT simply doesn't track it. You can build things on top
of GIT if you want to, though. They may not be as efficient as they would
be if they were built _into_ GIT, but on the other hand GIT does a lot of
other things a hell of a lot faster thanks to it's design.

So whether you agree with the things that _I_ consider important probably
depends on how you work. The real downside of GIT may be that _my_ way of
doing things is quite possibly very rare.

But it clearly is the only right way. The fact that everybody else does it
some other way only means that they are wrong.

Linus

Linus Torvalds

unread,

Apr 10, 2005, 12:10:10 PM4/10/05

to

On Sat, 9 Apr 2005 tony...@intel.com wrote:
>
> With 60,000 changesets in the current tree, we will start out our git
> repository with about 600,000 files. Assuming the first byte of the
> SHA1 hash is random, that means an average of 2343 files in each of the
> objects/xx directories. Give it a few more years at the current pace,
> and we'll have over 10,000 files per directory. This sounds like a lot
> to me ... but perhaps filesystems now handle large directories enough
> better than they used to for this to not be a problem?

The good news is that git itself doesn't really care. I think it's
literally _one_ function ("get_sha1_filename()") that you need to change,
and then you need to write a small script that moves files around, and
you're really much done.

Also, I did actually debate that issue with myself, and decided that even
if we do have tons of files per directory, git doesn't much care. The
reason? Git never _searches_ for them. Assuming you have enough memory to
cache the tree, you just end up doing a "lookup", and inside the kernel
that's done using an efficient hash, which doesn't actually care _at_all_
about how many files there are per directory.

So I was for a while debating having a totally flat directory space, but
since there are _some_ downsides (linear lookup for cold-cache, and just
that "ls -l" ends up being O(n**2) and things), I decided that a single
fan-out is probably a good idea.

> Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?

Hey, I may end up being wrong, and yes, maybe I should have done a
two-level one. The good news is that we can trivially fix it later (even
dynamically - we can make the "sha1 object tree layout" be a per-tree
config option, and there would be no real issue, so you could make small
projects use a flat version and big projects use a very deep structure
etc). You'd just have to script some renames to move the files around..

Linus

Petr Baudis

unread,

Apr 10, 2005, 12:30:12 PM4/10/05

to

Hello,

so I "released" git-pasky-0.1, my set of patches and scripts upon
Linus' git, aimed at human usability and to an extent a SCM-like usage.

You can get it at

http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2

and after unpacking and building (make) do

git pull pasky

to get the latest changes from my branch. If you already have some git
from my branch which can do pulling, you can bring yourself up to date
by doing just

gitpull.sh pasky

(but this style of usage is deprecated now). Please see the README for
some details regarding usage etc. You can find the changes from the last
announcement in the ChangeLog (the previous announcement corresponds to
commit id 5125d089ad862f16a306b4942155092e1dce1c2d). The most important
change is probably recursive diff addition, and making git ignore the
nsec of ctime and mtime, since it is totally unreliable and likes to
taint random files as modified.

My near future plans include especially some merge support; I think it
should be rather easy, actually. I'll also add some simple tagging
mechanism. I've decided to postpone the file moving detection, since
there's no big demand for it now. ;-)

I will also need to do more testing on the linux kernel tree.
Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in

$ time gitdiff.sh `parent-id` `tree-id` >p
real 5m37.434s
user 1m27.113s
sys 2m41.036s

which is pretty horrible, it seems to me. Any benchmarking help is of
course welcomed, as well as any other feedback.

BTW, what would be the best (most complete) source for the BK tree
metadata? Should I dig it from the BKCVS gateway, or is there a better
source? Where did you get the sparse git database from, Linus? (BTW, it
would be nice to get sparse.git with the directories as separate.)

Have fun,

Linus Torvalds

unread,

Apr 10, 2005, 1:00:13 PM4/10/05

to

On Sun, 10 Apr 2005, Petr Baudis wrote:
>
> Where did you get the sparse git database from, Linus? (BTW, it
> would be nice to get sparse.git with the directories as separate.)

When we were trying to figure out how to avert the BK disaster, and one of
Tridges concerns (and, in my opinion, the only really valid one) was that
you couldn't get the BK data in some SCM-independent way.

So I wrote some very preliminary scripts (on top of BK itself) to extract
the data, to show that BK could generate a SCM-neutral file format (a very
stupid one and horribly useless for anything but interoperability, but
still...). I was hoping that that would convince Tridge that trying to
muck around with the internal BK file format was not worth it, and avert
the BK trainwreck.

Larry was ok with the idea to make my export format actually be natively
supported by BK (ie the same way you have "bk export -tpatch"), but Tridge
wanted to instead get at the native data and be difficult about it. As a
result, I can now not only use BK any more, but we also don't have a nice
export format from BK.

Yeah, I'm a bit bitter about it.

Anyway, the sparse data came out of my hack. It's very inefficient, and I
estimated that doing the same for the kernel would have taken ten solid
days of conversion, mainly because my hack was really just that: a quick
hack to show that BK could do it. Larry could have done it a lot better.

I'll re-generate the sparse git-database at some point (and I'll probably
do so from the old GIT database itself, rather than re-generating it from
my old BK data).

Linus

Rutger Nijlunsing

unread,

Apr 10, 2005, 1:10:12 PM4/10/05

to

On Sun, Apr 10, 2005 at 08:44:56AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 10 Apr 2005, Junio C Hamano wrote:
> >
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
>
> You can represent renames on top of git - git itself really doesn't care.
> In many ways you can just see git as a filesystem - it's content-
> addressable, and it has a notion of versioning, but I really really
> designed it coming at the problem from the viewpoint of a _filesystem_
> person (hey, kernels is what I do), and I actually have absolutely _zero_
> interest in creating a traditional SCM system.
>
> So to take renaming a file as an example - why do you actually want to
> track renames? In traditional SCM's, you do it for two reasons:
>
> - space efficiency. Most SCM's are based on describing changes to a file,

[snip]

> - annotate/blame. This is a valid concern, but the fact is, I never use

[snip]

- merging.
When the parent tree renames a file, it's easier for an out-of-tree
patch to get up-to-date.

- reviewing.
A huge patch with 2000 added lines and 1990 removed lines is more
difficult to review then a rename + 10 lines patch.

> So consider me deficient, or consider me radical. It boils down to the
> same thing. Renames don't matter.

When you've got no out-of-tree patches since you've got the
parent-of-all-trees, then they don't matter, that's true :)

> So whether you agree with the things that _I_ consider important probably
> depends on how you work. The real downside of GIT may be that _my_ way of
> doing things is quite possibly very rare.

--
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl

never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

Ingo Molnar

unread,

Apr 10, 2005, 1:40:07 PM4/10/05

to

* Petr Baudis <pa...@ucw.cz> wrote:

> I will also need to do more testing on the linux kernel tree.
> Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
>
> $ time gitdiff.sh `parent-id` `tree-id` >p
> real 5m37.434s
> user 1m27.113s
> sys 2m41.036s
>
> which is pretty horrible, it seems to me. Any benchmarking help is of
> course welcomed, as well as any other feedback.

it seems from the numbers that your system doesnt have enough RAM for
this and is getting IO-bound?

Ingo

Rik van Riel

unread,

Apr 10, 2005, 1:40:14 PM4/10/05

to

On Sat, 9 Apr 2005, Linus Torvalds wrote:

> I've rsync'ed the new git repository to kernel.org, it should all be there
> in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the
> mirror scripts already picked it up on the public side too).

GCC 4 isn't very happy. Mostly sign changes, but also something
that looks like a real error:

gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c
fsck-cache.c: In function 'main':
fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined
fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined

I assume that fsck_tree and fsck_commit should complain loudly
if they ever get to that point - but since I'm not quite sure
there's no patch, sorry.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

Paul Jackson

unread,

Apr 10, 2005, 1:40:12 PM4/10/05

to

Ralph wrote:
> but good enough for
> most uses that people will get caught out when it fails.

Exactly.

If Linus persists in this diff-tree output format, using two lines for
changed files, then I will have to add the following sed script to my
arsenal:

sed '/^</ { N; s/\n>/ / }'

It collapses pairs of lines:

<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile
>100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile

to the single line:

<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile 100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile

However, more people will get bit by this git glitch than know sed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Ingo Molnar

unread,

Apr 10, 2005, 1:50:07 PM4/10/05

to

* Willy Tarreau <wi...@w.ods.org> wrote:

> > > I will also need to do more testing on the linux kernel tree.
> > > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> > >
> > > $ time gitdiff.sh `parent-id` `tree-id` >p
> > > real 5m37.434s
> > > user 1m27.113s
> > > sys 2m41.036s
> > >
> > > which is pretty horrible, it seems to me. Any benchmarking help is of
> > > course welcomed, as well as any other feedback.
> >
> > it seems from the numbers that your system doesnt have enough RAM for
> > this and is getting IO-bound?
>

> Not the only problem, without I/O, he will go down to 4m8s (u+s) which
> is still in the same order of magnitude.

probably not the only problem - but if we are lucky then his system was
just trashing within the kernel repository and then most of the overhead
is the _unnecessary_ IO that happened due to that (which causes CPU
overhead just as much). The dominant system time suggests so, to a
certain degree. Maybe this is wishful thinking.

Ingo Molnar

unread,

Apr 10, 2005, 1:50:08 PM4/10/05

to

* Rik van Riel <ri...@redhat.com> wrote:

> GCC 4 isn't very happy. Mostly sign changes, but also something that
> looks like a real error:
>
> gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c
> fsck-cache.c: In function 'main':
> fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined
> fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined
>
> I assume that fsck_tree and fsck_commit should complain loudly if they
> ever get to that point - but since I'm not quite sure there's no
> patch, sorry.

i sent a patch for most of the sign errors, but the above is a case gcc
not noticing that the function can never ever exit the loop, and thus
cannot get to the 'return' point.

Ingo

Willy Tarreau

unread,

Apr 10, 2005, 1:50:08 PM4/10/05

to

On Sun, Apr 10, 2005 at 07:33:49PM +0200, Ingo Molnar wrote:
>
> * Petr Baudis <pa...@ucw.cz> wrote:
>
> > I will also need to do more testing on the linux kernel tree.
> > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> >
> > $ time gitdiff.sh `parent-id` `tree-id` >p
> > real 5m37.434s
> > user 1m27.113s
> > sys 2m41.036s
> >
> > which is pretty horrible, it seems to me. Any benchmarking help is of
> > course welcomed, as well as any other feedback.
>
> it seems from the numbers that your system doesnt have enough RAM for
> this and is getting IO-bound?

Not the only problem, without I/O, he will go down to 4m8s (u+s) which

is still in the same order of magnitude.

willy

Paul Jackson

unread,

Apr 10, 2005, 2:30:17 PM4/10/05

to

Tony wrote:
> Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?

I tend to size these things with the square root of the number of
leaf nodes. If I have 2,560,000 leaves (your 10,000 files in each
of 16*16 directories), then I will aim for 1600 directories of
1600 leaves each.

My backup is sized for about this number of leaves, and it uses:

xxx/xxxzzzzzzzzzzzzzzzz

(I repeat the xxx in the leaf name - easier to code.)

I don't think there is any need for two levels. There are 4096
different values of three digit hex numbers. That's ok in one
directory.

The only question would be 'xx' or 'xxx' - two or three digits.

This one is on the cusp in my view - either works.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Petr Baudis

unread,

Apr 10, 2005, 3:00:09 PM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 07:45:12PM CEST, I got a letter
where Ingo Molnar <mi...@elte.hu> told me that...

>
> * Willy Tarreau <wi...@w.ods.org> wrote:
>
> > > > I will also need to do more testing on the linux kernel tree.
> > > > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> > > >
> > > > $ time gitdiff.sh `parent-id` `tree-id` >p
> > > > real 5m37.434s
> > > > user 1m27.113s
> > > > sys 2m41.036s
> > > >
> > > > which is pretty horrible, it seems to me. Any benchmarking help is of
> > > > course welcomed, as well as any other feedback.
> > >
> > > it seems from the numbers that your system doesnt have enough RAM for
> > > this and is getting IO-bound?
> >
> > Not the only problem, without I/O, he will go down to 4m8s (u+s) which
> > is still in the same order of magnitude.
>
> probably not the only problem - but if we are lucky then his system was
> just trashing within the kernel repository and then most of the overhead
> is the _unnecessary_ IO that happened due to that (which causes CPU
> overhead just as much). The dominant system time suggests so, to a
> certain degree. Maybe this is wishful thinking.

It turns out to be the forks for doing all the cuts and such what is
bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
15 forks per change, I guess, and for some reason cut takes a long of
time on its own.

I've rewritten the cuts with the use of bash arrays and other smart
stuff. I somehow don't feel comfortable using this and prefer the
old-fashioned ways, but it would be plain unusable without this.

Now I'm down to

real 1m21.440s
user 0m32.374s
sys 0m42.200s

and I kinda doubt if it is possible to cut this much down. Almost no
disk activity, I have almost everything cached by now, apparently.

Anyway, you can git pull to get the optimized version.

Thanks for the help,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Paul Jackson

unread,

Apr 10, 2005, 3:10:09 PM4/10/05

to

Linus wrote:
> It's a filesystem - although a
> fairly strange one.

Ah ha - that explains the read-tree and write-tree names.

The read-tree pulls stuff out of this file system into
your working files, clobbering local edits. This is like
the read(2) system call, which clobbers stuff in your
read buffer.

The write-tree pushes stuff down into the file system,
just like write(2) pushes data into the kernel.

I was getting all kind of frustrated yesterday trying
to use Linus's git commands, coming at these names with my
SCM hat on.

That way of thinking really doesn't work well here.

I will have to look more closely at pasky's GIT toolkit
if I want to see an SCM style interface.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Willy Tarreau

unread,

Apr 10, 2005, 3:20:12 PM4/10/05

to

On Sun, Apr 10, 2005 at 08:45:22PM +0200, Petr Baudis wrote:

> It turns out to be the forks for doing all the cuts and such what is
> bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> 15 forks per change, I guess, and for some reason cut takes a long of
> time on its own.
>
> I've rewritten the cuts with the use of bash arrays and other smart
> stuff. I somehow don't feel comfortable using this and prefer the
> old-fashioned ways, but it would be plain unusable without this.

I've encountered the same problem in a config-generation script a while
ago. Fortunately, bash provides enough ways to remove most of the forks,
but the result is less portable.

I've downloaded your code, but it does not compile here because of the
tv_nsec fields in struct stat (2.4, glibc 2.2), so I cannot use it to
get the most up to date version to take a look at the script. Basically,
all the 'cut' and 'sed' can be removed, as well as the 'dirname'. You
can also call mkdir only if the dirs don't exist. I really think you
should end up with only one fork in the loop to call 'diff'.

> Now I'm down to
>
> real 1m21.440s
> user 0m32.374s
> sys 0m42.200s
>
> and I kinda doubt if it is possible to cut this much down. Almost no
> disk activity, I have almost everything cached by now, apparently.

It is very common to cut times by a factor of 10 or more when replacing
common unix tools by pure shell. Dynamic library initialization also
takes a lot of time nowadays, and probably you have localisation which
is big too. Sometimes, just wiping a few variables at the top of the
shell might remove some useless overhead.

> Anyway, you can git pull to get the optimized version.
>
> Thanks for the help,

Willy

Paul Jackson

unread,

Apr 10, 2005, 3:30:12 PM4/10/05

to

> Some thing like the following patch, may be turn off able.

Take out an old envelope and compute on it the odds of this
happening.

Say we have 10,000 kernel hackers, each producing one
new file every minute, for 100 hours a week. And we've
cloned a small army of Andrew Morton's to integrate
the resulting tsunamai of patches. And Linus is well
cared for in the state funny farm.

What is the probability that this check will fire even
once, between now and 10 billion years from now, when
the Sun has become a red giant destroying all life on
planet Earth?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Sean

unread,

Apr 10, 2005, 4:00:20 PM4/10/05

to

On Sun, April 10, 2005 12:55 pm, Linus Torvalds said:

> Larry was ok with the idea to make my export format actually be natively
> supported by BK (ie the same way you have "bk export -tpatch"), but
> Tridge wanted to instead get at the native data and be difficult about
> it. As a result, I can now not only use BK any more, but we also don't
> have a nice export format from BK.
>
> Yeah, I'm a bit bitter about it.
>

Linus,

With all due respect, Larry could have dealt with this years ago and
removed the motivation for Tridge and others to pursue reverse
engineering. Instead he chose to insult and question the motives of
everyone that wanted open-source access to the Linux history data. The
blame for the current situation falls firmly on the choice to use a
closed-source SCM for Linux and the actions of the company that owned it.

Sean

Paul Jackson

unread,

Apr 10, 2005, 4:50:10 PM4/10/05

to

Good lord - you don't need to use arrays for this.

The old-fashioned ways have their ways. Both the 'set'
command and the 'read' command can split args and assign
to distinct variable names.

Try something like the following:

diff-tree -r $id1 $id2 |
sed -e '/^</ { N; s/\n>/ / }' -e 's/./& /' |
while read op mode1 sha1 name1 mode2 sha2 name2
do
... various common stuff ...
case "$op" in
"+")
...
;;
"-")
...
;;
"<")
test $name1 = $name2 || die mismatched names
label1=$(mkbanner "$loc1" $id1 "$name1" $mode1 $sha1)
label2=$(mkbanner "$loc2" $id2 "$name1" $mode2 $sha2)
diff -L "$label1" -L "$label2" -u "$loc1" "$loc2"
;;
esac
done

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Linus Torvalds

unread,

Apr 10, 2005, 4:50:10 PM4/10/05

to

On Sun, 10 Apr 2005, Petr Baudis wrote:
>
> It turns out to be the forks for doing all the cuts and such what is
> bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> 15 forks per change, I guess, and for some reason cut takes a long of
> time on its own.

Heh.

Can you pull my current repo, which has "diff-tree -R" that does what the
name suggests, and which should be faster than the 0.48 sec you see..

It may not matter a lot, since actually generating the diff from the file
contents is what is expensive, but remember my goal: I want the expense of
a diff-tree to be relative to the size of the diff, so that implies that
small diffs haev to be basically instantaenous. So I care.

So I just tried the 2.6.7->2.6.8 diff, and for me the new recursive
"diff-tree" can generate the _list_ of files changed in zero time:

real 0m0.079s
user 0m0.067s
sys 0m0.024s

but then _doing_ the diff is pretty expensive (in this case 3800+ files
changed, so you have to unpack 7600+ objects - and even unpacking isn't
the expensive part, the expense is literally in the diff operation
itself).

Me, the stuff I automate is the small steps. Doing a single checkin. So
that's the case I care about going fast, when a "diff-tree" will likely
have maybe five files or something. That's why I want the small
incremental cases to go fast - it it takes me a minute to generate a diff
for a _release_, that's not a big deal. I make one release every other
month, but I work with lots of small patches all the time.

Anyway, with a fast diff-tree, you should be able to generate the list of
objects for a fast "merge". That's next.

(And by "merge", I of course mean "suck". I'm talking about the old CVS
three-way merge, and you have to specify the common parent explicitly and
it won't handle any renames or any other crud. But it would get us to
something that might actually be useful for simple things. Which is why
"diff-tree" is important - it gives the information about what to tell
merge).

Linus

Linus Torvalds

unread,

Apr 10, 2005, 5:00:13 PM4/10/05

to

On Sun, 10 Apr 2005, Paul Jackson wrote:
>
> Ah ha - that explains the read-tree and write-tree names.
>
> The read-tree pulls stuff out of this file system into
> your working files, clobbering local edits. This is like
> the read(2) system call, which clobbers stuff in your
> read buffer.

Yes. Except it's a two-stage thing, where the staging area is always the
"current directory cache".

So a "read-tree" always reads the tree information into the directory
cache, but does not actually _update_ any of the files it "caches". To do
that, you need to do a "checkout-cache" phase.

Similarly, "write-tree" writes the current directory cache contents into a
set of tree files. But in order to have that match what is actually in
your directory right now, you need to have done a "update-cache" phase
before you did the "write-tree".

So there is always a staging area between the "real contents" and the
"written tree".

> That way of thinking really doesn't work well here.
>
> I will have to look more closely at pasky's GIT toolkit
> if I want to see an SCM style interface.

Yes. You really should think of GIT as a filesystem, and of me as a
_systems_ person, not an SCM person. In fact, I tend to detest SCM's. I
think the reason I worked so well with BitKeeper is that Larry used to do
operating systems. He's also a systems person, not really an SCM person.
Or at least he's in between the two.

My operations are like the "system calls". Useless on their own: they're
not real applications, they're just how you read and write files in this
really strange filesystem. You need to wrap them up to make them do
anything sane.

For example, take "commit-tree" - it really just says that "this is the
new tree, and these other trees were its parents". It doesn't do any of
the actual work to _get_ those trees written.

So to actually do the high-level operation of a real commit, you need to
first update the current directory cache to match what you want to commit
(the "update-cache" phase).

Then, when your directory cache matches what you want to commit (which is
NOT necessarily the same thing as your actual current working area - if
you don't want to commit some of the changes you have in your tree, you
should avoid updating the cache with those changes), you do stage 2, ie
"write-tree". That writes a tree node that describes what you want to
commit.

Only THEN, as phase three, do you do the "commit-tree". Now you give it
the tree you want to commit (remember - that may not even match your
current directory contents), and the history of how you got here (ie you
tell commit what the previous commit(s) were), and the changelog.

So a "commit" in SCM-speak is actually three totally separate phases in my
filesystem thing, and each of the phases (except for the last
"commit-tree" which is the thing that brings it all together) is actually
in turn many smaller parts (ie "update-cache" may have been called
hundreds of times, and "write-tree" will write several tree objects that
point to each other).

Similarly, a "checkout" really is about first finding the tree ID you want
to check out, and then bringing it into the "directory cache" by doing a
"read-tree" on it. You can then actually update the directory cache
further: you might "read-tree" _another_ project, or you could decide that
you want to keep one of the files you already had.

So in that scneario, after doing the read-tree you'd do an "update-cache"
on the file you want to keep in your current directory structure, which
updates your directory cache to be a _mix_ of the original tree you now
want to check out _and_ of the file you want to use from your current
directory. Then doing a "checkout-cache -a" will actually do the actual
checkout, and only at that point does your working directory really get
changed.

Btw, you don't even have to have any working directory files at all. Let's
say that you have two independent trees, and you want to create a new
commit that is the join of those two trees (where one of the trees take
precedence). You'd do a "read-tree <a> <b>", which will create a directory
cache (but not check out) that is the union of the <a> and <b> trees (<b>
will overrride). And then you can do a "write-tree" and commit the
resulting tree - without ever having _any_ of those files checked out.

Linus

Petr Baudis

unread,

Apr 10, 2005, 5:30:19 PM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 09:13:19PM CEST, I got a letter
where Willy Tarreau <wi...@w.ods.org> told me that...

> On Sun, Apr 10, 2005 at 08:45:22PM +0200, Petr Baudis wrote:
>
> > It turns out to be the forks for doing all the cuts and such what is
> > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> > 15 forks per change, I guess, and for some reason cut takes a long of
> > time on its own.
> >
> > I've rewritten the cuts with the use of bash arrays and other smart
> > stuff. I somehow don't feel comfortable using this and prefer the
> > old-fashioned ways, but it would be plain unusable without this.
>
> I've encountered the same problem in a config-generation script a while
> ago. Fortunately, bash provides enough ways to remove most of the forks,
> but the result is less portable.
>
> I've downloaded your code, but it does not compile here because of the
> tv_nsec fields in struct stat (2.4, glibc 2.2), so I cannot use it to
> get the most up to date version to take a look at the script. Basically,

Ok, I decided to stop this nsec madness (since it broke show-diff
anyway at least on my ext3), and you get it only if you pass -DNSEC
to CFLAGS now. Hope this fixes things for you. :-)

BTW, I regularly update the public copy as accessible on the web.

> all the 'cut' and 'sed' can be removed, as well as the 'dirname'. You
> can also call mkdir only if the dirs don't exist. I really think you
> should end up with only one fork in the loop to call 'diff'.

You still need to extract the file by cat-file too. ;-) And rm the files
after it compares them (so that we don't fill /tmp with crap like
certain awful programs like to do). But I will conditionalize the mkdir
calls, thanks for the suggestion - I think that's the last bit to be
squeezed from this loop (I'll yet check on the read proposal - I
considered it before and turned down for some reason, can't remember why
anymore, though).

Thanks,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Linus Torvalds

unread,

Apr 10, 2005, 5:40:11 PM4/10/05

to

On Sun, 10 Apr 2005, Linus Torvalds wrote:
>
> Can you pull my current repo, which has "diff-tree -R" that does what the
> name suggests, and which should be faster than the 0.48 sec you see..

Actually, I changed things around. Everybody hated the "<" ">" lines, so I
put a changed thing on a line of its own with a "*" instead.

So you'd now see lines like

*100644->100644 1874e031abf6631ea51cf6177b82a1e662f6183e->e8181df8499f165cacc6a0d8783be7143013d410 CREDITS

which means that the CREDITS file has changed, and it shows you the mode
-> mode transition (that didn't change in this case) and the sha1 -> sha1
transition.

So now it's always just one line per change. Firthermore, the filename is
always field 3, if you use spaces as delimeters, regardless of whether
it's a +/-/* field.

So let's say you want to merge two trees (dst1 and dst2) from a common
parent (src), what you would do is:

- get the list of files to merge:

diff-tree -R <dst1> <dst2> | tr '\0' '\n' > merge-files

- Which of those were changed by <src> -> <dstX>?

diff-tree -R <src> <dst1> | tr '\0' '\n' | join -j 3 - merge-files > dst1-change
diff-tree -R <src> <dst2> | tr '\0' '\n' | join -j 3 - merge-files > dst2-change

- Which of those are common to both? Let's see what the merge list is:

join dst1-change dst2-change > merge-list

and hopefully you'd usually be working on a very small list of files by
then (everything else you'd just pick from one of the destination trees
directly - you've got the name, the sha-file, everything: no need to even
look at the data).

Does this sound sane? Pasky? Wanna try a "git merge" thing? Starting off
with the user having to tell what the common parent tree is - we can try
to do the "automatically find best common parent" crud later. THAT may be
expensive.

(Btw, this is why I think "diff-tree" is more important than actually
generating the real diff itself - the above uses diff-tree three times
just to cut down to the point where _hopefully_ you don't actually need to
generate very much diffs at all. So I want "diff-tree" to be really fast,
even if it then can take a minute to actually generate a big diff between
releases etc).

Christopher Li

unread,

Apr 10, 2005, 6:00:14 PM4/10/05

to

I totally agree that odds is really really small.
That is why it is not worthy to handle the case. People hit that
can just add a new line or some thing to avoid it, if
it happen after all.

It is the little peace of mind to know for sure that did
not happen. I am just paranoid.

Chris

Luck, Tony

unread,

Apr 10, 2005, 6:10:12 PM4/10/05

to

>Also, I did actually debate that issue with myself, and decided that even
>if we do have tons of files per directory, git doesn't much care. The
>reason? Git never _searches_ for them. Assuming you have enough memory to
>cache the tree, you just end up doing a "lookup", and inside the kernel
>that's done using an efficient hash, which doesn't actually care _at_all_
>about how many files there are per directory.

So long as the hash *is* efficient when the directory is packed full of
38 character filenames made only of [0-9a-f] ... which might not match
the test cases under which the hash was picked :-) When there are some
full-sized kernel git images, someone should do a sanity check.

>Hey, I may end up being wrong, and yes, maybe I should have done a
>two-level one. The good news is that we can trivially fix it later (even
>dynamically - we can make the "sha1 object tree layout" be a per-tree
>config option, and there would be no real issue, so you could make small
>projects use a flat version and big projects use a very deep structure
>etc). You'd just have to script some renames to move the files around.

It depends on how many eco-system shell scripts get built that need to
know about the layout ... if some shell/perl "libraries" encode this
filename layout (and people use them) ... then switching later would
indeed be painless.

-Tony

Petr Baudis

unread,

Apr 10, 2005, 6:20:10 PM4/10/05

to

Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter
where "Luck, Tony" <tony...@intel.com> told me that...
..snip..

> >Hey, I may end up being wrong, and yes, maybe I should have done a
> >two-level one. The good news is that we can trivially fix it later (even
> >dynamically - we can make the "sha1 object tree layout" be a per-tree
> >config option, and there would be no real issue, so you could make small
> >projects use a flat version and big projects use a very deep structure
> >etc). You'd just have to script some renames to move the files around.
>
> It depends on how many eco-system shell scripts get built that need to
> know about the layout ... if some shell/perl "libraries" encode this
> filename layout (and people use them) ... then switching later would
> indeed be painless.

FWIW, my short-term plans include support for monotone-like hash ID
shortening - it's enough to use the shortest leading unique part of the
ID to identify the revision. I will poke to the object repository for
that. I also already have Randy Dunlap's git lsobj, which will list all
objects of a specified type (very useful especially when looking for
orphaned commits and such rather lowlevel work).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Christopher Li

unread,

Apr 10, 2005, 6:20:06 PM4/10/05

to

On Sun, Apr 10, 2005 at 01:57:33PM -0700, Linus Torvalds wrote:
>
> > That way of thinking really doesn't work well here.
> >
> > I will have to look more closely at pasky's GIT toolkit
> > if I want to see an SCM style interface.
>
> Yes. You really should think of GIT as a filesystem, and of me as a
> _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I
> think the reason I worked so well with BitKeeper is that Larry used to do
> operating systems. He's also a systems person, not really an SCM person.
> Or at least he's in between the two.
>

Yes, I am puzzled for a while how to use git until I realize that it is
a version file system.

BTW, one thing I learn from ext3 is that it is very useful to have some
compatible flag for future development. I think if we want to reserve some
room in the file format for further development of git, it is the right time
to do it before it get bigs. e.g. an optional variable size header in "tree"
including format version and capability etc. I can see the counter argument
that it is not as important as a real file system because it is a lot easier
bring it off line to upgrade the whole tree.

One the other hand, it is almost did not cost any thing in terms of space and
CPU time, most directory did not get to file system block boundary so extra few bytes
is almost free. If carefully planed, it will make the future up grade of git
a lot smoother.

What do you think?

Chris

Petr Baudis

unread,

Apr 10, 2005, 6:30:13 PM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 10:38:11PM CEST, I got a letter
where Linus Torvalds <torv...@osdl.org> told me that...

> On Sun, 10 Apr 2005, Petr Baudis wrote:
> >
> > It turns out to be the forks for doing all the cuts and such what is
> > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> > 15 forks per change, I guess, and for some reason cut takes a long of
> > time on its own.
>
> Heh.
>
> Can you pull my current repo, which has "diff-tree -R" that does what the
> name suggests, and which should be faster than the 0.48 sec you see..

Funnily enough, now after some more cache teasing it's ~0.185. Your one
still ~0.17, though. :/ (That might be because of the format changes,
though, since you do less printing now.) (BTW, all those measurements
are done on my AMD K6 walking on 1600MHz, 512M RAM, about 200M available
for caches.)

Just out of interest, did you have a look at my diff-tree -r
implementation and decided that you don't like it, or you weren't aware
of it?

I will probably take most of your diff-tree change, but I'd prefer to do
the sha1->tree mapping directly in diff_tree().

> It may not matter a lot, since actually generating the diff from the file
> contents is what is expensive, but remember my goal: I want the expense of
> a diff-tree to be relative to the size of the diff, so that implies that
> small diffs haev to be basically instantaenous. So I care.

Me too, of course.

> So I just tried the 2.6.7->2.6.8 diff, and for me the new recursive
> "diff-tree" can generate the _list_ of files changed in zero time:
>
> real 0m0.079s
> user 0m0.067s
> sys 0m0.024s
>
> but then _doing_ the diff is pretty expensive (in this case 3800+ files
> changed, so you have to unpack 7600+ objects - and even unpacking isn't
> the expensive part, the expense is literally in the diff operation
> itself).
>
> Me, the stuff I automate is the small steps. Doing a single checkin. So
> that's the case I care about going fast, when a "diff-tree" will likely
> have maybe five files or something. That's why I want the small
> incremental cases to go fast - it it takes me a minute to generate a diff
> for a _release_, that's not a big deal. I make one release every other
> month, but I work with lots of small patches all the time.

I see.

> Anyway, with a fast diff-tree, you should be able to generate the list of
> objects for a fast "merge". That's next.
>
> (And by "merge", I of course mean "suck". I'm talking about the old CVS
> three-way merge, and you have to specify the common parent explicitly and
> it won't handle any renames or any other crud. But it would get us to
> something that might actually be useful for simple things. Which is why
> "diff-tree" is important - it gives the information about what to tell
> merge).

I currently already do a merge when you track someone's source - it will
throw away your previous HEAD record though, so if you committed some
local changes after the previous pull, you will get orphaned commits and
the changes will turn to uncommitted ones. I have some ideas regarding
how to do it properly (and do any arbitrary merging, for that matter), I
hope to get to it as soon as I catch up with you. :-)

BTW, the three-way merge comes from RCS. That reminds me, is there any
tool which will take .rej files and throw them into the file to create
rcsmerge-like conflicts? Perhaps it's fault of my bad tools, but I
prefer to work with the inline rejects much more to .rej files (except
to actually notice the rejects).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Linus Torvalds

unread,

Apr 10, 2005, 6:40:11 PM4/10/05

to

On Sun, 10 Apr 2005, Christopher Li wrote:
>
> BTW, one thing I learn from ext3 is that it is very useful to have some
> compatible flag for future development. I think if we want to reserve some
> room in the file format for further development of git

Way ahead of you.

This is (one reason) why all git objects have the type embedded inside of
them. The format of all objects is totally regular: they are all
compressed with zlib, they are all named by the sha1 file, and they all
start out with a magic header of "<typename> <typesize><nul byte>".

So if I want to create a new kind of tree object that does the same thing
as the old one but has some other layout, I'd just call it something else.
Like "dir". That was what I initially planned to do about the change to
recursive tree objects, but it turned out to actually be a lot easier to
just encode it in the old type (that way the routines that read it don't
even have to care about old/new types - it's all the same to them).

Linus

Petr Baudis

unread,

Apr 10, 2005, 6:40:10 PM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter
where Christopher Li <lk...@chrisli.org> told me that...

> I totally agree that odds is really really small.
> That is why it is not worthy to handle the case. People hit that
> can just add a new line or some thing to avoid it, if
> it happen after all.
>
> It is the little peace of mind to know for sure that did
> not happen. I am just paranoid.

BTW, I've merged the check to git-pasky some time ago, you can disable
it in the Makefile. It is by default on now, until someone convinces me
it actually affects performance measurably.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Christopher Li

unread,

Apr 10, 2005, 7:10:07 PM4/10/05

to

On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:
>
>
> On Sun, 10 Apr 2005, Christopher Li wrote:
> >
> > BTW, one thing I learn from ext3 is that it is very useful to have some
> > compatible flag for future development. I think if we want to reserve some
> > room in the file format for further development of git
>
> Way ahead of you.
>
> This is (one reason) why all git objects have the type embedded inside of
> them. The format of all objects is totally regular: they are all
> compressed with zlib, they are all named by the sha1 file, and they all
> start out with a magic header of "<typename> <typesize><nul byte>".
>
> So if I want to create a new kind of tree object that does the same thing
> as the old one but has some other layout, I'd just call it something else.
> Like "dir". That was what I initially planned to do about the change to
> recursive tree objects, but it turned out to actually be a lot easier to
> just encode it in the old type (that way the routines that read it don't
> even have to care about old/new types - it's all the same to them).

Ha, that is right. You put the new type into same object trick me into
thinking I have to do the same way. Totally forget I can introduce new type
of objects. It is even cleaner. Cool.

How about deleting trees from the caches? I don't need to delete stuff from
the official tree. It is more for my local version control.
Here is the usage case,
- I check out the git.git.
- using quilt to build my series of patches, git-hack1, git-hack2.. git-hack6.
let's say those are store in git cache as well
- I pick some of them come up with a clean one "submit.patch"
- submit.patch get merged into official git tree.
- Now I want to get rid of the hack1 to hack6, but how?

One way to do it is never commit hack1 to hack6 into git or cache. They stay as quilt
patches only. But it is very tempting to let quilt using git instead of the
.pc/ directory, quilt can simplify as some usage case of patch and git.

Chris

Bernd Eckenfels

unread,

Apr 10, 2005, 7:10:08 PM4/10/05

to

In article <2005041011190...@engr.sgi.com> you wrote:
> (I repeat the xxx in the leaf name - easier to code.)

It is a bit OT, but just a note: there are file systems (hash functions) out
there who dont like a lot of files named the same way. For example NTFS with
the 8.3 short names.

Greetings
Bernd

Linus Torvalds

unread,

Apr 10, 2005, 7:20:10 PM4/10/05

to

On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> I currently already do a merge when you track someone's source - it will
> throw away your previous HEAD record though

Not only that, it doesn't do what I consider a "merge".

A real merge should have two or more parents. The "commit-tree" command
already allows that: just add any arbitrary number of "-p xxxxxxxxx"
switches (well, I think I limited it to 16 parents, but that's just a
totally random number, there's nothing in the file format or anything
else that limits it).

So while you've merged my "data", but you've not actually merged my
revision history in your tree.

And the reason a real merge _has_ to show both parents properly is that
unless you do that, you can never merge sanely another time without
getting lots of clashes from the previous merge. So it's important that a
merge really shows both trees it got data from.

This is, btw, also the reason I haven't merged with your tree - I want to
get to the point where I really _can_ merge without throwing away the
information. In fact, at this point I'd rather not merge with your tree at
all, because I consider your tree to be "corrupt" thanks to lacking the
merge history.

So you've done the data merge, but not the history merge.

And because you didn't do the history merge, there's no way to
automatically find out what point of my tree you merged _with_. See?

And since I have no way to see what point in time you merged with me, now
I can't generate a nice 3-way diff against the last common ancestor of
both of our trees.

So now I can't do a three-way merge with you based on any sane ancestor,
unless I start guessing which ancestor of mine you merged with. Now, that
"guess" is easy enough to do with a project like "git" which currently has
just a few tens of commits and effectively only two parallell development
trees, but the whole point is to get to a system where that isn't true..

Linus

Linus Torvalds

unread,

Apr 10, 2005, 7:30:12 PM4/10/05

to

On Sun, 10 Apr 2005, Christopher Li wrote:
>
> How about deleting trees from the caches? I don't need to delete stuff from
> the official tree. It is more for my local version control.

I have a plan. Namely to have a "list-needed" command, which you give one
commit, and a flag implying how much "history" you want (*), and then it
spits out all the sha1 files it needs for that history.

Then you delete all the other ones from your SHA1 archive (easy enough to
do efficiently by just sorting the two lists: the list of "needed" files
and the list of "available" files).

Script that, and call the command "prune-tree" or something like that, and
you're all done.

(*) The amount of history you want might be "none", which is to say that
you don't want to go back in time, so you want _just_ the list of tree and
blob objects associated with that commit.

Or you might want a "linear" history, which would be the longest path
through the parent changesets to the root.

Or you might want "all", which would follow all parents and all trees.

Or you might want to prune the history tree by date - "give me all
history, but cut it off when you hit a parent that was done more than 6
months ago".

This "list-needed" thing is not just for pruning history either. If you
have a local tree "x", and you want to figure out how much of it you need
to send to somebody else who has an older tree "y", then what you'd do is
basically "list-needed x" and remove the set of "list-needed y". That
gives you the answer to the question "what's the minimum set of sha1 files
I need to send to the other guy so that he can re-create my top-of-tree".

My second plan is to make somebody else so fired up about the problem that
I can just sit back and take patches. That's really what I'm best at.
Sitting here, in the (rain) on the patio, drinking a foofy tropical drink,
and pressing the "apply" button. Then I take all the credit for my
incredible work.

Hint, hint.

Linus

Paul Jackson

unread,

Apr 10, 2005, 7:30:11 PM4/10/05

to

Useful explanation - thanks, Linus.

Is this picture and description accurate:

==============================================================

< working directory files (foo.c) >
^
^ |
| upward ops | downward ops |
| ---------- | ------------ |
| checkout-cache | update-cache |
| show-diff | v
v
< current directory cache (".dircache/index") >
^
^ |
| upward ops | downward ops |
| ---------- | ------------ |
| read-tree | write-tree |
| | commit-tree |
| v
v
< git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >

==============================================================

The checkout-cache and show-diff ops read their meta-data from
the cache, and the actual file contents from the git filesystem.
Similary, the update-cache op writes meta-data into the cache,
and may create new files in the git filesystem.

The cache (but not the git filesystem) stores transient
information (ctime, mtime, dev, ino, uid, gid, and size)
about each working file update-cache has copied into the git
filesystem so that checkout-cache and show-diff can detect
changes in the contents of working files just from a stat,
without actually rereading the file.

In some sense, the cache holds the git filesystem inodes,
and the git filesystem holds the data blocks. Except that:
(1) the cache just holds the current "view" into the git
filesystem,
(2) objects in the filesystem have an "inode" number (their
<sha1> value) that is persistent whether in view or not,
(3) objects in the filesystem are not removed just because
nothing in the cache references them,
(4) objects in the filesystem can reference other objects,
that are typically also in the filesystem, but that can
still be reliably self-identified even if found in the
wild of say one's email inbox, and
(5) the view in the directory cache can itself be made into
a filesystem object - using commit-tree.

==============================================================

Minor question:

I must have an old version - I got 'git-0.03', but
it doesn't have 'checkout-cache', and its 'read-tree'
directly writes my working files.

How do I get a current version? Well, one way I see,
and that's to pick up Pasky's:

http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2

Perhaps that's the best way?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Paul Jackson

unread,

Apr 10, 2005, 7:40:09 PM4/10/05

to

Petr wrote:
> That reminds me, is there any
> tool which will take .rej files and throw them into the file to create
> rcsmerge-like conflicts?

Check out 'wiggle'
http://www.cse.unsw.edu.au/~neilb/source/wiggle/

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Petr Baudis

unread,

Apr 10, 2005, 7:40:11 PM4/10/05

to

Dear diary, on Mon, Apr 11, 2005 at 01:10:58AM CEST, I got a letter

where Linus Torvalds <torv...@osdl.org> told me that...
>
>

> On Mon, 11 Apr 2005, Petr Baudis wrote:
> >
> > I currently already do a merge when you track someone's source - it will
> > throw away your previous HEAD record though
>
> Not only that, it doesn't do what I consider a "merge".
>
> A real merge should have two or more parents. The "commit-tree" command
> already allows that: just add any arbitrary number of "-p xxxxxxxxx"
> switches (well, I think I limited it to 16 parents, but that's just a
> totally random number, there's nothing in the file format or anything
> else that limits it).
>
> So while you've merged my "data", but you've not actually merged my
> revision history in your tree.

Well, that's exactly what I was (am) going to do. :-) That's also why I
said that I (virtually) throw the local commits away now. Instead, if
there were any local commits, I will do git merge:

commit-tree $(write-tree) -p $local_head -p $tracked_tree

Note that I will need to make this two-phase - first applying the
changes, then doing the commit; between those two phases, the user
should resolve potential conflicts and check if the merge went right.

I think I will name the first phase git merge and the second phase will
be just git commit, and I will store the merge information in
.dircache/. (BTW, I think the directory name is pretty awful; what about
.git/ ?)

> And the reason a real merge _has_ to show both parents properly is that
> unless you do that, you can never merge sanely another time without
> getting lots of clashes from the previous merge. So it's important that a
> merge really shows both trees it got data from.
>
> This is, btw, also the reason I haven't merged with your tree - I want to
> get to the point where I really _can_ merge without throwing away the
> information. In fact, at this point I'd rather not merge with your tree at
> all, because I consider your tree to be "corrupt" thanks to lacking the
> merge history.
>
> So you've done the data merge, but not the history merge.
>
> And because you didn't do the history merge, there's no way to
> automatically find out what point of my tree you merged _with_. See?
>
> And since I have no way to see what point in time you merged with me, now
> I can't generate a nice 3-way diff against the last common ancestor of
> both of our trees.
>
> So now I can't do a three-way merge with you based on any sane ancestor,
> unless I start guessing which ancestor of mine you merged with. Now, that
> "guess" is easy enough to do with a project like "git" which currently has
> just a few tens of commits and effectively only two parallell development
> trees, but the whole point is to get to a system where that isn't true..

Well, I've wanted to get the basic things working first before doing git
merge. (Especially since until recently, diff-tree was PITA to work
with, and before that it didn't even exist.) If you want, I can rebuild
my tree with doing the merging properly, after I have git merge working.

(BTW, it would be useful to have a tool which just blindly takes what
you give it on input and throws it to an object of given type; I will
need to construct arbitrary commits during the rebuild if I'm to keep
the correct dates.)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Linus Torvalds

unread,

Apr 10, 2005, 7:50:07 PM4/10/05

to

On Sun, 10 Apr 2005, Paul Jackson wrote:
>

> Useful explanation - thanks, Linus.

Hey. You're welcome. Especially when you create good documentation for
this thing.

Because:

> Is this picture and description accurate:

[ deleted, but I'll probably try to put it in an explanation file
somewhere ]

Yes. Excellent.

> Minor question:
>
> I must have an old version - I got 'git-0.03', but
> it doesn't have 'checkout-cache', and its 'read-tree'
> directly writes my working files.

Yes. Crappy old tree, but it can still read my git.git directory, so you
can use it to update to my current source base.

However, from a usability angle, my source-base really has been
concentrating _entirely_ on just the plumbing, and if you actually want a
faucet or a toilet _conntected_ to the plumbing, you're better off with
Pasky's tree, methinks:

> How do I get a current version? Well, one way I see,
> and that's to pick up Pasky's:
>
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>
> Perhaps that's the best way?

Indeed. He's got a number of shell scripts etc to automate the boring
parts.

Linus

Linus Torvalds

unread,

Apr 10, 2005, 7:50:08 PM4/10/05

to

On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> (BTW, it would be useful to have a tool which just blindly takes what
> you give it on input and throws it to an object of given type; I will
> need to construct arbitrary commits during the rebuild if I'm to keep
> the correct dates.)

Hah. That's what "COMMITTER_NAME" "COMMITTER_EMAIL" and "COMMITTER_DATE"
are there for.

There's two things to commits: when (and by whom) it was committed to a
tree, and when the changes were really done.

So set the COMMITTER_xxx things to the person/time you want to consider
the _original_ one, and let "commit-tree" author you as the creator of the
commit itself. The regular "ChangeLog" thing should only show the author
and original time, but it's nice to see who created the commit itself.

I did this very much on purpose: see how I always try to attribute
authorship in BK to the person who actually wrote the code. At the same
time, I think it's interesting from a tracking standpoint to also see
when/where that change got introduced into a tree.

I _tried_ to get this right in the sparse tree conversion. I won't
guarantee that it's all correct, but the top commit in the sparse tree
looks like this:

tree 67607f05a66e36b2f038c77cfb61350d2110f7e8
parent 9c59995fef9b52386e5f7242f44720a7aca287d7
author Christopher Li <spa...@chrisli.org> Sat Apr 2 09:30:09 PST 2005
committer Linus Torvalds <torv...@ppc970.osdl.org> Thu Apr 7 20:06:31 2005

...

exactly because I tracked when I committed it to the sparse tree
_separately_ from tracking when it was created.

So when I re-create the sparse-tree, I'll also end up re-writing the
"committer" information. And that's proper. That's really saying "this
sha1 object was created by Xxxx at time Xxxx".

Btw, the "COMMITTER_xxxx" environment variables are very confusingly
named. They actually go into the _author_ line in the commit object. I'm a
total retard, and I really don't know why I called it "COMMITTER_xxx"
instead of "AUTHOR_xxx".

Linus "retard" Torvalds

Petr Baudis

unread,

Apr 10, 2005, 8:00:25 PM4/10/05

to

Dear diary, on Mon, Apr 11, 2005 at 01:46:50AM CEST, I got a letter

where Linus Torvalds <torv...@osdl.org> told me that...
>
>

> On Mon, 11 Apr 2005, Petr Baudis wrote:
> >
> > (BTW, it would be useful to have a tool which just blindly takes what
> > you give it on input and throws it to an object of given type; I will
> > need to construct arbitrary commits during the rebuild if I'm to keep
> > the correct dates.)
>
> Hah. That's what "COMMITTER_NAME" "COMMITTER_EMAIL" and "COMMITTER_DATE"
> are there for.
>
> There's two things to commits: when (and by whom) it was committed to a
> tree, and when the changes were really done.
>
> So set the COMMITTER_xxx things to the person/time you want to consider
> the _original_ one, and let "commit-tree" author you as the creator of the
> commit itself. The regular "ChangeLog" thing should only show the author
> and original time, but it's nice to see who created the commit itself.

I already use those - look at my ChangeLog. (That's because for certain
reasons I'm working on git in a half-broken chrooted environment.)

When rebuilding the tree from scratch, I wanted like to do it
transparently - that is, so that noone could notice that I rebuilt it,
since it effectively still _is_ the original tree from the data
standpoint, just the history flow is actually correct this time.

> Btw, the "COMMITTER_xxxx" environment variables are very confusingly
> named. They actually go into the _author_ line in the commit object. I'm a
> total retard, and I really don't know why I called it "COMMITTER_xxx"
> instead of "AUTHOR_xxx".

So, who will fix it in his tree first! ;-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Petr Baudis

unread,

Apr 10, 2005, 8:00:21 PM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 11:39:02PM CEST, I got a letter

where Linus Torvalds <torv...@osdl.org> told me that...

> On Sun, 10 Apr 2005, Linus Torvalds wrote:
> >
> > Can you pull my current repo, which has "diff-tree -R" that does what the
> > name suggests, and which should be faster than the 0.48 sec you see..
>
> Actually, I changed things around. Everybody hated the "<" ">" lines, so I
> put a changed thing on a line of its own with a "*" instead.
>
> So you'd now see lines like
>
> *100644->100644 1874e031abf6631ea51cf6177b82a1e662f6183e->e8181df8499f165cacc6a0d8783be7143013d410 CREDITS
>
> which means that the CREDITS file has changed, and it shows you the mode
> -> mode transition (that didn't change in this case) and the sha1 -> sha1
> transition.
>
> So now it's always just one line per change. Firthermore, the filename is
> always field 3, if you use spaces as delimeters, regardless of whether
> it's a +/-/* field.

That's great, just when I finally managed to properly fix the xargs
boundary case in gitdiff-do (without throwing away the NUL-termination).
You know how to please people! ;-)

(Not that I'd have *anything* against the change. The logic is simpler
and you'll be actually able to work with diff-tree a little sanely.)

BTW, it is quite handy to have the entry type in the listing (guessing
that from mode in the script just doesn't feel right and doing explicit
cat-file kills the performance). I would also really prefer the fields
separated by tabs. It looks nicer on the screen (aligned, e.g. modes and
type are varsized), and is also easier to parse (cut defaults to tabs as
delimiters, for example).

> So let's say you want to merge two trees (dst1 and dst2) from a common
> parent (src), what you would do is:
>
> - get the list of files to merge:
>
> diff-tree -R <dst1> <dst2> | tr '\0' '\n' > merge-files

...oh, I probably forgot to ask - why did you choose -R instead of -r?
It looks rather alien to me; if it starts by 'diff', my hand writes -r
without thinking.

> - Which of those were changed by <src> -> <dstX>?
>
> diff-tree -R <src> <dst1> | tr '\0' '\n' | join -j 3 - merge-files > dst1-change
> diff-tree -R <src> <dst2> | tr '\0' '\n' | join -j 3 - merge-files > dst2-change
>
> - Which of those are common to both? Let's see what the merge list is:
>
> join dst1-change dst2-change > merge-list
>
> and hopefully you'd usually be working on a very small list of files by
> then (everything else you'd just pick from one of the destination trees
> directly - you've got the name, the sha-file, everything: no need to even
> look at the data).

Ok, this looks reasonable. (Provided that I DWYM regarding the joins.)

> Does this sound sane? Pasky? Wanna try a "git merge" thing? Starting off
> with the user having to tell what the common parent tree is - we can try
> to do the "automatically find best common parent" crud later. THAT may be
> expensive.

I will definitively try "git merge", but maybe not this night anymore
(it's already 1:32 here now).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Randy.Dunlap

unread,

Apr 10, 2005, 8:20:07 PM4/10/05

to

On Sun, 10 Apr 2005 16:23:11 -0700 Paul Jackson wrote:

or Chris Mason's 'rej' program:
ftp://ftp.suse.com/pub/people/mason/rej/

---
~Randy

Petr Baudis

unread,

Apr 10, 2005, 8:20:08 PM4/10/05

to

Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter
where Paul Jackson <p...@engr.sgi.com> told me that...

> Useful explanation - thanks, Linus.
>
> Is this picture and description accurate:
>
> ==============================================================
>
>
> < working directory files (foo.c) >
> ^
> ^ |
> | upward ops | downward ops |
> | ---------- | ------------ |
> | checkout-cache | update-cache |
> | show-diff | v
> v
> < current directory cache (".dircache/index") >
> ^
> ^ |
> | upward ops | downward ops |
> | ---------- | ------------ |
> | read-tree | write-tree |
> | | commit-tree |
> | v
> v
> < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >

Well, except that from purely technical standpoint commit-tree has
nothing to do in this picture - it creates new object in the git
filesystem based on its input data, but regardless to the directory
cache or current tree. It probably still belongs where it is from the
workflow standpoint, though.

..snip..

> Minor question:
>
> I must have an old version - I got 'git-0.03', but
> it doesn't have 'checkout-cache', and its 'read-tree'
> directly writes my working files.
>
> How do I get a current version? Well, one way I see,
> and that's to pick up Pasky's:
>
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>
> Perhaps that's the best way?

You can take mine, and do:

git pull pasky
git pull linus
cp .dircache/HEAD .dircache/HEAD.local

Now, your tree and git filesystem is up to date.

git track local

Now, when you do git pull pasky, your working tree will not be updated
automatically anymore.

git track linus

Now, you start tracking Linus' tree instead. Note that the initial
update will blow away the scripts in your current tree, so before you do
the last two steps you will probably want to clone the tree and set PATH
to the one still tracking me, so you get all the comfort. ;-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Paul Jackson

unread,

Apr 10, 2005, 8:30:10 PM4/10/05

to

Linus writes:
> Hey. You're welcome. Especially when you create good documentation for
> this thing.

Glad to be of service. Sounds like the umbrella in your foofy
drink drink will come in handy - keeping off the rain.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Petr Baudis

unread,

Apr 10, 2005, 8:30:11 PM4/10/05

to

Dear diary, on Mon, Apr 11, 2005 at 02:20:52AM CEST, I got a letter

where Linus Torvalds <torv...@osdl.org> told me that...

> Btw, does anybody have strong opinions on the license? I didn't put in a
> COPYING file exactly because I was torn between GPLv2 and OSL2.1.
>
> I'm inclined to go with GPLv2 just because it's the most common one, but I
> was wondering if anybody really had strong opinions. For example, I'd
> really make it "v2 by default" like the kernel, since I'm sure v3 will be
> fine, but regardless of how sure I am, I'm _not_ a gambling man.

Oh, I wanted to ask about this too. I'd mostly prefer GPLv2 (I have no
problem with the version restriction, I usually do it too), it's the one
I'm mostly familiar with and OSL appears to be incompatible with GPL (at
least FSF says so about OSL1.0), which might create various annoying
issues. I hate when licenses get in my way and prevent me to possibly
include some useful code.

Linus Torvalds

unread,

Apr 10, 2005, 8:30:12 PM4/10/05

to

Btw, does anybody have strong opinions on the license? I didn't put in a
COPYING file exactly because I was torn between GPLv2 and OSL2.1.

I'm inclined to go with GPLv2 just because it's the most common one, but I
was wondering if anybody really had strong opinions. For example, I'd
really make it "v2 by default" like the kernel, since I'm sure v3 will be
fine, but regardless of how sure I am, I'm _not_ a gambling man.

Linus

Petr Baudis

unread,

Apr 10, 2005, 8:40:12 PM4/10/05

to

Dear diary, on Sun, Apr 10, 2005 at 10:38:11PM CEST, I got a letter

where Linus Torvalds <torv...@osdl.org> told me that...

..snip..

> Can you pull my current repo, which has "diff-tree -R" that does what the
> name suggests, and which should be faster than the 0.48 sec you see..

Am I just missing something, or your diff-tree doesn't handle
added/removed directories?

(Mine does! *hint* *hint* It also doesn't bother with dynamic
allocation, but someone might consider the static path buffer ugly.
Anyway, I hacked it with a plan to do a massive cleanup of the file
later.)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Christopher Li

unread,

Apr 10, 2005, 8:40:08 PM4/10/05

to

I see. It just need some basic set operation (+, -, and)
and some way to select a set:

sha5--->
/
/
sha1-->sha2-->sha3--
\ /
\ /
>sha4

list sha1 # all the file list in changeset sha1
# {sha1}
list sha1,sha1 # same as above
list sha1,sha2 # all the file list in between changeset sha1
# and changeset sha2
# {sha1, sha2} in example
list sha1,sha3 # {sha1, sha2, sha3, sha4}

list sha1,any # all the change set reachable from sha1.
{sha1, ... sha5, ...}

new sha1,sha2 # all the new file add between in sha1, sha2 (+)
changed sha1,sha2 # add the changed file between sha1, sha2 (>) (<)
deleted sha1,sha2 # add the deleted file between sha1, sha2 (-)

before time # all the file before time
after time # all the file after time

So in my example, the file I want to delete is :

{list hack1, base}+ {list hack2, base} ... {list hack6, base} \
- [list official_merge, base ]

On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote:
>
>
> > the official tree. It is more for my local version control.
>
> I have a plan. Namely to have a "list-needed" command, which you give one
> commit, and a flag implying how much "history" you want (*), and then it
> spits out all the sha1 files it needs for that history.
>
> Then you delete all the other ones from your SHA1 archive (easy enough to
> do efficiently by just sorting the two lists: the list of "needed" files
> and the list of "available" files).
>
> Script that, and call the command "prune-tree" or something like that, and
> you're all done.
>
> (*) The amount of history you want might be "none", which is to say that
> you don't want to go back in time, so you want _just_ the list of tree and
> blob objects associated with that commit.

That will be {list head}

>
> Or you might want a "linear" history, which would be the longest path
> through the parent changesets to the root.

That will be {list head,root}

>
> Or you might want "all", which would follow all parents and all trees.

That will be {list any, root}

>
> Or you might want to prune the history tree by date - "give me all
> history, but cut it off when you hit a parent that was done more than 6
> months ago".

That is {after -6month }

>
> This "list-needed" thing is not just for pruning history either. If you
> have a local tree "x", and you want to figure out how much of it you need
> to send to somebody else who has an older tree "y", then what you'd do is
> basically "list-needed x" and remove the set of "list-needed y". That
> gives you the answer to the question "what's the minimum set of sha1 files
> I need to send to the other guy so that he can re-create my top-of-tree".
>

That is {list x, any} - {list y, any}

> My second plan is to make somebody else so fired up about the problem that
> I can just sit back and take patches. That's really what I'm best at.
> Sitting here, in the (rain) on the patio, drinking a foofy tropical drink,
> and pressing the "apply" button. Then I take all the credit for my
> incredible work.

Sounds like a good plan.

Chris

Linus Torvalds

unread,

Apr 10, 2005, 9:20:07 PM4/10/05

to

On Mon, 11 Apr 2005, Petr Baudis wrote:
>

> Dear diary, on Sun, Apr 10, 2005 at 10:38:11PM CEST, I got a letter
> where Linus Torvalds <torv...@osdl.org> told me that...
> ..snip..
> > Can you pull my current repo, which has "diff-tree -R" that does what the
> > name suggests, and which should be faster than the 0.48 sec you see..
>
> Am I just missing something, or your diff-tree doesn't handle
> added/removed directories?

You're not missing anything, I did it that way on purpose. I thought it
would be easier to do the expansion in the caller (who knows what it is
they want to do with the end result).

But now that I look at merging, I realize that was actually the wrong
thing to do. A merge algorithm definitely wants to see the expanded tree,
since it will compare/join several of the diff-tree output things.

So I'll either fix it or decide to just go with your version instead. I'm
not overly proud.

Linus

Petr Baudis

unread,

Apr 10, 2005, 10:10:07 PM4/10/05

to

Hello,

here goes git-pasky-0.2, my set of patches and scripts upon
Linus' git, aimed at human usability and to an extent a SCM-like usage.

If you already have a previous git-pasky version, just git pull pasky
to get it. Otherwise, you can get it from:

http://pasky.or.cz/~pasky/dev/git/

Please see the README there and/or the parent post for detailed
instructions. You can find the changes from the last announcement
in the ChangeLog (releases have separate commits so you can find them
easily; they are also tagged for purpose of diffing etc).

This is release contains mostly bugfixes, performance enhancements
(especially w.r.t. git diff), and some merges with Linus (except for
diff-tree, where I merged only the new output format). New features
are trivial - support for tagging and short SHA1 ids; you can use
only the start of the SHA1 hash long enough to be unambiguous.

My immediate plan is implementing git merge, which I will do tommorow,
if noone will do it before that is. ;-)

Any feedback/opinions/suggestions/patches (especially patches) are
welcome.

Have fun,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Daniel Barkalow

unread,

Apr 10, 2005, 10:50:06 PM4/10/05

to

On Mon, 11 Apr 2005, Petr Baudis wrote:

> Hello,
>
> here goes git-pasky-0.2, my set of patches and scripts upon
> Linus' git, aimed at human usability and to an extent a SCM-like usage.

Incidentally, the git-pasky-base tarball you have up has its checked-out
tree partway between 0.1 and 0.2, and doesn't compile. (The included HEAD
version in .dircache is fine, if the user has some way to bootstrap)

-Daniel
*This .sig left intentionally blank*

Nur Hussein

unread,

Apr 11, 2005, 1:30:10 AM4/11/05

to

> Btw, does anybody have strong opinions on the license? I didn't put in a
> COPYING file exactly because I was torn between GPLv2 and OSL2.1.

I think GPLv2 would create the least amount of objection in the
community, so I'd probably want to go with that.

Nur Hussein

bert hubert

unread,

Apr 11, 2005, 3:00:15 AM4/11/05

to

On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:

> compressed with zlib, they are all named by the sha1 file, and they all

Now I know this is a concious decision, but recent zlib allows you to write
out gzip content, at a cost of 14 bytes I think per file, by adding 32 to
the window size. This in turn would allow users to zcat your objects at
ease.

You get confirmation of completeness of the file for free, as gzip encodes
the length of the file at the end.

Perhaps something to consider.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Christer Weinigel

unread,

Apr 11, 2005, 3:30:15 AM4/11/05

to

bert hubert <a...@ds9a.nl> writes:

> On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:
>
> > compressed with zlib, they are all named by the sha1 file, and they all
>
> Now I know this is a concious decision, but recent zlib allows you to write
> out gzip content, at a cost of 14 bytes I think per file, by adding 32 to
> the window size. This in turn would allow users to zcat your objects at
> ease.
>
> You get confirmation of completeness of the file for free, as gzip encodes
> the length of the file at the end.

I would very much like it if git used normal gzip files with a .gz
extension. Doing it this way means that the compression methods can
be extended in the future. I.e:

ab/1234567890.gz gzip compressed
ab/1234567890.xd xdelta compressed

I find the xdelta encoding very attractive since it can probably
reduce the size of the repository drastically. A compression script
could for run nightly and xdelta compress everything that's older than
a few months (to figure out what files to create the delta from, just
look at the commit files and compare the parent tree to the current
tree).

Of course, this means that a dumb wget won't work all that well to
synchronize two trees, but it might be worthwile anyways.

/Christer

--
"Just how much can I get away with and still go to heaven?"

Freelance consultant specializing in device driver programming for Linux
Christer Weinigel <chri...@weinigel.se> http://www.weinigel.se

Ingo Molnar

unread,

Apr 11, 2005, 3:50:13 AM4/11/05

to

* Linus Torvalds <torv...@osdl.org> wrote:

> Btw, does anybody have strong opinions on the license? I didn't put in
> a COPYING file exactly because I was torn between GPLv2 and OSL2.1.
>
> I'm inclined to go with GPLv2 just because it's the most common one,
> but I was wondering if anybody really had strong opinions. For
> example, I'd really make it "v2 by default" like the kernel, since I'm
> sure v3 will be fine, but regardless of how sure I am, I'm _not_ a
> gambling man.

is there any fundamental problem with going with v2 right now, and then
once v3 is out and assuming it looks ok, all newly copyrightable bits
(new files, rewrites, substantial contributions, etc.) get a v3
copyright? (and the collection itself would be v3 too) That method
wouldnt make it fully v3 automatically once v3 is out, but with time
there would be enough v3 bits in it to make it essentially v3. This way
we wouldnt have to blanket trust v3 before having seen it, and wouldnt
be stuck at v2 either.

Ingo

Florian Weimer

unread,

Apr 11, 2005, 4:50:11 AM4/11/05

to

* Ingo Molnar:

> is there any fundamental problem with going with v2 right now, and then
> once v3 is out and assuming it looks ok, all newly copyrightable bits
> (new files, rewrites, substantial contributions, etc.) get a v3
> copyright? (and the collection itself would be v3 too) That method
> wouldnt make it fully v3 automatically once v3 is out, but with time
> there would be enough v3 bits in it to make it essentially v3.

Almost certainly, v3 will be incompatible with v2 because it adds
further restrictions. This means that your proposal would result in
software which is not redistributable by third parties.

Ingo Molnar

unread,

Apr 11, 2005, 5:00:13 AM4/11/05

to

* Petr Baudis <pa...@ucw.cz> wrote:

> Hello,
>
> here goes git-pasky-0.2, my set of patches and scripts upon Linus'
> git, aimed at human usability and to an extent a SCM-like usage.

works fine on FC4, i only minor issues: 'git' in the tarball didnt have
the x permission. Also, your scripts assume they are in $PATH. When
trying out a tarball one doesnt usually do a 'make install' but tries
stuff locally. Also, 'make install' doesnt seem to install the git
script itself, is that intentional?

Ingo

Anton Altaparmakov

unread,

Apr 11, 2005, 5:30:14 AM4/11/05

to

On Mon, 2005-04-11 at 01:04 +0200, Bernd Eckenfels wrote:
> In article <2005041011190...@engr.sgi.com> you wrote:
> > (I repeat the xxx in the leaf name - easier to code.)
>
> It is a bit OT, but just a note: there are file systems (hash functions) out
> there who dont like a lot of files named the same way. For example NTFS with
> the 8.3 short names.

Since you mention NTFS, there is no need to worry about that for Linux.
Certainly the Linux kernel NTFS driver is never going to create 8.3
short names. (It doesn't create names at all at the moment but my grand
plan is that it will only ever create file names in the Win32 and/or
POSIX name spaces. The DOS name space is a thing of the past IMO.)

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

Petr Baudis

unread,

Apr 11, 2005, 7:00:20 AM4/11/05

to

Dear diary, on Mon, Apr 11, 2005 at 10:40:00AM CEST, I got a letter
where Florian Weimer <f...@deneb.enyo.de> told me that...

> * Ingo Molnar:
>
> > is there any fundamental problem with going with v2 right now, and then
> > once v3 is out and assuming it looks ok, all newly copyrightable bits
> > (new files, rewrites, substantial contributions, etc.) get a v3
> > copyright? (and the collection itself would be v3 too) That method
> > wouldnt make it fully v3 automatically once v3 is out, but with time
> > there would be enough v3 bits in it to make it essentially v3.
>
> Almost certainly, v3 will be incompatible with v2 because it adds
> further restrictions. This means that your proposal would result in
> software which is not redistributable by third parties.

Hmm, what would be actually the point in introducing further
restrictions? Anyone who then wants to get around them will just
distribute the software with the "any later version" provision under
GPLv2, and GPLv3 will have no impact expect for new software with "v3 or
any later version" provision. What am I missing?

I've been doing a lot of LKML catching up, and I remember someone
suggesting using GPLv2 (for kernel, but should apply to git too), with a
provision to let someone trusted (Linus) decide when GPLv3 is out
whether you can use GPLv3 for the kernel too. Does it make sense? And is
it even legally doable without sending signed written documents to
Linus' tropical hacienda?

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Petr Baudis

unread,

Apr 11, 2005, 6:30:14 AM4/11/05

to

Dear diary, on Mon, Apr 11, 2005 at 10:50:51AM CEST, I got a letter
where Ingo Molnar <mi...@elte.hu> told me that...

>
> * Petr Baudis <pa...@ucw.cz> wrote:
>
> > Hello,
> >
> > here goes git-pasky-0.2, my set of patches and scripts upon Linus'
> > git, aimed at human usability and to an extent a SCM-like usage.
>
> works fine on FC4, i only minor issues: 'git' in the tarball didnt have
> the x permission.

Sorry, fixed in the tarball. It is in the diffs but I have no git patch
yet to apply the mode changes.

> Also, your scripts assume they are in $PATH. When
> trying out a tarball one doesnt usually do a 'make install' but tries
> stuff locally.

Hmm, I think I will need to make something like

exedir=$(dirname $0)

on the top of each script and then do all the git calls with ${exedit}
prepended. That should fix the issue, right?

> Also, 'make install' doesnt seem to install the git script itself, is
> that intentional?

Oops, I actually didn't even notice that there _is_ any install target
in the Makefile already. ;-) I will add the relevant stuff to it.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Petr Baudis

unread,

Apr 11, 2005, 6:30:15 AM4/11/05

to

Dear diary, on Mon, Apr 11, 2005 at 04:46:42AM CEST, I got a letter
where Daniel Barkalow <bark...@iabervon.org> told me that...

> On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> > Hello,
> >
> > here goes git-pasky-0.2, my set of patches and scripts upon
> > Linus' git, aimed at human usability and to an extent a SCM-like usage.
>
> Incidentally, the git-pasky-base tarball you have up has its checked-out
> tree partway between 0.1 and 0.2, and doesn't compile. (The included HEAD
> version in .dircache is fine, if the user has some way to bootstrap)

Oops, I'm sorry. It appears some diffs just slipped out from the tracked
tree, perhaps I was pulling once when git diff was broken and I didn't
notice it. Now there is a newer tarball there, it is not a pure 0.2
anymore though - if you use the COMMITTER_* env variables, they are now
AUTHOR_*.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

Ingo Molnar

unread,

Apr 11, 2005, 7:40:11 AM4/11/05

to

i think all of the 'repository size' and 'bandwidth' concerns could be
solved via a new (and pretty much simple and transparent) object type:
the 'combo-blob'.

Summary:
--------

This is a space/bandwidth-efficient blob that 'includes' arbitrary
portions of (one, two, or more) simple blobs by reference [1], with byte
granularity, plus an optional followup portion that includes the full
constructed state, uncompressed. [2] It can also conserve more RAM
compared to the current repository format.

Representation:
---------------

A combo-blob would have the 'simplest possible' and thus most obvious
representation: a list (the 'include-table') of "include X bytes at
offset Y from parent Z" operations:

<parent-blob-ID> <offset> <size>
[optional full constructed state]

e.g.:

6d11b2dd7f169c29664ac0553090865b7b020973 0 64444
6d374c972c04a0b1894cc6898dffa8ab0b273fcb 0 100
6d11b2dd7f169c29664ac0553090865b7b020973 64545 163656

'punches' 100 bytes out of blob 6d1* at offset 64444, and replaces it
with blob 6d3*'s 100 bytes. [offset/size would be stored in a binary
form to have constant record sizes.]

in OS terms it's similar to an iovec representation. [3]

The hash of a combo-blob is calculated off the include-table alone: i.e.
it's _not_ equivalent to the hash of the included contents. I.e. you
cannot 'collapse' a combo-blob after the fact, it's an immutable part of
the history of the repository, similar to other stored objects. You can
freely cache/uncache (blow-up/collapse) it on the other hand.

[ NOTE: further below you can find a 'Notes' section as well, which
might address some of the issues/ideas you might have at this point. ]

Cons:
-----

there are a number of disadvantages:

- performance hit. Linus is perfectly right, in terms of performance,
nothing beats having full objects.

Hence i kept the option to include the full constructed blob [4]
(uncompressed) as well in the combo-blob. When all combo-blobs are
'blown up' then they can be better in terms of performance than the
current repository format. [they still carry the small slice & dice
information as well]

the performance hit can be reduced in a finegrained way by introducing
occasional full objects in the history. E.g. after every 8 steps one
would include a full blob, to limit the number of blobs necessary to
construct a previously unconstructed combo-blob. This would still cut
the overhead of the current format substantially.

clearly, the most important cache is the current directory cache,
which this abstraction does not hurt.

- complexity. It's all pretty straightforward, but checking the
consistency of a combo object is not as simple as checking the
consistency of a simple object, as it would have to recursively check
all parent IDs as well. I think it's worth the price though.

- repository has optional components: the 'blown up' (cached) portion of
a combo-blob can be freely destructed. This means that two
repositories can now not only differ in their directory-cache, but
also in their objects/ hierarchy. I dont think this is a big issue,
BYMMV.

Pros:
-----

- the main advantage is space/bandwith: it's pretty much as efficient as
it gets: it can be used to represent compressed binary deltas. A fully
trimmed (uncached) repository is very efficient.

- the optional 'fully constructed' portion is not compressed, so once a
repository is 'cached', it is faster to process (in areas outside the
current directory cache) than the current repository format. (In fact,
when a previously unused portion of a repository is accessed _first_,
it is IO-bound by nature - so we can very well spend the extra CPU
cycles on uncompressing things.)

- a 'combo' blob will be more memory-efficient as well. So with given
amount of RAM one could access more history, with a small CPU cost -
as long as the level of 'history recursion' is kept in check (e.g. via
the previously mentioned 'at most 8-deep combinations').
Straightforward iovecs could be passed to Linux system-calls, when
constructing a 'view' of a file, without having to cache every step of
the file's history.

- a combo-blob directly represents the way humans code: combining
pre-existing pieces of information and adding relatively low amount of
new stuff. Having a natural representation for the type of activity
that a tool supports cannot hurt.

( - combo-blobs enable a per-chunk (or per-line) edit history. It's not
an important feature though. )

Notes:
------

[1] the combo-blob is not a 'delta' thing. It combines pre-existing
parents. One of the parents may of course be a 'delta' that acts upon
the other parent - but the combo-blob does not know and does not care.
(A combo-blob might as well represent an act of someone consolidating
multiple small files into a big file, or splitting up a big file into
smaller files. Or a combo-blob might represent the trimming of a
preexisting file.)

[2]: a combo-blob is conceptually still a simple object with blob data
in it, nothing more. It can be referenced in other object types
equivalently to other blobs. It just happens to be a combination of
existing blobs, and hence the 'git filesystem' has to work harder (but
still quite efficiently) to get to the contents.

[3]: a combo-blob might reference any parent blob, including combo
blobs. This means that e.g. multiple small deltas can be represented
via:

where combo-blob-#2 is thus a combination of blob-#1,blob-#2,blob-#3.

[4] alternatively, it might also make sense to extend the simple
combo-blob concept with the concept of a 'cache-blob': a cache-blob
'blows up' combo blobs in that it fully constructs the blob contents,
but it is otherwise identical to the blob it caches. Simple (non-combo)
blob types are a cache of themselves.

Ingo

Petr Baudis

unread,

Apr 11, 2005, 10:00:13 AM4/11/05

to

Hello,

here goes git-pasky-0.3, my set of patches and scripts upon

Linus' git, aimed at human usability and to an extent a SCM-like usage.

If you already have a previous git-pasky version, just git pull pasky

to get it (but see below!!!). Otherwise, you can get it from:

http://pasky.or.cz/~pasky/dev/git/

Please see the README there and/or the parent post for detailed
instructions. You can find the changes from the last announcement
in the ChangeLog (releases have separate commits so you can find them
easily; they are also tagged for purpose of diffing etc).

This release is mainly focused on bugfixes. Especially, it fixes git
diff, which was totally broken in the previous release and would only
diff every other file (forgot to remove one shift from the times when
changes were reported two-line from diff-tree). Very sorry about that.

This implies that git pull was broken too, though - if you pulled
tracked branch, git diff wouldn't produce the complete diff for patch to
apply. If you didn't do any local changes, it is fortunately easy to
repair:

git diff | patch -p0 -R

(The unapplied changes appear as reverted in your local tree when
compared with the cache.) You will need to edit the diff if you did
some local changes.

Other change breaking some compatibility is regarding commit
environment variables - s/COMMITTER_*/AUTHOR_*/. Otherwise it is usual
bunch of merges with Linus and some really minor stuff. Oh, and make
install works.

One annoying thing is rsync error when pulling from Linus - it tries
to sync the tags/ directory and I don't know how to safely silence it
except throwing away all stderr. I will probably make it fetch the list
of .dircache and rsync only things which are really there.

Any feedback/opinions/suggestions/patches (especially patches) are

welcome. You can also stop by at #git either on FreeNode or on OTFC (I
will be around only from 20:00 CET on, though).

H. Peter Anvin

unread,

Apr 11, 2005, 10:10:13 AM4/11/05

to

Followup to: <20050410065...@64m.dyndns.org>
By author: Christopher Li <lk...@chrisli.org>
In newsgroup: linux.dev.kernel
>
> There is one problem though. How about the SHA1 hash collision?
> Even the chance is very remote, you don't want to lose some data do due
> to "software" error. I think it is OK that no handle that
> case right now. On the other hand, it will be nice to detect that
> and give out a big error message if it really happens.
>

If you're actually worried about it, it'd be better to just use a
different hash, like one of the SHA-2's (probably a better choice
anyway), instead of SHA-1.

-hpa

Paul Jackson

unread,

Apr 11, 2005, 10:50:10 AM4/11/05

to

Hmmm ... I have this strong sense that I am about 2 hours away from
smacking my forehead and groaning "Duh - so that's what Ingo meant!"

However, one must play out one's destiny.

Could you provide an example scenario, which results in the creation of
a combo-blob?

The best I can come up with is the following.

Let's say Nick changes one line in the middle of kernel/sched.c
(yeah - I know - unlikely scenario - he usually changes more
than that - nevermind that detail.)

In the days Before Combo Blobs (BCB), git would have been told that
kernel/sched.c was to be picked up, and would have wrapped it up in a
zlib'd blob, sha1summed it, seen it was a new sum, and added that blob
to its objects (or something like this -- I'm still a little fuzzy on
these git details.)

But Nick just downloaded the latest git 1.5.11.1 which has added support
for combo blobs, so now, guessing here, instead of wrapping up the new
sched.c, git instead unwraps the old one, diff's with the new, notices a
couple of long sequences that are unchanged, wraps up both of those
sequences as a couple of relatively large blobs, and wraps up the new
lines that Nick just coded in the middle as a small blob, and puts all
three in the object store, along with another small combo-blob, tying
them all together.

So far, not too bad. Haven't gained anything, and required the
unpacking of a zlib blog we didn't require before, and the running and
analyzing of a diff we didn't require before, but the end result is only
moderately worse - four object blobs instead of one, but of total size
not much larger (well, total size typically 3 disk blocks worse, due to
a slight increase in fragmentation from using 4 blocks to store what
used to be in one.)

But now I get stuck. Unless I throw in something like the interleaved
delta compression that's at the heart of Marc Rochind's old SCCS code
(and Larry's rewrite thereof), I don't see how we ever come to the
practical realization that any of these four new blobs can ever be
reused.

So explain to me again how we ever gain anything with these combo blobs,
while I take a prophylactic aspirin, so the forehead whack won't hurt as
much.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

Adam J. Richter

unread,

Apr 11, 2005, 12:10:11 PM4/11/05

to

On 2005-04-11, Linus Torvalds wrote:
>I'm inclined to go with GPLv2 just because it's the most common one [...]

You may want to use a file from GPL'ed monotone that
implements a substantial diff optimization described in the August
1989 paper by Sun Wu, Udi Manber and Gene Myers ("An O(NP) Sequence
Comparison Algorithm"). According to th file, that implementation
was a port of some Scheme code written by Aubrey Jaffer to C++ by
Graydon Hoare. (By the way, I would prefer that git just punt to
user level programs for diff and merge when all of the versions
involved are different or at least have a very thin interface
for extending the facility, because I would like to do some character
based merge stuff.)

It looks to me like the anti-patent provisions of OSLv2.1
could be circumvented by an offender creating a separate company
to do patent litigation. So, I think you'll find that the software
reuse benefits (both to GIT and to other software projects) of the
more widely used GPL ougtweigh the anti-patent benefits of OSLv2.1.

Although I like the idea of anti-patent provisions, such
as those in OSLv2.1, I think mutual compatability of free
software is probably more consequential, even from a purely
political perspective.

Perhaps you might want to consider offering the code
under the distributor's choice of either license if you want
to offer the very minor benefits of slightly easier compliance
to those who do not litigate software patents, or, perhaps more
importantly, the ability of the software to be copied into
OSLv2.1 projects (if there are any).

__ ______________
Adam J. Richter \ /
ad...@yggdrasil.com | g g d r a s i l

Ingo Molnar

unread,

Apr 11, 2005, 11:30:14 AM4/11/05

to

here are some stats: of the last 34160 files modified in the Linux
kernel tree in the past 1 year, the file sizes total to 1 GB, and the
average file-size per file committed is 31220 bytes. The changes
themselves amount to:

22404 files changed, 1996494 insertions(+), 1396644 deletions(-)

(the # of files changed is lower because one file can be modified
multiple times)

the Linux kernel has an average line-length of 36 bytes, so even without
analyzing the commits themselves, the actual size of changes is around
70 MB content added, 50 MB content removed. The patches (plus commit
comments, and email headers) add up to 250 MB.

So the combo-blob representation would have an uncompressed content
somewhere between 130MB and 250MB: 200 MB would be a good guess i think.
That's 20% of the 1+ GB the full-blob representation would give, and it
would be nearly as compressible.

Ingo

Ingo Molnar

unread,

Apr 11, 2005, 1:10:09 PM4/11/05

to

* Linus Torvalds <torv...@osdl.org> wrote:

> > Also, with a 'replicate the full object on every 8th commit'
> > rule the risk would be somewhat mitigated as well.
>
> ..but not the complexity.
>
> The fact is, I want to trust this thing. Dammit, one reason I like GIT
> is that I can mentally visualize the whole damn tree, and each step is
> so _simple_. That's extra important when the object database itself is
> so inscrutable - unlike CVS or SCCS or formats like that, it's damn
> hard to visualize from looking at a directory listing.

ok. Meanwhile i found another counter-argument: the average committed
file size is 36K, which with gzip -9 would compress down to roughly 8K,
with the commit message being another block. That's 2+1 blocks used per
commit, while with deltas one could at most cut this down to 1+1+1
blocks - just as much space! So we would be almost even with the more
complex delta approach, just by increasing the default compression ratio
from 6 to 9. (but even with the default we are not that bad.)

case closed i guess. (The network bandwith issue can/could indeed be
solved independently, without any impact to the fundamentals, as you
suggested.)

Ingo

Chris Wedgwood

unread,

Apr 11, 2005, 2:20:09 PM4/11/05

to

On Mon, Apr 11, 2005 at 09:01:51AM -0700, Linus Torvalds wrote:

> I disagree. Yes, the thing is designed to be replicated, so most of
> the time the easiest thing to do is to just rsync with another copy.

It's not clear how any of this is going to give me something like

bk changes -R

or
bk changes -L

functionality. I'm guessing I will have to sync locally and check
between two trees in those cases? Or at least sync enough metadata as
to make this possible... but not the entire tree right?

Linus Torvalds

unread,

Apr 11, 2005, 11:40:11 AM4/11/05

to

On Mon, 11 Apr 2005, Ingo Molnar wrote:
>
> to construct the combo blob later on, we do have to unpack sched.c (and
> if it's already a combo-blob that is not cached then we'd have to unpack
> all parents until we arrive at some full blob).

I really don't want to have this. Having chains of dependencies is really
painful, and now if _any_ of them gets corrupted, you're screwed.

Yes, GIT already has chains, but they are the minimal possible (ie we have
the path-name-dependent tree chain, which I tried to avoid but really
couldn't). The "commit" chain can grow to arbitrary sizes, but losing any
entry but the top one really doesn't lose any data - you lost your place
in history, but at least you're not totally screwed. You still have your
data, you just can't find your way to the root (but you can, for example,
effectively re-create the whole commit chain if you want to without having
to touch any of the data blobs).

So I would very strongly suggest that we do not have dependent combo
blobs, but that if you want to, a better "network protocol" might be quite
possible. Ie send diffs over the network, and re-create the blobs on the
other side. You can trivially check that you got it right, because if you
didn't, the name of the result won't match ;)

Please?

Linus

Randy.Dunlap

unread,

Apr 11, 2005, 12:00:21 PM4/11/05

to

On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:

Please go into a little more detail about how to do this step...
that seems to be the most basic concept that I am missing.
i.e., how to find the "latest/current" tree (version/commit)
and check it out (read-tree, checkout-cache, etc.).

Even if I use Pasky's tools, I'd like to understand this step.

---
~Randy

Ingo Molnar

unread,

Apr 11, 2005, 12:10:12 PM4/11/05

to

* Ingo Molnar <mi...@elte.hu> wrote:

>
> * Linus Torvalds <torv...@osdl.org> wrote:
>
> > > to construct the combo blob later on, we do have to unpack sched.c (and
> > > if it's already a combo-blob that is not cached then we'd have to unpack
> > > all parents until we arrive at some full blob).
> >
> > I really don't want to have this. Having chains of dependencies is
> > really painful, and now if _any_ of them gets corrupted, you're
> > screwed.
>

> if a repository is corrupted then it pretty much needs to be dropped
> anyway. Also, with a 'replicate the full object on every 8th commit'

> rule the risk would be somewhat mitigated as well.

another thing is that if the repository is 'cached' (which would
normally be the case for work files), then it would be more resilient
against corruption as the full uncompressed file would be included at
the end of the combo-blob.

Ingo