On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > Listing the file paths and their sigs included in a tree to make > a snapshot of a tree state sounds fine, and diffing two trees by > looking at the sigs between two such files sounds fine as well.
> But I am wondering what your plans are to handle renames---or > does git already represent them?
git doesn't represent transitions (or deltas), but only state. So it's not (much) more then a .tar file from version-management perspective; the only difference being that a git-tree has a comment field and a predecessor-reference, which are currently not used in determining the 'patch' between two trees.
Deltas are derived by comparing different versions and determining the difference by reverse-engineering the differences which got us from version A to version B.
Deltas are currently described as patch(1)es. Patches don't have the concept of 'renaming', so even after determining that file X has been renamed to Y, we have no container for this fact. A patch(1) only contains local-file-edits: substitute lines by other lines.
Deltas are not needed to follow a tree; deltas are useful for merging branches of versions, and for reviewing purposes. This is comparable to using tar for version-management: it is very common to weekly tar your current version of your project as a poor-mans-version management for one-person one-project.
So what is needed is a way to represent deltas which can contain more than only traditional patches. I would propose a simple format: the shell-script in a fixed-format.
Shell-patch format in EBNF: <shellpatch> ::= ( <comment>? <command>* )* <comment> ::= <commentline>+ The comments contains the text describing the function of the patch following it. <commentline> ::= "# " <text> <command> ::= "mv " <pathname> " " <pathname> "\n" | "cp " <filename> " " <filename> "\n" | "chmod " <mode> <pathname> "\n" | "patch <<__UNIQUE_STRING__\n" <patch> "__UNIQUE_STRING__\n" (where UNIQUE_STRING must not be contained in patch) <filename> ::= <pathname> (but pointing to a file) <pathname> ::= a pathname relative to '.'; escaping special characters the shell-way; may not contain '..'.
Example: # Rename file b to a1, and change a line. mv b a1 patch <<__END__ *** a1 Sun Apr 10 11:43:37 2005 --- a2 Sun Apr 10 11:43:41 2005 *************** *** 1,4 **** 1 2 ! from 3 --- 1,4 ---- 1 2 ! to 3 __END__
Advantages: - ASCII! - a shell-patch is executable without extra tooling - a shell-patch is readable and therefore reviewable - a shell-patch is forward-compatible: a shell-patch acts like a patch (since patch(1) ignores garbage around patch :), but not backwards-compatible. - extensible - the heavy-lifting is done by 'patch' Disadvantages: - no deltas for binary files
Open issues: - <comment> could be made more structured; maybe containing fields like Sujbect:, Author:, Signed-By:, certificates, ... (BitKeeper seems to be using "# " <field> ":" <value> "\n" lines) - patch(1) doesn't know any directories. Should shell-patch know directories? This implies commands working on directories to (like directory renaming, mode changing, ...). Otherwise directories are implicit (a file in a directories implies the existance of that directory). Also implies mkdir and rmdir as shell-patch commands. - extra commands might be useful to conserve more state(changes): ln -s -- for symbolic links; ln -- for hard links; chown -- for permissions; chattr -- for storing extended attributes touch -- for setting timestamps (probably creation time only, since mtime is something git relies on) ...and for the really adventurous: sed 's,<fromstring>,<tostring>,' -- for substitutions (this is something darcs supports, but which I think is too bothersome to use since it is difficult to reverse engineere from two random trees) Why a fixed format at all? - This way, the executable shell-patch can be proven to be harmless to the machine: 'rm -rf /' is a valid shell-script, but not a valid shell-patch (since 'rm' is not valid command, random flags like '-rf' are not supported, and '/' is an absolute pathname. - A fixed format enables tooling to support such a patch format; for example creating the reverse-patch, merging patches (yeah, 'cat' also merges patches...).
...what has this to do with git? Not much and everything, depending on how you look onto it. 'git' is 'tar', and 'shell-patch' is 'patch'; both orthogonal concepts but very usable in combination. We'll look at getting from two git trees to a shell-patch.
Diffing the trees would not only look at the file and per file at the hashes, but also the other way around: which hash values are used more than once. For files with the same hash value, compare the contents (and rest of attributes); this is needed since the mapping from file contents to sha1 is one-way. When the contents is the same, the shell-patch-command to generate is obviously a 'cp'.
For example, we have got two trees in git (pathname -> hash value): tree1/file1 -> 1234 tree1/file2 -> 4567 and tree2/file1 -> 3456 tree2/file3 -> 4567 tree2/file4 -> 4567
...by an algorithm which starts by determining all renames, then all copies, and finally all patches.
Comments?
-- Rutger Nijlunsing ---------------------- linux-kernel at tux.tmfweb.nl never attribute to a conspiracy which can be explained by incompetence ---------------------------------------------------------------------- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
>handle by pure rename only plus the extra delta. The current git don't >have per file change history. From git's point of view some file deleted >and the other file appeared with same content.
>It is the top level SCM to handle that correctly. >Rename a directory will be even more fun.
But from a git perspective it will be very efficient. Imagine that Linus decides to rename arch/i386 as arch/x86 ... at the git repository level this just requires a changeset, a new top level tree, and a new tree for the arch directory showing that i386 changed to x86. That's all ... every files below that didn't change, so the blobs for the files are all the same.
Christopher Li wrote: > On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote: > > NOTE! This means that each "tree" file basically tracks just a > > single directory. The old style of "every file in one tree file" > > still works, but fsck-cache will warn about it. Happily, the git > > archive itself doesn't have any subdirectories, so git itself is not > > impacted by it.
> That is really cool stuff. My way to read it, correct me if I am > wrong, git is a user space version file system. "tree" <--> directory > and "blob" <--> file. "commit" to describe the version history.
See the Venti filesystem in Bell Labs's Plan 9 OS. It too uses SHA-1.
This paper describes a network storage system, called Venti, intended for archival data. In this system, a unique hash of a block's contents acts as the block identifier for read and write operations. This approach enforces a write-once policy, preventing accidental or malicious destruction of data. In addition, duplicate copies of a block can be coalesced, reducing the consumption of storage and simplifying the implementation of clients. Venti is a building block for constructing a variety of storage applications such as logical backup, physical backup, and snapshot file systems.
We have built a prototype of the system and present some preliminary performance results. The system uses magnetic disks as the storage technology, resulting in an access time for archival data that is comparable to non-archival data. The feasibility of the write-once model for storage is demonstrated using data from over a decade's use of two Plan 9 file systems.
>In other words, each "commit" file is very small and cheap, but since >almost every commit will also imply a totally new tree-file, "git" is >going to have an overhead of half a megabyte per commit. Oops.
>Damn, that's painful. I suspect I will have to change the format somehow.
Having dodged that bullet with the change to make tree files point at other tree files ... here's another (potential) issue.
A changeset that touches just one file a few levels down from the top of the tree (say arch/i386/kernel/setup.c) will make six new files in the git repository (one for the changeset, four tree files, and a new blob for the new version of the file). More complex changes make more files ... but say the average is ten new files per changeset since most changes touch few files. With 60,000 changesets in the current tree, we will start out our git repository with about 600,000 files. Assuming the first byte of the SHA1 hash is random, that means an average of 2343 files in each of the objects/xx directories. Give it a few more years at the current pace, and we'll have over 10,000 files per directory. This sounds like a lot to me ... but perhaps filesystems now handle large directories enough better than they used to for this to not be a problem?
Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?
> But I am wondering what your plans are to handle renames---or > does git already represent them?
You can represent renames on top of git - git itself really doesn't care. In many ways you can just see git as a filesystem - it's content- addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a _filesystem_ person (hey, kernels is what I do), and I actually have absolutely _zero_ interest in creating a traditional SCM system.
So to take renaming a file as an example - why do you actually want to track renames? In traditional SCM's, you do it for two reasons:
- space efficiency. Most SCM's are based on describing changes to a file, and compress the data by doing revisions on the same file. In order to continue that process past a rename, such an SCM _has_ to track renames, or lose the delta-based approach.
The most trivial example of this is "diff", ie a rename ends up generating a _huge_ diff unless you track the rename explicitly.
GIT doesn't care. There is _zero_ space efficiency in trying to track renames. In fact, it would add overhead to the system, not lessen it. That's because GIT fundamentally doesn't do the "delta-within-a-file" model.
- annotate/blame. This is a valid concern, but the fact is, I never use it. It may be a deficiency of mine, but I simply don't do the per-line thing when I debug or try to find who was responsible. I do "blame" on a much bigger-picture level, and I personally believe (pretty strongly) that per-line annotations are not actually a good thing - they come not because people _want_ to do things at that low level, but because historically, you didn't _have_ the bigger-picture thing.
In other words, pretty much every SCM out there is based on SCCS "mentally", even if not in any other model. That's why people think per-line blame is important - you have that mental model.
So consider me deficient, or consider me radical. It boils down to the same thing. Renames don't matter.
That said, if somebody wants to create a _real_ SCM (rather than my notion of a pure content tracker) on top of GIT, you probably could fairly easily do so by imposing a few limitations on a higher level. For example, most SCM's that track renames require that the user _tell_ them about the renames: you do a "bk mv" or a "svn rename" or something.
If you want to do the same on top of GIT, then you should think of GIT as what it is: GIT just tracks contents. It's a filesystem - although a fairly strange one. How would you track renames on top of that? Easy: add your own fields to the GIT revision messages: GIT enforces the header, but you can add anything you want to the "free-form" part that follows it.
Same goes for any other information where you care about what happens "within" a file. GIT simply doesn't track it. You can build things on top of GIT if you want to, though. They may not be as efficient as they would be if they were built _into_ GIT, but on the other hand GIT does a lot of other things a hell of a lot faster thanks to it's design.
So whether you agree with the things that _I_ consider important probably depends on how you work. The real downside of GIT may be that _my_ way of doing things is quite possibly very rare.
But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong.
> With 60,000 changesets in the current tree, we will start out our git > repository with about 600,000 files. Assuming the first byte of the > SHA1 hash is random, that means an average of 2343 files in each of the > objects/xx directories. Give it a few more years at the current pace, > and we'll have over 10,000 files per directory. This sounds like a lot > to me ... but perhaps filesystems now handle large directories enough > better than they used to for this to not be a problem?
The good news is that git itself doesn't really care. I think it's literally _one_ function ("get_sha1_filename()") that you need to change, and then you need to write a small script that moves files around, and you're really much done.
Also, I did actually debate that issue with myself, and decided that even if we do have tons of files per directory, git doesn't much care. The reason? Git never _searches_ for them. Assuming you have enough memory to cache the tree, you just end up doing a "lookup", and inside the kernel that's done using an efficient hash, which doesn't actually care _at_all_ about how many files there are per directory.
So I was for a while debating having a totally flat directory space, but since there are _some_ downsides (linear lookup for cold-cache, and just that "ls -l" ends up being O(n**2) and things), I decided that a single fan-out is probably a good idea.
> Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?
Hey, I may end up being wrong, and yes, maybe I should have done a two-level one. The good news is that we can trivially fix it later (even dynamically - we can make the "sha1 object tree layout" be a per-tree config option, and there would be no real issue, so you could make small projects use a flat version and big projects use a very deep structure etc). You'd just have to script some renames to move the files around..
to get the latest changes from my branch. If you already have some git from my branch which can do pulling, you can bring yourself up to date by doing just
gitpull.sh pasky
(but this style of usage is deprecated now). Please see the README for some details regarding usage etc. You can find the changes from the last announcement in the ChangeLog (the previous announcement corresponds to commit id 5125d089ad862f16a306b4942155092e1dce1c2d). The most important change is probably recursive diff addition, and making git ignore the nsec of ctime and mtime, since it is totally unreliable and likes to taint random files as modified.
My near future plans include especially some merge support; I think it should be rather easy, actually. I'll also add some simple tagging mechanism. I've decided to postpone the file moving detection, since there's no big demand for it now. ;-)
I will also need to do more testing on the linux kernel tree. Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
$ time gitdiff.sh `parent-id` `tree-id` >p real 5m37.434s user 1m27.113s sys 2m41.036s
which is pretty horrible, it seems to me. Any benchmarking help is of course welcomed, as well as any other feedback.
BTW, what would be the best (most complete) source for the BK tree metadata? Should I dig it from the BKCVS gateway, or is there a better source? Where did you get the sparse git database from, Linus? (BTW, it would be nice to get sparse.git with the directories as separate.)
> Where did you get the sparse git database from, Linus? (BTW, it > would be nice to get sparse.git with the directories as separate.)
When we were trying to figure out how to avert the BK disaster, and one of Tridges concerns (and, in my opinion, the only really valid one) was that you couldn't get the BK data in some SCM-independent way.
So I wrote some very preliminary scripts (on top of BK itself) to extract the data, to show that BK could generate a SCM-neutral file format (a very stupid one and horribly useless for anything but interoperability, but still...). I was hoping that that would convince Tridge that trying to muck around with the internal BK file format was not worth it, and avert the BK trainwreck.
Larry was ok with the idea to make my export format actually be natively supported by BK (ie the same way you have "bk export -tpatch"), but Tridge wanted to instead get at the native data and be difficult about it. As a result, I can now not only use BK any more, but we also don't have a nice export format from BK.
Yeah, I'm a bit bitter about it.
Anyway, the sparse data came out of my hack. It's very inefficient, and I estimated that doing the same for the kernel would have taken ten solid days of conversion, mainly because my hack was really just that: a quick hack to show that BK could do it. Larry could have done it a lot better.
I'll re-generate the sparse git-database at some point (and I'll probably do so from the old GIT database itself, rather than re-generating it from my old BK data).
On Sun, Apr 10, 2005 at 08:44:56AM -0700, Linus Torvalds wrote:
> On Sun, 10 Apr 2005, Junio C Hamano wrote:
> > But I am wondering what your plans are to handle renames---or > > does git already represent them?
> You can represent renames on top of git - git itself really doesn't care. > In many ways you can just see git as a filesystem - it's content- > addressable, and it has a notion of versioning, but I really really > designed it coming at the problem from the viewpoint of a _filesystem_ > person (hey, kernels is what I do), and I actually have absolutely _zero_ > interest in creating a traditional SCM system.
> So to take renaming a file as an example - why do you actually want to > track renames? In traditional SCM's, you do it for two reasons:
> - space efficiency. Most SCM's are based on describing changes to a file, [snip] > - annotate/blame. This is a valid concern, but the fact is, I never use
[snip]
- merging. When the parent tree renames a file, it's easier for an out-of-tree patch to get up-to-date.
- reviewing. A huge patch with 2000 added lines and 1990 removed lines is more difficult to review then a rename + 10 lines patch.
> So consider me deficient, or consider me radical. It boils down to the > same thing. Renames don't matter.
When you've got no out-of-tree patches since you've got the parent-of-all-trees, then they don't matter, that's true :)
> So whether you agree with the things that _I_ consider important probably > depends on how you work. The real downside of GIT may be that _my_ way of > doing things is quite possibly very rare.
-- Rutger Nijlunsing ---------------------------------- eludias ed dse.nl never attribute to a conspiracy which can be explained by incompetence ---------------------------------------------------------------------- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Sat, 9 Apr 2005, Linus Torvalds wrote: > I've rsync'ed the new git repository to kernel.org, it should all be there > in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the > mirror scripts already picked it up on the public side too).
GCC 4 isn't very happy. Mostly sign changes, but also something that looks like a real error:
gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c fsck-cache.c: In function 'main': fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined
I assume that fsck_tree and fsck_commit should complain loudly if they ever get to that point - but since I'm not quite sure there's no patch, sorry.
-- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
However, more people will get bit by this git glitch than know sed.
-- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > > I will also need to do more testing on the linux kernel tree. > > > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> > > $ time gitdiff.sh `parent-id` `tree-id` >p > > > real 5m37.434s > > > user 1m27.113s > > > sys 2m41.036s
> > > which is pretty horrible, it seems to me. Any benchmarking help is of > > > course welcomed, as well as any other feedback.
> > it seems from the numbers that your system doesnt have enough RAM for > > this and is getting IO-bound?
> Not the only problem, without I/O, he will go down to 4m8s (u+s) which > is still in the same order of magnitude.
probably not the only problem - but if we are lucky then his system was just trashing within the kernel repository and then most of the overhead is the _unnecessary_ IO that happened due to that (which causes CPU overhead just as much). The dominant system time suggests so, to a certain degree. Maybe this is wishful thinking.
> GCC 4 isn't very happy. Mostly sign changes, but also something that > looks like a real error:
> gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c > fsck-cache.c: In function 'main': > fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined > fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined
> I assume that fsck_tree and fsck_commit should complain loudly if they > ever get to that point - but since I'm not quite sure there's no > patch, sorry.
i sent a patch for most of the sign errors, but the above is a case gcc not noticing that the function can never ever exit the loop, and thus cannot get to the 'return' point.
Tony wrote: > Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?
I tend to size these things with the square root of the number of leaf nodes. If I have 2,560,000 leaves (your 10,000 files in each of 16*16 directories), then I will aim for 1600 directories of 1600 leaves each.
My backup is sized for about this number of leaves, and it uses:
xxx/xxxzzzzzzzzzzzzzzzz
(I repeat the xxx in the leaf name - easier to code.)
I don't think there is any need for two levels. There are 4096 different values of three digit hex numbers. That's ok in one directory.
The only question would be 'xx' or 'xxx' - two or three digits.
This one is on the cusp in my view - either works.
-- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > > > I will also need to do more testing on the linux kernel tree. > > > > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> > > > $ time gitdiff.sh `parent-id` `tree-id` >p > > > > real 5m37.434s > > > > user 1m27.113s > > > > sys 2m41.036s
> > > > which is pretty horrible, it seems to me. Any benchmarking help is of > > > > course welcomed, as well as any other feedback.
> > > it seems from the numbers that your system doesnt have enough RAM for > > > this and is getting IO-bound?
> > Not the only problem, without I/O, he will go down to 4m8s (u+s) which > > is still in the same order of magnitude.
> probably not the only problem - but if we are lucky then his system was > just trashing within the kernel repository and then most of the overhead > is the _unnecessary_ IO that happened due to that (which causes CPU > overhead just as much). The dominant system time suggests so, to a > certain degree. Maybe this is wishful thinking.
It turns out to be the forks for doing all the cuts and such what is bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about 15 forks per change, I guess, and for some reason cut takes a long of time on its own.
I've rewritten the cuts with the use of bash arrays and other smart stuff. I somehow don't feel comfortable using this and prefer the old-fashioned ways, but it would be plain unusable without this.
Now I'm down to
real 1m21.440s user 0m32.374s sys 0m42.200s
and I kinda doubt if it is possible to cut this much down. Almost no disk activity, I have almost everything cached by now, apparently.
Anyway, you can git pull to get the optimized version.
Linus wrote: > It's a filesystem - although a > fairly strange one.
Ah ha - that explains the read-tree and write-tree names.
The read-tree pulls stuff out of this file system into your working files, clobbering local edits. This is like the read(2) system call, which clobbers stuff in your read buffer.
The write-tree pushes stuff down into the file system, just like write(2) pushes data into the kernel.
I was getting all kind of frustrated yesterday trying to use Linus's git commands, coming at these names with my SCM hat on.
That way of thinking really doesn't work well here.
I will have to look more closely at pasky's GIT toolkit if I want to see an SCM style interface.
-- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Sun, Apr 10, 2005 at 08:45:22PM +0200, Petr Baudis wrote: > It turns out to be the forks for doing all the cuts and such what is > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about > 15 forks per change, I guess, and for some reason cut takes a long of > time on its own.
> I've rewritten the cuts with the use of bash arrays and other smart > stuff. I somehow don't feel comfortable using this and prefer the > old-fashioned ways, but it would be plain unusable without this.
I've encountered the same problem in a config-generation script a while ago. Fortunately, bash provides enough ways to remove most of the forks, but the result is less portable.
I've downloaded your code, but it does not compile here because of the tv_nsec fields in struct stat (2.4, glibc 2.2), so I cannot use it to get the most up to date version to take a look at the script. Basically, all the 'cut' and 'sed' can be removed, as well as the 'dirname'. You can also call mkdir only if the dirs don't exist. I really think you should end up with only one fork in the loop to call 'diff'.
> Now I'm down to
> real 1m21.440s > user 0m32.374s > sys 0m42.200s
> and I kinda doubt if it is possible to cut this much down. Almost no > disk activity, I have almost everything cached by now, apparently.
It is very common to cut times by a factor of 10 or more when replacing common unix tools by pure shell. Dynamic library initialization also takes a lot of time nowadays, and probably you have localisation which is big too. Sometimes, just wiping a few variables at the top of the shell might remove some useless overhead.
> Anyway, you can git pull to get the optimized version.
> Some thing like the following patch, may be turn off able.
Take out an old envelope and compute on it the odds of this happening.
Say we have 10,000 kernel hackers, each producing one new file every minute, for 100 hours a week. And we've cloned a small army of Andrew Morton's to integrate the resulting tsunamai of patches. And Linus is well cared for in the state funny farm.
What is the probability that this check will fire even once, between now and 10 billion years from now, when the Sun has become a red giant destroying all life on planet Earth?
-- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Sun, April 10, 2005 12:55 pm, Linus Torvalds said:
> Larry was ok with the idea to make my export format actually be natively > supported by BK (ie the same way you have "bk export -tpatch"), but > Tridge wanted to instead get at the native data and be difficult about > it. As a result, I can now not only use BK any more, but we also don't > have a nice export format from BK.
> Yeah, I'm a bit bitter about it.
Linus,
With all due respect, Larry could have dealt with this years ago and removed the motivation for Tridge and others to pursue reverse engineering. Instead he chose to insult and question the motives of everyone that wanted open-source access to the Linux history data. The blame for the current situation falls firmly on the choice to use a closed-source SCM for Linux and the actions of the company that owned it.
Good lord - you don't need to use arrays for this.
The old-fashioned ways have their ways. Both the 'set' command and the 'read' command can split args and assign to distinct variable names.
Try something like the following:
diff-tree -r $id1 $id2 | sed -e '/^</ { N; s/\n>/ / }' -e 's/./& /' | while read op mode1 sha1 name1 mode2 sha2 name2 do ... various common stuff ... case "$op" in "+") ... ;; "-") ... ;; "<") test $name1 = $name2 || die mismatched names label1=$(mkbanner "$loc1" $id1 "$name1" $mode1 $sha1) label2=$(mkbanner "$loc2" $id2 "$name1" $mode2 $sha2) diff -L "$label1" -L "$label2" -u "$loc1" "$loc2" ;; esac done
-- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <p...@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> It turns out to be the forks for doing all the cuts and such what is > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about > 15 forks per change, I guess, and for some reason cut takes a long of > time on its own.
Heh.
Can you pull my current repo, which has "diff-tree -R" that does what the name suggests, and which should be faster than the 0.48 sec you see..
It may not matter a lot, since actually generating the diff from the file contents is what is expensive, but remember my goal: I want the expense of a diff-tree to be relative to the size of the diff, so that implies that small diffs haev to be basically instantaenous. So I care.
So I just tried the 2.6.7->2.6.8 diff, and for me the new recursive "diff-tree" can generate the _list_ of files changed in zero time:
real 0m0.079s user 0m0.067s sys 0m0.024s
but then _doing_ the diff is pretty expensive (in this case 3800+ files changed, so you have to unpack 7600+ objects - and even unpacking isn't the expensive part, the expense is literally in the diff operation itself).
Me, the stuff I automate is the small steps. Doing a single checkin. So that's the case I care about going fast, when a "diff-tree" will likely have maybe five files or something. That's why I want the small incremental cases to go fast - it it takes me a minute to generate a diff for a _release_, that's not a big deal. I make one release every other month, but I work with lots of small patches all the time.
Anyway, with a fast diff-tree, you should be able to generate the list of objects for a fast "merge". That's next.
(And by "merge", I of course mean "suck". I'm talking about the old CVS three-way merge, and you have to specify the common parent explicitly and it won't handle any renames or any other crud. But it would get us to something that might actually be useful for simple things. Which is why "diff-tree" is important - it gives the information about what to tell merge).
> Ah ha - that explains the read-tree and write-tree names.
> The read-tree pulls stuff out of this file system into > your working files, clobbering local edits. This is like > the read(2) system call, which clobbers stuff in your > read buffer.
Yes. Except it's a two-stage thing, where the staging area is always the "current directory cache".
So a "read-tree" always reads the tree information into the directory cache, but does not actually _update_ any of the files it "caches". To do that, you need to do a "checkout-cache" phase.
Similarly, "write-tree" writes the current directory cache contents into a set of tree files. But in order to have that match what is actually in your directory right now, you need to have done a "update-cache" phase before you did the "write-tree".
So there is always a staging area between the "real contents" and the "written tree".
> That way of thinking really doesn't work well here.
> I will have to look more closely at pasky's GIT toolkit > if I want to see an SCM style interface.
Yes. You really should think of GIT as a filesystem, and of me as a _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I think the reason I worked so well with BitKeeper is that Larry used to do operating systems. He's also a systems person, not really an SCM person. Or at least he's in between the two.
My operations are like the "system calls". Useless on their own: they're not real applications, they're just how you read and write files in this really strange filesystem. You need to wrap them up to make them do anything sane.
For example, take "commit-tree" - it really just says that "this is the new tree, and these other trees were its parents". It doesn't do any of the actual work to _get_ those trees written.
So to actually do the high-level operation of a real commit, you need to first update the current directory cache to match what you want to commit (the "update-cache" phase).
Then, when your directory cache matches what you want to commit (which is NOT necessarily the same thing as your actual current working area - if you don't want to commit some of the changes you have in your tree, you should avoid updating the cache with those changes), you do stage 2, ie "write-tree". That writes a tree node that describes what you want to commit.
Only THEN, as phase three, do you do the "commit-tree". Now you give it the tree you want to commit (remember - that may not even match your current directory contents), and the history of how you got here (ie you tell commit what the previous commit(s) were), and the changelog.
So a "commit" in SCM-speak is actually three totally separate phases in my filesystem thing, and each of the phases (except for the last "commit-tree" which is the thing that brings it all together) is actually in turn many smaller parts (ie "update-cache" may have been called hundreds of times, and "write-tree" will write several tree objects that point to each other).
Similarly, a "checkout" really is about first finding the tree ID you want to check out, and then bringing it into the "directory cache" by doing a "read-tree" on it. You can then actually update the directory cache further: you might "read-tree" _another_ project, or you could decide that you want to keep one of the files you already had.
So in that scneario, after doing the read-tree you'd do an "update-cache" on the file you want to keep in your current directory structure, which updates your directory cache to be a _mix_ of the original tree you now want to check out _and_ of the file you want to use from your current directory. Then doing a "checkout-cache -a" will actually do the actual checkout, and only at that point does your working directory really get changed.
Btw, you don't even have to have any working directory files at all. Let's say that you have two independent trees, and you want to create a new commit that is the join of those two trees (where one of the trees take precedence). You'd do a "read-tree <a> <b>", which will create a directory cache (but not check out) that is the union of the <a> and <b> trees (<b> will overrride). And then you can do a "write-tree" and commit the resulting tree - without ever having _any_ of those files checked out.
> On Sun, Apr 10, 2005 at 08:45:22PM +0200, Petr Baudis wrote:
> > It turns out to be the forks for doing all the cuts and such what is > > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about > > 15 forks per change, I guess, and for some reason cut takes a long of > > time on its own.
> > I've rewritten the cuts with the use of bash arrays and other smart > > stuff. I somehow don't feel comfortable using this and prefer the > > old-fashioned ways, but it would be plain unusable without this.
> I've encountered the same problem in a config-generation script a while > ago. Fortunately, bash provides enough ways to remove most of the forks, > but the result is less portable.
> I've downloaded your code, but it does not compile here because of the > tv_nsec fields in struct stat (2.4, glibc 2.2), so I cannot use it to > get the most up to date version to take a look at the script. Basically,
Ok, I decided to stop this nsec madness (since it broke show-diff anyway at least on my ext3), and you get it only if you pass -DNSEC to CFLAGS now. Hope this fixes things for you. :-)
BTW, I regularly update the public copy as accessible on the web.
> all the 'cut' and 'sed' can be removed, as well as the 'dirname'. You > can also call mkdir only if the dirs don't exist. I really think you > should end up with only one fork in the loop to call 'diff'.
You still need to extract the file by cat-file too. ;-) And rm the files after it compares them (so that we don't fill /tmp with crap like certain awful programs like to do). But I will conditionalize the mkdir calls, thanks for the suggestion - I think that's the last bit to be squeezed from this loop (I'll yet check on the read proposal - I considered it before and turned down for some reason, can't remember why anymore, though).