Examples
--------
.example 1
----
----
I'm happy to see the draft spec! A procedural question. Could we put
the tDiff spec in version control somewhere? I've taken the liberty of
putting it into a github gist here:
https://gist.github.com/749949
so that I or others can patch it more easily. But I'd be happy with any
other version-controlled technology.
Some comments:
* You've done a good job of retaining the flavor of classic diff files.
* I'd suggest "hunk" rather than "chunk" based on
http://www.gnu.org/software/diffutils/manual/
* The specification that hunk order is arbitrary may need
qualification. Order can affect context lines, for example. The
semantics of matching would also need to be carefully specified in order
to avoid order affects.
* The specification that comment blocks may contain information relevant
to interpretation is something I'd argue against. I'd prefer for
extended syntax to be managed cleanly via an extended or forked
specification.
* It would be worth having a specified header to the entire file that
facilitates file type sniffing, and has some room for hinting at the
specification variant in force. Ideally, there won't be variants, but
we should allow some scope for evolution.
* What shoud happen if the user has a column called "<ROW_NUM>"?
* Do you have thoughts on how NULL values should be represented?
* There'll probably need to be some specification for what constitutes a
match for types that are represented only approximately in text form, or
whose representation may vary.
* I think you've basically made an argument for *not* using a tabular
representation, and there are no real traces of it left in the format,
other than the "|" character. I can accept this, but will argue the
case for another round or two :-). Comments on this follow.
* What's the CR/LF quoting policy? No quoting works, but needs to be
made explicit or implementations will diverge. Might be good to find a
well-specified micro-format to appeal to.
> That means each chunk is itself a tDiff document. But in
> that case, each chunk would need its own column header.
I'd suggest that what should be in a hunk are a set of changes that are
easier to understand as a group then individually. For example, in a
classic diff, changes to a set of continguous lines may be grouped
together. Such a hunk isn't intended to be broken into independent
pieces, although it could be by "atomizing" it at the cost of increased
verbosity.
> Worse still,
> each "diff line" within a chunk is itself a chunk, according to the
> recursive definition of chunk. So, potentially each diff line would
> need its own header, which seems to defeat any benefits of column
> headers.
If I understand correctly, I think this is going too far. Producing a
good diff is a trade-off. It is nice to have independent hunks, but it
is also nice to have hunks that group coordinated changes. Producing
such hunks is a non-trivial optimization problem, not a problem for the
spec, but I think the spec should allow for hunks that group a bunch of
changes that are easier to review as a group, and less verbose to
express as a group.
> Also, if diffs are sparse, then a full rectangle becomes
> very awkward to read, especially in the case of tables with many
> columns. Imagine that the table has 300 columns (I routinely deal with
> DB tables like this).
Me too. Think of this: image changing columns A and B across all rows,
and then changing a few other values in column C, D, E, and F here and
there. I'd like a diff that gives me a hunk showing the A and B changes
in a tabular, easy-to-scan form (omitting irrelevant columns), and then
hunks for the other changes in less regular form to examine
individually. I don't think diffkit or coopy needs to generate such
diffs today or tomorrow, but it would be good if the spec was expressive
enough to allow them.
> Now imagine that many rows have diffs, but that
> only 1 or 2 columns in each row have column diffs. Having a full
> rectangle is pretty nasty looking in this case.
Yes, the spec should definitely allow column specification down to the
hunk level at least. The current draft meets that by going down to the
line level.
> Finally, having one
> column header for the whole document can make the document very
> awkward to read in a text editor.
Agreed.
> Suppose there are hundreds or
> thousands of rows in the tDiff document. If the column names appear
> only in a header, and there is one header for the whole document, and
> the user is reading the document within a text editor, she will have
> great difficulty keeping track of column identity. That's because the
> header will scroll off the top as she scrolls down through rows.
> Likewise for attempting to use line oriented Unix tools like vi, sed
> or grep. In other words, using a column header has really bad
> locality.
>
True. This is a general problem with formats like CSV. If sed or grep
support is a priority, then there's no alternative but per-line
repetition of column names. An alternative would be to provide a tool
that reads diffs that may not be fully denormalized, and outputs fully
denormalized diffs.
> Instead of using column headers, I chose to denormalize the column
> names into the cells, using key-value pairs in each cell:
>
>
> - | col1=1
> + | col1=2 | col2=1111| col3= x | col4=aaaa
> = | col1=3 | col3=x->y
>
> I think the drawbacks to this method are obvious. But this approach
> also solves the 3 serious issues with the first scheme, and so I think
> it's more viable.
>
Hmm. How bad would it be to *also* support a line like the following:
@ | col1 | col2| col3
such that lines within the same hunk can then drop the "colN=" prefix?
This would mean that the lines of that hunk are not divisible. I
think that is ok, and perfectly in the spirit of classic diff. However,
I can live without it if you think it is a terrible idea. One cost is
implementation complexity. For *applying* patches, implementation of
both methods would be required, but it is easy. For *generating*
patches, implementation of the new form is perhaps difficult (an
optimization problem), but it suffices to implement the simpler method
in order to meet the spec.
Cheers,
Paul
Thanks for looking into the details of what diff does, Joe. I've
weakened the language in the draft spec to this:
* When there is a choice in how to express a difference between two
tables, generators are encouraged to choose an expression that minimizes
ordering effects between hunks.
My current text is here: https://gist.github.com/749949 but it is not
yet in a consistent state. I'll mail the list when I have something
worth reviewing.
Best,
Paul
I updated the draft spec for a tabular diff format. HTML view is here:
http://share.find.coop/doc/tdiff_spec_draft.html
Forkable source is here:
https://gist.github.com/749949
What I need now is a tool for diffing spec versions so I can summarize
what changed :-).
* Columns used for identifying rows (primary keys, roughly speaking) are
now identified syntactically.
* Columns used for identifying rows are no longer assumed to be primary
keys.
* For the complete example you gave, I've switched the source table
order based on what classic diff does. I may have this wrong.
* I've spelled out some quoting rules, and taken the opportunity to
shorten some reserved names.
Obviously feel free to revert things I've broken.
Cheers,
Paul