tDiff spec: early rough draft

367 views
Skip to first unread message

joe

unread,
Dec 19, 2010, 9:10:50 PM12/19/10
to diffkit-user
Everyone,

I know that Paul will dig into this, but I'm hoping that others will
review it also. Some people on the list have already thought about a
lot of these issues and their feedback would be very welcome. The
informal spec. is below. I wanted to first just rough it out, and if
we can reach consensus then we'll go back and fill in some missing
details. In the interest of not making this message any longer than it
already is, I'm posting my commentary to the spec. in a separate
message after this one.

The spec. is written in AsciiDoc (http://www.methods.co.nz/asciidoc/).
If you're not familiar with AsciiDoc I would suggest you look into
it-- it's a very productive format. You can read my asciidoc spec.
document as-is, or you can download AsciiDoc and run the asciidoc tool
over the text file to generate a nice looking html document.

-----------


tDiff Format Specification
====================
:Author: Paul Fitzpatrick, Joe Panico
:Email: paul@somewhere, diffki...@googlegroups.com
:Date: 2010-12-19
:Revision: 0.1

[abstract]
Purpose
-------
This document defines the tDiff format; a text encoding for describing
the differences between two tables of data.

tDiff output is intended for two audiences: end-users who are looking
for an easily readable report that describes the differences between
two tables; and tools that will interpret tDiff output in order to
perform further processing or analysis (e.g. "patch" tools).

Scope
-----
tDiff should be able to describe the differences between *any types*
of data tables. That includes: tables that have an identifiable unique
key (e.g. RDBMS tables with primary key) and tables that don't (RDBMS
tables without primary key, or heaps); tables where row ordering is
meaningful, whether explicit or implied, such as spreadsheets or CSV
files; and tables where row ordering has no inherent meaning (RDBMS
tables). Further, tDiff should be able to represent the differences
between two different *types* of table, e.g. an RDBMS table versus a
spreadsheet table.

Limitations
-----------
tDiff is designed to described the differences in *content* between
two tables-- it does not have any facilities for describing the
differences in *structure* between two tables. A future revision will
expand the specification to include structure or schema information.

tDiff is designed to described text (character) content. It has no
facilities to describe differences in content that does not have a
natural default character representation. A future revision may expand
the specification to include arbitrary binary data.

tDiff can represent the differences between exactly *two* tables. It
cannot represent an n-way compare where n > 2.


Representation
--------------
tDiff documents use the UTF-8 character encoding.


Top-level structure
-------------------
tDiff documents comprise any number of comments and diff
"chunks" (hunks?), interleaved in any order. The intent is that
comments hold meta information about diffs or environmental
information that might be useful to a human or machine interpretor.
Each diff chunk describes a contiguos block of data differences
between the two sources, where the meaning of contiguos depends on the
type of sources involved. No relationship between any chunk or comment
is implied by its order in the stream relative to any other chunk or
comment. Ordering is arbitrary or at the discretion of the
interpreter. The intent is that, generally, each chunk could stand on
its own as an independent tDiff document, for the sake of any type of
interpretation.

Comment
-------
Comment blocks are delimited using: /* */ (C style). Any content can
ocurr within a a comment block. Machine interpretors may read the
contents of comment blocks-- the comment blocks may include
information that affects how the interpretor reads the comment or even
how it interprets diff chunks.

Chunk [Hunk?]
-------------
A chunk is a series of one or more adjacent diff lines, where each
line represents the diffs in a single row from the source tables. The
diff lines within a chunk should be separated by only the newline
characters that terminate each diff line, so that they all appear as
adjacent lines within a text editor. Within a tDiff document, each
chunk must be offset from any other chunk by delimiting it, both
before and after, with standalone newlines. A standalone newline is a
single Unix newline ('\n') on it's own line.

.document example (high level)
----

/*
* this comment could describe the whole document
*/

/*
* this comment could describe the following chunk (chunk1)
*/
* chunk1, context1
+ chunk1, diff1
- chunk1, diff2
= chunk1, diff3
* chunk1, context2

/*
* this comment could describe the following chunk (chunk2), even
though it's
* not immediately adjacent to the chunk (there is an intervening
newline).
*/

+ chunk2, diff1
- chunk2, diff2

/*
* final, wrap up, comment
*/

----

Diff line
---------
A diff line describes the diffs in a single row of the two tables that
were compared. In order to interpret this specification, one table is
designated the *source* or *LHS* (Left Hand Side) and the other table
is designated the *destination* or *RHS* (Right Hand Side). Each row
is identified, on both the LHS and RHS, by its *key*. There are three
types of line diffs, described from the perspective of the source
(LHS) being the reference (or correct) table:

- Missing row: the row, as identified by its key, was present on the
LHS, but not present on the RHS.
- Extra row: the row, as identified by its key, was not present on the
LHS, but was present on the RHS.
- Column diffs: the row, as identified by its key, was present in both
the LHS and RHS tables. But some of the non-key columns had values on
the RHS that were different from those on the LHS.

Each diff line occupies its own line in the document, and begins with
one of three characters, depending on the type of diff is represents.
These three characters are called "line type" characters:

- Missing row diff begins with plus: _'+'_. That's because the diffs
are described from the perspective of the LHS being the reference
table; in order to make the RHS look like the LHS we would have to add
the missing row to the RHS.
- Extra row diff begins with minus: _'-'_. In order to make the RHS
look like the LHS we would have to remove the extra row from the RHS.
- Column diffs begin with equals: _'='_. The same row (according to
the key) was found on both the LHS and the RHS (the row keys were
equal). In order to make the RHS look like the LHS we would have to
update some of the column values on the RHS.

The line type character can be left or right padded with any amount of
whitespace, for readability. The line type character is followed by
any number of name-value pairs, where the names represent column
names, and the values are the values for the corresponding column name
in that particular row. The name is separated from the value by an
equals ('=') sign. The name-value pairs, as well as the line type
character, are delimited with a pipe _'|'_ character. If the name or
value contain a pipe, dash, greater than, or any other character that
might make it confusing or difficult to read, the token can be single
quoted. If the original value was single quoted to begin with, then
the encoded value requires double single quotes. The document
generator must ensure that the key columns (though not explicitly
tagged as such) ocurr in the diff line before all non-key columns.

.diff line examples
----
+ | key-name1=value1| key-name2=value2| nonkey-name3=value3| nonkey-
name4=value4
- | key-name1=value5| key-name2=value6
= | key-name1=value7| key-name2=value8| nonkey-name3=old-value9->new-
value9| nonkey-name4=old-value10->new-value10
----

In the case of column diffs, for each cell that was different between
the LHS and the RHS, both the old and new values are displayed. The
old value must come first, followed by '->' (dash greater than),
followed by the new value. For all three diff line types, the
generator may include source (LHS) name-value pairs that are not
strictly needed, but help with row identification.

<ROW_NUM> Pseudo Column
-----------------------
Some sources of table data have a meaningful notion of row or line
number (e.g. spreadsheets, CSV files) but others do not (RDBMS
tables). In many cases, the representation of the row number is
external to the table data itself. In order to accomodate these
sources, document generators can make use of the <ROW_NUM> pseudo
column. In most cases, the <ROW_NUM> will form the key for the table,
and so, just like any other key columns, it should ocurr within the
diff line before any non-key name-value pairs.

.pseudo column examples
----
+ | <ROW_NUM>=3 | nonkey-name3=value3| nonkey-name4=value4
- | <ROW_NUM>=4
= | <ROW_NUM>=5 | nonkey-name3=old-value9->new-value9| nonkey-
name4=old-value10->new-value10
----

Colum Numbers
-------------
Some sources of table data do not use named columns-- for instance,
CSV files that have not designated a column header row. In those
cases, column names are replaced with column number designators. The
designators follow this pattern:

<COL_NUM1>, <COL_NUM2>, ...

N.B. The ordinal numbers embedded within column number designators are
1's based, not 0's based. In other words, counting starts at 1, not at
zero.

.column number example
----
= | <COL_NUM1>=value1| <COL_NUM2>=value2| <COL_NUM3>=old-value3->new-
value3| <COL_NUM4>=old-value4->new-value4
----


Context
-------

Though not always strictly required to fully and accurately described
differences between the LHS and RHS, generators are allowed to include
extra rows and extra column values as contextual information. These
can help human readers oriented themselves within the data, and might
help machine interpretors to resolve otherwise ambiguities in applying
a tDiff document. In the case of missing row or extra row lines, extra
column data can be included within the row as name-value pairs, just
like any other name-value pairs. In he case of a column diff line,
extra name-value pairs can be included within the line, but they must
ocurr after the key name-value pairs (as is the case for all non-key
name-value pairs). These contextual name-value pairs should also
include only the old value, since both the old value and new value are
ostensibly the same. In other words, contextual column values should
not include "->new-value". In addition to contextual column
information, a chunk may contain any number of contextual rows. These
rows appear with the line type character _'*'_. Streaks of contextual
rows can prefix and suffix the block of actual (+, -, =) real diff
lines, but cannot be interspersed amongst them. In other words, a
chunk is:

<streak of context lines>

<streak of real (+ or - or =) diff lines>

<streak of context lines>

.context example
----
* | key-name1=value-1| key-name2=value-1| nonkey-name3=value-1| nonkey-
name4=value-1
* | key-name1=value0| key-name2=value0| nonkey-name3=value0| nonkey-
name4=value0
+ | key-name1=value1| key-name2=value2| nonkey-name3=value3| nonkey-
name4=value4
- | key-name1=value5| key-name2=value6
= | key-name1=value7| key-name2=value8| nonkey-name3=old-value9->new-
value9| nonkey-name4=old-value10->new-value10
* | key-name1=value11| key-name2=value11| nonkey-name3=value11| nonkey-
name4=value11
* | key-name1=value12| key-name2=value12| nonkey-name3=value12| nonkey-
name4=value12
----

Examples
--------

.example 1
----

In example one, both tables are in an RDBMS, both tables have the same
column names, and the primary key is column1. The generator requested
no context.

lhs: rhs:
column1,column2,column3,column4
column1,column2,column3,column4
---------------------------- 1,
0000, x, aaaa
2, 1111, x, aaaa
----------------------------
3, 2222, y, aaaa 3,
2222, x, aaaa
4, 0000, z, bbbb 4,
3333, x, aaaa
5, 4444, z, bbbb 5,
4444, x, aaaa
6, 5555, u, aaaa 6,
5555, x, aaaa
7, 0000, v, aaaa
----------------------------
8, 1111, x, aaaa
----------------------------

/*
* this is the tDiff document for example 1, using 1 chunk only and no
context
*/
- | column1=1
+ | column1=2| column2=1111| column3=x| column4=aaaa
= | column1=3| column3=x->y
= | column1=4| column2=3333->0000| column3=x->z| column4=aaaa->bbbb
= | column1=5| column3=x->z| column4=aaaa->bbbb
= | column1=6| column3=x->u
+ | column1=7| column2=0000| column3=v| column4=aaaa
+ | column1=8| column2=1111| column3=x| column4=aaaa
/*
* end of tDiff document
*/

/*
* here is a tDiff document that is equivalent to the document above,
except
* that it uses 8 chunks, more comments, and adds in some context
*/
/*
* chunk 1: notice that columns 2,3,4 are context-- not strictly
necessary
* to specify a remove
*/
- | column1=1| column2=0000| column3=x| column4=aaaa

/*
* chunk 2: notice that the chunks are separated by standalone newline
*/
+ | column1=2| column2=1111| column3=x| column4=aaaa

/*
* chunk 3: notice that column2 and column4 are merely context
*/
= | column1=3| column2=2222| column3=x->y| column4=aaaa

/*
* chunk 4: notice that the column diff line is surrounded by context
rows, and
* that the context rows describe the values on the RHS.
*/
* | column1=3| column2=2222| column3=x| column4=aaaa
= | column1=4| column2=3333->0000| column3=x->z| column4=aaaa->bbbb
* | column1=5| column2=4444| column3=x| column4=aaaa

/*
* chunk 5
*/
= | column1=5| column3=x->z| column4=aaaa->bbbb

/*
* chunk 6
*/
= | column1=6| column3=x->u

/*
* chunk 7
*/
+ | column1=7| column2=0000| column3=v| column4=aaaa

/*
* chunk 8
*/
+ | column1=8| column2=1111| column3=x| column4=aaaa

/*
* end of tDiff document
*/
----

joe panico

unread,
Dec 19, 2010, 9:16:59 PM12/19/10
to diffkit-user
Looks like the tables of data in the example were horribly mangled.
Let me try those tables again:

Examples
--------

.example 1
----

----

joe

unread,
Dec 20, 2010, 6:27:16 AM12/20/10
to diffkit-user
Here are some comments to go along with the draft spec. They explain
(rationalize :0) the decisions that went into the design.

First, I chose to use the word "chunk" for no particular reason. I
have no idea what is the difference in meaning between "chunk" and
"hunk". If you look at the wikipedia entry for "diff", you'll see that
they use both words, and it looks to me like they use them
interchangeably. Neither word is really appealing, because "chunk"
sounds like it needs to go on a diet, and "hunk" sounds like it needs
to go on baywatch.

The most difficult aspect of the spec. to deal with was the
requirement that the tDiff format be table based. There are different
ways to interpret this. One way is for the table to be fully
normalize, with the column names appearing only once for the whole
document in a column header. That's the way people are used to seeing
tables in a spreadsheet, for instance:

type char| col1| col2| col3| col4
- | 1 | | |
+ | 2 | 1111| x | aaaa
= | 3 | | x->y |
...

The nice thing about this method is that the results are always
rectangular. So its easier to load the results into table based
viewers, like spreadsheets. But there are some serious (fatal?)
drawbacks also. One goal is to have each diff chunk be able to stand
on its own. That means each chunk is itself a tDiff document. But in
that case, each chunk would need its own column header. Worse still,
each "diff line" within a chunk is itself a chunk, according to the
recursive definition of chunk. So, potentially each diff line would
need its own header, which seems to defeat any benefits of column
headers. Also, if diffs are sparse, then a full rectangle becomes
very awkward to read, especially in the case of tables with many
columns. Imagine that the table has 300 columns (I routinely deal with
DB tables like this). Now imagine that many rows have diffs, but that
only 1 or 2 columns in each row have column diffs. Having a full
rectangle is pretty nasty looking in this case. Finally, having one
column header for the whole document can make the document very
awkward to read in a text editor. Suppose there are hundreds or
thousands of rows in the tDiff document. If the column names appear
only in a header, and there is one header for the whole document, and
the user is reading the document within a text editor, she will have
great difficulty keeping track of column identity. That's because the
header will scroll off the top as she scrolls down through rows.
Likewise for attempting to use line oriented Unix tools like vi, sed
or grep. In other words, using a column header has really bad
locality.

Instead of using column headers, I chose to denormalize the column
names into the cells, using key-value pairs in each cell:


- | col1=1
+ | col1=2 | col2=1111| col3= x | col4=aaaa
= | col1=3 | col3=x->y

I think the drawbacks to this method are obvious. But this approach
also solves the 3 serious issues with the first scheme, and so I think
it's more viable.

cheers,

Joe




Paul Fitzpatrick

unread,
Dec 21, 2010, 10:20:29 AM12/21/10
to diffki...@googlegroups.com
Hi Joe,

I'm happy to see the draft spec! A procedural question. Could we put
the tDiff spec in version control somewhere? I've taken the liberty of
putting it into a github gist here:
https://gist.github.com/749949
so that I or others can patch it more easily. But I'd be happy with any
other version-controlled technology.

Some comments:

* You've done a good job of retaining the flavor of classic diff files.

* I'd suggest "hunk" rather than "chunk" based on
http://www.gnu.org/software/diffutils/manual/

* The specification that hunk order is arbitrary may need
qualification. Order can affect context lines, for example. The
semantics of matching would also need to be carefully specified in order
to avoid order affects.

* The specification that comment blocks may contain information relevant
to interpretation is something I'd argue against. I'd prefer for
extended syntax to be managed cleanly via an extended or forked
specification.

* It would be worth having a specified header to the entire file that
facilitates file type sniffing, and has some room for hinting at the
specification variant in force. Ideally, there won't be variants, but
we should allow some scope for evolution.

* What shoud happen if the user has a column called "<ROW_NUM>"?

* Do you have thoughts on how NULL values should be represented?

* There'll probably need to be some specification for what constitutes a
match for types that are represented only approximately in text form, or
whose representation may vary.

* I think you've basically made an argument for *not* using a tabular
representation, and there are no real traces of it left in the format,
other than the "|" character. I can accept this, but will argue the
case for another round or two :-). Comments on this follow.

* What's the CR/LF quoting policy? No quoting works, but needs to be
made explicit or implementations will diverge. Might be good to find a
well-specified micro-format to appeal to.

> That means each chunk is itself a tDiff document. But in
> that case, each chunk would need its own column header.

I'd suggest that what should be in a hunk are a set of changes that are
easier to understand as a group then individually. For example, in a
classic diff, changes to a set of continguous lines may be grouped
together. Such a hunk isn't intended to be broken into independent
pieces, although it could be by "atomizing" it at the cost of increased
verbosity.

> Worse still,
> each "diff line" within a chunk is itself a chunk, according to the
> recursive definition of chunk. So, potentially each diff line would
> need its own header, which seems to defeat any benefits of column
> headers.

If I understand correctly, I think this is going too far. Producing a
good diff is a trade-off. It is nice to have independent hunks, but it
is also nice to have hunks that group coordinated changes. Producing
such hunks is a non-trivial optimization problem, not a problem for the
spec, but I think the spec should allow for hunks that group a bunch of
changes that are easier to review as a group, and less verbose to
express as a group.

> Also, if diffs are sparse, then a full rectangle becomes
> very awkward to read, especially in the case of tables with many
> columns. Imagine that the table has 300 columns (I routinely deal with
> DB tables like this).

Me too. Think of this: image changing columns A and B across all rows,
and then changing a few other values in column C, D, E, and F here and
there. I'd like a diff that gives me a hunk showing the A and B changes
in a tabular, easy-to-scan form (omitting irrelevant columns), and then
hunks for the other changes in less regular form to examine
individually. I don't think diffkit or coopy needs to generate such
diffs today or tomorrow, but it would be good if the spec was expressive
enough to allow them.

> Now imagine that many rows have diffs, but that
> only 1 or 2 columns in each row have column diffs. Having a full
> rectangle is pretty nasty looking in this case.

Yes, the spec should definitely allow column specification down to the
hunk level at least. The current draft meets that by going down to the
line level.

> Finally, having one
> column header for the whole document can make the document very
> awkward to read in a text editor.

Agreed.

> Suppose there are hundreds or
> thousands of rows in the tDiff document. If the column names appear
> only in a header, and there is one header for the whole document, and
> the user is reading the document within a text editor, she will have
> great difficulty keeping track of column identity. That's because the
> header will scroll off the top as she scrolls down through rows.
> Likewise for attempting to use line oriented Unix tools like vi, sed
> or grep. In other words, using a column header has really bad
> locality.
>

True. This is a general problem with formats like CSV. If sed or grep
support is a priority, then there's no alternative but per-line
repetition of column names. An alternative would be to provide a tool
that reads diffs that may not be fully denormalized, and outputs fully
denormalized diffs.

> Instead of using column headers, I chose to denormalize the column
> names into the cells, using key-value pairs in each cell:
>
>
> - | col1=1
> + | col1=2 | col2=1111| col3= x | col4=aaaa
> = | col1=3 | col3=x->y
>
> I think the drawbacks to this method are obvious. But this approach
> also solves the 3 serious issues with the first scheme, and so I think
> it's more viable.
>

Hmm. How bad would it be to *also* support a line like the following:
@ | col1 | col2| col3
such that lines within the same hunk can then drop the "colN=" prefix?
This would mean that the lines of that hunk are not divisible. I
think that is ok, and perfectly in the spirit of classic diff. However,
I can live without it if you think it is a terrible idea. One cost is
implementation complexity. For *applying* patches, implementation of
both methods would be required, but it is easy. For *generating*
patches, implementation of the new form is perhaps difficult (an
optimization problem), but it suffices to implement the simpler method
in order to meet the spec.

Cheers,
Paul

joe

unread,
Dec 22, 2010, 9:46:22 AM12/22/10
to diffkit-user
Paul,

> putting it into a github gist here:https://gist.github.com/749949

That location is fine. Haven't tried git yet, but it looks like it's
about time ;)

Procedurally-- I think you should now take the scalpel and chainsaw to
my original document and create a whole new draft that reflects all of
your ideas below.

> * I'd suggest "hunk" rather than "chunk" based onhttp://www.gnu.org/software/diffutils/manual/

Works for me.

> * The specification that hunk order is arbitrary may need

I think we are actually in agreement here. The confusion arises from
my use of "hunk" terminology. I was assuming that the context is part
of the hunk, whereas you are assuming that the context decorates the
hunk and that the hunk strictly contains diffs. In either case, it's
clear that the context has to be adjacent to the diffs it is
contextualizing. But beyond that, there should not be any need to
imply relationships based on ordering.

> * The specification that comment blocks may contain information relevant
OK. I was allowing loose comments as a way of keeping the spec. super
minimalist. Essentially, the comment becomes a garbage pail for
anything you might want to be in the document that isn't a diff. But I
certainly buy your argument that it's not clean.

>* It would be worth having a specified header to the entire file that
Agreed. I was going to dump that into a comment, but if we ditch the
idea of the universal comment, then some type of header is
appropriate. Spec. it out.

>* What shoud happen if the user has a column called "<ROW_NUM>"?
I do not believe that any RDBMS allows use of <> in object names. It's
conceivable that a spreadsheet or csv file could have such a column,
but in those cases is there any character combo that is absolutely
collision proof?

>* Do you have thoughts on how NULL values should be represented?
My first (inchoate) thought was to use the convention adopted by most
delimited DB dump files. There, null values show up as nothing (no
chars) between adjoining delimiters (e.g. ||). Empty strings would
need to be quoted (e.g. |''|).

>* There'll probably need to be some specification for what constitutes a ...
I don't understand what you are saying here. The diff file is only
showing diffs, not how the matches were performed to determine the
diff. Even if the match was fuzzy, the diffs should still be
represented faithfully, no?

>* I think you've basically made an argument for *not* using a tabular
I really didn't want to, but that's what happened after I struggled
with the details of trying to make tabular work. Fundamentally, I was
not able to square the circle :-/.

> but will argue the case for another round or two :-).
Please do. I just need someone to show me how it works, in the face of
all the problems I pointed out.

>I'd suggest that what should be in a hunk are a set of changes that are ...
makes sense.

>* What's the CR/LF quoting policy?
Good point. I think the resulting diff text file looks best if we
actually encode CR/LF/NL into printing (non-white) characters. That
way, lines don't get broken up.

>If I understand correctly, I think this is going too far. Producing a
I probably didn't express my intent well here. I think we are in
agreement on this one.

>Hmm. How bad would it be to *also* support a line like the following:
>@ | col1 | col2| col3

OK, so it sounds like your proposal is to support both normalized and
denormalized column names at the hunk level. I can live with that.

Please take a whack at the next draft revision. Also, what do you
think of the idea of requesting feedback from other projects after we
have ourselves reached a consensus? In particular, there are a few
people associated with the Maatkit project who have already done a
huge amount of work on table diff/patch over the last few years. Seems
like we should really ask them what they think.

cheers,

Joe

Paul F

unread,
Jan 3, 2011, 10:22:20 AM1/3/11
to diffkit-user


On Dec 22 2010, 9:46 am, joe <trur...@gmail.com> wrote:
> Paul,
>
> > putting it into a github gist here:https://gist.github.com/749949
>
> That location is fine. Haven't tried git yet, but it looks like it's
> about time ;)
>
> Procedurally-- I think you should now take the scalpel and chainsaw to
> my original document and create a whole new draft that reflects all of
> your ideas below.

Will do. Getting back to this after a holiday break.

> > * The specification that hunk order is arbitrary may need
>
> I think we are actually in agreement here. The confusion arises from
> my use of "hunk" terminology. I was assuming that the context is part
> of the hunk, whereas you are assuming that the context decorates the
> hunk and that the hunk strictly contains diffs. In either case, it's
> clear that the context has to be adjacent to the diffs it is
> contextualizing. But beyond that, there should not be any need to
> imply relationships based on ordering.

As best I remember by concern here, it was this. Each hunk has a side-
effect (a modification of the table). Contexts contain information
about the expected state of the table. If the context of one hunk
overlaps with a part of the table that is modified by another hunk,
then I'd expect to see order dependencies. Any reference to row
numbers would be particularly prone to ordering issues. Any idea what
regular diff does?

>
> > * The specification that comment blocks may contain information relevant
>
> OK. I was allowing loose comments as a way of keeping the spec. super
> minimalist. Essentially, the comment becomes a garbage pail for
> anything you might want to be in the document that isn't a diff. But I
> certainly buy your argument that it's not clean.

Ok, I'll remove this from the spec and see how it goes.

> >* It would be worth having a specified header to the entire file that
>
> Agreed. I was going to dump that into a comment, but if we ditch the
> idea of the universal comment, then some type of header is
> appropriate. Spec. it out.

Will do.

> >* What shoud happen if the user has a column called "<ROW_NUM>"?
>
> I do not believe that any RDBMS allows use of <> in object names. It's
> conceivable that a spreadsheet or csv file could have such a column,
> but in those cases is there any character combo that is absolutely
> collision proof?

We can avoid collisions if we spec out quoting rules for user column
names with reserved characters, and avoid restricting the set of
column names available.

>
> >* Do you have thoughts on how NULL values should be represented?
>
> My first (inchoate) thought was to use the convention adopted by most
> delimited DB dump files. There, null values show up as nothing (no
> chars) between adjoining delimiters (e.g. ||). Empty strings would
> need to be quoted (e.g. |''|).

Seems fine. I couldn't do that with CSV, since those two
representations were specified to mean the same thing. There are
definite advantages to making up a new format :-).

Hmm. That means we could possibly also bring back a special value I
wanted to have for COOPY, that meant something like "not specified".
This can allow simpler, more regular diffs. I'll put it in the spec
and see what you think.

[snip]

> >* I think you've basically made an argument for *not* using a tabular
>
> I really didn't want to, but that's what happened after I struggled
> with the details of trying to make tabular work. Fundamentally, I was
> not able to square the circle :-/.

:-)

> >Hmm.  How bad would it be to *also* support a line like the following:
> >@ | col1 | col2| col3
>
> OK, so it sounds like your proposal is to support both normalized and
> denormalized column names at the hunk level. I can live with that.

Good.

> Please take a whack at the next draft revision. Also, what do you
> think of the idea of requesting feedback from other projects after we
> have ourselves reached a consensus?

I'm very much in favor.

Best,
Paul

joe

unread,
Jan 8, 2011, 6:14:55 AM1/8/11
to diffkit-user
Paul,

>Each hunk has a side- effect (a modification of the table).

Absolutely.

>Any reference to row numbers would be particularly prone to ordering issues. Any idea what regular diff does?

Undoubtedly each patch tool implements its own best guess heuristic.
I've read up a bit on the topic, and the spirit of context patch files
is that they are a "best effort" specification, not a mathematically
guaranteed contract. I didn't find any mention that the hunks within a
unidiff file are meant to be completely independent of one another; in
the sense that you could decompose a unidiff file into a collection of
unidiff files, one file per hunk, apply those unidiff files, and
achieve exactly the same effect as if you had applied the original
file. In fact, in looking at the developer notes for one particular
unidiff patch implementation (svn patch), it looks to me that the
implementation is fairly dependent on all of the hunks within a
unidiff file being applied in a single transaction:

http://www.mail-archive.com/d...@subversion.apache.org/msg00002.html

If you read the description of their patch algorithm, it sounds as
though they apply all of the hunks in the file in a 2 pass process. In
the first pass, they use the line numbering and context for each hunk
to identify its location and span within the original document. Since
the original document is maintained intact until the very end of the
whole patch, the hunk identities from the first pass represent a
snapshot of the original state of each hunk. In the second pass they
then apply the patch for each hunk to its original snapshot.

It sounds like we need to erase any mention in the tDiff spec
regarding the independence of hunks. In the first place, as you point
out, it might not be possible or practical to implement. Secondly, it
doesn't really seem to be consistent with original diff.

Cheers,

Je

On Jan 3, 10:22 am, Paul F <paul.michael.fitzpatr...@gmail.com> wrote:
> On Dec 22 2010, 9:46 am, joe <trur...@gmail.com> wrote:
>
> > Paul,
>
[snip]

Paul Fitzpatrick

unread,
Jan 9, 2011, 12:41:01 PM1/9/11
to diffki...@googlegroups.com

> It sounds like we need to erase any mention in the tDiff spec
> regarding the independence of hunks. In the first place, as you point
> out, it might not be possible or practical to implement. Secondly, it
> doesn't really seem to be consistent with original diff.
>

Thanks for looking into the details of what diff does, Joe. I've
weakened the language in the draft spec to this:

* When there is a choice in how to express a difference between two
tables, generators are encouraged to choose an expression that minimizes
ordering effects between hunks.

My current text is here: https://gist.github.com/749949 but it is not
yet in a consistent state. I'll mail the list when I have something
worth reviewing.

Best,
Paul

joe panico

unread,
Jan 9, 2011, 12:53:20 PM1/9/11
to diffki...@googlegroups.com
Thanks Paul, we're making steady progress.

Paul Fitzpatrick

unread,
Jan 12, 2011, 4:33:02 PM1/12/11
to diffki...@googlegroups.com
Hi Joe,

I updated the draft spec for a tabular diff format. HTML view is here:
http://share.find.coop/doc/tdiff_spec_draft.html
Forkable source is here:
https://gist.github.com/749949

What I need now is a tool for diffing spec versions so I can summarize
what changed :-).

* Columns used for identifying rows (primary keys, roughly speaking) are
now identified syntactically.
* Columns used for identifying rows are no longer assumed to be primary
keys.
* For the complete example you gave, I've switched the source table
order based on what classic diff does. I may have this wrong.
* I've spelled out some quoting rules, and taken the opportunity to
shorten some reserved names.

Obviously feel free to revert things I've broken.

Cheers,
Paul

Aaron Schumacher

unread,
Feb 4, 2015, 9:41:57 AM2/4/15
to diffki...@googlegroups.com
Hello all! Am I correct in thinking that DiffKit and tDiff are both quite dead and that any current work would be found in the coopy project? I think I need to dive into that project's documentation, but it would be great to know from this list if DiffKit/tDiff have direct descendents there, or if coopy is entirely independent in its origins. Are there other projects I should look into for distributed collaborative data editing and version control? I'm particularly interested in the datomic data model as a possible approach to these problems. Thanks!

- Aaron

Paul Fitzpatrick

unread,
Feb 4, 2015, 10:41:58 AM2/4/15
to diffki...@googlegroups.com
Hi Aaron,

For the tDiff tabular diff format, Joe and I collaborated on that back
in 2011, and I extended it a little afterwards:
http://share.find.coop/doc/patch_format_tdiff.html
We had been working independently at the time, he on diffkit, I on
coopy, and thought it would be neat to hammer out a mutually supported
format. coopy can use the tdiff format for diffing and patching, but
I've seen no signs of others picking it up. I'm impressed that you
found it :-)

I've been seeing more traction for an alternate, simpler "highlighter"
diff format I cobbled together:
http://dataprotocols.org/tabular-diff-format/
It is optimized for display as a table, which tends to be just way
easier to read, at least for the communities I work with. coopy
supports this format too for diffing and patching, as does daff, a newer
project of mine.
https://github.com/paulfitz/daff
daff is a simplification of coopy, trying to get at the essentials and
make them available nicely packaged for as many languages as possible.
So far there are javascript, python, ruby, and php packages. (Btw, if
there's any java packaging guru listening, I have a java version of daff
that I'd love to package, but would need a hand)

I have no insight into the current status of DiffKit. For other
projects, I imagine you are aware of dat:
https://github.com/maxogden/dat/
It isn't, as far as I can tell, actually useful for anything but one-way
distribution right now, but they hope to eventually get there and talk a
lot about the distributed case.

The datomic data model is definitely very interesting!

Cheers,
Paul

On 02/04/2015 03:41 PM, Aaron Schumacher wrote:
> Hello all! Am I correct in thinking that DiffKit and tDiff are both
> quite dead and that any current work would be found in the coopy
> <http://share.find.coop/>project? I think I need to dive into that
> project's documentation, but it would be great to know from this list
> if DiffKit/tDiff have direct descendents there, or if coopy is
> entirely independent in its origins. Are there other projects I should
> look into for distributed collaborative data editing and version
> control? I'm particularly interested in the datomic data model
> <http://www.infoq.com/articles/Datomic-Information-Model> as a
> possible approach to these problems. Thanks!
>
> - Aaron
>
>
> On Wednesday, January 12, 2011 at 4:33:02 PM UTC-5, Paul F wrote:
>
> Hi Joe,
>
> I updated the draft spec for a tabular diff format. HTML view is
> here:
> http://share.find.coop/doc/tdiff_spec_draft.html
> <http://share.find.coop/doc/tdiff_spec_draft.html>
> Forkable source is here:
> https://gist.github.com/749949 <https://gist.github.com/749949>
> --
> You received this message because you are subscribed to the Google
> Groups "diffkit-user" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to diffkit-user...@googlegroups.com
> <mailto:diffkit-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Paulo Aurelio

unread,
Oct 14, 2015, 4:17:47 PM10/14/15
to diffkit-user, j...@panmachine.biz
Hi Joe, How are you?

I would like to congratulate you for the Diffkit project. I´ve been studying it for the last week. 

Do you have news about  the project? Are you going to continuing working on it? 

I am very interested in to use Diffkit.

Thanks.
Paulo.

Paulo Aurelio

unread,
Oct 19, 2015, 9:19:13 PM10/19/15
to diffkit-user
Hi Paul. I have read some code from your projects such as coopy and daff. Congratulatoins...it seems that you have continue the Joe's working about diffkit and make great improvements...realy cool...

Do have any idea if you or Joe will provide the tDiff output format for the diffkit?

Another doubt is about daff. It's possível use it to get read data directly from the database like diffkit does?

Thanks in advance...
Paulo

Paul Fitzpatrick

unread,
Oct 20, 2015, 9:46:50 AM10/20/15
to diffki...@googlegroups.com
Hi Paulo,

I haven't heard from Joe for a long time, and haven't seen any changes
to diffkit since 2011. Daff can't yet work directly from a database,
no (with the tiny exception of sqlite). Coopy can handle PostgreSQL
and MySQL.

Cheers,
Paul
> --
> You received this message because you are subscribed to the Google Groups "diffkit-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to diffkit-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages