RDF fixity

72 views
Skip to first unread message

Stefano Cossu

unread,
May 1, 2018, 6:40:51 PM5/1/18
to fedora-tech
Hello,
Looking at both the current Fedora 4 documentation and the Fedora specs,
I see that there is a very clear notion of fixity for non-RDF sources
(LDP-NR, AKA binaries) but not for RDF documents (LDP-RS).

This prompts several questions:

- Is integrity of RDF data as important as non-RDF data integrity?
- If the above is true (I personally assume so), how can RDF data
integrity be monitored?
- If calculating a checksum is the best way to monitor the integrity of
a LDP-RS, how can the challenges that RDF presents in this regard be
addressed:
- How shall the triples be ordered for calculating the checksum?
- Shall the checksum include inbound links?
- Shall it include fragment URIS (which are to be regarded as
separate resources but tied to the lifecycle of the parent) or even
other subjects, stored in the LDPR graph?
- Has someone ever discussed a possible canonical representation of LDP
resources (which have their own specificities compared to plain RDF)
that may help defining a convention for calculating a digest from them?
If not, shall this community start that discussion?

Thoughts would be appreciated.

Thanks,
Stefano


--
Stefano Cossu
Director of Application Services, Collections

The Art Institute of Chicago
116 S. Michigan Ave.
Chicago, IL 60603
312-499-4026

Esmé Cowles

unread,
May 1, 2018, 7:02:13 PM5/1/18
to fedor...@googlegroups.com
Stefano-

Given the vagaries of not just order, but other variabilities in RDF expressions (blank nodes, quoting options, etc., etc.) I think it would be challenging to define a canonical form of RDF. But I do think it could be done, and I agree that having a canonical form and taking the checksum of it would be useful. Especially if the canonicalization algorithm were sufficiently well-documented that clients could reliably canonicalize and checksum locally, it seems like it could provide the same kind of fixity guarantees that LDP-NRs have.

-Esmé
> --
> You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.
> To post to this group, send email to fedor...@googlegroups.com.
> Visit this group at https://groups.google.com/group/fedora-tech.
> For more options, visit https://groups.google.com/d/optout.

Benjamin Armintor

unread,
May 1, 2018, 7:32:59 PM5/1/18
to fedor...@googlegroups.com

> To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.

> To post to this group, send email to fedor...@googlegroups.com.
> Visit this group at https://groups.google.com/group/fedora-tech.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.

reesj

unread,
May 2, 2018, 10:40:18 AM5/2/18
to Fedora Tech
Interesting idea that I think speaks to our notions of a repository's purpose.

Fixity in the context of binaries to me is the realm of preservation (my binaries have not changed), but for metadata (and maybe binaries) perhaps the question is about authenticity (this RDF is an untampered product of my repository at a specific point in time).

In our current context metadata (bibliographic) is mutable and the repository is not the authoritative source, but what happens when the metadata is part of a research object, etc. that is meant to be unmutable?

Stefano Cossu

unread,
May 2, 2018, 5:26:02 PM5/2/18
to Fedora Tech
Thanks for the additional ideas and sources (I gave the paper a quick scan so far and it seems very interesting, I will get into details soon). It seems like there are additional concerns that I did not mention in my initial message:

1. Blank nodes
2. Concerns over authenticity of the source

I also ran into another paper on the subject [1] which I actually found by looking at an implementation of the described algorightm [2], which goes by the name of RGDA1.

I will get back to this thread once I finish reading these papers, but so far I am glad that there is interest in the topic (as I imagined there would in a community that deals with RDF *and* preservation...).

Best,
Stefano

[1] http://www.hpl.hp.com/techreports/2003/HPL-2003-235R1.pdf
[2] http://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.compare.IsomorphicGraph

Benjamin J. Armintor

unread,
May 2, 2018, 5:32:19 PM5/2/18
to fedor...@googlegroups.com
Just as an aside: blank nodes are a specific concern of the paper I linked, I think you'll be interested in it.

--

Simeon Warner

unread,
May 7, 2018, 7:26:07 PM5/7/18
to fedor...@googlegroups.com
This is a really good question and brings to mind a number of concerns I
have with linked data (as opposed to RDF per se) for preservation. As
Stefano notes, there is an implementation of RGDA1 bnode
canonicalization in Python's rdflib.compare._TripleCanonicalizer and I
leveraged that successfully when making an RDF diffing tool for another
project [1].

That still leaves the question of whether you want to be embedding the
server URI in persistent data. I think it would be interesting to
explore an implementation of Fedora that could keep the server URI out
of the data -- we (as a community) are really terrible at keeping
repository URIs stable, let alone back-end server URIs!

Cheers,
Simeon

[1] https://github.com/zimeon/rdiffb
> send an email to fedora-tech...@googlegroups.com
> <mailto:fedora-tech...@googlegroups.com>.
> To post to this group, send email to fedor...@googlegroups.com
> <mailto:fedor...@googlegroups.com>.
> <https://groups.google.com/group/fedora-tech>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Fedora Tech" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fedora-tech...@googlegroups.com
> <mailto:fedora-tech...@googlegroups.com>.
> To post to this group, send email to fedor...@googlegroups.com
> <mailto:fedor...@googlegroups.com>.

Stefano Cossu

unread,
May 8, 2018, 1:23:03 PM5/8/18
to fedor...@googlegroups.com, Simeon Warner
Simeon,
I too took a crack at this implementation in lakesuperior [1] and it
seems to work, sending a Digest header for LDP-NR as well as for LDP-RS
resources; however the RDFLib implementation does not produce a digest
in a defined format, which is a limitation at the moment.

The question you bring up about the server URI is a very good one and I
have some ideas in this regard but no definitive position. I think this
is one of the questions that the broader community around LDP should be
consulted about.

The best approach I could think of hinges off the concept of an LDP
resource as a self-contained, portable graph. By portable I intend that
it can be represented with URIs relative to the resource URI; in fact,
the LDP best practices encouage that [2]. Using relative URIs does not
require consistency of internal or public identifiers. Moreover, two
resources in the same repository, or in different repositories, with
different URIs can be checked for identity.

My question is: if only the subject changes, can two such graphs with
different URIs and in different locations be considered identical? By
the same logic, shall http://localhost:8000/resource and
https://my-public-dn.org/resource (the same LDP resource accessed
directly on the server and on a reverse proxy) be considered identical?
Again, I think this might be a question larger than this group.

Best,
Stefano

[1] https://github.com/scossu/lakesuperior/pull/74
[2] https://www.w3.org/TR/ldp-bp/#use-relative-uris

Jared Whiklo

unread,
May 8, 2018, 1:55:41 PM5/8/18
to fedor...@googlegroups.com
Because "someone" posted an interesting paper on producing RDF Digests
and then went ahead and wrote an implementation in Ruby but didn't
circle back here with that information, I will

https://github.com/barmintor/rdf-digest

Which got me wondering if I could do that in PHP and lo and behold

https://github.com/whikloj/RdfHashing

I will note that with the PR[1] I have opened on Ben's work I was able
to get the same SHA-256 value from both implementations.

It certainly does not help with the question of "what" your graph
contains, but if you can define the graph it can tell you if it matches
a separate graph.

cheers,
jared

[1] https://github.com/barmintor/rdf-digest/pull/1
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
For every action, there is an equal and opposite government program.

signature.asc

Stefano Cossu

unread,
May 8, 2018, 5:11:18 PM5/8/18
to fedor...@googlegroups.com

I knew barmintor hid sonething under his sleeve...

To add to the conversation, see attached discussion with one of the authors of the paper mentioned in Ben's implementation:

Dear Stefano Cossu,

In regard to your questions:

1. Are you aware of more recent developments and research on the topic?

Yes - there were problems found by Dr. Miquel Ceriani that render the algorithm in it’s proposed form invalid, please see given counter examples following the signature of this email.

2. Are you aware of any existing implementation on Python of your algorithm or a more recent version?

Unfortunately I have no such knowledge, sorry!
Would you mind keeping us in the loop if you try to correct the algorithm and / or data structure?

Thank you and kind regards
Edzard Hoefig


Prof. Dr.-Ing. Edzard Hoefig
Beuth University
Luxemburger Str. 10, 13353 Berlin
Room B 216, Phone: +49 (0) 30 4504 2784


Dr. Ceriani counter example 1

I will represent the RDF graphs in Turtle syntax, considering the
prefix ex: bound to the fictitious domain http://example.org/.

# Counter Example 1

Graph 1:

   _:a ex:p _:c .
   _:b ex:p _:c .
   _:c ex:p _:c .

Graph 2:

   _:a ex:p _:b .
   _:b ex:p _:a .
   _:c ex:p _:c .

The two RDF graphs represented are different (and not just for the
used labels) but the algorithm generates in both cases the following
string (spaces added for readability):

   { * [ - ( * [ - ] ) ] } { * [ - ( * [ - ] ) ] } { * [ - ] }


Counter Example 1 uses self-referential blank nodes. This seems to be something that we didn’t take into account. It might be easily fixed by introducing a special symbol (we are proposing „!“) for the original subject node when constructing the transitive blank node labels. Just record the original node before line 4 in Algorithm 1 and return the special symbol instead of an empty string when and only when you terminate on the original node on line 4 in Algorithm 2.

Dr. Ceriani counter example 2

## Counter Example 2

Graph 3:

   ex:A ex:p _:ab .
   ex:B ex:p _:ab .
   ex:C ex:p _:c .

Graph 4:

   ex:A ex:p _:a .
   ex:B ex:p _:bc .
   ex:C ex:p _:bc .

These two RDF graphs are also different but the algorithm will generate in
both cases the following string:

   { * } { * } { ex:A [ - ( * ) ] } { ex:B [ - ( * ) ] } { ex:C [ - ( * ) ] }

Counter Example 2 is a different beast. We still believe, that the general idea of constructing the transitive neighbourhood for identification of a blank node is correct, but this one shows that we cannot do it in a „forward“ manner (by traversing along the predicates), as a blank node is obviously also characterized by the incoming edges (see attached picture). We would need to use a completely different data structure for storing the RDF data (a graph structure that allows for „backward“ traversal along the predicates).

Am 07.05.2018 um 18:36 schrieb Stefano Cossu <sco...@artic.edu>:

Dear Edzard Höfig and Ina Schieferdecker,
I have read with interest your paper about RDF hashing [1]. I am
currently interested in implementing a similar algorithm in Python, thus
I have a couple of questions:

1. Are you aware of more recent developments and research on the topic?
2. Are you aware of any existing implementation on Python of your
algorigthm or a more recent version?

I am currently using RDFLib's implementation [2] which seems to have
some issues with producing a SHA256-compatible output [3], so

Thank you for your help.

Sincerely,
Stefano Cossu

[1] https://smex12-5-en-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=http%3a%2f%2fceur%2dws.org%2fVol%2d1259%2fproceedings.pdf%23page%3d65&umid=2c651cbf-1fbc-411c-84d4-8b1ea61839a4&auth=e9469ec9cb4bbb5b8df42dd9508cbc18e5a2ffcf-ff9504c105149d691c20f8e45397bbc0b747a189
[2]
http://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.compare.IsomorphicGraph
[3] https://smex12-5-en-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2fRDFLib%2frdflib%2fissues%2f825&umid=2c651cbf-1fbc-411c-84d4-8b1ea61839a4&auth=e9469ec9cb4bbb5b8df42dd9508cbc18e5a2ffcf-e75eb890f034e0d77beac036ede311363ae83f56




--
Stefano Cossu
Director of Application Services, Collections

The Art Institute of Chicago
116 S. Michigan Ave.
Chicago, IL 60603
312-499-4026

--

Stefano Cossu

unread,
May 8, 2018, 5:15:55 PM5/8/18
to fedor...@googlegroups.com, Jared Whiklo
And, great to know that there is a PHP implementation too. I might as
well try to copycat it in Python too—after I figure out hof prof.
Ceriani's issue may affect real-world scenarios in Fedora.

Stefano


On 05/08/2018 12:55 PM, Jared Whiklo wrote:
Reply all
Reply to author
Forward
0 new messages