Merged blank node identifiers

17 views
Skip to first unread message

Stian Soiland-Reyes

unread,
Jan 20, 2017, 8:35:45 AM1/20/17
to Web Data Commons
Hi, I had a look at the html-embedded-jsonld.list from the announced October 2016 crawl, and looked at a randomly picked file:


This seems to have munged statements about the same blank node identifiers again and again.. for instance:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed


Looking at the file we see an incremental use of _:b0, _:b1, _:b2 etc. before then starting again at _:b0.

So it seems these data files are not really useful as RDF files, as all these would be munged together as a single blank node..




Stian Soiland-Reyes

unread,
Jan 20, 2017, 8:40:41 AM1/20/17
to Web Data Commons
I know they are in different graphs, but that's just a workaround for those writing queries.. blank nodes are the same node across graphs within a dataset -- each .nq file represents a (sub)dataset.


> Blank nodes
 can be shared between graphs in an RDF dataset.

Robert Meusel

unread,
Jan 21, 2017, 11:58:46 AM1/21/17
to Web Data Commons
Hi Stian,

Your observation is totally right. This issue is based on how the extraction is done (in parallel) and exists for the different kinds of structured data - not only for embedded json ld. And also for the former extractions.

In order to make the blanknodes unique across and within the files, you need to combine them with the URL (fourth part of the quad).

I hope this helps,
Robert

Stian Soiland-Reyes

unread,
Jan 21, 2017, 3:19:46 PM1/21/17
to web-data...@googlegroups.com
Thanks, presumably this will be corrected for the next release..?

Perhaps an improvement to the extraction is to embed something like a UUID per parsed document in the bnode identifier? You could do UUID v3/v5 hash of the graph uri so you get consistent bnode identifiers from the same document.

(As a side note this would also allow you to arbitrarily split/join the NQ files as long as you keep graph statements gathered)

Btw, what happens if the source document uses a named JSON-LD @graph?

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commons+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Meusel

unread,
Jan 22, 2017, 5:54:12 AM1/22/17
to Web Data Commons, soilan...@cs.manchester.ac.uk
Might be something which we should consider in the next extraction, which should not hinder you in applying the workaround just discussed. In case you want to contribute, you can always create a branch and do a pull request in the repo. The code is freely available. 
 
In case named JSON-LD Graph is used, e.g.:

{
      "@type": "Person",
      "name": "Robert Millar"
    }


The context is resolved and the deposited vocabulary is used to create the quads, e.g.: http://xmlns.com/foaf/0.1/Person

Cheers,
Robert
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commo...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages