Download nanopubs in bulk – best approach?

13 views
Skip to first unread message

Piotr Sowiński

unread,
Feb 20, 2023, 4:10:35 AM2/20/23
to nanopu...@googlegroups.com

Hi!

I'm looking to use the nanopub dataset as a whole for research on RDF engines' performance, because the dataset is big and clearly licensed. However, the only interfaces I found for downloading the nanopubs are various APIs and apps such as this one, which only provides compressed packages of 1000 nanopubs each.

I don't want to unnecessarily overload anyone's servers by making thousands of API calls, so – is there any way to download "the whole thing"? Or should I just use the existing APIs? I would be grateful for any hints. :)

-- 
Piotr Sowiński
Systems Research Institute 
Polish Academy of Sciences

Tobias Kuhn

unread,
Feb 21, 2023, 8:09:38 AM2/21/23
to psowi...@gmail.com, nanopu...@googlegroups.com
Hi Piotr,

Great to hear that you find this nanopublication dataset interesting.

There is a complete dump of the nanopublications from a few years back
available on Zenodo: https://zenodo.org/record/1213293

It's a bit outdated, but in terms of number of nanopublications, it
doesn't make too much of a difference (because of some early large
datasets).

But we should make an update of this dump some time soon.

However, you should also feel free to issue a few thousand API requests
to the server network. That's what these servers do anyways to keep each
other updated, so you don't need to worry too much that you are
overloading them.

Best regards,
Tobias


On 20.02.23 09:10, Piotr Sowiński wrote:
> Hi!
>
> I'm looking to use the nanopub dataset as a whole for research on RDF
> engines' performance, because the dataset is big and clearly licensed.
> However, the only interfaces I found for downloading the nanopubs are
> various APIs and apps such as this one
> <https://np.petapico.org/nanopubs.html?page=10904>, which only provides
> compressed packages of 1000 nanopubs each.
>
> I don't want to unnecessarily overload anyone's servers by making
> thousands of API calls, so – is there any way to download "the whole
> thing"? Or should I just use the existing APIs? I would be grateful for
> any hints. :)
>
> --
> Piotr Sowiński
> Systems Research Institute
> Polish Academy of Sciences
>
> --
> You received this message because you are subscribed to the Google
> Groups "Nanopublications" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to nanopub-user...@googlegroups.com
> <mailto:nanopub-user...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/nanopub-users/461f1bba-0cb3-78db-d9ad-3962464abc01%40gmail.com <https://groups.google.com/d/msgid/nanopub-users/461f1bba-0cb3-78db-d9ad-3962464abc01%40gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.

Piotr Sowiński

unread,
Feb 21, 2023, 9:37:52 AM2/21/23
to Tobias Kuhn, nanopu...@googlegroups.com
Thank you Tobias, the dump will be perfect for my purposes :)

Thank you also for the clarification about APIs and the state of the
dataset.

--
Piotr Sowiński

jhpoelen

unread,
Feb 21, 2023, 10:10:50 AM2/21/23
to Nanopublications
Hey y'all -

Glad to hear the chatter on dumps and data packages.

My follow-up question -

How do you recommend to cite, and verify, the collection of nanopubs and the process that published them?

Note that most DOIs do not offer a way to verify authenticity the cited content.

thx,

-jorrit 

Tobias Kuhn

unread,
Feb 22, 2023, 1:35:00 AM2/22/23
to Jorrit Poelen, Piotr Sowiński, nanopu...@googlegroups.com
Hi Jorrit and all,
> How do you recommend to cite, and verify, the collection of nanopubs and
> the process that published them?

Good question!

As a concrete example of how this can be done, a while back, around the
same time when the data dump was last published, I experimented with
that and pubished a paper about how to cite and verify nanopub datasets
and subsets thereof:
https://link.springer.com/chapter/10.1007/978-3-319-68288-4_26

See references 25 to 28, where I cite several such datasets. The Trusty
URIs of these index nanopubs in the reference allows for recursively
verifying the whole set of nanopublications.

The process that created these nanopublications is supposed to be in the
provenance parts of the individual nanopublications. But this part is
under-developed in practice, and we could learn from the system you
demonstrated in one of the last nanopub calls, where you automatically
record such processes and represent them in RDF (if I remember correctly).

Regards,
Tobias

Evelo, Chris (BIGCAT)

unread,
Feb 22, 2023, 10:53:24 AM2/22/23
to Tobias Kuhn, Jorrit Poelen, Piotr Sowiński, nanopu...@googlegroups.com
Dear all,

This is a very important and difficult challenge indeed.

Typically there are two questions
1) What is the provenance (i.e. where does this come from and how was it extracted)?
2) What is the evidence that the statement actually is correct.

My problem with that is what I think of as provenance trails. An example might help explain.
- Suppose we have a nanopub parallel to a mapping in BridgeDb that says "this gene codes for this protein"
- So the provenance we provide is something like "we took this from ENSEMBL vx.y and used this script"
- Evidence that that mapping *is indeed in ENSEMBL* can come from an independent script or somebody actually checking
- But.... that does not give me evidence that the statement in ENSEMBL is in fact correct
- For that I would need the provenance for that specific ENSEMBL build and evidence trusted by their experts
- And that would probably lead you through a few extra steps.

Now my problem is not only that this is complex, it also actually lowers trust. If I would evaluate this nanopub as a researcher and would not find the evidence that which I was looking for "is this correct?" but instead found a technical statement that "yes, it was important correctly" I would be disappointed. Any ideas how to solve this?

Best, Chris

On 22/02/2023, 07:35, "nanopu...@googlegroups.com <mailto:nanopu...@googlegroups.com> on behalf of Tobias Kuhn" <nanopu...@googlegroups.com <mailto:nanopu...@googlegroups.com> on behalf of kuhnt...@gmail.com <mailto:kuhnt...@gmail.com>> wrote:


Hi Jorrit and all,
> How do you recommend to cite, and verify, the collection of nanopubs and
> the process that published them?


Good question!


As a concrete example of how this can be done, a while back, around the
same time when the data dump was last published, I experimented with
that and pubished a paper about how to cite and verify nanopub datasets
and subsets thereof:
https://link.springer.com/chapter/10.1007/978-3-319-68288-4_26 <https://link.springer.com/chapter/10.1007/978-3-319-68288-4_26>


See references 25 to 28, where I cite several such datasets. The Trusty
URIs of these index nanopubs in the reference allows for recursively
verifying the whole set of nanopublications.


The process that created these nanopublications is supposed to be in the
provenance parts of the individual nanopublications. But this part is
under-developed in practice, and we could learn from the system you
demonstrated in one of the last nanopub calls, where you automatically
record such processes and represent them in RDF (if I remember correctly).


Regards,
Tobias


--
You received this message because you are subscribed to the Google Groups "Nanopublications" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nanopub-user...@googlegroups.com <mailto:nanopub-user...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/nanopub-users/cd05a038-e971-6d13-cfdb-c208aec77f6a%40gmail.com <https://groups.google.com/d/msgid/nanopub-users/cd05a038-e971-6d13-cfdb-c208aec77f6a%40gmail.com>.

Barend Mons

unread,
Feb 22, 2023, 11:07:01 AM2/22/23
to Evelo Chris (BIGCAT), Erik Schultes, Tobias Kuhn, Jorrit Poelen, Piotr Sowiński, nanopu...@googlegroups.com
Hi Chris, I see the potential complexity here and especially the layered evidence trail. 
However, I think the solution may also be in that same 'layered approach.
After all, let’s make computers enabled to do the same thing as we do as humans( but more systematically and consistently ;)

- When I first look at a Knowledge graph (for instance in Euretos) I do not check the provenance (sources) at first but ‘assume’ that they have done their mining and curation correctly. 
- If there is any reason for doubt, the first sep is to simply call for the sources they provide for each cardinal assertion (tripe in the graph) and if the evidence that the triple is correct is convincing king for me, I go on taking it seriously. 
- Only when there is reasonable doubt that the triple may be problematic or wrong I dig further and make up my own opinion about it.

Now machines could do that (if we structure everything correctly) more systematically, but they do not have our ‘intrinsic background knowledge to make an informed judgement and if they seee hydorxychloroquine treats Covid and they see ‘lancet’ among the provenance (mining is correct, journal is reputable) how would they know it is still crap? 

Well there are many ways to think of, but one very simple one is human annotation of cardinal assertions (if 20 ‘ORCIDS’ have annotated that triple as ‘only True still believes this and also Majory Green) but these scientists have all refuted this statement, the machine (and people) will scratch their heads….

My two cents
B

Prof. dr. Barend Mons
LUMC &LACDR
Scientific Director GO FAIR Foundation
President of CODATA
Founding editor of FAIR Connect


ZOOM: 
Pw: 79779

Visiting address: 
1st floor Poortgebouw-Noord, room 050F
Rijnsburgerweg 10
2333 AA Leiden
The Netherlands

E-mail: baren...@go-fair.org
Mobile: +31 6 24879779
Skype: dnerab
Website: https://www.gofair.foundation
ORCID: 0000-0003-3934-0072


To unsubscribe from this group and stop receiving emails from it, send an email to nanopub-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nanopub-users/BF640981-F55B-4861-B192-B899B025FB58%40maastrichtuniversity.nl.
For more options, visit https://groups.google.com/d/optout.

Evelo, Chris (BIGCAT)

unread,
Feb 22, 2023, 11:13:00 AM2/22/23
to Barend Mons, Erik Schultes, Tobias Kuhn, Jorrit Poelen, Piotr Sowiński, nanopu...@googlegroups.com

Thanks Barend,

 

That is helpful, and I agree.

 

However, it also links to what I see in research data management in general. We focus on getting the data “in”, far less on how to use it again. Yes, we could build a tool that follows that provenance trail and discovers the evidence at the end. Or the lack thereof… Like, being able to quickly  follow the trail  for “mouth masks are not a good idea during a pandemic” would probably have helped our RIVM a lot 😊. So in general, I would see a bit more effort on tools that reuse what we create.

 

                Best, Chris

jhpoelen

unread,
Feb 22, 2023, 3:53:00 PM2/22/23
to Nanopublications

For whatever it is worth:

In my mind there are at least two kinds of provenance -

1. (data provenance or origin) how/when/by who/by what was the digital data sourced / transformed. E.g., I used curl to download https://example.org/data.zip, then used zip to extract table.tsv from the zip file. Then, I generated a new digital object using "cut -f1"

2. (knowledge provenance or origin) what is the evidence cited support a specific claim? Who supported the claim, who refuted the claim? E.g., Dr Dena Abbasi claims that dog treats are an effective way to pacify dogs and cites various sources, all of them published journals with an impact factor > 2 (whatever that means these days). Also, Dr Lupus made a similar claim 20 years ago. The claim has not been refuted except by Felix, the neighbors cat.

So, for 1. its really about resource locations, transformations and bits and bytes.

2. relates to associations between claims, people, published knowledge etc.

In my mind these two are very different, and likely require different approaches.

And, I have an interest in 1. and have built tools (e.g., https://github.com/bio-guoda/preston ) to help capture data provenance. In my mind 2. is hard to solve at scale (using robots) unless 1. is tackled at scale (using robots).

I am probably repeating stuff that was already said, or am making a common mistake, so I am eager to hear your thoughts on this.

-jorrit

https://jhpoelen.nl

Tobias Kuhn

unread,
Feb 23, 2023, 2:33:06 AM2/23/23
to Jorrit Poelen, Evelo, Chris (BIGCAT), Barend Mons, Erik Schultes, Piotr Sowiński, nanopu...@googlegroups.com
Dear all,

Excellent discussion!

I like Jorrit's distinction between data provenance and knowledge
provenance. It feels very true, but I have a hard time wrapping my head
around how these two relate. They seem to be on different levels about
the same processes, so it's *not* that the overall provenance is a mixed
chain of data provenance and knowledge provenance steps but it's also
*not* that each happens in its own separate provenance trail.

I believe it's more something like a single provenance trail that you
can look at at the level of data or at the level of knowledge. And the
knowledge is encoded in data somehow, and what connects the two is the
interpretation of the data that gives us knowledge. And oftentimes steps
in the provenance trail transform the data but the knowledge stays the same.

This sounds like something others must have figured out a long time ago,
but I am not aware of any conceptual/formal model that would describe
exactly this distinction of data provenance and knowledge provenance.

Does anyone know of such a model? (assuming my reasoning above is sound,
which it might not be...)

Regards,
Tobias


On 22.02.23 20:48, Jorrit Poelen wrote:
> For whatever its worth:
>
> In my mind there's at least two kinds of provenance -
>
> 1. (data provenance or origin) how/when/by who/by what was the digital
> data sourced / transformed. E.g., I used curl to download
> https://example.org/data.zip, then used zip to extract table.tsv from
> the zip file. Then, I generated a new digital object using "cut -f1"
>
> 2. (knowledge provenance or origin) what is the evidence cited support a
> specific claim? Who supported the claim, who refuted the claim? E.g., Dr
> Dena Abbasi claims that dog treats are an effective way to pacify dogs
> and cites various sources, all of them published journals with an impact
> factor > 2 (whatever that means these days). Also, Dr Lupus made a
> similar claim 20 years ago. The claim has not been refuted except by
> Felix, the neighbors cat.
>
> So, for 1. its really about locations, transformations and bits and bytes.
>
> For 2. relates to associations between claims, people, published
> knowledge etc.
>
> In my mind these two are very different, and likely require different
> approaches.
>
> And, I have an interest in 1. . In my mind 2. is hard to solve at scale
> (using robots) unless 1. is tackled at scale (using robots).
>
> I am probably repeating stuff that was already said, or am making a
> common mistake, so I am eager to hear your thoughts on this.
>
> -jorrit
>
> On 2/22/23 10:12, Evelo, Chris (BIGCAT) wrote:
>>
>> Thanks Barend,
>>
>> That is helpful, and I agree.
>>
>> However, it also links to what I see in research data management in
>> general. We focus on getting the data “in”, far less on how to use it
>> again. Yes, we could build a tool that follows that provenance trail
>> and discovers the evidence at the end. Or the lack thereof… Like,
>> being able to quickly  follow the trail  for “mouth masks are not a
>> good idea during a pandemic” would probably have helped our RIVM a lot
>> 😊. So in general, I would see a bit more effort on tools that reuse
>> what we create.
>>
>>                 Best, Chris
>>
>> *From: *Barend Mons <baren...@gmail.com>
>> *Date: *Wednesday, 22 February 2023 at 17:07
>> *To: *"Evelo, Chris (BIGCAT)" <chris...@maastrichtuniversity.nl>,
>> Erik Schultes <er...@gofair.foundation>
>> *Cc: *Tobias Kuhn <kuhnt...@gmail.com>, Jorrit Poelen
>> *Subject: *Re: Download nanopubs in bulk – best approach?
>> *Prof. dr. Barend Mons*
>> LUMC &LACDR
>>
>> Scientific Director GO FAIR Foundation
>>
>> President of CODATA
>> Founding editor of FAIR Connect
>> <https://www.dropbox.com/s/x9ecsqr6ydkeode/FC.jvb.mp4?dl=0>
Reply all
Reply to author
Forward
0 new messages