Lost Green OA articles from Semantic Scholar

552 views
Skip to first unread message

Richard Orr

unread,
Apr 27, 2021, 3:48:50 PM4/27/21
to Unpaywall discussion

Hello Unpaywall users,


Over the last two weeks, one of the largest repositories we index, Semantic Scholar, removed most of the articles it had been hosting. The end result for Unpaywall is that about 1 million formerly Green OA articles are now Closed. This is about 12% of all Green OA. We're working on finding new locations for as many articles as we can.


The total number of articles removed from Semantic Scholar was about 8 million, but most of them are still OA because we had other locations.


Richard


--
Richard Orr
Lead Developer, Unpaywall
Our ResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.

Bryan Newbold

unread,
Apr 27, 2021, 4:09:05 PM4/27/21
to Richard Orr, Unpaywall discussion
Many Open Access papers hosted on pdfs.semanticscholar.org have been
crawled and included in the Internet Archive Wayback Machine. Linking
to an archived snapshot might be an easy work around for folks in some
cases.

For example, this direct PDF URL now redirects to a Semantic Scholar
landing page (HTML):

http://pdfs.semanticscholar.org/d506/6ad240a6f27a24efdbdf224fd8b669c25e55.pdf

But there is an archived snapshot from 2019:

http://web.archive.org/web/2019*/http://pdfs.semanticscholar.org/d506/6ad240a6f27a24efdbdf224fd8b669c25e55.pdf

If anybody has any questions about finding this type of capture via
Wayback API, or accessing the original capture content from
web.archive.org without the archival context HTML header (eg, for text
and data mining), I am happy to help.

Maybe also worth noting that we have mirrored copies of both the
Unpaywall corpus and the Semantic Scholar metadata corpus ("Semantic
Scholar Open Research Corpus") as datasets on archive.org, which makes
historical comparisons possible:

https://archive.org/details/ia_biblio_metadata

Please note and respect the licensing and attribution notes for all of
these bulk metadata datasets.

--bryan
(Internet Archive Staff)

On Tue, Apr 27, 2021 at 12:48 PM, Richard Orr <ric...@ourresearch.org>
wrote:
> --
> You received this message because you are subscribed to the Google
> Groups "Unpaywall discussion" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to unpaywall+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/unpaywall/CADPwdMOBk9_KOjJ1gH-jh6uEJkcPAY3qAKep6Z22P5bpXj_vjw%40mail.gmail.com.

d.sm...@herts.ac.uk

unread,
Apr 28, 2021, 4:15:17 AM4/28/21
to Unpaywall discussion
Hi Richard,

Do you know why these articles were removed from Semantic Scholar?

Just to spell it out so it's clear in my mind, 8 million articles have disappeared from Semantic Scholar, most of which Unpaywall had alternative locations for and so which remain available as Green OA items, but 1 million are lost from Semantic Scholar and no alternative locations for these exist. And that 1 million represent 12% of all total global green OA?

Have I understood your message correctly?

Best wishes, Danny

Danny Smith
University of Hertfordshire

Jason Priem

unread,
Apr 28, 2021, 3:07:07 PM4/28/21
to d.sm...@herts.ac.uk, Unpaywall discussion
Hi Danny,
Jason here, also of Unpaywall. My responses are inline below:

On Wed, Apr 28, 2021 at 1:15 AM d.sm...@herts.ac.uk <d.sm...@herts.ac.uk> wrote:
Hi Richard,

Do you know why these articles were removed from Semantic Scholar?

We talked to them a bit about it, but not enough to give this question a good answer. It's probably best to ask them directly...I wouldn't want to speak for them and get their info wrong.
 


Just to spell it out so it's clear in my mind, 8 million articles have disappeared from Semantic Scholar, most of which Unpaywall had alternative locations for and so which remain available as Green OA items, but 1 million are lost from Semantic Scholar and no alternative locations for these exist. And that 1 million represent 12% of all total global green OA?

Have I understood your message correctly?

Yep, correct!  As Richard mentioned though, I think we will be able to find new locations for many of those in due time. We'll keep y'all updated on that.
Best,
Jason
 

Best wishes, Danny

Danny Smith
University of Hertfordshire

On Tuesday, 27 April 2021 at 20:48:50 UTC+1 ric...@ourresearch.org wrote:

Hello Unpaywall users,


Over the last two weeks, one of the largest repositories we index, Semantic Scholar, removed most of the articles it had been hosting. The end result for Unpaywall is that about 1 million formerly Green OA articles are now Closed. This is about 12% of all Green OA. We're working on finding new locations for as many articles as we can.


The total number of articles removed from Semantic Scholar was about 8 million, but most of them are still OA because we had other locations.


Richard


--
Richard Orr
Lead Developer, Unpaywall
Our ResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.

--
You received this message because you are subscribed to the Google Groups "Unpaywall discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unpaywall+...@googlegroups.com.


--
Jason Priem, cofounder
Our Research: We build tools to make scholarly research more open, connected, 
and reusable—for everyone.

Bianca Kramer

unread,
Apr 29, 2021, 5:26:21 AM4/29/21
to Jason Priem, d.sm...@herts.ac.uk, Unpaywall discussion
Hi Jason, Richard,

Thanks for the heads-up and further explanations. Would you consider releasing a new datadump (earlier than regularly scheduled) given this?

kind regards, Bianca  



Op wo 28 apr. 2021 om 21:07 schreef Jason Priem <ja...@ourresearch.org>:

Jason Priem

unread,
Apr 29, 2021, 11:07:20 AM4/29/21
to Unpaywall discussion
Hi Bianca,
Perhaps, but probably not right away, because we're still thinking we'll be able to replace a lot of the newly-lost content pretty soon. If this turns out to be a "blip" in the amount OA that's quickly remedied, then doing a data dump in the middle of the blip would of course counterproductive.  A lot will depend on what we find over the next few weeks, in terms of replacing the lost content.
Best,
Jason

Eric Jeangirard

unread,
Apr 29, 2021, 11:20:19 AM4/29/21
to Jason Priem, Unpaywall discussion
Hi !

Considering this massive drop from Semantic Scholar, does it really make sense to name an article "green" whose only known oa_location is on Semantic Scholar?

Eric

Jason Priem

unread,
Apr 29, 2021, 11:28:22 AM4/29/21
to Unpaywall discussion
Hi Eric,
No, it doesn't. So we don't. We're removing the non-functional Semantic Scholar (S2) links from the database. It make take us a few more days to get all those links removed, but once that's done, everything will be up to date, and you'll be able to see the up-to-date article status in the REST API and Data Feed. 

An article that used to have only one oa_location, and that location was from S2, will now have zero oa_locations. And since it has zero oa_locations, the oa_status will be listed as "closed," not "green".
j

Eric Jeangirard

unread,
Apr 29, 2021, 11:33:16 AM4/29/21
to Jason Priem, Unpaywall discussion
Thanks a lot for your reply Jason.

Jason Priem

unread,
Apr 29, 2021, 11:36:40 AM4/29/21
to Unpaywall discussion
@eric no problem, glad it helped!
j

Reply all
Reply to author
Forward
0 new messages