How to replace Google Scholar Scraping of author publications with Orcid+crossrefs+unpaywall?

333 views
Skip to first unread message

Russell Jarvis

unread,
Feb 7, 2021, 6:05:10 AM2/7/21
to orcid-a...@googlegroups.com
Hi orcid-api community,

Motivation of question:
In order to try to promote clear and accessible writing in science. I have created a web app that analyses the readability of scientific texts on an author by author basis, it then ranks academic authors in the same field in terms of their readability. A description of the tool is on GitHub. Unfortunately the current version application relies on scraping google scholar which I don't like, as it involves solving captchas, and therefore it cannot scale or give a satisfying user experience.

As an alternative in python I am trying to use chain together orcid-API, crossref and unpaywall. It seems like this approach would probably work okay, but also orcid-API and unpaywall both require you to register for their APIs and therefore it seems like any user of your application would also have to register for API access too. I worry that this requirement that users of an application must register for API access, would hinder the viability of my application (which is not for profit and FOS Software anyway).

Technical Background to question:
From emailing contact-us at orcid I was able to learn that you can get author name specific orcid id information by using this link in the browser:

https://pub.orcid.org/v3.0/csv-search/?q=

For example if I wanted to search for Sarah Adams

https://pub.orcid.org/v3.0/csv-search/?q=Sarah Adams

If you paste that in a web browser most browsers will download a text orper json file populated by relevant orcid-id information for Sarah Adams

However I think this approach uses redirection and browser tricks, therefore I cannot naively do the same thing programmatically using python requests. For example the following approach won't work:

In python search for an authors orcid id by their first and last name.

    import requests

    NAME="Sarah Adams"

    url="https://pub.orcid.org/v3.0/csv-search/?q="+str(NAME)
    response = requests.get(url)

Also one might use a selenium web driver in which case this approach also wouldn't work:

    from selenium import webdriver
    driver = webdriver.Chrome(executable_path='/home/user/git/etudier/chromedriver')

    url="https://pub.orcid.org/v3.0/csv-search/?q="+str(NAME)

    driver.get(url)

I wonder if any one has any ideas how to achieve replacing Google Scholar Scraping of author publications with Orcid+crossrefs without necessarily forcing users of an application to sign up for an API access?

Thanks for any advice.

Russell.


--


Russell Jarvis
PhD
phone: 61444576301
email: russel...@protonmail.com

Michael Roberts

unread,
Feb 7, 2021, 9:21:26 AM2/7/21
to Russell Jarvis, orcid-a...@googlegroups.com
Hi Russell,

I think there may be a way around this - but before I dive into helping, I have one quick question which nay provide you with a possible solution/workaround. 

Have you tried the requests python API call with different request headers, sometimes changing it to application/json et al. can give you the data in your desired format...

Michael Roberts
asencis Ltd (https://asencis.com)

Sent from my iPhone

On 7 Feb 2021, at 11:05, Russell Jarvis <coloure...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "ORCID API Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orcid-api-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/orcid-api-users/CADAFsyNNvKd6wNNabcn0HawVfN5XLOcJcwDh5Krkh5Jp_%3DjiVw%40mail.gmail.com.

Pedro Costa

unread,
Feb 8, 2021, 8:49:00 AM2/8/21
to ORCID API Users
Hi Russell,

Thanks for posting about this.

The community might be able to help you out with your technical implementation.

I just wanted to point out we have a tutorial on how to use the Public API to search for records in the ORCID Registry which you might find useful:


Note that while you do need to register (for free) to use the Public API, the users whom you may want to search do not. However, this only allows you to read information which ORCID record owners set as public.

Hope this helps.

Pedro Costa
QA Lead

Antonin Delpeuch (lists)

unread,
Feb 10, 2021, 6:43:26 AM2/10/21
to orcid-a...@googlegroups.com
Hi Russell,

This project seems similar in spirit to what we have been doing in
Dissemin for the past few years:
https://dissem.in/
If you think there can be any scope for reusing that, let me know!

Best,
Antonin

On 07/02/2021 12:04, Russell Jarvis wrote:
> Hi orcid-api community,
>
> *Motivation of question:*
> In order to try to promote clear and accessible writing in science. I
> have created a web app that analyses the readability of scientific texts
> on an author by author basis, it then ranks academic authors in the same
> field in terms of their readability. A description of the tool
> <https://github.com/russelljjarvis/ScienceAccess> is on GitHub.
> Unfortunately the current version application relies on scraping google
> scholar which I don't like, as it involves solving captchas, and
> therefore it cannot scale or give a satisfying user experience.
>
> As an alternative in python I am trying to use chain together orcid-API,
> crossref and unpaywall. It seems like this approach would probably work
> okay, but also orcid-API and unpaywall both require you to register for
> their APIs and therefore it seems like any user of your application
> would also have to register for API access too. I worry that this
> requirement that users of an application must register for API access,
> would hinder the viability of my application (which is not for profit
> and FOS Software anyway).
>
> *Technical Background to question:*
> *phone: *614 <tel:623-404-9322>44576301
> *email:*russel...@protonmail.com <mailto:russel...@protonmail.com>
>
> --
> You received this message because you are subscribed to the Google
> Groups "ORCID API Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to orcid-api-use...@googlegroups.com
> <mailto:orcid-api-use...@googlegroups.com>.
> <https://groups.google.com/d/msgid/orcid-api-users/CADAFsyNNvKd6wNNabcn0HawVfN5XLOcJcwDh5Krkh5Jp_%3DjiVw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Russell Jarvis

unread,
Feb 15, 2021, 6:45:53 AM2/15/21
to orcid-a...@googlegroups.com
I am currently finding the dissem API fulfils a lot of my needs, but I still might have a use case for the Orcid-API. Specifically I just need to get author affiliations.

When I log in to orcid and is the web portal to register for an API tokens I get these two things: Client ID, Client secret, but I don't know how to use them to get author affiliations.

I just want to read records, I don't want to write records.

Thanks again for any help.

Also I looked into the headers of using an unathenticated API call:
```
import requests
NAME="Sarah Adams"
url="https://pub.orcid.org/v3.0/csv-search/?q="+str(NAME)
response = requests.get(url)
print(response.headers)

I get this:

{'date': 'Mon, 15 Feb 2021 11:26:31 GMT', 'content-type': 'application/vnd.orcid+xml; qs=5;charset=UTF-8', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'set-cookie': '__cfduid=ddeb0b300d41910e79e342881e66cf1ca1613388391; expires=Wed, 17-Mar-21 11:26:31 GMT; path=/; domain=.orcid.org; HttpOnly; SameSite=Lax, X-Mapping-fjhppofk=814C785051499CB634650A2359C0B50C; path=/', 'cache-control': 'no-cache, no-store, max-age=0, must-revalidate', 'expires': '0', 'pragma': 'no-cache', 'x-xss-protection': '1; mode=block', 'access-control-allow-origin': '*', 'x-content-type-options': 'nosniff', 'x-frame-options': 'DENY', 'cf-cache-status': 'DYNAMIC', 'cf-request-id': '08470a3ad10000fea517a1c000000001', 'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'server': 'cloudflare', 'cf-ray': '621eaca48e85fea5-MEL'}

print(response.content)
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n<error xmlns="http://www.orcid.org/ns/error">\n    <response-code>406</response-code>\n    <developer-message>400 Bad Request: There is an issue with your data or the API endpoint. 405 Method Not Allowed: Endpoint and method mismatch. 415 Unsupported Media Type: data must be in XML or JSON format.</developer-message>\n    <user-message>ORCID could not process the data, because they were invalid.</user-message>\n    <error-code>9001</error-code>\n    <more-info>https://members.orcid.org/api/resources/troubleshooting</more-info>\n</error>\n'

```

Pedro Costa

unread,
Feb 16, 2021, 7:01:15 AM2/16/21
to ORCID API Users
Hi Russell,

You can use ORCID's Public API to read information defined as publicly visible by record owners. You can read the whole record like so:

curl -i -H 'Content-Type: application/orcid+xml' -H 'Authorization: Bearer  [Your /read-public access token]' 'https://pub.sandbox.orcid.org/v3.0/0000-0001-2345-6789/record'

Or you can read just the affiliations section you may want to check. Here I'm reading the employment section:

curl -i -H 'Content-Type: application/orcid+xml' -H 'Authorization: Bearer  [Your /read-public access token]  ' 'https://pub.sandbox.orcid.org/v3.0/0000-0001-2345-6789/employment'

You can learn more about this with these tutorials:



And you can see all our tutorials in GitHub here:


Let us know if we can assist with anything else.

Pedro Costa


Russell Jarvis

unread,
Feb 19, 2021, 2:31:54 AM2/19/21
to Pedro Costa, orcid-a...@googlegroups.com
Hi Pedro and orcid-id community,

The process of getting access tokens to work is a bit time consuming and hard.

I am going around it circles at this stage:

But I keep going around in circles.

I go some tokens with

 curl -i -L -k -H 'Accept: application/json' --data 'client_id=APP-X&client_secret=X&grant_type=authorization_code&redirect_uri=https://pub.orcid.org/v2.0/&code=X' https://orcid.org/oauth/token

Response: 

Heaps of output but down the bottom:
CF-RAY: X-MEL

{"access_token":"access_token","token_type":"bearer","refresh_token":"X","expires_in":631138518,"scope":"/authenticate","name":"Russell Jarvis","orcid":"X"}%     

So I finally have my access_token. But I can get it to work with 

curl -i -H 'Content-Type: application/orcid+xml' -H 'Authorization: Bearer  access_token' 'https://pub.sandbox.orcid.org/v3.0/0000-0001-2345-6789/record'

Or you can read just the affiliations section you may want to check. Here I'm reading the employment section:

curl -i -H 'Content-Type: application/orcid+xml' -H 'Authorization: Bearer  access_token  ' 'https://pub.sandbox.orcid.org/v3.0/0000-0001-2345-6789/employment'

But it does not work in these contexts.

Am I missing something?

--
You received this message because you are subscribed to the Google Groups "ORCID API Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orcid-api-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/orcid-api-users/ba10d9eb-ce92-433f-a7e9-c88fbf762c29n%40googlegroups.com.

Pedro Costa

unread,
Feb 19, 2021, 4:44:06 AM2/19/21
to ORCID API Users
Hi Russell,

It seems that you have used Public API client credentials for the Production environment (https://orcid.org) to get permission from the ORCID record and obtain the access token, but they you tried GET requests to the Sandbox environment (https://sandbox.orcid.org). Those examples from my previous post need to be modified accordingly -- try this instead:

curl -i -H 'Content-Type: application/orcid+xml' -H 'Authorization: Bearer  access_token' 'https://pub.orcid.org/v3.0/0000-0003-0281-2849/record'

Hope this helps.

Kind regards,

Owen Stephens

unread,
Feb 19, 2021, 5:05:34 AM2/19/21
to Russell Jarvis, orcid-a...@googlegroups.com
Hi Russell

Is there a reason you are authenticating rather than using the public API?
For the use case you describe I can't see what benefits you get from authenticating

With the public API you wouldn't need to worry about the access token etc. just the request. E.g.

Will get you my public ORCID information - which is (as far as I understand it) exactly the same as you'd get if you were authenticated. If you needed to write data, or if authenticating gave you access to additional data (as a trusted organisation or a trusted individual) then authenticating would be necessary, but that doesn't seem to be what you require?

The default response here is XML so I didn't bother to set any headers, but if you want JSON

--header 'Accept: application/vnd.orcid+json'

The app I built (demo at http://powerful-chamber-19570.herokuapp.com/ and code at http://github.com/ostephens/orcid-node) only uses the public API

Best wishes

Owen

Russell Jarvis

unread,
Feb 19, 2021, 6:11:49 PM2/19/21
to Owen Stephens, orcid-a...@googlegroups.com
Owen this is awesome!

I personally don't want anything to do with the very technical authentication process. I thought authentication was necessary but I am very happy if it isn't.

If it turns out it is not even necessary for reading public records, I would say this is a really big flaw in the orcid documentation, which either gives no examples of basic public reading without authenticating, or massively under represents this simpler access method, such that no examples of this are discoverable with search engines. 

Russell Jarvis

unread,
Feb 20, 2021, 6:28:03 PM2/20/21
to Owen Stephens, orcid-a...@googlegroups.com

Thanks for that insight Owen.

I think the python code block below is the answer to my long running forum question. One question though. Is it correct to assume first-name, middle initial and family-name are separated by a plus? Ie ?q=first_name+middle_initial+family_name ?

In the code below if the name is Brian+H+Smith it finds the orcid-id For Brian+J+Smith http://orcid.org/0000-0003-0498-1910.

I am getting the sense that this is not so much an orcid problem, but more a problem with optional human participation in the orcid institution. Possibly Brian H Smith has not fully signed up to and participated in orcid. 

I think perhaps the list of orcid-id's for Brian Smith are just ranked hierarchically in terms of research prominence/orcid participation etc? The code ```temp['result'][0]``` below takes the top orcid-id from a list.

headers = {
    'Accept': 'application/vnd.orcid+json',
}
def name_to_orcid_id(NAME):
    orcid_id = None
    plus_initial=NAME['name']['first']
    initial = plus_initial.split(" ")
    if len(initial)==2:
      first_name = initial[0]+str("+")+initial[1][0]
    else:
      first_name = plus_initial
    name = first_name+str("+")+NAME['name']['last']
    params = (
        ('q',name),
    )
    response = requests.get('https://pub.orcid.org/v3.0/search/', headers=headers, params=params)
    temp = response.json()
    if len(temp):
       orcid_id = temp['result'][0]['orcid-identifier']['path']
    return orcid_id
   

Owen Stephens

unread,
Feb 21, 2021, 4:19:06 PM2/21/21
to Russell Jarvis, orcid-a...@googlegroups.com
Hi Russell,

I'm not quite sure exactly how the results and order of the results are decided (perhaps something that Pedro can advise on) but I wouldn't be surprised if they were ranked based on closeness of match.

A few things that might be useful:

If you use
https://pub.orcid.org/v3.0/expanded-search/
You'll get back a basic record that might save you an extra fetch

If you only want the first result you can use the "rows" parameter (i.e. rows=1)

You can limit your search to specific fields. So a more precise search which returns a single row might be:

https://pub.orcid.org/v3.0/expanded-search/?q=given-names:Brian+H+AND+family-name:Smith&rows=1

Although that still finds 0000-0003-0498-1910 so how far these are useful to you I'm not sure.

Ultimately, of course, names aren't unique - which is the problem that ORCID tries to resolve :)

Hope some of that is helpful

Owen
--
Sent from Postbox

Owen Stephens

unread,
Feb 21, 2021, 4:20:35 PM2/21/21
to Russell Jarvis, orcid-a...@googlegroups.com
Sorry - finally I forgot to include a link to some of the documentation which covers the things that I mentioned in the last email

https://info.orcid.org/documentation/api-tutorials/api-tutorial-searching-the-orcid-registry/

The information on that page is all true for the public API as far as I know

Owen
--
Sent from Postbox

Pedro Costa

unread,
Feb 22, 2021, 5:30:54 AM2/22/21
to ORCID API Users
Hi Russell,

I should clarify this -- you only need to use the 3-step OAuth process (which requires interaction from ORCID users) if you want to collect authenticated ORCID iDs. If all you're interested in is searching for publicly-visible information in the ORCID Registry you can just send your queries to the ORCID Public API without any need for authentication.

You can make requests like:

- curl -H 'Accept: application/orcid+json' 'https://pub.sandbox.orcid.org/v3.0/search/?q=family-name:Smith' -k
- curl -H 'Accept: application/orcid+xml' 'https://pub.sandbox.orcid.org/v3.0/expanded-search/?q=first-name:John' -k

You can learn more about, and find more examples of searching in the documentation Owen pointed out:


The documentation includes a list of the indexed fields you can use such as: "given-names ", "family-name", and "given-and-family-names". Note that we do not have a field for middle initials.

We use the SOLR search engine. Search results are ordered based on relevance according to the search terms used.

Just let us know if you have any other questions.

Pedro Costa
QA Lead, ORCID
Reply all
Reply to author
Forward
0 new messages