Regex searches in PMC ?

15 views
Skip to first unread message

A Biologist

unread,
Jul 3, 2024, 7:40:06 AMJul 3
to Europe PMC Developer Forum
Dear Sir/Madam,

I'd like to search plant science literature (full text) to only return articles in which the word "three" appears four or more times in the full-text Methods section. Is there any way to do this at all ? (I mean with any language, online App or helper tool - perhaps using Europe PMC  ?). I presume that I would need to search PMC (although another online database would be o.k.) and I would prefer an R solution (although another language would be fine as well, perhaps using json and regex ?). Perhaps this might be possible using R Biotea ? perhaps a SPARQL query with regular expressions using something called an RDF database ? At the moment I'm just enquiring whether it's at all possible - and which is the best/easiest direction to go. I've looked at scite - and this accepts json and regex - but apparently only searches citations rather than full-text methods.
I've used R packages "europepmc", "euPMC" and "tidypmc" - but it appears that those only work for those articles for which full text is available (and at the moment I only know how to access those with PMC ids) - which is only around 1% of all articles. 
It seems strange that a PMC query will (apparently) search for the word "three" in the Methods sections of all articles (including those without open access) but will not search for "(?:\\bthree\\b.*){4,}" ? - or am I wrong about this ?

Yours sincerely,

Jeremy Clark

Madhumiethaa Jayaprabha Palanisamy

unread,
Jul 4, 2024, 6:50:16 AMJul 4
to Europe PMC Developer Forum, A Biologist
Hi Jeremy,

Thank you for reaching out. As you pointed out, there is no direct way to use regular expressions in the initial search queries. You may have to use a combination of searches and post-processing.

There's an R package for querying Europe PMC, available here. As detailed on the Europe PMC RESTful Web Service page, you can search by sections in the full text and retrieve results as ID lists. You can also retrieve full text XMLs given the PMC ID.

And here's one of the approaches you can try using the R package:
1. Retrieve Article IDs with "three" in the Methods Section using epmc_search:
  epmc_search(query = 'METHODS:"three"', limit = 10, output = "id_list")
     There are millions of results, but can just limit to smaller for testing.
      'id_list' returns a list of IDs and sources 
2. For the retrieved ids, you can then get the full text in XML format using epmc_ftxt.
  full_text <- epmc_ftxt(pmcid)
3. And then, extract the Methods section from the XML and apply a regular expression as needed.

It will also be useful to look at the available full text fields that can be used for search from our search syntax page.

Hope you find this useful. Let us know if you have further queries.


Kind regards,
Madhumiethaa

Reply all
Reply to author
Forward
0 new messages