Incorrectly encoded characters in FTP preprint_fulltext for SciELO Preprints

21 views
Skip to first unread message

Daniel Ecer

unread,
Mar 25, 2026, 9:39:56 AM (10 days ago) Mar 25
to Europe PMC Developer Forum
Hi,

I am trying to compile a dataset for SciELO Preprints using the full text XML that EuropePMC provides (since SciELO Preprints does not).

But a number of values are not encoded correctly.


The XML for PPR458699 contains something like that:
<aff id="A2"><label>2</label>BioISI�Biosystems &amp; Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal</aff>

This is not a single case. There are several cases like that. It is affecting all of the 424 files I downloaded and also affects other fields beyond the affiliation.

AI suggests to run ftfy to fix that (it tells me that it checked and it doesn't result into false positives for those files).

Thank you
Daniel

Islam Hassan

unread,
Mar 26, 2026, 9:29:18 AM (9 days ago) Mar 26
to Europe PMC Developer Forum, d.e...@elifesciences.org
Dear Daniel,

Thanks for bringing this to our attention.
We will take a look and get back to you as soon as possible.

Yours sincerely,
Islam Hassan
Reply all
Reply to author
Forward
0 new messages