Hi,
I am trying to compile a dataset for SciELO Preprints using the full text XML that EuropePMC provides (since SciELO Preprints does not).
But a number of values are not encoded correctly.
The XML for PPR458699 contains something like that:
<aff id="A2"><label>2</label>BioISI�Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal</aff>
This is not a single case. There are several cases like that. It is affecting all of the 424 files I downloaded and also affects other fields beyond the affiliation.
AI suggests to run ftfy to fix that (it tells me that it checked and it doesn't result into false positives for those files).
Thank you
Daniel