Issue with diacritic character in bibliographic section

12 views
Skip to first unread message

sn...@bsd.uchicago.edu

unread,
Jun 9, 2023, 4:59:52 PM6/9/23
to ProfilesRNS
In our prod environment, we have a researcher with a citation for PMID: 35644154

One of the authors is a Julian Panés  The UI is displaying the author name as 'Pan?s J'.

Looking at [ProfilesRNS].[Profile.Data].[Publication.PubMed.AllXML] I see 'Pan?s'. In [ProfilesRNS].[Profile.Data].[Publication.PubMed.Author], it's also 'Pan?s'.

In our staging environment I note the author name as it appears in the bibliographic section is correct, i.e. the diacritic appears.

In stage  [ProfilesRNS].[Profile.Data].[Publication.PubMed.AllXML] for this PMID, all we have in column [X] is an empty PMIDList tag, i.e. <PMIDList/>.

In stage [ProfilesRNS].[Profile.Data].[Publication.PubMed.Author, for this PMID the diacritic is present.

I'd like to correct this issue in prod, but not sure where to look next. Ideas?

sn...@bsd.uchicago.edu

unread,
Jun 22, 2023, 2:57:49 PM6/22/23
to ProfilesRNS
This issue feels is on the Harvard side. As an example, independently of ProfilesRNS I submit PMID "36242764" to:

In the result XML, author surname for Julian Panés shows up as Pan?s.

Brown, Nicholas William

unread,
Jun 23, 2023, 10:17:30 AM6/23/23
to profi...@googlegroups.com

Can you try using the following URL instead and see if that fixes the issue

 

http://weberdemo.hms.harvard.edu/disambiguation/services/GetPMIDs/GetPubMedXML

 

Thanks

 

Nick

--
You received this message because you are subscribed to the Google Groups "ProfilesRNS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to profilesrns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/profilesrns/3011e31d-c450-45ff-8e7f-b1086d9b12e1n%40googlegroups.com.

sn...@bsd.uchicago.edu

unread,
Jun 23, 2023, 12:34:58 PM6/23/23
to ProfilesRNS
Using the new endpoint, unfortunately for PMID 36242764, author surname is returned as 'Pan,s' rather than expected  'Panés' 

sn...@bsd.uchicago.edu

unread,
Jun 23, 2023, 12:51:50 PM6/23/23
to ProfilesRNS
Hit send too soon. I noted there's no encoding specified in the resultant xml, i.e. something like   <?xml version="1.0" encoding="UTF-8"?>.

Of course, if there's processing of data happening upstream of the actual xml result being constructed, the characters could be getting mangled there. I am by no means an internationalization expert, and have been burned before by saving intermediate data as ascii by mistake.

sn...@bsd.uchicago.edu

unread,
Jun 23, 2023, 3:39:00 PM6/23/23
to ProfilesRNS
Nick et al, my apologies. 

My testing code itself had a bug with encoding when I dumped the file to disk for review. Once I corrected that issue, the encoding looks correct to me now using the different api endpoint. Is that useable for production purposes?

Brown, Nicholas William

unread,
Jun 23, 2023, 5:01:59 PM6/23/23
to profi...@googlegroups.com

That endpoint is suitable for production purposes, it hits the same database and is equally stable as the other endpoint.

sn...@bsd.uchicago.edu

unread,
Jul 3, 2023, 4:24:20 PM7/3/23
to ProfilesRNS
Nick --

With the new endpoint, [ProfilesRNS].[Profile.Data].[Publication.PubMed.AllXML] looks as expected. 

However in the UI, the author list in the bibliography section for PMID 35788348  appears to be 'doubled up', as 'Pan?s J' and 'Panés J' both appear:

Looking at [ProfilesRNS].[Profile.Data].[Publication.PubMed.Author], about 1.5% of author last/first names are corrupted with '?' character, which in our case is about 40K rows or so, including PMID 35788348 for example.

In [ProfilesRNS].[Profile.Data].[Publication.PubMed.Author] can we safely either remove the impacted rows or perhaps set [ValidYN] = N? And will this eventually clear up the UI once jobs run?

Thanks,

Stephan

Brown, Nicholas William

unread,
Jul 5, 2023, 12:36:32 PM7/5/23
to profi...@googlegroups.com

Stephan

 

You can delete the problem rows from the Author table. This should fix the issue in the UI once after the nightly jobs run.

Reply all
Reply to author
Forward
0 new messages