Im getting deleted newlines when Im highlighting. How can I convert these to spaces?

17 views
Skip to first unread message

Derek Thomas

unread,
Jun 14, 2021, 1:00:56 PM6/14/21
to Hypothes.is Forum
Screen Shot 2021-06-14 at 8.59.37 PM.png 
Turns into:
Thisframework features two main assets: (1) astrong multilingual baseline consisting of anXLM-R (Conneau et al., 2020) model pre-trained on millions of tweets in over thirty lan-guages, alongside starter code to subsequentlyfine-tune on a target task; and (2) a set ofunified sentiment analysis Twitter datasets ineight different languages.

Is there a way to get these newlines to be converted into spaces?
It appears to be this way in the API and in the extension.

Thanks,
Derek 

mdiro...@hypothes.is

unread,
Jun 15, 2021, 7:58:16 AM6/15/21
to Hypothes.is Forum
Hi Derek,

This is likely due to the digital text in the PDF itself and not something Hypothesis is doing. You could check by opening the PDF in your browser without Hypothesis and copy/pasting text from it into a Word document (or similar). My guess is that you'll see the same behavior. You're also welcome to send the pdf to sup...@hypothes.is and we can test it for you.

Best,
Michael

Derek Thomas

unread,
Jun 15, 2021, 11:47:08 AM6/15/21
to Hypothes.is Forum
Yeah, I think you are right. I get the same results when I copy/paste. Thats unfortunate.

Thanks,
Derek

mdiro...@hypothes.is

unread,
Jun 15, 2021, 8:28:22 PM6/15/21
to Hypothes.is Forum
If you haven't started annotating yet (or if future PDFs give you that problem) you can try https://docdrop.org/ocr/ and choose the "Force OCR" option. A new digital text layer might fix the issue for you, but the new document will be distinct from the old document and annotations on one won't show up on the other.

I might be able to go a step farther and make the two documents equivalent to each other if you want me to try. Send along the PDF if so.

Best,
Michael

Derek Thomas

unread,
Jun 16, 2021, 3:48:37 AM6/16/21
to Hypothes.is Forum
Thanks so much!
https://arxiv.org/pdf/2012.15547.pdf
95% of my pdfs are from arxiv. All of them are Data Science related and similar style.

mdiro...@hypothes.is

unread,
Jun 16, 2021, 9:22:05 AM6/16/21
to Hypothes.is Forum
You're welcome! I used the "Force OCR" option at the link I provided and made a new PDF. This one has spaces instead of newlines, but that means the the hyphenated words also get newlines instead of spaces. I'm not sure which way is better.

Best,
Michael 

2012.15547-el6n0_ocr_force.pdf
Reply all
Reply to author
Forward
0 new messages