Extract HTML from PDF

118 views
Skip to first unread message

Timur Gadzo

unread,
Feb 27, 2024, 1:33:56 PMFeb 27
to pdfium
Hi. Is is possible to extract HTML (rich text, formatted text) from PDF?
If I open PDF in Chrome and copy, seems only text is copied, without formatting, why?

DJRecipe

unread,
Feb 27, 2024, 9:08:14 PMFeb 27
to Timur Gadzo, pdfium
Pdfs use postscript (with absolute coordinates) and a bunch of encoded streams and dictionaries.

That is to say, there is no mark up language to extract.

You would have to write your own or find some pdf to html converters online.

On Wed, Feb 28, 2024, 1:33 AM Timur Gadzo <gti...@gmail.com> wrote:
Hi. Is is possible to extract HTML (rich text, formatted text) from PDF?
If I open PDF in Chrome and copy, seems only text is copied, without formatting, why?

--
You received this message because you are subscribed to the Google Groups "pdfium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/3e5a2f71-bc6a-441b-a4eb-c698c9222749n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages