Parse PDF into tiddlers

168 views
Skip to first unread message

tib...@sunyit.edu

unread,
Aug 31, 2017, 4:46:35 PM8/31/17
to TiddlyWiki
Hi All,

I figured this may be a stretch but wanted to ask, is their any way to import a PDF into plain text tiddlers?

Thanks

TonyM

unread,
Sep 1, 2017, 12:02:18 AM9/1/17
to TiddlyWiki
Such a tool would be helpful.

Personally I would look into tools to turn pdfs into text then import that, because there a many issues going from a highly formatted document type to plain text. Not to mention text inside images where some OCR is needed.

Foxit reader and Pro is great for pdf work but not sure it will help you.

Regards
Tony

Jeremy Ruston

unread,
Sep 1, 2017, 6:21:26 AM9/1/17
to tiddl...@googlegroups.com
The trouble with PDF is that, contrary to expectations, it is actually an image file format, rather than a text document format. In other words, it doesn’t know anything about paragraphs, or headers, or footers; all it knows about are simple instructions to draw a given letter at given coordinates. (Worse than that, some PDFs are actually just embedded bitmaps).

That means that converting a PDF into a conventional document is more akin to “optical character recognition” than ordinary file format conversion. It takes machine learning or sophisticated heuristics for software to figure out the structural relationships behind the document image. There is some effective software available to do this conversion, but it tends to be expensive because it’s such a hard problem and the capability is so valuable.

Best wishes

Jeremy.
> --
> You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
> To post to this group, send email to tiddl...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tiddlywiki.
> To view this discussion on the web visit https://groups.google.com/d/msgid/tiddlywiki/4b702ecd-3fbd-4e09-b6db-dd4092ca4000%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

tib...@sunyit.edu

unread,
Sep 6, 2017, 3:16:36 PM9/6/17
to TiddlyWiki
I found a tool that can convert a PDF textbook to epub, with minimal loss/corruption. Do we have an epub import tool lying around? I think Steve Schneider did something with epub in the spring.

@TiddlyTweeter

unread,
Sep 6, 2017, 3:27:06 PM9/6/17
to TiddlyWiki
Epubs like docx files are assemblages. They are NOT really one file. They are wrappers. Find out how to UN-wrap it. See what you got in that specific epub format after unwrapping. Tell us what is in it. j.

codacoder...@outlook.com

unread,
Sep 6, 2017, 3:36:48 PM9/6/17
to TiddlyWiki


On Wednesday, September 6, 2017 at 2:16:36 PM UTC-5, tib...@sunyit.edu wrote:
I found a tool that can convert a PDF textbook to epub, with minimal loss/corruption. Do we have an epub import tool lying around? I think Steve Schneider did something with epub in the spring.


epub is a zip file.  If you unzip an epub file it will show you (among other things) its raw html content.
Reply all
Reply to author
Forward
0 new messages