[TW5] Problem with images embedded in TiddlyWiki

569 views
Skip to first unread message

Michael Wiktowy

unread,
Mar 16, 2015, 10:52:31 AM3/16/15
to tiddl...@googlegroups.com
Hello,

Recently I have been trying to convert over a PDF of a staff directory into a TiddlyWiki (TW) file. It has a picture of each employee and a short description. For ease of distribution (and since there doesn't seem to be a universal standard for containerizing a web directory), I wanted to keep all the images embedded in the TW ... contrary to the predominant suggestion to keep binary files as external links. To build the TW file, I would directly cut and paste the pictures from the PDF into Chrome. That allowed me to import each binary picture as a tiddler that I could transclude into each person's profile.

However, I am noticing two issues (maybe interrelated):
1) The tiddler can be identified as image/gif, image/x-icon, image/jpeg or image/png and the resulting tiddler displays fine. Not a problem per se, just an agnosticism that I was not quite expecting that may indicate some sloppy handling of mime-types somewhere which may be problematic from a security standpoint ... especially if zip files are being introduced in 5.1.8. There was a huge infection of the Dyre worm that just made it around here that was hidden in an emailed zip file. I would be extra special careful to force a download only action on the zip files and not allow the browser to interpret them as the many, many different html container formats that are simply html directory trees zipped up.

2) The resulting file is about 10X the size (source PDF=3MB and resulting tiddler with not much extra info=28MB). I was expecting about a 33% increase in size due to the text encoding and a fixed overhead of about 1MB from the empty TiddlyWiki size ... maybe 5MB but not 28. I am wondering whether the technique of cutting and pasting is converting my nice compressed images in the PDF to raw bitmaps and whether there is something I can do.

Any thoughts on how better to compress these tiddlers?

/Mike

Michael Wiktowy

unread,
Mar 16, 2015, 1:13:04 PM3/16/15
to tiddl...@googlegroups.com
In looking at this further ... when copying the raw base64 encoded text from a tiddler and processing it through a separate online decoder, a sample image turns out to be a PNG file (verified by looking at the first line in a text editor) of about ~1/3 MB in size. So it is no surprise that ~80 images results in a 28MB file. Zipping this image file does not achieve any further reduction in size.

I am shocked at the compression that is achieved in the PDF file though. While there is some commonality between photos (they are all head shots with a white border around them, I don't know how they are getting losslessly compressed so effectively within the PDF file beyond the compression that PNG is offering. Even zipping up the entire TW only shrinks things down to 20 MB.

I am also surprised that no matter what extension I used, the image viewer opened it and identified it as the image type of whatever extension I used when it obviously ignored the extension completely and used some file magic identify and decode it correctly. So this is not browser-only behaviour.

Just some thoughts.

/Mike

Jeremy Ruston

unread,
Mar 16, 2015, 1:28:54 PM3/16/15
to TiddlyWiki
Hi Michael

I can confirm that visiting tiddlywiki.com/ and changing the MIME type of Motovun Jack.jpg to "image/png" doesn't prevent the image from displaying. But it's not anything special that TiddlyWiki is doing; I think it is just that browsers in practice ignore the MIME type of images, and instead sniff the content.

In terms of the ZIP support, all that's been added is the automatic recognition of the MIME type from the file extension. The implication is that if you do embed a ZIP file in a TiddlyWiki, then it is slightly more likely that the browser will get the file type correct if you create a download link for it. But the actual embedding of ZIP files has been possible since binary tiddlers were supported.

It sounds like compressing the images to JPG will get your file size down considerably. That kind of bulk operation is a bit easier under Node.js, but even in the browser you should be able to experiment with a few images.

Best wishes

Jeremy.






--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To post to this group, send email to tiddl...@googlegroups.com.
Visit this group at http://groups.google.com/group/tiddlywiki.
For more options, visit https://groups.google.com/d/optout.



--
Jeremy Ruston
mailto:jeremy...@gmail.com

Michael Wiktowy

unread,
Mar 16, 2015, 1:40:25 PM3/16/15
to tiddl...@googlegroups.com, jeremy...@gmail.com
On Monday, March 16, 2015 at 1:28:54 PM UTC-4, Jeremy Ruston wrote:
Hi Michael

I can confirm that visiting tiddlywiki.com/ and changing the MIME type of Motovun Jack.jpg to "image/png" doesn't prevent the image from displaying. But it's not anything special that TiddlyWiki is doing; I think it is just that browsers in practice ignore the MIME type of images, and instead sniff the content.

I figured as much.
  
It sounds like compressing the images to JPG will get your file size down considerably. That kind of bulk operation is a bit easier under Node.js, but even in the browser you should be able to experiment with a few images.

I am currently getting the images by copy+pasting right out of the 5MB PDF. So I am a little bit confused as to how the files are ballooning in size in just the copying operation. Still investigating but I am stating to think it is not anything TiddlyWiki is doing. It is likely Adobe Reader copying a pre-expanded, resized image to the clipboard and when I am pasting that into TiddlyWiki, it is not getting the tiny source file but some file data that has been through several layers of reinterpretation. I need to find a way to dissect the PDF with the tools that I have at work.

Thanks for the info.
/Mike

Jeremy Ruston

unread,
Mar 18, 2015, 3:48:21 PM3/18/15
to Michael Wiktowy, TiddlyWiki
Hi Michael

Just a thought, but Google suggests that there are tools that can extract images from PDFs automatically. For example:


It may be worth giving them a try,

Best wishes

Jeremy.

Michael Wiktowy

unread,
Mar 18, 2015, 4:47:46 PM3/18/15
to tiddl...@googlegroups.com, mwik...@gmail.com, jeremy...@gmail.com
I can find and I know of several tools to totally dissect PDFs. That is no problem and I will do some investigations on that at home. However, the problem was making due with the tools that I had at work ... and getting software approved for use at work is always an issue.

I was actually going to try Inkscape to try to break apart a sample PDF page and see if I can examine the source image. It has quite a good PDF import/editing capability but can only do so a page at a time.

/Mike

On Wednesday, March 18, 2015 at 3:48:21 PM UTC-4, Jeremy Ruston wrote:
Hi Michael

Michael Wiktowy

unread,
Mar 18, 2015, 4:54:38 PM3/18/15
to tiddl...@googlegroups.com, mwik...@gmail.com, jeremy...@gmail.com
... and to address the online extractors listed in that search result ... the document that I am trying to process is business-internal. There would be some privacy concerns to using free online services.

/Mike

RichardWilliamSmith

unread,
Mar 18, 2015, 10:04:26 PM3/18/15
to tiddl...@googlegroups.com, mwik...@gmail.com, jeremy...@gmail.com
Sorry if this is too simplistic but if you want to get the images out and you're happy to do it by hand, you could just grab them from the screen - if you're using a mac this is super-easy using shift-command-4 and then selecting the portion of the screen. It makes png images and you can drag them all into TW at once.

Regards,
Richard

Michael Wiktowy

unread,
Mar 18, 2015, 11:26:33 PM3/18/15
to tiddl...@googlegroups.com
The problem was not that I couldn't get the pictures out of the PDF. The problem was that simply highlighting them, copying them and then pasting them into TiddlyWiki unexpectedly and transparently converted them to PNG encoded images from JPG encoded images. I have since used the "Save As" option in Evince at home to save them to a format of my choice. Saving them as JPG reduced the size by an order of magnitude again.

I guess my understanding that the images embedded in a PDF has their format preserved was incorrect ... or the format was ignored by Chrome when copying and pasting.

/Mike
Reply all
Reply to author
Forward
0 new messages