possible bug: --extract-only option deletes destination directory

JohnV474

unread,

Mar 11, 2012, 3:57:20 PM3/11/12

to chm2pdf

I hope I've observed correctly. I'm new (but learning), so if I am
mistaken I apologize.

When I run "chm2pdf --extract-only name_of_file.chm", or "chm2pdf --
extract-only --verbose", it appears that chm2pdf is creating a
temporary directory, but then deleting it, and reporting success.

chm2pdf --help says that the --extract-only option should put the
htmls in CHM2PDF_WORK_DIR. I looked at the script and I can't tell if
this is defined. I did see reference to /tmp so I started watching
that folder while the script ran.

So, I entered "chm2pdf --extract-only --verbose name_of_file.chm".
Here's the output:

$chm2pdf --extract-only --verbose name_of_file.chm
CHM2PDF_WORK_DIR = /tmp/tmpg9acxv/name_of_file
CHM2PDF_ORIG_DIR = /tmp/tmpV0E3DU/name_of_file
Correcting links in the HTML files...
$

(note: i edited out the filename)

While that was running, I watched /tmp and saw both /tmp/tmpg9acxv
and /tmp/tmpV0E3DU being created. If I open one of them before
chm2pdf returns the shell prompt to me, I find all of the html files.
I can quickly copy them.

Then, both directories get deleted. I do not find them anywhere else.

If this is a bug, I wanted to report it. I may not be using the
script appropriately as I learn. If I find a solution I will post it
here.

Thanks.
-JV474

JohnV474

unread,

Mar 11, 2012, 4:27:27 PM3/11/12

to chm2pdf

I found an imperfect solution.

I opened /usr/bin/chm2pdf and went to the end and commented out the
following lines:

# shutil.rmtree(CHM2PDF_TEMP_WORK_DIR)
# shutil.rmtree(CHM2PDF_TEMP_ORIG_DIR)

Upon running the script as I had done before, the /tmp directories are
no longer removed. I can find the extracted html files.

Two outstanding issues:
1) links between files do not work.
2) images in output files do not appear.

One directory contains files names temp####.html. Based on the
timestamp, these are the final output. With those filenames, the
existing links between files do not work.

The other directory contains several files I do not understand, and
within a subfolder, the extracted html files. However, the links
between html files do not work because the filenames do not maintain
the case (uppercase/lowercase) of the original files in the CHM file.
Also, the images in the html files do not appear.

I may simply convert to PDF and call it a day, but had hoped to have a
folder of html files instead. I just don't like that .chm is not very
portable.

This is my first attempt at finding a bug and taking steps to correct
it myself (first time reading/tinkering with someone else's code,
etc.), so please be patient.

-JV474

JohnV474

unread,

Mar 11, 2012, 4:43:07 PM3/11/12

to chm2pdf

Correction:

The files in CHM2PDF_ORIG_DIR do show the images properly. Links work
unless the target file's name no longer matches, due to being
converted to lowercase.

Here is the output, for reference:
$ chm2pdf --extract-only --verbose name_of_file.chm
CHM2PDF_WORK_DIR = /tmp/tmpC8FiuD/name_of_file
CHM2PDF_ORIG_DIR = /tmp/tmppRTswH/name_of_file

Correcting links in the HTML files...
$

In the above output, the CHM2PDF_WORK_DIR is the one that contains
temp####.html, and the CHM2PDF_ORIG_DIR is the one that contains the
images and everything else.

-JV474

Reto

unread,

Mar 12, 2012, 4:11:55 PM3/12/12

to chm...@googlegroups.com

Hi JohnV474!

May I ask you which version of the script you are using? One you have e.g. installed from the ubuntu repository, or the one downloaded from http://code.google.com/p/chm2pdf/ ?
If you want to use the --extract only option, I suggest you to download the script form the google code page, as it works without temp directories.
One other solution is, like you did, comment the remove tree lines.
Or, you can change
CHM2PDF_TEMP_WORK_DIR=tempfile.mkdtemp()
CHM2PDF_TEMP_ORIG_DIR=tempfile.mkdtemp()
to e.g. (like the chm2pdf-0.9.1.tar.gz version)
CHM2PDF_TEMP_WORK_DIR='/tmp/chm2pdf/work'
CHM2PDF_TEMP_ORIG_DIR='/tmp/chm2pdf/orig'
This avoids "polluting" your system with random directory names.
The best would be not to delete the directories if the extract only option is used.

Here the list of errors I found, so if you feel that you could be affected by one of them, please try the code I posted at this location https://code.launchpad.net/~reto-knaak under code section:
* LP: #890870 multiple page problem
* LP: #890873 links not working in the PDF with upper/lower case spelling error
* LP: #890882 another broken link issue: solved in patch for 890873
* LP: #890874 Images not rendered in PDF due to upper/lower case spelling error
* LP: #890877 table background color removed
* LP: #890878 no effort is done in chm2pdf to delete javascript
* LP: #630520 chm2pdf crashed if BeautifulSoup is used but not installed
* LP: #894193 Trouble if chm contains path with spaces
* LP: #896692 Last page of CHM incompletly rendered or missing
* LP: #500262 Errors if file name contains escaped special characters (space, parenthesis etc)

I hope this helps...

It's a pity that the developers abandoned this script, because it does a great job (at least for me!).

Chris Karakas

unread,

Mar 13, 2012, 9:59:42 AM3/13/12

to chm...@googlegroups.com

John,

you probably have a version of chm2pdf that came from a Linux
distribution, like Debian for example.

If you try the version from my homepage

http://www.karakas-online.de/forum/viewtopic.php?t=10275

you should not encounter the problem you describe.

However, I must warn you that the way "my" version creates temporary
directories is considered a security vulnerability (AFAIR the problem is
that the name of the temp dirs is predictable).

If you are the only person who uses chm2pdf, then this should not pose a
problem IMO. However, I can imagine that there are scenarios where such
"openness" in creating temporary directories might lead to priviliedge
escalation (i.e. someone else becoming root on your computer).

If you are interested in the details, which lie some years back, you
should check the Debian mailing lists, as well as the bugs reporting
system at the Google groups page of chm2pdf.

The decision to practically destroy the --extract-only option, taken by
the Debian people, was the main reason from my withdrawal from further
development of this project. I understood the reasons, but it was so
frustrating an interruption for me - and, as you see, it takes *years*
for me to recover (if at all) from interruptions.

"No man ever steps in the same river twice, for it's not the same river
and he's not the same man." - Heraclitus.

Regards

Chris Karakas
http://www.karakas-online.de

Chris Karakas

unread,

Mar 13, 2012, 10:22:43 AM3/13/12

to chm...@googlegroups.com

The idea of the --extract-only and --dontextract options is to help you
in a situation like the one you are encountering: use the --extract-only
option to get the "original" version of the HTML files inside your CHM
(to be found in the directory pointed to by CHM2PDF_ORIG_DIR, in my
version of chm2pdf at
http://www.karakas-online.de/forum/viewtopic.php?t=10275), go check for
the causes of your problem (in your case: capital or small letters),
remedy the problem, then rerun chm2pdf with --dontextract.

Say "thank you" to Debian, who destroyed them without offering
alternatives for this use case.

I am not inclined to spend time on fixing something that was
deliberately broken.

And say "thank you" to your CHM creator for not distinguishing between
capital and small letters (breaking the HTML standard, but possibly
still conforming to CHM...).

Regards

Chris Karakas
http://www.karakas-online.de

Reto

unread,

Mar 14, 2012, 6:42:10 AM3/14/12

to chm2pdf

Hi Chris!

Thank you very much for your great script!

Ciao
Reto

Elvis Donald Attro

unread,

Apr 7, 2016, 9:19:55 PM4/7/16

to chm2pdf

I've just spent an hour (almost getting crazy) wondering why these /tmp/chm2pdf/work and /tmp/chm2pdf/orig referred in source were never created, and instead having these /tmp/tmp#### folder appearing and quickly disappearing when job is done; with no clue that the Debian version I was using was totally different from Chris' version. Outstanding !!!
Thanks to this thread, everything is clear now.
And I totally agree with the idea behind the --extract-only option, because things often go wrong (just like now) and you need to move directly to the files and get your hands dirty. And the way it has been broken, wow! I understand the security issue but they could have at least try to provide an alternative, maybe let the user pass the destination folder as a parameter to the --extract-only option, just for this special case or so.
Anyway thanks to y'all guys for this thread, because I was really going to spend a lot of time on this for nothing.

Reply all

Reply to author

Forward