Regarding HTML to PDF conversion using jodconverter 2.2.2 for Hindi

Arindam Lahiri

unread,

Jun 20, 2009, 9:15:05 AM6/20/09

to JODConverter

Lately I am facing some problems converting java string containing
unicode HTML (containing hindi language) to PDF using jodconverter
java library 2.2.2. The PDF is having garbage characters where hindi
characters existed in the html.
However there is no problem in character encoding of the string as it
is getting stored as clob in an oracle database and when this clob is
again retrieved from database it displays hindi without any problem.
I have copied this hindi text in a html page(.html file) and converted
to pdf using jodconverter cli but this pdf is also having garbage
characters. However using hindi text in odt document and converting it
to pdf is excellent using jodconverter cli.

Curt Arnold

unread,

Jun 21, 2009, 1:33:18 AM6/21/09

to jodcon...@googlegroups.com

XML has very explicit rules on determining the encoding in a document,
it is either UTF-8, UTF-16 or UCS-4 unless explicitly stated otherwise
and an XML processor never tries to determine the encoding based on
the local machine's locale. I'm not so sure that is the case with
HTML and may differ between applications. For reference look at: http://www.w3.org/TR/REC-html40/charset.html

You don't mention what encoding your HTML is saved in or how it was
edited. I assume that the page properly renders when you load it into
a web browser, but that just may be due to the browser's convention
being in sync with how you created your content and OpenOffice.org's
guessing wrong.

I would try:

a) testing with a UTF-16 HTML document. That should eliminate any
encoding confusion.

b) Adding a <meta> tag to HTML to specify the encoding if it saved in
a local file. I think this has less chance of being effective.

c) Adjust the encoding used for the HTML file and/or your locale
settings.

Mirko Nasato

unread,

Jun 21, 2009, 5:46:54 AM6/21/09

to jodcon...@googlegroups.com

Hi Arindam,

If you open your HTML file manually in OOo to start with, you can see
if displays correctly or not, even before exporting it to PDF. Then
adjust the charset in the HTML file if required, as Curt explained. I
would actually try the meta tag approach first, e.g.

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

Kind regards

Mirko

2009/6/20 Arindam Lahiri <arindam...@gmail.com>

Arindam Lahiri

unread,

Jun 21, 2009, 3:17:06 PM6/21/09

to JODConverter

Thanks Curt and Mirko for the tip. The meta tag was incorrect in the
html file and the command line started working perfectly after I
corrected it. But another bizarre thing is happening. I am generating
hindi HTML text from a web based html editor in my struts
application :-

Few lines from an action class :-

String strHindiHTML = (String)addAppForm.get("hindiHTML");
OpenOfficeConverter oOpenOfficeConverter = new OpenOfficeConverter
(oOpenOfficeConnection);
ByteArrayOutputStream oByteArrayOutputStream = new
ByteArrayOutputStream();
oOpenOfficeConverter.convertFromHTMLToPDF
(strHindiHTML,oByteArrayOutputStream);
ByteArrayInputStream oByteArrayInputStream = new ByteArrayInputStream
(oByteArrayOutputStream.toByteArray());
and using this oByteArrayInputStream to store the pdf stream in oracle
blob

The class OpenOfficeConverter is just a utility class for converting
the html stream to pdf stream and the function used here is :-
public void convertFromHTMLToPDF(String oHTMLText,OutputStream
oOutputStream)
{
try
{
DocumentFormat inputDocumentFormat = new DocumentFormat("HTML",
DocumentFamily.TEXT, "text/html", "html");
inputDocumentFormat.setExportFilter(DocumentFamily.TEXT, "HTML
(StarWriter)");
DocumentFormat outputDocumentFormat = new DocumentFormat("Portable
Document Format", DocumentFamily.TEXT, "application/pdf", "pdf");
outputDocumentFormat.setExportFilter(DocumentFamily.TEXT,
"writer_pdf_Export");
DocumentConverter oDocumentConverter = new
OpenOfficeDocumentConverter(this.oOpenOfficeConnection);
oDocumentConverter.convert(new ByteArrayInputStream
(oHTMLText.getBytes
()),inputDocumentFormat,oOutputStream,outputDocumentFormat);
........................................

The converted pdf stream is showing hindi characters with many ?
characters. I mean pdf file does have hindi characters but it displays
it incorrectly alongwith some question mark characters. I had also
used the last line to debug:-
oDocumentConverter.convert(new ByteArrayInputStream(oHTMLText.getBytes
()),inputDocumentFormat,new FileOutputStream("c:\
\test1234.pdf"),outputDocumentFormat);

But the file test1234.pdf has the same problem. I also thought that
maybe there is some encoding issue so I used oHTMLText.getBytes
("UTF-8") but to no use meanwhile this strHindiHTML string is stored
in oracle clob, again retrieved from database and renders flawlessly
in browser.

I am not able to figure out the problem. I would like to seek your
advice and expertise.

With Highest Regards
Arindam

On Jun 21, 2:46 pm, Mirko Nasato <mirko.nas...@gmail.com> wrote:
> Hi Arindam,
>
> If you open your HTML file manually in OOo to start with, you can see
> if displays correctly or not, even before exporting it to PDF. Then
> adjust the charset in the HTML file if required, as Curt explained. I
> would actually try the meta tag approach first, e.g.
>
> <meta http-equiv="content-type" content="text/html; charset=UTF-8">
>
> Kind regards
>
> Mirko
>

> 2009/6/20 Arindam Lahiri <arindamlahi...@gmail.com>

Mirko Nasato

unread,

Jun 22, 2009, 8:54:18 AM6/22/09

to jodcon...@googlegroups.com

Hi Arindam,

2009/6/21 Arindam Lahiri <arindam...@gmail.com>

>
> But the file test1234.pdf has the same problem. I also thought that
> maybe there is some encoding issue so I used oHTMLText.getBytes
> ("UTF-8") but to no use meanwhile this strHindiHTML string is stored
> in oracle clob, again retrieved from database and renders flawlessly
> in browser.
>

Again, start by making sure that the HTML renders correctly in
OpenOffice.org, by opening the file manually in Writer.

Have a look at the FAQ: http://code.google.com/p/jodconverter/wiki/FAQ

Kind regards

Mirko

Arindam Lahiri

unread,

Jun 22, 2009, 11:20:59 AM6/22/09

to JODConverter

Dear Mirko,

It renders perfectly manually in OpenOffice 3.1.0 and I have done
conversion through command line also from html to pdf using
jodconverter which is perfect too. The conversion is also perfect
using java.io.File and Java Library of jodconverter 2.2.2 but is not
properly taking place using streams and that is what bothers me the
most as this is the way I intend to use it in my web application.

With Best Regards
Arindam

On Jun 22, 5:54 pm, Mirko Nasato <mirko.nas...@gmail.com> wrote:
> Hi Arindam,
>

> 2009/6/21 Arindam Lahiri <arindamlahi...@gmail.com>

Mirko Nasato

unread,

Jun 22, 2009, 12:19:37 PM6/22/09

to jodcon...@googlegroups.com

Hi Arindam,

2009/6/22 Arindam Lahiri <arindam...@gmail.com>

>
> It renders perfectly manually in OpenOffice 3.1.0 and I have done
> conversion through command line also from html to pdf using
> jodconverter which is perfect too. The conversion is also perfect
> using java.io.File and Java Library of jodconverter 2.2.2 but is not
> properly taking place using streams and that is what bothers me the
> most as this is the way I intend to use it in my web application.
>

Just use files then, OOo works with files anyway. The convert() method
that accepts streams will create temporary files internally.

Kind regards

Mirko

Bill

unread,

Jun 27, 2009, 2:11:13 PM6/27/09

to JODConverter

There are two conversions involved here. One is the input filter that
converts HTML to ODT internally, and I've seen lots of problems with
that filter, particularly where style sheets are involved. In fact,
I've given up trying to use OpenOffice for HTML conversion, and am now
using wkpdf for that (or wkhtmltopdf on non-Mac systems), using
OpenOffice only for office-style formats. wkpdf has the whole WebKit
bag of HTML parsing and rendering tricks, so it's pretty good.

Bill

Alex Drizen

unread,

Jul 2, 2009, 3:18:23 PM7/2/09

to jodcon...@googlegroups.com

Mirko,

Quick question - was going to get started with Rimu Hosting as you recommend. But which linux distribution best for Open Office?

Centos5
Debian 4.0 (aka Etch)
Debian 5.0 (aka Lenny, RimuHosting recommended distro)
Ubuntu 9.04 (Jaunty Jackalope, from 2009-04)
Ubuntu 8.10 (Intrepid Ibex, from 2008-10)
Ubuntu 8.04 (Hardy Heron, 5 yr long term support (LTS))
Fedora 10