How to set charset for txt file during conversion to pdf

829 views
Skip to first unread message

Janusz Prokulewicz

unread,
Apr 21, 2010, 10:39:02 AM4/21/10
to JODConverter
Hello,

I'm using JODConverter to convert files in my application and I've got
an issue while converting .txt file containing chinese (simplified)
characters.
I've tried to open this .txt in my OpenOffice and there I can set
which charset I'd like to use while importing file to OO. When I set
one of these:
Chinese simplified (apple mac)
Chinese simplified (EUC-CN)
Chinese simplified (GB-18030)
and then save it as pdf everything works OK. I can see Chinese
characters both in imported document and saved .pdf file.

When I send the same .txt file to my JODConverter, it creates .pdf
document with wrong characters. I could obtain the same result when
during import .txt file to OO I select UTF-8 character set.

My question is:
how can I programaticaly set which charset JODConverter should use
while opening input .txt file?

JODConverter is so bad documented and I couldn't find anything
interesting in this field.

Thanks in advance,

Janusz

--
You received this message because you are subscribed to the Google Groups "JODConverter" group.
To post to this group, send email to jodcon...@googlegroups.com.
To unsubscribe from this group, send email to jodconverter...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/jodconverter?hl=en.

Janusz Prokulewicz

unread,
Apr 27, 2010, 9:01:13 AM4/27/10
to JODConverter
So far I have found sth like this:

-----------------------------------------------------
OfficeDocumentConverter converter = new
OfficeDocumentConverter(this.officeManager);

String ext = FileConverterUtils.getFileExtension(inputFile.getName());
if ("txt".equals(ext)) {
log.info("setting custom encoding type");
Map<String,Object> loadProperties = new HashMap<String,Object>();
loadProperties.put("Hidden", true);
loadProperties.put("ReadOnly", true);
loadProperties.put("FilterName", "Text (encoded)" );
loadProperties.put("FilterOptions", "GB2312" );

converter.setDefaultLoadProperties(loadProperties);
}
converter.convert(inputFile, outputFile);
-----------------------------------------------------

I've also tried to set encoding as 'GB-18030' (in many different
variants like 'gb18030', etc.) and it didn't work.

Any ideas, anybody ?

Janusz Prokulewicz

unread,
Apr 27, 2010, 9:01:21 AM4/27/10
to JODConverter
So far I have found sth like this:

-----------------------------------------------------
OfficeDocumentConverter converter = new
OfficeDocumentConverter(this.officeManager);

String ext = FileConverterUtils.getFileExtension(inputFile.getName());
if ("txt".equals(ext)) {
log.info("setting custom encoding type");
Map<String,Object> loadProperties = new HashMap<String,Object>();
loadProperties.put("Hidden", true);
loadProperties.put("ReadOnly", true);
loadProperties.put("FilterName", "Text (encoded)" );
loadProperties.put("FilterOptions", "GB2312" );

converter.setDefaultLoadProperties(loadProperties);
}
converter.convert(inputFile, outputFile);
-----------------------------------------------------

I've also tried to set encoding as 'GB-18030' (in many different
variants like 'gb18030', etc.) and it didn't work.

Any ideas, anybody ?

Mirko Nasato

unread,
Apr 27, 2010, 5:56:23 PM4/27/10
to JODConverter
First of all you'll have to find out what FilterOptions strings are
accepted by OpenOffice.org, because JODConverter just passes those
options straight to OOo. I know "uft8" works, but I don't know all the
possible values. It's likely that they're only documented in the OOo
source code.

Secondly, you probably don't want to setDefaultLoadProperties()
because that will apply to all conversions, but customize only the txt
format in the DocumentFormatRegistry.

Lastly, if you want to contribute to improve the JODConverter
documentation your help is welcome.

Kind regards

Mirko

Janusz Prokulewicz

unread,
Apr 28, 2010, 4:11:29 AM4/28/10
to JODConverter
Hello,

thanks for answer.

I figured it out how it works. The were two problems:
first of all string indicating Simplified Chinese encoding was not
GB2312 but gb_2312.

And the second was the way I passed settings two converter about the
it should load txt files. Below is the working version of my code
(part of method converting documents):


-----------------------------------------------------
OfficeDocumentConverter converter = null;

String inputExt =
FileConverterUtils.getFileExtension(inputFile.getName());

if ("txt".equals(inputExt)) {

String txtCharset = CharsetDetector.detectCharset(inputFile);
if ("gb2312".equals(txtCharset.toLowerCase())) {
log.info("setting custom encoding type (GB_2312)");

SimpleDocumentFormatRegistry sdfr = new
SimpleDocumentFormatRegistry();

DocumentFormat txt = new DocumentFormat("Plain Text", "txt", "text/
plain");
txt.setInputFamily(DocumentFamily.TEXT);
Map<String,Object> txtLoadProperties = new
LinkedHashMap<String,Object>();
txtLoadProperties.put("Hidden", true);
txtLoadProperties.put("ReadOnly", true);
txtLoadProperties.put("FilterName", "Text (encoded)");
txtLoadProperties.put("FilterOptions", "gb_2312");
txt.setLoadProperties(txtLoadProperties);
sdfr.addFormat(txt);

DocumentFormat pdf = new DocumentFormat("Portable Document Format",
"pdf", "application/pdf");
pdf.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "writer_pdf_Export"));
pdf.setStoreProperties(DocumentFamily.SPREADSHEET,
Collections.singletonMap("FilterName", "calc_pdf_Export"));
pdf.setStoreProperties(DocumentFamily.PRESENTATION,
Collections.singletonMap("FilterName", "impress_pdf_Export"));
pdf.setStoreProperties(DocumentFamily.DRAWING,
Collections.singletonMap("FilterName", "draw_pdf_Export"));
sdfr.addFormat(pdf);

converter = new OfficeDocumentConverter(this.officeManager, sdfr);
} else {
converter = new OfficeDocumentConverter(this.officeManager);
}

} else {
converter = new OfficeDocumentConverter(this.officeManager);
}

converter.convert(inputFile, outputFile);
-----------------------------------------------------

I used org.mozilla.intl.chardet.CharsetDetector class to detect file
encoding type.

I hope it will help at least a little bit other folks dealing with
similar problems. Of course it would be useful to create JODConverter
documentation but as always there's no time for this.


Greetings,
Janusz

Janusz Prokulewicz

unread,
Apr 28, 2010, 4:11:36 AM4/28/10
to JODConverter
Hello,

thanks for answer.

I figured it out how it works. The were two problems:
first of all string indicating Simplified Chinese encoding was not
GB2312 but gb_2312.

And the second was the way I passed settings two converter about the
it should load txt files. Below is the working version of my code
(part of method converting documents):


-----------------------------------------------------
OfficeDocumentConverter converter = null;

String inputExt =
FileConverterUtils.getFileExtension(inputFile.getName());

project2501

unread,
Apr 30, 2010, 9:10:33 AM4/30/10
to JODConverter
Thank you for this!

I noticed this same problem emerge with OpenOffice 3.2 and I was using
pyodconverter, but am switching to the new JOD for its process
management. My Chinese .doc got translated to PDF with ?'s characters
instead of chinese. Seems to be the same problem, but I have to try
your solution.
I am using chardet too (python though) and will make a JOD webapp for
conversions.

Mirko Nasato

unread,
May 1, 2010, 5:48:56 AM5/1/10
to JODConverter
Specifying a charset is required for *.txt files because they don't
embed any info about their encoding in the file itself, being just
text content. But for *.doc and other formats you don't need to
explicitly specify the charset.

Kind regards

Mirko

Samuel Thibault

unread,
May 1, 2010, 6:39:10 AM5/1/10
to jodcon...@googlegroups.com
Mirko Nasato, le Sat 01 May 2010 02:48:56 -0700, a écrit :
> Specifying a charset is required for *.txt files because they don't
> embed any info about their encoding in the file itself, being just
> text content.

Can't the charset default to the current locale encoding?

Samuel

Mirko Nasato

unread,
May 1, 2010, 7:05:42 AM5/1/10
to JODConverter
On May 1, 12:39 pm, Samuel Thibault <samuel.thiba...@ens-lyon.org>
wrote:
> Mirko Nasato, le Sat 01 May 2010 02:48:56 -0700, a écrit :
>
> > Specifying a charset is required for *.txt files because they don't
> > embed any info about their encoding in the file itself, being just
> > text content.
>
> Can't the charset default to the current locale encoding?
>
Yes, but how does that help if you have *.txt files in different
charsets?

Kind regards

Mirko

Samuel Thibault

unread,
May 1, 2010, 7:07:56 AM5/1/10
to jodcon...@googlegroups.com
Mirko Nasato, le Sat 01 May 2010 04:05:42 -0700, a écrit :
> On May 1, 12:39 pm, Samuel Thibault <samuel.thiba...@ens-lyon.org>
> wrote:
> > Mirko Nasato, le Sat 01 May 2010 02:48:56 -0700, a écrit :
> >
> > > Specifying a charset is required for *.txt files because they don't
> > > embed any info about their encoding in the file itself, being just
> > > text content.
> >
> > Can't the charset default to the current locale encoding?
> >
> Yes, but how does that help if you have *.txt files in different
> charsets?

It helps for those which are in your usual charset, which is probably
the common case in many situations.

Samuel
Reply all
Reply to author
Forward
0 new messages