Parsing ODT files (with Pantomime?)

244 views
Skip to first unread message

Bastien

unread,
Jun 3, 2014, 7:27:40 PM6/3/14
to clo...@googlegroups.com
Hi all,

I'm trying to get the content of an ODT file as plain text.

I've found Pantomime, but don't understand how to use it?

Can anyone put me on the right tracks with a minimal working
example?

Thanks in advance!

--
Bastien

Denis Fuenzalida

unread,
Jun 3, 2014, 8:58:20 PM6/3/14
to clo...@googlegroups.com
Hi Bastien,

ODT files from OpenOffice/LibreOffice are just Zip files which contain a bunch of xml files and folders for the images or media which you've inserted into a document. The text itself is contained in a file called "content.xml" inside of it.

There's a plain Java parser for ODT files on this very old post in one of the Oracle blogs which may be handy: https://blogs.oracle.com/prasanna/entry/openoffice_parser_extracting_text_from

Regards,

Denis

Denis Fuenzalida

unread,
Jun 3, 2014, 9:52:07 PM6/3/14
to clo...@googlegroups.com
I've created a small gist which shows how to use the ODFDOM API which is much simpler to use:

Jeffrey Cummings

unread,
Jun 3, 2014, 9:53:59 PM6/3/14
to clo...@googlegroups.com
You may want to look at Docjure https://github.com/mjul/docjure

It parses .xlsx files it may be able to parse .odt files. It uses the Apache POI Java library to parse.

Jeff 

Bastien

unread,
Jun 4, 2014, 5:03:05 AM6/4/14
to Denis Fuenzalida, clo...@googlegroups.com
Hi Denis,

Denis Fuenzalida <denis.fu...@gmail.com> writes:

> I've created a small gist which shows how to use the ODFDOM API which
> is much simpler to use:
>
> https://gist.github.com/dfuenzalida/a1e9755e9b2e7f638620

Thanks a lot for this! I tested it and I can get the human readable
text from an arbitrary .odt file.

My next purpose is to get what getTextContent gives me, but with new
lines preserved -- I used "getContentDom" so that I can get the full
DOM and find new lines by replacing <text:p...>...</text:p> ... but
this feel dirty. Do you know if I can get paragraphs easily? Or
should I manipulate the DOM with some other XML parsing tool?

Thanks for your time,

--
Bastien

Bastien

unread,
Jun 4, 2014, 5:04:27 AM6/4/14
to 'Jeffrey Cummings' via Clojure
Hi Jeffrey,

"'Jeffrey Cummings' via Clojure" <clo...@googlegroups.com> writes:

> You may want to look at Docjure https://github.com/mjul/docjure
>
>
> It parses .xlsx files it may be able to parse .odt files.

Thanks, but I don't see anything in docjure about parsing .odt files.
Or am I missing something?

--
Bastien

Alex Ott

unread,
Jun 4, 2014, 6:23:32 AM6/4/14
to clo...@googlegroups.com
Hi

Pantomime right now doesn't support the text extraction, but you can take the https://github.com/alexott/clj-tika (outdate although) - it uses the Apache Tika for text extraction


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Bastien

unread,
Jun 4, 2014, 8:33:51 AM6/4/14
to Alex Ott, clo...@googlegroups.com
Hi Alex,

Alex Ott <ale...@gmail.com> writes:

> Pantomime right now doesn't support the text extraction, but you can
> take the https://github.com/alexott/clj-tika (outdate although) - it
> uses the Apache Tika for text extraction

thanks -- I stumbled upon clj-tika but didn't understand how to use
it. Would you have a minima example? The README is pretty terse.

Thanks in advance,

--
Bastien

Alex Ott

unread,
Jun 4, 2014, 9:01:05 AM6/4/14
to clo...@googlegroups.com
>lein try clj-tika "1.2.0"

user=> (use 'tika)
user=> (def res (parse "https://www.oasis-open.org/committees/download.php/25054/07-08-22-MetaData-Examples.odt"))
#'user/res

res - the map consisting of:
 - :text -> extracted text
 - all other fields - metadata from document






--
 Bastien

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bastien

unread,
Jun 4, 2014, 9:32:59 AM6/4/14
to Alex Ott, clo...@googlegroups.com
Hi Alex,

Alex Ott <ale...@gmail.com> writes:

> res - the map consisting of:
> - :text -> extracted text
> - all other fields - metadata from document

Works like a charm, thanks a bunch!

--
Bastien
Reply all
Reply to author
Forward
0 new messages