Read Microsoft Word .doc files in Clojure

750 views
Skip to first unread message

Joshua Mendoza

unread,
Jan 2, 2014, 2:49:30 AM1/2/14
to clo...@googlegroups.com
Hi!,

I've been looking for libraries or resources to read MS .doc files in Clojure, but found none. Does anyone have tried, used, encountered or witnessed such a thing to read them?

I found a lot of info publicly available by the government in .doc files but I want to process them automatically with Clojure.

The closest thing I know is using Incanter but to read XLS files, which is not useful at all for this...

Well, any help would be great.

Thank you!

Dennis Haupt

unread,
Jan 2, 2014, 11:08:13 AM1/2/14
to clo...@googlegroups.com
use apache poi and write a small wrapper or something
this is what i did


2014/1/2 Joshua Mendoza <joshu...@gmail.com>
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Frank Hale

unread,
Jan 2, 2014, 11:14:32 AM1/2/14
to clo...@googlegroups.com
One solution is to use ClojureCLR and the OpenXML SDK.

Ron Toland

unread,
Jan 2, 2014, 6:33:11 PM1/2/14
to clo...@googlegroups.com
If all you need is the text, you could use Apache Tika to extract it: http://tika.apache.org/

There's a simple clojure lib to get you started: https://github.com/alexott/clj-tika

I've used it to pull text out of .doc, .pdf, and .odt files.

Ron

Brendan Younger

unread,
Jan 3, 2014, 10:12:24 PM1/3/14
to clo...@googlegroups.com
I've used the Java code in TextExtractor http://stackoverflow.com/questions/10250617/java-apache-poi-can-i-get-clean-text-from-ms-word-doc-files with good success in Clojure projects.  Either throw it in as Java source or convert to Clojure code.  You'll probably want the tika-parsers jar instead of the tika-app jar, though.

Brendan

Divya Shravanthi

unread,
Dec 5, 2014, 5:32:54 AM12/5/14
to clo...@googlegroups.com
Hi Ron,

Could you please share an example of how to pull simple text from pdf/doc files. I couldn't find a proper tutorial for clj-tika. 

Thanks

Gary Verhaegen

unread,
Dec 5, 2014, 7:26:06 AM12/5/14
to clo...@googlegroups.com
clj-tika seems to be abandoned (and is marked as deprecated). You will probably be better off using Tika directly through interop.
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ron Toland

unread,
Dec 5, 2014, 10:16:43 AM12/5/14
to clo...@googlegroups.com
Divya,

Here's a simple example for converting text from an input stream (which you can convert any file into):

(ns sample.tika
  (:require [clj-tika.core :as tika])

(defn extract-text
  "Extracts the text from the input stream"
  [input-stream]
  (tika/parse input-stream))

Ron

-- 
Sent with Sparrow

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/iKDl6NHv4DU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages