Parsing old Microsoft Word format (.doc)

78 views
Skip to first unread message

Konstantin Pečaļka

unread,
Dec 16, 2014, 3:44:43 AM12/16/14
to opencorporat...@googlegroups.com
Hi all,

Looking for a way to extract text from .doc (not .docx which can be handled with xml libraries).

Website from this mission http://missions.opencorporates.com/missions/641 contains quite simple .doc files with mostly plain text wrapped in .doc. I can't find good python libraries to parse .doc (not tried to search the same for ruby), but tried catdoc utility on files from that site -- it works and extracts all text.

Is it possible to run external programs from scraper scripts on Turbot server? If yes, can catdoc utility be installed?

Chris Taggart

unread,
Dec 16, 2014, 11:47:56 AM12/16/14
to opencorporat...@googlegroups.com
That looks like a good suggestion to add catdoc -- and looks pretty lightweight. An alternative might be poi  . Could you 

Because the bots run in Docker, we need to build the images in advance, with all the required libraries -- there's a balalnce between keeping them lightweight and functionality.

Could you possibly do a before and after sample with catdoc? i.e. input file and output file?

Thanks

Chris

Chris Taggart

unread,
Dec 16, 2014, 12:41:40 PM12/16/14
to opencorporat...@googlegroups.com
Possibly http://tika.apache.org/ is a better alternative. 

Not sure current debian build of catdoc is best to use (see http://www.wagner.pp.ru/~vitus/software/catdoc/)

Konstantin Mochalov

unread,
Dec 17, 2014, 2:47:59 AM12/17/14
to Chris Taggart, opencorporat...@googlegroups.com
Yes, catdoc is quite quirky. Yesterday compiled it under Mac OS and it crashes.

Tried catdoc, antiword and tika, here is outputs for one of .docs on minfin.gov.by:

Tika looks cool. It is somewhat heavyweight, however most flexible, can output xml and supports lots of formats (http://tika.apache.org/1.6/formats.html). Its binary has size of 30 Mb, however it does not require quirky compilation and requires only Java. I think this software would be useful for scrapers.

--
You received this message because you are subscribed to a topic in the Google Groups "OpenCorporates Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencorporates-community/mqNkmP-hEm0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencorporates-com...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Seb Bacon

unread,
Dec 17, 2014, 3:07:17 AM12/17/14
to Konstantin Mochalov, Chris Taggart, opencorporat...@googlegroups.com
Thanks for doing this comparison Konstantin. It's one reason we've not
picked a particular solution yet for Turbot.

Tika feels like a good choice to me, for the reasons you outline. It's
used by various other projects, e.g. ElasticSearch, and is mature
software. We'll have a look at implementing that today.
-------------------------------------------------------
OpenCorporates :: The Open Database of the Corporate World
http://opencorporates.com
Blog: http://blog.opencorporates.com
Twitter: http://twitter.com/OpenCorporates

OpenCorporates is published by Chrinon Ltd, a company dedicated to
improving and publishing public data under an open licence that allows
and encourages reuse, including commercially. Registered in England,
number 07444723.


On 17 December 2014 at 07:47, Konstantin Mochalov
> You received this message because you are subscribed to the Google Groups
> "OpenCorporates Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Willem de Groot

unread,
Dec 20, 2014, 11:44:43 AM12/20/14
to opencorporat...@googlegroups.com, incredib...@gmail.com, chris....@opencorporates.com

Seb Bacon

unread,
Dec 22, 2014, 3:15:34 AM12/22/14
to Willem de Groot, opencorporat...@googlegroups.com, Konstantin Pečaļka, Chris Taggart
On 20 December 2014 at 16:44, Willem de Groot <gwi...@gmail.com> wrote:
> Would piping these .docs through an online service be acceptable?
>
> https://translate.google.com/translate?hl=en&sl=ru&tl=en&u=http%3A%2F%2Fwww.minfin.gov.by%2Fupload%2Finsurance%2Flicens%2Fbagach.doc

As long as it was within the terms of service, and could be expected
to be reasonably reliable. I would expect the Google Drive API would
be the way to go with this -- I very much doubt it is within their ToS
to do otherwise programmatically.

This would have the convenience of a cloud service; on the other hand,
developers would have to manage their own API keys when in development
mode. Not a huge obstacle, so I do think this is worth considering.

Any other thoughts on this?

Thanks

Seb
Reply all
Reply to author
Forward
0 new messages