Sean
Sean -- what do you want to do (import? how?)? If it's just indexing, I
have to believe it's pretty easy to do regardless with any text-based
format (which XML is)?
Also, it seems to me you ought to be asking the OOXML question somewhere
else. Is there some antiword-dev list?
Finally, Mac OS X has some sort of utility for working with (I think
mostly converting) different document formats. I forget the name, but it
might be worth looking into if you don't mind platform-specific differences?
Bruce
>
> Sean Takats wrote:
>
>> Did some testing tonight with various command line tools, and
>> antiword looks very, very good on a sampling of ancient and new Mac
>> and Windows Word docs. But does antiword (or any of these other
>> utilities) currently support or plan to support the new XML Word
>> format (.docx)? We obviously want to support older .doc formats, but
>> ideally we would only need to bundle a binary that supports the new
>> version, too.
>
> Sean -- what do you want to do (import? how?)? If it's just
> indexing, I
> have to believe it's pretty easy to do regardless with any text-based
> format (which XML is)?
>
Believing it's easy and having a working, open-source cross-platform
solution are obviously quite different.
> Also, it seems to me you ought to be asking the OOXML question
> somewhere
> else. Is there some antiword-dev list?
>
I've already contacted the antiword developer, but I also asked here
about other possibilities in an effort to be more transparent about
future directions and in order to mine the Zotero community's more
focused expertise on the question.
> Finally, Mac OS X has some sort of utility for working with (I think
> mostly converting) different document formats. I forget the name,
> but it
> might be worth looking into if you don't mind platform-specific
> differences?
>
I think we very much mind platform-specific differences and want to
avoid them to the extent possible.
> Bruce
>
> >
> Believing it's easy and having a working, open-source cross-platform
> solution are obviously quite different.
OK, but solution for what? You still haven't that critical bit.
>> Also, it seems to me you ought to be asking the OOXML question
>> somewhere
>> else. Is there some antiword-dev list?
>>
> I've already contacted the antiword developer, but I also asked here
> about other possibilities in an effort to be more transparent about
> future directions and in order to mine the Zotero community's more
> focused expertise on the question.
Good :-)
Bruce
Just did some checking, and hacking text out of OOXML is not too
difficult. We need to read the zip file (which Mozilla can do
already), pull out word/document.xml (and endnotes/footnotes/headers/
footers if we want them). Then all the text is in <w:t> tags, and
<w:p> tags indicate paragraphs. (I haven't read the spec, but this is
what I see just looking at the XML.) In the end, it's not any harder
than indexing HTML. It takes two regexps to convert to serviceable
plaintext, or we can do it the proper way with one of Mozilla's XML
parsers.
Simon
Yeah, what I meant about "easy." You could write a little trivial XSLT
even that would just extract all the text.
BTW, whatever you come up with, would be nice if it worked with ODF too ;-)
Bruce
> For indexing, Bruce.
If you're just using it for indexing, is there not a dedicated indexer
you could use? For example, something like Lucene?
See also:
<http://ferret.davebalmain.com/trac/>
I realize, of course, you're probably bound by language issues.
Bruce
Yes, we're bound by language issues.
It seems antiword can be made to work. I think I wrote it off originally
because 1) I had trouble getting the encoding mappings working without a
full installation and 2) it doesn't write to a file natively (and we
can't do it through the shell (e.g. "antiword foo.doc > foo.txt) via
Mozilla). But we're already modifying the pdfinfo source to have it
write to a file (to get page number info--pdftotext already writes to a
file natively), so we can probably just make similar changes to antiword
to 1) hard-code the UTF-8 encoding (or at least use a file in the local
directory) and 2) write to a text file, and then distribute custom
binaries as we're doing for the Xpdf utilities. The dev branch already
has a UI and infrastructure for downloading and installing custom
binaries from zotero.org.