Re: [RavenDB] Indexing MS Office documents

196 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Dec 14, 2012, 3:04:54 PM12/14/12
to rav...@googlegroups.com
You would need to extract the textual data from the office docs and save that to RavenDB, then it would be searchable.


On Fri, Dec 14, 2012 at 6:13 PM, Sebastian Good <seba...@palladiumconsulting.com> wrote:
I understand that some modules of Lucene allow the indexing of MS Office documents, e.g. Microsoft Word, PowerPoint, and Excel. Is it possible to use this functionality from within RavenDB? I imagine it would simply be in the vein of "full text" searching, whereby words from those documents were exposed in a simple index.


clayton collie

unread,
Dec 14, 2012, 3:07:34 PM12/14/12
to rav...@googlegroups.com
You may want to check out http://tika.apache.org/ . I think someone may have had a .Net port working at some point

Matt Johnson

unread,
Jan 7, 2013, 5:27:44 PM1/7/13
to ravendb
A recent SO question got me thinking lately about Raven backing a CMS
with ability to search over attachment documents in various formats,
and I thought I would revive this thread from last month.

I wonder how hard it would be to create a bundle for raven that kicks
in when you store an attachment and auto-generates the indexable text?


Some research and I just ran across this: http://kevm.github.com/tikaondotnet/
I don't like that it's Java based, but the IKVM wrapper seems to be
fairly good for this particular library. Any opinions of how this
might be on performance? IKVM would basically be running inside
raven. Not sure if I like that idea or not...

I suppose it could also be built using a combination of iTextSharp for
pdf extraction, and npoi (http://npoi.codeplex.com/) for MSOffice
stuff. That's all c# atleast, but Tika is much more comprehensive in
the supported formats: http://tika.apache.org/1.2/formats.html

Thoughts?


On Dec 14 2012, 1:07 pm, clayton collie <gbo...@gmail.com> wrote:
> You may want to check outhttp://tika.apache.org/. I think someone may

Chris Marisic

unread,
Jan 7, 2013, 6:34:09 PM1/7/13
to rav...@googlegroups.com
IKVM usually brings the kiss of death in regards to performance. However with that being said, assuming the recent additions to raven that it should better handle indexes that are slow while letting other indexes to continue functioning properly, seems like that could make it not really matter for the performance. I'm not sure if that's directly applicable since i'm not sure where this would have to be inbetween, if poor performance due to IKVM would be able to drag down the raven server if its inside a bundle etc.

Matt Johnson

unread,
Jan 7, 2013, 7:07:21 PM1/7/13
to ravendb
I'm doing some preliminary testing, and it looks reasonably solid.
I'm using an AbstractAttachmentPutTrigger and doing the work in
AfterCommit so that any slowness won't interfere with the transaction.

If all looks good, expect a Tika bundle in the Raven.Contrib
project. :)

Matt Johnson

unread,
Jan 8, 2013, 12:01:00 AM1/8/13
to rav...@googlegroups.com
What do you think about an iFilters implementation instead?

http://stackoverflow.com/questions/4905271/indexing-pdf-xls-doc-ppt-using-lucene-net

From: Matt Johnson
Sent: ‎1/‎7/‎2013 5:07 PM
To: ravendb
Subject: [RavenDB] Re: Indexing MS Office documents

Chris Marisic

unread,
Jan 8, 2013, 8:51:00 AM1/8/13
to rav...@googlegroups.com
if the IFilters is how microsoft does search inside of office documents, that seems like the winner. Now whether it makes you want to jump off a bridge, that's the real question.
Reply all
Reply to author
Forward
0 new messages