A recent SO question got me thinking lately about Raven backing a CMS
with ability to search over attachment documents in various formats,
and I thought I would revive this thread from last month.
I wonder how hard it would be to create a bundle for raven that kicks
in when you store an attachment and auto-generates the indexable text?
Some research and I just ran across this:
http://kevm.github.com/tikaondotnet/
I don't like that it's Java based, but the IKVM wrapper seems to be
fairly good for this particular library. Any opinions of how this
might be on performance? IKVM would basically be running inside
raven. Not sure if I like that idea or not...
I suppose it could also be built using a combination of iTextSharp for
pdf extraction, and npoi (
http://npoi.codeplex.com/) for MSOffice
stuff. That's all c# atleast, but Tika is much more comprehensive in
the supported formats:
http://tika.apache.org/1.2/formats.html
Thoughts?
On Dec 14 2012, 1:07 pm, clayton collie <
gbo...@gmail.com> wrote:
> You may want to check outhttp://
tika.apache.org/. I think someone may