As I understand it Oracle Text in 10gR2 is not able to text index .docx
files generated by MS Office 2007. As the use of this format is only
going to increase (and we have to allow for this type of file) have any
of you come across this problem and did you devise a workaround for it?
Our application accepts CVs from candidates each of which has to be
indexed.
We are running on RHEL4 / 10gR2.
cheers
--
jeremy
Save as xml?
If this can be done within Oracle then yes - how do we do it? If not
then it's a bit impractical - candidates will attach their CV document
wont want to faff around with saving in different versions.
I understand that docx is actually "zipped" xml - so also wondering if
there is anything in Oracle 10g that might support "unzipping" a BLOB?
--
jeremy
Why not design your app so that whatever format the input "is
submitted in" you convert it into a standard one before trying to add
it ( and index it ) inside oracle?
>
> I understand that docx is actually "zipped" xml - so also wondering if
> there is anything in Oracle 10g that might support "unzipping" a BLOB?
>
Have you looked at the documentation?
It seems a bit odd to want to do that as Oracle will index so many
document types - we make more work for ourselves I guess and for what
actual benefit?
> Have you looked at the documentation?
Not the entire product no - is unzipping a BLOB something that is
available then?
--
jeremy
Is that so ? Then this really sucks. Is this because Mickeysoft is
protecting that document format ?
No idea on the reasons - discovered that this was the case with the
current app testing on 10gR2. The format has been out for some time I am
surprised Oracle hasn't been quicker to respond. A check in the Oracle
technet (?) forum turned up a response from someone saying they were
aware of the problem in Oracle, hoped to have a solution and backported
to earlier releases (so assume 10g) but that no dates had been
announced.
--
jeremy
Isn't that a likely side-effect where the vendor buys the text search
filters in and releases a product before the format in question
becomes current. Or to put it another way, I expect it will get fixed
in due course. Meanwhile I'm sure that you already have certain
requirements - for example language the CV is in, it not being in
wordstar and all the rest. Temporarily requiring office 07 users to
save in a compatible file format shouldn't be a great barrier to
anyone who wants to apply for a job.
Niall Litchfield
http://www.orawin.info/
In the "war for talent" which is very real in some sectors/geographies,
there is a very strong business case for removing as many barriers as
possible to make applying straightforward - the candidate should not
have to jump through hoops.
Right now the priority for us is devising a way in which we can continue
to accept that format of document and find a way to extract the text.
--
jeremy
You might be able to convert those which are docx (damn mickeysoft, I've run
into this also where somone sends me a docx or spreadsheet and I can't read
it because I didn't buy the most recent version of ms office) to doc files.
Then import the doc files. If you have one windows box with MS office that
can read doc x then you could write a little program that would take the
file and convert it. Then import that file. You could automate the
process. (eg a program detects that a file is in the dir, it converts the
file from docx to doc, imports it into the Oracle db and moves the file to a
finished directory. Gets the next file etc.
I have done some vba programming and I am sure I could help out if need be.
I can be contacted at jim dot scuba dot kennedy at gmail dot com.
Jim
So, if I was in your shoes, I'd either use automation to convert
the .docx to PDF, then insert the PDF, or use automation to crack open
the .docx file (actually, a zipped set of XML files) and insert/index
the document.xml file. The first path is easy and not complicated, but
you'll need Word 2007 to perform its magic. The second path is a lot
more risky, but it would be interesting to see how difficult it is to
handle. In terms of being able to index/search for keywords, it might
not be that bad at all.
I'm not entirely sure that I buy that in general - how many formats/
languages etc etc do you support. Especially given that anyone who can
save in .docx format can already save in .doc and probably .pdf format
directly.
> Right now the priority for us is devising a way in which we can continue
> to accept that format of document and find a way to extract the text.
I constructed a test in 11.1.0.6 (which was released post Office 2007)
and the same restriction applies. I suspect that I might go down the
route of developing an automated document converter - per Jim's
suggestion - and running that using the external procedure call
functionality. Bear in mind though that there can be some loss of
fidelity in the converted document (though you should get all the text
fine).
Niall
Oracle Text currently uses Verity KeyView document filters which they
licensed ca. 9.2.0.4 instead of (now discontinued) Inso Corp.'s.
Verity probably already added support for 2007 formats but integrating
this support into all supported Oracle releases and patchsets can
definitely take a while even if they are used unchanged. For some
reason, Oracle never created one-offs for Text filtering components
and always delivered new versions in patchsets.
If you can afford a Windows-based Oracle instance with COM Automation
option, you can set up Office 2007 there and use COM Automation to
invoke Office apps from the Oracle instance. You can then create a db
link to that Windows instance, create a function that will take a BLOB
as input and return a BLOB with converted document as output, and call
this function via the db link from other instances. The function would
write the source document into a file, use COM Automation to
instantiate an Office application and load the document, save it in
desired format, and then read it back into the resulting BLOB.
Shouldn't be too complex to implement.
Regards,
Vladimir M. Zakharychev
N-Networks, makers of Dynamic PSP(tm)
http://www.dynamicpsp.com
Ah - the beauty of prohibiting cutting egde technology!
Just curious - why do you think over 5 year old technology
can handle this years propriety formats?
> As the use of this format is only
> going to increase (and we have to allow for this type of file) have any
> of you come across this problem and did you devise a workaround for it?
>
Really? Why will it's use increase?
> Our application accepts CVs from candidates each of which has to be
> indexed.
>
> We are running on RHEL4 / 10gR2.
>
Updat to MicroSoft Windows 2008 server with SQL Server 2008.
--
Regards,
Frank van Bortel
Top-posting in UseNet newsgroups is one way to shut me up
There was a switch between 10G Rel1 and Rel2.
>[snip!]
> So, if I was in your shoes, I'd either use automation to convert
> the .docx to PDF, then insert the PDF, or use automation to crack open
> the .docx file (actually, a zipped set of XML files) and insert/index
> the document.xml file. The first path is easy and not complicated, but
> you'll need Word 2007 to perform its magic. The second path is a lot
> more risky, but it would be interesting to see how difficult it is to
> handle. In terms of being able to index/search for keywords, it might
> not be that bad at all.
You do realize, PDF's are indexed without problem?
I fail to see the benefit of "cracking open" the xml, once you
have the PDF
I don't know anything about this stuff (except for the com stuff I
worked on in some proprietary language that will be removed from the
language, so I have to rewrite it all eventually anyways), but I found
this illuminating: http://www.joelonsoftware.com/items/2008/02/19.html
Seems to support Vladimir's and Jim's suggestions.
I know I'm not impressed by any place that requires me to send a Word
resume.
jg
--
@home.com is bogus.
And today's rant is pretty interesting too: http://www.joelonsoftware.com/items/2008/03/17.html
As I said, EITHER pdf OR cracking open the xml. I think I even laid
out some reasonable rationale why you might choose one over the other.
As opposed to, say, telling someone to start over with a new database.
>
> As I said, EITHER pdf OR cracking open the xml. I think I even laid
> out some reasonable rationale why you might choose one over the other.
> As opposed to, say, telling someone to start over with a new database.
You did indeed, and I missed it. Apologies
Through software updates of course - I am not expecting oracle to
predict what microsoft does.
> > As the use of this format is only
> > going to increase (and we have to allow for this type of file) have any
> > of you come across this problem and did you devise a workaround for it?
>
> Really? Why will it's use increase?
>
Its use will increase because increasingly MS Office users will be
using Office 2007 in the same way that Windows 2000 users become XP
users become Vista users. When you save a document in Word 2007 and
it defaults to .docx most users aren't even going to be aware of the
file extension let alone its significance.
> > Our application accepts CVs from candidates each of which has to be
> > indexed.
>
> > We are running on RHEL4 / 10gR2.
>
> Updat to MicroSoft Windows 2008 server with SQL Server 2008.
>
Ah I knew there'd be a simple solution.
--
jeremy
That statement is entirely yours - the market is not at all happy,
either with MicroSoft Office incompatibility issues, nor with
MicroSoft Vista