Issue with Tika & Javaloader

1,255 views
Skip to first unread message

Sean Coyne

unread,
Jan 2, 2012, 1:40:07 PM1/2/12
to javaloa...@googlegroups.com
Hi,

I'm trying to extract text from various files using Tika and I am having an issue loading the jar with Javaloader.  If the Tika jar is put in the CF classpath, and loaded using createObject, it works perfectly.  When loading the same jar using Javaloader I get an error when trying to instantiate a class.  I have tested this on CF9, both standard and multiserver installs.

I have created a test case to demo this issue which is available here: http://dl.dropbox.com/u/1647505/tika.zip  The zip file includes Javaloader and the Tika jar and can be unzipped under any webroot and executed as is.

Here is the code I am trying to run.  It should simply extract the text from the PDF and dump it to the screen, however it does not make it past the line "tika.init()".  If I take the identical code, put the jar on the classpath and change tika = javaloader.create("org.apache.tika.Tika"); to tika = createObject("java","org.apache.tika.Tika"); then it works.

<cfscript>
// create instance of javaloader
javaloader = createObject("component","javaloader.JavaLoader").init([ 
getDirectoryFromPath(getCurrentTemplatePath()) & "tika-app-1.0.jar" 
]);

// grab a handle on the file
f = createObject("java","java.io.File").init( getDirectoryFromPath(getCurrentTemplatePath()) & "example.pdf" );

// create instance of tika
tika = javaloader.create("org.apache.tika.Tika");
tika = tika.init();

// parse it
content = tika.parseToString(f);

// output results
writeDump(var = content, label = "Content");
</cfscript>

And here is the error and stack trace.

Object instantiation exception.
An exception occurred while instantiating a Java object. The class must not be an interface or an abstract class. Error: ''.
The error occurred in /Users/sean/Sites/_scribble/tika/index.cfm: line 13
11 : // create instance of tika
12 : tika = javaloader.create("org.apache.tika.Tika");
13 : tika = tika.init();
14 :
15 : // parse it

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at coldfusion.runtime.java.JavaProxy.CreateObject(JavaProxy.java:166)
at coldfusion.runtime.java.JavaProxy.invoke(JavaProxy.java:80)
at coldfusion.runtime.CfJspPage._invoke(CfJspPage.java:2360)
at cfindex2ecfm848441849.runPage(/Users/sean/Sites/_scribble/tika/index.cfm:13)
at coldfusion.runtime.CfJspPage.invoke(CfJspPage.java:231)
at coldfusion.tagext.lang.IncludeTag.doStartTag(IncludeTag.java:416)
at coldfusion.filter.CfincludeFilter.invoke(CfincludeFilter.java:65)
at coldfusion.filter.ApplicationFilter.invoke(ApplicationFilter.java:381)
at coldfusion.filter.RequestMonitorFilter.invoke(RequestMonitorFilter.java:48)
at coldfusion.filter.MonitoringFilter.invoke(MonitoringFilter.java:40)
at coldfusion.filter.PathFilter.invoke(PathFilter.java:94)
at coldfusion.filter.LicenseFilter.invoke(LicenseFilter.java:27)
at coldfusion.filter.ExceptionFilter.invoke(ExceptionFilter.java:70)
at coldfusion.filter.ClientScopePersistenceFilter.invoke(ClientScopePersistenceFilter.java:28)
at coldfusion.filter.BrowserFilter.invoke(BrowserFilter.java:38)
at coldfusion.filter.NoCacheFilter.invoke(NoCacheFilter.java:46)
at coldfusion.filter.GlobalsFilter.invoke(GlobalsFilter.java:38)
at coldfusion.filter.DatasourceFilter.invoke(DatasourceFilter.java:22)
at coldfusion.filter.CachingFilter.invoke(CachingFilter.java:62)
at coldfusion.CfmServlet.service(CfmServlet.java:200)
at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)
at jrun.servlet.FilterChain.doFilter(FilterChain.java:86)
at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)
at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
at jrun.servlet.FilterChain.service(FilterChain.java:101)
at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:286)
at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
Caused by: java.lang.NullPointerException
at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:134)
at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:455)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:145)
at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
at org.apache.tika.Tika.<init>(Tika.java:93)
... 37 more

Sean Coyne

unread,
Jan 2, 2012, 3:21:37 PM1/2/12
to javaloa...@googlegroups.com
In case its relevant, I'll also note that CF 9 ships with Tika 0.6 on the class path (the core library and the parsers, not the entire project) and we are trying to use 1.0.  The 0.6 versions included with CF could not perform the extract as we needed.

Andrew Myers

unread,
Jan 2, 2012, 6:55:33 PM1/2/12
to javaloa...@googlegroups.com
Hi Sean,

Interesting one. I've had a bit of a play with it but I haven't been
able to figure it out yet.

It looks to me that when it tries to determine the mime type of the
document you're parsing, it is unable to load them from the
tika-mimetypes.xml file that is bundled with the tika jar.

Mark - the Java code in Tiki that reads this file is at line 134 here:

https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypesFactory.java

It looks to be using the ClassLoader to reference it. Can you see
anything here that might cause it to bork when using JavaLoader?

Regards,
Andrew.

Mark Mandel

unread,
Jan 2, 2012, 8:52:25 PM1/2/12
to javaloa...@googlegroups.com
So this is a bit of a doozey.

I'm seeing that:
javaloader.create("org.apache.tika.mime.MimeTypesReader").getClass().getPackage();

returns null. 

Which is super weird.

Digging in deeper.

Mark

Mark Mandel

unread,
Jan 2, 2012, 9:07:49 PM1/2/12
to javaloa...@googlegroups.com
I think this may be a bug in the underlying ClassLoader - not sure if it implements getPackage() properly. Am heading out (on holidays) for a bit, and will then dig a bit deeper.

Mark

Mark Mandel

unread,
Jan 3, 2012, 1:27:53 AM1/3/12
to javaloa...@googlegroups.com
There we go - got it worked.

Neat, hadn't managed to see one like this before.


Grab the develop branch. You should find that that works for you. The POC you posted worked for me once I ran it against this new code.

Let me know how you go.

Mark

Sean Coyne

unread,
Jan 3, 2012, 8:03:07 AM1/3/12
to javaloa...@googlegroups.com
Awesome Mark! I'll give this a try today, thanks!

Sean Coyne

unread,
Jan 3, 2012, 8:39:47 AM1/3/12
to javaloa...@googlegroups.com
Thanks again Mark! This worked great.

Jeff Coughlin

unread,
Jan 3, 2012, 7:50:12 AM1/3/12
to javaloa...@googlegroups.com
Sweet.  Thanks Mark!

Sean and I are working on this project together.  I won't actually be able to test this for another couple hours, but we are very grateful (crossing fingers it works).  I'll buy you a beer when I see you at cf.objective :)

--
Jeff Coughlin

Sean Coyne

unread,
Jan 3, 2012, 10:35:13 AM1/3/12
to javaloa...@googlegroups.com
OK, so that did work great, until we tried to parse docx and xlsx files.  PDFs, DOC, even JPG and MOV files were parsed fine, but the Open XML files throw an error.  When I try parsing these files using the same Tika JAR file on the command line, it parses correctly, so it doesn't seem to be a Tika issue, but when parsing via CF, it throws a "Could not initialize class org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller" error.  Sometimes it even crashes JRUN to the point that it needs to be restarted.  It does this for any docx file, but I have attached an example.  Just change the example code from before to load the docx file:

f = createObject("java","java.io.File").init( getDirectoryFromPath(getCurrentTemplatePath()) & "example.docx" );

Here is the stack trace:

01/03 10:27:40 error ROOT CAUSE: 
java.lang.ExceptionInInitializerError
at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.<clinit>(PackagePropertiesUnmarshaller.java:49)
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:153)
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:140)
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:99)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:206)
at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:121)
at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:73)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
at org.apache.tika.Tika.parseToString(Tika.java:380)
at org.apache.tika.Tika.parseToString(Tika.java:451)
at org.apache.tika.Tika.parseToString(Tika.java:431)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at coldfusion.runtime.StructBean.invoke(StructBean.java:508)
at coldfusion.runtime.CfJspPage._invoke(CfJspPage.java:2393)
at cfindex2ecfm848441849.runPage(/Users/sean/Sites/_scribble/tika/index.cfm:16)
Caused by: java.lang.ClassCastException: org.dom4j.DocumentFactory cannot be cast to org.dom4j.DocumentFactory
at org.dom4j.DocumentFactory.getInstance(DocumentFactory.java:97)
at org.dom4j.tree.AbstractNode.<clinit>(AbstractNode.java:39)
... 49 more
example.docx

Mark Mandel

unread,
Jan 3, 2012, 3:49:27 PM1/3/12
to javaloa...@googlegroups.com

Ah good old dom4j.

Have a search through the mailing list - we've covered this one before.

You will need to switch out the thread context classloader, and all should be well.

Mark

Sent from my mobile doohickey.

Sean Coyne

unread,
Jan 3, 2012, 5:27:22 PM1/3/12
to javaloa...@googlegroups.com
Thanks Mark, we have it indexing those files now.  Thanks for your help.

deer421

unread,
Feb 11, 2014, 3:21:35 PM2/11/14
to javaloa...@googlegroups.com
Hi Mark,
I am using the latest Tika 1.4 and javaloader 1.1 on CF10. When I called:

loader.create('org.apache.tika.Tika').init();

I got the same error message:


Object instantiation exception.
An exception occurred while instantiating a Java object. The class must not be an interface or an abstract class. Error: ''.

Any thoughts? Thanks for your help.

Sean Coyne

unread,
Feb 11, 2014, 4:24:27 PM2/11/14
to javaloa...@googlegroups.com

deer421

unread,
Feb 12, 2014, 10:15:26 AM2/12/14
to javaloa...@googlegroups.com
Thanks. Is there an example how to use the switchThreadContextClassLoader() function to instantiate tika?

deer421

unread,
Feb 12, 2014, 2:26:05 PM2/12/14
to javaloa...@googlegroups.com
Ok. I got it.

function tikaInit() {
  return create('org.apache.tika.Tika').init();
}

res = javaloader.switchThreadContextClassLoader(tikaInit);

Tim Parker

unread,
Mar 4, 2015, 12:36:15 AM3/4/15
to javaloa...@googlegroups.com
I tried this... and all seems well if I initialize JavaLoader with 'loadColdFusionClassPath=true' - but if I try 'loadColdFusionClassPath='false', the tikaInit method crashes:

java.lang.ClassNotFoundException: coldfusion.runtime.java.JavaProxy
java.net.URLClassLoader$1.run(URLClassLoader.java:366)
java.net.URLClassLoader$1.run(URLClassLoader.java:355)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:354)
java.lang.ClassLoader.loadClass(ClassLoader.java:425)
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
java.lang.ClassLoader.loadClass(ClassLoader.java:358)
java.lang.ClassLoader.findSystemClass(ClassLoader.java:1059)
com.compoundtheory.classloader.NetworkClassLoader.loadClass(NetworkClassLoader.java:488)
java.lang.ClassLoader.loadClass(ClassLoader.java:358)
coldfusion.runtime.java.JavaProxyFactory.getProxy(JavaProxyFactory.java:89)
coldfusion.runtime.ProxyFactory.getProxy(ProxyFactory.java:65)
coldfusion.runtime.CFPage.createObjectProxy(CFPage.java:4947)
coldfusion.runtime.CFPage.CreateObject(CFPage.java:4911)
coldfusion.runtime.CFPage.CreateObject(CFPage.java:4852)
coldfusion.runtime.CFPage.CreateObject(CFPage.java:4830)
coldfusion.runtime.CFPage.CreateObject(CFPage.java:4787)
$funcCREATEJAVAPROXY.runFunction(...\javaloader\JavaLoader.cfc:403)
coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:472)
coldfusion.filter.SilentFilter.invoke(SilentFilter.java:47)
coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(UDFMethod.java:405)
coldfusion.runtime.UDFMethod$ArgumentCollectionFilter.invoke(UDFMethod.java:368)
coldfusion.filter.FunctionAccessFilter.invoke(FunctionAccessFilter.java:55)
coldfusion.runtime.UDFMethod.runFilterChain(UDFMethod.java:321)
coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:220)
coldfusion.runtime.CfJspPage._invokeUDF(CfJspPage.java:2582)
$funcCREATE.runFunction(...\javaloader\JavaLoader.cfc:87)
coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:472)
coldfusion.filter.SilentFilter.invoke(SilentFilter.java:47)
coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(UDFMethod.java:405)
coldfusion.runtime.UDFMethod$ArgumentCollectionFilter.invoke(UDFMethod.java:368)
coldfusion.filter.FunctionAccessFilter.invoke(FunctionAccessFilter.java:55)
coldfusion.runtime.UDFMethod.runFilterChain(UDFMethod.java:321)
coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:220)
coldfusion.runtime.CfJspPage._invokeUDF(CfJspPage.java:2582)
$funcTIKAINIT.runFunction(my-code.cfc:23)

This occurs on both ACF10.0.15 and ACF9.0.2 (both using jdk1.7.0_76)

The classpath given to JavaLoader.init includes only tika 1.7 (tika-app-1.7.jar)

I'm trying to exclude the CF classpath because I'm trying to resolve a crash in Tika when I try to use it to parse .DOCX files (java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory) - maybe I'm on the wrong path for resolving that issue - I'd know for sure if JavaLoader wasn't crashing...

Sean Coyne

unread,
Mar 4, 2015, 11:35:59 AM3/4/15
to javaloa...@googlegroups.com
Tim, I know for the docx, xlsx, pptx, etc files I had to use Javaloader's "switchThreadContextClassLoader" method. ( https://github.com/markmandel/JavaLoader/wiki/Switching-the-ThreadContextClassLoader)

I load javaloader with loadColdFusionClassPath = true

So I call my parsing method like this:

javaloader.switchThreadContextClassLoader(parseOpenXmlFile, { filePath = arguments.filePath })

where "parseOpenXmlFile" looks like this:

<cffunction name="parseOpenXmlFile" access="private" output="false" returntype="string">
        <cfargument name="filePath" required="true" type="string" />
 
        <cfscript>   
        // grab a new instance of tika
        var tika = javaloader.create("org.apache.tika.Tika").init();
       
        // parse the file
        var returnValue = tika.parseToString(createObject("java","java.io.File").init(arguments.filePath));
       
        // return the parsed string
        return returnValue;
        </cfscript>
       
    </cffunction>

this way I can use Tika straight up for non-OpenXML files, but for OpenXML (".docx,.xlsx,.pptx,.docm,.xlsm,.pptm") I switch out the ThreadContextClassLoader and load up Tika that way.

Sean
Reply all
Reply to author
Forward
0 new messages