[groovy-user] How to run scraper example from GinA?

34 views
Skip to first unread message

siegfried

unread,
Jan 21, 2008, 9:58:53 PM1/21/08
to us...@groovy.codehaus.org

 

I’d like to run the following program from “Groovy In Action”. I downloaded the latest neko parser. As you can see, I tried experimenting with an older version of neko as well. I’m getting the stack trace show below. I’m running on windows.

Thanks,

Siegfried

/*

 * Begin commands to execute this file using Groovy with bash

 * if [ `uname` == 'Linux' ]

 * then

 * export CLASSPATH=${GROOVY_HOME}/embeddable/groovy-all-1.5.0.jar:bin:.

 * else

 * export CLASSPATH=${GROOVY_HOME}\\embeddable\\groovy-all-1.5.0.jar\;bin\;.\;c\:\\dev\\html_parsers\\neko\\nekohtml-0.9.5\\nekohtml.jar\;c\:\\dev\\xerces\\Xerces-J-bin.2.9.1\\xerces-2_9_1\\*.jar

 * export CLASSPATH=${GROOVY_HOME}\\embeddable\\groovy-all-1.5.0.jar\;bin\;.\;c\:\\dev\\html_parsers\\neko\\nekohtml-1.9.6\\nekohtml.jar\;c\:\\dev\\xerces\\Xerces-J-bin.2.9.1\\xerces-2_9_1\\*.jar

 * fi

* groovyc Listing_13_14_NekoHtml.groovy

 * java Listing_13_14_NekoHtml

 * End commands to execute this file using Groovy with bash

 */

import org.cyberneko.html.parsers.SAXParser

 

def url = 'http://java.sun.com'

 

def html = new XmlSlurper(new SAXParser()).parse(url)

 

def bolded = html.'**'.findAll{ it.name() == 'B' }

def out = bolded.A*.text().collect{ it.trim() }

out.removeAll([''])

out[2..5].each{ println it }

 

 

 

java Listing_13_14_NekoHtml

 

already have bin directory

>>> a serious error occurred: org/apache/xerces/parsers/AbstractSAXParser

>>> stacktrace:

java.lang.NoClassDefFoundError: org/apache/xerces/parsers/AbstractSAXParser

            at java.lang.ClassLoader.defineClass1(Native Method)

            at java.lang.ClassLoader.defineClass(ClassLoader.java:620)

            at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)

            at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)

            at java.net.URLClassLoader.access$000(URLClassLoader.java:56)

            at java.net.URLClassLoader$1.run(URLClassLoader.java:195)

            at java.security.AccessController.doPrivileged(Native Method)

            at java.net.URLClassLoader.findClass(URLClassLoader.java:188)

            at org.codehaus.groovy.tools.RootLoader.oldFindClass(RootLoader.java:142)

            at org.codehaus.groovy.tools.RootLoader.loadClass(RootLoader.java:114)

            at java.lang.ClassLoader.loadClass(ClassLoader.java:299)

            at groovy.lang.GroovyClassLoader.loadClass(GroovyClassLoader.java:621)

            at groovy.lang.GroovyClassLoader.loadClass(GroovyClassLoader.java:479)

            at org.codehaus.groovy.control.ResolveVisitor.resolveToClass(ResolveVisitor.java:425)

            at org.codehaus.groovy.control.ResolveVisitor.resolve(ResolveVisitor.java:178)

            at org.codehaus.groovy.control.ResolveVisitor.visitClass(ResolveVisitor.java:719)

            at org.codehaus.groovy.control.ResolveVisitor.startResolving(ResolveVisitor.java:71)

            at org.codehaus.groovy.control.CompilationUnit$5.call(CompilationUnit.java:527)

            at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:772)

            at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:438)

            at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:417)

            at org.codehaus.groovy.tools.FileSystemCompiler.compile(FileSystemCompiler.java:56)

            at org.codehaus.groovy.tools.FileSystemCompiler.main(FileSystemCompiler.java:220)

            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

            at java.lang.reflect.Method.invoke(Method.java:597)

            at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:101)

            at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130)

java.lang.NoClassDefFoundError: Listing_13_14_NekoHtml

Exception in thread "main"

Compilation exited abnormally with code 1 at Mon Jan 21 19:31:43

Paul King

unread,
Jan 22, 2008, 4:43:18 AM1/22/08
to us...@groovy.codehaus.org
siegfried wrote:
>
>
> I’d like to run the following program from “Groovy In Action”. I
> downloaded the latest neko parser. As you can see, I tried experimenting
> with an older version of neko as well. I’m getting the stack trace show
> below. I’m running on windows.
>
> Thanks,
>
> Siegfried

By following the pom trail from here:

http://repo1.maven.org/maven2/nekohtml/nekohtml/1.9.6/nekohtml-1.9.6.pom

I end up with the following CLASSPATH:

set CLASSPATH=.;lib/nekohtml.jar;lib/xercesImpl-2.8.1.jar

I left off the xml-apis and xml-resolver jars as they aren't needed for this scenario.

Result at the moment is:

JavaFX Script Technology
Poster Sessions at Mobile & Embedded Developer Days
Ask the Experts (Jan. 21-25): Developing and Deploying Java SE-Based Applications in Solaris
Free Evaluation: Sun Java Real-Time System 2.0 Update 1

Cheers, Paul.

---------------------------------------------------------------------
To unsubscribe from this list please visit:

http://xircles.codehaus.org/manage_email

siegfried

unread,
Jan 29, 2008, 1:43:05 PM1/29/08
to us...@groovy.codehaus.org

(1) I have a java program which can execute this XPath
"/html/body/table[2]/tr/td[3]/table/tr/td[2]/table[1]/tr/td/font/b" using
the standard java APIs and produce the desired output.

However, this xpath does not look correct after examining the html. It
should be

/html/body/table[2]/tr/td[3]/table/tr/td[2]/table[2]/tr/td/font/b
^ difference here!

This xpath does not work, however.

Now I'm experimenting in groovy with neko ala listing 13_14 from GinA. It is
working thanks to Paul!

This groovy code is producing the correct results (I discovered by trial and
error):
import org.cyberneko.html.parsers.SAXParser
//import org.apache.xpath.XPathAPI
import groovy.xml.dom.DOMCategory

def url =
'http://www.co.boulder.co.us/assessor/asrproprecords/assess_propdesc.asp?acc
ountno=R0035899&uniq_acctno=1&occ=1'


def html = new XmlSlurper(new SAXParser()).parse(url)

def out = html.BODY.TABLE.TR.TD[3].TABLE.TR.TD.TABLE.TR.TD.FONT; // this
works in groovy but not the standard java xpath
out = html.BODY.TABLE.TR.TD.TABLE.TR.TD.TABLE.TR.TD.FONT; // this works in
groovy too (but not in standard java)
println "begin:"
int ii = 0
out.each{ println "("+(ii++)+"):"+it + '\n' }
println "(just the info I want only):"+out[6] + '\n'
println "\ndone"

Now why does not the GPATH look more like the standard xpath that works with
java? I tried using the neko DOM from java but it keeps dying (I guess that
is a question for another mail list).

Now the XPath using standard java does not look right either. When I
populate a SWT tree control I get a nice picture of the XML and figure this
GPath should work (but it does not!):

def out = html.BODY.TABLE[2].TR.TD[3].TABLE.TR.TD[2].TABLE[2].TR.TD.FONT;

But why does this does not work with Groovy? Neither does
("/html/body/table[2]/tr/td[3]/table/tr/td[2]/table[2]/tr/td/font/b" or
("/html/body/table/tr/td/table/tr/td/table/tr/td/font/b") work with standard
java!

(2) How can I dump my html DOM to standard output so I can see what XML
slurper and neko have done to modify the original HTML code?

(3) Could the problem be due to the fact that the neko DOM classes keep
giving me stack traces in my java program so I have to use the JTidy
instead? I don't think so. I've looked at the actual html and clearly, I
want the second table and it seems like I should be able to specify the
second table. Now it does work to specify TD[3] in the GPATH but to specify
anything else does not work (so far)!

Could there be bugs in the GPath that are causing these inconsistencies? Do
you think there are bugs in the standard Xpath API?

(5) Can I specify a GPATH at execution time or must it be compiled?

Thanks very much (sorry this was so long!)
Siegfried

Reply all
Reply to author
Forward
0 new messages