XPath and LARGE XML files.

1,459 views
Skip to first unread message

Kyle Spraggs

unread,
Dec 6, 2012, 1:25:27 PM12/6/12
to scrip...@googlegroups.com
Does the XPath driver attempt to load the entire XML file into memory or does it use streaming? The file in question is a 621 MB file and I'm failing pretty miserably at getting Scriptella to parse it.

Kyle Spraggs

unread,
Dec 6, 2012, 2:21:26 PM12/6/12
to scrip...@googlegroups.com
It looks like the entire DOM is loaded which obviously won't work very well for a 600 megabyte file. I noticed a post by Martin Anderson about a staxpath driver. Does anyone happen to know if he ended up posting that somewhere?

Alexandr Berdnik

unread,
Dec 6, 2012, 2:22:44 PM12/6/12
to scrip...@googlegroups.com
Hi Kyle,
I belive scriptella xpath driver uses DOM - so there is no option to perform stream processing.


On Thu, Dec 6, 2012 at 10:25 PM, Kyle Spraggs <kyle.s...@gmail.com> wrote:
Does the XPath driver attempt to load the entire XML file into memory or does it use streaming? The file in question is a 621 MB file and I'm failing pretty miserably at getting Scriptella to parse it.

--
You received this message because you are subscribed to the Google Groups "Scriptella ETL" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scriptella/-/_Z25E_kUg6wJ.
To post to this group, send email to scrip...@googlegroups.com.
To unsubscribe from this group, send email to scriptella+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scriptella?hl=en.

Florian Kammermann

unread,
Dec 7, 2012, 2:59:19 AM12/7/12
to scrip...@googlegroups.com
If you look for another solution, I made really good experience with http://vtd-xml.sourceforge.net/.
Would be also great to see this working with sriptella together.

Fyodor Kupolov

unread,
Dec 7, 2012, 5:44:17 AM12/7/12
to scrip...@googlegroups.com
Hi guys,

Yes, xpath driver builds a DOM to run queries against it. This is basically done to improve performance in subqueries, because XPathExpression.evaluate(InputSource) builds a DOM on every call. AFAIK JDK does not provide stream-based xpath evaluation.
I would also consider http://code.google.com/p/jlibs/wiki/XMLDog. Author claims that It uses SAX and with one pass over the document it evaluates all the given xpaths.

Regards,
Fyodor

Fyodor Kupolov

unread,
Dec 7, 2012, 5:50:50 AM12/7/12
to scrip...@googlegroups.com
Hi Florian,

I've just looked at vtd-xml project description and it looks very impressive, especially the benchmark. But a big downside is that they are using GPL or commercial license, so one should take this into account when evaluating.

Regards,
Fyodor

On Friday, December 7, 2012 8:59:19 AM UTC+1, Florian Kammermann wrote:

Kyle Spraggs

unread,
Dec 7, 2012, 12:00:43 PM12/7/12
to scrip...@googlegroups.com
Is there a reason there isn't a scriptella driver built in to parse large XML files? It seems to be that would be a fairly common use-case.

Fyodor Kupolov

unread,
Dec 7, 2012, 12:08:49 PM12/7/12
to scrip...@googlegroups.com
Hi Kyle,

Well, when time is an issue it's a matter of priorities. My primary use-cases were focused around JDBC, therefore most efforts were spent in this area. Secondly, I was expecting Sun and now Oracle to improve a built-in xpath implementation in future JDK versions, but as we can see the recent version is still using DOM. And the last is licensing issues - Scriptella has Apache license, so adding LGPL/GPL dependency is not an option.

Best,
Fyodor

Kyle Spraggs

unread,
Dec 7, 2012, 12:27:53 PM12/7/12
to scrip...@googlegroups.com
Perfectly understandable. Thanks for all the work you've done. 

Cory Comer

unread,
Dec 7, 2012, 4:18:51 PM12/7/12
to scrip...@googlegroups.com
We've been poking around with this and it's pretty simple to get this lib running with the Janino driver, as seen below. One thing I would love to do is somehow take the resulting node and have it either create a new DOM nodelist object or something that we could pass to the normal Scriptella xpath driver, and have it map the elements and attributes how it normally does. Is that possible?

node is a com.ximpleware.VTDNav object, and I'm poking around at the docs, looks like we could dump the XML and use javax.xml.parsers.DocumentBuilder to build a document for the xml that we get, and then we would need to somehow get that document mapped to element/attribute variables?

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>

    <connection id="java" driver="janino"/>
    <connection id="log" driver="text"/>
   
    <query connection-id="java">
        import com.ximpleware.*;
        import com.ximpleware.xpath.*;
        import java.io.*;

        File f = new File("Rift/Items.xml");
        FileInputStream fis = new FileInputStream(f);
        byte[] b = new byte[(int) f.length()];
        fis.read(b);

        VTDGen vg = new VTDGen();
        vg.setDoc(b);
        vg.parse(true); // set namespace awareness to true

        VTDNav vn = vg.getNav();
       
        AutoPilot ap = new AutoPilot(vn);
        ap.selectXPath("//Items/Item");
        int result = -1;
       
        while((result = ap.evalXPath())!=-1){
            set("node", vn);
            next();
        }
       
        <script connection-id="log">
            ${node}
        </script>

    </query>
   
</etl>

Kyle Spraggs

unread,
Dec 13, 2012, 1:30:51 PM12/13/12
to scrip...@googlegroups.com
I can confirm that VTD-XML with the Janino driver works quite well. I inserted 88,000 entries from a 660 megabyte XML file in ~ 69 seconds using it. The goal now is to create a driver that can be a drop-in replacement to XPath.


On Thursday, December 6, 2012 12:25:27 PM UTC-6, Kyle Spraggs wrote:

sarah wilkes

unread,
Jan 4, 2013, 8:41:39 AM1/4/13
to scrip...@googlegroups.com
I can suggest Liquid XML Studio, it has a large file editor that can handle GB and TB sized files so it may work fo you, it also deal with xpath, xquery etc etc.

http://www.liquid-technologies.com/xml-studio.aspx


Florian Kammermann

unread,
Mar 13, 2013, 11:56:14 AM3/13/13
to scrip...@googlegroups.com
> The goal now is to create a driver that can be a drop-in replacement to XPath.
Did you already start with this work?

Kyle Spraggs

unread,
Mar 27, 2013, 1:49:31 PM3/27/13
to scrip...@googlegroups.com
No, I lack the Java knowledge and time to pick it up.

Florian Kammermann

unread,
Apr 26, 2013, 12:36:37 PM4/26/13
to scrip...@googlegroups.com
Ok, I did a simple driver implementation with vtd-xml.

@scriptella guys: how can I contribute the code?

Fyodor Kupolov

unread,
Apr 26, 2013, 3:52:19 PM4/26/13
to scrip...@googlegroups.com

I think the easiest way for the integration would be to start a project on github or google code. And I'll add the link from Scriptella website.

Sent from a phone. Please forgive typos...

On Apr 26, 2013 5:36 PM, "Florian Kammermann" <florian.k...@gmail.com> wrote:
Ok, I did a simple driver implementation with vtd-xml.

@scriptella guys: how can I contribute the code?

--
You received this message because you are subscribed to the Google Groups "Scriptella ETL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scriptella+...@googlegroups.com.

To post to this group, send email to scrip...@googlegroups.com.

Cory Comer

unread,
Apr 26, 2013, 4:27:13 PM4/26/13
to scrip...@googlegroups.com
If you get a project going on github, definitely let us know, I would love to take a look and contribute if possible as well.

I've encountered a number of situations where I've had to chunk large xml or just use the janino driver and a stream reader to handle the files, having a driver that supports vtd-xml would be fantastic.

Florian Kammermann

unread,
May 7, 2013, 2:39:09 PM5/7/13
to scrip...@googlegroups.com
Ok here is it: 

I hope I implemented everything as it should be.
I tried to adapt the existing XPath Unit Tests, but they didn't all run.

So I adapted what I could, for me the most important test is the one with the large xml file.

@scriptella guys: we can refine the code on github until it is good enought to merge into scriptella.

Fyodor Kupolov

unread,
May 8, 2013, 5:20:06 AM5/8/13
to scrip...@googlegroups.com
Hi Florian,

Thank you for the contrubution. I will definitely add a link to the driver. 

I've reviewed the code and have a few questions/suggestions:
VtdXmlXPathConnection:
- executeQuery. Small optimization, instead of " XPathQueryExecutor exec = queriesCache.get(queryContent);" I would add  "cache_queries?queriesCache.get(queryContent) : null" to avoid unnecessary lookup.

XPathQueryExecutor:
- What's the purpose of ThreadLocal<VTDNav> context? I see the value is being set but I cannot see where it is read? Did I miss anything?
- I see an instance of AutoPilot is created several times. Can't it be instantiated only once in the constructor and reused afterwards?
 
VTDNavCreator.class
- I would recommend to make the code more flexible by working with URLs rather than File. This way remote protocols like ftp or http will be supported out of the box. You can convert URL to a stream by calling url.openStream().
- Could you please elaborate the logic in createVTDNavFromXmlFile. Why the whole file is read into memory? When possible, we should work with streams or sources of streams. Also consider IOUtils.toByteArray(InputStream is, long maxLength) if it is required to read the file into RAM.

Regards,
Fyodor

Florian Kammermann

unread,
May 16, 2013, 12:56:38 PM5/16/13
to scrip...@googlegroups.com
Opened issues for your Input: https://github.com/floriankammermann/scriptella-vtdxml/issues and solved them.
Check out the closed issues.

Will try to add more tests in the future.

What will be importent for me is to iterate on two levels.
First select /root/level1 --> iterate over all level1
Second select on every level1: ./level2 --> iterate on this level2 
And then write records which are combine level1 with level2 data.

I need also a deeper understanding of the scriptella code with the callbacks, 
for a better implementation.

Regards 
Florian

s.mo...@gmail.com

unread,
Sep 5, 2014, 2:08:43 AM9/5/14
to scrip...@googlegroups.com
@Florian Kammermann:

Tried your addon and it works fine. There is one problem left. If there are namespaces defined on an element, i cannot use $ notation to get the content of the element.

Example: ns:title

$ns:title does not work. Is there another possibility to get the content of the element?

Florian Kammermann

unread,
Sep 6, 2014, 4:32:07 AM9/6/14
to scrip...@googlegroups.com
Can you provide  your xml and your scriptella script.
Reply all
Reply to author
Forward
0 new messages