multiple ways of generating XPath expressions (was Simplifying generated XPath when there is only a single sibiling (#2))

35 views

Skip to first unread message

Mark Butler

unread,

Jun 23, 2013, 5:39:52 AM6/23/13

to joox...@googlegroups.com

Dear Lukas,

First thanks for writing and open sourcing jOOX.

For any node in the document there are many XPath expressions. I am working on a tool that automatically generates XPath expressions for HTML pages to support information extraction. I have been looking at Selenium IDE, as this generates several different XPath expressions for a web pages element.

What I need to do is slightly different, as I potentially want to specify a group of elements, whereas Selenium IDE specifies a single element, but Selenium serves as a good model.

Some my question is do you have any interest in me contributing these approaches back to jOOX as they may be of use to others?

This may seem a bit abstract so I will explain the other algorithms:

1. Full path via sibling number

Already implemented by jOOX.

2. Relative path via sibling number

First we generate the node set for the full path. Then we walk up the tree one path segment at a time, comparing the resulting node set until it returns the same node set as the full path. For example if the full path is

/html/body/table/tr[1]/td[2]/p[1]

so we might try

//p[1] matches too many ...

//td[2]/p[1] matches too many ...

//tr[1]/td[2]/p[1]

if there is only a single table on the page this could be sufficient, it returns the same node set, so we accept it as a "minimal path expression"

3. Relative path via ID

Here as before we walk up the tree one path segment at a time but if a node has an ID, then we check if there is only one instance of that ID in the document, if so we can accept it as a minimal path expression e.g.

//p[1]

//td[2]/p[1]

//tr[@id='specification']/td[2]/p[1]

Found minimal path via ID

4. Full path / relative path via attribute

Some versions of HTML use CSS selectors to identify classes of nodes. In the XML case, we can generalize this to attributes so when we generate the path expression, if a node is the only sibling with a specific attribute then we can use that instead of a sibling index. We can then calculate the full path like [1] or the relative path like [2] as before.

5. Content

In Selenium, you might want to locate a control on a web page to add data to a form, press a button etc. In the web extraction framework I am working on I often need to do this to select the next page of results like on a Google results page. Here's an example - I am working with Chinese:

//a[text()='下一頁']/@href

Clearly calculating the alternatives will take time, so this would be done separately from the current getXPath method which is efficient.

What do you think? If you have some interest, then we can take it a little bit further, discuss some design alternatives, before I propose an implementation?

For my previous pull request, do you want to discuss design alternatives for that too, so I can revise my submission?

Best wishes, and thanks again for jOOX!

Mark

Lukas Eder

unread,

Jun 25, 2013, 5:58:34 PM6/25/13

to joox...@googlegroups.com, Mark Butler

Hi Mark,

Sorry for the delay, I had been a bit busy with my other project, jOOQ...

2013/6/23 Mark Butler <markhen...@gmail.com>

Dear Lukas,

First thanks for writing and open sourcing jOOX.

For any node in the document there are many XPath expressions. I am working on a tool that automatically generates XPath expressions for HTML pages to support information extraction. I have been looking at Selenium IDE, as this generates several different XPath expressions for a web pages element.

What I need to do is slightly different, as I potentially want to specify a group of elements, whereas Selenium IDE specifies a single element, but Selenium serves as a good model.

Some my question is do you have any interest in me contributing these approaches back to jOOX as they may be of use to others?

Yes, of course. The current Match.xpath() implementation is a canonical one. It will never produce any ambiguous XPath's. But depending on the document domain, there may be better (i.e. more readable) XPath expressions, of course.

This may seem a bit abstract so I will explain the other algorithms:

1. Full path via sibling number

Already implemented by jOOX.

Yes, this is clearly a must-have.

2. Relative path via sibling number

First we generate the node set for the full path. Then we walk up the tree one path segment at a time, comparing the resulting node set until it returns the same node set as the full path. For example if the full path is

/html/body/table/tr[1]/td[2]/p[1]

so we might try

//p[1] matches too many ...
//td[2]/p[1] matches too many ...

//tr[1]/td[2]/p[1]
if there is only a single table on the page this could be sufficient, it returns the same node set, so we accept it as a "minimal path expression"

OK, so does this work in a somewhat efficient way?

Also, do you think that //p[1] could be a useful XPath expression? Indexing with the // axis has quite a different semantics from indexing with "full paths"...

3. Relative path via ID

Here as before we walk up the tree one path segment at a time but if a node has an ID, then we check if there is only one instance of that ID in the document, if so we can accept it as a minimal path expression e.g.

//p[1]
//td[2]/p[1]
//tr[@id='specification']/td[2]/p[1]

Found minimal path via ID

That's a nice idea. Specifically because jOOX already has some ID-related API methods. It's good to strengthen the notion of XML ID's in such a context.

4. Full path / relative path via attribute

Some versions of HTML use CSS selectors to identify classes of nodes. In the XML case, we can generalize this to attributes so when we generate the path expression, if a node is the only sibling with a specific attribute then we can use that instead of a sibling index. We can then calculate the full path like [1] or the relative path like [2] as before.

Could you provide an example of this? Do note that HTML is not a primary use-case for jOOX, though.

5. Content

In Selenium, you might want to locate a control on a web page to add data to a form, press a button etc. In the web extraction framework I am working on I often need to do this to select the next page of results like on a Google results page. Here's an example - I am working with Chinese:

//a[text()='下一頁']/@href

Clearly calculating the alternatives will take time, so this would be done separately from the current getXPath method which is efficient.

Hmm, in an XML context, I'd see this as a secondary use-case.

What do you think? If you have some interest, then we can take it a little bit further, discuss some design alternatives, before I propose an implementation?

Sure!

In my opinion, jOOX shouldn't assume anything about the document structure (except maybe for IDs). This means that there is an infinite number of useful XPath calculation algorithms, depending on the setup. This again means that the actual algorithm should be as pluggable as possible, similar to the existing Mapper and Filter APIs. Given that Match objects are transformed to XPath strings, Mapper will be the most appropriate type again.

So what about an XPathMapper and an XPathMapper.Builder to construct an XPathMapper instance, given some configuration flags? That might be a versatile solution... It will be along the lines of this issue here:

https://github.com/jOOQ/jOOX/issues/122

I'm currently not quite sure how to start implementing builder patterns and / or DSL elements in jOOX, as this is quite a new area for this library (unlike jOOQ...). But nonetheless, if you agree to contribute this under the Apache Software License 2.0, I would be more than happy to incorporate any submission along the lines of a configurable org.joox.Mapper into the library, attributing authorship to you.

https://github.com/jOOQ/jOOX/blob/master/jOOX/LICENSE.txt

For my previous pull request, do you want to discuss design alternatives for that too, so I can revise my submission?

I think that a single XPathMapper API should be sufficient to cover also this use-case. So, one possibility is to write:

Mapper<String> mapper = JOOX

.xpath()

.indexAll(false) // [#122]

.shortcut(true) // 2

.shortcutOnID(true) // 3

.shortcutOnContent("下一頁")

.build();

Feel free to experiment a little. We can then discuss the API suggestion.

And sorry again for the delay.

Cheers

Lukas

Reply all

Reply to author

Forward

0 new messages