First thanks for writing and open sourcing jOOX.
For any node in the document there are many XPath expressions. I am working on a tool that automatically generates XPath expressions for HTML pages to support information extraction. I have been looking at Selenium IDE, as this generates several different XPath expressions for a web pages element.
What I need to do is slightly different, as I potentially want to specify a group of elements, whereas Selenium IDE specifies a single element, but Selenium serves as a good model.
Some my question is do you have any interest in me contributing these approaches back to jOOX as they may be of use to others?
This may seem a bit abstract so I will explain the other algorithms:
1. Full path via sibling number
Already implemented by jOOX.
2. Relative path via sibling number
First we generate the node set for the full path. Then we walk up the tree one path segment at a time, comparing the resulting node set until it returns the same node set as the full path. For example if the full path is
/html/body/table/tr[1]/td[2]/p[1]
so we might try
//p[1] matches too many ...
//td[2]/p[1] matches too many ...
//tr[1]/td[2]/p[1]
if there is only a single table on the page this could be sufficient, it returns the same node set, so we accept it as a "minimal path expression"
3. Relative path via ID
Here as before we walk up the tree one path segment at a time but if a node has an ID, then we check if there is only one instance of that ID in the document, if so we can accept it as a minimal path expression e.g.
//p[1]
//td[2]/p[1]
//tr[@id='specification']/td[2]/p[1]
Found minimal path via ID
4. Full path / relative path via attribute
Some versions of HTML use CSS selectors to identify classes of nodes. In the XML case, we can generalize this to attributes so when we generate the path expression, if a node is the only sibling with a specific attribute then we can use that instead of a sibling index. We can then calculate the full path like [1] or the relative path like [2] as before.
5. Content
In Selenium, you might want to locate a control on a web page to add data to a form, press a button etc. In the web extraction framework I am working on I often need to do this to select the next page of results like on a Google results page. Here's an example - I am working with Chinese:
//a[text()='下一頁']/@href
Clearly calculating the alternatives will take time, so this would be done separately from the current getXPath method which is efficient.
What do you think? If you have some interest, then we can take it a little bit further, discuss some design alternatives, before I propose an implementation?
For my previous pull request, do you want to discuss design alternatives for that too, so I can revise my submission?