Element Click selector behaving oddly

961 views
Skip to first unread message

Andrew Brown

unread,
Aug 2, 2014, 1:42:20 AM8/2/14
to web-s...@googlegroups.com
I want to use the element click selector (in the very latest version on github) to click on each of a list of items which will then use javascript to change the content I need to scrape.  For some reason, this doesn't seem to be working like I expect it to.  It appears to click on each of the links to cause the content to change, but it doesn't scrape the data after each click.  In one instance it only scraped the data after the last click, and in another case, it seemed to scrape the last data 3 times somehow.

I created a jsfiddle to show an example of the type of page I want to scrape here: http://jsfiddle.net/Xb3gX/4/

Here is the scraper:
{"startUrl":"http://fiddle.jshell.net/Xb3gX/4/show/","selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"click-list","selector":"div.sample","clickElementSelector":"li.clickme a","delay":"2000"},{"parentSelectors":["click-list"],"type":"SelectorText","multiple":false,"id":"content","selector":"#content","regex":"","delay":""}],"_id":"click-sample"}

It returns three rows:
Four
Four
For

When I expect it to return
One
Two
Three
Four

Am I using the element click selector incorrectly?  I'd appreciate any help.

Mārtiņš Balodis

unread,
Aug 3, 2014, 3:24:38 PM8/3/14
to Andrew Brown, web-s...@googlegroups.com
Element click selector should return a list elements to its child selectors. In this case it is returning the only div#content with its last state. The reason why it is returning only one record is because the selector compares DOM elements and returns unique ones. That was useful for scroll selector but in this case it seems to cause problems. In the case where you got 3 results was it the same setup? I added a bug report here:


--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Brown

unread,
Aug 6, 2014, 12:37:18 AM8/6/14
to web-s...@googlegroups.com
Hey Mārtiņš, thanks for being so fast on these fixes! I've found one other with this feature.  Let me know what you think.  Often, the content I want to scrape is populated by javascript as in the example, but the first set of data is already there and the first element click does not change the content at all.  I only want to scrape the data after the element click just in case it was not there to start with, but I do not want to scrape it twice.  I have updated the example page and scraper to show what I mean.  I tested this with the latest update to the code from today.

{"selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"click-list","selector":"div.sample","clickElementSelector":"li.clickme a","delay":"2000"},{"parentSelectors":["click-list"],"type":"SelectorText","multiple":false,"id":"content","selector":"#content","regex":"","delay":""}],"startUrl":"http://fiddle.jshell.net/Xb3gX/5/show/","_id":"click-sample"}

I can remove duplicate rows after the scraping, but it makes sense to me that the child selectors of the element click would go only after their parent, and thus ignore content that was already there.  What do you think?

Andrew Brown

unread,
Aug 6, 2014, 1:37:09 AM8/6/14
to web-s...@googlegroups.com
I looked at the code, and found where this is happening: line 110 in SelectorElementClick.js. In my opinion, it would make more sense to capture the content before the click with a separate selector (if it is wanted,) then do all the clicks and capture the data after each click, but I'm not sure how realistic that is.  I was able to comment out those lines to prevent it from scraping the content before a click, but I couldn't get the "before click" contents to get scraped by a regular text selector.  Here is a sitemap that I would think would capture the data before the click once, then once after each click:

{"selectors":[{"parentSelectors":["_root","click-list"],"type":"SelectorText","multiple":false,"id":"content","selector":"#content","regex":"","delay":""},{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"click-list","selector":"div.sample","clickElementSelector":"ul li.clickme a","delay":"500"}],"startUrl":"http://fiddle.jshell.net/Xb3gX/6/show/","_id":"click-sample-before-click"}

Unfortunately, it just seems to do the same thing as a sitemap where the 'content' selector has only one parent ('click-list'.)  What do you think about this?  It makes sense to me not to capture the elements before the click, or at least make it optional, but it's up to you.

Mārtiņš Balodis

unread,
Aug 7, 2014, 6:14:03 AM8/7/14
to Andrew Brown, web-s...@googlegroups.com
In the given example #button1 data is loaded and it is also clickable. Some sites make the button that loaded the data inactive. In these cases selecting elements before clicking is needed. The selector was built to handle these kind of sites. Here is an example:

{"_id":"click-pagination","startUrl":"http://fiddle.jshell.net/martinsbalodis/43jvkh5y/show/","selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"element_blocks","selector":"div.item","clickElementSelector":"a","delay":"500"},{"parentSelectors":["element_blocks"],"type":"SelectorText","multiple":false,"id":"data","selector":"_parent_","regex":"","delay":""}]}

Right now jsfiddle has removed your examples so I cannot really look at them. But if don't need the first button to be clicked why cannot you make the click selector in a way that it doesn't click the first button? For example a selector that doesn't click the first button might look like this: 
"a:not(:contains(1))" - all buttons except the one with text 1

I'll probobly add a configuration option to make the initial element selection optional.

The last sitemap that you gave which should extract data before clicking and after is working not as you expected because the "content" selector is overriding itself. By making it multiple:true this wouldn't happen. 




--
Reply all
Reply to author
Forward
0 new messages