Scraping data-* attributes

494 views
Skip to first unread message

Andrew Brown

unread,
Jul 31, 2014, 11:02:49 PM7/31/14
to web-s...@googlegroups.com
I'm still a bit stuck on getting data- attributes to scrape properly.  If I have a list like this:

<ul>
 
<li data-type="man">
   
<div class="name">Andrew</div>
 
</li>
 
<li data-type="woman">
   
<div class="name">Mary</div>
 
</li>
</ul>

Should I scrape this by having one element selector under root that gets to the li and then grabs the attribute (with no further path, just an attribute name) and then gets the name using a text selector that is div.name?  Or should I have two root level selectors, one for the attribute and one for the name?

I can't seem to get any data- attributes to get scraped no matter what.  Here's an example of what I tried (from a w3schools sample page)

{"startUrl":"http://www.w3schools.com/tags/tryhtml5_global_data.htm","selectors":[{"parentSelectors":["_root"],"type":"SelectorElementAttribute","multiple":true,"id":"animal type","selector":"ul li","extractAttribute":"data-animal-type"},{"parentSelectors":["_root"],"type":"SelectorText","multiple":true,"id":"animal","selector":"ul li","regex":""}],"_id":"data-test"}

I would like it to show:

animal typeanimal
birdSalmon
fishOwl
spiderTarantula

but instead I just get:
animal typeanimal
Salmon
Owl
Tarantula

Any tips for how to use attribute selection properly to get data- attributes off of list items?

Mārtiņš Balodis

unread,
Aug 1, 2014, 2:58:19 PM8/1/14
to web-s...@googlegroups.com
Hi,
It seems that attribute selector doesn't work properly on data-* attributes. I added a bug report here:

To extract both type and name you should create an Element selector for the li element and two child selectors for name and attribute. In this case these child selectors should also select the li element. But that is available in the development version.

I fixed the data-* bug and updated development version in which this should work. You can read how to install it here:
https://groups.google.com/forum/#!topic/web-scraper/I_RySlqHrvQ

This is the sitemap for the development version:

{"_id":"data-test","startUrl":"http://www.w3schools.com/tags/tryhtml5_global_data.htm","selectors":[{"parentSelectors":["_root"],"type":"SelectorElement","multiple":true,"id":"li-elem","selector":"li","delay":""},{"parentSelectors":["li-elem"],"type":"SelectorElementAttribute","multiple":false,"id":"type","selector":"_parent_","extractAttribute":"data-animal-type","delay":""},{"parentSelectors":["li-elem"],"type":"SelectorText","multiple":false,"id":"animal","selector":"_parent_","regex":"","delay":""}]}

Andrew Brown

unread,
Aug 2, 2014, 1:34:21 AM8/2/14
to web-s...@googlegroups.com
Thank you very much!  This seems to have resolved this issue.  I'm impressed how quickly you responded!  Thanks again!
Reply all
Reply to author
Forward
0 new messages