Need help extracting "tooltip" text from table

515 views
Skip to first unread message

Wael Kassem

unread,
Dec 29, 2014, 10:04:24 AM12/29/14
to web-s...@googlegroups.com
Hi
I need help extracting data from a certain website

I manage to extract the main table with the headings 
Job TitleEmployerLocationDate

However, i want also to add a fifth column that at least extracts the text from the "tooltip" when you put the mouse over the job title or better if possible to extract the text from the section under "Description" when you click each job title (i don't know if it is doable or not)

Any help is appreciated.

Thanks in advance

Message has been deleted

Wael Kassem

unread,
Dec 29, 2014, 10:14:56 AM12/29/14
to web-s...@googlegroups.com
Hi
This is the sitemap
{"selectors":[{"parentSelectors":["_root"],"type":"SelectorTable","multiple":true,"id":"Table","selector":"table.ListBorder","tableHeaderRowSelector":"tr.ListHeadlineRow","tableDataRowSelector":"tr:nth-of-type(n+2)","columns":[{"header":"Job Title","name":"Job Title","extract":true},{"header":"Employer","name":"Employer","extract":true},{"header":"Location","name":"Location","extract":true},{"header":"Date","name":"Date","extract":true}],"delay":"2000"},{"parentSelectors":["Next"],"type":"SelectorLink","multiple":true,"id":"Next","selector":"td.ListBottomRow a:nth-of-type(1)","delay":"2000"},{"parentSelectors":["_root"],"type":"SelectorHTML","multiple":true,"id":"Tooltip","selector":"div.tooltip:nth-of-type(n)","regex":"","delay":""}],"startUrl":"https://www.hirelebanese.com/searchresults.aspx?order=date&keywords=&category=&type=&duration=&country=117,241,258,259,260&state=&city=&emp=&pg=1&s=-1","_id":"hirelebanese"}

However, not all  tooltips are extracted and even those extracted they do not match the same row/job title! (as you see in the attached excel)
So i don't know how to proceed.
Thanks
hirelebanese (6).csv

Diego herreruela quintana

unread,
Jan 5, 2015, 5:55:27 AM1/5/15
to web-s...@googlegroups.com
tyr this:

{"selectors":[{"parentSelectors":["Next"],"type":"SelectorLink","multiple":true,"id":"Next","selector":"td.ListBottomRow a:nth-of-type(1)","delay":"2000"},{"parentSelectors":["_root"],"type":"SelectorLink","multiple":true,"id":"jobtitle","selector":"table.ListBorder a:nth-of-type(1)","delay":""},{"parentSelectors":["jobtitle"],"type":"SelectorText","multiple":false,"id":"job","selector":"div.HeadlineText span","regex":"","delay":""},{"parentSelectors":["jobtitle"],"type":"SelectorText","multiple":false,"id":"jobtype","selector":"span#category","regex":"","delay":""},{"parentSelectors":["jobtitle"],"type":"SelectorText","multiple":false,"id":"location","selector":"span#location","regex":"","delay":""},{"parentSelectors":["jobtitle"],"type":"SelectorText","multiple":false,"id":"employee type","selector":"span#employee_type","regex":"","delay":""},{"parentSelectors":["jobtitle"],"type":"SelectorText","multiple":false,"id":"descrip","selector":"span#description","regex":"","delay":""},{"parentSelectors":["jobtitle"],"type":"SelectorText","multiple":false,"id":"profile","selector":"span#profile","regex":"","delay":""}],"startUrl":"https://www.hirelebanese.com/searchresults.aspx?order=date&keywords=&category=&type=&duration=&country=117,241,258,259,260&state=&city=&emp=&pg=1&s=-1","_id":"hirele"}

this is for first web page. if you need more... you need categories
hirele.csv

Mārtiņš Balodis

unread,
Jan 5, 2015, 8:10:56 AM1/5/15
to web-scraper
Hi,
Two selectors with multiple option checked cannot be joined in pairs because the extension cannot know how to that in every case. For example you could have selected the table and navigation buttons. There wouldn't be no logical way how to join them. 

In this site the tooltip is hidden in the links title attribute. You can use Element selector to select table rows and also select the tooltip. Here is a sitemap that shows how that could be done. Note that the sites javascript on load removes the title attribute from the link. So the only way to make this sitemap work is by disabling JavaScript in the browser.

{"selectors":[{"parentSelectors":["Next"],"type":"SelectorLink","multiple":true,"id":"Next","selector":"td.ListBottomRow a:nth-of-type(1)","delay":"2000"},{"parentSelectors":["_root"],"type":"SelectorElement","multiple":true,"id":"table-row","selector":"table.ListBorder tr:nth-of-type(n+4):has(a)","delay":""},{"parentSelectors":["table-row"],"type":"SelectorText","multiple":false,"id":"title","selector":"td:nth-of-type(1)","regex":"","delay":""},{"parentSelectors":["table-row"],"type":"SelectorText","multiple":false,"id":"employer","selector":"td:nth-of-type(2)","regex":"","delay":""},{"parentSelectors":["table-row"],"type":"SelectorElementAttribute","multiple":false,"id":"tooltip","selector":"a","extractAttribute":"title","delay":""}],"startUrl":"https://www.hirelebanese.com/searchresults.aspx?order=date&keywords=&category=&type=&duration=&country=117,241,258,259,260&state=&city=&emp=&pg=1&s=-1","_id":"hirelebanese"}

--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wael Kassem

unread,
Jan 5, 2015, 9:47:03 AM1/5/15
to web-s...@googlegroups.com
Hi
Diego,
Thank you very much for the sitemap. I managed today to do something similar but for all pages like you said. However, i used a table selector as as parent selector, while you used a Link selector. Don't know if it will affect anything but the output looks fine to me.

{"selectors":[{"parentSelectors":["_root","Page"],"type":"SelectorElement","multiple":true,"id":"Table","selector":"table.ListBorder","delay":""},{"parentSelectors":["Table"],"type":"SelectorLink","multiple":true,"id":"Job Title","selector":"a:nth-of-type(1)","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Description","selector":"span#description","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Company","selector":"span#company","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Location","selector":"span#location","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Salary","selector":"span#salary","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Category","selector":"span#category","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Date posted","selector":"span#date","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Employee Type","selector":"span#employee_type","regex":"","delay":""},{"parentSelectors":["Job Title"],"type":"SelectorText","multiple":false,"id":"Gender","selector":"span#lblSex","regex":"","delay":""},{"parentSelectors":["_root","Page"],"type":"SelectorLink","multiple":false,"id":"Page","selector":"td.ListBottomRow a:nth-of-type(3)","delay":""}],"startUrl":"https://www.hirelebanese.com/searchresults.aspx?order=date&keywords=&category=&type=&duration=&country=117,241,258,259,260&state=&city=&emp=&pg=2","_id":"hirelebanesev2"}

Mārtiņš , thank you for your tip on disabling javascript!! I was playing with Webscraper on various websites during the past few days trying to see why i can sometimes extract the tooltips and sometimes not and i didn't think at all disabling javascript. Working like a charm now!
and again thanks for this tool, it works very well :)
Reply all
Reply to author
Forward
0 new messages