xpath queries repeatedly returning first element when looping through divs

102 views
Skip to first unread message

Scott Larsen

unread,
Sep 16, 2020, 2:18:16 AM9/16/20
to Selenium Users

Anyone run into anomalies when using Selenium to scrape using xpath queries vs css_selectors?  I'm scraping job listings from ZipRecruiter which are each contained in a div with a class of "job_content" so the following breaks each job into a separate web object.

jobListings = driver.find_elements_by_xpath("//div[@ class='job_content']")

The weirdness starts when I loop through the web objects and do queries on the individual listings/ code snippets.   Css_selector queries properly return the respective data for each listing but xpath queries repeatedly return the very first xpath property on the overall page and not the first in the current web object/ code snippet.  Am I misunderstanding xpath or the structure of the xpath query or am I missing something else?

These queries...
print(jobListings[0].find_element_by_xpath("//a[@class='t_org_link name']").get_attribute('innerHTML'))
print(jobListings[1].find_element_by_xpath("//a[@class='t_org_link name']").get_attribute('innerHTML'))
print(jobListings[2].find_element_by_xpath("//a[@class='t_org_link name']").get_attribute('innerHTML'))
print(jobListings[0].find_element_by_css_selector('a.job_link').get_attribute('href'))
print(jobListings[1].find_element_by_css_selector('a.job_link').get_attribute('href'))
print(jobListings[2].find_element_by_css_selector('a.job_link').get_attribute('href'))
print(jobListings[0].find_element_by_xpath("//a[@class='t_location_link location']").get_attribute('innerHTML'))
print(jobListings[1].find_element_by_xpath("//a[@class='t_location_link location']").get_attribute('innerHTML'))
print(jobListings[2].find_element_by_xpath("//a[@class='t_location_link location']").get_attribute('innerHTML'))

... return this....

ENGEO
ENGEO
ENGEO
https://www.ziprecruiter.com/k/l/AALfG3swCYkX8J-sDTYYCtuv7G_kvytS8WZYwzIe4evgqopOOB-oKTp3aBlarJl0rpP4XHemwERiGwRqWiIODqLl4l_yOvbvEozpvMIQgy6c_RicyJaKkZCitVDBmbER-Szg172lrxnr6QdeKScpRksccMlsu3oiMpLQp1zHL0IYuybdEwI17robRto9Ew
https://www.ziprecruiter.com/ek/l/AALAFf9nBGAp-D2952WqAglyMF4mtfbHX4FUTEeUWPczJG9JLjoght4St3fYy1u0nLvGbFQkBGs-KAJATrrqr5frBHehi5T4mECAoq39TvI5VnvBYABD88YFfEEswOUt2KrOBgXM_MwW8zjg-WOUdLLvMd18hzgBAT9rXGodoPQxy2DdjSLTOJ-L_r277g
https://www.ziprecruiter.com/ek/l/AAJL10qG4ZVYj_NkDo7ZoskE4CthCbZ1iiPK6ZEAdPPnwHeVWETh9_y8qq1YJdJhAUnXCysUahEbus-I9CdowkwJdSkw_7BTlwPuzMaW7xtYgeYpqZBGy1drIA_Q1PVuKQAtMctUVtoJ2DdBbHPgZmLx3nWRh3BSjgeUfLygKgYhGjOCDFEvJHSuxfn1jg
San Ramon, CA
San Ramon, CA
San Ramon, CA

joseph...@gmail.com

unread,
Sep 16, 2020, 12:21:08 PM9/16/20
to Selenium Users
Have a look at this bit of xpath documentation, specifically the part about the dot, period, whatever you want to call the thing at the end of this sentence.


The previous find that you used to get your list is being ignored when you start your next find with "//", as the system sees that as a completely new xpath and starts searching from the top of the DOM.  Try making your second set of finds start with ".//" so the dot tells it to start where it currently is in the document.  Thus, some of your examples would become the following:

print(jobListings[0].find_element_by_xpath(".//a[@class='t_org_link name']").get_attribute('innerHTML'))
print(jobListings[1].find_element_by_xpath(".//a[@class='t_org_link name']").get_attribute('innerHTML'))
print(jobListings[2].find_element_by_xpath(".//a[@class='t_org_link name']").get_attribute('innerHTML'))

I have had success with this method in the past.

Good luck!

Scott Larsen

unread,
Sep 16, 2020, 2:27:59 PM9/16/20
to Selenium Users
Thank you so much Joseph!  I'm a bit confused because I would think when I loop through each div it's only feeding the code from that div but xpath/ Selenium is obviously referencing the whole thing.  Regardless, adding the period works.

Thanks again,
Scott

joseph...@gmail.com

unread,
Sep 17, 2020, 5:05:57 PM9/17/20
to Selenium Users
This is a common misconception about how Selenium works with xpath, and it confuses a lot of people.  This is how I understand it.

When Selenium uses an xpath search, it DOES hand the current node,  the DOM, and the xpath query to the xpath system.  It does NOT modify the given xpath in anyway to force it to use the current node, though.  It just lets xpath do what xpath does without any interference.

The documentation I linked you to can also be a little confusing at first when looking at the first bit of documentation about //, which says:

-- //
-- Selects nodes in the document from the current node that match the selection no matter where they are

The part that is often unclear is that, if you don't specify what the current node should be before the // in the query, xpath defaults to the top of the document.  This can be seen in other examples from the documentation, ie:

-- //book

-- Selects all book elements no matter where they are in the document

-- //*

-- Selects all elements in the document

Those examples both start at the top of the document because there is no xpath stuff in front of the //, so the top of the document default applies.

Selenium allows the use of xpath, but xpath is a separate thing from Selenium.  To use xpath well, you have to learn how xpath itself works, and try not to assume that Selenium is influencing xpath in any way.  Every Selenium find using xpath is a standalone query, but you can have the xpath query use the dot to start at the current node found from a previous find.  It took me a while to learn this too.  :)

Joe
Reply all
Reply to author
Forward
0 new messages