1) <b>Some Institute name</b><Br><br> 2) some address<Br> city, st zip<br> 3) 4) United States <Br> 5) 6) Phone: 7) 8) (123) 456-7890<Br> 9) 10 <br> 11) Web address: <a href="Http://www.xyz.org" target="_Blank">www.xyz.org</a><Br>
<br><br>
<A href="javascript:history.back();">Back to Search Results</ a><br><br>
I want to scrap and collect the data between lines 1-11, ie, name, address, city, st, zip, United States, phone number, and line 11 I want the website url: 'http://www.xyz.org'
I can find the beginning of this section of code by doing this:
doc.css('h2').each do |elem| puts elem.content end which displays 'Association Detail'
I am having problems using this as the starting point to parse the data in lines 1-11 which contain the specific 'Association Detail' details. I've tried it with 'xpath' and 'search' according to the example here: http://rdoc.info/projects/tenderlove/nokogiri
but there's something I'm just not getting correctly when I use other elements get info from.
My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0
> I want to scrap and collect the data between lines 1-11, ie, name, > address, city, st, zip, United States, phone number, and line 11 I > want the website url: 'http://www.xyz.org'
> I can find the beginning of this section of code by doing this:
> doc.css('h2').each do |elem| puts elem.content end > which displays 'Association Detail'
> I am having problems using this as the starting point to parse the > data in lines 1-11 which contain the specific 'Association Detail' > details. I've tried it with 'xpath' and 'search' according to the > example here: http://rdoc.info/projects/tenderlove/nokogiri
> but there's something I'm just not getting correctly when I use other > elements get info from.
> My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0
> Thanks in advance for any help.
You aren't really searching by css, which would involve things like searching for tags based on their 'class' attribute or 'id' attribute. Because the <h2> tag doesn't have any attributes, you are simply searching by tag name, so you could do this instead:
doc.xpath('//h2').each do |h2| puts h2.content end
That uses xpath notation to find all h2 tags on the page. Then you might write something like this:
doc = Nokogiri::HTML.parse(html)
doc.xpath('//h2').each do |h2|
if h2.content == "Association Detail" puts "---" puts h2.next.content puts "---" end
end
Knowing you can do that will enable you to write something like this:
results = []
doc.xpath('//h2').each do |h2|
if h2.content == "Association Detail" curr_elmt = h2
while (curr_elmt = curr_elmt.next) curr_content = curr_elmt.content results << curr_content break if curr_content.include?("Web address:") end
end end
results.each do |result| puts "--start--" puts result puts "--end--" puts end
output=
--start-- DETAIL DIRECTORY RESULTS --end--
--start-- Some Institute name --end--
--start--
--end--
--start--
--end--
--start--
some address city, st zip
United States
Phone:
(123) 456-7890 ) Web address: www.xyz.orgBack to Search Results a>Search Again --end--
As you can see, the html is pretty bad, so your results aren't that great. You will have to figure out how to extract the data you need from those strings.
7stud's approach works, but Mark's doesn't (currently). Here's the file I created which will get me all the raw data I want (still have to process to get to final form).
doc.xpath('//h2').each do |h2| if h2.content == "Association Detail" curr_elmt = h2 while (curr_elmt = curr_elmt.next) curr_content = curr_elmt.content.gsub(/\n|\t|\r/,'').squeeze (' ').strip results << curr_content unless curr_content.strip.empty? break if curr_content.include?("Back to Search Results") end end end
results.each do |result| #Do while result is not a blank string puts "--start--" puts result puts "--end--" end return results end ---------------------------------------
So I just 'require' this file, and can then do:
> info = scrape 1234
where 'info' is the array 'results'. I can then process that to my hearts delight.
Thanks 7stud for your help. I would, however, like to know if Mark's way can be made to work too.
Not quite. following-sibling:: is an axis predicate that needs to be followed by a node. Therefore following-sibling::text() is a set of all text nodes after the div. After that, it's just a matter of indexing.
> doesn't seem applicable. And in fact, when I run your code, it doesn't > work:
As I just posted in another message, it works for me. I wonder what's different about my environment. Are you using Nokogiri 1.4.0?
$ nokogiri -v HI. You're using libxml2 version 2.6.16 which is over 4 years old and has plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, you upgrade your version of libxml2 and re-install nokogiri. If you like using libxml2 version 2.6.16, but don't like this warning, please define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring nokogiri.
/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.r b:272: warning: parenthesize argument(s) for future version --- nokogiri: 1.4.0 warnings: []
So it could be something with that, or maybe it has something to do with the fact that ruby 1.8.7 back ports some stuff from ruby 1.9. -- Posted via http://www.ruby-forum.com/.
> $ nokogiri -v > HI. You're using libxml2 version 2.6.16 which is over 4 years old and > has > plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, > you > upgrade your version of libxml2 and re-install nokogiri. If you like > using > libxml2 version 2.6.16, but don't like this warning, please define the > constant > I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring > nokogiri.
> /usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.r b:272: > warning: parenthesize argument(s) for future version > --- > nokogiri: 1.4.0 > warnings: []
> So it could be something with that, or maybe it has something to do with > the fact that ruby 1.8.7 back ports some stuff from ruby 1.9. > -- > Posted viahttp://www.ruby-forum.com/.
OK, when I put Mark's code in a file and ran it (versus entering it in a irb session) it DOES work. However, it doesn't capture the website url, which 7stud's approach does. I haven't figure out how to do it with this approach, and merely adding more items in xpaths doesn't work.
So Mark, how can your approach be used to capture the url add the end of the data section?
> $ nokogiri -v > HI. You're using libxml2 version 2.6.16 which is over 4 years old and > has > plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, > you > upgrade your version of libxml2 and re-install nokogiri. If you like > using > libxml2 version 2.6.16, but don't like this warning, please define the > constant > I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring > nokogiri.
> /usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.r b:272: > warning: parenthesize argument(s) for future version > --- > nokogiri: 1.4.0 > warnings: []
> So it could be something with that, or maybe it has something to do with > the fact that ruby 1.8.7 back ports some stuff from ruby 1.9. > -- > Posted viahttp://www.ruby-forum.com/.
On Nov 9, 10:51 pm, 7stud -- <bbxx789_0...@yahoo.com> wrote:
> Mark Thomas wrote: > > As I just posted in another message, it works for me. I wonder what's > > different about my environment. Are you using Nokogiri 1.4.0? > Yes, however I get a warning message that informs me that I'm using an > outdated version of libxml2: > $ ruby -v > ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1] > $ nokogiri -v > HI. You're using libxml2 version 2.6.16 which is over 4 years old and > has > plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, > you > upgrade your version of libxml2 and re-install nokogiri. If you like > using > libxml2 version 2.6.16, but don't like this warning, please define the > constant > I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring > nokogiri. > /usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.r b:272: > warning: parenthesize argument(s) for future version > --- > nokogiri: 1.4.0 > warnings: [] > libxml: > compiled: 2.6.16 > loaded: 2.6.16 > binding: extension > So it could be something with that, or maybe it has something to do with > the fact that ruby 1.8.7 back ports some stuff from ruby 1.9. > -- > Posted viahttp://www.ruby-forum.com/.
OK, when I put Mark's code in a file and ran it (versus entering it in a irb session) it DOES work. However, it doesn't capture the website url, which 7stud's approach does. I haven't figure out how to do it with this approach, and merely adding more items in xpaths doesn't work.
So Mark, how can your approach be used to capture the url add the end of the data section?
> OK, when I put Mark's code in a file and ran it (versus entering it in > a irb session) it DOES work. However, it doesn't capture the website > url, which 7stud's approach does. I haven't figure out how to do it > with this approach, and merely adding more items in xpaths doesn't > work.
> So Mark, how can your approach be used to capture the url add the end > of the data section?
You'll need to modify that last line. Unlike the other items, the URL is not in a text node, it is the href attribute of the first <a> element. So try:
> HI. You're using libxml2 version 2.6.16 which is over 4 years old and > has > plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, > you > upgrade your version of libxml2 and re-install nokogiri. If you like > using > libxml2 version 2.6.16, but don't like this warning, please define the > constant > I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring > nokogiri.
> /usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.r b:272: > warning: parenthesize argument(s) for future version > --- > nokogiri: 1.4.0 > warnings: []
Can you install a newer version of libxml2? As you can see from http://xmlsoft.org/news.html, your version dates back to 2004 with tons of bug fixes (including XPath fixes) since.
> Can you install a newer version of libxml2? As you can see from > http://xmlsoft.org/news.html, your version dates back to 2004 with > tons of bug fixes (including XPath fixes) since.
I've looked into installing newer versions of libxml2 and libxslt, but it looks complicated and fraught with danger for mac os x.
> > OK, when I put Mark's code in a file and ran it (versus entering it in > > a irb session) it DOES work. However, it doesn't capture the website > > url, which 7stud's approach does. I haven't figure out how to do it > > with this approach, and merely adding more items in xpaths doesn't > > work.
> > So Mark, how can your approach be used to capture the url add the end > > of the data section?
> You'll need to modify that last line. Unlike the other items, the URL > is not in a text node, it is the href attribute of the first <a> > element. So try:
> :url => "#{prefix}a[1]/@href"
Yes, this allows me to capture the url I want (and sometimes ones I don't want), and I'm able to post-process xpaths to get everything I need.
> On Nov 10, 10:29 pm, Mark Thomas <m...@thomaszone.com> wrote:
> > > OK, when I put Mark's code in a file and ran it (versus entering it in > > > a irb session) it DOES work. However, it doesn't capture the website > > > url, which 7stud's approach does. I haven't figure out how to do it > > > with this approach, and merely adding more items in xpaths doesn't > > > work.
> > > So Mark, how can your approach be used to capture the url add the end > > > of the data section?
> > > Here's the file I used with Mark's approach:
> > You'll need to modify that last line. Unlike the other items, the URL > > is not in a text node, it is the href attribute of the first <a> > > element. So try:
> > :url => "#{prefix}a[1]/@href"
> Yes, this allows me to capture the url I want (and sometimes ones I > don't want), and I'm able to post-process xpaths to get everything I > need.
> Now, I just need to understand completely WHY/HOW it works. :-)
Let's take the first one as an example. I noticed that everything was after a div with the class "sectionHeaderText", so I started with that:
//div[@class="sectionHeaderText"]
The double slash is a wildcard that means the div can be anywhere. The part in brackets is called a predicate, and it constrains the expression. I like to think of it as a "such that" clause. So you can read the above as "a div such that the class is 'sectionHeaderText'." (Actually, it's the set of all divs for which it is true, so if you had multiple divs with the same class, it would return them all)
Then I noticed that the items you wanted were not children of the div. The div closes before you get to the text you want. Even <br> tags are considered to be <br/> which are self-closing. Therefore almost everything you want is at the same nesting depth, or in XPath terminology, they are siblings. The "following-sibling" is an XPath "axis" (see the W3C Schools XPath tutorial for details on these things). The name though was inside a <b> element so I used the XPath expression to get the following sibling that happens to be a <b> element: