How to extract href by anchor regex?

1,113 views
Skip to first unread message

sprite

unread,
Aug 31, 2010, 7:38:28 PM8/31/10
to nokogiri-talk
I have a link element that is structured like this <a
href="whatever"><b>Anchor Text</b></a> I need to search for the anchor
text by regex and then grab the href, what's the easiest way to do
this?

Mike Dalessio

unread,
Sep 1, 2010, 4:11:52 PM9/1/10
to nokogi...@googlegroups.com
Hello!

Great question. One of the lesser-known (and lesser-documented, :-\) features of Nokogiri is the ability to define custom XPath functions in ruby.

This functionality is mentioned here:


but the docs certainly could be much more verbose. Here's an example of how to do what you're talking about:

    #! /usr/bin/env ruby
    
    require "rubygems"
    require "nokogiri"
    
    class RegexHelper
      def content_matches_regex node_set, regex_string
        ! node_set.select { |node| node.content =~ /#{regex_string}/ }.empty?
      end
    end
    
    html = Nokogiri::HTML <<-HTML
    <html><body>
      <div>
        <a href="whatever"><b>Anchor Text</b></a>
        <a href="fuzzy">no match here</a>
    HTML
    
    # step 1: find nodes whose content matches a regular expression
    node_set = html.xpath("//a[content_matches_regex(., '.*chor.*')]", RegexHelper.new)
    puts node_set.map(&:to_s) # => '<a href="whatever"><b>Anchor Text</b></a>'
    
    # step 2: find href
    puts node_set.map { |node| node["href"] }.inspect # => ["whatever"]
    
    # step 3: do both at the same time
    puts html.xpath("//a[content_matches_regex(., '.*chor.*')]/@href", RegexHelper.new).map(&:to_s) # => ["whatever"]
    
I'll add a chore to the github issues backlog to better document this functionality.

Christian Gregertsen

unread,
Sep 1, 2010, 5:42:28 PM9/1/10
to nokogi...@googlegroups.com
Thanks

Sent from my Samsung Captivate(tm) on AT&T

Mike Dalessio <mike.d...@gmail.com> wrote:

>--
>You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
>To post to this group, send email to nokogi...@googlegroups.com.
>To unsubscribe from this group, send email to nokogiri-tal...@googlegroups.com.
>For more options, visit this group at http://groups.google.com/group/nokogiri-talk?hl=en.
>

Reply all
Reply to author
Forward
0 new messages