Get all site tree with Ruby

Роман Ярыгин

unread,

May 12, 2015, 1:42:52 AM5/12/15

to rubyonra...@googlegroups.com

Hello!

I need to grab all site data with all tree structure. Every page have links to children pages. How to build site tree with Nokogiri? It must be recursive page visiting and scraping all directory links, but I can't recognize full algorhytm. How to do that?

P.S. And I don't need to "Save all site on disk with HTTRack". Data will be processed and copied on the new version of redesigned original site.

Vladimir Gordeev

unread,

May 12, 2015, 6:09:51 AM5/12/15

to rubyonra...@googlegroups.com

At which point you're get stuck?

Simply GET index page, parse it via nokogiri, select <a> tags which you interested in, extract urls from href attribute, do recursive GET on these urls.
Each page type should have its own function that performs GET and parsing.

If you have to fetch pretty huge amount of pages, then you need to store your grabbing state somewhere in database. For example, keep separate table for urls to be parsed. (url is a unique key), and mark rows a "to be parsed" and "already parsed". Of course you need to normalize all urls, not avoid duplicates in table.

Да и мог бы спросить в ror2ru.

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...@googlegroups.com.
To post to this group, send email to rubyonra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/db39c272-d353-42be-ae09-4a09fcf4abca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vladimir Gordeev

unread,

May 12, 2015, 6:20:00 AM5/12/15

to rubyonra...@googlegroups.com

Some time ago I solved similar problem (but I needed continuous grabbing), organizing several workers: https://medium.com/@vladimir_vg/dsl-74d0fcf03cae (in Russian language)
Probably you do not need such a complex thing, but you may get some ideas from it.

On Tue, May 12, 2015 at 7:42 AM, Роман Ярыгин <330...@gmail.com> wrote:

--

Роман Ярыгин

unread,

May 12, 2015, 7:21:27 PM5/12/15

to rubyonra...@googlegroups.com

I stuck exactly on recursive algoritm. Can't find out how to build that recursive function

Вот как раз на этой рекурсивной функции я и застрял. Не могу допетрить как ее написать.

вторник, 12 мая 2015 г., 20:09:51 UTC+10 пользователь Vladimir Gordeev написал:

Scott Ribe

unread,

May 12, 2015, 8:36:34 PM5/12/15

to rubyonra...@googlegroups.com, Роман Ярыгин

On May 12, 2015, at 5:21 PM, Роман Ярыгин <330...@gmail.com> wrote:
>
> I stuck exactly on recursive algoritm. Can't find out how to build that recursive function

It’s recursion, you call it again…

def start
get_subtree(‘/‘)
end

def get_subtree(url)
#fetch the page
#parse it
#for each link
#normalize the link
#if link not already visited
#add link to table of visited links
get_subtree(link)
#end
#end
end

--
Scott Ribe
scott...@elevated-dev.com
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

Роман Ярыгин

unread,

May 12, 2015, 9:44:52 PM5/12/15

to rubyonra...@googlegroups.com, 330...@gmail.com

Yeah, thanks. I figured it out. Now I stuck with million other problems, but this is another theme =)

среда, 13 мая 2015 г., 10:36:34 UTC+10 пользователь Scott Ribe написал:

Reply all

Reply to author

Forward