trying to scrape website

16 views
Skip to first unread message

Rob Carr

unread,
Mar 17, 2015, 10:10:24 AM3/17/15
to nokogi...@googlegroups.com
 

I'm currently trying to do my first proper project outside of Codecademy/Baserails and could use some pointers. I'm using a scraper as part of one of the Baserails projects as a base to work from. My aim is to get the string "Palms Trax" and store it in array called DJ. I also wish to get the string "Solid Steel Radio Show" and store it in an array called source. My plan was to extract all the lines from the details section into a subarray and to then filter it into the DJ and Source arrays but if there is a better way of doing it please tell me. I've been trying various different combinations such as '.details none.li.div', 'ul details none.li.div.a' etc but can't seem to stumble on the right one. Also could someone please explain to me why the code

page.css('ol').each do |line| subarray = line.text.strip.split(" - ") end

only works if I declare the subarray earlier outside of the loop as in the Baserails project I am working from this did not seem to be the case.

Here is the relevant html:

    
<!-- Infos --> <ul class="details none"> <li><span>Source</span><div> <a href="http://solidsteel.ninjatune.net/" target="_blank">Solid Steel Radio Show</a></div></li> <li><span>Date</span><div>2015.02.27</div></li> <li><span>Artist</span><div><a href="http://www.electronic-battle-weapons.com/mix-artist/palms-trax/" rel="tag">Palms Trax</a></div></li> <li><span>Genres</span><div><a href="http://www.electronic-battle-weapons.com/mix-genre/deep-house/" rel="tag">Deep House</a><a href="http://www.electronic-battle-weapons.com/mix-genre/experimental/" rel="tag">Experimental</a><a href="http://www.electronic-battle-weapons.com/mix-genre/house/" rel="tag">House</a><a href="http://www.electronic-battle-weapons.com/mix-genre/minimal/" rel="tag">Minimal</a><a href="http://www.electronic-battle-weapons.com/mix-genre/techno/" rel="tag">Techno</a></div></li> <li><span>Categories</span><div><a href="http://www.electronic-battle-weapons.com/releases-category/radio-shows/" rel="tag">Radio Shows</a><a href="http://www.electronic-battle-weapons.com/releases-category/solid-steel-radio-show/" rel="tag">Solid Steel Radio Show</a></div></li> <li><span>File Size</span><div> 135 MB</div></li> <li><span>File Format</span><div> MP3 Stereo 44kHz 320Kbps</div></li> </ul>

and my code so far:

 
require "open-uri" require "nokogiri" require "csv" #store url to be scraped url = "http://www.electronic-battle-weapons.com/mix/solid-steel-palms-trax/" #parse the page page = Nokogiri::HTML(open(url)) #initalize empty arrays details = [] dj = [] source = [] artist = [] track = [] subarray =[] #store data in arrays page.css('ul details none.li.div').each do |line| details = line.text.strip end puts details page.css('ol').each do |line| subarray = line.text.strip.split(" - ") end

Reply all
Reply to author
Forward
Message has been deleted
0 new messages