thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ ruby scrape.rb
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a3bd76032df6507d/37d98a0d3efeaae4?show_docid=37d98a0d3efeaae4"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/0c6d521382cda99d/244a1c70d6ea0878?show_docid=244a1c70d6ea0878"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/1f6885d8416db1a6/260089ad5b9e133b?show_docid=260089ad5b9e133b"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
f5031e66d7819c94/ee4025141f0e926c?show_docid=ee4025141f0e926c"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/725d2b507b595cd2/cf5df2ad24b92ed8?show_docid=cf5df2ad24b92ed8"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a5a8597adf18bc65/11496209df7695d1?show_docid=11496209df7695d1"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/d7612a26a60056b0?show_docid=d7612a26a60056b0"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/db37b5cf1c92ffa0?show_docid=db37b5cf1c92ffa0"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/4146e867f9efc1c1/64ee4b38fc3f3e78?show_docid=64ee4b38fc3f3e78"
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ nl scrape.rb
1 require 'rubygems'
2 require 'activerecord'
3 require 'yaml'
4 require 'item'
5 require 'open-uri'
6 require 'pp'
7 db = YAML::load(File.open('database.yml'))
8 ActiveRecord::Base.establish_connection(
9 :adapter => db["development"]["adapter"],
10 :host => db["development"]["host"],
11 :username => db["development"]["username"],
12 :password => db["development"]["password"],
13 :database => db["development"]["database"])
14 items = Item.find(:all)
15 items.each do |item|
16 open(item.url,
17 "User-Agent" => "Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.0.15) Gecko/2009102815 Ubuntu/9.04 (jaunty) Firefox/3.0.15"){|f|
18 item.html = f.readlines.join
19 item.save
20 pp item.url
21 }
22 end
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ mysql -u ruby -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 81
Server version: 5.0.75-0ubuntu10.2 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> select url, html from rss2mysql.items;
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
|
url
|
html
|
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a3bd76032df6507d/37d98a0d3efeaae4?show_docid=37d98a0d3efeaae4 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/0c6d521382cda99d/244a1c70d6ea0878?show_docid=244a1c70d6ea0878 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/1f6885d8416db1a6/260089ad5b9e133b?show_docid=260089ad5b9e133b |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
f5031e66d7819c94/ee4025141f0e926c?show_docid=ee4025141f0e926c |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/725d2b507b595cd2/cf5df2ad24b92ed8?show_docid=cf5df2ad24b92ed8 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a5a8597adf18bc65/11496209df7695d1?show_docid=11496209df7695d1 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/d7612a26a60056b0?show_docid=d7612a26a60056b0 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/db37b5cf1c92ffa0?show_docid=db37b5cf1c92ffa0 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/4146e867f9efc1c1/64ee4b38fc3f3e78?show_docid=64ee4b38fc3f3e78 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
10 rows in set (0.00 sec)
mysql> Aborted
thufir@ARRAKIS:~/projects/rss2mysql$
thanks,
Thufir
mysql> describe rss2mysql.items;
http://dizzy.co.uk/ruby_on_rails/cheatsheets/rails-migrations#database_mapping
Ok, thanks, that problem is solved by using text instead of string.
In terms of design, I'm considering the pros/cons for breaking of the
html to another table. Perhaps the scraped data should be with the
html in its own table? I doubt I'd see a performance differential for
the amount of data I'll be working with, but does it matter whether a
relatively large text field, and perhaps about five string fields are
added onto existing table, or whether there's a 1:1 relation to
another table? Also, perhaps there would be a 1:many relation between
the raw html and scraped data, but I'm not sure about that.
Also, I don't want to accidentally re-fetch the html and end up with a
bunch of 404 error pages, so would it make sense to add a boolean
indicating whether html had been fetched? Or, just restrict fetching
(?) the html to when feeds are grabbed?
thanks,
Thufir
It depends on the conceptual structure of the application. I doubt that
performance would be enough of an issue to worry about.
> Also, perhaps there would be a 1:many relation between
> the raw html and scraped data, but I'm not sure about that.
Depends on the data!
>
> Also, I don't want to accidentally re-fetch the html and end up with a
> bunch of 404 error pages, so would it make sense to add a boolean
> indicating whether html had been fetched? Or, just restrict fetching
> (?) the html to when feeds are grabbed?
Do you need a boolean? Just test whether the HTML field is null.
>
>
> thanks,
>
> Thufir
Best,
--
Marnen Laibow-Koser
http://www.marnen.org
mar...@marnen.org
--
Posted via http://www.ruby-forum.com/.
Yeah, figured as much. Partly as an exercise, I'll try for a base
model of "items" with a related model of "pages."
[...]
> > Also, I don't want to accidentally re-fetch the html
> > and end up with a
> > bunch of 404 error pages
[...]
> Do you need a boolean? Just test whether the HTML field is null.
Ah, right.
-Thufir