Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

scrape html gives partial result

1 view
Skip to first unread message

Thufir

unread,
Nov 13, 2009, 3:27:09 AM11/13/09
to
Why does scrape.rb just result in a few lines of html, rather than an
entire document? The html can be printed in its entirety, but how to
persist it? It looks like I have a misplaced } on line 21, but moving it
to line 18 didn't give better results.


thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ ruby scrape.rb
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a3bd76032df6507d/37d98a0d3efeaae4?show_docid=37d98a0d3efeaae4"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/0c6d521382cda99d/244a1c70d6ea0878?show_docid=244a1c70d6ea0878"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/1f6885d8416db1a6/260089ad5b9e133b?show_docid=260089ad5b9e133b"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
f5031e66d7819c94/ee4025141f0e926c?show_docid=ee4025141f0e926c"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/725d2b507b595cd2/cf5df2ad24b92ed8?show_docid=cf5df2ad24b92ed8"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a5a8597adf18bc65/11496209df7695d1?show_docid=11496209df7695d1"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/d7612a26a60056b0?show_docid=d7612a26a60056b0"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/db37b5cf1c92ffa0?show_docid=db37b5cf1c92ffa0"
"http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/4146e867f9efc1c1/64ee4b38fc3f3e78?show_docid=64ee4b38fc3f3e78"
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ nl scrape.rb
1 require 'rubygems'
2 require 'activerecord'
3 require 'yaml'
4 require 'item'
5 require 'open-uri'
6 require 'pp'


7 db = YAML::load(File.open('database.yml'))

8 ActiveRecord::Base.establish_connection(
9 :adapter => db["development"]["adapter"],
10 :host => db["development"]["host"],
11 :username => db["development"]["username"],
12 :password => db["development"]["password"],
13 :database => db["development"]["database"])


14 items = Item.find(:all)

15 items.each do |item|
16 open(item.url,
17 "User-Agent" => "Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.0.15) Gecko/2009102815 Ubuntu/9.04 (jaunty) Firefox/3.0.15"){|f|
18 item.html = f.readlines.join
19 item.save
20 pp item.url
21 }
22 end
thufir@ARRAKIS:~/projects/rss2mysql$
thufir@ARRAKIS:~/projects/rss2mysql$ mysql -u ruby -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 81
Server version: 5.0.75-0ubuntu10.2 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select url, html from rss2mysql.items;
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
|
url
|
html
|
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/923db4577e5ffbfb/440cc76e1d3dc4f0?show_docid=440cc76e1d3dc4f0 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a3bd76032df6507d/37d98a0d3efeaae4?show_docid=37d98a0d3efeaae4 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/0c6d521382cda99d/244a1c70d6ea0878?show_docid=244a1c70d6ea0878 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/1f6885d8416db1a6/260089ad5b9e133b?show_docid=260089ad5b9e133b |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
f5031e66d7819c94/ee4025141f0e926c?show_docid=ee4025141f0e926c |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/725d2b507b595cd2/cf5df2ad24b92ed8?show_docid=cf5df2ad24b92ed8 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
a5a8597adf18bc65/11496209df7695d1?show_docid=11496209df7695d1 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/d7612a26a60056b0?show_docid=d7612a26a60056b0 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/thread/
b357b950b39c0c06/db37b5cf1c92ffa0?show_docid=db37b5cf1c92ffa0 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
| http://groups.google.ca/group/ruby-talk-google/browse_thread/
thread/4146e867f9efc1c1/64ee4b38fc3f3e78?show_docid=64ee4b38fc3f3e78 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html >
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link REL="SHORTCUT ICON" HREF="/groups/img/3 |
+-----------------------------------------------------------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
10 rows in set (0.00 sec)

mysql> Aborted
thufir@ARRAKIS:~/projects/rss2mysql$

thanks,

Thufir


bra...@gmail.com

unread,
Nov 13, 2009, 10:10:39 PM11/13/09
to
On Fri, Nov 13, 2009 at 3:27 AM, Thufir <hawat....@gmail.com> wrote:
> Why does scrape.rb just result in a few lines of html, rather than an
> entire document?  The html can be printed in its entirety, but how to
> persist it?

mysql> describe rss2mysql.items;

http://dizzy.co.uk/ruby_on_rails/cheatsheets/rails-migrations#database_mapping

Thufir

unread,
Nov 13, 2009, 10:46:50 PM11/13/09
to
On Nov 13, 7:10 pm, brab...@gmail.com wrote:

> On Fri, Nov 13, 2009 at 3:27 AM, Thufir <hawat.thu...@gmail.com> wrote:
> > Why does scrape.rb just result in a few lines of html, rather than an
> > entire document?  The html can be printed in its entirety, but how to
> > persist it?
>
> mysql> describe rss2mysql.items;
>
> http://dizzy.co.uk/ruby_on_rails/cheatsheets/rails-migrations#databas...

Ok, thanks, that problem is solved by using text instead of string.

In terms of design, I'm considering the pros/cons for breaking of the
html to another table. Perhaps the scraped data should be with the
html in its own table? I doubt I'd see a performance differential for
the amount of data I'll be working with, but does it matter whether a
relatively large text field, and perhaps about five string fields are
added onto existing table, or whether there's a 1:1 relation to
another table? Also, perhaps there would be a 1:many relation between
the raw html and scraped data, but I'm not sure about that.

Also, I don't want to accidentally re-fetch the html and end up with a
bunch of 404 error pages, so would it make sense to add a boolean
indicating whether html had been fetched? Or, just restrict fetching
(?) the html to when feeds are grabbed?


thanks,

Thufir

Marnen Laibow-Koser

unread,
Nov 13, 2009, 11:33:20 PM11/13/09
to
Thufir wrote:
[...]

> In terms of design, I'm considering the pros/cons for breaking of the
> html to another table. Perhaps the scraped data should be with the
> html in its own table? I doubt I'd see a performance differential for
> the amount of data I'll be working with, but does it matter whether a
> relatively large text field, and perhaps about five string fields are
> added onto existing table, or whether there's a 1:1 relation to
> another table?

It depends on the conceptual structure of the application. I doubt that
performance would be enough of an issue to worry about.

> Also, perhaps there would be a 1:many relation between
> the raw html and scraped data, but I'm not sure about that.

Depends on the data!

>
> Also, I don't want to accidentally re-fetch the html and end up with a
> bunch of 404 error pages, so would it make sense to add a boolean
> indicating whether html had been fetched? Or, just restrict fetching
> (?) the html to when feeds are grabbed?

Do you need a boolean? Just test whether the HTML field is null.

>
>
> thanks,
>
> Thufir

Best,
--
Marnen Laibow-Koser
http://www.marnen.org
mar...@marnen.org
--
Posted via http://www.ruby-forum.com/.

Thufir

unread,
Nov 15, 2009, 3:45:14 AM11/15/09
to
On Nov 13, 8:33 pm, Marnen Laibow-Koser <mar...@marnen.org> wrote:
[...]
> > In terms of design, I'm considering the pros/cons for breaking of the
> > html to another table.
[...]

> It depends on the conceptual structure of the
> application.  I doubt that
> performance would be enough of an issue to worry about.

Yeah, figured as much. Partly as an exercise, I'll try for a base
model of "items" with a related model of "pages."

[...]


> > Also, I don't want to accidentally re-fetch the html
> > and end up with a
> > bunch of 404 error pages

[...]


> Do you need a boolean?  Just test whether the HTML field is null.

Ah, right.


-Thufir

0 new messages