Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[ask]How to remove HTML part of a text

0 views
Skip to first unread message

Darmanto Lie

unread,
Dec 25, 2009, 8:57:58 AM12/25/09
to
i'm using Hpricot to parsing html and my output is:

Größe: 37 - 54<br><a href="#" onclick="javascript:mass =
window.open('/nutzwert/masstabellen.html','Masstabelle','status=no,toolbar=no,scrollbars=yes,location=no,menu=no,width=800,height=600')">Ma&szlig;tabelle</a>

does anyone knows how to get the string "Größe: 37 - 54" and remove the
rest of that string ?

thanks
--
Posted via http://www.ruby-forum.com/.

Darmanto Lie

unread,
Dec 25, 2009, 9:41:11 AM12/25/09
to
sorry for double posting, it seems there is no edit post feature...

I have a problem with HTML parsing issue. I'll try to explain my problem
as clear as I can, and I hope someone can help me with this.

I've been given a task to fetch a specific data from HTML page. I'm
planning to use hpricot plugin to do this.
It's an online shop page, and I have to fetch cloth size information.

The product information part of the page can be in either of these 2
formats:

<table>
<tr>
... Some informations ...
</tr>
<tr>
<td>Available in:</td>
</tr>
<tr>
<td>... (The data I want to fetch) ...</td>
</tr>
</table>

OR

<table>
<tr>
... Some informations ...
</tr>
<tr>
<td>... Content ...</td>
<td>Available in:</td>
</tr>
<tr>
<td>... Content ...</td>
<td>... (The data I want to fetch) ...</td>
</tr>
</table>

The clue is: The row whose data I want to fetch, is always preceeded by
a row containing a string "Available in".
And I want to fetch NOT the content of the row, BUT the content of the
last cell (<td>) contained inside the row.

It's complex, and I have no idea on what to do here. Can someone help me
with this?
Thanks for the concern.

PS: The table snippet I post above may be contained inside another
table.
Apparently, the online shop use tables to do page formatting..

Darmanto Lie

unread,
Dec 25, 2009, 10:13:41 AM12/25/09
to
sorry for triple post...problem solved, i used nokogiri instead of
hpricot :)

W. James

unread,
Dec 25, 2009, 10:40:50 PM12/25/09
to
Darmanto Lie wrote:

> i'm using Hpricot to parsing html and my output is:
>
> Größe: 37 - 54<br><a href="#" onclick="javascript:mass =
> window.open('/nutzwert/masstabellen.html','Masstabelle','status=no,too
> lbar=no,scrollbars=yes,location=no,menu=no,width=800,height=600')">Ma&
> szlig;tabelle</a>
>
> does anyone knows how to get the string "Größe: 37 - 54" and remove
> the rest of that string ?
>
> thanks

"Gradhbbee: 37 - 54<br>"[ /^(.*?)</, 1 ]
==>"Gradhbbee: 37 - 54"
"Gradhbbee: 37 - 54<br>"[ /^(.*?)(?=<)/]
==>"Gradhbbee: 37 - 54"
"Gradhbbee: 37 - 54<br>"[ /^[^<]*/]
==>"Gradhbbee: 37 - 54"

--

0 new messages