Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HTML tag Parsing and extracting data.

300 views
Skip to first unread message

majid...@gmail.com

unread,
Oct 29, 2007, 8:10:35 AM10/29/07
to
Hi,

I am new in TCL . Let me tell you what I am want to do which so far I
am trying to but failed.

I would like to parse an HTML page say it is .

http://www.cmcelectronics.ca/En/Careers/job_display_en.php?JOB_ID=511
or
http://www.sanjel.com/careers/jobDesc.cfm?numJobBoardID=440

and I want to extract the data/info of "Duties & Responsibilities:" ,
"Description:" , "Summary" or "Responsibilities" and etc etc..
So I am looking for the code which should be generic enough in a sense
that if we pass "descriptions" or "description" or any keyword whic I
mentioned above or could be any then it looks and extract the
information related to that keyword or heading..

I hope you techies guys understood what I wanted to do... Please show
me an example code, if I am posting on wrong group then let me know
the group name where I could post my problem and can get the practile
example of it..

Thanks.
Best Regards,
Majid.

Gerald W. Lester

unread,
Oct 29, 2007, 9:23:43 AM10/29/07
to

You are going to have to write something (one or more procedures) specific
since each of the pages you gave as an example have different formats. You
may want to look at the html_parse package in TclLib for a tool to help.


--
+--------------------------------+---------------------------------------+
| Gerald W. Lester |
|"The man who fights for his ideals is the man who is alive." - Cervantes|
+------------------------------------------------------------------------+

sch...@uni-oldenburg.de

unread,
Oct 29, 2007, 9:45:03 AM10/29/07
to

majidkha...@gmail.com wrote:
> Hi,
>
> I am new in TCL . Let me tell you what I am want to do which so far I
> am trying to but failed.
>
> I would like to parse an HTML page say it is .
>
> http://www.cmcelectronics.ca/En/Careers/job_display_en.php?JOB_ID=511
> or
> http://www.sanjel.com/careers/jobDesc.cfm?numJobBoardID=440
>
> and I want to extract the data/info of "Duties & Responsibilities:" ,
> "Description:" , "Summary" or "Responsibilities" and etc etc..
> So I am looking for the code which should be generic enough in a sense
> that if we pass "descriptions" or "description" or any keyword whic I
> mentioned above or could be any then it looks and extract the
> information related to that keyword or heading..

Basically 'generic' is hard in this regard due to the way HTML turned
into a tag soup and isn't properly annotated at all (even harder if
javascript is involved).

A simple sledgehammer would be regexp..., a little more sophisticated
something like tcllib htmlparse or tdom in html mode. Even tclwebtest
might be helpful.

See: http://wiki.tcl.tk/2204
http://wiki.tcl.tk/tdom

Michael

chi...@singnet.com.sg

unread,
Oct 30, 2007, 8:58:03 PM10/30/07
to

I think tDOM is the best tool for web scraping. The best part of tDOM
is that it can handle well formed xml as well as html.

My blog has some sample codes that you can use that as a guideline
http://chihungchan.blogspot.com/search/label/Web%20Scraping


ewils...@gmail.com

unread,
Oct 31, 2007, 11:46:56 AM10/31/07
to
On Oct 31, 10:58 am, chih...@singnet.com.sg wrote:
> ...

> I think tDOM is the best tool for web scraping. The best part of tDOM
> is that it can handle well formed xml as well as html.
>
> My blog has some sample codes that you can use that as a >guidelinehttp://chihungchan.blogspot.com/search/label/Web%20Scraping

Hey nice blog with real-world Tcl use.

Regards
E Wilson


0 new messages