Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

reading data from a web site

11 views
Skip to first unread message

hw

unread,
Nov 19, 2019, 2:15:05 PM11/19/19
to begi...@perl.org
Hi,

how can I read data from a web site which is using multiple frames and some
javascript?

When using a web browser, I need to log in into the web site and follow a
couple links to finally get to the page I want. This page I need to get data
from has a frame with a select list and another frame displaying a table. By
default, the table contains data corresponding to the first entry in the
select list. Selecting an entry from the select list reloads the table in the
other frame once I click on the entry.

I need to automatically pick all the entries from the select list one after
another so that the table is being updated. Once I've read the new table, the
next entry in the select list needs to be picked to get the table updated,
until there are no more entries. The order in which the entries are being
picked can be random.

Once all the available tables are read, only the first entry in the select
needs to be read once per day. "Reading the table" means that I need to put
the data in the table into a database. It would help if could save all the
tables to files and convert the files later; the problem is getting the
tables.

Is this even possible? There doesn't seem be any useful support for
javascript with WWW::Mechanize, and even frames seem to be an issue.

I can only see that some java (or javascript or whatever it is) function is
being called when clicking on an entry in the select list the only purpose of
which seems to be to display a banner showing "Loading ..." with the dots
moving while the table is being loaded. Yet the table is being updated. I
can't see how and if there are GET or POST requests being sent by the web
browser. The only way to update seems to be to somehow fake a click on an
entry in the select list.

Any ideas?

Olivier

unread,
Nov 19, 2019, 9:45:05 PM11/19/19
to hw, begi...@perl.org
hw <h...@gc-24.de> writes:

> Hi,
>
> how can I read data from a web site which is using multiple frames and some
> javascript?

Provided that the web site does not change too often and that they don't
implement stupid "security" features, this should not be too complicate.

Each frame is a web page, with it own URL. So you can examine the source
code of the web page to find the URL of the first frame and second frame

Them you can use any Perl library you like to load that URLand pars it
for what you are looking for.

Then use that data to load the second frame with a URL modified to
include the type of data you have selected.

Being frames makes it much easier, you hould not have to care about the
javascript too much.

Olivier
--

hw

unread,
Nov 20, 2019, 2:00:05 PM11/20/19
to begi...@perl.org
On Wednesday, November 20, 2019 3:29:00 AM CET Olivier wrote:
> hw <h...@gc-24.de> writes:
> > Hi,
> >
> > how can I read data from a web site which is using multiple frames and
> > some
> > javascript?
>
> Provided that the web site does not change too often and that they don't
> implement stupid "security" features, this should not be too complicate.
>
> Each frame is a web page, with it own URL. So you can examine the source
> code of the web page to find the URL of the first frame and second frame
>
> Them you can use any Perl library you like to load that URLand pars it
> for what you are looking for.
>
> Then use that data to load the second frame with a URL modified to
> include the type of data you have selected.
>
> Being frames makes it much easier, you hould not have to care about the
> javascript too much.

The web site seems to be created by a program running on the server, i. e.
there is not really a web site. When I access it with lynx or with
WWW::Mechanize, the answer from the server says that neither frames, nor
javascript is supported, and it is not possible to log in.

Can WWW::Mechanize somehow trick the server into assuming that frames and
javascript are supported by the client?

Mike

unread,
Nov 20, 2019, 7:00:04 PM11/20/19
to begi...@perl.org

Shlomi Fish

unread,
Nov 21, 2019, 2:30:04 AM11/21/19
to hw, begi...@perl.org

Olivier

unread,
Nov 21, 2019, 2:45:05 AM11/21/19
to h...@gc-24.de, begi...@perl.org
hw <h...@gc-24.de> wrote:

> On Wednesday, November 20, 2019 3:29:00 AM CET Olivier wrote:
> > hw <h...@gc-24.de> writes:
> > > Hi,
> > >
> > > how can I read data from a web site which is using multiple frames and
> > > some
> > > javascript?
> >
> > Provided that the web site does not change too often and that they don't
> > implement stupid "security" features, this should not be too complicate.
> >
> > Each frame is a web page, with it own URL. So you can examine the source
> > code of the web page to find the URL of the first frame and second frame
> >
> > Them you can use any Perl library you like to load that URLand pars it
> > for what you are looking for.
> >
> > Then use that data to load the second frame with a URL modified to
> > include the type of data you have selected.
> >
> > Being frames makes it much easier, you hould not have to care about the
> > javascript too much.
>
> The web site seems to be created by a program running on the server, i. e.
> there is not really a web site. When I access it with lynx or with
> WWW::Mechanize, the answer from the server says that neither frames, nor
> javascript is supported, and it is not possible to log in.

Of course lynx cannot process frames. But that is not what I meant to
tell you.

Open the web page with your browser, FireFox, Chromium, whatever, the
CTRL-U to display the source. In that source, you should see some tages
<frame ir maybe <iframe which contains an URL.

Copy that URL and try to paste it in a separate window of your web browser.
You should see the list of the topic you can select from. In fact it
should display the contents of the 1st frame.

If it does not, you are in a not too good shape.

If it works, go back to the source code and locate the second <frame
tag, find the URL, copy, new window, paste.

The concept is to access to the contens of the frames directly, without
accessing the main page.

Best regards,

Olivier


>
> Can WWW::Mechanize somehow trick the server into assuming that frames and
> javascript are supported by the client?
--

hw

unread,
Nov 21, 2019, 6:30:04 AM11/21/19
to begi...@perl.org
On Thursday, November 21, 2019 7:57:49 AM CET Shlomi Fish wrote:
> On Wed, 20 Nov 2019 19:52:46 +0100
>
> [...]
> > Can WWW::Mechanize somehow trick the server into assuming that frames and
> > javascript are supported by the client?
>
> Hi!
>
> See:
>
> https://github.com/shlomif/Freenode-programming-channel-FAQ/blob/master/FAQ_
> with_ToC__generated.md#how-can-i-write-code-to-perform-operations-on-web-sit
> es-for-me-that-otherwise-should-be-done-manually
>
> (short URL: https://is.gd/ExAHTa )

There seems to be some agreement that selenium needs to be used. That seems
to be some kind of huge IDE thing which may work or not and which may require
years of trying to figure it out, if it's at all possible ...

Selenium doesn't even seem to be available in Fedora ...

hw

unread,
Nov 21, 2019, 7:45:04 AM11/21/19
to begi...@perl.org
When I do that, the login page is being displayed, and nothing happens when I
press Ctrl-U. Maybe it's because the page is already made with frames?

When I look at the source of the frame that contains the fields to enter a
username a password, I can see that there are inputs for those, like this:

<INPUT TYPE="TEXT" NAME="usrlogn" VALUE="" MAXLENGTH="15" SIZE="8">&nbsp;

The only URL is probably the one displayed in the address bar of the web
browser when looking at the source of the frame. That URL seems to point at
the program running on the web server with parameters in the URL which have
been created by the program. One of the parameters seems to be a session ID.

Instead of viewing the source of the frame, I can open the frame in other tab.
How does that help me? There is no way to automatically get the URL for the
frame because the parameters are being created by the program on the web
server, and they are only valid for a short time.

> Copy that URL and try to paste it in a separate window of your web browser.
> You should see the list of the topic you can select from. In fact it
> should display the contents of the 1st frame.

Well, yes, I can see the source of the frame that has the select list. That
doesn't help me either because to get the data I want, I need to select
entries from the select list. Selecting such an entry results in another
frame being updated; that frame shows a table.

I can get the URL of that frame from the frame info of the web browser and
download the frame and convert its table into a CSV and put the data into a
database --- but I can not get the URL of the frame other than copying it
manually from the frame info of the web browser.

> If it does not, you are in a not too good shape.
>
> If it works, go back to the source code and locate the second <frame
> tag, find the URL, copy, new window, paste.
>
> The concept is to access to the contens of the frames directly, without
> accessing the main page.
>
> Best regards,
>
> Olivier
>
> > Can WWW::Mechanize somehow trick the server into assuming that frames and
> > javascript are supported by the client?

Like I said, there are no frames to do anything with when the web site is
being accessed with WWW::Mechanize.

I can only see that when I select an entry from the select list, the web
browser sends a POST request for a subdocument and then right away makes a GET
request for a style sheet. Unfortunately, the browser doesn't tell me what
the POST request looks like. It should have something to do with what is
selected from the list ...
0 new messages