Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to grab HTML source?

0 views
Skip to first unread message

Mike Darrett

unread,
May 2, 2003, 6:19:23 PM5/2/03
to
Hello,

I'm trying to get a whole bunch of HTML source from a website (which I
do not own, but am merely trying to read):

https://www.blahblah.com/cgi-bin/Num=xxxx

where I have a list of numbers xxxx in a file.


I would love to run lynx and simply redirect the html dump to a file,
but alas, I do not have access to lynx either. So, perhaps I can
write a PHP script to do this?

readfile( "https://www.blahblah.com/cgi-bin/Num=xxxx" ) does not work:

Warning: readfile("https://www.blahblah.com/cgi-bin/Num=xxxx") - No
such file or directory

What am I doing wrong?

Thanks,

Mike Darrett

sid

unread,
May 2, 2003, 9:33:08 PM5/2/03
to

"Mike Darrett" <mike-...@darrettenterprises.com> schrieb im Newsbeitrag
news:d945119c.03050...@posting.google.com...

hey mike - maybe this will give you a clue:

http://www.php.net/manual/en/function.file.php

all you do is stick the html source in an array - each element holds a
line - then you just run through the array and print it.


Nikolai Chuvakhin

unread,
May 2, 2003, 10:41:07 PM5/2/03
to
mike-...@darrettenterprises.com (Mike Darrett) wrote
in message news:<d945119c.03050...@posting.google.com>...

>
> I would love to run lynx and simply redirect the html dump to a file,
> but alas, I do not have access to lynx either.

How about wget?

> readfile( "https://www.blahblah.com/cgi-bin/Num=xxxx" ) does not work:
>
> Warning: readfile("https://www.blahblah.com/cgi-bin/Num=xxxx") - No
> such file or directory
>
> What am I doing wrong?

Can you open those URLs in your browser? Is authorization of any kind
required there? (The reason I am asking about authorization is that
you have a secure URL...) If no authorization is needed, try using
sockets:

$local = fopen ('xxxx.htm', 'w');
$host = 'www.blahblah.com';
$path = '/cgi-bin/Num=xxxx';
$remote = fsockopen($host, '443'); // Note that we should use port 443
// for secure HTTP connection.
if ($remote) {
fputs($remote, "GET ".$path." HTTP/1.0\r\nHost: ".$host."\r\n\r\n");
while(!feof($remote)) {
fputs($local, fgets($remote, 10240));
}
} else {
echo 'Oops... Can't read it anyway... ';
}
fclose ($remote);
fclose ($local);

Cheers,
NC

Randell D.

unread,
May 3, 2003, 12:06:24 AM5/3/03
to

"Mike Darrett" <mike-...@darrettenterprises.com> wrote in message
news:d945119c.03050...@posting.google.com...

I'll ignore the right or wrongs of what you are doing for a moment and point
out the problems you're likely to face:

- I notice that you are trying to read a secure page (the s in https://
tells me this)... this might be in part your problem...

- I notice that the files you want to read reside under a /cgi-bin/
directory - its possible that the author could have some tests to check
which browser you are using to visit - hence your script is unlikely to
deliever those header variables and might fail. In addition, there might be
cookies, which your script is unlikely to process and thus the author of the
server programs could have fixed it so that what you see in your browser, is
different from a non-broswer connection...

Technically... what you are doing should work so I would first try what you
want to do on an insecure server... make sure that works first before you
start working on a secure server...

anyway... laters
randelld

Comex

unread,
May 3, 2003, 7:34:10 AM5/3/03
to

From the PHP manual:
*Tip:* You can use a URL as a filename with this function if the _fopen
wrappers_ have been enabled. See _fopen()_ for more details on how to
specify the filename and _Appendix I_ for a list of supported URL protocols.

So the wrappers could be disabled.

--
1.41421356237309504880168872420969807856967187537694
Nope, it's not true that the above isn't false, unless it is.
The search is more profitable than the find.
Never, ever back down.


Mike Darrett

unread,
May 5, 2003, 7:33:46 PM5/5/03
to
n...@iname.com (Nikolai Chuvakhin) wrote in message news:<32d7a63c.03050...@posting.google.com>...


I did try this; I get the "Oops" error mesage.

Since it's not exactly a file, I do wonder whether I can open the
website as a "file"...???

Don't worry too much about the legals of what I'm doing. This is
actually the website of one of my distributors; they're being
difficult when I ask them for a price list for their software. (You
would think they'd want to help me as much as possible, since I'm a
reseller... ah well.) Turns out I can grab the prices directly from
the https://...

Mike

Jason G

unread,
May 6, 2003, 2:42:06 AM5/6/03
to
Very Simple.

Most systems have cURL installed on them now. If it is not, you can
get it from curl.haxx.se

Write a script that shell_exec()'s curl...

/////////////////////////////////////////////////
$aNums = array('123', '125', '1234', '1456');

foreach($aNums as $sNum)
{
$sCmd = "curl 'https://www.blahblah.com/cgi-bin/Num=$sNum' >
$sNum.html";
shell_exec($sCmd);
}
////////////////////////////////////////////////

0 new messages