Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Awget - gawk function to download webpage into a variable

373 views
Skip to first unread message

silent...@gmail.com

unread,
Nov 7, 2014, 2:12:17 PM11/7/14
to
I often do web scraping and API retrieval involving 10s of thousands of web pages. Originally I wrote in a shell language and used Awk 1-liners extensively for the data processing but it became ridiculously slow when most of the shell script is calls to Awk. So I started writing in Awk directly with a few system/getline calls to external shell programs -- one of which is wget to download the webpage for processing. For example

command = "wget -q -O- \"http:..\" > temp_file"
system(command)

This method works but requires 1. calling an external program 2. saving to a temp file and then 3. reading that file back into a variable using Gawk's readfile() function (or, getline). It's a lot of IO, and maintenance overhead.

Ideally I wanted a way to load a webpage directly into a variable without calling an external program or saving/reading anything to/from a file. And that is what the following function awget() does.

It borrows heavily from Peteris Krumins's "get_youtube_vids.awk" script which showed the way.

Note the example URL is a redirect to en.wikipedia.org which the code detects and handles correctly.

Welcome any suggestions or ideas how to improve or where the function might fail.

------------------

BEGIN{

url = "http://www.wikipedia.org/wiki/Awk"

s = awget(url)

print s

}

# awget (replicate "wget -q -O- http:...")
# Download URL and return as a variable
# Handles redirects.
#
# Adapted from Peteris Krumins's "get_youtube_vids.awk"
# https://code.google.com/p/lawker/source/browse/fridge/gawk/www/get_youtube_vids.awk
#
function awget(url ,urlhost,urlrequest, c,i,a,p,f,j,output, foO, headerS, matches, inetfile, request, loop)
{

# Parse URL into host and request
c = split(url, a, "/")
urlhost = a[3] # www.domain.com
i = 3
while(i < c) {
f++
i++
p[f] = a[i]
}
urlrequest = "/" join(p,1,length(p)," ") # /wiki/Feudalism
gsub(" ","/",urlrequest) # Will break if request tail contains a "/"

inetfile = "/inet/tcp/0/" urlhost "/80"
request = "GET " urlrequest " HTTP/4.1\r\n"
request = request "Host: " urlhost "\r\n\r\n"

do {
get_headers(inetfile, request, headerS)
if ("Location" in headerS) {
close(inetfile)
if (match(headerS["Location"], /http:\/\/([^\/]+)(\/.+)/, matches)) {
foO["InetFile"] = "/inet/tcp/0/" matches[1] "/80"
foO["Host"] = matches[1]
foO["Request"] = matches[2]
}
else {
foO["InetFile"] = ""
foO["Host"] = ""
foO["Request"] = ""
}
if (inetfile == "") {
print "Failed 1 (" url "), caught in Location loop!" > "/dev/stderr"
return -1
}
}
loop++
} while (("Location" in headerS) && loop < 5)
if (loop == 5) {
print "Failed 2 (" url "), caught in Location loop!" > "/dev/stderr"
return -1
}

while ((inetfile |& getline) > 0) {
j++
output[j] = $0
}
close(inetfile)
if(length(output) == 0)
return -1
else
return join(output, 1, j, "\n")

}

function get_headers(Inet, Request, headerS)
{

delete headerS

# save global vars
OLD_RS=RS

print Request |& Inet

# get the http status response
if (Inet |& getline > 0) {
headerS["_status"] = $2
}
else {
print "Failed reading from the net. Quitting!"
exit 1
}

RS="\r\n"
while ((Inet |& getline) > 0) {
# we could have used FS=": " to split, but i could not think of a good
# way to handle header values which contain multiple ": "
# so i better go with a match
if (match($0, /([^:]+): (.+)/, Matches)) {
headerS[Matches[1]] = Matches[2]
}
else { break }
}
RS=OLD_RS
}

# From Awk manual
#
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}


Janis Papanagnou

unread,
Nov 7, 2014, 5:36:08 PM11/7/14
to
On 07.11.2014 20:12, silent...@gmail.com wrote:

[ Please restrict your line lengths in Usenet posting
to less than 80 characters per line! ]

> I often do web scraping and API retrieval involving 10s of thousands of web
> pages. Originally I wrote in a shell language and used Awk 1-liners
> extensively for the data processing but it became ridiculously slow when
> most of the shell script is calls to Awk.

Between the two extremes - calling awk ONE-LINERS and doing all the rest in
shell -, there is a third option; let the shell do the file processing and
using the tools (wget/curl/...), and let awk do the text processing of the
files. I also do regulary web scraping and this method works best concerning
performance, clean separation of tasks, and maintainability.

> So I started writing in Awk
> directly with a few system/getline calls to external shell programs -- one
> of which is wget to download the webpage for processing. For example
>
> command = "wget -q -O- \"http:..\" > temp_file" system(command)
>
> This method works but requires 1. calling an external program 2. saving to
> a temp file and then 3. reading that file back into a variable using Gawk's
> readfile() function (or, getline). It's a lot of IO, and maintenance
> overhead.
>
> Ideally I wanted a way to load a webpage directly into a variable without
> calling an external program or saving/reading anything to/from a file.

But why? - What's the point in re-implementing existing tools in awk?

If you have heavy text processing of HTML files that will typically cost
much more time than the awk invokation, let alone the unavoidable and slow
I/O of the web page(s).

curl web_address | awk ' { s = s $0 RS } END { do_processing(s) }'

with appropriate do_processing() of the web file contents in variable s.

<OT>
For your tasks of many web addresses use shell loops or curl's feature to
define counten pattern. E.g.

while IFS= read -r web_address
do
curl "${web_address}" | awk ' { s = s $0 RS } END { do_processing(s) }'
done <<EOT
webaddr1
webaddr2
...
EOT

</OT>

Janis

> And that is what the following function awget() does.
>
> It borrows heavily from Peteris Krumins's "get_youtube_vids.awk" script
> which showed the way.
>
> Note the example URL is a redirect to en.wikipedia.org which the code
> detects and handles correctly.
>
> Welcome any suggestions or ideas how to improve or where the function might
> fail.
> [...]


silent...@gmail.com

unread,
Nov 8, 2014, 2:19:05 AM11/8/14
to
Janis, there is a point when it makes sense for a shell script to migrate to a language: Awk, Perl, Python, Lua etc.. Since my shell script was already heavily using awk 1-liners for its core functionality (similar to the example you gave), it made sense to choose %100 Awk and drop the shell entirely. Any needed non-awk functions such as wget rewritten in Awk like this example or simply invoked via a system(wget) call. If I was using Perl it would have a wget functionality built-in but that's the point right? Most languages have native functionality of unix tools or the ability to create them. Gawk has networking and file functions for this reason, in support of its core mission as a text processing language. If text processing is the main purpose/function of the script, consider writing it entirely in Awk. I found it really improved things. But it depends on your program, it's not black and white. Regards.

Janis Papanagnou

unread,
Nov 8, 2014, 4:19:02 AM11/8/14
to
On 08.11.2014 08:19, silent...@gmail.com wrote:
> Janis, there is a point when it makes sense for a shell script to migrate
> to a language: Awk, Perl, Python, Lua etc.. Since my shell script was
> already heavily using awk 1-liners for its core functionality (similar to
> the example you gave), it made sense to choose %100 Awk and drop the shell
> entirely.

No, it doesn't. (But YMMV, of course.) As I tried to explain there's a
crucial difference between shell scripts using [imperformantly] dozens
one-liners vs. using and splitting tasks for shell scripts and awk in a
sensible way.

> Any needed non-awk functions such as wget rewritten in Awk like
> this example or simply invoked via a system(wget) call.

Awk's system() function is nothing but a (crippled) shell invocation,
and rewriting already existing specialized tools is an error prone way
to reinvent the wheel, often also in a crippled way.

> If I was using Perl
> it would have a wget functionality built-in but that's the point right?

Perl is not Awk. Perl has chosen to provide a complex and full featured
object oriented interface to the OS with some Unix tools re-implemented.

As said, I doubt that trying to make Awk something like Perl is a
sensible path, especially if you provide functionality imperformantly
as source code, or if there are specialized tool for your task existing
that you can use.

> Most languages have native functionality of unix tools or the ability to
> create them.

(In which universe?) Languages (on Unix) and Unix tool use the Unix API;
languages may have libraries to link (typically efficient binaries).
Creation of xeisting functionality is typically done by invoking existing
functionality, either through libraries or process (fork/exec).

> Gawk has networking and file functions for this reason, in
> support of its core mission as a text processing language.

I doubt that Awk supports networking to support re-inventing the wheel. Awk
also has co-processes and piped getline that you can use to link existing
external functionality in an efficient way; "curl ..." | getline line

> If text
> processing is the main purpose/function of the script, consider writing it
> entirely in Awk.

Indeed. But your approach does not focus on text processing, rather it
reimplements the wheel. Implementing HTML protocol primitives may be a nice
exercise but unnecessary (and error prone; e.g. is the URI syntax really
as simple as you implemented it? I seem to recall to have seen "official"
patterns that were far more complex.).

> I found it really improved things. But it depends on your
> program, it's not black and white. Regards.

That's what I tried to explain; it's not black and white. There's more
than an ill-designed shell programm with tons of awk one-liners on one
side and and re-implementing all functions, specifically the already
existing reliable ones, in Awk on the other side. YMMV, of course.

Janis

PS: Could you please regard Netiquette in Usenet posting. This newsgroup
is not a google web forum. (Informative links can be found, e.g., at
http://cfaj.freeshell.org/google/index.shtml.) - Thanks!

silent...@gmail.com

unread,
Nov 8, 2014, 11:36:48 AM11/8/14
to
Yeah well I'm writing awk scripts not shell scripts. As for why awk has networking, why do you think if not to do something as basic as downloading a web page? You've basically made one helpful comment: that the URL may be more complex than the function allows. I appreciate that. Regards.

Anton Treuenfels

unread,
Nov 8, 2014, 11:44:57 AM11/8/14
to

"Janis Papanagnou" <janis_pa...@hotmail.com> wrote in message
news:m3kn64$j7b$1...@news.m-online.net...

A whole lot of stuff, but much of it comes across to me as irritable. The OP
solved a problem in a way that satisfies and pleases him, and the response
seems a bit cranky that it wasn't done some other way. Give a new poster a
break. If what he wrote fails in some way predicted by a prophet of doom, it
looks to me that chances are good he'll be able to fix it himself.

Anton Treuenfels

silent...@gmail.com

unread,
Nov 8, 2014, 4:58:19 PM11/8/14
to
Hello Anton Treuenfels

Thank you for the support. The awk program I'm working on will poll somewhere around a quarter million pages and API requests, limited to 3 sites so there won't be problems with exotic URLs or site behavior. It will run on a shared cluster so resource friendliness is an issue, though the system is more than adequate given enough sleep delay between pages. My tests using this function have had 0 problems with increased speed and presumably decreased server load. I'm happy there is networking ability available in Gawk so as not to load wget a quarter million times.

Regards.

Joe User

unread,
Nov 8, 2014, 8:30:36 PM11/8/14
to
On Sat, 08 Nov 2014 13:58:18 -0800, silentlamb67 wrote:

> The awk program I'm working on will poll somewhere around a quarter
> million pages and API requests, limited to 3 sites so there won't be
> problems with exotic URLs or site behavior.

If you are worried about having to spawn a wget process for each page,
you could investigate wget running as a coprocess.

Something like this could work (untested):

wget --input-file=- <file_of_URLs -O - 2>&1 | gawk -f x.awk

The file_of_URLs could be a FIFO.

x.awk would have to interpret wget output interspersed with the html of
the source pages. x.awk could always know what URL was being read.

I think the resources of spawning wget are small compared to the
resources to read the URL's and process them.

Think about the reliability wget brings to this task. Imagine how hard
it would be to properly implement HTML timeouts and forwarding alone in
awk.


--
It takes a big man to cry, but it takes a
bigger man to laugh at that man.

-- Jack Handey

Janis Papanagnou

unread,
Nov 9, 2014, 7:40:23 PM11/9/14
to
On 08.11.2014 17:44, Anton Treuenfels wrote:
>
> The OP solved a problem in a way that satisfies and pleases him, [...]
> If what he wrote fails in some way predicted by a prophet of doom, it
> looks to me that chances are good he'll be able to fix it himself.

Sure, and no one prevented or prevents him to do what he likes.

But you may have forgotten that Usenet is a forum for discussions
of technical topics; if necessary, re-read the thread of postings
under that precept, ignoring your feelings and just focussing on
the written content. (Needless to say, this is just a suggestion;
you may of course also do what you want.)

Janis

silent...@gmail.com

unread,
Nov 13, 2014, 4:56:41 PM11/13/14
to
This particular awk program isn't using a list of URLs, rather URL's are generated dynamically from the previous pages, the program is making decisions where to go next based on what it finds. But the co-process is an interesting method and I will keep that in mind. The code above does recognize forwarding (the Location: header field). Not sure about timeouts but no problems noticed but something to consider should it come up.

I'm not saying this code competes with or replaces wget, which has many features, but I found it useful when programming entirely in Awk, to avoid external programs, and have a clear understanding of the sites being connected to. If I was doing web crawling to unknown sites, this code would be a poor choice. In this case it's used for 100s of thousands of connections to 3 websites of known behavior and reliability. And it's not mission critical etc.. use your best judgement, the code is there :)

Mike Sanders

unread,
Nov 15, 2014, 5:08:35 AM11/15/14
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> an error prone way to reinvent the wheel...

[...]

> rather it reimplements the wheel...

Just thinking aloud here (& certainly nothing personal), but...
good grief, its OKAY to reinvent to the wheel - tear it apart,
rebuild it, repaint it, refit it, subvert it, extend it.
Get under the hood & see how things work in my thinking.

I simply do not understand the logic of saying this its wrong,
its not as if one can't revert to more traditional ways of
doing things such as a production scenario.

And what of the value of experience for its own sake?
Its fine to reinvent the wheel, like gawk does with awk after all...

--
Mike Sanders
www: http://freebsd.hypermart.net
gpg: 0xD94D4C13

Janis Papanagnou

unread,
Nov 15, 2014, 7:58:42 AM11/15/14
to
On 15.11.2014 11:08, Mike Sanders wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
>> an error prone way to reinvent the wheel...
> [...]
>> rather it reimplements the wheel...
>
> Just thinking aloud here (& certainly nothing personal), but...
> good grief, its OKAY to reinvent to the wheel - [...]

Mike, anyone can do whatever he thinks. There's no argument here.
I said this already upthread since it wasn't obvious to everyone.

But we were discussing whether it's *sensible* to "reinvent the
wheel" - I'm really astonished that this theme needs a dispute
at all -, yet in this specific case (where the given arguments
seem to indicate that the OP was just "doing a lot wrong", that
lead him to doing it "differently wrong", instead of trying to
understand where the problem actually lied).

>
> I simply do not understand the logic of saying this its wrong,

If that's not obvious, and neither the explanations given fit to
support understanding of the issue (or call it view if you like),
you may resort to googling for the keywords to get explanations
that are yet better understandable than what I tried to explain.
(Possible keywords: "reinventing the wheel", "not-invented-here
syndrome", "anti pattern", etc., might be appropriate choices.)

> [...]
>
> And what of the value of experience for its own sake?
> Its fine to reinvent the wheel, like gawk does with awk after all...

This is very different than what has been said in this thread.
(When gawk was written (around 1986) there wasn't anything else
than the commercial awks. Of course it's good to have free and
open source awks available, specifically if you can use them on
other platforms than the commercial ones as well. It's a gain.)

The question we had here was different; I won't repeat it. All
has already been said a week ago. I won't even start trying to
prevent anyone from doing anything. But I *do* suggest methods
and software design principles to the interested readers that I
think are "better" than methods that proved to be inferior for
decades. Pick your choice. I don't mind anyway.

Janis

Mike Sanders

unread,
Nov 15, 2014, 8:29:20 AM11/15/14
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> Mike, anyone can do whatever he thinks. There's no argument here.
> I said this already upthread since it wasn't obvious to everyone.

Not a problem, just me ranting (too much coffee for me tonight).

> The question we had here was different; I won't repeat it. All
> has already been said a week ago. I won't even start trying to
> prevent anyone from doing anything. But I *do* suggest methods
> and software design principles to the interested readers that I
> think are "better" than methods that proved to be inferior for
> decades. Pick your choice. I don't mind anyway.

Janis, you have always been an excellent teacher of all things
[g]awk (your posts have taught me much). I can only say, sometimes,
its good to explore new ideas more fully *before* we decide their worth.

But I'll step back now & keep reading, nothing more about from me.
0 new messages