reg exp for urls

drsc...@gmail.com

unread,

Sep 27, 2010, 4:07:58 PM9/27/10

to

Does anyone have a good regexp to match all url's on a given web page?

DrS

Alexandre Ferrieux

unread,

Sep 27, 2010, 4:29:58 PM9/27/10

to

On Sep 27, 10:07 pm, drscr...@gmail.com wrote:
> Does anyone have a good regexp to match all url's on a given web page?
>
> DrS

If you just want absolute URLs starting with http://, then

[""''](http://[^""'']*)[""'']

should do. However, if you want to catch <a href="foo.htm">, and not
<script> foo="bar.htm"; </script>, you'd be better off using a ready-
made HTML parser (there are several of them in the wiki).

-Alex

mallikarjunarao

unread,

Sep 28, 2010, 7:44:10 AM9/28/10

to

On Sep 28, 1:07 am, drscr...@gmail.com wrote:
> Does anyone have a good regexp to match all url's on a given web page?
>
> DrS

set res [regexp {((http://|https://)(([a-z\.\-]*)))} $str match url
loc add]

Bruce

unread,

Sep 28, 2010, 9:44:26 AM9/28/10

to

from an RE perspective, there are extra unneeded parens
and from a matching standpoint that RE will fail horribly
on many,many,many URLs

Bruce

unread,

Sep 28, 2010, 10:26:06 AM9/28/10

to

FYI

http://daringfireball.net/2010/07/improved_regex_for_matching_urls

drsc...@gmail.com

unread,

Sep 28, 2010, 12:48:30 PM9/28/10

to

On 9/28/2010 10:26 AM, Bruce wrote:
>
> FYI
>
> http://daringfireball.net/2010/07/improved_regex_for_matching_urls
>

Thanks for the link. I copied the expression over to tcl but it does
not seem to work - not sure which parts need translating to tcl's regexp
version:

% set p {(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|
[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|$([^\s()<>]+|(\([^\s()<>]+$))*\))+
(?:$([^\s()<>]+|(\([^\s()<>]+$))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))}

% regexp $p http://www.yahoo.com
0

DrS

Alexandre Ferrieux

unread,

Sep 28, 2010, 3:42:58 PM9/28/10

to

Remove the \b just after the beginning. It means Backspace in Tcl-
land, maybe Word Boundary in that text.

-Alex

Arjen Markus

unread,

Sep 29, 2010, 4:11:16 AM9/29/10

to

On 28 sep, 18:48, drscr...@gmail.com wrote:
> On 9/28/2010 10:26 AM, Bruce wrote:
>
>
>
> > FYI
>
> >http://daringfireball.net/2010/07/improved_regex_for_matching_urls
>
> Thanks for the link. I copied the expression over to tcl but it does
> not seem to work - not sure which parts need translating to tcl's regexp
> version:
>
> % set p {(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|

> [a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|$([^\s()<>]+|(\([^\s()<>]+$))*\))+

> (?:$([^\s()<>]+|(\([^\s()<>]+$))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))}
>
> % regexp $phttp://www.yahoo.com
> 0
>
> DrS

Doesn't this illustrate why regular expressions are not the best
choice for
this task?

Regards,

Arjen

drsc...@gmail.com

unread,

Sep 29, 2010, 10:17:42 AM9/29/10

to

On 9/29/2010 4:11 AM, Arjen Markus wrote:
>
> Doesn't this illustrate why regular expressions are not the best
> choice for
> this task?
>

Plus, it turns out, it does not work very well, especially when the page
contains javascript, which is almost universal.

DrS

Alexandre Ferrieux

unread,

Sep 29, 2010, 10:57:43 AM9/29/10

to

On Sep 29, 4:17 pm, drscr...@gmail.com wrote:
>
> Plus, it turns out, it does not work very well, especially when the page
> contains javascript, which is almost universal.

Ah but if you want to extract all possibly dynamically generated URLs
by some Javascript code, your problem becomes roughly equivalent to
the Turing halting problem. A tough TIP even for Tcl9 :P

-Alex

drsc...@gmail.com

unread,

Sep 29, 2010, 11:07:36 AM9/29/10

to

On 9/29/2010 10:57 AM, Alexandre Ferrieux wrote:
> Ah but if you want to extract all possibly dynamically generated URLs
> by some Javascript code, your problem becomes roughly equivalent to
> the Turing halting problem. A tough TIP even for Tcl9 :P
>

I'd be happy if it extracted some url's consistently, like earlier when
you indicated for the href's. This one unfortunately gives me results
that need lots of post-cleanup work and it is not worth it. For
example, on the main yahoo.com page, it matches all sorts of stuff that
are not url's.

I think I will go back to your original regexp example and improve it a
bit or use string op's.

Thanks,

DrS

jr4412

unread,

Sep 29, 2010, 11:13:48 AM9/29/10

to

On Sep 29, 4:07 pm, drscr...@gmail.com wrote:
> I think I will go back to your original regexp example and improve it a
> bit or use string op's.

having a nose around on the net for scree-scraper scripts might
unearth some 'goodies'.

Arjen Markus

unread,

Sep 30, 2010, 3:03:49 AM9/30/10

to

On 29 sep, 16:57, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

All it takes is a decent implementation of TIP 131 ...

Regards,

Arjen

mallikarjunarao

unread,

Sep 30, 2010, 7:56:14 AM9/30/10

to

(http://|https://)([a-z\.\/$_$\-0-9#\?=%\&]*) this is also matches
almost all urls

Alexandre Ferrieux

unread,

Sep 30, 2010, 9:17:00 AM9/30/10

to

On Sep 30, 1:56 pm, mallikarjunarao <malli....@gmail.com> wrote:
>
> (http://|https://)([a-z\.\/$_$\-0-9#\?=%\&]*) this is also matches
> almost all urls

Well, if you want to use this simple regexp, then make it really
simple ;-)

https?://[-a-z./(_)0-9#?=%&]*

(no use backslashing inside the []. Only special chars are ] and -,
handled by position)

-Alex

Bruce

unread,

Sep 30, 2010, 9:54:11 AM9/30/10

to

mallikarjunarao wrote:
> On Sep 28, 7:26 pm, Bruce <Bruce_do_not_...@example.com> wrote:
>> Bruce wrote:
>>> mallikarjunarao wrote:
>>>> On Sep 28, 1:07 am, drscr...@gmail.com wrote:
>>>>> Does anyone have a good regexp to match all url's on a given web page?
>>>>> DrS
>>>> set res [regexp {((http://|https://)(([a-z\.\-]*)))} $str match url
>>>> loc add]
>>> from an RE perspective, there are extra unneeded parens
>>> and from a matching standpoint that RE will fail horribly
>>> on many,many,many URLs
>>> Bruce
>> FYI
>>
>> http://daringfireball.net/2010/07/improved_regex_for_matching_urls
>
> (http://|https://)([a-z\.\/$_$\-0-9#\?=%\&]*) this is also matches
> almost all urls

(https?://) is shorter than the alternation above,

and even though they would resolve the same, URLs can contain uppercase
letters as well

Bruce

mallikarjunarao

unread,

Oct 1, 2010, 1:10:49 AM10/1/10

to

On Sep 30, 6:17 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

thanx

Andreas Kupries

unread,

Oct 7, 2010, 12:01:56 AM10/7/10

to

mallikarjunarao <mall...@gmail.com> writes:

http://docs.activestate.com/activetcl/8.5/tcllib/uri/uri.html

While not documented the package uses a number of RE's internally.

--
So long,
Andreas Kupries <akup...@shaw.ca>
<http://www.purl.org/NET/akupries/>
Developer @ <http://www.activestate.com/>
-------------------------------------------------------------------------------

tom.rmadilo

unread,

Oct 7, 2010, 6:45:43 PM10/7/10

to

On Oct 6, 9:01 pm, Andreas Kupries <akupr...@shaw.ca> wrote:

> mallikarjunarao <malli....@gmail.com> writes:
> > On Sep 28, 1:07 am, drscr...@gmail.com wrote:
> >> Does anyone have a good regexp to match all url's on a given web page?
>
> >> DrS
>
> > set res [regexp {((http://|https://)(([a-z\.\-]*)))} $str match url
> > loc add]
>
> http://docs.activestate.com/activetcl/8.5/tcllib/uri/uri.html
>
> While not documented the package uses a number of RE's internally.

RFC 2396, Appendix B has a regular expression which works with Tcl's
[regexp], producing the same results.

It can't be used to find uri in a web page, but does a better job at
uri validation than the above regular expressions.

tom.rmadilo

unread,

Oct 7, 2010, 2:46:55 PM10/7/10

to

On Oct 6, 9:01 pm, Andreas Kupries <akupr...@shaw.ca> wrote:

> mallikarjunarao <malli....@gmail.com> writes:
> > On Sep 28, 1:07 am, drscr...@gmail.com wrote:
> >> Does anyone have a good regexp to match all url's on a given web page?
>
> >> DrS
>
> > set res [regexp {((http://|https://)(([a-z\.\-]*)))} $str match url
> > loc add]
>
> http://docs.activestate.com/activetcl/8.5/tcllib/uri/uri.html
>
> While not documented the package uses a number of RE's internally.

I noticed that the uri package mentions RFC 2396, basically
recognizing that the code is out of date. It would be nice to redo
this package, but the current API uses relatively random terms for the
basic URI components. Would the tcllib maintainers consider an
additional package, maybe rfc2396 which would try to implement the
part and provide an API for extensions to different namespaces?

Anyway, RFC 2396 has a great URI regular expression. It isn't for
finding URI in a web page, but for breaking found URI into components.
It works out of the box with Tcl's [regexp]. Here is the RFC example
and what Tcl produces:

RFC 2396 URI Generic Syntax August
1998

B. Parsing a URI Reference with a Regular Expression

As described in Section 4.3, the generic URI syntax is not
sufficient
to disambiguate the components of some forms of URI. Since the
"greedy algorithm" described in that section is identical to the
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential four components and fragment identifier of a URI
reference.

The following line is the regular expression for breaking-down a
URI
reference into its components.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9

The numbers in the second line above are only to assist
readability;
they indicate the reference points for each subexpression (i.e.,
each
paired parenthesis). We refer to the value matched for
subexpression
<n> as $<n>. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

where <undefined> indicates that the component is not present, as
is
the case for the query component in the above example. Therefore,
we
can determine the value of the four components and fragment as

scheme = $2
authority = $4
path = $5
query = $7
fragment = $9

and, going in the opposite direction, we can recreate a URI
reference
from its components using the algorithm in step 7 of Section 5.2.

And Tcl:
% set re {^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?}
% set uri {http://www.ics.uci.edu/pub/ietf/uri/#Related}
% regexp $re $uri 0 1 2 3 4 5 6 7 8 9
1
% set 0
http://www.ics.uci.edu/pub/ietf/uri/#Related
% set 1
http:
% set 2
http
% set 3
//www.ics.uci.edu
% set 4
www.ics.uci.edu
% set 5
/pub/ietf/uri/
% set 6
% set 7
% set 8
#Related
% set 9
Related