DrS
If you just want absolute URLs starting with http://, then
[""''](http://[^""'']*)[""'']
should do. However, if you want to catch <a href="foo.htm">, and not
<script> foo="bar.htm"; </script>, you'd be better off using a ready-
made HTML parser (there are several of them in the wiki).
-Alex
set res [regexp {((http://|https://)(([a-z\.\-]*)))} $str match url
loc add]
from an RE perspective, there are extra unneeded parens
and from a matching standpoint that RE will fail horribly
on many,many,many URLs
Bruce
Thanks for the link. I copied the expression over to tcl but it does
not seem to work - not sure which parts need translating to tcl's regexp
version:
% set p {(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|
[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))}
% regexp $p http://www.yahoo.com
0
DrS
Remove the \b just after the beginning. It means Backspace in Tcl-
land, maybe Word Boundary in that text.
-Alex
Doesn't this illustrate why regular expressions are not the best
choice for
this task?
Regards,
Arjen
Plus, it turns out, it does not work very well, especially when the page
contains javascript, which is almost universal.
DrS
Ah but if you want to extract all possibly dynamically generated URLs
by some Javascript code, your problem becomes roughly equivalent to
the Turing halting problem. A tough TIP even for Tcl9 :P
-Alex
I'd be happy if it extracted some url's consistently, like earlier when
you indicated for the href's. This one unfortunately gives me results
that need lots of post-cleanup work and it is not worth it. For
example, on the main yahoo.com page, it matches all sorts of stuff that
are not url's.
I think I will go back to your original regexp example and improve it a
bit or use string op's.
Thanks,
DrS
having a nose around on the net for scree-scraper scripts might
unearth some 'goodies'.
All it takes is a decent implementation of TIP 131 ...
Regards,
Arjen
(http://|https://)([a-z\.\/\(_\)\-0-9#\?=%\&]*) this is also matches
almost all urls
Well, if you want to use this simple regexp, then make it really
simple ;-)
https?://[-a-z./(_)0-9#?=%&]*
(no use backslashing inside the []. Only special chars are ] and -,
handled by position)
-Alex
(https?://) is shorter than the alternation above,
and even though they would resolve the same, URLs can contain uppercase
letters as well
Bruce
http://docs.activestate.com/activetcl/8.5/tcllib/uri/uri.html
While not documented the package uses a number of RE's internally.
--
So long,
Andreas Kupries <akup...@shaw.ca>
<http://www.purl.org/NET/akupries/>
Developer @ <http://www.activestate.com/>
-------------------------------------------------------------------------------
RFC 2396, Appendix B has a regular expression which works with Tcl's
[regexp], producing the same results.
It can't be used to find uri in a web page, but does a better job at
uri validation than the above regular expressions.
I noticed that the uri package mentions RFC 2396, basically
recognizing that the code is out of date. It would be nice to redo
this package, but the current API uses relatively random terms for the
basic URI components. Would the tcllib maintainers consider an
additional package, maybe rfc2396 which would try to implement the
part and provide an API for extensions to different namespaces?
Anyway, RFC 2396 has a great URI regular expression. It isn't for
finding URI in a web page, but for breaking found URI into components.
It works out of the box with Tcl's [regexp]. Here is the RFC example
and what Tcl produces:
RFC 2396 URI Generic Syntax August
1998
B. Parsing a URI Reference with a Regular Expression
As described in Section 4.3, the generic URI syntax is not
sufficient
to disambiguate the components of some forms of URI. Since the
"greedy algorithm" described in that section is identical to the
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential four components and fragment identifier of a URI
reference.
The following line is the regular expression for breaking-down a
URI
reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist
readability;
they indicate the reference points for each subexpression (i.e.,
each
paired parenthesis). We refer to the value matched for
subexpression
<n> as $<n>. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined> indicates that the component is not present, as
is
the case for the query component in the above example. Therefore,
we
can determine the value of the four components and fragment as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
and, going in the opposite direction, we can recreate a URI
reference
from its components using the algorithm in step 7 of Section 5.2.
And Tcl:
% set re {^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?}
% set uri {http://www.ics.uci.edu/pub/ietf/uri/#Related}
% regexp $re $uri 0 1 2 3 4 5 6 7 8 9
1
% set 0
http://www.ics.uci.edu/pub/ietf/uri/#Related
% set 1
http:
% set 2
http
% set 3
//www.ics.uci.edu
% set 4
www.ics.uci.edu
% set 5
/pub/ietf/uri/
% set 6
% set 7
% set 8
#Related
% set 9
Related