http://code.google.com/p/flaxcode/source/detail?r=1326
Added:
/trunk/flax/crawler
/trunk/flax/crawler/COPYING
/trunk/flax/crawler/README
/trunk/flax/crawler/__init__.py
/trunk/flax/crawler/crawler.py
/trunk/flax/crawler/sql_crawler.py
/trunk/flax/crawler/stdurl.py
/trunk/flax/crawler/test
/trunk/flax/crawler/test/digits
/trunk/flax/crawler/test/digits/1.jpg
/trunk/flax/crawler/test/digits/2.jpg
/trunk/flax/crawler/test/digits/3.jpg
/trunk/flax/crawler/test/digits/4.jpg
/trunk/flax/crawler/test/digits/5.jpg
/trunk/flax/crawler/test/digits/6.jpg
/trunk/flax/crawler/test/digits/7.jpg
/trunk/flax/crawler/test/digits/8.jpg
/trunk/flax/crawler/test/digits/9.jpg
/trunk/flax/crawler/test/digits.php
/trunk/flax/crawler/test/empty.mp3
/trunk/flax/crawler/test/find.html
/trunk/flax/crawler/test/find2.html
/trunk/flax/crawler/test/find3.html
/trunk/flax/crawler/test/index.html
/trunk/flax/crawler/test/meta.html
/trunk/flax/crawler/test/redirect.php
/trunk/flax/crawler/test/robots.txt
/trunk/flax/crawler/test/rss.xml
/trunk/flax/crawler/test/test.doc
/trunk/flax/crawler/test/test.jpg
/trunk/flax/crawler/test/test.pdf
/trunk/flax/crawler/test/test.png
=======================================
--- /dev/null
+++ /trunk/flax/crawler/COPYING Tue Jul 27 03:56:04 2010
@@ -0,0 +1,340 @@
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange;
or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free
Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
USA
+
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) year name of author
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show
w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Library General
+Public License instead of this License.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/README Tue Jul 27 03:56:04 2010
@@ -0,0 +1,17 @@
+============
+flax.crawler
+============
+
+Works with Python 2.6
+
+The module crawler.py is a web crawling engine including an in-memory
+implementation of the data abstraction classes.
+
+The module sql_crawler.py is an SQL database reference implementation of
the
+data abstraction classes.
+
+The directory 'test' contains a web site used in the test for crawler.py
(see
+the source). For the test to work, a virtual host should be set up on
localhost
+so that the URL http://test/ maps to the 'test' directory.
+
+For more information, contact to...@flax.co.uk
=======================================
--- /dev/null
+++ /trunk/flax/crawler/__init__.py Tue Jul 27 03:56:04 2010
@@ -0,0 +1,3 @@
+import crawler
+from stdurl import StdURL
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/crawler.py Tue Jul 27 03:56:04 2010
@@ -0,0 +1,759 @@
+# Copyright (C) 2010 Lemur Consulting Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+""" Module for web crawling.
+"""
+
+from urllib2 import urlopen, Request, URLError, HTTPError
+from httplib import IncompleteRead
+from robotparser import RobotFileParser
+from time import time, sleep
+from hashlib import md5
+from random import randint
+from re import compile as re_compile, IGNORECASE
+from new import instancemethod
+from threading import Thread, Lock, current_thread
+from inspect import currentframe
+from sys import exc_info, exc_clear
+
+from stdurl import StdURL
+
+
+silent = True # if False, output debug to stdout
+user_agent = "FlaxBot/0.1 (see http://www.flax.co.uk/)"
+default_delay = 4 # default time between requests for a domain
+http_threads = 10 # number of crawler threads
+
+
+class CrawlerError (Exception):
+ """ Base class for errors encountered whilst crawling.
+ """
+ pass
+
+
+class NoRobots (CrawlerError):
+ """ Exception raised when no robots.txt has been fetched for a domain.
+ """
+ pass
+
+
+class URLNotAllowed (CrawlerError):
+ """ Exception raised when a URL is disallowed by robots.txt.
+ """
+ pass
+
+
+class URLNotFollowed (CrawlerError):
+ """ Exception raised when a URL is followed (downloaded).
+ """
+ pass
+
+
+class DuplicateURL (CrawlerError):
+ """ Exception raised when a duplicate URL is encountered.
+ """
+ pass
+
+
+class DuplicateResource (CrawlerError):
+ """ Exception raised when a duplicate resource is detected.
+ """
+ pass
+
+
+class NotHandled (CrawlerError):
+ """ Exception raised when a parser can not handle a resource.
+ """
+ pass
+
+
+class DefaultDumper (object):
+ """ Default implementation of a dumper, which maintains a count of
dumped
+ resources and the total number of characters.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self):
+ self.count = 0
+ self.chars = 0
+
+ def dump_resource(self, resource):
+ """ Dump a resource.
+ """
+ self.count += 1
+ self.chars += len(resource.content)
+
+
+class DefaultURLPool (object):
+ """ Default implementation of a URL pool, maintaining URLs in memory.
+ Also maintains a dictionary of URL to error encountered.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self):
+ self._urls = list()
+ self._seen = set()
+ self._robots = list()
+ self.repeat_count = 0
+ self.link_count = 0
+ self.redirect_count = 0
+
+ def add_url(self, url):
+ """ Add a StdURL to the pool.
+ """
+ self._urls.append(url)
+ self._seen.add(url)
+ robots_url = StdURL("http://{0}/robots.txt".format(url.netloc))
+ if robots_url not in self._seen:
+ self._robots.append(robots_url)
+ self._seen.add(robots_url)
+
+ def add_link(self, source, target):
+ """ Add a link between the source StdURL and the target StdURL.
Note
+ that add_url is called for the target later if it is to be
added to
+ the pool of URLs.
+ """
+ assert source in self._seen
+ assert target is not None
+ self.link_count += 1
+
+ def add_redirect(self, source, target):
+ """ Add a redirect from the source StdURL to the target StdURL.
+ """
+ assert source in self._seen
+ assert target is not None
+ self.redirect_count += 1
+
+ def check_url(self, url):
+ """ If the URL pool has seen the specified StdURL, raise
DuplicateURL.
+ """
+ if url in self._seen:
+ self.repeat_count += 1
+ raise DuplicateURL()
+
+ def next_url(self):
+ """ Return a StdURL from the to-do collection. If none are left,
+ return None.
+ """
+ if len(self._robots) > 0:
+ url = self._robots[0]
+ self._robots.remove(url)
+ return url
+ if len(self._urls) == 0:
+ return None
+ url = self._urls[randint(0, len(self._urls) - 1)]
+ self._urls.remove(url)
+ return url
+
+
+class DefaultErrorHandler (object):
+ """ Default implementation of an error handler.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def error(self, url, e):
+ """ Record the error, e, against the specified StdURL.
+ """
+ pass
+
+
+class DefaultFollowDecider (object):
+ """ Default implementation of a follow decider (for deciding which
URLs to
+ download).
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self, content_type_re, same_domain=False):
+ """ Only follow resources with a content type matching the regexp,
and
+ if same_domain is True, only follow link targets on the same
domain
+ as the original URL.
+ """
+ self._re = re_compile(content_type_re)
+ self._same_domain = same_domain
+
+ def follow_resource(self, resource):
+ """ If the resource should not be followed, raise URLNotFollowed.
This
+ is called twice, once after the HEAD request (when
resource.content
+ is None) and again after the GET request.
+ """
+ if resource.content is not None:
+ return
+ if not self._re.match(resource.content_type()):
+ raise URLNotFollowed()
+
+ def follow_url(self, resource, target):
+ """ If the specified URL (an instance of StdURL) should not be
+ followed, raise URLNotFollowed.
+ """
+ if self._same_domain and resource.origin_url.netloc !=
target.netloc:
+ raise URLNotFollowed()
+
+
+class DefaultDuplicateDetector (object):
+ """ Default implementation of a duplicate detector, using an in-memory
+ set of ETags and content hashes.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self):
+ self.etags = set()
+ self.hash_set = set()
+
+ def duplicate_resource(self, resource):
+ """ Check a web resource for duplication. This will be called
twice,
+ once after the HEAD request (when resource.content is None) and
+ again after the GET request.
+ """
+ if resource.content is None:
+ # check the ETag, if there is one
+ etag = resource.headers.get("ETag")
+ if etag is not None:
+ if etag in self.etags:
+ raise DuplicateResource()
+ self.etags.add(etag)
+ return
+ # now check and update the hash set
+ hasher = md5()
+ hasher.update(resource.content)
+ value = hasher.digest()
+ if value in self.hash_set:
+ raise DuplicateResource()
+ self.hash_set.add(value)
+
+
+class DefaultHtmlParser (object):
+ """ Default implementation of an HTML link parser, using a (not very
+ sophisticated) regular expression.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ content_types = ("text/html", "application/xhtml+xml")
+
+ def __init__(self):
+ # note that {0}, {1} are not part of the regular expression
+ re = "<\s*{0}\s+{1}\s*=\s*(\"[^\"]+\"|'[^']+')\s*/?>"
+ self._a = re_compile(re.format("a", "href"), IGNORECASE)
+ self._img = re_compile(re.format("img", "src"), IGNORECASE)
+ meta = "<\s*meta\s+name\s*=\s*(\"robots\"|'robots')\s+" \
+ "content\s*=\s*(\"[^\"]+\"|'[^']+')\s*/?>"
+ self._meta = re_compile(meta, IGNORECASE)
+
+ def parse_resource(self, resource):
+ """ Yield target URLs from the given HTML content, and set noindex
and
+ nofollow on the resource if necessary. Raise NotHandled if the
+ resource can not be handled by this parser.
+ """
+ if resource.content_type() not in DefaultHtmlParser.content_types:
+ raise NotHandled()
+ meta = self._meta.findall(resource.content)
+ if len(meta) > 0:
+ content = meta[0][1].lower()
+ if "noindex" in content:
+ resource.noindex = True
+ if "nofollow" in content:
+ resource.nofollow = True
+ for url in self._a.findall(resource.content) + \
+ self._img.findall(resource.content):
+ yield url.strip("\"'")
+
+
+class DefaultThrottle (object):
+ """ Default implementation of a throttle, maintaining a in-memory map
from
+ domain last fetch time.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self):
+ self.hosts = dict()
+
+ def last_time(self, netloc):
+ """ Return the last time a request was made to the specified
domain,
+ and record that a request is being made now. Returns 0 if this
is
+ the first request.
+ """
+ t = self.hosts.get(netloc)
+ self.hosts[netloc] = time()
+ return t or 0
+
+
+class DefaultRobotManager (object):
+ """ Default implementation of a manager for robots.txt information.
+ Maintains an in-memory map from netloc to an instance of
+ RobotExclusionRulesParser.
+
+ If the attribute api_lock is a Lock, calls to the API are
synchronized.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self):
+ self._robots = dict()
+
+ def parse_robots(self, netloc, content):
+ """ Parse the given robots.txt content and store against the given
+ domain. If content is None, any URL will be allowed.
+ """
+ robot = RobotFileParser()
+ if content is not None:
+ robot.parse(content.split("\n"))
+ self._robots[netloc] = robot
+
+ def check_robots(self, url):
+ """ If no attempt has yet been made to fetch robots.txt for the
domain
+ of the specified StdURL, raise NoRobots. Otherwise, if access
to
+ the URL is not allowed according to the stored robots.txt,
raise
+ URLNotAllowed. Otherwise, return the crawl delay required by
+ robots.txt, or None if not specified.
+ """
+ robot = self._robots.get(url.netloc)
+ if robot is None:
+ raise NoRobots()
+ if not robot.can_fetch(user_agent, url.path):
+ raise URLNotAllowed()
+ return default_delay
+
+
+dump = DefaultDumper()
+pool = DefaultURLPool()
+follow = DefaultFollowDecider(".*")
+duplicate = DefaultDuplicateDetector()
+parsers = (DefaultHtmlParser(), )
+throttle = DefaultThrottle()
+robots = DefaultRobotManager()
+error = DefaultErrorHandler()
+
+
+def _sync(arg, *args):
+ """ Synchronize calls to an instancemethod (defaults to __call__ if an
+ object is passed) using the api_lock attribute of the bound object
(or
+ passed object). If api_lock is not set, no synchronization is done.
+ """
+ if isinstance(arg, instancemethod):
+ obj = arg.im_self
+ fn = arg
+ else:
+ obj = arg
+ fn = arg.__call__
+ lock = getattr(obj, "api_lock", None)
+ if lock is not None:
+ lock.acquire()
+ try:
+ return fn(*args)
+ finally:
+ if lock is not None:
+ lock.release()
+
+
+class HTTPResource (object):
+ """ Class for storing the results of a successful HTTP GET.
+ """
+
+ def __init__(self, origin_url, url, headers):
+ self.origin_url = origin_url
+ self.url = url
+ self.headers = headers
+ self.content = None
+ self.noindex = False
+ self.nofollow = False
+
+ def content_type(self):
+ """ Return the primary content type from the Content-Type header.
+ """
+ content_type = self.headers.get("Content-Type")
+ if content_type is None:
+ return None
+ return content_type.split(";")[0]
+
+ def check(self):
+ """ Check for duplicate (redirected) URL and content, and whether
to
+ follow (raises exceptions if not).
+ """
+ if self.url != self.origin_url:
+ _sync(pool.check_url, self.url)
+ _sync(duplicate.duplicate_resource, self)
+ _sync(follow.follow_resource, self)
+
+
+class _Courier (Request):
+ """ Class for using urllib2 to request web page content via HTTP.
+ """
+
+ def __init__(self, url):
+ """ Initialise the courier for the given page.
+ """
+ Request.__init__(self, str(url).replace(" ", "%20"))
+ self._method = None
+ self.add_header("User-Agent", user_agent)
+
+ def get_method(self):
+ """ Specifies the method used by urllib2 when making an HTTP
request.
+ """
+ return self._method
+
+ def fetch(self, method):
+ """ Send a request of the specified type (HEAD, GET) for the given
URL.
+ Returns the HTTP response.
+
+ Can raise URLError, HTTPError or IncompleteRead.
+ """
+ self._method = method
+ return urlopen(self)
+
+
+class CrawlerThread (Thread):
+ """ Worker thread for crawling URLs.
+ """
+
+ def __init__(self):
+ Thread.__init__(self, target=_crawl)
+ self.waiter = None
+ self._lock = Lock()
+ self._lock.acquire()
+
+ def wait(self):
+ """ Make the thread wait for notification.
+ """
+ self._lock.acquire()
+
+ def notify(self):
+ """ Wake the thread.
+ """
+ self._lock.release()
+
+
+def _crawl():
+ """ Crawl URLs obtained from _iter_urls(), scraping and adding to the
URL
+ pool we go.
+ """
+ for url in _iter_urls():
+ _debug("Crawling", url)
+ try:
+ if url.path == '/robots.txt':
+ _get_robots(url)
+ else:
+ _get_url(url)
+ except (CrawlerError, URLError, IncompleteRead) as e:
+ _debug(url)
+ _sync(error.error, url, e)
+ except:
+ # error is not lost - see _debug()
+ stop()
+
+def _get_robots(url):
+ """ Fetch a robots.txt URL and send the content to the robots module.
+ """
+ # initialise the throttle for this domain
+ _sync(throttle.last_time, url.netloc)
+ # make a GET request for robots.txt
+ _debug("HTTP GET", url)
+ courier = _Courier(url)
+ try:
+ response = courier.fetch("GET")
+ except HTTPError as e:
+ if e.code == 404:
+ content = None
+ else:
+ raise
+ else:
+ content = response.read()
+ response.close()
+ # send content to robots for parsing
+ _sync(robots.parse_robots, url.netloc, content)
+
+def _get_url(url):
+ """ Fetch a URL. Note that this is synchronized by the engine such that
+ it will only be called for each domain (url.netloc) once at a time.
+ """
+ # check robots.txt
+ delay = _sync(robots.check_robots, url) or default_delay
+ # hit the throttle and wait if necessary
+ t = _sync(throttle.last_time, url.netloc)
+ wait = t + delay - time()
+ if wait > 0:
+ _debug("Sleep for", wait)
+ sleep(wait)
+ _sync(throttle.last_time, url.netloc)
+ # make a HEAD request to check the headers
+ courier = _Courier(url)
+ _debug("HTTP HEAD", url)
+ response = courier.fetch("HEAD")
+ resource = HTTPResource(url, StdURL(response.url), response.headers)
+ # check for a redirect
+ if resource.url != resource.origin_url:
+ _sync(pool.add_redirect, resource.origin_url, resource.url)
+ # check whether to reject on (redirected) URL, headers or content type
+ resource.check()
+ # make a GET request for the resource, and replace details just in case
+ _debug("HTTP GET", url)
+ response = courier.fetch("GET")
+ resource.url = StdURL(response.url)
+ resource.headers = response.headers
+ resource.content = response.read()
+ response.close()
+ # check whether to reject on (redirected) URL, headers or content
+ resource.check()
+ # attempt to parse the content
+ parent = StdURL(resource.url)
+ for parser in parsers:
+ try:
+ for rel_url in _sync(parser.parse_resource, resource):
+ target = StdURL(rel_url, parent)
+ try:
+ if target.scheme != parent.scheme:
+ raise URLNotFollowed()
+ except URLNotFollowed:
+ _debug(target)
+ continue
+ _sync(pool.add_link, url, target)
+ try:
+ _sync(pool.check_url, target)
+ except DuplicateURL:
+ _debug(target)
+ continue
+ try:
+ _sync(follow.follow_url, resource, target)
+ except URLNotFollowed:
+ _debug(target)
+ else:
+ if not resource.nofollow:
+ _sync(pool.add_url, target)
+ except NotHandled:
+ exc_clear()
+ else:
+ break
+ # dump the resource (if allowed)
+ if not resource.noindex:
+ _debug("Dump", url, resource.content_type())
+ _sync(dump.dump_resource, resource)
+
+_debug_lock = Lock()
+def _debug(*args):
+ """ Output a log line using the given arguments, including exception
+ details if called in the context of an 'except' clause.
+ """
+ ty, e, tb = exc_info()
+ if e is None and silent:
+ return
+ if ty is not None:
+ while tb.tb_next is not None:
+ tb = tb.tb_next
+ f = tb.tb_frame
+ lineno = tb.tb_lineno
+ exc_clear()
+ else:
+ frame = currentframe()
+ f = frame.f_back
+ lineno = f.f_lineno
+ _debug_lock.acquire()
+ print round(time() - t0, 1), current_thread().name,
+ print "{0}:{1} {2}".format(f.f_code.co_filename, lineno,
f.f_code.co_name),
+ print "|",
+ if e is not None:
+ print e.__class__.__name__,
+ if len(str(e)) > 0:
+ print "({0})".format(str(e)),
+ print " ".join([str(x) for x in args])
+ _debug_lock.release()
+
+
+_threads = list()
+_waiters = set() # locks for blocking workers on the next URL
+_lock = Lock() # for locking waiter and domain map code
+_domain_map = dict() # for looking up threads working on a domain
+_halt = False # if a thread sets this to True, stop everything gracefully
+
+t0 = 0
+
+def start():
+ """ Start the crawler, returning when all crawler threads have
terminated.
+ """
+ global t0
+ t0 = time()
+ for _ in xrange(http_threads):
+ _threads.append(CrawlerThread())
+ for thread in _threads:
+ thread.start()
+ while len(_threads) > 0:
+ thread = _threads[0]
+ thread.join()
+ _threads.remove(thread)
+
+def stop():
+ """ Gracefully stop the crawler prematurely. Crawler threads will still
+ operate until all URLs handed out by the URL pool have been
handled.
+ """
+ global _halt
+ _halt = True
+ _debug("Stop")
+
+def _wait():
+ """ Cause the current thread to wait, or if all threads are waiting
notify
+ them, so that the engine can halt gracefully.
+ """
+ waiter = Lock()
+ waiter.acquire()
+ _waiters.add(waiter)
+ if len(_waiters) == len(_threads):
+ for waiter in _waiters:
+ waiter.release()
+ else:
+ _lock.release()
+ waiter.acquire()
+ _lock.acquire()
+
+def _iter_urls():
+ """ Yield URLs obtained from the URL Pool module, causing threads to
wait
+ when no more URLs are available, and wait on threads fetching a URL
+ from the same domain.
+ """
+ # workers call this function; _lock ensures only one at a time
+ _lock.acquire()
+ while len(_waiters) < len(_threads):
+ _lock.release()
+ url = _sync(pool.next_url) if not _halt else None
+ _lock.acquire()
+ if url is None:
+ _wait() # wait for a URL to be ready
+ continue
+ if len(_waiters) > 0:
+ _waiters.pop().release()
+ # check domain map for worker threads already working the domain
+ if url.netloc in _domain_map:
+ # add me as a waiter for the working thread, and make me wait
+ thread = _domain_map[url.netloc]
+ while thread.waiter is not None:
+ thread = thread.waiter
+ thread.waiter = current_thread()
+ _debug("Add as waiter for", thread.name)
+ _lock.release()
+ current_thread().wait()
+ _lock.acquire()
+ else:
+ # add me to the domain map
+ _domain_map[url.netloc] = current_thread()
+ _lock.release()
+ yield url
+ _lock.acquire()
+ # remove me from the domain map, and wake and promote a waiter
+ if current_thread().waiter is not None:
+ _domain_map[url.netloc] = current_thread().waiter
+ _debug("Wake", current_thread().waiter.name)
+ current_thread().waiter.notify()
+ current_thread().waiter = None
+ else:
+ del _domain_map[url.netloc]
+ _lock.release()
+
+if __name__ == "__main__":
+ from sys import argv
+
+ if "-v" in argv[1:]:
+ silent = False
+ if "-q" in argv[1:]:
+ default_delay = 0
+
+ class DomainFollowDecider (DefaultFollowDecider):
+ """ Test implementation.
+ """
+
+ def __init__(self, content_type_re, domain_set):
+ DefaultFollowDecider.__init__(self, content_type_re)
+ self.domain_set = domain_set
+
+ def follow_resource(self, resource):
+ """ Do not follow URLs that are outside the set of allowed
domains.
+ """
+ DefaultFollowDecider.follow_resource(self, resource)
+ if resource.origin_url.netloc not in self.domain_set:
+ raise URLNotFollowed()
+
+ class TestErrorHandler (DefaultErrorHandler):
+ """ Test implementation.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self):
+ DefaultErrorHandler.__init__(self)
+ self.count = 0
+ self.by_type = dict()
+
+ def error(self, url, e):
+ """ Add the URL to a set held against the error type.
+ """
+ self.count += 1
+ if type(e) not in self.by_type:
+ self.by_type[type(e)] = set()
+ self.by_type[type(e)].add(url)
+
+ def __str__(self):
+ """ Print errors to stdout.
+ """
+ s = "{0} logged errors\n".format(self.count)
+ for e_type, e_set in self.by_type.iteritems():
+ for url in e_set:
+ s += "{0} {1}\n".format(e_type.__name__, url)
+ return s
+
+ follow = DomainFollowDecider("^text/html$|^image/.*",
+ set(["test", "doesnotexist.net.uk"]))
+ error = TestErrorHandler()
+ pool.add_url(StdURL("http://test/"))
+
+ start()
+
+ if not silent:
+ print
+ print error
+
+ for u, y in (("http://test/does_not_exist.html", HTTPError),
+ ("http://doesnotexist.net.uk/page.html", NoRobots),
+ ("http://test/empty.mp3", URLNotFollowed),
+ ("http://test/test.jpg", URLNotAllowed)):
+ assert StdURL(u) in error.by_type[y]
+ assert dump.count == 12
+ assert error.count == 10
+ assert pool.repeat_count == 3
+ assert pool.link_count == 23
+ assert pool.redirect_count == 1
+ assert len(duplicate.etags) == 8
+ assert len(duplicate.hash_set) == 13
+ assert dump.chars == 542791
+ # test throttle
+ count = dump.count + error.count - pool.repeat_count
+ print time() - t0, default_delay * count
+ assert time() - t0 >= default_delay * count
+ print "Test passed"
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/sql_crawler.py Tue Jul 27 03:56:04 2010
@@ -0,0 +1,427 @@
+# Copyright (C) 2010 Lemur Consulting Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+""" Reference implementation of crawler, storing URLs in an SQL database
(using
+ sqlite3). Note (at least):
+
+ * It's quite inefficient - SQL operations should be batched, results
cached
+ * No error handling - e.g. URLs are abandoned if they raise
IncompleteRead
+ * The default HTML parser doesn't understand meta redirects
+"""
+
+import crawler
+from crawler import DefaultFollowDecider, DefaultHtmlParser,
URLNotAllowed, \
+ NoRobots, DuplicateResource, DuplicateURL,
URLNotFollowed,\
+ _debug
+from robotparser import RobotFileParser
+from time import time
+from pickle import dumps, loads
+from threading import Lock
+from hashlib import md5
+from sqlite3 import connect, Row, DatabaseError, Binary
+from os import unlink
+from os.path import isfile
+
+from stdurl import StdURL
+
+
+schema = """
+CREATE TABLE content (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ url_id INTEGER NOT NULL UNIQUE,
+ content BLOB NOT NULL,
+ hash VARCHAR(32));
+CREATE TABLE domain (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ netloc VARCHAR(256) NOT NULL UNIQUE,
+ robots BLOB,
+ time INTEGER DEFAULT 0);
+CREATE TABLE header (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ url_id INTEGER NOT NULL,
+ name VARCHAR(4096),
+ value VARCHAR(4096));
+CREATE TABLE link (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ source_id INTEGER NOT NULL,
+ target_id INTEGER NOT NULL,
+ count INTEGER NOT NULL);
+CREATE TABLE redirect (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ source_id INTEGER NOT NULL UNIQUE,
+ target_id INTEGER NOT NULL);
+CREATE TABLE url (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ domain_id INTEGER NOT NULL,
+ url VARCHAR(4096) NOT NULL UNIQUE,
+ time INTEGER);
+CREATE TABLE error (id INTEGER PRIMARY KEY AUTOINCREMENT,
+ url_id INTEGER NOT NULL UNIQUE,
+ type VARCHAR(64),
+ error VARCHAR(4096));
+CREATE UNIQUE INDEX link_idx ON link (source_id, target_id);
+CREATE UNIQUE INDEX redirect_idx ON redirect (source_id, target_id);
+"""
+
+class NoRow (Exception):
+ """ Exception raised when a SELECT statement yield no rows.
+ """
+
+ def __init__(self, statement):
+ Exception.__init__(self)
+ self.statement = statement
+
+ def __str__(self):
+ return "No matching row found: {0}".format(self.statement)
+
+
+class SQLImplementation (object):
+ """ SQL implementation of sub-module APIs.
+ """
+
+ api_lock = Lock()
+
+ def __init__(self, path):
+ """ Open a connection to the MySQL database of company data.
+ """
+ self.db = connect(path, check_same_thread=False)
+ self.db.row_factory = Row
+ self.cursor = self.db.cursor()
+
+ def execute(self, statement, *args):
+ """ Execute the given SQL statement, substituting the remaining
+ arguments into the statement. Returns the id of the inserted
row,
+ if any.
+ """
+ try:
+ self.cursor.execute(statement, args)
+ self.db.commit()
+ return self.cursor.lastrowid
+ except DatabaseError as e:
+ _debug(statement)
+ raise e
+
+ def select(self, statement, *args):
+ """ Execute the given SELECT SQL statement, substituting the
remaining
+ arguments into the statement. Returns the columns from the
SELECT,
+ or raises NoRow if no matching rows are found - but only the
first
+ result row.
+ """
+ try:
+ self.cursor.execute(statement, args)
+ result = self.cursor.fetchone()
+ if result is None:
+ raise NoRow(statement)
+ if len(result) == 1:
+ return result[0]
+ return result
+ except DatabaseError as e:
+ _debug(statement)
+ raise e
+
+ def select_iter(self, statement, *args):
+ """ Execute the given SELECT SQL statement, substituting the
remaining
+ arguments into the statement. Yields result rows as tuples.
+ """
+ try:
+ self.cursor.execute(statement, args)
+ for result in self.cursor.fetchall():
+ yield result
+ except DatabaseError as e:
+ _debug(statement)
+ raise e
+
+ def close(self):
+ """ Close the database connection.
+ """
+ self.db.close()
+
+ def initialise(self):
+ """ Create SQL tables.
+ """
+ self.cursor.executescript(schema)
+
+ def _select_url(self, url, select_time=False):
+ """ Select a URL id from the database. If select_time, include the
+ time column.
+ """
+ fs = "id" if not select_time else "id, time"
+ return self.select("SELECT " + fs + " FROM url WHERE url=?",
str(url))
+
+ def _insert_url(self, url, t=None):
+ """ Insert a URL into the database. If t is specified, set the URL
time
+ to that value. Returns the database id of the URL.
+ """
+ assert url.netloc != ""
+ try:
+ domain_id = self.select("SELECT id FROM domain WHERE netloc=?",
+ url.netloc)
+ except NoRow:
+ domain_id = self.execute("INSERT INTO domain(netloc) VALUES
(?)",
+ url.netloc)
+ return self.execute("INSERT INTO url(domain_id, url, time) " \
+ "VALUES (?, ?, ?)", domain_id, str(url), t)
+
+ def dump_resource(self, resource):
+ """ Dump the resource to the database. Check for redirects.
+ """
+ url_id = self._select_url(resource.url)
+ for name, value in resource.headers.items():
+ self.execute("INSERT INTO header(url_id, name, value) " \
+ "VALUES (?, ?, ?)", url_id, name, value)
+ content = Binary(resource.content)
+ self.execute("INSERT INTO content(url_id, content, hash) " \
+ "VALUES (?, ?, ?)", url_id, content, resource.hash)
+
+ def add_url(self, url):
+ """ Add a URL, referencing the domain (domain is created if it
does not
+ already exist).
+ """
+ try:
+ url_id, t = self._select_url(url, select_time=True)
+ except NoRow:
+ self._insert_url(url, 0)
+ else:
+ if t is not None:
+ raise DuplicateURL()
+ self.execute("UPDATE url SET time=0 WHERE id=?", url_id)
+
+ def add_link(self, source, target):
+ """ Add a link by referencing the source and target URLs.
+ """
+ source_id = self._select_url(source)
+ try:
+ target_id = self._select_url(target)
+ except NoRow:
+ target_id = self._insert_url(target)
+ try:
+ link_id, count = self.select("SELECT id, count FROM link " \
+ "WHERE source_id=? " \
+ "AND target_id=?",
+ source_id, target_id)
+ except NoRow:
+ self.execute("INSERT INTO link(source_id, target_id, count) " \
+ "VALUES (?, ?, 1)", source_id, target_id)
+ else:
+ self.execute("UPDATE link SET count=? WHERE id=?",
+ count + 1, link_id)
+
+ def add_redirect(self, source, target):
+ """ Add a redirect by referencing the source and target URLs.
+ """
+ orig_id = self._select_url(source)
+ try:
+ url_id = self._select_url(target)
+ except NoRow:
+ url_id = self._insert_url(target)
+ self.execute("INSERT INTO redirect(source_id, target_id) " \
+ "VALUES (?, ?)", orig_id, url_id)
+
+ def check_url(self, url):
+ """ Check whether a URL has been seen by lookup into the URL table.
+ """
+ try:
+ _, t = self._select_url(url, select_time=True)
+ if t is not None:
+ raise DuplicateURL()
+ except NoRow:
+ pass
+
+ def next_url(self):
+ """ Return a URL to fetch - try to return a URL referencing the
domain
+ with oldest 'time'. Update the 'time' on the returned URL with
the
+ current timestamp, and don't return URLs with a non-null time.
If
+ no URLs have been passed for domain of the URL, return the
robots
+ URL instead (recording on the domain the current timestamp).
+ """
+ global limit
+ if limit is not None:
+ if limit == 0:
+ return None
+ else:
+ limit -= 1
+ try:
+ domain_id, url_id, url = self.select("SELECT domain_id, " \
+ "url.id, url FROM domain, url WHERE
url.domain_id=domain.id " \
+ "AND url.time=0 ORDER BY domain.time LIMIT 1")
+ except NoRow:
+ return None
+ url = StdURL(url)
+ try:
+ self.select("SELECT time FROM domain WHERE id=? AND time > 0",
+ domain_id)
+ except NoRow:
+ url = StdURL("http://{0}/robots.txt".format(url.netloc))
+ self.execute("UPDATE domain SET time=? WHERE id=?",
+ int(time()), domain_id)
+ else:
+ self.execute("UPDATE url SET time=? WHERE id=?",
+ int(time()), url_id)
+ return url
+
+ def error(self, url, e):
+ """ Record the error against the URL.
+ """
+ url_id = self._select_url(url)
+ self.execute("INSERT INTO error(url_id, type, error) VALUES
(?, ?, ?)",
+ url_id, e.__class__.__name__, str(e))
+
+ def duplicate_resource(self, resource):
+ """ Check a web resource for duplication.
+ """
+ if resource.content is None:
+ # check the ETag, if there is one
+ etag = resource.headers.get("ETag")
+ if etag is not None:
+ try:
+ self.select("SELECT id FROM header WHERE name=? " \
+ "AND value=?", "ETag", etag)
+ raise DuplicateResource()
+ except NoRow:
+ pass
+ return
+ # check the hash
+ hasher = md5()
+ hasher.update(resource.content)
+ resource.hash = hasher.hexdigest()
+ try:
+ self.select("SELECT id FROM content WHERE hash=?",
resource.hash)
+ raise DuplicateResource()
+ except NoRow:
+ pass
+
+ def last_time(self, netloc):
+ """ Return the last time a request was made to the specified
domain,
+ and record that a request is being made now. Returns 0 if this
is
+ the first request.
+ """
+ domain_id, t = self.select("SELECT id, time FROM domain " \
+ "WHERE netloc=?", netloc)
+ self.execute("UPDATE domain SET time=? WHERE id=?",
+ int(time()), domain_id)
+ return t
+
+ def parse_robots(self, netloc, content):
+ """ Parse the given robots.txt content and store against the given
+ domain. If content is None, any URL will be allowed.
+ """
+ robot = RobotFileParser()
+ if content is not None:
+ robot.parse(content.split("\n"))
+ self.execute("UPDATE domain SET robots=? WHERE netloc=?",
+ dumps(robot), netloc)
+
+ def check_robots(self, url):
+ """ If no attempt has yet been made to fetch robots.txt for the
domain
+ of the specified URL, raise NoRobots. Otherwise, if access to
the
+ specified URL is not allowed according to the stored
robots.txt,
+ raise URLNotAllowed. Otherwise, return the crawl delay
required by
+ robots.txt, or None if not specified.
+ """
+ robots = self.select("SELECT robots FROM domain WHERE netloc=?",
+ url.netloc)
+ if robots is None:
+ raise NoRobots()
+ robot = loads(str(robots))
+ if not robot.can_fetch(crawler.user_agent, url.path):
+ raise URLNotAllowed()
+ return crawler.default_delay
+
+ def stats(self):
+ """ Output database stats to stdout.
+ """
+ n = self.select("SELECT COUNT(*) FROM content")
+ print n, "URLs downloaded"
+ n_e = self.select("SELECT COUNT(*) FROM error WHERE
type='HTTPError'")
+ print n_e, "HTTP errors:"
+ for source, target, e in self.select_iter("SELECT src.url,
tgt.url, " \
+ "error FROM error, url tgt, link, url src WHERE
type='HTTPError' "\
+ "AND tgt.id=error.url_id AND tgt.id=target_id " \
+ "AND src.id=source_id"):
+ print "{0} => {1} ({2})".format(source, target, e)
+
+
+if __name__ == "__main__":
+ from sys import argv
+
+ class SingleURLFollower:
+ """ Trivial follow decider that only allows the initial URL.
+ """
+ def __init__(self, url):
+ self.url = StdURL(url)
+
+ def follow_url(self, resource, target):
+ """ Follows no target URLs from links.
+ """
+ assert resource is not None
+ assert target is not None
+ raise URLNotFollowed()
+
+ def follow_resource(self, resource):
+ """ Allow only the initial URL to be followed.
+ """
+ if resource.origin_url != self.url:
+ raise URLNotFollowed()
+
+ if "-v" in argv[1:]:
+ crawler.silent = False
+ if "-q" in argv[1:]:
+ crawler.default_delay = 0
+ if "-t" in argv[1:]:
+ crawler.http_threads = 1
+ initialise = "-i" in argv[1:]
+ domain = "-d" in argv[1:]
+ single_url = "-u" in argv[1:]
+ limit = 5 if "-l" in argv[1:] else None
+ stats = "-s" in argv[1:]
+
+ for arg in argv[1:]:
+ if arg[0] == "-":
+ argv.remove(arg)
+
+ if len(argv) < (3 if not stats else 2):
+ print """Usage: [-v|-q|-i|-s|-l] <db path> <initial URL>
+
+Flags: -v Output debug messages
+ -q Set default delay to 0
+ -i Initialise database (erases any data)
+ -d Do not follow links out of initial domain
+ -u Do not follow any URLs (single URL mode)
+ -l Limit the number of URLs crawled to 5
+ -t Run only one crawler thread
+ -s Don't crawl, but output database stats
+"""
+ exit()
+
+ if initialise and isfile(argv[1]):
+ unlink(argv[1])
+
+ sql = SQLImplementation(argv[1])
+
+ if initialise:
+ sql.initialise()
+ if stats:
+ sql.stats()
+ else:
+ crawler.dump = sql
+ crawler.pool = sql
+ crawler.dns = sql
+ crawler.follow = DefaultFollowDecider("^text/html$|^image/.*",
domain)\
+ if not single_url else SingleURLFollower(argv[2])
+ crawler.duplicate = sql
+ crawler.parsers = (DefaultHtmlParser(), )
+ crawler.throttle = sql
+ crawler.robots = sql
+ crawler.error = sql
+ sql.add_url(StdURL(argv[2]))
+ crawler.start()
+
+ sql.close()
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/stdurl.py Tue Jul 27 03:56:04 2010
@@ -0,0 +1,113 @@
+# Copyright (C) 2010 Lemur Consulting Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+""" Module including a standard URL class.
+"""
+
+from urlparse import urljoin, urlsplit
+
+
+def stdurl(raw_url):
+ """ Convert the argument to a StdURL.
+ """
+ if raw_url is None or raw_url == "":
+ return None
+ return StdURL(raw_url)
+
+
+class StdURL (object):
+ """Class representing a URL, the scheme of which is assumed to be HTTP.
+ """
+
+ def __init__(self, url, parent=None):
+ """ Create a StdURL instance for the given URL.
+
+ If parent is specified, then the URL is resolved relative to
it.
+ """
+ if parent is not None:
+ # resolve the url relative to the parent
+ if isinstance(parent, StdURL):
+ parent = str(parent)
+ if isinstance(url, StdURL):
+ url = str(url)
+ url = urljoin(parent, url)
+ # in case a StdURL is passed as the url argument (copy)
+ if isinstance(url, StdURL):
+ parts = url
+ else:
+ parts = urlsplit(url.strip())
+ # pull out most parts from urlsplit
+ self.scheme = parts.scheme
+ self.hostname = parts.hostname
+ self.port = parts.port
+ self.netloc = parts.netloc
+ self.path = parts.path
+ self.query = parts.query
+ # compute some extra properties based on above
+ if self.path.find(".") == -1:
+ self.extension = ""
+ else:
+ self.extension = self.path.split(".")[-1]
+ query_str = "?{0}".format(self.query) if self.query else ""
+ self.selector = "{0}{1}".format(self.path, query_str)
+
+ def __eq__(self, other):
+ """ Two StdURL instances are equal if they have the same scheme,
host,
+ port, path and query.
+ """
+ if other is None:
+ return False
+ return self.scheme == other.scheme and self.netloc == other.netloc
\
+ and self.selector == other.selector
+
+ def __ne__(self, other):
+ """ Two StdURL instances are unequal if they differ in scheme,
host,
+ port, path or query.
+ """
+ if other is None:
+ return True
+ return self.scheme != other.scheme or self.netloc != other.netloc \
+ or self.selector != other.selector
+
+ def __hash__(self):
+ """ Returns a suitable hash value for the StdURL.
+ """
+ return hash(str(self))
+
+ def __str__(self):
+ """ Ignore any URL fragment (#foo) for the string representation.
+ """
+ return "{0}://{1}{2}".format(self.scheme, self.netloc,
self.selector)
+
+
+if __name__ == "__main__":
+ url1 = StdURL("http://www.google.com")
+ url2 = StdURL("http://www.google.com")
+ url3 = StdURL("http://www.google.co.uk")
+ assert url1.scheme == "http"
+ assert url1 == url2
+ assert not url1 != url2
+ assert url2 != url3
+ assert not url2 == url3
+ assert url1 != None
+ s = set([url1, url2, url3])
+ assert len(s) == 2
+ url4 = StdURL("http://www.abc.com/")
+ assert StdURL("mailto:a...@foo.com", url4).scheme == "mailto"
+ assert StdURL("foo.html", url4) ==
StdURL("http://www.abc.com/foo.html")
+ assert StdURL("http://foo", url4) == StdURL("http://foo")
+ print "TEST PASSED"
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/1.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/2.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/3.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/4.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/5.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/6.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/7.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/8.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits/9.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/digits.php Tue Jul 27 03:56:04 2010
@@ -0,0 +1,7 @@
+<?php
+
+header("Content-type: image/jpg");
+echo file_get_contents("digits/" . $_SERVER['QUERY_STRING'] . ".jpg");
+
+?>
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/find.html Tue Jul 27 03:56:04 2010
@@ -0,0 +1,11 @@
+<html>
+ <head>
+ <title>SES Test site</title>
+ </head>
+ <body>
+ <large>FIND ME</large>
+ <p>This is some text for testing the HTML strings code.</p>
+ <p>This is another paragraph of test text for the HTML strings
code.</p>
+ </body>
+</html>
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/find2.html Tue Jul 27 03:56:04 2010
@@ -0,0 +1,10 @@
+<html>
+ <head>
+ <title>SES Test site</title>
+ </head>
+ <body>
+ <large>FIND ME, TOO</large>
+ <p>An interesting term</p>
+ </body>
+</html>
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/find3.html Tue Jul 27 03:56:04 2010
@@ -0,0 +1,10 @@
+<html>
+ <head>
+ <title>SES Test site</title>
+ </head>
+ <body>
+ <large>FIND ME, TOO</large>
+ <p>A perplexing term</p>
+ </body>
+</html>
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/index.html Tue Jul 27 03:56:04 2010
@@ -0,0 +1,47 @@
+<html>
+ <head>
+ <title>SES Test site</title>
+ <meta name="robots" content="INDEX, FOLLOW">
+ </head>
+ <body>
+ <a href="http://test/find.html">Test allowed_hosts - should
follow</a>
+ <br>
+ <a href="http://www.flax.co.uk/">Test allowed_hosts - should NOT
follow</a>
+ <br>
+ Should follow all of these:
+ <a href="digits.php?1">1</a>
+ <a href="digits.php?2">2</a>
+ <a href="digits.php?3">3</a>
+ <a href="digits.php?4">4</a>
+ <a href="digits.php?5">5</a>
+ <a href="digits.php?6">6</a>
+ <a href="digits.php?7">7</a>
+ <a href="digits.php?8">8</a>
+ <a href="digits.php?9">9</a>
+ <br>
+ Should not crawl this image <img src="test.jpg">
+ <br>
+ Should crawl this image <img src="test.png">
+ <br>
+ Two of the same non-existent URL
+ <a href="does_not_exist.html">404</a>
+ <a href="does_not_exist.html">404</a>
+ <br>
+ <a href="http://doesnotexist.net.uk/page.html">Server Not Found</a>
+ <br>
+ <a href="empty.mp3">Should not crawl based on content type</a>
+ <br>
+ <a href="meta.html">META ROBOTS test</a>
+ <br>
+ <a href="test.doc">Word document test</a>
+ <br>
+ <a href="test.pdf">PDF test</a>
+ <br>
+ <a href="rss.xml">Here is an RSS feed</a>
+ <br>
+ <a href="mailto:webmaster@localhost">Mail</a>
+ <br>
+ <a href="redirect.php">Redirect</a>
+ </body>
+</html>
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/meta.html Tue Jul 27 03:56:04 2010
@@ -0,0 +1,11 @@
+<html>
+ <head>
+ <title>SES Test site</title>
+ <meta name="robots" content="NOINDEX, NOFOLLOW">
+ </head>
+ <body>
+ Should not follow links, or index (dump)
+ <a href="meta.html"></a>
+ </body>
+</html>
+
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/redirect.php Tue Jul 27 03:56:04 2010
@@ -0,0 +1,1 @@
+<?php header("Location: /find.html") ?>
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/robots.txt Tue Jul 27 03:56:04 2010
@@ -0,0 +1,2 @@
+User-agent: *
+Disallow: /test.jpg
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/rss.xml Tue Jul 27 03:56:04 2010
@@ -0,0 +1,50 @@
+<?xml version="1.0" encoding="utf-8"?>
+<rss version="2.0">
+<channel>
+ <title>Sample Feed</title>
+ <description>For documentation <em>only</em></description>
+ <link>find2.html</link>
+ <language>en</language>
+ <copyright>Copyright 2004, Mark Pilgrim</copyright>
+ <managingEditor>edi...@example.org</managingEditor>
+ <webMaster>webm...@example.org</webMaster>
+ <pubDate>Sat, 07 Sep 2002 0:00:01 GMT</pubDate>
+ <category>Examples</category>
+ <generator>Sample Toolkit</generator>
+ <docs>http://feedvalidator.org/docs/rss2.html</docs>
+ <cloud domain="rpc.example.com"
+ port="80"
+ path="/RPC2"
+ registerProcedure="pingMe"
+ protocol="soap"/>
+ <ttl>60</ttl>
+ <image>
+ <url>http://example.org/banner.png</url>
+ <title>Example banner</title>
+ <link>find3.html</link>
+ <width>80</width>
+ <height>15</height>
+ </image>
+ <textInput>
+ <title>Search</title>
+ <description>Search this site:</description>
+ <name>q</name>
+ <link>http://example.org/mt/mt-search.cgi</link>
+ </textInput>
+ <item>
+ <title>First item title</title>
+ <link>find3.html</link>
+ <description>
+ Watch out for
+ <span style="background:
url(javascript:window.location='http://example.org/')">
+ nasty tricks</span>
+ </description>
+ <author>ma...@example.org</author>
+ <category>Miscellaneous</category>
+ <comments>http://example.org/comments/1</comments>
+ <enclosure url="http://example.org/audio/demo.mp3" length="1069871"
type="audio/mpeg"/>
+ <guid>http://example.org/guid/1</guid>
+ <pubDate>Thu, 05 Sep 2002 0:00:01 GMT</pubDate>
+ </item>
+</channel>
+</rss>
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/test.doc Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/test.jpg Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/test.pdf Tue Jul 27 03:56:04 2010
Binary file, no diff available.
=======================================
--- /dev/null
+++ /trunk/flax/crawler/test/test.png Tue Jul 27 03:56:04 2010
Binary file, no diff available.