On Fri, Apr 27, 2012 at 9:21 PM, Seth Shaw <seth.e.s...
> We have not actively captured linked pages in the email sets we receive.
> If I were to pursue that path we would likely do a regex
> script to identify URLs and toss the list to either WGet or HTTrack and
> then bundle the capture along-side the email. Of, course you may
> well accidentally capture quite a few phishing and spam URLs, so you might
> want to have a human check the list first.
There are lots of problems when trying to do this for anything other
than PIM; an enterprise system might receive emails that contain links
which the addressee is authorized to access, but which the enterprise is
not. The url might point to content whose license does not permit
archiving. The links might not be idempotent; the url might require rely
on cookies or data stores not present on the email system, etc.
Conversely, some messages may refer to the content of links at specific
point in time, so delaying until the content can be manually reviewed may
result in the wrong version of a page being captured.