On Mon, 7 May 2012 16:48:35 +0100, "Derek.Moody"
<
derek...@casterbridge.net> wrote:
> <a href="mailto etc
> - as it'll fool most of the address harvesting spiders.
Not mine. I wrote a little BASIC proggy many years ago as a proof of
concept as to why Argonet's user directory was an invitation. It
built a list of pages and scanned them, plus subpages three levels
deep (unless linked away from Argo (determined by IP lookup)). From
that, it extracted all mailto links and decoded glyphs, nospam
inserts, and the like. It also looked for text "at" and "@" to try to
retrieve non-link addresses. It found some, and decoded them. ;-) It
also filtered out common bogus addresses (
whitehouse.gov, localhost,
etc).
The only thing it completely skipped was emails in images, 'cos my
maths isn't up to OCR code. Wouldn't have thought that'd be a laugh
in BASIC anyway. ;-)
I guess the level of result from a spambot depends upon the
competence of the programmer...
Best wishes,
Rick.
PS: The secret? Run the scan on a subset and ask it to keep
everything that "looks like" an address. Then go through it manually
coding rules. Then let loose, for most people munge their addresses
in very similar ways. God, imagine giving it "wiki recent changes" as
the input!
PPS: Don't think I have the code, it'll be on my dead 2Gb drive. It
was, however, a rainy Sunday's worth of code. It isn't hard, just a
bloody big array and a heap of decode rules. If I wrote it again, I'd
dump likely addresses to a file and process them afterwards, instead
of doing it during the scan...