Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Filtering web proxy

2 views
Skip to first unread message

Oleg Broytmann

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to Python Mailing List
Hello!

I want a filtering web proxy. I can write one myself, but if there is a
thing already... well, I don't want to reinvent the wheel. If there is such
thing (free and opensourse, 'course), I'll extend it for my needs.

Well, what are my needs? I am braindamaged idiot, who do not want to see
banners, pop-ups and all the crap, so I run Navigator with graphics/Java/
Javascript/cookies turned off 99% of the time. But for some sites I need
to go to preference dialog and turn it on manually. Then turn it back off
before visiting another site. (I need frames, tables and other features, so
I am not using lynx; yes, I know there are "links" and "w3m", and yes, I
use them from time to time; anyway I run Navigator).
It is tiresome. I want to automate the task.

So I want to turn graphics/script/cookies forever on, but filter the
crap in the proxy. For 99% sites I'll just remove the junk. For some sites
(a list of which I'll care manually) the proxy will pass HTML, graphics and
cookies unchanged (or a bit modified, 'cause in any case I don't want to
eat Doubleclick's cookies attached to their ads).
Once I tried Junkbuster, but found it inadequate. My need is simple -
just a short list of "white" sites. Never tried adzapper...

I wrote dozen HTML parsers in Python, so I can write one more, and turn
it into a proxy, but may be I can start with some already debugged code?

Oleg.
----
Oleg Broytmann http://phd.russ.ru/~phd/ ph...@mail.com
Programmers don't die, they just GOSUB without RETURN.

Ng Pheng Siong

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to
According to Oleg Broytmann <ph...@mail.com>:

> Once I tried Junkbuster, but found it inadequate. My need is simple -
> just a short list of "white" sites. Never tried adzapper...

I use Alfajor and am happy with it. I only need cookie filtering,
which Alfajor does very well.

Cheers.

--
Ng Pheng Siong <ng...@post1.com> * http://www.post1.com/home/ngps


Patrick Phalen

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to >
[Oleg Broytmann, on Mon, 17 Apr 2000]
:: I want a filtering web proxy. I can write one myself, but if there is a

:: thing already... well, I don't want to reinvent the wheel. If there is such
:: thing (free and opensourse, 'course), I'll extend it for my needs.

Try CTC (Cut the Crap). It's written in Python and is GNU'd.

http://softlab.ntua.gr/~ckotso/CTC/


Oleg Broytmann

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to Patrick Phalen

Looks promising, thanks. They also recommends Alfajor:
http://www.andrewcooke.free-online.co.uk/jara/alfajor/. Python, GPL'd...

Oleg. (All opinions are mine and not of my employer)
----
Oleg Broytmann Foundation for Effective Policies p...@phd.russ.ru

Robert W. Cunningham

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to
Oleg Broytmann wrote:

> I want a filtering web proxy. I can write one myself, but if there is a
> thing already... well, I don't want to reinvent the wheel. If there is such
> thing (free and opensourse, 'course), I'll extend it for my needs.

> <snip>


> I wrote dozen HTML parsers in Python, so I can write one more, and turn
> it into a proxy, but may be I can start with some already debugged code?

If you want a porting project, there is the FilterProxy program, written in
Perl. It is a true daemon, and implements several nifty features, such as a
modular architecture that lets users create their own filters. Present filters
include a banner ad blocker, and an cool HTTP compressor (a remote FilterProxy
instance on the far end of a modem link gzips all HTTP traffic for a 5x size
reduction, which Netscape decompresses on the fly). There is also a template
to simplify writing new modules, and if I ever get around to it I'm going to
add simple stream encryption (just for fun).

FilterProxy is still alpha, but it is already fairly potent. Future
improvements may include support for chaining proxies, so that local and/or
remote instances may be used in combination with an ISP's HTTP caching proxy
server.

If you have a broadband connection (cable modem, DSL, LAN), FilterProxy
introduces a noticeable delay, but not objectionable. It costs only a little
more than an IPchains firewall in terms of CPU bandwidth. And the Blocker
module is getting very good at removing some of the more complex ad formats
(including some Java and Javascript devils).

It would be an excellent comparison to have Python and Perl versions
side-by-side, so the power of each can be more easily observed.

I don't recall the URL, but it is on Freshmeat.


-BobC

Neil Schemenauer

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to
Erno Kuusela <er...@iki.fi> wrote:
>a html parser would need to work incrementally, unless you want to
>wait for the whole document to be transferred over the network before
>seeing any of it rendered.

Yes and if your connection is fast enough that you don't need
incremental loading you probably don't care too much about ads.
In my experience, filtering ads greatly enhances your experience
if your browsing on a slow connection.

>i guess you could do it incrementally with sgmllib (iirc you feed it a
>file object?), but you run into the fact that a big part of
>the html documents on the web are malformed and rely on the
>error correcting heuristics of the major browsers to function...

Right and there seems to be a lot of bad HTML code out there.
Unfortunately, I don't think you can easily make sgmllib parse
incrementally. Someone please correct me if I'm wrong.

Is the situation with XML the same as HTML? Are XML documents
forced to adhere to the standard or are parsers supposed to try
to do something intelligent with whatever crap they get fed?

>one starting point could be the "gray proxy" (i forget what it was
>really called). that was written on top of medusa, i think there was
>an announcement here? probably a year or so ago. it parsed the html
>and changed all the colors to grayscale, and did the same for
>images. medusa isn't free though.. (except the version in zope?)

You can try my "munchy" proxy. Its at:

http://www.enme.ucalgary.ca/~nascheme/python/

Saying that is parses HTML is a bit of a stretch however. It
just uses a couple of regexs. I'm sure Tim Peters would love
it. :)

In my experience, filtering ads at the HTML level is more
effective than filtering at the request level (like junkbuster).
My list of blocked URLs is very short but still catches ads on
almost all the sites I visit. Also, when ads are filtered I
usually cannot tell by looking at the page. Of course, YMMV.


Neil

--
HTML needs a rant tag. --Alan Cox

David Porter

unread,
Apr 17, 2000, 3:00:00 AM4/17/00
to Python List
* Oleg Broytmann <p...@phd.russ.ru>:
> Hello!

>
> I want a filtering web proxy. I can write one myself, but if there is a
> thing already... well, I don't want to reinvent the wheel. If there is such
> thing (free and opensourse, 'course), I'll extend it for my needs.
>
> Once I tried Junkbuster, but found it inadequate. My need is simple -
> just a short list of "white" sites. Never tried adzapper...

You tried the *wrong* junkbuster. There is an alternate version at
http://www.waldherr.org/junkbuster/ which is the best ad filter that I have
ever used (and still use). It comes with a very complete ad blocking list
and it turns blocked images into transparent 1x1 gifs, both unlike the
official version.

I have tried adzapper and CTC (Cut The Crap), but both of them were
unreliable. (I should mail the creators about this RSN.) They would for no
apparent reason, just stop working. That is, I would select a URL and
nothing would happen... I found myself constantly bypassing the proxy
because of this. I also dislike the concept of creating a new file everytime
that I want to block a site, one central config suits me better.

If the junkbuster version I mentioned is not satisfactory to you I would be
highly surprised, give it a go!


__David__


Erno Kuusela

unread,
Apr 18, 2000, 3:00:00 AM4/18/00
to
>>>>> "Oleg" == Oleg Broytmann <p...@phd.russ.ru> writes:

Oleg> Hello! I want a filtering web proxy. I can write one
Oleg> myself, but if there is a thing already... well, I don't
Oleg> want to reinvent the wheel. If there is such thing (free and
Oleg> opensourse, 'course), I'll extend it for my needs.

i am using junkbuster. it's simple, and works well.

it only does url blocking (so banners), but i never have javascript
turned on anyway. it can block cookies selectively (i have everything
but slashdot blocked), hide/spoof user-agent, and use other http
proxies.

it doesn't do html parsing, and it's not written in python though.
but i haven't missed anything from it so it being written
in c hasn't bothered me much. with javascript/java turned
off there's not so much need for html parsing...

Oleg> I wrote dozen HTML parsers in Python, so I can write one
Oleg> more, and turn it into a proxy, but may be I can start with
Oleg> some already debugged code?

a html parser would need to work incrementally, unless you want to
wait for the whole document to be transferred over the network before
seeing any of it rendered.

i guess you could do it incrementally with sgmllib (iirc you feed it a


file object?), but you run into the fact that a big part of
the html documents on the web are malformed and rely on the
error correcting heuristics of the major browsers to function...

one starting point could be the "gray proxy" (i forget what it was


really called). that was written on top of medusa, i think there was
an announcement here? probably a year or so ago. it parsed the html
and changed all the colors to grayscale, and did the same for
images. medusa isn't free though.. (except the version in zope?)

mozilla allows you to block/allow javascript by domain, iirc.

-- erno


Amit Patel

unread,
Apr 18, 2000, 3:00:00 AM4/18/00
to
Neil Schemenauer <nasc...@enme.ucalgary.ca> wrote:
|
| Yes and if your connection is fast enough that you don't need
| incremental loading you probably don't care too much about ads.
| In my experience, filtering ads greatly enhances your experience
| if your browsing on a slow connection.

Even with a fast connection, there are still things like filtering out
pop-ups, blocking cookies from certain sites, forcing pages to be
cachable (by modifying Cache-Control headers .. evil evil!) that can
be useful. One really handy thing I wrote was to highlight keywords
you searched for, so for example, you search for "big dog" and visit a
page, and it'll highlight "big" and "dog" on that page. However,
Google just added this feature (when you use its cached link) so I
don't need it so much in a proxy.

| Is the situation with XML the same as HTML? Are XML documents
| forced to adhere to the standard or are parsers supposed to try
| to do something intelligent with whatever crap they get fed?

I believe XML and XHTML parsers are required to reject bad stuff.

| Saying that is parses HTML is a bit of a stretch however. It
| just uses a couple of regexs. I'm sure Tim Peters would love
| it. :)

Regexps can "parse" HTML tags. There was something called REX.py that
was posted here somewhere. It's based on REX for Perl, which is based
on Robert D. Cameron "REX: XML Shallow Parsing with Regular
Expressions", Technical Report TR 1998-17, School of Computing
Science, Simon Fraser University, November, 1998.

The idea is that REX is a humongous hairy regexp (sorry Timbot!) that
will match one tag or non-tag at a time. You just keep feeding it
data and it can keep giving you tags/non-tags. With that, I hope to
build a filtering proxy that tokenizes all the HTML and does evil
transformations to it, incrementally.


- Amit

P.S. Here's the regexp, just for Tim Peters:

'[^<]+|<(?:!(?:--(?:[^-]*-(?:[^-][^-]*-)*->?)?|\\[CDATA\\[(?:[^\\]]*](?:[^\\]]+])*]+(?:[^\\]>][^\\]]*](?:[^\\]]+])*]+)*>)?|DOCTYPE(?:[\\n\\t\\r]+(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*|"[^"]*"|\'[^\']*\'))*(?:[\\n\\t\\r]+)?(?:\\[(?:<(?:!(?:--[^-]*-(?:[^-][^-]*-)*->|[^-](?:[^\\]"\'><]+|"[^"]*"|\'[^\']*\')*>)|\\?(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:\\?>|[\\n\\r\\t][^?]*\\?+
(?:[^>?][^?]*\\?+)*>))|%(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*;|[\\n\\t\\r]+)*](?:[\\n\\t\\r]+)?)?>?)?)?|\\?(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:\\?>|[\\n\\r\\t][^?]*\\?+(?:[^>?][^?]*\\?+)*>)?)?|/(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+)?>?)?|(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+)?=?(?
:[ \\n\\t\\r]+)?(?:"[^<"]*"|\'[^<\']*\'|\\w+))*(?:[ \\n\\t\\r]+)?/?>?)?)'

--
Amit J Patel, Computer Science Department, Stanford University
http://www-cs-students.stanford.edu/~amitp/


--
--
Amit J Patel, Computer Science Department, Stanford University
http://www-cs-students.stanford.edu/~amitp/

Oleg Broytmann

unread,
Apr 18, 2000, 3:00:00 AM4/18/00
to Python Mailing List
Hello!

Thanks to all who replied. I'll look into all the code. Right now proxy3
(MURI) looks best, with "munchy" follows.

On Mon, 17 Apr 2000, David Porter wrote:
> If the junkbuster version I mentioned is not satisfactory to you I would be
> highly surprised, give it a go!

It doesn't block Javascript.

Mark Nottingham

unread,
May 10, 2000, 3:00:00 AM5/10/00
to pytho...@python.org
On Tue, Apr 18, 2000 at 02:49:54AM +0000, Amit Patel wrote:

> Even with a fast connection, there are still things like filtering out
> pop-ups, blocking cookies from certain sites, forcing pages to be
> cachable (by modifying Cache-Control headers .. evil evil!)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Naughty! There's enough confusion out there already... *grin*

--
Mark Nottingham
http://www.mnot.net/


Eduard Hiti

unread,
May 12, 2000, 3:00:00 AM5/12/00
to
There is the classic 'Proxomitron', which has nothing to do with
Python, but can do things with HTML you wouldn't believe. Runs
on Win32. Find it at http://members.tripod.com/Proxomitron.

Also there is WBI from IBM. It's not so much a finished product
but a library for JAVA with the most extensive support for
parsing and rewriting incoming and outgoing HTTP I've ever seen.
There are a couple quite impresive demos for this library. You
can look at it at http://www.almaden.ibm.com/cs/wbi/.


* Sent from RemarQ http://www.remarq.com The Internet's Discussion Network *
The fastest and easiest way to search and participate in Usenet - Free!


0 new messages