Filter on HTML content?

2,336 views
Skip to first unread message

SeanC

unread,
Nov 15, 2010, 3:53:47 PM11/15/10
to GMail Power Users
Hi,

I have two questions, one specific, one more general:

1. Is it possible to filtter on certain words that appear in the HTML
of an email? For example, I'm looking to filter on any message that
has ".info/" in any HTML appearing in the message. So, for example,
if an email had this string, my filter would ideally catch it. I
cannot make this work, despite having tried. (Is it because the images/
html doesn't get downloaded until I tell GMail to download it)?

Here's a sample snippt of such HTML containing that code: <a
href="http://vergeboard.grossesseravenna.info/filled/
133221/1000/1006-008-0-2/preventer" style="text-decoration: none">


2. More general question: There are some emails -- the sneaky ones --
that are very persistent. For example, I get one from a company
called LowerMyBills that is primarily an image ad with lots of random
text (see full email below). I can filter on the word "LowerMyBills"
in the subject line, but I am surprised that the GMail filters do not
catch this email, and many others like it. I was hoping to use the
".info/" field above, as almost everything in my inbox with that field
is an unwanted email. Any thoughts or ideas on how to catch this king
of thing (or for that matter, why GMail's behind-the-scenes filters
don't)?

Thanks.

Here's an example of the that email?:

Delivered-To: gmaila...@gmail.com
Received: by 10.223.100.4 with SMTP id w4cs106542fan;
Mon, 15 Nov 2010 06:45:01 -0800 (PST)
Received: by 10.91.182.17 with SMTP id j17mr3500651agp.
105.1289832300984;
Mon, 15 Nov 2010 06:45:00 -0800 (PST)
Return-Path: <comcast...@comcast.net>
Received: from qmta14.emeryville.ca.mail.comcast.net
(qmta14.emeryville.ca.mail.comcast.net 1.2.3.4])
by mx.google.com with ESMTP id d23si8304215and.
111.2010.11.15.06.45.00;
Mon, 15 Nov 2010 06:45:00 -0800 (PST)
Received-SPF: pass (google.com: domain of comcast...@comcast.net
designates 1.2.3.4 as permitted sender) client-ip=1.2.3.4;
Authentication-Results: mx.google.com; spf=pass (google.com: domain of
comcast...@comcast.net designates 1.2.3.4 as permitted sender)
smtp.mail=comcast...@comcast.net
Received: from omta17.emeryville.ca.mail.comcast.net ([76.96.30.73])
by qmta14.emeryville.ca.mail.comcast.net with comcast
id XRuJ1f0021afHeLAESl06p; Mon, 15 Nov 2010 14:45:00 +0000
Received: from sz0077.ev.mail.comcast.net ([1.2.3.4])
by omta17.emeryville.ca.mail.comcast.net with comcast
id XSkz1f00H2sJR5G8dSkz77; Mon, 15 Nov 2010 14:45:00 +0000
Return-Path: preutil...@admireriar.com
Received: from imta04.westchester.pa.mail.comcast.net (LHLO
imta04.westchester.pa.mail.comcast.net) (76.96.62.37) by
sz0077.ev.mail.comcast.net with LMTP; Mon, 15 Nov 2010 14:44:59 +0000
(UTC)
Received: from mx232.admireriar.com ([92.253.241.232])
by imta04.westchester.pa.mail.comcast.net with comcast
id XSkz1f01Z51ZPf804Skz73; Mon, 15 Nov 2010 14:45:00 +0000
From: LowerMyBills <univers...@admireriar.com>
Subject: =?ISO-8859-1?Q?
=02=0C=1B=04=18=1F=16Bank=02s=20Forced=20to=20Forgive=20Credit=20Card=20Debt=20=7C=20See=08=20If=20You=20Qualify=20for=20Relief?
=
MIME-Version: 1.0
X-UID: comcastaddress
Content-Transfer-Encoding: 8bit
Content-Type: text/html; charset="iso-8859-1"
X-Zimbra-Forwarded: comcast...@comcast.net
Message-ID:
<1201487840.1268914.1289...@sz0077a.emeryville.ca.mail.comcast.net>
X-Mailer: Zimbra 6.0.5_GA_2431.RHEL5_64
Date: Mon, 15 Nov 2010 14:44:59 +0000 (UTC)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2900.2523" name=GENERATOR>
</HEAD>
<BODY>
<p align="center"><b><font face="Verdana" size="2">
<a href="http://vergeboard.grossesseravenna.info/filled/
133221/1000/1006-008-0-2/preventer" style="text-decoration: none">
<font color="#f36c0b"></font></a></font></b></p>
<p align="center">
<a href="http://jewellers.grossesseravenna.info/understand/
133221/1000/1006-008-0-2/strandedness">
<img border="0" src="http://Rhodope.grossesseravenna.info/images/
lmbd4.gif"></a><br>
<a href="http://region.grossesseravenna.info/produce/40195/1000/2000/
tentmaker">
<img border="0" src="http://bandy.grossesseravenna.info/images/
lmbd2.jpg"></a><br>
<a href="http://destructor.grossesseravenna.info/understand/
40085/1000/2000/departmentalise">
<img border="0" src="http://Sperry.grossesseravenna.info/images/
bolt4.jpg"></a></p>
<center><img
src="http://grossesseravenna.info/L2F7KZYW8comcastaddress6N8GS.mtmso/
133221/1000/008/comcast...@comcast.net/"
/></center>
<div style="display:none;visibility:hidden;"
<br>
multiple lines of random text were here
<br>
multiple lines of random text were here
<br>
multiple lines of random text were here
<br>
multiple lines of random text were here
</DIV>
</BODY></HTML>


Andrew Ingraham

unread,
Nov 17, 2010, 12:24:54 PM11/17/10
to gmail-po...@googlegroups.com
I'm not a "power user", so don't take my word on any of this.

> 1. Is it possible to filtter on certain words that appear in the HTML
> of an email?  For example, I'm looking to filter on any message that
> has ".info/" in any HTML appearing in the message.

As far as I know, filtering works on the entire email message text (or
close to it), which is why filtering on a word you expect to find in
the body, also catches that word if it appears only in the subject or
in the sender's name. Hence, I would think Gmail is searching the
HTML code too. But I could be wrong.

If it is being pulled in from another source when you view the
message, then I think you'd be out of luck. Is it there if you do
Show Original? That shows you what Gmail has to work with, before
pulling in images or other code.

Filtering on ".info/" might not work as expected because Gmail's
search/filter function seems to skip (most) punctuation, and Gmail
doesn't have wildcards. I have tried in vain several times to get
searches to work when the search term included some non-alphanumeric
characters, since Gmail ignores them.

If for some reason it sees ".info/" or "info" as part of a longer
"word", then it won't find that either because Gmail's search finds
whole words only.

I haven't a clue if an add-on to Gmail provides enhanced filtering
capabilities, since I don't use any.

But another alternative is IMAP to your PC and then you have a range
of choices in email clients that may offer better filtering
capabilities than Gmail itself does.

> 2. More general question:  There are some emails -- the sneaky ones --

> that are very persistent.  ...
> ... I am surprised that the GMail filters do not


> catch this email, and many others like it.

I trust you have been "training" your Gmail spam filter, by clicking
on Report Spam as often as possible.

I've had periods where many obvious spam messages weren't being
caught. After consistently marking them for a few weeks, the spam
filter finally caught on. In almost all cases it learns quickly, but
a couple of cases were oddly persistent.

Using (and training) the spam filter is said to be far preferable to
constructing your own filter.

It appears you are trying to filter on the ".info" top-level Internet
domain. I'd think there could be a lot of non-spam from that domain
too. Do you really want to block everything from .info?

Perhaps instead of the .info top-level domain, you could search for
the next-level domain, if it is consistently there on the troublesome
emails.

Regards,
Andy

SeanC

unread,
Nov 17, 2010, 3:20:16 PM11/17/10
to GMail Power Users
Thanks, Andy.

Yes. I thought about the issue of eliminating all ".info" domains, but
I've got other whitelisting going on, and it appears that anything
I've ever received with .info and which wasn't already on my safe list
was, in fact, spam.

I did look at the ".info/" does appear when I "show original" in
GMail. However, the ".info/" was appearing in either an href or img
tag. Even if I just try to filter on "info" with no punctuation, it
still doesn't work. Seems to be something related to it being in a
tag, but just not sure.

I have and will continue to train the GMail filter as you suggested.

Finally, one last question: when the GMail filter gets trained by my
marking an item as spam, is it getting trained just for ME or is it
occurring for all GMail users?

Thanks very much for your response.

Regards,

Sean

On Nov 17, 10:24 am, Andrew Ingraham <andrew.ingra...@gmail.com>
wrote:
> I'm not a "power user", so don't take my word on any of this.
>
> > 1. Is it possible to filtter on certain words that appear in theHTML
> > of an email?  For example, I'm looking to filter on any message that
> > has ".info/" in anyHTMLappearing in the message.
>
> As far as I know, filtering works on the entire email message text (or
> close to it), which is why filtering on a word you expect to find in
> the body, also catches that word if it appears only in the subject or
> in the sender's name.  Hence, I would think Gmail is searching theHTMLcode too.  But I could be wrong.

Andrew Ingraham

unread,
Nov 17, 2010, 4:27:20 PM11/17/10
to gmail-po...@googlegroups.com
> Finally, one last question:  when the GMail filter gets trained by my
> marking an item as spam, is it getting trained just for ME or is it
> occurring for all GMail users?

As far as I know, only your account is affected. What is spam to me
might not be spam to someone else.

I suppose Google might make use of everyone's spam preferences in some
way to tailor all their filters, over time. Even if they do this,
I've got to believe that one person's settings must have a minuscule
effect on everyone else's accounts.

Andy

SeanC

unread,
Nov 26, 2010, 3:02:05 PM11/26/10
to GMail Power Users
Ok. Thanks very much.
Reply all
Reply to author
Forward
0 new messages