Spamhalter getting overwhelmed by HTML meta tags

2 views
Skip to first unread message

Marco Old

unread,
Aug 6, 2021, 7:25:22 PM8/6/21
to
Spamhalter has not been working well having degraded over the past
couple of years. There was a long thread about Spamhalter from 2019
about this.

I performed all of the hints from that thread but Spamhalter still
misses many Spam emails.

In looking at the "Explain Spam Classification", I see that almost all
of the words used to classify the email are HTML meta tags. Words
like "style", "arial", "margin", "sans-serif", "font-family",
"text-align" and so on.

So I train an email as Spam and those words get into the
classificaiton for Spam messages and then on the next email, I train
the email as not Spam and those words are removed from the
classification. Then the next email is not considered Spam.

Has anyone noticed this?

Euler German

unread,
Aug 7, 2021, 2:27:48 PM8/7/21
to

On article <5ogrgg99d7l9nucvu...@4ax.com>, Marco Old
wrote (at least in part):

> So I train an email as Spam and those words get into the
> classificaiton for Spam messages and then on the next email, I train
> the email as not Spam and those words are removed from the
> classification. Then the next email is not considered Spam.
>
>

Maybe you're no "training" SpamHalter correctly. There's a big
difference between selecting one or more misclassified messages and
MOVING it to the Suspicious or junk mail folder, and picking
Spamhalter classification > Train message(s) as Spam from the menu.
The same applies the other way around, that is, MOVING message(s)
from the Suspicious or junk mail folder to any other folder is much
more effective than Train message(s) as Not-Spam. There's a technical
explanation for each method but in a nutshell it's how it works.

OTOH if it is not your case you may benefit of SpamHalter's database
cleaning which will remove deprecated data from corpus. Pick it from
Tools > Spam and content controls > Spamhalter... > Cleanup...

My current Spamhalter training strategy and settings:

(*) Train on classification errors only (smaller database)
( ) Train always (larger database, self-trained) <- no need if you're
run standalone
or on small LAN.

Spam level (%): 50 Not-spam boost: 1

SpamHalter has been running flawlessly here since version 1.0 with
these settings.

--
Kind regards,
Euler German

Please, reply preferably to the list.
Reply-To: partially ROT13, invalid=com
Due to spam I'm filtering-out GoogleGroups. Sorry. :(

Marco Old

unread,
Aug 25, 2021, 6:05:49 PM8/25/21
to
Euler,

Thanks for the hints. I had the training strategy setting but I had
default settings for Spam Level and Not-spam boost.

I changed them to your recommendation and we will see
what happens.

I will be sure to drag the spam messages into the Junk folder.

Marco

Euler German

unread,
Aug 26, 2021, 8:32:05 AM8/26/21
to

On article <0ffdig5vmtk4ct0bm...@4ax.com>, Marco Old
wrote (at least in part):

> I will be sure to drag the spam messages into the Junk folder.
>
>

You may also use Quick Actions for this (I'm a keyboard guy). Look at
Folder > Quick actions > Define quick actions...

Marco Old

unread,
Oct 24, 2021, 4:44:33 PM10/24/21
to
Update:

Helped by the residents of this group, I've got Spamhalter working
much better now.

I cleared out all of the previous cached data, clicked on the

(o) Train on classification errors only

set "Spam Level %" to 50

and set "Not-spam boost" to 1

as recommended in other posts.

Then I made sure to ONLY drag spam emails into the spam folder, NEVER
use the right click menu item "Train Messages(s) as Spam".

After a few weeks of dragging spam emails, now Spamhalter is working
very well. Almost 100% accuracy in detecting Spam and not-Spam.

Thanks to all.
Reply all
Reply to author
Forward
0 new messages