New way of detecting spam and detecting non-spam

101 views
Skip to first unread message

marc perkel

unread,
Sep 21, 2016, 12:05:45 PM9/21/16
to rspamd
Here's a new trick that could be added to rspamd to seriously increase spam and ham detection accuracy.

I am the founder of Junk Email Filter - a spam filtering service. I was thinking about patenting this idea but was recently diagnosed with lung cancer and probanly not going to be around long enough to get a patent. And the patent was going to be just for licensing to the big boys.

Anyhow - here's a link to how this works. It's free for anyone who wants to develop this idea.


This is a learning system - but it's not bayesian learning. I'm using Redis to make this work. It is especially good at detecting good email as good avoiding false positives.

The idea behind it is that there are an infinite number of words and phrases used in normal email that spammers never say. For example - if I emailed you with the subject "let's get some lunch" - it's good email because spammers never say that.

How do I know what spammers never say? I store what they do say and see if it's NOT IN THE LIST. 

Anyhow - a better explanation is in the link.

I've been running this for 9 months now and I'm filtering thousands of domains and it really works well. And I'd like to see this in general use because you all might improve it even more than I'm seeing.

Feel free to contact me with any questions.

Vsevolod Stakhov

unread,
Sep 22, 2016, 2:27:07 PM9/22/16
to marc perkel, rspamd
You are trying to describe the hidden Markov model used for many
statistical systems. In this model, you are combining probabilities
without real knowledge of the underlying data.

Rspamd uses OSB model in it's Bayes classifier. OSB is mostly described
in this paper: http://www.siefkes.net/ie/winnow-spam.pdf

I've tested OSB-Winnow and OSB-Bayes and found that OSB-Bayes is more
precise: http://osbf-lua.luaforge.net/papers/NIST_TREC_2005_paper.pdf

In this paper, the authors tell that there was a significant bug in
their OSBF implementation that caused Winnow to be better. There are
also slides about OSBF-Bayes in particular:
http://osbf-lua.luaforge.net/papers/osbf-eddc_slides.pdf

Rspamd uses the same approach from the very beginning of the project
(and now it uses Redis to store statistical tokens).

--
Vsevolod Stakhov

Marc Perkel

unread,
Sep 22, 2016, 3:15:33 PM9/22/16
to Vsevolod Stakhov, rspamd
It's unfortunate that you don't comprehend this is not bayesian filtering.

Vsevolod Stakhov

unread,
Sep 22, 2016, 5:05:31 PM9/22/16
to Marc Perkel, rspamd
The concrete method of processing tokens does not matter. It could be some probability combinator, or neural network, or just a naive fingerprinting (that's what you suggest)

Sent from my iPhone

On 22 Sep 2016, at 20:15, Marc Perkel <grinz...@gmail.com> wrote:

It's unfortunate that you don't comprehend this is not bayesian filtering.

--
You received this message because you are subscribed to the Google Groups "rspamd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rspamd+un...@googlegroups.com.
Visit this group at https://groups.google.com/group/rspamd.

Marc Perkel

unread,
Sep 22, 2016, 5:52:04 PM9/22/16
to Vsevolod Stakhov, rspamd
If you want to discuss this it would be useful if you actually read what I'm proposing first before forming an opinion about what you think it is. This is very very different than any methods currently being used and it requires some understanding of what I'm doing in order to discuss it.

Eugene Syromyatnikov

unread,
Sep 22, 2016, 6:38:53 PM9/22/16
to Marc Perkel, Vsevolod Stakhov, rspamd
On Thu, Sep 22, 2016 at 9:52 PM, Marc Perkel <grinz...@gmail.com> wrote:
> If you want to discuss this it would be useful if you actually read what I'm
> proposing first before forming an opinion about what you think it is. This
> is very very different than any methods currently being used and it requires
> some understanding of what I'm doing in order to discuss it.
Can you please explain how the proposed method different from
markovian discrimination with weight dependent on corpus?

>
> On 09/22/16 14:05, Vsevolod Stakhov wrote:
>
> The concrete method of processing tokens does not matter. It could be some
> probability combinator, or neural network, or just a naive fingerprinting
> (that's what you suggest)
>
> Sent from my iPhone
>
> On 22 Sep 2016, at 20:15, Marc Perkel <grinz...@gmail.com> wrote:
>
> It's unfortunate that you don't comprehend this is not bayesian filtering.
>
> --
> You received this message because you are subscribed to the Google Groups
> "rspamd" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rspamd+un...@googlegroups.com.
> Visit this group at https://groups.google.com/group/rspamd.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "rspamd" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rspamd+un...@googlegroups.com.
> Visit this group at https://groups.google.com/group/rspamd.



--
Eugene "eSyr" Syromyatnikov
mailto:evg...@gmail.com
xmpp:eSyr@jabber.{ru|org}

Marc Perkel

unread,
Sep 22, 2016, 7:17:03 PM9/22/16
to Eugene Syromyatnikov, Vsevolod Stakhov, rspamd


On 09/22/16 15:38, Eugene Syromyatnikov wrote:
> On Thu, Sep 22, 2016 at 9:52 PM, Marc Perkel <grinz...@gmail.com> wrote:
>> If you want to discuss this it would be useful if you actually read what I'm
>> proposing first before forming an opinion about what you think it is. This
>> is very very different than any methods currently being used and it requires
>> some understanding of what I'm doing in order to discuss it.
> Can you please explain how the proposed method different from
> markovian discrimination with weight dependent on corpus?
>

Markovian is a bayesian variation. Mine isn't anything close to
bayesian. Not even remotely close.

Reply all
Reply to author
Forward
0 new messages