How to get a grip on the Gogole spams since six months ago or so
seems to be along the lines of trawling the newsgroups, pulling
down their posts since about that date, then extracting from
each one whether it has "KINDLE EPUB EBOOK" or Thai code,
has that mostly they can all be identified because they make
an X-Content-Transfer-Encoding base64 header, then that the content
is sort of inscrutable block unless its uncoded.
So it looks like it's possible to identify only off the subject
and other headers, then pretty definitively off the format,
which ones are these spams.
Then each one of these has a Google posting account in
one of the Google headers.
Injection-Info:
google-groups.googlegroups.com;
posting-host=146.70.11.7; posting-account=hJ31DwoAAADGk9KnJ0tR36KM3U7DAsJC
It's that posting-account that basically is the abuser's,
whether or not it's an innocent dupe's after something like OAuth,
is undetermined.
Another sort of indicating bit is that they start with
X-Forwarded-Encrypted / X-Received / X-Forwarded-Encrypted / X-Received
pointing at some SMTP id's, what with regards to that looks like an SMTP
gateway,
what with respect to what might be alternate forms of message injection,
while the Path of each of the posts indicates as coming from
postnews.google.com .
Not all do, though.
Those though look like usual Google posters' posts, so it seems like
an automation of some Groups API on the Google side.
So anyways the idea is to
get the list of groups on a usenet server, GROUPS
get the count of headers since a few weeks ago
get the overview of headers
find likely spams
make a list of spammed groups
get the spammed count of headers since October
get the spammed overview of headers since October
find likely spams
pull down the headers
extract the posting-account
then part of the challenge is not including any threaded replies,
in the sense that some people replied to these posts in their rejection,
to make sure that an algorithm to mark spams is avoiding
Type I/II errors or the false positive/negative. I.e., such posts
in their replies, in their own content, don't have the same
characteristics.
(Or anything that contains bit ly links or "common exact-links in the spam".
Also all those "Case Analysis and Case Study Solution" spams,
look kind of similar. In fact when the spam started up I thought to
myself "hey I wonder if that's those 'Case Analysis and Case Study
Solution' spammers". )
For example on 10/2/23 ,I replied to a spam, so looking at it,
b0796889-551b-4637...@googlegroups.com
I would want to disambiguate spam reply rejections, from spams.
I'm not sure yet if the spams with same subject headers are
actually threaded replies or just have same subject, I imagine
that they just have the same subject header and aren't threaded replies.
So, there would basically be for cross-checking "likely spam"
(not replies, no references) and "likely not-spam" (replies, references).
Otherwise it does look just like one of the spams, with a
Content-Transfer-Encoding and that, but not the "encrypted SMTP"
bit.
"Note: Meta title tags should typically be around ...", is
one of the blast-fax mail-merge prompts that slips out,
with the idea of that finding that quote in the source
code will probably indicate the origin of the software.
So, the idea is to key off of posting account, then compute counts
for these sorts relations
posting-account -> email-address
posting-account -> targetted-group
then to compute how many spams were sent, by who, to where,
and whether posting-account <-> email-address is 1-1 or fraudulent,
then result a neat list of posting-accounts to batch up in a sort of
report and send it up to Google as a curated sort of spam report.
Then there's an idea that thus results a sort of spam rule,
about making "federated spam rules" type of a thing,
with regards to things like "spam blacklists" and these
kinds of things, and heuristics or rules, vis-a-vis, that
neural net classifiers are inscrutable and instead there
is to be a sort of "open quality rules" for relating messages,
to groups.
So, we can identify the spams, and, sort of to their origins.
Across all the Usenet groups.