New Usenet Archive

Jason Evans

unread,

Feb 7, 2022, 8:05:38 AM2/7/22

to

Hi all,

For the past month, I have been downloading and sorting Usenet archives from
a news server (with their permission) of everything from 2003 until today.
My next step is to decide how to upload them to archive.org.

Here is the current archive that runs from the 80's and 90's until around
2003: https://archive.org/details/usenethistorical

Each newsgroup hierarchy has its entry. I'm thinking about something
different, and I want you input on how to do that.

Here my plan. The following newsgroup hierarchies will have their own
entries:

Big-8:
comp
sci
news
misc
talk
humanities
soc

uk

de

alt will be broken down into subgroups because it's so huge.

alt-a-e
alt-f-j
alt-k-o
alt-p-t
alt-u-z

For example, alt.folklore.computers would be found in alt-f-j.

The rest of the hierarchies will be grouped together since they are
generally smaller and more likely to be nothing but spam.

Misc Newsgroup hierarchies-a-e
Misc Newsgroup hierarchies-f-j
Misc Newsgroup hierarchies-k-o
Misc Newsgroup hierarchies-p-t
Misc Newsgroup hierarchies-u-z

These are questions to you folks:

1. Does this makes since or would breaking everything down by individual
hierarchy be better?

2. If I do it this way, are there any other hierarchies that should not be
grouped with the misc. groups that should stand alone?

One final note. In case you're wondering, I am not archiving any binary
groups or any group that I think could get deleted because of the extremely
distasteful subject matter. I think you can get my gist about what I mean.
Everything else is here. Even the stupid spammy revenge froops.

Jason

Adam H. Kerman

unread,

Feb 7, 2022, 11:03:07 AM2/7/22

to

Jason Evans <jse...@mailfence.com> wrote:

>For the past month, I have been downloading and sorting Usenet archives from
>a news server (with their permission) of everything from 2003 until today.
>My next step is to decide how to upload them to archive.org.

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

We've already got that. Google indexed Usenet articles as if they were
posted on the Web in the first place as the lousy Google Groups Web
interface was treated like a real Web page. Within Google Groups itself,
searching became seriously hideous because Google stopped devoting staff
resources to making sure the indexes were being maintained. The indexing
services weren't great but they were better than what they became.

An extremely serious problem with Google Groups indexing of the article
body, when it was working, was it didn't do a great job distinguishing
between the author's own text and the quoted text if it was a followup.

Usenet archives lack decent indexes. Is there a way for you to upload a
very small archive, then work on the indexing and presentation of the
articles so it in some way resembles walking the thread tree? Can the
index be developed along with the archive, and then tested tested tested
to avoid another Google Groups?

>. . .

>One final note. In case you're wondering, I am not archiving any binary
>groups or any group that I think could get deleted because of the extremely
>distasteful subject matter. I think you can get my gist about what I mean.
>Everything else is here. Even the stupid spammy revenge froops.

Are you literally saying that you're archiving cancellable spam and
those various smaller-scale attacks on Usenet with articles uploaded by
the thousands from anonymyzing servers that aren't preventing abuse?

Revenge froups weren't any more spammy than any other part of Usenet.
Spam is spam regardless of the newsgroup.

Thomas Hochstein

unread,

Feb 7, 2022, 12:30:03 PM2/7/22

to

Adam H. Kerman schrieb:

> So you'd be relying upon their indexing and its likely inability to tell
> the difference between the article body, the .sig, and headers?

AFAIS, <https://archive.org/details/usenethistorical> has just zip'ed mbox
archives, one per group, with no way to browse, search or index anything.

Jason Evans

unread,

Feb 7, 2022, 1:14:29 PM2/7/22

to

Adam H. Kerman wrote:

> So you'd be relying upon their indexing and its likely inability to tell
> the difference between the article body, the .sig, and headers?
>
> We've already got that. Google indexed Usenet articles as if they were
> posted on the Web in the first place as the lousy Google Groups Web
> interface was treated like a real Web page. Within Google Groups itself,
> searching became seriously hideous because Google stopped devoting staff
> resources to making sure the indexes were being maintained. The indexing
> services weren't great but they were better than what they became.
>

There are two differences between what I'm doing and what Google is doing.

First, I am archiving the raw source articles in the same format that are
already on archive.org, through plain text MBOX files. If you're doing
research, download the newsgroup that you want and let your mail client or
whatever you want to use for MBOX files do the heavy lifting for you when it
comes to sorting and searching.

Secondly Google no longer provides headers which is important for research.
I am providing everything.

> An extremely serious problem with Google Groups indexing of the article
> body, when it was working, was it didn't do a great job distinguishing
> between the author's own text and the quoted text if it was a followup.
>
> Usenet archives lack decent indexes. Is there a way for you to upload a
> very small archive, then work on the indexing and presentation of the
> articles so it in some way resembles walking the thread tree? Can the
> index be developed along with the archive, and then tested tested tested
> to avoid another Google Groups?

I don't have the time or energy to create a website to host this stuff that
would also do a good job of indexing everything. What I'm doing is providing
the files free of charge to archive.org so if someone else wants to do that,
they can.

Jason Evans

unread,

Feb 7, 2022, 1:16:22 PM2/7/22

to

That is exactly what I have. My question is, is it better to have them on
archive.org with one entry per hierarchy or to group them like I suggested?

Adam H. Kerman

unread,

Feb 7, 2022, 1:43:46 PM2/7/22

to

I saw that they were zipped. Jason stated he's doing something different.

So if he's merely presented Usenet articles as text files or
digestified somehow but still text filed, I was questing how he was
going to rely upon archive.org's own indexing processes.

Adam H. Kerman

unread,

Feb 7, 2022, 1:49:30 PM2/7/22

to

I didn't mean to volunteer you to perform work you weren't willing to
do. I apologize for that. My comment, stating the obvious, was pointing
out what we don't have.

I don't have an opinion on whether your proposed grouping is better or
worse.

Julien ÉLIE

unread,

Feb 8, 2022, 2:46:40 PM2/8/22

to

Hi Jason,

> Here is the current archive that runs from the 80's and 90's until around
> 2003: https://archive.org/details/usenethistorical

As noted by another person (who spoke about that archive in a French
newsgroup), the encoding of bodies is wrong. All non-ASCII characters
are mungled :-/
Seen in fr.* and de.*, and I bet it is the same for all hierarchies.

--
Julien ÉLIE

« J'oubliais qu'Assurancetourix a une nouvelle corde à sa harpe ! »
(Astérix)

Jason Evans

unread,

Feb 9, 2022, 2:07:20 AM2/9/22

to

Julien ÉLIE wrote:

>
> Hi Jason,
>
>> Here is the current archive that runs from the 80's and 90's until around
>> 2003: https://archive.org/details/usenethistorical
>
> As noted by another person (who spoke about that archive in a French
> newsgroup), the encoding of bodies is wrong. All non-ASCII characters
> are mungled :-/
> Seen in fr.* and de.*, and I bet it is the same for all hierarchies.
>

Hi Julian,

This doesn't really answer the question that I asked in my original article
about organizing Usenet hierarchies for archive.org.

However, to respond to your comment, I picked this article at random from
fr.usenet.distribution. This is a screenshot
(https://pasteboard.co/YA9d6r01LUnP.png)using Thunderbird from one of the
archives that I created. You can see that the French letters can be read
correctly because this article is from last year and encoded in UTF-8. Even
some of the old articles in this particular archive that are encoded in
iso-8859-15 appear correctly.

The problem is that when you go back far enough, either plain ASCII is used
or some non-standard encoding and then the non-English characters are
munged. My colleague, Tristan, has been doing some work on this when it
comes to this issue with Esperanto on the early Usenet.

Jason

Julien ÉLIE

unread,

Feb 9, 2022, 12:34:47 PM2/9/22

to

Hi Jason,

> The problem is that when you go back far enough, either plain ASCII is used
> or some non-standard encoding and then the non-English characters are
> munged. My colleague, Tristan, has been doing some work on this when it
> comes to this issue with Esperanto on the early Usenet.

Yes, apparently, the problem is only for old archives (of last century
or so). When no encoding is specified, non-ASCII chars get mungled.
Thanks for the screenshot and information that recent articles are
correctly archived.

> This doesn't really answer the question that I asked in my original
> article about organizing Usenet hierarchies for archive.org.

I don't have a strong opinion about that. I would tend to prefer a
breaking down by individual hierarchies, as any kind of mixing
hierarchies may not be what users want.

--
Julien ÉLIE

« You know what I did before I married? Anything I wanted to. »