The story from which I excerpt is at
<http://www.zdnet.com/intweek/stories/news/0%2C4164%2C2640446%2C00.html>.
"Another dot-com, this time Deja.com, is
seeking shelter from the Internet economy by way of a merger. The New
York company's business units - its longstanding Deja News Usenet
discussion business and its newer Precision Buying Service buying
guide - will be sold separately, according to sources familiar with
the negotiations."
[...]
"However it has not grown fast enough, or done well enough, to support
the buying service that Deja has been developing for the past two
years, sources said. The company claimed about 5 million unique users,
90 percent of whom use the Usenet service. About 10 percent to 20
percent use the Precision Buying Service."
Insert standard grumbles about why Usenet archives need to be
maintained without profit-motive. I wonder what sort of resource
consumption it uses up; somewhere in the back of my mind is the idea
that it might be worthwhile to start a nonprofit Usenet archive *now*
(although that's probably something that should've been done ages ago)..
-Rich
--
Rich Lafferty ----------------------------------------
Nocturnal Aviation Division, IITS Computing Services
Concordia University, Montreal, QC
ri...@bofh.concordia.ca -------------------------------
> Insert standard grumbles about why Usenet archives need to be maintained
> without profit-motive. I wonder what sort of resource consumption it
> uses up; somewhere in the back of my mind is the idea that it might be
> worthwhile to start a nonprofit Usenet archive *now* (although that's
> probably something that should've been done ages ago)..
Depends on how much you want to archive, but text Usenet is probably
running around 2GB a day, very roughly. Other people likely have better
numbers.
So figure 500GB a year would do fine for a while at least with
compression.
50GB a year would probably be enough to handle the Big Eight at least
right now.
--
Russ Allbery (r...@stanford.edu) <http://www.eyrie.org/~eagle/>
Would it be possible to do this in some kind of distributed manner
like Pluribus or Napster? So, say, I could volunteer to archive a set
of groups that comes to around 4G or so, whilst someone with more
resources could archive a greater set? That way the load is spread,
and by careful overlapping, you get redundancy free.
--
r...@frottage.org
The only real problem as near as I can tell is the interface. If someone
can design a decent interface that doesn't involve having to run a
full-blown web server on each archiving box, that would go a long way.
NNTP isn't the right tool, unfortunately.
Something like traditional spool but breaking up each directory so it
doesn't get too large would be a nice, convenient storage format, although
not ideal for compression purposes. If you ran an rsync daemon, people
could easily grab copies of the groups you archived so that there would be
some redundancy....
I suspect disk would be the easy part of the equation, though. I
shudder to think of what Deja uses to *search* through that as quickly
as they do.
> 50GB a year would probably be enough to handle the Big Eight at least
> right now.
Awfully tempting, although obtaining 50GB/year plus cycles to search
would be Slightly Nontrivial. *sigh* It'd be hard to pick and choose
non-big-8 groups, too. I can see why Deja just took anything they
could get that wasn't a binary.
>> Depends on how much you want to archive, but text Usenet is probably
>> running around 2GB a day, very roughly. Other people likely have
>> better numbers.
> I suspect disk would be the easy part of the equation, though. I shudder
> to think of what Deja uses to *search* through that as quickly as they
> do.
Yes, that's harder. The nice thing about that, though, is that if you
have the content, you can deal with the search at your leisure, try
different indexes, wait for someone else to come along and do it for you,
etc. As long as the content is there, the search problem will eventually
solve itself.
> Awfully tempting, although obtaining 50GB/year plus cycles to search
> would be Slightly Nontrivial. *sigh* It'd be hard to pick and choose
> non-big-8 groups, too. I can see why Deja just took anything they could
> get that wasn't a binary.
The cooperative approach would be the best, I think. Each person take a
little chunk. I could probably find resources around here to archive some
bits, particularly in hierarchies like comp.* and sci.* that have
long-term educational merit....
OK, so here's one scenario:
The usenet archive is a cash cow but instead of focusing on that they
are taking all the profits and dumping them into The Suits Latest Pet
Project. (That's what happened at Cygnus for many years, with
s/usenet archive/open source/ and s/buying guide/closed source/).
Or the other is that:
Both are losing money, but the buying guide is somewhat more
doomed than the usenet archive.
Either way, separating the two strikes me as a good thing for the
usenet archive. Although whether it is going to go bankrupt anyway is
less clear to me.
> it might be worthwhile to start a nonprofit Usenet archive *now*
Well, the idea of looking into whether this can be distributed, a la
freenet, strikes me as an interesting avenue to pursue (I might start
out by reading some of the freenet/gnutella/&c docs to see what
they're doing about things like searching and keeping one node from
getting overloaded and such). Might be possible to have the storage
be existing news servers (with some way of limiting how much
bandwidth/CPU/IO they are volunteering) with the user interface being
the new part (not that I've tried to design this in any detail...).
I'm left with a feeling that it doesn't need to be that Massive. It
seems as though time would be best put into getting friends of Usenet
to offer a certain amount of resources which are known to be reliable
than it would to start worrying about redundancy and dialups and users
with fragile systems, especially when it becomes important to ensure
that the content of a particular message isn't changed (and while
there are means by which we can tell if a message changes, I'm not
aware of any that'd let us change it back). It's the difference
between storage and an archive, in other words.
Of course, magnetic platters aren't a particularly *reliable* way of
making archives; electronic archiving is to a large extent something I
don't entirely understand, although our head archivist has explained
the general problems to me a few times. :-) I'd say that the first
priority would be a system that puts Usenet into a reliable storage
medium after which we can start worrying about providing access to
it. :-/ Unfortunately, that doesn't have many bells and whistles
attached.
I don't think the problem with Deja was centralization, though -- just
that their interests were in making money from a Usenet archive,
rather than being in maintaining an archive of Usenet for its own
sake. They *do* have 10TB of data kicking around, though. When did
they start archiving?
Brewster Kahle's Internet Archive
(http://www.archive.org/collections/index.html#Usenet)
is working on this currently. I actually encouraged him
to try to hire Russ away from Stanford, bastard that I am.
Last I heard, though, he was hiring, so maybe it's time
for People With Clue (which would not be me) to put their
money where their mouths are.
Or not.
Rage away,
meg
--
Meg Worley _._ m...@steam.stanford.edu _._ Comparatively Literate
Interface is the job of the client. HTTP is a perfectly reasonable
transport agent, and any of the lightweight servers would do. Web servers
start getting bulky when they add features beyond the basic listen-
parse-return cycle.
> Something like traditional spool but breaking up each directory so it
> doesn't get too large would be a nice, convenient storage format, although
> not ideal for compression purposes. If you ran an rsync daemon, people
> could easily grab copies of the groups you archived so that there would be
> some redundancy....
$NEWSROOT/hierarchy/YYYY/MM/DD/ for instance? High traffic groups *need*
to split daily (e.g. rasfw) to maintain a reasonable number of articles/
directory. And it's all same-language text (or close enough); compression
will be perfectly reasonable.
No, I've thought better of it. Use $NEWSROOT/hierarchy/YYYY/MM/DD.tgz instead.
You get a number of files any filesystem can handle, an individual download
which is not too large, and your interface tool can handle the threading and
correlation and so forth. Side benefit, all of the tools to manipulate this
sort of thing are easy to build.
It would be nice to build a Subject index at the end of each month, too.
-dsr-
Hey, those guys stole my idea! :-) That could be very useful. I wonder
if they know about the Deja situation.
Well, it is for scholarship and research purposes only and you need a
password to access the materials. At least assuming the generic
policies at http://www.archive.org/proposal.html#Terms apply to the
usenet collection.
While that's still of some use, it does strike me as a somewhat
different service than a fully public one like what deja.com provides.
If Deja is going bankrupt, maybe we/someone could form a non-profit to
purchase their database for "pennies on the dollar"?
And stashing the data in Freenet or MojoNation would work. All that
is "needed" is to convert every article into a single unique "pathname".
Heck, /usenet/messageid would work perfectly.
--
Mark Atwood | Freedom from want, freedom from fear, freedom from choice.
m...@pobox.com | Is that the freedom you want?
http://www.pobox.com/~mra
>If Deja is going bankrupt, maybe we/someone could form a non-profit to
>purchase their database for "pennies on the dollar"?
They're not going bankrupt. They're being acquired. The article on
ZDNet that started all this implied that the deal was pretty much done
but that the details weren't public yet, at least for the archiving
segment.
--
-------------------------------------------------------------------------
ka...@eyrie.org | Please do not e-mail me copies of material posted
Kate Wrightson | to newsgroups. I read the groups to which I post.
Sorry for follow up to a post of mine, and a somewhat old one at that,
but a closer reading of the story confirms this.
The key word is "profitable".
"The company believes the profitable Usenet business unit"....
http://www.zdnet.com/intweek/stories/news/0,4164,2640446,00.html