Storing news in compressed form

Andy Clews

unread,

Mar 6, 1990, 9:01:47 AM3/6/90

to

We are running News B2.11 on a Sequent S81, DYNIX 3.0.15.

We are having recurring problems with news stuff filling our disk, even
with fairly fascist expiry for some groups.

I occurs to me that it would be a useful thing to be able to store news
articles in compressed form (a la compress(1)), and the article would be
uncompressed for reading and batching to our feed site as necessary. I
know that compressed-batch feeding can be done, but I am not aware of
being able to store articles in compressed form for reading by users.
This would save us a hell of a lot of space. Is this a possibility in B
news? In fact this question is more relevant to C News because we'll be
migrating to that fairly soon.

Any help appreciated.
--
Andy Clews, Computing Service, Univ. of Sussex, Brighton BN1 9QN, England
JANET: an...@syma.sussex.ac.uk BITNET: andy%syma.sus...@uk.ac

Henry Spencer

unread,

Mar 8, 1990, 12:14:09 PM3/8/90

to

In article <23...@syma.sussex.ac.uk> an...@syma.sussex.ac.uk (Andy Clews) writes:
>... I am not aware of

>being able to store articles in compressed form for reading by users.
>This would save us a hell of a lot of space. Is this a possibility in B
>news? In fact this question is more relevant to C News because we'll be
>migrating to that fairly soon.

We thought about this, but decided against it in the end. The space savings
would be small, since compress does not do well on small files. All the
news readers would have to be changed to know about it, a prospect we
dread. And it would greatly increase overhead in processing and reading
news. Our conclusion was that it didn't seem worth the trouble.
--
MSDOS, abbrev: Maybe SomeDay | Henry Spencer at U of Toronto Zoology
an Operating System. | uunet!attcan!utzoo!henry he...@zoo.toronto.edu

Leslie Mikesell

unread,

Mar 9, 1990, 2:00:25 PM3/9/90

to

In article <1990Mar8.1...@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:
>>being able to store articles in compressed form for reading by users.

>We thought about this, but decided against it in the end. The space savings

>would be small, since compress does not do well on small files. All the
>news readers would have to be changed to know about it, a prospect we
>dread.

How about using something like zoo format to bundle many articles into
a single file? With an external index (which could be rebuilt from the
internal info if needed) this could be very efficient to handle, except
for expiring individual articles. Grouping by expiration date would
avoid the issue and setting an arbitrary size limit on the bundle would
insure that you could do an efficient in-memory merge of any two bundles
as articles are deleted. In many cases, it might not be necessary to
unbundle/rebundle for the next site - just store it somewhere and update
the local index.

Perhaps the best approach to changing the news readers would be to start
with the nntp server and then modify it and the client programs to work
with local IPC where a network is not used.

Les Mikesell
l...@chinet.chi.il.us

Michael R. Johnston

unread,

Mar 9, 1990, 10:44:49 PM3/9/90

to

In article <23...@syma.sussex.ac.uk> an...@syma.sussex.ac.uk (Andy Clews) writes:

>I occurs to me that it would be a useful thing to be able to store news
>articles in compressed form (a la compress(1)), and the article would be
>uncompressed for reading and batching to our feed site as necessary.

This would be nice except for the fact that you'd have to modify ALL your
newsreaders/shell scripts/programs that rely upon the fact that news is stored
uncompressed. Assuming you accomplish this it'd work just fine. Of course then
you'd have to deal with the extraordinary overhead of uncompressing all the
articles in a group just to do an '=' in rn. I don't think it's practical on
MOST machines.

--
Michael R. Johnston mi...@lilink.com
Lilink Communications rutgers!lilink!mikej
"Affordable Unix Solutions" (516) 285-4148

Henry Spencer

unread,

Mar 10, 1990, 6:36:56 PM3/10/90

to

In article <1990Mar9.1...@chinet.chi.il.us> l...@chinet.chi.il.us (Leslie Mikesell) writes:
>How about using something like zoo format to bundle many articles into

>a single file? ...

There are a lot of things that could be done along these lines, but they
all have the basic problem of breaking newsreaders, plus the subsidiary
problem that it then becomes much more difficult to use Unix tools like
egrep on news articles. The extra CPU time eaten up in compression and
decompression, especially in news readers, is not a trivial issue either.
Yes, there are people who would see a net benefit, but overall it seemed
like too much work for limited return.

David Schachter

unread,

Mar 10, 1990, 8:03:52 PM3/10/90

to

In article <1990Mar10.0...@lilink.com> mi...@lilink.UUCP (Michael R. Johnston) writes:
>In article <23...@syma.sussex.ac.uk> an...@syma.sussex.ac.uk (Andy Clews) writes:
>>it would be a useful thing to be able to store news articles in compressed
>>form
>

>you'd have to deal with the extraordinary overhead of uncompressing all the
>articles in a group just to do an '=' in rn. I don't think it's practical on
>MOST machines.

A counter-argument: many computers, particularly 386 PCs, have inadequate disk
disk throughput to keep the CPU busy. On such systems, uncompressing uses
otherwise wasted CPU cycles, and might offer a performance boost, by reducing
disk bandwidth requirements. This site (llustig) is an example.

-- David Schachter
llustig!da...@mips.com
...!uunet!mips!llustig!david
da...@llustig.UUCP (MAYBE)

Palo Alto, California, USA
--
-- David Schachter
llustig!da...@mips.com
...!uunet!mips!llustig!david
da...@llustig.UUCP (MAYBE)

Geoffrey Welsh

unread,

Mar 11, 1990, 11:51:22 AM3/11/90

to

In article <1990Mar10.2...@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:
>In article <1990Mar9.1...@chinet.chi.il.us> l...@chinet.chi.il.us (Leslie Mikesell) writes:
>>How about using something like zoo format to bundle many articles into
>>a single file? ...
>
>There are a lot of things that could be done along these lines, but they
>all have the basic problem of breaking newsreaders, plus the subsidiary
>problem that it then becomes much more difficult to use Unix tools like
>egrep on news articles. The extra CPU time eaten up in compression and
>decompression, especially in news readers, is not a trivial issue either.
>Yes, there are people who would see a net benefit, but overall it seemed
>like too much work for limited return.

Actually, I think it'd be useful to have an (optional) feature allowing
a sysadmin to compress articles in much the same way some man entries are
compressed.

Sure, it'd be a shame to have to uncompress them every time someone reads
them, but the priorities (CPU time vs. disk space) may be different between
a major computing facility with lots of space, plenty of readers, and a
severe pinch on CPU time and, say, a personal system which carries a full
feed for distribution to a couple of sites and on which only one or two
people actually read the news.

Surely it wouldn't be too hard to add this feature to existing news
software?

Geoff

UUCP: watmath!xenitec!zswamp!root | 602-66 Mooregate Crescent
Internet: ro...@zswamp.fidonet.org | Kitchener, Ontario
FidoNet: SYSOP, 1:221/171 | N2M 5E6 CANADA
Data: (519) 742-8939 | (519) 741-9553
My comments do not represent and should not obligate anyone but myself.

Tom Limoncelli

unread,

Mar 11, 1990, 12:07:47 PM3/11/90

to

Compressed news would work really well if it was handled by the file
system, and could be transparent to the applications.

Method #1: Dynamically uncompress a file when it's opened. Use a LRU
algorithm to decide when to re-compress it. Of course, if you grep
through your entire partition you will find the entire disk
uncompressed.

Method #2: Keep all the files compressed, but handle read/write/seek
so that they deal with files that are compressed. For example, a seek
to the beginning of the file would run quickly, forward seeks would be
slow, and backward seeks would be either slow or really slow depending
on where the current file position is. Writes that aren't to the
beginning of a new file could fail. For the typical "grep through a
directory" no seeks are done (unless you consider "read" a forward
seek).

I rather like method #2; but it's not something that I would implement
because :) I'm not low on diskspace right now. :-)

How many news-only applications would these effect? I don't know how
the CPU:$ and disk:$ ratios are where you are, but for me CPU power is
a lot cheaper than disk space. Your milage may vary... and my milage
may be changing in a week or two. :-(

-Tom
--
Tom Limoncelli The computer industry should spend more time in front of
tlim...@drew.uucp their computers. Remember when "Look & Feel"
tlim...@drew.Bitnet was what you tried to do on a date?
lim...@pilot.njin.net

Roy Smith

unread,

Mar 11, 1990, 2:40:53 PM3/11/90

to

lim...@pilot.njin.net (Tom Limoncelli) writes:
> Compressed news would work really well if it was handled by the file
> system, and could be transparent to the applications.

I'm not sure I want LRU-cached compression/decompression in my
kernel (actually, I'm sure I don't want it) but you might want to make each
news article a unix-domain socket. When you open the file, the process
controlling the socket would somehow cons up the contents of the article and
deliver it to you. It might have a plain-text copy somewhere, or it might
have to uncompress a compressed copy. In fact, it could do anything it
wanted to get the text, including contacting an NNTP server, making rrn
obsolete, since any regular old disk-file-base newsreader would now be able
to do NNTP. You might choose, on a per-group basis, to keep a local copy of
the article uncompressed for 36 hours after initial reception, compress it
after than and keep the compressed form around for another 3 days, and
delete it after that, but keep a pointer to a copy on an archival NNTP
server for a whole month.

I'm not that familiar with how unix domain sockets work. Is there
anything which would prevent you from having thousands (or even tens of
thousands) of them in existance at one time (i.e. one per new article). Are
you only limited by inodes, or will you exhaust some network resource first?

This may be a nutso scheme, but at least it preserves the semantics
of what the news spool directory tree looks like to a news user agent.

--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
r...@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

Eric P. Scott

unread,

Mar 12, 1990, 2:11:27 AM3/12/90

to

In article <1990Mar11....@llustig.uucp> da...@llustig.UUCP

(David Schachter) writes:
>A counter-argument: many computers, particularly 386 PCs, have inadequate disk
>disk throughput to keep the CPU busy. On such systems, uncompressing uses
>otherwise wasted CPU cycles, and might offer a performance boost, by reducing
>disk bandwidth requirements. This site (llustig) is an example.

Gee, I dunno. wet (a 386 with pretty decent disk bandwidth)
spends a lot of CPU time decompressing batches. If the phone
time weren't an issue, I'd flush the compression. Disk space
isn't in short supply. Inodes are.
-=EPS=-
--
C News: Canada's answer to acid rain

Gary Bridgewater

unread,

Mar 14, 1990, 2:45:33 AM3/14/90

to

In article <Mar.11.12.07....@pilot.njin.net> lim...@pilot.njin.net (Tom Limoncelli) writes:
>Compressed news would work really well if it was handled by the file
>system, and could be transparent to the applications.
>
>Method #1: Dynamically uncompress a file when it's opened. Use a LRU
>algorithm to decide when to re-compress it. Of course, if you grep
>through your entire partition you will find the entire disk
>uncompressed.

This has the basis for a fairly big win. Doing one file at a time is not
space effective since a large percentage of the news files are within the
N/C size of compress ( a previous posting the last time this issue came
up made that clear ).
What is needed is an NFS style filesystem spoofer that pretends to be
/usr/spool/news externally. Internally, it keeps larger compressed chunks
in a big file (or group of files) indexed in some clever way (hash article
IDs spring to mind). In essence - it becomes a news database server. This
is similar to the concept of SQL servers. The intrepid developer would, of
course, add TCP support (could be RPC based but that leaves out a lot of
systems) so the thing was naturally distributable, but not necessarily so.
All existing software continues to work, to the extent that it is needed.
However, expire would be built in (news comes in, old news goes out) as
would the history function (you already hashed the article id in order
to store it).
This would use a fair number of cycles on the server but I really wonder if it
would use more than inews/expire use already?
It would be fighting the file system optimizer, in some sense, but it might
end up beating up the disks less, in the long run.
--
Gary Bridgewater, Data General Corporation, Sunnyvale California
ga...@sv.dg.com or {amdahl,aeras,amdcad}!dgcad!gary
The impossible we understand right away - the obvious takes a little longer.