best way to archive old groups

matty i

unread,

Aug 13, 2001, 9:25:24 AM8/13/01

to

i have a lot of old mail lying around, most of which i rarely, if
ever, need to look at. all of this mail is in nnml format (as, at one
time, it was an active mailing list i read or some other
correspondence). i'm wondering if there's a good format to convert
this into, that i can compress, and will allow access on a limited
basis. it would be especially interesting if that limited access were
easily searchable, as i'd imagine that would be the primary use.

in that same vain, wouldn't a mail format which kept postings for the
last month in normal nnml, but "archived" anything older by some other
means be a neat feature? it might be similar to what m$ claims they
do in outlook (i realized after typing), but, i never really believed
they did anything when they said "archiving old messages" anyways.

--
tom: i am awake. ~ matt ittigson
barkeep: your eyes are closed. ~
tom: who you gonna believe? ~ itti...@pobox.com.spam.bad
- miller's crossing ~

Kai Großjohann

unread,

Aug 13, 2001, 3:12:42 PM8/13/01

to

matty i <why.should.w...@usenet.sigh> writes:

> i have a lot of old mail lying around, most of which i rarely, if
> ever, need to look at. all of this mail is in nnml format (as, at one
> time, it was an active mailing list i read or some other
> correspondence). i'm wondering if there's a good format to convert
> this into, that i can compress, and will allow access on a limited
> basis. it would be especially interesting if that limited access were
> easily searchable, as i'd imagine that would be the primary use.

I just leave it in nnml format. Disk space is cheap these days. When
I got a lot of mail that I wanted to keep, I went through the group
every couple of months and moved all messages except the last 500 or
so into an archive group manually.

Another possibility would be to use nnfolder groups for archiving.
There, each group is one file, so there is less lost disk space due to
fragmentation. nnfolder groups can also be gzipped, but I'm not sure
how to do this with Gnus. But it's possible.

Various ways exist to search old mails. I hesitate to mention nnir.el
because I've written it and it requires a search engine. But together
with Nevin Kapur's (?) great nnir-grepmail.el, you can use grepmail to
search your mail and don't need an additional search engine.

kai
--
~/.signature: No such file or directory

matty i

unread,

Aug 13, 2001, 3:37:27 PM8/13/01

to

Kai.Gro...@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> matty i <why.should.w...@usenet.sigh> writes:
>
> > i have a lot of old mail lying around, most of which i rarely, if
> > ever, need to look at. all of this mail is in nnml format (as, at
> > one time, it was an active mailing list i read or some other
> > correspondence). i'm wondering if there's a good format to
> > convert this into, that i can compress, and will allow access on a
> > limited basis. it would be especially interesting if that limited
> > access were easily searchable, as i'd imagine that would be the
> > primary use.
>
> I just leave it in nnml format. Disk space is cheap these days.

but it creates loads of rather small files (which i don't believe is
particularly fast for disk access) and is harder to manage (by
compression or backups).

> When I got a lot of mail that I wanted to keep, I went through the
> group every couple of months and moved all messages except the last
> 500 or so into an archive group manually.

so there's no (pre-written) magic to make gnus do this?

> Another possibility would be to use nnfolder groups for archiving.
> There, each group is one file, so there is less lost disk space due
> to fragmentation. nnfolder groups can also be gzipped, but I'm not
> sure how to do this with Gnus. But it's possible.

pointer to more information about it?

Karl Kleinpaste

unread,

Aug 13, 2001, 4:12:30 PM8/13/01

to

matty i <why.should.w...@usenet.sigh> writes:
> but it creates loads of rather small files (which i don't believe is
> particularly fast for disk access) and is harder to manage (by
> compression or backups).

If there's some some reason to need to write down the filenames by
hand, perhaps; but given that files in need of either compression or
being backed up are typically discovered by a suitable automatic
process, why do you care if there is 1 file or 1000 files?

I am not trying to be critical, but I am trying to see if you are
perhaps concerned with an unsuitable choice of technical detail.

> so there's no (pre-written) magic to make gnus do this?

The concept of "archival" seems to vary rather a bit with each person,
as to what they want to archive, how, and to where.

That said, there is quite a bit of ready-made infrastructure by which
to accomplish automatic expiry. Take a peek at my bit of tutorial on
setting this up, posted last week:
news:vxkhevk4p...@cinnamon.vanillaknot.com

Having done it once, it becomes pretty trivial to do it again for as
many groups as you have reason to archive.

matty i

unread,

Aug 13, 2001, 4:25:37 PM8/13/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:

> matty i <why.should.w...@usenet.sigh> writes:
> > but it creates loads of rather small files (which i don't believe
> > is particularly fast for disk access) and is harder to manage (by
> > compression or backups).
>
> If there's some some reason to need to write down the filenames by
> hand, perhaps; but given that files in need of either compression or
> being backed up are typically discovered by a suitable automatic
> process, why do you care if there is 1 file or 1000 files?

it's easier to lose one file out of 1000 (some process dies before
completion, the disk gets corrupted, there's some weird parsing bug in
the script that does something, running out of disk space, etc) than
to lose the one big file that's expected. it's also more likely that
you'll notice something going wrong with the one file, than with one
file out of 1000 (in the example of some kind of corruption).

they aren't major issues, but it seems to me easier to backup (and
compress) one file than 5000 files.

also, if i'm searching through my mail directory for something,
searching through 5000 small files is much slower than one bigger
file.

> I am not trying to be critical, but I am trying to see if you are
> perhaps concerned with an unsuitable choice of technical detail.

i could be convinced that there isn't much reason to have one file
over a bunch of small ones. i guess, more than anything else, it
seems common sense to archive it that way.

> > so there's no (pre-written) magic to make gnus do this?
>
> The concept of "archival" seems to vary rather a bit with each
> person, as to what they want to archive, how, and to where.
>
> That said, there is quite a bit of ready-made infrastructure by
> which to accomplish automatic expiry. Take a peek at my bit of
> tutorial on setting this up, posted last week:
> news:vxkhevk4p...@cinnamon.vanillaknot.com
>
> Having done it once, it becomes pretty trivial to do it again for as
> many groups as you have reason to archive.

thanks for the pointer.

matty i

unread,

Aug 13, 2001, 4:29:07 PM8/13/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:

[...]

> That said, there is quite a bit of ready-made infrastructure by
> which to accomplish automatic expiry. Take a peek at my bit of
> tutorial on setting this up, posted last week:
> news:vxkhevk4p...@cinnamon.vanillaknot.com

you didn't address compression at all in that post. kai's answer of
disk is cheap is probably a good enough answer, but ...

Nevin Kapur

unread,

Aug 13, 2001, 4:57:03 PM8/13/01

to

matty i <why.should.w...@usenet.sigh> writes:

> i have a lot of old mail lying around, most of which i rarely, if
> ever, need to look at. all of this mail is in nnml format (as, at one
> time, it was an active mailing list i read or some other
> correspondence). i'm wondering if there's a good format to convert
> this into, that i can compress, and will allow access on a limited
> basis. it would be especially interesting if that limited access were
> easily searchable, as i'd imagine that would be the primary use.

My set up is similar to yours. Here's what I do:

- My primary mail is stored in nnml format. I archive all mail in
nnfolder format (no compression).
- I've written a function that looks at an article and decides where it
should eventually end up. (One can decide where an article should
end up by looking at any header in the original article.)
- I set nnmail-expiry-target to this function.
- I control when to expire by customizing the group parameter expiry-wait.
- To search I use grepmail and Kai Großjohann's nnir.el with my
nnir-grepmail.el.

> in that same vain, wouldn't a mail format which kept postings for the
> last month in normal nnml, but "archived" anything older by some other
> means be a neat feature?

This is easy enough to achieve by just frobbing expiry-wait and expiry-target.

Here's the aforementioned function:

(defvar nk/nnmail-expiry-targets nil
"An list of (\"header\" \"regexp\" \"target\" \"date format\"). If the
header matches the regexp, the message will be expired to the target
group with the date in the required format, appended. If no date
format is specified the %Y is assumed.")

(defun nk/nnmail-expiry-target (group)
"Returns a target expiry group depending on the last match in
nk/nnmail-expiry-targets."
(let* (header
(case-fold-search nil)
(from (or (message-fetch-field "from") ""))
(to (or (message-fetch-field "to") ""))
(date (date-to-time
(or (message-fetch-field "date") (current-time-string))))
(target 'delete))
;; Note that last match will be returned.
(dolist (regexp-target-pair nk/nnmail-expiry-targets target)
(setq header (car regexp-target-pair))
(cond
;; If the header is "from" or "to" match either field
((and (or (string-match header "from") (string-match header "to"))
(or (string-match (cadr regexp-target-pair) from)
(and (string-match message-dont-reply-to-names from)
(string-match (cadr regexp-target-pair) to))))
(setq target (concat (caddr regexp-target-pair)
"-"
(format-time-string
(or (cadddr regexp-target-pair) "%Y") date))))
((string-match (cadr regexp-target-pair) (message-fetch-field header))
(setq target (concat (caddr regexp-target-pair)
"-"
(format-time-string
(or (cadddr regexp-target-pair) "%Y") date))))))))

and my settings for it:

(setq nk/nnmail-expiry-targets
'(("from" ".*" "nnfolder:Archive" "%Y-%b")
("from" "xyz" "nnfolder:XYZ")
("from" "abc" "nnfolder:ABC")))

(setq nnmail-expiry-target 'nk/nnmail-expiry-target)

--
Nevin

Harry Putnam

unread,

Aug 13, 2001, 7:15:13 PM8/13/01

to

matty i <why.should.w...@usenet.sigh> writes:

> Karl Kleinpaste <ka...@charcoal.com> writes:
>
> [...]
>
> > That said, there is quite a bit of ready-made infrastructure by
> > which to accomplish automatic expiry. Take a peek at my bit of
> > tutorial on setting this up, posted last week:
> > news:vxkhevk4p...@cinnamon.vanillaknot.com
>
> you didn't address compression at all in that post. kai's answer of
> disk is cheap is probably a good enough answer, but ...

I agree with Kai about the disk space. And if searching is likely to
be the main usage, any kind of compression has got to hinder that to
one degree or another.

Here is one scenario: I run rsync against my Mail and News setup
several times a day. Its sort of a mirror but not really. The
regular Mail/News setup changes constantly and I set expiry to
comfortable limits since everything that comes thru is being archived
to a separate archive directory structure. The archive grows and
grows and all old groups or even long since deleted groups still exist
in the archive. I keep stuff in the regular setup long enough by my
reckoning, it varies from group to group. But it all accumulates in
the archive.

In the Archive:
Mail tops out at 786MB and News at 720MB for a total of 1506MB.

But with 47 Gigs of storage on that machine, its not too big really.
I have home made tools for searching it but that is one area that I'm
not satisfied with yet. Gnus can access any of groups in the archive
by nndoc or nndir depending on group format.

You could even access the whole Archive with nneething, and step thru
it like dired.

Currently considering ways to grab small bits of the archive and
display it in gnus nnir like buffer in an ephemeral group. Grabbing
the bits by search expression, something like one does on groups.google

So in summary, I find it is an unnecessary complication to consider
compression, and find rsync to be a good choice as archiver. I
believe it is best to get groups of this size out of gnus control
since the display engine is worked to death in large groups unless you
have a very plain display even then it works gnus hard.

Karl K. has stated that he finds gnus can handle large groups easily,
but that has not been my experience. I may possibly have non-sensible
stuff in my setup, but I find gnus to be very very slow above 3-4
thousand messages, unbearable in the teens and higher, in opening,
limiting, sorting, threading etc etc. Gnus keeps track of an awful

Karl Kleinpaste

unread,

Aug 13, 2001, 8:45:59 PM8/13/01

to

Harry Putnam <rea...@newsguy.com> writes:
> Karl K. has stated that he finds gnus can handle large groups easily,
> but that has not been my experience. I may possibly have non-sensible
> stuff in my setup, but I find gnus to be very very slow above 3-4
> thousand messages, unbearable in the teens and higher, in opening,
> limiting, sorting, threading etc

No, what I said is that I don't find groupings of articles so large to
be reasonable in the first place, hence my conception of a "large"
mail group is evidently about an order of magnitude lower than yours.

I just dug around in my archives briefly to determine that the two
very largest groups I have are 4900 and 3300, but those are outliers;
otherwise the real, working upper bound I've got is more on the order
2000 or 2500. And I practically never just plain enter those groups.
Any time I have something in there that I need to find, it's time for
nnir -- the ritualized gouging-with-a-blunt-instrument of coughing up
5000+ articles in a *Summary* buffer and then trying actually to
_find_ something in a horde that large...well, it's not something I'm
emotionally conditioned to do.

I strongly suspect that threading is the true time-killer in your use,
because threading is at very best O(n log n) and, depending on the
algorithm used, may in fact be O(n^2). (I know that there was brief
mention of Gnus' threading on the ding list, but I didn't consider it
very carefully.) Small surprise that sucking in umpteen thousand
articles and cross-referencing every single one against every single
other for potential parent/child relationships blows chunks,
particularly in an interpreted language like elisp.

Samuel Padgett

unread,

Aug 13, 2001, 10:18:57 PM8/13/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:

> I strongly suspect that threading is the true time-killer in your use,
> because threading is at very best O(n log n) and, depending on the
> algorithm used, may in fact be O(n^2).

Maybe Gnus could not thread extremely large groups then? I'd be nice
if there was a variable to control it. (It could be a cousin of
`font-lock-maximum-size' and `line-number-display-limit'.)

Sam
--
May all your PUSHes be POPped.

Karl Kleinpaste

unread,

Aug 13, 2001, 11:21:35 PM8/13/01

to

Samuel Padgett <spad...@nc.rr.com> writes:
> Maybe Gnus could not thread extremely large groups then? I'd be nice
> if there was a variable to control it.

That's a possibility, but the problem is that really big groups are
exactly those which most desperately *need* threading.

If you want a horrifying example of thread depth which would surely be
utterly inexplicable without threading, take a peek at the 2800-line
summary at http://www.cs.cmu.edu/~karl/www/histogram/guns.summary,
especially the last 300 or so lines.

Karl Kleinpaste

unread,

Aug 13, 2001, 11:23:56 PM8/13/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:
> [a screwed-up URL ]

Make that:
http://www.cs.cmu.edu/~karl/histogram/guns.summary

Harry Putnam

unread,

Aug 14, 2001, 1:45:16 AM8/14/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:

[...]

> I strongly suspect that threading is the true time-killer in your use,

> because threading is at very best O(n log n) and, depending on the
> algorithm used, may in fact be O(n^2). (I know that there was brief
> mention of Gnus' threading on the ding list, but I didn't consider it
> very carefully.) Small surprise that sucking in umpteen thousand
> articles and cross-referencing every single one against every single
> other for potential parent/child relationships blows chunks,
> particularly in an interpreted language like elisp.

Eh? I have no idea what Karl just said but I think him and I agree
.. he he.

Harry wrote:
> . . . . . . . . . . . . . . . . . . . . . . . I set expiry to

> comfortable limits since everything that comes thru is being archived
> to a separate archive directory structure.

Groups above 2-3 thousand are not practicle in gnus, and in fact are
not reflective of human communication patterns..

However, huge archives are practicle for the purpose mentioned by OP.
That is, to search and use small portions of. Karl (as a known
archive maven, and card carrying horder) has apparently arrived at a
way to do that within gnus, using nnir/swish++ or the like. I'm still
bumbling around looking for a good way to interface that usage pattern
such that the archives are external to gnus but the display of small
portions is done inside gnus, similar to nnweb.

Henrik Holm

unread,

Aug 14, 2001, 2:48:59 AM8/14/01

to

[Karl Kleinpaste]

How do you get the `thread' view?

(what I mean is the collectoon of |, -, `, and + that connect
the articles together:)

,----
| O +---> [ 16: Christopher Morton ]
| O | +---> [ 44: ki...@rightwinger.com]
| O | | +---> [ 83: Aaron R. Kulkis ] Re: Glen Yeadon,
| O | | | `---> [ 48: Gunner ] Re: Glen Yea
| O | | +---> [ 82: Aaron R. Kulkis ] Re: Glen Yeadon,
| O | | `---> [ 43: Christopher Morton ] Re: Glen Yeadon,
| O | `---> [ 33: silverback ]
| O | +---> [ 101: Aaron R. Kulkis ] Re: Glen Yeadon,
| O | `---> [ 27: Christopher Morton ] Re: Glen Yeadon,
| O | `---> [ 44: silverback ]
| O | `---> [ 37: Christopher Morton ]
| O | `---> [ 54: silverback ]
| O | `---> [ 11: Christopher Morton ]
`----

Henrik.

Kai Großjohann

unread,

Aug 14, 2001, 7:41:36 AM8/14/01

to

Henrik Holm <h.h...@spray.no> writes:

> How do you get the `thread' view?

Recent Oort has %B for gnus-summary-line-format.

Kai Großjohann

unread,

Aug 14, 2001, 7:40:16 AM8/14/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:

> I strongly suspect that threading is the true time-killer in your use,
> because threading is at very best O(n log n) and, depending on the
> algorithm used, may in fact be O(n^2). (I know that there was brief
> mention of Gnus' threading on the ding list, but I didn't consider it
> very carefully.) Small surprise that sucking in umpteen thousand
> articles and cross-referencing every single one against every single
> other for potential parent/child relationships blows chunks,
> particularly in an interpreted language like elisp.

I find that generating the summary buffer is the time killer, even
though it's probably just O(n). But then, I've never tried groups
larger than a couple of thousand, either.

Karl Kleinpaste

unread,

Aug 14, 2001, 8:43:24 AM8/14/01

to

Kai.Gro...@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
> Recent Oort has %B for gnus-summary-line-format.

And it's from a patch to 5.8 that was posted here maybe 2 weeks ago.

Kai Großjohann

unread,

Aug 14, 2001, 1:10:52 PM8/14/01

to

matty i <why.should.w...@usenet.sigh> writes:

> Kai.Gro...@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
>
>> When I got a lot of mail that I wanted to keep, I went through the
>> group every couple of months and moved all messages except the last
>> 500 or so into an archive group manually.
>
> so there's no (pre-written) magic to make gnus do this?

No. But here's another idea: if the mails get into the group via
splitting in the first place, then you can (easily) tell Gnus to just
split them into a different group each month.

For example:

(setq nnmail-split-methods
`((,(format-time-string "mail.gnus.%Y-%m") "To:.*di...@gnus.org")
(,(format-time-string "mail.misc.%Y-%m") "")))

If you restart Emacs every day, that's it. If you don't restart Emacs
every day, then just wrap the above into a function and use
midnight.el to have Emacs execute the function after midnight.

Okay. Now you have one group per month, rather than just two groups
(the archive group and the main group). But maybe it's good enough?

Henrik Holm

unread,

Aug 15, 2001, 4:37:04 AM8/15/01

to

[Kai Großjohann]

> Henrik Holm <h.h...@spray.no> writes:
>
> > How do you get the `thread' view?
>
> Recent Oort has %B for gnus-summary-line-format.

Ah, I guess that means 'more recent than v0.03' :)

Henrik.

Samuel Padgett

unread,

Aug 15, 2001, 3:03:30 PM8/15/01

to

Karl Kleinpaste <ka...@charcoal.com> writes:

> Samuel Padgett <spad...@nc.rr.com> writes:
>> Maybe Gnus could not thread extremely large groups then? I'd be nice
>> if there was a variable to control it.
>
> That's a possibility, but the problem is that really big groups are
> exactly those which most desperately *need* threading.

Good point.

Kai Großjohann

unread,

Aug 15, 2001, 5:40:34 PM8/15/01

to

Samuel Padgett <spad...@nc.rr.com> writes:

> Karl Kleinpaste <ka...@charcoal.com> writes:
>
>> Samuel Padgett <spad...@nc.rr.com> writes:
>>> Maybe Gnus could not thread extremely large groups then? I'd be nice
>>> if there was a variable to control it.
>>
>> That's a possibility, but the problem is that really big groups are
>> exactly those which most desperately *need* threading.
>
> Good point.

Maybe some optimization is possible by turning threading off
initially, then only threading the right thread with `A t' (I think)
when you have found an interesting article.

Henrik Holm

unread,

Aug 16, 2001, 3:30:16 AM8/16/01

to

[Kai Großjohann]

> Maybe some optimization is possible by turning threading
> off initially, then only threading the right thread with
> `A t' (I think) when you have found an interesting article.

ITYM `A T' (gnus-summary-refer-thread)?
(`A t' is gnus-article-babel)

I tried the former one -- but it doesn't seem to work when the
buffer isn't already threaded.

An alternative might be finding an article in the thread
you're interested in in order to get the exact Subject,
limiting to this subject line (`/ s <subject-regexp> RET'),
and then do `C-M-t'. The downside is that you need to know
regexps -- but in practice, limiting to a word or a substring
is often sufficient, in my experience.

Henrik

Kai Großjohann

unread,

Aug 16, 2001, 6:01:15 AM8/16/01

to

Henrik Holm <h.h...@spray.no> writes:

> [Kai Großjohann]
>
>> Maybe some optimization is possible by turning threading
>> off initially, then only threading the right thread with
>> `A t' (I think) when you have found an interesting article.
>
> ITYM `A T' (gnus-summary-refer-thread)?
> (`A t' is gnus-article-babel)

No, I meant `T t'. Sorry.