RFC usenet article archive program with keyword lookup

Skip to first unread message

Chuqui Q. Koala

Jan 7, 1985, 11:46:12 PM1/7/85
My thanks to all of the people out there that responded to my request for
ways of dealing with directory structures and large numbers of files. Most
responses were either variations of the /f/o/o/foo format or suggested
machine dependent things like hashing and playing with inode numbers. The
RFC below incorporates a way of dealing with a directory structure that
I think is rather elegant. I've got parts of the code for the following
program definition written (mostly the keyword generation stuff) so I'm
interested in hearing about potential problems, features I might have
missed and general comments and technical discussions before I find them
out the hard way. Please respond directly to me, I'll post updates to
the document or other comments as neccessary.

thanks, as they say, in advance.



Archiving and accessing usenet articles with keyword lookup
chuq von rospach
national semiconductor
last revision: 1/7/85

This is a preliminary description of a program that can generate and
maintain an archive of usenet articles and allow looking up articles
based on the article-id, subject lines, or keywords pulled out of the
article itself.

The basic calling sequence of this program is:

arc [-q | -v ] <command_string>

<command string is the actualy command to execute. Available commands
are the following (minimum typing needed to be unique can be used):

add <filename> [<filename> [...]]
adds the given unix filenames to the database. Must be done by
the archive superuser. All files must be in usenet article format.
If the filename is a directory all files in it (including more
directories) are handled appropriately. This command is only
available to the archive superuser.

totape <id> [<id> [...]]
the given id numbers are taken out of the database area and
placed in a directory so that they can be backed to secondary
storage. All pointers to the articles are updated to show that
they are not accessible from the system. This command is only
available to the archive superuser.

fromtape <id> [<id> [...]]
takes the given id numbers out of the secondary storage directory
and puts them back in the database, updating the pointers as
appropriate to make them accessible. This command is only available
to the archive superuser.

ignore <keyword>
removes any references from the system for the given keyword and
adds the keyword to the IGNORE file so that it will no longer be
used. Removing a keyword from IGNORE allows it to be used for
lookups, but does not regenerate keywords from the existing
database. There ought to be a way of doing this, but I'm lazy.

lookup [-i] <keyword>
returns the filenames (article id numbers for the -i flag) of
all articles which are referenced by the given keyword. This is
useful for pulling references out of the database, using commands
like "print `arc lookup f77`". Case is not significant.

Future enhancement (maybe): allowing booleans for the keywords,
as in "arc lookup (net.unix-wizards | net.unix) & (f77 | fortran)"
or some such syntax. This can be done without a lot of problem
with sort, uniq, and other tools, I think, using lists of article
id's in various combinations.

subject [-i] <subject_string>
does for subject lines what keywords do. <subject_string> can
be a substring, i.e. a string such as "vms" will match all
subjects with "vms" in it. Case is not significant.

article [-i] <article_id>
does for the article-id of an article what keywords do. <article_id>
can be a substring, so "orca" will return filenames to all articles
posted by site "orca" (also "orcan", etc...).

idfile <id> [<id> [...]]
returns the filename needed to access an article with a given
id number.

Prints a full list of keywords known by the system. This list is
in 'dbm' format, which means that the order they are printed out
will look random to all logical beings.


<id> - an article id, guaranteed to be unique in the system. This is given
to the article when entered into the system, and is simply an
incremented counter (there is, unfortunately, no data in the
usenet header guaranteed to be unique without a lot of
pain, such as an article-id and a posting time or some such
garbage). Basic definition of <id> is "typedef id long;"

<keyword> - a keyword is any string of characters allowed by the function
isalnum() (see ctype(3)) bounded by any non-allowed characters.
the underbar (_) and period(.) characters are also allowed.
All uppercase characters are mapped to lower case. Newsgroups
the article was posted to and any words in the Keywords header
line will also be stored as keywords.

<subject_string> - any legal Unix string. All upper case characters are
mapped to lower case.

<article_id> - a string of the format <article_number>@<site>.<domain>
as defined in RFC822, or any substring. All upper case characters
are mapped to lower case.


The following defines are used for filenames in the system. Hardcoded
filenames will NOT be used. Period. (This is a hint to those that might
try to 'fix' this software later).

Define default value use

ARC /usr/spool/newsarc home directory for this thing.
IGNORE ignore Keywords to not build lookup tables
for in the database.
ARTICLES articles stores <id> <article_id> one per
line for each article in database.
SUBJECTS subjects stores <id> <subject> one per line
for each article in the database.
KEYWORDS keywords Directory used to store keyword
lookup table. All files accessed
only through the FILEMAP pointers.
TAPE Tape Directory used to store articles
that will be moved to magtape or
other secondary storage.
DATA data Directory used to store the article
database. All files are accessed
only through the FILEMAP pointers.
FILEMAP filemap dbm pointer file to access keywords
and article database. See below.

The Filemap pointer file

Filemap is a dbm format file that is used to map keywords and id numbers
to the filenames needed to access them. The basic format is as follows:

key: I<id>
data: <filename>

key: K<keyword>
data: <filename>

<id> filenames are relative to DATA, <keyword> filenames are relative to
KEYWORD so that the system can be moved around easily by simply changing
a few defines and recompiling. The reason I'm doing this pointer lookup
at all is because I expect to have to change the internal structures of
both KEYWORD and DATA as they grow. Unix directory lookups are quadratic
in nature so if the directories get too large things slow down
significantly. By using the lookup file I can change one pointer in a
single place-- otherwise I'd have to track down references all over
the place when I want to change it. As I'm setting things up all I need
to do is change one function in the program and write a quick and dirty
program to relink the files appropriately.

DATA internal structure

For now, I'm going to use the following algorithm to store files in the DATA
directory. Because I only need to generate a filename once (all other
references are made through FILEMAP) generation can be expensive. We need
to be careful to keep directories at a reasonable size, so a fairly bushy
result is needed. The current algorithm should also grow the number of
directories based on the number of files so the inodes lost to directories
will stay proportional to the number of files in the system.

char *idfile(id_num)
id id_num;
char map[] = "abcdefghij";
static char filename[MAXLEN];
char id_string[MAXLEN];
int i = 0, j = 0;

p = id_string[0];

while (id_string[i])
filename[j++] = map[id_string[i++] - '0'];
filename[j++] = '/';
filename[j] = '\0';

KEYWORD filename generation

Because I expect to see many fewer keywords than articles (I HOPE!!!)
the generation of filenames is much simpler. Basically it is generated
by taking keyword[0] as a subdirectory, as in f/foo.

From the ministry of silly talks: Chuq Von Rospach
{allegra,cbosgd,decwrl,hplabs,ihnp4,seismo}!nsc!chuqui nsc!chu...@decwrl.ARPA

Now look here Mister "I'm not just a word processor"...

Reply all
Reply to author
0 new messages