For the development of web pages it is important to know the word
frequency of a page. Together with the GNU diction program you can check
and maintain easily the readability of a page.
So i need a cross referencer. I could not find a simple program for
win32. Somewhere around 1982 I developed a PASCAL cross reference
program, which is some time ago ;). Never done anything in C though.
I wanted to do this in C, mainly because of this newsgroup, which I find
to have a lot experts, providing usually good comments.
So I Googled and borrowed code from a lot of places, putting it together
in one file. The code is here (360 lines):
http://requirements-management.nl/files/download/xref-sources-0.99.zip
The header looks like this.
/****************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file>*/
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 */
/* input file. */
/* 02. By default xref neglects frequent */
/* English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as */
/* feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe xref.c*/
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, */
/* 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
/*****************************************************************/
If appropriate I am more than willing to add it to this message. Please
let me know.
All comments are welcomed. I would appreciate it if they not only
signaled issues, but also possible solutions.
Kind regards, and thanks in advance,
Hans Lodder
It's hard to be motivated to do a review of code not posted here so
you may get only a few replies. I'll made two suggestions after
commenting that the code looks good overall.
(1) An unbalanced tree is a good first approximation, but your list of
common words (I'd not call them keywords -- too confusing) is sorted
so you get, in essence, a linked list search not a tree-based one!
(2) If you include \r in the excluded characters, your files will work
on non Windows systems. Alternatively, find a way to distribute a
plain text version of the common word list so that it has native
line endings wherever it is used.
<snip>
--
Ben.
Thanks Ben! The complete code of xref.c
/********************************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file> */
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 input file. */
/* 02. By default xref neglects frequent English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe
xref.c */
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
/********************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stdbool.h>
#define WORD_WIDTH 15
#define NUM_WIDTH 5
#define LINE_BUF_SIZE 65536
#ifndef NDEBUG
#define NDEBUG
#undef NDEBUG
#endif
/* Node structure for each list of line numbers */
struct List
{
int lnum; /* line number of word */
struct List *next;
};
typedef struct List List;
/* Node structure for tree of distinct words */
struct Tree
{
char word[WORD_WIDTH];
int cnt_word; /* number of instances of word */
List *lstptr;
struct Tree *left, *right;
};
typedef struct Tree Tree;
/* Tree and list functions */
static Tree *addword (Tree *, const char *);
static List *addline (List *, int);
static Tree *find_node (Tree *, const char *);
static void print_header (void);
static void print_tree (Tree *);
static void print_list (List *);
static Tree *IsKeyword (Tree *, const char *);
static int ndistinct = 0;
int
main (int argc, char *argv[])
{
char linebuf[LINE_BUF_SIZE];
Tree *words = NULL;
Tree *keywords = NULL;
int lineno = 0;
FILE *fp;
char lang_excl_keywords[BUFSIZ] = "";
char *s;
/* command line processing */
while (--argc > 0 && (*++argv)[0] == '-')
{
for (s = argv[0] + 1; *s != '\0'; s++)
{
switch (*s)
{
case 'L':
--argc;
++argv;
strcpy (lang_excl_keywords, argv[0]);
break;
default:
fprintf (stderr, "xref: -I- version xref-0.99\n");
fprintf (stderr,
"xref: -I- create a document cross reference; optional:
exclude words\n");
fprintf (stderr,
"xref: -I- default: exclude common en-US words\n");
fprintf (stderr,
"xref: usage: xref -L <excluded words file> <stdin>
<stdout>\n");
fprintf (stderr, "xref: valid option is '-L'\n");
fprintf (stderr,
" L : letter code for the excluded keywords [C, de,
en, nl]\n");
fprintf (stderr, "\n");
fprintf (stderr,
"xref: mail requests and bugs to hans.lodder at
requirements-management.nl\n");
break;
}
}
}
/* Do optional input redirection */
if (argc == 1)
{
assert (strlen (argv[0]) != 0);
if (freopen (argv[0], "r", stdin) == NULL)
{
fprintf (stderr, "cannot open file '%s'", argv[0]);
exit (EXIT_FAILURE);
}
}
/* read in the keywords, excepted from the cross reference */
if (strlen (lang_excl_keywords) == 0)
strcat (lang_excl_keywords, "en.xrf");
else
strcat (lang_excl_keywords, ".xrf");
if ((fp = fopen (lang_excl_keywords, "r")) == NULL)
{
fprintf (stderr, "cannot open file '%s'", lang_excl_keywords);
exit (EXIT_FAILURE);
}
/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, fp) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";
/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';
/* assert (keywords != NULL); */
keywords = addword (keywords, wordptr);
lineptr = NULL;
}
}
/* terminate keyword processing */
if (fp != NULL)
(void) fclose (fp);
/* initialize the processing of words */
ndistinct = 0; /* re-initialize counter */
/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, stdin) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";
assert (lineptr != NULL);
++lineno;
/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
Tree *node;
assert (wordptr != NULL);
if (IsKeyword (keywords, wordptr) == NULL)
{
#ifdef NDEBUG
fprintf (stderr, "File: '%s', line '%d': wordptr= '%s'\n",
__FILE__, __LINE__, wordptr);
#endif
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';
assert (wordptr != NULL);
words = addword (words, wordptr);
node = find_node (words, wordptr);
(node->cnt_word)++; /* increment the number of instances of word */
node->lstptr = addline (node->lstptr, lineno);
}
lineptr = NULL;
}
}
/* Print results */
if (ndistinct > 0)
printf ("No. of distinct words: %d.\n\n", ndistinct);
if (words != NULL)
print_header ();
print_tree (words);
/* gcc guarantees that all space is returned to the OS, so it is
unnecessary to free allocated memory and file pointers. */
return 0;
}
Tree *
addword (Tree * tp, const char *word)
{
if (tp == NULL)
{
/* Add new entry */
assert (tp == NULL);
++ndistinct;
if ((tp = malloc (sizeof (Tree))) == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (tp != NULL);
strncpy (tp->word, word, WORD_WIDTH);
tp->cnt_word = 0;
tp->left = tp->right = NULL;
tp->lstptr = NULL;
}
else if (strcmp (tp->word, word) < 0)
tp->right = addword (tp->right, word);
else if (strcmp (tp->word, word) > 0)
tp->left = addword (tp->left, word);
return tp;
}
List *
addline (List * lp, int n)
{
if (lp == NULL)
{
/* Append new line number to list */
List *newp = malloc (sizeof (List));
if (newp == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (newp != NULL);
newp->lnum = n;
newp->next = NULL;
return newp;
}
else if (lp->lnum != n)
lp->next = addline (lp->next, n);
return lp;
}
void
print_header (void) /* print column headers */
{
const char *txt_keyword = "Keyword";
const char *txt_count = "Count";
const char *txt_perc = "( % )";
const char *txt_lineno = "Line number(s)";
const char *txt_hyphen = "-----------";
printf ("%-*.*s %-*s %-*s %-s\n", WORD_WIDTH, WORD_WIDTH, txt_keyword,
NUM_WIDTH, txt_count, NUM_WIDTH + NUM_WIDTH - 1, txt_perc,
txt_lineno);
printf ("%-*.*s %-*.5s %-*.8s %-s\n", WORD_WIDTH, WORD_WIDTH - 8,
txt_hyphen, NUM_WIDTH, txt_hyphen, NUM_WIDTH + NUM_WIDTH - 1,
txt_hyphen, txt_hyphen);
}
void
print_tree (Tree * tp)
{
if (tp != NULL)
{
/* In order traversal */
print_tree (tp->left);
printf ("%-*.*s: %*d (%5.1f%%)", WORD_WIDTH, WORD_WIDTH, tp->word,
NUM_WIDTH, tp->cnt_word,
(float) (tp->cnt_word * 100.0 / ndistinct));
print_list (tp->lstptr);
(void) putchar ('\n');
print_tree (tp->right);
}
}
void
print_list (List * lp)
{
const int INDENT = WORD_WIDTH + WORD_WIDTH + 1;
const int NUMS_PER_LINE = 8;
int count;
for (count = 0; lp != NULL; lp = lp->next, ++count)
{
printf ("%*d", NUM_WIDTH, lp->lnum);
if ((count + 1) % NUMS_PER_LINE == 0 && lp->next != NULL)
printf ("\n%*c", INDENT, ' ');
}
}
Tree *
find_node (Tree * tp, const char *s)
{
/* Binary search for word */
if (strcmp (tp->word, s) > 0)
return find_node (tp->left, s);
else if (strcmp (tp->word, s) < 0)
return find_node (tp->right, s);
else
return tp;
}
/* new function */
Tree *
IsKeyword (Tree * kwp, const char *s) /* keywords not to be counted */
{
int comp, search = 1;
while ((kwp != NULL) && (bool) search)
{
if ((comp = strcmp (kwp->word, s)) == 0)
search = 0;
else if (comp > 0)
kwp = kwp->left;
else if (comp < 0)
kwp = kwp->right;
else
assert (0);
}
if (!(bool) search) /* found a keyword */
return kwp;
else
return NULL;
}
/* End of file xref.c */
Thanks Ben! I will include the complete code. Here its is:
/********************************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file> */
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 input file. */
/* 02. By default xref neglects frequent English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe
xref.c */
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
snip
>Thanks Ben! I will include the complete code. Here its is:
snip 370+ lines
Changing your name and reposting almost 400 lines after five days
doesn't win many friends.
--
Remove del for email
What?
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"