Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Cross reference tool

10 views
Skip to first unread message

News groups

unread,
Feb 10, 2010, 9:53:11 AM2/10/10
to
Hi!

For the development of web pages it is important to know the word
frequency of a page. Together with the GNU diction program you can check
and maintain easily the readability of a page.

So i need a cross referencer. I could not find a simple program for
win32. Somewhere around 1982 I developed a PASCAL cross reference
program, which is some time ago ;). Never done anything in C though.

I wanted to do this in C, mainly because of this newsgroup, which I find
to have a lot experts, providing usually good comments.

So I Googled and borrowed code from a lot of places, putting it together
in one file. The code is here (360 lines):

http://requirements-management.nl/files/download/xref-sources-0.99.zip

The header looks like this.

/****************************************************************/
/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file>*/
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */
/* 01. Produces a cross reference of any UTF-8 */
/* input file. */
/* 02. By default xref neglects frequent */
/* English keywords. */
/* 03. No input results in no output. */
/* 04. Fool data is isolated as early as */
/* feasible without */
/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe xref.c*/
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */
/* 03. 2 equal words 1 line output list, */
/* 2 line numbers */
/* 04. 2 different words 2 line output list */
/* i. Remarks : */
/*****************************************************************/

If appropriate I am more than willing to add it to this message. Please
let me know.

All comments are welcomed. I would appreciate it if they not only
signaled issues, but also possible solutions.

Kind regards, and thanks in advance,

Hans Lodder

Ben Bacarisse

unread,
Feb 10, 2010, 11:46:04 AM2/10/10
to
News groups <j.hans....@planet.nl> writes:
<snip>

> So i need a cross referencer. I could not find a simple program for
> win32. Somewhere around 1982 I developed a PASCAL cross reference
> program, which is some time ago ;). Never done anything in C though.
>
> I wanted to do this in C, mainly because of this newsgroup, which I
> find to have a lot experts, providing usually good comments.
>
> So I Googled and borrowed code from a lot of places, putting it
> together in one file. The code is here (360 lines):
>
> http://requirements-management.nl/files/download/xref-sources-0.99.zip

It's hard to be motivated to do a review of code not posted here so
you may get only a few replies. I'll made two suggestions after
commenting that the code looks good overall.

(1) An unbalanced tree is a good first approximation, but your list of
common words (I'd not call them keywords -- too confusing) is sorted
so you get, in essence, a linked list search not a tree-based one!

(2) If you include \r in the excluded characters, your files will work
on non Windows systems. Alternatively, find a way to distribute a
plain text version of the common word list so that it has native
line endings wherever it is used.

<snip>
--
Ben.

News groups

unread,
Feb 11, 2010, 2:02:36 AM2/11/10
to
Ben Bacarisse schreef:

Thanks Ben! The complete code of xref.c

/********************************************************************************/


/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file> */
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */

/* 01. Produces a cross reference of any UTF-8 input file. */
/* 02. By default xref neglects frequent English keywords. */


/* 03. No input results in no output. */

/* 04. Fool data is isolated as early as feasible without */


/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe
xref.c */
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */

/* 03. 2 equal words 1 line output list, 2 line numbers */


/* 04. 2 different words 2 line output list */
/* i. Remarks : */

/********************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stdbool.h>

#define WORD_WIDTH 15
#define NUM_WIDTH 5

#define LINE_BUF_SIZE 65536

#ifndef NDEBUG
#define NDEBUG
#undef NDEBUG
#endif

/* Node structure for each list of line numbers */
struct List
{
int lnum; /* line number of word */
struct List *next;
};
typedef struct List List;

/* Node structure for tree of distinct words */
struct Tree
{
char word[WORD_WIDTH];
int cnt_word; /* number of instances of word */
List *lstptr;
struct Tree *left, *right;
};
typedef struct Tree Tree;

/* Tree and list functions */
static Tree *addword (Tree *, const char *);
static List *addline (List *, int);
static Tree *find_node (Tree *, const char *);
static void print_header (void);
static void print_tree (Tree *);
static void print_list (List *);
static Tree *IsKeyword (Tree *, const char *);

static int ndistinct = 0;

int
main (int argc, char *argv[])
{
char linebuf[LINE_BUF_SIZE];
Tree *words = NULL;
Tree *keywords = NULL;
int lineno = 0;
FILE *fp;
char lang_excl_keywords[BUFSIZ] = "";
char *s;

/* command line processing */

while (--argc > 0 && (*++argv)[0] == '-')
{
for (s = argv[0] + 1; *s != '\0'; s++)
{
switch (*s)
{
case 'L':
--argc;
++argv;
strcpy (lang_excl_keywords, argv[0]);
break;
default:
fprintf (stderr, "xref: -I- version xref-0.99\n");
fprintf (stderr,
"xref: -I- create a document cross reference; optional:
exclude words\n");
fprintf (stderr,
"xref: -I- default: exclude common en-US words\n");
fprintf (stderr,
"xref: usage: xref -L <excluded words file> <stdin>
<stdout>\n");
fprintf (stderr, "xref: valid option is '-L'\n");
fprintf (stderr,
" L : letter code for the excluded keywords [C, de,
en, nl]\n");
fprintf (stderr, "\n");
fprintf (stderr,
"xref: mail requests and bugs to hans.lodder at
requirements-management.nl\n");
break;
}
}
}

/* Do optional input redirection */

if (argc == 1)
{
assert (strlen (argv[0]) != 0);
if (freopen (argv[0], "r", stdin) == NULL)
{
fprintf (stderr, "cannot open file '%s'", argv[0]);
exit (EXIT_FAILURE);
}
}

/* read in the keywords, excepted from the cross reference */
if (strlen (lang_excl_keywords) == 0)
strcat (lang_excl_keywords, "en.xrf");
else
strcat (lang_excl_keywords, ".xrf");

if ((fp = fopen (lang_excl_keywords, "r")) == NULL)
{
fprintf (stderr, "cannot open file '%s'", lang_excl_keywords);
exit (EXIT_FAILURE);
}

/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, fp) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";

/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';

/* assert (keywords != NULL); */
keywords = addword (keywords, wordptr);

lineptr = NULL;
}
}

/* terminate keyword processing */
if (fp != NULL)
(void) fclose (fp);

/* initialize the processing of words */

ndistinct = 0; /* re-initialize counter */
/* Process each line of text file */
while (fgets (linebuf, LINE_BUF_SIZE, stdin) != NULL)
{
char *wordptr, *lineptr = linebuf;
char bad_chars[] =
" \t\n\f\v\\\"~!@#$%^&*()+'" "|`1234567890-={}[]:;<>?,./";

assert (lineptr != NULL);
++lineno;

/* Process each word in line */
while ((wordptr = strtok (lineptr, bad_chars)) != NULL)
{
Tree *node;

assert (wordptr != NULL);
if (IsKeyword (keywords, wordptr) == NULL)
{
#ifdef NDEBUG
fprintf (stderr, "File: '%s', line '%d': wordptr= '%s'\n",
__FILE__, __LINE__, wordptr);
#endif
if (strlen (wordptr) > WORD_WIDTH)
wordptr[WORD_WIDTH - 1] = '\0';

assert (wordptr != NULL);
words = addword (words, wordptr);
node = find_node (words, wordptr);
(node->cnt_word)++; /* increment the number of instances of word */

node->lstptr = addline (node->lstptr, lineno);
}

lineptr = NULL;
}
}

/* Print results */

if (ndistinct > 0)
printf ("No. of distinct words: %d.\n\n", ndistinct);

if (words != NULL)
print_header ();
print_tree (words);

/* gcc guarantees that all space is returned to the OS, so it is
unnecessary to free allocated memory and file pointers. */

return 0;
}

Tree *
addword (Tree * tp, const char *word)
{
if (tp == NULL)
{
/* Add new entry */
assert (tp == NULL);

++ndistinct;
if ((tp = malloc (sizeof (Tree))) == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (tp != NULL);
strncpy (tp->word, word, WORD_WIDTH);
tp->cnt_word = 0;

tp->left = tp->right = NULL;
tp->lstptr = NULL;
}
else if (strcmp (tp->word, word) < 0)
tp->right = addword (tp->right, word);
else if (strcmp (tp->word, word) > 0)
tp->left = addword (tp->left, word);

return tp;
}

List *
addline (List * lp, int n)
{
if (lp == NULL)
{
/* Append new line number to list */
List *newp = malloc (sizeof (List));

if (newp == NULL)
{
fprintf (stderr, "xref: -F- Out of memory");
exit (EXIT_FAILURE);
}
assert (newp != NULL);
newp->lnum = n;
newp->next = NULL;
return newp;
}
else if (lp->lnum != n)
lp->next = addline (lp->next, n);

return lp;
}

void
print_header (void) /* print column headers */
{
const char *txt_keyword = "Keyword";
const char *txt_count = "Count";
const char *txt_perc = "( % )";
const char *txt_lineno = "Line number(s)";
const char *txt_hyphen = "-----------";

printf ("%-*.*s %-*s %-*s %-s\n", WORD_WIDTH, WORD_WIDTH, txt_keyword,
NUM_WIDTH, txt_count, NUM_WIDTH + NUM_WIDTH - 1, txt_perc,
txt_lineno);
printf ("%-*.*s %-*.5s %-*.8s %-s\n", WORD_WIDTH, WORD_WIDTH - 8,
txt_hyphen, NUM_WIDTH, txt_hyphen, NUM_WIDTH + NUM_WIDTH - 1,
txt_hyphen, txt_hyphen);
}

void
print_tree (Tree * tp)
{
if (tp != NULL)
{
/* In order traversal */
print_tree (tp->left);
printf ("%-*.*s: %*d (%5.1f%%)", WORD_WIDTH, WORD_WIDTH, tp->word,
NUM_WIDTH, tp->cnt_word,
(float) (tp->cnt_word * 100.0 / ndistinct));
print_list (tp->lstptr);
(void) putchar ('\n');
print_tree (tp->right);
}
}

void
print_list (List * lp)
{
const int INDENT = WORD_WIDTH + WORD_WIDTH + 1;
const int NUMS_PER_LINE = 8;
int count;

for (count = 0; lp != NULL; lp = lp->next, ++count)
{
printf ("%*d", NUM_WIDTH, lp->lnum);
if ((count + 1) % NUMS_PER_LINE == 0 && lp->next != NULL)
printf ("\n%*c", INDENT, ' ');
}
}

Tree *
find_node (Tree * tp, const char *s)
{
/* Binary search for word */
if (strcmp (tp->word, s) > 0)
return find_node (tp->left, s);
else if (strcmp (tp->word, s) < 0)
return find_node (tp->right, s);
else
return tp;
}

/* new function */

Tree *
IsKeyword (Tree * kwp, const char *s) /* keywords not to be counted */
{
int comp, search = 1;

while ((kwp != NULL) && (bool) search)
{
if ((comp = strcmp (kwp->word, s)) == 0)
search = 0;
else if (comp > 0)
kwp = kwp->left;
else if (comp < 0)
kwp = kwp->right;
else
assert (0);
}

if (!(bool) search) /* found a keyword */
return kwp;
else
return NULL;

}

/* End of file xref.c */

Hans Lodder

unread,
Feb 16, 2010, 2:36:13 AM2/16/10
to
Ben Bacarisse schreef:

Thanks Ben! I will include the complete code. Here its is:

/********************************************************************************/


/* a. File : xref.c */
/* b. Version : v0-99 */
/* c. Author : Hans Lodder (2007-09-28) */
/* d. Modified : Hans Lodder (2010-01-16) */
/* e. Description : Cross references an input file */
/* Can exclude keywords through L switch. */
/* f. Usage : xref [-h] {[-Len] | [-Ldu] | [-Lde] | [ -LC]} <file> */
/* g. Default : Exclude English keywords -Len */
/* h. Algorithm : Uses a binary tree and linked lists */
/* i. Business requirements: */

/* 01. Produces a cross reference of any UTF-8 input file. */
/* 02. By default xref neglects frequent English keywords. */


/* 03. No input results in no output. */

/* 04. Fool data is isolated as early as feasible without */


/* sacrificing the design. */
/* 05. A modular and maintainable design */
/* j. Tools : gcc (4.4.0) -Wall -Wextra -pedantic -o xref.exe
xref.c */
/* splint (3.1.1) -weak xref.c */
/* indent (2.2.9) -gnu xref.c */
/* notepad++ (5.6.6) xref.c */
/* h. Test cases : 01. No input No output */
/* 02. 1 word 1 line file listing, */
/* 1 line output list */

/* 03. 2 equal words 1 line output list, 2 line numbers */


/* 04. 2 different words 2 line output list */
/* i. Remarks : */

Barry Schwarz

unread,
Feb 16, 2010, 9:00:53 PM2/16/10
to
On Tue, 16 Feb 2010 08:36:13 +0100, Hans Lodder
<j.hans....@planet.nl> wrote:

snip

>Thanks Ben! I will include the complete code. Here its is:

snip 370+ lines

Changing your name and reposting almost 400 lines after five days
doesn't win many friends.

--
Remove del for email

Keith Thompson

unread,
Feb 17, 2010, 1:46:27 AM2/17/10
to
Hans Lodder <j.hans....@planet.nl> writes:
[...]

> #ifndef NDEBUG
> #define NDEBUG
> #undef NDEBUG
> #endif
[...]

What?

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

0 new messages