Actually, I seem to recall saying that the primary reason is that
repeated words are a grammatical/semantic problem, which is not
ispell's function. The non-English issue is secondary.
In any case, it's trivial to check a document for repeated words. For
example:
#!/bin/sh
cat "$@" | detex -w | uniq -d | egrep -v '^$'
This will provide a list of all words that appear repeated in a document.
Assuming that they are relatively common words, such as "the", you can
then use emacs to search for them by either using a regular
expression, or just looking for "the the" and "the^Jthe". (I suspect
that you could also construct an emacs regular expression that would
detect duplicated words, but the above pipe popped into my head first.)
With only a tiny bit more work, the pipe could be modified to provide
information such as the character offset of the offending duplication,
at the expense of not using detex. For example:
#!/bin/sh
cat "$@" | tr -c "A-Za-z';" \\012 | \
awk '{offset++;if($0==last0&&$0!="")print offset, $0
last0=$0;offset+=length($0)}'
The disadvantage of this version is that it will report duplicated TeX
constructs, and miss duplications that are separated by TeX commands.
I'd recommend starting with this script, and then running the first
version before submission, just to make sure there aren't any
wierdnesses.
Both of these scripts have been minimally tested from the command
line.
--
Geoff Kuenning g.kue...@ieee.org ge...@ITcorp.com
Claudio
--
----------------------------------------------------------------------
Claudio Fleiner fle...@icsi.berkeley.edu
International Computer Science Institute, 1947 Center Street, Berkeley
Tel: (510) 643-9153, Fax: (510) 643-7684
This would be a useful feature to have. It should be optional of course.
Be aware that repeated words *do* happen in English as well. ...And I don't
just mean in that contorted sentence with five `and's in a row.
Donald Arseneau as...@reg.triumf.ca
> der der in German, I suppose). Well, I know that Anglo-centrism is very
> un-PC, and in the TeX world the hypersensitivity of Germans is a concern,
Being German, you triggered my hypersensitivity...
Please use an other reason, `der der' or other repetitions of word
(that might be grammatically correct) are considered bad style in
German texts anyhow. Any German author who has remained a bit of love
for his or her language will shy away from such constructs.
I would think the problem is more that this is outside the realm of a
_spell_ checker and belongs to a _style_ checker. But I have yet to
see a good style checker. (Well, I mean automatic ones. The
biological ones, e.g., barbara beeton or Rosemary Baily, are great. :-)
Joachim
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Joachim Schrod Email: sch...@iti.informatik.th-darmstadt.de
Computer Science Department
Technical University of Darmstadt, Germany
So if anyone has put together an unoffical hack as a patch for ispell to do
this, I would greatly appreciate if you could email it to me before I submit
the final copy of my dissertation later this week. My readers have informed
me that the problem is quite prevalent.
Thanks in advance.
_____________________________________________________________________________
'There was a master come unto the earth, | Ulick Stafford,
born in the holy land of Indiana, | Dept of Chemical Engineering,
in the mystical hills east of Fort Wayne'.| Notre Dame, IN 46556
http://ulix.rad.nd.edu/Ulick.html | Ulick.S...@nd.edu
Here's mine.
#!/usr/bin/perl
undef $/;
$* = 1;
while ( $ARGV = shift ) {
if (!open ARGV) { warn "$ARGV: $!\n"; next; }
$_ = <ARGV>;
s/\b(\s?)(([A-Za-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next;
split(/\n/);
$NR = 0;
@hits = ();
for (@_) {
$NR++;
push(@hits, sprintf("%5d %s", $NR, $_)) if /\200/;
}
$_ = join("\n",@hits);
s/\200([^\200]+)\200/[* $1 *]/g;
print "$ARGV:\n$_\n";
}
Using the emacs regular expression syntax, you can write expressions
that are not regular expressions in the sense of automata theory. In
particular, one part of an emacs regular expression can reference a
string matched by an earlier part of the expression. The following
function approximates what is wanted.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;
;;; Function to find duplicated words in text
;;;
(defun find-double-words ()
(interactive)
(save-excursion
(beginning-of-buffer)
(query-replace-regexp
"\\b\\(\\w+\\)\\b[ \t\n]+\\b\\1\\b"
"\\1")))
Mark R. Tuttle tut...@crl.dec.com
You're wrong and you're right.
"\W\(\w+\)\W\1" is an emacs regular expression that will find
duplicated words [1]. Admittedly, it is an extension to ordinary
regular expressions, but Geoff did say "emacs regular expression".
Brendan
[1] Not that it's any good to Ulick, as he doesn't use Emacs.
Brendan Halpin |Email: HAL...@VAX.OX.AC.UK
Dept of Applied Social Studies |PGP: Finger hal...@vax.ox.ac.uk
Oxford University, Wellington Sq.,| or hal...@gramsci.apsoc.ox.ac.uk
Oxford OX1 2ER, UK |Phone: +44 865 270347 (work) / 726758 (home)
Here's a perl program called stutter I whipped together sometime ago to
do this:
#!/usr/local/bin/perl
#
# Find two consecutive words repeated in document, e.g. and and
# Also works if stutter on adjacent lines
#
$lastword = "";
while (<>)
{
@words = split;
foreach $word (@words)
{
if ($word eq $lastword)
{
chop;
print "$ARGV $.: $word [$_]\n";
}
$lastword = $word;
}
if (eof)
{
close(ARGV);
}
}
Buy a copy of microspell from
Trigram Systems
5840 Northumberland St.
Pittsburgh, PA 15217
412-422-8976
for under $100 and load it on a PC. It has a TeX specific mode
(ask for that one when you order) and get the job done.
Dave Rogers