ispell check for repeated words

Geoff Kuenning

unread,

Oct 17, 1994, 3:28:45 PM10/17/94

to

> A while back I asked about the possibility of ispell checking for
> for repeated words, a bad habit of mine that MSWord takes care of. However,
> I was informed that such a feature would not be included because
> because repeated words occur occasionally in languages other than English

Actually, I seem to recall saying that the primary reason is that
repeated words are a grammatical/semantic problem, which is not
ispell's function. The non-English issue is secondary.

In any case, it's trivial to check a document for repeated words. For
example:

#!/bin/sh
cat "$@" | detex -w | uniq -d | egrep -v '^$'

This will provide a list of all words that appear repeated in a document.
Assuming that they are relatively common words, such as "the", you can
then use emacs to search for them by either using a regular
expression, or just looking for "the the" and "the^Jthe". (I suspect
that you could also construct an emacs regular expression that would
detect duplicated words, but the above pipe popped into my head first.)

With only a tiny bit more work, the pipe could be modified to provide
information such as the character offset of the offending duplication,
at the expense of not using detex. For example:

#!/bin/sh
cat "$@" | tr -c "A-Za-z';" \\012 | \
awk '{offset++;if($0==last0&&$0!="")print offset, $0
last0=$0;offset+=length($0)}'

The disadvantage of this version is that it will report duplicated TeX
constructs, and miss duplications that are separated by TeX commands.
I'd recommend starting with this script, and then running the first
version before submission, just to make sure there aren't any
wierdnesses.

Both of these scripts have been minimally tested from the command
line.
--
Geoff Kuenning g.kue...@ieee.org ge...@ITcorp.com

Claudio Fleiner

unread,

Oct 17, 1994, 4:49:18 PM10/17/94

to

In article <37uj9e$k...@delphi.cs.ucla.edu>,

Geoff Kuenning <ge...@ficus.cs.ucla.edu> wrote:
>(I suspect
>that you could also construct an emacs regular expression that would
>detect duplicated words, but the above pipe popped into my head first.)
>

Nope, it is not possible to detect duplicate words with
regular expressions. You could extend them to allow
the detection of duplicate words, but then they are
no longer regular expressions.

Claudio

--
----------------------------------------------------------------------
Claudio Fleiner fle...@icsi.berkeley.edu
International Computer Science Institute, 1947 Center Street, Berkeley
Tel: (510) 643-9153, Fax: (510) 643-7684

Donald Arseneau

unread,

Oct 17, 1994, 8:58:00 PM10/17/94

to

In article <37u40n$6...@news.nd.edu>, ul...@ulix.rad.nd.edu (Ulick Stafford) writes...

>A while back I asked about the possibility of ispell checking for

>for repeated words, ... [but this] feature would not be included because
>because [sic:-)] repeated words occur occasionally in languages other
>than English

This would be a useful feature to have. It should be optional of course.
Be aware that repeated words *do* happen in English as well. ...And I don't
just mean in that contorted sentence with five `and's in a row.

Donald Arseneau as...@reg.triumf.ca

Joachim Schrod

unread,

Oct 18, 1994, 8:02:00 AM10/18/94

to

In article <37u40n$6...@news.nd.edu>, ul...@ulix.rad.nd.edu (Ulick Stafford) writes:

> der der in German, I suppose). Well, I know that Anglo-centrism is very
> un-PC, and in the TeX world the hypersensitivity of Germans is a concern,

Being German, you triggered my hypersensitivity...

Please use an other reason, `der der' or other repetitions of word
(that might be grammatically correct) are considered bad style in
German texts anyhow. Any German author who has remained a bit of love
for his or her language will shy away from such constructs.

I would think the problem is more that this is outside the realm of a
_spell_ checker and belongs to a _style_ checker. But I have yet to
see a good style checker. (Well, I mean automatic ones. The
biological ones, e.g., barbara beeton or Rosemary Baily, are great. :-)

Joachim

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Joachim Schrod Email: sch...@iti.informatik.th-darmstadt.de
Computer Science Department
Technical University of Darmstadt, Germany

Ulick Stafford

unread,

Oct 17, 1994, 11:08:07 AM10/17/94

to

A while back I asked about the possibility of ispell checking for

for repeated words, a bad habit of mine that MSWord takes care of. However,

I was informed that such a feature would not be included because
because repeated words occur occasionally in languages other than English (well

der der in German, I suppose). Well, I know that Anglo-centrism is very
un-PC, and in the TeX world the hypersensitivity of Germans is a concern,

but nonetheless, such a feature would greatly increase the usefulness
of the extremely useful ispell program.

So if anyone has put together an unoffical hack as a patch for ispell to do
this, I would greatly appreciate if you could email it to me before I submit
the final copy of my dissertation later this week. My readers have informed
me that the problem is quite prevalent.

Thanks in advance.
_____________________________________________________________________________
'There was a master come unto the earth, | Ulick Stafford,
born in the holy land of Indiana, | Dept of Chemical Engineering,
in the mystical hills east of Fort Wayne'.| Notre Dame, IN 46556
http://ulix.rad.nd.edu/Ulick.html | Ulick.S...@nd.edu

Tom Christiansen

unread,

Oct 19, 1994, 9:32:12 AM10/19/94

to

:-> In comp.lang.perl, k...@syd.dit.csiro.au (Ken Yap) writes:
:In article <37uj9e$k...@delphi.cs.ucla.edu>:
:|> A while back I asked about the possibility of ispell checking for

:|> for repeated words, a bad habit of mine that MSWord takes care of. However,
:|> I was informed that such a feature would not be included because
:|> because repeated words occur occasionally in languages other than English

:
:Here's a perl program called stutter I whipped together sometime ago to
:do this:

Here's mine.

#!/usr/bin/perl
undef $/;
$* = 1;
while ( $ARGV = shift ) {
if (!open ARGV) { warn "$ARGV: $!\n"; next; }
$_ = <ARGV>;
s/\b(\s?)(([A-Za-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next;
split(/\n/);
$NR = 0;
@hits = ();
for (@_) {
$NR++;
push(@hits, sprintf("%5d %s", $NR, $_)) if /\200/;
}
$_ = join("\n",@hits);
s/\200([^\200]+)\200/[* $1 *]/g;
print "$ARGV:\n$_\n";
}

Mark R. Tuttle

unread,

Oct 17, 1994, 7:08:48 PM10/17/94

to

One person speculated that an emacs regular expression could detect
duplicate words, and another person replied that regular expressions
cannot detect duplicate words. Both are correct.

Using the emacs regular expression syntax, you can write expressions
that are not regular expressions in the sense of automata theory. In
particular, one part of an emacs regular expression can reference a
string matched by an earlier part of the expression. The following
function approximates what is wanted.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;
;;; Function to find duplicated words in text
;;;
(defun find-double-words ()
(interactive)
(save-excursion
(beginning-of-buffer)
(query-replace-regexp
"\\b\$\\w+\$\\b[ \t\n]+\\b\\1\\b"
"\\1")))

Mark R. Tuttle tut...@crl.dec.com

Brendan Halpin

unread,

Oct 19, 1994, 3:42:15 AM10/19/94

to

fle...@ICSI.Berkeley.EDU (Claudio Fleiner) writes:
> Geoff Kuenning <ge...@ficus.cs.ucla.edu> wrote:
>>(I suspect
>>that you could also construct an emacs regular expression that would
>>detect duplicated words, but the above pipe popped into my head first.)
>>
> Nope, it is not possible to detect duplicate words with
> regular expressions. You could extend them to allow
> the detection of duplicate words, but then they are
> no longer regular expressions.

You're wrong and you're right.

"\W$\w+$\W\1" is an emacs regular expression that will find
duplicated words [1]. Admittedly, it is an extension to ordinary
regular expressions, but Geoff did say "emacs regular expression".

Brendan

[1] Not that it's any good to Ulick, as he doesn't use Emacs.

Brendan Halpin |Email: HAL...@VAX.OX.AC.UK
Dept of Applied Social Studies |PGP: Finger hal...@vax.ox.ac.uk
Oxford University, Wellington Sq.,| or hal...@gramsci.apsoc.ox.ac.uk
Oxford OX1 2ER, UK |Phone: +44 865 270347 (work) / 726758 (home)

Ken Yap

unread,

Oct 19, 1994, 1:44:55 AM10/19/94

to

In article <37uj9e$k...@delphi.cs.ucla.edu>:

|> A while back I asked about the possibility of ispell checking for
|> for repeated words, a bad habit of mine that MSWord takes care of. However,
|> I was informed that such a feature would not be included because
|> because repeated words occur occasionally in languages other than English

Here's a perl program called stutter I whipped together sometime ago to
do this:

#!/usr/local/bin/perl
#
# Find two consecutive words repeated in document, e.g. and and
# Also works if stutter on adjacent lines
#

$lastword = "";
while (<>)
{
@words = split;
foreach $word (@words)
{
if ($word eq $lastword)
{
chop;
print "$ARGV $.: $word [$_]\n";
}
$lastword = $word;
}
if (eof)
{
close(ARGV);
}
}

PROF D. Rogers (EAS FAC)

unread,

Oct 21, 1994, 2:42:50 PM10/21/94

to

In article <37u40n$6...@news.nd.edu> ul...@ulix.rad.nd.edu (Ulick Stafford) writes:
!A while back I asked about the possibility of ispell checking for
!for repeated words, a bad habit of mine that MSWord takes care of. However,
!I was informed that such a feature would not be included because
!because repeated words occur occasionally in languages other than English (well
!der der in German, I suppose). Well, I know that Anglo-centrism is very
!un-PC, and in the TeX world the hypersensitivity of Germans is a concern,
!but nonetheless, such a feature would greatly increase the usefulness
!of the extremely useful ispell program.
!
!So if anyone has put together an unoffical hack as a patch for ispell to do
!this, I would greatly appreciate if you could email it to me before I submit
!the final copy of my dissertation later this week. My readers have informed
!me that the problem is quite prevalent.
!
!Thanks in advance.
!_____________________________________________________________________________
! 'There was a master come unto the earth, | Ulick Stafford,
! born in the holy land of Indiana, | Dept of Chemical Engineering,
! in the mystical hills east of Fort Wayne'.| Notre Dame, IN 46556
! http://ulix.rad.nd.edu/Ulick.html | Ulick.S...@nd.edu
!

Buy a copy of microspell from

Trigram Systems
5840 Northumberland St.
Pittsburgh, PA 15217
412-422-8976

for under $100 and load it on a PC. It has a TeX specific mode
(ask for that one when you order) and get the job done.

Dave Rogers