Elementary but surprisingly difficult.

Albert van der Horst

unread,

May 24, 2008, 4:05:48 PM5/24/08

to

During the Euler stuff I came upon this following sub-problem.
It is elementary enough to be in a library somewhere.
An area giving as (addr_start, addr_end) contains items
of length ``len''. I want to get of the duplicates.

Sorting the items is pretty easy, qsort can handle that.
Now eliminating the duplicates, we need only compare one
item to the next.
Something along compact (start, end, len -- end' )
The new end should be returned, of course.

Even using variables/locals, I find this difficult.
(More difficult then e.g. making a table for the number of
dividers of N, which sounds much less trivial.)

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Slava Pestov

unread,

May 24, 2008, 5:02:20 PM5/24/08

to

On May 24, 3:05 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
wrote:

> Even using variables/locals, I find this difficult.
> (More difficult then e.g. making a table for the number of
> dividers of N, which sounds much less trivial.)

It's not difficult if you have a library of sequence operations:

http://factor-language.blogspot.com/2008/05/removing-duplicates-from-sequence.html

Slava

Bernd Paysan

unread,

May 24, 2008, 5:36:14 PM5/24/08

to

Slava Pestov wrote:

> On May 24, 3:05 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
> wrote:
>> Even using variables/locals, I find this difficult.
>> (More difficult then e.g. making a table for the number of
>> dividers of N, which sounds much less trivial.)
>
> It's not difficult if you have a library of sequence operations:

Nice filter, but doing that in-place is not difficult, either.

I give you a solution for characters in a string, and Albert can generalize
it. It is supposed that the string is sorted, so it will only remove
sequenes of the same character:

: uniquify ( addr u -- addr u' )
over >r bounds dup dup c@ 2swap 1+ ?DO
I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
drop r> swap over - ;

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Slava Pestov

unread,

May 24, 2008, 6:23:29 PM5/24/08

to

On May 24, 4:36 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> : uniquify ( addr u -- addr u' )
> over >r bounds dup dup c@ 2swap 1+ ?DO
> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
> drop r> swap over - ;

This is why stack-based languages have a reputation of unreadability.

Slava

Bruce McFarling

unread,

May 24, 2008, 7:24:38 PM5/24/08

to

``uniquify'' ... I dunno, that word looks pretty readable to me.

Its *definition* is underfactored, of course, and underfactored
definitions are always hard to read, but

uniquify ( addr u -- addr u' )

trim duplicate characters from a sorted string, in place

... seems pretty readable to me.

mark4

unread,

May 24, 2008, 8:08:52 PM5/24/08

to

On May 24, 1:05 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
wrote:

I didnt test this but it looks right (dont take my word for it
though :)

assumes string is 0 terminated though, which is most un-forthlike :P

: remove-dups ( a1 --- )
dup
begin
begin
1+
2dup c@ swap c@ <>
until
swap 1+ swap
2dup c@ dup rot c!
0=
until ;

van...@vsta.org

unread,

May 24, 2008, 8:11:30 PM5/24/08

to

Bruce McFarling <agi...@netscape.net> wrote:
>> This is why stack-based languages have a reputation of unreadability.

> ...

> Its *definition* is underfactored, of course, and underfactored

> definitions are always hard to read, ...

So, could you try your hand at a properly factored definition. This is
a good example of the kind of thing I never code cleanly in Forth.

Thanks,
Andy Valencia

Doug Hoffman

unread,

May 24, 2008, 9:18:17 PM5/24/08

to

I both agree and disagree with you in this case. It is not the kind of
code I would normally write, especially if in a hurry to get the job
done. Of course I could spend the time to pick it apart and see exactly
what is going. But why bother? The word works. Some people, like
Bernd, can probably glance at code like this and easily/quickly see
exactly what makes it tick. I am not one of those people, even though I
like and use Forth.

I will hang on to 'uniquify' because I can see where it could be useful
to me in the future. (Thanks Bernd!).

-Doug

Doug Hoffman

unread,

May 24, 2008, 10:52:43 PM5/24/08

to

On May 24, 9:18 pm, Doug Hoffman <no.spam> wrote:
> Slava Pestov wrote:
> > On May 24, 4:36 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> >> : uniquify ( addr u -- addr u' )
> >> over >r bounds dup dup c@ 2swap 1+ ?DO
> >> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
> >> drop r> swap over - ;
>
> > This is why stack-based languages have a reputation of unreadability.
>
> I both agree and disagree with you in this case. It is not the kind of
> code I would normally write, especially if in a hurry to get the job
> done. Of course I could spend the time to pick it apart and see exactly

> what is going (on). But why bother? The word works. Some people, like

> Bernd, can probably glance at code like this and easily/quickly see
> exactly what makes it tick. I am not one of those people, even though I
> like and use Forth.

Having said that, I thought I should show the way I would normally
attack this kind of problem (context is resource-rich pc
emvironment). The Mops Neon-based object library has some powerful
string handling methods. So I would start with that and come up with
something like the following. It may not be as readable to you as
your Factor solution. But then for *me* this is far more readable
than the Factor code because I know nothing about Factor.

(Note that Mops string objects maintain 2 pointers into the string,
position and limit, or pos and lim. Manipulation of these pointers is
the key to using string objects.)

:class ustring super{ string+ }
:m 2nd: ( -- c) \ 2nd char
1 skip: self 1st: self ;m
:m del1: 1 deleten: self ;m
:m backup1: pos: self 1 - >pos: self ;m
:m unique:
BEGIN
pos: self 1 + lim: self <
WHILE
1st: self 2nd: self =
IF del1: self backup1: self THEN
REPEAT ;m
;class

ustring s

" abbcdefffghiijkkkkkkklmnooz" put: s
unique: s
all: s type
abcdefghijklmnoz ok

-Doug

John Passaniti

unread,

May 25, 2008, 1:54:39 AM5/25/08

to

Albert van der Horst wrote:
> Sorting the items is pretty easy, qsort can handle that.
> Now eliminating the duplicates, we need only compare one
> item to the next.

What if the original ordering must be preserved? By sorting, you've
destroyed that ordering. You would now have to annotate each item so
that you could later sort again. So the computational cost of this now
requires an annotation pass and two sorts... if the original ordering
must be preserved.

A less general way to deal with this would be to construct a parallel
data structure that recorded the values. In the case where each item
was relatively small (say, a 16-bit integer or a 8-bit character), then
a bitmap is an obvious data structure. As you move through the list of
items, check first if the item is set in the bitmap. If it is, then
it's a duplicate. If it isn't, then mark the bitmap. No sorting is
necessary and duplicate detection occurs in a single pass.

As the size of the items being checked increases, storing it as a bitmap
gets more costly. At that point, you can then using hashing (ideally
picking a hash algorithm that works well for the data set) or you can
use a bitmap representation that exploits any sparseness inherent in the
data. In the case of hashing, storing hash collisions in a list would
be one strategy. In the case of a sparse bitmap, a radix trie might
work well.

Removing the actual duplicates is a separate matter. I would use two
pointers-- one for reading the current item, the other for where to
write it. As long as both pointers are the same, there have been no
duplicates and thus no reason to do a write. When the pointers are no
longer the same, then a write is needed. This could be slightly more
intelligent by delaying the write until the next duplicate, allowing a
block move instead of an item-by-item move.

Albert van der Horst

unread,

May 25, 2008, 5:36:32 AM5/25/08

to

In article <ebemg5-...@vimes.paysan.nom>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Slava Pestov wrote:
>
>> On May 24, 3:05 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
>> wrote:
>>> Even using variables/locals, I find this difficult.
>>> (More difficult then e.g. making a table for the number of
>>> dividers of N, which sounds much less trivial.)
>>
>> It's not difficult if you have a library of sequence operations:
>
>Nice filter, but doing that in-place is not difficult, either.
>
>I give you a solution for characters in a string, and Albert can generalize
>it. It is supposed that the string is sorted, so it will only remove
>sequenes of the same character:
>
>: uniquify ( addr u -- addr u' )
> over >r bounds dup dup c@ 2swap 1+ ?DO
> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
> drop r> swap over - ;

The main problem is that the content cannot be placed
on the stack they cannot be compared with <> etc.
The above solution foregoes the technical problem.
The algorithmic solution you gave is of course easy enough to come up
with. Although it looks ugly enough even for this simple case.

>
>--
>Bernd Paysan
>"If you want it done right, you have to do it yourself"
>http://www.jwdt.com/~paysan/

Albert van der Horst

unread,

May 25, 2008, 5:39:06 AM5/25/08

to

In article <d9ff9$4838be72$cdd0851a$13...@DIALUPUSA.NET>,

Doug Hoffman <dhof...@talkamerica.net> wrote:
>Slava Pestov wrote:
>> On May 24, 4:36 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
>>> : uniquify ( addr u -- addr u' )
>>> over >r bounds dup dup c@ 2swap 1+ ?DO
>>> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
>>> drop r> swap over - ;
>>
>> This is why stack-based languages have a reputation of unreadability.
>
>I both agree and disagree with you in this case. It is not the kind of
>code I would normally write, especially if in a hurry to get the job
>done. Of course I could spend the time to pick it apart and see exactly
>what is going. But why bother? The word works. Some people, like
>Bernd, can probably glance at code like this and easily/quickly see
>exactly what makes it tick. I am not one of those people, even though I
>like and use Forth.

No it doesn't work for me. It is hard to understand that using it
for what I want is very hard.
(Compare this to C : pc++ , now pc point to a struct instead of
a char, no sweat).

>I will hang on to 'uniquify' because I can see where it could be useful
>to me in the future. (Thanks Bernd!).
>
>-Doug

Albert van der Horst

unread,

May 25, 2008, 5:42:42 AM5/25/08

to

In article <ac0bbd8f-e6b1-443c...@b9g2000prh.googlegroups.com>,

To be exact the array contains data structures representing
a circuit composed of identical condensors. It is 8 bytes, and
nothing must be assumed about the content. :-(

Groetjes Albert

John Passaniti

unread,

May 25, 2008, 7:55:37 AM5/25/08

to

Bruce McFarling wrote:
>> This is why stack-based languages have a reputation of unreadability.
>
> ``uniquify'' ... I dunno, that word looks pretty readable to me.

Wow! This is truly an amazing breakthrough! In the past, reading code
meant... reading code. What a chore! Worrying about balancing control
structures, thinking about how each word affects the stack, considering
how words might screw around with the input stream, etc.

But now, thanks to the awesome power of simply redefining English words
so they mean what we want, we've solved that problem. Now, readability
is merely the ability to read the name of a definition and a comment
stating the purpose of the definition!

Albert van der Horst

unread,

May 25, 2008, 7:46:35 AM5/25/08

to

In article <8a7_j.3356$%T3....@fe105.usenetserver.com>,

John Passaniti <nn...@JapanIsShinto.com> wrote:
>Albert van der Horst wrote:
>> Sorting the items is pretty easy, qsort can handle that.
>> Now eliminating the duplicates, we need only compare one
>> item to the next.
>
>What if the original ordering must be preserved? By sorting, you've
>destroyed that ordering. You would now have to annotate each item so
>that you could later sort again. So the computational cost of this now
>requires an annotation pass and two sorts... if the original ordering
>must be preserved.

Sorry dude, I have here a set problem at hand. The order is irrelevant,
but I can't afford to waste time on duplicates.
<SNIP>

Groetjes Albert

Albert van der Horst

unread,

May 25, 2008, 7:43:50 AM5/25/08

to

In article <ebemg5-...@vimes.paysan.nom>,
Bernd Paysan <bernd....@gmx.de> wrote:

>Slava Pestov wrote:
>
>> On May 24, 3:05 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
>> wrote:
>>> Even using variables/locals, I find this difficult.
>>> (More difficult then e.g. making a table for the number of
>>> dividers of N, which sounds much less trivial.)
>>
>> It's not difficult if you have a library of sequence operations:
>
>Nice filter, but doing that in-place is not difficult, either.
>
>I give you a solution for characters in a string, and Albert can generalize
>it. It is supposed that the string is sorted, so it will only remove
>sequenes of the same character:
>
>: uniquify ( addr u -- addr u' )
> over >r bounds dup dup c@ 2swap 1+ ?DO
> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP

----------------------------------------------
WANT CASE-INSENSITIVE CASE-INSENSITIVE

WANT TUCK
WANT BOUNDS
WANT DUMP

CREATE AAP 1 C, 1 C, 2 C, 2 C, 3 C, 4 C, 5 C, 5 C, 5 C, 5 C,

: uniquify ( addr u -- addr u' )
over >r bounds dup dup c@ 2swap 1+ ?DO
I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
drop r> swap over - ;

AAP 10 uniquify .

AAP 100 DUMP

----------------------------------------------

albert@apple:~/PROJECT/euler> lina -a
"unique.frt" INCLUDED
4
0805,124C: 0102 0304 |....|
0805,1250: 0504 0505 0505 0000 0800 0000 756E 6971 |............uniq|
0805,1260: 7569 6679 68A7 0408 8012 0508 0000 0000 |uifyh...........|
0805,1270: 2C12 0508 5812 0508 0A05 0508 0000 0000 |,...X...........|
0805,1280: 98A3 0408 B8A1 0408 740A 0508 48A4 0408 |........t...H...|
0805,1290: 48A4 0408 98A5 0408 A8A4 0408 E4AF 0408 |H...............|
0805,12A0: 2C99 0408 3400 0000 6899 0408 98A5 0408 |,...4...h.......|
OK
BYE

----------------------------------------------

So is this a bug in lina, or does it fail the first test
I throw at it?
(I expect 5 of course.)

The code is not transparent enough that it would be of
any value as specification of an algorithm, or in this
case to debug.

>--
>Bernd Paysan

Groetjes Albert

Albert van der Horst

unread,

May 25, 2008, 7:52:29 AM5/25/08

to

In article <8a7_j.3356$%T3....@fe105.usenetserver.com>,
John Passaniti <nn...@JapanIsShinto.com> wrote:

<SNIP>

>
>Removing the actual duplicates is a separate matter. I would use two
>pointers-- one for reading the current item, the other for where to
>write it. As long as both pointers are the same, there have been no
>duplicates and thus no reason to do a write. When the pointers are no
>longer the same, then a write is needed. This could be slightly more
>intelligent by delaying the write until the next duplicate, allowing a
>block move instead of an item-by-item move.

That is a good description and much in line with my thoughts.
Programming it in Forth is a different matter.
I started with telling that I have technical difficulties.

Groetjes Albert.

Albert van der Horst

unread,

May 25, 2008, 9:40:27 AM5/25/08

to

In article <k1fab...@spenarnc.xs4all.nl>,

Albert van der Horst <alb...@spenarnc.xs4all.nl> wrote:

>In article <8a7_j.3356$%T3....@fe105.usenetserver.com>,
>John Passaniti <nn...@JapanIsShinto.com> wrote:
><SNIP>
>>
>>Removing the actual duplicates is a separate matter. I would use two
>>pointers-- one for reading the current item, the other for where to
>>write it. As long as both pointers are the same, there have been no
>>duplicates and thus no reason to do a write. When the pointers are no
>>longer the same, then a write is needed. This could be slightly more
>>intelligent by delaying the write until the next duplicate, allowing a
>>block move instead of an item-by-item move.
>
>That is a good description and much in line with my thoughts.
>Programming it in Forth is a different matter.
>I started with telling that I have technical difficulties.

I did something I didn't want to do.
I wrote it in C first then I translated it to Forth:
I could write the c-code almost without thinking in mere minutes.
Mechanically translating to Forth was also mere minutes.

-----------------------------------------

sometype *unique( sometype aap[], int len )
{
sometype *ps; sometype *pd;

for (pd=ps=aap; ps<aap[len]; pd++,ps++)
{
while( ps <aap[len-1] && ps[0] == ps[1] ) ps++;
*pd = *ps; /* strncpy( pd, ps ,sizeof (sometype)); */
}
return pd+1;
}

\ Sort from BEGIN to END. Leave new END (adr1, adr2 -- adr3)
: doit >R
begin DUP ( pd ps)
DUP R@ < WHILE
BEGIN DUP CELL+ R@ < IF DUP @ OVER CELL+ @ = ELSE 0 THEN WHILE
CELL+
REPEAT
SWAP OVER @ OVER !
CELL+ SWAP CELL+
REPEAT
CELL+
;

-----------------------------------------

Neither of the above code was debugged.
Debugging the Forth code, was relatively easy:

---------------------------------------
: doit >R
DUP
BEGIN ( pd ps)
DUP R@ < WHILE
BEGIN DUP CELL+ R@ < IF DUP @ OVER CELL+ @ .S = ELSE 0 THEN .S WHILE
CELL+
REPEAT
SWAP OVER @ OVER !
CELL+ SWAP CELL+
REPEAT DROP RDROP
;

CREATE AAP 1 , 1 , 2 , 2 , 3 , 4 , 5 , 5 , 5 , 5 ,

AAP DUP 10 CELLS +
.S
doit
.S
CR
AAP - 1 CELLS /MOD . .
CR
AAP 110 DUMP

------------------------------------------

This code passed.

Now it turns out, as usual, that I didn't need this in the first
place. :-(.

Groetjes Albert.

Bernd Paysan

unread,

May 25, 2008, 2:59:24 PM5/25/08

to

Albert van der Horst wrote:

> The main problem is that the content cannot be placed
> on the stack they cannot be compared with <> etc.

I don't see the big problem. If you can't place the content on the stack,
place a pointer. And compare with whatever compare function you need. If
you want this a library function like qsort, define that your array is an
array of pointers to the actual structures, and that you have a deferred
word for the actual comparison. This will compare for <= (whatever that is
for your objects), and is used for both the qsort and the duplicate
removal. You just have to make sure that the comparison order is the right
one, i.e. if you expect a to be <= b, compare b <= a for duplicate removal
(will only be true if b=a).

Bernd Paysan

unread,

May 26, 2008, 3:25:18 AM5/26/08

to

Albert van der Horst wrote:

> : uniquify ( addr u -- addr u' )
> over >r bounds dup dup c@ 2swap 1+ ?DO
> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
> drop r> swap over - ;
>
> AAP 10 uniquify .

[gives 4]

Yes, that's a off-by-one bug, add 1+ at the end of uniquify.

Jonah Thomas

unread,

May 26, 2008, 3:26:22 PM5/26/08

to

Albert van der Horst <alb...@spenarnc.xs4all.nl> wrote:

> Sorting the items is pretty easy, qsort can handle that.
> Now eliminating the duplicates, we need only compare one
> item to the next.
> Something along compact (start, end, len -- end' )
> The new end should be returned, of course.
>
> Even using variables/locals, I find this difficult.
> (More difficult then e.g. making a table for the number of
> dividers of N, which sounds much less trivial.)

There are lots of sort algorithms available. If you find the right one, you can delete duplicates when the sort notices them. Then you don't have to sort them.

I started to do that. I found a simple comb-sort that was supposed to be pretty fast, and I looked at how to make simple modifications so it would close up duplicates instead of sorting them. Then I checked back here and found that you no longer wanted the routine at all. Oh well.

I'm not sure it would have been worth doing. For a long sequence you might have a long block move for every individual duplicate. The simplest way it would be a block move on average half as long as the whole sequence. Easy to add a little complication and make it a block move on average a quarter as long as the whole sequence. Maybe a different sort would work better.

Here's the combsort, in case anybody has any interest.

-------
\ Combsort
\ Adapted from code by Wayne Conrad and Nick Estes
\ http://yagni.com/combsort/forth/index.php

defer less

: newgap ( gap -- gap )
10 13 */
dup 9 11 within if
drop 11
then 1 max ;

: swap-by-pointer? ( p1 p2 -- flag )
>r dup @ r@ @ less if
r@ @ over @ r> ! swap ! -1 else
r> 2drop 0 then ;

: comb ( gap 0 array length -- flag )
bounds ?do
over i + i swap-by-pointer? or
1 cells +loop nip ;

: combsort ( array length -- )
dup
begin
newgap dup >r cells
0 2over r@ - cells comb
r@ swap 0= r> 2 < and
until
drop 2drop ;
-----

Anton Ertl

unread,

May 26, 2008, 3:29:33 PM5/26/08

to

Bernd Paysan <bernd....@gmx.de> writes:
>Albert van der Horst wrote:
>> The main problem is that the content cannot be placed
>> on the stack they cannot be compared with <> etc.
>
>I don't see the big problem. If you can't place the content on the stack,
>place a pointer. And compare with whatever compare function you need. If
>you want this a library function like qsort, define that your array is an
>array of pointers to the actual structures, and that you have a deferred
>word for the actual comparison.

No! Don't use a global variable (like a deferred word) for that; it's
not reentrant. Instead, pass the xt of the comparison word on the
stack, or pass the right comparison word with the data (e.g., by using
an OO extension).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2008: http://www.euroforth.org/ef08.html

Bill Spight

unread,

May 30, 2008, 9:57:12 AM5/30/08

to

On Sun, 25 May 2008 01:54:39 -0400, John Passaniti wrote:

> Albert van der Horst wrote:
>> Sorting the items is pretty easy, qsort can handle that.
>> Now eliminating the duplicates, we need only compare one
>> item to the next.
>
> What if the original ordering must be preserved? By sorting, you've
> destroyed that ordering.

Errr. "abcbdecadec" Eliminate duplicates but preserve the original order?
Impossible.

John Passaniti

unread,

May 31, 2008, 11:58:42 PM5/31/08

to

Bill Spight wrote:
>> What if the original ordering must be preserved? By sorting, you've
>> destroyed that ordering.
>
> Errr. "abcbdecadec" Eliminate duplicates but preserve the original order?
> Impossible.

"abcde"

I apparently just did the impossible.

Here, allow me to use a different string:

"edecdbea"

Using my awesome powers of removing duplicates without sorting and
without changing the order as I progress through the string, I get:

"edcba"

Later, Albert van der Horst restated the problem not as removing
duplicates from an array, but removing duplicates from a set. As a set
problem, order doesn't matter. But still, I continue to wonder why this
discussion involves any sorting. It doesn't make the problem any easier
and it invariably requires pointless memory moves of either the contents
of the array or a set of addresses to that contents. My solution
required only a single pass through the array and the creation of a
parallel data structure to mark values that were previously seen.

Marcel Hendrix

unread,

Jun 1, 2008, 2:41:13 AM6/1/08

to

John Passaniti <nn...@JapanIsShinto.com> wrote Re: Elementary but surprisingly difficult.

> Bill Spight wrote:
>>> What if the original ordering must be preserved? By sorting, you've
>>> destroyed that ordering.

>> Errr. "abcbdecadec" Eliminate duplicates but preserve the original order?
>> Impossible.

> "abcde"

> I apparently just did the impossible.

"badec"

[..]

-marcel

Jonah Thomas

unread,

Jun 1, 2008, 12:31:17 PM6/1/08

to

John Passaniti <nn...@JapanIsShinto.com> wrote:
> Bill Spight wrote:

> >> What if the original ordering must be preserved? By sorting, you've
> >> destroyed that ordering.
> >
> > Errr. "abcbdecadec" Eliminate duplicates but preserve the original order?
> > Impossible.
>
> "abcde"
>
> I apparently just did the impossible.

How do we even decide what it means to preserve the original order when we remove some items? abcde keeps a subset that's in the original order. bdeca does also.

> Here, allow me to use a different string:
>
> "edecdbea"

Now you have one that has no alphabetised subset that's in the same order as any subset of the original. So there are sequences that would take a very strange definition to say you can alphabetise them and then remove duplicates and keep the same ordering.

> Using my awesome powers of removing duplicates without sorting and
> without changing the order as I progress through the string, I get:
>
> "edcba"
>
> Later, Albert van der Horst restated the problem not as removing
> duplicates from an array, but removing duplicates from a set. As a set
> problem, order doesn't matter. But still, I continue to wonder why this
> discussion involves any sorting. It doesn't make the problem any easier
> and it invariably requires pointless memory moves of either the contents
> of the array or a set of addresses to that contents. My solution
> required only a single pass through the array and the creation of a
> parallel data structure to mark values that were previously seen.

It does seem kind of unforthy, doesn't it? You start with an existing library routine that makes the problem easier, and use it regardless of possible inefficiency.

I can sort of see how to do your thing. First attempt. For each item in the list past the first, check it against all the previous unique items. If it doesn't match any of them, move it to the first spot past the last unique one and increase the unique count by one.

You have to do up to N(N-1)/2 comparisons, and up to N-2 moves.

If moves were costly it might be possible to move the last unique items into the first nonunique spots, not moving the ones that are already in place. But why would moves be costly? More costly than saving items in a queue waiting to move them?

If sorting was cheap but compares were costly, you could sort first and then compare only items that are close. But why would compares be expensive when they're cheap while sorting?

Could you get fewer comparisons by sorting? It seems plausible to me. The worst case N(N-1)/2 when they're all unique compares every item against every previous item. Once it's partially sorted maybe a good sort routine would do fewer compares? But you have to compare < and = both. i dunno.

It seems like 10 years ago I'd have spent as much of a Sunday afternoon as it took to find out. Now I'm just as happy if somebody else does it. Is that what it's like to get old?

Bruce McFarling

unread,

Jun 1, 2008, 2:15:25 PM6/1/08

to

On Jun 1, 12:31 pm, Jonah Thomas <jethom...@gmail.com> wrote:

Why are we moving items?

If we are looking for a generic routine, surely a simple linked list
approach would handle anything for which someone can come up with a
"are the things pointed to at these two addresses equal" function.

1. Create linked list anchored on variable LIST, composed of
* pointer to next
* pointer to item

0 pointer to next flags end of list.

For each item in list, go through its sublist, comparing each item in
the sublist with that item. For each equality, move that linked list
item to a linked list anchored on the variable DUPLIST.

When you get to either a zero next pointer for the just completed
word, or a zero next pointer for the next item in the base list,
you're done.

Jonah Thomas

unread,

Jun 1, 2008, 5:52:26 PM6/1/08

to

Bruce McFarling <agi...@netscape.net> wrote:

> Why are we moving items?
>
> If we are looking for a generic routine, surely a simple linked list
> approach would handle anything for which someone can come up with a
> "are the things pointed to at these two addresses equal" function.
>
> 1. Create linked list anchored on variable LIST, composed of
> * pointer to next
> * pointer to item
>
> 0 pointer to next flags end of list.
>
> For each item in list, go through its sublist, comparing each item in
> the sublist with that item. For each equality, move that linked list
> item to a linked list anchored on the variable DUPLIST.
>
> When you get to either a zero next pointer for the just completed
> word, or a zero next pointer for the next item in the base list,
> you're done.

If the point is to wind up with just the unique items, then if you start with an array of addresses of all the items and you move the addresses of unique ones to the end of the list of current unique ones each time, you can do that with ! . If you have to manipulate linked lists it's more work, though the complications are all under-the-hood.

Here's a first pass at coding it.

defer equal ( x1 x2 -- flag )
: check-equal ( addr1 addr2 -- flag )
@ swap @ equal ;

: none-equal ( addr after-end start -- flag )
do
dup i check-equal if drop 0 unloop exit then
loop drop -1 ;

: uniquify ( 1-past-end-addr start-addr -- start-addr 1-cell-past-new-end-addr )
cell+ tuck 1 cells - tuck cell+ do
over i 2over none-equal if
i @ swap !
swap cell+ swap 0
then
drop
1 cells +loop ;

' = is equal
create test 1 , 5 , 7 , 1 , 5 , 7 , 6 , 8 , 6 , 7 ,

test 10 cells bounds uniquify
: display ( end start -- )
?do i @ . 1 cells +loop ;

It may be more trouble to put a bunch of addresses into an array than to put them into a linked list. Particularly if you already have them in a linked list. But it's a little less effort to loop through an array than a linked list, and a little less effort to move an item to an array element than to link it into a list.

Maybe there's a better way still.

Albert van der Horst

unread,

Jun 1, 2008, 6:57:16 PM6/1/08

to

In article <e98f56b0-9bd1-44b3...@y38g2000hsy.googlegroups.com>,
Bruce McFarling <agi...@netscape.net> wrote:
<SNIP>

>
>For each item in list, go through its sublist, comparing each item in
>the sublist with that item. For each equality, move that linked list
>item to a linked list anchored on the variable DUPLIST.
>
>When you get to either a zero next pointer for the just completed
>word, or a zero next pointer for the next item in the base list,
>you're done.

Maybe the original question with getting rid of duplicates had to do
with an Euler problem where a Gigabyte machine risks running out of
memory on packed items. I don't remember anymore ... ;-)

Groetjes Albert

John Passaniti

unread,

Jun 1, 2008, 9:35:22 PM6/1/08

to

Jonah Thomas wrote:
> How do we even decide what it means to preserve the original order
> when we remove some items? abcde keeps a subset that's in the
> original order. bdeca does also.

I didn't have any problem at all defining order. Again, the original
message from Albert-- the one I first replied to-- described the problem
as removing duplicates from an array. Now I hate to get all Computer
Science on everyone, but the very definition of array denotes an
ordering; you can say the "first" item in an array, and it has clear and
unambiguously meaning.

So no, if preserving the original order mattered, "bdeca" wouldn't work,
because the original order had a "a" as the *first* element. Albert
didn't say the original order mattered-- indeed in a subsequent message
to the one I replied to, he said he was thinking more of sets than
arrays. (More formally, he wanted to convert a "bag" or "multiset" to a
a proper set.) But being a human stuck in a causal universe, I can't
respond to messages from the future.

>> "edecdbea"
>
> Now you have one that has no alphabetised subset that's in the same
> order as any subset of the original. So there are sequences that
> would take a very strange definition to say you can alphabetise them
> and then remove duplicates and keep the same ordering.

Again, I'm struggling to figure out why sorting came into this
discussion. The moment anyone talks about sorting (or in your case,
implicitly talks about sorting by using the phrase "alphabetized
subset"), then I just raise my hands and say, "why."

>> Later, Albert van der Horst restated the problem not as removing
>> duplicates from an array, but removing duplicates from a set. As a
>> set problem, order doesn't matter. But still, I continue to wonder
>> why this discussion involves any sorting. It doesn't make the
>> problem any easier and it invariably requires pointless memory
>> moves of either the contents of the array or a set of addresses to
>> that contents. My solution required only a single pass through the
>> array and the creation of a parallel data structure to mark values
>> that were previously seen.
>
> It does seem kind of unforthy, doesn't it? You start with an existing
> library routine that makes the problem easier, and use it regardless
> of possible inefficiency.

I wouldn't say it's "unforthy" because that implies that a skilled
programmer would do something different in Forth than in other
languages. I would say that throwing in a sort is simply bad (or
unnecessary) programming in any language. Forth programmers like to
tell themselves they are doing things in some sense "better" than their
non-Forth brethren, but that's just feel-good mythology.

> I can sort of see how to do your thing. First attempt. For each item
> in the list past the first, check it against all the previous unique
> items. If it doesn't match any of them, move it to the first spot
> past the last unique one and increase the unique count by one.

I'm not sure where counting the number of unique items comes into this.
The original statement said to remove duplicates, not count them. And
no, the way you expressed my solution seems... weird.

Let's take the simple case of removing duplicate 8-bit values. Create a
bitmap large enough to hold every value (256/8=32 bytes) and clear it.
Now for each item in the array, check if the appropriate bit for that
value is set. If so, you have a duplicate. If it's not set, then set
it. Done. One single pass through the array. Nothing is compared more
than once.

In the case of removing duplicates from items that are larger than 8-bit
bytes, you choose a data structure that can efficiently represent the
presence. For 16-bit values, you need 8k for a straight bitmap. If
that's too large, there are other representations available (such as a
binary trie). Hashing can also be used, with appropriate handling of
collisions.

> You have to do up to N(N-1)/2 comparisons, and up to N-2 moves.

No, I do N comparisons in the case of storing in a data structure (like
a bitmap) without collisions. With collisions (like in a hash table),
the number of comparisons would depend on the number of collisions
(which reflects back on the hashing function used) and how collisions
were resolved, but would typically be pretty close to N.

> If moves were costly it might be possible to move the last unique
> items into the first nonunique spots, not moving the ones that are
> already in place. But why would moves be costly? More costly than
> saving items in a queue waiting to move them?

I don't understand. No queue is necessary in what I'm describing. In
my solution, I discuss two entirely separate things-- detection of
duplicates, and removing the duplicates. In my case, removing
duplicates requires nothing more than two pointers. As you step through
the array, two pointers ("read" and "write") point to the current item
in the array. When a duplicate item is found, the "read" pointer is
advanced past the item and the write pointer is left where it is (on the
duplicate item). As long as there are duplicates, you keep advancing
the read pointer. When you find a non-duplicate item, you copy the item
from the read pointer to the write pointer.

This can be made smarter by delaying copies so that larger blocks of
memory can be moved all at once, instead of each item individually.

> If sorting was cheap but compares were costly, you could sort first
> and then compare only items that are close. But why would compares be
> expensive when they're cheap while sorting?

I don't understand. To sort requires comparisons. So if your
comparisons are expensive, sorting will just add more expensive
comparisons. They aren't necessary.

> Could you get fewer comparisons by sorting? It seems plausible to me.

No. Sorting itself *is* executing comparisons. So the moment you add a
sort, you've *added* comparisons.

The only thing a sort buys you is if for some reason you can't store a
parallel data structure to mark the presence of items. Then and only
then would a sort be useful.

John Passaniti

unread,

Jun 1, 2008, 9:38:19 PM6/1/08

to

Albert van der Horst wrote:

> Maybe the original question with getting rid of duplicates had to do
> with an Euler problem where a Gigabyte machine risks running out of
> memory on packed items. I don't remember anymore ... ;-)

Your original question was abstract; you may have had a specific
application in mind, but it wasn't presented as such.

Marc Olschok

unread,

Jun 2, 2008, 2:06:31 PM6/2/08

to

John Passaniti <nn...@japanisshinto.com> wrote:
> Jonah Thomas wrote:
> > How do we even decide what it means to preserve the original order
> > when we remove some items? abcde keeps a subset that's in the
> > original order. bdeca does also.
>
> I didn't have any problem at all defining order. Again, the original
> message from Albert-- the one I first replied to-- described the problem
> as removing duplicates from an array. Now I hate to get all Computer
> Science on everyone, but the very definition of array denotes an
> ordering; you can say the "first" item in an array, and it has clear and
> unambiguously meaning.

>
> So no, if preserving the original order mattered, "bdeca" wouldn't work,
> because the original order had a "a" as the *first* element.

By that reasoning, your original solution of reducing "abcbdecadec"
to "abcde" also does not work because the original order had "d"
as the *fifth* element.

So perhaps it is not that obvious from the notion of array,
what kind of order is to be understood.

Of course an array can be identified with a map f: I --> X
from a linear ordered index set I to a set X describing the possible
entries. In this way, the map f "is" the order; but then no true reduction
would preserve it.

Another choice is the "order of first appearance", where an order
is imposed on f(I) via

x < y <==> min{ i in I | f(i)=x } < min{ i in I | f(i)=y }

With respect to this order the word "abcbdecadec" gives

a < b < c < d < e

So your solution "abcde" does preserve it but "bdeca" does not.

Marc

Bruce McFarling

unread,

Jun 2, 2008, 5:43:35 PM6/2/08

to

On Jun 1, 5:52 pm, Jonah Thomas <jethom...@gmail.com> wrote:
> If the point is to wind up with just the unique items, then if you start with an array of addresses of all the items and you move the addresses of unique ones to the end of the list of current unique ones each time, you can do that with ! . If you have to manipulate linked lists it's more work though the complications are all under-the-hood.

But 'remove duplicates from tail of list' means that the next element
in the list, if there is one, is always unique, because if it was not
unique, it would already have been removed.

Indeed, if you don't need the list of duplicates, the only linked list
*manipulation* required is:

: cut-next-item ( &item -- ) DUP >R @ @ R> ! ;

Bruce McFarling

unread,

Jun 2, 2008, 5:47:33 PM6/2/08

to

On Jun 1, 6:57 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
wrote:

> Maybe the original question with getting rid of duplicates had to do
> with an Euler problem where a Gigabyte machine risks running out of
> memory on packed items. I don't remember anymore ... ;-)

Yes, the same algorithm working by repacking a vector in place is a
lot more memory moves than working from an external linked list, so
its easy to see how it could be bested if there is not ``ITEMS 2*
CELLS'' space free for the linked list.

John Passaniti

unread,

Jun 2, 2008, 5:52:22 PM6/2/08

to

On Jun 2, 2:06 pm, Marc Olschok <nob...@nowhere.invalid> wrote:
> So perhaps it is not that obvious from the notion of array,
> what kind of order is to be understood.

Yes, I suppose the less practically-oriented might enjoy coming up
with all manner of notions of array order. But for those people who
have less free time, the notion of order in an array is pretty obvious
and intuitive. At least it's obvious and intuitive to a programmer.
I'm sure the mathematicians in da house have some exquisitely bizarre
notions of order that can be used to completely derail this
discussion.

> Of course an array can be identified with a map f: I --> X
> from a linear ordered index set I to a set X describing the possible
> entries. In this way, the map f "is" the order; but then no true reduction
> would preserve it.

When someone writes a message saying, 'I have an array and I want to
remove duplicates," pedantic discussions over subtle notions of order
probably aren't the first thing that comes to mind as being useful.

It was I who originally brought up the question of order. I noted (in
Albert's initial message) that his use of sorting would destroy
order. Or at least destroy the common notion of order than anyone who
doesn't over-think the problem would have. I didn't know if order was
important (in a subsequent message, he indicated it was not).

> x < y <==> min{ i in I | f(i)=x } < min{ i in I | f(i)=y }
>
> With respect to this order the word "abcbdecadec" gives
>
> a < b < c < d < e
>
> So your solution "abcde" does preserve it but "bdeca" does not.

Correct, and if we were to use a Fibonacci sequence to generate array
indicies, modulo the number of pieces of Pez candies I just ate, and
use that as a seed for a random number generator that generated
indices-- then my solution also doesn't preserve that ordering. Got
me there.

Bruce McFarling

unread,

Jun 2, 2008, 5:53:52 PM6/2/08

to

On Jun 2, 2:06 pm, Marc Olschok <nob...@nowhere.invalid> wrote:

> Another choice is the "order of first appearance", where an order
> is imposed on f(I) via

> x < y <==> min{ i in I | f(i)=x } < min{ i in I | f(i)=y }

> With respect to this order the word "abcbdecadec" gives

> a < b < c < d < e

> So your solution "abcde" does preserve it but "bdeca" does not.

Order is a relative concept ... the only possible ambiguity is whether
the *first* of several instances is the unique one and the
*successive* ones are the duplicates. That is the most natural
specification, since "the first of" exists for both the duplicated and
non-duplicated entries. If so, preserving ordering means preserving:

index(x(1) in array1) < index(y(1) in array1)
=> index(x in array2) < index(y in array2)

Jonah Thomas

unread,

Jun 2, 2008, 6:45:08 PM6/2/08

to

John Passaniti <nn...@JapanIsShinto.com> wrote:
> Jonah Thomas wrote:

> >> "edecdbea"
> >
> > Now you have one that has no alphabetised subset that's in the same
> > order as any subset of the original. So there are sequences that
> > would take a very strange definition to say you can alphabetise them
> > and then remove duplicates and keep the same ordering.
>
> Again, I'm struggling to figure out why sorting came into this
> discussion. The moment anyone talks about sorting (or in your case,
> implicitly talks about sorting by using the phrase "alphabetized
> subset"), then I just raise my hands and say, "why."

They wanted to sort and then remove duplicates. It was an approach to solve the problem of removing duplicates.

> > I can sort of see how to do your thing. First attempt. For each item
> > in the list past the first, check it against all the previous unique
> > items. If it doesn't match any of them, move it to the first spot
> > past the last unique one and increase the unique count by one.
>
> I'm not sure where counting the number of unique items comes into this.
> The original statement said to remove duplicates, not count them. And
> no, the way you expressed my solution seems... weird.

Yes, it wasn't your solution at all.

> Let's take the simple case of removing duplicate 8-bit values. Create a
> bitmap large enough to hold every value (256/8=32 bytes) and clear it.
> Now for each item in the array, check if the appropriate bit for that
> value is set. If so, you have a duplicate. If it's not set, then set
> it. Done. One single pass through the array. Nothing is compared more
> than once.

Yes, I see that.

create flags 256 allot

: uniquefy ( c-addr len -- )
flags 256 1 fill
bounds ?do
0 i c@ flags + c!
loop ;

: present? ( char -- flag )
flags + c@ 0= ;

So you can easily test whether any item is present in the string. You don't care whether it's present more than once, all you need is to know whether it's present one or more times.

As you point out you could do it with 32 bytes instead of 256 bytes at the cost of complexifying the code a little bit.

> In the case of removing duplicates from items that are larger than 8-bit
> bytes, you choose a data structure that can efficiently represent the
> presence. For 16-bit values, you need 8k for a straight bitmap. If
> that's too large, there are other representations available (such as a
> binary trie). Hashing can also be used, with appropriate handling of
> collisions.

And for cell-size values you need half a gigabyte. But maybe it's only a small subset of the 32-bit values? For strings of indefinite size you need an indefinite length? Still, your beautiful simple approach is good for a lot of cases.

> > Could you get fewer comparisons by sorting? It seems plausible to me.
>
> No. Sorting itself *is* executing comparisons. So the moment you add a
> sort, you've *added* comparisons.
>
> The only thing a sort buys you is if for some reason you can't store a
> parallel data structure to mark the presence of items. Then and only
> then would a sort be useful.

Yes, and even then you don't really need a sort, you can do it the way I coded it. Instead of sorting your duplicates and then discarding the extras, you just consider them discarded immediately. But with the unique items unsorted you have potentially a lot of comparisons to do for each new item. If that matters you could sort fresh each time you get a new unique item. Use a sort routine that's good for almost-sorted lists; you only have to place one new item each time.

With the unique-so-far items sorted you could treat the list as a balanced binary tree to check the next possibly-unique item. If it isn't unique you find that out. If it is unique you find out where to insert it. Use MOVE to shift items right to provide a space for it. More code but the result should be faster than what I did.

John Passaniti

unread,

Jun 2, 2008, 7:33:49 PM6/2/08

to

Jonah Thomas wrote:
>> Again, I'm struggling to figure out why sorting came into this

>> discussion. [...]

>
> They wanted to sort and then remove duplicates. It was an approach to
> solve the problem of removing duplicates.

Cute.

Yes, there are a universe of approaches. There is a smaller universe of
good approaches.

> And for cell-size values you need half a gigabyte. But maybe it's
> only a small subset of the 32-bit values? For strings of indefinite
> size you need an indefinite length? Still, your beautiful simple
> approach is good for a lot of cases.

No, you don't need half a gigabyte. As I wrote, there are other
representations other than straight bitmaps that could be used. You
could use a (binary) radix trie. You could use a hash table with
chaining for collisions.

>> The only thing a sort buys you is if for some reason you can't
>> store a parallel data structure to mark the presence of items.
>> Then and only then would a sort be useful.
>
> Yes, and even then you don't really need a sort, you can do it the
> way I coded it. Instead of sorting your duplicates and then
> discarding the extras, you just consider them discarded immediately.

What does this buy you? In what case would such an approach make sense
over a simpler one-pass solution?

> With the unique-so-far items sorted you could treat the list as a
> balanced binary tree to check the next possibly-unique item. If it
> isn't unique you find that out. If it is unique you find out where to
> insert it. Use MOVE to shift items right to provide a space for it.
> More code but the result should be faster than what I did.

If the goal here is to simply come up with as many possible ideas for
how to approach the problem, then bravo. We then can store the data not
just in a binary tree (balanced or not), but now we can use other kinds
of trees (AVL, splay, red/black), probabilistic data structures like
skip-lists, a variety of radix tries, and an entire world of data
structures with a variety of interesting properties. Why would could
even use the items in the array as training data for a neural net that
would recognize previously-seen items. Or how about we view this as a
parsing problem and build an extensible deterministic finite automata
that recognizes previously-seen items in the array as transitions in
some abstract lexical space. Maybe a genetic approach-- we build a
population of thousands of programs that interpret the items in the
array, and from each population, choose the ones that were best able to
recognize previously-seen items and cross-breed them.

Oh, and then we'll sort it. That should help.

Bruce McFarling

unread,

Jun 3, 2008, 12:36:14 PM6/3/08

to

On Jun 2, 6:45 pm, Jonah Thomas <jethom...@gmail.com> wrote:
> Yes, and even then you don't really need a sort, you can do it the way I coded it. Instead of sorting your duplicates and then discarding the extras, you just consider them discarded immediately. But with the unique items unsorted you have potentially a lot of comparisons to do for each new item.

Yes, you have many of the comparisons that you would have to do to
sort the array ... however, since you "know" the provisional right
place for each item as soon as you compare it to the most recently
found unique item in the vector, you can finish the job with exactly
one comparison for each pair of items in the vector.

That is, the first item in the array is the first item in the result.

Go through the tail, compare each item in the tail to the first item,
if it is equal, skip to the next, if it is not equal, put it in the
next sequential position in the tail.

Repeat until there is no tail.

You're done.

So with a code-address setting a target for a C-PUT, and a string
providing the source and limit for a C-GET? ... and a value CVALUE the
inner loop is something like:

( ca u ) OVER >C-TARGET >S$
C-GET? IF
TO CVALUE BEGIN
C-GET? ( c TRUE | FALSE ) WHILE
DUP CVALUE = IF C-PUT ELSE DROP THEN
REPEAT
THEN

... then the outer loop is just getting the target string and
repeating the inner loop if it is longer than 1 char.

... its not the comparisons where that could be beat by recourse to a
side data structure, but in repeated moves to get to the current head
of the string ... with a side data structure, like a bitmap, those
moves can be avoided.

But if the original question was not what is the most computationally
efficient way to do it, but why its hard to do ... the easiest way to
do it does not look all that hard to me, so I'm wondering what I'm
missing.

Albert van der Horst

unread,

Jun 3, 2008, 5:25:58 PM6/3/08

to

In article <f806c245-9057-493d...@m3g2000hsc.googlegroups.com>,

John Passaniti <john.pa...@gmail.com> wrote:
>
>It was I who originally brought up the question of order. I noted (in
>Albert's initial message) that his use of sorting would destroy
>order. Or at least destroy the common notion of order than anyone who
>doesn't over-think the problem would have. I didn't know if order was
>important (in a subsequent message, he indicated it was not).

And of course it was implicit in the way the problem was stated.

Jonah Thomas

unread,

Jun 4, 2008, 1:41:12 PM6/4/08

to

Bruce McFarling <agi...@netscape.net> wrote:
> Jonah Thomas <jethom...@gmail.com> wrote:
> > Yes, and even then you don't really need a sort, you can do it the way I coded it. Instead of sorting your duplicates and then discarding the extras, you just consider them discarded immediately. But with the unique items unsorted you have potentially a lot of comparisons to do for each new item.
>
> Yes, you have many of the comparisons that you would have to do to
> sort the array ... however, since you "know" the provisional right
> place for each item as soon as you compare it to the most recently
> found unique item in the vector, you can finish the job with exactly
> one comparison for each pair of items in the vector.
>
> That is, the first item in the array is the first item in the result.
>
> Go through the tail, compare each item in the tail to the first item,
> if it is equal, skip to the next, if it is not equal, put it in the
> next sequential position in the tail.
>
> Repeat until there is no tail.
>
> You're done.

That was my first approach. It works, and it requires every pair of items to be compared, for N(N-1)/2 comparisons.

> So with a code-address setting a target for a C-PUT, and a string
> providing the source and limit for a C-GET? ... and a value CVALUE the
> inner loop is something like:
>
> ( ca u ) OVER >C-TARGET >S$
> C-GET? IF
> TO CVALUE BEGIN
> C-GET? ( c TRUE | FALSE ) WHILE
> DUP CVALUE = IF C-PUT ELSE DROP THEN
> REPEAT
> THEN
>
> ... then the outer loop is just getting the target string and
> repeating the inner loop if it is longer than 1 char.

Yes. I started with an array and put the new unique items at the end of the unique list which I kept at the start of the array. That destroyed the information about how many copies there are of each item, which nobody had said was important.

> ... its not the comparisons where that could be beat by recourse to a
> side data structure, but in repeated moves to get to the current head
> of the string ... with a side data structure, like a bitmap, those
> moves can be avoided.

Yes, that's clearly the way to go when the side structure doesn't add too much space or complexity.

> But if the original question was not what is the most computationally
> efficient way to do it, but why its hard to do ... the easiest way to
> do it does not look all that hard to me, so I'm wondering what I'm
> missing.

I think it involved going down a blind alley. When you have a sort routine handy then it seems intuitive that sorting might make the problem simpler because then all the copies will be right next to each other. But if you start with an array with duplicates and you want to wind up with an array of uniques, you still have to move things around. You have less remaining work to do comparing, but it isn't as simple as it "ought" to be.

Doing it without a side structure and starting with an array, I thought it was easier to use the array than create a threaded list etc. You can sort the unique items and then the number of searches goes down from N(N-1)/2 to around (N-1)(log U)/2. That looks good. But inserting an item into the sorted unique array requires a move that's on average U/2 items long. So the more unique items there are the faster the comparisons go, but the slower the inserts get. So it isn't obvious when it would be an improvement. If you put the unique items in a binary tree or something instead of just an array you could search fast and insert fast, and you might lose some time balancing the tree.

Oh well. I reviewed Knuth some on searching, so it was probably good for my soul.

Marc Olschok

unread,

Jun 4, 2008, 5:24:52 PM6/4/08

to

John Passaniti <john.pa...@gmail.com> wrote:
> On Jun 2, 2:06 pm, Marc Olschok <nob...@nowhere.invalid> wrote:
> > So perhaps it is not that obvious from the notion of array,
> > what kind of order is to be understood.
>
> Yes, I suppose the less practically-oriented might enjoy coming up
> with all manner of notions of array order. But for those people who
> have less free time, the notion of order in an array is pretty obvious
> and intuitive. At least it's obvious and intuitive to a programmer.
> I'm sure the mathematicians in da house have some exquisitely bizarre
> notions of order that can be used to completely derail this
> discussion.

That was neither my aim nor my point: I tried to make sense of the
following exchange:

Your remark from <8a7_j.3356$%T3....@fe105.usenetserver.com>:

| What if the original ordering must be preserved? By sorting, you've

| destroyed that ordering.[...]

Bill's example from <YIT%j.2924$jI5...@flpi148.ffdc.sbc.com>:

| Errr. "abcbdecadec" Eliminate duplicates but preserve the original order?
| Impossible.

Your claim from <U7p0k.112$WY1...@fe107.usenetserver.com>:

| "abcde"
| I apparently just did the impossible.

Jonah's question <20080601123117.6...@gmail.com>

| How do we even decide what it means to preserve the original order
| when we remove some items?
| abcde keeps a subset that's in the original order. bdeca does also.

Your answer from <p9I0k.5326$3R....@fe085.usenetserver.com>:

| Now I hate to get all Computer Science on everyone, but the
| very definition of array denotes an ordering; you can say
| the "first" item in an array, and it has clear and unambiguously meaning.
|
| So no, if preserving the original order mattered, "bdeca" wouldn't work,
| because the original order had a "a" as the *first* element.

So when I find three practically-oriented programmers arguing as above
about whether an order preserving reduction can be done and which of the
above does it, it is perhaps not too bizarre to wonder, if they agree about
what they mean by "order". In particular, since your above explanation
("I know what is first, so I have an order") does not seem to have anything
scientific with it.

In particular, the only "notion of order" which "the very definition
of array" might denote(?) is the one, in which the index set I is ordered
(for example I = {0, 1, 2, 3} with the natural ordering) and the array
is identified with a map f: I --> X, where X contains the possible entries.
This does not give an order relation on X, but it corresponds to the
notion of an ordered sequence. This is indeed the intuitive notion
of order which comes with an array --- and it proves Bill's above claim
that "abcbdecadec" cannot be reduced while preserving the order.

So your claim, that "abcde" is an order-preserving reduction, must
refer not to the above "intuitive" notion which "the very definition
of an array" gives us, but to some other notion of "order".
The description "you can say the 'first' item in an array" is obviously
not enough to accept "abcde" but reject "abedc".

So I guessed, that you perhaps meant the order relation given by

x < y <==> min{ i in I | f(i)=x } < min{ i in I | f(i)=y }

which compares on the basis of the position of first appearance of an
item in the array.

Now this is also intuitive; but it _differs_ from the other intuitive
notion. So the moral is: even practically minded programmers benefit from
saying explicitely what they are talking about.

Marc

John Passaniti

unread,

Jun 4, 2008, 7:51:40 PM6/4/08

to

Marc Olschok wrote:
> So I guessed, that you perhaps meant the order relation given by
>
> x < y <==> min{ i in I | f(i)=x } < min{ i in I | f(i)=y }
>
> which compares on the basis of the position of first appearance of an
> item in the array.

Yes. That and your other supporting text is the longest and most
complicated way possible of explicitly expressing what should be
intuitive to anyone reading this thread. Congratulations.

> Now this is also intuitive; but it _differs_ from the other intuitive
> notion. So the moral is: even practically minded programmers benefit from
> saying explicitely what they are talking about.

Kind of. The real moral here is that bored people who delight in
over-thinking a problem are capable of squeezing out exquisitely
abstruse meanings from common notions.

I'll try this one more time because I just got a new laptop and I can
use some typing practice to get used to it:

The original statement of the problem was that you have an array, and
that you want to delete duplicates. My statements about preserving
order were in the context of that original problem. So if someone (like
Bill) has a notion of order that is incompatible with removing *any*
items, then he's no longer in the same context.

That is, "intuitive" necessarily exists within a context. If others
wants to shift the context to a pedantic discussion about mathematical
notions of arrays, then they can play games with what is and isn't
intuitive. But for those of us who think continuity is kewl, we'll keep
our conversation in the same context.

Johan's notion of order is at least worth discussing in this thread
because he's allowing for items from the array to be deleted. But his
notion of order is not at all intuitive-- at least to a programmer.
Allow me to demonstrate:

create x 4 , 3 , 6 , 3 , 4 , 4 ,

There, we have an array. Now my question: What is the address of the
first item in the array?

I say the address of the first item is this:

x

Crazy, huh? But that's the radical kind of guy I am.

Jonah is apparently saying that the first item in the array could be at
either the next one or two cells after x.

And now for the fun part. Since you're defending this as an intuitive
notion of order, please give me a real-world example where this notion
of order would be intuitive. Note that I didn't ask for a contrived
problem. I'm asking where this notion of order is *intuitive.*

Marc Olschok

unread,

Jun 5, 2008, 12:49:52 PM6/5/08

to

I am not sure if I understand the meaning of the abov, but I guess
it comes from assigning to each array f: I --> X a map s: X -->I such
that s(x) is the first index with f(s(x))=x, assuming that f is surjective.
The order X < y <==> s(x) < s(y) then is the same as above.

Marc

Bruce McFarling

unread,

Jun 5, 2008, 2:02:01 PM6/5/08

to

On Jun 5, 12:49 pm, Marc Olschok <nob...@nowhere.invalid> wrote:

Yes, a^-1(item)=index, shorthand index(item), is the inverse function
of a(index)=item, which of course requires a 1-1 mapping so that each
member of a group of otherwise identical items must be distinguished,
which can only be done arbitrarily in terms of the index of each
member of the group. The most natural mapping for a group of item=g
assigns a group index for g_gi so that the member of the group with
the lowest index is item g_1, and so on in group, so that
index(g_gi)<index(g_gi) iff gi<gj.

Marc Olschok

unread,

Jun 7, 2008, 2:29:23 PM6/7/08

to

John Passaniti <nn...@japanisshinto.com> wrote:
> Marc Olschok wrote:
> > So I guessed, that you perhaps meant the order relation given by
> >
> > x < y <==> min{ i in I | f(i)=x } < min{ i in I | f(i)=y }
> >
> > which compares on the basis of the position of first appearance of an
> > item in the array.
>
> Yes. That and your other supporting text is the longest and most
> complicated way possible of explicitly expressing what should be
> intuitive to anyone reading this thread. Congratulations.

Perhaps it should be, but it obviously was not. The need for precision
usually comes up in such situations. Observe that in the above you could
use max instead of min, producing an order which is no less "intuituive"
(or is it not :-).

>
> > Now this is also intuitive; but it _differs_ from the other intuitive
> > notion. So the moral is: even practically minded programmers benefit from
> > saying explicitely what they are talking about.
>
> Kind of. The real moral here is that bored people who delight in
> over-thinking a problem are capable of squeezing out exquisitely
> abstruse meanings from common notions.
>
> I'll try this one more time because I just got a new laptop and I can
> use some typing practice to get used to it:
>
> The original statement of the problem was that you have an array, and
> that you want to delete duplicates. My statements about preserving
> order were in the context of that original problem. So if someone (like
> Bill) has a notion of order that is incompatible with removing *any*
> items, then he's no longer in the same context.
>
> That is, "intuitive" necessarily exists within a context. If others
> wants to shift the context to a pedantic discussion about mathematical
> notions of arrays, then they can play games with what is and isn't
> intuitive. But for those of us who think continuity is kewl, we'll keep
> our conversation in the same context.

Since the original problem did not mention order anyway, it is difficult
to decide whether any notion of order fits into the context.

For simplicity, let's assume that for an array x: I --> X the
index set I is just some set of the form n = {0, ..., n-1}.
So the original question asked how, given a array x: n --> X,
to produce a map v: k --> n with k <= n, such that the composition
vx: k --> X is injective ("no duplicates") but still has the same image as x.

Part of the previous confusion arose out of overworked word "order".

An array x: n --> X does in general _not_ (by its "very definition") give
an _order_relation_ on X (<http://en.wikipedia.org/wiki/Partial_order>).

However, it _does_ correspond (by its "very definition") to an
_ordered_sequence_ in X (<http://en.wikipedia.org/wiki/Sequence>).

These two concepts, although related, are different.
My guess is that, except you, everybody regarded the latter concept
as the more intuitive, and I would except the same for programmers.
Its main advantage is that it is indeed intrinsic to the array, and does
not involve the choices necessary for imposing some order relation on X.

>
> Johan's notion of order is at least worth discussing in this thread
> because he's allowing for items from the array to be deleted. But his
> notion of order is not at all intuitive-- at least to a programmer.
> Allow me to demonstrate:
>
> create x 4 , 3 , 6 , 3 , 4 , 4 ,
>
> There, we have an array. Now my question: What is the address of the
> first item in the array?
>
> I say the address of the first item is this:
>
> x
>
> Crazy, huh? But that's the radical kind of guy I am.

Not more radical that saying that the adress of the last item is
x 5 cells +

>
> Jonah is apparently saying that the first item in the array could be at
> either the next one or two cells after x.

No.
Jonah, having grasped "array = ordered sequence", simply uses the well
known notion of subsequence (<http://en.wikipedia.org/wiki/Subsequence>).

In the above notation, it means that the reindexing map v: k --> n
is monotone. Consequently, both

create v 0 , 1 , 2 ,

or

create v 1 , 2 , 4

among some other possibilities, are acceptable to him.

>
> And now for the fun part. Since you're defending this as an intuitive
> notion of order, please give me a real-world example where this notion
> of order would be intuitive. Note that I didn't ask for a contrived
> problem. I'm asking where this notion of order is *intuitive.*

What shall I say? I hope that I could at least express that the two
notions "array" and "ordered sequence" are essentially the same, and
hence Jonahs interpretation of "order preserving" is more intuitive than
the interpretation based on the notion of "order relation" which has
to be specified in a rather ad hoc manner.

Marc

William James

unread,

Jun 7, 2008, 4:46:44 PM6/7/08

to

On May 24, 4:36 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Slava Pestov wrote:
> > On May 24, 3:05 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
> > wrote:
> >> Even using variables/locals, I find this difficult.
> >> (More difficult then e.g. making a table for the number of
> >> dividers of N, which sounds much less trivial.)
>
> > It's not difficult if you have a library of sequence operations:
>
> Nice filter, but doing that in-place is not difficult, either.
>
> I give you a solution for characters in a string, and Albert can generalize
> it. It is supposed that the string is sorted, so it will only remove
> sequenes of the same character:
>
> : uniquify ( addr u -- addr u' )
> over >r bounds dup dup c@ 2swap 1+ ?DO
> I c@ tuck <> IF swap 1+ 2dup c! swap THEN LOOP
> drop r> swap over - ;

Ruby

E:\>irb --prompt xmp
"Are you a bonehead? Then you'll be proud to produce
Forth spaghetti-code.".split(//).uniq.join
==>"Are youabnhd?T'lptc\nFsgi-."

John Passaniti

unread,

Jun 9, 2008, 3:10:53 PM6/9/08

to

Marc Olschok wrote:
> Perhaps it should be, but it obviously was not. The need for precision
> usually comes up in such situations.

False. This is an informal newsgroup where people casually state
problems. This is not a group design session for the space shuttle.

> Observe that in the above you could
> use max instead of min, producing an order which is no
> less "intuituive" (or is it not :-).

You obviously are enjoying complicating the discussion. I'm not interested.

> Since the original problem did not mention order anyway, it is difficult
> to decide whether any notion of order fits into the context.

Correct. I brought up order as an aside-- pointing out that sorting
would (potentially) destroy order. Anyone who had graduated past
"Hello, World!" knew what sense of order I was using.

> Part of the previous confusion arose out of overworked word "order".

Again, there was no confusion. By saying that sorting (potentially)
destroyed order, these sense of order that I was discussing was clear.
The feigned confusion came from those who wanted to be pedantic.

> However, it _does_ correspond (by its "very definition") to an
> _ordered_sequence_ in X (<http://en.wikipedia.org/wiki/Sequence>).
>
> These two concepts, although related, are different.
> My guess is that, except you, everybody regarded the latter concept
> as the more intuitive, and I would except the same for programmers.

I'm not sure why you think I'm not talking about an ordered sequence.
Probably because you've so complicated the discussion that you've lost
what I've written. No matter.

> Its main advantage is that it is indeed intrinsic to the array, and does
> not involve the choices necessary for imposing some order relation on X.

Yep. That's my argument.

> What shall I say? I hope that I could at least express that the two
> notions "array" and "ordered sequence" are essentially the same, and
> hence Jonahs interpretation of "order preserving" is more intuitive than
> the interpretation based on the notion of "order relation" which has
> to be specified in a rather ad hoc manner.

You must be a blast at parties.

Marc Olschok

unread,

Jun 9, 2008, 3:27:13 PM6/9/08

to

John Passaniti <nn...@japanisshinto.com> wrote:
> Marc Olschok wrote:

>[...]

> > Part of the previous confusion arose out of overworked word "order".
>
> Again, there was no confusion. By saying that sorting (potentially)
> destroyed order, these sense of order that I was discussing was clear.
> The feigned confusion came from those who wanted to be pedantic.
>
> > However, it _does_ correspond (by its "very definition") to an
> > _ordered_sequence_ in X (<http://en.wikipedia.org/wiki/Sequence>).
> >
> > These two concepts, although related, are different.
> > My guess is that, except you, everybody regarded the latter concept
> > as the more intuitive, and I would except the same for programmers.
>
> I'm not sure why you think I'm not talking about an ordered sequence.

Because you rejected Jonahs suggestion which was based on that notion,
and instead opted for the order relation that I described before.

> Probably because you've so complicated the discussion that you've lost
> what I've written. No matter.

Not at all. But I am not really interested in chasing a moving target.

>
> > Its main advantage is that it is indeed intrinsic to the array, and does
> > not involve the choices necessary for imposing some order relation on X.
>
> Yep. That's my argument.

It seems to be now. Glad to see that.

Marc