Competition anyone?

6 views
Skip to first unread message

Robert AH Prins

unread,
Apr 9, 1998, 3:00:00 AM4/9/98
to

Hi all,

I don't know if anyone is interested spending some time on this during the
Easter period, I've been working on it for about the last three years, and
it can be done quite fast... My first attempt took about 26 minutes on a
66Mhz 486, adding about 26 minutes to the original running time of the
program, it now takes quite a lot less, but I would like to see if any of
you, being new to the problem, can devise an even faster method, and give us
all an explanation as to how you arrived at your solution.

As for the program of which this routine is a part, I'm currently adding a
little more comment to it, and might upload it to Garbo, as a prime example of
"How to use TP to make things faster" (or equally possible "This is the worst
abuse of TP I've seen, but I must admit that it is pretty fast")

Input : An N-line text file, with a 3-character string on each line
Assumptions: N <= 4095 and lines are implicitely numbered from 1 to N,
i.e. your program has to number them internally.
NOTE: In real life each line contains a lot more data, so
arrays are out of 64K limit question...
C, the number of different 3-character strings <= 63

Wanted:

For all Z >= 5, up to C:

The shortest series of lines from the input file, and their positions in it,
containing Z different 3 character strings, with the restriction that a such
a string may not be part of a longer string, i.e. you are not allowed to
include the two series of 5 different strings (i.e. AAA BBB CCC DDD EEE &
BBB CCC DDD EEE FFF) that are included in a series of 6 different strings
(i.e. AAA ... FFF). You also have to find every occurance of this shortest
series.

FYI, below is my input file, in a flowed format, i.e. you have replace the
blanks with CR/LF's, the 3-char strings are, as you might guess nationalities
and there are 35 of them in 1209 lines, with the longest series, from "L" to
"E" spanning lines 183 to 1170 - both occur only once, BUT is purely
accidental, so no use may be made of it.

The output for this longest series would be:

! Nats ! Span ! From ! To
! 35 ! 988 ! 183 ! 1170 ! L YU D GR B A GB NL IL DK !
! ! ! ! ! S SF RL TR I WAN USA AFG CH F !
! ! ! ! ! IND PAK NZ CDN IRL SGP P DZ J N !
! ! ! ! ! CZ HR PL RSM E !

(To add just a little difficulty, the series has to be split in groups of
10 nationalities per line)

BTW, there are two series of 5 in 5, no less than six of 9 in 19, but for all
others, there is just one shortest series.

And finally, I didn't start using BASM until I had brought down the running
time to about 3 minutes (on the same 66MHz 486!). BASM is great, but it is
not a substitute for improving a suboptimal algorithm!

And finally finally, some of you may smell a rat - he cannot solve it, so
he's asking us to do it for him. No, and to prove it, I will send a UUencoded
password encryped ZIP to the first person that sends me <pri...@wcg.co.uk>
an email before tomorrow evening (aka 1998-04-10-18.00). This person must
agree that his address will appear in a follow-up message, with the usual
"nospam" qualifiers, and I would prefer it to be someone who regularly
posts to c.l.p.b!

Robert
--
Robert AH Prins
pri...@wcg.co.uk

As for the condensed input file, here it is:

NL NL NL D D D D D D D DK DK DK DK NL S S S S S S S S S S S S S S S S S S S
S S S S S S S S S S S S N NL NL NL NL NL NL NL NL D D D DK NL DK S S S S N N
N N N N N N N D D N N D N N N N N N USA N N N SF SF SF S SF SF SF SF SF SF
SF SF SF SF SF SF SF SF S S S S S S S S S S S S S S S S DK DK D D D D TR D D
D D GB D F F F F YU D D NL D NL NL NL NL NL NL DK D DK DK DK NL S S S S S S
DK DK DK D NL NL NL NL NL NL NL NL D D D D NL D A L YU YU YU D GR GR GR GR
GR GR B GR GR GR GR D YU YU YU YU YU YU A A GB D D GB D NL NL NL NL NL NL NL
IL D D D D DK DK S S NL S S S NL S S S S SF DK DK DK DK DK DK DK DK DK DK NL
NL NL NL NL NL NL NL NL NL D D D D D NL D D D S D S S S S S S S DK DK DK DK
DK DK DK DK S S D D D D D D RL A A A GR TR GR GR GR GR I TR D A YU NL NL D
IL D D D D D A A YU GR GR GR GR GR D D NL NL NL NL NL NL D D D D GR YU NL NL
NL A D A A A A TR A GR GR GR GR WAN GB GR GR YU YU YU YU TR TR IL D A D D D
D D DK DK DK D S SF S NL NL NL NL NL D D D D D USA D AFG YU GR GR GR GR GR
GR GR YU GR D D D D D NL NL NL NL NL NL NL D D D D CH DK SF SF S D TR TR TR
TR TR TR TR TR TR TR TR TR TR TR TR TR B TR TR TR TR TR TR TR TR I TR TR TR
TR TR TR TR TR TR TR TR TR D TR TR TR GB GR F GR GR GR GR NL NL NL NL NL NL
NL GB GB GB GB GB GB B B F B NL NL NL NL B B B B B F F F F GB GB GB GB GB GB
B IND GB GB GB NL NL NL NL NL NL D D GB D I I I I I I I I I I I I I I I GR
GR GR GR GR D I I I I D D NL NL NL GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB GB GB GB GB GB GB PAK GB GB GB GB GB GB GB GB GB NZ CDN GB GB
GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB TR GB GB GB GB GB GB IRL GB GB
GB GB GB GB GB GB GB GB GB B B NL NL NL NL NL NL NL D D D D USA D D D I I I
I SF SF GB GR GR I NL A NL I NL D D NL NL NL GB GB GB GB GB GB GB GB SGP GB
GB GB GB GB GB D GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB D GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB NL D D D D A D I I NL GB I I D I A GR GR GR F GR GB I I I D A
D D F NL NL NL NL NL D D D D D NL GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB P GB GB GB GB GB GB GB GB GB GB I GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB DZ GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB I GB GB GB GB GB GB GB
GB GB GB GB GB GB D GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB NL NL NL
J D D D D D D D I I I GB NL GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB I GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB
GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB IND GB GB GB GB GB GB GB GB GB
GB GB GB GB GB GB GB GB GB GB GB GB GB NL D TR D GB D A N A I I I GB D GR D
D D B I I I I NL D CZ D D D NL S DK S S S S S D D D D D D D D P D P D D D NL
NL NL NL NL NL GB GB GB GB GB GB GB F GB GB GB GB B GB GB GB GB NL NL HR NL
TR D D D D D D D D D D D D D D D D D D D D D D D D D PL D D D D D D D D D D
I I I I I GR A I RSM DK D D I D D D NL GB GB GB GB GB GB E GB GB GB GB DK B
NL NL NL NL NL D D D D D GB D NL D D D D CDN D D D NL D D D D D D B B B B NL

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/ Now offering spam-free web-based newsreading

Derek T. Asari

unread,
Apr 13, 1998, 3:00:00 AM4/13/98
to

[ Rest of message deleted ]
-----------------------------------------------------------------
This is a strange one. You want to find a faster, more efficient
routine, but you don't supply an exact time for comparison. You
also don't supply an algorithm, making it difficult to estimate
relative efficiency or even if my method is different from yours.
I can't see why you didn't supply a general description of how
you did it, just so that people wouldn't "re-invent the wheel".

Well, in spite of this, I thought I'd take shot at it and
continued reading. Your problem seems to be a kind of minimum
sum problem, but I had problems understanding the conditional of
excluding sub-runs of a larger run. Specifically, what defines a
run from which sub-runs can be eliminated. Rather than derive an
algorithm which incorporates an incorrect testing procedure, I
changed the goal to simply finding all runs meeting the criteria.
I figured that I could make up some ground by improving the way
it finds these runs and the way it determines the size of the
span.

The program I wrote finds the smallest span(s) containing n
nationalities where 5<=n<=MaxNationalities without regard to
eliminating spans on the basis of it being part of a larger span.
It processes your example data in about 9m:20s running in
interpreted BASIC on a little handheld using an 8088 equivalent
processor operating at 4.77 mHz. Since the method involves
finding every span containing a certain number of nationalities,
elimination can be added as a simple test once the exact
conditions involved are known. A compiled version executed on a
486 machine should take less than 10 seconds to complete.

20 seconds of the total execution time is used to convert your
raw data into a form of run length encoding. By changing "N N N N
N N" to 6,1 (where 1 represents the index of the symbol "N" in a
table), your 1209 lines are reduced to 322 groups. I believe
this would improve any method of data processing since it
significantly reduces the amount of data to be manipulated. It
also means that the number of lines and the position of boundary
lines can be calculated rathered than be determined by enumeration.

Anyway, I think something along these lines would be faster than
the 3 minutes mentioned. But then again, you might be doing it
this way already.
-----------------------------------------------------------------
Derek Asari
eq...@cleveland.freenet.edu

Robert AH Prins

unread,
Apr 16, 1998, 3:00:00 AM4/16/98
to eq...@cleveland.freenet.edu

In article <6gssk8$f8u$1...@pale-rider.INS.CWRU.Edu>,
eq...@cleveland.Freenet.Edu (Derek T. Asari) wrote:

Hi Derek,

> [ Rest of message deleted ]
> -----------------------------------------------------------------
> This is a strange one. You want to find a faster, more efficient
> routine, but you don't supply an exact time for comparison.

Sorry, mea maxima culpa, I should have given the current time, which is
~21 seconds on the same 486 66.

> You also don't supply an algorithm, making it difficult to estimate
> relative efficiency or even if my method is different from yours.
> I can't see why you didn't supply a general description of how
> you did it, just so that people wouldn't "re-invent the wheel".

My teacher never gave me any clues. If that resulted in me re-inventing
the wheel, it would just show that neither of us could come up with
another method, however bad it would be. It might also result in both of
us using basically a similar method, but using other implementations...

If I had given my algorithm, you might have assumed that it is wrong to
start with, causing you to look not a non-existant better method, because
it isn't my algorithm that is wrong, but rather my implementation may be
leaving a lot of room for improvement, without changing the fundamentals!

> Well, in spite of this, I thought I'd take shot at it and
> continued reading.

Thank you

> Your problem seems to be a kind of minimum sum problem,

Yes I suppose it is!

> but I had problems understanding the conditional of
> excluding sub-runs of a larger run. Specifically, what defines a
> run from which sub-runs can be eliminated.

Maybe I wasn't clear enough. I want to eliminate all sub-runs that are part of
a longer run, i.e. if I'm looking for nuns of five nationalities, I do not
want to include runs of five nationalities that are embedded in runs of 6 or
more nationalities, so

x A B C D E y with both x AND y in {A, B, B, D, E} is a run that must
be included, but

x A B C D E y with either x OR y not in {A, B, C, D, E} is a run that
must be excluded, as this would make x..E or A..y a run of 6 (or even
x..y a run of 7!)

> Rather than derive an
> algorithm which incorporates an incorrect testing procedure, I
> changed the goal to simply finding all runs meeting the criteria.

As my teacher used to say, go for the general solution first, and narrow it
down later, instead of devising a specialized solution that is difficult to
adapt to similar problems.

> I figured that I could make up some ground by improving the way
> it finds these runs and the way it determines the size of the
> span.
>
> The program I wrote finds the smallest span(s) containing n
> nationalities where 5<=n<=MaxNationalities without regard to
> eliminating spans on the basis of it being part of a larger span.
> It processes your example data in about 9m:20s running in
> interpreted BASIC on a little handheld using an 8088 equivalent
> processor operating at 4.77 mHz. Since the method involves
> finding every span containing a certain number of nationalities,
> elimination can be added as a simple test once the exact
> conditions involved are known. A compiled version executed on a
> 486 machine should take less than 10 seconds to complete.

I'd love to see your algorithm (and not to be hypocritical) the BASIC
program, you should have received my routine as a separate email, as it
is rather long, and may not be overly interesting for others.

However, if anyone else wants so see it, send _ME_ <pri...@wcg.co.uk> an
email and if I get more than five requests, I'll post it in the group, if I
get less, you all get your private copy of the routine, or IF YOU EXPLICITLY
ASK FOR IT the full source of the entire program + data (115K zipped, or 133K
if you also want the post processors (to some reordering, sorting and to
convert the output files into RTF format))

> 20 seconds of the total execution time is used to convert your
> raw data into a form of run length encoding. By changing "N N N N
> N N" to 6,1 (where 1 represents the index of the symbol "N" in a
> table), your 1209 lines are reduced to 322 groups. I believe
> this would improve any method of data processing since it
> significantly reduces the amount of data to be manipulated. It
> also means that the number of lines and the position of boundary
> lines can be calculated rathered than be determined by enumeration.
>
> Anyway, I think something along these lines would be faster than
> the 3 minutes mentioned. But then again, you might be doing it
> this way already.

I am indeed, keeping only the first and last item of runs of the same
nationality, which brings me back to what I wrote before: If I had
mentioned RLL, you might have assumed this to be of no or limited use
and you might have wasted time on trying to find a non-existant better
algorithm!

As for other "entries" to the competition, this morning I received one from
Brent Beach, who claims he can do it in 8 seconds on a 486 33, using pure
Pascal. He didn't include his solution, but given the correct results for the
longest runs he included, I have no doubt that it's correct. It makes me onder
what I've missed all these years...

Regards,

Robert
--
Robert AH Prins
pri...@wcg.co.uk

-----== Posted via Deja News, The Leader in Internet Discussion ==-----

Reply all
Reply to author
Forward
0 new messages