Database Challenge

Nicholas Kormanik

unread,

Dec 31, 2009, 5:34:30 PM12/31/09

to SemWare, nkor...@gmail.com

There are 12 records in this mini database. Two columns. First
column are social security numbers. Second column are names.
Unfortunately Jane Doe appears three times, with three different
versions of her name, but having the same social security number.

Challenge: Remove the duplicates, where social security is the same,
and keep any one of the names. Final result will be whittled down to
10 records.

(Real life problem has 6.5 million records, and lots of duplicates,
with various versions of names.)

025-60-4044 joe average
004-16-4077 jane doe
014-27-9076 mike smith
098-43-2098 rodolfo pilas
073-15-6005 gustavo boksar
004-16-4077 jane a. doe
147-79-9074 bea busaniche
165-63-0189 pablo medrano
124-96-7092 jeff aaron
004-16-4077 jane anne doe
172-30-6069 michael peters
059-85-1062 leroy baker

Carlo Hogeveen

unread,

Dec 31, 2009, 6:58:40 PM12/31/09

to sem...@googlegroups.com

I had the macro DelDub3 lying around, you can find it here now:

http://www.xs4all.nl/~hyphen/tse/index.html

Note that to use it for your specific example,
beforehand you need to sort on security number
and make sure the "-" character is part of TSE's wordset.

A special Dutch greeting from 2010 to our 2009 western neighbours in the
America's,
You'll get there :-)
Carlo

> -----Oorspronkelijk bericht-----
> Van: sem...@googlegroups.com [mailto:sem...@googlegroups.com] Namens
> Nicholas Kormanik
> Verzonden: donderdag 31 december 2009 23:35
> Aan: SemWare
> CC: nkor...@gmail.com
> Onderwerp: [TSE] Database Challenge

Nicholas Kormanik

unread,

Jan 1, 2010, 4:55:50 AM1/1/10

to SemWare

Thank you so much Carlo.

We made it! 2010. Hope a good year for all.

Nicholas

Leo

unread,

Jan 1, 2010, 5:13:06 AM1/1/10

to SemWare

The solution is not so difficult in principle.

There may be a problem with performance. Holding 6.5 million records
of, say, 25 bytes in memory takes about 2 GB. It might work, depending
on your machine, but using TSE only, the sort needs to be disk-based
anyway.

And then, when your file is sorted on social security number, there is
no need to read all the data into memory, a record by record process,
remembering the previous social security number, could spit out only
non-duplicate records.
AWK is an excellent and free language to do this sort of thing.

If getting the file sorted is a problem, in AWK you can store the
social security numbers in an associative array, only adding non-
duplicates. And in the end spit the array out. (As a side effect, the
program could report in a log file about alll the variations in names
belonging to the same social security number.)

Now that I think of it, a unix-like sort can directly remove
duplicates:
sort -u
Requires a bit of study to get the options right to make it sort on
fields (not the whole record) and specify the outputfile , -o I
believe.

Leo Mulders

knud van eeden

unread,

Jan 1, 2010, 2:03:09 PM1/1/10

to sem...@googlegroups.com

Regarding unique:

Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique (possibly greater than 255 characters) lines (in a highlighted block)?
http://www.knudvaneeden.com/tinyurl.php?urlKey=url000384

with friendly greetings,
Knud van Eeden

Why Tea

unread,

Jan 1, 2010, 6:55:58 PM1/1/10

to SemWare

Very well explained, Leo. Tse is not the right tool for the job here.
A memory based editor is not suitable for text manipulation in huge
files. Knud

Try download a Windows ports of the Unix utilities here:
http://unxutils.sourceforge.net/.

Assume the database is called database.txt, to do what Nicholas wants
is simply to use some of the commands like this - sort field one
(delimited by a space), compare a maximum number of 11 (eleven)
characters for uniqueness.

1) sort -k 1 database.txt | uniq -w 11 | wc -l
- this gives a count of 10 unique lines

2) sort -k 1 database.txt | uniq -w 11 > database_unique.txt
- database_unique.txt holds the unique social security numbers

3) sort -k 1 database.txt | uniq -w 11 -d
- shows only duplicated lines, one per social security number

The next challenge is for someone to write a Tse macro to do the same
job on the 6.5 million records and compare the execution time.

Regards,
/Why Tea

knud van eeden

unread,

Jan 1, 2010, 7:33:48 PM1/1/10

to sem...@googlegroups.com

Hi,

Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only the unique values from a given column in a given text file?
http://www.knudvaneeden.com/tinyurl.php?urlKey=url000385

shows everything done in TSE only in about 5 to 10 minutes.

with friendly greetings,
Knud van Eeden

knud van eeden

unread,

Jan 1, 2010, 8:40:30 PM1/1/10

to sem...@googlegroups.com

> There may be a problem with performance. Holding 6.5 million records
> of, say, 25 bytes in memory takes about 2 GB.

(6.5 . 10^6) . (10^2 / 4)
is about 2 . 10^8 or thus about 200 . 10^6 or
thus fortunately it is only 200 megabytes.

I saw this after doing my tests which went thus smooth in TSE only, and compared
that with my example data file (so noted by incident this difference).

Thus it can be done in memory in TSE only
(with say 1 gigabytes of RAM memory installed on your computer)

mst...@gmail.com

unread,

Jan 1, 2010, 10:09:14 PM1/1/10

to SemWare

I was able to use the 2.0 version of TSE way back in the day when it
was called QEdit to work on a Social Security death file. It took a
bit of work because the sort function was exterior to QEdit. As I
remember it the file had to be broken up in multiple parts and then
put back together.

Just recently I worked on the same file and a macro isn't really
necessary just sort by the SSNo including dashes, then at the top of
the file search for "990-". This will take you down through most of
the file then do a search for "000-" from there. You might have to
hunt back through a few pages to the lowest SSNo. to find where the
duplicates start then it is just a matter of deleting from there to
the bottom of the file.

knud van eeden

unread,

Jan 2, 2010, 1:57:04 AM1/2/10

to sem...@googlegroups.com

If you use Cygwin sort.exe, a one liner will do (200 megabytes file in under one minute)

sort.exe -k <begin column>,<end column number> -un <your filename you want to sort>

E.g.

c:\cygwin\bin\sort.exe -k1,11 -un c:\temp\ddd.txt

will get the unique values of the file c:\temp\ddd.txt, after sorting on column 1 to column 11.

knud van eeden

unread,

Jan 2, 2010, 6:44:59 PM1/2/10

to sem...@googlegroups.com

Awk: Record: Field: Unique: How to get the unique records based on the first field?
http://www.knudvaneeden.com/tinyurl.php?urlKey=url000386

Current state:

Given a 200 megabytes file with about 6 million records:

1. Gnu Awk: 5 seconds
2. Gnu sort.exe: 1 minute
3. TSE algorithm using sort + unique: 5 minutes.

Why Tea

unread,

Jan 2, 2010, 7:06:39 PM1/2/10

to SemWare

Well done Knud. Leo was spot on too in his analysis about the tool
selection.

On Jan 3, 10:44 am, knud van eeden <knud_van_ee...@yahoo.com> wrote:
> Awk: Record: Field: Unique: How to get the unique records based on the first field?http://www.knudvaneeden.com/tinyurl.php?urlKey=url000386

knud van eeden

unread,

Jan 2, 2010, 7:25:05 PM1/2/10

to sem...@googlegroups.com

Well Why Tea, I would say yes and no ;-)

If a (supposed to be one time only) job can be done in say 5 minutes
(and with Larry's frequency algorithm idea (if it does not work out I might program it similar myself)
the Awk solution is based on that same smart frequency idea also,
maybe even brought down to under 1 minute, and can be further automated in TSE, e.g.
adding

Ask( "what is the begin column of your data" ... ) // e.g. 1 (=begin column of social security data)
Ask( "what is the end column of your data" ... ) // e.g. 11 (=end column of social security data)
Ask( "what is the filename containing your data" ) // e.g. mydatabase.txt (contains the social security data)
Now call your TSE macro ...

then I think I stick to TSE when that job has to be done (put it in a hierarchical TSE menu entry,
so easy to find, not much thinking (what was that Awk program again??), just pressing a few buttons.

At the end of the day the means used (Awk, sort, Perl, TSE, ...) should be pretty much equivalent,
if in the same time range, and then your personal choice (e.g. TSE) might prevail.

Of course if the data files are very very large then it becomes another ballgame,
and you might be forced to stick to e.g. GNU sort.exe
(because that is very robust, I know because I tested it with sorting several
gigabytes large log files), and maybe GNU Awk (not tested that yet).

with friendly greetings,
Knud van Eeden

Why Tea

unread,

Jan 2, 2010, 8:44:20 PM1/2/10

to SemWare

I knew you would disagree :) It's nice we could discuss this
rationally, at least we can agree to disagree. I dislike the phrase
"yes and no" though, it sounds like a Bush lingo :-)

My thought is, if you can do it with existing tool, why write? If you
write with something freely available and share, why not? If you don't
have to load a huge file in memory and get the job done, why not? In
addition, experience with "real" scripts can be put on a CV, but Tse
macro probably won't look as good.

Please don't get me wrong, Tse is a brilliant piece of software and
has stood the test of time. I still use it almost everyday for what it
does best. But sometimes there are better alternatives for some tasks
which is worth consideration.

You mentioned you collected a lot of editors but haven't spent much
time with them apart from Tse. It could also be the same with the
macro and scripting languages. The dilemma is, the more one uses a
tool, the better and more efficient one becomes. Then there is no
reason to learn and explore something else *in depth*. I have seen the
same phenomenon in many of my colleagues who won't touch anything else
apart from the tools they already knew. In the extreme case, there are
still people who won't leave the good 'ol Dos.

/Why Tea

knud van eeden

unread,

Jan 2, 2010, 9:49:11 PM1/2/10

to sem...@googlegroups.com

Computer: Editor: TSE: Line: Operation: Select: Unique: How to get only unique values from given column in given text file? Frequency method
http://www.knudvaneeden.com/tinyurl.php?urlKey=url000387

Current state:

GNU Awk on a 200 megabytes file: 5 seconds
GNU Awk on a 1 gigabyte file: 22 seconds

GNU sort on a 200 megabytes file: 1 minute
GNU sort on a 1 gigabyte file: 5 minutes

TSE unique using frequency method on a 200 megabytes file: 70 seconds
TSE sort + unique on a 200 megabytes file: 5 minutes

Happy new year

Larry

unread,

Jan 2, 2010, 9:52:16 PM1/2/10

to sem...@googlegroups.com

Since I am barely able to read anymore, and since most programs do not offer
displays that respect the Windows High Contrast mode, I am unable to explore
other means for accomplishing certain tasks on text files.

I used to be like you 100%! That's how I found QEdit. If Sam had gone the
way of other shareware authors in the early '90s and never developed TSe, I
would likely still be using QEdit (or Boxer).

As I say, I used to do what you do now. I could not count the number of
programs I played with then, looking for solutions to all kinds of problems,
and usually finding a little something wrong with every program I tried.
Some few were 'marvelous', like QModem, Bluewave, most of the
Fido-associated programs, and others. Then, I had limitless patience because
I could read the screens and manuals and could usually set them up just
right.

When working with text in TSE and finally realizing that if I wanted
something, I would most likely have to do it myself, I finally took the
plunge and learned some simple SAL. Over time, I learned more, but mostly I
learned how to frame a problem and then chase the solution using SAL.

But, the most important things of all are that I get nearly immediate
results, and if I have a problem with a macro, I can fix it (usually)
without asking for help or depending on a program author to make a special
change just for me. As best as I can remember, an author responded to a
suggestion from me only once before I joined the SemWare List, and that was
the author of the SilverXpress mail reader, who added the incremental
extension numbering system to his mail packets.

Would it not make more sense to use the macro language of a program first to
try for a solution, then, if none can be made, go to the NET to try to find
another program that would do the trick?

On the other hand, I am always glad to read your suggestions and program
referrals. It is greatly appreciated and I hope you continue.

--------------------------------------------------
From: "Why Tea" <ytl...@gmail.com>
Sent: Saturday, January 02, 2010 7:44 PM
To: "SemWare" <sem...@googlegroups.com>

Subject: [TSE] Re: Database Challenge

> I knew you would disagree :) It's nice we could discuss this

knud van eeden

unread,

Jan 2, 2010, 10:37:54 PM1/2/10

to sem...@googlegroups.com

In programming there exist only 4 fundamental forms

1. Series

2. Choice

3. Loop

4. Sub

(see e.g. books
John Motil - "Programming principles"
and
Antony Dvorak - "Basic in Action"

Most computer programs you build are just combinations of this 4 fundamental forms.
And via your programs you can build and create almost anything (yourself).
At the end of the day most computer languages are pretty much equivalent, because each implements
that 4 fundamental forms.

Which computer language you use, or tool, and or if you want to develop it yourself is pretty much a matter of personal choice. It might vary depending on how much time you have, which priorities and possibilities.
It depends on your goals. E.g. you might say "I want to program everything (also) in TSE". So then you might try to achieve that.

If you look at this database challenge you see that the same solution has been achieved with at least 3 different tools (TSE, Awk, GNU sort). While implementing it you probably learned a lot (e.g. new algorithms, thinking about its behavior, transfer of information between different computer languages, ...).
Similar you can for sure program it in SQL (e.g. using SELECT, DISTINCT and ORDER BY), in a spreadsheet (using VBScript), Emacs (Emacs-LISP), SlickEdit (Slick-C), VI (VIML language), Perl, Ruby, Python, Scheme, C++, Java, JavaScript, and so on. Sometimes one would do that (e.g. for time comparisons (which language is the fastest?)), but time needed will usually grow at least linear (do you have that time available? maybe not) and the results are usually equivalent.
So you might to stick to one solution only, and then maybe the simplest.
Similar to natural languages. You might speak French, German, and so on. But speaking English is so much easier, because it has such a simple grammar (compare that with Russian grammar e.g.), and costs much less energy and thinking. So at the end of the day you stick to English only.

Programming is similar to being a writer or being an artist. You have to creatively solve problems yourself usually.
If you are proficient in and having a lot of fun in programming you might be more inclined to do it yourself. If not you might be more inclined or even forced to use existing tools.
I think I am pretty proficient in the TSE language, and can program almost anything myself if I have to, and it gives a lot of fun all the time. It is simple, and allows to follow the minimalist design.

One should stay certainly open-minded for new solutions, because if one says always no to it then you might miss something which might be interesting, so being open-minded might be a good strategy. If there are new developments anywhere one probably jumps on that bandwagon.

At the end of the day it is all a matter of choice thus, and there are many different choices possible.

Why Tea

unread,

Jan 3, 2010, 6:14:08 AM1/3/10

to SemWare

Knud, thanks for the explanation about computer programming
principles. I think you probably talked about text manipulation in
specific, not programming in general. The choice of a language is
obviously important for the platform and the tasks it's going to
perform. For example, you will not use Java or scripts in any type of
real time programming. But you might use Java in OSS (Operations
Support System) for its large collections of classes. For text
manipulations, there is little need to roll your own as there are many
tools and scripting languages designed specifically for that purpose -
Tse macro is among them.

I always believe the principle in going with what you know best. A
good tool will do you no good if you can't get it to work for you.
What I'm trying to say is, if one immerses himself or herself too much
in one tool, then he or she will not be proficient in anything else.
Then the choice will always be narrowed down to ONE. As discussed in a
previous thread, being able to program just "Hello World!" in any
language is hardly enough to appreciate the power and capabilities of
the language.

Like you, I am fortunate to be able to speak and write in a few human
languages. English, being easy to master is because it's ubiquitous
(some call it American invasion), not because of its simple grammar.
In comparison, I believe Chinese has an even simpler grammar as it
doesn't have different tenses and everything is gender neutral; nouns
don't change because it's singular or plural and verbs stay the same
irrespective of time, etc. - but most non-native speakers find it hard
to pronounce because it's a tonal language of 4 different intonations
per word.