Database Challenge

Nicholas Kormanik

unread,

Jan 1, 2010, 5:37:25 AM1/1/10

to

There are 12 records in this mini database. Two columns. First
column are social security numbers. Second column are names.
Unfortunately Jane Doe appears three times, with three different
versions of her name, but having the same social security number.

Challenge: Remove the duplicates, where social security is the same,
and keep any one of the names. Final result will be whittled down to
10 records.

(Real life problem has 6.5 million records, and lots of duplicates,
with various versions of names.)

025-60-4044 joe average
004-16-4077 jane doe
014-27-9076 mike smith
098-43-2098 rodolfo pilas
073-15-6005 gustavo boksar
004-16-4077 jane a. doe
147-79-9074 bea busaniche
165-63-0189 pablo medrano
124-96-7092 jeff aaron
004-16-4077 jane anne doe
172-30-6069 michael peters
059-85-1062 leroy baker

David Reiss

unread,

Jan 2, 2010, 5:03:42 AM1/2/10

to

In[28]:= data = Import["/Users/dreiss/Desktop/data.csv"]

Out[28]= {{"025-60-4044 ", " joe average "}, {"004-16-4077 ",
" jane doe "}, {"014-27-9076 ",
" mike smith "}, {"098-43-2098 ",
" rodolfo pilas "}, {"073-15-6005 ",
" gustavo boksar "}, {"004-16-4077 ",
" jane a. doe "}, {"147-79-9074 ",
" bea busaniche "}, {"165-63-0189 ",
" pablo medrano "}, {"124-96-7092 ",
" jeff aaron "}, {"004-16-4077 ",
" jane anne doe "}, {"172-30-6069 ",
" michael peters "}, {"059-85-1062 ", " leroy baker "}}

In[31]:= First /@ GatherBy[data, First] // Length

Out[31]= 10

Adriano Pascoletti

unread,

Jan 2, 2010, 5:04:37 AM1/2/10

to

Copy and paste from your message to the string data, then compute

First /@ GatherBy[ImportString[data, "Lines"], StringTake[#, 11] &]

data = "025-60-4044 joe average

004-16-4077 jane doe
014-27-9076 mike smith
098-43-2098 rodolfo pilas
073-15-6005 gustavo boksar
004-16-4077 jane a. doe
147-79-9074 bea busaniche
165-63-0189 pablo medrano
124-96-7092 jeff aaron
004-16-4077 jane anne doe
172-30-6069 michael peters

059-85-1062 leroy baker";

First /@ GatherBy[ImportString[data, "Lines"], StringTake[#1, 11] &]

{"025-60-4044 joe average", "004-16-4077 jane doe",
"014-27-9076 mike smith", "098-43-2098 rodolfo pilas",
"073-15-6005 gustavo boksar", "147-79-9074 bea busaniche",
"165-63-0189 pablo medrano", "124-96-7092 jeff aaron",
"172-30-6069 michael peters", "059-85-1062 leroy baker"}

Adriano Pascoletti

2010/1/1 Nicholas Kormanik <nkor...@gmail.com>

Bill Rowe

unread,

Jan 2, 2010, 5:05:10 AM1/2/10

to

On 1/1/10 at 5:37 AM, nkor...@gmail.com (Nicholas Kormanik) wrote:

>There are 12 records in this mini database. Two columns. First
>column are social security numbers. Second column are names.
>Unfortunately Jane Doe appears three times, with three different
>versions of her name, but having the same social security number.

>Challenge: Remove the duplicates, where social security is the
>same, and keep any one of the names. Final result will be whittled
>down to 10 records.

>(Real life problem has 6.5 million records, and lots of duplicates,
>with various versions of names.)

>025-60-4044 joe average
>004-16-4077 jane doe
>014-27-9076 mike smith
>098-43-2098 rodolfo pilas
>073-15-6005 gustavo boksar
>004-16-4077 jane a. doe
>147-79-9074 bea busaniche
>165-63-0189 pablo medrano
>124-96-7092 jeff aaron
>004-16-4077 jane anne doe
>172-30-6069 michael peters
>059-85-1062 leroy baker

=46irst, transform the database into a Mathematica list. This can
probably be easily accomplished by reading in the data using
Import. I say probably since you have provided no details of the
file format. In any case once the database is put into the form
of a Mathematica list such as:

data=
{

{"025-60-4044","joe average"},
{"004-16-4077","jane doe"},
{"014-27-9076","mike smith"},
{"098-43-2098","rodolfo pilas"},
{"073-15-6005","gustavo boksar"},
{"004-16-4077","jane a. doe"},
{"147-79-9074","bea busaniche"},
{"165-63-0189","pablo medrano"},
{"124-96-7092","jeff aaron"},
{"004-16-4077","jane anne doe"},
{"172-30-6069","michael peters"},

{"059-85-1062","leroy baker"}};

The desired result can be achieved in one line:

In[2]:= First /@ GatherBy[data, First]

Out[2]= {{"025-60-4044", "joe average"}, {"004-16-4077",

"jane doe"}, {"014-27-9076", "mike smith"}, {"098-43-2098",

"rodolfo pilas"}, {"073-15-6005", "gustavo boksar"}, {"147-79-9074",

"bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"124-96-7092",

"jeff aaron"}, {"172-30-6069", "michael peters"}, {"059-85-1062",
"leroy baker"}}

Here, I've assumed you are using version 7 of Mathematica. If
not, you will need to sort the data base by the first item,
break them into sets using Split then take the first. Something like:

In[3]:= First /@
Split[data[[Ordering[First /@ data]]],
StringMatchQ[#1[[1]], #2[[1]]] &]

Out[3]= {{"004-16-4077", "jane doe"}, {"014-27-9076",
"mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062",
"leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098",
"rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074",
"bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069",
"michael peters"}}

dh

unread,

Jan 2, 2010, 5:05:21 AM1/2/10

to

Hi Nicholas,
you may achieve this using Union with SameTest:
dat = {

{"025-60-4044" , "joe average"},
{"004-16-4077 ", "jane doe"},
{"014-27-9076" , " mike smith"},
{"098-43-2098 " , "rodolfo pilas"},
{"073-15-6005 " , "gustavo boksar"},

{"004-16-4077 " , "jane a.doe"},

{"147-79-9074 " , "bea busaniche"},
{"165-63-0189 " , "pablo medrano"},
{"124-96-7092 " , "jeff aaron"},
{"004-16-4077 " , "jane anne doe"},
{"172-30-6069" , " michael peters"},
{"059-85-1062" , " leroy baker"}
};

Union[dat, SameTest -> (#1[[1]] == #2[[1]] &)]

Daniel

DrMajorBob

unread,

Jan 2, 2010, 5:06:49 AM1/2/10

to

That's a trivial version of the REAL problem. Suppose the same social
occurs for names that aren't similar. Further, suppose that when the
social is the same and names ARE similar, each data record includes useful
information the others do not. One has a telephone number, another has an
address, etc. Suppose the (approximately) same name and address sometimes
appears with different socials and phone numbers. Suppose all these
problems occur frequently.

Good luck with that!

For the simpler problem you've stated, this does the trick:

data = {{"025-60-4044", "joe average"},

{"004-16-4077", "jane doe"},
{"014-27-9076", "mike smith"},
{"098-43-2098", "rodolfo pilas"},
{"073-15-6005", "gustavo boksar"},
{"004-16-4077", "jane a.doe"},
{"147-79-9074", "bea busaniche"},
{"165-63-0189", "pablo medrano"},
{"124-96-7092", "jeff aaron"},
{"004-16-4077", "jane anne doe"},
{"172-30-6069", "michael peters"},
{"059-85-1062", "leroy baker"}};

SplitBy[Sort@data, First][[All, 1]]

{{"004-16-4077", "jane a.doe"}, {"014-27-9076",
"mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062",
"leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098",
"rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074",
"bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069",
"michael peters"}}

Bobby

On Fri, 01 Jan 2010 04:37:54 -0600, Nicholas Kormanik
<nkor...@gmail.com> wrote:

>
> There are 12 records in this mini database. Two columns. First
> column are social security numbers. Second column are names.
> Unfortunately Jane Doe appears three times, with three different
> versions of her name, but having the same social security number.
>
> Challenge: Remove the duplicates, where social security is the same,
> and keep any one of the names. Final result will be whittled down to
> 10 records.
>
> (Real life problem has 6.5 million records, and lots of duplicates,
> with various versions of names.)
>
>
> 025-60-4044 joe average
> 004-16-4077 jane doe
> 014-27-9076 mike smith
> 098-43-2098 rodolfo pilas
> 073-15-6005 gustavo boksar
> 004-16-4077 jane a. doe
> 147-79-9074 bea busaniche
> 165-63-0189 pablo medrano
> 124-96-7092 jeff aaron
> 004-16-4077 jane anne doe
> 172-30-6069 michael peters
> 059-85-1062 leroy baker
>
>
>
>

--
DrMaj...@yahoo.com

Bob Hanlon

unread,

Jan 2, 2010, 5:08:27 AM1/2/10

to

Your data must be read in as strings

data = {{"025-60-4044", "joe average"},
{"004-16-4077", "jane doe"},
{"014-27-9076", "mike smith"},
{"098-43-2098", "rodolfo pilas"},
{"073-15-6005", "gustavo boksar"},
{"004-16-4077", "jane a.doe"},
{"147-79-9074", "bea busaniche"},
{"165-63-0189", "pablo medrano"},
{"124-96-7092", "jeff aaron"},
{"004-16-4077", "jane anne doe"},
{"172-30-6069", "michael peters"},
{"059-85-1062", "leroy baker"}};

Here as some of the ways

Union[data, SameTest -> (#1[[1]] == #2[[1]] &)]

DeleteDuplicates[data, #1[[1]] == #2[[1]] &]

First /@ GatherBy[data, First]

First /@ SplitBy[SortBy[data, First], First]

Bob Hanlon

---- Nicholas Kormanik <nkor...@gmail.com> wrote:

=============

Leonid Shifrin

unread,

Jan 2, 2010, 5:05:43 AM1/2/10

to

Hi Nicholas,

given your table as a nested list of two columns:

In[1]:=
ssnNames =
{{"025-60-4044", "004-16-4077", "014-27-9076", "098-43-2098",
"073-15-6005", "004-16-4077", "147-79-9074", "165-63-0189",
"124-96-7092", "004-16-4077", "172-30-6069",
"059-85-1062"}, {"joe average", "jane doe", "mike smith",
"rodolfo pilas", "gustavo boksar", "jane a.doe", "bea busaniche",
"pablo medrano", "jeff aaron", "jane anne doe", "michael peters",
"leroy baker"}};

The simplest is probably to use GatherBy and pick the first element in each
of the generated sublists with the same ssn-s (since you don't care which
name to choose for the same ssn)

In[2]:= Transpose[GatherBy[Transpose@ssnNames, First][[All, 1]]]

Out[2]= {{"025-60-4044", "004-16-4077", "014-27-9076", "098-43-2098",
"073-15-6005", "147-79-9074", "165-63-0189", "124-96-7092",
"172-30-6069", "059-85-1062"}, {"joe average", "jane doe",
"mike smith", "rodolfo pilas", "gustavo boksar", "bea busaniche",
"pablo medrano", "jeff aaron", "michael peters", "leroy baker"}}

Regards,
Leonid

Pratim Vakish

unread,

Feb 18, 2010, 5:17:30 AM2/18/10

to

Hello,

I am trying to solve a nonlinear optimization problem with two decision variables.
I would like to generate a graphical display of the feasible set defined by a nonlinear constraint of the problem and to analyze the convexity (or not) of the feasible set.

Could anybody help me for this?

Regards,

Pratim

Pratim Vakish

unread,

Feb 19, 2010, 3:32:46 AM2/19/10

to

Hello,

I am trying to solve a nonlinear optimization problem with two decision variables with Mathematica.

Bill Rowe

unread,

Feb 19, 2010, 3:35:04 AM2/19/10

to

On 2/18/10 at 5:17 AM, prati...@hotmail.com (Pratim Vakish) wrote:

>I am trying to solve a nonlinear optimization problem with two

>decision variables. I would like to generate a graphical display of

>the feasible set defined by a nonlinear constraint of the problem
>and to analyze the convexity (or not) of the feasible set.

>Could anybody help me for this?

ContourPlot would likely be a good choice for producing the
desired graphic. If you need more help, you are going to have to
provide more detail of the problem you are trying to solve.