Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to cluster data

0 views
Skip to first unread message

Luca

unread,
Dec 22, 2009, 6:15:07 AM12/22/09
to
Dear all, excuse me for this maybe "too simple" question..

I am trying to find a software/algorythm that can

"cluster" simple data on an excel sheet

Example:
Variable a Variable b Variable c
Case 1 1 0 0
Case 2 0 1 1
Case 3 1 0 0
Case 4 1 1 0
Case 5 0 1 1


The systems recognizes that there are 3 possible clusters:

the first with cases that has Variable a as true,
the second has Variables b and c
the third is "all the rest"

Variable a Variable b Variable c
Case 1 1 0 0
Case 3 1 0 0

Case 2 0 1 1
Case 5 0 1 1

Case 4 1 1 0

Thank you in advance

C. (http://symcbean.blogspot.com/)

unread,
Dec 22, 2009, 6:48:20 AM12/22/09
to

This is a PHP newsgroup - so algorithm design for MS Excel is way off
topic.

Or do you mean that you want to create a spreadsheet using PHP, and
you want the output formatted in a particular way? (thats 2 seperate
problems - which one can't you solve?)

C.

Taliesin Nuin

unread,
Dec 23, 2009, 9:58:45 AM12/23/09
to

You really don't give enough detail to give a full answer. The reason
being that depending on what you want this for, there are better or
worse ways of doing this. For example, if you are just regarding the
Excel spreadsheet as a one-time source data and you wont be referring to
it again, and if you'll be using this data on an ongoing basis, then for
all my data sorting needs, I stick things in SQL and then everything
becomes simple (to me ;). Will the data always be 0 or 1? If it is, then
I can do some nice things with binary arithmetic to group the rows. But
anyway, answering exactly the question that you asked, here is one
algorithm that will do what you want (I'll do it in PHP since this is
the PHP group):

<?php
$source_data = array(
array('identifier' => 'Case 1', 'a' => 1, 'b' => 0, 'c' => 0),
array('identifier' => 'Case 2', 'a' => 0, 'b' => 1, 'c' => 1),
array('identifier' => 'Case 3', 'a' => 1, 'b' => 0, 'c' => 0),
array('identifier' => 'Case 4', 'a' => 1, 'b' => 1, 'c' => 0),
array('identifier' => 'Case 5', 'a' => 0, 'b' => 1, 'c' => 1));

$clustered_data = array();

foreach($source_data as $row)
{
$key = $row['a'] . $row['b'] . $row['c'];
$clustered_data[$key][] = $row;
}

var_Dump($clustered_data);
?>

Now this may well not be perfect for you. I don't know the wider context
of what you're trying. However, it will give you your clusters.
Hopefully you can amend this to get the format that you want, e.g. you
might want a more useful key for the array. You probably don't want to
store the whole duplicate row, instead just amending it to store the
'identifier' part so your $clustered_data just lists the key and the
'identifier' field. I'm not going to bother getting into these details
without knowing more about your specific context.

I'll mention two likely problems however, depending on what you're
actually trying to do in practice. Firstly, you would also have a
problem if you had other than just 1s and 0s in your table because of
the risk of key collisions: e.g. 11 in variable a and 0 in variable b
looks the same as 1 in variable a and 10 in variable b when you stick
them together as strings. There are easy ways around that now you have
the algorithm. THe other likely scenario is where you have a variable
number of columns in each scenario. In this case, you'll probably need
to have a pre-scan of the data first to work out the number of columns
and generate your key in a more dynamic manner.

Hope all this helps. Please post back to comment.

Taliesin Nuin.

Taliesin Nuin

unread,
Dec 23, 2009, 10:00:48 AM12/23/09
to

By the way, it's not necessarily a "too simple" question. I had a
similar problem recently and had to work it out for myself. Like many
things, it's obvious in retrospect.

Taliesin.

0 new messages