[Haskell-cafe] FGL/Haskell and Hierarchical Clustering/dendograms

Nikolas Borrel-Jensen

unread,

Dec 23, 2009, 5:57:58 AM12/23/09

to haskel...@haskell.org, andr...@gmail.com

Hi! I have some trouble implementing single-linkage clustering algorithm by
using a minimum-spanning tree, so I would appreciate if some of you could
give me some advise.

I am implementing a single-linkage clustering algorithm, and my approach is
to use minimum spanning trees for that task. I am using the library FGL (
http://web.engr.oregonstate.edu/~erwig/fgl/haskell/), and I have managed to
compute a minimum spanning tree from an arbitrary fully connected graph with
5 nodes. I get [ [(4,0) ] , [ (3,1) , (4,0) ] , [ (1,1) , (3,1) , (4,0) ] ,
[ (2,3) , (4,0) ] , [ (5,12) , (2,3) , (4,0) ] ], which is the root path
tree of the minimum spanning tree created by the function msTreeAt.

>From that I would create a dendrogram. [ (1,1) , (3,1) , (4,0) ] is telling
that node 1,3 and 4 has the same cost, namely cost 1. Therefore these are
merged at level 1. At level 1 we now have 3 clusters: (1,3,4), 2 and 5. Now
the second lowest should be merged, that is 2 and 4. BUT because 4 is
already merged in the cluster (1,3,4), we should merge (1,3,4) and 2 at
level 3 (because the cost is 3). Now at level 3 we have 2 clusters,
(1,2,3,4) and 5. Now we merge the last one at level 12: (1,2,3,4,5), and we
are finished.

I have very hard to see, how this could be done efficiently without pointers
(as in C). I have thought of just saving the nodes from the start of the
root path, and traversing it, but a lot of searching should be done all the
time.

Can you please give me some advise on that?

Kind regards

Nikolas Borrel-Jensen
Computer Science
University Of Copenhagen

Ketil Malde

unread,

Dec 23, 2009, 6:53:39 AM12/23/09

to Nikolas Borrel-Jensen, andr...@gmail.com, haskel...@haskell.org

Nikolas Borrel-Jensen <nikola...@gmail.com> writes:

> I have very hard to see, how this could be done efficiently without pointers
> (as in C). I have thought of just saving the nodes from the start of the
> root path, and traversing it, but a lot of searching should be done all the
> time.

I must admit I didn't follow your examples. But when I implemented
single linkage clustering, I maintained a list of current clusters.
Each cluster held a Set of its nodes, and traversing the list of edges
from least cost to greatest, the clusters containing the end points of
each edge was identified, and, if different, merged.

It's probably possible to do it more efficiently, but I don't think it's
too bad.

-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Matt Morrow

unread,

Dec 23, 2009, 11:53:24 PM12/23/09

to Nikolas Borrel-Jensen, andr...@gmail.com, haskel...@haskell.org

Hi Nikolas,

Interesting problem. I'd do something like the following, where
the initial spanning tree from you example (re-tree-ified) is:

{-
ghci> :t t
t :: Tree (Id, Cost)
g
ghci> ppT t
(4,0)
|
+- (3,1)
| |
| `- (1,1)
|
`- (2,3)
|
`- (5,12)
-}

and which results in the tree:

{-
ghci> let s = agglom fst snd t
ghci> :t s
s :: Tree (Cost, [Id])
ghci> ppT s
(0,[4])
|
+- (1,[3,1])
|
`- (3,[2])
|
`- (12,[5])
-}

which can then be flattened/etc as needed by further steps of the algo.

The code for `agglom':

-----------------------------------------------------------------------------
import Data.Tree
import Data.List

type Id = Int
type Cost = Int

t :: Tree (Id,Cost)
t = Node (4,0)
[Node (3,1) [Node (1,1) []]
,Node (2,3) [Node (5,12) []]]

ppT :: Show a => Tree a -> IO ()
ppT = putStrLn . drawTree . fmap show

-- | Compress the incoming @Tree a@ with @accumEq@.
agglom :: Eq cost
=> (a -> b)
-> (a -> cost)
-> Tree a -> Tree (cost,[b])
agglom proj cost = go
where accum = accumEq proj cost
go (Node a []) = Node (cost a,[proj a]) []
go (Node a ts) = let b = proj a
c = cost a
(bs,ss) = accum c ts
in Node (c,b:bs) (fmap go ss)

-- | Repeatedly @splitEq@, and return a pair
-- whose /first/ element is a list of the projected
-- @b@s from those root values along paths from
-- the roots of the trees in the incoming @[Tree a]@
-- which have @cost@ equal to the third function parameter,
-- and whose /second/ element is the (concatenation of the)
-- list(s) gotten from each of the @splitEq@ calls.
accumEq :: Eq cost
=> (a -> b)
-> (a -> cost) -> cost
-> [Tree a] -> ([b],[Tree a])
accumEq proj cost c = go [] []
where split ts = splitEq proj cost c ts
go xs ys [] = (xs,ys)
go xs ys ts = let (eqs,neqs) = split ts
in case eqs of
[]-> ([],ts)
_ -> let (bs,tss) = unzip eqs
in go (bs++xs)
(neqs++ys)
(concat tss)

-- | Split the incoming trees into
-- (1) a @[(b,Tree a)]@ of each @b@ is the
-- @proj@ected value from an @a@ where
-- the @cost@ of that @a@ is equal to
-- the third function parameter, and (2)
-- the members of the incoming @[Tree a]@
-- whose roots' costs are /not/ equal to
-- the third function parameter.
splitEq :: Eq cost
=> (a -> b)
-> (a -> cost) -> cost
-> [Tree a] -> ([(b,[Tree a])],[Tree a])
splitEq proj cost c = foldl' go ([],[])
where go (!eqs,!neqs)
t@(Node a ts)
| c==cost a = ((proj a,ts):eqs,neqs)
| otherwise = (eqs,t:neqs)
-----------------------------------------------------------------------------

Cheers,
Matt

Matt Morrow

unread,

Dec 24, 2009, 12:31:46 AM12/24/09

to Nikolas Borrel-Jensen, andr...@gmail.com, haskel...@haskell.org

For completeness, you might then do the actual clustering something like:

------------------------------------------------------------------------
import Data.Tree
import Data.List
import Data.Function

-- ... code from before ...

cluster :: Ord cost

=> (a -> b)
-> (a -> cost)

-> Tree a -> Cluster (cost,[b])
cluster proj cost t =
-- List can't be empty since Tree can't.
let o:os = sortBy (compare `on` fst)
. flatten
. agglom proj cost
$ t
in foldl' cons (One o) os

data Cluster a
= One a
| Many [Cluster a]
deriving(Eq,Ord,Read,Show)

instance Functor Cluster where
fmap f (One a) = One (f a)
fmap f (Many cs) = Many ((fmap . fmap) f cs)

cons :: Cluster a -> a -> Cluster a
cons c a = Many [c,One a]

{-
ghci> let c = cluster fst snd t

ghci> :t c
c :: Cluster (Cost, [Id])
ghci> c
Many [Many [Many [One (0,[4]),One (1,[3,1])],One (3,[2])],One (12,[5])]

ghci> :t fmap snd c
fmap snd c :: Cluster [Id]
ghci> fmap snd c
Many [Many [Many [One [4],One [3,1]],One [2]],One [5]]

ghci> :t fmap fst c
fmap fst c :: Cluster Cost
ghci> fmap fst c
Many [Many [Many [One 0,One 1],One 3],One 12]
-}
-------------------------------------------------------------------------------

Matt

Nikolas Borrel-Jensen

unread,

Dec 28, 2009, 8:15:50 PM12/28/09

to Matt Morrow, andr...@gmail.com, haskel...@haskell.org

Thank you very much for your reply! I have been looking at the code, and
there are two problems, as I can see. First, trying with the example

t1 :: Tree (Id, Cost)
t1 = Node (4,0)
[Node (3,2) [Node (1,12) []]
,Node (2,3) [Node (5,1) [Node (6,2) [Node (7,2) [] ]]]]

printed as

(4,0)
|
+- (3,2)
| |
| `- (1,12)
|
`- (2,3)
|
`- (5,1)
|
`- (6,2)
|
`- (7,2)

your function 'cluster fst snd t1' returns

Many [Many [Many [Many [Many [One (0,[4]),One (1,[5])],One (2,[3])],One
(2,[6,7])],One (3,[2])],One (12,[1])]

I can't see how this representation is giving the hierarchical clusters. The
example above should resolve into

level 1: [[(2,3),(5,1)],[(6,2)],[(7,2)],[(4,0)], [(3,2)], [(1,12)]]

level 2: [[(2,3),(5,1),(6,2),(7,2)], [(4,0),(3,2)], [(1,12)]]

level 3: [[(2,3),(5,1),(6,2),(7,2),(4,0),(3,2)], [(1,12)]]

level 4 (or (cost) level 12): [[(2,3),(5,1),(6,2),(7,2),(4,0),(3,2),(1,12)]]

By doing it this way, we cluster all nodes connected with edges less than or
equal x at (cost) level x. Clearly, we can have level 1:
[[(1,1),(2,1)],[(3,1),(4,1)],...] if the edges between [(1,1),(2,1)] and
[(3,1),(4,1)] are greater than 1.

Second, I don't think it is trivial to tree-i-fy the root path tree. I have
done the function treeifyMST, which surely isn't efficient, since the list
encounteredNodes is traversed as many times as the number of nodes (a binary
search tree would be more efficient). But more important, the tree isn't
correct, since each path is connected at the root of the tree.

Example (LRTree Int): [ [(1,0)],[(5,1),(1,0) ], [(2,2),(1,0)] ,
[(3,3),(2,2),(1,0)] , [(4,4),(2,2),(1,0)] ] -> [ [(5,1),(1,0) ] ,
[(3,3),(2,2),(1,0)] , [(4,4),(2,2),(1,0)] ]

In my code, all 3 paths are branching at the root (1,0), but should for the
last two paths branch at node (2,2). How should I cope with that in an
efficient way?

I wonder if if it is easier to implement it from the ground using the
approach given at
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html?

---------------------------------------------------------------------
--TO DO: now all paths are connected at the root of the tree. Should be
patched at the right places inside the tree. The search in the list
encounteredNodes is not efficient.
treeifyMST :: LRTree Int -> Tree (Id,Cost)
treeifyMST rootpathtree =
let
(LP rpt:rpts) = rootpathtree
root = head rpt
revrootpathtree = reverse rootpathtree
in
Node root (constructTree [] revrootpathtree)
where
constructTree :: [Int] -> LRTree Int -> [Tree (Id,Cost)]
constructTree encounteredNodes (LP x:[]) = []
constructTree encounteredNodes (LP x:xs) =
let
path1 = x !! 0
path2 = x !! 1
id1 = fst path1
id2 = fst path2
in
case (L.find (==id1) encounteredNodes) of
-- because we have encountered an already processed id, we can skip this
sublist
Just _ -> constructTree (id2:encounteredNodes) xs
-- new id, meaning that we have encountered a new path
Nothing -> let
lenpath = length x
revpath = reverse $ take (lenpath-1) x
tree = listToNode revpath
in
tree:constructTree (id2:encounteredNodes) xs
constructTree _ _ = []

listToNode (p:ps:[]) = Node p [Node ps []]
listToNode (p:ps) = Node p [listToNode ps]
---------------------------------------------------------------------

Nikolas