Clustering output depends on order of barcodes

12 views

Skip to first unread message

msma...@gmail.com

unread,

Dec 10, 2018, 9:16:53 AM12/10/18

to Bartender

Hi Lu,

Is it expected that the output of the clustering step may depend on the order of barcodes in the input file? I originally stumbled on this in my real data, but as a test, I generated a list of 1e6 random barcodes of length 15, and I ran them through bartender's clustering with the barcodes shuffled in two different ways. In one case bartender identified 682381 clusters, and in the other 682379 clusters. Certainly a small difference, but I'm wondering if this is expected. I'm guessing that the problem is a few barcodes that are equidistant to multiple clusters, and which clusters those barcodes get merged into depends on which ones are considered first. I'd appreciate it if you could clarify this for me.

Many thanks,

Michael

赵路

unread,

Dec 13, 2018, 10:43:32 PM12/13/18

to Bartender

---------- Forwarded message ---------
From: 赵路 <luzha...@gmail.com>
Date: Thu, Dec 13, 2018 at 10:43 PM
Subject: Re: Clustering output depends on order of barcodes
To: Michael Manhart <msma...@gmail.com>

Hi Michael,

Thanks for reporting this behavior. Based on your description and current implementation, I'm not sure what's the real cause for this little fuzziness in your case. This behavior might come from several sources. For example, the underlying data structure APIs does not have point or value stability (the code should be examined and removed if this exists), the greedy clustering algorithm in each bucket is not invariant to sequences order mostly because of the reason you pointed out (exists barcodes that are equidistant to multiple clusters). To be frank, I'm not surprised by this behavior. And even for classic clustering algorithms, they also might have different results for different initial states, such as k-means. In your experiment, the difference is very small in terms of # of clusters. Could you share me the cluster size distribution of those clusters only show up in one setting? I suspect that most of their size should be very small.

Best,

--
You received this message because you are subscribed to the Google Groups "Bartender" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBa...@googlegroups.com.
To post to this group, send email to bartenderRa...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bartenderRandomBarcode/f5431cf6-de84-47d6-ad75-a03cb342cca5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.