Merging barcodes across time points

msma...@gmail.com

unread,

Mar 8, 2018, 3:04:15 PM3/8/18

to Bartender

Hi Lu,

I have a few more questions about how bartender clusters across time points --- in particular, the frequencies of my clustered barcodes are fluctuating a lot between time points, including some barcodes that appear only at a single intermediate time point but are absent at all other time points. I'm wondering if it's some artifact of the clustering.

First, when I use bartender_combiner_com, should I list input files forward in time or backward in time? I was a bit confused about whether it should be forward or backward, since the documentation on github says it should be forward, but the algorithm seems to process them in reverse order, in which case it is considering the least-diverse set of barcodes (the last time point) first. Shouldn't the algorithm consider the first time point first, and then add unmatched clusters from t to the list at t + 1 (with zero reads at that time point)?

Related, I'm wondering if it would be better to simply pool the reads from all time points together, clustering them, and then determining how many reads from each cluster came from each time point, rather than clustering at each time point separately and then matching.

Thank you very much for all your help!

Michael

赵路

unread,

Mar 25, 2018, 1:36:27 AM3/25/18

to Michael Manhart, Bartender

On Thu, Mar 8, 2018 at 12:04 PM, <msma...@gmail.com> wrote:

Hi Lu,

I have a few more questions about how bartender clusters across time points --- in particular, the frequencies of my clustered barcodes are fluctuating a lot between time points, including some barcodes that appear only at a single intermediate time point but are absent at all other time points. I'm wondering if it's some artifact of the clustering.

Bartender will keep all clusters whose frequencies are larger than certain threshold even if the cluster does not appear in other time points. You can set the frequency cutoff using "-c" option. These abnormal clusters might because of clustering errors or the original data. To better understanding the clustering result, it is necessary to check these abnormal clusters case by case. Unfortunately, there is no good rule for Bartender to deal with these abnormal clusters that is general enough for all kinds of experiments.

First, when I use bartender_combiner_com, should I list input files forward in time or backward in time? I was a bit confused about whether it should be forward or backward, since the documentation on github says it should be forward, but the algorithm seems to process them in reverse order, in which case it is considering the least-diverse set of barcodes (the last time point) first. Shouldn't the algorithm consider the first time point first, and then add unmatched clusters from t to the list at t + 1 (with zero reads at that time point)?

From my observation, merging from last or start does not matter much in terms of accuracy. The reason that I chose merging from back is that it is easier to implement and kind of intuitive to myself.

Related, I'm wondering if it would be better to simply pool the reads from all time points together, clustering them, and then determining how many reads from each cluster came from each time point, rather than clustering at each time point separately and then matching.

This is a great idea. It should have better accuracy theoretically although it is more computational and introduces extra complication to the clustering process. I believe you can easily leverage Bartender to achieve this. Here is one idea, for each reads(putative barcode), you can add the time point information to its UMI and then do clustering. A simple post clustering script can identify the frequency contribution from each experiment. Please let me know your thoughts.

Thank you very much for all your help!

Michael

--
You received this message because you are subscribed to the Google Groups "Bartender" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBarcode+unsubscri...@googlegroups.com.
To post to this group, send email to bartenderRandomBarcode@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bartenderRandomBarcode/0675e068-959b-4621-9286-c51be25270bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Sincerely,

Lu

Message has been deleted

赵路

unread,

Apr 3, 2018, 12:33:42 PM4/3/18

to Michael Manhart, Bartender

Hey Michael,

There is one output file with suffix "barcode.csv" in the clustering result. Below are steps to split the cluster size into multiple time points:

1. Find all raw barcode that belongs to each cluster using the clustering result in "***barcode.csv" and find all UMIs that belong to each cluster using the input file.

2. Pool all UMIs that belong to one cluster for each time point. Count the # of unique UMIs in this pool. This count will be the size at each time point within each cluster

Here is an example

raw input

AAAAAAA,t0_1

AAAAAAA, t0_2
AAACCCC,t0_2
AAAAAAA,t1_1
AAACCCC,t1_2

Cluster output

center, size, cluster.id

AAACCCC,2,1
AAAAAAA,3,2

barcode.csv file

raw_barcode, freq, cluster.id

AAAAAAA, 3, 2

AAACCCC,2,1

Calculate the # unique umi in each time point.

For this example, for cluster 1, there are one unique umi in each time point (t0_2, to_2). For cluster 2, there are two unique umis (t0_1, t0_2) in t0 and one unique umi (t1_1) in t1. So you will get the following result

cluster.id, t0, t1

1, 1, 1

2, 2, 1

Hopefully I explain it clearly.

Lu

On Mon, Apr 2, 2018 at 2:36 PM, <msma...@gmail.com> wrote:

Thanks, Lu. Using the UMI feature to map pooled reads back to their original time points is exactly what I thinking, but I am a bit confused how to do it. For example, let Pooled_barcode.txt be my master set of barcodes pooled across time points:

AAAAAAA,t0_1
AAACCCC,t0_2
AAAAAAA,t1_1
AAACCCC,t1_2

The UMIs at the end of each line indicate both the time point (t0 or t1) and a unique ID number within that time point (1 or 2). I then cluster this set of barcodes together. The output file Pooled_barcode.csv contains information on how each unique read mapped to a cluster:

Unique.reads,Frequency,Cluster.ID
AAACCCC,2,1
AAAAAAA,2,2

The problem is that while I know that two pooled reads mapped to cluster 1, I don't know which reads these were. So I suppose I can go back to the original list of pooled barcodes (Pooled_barcode.txt) and go through each barcode, one by one, look it up in the list of unique reads and mapped clusters (Pooled_barcode.csv), and determine the frequency of clusters in each time point. This last part --- looking up each original barcode one-by-one in the list of clusters --- seems rather time-consuming, but is there any other way to do it?

Many thanks,
Michael

To post to this group, send email to bartenderRa...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bartenderRandomBarcode/0675e068-959b-4621-9286-c51be25270bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Sincerely,

Lu

--

You received this message because you are subscribed to the Google Groups "Bartender" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBarcode+unsub...@googlegroups.com.

To post to this group, send email to bartenderRandomBarcode@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bartenderRandomBarcode/ed1abf46-0098-464f-b5cf-44b20b4d1b58%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Sincerely,

Lu

Reply all

Reply to author

Forward