combiner tool loses barcodes appearing only at later time points?

14 views
Skip to first unread message

msma...@gmail.com

unread,
May 3, 2017, 7:17:53 PM5/3/17
to Bartender
Hi,

I am using the bartender_combiner tool to merge three sets of barcode counts.  However, it seems that the merged cluster data produced by this tool depends on the order of input cluster files --- I think any barcodes *not* appearing in the first listed input file (e.g., the first time point) are excluded.  For example, running

bartender_combiner_com -f data1_cluster.csv,data1_quality.csv,data2_cluster.csv,data2_quality.csv,data3_cluster.csv,data3_quality.csv -o merged_data

will exclude barcodes that appear in data2 or data3 but *not* in data1, while running

bartender_combiner_com -f data2_cluster.csv,data2_quality.csv,data1_cluster.csv,data1_quality.csv,data3_cluster.csv,data3_quality.csv -o merged_data

instead excludes barcodes from data1 and data3.

Is this indeed what is going on here, and if so, is it actually the intended behavior?  It seems to me that the merged cluster file should list all barcodes appearing at *any* time point, not just the first time point, so that way a barcode that was erroneously missed in the first measurement can still be tracked later.

Many thanks,
Michael Manhart

赵路

unread,
May 4, 2017, 1:05:01 AM5/4/17
to msma...@gmail.com, Bartender
Hi Michael,

Thanks for using Bartender.

First of all, the combiner tool is originally designed to handle time-series data and the input files should sorted in chronological order. if it is not time-series data, at least the previous data point is roughly the super set of current data point. 

Second, bartender will not remove any barcode cluster by default. There is an option (-c), which is a threshold of removing unmatched barcode clusters in each time point. The default value is 1, that means no barcode cluster will be removed in the merging process. 

Third, combiner tool merges the input data points starting from the last input data point(file).  In high level, it merges current data point with the previous one and match the merged clusters with previous previous data point. It continues this process util it reaches the first input data point.

May I ask how you tell if bartender remove barcodes? 

Thank you.
Best,
Lu

--
You received this message because you are subscribed to the Google Groups "Bartender" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBarcode+unsub...@googlegroups.com.
To post to this group, send email to bartenderRandomBarcode@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bartenderRandomBarcode/8bcda5ce-9eca-43fa-b3f0-5be0cc78b27b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Sincerely,
 
Lu

msma...@gmail.com

unread,
May 8, 2017, 3:00:20 PM5/8/17
to Bartender, msma...@gmail.com
Hi Lu,

Thanks for your reply.

It is true my data sets are not chronological (they are replicate sequencing runs from the same sample), and so no single data set has a superset of all barcodes observed in the other two.  I ran bartender on each data set separately, and now I just want to merge the results so I can see how similar or different the replicates are (e.g., how well the barcode frequencies in each are correlated).

Here's how I know barcodes are being lost in the merged files from bartender_combiner_com.  First, if I count the total number of reads for each data set in the combined *_cluster.csv file (i.e., summing over the counts in the last three columns for each unique barcode), then these numbers don't match the total number of reads for each data set that I obtain from the original data files (before merging).  In particular, the total read count matches *only* for whichever data set is listed first on the command line for bartender_combined_com, while the other two read counts are always lower than they should be (indicating loss of barcodes).  I have tested this by changing the order of the data sets on the command line for bartender_combined_com.

Second, no matter what order of input files I use, in the combined *_cluster.csv file there are many barcodes with counts of the form "1,0,0," meaning the barcode was seen in the first data set but not the other two, but never any of the forms "0,1,0" or "0,0,1."  This also indicates that the merged file misses barcodes from the second and third input files if they are not also in the first one.

Is there a way to fix this problem?

Michael



On Thursday, May 4, 2017 at 1:05:01 AM UTC-4, 赵路 wrote:
Hi Michael,

Thanks for using Bartender.

First of all, the combiner tool is originally designed to handle time-series data and the input files should sorted in chronological order. if it is not time-series data, at least the previous data point is roughly the super set of current data point. 

Second, bartender will not remove any barcode cluster by default. There is an option (-c), which is a threshold of removing unmatched barcode clusters in each time point. The default value is 1, that means no barcode cluster will be removed in the merging process. 

Third, combiner tool merges the input data points starting from the last input data point(file).  In high level, it merges current data point with the previous one and match the merged clusters with previous previous data point. It continues this process util it reaches the first input data point.

May I ask how you tell if bartender remove barcodes? 

Thank you.
Best,
Lu
On Wed, May 3, 2017 at 4:17 PM, <msma...@gmail.com> wrote:
Hi,

I am using the bartender_combiner tool to merge three sets of barcode counts.  However, it seems that the merged cluster data produced by this tool depends on the order of input cluster files --- I think any barcodes *not* appearing in the first listed input file (e.g., the first time point) are excluded.  For example, running

bartender_combiner_com -f data1_cluster.csv,data1_quality.csv,data2_cluster.csv,data2_quality.csv,data3_cluster.csv,data3_quality.csv -o merged_data

will exclude barcodes that appear in data2 or data3 but *not* in data1, while running

bartender_combiner_com -f data2_cluster.csv,data2_quality.csv,data1_cluster.csv,data1_quality.csv,data3_cluster.csv,data3_quality.csv -o merged_data

instead excludes barcodes from data1 and data3.

Is this indeed what is going on here, and if so, is it actually the intended behavior?  It seems to me that the merged cluster file should list all barcodes appearing at *any* time point, not just the first time point, so that way a barcode that was erroneously missed in the first measurement can still be tracked later.

Many thanks,
Michael Manhart

--
You received this message because you are subscribed to the Google Groups "Bartender" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBarcode+unsub...@googlegroups.com.
To post to this group, send email to bartenderRa...@googlegroups.com.



--
Sincerely,
 
Lu

赵路

unread,
May 9, 2017, 12:47:03 AM5/9/17
to Michael Manhart, Bartender
Hi Michael,

Thank you for the information. May I ask which bartender version you are using? The best version is bartender-1.1, which I'm still actively maintaining.

If you can share me part of your input, it will be very helpful for me to find the problem.
Thank you.
Best,
Lu

To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBarcode+unsubscri...@googlegroups.com.



--
Sincerely,
 
Lu

--
You received this message because you are subscribed to the Google Groups "Bartender" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bartenderRandomBarcode+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Sincerely,
 
Lu

M. Manhart

unread,
May 9, 2017, 7:17:58 PM5/9/17
to 赵路, Bartender
Hi Lu,

I am indeed using bartender-1.1.

Here are three small fastq files to illustrate the situation.  They each contain 9 reads, with no overlap in the barcodes between the three files.  I run the following commands to extract, cluster, and combine the barcodes:

bartender_extractor_com -f Test1.fastq -o Test1 -p GCCGG[14-16]TATCT
bartender_extractor_com -f Test2.fastq -o Test2 -p GCCGG[14-16]TATCT
bartender_extractor_com -f Test3.fastq -o Test3 -p GCCGG[14-16]TATCT

bartender_single_com -f Test1_barcode.txt -o Test1 -c 1
bartender_single_com -f Test2_barcode.txt -o Test2 -c 1
bartender_single_com -f Test3_barcode.txt -o Test3 -c 1

bartender_combiner_com -f Test1_cluster.csv,Test1_quality.csv,Test2_cluster.csv,Test2_quality.csv,Test3_cluster.csv,Test3_quality.csv -o Test123 -c 1

The resulting cluster file Test123_cluster.csv contains only the 8 unique barcodes from Test1, and none of the barcodes from Test2 or Test3, which are all unique from the barcodes in Test1.  For instance, the first barcode listed in Test2 is ACGATGGATCACCAT, and it does not appear in the combined cluster file.  However, if we simply reorder the input files for the combiner step, e.g., give Test2 as the first one:

bartender_combiner_com -f Test2_cluster.csv,Test2_quality.csv,Test1_cluster.csv,Test1_quality.csv,Test3_cluster.csv,Test3_quality.csv -o Test213 -c 1

Then the output is completely different, and only the barcodes from Test2 are in the combined file.  It seems like the code uses the set of barcodes listed first on the command line as the reference, and it matches barcodes in the subsequent input files to these as it merges.  But then if a barcode is not in that first file, but *is* in subsequent files, then it will not appear in the combined file.

Michael

To post to this group, send email to bartenderRandomBarcode@googlegroups.com.



--
Sincerely,
 
Lu

Test1.fastq
Test2.fastq
Test3.fastq

赵路

unread,
May 10, 2017, 2:04:24 AM5/10/17
to M. Manhart, Bartender
Hi Michael,

Thank you for providing me the test files. I tried them and Bartender works as expected. Attached is the combined result.

Could you tell me how do you install Bartender? Do you use the binary file directly or you build and install it by following the readme?

Also please make sure there is no space in the input files, otherwise it won't work correctly.

Also please download the latest Bartender-1.1. I put several fixes in the past two months.

Please let me know if you have any question regarding the results and Bartender.
Best,
Lu
--
Sincerely,
 
Lu
Test1_barcode.txt
Test2.fastq
Test3_barcode.csv
Test3_barcode.txt
Test3_cluster.csv
Test3_quality.csv
Test3.fastq
Test123_cluster.csv
Test1_barcode.csv
Test1_cluster.csv
Test1_quality.csv
Test1.fastq
Test2_barcode.csv
Test2_barcode.txt
Test2_cluster.csv
Test2_quality.csv

M. Manhart

unread,
May 13, 2017, 3:57:04 PM5/13/17
to 赵路, Bartender
Hi Lu,

I reinstalled the latest version (build and install according to the instructions), and that does solve this problem, at least for this test case.  However, when I run the combiner on my real data, it is taking forever --- I ran it on three data sets, each with about 3e5 clusters, and it ran for about 15 hours before I just killed it.  There should be substantial albeit imperfect overlap between the barcodes in these sets.  Is the combiner normally very time-consuming, or is something going wrong?

Michael


赵路

unread,
May 13, 2017, 4:39:27 PM5/13/17
to M. Manhart, Bartender
Hey M.Manhart,

That's great you get the correct version. I tested combiner with simulated data and real data(with about half million barcodes). The datasets I used are all time series data and contains at least 20 time points.  And combiner is reasonably fast (around 4-5) minutes. I believe there is something going wrong. I have no idea what went wrong since I don't have enough information. Do you mind sharing me the clustering log and combiner log. If you can share me some real data, it will be very helpful.

Thank you
Best,
Lu
--
Sincerely,
 
Lu

M. Manhart

unread,
May 16, 2017, 4:54:56 PM5/16/17
to 赵路, Bartender
Hi Lu,

Thanks for the information.  I can share two data files with you over Dropbox if that would be convenient.

Here are my commands and the stderr/stdout from running these data sets:

$ bartender_extractor_com -f Run3/Read_data/13_0_S1_L001_R1_001.fastq -o sample13 -p GCCGG[14-16]TATCT

Running bartender extractor
/home/mmanhart/Tools/bartender-1.1-master/bartender_extractor Run3/Read_data/13_0_S1_L001_R1_001.fastq sample13 1 "(GCCG.|GCC.G|GC.GG|G.CGG|.CCGG)([ATCGN]{14,16})(TATC.|TAT.T|TA.CT|T.TCT|.ATCT)" GCCGG TATCT 3
Totally there are 4092380 reads in Run3/Read_data/13_0_S1_L001_R1_001.fastq file!
Totally there are 3933803 valid barcodes from Run3/Read_data/13_0_S1_L001_R1_001.fastq file
Totally there are 3933803 valid barcodes whose quality pass the quality condition
The estimated sequence error from the prefix and suffix parts is 0.0102028
00:00:09

$ bartender_extractor_com -f Run3/Read_data/14_0_S2_L001_R1_001.fastq -o sample14 -p GCCGG[14-16]TATCT

Running bartender extractor
/home/mmanhart/Tools/bartender-1.1-master/bartender_extractor Run3/Read_data/14_0_S2_L001_R1_001.fastq sample14 1 "(GCCG.|GCC.G|GC.GG|G.CGG|.CCGG)([ATCGN]{14,16})(TATC.|TAT.T|TA.CT|T.TCT|.ATCT)" GCCGG TATCT 3
Totally there are 3763961 reads in Run3/Read_data/14_0_S2_L001_R1_001.fastq file!
Totally there are 3650191 valid barcodes from Run3/Read_data/14_0_S2_L001_R1_001.fastq file
Totally there are 3650191 valid barcodes whose quality pass the quality condition
The estimated sequence error from the prefix and suffix parts is 0.00844517
00:00:08

$ bartender_single_com -f sample13_barcode.txt -o sample13 -c 1

Running bartender
Loading barcodes from the file
It takes 00:00:02 to load the barcodes from sample13_barcode.txt
Start to clustering barcode with length 14
Using two sample unpooled test
transforming the barcodes into clusters
Initial number of unique reads:  7469
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Identified 4584 barcodes with length 14
Start to clustering barcode with length 15
Using two sample unpooled test
transforming the barcodes into clusters
Initial number of unique reads:  967178
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Clustering iteration 3
Identified 376906 barcodes with length 15
Start to clustering barcode with length 16
Using two sample unpooled test
transforming the barcodes into clusters
Initial number of unique reads:  6858
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Clustering iteration 3
Identified 4114 barcodes with length 16
The clustering process takes 00:00:17
start to dump clusters to file with prefix sample13
There is no pcr effects in the original data
The estimated error rate is 0.0135624
The overall running time 00:00:29 seconds.

$ bartender_single_com -f sample14_barcode.txt -o sample14 -c 1

Running bartender
Loading barcodes from the file
It takes 00:00:02 to load the barcodes from sample14_barcode.txt
Start to clustering barcode with length 14
Using two sample unpooled test
transforming the barcodes into clusters
Initial number of unique reads:  6593
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Identified 3974 barcodes with length 14
Start to clustering barcode with length 15
Using two sample unpooled test
transforming the barcodes into clusters
Initial number of unique reads:  804699
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Clustering iteration 3
Identified 324366 barcodes with length 15
Start to clustering barcode with length 16
Using two sample unpooled test
transforming the barcodes into clusters
Initial number of unique reads:  6033
The distance threshold is 2
Clustering iteration 1
Clustering iteration 2
Clustering iteration 3
Identified 3349 barcodes with length 16
The clustering process takes 00:00:12
start to dump clusters to file with prefix sample14
There is no pcr effects in the original data
The estimated error rate is 0.0127857
The overall running time 00:00:23 seconds.

$ bartender_combiner_com -f sample13_cluster.csv,sample13_quality.csv,sample14_cluster.csv,sample14_quality.csv -o samples13_14 -c 1

Running bartender_combiner
Current generation 1
Finished merging generation 1
Current generation 0

The last command for the combiner just gets stuck there and never seems to finish.  I don't have this problem, however, on some small test data sets I tried, just this real data.  I would really appreciate it if you can figure out what's wrong!

Michael

赵路

unread,
May 16, 2017, 8:34:51 PM5/16/17
to M. Manhart, Bartender
Thanks. Will take a look.
--
Sincerely,
 
Lu

赵路

unread,
May 19, 2017, 12:36:01 PM5/19/17
to M. Manhart, Bartender
Hey Manhart,
the log and commands looks good to me. Would u please share me part of your raw data?
Thank you!
best,
lu

Reply all
Reply to author
Forward
0 new messages