Parsing paired barcoded reads from Myseq empty reads

73 views
Skip to first unread message

Luis Morgado

unread,
Apr 13, 2017, 9:42:36 AM4/13/17
to LotuS rRNA pipeline
Dear Falk,

I want to use SDM to demultiplex my raw fastq files from PE myseq run from ITS2 fungal data, however, I'm running into some problems.

I use the following command:

sdm -i ~EL01_S1_L001_R1_001.fastq.gz,~EL01_S1_L001_R2_001.fastq.gz -map ~map_EL01.txt -o_fastq ~lotus_test -paired 2 -SyncReadPairs T -o_demultiplex ~lotus_test/

the map file looks like this:
#SampleID    BarcodeSequence    Barcode2ndPair    ForwardPrimer    ReversePrimer
L1    NAACAAC    GTCTCTNN    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC
L2    NNAACCGA    GATTCGN    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC
L3    NNNCCGGAA    TTCATTNNN    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC

after running the command I get one file for FWD and another for REV for each barcode, which is what I want, but the files are empty.
Do you think you can figure out what am I doing wrong?
I tried mothur and it works, but it's rather slow....

Thank you in advance.
Luis

Falk Hildebrand

unread,
Apr 14, 2017, 4:32:23 AM4/14/17
to LotuS rRNA pipeline
Hey Luis,
try using the absolute paths to the input files instead of ~xx.fna. Actually this should be ~/xx.fna on my system, but I would rather go for the absolute path. Also make sure the output dir exists. Last, I would recommend to run the lotus pipeline with the mapping file with the -saveDemultiplex 1 option. This will save you having to create dirs etc and just aborts the run after the demultiplexing.
However, the big problem are the Barcodes. Do not use N in these. This will not be recognized by sdm as valid "any nucleotide". Instead just leave the N's away, sdm will check the first X nucleotides for the start of a valid barcode.
hth,
Falk

Luis Morgado

unread,
Apr 14, 2017, 12:06:22 PM4/14/17
to LotuS rRNA pipeline
Hi Falk,
thank you so much for the prompt reply.
I did include the full path in the commands just that I shortened in the post.
Removing the N's worked like a charm. I'm rather curious on how does the software/algorithm work? does it look for the letter pattern in the first and last X letters? is there an option for a log file where it says how many sequences were parsed/excluded?

Also, I have a bunch of fastq.gz files that would be great to run in one go. I tried to do it including the fastq pairs respective for each barcode in the map file, but got the following error:
"This is sdm (simple demultiplexer) 1.36 beta.

 Could not auto detect input format. First non-empty line of your file looked like:
??D 9[?JҕpJ(n?J??????N?D?,??D??EL01_S1_L001_R1?_001.fastq?sD F`?J?m?J(n?J'N???????l?
"

my map file looks like this:
SampleID    BarcodeSequence    Barcode2dPair    ForwardPrimer    ReversePrimer    fastqFile
L1    AACAAC    GTCTCT    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC    EL01_S1_L001_R1_001.fastq.gz,EL01_S1_L001_R2_001.fastq.gz
L2    AACCGA    GATTCG    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC    EL01_S1_L001_R1_001.fastq.gz,EL01_S1_L001_R2_001.fastq.gz
L3    CCGGAA    TTCATT    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC    EL01_S1_L001_R1_001.fastq.gz,EL01_S1_L001_R2_001.fastq.gz
....
....
L192    TGATCC    TGATCC    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC    EL04_S4_L001_R1_001.fastq.gz,EL04_S4_L001_R2_001.fastq.gz
L193    AACAAC    GTCTCT    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC    EL05_S5_L001_R1_001.fastq.gz,EL05_S5_L001_R2_001.fastq.gz
L194    AACCGA    GATTCG    GTGARTCATCGARTCTTTG    TCCTCCGCTTATTGATATGC    EL05_S5_L001_R1_001.fastq.gz,EL05_S5_L001_R2_001.fastq.gz

the command was:
sdm -i /Volumes/Untitled/OMG/MycoSoil/EcoServ/Bioinformatic_analyses/ParsingBarcodes_fromFastq -fastqVersion 1 -map /Volumes/Untitled/OMG/MycoSoil/EcoServ/Bioinformatic_analyses/ParsingBarcodes_fromFastq/completeMap.txt -paired 2 -SyncReadPairs T  -saveDemultiplex 1 -o_demultiplex /Volumes/Untitled/OMG/MycoSoil/EcoServ/Bioinformatic_analyses/ParsingBarcodes_fromFastq/allSamplesDemultiplexed_test/

Thanks again for your help and great software.
Luis



Falk Hildebrand

unread,
Apr 19, 2017, 4:54:07 AM4/19/17
to LotuS rRNA pipeline
Hey Luis,
yes normally you can include several fastq files in the map, as in the lotus documentation on the website. The format of the map look ok to me (just make sure it's all tab separated), and the gz files worked before for you.. can you check the first lines of your input file?
If this doesn't work, try unzipping the first files (so having EL01_S1_L001_R1_001.fastq,EL01_S1_L001_R2_001.fastq) to test if it somehow doesn't recognize the .gz.

check the log file (given via -log to sdm), that has extensive stats on read qual etc, though not sure if this is active in the -saveDemultiplex 1 mode.
Please let me know if unzipping is helping, because this would be a bug I need to fix.
hth,
Falk

Falk Hildebrand

unread,
Apr 19, 2017, 5:01:34 AM4/19/17
to LotuS rRNA pipeline
About the algorithm: if you don't allow for barcode errors, then a hash tree is built of the barcodes that is looked into for the barcode. if you allow for errors, than an exact matching & counting of mismatches is performed, that would also allow for redundant codes and "n" characters. Of course one doesn't want this in Barcodes in the first place (redudant bases). best, Falk

Luis Morgado

unread,
Apr 20, 2017, 8:31:02 AM4/20/17
to LotuS rRNA pipeline
Hi Falk,
Thanks for you reply.
I tried with the unziped files and gave me the same output.
I confirmed that the map file is all tab delimited, so that should not be the problem.

Regarding the demultiplexing process. I have short tags on both ends of the reads, does SDM account for reads/barcodes that might be in 'antisense'?
Kind regards,
Luis

Falk Hildebrand

unread,
Apr 24, 2017, 4:04:09 AM4/24/17
to LotuS rRNA pipeline
Hey Luis,
yes, sdm accounts for antisense barcodes (even a mix of both), but it is still recommended that you check in your fastq files, that both fwd and rev BC are readable in the same way from the read sequence, as you have written them in the mapping file. I.e. search for the barcode in read1 and read2 file for the barcode as you have it in the mapping file. If found in the start of the read, that's the orientation you want in the mapping file.
I'll try to look if I can reproduce the bug you describe in the next days.
best,
Falk

Luis Morgado

unread,
Apr 25, 2017, 12:20:04 PM4/25/17
to LotuS rRNA pipeline
Hi Falk,

Thank you for your reply.
Another question regarding demultiplexing but to test a different worflow.
If I want to demultiplex my dual barcoded reads after merging R1 and R2 is the command below the right command? I only want to parse the reads that have both barcodes.

sdm -i_fastq <full path to file> -map <full path to map> -onlyPair 2 -saveDemultiplex 1 -o_demultiplex <full path for demultiplex folder>

Best regards,
Luis



Falk Hildebrand

unread,
Apr 27, 2017, 5:58:00 AM4/27/17
to LotuS rRNA pipeline
Hey Luis,
just as an important note: I am upgrading the lotus pipeline (I think there might be a small issue with the installer atm), but please use the latest sdm 1.37 for dual barcodes, there was a bug before that caused some misassignments for dual barcodes, given a specific mapping setup.
Try using this command:

sdm -i_fastq <full path to files> -map <full path to map> -paired 2 -saveDemultiplex 1 -o_demultiplex <full path for demultiplex folder>

the -onlyPair will only output one of the reads.
best,
Falk

Falk Hildebrand

unread,
Apr 28, 2017, 2:36:17 PM4/28/17
to LotuS rRNA pipeline
Hey Luis,
can you also check if the new version alleviates your problems with the demultiplexing?
best,
Falk

Luis Morgado

unread,
May 3, 2017, 6:48:03 AM5/3/17
to LotuS rRNA pipeline
Hi Falk,
Sorry for the delayed reply.
I did some tests but it seems that none of the issues were solved with the new version.
When I run the command to demultiplex mutiple (non merged) fastq/fastq.gz through the map file it gives me the same output:

"
This is sdm (simple demultiplexer) 1.37 beta.


 Could not auto detect input format. First non-empty line of your file looked like:
?*? 6B?J??!J???J??????V?D??D??EL01_S1_L001_R2?_001.fastq??? ]B?J??!JM1?J??????V    <??    <??EL02_S2_L001_R1?_001.fastq??? {B?J??!JM1?Jp?????^    <??    <??EL02_S2_L001_R2?_001.fastq??m ?B?J??!J???J??????^t?ė?t?ė?EL03_S3_L001_R1?_001.fastq??r ?B?J??!J???Jr?????ft?ė?t?ė?EL03_S3_L001_R2?_001.fastq??- ?B?J??!JN1?Jb?????f1$^"1$^?EL04_S4_L001_R1?_001.fastq?$? C?J??!JN1?JA?????n1$^;?1$^?EL04_S4_L001_R2?_001.fastq?? NC?J??!JN1?J?????o????Y??????EL05_S5_L001_R1?_001.fastq??_ pC?J??!JN1?J
         ????w????7
                   ?????EL05_S5_L001_R2?_001.fastq?.? ?C?J?N:JN1?J?????3Oz??f?z??f?EL06_S1_L001_R1?_001.fastq?O? ?C?J?N:JN1?J?????3Wz??fx$z??f?EL06_S1_L001_R2?_001.fastq??V ?C?J?N:JN1?J?????SW???gh&???g?EL07_S2_L001_R1?_001.fastq??N ?C?J?N:JO1?J?????S_???g?(???g?EL07_S2_L001_R2?_001.fastq?ڍ ?C?J?N:JO1?Jh????s_?Е??*?Е??EL08_S3_L001_R1?_001.fastq?9?     D?J?N:JO1?J5????sg?Е?/-?Е??EL08_S3_L001_R2?_001.fastq?.s 0D?J?N:JO1?J%?????g??\wq ??\w?EL09_S4_L001_R1?_001.fastq?x LD?J?N:JO1?JR?????o??\w{0??\w?EL09_S4_L001_R2?_001.fastq?|? jD?J?N:J??Jr????{n??bW?2??bW?EL10_S5_L001_R1?_001.fastq?Q? ?D?J?N:J??J>????{v??bWf ??bW?EL10_S5_L001_R2?_001.fastq? ?D?J-J?]?J    ????r?'
"
I used both the fastq and fastq.gz.

When I try to demultiplex merged reads with the command
sdm -i_fastq <full path to fastq> -map <full path to map file> -paired 2 -saveDemultiplex 1 -o_demultiplex <full path to output file>

it gives me the following output:
"
This is sdm (simple demultiplexer) 1.37 beta.

No output file will be written
NO filtering will be done on your reads (just rewriting / log files created).
Writing demultiplexed files to: /Volumes/Untitled/OMG/MycoSoil/EcoServ/Bioinformatic_analyses/BioInfo_tests/Test1/test_sdm_barcode_parsing
Unequal number of files (1) and option-set paired files (2).
 Aborting...
"

I tried the optional "-paired 1" and it gives me a demultiplexed output (with a bit more sequences that the demultiplexed files through Mothur). Is the -paired 1 just parsing from one end barcodes?

Cheers,
Luis

Falk Hildebrand

unread,
May 5, 2017, 1:09:22 PM5/5/17
to LotuS rRNA pipeline
Just a quick note on this, the command "sdm -i_fastq <full path to fastq> -map <full path to map file> -paired 2 -saveDemultiplex 1 -o_demultiplex <full path to output file>"

should be

sdm -i_path <full path to fastq> -map <full path to map file> -paired 2 -saveDemultiplex 1 -o_demultiplex <full path to output file>

best,
Falk
Reply all
Reply to author
Forward
0 new messages