classification of a single read results in multiple classifications...

56 views
Skip to first unread message

Robert Player

unread,
Jan 11, 2017, 4:38:45 PM1/11/17
to CLARK Users

So I'm testing out CLARK, going through the README step by step:

   - I put together a target list (T) shown below
$ cat targets.fp.ti 
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/40050.fna 40050
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/101688.fna 101688
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/347495.fna 347495
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/1110693.fna 1110693
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/885275.fna 885275
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/1070417.fna 1070417

   - I made my database directory (D)

   - here is my input:
$ cat ecoli.fastq 
@M02057:50:000000000-AN9K6:1:1101:13867:1030 1:N:0:5
NTACTGGCGATGTTATCCAGCGATATATCGAACAGTTTGTCGCCAATTCCAACAAAATGACCGCCGTCGGGGCGTGCGGGCTGATCGTCACGGCGTNNTTGNTGNTNTACTCCATCGATAGCGCGTTGAATACCATCTGGCGCAGNNAACGAGCGCNACCCAAAATNNNNNNNTNNGNNNTGTACTGGATGATTTTAACGC
+
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGF##:B>#:3#:#:BFFGGGGGGGGGFGGGGGGGGGGGGGGGFDGGGGGGG##8<FCGGGGD#8=FGGGGGG#######8##8###14;CGGFFG:CFGGGGGGGGE

Running the command:
CLARK -n 10 -k 20 -T targets.fp.ti -D test_db_dir/ -O ecoli.fastq -R results

I get the following result.csv file:
$ cat results.csv 
Object_ID, Length, Assignment
M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275

What in the world is going on? I mean it is the correct classification, but why are there so many lines in the output?!
Isn't the output supposed to just be 1 read classification per line?

Rachid OUNIT

unread,
Jan 11, 2017, 6:25:06 PM1/11/17
to Robert Player, CLARK Users
Hello  Robert,

It seems you encountered a bug thank you for reporting it - it seems to be due to the multithreading algorithm. I will look at it.
Could you rerun CLARK with 1 thread (-n 1) instead of 10?

Cheers,
Rachid

--
You received this message because you are subscribed to the Google Groups "CLARK Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+unsubscribe@googlegroups.com.
To post to this group, send email to clark...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.
To view this discussion on the web visit https://groups.google.com/d/msgid/clarkusers/2bb75d64-15b1-46f5-93f5-7d677022fecb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

RP

unread,
Jan 12, 2017, 8:59:47 AM1/12/17
to CLARK Users, play...@gmail.com
Yep, didn't even make that connection, though it was obvious. Seems like the extra threads (lines 3-11 in n10results.csv) are also using the full header (with the '@' symbol) as the read name as well.

n = 10
$ CLARK -10 -20 -T targets.fp.ti -D test_db_dir/ -O ecoli.fastq -R n10results
$ cat n10results.csv 
Object_ID, Length, Assignment
M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275

n = 1
$ CLARK -n 1 -k 20 -T targets.fp.ti -D test_db_dir/ -O ecoli.fastq -R n1results
$ cat n1results.csv 
Object_ID, Length, Assignment
M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275

Any idea how long this fix will take?

RP

unread,
Jan 12, 2017, 9:08:16 AM1/12/17
to CLARK Users, play...@gmail.com
Probably having to do with this same multi-threading issue...
When I try to use n > 1 with PE reads, I get a 'Segmentation fault (core dumped)' error:

$ CLARK -n 10 -k 20 -T targets.fp.ti -D test_db_dir/ -P R1.fa R2.fa -R PEn10results
CLARK version 1.2.3 (UCR CS&E. Copyright 2013-2016 Rachid Ounit, roun...@cs.ucr.edu
Loading database [test_db_dir/db_central_k20_t6_s1610612741_m0.tsk.*] ...
Loading done (database size: 1648 MB read, with sampling factor 2)
Mode: Default, Processing file: R1.fa_ConcatenatedByCLARK.fa, using 10 CPU.
Segmentation fault (core dumped)

Also with the R1.fa_ConcatenatedByCLARK.fa file left in my working directory.

When n = 1 the results are fine, and the concatenated CLARK file is removed from my wd.
Reply all
Reply to author
Forward
0 new messages