classification of a single read results in multiple classifications...

Robert Player

unread,

Jan 11, 2017, 4:38:45 PM1/11/17

to CLARK Users

So I'm testing out CLARK, going through the README step by step:

- I put together a target list (T) shown below

$ cat targets.fp.ti 
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/40050.fna	40050
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/101688.fna	101688
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/347495.fna	347495
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/1110693.fna	1110693
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/885275.fna	885275
/data/home/playera1/Projects/EMA/META_pipeline/zzz_CLARK_testing/targets/1070417.fna	1070417

- I made my database directory (D)

- here is my input:

$ cat ecoli.fastq 
@M02057:50:000000000-AN9K6:1:1101:13867:1030 1:N:0:5
NTACTGGCGATGTTATCCAGCGATATATCGAACAGTTTGTCGCCAATTCCAACAAAATGACCGCCGTCGGGGCGTGCGGGCTGATCGTCACGGCGTNNTTGNTGNTNTACTCCATCGATAGCGCGTTGAATACCATCTGGCGCAGNNAACGAGCGCNACCCAAAATNNNNNNNTNNGNNNTGTACTGGATGATTTTAACGC
+
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGF##:B>#:3#:#:BFFGGGGGGGGGFGGGGGGGGGGGGGGGFDGGGGGGG##8<FCGGGGD#8=FGGGGGG#######8##8###14;CGGFFG:CFGGGGGGGGE

Running the command:

CLARK -n 10 -k 20 -T targets.fp.ti -D test_db_dir/ -O ecoli.fastq -R results

I get the following result.csv file:

$ cat results.csv 
Object_ID, Length, Assignment
M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275

What in the world is going on? I mean it is the correct classification, but why are there so many lines in the output?!

Isn't the output supposed to just be 1 read classification per line?

Rachid OUNIT

unread,

Jan 11, 2017, 6:25:06 PM1/11/17

to Robert Player, CLARK Users

Hello Robert,

It seems you encountered a bug thank you for reporting it - it seems to be due to the multithreading algorithm. I will look at it.

Could you rerun CLARK with 1 thread (-n 1) instead of 10?

Cheers,

Rachid

--
You received this message because you are subscribed to the Google Groups "CLARK Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+unsubscribe@googlegroups.com.
To post to this group, send email to clark...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.
To view this discussion on the web visit https://groups.google.com/d/msgid/clarkusers/2bb75d64-15b1-46f5-93f5-7d677022fecb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

RP

unread,

Jan 12, 2017, 8:59:47 AM1/12/17

to CLARK Users, play...@gmail.com

Yep, didn't even make that connection, though it was obvious. Seems like the extra threads (lines 3-11 in n10results.csv) are also using the full header (with the '@' symbol) as the read name as well.

n = 10

$ CLARK -n 10 -k 20 -T targets.fp.ti -D test_db_dir/ -O ecoli.fastq -R n10results
$ cat n10results.csv

Object_ID, Length, Assignment
M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275
@M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275

n = 1

$ CLARK -n 1 -k 20 -T targets.fp.ti -D test_db_dir/ -O ecoli.fastq -R n1results
$ cat n1results.csv

Object_ID, Length, Assignment
M02057:50:000000000-AN9K6:1:1101:13867:1030,201,885275

Any idea how long this fix will take?

RP

unread,

Jan 12, 2017, 9:08:16 AM1/12/17

to CLARK Users, play...@gmail.com

Probably having to do with this same multi-threading issue...

When I try to use n > 1 with PE reads, I get a 'Segmentation fault (core dumped)' error:

$ CLARK -n 10 -k 20 -T targets.fp.ti -D test_db_dir/ -P R1.fa R2.fa -R PEn10results
CLARK version 1.2.3 (UCR CS&E. Copyright 2013-2016 Rachid Ounit, roun...@cs.ucr.edu) 
Loading database [test_db_dir/db_central_k20_t6_s1610612741_m0.tsk.*] ...
Loading done (database size: 1648 MB read, with sampling factor 2)
Mode: Default,	Processing file: R1.fa_ConcatenatedByCLARK.fa,	 using 10 CPU.
Segmentation fault (core dumped)

Also with the R1.fa_ConcatenatedByCLARK.fa file left in my working directory.

When n = 1 the results are fine, and the concatenated CLARK file is removed from my wd.

Reply all

Reply to author

Forward