Give partis count information for input seqs

23 views
Skip to first unread message

nbst...@gmail.com

unread,
Nov 21, 2017, 4:21:28 PM11/21/17
to partis
Is there a way to tell partis the count for each input sequence? By count, I mean the number of reads that have the same sequence (after running basic QC stuff). Prior to running partis, I have been using PRESTO to do basic QC stuff (i.e. merging read pairs, Q-score filtering, etc), and at the end I collapse all identical reads. After which, the count information for each sequence is stored in the header line of each entry in the fastq file. The headers end up looking something like: @blah_blah_blah|DUPCOUNT=26|more_blah_blah_blah . Is there a way to tell partis that the count information for this sequence is given by 'DUPCOUNT' and is 26? I looked through the documentation, but didn't see anything there. Sorry if I missed something. It would be nice if this information would be passed on to the 'duplicates' column in the output csv files.

Best,

Nicolas

Duncan Ralph

unread,
Nov 21, 2017, 4:40:37 PM11/21/17
to Nicolas Strauli, partis
No you're right, there isn't, since we don't use multiplicity information at all. We should, though, at some point.

If you just want the information available in the output file, you can just make it part of the unique I'd, ie whatever comes after the '>' and before either any white space or pipes.

--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+unsubscribe@googlegroups.com.
To post to this group, send email to par...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/partis/91f5fb0b-1fe4-427a-a143-b5fcb67e9ec1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicolas Strauli

unread,
Nov 21, 2017, 5:26:56 PM11/21/17
to Duncan Ralph, partis
Ah, now I see that I was interpreting the 'duplicates' column incorrectly. My 'unique_ids' are coded as integers, so I interpreted the 'duplicates' column as counts. Now I see that they are lists of unique IDs. Thanks. And I'll try your hack.

Duncan Ralph

unread,
Nov 21, 2017, 7:51:48 PM11/21/17
to Nicolas Strauli, partis
Oh, sorry, I was forgetting about the 'duplicates' column. So let me amend that to say "we don't _use_ multiplicity in any intelligent way to, for instance, inform annotation or partitioning". The 'duplicates' column is just sequences that partis found out were identical after removing non-coding regions to 5' of v and 3' of j. So I guess it might makes sense to allow to add your initial duplicates to that number? But then it might get confusing which is coming from where, and you can probably do that afterward, anyway?

Duncan Ralph

unread,
May 1, 2020, 3:07:28 PM5/1/20
to Nicolas Strauli, partis
In case anyone was using the info in this thread, it's now out of date. We now do use multiplicity information in some places (mostly selection metrics like lbi/consensus distance), and multiplicity info should be passed in using the input meta info.

On Tue, Nov 21, 2017 at 4:51 PM Duncan Ralph <dkr...@gmail.com> wrote:
Oh, sorry, I was forgetting about the 'duplicates' column. So let me amend that to say "we don't _use_ multiplicity in any intelligent way to, for instance, inform annotation or partitioning". The 'duplicates' column is just sequences that partis found out were identical after removing non-coding regions to 5' of v and 3' of j. So I guess it might makes sense to allow to add your initial duplicates to that number? But then it might get confusing which is coming from where, and you can probably do that afterward, anyway?
On Tue, Nov 21, 2017 at 2:26 PM, Nicolas Strauli <nbst...@gmail.com> wrote:
Ah, now I see that I was interpreting the 'duplicates' column incorrectly. My 'unique_ids' are coded as integers, so I interpreted the 'duplicates' column as counts. Now I see that they are lists of unique IDs. Thanks. And I'll try your hack.
On Tue, Nov 21, 2017 at 1:40 PM, Duncan Ralph <dkr...@gmail.com> wrote:
No you're right, there isn't, since we don't use multiplicity information at all. We should, though, at some point.

If you just want the information available in the output file, you can just make it part of the unique I'd, ie whatever comes after the '>' and before either any white space or pipes.
On Nov 21, 2017 1:21 PM, <nbst...@gmail.com> wrote:
Is there a way to tell partis the count for each input sequence? By count, I mean the number of reads that have the same sequence (after running basic QC stuff). Prior to running partis, I have been using PRESTO to do basic QC stuff (i.e. merging read pairs, Q-score filtering, etc), and at the end I collapse all identical reads. After which, the count information for each sequence is stored in the header line of each entry in the fastq file. The headers end up looking something like: @blah_blah_blah|DUPCOUNT=26|more_blah_blah_blah . Is there a way to tell partis that the count information for this sequence is given by 'DUPCOUNT' and is 26? I looked through the documentation, but didn't see anything there. Sorry if I missed something. It would be nice if this information would be passed on to the 'duplicates' column in the output csv files.

Best,

Nicolas

--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages