Grabbing Specific String Values Separated by Underscore

33 views
Skip to first unread message

Tj Idowu

unread,
Jan 13, 2022, 3:28:29 PM1/13/22
to Nextflow

Hello,

I am trying to figure out how to separate and categorize some files based on some string values in their name. For example, if the files are named;
X_filename_f0_r1-1_1_R1.fastq
X_filename_f0_r0-1_1_R1.fastq
X_filename_f0_r0-1_2_R1.fastq
X_filename_f0_r1-0_3_R1.fastq

X_filename_f1_r1-1_1_R1.fastq
X_filename_f1_r0-1_1_R1.fastq
X_filename_f1_r0-1_2_R1.fastq
X_filename_f1_r1-0_3_R1.fastq

X_filename_f2_r1-1_1_R1.fastq
X_filename_f2_r0-1_1_R1.fastq
X_filename_f2_r0-1_2_R1.fastq
X_filename_f2_r1-0_3_R1.fastq

X_filename_f3_r1-1_1_R1.fastq
X_filename_f3_r0-1_1_R1.fastq
X_filename_f3_r0-1_2_R1.fastq
X_filename_f3_r1-0_3_R1.fastq

How can I separate the files based on f0, f1, f2, f3 and have each category saved under a different name? Then I want to separate the same files based on the numbers after the "r" but before the underscore. There are a lot of different numbers so I can't be specific in the script about the numbers to select. These same files will then be separated again based on the number before "_R1". So in the end, I will have a category of files based on f-number then a subcategory based on r number then another subcategory under that based on a number.

Thank you

drhp...@gmail.com

unread,
Jan 14, 2022, 7:34:08 AM1/14/22
to Nextflow

Hi TJ!

It would be great if you could send a mock up of how you expect the channels to look at the end? It's a little tricky understanding exactly how you want the channels to look.

But to answer the first question this worked for me using NF 21.10.6:

Create dummy files for testing:

mkdir -p fastq
touch X_filename_f0_r1-1_1_R1.fastq
touch X_filename_f0_r0-1_1_R1.fastq
touch X_filename_f0_r0-1_2_R1.fastq
touch X_filename_f0_r1-0_3_R1.fastq

touch X_filename_f1_r1-1_1_R1.fastq
touch X_filename_f1_r0-1_1_R1.fastq
touch X_filename_f1_r0-1_2_R1.fastq
touch X_filename_f1_r1-0_3_R1.fastq

1) Group reads by "f" id

Main script

#!/usr/bin/env nextflow

ch_files = Channel.fromPath( './fastq/*.fastq' )

ch_files
    .map { it -> [ it.baseName.tokenize('_')[2], it ] }
    .groupTuple()
    .set { ch_grouped_files }
ch_grouped_files.view()

Output

[f0, [/home/harshil/testing/nf/fastq/X_filename_f0_r0-1_1_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f0_r0-1_2_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f0_r1-1_1_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f0_r1-0_3_R1.fastq]]
[f1, [/home/harshil/testing/nf/fastq/X_filename_f1_r1-1_1_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f1_r1-0_3_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f1_r0-1_2_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f1_r0-1_1_R1.fastq]]

2) Group reads by sample

Main script

#!/usr/bin/env nextflow

ch_files = Channel.fromPath( './fastq/*.fastq' )
ch_files
    .map { it -> [ it.baseName.tokenize('_')[2..3].join('_'), it ] }
    .groupTuple()
    .set { ch_grouped_files }
ch_grouped_files.view()

Output:
[f0_r0-1, [/home/harshil/testing/nf/fastq/X_filename_f0_r0-1_1_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f0_r0-1_2_R1.fastq]]
[f0_r1-1, [/home/harshil/testing/nf/fastq/X_filename_f0_r1-1_1_R1.fastq]]
[f1_r1-1, [/home/harshil/testing/nf/fastq/X_filename_f1_r1-1_1_R1.fastq]]
[f0_r1-0, [/home/harshil/testing/nf/fastq/X_filename_f0_r1-0_3_R1.fastq]]
[f1_r1-0, [/home/harshil/testing/nf/fastq/X_filename_f1_r1-0_3_R1.fastq]]
[f1_r0-1, [/home/harshil/testing/nf/fastq/X_filename_f1_r0-1_2_R1.fastq, /home/harshil/testing/nf/fastq/X_filename_f1_r0-1_1_R1.fastq]]

You can play with which entries you want to target by changing the indexes in the snippets above.

Cheers,

Harshil



Tj Idowu

unread,
Feb 18, 2022, 10:45:10 AM2/18/22
to Nextflow
Hello Harshil,

thank you so much for your help. You were right I just had to change the index range to fix my problem.

Tj

Reply all
Reply to author
Forward
0 new messages