Juicer on multiple fastq.gz files

711 views
Skip to first unread message

Jenny

unread,
Jan 8, 2020, 7:59:58 PM1/8/20
to 3D Genomics
Hello,
I have multiple fastq.gz files from a single sample. And I want to run Juicer to combine all of the reads into one Hi-C map, without needing to merge all the fastq files into one file before running Juicer. Is this possible, and how may I organize the directories?
Thanks,
Jenny

Neva Durand

unread,
Jan 9, 2020, 8:29:45 AM1/9/20
to Jenny, 3D Genomics
Yes. Just be sure the read1 and read2 have the same stem (and have R1 and R2 in the name). And put them all in the fastq directory. Look at the documentation here for more info:

GitHub.com/aidenlab/juicer/wiki

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/e0880ca7-cd49-4de6-8c48-781dc83cf68c%40googlegroups.com.
--
Neva Cherniavsky Durand, Ph.D.
Pronouns: she, her, hers
Assistant Professor, Aiden Lab

Jiani Yin

unread,
Jan 9, 2020, 2:21:33 PM1/9/20
to Neva Durand, 3D Genomics
Hi Neva,
Thank you for your rapid response.

So in the fastq/ directory, I can for example have:
Batch1_R1.fastq.gz
Batch1_R2.fastq.gz
Batch2_R1.fastq.gz
Batch2_R2.fastq.gz
Batch3…
And then run juicer.sh, it will take all the fastq.gz files and generate a single map?

Thanks,
Jenny

Neva Durand

unread,
Jan 9, 2020, 3:08:11 PM1/9/20
to Jiani Yin, 3D Genomics
Yes, exactly. 

Etienne Danis

unread,
May 12, 2021, 1:58:42 PM5/12/21
to 3D Genomics
Dear Neva,

Thank you very much for creating all this suite of tools for Hi-C data processing!
And thank you very much for patiently answering the questions from so many users of your tools!

Based on your previous answer (see above), the juicer.sh script will be able to process all the fastq files from replicates as long as they are located in the same directory and they have the same stem name (and have R1 and R2 in the name).
Is it the case for all juicer.sh scripts (I'm trying to decide between CPU juicer.sh script and the LSF juicer.sh script)?
Is it still necessary to use the mega.sh script to combine all the data into a unique .hic file?
When is mega.sh useful to run?

Thank you very much in advance for your help!

Best regards,
Etienne

Neva Durand

unread,
May 12, 2021, 2:01:28 PM5/12/21
to Etienne Danis, 3D Genomics
Hello,

With a lot of reads, the CPU version will take an inordinate amount of time to run. However, the LSF script is not really maintained because we don't have an LSF system with which to debug it. If you have a billion read map you will have to use LSF. For 100M, you can try CPU, but it might take a long time to run, particularly the merge sort.

The mega.sh script is for combining replicates.

Best
Neva



--
Neva Cherniavsky Durand, Ph.D. | she, her, hers
Assistant Professor |  Molecular and Human Genetics
Aiden Lab | Baylor College of Medicine

Etienne Danis

unread,
May 12, 2021, 4:03:15 PM5/12/21
to 3D Genomics
Thank you very much, Neva for your very prompt answer!
This is very good to know. It makes sense. 

Since I do not have access to an HPC using SLURM or others..., I'm wondering whether I should use AWS.
(I was initially planning to run the first part of juicer (till the .hic file) on the LSF and then run HiCCUPS on AWS)
Are the updates made to juicer.sh (for other versions than the LSF one) slightly (or drastically) improving the calling of the loops or are they just removing some minor bugs or speeding up the processing?
Could the quality of the results of my analyses be improved if I used a more recent version of juicer (better sensitivity or resolution...)?

If I correctly understood your last answer, juicer.sh can process multiple fastq files from different sequencing runs corresponding to the sample (batch1_R1.fastq.gz, batch1_R2.fastq.gz, batch2_R1.fastq.gz, batch2_R2.fastq.gz, batch3_R1.fastq.gz, batch3_R2.fastq.gz, ...) and create one .hic file with all the data combined, while mega.sh can combine hic files from different replicates of the sample.
For example, now I have 1.2 billon reads for one sample spit into 9 distinct fastq files (18 total with the R1 an R2). Juicer.sh will automatically analyze all the 9 fastq files and give me one .hic file. At this stage, I won't need to use mega.sh.
But if a bit later, I decide to get Hi-C data from the same sample but at a different cell passage, get another 1.2 billion reads split into multiple fastq files, combine the data into one .hic file, then if I want to merge this latter .hic file with the previous .hic file,  then I would have to run mega.sh to combined these two .hic files.
Am I correct?

Thank you very much in advance!
Best,
Etienne


Reply all
Reply to author
Forward
0 new messages