miseq vs nextseq

Lea Jessop

unread,

Aug 7, 2017, 12:11:46 PM8/7/17

to HiC-Pro

I'm struggling to get my HiC to work. I finally generated a library that gave good QC metrics when I sequenced it on the MiSeq. Now I've run it on a NextSeq and I get 77% of mapped reads are duplicates?

Ideas or suggestions welcome!

Lea

plotHiCContactRanges_miseq.pdf

plotHiCFragment_miseq.pdf

plotHiCContactRanges_nextseq.pdf

plotHiCFragment_nextseq.pdf

plotMapping_nextseq.pdf

plotMappingPairing_nextseq.pdf

nservant

unread,

Aug 7, 2017, 4:07:01 PM8/7/17

to HiC-Pro

Hi Lea,
A few comments.
First, I see that you have some bugs in the R plots. I updated the R codes a couple a weeks ago in the devel branch of HiC-Pro. Yet, I did not have time to release a new version, but you can still download and use this new version.
Then, regarding your Miseq results, it's true that everything looks good, although 10 000 reads is not that much.
About your Nextseq run. First, I think that there is something wrong in the LIGATION_SITE parameter you put in the configuration file. Here, you have 0% of trimmed reads, which is clearly not expected.
I think that if you provide the good ligation motif, you should rescue a significant proportion of unmapped reads, and therefore much more valid pairs.
Finally, regarding your proportion of duplicates, I agree with you that this is a bit surprising based on your first Miseq run.
I think it could be good to double check that, for instance by running other QC tools such as FastQC, or just Picard Markduplicates on the BAM file.
Let me know if you find any explanation.
Good luck
N

Lea Jessop

unread,

Aug 11, 2017, 3:09:00 PM8/11/17

to HiC-Pro

I've managed to identify and remove duplicates using Picard. Now I'd like to run the HiC Pro pipeline in sequential mode from starting with the BAM file that has dups removed and I'm getting an error.

I use the following:

/DCEG/Resources/Tools/HiC-Pro/2.7.8/opt/HiC-Pro_2.7.8/bin/HiC-Pro -i /DCEG/Branches/LTG/Chanock/Lea/HiC/HiC_ACHN_06122017/testingBAMs/rawdata/ -o /DCEG/Branches/LTG/Chanock/Lea/HiC/HiC_ACHN_06122017/testingBAMs/ -c /DCEG/Branches/LTG/Chanock/Lea/HiC/HiC_ACHN_06122017/testingBAMs/config-hicpro.txt -s proc_hic -s quality_checks -p

and I get

Run HiC-Pro 2.7.8 parallel mode

find: File system loop detected; `rawdata/rawdata' is part of the same file system loop as `rawdata'.

make: *** [make_cluster_script] Error 1

I'm using only the sample_S0_L002_R1_001_hg19.bwt2merged.bam and sample_S0_L002_R2_001_hg19.bwt2merged.bam generated when I ran the pipeline on the fastq.gz files. I used picard MarkDuplicates REMOVE_DUPLICATES=TRUE to generated a new BAM file for each of the 2 reads. These two BAM files are what I'm trying to use as input.

Any idea what I'm doing wrong here?

Thanks

Lea

nservant

unread,

Aug 11, 2017, 5:00:24 PM8/11/17

to HiC-Pro

Hi Lea,
Just to know, how many duplicates do you remove with Picard ? To see whether this is concordant with what HiCpro found.
Then, the error is because you put the same input/output. And HiCpro first create a 'rawdata' link into its output folder. But here, there is an kind of infinite loop as you already have a 'rawdata' folder
Anyway, change the output, and it should work !
Best

Lea Jessop

unread,

Aug 15, 2017, 10:06:06 AM8/15/17

to HiC-Pro

Thanks.

Correcting that did get the pipeline to produce the 2 scripts. Unfortunately I could not get those to run properly to generate the correct analysis output files. Still not sure why, but in going thru all the log files I discovered a very silly mistake in my original analysis that is the likely cause of the high number of duplicates. I'm embarrassed to admit it, but after I split the fastq files I left the original fastq.gz files in the rawdata folder. Given this oversight, I'd actually expect the fraction of duplicates in the alignment to be even higher than what HiC-Pro reported. Nevertheless, I'm re-running the analysis now. Will post results when they come.

Lea

Lea Jessop

unread,

Aug 16, 2017, 9:43:03 AM8/16/17

to HiC-Pro

Re-analysis complete. Getting 46% duplicates from HiC-Pro. A bit higher than what I saw from fastqc, which ranged from 15 - 36%. So, as is almost always the case, issues were user induced.