Quality pairs with singleton not mapping

Clifford Rostomily

unread,

Oct 18, 2022, 1:28:37 PM10/18/22

to HiC-Pro

Hello,

We're seeing something weird where a large proportion of our reads (40-60%) are being flagged as pairs with singleton during the bowtie2 global step. I re-ran HiC-pro with the remove singletons option as false and found that the pairs with singleton are classified as "Unique_singleton_alignments."

I then isolated 20 of these "Unique_singleton_alignments" reads and checked them with the UCSC Blat tool. This showed that showed that:

1) 8 of the 20 read pairs can be classified into the type 1, in which one of the two paired-reads is chimeric (i.e., portion of the read mapped to one genomic region, the other portion mapped to a different genomic region);

2) 11 of the 20 read pairs can be classified into the type 2, in which both of the two paired-reads can be uniquely mapped to the mouse genome;

3) only 1 of the 20 read pairs can be classified into the type 3, in which one of the two paired-reads cannot find perfect match in the genome (<95% match).

I am wondering if this could be an issue with our bowtie2 parameters in our config? We are just using the default options:

BOWTIE2_GLOBAL_OPTIONS = --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder
BOWTIE2_LOCAL_OPTIONS = --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder

Also of note is that this is a DNase experiment, and we have adjusted parameters for that as suggested.

Any thoughts?

Thanks,

Cliff

nservant

unread,

Oct 18, 2022, 1:40:49 PM10/18/22

to HiC-Pro

Hi Cliff

Thank you very much for your feedbacks. Very interesting ...

My comments below

1) 8 of the 20 read pairs can be classified into the type 1, in which one of the two paired-reads is chimeric (i.e., portion of the read mapped to one genomic region, the other portion mapped to a different genomic region);

Arff, that's good to know. In the test I did on DNase Hi-C data, I didn't have so many reads like this, but I guess it depends on the fragment size, and on the reads size ?

I worked a lot with small reads (<50bp). How long are your reads ?

Maybe you can trim them ? or adjust the bowtie2 parameters to run it in local alignment mode (instead of global end-to-end mapping).

Then, in many Hi-C analysis, people are now moving to bwa-mem to align Hi-C reads (which allow soft clipping, and should work better in your cae).

HiC-Pro is not compatible with bwa (and will not be). However, this is something we will work on in the context of the nf-core-hic pipeline instead.

2) 11 of the 20 read pairs can be classified into the type 2, in which both of the two paired-reads can be uniquely mapped to the mouse genome;

Did check how theu are flagged in the BAM files after bowtie2 ?

Indeed, the default mapping parameter we are using are stringent.

For instance, you can simple try something like ;

--very-sensitive --end-to-end --reorder

by removing the penalities. It should be better.

3) only 1 of the 20 read pairs can be classified into the type 3, in which one of the two paired-reads cannot find perfect match in the genome (<95% match).

ok

Hope it helps !

Best

Nicolas

Clifford Rostomily

unread,

Oct 18, 2022, 5:32:42 PM10/18/22

to HiC-Pro

Nicolas,

Thanks for your prompt reply!

I worked a lot with small reads (<50bp). How long are your reads ?

Our reads are a bit longer (~150bp).

Did check how they are flagged in the BAM files after bowtie2 ?

To isolate the problem reads I just rewrote the mergeSAM.py script to save the problem singleton reads instead of just counting them. I did not check their BAM flags. Looking at them now I see that they all have a flag of 16, 0, or 4. Here is an example problem read pair:

MG01HX02:1180:HJ5N7CCX2:4:1101:28006:1450 16 chrX 77392295 24 151M * 0 0 AGTAGATGCTGAACCTATATATACATATAATCACTACCACAGCTTACATATGTCCTCTCTGGAAACACAAGTGTACATACTCACTCACAGCAAATATGTGTGTGTGTGCGTGTGTGTGTGTGTGTGTTTTATGTATGTATGTATGTATGCN AJFAFFJJFJ<FJJFJFJ<J<JFFFJJFAJFAAJ<-A-JJF<7<JJJJFJJJJA<7JFJFJJJ<JJJAAAJFJJF<FFJJJJJJJJJFFF-AJJJJJFFJJJFJFJFJJJJJFJFFFF-JFAJJJJJJJJJJJAFAF-<JJFAFJFAAAA# AS:i:-11 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:35C19A34A59A0 YT:Z:UU RG:Z:BMG
MG01HX02:1180:HJ5N7CCX2:4:1101:28006:1450 4 * 0 0 ** 0 0 NTTGCCTTTGTACGTCTATGCAATTTAGTATTTGAGGCCTCAAAATCATTTCTGATTGCATATAAACATTAGACAGAGCCAGCAGGGTAGGAGAGTGATTGCTGTGCAGAGTGGGAGCGAGGCCACAGAGGATTTACACAACTACTAAATG #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJ<JJJJJJJJJJJJJJJFJJJFJJJJJJJJJJJJJJJJJAFAJJJJJJJJJAJJJ-< YT:Z:UU RG:Z:BMG

For instance, you can simple try something like ;

--very-sensitive --end-to-end --reorder

We supplied the --very sensitive flag, are you suggesting the defaults for --score-min and -L should perform better (--score-min L,-0.6,-0.6 -L 20)?

Does re-running with the following bowtie2 options seem reasonable, or would you suggest I still use end-to-end in the global step?

BOWTIE2_GLOBAL_OPTIONS = --very-sensitive-local --local --reorder
BOWTIE2_LOCAL_OPTIONS = --very-sensitive-local --local --reorder

Thanks,

Cliff

nservant

unread,

Oct 19, 2022, 3:25:59 AM10/19/22

to HiC-Pro

I think your reads are too long, that's why you have some many chimeric reads.

You do not need to have reads of 150nt to uniquely align them on the genome ... 50bp or 75bp are usually enough, except if your biological questions is related to repeated elements of course ... in that case, longer reads could help.

I would suggest to trim them to 50/75 bp and to rerun HiC-Pro to see the difference.

Regarding the bowtie2 option, you can indeed use the default one, that's fine.

For the local strategy, you can try it if you want to use the full length of the reads (150nt), otherwise if you trim them at 50bp, you can use the global setting.

Between the two solutions, I would advice the reads trimming ... instead of the local mapping.

Imagine that you have a fragment/reads like this, with a ligation close to the extremity ;

|---------------|-------------------------------------------------------------------------------------| - fragment

---------------------------------------> <----------------------------------------- - reads

-----------------------> <----------------------------------------- - local alignment

--------------> <----------------------------------------- - what you want

If you use bowtie2 in local mode, it migth align with a better score the longest part of the reads (the 3' part), and thus create a lot of 'false dangling end'.

But actually, you want to force the alignment from the 5' of the reads, and you cannot do that with bowtie2 (while it seems you can with bwa-mem)

If you trim the reads and align in end-to-end mode, you should therefore improve the mapping by focusing on the 5' end of the reads. It will not be perfect, but it should be much better.

Hope it's clear :)

N

maharshich...@gmail.com

unread,

Jan 7, 2023, 12:30:45 PM1/7/23

to HiC-Pro

Hi Cliff,

Did you try Nicolas' solution of trimming your reads upto 50-75bp? Did it help?

Clifford Rostomily

unread,

Jan 13, 2023, 6:02:32 PM1/13/23

to HiC-Pro

Hey,

Sorry I did not try this, and have not been working on this project. I believe someone in the lab may be working on it still and I can see if they got it working. I'm unsure if their plan was to trim or just switch to bwa-mem though.