ChIP DiffBind

James Saliba

unread,

Apr 24, 2023, 6:49:26 PM4/24/23

to GenPipes

Hello,

I am confused about how to setup the configuration file for DiffBind.
I have 2 treatments, each has 2 inputs and 2 IPs (bio replicates)

During normal analysis, for every treatment, i set the treatments' 2 inputs at 1 and its 2 IPs at 2.

But i would like to compare the two treatments now. Do i set one treatment set as 1 and another as 2 and do i also number their inputs or does DiffBind not take inputs?

Finally, can i start the job on the existing files if i run -s 17

Thanks!

Mareike Janiak

unread,

Apr 27, 2023, 10:36:29 AM4/27/23

to GenPipes

Hi James,

The GenPipes documentation has a section on the design file format used for the chipseq pipeline:

https://genpipes.readthedocs.io/en/genpipes-v4.3.2/user_guide/pipelines/gp_chipseq.html

I hope that this clears things up, but let us know if you still have questions and we'll do our best to help.

As for re-starting a job on existing files, yes, if all previous steps have completed successfully and you would just like to repeat the differential binding step, you should be able to achieve this by specifying -s {step_number}. Note that if you have already run this step once successfully, you may need to also use the -f flag to force the step to repeat.

Best,

Mareike

Mareike Janiak

unread,

Apr 27, 2023, 1:47:50 PM4/27/23

to GenPipes

Follow-up from James:

"Hi Mareike,

My question is about section 17 (differential binding). The setup of the readset and design file is clear for the general pipeline but not for section 17. Section 17s are unclear and clash with the initial file setup.

The chip instructions tells us to put the input at 1 and IP at 2 for each comparison. Thats fine and works for regular ChIP.

But step 17 (used to compare 2 IPs) where we compare between 2 treatments each having its own input. Its saying to put one treatment at 1 and the other at 2. But what about their respective inputs? Do we number their inputs too for that column?

Best,

James"

James Saliba

unread,

Apr 27, 2023, 1:51:01 PM4/27/23

to GenPipes

Hi Mareike,

My question is about section 17 (differential binding). The setup of the readset and design files is clear for the general pipeline (calssic chip example below) but not for section 17.

The initial chip instructions tells us to put the input at 1 and IP at 2 for each treatment separately. Thats fine and worked.

But step 17 (used to compare between 2 IP treatments) is unclear about what to do with the inputs. Its saying to put one treatment at 1 and the other at 2. But what about their respective inputs? Do we number their inputs too for that column? or just leave the inputs at 0.

ank you
Thank you

Best,
James

Mareike Janiak

unread,

Apr 27, 2023, 2:18:30 PM4/27/23

to GenPipes

Hi James,

Thanks for clarifying your question and sending the table as an example.

I think there is a misunderstanding about how the design file is used within the pipeline. The design file is not used at all for most of the steps of the pipeline, because it is only needed to complete the differential binding analysis (step 17). During the other steps, the information from the readset file is used to identify the input and IP samples.

The differential binding analysis is also not intended to compare input to IP samples, but to compare treatments only. The inputs do not need to be mentioned in the design file, because only the peaks (generated from comparing input to IP in the previous steps) are compared at this point.

In the table you've shared, the final column is closest to what you want your design file to look like, but it is not necessary to include the inputs with 0s:

Design file:

Sample MarkName A_vs_B
TreatmentA H3K27ac 1

TreatmentB H3K27ac 2

All of the other steps distinguish between the input and IP via the information in the readset file (first 4 columns):

Sample Readset MarkName MarkType

TreatmentA Readset1 H3K27ac N

TreatmentA Readset2 Input I

TreatmentB Readset3 H3K27ac N

TreatmentB Readset4 Input I

I hope this helps! We're working on an update in response to your other question and will have that available soon.

Best,

Mareike

James Saliba

unread,

Apr 27, 2023, 3:54:40 PM4/27/23

to GenPipes

Hi Mareike,

That makes so much sense, thank you!

Best,
James

James Saliba

unread,

Apr 27, 2023, 4:55:19 PM4/27/23

to GenPipes

Hi Mareike,

I have all the above samples but in biological replicates so two of each. Would you please tell me how to edit the files as the way i did it step 17 gave an error:
"At leaset two treatments and controls should be defined. Skipping differential binding analysis for AvsB

What i did:

Sample Readset MarkName MarkType

TreatmentA Readset1 H3K27ac N

TreatmentA Readset2 Input I

TreatmentB Readset3 H3K27ac N

TreatmentB Readset4 Input I

TreatmentA Readset5 H3K27ac N

TreatmentA Readset6 Input I

TreatmentB Readset7 H3K27ac N

TreatmentB Readset8 Input I

This generates only 4 BAM files total. InputA InputB, treatmentA treatmentB
However, since the design file requires 2 files for each treatment. It does not accept this

Sample MarkName A_vs_B
TreatmentA H3K27ac 1

TreatmentB H3K27ac 2

Therefore it needs

Sample MarkName A_vs_B
TreatmentA1 H3K27ac 1

TreatmentB1 H3K27ac 2

TreatmentA2 H3K27ac 1

TreatmentB2 H3K27ac 2

But the above means my readsets need to be named:

Sample Readset MarkName MarkType

TreatmentA1 Readset1 H3K27ac N

TreatmentA1 Readset2 Input I

TreatmentB1 Readset3 H3K27ac N

TreatmentB1 Readset4 Input I

TreatmentA2 Readset5 H3K27ac N

TreatmentA2 Readset6 Input I

TreatmentB2 Readset7 H3K27ac N

TreatmentB2 Readset8 Input I

However this means the software will not know that the samples are biological replicates when doing peak calls for the steps before 17.

Is this they way it is. 2 separate types of analysis for when you want peak calls with replicates vs when we need diffbinding?
Or did i edit wrongly?
Thank you

Best,
James

On Thursday, April 27, 2023 at 2:18:30 PM UTC-4 mareike...@computationalgenomics.ca wrote:

José Héctor Gálvez

unread,

Apr 27, 2023, 5:27:15 PM4/27/23

to GenPipes

Hello James,

I think that this is more likely a limitation of DiffBind than our pipeline. From the DiffBind documentation, by default DiffBind expects at least 3 replicates per experimental group:

“This call will set up any "default" contrasts by examining the project metadata factors and assuming we want to look at the differences between any two sample groups with at least three replicates in each side of the comparison (that is, any factor that has two different values where there are at least three samples that share each value.)”

This makes sense, since DiffBind does a normalization step within the differential binding analysis to improve results, which means that it expects experimental groups to have an n>1 . I think we have adapted it to work with 2 samples per experimental group, but the statistics of differential binding analysis probably break apart if you have an n=1.

In general, any differential analysis benefits greatly from having an n >1, in general, we tend to recommend an n >3 if possible, but even more if the effect you are looking for is not very pronounced. Anything below 3 replicates per experimental group will likely result in a lot of false positives and negatives that will be hard to correct for.

With that being said, it is important to a make a distinction between technical and biological replicates and how they translate to “readsets” in GenPipes:

You can consider a technical replicate to be the exact same library/sample sequenced more than once. This is what we call a “readset” in GenPipes and we merge it by default because, in principle, the data is coming from the same sequencing library, therefore there are no biological differences. An example of this would be a library that pooled spread across several sequencing lanes in a sequencing flowcell to achieve higher depth of coverage. Since it is the same library (readset) we can merge it within GenPipes to have a single alignment file
A biological replicate can have several meanings that are highly dependent on your experimental design. On a very basic level, a biological replicate is a sample that meets the same experimental conditions, but crucially, is not actually the same sequencing library. Using a very simple naïve example, if you have an experiment where you want to compare binding sites between two cell lines, you might set up your experiment to extract DNA from aliquots of the same cell line at different time points in the day… each of those aliquots is therefore a biological replicate (in your experiment, they would belong in the same experimental group since they come from the same tissue). However they are not the same readset, since each aliquot might have slight biological variation depending on the cells, it would not be advisable to merge the data coming from the sequencer. In that case, when working with GenPipes you need to label these kinds of replicates as different Samples.

Remember, in the ChIP seq pipeline, your experimental groups have to have an n>1 of samples, not readsets.

So to keep the example going, you have an ChIP experiment comparing two cell lines: A and B. You sample the cell lines at 2 different points in the day, so you have: A1, A2, B1, and B2. Finally, when preparing your ChIP libraries, each for each sample you have your mark (H3K27ac) plus it’s input. You now have 8 different Samples:

A1_ H3K27ac
A1_ Normal
A2_ H3K27ac
A2_ Normal
B1_ H3K27ac
B1_ Normal
B2_ H3K27ac
B2_ Normal

Notice that so far, each sample consists of one readset. In your design file (like Mareike explained) you should have something like this and it should work:

Sample MarkName A_vs_B

A1 H3K27ac 1

B1 H3K27ac 2

A2 H3K27ac 1

B2 H3K27ac 2

Now, you might be wondering what is the point of readsets and how you should use them. To extend our example, let’s assume that when you send your samples for sequencing, the technicians preparing the libarires load each of your samples in two sequencing lanes, which means that for each you get 2 sets of paired fastq files (or equivalent) from the sequencer, 1 pair for lane 1 and 1 pair for lane 2.

A1_ H3K27ac_lane1
A1_ Normal_lane1
A1_ H3K27ac_lane2
A1_ Normal_lane2 … etc

Since the only difference between them is just the sequencing lane, it doesn’t make sense to consider these true biological replicates, they are just a technical replicate that is a result of how samples were loaded into the sequencer. Therefore, they are readsets, you will want to merge them in the end, because they don’t reflect your experimental design.

Based on the information you shared with us so far, it seems that what you have in your examples are not actually readsets but different samples, which is why, when you are merging them, GenPipes is erroring out. You are essentially reducing your experimental groups to an n=1 and therefore all the differential binding analyses will not work.

I hope this makes sense and let us know if you still have any other questions.

Best,

Jose Hector Galvez

Bioinformatics Manager - Technical Development

Canadian Centre for Computational Genomics

Office Tel. (514) 746-8812

--
You received this message because you are subscribed to the Google Groups "GenPipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genpipes+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genpipes/c51b0732-36f1-4cfd-8938-0e7350402224n%40googlegroups.com.

James Saliba

unread,

Apr 29, 2023, 12:49:16 PM4/29/23

to GenPipes

Hi Jose,
Thank you for your detailed response, it was very illuminating!

Best,
James

Reply all

Reply to author

Forward