collapse value

eddieasalinas

unread,

May 25, 2012, 3:04:06 PM5/25/12

to MOSAiCS User Group

When preprocessing the eland input, what is a good value for
[collapse]?

The sample command (on the page http://www.stat.wisc.edu/~keles/Software/mosaics/)
under the section called "Preprocess Eland output to bin-level files"
has 3 as the collapse value

Why was 3 chosen for the value?

Is 3 a sort of threshold collapse value for declaration of a peak?

-Eddie

Dongjun Chung

unread,

May 26, 2012, 6:22:35 PM5/26/12

to MOSAiCS User Group

Hi Eddie,

You don't need to use the perl script any more to process eland files.
In more recent versions of mosaics package, you can process various
types of aligned read files in a much easier way using constructBins()
function. The "collapse" parameter in the perl script corresponds to
the "capping" parameter in constructBins() function. Please check the
vignette of recent version of mosaics package, e.g.,

http://www.bioconductor.org/packages/2.10/bioc/html/mosaics.html

There is no theory or rigorous rule to determine "capping" value and
it is not related to the peak calling step either. capping is used
only in the preprocessing step and it is to remove possible PCR
amplification artifacts.

capping = 3 means that at most 3 reads are allowed to be mapped to the
exactly identical nucleotide position on the genome. capping = 3 is
appropriate for the data with low sequencing depth because we usually
observe only one read at each nucleotide position. However, for data
with high sequencing depth (e.g., hiSeq), it is possible that small
capping value could rather introduce biases and in this case, we
recommend not to use capping at all. You can avoid using capping by
setting capping to some non-positive value, e.g., capping = 0.

Thanks,
Dongjun

On May 25, 2:04 pm, eddieasalinas <eddieasali...@gmail.com> wrote:
> When preprocessing the eland input, what is a good value for
> [collapse]?
>

> The sample command (on the pagehttp://www.stat.wisc.edu/~keles/Software/mosaics/)

eddieasalinas

unread,

May 29, 2012, 11:11:59 AM5/29/12

to mosaics_u...@googlegroups.com

Hi Dongjun,

Thank you!

It has been something for me to pay attention to as to which R vignette I pull up from google searches , usually a version that's *not* the most recent comes up.....that didn't help me. thank you for helping me to be sure that I use the most recent version.

I have used the function you mentioned to construct bins.

Then, I used the "readBins" function to read in the bins using the command below:
binData<-readBins(type=c("chip","M","GC","N"),fileName=c("./XXX.eland_fragL35_bin35.txt","./XXX.eland_fragL35_bin35.txt","./XXX.eland_fragL35_bin35.txt","./XXX.eland_fragL35_bin35.txt"))

I thought to use this function as suggested by the vignette.

Then, I try to use the mosaicsFit function for fitting (also per the vignette)
See how I have a one-sample analysis. I am trying to speed things up by using the parallel package and all avaialble cores.

However, I get an error "Error in .rlmFit_OS(parEst = fitParam, mean_thres = meanThres, Y = binData@tagCount, :
insufficient # of proper strata! Cannot proceed!"

Do you have an idea what I'm doing wrong?

> myFit<-mosaicsFit(binData, analysisType="OS", bgEst="automatic" ,parallel=TRUE, nCore=32)
Use 'parallel' package for parallel computing.
Info: background estimation method is determined based on data.
Info: background estimation based on bins with low tag counts.
Info: one-sample analysis.
Info: use adaptive griding.
Info: fitting background model...
Info: grid = 0.01
Info: grid = 0.02
Info: grid = 0.04
Info: grid = 0.1
Info: grid = 0.2
Info: grid = 0.5
Error in .rlmFit_OS(parEst = fitParam, mean_thres = meanThres, Y = binData@tagCount, :
insufficient # of proper strata! Cannot proceed!
>

-Eddie

Dongjun Chung

unread,

May 30, 2012, 5:13:15 PM5/30/12

to MOSAiCS User Group

Hi Eddie,

I guess that this problem might be related to your choice of fragLen
and binSize.

In your output, I found that you set fragLen and binSize as 35. I
guess that 35 bp might be read length. fragLen actually means the
fragment length and in many cases, it is usually around 200 bp. I also
recommend to set the bin size equal to the fragment length, e.g., 200
bp.

Could you please try larger values, such as 200, for fragLen and
binSize and let me know whether it works?

Thanks,
Dongjun

> >www.stat.wisc.edu/~keles/Software/mosaics/<http://www.stat.wisc.edu/%7Ekeles/Software/mosaics/>)

eddieasalinas

unread,

May 31, 2012, 1:21:00 PM5/31/12

to mosaics_u...@googlegroups.com

Hi Dongjun,

Thank you for your help. I have made some progress...

Using a two-sample analysis, I was able to get peak and bed files. Thank you.

However, for a one-sample, analysis, I am still unable to get results.

I am getting the same "insufficient strata" error as before.

This time I am using 146, the proper tag size!

Any thoughts?

See below:

[salinasea2@p66]/analysis/MOSAiCS>R

R version 2.15.0 (2012-03-30)

ISBN 3-900051-07-0

Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

> library('parallel')

> library('mosaics')

Loading required package: Rcpp

> constructBins(infile="XXXXX_5_cat.sorted.sam.eland",fileFormat="eland_result",byChr=FALSE,fragLen=146, binSize=146)

------------------------------------------------------------

Info: setting summary

------------------------------------------------------------

Name of aligned read file: XXXXX_5_cat.sorted.sam.eland

Aligned read file format: Eland result

Directory of processed bin-level files: ./

Construct bin-level files by chromosome? N

List of chromosomes to be excluded?

Fragment length: 146

Bin size: 146

------------------------------------------------------------

Info: reading the aligned read file and processing it into bin-level files...

Info: done!

------------------------------------------------------------

Info: processing summary

------------------------------------------------------------

Directory of processed bin-level files: ./

Processed bin-level file: XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt

------------------------------------------------------------

> binData<-readBins(type=c("chip","M","GC","N"),fileName=c("XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt","XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt","XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt","XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt"))

Info: reading and preprocessing bin-level data...

Info: data contains more than one chromosome.

Info: done!

------------------------------------------------------------

Info: preprocessing summary

------------------------------------------------------------

[Note] Bins with ambiguous sequences will be excluded from the analysis.

Coordinates before & after preprocessing:

chr10.fa: 0 - 135524792 -> 0 - 135524792

chr11.fa: 0 - 134946486 -> 0 - 134946486

chr11_gl000202_random.fa: 0 - 39274 -> 0 - 39274

chr12.fa: 0 - 133841266 -> 0 - 133840974

chr13.fa: 0 - 115109466 -> 0 - 115109174

chr14.fa: 0 - 107289268 -> 0 - 107289268

chr15.fa: 0 - 102521054 -> 0 - 102521054

chr16.fa: 0 - 90294722 -> 0 - 90294430

chr17_ctg5_hap1.fa: 0 - 1680606 -> 0 - 1680314

chr17.fa: 0 - 81194980 -> 0 - 81194980

chr17_gl000203_random.fa: 0 - 37376 -> 0 - 37084

chr17_gl000204_random.fa: 0 - 79424 -> 0 - 79424

chr17_gl000205_random.fa: 0 - 174616 -> 0 - 174470

chr17_gl000206_random.fa: 0 - 40588 -> 292 - 40588

chr18.fa: 0 - 78016998 -> 0 - 78016998

chr18_gl000207_random.fa: 0 - 4088 -> 0 - 4088

chr19.fa: 0 - 59118904 -> 0 - 59118904

chr19_gl000208_random.fa: 0 - 92564 -> 0 - 92564

chr19_gl000209_random.fa: 0 - 158702 -> 0 - 158410

chr1.fa: 0 - 249240542 -> 0 - 249240250

chr1_gl000191_random.fa: 0 - 106142 -> 0 - 105996

chr1_gl000192_random.fa: 0 - 547354 -> 292 - 547354

chr20.fa: 0 - 62965420 -> 0 - 62965274

chr21.fa: 0 - 48119848 -> 0 - 48119702

chr21_gl000210_random.fa: 0 - 27448 -> 292 - 27156

chr22.fa: 0 - 51243664 -> 0 - 51243372

chr2.fa: 0 - 243189426 -> 0 - 243189134

chr3.fa: 0 - 197956290 -> 0 - 197955998

chr4_ctg9_hap1.fa: 0 - 590424 -> 0 - 590424

chr4.fa: 0 - 191044212 -> 0 - 191044212

chr4_gl000193_random.fa: 0 - 187464 -> 0 - 187464

chr4_gl000194_random.fa: 0 - 190384 -> 0 - 190384

chr5.fa: 0 - 180904950 -> 0 - 180904804

chr6_apd_hap1.fa: 0 - 4622214 -> 0 - 4621922

chr6_cox_hap2.fa: 0 - 4793910 -> 0 - 4793764

chr6_dbb_hap3.fa: 0 - 4609220 -> 0 - 4609220

chr6.fa: 0 - 171051264 -> 0 - 171051264

chr6_mann_hap4.fa: 0 - 4648932 -> 0 - 4648640

chr6_mcf_hap5.fa: 0 - 4833330 -> 0 - 4833330

chr6_qbl_hap6.fa: 0 - 4611848 -> 0 - 4611556

chr6_ssto_hap7.fa: 0 - 4927354 -> 0 - 4927062

chr7.fa: 0 - 159128320 -> 0 - 159128320

chr7_gl000195_random.fa: 0 - 182938 -> 0 - 182792

chr8.fa: 0 - 146301198 -> 0 - 146301198

chr8_gl000196_random.fa: 0 - 38836 -> 438 - 38544

chr8_gl000197_random.fa: 0 - 37084 -> 0 - 37084

chr9.fa: 0 - 141129586 -> 0 - 141129440

chr9_gl000198_random.fa: 0 - 90082 -> 0 - 90082

chr9_gl000199_random.fa: 0 - 169506 -> 0 - 169360

chr9_gl000200_random.fa: 0 - 186880 -> 0 - 186588

chr9_gl000201_random.fa: 0 - 36062 -> 0 - 36062

chrM.fa: 0 - 16644 -> 0 - 16644

chrUn_gl000211.fa: 0 - 166002 -> 0 - 166002

chrUn_gl000212.fa: 0 - 186296 -> 0 - 186150

chrUn_gl000213.fa: 0 - 164104 -> 146 - 163812

chrUn_gl000214.fa: 0 - 137386 -> 146 - 137240

chrUn_gl000215.fa: 0 - 171550 -> 0 - 170966

chrUn_gl000216.fa: 0 - 172280 -> 0 - 172134

chrUn_gl000217.fa: 0 - 172134 -> 146 - 171988

chrUn_gl000218.fa: 0 - 161038 -> 0 - 161038

chrUn_gl000219.fa: 0 - 179142 -> 0 - 179142

chrUn_gl000220.fa: 0 - 161768 -> 0 - 161768

chrUn_gl000221.fa: 0 - 155198 -> 0 - 155052

chrUn_gl000222.fa: 0 - 186442 -> 146 - 186150

chrUn_gl000223.fa: 0 - 179434 -> 292 - 179142

chrUn_gl000224.fa: 0 - 179726 -> 0 - 179434

chrUn_gl000225.fa: 0 - 210970 -> 0 - 210824

chrUn_gl000226.fa: 0 - 15038 -> 0 - 15038

chrUn_gl000227.fa: 0 - 128334 -> 0 - 128334

chrUn_gl000228.fa: 0 - 129064 -> 0 - 128772

chrUn_gl000229.fa: 0 - 19710 -> 146 - 19710

chrUn_gl000230.fa: 0 - 43508 -> 0 - 43216

chrUn_gl000231.fa: 0 - 27448 -> 0 - 27156

chrUn_gl000232.fa: 0 - 40588 -> 0 - 40588

chrUn_gl000233.fa: 0 - 45844 -> 0 - 45260

chrUn_gl000234.fa: 0 - 40588 -> 0 - 40588

chrUn_gl000235.fa: 0 - 34310 -> 0 - 34310

chrUn_gl000236.fa: 0 - 41318 -> 0 - 41172

chrUn_gl000237.fa: 0 - 45844 -> 146 - 45698

chrUn_gl000238.fa: 0 - 39712 -> 0 - 39420

chrUn_gl000239.fa: 0 - 33726 -> 0 - 33726

chrUn_gl000240.fa: 0 - 41464 -> 0 - 41318

chrUn_gl000241.fa: 0 - 42194 -> 0 - 42194

chrUn_gl000242.fa: 0 - 43216 -> 0 - 43216

chrUn_gl000243.fa: 0 - 42340 -> 0 - 42340

chrUn_gl000244.fa: 0 - 38836 -> 292 - 38690

chrUn_gl000245.fa: 0 - 36062 -> 0 - 36062

chrUn_gl000246.fa: 0 - 37814 -> 0 - 37814

chrUn_gl000247.fa: 0 - 36354 -> 0 - 36354

chrUn_gl000248.fa: 0 - 39712 -> 0 - 39712

chrUn_gl000249.fa: 0 - 38252 -> 0 - 37960

chrX.fa: 0 - 155260050 -> 0 - 155259904

chrY.fa: 0 - 59363016 -> 0 - 59362870

------------------------------------------------------------

> myFit<-mosaicsFit(binData, analysisType="OS", bgEst="automatic" )

Info: background estimation method is determined based on data.

Info: background estimation based on bins with low tag counts.

Info: one-sample analysis.

Info: use adaptive griding.

Info: fitting background model...

Info: grid = 0.01

Info: grid = 0.02

Info: grid = 0.04

Info: grid = 0.1

Info: grid = 0.2

Info: grid = 0.5

Error in .rlmFit_OS(parEst = fitParam, mean_thres = meanThres, Y = binData@tagCount, :

insufficient # of proper strata! Cannot proceed!

>

-Eddie

Dongjun Chung

unread,

Jun 3, 2012, 4:26:03 PM6/3/12

to mosaics_u...@googlegroups.com

Hi Eddie,

The command lines indicate that you used the same file ("XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt") for ChIP, M, GC, & N and I think that this generates the problem. I guessed so because

> binData <- readBins( type=c("chip","M","GC","N"),

fileName=c("XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt",
"XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt",
"XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt",
"XXXXX_5_cat.sorted.sam.eland_fragL146_bin146.txt"))

If this is the case, you need to use proper M, GC, and N files. From the output, I guess that you are working on the human genome. Are you using hg19 as the reference genome? If so, I have preprocessed M, GC, and N files for hg19 with fragment length = bin size = 200 bp here:

http://www.stat.wisc.edu/~chungdon/hg19/

Could you please generate the bin-level file for ChIP sample using fragment length = bin size = 200 bp and try the one-sample analysis using the proper M, GC, and N files?

Thanks,
Dongjun

eddieasalinas

unread,

Jun 4, 2012, 7:26:01 AM6/4/12

to mosaics_u...@googlegroups.com

Hi Dongjun,

Yes, the alignment is/was against hg19.

I used the link that you gave me and downloaded HG19 data.

I tried to use the files downloaded with the CHIP data (for a one-sample analysis).

I first got an error that there were no chromosomes in common.

To address that, I removed ".fa" from the bin files generated by the MOSAICS constructBins function.

After that, I retried the analysis, but now I get a different error, "...differing number of rows."

See the log below....any ideas???

Should I reformat the BIN files from HG19 with the 146-value from the chip data? Would that make a difference????

[username@xyx]/data/xxxxx/job_single>R

R version 2.15.0 (2012-03-30)

ISBN 3-900051-07-0

Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

> library('mosaics')

Loading required package: Rcpp

> binData<-readBins(

+ type=c("chip","M","GC","N"),

+ fileName=c("./Sample_GH489_sorted.eland_fragL146_bin146.nofa.txt",

+ "./map_fragL200_bin200.txt",

+ "./GC_fragL200_bin200.txt",

+ "./N_fragL200_bin200.txt")

+ )

Info: reading and preprocessing bin-level data...

Info: data contains more than one chromosome.

Info: done!

------------------------------------------------------------

Info: preprocessing summary

------------------------------------------------------------

[Note] Bins with ambiguous sequences will be excluded from the analysis.

Coordinates before & after preprocessing:

chr1: 0 - 249239958 -> 715400 - 249236600

chr10: 0 - 135524792 -> 4365400 - 135517200

chr11: 0 - 134946486 -> 4365400 - 134933200

chr12: 0 - 133840244 -> 4365400 - 133838200

chr13: 0 - 115109904 -> Inf - -Inf

chr14: 0 - 107289414 -> Inf - -Inf

chr15: 0 - 102521054 -> Inf - -Inf

chr16: 0 - 90292678 -> 4365400 - 90286400

chr17: 0 - 81193082 -> 0 - 81190600

chr18: 0 - 78016852 -> 715400 - 78007800

chr19: 0 - 59118174 -> 4365400 - 59115400

chr2: 0 - 243184608 -> 715400 - 243177600

chr20: 0 - 62964982 -> 4365400 - 62955200

chr21: 0 - 48119410 -> Inf - -Inf

chr22: 0 - 51239722 -> Inf - -Inf

chr3: 0 - 197956144 -> 4365400 - 197946800

chr4: 0 - 191044212 -> 715400 - 191041000

chr5: 0 - 180900716 -> 715400 - 180894000

chr6: 0 - 171047468 -> 4365400 - 171039000

chr7: 0 - 159128466 -> 715400 - 159125400

chr8: 0 - 146301490 -> 715400 - 146292000

chr9: 0 - 141146522 -> 715400 - 141138200

chrX: 0 - 154929360 -> 4365400 - 154920600

chrY: 0 - 59033056 -> 715400 - 59027800

------------------------------------------------------------

There were 48 warnings (use warnings() to see them)

> warnings()

Warning messages:

1: In Y > 0 & M == 0 :