Generating bin-level GC, N, Mapability for HG19+EBV

124 views
Skip to first unread message

Kyle McChesney

unread,
Apr 22, 2014, 11:53:11 AM4/22/14
to mosaics_u...@googlegroups.com
Hello,

I am attempting to analyse some chip-seq data using MOSAICS. My main goal is to include the Epstein-Barr genome in hg19, and have it treated as another chromosome in the analysis. I have downloaded the preprocessing scripts from the website, and have been successful in generating the bin-level GC and N data for all of the chromosomes (including chrEBV as I have it named). I have tried to generate the mappability using the PeakSeq scripts linked on the pre-process page. Currently I have each chromosome in a seperate .fa file, and the PeakSeq python script has been running for quite some time (days). I tried to run the Parrallel version, but it seems to require connection to a different institutions network.

I have the feeling that I am doing something wrong...

Should I merg all of the hg19 chromosomes into on fa or multi-fasta file and then generate all the bin level data like that?
Or should I keep them separate and then call readBins for each chromosome?

Sorry for the rather open-ended question!
Thanks,
Kyle

Dongjun Chung

unread,
Apr 23, 2014, 10:56:43 PM4/23/14
to mosaics_u...@googlegroups.com
 Hi Kyle,



On Tue, Apr 22, 2014 at 11:53 AM, Kyle McChesney <mbio...@gmail.com> wrote:

Hello,

I am attempting to analyse some chip-seq data using MOSAICS. My main goal is to include the Epstein-Barr genome in hg19, and have it treated as another chromosome in the analysis. I have downloaded the preprocessing scripts from the website, and have been successful in generating the bin-level GC and N data for all of the chromosomes (including chrEBV as I have it named). I have tried to generate the mappability using the PeakSeq scripts linked on the pre-process page. Currently I have each chromosome in a seperate .fa file, and the PeakSeq python script has been running for quite some time (days). I tried to run the Parrallel version, but it seems to require connection to a different institutions network.


The description about what you did basically sounds correct to me (although there could be some more details). It also makes sense that calculating mappability score takes much longer time than calculating GC & N scores because calculating GC & N scores requires only one scan of the genome while calculating mappability score requires comparison of all the 36bp sequences (for example) across the genome (although PeakSeq developer implemented it in a smart way).

Can you recognize any progress (such as partial output files) of PeakSeq MappabilityMap python script? If so, it might be just matter of time. In the case you have any problem or error regarding PeakSeq MappabilityMap python, you might need to contact the PeakSeq developer for technical supports.
 
Should I merg all of the hg19 chromosomes into on fa or multi-fasta file and then generate all the bin level data like that?
Or should I keep them separate and then call readBins for each chromosome?


After you generate mappability score and so on, I recommend to combine all the chromosomes and generate a single bin-level file for each of M, GC, and N, as well as for each of ChIP & input data. Genome-wide analysis requires these genome-level files and in our experience, genome-wide analysis generally provides significantly more stable model fitting compared to chromosome-wise analysis.

Please let me know if you have any questions.

Thanks,
Dongjun

Kyle McChesney

unread,
Apr 24, 2014, 11:29:24 AM4/24/14
to mosaics_u...@googlegroups.com
Hello Dongjun,

Thanks for getting back to me. You were absolutely right about the Mappability script, it took a long time to run, but in the end I seem to have all the chr*b.out files mentioned in the pre-processing steps.

 I am still about confused about how/when to combine things when creating the GC,M,N bin files.

For example, I have all the the chromosomes in separate fa files.
I also have an fa file which I created with the following command:
(In the directory with all of the operate fa files)
$ cat *.fa >> hg19-combined.fa

So the combined fa file is more of a multi-fasta file, which still retains the header line (">chromosome *") for each of the chromosomes.

So for generating the GC and N, which of the following steps should I take?
1. Run perl script:
 cal_binary_GC_N_score.pl chr[id].fa [chrid] i

For each of the separate files, and then combine them later (before or after running the process_score.pl)

Or option 2.
Run cal_binary_GC_N_score.pl on the combined multifasta file.

And for the mappability, how should I go about combining that?
Should I run the cal_binary_map_score.py script on each seperate chr*b.out file and then combine them? Or merge all of the chr*b.out files before running them through cal_binary_map_score.py.

Thank you so much for your help, my apologies for al the questions!
-Kyle

--
You received this message because you are subscribed to a topic in the Google Groups "MOSAiCS User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mosaics_user_group/Vn9lKQC42TI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mosaics_user_gr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dongjun Chung

unread,
Apr 26, 2014, 10:35:50 PM4/26/14
to mosaics_u...@googlegroups.com
Hi Kyle,

It is great to hear that generating mappability scores is going well.

Regarding the generation of bin-level files, it should be the option #1,
i.e., generate separate chromosome files & combine them.

In the final files for M, GC, and N, mosaics expects the three columns:

chrID, coord, score

In other words, you need to add chrID in the first column.
I think that this should be easily handled using a perl script or linux commands.

I am really sorry for such inconvenience.
We will work on improving this preprocessing part
and try to prepare the perl script to generate genome-wide version of M, GC, and N scores.

Thanks,
Dongjun


--
You received this message because you are subscribed to the Google Groups "MOSAiCS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mosaics_user_gr...@googlegroups.com.

Kyle McChesney

unread,
Apr 29, 2014, 10:58:10 AM4/29/14
to mosaics_u...@googlegroups.com
Hey Dongjun,

This all sounds great, I should have no problem combining all of these files.
Thanks for all your help!

-Kyle
To unsubscribe from this group and all its topics, send an email to mosaics_user_group+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "MOSAiCS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mosaics_user_group+unsub...@googlegroups.com.

Kyle McChesney

unread,
May 2, 2014, 11:48:13 AM5/2/14
to mosaics_u...@googlegroups.com
Hey Dongjun,

Sorry to bother you again, but I have run into some new issues. I successfully merged everything, and was able to construct and read the bin-level data for all fields (chip, input, M, GC, N). When I attempted to run:

> mosaicsFit(bin1, analysisType = "TS", parallel=TRUE, nCore=10)

I received the following output:

Use 'parallel' package for parallel computing.
Info: background estimation method is determined based on data.
Info: background estimation based on bins with low tag counts.
Info: two-sample analysis (with mappability & GC content).
Info: use adaptive griding.
Info: fitting background model...
Info: grid = 0.01
Info: grid = 0.02
Info: grid = 0.04
Info: grid = 0.1
Info: grid = 0.2
Info: grid = 0.5
Info: done!
Info: fitting one-signal-component model...
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
Error in while (abs(logLik[iter] - logLik[iter - 1]) > eps & iter < 10) { :
missing value where TRUE/FALSE needed

Here is my entire R script

require(parallel)
library("mosaics")
bin1 <- readBins(type = c("chip", "input", "M", "GC", "N"), fileName = c("chip.sam_fragL200_bin50.txt", "input.sam_fragL200_bin50.txt", "hg19_ebv_M_fragL200_bin50.txt", "hg19_ebv_GC_fragL200_bin50.txt", "hg19_ebv_N_fragL200_bin50.txt"),parallel=FALSE, nCore=8)
Info: reading and preprocessing bin-level data...
Info: assume that data contains more than one chromosome
Info: done!

------------------------------------------------------------
Info: preprocessing summary
------------------------------------------------------------
[Note] Bins with ambiguous sequences will be excluded from the analysis.
Coordinates before & after preprocessing:
chr1:     0 - 249240750    ->    9800 - 249240750
chr10:     0 - 135524850    ->    59800 - 135524850
chr11:     0 - 134946650    ->    59800 - 134946650
##
## Rest of Chromosomes
##
chrX:     0 - 155260600    ->    59800 - 155260600
chrY:     0 - 59363600    ->    9800 - 59363600
------------------------------------------------------------

> bin1
Summary: bin-level data (class: BinData)
----------------------------------------
- # of chromosomes in the data: 94
- total effective tag counts: 53211162
  (sum of ChIP tag counts of all bins)
- control sample is incorporated
- mappability score is incorporated
- GC content score is incorporated
- uni-reads are assumed
----------------------------------------
chrID:    # of bins, total effective tag counts
chr1:    4505951, 4460158
chr10:    2626437, 2599939
chr11:    2622660, 2390072
##
## Rest of Chromosomes
##
chr9_gl000199_random:    3394, 26182
chr9_gl000200_random:    3731, 0
chrY:    513221, 100143
----------------------------------------

> fit1 <- mosaicsFit(bin1, analysisType = "TS", parallel=TRUE, nCore=10)
Use 'parallel' package for parallel computing.
Info: background estimation method is determined based on data.
Info: background estimation based on bins with low tag counts.
Info: two-sample analysis (with mappability & GC content).
Info: use adaptive griding.
Info: fitting background model...
Info: grid = 0.01
Info: grid = 0.02
Info: grid = 0.04
Info: grid = 0.1
Info: grid = 0.2
Info: grid = 0.5
Info: done!
Info: fitting one-signal-component model...
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
Error in while (abs(logLik[iter] - logLik[iter - 1]) > eps & iter < 10) { :
  missing value where TRUE/FALSE needed

Any ideas?

Kyle McChesney

unread,
May 6, 2014, 12:58:52 PM5/6/14
to mosaics_u...@googlegroups.com
I looked through the old threads and found people having similar issues.
I tried doing and IO fit, and ran into a different error (when trying to call mosaicsPeak())

> fit1 <- mosaicsFit(bin1, analysisType = "IO", parallel=TRUE, nCore=2)

Use 'parallel' package for parallel computing.
Info: background estimation method is determined based on data.
Info: background estimation based on bins with low tag counts.
Info: two-sample analysis (Input only).

Info: use adaptive griding.
Info: fitting background model...
Info: done!
Info: fitting one-signal-component model...
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
Info: fitting two-signal-component model...
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
Info: calculating BIC of fitted models...

  RCPP_FUNCTION_5 is deprecated, it will be removed permanently in july 2014
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
Info: done!
> fit1
Summary: MOSAiCS model fitting (class: MosaicsFit)
------------------------------
--------------------
analysis type: two-sample analysis (Input only)
parameters used: k = 3, d = 0.25
BIC of one-signal-component model = 35120124
BIC of two-signal-component model = 34351225
--------------------------------------------------
> peak1 <- mosaicsPeak(fit1, signalModel = "2S", FDR = 0.05, maxgap = 200, minsize = 50, thres = 10)
Info: use two-signal-component model.
Info: calculating posterior probabilities...
  RCPP_FUNCTION_6 is deprecated, it will be removed permanently in july 2014
Info: calling peaks...
Error in peak_info_list[[1]] : subscript out of bounds


Reply all
Reply to author
Forward
0 new messages