phastcons placental as bigwig

447 views
Skip to first unread message

António Domingues

unread,
May 16, 2014, 6:53:45 AM5/16/14
to gen...@soe.ucsc.edu
Dear UCSC genome browser team,

I want to do some conservation analysis and for the later stages of my
analysis, it is required a bigwig file with all chromosomes, but I am
running into problems in the the conversion from wig to bigWig.

First I downloaded all the data (and removed the contig files):
cd /projects/seq-work/user/all/conservation/human/placentalMammals
rsync -avz --progress
rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/ ./

Then I merge all the wig files:
# merge chr
zcat chr*.placental.wigFix.gz | grep -v ^track | gzip >
all_chr.placental.wigFix.gz

But when converting to bigwig there is just an error "killed" after some
time:
wigToBigWig -clip all_chr.placental.wigFix.gz hg19.genome
all_chr.placental.bw
killed

I imagine this is a memory issue. So 3 questions:

1. Is this really a memory issue? My computer does have 16GB of RAM.
2. Is there a more efficient way of doing this? (or, what am I doing wrong)
3. for phastCons100way there is an available bigWig file. Is there
something similar for the placental species?

Best,
António

--
António Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universität Dresden
Fetscherstraße 105
01307 Dresden

Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology

Jonathan Casper

unread,
May 16, 2014, 7:48:37 PM5/16/14
to António Domingues, gen...@soe.ucsc.edu

Hello António,

Thank you for your question about converting wig files to bigWig. It is quite possibly that your problem is due to a memory issue - creating large bigWigs can consume a substantial amount of RAM, particular when you start with a large file like this one. That said, one of our engineers notes that the "kill" might instead happen due to resource limitations set in your environment. You can run "ulimit -a" for the bash shell or "limit" for cshell variants to see what limits, if any, are in place. Some limits are soft limits, and can be increased by the user.

If you cannot get enough memory available (it seems likely - my test run of your command used over 30GB of memory), you might have better luck converting the individual wig files into bigWigs and then using the new "bigWigCat" utility developed by Daniel Zerbino to combine them. Splitting the format conversion into multiple steps should make it easier for your computer to handle. Here is the usage message for bigWigCat:

bigWigCat v 4 - merge non-overlapping bigWig files
directly into bigWig format
usage:
   bigWigCat out.bw in1.bw in2.bw ...
Where in*.bw is in big wig format
and out.bw is the output indexed big wig file.
options:
   -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024

Our engineer adds that you would need to use the -fixedSummaries and -keepAllChromosomes to create the input bigWig files. Here is a sample shell script that will convert each wig file into a bigWig and merge them with bigWigCat.
#/bin/tcsh
foreach f in (chr*.placental.wigFix.gz)
  zcat f | grep -v ^track > f:r.temp
  wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp f:r.bw
  rm f:r.temp
end
# Note the f:r syntax just trims the extension from the filename.

#To concatenate:
bigWigCat all_chr.placental.bw hg19.genome chr*.placental.wigFix.bw

For the 100-way alignment, we only provide scores computed from the full alignment - not subclasses like primate or placental mammal. If you are interested in computing your own phastCons scores from a subset of the species, you can find the multiple alignments and more information about our methods at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group





--


António domingues

unread,
May 19, 2014, 1:22:11 PM5/19/14
to gen...@soe.ucsc.edu
Thanks a lot. I was not aware of bigWigCat - that alone should solve the problem. The script comes in quite handy as well.


António
-- 
António Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universität Dresden 
Fetscherstraße 105
01307 Dresden

Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology

     
    For the 100-way alignment, we only provide scores computed from the full
    alignment - not subclasses like primate or placental mammal. If you are
    interested in computing your own phastCons scores from a subset of the
    species, you can find the multiple alignments and more information about
    our methods at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/.
     
    I hope this is helpful. If you have any further questions, please reply to
    gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those
    addresses will be archived in publicly-accessible forums for the benefit of
    other users. If your question contains sensitive data, you may send it
    instead to genom...@soe.ucsc.edu.
     
    --
    Jonathan Casper
    UCSC Genome Bioinformatics Group
     
     
    On Fri, May 16, 2014 at 3:53 AM, António Domingues <

     

      Matthew Speir <msp...@soe.ucsc.edu> May 16 04:11PM -0700  

      Hello Roshan,
       
      Thank you for your question about the RepeatMasker and Conservation
      tracks in the UCSC Genome Browser. You may not be seeing the repeat or
      the conservation information because the RepeatMasker and Conservation
      tracks are not turned on. I highly recommend taking advantage of some
      the introductory resources that are available for the browser. I would
      start by watching these tutorials: http://www.openhelix.com/ucsc. You
      can find information on further training and tutorials here:
      http://genome.ucsc.edu/training.html.
       
      Data sets in the Genome Browser are stored in tracks. To view a
      particular data set, you will need to turn on the corresponding track.
      Information on repeats, such as the AluY element that you mentioned, is
      contained in the RepeatMasker track. The Conservation track contains
      conservation information based alignments between the human genome and
      99 different vertebrate genomes. You can use the following steps to turn
      on the RepeatMasker and Conservation tracks:
       
      1. Navigate to the gateway page for the Human/hg19 assembly,
      http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19.
       
      2. In the search term box type chr16:47,662,162-47,662,468 (note
      this is the position you gave for the AluY element).
       
      3. Click Submit.
       
      4. Scroll down the page to the section labeled Comparative Genomics.
       
      5. Click the label for the Conservation track.
       
      6. Check the boxes next to the species that you want to be
      displayed as part of this conservation track.
       
      7. At the top of the page, set the maximum display mode for this
      track to 'Full'.
       
      8. Click Submit.
       
      9. Scroll down the page to the section labeled Repeats.
       
      10. Click the label for the RepeatMasker track.
       
      11. At the top of the page, set the display mode to full.
       
      12. Click submit.
       
      If you have correctly turned on the RepeatMasker and Conservation
      tracks, you should see some thing like this:
      http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=mspeeeer&hgS_otherUserSessionName=hg19_mlqAluYcons.
      In this session, the first track is the UCSC Genes track showing the
      PHKB gene, the second track is the RepeatMasker track showing the AluY
      element, and the last track is the Conservation track showing only the
      primate alignments. If you would like more information on how the data
      for a track was generated, you can look at the track description page.
      You can get to the description for a track by clicking the track labels,
      just as you did to turn on these tracks, and scrolling down past the
      track settings.

       
      I hope this is helpful. If you have any further questions, please reply
      to gen...@soe.ucsc.edu. All messages sent to that address are archived
      on a publicly-accessible Google Groups forum. If your question includes
      Matthew Speir
      UCSC Genome Bioinformatics Group
       
       
      On 5/15/14, 5:22 AM, roshan fatima wrote:

       

      Matthew Speir <msp...@soe.ucsc.edu> May 16 12:41PM -0700  

      Hello Christian,
       
      Thank your for your follow-up questions about the TNNI3 gene. Currently,
      the RefSeq track is generated through an automated pipeline. We download
      the transcripts from RefSeq, align them to the genome, and then filter
      the alignments according to the parameters described in the methods
      section of the track description page,
      http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refGene. Since this
      track is not manually curated, it is unlikely that this issue with TNNI3
      or other proteins will be fixed in the RefSeq Genes track. In the
      future, we hope to add a track that contains RefSeq's alignments to help
      resolve issues where our re-alignments are different from the RefSeq
      alignments. In the mean time, you could use the GENCODE genes track,
      which does not have this issue with the TNNI3 protein. The GENCODE
      track,
      http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeV19,
      contains manual annotations merged with evidence-based automated
      annotations.

       
      I hope this is helpful. If you have any further questions, please reply
      to gen...@soe.ucsc.edu. All messages sent to that address are archived
      on a publicly-accessible Google Groups forum. If your question includes
      Matthew Speir
      UCSC Genome Bioinformatics Group
       
       

       

      Roberto Munita <robert...@gmail.com> May 16 03:21PM -0400  

      Hi,
       
      I was trying to use the data from Retroposed Genes V5 track. But my problem
      is that I can´t find a table that have pseudogenes alignments information
      with the gene name (I searched in the table browser).
       
      For example NM_004643.3-18 is "retro-PABPN1" and if you search in the
      genome browser: "NM_004643.3-18"and you click over the pseudogene the
      information say:
       
      Source Gene:
       
      NM_004643.3-18 PABPN1 Homo sapiens poly(A) binding protein, nuclear 1
      (PABPN1), mRNA.
       
      Where I can found that information for all the Retroposed Genes V5 track?
       
      I hope you could help me,
       
      Thanks for your help!
       
      Roberto
       
       
      Enviado con MailTrack<https://mailtrack.io/install?source=signature&lang=es&referral=robert...@gmail.com&idSignature=23>

       

      Matthew Speir <msp...@soe.ucsc.edu> May 16 11:56AM -0700  

      Hi Hervé,
       
      Thank you for bringing this incorrect link on the dm3 gateway page to
      our attention, I will change it soon. We've been informed that the DHGP
      website has been retired. As an alternative, you can check out the
      websites for the Berkeley Drosophila Genome Project (BDGP),
      http://www.fruitfly.org/, and FlyBase, http://flybase.org/. The
      information from the DHGP site often mirrored what was on these BDGP and
      FlyBase sites. If you would like more information on the dm3 assembly
      and the heterochromatin sequences used in this assembly, I recommend
      looking at the following resources:
       
      * BDGP Release notes,
      http://www.fruitfly.org/data/sequence/README.RELEASE5
      * DHGP publication 'Sequence finishing and mapping of Drosophila
      melanogaster heterochromatin',
      http://www.ncbi.nlm.nih.gov/pubmed/17569867

       
      I hope this is helpful. If you have any further questions, please reply
      to gen...@soe.ucsc.edu. All messages sent to that address are archived
      on a publicly-accessible Google Groups forum. If your question includes
      Matthew Speir
      UCSC Genome Bioinformatics Group
       
       
      On 5/15/14, 3:09 PM, Hervé Pagès wrote:

       

      Jonathan Casper <jca...@soe.ucsc.edu> May 16 10:28AM -0700  

      Hello Kie,
       
      Thank you for reporting this issue, and I'm sorry for the confusion it is
      causing you! We're aware of the problem and are planning to address it, but
      I don't have a timetable for when the issue will be fixed. In the meantime,
      you can recover the proper highlighting by simply refreshing the browser
      page.

       
      I hope this is helpful. If you have any further questions, please reply to
      gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those
      addresses will be archived in publicly-accessible forums for the benefit of
      other users. If your question contains sensitive data, you may send it
      instead to genom...@soe.ucsc.edu.
       
      --
      Jonathan Casper
      UCSC Genome Bioinformatics Group
       
       

       

      --

      To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

      Galt Barber

      unread,
      May 19, 2014, 3:30:36 PM5/19/14
      to António domingues, gen...@soe.ucsc.edu
      Correction: 

      The chrom.sizes file which you called hg19.genome was added incorrectly
      to the bigWigCat command,
      but it must be in the wigToBigWig command instead.

      Here are the corrected lines:

      wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp hg19.genome f:r.bw

      bigWigCat all_chr.placental.bw chr*.placental.wigFix.bw

      Sorry for the confusion.
      Please note that bigWigCat will not actually work unless wigToBigWig
      is run with both the -fixedSummaries and -keepAllChromosomes options.
      The -keepAllChromsomes options means that the full chromosome index
      is present in every bigWig, so merging them later is possible.

      FYI bigWigCat is the work of Daniel Zerbino.

      -Galt


      --


      Reply all
      Reply to author
      Forward
      0 new messages