phastcons placental as bigwig

447 views

Skip to first unread message

António Domingues

unread,

May 16, 2014, 6:53:45 AM5/16/14

to gen...@soe.ucsc.edu

Dear UCSC genome browser team,

I want to do some conservation analysis and for the later stages of my
analysis, it is required a bigwig file with all chromosomes, but I am
running into problems in the the conversion from wig to bigWig.

First I downloaded all the data (and removed the contig files):
cd /projects/seq-work/user/all/conservation/human/placentalMammals
rsync -avz --progress
rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/ ./

Then I merge all the wig files:
# merge chr
zcat chr*.placental.wigFix.gz | grep -v ^track | gzip >
all_chr.placental.wigFix.gz

But when converting to bigwig there is just an error "killed" after some
time:
wigToBigWig -clip all_chr.placental.wigFix.gz hg19.genome
all_chr.placental.bw
killed

I imagine this is a memory issue. So 3 questions:

1. Is this really a memory issue? My computer does have 16GB of RAM.
2. Is there a more efficient way of doing this? (or, what am I doing wrong)
3. for phastCons100way there is an available bigWig file. Is there
something similar for the placental species?

Best,
António

--
António Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universität Dresden
Fetscherstraße 105
01307 Dresden

Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology

Jonathan Casper

unread,

May 16, 2014, 7:48:37 PM5/16/14

to António Domingues, gen...@soe.ucsc.edu

Hello António,

Thank you for your question about converting wig files to bigWig. It is quite possibly that your problem is due to a memory issue - creating large bigWigs can consume a substantial amount of RAM, particular when you start with a large file like this one. That said, one of our engineers notes that the "kill" might instead happen due to resource limitations set in your environment. You can run "ulimit -a" for the bash shell or "limit" for cshell variants to see what limits, if any, are in place. Some limits are soft limits, and can be increased by the user.

If you cannot get enough memory available (it seems likely - my test run of your command used over 30GB of memory), you might have better luck converting the individual wig files into bigWigs and then using the new "bigWigCat" utility developed by Daniel Zerbino to combine them. Splitting the format conversion into multiple steps should make it easier for your computer to handle. Here is the usage message for bigWigCat:

bigWigCat v 4 - merge non-overlapping bigWig files
directly into bigWig format
usage:
   bigWigCat out.bw in1.bw in2.bw ...
Where in*.bw is in big wig format
and out.bw is the output indexed big wig file.
options:
   -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024

Our engineer adds that you would need to use the -fixedSummaries and -keepAllChromosomes to create the input bigWig files. Here is a sample shell script that will convert each wig file into a bigWig and merge them with bigWigCat.

#/bin/tcsh
foreach f in (chr*.placental.wigFix.gz)
  zcat f | grep -v ^track > f:r.temp
  wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp f:r.bw
  rm f:r.temp
end
# Note the f:r syntax just trims the extension from the filename.

#To concatenate:
bigWigCat all_chr.placental.bw hg19.genome chr*.placental.wigFix.bw

For the 100-way alignment, we only provide scores computed from the full alignment - not subclasses like primate or placental mammal. If you are interested in computing your own phastCons scores from a subset of the species, you can find the multiple alignments and more information about our methods at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

António domingues

unread,

May 19, 2014, 1:22:11 PM5/19/14

to gen...@soe.ucsc.edu

Thanks a lot. I was not aware of bigWigCat - that alone should solve the problem. The script comes in quite handy as well.

António

-- 
António Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universität Dresden 
Fetscherstraße 105
01307 Dresden

Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology

On 05/17/2014 07:07 PM, gen...@soe.ucsc.edu wrote:

Today's topic summary

Group: http://groups.google.com/a/soe.ucsc.edu/group/genome/topics

phastcons placental as bigwig [1 Update]

Query regarding repeat element display match with sequence [1 Update]

TNNI3 [1 Update]

Retroposed Genes V5 Source Gene name! [1 Update]

what happened to the DHGP website? (linked from the dm3 Browser Gateway) [1 Update]

Reproducible bug with highlight feature [1 Update]

phastcons placental as bigwig

Jonathan Casper <jca...@soe.ucsc.edu> May 16 04:48PM -0700

<http://placental.wigfix.bw/>

For the 100-way alignment, we only provide scores computed from the full
alignment - not subclasses like primate or placental mammal. If you are
interested in computing your own phastCons scores from a subset of the
species, you can find the multiple alignments and more information about
our methods at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/.

I hope this is helpful. If you have any further questions, please reply to
gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

On Fri, May 16, 2014 at 3:53 AM, António Domingues <

Query regarding repeat element display match with sequence

Matthew Speir <msp...@soe.ucsc.edu> May 16 04:11PM -0700

Hello Roshan,

Thank you for your question about the RepeatMasker and Conservation
tracks in the UCSC Genome Browser. You may not be seeing the repeat or
the conservation information because the RepeatMasker and Conservation
tracks are not turned on. I highly recommend taking advantage of some
the introductory resources that are available for the browser. I would
start by watching these tutorials: http://www.openhelix.com/ucsc. You
can find information on further training and tutorials here:
http://genome.ucsc.edu/training.html.

Data sets in the Genome Browser are stored in tracks. To view a
particular data set, you will need to turn on the corresponding track.
Information on repeats, such as the AluY element that you mentioned, is
contained in the RepeatMasker track. The Conservation track contains
conservation information based alignments between the human genome and
99 different vertebrate genomes. You can use the following steps to turn
on the RepeatMasker and Conservation tracks:

1. Navigate to the gateway page for the Human/hg19 assembly,
http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19.

2. In the search term box type chr16:47,662,162-47,662,468 (note
this is the position you gave for the AluY element).

3. Click Submit.

4. Scroll down the page to the section labeled Comparative Genomics.

5. Click the label for the Conservation track.

6. Check the boxes next to the species that you want to be
displayed as part of this conservation track.

7. At the top of the page, set the maximum display mode for this
track to 'Full'.

8. Click Submit.

9. Scroll down the page to the section labeled Repeats.

10. Click the label for the RepeatMasker track.

11. At the top of the page, set the display mode to full.

12. Click submit.

If you have correctly turned on the RepeatMasker and Conservation
tracks, you should see some thing like this:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=mspeeeer&hgS_otherUserSessionName=hg19_mlqAluYcons.
In this session, the first track is the UCSC Genes track showing the
PHKB gene, the second track is the RepeatMasker track showing the AluY
element, and the last track is the Conservation track showing only the
primate alignments. If you would like more information on how the data
for a track was generated, you can look at the track description page.
You can get to the description for a track by clicking the track labels,
just as you did to turn on these tracks, and scrolling down past the
track settings.

I hope this is helpful. If you have any further questions, please reply

to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes

sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

On 5/15/14, 5:22 AM, roshan fatima wrote:

TNNI3

Matthew Speir <msp...@soe.ucsc.edu> May 16 12:41PM -0700

Hello Christian,

Thank your for your follow-up questions about the TNNI3 gene. Currently,
the RefSeq track is generated through an automated pipeline. We download
the transcripts from RefSeq, align them to the genome, and then filter
the alignments according to the parameters described in the methods
section of the track description page,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refGene. Since this
track is not manually curated, it is unlikely that this issue with TNNI3
or other proteins will be fixed in the RefSeq Genes track. In the
future, we hope to add a track that contains RefSeq's alignments to help
resolve issues where our re-alignments are different from the RefSeq
alignments. In the mean time, you could use the GENCODE genes track,
which does not have this issue with the TNNI3 protein. The GENCODE
track,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeV19,
contains manual annotations merged with evidence-based automated
annotations.

I hope this is helpful. If you have any further questions, please reply

to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes

sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

Retroposed Genes V5 Source Gene name!

Roberto Munita <robert...@gmail.com> May 16 03:21PM -0400

Hi,

I was trying to use the data from Retroposed Genes V5 track. But my problem
is that I can´t find a table that have pseudogenes alignments information
with the gene name (I searched in the table browser).

For example NM_004643.3-18 is "retro-PABPN1" and if you search in the
genome browser: "NM_004643.3-18"and you click over the pseudogene the
information say:

Source Gene:

NM_004643.3-18 PABPN1 Homo sapiens poly(A) binding protein, nuclear 1
(PABPN1), mRNA.

Where I can found that information for all the Retroposed Genes V5 track?

I hope you could help me,

Thanks for your help!

Roberto

Enviado con MailTrack<https://mailtrack.io/install?source=signature&lang=es&referral=robert...@gmail.com&idSignature=23>

what happened to the DHGP website? (linked from the dm3 Browser Gateway)

Matthew Speir <msp...@soe.ucsc.edu> May 16 11:56AM -0700

Hi Hervé,

Thank you for bringing this incorrect link on the dm3 gateway page to
our attention, I will change it soon. We've been informed that the DHGP
website has been retired. As an alternative, you can check out the
websites for the Berkeley Drosophila Genome Project (BDGP),
http://www.fruitfly.org/, and FlyBase, http://flybase.org/. The
information from the DHGP site often mirrored what was on these BDGP and
FlyBase sites. If you would like more information on the dm3 assembly
and the heterochromatin sequences used in this assembly, I recommend
looking at the following resources:

* BDGP Release notes,
http://www.fruitfly.org/data/sequence/README.RELEASE5
* DHGP publication 'Sequence finishing and mapping of Drosophila
melanogaster heterochromatin',
http://www.ncbi.nlm.nih.gov/pubmed/17569867

I hope this is helpful. If you have any further questions, please reply

to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes

sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

On 5/15/14, 3:09 PM, Hervé Pagès wrote:

Reproducible bug with highlight feature

Jonathan Casper <jca...@soe.ucsc.edu> May 16 10:28AM -0700

Hello Kie,

Thank you for reporting this issue, and I'm sorry for the confusion it is
causing you! We're aware of the problem and are planning to address it, but
I don't have a timetable for when the issue will be fixed. In the meantime,
you can recover the proper highlighting by simply refreshing the browser
page.

I hope this is helpful. If you have any further questions, please reply to
gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Galt Barber

unread,

May 19, 2014, 3:30:36 PM5/19/14

to António domingues, gen...@soe.ucsc.edu

Correction:

The chrom.sizes file which you called hg19.genome was added incorrectly
to the bigWigCat command,
but it must be in the wigToBigWig command instead.

Here are the corrected lines:

wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp hg19.genome f:r.bw

bigWigCat all_chr.placental.bw chr*.placental.wigFix.bw

Sorry for the confusion.

Please note that bigWigCat will not actually work unless wigToBigWig

is run with both the -fixedSummaries and -keepAllChromosomes options.

The -keepAllChromsomes options means that the full chromosome index
is present in every bigWig, so merging them later is possible.

FYI bigWigCat is the work of Daniel Zerbino.

-Galt

--

Reply all

Reply to author

Forward

0 new messages