Re: Doubt about vcfsubset mode

53 views

Skip to first unread message

Fabien Campagne

unread,

Jan 23, 2014, 9:03:26 AM1/23/14

to Amruta Nambiar, goby-fr...@googlegroups.com

Hello,

I am copying the Goby forum so that others may find the answer if they encounter a similar problem.

I noticed two problems with the test data and command line. When I ran goby on mac with the command line you provided against the test set you provided I got the complete exception, which I reproduce below (please include the top lines of exception in problem reports, this is the most useful for diagnostic):

net.sf.samtools.SAMFormatException: Invalid GZIP header
at net.sf.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:72)
at net.sf.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:379)
at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:361)
at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:109)
at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:234)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at edu.cornell.med.icb.goby.util.GrepReader.getNextLine(GrepReader.java:110)
at edu.cornell.med.icb.goby.util.GrepReader.read(GrepReader.java:68)

...

To me, this indicates that the input file was not compressed with BGZip. We'll add to Goby's documentation to make clear that compressed files must be compressed with BGzip (we use tabix indexing on these files, and only BGzip compression supports this). Now, this is important, because goby decompresses VCF files with the BGZip code implemented in samtools. The fix is to decompress your file with gzip, and recompress with bgzip (included in the tabix distribution).

After that, the second problem is minor: you used a filename after -s, when goby expects a suffix. The documentation indicates that

The output suffix to construct an output filename for each input file.

The output filename will be input-filename - extensions + suffix +

vcf.gz

Try -s -split, which creates input-split.vcf.gz

After these two changes, I was able to correctly split your input file with Goby 2.3.3. Hope this helps.

Attached is the sample split with goby with this command line:

goby 1g vcf-subset ~/Downloads/input.vcf.gz -c HG00096 -s -split

Best, Fabien

On Thu, Jan 23, 2014 at 7:05 AM, Amruta Nambiar <amru...@gmail.com> wrote:

Hi,

I am a Bioinformatics enthusiast from India.I wanted to subset 1000 genome data. I have tried vcftools-vcf-subset. But as vcftools’s vcf-subset is slow I installed Goby.

I have given the command as given below for executing

  java -Xmx3g -jar goby.jar -m vcf-subset /users/anambiar/Goby/Input/input.vcf.gz -c HG00096 -s /users/anambiar/Goby/Output/

I have tried with –m and –mode for specifying the mode

input.vcf.gz is the input file. I have taken first 100 lines of 1000Genome vcf file and created input.vcf.gz

But the above command is giving an error message

        at java.io.BufferedReader.fill(BufferedReader.java:154)

        at java.io.BufferedReader.readLine(BufferedReader.java:317)

        at java.io.BufferedReader.readLine(BufferedReader.java:382)

        at edu.cornell.med.icb.goby.util.GrepReader.getNextLine(GrepReader.java:110)

        at edu.cornell.med.icb.goby.util.GrepReader.read(GrepReader.java:68)

        at java.io.Reader.read(Reader.java:140)

        at it.unimi.dsi.io.FastBufferedReader.noMoreCharacters(FastBufferedReader.java:276)

        at it.unimi.dsi.io.FastBufferedReader.readLine(FastBufferedReader.java:333)

        at it.unimi.dsi.io.LineIterator.hasNext(LineIterator.java:82)

        at edu.cornell.med.icb.goby.readers.vcf.VCFParser.readHeader(VCFParser.java:429)

        at edu.cornell.med.icb.goby.modes.VCFSubsetMode.processOneFile(VCFSubsetMode.java:182)

        at edu.cornell.med.icb.goby.modes.VCFSubsetMode.access$000(VCFSubsetMode.java:54)

        at edu.cornell.med.icb.goby.modes.VCFSubsetMode$1.action(VCFSubsetMode.java:157)

        at edu.cornell.med.icb.goby.util.BasenameParallelRegion$1.run(BasenameParallelRegion.java:52)

        at edu.rit.pj.IntegerForLoop.commonRun(IntegerForLoop.java:448)

        at edu.rit.pj.ParallelRegion.execute(ParallelRegion.java:307)

        at edu.rit.pj.ParallelRegion.execute(ParallelRegion.java:203)

        at edu.cornell.med.icb.goby.util.BasenameParallelRegion.run(BasenameParallelRegion.java:43)

        at edu.rit.pj.ParallelTeamThread.run(ParallelTeamThread.java:110)

net.sf.samtools.SAMFormatException: Invalid GZIP header

        at net.sf.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:72)

Is the command I gave correct. I have attached the input file along with this mail. Your reply will be of great help to me.

Regards,

Amruta

--
Fabien Campagne, PhD -- http://campagnelab.org

Assistant Professor, Dept. of Physiology and Biophysics
Institute for Computational Biomedicine
Associate Director, Biomedical Informatics Core,
Clinical Translational Science Center

Weill Medical College of Cornell University
phone: (646)-962-5613 1305 York Avenue
fax: (646)-962-0383 Box 140
New York, NY 10021

Do you speak next-gen?

See how GobyWeb can help simplify your NGS projects at http://gobyweb.campagnelab.org