To me, this indicates that the input file was not compressed with BGZip. We'll add to Goby's documentation to make clear that compressed files must be compressed with BGzip (we use tabix indexing on these files, and only BGzip compression supports this). Now, this is important, because goby decompresses VCF files with the BGZip code implemented in samtools. The fix is to decompress your file with gzip, and recompress with bgzip (included in the tabix distribution).
After that, the second problem is minor: you used a filename after -s, when goby expects a suffix. The documentation indicates that
The output suffix to construct an output filename for each input file.
The output filename will be input-filename - extensions + suffix +
vcf.gz
Try -s -split, which creates input-split.vcf.gz
After these two changes, I was able to correctly split your input file with Goby 2.3.3. Hope this helps.
Attached is the sample split with goby with this command line:
goby 1g vcf-subset ~/Downloads/input.vcf.gz -c HG00096 -s -split
Best, Fabien
Hi,
I am a Bioinformatics enthusiast from India.I wanted to subset 1000 genome data. I have tried vcftools-vcf-subset. But as vcftools’s vcf-subset is slow I installed Goby.
I have given the command as given below for executing
java -Xmx3g -jar goby.jar -m vcf-subset /users/anambiar/Goby/Input/input.vcf.gz -c HG00096 -s /users/anambiar/Goby/Output/
I have tried with –m and –mode for specifying the mode
input.vcf.gz is the input file. I have taken first 100 lines of 1000Genome vcf file and created input.vcf.gz
But the above command is giving an error message
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at edu.cornell.med.icb.goby.util.GrepReader.getNextLine(GrepReader.java:110)
at edu.cornell.med.icb.goby.util.GrepReader.read(GrepReader.java:68)
at java.io.Reader.read(Reader.java:140)
at it.unimi.dsi.io.FastBufferedReader.noMoreCharacters(FastBufferedReader.java:276)
at it.unimi.dsi.io.FastBufferedReader.readLine(FastBufferedReader.java:333)
at it.unimi.dsi.io.LineIterator.hasNext(LineIterator.java:82)
at edu.cornell.med.icb.goby.readers.vcf.VCFParser.readHeader(VCFParser.java:429)
at edu.cornell.med.icb.goby.modes.VCFSubsetMode.processOneFile(VCFSubsetMode.java:182)
at edu.cornell.med.icb.goby.modes.VCFSubsetMode.access$000(VCFSubsetMode.java:54)
at edu.cornell.med.icb.goby.modes.VCFSubsetMode$1.action(VCFSubsetMode.java:157)
at edu.cornell.med.icb.goby.util.BasenameParallelRegion$1.run(BasenameParallelRegion.java:52)
at edu.rit.pj.IntegerForLoop.commonRun(IntegerForLoop.java:448)
at edu.rit.pj.ParallelRegion.execute(ParallelRegion.java:307)
at edu.rit.pj.ParallelRegion.execute(ParallelRegion.java:203)
at edu.cornell.med.icb.goby.util.BasenameParallelRegion.run(BasenameParallelRegion.java:43)
at edu.rit.pj.ParallelTeamThread.run(ParallelTeamThread.java:110)
net.sf.samtools.SAMFormatException: Invalid GZIP header
at net.sf.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:72)
Is the command I gave correct. I have attached the input file along with this mail. Your reply will be of great help to me.
Regards,
Amruta