Convert Sam to qlx

476 views
Skip to first unread message

Fabio Luciani

unread,
Mar 20, 2012, 2:41:13 AM3/20/12
to viral-to...@googlegroups.com
Hi

My idea is to use vphaser for some projects in which I have already generated Mosaik alignment files and of course sam files.
I would like to try phaser without re-running all the alignment part and hence I was thinking to use the script you provide
samToQlx.pl

I am however confused because the input for this files are:

perl qlxToSam.pl <qlxinput.qlx> <assembly.fasta> <samoutput.sam>


I can's figure out what is this assembly.fasta
Is the assembly that you generate using other softwares from your group, like RC454, or is a file that I can generate in another way?
I have sam file and reference, and I have also the MOSAIK aligned files in mosaik format.

thanks


Fabio Luciani

unread,
Mar 20, 2012, 3:16:24 AM3/20/12
to viral-to...@googlegroups.com
ok, I have tried like this

../samToQlx.pl file.sam read.fna read.qual ref_assembly.fas out.qlx
and I got a very large list  like this

Use of uninitialized value in concatenation (.) or string at ../samToQlx.pl line 248, <BAMFILE> line 196.
substr outside of string at ../samToQlx.pl line 248, <BAMFILE> line 196.

but I got an output .qlx

however, when I tried this qlx in
perl ../vphaser.pl -out.qlx -o output

 i got many of these

Use of uninitialized value in numeric le (<=) at ../vphaser.pl line 2086, <FFILE> line 54948.


Something is wrong with initialization!
I would appreciate help her,e thanks

Fabio

Patrick Charlebois

unread,
Mar 20, 2012, 1:14:47 PM3/20/12
to Broad Viral Tool Users
Hi Fabio,

my first guess would be that you have an issue with your quality input
file. Is it in spaced integer format or ASCII? Because the script
seems to be unable to build the NQS string, which is done from the
quality input.

The quality file format should look like:

>READID1
30 40 40 40 40 30 20 20 15 25 ...
>READID2
30 40 40 40 40 30 20 20 15 25 ...

If you have the qualities in ASCII strings, you can use the following
perl subroutine to convert them (takes an ascii string as input and
returns the integer string):

sub convertAsc
{
my $ascstr = shift;
my $qualstr = ord(substr($ascstr,0,1)) - 33;
for (my $pos = 1; $pos < length($ascstr); $pos++)
{
$qualstr .= " ".(ord(substr($ascstr,$pos,1)) - 33);
}
return $qualstr;
}

If this is not what your issue is, it would likely help if I can see
some sample of your input file.

Thanks,
Patrick

Fabio Luciani

unread,
Mar 20, 2012, 6:04:08 PM3/20/12
to viral-to...@googlegroups.com
Hi Patrick

Thanks for the answer.
Please see the head of the qual file
>G5ZVNFF01A0ZHG rank=0000076 x=304.0 y=530.5 length=41#AGTGTGTGCGT
32 30 27 27 27 31 31 33 32 30 23 24 22 22 25 28 28 23 17
>G5ZVNFF01BAD7S rank=0000082 x=411.0 y=1254.0 length=42#ACGTACACACT
23 23 23 23 23 23 29 23 24 23 30 30 27 19 19 19 27 31 32 32
30 27 27 25 25 27 24 24 14 14
>G5ZVNFF01A2RT4 rank=0001072 x=324.0 y=2010.0 length=59#ACGTACACACT
22 25 22 17 17 17 17 26 27 27 27 27 17 17 17 19 27 24 24 23
23 24 24 25 27 28 28 28 28 27 18 19 11 11 11 11 14 14 22 14
17 17 19 19 17 19 17 17 19 19 18 19 23 27 28 28 26 28 28
>G5ZVNFF01A2W0F rank=0001293 x=326.0 y=525.0 length=45#ACGTACACACT
37 37 37 36 27 25 25 25 27 27 27 30 29 27 30 21 22 21 22 21
15 15 15 19 15 19 18 21 27 28 28 27 25 27 24 19 19 19 24 24
24 25 23 21 21
>G5ZVNFF01AUU29 rank=0002402 x=234.5 y=1619.0 length=74#ACGTACACACT
32 28 19 19 19 16 19 25 29 26 29 21 23 24 30 40 40 40 40 40
40 40 40 40 40 35 28 29 26 26 26 31 34 16 16 16 16 25 29 26
26 26 35 38 40 39 36 32 27 23 21 23 29 35 32 27 19 19 24 14
14 16 16 30 30 32 32 32 35 35 35 36 34 34
>G5ZVNFF01A8D35 rank=0002964 x=388.5 y=2019.0 length=74#ACGTACACACT
32 32 31 27 24 24 15 13 14 22 22 17 17 17 17 17 24 28 28 28
29 29 30 28 28 28 28 27 22 22 17 17 15 15 13 14 19 22 27 27
27 28 28 27 29 27 19 19 19 24 23 23 23 21 21 21 23 28 28 28
20 19

to me it seems ok.

As I am not a perl person, I need to figure the script you sent out.
However, it seems that my qual file is a spaced integer, don't you think?
It is weird because I never had problem with these qual score files, whcih have been sent to me from the people that ran the 454 Titanium.
I am a bit confused on this
thanks
Fabio


Thanks

Patrick Charlebois

unread,
Mar 20, 2012, 6:56:27 PM3/20/12
to Broad Viral Tool Users
I think the problem is actually in the read names. That might be
something that I forgot to mention in the documentation (if so I'll
add it) but the names of the reads are used directly the way they are
written as hash keys and for the qlx name. The format these names have
in your file would totally destroy the structure of the qlx file since
they have spaces in them (and special characters that would likely be
bad for perl code).

Made me realize that I never consider this problem because my script
extracting read fastas and quals from Newbler trims everything from
the read name after the initial ID, so I never have anything else than
the unique ID part to worry about.

If you want to try to see if this fixes your problem, you could just
do these perl one-liners on your read fasta and qual files directly
from the command-line, simply replacing the file names with yours
(better to copy your file first just in case you want to keep the
headers for something else):

perl -p -i -e 's/>(.+?) .+/>$1/g' <filename.fa>
perl -p -i -e 's/>(.+?) .+/>$1/g' <filename.qual>

These will trim your read ids (it's important to trim on both the
fasta and qual because their names must match).

Since someone else is asking about this and their problem might come
from the very same issue, it might be good for me to just include this
as part of my process somewhere. Either add an additional 'data prep'
script that would verify all that, or just try to include it as part
of my other scripts in a way that it could handle this properly. I'm
just worried about all the formats people could end up having, if for
instance someone had reads named
>Read 1
>Read 2
...

then my script would crash because all reads would be named the same.
I think for now I will just update the documentation to make it very
clear that the first part of the read id must contain only a unique
identifier to the read, and supply these perl one-liners to clear the
files from anything trailing the names if necessary.

Let me know if this fixes your problem or not, I'd like to know if
there is other issues.

Thanks!
Patrick

Fabio Luciani

unread,
Mar 20, 2012, 11:28:36 PM3/20/12
to viral-to...@googlegroups.com
Dear Patrick

I tried your idea of
using this command line


perl -p -i -e 's/>(.+?) .+/>$1/g' <filename.fa>
perl -p -i -e 's/>(.+?) .+/>$1/g' <filename.qual>

but unfortunately this did not fix the problem as I still get many of these

Use of uninitialized value in length at ../samToQlx.pl line 232, <BAMFILE> line 9087.
Use of uninitialized value in length at ../samToQlx.pl line 232, <BAMFILE> line 9088.
Use of uninitialized value in length at ../samToQlx.pl line 232, <BAMFILE> line 9089.
Use of uninitialized value in length at ../samToQlx.pl line 232, <BAMFILE> line 9090.
substr outside of string at ../samToQlx.pl line 213, <BAMFILE> line 9091.
Use of uninitialized value in concatenation (.) or string at ../samToQlx.pl line 213, <BAMFILE> line 9091.
substr outside of string at ../samToQlx.pl line 197, <BAMFILE> line 9091.
Use of uninitialized value in concatenation (.) or string at ../samToQlx.pl line 197, <BAMFILE> line 9091.
substr outside of string at ../samToQlx.pl line 213, <BAMFILE> line 9091.
Use of uninitialized value in concatenation (.) or string at ../samToQlx.pl line 213, <BAMFILE> line 9091.
substr outside of string at ../samToQlx.pl line 197, <BAMFILE> line 9091.
Use of uninitialized value in concatenation (.) or string at ../samToQlx.pl line 197, <BAMFILE> line 9091.
Use of uninitialized value in length at ../samToQlx.pl line 232, <BAMFILE> line 9091.
Argument "RG:Z:ZHPBF16GGSA" isn't numeric in numeric eq (==) at ../samToQlx.pl line 164, <BAMFILE> line 9092.
Use of uninitialized value in subtraction (-) at ../samToQlx.pl line 179, <BAMFILE> line 9092.
Use of uninitialized value in subtraction (-) at ../samToQlx.pl line 183, <BAMFILE> line 9092.
Use of uninitialized value in length at ../samToQlx.pl line 185, <BAMFILE> line 9092.
Use of uninitialized value in length at ../samToQlx.pl line 221, <BAMFILE> line 9092.
Use of uninitialized value in substr at ../samToQlx.pl line 238, <BAMFILE> line 9092.


any idea?
It looks that some positions have more trouble than others/

The reads now are 

head 1.GAC.454Reads_RLMID7_47317T20W8.fna
G5ZVNFF01A0ZHG
TCGCGTCTCTCAGCACACA
G5ZVNFF01BAD7S
TGTGTGCGTGTCGCGTCTCTCAGCACACAG
G5ZVNFF01A2RT4
ATCAAGGACAGCAAGGCAGTTCATGCTGACATGGGGTACGTGATAGAAGTAGAAGAACG
G5ZVNFF01A2W0F
AGCTGGAAGACTAATACATAGGGCGAGATGTACAGAACTCCACTT
G5ZVNFF01AUU29
TGGTAACATTCAAGACAGCTCATGCAAGAAGCAGGAAGTGGTCGTACTAGGATCACAAGA

head 1.GAC.454Reads_RLMID7_47317T20W8.qual
G5ZVNFF01A0ZHG
32 30 27 27 27 31 31 33 32 30 23 24 22 22 25 28 28 23 17
G5ZVNFF01BAD7S
23 23 23 23 23 23 29 23 24 23 30 30 27 19 19 19 27 31 32 32
30 27 27 25 25 27 24 24 14 14
G5ZVNFF01A2RT4
22 25 22 17 17 17 17 26 27 27 27 27 17 17 17 19 27 24 24 23
23 24 24 25 27 28 28 28 28 27 18 19 11 11 11 11 14 14 22 14
17 17 19 19 17 19 17 17 19 19 18 19 23 27 28 28 26 28 28
G5ZVNFF01A2W0F



Thanks

Fabio

Fabio Luciani

unread,
Mar 20, 2012, 11:32:31 PM3/20/12
to viral-to...@googlegroups.com
I forgot to add that I do get a qlx find, despite these warnings
However, when I tried to run vphaser.pl I get all errors like this

Use of uninitialized value in numeric le (<=) at ../vphaser.pl line 2086, <FFILE> line 54948.
.Use of uninitialized value in numeric eq (==) at ../vphaser.pl line 1584, <FFILE> line 54948.
Use of uninitialized value in concatenation (.) or string at ../vphaser.pl line 1596, <FFILE> line 54948.
009244?==AA>4//045>>?866==?;=8855252225(8997700.1151,,,,,,133/0000777995300!?>8...88>=;;8888<9<`: Expected to end ref at  but ended at -2 instead!!


Fabio

Patrick Charlebois

unread,
Mar 21, 2012, 10:08:05 AM3/21/12
to Broad Viral Tool Users
Did you actually remove the > from the read names or is that just a
typo from what you pasted here? If that's not the issue, would it be
possible for you to send me just a small subset (say 5 reads and quals
and the sam file containing only these 5) of your data? It would
likely help me figure out what the problem is.

Thanks,
Patrick
> ...
>
> read more »

Patrick Charlebois

unread,
Mar 24, 2012, 2:44:36 PM3/24/12
to Broad Viral Tool Users
I sent you an email. The issue seems to be with the .sam file itself
that has different length of DNA string and qual strings. Not sure
what caused that when you generated it.

On Mar 21, 10:08 am, Patrick Charlebois <patri...@broadinstitute.org>
wrote:
> ...
>
> read more »

Fabio Luciani

unread,
Mar 24, 2012, 5:01:20 PM3/24/12
to viral-to...@googlegroups.com
HIpatrcik
weird, I have use dMOSAIK.
It must be the qual score file.
I have generated them with NExtgene.
Well, not me the poeple who ran the sample.
Maybe I shoudl try other samples
F

Fabio Luciani

unread,
Mar 24, 2012, 5:23:24 PM3/24/12
to viral-to...@googlegroups.com
Hi Patrick

I will try with few more MOSAIKs run, and also see whether this is an issue prior to SAM.
It is unfortunate because I ran MOSAIK for all my samples already and not a problem.
Will keep you informed.
I will now try rc454 and check
Fabio


On 25/03/2012, at 5:44 AM, Patrick Charlebois wrote:

Patrick Charlebois

unread,
Mar 24, 2012, 5:29:56 PM3/24/12
to Broad Viral Tool Users
Out of curiosity, what version of Mosaik are you using?

I know that this was developped with 1.1.0013 and it's still what I'm
using here. I know that some prior versions of Mosaik had a bad SAM
output. If you're using a newer version then in theory it should be
better, but who knows. I have not tested compatibility of everything
with other version of Mosaik.

Patrick
> ...
>
> read more »

Fabio Luciani

unread,
Mar 24, 2012, 5:39:31 PM3/24/12
to viral-to...@googlegroups.com
I am using 14, That is why I am worried.
I have other SAM files built with 10,11,12,13
I have been running Mosaik since 2009, that is why I am worried.
F

Fabio Luciani

unread,
Mar 24, 2012, 5:43:38 PM3/24/12
to viral-to...@googlegroups.com
one thing I noticed, the files I sent to you, I had to make them by sorting reads into sam file.
Basically, theorder in which reads were into the fasta and qual, was not the order in which they appear in the sam file.
IS that something that could affect results?
Fabio

On 25/03/2012, at 8:29 AM, Patrick Charlebois wrote:

Patrick Charlebois

unread,
Mar 24, 2012, 6:01:37 PM3/24/12
to Broad Viral Tool Users
I don't believe that should do anything for the samToQlx.pl tool. It
reads the data into memory by read id, so the order should not matter.

I wonder if a mistake could've happen during the sorting of reads in
the same file that might've resulted in some of the weird things I
saw, especially something like the DNA and Qual string not having any
tab between them. That seems like a weird error to me.
> ...
>
> read more »

Fabio Luciani

unread,
Mar 24, 2012, 6:14:02 PM3/24/12
to viral-to...@googlegroups.com
than I can I just send to you via coudstore an example that I did not modify?
Just a final check.
I woud like to send to you a realtively small example of fasta, qual and sam
maybe on HCV data that I checked more carefully
if this is ok I can send them to you know
Fabio

Patrick Charlebois

unread,
Mar 24, 2012, 7:48:24 PM3/24/12
to Broad Viral Tool Users
Sure, would be fine.
> ...
>
> read more »

Fabio Luciani

unread,
Mar 24, 2012, 8:00:54 PM3/24/12
to viral-to...@googlegroups.com
Thanks Patrick
very much appreciated

Fabio Luciani

unread,
Mar 30, 2012, 8:05:11 PM3/30/12
to viral-to...@googlegroups.com, Patrick Charlebois
Dear Patrick

I sent to you the data via cludstor, can you please confirm that you received it?
Also, I am still stuck with the SamtoQlx and I am getting convinced that it is the fact that all my qual and fna files are generated with NextgenE

thanks
Fabio


On 25/03/2012, at 10:48 AM, Patrick Charlebois wrote:

Jonathan Jacobs

unread,
Jul 11, 2012, 4:01:37 PM7/11/12
to viral-to...@googlegroups.com, Patrick Charlebois
Was this issue ever fixed? I am having the same issue. I followed your advice about the read naming only including alphanumeric titles - required a bit of mucking around with our data . Actually went back to our FASTQ files coming off our sequencer, and ran this script to remove the colons and spaces from the titles of reads in the FastQ

#!/usr/bin/perl
open FILE, shift(@ARGV) or die $!;
my $file = join '', <FILE>;
close FILE;
while($file =~ /(@.+?)(\n.+?\n\+\n.+?\n)/g){
$title = $1; $data = $2;
$title =~ s/\W//g;
print $title,$data;
}

Following that - we imported the FASTQ into CLCbio, redid our reference mapping, exported the SAM files and a new FASTQ for just the reads that mapped. Then split that FASTQ files into FASTA and QUAL files using BioPieces' write_454 tool (very handy). We then ran sam2qlx.pl and got the same error as before:

substr outside of string at samToQlx.pl line 198, <samFILE> line 60564.
Use of uninitialized value in concatenation (.) or string at samToQlx.pl line 198, <samFILE> line 60564.

I forged ahead and tried to get Vphaser.pl to run and got this error:

Use of uninitialized value in string eq at vphaser.pl line 1856, <FFILE> line 363360.
Use of uninitialized value in string eq at vphaser.pl line 1872, <FFILE> line 363360.
Use of uninitialized value in string eq at vphaser.pl line 1872, <FFILE> line 363360.
Use of uninitialized value in string eq at vphaser.pl line 2014, <FFILE> line 363360.
Use of uninitialized value in string eq at vphaser.pl line 2014, <FFILE> line 363360.

This looped endlessly and had to be ^C killed. $info->{RefBase} is the empty/uninitialized string. So, I modded the Vphaser.pl script a bit to dump the values of $info out using the Data::Dumper package. This is what I get when that error gets thrown the first time. The code I added was to the processReadInfo() subroutine (line 1809) and this is what I added around line 1830.

unless ($info->{RefBase}){
print STDERR '$info->{RefBase} not defined!',"\n";
print Dumper(\%$info),"\n\n",$info->{ReadBase},"\n";
exit;
}

This results in the following output after a huge number of "." (periods) get printed.

............................. 
Summarizing reads...
.$info->{RefBase} not defined!
$VAR1 = {
          'BQ' => 32,
          'ID' => 'RXOA21727745',
          'NQS' => '1',
          'ReadEnd' => '96',
          'Polarity' => '-',
          'LocusInfo' => {
                           'ID' => [],
                           'Window' => [
                                         undef,
                                         205,
                                         205,
                                         205,
                                  ... lots of increasingly larger integers...
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690,
                                         1690
                                       ]
                         },
          'RefEnd' => 1,
          'HQ' => {
                    'DeleteLocus' => -1,
                    'BQSum' => 0,
                    'DeletePos' => -1,
                    'State' => 'R',
                    'InsertLocus' => -1,
                    'LastNuc' => 'N',
                    'GapLength' => 0,
                    'InsertPos' => -1
                  },
          'RawBase' => 'A',
          'ModelCounts' => {
                             'HQ' => {},
                             'All' => {}
                           },
          'RefBase' => undef,
          'Run' => 0,
          'PosEnd' => 95,
          'Region' => 0,
          'ReadBase' => 'T',
          'ReadLength' => '169',
          'NQSSeq' => [
                        '1',
                        '1',
                        '1',
                        '1',
                        '1',
                        '1',
                   ..... lots of data.... 
                        '0',
                        '0',
                        '0',
                        '0',
                        '0',
                        '0'
                      ],
          'Increment' => -1,
          'BQSeq' => [
                       32,
                       32,
                       32,
                       35,
                       .... Lots of Quality Data...
                       29,
                       22
                     ],
          'Switch' => 0,
          'CurRead' => 0,
          'Reads' => 60560,
          'RefPos' => 97,
          'EmpiricalCounts' => {
                                 'HQ' => {},
                                 'All' => {}
                               },
          'RefSeq' => [],
          'ReadPos' => 0,
          'All' => {
                     'DeleteLocus' => -1,
                     'BQSum' => 0,
                     'DeletePos' => -1,
                     'State' => 'R',
                     'InsertLocus' => -1,
                     'LastNuc' => 'N',
                     'GapLength' => 0,
                     'InsertPos' => -1
                   },
          'ReadSeq' => [
                         'T',
                         LOTS OF SEQUENCE... 
                         'C'
                       ]
        };

Use of uninitialized value in print at vphaser.pl line 1834, <FFILE> line 363360.


I don't know what to make of this - but it looks like $LocusInfo->Window[0] is undef and this may be the problem (I've bold/highlighted that above)... or maybe that isn't the issue. 


Patrick - can you make out from the above Data::Dumper what the issue may be?  Is this a BUG that can be fixed or is it something with our data files?

Cheers, 

Jonathan Jacobs
MRIGlobal


Patrick Charlebois

unread,
Jul 11, 2012, 4:23:47 PM7/11/12
to viral-to...@googlegroups.com, Patrick Charlebois
Hi Jonathan,

I am really not sure what to make out of this. I do not know the very details of V-Phaser's code (I did not code this script) but I would think the problem is likely from the input file seeing as I never encountered something like this.

It seems that you have an issue with samToQlx.pl before running V-Phaser, so my guess is that something must be off in there that is causing V-Phaser to act up afterwards. One thing that could be worth looking into for the Qlx file is to be sure that every set of reference-read-nqs-qual sequences have the same length (for a given read I mean, different reads can have different length but within one read all 4 strings should have the same length, but the error you have makes me think this might not be the case, which would cause V-Phaser to have an 'empty' data somewhere that might create the problems).

If this is not the issue, would it be possible to send me a small subset of your data that causes the problem for me to look into? If you can send me like 5 reads in your sam file that send you an error when you try to convert to qlx (and the reference you're using), I could try figuring out what's wrong in this particular case.

We should have a replacement for V-Phaser available relatively soon that will not be using qlx files, hopefully this will solve these format issues!

Thanks,
Patrick

Jonathan Jacobs

unread,
Jul 11, 2012, 4:49:38 PM7/11/12
to viral-to...@googlegroups.com, Patrick Charlebois
Thank you for your quick reply Patrick.

I just noticed my QLX files looked different than the test set after I posted. So I rolled my MOSAIK back to version 13 and regenerated my QLX file. They now look fine - but I am still getting a VPHASER.pl issue (below) when I run perl vphaser.pl:

Use of uninitialized value in numeric eq (==) at vphaser.pl line 2060, <FFILE> line 362586.
Use of uninitialized value in numeric eq (==) at vphaser.pl line 2060, <FFILE> line 362586.
Use of uninitialized value in numeric eq (==) at vphaser.pl line 2060, <FFILE> line 362586.
Use of uninitialized value in numeric eq (==) at vphaser.pl line 2060, <FFILE> line 362586.
Use of uninitialized value in numeric eq (==) at vphaser.pl line 2060, <FFILE> line 362586.

repeatededly. It eventually stops and says  "defining models...." and now it's crunching away, pinged at 100% CPU usage for the last 20minutes on one of my processors. I don't know how long I should wait, but I'm running a 12-core system with 32GB of ram - so I don't mind letting it run overnight. Hopefully I'll have some good news tomorrow.


Jonathan Jacobs

unread,
Jul 12, 2012, 9:32:59 AM7/12/12
to viral-to...@googlegroups.com
OK - quick update. Vphaser seemed to run to completion (I think) and generated a series of files. It took about 4 hours on my 12-core system. I remains to be seen if the software works as advertised though. The files it generated were:

DQ380149ZH501S2.vph_calls.txt
DQ380149ZH501S2.vph_model.txt
DQ380149ZH501S2.vph_out.txt
DQ380149ZH501S2.vph_snp.txt

The next step was to run vprofiler.pl - I initially couldn't get this to work either though; the documentation on how to run it is just not there.

Diving into the test data I noticed that the vprofiler.pl script uses a input file that, in itself, is sort of like a FASTA file with file locations, etc of all the previously generated files and a couple of files with their own format requirements. (GeneList and Haplolist). I went ahead and constructed files that matched those formats for RVFV (the virus we are studying), and then ran vprofiler and got this result:

jacobs$ perl vprofiler.pl -i ~/DQ380149-vprofiler-input.txt -o ~/vpro -noendvariant=10 -nt -codon
invalid VPhaser input file for RVFV ~/DQ380149ZH501S2.vph_calls.txt at vprofiler.pl line 192, <INPUT> line 11.

Turns out that using ~/filename instead of the full path was causing the problem. The script cannot interpret ~/filename, and using the full path worked (/Users/jjacobs/filename). So I modded the vprofiler-input.txt file with full path names to the files, and ran it. Again I got a huge number of exceptions and undefined/uninitialized variable warnings. It ended with 

substr outside of string at bioinformatics/VpSoftwarePackage/vprofiler.pl line 823, <AAALIGN> line 23426.
Use of uninitialized value $newqualstr in substr at bioinformatics/VpSoftwarePackage/vprofiler.pl line 771, <AAALIGN> line 23426.
Use of uninitialized value $newqualstr in substr at bioinformatics/VpSoftwarePackage/vprofiler.pl line 823, <AAALIGN> line 23426.

This goes on repeatedly for quite a while and then ends with :

Error in library(gplots) : there is no package called ‘gplots’
Execution halted
Error in library(gplots) : there is no package called ‘gplots’
Execution halted
Error in library(gplots) : there is no package called ‘gplots’
Execution halted
mv: rename /Users/jjacobs/vpro_RVFV_Heatmap_All//Users/jjacobs/vpro_RVFV_Heatmap_Syn/RVFV_Full_Heatmap.pdf to /Users/jjacobs/vpro_RVFV_Heatmap_All/RVFV_Full_Heatmap_Syn.pdf: No such file or directory
mv: rename /Users/jjacobs/vpro_RVFV_Heatmap_All//Users/jjacobs/vpro_RVFV_Heatmap_NonSyn/RVFV_Full_Heatmap.pdf to /Users/jjacobs/vpro_RVFV_Heatmap_All/RVFV_Full_Heatmap_NonSyn.pdf: No such file or directory
rm: /Users/jjacobs/vpro_RVFV_Heatmap_All//Users/jjacobs/vpro_RVFV_Heatmap_NonSyn: No such file or directory
rm: /Users/jjacobs/vpro_RVFV_Heatmap_All//Users/jjacobs/vpro_RVFV_Heatmap_Syn: No such file or directory
rm: /Users/jjacobs/vpro_RVFV_Heatmap_All//Users/jjacobs/vpro_RVFV_Heatmap_Textfiles_Folder: No such file or directory

Which is strange. A - gnuplot and rplots is definitely installed an in the path. Not sure what gplots is - when I run "gem install gplots" nothing comes up as a match. B - the mv/rm commands are doubling the pathnames (see above). Not sure what this might be either.

In the end - I give up. I hope this software eventually works. From your published results it looks very good, but so far on our systems it doesn't work. Perhaps it's a OSX issue - or that we are not working with 454 data at the start? I'm not quite sure, but I've given a solid shot and about 8 - 10 hours of troubleshooting so far.

My recommendation would be to use something like BioPerl or BioPython to handle data exchange, file importing and reformating, etc. The documentation is also in need of updating.

Best regards - I'll keep an eye out for any updates you may come out with down the road. I hope this software does get updated, we could make good use of it for our work on Rift Valley virus data.


Michael C. Zody

unread,
Jul 12, 2012, 9:47:59 AM7/12/12
to viral-to...@googlegroups.com
Jonathan,

Thanks for your feedback.

The original version of V-phaser was designed for 454 data. The performance was never optimized for the read depths we get with Illumina data. We're currently working on a new version that should have better time and memory performance on deeper sequencing and will also handle paired reads. Look for that in the next few months.

As to the other errors, I see you're running on Mac OS. Often programming packages we take for granted on Unix aren't installed on Macs. It's a real pain, and we've never really tested this code on Mac OS, so we haven't seen these issues before. We'll take a look and try to get some notes on how to bring the install up to what we're expecting.

Lastly, I think where there are some residual errors you're seeing in the last run, it would be useful if you could extract the referenced line in the input file so we can see if this is really an upstream data problem in the pipeline that we're not catching properly in V-profiler versus a problem with V-profiler's algorithm implementation (although where it references the same line over and over, we may have just left a filehandle open).

Mike

Patrick Charlebois

unread,
Jul 13, 2012, 6:59:28 PM7/13/12
to viral-to...@googlegroups.com
Hi Jonathan,

thanks for the feedback. If you look through the V-Phaser_V-Profiler_UserGuide.doc file in the package, the whole documentation on how to run V-Profiler should be present in the second section, including command line, options and input file format.

It's true that ~ cannot be used in the input files. This ~ to specify the path has to be converted by the shell script when the command line is interpreted, but when it is in an input file that is used directly by perl the conversion is not done by the shell and so it is really looking for a file with ~ in the name. As a rule of thumb I would recommend not to use the ~ notation to specify your home directory in an input file but only directly in command lines.

The uninitialized variable problem to me really seems like an input file issue. Did you get your read alignment files from Mosaik or did you convert a Sam file to Qlx format? Although if V-Phaser managed to run properly the format should be ok in theory, but I remember there was some issue with that too before. It might also be simply from how the input was done, if you didn't have the documentation to guide you. My best guess would likely be that it is a file format conversion issue though. 

For the last part with gplot, the script does not look for a gplot library itself, but it calls to R in order to generate the Heatmaps. The R version that is recommended in the documentation is 2.9. Did you try with this version? A different version might have trouble with the calls made to the Heatmap functions in R.

If you have more questions or want me to try running the scripts on some of your data let me know.

Patrick Charlebois

unread,
Aug 30, 2012, 9:22:38 PM8/30/12
to viral-to...@googlegroups.com
I have just released a major update to RC454 and mostly the samToQlx and qlxToSam format conversion tools. Everything has now moved to using the latest version of Mosaik, and the format conversion tools should be much better at handling external sam files. Hopefully most of the issues raised in this thread have been fixed by this. You can re-download the package, and look at the thread specifically about the subject for some extra details.

Patrick

Jonathan Jacobs

unread,
Aug 30, 2012, 10:11:22 PM8/30/12
to viral-to...@googlegroups.com
Excellent! Thank you. I just finished doing another major round of NGS of Rift Valley Fever virus and was gearing up to dive into the data over the next couple weeks. This is good timing!

Cheers,

**********
Jonathan Jacobs, PhD
(c) 240 447 4039

"You have read everything, drunk nothing" -- Éluard

Joseph Hughes

unread,
Nov 26, 2012, 10:55:47 AM11/26/12
to viral-to...@googlegroups.com
Hi,

I have generated a sam file using Stampy which I would like to convert to qlx to be able to use V-phaser. The reads are paired-end. I Have converted the fastq to fasta and qual (see below).
>M00599_3_000000000-A1F3P_1_1101_14422_1594
CTTCAAGATAGAAAAAGGAAAAGTAGTTAAATCAGTTGAATTGAATGCCCCTAATTATCACTACGAGGAGTGCTCCTGTTATCCTGATGCAGGCGAAGTCATGTGTGTGTGCAGGGATAACTGGCATGG
>M00599_3_000000000-A1F3P_1_1101_16552_1603
CCCCACCGGAACAGAGTAGGATGCAGTTCTCTTCTCTAACTGTGAACATAAGGGGTTCAGGAATGAGAGTACTTGTGAGAGGCAACTCCCCTGTGTTCAACTATAACAAGGCAACCAAGAGGCTC
>M00599_3_000000000-A1F3P_1_1101_16435_1632
GCTCTCGTGCGTACTGGGATGGACCCCAGAATGTGCTCTCTGATGCAAGGATCAACTCTCCCGAGGAGATCTGGAGCTGCTGGTGCGGCAGTGAAGGGAGTTGGAACGATGGTAATGGAGCTAATTCG
>M00599_3_000000000-A1F3P_1_1101_15894_1644
TGCAACCAAAGCATTACTGAACAGGCTGTAACTTCAGTGACATTAGCGGGCAATTCATCTCTTTGCCCTATTAGTGGGTGGGCTATATACAGTAAGGATAACGGTATAAGAATTGGTTCCAAGGGGG
>M00599_3_000000000-A1F3P_1_1101_15062_1659
TGTGATAGCAATCTCCTTCACAATTGGCATCAACCTGTACCCCACTCTGAATCCCCATAGACTTCCCTCTCAGAAAACTTGCACGGTCTGGAGCTATGAAGGCCCCATTGAAACTGAAAGTG

>M00599_3_000000000-A1F3P_1_1101_14422_1594
38 38 40 40 40 39 40 39 39 39 40 39 40 37 40 40 39 39 40 40 40 40 40 32 37 37 39 39 40 39 39 40 40 40 39 39 40 39 37 34 38 39 39 40 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 39 39 39 37 39 39 37 39 39 37 37 39 39 37 39 39 38 39 37 39 37 33 37 37 29 27 35 36 39 34 36 37 37 35 37 35 35 37 30 35 36 27 35 35 38 38 36 36 36 36 36 36 29 34 36 32 34 32 28 29 32 27 
>M00599_3_000000000-A1F3P_1_1101_16552_1603
38 38 39 39 39 39 39 39 39 39 39 39 39 40 40 40 39 39 39 37 38 39 40 40 40 39 40 40 40 40 40 40 40 40 40 40 39 39 40 40 40 40 40 39 34 36 39 39 39 39 40 39 40 40 39 37 39 40 39 39 40 40 37 38 38 36 38 39 39 38 35 38 39 40 37 38 39 28 33 33 13 28 34 35 39 37 39 39 37 35 28 37 37 39 39 38 38 38 37 37 38 38 37 33 35 29 35 35 13 28 35 36 36 35 38 36 34 32 34 36 34 38 26 32 27 
>M00599_3_000000000-A1F3P_1_1101_16435_1632
37 37 39 39 39 40 39 39 39 40 39 36 36 39 37 40 40 40 39 39 39 39 40 40 40 39 39 39 39 39 39 39 39 40 37 39 40 40 40 40 40 39 39 40 39 40 39 39 39 37 39 39 40 40 33 36 39 39 40 40 39 40 39 36 36 36 8 34 30 35 38 39 39 38 28 30 37 39 33 35 37 39 13 22 30 37 39 33 33 21 8 33 35 20 28 35 36 36 27 30 32 32 32 32 20 11 20 18 9 27 34 36 36 34 11 28 20 9 25 34 32 36 25 32 25 34 32 34 
>M00599_3_000000000-A1F3P_1_1101_15894_1644
38 38 40 40 40 40 40 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 37 38 39 40 39 39 39 40 40 40 39 39 39 39 39 39 39 39 39 40 40 40 37 39 40 40 40 40 40 40 40 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 38 38 38 38 38 38 38 37 37 37 38 37 36 38 38 38 38 38 35 28 28 32 26 34 36 38 36 36 36 38 38 34 34 36 36 34 38 38 29 30 30 33 27 
>M00599_3_000000000-A1F3P_1_1101_15062_1659
37 37 39 40 40 37 39 39 40 40 39 35 37 36 36 39 39 40 35 38 35 36 38 39 39 39 40 37 39 38 39 40 40 40 40 37 39 37 37 39 39 40 40 39 39 39 40 40 40 33 37 39 39 40 39 40 40 40 40 33 38 39 37 33 38 39 39 35 34 37 39 39 39 40 34 38 37 37 38 39 39 39 39 37 39 39 37 36 39 39 37 37 37 39 35 37 39 37 39 39 39 39 39 39 36 35 36 37 37 37 37 37 37 37 37 35 36 35 35 36 33 20 


When I try to convert to qlx, I get the following error:

Use of uninitialized value in concatenation (.) or string at samToQlx.pl line 248, <samFILE> line 16002.
Use of uninitialized value within %readseqs in length at samToQlx.pl line 221, <samFILE> line 16003.
Use of uninitialized value in pattern match (m//) at samToQlx.pl line 225, <samFILE> line 16003.
Use of uninitialized value $1 in length at samToQlx.pl line 226, <samFILE> line 16003.
Use of uninitialized value within %readseqs in length at samToQlx.pl line 235, <samFILE> line 16003.
Use of uninitialized value $nqsstr in substr at samToQlx.pl line 238, <samFILE> line 16003.
substr outside of string at samToQlx.pl line 248, <samFILE> line 16003.
Use of uninitialized value in concatenation (.) or string at samToQlx.pl line 248, <samFILE> line 16003.

A .qlx file is created but the binary line is missing:

>Read M00599:3:000000000-A1F3P:1:1101:14422:1594 1 129 0 + NA_cons 719 847
TGTTGCATTTGACCCAAGGGACCTGCTGGGAACAAATGTACACACCTGGAGGGGAGGTAAGAAATGATGATGTTGATCAGAGTTTAATTATTGCTGCTAGAAATATTGTTAGGAGAGCAACAGTATCAG
CTTCAAGATAGAAAAAGGAAAAGTAGTTAAATCAGTTGAATTGAATGCCCCTAATTATCACTACGAGGAGTGCTCCTGTTATCCTGATGCAGGCGAAGTCATGTGTGTGTGCAGGGATAACTGGCATGG

GGIIIHIHHHIHIFIIHHIIIIIAFFHHIHHIIIHHIHFCGHHIHIIIIIIIIIIIIIIIIIIIHHHHHHFHHFHHFFHHFHHGHFHFBFF><DEHCEFFDFDDF?DE<DDGGEEEEEE>CEACA=>A<

>Read M00599:3:000000000-A1F3P:1:1101:14422:1594 1 97 0 + NA_cons 1032 1128
GTCATCTGTCAAAAGAGAAGAAGAAGTGCTCACAGGCAACCTCCAAACATTGAAAATAAGAGTGCATGAAGGATATGAGGAATTCACAATGGTTGGG
GGAAGAACAAAAAGCACTAGTTCCAGAAGCGGCTTTGAGATGATTTGGGATCCGAATGGGTGGACTGAAACGGACAGTAGCTTCTCAGTGAAACAAG

??CDCBFCFC.@77....77....CCEC5*88//-=?EFC-?A5--A555,,/FHHFEFGED/D?=DHHFDHEC/C9/C8-0/@FFGFFHHFF?HFF

>Read M00599:3:000000000-A1F3P:1:1101:16552:1603 1 125 0 + PB2_cons 1898 2022
CCCCACCGGAACAGAGTAGGATGCAGTTCTCTTCTCTAACTGTGAACATAAGGGGTTCAGGAATGAGAGTACTTGTGAGAGGCAACTCCCCTGTGTTCAACTATAACAAGGCAACCAAGAGGCTC
CCCCACCGGAACAGAGTAGGATGCAGTTCTCTTCTCTAACTGTGAACATAAGGGGTTCAGGAATGAGAGTACTTGTGAGAGGCAACTCCCCTGTGTTCAACTATAACAAGGCAACCAAGAGGCTC

GGHHHHHHHHHHHIIIHHHFGHIIIHIIIIIIIIIIHHIIIIIHCEHHHHIHIIHFHIHHIIFGGEGHHGDGHIFGH=BB.=CDHFHHFD=FFHHGGGFFGGFBD>DD.=DEEDGECACECG;A<


Any help/advice is greatly appreciated.

Joseph




Joseph Hughes

unread,
Nov 26, 2012, 11:42:10 AM11/26/12
to viral-to...@googlegroups.com
I've partly solved the problem. I had renamed the fasta and qual files IDs as suggested in a post above but had not changed the sam file IDs.
However,  I am still getting a large number of error messages:
Use of uninitialized value $letter in string eq at samToQlx.pl line 192, <samFILE> line 73872.
Use of uninitialized value $letter in string eq at samToQlx.pl line 201, <samFILE> line 73872.
Use of uninitialized value $letter in string eq at samToQlx.pl line 208, <samFILE> line 73872.
Use of uninitialized value $cigarstr in length at samToQlx.pl line 192, <samFILE> line 73872.
Use of uninitialized value $letter in string eq at samToQlx.pl line 192, <samFILE> line 73873.
Use of uninitialized value $letter in string eq at samToQlx.pl line 201, <samFILE> line 73873.
Use of uninitialized value $letter in string eq at samToQlx.pl line 208, <samFILE> line 73873.
Use of uninitialized value $cigarstr in length at samToQlx.pl line 192, <samFILE> line 73873.
Use of uninitialized value $1 in length at samToQlx.pl line 226, <samFILE> line 73875.
Use of uninitialized value $1 in length at samToQlx.pl line 226, <samFILE> line 73877.
Use of uninitialized value $1 in length at samToQlx.pl line 226, <samFILE> line 73878.


However, at least now the qlx file contains the four lines expected.

Any advice on how to solve the above would be much appreciated.

Joseph


Joseph Hughes

unread,
Nov 27, 2012, 10:23:07 AM11/27/12
to viral-to...@googlegroups.com
I've continued despite the above error  messages as my qlx file lloked good:
>Read M00599_3_000000000-A1F3P_1_1101_14422_1594 1 129 226 + NA_cons 719 847
TGTTGCATTTGACCCAAGGGACCTGCTGGGAACAAATGTACACACCTGGAGGGGAGGTAAGAAATGATGATGTTGATCAGAGTTTAATTATTGCTGCTAGAAATATTGTTAGGAGAGCAACAGTATCAG
CTTCAAGATAGAAAAAGGAAAAGTAGTTAAATCAGTTGAATTGAATGCCCCTAATTATCACTACGAGGAGTGCTCCTGTTATCCTGATGCAGGCGAAGTCATGTGTGTGTGCAGGGATAACTGGCATGG
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
GGIIIHIHHHIHIFIIHHIIIIIAFFHHIHHIIIHHIHFCGHHIHIIIIIIIIIIIIIIIIIIIHHHHHHFHHFHHFFHHFHHGHFHFBFF><DEHCEFFDFDDF?DE<DDGGEEEEEE>CEACA=>A<

>Read M00599_3_000000000-A1F3P_1_1101_14422_1594 1 97 226 + NA_cons 1032 1128
GTCATCTGTCAAAAGAGAAGAAGAAGTGCTCACAGGCAACCTCCAAACATTGAAAATAAGAGTGCATGAAGGATATGAGGAATTCACAATGGTTGGG
GGAAGAACAAAAAGCACTAGTTCCAGAAGCGGCTTTGAGATGATTTGGGATCCGAATGGGTGGACTGAAACGGACAGTAGCTTCTCAGTGAAACAAG
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
??CDCBFCFC.@77....77....CCEC5*88//-=?EFC-?A5--A555,,/FHHFEFGED/D?=DHHFDHEC/C9/C8-0/@FFGFFHHFF?HFF

>Read M00599_3_000000000-A1F3P_1_1101_16552_1603 1 125 228 + PB2_cons 1898 2022
CCCCACCGGAACAGAGTAGGATGCAGTTCTCTTCTCTAACTGTGAACATAAGGGGTTCAGGAATGAGAGTACTTGTGAGAGGCAACTCCCCTGTGTTCAACTATAACAAGGCAACCAAGAGGCTC
CCCCACCGGAACAGAGTAGGATGCAGTTCTCTTCTCTAACTGTGAACATAAGGGGTTCAGGAATGAGAGTACTTGTGAGAGGCAACTCCCCTGTGTTCAACTATAACAAGGCAACCAAGAGGCTC
11111111111111111111111111111111111111111111111111111111111111111111111111100000000000111111111111111110000000000011111111110
GGHHHHHHHHHHHIIIHHHFGHIIIHIIIIIIIIIIHHIIIIIHCEHHHHIHIIHFHIHHIIFGGEGHHGDGHIFGH=BB.=CDHFHHFD=FFHHGGGFFGGFBD>DD.=DEEDGECACECG;A<

Vphaser runs without any error messages, I get lots of ..... on the screen:
perl vphaser.pl -i S15v2.qlx -o S15vphaserM1 -model 1 &
 
It took about 3 hours to process using the model 1. But unfortunately, the output files are all empty. Is this something to do with using Illumina data or paired-end data?

Any advice is welcome,
Joseph

Michael C. Zody

unread,
Nov 28, 2012, 10:40:16 AM11/28/12
to viral-to...@googlegroups.com
Joseph,

Nice debugging catch on the read name matching. That would have taken us some time to figure out, I'm sure.

The uninitialized variable warnings you were getting are, I think, a non-damaging bug. I notice it doesn't happen on every line. I'm guessing it's because you have unaligned reads in your sam file and samToQlx.pl forgets to check for the unaligned cigar string character. Can you check some of those lines and see if they are, in fact, unaligned reads? If that's the problem, we can implement a fix so it stops spitting out spurious warnings.

As for vphaser, I'm not sure what's going on here. In theory it should work on Illumina data, but it won't use the read pairing in the phasing algorithm (we have a new version that we're almost ready to release that will). Are you sure it's exiting properly and not running out of memory? Can you try feeding it a subset of your data with maybe 5000 reads in it and see if you get output? That might tell us whether the problem is something fundamental with your input files that we can help you fix or a problem with the data size.

thanks

Mike

Joseph Hughes

unread,
Nov 28, 2012, 2:45:36 PM11/28/12
to viral-to...@googlegroups.com
Hi Michael,

I'll remove the unmapped reads from the bam file and re-run to see if I get the same error. I will also try with a subset of the data in vphaser.
I'll keep you posted.

Thanks,
Joseph

Joseph Hughes

unread,
Nov 29, 2012, 9:39:28 AM11/29/12
to viral-to...@googlegroups.com
Hi Mike,

I have removed the unmapped reads from the sam file but I still get the error on line 226, e.g.:
Use of uninitialized value $1 in length at samToQlx.pl line 226, <samFILE> line 119794.

I think it is something to do with the fact that it is paired-end data and I have tried to figure out why some reads give errors. It seems that in some cases $rawreadstr would need to be reverse-complemented to provide a match in $readseqs{$readid} =~ /(.*?)$rawreadstr(.*)/;
Is that possible? 

Joseph

Joseph Hughes

unread,
Nov 29, 2012, 10:32:40 AM11/29/12
to viral-to...@googlegroups.com
I've uploaded some test files here: https://github.com/josephhughes/TestSets to see whether you also get the same error with samToQlx.pl

Cheers,
Joseph

Michael C. Zody

unread,
Nov 30, 2012, 11:05:42 AM11/30/12
to viral-to...@googlegroups.com
I'll try on Monday and see what I get. The matching should be on name, but there might be an issue with the way the paired reads are treated. I know we have run this successfully, so I'm not 100% sure what's going on for you.

Are you running our latest release? We made numerous changes to samToQlx.pl earlier this month. I don't think any of them affect your problems, but I would like to start from the same base code.

Mike

Joseph Hughes

unread,
Dec 1, 2012, 8:51:39 AM12/1/12
to viral-to...@googlegroups.com
Hi Mike,

I downloaded it just last Monday from the Broad Institute website.

Joseph

Michael C. Zody

unread,
Dec 4, 2012, 7:03:22 PM12/4/12
to viral-to...@googlegroups.com
Joseph,

I'm just now getting to this. I need your reference fasta file to run samToQlx.pl.

Mike

Joseph Hughes

unread,
Dec 5, 2012, 11:24:38 AM12/5/12
to viral-to...@googlegroups.com
Hi Mike,

Sorry about that. It is now in the TestSets folder.

Cheers,
Joseph

Michael C. Zody

unread,
Dec 5, 2012, 3:10:29 PM12/5/12
to viral-to...@googlegroups.com
Okay, well now I feel like an idiot. Your test data works fine when I run it, but that's because when we made all the fixes to samToQlx.pl in RC454, I didn't realize there was a separate version distributed with V-Phaser. The short answer for now is that you should try downloading the RC454 package (if you don't already have it) and running that version of samToQlx.pl instead of the one in the V-Phaser distribution.

Normally I would just put a new version of V-Phaser up right away, but the version in V-Phaser appears to be several major revisions back from the one in RC454 (to the point that the new version no longer requires the seq and qual files to run and the line where you're getting an error is no longer in the code in any recognizable form). I need to confirm that there aren't any lost features that some V-Phaser or V-Profiler functions may require.

If you can get through the V-Phaser run using the RC454 release of samToQlx.pl, though, then I'm guessing it will be fine.

Sorry it took this long to get to this point.

Mike
Reply all
Reply to author
Forward
0 new messages