How to Include User Sequences

143 views
Skip to first unread message

Ben Karin

unread,
Jan 29, 2015, 3:08:26 PM1/29/15
to phl...@googlegroups.com
I have an alignment and am trying to use PHLAWD to search GenBank and add align sequences to the existing alignment. My goal is to use PHLAWD with PUmPER on top of my starting alignment. How do you set up the configfile to accept user sequences?

I see that this is possible based on the manual on the website where there is an empty page titled "User Sequences". This would be extremely useful for me.

Thanks,

Ben

Matthew Nelsen

unread,
Jan 29, 2015, 3:32:14 PM1/29/15
to phl...@googlegroups.com
Hey Ben,

Here is some info Cody Hinchliff provided me with awhile back regarding the same question:

----------------------------------------------------------------------------------------
You can specify a user fasta file in the phlawd config file. The format for that command should be:

userfasta = <path to file>

You may also need to add a line with just the word "assignleftovers" to the config file to tell phlawd to insert user sequenceswith taxon names it doesn't recognize, depending on which version you are using. (It will tell you that there were X leftover sequences not added if you are using the version that would require that additional command).
----------------------------------------------------------------------------------------


--
You received this message because you are subscribed to the Google Groups "phlawd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to phlawd+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Karin

unread,
Jan 29, 2015, 5:07:20 PM1/29/15
to phl...@googlegroups.com
Thanks, that worked perfectly! 

Do you know if there is a way to download all sequences on GenBank for each taxon instead of removing duplicates? I would like to keep every sequence within my search on GenBank.

Ben

Matthew Nelsen

unread,
Jan 29, 2015, 6:31:04 PM1/29/15
to phl...@googlegroups.com
Hi Ben,

My apologies - I think I may have misunderstood your previous email.  The solution I'd sent was to add extra sequences to the phlawd search/pipeline (ie. sequences not already in NCBI or outgroup taxa, etc.) prior to alignment (I don't think there is a way to add new seqs to an existing alignment in phlawd).  

I am not sure if there is a simple way to search for and keep ALL GI's for a given locus for each taxon (instead of the longest/"best" sequence) within phlawd...this might require fiddling w/ source code.  Hopefully someone else will have a solution.

-Matt

Ben Karin

unread,
Jan 29, 2015, 6:45:42 PM1/29/15
to phl...@googlegroups.com
Your first post did answer my initial question exactly. 

It would be nice if it is possible to keep all sequences matching the search, but I understand that this is not really the primary purpose of phlawd.

Ben

Milton Tan

unread,
Feb 19, 2015, 3:17:41 PM2/19/15
to phl...@googlegroups.com
I was curious about the taxon name recognition of user sequence files. Is there a proper formatting to the fasta file of user sequences so that PHLAWD can recognize them? 

Ben Karin

unread,
Feb 19, 2015, 3:56:13 PM2/19/15
to phl...@googlegroups.com
I used a fasta file of 3 unaligned sequences with short and simple names and it worked great. Names like:

>Genus1_species1_MuseumNumber
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>Genus2_species2_MuseumNumber
CTGACTGACTGACTGACTGACTGAC

It didn't work when I tried with aligned sequences that had gaps in them. I used both of these terms in the configfile.

userfasta = <path to fasta file>
assignleftovers

Cody Hinchliff

unread,
Feb 19, 2015, 9:27:47 PM2/19/15
to phl...@googlegroups.com
Milton,

Try opening your file in a text editor where you can choose the type of line breaks, and save it with unix-style line breaks (that is, you want LF, not CR or CRLF). If you are on a mac, you can use Textwrangler to do that.

Milton Tan

unread,
Feb 26, 2015, 6:33:31 PM2/26/15
to phl...@googlegroups.com
Sorry my question was vague. I wanted to make sure that the species name could be read from the fasta files properly, and was curious if there was a specific convention for this, or how specifically phlawd reads species names since my experience is that programs can be pretty specific with how text is formatted. Obviously they should match NCBI but besides that I wasn't sure what phlawd's assumptions were as far as how to find the name in the text following '>'.

Cody Hinchliff

unread,
Feb 27, 2015, 11:12:57 AM2/27/15
to phl...@googlegroups.com
It should just read the whole line until it hits an LF char and then it assumes the next line is sequence data. The source code is available on GitHub if you want to have a look: https://github.com/chinchliff/phlawd/blob/master/src/fasta_util.cpp#L52

In general, it is still wise avoid any whitespace or special characters (other than possibly underscores) in fasts species names.

Milton Tan

unread,
Feb 27, 2015, 1:10:24 PM2/27/15
to phl...@googlegroups.com
Thanks, I have just renamed the sequences for only organism name to remove all doubt. I have another question. I have some taxa that I have multiple sequences for. Right now PHLAWD is telling me it doesn't accept duplicate taxa. Can I get PHLAWD to ignore duplicate taxa in the userfasta file?

Cody Hinchliff

unread,
Feb 27, 2015, 2:04:05 PM2/27/15
to phl...@googlegroups.com
Each sequence needs to have a unique name. It really doesn't matter what taxon the sequences are from, just give them unique names (e.g. taxonA_seq1, taxonA_se2) and it won't consider them duplicates.

--

Milton Tan

unread,
Feb 27, 2015, 3:13:10 PM2/27/15
to phl...@googlegroups.com
Hmmm, PHLAWD doesn't seem to recognize the extra text after the names as separate. It says these taxa are not in the ncbi database as a number or a name, even if they are. It also tries to add these as leftover sequences. I'm not sure what to do here.

Milton Tan

unread,
Feb 27, 2015, 6:09:14 PM2/27/15
to phl...@googlegroups.com
Well I got around this by removing all sequences with duplicate taxon names. After running PHLAWD, the output alignments have duplicate taxa as well. I am pretty sure this is not what PHLAWD is supposed to do, so I am confused.
Reply all
Reply to author
Forward
0 new messages