Wrong sequences found

18 views
Skip to first unread message

Thijs Janzen

unread,
Jan 28, 2017, 12:33:19 PM1/28/17
to phyloGenerator Users
Hi Will,

I am using PhyloGenerator as a scraper to collect FASTA files for a large number of species at once. Currently, I'm collecting data on Lamprologini cichlids.

If I, for instance, look on GenBank, I notice that for the gene "dlx2a", there are a reasonable amount of hits (https://www.ncbi.nlm.nih.gov/nuccore/?term=dlx2a++lamprologini).

Furthermore, for for instance Lamprologus callipterus, there are specific sequences for this gene. 

However, if I let PhyloGenerator scrape for the gene dlx2a (or for "dlx2a"), it does not return any dlx2a genes, and instead defaults to the much more common sequenes of NADH and RAG1 (which are interesting in themselves, but not the focus of this search).

This is very misleading - only the description after the GenBank accession number gives away that these sequences are, in fact, not dlx2a genes. Verifying the GenBank numbers turns up that indeed the sequences stored when writing "output" are not the sequences of interest.

Is this a bug in the program, or am I doing something wrong?

Thanks in advance,

Thijs

Thijs Janzen

unread,
Jan 28, 2017, 1:40:03 PM1/28/17
to phyloGenerator Users
This is with PhyloGenerator 1 btw.


Op zaterdag 28 januari 2017 18:33:19 UTC+1 schreef Thijs Janzen:

Thijs Janzen

unread,
Jan 29, 2017, 2:56:10 PM1/29/17
to phyloGenerator Users
I have managed to do a partial workaround, by using phyloGenerator 2. phyloGenerator 2 seems to be able to much more accurately find sequences.
At least, that is what I'm inclined to believe. 
Unfortunately, phyloGenerator 2 does not store the full FASTA output per sequence, only the GenBank access code, and the sequence. 
Would it be straightforward to edit the download.rb script to be able to access this information and print it to file? That would be a great help, and would allow me to verify (at least quickly by eye) that everything went correctly.

Thanks in advance!

Thijs



Op zaterdag 28 januari 2017 18:33:19 UTC+1 schreef Thijs Janzen:
Hi Will,

Will

unread,
Jan 30, 2017, 2:05:49 PM1/30/17
to Thijs Janzen, phyloGenerator Users
Hello Thijs,

Thanks for this. Would it be possible for you to send me the species and the exact gene names that you were searching for with pG1? It sounds like this is related to GenBank not always having perfect information about sequences, but it's difficult for me to be certain without knowing what you've typed in. You would also likely have more luck using the referenceDownload method, but I can give you more advice once you give me a species list.

I'm glad pG2 is working for you, but I'm a bit confused as to exactly and precisely what you want. A FASTA file is simply a sequence title and the DNA sequence - are you saying you want pG2 to save the file such that the title of the sequence is exactly what it would be if you downloaded from GenBank? If so, yes, it's quite simple to modify the program in that way and I'll whack that together for you over my lunch break.

Cheers,

Will

---

Need a phylogeny? Try phyloGenerator: original or new version
Measuring phylogenetic structure? Try install.packages('pez')

Will Pearse
Assistant Professor of Biology, Utah State University
Skype: will.pearse

Will

unread,
Jan 30, 2017, 2:50:47 PM1/30/17
to Thijs Janzen, phyloGenerator Users
Hello Thijs,

Thanks again for getting in touch. You can download a version of pG2 that should give you GenBank files in the format you want here (https://github.com/willpearse/phyloGenerator2/tree/raw_dwn). Do let me know about pG1 though; for what it's worth, I'm certain this is related to what's on genbank and not something in pG1, but I'd be grateful of your species list so I can check regardless.


Cheers,

Will

---

Need a phylogeny? Try phyloGenerator: original or new version
Measuring phylogenetic structure? Try install.packages('pez')

Will Pearse
Assistant Professor of Biology, Utah State University
Skype: will.pearse

Thijs Janzen

unread,
Jan 30, 2017, 3:07:13 PM1/30/17
to phyloGenerator Users, thijs...@gmail.com
Hi Will,

You are awesome!
Thanks a lot for your reply!
I'm unsure what exactly caused it, but I have been playing a bit with the python code of pG1, and found that apparently some genes are not found when searched for with the parameter [Genes], for instance, try the search string:
Altolamprologus calvus[Organism] AND S18[Genes]     (which doesn't give a hit)
Altolamprologus calvus[Organism[ AND S18[All Fields]  (which does generate hits)
So indeed, it seems to be a GenBank side flaw, not pG related.

My goal is to index genes of the family of Lamprologini, with the intent of building a tree (I will do that separately from the pG suite). In order to do so, I am now tracking which genes cover a sufficient number of species, and hence am downloading a large number of FASTA files per gene, per species. In order to verify that the downloaded sequences are, in fact, the sequences they pretend to be, it really helps to read the description. Oftentimes, the name of the gene is given in the description. This is also how I spotted the error in the first place (pG1 would say it downloaded sequences for the gene dlx2a, but the descriptions of said genes described them as RH1 or ND2 or something entirely else).

Thanks a bunch!

Thijs

Op maandag 30 januari 2017 20:50:47 UTC+1 schreef Will Pearse:

Will

unread,
Jan 30, 2017, 3:16:20 PM1/30/17
to Thijs Janzen, phyloGenerator Users
Hello Thijs,

Thanks for this. It does indeed sound like this problem has been caused by GenBank: the first thing you search is more restrictive (I seem to remember there's a 'fussy' option in pG1's Python backend that does basically this, and it's the default from memory for the program when it's running), whereas the second one is returning anything for you that mentions "S18" somewhere in the entry. So you could imagine a (silly, and very unlikely I hope :D !) example where someone uploaded a sequence and annotated a region as "absolutely nothing to do with S18" and your second search would find that sequence.

From what you've said, pG2 is doing the job for you, yes? If I could make a recommendation as well, it would be to see if you get better results using the referenceDownload options in pG1/2. You'll find the cleaning it can do on your sequences quite useful (well, I find it useful!) and hopefully that will speed you along a little bit. It will also serve as a check against these sorts of problems: sequences from other regions shouldn't pass the referenceDownload step, and so you shouldn't have these problems. I wrote the add-on for you so that it should (...) still be able to do all those checks for you on the sequences as well as outputting things in GenBank format for you to have a closer look at.

Let me know how you get on. It sounds like you're doing something fun :D


Cheers,

Will

---

Need a phylogeny? Try phyloGenerator: original or new version
Measuring phylogenetic structure? Try install.packages('pez')

Will Pearse
Assistant Professor of Biology, Utah State University
Skype: will.pearse

Thijs Janzen

unread,
Jan 30, 2017, 4:03:03 PM1/30/17
to Will, phyloGenerator Users


Let me know how you get on. It sounds like you're doing something fun :D

I sure am! It's a revision of this manuscript: http://biorxiv.org/content/early/2016/11/03/085431  
now with more genes, more species and a better tree (hopefully).

Thanks for all the help!

Thijs

Will

unread,
Jan 30, 2017, 8:01:40 PM1/30/17
to Thijs Janzen, phyloGenerator Users

No worries. Best of luck with the paper - it looks great!

Will

Reply all
Reply to author
Forward
0 new messages