parsing taxonomy from custom format

63 views
Skip to first unread message

Jean-Francois Martin

unread,
Aug 24, 2017, 8:07:35 AM8/24/17
to MetacodeR
I really enjoy Metacoder that is so helpful ! thank you for your effort in providing such a useful tool.
Kind of a noob question but I failed finding a solution yet : I have an issue with parsing taxonomy from a custom format and as I am not at ease at all with regex I spent the last four hours trying to find out the solution with the following format. Would you mind helping with that please? I suspect the space between genus and species name in the specie descriptor does not help but I could not figure out the way to make it.
Thank you !
Jef

>10952483 Root;k__Viridiplantae_33090;p__Streptophyta_35493;c__sub__rosids_71275;o__Rosales_3744;f__Moraceae_3487;g__Ficus_3493;s__Ficus padana_100562;
ACACGCCGTTGCTCCCCCCCCCCCCCCCGCAAAACCCCCGTCCCGTTCCTGGCGGGGCGAGGGGGGACCATGGGGGGCGGAAAATGACCTCCCGTGCGATTTTCGACCCGCGGTTGGTCCAAAAATCGAGTCCCCTGTCACGTCGTCTTGGCAACAGGTAGTCGATCATTCGGTGCCACCGCCACGTGCGTCGGACACGCATCGGGACTCCGACAGACCCCAACGCGCCCGTCACGGGTGCCTCCAACGC
>119926030 Root;k__Viridiplantae_33090;p__Streptophyta_35493;c__undef__0;o__Caryophyllales_3524;f__Amaranthaceae_3563;g__Atriplex_3550;s__Atriplex canescens_35922;
ACGCATCGCGTCTCCCCCCACCACCCCGTGTGGATGGGGAGGAGGATGATGGCCTCCCATGCCTCACCGGGCGTGGATGGCCTAAATAAGGAGCCCCCGGTTACGAAGTGCCGCGGCGATTGGTGGAATACAAGGCCTAGCCTAGGATGAAACGGTAATCGCGCACATCGTAGCTCTTGAGGACTCGCAGGACCCTTACTTGTTTGCCCTTAGGGGCGGCAAAACCGTTGCGA

Zachary Foster

unread,
Aug 24, 2017, 1:55:53 PM8/24/17
to MetacodeR
No problem, I am glad you like it!

Yea, that format is a bit challenging because of "Root" not having the same info as the rest and the mixutre of spaces and underscores in the names: "o__Caryophyllales_3524", "s__Atriplex canescens_35922", and "c__undef__0" are all slightly different formats. That makes making a regex that matches all three formats hard, but I found one that works.

Attached in an R project with an Rmd file that has the code to parse that sample in the current way and the new way that is in development. I attached a PDF of the output here too. If there are more inconsistencies in the entire dataset that are not in these two headers, it might not work on the whole dataset. If that happens, let me know.

-Zach
2017_08_24--matrin_Jean_francois--parsing.pdf
2017_08_24--matrin_Jean_francois--parsing.zip

Jean-Francois Martin

unread,
Aug 24, 2017, 2:29:00 PM8/24/17
to MetacodeR
Thanks a lot for your fast answer.
Unfortunately there seems to be more inconsistencies in the full length file (see links beloww)...
Below are the links to the two files I would like to use (the first one is a direct link to the fasta file, the second is the file rbcL_all_Jan2016.rdp.fa within the zip file

Those reference files are used for ITS2 and rbcl barcoding for plants (not my work) and I anticipate this format may be more used soon as the authors provide a script to prepare any marker to this format.
I hope you can think of a way to accomodate for this format, I have to say I cannot really help unfortunately
An other option would be to modify the fasta file itself of course. Anything would work for me.
Thanks for your time anyway!
Jef

https://github.com/molbiodiv/meta-barcoding-dual-indexing/blob/master/precomputed/viridiplantae_all_2014.rdp.fa
https://figshare.com/articles/rbcL_rdp_trained_reference_database_zip/3827631

Jean-Francois Martin

unread,
Aug 25, 2017, 12:58:37 AM8/25/17
to MetacodeR
Night hepls claryfying thoughts and the more I think about it and the more I realise that using that format that is not really rdp compliant is an issue, not only for Metacoder but for other parts of the analysis as dada2 for example. I am sure you know it alreday but dada2 has a function to assign taxonomy to sequances derived from amplicon sequencing and it uses among others a rdp fasta format as reference, hence I would have the same issue there. I will therefore put energy to modify the format of thoses rcl and its2 files to make them compliant instead of trying to adapt your code and dada2 to make it work, it sounds much more reasonable!
Thanks again anyway for your time, I look forward playing with the future developments!
Jef

Jean-Francois Martin

unread,
Aug 25, 2017, 1:08:46 AM8/25/17
to MetacodeR
Meanwhile if you think of a very easy way to make the rbcl file to work with metacoder I would be happy to knwo as I would like to explore the in silico pcr function on this dataset asap. if there is no obvious way do not waste more time on this of course.
Thanks,
Jef

Zachary Foster

unread,
Aug 25, 2017, 11:33:27 AM8/25/17
to MetacodeR
Hi Jef,

Oddly, it looks like the github version can parse the whole file using the code I wrote for those two headers, but not the current CRAN version for some reason. It would be great if you can get it into RDP format, since there is already a parser for RDP in the github version of metacoder. I think I have this working in the github version. I will send an example soon. There are a lot of taxa in the whole data set (> 43,000)!

Jean-Francois Martin

unread,
Sep 8, 2017, 9:06:47 AM9/8/17
to MetacodeR
It looks like it works indeed, although my Computer has not enough memory (32Gb)...
Do you confirm?
Thanks.
Jef

Zachary Foster

unread,
Sep 8, 2017, 4:55:50 PM9/8/17
to MetacodeR
Hi Jef,

I was able to parse the dataset and it used 3Gb of RAM. Trying to plot all of it did freeze up my computer, which might have been due to no RAM. I will have to look into it. Did it stop working during the parsing or plotting? Thanks,

-Zach

Jean-Francois Martin

unread,
Sep 9, 2017, 1:08:32 PM9/9/17
to MetacodeR
It stopped during the parsing

Zachary Foster

unread,
Sep 20, 2017, 12:44:30 PM9/20/17
to MetacodeR
Hi Jef,

I am sorry for the delay! I have a lot of other stuff going on right now and I overlooked your response. Hmm, I am not sure why it stopped during parsing. Does your computer say its using 30Gb of RAM during parsing? Mine says only 3Gb.

I recommend using the dev versions of metacoder and taxa. They are usually stable.

I have a few ideas for making the plotting less RAM-intensive, but I dont know when I will have time to implement them. I was able to subset the data, although it took a few minutes, so you might have to subset the part of the data you are interested in before plotting. Plotting a dataset that big is typically a mess anyway.

I don't know if I can do much without reproducing your problem. What exactly happens when it stops during parsing? Does the whole computer freeze or just R?

-Zach
Reply all
Reply to author
Forward
0 new messages