I tried to follow the tutorial for strainphlan. This thread is a bit too long so I decided to open a new thread.
I have samtools 0.1.19 and add dump_file.py in the path. It runs a bit long now before sending a different error. The error message didn't provide much useful information for me. Interestingly, when I ran the last "failed" command of dump_file.py. It finished without error.
Can someone please help? thanks John
$ sample2markers.py --ifn_samples ES4_MetaG_S5_L001_R1_001.sam.bz2 --input_type sam --output_dir marker/ --nproc 8 &
[1] 114110
$ Traceback (most recent call last):
File "/usr/local/metaphlan2/strainphlan_src/ooSubprocess.py", line 244, in wrapper
return f(*args, **kwargs)
File "/usr/local/metaphlan2/strainphlan_src/sample2markers.py", line 381, in run_sample
quiet=args['quiet'])
File "/usr/local/metaphlan2/strainphlan_src/sample2markers.py", line 306, in sam2markers
stderr=error_pipe)
File "/usr/local/metaphlan2/strainphlan_src/ooSubprocess.py", line 181, in chain
%(' | '.join(self.chain_cmds), return_code))
ooSubprocessException: Failed when executing the command: dump_file.py --input_file ES4_MetaG_S5_L001_R1_001.sam.bz2 | samtools view -bS - | samtools sort -o - marker/ES4_MetaG_S5_L001_R1_001.sam.bz2.bam.sorted | samtools mpileup -u - | bcftools view -c -g -p 1.1 -
return code: 255
--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
thanks for the reply. I don't see options for "samtools_exe" or "bcftools_exe". See below. Does it mean I don't have the latest version? We installed the software not long ago.
Is "--sam2file_ext" what you meant?
John
[johnny@usslrlx0002 strainphlan]$ /usr/local/metaphlan2/strainphlan_src/sample2markers.py -h
usage: sample2markers.py [-h] --ifn_samples IFN_SAMPLES [IFN_SAMPLES ...]
[--ifn_markers IFN_MARKERS] --output_dir OUTPUT_DIR
[--nprocs NPROCS] [--min_read_len MIN_READ_LEN]
[--min_align_score MIN_ALIGN_SCORE]
[--min_base_quality MIN_BASE_QUALITY]
[--error_rate ERROR_RATE]
[--marker2file_ext MARKER2FILE_EXT]
[--sam2file_ext SAM2FILE_EXT] [--verbose]
--input_type {fastq,sam}
optional arguments:
-h, --help show this help message and exit
--ifn_samples IFN_SAMPLES [IFN_SAMPLES ...]
--ifn_markers IFN_MARKERS
--output_dir OUTPUT_DIR
--nprocs NPROCS
--min_read_len MIN_READ_LEN
--min_align_score MIN_ALIGN_SCORE
--min_base_quality MIN_BASE_QUALITY
--error_rate ERROR_RATE
--marker2file_ext MARKER2FILE_EXT
--sam2file_ext SAM2FILE_EXT
--verbose Show all information. Default "not set".
--input_type {fastq,sam}
The input type: fastq, sam. Sam files can be obtained
from the previous run of this script or
strainphlan.py).
Hi Tin,I have 128 sequencing files, belonging to 8 samples. That is, each samples were sequenced in 8 lanes, PE (see below). they most likely from the same strain. I understand you said if less than 3 strains, then it can't do it. but why did it say "ERROR: Problem reading number of species and sites"?Have a good weekend,
John$ grep -c '>' s__Lactobacillus_johnsonii.fasta17$ grep '>' s__Lactobacillus_johnsonii.fasta>MOCK_MetaG_S1_L003_R2_001>MOCK_MetaG_S1_L004_R2_001>MOCK_MetaG_S1_L006_R2_001>MOCK_MetaG_S1_L008_R1_001>MOCK_MetaG_S1_L001_R1_001>MOCK_MetaG_S1_L005_R2_001>MOCK_MetaG_S1_L002_R1_001>MOCK_MetaG_S1_L002_R2_001>MOCK_MetaG_S1_L006_R1_001>MOCK_MetaG_S1_L004_R1_001>MOCK_MetaG_S1_L003_R1_001>MOCK_MetaG_S1_L007_R2_001>MOCK_MetaG_S1_L008_R2_001>GCF_L_johnsonii>MOCK_MetaG_S1_L007_R1_001>MOCK_MetaG_S1_L005_R1_001>MOCK_MetaG_S1_L001_R2_001ooSubprocess: raxmlHPC-PTHREADS-SSE3 -s /data/IBM_data/shotgun/strainphlan/trees/L_johnsonii2/s__Lactobacillus_johnsonii.fasta -w /data/IBM_data/shotgun/strainphlan/trees/L_johnsonii2 -n s__Lactobacillus_johnsonii.tree -p 1234 -m GTRCAT -T 16ERROR: Problem reading number of species and sites2016-10-21 13:36:46,734 | INFO | __main__ | build_tree | 1122 | Cannot build the tree! The number of samples is too few or there is some error with raxmlHMP2016-10-21 13:36:47,118 | INFO | __main__ | strainer | 1511 | Finished!On Fri, Oct 21, 2016 at 11:34 AM, Duy Tin Truong <duytin...@gmail.com> wrote:Hi John,How many samples did you use and can you count how many strain samples that could be constructed:grep -c '>' s__Lactobacillus_johnsonii.fastaIf there were less than 3 strains, raxml could not build the tree. Besides, you can use --relaxed_parameters? option to reduce stringency.Cheers,TinOn Fri, Oct 21, 2016 at 6:19 PM Johnny Li <ql1...@gmail.com> wrote:forgot the link
On Fri, Oct 21, 2016 at 11:14 AM, Johnny Li <ql1...@gmail.com> wrote:
Hi Tin,I ran the strainphlan.py to create trees. See below. The script ran fine with no errors. But it doesn't create the tree file (RAxML_bestTree.s__Eubacterium_siraeum.tree) as mentioned in the tutorial. Instead, I got the following files: arguments.txt, s__Lactobacillus_johnsonii.fasta, s__Lactobacillus_johnsonii.info, s__Lactobacillus_johnsonii.marker_pos, and s__Lactobacillus_johnsonii.polymorphic. Do you know what I have done wrong?In addition, here is where I got the genomic seq. from the NCBI refseq database (in gz, not bz2). Can you help?thanksJohnstrainphlan.py --ifn_samples marker/*.markers --ifn_markers marker_fasta_files/s__Lactobacillus_johnsonii_markers.fasta --ifn_ref_genomes GCF_L_johnsonii.fna.gz --output_dir trees/L_johnsonii --clades s__Lactobacillus_johnsonii --marker_in_clade 0.2 --nprocs_main 16On Fri, Oct 21, 2016 at 2:11 AM, Duy Tin Truong <duytin...@gmail.com> wrote:Hi John,Blastn is required for adding the reference genomes and muscle is for marker alignment. Besides, you can produce all trees without adding reference genomes by specifying "--clades all", otherwise you have to run strainphlan for each species. To reduce the stringency, you can look at some suggestions in "--relaxed_paramters?".Cheers,TinOn Thu, Oct 20, 2016 at 6:04 PM Johnny Li <ql1...@gmail.com> wrote:Sorry I sent too soon.is blastn requied by muscle alignment? I look at the muscle, it doesn't seem to require blastn. I wonder what blastn is for here?thanks
John]$ strainphlan.py --ifn_samples *.markers --ifn_markers s__Salmonella_enterica_markers.fasta --ifn_ref_genomes GCF_000353585.1_S._enterica_Tennessee_CDC07-0191_cds_from_genomic.fna.gz --output_dir . --clades s__Salmonella_enterica --marker_in_clade 0.22016-10-20 10:41:08,843 | ERROR | __main__ | check_dependencies | 1529 | Cannot find blastn in the executable path!On Thu, Oct 20, 2016 at 10:56 AM, Johnny Li <ql1...@gmail.com> wrote:Hi again,I was doing this step, but got the error for
On Thu, Oct 20, 2016 at 10:36 AM, Johnny Li <ql1...@gmail.com> wrote:
Hi Tin,I have a question perhaps quite naive. If I identified 10 species in clades.txt, can you extract markers for 10 species at once? In tutorial, it was shown one at a time.It seems to me that strainphlan is doing one species at time for tree generation and viewing?thanksJohnOn Wed, Oct 19, 2016 at 10:48 AM, Duy Tin Truong <duytin...@gmail.com> wrote:Hi John,Yes, that is "--print_clades_only". In addition, you can change --marker_in_clade to 0.5.Cheers,TinOn Wed, Oct 19, 2016 at 5:41 PM Johnny Li <ql1...@gmail.com> wrote:Tin,thanks. I don't see --print-clades option. There is one called --print_clades_only, which is probably not what you referred. what is your recommendation for the cutoff for --marker_in_clade option? 0.8 does seem to be quite stringent.I may have more questions when I get to building trees.thanksJohnOn Wed, Oct 19, 2016 at 9:59 AM, Duy Tin Truong <duytin...@gmail.com> wrote:Hi John,The step of running extract_markers.py is necessary if you need to add reference genomes to your trees, otherwise you can skip it and for each clade that you need to add the reference genomes, you need to run that step. Besides, the "--print_clades" option will only print the clades with the ratio of present markers greater than 0.8 (can be changed --marker_in_clade), i.e. if a clade X has 200 markers, and the number markers present in a sample is 100, then the clade will not be printed. There are other options in strainphlan.py that can be seen by "-h" to reduce the stringency.Finally, the all_markers.fasta contains all the markers of the database. If you want to extract consensus markers for each sample, you have to open the "*.markers" files produced by sample2markers.py by using msgpack.load(open("the_sample.markers")).Hope this helps.Cheers,TinOn Wed, Oct 19, 2016 at 4:36 PM Johnny Li <ql1...@gmail.com> wrote:Hi Tin,thanks you so much for being so patient. It works now. I now have a question on extracting clade specific markers.after one identifies all clades in the samples: strainphlan.py --ifn_samples *.markers --output_dir . --print_clades_only > clades.txt.if clades.txt have 100 species, does he need to run "extract_markers.py --mpa_pkl mpa_v20_m200.pkl --ifn_markers all_markers.fasta --clade s__Eubacterium_siraeum --ofn_markers s__Eubacterium_siraeum.markers.fasta" 100x (replacing s__Eubacterium_siraeum with different a different strain name)?I guess one can use a for loop ...BTW, I ran one of my samples, it generated a clades.txt file that contains the following. Is it normal to not have bacteria? If there is no bacteria, can I still go ahead with extract strain specific markers? In other words, does all_markers.fasta contain only bacterial markers?$ more clades.txts__Abelson_murine_leukemia_viruss__Avian_endogenous_retrovirus_EAV_HPs__Avian_myelocytomatosis_viruss__Porcine_type_C_oncoviruss__Saccharomyces_cerevisiae_killer_virus_M1thanks
JohnOn Sat, Oct 8, 2016 at 3:55 AM, Duy Tin Truong <duytin...@gmail.com> wrote:Hi John,It was included in the samtools 0.1.19:Cheers,Tin