Following our emails yesterday, I experimented with the command and some of my datasets. I would expect for one of my datasets, to have NOT found any reverse primers, due to the short read lengths. However log output (below) shows that out of 505,000 only 18,000 sequences contained no reverse primer?
Reverse primer written as TGATCCTTCTGCAGGTTCACCTAC in mapping found, found as GTAGGTGAACCTGCAGAAGGATCA in inflated fasta file.
Details for removal of reverse primers
Original fasta filepath: denoised_seqs_inflated.fna
Total seqs in fasta: 505695
Mapping filepath: /Users/serena/Documents/Academia/E_Soi/Archive/Mapping_Files/Soi_Map2.txt
Truncation option: truncate_only
Mismatches allowed: 0
Total seqs written: 505695
SampleIDs not found: 0
Reverse primers not found: 18007
Am I right in thinking then that this file will need to be re-prossessed given that in the majority of cases the reverse primer has been located and removed (albeit on very short read lengths)? (Also the reverse primer is written as TGATCCTTCTGCAGGTTCACCTAC in the mapping file, but was removed as the reverse complement of this GTAGGTG... without me changing it).
Can I do much in terms of assignments with read lengths this short(see sample at bottom)? Would it be better to leave the primer in?
For most of my datasets the truncate_reverse_primer command has failed to highlight and remove any reverse primers (showing the following):
Details for removal of reverse primers
Original fasta filepath: Denoised_seqs.fna
Total seqs in fasta: 5064
Mapping filepath: /Users/serena/Documents/Academia/Cerco/mapping_info_Cerco/Cerco_only_corrected.txt
Truncation option: truncate_only
Mismatches allowed: 0
Total seqs written: 5064
SampleIDs not found: 0
Reverse primers not found: 5064
I suspected it might be a problem with the orientation of the primer in the mapping file specified (although it found the reverse complement in the above example)? I therefore experimented with the reverse primer orientation by reverse complementing and complementing the primer, saving as new mapping files and re-running.
In most cases, some reverse primers were found whereby the number of sequences in which the 'reverse primer was not found' was less than the total. Can I presume that this script has therefore worked?
Would you expect it to have removed the reverse primer from the majority of sequences? For example one dataset has ~5000 sequences and 4,400 instances where the reverse primer was not found. Do you think it is worth re-running the rest of the processing steps given that only 10% of the data appear to have had the reverse primer removed?
Did I even need to change the orientation of the primer in the mapping file? Presumably it is ok to have changed this in order to find the reverse primers? It is just strange that it changed it automatically in one dataset and failed in another until I manually changed the mapping file.
Many thanks for your help
Serena
#Sequences after inflated_denoise_wrapper but before truncate_reverse_primer:
>Soils.LoddBcrop.crop8_8323 GV2FPVH01ECN49 orig_bc=CGTACTCAGA new_bc=CGTACTCAGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.ClaxMcrop.crop9_251032 GV2Q3QI01BW9UE orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.LoddAcrop.crop7_314059 GV2Q3QI01CH43A orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.Oward.crop.crop12_324340 GV2Q3QI01C8JOX orig_bc=TCGATAGTGA new_bc=TCGATAGTGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.LoddAcrop.crop7_387555 GV2Q3QI01D13VF orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.EdgBaside.aside6_486595 GW10GNM03DBVAN orig_bc=CGCAGTACG new_bc=CGCAGTACG bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.Yettcrop.crop17_228673 GV2Q3QI01EA8BB orig_bc=ACGACAGC new_bc=ACGACAGC bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
>Soils.ClaxMcrop.crop9_243228 GV2Q3QI01CPN7Y orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCACTGAGACTACCAAGGCACACAGGGGATAGG
#Sequences after truncate_reverse_primers
>Soils.LoddBcrop.crop8_8323 GV2FPVH01ECN49 orig_bc=CGTACTCAGA new_bc=CGTACTCAGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.ClaxMcrop.crop9_251032 GV2Q3QI01BW9UE orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.LoddAcrop.crop7_314059 GV2Q3QI01CH43A orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.Oward.crop.crop12_324340 GV2Q3QI01C8JOX orig_bc=TCGATAGTGA new_bc=TCGATAGTGA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.LoddAcrop.crop7_387555 GV2Q3QI01D13VF orig_bc=CGCGTATA new_bc=CGCGTATA bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.EdgBaside.aside6_486595 GW10GNM03DBVAN orig_bc=CGCAGTACG new_bc=CGCAGTACG bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.Yettcrop.crop17_228673 GV2Q3QI01EA8BB orig_bc=ACGACAGC new_bc=ACGACAGC bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>Soils.ClaxMcrop.crop9_243228 GV2Q3QI01CPN7Y orig_bc=CTATAGCGT new_bc=CTATAGCGT bc_diffs=0
GCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCF