Dear Alexis,
(and anyone that can help)
I saw your reply: "...discussed this with Alexey yesterday and we both think that it
would make more sense to move the partitioning to an external tool
that then just passes two partitions to the downstream ML tool".
Can you please comment if the idea below could work with LocARNA 2.0
(Will, 2024 Methods Mol Biol doi:10.1007/978-1-0716-3519-3_10) forgi and
RAxML-NG or you have any other suggestions?
Thank you so much!
Daniel
1) Use forgi to read locarna output and convert in partition.txt
(Thiel et al. 2019
https://doi.org/10.12688/f1000research.18458.2)
#If I understood correctly:
#save as file locarna2partition.py
#!/usr/bin/env python3
import forgi
from Bio import AlignIO
alignment = AlignIO.read("locarna_output/results/result.stk", "stockholm")
ss = alignment.column_annotations.get('secondary_structure') or alignment.column_annotations.get('SS_cons')
if ss is None:
raise ValueError("No secondary structure annotation found in Stockholm file.")
ss_clean = ss.replace('-', '.').replace('~', '.').replace(':',
'.').replace('<', '(').replace('>', ')').replace('{',
'(').replace('}', ')').replace('[', '(').replace(']', ')')
bg, = forgi.load_rna(ss_clean)
stems, loops = set(), set()
for elem, coords in bg.defines.items():
pos = [i for j in range(0, len(coords), 2) for i in range(coords[j], coords[j+1]+1)]
(stems if elem.startswith('s') else loops).update(pos)
def ranges(s):
p = sorted(s)
r, start, prev = [], p[0], p[0]
for x in p[1:]:
if x == prev + 1:
prev = x
else:
r.append(f"{start}-{prev}" if start != prev else str(start))
start = prev = x
r.append(f"{start}-{prev}" if start != prev else str(start))
return ",".join(r)
with open("partition.txt", "w") as f:
f.write(f"GTR+G, loops = {ranges(loops)}\n")
f.write(f"GTR+G, stems = {ranges(stems)}\n")
#end of file
#install forgi and run
pip install biopython forgi
python3 locarna2partition.py
#The partition.txt output should fit RAxML-NG inference using structure-derived partitions:
raxml-ng --all -msa reference_aligned.fasta --model partition.txt
--bs-trees autoMRE --seed 12345 --threads $THREADS --prefix
referencetree
2) Last, what do you think then to use the reference tree for phylogenetic
placement of short reads using the locarna guide with witch.py
(
https://doi.org/10.1093/bioadv/vbad024) and epa-ng:
witch.py -b reference_aligned.fasta -e locarna_output/results/result.tree -q
short_sequences.fasta -d witch_output -o merged
seqkit seq -m $(seqkit fx2tab -l ref_aligned.fasta | awk '{print $2}'
| sort -nr | head -1) witch_output/merged.masked.fasta >
query_aligned.fasta
#epa-ng phylogenetic placement:
mkdir epa_output
epa-ng --ref-msa reference_aligned.fasta --tree referencetree.treefile --query
query_aligned.fasta --model GTR+G --outdir epa_output --threads
$THREADS
#thanks a lot!
#any other inputs or software suggestions are largely appreciated