HTS: What is the correct way to access the sequence of a softclipped alignment?

19 views
Skip to first unread message

Juan Monroy-Nieto

unread,
Jul 3, 2024, 1:42:18 AM7/3/24
to biogo-user
When an alignment is soft-clipped, extracting a location from the underlying sequence is shifted by the soft-clipped amount due to record.Pos assuming the masking while the extraction methods return the whole read. Both extraction methods for a sequence return the same right-shifted position.

As an example the CIGAR string would start with 3. unmapped leading bases as "?".
```txt
     T
XXXXXTXXXX //alignment
???XXWXXTX // record.Seq.Expand()
```

```go
//both will return W
r.Seq.Expand()[target - 1 - e.Pos]  //target is 1-based
r.Seq.At(target -1 e.Pos)
```

Alternative starting point using record.Start() does not work either because it matches record.Pos.

I have validated this findings using real data comparing Tablet GUI with all values. Currently I am generating the shift by parsing the cigar string. This is less than ideal due to relying on user (me) defined regexp.


The following is a real world example
```
# Tablet alignment info, seq soft-clipped
M02260:32:000000000-BLYVN:1:1101:18942:8135
From: 1,095 U1,095 to 1,383 U1,383
Length: 289 U289 (1 mismatch)
Cigar: 12S289M
Read Group: foo
Read direction is REVERSE

>M02260:32:000000000-BLYVN:1:1101:18942:8135
TTTACAACTTAGTATGGCAATATTTATATTCATTAAGAAAAGATAGAGCT
CCATTAGTGTTTTATTGGATTCCTTGGTTTGGTTCTGCAGCTTCATATGG
TCAACAACCTTATGAATTTTTTGAATCATGTCGTCAAAAGTATGGTGATG
TATTTTCATTTATGTTATTAGGGAAAATTATGACGGTTTATTTAGGTCCA
AAAGGTCATGAATTTGTTTTTAATGCTAAATTATCTGATGTTTCTGCTGA
AGATGCTTATAAACATTTAACTACTCCAGTTTTCGGTAA


# HTS/bam parsed
GCTCTTCCGATC //this 12bp portion is included but position still starts at 1094
TTTACAACTTAGTATGGCAATATTTATATTCATTAAGAAAAGATAGAGCT
CCATTAGTGTTTTATTGGATTCCTTGGTTTGGTTCTGCAGCTTCATATGG
TCAACAACCTTATGAATTTTT*T*GAATCATGTCGTCAAAAGTATGGTGATG //search target in between 2 stars
TATTTTCATTTATGTTATTAGGGAAAATTATGACGGTTTATTTAGGTCCA
AAAGGTCATGAATTTGTTTTTAATGCTAAATTATCTGATGTTTCTGCTGA
AGATGCTTATAAACATTTAACTACTCCAGTTTTCGGTAA
```

Dan Kortschak

unread,
Jul 3, 2024, 2:07:50 AM7/3/24
to biogo...@googlegroups.com
If you can provide a minimal code reproducer, that would be really
helpful, but what I would suggest is that you get the offset by using
the Cigar.Lengths method, but only on the portion of the sam.Cigar that
includes the softclip.

Reply all
Reply to author
Forward
0 new messages