The actual A-tail extraction procedure is not built into STAR at the moment since it depends strongly on the actual library prep.
The A-tails (or any other kind of un-templated tails) can be extracted from STAR output following a very simple logic.
STAR’s alignment are “extended” as much as possible to match the genome on both ends. The unmapped portion of the sequence is represented as “soft-clipped” S in the CIGAR string of the SAM output. To identify the A-tail, you would need to check that the “soft-clipped” portion of the sequence contains ‘A’ or ‘T’. Depending on your sequencing protocol you may need to do it differently.
I have attached a simple awk script that would extract poly-A tails for a standard dUTP protocol, in which the second mate is actually the first fragment of RNA with the same sequence as RNA , while the first mate is the second (downstream) fragment with the sequence reverse complementary to that of RNA.
You can run the script:
> awk -f extractAtails_1stMate2.awk Aligned.out.sam > Atails.txt
The Atails.txt file will contain the coordinates of A-sites (i.e. next base after the last transcribed base) detected in each A-tailed read:
chr coord strand(1+,2-) AtailLength NumberOfAs
You can then collapse these A-sites and count the number of reads per site.
Cheers
Alex
On Thursday, April 18, 2013 2:57:27 PM UTC-4, Charles wrote:
Hi,
I work in a bioinformatics core at UCLA and I use STAR for aligning
RNAseq reads.
I am working on an experiment were the researcher is looking to detect
polyT sequences and where they align to identify alternative
polyadenylation.
Does STAR automatically trim reads with these sequences and then align
them or do some special parameters need to be set in place first?
Thank you,
Charles Blum